The Metaverse and its immersive environments are gaining significant attention due to their potential applications across various fields, from healthcare to art. As their numbers grow, it becomes difficult to effectively search through them and identify those of interest to the user. Recently, Metaverses were modeled as multimedia-rich 3D scenarios. However, existing works on retrieving them via text have several shortcomings, including the lack of experimentation with joint analysis of heterogeneous multimedia formats within the Metaverse, the use of small-scale datasets with randomly aggregated elements, and the consequent lack of thematic coherence in retrieval methods. To address these issues, we introduce SAVAGE, a novel synthetic dataset of 10,000 thematic exhibitions containing both real-world paintings and generated video artworks. Moreover, we propose HM3, a new hierarchical methodology for Metaverse Retrieval which captures all the contents of the room and integrates both images and videos, while its training is guided by a novel theme-aware loss function. Experiments on SAVAGE demonstrate the effectiveness of HM3 in modelling museums. The method also shows considerable improvements on an existing dataset of Metaverses, with ablation studies and qualitative analyses confirming the utility of the proposed theme-aware loss function.