Logo del repository
  1. Home
 
Opzioni

Intrinsic Dimension Estimation for non-Euclidean manifolds: from metagenomics to unweighted networks

MACOCCO, IURI
2023-10-26
Abstract
Within the field of unsupervised manifold learning, Intrinsic Dimension estimators are among the most important analysis tools. The Intrinsic Dimension provides a measure of the dimensionality of the hidden manifold from which data are sampled, even if the manifold is embedded in a space with a much higher number of features. The present Thesis tackles the still unanswered problem of computing the Intrinsic Dimension (ID) of spaces characterised by non-Euclidean metrics. In particular, we focus on datasets where the distances between points are measured by means of Manhattan, Hamming or shortest-path metrics and, thus, can only assume discrete values. This peculiarity has deep consequences on the way datapoints populate the neighbourhoods and on the structure on the manifold. For this reason we develop a general purpose, nearest-neighbours-based ID estimator that has two peculiar features: the capability of selecting explicitly the scale at which the Intrinsic Dimension is computed and a validation procedure to check the reliability of the provided estimate. We thus specialise the estimator to lattice spaces, where the volume is measured by means of the Ehrhart polynomials. After testing the reliability of the estimator on artificial datasets, we apply it to genomics sequences and discover an unexpectedly low ID, suggesting that the evolutive pressure exerts strong restraints on the way the nucleotide basis are allowed to mutate. This same framework is then employed to profile the scaling of the ID of unweighted networks. The diversity of the obtained ID signatures prompted us into using it as a signature to characterise the networks. Concretely, we employ the ID as a summary statistics within an Approximate Bayesian Computation framework in order to pinpoint the parameters of network mechanistic generative models of increasing complexity. We discover that, by targeting the ID of a given network, other typical network properties are also fairly retrieved. As a last methodological development, we improved the ID estimator by adaptively selecting, for each datapoint, the largest neighbourhoods with an approximately constant density. This offers a quantitative criterion to automatically select a meaningful scale at which the ID is computed and, at the same time, allows to enforce the hypothesis of the method, implying more reliable estimates.
Archivio
https://hdl.handle.net/20.500.11767/134630
https://ricerca.unityfvg.it/handle/20.500.11767/134630
Diritti
embargoed access
Soggetti
  • Intrinsic Dimension

  • Data Science

  • Discrete Metric

  • Unsupervised Learning...

  • Network Theory

  • Genomic

  • Lattice

  • Manifold Learning

  • Settore FIS/02 - Fisi...

  • Settore FIS/07 - Fisi...

google-scholar
Get Involved!
  • Source Code
  • Documentation
  • Slack Channel
Make it your own

DSpace-CRIS can be extensively configured to meet your needs. Decide which information need to be collected and available with fine-grained security. Start updating the theme to match your nstitution's web identity.

Need professional help?

The original creators of DSpace-CRIS at 4Science can take your project to the next level, get in touch!

Realizzato con Software DSpace-CRIS - Estensione mantenuta e ottimizzata da 4Science

  • Impostazioni dei cookie
  • Informativa sulla privacy
  • Accordo con l'utente finale
  • Invia il tuo Feedback