Logo del repository
  1. Home
 
Opzioni

Feature selection by Information Imbalance optimization: Clinics, molecular modeling and ecology

WILD, ROMINA
2024-12-03
Abstract
In feature selection current methods are often limited by the types and dimensions of data they can handle. Supervised methods, in particular, are rigid regarding their target space, typically requiring it to be one-dimensional and of a specific type (e.g. continuous or categorical). This thesis introduces feature selection methods which mitigate these limitations using a statistic called the Information Imbalance. This method identifies a low-dimensional subset of input features that best preserves pairwise distance relations found in the target feature space by ranking nearest neighbors. First, we derive a weighted Information Imbalance approach to handle class-imbalanced medical data, along with an optimization routine capable of managing missing data. The study on COVID-19 severity prediction showcased this approach, successfully isolating a 13-feature subset from a pool of roughly 150 features. This subset outperformed traditional feature selection methods in subsequent predictions for patient severity. We then introduce an Information Imbalance variant that can handle binary and categorical data. We benchmarked this approach on Amazon Rainforest biodiversity data. By quantifying the relative information content of continuous features, like average temperature, and categorical features, like the label of the region in which data are recorded, this method identifies plausible predictors of species richness and asymmetric information even between variables which are not correlated. Finally, we introduced a differentiable variant of the Information Imbalance, implemented in the easy-to-use Python package, DADApy. Differentiable Information Imbalance (DII) optimizes relative feature weights via gradient descent, addressing combinatorial challenges of high-dimensional data. The weights correct for different units of measure and relative importance and allow for feature selection through sparsity-inducing optimization approaches. In molecular dynamics simulations, this method reduced the feature set to three collective variables effectively describing a beta-pin peptide. In another application on machine learning potentials, the input feature space was compressed, reducing run time while preserving accuracy.
Archivio
https://hdl.handle.net/20.500.11767/143290
https://ricerca.unityfvg.it/handle/20.500.11767/143290
Diritti
open access
Soggetti
  • Feature selection

  • Information Imbalance...

  • Differentiable Inform...

  • DII

  • clinical prediction

  • species richness pred...

  • feature weighting

  • high dimensional

  • Settore PHYS-06/A - F...

  • Settore PHYS-04/A - F...

  • Settore MATH-03/B - P...

google-scholar
Get Involved!
  • Source Code
  • Documentation
  • Slack Channel
Make it your own

DSpace-CRIS can be extensively configured to meet your needs. Decide which information need to be collected and available with fine-grained security. Start updating the theme to match your nstitution's web identity.

Need professional help?

The original creators of DSpace-CRIS at 4Science can take your project to the next level, get in touch!

Realizzato con Software DSpace-CRIS - Estensione mantenuta e ottimizzata da 4Science

  • Impostazioni dei cookie
  • Informativa sulla privacy
  • Accordo con l'utente finale
  • Invia il tuo Feedback