Logo del repository
  1. Home
 
Opzioni

The Venus score for the assessment of the quality and trustworthiness of biomedical datasets

Chicco D.
•
Fabris A.
•
Jurman G.
2025
  • journal article

Periodico
BIODATA MINING
Abstract
Biomedical datasets are the mainstays of computational biology and health informatics projects, and can be found on multiple data platforms online or obtained from wet-lab biologists and physicians. The quality and the trustworthiness of these datasets, however, can sometimes be poor, producing bad results in turn, which can harm patients and data subjects. To address this problem, policy-makers, researchers, and consortia have proposed diverse regulations, guidelines, and scores to assess the quality and increase the reliability of datasets. Although generally useful, however, they are often incomplete and impractical. The guidelines of Datasheets for Datasets, in particular, are too numerous; the requirements of the Kaggle Dataset Usability Score focus on non-scientific requisites (for example, including a cover image); and the European Union Artificial Intelligence Act (EU AI Act) sets forth sparse and general data governance requirements, which we tailored to datasets for biomedical AI. Against this backdrop, we introduce our new Venus score to assess the data quality and trustworthiness of biomedical datasets. Our score ranges from 0 to 10 and consists of ten questions that anyone developing a bioinformatics, medical informatics, or cheminformatics dataset should answer before the release. In this study, we first describe the EU AI Act, Datasheets for Datasets, and the Kaggle Dataset Usability Score, presenting their requirements and their drawbacks. To do so, we reverse-engineer the weights of the influential Kaggle Score for the first time and report them in this study. We distill the most important data governance requirements into ten questions tailored to the biomedical domain, comprising the Venus score. We apply the Venus score to twelve datasets from multiple subdomains, including electronic health records, medical imaging, microarray and bulk RNA-seq gene expression, cheminformatics, physiologic electrogram signals, and medical text. Analyzing the results, we surface fine-grained strengths and weaknesses of popular datasets, as well as aggregate trends. Most notably, we find a widespread tendency to gloss over sources of data inaccuracy and noise, which may hinder the reliable exploitation of data and, consequently, research results. Overall, our results confirm the applicability and utility of the Venus score to assess the trustworthiness of biomedical data.
DOI
10.1186/s13040-024-00412-x
WOS
WOS:001392346300001
Archivio
https://hdl.handle.net/11368/3106381
info:eu-repo/semantics/altIdentifier/scopus/2-s2.0-85217522058
https://biodatamining.biomedcentral.com/articles/10.1186/s13040-024-00412-x
Diritti
open access
license:creative commons
license uri:http://creativecommons.org/licenses/by/4.0/
FVG url
https://arts.units.it/bitstream/11368/3106381/1/s13040-024-00412-x-1.pdf
Soggetti
  • Bioinformatic

  • Biomedical data quali...

  • Cheminformatic

  • Computational biology...

  • Data documentation

  • Data trustworthine

  • Datasheets for Datase...

  • EU AI Act

  • Health informatic

  • Kaggle

  • Medical data

  • Medical text

  • Trustworthine

  • Trustworthy data

google-scholar
Get Involved!
  • Source Code
  • Documentation
  • Slack Channel
Make it your own

DSpace-CRIS can be extensively configured to meet your needs. Decide which information need to be collected and available with fine-grained security. Start updating the theme to match your nstitution's web identity.

Need professional help?

The original creators of DSpace-CRIS at 4Science can take your project to the next level, get in touch!

Realizzato con Software DSpace-CRIS - Estensione mantenuta e ottimizzata da 4Science

  • Impostazioni dei cookie
  • Informativa sulla privacy
  • Accordo con l'utente finale
  • Invia il tuo Feedback