medical research data repository

U.S. Department of Health & Human Services
National Institutes of Health
Division of Program Coordination, Planning, and Strategic Initiatives (DPCPSI)
ODSS Intranet (NIH Staff)

Biomedical Data Repositories and Knowledgebases

About biomedical data repositories and knowledgebases.

To better support a modern data resource ecosystem, NIH makes a distinction between data repositories and knowledgebases. While both are important for advancing biomedical research, data repositories and knowledgebases can have unique functions, metrics for success, and sustainability needs.

Sustaining a healthy and productive data resource ecosystem means that each component:

Delivers scientific impact to the communities that they serve
Employs and promotes good data management practices and provides efficient operation for quality and services
Engages with the user community and continuously addresses their needs
Supports a process for data life-cycle analysis
Engrosses exploration of the current landscape of biomedical data repository metrics to to NIH in better understanding how datasets and repositories are used
Provides long-term preservation and trustworthy governance

Both data repositories and knowledgebases contribute to the NIH data resource ecosystem

Data Repositories

Biomedical data repositories accept the submission of relevant data from the research community to store, organize, validate, archive, preserve, and distribute data in compliance with the FAIR Data Principles.
Curation focuses on quality assurance and quality control.
Example: core data might include genome, transcriptome, and protein sequences or imaging or spectroscopic data

Knowledgebases

Biomedical knowledgebases extract, accumulate, organize, annotate, and link the growing body of information that is related to, and relies on, core datasets.
Significant levels of human curation are traditionally required.
Example: information about expression patterns, splicing variants, localization, protein-protein interaction, and pathway networks related to an organism or set of organisms; publication information

View Trans-NIH BioMedical Informatics Coordinating Committee (BMIC) Data Sharing Resources .

Metrics and Lifecycle

Data repositories and knowledgebases exist on a spectrum of ability and readiness to adopt the desirable characteristics aligned with FAIR and TRUST principles. Due to the critical nature of research data resources, repositories, and datasets, the development of metrics to evaluate the usage, utility, and impact of a given repository is essential. To that end, NIH conducted a survey and organized a workshop to better understand both existing and desired lifecycle metrics. The NIH then issued a report which presents the findings to better understand metrics currently used within the biomedical repository community, which can inform future NIH efforts to help develop this space and to understand patterns of use across datasets and repositories.

Open Funding Opportunities

(Open) Promoting Data Reuse for Health Research ( NOT-OD-24-096 ), April 30, 2024
(Open) Enhancement and Management of Established Biomedical Data Repositories and Knowledgebases (PAR-23-237) August 31, 2023
(Open) Early-stage Biomedical Data Repositories and Knowledgebases ( PAR-23-236) August 31, 2023
FAQs for PAR-23-237 and PAR-23-236
Notice of Pre-Application Webinar for the NIH Biomedical Data Repositories and Knowledgebases Program (DRKB) ( NOT-OD-24-097 ) April 11, 2024. Recording of the webinar can be accessed on this website: [ Recording ]

Closed Funding Opportunities

(Closed) Support for existing data repositories to align with FAIR and TRUST principles and evaluate usage, utility, and impact ( NOT-OD-23-044 ) FAQs January 5, 2023
(Closed) Support for existing data repositories to align with FAIR and TRUST principles and evaluate usage, utility, and impact ( NOT-OD-22-069 ) January 31, 2022
(Closed) Administrative Supplements Available to Strengthen NIH-Funded Biomedical Data Repositories (NOT-OD-21-089) , April 6, 2021
(Closed) Biomedical Data Repository ( PAR-23-079 ), May 9, 2023
(Closed) Biomedical Knowledgebase ( PAR-23-078 ), May 9, 2023
Biomedical Data Repository ( PAR-20-089 )
Biomedical Knowledgebase ( PAR-20-097 )

Funded Awards

PAR-20-089 and PAR-20-097 Awardees

PAR-20-089 and PAR-20-097 Award Recipients
Award IC	Principal Investigator	Project Title
	Nuno Bandeira
	Adam R. Ferguson
	Jeffrey C Hoch
	Jonathan Rosand
	Samuel S. Wu
	Anita Elzbieta Bandrowski
	Dinesh Barupal
	Alex Bateman
	Lindsay G Cowell	i-AKC: Integrated AIRR Knowledge Commons
	Michael K. Gilson
	Malachi Griffith
	Marc S. Halfon
	Carol Marie Hamilton
	Yongqun He
	Peter D Karp
	Teri Ellen Klein
	Elliot J. Lefkowitz
	Carolyn J. Mattingly
	Nicola Mulder
	Mark A. Musen
NHGRI	Helen Parkinson
	Lynn Marie Schriml
	Lincoln D. Stein
	Paul W Sternberg
	Paul D. Thomas
	Michael Tiemeyer
	Alexander Tropsha
	Jeremy Lyle Warner

View PAR-23-079

PAR-23-079 Award Recipients
Grant Number	Award IC	Principal Investigator	Project Title
		Mackenzie Cottrell	HIV Pharmacology Data Repository
		Joost B Wagenaar
		Kivanc Kose
		Naela McCarty	The Georgia Cystic Fibrosis Data Warehouse

View PAR-23-078

PAR-23-078 Award Recipients
Grant Number	Award IC	Principal Investigator	Project Title
		Norbert Perrimon

View NOT-OD-22-069 Awardees

NOT-OD-22-069 Award Recipients

		Antonella Zanobetti
		Christian Haselgrove
		Melissa Haendel
		Alex Bateman
		Susan Teitelbaum
GS-35F-0442V, 75N97021F00100		Alison Garcia
GS-35F-0442V, 75N97021F00100		Alison Garcia
HHSN26110071		Andrey Fedorov

View NOT-OD-21-089 Awardees

NOT-OD-21-089 Award Recipients

	Eric Ravussin	(link is external)) was opened to allow people to independently search the cadre of available data. As we transition to increase usage, it is imperative that we align with the FAIR and TRUST principles and to ensure we can appropriately track usage, utility, and impact. In response to NOT-OD-21-089, we have developed a comprehensive but conservative one-year project to achieve these goals. In aim 1, we will improve “FAIR”-ness by adding existing data and increasing metadata and establishing metrics for tracking and usage. In aim 2, we will improve “TRUST”-worthiness by promoting and demonstrating the methods used for data collection. Finally, aim 3, will explore the possibility for certification. This unique repository provides unique data on nutrition and obesity which seeks to benefit researchers across the country for years to come.
	Molly A. Bogue
	Nadine Martin
	Brian MacWhinney
	Carl Kesselman
	Dalane Kitzman
	Paul Sternberg
	Linda Brzustowicz
	Adam Ferguson
	Vikash Gilja
	Ronna Hertzano
	Julius Fridriksson	Public sharing of the Aphasia Recovery Cohort
HHSN316201200036W	Atul Butte
HHSN316201300006W/HHSN27200002	Nada Midani
HHSN316201300006W/HHSN27200002	Nada Midani
75N94021D00001/75N94021F00001	Michael Keller
HHSN316201200054W	Jennifer Fostel

View NOT-OD-23-044 Awardees

NOT-OD-23-044 Award Recipients
Grant Number	Award IC	Principal Investigator	Project Title
		Julius Fridriksson
		Lincoln D Stein
		John E Marcotte
		Helen E Parkinson

Engage with the community by joining [email protected] listserv. Instructions on how to join can be found here .

This page last reviewed on August 27, 2024

This page last reviewed on May 2, 2022

An official website of the United States government

Here's how you know

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Finding Datasets, Data Repositories, and Data Standards

This online guide contains resources for finding data repositories for data preservation and access and locating datasets for reuse. The guide was developed as an online companion for the class Resources for Finding and Sharing Research Data . If you are NIH or HHS staff, please check out the NIH Library training schedule for upcoming classes.

If you need a one-on-one or group consultation on locating data repositories and datasets, please contact the NIH Library .

Some content of this guide is adapted from:

Read, Kevin; Surkis, Alisa (2018): Research Data Management Teaching Toolkit. figshare. ( https://figshare.com/articles/Research_Data_Management_Teaching_Toolkit/5042998 ) This work is licensed under Attribution 4.0 International (CC BY 4.0).

Navigation:

Resources to Locate Data Repositories

Resources for data sharing for intramural nih researchers, issues to consider with data repositories, searching across data repositories, generalist repositories, data journals, databases linked to datasets, issues to consider with datasets, data standards and common data elements (cdes), data repositories.

Domain-specific repositories
Generalist repositories
Information from the BMIC tables described above, listing repositories for sharing scientific data and repositories for accessing scientific data , can also be found at Sharing.nih.gov .
The portal covers data registries from across many academic disciplines.
Users can search by keyword or browse repositories by subject , content type , or country .
Choose Databases to search and browse data repositories.
Choose Collections to view data repositories, standards, and policies related to various topics.
Submit a Data Management and Sharing plan (DMSP) outlining how scientific data and any accompanying metadata will be managed and shared, taking into account any potential restrictions or limitations.
Comply with the Data Management and Sharing plan approved by the funding Institute or Center (IC).
Data Management & Sharing Policy Overview : Learn more about the 2023 Data Management & Sharing Policy, and find resources to assist with compliance.
Allowable Costs for Data Management and Sharing
Elements of an NIH Data Management and Sharing Plan
Selecting a Repository for Data Resulting from NIH-Supported Research
Protecting Privacy When Sharing Human Research Participant Data
Responsible Management and Sharing of American Indian/Alaska Native Participant Data
Research associated with a ZIA
Research associated with a clinical protocol that will undergo IC Initial Scientific Review
The plans will address the elements indicated in the Intramural Research Program Data Management and Sharing (IRP DMS) Plan template. The template addresses six NIH-recommended core elements , and allows for the inclusion of IC-specific elements: Intramural Data Management and Sharing Plan Template (PDF)
See the 2023 NIH Data Management and Sharing Policy page in the OIR Sourcebook for additional guidance and resources.
See the library guide Data Management and Sharing Plan Resources for a detailed list of DMSP resources and IC-specific contacts.
Genomic Data Sharing Policy
NIH Institute and Center Data Sharing Policies
Intramural Human Data Sharing Policy
Other Sharing Policies
Find more information on Intramural Data Sharing from the NIH Office of Intramural Research.
Visit Sharing.nih.gov for guidance on Selecting a Data Repository and a list of potential Repositories for Sharing Scientific Data .

Issues to consider when finding a data repository to preserve and share data:

Required Repositories: Check the funder/publisher policies to see if there are required repositories where the data must be deposited.
You may need to anonymize and/or aggregate the data before sharing, or access to the data may need to be limited to researchers with specific permissions.
Intellectual Property: Be aware of who owns the intellectual property and if there are any licensing restrictions.
Required Data Standards: Be aware of the data standards (such as metadata and data formats) required for depositing the data in the repository.
Deposit and Storage Costs: Be aware of any costs associated with depositing/storing the data.

Find additional guidance at Sharing.nih.gov for Selecting a Data Repository .

Indexes datasets using the metadata descriptions that come directly from the dataset web pages using schema.org structure.
Contains more than 31 million datasets from more than 4,600 internet domains.
About half of these datasets come from .com domains, but .org and governmental domains also well represented.
Dataset results are now also listed in general Google search results, according to February 2023 blog post .
Filter results by date range, data type, source type (article or data repository), and source.
NLM also offers Center for Clinical Observational Investigations (CCOI) Dataset Profiles , for exploring large-scale clinical datasets

Here’s a closer look at a few major cross-disciplinary repositories highlighted on the NIH Data Sharing Resources: Generalist Repositories page.

Browse or search and filter datasets by geographical location, subject, journal, or institution.
Filter by Item Type: Dataset.
Filter by Type: Dataset to view only dataset results.

The NIH Office of Data Science Strategy (ODSS) announced the Generalist Repository Ecosystem Initiative (GREI) , which includes seven established generalist repositories that will work together to establish consistent metadata, develop use cases for data sharing, train and educate researchers on FAIR data and the importance of data sharing, and more. A series of recorded webinars is offered to learn about GREI and generalist repositories.

Some will also store the dataset.
Others provide recommendations of where to store the data.
Usually peer-reviewed.
GigaScience : An open access, open data, open peer-review journal from Oxford University Press focusing on “big data” research from the life and biomedical sciences.
Scientific Data : Scientific Data is a peer-reviewed, open-access journal from Springer Nature that publishes descriptions of scientifically valuable datasets and research that advances the sharing and reuse of scientific data.
Sources of Dataset Peer Review : University of Edinburgh maintains a list of peer-reviewed data publications.
The EU-funded FOSTER portal (e-learning platform for training resources on Open Science) provides a list of Open Data Journals .
Walters, William H. 2020. “ Data Journals: Incentivizing Data Access and Documentation Within the Scholarly Communication System ”. Insights 33 (1): 18. DOI: http://doi.org/10.1629/uksg.510 : Provides list of data journals.
PubMed : Use the filter option “Article Attribute” > “Associated Data” to only view results with related data links. Data filters were originally added to PubMed and PubMed central in 2018.
Web of Science : When viewing search results in Web of Science (All Databases), choose the Associated Data option under Quick Filters to view only search results that mention a data set, data study, or data repository in the Data Citation Index . The Data Citation Index includes records on over 14 million research data sets, 1.6 million data studies, and 405 thousand software from over 450 international data repositories in the sciences, social sciences, and arts and humanities.

Issues to consider when re-using datasets include:

Who is the author of the dataset? What is their institutional affiliation?
Is there a peer-reviewed publication associated with the dataset?
Licensing : Check any license restrictions for the data. Many repositories will list the type of license the data is covered by (usually Creative Commons or Open Data Commons licenses ).
Use the format defined by a style guide, like APA (See APA style manual examples for datasets ).
In EndNote, you can define a reference as a dataset. EndNote will then format the reference into the correct dataset citation format for the selected style.
Learn more: NYU Libraries, Data Sources: How to Cite Data & Statistics

See the ELIXIR Research Data Management Kit (RDMkit) guide on Existing Data for additional considerations and resources when locating existing datasets for reuse.

Data/metadata standards and CDEs can help to make data more FAIR (findable, accessible, interoperable, and re-usable – see FORCE11 The FAIR Data Principles ).

DCC Disciplinary Metadata : Collections of metadata standards organized by discipline.
FAIRsharing.org : An online catalog that includes over 1750 data and metadata standards.
NIH CDE Repository : The NIH Common Data Elements (CDE) Repository provides access to structured human and machine-readable definitions of data elements that have been recommended or required by NIH Institutes and Centers and other organizations for use in research and for other purposes.

Finding Datasets for Secondary Analysis

About This Guide
New to Hopkins?

NIH Data Repositories

Examples of nih data repositories.

Other Data Repositories/Consortium
Genomic Databases
EHR Databases
COVID-19 Datasets
Centers for Medicare & Medicaid Services (CMS)
Healthcare Cost and Utilization Project (HCUP)
Medical Expenditure Panel Survey (MEPS)
Research Data Center (RDC)
Featured Public Use Datasets
Featured Studies of Data Reuse
Data Catalogs & Search Engines
NHANES 2020

In general, NIH does not endorse or require sharing in any specific repository and encourages researchers to select the repository that is most appropriate for their data type and discipline (though such specification does exist for particular initiatives). To help researchers locate an appropriate resource for sharing their data, as well as to promote awareness of resources where datasets can be located for reuse, Trans_NIH BioMedical Informatics Coordinating Committee (BMIC) maintains lists of several types of data sharing resources:

Open NIH-supported domain-specific repositories that house data of a specific type or related to a specific discipline;
Other NIH-supported domain-specific resources , including repositories and knowledgebases, that have limitations on submitting and/or accessing data; and
Generalist repositories that house data regardless of type, format, content, or subject matter.
Biologic Specimen and Data Repository Information Coordinating Center (BioLINCC) The goal of BioLINCC is to facilitate and coordinate the existing activities of the NHLBI Biorepository and the Data Repository and to expand their scope and usability to the scientific community through a single web-based user interface.
Data and Specimen Hub (DASH) NICHD DASH is a centralized resource for researchers to store and access de-identified data from NICHD funded research studies for the purposes of secondary research use. It serves as a mechanism for NICHD-funded extramural and intramural investigators to share research data from studies in accordance with the NIH Data Sharing Policy and the NIH Genomic Data Sharing Policy.
EyeGENE® (NEI) The eyeGENE® Biorepository and corresponding Database contain family history and clinical eye exam data from subjects enrolled in eyeGENE® Program coupled to clinical grade DNA samples. This data and samples are submitted by collaborators throughout the US and Canada and the data is available on a controlled access basis to researchers world-wide.
Inter-University Consortium for Political and Social Research (ICPSR) An international consortium of more than 750 academic institutions and research organizations, ICPSR provides leadership and training in data access, curation, and methods of analysis for the social science research community. ICPSR maintains a data archive of more than 250,000 files of research in the social and behavioral sciences. It hosts 21 specialized collections of data in education, aging, criminal justice, substance abuse, terrorism, and other fields.
The National Institute on Aging Genetics of Alzheimer’s Disease Data Storage Site (NIAGADS) The National Institute on Aging Genetics of Alzheimer's Disease Data Storage Site (NIAGADS) is a national genetics data repository facilitating access to genotypic and phenotypic data for Alzheimer's disease (AD). Data include GWAS, whole genome (WGS) and whole exome (WES), expression, RNA Seq, and CHIP Seq analyses.
OpenfMRI The OpenfMRI database is a curated public repository of human and non-human brain imaging data collected using MRI technique (potentially with additional PET, EEG and MEG data). No registration or license agreement is required to obtain the data, which is distributed, by default, using a Public Domain license.
<< Previous: New to Hopkins?
Next: NIH Data Repositories >>
Last Updated: Sep 4, 2024 8:50 AM
URL: https://browse.welch.jhmi.edu/datasets

STARR is a data resource that is designed to improve access to healthcare data by researchers. STARR contains data from Stanford Health Care, and the Stanford Children’s Hospital and supports diverse use cases and research applications. STARR has raw data, analysis ready data, linked data across different data modalities, support for different data models, multiple clinical data warehouses, data search and access tools, data de-identification pipelines, concierge services, training and documentation.

Announcements for Stanford researchers:

May 29, 2024: Starting 18th Jun 2024, MRNs created at Stanford Hospitals will have 10 digits (vs the current 8 digits). This may affect your analytical workflow if you are assuming MRNs are a fixed length. If you do not have access to STARR user slack channel or STARR Tools and find yourself requiring the details on the ranges and allocations for the two hospitals, please request a consultation with one of the STARR team members.

Self-service tools

The self-services tools are designed to meet a range of use cases such as cohort analysis using a graphical user interface, SQL access to pre-IRB databases, access to de-identified clinical text, linking multi-modal pre-IRB data, complex phenotyping and more.

Research support

Research IT and Research Informatics Center host a number of office hours, and online tutorials sessions. We provide access to documentation and code, monitor a slack channel and provide mechanisms to file bug reports.

Consulation services

Where self-service tools and research support are insufficient, researchers can request additional data and technology services via consultation services.

Find your Librarian Connect with a librarian with expertise in your research subject area
Toolkits Curated information resources grouped by discipline
Research Guides Librarian-recommended resources, research tips, and how-to guides
Faculty & Staff Guide Quick links for faculty and staff
UW Libraries Search Online catalog for materials held by UW and Summit Libraries
Books Health sciences print and electronic books, sorted by subject
Databases Indexed collections of full-text articles, citations and other research materials
Journals Searchable list of health sciences journal titles held at UW
Videos Educational and procedural videos, sorted by subject
Request Library Resources Request articles, books, and media for pickup or delivery
Your Library Account Portal for renewing borrowed materials, viewing item request updates, and paying fines
Course Reserves Library materials reserved for your classes
Interlibrary Loan Receive scanned print articles delivered via email and borrow items not held at UW
Off-Campus Access Instruction for connecting to UW Libraries resources while away from campus
Study Rooms Reservable rooms for individual and small-group studying
Li Lu Library Open library space located in the UW Health Sciences Education Building
Meeting and Event Spaces Larger spaces available for fee-based reservations
Learning Commons Lab Testing facility with drop-in computer access (closed Apr-Dec 2024)
Accessibility at HSL We are committed to providing equal access to library collections, services, and facilities for all library users
Collection Guidelines Guidelines for resource purchases by HSL
FAQ Commonly asked questions from HSL users
News Announcements, upcoming events, and new resources
Staff Directory Current HSL, NNLM Region 5, and HEALWA staff
UW Libraries Policies Policies governing use of library resources, space, and services
UW Health Sciences Library
HSL Research Guides
University of Washington Libraries
Library Guides

Data Resources in the Health Sciences

Clinical Data

Introduction to Clinical Data

Electronic health record, administrative data, claims data, patient / disease registries, health surveys, clinical trials registries and databases, clinical research datasets.

Scientific Data
Statistics Sources: Health Sciences
Preserve/Store Data
Describe Data
Analyze/Visualize Data

Defining Clinical Data Repositories

State of the Industry: Seven Characteristics of a Clinical Research Data Repository HIMSS

A Practical Guide to Clinical Data Warehousing Association for Clinical Data Management (ACDM)

Clinical data is a staple resource for most health and medical research. Clinical data is either collected during the course of ongoing patient care or as part of a formal clinical trial program. Clinical data falls into six major types:

Electronic health records
Administrative data
Claims data
Patient / Disease registries
Health surveys
Clinical trials data

See boxes below for examples of each major type.

For additional administrative and survey sources such as healthdata.gov , see Statistics Sources: Health Sciences

For registry sources, see Data Repository Registries

The purest type of electronic clinical data which is obtained at the point of care at a medical facility, hospital, clinic or practice. Often referred to as the electronic medical record (EMR), the EMR is generally not available to outside researchers. The data collected includes administrative and demographic information, diagnosis, treatment, prescription drugs, laboratory tests, physiologic monitoring data, hospitalization, patient insurance, etc.

Individual organizations such as hospitals or health systems may provide access to internal staff. Larger collaborations, such as the NIH Collaboratory Distributed Research Network provides mediated or collaborative access to clinical data repositories by eligible researchers. Additionally, the UW De-identified Clinical Data Repository (DCDR) and the Stanford Center for Clinical Informatics allow for initial cohort identification.

Often associated with electronic health records, these are primarily hospital discharge data reported to a government agency like AHRQ.

Healthcare Cost & Utilization Project (H-CUP) HCUPnet is a free, on-line query system based on data from the Healthcare Cost and Utilization Project (HCUP). It provides access to health statistics and information on hospital inpatient and emergency department utilization. The project includes a number of datasets and sample studies listed under the information icon. Datasets are available for purchase. more... less... Nationwide Inpatient Sample Kids Inpatient Database State Inpatient Databases State Ambulatory Surgery Databases State Emergency Department Databases

Claims data describe the billable interactions (insurance claims) between insured patients and the healthcare delivery system. Claims data falls into four general categories: inpatient, outpatient, pharmacy, and enrollment. The sources of claims data can be obtained from the government (e.g., Medicare) and/or commercial health firms (e.g., United HealthCare).

Basic Stand Alone (BSA) Medicare Claims Public Use Files (PUFs) This is the Basic Stand Alone (BSA) Public Use Files (PUF) for Medicare claims. This is a claim-level file in which each record is a claim incurred by a 5% sample of Medicare beneficiaries. Claims include inpatient/outpatient care, prescription drugs, DME, SNF, hospice, etc. There are some demographic and claim-related variables provided in every PUF.
Medicare Provider Utilization and Payment Data Data that summarize utilization and payments for procedures, services, and prescription drugs provided to Medicare beneficiaries by specific inpatient and outpatient hospitals, physicians, and other suppliers.
Medicaid Data Sources The Medicaid Analytic eXtract data contains state-submitted data on Medicaid eligibility, service utilization and payments. The CMS-64 provides data on Medicaid and SCHIP Budget and Expenditure Systems.
Medicaid Statistical Information System MSIS is the basic source of state-submitted eligibility and claims data on the Medicaid population, their characteristics, utilization, and payments and is available by clicking on the link on the left-side column.

Disease registries are clinical information systems that track a narrow range of key data for certain chronic conditions such as Alzheimer's Disease, cancer, diabetes, heart disease, and asthma. Registries often provide critical information for managing patient conditions.

Global Alzheimer's Association Interactive Network (GAAIN) The Global Alzheimer’s Association Interactive Network (GAAIN) is a collaborative project that will provide researchers around the globe with access to a vast repository of Alzheimer’s disease research data and the sophisticated analytical tools and computational power needed to work with that data.
National Cardiovascular Data Registry (NCDR) The NCDR® is the American College of Cardiology’s worldwide suite of data registries helping hospitals and private practices measure and improve the quality of cardiovascular care they provide. The NCDR encompasses six hospital-based registries and one outpatient registry. There are currently more than 2,400 hospitals and nearly 1,000 outpatient providers participating in NCDR registries.
National Program of Cancer Registries CDC provides support for states and territories to maintain registries that provide high-quality data. Data collected by local cancer registries enable public health professionals to understand and address the cancer burden more effectively.
National Trauma Data Bank The National Trauma Data Bank® (NTDB) is the largest aggregation of trauma registry data ever assembled. The goal of the NTDB is to inform the medical community, the public, and decision makers about a wide variety of issues that characterize the current state of care for injured persons.
Surveillance, Prevention, and Management of Diabetes Mellitus DataLink (SUPREME DM)

In order to provide an accurate evaluation of the population health, national surveys of the most common chronic conditions are generally conducted to provide prevalence estimates. National surveys are one of the few types of data collected specifically for research purposes, thus making it more widely accessible.

Medicare Current Beneficiary Survey The Medicare Current Beneficiary Survey (MCBS) is a continuous, multipurpose survey of a nationally representative sample of the Medicare population. The central goals of MCBS are to determine expenditures and sources of payment for all services used by Medicare beneficiaries.
National Health & Nutrition Examination Survey (NHANES) The National Health and Nutrition Examination Survey (NHANES) is a program of studies designed to assess the health and nutritional status of adults and children in the United States. The survey is unique in that it combines interviews and physical examinations.
National Medical Expenditure Survey The Medical Expenditure Panel Survey (MEPS) is a set of large-scale surveys of families and individuals, their medical providers, and employers across the United States. MEPS is the most complete source of data on the cost and use of health care and health insurance coverage.
National Center for Health Statistics A rich source of health data and statistics on a variety of topics.
CMS Data Navigator Center for Medicare & Medicaid Services - Research, Statistics, Data & Systems
National Health and Aging Trends Study (NHATS) NHATS is a study of Medicare beneficiaries age 65 years and older. The study is being conducted by the Johns Hopkins University Bloomberg School of Public Health, with data collection by Westat, and support from the National Institute on Aging. NHATS is intended to foster research that will guide efforts to reduce disability, maximize health and independent functioning, and enhance quality of life at older ages.
ClinicalTrials.gov o Registry and results database hosted by the NIH. o Information on publicly and privately supported clinical studies from around the world.
Cochrane Library o Trials database, CENTRAL, is component of Cochrane Library o Reports of randomized and quasi-randomized clinical trials taken from Medline, Embase, and elsewhere.
WHO International Clinical Trials Registry Platform (ICTRP) o Clinical trial registration data from over 15 trial registries, including registries from the European Union, Africa, China, Japan, Brazil, and Australia. o Use "standard search" to look for NCT or ISRCTN numbers cited in articles.
European Union Clinical Trials Database o Protocol and results information on interventional clinical trials conducted in the EU. o Good source of pediatric drug development trials.
CenterWatch o Portal for actively recruiting pharmaceutical industry-sponsored clinical trials.

Clinical research data may be available through national or discipline-specific organizations. Level of access is likely restricted but available through proper channels.

Proprietary research data may also be available through individual agreements with private companies.

Biologic Specimen and Data Repository Information Coordinating Center (NHLBI) Listing of studies with resources available for searching and request via BioLINCC.
Biomedical Translational Research Information System (BTRIS) Research data available to the NIH intramural community only.
Clinical Data Study Request Clinical trials data. Partners include Pharmaceutical companies.
NIMH Clinical Trials - Limited Access Datasets Requirements for access at the bottom of the page.
YODA (Yale Open Data Access) Access to participant-level clinical research data and/or comprehensive reports of clinical research. Partners include Medtronic and Johnson & Johnson.
<< Previous: Find Data
Next: Scientific Data >>
Last Updated: Sep 3, 2024 1:00 AM
URL: https://guides.lib.uw.edu/hsl/data

Why Share your Clinical Research Data
Board of Directors
Leadership Team
Partners and Funders
Independent Review Panel
Projects at Vivli

Our Members

Become a Member
Why we support Vivli
Enquiries about Vivli Member Studies
Case Studies
REQUEST DATA
Share NIH-Funded Data
Vivli Metrics
How to Guides
Public Disclosures
Webinars & Publications
HIV & AIDs
DataWorks! Prize

Take part in the NIH-Funded DataWorks! Prize

Find out more about how to submit a proposal.

Find Out More

Updates & Events

Recommended Repositories

All data, software and code underlying reported findings should be deposited in appropriate public repositories, unless already provided as part of the article. Repositories may be either subject-specific repositories that accept specific types of structured data and/or software, or cross-disciplinary generalist repositories that accept multiple data and/or software types.

If field-specific standards for data or software deposition exist, PLOS requires authors to comply with these standards. Authors should select repositories appropriate to their field of study (for example, ArrayExpress or GEO for microarray data; GenBank, EMBL, or DDBJ for gene sequences). PLOS has identified a set of established repositories, listed below, that are recognized and trusted within their respective communities. PLOS does not dictate repository selection for the data availability policy.

For further information on environmental and biomedical science repositories and field standards, we suggest utilizing FAIRsharing . Additionally, the Registry of Research Data Repositories ( Re3Data ) is a full scale resource of registered data repositories across subject areas. Both FAIRsharing and Re3Data provide information on an array of criteria to help researchers identify the repositories most suitable for their needs (e.g., licensing, certificates and standards, policy, etc.).

If no specialized community-endorsed public repository exists, institutional repositories that use open licenses permitting free and unrestricted use or public domain, and that adhere to best practices pertaining to responsible sharing, sustainable digital preservation, proper citation, and openness are also suitable for deposition.

If authors use repositories with stated licensing policies, the policies should not be more restrictive than the Creative Commons Attribution (CC BY) license .

Cross-disciplinary repositories

Dryad Digital Repository
Harvard Dataverse Network
Network Data Exchange (NDEx)
Open Science Framework
Swedish National Data Service

Repositories by type

Biochemistry

*Data entered in the STRENDA DB submission form are automatically checked for compliance and receive a fact sheet PDF with warnings for any missing information.

Biomedical Sciences

Marine Sciences

SEA scieNtific Open data Edition (SEANOE)

Model Organisms

Neuroscience

Functional Connectomes Project International Neuroimaging Data-Sharing Initiative (FCP/INDI)
German Neuroinformatics Node/G-Node (GIN)
NeuroMorpho.org

Physical Sciences

Social Sciences

Inter-university Consortium for Political and Social Research (ICPSR)
Qualitative Data Repository
UK Data Service

Structural Databases

Taxonomic & Species Diversity

Unstructured and/or Large Data

PLOS would like to thank the Open Access Nature Publishing Group journal, Scientific Data , for their own list of recommended repositories .

Repository Criteria

The list of repositories above is not exhaustive and PLOS encourages the use of any repository that meet the following criteria:

Dataset submissions should be open to all researchers whose research fits the scientific scope of the repository. PLOS’ list does not include repositories that place geographical or affiliation restrictions on submission of datasets.

Repositories must assign a stable persistent identifier (PID) for each dataset at publication, such as a digital object identifier (DOI) or an accession number.

Repositories must provide the option for data to be available under CC0 or CC BY licenses (or equivalents that are no less restrictive). Specifically, there must be no restrictions on derivative works or commercial use.
Repositories should make datasets available to any interested readers at no cost, and with no registration requirements that unnecessarily restrict access to data. PLOS will not recommend repositories that charge readers access fees or subscription fees.
Repositories must have a long-term data management plan (including funding) to ensure that datasets are maintained for the foreseeable future.
Repositories should demonstrate acceptance and usage within the relevant research community, for example, via use of the repository for data deposition for multiple published articles.
Repositories should have an entry in FAIRsharing.org to allow it to be linked to the PLOS entry .

Please note, the list of recommended repositories is not actively maintained. Please use the resources at the top of the page and the criteria above to help select an appropriate repository.

Enabling HIPAA-Compliant Clinical Research at Stanford

Enabling Data Driven Clinical Research

Starr tools.

The STAnford Research Repository, or STARR, is Stanford Medicine's approved resource for working with clinical data for research purposes. The STARR IRB permits the collection and aggregation of all data generated at Stanford for clinical care purposes, and articulates the formal approval process each research project must follow in order to obtain and work with this data for research purposes.

STARR is the home of two web tools, one for Cohort Discovery , the other for Chart Review .

This step-by-step guide provides an overview of all available options for using the Cohort Discovery and Chart Review Tools. The most popular choice is self-provisioned chart review .

You are required to have both a fully sponsored SUNetID and Cardinal Key to access STARR Tools, as the login process requires you to authenticate to Google using your [email protected] identity.

Step 1: Cohort Discovery

Step 2: Compliance

Step 3: Chart Review

Step 1 - Cohort Discovery

All clinical data at Stanford Medicine, including EHR data from both hospitals as well as data from various clinical ancillary systems, is available for research through the auspices of the .

If this is your first time using clinical data for research, your first step should be to familiarize yourself with the . Its most powerful feature is the ability to search for text in clinical documents and reports, since so much clinical information is recorded in narrative rather than structured form.

The Cohort Discovery tool lets you count the approximate number of patients with the clinical characteristics of interest. If enough patients are found suitable for study, you can then save the list for subsequent online review of their charts.

For more information on the types of clinical data we have available both through Cohort Discovery and other tools and services, please refer to our clinical research data inventory .

Step 2 - Compliance

The Cohort Discovery Tool lets you see patient counts and some simple summary statistics, but most research projects then wish to delve deeper and either conduct online chart review or work with structured datasets extracted from the clinical research data repository.

In order to work with detailed clinical data for research purposes at Stanford Medicine, you must have either a valid IRB protocol or a letter of NHS Determination from the Stanford IRB.

If you are not familiar yet with the Stanford IRB, you can read more about compliance here .

Step 3 - Chart Review & Data Download

Once you have an approved IRB protocol or a letter of Non Human Subjects Research from the Stanford Research Compliance Office (RCO) you can use the Cohort Discovery Tool to provision a list of patients for review in the Chart Review Tool using this step by step guide .

The Chart Review tool has a built-in capability to export data in .csv file format.

If any information pertinent to your research is not available online, you can get in touch with the Research Informatics Center to request a custom data extraction.

Redirect Notice

Expediting the Translation of Research Results to Improve Human Health.

Featured news & events, explore the areas in which nih has sharing policies..

Under NIH data sharing policies, investigators are encouraged to maximize the appropriate sharing of scientific data.

NIH expects data from large-scale genomic studies to be broadly and responsibly shared.

NIH expects that research tools developed with NIH funding be made accessible to other researchers.

NIH expects that unique model organisms be made available to the scientific community.

NIH expects clinical trials to be registered and summary results reported in ClinicalTrials.gov

NIH expects that all peer-reviewed manuscripts be publicly available on PubMed Central.

Not sure where to start?

Accessing Data

NIH hosts some of the world’s largest biomedical data repositories. Learn what datasets are available and how to access them and how to use them responsibly.

Resources Highlights

Faqs: 2023 data management & sharing policy.

Find answers to frequently asked questions on the 2023 Data Management & Sharing Policy. Topics include budget, policy scope, compliance, and more.

Learning Resources

Couldn't attend a sharing-related webinar or workshop? Our Learning page has materials from past training events such as webinar recordings and slide decks.

NIH Institute and Center Data Sharing Policies

Many NIH Institutes, Centers, and Offices have their own sharing expectations. Browse our filterable table to see if your funding program has a policy that you may need to prepare for.

Policy Overview: Data Management and Sharing

Looking for a quick reference about the DMS Policy going into effect in January? Check out our Policy Overview page for a step-by-step walk through the policy expectations.

Writing a Data Management & Sharing Plan

Find guidance on developing a Data Management and Sharing Plan, including what elements to address and a link to our optional format page.

Informed Consent for Secondary Research with Data and Biospecimens

Need help developing informed consent documents for data sharing? See our new sample language and points to consider for informed consent.

Repositories for Sharing Scientific Data

Need help identifying the right repository for your data? Check out our filterable list of NIH-affiliated repositories.

News & Events

Latest news, not-od-24-157: implementation update for data management and access practices under the genomic data sharing policy, recording & resources now available for dec. 14 fdp dms town hall, federal demonstration partnership (fdp) data management and sharing (dms): updates and planning for phase 2, latest events.

No Upcoming events. See our News & Events page for a list of all events.

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Publications
Account settings
My Bibliography
Collections
Citation manager

Save citation to file

Email citation, add to collections.

Create a new collection
Add to an existing collection

Add to My Bibliography

Your saved search, create a file for external citation management software, your rss feed.

Search in PubMed
Search in NLM Catalog
Add to Search

What You Need to Know Before Implementing a Clinical Research Data Warehouse: Comparative Review of Integrated Data Repositories in Health Care Institutions

Affiliations.

1 Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, Canada.
2 Bioinformatics Graduate Program, University of British Columbia, Vancouver, BC, Canada.
3 Research Institute, BC Children's Hospital, Vancouver, BC, Canada.
4 School of Population and Public Health, University of British Columbia, Vancouver, BC, Canada.
5 Department of Pediatrics, University of British Columbia, Vancouver, BC, Canada.
6 Department of Anesthesiology, Pharmacology and Therapeutics, University of British Columbia, Vancouver, BC, Canada.
PMID: 32852280
PMCID: PMC7484778
DOI: 10.2196/17687

Background: Integrated data repositories (IDRs), also referred to as clinical data warehouses, are platforms used for the integration of several data sources through specialized analytical tools that facilitate data processing and analysis. IDRs offer several opportunities for clinical data reuse, and the number of institutions implementing an IDR has grown steadily in the past decade.

Objective: The architectural choices of major IDRs are highly diverse and determining their differences can be overwhelming. This review aims to explore the underlying models and common features of IDRs, provide a high-level overview for those entering the field, and propose a set of guiding principles for small- to medium-sized health institutions embarking on IDR implementation.

Methods: We reviewed manuscripts published in peer-reviewed scientific literature between 2008 and 2020, and selected those that specifically describe IDR architectures. Of 255 shortlisted articles, we found 34 articles describing 29 different architectures. The different IDRs were analyzed for common features and classified according to their data processing and integration solution choices.

Results: Despite common trends in the selection of standard terminologies and data models, the IDRs examined showed heterogeneity in the underlying architecture design. We identified 4 common architecture models that use different approaches for data processing and integration. These different approaches were driven by a variety of features such as data sources, whether the IDR was for a single institution or a collaborative project, the intended primary data user, and purpose (research-only or including clinical or operational decision making).

Conclusions: IDR implementations are diverse and complex undertakings, which benefit from being preceded by an evaluation of requirements and definition of scope in the early planning stage. Factors such as data source diversity and intended users of the IDR influence data flow and synchronization, both of which are crucial factors in IDR architecture planning.

Keywords: data aggregation; data analytics; data warehousing; database; health informatics; information storage and retrieval.

©Kristina K Gagalova, M Angelica Leon Elizalde, Elodie Portales-Casamar, Matthias Görges. Originally published in JMIR Formative Research (http://formative.jmir.org), 27.08.2020.

PubMed Disclaimer

Conflict of interest statement

Conflicts of Interest: None declared.

Article selection process. The diagram…

Article selection process. The diagram shows the number of articles at each stage…

Architecture models identified from selected…

Architecture models identified from selected integrated data repositories (IDRs). Arrows indicate data output…

Common data types across IDRs.…

Common data types across IDRs. Columns show the main types of data collected…

Publication types

Search in MeSH

Related information

Linkout - more resources, full text sources.

Europe PubMed Central
JMIR Publications
PubMed Central

Miscellaneous

NCI CPTAC Assay Portal
Citation Manager

NCBI Literature Resources

MeSH PMC Bookshelf Disclaimer

The PubMed wordmark and PubMed logo are registered trademarks of the U.S. Department of Health and Human Services (HHS). Unauthorized use of these marks is strictly prohibited.

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

View all journals
Explore content
About the journal
Publish with us
Sign up for alerts

Data Repository Guidance

Scientific Data mandates the release of datasets accompanying our Data Descriptors, but we do not ourselves host data. Instead, we ask authors to submit datasets to an appropriate public data repository. Data should be submitted to discipline-specific, community-recognized repositories where possible. Where a suitable discipline-specific resource does not exist, data should be submitted to a generalist repository .

Authors must deposit their data to a data repository as part of the manuscript submission process; manuscripts will not otherwise be sent for review. If data have not been deposited to a repository prior to manuscript submission we offer a service to deposit them at figshare or dryad during the submission process via our article submission platform. Data may also be deposited to these resources temporarily, if the main host repository does not support confidential peer review (see below).

Repositories need to meet our requirements for anonymous peer-review, data access, preservation, resource stability, licences and suitability for use by all researchers with the appropriate types of data:

Use open licences (CC0 and CC-BY, or their equivalents, are required in most cases learn more ). Exceptions will only be permitted for human derived data that is considered sensitive (e.g. risk of participant identification, controls on specific uses, etc), where we suggest data are shared under Data Usage Agreements (DUAs). We do not typically support the use of more restrictive CC licences - containing SA, NC or ND clauses - for either sensitive or non-sensitive datasets, other than where applied to third party data that has been re-used and the original licence needs to be retained.
Allow public access to data without barriers, such as formal application processes, unless required for sensitive human datasets requiring controlled access and Data Usage Agreements. Note that basic login functionalities, where data are captured for analytics purposes only, are accepted for non-sensitive datasets as long as immediate access is granted to the holder of the email address without manual checks, however we encourage login-free https access without registration in most cases.
All data need to be available for peer review. Where logins or other barriers are required or temporarily applied, routes for confidential peer review of submitted datasets need to be provided that do not reveal the identity of the reviewer to the data owner/author of the associated article. Please consult with the repository to arrange this, or provide the data in a temporary location for peer review.
Ensure long-term persistence and preservation of datasets in their published form. All Data Descriptors need to be associated with live data, so long term preservation and persistence is required to avoid future correction or other action to ensure the integrity of the paper.
Provide stable persistent identifiers for submitted datasets. DOIs are the default for most non-omics datasets described in the journal.
Subject specific repositories that are supported and recognized within their scientific community are strongly encouraged - general repositories should be used where no suitable subject repository is available, or the repository does not meet the requirements above.

The list below is intended as a guide for those who are unsure where to deposit their data, and provides examples of repositories from a number of disciplines. Please note this list does not constitute a formal or exclusive list of repositories accepted by the journal and there are many more repositories that meet our criteria than we are able to track. The list is no longer updated (since 2021), but is retained as a useful list of suggestions.

Authors may also wish to use external resources such as DataCite’s Repository Finder and the FAIRsharing registry to find an appropriate repository for their data. Please note that certain data types (e.g. most omics and cystallographic data) are subject to mandates on which repository should be used. Please see our policy on mandated data types for further informaton.

View data repositories

Biological sciences: Nucleic acid sequence ; Protein sequence ; Molecular & supramolecular structure ; Neuroscience ; Omics ; Taxonomy & species diversity ; Mathematical & modelling resources ; Cytometry and Immunology ; Imaging ; Organism-focused resources
Health sciences
Chemistry and Chemical biology
Earth, Environmental and Space sciences: Broad scope Earth & environmental sciences; Astronomy & planetary sciences; Biogeochemistry and Geochemistry; Climate sciences; Ecology; Geomagnetism & Palaeomagnetism; Ocean sciences; Solid Earth sciences
Materials science
Social sciences
Generalist repositories

Biological sciences ⤴

Nucleic acid sequence ⤴.

Novel DNA sequence, novel RNA sequence, and novel genome assembly data must be deposited to repositories that are part of the International Nucleotide Sequence Collaboration (INSDC) or to those which are working towards INSDC inclusion (as listed below), unless there are privacy or ethics restrictions that prevent open sharing of such data. These data may in addition be deposited to regional and national repositories as required. For human data that requires special controls, please see our recommended health sciences repositories.

Raw sequencing data (reads or traces)

Genome assemblies

Annotated sequences

Sample metadata

Browse data and metadata standards endorsed by the Genome Standards Consortium

Genetic variation data

(human variations less than 50bp)
(human variations greater than 50bp)
(human genotype & phenotype)
(all species)
(GSA-Human)

Protein sequence ⤴

Molecular & supramolecular structure ⤴

These repositories accept structural data for small molecules; peptides and proteins (all); and larger assemblies (EMDB).

Small molecule crystallographic data should be uploaded to Dryad or figshare before manuscript submission, and should include a .cif file, and structure factors for each structure. Both the structure factors and the structural output must have been checked using the IUCR's CheckCIF routine , and a copy of the output must be included at submission, together with a justification for any alerts reported.

Neuroscience ⤴

These data repositories all accept human-derived data (NeuroMorpho.org and G-Node also accept data from other organisms). Please note that human-subject data submitted to OpenfMRI must be de-identified.


(formerly OpenfMRI)

Functional genomics

Functional genomics is a broad experimental category, and Scientific Data 's recommendations in this discipline likewise bridge disparate research disciplines. Data should be deposited following the relevant community requirements where possible.

Please refer to the MIAME standard for microarray data. Molecular interaction data should be deposited with a member of the International Molecular Exchange Consortium (IMEx), following the MIMIx recommendations .

For data linking genotyping and phenotyping information in human subjects, we strongly recommend submission to dbGAP, EGA or JGA, which have mechanisms in place to handle sensitive data.

Metabolomics & Proteomics

We ask authors to submit proteomics data to members of the ProteomeXchange consortium (listed below), following the MIAPE recommendations .

Taxonomy & species diversity ⤴

(formerly LTER Network Information System Data Portal)

Mathematical & modelling resources ⤴

Cytometry and Immunology ⤴

Organism-focused resources ⤴

These resources provide information specific to a particular organism or disease pathogen. They may accept phenotype information, sequences, genome annotations and gene expression patterns, among other types of data. Incorporating data into these resources can be very valuable for promoting reuse within these specific communities; however, where applicable, we ask that data records be submitted both to a community repository and to one suitable for the type of data (e.g. transcriptome profiling; please see above).

Health sciences ⤴

Some of the repositories in this section are suitable for datasets requiring restricted data access, which may be required for the preservation of study participant anonymity in clinical datasets. We suggest contacting repositories directly to determine those with data access controls best suited to the specific requirements of your study.





(formally Virtual Skeleton Database)

Chemistry and Chemical biology ⤴

Earth, Environmental and Space sciences ⤴

Broad scope Earth & environmental sciences ⤴





(DOIs only assigned to deposited data on request)

Astronomy & planetary sciences ⤴

Biogeochemistry and geochemistry ⤴, climate sciences ⤴.


(formerly LTER Network Information System Data Portal)

Geomagnetism & Palaeomagnetism ⤴

Ocean sciences ⤴.


(DOIs only assigned to deposited data on request)

Solid Earth sciences ⤴

Materials science ⤴

Social sciences ⤴.

Generalist repositories ⤴

Scientific Data encourages authors to archive data to one of the above data-type specific repositories where possible. Where a data-type specific repository is not available, the following generalist repositories might be suitable. Generalist repositories may also be appropriate for archiving associated analyses, or experimental-control data, supplementing the primary data in a discipline-specific repository.

The generalist repositories listed below are able to accept data from all researchers, regardless of location or funding source. If your institution has its own generalist data repository this can be used to host your data as long as the repository is able to mint DataCite DOIs , and allows data to be shared under open terms of use (for example the CC0 waiver ). Please note that if your chosen repository is unable to support confidential peer-review, you will be asked to temporarily deposit a copy of the dataset to one of our integrated generalist repositories to facilitate review of your article. Upon completion of peer review, the temporary copy will be erased. To use a repository which does not appear in the manuscript submission system, select 'DataCite DOI' as the repository name during the submission process.

		's manuscript submission system
$120 USD for first 20 GB, and $50 USD for each additional 10 GB
100 GB free per manuscript.	1 TB per dataset	- To qualify for the 100 GB of free storage, data must be uploaded to figshare via our submission system.
for datasets over 1 TB	2.5 GB per file, 10 GB per dataset	No
	5 GB per file, multiple files can be uploaded	No
	50 GB per dataset	No
	8 GB per file, no limit to dataset size	No

Quick links

Explore articles by subject
Guide to authors
Editorial policies

An official website of the United States government

Here’s how you know

Official websites use .gov A .gov website belongs to an official government organization in the United States.

Secure .gov websites use HTTPS A lock ( Lock A locked padlock ) or https:// means you’ve safely connected to the .gov website. Share sensitive information only on official, secure websites.

DATA SHARING RESOURCES

NIH-Supported Data Sharing Resources

To help researchers locate an appropriate repository for sharing or accessing data, BMIC maintains lists of data sharing repositories. Domain-specific repositories are typically limited to data of a certain type or related to a certain discipline. Generalist repositories accept data regardless of data type, format, content, or disciplinary focus. ..MORE

DOMAIN-SPECIFIC REPOSITORIES
GENERALIST REPOSITORIES

Domain-Specific Repositories

Displaying 1 - 25 of 111 results

25 PER PAGE
50 PER PAGE
75 PER PAGE

NAME/DESCRIPTION	ICO	SUBJECT AREA	MODEL SYSTEM	ACCESS TYPE	PROPERTIES	REPOSITORY LINKS

Last Reviewed: January 27, 2024

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Publications
Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

Advanced Search
Journal List
JMIR Form Res
v.4(8); 2020 Aug

What You Need to Know Before Implementing a Clinical Research Data Warehouse: Comparative Review of Integrated Data Repositories in Health Care Institutions

Kristina k gagalova.

1 Canada’s Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, Canada

2 Bioinformatics Graduate Program, University of British Columbia, Vancouver, BC, Canada

3 Research Institute, BC Children’s Hospital, Vancouver, BC, Canada

M Angelica Leon Elizalde

4 School of Population and Public Health, University of British Columbia, Vancouver, BC, Canada

Elodie Portales-Casamar

5 Department of Pediatrics, University of British Columbia, Vancouver, BC, Canada

Matthias Görges

6 Department of Anesthesiology, Pharmacology and Therapeutics, University of British Columbia, Vancouver, BC, Canada

Associated Data

Supplementary methods and results.

Integrated data repositories (IDRs), also referred to as clinical data warehouses, are platforms used for the integration of several data sources through specialized analytical tools that facilitate data processing and analysis. IDRs offer several opportunities for clinical data reuse, and the number of institutions implementing an IDR has grown steadily in the past decade.

The architectural choices of major IDRs are highly diverse and determining their differences can be overwhelming. This review aims to explore the underlying models and common features of IDRs, provide a high-level overview for those entering the field, and propose a set of guiding principles for small- to medium-sized health institutions embarking on IDR implementation.

We reviewed manuscripts published in peer-reviewed scientific literature between 2008 and 2020, and selected those that specifically describe IDR architectures. Of 255 shortlisted articles, we found 34 articles describing 29 different architectures. The different IDRs were analyzed for common features and classified according to their data processing and integration solution choices.

Despite common trends in the selection of standard terminologies and data models, the IDRs examined showed heterogeneity in the underlying architecture design. We identified 4 common architecture models that use different approaches for data processing and integration. These different approaches were driven by a variety of features such as data sources, whether the IDR was for a single institution or a collaborative project, the intended primary data user, and purpose (research-only or including clinical or operational decision making).

Conclusions

IDR implementations are diverse and complex undertakings, which benefit from being preceded by an evaluation of requirements and definition of scope in the early planning stage. Factors such as data source diversity and intended users of the IDR influence data flow and synchronization, both of which are crucial factors in IDR architecture planning.

Introduction

An electronic health record (EHR) is a system for the input, processing, storage, and retrieval of digital health data. EHR systems have been increasingly adopted in the United States over the past 10 years [ 1 ], and their use is spreading worldwide in both hospital and outpatient care settings [ 2 , 3 ]. An EHR is typically organized in a patient-centric manner and has become a powerful tool to store data in a time-dependent and longitudinal structure. EHR data can also be integrated into an enterprise data warehouse or integrated data repository (IDR). IDRs collect heterogeneous data from multiple sources and present them to the user through a comprehensive view [ 4 ]. Unlike EHRs, IDRs offer specialized analytical tools for researchers or analysts to perform data analyses.

An IDR is a significant institutional investment in terms of both initial costs and maintenance, but it offers the advantage of clinical data reuse beyond direct clinical care, such as for research and quality improvement studies. Secondary use of clinical data is a rapidly growing field [ 5 , 6 ]; an increasing number of institutions have implemented in-house IDRs and several others are developing IDRs for future research endeavors.

Unlike clinical practice, which focuses on enhancing the well-being of current patients, the purpose of an IDR is to produce generalized knowledge that can be extended to future patients. Typical applications of IDRs include retrospective analysis and hypothesis generation [ 7 ]. Some IDRs also support clinical applications, such as clinical decision support systems (CDSSs), that work alongside clinical practice to estimate risk factors or predictive scores associated with clinical treatments. CDSSs help to avoid medical errors and deliver efficient and safer care by assisting the provider with diagnosis, therapy planning, and treatment evaluation decisions [ 8 ]. All these applications are valuable resources that have the potential to improve the quality of health care [ 9 ] and reduce health costs if implemented appropriately [ 10 ].

Our study is motivated by the need to develop a pediatric IDR at our institution and by the lack of literature providing practical recommendations to apply during the initial development stages. Reviews by Shin et al [ 11 ] and Huser et al [ 12 ] highlighted the recommended characteristics when designing an IDR; however, they include only a small set of examples and a limited number of example IDRs. Since 2014, the IDR landscape has evolved rapidly, and thus, we felt more recent developments needed to be better addressed as well. A 2018 review by Hamoud et al [ 13 ] provided a comprehensive description of most recent data warehouses, including information about their data content, processing, and main purpose; it also provides general recommendations for the implementation of an IDR, but no practical considerations to guide the planning stages.

This study compares the features of contemporary IDRs and presents some guiding principles for the design and implementation of a clinical research data warehouse. Our research objective was to identify the major features of contemporary IDRs and obtain a list of established architectures used in the field of health informatics. We expect that this review will be useful for other small- to medium-sized institutions that plan to implement an institutional IDR and have no extensive experience in the field.

We conducted a literature review and a targeted web-based search to identify the major existing IDRs and synthesized the retrieved information around key themes.

Literature Review Search

We performed a narrative review following the procedure described below. First, a literature search was conducted using Ovid MEDLINE (Medical Literature Analysis and Retrieval System Online) and IEEE Xplore (Institute of Electrical and Electronics Engineers Xplore), queried in March 2020 ( Figure 1 ). Articles were identified in 2 iterative phases. The first phase used an initial list of keywords querying for infrastructure purposes (data integration, such as linkage and harmonization) as well as infrastructure type and hospital setting ( Multimedia Appendix 1 : A1). The second phase search used additional keywords identified from the titles and abstracts of articles retrieved in the first phase ( Multimedia Appendix 1 : A1). Second, Google Scholar was queried for major article keywords (Integrated Data Repository) OR (Clinical Data Warehouse), and the first 150 retrieved hits were screened. The query was executed in a single search stage because the traditional search methods using Ovid MEDLINE and IEEE Xplore already produced exhaustive results.

An external file that holds a picture, illustration, etc.
Object name is formative_v4i8e17687_fig1.jpg

Article selection process. The diagram shows the number of articles at each stage of selection for each of the 3 databases: MEDLINE (Medical Literature Analysis and Retrieval System Online), IEEE Xplore (Institute of Electrical and Electronics Engineers Xplore), and Google Scholar.

We selected peer-reviewed articles, published in the English language between January 2008 and March 2020, to include the most current data warehouse features. Non-English articles were excluded because of a lack of resources for translation. We retained articles for which the full text was available and removed duplicates. KG read the abstracts, and the articles describing specific data integration strategies, describing architecture structures, or providing more information about the data models were included. When it was unclear whether an article should be included, the authors EPC and MG were consulted. Duplicated articles were removed using EndNote reference management software (Clarivate Analytics). Additional articles providing the most up-to-date information about selected IDRs or cited by the selected articles were included in the selection process because they were considered relevant for the IDR definition. Targeted Web-Based Search of Known Institutional IDRs

We manually queried nonpublished resources with the goal of adding contemporary data warehousing practices implemented in large North American hospitals. A convenience sample of hospitals known to be leaders in these types of data warehousing was suggested by EPC and MG.

Additionally, we browsed publicly available information on each of the targeted institutional websites ( Multimedia Appendix 1 : A2). This was complemented with relevant peer-reviewed articles cited in these websites related to the design, implementation, and applications of such repositories.

Manual Shortlisting for a Comparative Review Analysis

For the comparative review analysis, we performed a manual selection to shortlist articles specifically describing IDR architectures. The shortlisting considered the major focus of the article and the presence of significant details describing data integration, data processing, or database services. The selected articles were searched for related IDR projects and further web-based resources ( Table 1 and Multimedia Appendix 1 : A3).

Institutions and major features of the integrated data repositories.

IDR		IDR scope	Architecture model	Standard common data model	Standard terminologies	Primary references

	Biomedical Translational Research Information System (BTRIS)	General care	General	N/A	RED	[ ]
	Deceased subjects (dsIDR)	Deceased subjects	General	N/A	RED	[ ]

	Healthcare Enterprise Repository for Ontological Narration (HERON)	General care	General	i2b2	ICD -9/ICD-10, CPT , RxNorm , SNOMED-CT , NDFRT , NCI , FDB	[ , ]

	Stanford Translational Research Integrated Database Environment (STRIDE)	General care	General	i2b2, OMOP	ICD-9, CPT, RxNorm, SNOMED-CT	[ , ]
	STAnford Research Repository (STARR)	General care	General	i2b2, OMOP	ICD-9, CPT, RxNorm, SNOMED-CT	[ ]

	HEGP CDW platform	General care, cardiovascular, cancer	General	i2b2	ICD-10, LOINC , SNOMED-CT	[ - ]

	Hanover Medical School Translational Research framework (HaMSTR)	General care	General	i2b2	ICD-10, LOINC	[ ]

	Clinical data warehouse	General care	General	I2b2	LOINC, NCI	[ ]

	Prostate cancer research database	Cancer	General	N/A	N/A	[ ]

	Maternal and Infant Data Hub (MIDH)	Perinatal	General	OMOP	ICD-9/ICD-10, SNOMED-CT	[ ]

	CAncer Research for PErsonalized Medicine (CARPEM)	Cancer	General	Variant of i2b2 (tranSMART )	ICD-9/ICD-10, SNOMED-CT, ATC , GO , HPO	[ ]

	Health Science South Carolina (HSSC) clinical data warehouse	General care	General	i2b2	N/A	[ ]

	Data Warehouse for Translational Research (DW4TR)	Cancer	General	N/A	MeSH , SNOMED-CT, NCI, caBIG VCDE	[ , ]

	VA EHR (Veterans Administration’s electronic health records)	General care	General with CDSS	N/A	ICD-9	[ ]

	Models and simulation techniques for discovering diabetes influence factors (MOSAIC)	Diabetes	General with CDSS	i2b2	ICD-9, DRG , ATC	[ ]

	China Stroke Data Center (CSDC)	Cerebrovascular	General with CDSS	N/A	N/A	[ ]

	Methodist Environment for Translational Enhancement and Outcomes Research (METEOR)	General care	General with CDSS	Extension of i2b2	ICD-9, CPT	[ ]

	Mayo Enterprise Data Trust (MEDT)	General care	General	i2b2	LexGrid	[ ]
	Ovarian cancer registry	Cancer	General	i2b2	LexGrid	[ ]
	Translational Research Center (TRC)	Cancer	Biobank-driven	i2b2	LexGrid	[ ]

	Synthetic Derivative	General care	General	N/A	FDB, ICD-9, CPT	[ ]
	BioVU	General care	Biobank-driven	N/A	FDB, ICD-9, CPT	[ ]

	Biorepository Portal (BRP)	Cancer, pediatric	Biobank-driven	Harvest	N/A	[ ]

	BioBankWarden (BBW)	Cancer	Biobank-driven	N/A	ICD-10, SNOMED-CT, LOINC, GO	[ ]

	onco-i2b2	Cancer	Biobank-driven	i2b2	SNOMED-CT	[ ]

	CLB-IT	Cancer	User-controlled application layer	N/A	ADICAP , ICD-O	[ ]

	Federated Utah Research and Translational Health electronic Repository (FURTHeR)	Several	Federated	i2b2, OMOP, OpenMRS	ICD-9/ICD-10, LOINC, SNOMED-CT, RxNorm	[ ]
	OpenFurther	Several	Federated	i2b2, OMOP, OpenMRS	ICD-9/ICD-10, LOINC, SNOMED-CT, RxNorm	[ ]

	Pediatric Health Information System (PHIS+)	Pediatric	Based on FURTHeR	i2b2	LOINC, SNOMED-CT	[ ]

	@neurIST platform	Cerebrovascular	Federated as in FURTHeR	N/A	@neurIST ontology	[ ]

	Research Data Management System (RDMS)	Cancer	Federated as in FURTHeR	i2b2	ICD-10, SNOMED-CT	[ ]

a IDR: Integrated Data Repository

b The IDRs are defined by their data scope, architecture model (as defined by the major design class represented in Figure 2 ), standard common data model, standard terminology, and primary reference.

c N/A: not applicable, n=4 in Standard Terminology.

d RED: Research Entities Dictionary, n=1.

e i2b2: Informatics for Integrating Biology and the Bedside

f ICD-9/ICD-10/ICD-O: International Classification for Diseases, version 9/10, O for oncology, n=14.

g CPT: Current Procedural Terminology, n=4.

h RxNorm: standardized nomenclature for clinical drugs, n=3.

i SNOMED-CT: Systematized Nomenclature of Medicine-clinical terms, n=11.

j NDFRT: National Drug File Reference Terminology, n=1.

k NCI: National Cancer Institute, n=2.

l FDB: First Databank, n=2.

m OMOP: Observational Medical Outcomes Partnership

n HEGP CDW: Hôpital Européen Georges Pompidou Clinical Data Warehouse, n=1.

o LOINC: Logical Observation Identifiers Names and Codes, n=5.

p tranSMART: Open-source data platform for translational research, n=1.

q ATC: Anatomical Therapeutic Chemical Classification, n=2.

r GO: Gene Ontology, n=2.

s HPO: Human Phenotype Ontology, n=1.

t MeSH: Medical Subject Headings; n=1.

u caBIG VCDE: the cancer Biomedical Informatics Grid Vocabulary and Data Elements Workspace, n=1.

v DRG: Diagnosis Related Group, n=1.

w LexGrid: Lexical Grid, n=1.

x CLB-IT: Léon Bérard Cancer Center-IT.

y ADICAP: Association pour le Développement de l'Informatique en Cytologie et en Anatomie Pathologique, n=1.

z OpenMRS: Open Medical Record System, n=1.

aa @neurIST ontology, n=1.

Literature Synthesis and Institution Characterization

Information from the literature was aggregated through thematic analysis and collapsed into 4 classes of IDR architectures. We evaluated the main features of the identified IDRs, such as data processing components, data characteristics, common terminologies, and data models. Features were summarized, compared, and contrasted. We extracted information about host institutions and divided them into small (≤500 beds), medium (500-1000 beds), and large (>1000 beds) institutions based on the number of beds listed on the institution’s websites.

Analysis of Word Content

Selected articles were uploaded into NVivo 12 (QSR International LLC) for qualitative analysis, specifically to count the word frequency in the selected papers. The words with a minimum length of 5 in the full text were counted, excluding stop words, and grouped by synonyms. The word frequency is represented as a word cloud, generated with R (R Foundation for Statistical Computing) and wordcloud package 2.6.

Citations Analysis

The references of the articles describing IDRs were downloaded in a semiautomated manner using Content Extractor and Miner software [ 50 ] to parse the full-text PDF files. References to web resources, video-cast meetings, and software were removed, and partial references were manually corrected. The references were grouped by first author and year of publication and loaded in R (R Foundation for Statistical Computing) and plotted with UpSetR [ 51 ].

A total of 241 articles were identified in the literature search [ 11 , 13 - 19 , 21 - 29 , 31 , 33 - 35 , 37 , 43 , 44 , 47 - 49 , 52 - 264 ]; the largest number of articles were identified in IEEE Xplore (n=112), followed by MEDLINE (n=95), and Google Scholar (n=71). After removing duplicates (n=24), we added 3 articles that were frequently cited in the selected articles but were missing from our search results [ 30 , 36 , 42 ]. Three articles [ 38 , 40 , 45 ] were further added that provided additional details relevant to the review topic. Finally, 1 article was replaced by a more updated publication [ 265 ]. These 247 articles were combined with the targeted web-based search [ 32 , 39 - 41 , 266 - 269 ]; hence, we identified a total of 255 articles ( Figure 1 ). The most frequent words in the articles were system, information , study , project , and design ( Multimedia Appendix 1 : A4.1). A total of 79 of these 255 articles were published between 2014 and 2016, and 34 were published in 2019; this date range covers the full range of initially identified articles in this domain area ( Multimedia Appendix 1 : A4.2 and A4.3).

A total of 116 articles were presented in proceedings of international scientific conferences, particularly those published in the book series Studies in Health Technology and Informatics (n=23); this included the World Congress of Medical and Health Informatics and Medical Informatics Europe. The second most frequent proceedings were the American Medical Informatics Association annual symposium and joint summits on translational science (n=12). The most frequently observed journals were the Journal of the American Medical Informatics Association (n=9) and BioMed Central (BMC; n=8), with BMC Bioinformatics being the most common. More details about the individual conferences and journals can be found in Multimedia Appendix 1 : A4.4 and A4.5.

For this review, we focused on the 34 articles describing 29 IDRs for which sufficient design details were presented. The additional web resources describing 2 IDRs, Stanford Translational Research Integrated Database Environment (STRIDE) and Federated Utah Research and Translational Health Electronic Repository (FURTHeR), referred to novel projects STAnford Research Repository (STARR) [ 20 ] and OpenFurther [ 46 ], respectively, which increased the number of IDRs to 31 from 25 different institutions or collaborative projects ( Table 1 ). In reviewing the references in these 34 articles, we observed only a small overlap, with 1 reference [ 270 ] being found in common in a maximum of 11 articles ( Multimedia Appendix 1 : A5.1). The most frequently cited among the 34 are onco-Informatics for Integrating Biology and the Bedside (i2b2) [ 43 , 271 ], STRIDE [ 18 ], and the Mayo Clinic [ 36 ] IDRs, cited in 8, 5, and 4 articles, respectively ( Multimedia Appendix 1 : A5.2).

IDRs represent a variety of applications of health data warehousing for research. Although they share common characteristics, as described in detail below, they also demonstrate the many different purposes they can serve. For example, BioVU [ 40 ] and the Synthetic Derivative [ 39 ] at Vanderbilt University Medical Center are examples of a biobank-driven database that automatically couples patients’ clinical information to biological samples (biosamples). The power of this system is its connection between genotype and phenotype and its large number of biosamples (>50,000), which allows a rich set of cohort research studies. The Maternal and Infant Data Hub (MIDH) at Cincinnati Children’s Hospital Medical Center [ 27 ] is a regional perinatal data repository that integrates a large and diverse set of data from different institutions. The strength of the project is the combination of delivery and postdischarge hospital data and the linked mother and child data sets. The pilot database contains approximately 70,000 newborns and 42,000 pediatric postnatal visits. Another example is the Hanover Medical School Translational Research Framework (HaMSTR) framework at the Hanover Peter L. Reichertz Institute [ 24 ], which was developed to automatically load data from a clinical data repository into a standard data model that researchers can query; it is a successful example of fast data upload and query using data structures designed from standard data models available for clinical research.

Characteristics of the Institutions in the Selected IDR Sample

We identified 2 types of IDRs: those developed for use in a single institution (n=19) and those implemented for a collaborative project (n=12). The latter typically integrate patient data and provide project-specific tools. The median number of different institutional partners in a collaborative IDR is 6, with one of the partners acting as an organizational hub. The partners range from research institutes, laboratories, and private institutions to university medical centers.

The IDRs were further divided by their scope ( Table 1 ), which were classified as general or specialized medical care (cancer, pediatrics, perinatal, cerebrovascular, or cardiovascular). Seven of the 10 IDRs containing specialized data were collaborative projects, likely indicating the need to pool data from several institutions when dealing with smaller but more focused patient populations.

Four Major Architecture Models Used in Our Selected IDR Sample

We identified 4 overarching conceptual architectures that summarize the data layers in the selected IDRs ( Figure 2 ). Different institutions can implement multiple architectures for different purposes; we assigned each IDR to a category considering the major features of the IDR, as described in their respective articles.

An external file that holds a picture, illustration, etc.
Object name is formative_v4i8e17687_fig2.jpg

Architecture models identified from selected integrated data repositories (IDRs). Arrows indicate data output because of a query (blue) and data input (orange) because of data integration or update. Continuous lines show data query and integration applied by research users, whereas dashed lines are data queries performed by operational or clinical users.

The general architecture model is the most common model, with 19 identified IDRs structured around medical data mining ( Figure 2 , General architecture with optional CDSS ). In outline, different data marts are transferred to a staging layer that harmonizes the input to a common data view; data are loaded into a common data warehouse and queried through an application layer that communicates with the user; a CDSS tool can provide added functionality. Hence, in this architecture, each data source is originally stored in an independent data mart, collecting data from a separate research or clinical source within the same institution. Data are processed in the staging layer, which reshapes the input to an integrated view through several steps of data linkage, transformation, and harmonization. The next stage of processing is loading the data into a single database connected to an application layer that provides the tools for end users, typically researchers, to access and analyze the data securely with different services. An example of an IDR providing multiple services is the STRIDE architecture stack [ 18 ], which includes several services for data analysis or research data management. The articles describing METEOR [ 35 ], CSDC [ 34 ], models and simulation techniques for discovering diabetes-related factors (MOSAIC) [ 33 ], and Veterans Administration’s EHRs [ 32 ] provide further details about the integration of CDSS tools in the architecture. In these cases, the architecture model is divided into CDSS and data analysis modules, both of which communicate with the common database. The CDSS allows clinical staff to retrieve real-time individual patient records and to use analytical models to make risk prediction. The CDSS tools described by METEOR and MOSAIC, for example, learn from the clinical data stored in the data warehouse and estimate risk factors predicting hospital readmission or long-term complications.

The Health Science South Carolina (HSSC) [ 29 ] IDR gathers data from different clinical systems implemented in various institutions, all of which are party to a data collaboration agreement that authorizes data aggregation in a single data warehouse. This data warehouse contains a longitudinal record for each individual across all institutions. Data processing and terminology mapping occur in a conceptual staging layer, as in the case of the general architecture model.

In the case of the Erlanger University Hospital IDR [ 25 ], terminology is mapped using vocabularies that are manually curated and mapped through an automatic workflow that processes the raw data to the final data warehouse format. Other IDRs that make use of multiple terminologies are health care enterprise repository for ontological narration [ 16 ], Research for PErsonalized Medicine (CARPEM) [ 28 ], and STRIDE [ 18 ], but further details of their mapping processes were not available.

The biobank-driven architecture model is built around a particular application, in this case, biobanking ( Figure 2 , Biobank-driven architecture ). This model is similar to the general architecture model but, in this case, the IDR is built around the biosamples database. The biosample data integration occurs at the staging layer. The main feature is that the model allows the biosample operational user to access the raw and identified biobank data source for quality control and biosample management. An example of a biobank-driven structure is the biorepository portal (BRP) [ 41 , 266 ], which allows for the automatic integration of biosamples with clinical data, while maintaining unrestricted access to the biorepository for the operational team. The Mayo Clinic and Vanderbilt University adopt the general and biobank-driven architecture models in parallel.

The user-controlled application layer architecture model does not have a specific staging layer ( Figure 2 , User-controlled application layer ). This architecture does not include a central data warehouse; the data are preprocessed and integrated from the original data sources only when the users query the data. Hence, data are processed in 2 stages: the first stage preprocesses the original data to a common format. The user query then carries out the final data integration function for the output delivery. In this architecture, a common data warehouse is not implemented, but rather the data are dynamically queried. An example is the text mining technology at the Léon Bérard Cancer Center (CLB) [ 44 ], which indexes text documents during the preprocessing stage and in which the users’ queries return the exact documents matched.

The federated architecture is implemented for heterogeneous data retrieval and integration across multiple institutions ( Figure 2 , Federated architecture , adapted from OpenFurther). In this case, institutions selectively share their data through an adaptor system that applies common preprocessing, with data integrated on-the-fly in a virtual data warehouse. The FURTHeR federated query platform [ 45 ] builds a virtual IDR that responds to the needs of the user and calls several services for data resolution on-the-fly and upon query. The architecture model is flexible and operates using several services for data integration. An application of FURTHeR is the Pediatric Health Information System+ project [ 47 ], which combines data from 6 institutions. The IDR uses a federation component, which aggregates and stores translated query results in a temporary, in-memory database for presentation and analysis by the researcher for the duration of the user’s session. Federated data integration was also proposed using a research data management system (RDMS) [ 49 ], which integrates clinical and biosample data from several institutions in Germany. The @neurIST [ 48 ] is a large IDR dedicated to translational research that includes data, computing resources, and tools for researchers and clinicians. Data are located across different sites and are securely shared with a grid infrastructure that allows federated data access.

The 4 types of architecture present different analytics tools, data presentation logic, and query interface based on the type of user they serve, which can be classified into 2 major groups: the first group, such as researchers and operational or business analysts, uses the IDR to identify important clinical features that occur at the level of patient cohorts. The second type of user, such as physicians and other health care professionals, uses the IDR to make decisions at an individual patient level, for example, to plan specific therapeutic interventions or predict risk. The first type of user is served by all the architecture models ( Research user in Figure 2 ). The general architecture model that incorporates a CDSS presents a clear separation of both user types who have different applications for IDR data, with CDSS queries being made by clinical users ( Figure 2 , General architecture with optional CDSS ). Similarly, the biobank-driven architecture model includes operational users who can directly query the information regarding patient biosamples for clinical applications ( Figure 2 , Biobank-driven architecture ).

Data Retrieval and Update Are Influenced by the IDR Architecture Model

Both data update and integration schedules in an IDR are important features that define the timeliness of data. Here, we describe some of the key limiting steps and their occurrence in the different IDR architecture models.

Data Retrieval

The data processing involved in extraction, transformation, and loading (ETL) is described in detail in the articles of biomedical translational research information system (BTRIS) [ 14 ], HaMSTR [ 24 ], Mayo Translational Research Center (TRC) [ 38 ], CARPEM [ 28 ], onco-i2b2, Vanderbilt’s Synthetic Derivative [ 39 ] and BioVU [ 40 ], and BRP [ 41 ]. These IDRs represent the general and biobank-driven architecture models, which implement a staging layer for the ETL process. A temporal sequence of the ETL steps is as follows:

Data extraction from source(s): The source data are extracted by an automatic (or manual) process.
Deidentification: Identifiable patient features, such as demographics or localization, are removed before loading into the IDR. The biobank-driven IDRs implement an automated process of this step without the need for extensive institutional reviews. In addition to the deidentified data, BTRIS [ 14 ] and Vanderbilt’s Synthetic Derivative [ 39 ] maintain a parallel database with original identifiable patient entries for research purposes where appropriate.
Assignment of unique identifiers: Deidentified data are assigned unique patient identifiers that are used as a reference for linking.
Data transformation and standardization: Data are first checked for possible errors or missing values and are then transformed into a common format that is standard for all cohorts. Data may be subjected to transformation, such as the derivation of new values from the existing ones (pseudonymization) for maintaining privacy.
Standard terminology and ontology mapping: Data types are labeled with standard terminologies.
Data linkage: If the data are derived from multiple sources, they are linked and combined in the IDR.
Loading into the data warehouse: This is performed by either an update of existing data or a complete data re-import into the data warehouse.

The CLB [ 44 ] IDR (user-controlled application layer architecture model) uses specialized software to manipulate the content from unstructured data without using an ETL process. IDRs representing architecture model 4 do not provide additional information on the ETL process in their respective articles.

Data Update

Five of the selected articles provide additional information about the frequency of data updates in their IDRs. BTRIS [ 14 ] and Vanderbilt’s Synthetic Derivative [ 39 ] argue for daily IDR updates as new source data accumulate daily. Onco-i2b2 [ 43 ] performs more frequent data synchronization, as frequent as every 15 min. A real-time data update is presented by METEOR [ 35 ] and MOSAIC [ 33 ], which also integrate a CDSS in their architecture model and thus need this frequency to make actionable decisions. MOSAIC presents an example with asynchronous data update; although the CDSS is updated in real time, the demographics are synchronized only every 6 months. The general architecture model combined with a CDSS may require real-time data updates, whereas the general or the biobank-driven architecture models, without a CDSS, may have periodic updates that vary widely in frequency.

Major IDR Features: Data Type, Standard Terminology, and Common Data Model

We have listed the data types in 19 of the selected IDRs based on information in the articles ( Figure 3 ). The most common types of data are those extracted from EHR that include patient demographics, diagnoses, procedures, laboratory tests, and medications.

An external file that holds a picture, illustration, etc.
Object name is formative_v4i8e17687_fig3.jpg

Common data types across IDRs. Columns show the main types of data collected in the selected IDRs. Gray-filled cells denote feature presence, with colors classifying the IDRs based on the examined architectures. Only 19 IDR articles contained enough information in their articles to be included in this figure. BRP: biorepository portal; BTRIS: biomedical translational research information system; CARPEM: cancer research for personalized medicine; CLB-IT: Léon Bérard Cancer Center Information Technology; DW4TR: Data Warehouse for Translational Research; EHR: electronic health record; HEGP: Hôpital Européen Georges Pompidou; HERON: health care enterprise repository for ontological narration; HSSC: Health Science, South Carolina; IDRs: integrated data repositories; Mayo Clinic-TRC: Mayo Clinic – Translational Research Center; METEOR: Methodist Environment for Translational Enhancement and Outcome Research; MIDH: Maternal and Infant Data Hub; MOSAIC: models and simulation techniques for discovering diabetes-related factors; Onco-i2b2; PHIS+: Pediatric Health Information System+; STARR: STAnford Research Repository; VUMC-BioVU: Vanderbilt University Medical Center–BioVU; VUMC-SD: Vanderbilt University Medical Center–Synthetic Derivative.

Several IDRs incorporate data from biosamples and their omics characterization, especially those based on the biobank-driven architecture model such as TRC [ 38 ], BRP [ 41 , 266 ], and BioVU [ 40 ]. Other examples of omics-based IDRs are CARPEM [ 28 ], Data Warehouse for Translational Research (DW4TR) [ 30 ], and @neurIST [ 48 ], which are dedicated to specific domains of research, namely cancer and cerebral aneurysm research.

Several types of images are part of modern IDRs, such as radiographic images in BTRIS [ 14 ] and document images in Methodist Environment for Translational Enhancement and Outcome Research (METEOR) [ 35 ]. In addition, medical reports are integrated in the IDRs. Clinical documents can be processed using natural language processing (NLP) algorithms to extract clinical conditions, medication types, and other features from common hospital procedures, which increases their utility through transformation into structured data. NLP modules are integrated in CLB-IT [ 44 ], which is specifically built for text processing entries, as well as BTRIS [ 14 ], METEOR [ 35 ], and onco-i2b2 [ 43 ].

IDRs including CDSS include outcome data types, which are relevant for calculating risk factors or predictive values in clinical domains. External data can also be integrated into the IDRs, including genomics data from disease model organisms (BTRIS) [ 14 ], patients from external sources (BTRIS [ 14 ] and DW4TR [ 30 ]), or environmental indices and geolocation (MIDH) [ 27 ].

Standard Terminology

Health information technology uses controlled terminologies to condense the information to a set of codes that can be manipulated more easily and automatically in data processing. We observed the adoption of both common [ 272 , 273 ] and specialized terminologies (eg, Anatomical Therapeutic Chemical Classification [ 274 ], human phenotype ontology [ 275 ], Gene Ontology [ 276 ]). The most broadly used were International Classification of Diseases (ICD)-9 and 10 for the classification of diseases, systematized nomenclature of medicine-clinical terms (SNOMED-CT) for a variety of medical domains, Logical Observation Identifiers, Names, and Codes for laboratory observations, and current procedural terminology for common procedures ( Table 1 ). These terminologies were utilized within the EHR and further integrated into the IDRs.

Common Data Model

A common data model (CDM) is a standard data schema that enables data interoperability and sharing. Contemporary data warehouses propose an analytical platform built around the CDM that provides all the software components to construct and manage the data in a CDM. A few different CDMs have been developed and adopted by the wider clinical research community, although some institutions still favor using a custom data schema tailored to their specific needs. In our study, a standard CDM was adopted by 18 of the 29 IDRs. The most frequently applied CDM, found in 16 instances, is Informatics for Integrating Biology and the Bedside (i2b2) [ 277 ]. METEOR [ 35 ] applies i2b2 with an expanded schema, and CARPEM [ 28 ] applies tranSMART [ 278 ], which is a framework layered on top of i2b2, dedicated to integrating omics data with EHR data. Another popular CDM that has been used more frequently in recent years is the Observational Medical Outcomes Partnership (OMOP) [ 279 ], adopted by 3 IDRs, namely MIDH [ 27 ], OpenFurther [ 46 ], and STARR [ 20 ]. OpenFurther uses OpenMRS [ 280 ], which is an open-source software and CDM that delivers health care in low- and middle-income countries. The BRP [ 41 ] is the only example using Harvest as their CDM.

Principal Findings

Our review identified several institutions of various sizes and scopes that utilize an IDR. These IDRs contain data used for both research and clinical decision-making purposes. The use of structured data from natural language processing of clinical notes, clinical imaging, and omics data are the most recent big data types to be integrated with standard clinical observations. Owing to the large heterogeneity, however, integration is complex and tailored to the specific needs during the IDR implementation and maintenance, as ETL necessitates a significant effort in both the initial modeling and the ongoing updates.

As a novel contribution, we proposed and classified IDR architectures into 4 major models that highlight the processing and integration steps. The most common architecture model employs a staging layer implemented before the data are loaded into the data warehouse.

A set of common features are applied across most IDRs: IDRs commonly use standard terminologies such as ICD-9/10 and SNOMED-CT, which are often already part of the EHR data. Several IDRs use an open-source translational research framework to model their data, as described by Huser et al [ 12 ]. We observed extensive use of i2b2 CDM and the emergent adoption of OMOP CDM, which has the possibility to map additional domain-specific terminologies. Interestingly, PCORnet is one of the newest CDMs, but its application was not discussed in the sample of IDRs reviewed. The PCORnet is the most recently implemented CDM that borrows from several other CDMs and is organized around patient outcomes [ 261 ].

To safeguard the data in the IDR, data security and privacy need to be ensured from the initial steps of development. Data security is an important factor in all architecture types, with a particular need in collaborative projects that share data across jurisdictions. For example, in the general architecture of HSSC [ 29 ], data need to be stored in physically and logically secure facilities, where data management is extended to all the parties involved, and data need to be transmitted between the participating institutions through private high-speed networks. In the case of federated data warehouses, such as @neurIST [ 48 ], there is a tight control of data flow between different institutions and clinical and research domains, following policies aligned with recommendations from the Legal and Ethics Advisory Board. Privacy, referring to the protection of patient’s personal information, emerged as an important feature, especially in the biobank-driven architecture; here, identifiable patient information is deleted from both the biosamples and the patient clinical data. Developers at the Children’s Hospital of Philadelphia and the Children’s Brain Tumor Tissue Consortium created an electronic Honest Broker (eHB) and Biorepository Portal (BRP) eHB [ 41 ], which provides a method for patient privacy protection by removing all the exposure of the research staff to patient identifiers and automating the deidentification process. Following a different privacy-preservation approach, Vanderbilt’s Synthetic Derivative database [ 39 ] alters the patient data by obfuscating the true entries while preserving their time dependence.

Guiding Principles

The implementation of an IDR is subject to several factors that must be considered before development. We identified 2 major factors: (1) the data stored in the IDR and (2) the scope of the IDR, either being exclusively used for research purposes or in combination with clinical or operational purposes, as shown in the general and biobank-driven architecture models. Data types, heterogeneity, and volume greatly influence system load, update, and query of the database. The scope of the IDR influences its primary end users, researchers, clinical users, or operational users, who have different needs and, thus, need access to different sets of tools to extract, analyze, and visualize the data. All the features influence both the data latency and the data synchronization, which are major elements in the model architecture. Moreover, available funding plays an important role in architecture decisions, as are considerations for future expansions.

Among the set of selected IDRs, we observed a number of collaborative projects that work within specialized medical domains, such as cancer or pediatrics. Collaborative IDRs are likely to integrate their data to increase the number of patients, thus increasing the statistical power of their respective cohorts.

On the basis of our analysis, we highlight the following guiding principles for small- to medium-sized institutions planning to implement an IDR:

The general architecture model, with or without CDSS, is the most straightforward to implement; the data staging layer facilitates ETL and data processing before loading into the data warehouse.
Select a standard CDM already in use by other institutions; both i2b2 and OMOP provide server and client services in a single unique platform that serves the user with all the necessary tools to set up a structured IDR.
Wherever possible, adopt standard terminologies; we listed the most common terminologies derived from the integration with EHR data ( Table 1 ). One promising approach is that common terminologies are applied in the first phases of the IDR development with other, more specialized terminologies, added later as the project scope expands.
Finally, the data update requirements and ETL process design should be carefully considered, the level of automation, as these are the limiting stages in data integration and update.

Commercial electronic medical record platforms such as Epic, Cerner, Meditech, and Allscripts are dominant in large institutions. However, although some information about how to query underlying databases and application programming interfaces to communicate with these systems are available, little information on transforming such data into IDR is available in the literature, most likely because of their proprietary nature. Most vendors also sell tools for analysts to query and make use of data from these clinical production systems; however, they are not IDRs themselves and are not targeted toward secondary use for research.

As for lessons learned in the field, Epstein et al [ 281 ] demonstrate the feasibility of transferring the development of a perioperative data warehouse (schemas and processes) built on top of Epic’s database from one institution to another.

Comparison With Prior Work

In their review, Hamoud et al [ 13 ] provided general requirements for building a successful clinical data warehouse, recommending a top-down approach to the initial stages of development. They recommended considering all the individual components of the final system to decrease integration obstacles when dealing with heterogeneous data sources.

Three major factors contributing to the success of IDRs were identified by Baghal [ 231 ] when developing their in-house IDR: (1) organizational, enhancing the collaboration between different departments and researchers; (2) behavioral, building new professional relationships through frequent meetings and communication between departments; and (3) technical improvements to deploy new self-service tools that empower researchers. Collectively, these factors increase the utility and adoption of IDRs in clinical research.

In addition, the report by Rizi and Roudsari [ 282 ] on lessons and barriers from their development of a public health data warehouse, which IDR developers might want to consider, specifically, not to underestimate technical challenges such as those related to extracting data from other systems, difficulties in modeling and mapping of data, as well as data security and privacy. Other considerations include leveraging the IDR to improve data quality at the source, implementing a data governance framework from the beginning, and ensuring that key organizational stakeholders endorse the project early and strongly [ 282 ].

Limitations

Our search was not intended to be a systematic search; therefore, we may have missed some articles. An example of missing articles is those describing raw and unstructured data repositories, also referred to as data lakes , as these did not appear in our search results although we knew they exist. One of the data lakes was presented by Foran et al [ 207 ] as a file reservoir, integrated in the data warehouse schema. For researchers to access those data, it was necessary to use a feeder database before their upload to the final data warehouse.

Furthermore, we were able to report on the IDRs and IDR features described in the literature, possibly omitting smaller institutions that are not actively publishing in peer-reviewed journals. In an attempt to mitigate this issue, we searched the representative institutional websites to retrieve additional details about the IDR architectures. As shown in Multimedia Appendix 1 [ 283 - 288 ]: A2, several organizations provide further details about their architecture in GitHub repositories or institutional Wiki pages, which can be explored for additional information besides the published literature.

This review includes articles and web resources shortlisted according to aspects of the IDR architectures that were considered relevant. Providing an exhaustive coverage of all aspects of IDR implementation, such as tools designed to interact with the IDR, are better left for a dedicated review. An example of such tools is the Green Button project, which provides critical help in treating patients [ 289 - 292 ]. Examples of CDM-based tools, built around an application, are the @neurIST platform [ 48 ], @neurLink, and @neurFuse application suites that consist of research-oriented modules dedicated to knowledge discovery and image processing. CDSS tools such as Green Button, @neurIST applications, or many other existing frameworks are essential in providing sophisticated analyses to support clinicians, but are beyond the scope of our review.

There is significant potential in the implementation of IDRs in health institutions, and their importance is evident from the growing number of projects developed in the past 10 years. Despite the common trends in IDR implementation observed in this study, there are also many variations. There are 2 major design factors, namely data heterogeneity and IDR scope, which need to be carefully considered before embarking on the IDR design and planning process.

Finally, we aim to apply the knowledge presented in this study for the implementation of a pediatric IDR at our institution. By sharing our experience of planning and designing our IDR with those joining the field or planning to implement an IDR for research purposes, we hope to contribute to future IDR endeavors.

Acknowledgments

The project was supported, in part, by an Evidence to Innovation (E2i) Research Theme seed grant through the BC Children's Hospital Research Institute. The authors wish to thank Colleen Pawliuk for her help with the literature search strategy development and execution and Nicholas West and Zoltan Bozoky for editorial assistance.

Abbreviations

BMC	BioMed Central
BRP	biorepository portal
BTRIS	biomedical translational research information system
CARPEM	CAncer Research for PErsonalized Medicine
CDM	common data model
CDSS	clinical decision support system
CLB	Léon Bérard Cancer Center
DW4TR	Data Warehouse for Translational Research
EHR	electronic health record
ETL	extraction, transformation, and loading
FURTHeR	Federated Utah Research and Translational Health Electronic Repository
HaMSTR	Hanover Medical School Translational Research Framework
HSSC	Health Science, South Carolina
ICD	International Classification of Diseases
IDR	integrated data repository
IEEE Xplore	Institute of Electrical and Electronics Engineers Xplore
i2b2	Informatics for Integrating Biology and the Bedside
MEDLINE	Medical Literature Analysis and Retrieval System Online
METEOR	Methodist Environment for Translational Enhancement and Outcome Research
MIDH	Maternal and Infant Data Hub
MOSAIC	models and simulation techniques for discovering diabetes-related factors
NLP	natural language processing
OMOP	Observational Medical Outcomes Partnership
SNOMED-CT	systematized nomenclature of medicine-clinical terms
STARR	STAnford Research Repository
STRIDE	Stanford Translational Research Integrated Database Environment
TRC	Translational Research Center

Multimedia Appendix 1

Conflicts of Interest: None declared.

Fact sheets
Facts in pictures
Publications
Questions and answers
Tools and toolkits
Endometriosis
Excessive heat
Mental disorders
Polycystic ovary syndrome
All countries
Eastern Mediterranean
South-East Asia
Western Pacific
Data by country
Country presence
Country strengthening
Country cooperation strategies
News releases
Feature stories
Press conferences
Commentaries
Photo library
Afghanistan
Cholera
Coronavirus disease (COVID-19)
Greater Horn of Africa
Israel and occupied Palestinian territory
Disease Outbreak News
Situation reports
Weekly Epidemiological Record
Surveillance
Health emergency appeal
International Health Regulations
Independent Oversight and Advisory Committee
Classifications
Data collections
Global Health Observatory
Global Health Estimates
Mortality Database
Sustainable Development Goals
Health Inequality Monitor
Global Progress
World Health Statistics
Partnerships
Committees and advisory groups
Collaborating centres
Technical teams
Organizational structure
Initiatives
General Programme of Work
WHO Academy
Investment in WHO
WHO Foundation
External audit
Financial statements
Internal audit and investigations
Programme Budget
Results reports
Governing bodies
World Health Assembly
Executive Board
Member States Portal

Coronavirus (COVID-19) data

CONFIRMED CASES

CONFIRMED DEATHS

Triple Billion progress

Healthier populations

Worldwide, the number of additional people expected to be enjoying better health and wellbeing is projected to be 1.5bn (1.2bn – 1.8bn) by 2025 compared to 2018.

Universal health coverage

Worldwide, the number of additional people expected to be covered by essential services and not experiencing financial hardship is projected to be 585m (526.1m – 639.5m) by 2025 compared to 2018.

Health emergencies protection

Worldwide, the number of additional people expected to be protected from health emergencies is projected to be 776.9m (647.4m – 912.5m) by 2025 compared to 2018.

The triple billion targets provide a unified approach to accelerating progress towards the achievement of the health-related Sustainable Development Goals.

Estimates are calculated with 90% uncertainty intervals (UI) and presented as the upper and lower bounds in the closed brackets after the mean estimate.

Data at WHO

Welcome to your gateway to all public health data. These databases and platforms provide access to understandable and timely data, transforming lives by making health and inequality data findable, browsable, and usable.

Data collection

Good data is essential to good decision-making. Access WHO standards and solutions to collect timely, reliable and actionable data.

WHO dashboards deliver critical insights and dynamic visualizations on priority health topics and empower users to explore trends, identify patterns, and make informed decisions to improve health.

data-reports-world-health-statistics-2020

Division of Data, Analytics and Delivery for Impact

DDI IN FOCUS 2024
Data principles
Data protection and sharing policies

Expert advisory groups and networks

Family of International Classifications Network
Reference Group on Health Statistics
Technical Advisory Group on COVID-19 Mortality Assessment
Technical Advisory Group (TAG) for Universal Health and Preparedness Review (UHPR)
Health Data Collaborative

Regional observatories

Africa Health Observatory
PLISA Health Information Platform for the Americas
Eastern Mediterranean Health Observatory
European Health Information Gateway
South East Asia Health Information Platform
Western Pacific Health Data Platform

Handling imbalanced medical datasets: review of a decade of research

Open access
Published: 02 September 2024
Volume 57 , article number 273 , ( 2024 )

Cite this article

You have full access to this open access article

Mabrouka Salmi 1 , 2 ,
Dalia Atif 3 ,
Diego Oliva 4 ,
Ajith Abraham 5 &
Sebastian Ventura 2

Machine learning and medical diagnostic studies often struggle with the issue of class imbalance in medical datasets, complicating accurate disease prediction and undermining diagnostic tools. Despite ongoing research efforts, specific characteristics of medical data frequently remain overlooked. This article comprehensively reviews advances in addressing imbalanced medical datasets over the past decade, offering a novel classification of approaches into preprocessing, learning levels, and combined techniques. We present a detailed evaluation of the medical datasets and metrics used, synthesizing the outcomes of previous research to reflect on the effectiveness of the methodologies despite methodological constraints. Our review identifies key research trends and offers speculative insights and research trajectories to enhance diagnostic performance. Additionally, we establish a consensus on best practices to mitigate persistent methodological issues, assisting the development of generalizable, reliable, and consistent results in medical diagnostics.

Learning from Imbalanced Data in Healthcare: State-of-the-Art and Research Challenges

Classification performance assessment for imbalanced multiclass data

A systematic review and applications of how AI evolved in healthcare

Explore related subjects.

Artificial Intelligence

Avoid common mistakes on your manuscript.

1 Introduction

The class imbalance issue remains one of the main challenges in data mining. The fact that one class is underrepresented in a dataset while the other(s) is prevailing results in uneven class distribution. When the data is unevenly distributed, the prevalent class is called the majority class, while the one containing the rare cases is called the minority class. The minority class is usually ignored by the machine learning algorithms that prioritize the majority class (Sun et al. 2009 ). Due to the imbalance in the dataset, conventional machine learning algorithms are biased towards the class primarily present in the data, while those rare cases are neglected. The main reason for such a problem is how machine learning algorithms are constructed; they assume balanced datasets (Krawczyk 2016 ). The balance in real-world datasets is often unreached, and the ill-prepared machine learning algorithms cannot assist in detecting rare cases of interest, which is an immense concern in research.

Medical diagnosis data are becoming of great use and interest with the progress in big data and medicine (Haixiang et al. 2017 ). Hence, it is subject to improving medical care treatment and creating aid-medical diagnosis systems. Machine learning is availing in designing medical diagnosis systems (Huda et al. 2016 ; Xiao et al. 2021 ; Woźniak et al. 2023 ); however, the imbalanced medical data hinders the machine learning algorithms’ performance, thus the performance of medical diagnosis systems. Medical diagnosis data could be represented in two classes one of the non-diseased individuals (healthy) and the other of the diseased individuals (unhealthy). Accurately predicting unhealthy individuals (diseased patients) on time allows early access to medical treatment and saves patients’ lives, which is unachieved without appropriate handling of class imbalance in medical datasets.

Intensive research has been conducted through literature to deal with the issue of class imbalance in general. Consequently, several methods of learning from imbalanced data have been proposed, and they are grouped mainly into two approaches: data-level and algorithmic level. The latter modifies the learning algorithms to consider the minority class, and the former handles the class imbalance by modifying the data distribution, whether through undersampling that eliminates instances from the majority class, oversampling the minority class that creates synthetic instances, or hybridizing both under and oversampling to reduce the imbalance. In addition, researchers propose several basic and advanced class imbalance handling methods that are generally applied to various domains.

Many literature reviews have been carried out on class imbalance, whether focusing on class imbalance handling methods only (Galar et al. 2011 ; Abd Elrahman and Abraham 2013 ; Spelmen and Porkodi 2018 ; Ali et al. 2019 ), both methods and applications (Haixiang et al. 2017 ; Kumar et al. 2021 ), or methods for a specific application’s field (Patel et al. 2020 ). However, class imbalance in medical diagnosis is not well highlighted, yet specificities of the imbalanced medical data are unconsidered. Such specificities pose a unique challenge for working with medical data and require specialized techniques and methodologies to ensure the validity and generalizability of the findings. Improving existing medical diagnosis systems and improving human well-being rely on medical diagnosis research. Hence, researchers and practitioners in healthcare, in general, and in medical diagnosis need to be aware of these factors and be abreast of the recent advancements in the field to identify their starting research points. In this work, we analyze the literature on handling imbalanced medical datasets and formulate the following intended research questions to cover the knowledge gaps.

RQ1 How can we develop a comprehensive framework for categorizing and evaluating imbalanced learning techniques tailored specifically to the complexities of medical datasets?

RQ2 What emerging trends and future trajectories are envisaged for tackling imbalanced medical data?

RQ3 What methodological techniques and procedural recommendations for mitigating class imbalance in research studies with a focus on enhancing the validity and reliability of results?

We aim to emphasize the research on the intersection of class imbalance in structured data and medical diagnosis through a well-designed research methodology. This paper comprehensively reviews the last decade’s research and clusters the reviewed literature in medical imbalanced datasets in three main approaches by building up on the existing classification of class imbalance methods (Krawczyk 2016 ): preprocessing level entailing data level methods and feature level methods, learning level encloses algorithmic methods, and combined techniques hybridize the two mentioned approaches. Related research is meticulously classified into subgroups within each approach to specifically present the state-of-the-art and facilitate detailed tracking of advancements and areas for continued development. This review systematically extracts and presents detailed statistics on the medical datasets and evaluation metrics employed in existing literature, delineating the most and least commonly used resources to offer insights into prevailing research methodologies. It synthesizes prior research outcomes concerning class imbalance in medical datasets and discusses observations from the contextual analysis. This innovative exploration offers speculative insights into methodological concerns and practical aspects, critically evaluating the high performance of specific methodologies across diverse medical datasets. Subsequently, we acknowledge the inevitable limitations of our study due to non-reproducible experimental outcomes and other significant constraints encountered in the analysis of imbalanced medical data. In addition to presenting original contributions, this review identifies research trends in imbalanced medical datasets and highlights promising directions for future research that could enhance medical diagnosis performance. It also establishes best practices in this field, aiming to mitigate prevalent issues and proposing a consensus among researchers to guide future studies.

The structure of the review paper is as follows: Sect. 2 introduces the problem of class imbalance in medical datasets. Section 3 details the search methodology and describes the findings regarding used medical datasets and evaluation metrics. Section 4 presents the data-level approach proposed for imbalanced medical datasets, Sect. 5 exposes the learning-level proposed solutions, and Sect. 6 contains the proposed combined techniques in the literature. Section 7 synthesizes the outcomes of research works on several imbalanced medical datasets. Section 8 discusses reflections on the synthesis, highlighting speculative insights, whereas the value and limitation of the observatory synthesis are pointed out in Sect. 9 . Section 10 summarizes the research trends and future directions in imbalanced medical datasets research. In Sect. 11 , we highlighted the best practices amongst researchers in imbalanced medical data. Section 12 concludes the paper.

2 The problem of class imbalance in medical data

With the advancement of technologies, medical data is increasingly stored in the form of electronic medical records, where the historical medical data of an individual is saved and shared with authorized users (Fujiwara et al. 2020 ). Demographic data, clinical tests, X-ray images, MRIs, fMRI, EEG, and other types represent medical information. The access to voluminous medical data, along with the progress in the application of machine learning, has been helpful for medical care specialists and clinicians. Machine learning effectiveness in multiple domains encourages constructing aid-medical diagnosis systems to automate medical diagnosis and help with the scarcity of medical experts in specific domains and places and the vast demand for diagnosis for specific diseases. Those diagnosis systems are trained on historical medical data about a particular disease to perform well on unknown new medical data and predict the disease. However, such systems are constructed through well-designed processes depending on the disease and its data availability with the help of experts’ knowledge. Nonetheless, the class imbalance in medical data hardens the mission of machine learning algorithms and diagnostic systems.

While naturally unhealthy people are less than healthy, the class imbalance exists if the classes are unequally distributed in the dataset for training machine learning algorithms. There are numerous sources of imbalance in medical data. However, they can be grouped into four patterns:

Bias in data collection : resulting from the fact that certain groups, such as non-diabetics, are underrepresented in research because they are underdiagnosed.

The prevalence of rare classes : in this case, the imbalance is inherent to the disease because certain conditions occur in 1 per 100,000 in the population, making the positive class rare.

Longitudinal studies : medical studies investigated over time can result in an imbalance in the dataset due to the discharge of certain patients (lost to follow-up) or the change of class over time (such as the progression of one stage to another in the case of cancer).

Data privacy and ethics : the susceptibility of certain diseases, such as HIV, can limit access to positive classes, resulting in imbalanced datasets.

An imbalanced dataset is defined by a disproportionate distribution between classes, where the Imbalance Ratio ( IR ), calculated as $IR = N_{maj} / N_{min}$ , indicates the extent of this disproportion. In this formula, $N_{maj}$ and $N_{min}$ represent the number of instances in the majority and minority classes, respectively. In binary datasets, the degree of imbalance is usually defined as IR : 1, where the more significant the difference than 1, the more severe the imbalance is.

Many existing classifiers exhibit an inductive bias that favors the majority class when trained on imbalanced datasets, often at the expense of the minority class. This results in suboptimal performance in less-represented classes. For instance, in diagnoses such as cancer risk or Alzheimer’s disease, patients are typically outnumbered by healthy individuals. Unfortunately, conventional classifiers tend to prioritize high overall accuracy, potentially leading to the misclassification of at-risk patients as healthy. Such errors in classification can have grave consequences, including the inappropriate discharge of patients in need of critical care. Additionally, this predisposition can lead to unfair treatment and ethical dilemmas, as it systematically disadvantages those requiring the most medical attention, raising significant concerns about equity in healthcare diagnostics.

Class imbalance handling methods are created for general purposes and not for medical diagnosis data. Applying such methods without considering the context of the disease in matter or the data at hand may lead to uninterpretable yet inaccurate results (Han et al. 2019 ). For example, synthetic minority data are generated to balance the medical data so machine learning can learn equally on both existing classes (diseased patients and non-diseased patients). However, synthetic data needs to conform to the characteristics of original medical data. Besides, the application of machine learning algorithms for medical diagnosis needs to be adequately evaluated in case of imbalanced data. The cost of misclassifying a diseased patient is more critical than misclassifying a non-diseased patient. The first can lead to dangerous consequences that may affect the patient’s life, whereas the second may lead to a further clinical investigation (Fotouhi et al. 2019 ). Therefore, the evaluation of medical diagnosis machine learning models relies mainly on measuring their predictive power for minority cases (diseased patients) (Han et al. 2019 ). However, a well-performing medical diagnosis system is expected to provide the best compromise in predicting diseased and non-diseased patients and avoid all kinds of costs of misclassification.

On the other hand, synthetic data must adhere to the original medical data’s characteristics. Otherwise, the automatic application of generic methods such as SMOTE (Chawla et al. 2002 ) may introduce biases and patterns not present in the original data, as well as irrelevant biologically impossible information, which may affect overall model performance for a variety of reasons, including inaccurate representation of rare case characteristics leading to unreliable model predictions, creation of synthetic data only in the rare cases neighborhood causing overfitting and generalization problems, and worst feature representation by increasing, decreasing, or reversing a variable’s impact on the target. Researchers have thus worked over the last decade to find solutions that avoid the drawbacks mentioned earlier, such as creating synthetic instances more representative of the underlying distribution, reducing the risk of inducing noise, and ensuring better generalization. Misdiagnosis occurs due to the difficulty in learning rare cases, and the need for researchers to stay up-to-date with the latest advances motivates them to incorporate improvements in the field into their research to maximize the utility of available data. As a result, our motivation in this paper is to classify pertinent techniques into several strata and to provide a critical review of the relevant literature as well as a synthesis of the outcomes of research on reference class imbalance datasets based on several metrics, enriching this classification to enhance the advantages and disadvantages of each stratum, thus opening up new research directions in the field.

3 Research methodology and basic statistics

This section details the search methodology used for data collection and the statistics describing the extracted data from the reviewed literature. The proposed review process follows most of the common guidelines proposed by Kitchenham ( 2004 ) for performing systematic literature reviews in software engineering research.

3.1 Data collection

Our search methodology defined the bibliographic databases, search keywords, inclusion and exclusion criteria, and time range for our literature review. Regarding the bibliographic databases, we selected Google Scholar and Scopus to collect papers. Besides, the search keywords are shown in Fig. 1 . The inclusion and exclusion criteria used for paper selection and the search methodology process are illustrated in the diagram (see Fig. 2 ).

We used advanced search in the Google Scholar and Scopus databases to find papers, setting the search for keywords in title only with a time range between 2013 and 14/01/2023. In our search, we used keywords to search for papers with class imbalance as a topic like “imbalanced” and keywords to capture papers that treated medical data in general like “medical” as depicted in Fig. 1 . In addition, we eliminated some search terms due to their widespread occurrence with search keywords like “diagnosis”, based on some trials, and to ensure the relevance of the results. The preliminary results were 409 in Google Scholar and 222 in Scopus, which we added to our reference manager. The first cleaning of our collected dataset by removing the duplicates and some unrelated results ended up with 249 papers. Afterward, we scanned the collected papers through the title and abstract if needed to sort out only relevant papers according to our review scope and the selection criteria. This second scanning yielded 165 papers pertinent to our review topic. A final scanning of the remaining results through full text was necessary, and we ended up with 150 papers, among them twelve without access. For that, the reviewed papers in this article are 137. The diagram in Fig. 2 illustrates the proposed methodology for data collection.

The used search keywords

The search methodology for the literature review

3.2 Analyses of used datasets and classification-based metrics

3.2.1 medical datasets.

We extracted all the datasets used in the reviewed research articles and grouped them based on their availability: public or private. We found that 95 (69%) papers used publicly available medical datasets, and 44 papers used private ones. Some research articles use both public and private datasets, and three research papers could have mentioned their employed datasets more clearly. The public datasets used in research have been investigated by extracting their usage frequency. Therefore, those datasets are partitioned into two groups: reference class imbalance medical datasets, which are frequently used (see Fig. 4 ), are displayed in Table 1 , and non-reference class imbalance medical datasets are less used than those above. Figure 3 illustrates the non-reference datasets based on their commonness in research.

Table 1 displays the main characteristics of the medical datasets as used in the reviewed research, including the dataset size, the number of features, and the number of instances in each of the minority and the majority classes. All the medical datasets are initially of binary class except for the “New Thyroid Disease Dataset,” which consists of three classes. For deeper insights into the procedural and contextual specifics of the dataset, it is advised to refer to the detailed discussions found in the referenced data sources and the foundational studies. The imbalance varies from one dataset to another, indicating the difference in its degree; while a dataset is highly imbalanced in research work, it is moderately imbalanced in another research framework; a class imbalance of one dataset is slight compared to another but could be more challenging. Although this points to the lack of an accepted universal quantification of the severity degree of the imbalance - as discussed later in Sect. 8 , the imbalance of the datasets in Table 1 is highlighted and well-considered as an imbalance across the literature. While reviewing the reference medical datasets, we identified an underrepresentation of certain medical domains, such as psychiatry and psychology. This absence may be linked to the data scarcity as stated by Kumar et al. ( 2023 ), or the nature of these fields, which are often explored through unstructured, text-based data (Awon et al. 2022 ), thus falling outside the primary scope of our structured data analysis.

Non-reference class imbalance medical datasets

Reference class imbalance medical datasets

3.2.2 Evaluation metrics and statistical tests

Reviewed research papers selected from different evaluation metrics to assess the performance of their proposed approach. Several metrics and statistical tests have been used in medical diagnosis using imbalanced datasets. We extracted all the used metrics and statistical tests in the reviewed literature and presented the findings in Fig. 5 and Table 2 . The used metrics and statistical tests are split into two groups: the first group contains frequently used metrics used at least eight times in the literature. In contrast, the second group contains infrequently used both metrics and statistical tests used a maximum of seven times.

As seen in Fig. 5 , nine classification-based metrics are primarily used: AUC-ROC, Recall (also known as sensitivity), Precision, Specificity, F1-score, G-Mean, FPR (False positive ratio), Matthews Correlation Coefficient (MCC), and Accuracy. We notice that recall is the most used metric. 62.8% papers selected the recall to assess their proposed approaches. In the case of imbalanced data, the focus is on classifying minority classes, especially when dealing with imbalanced data in medical diagnosis. For that, using sensitivity is essential in class imbalance research. Furthermore, accuracy, AUC-ROC, and F1-score are used in medical diagnosis systems evaluation. Accuracy is used in 57% of the reviewed literature, while it reflects the overall performance of models and hides the misclassification of minority examples. Research emphasizes the use of recall (sensitivity) to measure the model’s performance in identifying minority samples that are unhealthy or diseased patients in our case. However, we found that accuracy is still widely used in second place after sensitivity and is used solely to evaluate the imbalanced classification model in several studies (Sajana and Narasingarao 2018b ; Mohd et al. 2019 ; Babar 2021 ; Lan et al. 2022 ). The area under the curve, also known as the AUC-ROC or AUC, is significantly used in 50.4% of the explored literature. The information added by the area under the curve indicates the ability of the proposed approach to discriminate between the minority class and the majority class. The higher the value of the AUC, the more powerful the model is in discrimination between different classes. Therefore, we notice the importance attributed to the AUC in medical diagnosis with imbalanced data research, where some researchers rely individually on it in experimental analysis (Çinaroğlu 2017 ; Hassan and Amiri 2019 ). Another commonly used metric is the F -value, used in 49% of reviewed literature. The F -value informs about balanced classification; the higher its value, the better the trade-off between precision and recall. Referring to the F -value, the misclassified minority examples and the misclassified majority examples are considered. As a result, the model performance in classifying both classes in binary classification is evaluated by the F -value. The high frequency of using the F -value indicates the attention to both minority and majority classes, hence the general performance of proposed approaches in handling imbalanced medical datasets.

Frequently used metrics

Another cluster of quietly used metrics contains specificity, precision, and geometric mean (GM). 34% papers utilized specificity, while 30.7 and 28% of included literature used precision and geometric mean, respectively. The precision metric focuses on minimizing the false minority predicted examples, for that reference to it is vital to the excellent performance of medical diagnosis models; however, it is not as important as recall. Focusing only on recall minimizes the type one error in classification. Hence, we avoid predicting a diseased patient as non-diseased, and an accurate diagnosis saves human lives by allowing patients to access treatment as early as possible. However, by ignoring and minimizing the false predicted minority examples non-diseased patients are diagnosed as diseased patients, a type II error in classification, which may charge extra costs for all parts of society (like medical care and patients). The degree of attention to each classification error may be the reason for the difference in recall and precision in medical diagnosis research. Specificity and G-mean are used frequently but less frequently than other metrics like recall and accuracy. In some research works (Naghavi et al. 2019 ; Liu et al. 2020 ; Ibrahim 2022 ), we see that selecting one metric like specificity or G-mean or both with the recall to evaluate the proposed approaches. The G-mean of sensitivity and specificity shows the compromise between both metrics. When used with sensitivity, it can inform about the specificity score. Besides, the specificity quantifies the model’s ability to identify the majority class; knowing the specificity aside from the sensitivity illustrates the balance between them. Consequently, the relatively lower use of specificity and G-mean compared to recall is mainly explained as mentioned. However, we see considerable attention to recall in other research works without referring to the specificity and G-mean (Sun et al. 2021 ; Mienye and Sun 2021 ; Shi et al. 2022 ), which ignore the balance between both correctly identifying diseased patients as well as correctly identifying non-diseased patients.

Moreover, the Matthews correlation coefficient, also known as MCC, that informs about the classification performance could be even better than the F -value, and accuracy (Xu et al. 2020 ) is less frequently used. The False positive rate (FPR) is considered even though less than other metrics 9% papers used it. It refers to the misdiagnosed cases as diseased individuals. Thus, we notice a growing convergence of researchers to other metrics to quantify the performance of their proposed disease diagnosis models Accuracy, F -value instead of MCC, and FPR instead of sensitivity. Some research works used them simultaneously with standard metrics to better analyze the model’s effectiveness (Shilaskar et al. 2017 ; Sadrawi et al. 2018 ; Cheng and Wang 2020 ).

Table 2 groups uncommonly used metrics and statistical tests with their frequency of usage in reviewed research. We notice that statistical tests like the Friedman test, Wilcoxon paired signed rank, and Holm test are being used, even occasionally, which means researchers are referring to other tools, unlike evaluation metrics, to compare between proposed approaches and existing approaches. We find that the area under the precision-recall curve, also known as AUC-PR, is used only six times, although it is known as an appropriate metric for imbalanced classification (Huo et al. 2022 ; Albuquerque et al. 2022 ). A high AUC-PR means high precision and recall; therefore, it summarizes the model’s predictive power in minority and majority classes. Other evaluation metrics are used in a few studies, and the necessity of some adapted metrics to the proposed models may explain the variety of used metrics and change their interpretation.

4 Pre-processing level

4.1 feature level.

It entails all methods focusing on feature space to treat class imbalance in the data. One of the existing feature-level methods is feature selection, a widely used preprocessing procedure in different machine-learning tasks that employs various techniques to retain discriminating features. Another feature-level method is feature extraction, which creates new features from the initial feature space to keep most information in a smaller new set of features. Both methods, generally, are used to deal with high dimensional data, where the selected or extracted features are supposed to be informative features and facilitate the learning process and model generalization. Alternatively, feature weighting is found in the literature to improve recognition of the class of interest that is usually rare in medical applications such as medical diagnosis and risk prediction. Recently, mentioned methods are proposed to handle imbalanced learning, whether as self-standing approaches (Zhang and Chen 2019a ; Li et al. 2022 ; Shakhgeldyan et al. 2020 ), discussed in this section, or combined with other class imbalance techniques (Wang et al. 2020 ; Tang et al. 2021 ; Lijun et al. 2018 ).

Feature-level methods are used to tackle the class imbalance and reduce the dimensionality (Zhang and Chen 2019a ; El-Baz 2015 ; Sridevi and Murugan 2014 ; Li et al. 2022 ). Researchers in Zhang and Chen ( 2019a ) selected the optimal features of the breast tumor using an improved Laplacian score (LS), which better compromised computational efforts and classification performance by surpassing rough set-EKNN (El-Baz 2015 ) and feature selection-multiple layer perceptron (FSMLP) (Sridevi and Murugan 2014 ). Similarly, in Li et al. ( 2022 ), insightful selection of interpretable features using functional principal component analysis on longitudinal data achieved more accurate data categorization and reduced computing complexity. Filters and wrappers have been used in disease and mortality prediction, respectively (Venkatanagendra and Ussenaiah 2019 ; Shakhgeldyan et al. 2020 ). In comparison, feature selection using filters improved the classification performance of Feed-Forward NN, SVM, XG Boost, Random Forest, and LDA in Venkatanagendra and Ussenaiah ( 2019 ). A four-stage feature selection based on filter and wrapper methods exceeded random forest and logistic regression in Shakhgeldyan et al. ( 2020 ). Promisingly feature weighting yielded high discrimination between majority and minority data (Polat 2018 ; Baniasadi et al. 2020 ). Polat used similarity and clustering considering the class label to weight each attribute’s data points, making them more linearly separable and illustrating superior results than random subsampling (Polat 2018 ). Baniasadi et al. applied linear interpolation for missing values imputation and sample weighting (Baniasadi et al. 2020 ). Feature-level methods are remarkably proposed once imbalanced data is highly dimensional. Unexpectedly, feature weighting provides promising results. It is necessary to investigate its efficiency in dealing with the class imbalance issue regardless of the high dimensionality. Table 3 briefly describes the feature-level methods.

4.2 Data level

This approach deals with class imbalance at the data level by modifying the data distribution to balance the dataset through oversampling, undersampling, or a combination. Oversampling augments the number of minority samples (rare cases) in the dataset using different techniques, and undersampling decreases the number of majority samples. Hybrid methods combine oversampling and undersampling to obtain evenly distributed data. Researchers commonly use data-level methods to address class imbalances due to their simple implementation in the preprocessing phase, which is independent of the learning process. In general, the versatility of resampling in imbalanced learning has been noticed earlier (Abd Elrahman and Abraham 2013 ; Haixiang et al. 2017 ); however, this section reviews their application in medical data.

4.2.1 Oversampling

Oversampling prevails in imbalanced medical data classification and is significantly referred to in assessing proposed class imbalance methods. Hereafter, oversampling is individually used to combat the imbalance issue.

Random oversampling with random forests showed optimal performance in identifying the severity of the Hepatitis C virus (Orooji and Kermani 2021 ). However, randomly duplicating original minority samples may lead to overfitting, which implies using advanced techniques. Popularly used technique SMOTE (Synthetic Minority Oversampling Technique) created by Chawla et al. ( 2002 ) outperformed with KNN classifier (Hassan and Amiri 2019 ), however, demonstrated similar results with logistic regression to threshold adjustment based on Youden index (YI) (Albuquerque et al. 2022 ). Recently, the data distribution of disease samples is emphasized in SMOTE oversampling (Xu et al. 2021 ; Sun et al. 2021 ). Xu et al. used SMOTE based on a filtered k-means clustering (KNSMOTE) to overcome noise generation, overlapping and borderline issues, which outpaced traditional and cluster-based oversampling (Xu et al. 2021 ). Sun et al. integrated a multi-dimensional Gaussian probability hypothesis test to add SMOTE synthesized samples (MDGPH-SMOTE) to the original minority samples, illustrating better classification accuracy and recall (Sun et al. 2021 ).

SMOTE was adapted to various data contexts and combined with machine learning algorithms (Mustafa et al. 2017 ; Wang et al. 2013 ; Mohd et al. 2019 ). Farther Distance-based SMOTE was used along with PCA to handle high dimensional imbalanced biomedical data, revealing superiority over correlation and information gain (Mustafa et al. 2017 ). Differently, Wang et al. structured a Minimum Spanning Tree based on the KNN graph for minority data, then SMOTE synthesized samples along the paths between two randomly selected samples (Wang et al. 2013 ). In multi-class medical data, SMOTE with MLP model attained the highest accuracy (Mohd et al. 2019 ). Sajana and Narasingarao ( 2018a ), authors intentionally balanced the initial data with SMOTE then split it for learning and testing a Naive Bayes classifier. Researchers investigated the real class of artificial minority instances created by SMOTE (Sug 2016 ; Naseriparsa et al. 2020 ). Sug checked the class of synthetic data using MLP and accordingly trained tree classifiers; however, results revealed insignificant differences (Sug 2016 ). Generating synthetic samples within the region with a high density of minority samples reduced the class mixture (Naseriparsa et al. 2020 ) and exceeded SMOTE variants.

Alternatively, Oversampling-based diverse methods yielded positive results. Oversampling based on causal relationships between features exceeded CCR (combined cleaning and resampling algorithm), k-means SMOTE, GAN (Generative Adversarial Networks), and CUSBoost (cluster-based undersampling with boosting (Luo et al. 2021 ). Oversampling using improved ant colony to diagnose outpatients of TCM (Traditional Chinese medicine) exceeded traditional ML like C4.5. and SMOTE (Bi and Ma 2021 ).

The decomposition of minority data was extensively studied as a prior step to sampling (López et al. 2013 ; Napierala and Stefanowski 2016 , 2012 ), yet no universal method was concluded. Han et al. ( 2019 ), the authors applied different sampling strategies based on minority data selection using a self-adaptive algorithm and enhanced the recognition of minority class. Very recent research investigated synthetic samples fitting in the minority data (Rodriguez-Almeida et al. 2022 ), unexpectedly, experiments revealed higher similarity between synthetic and real data did not necessarily improve the classification performance. Data generation-based deep learning approaches in structured data are emerging (Xiao et al. 2021 ; Lan et al. 2022 ). While GAN and SMOTE highly increased the classification accuracy in Lan et al. ( 2022 ), combining SMOTE variants with conditional tabular generative adversarial networks (CTGAN) yielded unstable results (Rodriguez-Almeida et al. 2022 ). In contrast, a Wasserstein generative adversarial network (WGAN) in gene expression data excelled popular sampling methods (Xiao et al. 2021 ).

Oversampling is relatively used on its own to treat class imbalance in disease prediction. Besides using existing oversampling techniques and combining or improving them, we see two recent lines of research. One that considers the data distribution and its specificities in medical diagnosis while sampling minority examples. The other line adopts generative adversarial networks in structured medical data, a newborn research topic, resulting in a hybridization of both lines as observed. However, both research topics are unexplored and open for investigation. Table 4 briefly describes the proposed oversampling techniques with their key ideas.

4.2.2 Undersampling

Undersampling decreases the number of prevalent class examples by removing noisy data or duplicates that are uninformative through basic techniques like random undersampling or advanced techniques like clustering-based ones. Although undersampling is less used than oversampling, it is inventively proposed in medical diagnosis research.

Random undersampling with Random Forest output superior performance in Covid-19 mortality prediction (Iori et al. 2022 ), Hereditary Angioedema disease diagnosis (Dai and Hua 2016 ), and melanoma prediction (Richter and Khoshgoftaar 2018 ). K-means clustering was integrated into undersampling and boosted the prediction of diseased patients (Augustine and Jereesh 2022 ; Neocleous et al. 2016 ; Babar and Ade 2016 ). Augustine & Jereesh balanced the data using random undersampling at the generated clusters level (Augustine and Jereesh 2022 ). While Neocleous et al. ( 2016 ) used k-nearest neighbours after clustering. Similarly, the authors in Babar and Ade ( 2016 ) designed a Multiple Linear Perceptron (MLPUS) using k-means clustering that outperformed SMOTE, where iteratively samples close to the cluster centroid were used to train MLP and only samples with the highest SM (Stochastic measure) values are added to the training data which keeps hard to learn samples. Simply, clustering the majority class into subsets equal to the minority class and combining each with the minority class for training modestly ameliorated the results in Li et al. ( 2018 ) and sometimes outperformed SMOTE in Rahman and Davis ( 2013 ). However, Ensembling base classifiers built on balanced subsets exceeded BalanceCascase and EasyEnsemble undersampling techniques (Parvin et al. 2013 ). Salman & Vomlel further weighted instances using mutual information at each cluster, and their trained Tree-Augmented Naive Bayes (TAN) surpassed TAN with SMOTE (Salman and Vomlel 2017 ). Recently, Ibrahim used Salp swarm optimization to efficiently determine the clusters’ centres, which sometimes exceeded cluster-based sampling techniques (Ibrahim 2022 ).

Adding high-quality majority samples to the minority class is variedly suggested (Zhang et al. 2020 ; Wang et al. 2020 ). After randomly selecting a subsample of the majority samples, only those with high entropy were selected based on the Gaussian Naive Bayes estimator which hastened the undersampling process (Zhang et al. 2020 ). The results in Wang et al. ( 2020 ) significantly outpaced SMOTE and random undersampling using Imbalanced Self-Paced Learning (ISPL) with logistic regression. The authors in Al-Shamaa et al. ( 2020 ) separated majority class instances and minority class instances based on the Hellinger distance, and majority instances most similar to their neighbouring minority instances were added to the original minority class. Investigations showed higher performance of the method than Tomeklinks, random undersampling, and edited nearest neighbours.

The data distribution is distinctly integrated into undersampling (Vuttipittayamongkol and Elyan 2020b ; Kamaladevi and Venkatraman 2021 ). Vuttipittayamongkol and Elyan ( 2020b ) identified overlapped instances using recursive search neighbouring then discarded the majority class instances. While in Kamaladevi and Venkatraman ( 2021 ), the authors imputed noise samples using the mean and relabeled borderline samples based on Tversky similarity Indexive regression. Investigations illustrated promising results yet better performance than Tomeklinks, random undersampling, and edited nearest neighbors technique. Jain and his colleagues in Jain et al. ( 2017 , 2020 ) applied genetic algorithms to improve the recognition rate of diseased patients while maintaining high correct prediction of healthy patients. Their undersampling-based evolutionary optimization reduced the majority class samples by maximizing the geometric mean, significantly improving the classification performance. Table 5 summarizes the main ideas of the oversampling techniques proposed in the reviewed literature and other information.

4.2.3 Hybrid methods and comparative studies of resampling techniques

Hybrid techniques are uncommonly used to deal with imbalanced medical data by combining undersampling the majority class and oversampling the minority class. Comparably, studies contrasted various sampling techniques to reduce class discrepancy.

Resampling boosted the accuracy of liver disease detection (Arbain and Balakrishnan 2019 ). Fahmi et al. applied random resampling after weighting samples using the class distribution’s inverse proportions, which achieved superior performance than SMOTE (Fahmi et al. 2022 ). Hybridization of ROSE for majority and minority class and K-means to select boundary samples with SVM classifier improved the prediction of all diseases in Zhang and Chen ( 2019b ).

SMOTE is commonly combined with various undersampling techniques (Shi et al. 2022 ; Xu et al. 2020 ; Wosiak and Karbowiak 2017 ). SMOTE-ENN with logistic regression remarkably identified chronic kidney patients at risk of end-stage and exceeded the Cox proportional hazard model (Shi et al. 2022 ). The authors in Xu et al. ( 2020 ) repeatedly changed the oversampling ratio of SMOTE by the misclassification rate of trained RF on a subset of data and combined it with ENN. This hybrid method minimized the MCC (Matthews Correlation Coefficient) and statistically demonstrated significant performance compared to different data-level methods. The classification performance based on the Hybridisation of SMOTE with random undersampling fluctuated in Wosiak and Karbowiak ( 2017 ). However, SMOTE with Tomek Links showed superior performance (Zeng et al. 2016 ).

Few novel hybrid sampling methods were designed for imbalanced medical data (Babar 2021 ; Vuttipittayamongkol and Elyan 2020a ). Babar and Ade ( 2016 ), the authors combined the MLPUS with the Majority Weighted Minority Oversampling Technique (MWMOTE), which assigns selection weights to important and borderline minority samples and then synthesizes new samples using clustering. A clustering approach was used further in the generation of synthetic samples. Investigation illustrates the better performance of the combination compared to MLPUS and MWMOTE separately. The authors in Vuttipittayamongkol and Elyan ( 2020a ) eliminated the majority of instances based on the overlapping degree and oversampled minority instances in borderline regions using Borderline SMOTE; they attained high performance based on boosting,

Frequent studies compared sampling techniques in cancer diagnosis (Fotouhi et al. 2019 ), no-show cases detection (Krishnan and Sangar 2021 ), stroke diagnosis (Alamsyah et al. 2021 ), pediatric acute-conditions detection (Wilk et al. 2016 ), chronic kidney disease prediction (Yildirim 2017 ), heart disease prediction (Fernando et al. 2022 ), Lymph node metastasis prediction in stage T1 Lung adenocarcinoma (Lv et al. 2022 ), osteoporosis detection (Werner et al. 2016 ), predicting the risk of chronic kidney disease in cardiovascular disease patients (Vinothini and Baghavathi Priya 2020 ), and multi-minority medical data (Shilaskar and Ghatol 2019 ), however, results varied depending on the data used and experiment configurations.

The hybrid approach in imbalanced medical data seems to be less considered compared to advances in sampling techniques. Moreover, comparisons of sampling techniques yield to select the best, yet a balancing technique’s outcome could vary based on many factors, including the medical data used. Table 6 describes the hybrid techniques in a nutshell.

5 Learning level

Modifications concerning the learning algorithms are grouped under this section and further classified into subgroups depending on the similarities in the used algorithm as described in the following.

5.1 Cost-sensitive learning

It attributes specific costs for misclassifying minority and majority samples. The misclassification costs are unknown; however, the cost matrix is usually inversely proportional to the distribution of classes in the original data. Therefore, more attention is given to the minority class.

Normally, cost-sensitive learning in medical data outperforms cost-insensitive learning (Wu et al. 2020 ; He et al. 2016 ; Phankokkruad 2020 ; Nguyen et al. 2020 ). Radial basis neural network (RBF-NN) based sample distribution adaptive cost function in Wu et al. ( 2020 ) exceeded different forms of RBF-NN, ensemble methods based on RBF, and single classifiers. He et al. used cost-sensitive neural networks and the cost as part of gradient descent (He et al. 2016 ); investigation showed its minimal costs and significant accuracy. Cost-sensitive XGBoost model with the tuning of class weights effectively diagnosed breast cancer (Phankokkruad 2020 ). Likewise, a cost-sensitive version of Multiple Layer Perceptron and convolutional neural networks outperformed in detecting Inflammatory Bowel Disease (IBD) (Nguyen et al. 2020 ). However, some traditional ML algorithms yielded comparable results to developed cost-sensitive models, the decision rules algorithm and the ensemble of cost-sensitive SVM indistinguishably performed (Zięba 2014 ). While Decision Tree and Logistic regression achieved better accuracy than their corresponding cost-sensitive models (Mienye and Sun 2021 ).

Some research newly defined the cost matrix (Huo et al. 2022 ; Zhu et al. 2018 ; Belarouci et al. 2016 ; Wan et al. 2014 ). The authors in Belarouci et al. ( 2016 ) introduced a version of the least mean square algorithm to associate weights to different samples according to the errors, and investigations illustrated its superiority over SMOTE in breast cancer detection. Recently, Huo et al. used neural networks and set the misclassification costs as learnable parameters which released high performance in risk prediction in binary and multi-class classification (Huo et al. 2022 ). Class weights random forest based on class weighting for each classifier with threshold voting gave very optimistic results in Zhu et al. ( 2018 ); while attributing weights based on a scoring function (RankCost) in Wan et al. ( 2014 ) outperformed cost-sensitive decision trees and Adaboost.

5.2 Optimization techniques

Recent methods applied Genetic algorithms to handle imbalanced medical data (Jain et al. 2020 ; Nalluri et al. 2020 ). Jain et al. ( 2020 ) optimized the specificity and sensitivity, where two models were constructed by employing the NSGA II algorithm and combined for the prediction of minority and majority samples. While the hybrid evolutionary learning with multiobjective exceeded optimization methods (Nalluri et al. 2020 ).

5.3 Simple classifier

It consists of using conventional machine learning algorithms to classify imbalanced medical data, which may include postprocessing or preprocessing procedures to tackle the imbalance issue and boost the classification performance.

Hyperparameter tuning with SVM models improved patient detection sometimes (Ksiaa et al. 2021 ), while performed similarly to cost-sensitive learning in Alzheimer’s prediction (Zhang et al. 2022 ). Contrast classification strategy-based feature elimination demonstrated superior results compared to decision trees, and SVM (Dhanusha et al. 2022 ). Modification on the used classifiers released good results (Alves et al. 2023 ). Alves et al. developed a generalization of complementary loglog link functions for binary regression that better fitted the data than binomial models (Alves et al. 2023 ). Differently, Kumar and Thakur proposed A fuzzy learning approach hybridizing adaptive and neighbor-weighted KNN for liver disease detection that outpaced Fuzzy Adaptive KNN (Kumar and Thakur 2019 ).

5.4 Ensemble learning

This approach combines a set of single classifiers to perform classification tasks. There are three types of ensembles: bagging, boosting, and stacking. Bagging consists of building multiple single classifiers on different samples of the primary dataset and then combining their prediction with some basic statistics. Boosting is an iterative approach combining weak learners where each focuses on the misclassified instances by the previous one and generates predictions using a weighted average of constructed models. Finally, stacking is based on stacking different classifier types built on the same dataset and aggregating their predictions using another model (combiner).

Various ensemble learning classifiers effectively diagnosed the disease in imbalanced data (Zhao et al. 2022 ; Wei et al. 2017 ; Bhattacharya et al. 2017 ; Potharaju and Sreedevi 2016 ). Weighted ensemble-based Knn algorithm with feature extraction released remarkable results in identifying the stage of Parkinson’s disease (Zhao et al. 2022 ). Similarly, ensemble Knn based with the relief-F method for feature selection accurately predicted the responses of breast cancer patients to neoadjuvant chemotherapy (Gao et al. 2018 ). Whereas the authors in Wei et al. ( 2017 ) used XGBoost based on EasyEnsemble, investigations demonstrated its high results in large-scale imbalanced diabetes data. Bhattacharya et al. ( 2017 ), the authors balanced the training subsets and employed a hierarchical Meta classification method, Experiments showed the high performance of random forest hierarchical meta-classifier in detecting later stages of chronic kidney disease that exceeded random oversampling and SMOTE. The majority voting ensemble of AdaBoost and Logistic regression outperformed AdaBoost and Logistic regression in heart disease detection (Rath et al. 2022 ). While ensemble by bootstrapping the majority class with a replacement and majority voting considerably detects different types of Parkinson’s disease (Roy et al. 2023 ). In contrast, Zhao et al. ( 2021 ) ensembles various machine learning algorithms, where AdaBoost and XGBoost comparably outpaced other ensemble models. Mathew and Obradovic ( 2013 ) used homomorphic encryption to secure multi-party computation with a distribution voting ensemble if collected encrypted data was imbalanced, illustrating the superiority of ensemble models over baseline models.

Random forest revealed significant results compared to boosting and bagging techniques in the prediction of malaria disease (Sajana and Narasingarao 2018b ) and thyroid (Çinaroğlu 2017 ). Differently, the authors in Guo et al. ( 2018 ) used an ensemble of rotation trees (ERT) including undersampling and feature extraction, and investigations showed, statistically, the excellent performance of ERT compared to EasyEnsemble, Random Undersampling Random Forest (RURF), BalanceCascade, and bagging. While in Potharaju and Sreedevi ( 2016 ) the authors developed ensembles of rule-based algorithms on SMOTE-balanced data, the experiments showed the optimal accuracy of AdaBoost.

5.5 Deep learning algorithms

Modification of the structure and parameters of neural networks and deep learning algorithms is found as an approach to tackle class imbalance in medical data and improve the classification performance (Ghorbani et al. 2022 ; Izonin et al. 2022 ; Liu et al. 2019 ; Sribhashyam et al. 2022 ). The authors in Ghorbani et al. ( 2022 ) combined a Graph convolutional network (GCN) algorithm with weighting networks and employed an iterative adversarial training process, demonstrating stability and superior performance compared to other GCN methods. An improved imbalanced probability neural network (IPNN) by Izonin et al. ( 2022 ) yielded high performance. Liu et al. ( 2019 ), the authors automated hyperparameter optimization (AutoHPO) of deep neural network (DNN) including dimensionality reduction using PCA K-means and majority instance selection with batch reweighting using online learning; investigation demonstrated the excellence of AutoHPO based on DNN compared to DNN, XGB, etc. ResNet and GRU with weighted focal loss function exceeded ResNet in multi-class heart disease detection (Rong et al. 2020 ). A stacked denoising autoencoder (SDA) for anomaly detection excelled LSTM, SVM, MLP with Borderline SMOTE, and SVM with SMOTE (Alhassan et al. 2018 ). Recently, Sribhashyam et al. used multi-instance neural network architecture that exceeded state-of-the-art methods for disease diagnosis (Sribhashyam et al. 2022 ).

5.6 Unsupervised learning

Unsupervised learning approaches showed high performance and interpretability; however, it is uncommonly used (Zhou and Wong 2021 ; Chan et al. 2017 ). Chan et al. ( 2017 ), the authors used a pattern discovery and heuristic optimization of the geometric mean, which significantly performed and bettered logistic regression. Lately, the authors in Zhou and Wong ( 2021 ) identified relevant patterns, for which they established a matrix representing the frequency of co-occurrence of pairs-values (like in association rules). Then, they build another matrix representing the frequency deviation from the default frequencies (the parallel of the covariance matrix in PCA). They decomposed this matrix into several PCs and then projected these pairs of values in the sub-space. Then, they selected clusters (patterns). Experiments demonstrated the outperformance of the proposed algorithm over CART, Naive Bayes, and logistic regression.

Regarding structured medical data, deep learning is yet to be explored as a potential solution for class imbalance where many reasons may pop up, like the insufficiency of medical data or the model complexity. Another emerging research line is pattern recognition. A descriptive table (Table 7 ) provides all information about learning level techniques, like the year, the title, and the main idea.

6 Combined techniques and comparison of different approaches

Combining learning and data-level approaches is considered to treat imbalanced medical data. Studies contrasting different approaches or suggesting combined techniques are quite frequent as learning approaches in the last decade’s literature.

Recently, studies combined deep learning approaches with sampling techniques and exceeded the state-of-the-art techniques (Feng and Li 2021 ; Woźniak et al. 2023 ). Feng and Li ( 2021 ), the authors optimized the borderline SMOTE and ADASYN combination $\alpha$ DBASMOTE where only minority samples in danger set are synthesized and used DenseNet convolutional neural network. Investigation illustrated the higher performance of $\alpha$ DBASMOTE over Borderline SMOTE and ADASYN. The authors in Woźniak et al. ( 2023 ) combined oversampling by ADASYN and SMOTE with undersampling by Tomek-Links and used a Bidirectional Long Short-Term Memory deep learning model which output promising results. Rath et al. ensembled LSTM and GAN based on GAN for data generation, and the investigations showed excellent results in heart disease detection (Rath et al. 2021 ). SVM based on the active learning approach relied on the degree of the instance’s importance and yielded superior performance (Lee et al. 2015 ). Likewise, Suresh et al. ( 2022 ) used Radius SMOTE for balancing and Convolutional generative adversarial network for data generation with a modified CNN model, experimentation illustrates its optimal performance and lower computational time.

Preprocessing was integrated into class imbalance approaches (Cheng et al. 2022 ; Hallaji et al. 2021 ). Cheng et al. ( 2022 ) denoised signals and combined multi-scale features along with ADASYN for balancing different categories of Electrocardiogram (ECG). While Britto and Ali ( 2021 ) proposed balancing and augmenting the data and a deep learning model with adaptive weighting for minority classes. Hallaji et al. ( 2021 ) compared an adversarial imputation classification network (AICN) with hybrid models encompassing sampling with data imputation techniques. Miss-Forest was the most performant in imputation, and SMOTE was the best in balancing techniques, while AICN outperformed and showed stability in different missing value ratios. Ensemble learning combined with different approaches better-handled class imbalance in medical data (Gan et al. 2020 ; Gupta and Gupta 2022 ). AdaCost with tree-augmented naive Bayes network outpaced AdaCost variants (Gan et al. 2020 ), whereas experiments in Gupta and Gupta ( 2022 ) demonstrated the high performance of boosted ensemble stacking. Oversampling with Ensemble of PNN and weighted voting significantly outperformed PNN, biased random forest, and random undersampling boosting (Yuan et al. 2021 ). Liu et al. used hybrid sampling by SMOTE and Cross validated committee filter, then an ensemble of SVM with optimized weighted voting using simulated annealing genetic algorithm (SAGA) (Liu et al. 2020 ); investigation illustrated its optimal performance compared to the state-of-the-art classification models.

Sampling with ensemble learning combined in different manners effectively handled class imbalance in disease diagnosis (Naghavi et al. 2019 ; Kinal and Woźniak 2020 ; Li et al. 2021 ; Lamari et al. 2021 ). ADASYN for oversampling and the cost-sensitive ensemble classifier constructed on SVM, KNN, and MLP conquered deep learning-based models in freezing of gait (FoG) prediction (Naghavi et al. 2019 ). Dynamic ensemble selection, in particular, DES-KNN coupled with SMOTE, significantly treated non-severely unbalanced data (Kinal and Woźniak 2020 ). Likewise, SMOTE-ENN sampling with dynamic classifier selection using META-DES exceeded the META-DES on imbalanced data (Lamari et al. 2021 ). Li et al. designed a harmonized-centred ensemble (HCE) approach that iteratively undersampled the majority class samples based on their classification hardness level (Li et al. 2021 ). Investigations demonstrated the outperformance of HCE over the Under-Bagging method, RUSBoost method, and self-paced ensemble learning framework (SPE). A SMOTE-based stacked ensemble with Bayesian optimization for hyperparameters tuning released excellent results in breast cancer diagnosis (Cai et al. 2018 ). The combination of SMOTE with SVM and AdaBoost surpassed stacking and voting strategies (Wang et al. 2020 ). Undersampling using different techniques with AdaBoost for learning and prediction attained optimal results (Shaw et al. 2021 ). Feature extraction, along with random undersampling and XGBoost, effectively predicted acute kidney injury in intensive care unit patients and outperformed random oversampling, random forest, AdaBoost, KNN, and Naïve Bayes (Wang et al. 2020 ). Similarly, Liu et al. ( 2014 ) used random undersampling to train SVM classifiers and validated them on data synthesized by SMOTE accordingly specific weights were attributed to SVMs; investigation illustrated the effectiveness of the SVM ensemble in cardiac complications of patients with chest pain in the emergency at the hospital.

Modifications on the random forest algorithm had considerable results (Meher et al. 2014 ; Lyra et al. 2019 ). Meher et al. ( 2014 ) developed a combined random forest where each random forest was trained on a balanced subset of data clustered from the original data. According to experiments, the combined random forest outperformed weighted and biased random forests. A “nested forest” was developed by Lyra et al. ( 2019 ) using feature selection and reduction with random undersampling to create balanced subsets for decision tree training, and the best forests were used for sepsis prediction. Fujiwara et al. ( 2020 ), the authors used boosting weights to select misclassified majority samples iteratively in the next CART classifier and oversampled the minority samples based on their distribution. Experiments demonstrated the superior performance of the approach in severely imbalanced medical data compared to random undersampling with boosting and SMOTE. In contrast, the scholars in Silveira et al. ( 2022 ) combined manual oversampling by a nephrologist and automated oversampling by SMOTE and its variants, where the decision tree achieved superior and stable performance in the early detection of chronic kidney disease.

The research compared class imbalance strategies in disease diagnosis (Drosou et al. 2014 ; Gupta et al. 2021 ; Wang et al. 2023 ) had different outcomes. In comparisons of resampling and cost-sensitive learning approaches (Drosou et al. 2014 ), while SVM is used for classification, the best performance was achieved by hybrid sampling (SMOTE and random undersampling) with SVM. The authors in Gupta et al. ( 2021 ) examined various class imbalance techniques where extensive experiments illustrated the outperformance of weighted XGBoost and stacking ensemble of weighted classifiers in breast cancer diagnosis. Additionally, feature selection, SMOTE, and cost-sensitive learning were employed with a variety of machine learning classifiers (Wang et al. 2023 ); however, three strategies achieved the best results in identifying patients with chronic obstructive pulmonary disease: cost-sensitive logistic regression, cost-sensitive SVM, and logistic regression with SMOTE.

Feature selection noticeably improved the classification performance in imbalanced medical data (Porwik et al. 2016 ; Špečkauskienė 2016 ; Lijun et al. 2018 ; Razzaghi et al. 2019 ). Wrappers for feature selection with parallel ensemble based on a weighted Knn achieved better and more stable accuracy than C4.5 and naïve Bayes in multi-class imbalanced and incomplete HCV data (Porwik et al. 2016 ). Feature selection outperformed Oversampling with SMOTE in multi-class Parkinson’s disease detection (Špečkauskienė 2016 ) where the Clinical Decision Support system identified the best feature subset in Špečkauskienė ( 2011 ). Lijun et al. ( 2018 ) combined elastic net for feature selection and hybrid sampling using SMOTE and Random undersampling and used SVM multi-class investigations showed the superior overall accuracy achieved. Differently, ensemble learning methods with SMOTE and feature selection outperformed single classifiers particularly random forest and bagging yielded the highest results (Razzaghi et al. 2019 ). Tang et al. ( 2021 ), the authors combined feature selection and dimensionality reduction for biological data in breast cancer diagnosis and designed a twice-competitional ensemble method (TCEM) to select the optimal model, where results were promising. Cheng and Wang applied Particle Swarm Optimization (PSO) for feature selection with SMOTE and Random forest and achieved considerable breast cancer diagnosis results (Cheng and Wang 2020 ).

Optimization techniques were integrated into different approaches and largely improved the medical diagnosis (Shilaskar et al. 2017 ; Sadrawi et al. 2018 ; Desuky et al. 2021 ). Shilaskar et al. ( 2017 ) combined hybrid sampling with a modified particle swarm optimization to optimize the kernel function of SVM. The authors in (Sadrawi et al. 2018 ) used Fuzzy C-mean clustering to undersample the majority class and genetic algorithms to optimize the activation combination of the ensemble of activated ANN models. Including diversity within the ensemble and GA optimization yielded better results than single classifiers. Sampling using crossover genetic operator with adaptive boosting proposed by Desuky et al. ( 2021 ) improved classification performance better than SMOTE and safe level SMOTE (SLSMOTE). Feature selection and Principal Component Analysis with random oversampling and Ensemble voting exceeded SMOTE, SMOTE-ENN, and SMOTE-Tomek links (Alashban and Abubacker 2020 ). Srinivas et al. used rough set theory based on fuzzy c-mean clustering which exceeded the rough fuzzy classifier in heart disease detection (Srinivas et al. 2014 ). Table 8 is a descriptive table of all the combined techniques proposed for imbalanced medical data.

7 Synthesis of research outcomes on imbalanced medical datasets

Several benchmarking imbalanced datasets appear in the studied medical diagnosis research. Among the frequently medical diagnostics imbalanced data, we overview results on those frequently studied, namely: “Pima Diabetes Dataset”, “Wisconsin Diagnostic Breast Cancer (WDBC)”, “Wisconsin Prognostic Breast Cancer (WPBC)”, “Haberman Dataset”, “SPECT Heart Dataset”, “Breast Cancer Dataset”, “Indian Liver Patient Dataset (ILPD)”, “Hepatitis-C Dataset”, “Cervical Cancer Dataset”, “Heart Disease Dataset”, “Breast Cancer Wisconsin Original Dataset”, “Parkinson’s Disease Dataset”, “New Thyroid Dataset”, “Chronic Kidney Disease Dataset”, “Thoracic Surgery Dataset”, “Liver Disorder Dataset”, “Mammographic Mass Dataset”. This synthesis consolidates the findings from research utilized key imbalanced medical datasets, providing a cohesive understanding of how these datasets are analyzed within the framework of class imbalance.

This analysis is contextual, relying on the employed class imbalance methodology by the research authors and its performance quantified in terms of evaluation metrics they selected. Those experimental details were the most explicitly reported across the literature; clarifications on the underlying methodological procedures could enhance the informativeness of observations. Thus, we attempt to bridge the theoretical frameworks of machine learning with their practical applications in medical diagnostics, using an observatory approach to offer a detailed overview of current practices and performance metrics, highlighting the utilization and effectiveness of these methods in different medical contexts without drawing new conclusions or conducting experimental analysis. It is important to note that this synthesis cannot be classified as experimental or deeply analytical due to several constraints. Consequently, our reflections on the synthesis setting up and context are mentioned accordingly.

Eleven research papers on medical diagnosis in imbalanced data have employed the “Breast Cancer Wisconsin Original Dataset” in experimentation. Table 9 summarizes the results of each research work and mentions the used approach in tackling the class imbalance issue. While this dataset presents an imbalance ratio of 1.90, various class imbalance methods have been used to tackle this imbalance. The learning approach is the most prevalent and yields excellent performance in classifying breast cancer, where combined techniques are the most implemented (Yuan et al. 2021 ; Kinal and Woźniak 2020 ; Suresh et al. 2022 ; Cai et al. 2018 ) compared to cost-sensitive methods (Wu et al. 2020 ), ensemble methods (Guo et al. 2018 ), and optimization techniques (Nalluri et al. 2020 ). Scholars have used data-level approaches, though less frequently than previous approaches, the outcomes are considerable performance in terms of different metrics where we found a feature-level method (Zhang and Chen 2019a ), an oversampling method (Mustafa et al. 2017 ), an undersampling method (Vuttipittayamongkol and Elyan 2020b ), hybrid method (Zhang and Chen 2019b ). There are slight differences in performance metrics observed. However, the effectiveness of a method can be influenced by numerous factors, including the specific characteristics of the data, the complexity of the model, and the research goals. In this analysis of the ’Breast Cancer Wisconsin Original Dataset,’ we observe subtle variations in performance metrics among the different methodologies employed. Despite these variations, the overall classification performance remains considerable, demonstrating robustness in addressing class imbalances within this dataset.

Table 10 summarizes the findings from eleven distinct studies on the “Heart Disease Dataset,” each employing different strategies to tackle the challenges of class imbalance in medical diagnostics. This dataset exhibits an imbalance ratio of 1.20; other versions of the datasets exist that could be differently imbalanced. The researchers experimenting always refer to the version presented in Table 1 unless other details are reported. This dataset has seen a variety of approaches, with combined techniques being particularly prevalent, as demonstrated in the works by Gan et al. ( 2020 ), Kinal and Woźniak ( 2020 ), Shilaskar et al. ( 2017 ), Desuky et al. ( 2021 ) and Srinivas et al. ( 2014 ), which display a range of outcomes across key metrics such as accuracy, sensitivity, specificity, and more. Other approaches include undersampling (Jain et al. 2020 ), which yielded high accuracy and sensitivity, and oversampling (Rodriguez-Almeida et al. 2022 ), although specific performance metrics for the latter are not reported; whereas optimization techniques employed by Nalluri et al. ( 2020 ) showed superior performance with nearly perfect metrics, indicating potential advantages depending on the specific methodological implementations and study goals. The hybrid approach by Shilaskar and Ghatol ( 2019 ) and optimization efforts by Chan et al. ( 2017 ) also added to the diversity of results, though with mixed effectiveness. This analysis reveals variations in how different methods perform under the constraints of the same dataset, reflecting a spectrum of effectiveness in the tools and strategies deployed. Despite these differences, the collective outcomes contribute significantly to advancing the diagnostic capabilities for heart disease, illustrating the value of diverse methodological approaches in enhancing overall classification performance.

Table 11 synthesizes the outcomes from five research studies on the “Cervical Cancer Dataset,” focusing on various methodologies used for cervical cancer diagnosis. This dataset, in particular, has the highest class imbalance among reference medical datasets, as seen in Table 1 . It is observed a predominant reliance on combined techniques, as employed by Gan et al. ( 2020 ), Gupta and Gupta ( 2022 ), Kinal and Woźniak ( 2020 ), and Woźniak et al. ( 2023 ). Each study shows differing levels of effectiveness across metrics such as accuracy, AUC, precision, sensitivity, F -value, geometric mean, and specificity. Mienye and Sun ( 2021 ) utilized a cost-sensitive approach, which stands out with exceptional results—achieving perfect scores in accuracy, AUC, precision, and sensitivity. In contrast, the combined techniques exhibit a range of performances, with Woźniak et al. ( 2023 ) demonstrating notably high efficacy, almost reaching optimal scores across all evaluated metrics. This array of studies reflects the effectiveness of different learning strategies in diagnosing cervical cancer. It highlights the diversity in methodological success and underlines the particular strengths of more nuanced approaches, like the cost-sensitive method showcased by Mienye and Sun. Overall, two main learning methods are observed, whereas the aggregated findings from these studies highlight their contribution to advancements in cervical cancer diagnostics concerning the studied data.

Table 12 assembles findings from multiple research studies that have applied various approaches to the “Hepatitis Dataset,” characterized by an imbalance ratio of 3.84. This summary highlights how the twelve research papers employed different methods to address the challenges inherent in the imbalanced data, employing ensemble, cost-sensitive, hybrid, undersampling, oversampling, feature-level, combined techniques, and optimization strategies. Among the methodologies, the feature-level approach by Polat ( 2018 ) stands out with perfect scores across all metrics, showcasing the potential of finely tuned feature engineering in such contexts. Similarly, optimization techniques used by Nalluri et al. ( 2020 ) and combined techniques by Gupta and Gupta ( 2022 ) demonstrated high effectiveness, with near-perfect accuracy and other metrics. Conversely, approaches like the ensemble by Guo et al. ( 2018 ) and the hybrid technique by Wosiak and Karbowiak ( 2017 ) yielded more modest results, accentuating the variability in the efficacy of different methodologies within the same imbalanced dataset. The undersampling methods, particularly those implemented by Babar and Ade ( 2016 ) and Jain et al. ( 2020 ), showed remarkable improvements in handling class imbalance, reflected in their high accuracy and specificity. This aggregation of studies illustrates a broad expanse of success in managing class imbalance of the dataset, with some methods showing considerable effectiveness while others highlight areas for potential improvement.

Table 13 gathers the performance metrics from several studies that utilized the “Indian Liver Patient Dataset (ILPD)” to address its class imbalance of 2.49. The table provides a broad overview of the effectiveness of different class imbalance approaches, including simple classifiers, undersampling, combined techniques, and optimization strategies. The results demonstrate a range of effectiveness across methodologies. Combined Techniques employed by Gan et al. ( 2020 ), Yuan et al. ( 2021 ), and Kinal and Woźniak ( 2020 ), these methods yielded mixed results. Gan et al. and Yuan et al. reported relatively lower specificities and sensitivities, while Kinal and Woźniak achieved a high specificity of 0.95, indicating that the success of combined techniques can vary significantly based on their specific configurations and the aspects of the data they prioritize. On the other hand, the simple classifier approach by Kumar and Thakur ( 2019 ) showed a high F -value and precision, suggesting that even straightforward models can perform effectively within this dataset. Undersampling, proposed by Jain et al. ( 2017 , 2020 ), showed improvements in specificity and sensitivity, indicating its utility in enhancing model accuracy by addressing data imbalance. Meanwhile, Nalluri et al. ( 2020 ) applied optimization techniques, which resulted in balanced performance across all metrics. This table of findings across different studies illuminates the varied effectiveness of each methodology in handling the dataset’s imbalance. Each demonstrates high values in some metrics and lower values in others. It illustrates the necessity of selecting an appropriate method based on specific dataset characteristics and desired outcomes in diagnostic accuracy.

Table 14 assembles the results from diverse research methodologies to diagnose breast cancer using the “Breast Cancer Dataset.” This dataset’s imbalance of 2.38 has prompted researchers to employ mixed techniques, including undersampling, cost-sensitive methods, ensemble approaches, hybrid strategies, and combined techniques. Undersampling is mostly used with varied results, as illustrated by Al-Shamaa et al. ( 2020 ) with modest outcomes in specificity and sensitivity, contrasting significantly with Ibrahim ( 2022 ), which achieved high values across these metrics. Similarly, Babar and Ade ( 2016 ) and Jain et al. ( 2020 ) also utilized undersampling, resulting in a particularly strong performance from the former. Wan et al. 2014 and Zięba ( 2014 ) applied cost-sensitive methods, showing lower performance metrics. Guo et al. ( 2018 ) employed an ensemble approach, yielding middling results, which suggest a complexity in achieving higher predictive accuracy through this method. In other studies, specific performance metrics are not fully detailed, highlighting a need for more comprehensive results. Babar ( 2021 ) implemented a hybrid method, achieving considerable accuracy, and Yuan et al. ( 2021 ) explored combined techniques and achieved an average trade-off of sensitivity and specificity. Significant variability in the literature outcomes is observed, suggesting the ongoing challenges and complexities in diagnosing breast cancer in this particular imbalanced dataset.

Table 15 showcases the results from seven distinct studies that have applied various methodologies to the “SPECT Heart Dataset,” which has an imbalance ratio of 3.85. These methodologies encompass miscellaneous methods to improve diagnostic accuracy and address the dataset’s imbalance. The study by Polat ( 2018 ) indicates the efficacy of feature level adjustments, yielding excellent performance metrics. Jain et al. ( 2017 , 2020 ) both employed undersampling techniques. While the later study provides specific details on performance metrics like specificity, sensitivity, and accuracy—all marked consistently at 0.88—Jain et al. ( 2017 ) attained a geometric mean of 0.91, suggesting effective handling of class imbalances. Babar ( 2021 ) utilized a hybrid approach and achieved an accuracy of 0.84. Liu et al. ( 2020 ) and Kinal and Woźniak ( 2020 ) both opted for combined techniques, with varying levels of success across specific and general performance metrics. Nalluri et al. ( 2020 ) implemented optimization techniques, resulting in impressive specificity, sensitivity, and accuracy scores. The synthesis in Table 15 reflects the diverse strategies researchers can employ to tackle diagnostic challenges and underscores the complexity of achieving high accuracy in class imbalances.

Table 16 groups the results of research studies exploring various techniques to address the challenges presented by the “Haberman Dataset,” which exhibits an imbalance ratio of 2.78. This imbalance influences the choice of methodological approaches, including sampling strategies, learning techniques, and combined techniques. The outcomes of sampling methods vary, while the oversampling method in Xu et al. ( 2021 ) effectively mitigates class disparity, achieving optimal results in sensitivity and specificity, the results of Wang et al. ( 2013 ) denote a modest value of sensitivity, and the undersampling technique proposed in Jain et al. ( 2020 ) indicate relatively considerable performance. Other studies report their results in one metric, Jain et al. ( 2017 ) proposing an undersampling reported a high precision value, and Xu et al. ( 2020 ) used hybrid sampling reflected in a high F -value, suggesting an effective balance between recall and precision. Mienye and Sun ( 2021 ) adopts a cost-sensitive technique, achieving notable sensitivity and precision. Leveraged by Ghorbani et al. ( 2022 ) and Izonin et al. ( 2022 ), deep learning models excel in discerning complex patterns, with Izonin’s findings excelling in sensitivity and precision. Liu et al. ( 2020 ) and Desuky et al. ( 2021 ) employ combined techniques, achieving balanced values across various metrics. Nalluri et al. ( 2020 ) explores optimization techniques for class imbalance, leading to average metrics values. This synthesis stresses diverse approaches to enhancing model accuracy against the Haberman Dataset’s imbalance. We observe better performance in terms of sensitivity along recent studies achieved and significant differences between the findings of the literature on this dataset, while few achieved excellent performance, others potentially need to tackle effectively class imbalance in particular and understanding of the medical data in general.

The reviewed medical diagnosis research results in imbalanced data employing the WPBC dataset are presented in Table 17 . knowing that this dataset exhibits an imbalance of 3.21, we observe that five studies proposed sampling methods to handle the class imbalance in the data, where the outcomes of the research proposing oversampling (Xu et al. 2021 ) indicate optimal performance, other studies employing undersampling (Jain et al. 2017 , 2020 ) and hybrid (Xu et al. 2020 ) imply significant values of performance metrics, whereas the study (Zhang and Chen 2019b ) implementing hybrid sampling indicate modest performance. The findings of research works (Yuan et al. 2021 ; Liu et al. 2020 ) proposing combined techniques appear modest, although Liu et al.’s approach indicates a better balance between sensitivity and specificity. However, the effectiveness of Nalluri et al. ( 2020 ) that implemented optimization technique for class imbalance is superior regarding reported metrics. Yet, the analysis of the noted results in handling the imbalance in the WPBC dataset points to the presence of diverse class imbalance methods and the variation in the performance of implemented methodologies.

Table 18 presents the findings from research studies utilizing the “Wisconsin Diagnostics Breast Cancer (WDBC)” dataset. These studies have implemented a variety of approaches, including combined techniques, algorithmic-level modifications, and preprocessing methods, to address the challenges associated with this dataset’s imbalance of 1.68 and improve diagnostic accuracy. Combining techniques dominate the research landscape and are used in eight studies, including (Shaw et al. 2021 ), which achieved perfect scores across sensitivity, accuracy, and precision metrics. Similarly, Kinal and Woźniak ( 2020 ), Desuky et al. ( 2021 ), Cai et al. ( 2018 ) and Liu et al. ( 2020 ) showed excellent outcomes, underscoring the efficacy of these approaches in optimizing performance across several metrics. However, Yuan et al. ( 2021 ), Cheng and Wang ( 2020 ) and Gupta and Gupta ( 2022 ) indicated high outcomes in terms of specific performance metrics. Four studies employed cost-sensitive methods to enhance model sensitivity to cost discrepancies between classes. Belarouci et al. ( 2016 ) achieved ideal results in all evaluated metrics, illustrating the potential of these methods to balance predictive accuracy and cost considerations. Zhu et al. ( 2018 ), Wu et al. ( 2020 ) and Phankokkruad ( 2020 ) also showed significant improvements, particularly in specificity and F -values; while Nalluri et al. ( 2020 ) implemented optimization techniques and showed impressive results. Several studies utilized oversampling to correct imbalances in dataset representation (Naseriparsa et al. 2020 ; Luo et al. 2021 ; Lan et al. 2022 ; Xu et al. 2021 ), with Xu et al. achieving near-perfect sensitivity and specificity. Similarly, studies implemented hybrid sampling (Zhang and Chen 2019b ; Xu et al. 2020 ) and undersampling (Jain et al. 2020 ), though presented superior performance, Xu et al. ( 2020 ) perfectly score across multiple metrics. On the other hand, Zhang and Chen ( 2019a ) applied feature-level modifications, resulting in high marks across sensitivity, specificity, and accuracy. Differently, Izonin et al. ( 2022 ) implemented deep learning to deal with the class imbalance and achieved remarkable results. The diverse methodologies listed in Table 18 reflect diverse strategies researchers employ to tackle the class imbalance issue of the WDBC dataset. Although combined techniques show particular prevalence, different approaches suggested optimal effectiveness. This overview not only underlines the variability in method effectiveness but also highlights the ongoing advancements in breast cancer diagnostics, emphasizing the achievement in diagnostic accuracy.

Twenty-nine research articles have utilized the Pima Diabetes Dataset in experimentation. Table 19 summarizes all the experimental results of one feature-level method, fourteen sampling methods, nine algorithmic-level approaches, and six combined techniques. Table 19 summarizes the diverse research methodologies applied to the “Pima Diabetes Dataset,” exhibiting an imbalance with a ratio of 1.87. The dataset’s imbalance and relevance have urged the adoption of various approaches to improve predictive accuracy and handle data imbalances: one feature-level method, fifteen sampling methods, seven learning-level approaches, and six combined techniques. The results of the oversampling method proposed by Rodriguez-Almeida et al. ( 2022 ) are inexplicitly mentioned. Combined techniques output an average geometric mean (Yuan et al. 2021 ). The two sampling techniques proposed by Zeng et al. ( 2016 ) and Hassan and Amiri ( 2019 ) and the two learning methods suggested by Ghorbani et al. ( 2022 ) and Wu et al. ( 2020 ) have a good overall performance in diabetes diagnosis with AUC values in the range (0.8–0.88). The three sampling methods (Xu et al. 2020 ), Babar ( 2021 ) and Mustafa et al. ( 2017 ) yield excellent global performance by achieving values greater than 0.98 in F -value, accuracy, and AUC, respectively. We categorize the remaining works in the literature into four groups based on their sensitivity score. The oversampling method proposed by Wang et al. ( 2013 ) poorly recognizes diabetes patients. Further, The methodologies proposed by Guo et al. ( 2018 ), Liu et al. ( 2020 ), Naseriparsa et al. ( 2020 ), Kamaladevi and Venkatraman ( 2021 ), Lamari et al. ( 2021 ), Ibrahim ( 2022 ) and Izonin et al. ( 2022 ) averagely identify patients with diabetes. Approaches in Wan et al. ( 2014 ), Babar and Ade ( 2016 ), Zhang and Chen ( 2019b ), Kinal and Woźniak ( 2020 ), Nalluri et al. ( 2020 ), Desuky et al. ( 2021 ) and Mienye and Sun ( 2021 ) attain a considerable detection of patients with the disease. Finally, the following methods excellently classify the target group (diseased): optimization technique in Jain et al. ( 2020 ), feature level in Polat ( 2018 ), undersampling by Al-Shamaa et al. ( 2020 ), combined techniques by Suresh et al. ( 2022 ), and oversampling in Xu et al. ( 2021 ). Nevertheless, the latter shows excellent specificity as well. The findings outlined in Table 19 reveal the class imbalance strategies designed for diabetes prediction using the Pima Diabetes Dataset. The varied methodologies underscore the dynamic nature of medical diagnostics research, where each approach provides distinct advantages and faces specific challenges. This synthesis recaps the diverse strategies employed and highlights the expanding field as researchers seek more accurate and efficient diagnostic models.

Table 20 exposes the results from several studies that have utilized the “Parkinson’s Disease Dataset” to evaluate different diagnostic approaches. Knowing that this dataset exhibits an imbalance of 3.06, these studies encompass a range of methodologies: five preprocessing approaches, one optimization technique, and one combined techniques approach have been proposed to handle the class imbalance in the disease data. We notice the inferior performance of the combined techniques strategy in diseased and non-diseased patients’ detection. Moreover, the three sampling methods suggested by Sug ( 2016 ), Zeng et al. ( 2016 ) and Jain et al. ( 2017 ) and the optimization techniques by Nalluri et al. ( 2020 ) achieve an excellent tradeoff between diagnosing both patients with/without Parkinson’s ( $0.93<Sens<0.99$ ). Furthermore, the feature level method in Polat ( 2018 ) and the undersampling method in Jain et al. ( 2020 ) correctly identify all cases. This analysis indicates the diverse methodologies implemented and the variation in their effectiveness in classifying Parkinson’s cases, where optimal performance is unveiled by some methods while other methods struggle to show comparable performance.

Table 21 presents the literature findings, within the review time range, realized in thyroid diagnosis using the “New Thyroid Dataset”: four combined techniques approaches and two learning ones. All the methods significantly diagnose patients (Sensitivity $> 0.99$ ). However, the following combined techniques (Shilaskar et al. 2017 ; Liu et al. 2020 ) optimally perform according to sensitivity, specificity, accuracy, and geometric mean. Effectiveness in handling the class imbalance among the various proposed methodologies is observed, indicating overcoming the challenges related to the mentioned dataset.

Among the reviewed literature, five studies analyzed the “Chronic Kidney Disease Dataset”; their outcomes are shown in Table 22 . The results of the oversampling method proposed by Rodriguez-Almeida et al. ( 2022 ) are unclearly mentioned. On the other hand, significant performance has been reached by the hybrid method proposed by Yildirim ( 2017 ) and the combined techniques proposed by Suresh et al. ( 2022 ). Both the undersampling method proposed by Jain et al. ( 2020 ) and the learning approach suggested by Mienye and Sun ( 2021 ) perfectly diagnose patients with chronic kidney disease; however, the former optimally identifies non-diseased patients. Various methodologies adopted different class imbalance methods; however, broad significant performance is observed in experimenting with this chronic kidney disease data.

Table 23 summarizes the results of the proposed approaches experimenting with the “Thoracic Surgery Dataset”. The dataset presents an imbalance of 5.14; therefore, studies proposed different class imbalance methods within their classification methodologies. The optimization technique proposed by Chan et al. ( 2017 ) obtains low values of both sensitivity and specificity; significant detection of diseased patients associated with a poor detection of non-diseased patients or the opposite is noticed in the following three studies: undersampling techniques proposed in Al-Shamaa et al. ( 2020 ) and Vuttipittayamongkol and Elyan ( 2020b ), the optimization technique in Nalluri et al. ( 2020 ), and the cost-sensitive method in Zięba ( 2014 ). The combined techniques have released average geometric mean value (Kinal and Woźniak 2020 ), while relatively superior values have been resulted in by undersampling methods (Jain et al. 2017 , 2020 ). Finally, optimal performance has been attained by the feature-level method (Polat 2018 ). Regarding the “Thoracic Surgery Dataset”, differences in the effectiveness of outlined methodologies are observed globally throughout the analysis; while the approach of class imbalance is noted, other jointly affecting factors exist in the context.

Regarding the investigation in Liver Disorder detection, six research works have been conducted using the “Liver Disorder Dataset”, and Table 24 shows their outcomes. The hybrid method proposed by Babar ( 2021 ) has the highest accuracy value demonstrating its superior global performance. On the other hand, the accuracy and the area under curve values of the cost-sensitive approach proposed by Wu et al. ( 2020 ) refer to its average overall performance. However, in medical diagnosis models, particularly with imbalanced data, more specific metrics, like the sensitivity and geometric mean, are considered in performance assessment. Thus we notice the inferior performance of the oversampling method by Wang et al. ( 2013 ) in diseased patients diagnosis with a sensitivity of (0.58). Moreover, the undersampling technique, the optimization technique, and the combined techniques approach (Babar and Ade 2016 ; Nalluri et al. 2020 ; Shaw et al. 2021 ) outcome good values of sensitivity ( $>0.82$ ); however, the latter has higher values of accuracy, precision, and AUC and a better sensitivity which may be attributed to its significant performance in identifying patients with/out liver disorder. Various class imbalance methods were proposed, nonetheless, we notice that overall classification performance on this Liver Disorder data could be further improved.

Four distinct studies experimented with the “Mammographic Mass Dataset”; their findings are in Table 25 . The hybrid strategy in Babar ( 2021 ) has attained the best accuracy of (0.88), unveiling its overall good performance. A good ratio of lesion detection is achieved by the undersampling method in Babar and Ade ( 2016 ), the cost-sensitive method in Zhu et al. ( 2018 ), and the combined techniques (Desuky et al. 2021 ), while the two first have equal sensitivity which refers diagnosing the malignant breast cancer lesion. The undersampling method has a good compromise between sensitivity and specificity, with a higher geometric mean and accuracy. Although few studies utilized Mammographic Mass Data, we observe the relatively considerable performance of the proposed methodologies globally.

8 Discussion

Of greater interest is exploring observations made through contextual analysis in this section. Thus, we discuss reflections on the synthesis of the outcomes of previous research on the reference medical datasets to point out speculative insights on methodological concerns and practical aspects in investigating class imbalance in medical data.

Methodologies performance considering the class imbalance methods For each medical dataset, we selected approaches that showed high performance; thus, twenty-two highly-performing methods on seventeen datasets, meaning various research works outcome similar optimal results in some medical datasets. The research by Polat ( 2018 ), proposing a feature level method, indicated optimal performance in handling class imbalance in three imbalanced medical datasets, namely: “Hepatitis-C Dataset”, “Parkinson’s Disease Dataset”, and “Thoracic Surgery Dataset”; where the data points of each attribute are weighted using similarity and clustering considering the class label.

Similarly, In breast cancer diagnosis using both the “Breast Cancer Wisconsin Original Dataset” and the “Wisconsin Diagnostic Breast Cancer (WDBC)” and in heart disease detection using the “SPECT Heart Dataset”, the research based on optimization techniques proposed by Nalluri et al. showed the most effectiveness in classification. Briefly, the method of Nalluri et al. ( 2020 ) uses a hybrid EA with Multiobjective, the fitness function is SVM, along with two Multiobjective scenarios and population with non-dominated solutions and limit solutions. The oversampling method proposed by Xu et al. ( 2021 ) appeared to be the most successful approach in treating three imbalanced medical datasets, which are: “Haberman Dataset”, “Wisconsin Prognostic Breast Cancer (WPBC)”, and “Pima Dataset”. In detail, this method uses a filtered k-means clustering to identify a new data matrix, which utilizes newly calculated sampling ratios and SMOTE to balance the data classes. The research adopting hybrid methods implied superior results in one dataset, “Wisconsin Diagnostic Breast Cancer (WDBC)”; this method hybridizes oversampling by ROSE and Sample selection by K-means to handle the imbalance in medical data (Zhang and Chen 2019b ).

Overall undersampling techniques showed high classification performance in five medical datasets. The research proposing an undersampling method (Jain et al. 2020 ) based on Genetic algorithms could be perceived as the most efficient strategy for addressing the class imbalance in the following datasets: “Parkinson’s Disease Dataset”, “Chronic Kidney Disease Dataset”, and “Indian Liver Patient Dataset (ILPD)”; other studies proposed undersampling methods for class imbalance (Babar and Ade 2016 ; Vuttipittayamongkol and Elyan 2020b ) respectively in “Breast Cancer Dataset” and “Heart Disease Dataset” outcomed the most promising results. The former is multiple-layer perceptron-based undersampling. At the same time, the latter Identifies the overlapping space of instances using recursive search neighbouring, then discards the majority instances in it to improve the visibility of minority instances. In cervical cancer diagnosis using the “Cervical Cancer Dataset”, the cost-sensitive approach suggested by Mienye & Sun, a cost-sensitive random forest classifier, indicated the optimal results. Whereas, in breast cancer diagnosis using the “Wisconsin Diagnostic Breast Cancer (WDBC)” dataset, the cost-sensitive method by Belarouci et al. ( 2016 ) suggested the most effectiveness as hybrid and combined techniques. It consists of a version of the least mean square (LMS) algorithm that associates weights to different samples according to the errors.

The approaches proposed in Shilaskar et al. ( 2017 ) and Liu et al. ( 2020 ) appear to be the most effective in thyroid detection using the “New Thyroid Dataset”. Liu et al. ( 2020 ) proposed a SMOTE combined with a cross-validated committee filter (CVCF) and SVM ensemble, and Shilaskar et al. ( 2017 ) combined oversampling and undersampling along with SVM optimized using genetic algorithms. Moreover, the study, suggesting a combined techniques approach, by Shaw et al. ( 2021 ) outcomes excellent results along with that based on optimization techniques (Nalluri et al. 2020 ). Knowing that Shaw et al. under-sample the majority class with three different techniques and then combine the picked samples with the minority class with AdaBoost for prediction. Additionally, The research studies based on combined techniques (Shaw et al. 2021 ) and Desuky et al. ( 2021 ) likely surpass other approaches in two datasets: “Liver Disorder Dataset” and “Mammographic Mass Dataset”, respectively, and releasing optimal diagnoses. The latter is Sampling with an ensemble based on a Crossover genetic operator to handle class imbalance.

Among the studies suggesting high classification performance in the medical reference dataset, the prevalence of the preprocessing-level methods theoretically owing to their extent of use in the reviewed literature, around sixty-one papers addressed the class imbalance proposing preprocessing, where hybrid sampling presented in 20 research works, undersampling in 18, and oversampling in 17. Besides, even the studies based on combined techniques, likely outperforming, utilize sampling techniques. Moreover, the research proposing feature-level methods indicates promising results, which could be a prominent research line, especially in sensitive clinical applications, by avoiding reliability issues of synthetic samples. On the other hand, Learning level methods are equally mentioned in research works reportedly efficient in some medical datasets. The distinct specifics of the datasets detailed in Table 1 , coupled with the diversity of methodologies explored in the existing literature, suggest that the findings are context-dependent and may not be broadly applicable, emphasizing the need for cautious interpretation and an understanding of the limited scope.

Objectives in class imbalance research for medical applications Reference datasets presented in Table 1 are repetitively used for various methodological frameworks, whether for evaluating the class imbalance approach designed for diagnosing or studying a specific disease. A shared objective for those studies is the evaluation of the proposed approach over medical data exhibiting a certain degree of imbalance; while the objectives normally set in ML for medical diagnosis research are conditional to the given data and the medical application and relevance through the studied research the interchangeability between we observed a lack of specificity in how terms like ‘diagnosis,’ ‘prediction,’ ‘classification,’ and ‘early detection’ are employed interchangeably. This could be attributed to the overarching challenges of class imbalance, which seem to outweigh the need for clear differentiation in study objectives. Regardless of the stated goal, the primary concern often remains with the performance metrics of the learning algorithms due to the class imbalance, leading to a uniform approach in evaluating methodologies across different medical objectives. This issue is compounded by the general absence of transparent reporting in the literature, where distinctions between medical applications are often vague. Notably, this is less the case in works specifically targeting mortality prediction, which tend to demonstrate a clearer connection between methodological choices and their clinical implications. To enhance the clarity and applicability of research in this field, there is a need for more precise definitions of study objectives, specialized methodologies that directly address these objectives, and transparent reporting that links specific methodological approaches to their clinical outcomes.

Transparency in class imbalance approaches The literature often lacks detailed descriptions of datasets, methodologies, and experimental implementations, which limits the depth of analysis to an exploratory level. For instance, data-level methods such as sampling ratios frequently omit details like post-balancing data distribution. Even when aspects like evaluation techniques, preprocessing steps, and underlying learning algorithms are well-documented, they add layers of complexity that complicate straightforward observational synthesis. As such, including diverse details from the reviewed works increases the synthesis process’s complexity and necessitates a more intensive investigative approach that transcends traditional observational efforts. This demands methodologies that delve beyond mere describing, requiring a rigorous examination of methodologies, results, and their interrelations within the broader research landscape to achieve a more comprehensive analysis.

Standardization issues in class imbalance The variability in class imbalance degrees across datasets reviewed herein spotlights a significant challenge in medical research. What may be deemed highly imbalanced in one study might only present as moderate in another, reflecting the quantitative differences and the diverse challenges each dataset presents. For example, slight imbalances in one dataset could be more problematic than severe imbalances in another, depending on factors such as the complexity of the medical conditions involved or data quality issues. This variability highlights the necessity for context-specific approaches in handling class imbalances, where the unique characteristics of each dataset are considered in the development and application of methodologies.

Furthermore, the absence of a universally accepted standard for quantifying the severity of class imbalance complicates the comparison of results across different studies and hampers the development of potentially broadly effective solutions. This lack of standardization calls for establishing clear metrics that could guide researchers in accurately classifying and reporting the degree of imbalance. Enhanced reporting standards and systematic analysis approaches are essential to facilitate a more consistent evaluation of method effectiveness across varied research contexts. By advocating for standardized quantification and comprehensive reporting, the research community can better understand the impact of class imbalance on medical diagnostics and develop more adaptable methodologies to improve the reliability and generalizability of outcomes in medical research.

9 Value and limitations

This comprehensive review of the literature addressing the issue of class imbalance involves the new detailed classification of class imbalance methods and informative statistics on the evaluation metrics and medical datasets and is further enhanced with practical insights by synthesizing the literature findings on the reference medical datasets with class imbalance. We aimed to extend the deep literature review with an overview of the experimental outcomes of proposed class imbalance methodologies that could not be reproducible; further, we intended to provide the reader with a contextual analysis describing the findings considering the found settings knowing that it was difficult to mention all the factors implemented in previous research due to general descriptions missing necessary configurations and methodological procedures.

Therefore, such a review in its experimental insights referring to the presented synthesis of research outcomes exhibits some limitations that could not be resolved in the current work due to correspondence issues with the principal question and the challenges it takes to establish a thorough comparative analysis controlling all the factors affecting the environment of experimenting with imbalanced medical data, to mention, but not limited to, the data size, the data dimensionality, the preprocessing procedures, the underlying learning algorithm, the imbalance ratio, the class imbalance method itself if involving other parameters such as the matrix of costs definition in cost-sensitive learning methods or the imbalance ratio in data level approaches.

Thus, the discussion of the overviewed findings is indicative and descriptive and states the need for an exhaustive experimental review to derive decisive and generalizable conclusions. Despite these limitations, our work maps out the landscape of existing research and emphasizes the variability and complexity of approaches, suggesting a compulsory need for standardization in research reporting and methodology. By highlighting these areas, we contribute to a deeper understanding of how class imbalance affects medical dataset analysis and point towards areas where further research and more refined methodologies are needed.

10 Trends and research directions

This section scrutinizes the predominant trends and emergent strategies in addressing class imbalance within medical datasets as identified in our comprehensive review of the past decade’s literature. We feature key methodological innovations and the evolving paradigms that have shaped current approaches to managing imbalances in medical diagnostic data. Our analysis outlines the methodologies and links them to their potential impacts on enhancing diagnostic accuracy and clinical applicability.

Oversampling Researchers usually divide the minority class into three clusters: outliers (also called noise), safe samples, and in-danger or overlapped samples; when the distribution of each of the majority class and the minority class are overlapped, this consists of the borderline samples, known as in danger samples or overlapped samples. However safe cluster contains only samples that are in the minority distribution. Outliers or noisy samples are samples on the extreme side of the distribution, far from the mean distribution of minority samples. After this partition, some researchers only keep safe samples, oversample them Xu et al. ( 2021 ), and consider in-danger and outliers as noisy samples (deleted). However, in-danger samples or samples on the borderline are important in discriminating the minority from the majority, especially in our context, medical diagnosis, or medical applications in general. Other researchers (Han et al. 2019 ) adopt another partition of samples into four categories: noise samples, border samples, unstable samples, and stable samples, where only noisy samples are deleted. There is no unification in the partition of samples, which differentiates one sampling algorithm from another. Furthermore, as much as it depends on the sample’s distribution, this partition needs to be explored in future research so that a partition is derived based on the data distribution automatically to retain the characteristics of the primary data.

Undersampling Research works like the Tversky similarity-based undersampling (Kamaladevi and Venkatraman 2021 ) and others remove the noise from the majority and the border samples, a major mistake. Knowing that samples at the borderline space also called the danger space, are hard to classify correctly, they are the most critical samples. If the classifier learns to classify those samples, it will significantly succeed in classifying any new sample. This is because those samples contain the recognition patterns of both minority and majority samples. Nevertheless, they are still hard to learn from because it is where both distributions of minority class and majority class intersect and nearly exist. For that, samples in the border space can be exploited to improve the classification of imbalanced data rather than deleting them. So proper methods can be developed to address this issue.

Algorithmic solutions complexity with preprocessing simplicity (deep learning and ensemble) Another existing trend is to use ensemble learning or complex algorithms like deep learning algorithms (neural networks, graph neural networks) combined with optimization processes, without the preprocessing phase of minority samples. Such research works deal with class imbalance problems at the algorithmic level by optimizing the classification algorithm’s parameters or/and structure rather than treating the imbalance at the data level. Also, similar works combine deep learning algorithms with cost-sensitive learning by adding misclassification weights to the training phase. As a result, the main common thing in this approach is using simple preprocessing techniques and focusing on the learning phase. However, the learning phase appears complex in several works like stacking, ensemble, deep learning algorithms with/without optimization, and cost-sensitive learning combined with deep learning or previously mentioned methods.

Genetic algorithms for optimization Another prevailing trend is the use of genetic algorithms in optimizing the learning classifier or the sampling technique. Even though researchers proposed multi-objective functions that are not well explored, which may be a future research direction, GA was used in undersampling and yielded good results; however, the proposed GA-based algorithms miss the optimization of parameters setting, which can significantly improve the performance of such methods.

SMOTE performance SMOTE is always used as a reference in comparative analysis in any work proposing a developed method. Reviewing all these results shows that SMOTE maintains stability and good performance patterns no matter how the class imbalance severity changes or the learning process is designed. Moreover, even if the newly proposed methods (sampling or algorithmic techniques) surpass SMOTE in some classification metrics, SMOTE still indicates better or similar results based on the remaining metrics. Nevertheless, a sampling method that exceeds SMOTE according to all classification assessment metrics is undiscovered, although the disadvantages of SMOTE techniques include synthesizing noisy and overlapped samples.

Feature selection Another approach in the literature chooses feature selection to tackle class imbalance in medical data and prove good results. However, this approach is not well explored as only some efforts of researchers in imbalanced medical datasets combine some feature selection techniques with improved classifiers. In our context, many reviewed papers include feature selection in the pre-processing phase, but how feature selection can be a performant solution in addressing class imbalance is a question that should be thoroughly discussed.

The compromise between sensitivity and specificity Another point regarding sampling in general and dealing with the imbalance in medical diagnosis is the problem of finding a compromise between correctly predicted diseased people and correctly detected non-diseased people, namely between sensitivity and specificity. The trade-off between those measures in our context is discussed by only one research; however, it is a long-lasting issue that is ignored. Future research may consider developing well-performing methods in classifying diseased and non-diseased people as an advanced level of improving existing approaches in imbalanced data classification. The reason for such a situation is that the unhealthy class represents rare cases. The focus is more on predicting unhealthy patients to provide early treatment and lessen the dangerous complications, so it is considered the class of interest. Nevertheless, intelligent systems of medical diagnosis or aid-medical diagnosis should be more careful towards both classes as an advanced level of intelligence.

Enhancing ensemble learning Whether modifying the ensemble selection like dynamic ensemble selection, modifying the structures of ensemble members, or making it cost-sensitive needs more investigation to evaluate its effectiveness; besides, stacking is sparingly found in the literature (Gupta and Gupta 2022 ). However, it shows considerable performance besides combining ensemble with cost-sensitive learning.

Simple classifiers Postprocessing (hyperparameters fine-tuning) or preprocessing procedures like feature selection show significant performance. Another research (Zhao et al. 2022 ) proposed a simple learning approach, an ensemble of KNN with weighting voting, that also leads to good results. Thus, simple, easy-to-implement and interpret, and unsophisticated algorithms without classic solutions for handling class imbalance resulted in significant accuracy and recall, as seen in simple classifiers reviewed papers.

Synthetic data and original data The use of synthetic data is prevalent in addressing class imbalance in medical datasets. However, ensuring that these data accurately reflect the real-world characteristics of original datasets is essential to prevent the introduction of biases that could compromise the fairness of medical diagnostics. Statistical tests to verify the similarity between synthetic and real data are necessary to maintain the integrity of medical models as initiated in Rodriguez-Almeida et al. ( 2022 ). This coherence is vital for the accuracy of the models and for ensuring that they do not perpetuate or exacerbate existing disparities in diagnosis outcomes. Future research should focus on developing methods that ensure both representative and equitable synthetic data, promoting fairness in medical diagnostics by adhering to rigorous standards that prevent bias and enhance the generalizability of research findings across diverse patient populations. This approach will support the broader goal of equitable healthcare by ensuring that advancements in medical diagnostics are accessible and beneficial to all population segments, thus upholding ethical standards in medical research and practice.

Interpretability and explainability In this last decade, many machine/deep learning algorithms have emerged to tackle the issue of class imbalance in medical diagnosis. We have observed a progressive evolution towards increasingly sophisticated and intricate models throughout the literature. While these algorithms frequently exhibit promising results in research environments, a significant disparity exists in their practical implementation within clinical settings. This discrepancy primarily stems from the need for more interpretability and trust among practitioners, especially in critical medical contexts. In light of these considerations, future research endeavours should prioritize the development of algorithms equipped to address imbalanced diagnoses while offering interpretability. Such models promise to enhance transparency in decision-making processes, thereby enabling greater understanding and trust among practitioners. This, in turn, paves the way for improved acceptance and adoption rates. Diverse approaches can be explored to achieve explainability, including employing model-agnostic techniques or incorporating post-hoc explanations. Such strategies facilitate domain experts’ comprehension of complex model behaviours, even in cases where the proposed models lack inherent interpretability.

Computational efficiency and clinical deployment Another practical challenge associated with complex models in addressing imbalanced diagnoses is their computational efficiency, which directly influences their usability in clinical settings. By prioritizing computational efficiency in model development, researchers can effectively bridge the gap between sophisticated machine/deep learning models and their practical deployment by practitioners. This emphasis ensures that the models offer advanced capabilities and are feasible for real-world implementation.

Federated learning Another crucial research direction is addressing ethical concerns using federated learning models. This decentralized approach enables training models locally on distributed servers while safeguarding data privacy. Moreover, it proves advantageous in mitigating bias in data collection by training models across various geographic healthcare institutions. This broader representation holds the potential to yield more balanced models, particularly beneficial when class imbalance issues stem from bias in data collection rather than inherent population characteristics.

Deep learning approach in tabular imbalanced medical data It is exciting research lately, whether with data generation using GANs and their variants, graph-based deep learning approaches, or probabilistic neural networks that are newly suggested. Recent advancements in addressing class imbalance in medical data have seen researchers proposing sophisticated methodologies, necessitating a comparative analysis with traditional approaches to elucidate their differences better. Among these innovations, applying deep learning techniques such as Generative Adversarial Networks (GANs) for data generation—combined with classical machine learning algorithms, sampling, and cost-sensitive techniques—has yielded remarkable results.

Despite these technological advancements, the foundational aspect of data integrity remains critical. We cannot overstate the importance of establishing structured data collection designs that preserve the inherent population characteristics and ensure the representativeness of the collected sample. Such rigor in data collection is essential to avoid the injection of bias, which can skew the outcomes of even the most advanced analytical methods. As the field progresses, both cutting-edge technology and accurate data management must be harmonized to address the complexities of class imbalance in medical data fully. Additionally, integrating domain expertise in model training is crucial in ensuring these technologies are technically advanced and clinically relevant. Combining deep medical insights with innovative machine learning techniques enhances diagnostic tools’ accuracy and applicability, supporting the ultimate goal of improving healthcare outcomes through more sophisticated and informed data science practices.

11 Mispractices and consensus on handling imbalanced data

This literature review revealed mispractices in class imbalance proposed strategies, particularly in medical data. These mispractices prevent the accurate evaluation of proposed methodologies and degrade any comparative study. In this section, we present the common fatal mistakes existing in literature and scholars still adopting in treating imbalanced data and propose the best practices instead. Besides, such best practices must be considered in this research line. There is an unveiled consensus amongst researchers on them. Thus, stating this consensus in our literature review is indispensable to advance the state of the art and yet, in the future, possess better and more effective tools to combat the class imbalance. Without treating these misconducts, any proposed methods may be inappropriately evaluated, yet future research will dismiss the starting points and falsely build on wrongly presented methods.

Overall performance measures in class imbalance methods Using general metrics to evaluate the performance of models in imbalanced data remains a critical issue. According to this literature review, multiple research works selected few metrics yet single metrics like accuracy. Relying only on accuracy, AUC-ROC, and F-measure uncover the real effectiveness of the model due to the imbalance in the used data. As a result, metrics reveal the model’s performance in each class in the data. Therefore, a tendency to use sensitivity, specificity, and other metrics is required.

Data partition with data augmentation Augmenting the minority samples in an imbalanced dataset is a way to balance it. It is commonly used in literature and could be performed using any oversampling method. Usually, the selected sampling technique is applied to the training dataset. Hence, the machine learning model learns on balanced data where existing classes are equally represented. Conventional machine learning models are constructed on equally distributed data and expect the same in training datasets. Consequently, by sampling the training data, the learning algorithm gives equal attention to majority and minority classes. While data partition divides the data into training and test to select the best-performing model, only the training set participates in learning the model; the test set should be preprocessed like the training set. However, it should retain the data distribution characteristics as in real-world data. Testing the trained model on balanced data in the context of class imbalance leads to unrealistic results and misinformation on prediction performance. Additionally, proposing a new sampling method and using it before the train-test split will incorrectly tell us of its effectiveness. Thus, even a comparison with other research works is useless. Instead, highlighting the best practice when selecting oversampling to handle class imbalance appears necessary to prevent misconduct in future research.

Consensus on performance evaluation metrics in class imbalance Researchers in class imbalance should circulate the used evaluation metrics for future research purposes. Setting an ensemble of have-to-use metrics in treating imbalanced data appears unignorable. The set of metrics should involve a variety of metrics to measure the real performance of proposed approaches efficiently. As an attempt, we suggest the following: Sensitivity, Specificity, and Accuracy.

12 Conclusion

This paper presents the inaugural comprehensive review of the literature addressing the class imbalance in medical data, analyzing a decade’s worth of research. Through a rigorous search methodology, 137 research articles were deemed relevant and subjected to a critical evaluation within a structured framework. Initially, the review introduces a novel classification of class imbalance methods, categorizing them into three primary approaches: preprocessing, learning, and combined techniques. This categorization facilitates a subtle exploration of contemporary techniques by further subdividing them into detailed subclasses.

Specifically, the learning approach is divided into six subclasses: cost-sensitive learning, optimization techniques, simple classifiers, ensemble learning, deep learning algorithms, and unsupervised learning. Similarly, the preprocessing approach comprises two detailed subclasses. The third category consists of combined techniques and comparative studies of different approaches. Furthermore, the paper provides an extensive overview and descriptive statistics of the medical datasets and evaluation metrics utilized in the reviewed literature, thoroughly examining current research practices and conventions.

Moreover, by synthesizing the outcomes of previous studies on reference medical datasets, this review provides an exploratory overview of the field’s current state, identifying key trends and gaps that future research must address while clarifying related implications and the limited scope of our observatory reflections. The trends found in the literature have been comprehensively explained, and the prominent future research directions are pointed out, providing plausible research initiation points. Finally, we presented methodological strategies and procedural guidelines that can be implemented to ameliorate research studies in class imbalance, intending to augment the robustness, reliability, and generalizability of findings. The consensus should be broadly acknowledged to align communal measures toward devising optimal strategies to address the class imbalance issue.

Availability of data and materials

Not applicable.

Code Availability

Abd Elrahman SM, Abraham A (2013) A review of class imbalance problem. J Netw Innov Comput 1(2013):332–340

Google Scholar

Alamsyah ARB, Anisa SR, Belinda NS, Setiawan A (2021) Smote and nearmiss methods for disease classification with unbalanced data: case study: Ifls 5. Proc Int Confer Data Sci Offic Stat 2021:305–314

Article Google Scholar

Alashban M, Abubacker NF (2020) Blood glucose classification to identify a dietary plan for high-risk patients of coronary heart disease using imbalanced data techniques. In: Computational science and technology: 6th ICCST 2019, Kota Kinabalu, Malaysia, 29–30 August 2019. Springer, pp 445–455

Albuquerque J, Medeiros AM, Alves AC, Bourbon M, Antunes M (2022) Comparative study on the performance of different classification algorithms, combined with pre-and post-processing techniques to handle imbalanced data, in the diagnosis of adult patients with familial hypercholesterolemia. PLoS One 17(6):1–19

Alhassan Z, Budgen D, Alshammari R, Daghstani T, McGough AS, Al Moubayed N (2018) Stacked denoising autoencoders for mortality risk prediction using imbalanced clinical data. In: 2018 17th IEEE international conference on machine learning and applications (ICMLA). IEEE, pp 541–546

Ali H, Salleh MNM, Saedudin R, Hussain K, Mushtaq MF (2019) Imbalance class problems in data mining: a review. Indones J Electr Eng Comput Sci 14(3):1560–1571

Al-Shamaa ZZ, Kurnaz S, Duru AD, Peppa N, Mirnezami AH, Hamady ZZ et al (2020) The use of Hellinger distance undersampling model to improve the classification of disease class in imbalanced medical datasets. Appl Bion biomech 2020:1–10

Alves JS, Bazán JL, Arellano-Valle RB (2023) Flexible cloglog links for binomial regression models as an alternative for imbalanced medical data. Biom J 65(3):2100325

Article MathSciNet Google Scholar

Arbain AN, Balakrishnan BYP (2019) A comparison of data mining algorithms for liver disease prediction on imbalanced data. Int J Data Sci Adv Analyt (ISSN 2563-4429) 1(1):1–11

Augustine J, Jereesh A (2022) An ensemble feature selection framework for the early non-invasive prediction of Parkinson’s disease from imbalanced microarray data. In: Advances in computing and data sciences: 6th international conference, ICACDS 2022, Kurnool, India, April 22–23, 2022, revised selected papers, Part II. Springer, pp 1–11

Awon VK, Balloccu S, Wu Z, Reiter E, Helaouie R, Reforgiato Recupero D, Riboni D (2022) Data augmentation for reliability and fairness in counselling quality classification. In: Proceedings of the 1st workshop on scarce data in artificial intelligence for healthcare (SDAIH 2022). SciTePress

Babar V (2021) Classification of imbalanced data of medical diagnosis using sampling techniques. Commun Appl Electr 7:7–12

Babar V, Ade R (2016) A novel approach for handling imbalanced data in medical diagnosis using undersampling technique. Commun Appl Electron 5:36–42

Baniasadi A, Rezaeirad S, Zare H, Ghassemi MM (2020) Two-step imputation and adaboost-based classification for early prediction of sepsis on imbalanced clinical data. Crit Care Med 49(1):e91–e97

Belarouci S, Bouchikhi S, Chikh MA (2016) Comparative study of balancing methods: case of imbalanced medical data. Int J Biomed Eng Technol 21(3):247–263

Bhattacharya M, Jurkovitz C, Shatkay H (2017) Assessing chronic kidney disease from office visit records using hierarchical meta-classification of an imbalanced dataset. In: 2017 IEEE international conference on bioinformatics and biomedicine (BIBM). IEEE, pp 663–670

Bi W, Ma R (2021) Unbalanced data set processing method for colorectal cancer prediction in tcm diagnosis. In: 2020 IEEE international conference on E-health networking, application & services (HEALTHCOM). IEEE, pp 1–6

Britto CF, Ali ARH (2021) Prostate cancer diagnosis model with the handling of multi-class imbalance through the adaptive weighting based deep learning model. EFFLATOUNIA-Multidiscipl J 5(2):3204–3212

Cai T, He H, Zhang W (2018) Breast cancer diagnosis using imbalanced learning and ensemble method. Appl Comput Math 7(3):146–154

Chan TM, Li Y, Chiau CC, Zhu J, Jiang J, Huo Y (2017) Imbalanced target prediction with pattern discovery on clinical data repositories. BMC Med Inform Decis Mak 17:1–12

Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357

Cheng CH, Wang YC (2020) A novel multi-combined method for handling medical dataset with imbalanced classes problem. Adv Math: Sci J 9:6623–6629

Cheng Z, Liu Z, Yang G (2022) Diagnosis of arrhythmia based on multi-scale feature fusion and imbalanced data. In: 2022 7th international conference on machine learning technologies (ICMLT), pp 92–98

Çinaroğlu S (2017) Ensemble learning methods to deal with imbalanced disease and left-skewed cost data. Am J Bioinformat Res 7(1):1–8

Dai D, Hua S (2016) Random under-sampling ensemble methods for highly imbalanced rare disease classification. In: Proceedings of the international conference on data science (ICDATA), p 54

Desuky AS, Omar AH, Mostafa NM (2021) Boosting with crossover for improving imbalanced medical datasets classification. Bull Electr Eng Informat 10(5):2733–2741

Dhanusha C, Kumar AS, Villanueva L (2022) Enhanced contrast pattern based classifier for handling class imbalance in heterogeneous multidomain datasets of Alzheimer disease detection. In: Applications of artificial intelligence and machine learning: select proceedings of ICAAAIML 2021. Springer, pp 801–814

Drosou K, Georgiou S, Koukouvinos C, Stylianou S (2014) Support vector machines classification on class imbalanced data: a case study with real medical data. J Data Sci 12(4):727–753

El-Baz A (2015) Hybrid intelligent system-based rough set and ensemble classifier for breast cancer diagnosis. Neural Comput Appl 26:437–446

Fahmi A, Muqtadiroh FA, Purwitasari D, Sumpeno S, Purnomo MH (2022) A multi-class classification of dengue infection cases with feature selection in imbalanced clinical diagnosis data. Int J Intell Eng Syst 15(3):2022

Farquad MAH, Bose I (2012) Preprocessing unbalanced data using support vector machine. Decis Support Syst 53(1):226–233

Feng Y, Li J (2021) A novel $\alpha$ distance borderline-adasyn-smote algorithm for imbalanced data and its application in Alzheimer’s disease classification based on dense convolutional network. In: Journal of physics: conference series, vol 2031. IOP Publishing, p 012046

Fernando C, Weerasinghe P, Walgampaya C (2022) Heart disease risk iden- tification using machine learning techniques for a highly imbalanced dataset: a comparative study. KDU J Multi Stud 4(2):43–55. https://doi.org/10.4038/kjms.v4i2.50

Fotouhi S, Asadi S, Kattan MW (2019) A comprehensive data level analysis for cancer diagnosis on imbalanced data. J Biomed Inform 90:103089

Fujiwara K, Huang Y, Hori K, Nishioji K, Kobayashi M, Kamaguchi M, Kano M (2020) Over-and under-sampling approach for extremely imbalanced and small minority data problem in health record analysis. Front Public Health 8:178

Galar M, Fernandez A, Barrenechea E, Bustince H, Herrera F (2011) A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans Syst, Man, Cybern Part C (Appl Rev) 42(4):463–484

Gan D, Shen J, An B, Xu M, Liu N (2020) Integrating TANBN with cost sensitive classification algorithm for imbalanced data in medical diagnosis. Comput Ind Eng 140:106266

Gao T, Hao Y, Zhang H, Hu L, Li H, Li H, Hu L, Han B (2018) Predicting pathological response to neoadjuvant chemotherapy in breast cancer patients based on imbalanced clinical data. Pers Ubiquit Comput 22:1039–1047

Ghorbani M, Kazi A, Baghshah MS, Rabiee HR, Navab N (2022) Ra-gcn: graph convolutional network for disease prediction problems with imbalanced data. Med Image Anal 75:102272

Guo H, Liu H, Wu CA, Liu W, She W (2018) Ensemble of rotation trees for imbalanced medical datasets. J Healthc Eng 2018:8902981. https://doi.org/10.1155/2018/8902981

Gupta S, Gupta MK (2022) A comprehensive data-level investigation of cancer diagnosis on imbalanced data. Comput Intell 38(1):156–186

Gupta R, Bhargava R, Jayabalan M (2021) Diagnosis of breast cancer on imbalanced dataset using various sampling techniques and machine learning models. In: 2021 14th international conference on developments in esystems engineering (DeSE). IEEE, pp 162–167

Haixiang G, Yijing L, Shang J, Mingyun G, Yuanyue H, Bing G (2017) Learning from class-imbalanced data: Review of methods and applications. Expert Syst Appl 73:220–239

Hallaji E, Razavi-Far R, Palade V, Saif M (2021) Adversarial learning on incomplete and imbalanced medical data for robust survival prediction of liver transplant patients. IEEE Access 9:73641–73650

Han W, Huang Z, Li S, Jia Y (2019) Distribution-sensitive unbalanced data oversampling method for medical diagnosis. J Med Syst 43:1–10

Hassan MM, Amiri N (2019) Classification of imbalanced data of diabetes disease using machine learning algorithms. Age (Years) 21(81):24–33

He F, Yang H, Miao Y, Louis R (2016) A cost sensitive and class-imbalance classification method based on neural network for disease diagnosis. In: 2016 8th international conference on information technology in medicine and education (ITME). IEEE, pp 7–10

Huda S, Yearwood J, Jelinek HF, Hassan MM, Fortino G, Buckland M (2016) A hybrid feature selection with ensemble classification for imbalanced healthcare data: a case study for brain tumor diagnosis. IEEE Access 4:9145–9154

Huo Z, Qian X, Huang S, Wang Z, Mortazavi BJ (2022) Density-aware personalized training for risk prediction in imbalanced medical data. In: Machine learning for healthcare conference. PMLR, pp 101–122

Ibrahim MH (2022) A SALP swarm-based under-sampling approach for medical imbalanced data classification. Avrupa Bilim ve Teknoloji Dergisi 34:396–402

Iori M, Di Castelnuovo C, Verzellesi L, Meglioli G, Lippolis DG, Nitrosi A, Monelli F, Besutti G, Trojani V, Bertolini M et al (2022) Mortality prediction of covid-19 patients using radiomic and neural network features extracted from a wide chest x-ray sample size: A robust approach for different medical imbalanced scenarios. Appl Sci 12(8):3903

Izonin I, Tkachenko R, Greguš M (2022) I-pnn: an improved probabilistic neural network for binary classification of imbalanced medical data. In: Database and expert systems applications: 33rd international conference, DEXA 2022, Vienna, Austria, August 22–24, 2022, Proceedings, Part II. Springer, pp 147–157

Jain A, Ratnoo S, Kumar D (2017) Addressing class imbalance problem in medical diagnosis: a genetic algorithm approach. In: 2017 international conference on information, communication, instrumentation and control (ICICIC). IEEE, pp 1–8

Jain A, Ratnoo S, Kumar D (2023) A novel multi-objective genetic algorithm approach to address class imbalance for disease diagnosis. Int J Info Technol 15:1151–1166. https://doi.org/10.1007/s41870-020-00471-3

Kamaladevi M, Venkatraman V (2021) Tversky similarity based undersampling with Gaussian kernelized decision stump adaboost algorithm for imbalanced medical data classification. Int J Comp Commun Control 16(6):4291. https://doi.org/10.15837/ijccc.2021.6.4291

Kinal M, Woźniak M (2020) Data preprocessing for des-knn and its application to imbalanced medical data classification. In: Intelligent information and database systems: 12th Asian conference, ACIIDS 2020, Phuket, Thailand, March 23–26, 2020, Proceedings, Part I 12. Springer, pp 589–599

Kitchenham B (2004) Procedures for performing systematic reviews. Keele, UK, Keele Univer 33(2004):1–26

Krawczyk B (2016) Learning from imbalanced data: open challenges and future directions. Progr Artif Intell 5(4):221–232

Krishnan U, Sangar P (2021) A rebalancing framework for classification of imbalanced medical appointment no-show data. J Data Inf Sci 6(1):178–192

Ksiaa W, Rejab FB, Nouira K (2021) Tuning hyperparameters on unbalanced medical data using support vector machine and online and active svm. In: Intelligent systems design and applications: 20th international conference on intelligent systems design and applications (ISDA 2020) held December 12–15, 2020. Springer, pp 1134–1144

Kumar P, Bhatnagar R, Gaur K, Bhatnagar A (2021) Classification of imbalanced data: review of methods and applications. In: IOP conference series: materials science and engineering, vol 1099. IOP Publishing, p 012077

Kumar V, Medda G, Recupero DR, Riboni D, Helaoui R, Fenu G (2023) How do you feel? Information retrieval in psychotherapy and fair ranking assessment. In: International workshop on algorithmic bias in search and recommendation. Springer, pp 119–133

Kumar P, Thakur RS (2019) Diagnosis of liver disorder using fuzzy adaptive and neighbor weighted k-nn method for lft imbalanced data. In: 2019 international conference on smart structures and systems (ICSSS). IEEE, pp 1–5

Lamari M, Azizi N, Hammami NE, Boukhamla A, Cheriguene S, Dendani N, Benzebouchi NE (2021) Smote–enn-based data sampling and improved dynamic ensemble selection for imbalanced medical data classification. In: Advances on smart and soft computing: proceedings of ICACIn 2020. Springer, pp 37–49

Lan ZC, Huang GY, Li YP, Rho S, Vimal S, Chen BW (2023) Conquering insufficient/imbalanced data learning for the internet of medical things. Neural Comput Appl 35:22949–22958. https://doi.org/10.1007/s00521-022-06897-z

Lee J, Wu Y, Kim H (2015) Unbalanced data classification using support vector machines with active learning on scleroderma lung disease patterns. J Appl Stat 42(3):676–689

Li Y, Hsu WW, Initiative ADN (2022) A classification for complex imbalanced data in disease screening and early diagnosis. Stat Med 41(19):3679–3695

Lijun L, Tingting J, Meiya H (2018) Feature identification from imbalanced data sets for diagnosis of cardiac arrhythmia. In: 2018 11th international symposium on computational intelligence and design (ISCID), vol 2. IEEE, pp 52–55

Liu N, Koh ZX, Chua ECP, Tan LML, Lin Z, Mirza B, Ong MEH (2014) Risk scoring for prediction of acute cardiac complications from imbalanced clinical data. IEEE J Biomed Health Inform 18(6):1894–1902

Liu T, Fan W, Wu C (2019) A hybrid machine learning approach to cerebral stroke prediction based on imbalanced medical dataset. Artif Intell Med 101:101723

Liu N, Li X, Qi E, Xu M, Li L, Gao B (2020) A novel ensemble learning paradigm for medical diagnosis with imbalanced data. IEEE Access 8:171263–171280

Li H, Wang X, Li Y, Qin C, Liu C (2018) Comparison between medical knowledge based and computer automated feature selection for detection of coronary artery disease using imbalanced data. In: BIBE 2018; international conference on biological information and biomedical engineering. VDE, pp 1–4

Li J, Xin B, Yang Z, Xu J, Song S, Wang X (2021) Harmonization centered ensemble for small and highly imbalanced medical data classification. In: 2021 IEEE 18th international symposium on biomedical imaging (ISBI). IEEE, pp 1742–1745

López V, Fernández A, García S, Palade V, Herrera F (2013) An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inf Sci 250:113–141

Luo H, Liao J, Yan X, Liu L (2021) Oversampling by a constraint-based causal network in medical imbalanced data classification. In: 2021 IEEE international conference on multimedia and expo (ICME). IEEE, pp 1–6

Lv J, Chen X, Liu X, Du D, Lv W, Lu L, Wu H (2022) Imbalanced data correction based pet/ct radiomics model for predicting lymph node metastasis in clinical stage t1 lung adenocarcinoma. Front Oncol 12:61

Lyra S, Leonhardt S, Antink CH (2019) Early prediction of sepsis using random forest classification for imbalanced clinical data. In: 2019 computing in cardiology (CinC). IEEE, pp 1–4

Mathew G, Obradovic Z (2013) Distributed privacy-preserving decision support system for highly imbalanced clinical data. ACM Trans Manag Inf Syst (TMIS) 4(3):1–15

Meher PK, Rao AR, Wahi SD, Thelma B (2014) An approach using random forest methodology for disease risk prediction using imbalanced case-control data in gwas. Curr Med Res Pract 4(6):289–294

Mienye ID, Sun Y (2021) Performance analysis of cost-sensitive learning methods with application to imbalanced medical data. Informat Med Unlocked 25:100690

Mohd F, Abdul Jalil M, Noora NMM, Ismail S, Yahya WFF, Mohamad M (2019) Improving accuracy of imbalanced clinical data classification using synthetic minority over-sampling technique. In: Advances in data science, cyber security and IT applications: 1st international conference on computing, ICC 2019, Riyadh, Saudi Arabia, December 10–12, 2019, Proceedings, Part I. Springer, pp 99–110

Mustafa N, Li JP, Memon RA, Omer MZ (2017) A classification model for imbalanced medical data based on pca and farther distance based synthetic minority oversampling technique. Int J Adv Comput Sci Appl 8(1):61–67

Naghavi N, Miller A, Wade E (2019) Towards real-time prediction of freezing of gait in patients with Parkinson’s disease: addressing the class imbalance problem. Sensors 19(18):3898

Nalluri MR, Kannan K, Gao XZ, Roy DS (2020) Multiobjective hybrid monarch butterfly optimization for imbalanced disease classification problem. Int J Mach Learn Cybern 11:1423–1451

Napierala K, Stefanowski J (2016) Types of minority class examples and their influence on learning classifiers from imbalanced data. J Intell Inf Syst 46:563–597

Napierala K, Stefanowski J (2012) Identification of different types of minority class examples in imbalanced data. In: Hybrid artificial intelligent systems: 7th international conference, HAIS 2012, Salamanca, Spain, March 28–30th, 2012. Proceedings, Part II, vol 7. Springer, pp 139–150

Naseriparsa M, Al-Shammari A, Sheng M, Zhang Y, Zhou R (2020) Rsmote: improving classification performance over imbalanced medical datasets. Health Inf Sci Syst 8:1–13

Neocleous AC, Nicolaides KH, Schizas CN (2016) Intelligent noninvasive diagnosis of aneuploidy: raw values and highly imbalanced dataset. IEEE J Biomed Health Inform 21(5):1271–1279

Nguyen HT, Tran TB, Bui QM, Luong HH, Le TP, Tran NC (2020) Enhancing disease prediction on imbalanced metagenomic dataset by cost-sensitive. Int J Adv Comput Sci Appl 11(7):651–3657. https://doi.org/10.14569/IJACSA.2020.0110778

Orooji A, Kermani F (2021) Machine learning based methods for handling imbalanced data in hepatitis diagnosis. Front Health Informat 10(1):57

Parvin H, Minaei-Bidgoli B, Alinejad-Rokny H (2013) A new imbalanced learning and dictions tree method for breast cancer diagnosis. J Bionanosci 7(6):673–678

Patel H, Singh Rajput D, Thippa Reddy G, Iwendi C, Kashif Bashir A, Jo O (2020) A review on classification of imbalanced data for wireless sensor networks. Int J Distrib Sens Netw 16(4):1550147720916404

Phankokkruad M (2020) Cost-sensitive extreme gradient boosting for imbalanced classification of breast cancer diagnosis. In: 2020 10th IEEE international conference on control system, computing and engineering (ICCSCE). IEEE, pp 46–51

Polat K (2018) Similarity-based attribute weighting methods via clustering algorithms in the classification of imbalanced medical datasets. Neural Comput Appl 30:987–1013

Porwik P, Orczyk T, Lewandowski M, Cholewa M (2016) Feature projection k-nn classifier model for imbalanced and incomplete medical data. Biocybern Biomed Eng 36(4):644–656

Potharaju SP, Sreedevi M (2016) Ensembled rule based classification algorithms for predicting imbalanced kidney disease data. J Eng Sci Technol Rev 9(5):201–207

Rahman MM, Davis DN (2013) Addressing the class imbalance problem in medical datasets. Int J Mach Learn Comput 3(2):224

Rath A, Mishra D, Panda G, Satapathy SC (2021) Heart disease detection using deep learning methods from imbalanced ecg samples. Biomed Signal Process Control 68:102820

Rath A, Mishra D, Panda G (2022) Imbalanced ecg signal-based heart disease classification using ensemble machine learning technique. Front Big Data 5:1021518. https://doi.org/10.3389/fdata.2022.1021518

Razzaghi T, Safro I, Ewing J, Sadrfaridpour E, Scott JD (2019) Predictive models for bariatric surgery risks with imbalanced medical datasets. Ann Oper Res 280:1–18

Richter AN, Khoshgoftaar TM (2018) Building and interpreting risk models from imbalanced clinical data. In: 2018 IEEE 30th international conference on tools with artificial intelligence (ICTAI). IEEE, pp 143–150

Rodriguez-Almeida AJ, Fabelo H, Ortega S, Deniz A, Balea-Fernandez FJ, Quevedo E, Soguero-Ruiz C, Wägner AM, Callico GM (2023) Synthetic patient data generation and evaluation in disease prediction using small and imbalanced datasets. IEEE J Biomedi Health Info 27(6):2670–2680. https://doi.org/10.1109/JBHI.2022.3196697

Rong P, Luo T, Li J, Li K (2020) Multi-label disease diagnosis based on unbalanced ecg data. In: 2020 IEEE 9th data driven control and learning systems conference (DDCLS). IEEE, pp 253–259

Roy S, Roy U, Sinha D, Pal RK (2023) Imbalanced ensemble learning in determining Parkinson’s disease using keystroke dynamics. Expert Syst Appl 217:119522. https://doi.org/10.1016/j.eswa.2023.119522

Sadrawi M, Sun WZ, Ma MHM, Yeh YT, Abbod MF, Shieh JS (2018) Ensemble genetic fuzzy neuro model applied for the emergency medical service via unbalanced data evaluation. Symmetry 10(3):71

Sajana T, Narasingarao M (2018) Classification of imbalanced malaria disease using Naïve Bayesian algorithm. Int J Eng Technol 7(2.7):786–790

Sajana T, Narasingarao M (2018) An ensemble framework for classification of malaria disease. ARPN J Eng Appl Sci 13(9):3299–3307

Salman I, Vomlel J (2017) A machine learning method for incomplete and imbalanced medical data. In: Proceedings of the 20th Czech-Japan seminar on data analysis and decision making under uncertainty, pp 188–195

Shakhgeldyan K, Geltser B, Rublev V, Shirobokov B, Geltser D, Kriger A (2020) Feature selection strategy for intrahospital mortality prediction after coronary artery bypass graft surgery on an unbalanced sample. In: Proceedings of the 4th international conference on computer science and application engineering, pp 1–7

Shaw SS, Ahmed S, Malakar S, Sarkar R (2021) An ensemble approach for handling class imbalanced disease datasets. In: Proceedings of international conference on machine intelligence and data science applications: MIDAS 2020. Springer, pp 345–355

Shilaskar S, Ghatol A (2019) Diagnosis system for imbalanced multi-minority medical dataset. Soft Comput 23(13):4789–4799

Shilaskar S, Ghatol A, Chatur P (2017) Medical decision support system for extremely imbalanced datasets. Inf Sci 384:205–219

Shi X, Qu T, Van Pottelbergh G, Van Den Akker M, De Moor B (2022) A resampling method to improve the prognostic model of end-stage kidney disease: a better strategy for imbalanced data. Front Med 9:730748. https://doi.org/10.3389/fmed.2022.730748

Silveira ACD, Sobrinho Á, Silva LDD, Costa EDB, Pinheiro ME, Perkusich A (2022) Exploring early prediction of chronic kidney disease using machine learning algorithms for small and imbalanced datasets. Appl Sci 12(7):3673

Špečkauskien ̇eV (2015) Feature selection on imbalanced data set for the decision support of Parkinson’s disease. In Biomedical Engineering-2015: Proceedings of 19th International conference:[Kaunas, Lithuania, 26-2 November 2015]/Kaunas University of Technology. Biomedical Engineering Institute. Lithuanian Society of Biomedical Engineering. Kaunas: Technologija, 2015, pp. 10–14

Špečkauskien ̇eV (2011) Development and analysis of informational clinical decision support method. Phd thesis, Technologija, Kaunas

Spelmen VS, Porkodi R (2018) A review on handling imbalanced data. In: 2018 international conference on current trends towards converging technologies (ICCTCT). IEEE, pp 1–11

Sribhashyam S, Koganti S, Vineela MV, Kalyani G (2022) Medical diagnosis for incomplete and imbalanced data. In: Intelligent Data Engineering and Analytics: Proceedings of the 9th international conference on frontiers in intelligent computing: theory and applications (FICTA 2021). Springer, pp 491–499

Sridevi T, Murugan A (2014) A novel feature selection method for effective breast cancer diagnosis and prognosis. Int J Comput Appl 88(11):28–33

Srinivas K, Rao GR, Govardhan A (2014) Adapting rough-fuzzy classifier to solve class imbalance problem in heart disease prediction using fcm. Int J Med Eng Informat 6(4):297–318

Sug H (2016) More balanced decision tree generation for imbalanced data sets including the Parkinson’s disease data. Int J Biol Biomed Eng 10:115–123

Sun Y, Wong AK, Kamel MS (2009) Classification of imbalanced data: a review. Int J Pattern Recognit Artif Intell 23(04):687–719

Sun H, Wang A, Feng Y, Liu C (2021) An optimized random forest classification method for processing imbalanced data sets of Alzheimer’s disease. In: 2021 33rd Chinese control and decision conference (CCDC). IEEE, pp 1670–1673

Suresh T, Brijet Z, Subha T (2023) Imbalanced medical disease dataset classification using enhanced generative adversarial network. Comput Methods Biomech Biomed Eng 26(14):1702–1718. https://doi.org/10.1080/10255842.2022.2134729

Tang X, Cai L, Meng Y, Gu C, Yang J, Yang J (2021) A novel hybrid feature selection and ensemble learning framework for unbalanced cancer data diagnosis with transcriptome and functional proteomic. IEEE Access 9:51659–51668

Tavares TR, Oliveira AL, Cabral GG, Mattos SS, Grigorio R (2013) Preprocessing unbalanced data using weighted support vector machines for prediction of heart disease in children. In: The 2013 international joint conference on neural networks (IJCNN). IEEE, pp 1–8

Venkatanagendra K, Ussenaiah M (2019) Xgb classification technique to resolve imbalanced heart disease data. Int J Res Electron Comput Eng 7(1):406–410

Vinothini A, Baghavathi Priya S (2020) Design of chronic kidney disease prediction model on imbalanced data using machine learning techniques. Indian J Comput Sci Eng 11(6):708–718

Vuttipittayamongkol P, Elyan E (2020a) Improved overlap-based undersampling for imbalanced dataset classification with application to epilepsy and Parkinson’s disease. Int J Neural Syst 30(08):2050043. https://doi.org/10.1142/S0129065720500434

Vuttipittayamongkol P, Elyan E (2020b) Overlap-based undersampling method for classification of imbalanced medical datasets. In: Artificial intelligence applications and innovations: 16th IFIP WG 12.5 international conference, AIAI 2020, Neos Marmaras, Greece, June 5–7, 2020, Proceedings, Part II, vol 16. Springer, pp 358–369

Wan X, Liu J, Cheung WK, Tong T (2014) Learning to improve medical decision making from imbalanced data without a priori cost. BMC Med Informat Decis Mak 14:1–9

Wang L, Zhao Z, Luo Y, Yu H, Wu S, Ren X, Zheng C, Huang X (2020) Classifying 2-year recurrence in patients with dlbcl using clinical variables with imbalanced data and machine learning methods. Comput Methods Programs Biomed 196:105567

Wang Y, Wei Y, Yang H, Li J, Zhou Y, Wu Q (2020) Utilizing imbalanced electronic health records to predict acute kidney injury by ensemble learning and time series model. BMC Med Informat Decis Mak 20(1):1–13

Wang X, Ren H, Ren J, Song W, Qiao Y, Ren Z, Zhao Y, Linghu L, Cui Y, Zhao Z et al (2023) Machine learning-enabled risk prediction of chronic obstructive pulmonary disease with unbalanced data. Comput Methods Progr Biomed 230: https://doi.org/10.1016/j.cmpb.2023.107340

Wang J, Yao Y, Zhou H, Leng M, Chen X (2013) A new over-sampling technique based on svm for imbalanced diseases data. In: Proceedings 2013 international conference on mechatronic sciences, electric engineering and computer (MEC). IEEE, pp 1224–1228

Wang Q, Zhou Y, Zhang W, Tang Z, Chen X (2020) Adaptive sampling using self-paced learning for imbalanced cancer data pre-diagnosis. Expert Syst Appl 152:113334. https://doi.org/10.1016/j.eswa.2020.113334

Wei X, Jiang F, Wei F, Zhang J, Liao W, Cheng S (2017) An ensemble model for diabetes diagnosis in large-scale and imbalanced dataset. In: Proceedings of the computing frontiers conference, pp 71–78

Werner A, Bach M, Pluskiewicz W (2016) The study of preprocessing methods’ utility in analysis of multidimensional and highly imbalanced medical data. In: Proceedings of 11th international conference IIIS2016

Wilk S, Stefanowski J, Wojciechowski S, Farion KJ, Michalowski W (2016) Application of preprocessing methods to imbalanced clinical data: An experimental study. In: Information technologies in medicine: 5th international conference, ITIB 2016 Kamień Śląski, Poland, June 20–22, 2016 proceedings, vol 1. Springer, pp 503–515

Wosiak A, Karbowiak S (2017) Preprocessing compensation techniques for improved classification of imbalanced medical datasets. In: 2017 Federated conference on computer science and information systems (FedCSIS). IEEE, pp 203–211

Woźniak M, Wieczorek M, Siłka J (2023) Bilstm deep neural network model for imbalanced medical data of iot systems. Futur Gener Comput Syst 141:489–499

Wu JC, Shen J, Xu M, Liu FS (2020) An evolutionary self-organizing cost-sensitive radial basis function neural network to deal with imbalanced data in medical diagnosis. Int J Comput Intell Syst 13(1):1608–1618

Xiao Y, Wu J, Lin Z (2021) Cancer diagnosis using generative adversarial networks based on deep learning from imbalanced data. Comput Biol Med 135: https://doi.org/10.1016/j.compbiomed.2021.104540

Xu Z, Shen D, Nie T, Kou Y (2020) A hybrid sampling algorithm combining m-smote and enn based on random forest for medical imbalanced data. J Biomed Informat 107:103465

Xu Z, Shen D, Nie T, Kou Y, Yin N, Han X (2021) A cluster-based oversampling algorithm combining smote and k-means for imbalanced medical data. Inf Sci 572:574–589

Yildirim P (2017) Chronic kidney disease prediction on imbalanced data by multilayer perceptron: Chronic kidney disease prediction. In: 2017 IEEE 41st annual computer software and applications conference (COMPSAC), vol 2. IEEE, pp 193–198

Yuan X, Chen S, Sun C, Yuwen L (2021) A novel class imbalance-oriented polynomial neural network algorithm for disease diagnosis. In: 2021 IEEE international conference on bioinformatics and biomedicine (BIBM). IEEE, pp 2360–2367

Zeng M, Zou B, Wei F, Liu X, Wang L (2016) Effective prediction of three common diseases by combining smote with tomek links technique for imbalanced medical data. In: 2016 IEEE international conference of online analysis and computing science (ICOACS). IEEE, pp 225–228

Zhang J, Chen L (2019) Clustering-based undersampling with random over sampling examples and support vector machine for imbalanced classification of breast cancer diagnosis. Computer Assist Surg 24(sup2):62–72

Zhang H, Zhang H, Pirbhulal S, Wu W, Albuquerque VHCD (2020) Active balancing mechanism for imbalanced medical data in deep learning-based classification models. ACM Trans Multimedia Comput, Commun, Appl (TOMM) 16(1s):1–15

Zhang J, Chen L (2019a) Breast cancer diagnosis from perspective of class imbalance. Iran J Med Phys 16(3). https://doi.org/10.22038/ijmp.2018.31600.1373

Zhang F, Petersen M, Johnson L, Hall J, O’bryant SE (2022) Hyperparameter tuning with high performance computing machine learning for imbalanced Alzheimer’s disease data. Appl Sci 12(13):6670

Zhao YX, Yuan H, Wu Y (2021) Prediction of adverse drug reaction using machine learning and deep learning based on an imbalanced electronic medical records dataset. In: Proceedings of the 5th international conference on medical and health informatics, pp 17–21

Zhao H, Wang R, Lei Y, Liao WH, Cao H, Cao J (2022) Severity level diagnosis of Parkinson’s disease by ensemble k-nearest neighbor under imbalanced data. Expert Syst Appl 189:116113

Zhou PY, Wong AK (2021) Explanation and prediction of clinical data with imbalanced class distribution based on pattern discovery and disentanglement. BMC Med Informat Decis Mak 21(1):1–15

Zhu M, Xia J, Jin X, Yan M, Cai G, Yan J, Ning G (2018) Class weights random forest algorithm for processing class imbalanced medical data. IEEE Access 6:4641–4652

Zięba M (2014) Service-oriented medical system for supporting decisions with missing and imbalanced data. IEEE J Biomed Health Informat 18(5):1533–1540

Download references

Funding for open access publishing: Universidad de Córdoba/CBUA. Spanish Ministry of Science and Innovation and the European Fund for Region Development, Grant: PID2020-115832-I00

Author information

Authors and affiliations.

Laboratory of Applied Statistics (LASAP), National Higher School of Statistics and Applied Economics, Koléa, Tipaza, Algeria

Mabrouka Salmi

Andalusian Research Institute in Data Science and Computational Intelligence (DaSCI), University of Cordoba, Cordoba, Spain

Mabrouka Salmi & Sebastian Ventura

Economics Department, University Center of Tipaza, Tipaza, Algeria

Depto. de Ingeniería Electro-Fotónica, Universidad de Guadalajara, CUCEI, Guadalajara, Jalisco, Mexico

Diego Oliva

School of Artificial Intelligence, Bennett University, Greater Noida, 201310, Uttar Pradesh, India

Ajith Abraham

You can also search for this author in PubMed Google Scholar

Contributions

Mabrouka Salmi: Conceptualization-Equal, Data duration-Lead, Investigation-Lead, Methodology-Equal, Visualization-Lead, Writing—original draft-Lead, Writing—review & editing-Equal. Dalia Atif: Methodology-Equal, Writing—review & editing-Equal. Ajith Abraham: Methodology-Equal, Writing—review & editing-Equal. Diego Oliva: Methodology-Equal, Writing—review & editing-Equal. Sebastian Ventura: Conceptualization-Equal, Funding acquisition-Lead, Supervision-Lead, Writing—review & editing-Equal.

Corresponding author

Correspondence to Sebastian Ventura .

Ethics declarations

Conflict of interest.

The authors declare no competing financial and/or non-financial interests about the described work.

Ethical approval

Consent to participate, consent for publication, additional information, publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Salmi, M., Atif, D., Oliva, D. et al. Handling imbalanced medical datasets: review of a decade of research. Artif Intell Rev 57 , 273 (2024). https://doi.org/10.1007/s10462-024-10884-2

Download citation

Accepted : 25 July 2024

Published : 02 September 2024

DOI : https://doi.org/10.1007/s10462-024-10884-2

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Class imbalance
Medical datasets
Medical diagnosis
Machine learning
Find a journal
Publish with us
Track your research

Open access
Published: 09 September 2024

Association between serum alkaline phosphatase and clinical prognosis in patients with acute liver failure following cardiac arrest: a retrospective cohort study

Yuequn Xie 1 ,
Liangen Lin 1 ,
Congcong Sun 2 ,
Linglong Chen 1 &
Wang Lv 1

European Journal of Medical Research volume 29 , Article number: 453 ( 2024 ) Cite this article

Metrics details

Acute liver failure (ALF) following cardiac arrest (CA) poses a significant healthcare challenge, characterized by high morbidity and mortality rates. This study aims to assess the correlation between serum alkaline phosphatase (ALP) levels and poor outcomes in patients with ALF following CA.

A retrospective analysis was conducted utilizing data from the Dryad digital repository. The primary outcomes examined were intensive care unit (ICU) mortality, hospital mortality, and unfavorable neurological outcome. Multivariable logistic regression analysis was employed to assess the relationship between serum ALP levels and clinical prognosis. The predictive value was evaluated using receiver operator characteristic (ROC) curve analysis. Two prediction models were developed, and model comparison was performed using the likelihood ratio test (LRT) and the Akaike Information Criterion (AIC).

A total of 194 patients were included in the analysis (72.2% male). Multivariate logistic regression analysis revealed that a one-standard deviation increase of ln-transformed ALP were independently associated with poorer prognosis: ICU mortality (odds ratios (OR) = 2.49, 95% confidence interval (CI) 1.31–4.74, P = 0.005), hospital mortality (OR = 2.21, 95% CI 1.18–4.16, P = 0.014), and unfavorable neurological outcome (OR = 2.40, 95% CI 1.25–4.60, P = 0.009). The area under the ROC curve for clinical prognosis was 0.644, 0.642, and 0.639, respectively. Additionally, LRT analyses indicated that the ALP-combined model exhibited better predictive efficacy than the model without ALP.

Conclusions

Elevated serum ALP levels upon admission were significantly associated with poorer prognosis of ALF following CA, suggesting its potential as a valuable marker for predicting prognosis in this patient population.

Introduction

Acute liver failure (ALF) is a life-threatening complication encountered in critically ill patients within the intensive care unit (ICU). It affects 16–56% of patients suffering from post-cardiac arrest syndrome (PCAS) during their treatment and is associated with poor clinical outcomes, highlighting the liver's significant role in the pathophysiology of cardiac arrest (CA) syndrome [ 1 , 2 ]. Despite advancements in diagnosis and treatment, the mortality rate for patients with ALF following CA remains alarmingly high [ 1 ].

Alkaline phosphatase (ALP) is a glycoprotein primarily located on cell membranes and is predominantly found in various human tissues, including the liver, bone, placenta, kidney, and small intestine [ 3 ]. ALP is closely linked to several adverse risk factors, particularly inflammation, endothelial dysfunction, and coagulation [ 4 ]. Inflammation is pivotal in the pathogenesis of ALF, where reactive oxygen species production aggravates mitochondrial dysfunction and oxidative stress, resulting in hepatocyte necrosis [ 5 ]. During ALF, ALP activity increases in liver tissue, indicating a protective response to immunological liver damage by neutralizing endotoxins [ 6 ]. Additionally, ALP can dephosphorylate and detoxify lipopolysaccharides (LPS), thereby reducing liver inflammation by downregulating inflammatory pathways, including TLR4, TNF-α, IL-1β, and NF-Κb [ 7 , 8 ]. Consequently, patients with ALF demonstrate a significant elevation in serum ALP levels. In clinical settings, serum ALP levels are well-established markers of skeletal or hepatobiliary dysfunction [ 9 , 10 ]. Recent findings suggest a relationship between serum ALP levels and the risk of cardiovascular diseases, as well as all-cause mortality in diverse populations, including those with chronic kidney disease, metabolic syndrome, and coronary artery disease [ 4 , 11 , 12 , 13 ]. Furthermore, several studies show a positive correlation between serum ALP levels and both all-cause mortality and poor functional prognosis following cerebrovascular disease [ 14 , 15 ]. However, the existence of a similar association between serum ALP levels and mortality or neurological outcomes in patients with ALF following CA remains unclear.

We hypothesized that elevated total serum ALP levels independently correlate with poorer prognosis in patients with ALF following CA. Thus, we conducted a longitudinal cohort study to assess the association between serum ALP levels and mortality/neurological outcome in this patient population.

Study population

This retrospective cohort study utilized data from the Dryad digital repository ( https://doi.org/10.5061/dryad.qv6fp83 ). Conducted at Erasme Hospital, Brussels, Belgium, from January 2007 to December 2015, it focused on ICU-treated in-hospital cardiac arrest (IHCA) or out-of-hospital cardiac arrest (OHCA) patients. The studies involving human participants were reviewed and approved by Comité d’Ethique Hospitalo-Facultaire Erasme-ULB (P2017/264), but waived the need for informed consent because of its retrospective nature. Included 194 comatose patients with acute liver failure following IHCA or OHCA (Glasgow Coma Scale < 9). Exclusion criteria comprised deaths within 24 h of admission ( n = 51), the absence of liver function data ( n = 10), concurrent cirrhotic disease ( n = 14), and the absence of ALF following CA ( n = 166) (Fig. 1 ). All CA and comatose patients underwent 24-h targeted temperature management (TTM) with a target temperature of 32–34 °C, employing midazolam and morphine for deep sedation and cisatracurium for shivering control. Post-resuscitation treatment followed established protocols [ 16 ].

Flowchart of the study population with inclusion and exclusion criteria

Data collection

We collected demographic data, arrest characteristics, and comorbidity profiles for all patients. Disease severity was assessed upon admission using the Acute Physiology and Chronic Health Evaluation (APACHE) II score and the Sequential Organ Failure Assessment (SOFA) score. Initial laboratory assessments on admission following return of spontaneous circulation (ROSC) included blood gas analysis (pH, PaO2, PaCO2) and standard tests such as serum alanine transaminase (ALT), serum aspartate transaminase (AST), ALP, serum lactate dehydrogenase (LDH), C-reactive protein (CRP), serum gamma-glutamyl transpeptidase (GGT), serum lactate, serum creatinine, total bilirubin (normal range ≤ 1.2 mg/dL), international normalized ratio (INR) (normal range ≤ 1.2), prothrombin time (PT) (normal range > 70%), and platelet (normal range 150–350 × 10 3 /mm). Mean arterial pressure (MAP), mechanical ventilation, intra-aortic balloon pump (IABP), extracorporeal membrane oxygenation (ECMO), continuous renal replacement therapy (CRRT), vasoactive drug usage, and ICU length of stay were documented.

Definitions

ALF following CA was defined as a condition characterized by an INR levels equal to or greater than 1.5, elevated total bilirubin levels, and the absence of chronic liver disease [ 1 , 2 ]. Shock was defined as a systolic arterial pressure below 90 mmHg despite appropriate volume expansion, necessitating vasopressor support (e.g., dopamine/dobutamine, adrenaline) for over 6 h. ICU and in-hospital mortality refer to all-cause mortality during the ICU stay and overall hospitalization, respectively. An unfavorable neurological outcome was delineated by a cerebral performance categories score (CPC) of 3–5 at 90 days, while a favorable outcome was a CPC of 1–2[ 17 ]. The CPC scale ranges from 1 (good cerebral functioning or mild disability) to 5 (brain death). CPC assessments were conducted prospectively by the general practitioner via telephone interviews during follow-up.

Statistical analysis

Continuous variables were expressed as median (25th to 75th percentiles) or mean ± standard deviation (SD), while categorical variables were presented as frequencies (%). For continuous variables, the Student's t-test or Wilcoxon rank-sum test was utilized to assess normality of distribution. Pearson's Chi-squared test or Fisher's exact test was employed for categorical data. For baseline characteristics analysis, serum ALP levels were categorized into two groups according to the median: < 77 IU/L and ≥ 77 IU/L. Multivariable logistic regression analysis was used to estimate the odds ratio (OR) and corresponding 95% confidence interval (CI) for the risk of mortality/unfavorable neurological outcome according to serum ALP levels (median and one-standard deviation increase of ln-transformed) in patients with ALF following CA. We corrected for age, sex, adrenaline, cardiac cause, out-of-hospital, non-shockable rhythm, chronic anticoagulation, shock, vasopressor therapy, CRP, and MAP as potential confounders. Subsequently, restricted cubic spline models and smooth curve fitting were utilized to investigate the relationship between serum ALP levels and clinical prognosis. Interaction and stratified analyses were conducted based on subgroup variables, with interaction across subgroups assessed using likelihood ratio tests. Receiver operating characteristic (ROC) curves were generated, and the predictive performance of serum ALP levels on prognosis was assessed using the area under the curve (AUC). Model comparison were performed using the likelihood ratio test (LRT) and the Akaike Information Criterion (AIC).

All analyses were conducted using R Statistical Software (Version 4.2.2, The R Foundation) and Free Statistics Analysis Platform (Version 1.9, Beijing, China). Statistical significance was defined as a two-sided P value < 0.05.

Baseline characteristics of study participants

Table 1 presents the baseline characteristics of the participants ( n = 194) in the cohort study. The average age of the participants was 62.0 (52.0, 73.8) years, with 72.2% being male. The high-level group predominantly consisted of participants with non-shockable rhythms, IHCA, and chronic renal failure. Additionally, those in the high-level group was more likely to present with poorer laboratory results and to require more interventions compared to the low-level group.

Primary outcomes

Table 2 presents the cumulative incidences of prognosis among the 194 patients included in the study. Of these patients, 106 (54.6%) died in the ICU, 116 (59.8%) died during hospitalization, and 121 (62.4%) experienced an unfavorable neurological outcome. Participants in the high-level group were more likely to have higher rates of ICU mortality, hospital mortality, and unfavorable neurological outcome (Fig. 2 ).

Relationship between serum ALP levels and clinical prognosis

Multivariate logistic regression analysis

Table 3 shows the results from multivariate logistic regression analysis. When ALP was considered as a continuous variable(ln), elevated serum ALP levels were associated with an increased risk of ICU mortality, hospital mortality, and unfavorable neurological outcome, with OR of 2.49 [1.31–4.74], 2.21 [1.18–4.16], and 2.40 [1.25–4.60], respectively. When ALP was treated as a categorical variable, the OR remained significantly associated with all three outcomes (2.53 [1.24–5.14], 2.29 [1.13 ~ 4.64], and 2.09 [1.02–4.29], respectively). Furthermore, a restricted cubic splines regression model indicated that the risk for all three outcomes increased linearly with rising serum ALP levels ( P for non-linearity = 0.816, 0.598 and 0.342, respectively) (Figure S1).

The predictive accuracy for clinical prognosis

The ROC curve results showed that the AUC was 0.644 (95% CI 0.567–0.722) for ICU mortality, 0.642 (95% CI 0.563–0.721) for hospital mortality, and 0.639 (95% CI 0.559–0.719) for unfavorable neurological outcome (Fig. 3 ).

Comparison of ROC curves for predicting ICU mortality, hospital mortality, and unfavorable neurological outcome

Subgroup analysis

We analyzed the risk stratification value of serum ALP levels for primary endpoints in multiple subgroups of the enrolled patients, including age, sex, coronary artery disease, OHCA, and shock (Fig. 4 ). Overall, the positive correlation between serum ALP levels and all three prognosis were generally consistent across subgroups, with higher serum ALP levels associated with higher rate of ICU mortality, hospital mortality, and unfavorable neurological outcome.

Subgroup analysis of the association between serum ALP levels and clinical prognosis

Adding the ALP to clinical information

Table 4 displays the evaluation of two multivariate models using LRT, AIC, and AUC. Model 2 consistently showed higher LRT compared to Model 1 for all three clinical prognosis. Specifically, for ICU mortality, AIC values decreased from 248.39 in Model 1 to 237.31 in Model 2, with AUC increasing from 0.732 to 0.768. Similarly, for hospital mortality, AIC values decreased from 243.57 to 235.23, with AUC increasing from 0.729 to 0.759. For unfavorable neurological outcome, AIC values decreased from 238.12 to 230.00, with AUC increasing from 0.733 to 0.7767. These findings indicate that combining serum ALP levels with Model 1 improves the prognostic prediction model.

In this study, we explored for the first time the association between serum ALP levels and prognosis in patients with ALF following CA. Our findings revealed that elevated serum ALP levels was independently linear correlation associated with increased ICU mortality, hospital mortality and unfavorable neurological outcome in this patient population. And this association persisted for ALP after adjustment for confounders and remained robust in subgroup analyses. The addition of ALP to the original model enhanced its prognostic predictive abilities. Furthermore, as our inclusion of patients with both IHCA and OHCA, broadens the generalizability of our findings. The results suggest that serum ALP levels could be a promising biomarker for predicting prognosis in patients with ALF following CA. Nevertheless, additional research is required to validate these findings and to better understand the underlying mechanisms.

Nearly all population-based studies, have consistently shown an association between serum ALP levels and increased all-cause mortality. An analysis involving 34,147 adults from the National Health and Nutrition Examination Survey (NHANES) conducted from 1999 to 2014 revealed a positive correlation between serum ALP levels and both all-cause and cardiovascular mortality in the general population[ 4 ]. A multicenter randomized trial found that elevated serum ALP levels were associated with higher mortality rates among African Americans with stage III and stage IV chronic kidney disease [ 11 ]. Additionally, previous studies have indicated that elevated serum ALP levels not only predict in-hospital mortality following acute cerebral infarction, but also correlate with a poor neurological outcomes at three months [ 15 ], which is similarly applicable to patients with cerebral hemorrhage [ 14 ]. These findings align with those of our study, where multivariate logistic regression demonstrated that elevated serum ALP levels were associated with an increased risk of ICU mortality, hospital mortality, and unfavorable neurological outcome, with OR of 2.49 [1.31–4.74], 2.21 [1.18–4.16], and 2.40 [1.25–4.60], respectively. Furthermore, treating ALP as a categorical variable and conducting subgroup analyses further validated the robustness of our results. This phenomenon may be attributed to the fact that serum ALP, although found in various organs, is primarily located in the liver. Therefore, elevated ALP levels may signal a worsening prognosis when hepatocyte necrosis occurs [ 18 , 19 ].

Integrating serum ALP levels with clinical information significantly enhanced the model's performance, with AUC values of 0.768 for ICU mortality, 0.759 for hospital mortality, and 0.767 for unfavorable neurological outcome. While these results suggest that including ALP enhances model performance, careful consideration is necessary regarding its direct application in clinical practice. Additionally, our analysis using ROC curves demonstrated that serum ALP exhibited comparable predictive accuracy for adverse outcomes when compared to APACHE II and SOFA scores. Both the APACHE II and SOFA scores are recognized for their prognostic capabilities in predicting mortality and poor neurological outcome following CA [ 20 , 21 , 22 , 23 ]. The inclusion of hepatic dysfunction makes the SOFA score especially relevant for clinical prognostic assessment in patients with ALF [ 24 ]. Although the APACHE II score remains prognostically relevant for patients with ALF or following liver transplantation, its predictive accuracy may be limited [ 25 , 26 ]. Our findings indicate that among the three predictors, the SOFA score demonstrated the highest predictive efficacy, while the APACHE II score showed the lowest. Nonetheless, serum ALP continues to be a readily accessible and effective predictor.

Various mechanisms linking elevated serum ALP levels to poor outcomes in ALF following CA deserve consideration. Notably, serum ALP has been recognized as a surrogate marker of systemic inflammation, and multiple studies have reported an association between ALP and CRP [ 12 , 27 , 28 ]. Our study also demonstrated consistent findings, revealing a significant positive correlation between ALP and CRP levels ( P < 0.05). After ROSC, PCAS induces a sepsis-like syndrome characterized by elevated inflammatory markers, including CRP, the resulting ALF further exacerbates the inflammatory response [ 18 , 29 ]. Numerous studies have confirmed that elevated CRP levels are indicative of high mortality and unfavorable neurological outcome following CA, with the underlying mechanism linked to the inflammatory response rather than necessarily indicating the presence of an infection [ 30 , 31 ]. Inflammation may thus explain the link between elevated serum ALP levels and poor outcomes. According to data from European Union transplant units, approximately 18% of ALF cases are attributed to pharmacological factors, with an increasing trend in recent years [ 32 ]. The use of various adrenaline and vasoactive drugs during cardiopulmonary resuscitation and post-resuscitation therapy can contribute to pharmacological liver injury and poorer outcomes in CA patients [ 33 , 34 ]. In our study, higher ALP levels were observed in individuals who frequently or heavily used vasoactive drugs ( P < 0.05). Therefore, pharmacological factors may provide another explanation for the correlation between elevated serum ALP levels and poor outcomes. However, even after adjusting for these factors, the association between ALP levels and adverse prognosis persisted, suggesting that this relationship operates through distinct mechanisms. Other mechanisms may also be at play, and future studies should investigate specific ALP isoenzymes to elucidate the pathophysiological links between serum ALP and poor outcomes.

Additionally, we observed that the high-level group had significantly lower blood glucose levels compared to the low-level group. Several factors may contribute to this finding. Firstly, following CA, the body's metabolic demands increase, particularly in the acute post-resuscitation phase [ 35 ]. Impaired liver function may hinder the liver's ability to supply adequate energy and metabolites, leading to higher energy expenditure and reduced serum glucose levels [ 36 ]. Secondly, the liver may experience direct damage from ischemia and reperfusion injury following CA, resulting in hepatocellular damage or necrosis and an increased release of liver enzymes such as ALP. Given the liver’s critical role in regulating glucose—through glycogen synthesis, catabolism, and gluconeogenesis—lower glucose levels may indicate significant liver dysfunction [ 37 ]. Therefore, the observed hypoglycemia in the high-level group, along with elevated liver enzymes and a significantly poorer outcomes, reinforces this possibility.

Our study has several limitations. Firstly, it is a single-center observational and retrospective analysis of an existing database, which may limit the generalizability of our findings and hinder the ability to establish causal relationships. Secondly, despite adjusting for numerous potential confounders, we cannot entirely exclude the possibility of undetected confounders. Thirdly, our results are derived from a Belgian population, necessitating further validation in other populations. Lastly, the existing database only includes serum ALP measurements, without information on ALP isoforms, which limits our ability to infer the association of other ALP sources with increased mortality. However, it is important to note that, despite being single-center studies, the results remained consistent, indicating the robustness of the findings within the specific study populations.

The present study suggests that serum ALP may serve as a valuable marker for predicting the prognosis of patients with ALF following CA, although further confirmation of these findings is necessary.

Availability of data and materials

Data will be made available on request. Extra data can be accessed via the Dryad data repository at https://doi.org/ https://doi.org/10.5061/dryad.qv6fp83 .

Abbreviations

Alkaline phosphatase
Acute liver failure
Cardiac arrest

Intensive care unit

Odds ratios

Confidence interval

Post-cardiac arrest syndrome

In-hospital cardiac arrest

Out-of-hospital cardiac arrest

Targeted temperature management

Acute Physiology and Chronic Health Evaluation

Sequential Organ Failure Assessment

Return of spontaneous circulation

Alanine transaminase

Aspartate transaminase

Lactate dehydrogenase

Gamma-glutamyl transpeptidase

International normalized ratio

Prothrombin time

Intra-aortic balloon pump

Extracorporeal membrane oxygenation

Continuous renal replacement therapy

Cerebral performance categories score

Variance inflation factor

Mean arterial pressure

Area under the curve

Receiver operating characteristic

Likelihood ratio test

Akaike Information Criterion

National Health and Nutrition Examination Survey

Iesu E, Franchi F, Zama Cavicchi F, Pozzebon S, Fontana V, Mendoza M, et al. Acute liver dysfunction after cardiac arrest. PLoS ONE. 2018;13(11): e0206655.

Article PubMed PubMed Central Google Scholar

Delignette MC, Stevic N, Lebossé F, Bonnefoy-Cudraz E, Argaud L, Cour M. Acute liver failure after out-of-hospital cardiac arrest: an observational study. Resuscitation. 2024;197: 110136.

Article PubMed Google Scholar

Tonelli M, Curhan G, Pfeffer M, Sacks F, Thadhani R, Melamed ML, et al. Relation between alkaline phosphatase, serum phosphate, and all-cause or cardiovascular mortality. Circulation. 2009;120(18):1784–92.

Article PubMed CAS Google Scholar

Yan W, Yan M, Wang H, Xu Z. Associations of serum alkaline phosphatase level with all-cause and cardiovascular mortality in the general population. Front Endocrinol (Lausanne). 2023;14:1217369.

Jaeschke H. Reactive oxygen and mechanisms of inflammatory liver injury: present concepts. J Gastroenterol Hepatol. 2011;26(Suppl 1):173–9.

Xu Q, Lu Z, Zhang X. A novel role of alkaline phosphatase in protection from immunological liver injury in mice. Liver. 2002;22(1):8–14.

Pike AF, Kramer NI, Blaauboer BJ, Seinen W, Brands R. A novel hypothesis for an alkaline phosphatase ‘rescue’ mechanism in the hepatic acute phase immune response. Biochim Biophys Acta. 2013;1832(12):2044–56.

Wu H, Wang Y, Yao Q, Fan L, Meng L, Zheng N, et al. Alkaline phosphatase attenuates LPS-induced liver injury by regulating the miR-146a-related inflammatory pathway. Int Immunopharmacol. 2021;101(Pt A): 108149.

Siller AF, Whyte MP. Alkaline phosphatase: discovery and naming of our favorite enzyme. J Bone Miner Res. 2018;33(2):362–4.

Poupon R. Liver alkaline phosphatase: a missing link between choleresis and biliary inflammation. Hepatology. 2015;61(6):2080–90.

Beddhu S, Ma X, Baird B, Cheung AK, Greene T. Serum alkaline phosphatase and mortality in African Americans with chronic kidney disease. Clin J Am Soc Nephrol. 2009;4(11):1805–10.

Article PubMed PubMed Central CAS Google Scholar

Kim JH, Lee HS, Park HM, Lee YJ. Serum alkaline phosphatase level is positively associated with metabolic syndrome: a nationwide population-based study. Clin Chim Acta. 2020;500:189–94.

Ndrepepa G, Xhepa E, Braun S, Cassese S, Fusaro M, Schunkert H, et al. Alkaline phosphatase and prognosis in patients with coronary artery disease. Eur J Clin Invest. 2017;47(5):378–87.

Li S, Wang W, Zhang Q, Wang Y, Wang A, Zhao X. Association between alkaline phosphatase and clinical outcomes in patients with spontaneous intracerebral hemorrhage. Front Neurol. 2021;12: 677696.

Guo W, Liu Z, Lu Q, Liu P, Lin X, Wang J, et al. Non-linear association between serum alkaline phosphatase and 3-month outcomes in patients with acute stroke: results from the Xi’an stroke registry study of China. Front Neurol. 2022;13: 859258.

Tujjar O, Mineo G, Dell’Anna A, Poyatos-Robles B, Donadello K, Scolletta S, et al. Acute kidney injury after cardiac arrest. Crit Care. 2015;19(1):169.

Perkins GD, Jacobs IG, Nadkarni VM, Berg RA, Bhanji F, Biarent D, et al. Cardiac arrest and cardiopulmonary resuscitation outcome reports: update of the Utstein Resuscitation Registry Templates for Out-of-Hospital Cardiac Arrest: a statement for healthcare professionals from a task force of the International Liaison Committee on Resuscitation (American Heart Association, European Resuscitation Council, Australian and New Zealand Council on Resuscitation, Heart and Stroke Foundation of Canada, InterAmerican Heart Foundation, Resuscitation Council of Southern Africa, Resuscitation Council of Asia); and the American Heart Association Emergency Cardiovascular Care Committee and the Council on Cardiopulmonary, Critical Care, Perioperative and Resuscitation. Circulation. 2015;132(13):1286–300.

de Perez Ruiz Garibay A, Kortgen A, Leonhardt J, Zipprich A, Bauer M. Critical care hepatology: definitions, incidence, prognosis and role of liver failure in critically ill patients. Crit Care. 2022;26(1):289.

Article Google Scholar

Haarhaus M, Brandenburg V, Kalantar-Zadeh K, Stenvinkel P, Magnusson P. Alkaline phosphatase: a novel treatment target for cardiovascular disease in CKD. Nat Rev Nephrol. 2017;13(7):429–42.

Donnino MW, Salciccioli JD, Dejam A, Giberson T, Giberson B, Cristia C, et al. APACHE II scoring to predict outcome in post-cardiac arrest. Resuscitation. 2013;84(5):651–6.

Blatter R, Amacher SA, Bohren C, Becker C, Beck K, Gross S, et al. Comparison of different clinical risk scores to predict long-term survival and neurological outcome in adults after cardiac arrest: results from a prospective cohort study. Ann Intensive Care. 2022;12(1):77.

Matsuda J, Kato S, Yano H, Nitta G, Kono T, Ikenouchi T, et al. The Sequential Organ Failure Assessment (SOFA) score predicts mortality and neurological outcome in patients with post-cardiac arrest syndrome. J Cardiol. 2020;76(3):295–302.

Pineton de Chambrun M, Bréchot N, Lebreton G, Schmidt M, Hekimian G, Demondion P, et al. Venoarterial extracorporeal membrane oxygenation for refractory cardiogenic shock post-cardiac arrest. Intensive Care Med. 2016;42(12):1999–2007.

Wehler M, Kokoska J, Reulbach U, Hahn EG, Strauss R. Short-term prognosis in critically ill patients with cirrhosis assessed by prognostic scoring systems. Hepatology. 2001;34(2):255–61.

Niewiński G, Starczewska M, Kański A. Prognostic scoring systems for mortality in intensive care units–the APACHE model. Anaesthesiol Intensive Ther. 2014;46(1):46–9.

Mitchell I, Bihari D, Chang R, Wendon J, Williams R. Earlier identification of patients at risk from acetaminophen-induced acute liver failure. Crit Care Med. 1998;26(2):279–84.

Kim J, Song TJ, Song D, Lee HS, Nam CM, Nam HS, et al. Serum alkaline phosphatase and phosphate in cerebral atherosclerosis and functional outcomes after cerebral infarction. Stroke. 2013;44(12):3547–9.

Webber M, Krishnan A, Thomas NG, Cheung BM. Association between serum alkaline phosphatase and C-reactive protein in the United States National Health and Nutrition Examination Survey 2005–2006. Clin Chem Lab Med. 2010;48(2):167–73.

Adrie C, Adib-Conquy M, Laurent I, Monchi M, Vinsonneau C, Fitting C, et al. Successful cardiopulmonary resuscitation after cardiac arrest as a “sepsis-like” syndrome. Circulation. 2002;106(5):562–8.

Meyer M, Wiberg S, Grand J, Kjaergaard J, Hassager C. Interleukin-6 receptor antibodies for modulating the systemic inflammatory response after out-of-hospital cardiac arrest (IMICA): study protocol for a double-blinded, placebo-controlled, single-center, randomized clinical trial. Trials. 2020;21(1):868.

Annborn M, Dankiewicz J, Erlinge D, Hertel S, Rundgren M, Smith JG, et al. Procalcitonin after cardiac arrest—an indicator of severity of illness, ischemia-reperfusion injury and outcome. Resuscitation. 2013;84(6):782–7.

Germani G, Theocharidou E, Adam R, Karam V, Wendon J, O’Grady J, et al. Liver transplantation for acute liver failure in Europe: outcomes over 20 years from the ELTR database. J Hepatol. 2012;57(2):288–96.

Shi X, Yu J, Pan Q, Lu Y, Li L, Cao H. Impact of total epinephrine dose on long term neurological outcome for cardiac arrest patients: a cohort study. Front Pharmacol. 2021;12: 580234.

Ong ME, Tan EH, Ng FS, Panchalingham A, Lim SH, Manning PG, et al. Survival outcomes with the introduction of intravenous epinephrine in the management of out-of-hospital cardiac arrest. Ann Emerg Med. 2007;50(6):635–42.

Shoaib M, Kim N, Choudhary RC, Yin T, Shinozaki K, Becker LB, et al. Increased plasma disequilibrium between pro- and anti-oxidants during the early phase resuscitation after cardiac arrest is associated with increased levels of oxidative stress end-products. Mol Med. 2021;27(1):135.

Schneeweiss B, Pammer J, Ratheiser K, Schneider B, Madl C, Kramer L, et al. Energy metabolism in acute hepatic failure. Gastroenterology. 1993;105(5):1515–21.

Han HS, Kang G, Kim JS, Choi BH, Koo SH. Regulation of glucose metabolism from a liver-centric perspective. Exp Mol Med. 2016;48(3): e218.

Download references

Acknowledgements

The authors sincerely thank Enrica Iesu et al. for sharing their data.

The funding support received from Science and Technology Plan Project of Wenzhou Municipality (Y20220497).

Author information

Authors and affiliations.

Department of Emergency, The Third Affiliated to Shanghai University, Wenzhou People’s Hospital, No. 299 Guan Road, Louqiao Street, Ouhai District, Wenzhou, 325000, Zhejiang, China

Yuequn Xie, Liangen Lin, Linglong Chen & Wang Lv

Department of Scientific Research Center, The Third Affiliated to Shanghai University, Wenzhou People’s Hospital, Wenzhou, 325000, Zhejiang, China

Congcong Sun

You can also search for this author in PubMed Google Scholar

Contributions

WL and YX planned and designed the study and wrote the manuscript. LL and LC contributed to the data cleaning and statistical analysis. WL, CS, and LL revised the manuscript for important intellectual content. All authors have read and approved the final version of the manuscript.

Corresponding author

Correspondence to Wang Lv .

Ethics declarations

Ethics approval and consent to participate.

The studies involving human participants were reviewed and approved by Comité d’Ethique Hospitalo-Facultaire Erasme-ULB (P2017/264), but waived the need for informed consent because of its retrospective nature.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1: figure s1. curve fitting of the lnand clinical prognosis in patients with alf following ca., rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/ .

Reprints and permissions

About this article

Cite this article.

Xie, Y., Lin, L., Sun, C. et al. Association between serum alkaline phosphatase and clinical prognosis in patients with acute liver failure following cardiac arrest: a retrospective cohort study. Eur J Med Res 29 , 453 (2024). https://doi.org/10.1186/s40001-024-02049-2

Download citation

Received : 13 June 2024

Accepted : 02 September 2024

Published : 09 September 2024

DOI : https://doi.org/10.1186/s40001-024-02049-2

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Clinical prognosis
Risk factor

European Journal of Medical Research

ISSN: 2047-783X

General enquiries: [email protected]

Information

Author Services

Initiatives

You are accessing a machine-readable page. In order to be human-readable, please install an RSS reader.

All articles published by MDPI are made immediately available worldwide under an open access license. No special permission is required to reuse all or part of the article published by MDPI, including figures and tables. For articles published under an open access Creative Common CC BY license, any part of the article may be reused without permission provided that the original article is clearly cited. For more information, please refer to https://www.mdpi.com/openaccess .

Feature papers represent the most advanced research with significant potential for high impact in the field. A Feature Paper should be a substantial original Article that involves several techniques or approaches, provides an outlook for future research directions and describes possible research applications.

Feature papers are submitted upon individual invitation or recommendation by the scientific editors and must receive positive feedback from the reviewers.

Editor’s Choice articles are based on recommendations by the scientific editors of MDPI journals from around the world. Editors select a small number of articles recently published in the journal that they believe will be particularly interesting to readers, or important in the respective research area. The aim is to provide a snapshot of some of the most exciting work published in the various research areas of the journal.

Original Submission Date Received: .

Active Journals
Find a Journal
Proceedings Series
For Authors
For Reviewers
For Editors
For Librarians
For Publishers
For Societies
For Conference Organizers
Open Access Policy
Institutional Open Access Program
Special Issues Guidelines
Editorial Process
Research and Publication Ethics
Article Processing Charges
Testimonials
Preprints.org
SciProfiles
Encyclopedia

Article Menu

Subscribe SciFeed
Recommended Articles
Author Biographies
Google Scholar
on Google Scholar
Table of Contents

Find support for a specific problem in the support section of our website.

Please let us know what you think of our products and services.

Visit our dedicated information section to learn more about MDPI.

JSmol Viewer

Toward ensuring data quality in multi-site cancer imaging repositories.

1. Introduction

2. materials and methods.

The first step is to create the Data Quality Conceptual Model which consists of the set of data quality metrics that will be assessed by this methodology.
The second step is to create a well-defined data-collection protocol, which is the product of a data integration procedure. This protocol constitutes a set of requirements that data should follow to be properly integrated into a multi-center repository and is designed to ensure data homogeneity among multiple-source data. The procedure is briefly described in previous work [ 20 ], but this article is extended to describe the steps in detail, as well as all the resulting requirements and data quality rules that ensure that the data are compliant with the metrics set through the conceptual model.
The third step is the quality assessment of the data provided through a Data Integration Quality Check Tool (DIQCT), which, based on the rules provided by step 2, checks if the quality requirements are met. The tool informs the user of corrective actions that need to be taken prior to the data provision to ensure that the data provided is of high quality. The tool was described in a previous publication [ 21 ], but in this article, an extended version is presented along with the evaluation results of 3 rounds of user experience assessment.

2.2. Data Quality Conceptual Model

2.3. data quality requirements and rules definition procedure, 2.3.1. clinical metadata and structure.

Identification: As a first step, an initial template was created for each cancer type—breast, colorectal, lung, prostate—incorporating domain knowledge from medical experts and the related literature. This first template included the methodology for separating collected data in different time points, as well as associating related imaging, laboratory, and histopathological examinations with each time point.
Review: These initial templates were circulated to the medical experts and reviewed. The medical experts shared their comments on the proposed protocol and an asynchronous discussion took place to debate on controversial topics. In this step, fields were added, removed, or modified to fit the needs of the specific study.
Merge: After the review and the received comments, a consensus on each template was extracted and discussed thoroughly in a meeting with the medical experts to resolve homogenization issues.
Redefine: The data providers were asked to provide an example case for each cancer type. These example cases were reviewed for consistency between the entries deriving from different sites. Based on the inputs received, the allowable value sets were defined. The pre-final version of the templates was extracted in a homogenized way.
Standardize: At this point, the standardization of the fields’ content took place. Each one of the value sets was standardized to follow categorical values or medical standards. In applicable cases, terminologies based on medical standards, such as ICD-11 and ATC, were adopted.
Review and Refine: The templates were circulated again for verification.

2.3.2. Imaging Data

2.4. data integration quality check tool.

Version 1: In this version, the tool was implemented in two ways: (i) as an executable file (.exe): the pipeline along with all the dependencies was built as a directly executable file (ii) as Docker Image: the pipeline along with all the dependencies was built in a docker container publicly available to all members of the consortium.
Version 2: To improve the usability of DIQCT, in the second version, a web application was implemented using R programming language and R Studio Shiny server, allowing the interactive execution of specific scripts through HTML pages. This application includes 5 components, and the execution of each of them is controlled by the user.
Version 3: The third release contains four additional components and some improvements in terms of efficiency and visualization.

2.5. Evaluation Methodology for the Quality Tool

3. results—the incisive case, 3.2. data quality ruleset, 3.3. the diqct.

Structure and codification: In this component, errors related to the template structure and patient codification are reported. Specifically, this component checks and reports (i) The structure of the provided template. The provided template structure, in terms of tabs and columns, is compared to the one initially defined and circulated for use, and alterations are reported to the user for correction. This check is related to the second dimension, Accuracy . (ii) The patient’s ids. The inserted patient unique identification numbers are checked to ensure they follow the proper encoding and for duplicate entries in the template. The user must correct the reported errors and continue. This check is related to both Accuracy and Uniqueness and Rules 1 and 3.
Content Validity: In this component, errors related to the template content are reported. Specifically, this component checks if the standards and terminologies proposed for the allowable values of all fields of the templates are followed. It reports, for each patient separately in a different row, the fields of each tab that do not comply with the proposed value range. This component also checks if the time points provided are within the boundaries proposed in the collection protocol definition and if all time points provided are in the correct chronological order. The user must review erroneous entries. This check is related to Validity and Accuracy quality dimensions and Rules 2 and 4.
Case Completeness: This component presents an overview of the data provided. It depicts a summary for each patient in terms of what modalities are available at each time point, as well as the percentage of mandatory fields that are present for each patient and a list of the absent fields so the user can review the missing values and provide more information if possible. This report is related to Completeness and Rule 5. The components of the second category are:
Template-image Consistency: This component has a dual role: (i) For each patient, the imaging modalities provided are inserted in the template to a corresponding time point. This component checks for each entry the agreement between the template and the images provided. (ii) If the provided images are compliant with the template, it performs a proper renaming of the studies’ folders to a predefined naming convention so they can be stored in a unified way. In case of inconsistency between the template and the folders, a message appears for the specific patient. The user must correct the reported errors and continue. This component is related to Integrity and Rules 1 and 3. The components of the third category are:
DICOM De-identification Protocol: DICOM files contain not only imaging information, such as intensity for each pixel, but also several valuable metadata crucial for the proper interpretation of the images. These metadata are stored in specific DICOM tags, each characterized by a group of two hexadecimal values. The de-identification protocol is defined by a list of tags and their respective actions, which could involve removing the value or replacing it with a new one. The main goal of this component is to verify whether a specific de-identification protocol has been correctly applied to the imaging data. To achieve this, the tool checks the metadata in all the DICOM files and suggests appropriate actions to ensure compliance with the protocol. Users can interact with the tool and choose among different protocols, making it highly versatile. The tool generates an output in a tabular format, listing the metadata that does not comply with the protocol, along with the path to the corresponding image and the corrective action that needs to be applied. Additionally, the tool provides a graphical representation of the most common errors using a bar chart. It is important to note that this component does not assess whether personal data are overlaid in the image as burned-in information. Its primary focus is on the proper handling of DICOM metadata to maintain data consistency and privacy. It is related to Consistency and Rule 7.
DICOM Validation: As mentioned earlier, DICOM metadata contains valuable information related to the acquisition protocol. The cornerstone of this component is the dciodvfy tool ( https://dclunie.com/dicom3tools/dciodvfy.html , accessed on 27 August 2024), which provides comprehensive functionality by performing various checks on DICOM files. First, it verifies attributes against the requirements of Information Object Definitions (IODs) and Modules as defined in DICOM PS 3.3 Information Object Definition. Second, it ensures that the encoding of data elements and values aligns with the encoding rules specified in DICOM PS 3.5 Data Structures and Encoding. Third, the tool validates data element value representations and multiplicity using the data dictionary from DICOM PS 3.6 Data Dictionary. Lastly, it checks the consistency of attributes across multiple files that are expected to be identical for the same entity in all instances. Through these checks, the tool ensures the integrity and conformity of the DICOM data, promoting accurate and standardized medical image management. However, it is important to note that the DICOM Standard Committee does not provide any official tool to ensure complete DICOM compliance. As a result, this tool does not guarantee that the DICOM file is entirely compliant with DICOM, even if no errors are found during the validation process. Nevertheless, the tool does report major errors, such as missing mandatory attributes, the presence of invalid values in DICOM tags, encoding issues, or errors in the unique identifiers. Similar to the de-identification component, the outcome of the validation component is provided both in a tabular format listing the errors found for each DICOM file and in a graphical format where the most common issues in all the DICOM files are visualized as bar charts. This user-friendly presentation aids in identifying and addressing potential problems in the DICOM data, contributing to enhanced data quality and reliability. This component integrates rule 6 and Validity.
Image Requirements: These are related to the data quality used for analysis and training of the algorithms. Quality factors may include pixel size, slice thickness, field of view, etc. The analysis of DICOM files ensures that the data have similar quality. The component produces an outcome table listing the images that do not fulfill each requirement. This component relates to Rule 9 and Consistency.
Required imaging modalities: These requirements are related to the imaging modalities expected for each cancer type. AI developers collaborate with clinical experts to define these modalities, which are specific to each type of cancer. The component checks whether at least one of the imaging modalities defined for each cancer type is provided. For instance, if data from lung cancer patients are provided, the AI developers expect to access CT images or X-rays, as the implemented models rely on these types of images. This component incorporates rule 8 and Completeness .
Annotations: This requirement relates to the availability of annotation files in the correct series folder. Some AI tools may need segmentation files, so this component checks whether an annotation file of a specified format, such as a NIFTI file, is present in a folder. Additionally, the tool verifies whether the annotation file has the same number of pixels across the X and Y planes as the DICOM files located in the same folder. It also checks whether the number of slices in the annotation file coincides with the number of DICOM images in the folder. If more than one annotation file is found in the same folder or if the annotation file is not in the series folder, an appropriate message is provided. The tool generates an outcome table listing all the annotation files found, along with any issues that may have been identified and relate to Rule 10 and Consistency.
DICOM Overall Patient Evaluation: This component summarizes the findings from the previous component and presents a table containing all the patients and the extent to which the quality requirements are met. Additionally, this component checks for any duplicate images that may exist for each patient and across the whole repository. Each requirement is included in a separate column of the table. If a requirement is fully met for a patient, the respective cell is colored green. In cases where a requirement is only partially covered, for example, not all the expected imaging modalities are provided, the cell is colored in beige. However, if the requirement is not met at all, the cell is colored red. This component relates to Uniqueness and Completeness .

3.4. Evaluation of the Tool

4. discussion, 5. future work, 6. conclusions, author contributions, institutional review board statement, informed consent statement, data availability statement, acknowledgments, conflicts of interest.

Kocarnik, J.M.; Compton, K.; Dean, F.E.; Fu, W.; Gaw, B.L.; Harvey, J.D.; Henrikson, H.J.; Lu, D.; Pennini, A.; Xu, R.; et al. Cancer Incidence, Mortality, Years of Life Lost, Years Lived with Disability, and Disability-Adjusted Life Years for 29 Cancer Groups From 2010 to 2019 A Systematic Analysis for the Global Burden of Disease Study 2019. JAMA Oncol. 2022 , 8 , 420–444. [ Google Scholar ] [ CrossRef ]
Ferlay, J.; Colombet, M.; Soerjomataram, I.; Parkin, D.M.; Piñeros, M.; Znaor, A.; Bray, F. Cancer statistics for the year 2020: An overview. Int. J. Cancer 2021 , 149 , 778–789. [ Google Scholar ] [ CrossRef ] [ PubMed ]
Saslow, D.; Boetes, C.; Burke, W.; Harms, S.; Leach, M.O.; Lehman, C.D.; Morris, E.; Pisano, E.; Schnall, M.; Sener, S.; et al. American Cancer Society Guidelines for Breast Screening with MRI as an Adjunct to Mammography. CA Cancer J. Clin. 2007 , 57 , 75–89. [ Google Scholar ] [ CrossRef ] [ PubMed ]
Wang, L.; Lu, B.; He, M.; Wang, Y.; Wang, Z.; Du, L. Prostate Cancer Incidence and Mortality: Global Status and Temporal Trends in 89 Countries From 2000 to 2019. Front. Public Health 2022 , 10 , 811044. [ Google Scholar ] [ CrossRef ] [ PubMed ]
Siegel, R.L.; Miller, K.D.; Sauer, A.G.; Fedewa, S.A.; Butterly, L.F.; Anderson, J.C.; Cercek, A.; Smith, R.A.; Jemal, A. Colorectal cancer statistics, 2020. CA Cancer J. Clin. 2020 , 70 , 145–164. [ Google Scholar ] [ CrossRef ] [ PubMed ]
Aberle, D.R.; Black, W.C.; Chiles, C.; Church, T.R.; Gareen, I.F.; Gierada, D.S.; Mahon, I.; Miller, E.A.; Pinsky, P.F.; Sicks, J.D. Lung Cancer Incidence and Mortality with Extended Follow-up in the National Lung Screening Trial. J. Thorac. Oncol. 2019 , 14 , 1732–1742. [ Google Scholar ] [ CrossRef ]
Bhinder, B.; Gilvary, C.; Madhukar, N.S.; Elemento, O. Artifi Cial intelligence in cancer research and precision medicine. Cancer Discov. 2021 , 11 , 900–915. [ Google Scholar ] [ CrossRef ]
Bizzo, B.C.; Almeida, R.R.; Michalski, M.H.; Alkasab, T.K. Artificial Intelligence and Clinical Decision Support for Radiologists and Referring Providers. J. Am. Coll. Radiol. 2019 , 16 , 1351–1356. [ Google Scholar ] [ CrossRef ]
Yin, J.; Ngiam, K.Y.; Teo, H.H. Role of Artificial Intelligence Applications in Real-Life Clinical Practice: Systematic Review. J. Med. Internet Res. 2021 , 23 , e25759. [ Google Scholar ] [ CrossRef ]
Martinez-Millana, A.; Saez-Saez, A.; Tornero-Costa, R.; Azzopardi-Muscat, N.; Traver, V.; Novillo-Ortiz, D. Artificial intelligence and its impact on the domains of universal health coverage, health emergencies and health promotion: An overview of systematic reviews. Int. J. Med. Inform. 2022 , 166 , 104855. [ Google Scholar ] [ CrossRef ]
Gillies, R.J.; Schabath, M.B. Radiomics improves cancer screening and early detection. Cancer Epidemiol. Biomark. Prev. 2020 , 29 , 2556–2567. [ Google Scholar ] [ CrossRef ] [ PubMed ]
Chen, Z.H.; Lin, L.; Wu, C.F.; Li, C.F.; Xu, R.H.; Sun, Y. Artificial intelligence for assisting cancer diagnosis and treatment in the era of precision medicine. Cancer Commun. 2021 , 41 , 1100–1115. [ Google Scholar ] [ CrossRef ] [ PubMed ]
Liu, M.; Wu, J.; Wang, N.; Zhang, X.; Bai, Y.; Guo, J.; Zhang, L.; Liu, S.; Tao, K. The value of artificial intelligence in the diagnosis of lung cancer: A systematic review and meta-analysis. PLoS ONE 2023 , 18 , e0273445. [ Google Scholar ] [ CrossRef ] [ PubMed ]
Spadaccini, M.; Troya, J.; Khalaf, K.; Facciorusso, A.; Maselli, R.; Hann, A.; Repici, A. Artificial Intelligence-assisted colonoscopy and colorectal cancer screening: Where are we going? Dig. Liver Dis. 2024 , 56 , 1148–1155. [ Google Scholar ] [ CrossRef ] [ PubMed ]
Yuan, J.; Hu, Z.; Mahal, B.A.; Zhao, S.D.; Kensler, K.H.; Pi, J.; Hu, X.; Zhang, Y.; Wang, Y.; Jiang, J.; et al. Integrated Analysis of Genetic Ancestry and Genomic Alterations across Cancers. Cancer Cell. 2018 , 34 , 549–560.e9. [ Google Scholar ] [ CrossRef ]
Carle, F.; Di Minco, L.; Skrami, E.; Gesuita, R.; Palmieri, L.; Giampaoli, S.; Corrao, G. Quality assessment of healthcare databases. Epidemiol. Biostat. Public Health 2017 , 14 , 1–11. [ Google Scholar ] [ CrossRef ]
Kahn, M.G.; Callahan, T.J.; Barnard, J.; Bauck, A.E.; Brown, J.; Davidson, B.N.; Estiri, H.; Goerg, C.; Holve, E.; Johnson, S.G.; et al. A Harmonized Data Quality Assessment Terminology and Framework for the Secondary Use of Electronic Health Record Data. eGEMs 2016 , 4 , 18. [ Google Scholar ] [ CrossRef ]
Kim, K.-H.; Choi, W.; Ko, S.-J.; Chang, D.-J.; Chung, Y.-W.; Chang, S.-H.; Kim, J.-K.; Kim, D.-J.; Choi, I.-Y. Multi-center healthcare data quality measurement model and assessment using omop cdm. Appl. Sci. 2021 , 11 , 9188. [ Google Scholar ] [ CrossRef ]
Huser, V.; DeFalco, F.J.; Schuemie, M.; Ryan, P.B.; Shang, N.; Velez, M.; Park, R.W.; Boyce, R.D.; Duke, J.; Khare, R.; et al. Multisite Evaluation of a Data Quality Tool for Patient-Level Clinical Datasets. eGEMs 2016 , 4 , 24. [ Google Scholar ] [ CrossRef ]
Kosvyra, A.; Filos, D.; Fotopoulos, D.; Tsave, T.; Chouvarda, I. Towards Data Integration for AI in Cancer Research. In Proceedings of the 2021 43rd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), Mexico, 1–5 November 2021; pp. 2054–2057. [ Google Scholar ] [ CrossRef ]
Kosvyra, A.; Filos, D.; Fotopoulos, D.; Tsave, O.; Chouvarda, I. Data Quality Check in Cancer Imaging Research: Deploying and Evaluating the DIQCT Tool. In Proceedings of the Annual International Conference of the IEEE Engineering in Medicine and Biology Society, EMBC, Scotland, UK, 11–15 July 2022; Institute of Electrical and Electronics Engineers Inc.: Piscataway, NJ, USA, 2022; pp. 1053–1057. [ Google Scholar ] [ CrossRef ]
Laugwitz, B.; Held, T.; Schrepp, M. LNCS 5298—Construction and Evaluation of a User Experience Questionnaire ; Springer: Berlin/Heidelberg, Germany, 2008; Volume 5298. [ Google Scholar ]
Pezoulas, V.C.; Kourou, K.D.; Kalatzis, F.; Exarchos, T.P.; Venetsanopoulou, A.; Zampeli, E.; Gandolfo, S.; Skopouli, F.; De Vita, S.; Tzioufas, A.G.; et al. Medical data quality assessment: On the development of an automated framework for medical data curation. Comput. Biol. Med. 2019 , 107 , 270–283. [ Google Scholar ] [ CrossRef ]
Wada, S.; Tsuda, S.; Abe, M.; Nakazawa, T.; Urushihara, H. A quality management system aiming to ensure regulatory-grade data quality in a glaucoma registry. PLoS ONE 2023 , 18 , e0286669. [ Google Scholar ] [ CrossRef ] [ PubMed ]
Zaridis, D.I.; Mylona, E.; Tachos, N.; Pezoulas, V.C.; Grigoriadis, G.; Tsiknakis, N.; Marias, K.; Tsiknakis, M.; Fotiadis, D.I. Region-adaptive magnetic resonance image enhancement for improving CNN-based segmentation of the prostate and prostatic zones. Sci. Rep. 2023 , 13 , 714. [ Google Scholar ] [ CrossRef ] [ PubMed ]
Dovrou, A.; Nikiforaki, K.; Zaridis, D.; Manikis, G.C.; Mylona, E.; Tachos, N.; Tsiknakis, M.; Fotiadis, D.I.; Marias, K. A segmentation-based method improving the performance of N4 bias field correction on T2weighted MR imaging data of the prostate. Magn. Reson. Imaging 2023 , 101 , 1–12. [ Google Scholar ] [ CrossRef ] [ PubMed ]

Click here to enlarge figure

Rule	Name	Brief Description
1	Naming Conventions	(a) PatientId: XXX-YYYYYY (b) Imaging Examinations: PatientId_Modality_timepoint
2	Timepoints	Definition of 4 Time points: Baseline, After 1st Treatment, 1st and 2nd Follow-Up and their period
3	Structure	(a) Clinical Metadata—Template, (b) Folder Structure
4	Value Ranges	(a)Allowable Type, (b) Actual Value Range
5	Mandatory fields	Definition of the minimum fields that should be present in the template
6	DICOM Validation	Attributes Verification, Encoding Validation, Value Representation and Multiplicity Check, Attributes Consistency
7	De-identification protocol	Definition of the de-identification profile
8	Expected imaging modalities	Definition of the list of imaging modalities per cancer type
9	Analysis requirements	Definition of the expected values that each imaging modality should have to be used for analysis
10	Annotation mismatch	(a) the number of DICOM images is the same as the number of slices in the annotation file, (b) The number of rows and the number of columns for both the annotation file and the images are identical

	No of Participants	Clinical Experts	Technical Experts	Windows	Linux	Mac
Round 1	9	3	6	7	1	1
Round 2	6	4	2	5	-	1
Round 3	7	4	3	5	1	1

The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

Kosvyra, A.; Filos, D.T.; Fotopoulos, D.T.; Tsave, O.; Chouvarda, I. Toward Ensuring Data Quality in Multi-Site Cancer Imaging Repositories. Information 2024 , 15 , 533. https://doi.org/10.3390/info15090533

Kosvyra A, Filos DT, Fotopoulos DT, Tsave O, Chouvarda I. Toward Ensuring Data Quality in Multi-Site Cancer Imaging Repositories. Information . 2024; 15(9):533. https://doi.org/10.3390/info15090533

Kosvyra, Alexandra, Dimitrios T. Filos, Dimitris Th. Fotopoulos, Olga Tsave, and Ioanna Chouvarda. 2024. "Toward Ensuring Data Quality in Multi-Site Cancer Imaging Repositories" Information 15, no. 9: 533. https://doi.org/10.3390/info15090533

Article Metrics

Article access statistics, further information, mdpi initiatives, follow mdpi.

Subscribe to receive issue release notifications and newsletters from MDPI journals

IMAGES

EHR as a clinical data repository.
The MURDOCK Integrated Data Repository (MIDR). In addition to the
What is a Data Repository? Definition, Types + Examples
Clinical Data Repository (AR-CDR)
Overview of the Clinical Data Repository (CDR)
FEATURE

VIDEO

Investigating Possible Gene Therapy Approaches for Nemaline Myopathy
Philip Durbin, Jan Range, Oliver Bertuch: Distributed Metadata and Data with Dataverse
From Data to Drugs: The AI Revolution in Medicine #shorts #AIinMedicine #DrugDiscovery
How Medicines Are Tested
Predicting Primary Care Physician Burnout From EHR Usage Measures
Depositing Research Data in the TXST Dataverse Repository

COMMENTS

Repositories for Sharing Scientific Data
Repositories for Sharing Scientific Data - NIH data sharing
Biomedical Data Repositories and Knowledgebases
Data repositories and knowledgebases exist on a spectrum of ability and readiness to adopt the desirable characteristics aligned with FAIR and TRUST principles. Due to the critical nature of research data resources, repositories, and datasets, the development of metrics to evaluate the usage, utility, and impact of a given repository is essential.
Finding Datasets, Data Repositories, and Data Standards
Finding Datasets, Data Repositories, and Data Standards
Archived Clinical Research Datasets
Data Management and Sharing Information. The data repository houses the NINDS Division of Clinical Research (DCR) funded studies and trials in neurological areas such as stroke, Parkinson's disease, migraine, MS, and other neurologic disorders. The data requests and approvals are managed by NINDS.Imaging, biosamples, and other types of data are not supported by this repository.
Search NCBI databases
Search NCBI databases - NLM
Research Data Repositories & Databases
Research Data Repositories & Databases - Finding Datasets ...
STAnford medicine Research data Repository
STAnford medicine Research data Repository. STARR is a data resource that is designed to improve access to healthcare data by researchers. STARR contains data from Stanford Health Care, and the Stanford Children's Hospital and supports diverse use cases and research applications. STARR has raw data, analysis ready data, linked data across ...
Recommended Repositories
Recommended Repositories | PLOS ONE
The Clinical Research Data Repository of the US National Institutes of
The Columbia University Clinical Data Repository. The initial design of BTRIS has been based on experience with the creation of the Clinical Data Repository (CDR) at the Columbia University Medical Center in New York.[] That system has accrued patient care data since 1988 from many different sources, including laboratories, pharmacies, radiology departments, order entry, and clinician ...
Biomedical Data Repository Concepts and Management Principles
Biomedical data repository: Systems that accept submissions of relevant data from the research community to store, organize, validate, archive, preserve, and distribute the data, in compliance ...
Traits and types of health data repositories
A repository that adds levels of integration and quality to the primary (research or clinical) data of a single institution, to support flexible queries for multiple uses. Is broader in application than a registry. Collection: A library of heterogeneous data sets from more organizations than a warehouse or more sources than a registry.
Clinical Data
Larger collaborations, such as the NIH Collaboratory Distributed Research Network provides mediated or collaborative access to clinical data repositories by eligible researchers. Additionally, the UW De-identified Clinical Data Repository (DCDR) and the Stanford Center for Clinical Informatics allow for initial cohort identification.
Datasets
Stanford Clinical Data. STARR, a research data repository with 20 years of fully identified clinical data (since 1998), includes, but is not limited to, nightly clinical data, Epic Clarity, from both Stanford Health Care (SHC aka adult hospital) and Stanford Children's health (SCH aka Lucile Packard Children's Hospital or LKSC). STARR ...
Vivli
Vivli - Center for Global Clinical Research Data
Recommended Repositories
Additionally, the Registry of Research Data Repositories is a full scale resource of registered data repositories across subject areas. Both FAIRsharing and Re3Data provide information on an array of criteria to help researchers identify the repositories most suitable for their needs (e.g., licensing, certificates and standards, policy, etc.).
STAnford Research Repository (STARR) Tools
The STAnford Research Repository, or STARR, is Stanford Medicine's approved resource for working with clinical data for research purposes. The STARR IRB permits the collection and aggregation of all data generated at Stanford for clinical care purposes, and articulates the formal approval process each research project must follow in order to obtain and work with this data for research purposes.
Home Page
Clinical Trials. NIH expects clinical trials to be registered and summary results reported in ClinicalTrials.gov ... Secondary Research with Data and Biospecimens. ... Repositories for Sharing Scientific Data. Need help identifying the right repository for your data? Check out our filterable list of NIH-affiliated repositories. View ...
What You Need to Know Before Implementing a Clinical Research Data
Background: Integrated data repositories (IDRs), also referred to as clinical data warehouses, are platforms used for the integration of several data sources through specialized analytical tools that facilitate data processing and analysis. IDRs offer several opportunities for clinical data reuse, and the number of institutions implementing an IDR has grown steadily in the past decade.
Data Repository Guidance
Data Repository Guidance | Scientific Data
Data Repository Finder
Data Repository Finder
Open Domain-Specific Data Sharing Repositories
The 4D Nucleome Data Portal is an open repository for genomic and microscopy nuclear architecture datasets generated by members of the 4D Nucleome. consortium or other relevant studies. In addition to curating and publicly sharing these datasets, the portal provides standardized bioinformatics and visualization tools to aid exploration and ...
What You Need to Know Before Implementing a Clinical Research Data
Another example is the Hanover Medical School Translational Research Framework (HaMSTR) framework at the Hanover Peter L. Reichertz Institute , which was developed to automatically load data from a clinical data repository into a standard data model that researchers can query; it is a successful example of fast data upload and query using data ...
Data at WHO
data.who.int. This world-class interactive digital platform provides WHO, partners, countries, and the public with access to usable, trusted health data. Welcome to your gateway to all public health data. These databases and platforms provide access to understandable and timely data, transforming lives by making health and inequality data ...
Handling imbalanced medical datasets: review of a decade of research
Machine learning and medical diagnostic studies often struggle with the issue of class imbalance in medical datasets, complicating accurate disease prediction and undermining diagnostic tools. Despite ongoing research efforts, specific characteristics of medical data frequently remain overlooked. This article comprehensively reviews advances in addressing imbalanced medical datasets over the ...
Association between serum alkaline phosphatase and clinical prognosis
Acute liver failure (ALF) following cardiac arrest (CA) poses a significant healthcare challenge, characterized by high morbidity and mortality rates. This study aims to assess the correlation between serum alkaline phosphatase (ALP) levels and poor outcomes in patients with ALF following CA. A retrospective analysis was conducted utilizing data from the Dryad digital repository.
Toward Ensuring Data Quality in Multi-Site Cancer Imaging Repositories
The proposed methodology can assist the deployment of big data centralized or distributed repositories with data from diverse data sources, thus facilitating the development of AI tools. ... AI is significantly transforming cancer research and clinical practice by enhancing diagnostic accuracy, personalizing treatment plans, and accelerating ...
Where is the research on sport-related concussion in Olympic athletes
Objectives This cohort study reported descriptive statistics in athletes engaged in Summer and Winter Olympic sports who sustained a sport-related concussion (SRC) and assessed the impact of access to multidisciplinary care and injury modifiers on recovery. Methods 133 athletes formed two subgroups treated in a Canadian sport institute medical clinic: earlier (≤7 days) and late (≥8 days ...

Biomedical Data Repositories and Knowledgebases

Data Repositories

Knowledgebases

Metrics and Lifecycle

Open Funding Opportunities

Closed Funding Opportunities

Funded Awards

Finding Datasets, Data Repositories, and Data Standards

Resources to Locate Data Repositories

Finding Datasets for Secondary Analysis

NIH Data Repositories

Announcements for Stanford researchers:

Self-service tools

Research support

Consulation services

Data Resources in the Health Sciences

Introduction to Clinical Data

Defining Clinical Data Repositories

Our Members

Take part in the NIH-Funded DataWorks! Prize

Updates & Events

Recommended Repositories

Cross-disciplinary repositories

Repositories by type

Biochemistry

Biomedical Sciences

Marine Sciences

Model Organisms

Neuroscience

Physical Sciences

Social Sciences

Structural Databases

Taxonomic & Species Diversity

Unstructured and/or Large Data

Repository Criteria

Enabling Data Driven Clinical Research

Step 1 - Cohort Discovery

Step 2 - Compliance

Step 3 - Chart Review & Data Download

Redirect Notice

Expediting the Translation of Research Results to Improve Human Health.

Accessing Data

Resources Highlights

Learning Resources

NIH Institute and Center Data Sharing Policies

Policy Overview: Data Management and Sharing

Writing a Data Management & Sharing Plan

Informed Consent for Secondary Research with Data and Biospecimens

Repositories for Sharing Scientific Data

News & Events

Save citation to file

Add to My Bibliography

What You Need to Know Before Implementing a Clinical Research Data Warehouse: Comparative Review of Integrated Data Repositories in Health Care Institutions

Conflict of interest statement

Similar articles

Publication types

Related information

Miscellaneous

Data Repository Guidance

View data repositories

Biological sciences ⤴

Protein sequence ⤴

Molecular & supramolecular structure ⤴

Neuroscience ⤴

Taxonomy & species diversity ⤴

Mathematical & modelling resources ⤴

Cytometry and Immunology ⤴

Organism-focused resources ⤴

Health sciences ⤴

Chemistry and Chemical biology ⤴

Broad scope Earth & environmental sciences ⤴

Astronomy & planetary sciences ⤴

Geomagnetism & Palaeomagnetism ⤴

Solid Earth sciences ⤴

Materials science ⤴

Generalist repositories ⤴

Quick links

NIH-Supported Data Sharing Resources

Domain-Specific Repositories

What You Need to Know Before Implementing a Clinical Research Data Warehouse: Comparative Review of Integrated Data Repositories in Health Care Institutions