Israel News Beep
  • News Beep
  • Israel
  • Headlines
  • Business
  • Entertainment
  • Health
  • Science
  • Sports
  • Technology
Israel News Beep
Israel News Beep
  • News Beep
  • Israel
  • Headlines
  • Business
  • Entertainment
  • Health
  • Science
  • Sports
  • Technology
A methodology for developing dermatological datasets: lessons from retrospective data collection for AI-based applications | BMC Medical Research Methodology
SScience

A methodology for developing dermatological datasets: lessons from retrospective data collection for AI-based applications | BMC Medical Research Methodology

  • November 5, 2025

In recent years, artificial intelligence (AI) algorithms have demonstrated considerable potential in the medical domain [1]. Dermatology, in particular, has emerged as a leading area of application due to the visual nature of skin diseases and the growing availability of clinical data imaging [2, 3]. Among the most widely adopted algorithms are deep neural networks, which have shown notable performance in tasks such as skin cancer classification and lesion detection [4,5,6,7,8]. According to GLOBOCAN, skin cancer remains a major global health concern, with both melanoma skin cancer (MSC) and non-melanoma skin cancer (NMSC) showing rising incidence and mortality rates [9]. It has been estimated that one person dies of skin cancer every four minutes [9, 10]. Europe, Asia, Australia, and New Zealand are the regions with the highest incidence and mortality of MSC, while NMSC cases are more frequent in North America, Asia, Australia, and New Zealand [11]. Currently, skin cancer affect approximately 1 in every 5 000 individuals worldwide [12]. Dermatologists consider it a public health concern, and some even classify it as an epidemic due to the increasing incidence rates. The burden of disease is disproportionately high in regions with robust diagnostic infrastructure and reporting systems, leading to concentrated research efforts in these geographic areas [11]. These environments feature well-curated dermatological datasets, which are essential for training and validating AI models, and are predominantly derived from high-income countries [5, 7]. In contrast, low- and middle-income countries often face limitations in the development of dermatological datasets due to barriers such as limited access to dermatological care, insufficient digital infrastructure, and lack of standardized protocols [6]. This geographic imbalance restricts the generalizability of AI models and may exacerbate health inequities by excluding underrepresented populations [5, 7].

Although publicly available dermatological data sets have improved access to clinical and dermoscopic images, they often lack comprehensive metadata, and many reflect demographic biases, particularly in the representation of skin tone [10]. Most of these datasets originate from high-income settings, which reinforces structural disparities and contributes to algorithmic bias [13, 14]. This lack of inclusivity is a key factor in what is increasingly referred to as ‘health data poverty’, a challenge especially pertinent in Latin America [15]. Wen et al. highlight the urgent need for standardized metadata protocols, citing inconsistencies between datasets and a general lack of methodological transparency in dataset construction [5]. These limitations hinder reproducibility and challenge cross-cohort comparisons.

One of the efforts to standardize dermatological data comes from the international standard for medical images, DICOM (Digital Imaging and Communications in Medicine). The adoption of this standard in the dermatological field has occurred in metadata encoding, image repositories, and electronic medical records. While some work has begun to implement the DICOM standard [16] successfully, it is important to highlight that its scope is limited because it has been implemented in the context of dermoscopy images. This highlights that implementing DICOM in the dermatological field is still a complex task, especially in clinical images, because these are obtained through cameras on mobile devices. This type of acquisition lacks standardization and therefore generates variations in image quality [17]. It is projected that the future adoption of the DICOM standard will allow for the easy use of AI, especially to improve interoperability issues, that is, to allow communication and integration between digital platforms and applications [18]. However, addressing ethical, regulatory, medical-legal, and labor issues is necessary to integrate the DICOM standard and AI into dermatology successfully.

The International Skin Imaging Collaboration (ISIC) remains the most prominent source of publicly accessible dermatological data [19]. However, it lacks a standardized methodology for dataset construction and has faced criticisms for issues such as duplicate cases within training and test sets [10]. These methodological gaps underscore the need for a structured, reproducible workflow that ensures data quality, diversity, and integrity, especially when such datasets underpin a diagnostic tool meant for global deployment [20]. Despite an increase in published efforts to share dataset development pipelines, few studies present clearly defined, step-by-step methodologies [15, 16, 21,22,23,24]. Notable exceptions include PAD-UFES-20, focused on the Brazilian population, which documents a process starting from clinical consultation to histopathological confirmation and secure data storage for cases involving suspected neoplasms. Metadata and clinical images are stored on a secure web server, and a final quality review is carried out for each sample [21].

The first dermatological dataset from Argentina reports metadata and images from 623 patients, describing the main phases of development: data collection, selection, preprocessing, and technical validation [15]. BCN20000 dataset from Spain similarly describe structured pipelines that include clinical data acquisition, preprocessing, and validation, comprises lesions from complex anatomical areas, including the nails and mucosa. Its creation involved several stages: image acquisition, image type separation, automated patient identification, linking images with clinical records, data registration, and technical validation [22]. Other examples include the SLICE-3D dataset [23], which integrates clinical snipetts taken from 3D total-body photography, and the DERM12345 [24], notable for its hierarchical classification system and inclusion of diverse skin types. The SIIM-ISIC 2020 Challenge dataset [16] outlines a seven-step protocol addressing image curation, duplicate detection, metadata registration, and validation. These examples illustrate the heterogeneity of methodologies used in dataset development (see Table 1) [15, 16, 21,22,23,24]. Although many include clinical oversight, workflows vary widely in terms of image acquisition, metadata documentation, and validation. The role of dermatologists is consistently emphasized across studies [15], but clinical environments often lack the infrastructure, training, or technical capacity to support robust dataset development [25]. Emerging smaller-scale datasets have helped address the information gap in certain countries across the Americas [15, 21]. These contributions are valuable for other countries in the region, beginning to engage in dermatological research.

Table 1 Summary of recent studies reporting dermatological datasets

This study aims to contribute a structured, empirically grounded methodology for dermatological dataset development. Moreover, increasing worldwide datasets to improve artificial intelligence applications, and diminishing skin cancer diagnosis disparities among different populations. Our approach is informed by two real-world case studies, one conducted in Chile (Table 2) and the other in Mexico (Table 3), and is designed to address three critical gaps observed in current public datasets: (1) The need for multimodal datasets that integrate both images and associated metadata to more accurately reflect clinical practice and support deep learning applications. Most existing public datasets are limited to image data only; (2) The absence of standardized metadata protocols, including clear guidelines on which clinical variables should be recorded, how they should be defined, and what criteria guide their inclusion; (3) the limited availability of dermatological datasets from underrepresented regions such as Central and South America, which hinders the development of inclusive, generalizable AI tools [26]. Current classification systems rely predominantly on data from Europe, North America, and Oceania [5, 18], leading to globally imbalanced data representation [27]. In response to these challenges, we present a practical methodology co-developed by experts in dermatology and computer science. Our proposed framework simplifies each step of dataset construction. In addition to the methodological framework, we identify challenges arising during real-world clinical implementation and propose mitigation strategies. The rest of the article is organized as follows: Methods section presents the proposed methodology. Results section summarizes the two case studies in which we applied the proposed methodology. Developing dermatological datasets: challenges and recommendations section shows the main challenges in creating dermatological datasets and describes practical recommendations for developing retrospective and prospective datasets. Analysis based on previous efforts to create datasets section provides an analysis based on key aspects of previous initiatives for creating dermatological datasets. Limitations section describes the limitations of the proposed methodology. Finally, Conclusion section presents the conclusions.

Table 2 Summary of the methodology applied to Chilean patient dataTable 3 Summary of the methodology applied to Mexican patient data

  • Tags:
  • Clinical metadata
  • Dataset methodology
  • Dermatology
  • health sciences
  • IL
  • Israel
  • medicine
  • Science
  • Skin Cancer
  • Statistical Theory and Methods
  • Statistics for Life Sciences
  • Theory of Medicine/Bioethics
Israel News Beep
www.newsbeep.com