Exploring the Landscape of Machine Learning Data Bugs: Frequency, Impacts, and Relationships for Enhanced Automation

Abstract

As machine learning (ML) systems are increasingly deployed in critical domains, addressing data-related issues is essential to ensure the reliability, fairness, and performance of AI models. This research investigates bugs in ML training data, analyzing their individual and combined impacts as well as their interrelationships. The novelty of this work lies in its comprehensive focus on bugs specific to ML training data—an area often overlooked compared to studies centered on model architectures or algorithms. By emphasizing data quality, this research highlights its crucial role in overall ML performance. Additionally, the combinatorial analysis of different bug types offers a new perspective on how data issues can interact and compound. The study is highly relevant to industrial automated testing, with a particular emphasis on quality assurance for ML training data. Its findings can support the development of automated tools aimed at detecting and preventing data-related bugs within ML pipelines.
Document Type: 
Presentations
Howpublished: 
presented at 11th User Conference on Advanced Automated Testing (UCAAT)
Month: 
4
Year: 
2025
2025 © Software Engineering For Distributed Systems Group

Main menu 2