A K-Means, Ward, and DBSCAN Repeatability Study
DOI:
https://doi.org/10.37256/ccds.7220269455Keywords:
repeatability, reproducibility, clustering methods, K-Means, Density-Based Spatial Clustering of Applications with Noise (DBSCAN), Ward methodAbstract
Reproducibility is essential in machine learning because it ensures that a model or experiment yields the same scientific conclusion. For specific algorithms, repeatability with bitwise identical results is also a key for scientific integrity because it allows debugging. We decomposed several very popular clustering algorithms: K-Means, Density-Based Spatial Clustering of Applications with Noise (DBSCAN), and Ward into their fundamental steps, and we identify the conditions required to achieve repeatability at each stage. We use an implementation example with the Python library scikit-learn to examine the repeatable aspects of each method. Our results reveal non-repeatable behavior with K-Means when the number of OpenMP threads exceeds two. This work aims to raise awareness of this issue among both users and developers, encouraging further investigation and potential fixes.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 Anthony Bertrand, Engelbert Mephu Nguifo, Violaine Antoine, David R.C. Hill

This work is licensed under a Creative Commons Attribution 4.0 International License.
