A K-Means, Ward, and DBSCAN Repeatability Study

Authors

  • Anthony Bertrand University Clermont Auvergne, Clermont Auvergne INP, ENSM St Etienne, CNRS, LIMOS, Clermont-Ferrand, 63000, France
  • Engelbert Mephu Nguifo University Clermont Auvergne, Clermont Auvergne INP, ENSM St Etienne, CNRS, LIMOS, Clermont-Ferrand, 63000, France
  • Violaine Antoine University Clermont Auvergne, Clermont Auvergne INP, ENSM St Etienne, CNRS, LIMOS, Clermont-Ferrand, 63000, France
  • David R.C. Hill University Clermont Auvergne, Clermont Auvergne INP, ENSM St Etienne, CNRS, LIMOS, Clermont-Ferrand, 63000, France

DOI:

https://doi.org/10.37256/ccds.7220269455

Keywords:

repeatability, reproducibility, clustering methods, K-Means, Density-Based Spatial Clustering of Applications with Noise (DBSCAN), Ward method

Abstract

Reproducibility is essential in machine learning because it ensures that a model or experiment yields the same scientific conclusion. For specific algorithms, repeatability with bitwise identical results is also a key for scientific integrity because it allows debugging. We decomposed several very popular clustering algorithms: K-Means, Density-Based Spatial Clustering of Applications with Noise (DBSCAN), and Ward into their fundamental steps, and we identify the conditions required to achieve repeatability at each stage. We use an implementation example with the Python library scikit-learn to examine the repeatable aspects of each method. Our results reveal non-repeatable behavior with K-Means when the number of OpenMP threads exceeds two. This work aims to raise awareness of this issue among both users and developers, encouraging further investigation and potential fixes.

Downloads

Published

2026-04-02

How to Cite

1.
Bertrand A, Nguifo EM, Antoine V, Hill DR. A K-Means, Ward, and DBSCAN Repeatability Study. Cloud Computing and Data Science [Internet]. 2026 Apr. 2 [cited 2026 Apr. 9];7(2):215-45. Available from: https://ojs.wiserpub.com/index.php/CCDS/article/view/9455