A K-Means, Ward, and DBSCAN Repeatability Study

Anthony  Bertrand; Engelbert Mephu  Nguifo; Violaine  Antoine; David R.C.   Hill

doi:10.37256/ccds.7220269455

Authors

Anthony Bertrand University Clermont Auvergne, Clermont Auvergne INP, ENSM St Etienne, CNRS, LIMOS, Clermont-Ferrand, 63000, France
Engelbert Mephu Nguifo University Clermont Auvergne, Clermont Auvergne INP, ENSM St Etienne, CNRS, LIMOS, Clermont-Ferrand, 63000, France
Violaine Antoine University Clermont Auvergne, Clermont Auvergne INP, ENSM St Etienne, CNRS, LIMOS, Clermont-Ferrand, 63000, France
David R.C. Hill University Clermont Auvergne, Clermont Auvergne INP, ENSM St Etienne, CNRS, LIMOS, Clermont-Ferrand, 63000, France

DOI:

https://doi.org/10.37256/ccds.7220269455

Keywords:

repeatability, reproducibility, clustering methods, K-Means, Density-Based Spatial Clustering of Applications with Noise (DBSCAN), Ward method

Abstract

Reproducibility is essential in machine learning because it ensures that a model or experiment yields the same scientific conclusion. For specific algorithms, repeatability with bitwise identical results is also a key for scientific integrity because it allows debugging. We decomposed several very popular clustering algorithms: K-Means, Density-Based Spatial Clustering of Applications with Noise (DBSCAN), and Ward into their fundamental steps, and we identify the conditions required to achieve repeatability at each stage. We use an implementation example with the Python library scikit-learn to examine the repeatable aspects of each method. Our results reveal non-repeatable behavior with K-Means when the number of OpenMP threads exceeds two. This work aims to raise awareness of this issue among both users and developers, encouraging further investigation and potential fixes.

A K-Means, Ward, and DBSCAN Repeatability Study

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

License