Chest Disease Image Classification Based on Spectral Clustering Algorithm

: Nowadays, the emergence of new technologies gives rise to a huge amount of data in different fields such as public transportation, community services, scientific research, etc. Due to the aging population, healthcare is becoming more important in our daily life to reduce public burdens. For example, manually archiving massive electronic medical files, such as X-ray images, is impossible. However, precise classification is essential for further work, such as diagnosis. In this report, we applied a spectral clustering algorithm to classify chest disease X-ray images. We also employed the “pure” K-means algorithm for comparison. Three types of indexes are used to quantify the performances of both algorithms. Our analysis result shows that spectral clustering can successfully classify chest X-ray images based on the presence of disease spots on the lungs and the performance is superior to “pure” K-means clustering.


Introduction
Nowadays, chest diseases have become a significant threat to human health [1]. These diseases manifest in various types, each with its own set of symptoms and severity levels. To accurately diagnose chest diseases, medical professionals rely heavily on diagnostic imaging techniques, with X-ray being the most commonly used due to its speed and affordability [2]. X-ray has become a primary tool for screening and identifying various chest ailments such as pneumonia, pneumothorax, and masses [3].
Among these diseases, about half a billion persons are suffered from pneumonia and about 4 million people die from it per year [4]. For example, lung cancer is one of the malignant tumors, which has the highest incidence and mortality in the world, since the 5-year survival rate of the patient is around 16% [5]. Early diagnosis and treatment can significantly reduce mortality resulting from chest diseases [6]. However, understanding chest X-ray needs a lot of professional knowledge. So far, X-ray images are typically explained by radiologists [7]. But misdiagnosis always happens, because of the diverse chest pathological features and the potential fatigue or lack of experience of radiologists [8]. Therefore, there is a crucial need to develop chest X-ray-assisted diagnostic algorithms that can aid radiologists in providing timely and effective treatment to patients with chest diseases [9].
Recently, the emergence of artificial intelligence techniques has attracted popularity worldwide because they can be applied in the biomedical fields, including skin cancer diagnosis, standard plane detection and localization in fetal ultrasound, lung nodule detection, etc. [10]. At present, this technology has made some progress in chest disease diagnosis [11]. However, artificial intelligence-based diagnosis technique still has many obvious drawbacks. For example, these include low-level feature recognition, small proportion of disease spots, and changeable disease location between the chest disease and the remaining normal area, compared with those from real images, due to the coarsegrained recognition [12]. As a consequence, it is hard to make connections between features, and finding the subtle traits that fully characterize the object is not straightforward, so the fine-grained classification of chest diseases is very challenging due to the difficulty of finding discriminative features. it is also difficult to find the most representative features for fine-grained classification of chest diseases [13,14].
On the other hand, the imbalance between normal and disease X-ray images makes it challenging to design accurate diagnostic algorithms. This can lead to misdiagnosis, as the gradient direction tends to favor normal X-ray images [15,16]. Current algorithms can only detect whether an image is diseased or not, but have difficulty identifying the specific type of disease and differences between them [17]. To conquer those problems, binary cross-entropy is used in this study to balance the number of disease and normal X-ray images and compensate for the loss function [18].
Among graph mining techniques for imaging classification, clustering algorithms are extraordinarily critical, since they can categorize objects into different groups according to their similarities. Traditional clustering algorithms, such as K-means, are limited by their spherical assumptions, which can lead to local optima. Spectral clustering has rapidly developed in recent years to compensate for these limitations by leveraging the eigenvalues of the similarity matrix [19][20][21].
Fiedler initialized spectral clustering algorithm, unlike traditional clustering algorithms, spectral clustering does not rely on shape assumptions and can find the global optimum [22]. It has been widely used in very large-scale integration (VLSI) design, load balancing, parallel computing, and sparse matrix partitioning. Spectral clustering is advantageous in both theoretical and practical applications, and it can be applied to any distribution pattern without considering dependency assumptions [23,24]. In the biomedical field, spectral clustering can be used for image segmentation, including classifying normal and abnormal medical images such as pneumonia X-ray images. If successful, the algorithm could be applied to other chest diseases and other medical imaging modalities like magnetic resonance imaging (MRI) and ultrasound images.
Previous articles have shown that researchers are focusing on image classification. We plan to use the spectral clustering algorithm to classify normal and abnormal medical images, starting with pneumonia X-ray images and expanding to other chest diseases such as infiltration, effusion, mass, and nodule [25,26]. The algorithm will also be applied to other medical imaging modalities like MRI and ultrasound images. These automatic classification models can be used as a pre-screening step to save time and facilitate doctors in making diagnoses.
In this paper, we will present an overview of the relevant research related to our topic of interest, followed by a detailed description of our data and the methodology we will use. Then, we will discuss the data extraction and preprocessing steps, the application of various models supported by theoretical underpinnings, and a comparative analysis of the performance of these models. In the final section, we will provide a comprehensive evaluation of the results obtained and highlight potential areas for future research.

Related research
Image classification is an important task in the field of medical imaging, particularly for X-ray images. Over the years, various techniques and methods have been developed to classify X-ray images accurately. One of the most popular approaches for X-ray image classification is deep learning. Among those deep learning models, numerous studies have investigated the performance of convolutional neural networks (CNNs). For example, Rajpurkar et al. proposed a deep learning algorithm that achieved state-of-the-art performance in classifying 14 different thoracic diseases using X-ray images [27].
Transfer learning has also been investigated for X-ray image classification. It involves utilizing pre-trained models on large data sets to extract features. For instance, Zhang et al. used a transfer learning approach by utilizing a pretrained VGG-16 model for classifying chest X-ray images into normal and abnormal categories. Their study showed that the approach achieved high accuracy rates of over 90% [28].
In addition to utilizing the aforementioned deep learning techniques, traditional machine learning methods have also been studied for X-ray image classification. For instance, Hamed et al. developed a hybrid approach that combined histogram equalization with machine learning algorithms, such as K-Nearest Neighbor (KNN), Decision Tree, Support Vector Machine (SVM), etc. [29].
Overall, there have been significant developments in X-ray image classification using various techniques, ranging from deep learning to traditional machine learning methods. These studies have demonstrated high accuracy rates, which can be beneficial for early diagnosis and treatment of various diseases. However, there is still scope for improvement in terms of the different methods and their accuracy, as well as expanding the range of diseases that can be classified using X-ray images.

Dataset
The dataset in our study includes chest X-ray images that have been categorized as normal or abnormal, with the abnormal category being caused by pneumonia (as shown in Figure 1). To ensure the accuracy of the analysis of chest X-ray images, a quality control process was implemented where all radiographs were initially screened to remove any scans that were deemed unreadable or of low quality. The images were evaluated and diagnosed by expert physicians before being cleared for use in the artificial intelligence (AI) system [30]. The dataset is quite large, with a total size of 1.15 GB and thousands of images. In order to improve efficiency and save time, a subset of 200 normal and 200 pneumonia images were randomly selected for further processing. The data can be downloaded from Kaggle, https:// www.kaggle.com/datasets/paultimothymooney/chest-xray-pneumonia [31].

Spectral clustering classification
The primary approach of the whole spectral clustering classification procedure is depicted in the flowchart as shown in Figure 2. First, all randomly selected images are imported and then subjected to grayscale conversion and resizing as spectral clustering has specific requirements for the image name, grayscale, and size of images. These three steps are necessary for preparing the images for spectral clustering. Next, after applying spectral clustering classification, the classified images are restored into two different folders. In spectral clustering, images are converted into nodes in a high-dimensional space. The distance between nodes determines their edge weights.
Once the nodes are created, a distance matrix is computed and used to generate the diagonal degree matrix. This degree matrix is then used to compute the Laplacian matrix, which is a key step in the spectral clustering algorithm. The Laplacian matrix helps to identify clusters of images based on their similarities and differences, and ultimately leads to the accurate classification of the images.
To find the distance matrix, usually V is used to represent a set of nodes, and use E to represent a set of edges. And a graph is G, so G(V, E). Among them, V is the set of all nodes (v 1 , …, v n ). w ij is the weight between v i and v j , w ij = w ji . If there is an edge between two nodes, exists w ij > 0, otherwise, w ij = 0. For any node in the graph, the corresponding degree W [Equation (1)] is the sum of all its adjacent edges [32].
In spectral clustering, a distance matrix is obtained by representing nodes as a set V and edges as a set E in a graph G(V, E). V represents all nodes (v 1 , …, v n ), and w ij is the weight between v i and v j , where w ij = w ji . If an edge exists between two nodes, w ij > 0; otherwise, w ij = 0. The degree d (Equation (1)) of a node in the graph is the sum of its adjacent edge weights, and is used to generate the Laplacian matrix for spectral clustering [32].
To describe the degrees of all nodes, we can get a n × n matrix of degrees D, which is a diagonal matrix [Equation (2)] [33].
The adjacent matrix is named W [Equation (3)]. It's also an n × n matrix, the ith row and jth column is w ij . Usually, we can use the weight among all nodes to construct a m × m similarity matrix, in which ith row and jth column have a corresponding weight W ij [34].
(3) 11 12 1 There are three common methods to construct a similarity matrix: epsilon nearest neighbor, KNN, and fully connected graph.
In the epsilon nearest neighbor method, a distance threshold epsilon is set, and the Euclidean distance S ij [Equation (4)] is used to evaluate the distance between any two nodes x i and x j [35].
In the KNN method, all nodes are traversed using KNN to find their KNNs, and the edge weights w ij for these neighbors are computed. In a fully connected graph, all nodes have weights greater than zero and are connected to each other. Before performing clustering, the Laplacian matrix computation is computed as L = D -W where D is the diagonal degree matrix and W is the adjacent matrix described earlier. The Laplacian matrix offers advantages over traditional clustering with complex shapes and varying densities in high-dimensional spaces due to the following reasons: (a) It is a symmetric matrix, this can be easily found because both D and W are symmetric matrices. In order to prevent bad results of clustering, it is necessary to limit the size of all sub-graphs. The most common way to do this is normalized cut, the formula [ Based on the basic knowledge of the clustering algorithm, it is clear that clustering can be used on various kinds of sample data and converges at global optimal. If the input D = (x 1 , … x n ) contains k 1 dimensions after dimension reduction and k 2 dimensions after clustering, the output is a set of k clusters C(c 1 , … c k ).

"Pure" K-means clustering classification
To compare the results, "pure" K-means is used to classify the images. The procedure is depicted in the flowchart as shown in Figure 3. The procedure is similar to the spectral clustering algorithm, except for the last two steps. Unlike spectral clustering, the "pure" K-means algorithm does not require complex computations such as distance matrix and Laplacian matrix and it can be imported from sklearn by setting the number of clusters to be 2. The last step involved attaching classified labels to the corresponding images, rather than creating two folders to store the classified images separately.

Classification label restoration
Resize images

Measurement
Since clustering is an unsupervised learning, the common supervised learning measurement, such as precision, recall, accuracy, etc. based on the confusion matrix, cannot be used in this project. To qualify the performances of spectral clustering and "pure" K-means clustering, three types of indexes: silhouette coefficient, Davies-Bouldin index, and Calinski-Harabasz index are applied.

Silhouette coefficient
The silhouette coefficient, as shown in Equation (5), is calculated using the mean intra-cluster distance (a) and the mean nearest-cluster distance (b). The coefficient is from -1 to 1 with a higher coefficient indicating better cluster classification. A coefficient of 0 indicates that the clusters are close to each other, and the classification is poor. A negative coefficient indicates some data are assigned to the wrong clusters [37].

Davies-Bouldin index
The Davies-Bouldin index, as shown in Equation (6), is used to measure the average similarity of each cluster with its most similar cluster. The index ranges from 0 to infinity, with a lower value indicating better cluster classification. A value of 0 indicates perfect clustering, while a higher value indicates worse clustering [38]. 1 1 max

Calinski-Harabasz index
The Calinski-Harabasz index, as shown in Equation (7), is also known as the variance ratio criterion. It measures the value of the ratio between the within-cluster dispersion and the between-cluster dispersion. A higher index value indicates better cluster separation and classification, while a lower value indicates less distinct clusters [39].

Renaming
To begin the spectral clustering process, 200 normal images and 200 pneumonia images were randomly selected and imported. The images were not named sequentially, so a renaming process was carried out. For instance, the original names of the first 3 images were "normal 1," "normal 10," and "normal 20." After renaming, they became "NORMAL-1," "NORMAL-2," and "NORMAL-3," respectively. Figure 4 shows the first 20 renamed normal images.

Renaming
After renaming, the images were converted to grayscale using the Open Source Computer Vision Library (OpenCV), a powerful image processing tool [40]. The cv2.COLOR_BGR2GRAY function from OpenCV was applied to convert the images to grayscale. Figure 5 displays the first 20 normal images after grayscale conversion..

Renaming
After grayscale conversion, the images needed to be resized to a smaller size than their original size. This was accomplished using the Python Imaging Library (PIL), another powerful tool for image processing [41]. Figure 6 displays the first 20 resized normal images. The images are presented as large icons in their original renamed, grayscaleconverted, and resized forms. The resized images are significantly smaller in size compared to the remaining images.

Classification
The created spectral clustering algorithm was then applied to process the images. First, the images were loaded and Principal Component Analysis (PCA) was performed. Next, the distance matrix and Laplacian matrix were computed. After getting eigenvectors, a feature vector was created from k first eigenvectors by stacking them as columns, followed by applying K-means clustering. Here, k was set to 2. The classified images were saved into two different folders. Figure  7 shows the classified images by spectral clustering, with one folder containing a sample of one type classified first 30 images and another folder containing a sample of another type classified first 30 images.   Figure 9 display the results of spectral clustering and "pure" K-means clustering on 400 images, with the expectation of 200 normal images and 200 pneumonia images being divided into two separate folders. In practice, after applying spectral clustering, 214 images are in one folder and 186 images are in another folder. In the 214 images folder, there are 198 normal images and 16 pneumonia images. In the 186 images folder, there are 2 normal images and 184 pneumonia images. This indicates that the classification was mostly successful. However, the results of "pure" K-means clustering are very different. One folder contains 59 images, and another folder contains 341 images, indicating a severe imbalance. The folder with 59 images has 22 normal images and 37 pneumonia images while the folder with 341 images has 178 normal images and 163 pneumonia images. It is obvious that in each folder, no one category dominates, indicating an unsuccessful classification.  Table 1 shows the performance results of spectral clustering and "pure" K-means clustering. For the silhouette coefficient, the spectral clustering is 0.443, which is significantly higher than the "pure" K-means coefficient of 0. A silhouette coefficient of 0 indicates overlapping clusters [42]. While 0.443 is not a "perfect" value, it is an improvement over the alternative. Spectral clustering also outperformed "pure" K-means in terms of the Davies-Bouldin index, achieving a score of 1 compared to the "pure" K-means score of 8. Scores closer to zero indicate better partitioning [43]. Additionally, the Calinski-Harabasz index was higher for spectral clustering at 311 compared to the "pure" K-means score of 2.8, demonstrating that spectral clustering outperformed "pure" K-means clustering [44].

Conclusion
Based on the above results and analysis, it can be concluded that: (1) spectral clustering is able to successfully classify chest X-ray images based on the presence of disease spots on the lungs and (2) the performance of spectral clustering is superior to "pure" K-means clustering as shown by the three unsupervised indexes used in the evaluation.

In the Future
Despite achieving satisfactory results, there are still some limitations and areas for improvement in this study. Firstly, the computational efficiency of spectral clustering can be optimized, especially for larger datasets. This can be achieved by implementing parallel computing techniques or using more efficient algorithms for computing the distance and Laplacian matrices.
Secondly, to make the classification more meaningful and realistic, it is necessary to include more types of chest diseases in the dataset and to increase the number of clusters accordingly. This will provide a more comprehensive analysis of the chest X-ray images and improve the accuracy of the classification results.
Lastly, the "pure" K-means clustering method used for comparison performed poorly in this study, and further improvements are needed to make it more effective. This can include using more advanced initialization methods, such as K-means++, or applying feature engineering techniques to improve the quality of the input data. Overall, these improvements can enhance the performance and applicability of the spectral clustering algorithm in the field of medical image analysis.