GitHub - RevathyVenukuttan/Unsupervised-Machine-Learning: Unsupervised machine learning techniques (clustering) on EEG data to identify epileptic and non-epileptic seizures

Introduction

Epilepsy is a serious chronic neurological disorder characterized with sudden and recurrent aberrant neural activity clinically expressed in the form of seizures. Seizures usually last from seconds to a few minutes and can lead to dangerous and possibly life-threatening situations. Epileptic seizure detection therefore plays an important role in improving the treatment outcomes and quality of life of epileptic patients. The electroencephalogram (EEG) signals capture the electrical activity of the brain and thus provide valuable insight into electrical disturbances of the neural activity. As such, EEG signals are considered to be important data in diagnosing epilepsy and predicting epilepsy seizures. However, the traditional visual inspection of lengthy EEG recordings by experts and analysts is a tedious, error-prone and time-consuming process. Furthermore, the random nature of epileptic EEG recordings makes the classification of seizure-free and seizure signals more difficult. Hence, there is a strong demand for the development of an automated detection system for early diagnosis and treatment of epilepsy. An automatic epileptic classification system normally requires large datasets to train the classifier with high accuracy. Furthermore, the target categories in the training set rely on the labels manually obtained by experts, and thus may be flawed. These limitations impede the successful use of supervised classification techniques. Instead, the present project utilizes two unsupervised learning methods, hierarchical and k-means clustering, in order to discriminate intracranial seizure EEGs from intracranial healthy EEGs recorded in pre-surgical patients. The hypothesis is that both clustering methods would optimally produce two significant clusters across five broad spectral sub-bands of the EEG signal which are generally of clinical interest: delta (0-4 Hz), theta (4-8 Hz), alpha (8-16 Hz), beta (16-32 Hz) and gamma waves (32-64 Hz). These five frequency sub-bands provide a more accurate depiction of the neural activities underlying epileptic seizures, amplifying important changes in the EEG signal that would otherwise go unnoticed in the original full-spectrum signal. Therefore, the hypothesis that by utilizing band-wise approach we would be able to identify the frequencies at which the discrimination between healthy and epileptic EEGs is the most successful.

Description of EEG Dataset

This project uses seizure discrimination experiments on the publicly available EEG dataset provided by Bonn University. It includes five different sets from the EEG archive of pre-surgical diagnosis. Of the five, two sets containing intracranial EEG recordings from (1) within the epileptogenic zone during seizure (i.e. seizure EEGs) and from (2) the hippocampal formation of the opposite hemisphere of the brain (i.e. healthy EEGs) were selected. Each dataset includes 99 single-channel EEG recordings of 23.6 seconds duration. All the EEG signals are sampled at 173.6 Hz and digitized using a 12-bit analog-to-digital converter. The EEG data provided by the Bonn Dataset does not have artifacts. Prior to publishing the dataset, the captured EEG segments containing artifacts had been deleted and those containing delicate artifacts had been denoised using a band-pass filter with cut-off frequencies of 0.53Hz and 40Hz.

EEG Data Filtering

Generally, in healthy individuals, the brain ways may be classified as belonging to one of five wave groups described earlier. Therefore Fast Fourier Transform (FFT) was applied to obtain the spectrum of the EEG signals. The FFT relies on the Discrete Fourier Transform (DFT) which is computed eegkit R package (v. 1.0-4)

Hierarchical Clustering

Briefly, given a collection of N unlabeled observations, in p dimensions, algorithms for hierarchical clustering estimate all K = 1,…, N partitions of the data through a sequential optimization procedure. The sequence of steps can be implemented as either an agglomerative (bottom-up; AGNES) or divisive (top-down; DIANA) approach to produce the nested hierarchy of clusters. Agglomerative clustering begins with each observation belonging to one of N disjoint singleton clusters. Then, at each step, the two most similar clusters are joined until after (N − 1) steps, all observations belong to a single cluster of size N. Divisive clustering proceeds in a similar, but reversed manner. In this project, the focus is on agglomerative approaches, which are more often used in practice. Commonly, in agglomerative clustering, the pairwise similarity of observations is measured using a dissimilarity function, such as Euclidean distance. Then, a linkage function is used to extend this notion of dissimilarity to pairs of clusters. The clusters identified using hierarchical algorithms depend heavily on the choice of both the dissimilarity and linkage functions. The sequence of clustering solutions obtained by hierarchical clustering is naturally visualized as a binary tree, commonly referred to as a dendrogram. The spectrum of clustering solutions can be recovered from the dendrogram by cutting the tree at an appropriate height, and taking the resulting subtrees as the clustering solution. In a dendrogram, the most similar clusters and observations are connected near the bottom of the tree. The quality of clustering has been internally examined using the average silhouette method. To assess how well detected clusters agree with the true cluster labels, the mean adjusted rand index (ARI) for each of the five wave groups are reported, with larger values corresponding to higher agreement.

K-means Clustering

K-means clustering uses an iterative algorithm that starts with three user-specified parameters: number of clusters K, cluster initialization and distance metric. The algorithm generally iterates between two steps, which are (1) assigning the data points to clusters that have the closest centroid value and (2) recalculating the positions of K centroids. However, there is a probability that the algorithm would converge to a local minimum, as opposed to global minimum. In order to reduce this effect and thus improve the clustering accuracy, the algorithm must be run multiple times. Optimal number of clusters can be determined using a variety of methods, of which the most commonly used are: elbow method, average silhouette method and the gap-statistic method. This paper uses the average silhouette method to calculate the optimal number of clusters for each frequency band. The average silhouette method measures the quality of clustering by determining how well each data point fits within the cluster. A high average value indicates good clustering.

Both hierarchical and k-means clustering analyses were performed in R utilizing factoextra (v. 1.0.5) and dendextend (v. 1.9.0) R packages.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
Clustering.R		Clustering.R
DataPreprocessing.R		DataPreprocessing.R
FourierTransformation.R		FourierTransformation.R
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Description of EEG Dataset

EEG Data Filtering

Hierarchical Clustering

K-means Clustering

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Introduction

Description of EEG Dataset

EEG Data Filtering

Hierarchical Clustering

K-means Clustering

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages