Predicting Parkinson’s Disease using R Programming

A beginner’s guide to predictive analytics on Parkinson’s Disease Dataset in R Programming

This is my solo capstone project for the final part of my project course to obtain the ‘Data Science: Capstone certificate’ offered by Harvard University (HarvardX) through the edX platform.

Being new to some extent in the area of Data Science and specially R Programming, taking up a course such as this was truly challenging even though I had a slight idea of what Data Science is and what it takes in order to create a predictive analysis model. Therefore, on clearing such a project successfully with all the efforts and time I put into it, I wanted to share some key ideas that guided me into the successful completion of this project. So let us start with the topic of Parkinson’s Disease and then hop into the Data Science part of the project…

What is Parkinson’s Disease?

Parkinson’s Disease: Symptoms; Illustration

Now after knowing the problem statement we are dealing with in this project, let us understand the data set that I have used for the project.

Dataset

In this project, I have used the Parkinsons Data Set from UCI Machine Learning Repository, which has been uploaded from the Oxford Parkinson’s Disease Detection Dataset.

A portion of the Parkinson’s Disease Dataset is shown below:

Data Analysis: Some key points…

Before diving into Parkinson’s Disease prediction, let us look at the data in detail in order to understand the important features present therein. This can be done by keeping the following points in mind:

  • Check for null values in the data
  • Check for redundancy in the data
  • Remove unimportant attributes (ID, transaction number, etc.)
  • Understand the datatype of values of all attributes
  • Differentiate out input data from target data
  • Apply correlation on the input data
  • Dimensionality reduction and Feature selection of important attributes/components

R programming aspect of the project…

Here are some online resources to learn R for Data Science provided by Analytics India Magazine…

Now let us take a look at the packages I have used in R Programming in order to complete the project. These packages are:

  • dplyr: grammar for data manipulation
  • corrplot: graphical display of a correlation matrix
  • mlbench: framework for distributed Machine Learning
  • caret: Classification And REgression Training; streamline model training
  • randomForest: Breiman and Cutler’s Random Forests for Classification and Regression
  • factoextra: Extract and Visualize results of Multivariate Data Analyses
  • FactoMineR: Exploratory Data Analysis Methods to summarize, visualize and describe datasets
  • CORElearn: Classification, Regression and Feature Evaluation; R port of data mining system
  • rmarkdown: Convert R Markdown documents into a variety of formats
  • knitr: Dynamic report generation with R

In this dataset, we see that there are 48 healthy people and 147 patients with Parkinson’s Disease. Since the class distribution for both the labels (healthy and patients with Parkinson’s) is not balanced, therefore this can be thought of as a class imbalance problem. Therefore, by using certain ways to deal with class imbalance problems, we move on to predicting people into categories such as ‘healthy’ or ‘with Parkinson’s disease’ as shown below:

Prediction Model

In order to predict the people in 2 categories i.e., 0 for healthy and 1 for patients with Parkinson’s Disease, our classification model utilizes Random Forest Classifier of the CORElearn Package to accurately predict the validation/test data after the model has been trained with 70% of the dataset in random fashion.

Here, we have trained our model against the attribute ‘status’ (dependent variable) with 136 inputs of our training data using CoreModel for Random Forest Classifier and then tested our model with 45 inputs of the test/validation data to obtain our results.

Model Evaluation

Therefore for this project, it has been found that the model used for predicting Parkinson’s Disease in patients is 97.87% accurate.

Project Link

Acknowledgements

  • Harvard University and edX platform for providing this course
  • University of Oxford and UCI Machine Learning Repository for the dataset.

Thank you for your time.