Predicting Parkinson’s Disease using R Programming

5 min readJan 7, 2021

A beginner’s guide to predictive analytics on Parkinson’s Disease Dataset in R Programming

This is my solo capstone project for the final part of my project course to obtain the ‘Data Science: Capstone certificate’ offered by Harvard University (HarvardX) through the edX platform.

Being new to some extent in the area of Data Science and specially R Programming, taking up a course such as this was truly challenging even though I had a slight idea of what Data Science is and what it takes in order to create a predictive analysis model. Therefore, on clearing such a project successfully with all the efforts and time I put into it, I wanted to share some key ideas that guided me into the successful completion of this project. So let us start with the topic of Parkinson’s Disease and then hop into the Data Science part of the project…

What is Parkinson’s Disease?

According to Oxford, Parkinson’s Disease is a progressive disease of the central nervous system, and is marked by tremor, muscular rigidity, and slow, imprecise movement, chiefly affecting the middle-aged and elderly people. It can last for years or even be lifelong. The complications of a person dealing with Parkinson’s Disease include: thinking difficulties, emotional changes and depression, swallowing problems, chewing and eating problems, sleep disorders, bladder problems, constipation and may also prove fatal.

Parkinson’s Disease: Symptoms; Illustration

Now after knowing the problem statement we are dealing with in this project, let us understand the data set that I have used for the project.

Dataset

Finding a dataset for analyzing and solving medical problems is difficult as most of the medical data may be limited to the medical institutes that may like to keep it confidential. So in such a case, finding a dataset that caters to your requirements and serves the purpose of easy diagnosis becomes quite of a challenge.

In this project, I have used the Parkinsons Data Set from UCI Machine Learning Repository, which has been uploaded from the Oxford Parkinson’s Disease Detection Dataset.

A portion of the Parkinson’s Disease Dataset is shown below:

Data Analysis: Some key points…

Before diving into Parkinson’s Disease prediction, let us look at the data in detail in order to understand the important features present therein. This can be done by keeping the following points in mind:

Check for null values in the data
Check for redundancy in the data
Remove unimportant attributes (ID, transaction number, etc.)
Understand the datatype of values of all attributes
Differentiate out input data from target data
Apply correlation on the input data
Dimensionality reduction and Feature selection of important attributes/components

R programming aspect of the project…

The reason behind using R programming for this capstone project is to prepare the R code, RMD file and PDF report from the RMD file for submission of the project required for the course.

Here are some online resources to learn R for Data Science provided by Analytics India Magazine…

Now let us take a look at the packages I have used in R Programming in order to complete the project. These packages are:

dplyr: grammar for data manipulation
corrplot: graphical display of a correlation matrix
mlbench: framework for distributed Machine Learning
caret: Classification And REgression Training; streamline model training
randomForest: Breiman and Cutler’s Random Forests for Classification and Regression
factoextra: Extract and Visualize results of Multivariate Data Analyses
FactoMineR: Exploratory Data Analysis Methods to summarize, visualize and describe datasets
CORElearn: Classification, Regression and Feature Evaluation; R port of data mining system
rmarkdown: Convert R Markdown documents into a variety of formats
knitr: Dynamic report generation with R

In this dataset, we see that there are 48 healthy people and 147 patients with Parkinson’s Disease. Since the class distribution for both the labels (healthy and patients with Parkinson’s) is not balanced, therefore this can be thought of as a class imbalance problem. Therefore, by using certain ways to deal with class imbalance problems, we move on to predicting people into categories such as ‘healthy’ or ‘with Parkinson’s disease’ as shown below:

Prediction Model

In order to predict the people in 2 categories i.e., 0 for healthy and 1 for patients with Parkinson’s Disease, our classification model utilizes Random Forest Classifier of the CORElearn Package to accurately predict the validation/test data after the model has been trained with 70% of the dataset in random fashion.

Here, we have trained our model against the attribute ‘status’ (dependent variable) with 136 inputs of our training data using CoreModel for Random Forest Classifier and then tested our model with 45 inputs of the test/validation data to obtain our results.

Model Evaluation

By using modelEval() from the CORElearn package, the prediction model used above has been evaluated using metrics such as Accuracy, Precision, Recall, F1 score, etc.

Therefore for this project, it has been found that the model used for predicting Parkinson’s Disease in patients is 97.87% accurate.

Project Link

Link to my project is given here.

Acknowledgements

Special thanks to:

Harvard University and edX platform for providing this course
University of Oxford and UCI Machine Learning Repository for the dataset.

Thank you for your time.

Embracing this article with claps 👏 will be highly appreciated as it will encourage me to write more blogs about my journey in the field of analytics. See you all soon…