A beginner’s guide to predictive analytics on Parkinson’s Disease Dataset in R Programming
This is my solo capstone project for the final part of my project course to obtain the ‘Data Science: Capstone certificate’ offered by Harvard University (HarvardX) through the edX platform.
Being new to some extent in the area of Data Science and specially R Programming, taking up a course such as this was truly challenging even though I had a slight idea of what Data Science is and what it takes in order to create a predictive analysis model. Therefore, on clearing such a project successfully with all the efforts and time I put into it, I wanted to share some key ideas that guided me into the successful completion of this project. So let us start with the topic of Parkinson’s Disease and then hop into the Data Science part of the project…
What is Parkinson’s Disease?
According to Oxford, Parkinson’s Disease is a progressive disease of the central nervous system, and is marked by tremor, muscular rigidity, and slow, imprecise movement, chiefly affecting the middle-aged and elderly people. It can last for years or even be lifelong. The complications of a person dealing with Parkinson’s Disease include: thinking difficulties, emotional changes and depression, swallowing problems, chewing and eating problems, sleep disorders, bladder problems, constipation and may also prove fatal.
Now after knowing the problem statement we are dealing with in this project, let us understand the data set that I have used for the project.
Finding a dataset for analyzing and solving medical problems is difficult as most of the medical data may be limited to the medical institutes that may like to keep it confidential. So in such a case, finding a dataset that caters to your requirements and serves the purpose of easy diagnosis becomes quite of a challenge.
In this project, I have used the Parkinsons Data Set from UCI Machine Learning Repository, which has been uploaded from the Oxford Parkinson’s Disease Detection Dataset.
A portion of the Parkinson’s Disease Dataset is shown below:
Data Analysis: Some key points…
Before diving into Parkinson’s Disease prediction, let us look at the data in detail in order to understand the important features present therein. This can be done by keeping the following points in mind:
- Check for null values in the data
- Check for redundancy in the data
- Remove unimportant attributes (ID, transaction number, etc.)
- Understand the datatype of values of all attributes
- Differentiate out input data from target data
- Apply correlation on the input data
- Dimensionality reduction and Feature selection of important attributes/components
R programming aspect of the project…
The reason behind using R programming for this capstone project is to prepare the R code, RMD file and PDF report from the RMD file for submission of the project required for the course.
Here are some online resources to learn R for Data Science provided by Analytics India Magazine…
Now let us take a look at the packages I have used in R Programming in order to complete the project. These packages are:
- dplyr: grammar for data manipulation
- corrplot: graphical display of a correlation matrix
- mlbench: framework for distributed Machine Learning
- caret: Classification And REgression Training; streamline model training
- randomForest: Breiman and Cutler’s Random Forests for Classification and Regression
- factoextra: Extract and Visualize results of Multivariate Data Analyses
- FactoMineR: Exploratory Data Analysis Methods to summarize, visualize and describe datasets
- CORElearn: Classification, Regression and Feature Evaluation; R port of data mining system
- rmarkdown: Convert R Markdown documents into a variety of formats
- knitr: Dynamic report generation with R
In this dataset, we see that there are 48 healthy people and 147 patients with Parkinson’s Disease. Since the class distribution for both the labels (healthy and patients with Parkinson’s) is not balanced, therefore this can be thought of as a class imbalance problem. Therefore, by using certain ways to deal with class imbalance problems, we move on to predicting people into categories such as ‘healthy’ or ‘with Parkinson’s disease’ as shown below:
In order to predict the people in 2 categories i.e., 0 for healthy and 1 for patients with Parkinson’s Disease, our classification model utilizes Random Forest Classifier of the CORElearn Package to accurately predict the validation/test data after the model has been trained with 70% of the dataset in random fashion.
Here, we have trained our model against the attribute ‘status’ (dependent variable) with 136 inputs of our training data using CoreModel for Random Forest Classifier and then tested our model with 45 inputs of the test/validation data to obtain our results.
By using modelEval() from the CORElearn package, the prediction model used above has been evaluated using metrics such as Accuracy, Precision, Recall, F1 score, etc.
Therefore for this project, it has been found that the model used for predicting Parkinson’s Disease in patients is 97.87% accurate.
Link to my project is given here.
Special thanks to:
- Harvard University and edX platform for providing this course
- University of Oxford and UCI Machine Learning Repository for the dataset.
Thank you for your time.
Embracing this article with claps 👏 will be highly appreciated as it will encourage me to write more blogs about my journey in the field of analytics. See you all soon…