Author - Himanshu Bobade
Diabetes is a major disease which leads to serious damage to the nerves and blood vessels over time. It is always better to know about the disease at an early stage for proper treatment. We can use machine learning to predict it. Various classification algorithms like decision tree, Random-Forest, SVM, KNN, Logistic Regression, LDA might help us classify whether a person has diabetes or not. This article will mainly consist the application of various algorithms, training models and deploying it using Flask framework.
INTRODUCTION:
Diabetes is an emerging global pandemic and is increasingly becoming a global health problem. It is estimated that 300 million adults making the third quarter of the world adult population will be in non-industrialized countries by the year 2025. Preventing the disease at its earliest stage is the need of the hour. Predicting the disease can certainly be done with the help of Machine Learning.
The Project mainly highlights machine learning algorithms like Random-Forest, Decision Tree, Logistic Regression, Support Vector Machine(SVMs), Logistic Discriminant Analysis and K-Nearest Neighbours.
The goal of this project is to find the best algorithm with best accuracy in order the chance/probability of having diabetes with given set of data as the input. Before we get started, here is the data flow diagram for the project fig 0.1:
![](https://static.wixstatic.com/media/6e3b57_e26d321873bb45ca8186ae6505f7a8bf~mv2.jpg/v1/fill/w_602,h_385,al_c,q_80,enc_auto/6e3b57_e26d321873bb45ca8186ae6505f7a8bf~mv2.jpg)
Fig. 0.1: Data Flow
The entire implementation is divided into 2 steps:
Step 1: Training the dataset over different classification algorithms.
Step 2: Deploying the model using Flask framework.
Step 1
For this project, the most relevant Dataset available on internet is taken into consideration. The data consists of a parameters like: Pregnancy, Glucose, Blood Pressure, Insulin, Skin Thickness, BMI, Diabetes-Pedigree-function, Age and the outcome that is whether he/she has diabetes or not. This data mainly has outliers and irregularities which we have preprocessed using preprocessing techniques.
Logistic Regression
First algorithm we’ll use is Logistic Regression. Logistic Regression uses sigmoid function to classify. It takes any real-valued number and map it into a value between 0 and 1, but never exactly at those limits. The formula for sigmoid function is = 1 / (1 + e^-value).
clf_lr = LogisticRegression()
clf_lr.fit(X,y)
clf_lr.predict_proba(X)
y_pred = clf_lr.predict(X)
from sklearn.metrics import accuracy_score
print("Accuracy Score LR",accuracy_score(y, y_pred))
The Accuracy score comes out to be as shown fig.1.1 :
![](https://static.wixstatic.com/media/6e3b57_da83063105ca4180a5302927513b0816~mv2.jpg/v1/fill/w_247,h_27,al_c,q_80,enc_auto/6e3b57_da83063105ca4180a5302927513b0816~mv2.jpg)
Fig 1.1 Logistic Regression accuracy score
We’ll convert our trained model into a pickle file so that we can use it to fetch output for given input value.
pickle.dump(clf_lr, open('data/LogisticRegressionModel.pkl','wb'))
LDA
Linear discriminant analysis (LDA) is a method used in statistics and other fields, to find a linear combination of features that classifies two or more classes of objects or events. Following code shows how the algorithm is implemented and output is shown in fig 1.2 .
clf_lda = LinearDiscriminantAnalysis()
clf_lda.fit(X, y)
y_pred_lda = clf_lda.predict(X)
print("Accuracy score LDA",accuracy_score(y, y_pred_lda))
pickle.dump(clf_lda, open('data/LDA.pkl','wb')
![](https://static.wixstatic.com/media/6e3b57_583f7b2ff3f74043bb1d4bb44f7bc746~mv2.jpg/v1/fill/w_184,h_21,al_c,q_80,enc_auto/6e3b57_583f7b2ff3f74043bb1d4bb44f7bc746~mv2.jpg)
Fig 1.2 LDA accuracy score
KNN
K-Nearest Neighbour algorithm stores all the available data and classifies a new data point based on the similarity. When new data appears, it can be easily classified categories by using KNN algorithm.
Following code shows how the algorithm is implemented and output is shown in fig 1.3 .
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
params = {'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20,30]}
grid_search_cv = GridSearchCV(KNeighborsClassifier(), params)
grid_search_cv.fit(X_train_s, y_train)
optimised_KNN = grid_search_cv.best_estimator_
y_test_pred = optimised_KNN.predict(X_test_s)
print("Accuracy KNN",accuracy_score(y_test, y_test_pred))
pickle.dump(clf_knn_1, open('data/KNN.pkl','wb'))
![](https://static.wixstatic.com/media/6e3b57_f44c5bb0409d481fb35cf73488724466~mv2.jpg/v1/fill/w_139,h_17,al_c,q_80,enc_auto/6e3b57_f44c5bb0409d481fb35cf73488724466~mv2.jpg)
Fig 1.3 KNN accuracy score
Decision Tree
Decision tree is and algorithm which classifies the dataset with the help of the flowchart. It takes in the values and characterize it using it’s parameters where the dataset and it’s label seems to vary.
Following code shows how the algorithm is implemented and output is shown in fig 1.4 .
clf_dt = DecisionTreeClassifier()
clf_dt = clf_dt.fit(X_train,y_train)
y_pred = clf_dt.predict(X_test)
print("Accuracy Decision tree:",metrics.accuracy_score(y_test, y_pred))
pickle.dump(clf_dt, open('data/DecisionTree.pkl','wb'))
![](https://static.wixstatic.com/media/6e3b57_55cf8fe279604fba8bf925ac5d427842~mv2.jpg/v1/fill/w_310,h_25,al_c,q_80,enc_auto/6e3b57_55cf8fe279604fba8bf925ac5d427842~mv2.jpg)
Fig 1.4 Decision Tree accuracy score
SVM
The svms classify the dataset using hyperplanes. It finds a hyperplane in an N-dimensional Space, thus seperates the data into categories.
Following code shows how the algorithm is implemented and output is shown in fig 1.5
from sklearn import svm
clf_svm = svm.SVC(kernel='linear')
clf_svm.fit(X_train, y_train)
y_pred = clf_svm.predict(X_test)
print("Accuracy for SVM:",metrics.accuracy_score(y_test, y_pred))
pickle.dump(clf_svm, open('data/SVM.pkl','wb'))
![](https://static.wixstatic.com/media/6e3b57_effa1eec85334c15a1c4a7723bb482e8~mv2.jpg/v1/fill/w_184,h_20,al_c,q_80,enc_auto/6e3b57_effa1eec85334c15a1c4a7723bb482e8~mv2.jpg)
Fig 1.5 SVM accuracy score
Random Forest
It’s an application of decision tree. It uses multiple decision tree for labeling the data and then it merges them together to get more accurate results.
Following code shows how the algorithm is implemented and output is shown in fig 1.6 .
from sklearn.ensemble import RandomForestClassifier
clf_rf=RandomForestClassifier(n_estimators=180)
clf_rf.fit(X_train,y_train)
y_pred=clf_rf.predict(X_test)
print("Accuracy Random Forest:",metrics.accuracy_score(y_test, y_pred))
pickle.dump(clf_rf, open('data/RandomForest.pkl','wb'))
![](https://static.wixstatic.com/media/6e3b57_0e9da1fd313940f7a95398cfa5163b7e~mv2.jpg/v1/fill/w_238,h_23,al_c,q_80,enc_auto/6e3b57_0e9da1fd313940f7a95398cfa5163b7e~mv2.jpg)
Fig 1.6 Random Forest accuracy score
Following graph shows the percentage efficiency of Algorithms in fig 1.7
![](https://static.wixstatic.com/media/6e3b57_8596e5af4b5b4330b01e33bb7a741962~mv2.jpg/v1/fill/w_575,h_374,al_c,q_80,enc_auto/6e3b57_8596e5af4b5b4330b01e33bb7a741962~mv2.jpg)
Fig 1.7 Algorithm Accuracy comparison
Step 2:
This part of implementation consists of deployment of the application. We need to construct an app to take in our data and give us the desired output.
First, we need to load our trained models. We can do it by using:
modelrf = pickle.load(open('data/RandomForest.pkl','rb'))
modeldt = pickle.load(open('data/DecisionTree.pkl','rb'))
modelknn = pickle.load(open('data/KNN.pkl','rb'))
modellda = pickle.load(open('data/LDA.pkl','rb'))
modelsvm = pickle.load(open('data/SVM.pkl','rb'))
modellr = pickle.load(open('data/LogisticRegressionModel.pkl','rb'))
We will use the form.html file to fetch in the data from user using the app. The interface is shown in fig 2.1
![](https://static.wixstatic.com/media/6e3b57_c5b41b9694c949f894a0cdadd93ba037~mv2.jpg/v1/fill/w_602,h_322,al_c,q_80,enc_auto/6e3b57_c5b41b9694c949f894a0cdadd93ba037~mv2.jpg)
Fig 2.1 Input Interface
We should fetch the user input by using following code:
preg= request.form['preg']
gluc= request.form['gluc']
blood_pressure= request.form['blood_pressure']
skin_th= request.form['skin_th']
insln= request.form['insln']
b_m_i= request.form['b_m_i']
d_p_func = request.form['d_p_func']
AGE = request.form['AGE']
After fetching the input, we should use our trained model to give us an output.
Similarly, we can use this method for other models as well. The processed output will then be sent to analysis.html.
Now, as we know we have got the highest efficiency for Random Forest Algorithm, we will use it to display our main output. The Analysis.html page will show us the prediction as shown in fig 2.2 and algorithm result comparison for different algorithm for the given input as shown in fig 2.3
sample_data = [preg,gluc,blood_pressure,skin_th,insln, b_m_i,d_p_func,AGE]
clean_data = [float(i) for i in sample_data]
ex = np.array(clean_data).reshape(1,-1)
result_prediction = modelrf.predict(ex)
![](https://static.wixstatic.com/media/6e3b57_0005835379af4fe69eb03618eeb5518d~mv2.jpg/v1/fill/w_602,h_205,al_c,q_80,enc_auto/6e3b57_0005835379af4fe69eb03618eeb5518d~mv2.jpg)
Fig 2.2 Output prediction using Random Forest
![](https://static.wixstatic.com/media/6e3b57_3225017869b14ca9ac31c8251ce7492b~mv2.jpg/v1/fill/w_506,h_461,al_c,q_80,enc_auto/6e3b57_3225017869b14ca9ac31c8251ce7492b~mv2.jpg)
Fig 2.3 Output from all algorithms
CONCLUSION
Thus, we used the dataset and trained and six different algorithm, learned about the functioning of algorithms, their accuracy scores and deploying it using flask framework enabled us to have a little taste about the web framework.
GitHub Code:
https://github.com/himanshubobade/Diabetes_prediction_using_flask_framework
References:
https://www.kaggle.com/uciml/pima-indians-diabetes-database
https://flask.palletsprojects.com/en/2.0.x/
https://monkeylearn.com/blog/classification-algorithms/
Comments