LUNG CANCER PREDICTION

Aug 26, 20215 min read

AUTHOR – Thota Ashavanthini Krishna

The goal of this study is to show how to use feature selection to obtain the relevant features required for machine learning. Lung cancer is an aggressive illness with a dismal prognosis, with a 5-year survival rate of only 18%. Early diagnosis of lung cancer cells can result in a significant reduction in lung cancer mortality rates. To reduce lung cancer death rates, several computer-aided diagnosis tools have been created. Although lung cancer cannot be prevented, the risk of developing it can be lowered. As a result, early identification of lung cancer is essential for patient survival. Classification techniques such as Naive Bayes, SVM, Decision Tree, and Logistic Regression were used to analyze lung cancer prediction. The main objective of the paper is to estimate lung cancer risk by using the SVM classification approach.

Introduction

Lung cancer is the world's deadliest cancer, with 75 percent of individuals diagnosed dying within five years of diagnosis. The prognosis is substantially better when cancers are discovered early. Almost two-thirds of patients survive for at least five years if their tumours are tiny and limited to the lungs. AI algorithms that can detect ever-smaller lung tumors have been developed in response to the requirement for early detection. AI technology can help on both aspects by taking part of the load off of overworked professionals and finding lung spots that are not evident to the human eye. In this study, we tried to predict Lung Cancer using 3 different algorithms.

Logistic regression classification
SVM (Support Vector Machine) classification
Decision tree classification

These are the important variables we will use for classifying lung cancer:

Age
Smokes
AreaQ
Alkhol

To make you understand the algorithms better, we have taken a data set with less variables.

This is how the dataset looks like :

FIG:1

Data Visualization

Let us distribute the diagnosis w.r.t the result variable and form a bar plot.

FIG:2

We cannot come to any conclusion using this box plot. Let us try to visualize it in more detail using all the variables and assigning them different colors. In this plot, we can compare the result variable with all the other components. But still, we cannot determine the exact prediction.

FIG:4

Now, let us start using different algorithms and find the accurate result.

Logistic Regression Classification

Logistic regression is a supervised classification algorithm. It is a technique that can be applied to binary classification problems. This technique uses the logistic function or sigmoid function which is an S-shaped curve that can take any real value number and assign it to a value between 0 and 1, but never exactly between those boundaries.

For a given set of features (or inputs), X, the target variable (or output), y, can only take discrete values in a classification problem. Logistic regression is a regression model. The model creates a regression model for predicting the probability that a particular data entry belongs to the “1” category. Just the same as linear regression assumes that the data follows a linear function, logistic regression models the data using the sigmoid function. Now, let us try to implement Logistic Regression algorithm on our dataset.

from sklearn.linear_model import LogisticRegression   
# We defining the model logreg = LogisticRegression(C=10)   
# We train the model logreg.fit(X_train, Y_train)   
# We predict target values Y_predict1 = logreg.predict(X_test)

from sklearn.metrics import confusion_matrix import seaborn as sns   logreg_cm = confusion_matrix(Y_test, Y_predict1) 
f, ax = plt.subplots(figsize=(5,5)) 
sns.heatmap(logreg_cm, annot=True, linewidth=0.7, linecolor='cyan', fmt='g', ax=ax, cmap="YlGnBu") 
plt.title('Logistic Regression Classification Confusion Matrix') plt.xlabel('Y predict') plt.ylabel('Y test') plt.show()

Result –

FIG:5

Support Vector Machine (SVM) Classification –

Support Vector Machine (SVM) is a supervised machine learning algorithm that can be used to solve both classification and regression problems. . It is, however, mostly used to solve classification problems. In the SVM algorithm, we plot each data item as a point in n-dimensional space with the value of each feature being the value of a particular coordinate. Then, we perform classification by finding the hyper-plane that differentiates the two classes very well. Hence, hyperplanes are decision boundaries that help in the classification of data points. Different classes can be assigned to data points on either side of the hyperplane. Now, let us try to implement Support Vector Machine algorithm on our dataset.

from sklearn.ensemble import BaggingClassifier

from sklearn.multiclass import OneVsRestClassifier

from sklearn.svm import SVC

# We define the SVM model svmcla = OneVsRestClassifier(BaggingClassifier(SVC(C=10,kernel='rbf',random_state=9, probability=True),n_jobs=-1))

# We train model svmcla.fit(X_train, Y_train)

# We predict target values Y_predict2 = svmcla.predict(X_test)

# The confusion matrix 
svmcla_cm = confusion_matrix(Y_test, Y_predict2) 
f, ax = plt.subplots(figsize=(5,5)) 
sns.heatmap(svmcla_cm, annot=True, linewidth=0.7, linecolor='cyan', fmt='g', ax=ax, cmap="YlGnBu") 
plt.title('SVM Classification Confusion Matrix') plt.xlabel('Y predict') plt.ylabel('Y test') 
plt.show()

Result –

FIG:6

Decision Tree Algorithm –

Decision Tree is a supervised learning technique that can be used to solve both classification and regression problems, however it is most used to solve classification problems. It is a tree-structured classifier, where internal nodes represent the features of a dataset, branches represent the decision rules, and each leaf node represents the outcome.

The Decision Node and the Leaf Node are the two nodes of a Decision tree. Leaf nodes are the output of those decisions and do not contain any more branches, whereas Decision nodes are used to make any decision and have several branches. The decisions are made based on the characteristics of the given dataset. It is a graphical representation for getting all the possible solutions to a problem/decision based on given conditions.

from sklearn.tree import DecisionTreeClassifier  
 # We define the model dtcla = DecisionTreeClassifier(random_state=9)   
 # We train model dtcla.fit(X_train, Y_train)   
 # We predict target values Y_predict4 = dtcla.predict(X_test)

from sklearn.metrics import confusion_matrix import seaborn as sns dtcla_cm = confusion_matrix(Y_test, Y_predict4)
f, ax = plt.subplots(figsize=(5,5)) 
sns.heatmap(dtcla_cm, annot=True, linewidth=0.7, linecolor='cyan', fmt='g', ax=ax, cmap="YlGnBu")
plt.title('Decision Tree Classification Confusion Matrix') 
plt.xlabel('Y predict') 
plt.ylabel('Y test') plt.show()

Result :

FIG:7

Now, let us try to compare the three algorithms test scores.

Testscores = pd.Series([score_logreg, score_svmcla, score_dtcla],                          index=['Logistic Regression Score', 'Support Vector Machine Score', 'Decision Tree Score'])  print(Testscores)

We will get the output as :

Logistic Regression Score          1.0 
Support Vector Machine Score       1.0 
Decision Tree Score                1.0

So, all three algorithms are working perfectly.

From the confusion matrices, we can conclude that the values of

True Positive = 4
False Positive = 0
False Negative = 0
True Negative = 2

It is extremely useful for measuring Recall, Precision, Specificity, Accuracy and most importantly AUC-ROC Curve.

FIG:8

In this case, Recall = 44+0= 1 which is 100%

FIG:9

In this case, Precision = 44+0 = 1 which is also 100%

Accuracy = TP+TNTP+FP+FN+TN

In this case, Accuracy = 4+24+0+0+2 = 1 which is also 100%

Hence, we can say that our model is 100% accurate.

FIG;10

Conclusion :

Previously, a doctor had to perform multiple tests to determine whether a patient had lung cancer or not. However, this was a lengthy procedure. During a diagnosis, a patient may be subjected to unnecessary examinations or tests to spot the disease of lung cancer. To reduce the process time and unnecessary check-ups, a preliminary test should be performed in which both the patient and the doctor are warned of the possibility of lung cancer. Machine learning techniques are being used extensively in the prediction and classification of medical data. Logistic Regression, SVM and decision tree are the machine learning algorithms used for this comparative study.

Github Link:

https://github.com/Ashvanthini9/Lung-Cancer-Detection

References :

Lung Cancer Detection and Classification using Deep Learning, Ruchita Tekade
Lung Cancer Detection Using Machine Learning Techniques, International Journal of Advanced Research in Electrical, Electronics and Instrumentation Engineering, Vol. 8, Issue 2, February 2019
A Comparative Study of Lung Cancer Detection using Machine Learning Algorithms, Veena G Department of Computer Science and Applications, Amrita Vishwa Vidyapeetham
https://www.kaggle.com/gargmanish/basic-machine-learning-with-cancer

Madras Scientific Research Foundation

LUNG CANCER PREDICTION

Introduction

Data Visualization

Logistic Regression Classification

Result –

Support Vector Machine (SVM) Classification –

Result –

Decision Tree Algorithm –

Result :

Conclusion :

Github Link:

References :

Recent Posts

Comments