AUTHOR: Yeshwanth Buggaveeti
The data set consists of 50 samples from each of three species of Iris flowers (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters. The Iris flower data set or Fisher's Iris data set is a multivariate data set introduced by Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis; Based on the combination of these four features, Fisher developed a linear discriminant model to distinguish/classification of the species from each other.
The iris dataset contains measurements for 150 iris flowers from three different species.
Iris-setosa (n=50)
Iris-versicolor (n=50)
Iris-virginica (n=50)
![](https://static.wixstatic.com/media/6e3b57_9a76f0b2543b4c939a3c19a8eb9f82a3~mv2.jpeg/v1/fill/w_980,h_301,al_c,q_80,usm_0.66_1.00_0.01,enc_auto/6e3b57_9a76f0b2543b4c939a3c19a8eb9f82a3~mv2.jpeg)
FIG:1
The four features of the Iris dataset:
sepal length in cm
sepal width in cm
petal length in cm
petal width in cm
![](https://static.wixstatic.com/media/6e3b57_5e5258ddd650464ca3fdf3f94c245162~mv2.jpeg/v1/fill/w_868,h_582,al_c,q_85,enc_auto/6e3b57_5e5258ddd650464ca3fdf3f94c245162~mv2.jpeg)
FIG:2
The problem is that we have given some features of a flower, and based on these features we have to identify which flower belongs to which category. But we know that, this type of problems belongs to classification problems. We can solve this by using supervised machine learning classification algorithms (decision tree); Decision tree/random forest is used for both classification and regression.
Steps involved in classification:
For this classification problem on the Iris species data, we will do the following tasks.
First, we will load the Iris dataset file and then we will use some python methods to explore/analyze the dataset.
Then we will split half of the data into the training set to train the hypothesis model and half of them as the validation set to check the test accuracy score
Now we will be visualizing the Iris dataset with the help of a pair plot (EDA part)
After calculating the accuracy scores, we will be building Models like support vector machine and decision tree to train the classification model.
And we use GridSearchCV to tune the hyperparameters(C, gamma, kernel) in the SVC model to try to achieve a 100% accuracy score.
Firstly, We will Import libraries like pandas, Sklearn, NumPy and seaborn and then load the Iris dataset. pandas- For data manipulation, NumPy For numerical calculation, array. matplotlib is used for plotting graphs. Scikit-learn (sklearn) is a free machine learning library for Python. It features various machine learning algorithms and functions (accuracy, train, test, split etc...) and seaborn is used for visualization purpose. Now, we will be checking the data information from the below image we will get the idea about the data like its size, features and Type of data.
![](https://static.wixstatic.com/media/6e3b57_bea338b216534e7c9789bc5ad54b89c6~mv2.jpg/v1/fill/w_850,h_789,al_c,q_85,enc_auto/6e3b57_bea338b216534e7c9789bc5ad54b89c6~mv2.jpg)
FIG:3
Exploratory Data analysis:
We will create a Pair plot of the dataset using the seaborn as sns to see which flower species is more separable. By observing the below image Iris Setosa seems more separable. All those blue Color dots belong to the Iris Setosa species only and we can see it is the most separated and also we can visually confirm there 3 species in total. Pair Plots are a simple way to visualize relationships between each variable. It produces a matrix of relationships between each variable in your data for an instant examination of our data..
![](https://static.wixstatic.com/media/6e3b57_49bbd111fe2143ccaac0f5cdba6e914d~mv2.jpg/v1/fill/w_672,h_476,al_c,q_80,enc_auto/6e3b57_49bbd111fe2143ccaac0f5cdba6e914d~mv2.jpg)
FIG:4
Splitting the dataset into train and test:
The below code is used to train and test the data we will load all the species to the and the species label to Y; here Y is the prediction variable so by the given inputs we can be able to predict the species. Of course, we will not use all the data for training, we will train our model with some data and test the rest. Testing of the model is important to see how our model behave means how much it is correct or accurate
CODE
x=df[['SepalLengthCm','SepalWidthCm','PetalLengthCm','PetalWidthCm']]
y=df['Species']
from sklearn.preprocessing import LabelEncoder
lbe= LabelEncoder()
y=lbe.fit_transform(y)
from sklearn.model_selection import train_test_split
xtrain,xtest,ytrain,ytest= train_test_split(x,y,test_size=0.2,random_state=50)
ypred=dc.predict(xtest) # predicting on the testing data.
ypred
Output:
array([1, 1, 0, 0, 2, 2, 2, 0, 0, 1, 0, 2, 0, 2, 1, 0, 1, 0, 1, 2, 2, 1,
0, 2, 1, 2, 1, 1, 1, 2], dtype=int64)
Here, we just printed our target data (that is categories of flower) and samples (features values based on these, a category of the flower is decided). You can see in the output that categories are (0,1 and 2, setosa, Versicolor and virginica) and there are four features, whose values are changing when the category of the flowers is changing. So, here we have basically split the datasets into two parts Testing (test size=30%) and Training Data (70%). Random State- Random state can be any integer. The reason behind defining random state parameters is to initialize a random number generator which ensures that the splits that you generate are reproducible. What if you don't define a random state- If we don't define a random state, whenever we run(execute) code a new random value is generated and the train and test datasets would have different values each time.
from sklearn.tree import DecisionTreeClassifier
dc= DecisionTreeClassifier()
dc.fit(xtrain,ytrain)
By this method we will get the Decision tree classification algorithm and we apply it on the data. For Support vector machine algorithm we use below code.
from sklearn.tree import SVC
model=SVC(C=1, kernel='rbf', tol=0.001)
model.fit(xtrain, ytrain)
After building the model we have to check how our model is performing. To check whether our model is working perfectly for values. So, here we can see that we get an accuracy of 96% which is good. Means our model predict a 4% wrong prediction about a category of the flower.
The confusion matrix is also a good way to see the model's performance. Here out of 150 samples 30% =45 samples we used for testing purpose. In confusion matrix the positions (0,0) - 16- means 16 flowers of category 0 predicted right. (1,1)- 10- flowers of category 1 predicted right. (2,2)-17- flowers of category 2 predicted right. (0,0), (1,1) and (2,2) - correct prediction rest wrong prediction.
Conclusion:
Diagonal positions of matrix -right prediction, rest-wrong. So, according to the confusion matrix out of 45 samples, 2 samples predicted wrong rest are correct.
![](https://static.wixstatic.com/media/6e3b57_96a0a9f353d34aa1a552b50976f74bba~mv2.jpeg/v1/fill/w_980,h_706,al_c,q_85,usm_0.66_1.00_0.01,enc_auto/6e3b57_96a0a9f353d34aa1a552b50976f74bba~mv2.jpeg)
FIG:5
![](https://static.wixstatic.com/media/6e3b57_d6b6572353c34856ab2073b36f3e2c52~mv2.jpg/v1/fill/w_980,h_674,al_c,q_85,usm_0.66_1.00_0.01,enc_auto/6e3b57_d6b6572353c34856ab2073b36f3e2c52~mv2.jpg)
FIG:6
In this block of code, we created/ plotted a decision tree. This will help us understand how the Decision Tree Classifier works. How are Decisions taken?
A classification tree is a sequence of the question, Solution and classification. In the above tree X [2] = petal length means a condition if petal length is less than equal to 0.25. If it's True then classify the flower into class 0 (setosa), Because versicolor a particular or pure class it is not further divided and it forms a leaf node. If petal length > 0.25 means there are two possibilities (virginica, versicolor) it forms a decision node to decide for a leaf node. We keep doing this until we find a pure leaf node. Leaf nodes cannot be further divided. And this is a decision tree forms. We can control the depth of the decision tree.
Accuracy score for SVM
We are getting low accuracy score when we compare SVM with the decision tree. This proves that a Decision tree is better than SVM. The accuracy is 93%.
![](https://static.wixstatic.com/media/6e3b57_51d8e76e5af14067af668937d66721c9~mv2.jpeg/v1/fill/w_961,h_516,al_c,q_85,enc_auto/6e3b57_51d8e76e5af14067af668937d66721c9~mv2.jpeg)
FIG:7
Github:
https://github.com/yeshwanth69/Decision-treetsf/blob/master/Decision%20Tree.ipynb
Reference links:
https://www.kaggle.com/junyingzhang2018/classification-on-iris-data
https://medium.com/gft-engineering/start-to-learn-machine-learning-with-the-iris-flower-classification-challenge-4859a920e5e3
https://medium.com/@jebaseelanravi96/machine-learning-iris-classification-33aa18a4a983
https://www.ritchieng.com/machine-learning-iris-dataset/
Comments