Iris Class Prediction Project

Author: Oswald Codjoe

*Background:

In this project, I undertake the task of predicting the class of sampled flowers based on their attributes. The flowers under consideration are Irises. The dataset consists of three varieties of irises (versicolor, setosa, and virginica),and attributes such as sepal length, sepal width, petal length and petal width. I employ two classification algorithms for the task. They are (i) multinomial logistic regression, and (ii)linear support vector machings (svc). Information about both estimators/classifiers can be found in the documentation of sklearn, a standard python library.

PART ONE: MULTINOMIAL LOGISTIC REGRESSION

One: Importing relevant modules

In [1]:
import pandas as pd
import sklearn as skl

Two: Reading the dataset and storing it as a pandas object

In [2]:
d = pd.read_csv('iris.data')
# Showing the last three plants in the dataset
d.tail(3)
Out[2]:
5.1 3.5 1.4 0.2 Iris-setosa
146 6.5 3.0 5.2 2.0 Iris-virginica
147 6.2 3.4 5.4 2.3 Iris-virginica
148 5.9 3.0 5.1 1.8 Iris-virginica

Three: Cleaning the dataset

In [3]:
#Renaming the columns of the dataset
d.columns=['Sepal Length', 'Sepal Width', 'Petal Length', 'Petal Width', 'Class']
d.tail(3)
Out[3]:
Sepal Length Sepal Width Petal Length Petal Width Class
146 6.5 3.0 5.2 2.0 Iris-virginica
147 6.2 3.4 5.4 2.3 Iris-virginica
148 5.9 3.0 5.1 1.8 Iris-virginica
In [4]:
# Generating an overall description/summary of the dataset 
d.describe()
Out[4]:
Sepal Length Sepal Width Petal Length Petal Width
count 149.000000 149.000000 149.000000 149.000000
mean 5.848322 3.051007 3.774497 1.205369
std 0.828594 0.433499 1.759651 0.761292
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.400000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000
In [5]:
# Identifying the levels of the response variable since it's not shown in the summary above
d.groupby(['Class']).describe()
Out[5]:
Sepal Length Sepal Width ... Petal Length Petal Width
count mean std min 25% 50% 75% max count mean ... 75% max count mean std min 25% 50% 75% max
Class
Iris-setosa 49.0 5.004082 0.355879 4.3 4.800 5.0 5.2 5.8 49.0 3.416327 ... 1.600 1.9 49.0 0.244898 0.108130 0.1 0.2 0.2 0.3 0.6
Iris-versicolor 50.0 5.936000 0.516171 4.9 5.600 5.9 6.3 7.0 50.0 2.770000 ... 4.600 5.1 50.0 1.326000 0.197753 1.0 1.2 1.3 1.5 1.8
Iris-virginica 50.0 6.588000 0.635880 4.9 6.225 6.5 6.9 7.9 50.0 2.974000 ... 5.875 6.9 50.0 2.026000 0.274650 1.4 1.8 2.0 2.3 2.5

3 rows × 32 columns

In [6]:
# Creating a new column in the dataset
d['New Class'] = d['Class']
d.tail(3)
Out[6]:
Sepal Length Sepal Width Petal Length Petal Width Class New Class
146 6.5 3.0 5.2 2.0 Iris-virginica Iris-virginica
147 6.2 3.4 5.4 2.3 Iris-virginica Iris-virginica
148 5.9 3.0 5.1 1.8 Iris-virginica Iris-virginica
In [7]:
# Replacing the values in the new column with 1 if Iris-setosa, 2 if Iris-versicolor, and
# 3 if Iris-virginica. By converting the response variable to numeric form, just like the 
# explanatory variables, this step makes it possible to apply the multinomial logistic
# regression.
newd=d.replace({'New Class':{'Iris-setosa':0, 'Iris-versicolor':1, 'Iris-virginica':2}})
newd.tail(3)
Out[7]:
Sepal Length Sepal Width Petal Length Petal Width Class New Class
146 6.5 3.0 5.2 2.0 Iris-virginica 2
147 6.2 3.4 5.4 2.3 Iris-virginica 2
148 5.9 3.0 5.1 1.8 Iris-virginica 2

Four: Fitting the multinomial logistic regression model to the data

In [8]:
# Creating X,an explanatory variable matrix, and Y, a response variable matrix
X=newd.drop(columns=['Class','New Class'])
Y=newd['New Class']
In [9]:
# Splitting the dataset into training data and testing data
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size=0.2) 
In [10]:
# Creating a model object. This allows for various manipulations of the model, such as
# parameter manipulation. 
from sklearn.linear_model import LogisticRegression
mod1= LogisticRegression()
In [11]:
# Checking the parameters of the model, which I leave as is.
mod1.get_params()
Out[11]:
{'C': 1.0,
 'class_weight': None,
 'dual': False,
 'fit_intercept': True,
 'intercept_scaling': 1,
 'l1_ratio': None,
 'max_iter': 100,
 'multi_class': 'auto',
 'n_jobs': None,
 'penalty': 'l2',
 'random_state': None,
 'solver': 'lbfgs',
 'tol': 0.0001,
 'verbose': 0,
 'warm_start': False}
In [12]:
# Fitting the model to the training dataset
mod1.fit(X_train, Y_train)
/Applications/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
Out[12]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

Five: Using the fitted model to predict classes of flowers in the test dataset.

In [13]:
Predicted_Class = mod1.predict(X_test)
Predicted_Class
Out[13]:
array([0, 2, 1, 2, 2, 1, 0, 2, 1, 2, 0, 2, 1, 1, 1, 2, 1, 1, 2, 2, 0, 1,
       1, 1, 2, 1, 1, 2, 0, 2])

Six: Evaluating the model

In [14]:
# Checking how well the model has been fitted to the train dataset
mod1.score(X_train, Y_train) 
Out[14]:
0.9495798319327731
In [15]:
# Checking how well the model predicts classes in the test dataset
mod1.score(X_test, Y_test)
Out[15]:
1.0
In [16]:
# Checking model performance using a classification report
from sklearn.metrics import classification_report, confusion_matrix,accuracy_score
print(classification_report(Y_test,Predicted_Class))
              precision    recall  f1-score   support

           0       1.00      1.00      1.00         5
           1       1.00      1.00      1.00        13
           2       1.00      1.00      1.00        12

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30

In [17]:
# Checking model performance using a confusion matrix
print(confusion_matrix(Y_test, Predicted_Class))
[[ 5  0  0]
 [ 0 13  0]
 [ 0  0 12]]
In [18]:
# Checking model performance using accuracy score
print(accuracy_score(Y_test,Predicted_Class))
1.0
In [19]:
# Saving the model for future use
# import pickle 
# pickle.dump(mod1, open('Logistic_Regression.pkl','wb'))
In [20]:
# Loading the model for future use.
# loaded_model = pickle.load(open('Logistic_Regression.pkl','rb'))
# loaded_model.score(X_test,Y_test)

PART TWO: LINEAR SUPPORT VECTOR MACHINES (SVC)

One: Fitting the linear svc model to the data

In [27]:
# Duplicating the dataset so that I don't overwrite the old one
newd1 = newd
In [28]:
# Creating X,an explanatory variable matrix, and Y, a response variable matrix 
X=newd1.drop(columns=['Class','New Class']) 
Y=newd1['Class']
In [29]:
# Splitting the dataset into training data and testing data
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size=0.2) 
In [30]:
# Creating a model object. 
from sklearn import svm 
mod2 = svm.SVC()
In [31]:
#Checking the parameters of the model, which I leave as is.
mod2.get_params()
Out[31]:
{'C': 1.0,
 'break_ties': False,
 'cache_size': 200,
 'class_weight': None,
 'coef0': 0.0,
 'decision_function_shape': 'ovr',
 'degree': 3,
 'gamma': 'scale',
 'kernel': 'rbf',
 'max_iter': -1,
 'probability': False,
 'random_state': None,
 'shrinking': True,
 'tol': 0.001,
 'verbose': False}
In [32]:
#Fitting the model to the training dataset
mod2.fit(X_train,Y_train)
Out[32]:
SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

Two: Using the fitted model to predict the class of flowers in the test dataset

In [39]:
Predicted_Class = mod2.predict(X_test)
Predicted_Class
Out[39]:
array(['Iris-virginica', 'Iris-setosa', 'Iris-virginica',
       'Iris-virginica', 'Iris-virginica', 'Iris-versicolor',
       'Iris-versicolor', 'Iris-versicolor', 'Iris-setosa', 'Iris-setosa',
       'Iris-versicolor', 'Iris-versicolor', 'Iris-virginica',
       'Iris-setosa', 'Iris-versicolor', 'Iris-versicolor',
       'Iris-versicolor', 'Iris-virginica', 'Iris-versicolor',
       'Iris-virginica', 'Iris-setosa', 'Iris-virginica', 'Iris-setosa',
       'Iris-versicolor', 'Iris-virginica', 'Iris-versicolor',
       'Iris-virginica', 'Iris-versicolor', 'Iris-virginica',
       'Iris-setosa'], dtype=object)

Three: Evaluating the model

In [36]:
# Checking how well the model learned patterns in the training dataset
mod2.score(X_train, Y_train)
Out[36]:
0.957983193277311
In [37]:
# Checking how well the model learned patterns in the test data
mod2.score(X_test, Y_test)
Out[37]:
0.9333333333333333
In [40]:
# Checking model performance using a classification report
from sklearn.metrics import classification_report, confusion_matrix,accuracy_score
print(classification_report(Y_test,Predicted_Class))
                 precision    recall  f1-score   support

    Iris-setosa       1.00      1.00      1.00         7
Iris-versicolor       0.83      1.00      0.91        10
 Iris-virginica       1.00      0.85      0.92        13

       accuracy                           0.93        30
      macro avg       0.94      0.95      0.94        30
   weighted avg       0.94      0.93      0.93        30

In [41]:
# Checking model performance using a confusion matrix
print(confusion_matrix(Y_test, Predicted_Class))
[[ 7  0  0]
 [ 0 10  0]
 [ 0  2 11]]
In [42]:
# Checking model performance using accuracy score
print(accuracy_score(Y_test,Predicted_Class))
0.9333333333333333
In [43]:
# Saving the model for future use
# import pickle 
# pickle.dump(mod2, open('linearsvc.pkl','wb'))
In [ ]:
# Loading the model for future use.
# loaded_model = pickle.load(open('linearsvc.pkl','rb'))
# loaded_model.score(X_test,Y_test)