Iris Class Prediction Project¶

Author: Oswald Codjoe¶

*Background:¶

In this project, I undertake the task of predicting the class of sampled flowers based on their attributes. The flowers under consideration are Irises. The dataset consists of three varieties of irises (versicolor, setosa, and virginica),and attributes such as sepal length, sepal width, petal length and petal width. I employ two classification algorithms for the task. They are (i) multinomial logistic regression, and (ii)linear support vector machings (svc). Information about both estimators/classifiers can be found in the documentation of sklearn, a standard python library.

PART ONE: MULTINOMIAL LOGISTIC REGRESSION¶

One: Importing relevant modules¶

import pandas as pd
import sklearn as skl

Two: Reading the dataset and storing it as a pandas object¶

d = pd.read_csv('iris.data')
# Showing the last three plants in the dataset
d.tail(3)

Three: Cleaning the dataset¶

#Renaming the columns of the dataset
d.columns=['Sepal Length', 'Sepal Width', 'Petal Length', 'Petal Width', 'Class']
d.tail(3)

# Generating an overall description/summary of the dataset 
d.describe()

# Identifying the levels of the response variable since it's not shown in the summary above
d.groupby(['Class']).describe()

# Creating a new column in the dataset
d['New Class'] = d['Class']
d.tail(3)

# Replacing the values in the new column with 1 if Iris-setosa, 2 if Iris-versicolor, and
# 3 if Iris-virginica. By converting the response variable to numeric form, just like the 
# explanatory variables, this step makes it possible to apply the multinomial logistic
# regression.
newd=d.replace({'New Class':{'Iris-setosa':0, 'Iris-versicolor':1, 'Iris-virginica':2}})
newd.tail(3)

Four: Fitting the multinomial logistic regression model to the data¶

# Creating X,an explanatory variable matrix, and Y, a response variable matrix
X=newd.drop(columns=['Class','New Class'])
Y=newd['New Class']

# Splitting the dataset into training data and testing data
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size=0.2)

# Creating a model object. This allows for various manipulations of the model, such as
# parameter manipulation. 
from sklearn.linear_model import LogisticRegression
mod1= LogisticRegression()

# Checking the parameters of the model, which I leave as is.
mod1.get_params()

{'C': 1.0,
 'class_weight': None,
 'dual': False,
 'fit_intercept': True,
 'intercept_scaling': 1,
 'l1_ratio': None,
 'max_iter': 100,
 'multi_class': 'auto',
 'n_jobs': None,
 'penalty': 'l2',
 'random_state': None,
 'solver': 'lbfgs',
 'tol': 0.0001,
 'verbose': 0,
 'warm_start': False}

# Fitting the model to the training dataset
mod1.fit(X_train, Y_train)

/Applications/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

Five: Using the fitted model to predict classes of flowers in the test dataset.¶

Predicted_Class = mod1.predict(X_test)
Predicted_Class

array([0, 2, 1, 2, 2, 1, 0, 2, 1, 2, 0, 2, 1, 1, 1, 2, 1, 1, 2, 2, 0, 1,
       1, 1, 2, 1, 1, 2, 0, 2])

Six: Evaluating the model¶

# Checking how well the model has been fitted to the train dataset
mod1.score(X_train, Y_train)

0.9495798319327731

# Checking how well the model predicts classes in the test dataset
mod1.score(X_test, Y_test)

1.0

# Checking model performance using a classification report
from sklearn.metrics import classification_report, confusion_matrix,accuracy_score
print(classification_report(Y_test,Predicted_Class))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00         5
           1       1.00      1.00      1.00        13
           2       1.00      1.00      1.00        12

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30

# Checking model performance using a confusion matrix
print(confusion_matrix(Y_test, Predicted_Class))

[[ 5  0  0]
 [ 0 13  0]
 [ 0  0 12]]

# Checking model performance using accuracy score
print(accuracy_score(Y_test,Predicted_Class))

1.0

# Saving the model for future use
# import pickle 
# pickle.dump(mod1, open('Logistic_Regression.pkl','wb'))

# Loading the model for future use.
# loaded_model = pickle.load(open('Logistic_Regression.pkl','rb'))
# loaded_model.score(X_test,Y_test)

PART TWO: LINEAR SUPPORT VECTOR MACHINES (SVC)¶

One: Fitting the linear svc model to the data¶

# Duplicating the dataset so that I don't overwrite the old one
newd1 = newd

# Creating X,an explanatory variable matrix, and Y, a response variable matrix 
X=newd1.drop(columns=['Class','New Class']) 
Y=newd1['Class']

# Splitting the dataset into training data and testing data
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size=0.2)

# Creating a model object. 
from sklearn import svm 
mod2 = svm.SVC()

#Checking the parameters of the model, which I leave as is.
mod2.get_params()

{'C': 1.0,
 'break_ties': False,
 'cache_size': 200,
 'class_weight': None,
 'coef0': 0.0,
 'decision_function_shape': 'ovr',
 'degree': 3,
 'gamma': 'scale',
 'kernel': 'rbf',
 'max_iter': -1,
 'probability': False,
 'random_state': None,
 'shrinking': True,
 'tol': 0.001,
 'verbose': False}

#Fitting the model to the training dataset
mod2.fit(X_train,Y_train)

SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

Two: Using the fitted model to predict the class of flowers in the test dataset¶

Predicted_Class = mod2.predict(X_test)
Predicted_Class

array(['Iris-virginica', 'Iris-setosa', 'Iris-virginica',
       'Iris-virginica', 'Iris-virginica', 'Iris-versicolor',
       'Iris-versicolor', 'Iris-versicolor', 'Iris-setosa', 'Iris-setosa',
       'Iris-versicolor', 'Iris-versicolor', 'Iris-virginica',
       'Iris-setosa', 'Iris-versicolor', 'Iris-versicolor',
       'Iris-versicolor', 'Iris-virginica', 'Iris-versicolor',
       'Iris-virginica', 'Iris-setosa', 'Iris-virginica', 'Iris-setosa',
       'Iris-versicolor', 'Iris-virginica', 'Iris-versicolor',
       'Iris-virginica', 'Iris-versicolor', 'Iris-virginica',
       'Iris-setosa'], dtype=object)

Three: Evaluating the model¶

# Checking how well the model learned patterns in the training dataset
mod2.score(X_train, Y_train)

0.957983193277311

# Checking how well the model learned patterns in the test data
mod2.score(X_test, Y_test)

0.9333333333333333

# Checking model performance using a classification report
from sklearn.metrics import classification_report, confusion_matrix,accuracy_score
print(classification_report(Y_test,Predicted_Class))

                 precision    recall  f1-score   support

    Iris-setosa       1.00      1.00      1.00         7
Iris-versicolor       0.83      1.00      0.91        10
 Iris-virginica       1.00      0.85      0.92        13

       accuracy                           0.93        30
      macro avg       0.94      0.95      0.94        30
   weighted avg       0.94      0.93      0.93        30

# Checking model performance using a confusion matrix
print(confusion_matrix(Y_test, Predicted_Class))

[[ 7  0  0]
 [ 0 10  0]
 [ 0  2 11]]

# Checking model performance using accuracy score
print(accuracy_score(Y_test,Predicted_Class))

0.9333333333333333

# Saving the model for future use
# import pickle 
# pickle.dump(mod2, open('linearsvc.pkl','wb'))

# Loading the model for future use.
# loaded_model = pickle.load(open('linearsvc.pkl','rb'))
# loaded_model.score(X_test,Y_test)

	5.1	3.5	1.4	0.2	Iris-setosa
146	6.5	3.0	5.2	2.0	Iris-virginica
147	6.2	3.4	5.4	2.3	Iris-virginica
148	5.9	3.0	5.1	1.8	Iris-virginica

	Sepal Length	Sepal Width	Petal Length	Petal Width	Class
146	6.5	3.0	5.2	2.0	Iris-virginica
147	6.2	3.4	5.4	2.3	Iris-virginica
148	5.9	3.0	5.1	1.8	Iris-virginica

	Sepal Length	Sepal Width	Petal Length	Petal Width
count	149.000000	149.000000	149.000000	149.000000
mean	5.848322	3.051007	3.774497	1.205369
std	0.828594	0.433499	1.759651	0.761292
min	4.300000	2.000000	1.000000	0.100000
25%	5.100000	2.800000	1.600000	0.300000
50%	5.800000	3.000000	4.400000	1.300000
75%	6.400000	3.300000	5.100000	1.800000
max	7.900000	4.400000	6.900000	2.500000

	Sepal Length								Sepal Width		...	Petal Length		Petal Width
	count	mean	std	min	25%	50%	75%	max	count	mean	...	75%	max	count	mean	std	min	25%	50%	75%	max
Class
Iris-setosa	49.0	5.004082	0.355879	4.3	4.800	5.0	5.2	5.8	49.0	3.416327	...	1.600	1.9	49.0	0.244898	0.108130	0.1	0.2	0.2	0.3	0.6
Iris-versicolor	50.0	5.936000	0.516171	4.9	5.600	5.9	6.3	7.0	50.0	2.770000	...	4.600	5.1	50.0	1.326000	0.197753	1.0	1.2	1.3	1.5	1.8
Iris-virginica	50.0	6.588000	0.635880	4.9	6.225	6.5	6.9	7.9	50.0	2.974000	...	5.875	6.9	50.0	2.026000	0.274650	1.4	1.8	2.0	2.3	2.5

	Sepal Length	Sepal Width	Petal Length	Petal Width	Class	New Class
146	6.5	3.0	5.2	2.0	Iris-virginica	Iris-virginica
147	6.2	3.4	5.4	2.3	Iris-virginica	Iris-virginica
148	5.9	3.0	5.1	1.8	Iris-virginica	Iris-virginica