Importing all the necessary libraries

In [ ]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly as plotly
import plotly.express as px
plotly.offline.init_notebook_mode()

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

Framing the problem¶

The goal is to classify the species of an Iris flower given the dimensions of parts of the flower. The classification is binary, there are two classes: virginica and non-virginica.

We will be using logistic regression for this classification. We will also explore how different models will perform depending on the number of features (1, 2, 3 or 4) that we train them with.

Getting the data¶

In [ ]:
from sklearn import datasets
iris = datasets.load_iris(as_frame=True)
X, y = datasets.load_iris(return_X_y=True, as_frame=True)

Exploratory Data Analysis¶

Redefining the data¶

Firstly, let's see how the data looks like.

In [ ]:
df = iris.frame
df
Out[ ]:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) target
0 5.1 3.5 1.4 0.2 0
1 4.9 3.0 1.4 0.2 0
2 4.7 3.2 1.3 0.2 0
3 4.6 3.1 1.5 0.2 0
4 5.0 3.6 1.4 0.2 0
... ... ... ... ... ...
145 6.7 3.0 5.2 2.3 2
146 6.3 2.5 5.0 1.9 2
147 6.5 3.0 5.2 2.0 2
148 6.2 3.4 5.4 2.3 2
149 5.9 3.0 5.1 1.8 2

150 rows × 5 columns

In [ ]:
iris.target_names
Out[ ]:
array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

We will redefine 2 classes for our dataset. Originially, our target consists of 3 classes, but we only want for virginica or non-virginica.

In [ ]:
df['target'] = df['target'].apply(lambda x: 'virginica' if iris.target_names[x] == 'virginica' else 'non-virginica')
df
Out[ ]:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) target
0 5.1 3.5 1.4 0.2 non-virginica
1 4.9 3.0 1.4 0.2 non-virginica
2 4.7 3.2 1.3 0.2 non-virginica
3 4.6 3.1 1.5 0.2 non-virginica
4 5.0 3.6 1.4 0.2 non-virginica
... ... ... ... ... ...
145 6.7 3.0 5.2 2.3 virginica
146 6.3 2.5 5.0 1.9 virginica
147 6.5 3.0 5.2 2.0 virginica
148 6.2 3.4 5.4 2.3 virginica
149 5.9 3.0 5.1 1.8 virginica

150 rows × 5 columns

Exploring and Describing the data¶

Now that we have redefined our classes, let us explore our data, make some inferences and gain some insights.

In [ ]:
df.describe()
Out[ ]:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.057333 3.758000 1.199333
std 0.828066 0.435866 1.765298 0.762238
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000

In the above, we can see some descriptive statistics for all the species in our dataset, but we can further examine them seperately based on the two classes/groups.

In [ ]:
df_virginica = df[df['target'] == 'virginica']
df_non_virginica = df[df['target'] == 'non-virginica']
In [ ]:
print('VIRGINICA')
display(df_virginica.describe())
VIRGINICA
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
count 50.00000 50.000000 50.000000 50.00000
mean 6.58800 2.974000 5.552000 2.02600
std 0.63588 0.322497 0.551895 0.27465
min 4.90000 2.200000 4.500000 1.40000
25% 6.22500 2.800000 5.100000 1.80000
50% 6.50000 3.000000 5.550000 2.00000
75% 6.90000 3.175000 5.875000 2.30000
max 7.90000 3.800000 6.900000 2.50000
In [ ]:
print('NON-VIRGINICA')
display(df_non_virginica.describe())
NON-VIRGINICA
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
count 100.000000 100.000000 100.000000 100.000000
mean 5.471000 3.099000 2.861000 0.786000
std 0.641698 0.478739 1.449549 0.565153
min 4.300000 2.000000 1.000000 0.100000
25% 5.000000 2.800000 1.500000 0.200000
50% 5.400000 3.050000 2.450000 0.800000
75% 5.900000 3.400000 4.325000 1.300000
max 7.000000 4.400000 5.100000 1.800000

Based on the descriptive statistics we have above, there are some things we can see.

In sepal length, the mean of the virginica class is higher, suggesting that the flowers in this class generally have a longer sepal legth, we can also see this in the min and max value. This already hints that this feature will be very determinant in predicting our feature giving the clear dinstinctions between the two classes. The very similar standard deviation also shows there's a similar spread in the dataframe between the two classes.

In sepal width, the non-viginica class seems to be more pronounced but the two classes have very similar numbers and the distinction is not as clear as in the sepal length. The stats here suggest that the flowers in the virginica class generally have a narrower sepal width.

In petal length, we can see this has the largest difference and clearest dinstinction between the two classes. The flowers in the virginca class have a longer petal length. I also believe that this feature will also be very important in determing our class. or in combination with other features like sepal length? more on this later.

The petal width also follows the same direction as sepal length and petal length. Virgincal iris flowers have a wider petal than non-virginica iris flowers.

Visualizing the Data¶

Let us first visualize our features individually.¶

In [ ]:
palette = sns.color_palette("husl", 2)
In [ ]:
for feature in iris.feature_names:
    sns.histplot(data=df, x=feature, hue='target', multiple="dodge", shrink=.8, palette=palette)
    plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Just like we already postulated above, we can once again see from our histogram that the virginica class is very "different" from the other class in the petal length, petal width.

In sepal length, there is a bit of an overlap between the two classes but you can still see that non-virginica have shorter lengths and virginica have longer length.

Also quite nice and noticeable to see that the sepal width is 'almost' normally distributed and the non-virginica class has a wider spread.

Let us look at another kind of histogram plot. https://seaborn.pydata.org/generated/seaborn.histplot.html

In [ ]:
for feature in iris.feature_names:
    sns.histplot(
    df, x=feature, y="target", hue="target", legend=False, palette=palette
    )
    plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Again, similar observations.

Let us look at at box plot next to see the distribution.

In [ ]:
for feature in iris.feature_names:
    sns.boxplot(x='target', y=feature, data=df, palette=palette)
    plt.show()
    
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

From the box plot, we can see that this graph follows the same things we have been noticing, but in addition, we can see that there are a few outliers that we may or may not have to remove in order to maximize the accuracy of our models.

What we have below is what we call a violinplot. https://www.kaggle.com/code/parulpandey/penguin-dataset-the-new-iris

A violin plot is a statistical graphic for comparing probability distributions. It is similar to a box plot, with the addition of a rotated kernel density plot on each side.

https://en.wikipedia.org/wiki/Violin_plot

https://www.google.com/url?sa=i&url=https%3A%2F%2Fwww.labxchange.org%2Flibrary%2Fitems%2Flb%3ALabXchange%3A46f64d7a%3Ahtml%3A1&psig=AOvVaw0-n5BjGdjjT0XdvvKb_OCm&ust=1708292579036000&source=images&cd=vfe&opi=89978449&ved=0CBMQjRxqFwoTCMihwvSrs4QDFQAAAAAdAAAAABAb

In [ ]:
for feature in iris.feature_names:
    sns.violinplot(x='target', y=feature, data=df, palette=palette)
    plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Now let us visualize the features and targets against each other¶

Let's get more into this, especially the scatter plot.

In [ ]:
sns.scatterplot(data=df, x='sepal length (cm)', y='sepal width (cm)', hue='target', palette=palette)
plt.show()
No description has been provided for this image
In [ ]:
sns.scatterplot(data=df, x='petal length (cm)', y='petal width (cm)', hue='target', palette=palette)
plt.show()
No description has been provided for this image
In [ ]:
sns.scatterplot(data=df, x='sepal length (cm)', y='petal length (cm)', hue='target', palette=palette)
plt.show()
No description has been provided for this image
In [ ]:
sns.scatterplot(data=df, x='sepal width (cm)', y='petal width (cm)', hue='target', palette=palette)
plt.show()
No description has been provided for this image

Lovely looking graphs. If you look at the petal length vs petal width plot in particular, you can see what we have been saying since about how they show obvious dinstiction between the two classes.

Correlation Matrix¶

In [ ]:
df_numeric = df.copy()
df_numeric['target'] = df_numeric['target'].map({'virginica': 1, 'non-virginica': 0})
corr = df_numeric.corr()
corr
Out[ ]:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) target
sepal length (cm) 1.000000 -0.117570 0.871754 0.817941 0.638020
sepal width (cm) -0.117570 1.000000 -0.428440 -0.366126 -0.135645
petal length (cm) 0.871754 -0.428440 1.000000 0.962865 0.721011
petal width (cm) 0.817941 -0.366126 0.962865 1.000000 0.769445
target 0.638020 -0.135645 0.721011 0.769445 1.000000
In [ ]:
sns.heatmap(corr, annot=True, cmap='coolwarm')
Out[ ]:
<Axes: >
No description has been provided for this image

Above, we have statistical measures further validating our claim about how each feature correlates with each other and with the target values.

But here we can actually see that it is petal width that has the most correlation to our target but peatl width is not too far behind. We can also see that petal width and petal length are also very correlated.

Data Preprocessing¶

To clean up our data, we check if we have any rows with a missing or null value or any outliers that could skew our data.

In [ ]:
df.isnull().sum()
Out[ ]:
sepal length (cm)    0
sepal width (cm)     0
petal length (cm)    0
petal width (cm)     0
target               0
dtype: int64

All the rows are intact as we already know.

But when we visualized our data, we noticed some outliers. But for the scope of this notebook, we won't be doing that. They are few and we don't have that much data anyway.

Splitting the Data¶

Splitting the data into training, validation and test sets, using 70%, 15% and 15% respectively.

In [ ]:
X = df.drop('target', axis=1)
y = df['target']

X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.2, random_state=42)

X_valid, X_test, y_valid, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)
In [ ]:
print(X_train.shape, X_valid.shape, X_test.shape)
(120, 4) (15, 4) (15, 4)

Logistic Regression¶

Because we are to run 4 different models with different number of features, we will based this decision on the correlation betwen the features and the target and among the features themselves.

In [ ]:
sns.heatmap(corr, annot=True, cmap='coolwarm')
Out[ ]:
<Axes: >
No description has been provided for this image

For the one feature model, we will use the petal width , as this has the highest positive correlation with our target value.

For the two features model, we will use the petal width and the petal length, they both have the highest correlation to our target and they are both very correlated themselves.

For the three features model, following the same reasoning, we will use petal width, petal length and sepal length.

In [ ]:
feature_sets = {
    "One feature": ['petal width (cm)'],
    "Two features": ['petal width (cm)', 'petal length (cm)'],
    "Three features": ['petal width (cm)', 'petal length (cm)', 'sepal length (cm)'],
    "Four features": ['petal width (cm)', 'petal length (cm)', 'sepal length (cm)', 'sepal width (cm)']
}
In [ ]:
models_dict = {}

results_dict = {}

for feature_set_name, feature_set in feature_sets.items():

    model = LogisticRegression()
    

    model.fit(X_train[feature_set], y_train)
    

    models_dict[feature_set_name] = model
    

    y_valid_pred_proba = model.predict_proba(X_valid[feature_set])[:, 1]
    y_valid_pred = model.predict(X_valid[feature_set])
    

    results = pd.DataFrame({
        'Instance number': X_valid.index,
        'Probability of predicting virginica': y_valid_pred_proba,
        'Actual prediction by the model': y_valid_pred,
        'Ground truth': y_valid
    })
    

    results_dict[feature_set_name] = results

Model Evaluation¶

Scoring the Models¶

Let us see how each model performed.

In [ ]:
for feature_set_name, results in results_dict.items():
    print(f"Results for model with {feature_set_name}:")
    display(results)
Results for model with One feature:
Instance number Probability of predicting virginica Actual prediction by the model Ground truth
26 26 0.007232 non-virginica non-virginica
18 18 0.004901 non-virginica non-virginica
118 118 0.925283 virginica virginica
145 145 0.925283 virginica virginica
78 78 0.350797 non-virginica non-virginica
127 127 0.636206 virginica virginica
108 108 0.636206 virginica virginica
55 55 0.198051 non-virginica non-virginica
30 30 0.003318 non-virginica non-virginica
29 29 0.003318 non-virginica non-virginica
141 141 0.925283 virginica virginica
110 110 0.792805 virginica virginica
19 19 0.004901 non-virginica non-virginica
132 132 0.893300 virginica virginica
64 64 0.198051 non-virginica non-virginica
Results for model with Two features:
Instance number Probability of predicting virginica Actual prediction by the model Ground truth
26 26 0.000014 non-virginica non-virginica
18 18 0.000015 non-virginica non-virginica
118 118 0.998655 virginica virginica
145 145 0.897028 virginica virginica
78 78 0.210752 non-virginica non-virginica
127 127 0.585651 virginica virginica
108 108 0.936985 virginica virginica
55 55 0.150068 non-virginica non-virginica
30 30 0.000009 non-virginica non-virginica
29 29 0.000009 non-virginica non-virginica
141 141 0.870248 virginica virginica
110 110 0.782899 virginica virginica
19 19 0.000009 non-virginica non-virginica
132 132 0.952741 virginica virginica
64 64 0.016507 non-virginica non-virginica
Results for model with Three features:
Instance number Probability of predicting virginica Actual prediction by the model Ground truth
26 26 0.000015 non-virginica non-virginica
18 18 0.000011 non-virginica non-virginica
118 118 0.997968 virginica virginica
145 145 0.874726 virginica virginica
78 78 0.214948 non-virginica non-virginica
127 127 0.593969 virginica virginica
108 108 0.930751 virginica virginica
55 55 0.174016 non-virginica non-virginica
30 30 0.000011 non-virginica non-virginica
29 29 0.000011 non-virginica non-virginica
141 141 0.827581 virginica virginica
110 110 0.759765 virginica virginica
19 19 0.000009 non-virginica non-virginica
132 132 0.952516 virginica virginica
64 64 0.017705 non-virginica non-virginica
Results for model with Four features:
Instance number Probability of predicting virginica Actual prediction by the model Ground truth
26 26 0.000009 non-virginica non-virginica
18 18 0.000006 non-virginica non-virginica
118 118 0.998534 virginica virginica
145 145 0.873922 virginica virginica
78 78 0.207005 non-virginica non-virginica
127 127 0.572730 virginica virginica
108 108 0.946564 virginica virginica
55 55 0.170670 non-virginica non-virginica
30 30 0.000008 non-virginica non-virginica
29 29 0.000008 non-virginica non-virginica
141 141 0.820222 virginica virginica
110 110 0.728198 virginica virginica
19 19 0.000004 non-virginica non-virginica
132 132 0.956114 virginica virginica
64 64 0.016196 non-virginica non-virginica

To summarize the data in each table using just one value, we will be using the accuracy score provided by sklearn.

In [ ]:
from sklearn.metrics import accuracy_score
accuracies = {}

for feature_set_name, results in results_dict.items():
    accuracy = accuracy_score(y_valid, results['Actual prediction by the model'])
    accuracies[feature_set_name] = accuracy
    print(f"Accuracy for model with {feature_set_name}: {accuracy:.2f}")
    
Accuracy for model with One feature: 1.00
Accuracy for model with Two features: 1.00
Accuracy for model with Three features: 1.00
Accuracy for model with Four features: 1.00

What the above means that based on our choices of features for each of the models, the models have a 100% accuracy score on the validation data.

Decision Boundary Graph¶

In [ ]:
display(X_valid)
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
26 5.0 3.4 1.6 0.4
18 5.7 3.8 1.7 0.3
118 7.7 2.6 6.9 2.3
145 6.7 3.0 5.2 2.3
78 6.0 2.9 4.5 1.5
127 6.1 3.0 4.9 1.8
108 6.7 2.5 5.8 1.8
55 5.7 2.8 4.5 1.3
30 4.8 3.1 1.6 0.2
29 4.7 3.2 1.6 0.2
141 6.9 3.1 5.1 2.3
110 6.5 3.2 5.1 2.0
19 5.1 3.8 1.5 0.3
132 6.4 2.8 5.6 2.2
64 5.6 2.9 3.6 1.3
In [ ]:
import matplotlib.pyplot as plt
import seaborn as sns

feature_set_name = "One feature" 
model = models_dict[feature_set_name]
feature_set = feature_sets[feature_set_name]

plt.figure(figsize=(8, 6))
sns.scatterplot(x=X_valid[feature_set[0]], y=y_valid, hue=y_valid_pred)

boundary = -model.intercept_ / model.coef_

plt.axvline(x=boundary, color='red')

x_values = pd.DataFrame(np.linspace(X_valid[feature_set[0]].min(), X_valid[feature_set[0]].max(), 100), columns=[feature_set[0]])
y_values = model.predict_proba(x_values)[:, 1]
plt.plot(x_values, y_values, color='blue', alpha=0.2)


plt.title(f"Decision Boundary for {feature_set_name}")
plt.show()
No description has been provided for this image
In [ ]:
feature_set_name = "Two features"
model = models_dict[feature_set_name]
feature_set = feature_sets[feature_set_name]

fig = px.scatter(X_valid, x=feature_set[0], y=feature_set[1],  color=y_valid_pred, color_discrete_map={0: 'blue', 1: 'red'})


MIN_FEATURE_VALUE = X_valid.min().min()
MAX_FEATURE_VALUE = X_valid.max().max()

decision_boundary_x1 = np.linspace(MIN_FEATURE_VALUE, MAX_FEATURE_VALUE, 10)
decision_boundary_x2 = -model.intercept_ / model.coef_[0][1] - model.coef_[0][0] / model.coef_[0][1] * decision_boundary_x1

fig.add_scatter(x=decision_boundary_x1, y=decision_boundary_x2, mode='lines', name='Decision Boundary', line=dict(color='red', width=1))

fig.update_layout(title_text=f'Decision Boundary for 2 features', width=800, height=400)
fig.show()
In [ ]:
import plotly.graph_objects as go

feature_set_name = "Three features"
model = models_dict[feature_set_name]
feature_set = feature_sets[feature_set_name]

fig = go.Figure()

X_valid_0 = X_valid[y_valid_pred == 'non-virginica']
X_valid_1 = X_valid[y_valid_pred == 'virginica']

fig.add_trace(go.Scatter3d(
    x=X_valid_0[feature_set[0]], 
    y=X_valid_0[feature_set[1]], 
    z=X_valid_0[feature_set[2]], 
    mode='markers',
    marker=dict(
        size=5,
        color='green',
        opacity=0.8
    ),
    name='Category 0'
))

fig.add_trace(go.Scatter3d(
    x=X_valid_1[feature_set[0]], 
    y=X_valid_1[feature_set[1]], 
    z=X_valid_1[feature_set[2]], 
    mode='markers',
    marker=dict(
        size=5,
        color='purple',
        opacity=0.8
    ),
    name='Category 1'
))

x1, x2 = np.meshgrid(np.linspace(MIN_FEATURE_VALUE, MAX_FEATURE_VALUE, 10), np.linspace(MIN_FEATURE_VALUE, MAX_FEATURE_VALUE, 10))
decision_boundary_x3 = -model.intercept_ / model.coef_[0][2] - model.coef_[0][0] / model.coef_[0][2] * x1 - model.coef_[0][1] / model.coef_[0][2] * x2

fig.add_trace(go.Surface(x=x1, y=x2, z=decision_boundary_x3, colorscale='Reds', opacity=0.6))

fig.update_layout(
    title_text=f'Decision Boundary for 3 features', 
    width=1200, 
    height=800,
    scene=dict(
        xaxis_title='petal width (cm)',
        yaxis_title='petal length (cm)',
        zaxis_title='sepal length (cm)'
    )
)

camera = dict(
    eye=dict(x=2, y=2, z=1.1)
)

fig.update_layout(scene_camera=camera)

fig.show()

Failure Modes¶

As we saw with the accuracy score on our validation case, all the predictions were correct, so there is no way to "easily" identify the instances when our model fails.

Let us go even further to use the other metric provided by sklearn.

Precision Score

In [ ]:
from sklearn.metrics import precision_score, recall_score, f1_score, confusion_matrix

precisions = {}

for feature_set_name, results in results_dict.items():
    precision = precision_score(y_valid, results['Actual prediction by the model'], pos_label='virginica')
    precisions[feature_set_name] = precision
    print(f"Precision for model with {feature_set_name}: {precision:.2f}")
Precision for model with One feature: 1.00
Precision for model with Two features: 1.00
Precision for model with Three features: 1.00
Precision for model with Four features: 1.00
In [ ]:
recalls = {}

for feature_set_name, results in results_dict.items():
    recall = recall_score(y_valid, results['Actual prediction by the model'], pos_label='virginica')
    recalls[feature_set_name] = recall
    print(f"Recall score for model with {feature_set_name}: {recall:.2f}")
Recall score for model with One feature: 1.00
Recall score for model with Two features: 1.00
Recall score for model with Three features: 1.00
Recall score for model with Four features: 1.00
In [ ]:
f1s = {}

for feature_set_name, results in results_dict.items():
    f1 = f1_score(y_valid, results['Actual prediction by the model'], pos_label='virginica')
    f1s[feature_set_name] = f1
    print(f"F1 Score for model with {feature_set_name}: {f1:.2f}")
F1 Score for model with One feature: 1.00
F1 Score for model with Two features: 1.00
F1 Score for model with Three features: 1.00
F1 Score for model with Four features: 1.00
In [ ]:
confusion_matrices = {}

for feature_set_name, results in results_dict.items():
    confusion = confusion_matrix(y_valid, results['Actual prediction by the model'])
    confusion_matrices[feature_set_name] = confusion
    print(f"Confusion matrix for model with {feature_set_name}:\n {confusion}")
Confusion matrix for model with One feature:
 [[8 0]
 [0 7]]
Confusion matrix for model with Two features:
 [[8 0]
 [0 7]]
Confusion matrix for model with Three features:
 [[8 0]
 [0 7]]
Confusion matrix for model with Four features:
 [[8 0]
 [0 7]]

As we can see, still all perfect score.

Now, I am more concerned as to why this is happening. I have a few ideas.

    1. Our dataset: we have quite a small number of rows in our data, 150. This could be a reason why we are having such high accuracy scores.
    1. The proportion of our data that we are using to train our model, 80% is quite high. I played around with this figure and I noticed that the lower the percentage of the training data, the less accurate our data is.
    1. We are only predicting two classes. This "kind of" makes the job of our model easy.
    1. There is a high positive correlation between our features and our target. This could also be why we are having such high scores.
    1. We didn't do cross validation.

With that being said though, something that caught my eyes and I feel can be used to relatively guide our decisions in terms of evaluating our model, is the probability of the prediction.

If you look at each of our models, they have different probabilitiesfor predicting virginica or non-virginica.

I believe that a model that that has a low(er) probability of predicting virginca on an instace whose goround truth is non-virginica is good and likewise having a very high probability of predicting virginica when the ground truth is virginica is very good.

In the absence of a proper classification evaluation metric and the metrics from sklearn all scoring 100%, I believe this should do.

In [ ]:
# we are using a 50% probability threshold.
def calculate_total_score(df):
    def calculate_score(row):
        prob = row['Probability of predicting virginica']
        ground_truth = row['Ground truth']

        if ground_truth == 'virginica':
            return prob - 0.5
        else:
            return 0.5 - prob

    df['Score'] = df.apply(calculate_score, axis=1)
    
    return df['Score'].sum()

total_scores = {feature_set_name: calculate_total_score(df) for feature_set_name, df in results_dict.items()}
In [ ]:
total_scores = {feature_set_name: calculate_total_score(df) for feature_set_name, df in results_dict.items()}

for feature_set_name, total_score in total_scores.items():
    print(f"Total score for model with {feature_set_name}: {total_score}")
Total score for model with One feature: 5.463795534529218
Total score for model with Two features: 6.146823413627794
Total score for model with Three features: 6.030550287627612
Total score for model with Four features: 6.002378652450679

From the above, we can see that the model with two features has the highest score based on our algorithm. This makes total sense because like we observed during our EDA, petal width and petal length has the most positive correlation with our target.

Some things to point out based on the performances is that when we add "less correlated" to features to our model, it actually brings down the precision of the probability, as we can see in the model with 4 features.

Another good point to note is that the model scores low on the model with one feature, even though the feature is the petal width which has the highest correlation with our target. What this tells us also is that more (positively correlated) features will improve the accuracy of our classification.

Now, while this scoring algorthm might not be 100% accurate, I believe it should do for the scope of this analysis.

Recommend the Best Model.¶

Looking at the analysis we have just done above, I think I can choose the model with the two features as our best model.

It uses two postively correlated features, it scored 100% on all the evaluation metrics and even in our scoring algorithm as well.

Now let us see how the model performs on our test data.

In [ ]:
best_score = max(total_scores, key=total_scores.get)
best_model = models_dict[best_score]
best_model
Out[ ]:
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
In [ ]:
# Predict the probabilities and classes on the test set using the best model
y_test_pred_proba = best_model.predict_proba(X_test[feature_sets[best_score]])[:, 1]
y_test_pred = best_model.predict(X_test[feature_sets[best_score]])

# Create a dataframe with the test results
test_results = pd.DataFrame({
    'Instance number': X_test.index,
    'Probability of predicting virginica': y_test_pred_proba,
    'Actual prediction by the model': y_test_pred,
    'Ground truth': y_test
})
In [ ]:
display(test_results)
Instance number Probability of predicting virginica Actual prediction by the model Ground truth
143 143 0.981925 virginica virginica
56 56 0.356495 non-virginica non-virginica
128 128 0.942506 virginica virginica
69 69 0.023739 non-virginica non-virginica
68 68 0.210752 non-virginica non-virginica
82 82 0.029036 non-virginica non-virginica
45 45 0.000007 non-virginica non-virginica
131 131 0.990823 virginica virginica
36 36 0.000004 non-virginica non-virginica
73 73 0.194982 non-virginica non-virginica
76 76 0.322391 non-virginica non-virginica
104 104 0.971436 virginica virginica
31 31 0.000011 non-virginica non-virginica
9 9 0.000006 non-virginica non-virginica
12 12 0.000004 non-virginica non-virginica

Another 100/100 performance.

Let us calculate the score using scikit learn and also with our algorithm.

In [ ]:
accuracy = accuracy_score(y_test, test_results['Actual prediction by the model'])
print(f"Accuracy for model with {feature_set_name}: {accuracy:.2f}")
Accuracy for model with Four features: 1.00
In [ ]:
calculate_total_score(test_results)
Out[ ]:
6.249262208930103

This is a higher score than even in our validations with the same number of rows, so this wasn't a case of overfitting.

Lovely stuff.