PyCaret Tutorial

9 minute read

PyCaret Tutorial

This tutorial will not cover PyCaret as a whole, but it will rather focus on the entire pipeline for a single classification problem. We will use the wine recognition dataset:

Lichman, M. (2013). UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science.

OBS: This post was automatically generated from my Jupyter Notebook thanks to Adam Blomberg’s post.

What is PyCaret?

PyCaret is a library that comes to automate the Machine Learning (ML) process requiring very few lines of code. It works with multiple other useful ML libraries, such as scikit-learn.

Check out their website for more information: PyCaret Homepage.

Installing PyCaret

Virtual environment

First, we create a virtual environment to ensure we do not have unnecessary packages as well as to keep things isolated and light.

Using venv:

$ python -m venv /path/to/new/virtual/environment

And to activate it:

$ source /path/to/new/virtual/environment/bin activate

Using conda:

$ conda create --name venv pip

And to activate it:

$ conda source venv activate

Actual PyCaret installation

We use pip to install it:

$ pip install pycaret

Or on the notebook:

! pip install pycaret

Dataset

We will import the dataset from Scikit-learn:

import sklearn
from sklearn.datasets import load_wine
print(f"Scikit-learn version: {sklearn.__version__}")
import pandas as pd
import numpy as np

Scikit-learn version: 0.23.2

data = load_wine()
df = pd.DataFrame(data=np.c_[data['data'], data['target']],
                  columns=data['feature_names'] + ['target'])
df.shape

(178, 14)

Let’s do some very brief EDA on the dataset:

display(df.head())
df.describe()

	alcohol	malic_acid	ash	alcalinity_of_ash	magnesium	total_phenols	flavanoids	nonflavanoid_phenols	proanthocyanins	color_intensity	hue	od280/od315_of_diluted_wines	proline
0	14.23	1.71	2.43	15.6	127.0	2.80	3.06	0.28	2.29	5.64	1.04	3.92	1065.0
1	13.20	1.78	2.14	11.2	100.0	2.65	2.76	0.26	1.28	4.38	1.05	3.40	1050.0
2	13.16	2.36	2.67	18.6	101.0	2.80	3.24	0.30	2.81	5.68	1.03	3.17	1185.0
3	14.37	1.95	2.50	16.8	113.0	3.85	3.49	0.24	2.18	7.80	0.86	3.45	1480.0
4	13.24	2.59	2.87	21.0	118.0	2.80	2.69	0.39	1.82	4.32	1.04	2.93	735.0

	alcohol	malic_acid	ash	alcalinity_of_ash	magnesium	total_phenols	flavanoids	nonflavanoid_phenols	proanthocyanins	color_intensity	hue	od280/od315_of_diluted_wines	proline	target
count	178.000000	178.000000	178.000000	178.000000	178.000000	178.000000	178.000000	178.000000	178.000000	178.000000	178.000000	178.000000	178.000000	178.000000
mean	13.000618	2.336348	2.366517	19.494944	99.741573	2.295112	2.029270	0.361854	1.590899	5.058090	0.957449	2.611685	746.893258	0.938202
std	0.811827	1.117146	0.274344	3.339564	14.282484	0.625851	0.998859	0.124453	0.572359	2.318286	0.228572	0.709990	314.907474	0.775035
min	11.030000	0.740000	1.360000	10.600000	70.000000	0.980000	0.340000	0.130000	0.410000	1.280000	0.480000	1.270000	278.000000	0.000000
25%	12.362500	1.602500	2.210000	17.200000	88.000000	1.742500	1.205000	0.270000	1.250000	3.220000	0.782500	1.937500	500.500000	0.000000
50%	13.050000	1.865000	2.360000	19.500000	98.000000	2.355000	2.135000	0.340000	1.555000	4.690000	0.965000	2.780000	673.500000	1.000000
75%	13.677500	3.082500	2.557500	21.500000	107.000000	2.800000	2.875000	0.437500	1.950000	6.200000	1.120000	3.170000	985.000000	2.000000
max	14.830000	5.800000	3.230000	30.000000	162.000000	3.880000	5.080000	0.660000	3.580000	13.000000	1.710000	4.000000	1680.000000	2.000000

All data is numeric and without missing values.

# Checking the classes
df['target'].value_counts()

0    71
0    59
0    48
Name: target, dtype: int64

The classes are not very imbalanced.

Splitting the dataset into train and test

from sklearn.model_selection import train_test_split

df_train, df_test = train_test_split(df,
                                     test_size=0.20,
                                     random_state=0,
                                     stratify=df['target'])

Initializing PyCaret

We can, now, import PyCaret and start playing around with it!

import pycaret
from pycaret.classification import *
print(f"PyCaret version: {pycaret.__version__}")

PyCaret version: 2.3.1

PyCaret syntax asks us to setup the data as the first step, providing the input data with the features and the target variable. This setup function is how PyCaret initializes the pipeline for future preprocessing, modeling, and deployment. It requires two parameters: data, a pandas dataframe, and target, the name of the target column. There are other parameters, but they are all optional and we will not cover them at this point.

The setup infers on its own the features’ data types, but it is worth double checking since it may do it incorrectly sometimes. A table showing the features and their data types is displayed after running the setup. It asks us to confirm if correct and press enter. If something is incorrect, then we can correct it by typing quit. Otherwise, we can proceed. Notice that getting the data types right is extremely important, because PyCaret automatically preprocesses the data based on each feature’s type. Preprocessing is a fundamental and critical part of doing ML properly.

# setup the dataset
grid = setup(data=df_train, target='target')

	Description	Value
0	session_id	6420
1	Target	target
2	Target Type	Multiclass
3	Label Encoded	None
4	Original Data	(142, 14)
5	Missing Values	False
6	Numeric Features	13
7	Categorical Features	0
8	Ordinal Features	False
9	High Cardinality Features	False
10	High Cardinality Method	None
11	Transformed Train Set	(99, 13)
12	Transformed Test Set	(43, 13)
13	Shuffle Train-Test	True
14	Stratify Train-Test	False
15	Fold Generator	StratifiedKFold
16	Fold Number	10
17	CPU Jobs	-1
18	Use GPU	False
19	Log Experiment	False
20	Experiment Name	clf-default-name
21	USI	6d1b
22	Imputation Type	simple
23	Iterative Imputation Iteration	None
24	Numeric Imputer	mean
25	Iterative Imputation Numeric Model	None
26	Categorical Imputer	constant
27	Iterative Imputation Categorical Model	None
28	Unknown Categoricals Handling	least_frequent
29	Normalize	False
30	Normalize Method	None
31	Transformation	False
32	Transformation Method	None
33	PCA	False
34	PCA Method	None
35	PCA Components	None
36	Ignore Low Variance	False
37	Combine Rare Levels	False
38	Rare Level Threshold	None
39	Numeric Binning	False
40	Remove Outliers	False
41	Outliers Threshold	None
42	Remove Multicollinearity	False
43	Multicollinearity Threshold	None
44	Clustering	False
45	Clustering Iteration	None
46	Polynomial Features	False
47	Polynomial Degree	None
48	Trignometry Features	False
49	Polynomial Threshold	None
50	Group Features	False
51	Feature Selection	False
52	Feature Selection Method	classic
53	Features Selection Threshold	None
54	Feature Interaction	False
55	Feature Ratio	False
56	Interaction Threshold	None
57	Fix Imbalance	False
58	Fix Imbalance Method	SMOTE

Among the optional parameters we skipped, there were options for preprocessing such as outline removal, feature selection, feature encoding, dimensionality reduction, how to split between train and test set, and many more! Check out the documentation for more details.

Comparing different models

After setting up, the time to compare and evaluate different models has arrived! And this is done so easily that pretty much anyone with a minimal knowledge on metrics could pick up the best model. This is one of the cool PyCaret’s features that allows us to save a lot of time.

PyCaret uses 10-fold cross-validation as its default, sort results by classification accuracy, returns the best model, and displays the results of all tested classifiers with different metrics.

best = compare_models()

	Model	Accuracy	AUC	Recall	Prec.	F1	Kappa	MCC	TT (Sec)
rf	Random Forest Classifier	0.9700	0.9975	0.9750	0.9770	0.9698	0.9545	0.9585	0.0790
nb	Naive Bayes	0.9600	0.9950	0.9639	0.9678	0.9598	0.9394	0.9431	0.0060
et	Extra Trees Classifier	0.9600	0.9967	0.9700	0.9703	0.9604	0.9384	0.9438	0.0670
ridge	Ridge Classifier	0.9589	0.0000	0.9600	0.9686	0.9578	0.9365	0.9423	0.0050
lightgbm	Light Gradient Boosting Machine	0.9500	1.0000	0.9589	0.9623	0.9486	0.9242	0.9315	0.1100
lda	Linear Discriminant Analysis	0.9489	0.9936	0.9533	0.9619	0.9483	0.9211	0.9283	0.0060
qda	Quadratic Discriminant Analysis	0.9389	0.9950	0.9167	0.9534	0.9319	0.9023	0.9135	0.0070
dt	Decision Tree Classifier	0.9200	0.9369	0.9167	0.9330	0.9178	0.8773	0.8855	0.0060
lr	Logistic Regression	0.9189	0.9888	0.9194	0.9314	0.9175	0.8762	0.8830	0.4420
gbc	Gradient Boosting Classifier	0.8900	0.9876	0.8922	0.9127	0.8883	0.8290	0.8416	0.0770
ada	Ada Boost Classifier	0.8489	0.9378	0.8378	0.8721	0.8440	0.7652	0.7782	0.0320
knn	K Neighbors Classifier	0.7178	0.8830	0.6922	0.7521	0.7012	0.5651	0.5915	0.0090
svm	SVM - Linear Kernel	0.5978	0.0000	0.5678	0.5420	0.5152	0.3702	0.4458	0.0070

Showing the best classifier:

print(best)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=-1, oob_score=False, random_state=6420, verbose=0,
                       warm_start=False)

Tuning a model

The models used when comparing use their default hyperparameters. We can tune and choose the best hyperparameters for a single model using PyCaret. This is done by running a Random Grid Search on a specific search space. We can, as well, define a custom search grid, but we will not do it at this point. Also, there are many hyperparameters that allow us to choose early stopping, number of iterations, which metric to optimize for, and so on. Tuning returns a similar table as before, but each row now shows the result for each validation fold.

We will tune the K Neighbors Classifier, since it performed poorly with its default parameters. To do so, we first create a model with PyCaret:

from sklearn.neighbors import KNeighborsClassifier

tuned_knn = tune_model(KNeighborsClassifier(), n_iter=100)

	Accuracy	AUC	Recall	Prec.	F1	Kappa	MCC
0	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000
1	0.9000	0.9300	0.9333	0.9250	0.9016	0.8438	0.8573
2	0.7000	0.7929	0.6944	0.7400	0.7067	0.5385	0.5471
3	0.8000	0.9083	0.8056	0.8250	0.7971	0.6970	0.7078
4	0.9000	0.9012	0.8889	0.9200	0.8956	0.8462	0.8598
5	0.8000	0.8298	0.7778	0.8450	0.7627	0.6923	0.7273
6	0.7000	0.8857	0.6944	0.7400	0.7067	0.5385	0.5471
7	0.9000	1.0000	0.9167	0.9333	0.9029	0.8485	0.8616
8	0.8000	0.8458	0.8333	0.9000	0.8000	0.7059	0.7500
9	0.7778	0.9000	0.6667	0.6296	0.6889	0.6250	0.6934
Mean	0.8278	0.8994	0.8211	0.8458	0.8162	0.7335	0.7551
SD	0.0910	0.0637	0.1079	0.1076	0.0993	0.1417	0.1364

print(tuned_knn)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='manhattan',
                     metric_params=None, n_jobs=None, n_neighbors=3, p=2,
                     weights='distance')

Plotting a model

If just seeing a table is not enough, you can plot the results of a model in different ways. Here are a few examples:

plot_model(best, plot='boundary')

plot_model(best, plot='confusion_matrix')

plot_model(tuned_knn, plot='pr')

plot_model(tuned_knn, plot='class_report')

plot_model(tuned_knn, plot='confusion_matrix')

plot_model(tuned_knn, plot='auc')

Explainable AI

Most businesses do not like to have a black-box model telling them what to do. It is extremely important to understand what the model does so that the business can take action to improve their results.

For this, we need to install the shap library:

$ pip install shap

Or on the notebook:

! pip install shap

This library is based on the concept of Shapley values created in the Game Theory context to compute feature importance. For now, it is only available for tree-based models.

from sklearn.ensemble import RandomForestClassifier
rf = create_model(RandomForestClassifier())

	Accuracy	AUC	Recall	Prec.	F1	Kappa	MCC
0	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000
1	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000
2	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000
3	0.9000	1.0000	0.9167	0.9250	0.9000	0.8507	0.8636
4	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000
5	0.9000	1.0000	0.9167	0.9250	0.9000	0.8507	0.8636
6	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000
7	0.9000	0.9667	0.9167	0.9200	0.8984	0.8438	0.8573
8	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000
9	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000
Mean	0.9700	0.9967	0.9750	0.9770	0.9698	0.9545	0.9585
SD	0.0458	0.0100	0.0382	0.0352	0.0461	0.0695	0.0635

Summary plot

interpret_model(rf, plot='summary')

Correlation plot

interpret_model(rf, plot='correlation')

Reason plot

All observation

interpret_model(rf, plot='reason')

A specific observation

interpret_model(rf, plot='reason', observation=10)

Predicting

Now that the exploratory phase is over, we can predict the results on unseen data.

best_final = finalize_model(best)
predictions = predict_model(best_final, data=df_test.drop('target', axis=1))

from sklearn.metrics import classification_report

print(classification_report(df_test['target'], predictions['Label']))

              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00        12
         1.0       1.00      1.00      1.00        14
         2.0       1.00      1.00      1.00        10

    accuracy                           1.00        36
   macro avg       1.00      1.00      1.00        36
weighted avg       1.00      1.00      1.00        36

What now?

PyCaret allows you to do predictions direct from the cloud, so check it out in their website if you want to do this.

Also, you can build your ensemble models with PyCaret.

They support not only Classification, but also Regression, Anomaly Detection, Clustering, Natural Language Processing, and Association Rules Mining.

References

[1] PyCaret Homepage

[2] PyCaret documentation

[3] Multiclass Classification Tutorial

[4] A Gentle Introduction to PyCaret for Machine Learning

Share on

X Facebook LinkedIn Bluesky

Leonardo N. Rosenberg

PyCaret Tutorial

PyCaret Tutorial

What is PyCaret?

Installing PyCaret

Virtual environment

Actual PyCaret installation

Dataset

Initializing PyCaret

Comparing different models

Tuning a model

Plotting a model

Explainable AI

Summary plot

Correlation plot

Reason plot

All observation

A specific observation

Predicting

What now?

References

Share on

You May Also Enjoy

When Slovenia’s Yellow Wine Surprised Me in Pardes Hanna

When a Tannic Barbera Crashes Your Night

Neural Networks: a technical summary

Running Jupyter Notebooks on an AWS server