Best Practices For Your Machine Learning Workflow in Scikit-learn

3 min readMar 30, 2021

Over the course of my data science learning, I have had a ton of exposure to Scikit-learn. Scikit learn is one of the most popular machine learning libraries for Python. In this post, I will discuss some important practices to keep in mind when using the Scikit-learn library for building effective models.

Keep Your Data Preprocessing Consistent

Scikit-learn provides numerous libraries and methods for doing data transformations. Any Dataset transformations used when training a model, must also be used on test data or when deployed in production systems. For example, if you use their StandardScaler to scale the X train to fit your model, you must also use StandardScaler on the X_test when getting predictions or model metrics. Not applying consistent processing steps will result in unexpected results and cause your model to not perform effectively.

The below code block, courtesy of Sci-kit learn’s documentation shows the wrong way to predict on the test data without processing it in the same manner as the training data which was used to fit the model

>>> from sklearn.metrics import mean_squared_error>>> from sklearn.linear_model import LinearRegression>>> from sklearn.preprocessing import StandardScaler>>> scaler = StandardScaler()>>> X_train_transformed = scaler.fit_transform(X_train)>>> model = LinearRegression().fit(X_train_transformed, y_train)>>> mean_squared_error(y_test, model.predict(X_test))62.80...

The below code block shows the correct way

>>> X_test_transformed = scaler.transform(X_test)>>> mean_squared_error(y_test, model.predict(X_test_transformed))0.90...

Avoid Data Leakage

Data leakage is the use of information in the model training process which would not be expected to be available at prediction time, causing the predictive scores (metrics) to overestimate the model’s utility when run in a production environment.

Data leakage will cause your model to appear to perform better than it should.

Here are some things to consider to avoid data leakage when using Scikit-learn:

· Never fit on test data.

· When performing feature selection algorithms, only use training data. Train, test split your data first, then do feature selection on the training data.

· If you use some average or other statistic to feature engineer, impute, scale, or otherwise pre-process your data, you should only be using a statistic based on the training subset, otherwise information from your training subset will be influencing the model.

· Be aware of/avoid using leaky predictors as features. You should not use variables that were updated after the target was realized. For example if you are building a model to predict if people are sick, you shouldn’t fit the model using the fact that people have taken medicine. People take medicine because they are sick, not the other way around (usually). Deploying that model on unseen data would result in inaccurate results!

Wrong

>>> # Incorrect preprocessing: the entire data is transformed>>> X_selected = SelectKBest(k=25).fit_transform(X, y)>>> X_train, X_test, y_train, y_test = train_test_split(… X_selected, y, random_state=42)

Right

>>> X_train, X_test, y_train, y_test = train_test_split(...     X, y, random_state=42)>>> select = SelectKBest(k=25)>>> X_train_selected = select.fit_transform(X_train, y_train)

The above guidelines should be followed for cross-validation test and train folds as well

Pipelines

Scikit-learn has an object called Pipelines that will help you to avoid many of the above pitfalls from bad code. One of Core Developers himself recommend everyone should be using Pipelines when using Scikit.

Pipelines allow you do encapsulate all the preprocessing steps, feature selections, scaling, encoding of variables, and so on together with the final supervised model that you usually have in a single estimator. -Andreas Muller, Core Developer of Scikit-learn

Pipelines are especially useful when doing cross validation as it will prevent data leakage from the training to test folds and allow the cross validation scoring to provide a more accurate picture of model performance.

Conclusion

There you have it some important practices you should be following when using Scikit-learn so that your models can perform effectively on unseen data!

Sources

https://scikit-learn.org/stable/common_pitfalls.html

https://scikit-learn.org/stable/modules/compose.html#pipeline

https://drgabrielharris.medium.com/python-how-scikit-learn-0-20-optimal-pipeline-and-best-practices-dc4dd94d2c09

https://towardsdatascience.com/want-to-truly-master-scikit-learn-2-essential-tips-from-the-official-developer-himself-dada6ff56b99

https://en.wikipedia.org/wiki/Leakage_(machine_learning)

https://www.kaggle.com/dansbecker/data-leakage