The scikit-learn certification, Expert

The senior bar.
Set by the people
who maintain
scikit-learn
.

The Expert Practitioner Certification is for senior data scientists who ship to production and lead other people's pipelines. Custom estimators, calibration, MLOps, and the judgement to debug a teammate's model and explain why.

Senior data scientist Issued by Probabl Verifiable credential
03of three certification levels
150 minproctored online exam
50 + 2multi-choice plus two labs
75%to pass, graded by topic
What will be evaluated

Seven competencies of a senior data scientist.

The Expert certification is designed to ensure that our certified professionals possess both the conceptual understanding and the practical skills of a senior data scientist. The exam is graded against the seven areas below.

01

Expert-level machine learning

In-depth knowledge of machine learning algorithms, including emerging trends and best practices.

02

Algorithm development

Ability to develop and implement custom machine learning algorithms tailored to specific problems.

03

Model deployment

Expertise in deploying machine learning models into production environments, including knowledge of MLOps.

04

Research and innovation

Ability to conduct independent research and contribute to the development of new methods or tools.

05

Strategic planning

Involvement in long-term planning and strategy development for data science initiatives within the organization.

06

Strategic vision

Strong understanding of broader industry and market trends to shape the strategic direction of ML efforts.

07

Model diagnostics

Identify, troubleshoot, and resolve potential problems within the machine learning pipeline of other team members.

What do I need to know

Six topics. The shape of the Expert exam.

A step beyond Professional. Custom estimators, metadata routing, calibration, partial dependence, and the ops surface of getting a model running in production.

01

Machine learning concepts

The senior mental model. Loss functions, splitting criteria, calibration vs ranking power.

  • Supervised and unsupervised, regression, classification, clustering, dimensional reduction
  • Model families, tree-based, linear, ensemble, neighbors
  • Loss functions and surrogate loss
  • Splitting criteria in decision trees
  • Filter, wrapper, and embedded methods for feature selection
  • Calibration (expected calibration error) vs ranking power (ROC AUC, GINI)
sklearn, topic-01.py
from sklearn.metrics import \
  brier_score_loss, roc_auc_score

# calibration
brier = brier_score_loss(y, p)
# ranking power
auc = roc_auc_score(y, p)
02

Model building and evaluation

Write your own estimator. Route metadata. Post-calibrate, and read the calibration plot honestly.

  • Create your own estimator, NearestCentroid, recommender systems, transformers
  • Metadata routing across estimators and CV
  • Calibration plots with CalibrationDisplay, post-calibration with CalibratedClassifierCV
sklearn, topic-02.py
from sklearn.calibration \
  import CalibratedClassifierCV

cal = CalibratedClassifierCV(
  base_estimator=clf,
  method="isotonic",
  cv=5,
).fit(X_tr, y_tr)
03

Interpretation and communication

Diagnose a colleague's pipeline. Read a partial dependence plot. Spot the leakage.

  • Partial dependence plots, non-linear impact on the target
  • Permutation importance
  • Diagnosing methodology, given a plot, name the failure
  • Pitfalls (e.g. feature selection inside or outside the pipeline)
  • Code comprehension and good practices
sklearn, topic-03.py
from sklearn.inspection import \
  PartialDependenceDisplay, \
  permutation_importance

PartialDependenceDisplay\
  .from_estimator(clf, X, [0, 1])
04

Data preprocessing

Stitch sources together, derive features, read the plot before you choose the model family.

  • Loading parquet datasets
  • Reading plots to decide which family of models fits
  • Combining data from multiple sources
  • Adding new features, lagged features for time-based data
sklearn, topic-04.py
import pandas as pd

df["sales_lag_7"] = (
  df.groupby("store")["sales"]
    .shift(7)
)
05

Model selection and validation

Hyperparameter tuning with proper scoring rules. Choose the metric the calibration plot demands.

  • Hyperparameter tuning with proper scoring rules (calibration)
sklearn, topic-05.py
from sklearn.model_selection \
  import GridSearchCV
from sklearn.metrics import \
  make_scorer, brier_score_loss

scorer = make_scorer(
  brier_score_loss,
  greater_is_better=False,
  needs_proba=True,
)
GridSearchCV(pipe, grid, scoring=scorer)
06

Model deployment

The MLOps surface area. Save it, load it, ship it, and know which serializer to use.

  • Saving and loading trained models with joblib, pickle, or skops
  • Trade-offs between serializers, security, and forward compatibility
sklearn, topic-06.py
import skops.io as sio

sio.dump(model, "model.skops")
loaded = sio.load(
  "model.skops",
  trusted=True,
)
The certification ladder

Three levels. You are at the top.

Three certifications, each matching a level and a typical data scientist career path. The Expert is the senior bar, the one we built for the people who will lead other practitioners.

YOU ARE HERE Level 03

Expert

Senior. Custom estimators, calibration, MLOps, diagnostics.

Get training with Skolar

Prepare with the
Expert course on Skolar.
Free to start.

The Expert track on Skolar matches this exam: custom estimators, metadata routing, partial dependence, calibration, and the model deployment surface, with notebooks written by the scikit-learn team.

skolar.probabl.ai 3 courses
01
Associate Practitioner
8 lessons~24 hCompleted
Review
02
Professional
10 lessons~32 hCompleted
Review
03
Expert
12 lessons~40 h4/12 complete
Continue
The exam, in brief

Logistics, plain.

Everything you need to plan your sitting, in six lines.

FormatProctored onlinevia Webassessor
Duration150 minutes50 multi + 2 labs
Passing75%graded by topic area
LanguagesEnglishFrench coming Q4
Fee$499 USDone retake included
Validity3 yearsrenewable, re-examination
FAQ

Questions we get a lot.

Ready when you are

Set the senior bar
with scikit-learn.

150 minutes. $499 USD. Multiple-choice, two hands-on labs, a credential issued by the maintainers themselves.