The scikit-learn certification, Professional

The mid-level
scikit-learn exam.
Built by the people
who maintain it.

The Professional Practitioner Certification is for working data scientists. Regularization, ensembles, feature engineering, nested cross-validation, and the judgement to pick a model and defend it to a stakeholder.

Mid-level data scientist Issued by Probabl Verifiable credential
02of three certification levels
120 minproctored online exam
70 + 1multi-choice plus a hands-on lab
72%to pass, graded by topic
What will be evaluated

Seven competencies of a working mid-level data scientist.

The Professional certification is designed to ensure that our certified professionals possess both the conceptual understanding and the practical skills of a mid-level data scientist. The exam is graded against the seven areas below.

01

Advanced ML knowledge

Proficiency in a broad range of machine learning algorithms and the ability to select appropriate models for specific problems.

02

Programming expertise

Strong coding skills in Python, with experience in optimizing code for performance and scalability.

03

Data handling and engineering

Ability to handle large datasets, including data extraction, transformation, and loading processes.

04

Feature engineering

Experience in creating and selecting features to improve model performance.

05

Tuning and optimization

Proficiency in hyperparameter tuning, model selection, and ensemble methods to improve model performance.

06

Critical thinking

Approach complex problems systematically and evaluate multiple solutions, including diagnosing issues in a model pipeline.

07

Business expertise

How ML projects align with business goals and how to translate technical results into actionable business insights.

What do I need to know

Five topics. The shape of the Professional exam.

A step beyond Associate. You need to recognize when a model is regularized correctly, when a CV strategy leaks, and how to communicate that to non-technical readers.

01

Machine learning concepts

The advanced mental model. Probabilistic outputs, regularization regimes, and what overfitting does to soft predictions.

  • Supervised and unsupervised, regression, classification, clustering, dimensional reduction
  • Model families, tree-based, linear, ensemble, neighbors
  • Regularization, L1, L2, Elasticnet
  • Hard and soft predictions, predict vs predict_proba
  • Overfitting and underfitting, impact on soft predictions
sklearn, topic-01.py
from sklearn.linear_model \
    import LogisticRegression

clf = LogisticRegression(
  penalty="elasticnet",
  l1_ratio=0.5,
  solver="saga",
)
02

Model building and evaluation

Pick the baseline, regularize the noise, ensemble when warranted, and choose the metric that fits the problem.

  • Linear models as baselines
  • Handling correlation with regularization and feature selection
  • Bagging and boosting, the working ensemble methods
  • Choosing metrics for outliers and imbalanced settings
sklearn, topic-02.py
from sklearn.ensemble import \
  HistGradientBoostingClassifier
from sklearn.metrics import \
  average_precision_score

clf = HistGradientBoostingClassifier()
clf.fit(X_tr, y_tr)
ap = average_precision_score(
  y_te, clf.predict_proba(X_te)[:,1]
)
03

Interpretation and communication

Read the plot, name the failure mode, explain it without using the word probability twice.

  • Visualizing results with intermediate matplotlib and seaborn techniques
  • Interpreting model outputs and performance metrics
  • Communicating results to non-technical stakeholders
sklearn, topic-03.py
from sklearn.metrics import \
    PrecisionRecallDisplay

PrecisionRecallDisplay\
  .from_estimator(
    clf, X_te, y_te,
  ).plot()
04

Data preprocessing

Heatmaps, PCA, polynomial features, label propagation. The shaping work that makes a real-world dataset trainable.

  • Loading parquet datasets
  • Heatmaps and PCA for first look
  • Identifying strongly correlated features
  • Missing values in the target via label propagation
  • Feature engineering with PolynomialFeatures, SplineTransformer
  • Combining features with FeatureUnion
sklearn, topic-04.py
from sklearn.pipeline import \
  FeatureUnion
from sklearn.preprocessing import \
  PolynomialFeatures, SplineTransformer

union = FeatureUnion([
  ("poly", PolynomialFeatures(2)),
  ("spline", SplineTransformer()),
])
05

Model selection and validation

Group structure, non i.i.d. data, nested CV, stable hyperparameters across folds.

  • Cross-validation with group structure and non i.i.d. data
  • Hyperparameter tuning, GridSearchCV, RandomSearchCV
  • Stability of optimal hyperparameters via nested cross-validation
sklearn, topic-05.py
from sklearn.model_selection \
    import GridSearchCV, GroupKFold

inner = GridSearchCV(pipe, grid, cv=5)
outer = cross_val_score(
  inner, X, y, groups=g,
  cv=GroupKFold(5),
)
The certification ladder

Three levels. You are on the second.

Three certifications, each matching a level and a typical data scientist career path. Associate is the prerequisite mindset, Professional is the working bar, Expert is the bar of the people who maintain the library.

YOU ARE HERE Level 02

Professional

Mid-level. Regularization, ensembles, feature engineering, nested CV.

Get training with Skolar

Prepare with the
Professional course on Skolar.
Free to start.

The Professional track on Skolar matches this exam: regularization, ensembles, feature unions, and nested validation, with notebooks and practice questions written by the scikit-learn team.

skolar.probabl.ai 3 courses
01
Associate Practitioner
8 lessons~24 hCompleted
Review
02
Professional
10 lessons~32 h2/10 complete
Continue
03
Expert
12 lessons~40 h
Locked
The exam, in brief

Logistics, plain.

Everything you need to plan your sitting, in six lines.

FormatProctored onlinevia Webassessor
Duration120 minutes70 multi + 1 lab
Passing72%graded by topic area
LanguagesEnglishFrench coming Q4
Fee$349 USDone retake included
Validity3 yearsrenewable via Level 03
FAQ

Questions we get a lot.

Ready when you are

Certify the work
you already do, with scikit-learn.

120 minutes. $349 USD. Multiple-choice plus a hands-on lab, a credential issued by the maintainers themselves.