The scikit-learn certification, Associate

Prove you know scikit-learn.
From the team
that ships it.

The Associate Practitioner Certification is for junior data scientists. A 90 minute exam covering fundamental ML, preprocessing, model selection, and evaluation, designed and graded by the maintainers of scikit-learn.

Junior data scientist level Issued by Probabl Verifiable credential
200M+scikit-learn downloads / month
01of three certification levels
90 minproctored online exam
70%to pass, 60 questions
What will be evaluated

Eight competencies of a working junior data scientist.

The certification is designed to ensure that our certified professionals possess both the conceptual understanding and the practical skills of a junior data scientist. The exam is graded against the eight areas below.

01

Fundamental ML

Proficiency in fundamental machine learning algorithms, knowing when to reach for a model, and when deep learning would be overkill.

02

Programming skills

Comfort in Python, especially scikit-learn, Pandas, and NumPy. The everyday surface a junior DS lives in.

03

Data manipulation

Cleaning, manipulating, and preprocessing data using Python libraries. Reshape it before you fit it.

04

Data visualization

Using Python plotting tools to inspect data and communicate results. Matplotlib first, seaborn for shape.

05

Statistical knowledge

Working understanding of statistics, probability, and hypothesis testing, enough to interpret a model score.

06

Model evaluation

Cross-validation, confusion matrices, ROC curves. Knowing what a good number actually means in context.

07

Attention to detail

Strong attention to detail to ensure data accuracy and model reliability. The work behind reproducibility.

08

Problem solving

Logical analysis of issues, including design choices for data pipelines and how to evaluate them.

What do I need to know

Five topics. The shape of the exam.

Each topic block lists the concepts and the scikit-learn surface area you will be tested on. If you can read the snippet on the right and explain what it does, you are on track.

01

Machine learning concepts

The mental model. What a learning algorithm is, how it learns, and what can go wrong.

  • Types of ML, supervised, unsupervised, semi-supervised
  • Model families, tree-based, linear, ensemble, neighbors
  • Key concepts, features, labels, training and test sets
  • Overfitting and underfitting
  • The bias / variance trade-off
sklearn, topic-01.py
from sklearn.model_selection \
    import train_test_split

X_tr, X_te, y_tr, y_te = \
    train_test_split(X, y)
02

Model building and evaluation

Fit, predict, score. The everyday loop, plus what the score actually tells you.

  • Splitting datasets with train_test_split
  • Training models with fit()
  • Predicting with predict()
  • Evaluating with accuracy, precision, recall, F1, MSE, R squared
  • Interpreting score against a dummy baseline
sklearn, topic-02.py
from sklearn.linear_model \
    import LogisticRegression

model = LogisticRegression()
model.fit(X_tr, y_tr)
score = model.score(X_te, y_te)
03

Interpretation and communication

Plotting results and explaining them to people who do not read confusion matrices for fun.

  • Visualizing results with matplotlib and seaborn
  • Reading a confusion matrix and an ROC curve
  • Explaining performance to non-technical stakeholders
  • Reporting uncertainty without hand-waving
sklearn, topic-03.py
from sklearn.metrics import \
    ConfusionMatrixDisplay

ConfusionMatrixDisplay\
  .from_estimator(
    model, X_te, y_te,
  ).plot()
04

Data preprocessing

Most of the job. Loading, cleaning, encoding, the work that decides whether the model can learn anything at all.

  • Loading parquet datasets
  • Scatterplots and boxplots for first look
  • Spotting wrongly-encoded columns (float as string, etc.)
  • Imputation with SimpleImputer
  • Feature scaling, StandardScaler, MinMaxScaler
  • Encoding with OrdinalEncoder, OneHotEncoder
  • Combining steps with ColumnTransformer
sklearn, topic-04.py
from sklearn.compose \
    import ColumnTransformer

pre = ColumnTransformer([
  ("num", StandardScaler(), num),
  ("cat", OneHotEncoder(), cat),
])
05

Model selection and validation

Choosing the right model, tuning it honestly, and knowing how stable the answer is.

  • Cross-validation, KFold, ShuffleSplit, and friends
  • Reading learning and validation curves
  • Hyperparameter tuning with GridSearchCV, RandomSearchCV
  • Stability of learned coefficients across splits
sklearn, topic-05.py
from sklearn.model_selection \
    import GridSearchCV

grid = GridSearchCV(
  pipeline,
  param_grid={"C": [.1, 1, 10]},
  cv=5,
).fit(X_tr, y_tr)
The certification ladder

Three levels. Start with Associate.

Three certifications, each matching a level and a typical data scientist career path. You are looking at the first one.

YOU ARE HERE Level 01

Associate Practitioner

Junior data scientist. Fundamental ML, preprocessing, evaluation.

Get training with Skolar

Prepare with three
online courses.
Free to start.

Each course matches a certification level and reflects a data scientist typical career path. Start with the Associate course, paced lessons, notebooks, and practice questions written by the scikit-learn team.

skolar.probabl.ai 3 courses
01
Associate Practitioner
8 lessons~24 h3/8 complete
Continue
02
Professional
10 lessons~32 h
Locked
03
Expert
12 lessons~40 h
Locked
The exam, in brief

Logistics, plain.

Everything you need to plan your sitting, in six lines.

FormatProctored onlinevia Webassessor
Duration90 minutes60 multiple-choice
Passing70%graded by topic area
LanguagesEnglishFrench coming Q3
Fee$299 USDone retake included
Validity3 yearsrenewable via Level 02
FAQ

Questions we get a lot.

Ready when you are

Get certified by the
team that ships scikit-learn.

90 minutes. $299 USD. A credential issued by the maintainers themselves.