Tech

Databricks Machine Learning Engineer Associate

Extended exam preparation guide for the Databricks ML Engineer Associate certification exam with practice questions!

Section 1: Databricks Machine Learning

MLOps Best Practices

MLOps is the discipline of applying DevOps practices to ML workflows.

MLOps Components:

  1. Version Control: code, notebooks, and configuration tracked in Git
  2. CI/CD: automated testing and deployment pipelines
  3. Workflows (Lakeflow Jobs): orchestrate dependencies between data preprocessing, feature engineering, model training, and inference tasks. Inference runs at a different frequency than training.
  4. Model Registry: store and version trained model artifacts with metadata. In Databricks, this is the Unity Catalog registry.
  5. Model Serving: Databricks Model Serving (real-time REST endpoints) or Kubernetes for low-latency deployments
  6. Monitoring: Lakehouse Monitoring for drift detection, performance degradation, and data quality
  7. Data Version Control: Delta Lake time travel for reproducibility and rollback
  8. Feature Store: consistent feature computation between training and serving
  9. Vector Database / LLM Tracing / Human-in-the-loop: for GenAI workflows

MLOps Principles, how you build and operate:

  • Documentation: track decisions, configurations, and assumptions
  • Code quality: pre-commit hooks, unit tests (verify individual functions), integration tests (verify end-to-end flows)
  • Traceability and reproducibility: same code + same data = same model; enables easy rollback
  • Monitoring and alerting: application performance, infrastructure health, and business metrics

Production-readiness refactoring: separate functions/classes/modules, isolate configurations, add logging, package the project code.

ML Runtimes

Databricks ML runtimes are pre-configured cluster images designed for machine learning workloads:

  • Pre-installed ML libraries: scikit-learn, TensorFlow, PyTorch, XGBoost, LightGBM, HuggingFace
  • MLflow pre-installed and pre-configured, no manual setup required
  • GPU-optimized variants available (CUDA, cuDNN pre-configured)
  • Eliminates library conflicts and manual dependency management
  • Consistent environment across the team, reducing "works on my machine" issues

AutoML

Databricks AutoML automates feature selection, model selection, hyperparameter search, and preprocessing. All runs are logged to an MLflow experiment. The best run is surfaced in the AutoML UI.

In 2026, the primary entry point for AutoML is the Genie Code Data Science Agent.

Key advantage — the opaque-box problem: AutoML generates a fully editable source code notebook for the best run. You can open it, read the pipeline, and customize it with domain expertise, unlike black-box AutoML tools that return a model you cannot modify.

Feature Store: Unity Catalog vs Workspace

Unity Catalog Feature StoreWorkspace Feature Store (Legacy)
ScopeAccount-level, cross-workspaceSingle workspace only
ClientFeatureEngineeringClientFeatureStoreClient (deprecated)
DiscoverySearchable across the organizationWorkspace-local only
GovernanceUC permission model, row/column-level securityWorkspace-level permissions only
LineageFull lineage: which models use which featuresLimited
Cross-workspace accessYesNo

Why UC Feature Store matters: discoverability across teams, full lineage tracking, training/serving skew prevention, and online serving support.

Creating and Writing Feature Store Tables

Use FeatureEngineeringClient.create_table(). The table must have a primary key.

python
from databricks.feature_engineering import FeatureEngineeringClient

fe = FeatureEngineeringClient()

fe.create_table(
    name="catalog.schema.user_features",
    primary_keys=["user_id"],
    schema=features_df.schema,
    description="User-level behavioral features"
)

Always use mode="merge" when writing to an existing feature table; it upserts by primary key. Using "overwrite" destroys existing features.

python
fe.write_table(
    name="catalog.schema.user_features",
    df=features_df,
    mode="merge"
)

Training Models with Feature Store Lookups

Define FeatureLookup objects to specify which features to pull, then pass to create_training_set. Log with fe.log_model(), not standard MLflow autolog; this packages feature metadata with the model artifact so inference can perform automatic feature lookup.

python
from databricks.feature_engineering import FeatureEngineeringClient, FeatureLookup
import mlflow

fe = FeatureEngineeringClient()

feature_lookups = [
    FeatureLookup(
        table_name="catalog.schema.user_features",
        feature_names=["30d_spend", "avg_txn_value", "days_since_last_login"],
        lookup_key="user_id"
    )
]

training_set = fe.create_training_set(
    df=labels_df,
    feature_lookups=feature_lookups,
    label="churn"
)

training_df = training_set.load_df()
model = train_model(training_df)

fe.log_model(
    model=model,
    artifact_path="model",
    flavor=mlflow.sklearn,
    training_set=training_set
)

For batch inference with automatic feature lookup:

python
predictions = fe.score_batch(
    model_uri="models:/catalog.schema.model_name/1",
    df=primary_keys_df    # only primary key columns needed
)

Online vs Offline Feature Tables

Offline Feature TableOnline Feature Table
StorageDelta LakeLow-latency key-value database
LatencySeconds to minutesMilliseconds (point lookup)
Use caseBatch training, batch inferenceReal-time model serving
SyncSource of truthSynced from offline table via CDF

If your model uses automatic feature lookups and is deployed to a real-time serving endpoint, online tables are strictly required. An offline Delta table cannot return features in milliseconds.

MLflow Client API

Finding the best run:

python
from mlflow import MlflowClient

client = MlflowClient()
runs = client.search_runs(
    experiment_ids=["<experiment_id>"],
    filter_string="",
    order_by=["metrics.val_rmse ASC"],   # ASC for error metrics
    max_results=1
)
best_run = runs[0]
best_run_id = best_run.info.run_id

Manual logging:

python
import mlflow

with mlflow.start_run() as run:
    mlflow.log_param("max_depth", 5)
    mlflow.log_param("learning_rate", 0.01)
    mlflow.log_metric("train_rmse", train_rmse)
    mlflow.log_metric("val_rmse", val_rmse)
    mlflow.log_artifact("feature_importance.png")
    mlflow.sklearn.log_model(model, artifact_path="model")

What the MLflow UI exposes: parameters, metrics (with step history), artifacts, source notebook link, run metadata (ID, start time, duration, status), tags, model signature, and dataset information.

Registering Models in Unity Catalog

python
import mlflow

mlflow.set_registry_uri("databricks-uc")

model_uri = f"runs:/{run_id}/model"
registered_model = mlflow.register_model(
    model_uri=model_uri,
    name="catalog.schema.model_name"   # three-level UC namespace
)
Unity Catalog RegistryWorkspace Registry
ScopeAccount-level, cross-workspaceSingle workspace
Access controlUC permission modelWorkspace-level ACLs
LineageFull lineage: features → model → servingLimited
Multi-env supportShare model across dev/staging/prodMust copy artifacts manually

In MLflow 3, Unity Catalog is the default registry. The workspace registry is legacy.

Promoting Code vs Promoting Models

Promote models (move the trained artifact through environments):

  • Training data is the same across dev / staging / prod
  • You want to validate the exact artifact that will run in production
  • Example: a batch scoring model trained on a static historical dataset

Promote code (move the training script; retrain in each environment):

  • The model must be trained on environment-specific data
  • Regulatory requirements mandate that the production model was trained on production data
  • Example: a fraud detection model that must be trained on live production transactions

Tags and Aliases

Tags are key-value labels for governance and filtering:

python
from mlflow import MlflowClient

client = MlflowClient()

client.set_registered_model_tag("catalog.schema.model_name", "team", "ml-platform")
client.set_model_version_tag("catalog.schema.model_name", "2", "validated", "true")
client.delete_registered_model_tag("catalog.schema.model_name", "team")
client.delete_model_version_tag("catalog.schema.model_name", "2", "validated")

Aliases replace stage transitions (Staging/Production) in Unity Catalog; they are mutable pointers to specific model versions:

python
client.set_registered_model_alias("catalog.schema.model_name", "champion", "3")
client.set_registered_model_alias("catalog.schema.model_name", "challenger", "4")

# Load by alias — no hardcoded version numbers in downstream code
champion = mlflow.pyfunc.load_model("models:/catalog.schema.model_name@champion")

When the challenger outperforms the champion, reassign the champion alias. No version numbers change in downstream code.

Sample Questions

Section 2: Data Preparation for Machine Learning

Summary Statistics

python
# .summary() — full descriptive statistics including percentiles
df.summary().show()
# Returns: count, mean, stddev, min, 25%, 50%, 75%, max for each column

# .describe() — subset (no percentiles)
df.describe().show()

# dbutils data summaries (richest option in Databricks notebooks)
dbutils.data.summarize(df)
# Returns interactive HTML summary with distribution histograms and null counts

Removing Outliers

Standard deviation method: best for normally distributed features:

python
from pyspark.sql import functions as F

mean_val = df.select(F.mean("feature")).collect()[0][0]
stddev_val = df.select(F.stddev("feature")).collect()[0][0]

df_clean = df.filter(F.abs(df["feature"] - mean_val) <= 3 * stddev_val)

IQR method: more robust for skewed distributions:

python
Q1, Q3 = df.approxQuantile("feature", [0.25, 0.75], 0.0)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

df_clean = df.filter((df["feature"] >= lower_bound) & (df["feature"] <= upper_bound))

Decision rule: Use IQR for skewed data; use standard deviation for approximately normal data.

Visualizations

Categorical features:

  • Bar chart: shows frequency of each category
  • display(df.groupBy("category_col").count().orderBy("count", ascending=False))

Continuous features:

  • Histogram: shows distribution shape, skewness, and outliers
  • Box plot: shows median, IQR, and outliers in a single view

Comparing two categorical features: crosstab (df.stat.crosstab()), chi-squared test, grouped bar chart.

Comparing two continuous features: Pearson correlation (df.stat.corr()), Spearman correlation (robust to outliers), scatter plot. Use Pearson for linear relationships; Spearman for skewed data.

Imputing Missing Values

MethodBest ForCaveat
MeanNormally distributed continuous featuresSensitive to outliers
MedianSkewed continuous featuresRobust to outliers
ModeCategorical featuresCan inflate one dominant category

Always examine the distribution first: there is no one-size-fits-all imputation strategy.

python
from pyspark.ml.feature import Imputer

imputer = Imputer(
    inputCols=["col_a", "col_b"],
    outputCols=["col_a_imputed", "col_b_imputed"],
    strategy="mean"      # or "median"
)
imputer_model = imputer.fit(train_df)
train_imputed = imputer_model.transform(train_df)

# Mode imputation (categorical, manual)
mode_val = df.groupBy("cat_col").count().orderBy("count", ascending=False).first()[0]
df_imputed = df.fillna({"cat_col": mode_val})

One-Hot Encoding

One-hot encoding creates a binary column for every unique category, resulting in a sparse high-dimensional vector.

python
from pyspark.ml.feature import StringIndexer, OneHotEncoder
from pyspark.ml import Pipeline

indexer = StringIndexer(inputCol="category", outputCol="category_idx")
encoder = OneHotEncoder(inputCol="category_idx", outputCol="category_ohe")

pipeline = Pipeline(stages=[indexer, encoder])
ohe_model = pipeline.fit(train_df)
train_encoded = ohe_model.transform(train_df)
ScenarioOHE Appropriate?Reason
Low-cardinality nominal feature in a linear modelYesLinear models need independent coefficients per category
High-cardinality feature (hundreds of categories)NoCurse of dimensionality; sparse matrix issues
Feature in a tree-based model (RF, GBM)Not neededTrees handle label-encoded integers natively
Ordinal feature (low/med/high)NoUse ordinal encoding to preserve rank

For high-cardinality features, group rare categories into "Other" before encoding, or use target encoding.

Log Scale Transformation

Apply log transformation when:

  • Feature or target has a highly right-skewed distribution (income, house prices, transaction amounts)
  • Values span multiple orders of magnitude
  • The relationship between feature and target is multiplicative rather than additive
python
from pyspark.sql import functions as F
df = df.withColumn("log_price", F.log("price"))

Critical: If you train on a log-transformed target, you must exponentiate predictions before computing evaluation metrics or interpreting results:

python
import numpy as np
predictions_original_scale = np.exp(log_predictions)

Sample Questions

Section 3: Model Development

Algorithm Selection

ScenarioRecommended Algorithms
Binary classification, interpretability neededLogistic Regression, Decision Tree
Binary classification, high performanceRandom Forest, Gradient Boosting (XGBoost, LightGBM)
Multi-class classificationRandom Forest, Gradient Boosting, Multinomial Logistic Regression
Regression, linear relationshipLinear Regression
Regression, non-linear relationshipRandom Forest Regressor, Gradient Boosting Regressor
Clustering (no labels)K-Means, DBSCAN
Recommendation / collaborative filteringALS via Spark ML
High-dimensional sparse dataRegularized linear models (L1/L2)

Key decision factors: linear vs non-linear relationships, interpretability requirement, dataset size (Spark ML vs scikit-learn), labeled vs unlabeled data.

Data Imbalance

Class imbalance occurs when one class has far fewer instances than another. Standard accuracy becomes misleading; a model that always predicts the majority class can appear highly accurate while being useless.

Mitigation strategies:

  1. Cost-sensitive learning: class_weight="balanced" in scikit-learn, or weightCol in Spark ML. Directly penalizes the model for ignoring the minority class.
  2. Oversampling: SMOTE generates synthetic minority-class examples
  3. Undersampling: randomly remove majority-class records
  4. Appropriate metrics: use F1, ROC/AUC, or Precision-Recall AUC instead of accuracy
  5. Stratified splits: preserve class ratio in both train and test sets

Estimators vs Transformers

EstimatorTransformer
DefinitionLearns parameters from dataApplies a fixed or learned transformation
Key method.fit(df) → returns a fitted Model.transform(df) → returns a new DataFrame
Examples (unfitted)LinearRegression, RandomForestClassifier, StandardScaler
Examples (fitted)LinearRegressionModel, RandomForestModel, StandardScalerModel

A Pipeline is itself an Estimator. Calling .fit() trains all stages and returns a PipelineModel (a Transformer).

Training Pipelines

A Spark ML Pipeline chains transformers and a final estimator, ensuring consistent preprocessing across train and test sets.

python
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler, StandardScaler
from pyspark.ml.classification import RandomForestClassifier

indexer  = StringIndexer(inputCol="category", outputCol="category_idx")
encoder  = OneHotEncoder(inputCol="category_idx", outputCol="category_ohe")
assembler = VectorAssembler(inputCols=["feat1", "feat2", "category_ohe"], outputCol="features_raw")
scaler   = StandardScaler(inputCol="features_raw", outputCol="features")
rf       = RandomForestClassifier(featuresCol="features", labelCol="label", numTrees=100)

pipeline = Pipeline(stages=[indexer, encoder, assembler, scaler, rf])
model    = pipeline.fit(train_df)        # Estimator: learns all parameters
predictions = model.transform(test_df)  # Transformer: applies all stages

Pipelines prevent data leakage: the scaler is fit only on training data, then applied consistently to test data.

Hyperparameter Tuning with Hyperopt

python
from hyperopt import fmin, tpe, hp, STATUS_OK, SparkTrials
import mlflow

def objective(params):
    with mlflow.start_run(nested=True):
        model = train_model(params)
        val_loss = evaluate(model, val_df)
        mlflow.log_metric("val_loss", val_loss)
    return {"loss": val_loss, "status": STATUS_OK}

search_space = {
    "max_depth":     hp.quniform("max_depth", 2, 10, 1),
    "learning_rate": hp.loguniform("learning_rate", -5, 0),
    "num_leaves":    hp.quniform("num_leaves", 20, 150, 1)
}

best_params = fmin(
    fn=objective,
    space=search_space,
    algo=tpe.suggest,
    max_evals=50,
    trials=SparkTrials()    # distributes trials across Spark workers
)

fmin minimizes the return value. For metrics where higher is better (e.g., accuracy), return -accuracy.

Search Strategies

StrategyHow it worksEfficiencyBest when
Grid searchExhaustively tries every combinationLowSmall search spaces
Random searchSamples uniformly at randomHigher than gridLarge search spaces
Bayesian (TPE)Uses prior results to focus search on promising regionsHighestLarge spaces; default in Hyperopt and Optuna

Optuna key terms: Study (optimization session), Trial (single call to objective), Pruning (halts unpromising trials early), MLFlowCallback (auto-logs each trial into MLflow with parent-child hierarchy).

Parallelizing Single-Node Models

Single-node models (scikit-learn, XGBoost) don't distribute internally, but you can parallelize the hyperparameter search:

python
# Hyperopt: SparkTrials runs trials on Spark workers
best = fmin(fn=objective, space=search_space, algo=tpe.suggest,
            max_evals=50, trials=SparkTrials(parallelism=4))

Important: Optuna's n_jobs uses multi-threading on a single machine. Due to Python's GIL, this achieves concurrency but not true CPU parallelism. For genuine distribution, use SparkTrials (Hyperopt) or MlflowSparkStudy (Optuna).

Cross-Validation vs Train-Validation Split

Cross-Validation (k-fold)Train-Validation Split
BenefitMore robust generalization estimate; each point used for validation exactly onceFast and simple
BenefitUses all data for both training and validationLower memory and compute requirements
DownsideComputationally expensive: trains k models per hyperparameter comboEstimate depends on which data ended up in which split
DownsideRisk of temporal leakage with random splits on time-series dataMay overfit to the particular split chosen

Use cross-validation when data is limited or you need a reliable generalization estimate. Use train-validation split for large datasets or time-series data.

python
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import BinaryClassificationEvaluator

paramGrid = (ParamGridBuilder()
    .addGrid(rf.numTrees, [50, 100])
    .addGrid(rf.maxDepth, [5, 10])
    .build())

cv = CrossValidator(estimator=pipeline, estimatorParamMaps=paramGrid,
                    evaluator=BinaryClassificationEvaluator(metricName="areaUnderROC"),
                    numFolds=5)

cv_model = cv.fit(train_df)
best_model = cv_model.bestModel

Model count formula: (hyperparameter combinations) × k folds

Example: C=[0.1,1,10], kernel=['linear','rbf'], gamma=[0.01,0.1,1] with 5-fold CV → 3×2×3 = 18 combinations × 5 = 90 models.

Classification Metrics

MetricWhat it measuresUse when
Accuracy% of correct predictionsBalanced classes, equal cost of all errors
PrecisionOf predicted positives, how many are actually positiveFalse positives are costly (e.g., spam filter)
RecallOf actual positives, how many were caughtFalse negatives are costly (e.g., medical diagnosis, fraud)
F1 ScoreHarmonic mean of precision and recallImbalanced data; both error types matter
Log LossPenalizes confident wrong predictionsWhen calibrated probabilities matter
ROC/AUCDistinguishing ability across all thresholdsComparing models independent of threshold

AUC = 1.0 → perfect; AUC = 0.5 → random guessing; AUC below the diagonal = labels may be flipped.

Clustering metrics: Silhouette score (higher = better-defined clusters), Elbow method (plot inertia vs K; pick the elbow).

Regression Metrics

MetricUnitsKey property
Unitless (0 to 1)Proportion of variance explained; 1.0 = perfect
RMSESame as targetPenalizes large errors more; most commonly used
MAESame as targetRobust to outliers; treats all errors equally
MSESquared units of targetUsed internally in many optimizers

RMSE vs MAE: use RMSE when large errors are especially costly; use MAE when you want equal treatment or outliers shouldn't dominate.

Metric Selection by Scenario

ScenarioBest MetricWhy
Medical diagnosis — missing a positive case is catastrophicRecallMinimize false negatives
Spam filter — sending legitimate email to spam is costlyPrecisionMinimize false positives
Fraud detection — imbalanced classesF1, ROC/AUCBalance precision/recall
Predicting house prices — large errors especially badRMSEPenalizes large errors more
Predicting delivery time — outliers shouldn't dominateMAERobust to outliers
Explaining to business stakeholdersIntuitive: "explains X% of the variance"

Exponentiating Log-Transformed Targets

When a regression model is trained on a log-transformed target, predictions are also on the log scale. Before computing RMSE or interpreting predictions, exponentiate back:

python
import numpy as np

log_predictions = model.predict(X_test)
predictions = np.exp(log_predictions)      # back to original scale
actual = np.exp(y_test_log)

rmse = np.sqrt(np.mean((predictions - actual) ** 2))

Bias-Variance Tradeoff

Training ErrorValidation ErrorInterpretation
High bias (underfitting)HighHighModel too simple
High variance (overfitting)LowHighModel memorized training data
Well-fitLowLowGeneralizes to unseen data

Fixes:

ProblemFixes
High biasMore complex model, add features, reduce regularization
High varianceMore training data, stronger regularization, simpler model, early stopping

As model complexity increases, training error monotonically decreases, but validation error forms a U-shape. The optimal model sits at the bottom of the validation error curve.

Sample Questions

Section 4: Model Deployment

Batch vs Streaming vs Real-Time

BatchStreamingReal-Time
LatencyHigh (minutes to hours)Medium (seconds)Low (milliseconds)
ThroughputVery highHighModerate
Input paceScheduled, periodicContinuous streamOn-demand, request-by-request
InfrastructureDatabricks Jobs, SparkLakeflow Spark Declarative PipelinesDatabricks Model Serving
Best forNightly scoring, ETLEvent-driven inference, IoTFraud detection, recommendations, chatbots

Batch is unsuitable when data changes faster than every ~30 minutes or stale predictions are harmful. Streaming is event-driven and not suitable for millisecond needs. Real-time requires online feature stores when using automatic feature lookups.

Deploying Custom Models

To deploy custom logic (pre/post-processing, output transformation), extend mlflow.pyfunc.PythonModel:

python
import mlflow
import mlflow.pyfunc

class CustomModel(mlflow.pyfunc.PythonModel):
    def load_context(self, context):
        import pickle
        with open(context.artifacts["model_path"], "rb") as f:
            self.model = pickle.load(f)

    def predict(self, context, model_input):
        raw_predictions = self.model.predict(model_input)
        return ["high_risk" if p > 0.7 else "low_risk" for p in raw_predictions]

with mlflow.start_run():
    mlflow.pyfunc.log_model(
        artifact_path="custom_model",
        python_model=CustomModel(),
        artifacts={"model_path": "model.pkl"}
    )

Batch Inference

python
import mlflow
import pandas as pd

model = mlflow.pyfunc.load_model("models:/catalog.schema.model_name/1")

pandas_df = pd.read_parquet("data/batch_input.parquet")
predictions = model.predict(pandas_df)

For Feature Store models, use fe.score_batch(): it handles feature lookups automatically from a Spark DataFrame of primary keys:

python
predictions_spark = fe.score_batch(
    model_uri="models:/catalog.schema.model_name/1",
    df=primary_keys_spark_df
)

Streaming Inference with Delta Live Tables

Use Lakeflow Spark Declarative Pipelines (formerly DLT) with the MLflow model loaded as a Spark UDF:

python
import mlflow
from pyspark.sql import functions as F

predict_udf = mlflow.pyfunc.spark_udf(spark, model_uri="models:/catalog.schema.model_name/1")

@dlt.table
def inference_results():
    streaming_df = spark.readStream.table("source_table")
    return streaming_df.withColumn("prediction", predict_udf(*feature_cols))

DLT handles auto-scaling (Enhanced Autoscaling on by default), triggered and continuous modes, data expectations, schema evolution, and unified batch + streaming in the same pipeline code.

Real-Time Inference: Deploy and Query

python
from databricks.sdk import WorkspaceClient

client = WorkspaceClient()
client.serving_endpoints.create(
    name="my-model-endpoint",
    config={
        "served_models": [{
            "model_name": "catalog.schema.model_name",
            "model_version": "3",
            "workload_size": "Small",
            "scale_to_zero_enabled": True
        }]
    }
)
python
import requests, json

url = "https://<databricks-host>/serving-endpoints/my-model-endpoint/invocations"
response = requests.post(url,
    headers={"Authorization": f"Bearer {token}", "Content-Type": "application/json"},
    data=json.dumps({"inputs": [[1.0, 2.0, 3.0]]})
)
predictions = response.json()

A/B Testing with Traffic Splitting

Configure a single endpoint to split traffic between model versions:

python
client.serving_endpoints.update_config(
    name="my-model-endpoint",
    served_models=[
        {"model_name": "catalog.schema.model_name", "model_version": "3", "traffic_percentage": 50},
        {"model_name": "catalog.schema.model_name", "model_version": "4", "traffic_percentage": 50}
    ]
)

Monitor both versions' metrics; when the challenger demonstrates sufficient improvement, update the @champion alias.

Sample Questions