Section 1: Databricks Machine Learning
MLOps Best Practices
MLOps is the discipline of applying DevOps practices to ML workflows.
MLOps Components:
- Version Control: code, notebooks, and configuration tracked in Git
- CI/CD: automated testing and deployment pipelines
- Workflows (Lakeflow Jobs): orchestrate dependencies between data preprocessing, feature engineering, model training, and inference tasks. Inference runs at a different frequency than training.
- Model Registry: store and version trained model artifacts with metadata. In Databricks, this is the Unity Catalog registry.
- Model Serving: Databricks Model Serving (real-time REST endpoints) or Kubernetes for low-latency deployments
- Monitoring: Lakehouse Monitoring for drift detection, performance degradation, and data quality
- Data Version Control: Delta Lake time travel for reproducibility and rollback
- Feature Store: consistent feature computation between training and serving
- Vector Database / LLM Tracing / Human-in-the-loop: for GenAI workflows
MLOps Principles, how you build and operate:
- Documentation: track decisions, configurations, and assumptions
- Code quality: pre-commit hooks, unit tests (verify individual functions), integration tests (verify end-to-end flows)
- Traceability and reproducibility: same code + same data = same model; enables easy rollback
- Monitoring and alerting: application performance, infrastructure health, and business metrics
Production-readiness refactoring: separate functions/classes/modules, isolate configurations, add logging, package the project code.
ML Runtimes
Databricks ML runtimes are pre-configured cluster images designed for machine learning workloads:
- Pre-installed ML libraries: scikit-learn, TensorFlow, PyTorch, XGBoost, LightGBM, HuggingFace
- MLflow pre-installed and pre-configured, no manual setup required
- GPU-optimized variants available (CUDA, cuDNN pre-configured)
- Eliminates library conflicts and manual dependency management
- Consistent environment across the team, reducing "works on my machine" issues
AutoML
Databricks AutoML automates feature selection, model selection, hyperparameter search, and preprocessing. All runs are logged to an MLflow experiment. The best run is surfaced in the AutoML UI.
In 2026, the primary entry point for AutoML is the Genie Code Data Science Agent.
Key advantage — the opaque-box problem: AutoML generates a fully editable source code notebook for the best run. You can open it, read the pipeline, and customize it with domain expertise, unlike black-box AutoML tools that return a model you cannot modify.
Feature Store: Unity Catalog vs Workspace
| Unity Catalog Feature Store | Workspace Feature Store (Legacy) | |
|---|---|---|
| Scope | Account-level, cross-workspace | Single workspace only |
| Client | FeatureEngineeringClient | FeatureStoreClient (deprecated) |
| Discovery | Searchable across the organization | Workspace-local only |
| Governance | UC permission model, row/column-level security | Workspace-level permissions only |
| Lineage | Full lineage: which models use which features | Limited |
| Cross-workspace access | Yes | No |
Why UC Feature Store matters: discoverability across teams, full lineage tracking, training/serving skew prevention, and online serving support.
Creating and Writing Feature Store Tables
Use FeatureEngineeringClient.create_table(). The table must have a primary key.
from databricks.feature_engineering import FeatureEngineeringClient
fe = FeatureEngineeringClient()
fe.create_table(
name="catalog.schema.user_features",
primary_keys=["user_id"],
schema=features_df.schema,
description="User-level behavioral features"
)
Always use mode="merge" when writing to an existing feature table; it upserts by primary key. Using "overwrite" destroys existing features.
fe.write_table(
name="catalog.schema.user_features",
df=features_df,
mode="merge"
)
Training Models with Feature Store Lookups
Define FeatureLookup objects to specify which features to pull, then pass to create_training_set. Log with fe.log_model(), not standard MLflow autolog; this packages feature metadata with the model artifact so inference can perform automatic feature lookup.
from databricks.feature_engineering import FeatureEngineeringClient, FeatureLookup
import mlflow
fe = FeatureEngineeringClient()
feature_lookups = [
FeatureLookup(
table_name="catalog.schema.user_features",
feature_names=["30d_spend", "avg_txn_value", "days_since_last_login"],
lookup_key="user_id"
)
]
training_set = fe.create_training_set(
df=labels_df,
feature_lookups=feature_lookups,
label="churn"
)
training_df = training_set.load_df()
model = train_model(training_df)
fe.log_model(
model=model,
artifact_path="model",
flavor=mlflow.sklearn,
training_set=training_set
)
For batch inference with automatic feature lookup:
predictions = fe.score_batch(
model_uri="models:/catalog.schema.model_name/1",
df=primary_keys_df # only primary key columns needed
)
Online vs Offline Feature Tables
| Offline Feature Table | Online Feature Table | |
|---|---|---|
| Storage | Delta Lake | Low-latency key-value database |
| Latency | Seconds to minutes | Milliseconds (point lookup) |
| Use case | Batch training, batch inference | Real-time model serving |
| Sync | Source of truth | Synced from offline table via CDF |
If your model uses automatic feature lookups and is deployed to a real-time serving endpoint, online tables are strictly required. An offline Delta table cannot return features in milliseconds.
MLflow Client API
Finding the best run:
from mlflow import MlflowClient
client = MlflowClient()
runs = client.search_runs(
experiment_ids=["<experiment_id>"],
filter_string="",
order_by=["metrics.val_rmse ASC"], # ASC for error metrics
max_results=1
)
best_run = runs[0]
best_run_id = best_run.info.run_id
Manual logging:
import mlflow
with mlflow.start_run() as run:
mlflow.log_param("max_depth", 5)
mlflow.log_param("learning_rate", 0.01)
mlflow.log_metric("train_rmse", train_rmse)
mlflow.log_metric("val_rmse", val_rmse)
mlflow.log_artifact("feature_importance.png")
mlflow.sklearn.log_model(model, artifact_path="model")
What the MLflow UI exposes: parameters, metrics (with step history), artifacts, source notebook link, run metadata (ID, start time, duration, status), tags, model signature, and dataset information.
Registering Models in Unity Catalog
import mlflow
mlflow.set_registry_uri("databricks-uc")
model_uri = f"runs:/{run_id}/model"
registered_model = mlflow.register_model(
model_uri=model_uri,
name="catalog.schema.model_name" # three-level UC namespace
)
| Unity Catalog Registry | Workspace Registry | |
|---|---|---|
| Scope | Account-level, cross-workspace | Single workspace |
| Access control | UC permission model | Workspace-level ACLs |
| Lineage | Full lineage: features → model → serving | Limited |
| Multi-env support | Share model across dev/staging/prod | Must copy artifacts manually |
In MLflow 3, Unity Catalog is the default registry. The workspace registry is legacy.
Promoting Code vs Promoting Models
Promote models (move the trained artifact through environments):
- Training data is the same across dev / staging / prod
- You want to validate the exact artifact that will run in production
- Example: a batch scoring model trained on a static historical dataset
Promote code (move the training script; retrain in each environment):
- The model must be trained on environment-specific data
- Regulatory requirements mandate that the production model was trained on production data
- Example: a fraud detection model that must be trained on live production transactions
Tags and Aliases
Tags are key-value labels for governance and filtering:
from mlflow import MlflowClient
client = MlflowClient()
client.set_registered_model_tag("catalog.schema.model_name", "team", "ml-platform")
client.set_model_version_tag("catalog.schema.model_name", "2", "validated", "true")
client.delete_registered_model_tag("catalog.schema.model_name", "team")
client.delete_model_version_tag("catalog.schema.model_name", "2", "validated")
Aliases replace stage transitions (Staging/Production) in Unity Catalog; they are mutable pointers to specific model versions:
client.set_registered_model_alias("catalog.schema.model_name", "champion", "3")
client.set_registered_model_alias("catalog.schema.model_name", "challenger", "4")
# Load by alias — no hardcoded version numbers in downstream code
champion = mlflow.pyfunc.load_model("models:/catalog.schema.model_name@champion")
When the challenger outperforms the champion, reassign the champion alias. No version numbers change in downstream code.
Sample Questions
A data scientist wants to create a feature table to use in their models. They are working in a workspace with Unity Catalog enabled and want this feature table to be stored and governed by it. What is the correct way of creating this feature table?
Section 2: Data Preparation for Machine Learning
Summary Statistics
# .summary() — full descriptive statistics including percentiles
df.summary().show()
# Returns: count, mean, stddev, min, 25%, 50%, 75%, max for each column
# .describe() — subset (no percentiles)
df.describe().show()
# dbutils data summaries (richest option in Databricks notebooks)
dbutils.data.summarize(df)
# Returns interactive HTML summary with distribution histograms and null counts
Removing Outliers
Standard deviation method: best for normally distributed features:
from pyspark.sql import functions as F
mean_val = df.select(F.mean("feature")).collect()[0][0]
stddev_val = df.select(F.stddev("feature")).collect()[0][0]
df_clean = df.filter(F.abs(df["feature"] - mean_val) <= 3 * stddev_val)
IQR method: more robust for skewed distributions:
Q1, Q3 = df.approxQuantile("feature", [0.25, 0.75], 0.0)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
df_clean = df.filter((df["feature"] >= lower_bound) & (df["feature"] <= upper_bound))
Decision rule: Use IQR for skewed data; use standard deviation for approximately normal data.
Visualizations
Categorical features:
- Bar chart: shows frequency of each category
display(df.groupBy("category_col").count().orderBy("count", ascending=False))
Continuous features:
- Histogram: shows distribution shape, skewness, and outliers
- Box plot: shows median, IQR, and outliers in a single view
Comparing two categorical features: crosstab (df.stat.crosstab()), chi-squared test, grouped bar chart.
Comparing two continuous features: Pearson correlation (df.stat.corr()), Spearman correlation (robust to outliers), scatter plot. Use Pearson for linear relationships; Spearman for skewed data.
Imputing Missing Values
| Method | Best For | Caveat |
|---|---|---|
| Mean | Normally distributed continuous features | Sensitive to outliers |
| Median | Skewed continuous features | Robust to outliers |
| Mode | Categorical features | Can inflate one dominant category |
Always examine the distribution first: there is no one-size-fits-all imputation strategy.
from pyspark.ml.feature import Imputer
imputer = Imputer(
inputCols=["col_a", "col_b"],
outputCols=["col_a_imputed", "col_b_imputed"],
strategy="mean" # or "median"
)
imputer_model = imputer.fit(train_df)
train_imputed = imputer_model.transform(train_df)
# Mode imputation (categorical, manual)
mode_val = df.groupBy("cat_col").count().orderBy("count", ascending=False).first()[0]
df_imputed = df.fillna({"cat_col": mode_val})
One-Hot Encoding
One-hot encoding creates a binary column for every unique category, resulting in a sparse high-dimensional vector.
from pyspark.ml.feature import StringIndexer, OneHotEncoder
from pyspark.ml import Pipeline
indexer = StringIndexer(inputCol="category", outputCol="category_idx")
encoder = OneHotEncoder(inputCol="category_idx", outputCol="category_ohe")
pipeline = Pipeline(stages=[indexer, encoder])
ohe_model = pipeline.fit(train_df)
train_encoded = ohe_model.transform(train_df)
| Scenario | OHE Appropriate? | Reason |
|---|---|---|
| Low-cardinality nominal feature in a linear model | Yes | Linear models need independent coefficients per category |
| High-cardinality feature (hundreds of categories) | No | Curse of dimensionality; sparse matrix issues |
| Feature in a tree-based model (RF, GBM) | Not needed | Trees handle label-encoded integers natively |
| Ordinal feature (low/med/high) | No | Use ordinal encoding to preserve rank |
For high-cardinality features, group rare categories into "Other" before encoding, or use target encoding.
Log Scale Transformation
Apply log transformation when:
- Feature or target has a highly right-skewed distribution (income, house prices, transaction amounts)
- Values span multiple orders of magnitude
- The relationship between feature and target is multiplicative rather than additive
from pyspark.sql import functions as F
df = df.withColumn("log_price", F.log("price"))
Critical: If you train on a log-transformed target, you must exponentiate predictions before computing evaluation metrics or interpreting results:
import numpy as np
predictions_original_scale = np.exp(log_predictions)
Sample Questions
A data scientist needs to impute the missing values in a continuous feature. They want to do this with the least amount of effort but with correct results. Which strategy will do this?
Section 3: Model Development
Algorithm Selection
| Scenario | Recommended Algorithms |
|---|---|
| Binary classification, interpretability needed | Logistic Regression, Decision Tree |
| Binary classification, high performance | Random Forest, Gradient Boosting (XGBoost, LightGBM) |
| Multi-class classification | Random Forest, Gradient Boosting, Multinomial Logistic Regression |
| Regression, linear relationship | Linear Regression |
| Regression, non-linear relationship | Random Forest Regressor, Gradient Boosting Regressor |
| Clustering (no labels) | K-Means, DBSCAN |
| Recommendation / collaborative filtering | ALS via Spark ML |
| High-dimensional sparse data | Regularized linear models (L1/L2) |
Key decision factors: linear vs non-linear relationships, interpretability requirement, dataset size (Spark ML vs scikit-learn), labeled vs unlabeled data.
Data Imbalance
Class imbalance occurs when one class has far fewer instances than another. Standard accuracy becomes misleading; a model that always predicts the majority class can appear highly accurate while being useless.
Mitigation strategies:
- Cost-sensitive learning:
class_weight="balanced"in scikit-learn, orweightColin Spark ML. Directly penalizes the model for ignoring the minority class. - Oversampling: SMOTE generates synthetic minority-class examples
- Undersampling: randomly remove majority-class records
- Appropriate metrics: use F1, ROC/AUC, or Precision-Recall AUC instead of accuracy
- Stratified splits: preserve class ratio in both train and test sets
Estimators vs Transformers
| Estimator | Transformer | |
|---|---|---|
| Definition | Learns parameters from data | Applies a fixed or learned transformation |
| Key method | .fit(df) → returns a fitted Model | .transform(df) → returns a new DataFrame |
| Examples (unfitted) | LinearRegression, RandomForestClassifier, StandardScaler | — |
| Examples (fitted) | — | LinearRegressionModel, RandomForestModel, StandardScalerModel |
A Pipeline is itself an Estimator. Calling .fit() trains all stages and returns a PipelineModel (a Transformer).
Training Pipelines
A Spark ML Pipeline chains transformers and a final estimator, ensuring consistent preprocessing across train and test sets.
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler, StandardScaler
from pyspark.ml.classification import RandomForestClassifier
indexer = StringIndexer(inputCol="category", outputCol="category_idx")
encoder = OneHotEncoder(inputCol="category_idx", outputCol="category_ohe")
assembler = VectorAssembler(inputCols=["feat1", "feat2", "category_ohe"], outputCol="features_raw")
scaler = StandardScaler(inputCol="features_raw", outputCol="features")
rf = RandomForestClassifier(featuresCol="features", labelCol="label", numTrees=100)
pipeline = Pipeline(stages=[indexer, encoder, assembler, scaler, rf])
model = pipeline.fit(train_df) # Estimator: learns all parameters
predictions = model.transform(test_df) # Transformer: applies all stages
Pipelines prevent data leakage: the scaler is fit only on training data, then applied consistently to test data.
Hyperparameter Tuning with Hyperopt
from hyperopt import fmin, tpe, hp, STATUS_OK, SparkTrials
import mlflow
def objective(params):
with mlflow.start_run(nested=True):
model = train_model(params)
val_loss = evaluate(model, val_df)
mlflow.log_metric("val_loss", val_loss)
return {"loss": val_loss, "status": STATUS_OK}
search_space = {
"max_depth": hp.quniform("max_depth", 2, 10, 1),
"learning_rate": hp.loguniform("learning_rate", -5, 0),
"num_leaves": hp.quniform("num_leaves", 20, 150, 1)
}
best_params = fmin(
fn=objective,
space=search_space,
algo=tpe.suggest,
max_evals=50,
trials=SparkTrials() # distributes trials across Spark workers
)
fmin minimizes the return value. For metrics where higher is better (e.g., accuracy), return -accuracy.
Search Strategies
| Strategy | How it works | Efficiency | Best when |
|---|---|---|---|
| Grid search | Exhaustively tries every combination | Low | Small search spaces |
| Random search | Samples uniformly at random | Higher than grid | Large search spaces |
| Bayesian (TPE) | Uses prior results to focus search on promising regions | Highest | Large spaces; default in Hyperopt and Optuna |
Optuna key terms: Study (optimization session), Trial (single call to objective), Pruning (halts unpromising trials early), MLFlowCallback (auto-logs each trial into MLflow with parent-child hierarchy).
Parallelizing Single-Node Models
Single-node models (scikit-learn, XGBoost) don't distribute internally, but you can parallelize the hyperparameter search:
# Hyperopt: SparkTrials runs trials on Spark workers
best = fmin(fn=objective, space=search_space, algo=tpe.suggest,
max_evals=50, trials=SparkTrials(parallelism=4))
Important: Optuna's n_jobs uses multi-threading on a single machine. Due to Python's GIL, this achieves concurrency but not true CPU parallelism. For genuine distribution, use SparkTrials (Hyperopt) or MlflowSparkStudy (Optuna).
Cross-Validation vs Train-Validation Split
| Cross-Validation (k-fold) | Train-Validation Split | |
|---|---|---|
| Benefit | More robust generalization estimate; each point used for validation exactly once | Fast and simple |
| Benefit | Uses all data for both training and validation | Lower memory and compute requirements |
| Downside | Computationally expensive: trains k models per hyperparameter combo | Estimate depends on which data ended up in which split |
| Downside | Risk of temporal leakage with random splits on time-series data | May overfit to the particular split chosen |
Use cross-validation when data is limited or you need a reliable generalization estimate. Use train-validation split for large datasets or time-series data.
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import BinaryClassificationEvaluator
paramGrid = (ParamGridBuilder()
.addGrid(rf.numTrees, [50, 100])
.addGrid(rf.maxDepth, [5, 10])
.build())
cv = CrossValidator(estimator=pipeline, estimatorParamMaps=paramGrid,
evaluator=BinaryClassificationEvaluator(metricName="areaUnderROC"),
numFolds=5)
cv_model = cv.fit(train_df)
best_model = cv_model.bestModel
Model count formula: (hyperparameter combinations) × k folds
Example: C=[0.1,1,10], kernel=['linear','rbf'], gamma=[0.01,0.1,1] with 5-fold CV → 3×2×3 = 18 combinations × 5 = 90 models.
Classification Metrics
| Metric | What it measures | Use when |
|---|---|---|
| Accuracy | % of correct predictions | Balanced classes, equal cost of all errors |
| Precision | Of predicted positives, how many are actually positive | False positives are costly (e.g., spam filter) |
| Recall | Of actual positives, how many were caught | False negatives are costly (e.g., medical diagnosis, fraud) |
| F1 Score | Harmonic mean of precision and recall | Imbalanced data; both error types matter |
| Log Loss | Penalizes confident wrong predictions | When calibrated probabilities matter |
| ROC/AUC | Distinguishing ability across all thresholds | Comparing models independent of threshold |
AUC = 1.0 → perfect; AUC = 0.5 → random guessing; AUC below the diagonal = labels may be flipped.
Clustering metrics: Silhouette score (higher = better-defined clusters), Elbow method (plot inertia vs K; pick the elbow).
Regression Metrics
| Metric | Units | Key property |
|---|---|---|
| R² | Unitless (0 to 1) | Proportion of variance explained; 1.0 = perfect |
| RMSE | Same as target | Penalizes large errors more; most commonly used |
| MAE | Same as target | Robust to outliers; treats all errors equally |
| MSE | Squared units of target | Used internally in many optimizers |
RMSE vs MAE: use RMSE when large errors are especially costly; use MAE when you want equal treatment or outliers shouldn't dominate.
Metric Selection by Scenario
| Scenario | Best Metric | Why |
|---|---|---|
| Medical diagnosis — missing a positive case is catastrophic | Recall | Minimize false negatives |
| Spam filter — sending legitimate email to spam is costly | Precision | Minimize false positives |
| Fraud detection — imbalanced classes | F1, ROC/AUC | Balance precision/recall |
| Predicting house prices — large errors especially bad | RMSE | Penalizes large errors more |
| Predicting delivery time — outliers shouldn't dominate | MAE | Robust to outliers |
| Explaining to business stakeholders | R² | Intuitive: "explains X% of the variance" |
Exponentiating Log-Transformed Targets
When a regression model is trained on a log-transformed target, predictions are also on the log scale. Before computing RMSE or interpreting predictions, exponentiate back:
import numpy as np
log_predictions = model.predict(X_test)
predictions = np.exp(log_predictions) # back to original scale
actual = np.exp(y_test_log)
rmse = np.sqrt(np.mean((predictions - actual) ** 2))
Bias-Variance Tradeoff
| Training Error | Validation Error | Interpretation | |
|---|---|---|---|
| High bias (underfitting) | High | High | Model too simple |
| High variance (overfitting) | Low | High | Model memorized training data |
| Well-fit | Low | Low | Generalizes to unseen data |
Fixes:
| Problem | Fixes |
|---|---|
| High bias | More complex model, add features, reduce regularization |
| High variance | More training data, stronger regularization, simpler model, early stopping |
As model complexity increases, training error monotonically decreases, but validation error forms a U-shape. The optimal model sits at the bottom of the validation error curve.
Sample Questions
A data scientist is working on a model to predict customer churn. The dataset is highly imbalanced, with only 10% of instances representing churned customers. Which strategy directly mitigates the model's bias towards the non-churn class?
Section 4: Model Deployment
Batch vs Streaming vs Real-Time
| Batch | Streaming | Real-Time | |
|---|---|---|---|
| Latency | High (minutes to hours) | Medium (seconds) | Low (milliseconds) |
| Throughput | Very high | High | Moderate |
| Input pace | Scheduled, periodic | Continuous stream | On-demand, request-by-request |
| Infrastructure | Databricks Jobs, Spark | Lakeflow Spark Declarative Pipelines | Databricks Model Serving |
| Best for | Nightly scoring, ETL | Event-driven inference, IoT | Fraud detection, recommendations, chatbots |
Batch is unsuitable when data changes faster than every ~30 minutes or stale predictions are harmful. Streaming is event-driven and not suitable for millisecond needs. Real-time requires online feature stores when using automatic feature lookups.
Deploying Custom Models
To deploy custom logic (pre/post-processing, output transformation), extend mlflow.pyfunc.PythonModel:
import mlflow
import mlflow.pyfunc
class CustomModel(mlflow.pyfunc.PythonModel):
def load_context(self, context):
import pickle
with open(context.artifacts["model_path"], "rb") as f:
self.model = pickle.load(f)
def predict(self, context, model_input):
raw_predictions = self.model.predict(model_input)
return ["high_risk" if p > 0.7 else "low_risk" for p in raw_predictions]
with mlflow.start_run():
mlflow.pyfunc.log_model(
artifact_path="custom_model",
python_model=CustomModel(),
artifacts={"model_path": "model.pkl"}
)
Batch Inference
import mlflow
import pandas as pd
model = mlflow.pyfunc.load_model("models:/catalog.schema.model_name/1")
pandas_df = pd.read_parquet("data/batch_input.parquet")
predictions = model.predict(pandas_df)
For Feature Store models, use fe.score_batch(): it handles feature lookups automatically from a Spark DataFrame of primary keys:
predictions_spark = fe.score_batch(
model_uri="models:/catalog.schema.model_name/1",
df=primary_keys_spark_df
)
Streaming Inference with Delta Live Tables
Use Lakeflow Spark Declarative Pipelines (formerly DLT) with the MLflow model loaded as a Spark UDF:
import mlflow
from pyspark.sql import functions as F
predict_udf = mlflow.pyfunc.spark_udf(spark, model_uri="models:/catalog.schema.model_name/1")
@dlt.table
def inference_results():
streaming_df = spark.readStream.table("source_table")
return streaming_df.withColumn("prediction", predict_udf(*feature_cols))
DLT handles auto-scaling (Enhanced Autoscaling on by default), triggered and continuous modes, data expectations, schema evolution, and unified batch + streaming in the same pipeline code.
Real-Time Inference: Deploy and Query
from databricks.sdk import WorkspaceClient
client = WorkspaceClient()
client.serving_endpoints.create(
name="my-model-endpoint",
config={
"served_models": [{
"model_name": "catalog.schema.model_name",
"model_version": "3",
"workload_size": "Small",
"scale_to_zero_enabled": True
}]
}
)
import requests, json
url = "https://<databricks-host>/serving-endpoints/my-model-endpoint/invocations"
response = requests.post(url,
headers={"Authorization": f"Bearer {token}", "Content-Type": "application/json"},
data=json.dumps({"inputs": [[1.0, 2.0, 3.0]]})
)
predictions = response.json()
A/B Testing with Traffic Splitting
Configure a single endpoint to split traffic between model versions:
client.serving_endpoints.update_config(
name="my-model-endpoint",
served_models=[
{"model_name": "catalog.schema.model_name", "model_version": "3", "traffic_percentage": 50},
{"model_name": "catalog.schema.model_name", "model_version": "4", "traffic_percentage": 50}
]
)
Monitor both versions' metrics; when the challenger demonstrates sufficient improvement, update the @champion alias.
Sample Questions
A company has a podcast platform with thousands of users. An anomaly detection algorithm runs on a 10-minute running window of user events. A machine learning engineer wants to deploy this into a production data pipeline handling tens of thousands of events per second, with dynamic compute resizing. Which approach meets these requirements?