Ananya Nair

Section 1: Databricks Machine Learning

MLOps Best Practices

MLOps is the discipline of applying DevOps practices to ML workflows.

MLOps Components:

Version Control: code, notebooks, and configuration tracked in Git
CI/CD: automated testing and deployment pipelines
Workflows (Lakeflow Jobs): orchestrate dependencies between data preprocessing, feature engineering, model training, and inference tasks. Inference runs at a different frequency than training.
Model Registry: store and version trained model artifacts with metadata. In Databricks, this is the Unity Catalog registry.
Model Serving: Databricks Model Serving (real-time REST endpoints) or Kubernetes for low-latency deployments
Monitoring: Lakehouse Monitoring for drift detection, performance degradation, and data quality
Data Version Control: Delta Lake time travel for reproducibility and rollback
Feature Store: consistent feature computation between training and serving
Vector Database / LLM Tracing / Human-in-the-loop: for GenAI workflows

MLOps Principles, how you build and operate:

Documentation: track decisions, configurations, and assumptions
Code quality: pre-commit hooks, unit tests (verify individual functions), integration tests (verify end-to-end flows)
Traceability and reproducibility: same code + same data = same model; enables easy rollback
Monitoring and alerting: application performance, infrastructure health, and business metrics

Production-readiness refactoring: separate functions/classes/modules, isolate configurations, add logging, package the project code.

ML Runtimes

Databricks ML runtimes are pre-configured cluster images designed for machine learning workloads:

Pre-installed ML libraries: scikit-learn, TensorFlow, PyTorch, XGBoost, LightGBM, HuggingFace
MLflow pre-installed and pre-configured, no manual setup required
GPU-optimized variants available (CUDA, cuDNN pre-configured)
Eliminates library conflicts and manual dependency management
Consistent environment across the team, reducing "works on my machine" issues

AutoML

Databricks AutoML automates feature selection, model selection, hyperparameter search, and preprocessing. All runs are logged to an MLflow experiment. The best run is surfaced in the AutoML UI.

In 2026, the primary entry point for AutoML is the Genie Code Data Science Agent.

Key advantage — the opaque-box problem: AutoML generates a fully editable source code notebook for the best run. You can open it, read the pipeline, and customize it with domain expertise, unlike black-box AutoML tools that return a model you cannot modify.

Feature Store: Unity Catalog vs Workspace

	Unity Catalog Feature Store	Workspace Feature Store (Legacy)
Scope	Account-level, cross-workspace	Single workspace only
Client	`FeatureEngineeringClient`	`FeatureStoreClient` (deprecated)
Discovery	Searchable across the organization	Workspace-local only
Governance	UC permission model, row/column-level security	Workspace-level permissions only
Lineage	Full lineage: which models use which features	Limited
Cross-workspace access	Yes	No

Why UC Feature Store matters: discoverability across teams, full lineage tracking, training/serving skew prevention, and online serving support.

Creating and Writing Feature Store Tables

Use FeatureEngineeringClient.create_table(). The table must have a primary key.

from databricks.feature_engineering import FeatureEngineeringClient

fe = FeatureEngineeringClient()

fe.create_table(
    name="catalog.schema.user_features",
    primary_keys=["user_id"],
    schema=features_df.schema,
    description="User-level behavioral features"
)

Always use mode="merge" when writing to an existing feature table; it upserts by primary key. Using "overwrite" destroys existing features.

fe.write_table(
    name="catalog.schema.user_features",
    df=features_df,
    mode="merge"
)

Training Models with Feature Store Lookups

Define FeatureLookup objects to specify which features to pull, then pass to create_training_set. Log with fe.log_model(), not standard MLflow autolog; this packages feature metadata with the model artifact so inference can perform automatic feature lookup.

from databricks.feature_engineering import FeatureEngineeringClient, FeatureLookup
import mlflow

fe = FeatureEngineeringClient()

feature_lookups = [
    FeatureLookup(
        table_name="catalog.schema.user_features",
        feature_names=["30d_spend", "avg_txn_value", "days_since_last_login"],
        lookup_key="user_id"
    )
]

training_set = fe.create_training_set(
    df=labels_df,
    feature_lookups=feature_lookups,
    label="churn"
)

training_df = training_set.load_df()
model = train_model(training_df)

fe.log_model(
    model=model,
    artifact_path="model",
    flavor=mlflow.sklearn,
    training_set=training_set
)

For batch inference with automatic feature lookup:

predictions = fe.score_batch(
    model_uri="models:/catalog.schema.model_name/1",
    df=primary_keys_df    # only primary key columns needed
)

Online vs Offline Feature Tables

	Offline Feature Table	Online Feature Table
Storage	Delta Lake	Low-latency key-value database
Latency	Seconds to minutes	Milliseconds (point lookup)
Use case	Batch training, batch inference	Real-time model serving
Sync	Source of truth	Synced from offline table via CDF

If your model uses automatic feature lookups and is deployed to a real-time serving endpoint, online tables are strictly required. An offline Delta table cannot return features in milliseconds.

MLflow Client API

Finding the best run:

from mlflow import MlflowClient

client = MlflowClient()
runs = client.search_runs(
    experiment_ids=["<experiment_id>"],
    filter_string="",
    order_by=["metrics.val_rmse ASC"],   # ASC for error metrics
    max_results=1
)
best_run = runs[0]
best_run_id = best_run.info.run_id

Manual logging:

import mlflow

with mlflow.start_run() as run:
    mlflow.log_param("max_depth", 5)
    mlflow.log_param("learning_rate", 0.01)
    mlflow.log_metric("train_rmse", train_rmse)
    mlflow.log_metric("val_rmse", val_rmse)
    mlflow.log_artifact("feature_importance.png")
    mlflow.sklearn.log_model(model, artifact_path="model")

What the MLflow UI exposes: parameters, metrics (with step history), artifacts, source notebook link, run metadata (ID, start time, duration, status), tags, model signature, and dataset information.

Registering Models in Unity Catalog

import mlflow

mlflow.set_registry_uri("databricks-uc")

model_uri = f"runs:/{run_id}/model"
registered_model = mlflow.register_model(
    model_uri=model_uri,
    name="catalog.schema.model_name"   # three-level UC namespace
)

	Unity Catalog Registry	Workspace Registry
Scope	Account-level, cross-workspace	Single workspace
Access control	UC permission model	Workspace-level ACLs
Lineage	Full lineage: features → model → serving	Limited
Multi-env support	Share model across dev/staging/prod	Must copy artifacts manually

In MLflow 3, Unity Catalog is the default registry. The workspace registry is legacy.

Promoting Code vs Promoting Models

Promote models (move the trained artifact through environments):

Training data is the same across dev / staging / prod
You want to validate the exact artifact that will run in production
Example: a batch scoring model trained on a static historical dataset

Promote code (move the training script; retrain in each environment):

The model must be trained on environment-specific data
Regulatory requirements mandate that the production model was trained on production data
Example: a fraud detection model that must be trained on live production transactions

Tags and Aliases

Tags are key-value labels for governance and filtering:

from mlflow import MlflowClient

client = MlflowClient()

client.set_registered_model_tag("catalog.schema.model_name", "team", "ml-platform")
client.set_model_version_tag("catalog.schema.model_name", "2", "validated", "true")
client.delete_registered_model_tag("catalog.schema.model_name", "team")
client.delete_model_version_tag("catalog.schema.model_name", "2", "validated")

Aliases replace stage transitions (Staging/Production) in Unity Catalog; they are mutable pointers to specific model versions:

client.set_registered_model_alias("catalog.schema.model_name", "champion", "3")
client.set_registered_model_alias("catalog.schema.model_name", "challenger", "4")

# Load by alias — no hardcoded version numbers in downstream code
champion = mlflow.pyfunc.load_model("models:/catalog.schema.model_name@champion")

When the challenger outperforms the champion, reassign the champion alias. No version numbers change in downstream code.

Sample Questions

A data scientist wants to create a feature table to use in their models. They are working in a workspace with Unity Catalog enabled and want this feature table to be stored and governed by it. What is the correct way of creating this feature table?

Section 2: Data Preparation for Machine Learning

Summary Statistics

# .summary() — full descriptive statistics including percentiles
df.summary().show()
# Returns: count, mean, stddev, min, 25%, 50%, 75%, max for each column

# .describe() — subset (no percentiles)
df.describe().show()

# dbutils data summaries (richest option in Databricks notebooks)
dbutils.data.summarize(df)
# Returns interactive HTML summary with distribution histograms and null counts

Removing Outliers

Standard deviation method: best for normally distributed features:

from pyspark.sql import functions as F

mean_val = df.select(F.mean("feature")).collect()[0][0]
stddev_val = df.select(F.stddev("feature")).collect()[0][0]

df_clean = df.filter(F.abs(df["feature"] - mean_val) <= 3 * stddev_val)

IQR method: more robust for skewed distributions:

Q1, Q3 = df.approxQuantile("feature", [0.25, 0.75], 0.0)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

df_clean = df.filter((df["feature"] >= lower_bound) & (df["feature"] <= upper_bound))

Decision rule: Use IQR for skewed data; use standard deviation for approximately normal data.

Visualizations

Categorical features:

Bar chart: shows frequency of each category
display(df.groupBy("category_col").count().orderBy("count", ascending=False))

Continuous features:

Histogram: shows distribution shape, skewness, and outliers
Box plot: shows median, IQR, and outliers in a single view

Comparing two categorical features: crosstab (df.stat.crosstab()), chi-squared test, grouped bar chart.

Comparing two continuous features: Pearson correlation (df.stat.corr()), Spearman correlation (robust to outliers), scatter plot. Use Pearson for linear relationships; Spearman for skewed data.

Imputing Missing Values

Method	Best For	Caveat
Mean	Normally distributed continuous features	Sensitive to outliers
Median	Skewed continuous features	Robust to outliers
Mode	Categorical features	Can inflate one dominant category

Always examine the distribution first: there is no one-size-fits-all imputation strategy.

from pyspark.ml.feature import Imputer

imputer = Imputer(
    inputCols=["col_a", "col_b"],
    outputCols=["col_a_imputed", "col_b_imputed"],
    strategy="mean"      # or "median"
)
imputer_model = imputer.fit(train_df)
train_imputed = imputer_model.transform(train_df)

# Mode imputation (categorical, manual)
mode_val = df.groupBy("cat_col").count().orderBy("count", ascending=False).first()[0]
df_imputed = df.fillna({"cat_col": mode_val})

One-Hot Encoding

One-hot encoding creates a binary column for every unique category, resulting in a sparse high-dimensional vector.

from pyspark.ml.feature import StringIndexer, OneHotEncoder
from pyspark.ml import Pipeline

indexer = StringIndexer(inputCol="category", outputCol="category_idx")
encoder = OneHotEncoder(inputCol="category_idx", outputCol="category_ohe")

pipeline = Pipeline(stages=[indexer, encoder])
ohe_model = pipeline.fit(train_df)
train_encoded = ohe_model.transform(train_df)

Scenario	OHE Appropriate?	Reason
Low-cardinality nominal feature in a linear model	Yes	Linear models need independent coefficients per category
High-cardinality feature (hundreds of categories)	No	Curse of dimensionality; sparse matrix issues
Feature in a tree-based model (RF, GBM)	Not needed	Trees handle label-encoded integers natively
Ordinal feature (low/med/high)	No	Use ordinal encoding to preserve rank

For high-cardinality features, group rare categories into "Other" before encoding, or use target encoding.

Log Scale Transformation

Apply log transformation when:

Feature or target has a highly right-skewed distribution (income, house prices, transaction amounts)
Values span multiple orders of magnitude
The relationship between feature and target is multiplicative rather than additive

from pyspark.sql import functions as F
df = df.withColumn("log_price", F.log("price"))

Critical: If you train on a log-transformed target, you must exponentiate predictions before computing evaluation metrics or interpreting results:

import numpy as np
predictions_original_scale = np.exp(log_predictions)

Sample Questions

A data scientist needs to impute the missing values in a continuous feature. They want to do this with the least amount of effort but with correct results. Which strategy will do this?

Section 3: Model Development

Algorithm Selection

Scenario	Recommended Algorithms
Binary classification, interpretability needed	Logistic Regression, Decision Tree
Binary classification, high performance	Random Forest, Gradient Boosting (XGBoost, LightGBM)
Multi-class classification	Random Forest, Gradient Boosting, Multinomial Logistic Regression
Regression, linear relationship	Linear Regression
Regression, non-linear relationship	Random Forest Regressor, Gradient Boosting Regressor
Clustering (no labels)	K-Means, DBSCAN
Recommendation / collaborative filtering	ALS via Spark ML
High-dimensional sparse data	Regularized linear models (L1/L2)

Key decision factors: linear vs non-linear relationships, interpretability requirement, dataset size (Spark ML vs scikit-learn), labeled vs unlabeled data.

Data Imbalance

Class imbalance occurs when one class has far fewer instances than another. Standard accuracy becomes misleading; a model that always predicts the majority class can appear highly accurate while being useless.

Mitigation strategies:

Cost-sensitive learning: class_weight="balanced" in scikit-learn, or weightCol in Spark ML. Directly penalizes the model for ignoring the minority class.
Oversampling: SMOTE generates synthetic minority-class examples
Undersampling: randomly remove majority-class records
Appropriate metrics: use F1, ROC/AUC, or Precision-Recall AUC instead of accuracy
Stratified splits: preserve class ratio in both train and test sets

Estimators vs Transformers

	Estimator	Transformer
Definition	Learns parameters from data	Applies a fixed or learned transformation
Key method	`.fit(df)` → returns a fitted Model	`.transform(df)` → returns a new DataFrame
Examples (unfitted)	`LinearRegression`, `RandomForestClassifier`, `StandardScaler`	—
Examples (fitted)	—	`LinearRegressionModel`, `RandomForestModel`, `StandardScalerModel`

A Pipeline is itself an Estimator. Calling .fit() trains all stages and returns a PipelineModel (a Transformer).

Training Pipelines

A Spark ML Pipeline chains transformers and a final estimator, ensuring consistent preprocessing across train and test sets.

from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler, StandardScaler
from pyspark.ml.classification import RandomForestClassifier

indexer  = StringIndexer(inputCol="category", outputCol="category_idx")
encoder  = OneHotEncoder(inputCol="category_idx", outputCol="category_ohe")
assembler = VectorAssembler(inputCols=["feat1", "feat2", "category_ohe"], outputCol="features_raw")
scaler   = StandardScaler(inputCol="features_raw", outputCol="features")
rf       = RandomForestClassifier(featuresCol="features", labelCol="label", numTrees=100)

pipeline = Pipeline(stages=[indexer, encoder, assembler, scaler, rf])
model    = pipeline.fit(train_df)        # Estimator: learns all parameters
predictions = model.transform(test_df)  # Transformer: applies all stages

Pipelines prevent data leakage: the scaler is fit only on training data, then applied consistently to test data.

Hyperparameter Tuning with Hyperopt

from hyperopt import fmin, tpe, hp, STATUS_OK, SparkTrials
import mlflow

def objective(params):
    with mlflow.start_run(nested=True):
        model = train_model(params)
        val_loss = evaluate(model, val_df)
        mlflow.log_metric("val_loss", val_loss)
    return {"loss": val_loss, "status": STATUS_OK}

search_space = {
    "max_depth":     hp.quniform("max_depth", 2, 10, 1),
    "learning_rate": hp.loguniform("learning_rate", -5, 0),
    "num_leaves":    hp.quniform("num_leaves", 20, 150, 1)
}

best_params = fmin(
    fn=objective,
    space=search_space,
    algo=tpe.suggest,
    max_evals=50,
    trials=SparkTrials()    # distributes trials across Spark workers
)

fmin minimizes the return value. For metrics where higher is better (e.g., accuracy), return -accuracy.

Search Strategies

Strategy	How it works	Efficiency	Best when
Grid search	Exhaustively tries every combination	Low	Small search spaces
Random search	Samples uniformly at random	Higher than grid	Large search spaces
Bayesian (TPE)	Uses prior results to focus search on promising regions	Highest	Large spaces; default in Hyperopt and Optuna

Optuna key terms: Study (optimization session), Trial (single call to objective), Pruning (halts unpromising trials early), MLFlowCallback (auto-logs each trial into MLflow with parent-child hierarchy).

Parallelizing Single-Node Models

Single-node models (scikit-learn, XGBoost) don't distribute internally, but you can parallelize the hyperparameter search:

# Hyperopt: SparkTrials runs trials on Spark workers
best = fmin(fn=objective, space=search_space, algo=tpe.suggest,
            max_evals=50, trials=SparkTrials(parallelism=4))

Important: Optuna's n_jobs uses multi-threading on a single machine. Due to Python's GIL, this achieves concurrency but not true CPU parallelism. For genuine distribution, use SparkTrials (Hyperopt) or MlflowSparkStudy (Optuna).

Cross-Validation vs Train-Validation Split

	Cross-Validation (k-fold)	Train-Validation Split
Benefit	More robust generalization estimate; each point used for validation exactly once	Fast and simple
Benefit	Uses all data for both training and validation	Lower memory and compute requirements
Downside	Computationally expensive: trains k models per hyperparameter combo	Estimate depends on which data ended up in which split
Downside	Risk of temporal leakage with random splits on time-series data	May overfit to the particular split chosen

Use cross-validation when data is limited or you need a reliable generalization estimate. Use train-validation split for large datasets or time-series data.

from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import BinaryClassificationEvaluator

paramGrid = (ParamGridBuilder()
    .addGrid(rf.numTrees, [50, 100])
    .addGrid(rf.maxDepth, [5, 10])
    .build())

cv = CrossValidator(estimator=pipeline, estimatorParamMaps=paramGrid,
                    evaluator=BinaryClassificationEvaluator(metricName="areaUnderROC"),
                    numFolds=5)

cv_model = cv.fit(train_df)
best_model = cv_model.bestModel

Model count formula: (hyperparameter combinations) × k folds

Example: C=[0.1,1,10], kernel=['linear','rbf'], gamma=[0.01,0.1,1] with 5-fold CV → 3×2×3 = 18 combinations × 5 = 90 models.

Classification Metrics

Metric	What it measures	Use when
Accuracy	% of correct predictions	Balanced classes, equal cost of all errors
Precision	Of predicted positives, how many are actually positive	False positives are costly (e.g., spam filter)
Recall	Of actual positives, how many were caught	False negatives are costly (e.g., medical diagnosis, fraud)
F1 Score	Harmonic mean of precision and recall	Imbalanced data; both error types matter
Log Loss	Penalizes confident wrong predictions	When calibrated probabilities matter
ROC/AUC	Distinguishing ability across all thresholds	Comparing models independent of threshold

AUC = 1.0 → perfect; AUC = 0.5 → random guessing; AUC below the diagonal = labels may be flipped.

Clustering metrics: Silhouette score (higher = better-defined clusters), Elbow method (plot inertia vs K; pick the elbow).

Regression Metrics

Metric	Units	Key property
R²	Unitless (0 to 1)	Proportion of variance explained; 1.0 = perfect
RMSE	Same as target	Penalizes large errors more; most commonly used
MAE	Same as target	Robust to outliers; treats all errors equally
MSE	Squared units of target	Used internally in many optimizers

RMSE vs MAE: use RMSE when large errors are especially costly; use MAE when you want equal treatment or outliers shouldn't dominate.

Metric Selection by Scenario

Scenario	Best Metric	Why
Medical diagnosis — missing a positive case is catastrophic	Recall	Minimize false negatives
Spam filter — sending legitimate email to spam is costly	Precision	Minimize false positives
Fraud detection — imbalanced classes	F1, ROC/AUC	Balance precision/recall
Predicting house prices — large errors especially bad	RMSE	Penalizes large errors more
Predicting delivery time — outliers shouldn't dominate	MAE	Robust to outliers
Explaining to business stakeholders	R²	Intuitive: "explains X% of the variance"

Exponentiating Log-Transformed Targets

When a regression model is trained on a log-transformed target, predictions are also on the log scale. Before computing RMSE or interpreting predictions, exponentiate back:

import numpy as np

log_predictions = model.predict(X_test)
predictions = np.exp(log_predictions)      # back to original scale
actual = np.exp(y_test_log)

rmse = np.sqrt(np.mean((predictions - actual) ** 2))

Bias-Variance Tradeoff

	Training Error	Validation Error	Interpretation
High bias (underfitting)	High	High	Model too simple
High variance (overfitting)	Low	High	Model memorized training data
Well-fit	Low	Low	Generalizes to unseen data

Fixes:

Problem	Fixes
High bias	More complex model, add features, reduce regularization
High variance	More training data, stronger regularization, simpler model, early stopping

As model complexity increases, training error monotonically decreases, but validation error forms a U-shape. The optimal model sits at the bottom of the validation error curve.

Sample Questions

A data scientist is working on a model to predict customer churn. The dataset is highly imbalanced, with only 10% of instances representing churned customers. Which strategy directly mitigates the model's bias towards the non-churn class?

Section 4: Model Deployment

Batch vs Streaming vs Real-Time

	Batch	Streaming	Real-Time
Latency	High (minutes to hours)	Medium (seconds)	Low (milliseconds)
Throughput	Very high	High	Moderate
Input pace	Scheduled, periodic	Continuous stream	On-demand, request-by-request
Infrastructure	Databricks Jobs, Spark	Lakeflow Spark Declarative Pipelines	Databricks Model Serving
Best for	Nightly scoring, ETL	Event-driven inference, IoT	Fraud detection, recommendations, chatbots

Batch is unsuitable when data changes faster than every ~30 minutes or stale predictions are harmful. Streaming is event-driven and not suitable for millisecond needs. Real-time requires online feature stores when using automatic feature lookups.

Deploying Custom Models

To deploy custom logic (pre/post-processing, output transformation), extend mlflow.pyfunc.PythonModel:

import mlflow
import mlflow.pyfunc

class CustomModel(mlflow.pyfunc.PythonModel):
    def load_context(self, context):
        import pickle
        with open(context.artifacts["model_path"], "rb") as f:
            self.model = pickle.load(f)

    def predict(self, context, model_input):
        raw_predictions = self.model.predict(model_input)
        return ["high_risk" if p > 0.7 else "low_risk" for p in raw_predictions]

with mlflow.start_run():
    mlflow.pyfunc.log_model(
        artifact_path="custom_model",
        python_model=CustomModel(),
        artifacts={"model_path": "model.pkl"}
    )

Batch Inference

import mlflow
import pandas as pd

model = mlflow.pyfunc.load_model("models:/catalog.schema.model_name/1")

pandas_df = pd.read_parquet("data/batch_input.parquet")
predictions = model.predict(pandas_df)

For Feature Store models, use fe.score_batch(): it handles feature lookups automatically from a Spark DataFrame of primary keys:

predictions_spark = fe.score_batch(
    model_uri="models:/catalog.schema.model_name/1",
    df=primary_keys_spark_df
)

Streaming Inference with Delta Live Tables

Use Lakeflow Spark Declarative Pipelines (formerly DLT) with the MLflow model loaded as a Spark UDF:

import mlflow
from pyspark.sql import functions as F

predict_udf = mlflow.pyfunc.spark_udf(spark, model_uri="models:/catalog.schema.model_name/1")

@dlt.table
def inference_results():
    streaming_df = spark.readStream.table("source_table")
    return streaming_df.withColumn("prediction", predict_udf(*feature_cols))

DLT handles auto-scaling (Enhanced Autoscaling on by default), triggered and continuous modes, data expectations, schema evolution, and unified batch + streaming in the same pipeline code.

Real-Time Inference: Deploy and Query

from databricks.sdk import WorkspaceClient

client = WorkspaceClient()
client.serving_endpoints.create(
    name="my-model-endpoint",
    config={
        "served_models": [{
            "model_name": "catalog.schema.model_name",
            "model_version": "3",
            "workload_size": "Small",
            "scale_to_zero_enabled": True
        }]
    }
)

import requests, json

url = "https://<databricks-host>/serving-endpoints/my-model-endpoint/invocations"
response = requests.post(url,
    headers={"Authorization": f"Bearer {token}", "Content-Type": "application/json"},
    data=json.dumps({"inputs": [[1.0, 2.0, 3.0]]})
)
predictions = response.json()

A/B Testing with Traffic Splitting

Configure a single endpoint to split traffic between model versions:

client.serving_endpoints.update_config(
    name="my-model-endpoint",
    served_models=[
        {"model_name": "catalog.schema.model_name", "model_version": "3", "traffic_percentage": 50},
        {"model_name": "catalog.schema.model_name", "model_version": "4", "traffic_percentage": 50}
    ]
)

Monitor both versions' metrics; when the challenger demonstrates sufficient improvement, update the @champion alias.

Sample Questions

A company has a podcast platform with thousands of users. An anomaly detection algorithm runs on a 10-minute running window of user events. A machine learning engineer wants to deploy this into a production data pipeline handling tens of thousands of events per second, with dynamic compute resizing. Which approach meets these requirements?