Skip to content

Data Drift

Drift detection between continuous and categorical distributions — batch and streaming. Import from fair_perf_ml.drift.

Supported metrics: Jensen–Shannon Divergence, Population Stability Index (PSI), Wasserstein Distance, Kullback–Leibler Divergence.


Enums

fair_perf_ml.drift.base.DataDriftType

Bases: str, Enum

Currently supported methods of deriving the divergence between two distributions.

fair_perf_ml.drift.base.QuantileType

Bases: str, Enum

Supported method for deriving the number of bins to use when approximating a continuous distribution.


Batch functions

fair_perf_ml.drift.base.compute_drift_continuous_distribution

compute_drift_continuous_distribution(baseline_distribution: FloatingPointDataSlice, candidate_distribution: FloatingPointDataSlice, drift_metrics: list[DataDriftMetric], quantile_type: QuantileConfig | None = None) -> list[float]

Ad hoc computation of drift between two distributions of continuous data.

Parameters:

Name Type Description Default
baseline_distribution FloatingPointDataSlice

list[StringBound]

required
candidate_distribution FloatingPointDataSlice

list[StringBound]

required
drift_metrics list[DataDriftMetric]

list[DataDriftMetric]

required
quantile_type QuantileConfig | None

QuantileConfig = None - defaults to FreedmanDiaconis

None

returns: list[float] - one entry for every drift method provided, element wise mapped.

fair_perf_ml.drift.base.compute_drift_categorical_distribution

compute_drift_categorical_distribution(baseline_distribution: list[StringBound], candidate_distribution: list[StringBound], drift_metrics: list[DataDriftMetric]) -> list[float]

Ad hoc computation of drift between two distributions of cateogrical data.

Parameters:

Name Type Description Default
baseline_distribution list[StringBound]

list[StringBound]

required
candidate_distribution list[StringBound]

list[StringBound]

required
drift_metrics list[DataDriftMetric]

list[DataDriftMetric]

required

returns: list[float] - one entry for every drift method provided, element wise mapped.


Batch classes

fair_perf_ml.drift.base.ContinuousDataDrift

Bases: DataDriftDiscreteBase[float, list[float]]

Detects distributional drift in continuous (floating-point) features between a fixed baseline dataset and a runtime dataset.

Internally, the baseline is summarized as a histogram. The number of bins is derived automatically from the baseline data using the selected quantile rule. Drift is then measured by comparing the runtime data's distribution against the baseline histogram using the chosen divergence metric.

This type is suited for batch analysis: you collect a runtime dataset and compare it against the baseline in one call. For long-running accumulation where data arrives incrementally, use the streaming variants instead.

Considerations
  1. Bin count is determined by the baseline data and the quantile rule. If the data does not support the target bin count, fewer bins will be used.
  2. Resetting the baseline recomputes the histogram from scratch using the same quantile rule.

num_bins property

num_bins: int

The number of histogram bins derived from the baseline dataset.

__init__

__init__(baseline_data: FloatingPointDataSlice, quantile_type: str | None = None) -> None

Initialize with a baseline dataset.

Parameters:

Name Type Description Default
baseline_data FloatingPointDataSlice

The reference distribution. Accepts a numpy array or any iterable of values castable to float.

required
quantile_type str | None

Controls how many histogram bins are derived from the baseline. Options: "FreedmanDiaconis" (default, IQR-based, robust to outliers), "Scott" (std-based, assumes roughly normal data), "Sturges" (log2-based, best for small datasets). Pass None to use the default. Also accepts a QuantileType enum value.

None

reset_baseline

reset_baseline(new_baseline: FloatingPointDataSlice) -> None

Replace the baseline with a new dataset, recomputing the histogram.

Parameters:

Name Type Description Default
new_baseline FloatingPointDataSlice

The new reference distribution. Accepts a numpy array or any iterable of values castable to float.

required

compute_drift

compute_drift(runtime_data: FloatingPointDataSlice, drift_metric: DataDriftMetric) -> float

Compute a single drift score between runtime_data and the baseline.

Parameters:

Name Type Description Default
runtime_data FloatingPointDataSlice

The data collected at runtime. Accepts a numpy array or any iterable of values castable to float.

required
drift_metric DataDriftMetric

The divergence measure to use. Accepts a DataDriftType enum value or one of the strings "JensenShannon", "PopulationStabilityIndex", "WassersteinDistance", "KullbackLeibler".

required

Returns:

Type Description
float

The drift score as a float. Higher values indicate greater divergence

float

from the baseline distribution.

compute_drift_multiple_criteria

compute_drift_multiple_criteria(runtime_data: FloatingPointDataSlice, drift_metrics: list[DataDriftMetric]) -> list[float]

Compute multiple drift scores against runtime_data in a single pass.

Parameters:

Name Type Description Default
runtime_data FloatingPointDataSlice

The data collected at runtime. Accepts a numpy array or any iterable of values castable to float.

required
drift_metrics list[DataDriftMetric]

A list of divergence measures to compute. Each entry accepts a DataDriftType enum value or a metric name string.

required

Returns:

Type Description
list[float]

A list of drift scores in the same order as drift_metrics.

export_baseline

export_baseline() -> list[float]

Export the baseline as a normalized probability distribution.

Returns:

Type Description
list[float]

A list of floats, one per bin, where each value is the fraction of

list[float]

baseline samples that fall into that bin. Values sum to 1.0.

fair_perf_ml.drift.base.CategoricalDataDrift

Bases: DataDriftDiscreteBase[str, dict[str, float]]

num_bins property

num_bins: int

The number of histogram bins derived from the baseline dataset.

__init__

__init__(baseline_data: Sequence[StringBound]) -> None

Initialize with a baseline dataset.

Parameters:

Name Type Description Default
baseline_data Sequence[StringBound]

The reference distribution. Any iterable whose elements implement __str__.

required

reset_baseline

reset_baseline(new_baseline: Sequence[StringBound]) -> None

Replace the baseline with a new dataset, recomputing the label distribution.

Parameters:

Name Type Description Default
new_baseline Sequence[StringBound]

The new reference distribution. Any iterable whose elements implement __str__.

required

compute_drift

compute_drift(runtime_data: Sequence[StringBound], drift_metric: DataDriftMetric) -> float

Compute a single drift score between runtime_data and the baseline.

Parameters:

Name Type Description Default
runtime_data Sequence[StringBound]

The data collected at runtime. Any iterable whose elements implement __str__.

required
drift_metric DataDriftMetric

The divergence measure to use. Accepts a DataDriftType enum value or one of the strings "JensenShannon", "PopulationStabilityIndex", "WassersteinDistance", "KullbackLeibler".

required

Returns:

Type Description
float

The drift score as a float. Higher values indicate greater divergence

float

from the baseline distribution.

compute_drift_multiple_criteria

compute_drift_multiple_criteria(runtime_data: Sequence[StringBound], drift_metrics: list[DataDriftMetric]) -> list[float]

Compute multiple drift scores against runtime_data in a single pass.

Parameters:

Name Type Description Default
runtime_data Sequence[StringBound]

The data collected at runtime. Any iterable whose elements implement __str__.

required
drift_metrics list[DataDriftMetric]

A list of divergence measures to compute. Each entry accepts a DataDriftType enum value or a metric name string.

required

Returns:

Type Description
list[float]

A list of drift scores in the same order as drift_metrics.

export_baseline

export_baseline() -> dict[str, float]

Export the baseline as a normalized label frequency distribution.

Returns:

Type Description
dict[str, float]

A dict mapping each label (including the overflow bin) to its

dict[str, float]

fraction of the baseline dataset. Values sum to 1.0.


Streaming — continuous

fair_perf_ml.drift.streaming.StreamingContinuousDataDriftFlush

Bases: DataDriftStreamingBase[float, dict[str, list[float]], list[float]]

fair_perf_ml.drift.streaming.StreamingContinuousDataDriftDecay

Bases: DataDriftStreamingBase[float, dict[str, list[float]], list[float]]


Streaming — categorical

fair_perf_ml.drift.streaming.StreamingCategoricalDataDriftFlush

Bases: DataDriftStreamingBase[StringBound, dict[str, float], dict[str, float]]

fair_perf_ml.drift.streaming.StreamingCategoricalDataDriftDecay

Bases: DataDriftStreamingBase[StringBound, dict[str, float], dict[str, float]]


Exceptions

fair_perf_ml.drift.base.DataDriftParameterValidationError

Bases: Exception

Exception for when users pass invalid data in