Data Drift
Drift detection between continuous and categorical distributions — batch and streaming.
Import from fair_perf_ml.drift.
Supported metrics: Jensen–Shannon Divergence, Population Stability Index (PSI), Wasserstein Distance, Kullback–Leibler Divergence.
Enums
fair_perf_ml.drift.base.DataDriftType
Bases: str, Enum
Currently supported methods of deriving the divergence between two distributions.
fair_perf_ml.drift.base.QuantileType
Bases: str, Enum
Supported method for deriving the number of bins to use when approximating a continuous distribution.
Batch functions
fair_perf_ml.drift.base.compute_drift_continuous_distribution
compute_drift_continuous_distribution(baseline_distribution: FloatingPointDataSlice, candidate_distribution: FloatingPointDataSlice, drift_metrics: list[DataDriftMetric], quantile_type: QuantileConfig | None = None) -> list[float]
Ad hoc computation of drift between two distributions of continuous data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
baseline_distribution
|
FloatingPointDataSlice
|
list[StringBound] |
required |
candidate_distribution
|
FloatingPointDataSlice
|
list[StringBound] |
required |
drift_metrics
|
list[DataDriftMetric]
|
list[DataDriftMetric] |
required |
quantile_type
|
QuantileConfig | None
|
QuantileConfig = None - defaults to FreedmanDiaconis |
None
|
returns: list[float] - one entry for every drift method provided, element wise mapped.
fair_perf_ml.drift.base.compute_drift_categorical_distribution
compute_drift_categorical_distribution(baseline_distribution: list[StringBound], candidate_distribution: list[StringBound], drift_metrics: list[DataDriftMetric]) -> list[float]
Ad hoc computation of drift between two distributions of cateogrical data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
baseline_distribution
|
list[StringBound]
|
list[StringBound] |
required |
candidate_distribution
|
list[StringBound]
|
list[StringBound] |
required |
drift_metrics
|
list[DataDriftMetric]
|
list[DataDriftMetric] |
required |
returns: list[float] - one entry for every drift method provided, element wise mapped.
Batch classes
fair_perf_ml.drift.base.ContinuousDataDrift
Bases: DataDriftDiscreteBase[float, list[float]]
Detects distributional drift in continuous (floating-point) features between a fixed baseline dataset and a runtime dataset.
Internally, the baseline is summarized as a histogram. The number of bins is derived automatically from the baseline data using the selected quantile rule. Drift is then measured by comparing the runtime data's distribution against the baseline histogram using the chosen divergence metric.
This type is suited for batch analysis: you collect a runtime dataset and compare it against the baseline in one call. For long-running accumulation where data arrives incrementally, use the streaming variants instead.
Considerations
- Bin count is determined by the baseline data and the quantile rule. If the data does not support the target bin count, fewer bins will be used.
- Resetting the baseline recomputes the histogram from scratch using the same quantile rule.
__init__
Initialize with a baseline dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
baseline_data
|
FloatingPointDataSlice
|
The reference distribution. Accepts a numpy array or any iterable of values castable to float. |
required |
quantile_type
|
str | None
|
Controls how many histogram bins are derived from the
baseline. Options: |
None
|
reset_baseline
Replace the baseline with a new dataset, recomputing the histogram.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
new_baseline
|
FloatingPointDataSlice
|
The new reference distribution. Accepts a numpy array or any iterable of values castable to float. |
required |
compute_drift
Compute a single drift score between runtime_data and the baseline.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
runtime_data
|
FloatingPointDataSlice
|
The data collected at runtime. Accepts a numpy array or any iterable of values castable to float. |
required |
drift_metric
|
DataDriftMetric
|
The divergence measure to use. Accepts a
|
required |
Returns:
| Type | Description |
|---|---|
float
|
The drift score as a float. Higher values indicate greater divergence |
float
|
from the baseline distribution. |
compute_drift_multiple_criteria
compute_drift_multiple_criteria(runtime_data: FloatingPointDataSlice, drift_metrics: list[DataDriftMetric]) -> list[float]
Compute multiple drift scores against runtime_data in a single pass.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
runtime_data
|
FloatingPointDataSlice
|
The data collected at runtime. Accepts a numpy array or any iterable of values castable to float. |
required |
drift_metrics
|
list[DataDriftMetric]
|
A list of divergence measures to compute. Each entry
accepts a |
required |
Returns:
| Type | Description |
|---|---|
list[float]
|
A list of drift scores in the same order as |
fair_perf_ml.drift.base.CategoricalDataDrift
Bases: DataDriftDiscreteBase[str, dict[str, float]]
__init__
Initialize with a baseline dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
baseline_data
|
Sequence[StringBound]
|
The reference distribution. Any iterable whose
elements implement |
required |
reset_baseline
Replace the baseline with a new dataset, recomputing the label distribution.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
new_baseline
|
Sequence[StringBound]
|
The new reference distribution. Any iterable whose
elements implement |
required |
compute_drift
Compute a single drift score between runtime_data and the baseline.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
runtime_data
|
Sequence[StringBound]
|
The data collected at runtime. Any iterable whose
elements implement |
required |
drift_metric
|
DataDriftMetric
|
The divergence measure to use. Accepts a
|
required |
Returns:
| Type | Description |
|---|---|
float
|
The drift score as a float. Higher values indicate greater divergence |
float
|
from the baseline distribution. |
compute_drift_multiple_criteria
compute_drift_multiple_criteria(runtime_data: Sequence[StringBound], drift_metrics: list[DataDriftMetric]) -> list[float]
Compute multiple drift scores against runtime_data in a single pass.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
runtime_data
|
Sequence[StringBound]
|
The data collected at runtime. Any iterable whose
elements implement |
required |
drift_metrics
|
list[DataDriftMetric]
|
A list of divergence measures to compute. Each entry
accepts a |
required |
Returns:
| Type | Description |
|---|---|
list[float]
|
A list of drift scores in the same order as |
Streaming — continuous
fair_perf_ml.drift.streaming.StreamingContinuousDataDriftFlush
Bases: DataDriftStreamingBase[float, dict[str, list[float]], list[float]]
fair_perf_ml.drift.streaming.StreamingContinuousDataDriftDecay
Bases: DataDriftStreamingBase[float, dict[str, list[float]], list[float]]
Streaming — categorical
fair_perf_ml.drift.streaming.StreamingCategoricalDataDriftFlush
Bases: DataDriftStreamingBase[StringBound, dict[str, float], dict[str, float]]
fair_perf_ml.drift.streaming.StreamingCategoricalDataDriftDecay
Bases: DataDriftStreamingBase[StringBound, dict[str, float], dict[str, float]]
Exceptions
fair_perf_ml.drift.base.DataDriftParameterValidationError
Bases: Exception
Exception for when users pass invalid data in