How to Detect Drift in Feature Values Between Train and Inference Times
A common need when monitoring an ML model is to know if a feature behaves differently on average during inference time when compared to how it behaved during model training/testing. A large difference in a feature's value distribution or average value in a specific segment of the data may reduce our confidence in the model's capabilities to support said segment.
Mona makes it easy to get alerted on such cases.
To illustrate, we will use a generic example. Let's say we have a model which uses 3 features (in reality this number would be much larger) as its input, denoted as "feature_1", "feature_2" and "feature_3".
Let's also suppose the model outputs a score (denoted "score"), and for monitoring purposes we also track some metadata about each model training/inference run such as "user_country" and "user_age".
So an example JSON event being sent to Mona would look something like:
{
"feature_1": 0.23,
"feature_2": 0.667,
"feature_3": 0.1,
"score": 0.88,
"user_age": 33,
"user_country": "US",
"model_version": "v5",
"model_run_type": "train"
}
In the above you can see 2 keys which we haven't yet discussed: "model_version" which will be used to compare the inference runs only to train/test runs of the same model version, and "model_run_type", which states whether the described run is during "train", "test" or "inference".
You can easily export all your training, test and inference data to Mona in such a way.
The "fields" part of your Mona configuration may be trivial, and will just include the above fields, as follows (Mona will actually create such a suggested config for you automatically, given initial exported data):
{
"YOUR-USER-ID": {
"YOUR-CONTEXT-CLASS": {
"fields": {
"feature_1": {
"type": "numeric"
},
"feature_2": {
"type": "numeric"
},
"feature_3": {
"type": "numeric"
},
"score": {
"type": "numeric"
},
"user_age": {
"type": "numeric",
"segmentations": {
"5_years": {
"bucket_size": 5,
"default": true
}
}
},
"user_country": {
"type": "string"
},
"model_version": {
"type": "string"
},
"model_run_type": {
"type": "string"
}
}
}
}
}
For more information on fields definitions, see Fields Overview.
Now that we have a good understanding of how the data looks, and how the fields are configured, we can use Mona's "stanzas" configurations to tell it to detect and alert on drifts in the feature average values between train/test time and inference time, on any given model version.
{
"YOUR-USER-ID": {
"YOUR-CONTEXT-CLASS": {
"fields": {...},
"stanzas": {
"train_test_feature_drifts": {
"verses": [
{
"type": "AverageDrift",
"min_segment_size": 0.005,
"segment_by": [
"user_country",
"user_age"
],
"metrics": [
"feature_1",
"feature_2",
"feature_3"
],
"target_set_period": "14d",
"benchmark_set_period": "",
"min_anomaly_level": 0.25,
"avoid_benchmark_target_overlap": false,
"always_segment_baseline_by": [
"model_version"
],
"target_baseline_segment": {
"run_type": [
{
"value": "inference"
}
]
},
"benchmark_baseline_segment": {
"run_type": [
{
"value": "train"
},
{
"value": "test"
}
]
}
}
]
}
}
}
}
}
So what is happening here?
We've defined a single "verse" (which is the atomic configuration unit for Mona's Insights Generator) with type "AverageDrift".
"AverageDrift" typed verses look for statistically significant changes in the average value of metrics, between a "benchmark" dataset and a "target" dataset.
In this case, we defined (using the "metrics" parameter) that the relevant metrics to look for drifts in are the 3 different features.
Using the "min_anomaly_level" parameter, we also defined that a statistically significant change (drift) occurs when the change in averages between the benchmark and target sets is at least 0.25*(STD of the metric in the entire segment)
We tell Mona we care about such drifts in any user country or age group using the "segment_by" parameter. And we also filter out any segment (e.g., a specific user age group within a specific country) which isn't at least 0.5% of the data using the "min_segment_size".
We define the target set to be the last 14 days of inference data (using "target_set_period" and "target_baseline_segment").
We further define the benchmark set to be any train/test data, no matter the time. This is done using an empty "benchmark_set_period" along with a false "avoid_benchmark_target_overlap" (to allow the benchmark set to take data from the latest 14 days) - and with a filter on the relevant run types using the "benchmark_baseline_segment".
Lastly, we are "dividing our world" to different model versions, using "always_segment_baseline_by" - which means that target sets of specific inferences will only be compared to their counterparts in train/test on the same model version.
More details on how to use "AverageDrift" typed verses and all the possible config parameters for it can be found here.
Updated almost 2 years ago