How to Create and Track Relevant Metrics for Classification Models

Classification problems are a common use-case solved by ML models. In this short guide, we provide some directions for how to easily create relevant metrics and how to get alerted on changes and anomalies in them using Mona's configuration.

Binary Classification

The following example is of a binary classification model designed to detect transaction frauds.
The model predicts whether transactions made at specific banks are fraudulent or not.

Let's say that for each transaction the model outputs a score between 0 and 1 (named "transaction_fraud_score"), and that a threshold is used on this score to decide whether to tag a transaction as fraudulent or not. Also, we have some metadata for each transaction such as "transaction_country", "bank_id" and more. There are also several features associated with each transaction as well.

The following JSON is an example of the data exported to mona for a single transaction:

{
  "features": {...},
  "transaction_amount": 400,
  "transaction_country": "US",
  "bank_id": "32n54k-65nk3-545cxf",
  "transaction_fraud_score": 0.82
}

In order to monitor the model and the certainty of its predictions, Mona allows you to use Field Build Functions to build new fields. In this case we can build a new field named "is_fraud_detected" that uses the "range check" function to determine if the "transaction_fraud_score" is above a threshold or not. The arguments of this function determine that the range is from 0.8 and up, meaning if the "transaction_fraud_score" is higher than 0.8, then the transaction is considered a fraud.

Below we define configure the "fields" section for such a use case.

{
  "YOUR-USER-ID": {
    "TRANSACTION": {
      "fields": {
        "feature_1": {...},
        ...
        "feature_N": {...},
        "transaction_fraud_score": {
          "type": "numeric"
        },
        "is_fraud_detected": {
          "type": "boolean",
          "source": "transaction_fraud_score",
          "function": "range_check",
          "args": [
            {
              "left": {
                "value": 0.8,
                "inclusive": true
              }
            }
          ]
        },
        "bank_id": {
          "type": "string"
        },
        "transaction_amount": {
          "type": "numeric"
        }
      }
    }
  }
}

Now that we have defined the fields, including a specific field for whether the model detected a fraud or not, we can use Mona's "stanzas" configurations to tell it to detect and alert on drifts and outliers in the "transaction_fraud_score" in order to see the certainty of the model's score.

{
  "YOUR-USER-ID": {
    "TRANSACTIONS": {
      "fields": {...},
      "stanzas": {
        "detected_fraud_anomalies": {
          "baseline_segment": {
            "is_fraud_detected": [
              {
                "value": true
              }
            ]
          },
          "metrics": [
            "transaction_fraud_score"
          ],
          "segment_by": [
            "bank_id"
          ],
          "trend_directions": [
            "desc"
          ],
          "verses": [
            {
              "type": "AverageDrift",
              "min_segment_size": 0.005,
              "min_anomaly_level": 0.25,
              "target_set_period": "14d",
              "benchmark_set_period": "46d"
            },
            {
              "type": "AverageOutlier",
              "min_anomaly_level": 0.25,
              "time_period": "4w"
            }
          ]
        }
      }
    }
  }
}

As you can see, we have defined a stanza named "detected_fraud_anomalies" which includes two verses, AverageDrift and AverageOutlier.

Given such a configuration, Mona will filter the data to look only at transactions detected as fraudulent (using the "baseline_segment" param). Within this data it will look for (and alert on) any "bank_id" value (using the "segment_by" param), for which there is a lower or diminishing (using the "trend_directions" param) average "transaction_fraud_score".

In short, we are looking for banks where the model is less certain of detected frauds.

Same can be done to detect banks where the model is less certain of a non-fraud, using the opposite filtering and trend_directions

{
  "YOUR-USER-ID": {
    "TRANSACTIONS": {
      "fields": {...},
      "stanzas": {
        "fraud_true_check": {...},
        "fraud_false_check": {
          "baseline_segment": {
            "is_fraud_detected": [
              {
                "value": false
              }
            ]
          },
          "metrics": [
            "transaction_fraud_score"
          ],
          "segment_by": [
            "bank_id"
          ],
          "trend_directions": [
            "asc"
          ],
          "verses": [
            {
              "type": "AverageDrift",
              "min_segment_size": 0.005,
              "min_anomaly_level": 0.25,
              "target_set_period": "14d",
              "benchmark_set_period": "46d"
            },
            {
              "type": "AverageOutlier",
              "min_anomaly_level": 0.25,
              "time_period": "4w"
            }
          ]
        }
      }
    }
  }
}

Mona allows you also to send more data asynchronously and add it to already existing data based on a given context_id.

For the example, consider a case where sometime after the initial transaction data was sent, new "ground_truth" data stating whether the transaction was in fact a fraud or not is also exported to Mona.
Mona receives the follow-up data for each transaction as a new boolean field named "transaction_was_fraud" and adds it to the initial data, based on the context_id.

Now that this data is also available, one can easily use Mona to get alerted on increasing rates of false positives/negatives, as seen below. This is done by adding the new "transaction_was_fraud" metric to both stanzas. The stanza below will alert on diminishing actual fraud percentage from the transactions that were detected as frauds by the model.

{
  "YOUR-USER-ID": {
    "TRANSACTIONS:": {
      "fields": {...},
      "stanzas": {
        "fraud_true_check": {
          "baseline_segment": {
            "is_fraud_detected": [
              {
                "value": true
              }
            ]
          },
          "metrics": [
            "transaction_fraud_score",
            "transaction_was_fraud"
          ],
          "segment_by": [
            "bank_id"
          ],
          "verses": [
            {
              "type": "AverageDrift",
              "min_segment_size": 0.005,
              "min_anomaly_level": 0.25,
              "target_set_period": "14d",
              "benchmark_set_period": "46d",
              "trend_directions": [
                "desc"
              ]
            },
            {
              "type": "AverageOutlier",
              "min_anomaly_level": 0.25,
              "time_period": "4w"
            }
          ]
        }
      }
    }
  }
}

Multi Class Classification

In this example, we will look at a language detection model which determines what language is written in a text message. In this case, the classification is not true or false, but instead has many options (English, French, Spanish, etc...).

For each text message, the model gives a score for each language it is trying to detect. The language with the highest score is the one "detected" by the model, as long as it passes a given threshold. If no language gets a score higher than a given threshold, then no language is detected by the model ("undetermined").

Data about each text message is being exported to Mona. Besides the language scores, Mona also receives metadata such as "country_of_text" and "length_of_text".

The following JSON is an example of one model run and includes all the data for a specific text message.

{
  "country_of_text": "FR",
  "length_of_text": 125,
  "language_scores": {"eng": 0.6, "spa": 0.3, "fre": 0.8}
}

Now we can configure the fields to define a monitoring schema for each text message.

The most interesting fields to monitor could be named "model_selected_language" and "model_selected_language_score", and can be easily calculated by Mona by using the "get_top_key" and "get_top_value" field build functions respectively.

Another field we can create can be named "first_to_second_language_score_delta" which holds the delta between the highest language score and the second highest (defined by another field named "model_second_language_score"), in order to determine how certain the model is about the spoken language, compared to the other languages.

And finally we can create a new field named "is_language_undetermined" which, given the decision threshold, holds a boolean value that states whether none of the language classes was chosen.

{
  "YOUR-USER-ID": {
    "TEXT": {
      "fields": {
        "country_of_text": {
          "type": "string"
        },
        "length_of_text": {
          "type": "numeric"
        },
        "model_selected_language": {
          "type": "string",
          "source": "language_scores",
          "function": "get_top_key"
        },
        "model_selected_language_score": {
          "type": "numeric",
          "source": "language_scores",
          "function": "get_top_value"
        },
        "model_second_language_score": {
          "type": "numeric",
          "source": "language_scores",
          "function": "get_nth_highest_value",
          "args": [
            1
          ]
        },
        "first_to_second_language_delta": {
          "type": "numeric",
          "sources": [
            "model_selected_language_score",
            "model_second_language_score"
          ],
          "function": "delta"
        },
        "is_language_undetermined": {
          "type": "boolean",
          "source": "model_selected_language_score",
          "function": "range_check",
          "args": [
            {
              "right": {
                "value": 0.75,
                "inclusive": false
              }
            }
          ]
        }
      }
    }
  }
}

Having all the fields configured, now we can configure the stanzas to search for anomalies in a number of ways.

{
  "YOUR-USER-ID": {
    "TEXT": {
      "fields": {...},
      "stanzas": {
        "decreasing_score": {
          "metrics": [
            "model_selected_language_score",
            "first_to_second_language_delta"
          ],
          "segment_by": [
            "country_of_text",
            "model_selected_language"
          ],
          "trend_directions": [
            "desc"
          ],
          "min_anomaly_level": 0.25,
          "target_set_period": "14d",
          "benchmark_set_period": "46d",
          "verses": [
            {
              "type": "AverageDrift",
              "min_segment_size": 0.005
            },
            {
              "type": "AverageOutlier",
              "time_period": "4w"
            }
          ]
        },
        "increasing_failure_rate": {
          "metrics": [
            "language_undefined"
          ],
          "segment_by": [
            "country_of_text",
            "model_selected_language"
          ],
          "trend_directions": [
            "asc"
          ],
          "min_anomaly_level": 0.25,
          "target_set_period": "14d",
          "benchmark_set_period": "46d",
          "verses": [
            {
              "type": "AverageDrift",
              "min_segment_size": 0.005
            },
            {
              "type": "AverageOutlier",
              "time_period": "4w"
            }
          ]
        }
      }
    }
  }
}

The first - "decreasing_score", configures Mona to search for drifts and outliers in a descending direction, segmented by countries and languages, in the last 2 weeks, in comparison to the last 8 weeks, in the average of the "model_selected_language_score" and "first_to_second_language_delta".

The second stanza - "increasing_failure_rate", configures Mona to do the same but in an ascending direction, in the average score of "language_undefined".

{
  "YOUR-USER-ID": {
    "TEXT": {
      "fields": {...},
      "stanzas": {
        "decreasing_score": {...},
        "increasing_failure_rate": {...},
        "distribution_changes": {
          "segment_baseline_by": [
            "country_of_text"
          ],
          "segment_by": [
            "model_selected_language"
          ],
          "min_anomaly_level": 0.25,
          "target_set_period": "14d",
          "benchmark_set_period": "46d",
          "verses": [
            {
              "type": "SegmentSizeDrift"
            }
          ]
        }
      }
    }
  }
}

This third stanza - "distribution_changes", will search for drifts in the size of segments of each "model_selected_language" in data where the baseline is segmented by countries. This means Mona will alert us on any specific country in which the ratio of a specific detected language significantly drifts.

Let's assume that in this use-case, sometime after initial data is exported, new "ground_truth" data stating what language was in fact used in the text message is also exported to Mona.
Mona receives the follow-up data for each text message as a new string field named "actual_lang_of_text" and adds it to the initial data, based on the context_id.

With this data we can now check the success of the model by adding a new field named "model_was_correct" that uses "equality_check" Field Build Function, and checks if "model_selected_language" is in fact equal to "actual_lang_of_text"

{
  "YOUR-USER-ID": {
    "TEXT": {
      "fields": {
        {...},
        "actual_lang_of_text": {
          "type": "string"
        },
        "model_was_correct": {
          "type": "boolean",
          "function": "equality_check",
          "sources": [
            "model_selected_language",
            "actual_lang_of_text"
          ]
        }
      }
    }
  }
}

Now that this data is also available, one can easily use Mona to get alerted on decreasing rates of correct classifications by adding "model_was_correct" as a new metric in the "decreasing_score" stanza.

{
  "YOUR-USER-ID": {
    "TEXT": {
      "fields": {...},
      "stanzas": {
        "decreasing_score": {
          "metrics": [
            "model_selected_language_score",
            "first_to_second_language_delta",
            "model_was_correct"
          ],
          "segment_by": [
            "country_of_text",
            "model_selected_language"
          ],
          "trend_directions": [
            "desc"
          ],
          "min_anomaly_level": 0.25,
          "target_set_period": "14d",
          "benchmark_set_period": "46d",
          "verses": [
            {
              "type": "AverageDrift",
              "min_segment_size": 0.005
            },
            {
              "type": "AverageOutlier",
              "time_period": "4w"
            }
          ]
        },
        {...}
      }
    }
  }
}