- Extreme Gradient Boosting (XGBoost) is a valuable tool for training machine learning (ML) classifiers, which often come with the problem of surprise false positives (FPs) and false negatives (FNs).
- Surprise FPs consume threat researcher bandwidth and have a negative impact on customer confidence.
- CrowdStrike data scientists have developed a practical XGBoost custom objective function that retains the advantages of XGBoost while delivering more predictable model behavior, which reduces threat researcher cycles lost to surprise FP remediation.
Research is the cornerstone of CrowdStrike’s focus on innovation, and it enables us to stay a step ahead of the most sophisticated adversaries. The work of our dedicated team of researchers and data scientists is reflected in the industry-leading protection delivered by the AI-native CrowdStrike Falcon® platform. This team is not only involved in groundbreaking new developments — it is also constantly exploring ways to make existing cybersecurity technology more effective. This is the case with the newly identified (patent-pending) method for improving XGBoost in the use of ML model training.
CrowdStrike data scientists have identified a method for improving XGBoost classifier consistency between releases. This new XGBoost training method results in more predictable model behavior and less disruption in customer environments when new models are deployed. In addition, threat researchers spend significantly less time remediating the surprise FPs that are a noted downside to successive releases of XGBoost models.
This blog post outlines the current challenge with using XGBoost, the issue of surprise FPs, and the solution of an XGBoost custom objective function.
The Surprise FP Problem
XGBoost is a popular ML algorithm for creating robust, high-accuracy classifiers. It is considered to be one of the leading ML libraries for classification, regression, and ranking problems. With support for parallel processing, XGBoost can train models efficiently and quickly.
However, there is a complication: Successive releases of XGBoost models — even when trained on the same data — can exhibit significant shuffling of detection probabilities on a fixed ordered test set of portable executable files. This shuffling represents a hidden risk within model deployment because models with a propensity for detection probability — or “decision value” (DV) — shuffling are unpredictable with respect to previously observed behavior. This unpredictability comes in the form of FNs and FPs. The objective leveraged to optimize XGBoost models can be manipulated to improve consistency between model releases, resulting in safer customer environments.
An organization that deploys ML classifiers to customers must ensure each model release represents an improvement over the previous version. To accomplish this goal, understanding what constitutes an improvement from the customer’s point of view is key. An ML classifier considered in isolation should be optimized according to receiver operating characteristic (ROC) curve behavior near critical thresholds. However, no customer-facing ML classifier exists in isolation, so optimizing the model cannot be confined to maximizing the efficacy of a single release. Rather, developers must frame the optimization problem in terms of customer experience over time and through an ongoing series of model releases. In what follows, we will refer to successive model releases as “model N” and “model N+1.”
In the security space, endpoint malware detection models serve as the front line of defense against adversaries. Because the base rate of actual malware is low compared to that of benign files, the problem that plagues customers most often is FPs. Security analysts spend too much time investigating FPs. This hurts the SOC’s ability to respond quickly to real threats and contributes to alert fatigue.
FPs can be classified into two categories.
The first category consists of clean samples that are new to the customer environment and that the model falsely classifies as dirty. Minimizing FPs in this first category is a complex problem that can only be addressed by improvements in the subsequent model or by near real-time allowlisting.
The second category consists of surprise FPs. A surprise FP is a clean sample that is known within the customer environment such that the previous model scored it as clean and the new model has falsely scored it as dirty. This latter category is especially pernicious due to valuable threat researcher time being wasted and confidence in prior releases being lost by the customer. Remediating FPs in this category is therefore critical and can be accomplished by first understanding the mathematical root of surprise FPs.
The Root of Surprise FPs
ML binary classifiers typically output a float within a fixed range of values. Model developers then set a threshold to transform a float output by the model into a binary decision. The best method for setting a threshold is tying it to a target FP rate (FPR), as the expected behavior for the model at that threshold can then be quantified. Because of different initial conditions (new data, different hyperparameters, etc.) in model training between releases, successive models can operate differently on the same sample. These differences manifest as different DVs between two successive models for the exact same set of samples.
DVs that vary from one model to the next are not necessarily a problem if the ordering remains the same. Indeed, because thresholds are set by the target FPR, if all DVs shift down by some small value , then the given threshold would shift by the same value, resulting in equal model behavior. However, successive model releases do not typically preserve DV ordering. The far more common shuffling of DVs is what is responsible for surprise FPs.
Connecting DV Shuffling to Surprise FPs
DV shuffling can be visualized by considering a DV density plot, as shown in the chart below. The probability density of clean sample DVs is green, and for dirty samples, the probability density is red. Along the sample space axis we have exactly two clean samples presented as green dots. The leftmost green dot is scored dirty by model N and clean by model N+1, as indicated by the left-oriented blue arrow. The other sample is scored clean by model N and dirty by model N+1, as indicated by the right-oriented blue arrow. The red boundary around the rightmost instances of the green dots indicates a clean sample that is falsely scored as dirty by a given model version.