Why “Alerts as Code” is a winning strategy for system maintenance and analysis
While running multiple, independent clouds offers organizations many important benefits such as resiliency, flexibility and scalability, operating such an environment also increases complexity when it comes to monitoring. This is especially true when the various clouds have vastly differing levels of traffic, different service configurations or even different regulatory environments. While the clouds CrowdStrike operates are mostly homogeneous, there is still a significant challenge in maintaining alerts across the various clouds. Our engineering team runs hundreds of microservices in multiple distinct cloud environments. Those services emit dozens of application metrics across dozens of instances, which can be related to API calls, processing latency, event throughput or failures, each of which may be linked to automated alerts. During a recent count, we found over 500 unique application alerts — and that excludes alerts related to infrastructure or the data plane.
The engineering team uses Grafana, a multi-platform open source analytics and interactive visualization web application, to help manage these alerts. However, the alerting framework built into Grafana has several limitations:
- It does not allow the administrator to apply an alert against a template variable.
- There is no way to qualify how those alerts should be applied within a new environment if the clouds are substantially different in terms of size, load or traffic.
- When tuning the alerts, there is no simple way to synchronize the changes across the clouds and avoid overriding existing parameters for the alert.
Identifying a Flexible but Simple Solution
In trying to solve this issue, our team first considered taking the JSON that defines the alerts out of Grafana and putting it into version control. The team then imported all alerts into the new environment. However, we still ran into a problem in which the alerts no longer made sense within the context of the new environment because each cloud requires a different threshold. It also became incredibly time-consuming to synchronize updates across the clouds as we tuned the alerts.Our team also considered Grafonnet, which is a Grafana version of Jsonnet. This is essentially a way of writing JSON in a templated way to generate JSON. Jsonnet is a very powerful language, but it also has a steep learning curve. There are many scenarios where it is difficult to make the language do exactly what you want it to do. Though powerful and flexible, the level of commitment required for the entire engineering team to learn Jsonnet did not justify the benefit for this single use case.
As a result, we decided to use the Grafana tools software development kit (SDK) to generate alerts in code. This approach allowed us to specify alerts in a CrowdStrike-specific JSON format tailored to our needs. We then used additional programming to take that JSON and generate Grafana JSON from it. In doing so, the tooling is able to ascertain which clouds the alerts should be generated in based on the JSON file. We also built in some flexibility within the tooling, so that the alerts would be appropriately applied in cases where we have some variation between clouds. Our best practices for writing an effective alert in a multi-cloud environment are as follows:
- Set thresholds low enough to cover all environments
- Focus on alerts that pertain to failures or latency above some threshold instead of success below a particular threshold
- Use a ratio of failures to successes and alert on a percentage basis (e.g., > n% of API requests failing)
- In the case of Kafka alerts, use the KafkaBacklogAlert (see below) rather than alerting on a specific numeric lag value
- When applicable, emit metrics based on availability rather than activity — e.g., insufficient available resources in a pool, or insufficient available memory. This is because pool sizes, memory and other resources may be configured differently depending on the cloud. As such, it is easier to alert when the availability approaches zero, as opposed to a configurable maximum.
A Closer Look at “Alerts as Code”
Generating Alerts
In most cases, our alerts are grouped into individual JSON files for each microservice. However, some JSON files may cover a functional area rather than a particular microservice. There are also some JSON files that store information relevant to each individual cloud, such as AWS region, graphite datasource ID, environment name or other constants. A command line tool was developed to:- Read the common values and the microservice JSON file
- Generate dashboards in the Grafana JSON format
- Upload dashboards to Grafana
Scraping Alerts
To migrate from the existing manually built Grafana alerts into the new JSON format, a scraping capability was built into the command line tool. This capability:- Automates the process of exporting a Grafana dashboard as JSON
- Finds any alerts specified on the dashboard
- Migrates the alerts into our in-house defined alert types
Alert Files
The following is an example alert file for a particular microservice that features some of our in-house developed alert types. In this case, the alerts will be generated in all environments at the pool level with the exception of cloud2 and cloud3.pool2.{
"name": "microservice-name",
"scope": "pool",
"notificationChannels": <"slack_quality", "pagerduty_cloud">,
"includes": <>,
"excludes": <"cloud2", "cloud3.pool2">,
"alerts": <
{
"alertType": "ThresholdAlert",
"title": "a series of unfortunate events has occurred!",
"query": "sumSeries(microservice-name.*.meter.events.unfortunate)",
"type": "gt",
"params": <1000>,
"message": "Olaf is a cruel, scheming man. It is recommended that you escape to find shelter with your quirky Uncle Monty. Or restart microservice-name. Up to you."
},
{
"AlertType": "KafkaBacklogAlert",
"Title": "Backlog Alert",
"Message": "service is lagging or the kafka consumer is stopped. This will delay or prevent any dependent processing.",
"ConsumerGroup": "microservice-consumer-group",
"EpsMetric": "sumSeries(microservice-name.*.events.processed)",
"BacklogThreshold": 600
}
>
}
Types of Alerts
- ThresholdAlert.
- RawAlert.
- KafkaBacklogAlert.
- KafkaLagAlert. An early Kafka alert type that uses a hard-coded threshold for lag. Mostly useful in scraping existing alerts, this alert has been largely discontinued in favor of KafkaBacklogAlert for increased flexibility.
The Benefits of “Alerts as Code”
Our solution offers several important benefits to our organization:
- Parity. Our solution allows the team to ensure parity across all clouds, even as alerts are modified or augmented. Alerts can now be updated easily in all places at once, and version control provides an opportunity for formal reviews and viewing change history.
- Simplicity and efficiency. Developing our own JSON allowed our team to define alerts in a simplified form. Our team can also take some shortcuts that make it easier to set up an alert, as compared to doing so manually in Grafana. Likewise, if we need to make changes en masse and enforce standards, we can easily make updates with the tooling rather than doing so manually in Grafana.
Limitations/Challenges
While our current solution offers many valuable benefits, we did run into some roadblocks during the implementation. The Grafana tools SDK is built in Golang while the Grafana JSON itself is highly dynamic and has no complete formal specification (as of this writing, this document is incomplete).This is mostly an issue during unmarshalling of Grafana JSON — in many cases Grafana will emit JSON that can’t be successfully unmarshalled by the SDK. It also presents challenges when trying to emit working JSON from the SDK, as there is trial and error involved to build a valid dashboard that is interpreted correctly by Grafana.
Conclusion
“Alerts as Code” is a much more powerful and flexible alerting solution that does not require excessive hand tuning or duplicated work to maintain synchronization across environments. It has proved to be a valuable solution for our team.How does your organization manage alerts in a multi-cloud environment? Share your thoughts by tagging @CrowdStrike on social media.