Grafana Alerting in a Multi-cloud World

June 16, 2021

| | Engineering & Tech

Why “Alerts as Code” is a winning strategy for system maintenance and analysis

 

While running multiple, independent clouds offers organizations many important benefits such as resiliency, flexibility and scalability, operating such an environment also increases complexity when it comes to monitoring. This is especially true when the various clouds have vastly differing levels of traffic, different service configurations or even different regulatory environments. While the clouds CrowdStrike operates are mostly homogeneous, there is still a significant challenge in maintaining alerts across the various clouds. Our engineering team runs hundreds of microservices in multiple distinct cloud environments. Those services emit dozens of application metrics across dozens of instances, which can be related to API calls, processing latency, event throughput or failures, each of which may be linked to automated alerts. During a recent count, we found over 500 unique application alerts and that excludes alerts related to infrastructure or the data plane.

 

The engineering team uses Grafana, a multi-platform open source analytics and interactive visualization web application, to help manage these alerts. However, the alerting framework built into Grafana has several limitations:

 

  • It does not allow the administrator to apply an alert against a template variable.
  • There is no way to qualify how those alerts should be applied within a new environment if the clouds are substantially different in terms of size, load or traffic.
  • When tuning the alerts, there is no simple way to synchronize the changes across the clouds and avoid overriding existing parameters for the alert.

In this post, we review the CrowdStrike team’s solution for creating a framework that synchronizes alerts across clouds and also defines them in a way that accounts for different levels of traffic.

Identifying a Flexible but Simple Solution

In trying to solve this issue, our team first considered taking the JSON that defines the alerts out of Grafana and putting it into version control. The team then imported all alerts into the new environment. However, we still ran into a problem in which the alerts no longer made sense within the context of the new environment because each cloud requires a different threshold. It also became incredibly time-consuming to synchronize updates across the clouds as we tuned the alerts.

 

Our team also considered Grafonnet, which is a Grafana version of Jsonnet. This is essentially a way of writing JSON in a templated way to generate JSON. Jsonnet is a very powerful language, but it also has a steep learning curve. There are many scenarios where it is difficult to make the language do exactly what you want it to do. Though powerful and flexible, the level of commitment required for the entire engineering team to learn Jsonnet did not justify the benefit for this single use case.

 

As a result, we decided to use the Grafana tools software development kit (SDK) to generate alerts in code. This approach allowed us to specify alerts in a CrowdStrike-specific JSON format tailored to our needs. We then used additional programming to take that JSON and generate Grafana JSON from it. In doing so, the tooling is able to ascertain which clouds the alerts should be generated in based on the JSON file. We also built in some flexibility within the tooling, so that the alerts would be appropriately applied in cases where we have some variation between clouds. Our best practices for writing an effective alert in a multi-cloud environment are as follows:

 

  • Set thresholds low enough to cover all environments
  • Focus on alerts that pertain to failures or latency above some threshold instead of success below a particular threshold
  • Use a ratio of failures to successes and alert on a percentage basis (e.g., > n% of API requests failing)
  • In the case of Kafka alerts, use the KafkaBacklogAlert (see below) rather than alerting on a specific numeric lag value
  • When applicable, emit metrics based on availability rather than activity e.g., insufficient available resources in a pool, or insufficient available memory. This is because pool sizes, memory and other resources may be configured differently depending on the cloud. As such, it is easier to alert when the availability approaches zero, as opposed to a configurable maximum.

A Closer Look at “Alerts as Code”

Generating Alerts

In most cases, our alerts are grouped into individual JSON files for each microservice. However, some JSON files may cover a functional area rather than a particular microservice. There are also some JSON files that store information relevant to each individual cloud, such as AWS region, graphite datasource ID, environment name or other constants. A command line tool was developed to:

  • Read the common values and the microservice JSON file

     

  • Generate dashboards in the Grafana JSON format

     

  • Upload dashboards to Grafana

Scraping Alerts

To migrate from the existing manually built Grafana alerts into the new JSON format, a scraping capability was built into the command line tool. This capability:

  • Automates the process of exporting a Grafana dashboard as JSON
  • Finds any alerts specified on the dashboard

     

  • Migrates the alerts into our in-house defined alert types

Alert Files

The following is an example alert file for a particular microservice that features some of our in-house developed alert types. In this case, the alerts will be generated in all environments at the pool level with the exception of cloud2 and cloud3.pool2.

{
    "name": "microservice-name",
    "scope": "pool",
    "notificationChannels": <"slack_quality", "pagerduty_cloud">,
    "includes": <>,
    "excludes": <"cloud2", "cloud3.pool2">,
    "alerts": <
        {
            "alertType": "ThresholdAlert",
            "title": "a series of unfortunate events has occurred!",
            "query": "sumSeries(microservice-name.*.meter.events.unfortunate)",
            "type": "gt",
            "params": <1000>,
            "message": "Olaf is a cruel, scheming man. It is recommended that you escape to find shelter with your quirky Uncle Monty. Or restart microservice-name. Up to you."
        },
        {
            "AlertType": "KafkaBacklogAlert",
            "Title": "Backlog Alert",
            "Message": "service is lagging or the kafka consumer is stopped. This will delay or prevent any dependent processing.",
            "ConsumerGroup": "microservice-consumer-group",
            "EpsMetric": "sumSeries(microservice-name.*.events.processed)",
            "BacklogThreshold": 600
        }
    >
}

Types of Alerts

  • ThresholdAlert.

     

    The simplest type of alert, the ThresholdAlert, takes a single query and a threshold. It supports less than, greater than, within range, outside range and no value.
  • RawAlert.

     

    For cases where more flexibility is required, the RawAlert contains a set of conditions and targets in Grafana JSON form. This alert is highly verbose, so it should only be used when absolutely necessary.
  • KafkaBacklogAlert.

     

    An alert that can be deployed across clouds with differing levels of traffic easily, the KafkaBacklogAlert divides the current lag by the current events/messages per second to understand how much backlog has accumulated.
  • KafkaLagAlert. An early Kafka alert type that uses a hard-coded threshold for lag. Mostly useful in scraping existing alerts, this alert has been largely discontinued in favor of KafkaBacklogAlert for increased flexibility.

Our framework includes a set of common fields available for all alert types. This includes information such as the alert title and message, and overrides for notifications or environments (in case a particular alert needs different handling than the rest of the file). There is also a way to manage several Grafana parameters that can be overridden with custom values (e.g., “for,” “frequency,” “no_data_state”).

The Benefits of “Alerts as Code”

Our solution offers several important benefits to our organization:

 

 

  • Parity. Our solution allows the team to ensure parity across all clouds, even as alerts are modified or augmented. Alerts can now be updated easily in all places at once, and version control provides an opportunity for formal reviews and viewing change history.
  • Simplicity and efficiency. Developing our own JSON allowed our team to define alerts in a simplified form. Our team can also take some shortcuts that make it easier to set up an alert, as compared to doing so manually in Grafana. Likewise, if we need to make changes en masse and enforce standards, we can easily make updates with the tooling rather than doing so manually in Grafana.

Limitations/Challenges

While our current solution offers many valuable benefits, we did run into some roadblocks during the implementation. The Grafana tools SDK is built in Golang while the Grafana JSON itself is highly dynamic and has no complete formal specification (as of this writing, this document is incomplete). This is mostly an issue during unmarshalling of Grafana JSON — in many cases Grafana will emit JSON that can’t be successfully unmarshalled by the SDK. It also presents challenges when trying to emit working JSON from the SDK, as there is trial and error involved to build a valid dashboard that is interpreted correctly by Grafana.

Conclusion

“Alerts as Code” is a much more powerful and flexible alerting solution that does not require excessive hand tuning or duplicated work to maintain synchronization across environments. It has proved to be a valuable solution for our team.

 

How does your organization manage alerts in a multi-cloud environment? Share your thoughts by tagging @CrowdStrike on social media.

 

Breaches Stop Here