This blog was originally published March 8, 2022 on humio.com. Humio is a CrowdStrike Company.
Humio recently unveiled the results of its latest benchmark, where the log management platform achieved anew benchmark of 1 petabyte (PB) of streaming log ingestion per day. This benchmark showcases the power of Humio and its ability to scale with customer’s growing data-ingestion needs at an industry leading TCO. To get more insights into the benchmark, I asked five questions to the Humio engineer who conducted the 1PB benchmark, Grant Schofield, Humio Director of Infrastructure Engineering at CrowdStrike.
5 questions with Grant Schofield
1. When we think about Humio, one of the benefits is its modern, index-free architecture that allows for extremely fast searches. How is Humio’s index-free architecture different from the approach taken by all of the legacy log management solutions that are index-driven? Traditional log management platforms index their data. At scale, this leads to the “high-cardinality” problem, resulting in search bottlenecks and the consumption of significant amounts of disk space. Index-free means that Humio is not bound by indexes and input masks that cause slower ingest, high infrastructure requirements (CPU cores, storage) and poor query performance. Humio is a time-series database that makes use of Bloom filters allowing for searches to be extremely efficient solving the query performance issue. 2. What goes into a petabyte benchmark? What are the moving parts that make up such a massive ingest of data?To perform a petabyte benchmark, we needed to be able to generate 1PB of data. To do this in a single day required the use of our
Humio load testing tool. We had to get a collection of logs that we could use. In this case we used the following log types across four repositories:
- Random data
- Access logs
- Corelight DNS logs
- VPC flow logs
logs
will matter in the event of a security incident or outage. Log data continues to grow exponentially. With legacy solutions, sooner or later you are going to be forced to accept shrinking retention, slow performing queries and simply using your log management solution for dashboards and KPI tracking. Petabyte scale allows DevOps/ITOps and SecOps to start bringing all logs into the fold to do more than find answers but to ask more questions. This is why it’s important to have a platform that you can quickly provision additional capacity and accommodate higher ingest as you need it — and in a cost-effective way. 5. I've seen some log management solutions that literally take up rows and rows in a datacenter.
Obviously a 45 node setup is not going to take up as much room - perhaps just a portion of a rack. How long would the 45 nodes for a petabyte of data
take to set up and deploy? Provided the infrastructure is in place for it, and the EC2 instances have been provisioned, creating the cluster takes about ten minutes.