One engineer. One day. One petabyte of log data.

How index-free architecture can move you from the rows and into a rack in the datacenter

This blog was originally published March 8, 2022 on humio.com. Humio is a CrowdStrike Company.

Humio recently unveiled the results of its latest benchmark, where the log management platform achieved a

 

new benchmark of 1 petabyte (PB) of streaming log ingestion per day. This benchmark showcases the power of Humio and its ability to scale with customer’s growing data-ingestion needs at an industry leading TCO. To get more insights into the benchmark, I asked five questions to the Humio engineer who conducted the 1PB benchmark, Grant Schofield, Humio Director of Infrastructure Engineering at CrowdStrike.

5 questions with Grant Schofield

1. When we think about Humio, one of the benefits is its modern, index-free architecture that allows for extremely fast searches. How is Humio’s index-free architecture different from the approach taken by all of the legacy log management solutions that are index-driven? Traditional log management platforms index their data. At scale, this leads to the “high-cardinality” problem, resulting in search bottlenecks and the consumption of significant amounts of disk space. Index-free means that Humio is not bound by indexes and input masks that cause slower ingest, high infrastructure requirements (CPU cores, storage) and poor query performance. Humio is a time-series database that makes use of Bloom filters allowing for searches to be extremely efficient solving the query performance issue. 2. What goes into a petabyte benchmark? What are the moving parts that make up such a massive ingest of data?

 

To perform a petabyte benchmark, we needed to be able to generate 1PB of data. To do this in a single day required the use of our

 

 

Humio load testing tool. We had to get a collection of logs that we could use. In this case we used the following log types across four repositories:

  • Random data
  • Access logs
  • Corelight DNS logs
  • VPC flow logs

For testing automation we used an internal project called Megascops. We had to automate the provisioning of Humio clusters, then we used Strix to perform the load test and Asio to run the queries. 3. We’ve looked at presentations and reference architectures about other Index-driven solutions that include infrastructure sizes of 400, 500 and even over 1000 nodes to scale to 100 Terabytes per day. You were able to scale to 10 times that amount to get to 1 Petabyte with only 45 nodes? We also used 18 nodes for the Kafka infrastructure so we had a total of 63 nodes for our Petabyte benchmark but yes, that is quite a bit smaller than most 100TB/day infrastructures. What happens with index-driven solutions is that you have to write data twice. When a dataset is appended, the index also has to be updated. This causes the need for additional storage, and it takes additional memory and cores to accommodate it. When you compound this with over 100s of millions of logs per day, you end up with exponentially larger amounts of infrastructure to support it. 4. DevOps, ITOps and SecOps teams are routinely forced to abandon logs due to limitations around cost, performance and scale. How does a petabyte per day scale change this? This is a key point. We don’t want our customers to be in a position to try to predict what

 

logs

 

 

will matter in the event of a security incident or outage. Log data continues to grow exponentially. With legacy solutions, sooner or later you are going to be forced to accept shrinking retention, slow performing queries and simply using your log management solution for dashboards and KPI tracking. Petabyte scale allows DevOps/ITOps and SecOps to start bringing all logs into the fold to do more than find answers but to ask more questions. This is why it’s important to have a platform that you can quickly provision additional capacity and accommodate higher ingest as you need it — and in a cost-effective way. 5. I've seen some log management solutions that literally take up rows and rows in a datacenter.

 

Obviously a 45 node setup is not going to take up as much room - perhaps just a portion of a rack. How long would the 45 nodes for a petabyte of data

 

take to set up and deploy? Provided the infrastructure is in place for it, and the EC2 instances have been provisioned, creating the cluster takes about ten minutes.

 

Breaches Stop Here