HDP 2.6 takes a huge step forward toward true data management by introducing SQL-standard ACID Merge to Apache Hive.
As scalable as Apache Hadoop is, many workloads don’t work well in the Hadoop environment because they need frequent or unpredictable updates. Updates using hand-written Apache Hive or Apache Spark jobs are extremely complex. Not only are developers responsible for the update logic, they must also implement all rollback logic, detect and resolve write conflicts and find some way to isolate downstream consumers from in-progress updates. Hadoop has limited facilities for solving these problems and people who attempted it usually ended up limiting updates to a single writer and disabling all readers while updates are in progress.
This approach is too complicated and can’t meet reasonable SLAs for most applications. For many, Hadoop became just a place for analytics offload — a place to copy data and run complex analytics where they can’t interfere with the “real” work happening in the EDW.
Hive’s ACID Merge handles all the hard work of Hadoop data maintenance. Because Merge handles inserts, updates and deletes in a single pass, developers don’t need to implement multi-stage update pipelines or develop complex rollback/retry logic. In addition, Hive’s ACID system protects you against write conflicts and isolates readers from in-progress updates, meaning you can update and query data at the same time.
Figure 1: Complexity of the most common data maintenance tasks with and without MERGE
As Figure 1 shows, Merge lets you do the most common data maintenance tasks like upserts, SCD updates and data restatements in just one SQL query, far simpler than the complex and non-scalable approaches required with traditional Hive or Spark pipelines. Merge is 100% compatible with Hive LLAP which means fast updates and queries are possible on the same engine.
To get a sense of Hive Merge’s scalability, we ran a benchmark based on the standard TPC-H benchmark. TPC-H includes data maintenance via a process they call Refresh Functions, described in the TPC-H specification document. There are two Refresh processes. New Sales Refresh, which inserts new records and Old Sales Refresh to remove old data. The TPC-H schema has a table called orders to track all sales orders and a table called lineitem which contains up to 7 entries per order. The New Sales Refresh involves inserting net-new orders and line items. The Old Sales Refresh deletes items based on their primary key.
TPC-H supplies a data generator that can generate batches of these inserts and deletes. Our data maintenance approach is to generate a large batch of inserts and deletes, stage them into a staging table, and use Hive Merge to perform a Full Table Upsert to combine the updates and deletes into a single pass. In the real world, it’s common that an upstream operational system will give you periodic database dumps every hour or every day. This pattern lines up with real-world scenarios very well.
Figure 2: Merge rate for Full Table Upserts using 25 and 50 GB staging tables.
Note that in this benchmark we only test Merge rate, resulting in no simultaneous analytics running while the data is being kept up-to-date. Of course, Merge and queries can be run simultaneously to support updates is just a matter of additional capacity above the resources needed to support analytics.
Figure 3 below shows the complete details of the cluster used for the test. As you can see, even though the cluster size is quite modest, we can still upsert more than 20 billion records per day.
Figure 3: Cluster details used for the benchmark.