3x Faster Interactive Query with Apache Hive LLAP

Feed: Hortonworks Blog – Hortonworks.
Author: Carter Shanklin.

One of the most exciting new features of HDP 2.6 from Hortonworks was the general availability of Apache Hive with LLAP. If you missed DataWorks Summit you’ll want to look at some of the great LLAP experiences our users shared, including Geisinger who found that Hive LLAP outperforms their traditional EDW for most of their queries, and Comcast who found Hive LLAP is faster than Presto for 75% of benchmark queries.

These great results are thanks to performance and stability improvements Hortonworks made to Hive LLAP resulting in 3x faster interactive query in HDP 2.6. This blog dives into the reasons HDP 2.6 is so much faster. We’ll also take a look at the massive step forward Hive has made in SQL compliance with HDP 2.6, enabling Hive to run all 99 TPC-DS queries with only trivial modifications to the original source queries.

Starting Off: 3x Performance Gains in HDP 2.6 with Hive LLAP

Let’s start out with a summary of runtimes between Hive LLAP on HDP 2.5 versus HDP 2.6, on an identical 9 node cluster (details at the end of the doc), using queries from the TPC-DS suite as used in previous benchmarks. Because of SQL gaps in HDP 2.5 and older versions, these queries had re-writes versus the source TPC-DS queries which are no longer needed in HDP 2.6 and are therefore referred to as “legacy queries”.

Hive Interactive Query

A few key callouts:

HDP 2.6 runs this set of queries in 5155.3 seconds, compared to 14983.4 seconds for HDP 2.5, almost a 3x improvement for HDP 2.6.
HDP 2.6 is faster than HDP 2.5 on almost all queries.
HDP 2.6 is more than 5x faster than HDP 2.5 on 8 queries.

Now let’s dive into the reasons that HDP 2.6 is so much faster than HDP 2.5.

Hive Working Smarter, Not Harder

Hive’s core execution engine and operators are highly optimized, and designed to work end-to-end with the ORCFile columnar format and LLAP’s in-memory cache. Given this level of optimization, the majority of Hive’s performance improvements in HDP 2.6 come from more sophisticated cost-based optimization along with dynamic runtime features that minimize the overall work Hive must do to satisfy a query. You can say that with HDP 2.6, Hive is working smarter and not harder, re-using its own work or avoiding it altogether when possible.

Hive - Key Performance Advancements - HDP 2.6

Hive LLAP Supports the Full Suite of 99 TPC-DS Queries

Past Hive benchmarks have focused on a subset of the TPC-DS queries. In large part this was because older versions of Hive lacked SQL features used by TPC-DS queries. With HDP 2.6 Hive is able to run all 99 TPC-DS queries with only trivial modifications (defined as simple, mechanical rewrites such as changing column names/aliases, adding columns to the select list and other simple transformations.) For reference we’ve posted the full set of trivially-modified queries used in the the remainder of this blog post in the updated Hortonworks testbench repository.

Comparing Apache Hive to Apache Impala

Apache Impala is a SQL-on-Hadoop engine built specifically for interactive query. For this benchmark we compared Hive to Impala using the identical set of 99 trivially-modified TPC-DS queries, which can be found in our GitHub repository.

To summarize the results, the aggregate runtime for all queries is similar across the two engines, but Hive is able to run all 99 queries compared to only 60 in Impala.

Comaring Hive to Impala

This graph breaks down a full comparison of all runtimes, across the 60 queries which could be run in both engines.

Hive versus Impala

When we drill into why Impala is only able to run 60 out of the full suite of 99 queries, this is what we find:

Impala SQL Gaps

These missing features are not obscure SQL features. Instead they are features offered by almost all commercial SQL products and an ever-growing list of open-source SQL tools like Apache Hive. Some of these limitations can be dealt with on a query-by-query basis (such as lack of a date datatype), but such changes are impractical across an entire SQL suite or workload. Other Impala limitations require extremely difficult workarounds, such as missing support for rollups or intersections. Overall, Hive offers much better workload portability if you are considering offloading workloads from a legacy EDW solution.

Hive LLAP Performance Across All 99 Queries

For reference, this graph shows Hive performance across all 99 trivially-modified TPC-DS queries.

Hive Query Timings

Hardware Configuration (Used in All Test Configurations)

9 worker nodes, each with:

256 GB RAM
Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
2x HGST HUS726060AL4210 A7J0 disks for HDFS and YARN storage
Cisco VIC 1227 10 Gigabit Network Connection

HDP 2.6 Software Configuration

Ambari-managed HDP 2.6.2 stack
Hive version = 2.1.0.2.6.2.4-4
Tez version = 0.8.4.2.6.2.4-4
LLAP container size: 180GB (per node)
LLAP Heap Size: 128GB (per daemon)
LLAP Cache Size: 32GB (per daemon)
OS Setting: net.core.somaxconn set to 16k, ntpd and nscd running
Data: ORCFile format, scale 10,000, with daily partitions on fact tables (load scripts available in the GitHub repository).
Note: This HDP build will also be available in Hortonworks Data Cloud for AWS as version 1.16.5 within 7 days from this blog publication.

CDH 5.12 Software Configuration

CDH-managed CDH 5.12 stack
Impala version = 2.9
Impala was given all memory on the worker nodes, using all defaults provided by Cloudera Manager.
Data: Parquet format, scale 10,000, with daily partitions on fact tables.

3x Faster Interactive Query with Apache Hive LLAP

Starting Off: 3x Performance Gains in HDP 2.6 with Hive LLAP

Hive Working Smarter, Not Harder

Hive LLAP Supports the Full Suite of 99 TPC-DS Queries

Comparing Apache Hive to Apache Impala

Hive LLAP Performance Across All 99 Queries

Hardware Configuration (Used in All Test Configurations)

9 worker nodes, each with:

HDP 2.6 Software Configuration

CDH 5.12 Software Configuration

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112