Attunity Accelerates Data Loading and Transformation for Data Lakes

Feed: Hortonworks Blog – Hortonworks.
Author: Nadeem Asghar.

Attunity is a long-time Hortonworks partner who provides data optimization and data integration software to help Hortonworks customers address exploding data growth, efficiently manage the performance of BI and data warehouse systems, and realize the tremendous economies of Apache Hadoop®. Attunity solutions are certified on HDF, HDP and are YARN Ready. Together, Hortonworks and Attunity are committed to advancing Hadoop through community-led innovation. This new solution is one more example of that.

Attunity Compose for Hive

By Jordan Martz, Director of Technology Solutions, Attunity

Attunity Compose for Hive automates the data pipeline to create analytics-ready data by leveraging the latest innovations in Hadoop such as the new ACID Merge SQL capabilities, available today in Apache Hive™ (included in HDP 2.6), to automatically and efficiently process data insertions, updates and deletions.

[embedded content]

Attunity Compose for Hive was announced at the DataWorks Summit 2017 in San Jose, CA. Itamar Ankorion, Chief Marketing Officer at Attunity explained that “We help large corporations around the world implement strategic data lake initiatives by making data available in real-time for analytics and enabling them to overcome the inherent challenges associated with building modern data systems. Attunity Compose for Hive directly addresses these challenges to automate the implementation of Hive. It works by eliminating complex and lengthy manual development work for faster and more efficient implementation of analytics-ready data sets.”

[embedded content]

How Does Attunity Compose for Hive Work?

Attunity Compose for Hive automates the creation, loading and transformation of data into Hadoop Hive structures. It fully automates the pipeline of business intelligence (BI) ready data into Hive, to create both Operational Data Stores (ODS) and Historical Data Stores (HDS). Attunity Replicate integrates with Attunity Compose to accelerate data ingestion, data landing, SQL schema creation, data transformation and ODS & HDS creation/updates.

With Attunity Compose for Hive, you have:

Real-time data ingestion and landing. Leverage tight integration with Attunity Replicate to ingest data in batch or via continuous data capture (CDC), then copy that data to an on-premises or cloud target.
Comprehensive automation. Generate Hive schemas automatically for ODS and HDS targets, and all necessary data transformations are seamlessly applied.
Continuous, non-disruptive data store updates. Leverage the ANSI SQL compliant ACID MERGE operation to process data insertions, updates and deletions in a single pass.
Transaction consistency. Partition updates by time to ensure each transaction update is processed holistically for maximum consistency.
Improved operational visibility. Support slow changing dimensions to understand change impact with a granular history of updates such as customer address changes, etc. within the Historical Data Store.

Data Automation to Hive in Five Steps

Step 1: Use Attunity Replicate ingest data into Hadoop and partition the data

Attunity Replicate transfers data into Hadoop and the HDFS files systems in parallelized formats via WebHDFS and HttpFS protocols or over NFS and connects to HCatalog via ODBC and HQL Scripts. As data is loaded into Hadoop, the process of data partitioning is introduced as a way of creating metadata to address the consistent, transactionally verified datasets. Data files are uploaded to HDFS, according to the maximum size and time definition, and then stored in a directory under the change table directory. Whenever the specified partition timeframe ends, a partition is created in Hive, pointing to the HDFS directory.

Step 2: Connect to the Hadoop Cluster and configure CDC and ETL process

The images below showcase the connections into Hive and into the source database, Northwind, a MySQL instance.

By optionally storing the history of changes through the Manage Metadata -> Save Changes screen, you have the ability to select design an Operational or Historical data stores.

Step 3: Generate HIVE LLAP code for loading data

Attunity Compose considers these key items while generating Hive ETL calls:

Extracting data from the sources (initial load and CDC)
Loading data into landing zone in transactionally consistent data partitions to maintain integrity
Transforming data in the landing zone from sequence to ORC format
Handling ETL for DELETE operations
Scaling to support large number of sources, tables and truncations with the considerations of parallel processing of tasks
Managing parallel ETL processes to prevent Hadoop cluster overload.

By adding some changes to the source system, the data becomes delivered to [table]’_delivery’ zone, which is where the final presentation layers.

By carrying audits throughout the process with another set of tables for audits per record in [table]’_landing HIVE tables that have change tables and a record of the table’s partitions. The CDC partitions create records of when changes hit those partitions in the ‘attrep_cdc_partitions.’

By reviewing the content, the latest merge content gets introduced. By looking at the latest updates and merges record by reviewing the ‘I’ (Insert) and ‘U’ (Update) statements, as well as, appending to process to reconcile, where a delete occurred.

Step 4: Configure the Parallelism and Optimizations needed
Throttling of run to overload the Hadoop cluster (by limiting the number of SQL statements we run), within the manage ETL set under ETL Commands, settings, then advanced to address the number of max concurrent DB connections to use.