Amazon DynamoDB adaptive capacity now handles imbalanced workloads better by isolating frequently accessed items automatically

November 15, 2019, 12:13 pm

≫ Next: Ibrar Ahmed: Proposal for Global Indexes in PostgreSQL

≪ Previous: Julian Markwort: Introduction and How-To: etcd clusters for Patroni

Amazon DynamoDB adaptive capacity now handles imbalanced workloads better by isolating frequently accessed items automatically. If your application drives disproportionately high traffic to one or more items, DynamoDB will rebalance your partitions such that frequently accessed items do not reside on the same partition. This latest enhancement helps you maintain uninterrupted performance for your workloads. In addition, it helps you reduce costs by enabling you to provision throughput capacity more efficiently, instead of overprovisioning to accommodate uneven data access patterns.

↧

Ibrar Ahmed: Proposal for Global Indexes in PostgreSQL

November 20, 2019, 12:26 pm

≫ Next: MongoDB Atlas Data Lake Lets Developers Create Value from Rich Modern Data

≪ Previous: Amazon DynamoDB adaptive capacity now handles imbalanced workloads better by isolating frequently accessed items automatically

Feed: Planet PostgreSQL.

PostgreSQL A global index, by very definition, is a single index on the parent table that maps to many underlying table partitions. The parent table itself does not have a single, unified underlying store so it must, therefore, retrieve the data satisfying index constraints from physically distributed tables. In very crude terms, the global index accumulates data in one place so that data spanning across multiple partitions are accessed in one go as opposed to individually querying each partition.

Currently, there is no Global Index implementation available in PostgreSQL, and therefore I want to propose a new feature. I have sent a proposal to the community, and that discussion is now started. In this proposal, I ask for Global Index support just for B-Tree and will consider other index methods later.

Terminologies used

Global Indexes

A one-to-many index, in which one index map to all the partitioned tables.

Partitioned Index (Index Partitioning)

When global indexes become too large, then those are partitioned to keep the performance and maintenance overhead manageable. These are not within the scope of this work.

Local Index

A local index is an index that is local to a specific table partition; i.e. it doesn’t span across multiple partitions. So, when we create an index on a parent table, it will create a separate index for all its partitions. PostgreSQL uses the terminology of “partitioned index” when it refers to local indexes. This work will fix this terminology for PostgreSQL so that the nomenclature remains consistent with other DBMS.

Why Do We Need Global Index in PostgreSQL?

A global index is expected to give two very important upgrades to the partitioning feature set in PostgreSQL. It is expected to give a significant improvement in read-performance for queries targeting multiple local indexes of partitions, as well as adding a unique constraint across partitions.

Unique Constraint

Data uniqueness is a critical requirement for building an index. For global indexes that span across multiple partitions, uniqueness will have to be enforced on index column(s). This effectively translates into a unique constraint.

Performance

Currently, the pseudo index created on the parent table of partitions does not contain any data. Rather, it dereferences to the local indexes when an index search is required. This means that multiple indexes will have to be evaluated with data to be combined thereafter. However, with the global indexes, data will reside with the global index declared on the parent table. This avoids the need for multi-level index lookups, so read performance is expected to be significantly higher in some cases. There will, however, be a negative performance impact during write (insert/update) of data. This is discussed in more detail later on.

Creating a Global Index – Syntax

A global index may be created with the addition of a “GLOBAL” keyword to the index statement. Alternatively, one could specify the “LOCAL” keyword to create local indexes on partitions. We are suggesting to call this set of keywords: “partition_index_type”. By default, partition_index_type will be set as LOCAL. Here is a sample of the create index syntax.

<br>
CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] name ] ON [ ONLY ] table_name [ USING method ]
<p>    ( { column_name | ( expression ) } [LOCAL | GLOBAL] [ COLLATE collation ]<br>
    [ opclass ] [ ASC | DESC ] [ NULLS { FIRST | LAST } ] [, …] )</p>
<p>    [ INCLUDE ( column_name [, …] ) ]</p>
<p>    [ WITH ( storage_parameter = value [, … ] ) ]</p>
<p>    [ TABLESPACE tablespace_name ]</p>
<p>    [ WHERE predicate ]</p>

CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] name ] ON [ ONLY ] table_name [ USING method ]

( { column_name | ( expression ) } [LOCAL | GLOBAL] [ COLLATE collation ]

[ opclass ] [ ASC | DESC ] [ NULLS { FIRST | LAST } ] [, ...] )

[ INCLUDE ( column_name [, ...] ) ]

[ WITH ( storage_parameter = value [, ... ] ) ]

[ TABLESPACE tablespace_name ]

[ WHERE predicate ]

Pointing Index to Tuple

Currently, CTID carries a page and offset information for a known heap (table name). However, in the context of global indexes, this information within an index is insufficient. Since the index is expected to carry tuples from multiple partitions (heaps), CTID alone will not be able to link an index node to a tuple. This requires carrying additional data for the heap name to be stored with each index node.

Optimizer

The challenge with optimizer is a selection between local and global indexes when both are present. There have been many open questions, including evaluating the cost of scanning a global index. When should the LOCAL index be preferred over the GLOBAL index and vice versa?

Write Performance and Vacuum

There will be some write performance degradation because every change in partition tables must propagate upwards to the GLOBAL index on the parent table. This can be thought of as another index on a table, however, the [slight] performance degradation will be due to the fact that the GLOBAL index may carry a much bigger dataset with data from multiple partitions resulting in a higher tree traversal and update time. This applies to both write and vacuum processes.

It is still an open question, though, on how this will be handled within the code and how we can better optimize this process.

Conclusion

As we know most major DBMS engines that have partitioning support also have support for the Global Index. PostgreSQL has very powerful partitioning support but lacks the support of the Global Index. Global Index not only ensures the uniqueness across partitioning but also improves read performance. I have sent the proposal to PostgreSQL Community and while a discussion has been started, it is a slow process. If you are an engineer and want to contribute, respond to that thread in the community. If you are a user and have some uses cases, please share that on the same mail chain.

↧

MongoDB Atlas Data Lake Lets Developers Create Value from Rich Modern Data

November 22, 2019, 3:02 pm

≫ Next: What is Map Reduce Programming and How Does it Work

≪ Previous: Ibrar Ahmed: Proposal for Global Indexes in PostgreSQL

Feed: AWS Partner Network (APN) Blog.
Author: AWS Admin.

By Benjamin Flast, Sr. Product Manager, Atlas Data Lake at MongoDB

With the proliferation of cost-effective storage options such as Amazon Simple Storage Service (Amazon S3), there should be no reason you can’t keep your data forever, except that with this much data it can be difficult to create value in a timely and efficient way.

This is why some data lakes have gradually turned into data swamps, providing little utility and accentuating the need for novel ways of analyzing rich multi-structured data created by modern applications, APIs, and devices.

MongoDB’s Atlas Data Lake enables developers to mine their data for insights with more storage options and the speed and agility of the Amazon Web Services (AWS) Cloud.

MongoDB is an AWS Partner Network (APN) Advanced Technology Partner with the AWS Data & Analytics Competency. Our general purpose database platform can unleash the power of software and data for developers and the applications they build. MongoDB is available as a service on AWS via MongoDB Atlas.

MongoDB Cloud Database (Atlas) already addresses diverse online operational and real-time analytics use cases with strict low-latency and geo-distribution requirements. Atlas Data Lake expands MongoDB’s reach into batch and offline workloads, where efficiently processing massive volumes of rapidly changing data is the central problem statement.

Examples include data exploration across historical archives, aggregating data for market research or intelligence products, processing granular Internet of Things (IoT) data as it lands, model training and building, and more.

If you are a developer or data engineer already familiar with the MongoDB Query Language (MQL) and aggregation framework, you can apply the same familiar tools and programming language drivers to tackle a new class of workloads.

Analyze Data Stored in Multiple Formats

Atlas Data Lake allows you to take richly structured data in a variety of formats (including JSON, BSON, CSV, TSV, Avro, and Parquet, with more coming), combine it into logical collections, and query it using MQL.

This allows you to immediately get value from data without complex and time-consuming transformation, predefining a schema, loading it into a table, or metadata management.

Extract, Transform, Load (ETL) is still much of today’s work, and it’s becoming less cost effective to apply this legacy pattern to modern data of diverse shapes and huge volumes. With Atlas Data Lake, you can start exploring your data the minute it’s in Amazon S3 and you’ve configured your stores.

No Infrastructure to Set Up and Manage

Since Atlas Data Lake allows you to bring your own data in its current Amazon S3 structures, there’s no additional infrastructure to manage and no need to ingest your data elsewhere.

As such, we’ve separated compute from storage, which means Atlas Data Lake is an on-demand service where you only pay for what you use. This allows you to seamlessly scale each layer independently as needs change.

To get started, you just need to configure access to your Amazon S3 storage buckets through the intuitive user interface in MongoDB Atlas. Give read-only access if you’re only using the product for queries, or write access if you’d like to output the results of your queries and aggregations back to S3.

Atlas Data Lake is fully integrated with the rest of MongoDB Atlas in terms of billing, monitoring, and user permissioning for additional transparency and operational simplicity.

For users who already have a data lake based on S3, or have created one with AWS Lake Formation, you can still use Atlas Data Lake. Simply set up your storage config to describe what’s been created in AWS Lake Formation, and you can start querying.

Parallelize and Optimize Workloads

Atlas Data Lake’s architecture uses multiple compute nodes to analyze each bucket’s data and parallelizes operations by breaking queries down and then dividing the work across multiple agents.

Atlas Data Lake can automatically optimize your workloads by utilizing compute in the region closest to your data. This is useful for data sovereignty-related needs, granting you the ability to specify which region your data needs to be processed in, and giving you a global view of data with greater security.

Figure 1 – Atlas Data Lake architecture diagram.

Leverage a Rich Ecosystem of Tools

Atlas Data Lake also works with existing MongoDB drivers, the MongoDB shell, MongoDB Compass, and the MongoDB BI Connector for SQL-powered business intelligence. Support for visualization with MongoDB Charts is coming soon.

This means your existing applications written in JavaScript, Python, Java, Ruby, Go, or any other programming language can access Atlas Data Lakes just like they access MongoDB today. For data science and machine learning (ML) workloads, you can use tools like R Studio or Jupyter Notebooks with the MongoDB R and Python drivers or the Spark Connector.

Enable Data Exploration on Historical Archives

To bring Atlas Data Lake to life, I’d like to give you an example of how powerful it can be when combined with the MongoDB Aggregation Framework.

Imagine your company has just acquired a grocery store to add to its existing network of stores. Your boss tells you they want to run a promotion targeted at customers that have historically purchased the most at the new store.

You’re pointed to the data in your Amazon S3 and your boss asks you to pull the Customer IDs of top buyers across all files. The data files exist in a variety of formats, across several years, and they’d like you to focus on customers making purchases over the prior year. That said, the data looks something like this.

Figure 2 – Sample Amazon S3 bucket data.

Given this request, you decide to use Atlas Data Lake. The first step is to set up a data lake and configuration file.

Figure 3 – Data lake configuration and metrics.

You write your storage configuration to have one collection called ‘purchases’ with a definition that reflects data partitioned by year on Amazon S3. Each partition may contain any number of files of any supported format.

This configuration will enable Atlas Data Lake to only open partitions related to the data your query needs when you utilize the year in the query.

Figure 4 – Storage config example.

Once you’ve set the new storage config, you only need to write up a short aggregation pipeline and run it on your data lake, as shown below.

db.purchases.aggregate(
    [
        {
            /* To only look at data from the year 2018 
            you use a match stage in the aggregation
            which will utilize the storage config to
            only look at the data in the partition for 2018. */
            $match: { year: 2018 }
        },
        {
            /* Next, you group your data by CustomerID
            and add up the product of the UnityPrice 
            and Quantity purchased. */
            $group: {
                _id: '$CustomerID',
                totalPurchaseAmount: { $sum: { $multiply: [
                                {$toDecimal: "$UnitPrice"},
                                {$toDecimal: "$Quantity"}
                            ]
                        }
                    },
                /* You also calculate the total quantity
                ordered by the user. */
                totalUnitsOrdered: { $sum: 
                    {$toDecimal: "$Quantity"}
                    }
                }
        },
        { 
            // You then sort your data in descending order. 
            $sort : { totalPurchaseAmount : -1}
        },
        /* And finally, you limit to 6 to pull 
        back the top 6 (with the top amount
        being those purchases that are missing
        a Consumer ID). */
        { $limit: 6}
    ])

You match to queries from the year 2018, group individual transactions by CustomerID, sum transactions across quantity ordered and purchase price, sort it, and limit it to six (unfortunately they did not tag all purchases to a Customer ID).

Just like that, you’ve got the top five buyers from the chain you’ve just acquired.

Figure 5 – Results of aggregation pipeline query.

This is a simple example of the power of Atlas Data Lake. As demonstrated in this example, it’s designed to allow you to get the most from your data.

Other ways you can use Atlas Data Lake include providing data exploration capabilities across user communities and tools, creating historical data snapshots, enabling feature engineering for machine learning, and more, with the least amount of upfront work and a low investment.

Summary

Atlas Data Lake provides a serverless parallelized compute platform that gives you a powerful and flexible way to analyze and explore your data on Amazon S3.

As you’ve seen in this post, Atlas Data Lake supports a variety of formats and doesn’t require a predefined schema, or any schema evolution over time. By decoupling compute from storage, we’ve enabled you to scale each as needed. Atlas Data Lake plugs into the MongoDB ecosystem, allowing you to get more from your data lake with the tools you already use.

In my example, I covered how to set up a data lake and connect to your S3 bucket. I also demonstrated how you can configure your data lake to efficiently open partitions of data and keep your queries performant. Finally, using the power of the aggregation pipeline, I ran sophisticated analysis on data stored in S3, demonstrating the potential of using existing queries created for MongoDB.

To learn more, check out the MongoDB data lake product page or read our documentation. You can also sign up for Atlas to get your own data lake up and running in just a few minutes.

The content and opinions in this blog are those of the third party author and AWS is not responsible for the content or accuracy of this post.

MongoDB – APN Partner Spotlight

MongoDB is an AWS Competency Partner. Their modern, general purpose database platform, designed to unleash the power of software and data for developers and the applications they build.

Contact MongoDB | Solution Overview | AWS Marketplace

*Already worked with MongoDB? Rate this Partner

*To review an APN Partner, you must be an AWS customer that has worked with them directly on a project.

↧

What is Map Reduce Programming and How Does it Work

November 23, 2019, 9:30 pm

≫ Next: Using Kafka to throttle QPS on MySQL shards in bulk write APIs

≪ Previous: MongoDB Atlas Data Lake Lets Developers Create Value from Rich Modern Data

Feed: Featured Blog Posts – Data Science Central.
Author: Divya Singh.

Introduction

Data Science is the study of extracting meaningful insights from the data using various tools and technique for the growth of the business. Despite its inception at the time when computers came into the picture, the recent hype is a result of the huge amount of unstructured data that is getting generated and the unprecedented computational capacity that modern computers possess.

However, there is a lot of misconception among the masses about the true meaning of this field with many of the opinion that it is about predicting future outcomes from the data. Though predictive analytics is a part of Data Science, it is certainly not all of what Data Science stands for. In an analytics project, the first and foremost role is to get the build the pipeline and get the relevant data to perform predictive analytics later on. The professional who is responsible for building such ETL pipelines and the creating the system for flawless data flow is the Data Engineer and this field is known as Data Engineering.

Over the years the role of Data Engineers has evolved a lot. Previously it was about building Relational Database Management System using Structured Query Language or run ETL jobs. These days, the plethora of unstructured data from a multitude of sources has resulted in the advent of Big Data. It is nothing but a different forms of voluminous data which carries a lot of information if mined properly.

Now, the biggest challenge that professionals face is to analyse these huge terabytes of data which traditional file storage systems are incapable of handling. This problem was resolved by Hadoop which is an open-source Apache framework built to process large data in the form of clusters. Hadoop has several components which takes care of the data and one such component is known as Map Reduce.

What is Hadoop?

Created by Doug Cutting and Mike Cafarella in 2006, Hadoop facilitates distributed storage and processing of huge data sets in the form parallel clusters. HDFS or Hadoop Distributed File System is the storage component of Hadoop where different file formats could be stored to be processed using the Map Reduce programming which we would cover later on in this article.

The HDFS runs on large clusters and follows a master/slave architecture. The metadata of the file i.e., information about the relative position of the file in the node is managed by the NameNode which is the master and could save several DataNodes to store the data. Some of the other components of Hadoop are –

Yarn – It manages the resources and performs job scheduling.
Hive – It allows users to write SQL-like queries to analyse the data.
Sqoop – Used for to and fro structured data transfer between the Hadoop Distributed file System and the Relational Database Management System.
Flume – Similar to Sqoop but it facilitates the transfer of unstructured and semi-structured data between the HDFS and the source.
Kafka – A messaging platform of Hadoop.
Mahout – It used to create Machine Learning operations on big data.

Hadoop is a vast concept and in detail explanation of each components is beyond the scope of this blog. However, we would dive into one of its components – Map Reduce and understand how it works.

What is Map Reduce Programming

Map Reduce is the programming paradigm that allows for massive scalability across hundreds or thousands of servers in a Hadoop Cluster, i.e. suppose you have a job to run and you write the Job using the MapReduce framework and then if there are a thousand machines available, the Job could run potentially in those thousand machines.

The Big Data is not stored traditionally in HDFS. The data gets divided into chunks of small blocks of data which gets stored in respective data nodes. No complete data’s present in one centralized location and hence a native client application cannot process the information right away. So a particular framework is needed with the capability of handling the data that stays as blocks of data into respective data nodes, and the processing can go there to process that data and bring back the result. In a nutshell, data is processed in parallel which makes processing faster.

To improve performance and for better efficiency, the idea of parallelization was developed. The process is automated and concurrently executed. The instructions which are fragmented could also run on a single machine or on different CPU’s. To gain direct disk access, multiple computers uses SAN or Storage Area Networks which is a common type of Clustered File System unlike the Distributed File Systems which sends the data using the network.

One term that is common in this maser/slave architecture of data processing is Load Balancing where among the processors the tasks are spread to avoid overload on any DataNode. Unlike the static balancers, there is more flexibility provided by the dynamic balancers.

The Map-Reduce algorithm which operates on three phases – Mapper Phase, Sort and Shuffle Phase and the Reducer Phase. To perform basic computation, it provides abstraction for Google engineers while hiding fault tolerance, parallelization, and load balancing details.

Map Phase – In this stage, the input data is mapped into intermediate key-value pairs on all the mappers assigned to the data.
Shuffle and Sort Phase – This phase acts as a bridge between the Map and the Reduce phase to decrease the computation time. The data here is shuffled and sorted simultaneously based on the keys i.e., all intermediate values from the mapper phase is grouped together with respect to the keys and passed on to reduce function.
Reduce Phase– The sorted data is the input to the Reducer which aggregates the value corresponding to each key and produces the desired output.

How Map Reduce works

Across multiple machines, the Map invocations are distributed and the input data is automatically partitioned into M pieces of size sixteen to sixty four megabytes per piece. On a cluster of machines, many copies of the program are then started up.

Among the copies, one is the master copy while the rest are the slave copies. The master assigns M map and R reduce tasks to the slaves. Any idle worker would be assigned a task by the master.

The map task worker would read the contents of the input and pass key-value pairs to the Map function defined by the user. In the memory buffer, the intermediate key-value pairs would be produced.

To the local disk, the buffered pairs are written in a periodic fashion. The partitioning function then partitions them into R regions. The master would forward the location of the buffered key-value pairs to the reduce workers.

The buffered data is read by the reduce workers after getting the location from the master. Once it is read, the data is sorted based on the intermediate keys grouping similar occurrences together.

The Reduce function defined the user receives a set of intermediate values corresponding to each unique intermediate key that it encounters. The final output file would consists of the appended output from the Reduce function.

The user program is woken up by the Master once all the Map and Reduce tasks are completed. In the R output files, the successful MapReduce execution output could be found.

Each and every worker’s aliveness is checked by the master after the execution by sending periodic pings. If any worker does not respond to the ping, it is marked as failed after a certain point if time and its previous works are reset.

In case of failures, the map tasks which are completed would be re-executed as their output would be inaccessible in the local disk. Output which are stored in the global file system need not to be re-executed.

Some of the examples of Map Reduce programming are –

Map Reduce programming could count the frequencies of the URL access. The logs of web page would be processed by the map function and stored as output say which would be processed by the Reduce function by adding all the same URL and output their count.

Map Reduce programming could also be used to parse documents and count the number of words corresponding to each document.

For a given URL, the list of all the associated source URL’s could be obtained with the help of Map Reduce.

To calculate per host term vector, the map reduce programming could be used. The hostname and the term vector pair would be created for each document by the Map function which would be processed by the reduce function which in turn would remove less frequent terms and give a final hostname, term vector.

Conclusion

Data Engineering is a key step in any Data Science project and Map Reduce is undoubtedly an essential part of it. In this article we have a brief intuition about Big Data and provided an overview of Hadoop. Then we explained Map Reduce programming and its workflow and gave few real life applications of Map Reduce programming as well.

Read more here.

↧

Using Kafka to throttle QPS on MySQL shards in bulk write APIs

November 25, 2019, 9:27 am

≫ Next: Beena Emerson: Investigating bulk load operation in partitioned tables

≪ Previous: What is Map Reduce Programming and How Does it Work

Feed: Planet MySQL
;
Author: Pinterest Engineering
;

Qi Li | Software Engineer, Real-time Analytics

At Pinterest, backend core services are in charge of various operations on pins, boards, and users from both Pinners and internal services. While Pinners’ operations are identified as online traffic because of real-time response, internal traffic is identified as offline because processing is asynchronous, and real-time response is not required.

The services’ read and write APIs are shared between traffic of these cases. The majority of Pinners’ operations on a single object (such as creating a board, saving a Pin, or editing user settings through web or mobile) are routed to one of the APIs to fetch and update data in datastores. Meanwhile, internal services use these APIs to take actions on a large number of objects on behalf of users (such as deactivating spam accounts, removing spam Pins).

To offload internal offline traffic from APIs so online traffic can be handled exclusively with better reliability and performance, write APIs should support batch objects. A bulk write platform on top of Kafka is proposed and implemented. This also ensures internal services like QPS are supported more efficiently, without being restricted to guarantee high throughput. In this post, we’ll cover the characteristics of internal offline traffic, the challenges we faced and how we attacked them by building a bulk write platform in backend core services.

Datastores and write APIs

At Pinterest, MySQL is one major datastore to store content created by users. To store billions of Pins, boards and other data for hundreds of millions of Pinners, many MySQL database instances form a MySQL cluster, which is split into logical shards to manage and serve the data more efficiently. All data are split across on these shards.

To read and write data efficiently for one user, the data is stored in the same shard so that APIs only need to fetch data from one shard without fan-out queries to various shards. To prevent any single request from occupying MySQL database resource for a long time, every single query is configured with timeout.

All write APIs of core services were built for online traffic from Pinners at the beginning and work well as only a single object is accepted because pinner operates on a single object most of the time) and the operation is lightweight. Even when Pinners would take bulk operation, e.g. move a number of Pins to a section one board, the performance is still good because the number of objects isn’t very big and write APIs can handle them one by one.

Challenges

The situation changes as more and more internal services use existing write APIs for various bulk operations (such as removing many Pins for a spam user within a short period of time or backfilling a new field for a huge number of existing Pins). As write APIs can only handle one object at a time, much higher traffic with spikes is seen in these APIs.

To handle more traffic, autoscaling of the services can be applied but does not necessarily solve the problem completely because the capacity of the system is restricted by the MySQL cluster. With the existing architecture of MySQL cluster, it’s hard to do autoscaling of MySQL cluster.

To protect the services and MySQL cluster, rate limiting is applied to write APIs.

Although throttling can help to some extent, it has several drawbacks that prevent backend core services from being more reliable and scalable.

Both online and offline traffic to an API affect each other. If the spike of internal offline traffic happens, online traffic to the same API is affected with higher latency and downgraded performance, which impacts the user’s experience.

As more and more internal traffic is sent to the API, rate limiting needs to keep bumping carefully so that APIs can serve more traffic without affecting existing traffic.

Rate limiting does not stop hot shards. When internal services write data for a specific user, e.g. ingest a large number of feed pins for a partner, all requests are targeting the same shard. The hot shard is expected because of spike of requests in a short period of time. The situation gets worse when update operations in MySQL are expensive.

As internal services need to handle a big number of objects within a short period of time and do not need a real-time response, requests that target to the same shard can be combined together and handled asynchronously with one shared query to MySQL to improve efficiency and save bandwidth of connection resource of MySQL. All combined batch requests should be processed at a controlled rate to avoid hot shards.

Bulk write architecture

The bulk write platform was architectured to support high QPS for internal services with high throughput and zero hot shards. Also, migrating to the platform should be straightforward by simply calling new APIs.

Bulk write APIs and Proxy

To support write (update, delete and create) operation on a batch of objects, a set of bulk write APIs are provided for internal service, which can accept a list of objects instead of a single one object. This helps reduce QPS dramatically to the APIs compared to regular write APIs.

Proxy is a finagle service that maps incoming requests to different batching modules, which combine requests to the same shard together, according to the type of objects.

Batching Module

Batching module is to split a batch request into small batches based on the operation type and object type so one batch of objects can be processed efficiently in MySQL, which has timeout configured for each query.

This was designed for two major considerations:

Firstly, write rate to every shard should be configured to avoid hot shards as shards may contain different numbers of records and perform variously. One batch request from proxy contains objects on different shards. To control QPS accurately at shards, the batch request is splitting into batches based on targeting shards. ‘Shard Batching’ module splits requests by affected MySQL shards
Secondly, each write operation has its own batch size. The operations on different object types have different performance because they update a different number of various tables. For instance, creating a new Pin may change four to five different tables, meanwhile updating an existing Pin may change two tables only. Also, an update query to tables may take various lengths of time. Thus, a batch update for one object type may experience various latencies for different batch sizes. To make batch update efficient, the batch size is configured differently for various write operations. ‘Operation Batching’ further splits these requests by types of operation.

Rate Limiter with Kafka

All objects in a batch request from the batching module are on the same shard. Hot shard is expected if too many requests are hitting one specific shard. Hot shard affects all other queries to the same shard and downgrades the performance of the system. To avoid the issue, all requests to one shard should be sent at a controlled rate thus the shard will not be overwhelmed and can handle requests efficiently. To achieve this goal, one ratelimiter needed for every shard and it controls all requests of the shard.

To support high QPS from internal clients at the same time, all requests from them should be stored temporarily in the platform and processed at a controlled speed. This is where Kafka makes a good fit for these purposes.

Kafka can handle very high qps write and read.

Kafka is a reliable distributed message storage system to buffer batch requests so that requests are processed at a controlled rate.

Kafka can leverage the re-balancing of load and manage consumers automatically.

Each partition is assigned to one consumer exclusively (in the same consumer group) and the consumer can process requests with good rate-limiting.

Requests in all partitions are processed by different consumer processors simultaneously so that throughput is very high.

Kafka Configuration

Firstly, each shard in MySQL cluster has a matching partition in Kafka so that all requests to that shard will be published to the corresponding partition and processed by one dedicated consumer processor with accurate QPS. Secondly, a large number of consumer processors are running so that one or two partitions at maximum are assigned to one consumer processor to achieve maximum throughput.

Consumer Processor

The Consumer processor does rate-limiting of QPS on a shard with two steps:

Firstly, how many requests that a consumer can pull from its partition at a time is configured.
Secondly, consumer consults with the configuration for shards to get the precise number of batch requests that one shard can handle and uses Guava Ratelimiter to do rate control. For instance, for some shards, it may handle low traffic because hot users are stored in that shards.

Consumer processors can handle different failures with appropriate actions. To handle congestion in the threadpool, the consumer processor will retry the task with configured back off time if threadpool is full and busy with existing tasks. To handle failures in MySQL shards, it will check the response from MySQL cluster to catch errors and exceptions and take appropriate action on different failures. For instance, when it sees two consecutive failures of a timeout, it will send alerts to system admin and will stop pulling and processing requests with a configured wait time. With these mechanisms, the success rate of request processing is high.

Results

Several use cases of internal teams have been launched to bulk write platform with good performance. For instance, feed ingestion for partners is using the platform. Many improvements are observed in both the time spent and the success rate of the process. The result of ingesting around 4.3 million Pins is shown as follows.

Also, the hot shard is not seen during feed ingestion any more, which has caused a lot of similar issues before.

What’s next

As more internal traffic is separated from existing write APIs to new bulk write APIs, the performance of APIs for online traffic sees improvement, like less downtime, lower latency. This helps make systems more reliable and efficient.

The next step for the new platform is to support more cases by extending existing operations on more object types.

Acknowledgments

Thanks to Kapil Bajaj, Carlo De Guzman, Zhihuang Chen and the rest of the Core Services team at Pinterest! Also special thanks to Brian Pin, Sam Meder from the Shopping Infra team for providing support.

Using Kafka to throttle QPS on MySQL shards in bulk write APIs was originally published in Pinterest Engineering Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

↧

Beena Emerson: Investigating bulk load operation in partitioned tables

November 25, 2019, 11:01 pm

≫ Next: Simplify ETL data pipelines using Amazon Athena’s federated queries and user-defined functions

≪ Previous: Using Kafka to throttle QPS on MySQL shards in bulk write APIs

Feed: Planet PostgreSQL.

Partitioning introduced in PostgreSQL 10 increases performance in certain scenarios. In PostgreSQL 13, pgbench was extended to run benchmark tests for range and hash partitions by partitioning the pgbench_accounts table according to the parameters provided. Since pgbench_accounts is the largest table, the bulkload command COPY is used to populate it and pgbench logs the time taken to insert in that table. This blog will explore how this operation is affected by table partitioning.

Tests were performed for partition counts 200, 400, 600 and 800 on pgbench scales varying from 500 (5 GB) to 5000 (50GB) for both the range and hash partition type with the following settings:

pgbench thread/client count = 32
shared_buffers = 32GB
min_wal_size = 15GB
max_wal_size = 20GB
checkpoint_timeout=900
maintenance_work_mem=1GB
checkpoint_completion_target=0.9
synchronous_commit=on

The hardware specification of the machine on which the benchmarking was performed is as follows:

IBM POWER8 Server
Red Hat Enterprise Linux Server release 7.1 (Maipo) (with kernel Linux 3.10.0-229.14.1.ael7b.ppc64le)
491GB RAM
IBM,8286-42A CPUs (24 cores, 192 with HT)

Range Partitions

The red dotted lines show the performance for an unpartitioned table across different data sizes. It is evident that the range partitioned table takes slightly longer time but increasing the partition count hardly influences the load time.

Hash Partitions

The amount of time taken by a hash partitioned table with the lowest partition count for lowest data size (200 parts, 5GB) is 60% more than that of the unpartitioned table and the time taken for the hash partitioned table with the highest partition count and largest size (800 parts, 50 GB) is 180% more than that of the unpartitioned table. It is obvious that the number of partitions has heavily impacted the load time.

Combination graphs

Here are a few graphs that merge the results of the two partition types, range, and hash, to distinctly show the difference between the two types.

The above graph displays how the range and the hash partitioned table with 400 partitions compare against a non-partitioned one for different data sizes. There is a general upward trend as the data size to be loaded increases in all cases which are expected- larger the data, the longer the time taken. The range-partitioned table takes about 20-25% more time than the unpartitioned table and the hash partition shows a steep substantial increase in load time as the data size increases.

This graph presents how the range and hash partitioned table compare against a non-partitioned one for different partition counts at a data size of 50 GB. Here again, the range partition shows a steady 20-25% increase but the hash partitioned table exhibits a more dramatic change as the partition count increases.

Explaining the behavior

Though there is an expected influence of size on the load time, the hash partitioned table is also profoundly impacted by the amount of partitions the table has.

To decipher this, let us look at how pgbench partitions and inserts data into the table. The pgbench_accounts table is partitioned on the column aid which is called the partition key and a series of data for aid values are generated in sequence from 1 to pgbench_scale * 100000 and inserted using the COPY command. In COPY the tuple routing parameters are set for the partitioned table and then the data is inserted. The range-partitioned table splits data based on the data value such that each partition can hold all tuples with aid values that fall within the range and hash partitioned table uses a modulo operator on the value being inserted and the remainder obtained by performing the operation is used to determine the partition into which the value will be inserted.

Since pgbench copies data ordered according to the partition key viz aid, the range partition insertion behavior shows an ideal case where data of one partition is fully inserted before moving on to the next partition. For the hash partitioned table, the first value is inserted in the first partition, the second value in the second partition and so on until all the partitions are covered before it loops from the first partition again until all the data is exhausted exhibiting the worst-case scenario where the partition is repeatedly switched for every value inserted. As a result, the amount of times the partition is switched in a range-partitioned table is equal to the number of partitions which in this test case is not more than 800 while in hash the amount of times the partition has switched is equal to the amount of data being inserted which is more than 5 lakh times. This causes a massive difference in timing for the two partition types.

Conclusion

When using COPY for data ordered on the partition key column, no matter the size or the number of partitions, the operation would take about 20-25% more time than an unpartitioned table. If the data being copied is unordered with respect to the partition key then the time taken will depend on how often the partition has to be switched between insertions.

Hence, to speed up bulk-load COPY operation, it is advisable to have data sorted according to the partition key of the table so that there is minimal partition switching ensuring lower operation time.

↧

Simplify ETL data pipelines using Amazon Athena’s federated queries and user-defined functions

November 26, 2019, 9:17 am

≫ Next: Query any data source with Amazon Athena’s new federated query

≪ Previous: Beena Emerson: Investigating bulk load operation in partitioned tables

Feed: AWS Big Data Blog.

Amazon Athena recently added support for federated queries and user-defined functions (UDFs), both in Preview. See Query any data source with Amazon Athena’s new federated query for more details. Jornaya helps marketers intelligently connect consumers who are in the market for major life purchases such as homes, mortgages, cars, insurance, and education.

Jornaya collects data from a variety of data sources. Our main challenge is to clean and ingest this data into Amazon S3 to enable access for data analysts and data scientists.

Legacy ETL and analytics solution

In 2012, Jornaya moved from using from MySQL to Amazon DynamoDB for our real-time database. DynamoDB allowed a company of our size to receive the benefits of create, read, update, and delete (CRUD) operations with predictable low latency, high availability, and excellent data durability without the administrative burden of having to manage the database. This allowed our technology team to focus on solving business problems and rapidly building new products that we could bring to market.

Running analytical queries on NoSQL databases can be tough. We decided to extract data from DynamoDB and run queries on it. This was not simple.

Here are a few methods we use at Jornaya to get data from DynamoDB:

Leveraging EMR: We temporarily provision additional read capacity with DynamoDB tables and create transient EMR clusters to read data from DynamoDB and write to Amazon S3.
- Our Jenkins jobs trigger pipelines that spin up a cluster, extract data using EMR, and use the Amazon Redshift copy command to load data into Amazon Redshift. This is an expensive process because we use excess read capacity. To lower EMR costs, we use spot instances.
Enabling DynamoDB Streams: We use a homegrown Python AWS Lambda function named Dynahose to consume data from the stream and write it to a Amazon Kinesis Firehose delivery stream. We then configure the Kinesis Firehose delivery stream to write the data to an Amazon S3 location. Finally, we use another homegrown Python Lambda function named Partition to ensure that the partitions corresponding to the locations of the data written to Amazon S3 are added to the AWS Glue Data Catalog so that it can read using tools like AWS Glue, Amazon Redshift Spectrum, EMR, etc.

The process is shown in the following diagram.

We go through such pipelines because we want to ask questions about our operational data in a natural way, using SQL.

Using Amazon Athena to simplify ETL workflows and enable quicker analytics

Athena, a fully managed serverless interactive service for querying data in Amazon S3 using SQL, has been rapidly adopted by multiple departments across our organization. For our use case, we did not require an always-on EMR cluster waiting for an analytics query. Athena’s serverless nature is perfect for our use case. Along the way we discovered that we could use Athena to run extract, transform, and load (ETL) jobs.

However, Athena is a lot more than an interactive service for querying data in Amazon S3. We also found Athena to be a robust, powerful, reliable, scalable, and cost-effective ETL tool. The ability to schedule SQL statements, along with support for Create Table As Select (CTAS) and INSERT INTO statements, helped us accelerate our ETL workloads.

Before Athena, business users in our organization had to rely on engineering resources build pipelines. The release of Athena changed that in a big way. Athena enabled software engineers and data scientists to work with data that would have otherwise been inaccessible or required help from data engineers.

With the addition of query federation and UDFs to Athena, Jornaya has been able to replace many of our unstable data pipelines with Athena to extract and transform data from DynamoDB and write it to Amazon S3. The product and engineering teams at Jornaya noticed our reduced ETL failure rates. The finance department took note of lower EMR and DynamoDB costs. And the members of our on-call rotation (as well as their spouses) have been able to enjoy uninterrupted sleep.

For instance, the build history of one ETL pipeline using EMR looked like this (the history of ETL pipeline executions is shown in this chart with the job execution id on the x-axis and the execution time in minutes on the y-axis):

After migrating this pipeline to Athena and using federated queries to query DynamoDB, we were able to access data sources with ease that we simply could not previously with queries like the following:

CREATE TABLE "__TABLE_NAME__"
WITH (
  external_location = '__S3_LOCATION__'
, format = 'PARQUET'
, orc_compression = 'SNAPPY'
, partitioned_by = ARRAY['create_day']
) AS
SELECT DISTINCT
  d.key.s AS device_id
, CAST(d.created.n AS DECIMAL(14, 4)) AS created
, d.token.s AS token
, c.industry AS industry_code
, CAST(CAST(FROM_UNIXTIME(CAST(d.created.n AS DECIMAL(14, 4))) AS DATE) AS VARCHAR) AS create_day
FROM "rdl"."device_frequency_stream" d
  LEFT OUTER JOIN "lambda::dynamodb"."default"."campaigns" c ON c.key = d.campaign_key
WHERE d.insert_ts BETWEEN TIMESTAMP '__PARTITION_START_DATE__' AND TIMESTAMP '__PARTITION_END_DATE__'
  AND d.created.n >= CAST(CAST(TO_UNIXTIME(DATE '__START_DATE__') AS DECIMAL(14, 4)) AS VARCHAR)
  AND d.created.n < CAST(CAST(TO_UNIXTIME(DATE '__END_DATE__') AS DECIMAL(14, 4)) AS VARCHAR);

We achieved a much more performant process with a build history shown in the following diagram:

Conclusion

Using one SQL query, we were able to process data from DynamoDB, convert that data to Parquet, apply Snappy compression, create the correct partitions in our AWS Glue Data Catalog, and ingest data to Amazon S3. Our ETL process execution time was reduced from hours to minutes, the cost was significantly lowered, and the new process is simpler and much more reliable. The new process using Athena for ETL is also future-proof due to extensibility. In case we need to import a dataset from another purpose-built data store that does not have a ready data source connector, we can simply use the data source connector SDK to write our own connector and deploy it in Production—a one-time effort that will cost us barely one day.

Additionally, Athena federated queries have empowered Jornaya to run queries that connect data from not just different data sources, but from different data paradigms! We can run a single query that seamlessly links data from a NoSQL datastore, an RDS RDBMS, and an Amazon S3 data lake.

About the Authors

Manny Wald is the technical co-founder at Jornaya. He holds multiple patents and is passionate about the power of the cloud, big data and AI to accelerate the rate at which companies can bring products to market to solve real-world problems. He has a background in BI, application development, data warehousing, web services, and building tools to manage transactional and analytical information. Additionally, Manny created the internet’s first weekly hip hop turntablism mix show, is admitted to practice law at the state and federal levels, and plays basketball whenever he gets the chance.

Janak Agarwal is a product manager for Athena at AWS.

↧

Query any data source with Amazon Athena’s new federated query

November 26, 2019, 9:19 am

≫ Next: Extract, Transform and Load data into S3 data lake using CTAS and INSERT INTO statements in Amazon Athena

≪ Previous: Simplify ETL data pipelines using Amazon Athena’s federated queries and user-defined functions

Feed: AWS Big Data Blog.

Organizations today use data stores that are the best fit for the applications they build. For example, for an organization building a social network, a graph database such as Amazon Neptune is likely the best fit when compared to a relational database. Similarly, for workloads that require flexible schema for fast iterations, Amazon DocumentDB (with MongoDB compatibility) is likely a better fit. As Werner Vogels, CTO and VP of Amazon.com, said: “Seldom can one database fit the needs of multiple distinct use cases.” Developers today build highly distributed applications using a multitude of purpose-built databases. In a sense, developers are doing what they do best: dividing complex applications into smaller pieces, which allows them to choose the right tool for the right job. As the number of data stores and applications increase, running analytics across multiple data sources can become challenging.

Today, we are happy to announce a Public Preview of Amazon Athena support for federated queries.

Federated Query in Amazon Athena

Federated query is a new Amazon Athena feature that enables data analysts, engineers, and data scientists to execute SQL queries across data stored in relational, non-relational, object, and custom data sources. With Athena federated query, customers can submit a single SQL query and analyze data from multiple sources running on-premises or hosted on the cloud. Athena executes federated queries using Data Source Connectors that run on AWS Lambda. AWS has open-sourced Athena Data Source connectors for Amazon DynamoDB, Apache HBase, Amazon DocumentDB, Amazon Redshift, Amazon CloudWatch Logs, AWS CloudWatch Metrics, and JDBC-compliant relational data sources such MySQL, and PostgreSQL under the Apache 2.0 license. Customers can use these connectors to run federated SQL queries in Athena across these data sources. Additionally, using Query Federation SDK, customers can build connectors to any proprietary data source and enable Athena to run SQL queries against the data source. Since connectors run on Lambda, customers continue to benefit from Athena’s serverless architecture and do not have to manage infrastructure or scale for peak demands.

Running analytics on data spread across applications can be complex and time consuming. Application developers pick a data store based on the application’s primary function. As a result, data required for analytics is often spread across relational, key-value, document, in-memory, search, graph, time-series, and ledger databases. Event and application logs are often stored in object stores such as Amazon S3. To analyze data across these sources, analysts have to learn new programming languages and data access constructs, and build complex pipelines to extract, transform and load into a data warehouse before they can easily query the data. Data pipelines introduce delays and require custom processes to validate data accuracy and consistency across systems. Moreover, when source applications are modified, data pipelines have to be updated and data has to be re-stated for corrections. Federated queries in Athena eliminate this complexity by allowing customers to query data in-place wherever it resides. Analysts can use familiar SQL constructs to JOIN data across multiple data sources for quick analysis or use scheduled SQL queries to extract and store results in Amazon S3 for subsequent analysis.

The Athena Query Federation SDK extends the benefits of federated querying beyond AWS provided connectors. With fewer than 100 lines of code, customers can build connectors to proprietary data sources and share them across the organization. Connectors are deployed as Lambda functions and registered for use in Athena as data sources. Once registered, Athena invokes the connector to retrieve databases, tables, and columns available from the data source. A single Athena query can span multiple data sources. When a query is submitted against a data source, Athena invokes the corresponding connector to identify parts of the tables that need to be read, manages parallelism, and pushes down filter predicates. Based on the user submitting the query, connectors can provide or restrict access to specific data elements. Connectors use Apache Arrow as the format for returning data requested in a query, which enables connectors to be implemented in languages such as C, C++, Java, Python, and Rust. Since connectors are executed in Lambda, they can be used to access data from any data source on the cloud or on-premises that is accessible from Lambda.

Data Source Connectors

You can run SQL queries against new data sources by registering the data source with Athena. When registering the data source, you associate an Athena Data Connector specific to the data source. You can use AWS-provided open-source connectors, build your own, contribute to existing connectors, or use community or marketplace-built connectors. Depending on the type of data source, a connector manages metadata information; identifies specific parts of the tables that need to be scanned, read or filtered; and manages parallelism. Athena Data Source Connectors run as Lambda functions in your account.

Each data connector is composed of two Lambda functions, each specific to a data source: one for metadata and one for record reading. The connector code is open-source and should be deployed as Lambda functions. You can also deploy Lambda functions to AWS Serverless Application Repository and use them with Athena. Once the Lambda functions are deployed, they produce a unique Amazon Resource Name or ARN. You must register these ARNs with Athena. Registering an ARN allows Athena to understand with which Lambda function to talk during query execution. Once both the ARNs are registered, you can query the registered data source.

When a query runs on a federated data source, Athena fans out the Lambda invocations reading metadata and data in parallel. The number of parallel invocations depends on the Lambda concurrency limits enforced in your account. For example, if you have a limit of 300 concurrent Lambda invocations, Athena can invoke 300 parallel Lambda functions for record reading. For two queries running in parallel, Athena invokes twice the number of concurrent executions.

Diagram 1 shows how Athena Federated Queries work. When you submit a federated query to Athena, Athena will invoke the right Lambda-based connector to connect with your Data Source. Athena will fan out Lambda invocations to read metadata and data in parallel.

Diagram 1: Athena Federated Query Architecture

Example

This blog post demonstrates how data analysts can query data in multiple databases for faster analysis in one SQL query. For illustration purposes, consider an imaginary e-commerce company whose architecture leverages the following purpose-built data sources:

Payment transaction records stored in Apache HBase running on AWS.
Active orders, defined as customer orders not yet delivered, stored in Redis so that the processing engine can retrieve such orders quickly.
Customer data such as email addresses, shipping information, etc., stored in DocumentDB.
Product Catalog stored in Aurora.
Order Processor’s log events housed in Amazon CloudWatch Logs.
Historical orders and analytics in Redshift.
Shipment tracking data in DynamoDB.
A fleet of drivers performing last-mile delivery while using IoT-enabled tablets.

Customers of this imaginary e-commerce company have a problem. They have complained that their orders are stuck in a weird state. Some orders show as pending even though they have actually been delivered while other orders show as delivered but have not actually been shipped.

The company management has tasked the customer service analysts to determine the true state of all orders.

Using Athena federated queries

Using Athena’s query federation, the analysts can quickly analyze records from different sources. Additionally, they can setup a pipeline that can extract data from these sources, store them in Amazon S3 and use Athena to query them.

Diagram 2 shows Athena invoking Lambda-based connectors to connect with data sources that are on On Premises and in Cloud in the same query. In this diagram, Athena is scanning data from S3 and executing the Lambda-based connectors to read data from HBase in EMR, Dynamo DB, MySQL, RedShift, ElastiCache (Redis) and Amazon Aurora.

Diagram 2: Federated Query Example.

Analysts can register and use the following connectors found in this repository and run a query that:

Grabs all active orders from Redis. (see athena-redis)
Joins against any orders with ‘WARN’ or ‘ERROR’ events in Cloudwatch logs by using regex matching and extraction. (see athena-cloudwatch)
Joins against our EC2 inventory to get the hostname(s) and status of the Order Processor(s) that logged the ‘WARN’ or ‘ERROR’. (see athena-cmdb)
Joins against DocumentDB to obtain customer contact details for the affected orders. (see athena-docdb)
Joins against a scatter-gather query sent to the Driver Fleet via Android Push notification. (see athena-android)
Joins against DynamoDB to get shipping status and tracking details. (see athena-dynamodb)
Joins against HBase to get payment status for the affected orders. (see athena-hbase)
Joins against the advertising conversion data in BigQuery to see which promotions need to be applied if a re-order is needed. (see athena-bigquery)

Data Source Connector Registration

Analysts can register a data source connector using the Connect data source Flow in the Athena Query Editor.

Choose Connect data source or Data sources on the Query Editor.
Select the data source to which you want to connect, as shown in the following screenshot. You can also choose to write your own data source connector using the Query Federation SDK.
Follow the rest of the steps in the UX to complete the registration. They involve configuring the connector function for your data source (as shown in the following screenshot), selecting a Name as the Catalog Name to use in your query, and providing a description.

Sample Analyst Query

Once the registration of the data source connectors is complete, the customer service analyst can write the following sample query to identify the affected orders in one SQL query, thus increasing the organization’s business velocity.

Below you’ll find a video demonstration of a sample federated query:

[embedded content]

WITH logs 
     AS (SELECT log_stream, 
                message                                          AS 
                order_processor_log, 
                Regexp_extract(message, '.*orderId=(d+) .*', 1) AS orderId, 
                Regexp_extract(message, '(.*):.*', 1)            AS log_level 
         FROM 
     "lambda:cloudwatch"."/var/ecommerce-engine/order-processor".all_log_streams 
         WHERE  Regexp_extract(message, '(.*):.*', 1) != 'WARN'), 
     active_orders 
     AS (SELECT * 
         FROM   redis.redis_db.redis_customer_orders), 
     order_processors 
     AS (SELECT instanceid, 
                publicipaddress, 
                state.NAME 
         FROM   awscmdb.ec2.ec2_instances), 
     customer 
     AS (SELECT id, 
                email 
         FROM   docdb.customers.customer_info), 
     addresses 
     AS (SELECT id, 
                is_residential, 
                address.street AS street 
         FROM   docdb.customers.customer_addresses),
     drivers
     AS ( SELECT name as driver_name, 
                 result_field as driver_order, 
                 device_id as truck_id, 
                 last_updated 
         FROM android.android.live_query where query_timeout = 5000 and query_min_results=5),
     impressions 
     AS ( SELECT path as advertisement, 
                 conversion
         FROM bigquery.click_impressions.click_conversions),
     shipments 
     AS ( SELECT order_id, 
                 shipment_id, 
                 from_unixtime(cast(shipped_date as double)) as shipment_time,
                 carrier
        FROM lambda_ddb.default.order_shipments),
     payments
     AS ( SELECT "summary:order_id", 
                 "summary:status", 
                 "summary:cc_id", 
                 "details:network" 
        FROM "hbase".hbase_payments.transactions)

SELECT _key_            AS redis_order_id, 
       customer_id, 
       customer.email   AS cust_email, 
       "summary:cc_id"  AS credit_card,
       "details:network" AS CC_type,
       "summary:status" AS payment_status,
       impressions.advertisement as advertisement,
       status           AS redis_status, 
       addresses.street AS street_address, 
       shipments.shipment_time as shipment_time,
       shipments.carrier as shipment_carrier,
       driver_name     AS driver_name,
       truck_id       AS truck_id,
       last_updated AS driver_updated,
       publicipaddress  AS ec2_order_processor, 
       NAME             AS ec2_state, 
       log_level, 
       order_processor_log 
FROM   active_orders 
       LEFT JOIN logs 
              ON logs.orderid = active_orders._key_ 
       LEFT JOIN order_processors 
              ON logs.log_stream = order_processors.instanceid 
       LEFT JOIN customer 
              ON customer.id = customer_id 
       LEFT JOIN addresses 
              ON addresses.id = address_id 
       LEFT JOIN drivers 
              ON drivers.driver_order = active_orders._key_ 
       LEFT JOIN impressions
              ON impressions.conversion = active_orders._key_
       LEFT JOIN shipments
              ON shipments.order_id = active_orders._key_
       LEFT JOIN payments
              ON payments."summary:order_id" = active_orders._key_

Additionally, Athena writes all query results in an S3 bucket that you specify in your query. If your use-case mandates you to ingest data into S3, you can use Athena’s query federation capabilities statement to register your data source, ingest to S3, and use CTAS statement or INSERT INTO statements to create partitions and metadata in Glue catalog as well as convert data format to a supported format.

Conclusion

In this blog, we introduced Athena’s new federated query feature. Using an example, we saw how to register and use Athena data source connectors to write federated queries to connect Athena to any data source accessible by AWS Lambda from your account. Finally, we learnt that we can use federated queries to not only enable faster analytics, but also to extract, transform and load data into your datalake in S3.

Athena federated query is available in Preview in the us-east-1 (N. Virginia) region. Begin your Preview now by following these steps in the Athena FAQ.
To learn more about the feature, please see documentation the Connect to a Data Source documentation here.
To get started with using an existing connector, please follow this Connect to a Data Source guide.
To learn how to build your own data source connector using the Athena Query Federation SDK, please visit this Athena example in GitHub .

About the Author

Janak Agarwal is a product manager for Athena at AWS.

↧

Extract, Transform and Load data into S3 data lake using CTAS and INSERT INTO statements in Amazon Athena

November 26, 2019, 9:24 am

≫ Next: ICYMI: Serverless pre:Invent 2019

≪ Previous: Query any data source with Amazon Athena’s new federated query

Feed: AWS Big Data Blog.

Amazon Athena is an interactive query service that makes it easy to analyze the data stored in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run. You can reduce your per-query costs and get better performance by compressing, partitioning, and converting your data into columnar formats. To learn more about best practices to boost query performance and reduce costs, see Top 10 Performance Tuning Tips for Amazon Athena.

Overview

This blog post discusses how to use Athena for extract, transform and load (ETL) jobs for data processing. This example optimizes the dataset for analytics by partitioning it and converting it to a columnar data format using Create Table as Select (CTAS) and INSERT INTO statements.

CTAS statements create new tables using standard SELECT queries to filter data as required. You can also partition the data, specify compression, and convert the data into columnar formats like Apache Parquet and Apache ORC using CTAS statements. As part of the execution, the resultant tables and partitions are added to the AWS Glue Data Catalog, making them immediately available for subsequent queries.

INSERT INTO statements insert new rows into a destination table based on a SELECT query statement that runs on a source table. If the source table’s underlying data is in CSV format and destination table’s data is in Parquet format, then INSERT INTO can easily transform and load data into destination table’s format. CTAS and INSERT INTO statements can be used together to perform an initial batch conversion of data as well as incremental updates to the existing table.

Here is an overview of the ETL steps to be followed in Athena for data conversion:

Create a table on the original dataset.
Use a CTAS statement to create a new table in which the format, compression, partition fields and location of the new table can be specified.
Add more data into the table using an INSERT INTO statement.

This example uses a subset of NOAA Global Historical Climatology Network Daily (GHCN-D), a publicly available dataset on Amazon S3, in this example.

This subset of data is available at the following S3 location:

s3://aws-bigdata-blog/artifacts/athena-ctas-insert-into-blog/
Total objects: 41727 
Size of CSV dataset: 11.3 GB
Region: us-east-1

Procedure

Follow these steps to use Athena for an ETL job.

Create a table based on original dataset

The original data is in CSV format with no partitions in Amazon S3. The following files are stored at the Amazon S3 location:

2019-10-31 13:06:57  413.1 KiB artifacts/athena-ctas-insert-into-blog/2010.csv0000
2019-10-31 13:06:57  412.0 KiB artifacts/athena-ctas-insert-into-blog/2010.csv0001
2019-10-31 13:06:57   34.4 KiB artifacts/athena-ctas-insert-into-blog/2010.csv0002
2019-10-31 13:06:57  412.2 KiB artifacts/athena-ctas-insert-into-blog/2010.csv0100
2019-10-31 13:06:57  412.7 KiB artifacts/athena-ctas-insert-into-blog/2010.csv0101

Note that the file sizes are pretty small. Merging them into larger files and reducing total number of files would lead to faster query execution. CTAS and INSERT INTO can help achieve this.

To execute queries in the Athena console (preferably in us-east-1 to avoid inter-region Amazon S3 data transfer charges). First, create a database for this demo:

CREATE DATABASE blogdb

Now, create a table from the data above.

CREATE EXTERNAL TABLE `blogdb`.`original_csv` (
  `id` string, 
  `date` string, 
  `element` string, 
  `datavalue` bigint, 
  `mflag` string, 
  `qflag` string, 
  `sflag` string, 
  `obstime` bigint)
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY ',' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://aws-bigdata-blog/artifacts/athena-ctas-insert-into-blog/'

Use CTAS to partition data and convert into parquet format with snappy compression

Now, convert the data to Parquet format with Snappy compression and partition the data on a yearly basis. All these actions are performed using the CTAS statement. For the purpose of this blog, the initial table only includes data from 2015 to 2019. You can add new data to this table using the INSERT INTO command.

The table created in Step 1 has a date field with the date formatted as YYYYMMDD (e.g. 20100104). The new table is partitioned on year. Extract the year value from the date field using the Presto function substr(“date”,1,4).

CREATE table new_parquet
WITH (format='PARQUET', 
parquet_compression='SNAPPY', 
partitioned_by=array['year'], 
external_location = 's3://your-bucket/optimized-data/') 
AS
SELECT id,
         date,
         element,
         datavalue,
         mflag,
         qflag,
         sflag,
         obstime,
         substr("date",1,4) AS year
FROM original_csv
WHERE cast(substr("date",1,4) AS bigint) >= 2015
        AND cast(substr("date",1,4) AS bigint) <= 2019

Once the query is successful, check the Amazon S3 location specified in the CTAS statement above. You should be able to see partitions and parquet files in each of these partitions, as shown in the following examples:

Partitions:

$ aws s3 ls s3://your-bucket/optimized-data/
                           PRE year=2015/
                           PRE year=2016/
                           PRE year=2017/
                           PRE year=2018/
                           PRE year=2019/

Parquet files:

$ aws s3 ls s3://your-bucket/optimized-data/ --recursive --human-readable | head -5

2019-10-31 14:51:05    7.3 MiB optimized-data/year=2015/20191031_215021_00001_3f42d_1be48df2-3154-438b-b61d-8fb23809679d
2019-10-31 14:51:05    7.0 MiB optimized-data/year=2015/20191031_215021_00001_3f42d_2a57f4e2-ffa0-4be3-9c3f-28b16d86ed5a
2019-10-31 14:51:05    9.9 MiB optimized-data/year=2015/20191031_215021_00001_3f42d_34381db1-00ca-4092-bd65-ab04e06dc799
2019-10-31 14:51:05    7.5 MiB optimized-data/year=2015/20191031_215021_00001_3f42d_354a2bc1-345f-4996-9073-096cb863308d
2019-10-31 14:51:05    6.9 MiB optimized-data/year=2015/20191031_215021_00001_3f42d_42da4cfd-6e21-40a1-8152-0b902da385a1

Add more data into table using INSERT INTO statement

Now, add more data and partitions into the new table created above. The original dataset has data from 2010 to 2019. Since you added 2015 to 2019 using CTAS, add the rest of the data now using an INSERT INTO statement:

INSERT INTO new_parquet
SELECT id,
         date,
         element,
         datavalue,
         mflag,
         qflag,
         sflag,
         obstime,
         substr("date",1,4) AS year
FROM original_csv
WHERE cast(substr("date",1,4) AS bigint) < 2015

List the Amazon S3 location of the new table:

 $ aws s3 ls s3://your-bucket/optimized-data/
                           PRE year=2010/
                           PRE year=2011/
                           PRE year=2012/
                           PRE year=2013/
                           PRE year=2014/
                           PRE year=2015/
                           PRE year=2016/
                           PRE year=2017/
                           PRE year=2018/
                           PRE year=2019/

You can see that INSERT INTO is able to determine that “year” is a partition column and writes the data to Amazon S3 accordingly. There is also a significant reduction in the total size of the dataset thanks to compression and columnar storage in the Parquet format:

Size of dataset after parquet with snappy compression - 1.2 GB

You can also run INSERT INTO statements if more CSV data is added to original table. Assume you have new data for the year 2020 added to the original Amazon S3 dataset. In that case, you can run the following INSERT INTO statement to add this data and the relevant partition(s) to the new_parquet table:

INSERT INTO new_parquet
SELECT id,
         date,
         element,
         datavalue,
         mflag,
         qflag,
         sflag,
         obstime,
         substr("date",1,4) AS year
FROM original_csv
WHERE cast(substr("date",1,4) AS bigint) = 2020

Query the results

Now that you have transformed data, run some queries to see what you gained in terms of performance and cost optimization:

First, find the number of distinct IDs for every value of the year:

Query on the original table:

SELECT substr("date",1,4) as year, 
       COUNT(DISTINCT id) 
FROM original_csv 
GROUP BY 1 ORDER BY 1 DESC

Query on the new table:

SELECT year, 
  COUNT(DISTINCT id) 
FROM new_parquet 
GROUP BY  1 ORDER BY 1 DESC

Original table

New table

Savings

Run time

Data scanned

Cost

Run

Time

Data

Scanned

Cost

16.88 seconds

11.35 GB

$0.0567

3.79 seconds

428.05 MB

$0.002145

77.5% faster and 96.2% cheaper

Next, calculate the average maximum temperature (Celsius), average minimum temperature (Celsius), and average rainfall (mm) for the Earth in 2018:

Query on the original table:

SELECT element, round(avg(CAST(datavalue AS real)/10),2) AS value
FROM original_csv
WHERE element IN ('TMIN', 'TMAX', 'PRCP') AND substr("date",1,4) = '2018'
GROUP BY  1

Query on the new table:

SELECT element, round(avg(CAST(datavalue AS real)/10),2) AS value
FROM new_parquet 
WHERE element IN ('TMIN', 'TMAX', 'PRCP') and year = '2018'
GROUP BY  1

Original table

New table

Savings

Run time

Data scanned

Cost

Run

Time

Data

Scanned

Cost

18.65 seconds

11.35 GB

$0.0567

1.92 seconds

68.08 MB

$0.000345

90% faster and 99.4% cheaper

Conclusion

This post showed you how to perform ETL operations using CTAS and INSERT INTO statements in Athena. You can perform the first set of transformations using a CTAS statement. When new data arrives, use an INSERT INTO statement to transform and load data to the table created by the CTAS statement. Using this approach, you converted data to the Parquet format with Snappy compression, converted a non-partitioned dataset to a partitioned dataset, reduced the overall size of the dataset and lowered the costs of running queries in Athena.

About the Author

Pathik Shah is a big data architect for Amazon EMR at AWS.

↧

ICYMI: Serverless pre:Invent 2019

November 27, 2019, 7:58 pm

≫ Next: Announcing Amazon Redshift data lake export: share data in Apache Parquet format

≪ Previous: Extract, Transform and Load data into S3 data lake using CTAS and INSERT INTO statements in Amazon Athena

Feed: AWS Compute Blog.
Author: Eric Johnson.

With Contributions from Chris Munns – Sr Manager – Developer Advocacy – AWS Serverless

The last two weeks have been a frenzy of AWS service and feature launches, building up to AWS re:Invent 2019. As there has been a lot announced we thought we’d ship an ICYMI post summarizing the serverless service specific features that have been announced. We’ve also dropped in some service announcements from services that are commonly used in serverless application architectures or development.

AWS re:Invent 2019

We also want you to know that we’ll be talking about many of these features (as well as those coming) in sessions at re:Invent.

Here’s what’s new!

AWS Lambda

On September 3, AWS Lambda started rolling out a major improvement to how AWS Lambda functions work with your Amazon VPC networks. This change brings both scale and performance improvements, and addresses several of the limitations of the previous networking model with VPCs.

On November 25, Lambda announced that the rollout of this new capability has completed in 6 additional regions including US East (Virginia) and US West (Oregon).

New VPC to VPC NAT for Lambda functions

On November 18, Lambda announced three new runtime updates. Lambda now supports Node.js 12, Java 11, and Python 3.8. Each of these new runtimes has new language features and benefits so be sure to check out the linked release posts. These new runtimes are all based on an Amazon Linux 2 execution environment.

Lambda has released a number of controls for both stream and async based invocations:

For Lambda functions consuming events from Amazon Kinesis Data Streams or Amazon DynamoDB Streams, it’s now possible to limit the retry count, limit the age of records being retried, configure a failure destination, or split a batch to isolate a problem record. These capabilities will help you deal with potential “poison pill” records that would previously cause streams to pause in processing.
For asynchronous Lambda invocations, you can now set the maximum event age and retry attempts on the event. If either configured condition is met, the event can be routed to a dead letter queue (DLQ), Lambda destination, or it can be discarded.

In addition to the above controls, Lambda Destinations is a new feature that allows developers to designate an asynchronous target for Lambda function invocation results. You can set one destination for a success, and another for a failure. This unlocks really useful patterns for distributed event-based applications and can reduce code to send function results to a destination manually.

Lambda Destinations

Lambda also now supports setting a Parallelization Factor, which allows you to set multiple Lambda invocations per shard for Amazon Kinesis Data Streams and Amazon DynamoDB Streams. This allows for faster processing without the need to increase your shard count, while still guaranteeing the order of records processed.

Lambda Parallelization Factor diagram

Lambda now supports Amazon SQS FIFO queues as an event source. FIFO queues guarantee the order of record processing compared to standard queues which are unordered. FIFO queues support messaging batching via a MessageGroupID attribute which allows for parallel Lambda consumers of a single FIFO queue. This allows for high throughput of record processing by Lambda.

Lambda now supports Environment Variables in AWS China (Beijing) Region, operated by Sinnet and the AWS China (Ningxia) Region, operated by NWCD.

Lastly, you can now view percentile statistics for the duration metric of your Lambda functions. Percentile statistics tell you the relative standing of a value in a dataset, and are useful when applied to metrics that exhibit large variances. They can help you understand the distribution of a metric, spot outliers, and find hard-to-spot situations that create a poor customer experience for a subset of your users.

AWS SAM CLI

AWS SAM CLI deploy command

The SAM CLI team simplified the bucket management and deployment process in the SAM CLI. You no longer need to manage a bucket for deployment artifacts – SAM CLI handles this for you. The deployment process has also been streamlined from multiple flagged commands to a single command, sam deploy.

AWS Step Functions

One of the powerful features of Step Functions is its ability to integrate directly with AWS services without you needing to write complicated application code. Step Functions has expanded its integration with Amazon SageMaker to simplify machine learning workflows, and added a new integration with Amazon EMR, making it faster to build and easier to monitor EMR big data processing workflows.

Step Functions step with EMR

Step Functions now provides the ability to track state transition usage by integrating with AWS Budgets, allowing you to monitor and react to usage and spending trends on your AWS accounts.

You can now view CloudWatch Metrics for Step Functions at a one-minute frequency. This makes it easier to set up detailed monitoring for your workflows. You can use one-minute metrics to set up CloudWatch Alarms based on your Step Functions API usage, Lambda functions, service integrations, and execution details.

AWS Step Functions now supports higher throughput workflows, making it easier to coordinate applications with high event rates.

In US East (N. Virginia), US West (Oregon), and EU (Ireland), throughput has increased from 1,000 state transitions per second to 1,500 state transitions per second with bucket capacity of 5,000 state transitions. The default start rate for state machine executions has also increased from 200 per second to 300 per second, with bucket capacity of up to 1,300 starts in these regions.

In all other regions, throughput has increased from 400 state transitions per second to 500 state transitions per second with bucket capacity of 800 state transitions. The default start rate for AWS Step Functions state machine executions has also increased from 25 per second to 150 per second, with bucket capacity of up to 800 state machine executions.

Amazon SNS

Amazon SNS now supports the use of dead letter queues (DLQ) to help capture unhandled events. By enabling a DLQ, you can catch events that are not processed and re-submit them or analyze to locate processing issues.

Amazon CloudWatch

CloudWatch announced Amazon CloudWatch ServiceLens to provide a “single pane of glass” to observe health, performance, and availability of your application.

CloudWatch ServiceLens

CloudWatch also announced a preview of a capability called Synthetics. CloudWatch Synthetics allows you to test your application endpoints and URLs using configurable scripts that mimic what a real customer would do. This enables the outside-in view of your customers’ experiences, and your service’s availability from their point of view.

On November 18, CloudWatch launched Embedded Metric Format to help you ingest complex high-cardinality application data in the form of logs and easily generate actionable metrics from them. You can publish these metrics from your Lambda function by using the PutLogEvents API or for Node.js or Python based applications using an open source library.

Lastly, CloudWatch announced a preview of Contributor Insights, a capability to identify who or what is impacting your system or application performance by identifying outliers or patterns in log data.

AWS X-Ray

X-Ray announced trace maps, which enable you to map the end to end path of a single request. Identifiers will show issues and how they affect other services in the request’s path. These can help you to identify and isolate service points that are causing degradation or failures.

X-Ray also announced support for Amazon CloudWatch Synthetics, currently in preview. X-Ray supports tracing canary scripts throughout the application providing metrics on performance or application issues.

X-Ray Service map with CloudWatch Synthetics

Amazon DynamoDB

DynamoDB announced support for customer managed customer master keys (CMKs) to encrypt data in DynamoDB. This allows customers to bring your own key (BYOK) giving you full control over how you encrypt and manage the security of your DynamoDB data.

It is now possible to add global replicas to existing DynamoDB tables to provide enhanced availability across the globe.

Currently under preview, is another new DynamoDB capability to identify frequently accessed keys and database traffic trends. With this you can now more easily identify “hot keys” and understand usage of your DynamoDB tables.

CloudWatch Contributor Insights for DynamoDB

Last but far from least for DynamoDB, is adaptive capacity, a feature which helps you handle imbalanced workloads by isolating frequently accessed items automatically and shifting data across partitions to rebalance them. This helps both reduce cost by enabling you to provision throughput for a more balanced out workload vs. over provisioning for uneven data access patterns.

AWS Serverless Application Repository

The AWS Serverless Application Repository (SAR) now offers Verified Author badges. These badges enable consumers to quickly and reliably know who you are. The badge will appear next to your name in the SAR and will deep-link to your GitHub profile.

SAR Verified developer badges

AWS Code Services

AWS CodeCommit launched the ability for you to enforce rule workflows for pull requests, making it easier to ensure that code has pass through specific rule requirements. You can now create an approval rule specifically for a pull request, or create approval rule templates to be applied to all future pull requests in a repository.

AWS CodeBuild added beta support for test reporting. With test reporting, you can now view the detailed results, trends, and history for tests executed on CodeBuild for any framework that supports the JUnit XML or Cucumber JSON test format.

CodeBuild test trends

AWS Amplify and AWS AppSync

Instead of trying to summarize all the awesome things that our peers over in the Amplify and AppSync teams have done recently we’ll instead link you to their own recent summary: “A round up of the recent pre-re:Invent 2019 AWS Amplify Launches”.

AWS AppSync

Still looking for more?

We only covered a small bit of all the awesome new things that were recently announced. Keep your eyes peeled for more exciting announcements next week during re:Invent and for a future ICYMI Serverless Q4 roundup. We’ll also be kicking off a fresh series of Tech Talks in 2020 with new content helping to dive deeper on everything new coming out of AWS for serverless application developers.

↧

Announcing Amazon Redshift data lake export: share data in Apache Parquet format

December 3, 2019, 11:08 am

≫ Next: NDB Parallel Query, part 1

≪ Previous: ICYMI: Serverless pre:Invent 2019

Feed: Recent Announcements.

You can specify one or more partition columns so that unloaded data is automatically partitioned into folders in your Amazon S3 bucket. For example, you can choose to unload your marketing data and partition it by year, month, and day columns. This enables your queries to take advantage of partition pruning and skip scanning non-relevant partitions, improving query performance and minimizing cost.

For more information, refer to the Amazon Redshift documentation.

Amazon Redshift data lake export is supported with Redshift release version 1.0.10480 or later. Refer to the AWS Region Table for Amazon Redshift availability.

↧

NDB Parallel Query, part 1

December 6, 2019, 12:59 am

≫ Next: NDB Parallel Query, part 2

≪ Previous: Announcing Amazon Redshift data lake export: share data in Apache Parquet format

Feed: Planet MySQL
;
Author: Mikael Ronström
;

I will describe how NDB handles complex SQL queries in a number of
blogs. NDB has the ability to parallelise parts of join processing.
Ensuring that your queries makes best possible use of these
parallelisation features enables appplications to boost their
performance significantly. It will also be a good base to explain
any improvements we add to the query processing in NDB Cluster.

NDB was designed from the beginning for extremely efficient key lookups
and for extreme availability (less than 30 seconds of downtime per year
including time for software change, meta data changes and crashes).

Originally the model was single-threaded and optimised for 1-2 CPUs.
The execution model uses an architecture where messages are sent
between modules. This made it very straightforward to extend the
architecture to support multi-threaded execution when CPUs with
many cores became prevalent. The first multi-threaded version of NDB
was version 7.0. This supported up to 7 threads working parallel
plus a large number of threads handling interaction with the file
system.

With the introduction of 7.0 the scans of a table, either using an
range scan on an index or scanning the entire table was automatially
parallelised. So NDB have supported a limited form of parallel query
already since the release of 7.0 (around 2011 I think).

Now let’s use an example query, Q6 from DBT3 that mimics TPC-H.

SELECT
SUM(l_extendedprice * l_discount) AS revenue
FROM
lineitem
WHERE
l_shipdate >= ‘1994-01-01’
AND l_shipdate < DATE_ADD( '1994-01-01' , INTERVAL '1' year)
AND l_discount BETWEEN 0.06 – 0.01 AND 0.06 + 0.01
AND l_quantity < 24;

The execution of this will use a range scan on the index on l_shipdate.
This range is a perfectly normal range scan in NDB. Since range scans
are parallelised, this query will execute using 1 CPU for each partition
of the table. Assuming that we set up a cluster with default setup
and with 8 LDM threads the table will be partitioned into 16 partitions.
Each of those partitions will have a different CPU for the primary
partition. This means that the range scans will execute on 16 CPUs in
parallel.

LDM (Local Data Manager) is the name of the threads in the NDB data
nodes that manages the actual data in NDB. It contains a distributed
hash index for the primary keys and unique keys, an ordered index
implemented as a T-tree, a query handler that controls execution of
lookups and scans and checkpointing and also handles the REDO log.
Finally the LDM thread contains the row storage that has 4 parts.
Fixed size parts of the row in memory, variable sized parts of the
row in memory, dynamic parts of the row (absence of a column here
means that it is NULL, so this provides the ability to ADD a column
as an online operation) and finally a fixed size part that is stored
on disk using a page cache. The row storage also contains an
interpreter that can evaluate conditions, perform simple operations
like add to support efficient auto increment.

Now the first implementation of the NDB storage engine was implemented
such that all condition evaluation was done in the MySQL Server. This
meant that although we could scan the table in parallel, we still had
a single thread to evaluate the conditions. This meant that to handle
this query efficiently a condition pushdown is required. Condition
pushdown was added to the MySQL storage engine API a fairly long time
ago as part of the NDB development and can also benefit any other
storage engine that can handle condition evaluation.

So the above contains 3 parts that can be parallelised individually.
Scanning the data, evaluating the condition and finally performing
the sum on the rows that match the condition.

NDB currently parallelises the scan part and the condition evaluation
part. The sum is handled by the MySQL Server. In this case this the
filtering factor is high, so this means that the sum part is not a
bottleneck in this query. The bottleneck in this query is scanning
the data and evaluating the condition.

In the terminology of relational algebra this means that NDB supports
a parallelised SELECT operator for some filters. NDB also supports a
parallel PROJECT operator. NDB doesn’t yet support a parallel
AGGREGATE function.

The bottleneck in this query is how fast one can scan the data and
evaluate the condition. In version 7.6 we made a substantial
optimisation of this part where we managed to improve a simple
query by 240% through low-level optimisations of the code.
With this optimisation NDB can handle more than 2 million rows
per second per CPU with a very simple condition to evaluate. This
query greatly benefits from this greater efficiency. Executing this
query with scale factor 10 (60M rows in the lineitem table) takes
about 1.5 seconds with the configuration above where 16 CPUs
concurrently perform the scan and condition evaluation.

A single-threaded storage engine is around 20x slower. With more
CPUs available in the LDM threads the parallelisation will be even
higher.

Obviously there are other DBMSs that are focused on analytical
queries that can handle this query even faster, NDB is focused
on online applications with high write scalability and the highest
availability. But we are working to also make query execution of
complex SQL much faster that online applications can analyze
data in real-time.

Query Execution

In the figure below we describe the execution flow for this query. As usual

the query starts with parsing (unless it is a prepared statement) and after

that the query is optimised.

This query is executed as a single range scan against the lineitem table. Scans

are controlled by a TC thread that ensures that all the fragments of the table are

scanned. It is possible to control the parallelism of the query through the

NDB API. In most of the cases the parallelism will be full parallelism. Each thread

has a real-time scheduler and the scan in the LDM threads will be split up into

multiple executions that will be interleaved with execution by other queries

executing in parallel.

This means that in an idle system this query will be able to execute at full speed.

However even if there is lots of other queries going on in parallel the query will

execute almost as fast as long as the CPUs are not overloaded.

In the figure below we also show that control of the scan goes through the TC

thread, but the result row is sent directly from the LDM thread to the NDB API.

In the MySQL Server the NDB storage engine gets the row from the NDB API

and returns it to the MySQL Server for the sum function and result processing.

Query Analysis

The query reads the lineitem table that has about 6M rows in scale
factor 1. It reads them using an index on l_shipdate. The range
consists of 909.455 rows to analyse and of those 114.160 rows are
produced to calculate results of the sum. In the above configuration
it takes about 0.15 seconds for NDB to execute the query. There are
some limitations to get full use of all CPUs involved even in this
query that is related to batch handling. I will describe this in a
later blog.

Scalability impact

This query is only positively impacted by any type of scaling. The
more fragments the lineitem table is partitioned into, the more
parallelism the query will use. So the only limitation to scaling
is when the sum part starts to become the bottleneck.

Next part

In the next part we will discuss how NDB can parallelise a very
simple 2-way join from the DBT3 benchmark. This is Q12 from
TPC-H that looks like this.

SELECT
l_shipmode,
SUM(CASE
WHEN o_orderpriority = ‘1-URGENT’
OR o_orderpriority = ‘2-HIGH’
THEN 1
ELSE 0
END) AS high_line_count,
SUM(CASE
WHEN o_orderpriority <> ‘1-URGENT’
AND o_orderpriority <> ‘2-HIGH’
THEN 1
ELSE 0
END) AS low_line_count
FROM
orders,
lineitem
WHERE
o_orderkey = l_orderkey
AND l_shipmode IN (‘MAIL’, ‘SHIP’)
AND l_commitdate < l_receiptdate
AND l_shipdate < l_commitdate
AND l_receiptdate >= ‘1994-01-01’
AND l_receiptdate < DATE_ADD( '1994-01-01', INTERVAL '1' year)
GROUP BY
l_shipmode
ORDER BY
l_shipmode;

This query introduces 3 additional relational algebra operators,
a JOIN operator, a GROUP BY operator and a SORT operator.

↧

NDB Parallel Query, part 2

December 6, 2019, 3:51 am

≫ Next: Maximizing the Value of Your Cloud-Enabled Enterprise Data Lake by Tracking Critical Metrics

≪ Previous: NDB Parallel Query, part 1

Feed: Planet MySQL
;
Author: Mikael Ronström
;

In part 1 we showed how NDB can parallelise a simple query with only a single
table involved. In this blog we will build on this and show how NDB can only
parallelise some parts of two-way join query. As example we will use Q12 in
DBT3:

This query when seen through the relational operators will first pass through
a SELECT operator and a PROJECT operator in the data nodes. The JOIN operator
will be executed on the lineitem and orders tables and the result of the JOIN operator
will be sent to the MySQL Server. The MySQL Server will thereafter handle the
GROUP BY operator with its aggregation function and also the final SORT operator.
Thus we can parallelise the filtering, projection and join, but the GROUP BY
aggregation and sorting will be implemented in the normal MySQL execution of
GROUP BY, SUM and sorting.

This query will be execute by first performing a range scan on the lineitem
table and evaluating the condition that limits the amount of rows to send to
the join with the orders table. The join is performed on the primary key of
the orders table. So the access in the orders table is a primary key lookup
for each row that comes from the range scan on the lineitem table.

In the MySQL implementation of this join one will fetch one row from the
lineitem table and for each such row it will perform a primary key lookup
in the orders table. Given that this means that we can only handle one
primary key lookup at a time unless we do something in the NDB storage
engine. The execution of this query without pushdown join would make
it possible to run the scans towards the lineitem table in parallel. The
primary key lookup on the orders table would however execute serially
and only fetching one row at a time. This will increase the query time in
this case with a factor of around 5x. So by pushing the join down into
the NDB data nodes we can make sure that the primary key lookups on the
orders table are parallelised as well.

To handle this the MySQL Server has the ability to push an entire join
execution down to the storage engine. We will describe this interface in more
detail in a later blog part.

To handle this query in NDB we have implemented a special scan protocol that
enables performing complex join operations. The scan will be presented with
a parameter part for each table in the join operation that will describe the
dependencies between the table and the conditions to be pushed to each table.

This is implemented in the TC threads in the NDB data node. The TC threads in
this case acts as parallel JOIN operators. The join is parallelised on the
first table in the join, in this case the lineitem table. For each node in
the cluster a JOIN operator will be created that takes care of scanning all
partitions that have its primary partition in the node. This means that the
scan of the first table and the join operator is always located on the same node.

The primary key lookup is sent to the node where the data resides, in a cluster
with 2 replicas and 2 nodes and the table uses READ BACKUP, we will always find
the row locally. With larger clusters the likelihood that this lookup is sent
over the network increases.

Compared to a single threaded storage engine this query scales almost 30x times
using 2 nodes with 8 LDM threads each. NDBs implementation is as mentioned in
the previous blog very efficiently implemented, so the speedup gets a benefit
from this.

This query is more efficiently implemented in MySQL Cluster 8.0.18 since we
implemented support for comparing two columns, both from the same table and
from different tables provided they have the same data type. This improved
performance of this query by 2x. Previous to this the NDB interpreter could
handle comparisons of the type col_name COMPARATOR constant, e.g.
l_receiptdate >= ‘1994-01-01’.

Query Execution

In the figure below we show the execution flow for this query in NDB. As described
above we have a module called DBSPJ in the TC threads that handle the JOIN
processing. We have shown in the figure below the flow for the scan of the lineitem
table in blue arrows. The primary key lookups have been shown with red arrows.
In the figure below we have assumed that we’re not using READ BACKUP. We will
describe in more detail the impact of READ BACKUP in a later part of this blog serie.

Query Analysis

The query will read the lineitem in parallel using a range scan. This scan will
evaluate 909.844 rows when using scale factor 1 in TPC-H. Of those rows there will
be 30.988 rows that will evaluate to true. Each of those 30.988 rows will be sent to the
NDB API but will also be reported to the DBSPJ module to issue parallel key lookups
towards the orders table.

As a matter of a fact this query will actually execute faster than Q6 although it does
more work compared to the previous query we analysed (Q6 in TPC-H). Most of the
work is done in the lineitem table, both Q6 and Q12 does almost the same amount of
work in the range scan on the lineitem. However since there are fewer records to report
back to the MySQL Server this means that parallelism is improved due to the batch
handling in NDB.

Scalability impact

This query will scale very well with more partitions of the lineitem table
and the orders table. As the cluster grows some scalability impact will
come from a higher cost of the primary key lookups that have to be sent on
the network to other nodes.

Next Part

In part 3 we will discuss how the MySQL Server and the NDB storage engine works
together to define the query parts pushed down to NDB.

↧

Maximizing the Value of Your Cloud-Enabled Enterprise Data Lake by Tracking Critical Metrics

December 10, 2019, 4:45 pm

≫ Next: NDB Parallel Query, part 5

≪ Previous: NDB Parallel Query, part 2

Feed: AWS Partner Network (APN) Blog.
Author: Gopal Wunnava.

By Alberto Artasanchez, DBG Artificial Intelligence Lab Director at Accenture
By Raniendu Singh, Senior Data Engineer at AWS
By Gopal Wunnava, Principal Architect at AWS

More than ever, consulting projects run with lean staffs and tight deadlines. It’s imperative to quickly demonstrate value-add and results. There are many resources, vendors, and tools to assist in the creation of an enterprise data lake, but the tooling needed to measure the success of a cloud-based enterprise data lake implementations is lacking.

Successful data lake implementations can serve a corporation well for years. A key to success and longevity is to effectively communicate whether the implementation is adding value or not. However, most metrics for an enterprise data lake are not binary and are more granular than just saying the project is “green” or “red.”

Accenture, an AWS Partner Network (APN) Premier Consulting Partner, recently had an engagement with a Fortune 500 company that wanted to optimize its Amazon Web Services (AWS) data lake implementation.

As part of the engagement, Accenture moved the customer to better-suited services and developed metrics to closely monitor the health of its overall cloud environment.

In this post, we will focus on specifying the different metrics you can use in your environment to properly assess the status of your cloud-based data lake. We’ll detail some of the data lake metrics that can be used to measure performance. To set the context, we’ll first introduce some basic data lake concepts.

Data Lake Overview

The data lake paradigm is not new, and many enterprises are either thinking about implementing one or in the middle of implementing one for their organization.

An important concept to cover is the set of components that form the data lake. A data lake is most often divided into the following parts:

Transient data zone: This is a buffer used to temporarily host the data as you prepare to permanently move it to the landing data zone defined later.
Raw data zone: After quality checks and security transformations have been performed in the transient data zone, the data can be loaded into the raw data zone for permanent storage.
Trusted data zone: This is where the data is placed after it’s been checked to be in compliance with all government, industry, and corporate policies. It’s also been checked for quality.
Refinery data zone: In this zone, data goes through more transformation steps. Data here is integrated into a common format for ease of use. It goes through possible detokenization, more quality checks, and lifecycle management. This ensures the data is in a format you can easily use to create models and derive insights.
Sandboxes: Sandboxes are an integral part of the data lake because they allow data scientists, analysts, and managers to create unplanned exploratory use cases without the involvement of the IT department.

Sample Data Lake Architecture

Building a data lake is a non-trivial task requiring the integration of disparate technologies for data storage, ingestion, and processing, to name a few. Moreover, there are no standards for security, governance, and collaboration, which makes things more complicated.

There are many other factors a business must investigate before selecting its technology stack. There are many tools that can be used to implement a data lake, which could exist purely on the cloud, on-premises, or use a hybrid architecture.

Most data lakes take advantage of new open source and cloud services, but you could potentially use legacy technologies to implement your data lake if the business requirements sent you in that direction.

Here is a sample architecture for a data lake Accenture recently created for another customer using cloud-based technologies from AWS.

Figure 1 – Sample data lake physical architecture.

Data Lake Characteristics to be Measured

Before you can measure a data lake, you have to define what to measure. Here are a few characteristics to help you with these measurements:

Size: Together with variety and speed, these three characteristics are the oft-mentioned three Vs in many definitions of big data (volume, variety, velocity). How big is the data lake?
Governability: How easy is it to verify and certify the data in your lake?
Quality: What’s the quality of the data contained in the lake? Are some records and files invalid? Are there duplicates? Can you determine the source and lineage of the data?
Usage: How many visitors, sources, and downstream systems does the lake have? How easy is it to populate and access the data in the lake?
Variety: Does the data the lake is holding have many types? Are there many types of data sources that feed the lake? Can the data be extracted in different ways and formats, such as files, Amazon Simple Storage Service (Amazon S3), HDFS, traditional databases, and NoSQL?
Speed: How fast can you populate and access the lake?
Stakeholder and customer satisfaction: Users, downstream systems, and source systems are the data lake customers. We recommend periodically probing the data lake customers in a formal and measurable fashion—with a survey, for example—to get feedback and levels of satisfaction or dissatisfaction.
Security: Is the lake properly secured? Can only users with the proper access obtain data in the lake? Is the data encrypted? Is personally identifiable information (PII) properly masked for people without access?

Figure 2 – Sample enterprise data lake executive dashboard.

In the image above, you see a sample data lake executive dashboard for a Fortune 500 company, capturing many of the metrics discussed in this post.

A dashboard of this nature can help you to quickly obtain a visual snapshot of how your data lake is performing, using key metrics such as data governance, data quality, growth, user base, and customer satisfaction.

Measuring Data Lake Performance

Now that we have laid out all of the necessary context to get into the core of this post, let’s explore some of the metrics that can be used to gauge the success of your data lake.

A list of these metrics follows, but it’s not meant to be a comprehensive list. Rather, this is a starting point to generate the metrics applicable to your particular implementation.

Size

You may want to track two measurements: total lake size, and trusted zone size. For total lake size, the number itself might not be significant or provide any value. The lake could be full of useless data or valuable data. However, this number has a direct effect on your billing costs.

One way to keep this number in check and reduce costs is to set up an archival or purge policy. Your documents are moved to long-term storage like Amazon Glacier, or they can be permanently deleted.

Amazon S3 provides a convenient way to purge files by using life cycles policies, or by using S3 Intelligent-Tiering that optimizes costs by automatically moving data to the most cost-effective access tier, without performance impact or operational overhead. S3 Intelligent-Tiering stores objects in two access tiers: one that’s optimized for frequent access, and another lower-cost tier that’s optimized for infrequent access.

For trusted zone size, the bigger the number the better. It’s a measure of how much “clean data” exists in the lake. You can dump enormous amounts of data into the raw data zone. If it’s never transformed, cleaned, and governed, however, the data is useless.

Governability

This might be a difficult characteristic to measure, but it’s an important one as not all data must be governed. The critical data needs to be identified and a governance layer should be added on top of it.

There are many opportunities to track governability, such as:

Designate critical data elements (CDEs) and relate them at the dataset level to the data in the lake. Track what percentage of CDEs that are matched and resolved at the column level.
Track the number of approved CDEs against the total CDEs.
Track the number of modifications done to CDEs after the CDEs have already been approved.

Quality

Data quality does not need to be perfect; it just needs to be good enough depending on the domain. For example, if you’re using a dataset to generate this quarter’s financial report, the key metrics used to summarize the financial state of the company need to be as accurate as possible.

However, it may be OK for some reports accompanying this financial data to contain typos in the names of people presenting the financial data to the public. Such typos are far less damaging to the company and can be fixed easily without attracting the attention of auditors.

If the use case is trying to determine who should receive a marketing email, the data must still be fairly clean. However, if some of the emails are invalid, it’s not going to be a huge issue.

Data quality dimensions that are normally measured are:

Completeness: What percentage of the data includes a value? It’s important that critical data such as customer names, phone numbers, and emails be complete. Completeness doesn’t impact non-critical data that much.
Uniqueness: Are there duplicates in the data when there shouldn’t be any?
Timeliness: Is the data being produced in time for it to be useful?
Validity: Does the data conform to the respective standards set for it?
Accuracy: How well does the data reflect the real-world scenario that it represents?
Consistency: How well does the data align with a pre-established pattern? A common example is dates where patterns can vary greatly.

Usage

Borrowing a term from the internet, you might want to track the number of page requests, as well as the number of visits and visitors to your data lake in general. Also, track individual components of the lake.

Metrics we monitored in the dashboard we created for our Fortune 500 customer was the conversion (i.e. how many visitors turn into customers) and retention rates (i.e. how many customers turned into long-term customers).

Tracking these metrics gives you one indication of where to focus your efforts. If a certain section of the data lake is not getting much traffic, for example, you may want to consider make it obsolete.

AWS provides a convenient way to track your usage metrics by directly using SQL queries against AWS CloudTrail using Amazon Athena.

Variety

You should measure the variety of a couple components of the data lake:

Ideally, data lakes should be able to ingest a wide variety of input types: relational database management systems (RDBMS), NoSQL databases such as Amazon DynamoDB, CRM application data, JSON, XML, emails, logs, etc.
.
Even though the input data may be of many different types, you might want to homogenize the data in the lake as much as possible into one format and one storage type. This could be, for example, converting the data to a columnar format such as Parquet, which makes querying more efficient, and then storing the same in Amazon S3 buckets in your data lake. Adopting this technique as a standard approach enhances the overall user experience.

Complete uniformity might not be achievable or even desired. For example, it doesn’t make sense to convert unstructured data into Parquet. Use this metric as a loose guideline and not a strict policy.

Speed

Two useful measurements to use when it comes to speed are:

How long it takes to update the trusted data zone from the moment you start the ingestion process.
How long it takes for users to access the data that they require.

In both of these cases, it’s not required to squeeze every possible millisecond from the process. It just needs to be good enough. For example, if the nightly window to populate the data lake is four hours and the process is taking two hours, that may be acceptable.

However, if you know your input data will double, you’ll want to find ways to speed up the process since you will be hitting the limit shortly. Similarly, if user queries are taking a few seconds and they are using the queries to populate reports, the performance might be acceptable. The time it takes to optimize the queries further may be better spent on other priorities.

Customer Satisfaction

Other than security, this is one of the most important metrics to continuously track. We’re all at the mercy of our customers, and in this case the customers are our data lake users. If you don’t have users in your lake, or your users are unhappy, don’t be surprised if your data lake initiative withers in the vine and eventually dies.

You can track customer satisfaction in a variety of ways, ranging from the informal to the strict and formal. The most informal way is to periodically ask your project sponsor for a temperature reading.

To formalize this metric, we recommend a formal survey of the data lake users. You can multiply those opinions by the level of usage by each of the survey participants. If the lake gets a bad grade from a few sporadic users and great grades from hardcore users, it probably means your data lake implementation has a steep learning curve. When users get familiar with it, though, they can be hyper-productive.

Security

Security is paramount to ensure the data lake is secure and users have access only to their data. Having only a few breaches in the lake is not acceptable. Even one breach could mean critical data is compromised and can be used for nefarious purposes by competitors or other parties.

Following are some AWS services and features that assist in data lake implementation and facilitate the tracking of security metrics.

AWS Lake Formation: One of the features of this services is that it provides centralized access controls for your data in the lake. Granular data access policies can be defined for your users and applications to protect your data, which is independent of the services used to access the data.
.
AWS Lake Formation ensures all of your data is described in a data catalog, giving you one central location to browse and query the data you have permission to access. AWS Lake Formation uses AWS Identity and Access Management (IAM) policies to control permissions, and IAM and SAML authenticated users can be automatically mapped to data protection policies that are stored in the data catalog.
.
After the rules are established, AWS Lake Formation can enforce access controls with fine-grained granularity at the table and column level for Amazon Redshift Spectrum and Amazon Athena users. EMR integration supports authorization of Active Directory, Okta, and Auth0 users for EMR and Zeppelin notebooks connected to EMR clusters.
.
AWS Security Hub: This services provides AWS users with a central dashboard to track, aggregate, and measure security findings, and compare the against pre-established policies and compliance checks.
.
AWS Security Hub provides a “single pane of glass” view that serves as a starting point to get a sense of the overall health of a system. With AWS Security Hub, we can consolidate multiple AWS services like Amazon GuardDuty, Amazon Inspector, and Amazon Macie, as well as other APN Partner solutions, into one dashboard.
.
Amazon Macie: Storing PII data incorrectly can carry big penalties to a company’s reputation, as well as to its bottom lines via fines and lost business. To minimize this risk, Amazon Macie can be used to automatically scan your data lake to locate and flag errant PII in your repositories.

Optimizing the Data Lake

Once we start measuring the performance of the data lake, we can identify areas that can be improved and enhanced. AWS offers a variety of services that can assist in this task.

One powerful and new service at our disposal is AWS Lake Formation, which allows you to easily set up data lakes. In the context of optimizing the data lake, though, AWS Lake Formation offers the following features:

Data discovery, catalog, and search: AWS Lake Formation automatically discovers all AWS data sources to which it’s provided access by your IAM policies. The service invokes Aws Glue to crawl through data sources such as Amazon S3, Amazon Relational Database Service (Amazon RDS), and AWS CloudTrail to ensure your data is described in a centralized data catalog.
.
While the crawlers automatically generate properties useful to describe your metadata, you can add custom labels to categorize and comment on business attributes, such as data sensitivity and criticality, at the table or column level. These custom labels also provide you with the opportunity to be creative in terms of labeling and describing key metrics we use to measure data lake performance as identified earlier.
.
AWS Lake Formation provides an intuitive user interface to perform text-based search and filtering on entities by type, classification, attribute or free form text.
.
Increased performance: AWS Lake Formation facilitates data transformation into more performant formats like Parquet and ORC. It also optimizes data partitioning in Amazon S3 to improve performance and reduce costs. Raw data that’s loaded may be in partitions that are too small (requiring extra reads) or too large (reading more data than needed).
.
Clean and deduplicate data: AWS Lake Formation provides a powerful feature called AWS Lake Formation FindMatches to help clean and prepare your data for analysis by providing deduplication and finding matching records using artificial intelligence.
.
Increased security: AWS Lake Formation simplifies security management by enforcing encryption leverages the existing encryption capabilities of Amazon S3. The service records all activity in AWS CloudTrail, which enables governance, compliance, and operational and risk auditing capabilities.

Getting Started

In the AWS console, you can quickly start building data lakes and measuring their performance with AWS Lake Formation.

Accenture has a strong partnership with AWS, with a joint group dedicated to helping customers accelerate their innovation using cloud as a catalyst.

If you have questions or feedback about AWS Lake Formation, please email lakeformation-feedback@amazon.com.

Summary

In this post, we discussed a customer’s journey and some challenges they faced when implementing a data lake. We specified the elements of a well-formed data lake and presented a sample data lake architecture using cloud-based technologies from AWS.

We also defined a few of the data lake characteristics that can be measured, and then laid out how these characteristics are then measured and tracked. Finally, we showed you how these measurements can be used to enhance the data lake.

Just like the number of data lake definitions is vast, the number of potential metrics to use against your data lake is also large. The metrics we have laid out here should be a launching pad for your own custom metrics to measure your data lake’s success.

We are interested in hearing about the metrics you find relevant, as well as the challenges you faced and the solutions that worked for you.

.

.

Accenture – APN Partner Spotlight

Accenture is an APN Premier Consulting Partner. A leading, global professional services company that provides an end-to-end solution to migrate to and manage operations on AWS, Accenture’s staff of 440,000+ includes more than 4,000 trained and 2,000 AWS Certified professionals.

Contact Accenture | Practice Overview

*Already worked with Accenture? Rate this Partner

*To review an APN Partner, you must be an AWS customer that has worked with them directly on a project.

↧

NDB Parallel Query, part 5

December 11, 2019, 12:38 am

≫ Next: TSstudio 0.1.5 on CRAN

≪ Previous: Maximizing the Value of Your Cloud-Enabled Enterprise Data Lake by Tracking Critical Metrics

Feed: Planet MySQL
;
Author: Mikael Ronström
;

In this part we are going to analyze a bit more complex query than before.
This query is a 6-way join.

The query is:
SELECT
supp_nation,
cust_nation,
l_year,
SUM(volume) AS revenue
FROM
(
SELECT
n1.n_name AS supp_nation,
n2.n_name AS cust_nation,
extract(year FROM l_shipdate) as l_year,
l_extendedprice * (1 – l_discount) AS volume
FROM
supplier,
lineitem,
orders,
customer,
nation n1,
nation n2
WHERE
s_suppkey = l_suppkey
AND o_orderkey = l_orderkey
AND c_custkey = o_custkey
AND s_nationkey = n1.n_nationkey
AND c_nationkey = n2.n_nationkey
AND (
(n1.n_name = ‘GERMANY’ AND n2.n_name = ‘FRANCE’)
OR (n1.n_name = ‘FRANCE’ AND n2.n_name = ‘GERMANY’)
)
AND l_shipdate BETWEEN ‘1995-01-01’ AND ‘1996-12-31’
) AS shipping
GROUP BY
supp_nation,
cust_nation,
l_year
ORDER BY
supp_nation,
cust_nation,
l_year;

It is the inner SELECT that is the 6-way join. The outer part only deals with the
GROUP BY aggregation and ORDER BY of the result set from the inner
SELECT. As mentioned before the GROUP BY aggregation and ORDER BY
parts are handled by the MySQL Server. So the NDB join pushdown only deals
with the inner select.

In the previous queries we analysed the join order was pretty obvious. In this
case it isn’t that obvious. But the selection of join order is still fairly
straightforward. The selected join order is
n1 -> supplier -> lineitem -> orders -> customer -> n2.

Query analysis

The query starts by reading 2 rows from the nation table. This creates a new scan
on the supplier table, these 2 rows are either coming from the same TC thread or
from separate TC threads. This scan generates data for the next scan in the supplier
table. The supplier table will return 798 rows that is used in the scan against the
lineitem table. This assumes scale factor 1.

This represents a new thing to discuss. If this query would have been executed in the
MySQL Server we would only be able to handle one row from the supplier table at a
time. There have been some improvement in the storage engine API to handle this
using read multi range API in the storage engine API. This means a lot of
communication back and forth and starting up new scans. With the NDB join
processing we will send a multi-range scan to the lineitem table. This means that we
will send one scan message that contains many different ranges. There will still be a
new walking through the index tree for each range, but there is no need to send the
scan messages again and again.

Creation of these multi-ranges is handled as part of the join processing in the
DBSPJ module.

The join between supplier table and the lineitem contains one more interesting
aspect. Here we join towards the column l_orderkey in the lineitem table. In many
queries in TPC-H the join against the lineitem table uses the order key as the join
column. The order key is the first part of the primary key and is thus a candidate to
use as partition key. The TPC-H queries definitely improves by using the order key as
partition key instead of the primary key. This means that the orders and all lineitems
for the order are stored in the same LDM thread.

The scan on the lineitem will produce 145.703 to join with the orders table. The rest of
the joins are joined through the primary key. Thus we will perform 145.703 key lookups
in the orders table, there will be 145.703 key lookups in the customer table and finally
there will be 145.703 lookups against the nations table. The only filtering here will be
on the last table that will decrease the amount of result rows to the MySQL Server,
the end result will be 5.924 rows.

This gives another new point that it would be possible to increase parallelism in this
query by storing the result rows in the DBSPJ. However this would increase the
overhead, so it would improve parallelism at the cost of efficiency.

Scalability impact

If we make sure that the lineitem table is partitioned on the order key this query will
scale nicely. There will be fairly small impact with more partitions since only the scan
against the supplier table will be more costly in a larger cluster.

One thing that will make the query cost more is when the primary key lookups are
distributed instead of local. One table that definitely will be a good idea to use
FULLY REPLICATED for is the nations table. This means that all those 145.703 key
lookups will be handled inside a data node instead of over the network.

The supplier table has only 10.000 rows compared to the lineitem table that has
6M rows. Thus it should definitely be possible to use FULLY REPLICATED also
for this table. The customer table has 150.000 rows and is another candidate to use
for FULLY REPLICATED.

Since the MySQL Server will have to handle more than 300.000 rows in this query,
this will be the main bottleneck for parallelism. This means that the query will have a
parallelism of about 5. This is also the speed up we see compared to single threaded
storage engine for this query. This bottleneck will be about the same even with
larger clusters.

Next Part

I will take a break in this sequence of blogs for now and come back later with a

description of a bit more involved queries and how NDB handles pushing down

subqueries and parts of join query.

↧

TSstudio 0.1.5 on CRAN

December 11, 2019, 11:12 pm

≫ Next: SQL Server features implemented differently in MariaDB

≪ Previous: NDB Parallel Query, part 5

Feed: R-bloggers.
Author: Rami Krispin.

[This article was first published on Rami Krispin, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.

A new version (0.1.5) of the TSstudio package was pushed to CRAN last month. The release includes new functions as well as bug fixing, and update of the package license (modified from GPL-3 to MIT).

New features

train_model – a flexible framework for training, testing, evaluating, and forecasting models. This function provides the ability to run multiple models with backtesting or single training/testing partitions. This function will replace the ts_backtesting function which will deprecated in the next release.
plot_model – animation the performance of the train_model output on the backtesting partitions
plot_error – plotting the error distribution of the train_model output
ts_cor – for ACF and PACF plots with seasonal lags, this function will replace the ts_acf and ts_pacf functions that will deprecated in the next release.
arima_diag – a diagnostic plot for identify the AR, MA and differencing components of the ARIMA model

Fix errors

ts_seasonal – aligning the box plot color
ts_plot – setting the dash and marker mode for multiple time series

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook…

↧

SQL Server features implemented differently in MariaDB

December 13, 2019, 3:56 am

≫ Next: ETL and ELT design patterns for lake house architecture using Amazon Redshift: Part 1

≪ Previous: TSstudio 0.1.5 on CRAN

Feed: MariaDB Knowledge Base Article Feed.
Author: .

Modern DBMSs implement several advanced features. While an SQL standard exists, the complete feature list is different for every database system. Sometimes different features allow to achieve the same purpose, but with a different logic and different limitations. This is something to keep into account when planning a migration.

Some features are implemented by different DBMSs, with a similar logic and similar syntax. But there could be important differences that users should be aware of.

This page has a list of SQL Server features that MariaDB implements in a different way, and SQL Server features for which MariaDB has an alternative feature. Minor differences are not kept into account here. The list is not exhaustive.

The list of supported data types is different.
There are relevant differences in transaction isolation levels.
JSON support is different.
Temporary tables are implemented and used differently.
The list of permissions is different.
Security policies. MariaDB allows to achieve the same results by assigning permissions on views and stored procedures. However, this is not a common practice and it’s more complicated than defining security policies.
Clustered indexes. In MariaDB, the physical order of rows is delegated to the storage engine. InnoDB uses the primary key as a clustered index.
Hash indexes. Only some storage engines support HASH indexes.
- InnoDB has a feature called adaptive hash index, enabled by default. It means that in InnoDB all indexes are created as BTREE, and depending on how they are used, InnoDB could convert them from BTree to hash indexes, or the other way around. This happens in background.
- MEMORY uses hash indexes by default, if we don’t specify the BTREE keyword.
- See Storage Engine Index Types for more information.
Query store. MariaDB allows query performance analysis using the slow log and performance_schema. Some open source or commercial 3rd party tool read that information to produce statistics and make it easy to identify slow queries.
Temporal tables use a different (more standard) syntax on MariaDB. In MariaDB, the history is stored in the same table as current data (but optionally in different partitions). MariaDB supports both SYSTEM_TIME and APPLICATION_TIME.
Linked servers. MariaDB supports storage engines to read from, and write to, remote tables. When using the CONNECT engine, those tables could be in different DBMSs, including SQL Server.
NOT FOR REPLICATION
- MariaDB supports replication filters to exclude some tables or databases from replication
- It is possible to keep a table empty in a slave (or in the master) by using the BLACKHOLE storage engine.
- The master can have columns that are not present in a slave (the other way around is also supported). Before using this feature, read carefully the Replication When the Master and Slave Have Different Table Definitions page.
- With MariaDB it’s possible to prevent a trigger from running on slaves
- It’s possible to run events without replicating them. The same applies to some administrative statements.
- MariaDB superusers can run statements without replicating them, by using the sql_log_bin system variable.
- Constraints and triggers cannot be disabled for replication, but it is possible to drop them on the slaves.
- The IF EXISTS syntax allows to easily create a table on the master that already exists (possibly in a different version) on a slave.

Comments

Content reproduced on this site is the property of its respective owners,
and this content is not reviewed in advance by MariaDB. The views, information and opinions
expressed by this content do not necessarily represent those of MariaDB or any other party.

↧

ETL and ELT design patterns for lake house architecture using Amazon Redshift: Part 1

December 13, 2019, 10:17 am

≫ Next: Managing F5 BIG-IP Network Devices with Puppet

≪ Previous: SQL Server features implemented differently in MariaDB

Feed: AWS Big Data Blog.

Part 1 of this multi-post series discusses design best practices for building scalable ETL (extract, transform, load) and ELT (extract, load, transform) data processing pipelines using both primary and short-lived Amazon Redshift clusters. You also learn about related use cases for some key Amazon Redshift features such as Amazon Redshift Spectrum, Concurrency Scaling, and recent support for data lake export.

Part 2 of this series, ETL and ELT design patterns for lake house architecture using Amazon Redshift: Part 2, shows a step-by-step walkthrough to get started using Amazon Redshift for your ETL and ELT use cases.

ETL and ELT

There are two common design patterns when moving data from source systems to a data warehouse. The primary difference between the two patterns is the point in the data-processing pipeline at which transformations happen. This also determines the set of tools used to ingest and transform the data, along with the underlying data structures, queries, and optimization engines used to analyze the data. The first pattern is ETL, which transforms the data before it is loaded into the data warehouse. The second pattern is ELT, which loads the data into the data warehouse and uses the familiar SQL semantics and power of the Massively Parallel Processing (MPP) architecture to perform the transformations within the data warehouse.

In the following diagram, the first represents ETL, in which data transformation is performed outside of the data warehouse with tools such as Apache Spark or Apache Hive on Amazon EMR or AWS Glue. This pattern allows you to select your preferred tools for data transformations. The second diagram is ELT, in which the data transformation engine is built into the data warehouse for relational and SQL workloads. This pattern is powerful because it uses the highly optimized and scalable data storage and compute power of MPP architecture.

Redshift Spectrum

Amazon Redshift is a fully managed data warehouse service on AWS. It uses a distributed, MPP, and shared nothing architecture. Redshift Spectrum is a native feature of Amazon Redshift that enables you to run the familiar SQL of Amazon Redshift with the BI application and SQL client tools you currently use against all your data stored in open file formats in your data lake (Amazon S3).

A common pattern you may follow is to run queries that span both the frequently accessed hot data stored locally in Amazon Redshift and the warm or cold data stored cost-effectively in Amazon S3, using views with no schema binding for external tables. This enables you to independently scale your compute resources and storage across your cluster and S3 for various use cases.

Redshift Spectrum supports a variety of structured and unstructured file formats such as Apache Parquet, Avro, CSV, ORC, JSON to name a few. Because the data stored in S3 is in open file formats, the same data can serve as your single source of truth and other services such as Amazon Athena, Amazon EMR, and Amazon SageMaker can access it directly from your S3 data lake.

For more information, see Amazon Redshift Spectrum Extends Data Warehousing Out to Exabytes—No Loading Required.

Concurrency Scaling

Using Concurrency Scaling, Amazon Redshift automatically and elastically scales query processing power to provide consistently fast performance for hundreds of concurrent queries. Concurrency Scaling resources are added to your Amazon Redshift cluster transparently in seconds, as concurrency increases, to serve sudden spikes in concurrent requests with fast performance without wait time. When the workload demand subsides, Amazon Redshift automatically shuts down Concurrency Scaling resources to save you cost.

The following diagram shows how the Concurrency Scaling works at a high-level:

For more information, see New – Concurrency Scaling for Amazon Redshift – Peak Performance at All Times.

Data lake export

Amazon Redshift now supports unloading the result of a query to your data lake on S3 in Apache Parquet, an efficient open columnar storage format for analytics. The Parquet format is up to two times faster to unload and consumes up to six times less storage in S3, compared to text formats. You can also specify one or more partition columns, so that unloaded data is automatically partitioned into folders in your S3 bucket to improve query performance and lower the cost for downstream consumption of the unloaded data. For example, you can choose to unload your marketing data and partition it by year, month, and day columns. This enables your queries to take advantage of partition pruning and skip scanning of non-relevant partitions when filtered by the partitioned columns, thereby improving query performance and lowering cost. For more information, see UNLOAD.

Use cases

You may be using Amazon Redshift either partially or fully as part of your data management and data integration needs. You likely transitioned from an ETL to an ELT approach with the advent of MPP databases due to your workload being primarily relational, familiar SQL syntax, and the massive scalability of MPP architecture.

This section presents common use cases for ELT and ETL for designing data processing pipelines using Amazon Redshift.

ELT

Consider a batch data processing workload that requires standard SQL joins and aggregations on a modest amount of relational and structured data. You selected initially a Hadoop-based solution to accomplish your SQL needs. However, over time, as data continued to grow, your system didn’t scale well. You now find it difficult to meet your required performance SLA goals and often refer to ever-increasing hardware and maintenance costs. Relational MPP databases bring an advantage in terms of performance and cost, and lowers the technical barriers to process data by using familiar SQL.

Amazon Redshift has significant benefits based on its massively scalable and fully managed compute underneath to process structured and semi-structured data directly from your data lake in S3.

The following diagram shows how Redshift Spectrum allows you to simplify and accelerate your data processing pipeline from a four-step to a one-step process with the CTAS (Create Table As) command.

The preceding architecture enables seamless interoperability between your Amazon Redshift data warehouse solution and your existing data lake solution on S3 hosting other Enterprise datasets such as ERP, finance, and third-party for a variety of data integration use cases.

The following diagram shows the seamless interoperability between your Amazon Redshift and your data lake on S3:

When you use an ELT pattern, you can also use your existing ELT-optimized SQL workload while migrating from your on-premises data warehouse to Amazon Redshift. This eliminates the need to rewrite relational and complex SQL workloads into a new compute framework from scratch. With Amazon Redshift, you can load, transform, and enrich your data efficiently using familiar SQL with advanced and robust SQL support, simplicity, and seamless integration with your existing SQL tools. You also need the monitoring capabilities provided by Amazon Redshift for your clusters.

ETL

You have a requirement to unload a subset of the data from Amazon Redshift back to your data lake (S3) in an open and analytics-optimized columnar file format (Parquet). You then want to query the unloaded datasets from the data lake using Redshift Spectrum and other AWS services such as Athena for ad hoc and on-demand analysis, AWS Glue and Amazon EMR for ETL, and Amazon SageMaker for machine learning.

You have a requirement to share a single version of a set of curated metrics (computed in Amazon Redshift) across multiple business processes from the data lake. You can use ELT in Amazon Redshift to compute these metrics and then use the unload operation with optimized file format and partitioning to unload the computed metrics in the data lake.

You also have a requirement to pre-aggregate a set of commonly requested metrics from your end-users on a large dataset stored in the data lake (S3) cold storage using familiar SQL and unload the aggregated metrics in your data lake for downstream consumption. In other words, consider a batch workload that requires standard SQL joins and aggregations on a fairly large volume of relational and structured cold data stored in S3 for a short duration of time. You can use the power of Redshift Spectrum by spinning up one or many short-lived Amazon Redshift clusters that can perform the required SQL transformations on the data stored in S3, unload the transformed results back to S3 in an optimized file format, and terminate the unneeded Amazon Redshift clusters at the end of the processing. This way, you only pay for the duration in which your Amazon Redshift clusters serve your workloads.

As shown in the following diagram, once the transformed results are unloaded in S3, you then query the unloaded data from your data lake either using Redshift Spectrum if you have an existing Amazon Redshift cluster, Athena with its pay-per-use and serverless ad hoc and on-demand query model, AWS Glue and Amazon EMR for performing ETL operations on the unloaded data and data integration with your other datasets (such as ERP, finance, and third-party data) stored in your data lake, and Amazon SageMaker for machine learning.

You can also scale the unloading operation by using the Concurrency Scaling feature of Amazon Redshift. This provides a scalable and serverless option to bulk export data in an open and analytics-optimized file format using familiar SQL.

Best practices

The following recommended practices can help you to optimize your ELT and ETL workload using Amazon Redshift.

Analyze requirements to decide ELT versus ETL

MPP architecture of Amazon Redshift and its Spectrum feature is efficient and designed for high-volume relational and SQL-based ELT workload (joins, aggregations) at a massive scale. A common practice to design an efficient ELT solution using Amazon Redshift is to spend sufficient time to analyze the following:

Type of data from source systems (structured, semi-structured, and unstructured)
Nature of the transformations required (usually encompassing cleansing, enrichment, harmonization, transformations, and aggregations)
Row-by-row, cursor-based processing needs versus batch SQL
Performance SLA and scalability requirements considering the data volume growth over time
Cost of the solution

This helps to assess if the workload is relational and suitable for SQL at MPP scale.

Key considerations for ELT

For ELT and ELT both, it is important to build a good physical data model for better performance for all tables, including staging tables with proper data types and distribution methods. A dimensional data model (star schema) with fewer joins works best for MPP architecture including ELT-based SQL workloads. Consider using a TEMPORARY table for intermediate staging tables as feasible for the ELT process for better write performance, because temporary tables only write a single copy.

A common rule of thumb for ELT workloads is to avoid row-by-row, cursor-based processing (a commonly overlooked finding for stored procedures). This is sub-optimal because such processing needs to happen on the leader node of an MPP database like Amazon Redshift. Instead, the recommendation for such a workload is to look for an alternative distributed processing programming framework, such as Apache Spark.

Several hundreds to thousands of single record inserts, updates, and deletes for highly transactional needs are not efficient using MPP architecture. Instead, stage those records for either a bulk UPDATE or DELETE/INSERT on the table as a batch operation.

With the external table capability of Redshift Spectrum, you can optimize your transformation logic using a single SQL as opposed to loading data first in Amazon Redshift local storage for staging tables and then doing the transformations on those staging tables.

Key considerations for data lake export

When you unload data from Amazon Redshift to your data lake in S3, pay attention to data skew or processing skew in your Amazon Redshift tables. The UNLOAD command uses the parallelism of the slices in your cluster. Hence, if there is a data skew at rest or processing skew at runtime, unloaded files on S3 may have different file sizes, which impacts your UNLOAD command response time and query response time downstream for the unloaded data in your data lake.

You should also control maximum file size to approximately 100 MB or less in the UNLOAD command for better performance for downstream consumption. Similarly, for S3 partitioning, a rule of thumb is to not exceed number of partitions per table on S3 to couple of hundreds by choosing the low cardinality partitioning columns (year, quarter, month, and day are good choices) in the UNLOAD command. This avoids creating too many partitions, which in turn creates a large volume of metadata in the AWS Glue catalog, leading to high query times via Athena and Redshift Spectrum.

To get the best throughput and performance under concurrency for multiple UNLOAD commands running in parallel, create a separate queue for unload queries with Concurrency Scaling turned on. This lets Amazon Redshift burst additional Concurrency Scaling clusters as required.

Key considerations for Redshift Spectrum for ELT

To get the best performance from Redshift Spectrum, pay attention to the maximum pushdown operations possible, such as S3 scan, projection, filtering, and aggregation, in your query plans for a performance boost. This is because you want to utilize the powerful infrastructure underneath that supports Redshift Spectrum. Using predicate pushdown also avoids consuming resources in the Amazon Redshift cluster.

In addition, avoid complex operations like DISTINCT or ORDER BY on more than one column and replace them with GROUP BY as applicable. Amazon Redshift can push down a single column DISTINCT as a GROUP BY to the Spectrum compute layer with a query rewrite capability underneath, whereas multi-column DISTINCT or ORDER BY operations need to happen inside Amazon Redshift cluster.

Amazon Redshift optimizer can use external table statistics to generate more optimal execution plans. Without statistics, an execution plan is generated based on heuristics with the assumption that the S3 table is relatively large. It is recommended to set the table statistics (numRows) manually for S3 external tables.

For more information on Amazon Redshift Spectrum best practices, see Twelve Best Practices for Amazon Redshift Spectrum and How to enable cross-account Amazon Redshift COPY and Redshift Spectrum query for AWS KMS–encrypted data in Amazon S3.

Summary

This post discussed the common use cases and design best practices for building ELT and ETL data processing pipelines for data lake architecture using few key features of Amazon Redshift: Spectrum, Concurrency Scaling, and the recently released support for data lake export with partitioning.

Part 2 of this series, ETL and ELT design patterns for lake house architecture using Amazon Redshift: Part 2, shows you how to get started with a step-by-step walkthrough of a few simple examples using AWS sample datasets.

As always, AWS welcomes feedback. Please submit thoughts or questions in the comments.

About the Authors

Asim Kumar Sasmal is a senior data architect – IoT in the Global Specialty Practice of AWS Professional Services. He helps AWS customers around the globe to design and build data driven solutions by providing expert technical consulting, best practices guidance, and implementation services on AWS platform. He is passionate about working backwards from customer ask, help them to think big, and dive deep to solve real business problems by leveraging the power of AWS platform.

Maor Kleider is a principal product manager for Amazon Redshift, a fast, simple and cost-effective data warehouse. Maor is passionate about collaborating with customers and partners, learning about their unique big data use cases and making their experience even better. In his spare time, Maor enjoys traveling and exploring new restaurants with his family.

↧

Managing F5 BIG-IP Network Devices with Puppet

December 15, 2019, 1:58 pm

≫ Next: Webinar: Time Series Data Capture & Analysis in MemSQL 7.0

≪ Previous: ETL and ELT design patterns for lake house architecture using Amazon Redshift: Part 1

Feed: Puppet.com Blog RSS Feed.
Author:
;

Management of network devices is one of the exciting new features in Puppet Enterprise 2.0 and Puppet 2.7. In the initial release, support is limited to Cisco devices, but because Puppet is extensible via modules, we are able to build upon the existing framework and add support for F5 BIG-IP. Like most network appliances, installation of third party software is prohibited, which eliminates the ability to run an agent. Instead, Puppet takes advantage of F5 iControl API to interact and manage the device. F5 BIG-IP network appliances are capable of load balancing, SSL offloading, application monitoring, as well as many other advanced features, and now Puppet can manage these functionalities. Compared to traditional management methodologies and other third party tools that interact with F5, Puppet not only bridges the gap from deploying applications to bringing the service online to your customers, it also brings the unique benefit of the Puppet resource model to network devices. More specifically, the integration offers the ability to compare if a running configuration matches the desired configuration, and then enforce the changes once they’ve been reviewed.

In this blog post, we will step through the process of installing the F5 module, configuring connectivity, and writing a simple manifest to manage an F5 device with Puppet. If you are unfamiliar with BIG-IP devices, you may want to consult devcentral.f5.com for more information on F5 features such as iRules and iControl API.

In the following output, Puppet detects that the F5 device has the wrong iRule. We are running in simulation mode (--noop option) so Puppet only shows the changes that would be applied if we were enforcing the configuration.

$ puppet device --noop
...
notice: /Stage[main]//F5_rule[redirect_404]/definition: current_value: when HTTP_RESPONSE {
if { [HTTP::status] eq "404" } {
redirect to "http://www.puppetlabs.com/404/"
}
}, should be when HTTP_RESPONSE {
if { [HTTP::status] eq "404" } {
redirect to "http://www.puppetlabs.com/redirect/404/"
}
} (noop)
notice: Finished catalog run in 5.69 seconds

Now running Puppet with out the --noop option set, the iRules are changed to match our resource declaration.

$ puppet device
…
notice: /Stage[main]//F5_rule[redirect_404]/definition: definition changed when HTTP_RESPONSE {
if { [HTTP::status] eq "404" } {
redirect to "http://www.puppetlabs.com/404/"
}
}, to when HTTP_RESPONSE {
if { [HTTP::status] eq "404" } {
redirect to "http://www.puppetlabs.com/redirect/404/"
}
}
notice: Finished catalog run in 5.74 seconds

In addition to iRules, the initial release supports key features to configure the device certificate, manage applications pool/poolmember/virtualserver, and monitor application health. The module is published on the Puppet Forge, along with comprehensive documentation of supported F5 resources. The latest development release is available on GitHub. The ability to extend Puppet is not limited to network devices, and the commands shown later in this post to install the module is applicable for other Puppet modules available on the Puppet Forge and GitHub.

The puppet device command is a new application mode in Puppet intended to manage devices that can’t install Ruby/Puppet and run puppet agent. If you haven’t checked it out yet, I would review Brice’s introduction to network devices first. The high level overview of the entire communication process:

Devices are managed through an intermediate proxy system where Puppet agent is installed. The proxy system stores a certificate on behalf of the device. In the case of F5, it should have iControl gem installed, as well as the account information in device.conf to communicate with the device. The proxy connects to the Puppet master to retrieve the catalog on behalf of the F5 and applies changes as necessary.

F5 module installation

Before we get started with the installation process, there are two ways to install F5 module. First, via puppet-module tool which retrieves it from forge.puppetlabs.com (for stable releases of the module). The latest development release from GitHub is accessible via git. The instructions below are specific for Puppet Enterprise, but open source users can also install the module with some changes to the puppet module path. Puppet Enterprise currently ships with the puppet-module gem, and it’s freely available on rubygems.org. Eventually this will become a Puppet Face and turn into the command ‘puppet module’(expected in later release of 2.7). Onwards to the install process:

# Puppet Enterprise:
cd /etc/puppetlabs/puppet/modules
puppet-module install puppetlabs-f5

This should create a directory called f5. Older versions of puppet-module tool might create a puppetlabs-f5 directory in the modules directory—in that case, change it to f5.

Installing from GitHub:

# Puppet Enterprise:
cd /etc/puppetlabs/puppet/modules
git clone git@github.com:puppetlabs/puppetlabs-f5.git
ln -s puppetlabs-f5 f5

In Puppet 2.7, the transport between proxy agent and device supports telnet/ssh, however neither is suitable for F5 devices. Instead, we rely on F5’s iControl API. The iControl gem should be installed on both the master and the proxy system. This gem is available in the F5 module files directory.

# Puppet Enterprise:
/opt/puppet/bin/gem install /etc/puppetlabs/puppet/modules/f5/files/f5-icontrol-10.2.0.2.gem

Configuration and management

At this point we have the module installed, so we should configure connectivity for the device. The configuration for network devices by default are stored in /etc/puppet/device.conf:

[f5.puppetlabs.lan]
type f5
url https://username:password@f5.puppetlabs.lan/partition

[f5.dev.puppetlabs.lan]
type f5
url https://username:password@f5.dev.puppetlabs.lan/partition

You can also break down each device into it’s own configuration file such as /etc/puppet/f5_device1.conf, /etc/puppet/f5_device2.conf …, which is especially helpful if you wish to run against each device separately. In the square brackets is the device certificate name, and certificate management process is the same as puppet agent certs. The device type is f5, and the url is https instead of telnet/ssh. Because F5 supports different partitions we can optionally specify them at the end, and it will default to the ‘Common’ partition if it’s not provided. In the module, f5::config define resource type simplifies management of this configuration file on the proxy system:

f5::config { 'f5.puppetlabs.lan':
    username => 'admin',
    password => 'password',
    url      => 'f5.puppetlabs.lan',
    target   => '/etc/puppetlabs/puppet/device/f5.puppetlabs.lan.conf',
}

Once this configuration file is in place, we can initiate a puppet device run on the proxy server.

# execute on proxy server
$ puppet device --deviceconf /etc/puppetlabs/puppet/device/f5.puppetlabs.lan.conf

This should generate a certificate request on the master which should be signed:

# execute on puppet master
$ puppet cert -l
f5.puppetlabs.lan (2A:0C:A0:F8:C6:EE:EF:9B:B3:49:74:D1:27:31:1B:60)
$ puppet cert -s f5.puppetlabs.lan

At this point the master should have a node name f5.puppetlabs.lan in site.pp with the appropriate f5 resources:

node f5.puppetlabs.lan {
  f5_rule { 'redirect_404':
    ensure     => 'present',
    definition => 'when HTTP_RESPONSE {
if { [HTTP::status] eq "404" } {
redirect to "http://www.puppetlabs.com/redirect/404"
}
}',
  }

  f5_pool { 'webapp':
    ensure                          => 'present',
    action_on_service_down          => 'SERVICE_DOWN_ACTION_NONE',
    allow_nat_state                 => 'STATE_ENABLED',
    allow_snat_state                => 'STATE_ENABLED',
    lb_method                       => 'LB_METHOD_ROUND_ROBIN',
    member                          => {
      '10.10.0.1:80' => {'connection_limit' => '0',
                         'dynamic_ratio'    => '1',
                         'priority'         => '0',
                         'ratio'            => '1'},
      '10.10.0.2:80' => {'connection_limit' => '0', 
                         'dynamic_ratio'    => '1', 
                         'priority'         => '0',
                         'ratio'            => '1'},
      '10.10.0.3:80' => {'connection_limit' => '0',
                         'dynamic_ratio'    => '1',
                         'priority'         => '0',
                         'ratio'            => '1'}
    },
    minimum_active_member           => '1',
    minimum_up_member               => '0',
  }
}

When the puppet device command is executed again this will update the iRule and ensure the appropriate members are in the webapp pool:

$ puppet device --deviceconf /etc/puppetlabs/puppet/device/f5.puppetlabs.lan.conf

A limitation to watch out for in the current Puppet release is that ‘puppet apply’ and ‘puppet resource’ cannot modify network resources. However, we implemented a feature to allow puppet resource to query a F5 device. (For authors of types/providers, making changes to resources aren’t supported until apply_to_device in resources type are handled differently by puppet apply/resource commands). For now we use url facts to establish connectivity to specific F5 devices:

export RUBYLIB=/etc/puppetlabs/puppet/modules/f5/lib/
export FACTER_url=https://admin:password@f5.puppetlabs.lan/Common
puppet resource f5_rule

f5_rule { '_sys_https_redirect':
  ensure     => 'present',
  definition => '    when HTTP_REQUEST {
set host [HTTP::host]
HTTP::respond 302 Location "https://$host/"
}',
}
f5_rule { '_sys_auth_ssl_cc_ldap':
  ensure     => 'present',
  definition => '    when CLIENT_ACCEPTED {
set tmm_auth_ssl_cc_ldap_sid 0
set tmm_auth_ssl_cc_ldap_done 0
}
when CLIENTSSL_CLIENTCERT {
…

If you don’t have Puppet Enterprise 2.0 in your environment yet, you can either download the Learning Puppet VM, or Puppet Enterprise installation packages. F5 also provides F5 LTM Virtual Edition (VE) for trial on VMWare. Please report any issues or bugs to https://tickets.puppetlabs.com/browse/MODULES under the modules section.

Additional Resources

↧

Webinar: Time Series Data Capture & Analysis in MemSQL 7.0

December 18, 2019, 2:14 pm

≫ Next: Keeping a Lid on Concurrency within the Vantage Platform

≪ Previous: Managing F5 BIG-IP Network Devices with Puppet

Feed: MemSQL Blog.
Author: Floyd Smith.

With the MemSQL 7. 0 release, MemSQL has added more special-purpose features, making it even easier to manage time series data within our best-of-breed operational database. These new features allow you to structure queries on time series data with far fewer lines of code and with less complexity. With time series features in MemSQL, we make it easier for any SQL user, or any tool that uses SQL, to work with time series data, while making expert users even more productive. In a recent webinar (view the recording here), Eric Hanson described the new features and how to use them.

The webinar begins with an overview of MemSQL, then describes how customers have been using MemSQL for time series data for years, prior to the MemSQL 7.0 release. Then there’s a description of the time series features that MemSQL has added, making it easier to query and manage time series data, and a Q&A section at the end.

Introducing MemSQL

MemSQL is a very high-performance scalable SQL relational database system. It’s really good for scalable operations, both for transaction processing and analytics on tabular data. Typically, it can be as much as 10 times faster, and three times more cost-effective, than legacy database providers for large volumes under high concurrency.

We like to call MemSQL the No-Limits Database because of its amazing scalability. It’s the cloud-native operational database that’s built for speed and scale. We have capabilities to support operational analytics. So, operational analytics is when you have to deliver very high analytical performance in an operational database environment where you may have concurrent updates and queries running at an intensive, demanding level. Some people like to say that it’s when you need “Analytics with an SLA.”

Now, I know that everybody thinks they have an SLA when they have an analytical database, but when you have a really demanding SLA like requiring interactive, very consistent response time in an analytical database environment, under fast ingest, and with high concurrency, that’s when MemSQL really shines.

We also support predictive ML and AI capabilities. For example, we’ve got some built-in functions for vector similarity matching. Some of our customers were using MemSQL in a deep learning environment to do things like face and image matching and customers are prototyping applications based on deep learning like fuzzy text matching. The built-in dot product and Euclidean distance functions we have can help you make those applications run with very high performance. (Nonprofit Thorn is one organization that uses these ML and AI-related capabilities at the core of their app, Spotlight, which helps law enforcement identify trafficked children. – Ed.)

Also, people are using MemSQL when they need to move to cloud or replace legacy relational database systems. When they reach some sort of inflection point, like they know they need to move to cloud, they want to take advantage of the scalability of the cloud, they want to consider a truly scalable product, and so they’ll look at MemSQL. Also, when it comes time to re-architect the legacy application – if, say, the scale of data has grown tremendously, or is expected to change in the near future, people really may decide they need to find a more scalable and economical platform for their relational data, and that may prompt them to move to MemSQL.

Here are examples of the kinds of workloads and customers we support: Half of the top 10 banks banks in North America, two of the top three telecommunications companies in North America, over 160 million streaming media users, 12 of the Fortune 50 largest companies in the United States, and technology leaders from Akamai to Uber.

If you want to think about MemSQL and how it’s different from other database products, you can think of it as a very modern, high-performance, scalable SQL relational database. We have all three: speed, scale, and SQL. We get our speed because we compile queries to machine code. We also have in-memory data structures for operational applications, an in-memory rowstore structure, and a disk-based columnstore structure.

MemSQL is the No-Limits Database

We compile queries to machine code and we use vectorized query execution on our columnar data structure. That gives us tremendous speed on a per-core basis. We’re also extremely scalable. We’re built for the cloud. MemSQL is a cloud-native platform that can gang together multiple computers to handle the work for a single database, in a very elegant and high-performance fashion. There’s no real practical limit to scale when using MemSQL.

Finally, we support SQL. There are some very scalable database products out there in the NoSQL world that are fast for certain operations, like put and get-type operations that can scale. But if you try to use these for sophisticated query processing, you end up having to host a lot of the query processing logic in the application, even to do simple things like joins. It can make your application large and complex and brittle – hard to evolve.

So SQL, the relational data model, was invented by EF Codd (PDF) – back around 1970 – for a reason. To separate your query logic from the physical data structures in your database, and to provide a non-procedural query language that makes it easier to find the data that you want from your data set. The benefits that were put forth when the relational model was invented are still true today.

We’re firmly committed to relational database processing and non-procedural query languages with SQL. There’s tremendous benefits to that, and you can have the best of both. You can have speed, and you can have scale, along with SQL. That’s what we provide.

How does MemSQL fit into the rest of your data management environment? MemSQL provides tremendous support for analytics, application systems like dashboards, ad-hoc queries, and machine learning. Also other types of applications like real-time decision-making apps, Internet of Things apps, dynamic user experiences. The kind of database technology that was available before couldn’t provide the real-time analytics that are necessary to give the truly dynamic user experience people are looking for today; we can provide that.

MemSQL architectural chart CDC and data types

We also provide tremendous capabilities for fast ingest and change data capture (CDC). We have the ability to stream data into MemSQL from multiple sources like file systems and Kafka. We have a feature called Pipelines, which is very popular, to automatically load data from file folders, AWS S3, Kafka. You can transform data as it’s flowing into MemSQL, with very little coding. We support a very high performance and scalable bulk load system.

We have support for a large variety of data types including relational data, standard structured data types, key-value, JSON, geospatial, time-oriented data, and more. We run everywhere. You can run MemSQL on-premises, you can run it in the cloud as a managed database platform, or as a service in our new Helios system, which just was delivered in September.

We also allow people to self-host in the cloud. If they want full control over how their system is managed, they can self-host on all the major cloud providers and also run in containers; so, wherever you need to run, we are available.

I mentioned scalability earlier and I wanted to drill into that a little bit to illustrate the, how our platform is organized. MemSQL provides an image to the database client application as just, it’s just a database. You have a connection string, you connect, you set your connection to use us as a database, and you can start submitting SQL statements. It’s a single system image. The application doesn’t really know that MemSQL is distributed – but, underneath the sheets, it’s organized as you see in this diagram.

MemSQL node and leaf architecture

There are one or more aggregator nodes, which are front-end nodes that the client application connects to. Then, there can be multiple back-end nodes. We call them leaf nodes. The data is horizontally partitioned across the leaf nodes – some people call this sharding. Each leaf node has one or more partitions of data. Those partitions are defined based on some data definition language (DDL); when you create your table, you define how to shard the data across nodes.

MemSQL’s query processor knows how to take a SQL statement and divide it up into smaller units of work across the leaf nodes, and final assembly results is done by the aggregator node. Then, the results are sent back for the client. As you need to scale, you can add additional leaf nodes and rebalance your data, so that it’s easy to scale the system up and down as needed.

How Customers Have Used MemSQL for Time Series Data

So with that background on MemSQL, let’s talk about using MemSQL for time series data. First of all, for those of you who are not really familiar with time series, a time series is simply a time-ordered sequence of events of some kind. Typically, each time series entry has, at least, a time value and some sort of data value that’s taken at that time. Here’s an example time series of pricing of a stock over time, over like an hour and a half or so period.

MemSQL time series stock prices

You can see that the data moves up and down as you advance in time. Typically, data at any point in time is closely correlated to the immediately previous point in time. Here’s another example, of flow rate. People are using MemSQL for energy production, for example, in utilities. They may be storing and managing data representing flow rates. Here’s another example, a long-term time series of some health-oriented data from the US government, from the Centers for Disease Control, about chronic kidney disease over time.

These are just three examples of time series data. Virtually every application that’s collecting business events of any kind has a time element to it. In some sense, almost all applications have a time series aspect to them.

Let’s talk about time series database use cases. It’s necessary, when you’re managing time-oriented data, to store new time series events or entries, to retrieve the data, to modify time series data – to delete or append or truncate the data, or in some cases, you may even update the data to correct an error. Or you may be doing some sort of updating operation where you are, say, accumulating data for a minute or so. Then, once the data has sort of solidified or been finalized, you will no longer update it. There are many different modification scenarios for time series data.

Another common operation on time series data is to do things like convert an irregular time series to a regular time series. For example, data may arrive with a random sort of arrival process, and the spacing between events may not be equal, but you may want to convert that to a regular time series. Like maybe data arrives every 1 to 10 seconds, kind of at random. You may want to create a time series which has exactly 1 data point every 15 seconds. That’s an example of converting from an irregular to a regular time series.

MemSQL time series use cases

Another kind of operation on time series is to downsample. That means you may have a time series with one tick every second, maybe you want to have one tick every one minute. That’s downsampling. Another common operation is smoothing. So you may have some simple smoothing capability, like a five-second moving average of a time series, where you average together like the previous five seconds worth of data from the series, or a more complex kind of smoothing – say, where you fit a curve through the data to smooth it , such as a spline curve. There are many, many more kind of time series use cases.

A little history about how MemSQL has been used for time series is important to give, for context. Customers already use MemSQL for time series event data extensively, using our previously shipped releases, before the recent shipment of MemSQL 7.0 and its time series-specific features. Lots of our customers store business events with some sort of time element. We have quite a few customers in the financial sector that are storing financial transactions in MemSQL. Of course, each of these has a time element to it, recording when the transaction occurred.

MemSQL Time series plusses

Also, lots of our customers have been using us for Internet of Things (IoT) events. For example, in utilities, in energy production, media and communications, and web and application development. For example, advertising applications. As I mentioned before, MemSQL is really tremendous for fast and easy streaming. With our pipelines capability, it’s fast and easy to use load data, and just very high-performance insert data manipulation language (DML). You can do millions of inserts per second on a MemSQL cluster.

We have a columnstore storage mechanism which has tremendous compression – typically, in the range of 5x to 10x, compared to raw data. It’s easy to store a very large volume of historical data in a columnstore table in MemSQL. Because of the capabilities that MemSQL provides for high scalability, high-performance SQL, fast, and easy ingest, and high compression with columnar data storage. All those things have made MemSQL really attractive destination for people that are managing time series data.

New Time Series Features in MemSQL 7.0

(For more on what’s in MemSQL 7.0, see our release blog post, our deep dive into resiliency features, and our deep dive into MemSQL SingleStore. We also have a blog post on our time series features. – Ed.)

Close to half of our customers are using time series in some form, or they look at the data they have as time series. What we wanted to do for the 7.0 release was to make time series querying easier. We looked at some of our customers’ applications, and some internal applications we had built on MemSQL for historical monitoring. We saw that, while the query language is very powerful and capable, it looked like some of the queries could be made much easier.

MemSQL easy time series queries

We wanted to provide a very brief syntax to let people write common types of queries – to do things like downsampling, or converting irregular time series to regular time series. You want to make that really easy. We wanted to let the more typical developers do things they couldn’t do before with SQL because it was just too hard. Let experts do more, and do it faster ,so they could spend more time on other parts of their application rather than writing tricky queries to extract information from time series.

So that said, we were not trying to be the ultimate time series specialty package. For example, if you need curve fitting, or very complex kinds of smoothing ,or you need to add together two different time series, for example. We’re not really trying to enable those use cases to be as easy and fast as they can be. We’re looking at sort of a conventional ability to manage large volumes of time series data, ingest the time series fast, and be able to do typical and common query use cases through SQL easily. That’s what we want to provide. If you need some of these specialty capabilities, you probably want to consider a more specialized time series product like KBB+ or something similar to that.

Throughout the rest of the talk, I’m going to be referring a few times to an example based on candlestick charts. A candlestick chart is a typical kind of chart used in the financial sector to show high, low, open, and close data for a security, during some period of time – like an entire trading day, or by minute, or by hour, et cetera.

MemSQL time series candlestick chart

This graphic shows a candlestick chart with high, low, open, close graphic so that the little lines at the top and bottom show the high and low respectively. Then, the box shows the open and close. Just to start off with, I wanted to show a query using MemSQL 6.8 to calculate information that is required to render a candlestick chart like you see here.

MemSQL time series old and new code

On the left side, this is a query that works in MemSQL 6.8 and earlier to produce a candlestick chart from a simple series of financial trade or transaction events. On the right-hand side, that’s how you write the exact same query in MemSQL 7.0. Wow. Look at that. It’s about one third as many characters as you see on the left, and also it’s much less complex.

On the left, you see you’ve got a common table expression with a nested select statement that’s using window functions, sort of a relatively complex window function, and several aggregate functions. It’s using rank, and then using a trick to pick out the top-ranked value at the bottom. Anyway, that’s a challenging query to write. That’s an expert-level query, and even experts struggle a little bit with that. You might have to refer back to the documentation.

I’ll go over this again in a little more detail, but just please remember this picture. Look how easy it is to manage time series data to produce a simple candlestick chart on the right compared to what was required previously. How did we enable this? We provide some new time series functions and capabilities in MemSQL 7.0 that allowed us to write that query more easily.

New MemSQL time series functions

We provide three new built-in functions: FIRST(), LAST(), and TIME_BUCKET(). FIRST() and LAST() are aggregate functions that provide the first or last value in a time window or group, based on some time period that defines an ordering. I’ll say more about those in a few minutes. TIME_BUCKET() is a function that maps a timestamp to a one-minute or five-minute or one-hour window, or one-day window, et cetera. It allows you to do it in a very easy way with a very brief syntax, that’s fairly easy to learn and remember.

Finally, we’ve added a new designation called the SERIES TIMESTAMP column designation, which allows you to mark one of your columns as the time column for your time series. That allows some shorthand notations that I’ll talk about more.

Time series timestamp example

Here’s a very simple example table that holds time series data for financial transactions. We’ve got a ts column, that’s a datetime 6 marked as the series timestamp. The data type is datetime 6, which is, it’s standard datetime with six places to the right of the decimal point. It’s accurate down to the microsecond. Symbol is like a stock symbol, a character string up to five characters. Price is a decimal, with up to 18 digits in 4 places to the right of the decimal point. So very simple time series table for financial information.

Some examples that are going to follow, I’m going to use this simple data set. Now, we’ve got two stocks, made-up stocks, ABC and XYZ that have some data that’s arrived in a single day, February 18th of next year, in a period of a few minutes. We’ll use that data and some examples set in the future.

Let’s look in more detail at the old way of querying time series data with MemSQL using window functions. I want to, for each symbol, for each hour, produce high, low, open, and close. This uses a window function that partitions by a time bucket. The symbol and time bucket ordered by timestamp, and the rows are between unbounded preceding and unbounded following. “Unbounded” means that any aggregates we calculate over this window will be over the entire window.

Old code for time series with SQL

Then, we compute the rank, which is the serial number based on the sort order like 1, 2, 3, 4, 5. One is first, two is second, so forth. Then, the minimum and maximum over the window, and first value and last value over the window. First value and last value are the very original value and the very final value in the window, based on the sort order of the window. Then, you see that from Unix time, Unix timestamp, ts divided by 60 times 60, times 60 times 60.

This is a trick that people who manage time series data with SQL have learned. Basically, you can multiply, you can divide a timestamp by a window, and then multiply by the window again, and that will chunk up a fine-grain timestamp into a coarser grain that is bound at a window boundary. In this case, it’s 60 times 60. Then, finally, the select block at the end, you’ve got, you’re selecting the time series, the timestamp from above the symbol, min price, max price, first, last, but above that produced an entry for every single point in the series, so we really only want one. We pick out the top-ranked one.

Anyway, this is tricky. I mean, this is the kind of thing that will take an expert user from several minutes, to many minutes, to write, and with references back to the documentation. Can we do better than this? How can we do better? We introduced first and last as regular aggregate functions, in order to enable this kind of use case, with less code. We’ve got a very basic example. Now, select first, price, ts from tick, but the second argument to the first aggregate is a timestamp, but it’s optional.

If it’s not present, then we infer that you meant to use the series timestamp column of the table that you’re querying. The top one is the full notation, but in the bottom query, you say select first price, last price from tick. That first price and last price from tick implicitly use the series timestamp column ts as the time argument, the second argument to those aggregate functions. It just makes the query easier to write. You don’t have to remember to explicitly put in the series time value in the right place when you use those functions.

Next, we have a new function for time bucketing. You don’t have to write that tricky divide, and then that multiply kind of expression that I showed you before. Much, much easier to use, more intuitive. Time bucket takes a bucket width, and that’s a character string like 5m, for five minutes, 1h for one hour, and so forth. Then, two optional arguments – the time and the origin.

New code with MemSQL Time Series functions

The time is optional just like before. If you don’t use it, if you don’t specify it, then we implicitly add the series timestamp column from the table or table, from the table that you’re querying. Then, origin allows you to provide an offset. For example, if you want to do time bucketing but start at 8:000 AM every day, you want a bucket by day but start your day at 8AM instead of midnight, then you can put in an origin argument.

Again, this is far easier than the tricky math expression that we used for that candlestick query before. Here’s some example of uses of origin with an 8AM origin. For example, we’ve got this table T with that, and ts is the series timestamp ,and v is a value that’s a double-precision float. You see the query there in the middle: select time bucket 1d ts, and then you pick a date near the timestamps that you’re working with, and provide… That’s your origin. It’s an 8AM origin.

Then, some of these. You can see down below that the days, the day bucket boundaries are starting at 8AM. Normally, you’re not going to need to use an origin, but if you do have that need to have an offset you can do that. Again, let’s look at the new way of answering, providing the candlestick chart query. This uses, we say select time bucket 1h, which is a one hour bucket. Then, the symbol, the minimum price, the maximum price, the first price, and the last price.

Notice that in first and last and time bucket, we don’t even have to refer to the timestamp column in the original data set, because it’s implicit. Some of you may have worked with specialty products for managing web events like Splunk or Azure Kusto, and so this concept of using a time bucket function or a bucket function with an easy notation like this, you may be familiar with that from those kind of systems.

One of the reason people like those products so much for the use cases that they’re designed for is that it’s really easy to query the data. The queries are very brief. We try to bring that brevity for time series data to SQL with this new capability, with the series timestamp that’s an implicit argument to these functions. Then, just group by 2, 1, which is the time bucket and the symbol and order by 2, 1. So, very simple query expression.

Just to recap, MemSQL for several years has been great for time series ingest and storage. People loved it for that. We have very fast ingest, powerful SQL capability, with time-oriented functions as part of our window function capability. High-performance query processing based on compilation to machine code and vectorization, as well as scalability through scale-out and also the ability to support high concurrency, where you’ve got lots of writers and readers concurrently working on the same data set. And not to mention, we provide transactions, support, easy manageability, and we’re built for the cloud.

Now, given all the capabilities we already had, we’re making it even easier to query time series data with this new brief syntax, these new functions, first, last, and time bucket in the series timestamp concept, that allows you to write queries very briefly, without having to reference, repeatedly and redundantly, to the time column in your table.

Time series functions recap

This lets non-expert users do more than they could before, things they just weren’t capable of before with time series data, and it makes experts users more productive. I’d like to invite you to try MemSQL for free today, or contact Sales. Try it for free by using our free version, or go on Helios and do an eight-hour free trial. Either way, you can try MemSQL for no charge. Thank you.

Q&A: MemSQL and Time Series

Q. What’s the best way to age out old data from a table storing time series data?

A. The life cycle management of time series data is really important in any kind of time series application. One of the things you need to do is eliminate or purge old data. It’s really pretty easy to do that in MemSQL. All you have to do is run a delete statement periodically to delete the old data. Some other database products have time-oriented partitioning capabilities, and their delete is really slow, so they require you to, for instance, swap out an old partition once a month or so to purge old data from a large table. In MemSQL, you don’t really need to do that, because our delete is really, really fast. We can just run a delete statement to delete data prior to a certain time, whenever you need to remove old data.

Q. Can you have more than one time series column in a table?

A. You can only designate one column in a table as the series timestamp. However, you can have multiple time columns in a table and if you want to use different columns, you can use those columns explicitly with our new built-in time functions – FIRST(), LAST(), and TIME_BUCKET().There’s an optional time argument, so if you want to have like a secondary time on a table that’s not your primary series time stamp, but you want to use it for some of those functions, you can do it. You just have to name the time column explicitly in the FIRST(), LAST(), and TIME_BUCKET() functions.

Q. Does it support multi-tenancy?

A. Does it support multi-tenancy? Sure. MemSQL supports any number of concurrent users, up a very high number of concurrent queries. You can have multiple databases on a single cluster, and each application can have its own database if you want to, to have multi-tenant applications running on the same cluster.

Q. Does MemSQL keep a local copy of the data ingested or does it only keep references? If MemSQL keeps a local copy, how is it kept in sync with external sources?

A. MemSQL is a database system. You create tables, you insert data in the tables, you query data in the tables, you can update the data, delete it. So when you add a record to MemSQL it, that record, a copy of the information and that record, the record itself is kept in MemSQL. It doesn’t store data by reference, it stores copies of the data. If you want to keep it in sync with external sources, you need to, as the external values change, you’ll need to update the record that represents that information in MemSQL.

Q. How can you compute a moving average on a time series in MemSQL?

A. Sure. You can compute a moving average; it depends on how you want to do it. If you just want to average the data in each time bucket, you can just use average to do that. If you want to do a moving average, you can use window functions for that, and you can do an average over a window as it moves. You can average over a window from three preceding rows, to the current row, to average the last four values.

Q. Did you mention anything about Python interoperability? In any event, what Python interface capabilities do you offer?

A. We do have Python interoperability – in that, you can let client applications that connect to MemSQL and insert data, query data, and so forth in just about any popular query language. We support connectivity to applications through drivers that are MySQL wire protocol-compatible. Essentially, any application software that can connect to the MySQL database and insert data, update data, and so forth, can also connect to MemSQL.
We have drivers for Python that allow you to write a Python application and connect it to MemSQL. In addition, in our Pipeline capability, we support what are called transforms. Those are programs or scripts that can be applied to transform batches of information that are flowing into MemSQL through the Pipeline. You can write transforms in Python as well.

Q. Do I need to add indexes to be able to run fast select queries on time series data, with aggregations?

A. So, depending on the nature of the queries and how much data you have, how much hardware you have, you may or may not need to use indexes to make certain queries run fast. I mean, it really depends on your data and your queries. If you have very large data sets and high-selectivity queries and a lot of concurrency, you’re probably going to want to use indexes. We support indexes on our rowstore table type, both ordered indexes and hash indexes.

Then, our columnstore table type, we have a sort key, a primary sort key, which is like an index in some ways, as well as support for secondary hash indexes. However, the ability to share your data across multiple nodes in a large cluster and use columnstore, data storage structures that with very fast vectorized query execution makes it possible to run queries with response times of a fraction of a second, on very large data sets, without an index.
That can make it easier as an application developer, you can let the power of your computing cluster and database software just make it easier for you and not have to be so clever about defining your indexes. Again, it really depends on the application.

Q. Can you please also talk about encryption and data access roles, management for MemSQL?

A. With respect to encryption, for those customers that want to encrypt their data at rest, we recommend that they use Linux file system capabilities or cloud storage platform capabilities to do that, to encrypt the data through the storage layer underneath the database system.
Then, with respect to access control, MemSQL has a comprehensive set of data access capabilities. You can grant permission to access tables and views to different users or groups. We support single sign-on through a number of different mechanisms. We have a pretty comprehensive set of access control policies. We also support row-level security.

Q. What of row locking will I struggle kind with by using many transactions, selects, updates, deletes at once?

MemSQL has multi-version concurrency control, so readers don’t block writers and vice versa. Write-Write conflicts usually happen at row-level lock granularity.

Q. How expensive is it to reindex a table?

A. CREATE INDEX is typically fast. I have not heard customers have problems with it.

Q. Your reply on moving averages seem to pertain to simple moving averages, but how would you do exponential moving averages or weighted moving averages where a windows function may not be appropriate?

A. For that you’d have to do it in the client application or in a stored procedure. Or consider using a different time series tool.

Q. Are there any utilities available for time series data migration to / from an existing datastores like Informix,

A. For straight relational table migration, yes. But you’d have to probably do some custom work to move data from a time series DataBlade in Informix to regular tables in MemSQL.

Q. Does series timestamp accept integer data type or it has to be datetime data type?

A. The data type must be time or datetime or timestamp. Timestamp is not recommended because it has implied update behavior.

Q. Any plans to support additional aggregate functions with the time series functions? (e.g. we would have liked to get percentiles like first/last without the use of CTEs)

A. Percentile_cont and percentile_disc work in MemSQL 7.0 as regular aggs. If you want other aggs, let us know.

Q. Where can I find more info on AI (ML & DL) in MemSQL?

A. See the documentation for dot_product and euclidean_distance functions. And see webinar recordings about this from the past. And see blog: https://www.memsql.com/blog/memsql-data-backbone-machine-learning-and-ai/

Q. Can time series data be associated with asset context and queried in asset context. (Like a tank, with temperature, pressure, etc., within the asset context of the tank name.)

A. A time series record can have one timestamp and multiple fields. So I think you could use regular string table fields for context and numeric fields for metrics to plot and aggregate.

Q. Guessing the standard role based security model exists to restrict access to time series data.

A. Yes.

(End of Q&A)

We invite you to learn more about MemSQL at https://www.memsql.com, or give us a try for free at https://www.memsql.com/free.

↧