Quantcast
Channel: partitions – Cloud Data Architect
Viewing all articles
Browse latest Browse all 413

1.3 Billion NYC Taxi Rows into MemSQL Cloud

$
0
0

Feed: MemSQL Blog.
Author: Seth Luersen.

Experience teaches us that when loading data into a database, in whatever form normalized, denormalized, schema-less, hierarchical, key-value, document, etc the devil is always in the data load.

For enterprise companies in the real-time economy, every second saved means improved efficiency, productivity, and profitability. Thankfully, MemSQL Cloud makes your enterprise data fast to load and easy to access. You can spin up a cluster in MemSQL Cloud in minutes and load data very quickly using MemSQL Pipelines. MemSQL scales out for fast parallel processing of very large data sets. MemSQL Pipelines deliver exactly-once semantics for duplicate-free, fast data ingest.

In this blog tutorial, we’ll illustrate rapid data ingest with MemSQL using real-world data from the New York City Taxi and Limousine Commission (NYC TLC) yellow taxi trip data set, which is over 200GB of CSV files. We’ll break the tutorial into several sections:

  • Pre-configuration requirements for everything you need to complete the tutorial including AWS S3 and MemSQL Cloud
  • Downloading, compressing, and uploading the yellow taxi data to AWS S3
  • Loading the data with MemSQL Pipelines in about 24 minutes

Pre-configuration Requirements

To complete this tutorial, you can use a Mac OS, Linux, or Windows machine, and you’ll need a:

  • Bash compatible terminal
  • AWS account and S3 bucket with read and write privileges
  • AWS Command Line Interface (CLI)
  • MySQL compatible client to execute SQL statements
  • Running MemSQL cluster

AWS S3

First, log into your AWS account. To get started with AWS S3, review Getting Started with Amazon Simple Storage Service.

We will store approximately 44GB of compressed data versus 200GB of raw data, which typically costs about $4.50 a month. After loading the data into MemSQL, you can delete the S3 data and terminate the AWS account to avoid additional charges.

AWS Command Line Interface (CLI)

Use the AWS CLI to use shell commands to script S3 operations, such as creating a folder, listing folder items, and uploading files. Follow the User Guide for installing the AWS CLI.

AWS Credentials

To upload data to S3 using the AWS CLI and extract files from S3 with MemSQL Pipelines, you’ll need AWS credentials for an IAM user. For an IAM user, you can easily generate  an AWS Access Key ID and AWS Secret Access Key following this guide

S3 Bucket

Log in to the AWS Console and create a new S3 bucket in the availability zone for your account. If you are using MemSQL Cloud to avoid ingress and egress data charges, create the S3 bucket in the us-east-1 availability zone. The NYC taxi data set includes yellow taxi, green vehicle, and fhv vehicle data. We’ll only use the yellow taxi data in our initial data load as it is by far the largest data setIn the case of yellow taxi data, the schema changes over time, so we’ll create yellow taxi subfolders to demarcate the schema changes. In your S3 bucket, create the following older hierarchy:

MySQL Compatible Client

MemSQL uses the open source MySQL protocol, so in order to execute Data Definition Language (DDL) and Data Manipulation Language (DML) statements you’ll need a MySQL compatible client. Here are some popular, free clients:

  • MySQL Workbench
  • Sequel Pro 
  • MySql Client for  Linux APT Repository, Debian Packages
    sudo aptget update
    sudo aptget install mysqlclient
  • MySQL Client for Linux RPM, Red Hat Enterprise Linux/Oracle Linux/CentOS 5 systems
    sudo yum install mysqlcommunityclient mysql5.*

MemSQL Cluster

A MemSQL cluster consists of at least one master aggregator node and one leaf node. Although this tutorial is primarily for a cluster in MemSQL Cloud, it is easy to adapt it for any MemSQL cluster environment.

MemSQL Cloud

To experience a turnkey enterprise version of MemSQL, sign up for MemSQL Cloud – free credits up to $300 are available to new customers. You’ll get the full power and speed of MemSQL, but as a managed service that features optimum performance and world-class support. Once you’ve signed up, get started by reviewing the MemSQL Cloud Overview. You’ll need to setup the Security Groups for your own virtual private cloud, then create a cluster based on resource tiers.

You can complete this tutorial using the Small, Medium, or Large cluster sizes. If you have a need for speed, choose Large for a 4 unit MemSQL Cloud cluster and expect the NYC taxi data set to ingest completely in under 24 minutes for a cost of about five dollars. Based on the location of your S3 bucket, your milage may be different because of the variable nature of AWS S3 network speeds.

You can spin up a cluster in minutes and terminate a cluster at any time. To not use up all your free credits or incur additional charges, spin up a cluster when you are ready to load the data, and after completing this tutorial terminate your cluster.

MemSQL On-Premises or Self-Managed Cloud

MemSQL runs natively on a 64-bit Linux operating systems. Download and install MemSQL either on-premises or in your own self-managed cloud. For Windows or Mac OS, you can use the MemSQL Quickstart Docker image, which is a standalone edition of MemSQL. If using Windows 10, consider installing Bash using the Windows Subsystem for Linux Beta. This beta feature will install Ubuntu on Windows, and you can run MemSQL experimentally in this environment. Very cool.

NYC TLC Trip Data

The NYC TLC data files are as small as 300MB and as large as 2.5GB. With 8 years of monthly data, yellow taxi information accounts for about 90% of all data. Many yellow taxi cab files are approximately 2.4GB. When compressed, the files range from 300MB to almost 600MB.

Download, Compress, and Upload the Data

Here’s a rudimentary Bash script to do the work of downloading each file, compressing the raw files, uploading the compressed file to proper folder in S3, and then deleting both the raw CSV and tar.gz files from the machine where the script runs. The shell script requires AWS CLI.

In order for the script to work with your S3 bucket and folder, specify values for these variables:

  • S3_BUCKET
  • S3_BASE_FOLDER
  • AWS_ACCESS_KEY_ID
  • AWS_SECRET_ACCESS_KEY

Save your changes and name the file, nyc_tlc_taxi_files.sh. In the terminal where you are going to run the shell script, modify the file properties:

Next, create an AWS profile using the AWS CLI:

At the prompts, specify your values for the ‘AWS Access Key ID’ and ‘AWS Secret Access Key’. For ‘Default region name’, specify us-east-1. For ‘Default output format’, specify json.

From a Bash shell, run the nyc_tlc_taxi_files.sh script. There is no need to specify the AWS profile settings again, so press enter at each prompt. The script outputs the successful processing of each file:

Downloading, compressing, and uploading the 96 yellow taxi monthly data files using your local machine will take around six to eleven hours, depending on your connection speeds. So, if you are following along on your laptop, you may want to run the Bash shell script during the night, keeping your machine plugged in with sleep disabled.

To Compress or Not Compress?

With AWS S3, you ultimately pay for storage size and for data transmission. Depending on the availability zone, it costs on average about five dollars to store 200GB. Compressing 200GB gets that down to under 50GB. Because a MemSQL Pipeline for S3 can process a compressed file, you will save by using compression for large data sets on AWS S3.

Load the NYC Yellow Taxi Trip Data

After the bash shell script completes and all NYC Taxi trip data has been loaded in your various S3 bucket folders, we’ll now load all the data using MemSQL S3 Pipelines.

MemSQL Database and Tables

To get started, we’ll create a database and two tables. Because the yellow taxi data has several schema changes over eight years as described in the data dictionary, we’ll create a staging table to handle the various data formats. After the initial S3 Pipeline data ingest, we’ll insert the staging data into a destination table.

Using a MySQL compatible client, connect to the DDL endpoint for your MemSQL Cloud cluster. If not using MemSQL Cloud, connect to your Master Aggregator to run DDL commands. To create the database and tables, execute the following SQL:

MemSQL Pipelines

Declared ever so simply but ever so powerfully with a CREATE PIPELINE statement, MemSQL Pipelines are a native construct in MemSQL. The pipeline syntax allows us to extract one or more gzip compressed (or uncompressed) source files from a S3 bucket folder, then load the data to a destination database table. In other words, we’ll end up with three MemSQL Pipelines to match our S3 folders.

Create Pipelines

You’ll need to specify your AWS_SECRET_ACCESS_KEY and AWS_ACCESS_KEY_ID credentials in the following CREATE PIPELINE statements. In the statements, we specify to skip all CSV parsing errors. The three pipelines will load data rapidly and in parallel to the staging table:

Now, we can start the pipelines with:

Verify that your pipelines are running.

You can view the files in the batch as identified by BATCH_ID in the following:

Parallel Loading with MemSQL Pipelines

To determine the number of files in a pipeline batch, MemSQL uses the database partition count.   

The distinct ordinal count of master partitions reflects the total number of database partitions.

Load Skew

Smaller sized files in a pipeline batch will load faster than larger files. Only when all files in a batch are successfully loaded in parallel, does the next pipeline batch begin. For the NYC yellow taxi trip data set, the monthly files vary greatly in size, as great as 210MB. In other words, there will be a degree of load skew. The pipeline with the most files to process is pipeline_yellow_200901_201412. It processes 6 years of monthly data, which is 72 files. A database sized with 32 partitions will create three pipeline batches for a pipeline with 72 files:

To view the overall batch time in minutes, the total number of rows, and rows per second, execute the following query:

The results for loading all the yellow taxi data into a large sized cluster in MemSQL Cloud are:

  • Total_Minutes = 23.94
  • Total_Row = 1,308,985,065
  • Rows_Per_Sec = 911,305.258

Now that the load is complete, we can stop the pipelines.

From Staging to Destination

With 1.3 billion rows, there will be a few load errors. We can view the files with the most errors with the following query:

There are two files with around 650 row-related errors out of 1.308 billion rows. Of course, there are a few issues with the data itself. There are rows with 0000-00-00 00:00:00 datetimes and rows with meangliness longitudes and latitudes. We can clean these up with some basic DELETE and UPDATE statements that will take less than 20 seconds to complete.

With the data cleaned up, we are now ready to copy the data into our destination table. As you may recall, the DDL for yellow_trips table specifies the pickup and dropoff point as a MemSQL geographypoint data type, enabling us to write fast geospatial queries moving forward.

Finally, we’ll clean up the staging table and update the statistics for the yellow_trips table.

Now, we’re ready to start analyzing the data!

Data Science Inspiration

For those looking for inspiration about what and how to analyze the NYC yellow taxi data, take a look at Todd W. Schneider’s now famous blog “Analyzing 1.1 Billion NYC Taxi and Uber Trips, with a Vengeance”.  As Todd writes, “taken as a whole, the detailed trip-level data is more than just a vast list of taxi pickup and drop off coordinates: it’s a story of New York.”

Next Steps

The MemSQL story is one of relentless innovation and brilliant engineering.

Just five years ago, MemSQL posted “Loading half a billion records in 40 minutes”. That blog details how to chunk up manufactured data into 2000 equally sized files in order to load 500 million rows into MemSQL on a machine with 512GB memory and 64 cores.

Five years later, you can spin up a database cluster in MemSQL Cloud in minutes, and in about half the time load 1.3 billion rows of real-world data with a few simple CREATE PIPELINE statements. Try it yourself and sign up for MemSQL Cloud. New customers receive a $300 free credit.

To preview the next chapter in the amazing MemSQL story that includes an 80X improvement in columnstore query execution and AI similarity matching with SIMD functions, sign up for the MemSQL 6.0 Beta.


Viewing all articles
Browse latest Browse all 413

Trending Articles