1.3 Billion NYC Taxi Rows into MemSQL Cloud

Feed: MemSQL Blog.
Author: Seth Luersen.

Experience teaches us that when loading data into a database, in whatever form ― normalized, denormalized, schema-less, hierarchical, key-value, document, etc ― the devil is always in the data load.

For enterprise companies in the real-time economy, every second saved means improved efficiency, productivity, and profitability. Thankfully, MemSQL Cloud makes your enterprise data fast to load and easy to access. You can spin up a cluster in MemSQL Cloud in minutes and load data very quickly using MemSQL Pipelines. MemSQL scales out for fast parallel processing of very large data sets. MemSQL Pipelines deliver exactly-once semantics for duplicate-free, fast data ingest.

In this blog tutorial, we’ll illustrate rapid data ingest with MemSQL using real-world data from the New York City Taxi and Limousine Commission (NYC TLC) yellow taxi trip data set, which is over 200GB of CSV files. We’ll break the tutorial into several sections:

Pre-configuration requirements for everything you need to complete the tutorial including AWS S3 and MemSQL Cloud
Downloading, compressing, and uploading the yellow taxi data to AWS S3
Loading the data with MemSQL Pipelines in about 24 minutes

Pre-configuration Requirements

To complete this tutorial, you can use a Mac OS, Linux, or Windows machine, and you’ll need a:

Bash compatible terminal
AWS account and S3 bucket with read and write privileges
AWS Command Line Interface (CLI)
MySQL compatible client to execute SQL statements
Running MemSQL cluster

AWS S3

First, log into your AWS account. To get started with AWS S3, review Getting Started with Amazon Simple Storage Service.

We will store approximately 44GB of compressed data versus 200GB of raw data, which typically costs about $4.50 a month. After loading the data into MemSQL, you can delete the S3 data and terminate the AWS account to avoid additional charges.

AWS Command Line Interface (CLI)

Use the AWS CLI to use shell commands to script S3 operations, such as creating a folder, listing folder items, and uploading files. Follow the User Guide for installing the AWS CLI.

AWS Credentials

To upload data to S3 using the AWS CLI and extract files from S3 with MemSQL Pipelines, you’ll need AWS credentials for an IAM user. For an IAM user, you can easily generate an AWS Access Key ID and AWS Secret Access Key following this guide.

S3 Bucket

Log in to the AWS Console and create a new S3 bucket in the availability zone for your account. If you are using MemSQL Cloud to avoid ingress and egress data charges, create the S3 bucket in the us-east-1 availability zone. The NYC taxi data set includes yellow taxi, green vehicle, and fhv vehicle data. We’ll only use the yellow taxi data in our initial data load as it is by far the largest data set. In the case of yellow taxi data, the schema changes over time, so we’ll create yellow taxi subfolders to demarcate the schema changes. In your S3 bucket, create the following older hierarchy:

|– nyc–taxi

|—– yellow_tripdata

|——– 200901–201412

|——– 201501–201606

|——– 201607–201612

MySQL Compatible Client

MemSQL uses the open source MySQL protocol, so in order to execute Data Definition Language (DDL) and Data Manipulation Language (DML) statements you’ll need a MySQL compatible client. Here are some popular, free clients:

MySQL Workbench
Sequel Pro
MySql Client for Linux APT Repository, Debian Packages
sudo apt–get update
sudo apt–get install mysql–client

MySQL Client for Linux RPM, Red Hat Enterprise Linux/Oracle Linux/CentOS 5 systems
sudo yum install mysql–community–client mysql–5.*

MemSQL Cluster

A MemSQL cluster consists of at least one master aggregator node and one leaf node. Although this tutorial is primarily for a cluster in MemSQL Cloud, it is easy to adapt it for any MemSQL cluster environment.

MemSQL Cloud

To experience a turnkey enterprise version of MemSQL, sign up for MemSQL Cloud – free credits up to $300 are available to new customers. You’ll get the full power and speed of MemSQL, but as a managed service that features optimum performance and world-class support. Once you’ve signed up, get started by reviewing the MemSQL Cloud Overview. You’ll need to setup the Security Groups for your own virtual private cloud, then create a cluster based on resource tiers.

You can complete this tutorial using the Small, Medium, or Large cluster sizes. If you have a need for speed, choose Large for a 4 unit MemSQL Cloud cluster and expect the NYC taxi data set to ingest completely in under 24 minutes for a cost of about five dollars. Based on the location of your S3 bucket, your milage may be different because of the variable nature of AWS S3 network speeds.

You can spin up a cluster in minutes and terminate a cluster at any time. To not use up all your free credits or incur additional charges, spin up a cluster when you are ready to load the data, and after completing this tutorial terminate your cluster.

MemSQL On-Premises or Self-Managed Cloud

MemSQL runs natively on a 64-bit Linux operating systems. Download and install MemSQL either on-premises or in your own self-managed cloud. For Windows or Mac OS, you can use the MemSQL Quickstart Docker image, which is a standalone edition of MemSQL. If using Windows 10, consider installing Bash using the Windows Subsystem for Linux Beta. This beta feature will install Ubuntu on Windows, and you can run MemSQL experimentally in this environment. Very cool.

NYC TLC Trip Data

The NYC TLC data files are as small as 300MB and as large as 2.5GB. With 8 years of monthly data, yellow taxi information accounts for about 90% of all data. Many yellow taxi cab files are approximately 2.4GB. When compressed, the files range from 300MB to almost 600MB.

Download, Compress, and Upload the Data

Here’s a rudimentary Bash script to do the work of downloading each file, compressing the raw files, uploading the compressed file to proper folder in S3, and then deleting both the raw CSV and tar.gz files from the machine where the script runs. The shell script requires AWS CLI.

#!/bin/bash

# Change S3_BUCKET to your S3 bucket

export S3_BUCKET=“s3://my-s3-bucket”

# Change S3_BASE_FOLDER to your base folder, but should be nyc-taxi if you are following the tutorial

export S3_BASE_FOLDER=“nyc-taxi”

# Change AWS_ACCESS_KEY_ID to your key

export AWS_ACCESS_KEY_ID=MY_ACCESS_KEY

# ChangeAWS_SECRET_ACCESS_KEY to your secret key

export AWS_SECRET_ACCESS_KEY=MY_SECRET_KEY

# Change S3_BASE_FOLDER to your base folder.

export AWS_DEFAULT_PROFILE=my–aws–profile–user

# MemSQL Cloud is us-east-1; changing can result in AWS egress and ingress charges

export AWS_DEFAULT_REGION=us–east–1

aws configure —profile $AWS_DEFAULT_PROFILE

# do not change URL_ROOT

URL_ROOT=“https://s3.amazonaws.com/nyc-tlc/trip+data/”

# modify if needed for smaller subsets or add green and fhv for additional cab_types

MONTH_ORDINALS=(“01” “02” “03” “04” “05” “06” “07” “08” “09” “10” “11” “12”)

YEAR_ORDINALS=(“2009” “2010” “2011” “2012” “2013” “2014” “2015” “2016”)

CAB_TYPES=(“yellow”)

# leave as empty

FILE_NAME=“”

S3_FOLDER=“”

S3_SUBFOLDER=“”

for name in ${CAB_TYPES[@]}

if [ $name == “yellow” ]; then

S3_FOLDER=“yellow_tripdata”

YEARS=${YEAR_ORDINALS[@]}

for yy in ${YEARS[@]}

MONTHS=${MONTH_ORDINALS[@]}

for mm in ${MONTHS[@]}

FILE_NAME=${name}_tripdata_${yy}–${mm}

# get the csv file

curl –S –O “${URL_ROOT}${FILE_NAME}.csv” && echo “done! curl ${FILE_NAME}.csv” &

wait

# tarball the file

tar –cvzf “${FILE_NAME}.tar.gz” “${FILE_NAME}.csv” && echo “done! tar ${FILE_NAME}.tar.gz” &

wait

# upload to AWS S3 the gz file

if [[ $name == “yellow” && $yy == “2015” ]]; then

S3_SUBFOLDER=“201501-201606”

elif [[ $name == “yellow” && $yy == “2016” && $mm == “01” ]]; then

S3_SUBFOLDER=“201501-201606”

elif [[ $name == “yellow” && $yy == “2016” && $mm == “02” ]]; then

S3_SUBFOLDER=“201501-201606”

elif [[ $name == “yellow” && $yy == “2016” && $mm == “03” ]]; then

S3_SUBFOLDER=“201501-201606”

elif [[ $name == “yellow” && $yy == “2016” && $mm == “04” ]]; then

S3_SUBFOLDER=“201501-201606”

elif [[ $name == “yellow” && $yy == “2016” && $mm == “05” ]]; then

S3_SUBFOLDER=“201501-201606”

elif [[ $name == “yellow” && $yy == “2016” && $mm == “06” ]]; then

S3_SUBFOLDER=“201501-201606”

elif [[ $name == “yellow” && $yy == “2016” && $mm == “07” ]]; then

S3_SUBFOLDER=“201607-201612”

elif [[ $name == “yellow” && $yy == “2016” && $mm == “08” ]]; then

S3_SUBFOLDER=“201607-201612”

elif [[ $name == “yellow” && $yy == “2016” && $mm == “09” ]]; then

S3_SUBFOLDER=“201607-201612”

elif [[ $name == “yellow” && $yy == “2016” && $mm == “10” ]]; then

S3_SUBFOLDER=“201607-201612”

elif [[ $name == “yellow” && $yy == “2016” && $mm == “11” ]]; then

S3_SUBFOLDER=“201607-201612”

elif [[ $name == “yellow” && $yy == “2016” && $mm == “12” ]]; then

S3_SUBFOLDER=“201607-201612”

else

S3_SUBFOLDER=“200901-201412”

if [ $name == “yellow” ]; then

aws s3 cp ${FILE_NAME}.tar.gz ${S3_BUCKET}/${S3_BASE_FOLDER}/${S3_FOLDER}/${S3_SUBFOLDER}/ —profile $AWS_DEFAULT_PROFILE && echo “done! aws s3 cp ${FILE_NAME}.tar.gz” &

wait

#rm the cv files

rm –f “${FILE_NAME}.csv” && echo “done! rm -f ${FILE_NAME}.csv” &

wait

#rm the gz files

rm –f “${FILE_NAME}.tar.gz” && echo “done! rm -f ${FILE_NAME}.tar.gz” &

wait

done

In order for the script to work with your S3 bucket and folder, specify values for these variables:

S3_BUCKET
S3_BASE_FOLDER
AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY

Save your changes and name the file, nyc_tlc_taxi_files.sh. In the terminal where you are going to run the shell script, modify the file properties:

chmod 777 nyc_tlc_taxi_files.sh

Next, create an AWS profile using the AWS CLI:

AWS_DEFAULT_PROFILE=my–profile

aws configure —profile $AWS_DEFAULT_PROFILE

At the prompts, specify your values for the ‘AWS Access Key ID’ and ‘AWS Secret Access Key’. For ‘Default region name’, specify us-east-1. For ‘Default output format’, specify json.

From a Bash shell, run the nyc_tlc_taxi_files.sh script. There is no need to specify the AWS profile settings again, so press enter at each prompt. The script outputs the successful processing of each file:

HAL2001:shellscripts Seth$ ./nyc_tlc_taxi_files.sh

AWS Access Key ID [****************7G6Q]:

AWS Secret Access Key [****************EmaN]:

Default region name [us–east–1]:

Default output format [json]:

% Total % Received % Xferd Average Speed Time Time Time Current

Dload Upload Total Spent Left Speed

100 2420M 100 2420M 0 0 4398k 0 0:09:23 0:09:23 —:—:— 5915k

done! curl yellow_tripdata_2009–01.csv

a yellow_tripdata_2009–01.csv

done! tar yellow_tripdata_2009–01.tar.gz

upload: ./yellow_tripdata_2009–01.tar.gz to s3://my–s3–bucket/my–nyc–taxi–folder/yellow_tripdata/200901–201412/yellow_tripdata_2009–01.tar.gz

done! aws s3 cp yellow_tripdata_2009–01.tar.gz

done! rm –f yellow_tripdata_2009–01.csv

done! rm –f yellow_tripdata_2009–01.tar.gz

Downloading, compressing, and uploading the 96 yellow taxi monthly data files using your local machine will take around six to eleven hours, depending on your connection speeds. So, if you are following along on your laptop, you may want to run the Bash shell script during the night, keeping your machine plugged in with sleep disabled.

To Compress or Not Compress?

With AWS S3, you ultimately pay for storage size and for data transmission. Depending on the availability zone, it costs on average about five dollars to store 200GB. Compressing 200GB gets that down to under 50GB. Because a MemSQL Pipeline for S3 can process a compressed file, you will save by using compression for large data sets on AWS S3.

Load the NYC Yellow Taxi Trip Data

After the bash shell script completes and all NYC Taxi trip data has been loaded in your various S3 bucket folders, we’ll now load all the data using MemSQL S3 Pipelines.

MemSQL Database and Tables

To get started, we’ll create a database and two tables. Because the yellow taxi data has several schema changes over eight years as described in the data dictionary, we’ll create a staging table to handle the various data formats. After the initial S3 Pipeline data ingest, we’ll insert the staging data into a destination table.

Using a MySQL compatible client, connect to the DDL endpoint for your MemSQL Cloud cluster. If not using MemSQL Cloud, connect to your Master Aggregator to run DDL commands. To create the database and tables, execute the following SQL:

DROP DATABASE IF EXISTS nyc_taxi;

CREATE DATABASE IF NOT EXISTS nyc_taxi;

USE nyc_taxi;

DROP TABLE IF EXISTS yellow_trips_staging;

CREATE TABLE IF NOT EXISTS yellow_trips_staging (

vendor_id varchar(3) NOT NULL DEFAULT ‘0’,

pickup_datetime datetime NOT NULL,

dropoff_datetime datetime NOT NULL,

pickup_longitude DOUBLE NOT NULL DEFAULT 0,

pickup_latitude DOUBLE NOT NULL DEFAULT 0,

passenger_count tinyint NOT NULL DEFAULT 0,

trip_distance decimal(6,2) NOT NULL DEFAULT 0,

rate_code tinyint NOT NULL DEFAULT 0,

store_and_fwd_flag varchar(1) NOT NULL DEFAULT ‘N’,

dropoff_longitude DOUBLE NOT NULL DEFAULT 0,

dropoff_latitude DOUBLE NOT NULL DEFAULT 0,

PULocationID smallint(3) NOT NULL DEFAULT 0,

DOLocationID smallint(3) NOT NULL DEFAULT 0,

payment_type tinyint NOT NULL DEFAULT 0,

fare_amount decimal(6,2) NOT NULL DEFAULT 0,

surcharge decimal(6,2) NOT NULL DEFAULT 0,

extra decimal(6,2) NOT NULL DEFAULT 0,

mta_tax decimal(6,2) NOT NULL DEFAULT 0,

tip_amount decimal(6,2) NOT NULL DEFAULT 0,

tolls_amount decimal(6,2) NOT NULL DEFAULT 0,

improvement_surcharge decimal(6,2) NOT NULL DEFAULT 0,

total_amount decimal(7,2) NOT NULL DEFAULT 0,

colA varchar(1),

colB varchar(1),

key(pickup_datetime, dropoff_datetime) USING CLUSTERED COLUMNSTORE

);

DROP TABLE IF EXISTS yellow_trips;

CREATE TABLE IF NOT EXISTS yellow_trips (

vendor_id varchar(3) NOT NULL DEFAULT ‘0’,

pickup_datetime datetime NOT NULL,

pickup_year smallint(4) NOT NULL DEFAULT 0,

pickup_month tinyint(2) NOT NULL DEFAULT 0,

pickup_day tinyint(2) NOT NULL DEFAULT 0,

pickup_week tinyint(2) NOT NULL DEFAULT 0,

pickup_time TIME NOT NULL DEFAULT ’00:000:00′,

dropoff_datetime datetime NOT NULL,

dropoff_year smallint(4) NOT NULL DEFAULT 0,

dropoff_month tinyint(2) NOT NULL DEFAULT 0,

dropoff_day tinyint(2) NOT NULL DEFAULT 0,

dropoff_week tinyint(2) NOT NULL DEFAULT 0,

dropoff_time TIME NOT NULL DEFAULT ’00:000:00′,

pickup_pt geographypoint not null default ‘POINT(0 0)’,

dropoff_pt geographypoint not null default ‘POINT(0 0)’,

PULocationID smallint(3) NOT NULL DEFAULT 0,

DOLocationID smallint(3) NOT NULL DEFAULT 0,

passenger_count tinyint(2) NOT NULL DEFAULT 0,

trip_distance decimal(6,2) NOT NULL DEFAULT 0,

rate_code tinyint NOT NULL DEFAULT 0,

store_and_fwd_flag varchar(1) NOT NULL DEFAULT ‘N’,

payment_type tinyint NOT NULL DEFAULT 0,

fare_amount decimal(6,2) NOT NULL DEFAULT 0,

surcharge decimal(6,2) NOT NULL DEFAULT 0,

extra decimal(6,2) NOT NULL DEFAULT 0,

mta_tax decimal(6,2) NOT NULL DEFAULT 0,

tip_amount decimal(6,2) NOT NULL DEFAULT 0,

tolls_amount decimal(6,2) NOT NULL DEFAULT 0,

improvement_surcharge decimal(6,2) NOT NULL DEFAULT 0,

total_amount decimal(7,2) NOT NULL DEFAULT 0,

shard key(pickup_datetime, dropoff_datetime),

key(pickup_datetime, dropoff_datetime) USING CLUSTERED COLUMNSTORE

);

MemSQL Pipelines

Declared ever so simply but ever so powerfully with a CREATE PIPELINE statement, MemSQL Pipelines are a native construct in MemSQL. The pipeline syntax allows us to extract one or more gzip compressed (or uncompressed) source files from a S3 bucket folder, then load the data to a destination database table. In other words, we’ll end up with three MemSQL Pipelines to match our S3 folders.

Create Pipelines

You’ll need to specify your AWS_SECRET_ACCESS_KEY and AWS_ACCESS_KEY_ID credentials in the following CREATE PIPELINE statements. In the statements, we specify to skip all CSV parsing errors. The three pipelines will load data rapidly and in parallel to the staging table:

USE nyc_taxi;

DROP PIPELINE IF EXISTS pipeline_yellow_200901_201412;

CREATE PIPELINE IF NOT EXISTS pipeline_yellow_200901_201412 AS

— IMPORTANT: CHANGE my-s3-bucket

LOAD DATA S3 “my-s3-bucket/nyc-taxi/yellow_tripdata/200901-201412/”

— IMPORTANT: CHANGE my_aws_secret_access_key and my_aws_access_key_id

CREDENTIALS ‘{“aws_secret_access_key”: “my_aws_secret_access_key”, “aws_access_key_id”: “my_aws_access_key_id”}’

SKIP ALL ERRORS

INTO TABLE yellow_trips_staging

FIELDS TERMINATED BY ‘,’

(vendor_id,pickup_datetime,dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,rate_code,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,surcharge,mta_tax,tip_amount,tolls_amount,total_amount);

DROP PIPELINE IF EXISTS pipeline_yellow_201501_201606;

CREATE PIPELINE IF NOT EXISTS pipeline_yellow_201501_201606 AS

— IMPORTANT: CHANGE my-s3-bucket

LOAD DATA S3 “my-s3-bucket/nyc-taxi/yellow_tripdata/201501-201606/”

— IMPORTANT: CHANGE my_aws_secret_access_key and my_aws_access_key_id

CREDENTIALS ‘{“aws_secret_access_key”: “my_aws_secret_access_key”, “aws_access_key_id”: “my_aws_access_key_id”}’

SKIP ALL ERRORS

INTO TABLE yellow_trips_staging

FIELDS TERMINATED BY ‘,’

(vendor_id,pickup_datetime,dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,rate_code,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount);

DROP PIPELINE IF EXISTS pipeline_yellow_201607_201612;

CREATE PIPELINE IF NOT EXISTS pipeline_yellow_201607_201612 AS

— IMPORTANT: CHANGE my-s3-bucket

LOAD DATA S3 “my-s3-bucket/nyc-taxi/yellow_tripdata/201607-201612/”

— IMPORTANT: CHANGE my_aws_secret_access_key and my_aws_access_key_id

CREDENTIALS ‘{“aws_secret_access_key”: “my_aws_secret_access_key”, “aws_access_key_id”: “my_aws_access_key_id”}’

SKIP ALL ERRORS

INTO TABLE yellow_trips_staging

FIELDS TERMINATED BY ‘,’

(vendor_id,pickup_datetime,dropoff_datetime,passenger_count,trip_distance,rate_code,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,@,@);

Now, we can start the pipelines with:

USE nyc_taxi;

START PIPELINE pipeline_yellow_200901_201412;

START PIPELINE pipeline_yellow_201501_201606;

START PIPELINE pipeline_yellow_201607_201612;

Verify that your pipelines are running.

USE nyc_taxi;

SHOW PIPELINES;

You can view the files in the batch as identified by BATCH_ID in the following:

USE nyc_taxi;

SELECT * FROM information_schema.PIPELINES_FILES WHERE DATABASE_NAME = ‘nyc_taxi’;

Parallel Loading with MemSQL Pipelines

To determine the number of files in a pipeline batch, MemSQL uses the database partition count.

SHOW PARTITIONS on nyc_taxi;

The distinct ordinal count of master partitions reflects the total number of database partitions.

Load Skew

Smaller sized files in a pipeline batch will load faster than larger files. Only when all files in a batch are successfully loaded in parallel, does the next pipeline batch begin. For the NYC yellow taxi trip data set, the monthly files vary greatly in size, as great as 210MB. In other words, there will be a degree of load skew. The pipeline with the most files to process is pipeline_yellow_200901_201412. It processes 6 years of monthly data, which is 72 files. A database sized with 32 partitions will create three pipeline batches for a pipeline with 72 files:

SELECT PIPELINE_NAME, BATCH_ID, count(*) as File_Count

FROM information_schema.PIPELINES_FILES

WHERE DATABASE_NAME = ‘nyc_taxi’

GROUP BY PIPELINE_NAME, BATCH_ID

ORDER BY PIPELINE_NAME, BATCH_ID;

To view the overall batch time in minutes, the total number of rows, and rows per second, execute the following query:

SELECT Round((SUM(v1.BATCH_TIME) / 60),2) as Total_Minutes, FORMAT(SUM(v1.BATCH_ROWS_WRITTEN),0) as Total_Rows

,FORMAT(SUM(v1.BATCH_ROWS_WRITTEN)/ (SUM(v1.BATCH_TIME)),3) as Rows_Per_Sec

FROM (

SELECT DISTINCT t1.BATCH_ID, t1.BATCH_TIME, t1.BATCH_ROWS_WRITTEN

FROM INFORMATION_SCHEMA.PIPELINES_BATCHES as t1

WHERE t1.BATCH_STATE = ‘Succeeded’ AND t1.DATABASE_NAME = ‘nyc_taxi’

) as v1

The results for loading all the yellow taxi data into a large sized cluster in MemSQL Cloud are:

Total_Minutes = 23.94
Total_Row = 1,308,985,065
Rows_Per_Sec = 911,305.258

Now that the load is complete, we can stop the pipelines.

USE nyc_taxi;

STOP PIPELINE pipeline_yellow_200901_201412;

STOP PIPELINE pipeline_yellow_201501_201606;

STOP PIPELINE pipeline_yellow_201607_201612;

SHOW PIPELINES;

From Staging to Destination

With 1.3 billion rows, there will be a few load errors. We can view the files with the most errors with the following query:

SELECT v1.BATCH_SOURCE_PARTITION_ID, v1.ERROR_CODE, COUNT(v1.ERROR_ID) as Total_Errors

FROM (

SELECT BATCH_SOURCE_PARTITION_ID,ERROR_CODE,ERROR_ID

FROM information_schema.PIPELINES_ERRORS

) as v1

GROUP BY v1.BATCH_SOURCE_PARTITION_ID

ORDER BY total_errors DESC;

There are two files with around 650 row-related errors out of 1.308 billion rows. Of course, there are a few issues with the data itself. There are rows with 0000-00-00 00:00:00 datetimes and rows with meangliness longitudes and latitudes. We can clean these up with some basic DELETE and UPDATE statements that will take less than 20 seconds to complete.

USE nyc_taxi;

DELETE FROM yellow_trips_staging WHERE pickup_datetime = ‘0000-00-00 00:00:00’ AND dropoff_datetime = ‘0000-00-00 00:00:00’;

UPDATE yellow_trips_staging SET pickup_longitude = 0.0, pickup_latitude = 0.0 WHERE (pickup_longitude < -180.00 OR pickup_latitude > 90.00);

UPDATE yellow_trips_staging SET pickup_longitude = 0.0, pickup_latitude = 0.0 WHERE (pickup_longitude > 180.00 OR pickup_latitude < -90.00);

UPDATE yellow_trips_staging SET dropoff_longitude = 0.0, dropoff_latitude = 0.0 WHERE (pickup_longitude = 180.00 OR pickup_latitude = 90.00);

UPDATE yellow_trips_staging SET dropoff_longitude = 0.0, dropoff_latitude = 0.0 WHERE (pickup_longitude = -180.00 OR pickup_latitude = -90.00);

UPDATE yellow_trips_staging SET dropoff_longitude = 0.0, dropoff_latitude = 0.0 WHERE (dropoff_longitude < -180.00 OR dropoff_latitude > 90.00);

UPDATE yellow_trips_staging SET dropoff_longitude = 0.0, dropoff_latitude = 0.0 WHERE (dropoff_longitude > 180.00 OR dropoff_latitude < -90.00);

UPDATE yellow_trips_staging SET dropoff_longitude = 0.0, dropoff_latitude = 0.0 WHERE (dropoff_longitude = 180.00 OR dropoff_latitude = 90.00);

UPDATE yellow_trips_staging SET dropoff_longitude = 0.0, dropoff_latitude = 0.0 WHERE (dropoff_longitude = -180.00 OR dropoff_latitude = -90.00);

With the data cleaned up, we are now ready to copy the data into our destination table. As you may recall, the DDL for yellow_trips table specifies the pickup and dropoff point as a MemSQL geographypoint data type, enabling us to write fast geospatial queries moving forward.

USE nyc_taxi;

INSERT INTO yellow_trips

SELECT vendor_id, pickup_datetime,year(pickup_datetime) as pickup_year, month(pickup_datetime) as pickup_month, day(pickup_datetime) as pickup_day

, week(pickup_datetime) as pickup_week, time(pickup_datetime) as pickup_time, dropoff_datetime, year(dropoff_datetime) as droppoff_year

, month(dropoff_datetime) as droppoff_month, day(dropoff_datetime) as droppoff_day, week(dropoff_datetime) as droppoff_week, time(dropoff_datetime) as droppoff_time

, GEOGRAPHY_POINT(pickup_longitude , pickup_latitude) as pickup_pt, GEOGRAPHY_POINT(dropoff_longitude, dropoff_latitude) as dropoff_pt

, PULocationID, DOLocationID, passenger_count, trip_distance, rate_code, store_and_fwd_flag, payment_type, fare_amount, surcharge, extra, mta_tax

, tip_amount , tolls_amount, improvement_surcharge, total_amount

FROM yellow_trips_staging;

Finally, we’ll clean up the staging table and update the statistics for the yellow_trips table.

USE nyc_taxi;

TRUNCATE TABLE yellow_trips_staging;

ANALYZE TABLE yellow_trips;

OPTIMIZE TABLE yellow_trips;

Now, we’re ready to start analyzing the data!

Data Science Inspiration

For those looking for inspiration about what and how to analyze the NYC yellow taxi data, take a look at Todd W. Schneider’s now famous blog “Analyzing 1.1 Billion NYC Taxi and Uber Trips, with a Vengeance”. As Todd writes, “taken as a whole, the detailed trip-level data is more than just a vast list of taxi pickup and drop off coordinates: it’s a story of New York.”

Next Steps

The MemSQL story is one of relentless innovation and brilliant engineering.

Just five years ago, MemSQL posted “Loading half a billion records in 40 minutes”. That blog details how to chunk up manufactured data into 2000 equally sized files in order to load 500 million rows into MemSQL on a machine with 512GB memory and 64 cores.

Five years later, you can spin up a database cluster in MemSQL Cloud in minutes, and in about half the time load 1.3 billion rows of real-world data with a few simple CREATE PIPELINE statements. Try it yourself and sign up for MemSQL Cloud. New customers receive a $300 free credit.

To preview the next chapter in the amazing MemSQL story that includes an 80X improvement in columnstore query execution and AI similarity matching with SIMD functions, sign up for the MemSQL 6.0 Beta.

Pre-configuration Requirements

AWS S3

AWS Command Line Interface (CLI)

AWS Credentials

S3 Bucket

MySQL Compatible Client

MemSQL Cluster

MemSQL Cloud

MemSQL On-Premises or Self-Managed Cloud

NYC TLC Trip Data

Download, Compress, and Upload the Data

To Compress or Not Compress?

Load the NYC Yellow Taxi Trip Data

MemSQL Database and Tables

MemSQL Pipelines

Create Pipelines

Parallel Loading with MemSQL Pipelines

Load Skew

From Staging to Destination

Data Science Inspiration

Next Steps

Trending Articles