Replication at Speed – System of Record Capabilities for MemSQL 7.0

September 23, 2019, 12:04 am

≫ Next: Analyze Google Analytics data using Upsolver, Amazon Athena, and Amazon QuickSight

≪ Previous: Faster, Smarter, Better: Optimizations for Neo4j Graph Algorithms

System of record capability is the holy grail for transactional databases. Companies need to run their most trusted workloads on a database that has many ways to ensure that transactions are completed and to back up completed transactions, with fast and efficient restore capability. MemSQL 7.0 includes new features that deliver very fast synchronous replication – including a second copy in the initial write operation, atomically – and incremental backup, which offers increased flexibility and reliability. With these features, MemSQL 7.0 offers a viable alternative for Tier 1 workloads that require system of record capability. When combined with MemSQL SingleStore, and MemSQL’s long-standing ability to combine transactions and analytics on the same database software, MemSQL 7.0 now offers unprecedented design and operational simplicity, lower costs, and higher performance for a wide range of workloads.

The Importance of System of Record Capability

The ability to handle system of record (SoR) transactional workloads is an important characteristic for a database. When a database serves as a system of record, it should never lose a transaction that it has told the user it has received.

In providing system of record capability, there’s always some degree of trade-off between the speed of a transaction and the degree of safety that the system provides against losing data. In MemSQL 7.0, two new capabilities move MemSQL much further into SoR territory: fast synchronous replication and incremental backups.

Synchronous replication means that a transaction is not acknowledged as complete – “committed” – until it’s written to primary storage, called the master, and also to a replica, called the slave. In MemSQL 7.0, synchronous replication can be turned on with a negligible performance impact.

Synchronous durability – requiring transactions to be persisted to disk before a commit – is an additional data safety tool. It does take time, but writing to disk on the master happens in parallel with sending the transaction to the slave; there is an additional wait while the transaction is written to disk on the second system. The performance penalty is, of course, greater than for synchronous replication alone.

*Fast sync replication in MemSQL 7.0 makes it possible to run high availability with a small performance hit.*

In addition to synchronous replication and synchronous durability capabilities, a system of record database needs flexible restore options. In MemSQL 7.0, we add incremental backups, greatly increasing backup flexibility. Incremental backups allow a user to run backup far more often, without additional impact on the system. An incremental backup means only the data changed since the last backup needs to be stored. So the amount of time it takes to do the backup (and the resources required to implement the backup) are significantly reduced. This means a shorter RPO (Recovery Point Objective), which in turn means less data is lost in the event of an error that requires restoring a backup.

The rest of this blog post focuses on synchronous replication, a breakthrough feature in MemSQL 7.0.

Sync Replication in Action

Synchronous replication in pre-MemSQL 7.0 release was very deliberate, and quite slow. Data was replicated as it was committed. So if there were lots of small commits, you would pay the overhead of sending the data network many separate transactions with small amounts of data. In addition, data sent to the slave partition would be replayed into memory on that system, and then acknowledged by the slave to the master – and, finally, acknowledged in turn to the user. This was slow enough to restrict throughput in workloads that did many writes.

In MemSQL 7.0, we completely revamped how replication works. Commits are now grouped to amortize the cost of sending data on the network. The replication is also done lock-free. Lastly, the master doesn’t have to wait for the slave to replay the changes. As soon as the slave receives the data, an acknowledgement is sent back to the master, who then sends back success to the user.

Because MemSQL is a distributed database, it can implement a highly available system by keeping multiple copies of the data, and then failing over to another copy in the event that it detects a machine has failed. The following steps demonstrate why a single failure – of a network partition, of a node reboot, of a node that runs out of memory, or of a node that runs out of disk space – can’t cause data to be lost. In the next section, we’ll describe how this failure-resistant implementation is also made fast.

To provide failure resistance, here are the steps that are followed:

A CREATE DATABASE command is received. The command specifies Sync Replication and Async Durability. MemSQL creates partitions on the three leaves, calling the partitions db_0, db_1, and db_2. (In an actual MemSQL database, there would be many partitions per leaf, but for this example we use one partition per leaf to make it simpler.)

For redundancy 2 – that is, high availability (HA), with a master and slave copy of all data – the partitions are each copied to another leaf. Replication is then started, so that all changes on the master partition are sent to the slave partition.

An insert hits db_1. The update is written to memory on the master, then copied to memory on the slave.

The slave receives the page and acknowledges it to the master. The master database acknowledges the write to the master aggregator, which finally acknowledges it to the user. The write is considered committed.

This interaction between the master partition and its slave makes transactions failure-resistant. If either machine were to fail, the system still has an up-to-date copy of the data. It’s fast because of the asynchronous nature of log reply on the slave system: the acknowledgement to the primary system takes place after the log page is received, but before it’s replayed in the slave.

Making Log Page Allocation Distributed and Lock-Free

There’s still a danger to this speedy performance. Even if the number of transactions is large, if the transactions are all relatively small, they can be distributed smoothly across leaves, and fast performance is maintained. However, occasional large transactions – for instance, loading a large block of data – can potentially prevent any smaller transactions from occurring until the large operation is complete.

The bottleneck doesn’t occur on actual data updating, as this can be distributed. It occurs on the allocation of log pages. So, to make synchronous replication fast on MemSQL, we made log reservation and replication lock-free, reducing blocking. The largest difficulty in making our new sync replication was the allocation of log pages distributed and lock-free. There are several pieces that work together to prevent locking.

The first part to understand is the replication log. Transactions that interact with the replication log are as follows: Reserve, Write out log record(s), Commit.

The replication log is structured as an ordered sequence of 4KB pages, each of which may contain several transactions (if transactions are small), parts of different transactions, or just part of a transaction (if a transaction is > 4KB in size). Each 4KB page serves as a unit of group commit, reducing network traffic – full pages are sent, rather than individual transactions – and simplifying the code needed, as it operates mostly on standard-size pages rather than on variable-sized individual transactions.

To manage pages, each one is identified by a Log Sequence Number (LSN), a unique ID which begins with the first page numbered zero, then increments by one with each subsequent page. Each page has a page header, a 48 byte structure. The header contains two LSNs: the LSN of the page itself, and the committed LSN – the LSN up to which all pages had been successfully committed at the time the page in question was created. So a page could have LSN number 53, and also record the fact that the committed LSN at the point this page was created was 48 – all of the first 48 pages have been committed, but page 49 (and possibly also other, higher-numbered pages) has not been.

When a transaction wants to log something that it is doing to the log, there is an API which gives it logical space in the log and enough physical resources that it can be guaranteed not to fail, barring the node itself crashing. Next the transaction writes out into the log all the data that it wants within the log. Finally it calls the commit API, which is basically a signal to the log that the data is ready to be shipped over to the slave machine or to disk, or both.

With this background, we can look at how the log works internally. We have a 128-bit structure called the anchor in the log, which we use in order to implement a lock-free protocol for the log reservations. The anchor consists of two 64-bit numbers. One is the LSN of the current page in the log, and the other is the pointer into the page where the next payload of data can be written.

And all threads operate on the anchor using the compare-and-swap instruction, a CPU primitive which allows you to check that a particular location in memory has not changed, and then change it atomically, in one structure. It is very useful for lock-free operations, as we will see in a moment.

MemSQL 7.0 Sync Replication Demonstration

Let’s say we have four threads, and this diagram shows the current state of the anchor. And just for simplicity I’m not going to show the second part of the anchor, only the LSN.

With all compare and swaps, the threads working on trying to write to the log start by loading the most recent LSN, which has the value 1000.

Each thread reserves the number of pages it needs for the operation it’s trying to commit. In this case, Thread 1 is only reserving part of a page, so it wants to change the most recent LSN to 1001, while Thread 2 is reserving a large number of pages, and trying to change it to 2000. Both threads attempt to compare and swap (CAS) at the same time. In this example, Thread 2 gets there first and expects the LSN to be 1000, which it is. It performs the swap, replacing the anchor – the committed LSN – with 2000. It owns this broad swathe of pages and can stay busy with it for a long time.

Then Thread 1 reads the anchor expecting it to be 1000. Seeing that it’s a different number, 2000, the compare fails.

Thread 1 tries again, loading the new value of 2000 into its memory. It then goes on to succeed.

It’s important to note that the CAS operations are fast. Once a thread is successful, it starts doing a large amount of work to put its page together, write the log to memory, and send it. The CAS operation, by comparison, is much faster. Also, when it does fail, it’s because another thread’s CAS operation succeeded – there’s always work getting done. A thread can fail many times without a noticeable performance hit, for the thread or the system as a whole.

By contrast, in the previous method that MemSQL used, it was as if there were a large mutex (lock) around the LSN value. All the threads were forced to wait, instead of getting access and forming their pages in parallel. Compared to the new method, the older method was very slow.

On failovers, the master data store fails, and the slave is promoted to master. The new master now replays all the updates it has received.

It is possible that the old master received a page that was not also forwarded to the slave, because that’s the point at which the primary failed. However, with synchronous replication this is no problem – the page that only got to the master would not have been acknowledged to the user. The user will then retry, and the new primary will perform the update, send it to the new slave, receive an acknowledgement of successful receipt, and acknowledge to the user that the update succeeded.

Performance Impact

In the best case, there’s one round trip required per transaction, from user to master to slave, and back from slave to master to user. This is a low enough communication overhead that it is mostly amortized across other transactions doing work.

As we mentioned above, the cost of turning on synchronous replication is single digit percentage impact on TPC-C, a high-concurrency OLTP benchmark. This makes the performance hit of adding a much better data consistency story effectively free for most users!

The steps above show highlights, but there are many other interesting pieces that make the new synchronous replication work well. Just to name them, these features include async replication; multi-replica replication; chained replication, for higher degrees of HA; blog replication; garbage collection on blobs; divergence detection; and durability, which we’ve mentioned. Combined, all of these new features keep the impact of turning sync replication on very low, and give both the user and the system multiple ways to accomplish shared goals.

Conclusion

Synchronous replication without compromising MemSQL’s very fast performance opens up many new use cases that require system of record (SoR) capability for use with MemSQL. Also, the incremental backup capability, also new in MemSQL 7.0, further supports SoR workloads.

We are assuming here that these will be performed using MemSQL’s rowstore tables, which are kept in memory. Both rowstore and columnstore tables support different kinds of fast analytics.

So MemSQL can now be used for many more hybrid use cases in which MemSQL database software combines transactions and analytics, including joins and similar operations across multiple tables and different table types.

These hybrid use cases may get specific benefits from other MemSQL features in this release, such as MemSQL SingleStore. Our current customers are already actively exploring the potential for using these new capabilities with us. If you’re interested in finding out more about what MemSQL can do for you, download the MemSQL 7.0 Beta or contact MemSQL today.

↧

Analyze Google Analytics data using Upsolver, Amazon Athena, and Amazon QuickSight

September 27, 2019, 2:55 am

≫ Next: Mapping the Underlying Social Structure of Reddit

≪ Previous: Replication at Speed – System of Record Capabilities for MemSQL 7.0

Feed: AWS Big Data Blog.

In this post, we present a solution for analyzing Google Analytics data using Amazon Athena. We’re including a reference architecture built on moving hit-level data from Google Analytics to Amazon S3, performing joins and enrichments, and visualizing the data using Amazon Athena and Amazon QuickSight. Upsolver is used for data lake automation and orchestration, enabling customers to get started quickly.

Google Analytics is a popular solution for organizations who want to understand the performance of their web properties and applications. Google Analytics data is collected and aggregated to help users extract insights quickly. This works great for simple analytics. It’s less than ideal, however, when you need to enrich Google Analytics data with other datasets to produce a comprehensive view of the customer journey.

Why analyze Google Analytics data on AWS?

Google Analytics has become the de-facto standard web analytics tool. It is offered for free at lower data volumes and provides tracking, analytics, and reporting. It enables non-technical users to understand website performance by answering questions such as: where are users coming from? Which pages have the highest conversion rates? Where are users experiencing friction and abandoning their shopping cart?

While these questions are answered within the Google Analytics UI, there are however some limitation, such as:

Data sampling: Google Analytics standard edition displays sampled data when running ad hoc queries on time periods that contain more than 500,000 sessions. Large websites can easily exceed this number on a weekly or even daily basis. This can create reliability issues between different reports, as each query can be fed by a different sample of the data.
Difficulty integrating with existing AWS stack: Many customers have built or are in the process of building their data and analytics platform on AWS. Customers want to use the AWS analytics and machine learning capabilities with their Google Analytics data to enable new and innovative use cases.
Joining with external data sources: Seeing the full picture of a business’ online activity might require combining web traffic data with other sources. Google Analytics does not offer a simple way to either move raw data in or out of the system. Custom dimensions in Google Analytics can be used, but they are limited to 20 for the standard edition and are difficult to use.
Multi-dimensional analysis: Google Analytics custom reports and APIs are limited to seven dimensions per query. This limits the depth of analysis and requires various workarounds for more granular slicing and dicing.
Lack of alternatives: Google Analytics 360, which allows users to export raw data to Google BigQuery, carries a hefty annual fee. This can be prohibitive for organizations. And even with this upgrade, the native integration is only with BigQuery, which means users still can’t use their existing AWS stack.

Building or buying a new web analytics solution (including cookie-based tracking) is also cost-prohibitive, and can interrupt existing workflows that rely on Google Analytics data.

Customers are looking for a solution to enable their analysts and business users to incorporate Google Analytics data into their existing workflows using familiar AWS tools.

Moving Google Analytics data to AWS: Defining the requirements

To provide an analytics solution with the same or better level of reporting as Google Analytics, we designed our solution around the following tenets:

Analytics with a low technical barrier to entry: Google Analytics is built for business users, and our solution is designed to provide a similar experience. This means that beyond ingesting the data, we want to automate the data engineering work that goes into making the data ready for analysis. This includes data retention, partitioning, and compression. All of this work must be done under the hood and remain invisible to the user querying the data.
Hit-level data: Google Analytics tracks clickstream activity based on Hits – the lowest level of interaction between a user and a webpage. These hits are then grouped into Sessions – hits within a given time period, and Users – groups of sessions (more details here). The standard Google Analytics API is limited to session and user-based queries, and does not offer any simple way of extracting hit-level data. Our solution, however, does provide access to this granular data.
Unsampled data: By extracting the data from Google Analytics and storing it on Amazon S3, we are able to bypass the 500K sessions limitation. We also have access to unsampled data for any query at any scale.
Data privacy: If sensitive data is stored in Google Analytics, relying on third-party ETL tools can create risks around data privacy, especially in the era of GDPR. Therefore, our solution encrypts data in transit and relies exclusively on processing within the customer’s VPC.

Solution overview

The solution is built on extracting hit-level data and storing it in a data lake architecture on Amazon S3. We then use Amazon Athena and Amazon QuickSight for analytics and reporting. Upsolver, an AWS premier solution provider, is used to automate ingestion, ETL and data management on S3. Upsolver also orchestrate the entire solution with a simple-to-use graphical user interface. The following diagram shows the high level architecture of our solutions.

Reference architecture showing the flow of data across Google Anaytics, Amazon Athena and Amazon QuickSight

Using Upsolver’s GA connector we extract unsampled, hit-level data from Google Analytics. This data is then automatically ingested according to accepted data lake best practices and stored in an optimized form on Amazon S3. The following best practices are applied to the data:

Store data in Apache Parquet columnar file format to improve read performance and reduce the amount of data scanned per query.
Partition data by event (hit) time rather than by API query time.
Perform periodic compaction by which small files are merged into larger ones improving performance and optimizing compression.

Once data is stored on S3, we use Upsolver’s GUI to create structured fact tables from the Google Analytics data. Users can query them using Amazon Athena and Amazon Redshift. Upsolver provides simple to use templates to help users quickly create tables from their Google Analytics data. Finally, we use Amazon QuickSight to create interactive dashboards to visualize the data.

The result is a complete view of our Google Analytics data. This view provides the level of self-service analytics that users have grown accustomed to, at any scale, and without the limitations outlined earlier.

Building the solution: Step by step guide

In this section, we walk through the steps to set up the environment, configure Upsolver’s Google Analytics plugin, extract the data, and begin exploring.

Step 1: Installation and permissions

Sign up for Upsolver (can also be done via the AWS Marketplace).
Allow Upsolver access to read data from Google Analytics and add new custom dimensions. Custom dimensions enable Upsolver to read non-sampled hit-level data directly from Google Analytics instead of creating parallel tracking mechanisms that aren’t as trust-worthy.
To populate the custom dimensions that were added to Google Analytics, allow Upsolver to run a small JavaScript code on your website. If you’re using GA360, this is not required.

Step 2: Review and clean the raw data

For supported data sources, Upsolver automatically discovers the schema and collects key statistics for every field in the table. Doing so gives users a glimpse into their data.

In the following screenshot, you can see schema-on-read information on the left side, stats per field and value distribution on the right side.

Screen shot of the Upsolver UI showing schema-on-read information on the left side, stats per field and value distribution on the right side

Step 3: Publishing to Amazon Athena

Upsolver comes with four templates for creating tables in your AWS based data lake according to the Google Analytics entity being analyzed:

Pageviews – used to analyze user flow and behavior on specific sections of the web property using metrics such as time on page and exit rate.
Events – user-defined interactions such as scroll depth and link clicks.
Sessions – monitor a specific journey in the web property (all pageviews and events).
Users – understand a user’s interaction with the web property or app over time.

All tables are partitioned by event time, which helps improve query performance.

Upsolver users can choose to run the templates as-is, modify them first or create new tables unique to their needs.

The following screenshot shows the schema produced by the Pageviews template:

Screen shot of the Upsolver UI showing the schema produced by the Pageviews template:

The following screenshot shows the Pageviews and Events tables as well as the Amazon Athena views for Sessions and Users generated by the Upsolver templates.

Screenshot showing the Pageviews and Events tables as well as the Athena views for Sessions and Users generated from the Upsolver templates.

The following are a couple example queries you may want to run to extract specific insights:

-- Popular page titles 
SELECT page_title, 
       Count(*) AS num_hits 
FROM   ga_hits_pageviews 
GROUP  BY page_title 
ORDER  BY 2 DESC

-- User aggregations from hit data 
SELECT user_id, 
       Count(*)                   AS num_hits, 
       Count(DISTINCT session_id) AS num_of_sessions, 
       Sum(session_duration)      AS total_sessions_time 
FROM   ga_hits_pageviews 
GROUP  BY user_id

Step 4: Visualization in Amazon QuickSight

Now that the data has been ingested, cleansed, and written to S3 in a structured manner, we are ready visualize it with Amazon QuickSight. Start by creating a dashboard to mimic the one provided by Google Analytics. But we don’t need to stop there. We can use QuickSight ML Insights to extract deeper insights from our data. We can also embed Amazon QuickSight visualizations into existing web portals and applications making insights available to everyone.

Screenshot of QuickSight visual ization showing several sections, one with a graph, several others with various statistics

Screen shot of QuickSight showing a global map with usage concentrations marked by bubbles, alongside a pie graph.

Sreenshot of QuickSight showing a bar graph, alongside a table with various data values.

Conclusion

With minimal setup, we were able to extract raw hit-level Google Analytics data, prepare, and stored it in a data lake on Amazon S3. Using Upsolver, combined with Amazon Athena and Amazon QuickSight, we built a feature-complete solution for analyzing web traffic collected by Google Analytics on AWS.

Key technical benefits:

Schema on-read means data consumers don’t need to model the data into a table structure, and can instantly understand what their top dimensions are. For example, 85% of my users navigate my website using Google Chrome browser.
Graphical user interface that enables self-service consumption of Google Analytics data.
Fast implementation using pre-defined templates that map raw data from Google Analytics to tables in the data lake.
Ability to replay historical Google Analytics data stored on Amazon S3.
Ability to partition the data on Amazon S3 by hit time reducing complexity of handling late arriving events.
Optimize data on Amazon S3 automatically for improved query performance.
Automatically manage tables and partitions in AWS Glue Data Catalog.
Fully integrated with a suite of AWS native services – Amazon S3, Amazon Athena, Amazon Redshift and Amazon QuickSight.

Now that we have feature parity, we can begin to explore integrating other data sources such as CRM, sales, and customer profile to build a true 360-degree view of the customer. Furthermore, you can now begin using AWS Machine Learning services to optimize traffic to your websites, forecast demand and personalize the user experience.

We’d love to hear what you think. Please feel free to leave a comment with any feedback or questions you may have.

The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.

About the Authors

Roy Hasson is the global business development lead of analytics and data lakes at AWS. He works with customers around the globe to design solutions to meet their data processing, analytics and business intelligence needs. Roy is big Manchester United fan, cheering his team on and hanging out with his family.

Eran Levy is the director of marketing at Upsolver.

↧

Mapping the Underlying Social Structure of Reddit

September 28, 2019, 11:11 pm

≫ Next: Introduction to the Partition By Window Function

≪ Previous: Analyze Google Analytics data using Upsolver, Amazon Athena, and Amazon QuickSight

Feed: R-bloggers.
Author: Posts on Data Science Diarist.

Reddit is a popular website for opinion sharing and news aggregation. The site consists of thousands of user-made forums, called subreddits, which cover a broad range of subjects, including politics, sports, technology, personal hobbies, and self-improvement. Given that most Reddit users contribute to multiple subreddits, one might think of Reddit as being organized into many overlapping communities. Moreover, one might understand the connections among these communities as making up a kind of social structure.

Uncovering a population’s social structure is useful because it tells us something about that population’s identity. In the case of Reddit, this identity could be uncovered by figuring out which subreddits are most central to Reddit’s network of subreddits. We could also study this network at multiple points in time to learn how this identity has evolved and maybe even predict what it’s going to look like in the future.

My goal in this post is to map the social structure of Reddit by measuring the proximity of Reddit communities (subreddits) to each other. I’m operationalizing community proximity as the number of posts to different communities that come from the same user. For example, if a user posts something to subreddit A and posts something else to subreddit B, subreddits A and B are linked by this user. Subreddits connected in this way by many users are closer together than subreddits connected by fewer users. The idea that group networks can be uncovered by studying shared associations among the people that make up those groups goes way back in the field of sociology (Breiger 1974). Hopefully this post will demonstrate the utility of this concept for making sense of data from social media platforms like Reddit.

The data for this post come from an online repository of subreddit submissions and comments that is generously hosted by data scientist Jason Baumgartner. If you plan to download a lot of data from this repository, I implore you to donate a bit of money to keep Baumgartner’s database up and running (pushshift.io/donations/).

Here’s the link to the Reddit submissions data – files.pushshift.io/reddit/submissions/. Each of these files has all Reddit submissions for a given month between June 2005 and May 2019. Files are JSON objects stored in various compression formats that range between .017Mb and 5.77Gb in size. Let’s download something in the middle of this range – a 710Mb file for all Reddit submissions from May 2013. The file is called RS_2013-05.bz2. You can double-click this file to unzip it, or you can use the following command in the Terminal: bzip2 -d RS_2013-05.bz2. The file will take a couple of minutes to unzip. Make sure you have enough room to store the unzipped file on your computer – it’s 4.51Gb. Once I have unzipped this file, I load the relevant packages, read the first line of data from the unzipped file, and look at the variable names.

read_lines("RS_2013-05", n_max = 1) %>% fromJSON() %>% names

##  [1] "edited"                 "title"
      ##  [3] "thumbnail"              "retrieved_on"
      ##  [5] "mod_reports"            "selftext_html"
      ##  [7] "link_flair_css_class"   "downs"
      ##  [9] "over_18"                "secure_media"
      ## [11] "url"                    "author_flair_css_class"
      ## [13] "media"                  "subreddit"
      ## [15] "author"                 "user_reports"
      ## [17] "domain"                 "created_utc"
      ## [19] "stickied"               "secure_media_embed"
      ## [21] "media_embed"            "ups"
      ## [23] "distinguished"          "selftext"
      ## [25] "num_comments"           "banned_by"
      ## [27] "score"                  "report_reasons"
      ## [29] "id"                     "gilded"
      ## [31] "is_self"                "subreddit_id"
      ## [33] "link_flair_text"        "permalink"
      ## [35] "author_flair_text"

For this project, I’m only interested in three of these variables: the user name associated with each submission (author), the subreddit to which a submission has been posted (subreddit), and the time of submission (created_utc). If we could figure out a way to extract these three pieces of information from each line of JSON we could greatly reduce the size of our data, which would allow us to store multiple months worth of information on our local machine. Jq is a command-line JSON processor that makes this possible.

To install jq on a Mac, you need to make sure you have Homebrew (brew.sh/), a package manager that works in the Terminal. Once you have Homebrew, in the Terminal type brew install jq. I’m going to use jq to extract the variables I want from RS_2015-03 and save the result as a .csv file. To select variables with jq, list the JSON field names that you want like this: [.author, .created_utc, .subreddit]. I return these as raw output (-r) and render this as a csv file (@csv). Here’s the command that does all this:

jq -r '[.author, .created_utc, .subreddit] | @csv' RS_2013-05 > parsed_json_to_csv_2013_05

Make sure the Terminal directory is set to wherever RS_2013-05 is located before running this command. The file that results from this command will be saved as “parsed_json_to_csv_2013_05”. This command parses millions of lines of JSON (every Reddit submission from 05-2013), so this process can take a few minutes. In case you’re new to working in the Terminal, if there’s a blank line at the bottom of the Terminal window, that means the process is still running. When the directory name followed by a dollar sign reappears, the process is complete. This file, parsed_json_to_csv_2013_05, is about 118Mb, much smaller than 4.5Gb.

Jq is a powerful tool for automating the process of downloading and manipulating data right from your harddrive. I’ve written the a bash script that lets you download multiple files from the Reddit repository, unzip them, extract the relevant fields from the resulting JSON, and delete the unparsed files (Reddit_Download_Script.bash). You can modify this script to pull different fields from the JSON. For instance, if you want to keep the content of Reddit submissions, add .selftext to the fields that are included in the brackets.

Now that I have a reasonably sized .csv file with the fields I want, I am ready to bring the data into R and analyze them as a network.

Each row of the data currently represents a unique submission to Reddit from a user. I want to turn this into a dataframe where each row represents a link between subreddits through a user. One problem that arises from this kind of data manipulation is that there are more rows in the network form of this data than there are in the current form of the data. To see this, consider a user who has submitted to 10 different subreddits. These submissions would take up ten rows of our dataframe in its current form. However, this data would be represented by 10 choose 2, or 45, rows of data in its network form. This is every combination of 2 subreddits among those to which the user has posted. This number gets exponentially larger as the number of submissions from the same user increases. For this reason, the only way to convert the data into a network form without causing R to crash is to convert the data into a Spark dataframe. Spark is a distributed computing platform that partitions large datasets into smaller chunks and operates on these chunks in parallel. If your computer has a multicore processor, Spark allows you to work with big-ish data on your local machine. I will be using a lot of functions from the sparklyr package, which supplies dplyr backend to Spark. If you’re new to Spark and sparklyr, check out RStudio’s guide for getting started with Spark in R (spark.rstudio.com/).

Once I have Spark configured, I import the data into R as a Spark dataframe.

reddit_data

To begin, I make a few changes to the data – renaming columns, converting the time variable from utc time to the day of the year, and removing submissions from deleted accounts. I also remove submissions from users who have posted only once – these would contribute nothing to the network data – and submissions from users who have posted 60 or more times – these users are likely bots.

reddit_data %
          rename(author = V1, created_utc = V2, subreddit = V3) %>%
          mutate(dateRestored = timestamp(created_utc + 18000)) %>%
          mutate(day = dayofyear(dateRestored)) %>%
          filter(author != "[deleted]") %>% group_by(author) %>% mutate(count = count()) %>%
          filter(count  1) %>%
          ungroup()

Next, I create a key that gives a numeric id to each subreddit. I add these ids to the data, and select the variables “author”, “day”, “count”, “subreddit”, and “id” from the data. Let’s have a look at the first few rows of the data.

subreddit_key % distinct(subreddit) %>% sdf_with_sequential_id()

      reddit_data %
        select(author, day, count, subreddit, id)

      head(reddit_data)

## # Source: spark> [?? x 5]
      ##   author           day count subreddit             id
      ##   
      ## 1 Bouda            141     4 100thworldproblems  2342
      ## 2 timeXalchemist   147     4 100thworldproblems  2342
      ## 3 babydall1267     144    18 123recipes          2477
      ## 4 babydall1267     144    18 123recipes          2477
      ## 5 babydall1267     144    18 123recipes          2477
      ## 6 babydall1267     144    18 123recipes          2477

We have 5 variables. The count variable shows the number of times a user has posted to Reddit in May 2013, the id variable gives the subreddit’s numeric id, the day variable tells us what day of the year a submission has been posted, and the author and subreddit variables give user and subreddit names. We are now ready to convert this data to network format. The first thing I do is take an “inner_join” of the data with itself, merging by the “author” variable. For each user, the number of rows this returns will be the square of the number of submissions from that user. I filter this down to “number of submissions choose 2” rows for each user. This takes two steps. First, I remove rows that link subreddits to themselves. Then I remove duplicate rows. For instance, AskReddit-funny is a duplicate of funny-AskReddit. I remove one of these.

The subreddit id variable will prove useful for removing duplicate rows. If we can mutate two id variables into a new variable that gives a unique identifier to each subreddit pair, we can filter duplicates of this identifier. We need a mathematical equation that takes two numbers and returns a unique number (i.e. a number that can only be produced from these two numbers) regardless of number order. One such equation is the Cantor Pairing Function (wikipedia.org/wiki/Pairing_function):

Let’s define a function in R that takes a dataframe and two id variables, runs the id variables through Cantor’s Pairing Function and appends this to the dataframe, filters duplicate cantor ids from the dataframe, and returns the result. We’ll call this function cantor_filter.

cantor_filter % mutate(id_pair = .5*(id + id2)*(id + id2 + 1) + pmax(id, id2)) %>% group_by(author, id_pair) %>%
          filter(row_number(id_pair) == 1) %>% return()
      }

Next, I apply an inner_join to the Reddit data and apply the filters described above to the resulting dataframe.

reddit_network_data %
                              rename(day2 = day, count2 = count,
                              subreddit2 = subreddit, id2 = id),
                              by = "author") %>%
                 filter(subreddit != subreddit2) %>%
                 group_by(author, subreddit, subreddit2) %>%
                 filter(row_number(author) == 1) %>%
                 cantor_filter() %>%
                 select(author, subreddit, subreddit2, id, id2, day, day2, id_pair) %>%
                 ungroup %>% arrange(author)

Let’s take a look at the new data.

reddit_network_data

## Warning: `lang_name()` is deprecated as of rlang 0.2.0.
      ## Please use `call_name()` instead.
      ## This warning is displayed once per session.

## Warning: `lang()` is deprecated as of rlang 0.2.0.
      ## Please use `call2()` instead.
      ## This warning is displayed once per session.

## # Source:     spark> [?? x 8]
      ## # Ordered by: author
      ##    author     subreddit     subreddit2        id   id2   day  day2  id_pair
      ##    
      ##  1 --5Dhere   depression    awakened        7644 29936   135   135   7.06e8
      ##  2 --Adam--   AskReddit     techsupport    15261 28113   135   142   9.41e8
      ##  3 --Caius--  summonerscho… leagueoflegen…    79     3   124   142   3.48e3
      ##  4 --Gianni-- AskReddit     videos         15261  5042   125   138   2.06e8
      ##  5 --Gianni-- pics          AskReddit       5043 15261   126   125   2.06e8
      ##  6 --Gianni-- movies        pics           20348  5043   124   126   3.22e8
      ##  7 --Gianni-- gaming        videos         10158  5042   131   138   1.16e8
      ##  8 --Gianni-- gaming        pics           10158  5043   131   126   1.16e8
      ##  9 --Gianni-- movies        AskReddit      20348 15261   124   125   6.34e8
      ## 10 --Gianni-- movies        videos         20348  5042   124   138   3.22e8
      ## # … with more rows

We now have a dataframe where each row represents a link between two subreddits through a distinct user. Many pairs of subreddits are connected by multiple users. We can think of subreddit pairs connected through more users as being more connected than subreddit pairs connected by fewer users. With this in mind, I create a “weight” variable that tallies the number of users connecting each subreddit pair and then filters the dataframe to unique pairs.

reddit_network_data % group_by(id_pair) %>%
        mutate(weight = n()) %>% filter(row_number(id_pair) == 1) %>%
        ungroup

Let’s have a look at the data and see how many rows it has.

reddit_network_data

## # Source:     spark> [?? x 9]
      ## # Ordered by: author
      ##    author     subreddit   subreddit2    id   id2   day  day2 id_pair weight
      ##    
      ##  1 h3rbivore  psytrance   DnB            8     2   142   142      63      1
      ##  2 StRefuge   findareddit AlienBlue     23     5   133   134     429      1
      ##  3 DylanTho   blackops2   DnB           28     2   136   138     493      2
      ##  4 TwoHardCo… bikewrench  DnB           30     2   137   135     558      1
      ##  5 Playbook4… blackops2   AlienBlue     28     5   121   137     589      2
      ##  6 A_Jewish_… atheism     blackops2      6    28   139   149     623     14
      ##  7 SirMechan… Terraria    circlejerk    37     7   150   143    1027      2
      ##  8 Jillatha   doctorwho   facebookw…    36     9   131   147    1071      2
      ##  9 MeSire     Ebay        circlejerk    39     7   132   132    1120      3
      ## 10 Bluesfan6… SquaredCir… keto          29    18   126   134    1157      2
      ## # … with more rows

reddit_network_data %>% sdf_nrow

## [1] 744939

We’re down to ~750,000 rows. The weight column shows that many of the subreddit pairs in our data are only connected by 1 or 2 users. We can substantially reduce the size of the data without losing the subreddit pairs we’re interested in by removing these rows. I decided to remove subreddit pairs that are connected by 3 or fewer users. I also opt at this point to stop working with the data as a Spark object and bring the data into the R workspace as a dataframe. The network analytic tools I use next require working on a regular dataframes and our data is now small enough that we can do this without any problems. Because we’re moving into the R workspace, I save this as a new dataframe called reddit_edgelist.

 reddit_edgelist % filter(weight > 3) %>%
        select(id, id2, weight) %>% arrange(id) %>%
        # Bringing the data into the R workspace
        dplyr::collect()

Our R dataframe consists of three columns: two id columns that provide information on connections between nodes and a weight column that tells us the strength of each connection. One nice thing to have would be a measure of the relative importance of each subreddit. A simple way to get this would be to count how many times each subreddit appears in the data. I compute this for each subreddit by adding the weight values in the rows where that subreddit appears. I then create a dataframe called subreddit_imp_key that lists subreddit ids by subreddit importance.

subreddit_imp_key % group_by(id) %>%
                                       summarise(count = sum(weight)),
                  reddit_edgelist %>% group_by(id2) %>%
                    summarise(count2 = sum(weight)),
                  by = c("id" = "id2")) %>%
                  mutate(count = ifelse(is.na(count), 0, count)) %>%
                  mutate(count2 = ifelse(is.na(count2), 0, count2)) %>%
                  mutate(id = id, imp = count + count2) %>% select(id, imp)

Let’s see which subreddits are the most popular on Reddit according to the subreddit importance key.

left_join(subreddit_imp_key, subreddit_key %>% dplyr::collect(), by = "id") %>%
        arrange(desc(imp))

## # A tibble: 5,561 x 3
      ##       id    imp subreddit
      ##    
      ##  1 28096 107894 funny
      ##  2 15261 101239 AskReddit
      ##  3 20340  81208 AdviceAnimals
      ##  4  5043  73119 pics
      ##  5 10158  51314 gaming
      ##  6  5042  47795 videos
      ##  7 17856  47378 aww
      ##  8  2526  37311 WTF
      ##  9 22888  31702 Music
      ## 10  5055  26666 todayilearned
      ## # … with 5,551 more rows

These subreddits are mostly about memes and gaming, which are indeed two things that people commonly associate with Reddit.

Next, I reweight the edge weights in reddit_edgelist by subreddit importance. The reason I do this is that the number of users connecting subreddits is partially a function of subreddit popularity. Reweighting by subreddit importance, I control for the influence of this confounding variable.

reddit_edgelist %
                        left_join(., subreddit_imp_key %>% rename(imp2 = imp),
                                  by = c("id2" = "id")) %>%
        mutate(imp_fin = (imp + imp2)/2) %>% mutate(weight = weight/imp_fin) %>%
        select(id, id2, weight)

      reddit_edgelist

## # A tibble: 56,257 x 3
      ##       id   id2   weight
      ##    
      ##  1     1 12735 0.0141
      ##  2     1 10158 0.000311
      ##  3     1  2601 0.00602
      ##  4     1 17856 0.000505
      ##  5     1 22900 0.000488
      ##  6     1 25542 0.0185
      ##  7     1 15260 0.00638
      ##  8     1 20340 0.000320
      ##  9     2  2770 0.0165
      ## 10     2 15261 0.000295
      ## # … with 56,247 more rows

We now have our final edgelist. There are about 56,000 thousand rows in the data, though most edges have very small weights. Next, I use the igraph package to turn this dataframe into a graph object. Graph objects can be analyzed using igraph’s clustering algorithms. Let’s have a look at what this graph object looks like.

reddit_graph

## IGRAPH 2dc5bc4 UNW- 5561 56257 --
      ## + attr: name (v/c), weight (e/n)
      ## + edges from 2dc5bc4 (vertex names):
      ##  [1] 1--12735 1--10158 1--2601  1--17856 1--22900 1--25542 1--15260
      ##  [8] 1--20340 2--2770  2--15261 2--18156 2--20378 2--41    2--22888
      ## [15] 2--28115 2--10172 2--5043  2--28408 2--2553  2--2836  2--28096
      ## [22] 2--23217 2--17896 2--67    2--23127 2--2530  2--2738  2--7610
      ## [29] 2--20544 2--25566 2--3     2--7     2--7603  2--12931 2--17860
      ## [36] 2--6     2--2526  2--5055  2--18253 2--22996 2--25545 2--28189
      ## [43] 2--10394 2--18234 2--23062 2--25573 3--264   3--2599  3--5196
      ## [50] 3--7585  3--10166 3--10215 3--12959 3--15293 3--20377 3--20427
      ## + ... omitted several edges

Here we have a list of all of the edges from the dataframe. I can now use a clustering algorithm to analyze the community structure that underlies this subreddit network. The clustering algorithm I choose to use here is the Louvain algorithm. This algorithm takes a network and groups its nodes into different communities in a way that maximizes the modularity of the resulting network. By maximizing modularity, the Louvain algorithm groups nodes in a way that maximizes the number of within-group ties and minimizes the number of between-group ties.

Let’s apply the algorithm and see if the groupings it produces make sense. I store the results of the algorithm in a tibble with other relevant information. See code annotations for a more in-depth explanation of what I’m doing here.

reddit_communities % unlist,
        # Creating a community ids column and using rep function with map to populate
        # a column with community ids created by
        # Louvain alg
        comm = rep(reddit_communities[] %>%
                     names, map(reddit_communities[], length) %>% unlist) %>%
                     as.numeric) %>%
        # Adding subreddit names
        left_join(., subreddit_key %>% dplyr::collect(), by = "id") %>%
        # Keeping subreddit name, subreddit id, community id
        select(subreddit, id, comm) %>%
        # Adding subreddit  importance
        left_join(., subreddit_imp_key, by = "id")

Next, I calculate community importance by summing the subreddit importance scores of the subreddits in each community.

subreddit_by_comm % group_by(comm) %>% mutate(comm_imp = sum(imp)) %>% ungroup

I create a tibble of the 10 most important communities on Reddit according to the subreddit groupings generated by the Louvain algorithm. This tibble displays 10 largest subreddits in each of these communities. Hopefully, these subreddits will be similar enough that we can discern what each community represents.

comm_ids % group_by(comm) %>% slice(1) %>% arrange(desc(comm_imp)) %>% .[["comm"]]

      top_comms % filter(comm == comm_ids[i]) %>% arrange(desc(imp)) %>% .[["subreddit"]] %>% .[1:10]
      }

      comm_tbl % unlist)

Let’s have a look at the 10 largest subreddits in each of the 10 largest communities. These are in descending order of importance.

options(kableExtra.html.bsTable = TRUE)

      comm_tbl %>%
      kable("html") %>%
        kable_styling("hover", full_width = F) %>%
        column_spec(1, bold = T, border_right = "1px solid #ddd;") %>%
        column_spec(2, width = "30em")

Community	Subreddits
1	funny AskReddit AdviceAnimals pics gaming videos aww WTF Music todayilearned
2	DotA2 tf2 SteamGameSwap starcraft tf2trade Dota2Trade GiftofGames SteamTradingCards Steam vinyl
3	electronicmusic dubstep WeAreTheMusicMakers futurebeats trap edmproduction electrohouse EDM punk ThisIsOurMusic
4	hockey fantasybaseball nhl Austin DetroitRedWings sanfrancisco houston leafs BostonBruins mlb
5	cars motorcycles Autos sysadmin carporn formula1 Jeep subaru Cartalk techsupportgore
6	web_design Entrepreneur programming webdev Design windowsphone SEO forhire startups socialmedia
7	itookapicture EarthPorn AbandonedPorn HistoryPorn photocritique CityPorn MapPorn AnimalPorn SkyPorn Astronomy
8	wow darksouls Diablo Neverwinter Guildwars2 runescape diablo3 2007scape swtor Smite
9	blackops2 battlefield3 dayz Eve Planetside aviation airsoft WorldofTanks Warframe CallOfDuty
10	soccer Seattle Fifa13 Portland MLS Gunners reddevils chelseafc football LiverpoolFC

The largest community in this table, community 1, happens to contain the ten most popular subreddits on Reddit. Although some of these subreddits are similar in terms of their content – many of them revolve around memes, for example – a couple of them do not (e.g. videos and gaming). One explanation is that this first group of subreddits represents mainstream Reddit. In other words, the people who post to these subreddits are generalist posters – they submit to a broad enough range of subreddits that categorizing these subreddits into any of the other communities would reduce the modularity of the network.

The other 9 communities in the figure are easier to interpret. Each one revolves around a specific topic. Communities 2, 8, and 9 are gaming communities dedicated to specific games; communities 4 and 10 are sports communities; the remaining communities are dedicated to electronic music, cars, web design, and photography.

In sum, we have taken a month worth of Reddit submissions, converted them into a network, and identified subreddit communities from them. How successful were we? On one hand, the Louvain algorithm correctly identified many medium-sized communities revolving around specific topics. It’s easy to imagine that the people who post to these groups of subreddits contribute almost exclusively to them, and that it therefore makes sense to think of them as communities. On the other hand, the largest community has some pretty substantively dissimilar subreddits. These also happen to be the largest subreddits on Reddit. The optimistic interpretation of this grouping is that these subreddits encompass a community of mainstream users. However, the alternative possibly that this community is really just a residual category of subreddits that don’t really belong together but also don’t have any obvious place in the other subreddit communities. Let’s set this issue to the side for now.

In the next section, I visualize these communities as a community network and examine how this network has evolved over time.

In the last section, I generated some community groupings of subreddits. While these give us some idea of the social structure of Reddit, one might want to know how these communities are connected to each other. In this section, I take these community groupings and build a community-level network from them. I then create some interactive visualizations that map the social structure of Reddit and show how this structure has evolved over time.

The first thing I want to do is return to the subreddit edgelist, our dataframe of subreddit pairs and the strength of their connections, and merge this with community id variables corresponding to each subreddit. I filter the dataframe to only include unique edges, and add a variable called weight_fin, which is the average of the subreddit edge weights between each community. I also filter links in the community-level edgelist that connect community to themselves. I realize that there’s a lot going on in the code below. Feel free to contact me if you have any questions about what I’m doing here.

community_edgelist % select(id, comm), by = "id") %>%
        left_join(., subreddit_by_comm %>% select(id, comm) %>% rename(comm2 = comm), by = c("id2"= "id")) %>%
        select(comm, comm2, weight) %>%
        mutate(id_pair = .5*(comm + comm2)*(comm + comm2 + 1) + pmax(comm,comm2)) %>% group_by(id_pair) %>%
        mutate(weight_fin = mean(weight)) %>% slice(1) %>% ungroup %>% select(comm, comm2, weight_fin) %>%
        filter(comm != comm2) %>% filter(comm != comm2) %>%
        arrange(desc(weight_fin))

I now have a community-level edgelist, with which we can visualize a network of subreddit communities. I first modify the edge weight variable to discriminate between communities that are more and less connected. I choose an arbitrary cutoff point (.007) and set all weights below this cutoff to 0. Although doing this creates a risk of imposing structure on the network where there is none, this cutoff will help highlight significant ties between communities.

community_edgelist_ab %
        mutate(weight =  ifelse(weight_fin > .007, weight_fin, 0)) %>%
        filter(weight!=0) %>% mutate(weight = abs(log(weight)))

The visualization tools that I use here come from the visnetwork package. For an excellent set of tutorials on network visualizations in R, check out the tutorials section of Professor Katherine Ognyanova’s website (kateto.net/tutorials/). Much of what I know about network visualization in R I learned from the “Static and dynamic network visualization in R” tutorial.

Visnetwork’s main function, visNetwork, requires two arguments, one for nodes data and one for edges data. These dataframes need to have particular column names for visnetwork to be able to make sense of them. Let’s start with the edges data. The column names for the nodes corresponding to edges in the edgelist need to be called “from” and “to”, and the column name for edge weights needs to be called “weight”. I make these adjustments.

community_edgelist_mod %
        rename(from = comm, to = comm2) %>% select(from, to, weight)

Also, visnetwork’s default edges are curved. I prefer straight edges. To ensure edges are straight, add a smooth column and set it to FALSE.

community_edgelist_mod$smooth

I’m now ready to set up the nodes data. First, I extract all nodes from the community edgelist.

community_nodes % .[["from"]], community_edgelist_mod %>% .[["to"]]) %>% unique

Visnetwork has this really cool feature that lets you view node labels by hovering over them with your mouse cursor. I’m going to label each community with the names of the 4 most popular subreddits in that community.

comm_by_label % arrange(comm, desc(imp)) %>% group_by(comm) %>% slice(1:4) %>%
        summarise(title = paste(subreddit, collapse = " "))

Next, I put node ids and community labels in a tibble. Note that the label column in this tibble has to be called “title”.

community_nodes_fin % left_join(., comm_by_label, by = "comm")

I want the nodes of my network to vary in size based on the size of each community. To do this, I create a community importance key. I’ve already calculated community importance above. I extract this score for each community from the subreddit_by_comm dataframe and merge these importance scores with the nodes data. I rename the community importance variable “size” and the community id variable “id”, which are the column names that visnetwork recognizes.

comm_imp_key % group_by(comm) %>% slice(1) %>%
        arrange(desc(comm_imp)) %>% select(comm, comm_imp)

      community_nodes_fin %
        rename(size = comm_imp, id = comm)

One final issue is that my “mainstream Reddit/residual subreddits” community is so much bigger than the other communities that the network visualization will be overtaken by it if I don’t adjust the size variable. I remedy this by raising community size to the .3th power (close to the cube root).

community_nodes_fin % mutate(size = size^.3)

I can now enter the nodes and edges data into the visNetwork function. I make a few final adjustments to the default parameters. Visnetwork now lets you use layouts from the igraph package. I use visIgraphLayout to set the position of the nodes according to the Fruchterman-Reingold Layout Algorithm (layout_with_fr). I also adjust edge widths and set highlightNearest to TRUE. This lets you highlight a node and the nodes it is connected to by clicking on it. Without further ado, let’s have a look at the network.

2013 Reddit Network.

The communities of Reddit do not appear to be structured into distinct categories. We don’t see a cluster of hobby communities and a different cluster of advice communities, for instance. Instead, we have some evidence to suggest that the strongest ties are among some of the larger subcultures of Reddit. Many of the nodes in the large cluster of communities above are ranked in the 2-30 range in terms of community size. On the other hand, the largest community (mainstream Reddit) is out on a island, with only a few small communities around it. This suggests that the ties between mainstream Reddit and some of Reddit’s more niche communities are weaker than the ties among the latter. In other words, fringe subcultures of Reddit are more connected to each other than they are to Reddit’s mainstream.

The substance of these fringe communities lends credence to this interpretation. Many of the communities in the large cluster are somewhat related in their content. There are a lot of gaming communities, several drug and music communities, a couple of sports communities, and few communities that combine gaming, music, sports, and drugs in different ways. Indeed, most of the communities in this cluster revolve around activities commonly associated with young men. One might even infer from this network that Reddit is organized into two social spheres, one consisting of adolescent men and the other consisting of everybody else. Still, I should caution the reader against extrapolating too much from the network above. These ties are based on 30 days of submissions. It’s possible that something occurred during this period that momentarily brought certain Reddit communities closer together than they would be otherwise. There are links among some nodes in the network that don’t make much logical sense. For instance, the linux/engineering/3D-Printing community (which only sort of makes sense as a community) is linked to a “guns/knives/coins” community. This strikes me as a bit strange, and I wonder if these communities would look the same if I took data from another time period. Still, many of the links here make a lot of sense. For example, the Bitcoin/Conservative/Anarcho_Capitalism community is tied to the Anarchism/progressive/socialism/occupywallstreet community. The Drugs/bodybuilding community is connected to the MMA/Joe Rogan community. That one makes almost too much sense. Anyway, I encourage you to click on the network nodes to see what you find.

One of the coolest things about the Reddit repository is that it contains temporally precise information on everything that’s happened on Reddit from its inception to only a few months ago. In the final section of this post, I rerun the above analyses on all the Reddit submissions from May 2017 and May 2019. I’m using the bash script I linked to above to do this. Let’s have a look at the community networks from 2017 and 2019 and hopefully gain some insight into how Reddit has evolved over the past several years.

2017 Reddit Network.

Perhaps owing the substantial growth of Reddit between 2013 and 2017, we start to see a hierarchical structure among the communities that we didn’t see in the previous network. A few of the larger communities now have smaller communities budding off of them. I see four such “parent communities”. One of them is the music community. There’s a musicals/broadway community, a reggae community, an anime music community, and a “deepstyle” (whatever that is) community stemming from this. Another parent community is the sports community, which has a few location-based communities, a lacrosse community, and a Madden community abutting it. The other two parent communities are porn communities. I won’t name the communities stemming from these, but as you might guess many of them revolve around more niche sexual interests.

This brings us to another significant change between this network and the one from 2013: the emergence of porn on Reddit. We now see that two of the largest communities involve porn. We also start to see some differentiation among the porn communities. There is a straight porn community, a gay porn community, and a sex-based kik community (kik is a messenger app). It appears that since 2013 Reddit is increasingly serving some of the same functions as Craigslist, providing users with a place to arrange to meet up, either online or in person, for sex. As we’ll see in the 2019 network, this function has only continued to grow. This is perhaps due to the Trump Administration’s sex trafficking bill and Craigslist’s decision to shutdown its “casual encounters” personal ads in 2018.

Speaking of Donald Trump, where is he in our network? As it turns out, this visualization belies the growing presence of Donald Trump on Reddit between 2013 and 2017. The_Donald is a subreddit for fans of Donald Trump that quickly became of the most popular subreddits on Reddit during this time. The reason that we don’t see it here is that it falls into the mainstream Reddit community, and despite its popularity it is not one of the four largest subreddits in this community. The placement of The_Donald in this community was one of the most surprising results of this project. I had expected The_Donald to fall into a conservative political community. The reason The_Donald falls into the mainstream community, I believe, is that much of The_Donald consists of news and memes, the bread and butter of Reddit. Many of the most popular subreddits in the mainstream community are meme subreddits – Showerthoughts, drankmemes, funny – and the overlap between users who post to these subreddits and users who post to The_Donald is substantial.

2019 Reddit Network.

That brings us to May 2019. What’s changed from 2017? The network structure is similar – we have two groups, mainstream Reddit and a interconnected cluster of more niche communities. This cluster has the same somewhat hierarchical structure that we saw in the 2017 network, with a couple of large “parent communities” that are porn communities. This network also shows the rise of Bitcoin on Reddit. While Bitcoin was missing from the 2017 network, in 2019 it constitutes one of the largest communities on the entire site. It’s connected to a conspiracy theory community, a porn community, a gaming community, an exmormon/exchristian community, a tmobile/verizon community, and architecture community. While some of these ties may be coincidental, some of them likely reflect real sociocultural overlaps.

That’s all I have for now. My main takeaway from this project is that Reddit consists of two worlds, a “mainstream” Reddit that is comprised of meme and news subreddits and a more fragmented, “fringe” Reddit that is made up of groups of porn, gaming, hobbiest, Bitcoin, sports, and music subreddits. This begs the question of how these divisions map onto real social groups. It appears that the Reddit communities outside the mainstream revolve around topics that are culturally associated with young men (e.g. gaming, vaping, Joe Rogan). Is the reason for this that young men are more likely to post exclusively to a handful of somewhat culturally subversive subreddits that other users are inclined to avoid? Unfortunately, we don’t have the data to answer this question, but this hypothesis is supported by the networks we see here.

The next step to take on this project will be to figure out how to allow for overlap between subreddit communities. As I mentioned, the clustering algorythm I used here forces subreddits into single communities. This distorts how communities on Reddit are really organized. Many subreddits appeal to multiple and distinct interests of Reddit users. For example, many subreddits attract users with a common political identity while also providing users with a news source. City-based subreddits attract fans of cities’ sports teams but also appeal to people who want to know about non-sports-related local events. That subreddits can serve multiple purposes could mean that the algorithm I use here lumped together subreddits that belong in distinct and overlapping communities. It also suggests that my mainstream Reddit community could really be a residual community of liminal subreddits that do not have a clear categorization. A clustering algorithm that allowed for community overlap would elucidate which subreddits span multiple communities. SNAP (Stanford Network Analysis Project) has tools in Python that seem promising for this kind of research. Stay tuned!

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook…

↧

Introduction to the Partition By Window Function

October 2, 2019, 1:01 am

≫ Next: cary huang: A Guide to Basic Postgres Partition Table and Trigger Function

≪ Previous: Mapping the Underlying Social Structure of Reddit

Feed: Databasejournal.com – Feature Database Articles.
Author: .

T-SQL window functions perform calculations over a set of rows (known as a “window”) and return a single value for each row from the underlying query. A window (or windowing or windowed) function makes use of the values from the rows in a window to calculate the returned values.

A window is defined by using the OVER() clause. The OVER() T-SQL clause has the following functions:

It defines window partitions by using the PARTITION BY clause.
It orders rows within partitions by using the ORDER BY clause.

The OVER() clause can accept three different arguments:

PARTITION BY – PARTITION BY resets its counter every time a given column changes values.
ORDER BY – ORDER BY orders the rows (in the window only) the function evaluates.
ROWS BETWEEN – ROWS BETWEEN enables you to further limit the rows in the window.

The main focus of this article is the PARTITION BY function, but I may touch some other clauses as well.

A Small Script

Let’s assume you are an avid car-sport fan, and you have kept track of the different drivers, different cars, speeds accomplished, and speeds with different cars accomplished at specific dates. A query such as the following would give you detailed results.

SELECT

SpeedTestID,

SpeedTestDate,

CarID,

CarSpeed,

ROW_NUMBER() OVER (PARTITION BY SpeedTestDate, CarID ORDER BY SpeedTestID) AS SpeedTestsDoneToday, –Lists the number of the row, ordered by SpeedTestID

SUM(CarSpeed) OVER () AS CarSpeedTotal, –Grand Total of Carspeed for entire result set

SUM(CarSpeed) OVER (PARTITION BY SpeedTestDate) AS SpeedTotal –Total CarSpeed for row on SpeedTestdate,

SUM(CarSpeed) OVER (PARTITION BY SpeedTestDate, CarID) AS SpeedTotalPerCar, –Total CarSpeed for row’s SpeedTestDate AND Car

AVG(CarSpeed) OVER (PARTITION BY SpeedTestDate) AS SpeedAvg, –Average CarSpeed for row on SpeedTestdate,

AVG(CarSpeed) OVER (PARTITION BY SpeedTestDate, CarID) AS SpeedAvgPerCar, –Average CarSpeed for row’s SpeedTestDate AND Car

FROM SpeedTests

Let’s break the code down.

The first few lines should be obvious; you should already know (especially if you’re reading this article) how to select data and specify the fields needed in the select query.

ROW_NUMBER() will display the current number of the row being calculated by the window statement. In this case the date of the Speed test, the car’s ID and it will be ordered by the SpeedTestID field.

The next few lines make use of aggregate functions to work out the grand total carspeed of the speed tests, total speed on a specific row, total car speed, for a specific date, total car speed for a specific car for a specific date, as well as their Averages.

Let’s take the example a few steps further!

Add the following into the query:

SUM(CarSpeed) OVER (ORDER BY SpeedTestDate ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS SpeedRunningTotal, — Add all the CarSpeed values in rows up to and including the current row

SUM(CarSpeed) OVER (ORDER BY SpeedTestDate ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS SpeedSumLast4 –add all the CarSpeed values in rows between the current row and the 3 rows before it

By using ROWS BETWEEN you narrow the scope to be evaluated by the window function. The function will simply begin and end where ROWS BETWEEN specify.

Let’s get crazy. Add the next few lines to the script:

FIRST_VALUE(CarSpeed) OVER (ORDER BY SpeedTestDate) AS FirstSpeed, –FIRST_VALUE function will return the first CarSpeed in the result set

LAST_VALUE(CarSpeed) OVER (ORDER BY SpeedTestDate ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS LastSpeed, –LAST_VALUE function will return the last CarSpeed in the result set

Here, you made use of First_Value and Last_Value, which are quite aptly named. Get the first value and get the last value.

Even wackier… Add the next code:

LAG(CarSpeed, 1, 0) OVER (ORDER BY SpeedTestID) AS PrevSpeed, –LAG function will return the CarSpeed from 1 row behind it

LEAD(CarSpeed, 3) OVER (ORDER BY SpeedTestID) AS NextSpeed, –LEAD function will get the CarSpeed from 3 rows ahead

Lag gets the speed from one row before and Lead 3 rows after the current result set.

The reason I put these in is so that you can see the true power of window functions, as this is just the tip of the iceberg

Conclusion

Window functions can be life savers by making a complicated SQL calculation easy. Instead of writing massive SQL statements trying to figure out certain logic, a window function combines that logic and provides row by row or window by window feedback.

↧

cary huang: A Guide to Basic Postgres Partition Table and Trigger Function

October 3, 2019, 11:00 pm

≫ Next: Evaluating Model Performance by Building Cross-Validation from Scratch

≪ Previous: Introduction to the Partition By Window Function

Feed: Planet PostgreSQL.

1. Overview

Table partitioning is introduced after Postgres version 9.4 that provides several performance improvement under extreme loads. Partitioning refers to splitting one logically large table into smaller pieces, which in turn distribute heavy loads across smaller pieces (also known as partitions).

There are several ways to define a partition table, such as declarative partitioning and partitioning by inheritance. In this article we will focus on a simple form of declarative partitioning by value range.

Later in this article, we will discuss how we can define a TRIGGER to work with a FUNCTION to make table updates more dynamic.

2. Creating a Table Partition by Range

Let’s define a use case. Say we are a world famous IT consulting company and there is a database table called salesman_performance, which contains all the sales personnel world wide and their lifetime revenue of sales. Technically it is possible to have one table containing all sales personnel in the world but as entries get much larger, the query performance may be greatly reduced.

Here, we would like to create 7 partitions, representing 7 different levels of sales (or ranks) like so:

CREATE TABLE salesman_performance (
        salesman_id int not NULL,
        first_name varchar(45),
        last_name varchar(45),
        revenue numeric(11,2),
        last_updated timestamp
) PARTITION BY RANGE (revenue);

Please note that, we have to specify that it is a partition table by using keyword “PARTITION BY RANGE”. It is not possible to alter a already created table and make it a partition table.

Now, let’s create 7 partitions based on revenue performance:

CREATE TABLE salesman_performance_chief PARTITION OF salesman_performance
        FOR VALUES FROM (100000000.00) TO (999999999.99);

CREATE TABLE salesman_performance_elite PARTITION OF salesman_performance
        FOR VALUES FROM (10000000.00) TO (99999999.99);

CREATE TABLE salesman_performance_above_average PARTITION OF salesman_performance
        FOR VALUES FROM (1000000.00) TO (9999999.99);

CREATE TABLE salesman_performance_average PARTITION OF salesman_performance
        FOR VALUES FROM (100000.00) TO (999999.99);

CREATE TABLE salesman_performance_below_average PARTITION OF salesman_performance
        FOR VALUES FROM (10000.00) TO (99999.99);

CREATE TABLE salesman_performance_need_work PARTITION OF salesman_performance
        FOR VALUES FROM (1000.00) TO (9999.99);

CREATE TABLE salesman_performance_poor PARTITION OF salesman_performance
        FOR VALUES FROM (0.00) TO (999.99);

Let’s insert some values into “salesman_performace” table with different users having different revenue performance:

INSERT INTO salesman_performance VALUES( 1, 'Cary', 'Huang', 4458375.34, '2019-09-20 16:00:00');
INSERT INTO salesman_performance VALUES( 2, 'Nick', 'Wahlberg', 340.2, '2019-09-20 16:00:00');
INSERT INTO salesman_performance VALUES( 3, 'Ed', 'Chase', 764.34, '2019-09-20 16:00:00');
INSERT INTO salesman_performance VALUES( 4, 'Jennifer', 'Davis', 33750.12, '2019-09-20 16:00:00');
INSERT INTO salesman_performance VALUES( 5, 'Johnny', 'Lollobrigida', 4465.23, '2019-09-20 16:00:00');
INSERT INTO salesman_performance VALUES( 6, 'Bette', 'Nicholson', 600.44, '2019-09-20 16:00:00');
INSERT INTO salesman_performance VALUES( 7, 'Joe', 'Swank', 445237.34, '2019-09-20 16:00:00');
INSERT INTO salesman_performance VALUES( 8, 'Fred', 'Costner', 2456789.34, '2019-09-20 16:00:00');
INSERT INTO salesman_performance VALUES( 9, 'Karl', 'Berry', 4483758.34, '2019-09-20 16:00:00');
INSERT INTO salesman_performance VALUES( 10, 'Zero', 'Cage', 74638930.64, '2019-09-20 16:00:00');
INSERT INTO salesman_performance VALUES( 11, 'Matt', 'Johansson', 655837.34, '2019-09-20 16:00:00');

Postgres will automatically distribute queries to the respective partition based on revenue range.

You may run the d+ command to see the table and its partitions

or examine just salesman_performance, which shows partition key and range

d+ salesman-performance

we can also use EXPLAIN ANALYZE query to see the query plan PG system makes to scan each partition. In the plan, it indicates how many rows of records exist in each partition

EXPLAIN ANALYZE SELECT * FROM salesman_performance;

There you have it. This is a very basic partition table that distributes data by value range.

One of the advantages of using partition table is that bulk loads and deletes can be done simply by adding or removing partitions (DROP TABLE). This is much faster and can entirely avoid VACUUM overhead caused by DELETE

When you make a update to an entry. Say salesman_id 1 has reached the “Chief” level of sales rank from “Above Average” rank

UPDATE salesman_performance SET revenue = 445837555.34 where salesman_id=1;

You will see that Postgres automatically put salesman_id 1 into the “salesman_performance_chief” partition and removes from “salesman_performance_above_average”

3. Delete and Detach Partition

A partition can be deleted completely simply by the “DROP TABLE [partition name]” command. This may not be desirable in some use cases.

The more recommended approach is to use “DETACH PARTITION” queries, which removes the partition relationship but preserves the data.

ALTER TABLE salesman_performance DETACH PARTITION salesman_performance_chief;

If a partition range is missing, and the subsequent insertion has a range that no other partitions contain, the insertion will fail.

INSERT INTO salesman_performance VALUES( 12, 'New', 'User', 755837555.34, current_timestamp);

=> should result in failure because no partitions contain a range for this revenue =  755837555.34

If we add back the partition for the missing range, then the above insertion will work:

ALTER TABLE salesman_performance ATTACH PARTITION salesman_performance_chief
FOR VALUES FROM (100000000.00) TO (999999999.99);

4. Create Function Using Plpgsql and Define a Trigger

In this section, we will use an example of subscriber and coupon code redemption to illustrate the use of Plpgsql function and a trigger to correctly manage the distribution of available coupon codes.

First we will have a table called “subscriber”, which store a list of users and a table called “coupon”, which stores a list of available coupons.

CREATE TABLE subscriber (
    sub_id int not NULL,
    first_name varchar(45),
    last_name varchar(45),
    coupon_code_redeemed varchar(200),
    last_updated timestamp
);

CREATE TABLE coupon (
    coupon_code varchar(45),
    percent_off int CHECK (percent_off >= 0 AND percent_off<=100),
    redeemed_by varchar(100),
    time_redeemed timestamp
);

Let’s insert some records to the above tables:

INSERT INTO subscriber (sub_id, first_name, last_name, last_updated) VALUES(1,'Cary','Huang',current_timestamp);
INSERT INTO subscriber  (sub_id, first_name, last_name, last_updated) VALUES(1,'Nick','Wahlberg',current_timestamp);
INSERT INTO subscriber  (sub_id, first_name, last_name, last_updated) VALUES(1,'Johnny','Lollobrigida',current_timestamp);
INSERT INTO subscriber  (sub_id, first_name, last_name, last_updated) VALUES(1,'Joe','Swank',current_timestamp);
INSERT INTO subscriber  (sub_id, first_name, last_name, last_updated) VALUES(1,'Matt','Johansson',current_timestamp);

INSERT INTO coupon (coupon_code, percent_off) VALUES('CXNEHD-746353',20);
INSERT INTO coupon (coupon_code, percent_off)  VALUES('CXNEHD-653834',30);
INSERT INTO coupon (coupon_code, percent_off)  VALUES('CXNEHD-538463',40);
INSERT INTO coupon (coupon_code, percent_off)  VALUES('CXNEHD-493567',50);
INSERT INTO coupon (coupon_code, percent_off)  VALUES('CXNEHD-384756',95);

The tables now look like:

Say one subscriber redeems a coupon code, we would need a FUNCTION to check if the redeemed coupon code is valid (ie. Exists in coupon table). If valid, we will update the subscriber table with the coupon code redeemed and at the same time update the coupon table to indicate which subscriber redeemed the coupon and at what time.

CREATE OR REPLACE FUNCTION redeem_coupon() RETURNS trigger AS $redeem_coupon$
    BEGIN
    IF EXISTS ( SELECT 1 FROM coupon c where c.coupon_code = NEW.coupon_code_redeemed ) THEN
        UPDATE coupon SET redeemed_by=OLD.first_name, time_redeemed='2019-09-20 16:00:00' where  coupon_code = NEW.coupon_code_redeemed;
    ELSE
        RAISE EXCEPTION 'coupon code does not exist';
    END IF;
        RETURN NEW;
    END;
$redeem_coupon$ LANGUAGE plpgsql;

we need to define a TRIGGER, which is invoked BEFORE UPDATE, to check the validity of a given coupon code.

CREATE TRIGGER redeem_coupon_trigger
  BEFORE UPDATE
  ON subscriber
  FOR EACH ROW
  EXECUTE PROCEDURE redeem_coupon();

d+ subscriber should look like this:

Let’s have some users redeem invalid coupon codes and as expected, an exception will be raised if coupon code is not valid.

UPDATE subscriber set coupon_code_redeemed='12345678' where first_name='Cary';
UPDATE subscriber set coupon_code_redeemed='87654321' where first_name='Nick';
UPDATE subscriber set coupon_code_redeemed='55555555' where first_name='Joe';

Let’s correct the above and redeem only the valid coupon codes and there should not be any error.

UPDATE subscriber set coupon_code_redeemed='CXNEHD-493567' where first_name='Cary';
UPDATE subscriber set coupon_code_redeemed='CXNEHD-653834' where first_name='Nick';
UPDATE subscriber set coupon_code_redeemed='CXNEHD-384756' where first_name='Joe';

Now both table should look like this, and now both table have information cross-related.

And there you have it, a basic trigger function executed before each update.

5. Summary

With the support of partitioned table defined by value range, we are able to define a condition for postgres to automatically split the load of a very large table across many smaller partitions. This has a lot of benefits in terms of performance boost and more efficient data management.

Having postgres FUNCTION and TRIGGER working together as a duo, we are able to make general queries and updates more dynamic and automatic to achieve more complex operations. As some of the complex logics can be defined and handled as FUNCTION, which is then invoked at appropriate moment defined by TRIGGER, the application integrated to Postgres will have much less logics to implement.

A multi-disciplined software developer specialised in C/C++ Software development, network security, embedded software, firewall, and IT infrastructure

↧

Evaluating Model Performance by Building Cross-Validation from Scratch

October 3, 2019, 11:10 pm

≫ Next: Design Principles for Big Data Performance

≪ Previous: cary huang: A Guide to Basic Postgres Partition Table and Trigger Function

Feed: R-bloggers.
Author: Lukas Feick.

Cross-validation is a widely used technique to assess the generalization performance of a machine learning model. Here at STATWORX, we often discuss performance metrics and how to incorporate them efficiently in our data science workflow. In this blog post, I will introduce the basics of cross-validation, provide guidelines to tweak its parameters, and illustrate how to build it from scratch in an efficient way.

Model evaluation and cross-validation basics

Cross-validation is a model evaluation technique. The central intuition behind model evaluation is to figure out if the trained model is generalizable, that is, whether the predictive power we observe while training is also to be expected on unseen data. We could feed it directly with the data it was developed for, i.e., meant to predict. But then again, there is no way for us to know, or validate, whether the predictions are accurate. Naturally, we would want some kind of benchmark of our model’s generalization performance before launching it into production. Therefore, the idea is to split the existing training data into an actual training set and a hold-out test partition which is not used for training and serves as the „unseen“ data. Since this test partition is, in fact, part of the original training data, we have a full range of „correct“ outcomes to validate against. We can then use an appropriate error metric, such as the Root Mean Squared Error (RMSE) or the Mean Absolute Percentage Error (MAPE) to evaluate model performance. However, the applicable evaluation metric has to be chosen with caution as there are pitfalls (as described in this blog post by my colleague Jan). Many machine learning algorithms allow the user to specify hyperparameters, such as the number of neighbors in k-Nearest Neighbors or the number of trees in a Random Forest. Cross-validation can also be leveraged for „tuning“ the hyperparameters of a model by comparing the generalization error of different model specifications.

Common approaches to model evaluation

There are dozens of model evaluation techniques that are always trading off between variance, bias, and computation time. It is essential to know these trade-offs when evaluating a model, since choosing the appropriate technique highly depends on the problem and the data we observe. I will cover this topic once I have introduced two of the most common model evaluation techniques: the train-test-split and k-fold cross-validation. In the former, the training data is randomly split into a train and test partition (Figure 1), commonly with a significant part of the data being retained as the training set. Proportions of 70/30 or 80/20 are the most frequently used in the literature, though the exact ratio depends on the size of your data. The drawback of this approach is that this one-time random split can end up partitioning the data into two very imbalanced parts, thus yielding biased generalization error estimates. That is especially critical if you only have limited data, as some features or patterns could end up entirely in the test part. In such a case, the model has no chance to learn them, and you will potentially underestimate its performance.

A more robust alternative is the so-called k-fold cross-validation (Figure 2). Here, the data is shuffled and then randomly partitioned into

folds. The main advantage over the train-test-split approach is that each of the partitions is iteratively used as a test (i.e., validation) set, with the remaining k-1 parts serving as the training sets in this iteration. This process is repeated times, such that every observation is included in both training and test sets. The appropriate error metric is then simply calculated as a mean of all of the folds, giving the cross-validation error.

This is more of an extension of the train-test split rather than a completely new method: That is, the train-test procedure is repeated

times. However, note that even if is chosen to be as low as k=2 , i.e., you end up with only two parts. This approach is still superior to the train-test-split in that both parts are iteratively chosen for training so that the model has a chance to learn all the data rather than just a random subset of it. Therefore, this approach usually results in more robust performance estimates.

Comparing the two figures above, you can see that a train-test split with a ratio of 80/20 is equivalent to one iteration of a 5-fold (that is, k=5

) cross-validation where 4/5 of the data are retained for training, and 1/5 is held out for validation. The crucial difference is that in k-fold the validation set is shifted in each of the iterations. Note that a k-fold cross-validation is more robust than merely repeating the train-test split times: In k-fold CV, the partitioning is done once, and then you iterate through the folds, whereas in the repeated train-test split, you re-partition the data times, potentially omitting some data from training.

Repeated CV and LOOCV

There are many flavors of k-fold cross-validation. For instance, you can do „repeated cross-validation“ as well. The idea is that, once the data is divided into

folds, this partitioning is fixed for the whole procedure. This way, we’re not risking to exclude some portions by chance. In repeated CV, you repeat the process of shuffling and randomly partitioning the data into folds a certain number of times. You can then average over the resulting cross-validation errors of each run to get a global performance estimate.

Another special case of k-fold cross-validation is „Leave One Out Cross-Validation“ (LOOCV), where you set k = n

. That is, in each iteration, you use a single observation from your data as the validation portion and the remaining n-1 observations as the training set. While this might sound like a hyper robust version of cross-validation, its usage is generally discouraged for two reasons:

First, it’s usually very computationally expensive. For most datasets used in applied machine learning, training your model times is neither desirable nor feasible (although it may be useful for very small datasets).
Second, even if you had the computational power (and time on your hands) to endure this process, another argument advanced by critics of LOOCV from a statistical point of view is that the resulting cross-validation error can exhibit high variance. The cause of that is that your „validation set“ consists of only one observation, and depending on the distribution of your data (and potential outliers), this can vary substantially.

In general, note that the performance of LOOCV is a somewhat controversial topic, both in the scientific literature and the broader machine learning community. Therefore, I encourage you to read up on this debate if you consider using LOOCV for estimating the generalization performance of your model (for example, check out this and related posts on StackExchange). As is often the case, the answer might end up being „it depends“. In any case, keep in mind the computational overhead of LOOCV, which is hard to deny (unless you have a tiny dataset).

The value of and the bias-variance trade-off

If k=n

is not (necessarily) the best choice, then how to find an appropriate value for ? It turns out that the answer to this question boils down to the notorious bias-variance trade-off. Why is that?

The value for

governs how many folds your data is partitioned into and therefore the size of (i.e., number of observations contained in) each fold. We want to choose in a way that a sufficiently large portion of our data remains in the training set – after all, we don’t want to give too many observations away that could be used to train our model. The higher the value of , the more observations are included in our training set in each iteration.

For instance, suppose we have 1,200 observations in our dataset, then with k=3

our training set would consist of $frac{k-1}{k} * N = 800$ observations, but with k=8 it would include 1,050 observations. Naturally, with more observations used for training, you approximate your model’s actual performance (as if it were trained on the whole dataset), hence reducing the bias of your error estimate compared to a smaller fraction of the data. But with increasing , the size of your validation partition decreases, and your error estimate in each iteration is more sensitive to these few data points, potentially increasing its overall variance. Basically, it’s choosing between the „extremes“ of the train-test-split on the one hand and LOOCV on the other. The figure below schematically (!) illustrates the bias-variance performance and computational overhead of different cross-validation methods.

As a rule of thumb, with higher values for

, bias decreases and variance increases. By convention, values like k=5 or k=10 have been deemed to be a good compromise and have thus become the quasi-standard in most applied machine learning settings.

„These values have been shown empirically to yield test error rate estimates that suffer neither from excessively high bias nor from very high variance.“

James et al. 2013: 184

If you are not particularly concerned with the process of cross-validation itself but rather want to seamlessly integrate it into your data science workflow (which I highly recommend!), you should be fine choosing either of these values for

and leave it at that.

Implementing cross-validation in `caret`

Speaking of integrating cross-validation into your daily workflow—which possibilities are there? Luckily, cross-validation is a standard tool in popular machine learning libraries such as the caret package in R. Here you can specify the method with the trainControl function. Below is a script where we fit a random forest with 10-fold cross-validation to the iris dataset.

library(caret)

set.seed(12345)
inTrain

We define our desired cross-validation method in the trainControl function, store the output in the object fit.control, and then pass this object to the trControl argument of the train function. You can specify the other methods introduced in this post in a similar fashion:

# Leave-One-Out Cross-validation:
fit.control

The old-fashioned way: Implementing k-fold cross-validation by hand

However, data science projects can quickly become so complex that the ready-made functions in machine learning packages are not suitable anymore. In such cases, you will have to implement the algorithm—including cross-validation techniques—by hand, tailored to the specific project needs. Let me walk you through a make-shift script for implementing simple k-fold cross-validation in R by hand (we will tackle the script step by step here; you can find the whole code on our GitHub).

Simulating data, defining the error metric, and setting

# devtools::install_github("andrebleier/Xy")
library(tidyverse)
library(Xy)

sim

We start by loading the required packages and simulating some simulation data with 1,000 observations with the Xy() package developed by my colleague André (check out his blog post on simulating regression data with Xy). Because we need some kind of error metric to evaluate model performance, we define our RMSE function which is pretty straightforward: The RMSE is the root of the mean of the squared error, where error is the difference between our fitted (f) und observed (o) values—you can pretty much read the function from left to right. Lastly, we specify our

, which is set to the value of 5 in the example and is stored as a simple integer.

Partitioning the data

set.seed(12345)
sim_data

Next up, we partition our data into

folds. For this purpose, we add a new column, my.folds, to the data: We sample (with replacement) from 1 to the value of , so 1 to 5 in our case, and randomly add one of these five numbers to each row (observation) in the data. With 1,000 observations, each number should be assigned about 200 times.

Training and validating the model

cv.fun % as.vector()

  this.rmse

Next, we define cv.fun, which is the heart of our cross-validation procedure. This function takes two arguments: this.fold and data. I will come back to the meaning of this.fold in a minute, let’s just set it to 1 for now. Inside the function, we divide the data into a training and validation partition by subsetting according to the values of my.folds and this.fold: Every observation with a randomly assigned my.folds value other than 1 (so approximately 4/5 of the data) goes into training. Every observation with a my.folds value equal to 1 (the remaining 1/5) forms the validation set. For illustration purposes, we then fit a simple linear model with the simulated outcome and four predictors. Note that we only fit this model on the train data! We then use this model to predict() our validation data, and since we have true observed outcomes for this subset of the original overall training data (this is the whole point!), we can compute our RMSE and return it.

Iterating through the folds and computing the CV error

cv.error %
  mean()

cv.error

Lastly, we wrap the function call to cv.fun into a sapply() loop—this is where all the magic happens: Here we iterate over the range of

, so seq_len(k) leaves us with the vector [1] 1 2 3 4 5 in this case. We apply each element of this vector to cv.fun. In apply() statements, the iteration vector is always passed as the first argument of the function which is called, so in our case, each element of this vector at a time is passed to this.fold. We also pass our simulated sim_data as the data argument.

Let us quickly recap what this means: In the first iteration, this.fold equals 1. This means that our train set consists of all the observations where my.folds is not 1, and observations with a value of 1 form the validation set (just as in the example above). In the next iteration of the loop, this.fold equals 2. Consequently, observations with 1, 3, 4, and 5 form the training set, and observations with a value of 2 go to validation, and so on. Iterating over all values of

, this schematically provides us with the diagonal pattern seen in Figure 2 above, where each data partition at a time is used as a validation set.

To wrap it all up, we calculate the mean: This is the mean of our

individual RMSE values and leaves us with our cross-validation error. And there you go: We just defined our custom cross-validation function! This is merely a template: You can insert any model and any error metric. If you’ve been following along so far, feel free to try implementing repeated CV yourself or play around with different values for .

Conclusion

As you can see, implementing cross-validation yourself isn’t all that hard. It gives you great flexibility to account for project-specific needs, such as custom error metrics. If you don’t need that much flexibility, enabling cross-validation in popular machine learning packages is a breeze. I hope that I could provide you with a sufficient overview of cross-validation and how to implement it both in pre-defined functions as well as by hand. If you have questions, comments, or ideas, feel free to drop me an e-mail.

References

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2013. An Introduction to Statistical Learning. New York: Springer.

Über den Autor

Lukas Feick

I am a data scientist at STATWORX. I have always enjoyed using data-driven approaches to tackle complex real-world problems, and to help people gain better insights.

STATWORX
is a consulting company for data science, statistics, machine learning and artificial intelligence located in Frankfurt, Zurich and Vienna. Sign up for our NEWSLETTER and receive reads and treats from the world of data science and AI. If you have questions or suggestions, please write us an e-mail addressed to blog(at)statworx.com.

Der Beitrag Evaluating Model Performance by Building Cross-Validation from Scratch erschien zuerst auf STATWORX.

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook…

↧

Design Principles for Big Data Performance

October 4, 2019, 12:50 am

≫ Next: Orchestrate Amazon Redshift-Based ETL workflows with AWS Step Functions and AWS Glue

≪ Previous: Evaluating Model Performance by Building Cross-Validation from Scratch

Feed: Featured Blog Posts – Data Science Central.
Author: Stephanie Shen.

The evolution of the technologies in Big Data in the last 20 years has presented a history of battles with growing data volume. The challenge of big data has not been solved yet, and the effort will certainly continue, with the data volume continuing to grow in the coming years. The original relational database system (RDBMS) and the associated OLTP (Online Transaction Processing) make it so easy to work with data using SQL in all aspects, as long as the data size is small enough to manage. However, when the data reach a significant volume, it becomes very difficult to work with because it would take a long time, or sometimes even be impossible, to read, write, and process successfully.

Overall, dealing with a large amount of data is a universal problem for data engineers and data scientists. The problem has manifested in many new technologies (Hadoop, NoSQL database, Spark, etc.) that have bloomed in the last decade, and this trend will continue. This article is dedicated on the main principles to keep in mind when you design and implement a data-intensive process of large data volume, which could be a data preparation for your machine learning applications, or pulling data from multiple sources and generating reports or dashboards for your customers.

The essential problem of dealing with big data is, in fact, a resource issue. Because the larger the volume of the data, the more the resources required, in terms of memory, processors, and disks. The goal of performance optimization is to either reduce resource usage or make it more efficient to fully utilize the available resources, so that it takes less time to read, write, or process the data. The ultimate objectives of any optimization should include:

Maximized usage of memory that is available
Reduced disk I/O
Minimized data transfer over the network
Parallel processing to fully leverage multi-processors

Therefore, when working on big data performance, a good architect is not only a programmer, but also possess good knowledge of server architecture and database systems. With these objectives in mind, let’s look at 4 key principles for designing or optimizing your data processes or applications, no matter which tool, programming language, or framework you use.

Principle 1. Design based on your data volume

Before you start to build any data processes, you need to know the data volume you are working with: what will be the data volume to start with, and what the data volume will be growing into. If the data size is always small, design and implementation can be much more straightforward and faster. If the data start with being large, or start with being small but will grow fast, the design needs to take performance optimization into consideration. The applications and processes that perform well for big data usually incur too much overhead for small data and cause adverse impact to slow down the process. On the other hand, an application designed for small data would take too long for big data to complete. In other words, an application or process should be designed differently for small data vs. big data. Below lists the reasons in detail:

Because it is time-consuming to process large datasets from end to end, more breakdowns and checkpoints are required in the middle. The goal is 2-folds: first to allow one to check the immediate results or raise the exception earlier in the process, before the whole process ends; second, in the case that a job fails, to allow restarting from the last successful checkpoint, avoiding re-starting from the beginning which is more expensive. For small data, on the contrary, it is usually more efficient to execute all steps in 1 shot because of its short running time.
When working with small data, the impact of any inefficiencies in the process also tends to be small, but the same inefficiencies could become a major resource issue for large data sets.
Paralleling processing and data partitioning (see below) not only require extra design and development time to implement, but also takes more resources during running time, which, therefore, should be skipped for small data.
When working with large data, performance testing should be included in the unit testing; this is usually not a concern for small data.
Processing for small data can complete fast with the available hardware, while the same process can fail when processing a large amount of data due to running out of memory or disk space.

The bottom line is that the same process design cannot be used for both small data and large data processing. Large data processing requires a different mindset, prior experience of working with large data volume, and additional effort in the initial design, implementation, and testing. On the other hand, do not assume “one-size-fit-all” for the processes designed for the big data, which could hurt the performance of small data.

Principle 2: Reduce data volume earlier in the process.

When working with large data sets, reducing the data size early in the process is always the most effective way to achieve good performance. There is no silver bullet to solving the big data issue no matter how much resources and hardware you put in. So always try to reduce the data size before starting the real work. There are many ways to achieve this, depending on different use cases. Below lists some common techniques, among many others:

Do not take storage (e.g., space or fixed-length field) when a field has NULL value.
Choose the data type economically. For example, if a number is never negative, use integer type, but not unsigned integer; If there is no decimal, do not use float.
Code text data with unique identifiers in integer, because the text field can take much more space and should be avoided in processing.
Data aggregation is always an effective method to reduce data volume when the lower granularity of the data is not needed.
Compress data whenever possible.
Reduce the number of fields: read and carry over only those fields that are truly needed.
Leverage complex data structures to reduce data duplication. One example is to use the array structure to store a field in the same record instead of having each on a separate record, when the field shares many other common key fields.

I hope the above list gives you some ideas as to how to reduce the data volume. In fact, the same techniques have been used in many database software and IoT edge computing. The better you understand the data and business logic, the more creative you can be when trying to reduce the size of the data before working with it. The end result would work much more efficiently with the available memory, disk, and processors.

Principle 3: Partition the data properly based on processing logic

Enabling data parallelism is the most effective way of fast data processing. As the data volume grows, the number of parallel processes grows, hence, adding more hardware will scale the overall data process without the need to change the code. For data engineers, a common method is data partitioning. There are many details regarding data partitioning techniques, which is beyond the scope of this article. Generally speaking, an effective partitioning should lead to the following results:

Allow the downstream data processing steps, such as join and aggregation, to happen in the same partition. For example, partitioning by time periods is usually a good idea if the data processing logic is self-contained within a month.
The size of each partition should be even, in order to ensure the same amount of time taken to process each partition.
As the data volume grows, the number of partitions should increase, while the processing programs and logic stay the same.

Also, changing the partition strategy at different stages of processing should be considered to improve performance, depending on the operations that need to be done against the data. For example, when processing user data, the hash partition of the User ID is an effective way of partitioning. Then when processing users’ transactions, partitioning by time periods, such as month or week, can make the aggregation process a lot faster and more scalable.

Hadoop and Spark store the data into data blocks as the default operation, which enables parallel processing natively without needing programmers to manage themselves. However, because their framework is very generic in that it treats all the data blocks in the same way, it prevents finer controls that an experienced data engineer could do in his or her own program. Therefore, knowing the principles stated in this article will help you optimize process performance based on what’s available and what tools or software you are using.

Principle 4: Avoid unnecessary resource-expensive processing steps whenever possible

As stated in Principle 1, designing a process for big data is very different from designing for small data. An important aspect of designing is to avoid unnecessary resource-expensive operations whenever possible. This requires highly skilled data engineers with not just a good understanding of how the software works with the operating system and the available hardware resources, but also comprehensive knowledge of the data and business use cases. In this article, I only focus on the top two processes that we should avoid to make a data process more efficient: data sorting and disk I/O.

Putting the data records in a certain order, however, is often needed when 1) joining with another dataset; 2) aggregation; 3) scan; 4) deduplication, among other things. However, sorting is one of the most expensive operations that require memory and processors, as well as disks when the input dataset is much larger than the memory available. To get good performance, it is important to be very frugal about sorting, with the following principles:

Do not sort again if the data is already sorted in the upstream or the source system.
Usually, a join of two datasets requires both datasets to be sorted and then merged. When joining a large dataset with a small dataset, change the small dataset to a hash lookup. This allows one to avoid sorting the large dataset.
Sort only after the data size has been reduced (Principle 2) and within a partition (Principle 3).
Design the process such that the steps requiring the same sort are together in one place to avoid re-sorting.
Use the best sorting algorithm (e.g., merge sort or quick sort).

Another commonly considered factor is to reduce the disk I/O. There are many techniques in this area, which is beyond the scope of this article. Below lists 3 common reasons that need to be considered in this aspect:

Data compression
Data indexing
Performing multiple processing steps in memory before writing to disk

Data compression is a must when working with big data, for which it allows faster read and write, as well as faster network transfer. Data file indexing is needed for fast data accessing, but at the expense of making writing to disk longer. Index a table or file only when it is necessary, while keeping in mind its impact on the writing performance. Lastly, perform multiple processing steps in memory whenever possible before writing the output to disk. This technique is not only used in Spark, but also used in many database technologies.

In summary, designing big data processes and systems with good performance is a challenging task. The 4 basic principles illustrated in this article will give you a guideline to think both proactively and creatively when working with big data and other databases or systems. It happens often that the initial design does not lead to the best performance, primarily because of limited hardware and data volume in the development and test environments. Multiple iterations of performance optimization, therefore, are required after the process runs on production. Furthermore, an optimized data process is often tailored to certain business use cases. When the process is enhanced with new features to satisfy new use cases, certain optimizations could become not valid and require re-thinking. All in all, improving the performance of big data is a never-ending task, which will continue to evolve with the growth of the data and the continued effort of discovering and realizing the value of the data.

↧

Orchestrate Amazon Redshift-Based ETL workflows with AWS Step Functions and AWS Glue

October 11, 2019, 6:38 am

≫ Next: The Beauty of a Shared-Nothing SQL DBMS for Skewed Database Sizes

≪ Previous: Design Principles for Big Data Performance

Feed: AWS Big Data Blog.

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud that offers fast query performance using the same SQL-based tools and business intelligence applications that you use today. Many customers also like to use Amazon Redshift as an extract, transform, and load (ETL) engine to use existing SQL developer skillsets, to quickly migrate pre-existing SQL-based ETL scripts, and—because Amazon Redshift is fully ACID-compliant—as an efficient mechanism to merge change data from source data systems.

In this post, I show how to use AWS Step Functions and AWS Glue Python Shell to orchestrate tasks for those Amazon Redshift-based ETL workflows in a completely serverless fashion. AWS Glue Python Shell is a Python runtime environment for running small to medium-sized ETL tasks, such as submitting SQL queries and waiting for a response. Step Functions lets you coordinate multiple AWS services into workflows so you can easily run and monitor a series of ETL tasks. Both AWS Glue Python Shell and Step Functions are serverless, allowing you to automatically run and scale them in response to events you define, rather than requiring you to provision, scale, and manage servers.

While many traditional SQL-based workflows use internal database constructs like triggers and stored procedures, separating workflow orchestration, task, and compute engine components into standalone services allows you to develop, optimize, and even reuse each component independently. So, while this post uses Amazon Redshift as an example, my aim is to more generally show you how to orchestrate any SQL-based ETL.

Prerequisites

If you want to follow along with the examples in this post using your own AWS account, you need a Virtual Private Cloud (VPC) with at least two private subnets that have routes to an S3 VPC endpoint.

If you don’t have a VPC, or are unsure if yours meets these requirements, I provide an AWS CloudFormation template stack you can launch by selecting the following button. Provide a stack name on the first page and leave the default settings for everything else. Wait for the stack to display Create Complete (this should only take a few minutes) before moving on to the other sections.

Scenario

For the examples in this post, I use the Amazon Customer Reviews Dataset to build an ETL workflow that completes the following two tasks which represent a simple ETL process.

Task 1: Move a copy of the dataset containing reviews from the year 2015 and later from S3 to an Amazon Redshift table.
Task 2: Generate a set of output files to another Amazon S3 location which identifies the “most helpful” reviews by market and product category, allowing an analytics team to glean information about high quality reviews.

This dataset is publicly available via an Amazon Simple Storage Service (Amazon S3) bucket. Complete the following tasks to get set up.

Solution overview

The following diagram highlights the solution architecture from end to end:

The steps in this process are as follows:

The state machine launches a series of runs of an AWS Glue Python Shell job (more on how and why I use a single job later in this post!) with parameters for retrieving database connection information from AWS Secrets Manager and an .sql file from S3.
Each run of the AWS Glue Python Shell job uses the database connection information to connect to the Amazon Redshift cluster and submit the queries contained in the .sql file.
1. For Task 1: The cluster utilizes Amazon Redshift Spectrum to read data from S3 and load it into an Amazon Redshift table. Amazon Redshift Spectrum is commonly used as an means for loading data to Amazon Redshift. (See Step 7 of Twelve Best Practices for Amazon Redshift Spectrum for more information.)
2. For Task 2: The cluster executes an aggregation query and exports the results to another Amazon S3 location via UNLOAD.
The state machine may send a notification to an Amazon Simple Notification Service (SNS) topic in the case of pipeline failure.
Users can query the data from the cluster and/or retrieve report output files directly from S3.

I include an AWS CloudFormation template to jumpstart the ETL environment so that I can focus this post on the steps dedicated to building the task and orchestration components. The template launches the following resources:

Amazon Redshift Cluster
Secrets Manager secret for storing Amazon Redshift cluster information and credentials
S3 Bucket preloaded with Python scripts and .sql files
Identity and Access Management (IAM) Role for AWS Glue Python Shell jobs

See the following resources for how to complete these steps manually:

Be sure to select at least two private subnets and the corresponding VPC, as shown in the following screenshot. If you are using the VPC template from above, the VPC appears as 10.71.0.0/16 and the subnet names are A private and B private.

The stack should take 10-15 minutes to launch. Once it displays Create Complete, you can move on to the next section. Be sure to take note of the Resources tab in the AWS CloudFormation console, shown in the following screenshot, as I refer to these resources throughout the post.

Building with AWS Glue Python Shell

Begin by navigating to AWS Glue in the AWS Management Console.

Making a connection

Amazon Redshift cluster resides in a VPC, so you first need to create a connection using AWS Glue. Connections contain properties, including VPC networking information, needed to access your data stores. You eventually attach this connection to your Glue Python Shell Job so that it can reach your Amazon Redshift cluster.

Select Connections from the menu bar, and then select Add connection. Give your connection a name like blog_rs_connection, select Amazon Redshift as the Connection type, and then select Next, as shown in the following screenshot.

Under Cluster, enter the name of the cluster that the AWS CloudFormation template launched, i.e blogstack-redshiftcluster-####. Because the Python code I provide for this blog already handles credential retrieval, the rest of the values around database information you enter here are largely placeholders. The key information you are associating with the connection is networking-related.

Please note that you are not able to test the connection without the correct cluster information. If you are interested in doing so, note that Database name and Username are auto-populated after selecting the correct cluster, as shown in the following screenshot. Follow the instructions here to retrieve the password information from Secrets Manager to copy into the Password field.

ETL code review

Take a look at the two main Python scripts used in this example:

Pygresql_redshift_common.py is a set of functions that can retrieve cluster connection information and credentials from Secrets Manger, make a connection to the cluster, and submit queries respectively. By retrieving cluster information at runtime via a passed parameter, these functions allow the job to connect to any cluster to which it has access. You can package these functions into a library by following the instructions to create a python .egg file (already completed as a part of the AWS CloudFormation template launch). Note that AWS Glue Python Shell supports several python libraries natively.

import pg
import boto3
import base64
from botocore.exceptions import ClientError
import json

#uses session manager name to return connection and credential information
def connection_info(db):

	session = boto3.session.Session()
	client = session.client(
		service_name='secretsmanager'
	)

	get_secret_value_response = client.get_secret_value(SecretId=db)

	if 'SecretString' in get_secret_value_response:
		secret = json.loads(get_secret_value_response['SecretString'])
	else:
		secret = json.loads(base64.b64decode(get_secret_value_response['SecretBinary']))
		
	return secret


#creates a connection to the cluster
def get_connection(db,db_creds):

	con_params = connection_info(db_creds)
	
	rs_conn_string = "host=%s port=%s dbname=%s user=%s password=%s" % (con_params['host'], con_params['port'], db, con_params['username'], con_params['password'])
	rs_conn = pg.connect(dbname=rs_conn_string)
	rs_conn.query("set statement_timeout = 1200000")
	
	return rs_conn


#submits a query to the cluster
def query(con,statement):
    res = con.query(statement)
    return res

The AWS Glue Python Shell job runs rs_query.py when called. It starts by parsing job arguments that are passed at invocation. It uses some of those arguments to retrieve a .sql file from S3, then connects and submits the statements within the file to the cluster using the functions from pygresql_redshift_common.py. So, in addition to connecting to any cluster using the Python library you just packaged, it can also retrieve and run any SQL statement. This means you can manage a single AWS Glue Python Shell job for all of your Amazon Redshift-based ETL by simply passing in parameters on where it should connect and what it should submit to complete each task in your pipeline.

from redshift_module import pygresql_redshift_common as rs_common
import sys
from awsglue.utils import getResolvedOptions
import boto3

#get job args
args = getResolvedOptions(sys.argv,['db','db_creds','bucket','file'])
db = args['db']
db_creds = args['db_creds']
bucket = args['bucket']
file = args['file']

#get sql statements
s3 = boto3.client('s3') 
sqls = s3.get_object(Bucket=bucket, Key=file)['Body'].read().decode('utf-8')
sqls = sqls.split(';')

#get database connection
print('connecting...')
con = rs_common.get_connection(db,db_creds)

#run each sql statement
print("connected...running query...")
results = []
for sql in sqls[:-1]:
    sql = sql + ';'
    result = rs_common.query(con, sql)
    print(result)
    results.append(result)

print(results)

Creating the Glue Python Shell Job

Next, put that code into action:

Navigate to Jobs on the left menu of the AWS Glue console page and from there, select Add job.
Give the job a name like blog_rs_query.
For the IAM role, select the same GlueExecutionRole you previously noted from the Resources section of the AWS CloudFormation console.
For Type, select Python shell, leave Python version as the default of Python 3, and for This job runs select An existing script that you provide.
For S3 path where the script is stored, navigate to the script bucket created by the AWS CloudFormation template (look for ScriptBucket in the Resources), then select the python/py file.
Expand the Security configuration, script libraries, and job parameters section to add the Python .egg file with the Amazon Redshift connection library to the Python library path. It is also located in the script bucket under python /redshift_module-0.1-py3.6.egg.

When all is said and done everything should look as it does in the following screenshot:

Choose Next. Add the connection you created by choosing Select to move it under Required connections. (Recall from the Making a connection section that this gives the job the ability to interact with your VPC.) Choose Save job and edit script to finish, as shown in the following screenshot.

Test driving the Python Shell job

After creating the job, you are taken to the AWS Glue Python Shell IDE. If everything went well, you should see the rs_query.py code. Right now, the Amazon Redshift cluster is sitting there empty, so use the Python code to run the following SQL statements and populate it with tables.

Create an external database (amzreviews).
Create an external table (reviews) from which Amazon Redshift Spectrum can read from the source data in S3 (the public reviews dataset). The table is partitioned by product_category because the source files are organized by category, but in general you should partition on frequently filtered columns (see #4).
Add partitions to the external table.

Create an internal table (reviews) local to the Amazon Redshift cluster. product_id works well as a DISTKEY because it has high cardinality, even distribution, and most likely (although not explicitly part of this blog’s scenario) a column that will be used to join with other tables. I choose review_date as a SORTKEY to efficiently filter out review data that is not part of my target query (after 2015). Learn more about how to best choose DISTKEY/SORTKEY as well as additional table design parameters for optimizing performance by reading the Designing Tables documentation.

CREATE EXTERNAL SCHEMA amzreviews 
from data catalog
database 'amzreviews'
iam_role 'rolearn'
CREATE EXTERNAL database IF NOT EXISTS;



CREATE EXTERNAL TABLE amzreviews.reviews(
  marketplace varchar(10), 
  customer_id varchar(15), 
  review_id varchar(15), 
  product_id varchar(25), 
  product_parent varchar(15), 
  product_title varchar(50), 
  star_rating int, 
  helpful_votes int, 
  total_votes int, 
  vine varchar(5), 
  verified_purchase varchar(5), 
  review_headline varchar(25), 
  review_body varchar(1024), 
  review_date date, 
  year int)
PARTITIONED BY ( 
  product_category varchar(25))
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  's3://amazon-reviews-pds/parquet/';
  
  
  
ALTER TABLE amzreviews.reviews ADD
partition(product_category='Apparel') 
location 's3://amazon-reviews-pds/parquet/product_category=Apparel/'
partition(product_category='Automotive') 
location 's3://amazon-reviews-pds/parquet/product_category=Automotive'
partition(product_category='Baby') 
location 's3://amazon-reviews-pds/parquet/product_category=Baby'
partition(product_category='Beauty') 
location 's3://amazon-reviews-pds/parquet/product_category=Beauty'
partition(product_category='Books') 
location 's3://amazon-reviews-pds/parquet/product_category=Books'
partition(product_category='Camera') 
location 's3://amazon-reviews-pds/parquet/product_category=Camera'
partition(product_category='Grocery') 
location 's3://amazon-reviews-pds/parquet/product_category=Grocery'
partition(product_category='Furniture') 
location 's3://amazon-reviews-pds/parquet/product_category=Furniture'
partition(product_category='Watches') 
location 's3://amazon-reviews-pds/parquet/product_category=Watches'
partition(product_category='Lawn_and_Garden') 
location 's3://amazon-reviews-pds/parquet/product_category=Lawn_and_Garden';


CREATE TABLE reviews(
  marketplace varchar(10),
  customer_id varchar(15), 
  review_id varchar(15), 
  product_id varchar(25) DISTKEY, 
  product_parent varchar(15), 
  product_title varchar(50), 
  star_rating int, 
  helpful_votes int, 
  total_votes int, 
  vine varchar(5), 
  verified_purchase varchar(5), 
  review_date date, 
  year int,
  product_category varchar(25))
  
  SORTKEY (
     review_date
    );

Do this first job run manually so you can see where all of the elements I’ve discussed come into play. Select Run Job at the top of the IDE screen. Expand the Security configuration, script libraries, and job parameters section. This is where you add in the parameters as key-value pairs, as shown in the following screenshot.

Key	Value
–db	reviews
–db_creds	reviewssecret
–bucket
–file	sql/reviewsschema.sql

Select Run job to start it. The job should take a few seconds to complete. You can look for log outputs below the code in the IDE to watch job progress.

Once the job completes, navigate to Databases in the AWS Glue console and look for the amzreviews database and reviews table, as shown in the following screenshot. If they are there, then everything worked as planned! You can also connect to your Amazon Redshift cluster using the Redshift Query Editor or with your own SQL client tool and look for the local reviews table.

Step Functions Orchestration

Now that you’ve had a chance to run a job manually, it’s time to move onto something more programmatic that is orchestrated by Step Functions.

Launch Template

I provide a third AWS CloudFormation template for kickstarting this process as well. It creates a Step Functions state machine that calls two instances of the AWS Glue Python Shell job you just created to complete the two tasks I outlined at the beginning of this post.

For BucketName, paste the name of the script bucket created in the second AWS CloudFormation stack. For GlueJobName, type in the name of the job you just created. Leave the other information as default, as shown in the following screenshot. Launch the stack and wait for it to display Create Complete—this should take only a couple of minutes—before moving on to the next section.

Working with the Step Functions State Machine

State Machines are made up of a series of steps, allowing you to stitch together services into robust ETL workflows. You can monitor each step of execution as it happens, which means you can identify and fix problems in your ETL workflow quickly, and even automatically.

Take a look at the state machine you just launched to get a better idea. Navigate to Step Functions in the AWS Console and look for a state machine with a name like GlueJobStateMachine-######. Choose Edit to view the state machine configuration, as shown in the following screenshot.

It should look as it does in the following screenshot:

As you can see, state machines are created using JSON templates made up of task definitions and workflow logic. You can run parallel tasks, catch errors, and even pause workflows and wait for manual callback to continue. The example I provide contains two tasks for running the SQL statements that complete the goals I outlined at the beginning of the post:

Load data from S3 using Redshift Spectrum
Transform and writing data back to S3

Each task contains basic error handling which, if caught, routes the workflow to an error notification task. This example is a simple one to show you how to build a basic workflow, but you can refer to the Step Functions documentation for an example of more complex workflows to help build a robust ETL pipeline. Step Functions also supports reusing modular components with Nested Workflows.

SQL Review

The state machine will retrieve and run the following SQL statements:

INSERT INTO reviews
SELECT marketplace, customer_id, review_id, product_id, product_parent, product_title, star_rating, helpful_votes, total_votes, vine, verified_purchase, review_date, year, product_category
FROM amzreviews.reviews
WHERE year > 2015;

As I mentioned previously, Amazon Redshift Spectrum is a great way to run ETL using an INSERT INTO statement. This example is a simple load of the data as it is in S3, but keep in mind you can add more complex SQL statements to transform your data prior to loading.

UNLOAD ('SELECT marketplace, product_category, product_title, review_id, helpful_votes, AVG(star_rating) as average_stars FROM reviews GROUP BY marketplace, product_category, product_title, review_id, helpful_votes ORDER BY helpful_votes DESC, average_stars DESC')
TO 's3://bucket/testunload/'
iam_role 'rolearn';

This statement groups reviews by product, ordered by number of helpful votes, and writes to Amazon S3 using UNLOAD.

State Machine execution

Now that everything is in order, start an execution. From the state machine main page select Start an Execution.

Leave the defaults as they are and select Start to begin execution. Once execution begins you are taken to a visual workflow interface where you can follow the execution progress, as shown in the following screenshot.

Each of the queries takes a few minutes to run. In the meantime, you can watch the Amazon Redshift query logs to track the query progress in real time. These can be found by navigating to Amazon Redshift in the AWS Console, selecting your Amazon Redshift cluster, and then selecting the Queries tab, as shown in the following screenshot.

Once you see COMPLETED for both queries, navigate back to the state machine execution. You should see success for each of the states, as shown in the following screenshot.

Next, navigate to the data bucket in the S3 AWS Console page (refer to the DataBucket in the CloudFormation Resources tab). If all went as planned, you should see a folder named testunload in the bucket with the unloaded data, as shown in the following screenshot.

Inject Failure into Step Functions State Machine

Next, test the error handling component of the state machine by intentionally causing an error. An easy way to do this is to edit the state machine and misspell the name of the Secrets Manager secret in the ReadFilterJob task, as shown in the following screenshot.

If you want the error output sent to you, optionally subscribe to the error notification SNS Topic. Start another state machine execution as you did previously. This time the workflow should take the path toward the NotifyFailure task, as shown in the following screenshot. If you subscribed to the SNS Topic associated with it, you should receive a message shortly thereafter.

The state machine logs will show the error in more detail, as shown in the following screenshot.

Conclusion

In this post I demonstrated how you can orchestrate Amazon Redshift-based ETL using serverless AWS Step Functions and AWS Glue Python Shells jobs. As I mentioned in the introduction, the concepts can also be more generally applied to other SQL-based ETL, so use them to start building your own SQL-based ETL pipelines today!

About the Author

Ben Romano is a Data Lab solution architect at AWS. Ben helps our customers architect and build data and analytics prototypes in just four days in the AWS Data Lab.

↧

The Beauty of a Shared-Nothing SQL DBMS for Skewed Database Sizes

October 13, 2019, 11:56 pm

≫ Next: Training, validation and testing for supervised machine learning models

≪ Previous: Orchestrate Amazon Redshift-Based ETL workflows with AWS Step Functions and AWS Glue

Feed: MemSQL Blog.
Author: Eric Hanson.

The limitations of a typical, traditional relational database management system (RDBMS) have forced all sorts of compromises on data processing systems: from limitations on database size, to the separation of transaction processing from analytics. One such compromise has been the “sharding” of various customer accounts into separate database instances, partly so each customer could fit on a single computer server – but, in a typical power law, or Zipf, distribution, the larger database don’t fit. In response, database administrators have had to implement semi-custom sharding schemes. Here, we describe these schemes, discuss their limitations, and show how an alternative, MemSQL, makes them unnecessary.

Prologue

The primary purpose of many database implementations is to respond to a large volume of queries and updates, from many concurrent users and applications, with short, predictable response times. MemSQL is becoming well-known for delivering under these demanding conditions, in particular for large databases, bigger than will fit on a single server. And it can do so for both transactional (online transaction processing, or OLTP) and analytical (online analytics processing, or OLAP) applications.

These characteristics make MemSQL attractive for both:

In-house development teams that create different databases for different customers, and
Cloud service providers that create separate databases for each of their customers…

…when even just one, or a few, of the databases can be very large.

What follows are tales of two different database application architects who face the same problem—high skew of database size for different customer data sets, meaning a few are much larger than others—and address this problem in two different ways. One tries to deal with it via a legacy single-box database and through the use of “clever” application software. The other uses a scalable database that can handle both transactions and analytics—MemSQL. Judge for yourself who’s really the clever one.

The Story of the Hero Database Applications Architect

Once there was a database application architect. His company managed a separate database for each customer. They had thousands of customers. They came up with what seemed like a great idea. Each customer’s data would be placed in its own database. Then they would allocate one or more databases to a single-node database server. Each server would handle operational queries and the occasional big analytical query.

When a server filled up, they’d just allocate additional databases to a different server.

Everything was going great during development. The application code only had to be written once, for one scenario — all data for a customer fitting one DBMS server. If a database was big, no problem, just provision a larger server and put that database alone on that server. Easy.

Then they went into production. Everything was good. Success brought in bigger customers with more data. Data grew over time. The big customer data grew and grew. The biggest one would barely fit on a server. The architect started losing sleep. He kept the Xanax his doctor prescribed for anxiety in the top drawer and found himself dipping into it too often.

Then it happened. The biggest customer’s data would not fit on one machine anymore. A production outage happened. The architect proposed trimming the data to have less history, but the customers screamed. They needed 13 months minimum or else. He bought time by trimming to exactly 13 months. They only had two months of runway before they hit the wall again.

He got his top six developers together for an emergency meeting. They’d solve this problem by sharding the data for the biggest customer across several DBMS servers. Most queries in the app could be directed to one of the servers. The app developers would figure out where to connect and send the query. Not too hard. They could do it.

But some of the queries had aggregations over all the data. They could deal with this. They’d just send the query to every server, bring it back to the app, and combine the data in the app layer. His best developers actually thought this was super cool. It was way more fun than writing application software. They started to feel really proud of what they’d built.

Then they started having performance problems. Moving data from one machine to the other was hard. There were several ways they could do things. Which way should they do it? Then someone had the great idea to write an optimizer that would figure out how to run the query. This was so fun.

Around this time, the VP from the business side called the architect. She said the pace of application changes had slowed way down. What was going on? He proudly but at the same time sheepishly said that his top six app developers had now made the leap to be database systems software developers. Somehow, she did not care. She left his office, but it was clear she was not ready to let this lie.

He checked the bug count. Could it be this high? What were his people doing? He’d have to fix some of the bugs himself.

He started to sweat. A nervous lump formed in the pit of his stomach. The clock struck 7. His wife called and said dinner was ready. The kids wanted to see him. He said he’d leave by 8.

The Story of the Disciplined Database Applications Architect

Once there was a database application architect. His company managed a separate database for each customer. They had thousands of customers. They at first considered what seemed like a great idea. Each customer’s data would be placed in its own database on a single-node database server. But, asked the architect, what happens when there’s more data than will fit on one machine?

One of the devs on his team said he’d heard of this scale-out database called MemSQL that runs standard SQL and can do both operational and analytical workloads on the same system. If you run out of capacity, you can add more nodes and spread the data across them. The system handles it all automatically.

The dev had actually tried the free version of MemSQL for a temporary data mart and it worked great. It was really fast. And running it took half the work of running their old single-box DBMS. All their tools could connect to it too.

They decided to run just a couple of MemSQL clusters and put each customer’s data in one database on one cluster. They got into production. Things were going great; business was booming. Their biggest customer got really big really fast. It started to crowd out work for other customers on the same cluster. They could see a problem coming. How could they head it off?

They had planned for this. They just added a few nodes to the cluster and rebalanced the biggest database. It was all done online. It took an hour, running in the background. No downtime.

The VP from the business side walked in. She had a new business use case that would make millions if they could pull it off before the holidays. He called a meeting the next day with the business team and a few of his top developers. They rolled up their sleeves and sketched out the application requirements. Yeah, they could do this.

Annual review time came around. His boss showed him his numbers. Wow, that is a good bonus. He felt like he hadn’t worked too hard this year, but he kept it to himself. His golf score was down, and his pants still fit just like in college. He left the office at 5:30. His kids welcomed him at the door.

The Issue of Skewed Database Sizes

The architects in our stories are facing a common issue. They are building services for many clients, where each client’s data is to be kept in a separate database for simplicity, performance and security reasons. The database sizes needed by different customers vary dramatically, following what’s known as a Zipf distribution [Ada02]. In this distribution, the largest databases have orders of magnitude more data than the average ones, and there is a long tail of average and smaller-sized databases.

In a Zipf distribution of database sizes, the size y of a database follows a pattern like

size(r) = C * r -b

with b close to one, where r is the rank, and C is a constant, with the largest database having rank one, the second-largest rank two, and so on.

The following figure shows a hypothetical, yet realistic Zipf distribution of database size for r = 1.3 and C = 10 terabytes (TB). Because of the strong variation among database sizes, the distribution is considered highly skewed.

If your database platform doesn’t support scaleout, then it may be impossible to handle, say, the largest four customer databases when database size is distributed this way–unless you make tortuous changes to your application code, and maintain them indefinitely.

I have seen this kind of situation in real life more than once. For example, the method of creating an application layer to do distributed query processing over sharded databases across single-node database servers, alluded to in the “hero” story above, was tried by a well-known advertising platform. They had one database per customer, and the database sizes were Zipf-distributed. The largest customers’ data had to be split over multiple nodes. They had to create application logic to aggregate data over multiple nodes, and use different queries and code paths to handle the same query logic for the single-box and multi-box cases.

Their top developers literally had to become database systems software developers. This took them away from application development and slowed the pace of application changes. Slower application changes took money off the table.

An Up-to-Date Solution for Skewed Database Sizes

Writing a distributed query processor is hard. It’s best left to the professionals. And anyway, isn’t the application software what really produces value for database users?

Today’s application developers don’t have to go the route of application-defined sharding and suffer the pain of building and managing it. There’s a better way. MemSQL supports transactions and analytics on the same database, on a single platform. It handles sharding and distributed query processing automatically. It can scale elastically via addition of nodes and online rebalancing of data partitions.

Some of our customers are handling this multi-database, Zipf-distributed size scenario by creating a database per customer and placing databases on one or more clusters. They get a “warm fuzzy feeling” knowing that they will never hit a scale wall, even though most of their databases fit on one machine. The biggest ones don’t always fit. And they know that, when a database grows, they can easily expand their hardware to handle it. They only have to write and maintain the app logic one way, one time, for all of their customers. No need to keep Xanax in the top drawer.

MemSQL doesn’t require performance compromises for transactions or analytics [She19]. Quite the contrary, MemSQL delivers phenomenal transaction rates and crazy analytics performance [She19, Han18] via:

in-memory rowstore structures [Mem19a],
multi-version concurrency control [Mem19c],
compilation of queries to machine code rather than interpretation [Mem19e], and
a highly-compressed, disk-based columnstore [Mem19b] with
vectorized query execution and use of single-instruction-multiple-data (SIMD) instructions [Mem19d].

Moreover, it supports strong data integrity, high availability, and disaster recovery via:

transaction support
intra-cluster replication of each data partition to an up-to-date replica (a.k.a. redundancy 2)
cluster-to-cluster replication
online upgrades.

Your developers will love it too, since it supports popular language interfaces (via MySQL compatibility) as well as ANSI SQL, views, stored procedures, and user-defined functions.

And it now supports delivery as a true platform as a service, Helios. Helios lets you focus even more energy on the application rather than running–let alone creating and maintaining–the database platform. Isn’t that where you’d rather be?

References

[Ada02] Lada A. Adamic, Zipf, Power-laws, and Pareto – a ranking tutorial, HP Labs, https://www.hpl.hp.com/research/idl/papers/ranking/ranking.html, 2002.

[Han18] Eric Hanson, Shattering the Trillion-Rows-Per-Second Barrier With MemSQL, https://www.memsql.com/blog/memsql-processing-shatters-trillion-rows-per-second-barrier/, 2018.

[Mem19a] Rowstore, MemSQL Documentation, https://docs.memsql.com/v6.8/concepts/rowstore/, 2019.

[Mem19b] Columnstore, MemSQL Documentation, https://docs.memsql.com/v6.8/concepts/columnstore/, 2019.

[Mem19c] MemSQL Architecture, https://www.memsql.com/content/architecture/, 2019.

[Mem19d], Understanding Operations on Encoded Data, MemSQL Documentation, https://docs.memsql.com/v6.8/concepts/understanding-ops-on-encoded-data/, 2019.

[Mem19e], Code Generation, MemSQL Documentation, https://docs.memsql.com/v6.8/concepts/code-generation/, 2019.

[She19] John Sherwood et al., We Spent a Bunch of Money on AWS And All We Got Was a Bunch of Experience and Some Great Benchmark Results, https://www.memsql.com/blog/memsql-tpc-benchmarks/, 2019.

↧

Training, validation and testing for supervised machine learning models

October 14, 2019, 12:42 pm

≫ Next: What is DRBD?

≪ Previous: The Beauty of a Shared-Nothing SQL DBMS for Skewed Database Sizes

Feed: SAS Blogs.
Author: Beth Ebersole.

Validating and testing our supervised machine learning models is essential to ensuring that they generalize well. SAS Viya makes it easy to train, validate, and test our machine learning models.

Training, validation and test data sets

Training data are used to fit each model. Training a model involves using an algorithm to determine model parameters (e.g., weights) or other logic to map inputs (independent variables) to a target (dependent variable). Model fitting can also include input variable (feature) selection. Models are trained by minimizing an error function.

For illustration purposes, let’s say we have a very simple ordinary least squares regression model with one input (independent variable, x) and one output (dependent variable, y). Perhaps our input variable is how many hours of training a dog or cat has received, and the output variable is the combined total of how many fingers or limbs we will lose in a single encounter with the animal.

In ordinary least squares regression, the parameters are estimated that minimize the sum of the squared errors between the observed data and the predicted model. This is illustrated below where the predicted y = β₀ + β₁x. The y variable is on the vertical axis and the x variable is on the horizontal axis.

Validation data are used with each model developed in training, and the prediction errors are calculated. Depending on the model and the software, these prediction errors can be used to decide:

When to terminate the selection process
What effects (e.g., inputs, interactions, etc.) to include as the selection process proceeds, and/or
Which model to select

The validation errors calculated vary from model to model and may be things such as the the average squared error (ASE), mean squared error (MSE), error sum of squares (SSE), the negative log-likelihood, etc. The validation ASE is often used in VDMML.

Note: “Average Squared Error and Mean Squared Error might appear similar. But they describe two completely different measures, where each is appropriate only for specific models. In linear models, statisticians routinely use the mean squared error (MSE) as the main measure of fit. The MSE is the sum of squared errors (SSE) divided by the degrees of freedom for error. (DFE is the number of cases [observations]less the number of weights in the model.) This process yields an unbiased estimate of the population noise variance under the usual assumptions.

For neural networks and decision trees, there is no known unbiased estimator. Furthermore, the DFE is often negative for neural networks. There exists approximations for the effective degrees of freedom, but these are often prohibitively expensive and are based on assumptions that might not hold. Hence, the MSE is not nearly as useful for neural networks as it is for linear models. One common solution is to divide the SSE by the number of cases N, not the DFE. This quantity, SSE/N, is referred to as the average squared error (ASE).

The MSE is not a useful estimator of the generalization error, even in linear models, unless the number of cases is much larger than the number of weights.”
– From Enterprise Miner documentation.

The validation data may be used several times to build the final model.

Test data is a hold-out sample that is used to assess the final selected model and estimate its prediction error. Test data are not used until after the model building and selection process is complete. Test data tell you how well your model will generalize, i.e., how well your model performs on new data. By new data I mean data that have not been involved in the model building nor the model selection process in any way.

The test data should never be used for fitting the model, for deciding what effects to include, nor for selecting from among candidate models. In addition, be careful of any leakage of information from the test data set into the other data sets. For example, if you take a mean of all of the data to impute missing values, do that separately for each of the three data sets (training, validation, and test). Otherwise, you will leak information from one data set to another.

Partitioning the Data

Simple random sampling: The observations selected for the subset of the data are randomly selected, i.e., each observation has an equal probability of being chosen.

Stratified sampling: The data are divided into segments or “strata,” and then observations are randomly selected from each stratum. For example, for large cat training, you might want to stratify your sample by lions and tigers to ensure that your training, validation, and test data sets all include the proportion of lions and tigers that exist in the total population. As shown below, in this case we take a random sample of 10% of the tigers and a random sample of 10% of the lions.

Oversampling: Oversampling is commonly used when you are looking at rare events, such as cancer cases. You may have a data set with 10,000 people and only 100 cancer cases. You may wish to oversample the cancer cases to get, for example, a 90:10 ratio of non-cancer to cancer cases. You could use 100% of the cancer cases (all 100) and a 9% random sample of the non-cancer cases (900).

How partitioning is accomplished

In VA|VS|VDMML Visual Interface

Partitioning became easier in the VA|VS|VDMML visual interface in 17w47 (December 2017) with VA|VS|VDMML 8.2 on Viya 3.3. To partition data, click the icon to the right of the data set name, and select Add partition data item, as shown below.

You may select either simple random sampling or stratified sampling and type the training percentage. You may also select a random number seed if you wish.

You may choose:

two partitions (training and validation) or
three partitions (training, validation, and test)

They will be assigned numbers as follows:

0 = Validation
1 = Training
2 = Test

Note to anyone dealing with old versions of the software:

In the earlier (17w12, March 2012) release of VA/VS/VDMML 8.1 on Viya 3.2 you needed to have a binary partition indicator variable in your data set with your data already partitioned if you wished to partition the data into Training and Validation data sets. For example, the partition indicator variable could be called “Partition” and could have two values, “Training” and “Validation.” Or it could be called Training and the two values could be 0 and 1. It doesn’t matter what the name of the variable was, but it had to have only two values. As a data preparation step you had to set the partition indicator variable, indicating which value indicates Training observations and which value indicates validation observations. In 8.1, if you did not have a binary partition variable already in your data set, you could not partition automatically on the fly.

There is no rule of thumb for deciding how much of your data set to apportion into training, validation, and test data. This should be decided by a data scientist or statistician familiar with the data. A common split is 50% for training, 25% for validation, and 25% for testing.

Cross-validation is another method, but is not done automatically with the visual interface. For more information on cross-validation methods see: Funda Gunes’s blog on cross validation in Enterprise Miner or documentation on the SAS Viya CROSSVALIDATION statement.

Using SAS Viya procedures

PROC PARTITION lets you divide data into training, validation and test. In addition, most VS and VDMML supervised learning models have a PARTITION statement. (PROC FACTMAC is an exception.) Code for a partition statement within a PROC FOREST could look something like the code below.

This code would partition the full data set into 50% training data, 25% validation data, and 25% test data.

ALERT: If you have the same number of computer nodes, specifying the SEED= option in the PARTITION statement lets you repeatedly create the same partition data tables. However, changing the number of compute nodes changes the initial distribution of data, resulting in different partition data tables.

You can exercise more control over the partitioning of the input data table if you have created a partition variable in your data set. Then you must designate:

The variable in the input data table as the partition variable (e.g., “role,” in the example below)
A formatted value of that variable for each role (e.g., ‘TRAIN’, ‘VAL’, and ‘TEST’ as shown below)

Example code:

As I mentioned earlier, you can commonly use your validation data set to determine which model to select, or even when to stop the model selection process. For example, this is how you would do that within the PROC REGSELECT, SELECTION statement, METHOD= option:

Stopping the Selection Process. Specify STOP= VALIDATE. “At step k of the selection process, the best candidate effect to enter or leave the current model is determined. Here, ‘best candidate’ means the effect that gives the best value of the SELECT= criterion; this criterion does not need to be based on the validation data. The Validation ASE for the model with this candidate effect added or removed is computed. If this Validation ASE is greater than the Validation ASE for the model at step k, then the selection process terminates at step k.”
Choosing the Model. Specify CHOOSE= VALIDATE. Then “the Validation ASE will be computed for your models at each step of the selection process. The smallest model at any step that yields the smallest Validation ASE will be selected.”

– Quoted from PROC REGSELECT documentation

Summary

SAS Viya makes it easy to train, validate, and test our machine learning models. This is essential to ensuring that our models not only fit our existing data, but that they will perform well on new data.

Now you know everything you need to know about training cats and dogs. Go ahead and try this at home. If you are as smart as the dolphins pictured below, you might even be able to train a human!

References and more information

↧

What is DRBD?

October 15, 2019, 5:00 am

≫ Next: Best practices to scale Apache Spark jobs and partition data with AWS Glue

≪ Previous: Training, validation and testing for supervised machine learning models

Feed: Liquid Web.
Author: Jake Fellows
;

Did you know that it is possible for your server to crash while your website remains online?

Highly reliable databases are critical for online services to function in the event of a catastrophe. Therefore deploying dedicated High Availability Databases ensures they remain available, even if one node crashes.

DRBD: A Highly Available Tool That Can Help

This is where tools such as Distributed Replicated Block Device (DRBD) come in, enabling automatic failover capabilities to prevent downtime.

With a Distributed Replicated Block Device, whenever new data is written to disk, the block device uses the network to replicate data to the second node.

Through redundancy, businesses can protect themselves from downtime and financial loss, and get minimal to zero interruption during software and framework-related operations.

High Availability Database Hosting is ideal for mission-critical databases such as healthcare, government, eCommerce, big data or SaaS. Complex infrastructures can be hard to manage, but a DRBD delivers improved resiliency and optimizes disaster recovery, making them worth a significant investment.

In traditional architectures, all it takes for hardware shutdown is for one component to crash.”

In a High Availability environment, when a server crashes due to a hardware or software failure, the second server where all data has been replicated becomes active and takes over the workload. Thus, the hot spare ensures full redundancy and resilience.

Learn more about how HA infrastructure can help your business. Subscribe to our weekly newsletter.

Different Types of Highly Available Storage

DRBD is Linux-based open source software at the kernel level, on which High Availability clusters are built. It is a good solution for data clusters to replace solid state drives (SSD) storage solutions with low capacity.

Easily integrated in any infrastructure including cloud, DRBD is used to mirror data, logical volumes, file systems, RAID devices (Redundant Array of Independent Disks) and block devices (HDD partitions) across the network to multiple servers in real time, through different types of replication. The other hosts need to have the same amount of free disk space in the hard drive disk partition as the primary node.

DRBD uses a block file to synchronize a number of tasks, including the two independent HDD partitions in the active and passive servers for read and write operations. When the hot standby takes over, there is zero downtime because it already contains a copy of all data.

Remember, high availability is all about removing single points of failure from your infrastructure.”

hot standby or primary and secondary nodes

What is a Hot Standby or Secondary Node?

The hot server is a backup that allows load balancing to remove single points of failure. In active/passive mode, read and write (access or alter from memory) operations are run in the primary node.

An all-round tool, DRBD can add high availability to just about any application. DRBD can also work in an active/active environment, in particular as a popular approach to enabling load balancing in high availability (HA) cluster configurations. In this mode, servers run simultaneously so read and write operations are run on both servers at the same time, a process also known as shared-disk mode.

DRBD is an enterprise-grade tool that simplifies the replacement of data storage and increases data availability.”

DRBD supports both synchronous and asynchronous write operations, which will be further discussed below in relation to the three protocol setups.

In synchronous data replication, notifications are only delivered after write operations are finalized on all hosts, while in asynchronous replication applications receive notifications only of locally finalized operations, before the process moves on to other hosts.

Primary and Secondary Nodes

Commonly, in a small-scale High Availability two redundant node-scenario, one is active (primary) and one is inactive (secondary), also known as a hot standby that already has a copy of the data through network mirroring and replication provided by DRBD.

They are both connected to a single IP configuration, which means the hot spare will immediately take over operations in case of hardware failure. The switchover does not affect the High Availability databases, which remain 100% available.

how DRBD replication works

How Does DRBD Replication Actually Work?

DRBD architecture is made up of two separate segments that ensure high-availability storage; the kernel module for DRBD behavior and user administrative tools to operate DRBD disks.

Because this architecture enables database mirroring and data replication through both synchronous and asynchronous write operations, DRBD is a flexible, virtual block device that can run on three replication protocols, known as Protocol A, Protocol B and Protocol C. All data replication is network transparent (invisible) to other applications using the same protocol.

Protocol A constitutes asynchronous replication that can generate some data loss if host failover is forced. As previously explained, asynchronous data replication means that local write operations on the primary node (active/passive server situation) are considered achieved when local write operations are finished, and the mirrored data is available in the send buffer of the TCP transport framework. This setup is more common in replicating stacked resources in a wide area network.
Protocol B involves memory synchronous (semi-sync) replication. In this deployment, no data is usually lost in failover. Local operations on the primary node are considered achieved once local disk write is complete, and the replicated data is available in the second node. Finalized write operations on the primary node may be deleted, however, if both nodes crash and data storage on the primary node is destroyed. This protocol is a variation of protocols A and C, and an example of how versatile DRBD can be in replication modes.
Protocol C covers synchronous replication of local write operations and is the most popular scenario in production data replication. In this case, replication operations are considered achieved when replication confirmation is received on local and remote disks. DRBD is configured to use Protocol C by default, therefore to change the protocol setup reconfiguration in the file is necessary.

To confirm that the two hosts are indeed identical and all data was replicated, DRBD moves hashes and not data, which saves time and bandwidth.

In “split brain” situations in which node communication failures result in two hosts both being mistakenly identified as the primary hosts, DRBD leverages a recovery algorithm that ensures there is no inconsistent storage.

Managed Hosting Can Help With Complex Infrastructures and DRBD High Availability

DRBD is workload agnostic and a great open-source tool with features that enable it to work as a kernel module, certain userspace management applications and shell scripts. Organizations interested in DRBD virtual disks can take advantage of the open-source status and alter the software to accommodate their needs and applications.

Managing a complex infrastructure is not a task many businesses want on their plate, but Liquid Web can build and manage custom hosting environments to ensure peak performance, reduce team effort spent on configuration and achieve business objectives.

Not all companies have the proper resources to configure DRBD for their infrastructure, however they can always rely on a managed service provider like Liquid Web to do the heavy lifting for High Availability, especially when its product offering includes enterprise-grade tools such as DRBD software and Heartbeat.

Get Your Free High Availability Checklist Today

↧

Best practices to scale Apache Spark jobs and partition data with AWS Glue

October 17, 2019, 7:22 am

≫ Next: How to Migrate Oracle Workloads to VMware Cloud on AWS

≪ Previous: What is DRBD?

Feed: AWS Big Data Blog.

AWS Glue provides a serverless environment to prepare (extract and transform) and load large amounts of datasets from a variety of sources for analytics and data processing with Apache Spark ETL jobs. This series of posts discusses best practices to help developers of Apache Spark applications and Glue ETL jobs, big data architects, data engineers, and business analysts scale their data processing jobs running on AWS Glue automatically.

The first post of this series discusses two key AWS Glue capabilities to manage the scaling of data processing jobs. The first allows you to horizontally scale out Apache Spark applications for large splittable datasets. The second allows you to vertically scale up memory-intensive Apache Spark applications with the help of new AWS Glue worker types. The post also shows how to use AWS Glue to scale Apache Spark applications with a large number of small files commonly ingested from streaming applications using Amazon Kinesis Data Firehose. Finally, the post shows how AWS Glue jobs can use the partitioning structure of large datasets in Amazon S3 to provide faster execution times for Apache Spark applications.

Understanding AWS Glue worker types

AWS Glue comes with three worker types to help customers select the configuration that meets their job latency and cost requirements. These workers, also known as Data Processing Units (DPUs), come in Standard, G.1X, and G.2X configurations.

The standard worker configuration allocates 5 GB for Spark driver and executor memory, 512 MB for spark.yarn.executor.memoryOverhead, and 50 GB of attached EBS storage. The G.1X worker allocates 10 GB for driver and executor memory, 2 GB memoryOverhead, and 64 GB of attached EBS storage. The G.2X worker allocates 20 GB for driver and executor memory, 4 GB memoryOverhead, and 128 GB of attached EBS storage.

The compute parallelism (Apache Spark tasks per DPU) available for horizontal scaling is the same regardless of the worker type. For example, both standard and G1.X workers map to 1 DPU, each of which can run eight concurrent tasks. A G2.X worker maps to 2 DPUs, which can run 16 concurrent tasks. As a result, compute-intensive AWS Glue jobs that possess a high degree of data parallelism can benefit from horizontal scaling (more standard or G1.X workers). AWS Glue jobs that need high memory or ample disk space to store intermediate shuffle output can benefit from vertical scaling (more G1.X or G2.x workers).

Horizontal scaling for splittable datasets

AWS Glue automatically supports file splitting when reading common native formats (such as CSV and JSON) and modern file formats (such as Parquet and ORC) from S3 using AWS Glue DynamicFrames. For more information about DynamicFrames, see Work with partitioned data in AWS Glue.

A file split is a portion of a file that a Spark task can read and process independently on an AWS Glue worker. By default, file splitting is enabled for line-delimited native formats, which allows Apache Spark jobs running on AWS Glue to parallelize computation across multiple executors. AWS Glue jobs that process large splittable datasets with medium (hundreds of megabytes) or large (several gigabytes) file sizes can benefit from horizontal scaling and run faster by adding more AWS Glue workers.

File splitting also benefits block-based compression formats such as bzip2. You can read each compression block on a file split boundary and process them independently. Unsplittable compression formats such as gzip do not benefit from file splitting. To horizontally scale jobs that read unsplittable files or compression formats, prepare the input datasets with multiple medium-sized files.

Each file split (the blue square in the figure) is read from S3, deserialized into an AWS Glue DynamicFrame partition, and then processed by an Apache Spark task (the gear icon in the figure). Deserialized partition sizes can be significantly larger than the on-disk 64 MB file split size, especially for highly compressed splittable file formats such as Parquet or large files using unsplittable compression formats such as gzip. Typically, a deserialized partition is not cached in memory, and only constructed when needed due to Apache Spark’s lazy evaluation of transformations, thus not causing any memory pressure on AWS Glue workers. For more information on lazy evaluation, see the RDD Programming Guide on the Apache Spark website.

However, explicitly caching a partition in memory or spilling it out to local disk in an AWS Glue ETL script or Apache Spark application can result in out-of-memory (OOM) or out-of-disk exceptions. AWS Glue can support such use cases by using larger AWS Glue worker types with vertically scaled-up DPU instances for AWS Glue ETL jobs.

Vertical scaling for Apache Spark jobs using larger worker types

A variety of AWS Glue ETL jobs, Apache Spark applications, and new machine learning (ML) Glue transformations supported with AWS Lake Formation have high memory and disk requirements. Running these workloads may put significant memory pressure on the execution engine. This memory pressure can result in job failures because of OOM or out-of-disk space exceptions. You may see exceptions from Yarn about memory and disk space.

Exceeding Yarn memory overhead

Apache Yarn is responsible for allocating cluster resources needed to run your Spark application. An application includes a Spark driver and multiple executor JVMs. In addition to the memory allocation required to run a job for each executor, Yarn also allocates an extra overhead memory to accommodate for JVM overhead, interned strings, and other metadata that the JVM needs. The configuration parameter spark.yarn.executor.memoryOverhead defaults to 10% of the total executor memory. Memory-intensive operations such as joining large tables or processing datasets with a skew in the distribution of specific column values may exceed the memory threshold, and result in the following error message:

18/06/13 16:54:29 ERROR YarnClusterScheduler: Lost executor 1 on ip-xxx:
Container killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB physical memory used.
Consider boosting spark.yarn.executor.memoryOverhead.

Disk space

Apache Spark uses local disk on Glue workers to spill data from memory that exceeds the heap space defined by the spark.memory.fraction configuration parameter. During the sort or shuffle stages of a job, Spark writes intermediate data to local disk before it can exchange that data between the different workers. Jobs may fail due to the following exception when no disk space remains:

java.io.IOException: No space left on device
UnsafeExternalSorter: Thread 20 spilling sort data of 141.0 MB to disk (90 times so far)

AWS Glue job metrics

Most commonly, this is a result of a significant skew in the dataset that the job is processing. You can also identify the skew by monitoring the execution timeline of different Apache Spark executors using AWS Glue job metrics. For more information, see Debugging Demanding Stages and Straggler Tasks.

The following AWS Glue job metrics graph shows the execution timeline and memory profile of different executors in an AWS Glue ETL job. One of the executors (the red line) is straggling due to processing of a large partition, and actively consumes memory for the majority of the job’s duration.

With AWS Glue’s Vertical Scaling feature, memory-intensive Apache Spark jobs can use AWS Glue workers with higher memory and larger disk space to help overcome these two common failures. Using AWS Glue job metrics, you can also debug OOM and determine the ideal worker type for your job by inspecting the memory usage of the driver and executors for a running job. For more information, see Debugging OOM Exceptions and Job Abnormalities.

In general, jobs that run memory-intensive operations can benefit from the G1.X worker type, and those that use AWS Glue’s ML transforms or similar ML workloads can benefit from the G2.X worker type.

Apache Spark UI for AWS Glue jobs

You can also use AWS Glue’s support for Spark UI to inpect and scale your AWS Glue ETL job by visualizing the Directed Acyclic Graph (DAG) of Spark’s execution, and also monitor demanding stages, large shuffles, and inspect Spark SQL query plans. For more information, see Monitoring Jobs Using the Apache Spark Web UI.

The following Spark SQL query plan on the Spark UI shows the DAG for an ETL job that reads two tables from S3, performs an outer-join that results in a Spark shuffle, and writes the result to S3 in Parquet format.

As seen from the plan, the Spark shuffle and subsequent sort operation for the join transformation takes the majority of the job execution time. With AWS Glue vertical scaling, each AWS Glue worker co-locates more Spark tasks, thereby saving on the number of data exchanges over the network.

Scaling to handle large numbers of small files

An AWS Glue ETL job might read thousands or millions of files from S3. This is typical for Kinesis Data Firehose or streaming applications writing data into S3. The Apache Spark driver may run out of memory when attempting to read a large number of files. When this happens, you see the following error message:

# java.lang.OutOfMemoryError: Java heap space
# -XX:OnOutOfMemoryError="kill -9 %p"
# Executing /bin/sh -c "kill -9 12039"...

Apache Spark v2.2 can manage approximately 650,000 files on the standard AWS Glue worker type. To handle more files, AWS Glue provides the option to read input files in larger groups per Spark task for each AWS Glue worker. For more information, see Reading Input Files in Larger Groups.

You can reduce the excessive parallelism from the launch of one Apache Spark task to process each file by using AWS Glue file grouping. This method reduces the chances of an OOM exception on the Spark driver. To configure file grouping, you need to set groupFiles and groupSize parameters. The following code example uses AWS Glue DynamicFrame API in an ETL script with these parameters:

dyf = glueContext.create_dynamic_frame_from_options("s3",
    {'paths': ["s3://input-s3-path/"],
    'recurse':True,
    'groupFiles': 'inPartition',
    'groupSize': '1048576'}, 
    format="json")

You can set groupFiles to group files within a Hive-style S3 partition (inPartition) or across S3 partitions (acrossPartition). In most scenarios, grouping within a partition is sufficient to reduce the number of concurrent Spark tasks and the memory footprint of the Spark driver. In benchmarks, AWS Glue ETL jobs configured with the inPartition grouping option were approximately seven times faster than native Apache Spark v2.2 when processing 320,000 small JSON files distributed across 160 different S3 partitions. A large fraction of the time in Apache Spark is spent building an in-memory index while listing S3 files and scheduling a large number of short-running tasks to process each file. With AWS Glue grouping enabled, the benchmark AWS Glue ETL job could process more than 1 million files using the standard AWS Glue worker type.

groupSize is an optional field that allows you to configure the amount of data each Spark task reads and processes as a single AWS Glue DynamicFrame partition. Users can set groupSize if they know the distribution of file sizes before running the job. The groupSize parameter allows you to control the number of AWS Glue DynamicFrame partitions, which also translates into the number of output files. However, using a considerably small or large groupSize can result in significant task parallelism or under-utilization of the cluster, respectively.

By default, AWS Glue automatically enables grouping without any manual configuration when the number of input files or task parallelism exceeds a threshold of 50,000. The default value of the groupFiles parameter is inPartition, so that each Spark task only reads files within the same S3 partition. AWS Glue computes the groupSize parameter automatically and configures it to reduce the excessive parallelism, and makes use of the cluster compute resources with sufficient Spark tasks running in parallel.

Partitioning data and pushdown predicates

Partitioning has emerged as an important technique for organizing datasets so that a variety of big data systems can query them efficiently. A hierarchical directory structure organizes the data, based on the distinct values of one or more columns. For example, you can partition your application logs in S3 by date, broken down by year, month, and day. Files corresponding to a single day’s worth of data receive a prefix such as the following:

s3://my_bucket/logs/year=2018/month=01/day=23/

Predicate pushdowns for partition columns

AWS Glue supports pushing down predicates, which define a filter criteria for partition columns populated for a table in the AWS Glue Data Catalog. Instead of reading all the data and filtering results at execution time, you can supply a SQL predicate in the form of a WHERE clause on the partition column. For example, assume the table is partitioned by the year column and run SELECT * FROM table WHERE year = 2019. year represents the partition column and 2019 represents the filter criteria.

AWS Glue lists and reads only the files from S3 partitions that satisfy the predicate and are necessary for processing.

To accomplish this, specify a predicate using the Spark SQL expression language as an additional parameter to the AWS Glue DynamicFrame getCatalogSource method. This predicate can be any SQL expression or user-defined function that evaluates to a Boolean, as long as it uses only the partition columns for filtering.

This example demonstrates this functionality with a dataset of Github events partitioned by year, month, and day. The following code example reads only those S3 partitions related to events that occurred on weekends:

%spark

val partitionPredicate ="date_format(to_date(concat(year, '-', month, '-', day)), 'E') in ('Sat', 'Sun')"

val pushdownEvents = glueContext.getCatalogSource(
   database = "githubarchive_month",
   tableName = "data",
   pushDownPredicate = partitionPredicate).getDynamicFrame()

Here you can use the SparkSQL string concat function to construct a date string. The to_date function converts it to a date object, and the date_format function with the ‘E’ pattern converts the date to a three-character day of the week (for example, Mon or Tue). For more information about these functions, Spark SQL expressions, and user-defined functions in general, see the Spark SQL, DataFrames and Datasets Guide and list of functions on the Apache Spark website.

There is a significant performance boost for AWS Glue ETL jobs when pruning AWS Glue Data Catalog partitions. It reduces the time needed for the Spark query engine for listing files in S3 and reading and processing data at runtime. You can achieve further improvement as you exclude additional partitions by using predicates with higher selectivity.

Partitioning data before and during writes to S3

By default, data is not partitioned when writing out the results from an AWS Glue DynamicFrame—all output files are written at the top level under the specified output path. AWS Glue enables partitioning of DynamicFrame results by passing the partitionKeys option when creating a sink. For example, the following code example writes out the dataset in Parquet format to S3 partitioned by the type column:

%spark

glueContext.getSinkWithFormat(
    connectionType = "s3",
    options = JsonOptions(Map("path" -> "$outpath", "partitionKeys" -> Seq("type"))),
    format = "parquet").writeDynamicFrame(projectedEvents)

In this example, $outpath is a placeholder for the base output path in S3. The partitionKeys parameter corresponds to the names of the columns used to partition the output in S3. When you execute the write operation, it removes the type column from the individual records and encodes it in the directory structure. To demonstrate this, you can list the output path using the following aws s3 ls command from the AWS CLI:

PRE type=CommitCommentEvent/
PRE type=CreateEvent/
PRE type=DeleteEvent/
PRE type=ForkEvent/
PRE type=GollumEvent/
PRE type=IssueCommentEvent/
PRE type=IssuesEvent/
PRE type=MemberEvent/
PRE type=PublicEvent/
PRE type=PullRequestEvent/
PRE type=PullRequestReviewCommentEvent/
PRE type=PushEvent/
PRE type=ReleaseEvent/
PRE type=WatchEvent/

For more information, see aws . s3 . ls in the AWS CLI Command Reference.

In general, you should select columns for partitionKeys that are of lower cardinality and are most commonly used to filter or group query results. For example, when analyzing AWS CloudTrail logs, it is common to look for events that happened between a range of dates. Therefore, partitioning the CloudTrail data by year, month, and day would improve query performance and reduce the amount of data that you need to scan to return the answer.

The benefit of output partitioning is two-fold. First, it improves execution time for end-user queries. Second, having an appropriate partitioning scheme helps avoid costly Spark shuffle operations in downstream AWS Glue ETL jobs when combining multiple jobs into a data pipeline. For more information, see Working with partitioned data in AWS Glue.

S3 or Hive-style partitions are different from Spark RDD or DynamicFrame partitions. Spark partitioning is related to how Spark or AWS Glue breaks up a large dataset into smaller and more manageable chunks to read and apply transformations in parallel. AWS Glue workers manage this type of partitioning in memory. You can control Spark partitions further by using the repartition or coalesce functions on DynamicFrames at any point during a job’s execution and before data is written to S3. You can set the number of partitions using the repartition function either by explicitly specifying the total number of partitions or by selecting the columns to partition the data.

Repartitioning a dataset by using the repartition or coalesce functions often results in AWS Glue workers exchanging (shuffling) data, which can impact job runtime and increase memory pressure. In contrast, writing data to S3 with Hive-style partitioning does not require any data shuffle and only sorts it locally on each of the worker nodes. The number of output files in S3 without Hive-style partitioning roughly corresponds to the number of Spark partitions. In contrast, the number of output files in S3 with Hive-style partitioning can vary based on the distribution of partition keys on each AWS Glue worker.

Conclusion

This post showed how to scale your ETL jobs and Apache Spark applications on AWS Glue for both compute and memory-intensive jobs. AWS Glue enables faster job execution times and efficient memory management by using the parallelism of the dataset and different types of AWS Glue workers. It also helps you overcome the challenges of processing many small files by automatically adjusting the parallelism of the workload and cluster. AWS Glue ETL jobs use the AWS Glue Data Catalog and enable seamless partition pruning using predicate pushdowns. It also allows for efficient partitioning of datasets in S3 for faster queries by downstream Apache Spark applications and other analytics engines such as Amazon Athena and Amazon Redshift. We hope you try out these best practices for your Apache Spark applications on AWS Glue.

The second post in this series will show how to use AWS Glue features to batch process large historical datasets and incrementally process deltas in S3 data lakes. It also demonstrates how to use a custom AWS Glue Parquet writer for faster job execution.

About the Author

Mohit Saxena is a technical lead at AWS Glue. His passion is building scalable distributed systems for efficiently managing data on cloud. He also enjoys watching movies, and reading about the latest technology.

↧

How to Migrate Oracle Workloads to VMware Cloud on AWS

October 22, 2019, 4:39 pm

≫ Next: Beena Emerson: Benchmark Partition Table – 1

≪ Previous: Best practices to scale Apache Spark jobs and partition data with AWS Glue

Feed: AWS Partner Network (APN) Blog.
Author: Jayaraman VelloreSampathkumar.

By Jayaraman VelloreSampathkumar, Sr. Partner Solutions Architect at AWS

VMware Cloud on AWS is becoming a preferred choice to run Oracle workloads on Amazon Web Services (AWS).

Some Oracle workloads, like Oracle Real Application Cluster (RAC) or WebLogic application clusters, have specific networking and storage requirements, which can be met with VMware Cloud on AWS.

After you decide on an architecture, you should consider how to migrate your Oracle workloads to AWS.

This post details the various considerations and migration options users have for Oracle workloads. I will examine in-depth how to migrate an Oracle database from an on-premises environment to AWS using AWS Database Migration Service (AWS DMS).

VMware is an AWS Partner Network (APN) Advanced Technology Partner. VMware Cloud on AWS delivers a highly scalable, secure, and innovative service that allows organizations to seamlessly migrate and extend their on-premises VMware vSphere-based environments to the AWS Cloud running on next-generation Amazon Elastic Compute Cloud (Amazon EC2) bare metal infrastructure.

Migration Planning

Organizations have many options when migrating Oracle workloads from an on-premises environment to VMware Cloud on AWS. When selecting a migration path, you should consider:

Whether your workload is production or non-production.
The type of workload (databases, applications, containers).
SLA requirements (can it be offline for few hours?).
Network bandwidth between your on-premises environment and the VMware Cloud on AWS Region.

Beyond the technical aspects, consider the technical skill set of your migration personnel. Oracle DBAs are familiar with Oracle migration methods like Oracle Recovery Manager (RMAN) backup/restore, Oracle Data Guard, or Oracle GoldenGate. VMware personnel may be familiar with VMware HCX and vMotion for migrations.

I recommend categorizing the application based on how critical it is to the business. The different buckets have different approaches and require different skills. For example, a critical production database may require the care and security provided by Oracle Data Guard. But for the hundreds of dev/test databases and applications that are less critical in nature, a mass migration using HCX is faster and less costly.

Network Connectivity

Network connectivity between your on-premises data center and VMware Cloud on AWS plays a key role in migration. For more information, see the APN Blog post, Simplifying Network Connectivity with VMware Cloud on AWS and AWS Direct Connect.

Many Oracle customers hope to retain the same IP address in VMware Cloud on AWS. This is called a Layer 2 extension for VMware Cloud on AWS, and this option works only if the Oracle database or application is already in VMware on-premises. For more information, see the documentation.

Migration Options

Let’s review some of the migration options available for Oracle workloads. Some of these are only appropriate if the Oracle workload is already running on VMware on-premises, while others can work even if you’re not already using VMware locally.

VMware HCX

VMware HCX provides live migration technologies to VMware Cloud on AWS, without the need to retrofit your on-premises VMware infrastructure. It also supports migration from vSphere 5.0+.

HCX has four methods of migration: HCX Bulk Migration, HCX vMotion, HCX Cold Migration, and HCX Replication-Assisted vMotion. The following table helps you choose the right HCX migration methodology for your needs.

Figure 1 – How to choose the right VMware HCX migration methodology for your needs.

For a more in-depth look, see the APN Blog post, Migrating Workloads to VMware Cloud on AWS with Hybrid Cloud Extension (HCX).

Oracle Technologies

There are multiple options when using Oracle technologies to move to VMware Cloud on AWS.

For example, you can use Oracle RMAN backups to restore the databases in VMware Cloud on AWS. The AWS Storage Gateway file interface can run on the local VMware infrastructure and presents an NFS storage that can be mounted on an Oracle database server.

You can back up the database using Oracle RMAN, and write the RMAN backup set to the NFS storage, which is then copied to the designated Amazon Simple Storage Service (Amazon S3) bucket in the linked AWS account. You can copy the backup set from the S3 bucket to virtual machines (VMs) in VMware Cloud on AWS. Or you can choose to mount the S3 bucket as an NFS mount on VMware Cloud VMs using Storage Gateway.

Alternate Oracle technologies, such as Oracle Data Guard, Oracle Active Data Guard, or Oracle GoldenGate, can be used to replicate the on-premises database to the target database in VMware Cloud on AWS.

You can also consider AWS Direct Connect, depending on the amount of redo log generated in the source Oracle database. Oracle workloads based on the filesystem, such as Oracle WebLogic, can be replicated using traditional file syncing tools such as rsync.

AWS Database Migration Service

AWS DMS supports homogeneous migrations such as Oracle to Oracle, as well as heterogeneous migrations between different database engines such as Oracle to Amazon Aurora. With AWS DMS, you can perform one-time migrations as well as continuous replication of data from source to target.

AWS DMS is a platform-agnostic migration service. Oracle customers can use AWS DMS to migrate data from non-x86 platforms like Solaris, HP-UX, or even non-Oracle RAC databases to Oracle RAC databases.

Oracle RAC can be implemented on VMware Cloud on AWS, which supports multi-cast and shared storage requirements of Oracle RAC. We have published a reference architecture for Oracle RAC on VMware Cloud on AWS. For a step-by-step implementation guide, see the documentation.

Migrating Workloads with AWS DMS

The following diagram uses linked AWS accounts to run an AWS DMS instance.

Figure 2 – AWS DMS database migration architecture.

The steps to implement the migration architecture are as follows:

Provision the AWS DMS instance.
Create endpoints for source and target databases.
Create and start AWS DMS tasks.
Monitor the migration.

Provisioning the AWS DMS Instance

VMware Cloud on AWS creates cloud elastic network interfaces in a specific subnet in the linked AWS account. To provide low latency and high performance, you should provision your AWS DMS instance in the same subnet as the cloud network interface.

Your AWS DMS instance subnet also needs connectivity to the on-premises data center network where the source Oracle database exists. Depending upon the volume of data to be transferred, the connection could be a VPN-based connection or AWS Direct Connect.

AWS DMS offers a variety of instances for migration, but compute-optimized instances (C4) and memory-optimized instances (R4) are suitable for large production-class migration.

General purpose (T2) AWS DMS instances are useful in testing the initial connection and development instance migration. For large database migrations, I recommend AWS DMS 3.1.1 or higher, as it contains support for parallel full unload of tables and partitions for Oracle databases. The AWS DMS release notes contain detailed notes on features and updates.

As shown in Figure 3, on the AWS DMS instance creation screen you can choose the DMS instance class, DMS engine version (we recommend choosing the latest engine version), and Amazon EBS storage allocated to the instance.

Figure 3 – Screenshot of AWS DMS instance creation.

Log files and cached transactions consume allocated storage. For cached transactions, storage is used only when the cached transactions must be written to disk. Cached transactions from the source database are written when it takes time to load large tables in the target database.

The default storage is usually sufficient, but the allocated storage may be consumed when there are large tables and a rate of row change in the source table. Monitor the storage-related metrics and scale up the storage when required.

Creating Endpoints for Source and Target Databases

You can create endpoints for source and target databases. AWS DMS supports Oracle databases versions 10.2 through 12.2.

AWS DMS supports change data capture (CDC) to perform continuous replication. It can use Oracle LogMiner to read through the redo log files and archived log files. AWS DMS with Oracle Binary Reader can support full load replications, but does not support ongoing replication.

AWS DMS supports Oracle ASM for transaction log access, but the extra connection attribute in DMS must include your ASM username and ASM server address.

useLogMinerReader=N;asm_user=;asm_server=:/+ASM

AWS DMS can use a separate Oracle user account in its source database to support replication. For a complete list of user account privileges and other replication considerations, see the documentation.

Creating and Starting AWS DMS Tasks

An AWS DMS task is where all the work happens. You can specify tables and schemas, and provide filtering conditions. Multiple AWS DMS tasks are usually created to facilitate parallel load of independent schemas within the database.

As you can see in Figure 4, you can name the AWS DMS task using Task Identifier, and choose the replication instance you previously created. You can also select the source and target database endpoints, and select the type of AWS DMS migration task.

Figure 4 – Screenshot of AWS DMS task creation.

You can set up a task to do a one-time migration of data from source to target. You can also create an ongoing replication task to provide continuous replication between the source and target. To read ongoing changes for Oracle databases, AWS DMS uses either the Oracle LogMiner API or Binary Reader (bfile API).

AWS DMS reads ongoing changes from the online or archived redo logs based on the system change number (SCN). There are two types of ongoing replication tasks:

Full load and CDC: Task migrates existing data and then updates the target based on the changes from the source data.
CDC only: Task migrates ongoing changes after you have data on the target database. The initial target database could have been created using Oracle RMAN backup and restore or the export/import utility. It’s important to know the last SCN of the backup.

To have the tasks start immediately, choose Start Task on Create. Or, you can start the task from the Action menu on the Database Migration Tasks page.

You can validate the number of rows inserted in target tables by choosing Enable Validation. A row count is performed on each target table and compared to the source table. However, this setting can be time-consuming and might affect the overall migration time.

Monitoring the Migration

You can monitor the progress of your task by checking on the task status and by monitoring the tasks control table dmslogs.awsdms_apply_exceptions. For a full list of control tables and their columns, see the documentation.

You can also monitor the progress of your tasks using Amazon CloudWatch. By using the console, AWS Command Line Interface (CLI), or AWS DMS API, you can monitor the task progress and the resources and network connectivity used.

As shown in Figure 5, you can monitor various migration task metrics to gain better insight into the performance of the AWS DMS migration task.

Figure 5 – AWS DMS migration task metrics console page.

Other Technologies

There are other technologies you could consider for Oracle migration. AWS Marketplace lists software solutions that can be easily deployed on AWS. You can connect these third-party software deployments to your on-premises over AWS Direct Connect or a virtual private network (VPN) solution and move data to VMware Cloud on AWS.

Conclusion

To migrate Oracle workloads to VMware Cloud on AWS, there are a number of migration methods to choose from based on your needs and existing system.

For online migrations, VMware technologies like vMotion and HCX help you migrate VM workloads from on-premises VMware clusters to VMware Cloud on AWS.

AWS DMS helps you to migrate and/or replicate Oracle databases to VMware Cloud on AWS with minimal downtime. It supports Oracle RAC workloads and can migrate across platforms such as HP-UX, IBM-AIX, and Solaris SPARC. You may also use native Oracle technologies, such as Oracle DataGuard or Oracle GoldenGate, to migrate workloads to AWS.

Cold migration or offline migration of Oracle workloads is also possible using Oracle RMAN, AWS Snowball, AWS Storage Gateway, or VMware HCX.

For additional information on VMware Cloud on AWS, please visit our website. If you’re interested in discussing your Oracle workloads on VMware Cloud on AWS, reach out to us at aws-vmware-cloud@amazon.com.

↧

Beena Emerson: Benchmark Partition Table – 1

October 27, 2019, 12:02 pm

≫ Next: Hubert ‘depesz’ Lubaczewski: Waiting for PostgreSQL 13 – pgbench: add –partitions and –partition-method options.

≪ Previous: How to Migrate Oracle Workloads to VMware Cloud on AWS

Feed: Planet PostgreSQL.

With the addition of declarative partitioning in PostgreSQL 10, it only made sense to extend the existing pgbench benchmarking module to create partitioned tables. A recent commit of patch by Fabien Coelho in PostgreSQL 13 has made this possible.

The pgbench_accounts table can now be partitioned with –partitions and –partition-method options which specify the number of partitions and the partitioning method accordingly when we initialize the database.

pgbench -i –partitions [–partition-method ]

partitions : This must be a positive integer value

partition-method : Currently only range and hash are supported and the default is range.

pgbench will throw an error if the –partition-method is specified without a valid –partitions option.

For range partitions, the given range is equally split into the specified partitions. The lower bound of the first partition is MINVALUE and the upper bound of the last partition is MAXVALUE. For hash partitions, the number of partitions specified is used in the modulo operation.

Test Partitions

I performed a few tests using the new partition options with the following settings:

pgbench scale = 5000 (~63GB data + 10GB indexes)
pgbench thread/client count = 32
shared_buffers = 32GB
min_wal_size = 15GB
max_wal_size = 20GB
checkpoint_timeout=900
maintenance_work_mem=1GB
checkpoint_completion_target=0.9
synchronous_commit=on

The hardware specification of the machine on which the benchmarking was performed is as follows:

IBM POWER8 Server
Red Hat Enterprise Linux Server release 7.1 (Maipo) (with kernel Linux 3.10.0-229.14.1.ael7b.ppc64le)
491GB RAM
IBM,8286-42A CPUs (24 cores, 192 with HT)

Two different types of queries were tested:

Read-only default query: It was run using the existing -S option of pgbench.

Range query: The following custom query which searches for a range that is 0.002% of the total rows was used.

set v1 random(1, 100000 * :scale)

set v2 :v1 + 1000000

BEGIN;

SELECT abalance FROM pgbench_accounts WHERE v1 BETWEEN :v1 AND :v2;

END;

Tests were run for both range and hash partition types. The following table shows the median of three tps readings taken and the tps increase in percentage when compared to the non-partitioned table.

	Read-only Default Query				Range Query
non-partitioned	323331.60				35.36
partitions	range		hash		range		hash
partitions	tps	increase	tps	increase	tps	increase	tps	increase
100	201648.82	-37.63 %	208805.45	-35.42 %	36.92	4.40 %	35.31	-0.16 %
200	189642.09	-41.35 %	199718.17	-38.23 %	37.63	6.42 %	34.34	-2.90 %
300	191242.31	-40.85 %	203182.88	-37.16 %	38.33	8.38 %	34.01	-3.82 %
400	186329.88	-42.37 %	189118.42	-41.51 %	49.43	39.78 %	34.86	-1.44 %
500	189727.31	-41.32 %	195470.47	-39.54 %	48.39	36.83 %	33.19	-6.13 %
600	185143.62	-42.74 %	191237.48	-40.85 %	45.42	28.44 %	32.42	-8.32 %
700	179190.37	-44.58 %	178999.73	-44.64 %	42.18	19.29 %	32.57	-7.91 %
800	170432.79	-47.29 %	173027.42	-46.49 %	45.82	29.57 %	31.38	-11.28 %

Read-only Default Query

In this type of OLTP point query, we are selecting only one row. Internally, an index scan is performed on the pgbench_accounts_pkey for the value being queried. In the non-partitioned case, the index scan is performed on the only index present. However, for the partitioned case, the partition details are collected and then partition pruning is carried out before performing an index scan on the selected partition.

As seen on the graph, the different types of partitions do not show much change in behavior because we would be targeting only one row in one particular partition. This drop in performance for the partitioned case can be attributed to the overhead of handling a large number of partitions. The performance is seen to slowly degrade as the number of partitions is increased.

Range Custom Query

In this type of query, one million rows which are about 0.002% of the total entries are targeted in sequence. In the non-partitioned case, the singular primary key is searched for all of the given range. As in the previous case, for the partitioned table, partitioning pruning is attempted before the index scan is performed on the smaller indexes of the selected partitions.

Given the way the different partition types sort out the rows, the given range being queried will only be divided amongst at most two partitions in the range type but it would be scattered across all the partitions for hash type. As expected the range type fares much better in this scenario given the narrowed search being performed. The hash type performs worse as it is practically doing a full index search, like in the non-partitioned case, along with bearing the overhead of partition handling.

We can discern that range partitioned tables are very beneficial when the majority of the queries are range queries. We have not seen any benefit for hash partitions in these tests but they are expected to fare better in certain scenarios involving sequential scans. We can conclude that the partition type and other partition parameters should be set only after thorough analysis as the incorrect implementation of partition can tremendously decrease the overall performance.

I want to extend a huge thank you to all those who have contributed to this much essential feature which makes it possible to benchmark partitioned tables – Fabein Coelho, Amit Kapila, Amit Langote, Dilip Kumar, Asif Rehman, and Alvaro Herrera.

—

This blog is also published on postgresrocks.

↧

Hubert ‘depesz’ Lubaczewski: Waiting for PostgreSQL 13 – pgbench: add –partitions and –partition-method options.

October 28, 2019, 2:56 am

≫ Next: Column Histograms on Percona Server and MySQL 8.0

≪ Previous: Beena Emerson: Benchmark Partition Table – 1

Feed: Planet PostgreSQL.

On 3rd of October 2019, Amit Kapila committed patch:

pgbench: add --partitions and --partition-method options.
 
These new options allow users to partition the pgbench_accounts table by
specifying the number of partitions and partitioning method.  The values
allowed for partitioning method are range and hash.
 
This feature allows users to measure the overhead of partitioning if any.
 
Author: Fabien COELHO
 
Alvaro Herrera
Discussion: https://postgr.es/m/alpine.DEB.2.21..7008@lancre

This is interesting addition. If you’re not familiar with pgbench, it’s a tool that benchmarks PostgreSQL instance.

Running it happens in two phases:

initialize: pgbench -i -s …
run benchmark: pgbench …

Obviously, partitioning has to be done initialization, so let’s try first simple:

=$ pgbench -i -s 100
...
done in 10.03 s (drop tables 0.00 s, create tables 0.02 s, generate 6.70 s, vacuum 1.28 s, primary keys 2.03 s).

In the test database, I see:

$ d+
                                List OF relations
 Schema |       Name       | TYPE  | Owner | Persistence |  SIZE   | Description
--------+------------------+-------+-------+-------------+---------+-------------
 public | pgbench_accounts | TABLE | pgdba | permanent   | 1281 MB |
 public | pgbench_branches | TABLE | pgdba | permanent   | 40 kB   |
 public | pgbench_history  | TABLE | pgdba | permanent   | 0 bytes |
 public | pgbench_tellers  | TABLE | pgdba | permanent   | 80 kB   |
(4 ROWS)

Now, let’s retry initialize, with partitioning:

=$ pgbench -i -s 100 --partitions=10
...
done in 19.70 s (drop tables 0.00 s, create tables 0.03 s, generate 7.34 s, vacuum 10.24 s, primary keys 2.08 s).

and content is:

$ d+
                                       List OF relations
 Schema |        Name         |       TYPE        | Owner | Persistence |  SIZE   | Description
--------+---------------------+-------------------+-------+-------------+---------+-------------
 public | pgbench_accounts    | partitioned TABLE | pgdba | permanent   | 0 bytes |
 public | pgbench_accounts_1  | TABLE             | pgdba | permanent   | 128 MB  |
 public | pgbench_accounts_10 | TABLE             | pgdba | permanent   | 128 MB  |
 public | pgbench_accounts_2  | TABLE             | pgdba | permanent   | 128 MB  |
 public | pgbench_accounts_3  | TABLE             | pgdba | permanent   | 128 MB  |
 public | pgbench_accounts_4  | TABLE             | pgdba | permanent   | 128 MB  |
 public | pgbench_accounts_5  | TABLE             | pgdba | permanent   | 128 MB  |
 public | pgbench_accounts_6  | TABLE             | pgdba | permanent   | 128 MB  |
 public | pgbench_accounts_7  | TABLE             | pgdba | permanent   | 128 MB  |
 public | pgbench_accounts_8  | TABLE             | pgdba | permanent   | 128 MB  |
 public | pgbench_accounts_9  | TABLE             | pgdba | permanent   | 128 MB  |
 public | pgbench_branches    | TABLE             | pgdba | permanent   | 40 kB   |
 public | pgbench_history     | TABLE             | pgdba | permanent   | 0 bytes |
 public | pgbench_tellers     | TABLE             | pgdba | permanent   | 80 kB   |
(14 ROWS)

with main table looking like this:

$ d+ pgbench_accounts
                            Partitioned TABLE "public.pgbench_accounts"
  COLUMN  |     TYPE      | Collation | NULLABLE | DEFAULT | Storage  | Stats target | Description
----------+---------------+-----------+----------+---------+----------+--------------+-------------
 aid      | INTEGER       |           | NOT NULL |         | plain    |              |
 bid      | INTEGER       |           |          |         | plain    |              |
 abalance | INTEGER       |           |          |         | plain    |              |
 filler   | CHARACTER(84) |           |          |         | extended |              |
Partition KEY: RANGE (aid)
Indexes:
    "pgbench_accounts_pkey" PRIMARY KEY, btree (aid)
Partitions: pgbench_accounts_1 FOR VALUES FROM (MINVALUE) TO (1000001),
            pgbench_accounts_10 FOR VALUES FROM (9000001) TO (MAXVALUE),
            pgbench_accounts_2 FOR VALUES FROM (1000001) TO (2000001),
            pgbench_accounts_3 FOR VALUES FROM (2000001) TO (3000001),
            pgbench_accounts_4 FOR VALUES FROM (3000001) TO (4000001),
            pgbench_accounts_5 FOR VALUES FROM (4000001) TO (5000001),
            pgbench_accounts_6 FOR VALUES FROM (5000001) TO (6000001),
            pgbench_accounts_7 FOR VALUES FROM (6000001) TO (7000001),
            pgbench_accounts_8 FOR VALUES FROM (7000001) TO (8000001),
            pgbench_accounts_9 FOR VALUES FROM (8000001) TO (9000001)

If I’d use hash based partitioning, it would be:

=$ pgbench -i -s 100 --partitions=10 --partition-method=hash
...
done in 11.98 s (drop tables 0.12 s, create tables 0.03 s, generate 7.40 s, vacuum 1.93 s, primary keys 2.51 s).

created:

$ d+
                                       List OF relations
 Schema |        Name         |       TYPE        | Owner | Persistence |  SIZE   | Description
--------+---------------------+-------------------+-------+-------------+---------+-------------
 public | pgbench_accounts    | partitioned TABLE | pgdba | permanent   | 0 bytes |
 public | pgbench_accounts_1  | TABLE             | pgdba | permanent   | 128 MB  |
 public | pgbench_accounts_10 | TABLE             | pgdba | permanent   | 128 MB  |
 public | pgbench_accounts_2  | TABLE             | pgdba | permanent   | 128 MB  |
 public | pgbench_accounts_3  | TABLE             | pgdba | permanent   | 128 MB  |
 public | pgbench_accounts_4  | TABLE             | pgdba | permanent   | 128 MB  |
 public | pgbench_accounts_5  | TABLE             | pgdba | permanent   | 128 MB  |
 public | pgbench_accounts_6  | TABLE             | pgdba | permanent   | 128 MB  |
 public | pgbench_accounts_7  | TABLE             | pgdba | permanent   | 128 MB  |
 public | pgbench_accounts_8  | TABLE             | pgdba | permanent   | 128 MB  |
 public | pgbench_accounts_9  | TABLE             | pgdba | permanent   | 128 MB  |
 public | pgbench_branches    | TABLE             | pgdba | permanent   | 40 kB   |
 public | pgbench_history     | TABLE             | pgdba | permanent   | 0 bytes |
 public | pgbench_tellers     | TABLE             | pgdba | permanent   | 80 kB   |
(14 ROWS)
 
$ d+ pgbench_accounts
                            Partitioned TABLE "public.pgbench_accounts"
  COLUMN  |     TYPE      | Collation | NULLABLE | DEFAULT | Storage  | Stats target | Description
----------+---------------+-----------+----------+---------+----------+--------------+-------------
 aid      | INTEGER       |           | NOT NULL |         | plain    |              |
 bid      | INTEGER       |           |          |         | plain    |              |
 abalance | INTEGER       |           |          |         | plain    |              |
 filler   | CHARACTER(84) |           |          |         | extended |              |
Partition KEY: HASH (aid)
Indexes:
    "pgbench_accounts_pkey" PRIMARY KEY, btree (aid)
Partitions: pgbench_accounts_1 FOR VALUES WITH (modulus 10, remainder 0),
            pgbench_accounts_10 FOR VALUES WITH (modulus 10, remainder 9),
            pgbench_accounts_2 FOR VALUES WITH (modulus 10, remainder 1),
            pgbench_accounts_3 FOR VALUES WITH (modulus 10, remainder 2),
            pgbench_accounts_4 FOR VALUES WITH (modulus 10, remainder 3),
            pgbench_accounts_5 FOR VALUES WITH (modulus 10, remainder 4),
            pgbench_accounts_6 FOR VALUES WITH (modulus 10, remainder 5),
            pgbench_accounts_7 FOR VALUES WITH (modulus 10, remainder 6),
            pgbench_accounts_8 FOR VALUES WITH (modulus 10, remainder 7),
            pgbench_accounts_9 FOR VALUES WITH (modulus 10, remainder 8)

When running actual test it detects partitioning setup on its own:

=$ pgbench -j $( nproc ) -c $( nproc ) -T 30
starting vacuum...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 100
partition method: hash
partitions: 10
query mode: simple
number of clients: 8
number of threads: 8
duration: 30 s
number of transactions actually processed: 49460
latency average = 4.853 ms
tps = 1648.313046 (including connections establishing)
tps = 1648.411195 (excluding connections establishing)

while for range based partitioning it looks like:

=$ pgbench -j $( nproc ) -c $( nproc ) -T 30
starting vacuum...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 100
partition method: range
partitions: 10
query mode: simple
number of clients: 8
number of threads: 8
duration: 30 s
number of transactions actually processed: 51453
latency average = 4.665 ms
tps = 1714.829850 (including connections establishing)
tps = 1715.280907 (excluding connections establishing)

and without partitioning:

=$ pgbench -j $( nproc ) -c $( nproc ) -T 30
starting vacuum...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 100
query mode: simple
number of clients: 8
number of threads: 8
duration: 30 s
number of transactions actually processed: 52600
latency average = 4.563 ms
tps = 1753.153104 (including connections establishing)
tps = 1753.661528 (excluding connections establishing)

This is pretty cool, thanks to all involved.

↧

Column Histograms on Percona Server and MySQL 8.0

October 29, 2019, 7:24 am

≫ Next: Sept 2019: “Top 40” New R Packages

≪ Previous: Hubert ‘depesz’ Lubaczewski: Waiting for PostgreSQL 13 – pgbench: add –partitions and –partition-method options.

Feed: Planet MySQL
;
Author: Corrado Pandiani
;

MySQL Column HIstorgrams From time to time you may have experienced that MySQL was not able to find the best execution plan for a query. You felt the query should have been faster. You felt that something didn’t work, but you didn’t realize exactly what.

Maybe some of you did tests and discovered there was a better execution plan that MySQL wasn’t able to find (forcing the order of the tables with STRAIGHT_JOIN for example).

In this article, we’ll see a new interesting feature available on MySQL 8.0 as well as Percona Server for MySQL 8.0: the histogram-based statistics.

Today, we’ll see what a histogram is, how you can create and manage it, and how MySQL’s optimizer can use it.

Just for completeness, histogram statistics have been available on MariaDB since version 10.0.2, with a slightly different implementation. Anyway, what we’ll see here is related to Percona Server and MySQL 8.0 only.

What is a histogram

We can define a histogram as a good approximation of the data distribution of the values in a column.

Histogram-based statistics were introduced to give the optimizer more execution plans to investigate and solve a query. Until then, in some cases, the optimizer was not able to find out the best possible execution plan because non-indexed columns were ignored.

With histogram statistics, now the optimizer may have more options because also non-indexed columns can be considered. In some specific cases, a query can run faster than usual.

Let’s consider the following table to store departing times of the trains:

CREATE TABLE train_schedule(

id INT PRIMARY KEY,

train_code VARCHAR(10),

departure_station VARCHAR(100),

departure_time TIME);

We can assume that during peak hours, from 7 AM until 9 AM, there are more rows, and during the night hours we have very few rows.

Let’s take a look at the following two queries:

SELECT * FROM train_schedule WHERE departure_time BETWEEN ’07:30:00′ AND ’09:15:00′;

SELECT * FROM train_schedule WHERE departure_time BETWEEN ’01:00:00′ AND ’03:00:00′;

Without any kind of statistics, the optimizer assumes by default that the values in the departure_time column are evenly distributed, but they aren’t. In fact, the first query returns more rows because of this assumption.

Histograms were invented to provide to the optimizer a good estimation of the rows returned. This seems to be trivial for the simple queries we have seen so far. But let’s think now about having the same table involved in JOINs with other tables. In such a case, the number of rows returned can be very important for the optimizer to decide the order to consider the tables in the execution plan.

A good estimation of the rows returned gives the optimizer the capability to open the table in the first stages in case it returns few rows. This minimizes the total amount of rows for the final cartesian product. Then the query can run faster.

MySQL supports two different types of histograms: “singleton” and “equi-height”. Common for all histogram types is that they split the data set into a set of “buckets”, and MySQL automatically divides the values into the buckets and will also automatically decide what type of histogram to create.

Singleton histogram

one value per bucket
each bucket stores
- value
- cumulative frequency
well suited for equality and range conditions

Equi-height histogram

multiple values per bucket
each bucket stores
- minimum value
- maximum value
- cumulative frequency
- number of distinct values
not really equi-height: frequent values are in separated buckets
well suited for range conditions

How to use histograms

The histogram feature is available and enabled on the server, but not usable by the optimizer. Without an explicit creation, the optimizer works the same as usual and cannot get any benefit from the histogram-bases statistics.

There is some manual operation to do. Let’s see.

In the next examples, we’ll use the world sample database you can download from here: https://dev.mysql.com/doc/index-other.html

Let’s start executing a query joining two tables to find out all the languages spoken on the largest cities of the world, with more than 10 million people.

mysql> select city.name, countrylanguage.language from city join countrylanguage using(countrycode) where population>10000000;

+————————–+—————–+

| name | language |

+————————–+—————–+

| Mumbai (Bombay) | Asami |

| Mumbai (Bombay) | Bengali |

| Mumbai (Bombay) | Gujarati |

| Mumbai (Bombay) | Hindi |

| Mumbai (Bombay) | Kannada |

| Mumbai (Bombay) | Malajalam |

| Mumbai (Bombay) | Marathi |

| Mumbai (Bombay) | Orija |

| Mumbai (Bombay) | Punjabi |

| Mumbai (Bombay) | Tamil |

| Mumbai (Bombay) | Telugu |

| Mumbai (Bombay) | Urdu |

+————————–+—————–+

12 rows in set (0.04 sec)

The query takes 0.04 seconds. It’s not a lot, but consider that the database is very small. Use the BENCHMARK function to have more relevant response times if you like.

Let’s see the EXPLAIN:

mysql> explain select city.name, countrylanguage.language from city join countrylanguage using(countrycode) where population>10000000;

+——+——————–+————————–+——————+———–+——————————–+——————–+————–+—————————————————–+———+—————+——————–+

| 1 | SIMPLE | city | NULL | ref | CountryCode | CountryCode | 3 | world.countrylanguage.CountryCode | 18 | 33.33 | Using where |

Indexes are used for both the tables and the estimated cartesian product has 984 * 18 = 17,712 rows.

Now generate the histogram on the Population column. It’s the only column used for filtering the data and it’s not indexed.

For that, we have to use the ANALYZE command:

mysql> ANALYZE TABLE city UPDATE HISTOGRAM ON population WITH 1024 BUCKETS;

+——————+—————–+—————+———————————————————————————–+

+——————+—————–+—————+———————————————————————————–+

+——————+—————–+—————+———————————————————————————–+

We have created a histogram using 1024 buckets. The number of buckets is not mandatory, and it can be any number from 1 to 1024. If omitted, the default value is 100.

The number of chunks affects the reliability of the statistics. The more distinct values you have, the more the chunks you need.

Let’s have a look now at the execution plan and execute the query again.

mysql> explain select city.name, countrylanguage.language from city join countrylanguage using(countrycode) where population>10000000;

+——+——————–+————————–+——————+———+——————————–+——————–+————–+————————————+———+—————+——————–+

| 1 | SIMPLE | countrylanguage | NULL | ref | PRIMARY,CountryCode | CountryCode | 3 | world.city.CountryCode | 984 | 100.00 | Using index |

mysql> select city.name, countrylanguage.language from city join countrylanguage using(countrycode) where population>10000000;

+————————–+—————–+

| name | language |

+————————–+—————–+

| Mumbai (Bombay) | Asami |

| Mumbai (Bombay) | Bengali |

| Mumbai (Bombay) | Gujarati |

| Mumbai (Bombay) | Hindi |

| Mumbai (Bombay) | Kannada |

| Mumbai (Bombay) | Malajalam |

| Mumbai (Bombay) | Marathi |

| Mumbai (Bombay) | Orija |

| Mumbai (Bombay) | Punjabi |

| Mumbai (Bombay) | Tamil |

| Mumbai (Bombay) | Telugu |

| Mumbai (Bombay) | Urdu |

+————————–+—————–+

12 rows in set (0.00 sec)

The execution plan is different, and the query runs faster.

We can notice that the order of the tables is the opposite as before. Even if it requires a full scan, the city table is in the first stage. It’s because of the filtered value that is only 0.06. It means that only 0.06% of the rows returned by the full scan will be used to be joined with the following table. So, it’s only 4188 * 0.06% = 2.5 rows. In total, the estimated cartesian product is 2.5 * 984 = 2.460 rows. This is significantly lower than the previous execution and explains why the query is faster.

What we have seen sounds a little counterintuitive, doesn’t it? In fact, until MySQL 5.7, we were used to considering full scans as very bad in most cases. In our case, instead, forcing a full scan using a histogram statistic on a non-indexed column lets the query to get optimized. Awesome.

Where are the histogram statistics

Histogram statistics are stored in the column_statistics table in the data dictionary and are not directly accessible by the users. Instead the INFORMATION_SCHEMA.COLUMN_STATISTICS table, which is implemented as a view of the data dictionary, can be used for the same purpose.

Let’s see the statistics for our table.

mysql> SELECT SCHEMA_NAME, TABLE_NAME, COLUMN_NAME, JSON_PRETTY(HISTOGRAM) 
 -> FROM information_schema.column_statistics 
 -> WHERE COLUMN_NAME = ‘population’G 
*************************** 1. row *************************** 
 SCHEMA_NAME: world 
 TABLE_NAME: city 
 COLUMN_NAME: Population 
JSON_PRETTY(HISTOGRAM): { 
 “buckets”: [ 
 [ 
 42, 
 455, 
 0.000980632507967639, 
 4 
 ], 
 [ 
 503, 
 682, 
 0.001961265015935278, 
 4 
 ], 
 [ 
 700, 
 1137, 
 0.0029418975239029173, 
 4 
 ], 
… 
… 
 [ 
 8591309, 
 9604900, 
 0.9990193674920324, 
 4 
 ], 
 [ 
 9696300, 
 10500000, 
 1.0, 
 4 
 ] 
 ], 
 “data-type”: “int”, 
 “null-values”: 0.0, 
 “collation-id”: 8, 
 “last-updated”: “2019-10-14 22:24:58.232254”, 
 “sampling-rate”: 1.0, 
 “histogram-type”: “equi-height”, 
 “number-of-buckets-specified”: 1024 
}

mysql> SELECT SCHEMA_NAME, TABLE_NAME, COLUMN_NAME, JSON_PRETTY(HISTOGRAM)

-> FROM information_schema.column_statistics

-> WHERE COLUMN_NAME = ‘population’G

*************************** 1. row ***************************

SCHEMA_NAME: world

TABLE_NAME: city

COLUMN_NAME: Population

JSON_PRETTY(HISTOGRAM): {

“buckets”: [

42,

455,

0.000980632507967639,

503,

682,

0.001961265015935278,

700,

1137,

0.0029418975239029173,

8591309,

9604900,

0.9990193674920324,

9696300,

10500000,

1.0,

“data-type”: “int”,

“null-values”: 0.0,

“collation-id”: 8,

“last-updated”: “2019-10-14 22:24:58.232254”,

“sampling-rate”: 1.0,

“histogram-type”: “equi-height”,

“number-of-buckets-specified”: 1024

We can see for any chunk the min and max values, the cumulative frequency, and the number of items. Also, we can see that MySQL decided to use an equi-height histogram.

Let’s try to generate a histogram on another table and column.

mysql> ANALYZE TABLE country UPDATE HISTOGRAM ON Region; 
+—————+———–+———-+—————————————————+ 
| Table | Op | Msg_type | Msg_text | 
+—————+———–+———-+—————————————————+ 
| world.country | histogram | status | Histogram statistics created for column ‘Region’. | 
+—————+———–+———-+—————————————————+ 
1 row in set (0.01 sec)
mysql> SELECT SCHEMA_NAME, TABLE_NAME, COLUMN_NAME, JSON_PRETTY(HISTOGRAM) FROM information_schema.column_statistics WHERE COLUMN_NAME = ‘Region’G 
*************************** 1. row *************************** 
 SCHEMA_NAME: world 
 TABLE_NAME: country 
 COLUMN_NAME: Region 
JSON_PRETTY(HISTOGRAM): { 
 “buckets”: [ 
 [ 
 “base64:type254:QW50YXJjdGljYQ==”, 
 0.02092050209205021 
 ], 
 [ 
 “base64:type254:QXVzdHJhbGlhIGFuZCBOZXcgWmVhbGFuZA==”, 
 0.04184100418410042 
 ], 
 [ 
 “base64:type254:QmFsdGljIENvdW50cmllcw==”, 
 0.05439330543933054 
 ], 
 [ 
 “base64:type254:QnJpdGlzaCBJc2xhbmRz”, 
 0.06276150627615062 
 ], 
 [ 
 “base64:type254:Q2FyaWJiZWFu”, 
 0.1631799163179916 
 ], 
 [ 
 “base64:type254:Q2VudHJhbCBBZnJpY2E=”, 
 0.20083682008368198 
 ], 
 [ 
 “base64:type254:Q2VudHJhbCBBbWVyaWNh”, 
 0.23430962343096232 
 ], 
 [ 
 “base64:type254:RWFzdGVybiBBZnJpY2E=”, 
 0.3179916317991631 
 ], 
 [ 
 “base64:type254:RWFzdGVybiBBc2lh”, 
 0.35146443514644343 
 ], 
 [ 
 “base64:type254:RWFzdGVybiBFdXJvcGU=”, 
 0.39330543933054385 
 ], 
 [ 
 “base64:type254:TWVsYW5lc2lh”, 
 0.41422594142259406 
 ], 
 [ 
 “base64:type254:TWljcm9uZXNpYQ==”, 
 0.44351464435146437 
 ], 
 [ 
 “base64:type254:TWljcm9uZXNpYS9DYXJpYmJlYW4=”, 
 0.4476987447698744 
 ], 
 [ 
 “base64:type254:TWlkZGxlIEVhc3Q=”, 
 0.5230125523012552 
 ], 
 [ 
 “base64:type254:Tm9yZGljIENvdW50cmllcw==”, 
 0.5523012552301255 
 ], 
 [ 
 “base64:type254:Tm9ydGggQW1lcmljYQ==”, 
 0.5732217573221757 
 ], 
 [ 
 “base64:type254:Tm9ydGhlcm4gQWZyaWNh”, 
 0.602510460251046 
 ], 
 [ 
 “base64:type254:UG9seW5lc2lh”, 
 0.6443514644351465 
 ], 
 [ 
 “base64:type254:U291dGggQW1lcmljYQ==”, 
 0.7029288702928871 
 ], 
 [ 
 “base64:type254:U291dGhlYXN0IEFzaWE=”, 
 0.7489539748953975 
 ], 
 [ 
 “base64:type254:U291dGhlcm4gQWZyaWNh”, 
 0.7698744769874477 
 ], 
 [ 
 “base64:type254:U291dGhlcm4gYW5kIENlbnRyYWwgQXNpYQ==”, 
 0.8284518828451883 
 ], 
 [ 
 “base64:type254:U291dGhlcm4gRXVyb3Bl”, 
 0.891213389121339 
 ], 
 [ 
 “base64:type254:V2VzdGVybiBBZnJpY2E=”, 
 0.9623430962343097 
 ], 
 [ 
 “base64:type254:V2VzdGVybiBFdXJvcGU=”, 
 1.0 
 ] 
 ], 
 “data-type”: “string”, 
 “null-values”: 0.0, 
 “collation-id”: 8, 
 “last-updated”: “2019-10-14 22:29:13.418582”, 
 “sampling-rate”: 1.0, 
 “histogram-type”: “singleton”, 
 “number-of-buckets-specified”: 100 
}

mysql> ANALYZE TABLE country UPDATE HISTOGRAM ON Region;

+———————–+—————–+—————+—————————————————————————–+

+———————–+—————–+—————+—————————————————————————–+

+———————–+—————–+—————+—————————————————————————–+

1 row in set (0.01 sec)

mysql> SELECT SCHEMA_NAME, TABLE_NAME, COLUMN_NAME, JSON_PRETTY(HISTOGRAM) FROM information_schema.column_statistics WHERE COLUMN_NAME = ‘Region’G

*************************** 1. row ***************************

SCHEMA_NAME: world

TABLE_NAME: country

COLUMN_NAME: Region

JSON_PRETTY(HISTOGRAM): {

“buckets”: [

“base64:type254:QW50YXJjdGljYQ==”,

0.02092050209205021

“base64:type254:QXVzdHJhbGlhIGFuZCBOZXcgWmVhbGFuZA==”,

0.04184100418410042

“base64:type254:QmFsdGljIENvdW50cmllcw==”,

0.05439330543933054

“base64:type254:QnJpdGlzaCBJc2xhbmRz”,

0.06276150627615062

“base64:type254:Q2FyaWJiZWFu”,

0.1631799163179916

“base64:type254:Q2VudHJhbCBBZnJpY2E=”,

0.20083682008368198

“base64:type254:Q2VudHJhbCBBbWVyaWNh”,

0.23430962343096232

“base64:type254:RWFzdGVybiBBZnJpY2E=”,

0.3179916317991631

“base64:type254:RWFzdGVybiBBc2lh”,

0.35146443514644343

“base64:type254:RWFzdGVybiBFdXJvcGU=”,

0.39330543933054385

“base64:type254:TWVsYW5lc2lh”,

0.41422594142259406

“base64:type254:TWljcm9uZXNpYQ==”,

0.44351464435146437

“base64:type254:TWljcm9uZXNpYS9DYXJpYmJlYW4=”,

0.4476987447698744

“base64:type254:TWlkZGxlIEVhc3Q=”,

0.5230125523012552

“base64:type254:Tm9yZGljIENvdW50cmllcw==”,

0.5523012552301255

“base64:type254:Tm9ydGggQW1lcmljYQ==”,

0.5732217573221757

“base64:type254:Tm9ydGhlcm4gQWZyaWNh”,

0.602510460251046

“base64:type254:UG9seW5lc2lh”,

0.6443514644351465

“base64:type254:U291dGggQW1lcmljYQ==”,

0.7029288702928871

“base64:type254:U291dGhlYXN0IEFzaWE=”,

0.7489539748953975

“base64:type254:U291dGhlcm4gQWZyaWNh”,

0.7698744769874477

“base64:type254:U291dGhlcm4gYW5kIENlbnRyYWwgQXNpYQ==”,

0.8284518828451883

“base64:type254:U291dGhlcm4gRXVyb3Bl”,

0.891213389121339

“base64:type254:V2VzdGVybiBBZnJpY2E=”,

0.9623430962343097

“base64:type254:V2VzdGVybiBFdXJvcGU=”,

1.0

“data-type”: “string”,

“null-values”: 0.0,

“collation-id”: 8,

“last-updated”: “2019-10-14 22:29:13.418582”,

“sampling-rate”: 1.0,

“histogram-type”: “singleton”,

“number-of-buckets-specified”: 100

In this case, a singleton histogram was generated.

Using the following query we can see more human-readable statistics.

mysql> SELECT SUBSTRING_INDEX(v, ‘:’, -1) value, concat(round(c*100,1),’%’) cumulfreq, 
 -> CONCAT(round((c – LAG(c, 1, 0) over()) * 100,1), ‘%’) freq 
 -> FROM information_schema.column_statistics, JSON_TABLE(histogram->’$.buckets’,’$[*]’ COLUMNS(v VARCHAR(60) PATH ‘$[0]’, c double PATH ‘$[1]’)) hist 
 -> WHERE schema_name = ‘world’ and table_name = ‘country’ and column_name = ‘region’; 
+—————————+———–+——-+ 
| value | cumulfreq | freq | 
+—————————+———–+——-+ 
| Antarctica | 2.1% | 2.1% | 
| Australia and New Zealand | 4.2% | 2.1% | 
| Baltic Countries | 5.4% | 1.3% | 
| British Islands | 6.3% | 0.8% | 
| Caribbean | 16.3% | 10.0% | 
| Central Africa | 20.1% | 3.8% | 
| Central America | 23.4% | 3.3% | 
| Eastern Africa | 31.8% | 8.4% | 
| Eastern Asia | 35.1% | 3.3% | 
| Eastern Europe | 39.3% | 4.2% | 
| Melanesia | 41.4% | 2.1% | 
| Micronesia | 44.4% | 2.9% | 
| Micronesia/Caribbean | 44.8% | 0.4% | 
| Middle East | 52.3% | 7.5% | 
| Nordic Countries | 55.2% | 2.9% | 
| North America | 57.3% | 2.1% | 
| Northern Africa | 60.3% | 2.9% | 
| Polynesia | 64.4% | 4.2% | 
| South America | 70.3% | 5.9% | 
| Southeast Asia | 74.9% | 4.6% | 
| Southern Africa | 77.0% | 2.1% | 
| Southern and Central Asia | 82.8% | 5.9% | 
| Southern Europe | 89.1% | 6.3% | 
| Western Africa | 96.2% | 7.1% | 
| Western Europe | 100.0% | 3.8% | 
+—————————+———–+——-+

mysql> SELECT SUBSTRING_INDEX(v, ‘:’, –1) value, concat(round(c*100,1),‘%’) cumulfreq,

-> CONCAT(round((c – LAG(c, 1, 0) over()) * 100,1), ‘%’) freq

-> FROM information_schema.column_statistics, JSON_TABLE(histogram->‘$.buckets’,‘$[*]’ COLUMNS(v VARCHAR(60) PATH ‘$[0]’, c double PATH ‘$[1]’)) hist

-> WHERE schema_name = ‘world’ and table_name = ‘country’ and column_name = ‘region’;

+—————————————–+—————–+———–+

| value | cumulfreq | freq |

+—————————————–+—————–+———–+

| Antarctica | 2.1% | 2.1% |

| Australia and New Zealand | 4.2% | 2.1% |

| Baltic Countries | 5.4% | 1.3% |

| British Islands | 6.3% | 0.8% |

| Caribbean | 16.3% | 10.0% |

| Central Africa | 20.1% | 3.8% |

| Central America | 23.4% | 3.3% |

| Eastern Africa | 31.8% | 8.4% |

| Eastern Asia | 35.1% | 3.3% |

| Eastern Europe | 39.3% | 4.2% |

| Melanesia | 41.4% | 2.1% |

| Micronesia | 44.4% | 2.9% |

| Micronesia/Caribbean | 44.8% | 0.4% |

| Middle East | 52.3% | 7.5% |

| Nordic Countries | 55.2% | 2.9% |

| North America | 57.3% | 2.1% |

| Northern Africa | 60.3% | 2.9% |

| Polynesia | 64.4% | 4.2% |

| South America | 70.3% | 5.9% |

| Southeast Asia | 74.9% | 4.6% |

| Southern Africa | 77.0% | 2.1% |

| Southern and Central Asia | 82.8% | 5.9% |

| Southern Europe | 89.1% | 6.3% |

| Western Africa | 96.2% | 7.1% |

| Western Europe | 100.0% | 3.8% |

+—————————————–+—————–+———–+

Histogram maintenance

Histogram statistics are not automatically recalculated. If you have a table that is very frequently updated with a lot of INSERTs, UPDATEs, and DELETEs, the statistics can run out of date very soon. Having unreliable histograms can lead the optimizer to the wrong choice.

When you find a histogram was useful to optimize a query, you need to also have a scheduled plan to refresh the statistics from time to time, in particular after doing massive modifications to the table.

To refresh a histogram you just need to run the same ANALYZE command we have seen before.

To completely drop a histogram you may run the following:

ANALYZE TABLE city DROP HISTOGRAM ON population;

Sampling

The histogram_generation_max_mem_size system variable controls the maximum amount of memory available for histogram generation. The global and session values may be set at runtime.

If the estimated amount of data to be read into memory for histogram generation exceeds the limit defined by the variable, MySQL samples the data rather than reading all of it into memory. Sampling is evenly distributed over the entire table.

The default value is 20000000 but you can increase it in the case of a large column if you want more accurate statistics. For very large columns, pay attention not to increase the threshold more than the memory available in order to avoid excessive overhead or outage.

Conclusion

Histogram statistics are particularly useful for non-indexed columns, as shown in the example.

Execution plans that can rely on indexes are usually the best, but histograms can help in some edge cases or when creating a new index is a bad idea.

Since this is not an automatic feature, some manual testing is required to investigate if you really can get the benefit of a histogram. Also, the maintenance requires some scheduled and manual activity.

Use histograms if you really need them, but don’t abuse them since histograms on very large tables can consume a lot of memory.

Usually, the best candidates for a histogram are the columns with:

values that do not change much over time
low cardinality values
uneven distribution

Install Percona Server 8.0, test and enjoy the histograms.

Further reading on the same topic: Billion Goods in Few Categories – How Histograms Save a Life?

↧

Sept 2019: “Top 40” New R Packages

October 29, 2019, 11:12 pm

≫ Next: Infrastructure repair with Bolt

≪ Previous: Column Histograms on Percona Server and MySQL 8.0

Feed: R-bloggers.
Author: R Views.

[This article was first published on R Views, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.

One hundred and thirteen new packages made it to CRAN in September. Here are my “Top 40” picks in eight categories: Computational Methods, Data, Economics, Machine Learning, Statistics, Time Series, Utilities, and Visualization.

Computational Methods

eRTG3D v0.6.2: Provides functions to create realistic random trajectories in a 3-D space between two given fixed points (conditional empirical random walks), based on empirical distribution functions extracted from observed trajectories (training data), and thus reflect the geometrical movement characteristics of the mover. There are several small vignettes, including sample data sets, linkage to the sf package, and point cloud analysis.

freealg v1.0: Implements the free algebra in R: multivariate polynomials with non-commuting indeterminates. See the vignette for the math.

HypergeoMat v3.0.0: Implements Koev & Edelman’s algorithm (2006) to evaluate the hypergeometric functions of a matrix argument, which appear in random matrix theory. There is a vignette.

opart v2019.1.0: Provides a reference implementation of standard optimal partitioning algorithm in C using square-error loss and Poisson loss functions as described by Maidstone (2016), Hocking (2016), Rigaill (2016), and Fearnhead (2016) that scales quadratically with the number of data points in terms of time-complexity. There are vignettes for Gaussian and Poisson squared error loss.

Data

cde v0.4.1: Facilitates searching, download and plotting of Water Framework Directive (WFD) reporting data for all water bodies within the UK Environment Agency area. This package has been peer-reviewed by rOpenSci. There is a Getting Started Guide and a vignette on output reference.

eph v0.1.1: Provides tools to download and manipulate data from the Argentina Permanent Household Survey. The implemented methods are based on INDEC (2016).

leri v0.0.1: Fetches Landscape Evaporative Response Index (LERI) data using the raster package. The LERI product measures anomalies in actual evapotranspiration, to support drought monitoring and early warning systems. See the vignette for examples.

rwhatsapp v0.2.0: Provides functions to parse and digest history files from the popular messenger service WhatsApp. There is a vignette.

tidyUSDA v0.2.1: Provides a consistent API to pull United States Department of Agriculture census and survey data from the National Agricultural Statistics Service (NASS) QuickStats service. See the vignette.

Economics

bunching v0.8.4: Implements the bunching estimator from economic theory for kinks and knots. There is a vignette on Theory, and another with Examples.

fixest v0.1.2: Provides fast estimation of econometric models with multiple fixed-effects, including ordinary least squares (OLS), generalized linear models (GLM), and the negative binomial. The method to obtain the fixed-effects coefficients is based on Berge (2018). There is a vignette.

raceland v1.0.3: Implements a computational framework for a pattern-based, zoneless analysis, and visualization of (ethno)racial topography for analyzing residential segregation and racial diversity. There is a vignette describing the Computational Framework, one describing Patterns of Racial Landscapes, and a third on SocScape Grids.

Machine Learning

biclustermd v0.1.0: Implements biclustering, a statistical learning technique that simultaneously partitions, and clusters rows and columns of a data matrix in a manner that can deal with missing values. See the vignette for examples.

bbl v0.1.5: Implements supervised learning using Boltzmann Bayes model inference, enabling the classification of data into multiple response groups based on a large number of discrete predictors that can take factor values of heterogeneous levels. See Woo et al. (2016) for background, and the vignette for how to use the package.

corporaexplorer v0.6.3: Implements Shiny apps to dynamically explore collections of texts. Look here for more information.

fairness v1.0.1: Offers various metrics to assess and visualize the algorithmic fairness of predictive and classification models using methods described by Calders and Verwer (2010), Chouldechova (2017), Feldman et al. (2015), Friedler et al. (2018), and Zafar et al. (2017). There is a tutorial for the package.

imagefluency v0.2.1: Provides functions to collect image statistics based on processing fluency theory that include scores for several basic aesthetic principles that facilitate fluent cognitive processing of images: contrast, complexity / simplicity, self-similarity, symmetry, and typicality. See Mayer & Landwehr (2018) and Mayer & Landwehr (2018) for the theoretical background, and the vignette for an introduction.

ineqJD v1.0: Provides functions to compute and decompose Gini, Bonferroni, and Zenga 2007 point and synthetic concentration indexes. See Zenga M. (2015), Zenga & Valli (2017), and Zenga & Valli (2018) for more information.

lmds v0.1.0: Implements Landmark Multi-Dimensional Scaling (LMDS), a dimensionality reduction method scaleable to large numbers of samples, because rather than calculating a complete distance matrix between all pairs of samples, it only calculates the distances between a set of landmarks and the samples. See the README for an example.

modelStudio v0.1.7: Implements an interactive platform to help interpret machine learning models. There is a vignette, and look here for a demo of the interactive features.

nlpred v1.0: Provides methods for obtaining improved estimates of non-linear cross-validated risks obtained using targeted minimum loss-based estimation, estimating equations, and one-step estimation. Cross-validated area under the receiver operating characteristics curve ( LeDell sr al. (2015) ) and other metrics are included. There is a vignette on small sample estimates.

pyMTurkR v1.1: Provides access to the latest Amazon Mechanical Turk’ (‘MTurk’) Requester API (version ‘2017–01–17’), replacing the now deprecated MTurkR package.

stagedtrees v1.0.0: Creates and fits staged event tree probability models, probabilistic graphical models capable of representing asymmetric conditional independence statements among categorical variables. See Görgen et al. (2018), Thwaites & Smith (2017), Barclay et al. (2013), and Smith & Anderson](doi:10.1016/j.artint.2007.05.004) for background, and look here for and overview.

Statistics

confoundr v1.2: Implements three covariate-balance diagnostics for time-varying confounding and selection-bias in complex longitudinal data, as described in Jackson (2016) and Jackson (2019). There is a Demo vignette and another Describing Selection Bias from Dependent Censoring

distributions3 v0.1.1: Provides tools to create and manipulate probability distributions using S3. Generics random(), pdf(), cdf(), and quantile() provide replacements for base R’s r/d/p/q style functions. The documentation for each distribution contains detailed mathematical notes. There are several vignettes: Intro to hypothesis testing, One-sample sign tests,
One-sample T confidence interval, One-sample T-tests, Z confidence interval for a mean, One-sample Z-tests for a proportion, One-sample Z-tests, Paired tests, and Two-sample Z-tests.

dobin v0.8.4: Implements a dimension reduction technique for outlier detection, which constructs a set of basis vectors for outlier detection that bring outliers to the forefront using fewer basis vectors. See Kandanaarachchi & Hyndman (2019) for background, and the vignette for a brief introduction.

glmpca v0.1.0: Implements a generalized version of principal components analysis (GLM-PCA) for dimension reduction of non-normally distributed data, such as counts or binary matrices. See Townes et al. (2019) and Townes (2019) for details, and the vignette for examples.

immuneSIM v0.8.7: Provides functions to simulate full B-cell and T-cell receptor repertoires using an in-silico recombination process that includes a wide variety of tunable parameters to introduce noise and biases. See Weber et al. (2019) for background, and look here for information about the package.

irrCAC v1.0: Provides functions to calculate various chance-corrected agreement coefficients (CAC) among two or more raters, including Cohen’s kappa, Conger’s kappa, Fleiss’ kappa, Brennan-Prediger coefficient, Gwet’s AC1/AC2 coefficients, and Krippendorff’s alpha. There are vignettes on benchmarking, Calculating Chance-corrected Agreement Coefficients, and Computing weighted agreement coefficients.

LPBlg v1.2: Provides functions that estimate a density and derive a deviance test to assess if the data distribution deviates significantly from the postulated model, given a postulated model and a set of data. See Algeri S. (2019) for details.

SynthTools v1.0.0: Provides functions to support experimentation with partially synthetic data sets. Confidence interval and standard error formulas have options for either synthetic data sets or multiple imputed data sets. For more information, see Reiter & Raghunathan (2007).

Time Series

fable v0.1.0: Provides a collection of commonly used univariate and multivariate time series forecasting models, including automatically selected exponential smoothing (ETS) and autoregressive integrated moving average (ARIMA) models. There is an Introduction and a vignette on transformations.

nsarfima v0.1.0.0: Provides routines for fitting and simulating data under autoregressive fractionally integrated moving average (ARFIMA) models, without the constraint of stationarity. Two fitting methods are implemented: a pseudo-maximum likelihood method and a minimum distance estimator. See Mayoral (2007) and Beran (1995) for reference.

Utilities

nc v2019.9.16: Provides functions for extracting a data table (row for each match, column for each group) from non-tabular text data using regular expressions. Patterns are defined using a readable syntax that makes it easy to build complex patterns in terms of simpler, re-usable sub-patterns. There is a vignette on capture first match and another on capture all match.

pins v0.2.0: Provides functions that “pin” remote resources into a local cache in order to work offline, improve speed, avoid recomputing, and discover and share resources in local folders, GitHub, Kaggle and RStudio Connect. There is a Getting Started Guide and vignettes on Extending Boards, Using GitHub Boards, Using Kaggle Boards, Using RStudio Connect Boards, Using Website Boards, Using Pins in RStudio, Understanding Boards, and Extending Pins.

queryparser v0.1.1: Provides functions to translate SQL SELECT statements into lists of R expressions.

rawr v0.1.0: Retrieves pure R code from popular R websites, including github, kaggle, datacamp, and R blogs made using blogdown.

Visualization

FunnelPlotR v0.2.1: Implements Spiegelhalter (2005) Funnel plots for reporting standardized ratios, with overdispersion adjustment. The vignette offers examples.

ggBubbles v0.1.4: Implements mini bubble plots to display more information for discrete data than traditional bubble plots do. The vignette provides examples.

gghalves v0.0.1: Implements a ggplot2 extension for easy plotting of half-half geom combinations: think half boxplot and half jitterplot, or half violinplot and half dotplot.

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook…

↧

Infrastructure repair with Bolt

November 5, 2019, 8:43 am

≫ Next: Temporal Tables Part 3: Managing Historical Data Growth

≪ Previous: Sept 2019: “Top 40” New R Packages

Feed: Puppet Blog Feed.
Author: Cas Donoghue
;

Not to beat on our own drum or anything, but Puppet is a great tool for managing and configuring your entire infrastructure. It allows you to ensure trusted and consistent state across all your nodes and lets you update that configuration in the click of a button to deploy changes to your Puppet codebase.

But what happens when the agent cannot reach the master for these configuration updates?

Puppet 5.5.16 & 6.4.4 and prior had a long-standing bug that prevented the agent from using a proxy when the HTTP_PROXY environment variable was defined. This issue, and a number of other http proxy issues, were recently fixed so the agent now correctly respects those variables/settings. However, in some cases, this means that the Puppet agent may attempt to connect to Puppet Server via the previously ignored proxy.

In many environments, an http proxy is configured to only allow connections from internal hosts to external hosts, and it will reject any attempt to “reflect” off of the proxy from an internal host to another internal host. In these environments, Puppet agents may no longer be able to connect to its Puppet Server after upgrading to 5.5.17, 6.4.4 or 6.8.0+. And since the agent can’t get a catalog, you can’t use Puppet to remedy the issue.

You can, however, use Bolt to remedy the issue! Bolt is well suited to solve a problem like this because it does not rely on agents getting a catalog from Puppet Server. That means that we can use it for out-of-band infrastructure repair.

Configuring agents to not use a proxy

If you’re in such an environment and you’ve upgraded Puppet then it’s likely that you’ve lost Puppet control over your agents. If your proxy doesn’t allow internal connections, then agent runs will fail.

To resolve this issue, the agent should be configured to bypass the proxy and connect directly to the Puppet Server by adding the FQDN of the Puppet Server to the NO_PROXY environment variable. This can also be accomplished using the no_proxy setting in the latest releases, however that setting will be overridden by the HTTP_PROXY environment variable until a fix is released in 6.9.0 (and backported to 5.5.18 & 6.4.5).

This guide will show you how to use Bolt to ensure that both the system environment variable NO_PROXY and the no_proxy puppet setting includes your Puppet Server’s FQDN. Note that changes to the global NO_PROXY environment variable will affect all child processes that Puppet executes or services that it starts.

Bolt target group setup

Note: We need to run different commands to check and set environment variables for Windows and Linux nodes. To make it easier to do so, we’ll configure target groups for each based on PuppetDB queries. If you already have similar groups configured in your infrastructure, then you can skip this section.

If you don’t already have Bolt running, then you can find instructions for that on the installation page. To configure the target groups, we’ll use the Bolt PuppetDB plugin. Before we can use the plugin we need to configure Bolt to connect to PuppetDB. For this example I will authenticate with PuppetDB using a PE RBAC token.

I obtain a token with puppet-access login -l 0, obtain an ssl ca cert and save them to a directory called proxy_patch. In my case the ca cert of interest was obtained by copying the cert stored at /etc/puppetlabs/puppet/ssl/certs/ca.pem from my Puppet master host to my laptop.

Now I create a Bolt configuration file bolt.yaml with the following configuration:

---
puppetdb:
  server_urls: ["https://ox6m3vjvwj66xlr.delivery.puppetlabs.net:8081"]
  cacert: ~/proxy_patch/ca.pem
  token: ~/proxy_patch//token

Now that I have Bolt configured to connect to PuppetDB I can write a Bolt inventoryfile to organize connection information for Puppet agent nodes queried from the database. Target information is stored in a bolt inventoryfile. We will use the version 2 format for inventory which has added support for the PuppetDB plugin.

---
version: 2
groups:
  - name: linux-agents
    targets:
      - _plugin: puppetdb
        query: inventory[certname]{facts.os.family != "windows"}
        uri: facts.networking.hostname
    config:
      transport: ssh
      ssh:
        user: root
        private-key: ~/.ssh/id_rsa

In the example inventoryfile we configure the PuppetDB plugin to query for targets that are non-Windows. For each fact set that is returned a target uri is set to facts.networking.hostname. It is important to note that values set under the config section are dynamically generated for each target that matches the query. The static information about how to connect the dynamically generated targets is in the group level config section. So in this case I set the transport to ssh. The ssh transport is configured to use the ssh login root, and to use a private key stored in my .ssh directory. You can find more information about configuring Bolt transports at (https://puppet.com/docs/bolt/latest/bolt_configuration_options.html).

At this point we can verify that we can connect to the agent nodes. We will try running a simple Bolt command to echo the hostname for all the targets generated using the plugin. For example:

$ bolt command run hostname --targets linux-agents
Started on tp5t3a5vq63c0ef...
Started on ox6m3vjvwj66xlr...
Finished on tp5t3a5vq63c0ef:
  STDOUT:
    tp5t3a5vq63c0ef
Finished on ox6m3vjvwj66xlr:
  STDOUT:
    ox6m3vjvwj66xlr
Successful on 2 nodes: tp5t3a5vq63c0ef,ox6m3vjvwj66xlr
Ran on 2 nodes in 0.52 sec

Now that we have connected to our Linux nodes, let’s also add configuration for our Windows nodes.

---
version: 2
groups:
  - name: linux-agents
    targets:
      - _plugin: puppetdb
        query: inventory[certname]{facts.os.family != "windows"}
        uri: facts.networking.hostname
    config:
      transport: ssh
      ssh:
        user: root
        private-key: ~/.ssh/id_rsa
  - name: windows-agents
    targets:
      - _plugin: puppetdb
        query: inventory[certname]{facts.os.family = "windows"}
        uri: facts.networking.hostname
    config:
      transport: winrm
      winrm:
        ca-cert: ~/proxy_patch/ca.pem
        user: Administrator
        password:
          _plugin: prompt
          message: Winrm password please

Notice in the static configuration for the Windows agents there is another plugin reference. In this case we use the prompt plugin to get the WinRM password from the Bolt operator. It is important to note that the prompt plugin is at the group level so the user will only be prompted once when the windows-agents targets are requested and that value will be used to authenticate with all targets.

Bolt’s `puppet_conf` task

Bolt ships with some useful modules for managing infrastructure. We can use the puppet_conf module which contains a task for getting and setting Puppet configuration. We can examine the task information with the following command:

$ bolt task show puppet_conf

puppet_conf - Inspect puppet agent configuration settings

USAGE:
bolt task run --nodes  puppet_conf action= section= setting= value=

PARAMETERS:
- action: Enum[get, set]
    The operation (get, set) to perform on the configuration setting
- section: Optional[String[1]]
    The section of the config file. Defaults to main
- setting: String[1]
    The name of the config entry to set/get
- value: Optional[String[1]]
    The value you are setting. Only required for set

MODULE:
built-in module

Now that we have got two target groups for our Windows nodes and Linux nodes, we can use Bolt to check the settings and environment variables for all the nodes in these groups.

First, let’s use a task to check if nodes have the no_proxy setting with the puppet_conf Bolt task. In this example, we’re looking at the Linux nodes, but the task will be the same for Windows nodes:

$ bolt task run puppet_conf action=get setting=no_proxy --targets linux-agents
Started on tp5t3a5vq63c0ef...
Started on ox6m3vjvwj66xlr...
Finished on tp5t3a5vq63c0ef:
  {
    "status": "localhost, 127.0.0.1",
    "setting": "no_proxy",
    "section": "main"
  }
Finished on ox6m3vjvwj66xlr:
  {
    "status": "localhost, 127.0.0.1",
    "setting": "no_proxy",
    "section": "main"
  }
Successful on 2 nodes: tp5t3a5vq63c0ef,ox6m3vjvwj66xlr
Ran on 2 nodes in 2.67 sec

We see that the setting does not include our Puppet Server FQDN.

Similarly we can check if the NO_PROXY environment variable is set. Here we are checking our Windows nodes. To do the same on our Linux nodes, we’d use the command echo $NO_PROXY instead:

$ bolt command run 'Write-Host $env:NO_PROXY' -t windows-agents
Winrm password please:
Started on x4yml978aq0ct77...
Finished on x4yml978aq0ct77:
Successful on 1 node: x4yml978aq0ct77
Ran on 1 node in 1.1 sec

We see that nothing is printed and thus the environment variable is unset.

Fixing the issue

Now that we have an idea of what we need to accomplish, it is time to use Bolt’s most powerful capability: the plan. We want to set global environment variables on both Windows and Linux nodes as well as configure Puppet settings.

Plans live in modules, so let’s create a module called proxy_patch under site-modules and create a file called init.pp.

$ tree
.
├── bolt.yaml
├── ca.pem
├── inventory.yaml
├── Puppetfile
├── site-modules
│   └── proxy_patch
│       └── plans
│           └── init.pp
└── token

3 directories, 6 files

Save the following plan to init.pp:

plan proxy_patch(TargetSpec $nodes, String $no_proxy_fqdn_list){
  # Split targets into windows and linux OS
  $resolved_targets = get_targets($nodes)
  $resolved_targets.apply_prep
  $partition = $resolved_targets.partition |$target| {$target.facts['os']['family'] == 'windows'}
  $windows_targets = $partition[0]
  $nix_targets = $partition[1]

  # Use windows_env module to set global NO_PROXY environment var
  apply($windows_targets) {
    windows_env { 'NO_PROXY':
      ensure    => present,
      mergemode => clobber,
      value     => "${no_proxy_fqdn_list}"
    } ~>
    service { 'puppet':
      ensure => 'running'
    }
  }

  # Use stdlib module to set global NO_PROXY environment var
  apply($nix_targets) {
    file_line { "no_proxy_env_var":
      ensure  => present,
      line    => "NO_PROXY=${no_proxy_fqdn_list}",
      path    => "/etc/environment",
    } ~>
    service { 'puppet':
      ensure => 'running'
    }
  }

  # Add the 'no_proxy' option to puppet conf
  run_task('puppet_conf', $resolved_targets, 'action' => 'set', 'setting' => 'no_proxy', 'value' => $no_proxy_fqdn_list)
}

The plan accepts two parameters $nodes and $no_proxy_fqdn_list. The $nodes parameter represents the targets we wish to run plan against. The FQDN list represents a comma separated list containing the FQDN of your Puppet Server. It is important to note with this implementation the NO_PROXY environment variable will always be replaced with the $no_proxy_fqdn_list argument. You may consider modifying the plan to add some logic to query the contents of NO_PROXY and append the $no_proxy_fqnd_list argument as you see fit.

The first step of the plan partitions the targets based on operating system. We want to use different modules for managing system environment variables based on target OS. We accomplish environment variable management by applying Puppet manifest code. Specifically, we use the windows_env resource to set the NO_PROXY environment variable on Windows targets and the file_line resource from the stdlib module to manage /etc/environment on our Linux targets. In both cases we notify the Puppet service that environment variables have changed.

Once we have set the environment variable we can use a task from the puppet_conf module to configure the no_proxy Puppet setting. Note that the task is cross-platform, so we do not need different invocations based on target OS!

Before we can run the plan, we need to download the modules puppet-windows_env and puppetlabs-stdlib (the puppet_conf module ships with the Bolt system packages). In order to do that save the following to a file called Puppetfile in the same directory as bolt.yaml.

mod 'puppet-windows_env', '3.2.0'
mod 'puppetlabs-stdlib', '6.1.0'

We can use Bolt to install those modules with: bolt puppetfile install.

Now that we have the required modules we can run the plan. We invoke the plan and run it against all targets in our inventory (which is passed to the $nodes plan parameter) and the FQDN of our Puppet Server.

$ bolt plan run proxy_patch no_proxy_fqdn_list='localhost,127.0.0.1,https://q6b6x52w8k8xv1i.delivery.puppetlabs.net:8140' -t all
Winrm password please:
Starting: plan proxy_patch
Starting: install puppet and gather facts on tp5t3a5vq63c0ef, ox6m3vjvwj66xlr, x4yml978aq0ct77
Finished: install puppet and gather facts with 0 failures in 6.95 sec
Starting: apply catalog on x4yml978aq0ct77
Finished: apply catalog with 0 failures in 8.29 sec
Starting: apply catalog on tp5t3a5vq63c0ef, ox6m3vjvwj66xlr
Finished: apply catalog with 0 failures in 4.69 sec
Starting: task puppet_conf on tp5t3a5vq63c0ef, ox6m3vjvwj66xlr, x4yml978aq0ct77
Finished: task puppet_conf with 0 failures in 5.24 sec
Finished: plan proxy_patch in 25.18 sec
Plan completed successfully with no result

Now let’s verify the environment variables were set as expected for both the Windows and Linux based targets.

$ bolt command run 'Write-Host $env:NO_PROXY' -t windows-agents
Winrm password please:
Started on x4yml978aq0ct77...
Finished on x4yml978aq0ct77:
  STDOUT:
    localhost,127.0.0.1,https://q6b6x52w8k8xv1i.delivery.puppetlabs.net:8140
Successful on 1 node: x4yml978aq0ct77
Ran on 1 node in 0.99 sec

$ bolt command run 'echo $NO_PROXY' -t linux-agents
Started on ox6m3vjvwj66xlr...
Started on tp5t3a5vq63c0ef...
Finished on ox6m3vjvwj66xlr:
  STDOUT:
    localhost,127.0.0.1,https://q6b6x52w8k8xv1i.delivery.puppetlabs.net:8140
Finished on tp5t3a5vq63c0ef:
  STDOUT:
    localhost,127.0.0.1,https://q6b6x52w8k8xv1i.delivery.puppetlabs.net:8140
Successful on 2 nodes: tp5t3a5vq63c0ef,ox6m3vjvwj66xlr
Ran on 2 nodes in 0.54 sec

We have confirmed that the environment variables have been updated with a Bolt command! Now let’s check the Puppet setting with the puppet_conf task

$ bolt task run puppet_conf action=get setting=no_proxy --targets all
Winrm password please:
Started on tp5t3a5vq63c0ef...
Started on ox6m3vjvwj66xlr...
Started on x4yml978aq0ct77...
Finished on ox6m3vjvwj66xlr:
  {
    "status": "localhost,127.0.0.1,https://q6b6x52w8k8xv1i.delivery.puppetlabs.net:8140",
    "setting": "no_proxy",
    "section": "main"
  }
Finished on tp5t3a5vq63c0ef:
  {
    "status": "localhost,127.0.0.1,https://q6b6x52w8k8xv1i.delivery.puppetlabs.net:8140",
    "setting": "no_proxy",
    "section": "main"
  }
Finished on x4yml978aq0ct77:
  {
    "status": "localhost,127.0.0.1,https://q6b6x52w8k8xv1i.delivery.puppetlabs.net:8140",
    "setting": "no_proxy",
    "section": "main"
  }
Successful on 3 nodes: tp5t3a5vq63c0ef,ox6m3vjvwj66xlr,x4yml978aq0ct77
Ran on 3 nodes in 6.8 sec

Now we’ve also confirmed that we have updated the no_proxy settings and can breathe easy knowing our Windows and Linux Puppet agents will not be cut off from communicating with our Puppet Server by attempting to use a proxy connection instead of connecting directly.

Cas Donoghue is a software engineer at Puppet.

Learn more

↧

Temporal Tables Part 3: Managing Historical Data Growth

November 12, 2019, 1:42 am

≫ Next: Julian Markwort: Introduction and How-To: etcd clusters for Patroni

≪ Previous: Infrastructure repair with Bolt

Feed: Clustrix Blog.
Author: Alejandro Infanzon.

This is part 3 of a 5-part series, if you want to start from the beginning see Temporal Tables Part 1: Introduction & Use Case Example

Up until now, we haven’t crisply defined what is meant by SYSTEM_TIME in the above examples. With the DDL statement above, the time that is recorded is the time that the change arrived at the database server. This suffices for many use cases, but in some cases, particularly when debugging the behavior of queries at specific points of time, it is more important to know when the change was committed to the database. Only at that point does the data become visible to other users of the database. MariaDB can record temporal information based on the commit time by using transaction-precise system version. Two extra columns, start_trxid and end_trxid must be manually declared on the table:

CREATE TABLE purchaseOrderLines(
    purchaseOrderID              INTEGER NOT NULL
  , LineNum                      SMALLINT NOT NULL
  , status                       VARCHAR(20) NOT NULL
  , itemID                       INTEGER NOT NULL
  , supplierID                   INTEGER NOT NULL
  , purchaserID                  INTEGER NOT NULL
  , quantity                     SMALLINT NOT NULL
  , price                        DECIMAL (10,2) NOT NULL
  , discountPercent              DECIMAL (10,2) NOT NULL
  , amount                       DECIMAL (10,2) NOT NULL
  , orderDate                    DATETIME
  , promiseDate                  DATETIME
  , shipDate                     DATETIME
  , start_trxid                  BIGINT UNSIGNED GENERATED ALWAYS AS ROW START
  , end_trxid                    BIGINT UNSIGNED GENERATED ALWAYS AS ROW END
  , PERIOD FOR SYSTEM_TIME(start_trxid, end_trxid) 
  , PRIMARY KEY (purchaseOrderID, LineNum)
) WITH SYSTEM VERSIONING;

The rows now contain columns that represent the start and end the transaction ids for the change as recorded in the TRANSACTION_REGISTRY table in the system schema.

Temporal table: example 1

If you need to return the transaction commit time information from your temporal queries, you will need to join with this TRANSACTION_REGISTRY table, returning the commit_timestamp:

SELECT 
    commit_timestamp
  , begin_timestamp
  , purchaseOrderID
  , LineNum
  , status
  , itemID
  , supplierID
  , purchaserID
  , quantity
  , price
  , amount
FROM purchaseOrderLines, mysql.transaction_registry
WHERE start_trxid = transaction_id;

This will show when the change became visible to all sessions in the database (the most common scenario), or the begin_timestamp if you care about the beginning of the transaction that made the change.

Temporal table: example 2

Capturing the history of changes to a table does not come without some cost. As we showed earlier, one insert with three subsequent updates results in 4 rows being stored in the database.

Temporal table: example 3

For smaller tables, or tables that have infrequent changes to their rows, this may not be a problem. The storage and performance impact of additional rows might be insignificant compared to other activity. However, high-volume tables with many changes to rows may want to consider techniques for managing the growth of the historical data.

The first option is to disable temporal track for specific columns when appropriate. This is accomplished by using the WITHOUT SYSTEM VERSIONING modified on specific columns:

CREATE TABLE PurchaseOrderLines (
    purchaseOrderID         INTEGER NOT NULL
  , LineNum                 SMALLINT NOT NULL
  , status                  VARCHAR(20) NOT NULL
  , itemID                  INTEGER NOT NULL
  , supplierID              INTEGER NOT NULL
  , purchaserID             INTEGER NOT NULL
  , quantity                SMALLINT NOT NULL
  , price                   DECIMAL(10,2) NOT NULL
  , discountPercent         DECIMAL(10,2) NOT NULL
  , amount                  DECIMAL(10,2) NOT NULL
  , orderDate               DATETIME
  , promiseDate             DATETIME
  , shipDate                DATETIME
  , comments                VARCHAR(2000) WITHOUT SYSTEM VERSIONING
  , PRIMARY KEY (purchaseOrderID, LineNum)
) WITH SYSTEM VERSIONING;

Partitioning is another popular technique for managing the growth of historical data in temporal tables. The CURRENT keyword is understood by the partitioning logic when used on temporal tables with system versioning. Isolating the historical versions of the rows into their own partition is as simple as:

CREATE TABLE PurchaseOrderLines (
    purchaseOrderID         INTEGER NOT NULL
  , LineNum                 SMALLINT NOT NULL
  , status                  VARCHAR(20) NOT NULL
  , itemID                  INTEGER NOT NULL
  , supplierID              INTEGER NOT NULL
  , purchaserID             INTEGER NOT NULL
  , quantity                SMALLINT NOT NULL
  , price                   DECIMAL (10,2) NOT NULL
  , discountPercent         DECIMAL (10,2) NOT NULL
  , amount                  DECIMAL (10,2) NOT NULL
  , orderDate               DATETIME
  , promiseDate             DATETIME
  , shipDate                DATETIME
  , comments                VARCHAR(2000) WITHOUT SYSTEM VERSIONING
  , PRIMARY KEY (purchaseOrderID, LineNum)
) WITH SYSTEM VERSIONING
    PARTITION BY SYSTEM_TIME (
        PARTITION p_hist HISTORY
      , PARTITION p_cur CURRENT
);

This technique is especially powerful because partitions will be pruned when executing queries. Queries that access the current information will quickly skip historical data and only interact with the smaller data and associated indexes on the current partition.

Partitioning becomes an even more powerful tool when combined with interval definitions, dividing historical data into buckets that can then be managed individually.

CREATE TABLE PurchaseOrderLines (
    purchaseOrderID          INTEGER NOT NULL
  , LineNum                  SMALLINT NOT NULL
  , status                   VARCHAR(20) NOT NULL
  , itemID                   INTEGER NOT NULL
  , supplierID               INTEGER NOT NULL
  , purchaserID              INTEGER NOT NULL
  , quantity                 SMALLINT NOT NULL
  , price                    DECIMAL (10,2) NOT NULL
  , discountPercent          DECIMAL (10,2) NOT NULL
  , amount                   DECIMAL (10,2) NOT NULL
  , orderDate                DATETIME
  , promiseDate              DATETIME
  , shipDate                 DATETIME
  , comments                 VARCHAR(2000) WITHOUT SYSTEM VERSIONING
  , PRIMARY KEY (purchaseOrderID, LineNum)
) WITH SYSTEM VERSIONING
    PARTITION BY SYSTEM_TIME INTERVAL 1 WEEK (
        PARTITION p0 HISTORY
      , PARTITION p1 HISTORY
      , PARTITION p2 HISTORY
      ...
      , PARTITION p_cur CURRENT
);

Once a temporal table is partitioned based on intervals, administrators can use the Transportable Tablespaces feature of the InnoDB storage engine and the EXCHANGE PARTITION command syntax to manage table growth. Copying, dropping, and restoring partitions become simple data definition language (DDL) commands and file system operations, avoiding performance impact of changing individual rows.

Continue to Temporal Tables Part 4: Application Time to learn more.

↧

Julian Markwort: Introduction and How-To: etcd clusters for Patroni

November 14, 2019, 11:01 pm

≫ Next: Amazon DynamoDB adaptive capacity now handles imbalanced workloads better by isolating frequently accessed items automatically

≪ Previous: Temporal Tables Part 3: Managing Historical Data Growth

Feed: Planet PostgreSQL.

etcd is one of several solutions to a problem that is faced by many programs that run in a distributed fashion on a set of hosts, each of which may fail or need rebooting at any moment.

One such program is Patroni; I’ve already written an introduction to it as well as a guide on how to set up a highly-available PostgreSQL cluster using Patroni.

In that guide, I briefly touched on the reason why Patroni needs a tool like etcd.
Here’s a quick recap:

Each Patroni instance monitors the health data of a PostgreSQL instance.
The health data needs to be stored somewhere where all other Patroni instances can access it.
Based on this data, each Patroni instance decides what actions have to be taken to keep the cluster as a whole healthy.
A Patroni instance may decide that it needs to promote its PostgreSQL instance to become a primary, because it registered that there is currently no primary.
That Patroni instance needs to be sure that while it attempts to promote the database, no other Patroni instances can try to do the same. This process is called the “leader-race”, where the proverbial finish line consists of acquiring the “leader” lock. Once a Patroni instance has acquired this lock, the others cannot acquire it unless the new leader gives it up, or fails to extend its time to live.

The challenge now lies in providing a mechanism that makes sure that only a single Patroni instance can be successful in acquiring said lock.

In conventional, not distributed, computing systems, this condition would be guarded by a device which enables mutual exclusion, aka. a mutex. A mutex is a software solution that helps make sure that a given variable can only be manipulated by a single program at any given time.

For distributed systems, implementing such a mutex is more challenging:

The programs that contend for the variable need to send their request for change and then a decision has to be inferred somewhere as to whether this request can be accepted or not. Depending upon the outcome of this decision, the request is answered by a response, indicating “success” or “failure”. However, because each of the hosts in your cluster may become available, it would be ill-advised to make this decision-making mechanism a centralized one.

Instead, a tool is needed that provides a distributed decision-making mechanism. Ideally, this tool could also take care of storing the variables that your distributed programs try to change. Such a tool, in general, is called a Distributed Consensus Store (DCS), and it makes sure to provide the needed isolation and atomicity required for changing the variables that it guards in mutual exclusion.

An example of one such tool is etcd, but there are others: consul, cockroach, with probably more to come. Several of them (etcd, consul) base their distributed decisions on the RAFT protocol, which includes a concept of leader election. As long as the members of the RAFT cluster can decide on a leader by voting, the DCS is able to function properly and accept changes to the data, as the leader of the current timeline is the one who decides if a request to change a variable can be accepted or not.
A good explanation and visual example of the RAFT protocol can be found here.

In etcd and consul, requests to change the keys and values can be formed to include conditions, like “only change this variable if it wasn’t set to anything before” or “only change this variable if it was set to 42 before”.

Patroni uses these requests to make sure that it only sets the leader-lock if it is not currently set, and it also uses it to extend the leader-lock, but only if the leader-lock matches its own member name. Patroni waits for the response by the DCS and then continues its work. For example, if Patroni fails to acquire the leader-lock, it will not promote or initialize the PostgreSQL instance.

On the other hand, should Patroni fail to renew the time to live that is associated with the leader-lock, it will stop the PostgreSQL database in order to demote it, because otherwise, it cannot guarantee that no other member of the cluster will attempt to become a leader after the leader key expired.

One of the steps I described in the last blogpost – the step to deploy a Patroni cluster – sets up an etcd cluster:

etcd > etcd_logfile 2>&1 &

However, that etcd cluster consists only of a single member.

Problems with single-member etcd clusters

In this scenario, as long as that single member is alive, it can elect itself to be the leader and thus can accept change requests. That’s how our first Patroni cluster was able to function atop of this single etcd member.

But if this etcd instance were to crash, it would mean total loss of the DCS. Shortly after the etcd crash, the Patroni instance controlling the PostgreSQL primary would have to stop said primary, as it would be impossible to extend the time to live of the leader key.

To protect against such scenarios, wherein a single etcd failure or planned maintenance might disrupt our much desired highly-available PostgreSQL cluster, we need to introduce more members to the etcd cluster.

Three member etcd clusters

The number of cluster members that you choose to deploy is the most important consideration for deploying any DCS. There are two primary aspects to consider:

How likely is it that an etcd member fails or is offline for maintenance?

Usually, etcd members don’t fail just by chance. So if you take care to avoid load spikes, and storage or memory issues, you’ll want to consider at least two cluster members, so you can shut one down for maintenance.

How can a majority be found in the voting process?

By design, RAFT clusters require an absolute majority for a leader to be elected. This absolute majority depends upon the number of total etcd cluster members, not only those that are available to vote. This means that the number of votes required for a majority does not change when a member becomes unavailable. The number I just told you, two, was only a lower boundary for system stability and maintenance considerations.

In reality, a two member DCS will not work, because once you shut down one member, the other member will fail to get a majority vote for its leader candidacy, as 1 out of 2 is not more than 50%. If we want to be able to selectively shut members down for maintenance, we will have to introduce a third node, which can then partake in the leader election process. Then, there will always be two voters – out of the total three members – to choose a leader; two thirds is more than 50%, so the demands for absolute majority are met.

simple etcd cluster for Patroni

Placement of etcd cluster members in different data centers

The second most important consideration for etcd cluster deployments is the placement of the etcd nodes. Due to the aforementioned leader key expiry and the need for Patroni to refresh it at the beginning of each loop, and the fact that failure to achieve this will result in a forced stop of the leader, the placement of your etcd cluster members indirectly dictates how Patroni will react to network partitions and to the failure of the minority of nodes.

For starters, the simplest setup only resides in one data center (“completely biased cluster member placement”), so all Patroni and etcd members are not influenced by issues that cut off network access to the outside. At the same time, such a setup deals with the loss of the minority of etcd clusters in a simple way: it does not care – so long as there is still a majority of etcd members left that can talk to each other and to whom Patroni still can talk.

completely biased cluster member placement

But if you’d prefer a cross-data center setup, where the replicating databases are located in different data centers, etcd member placement becomes critical.

If you’ve followed along, you will agree that placing all etcd members in your primary data center (“completely biased cluster member placement”) is not wise. Should your primary data center be cut off from the outside world, there will be no etcd left for the Patroni members in the secondary data center to talk to, so even if they were in a healthy state, they could not promote. An alternative would be to place the majority of members in the first data center and a minority in the secondary one (“biased cluster member placement”).

Additionally, if one etcd member in your primary database is stopped – leaving your first data center without a majority of etcd members – and the secondary data center becomes unavailable as well, your database leader in the first data center will be stopped.

biased cluster member placement

You could certainly manually mitigate this corner case by placing one etcd member in a tertiary position (“tertiary decider placement”), outside of both your first and second data center with a connection to each of them. The two data centers then should contain an equal number of etcd members.

This way, even if either data center becomes unavailable, the remaining data center, together with the tertiary etcd member can still come to a consensus.
Placing one etcd cluster member in a tertiary location can also have the added benefit that its perception of networking partitions might be closer to the way that your customers and applications perceive network partitions.

Placement of a tertiary cluster member as a decider

tertiary decider placement

In the biased placement strategies mentioned above, consensus may be reached within a data center that is completely cut off from everything else. Placing one member in the tertiary position means that consensus can only be reached with the members in the data center that still have an intact uplink.

To increase the robustness of such a cluster even more, we can even place more than one node in a tertiary position, to mitigate against failures in a single tertiary member and network issues that could disconnect a single tertiary member.

You see, there are quite a lot of things to consider if you want to create a really robust etcd cluster. The above-mentioned examples are not an exhaustive list, and several other factors could influence your member placement needs. However, our experience has shown that etcd members seldom fail (unless there are disk latency spikes, or cpu-thrashing occurs) and that most customers want a biased solution anyway, as the primary data center is the preferred one. Usually, in biased setups, it is possible to add a new member to the secondary data center and exclude one of the members from the primary data center, in order to move the bias to the second data center. In this way, you can keep running your database even if the first data center becomes completely inaccessible.

With the placement considerations out of the way, let’s look at how to create a simple etcd server. For demonstration purposes, we will constrain this setup to three hosts, 10.88.0.41, 10.88.0.42, 10.88.0.43.

There are a couple of different methods to configure and start (“bootstrap”) etcd clusters; I will outline two of them:

The static bootstrap method requires knowledge of the network addresses of all cluster members. All addresses are then listed explicitly in the configuration file of each cluster member. This approach is great for learning and also easy to troubleshoot, as you can always look at the configuration file to see which hosts belong to the cluster.
The discovery bootstrap method is a little more implicit and works better with setups where addresses may change regularly or where there may not even be real addresses available, for example: in container clusters where all containers can be reached using the same address, but via different ports.

Some settings need to be written into the etcd config regardless of the setup method.

Since etcd uses two different communication channels – one for peer communication with other etcd cluster members, another for client communication with users and applications – some of the configuration parameters may look almost identical, but they are nevertheless essential.

Each etcd member needs the following:

name: A unique name within the etcd cluster.
listen-peer-urls: A URL which states where etcd should listen for requests made by its peers. Usually, this is an IP or hostname reachable from the other etcd member’s hosts.
listen-client-urls: A URL which states where etcd should listen for requests made by clients. Usually, this is an IP or hostname reachable from the other etcd member’s hosts. If you also want to allow local connections by users and applications, you can add a URL based on the local IP address.
initial-advertise-peer-urls: A URL which states which address other members of the cluster should use to connect to this etcd cluster member.
advertise-client-urls: A URL which states which address clients should use if they want to connect to this etcd cluster member. Should only be an address reachable from the network, not a local one, even if this member listens to local connections.

The above parameters are required for each etcd member and they should be different for all etcd members, as they should all have different addresses — or, at least, different ports.

Setting up an etcd cluster using static bootstrap

For the static bootstrap method, all etcd members additionally need the following parameters:

initial-cluster-token: A token that is unique to your cluster. Simply generate some random characters.
initial-cluster: By far the most important component required for a successful etcd bootstrap. This is a comma-separated string that lists all cluster members and their advertised peer-urls. Example: ‘centos_test_1=http://10.88.0.41:2380,centos_test_2=http://10.88.0.42:2380,centos_test_3=http://10.88.0.43:2380′

Now, if you start an instance of etcd on each of your nodes with fitting configuration files, they will try to talk to each other and – should your configuration be correct and no firewall rules are blocking traffic – bootstrap the etcd cluster.

As this blog post is already on the rather lengthy side, I have created an archive for you to download that contains logs that show the configuration files used as well as the output of the etcd processes.

Setting up an etcd cluster using discovery bootstrap

Another bootstrap approach, as I mentioned earlier, relies on an existing etcd cluster for discovery. Now, you don’t necessarily have to have a real etcd cluster. Keep in mind, that a single etcd process already acts as a cluster, albeit without high-availability. Alternatively, you can use the public discovery service located at discovery.etcd.io .

For the purpose of this guide, I will create a temporary local etcd cluster that can be reached by all members-to-be:

bash # etcd --name bootstrapper --listen-client-urls=http://0.0.0.0:2379 --advertise-client-urls=http://0.0.0.0:2379

We don’t need any of the peer parameters here as this bootstrapper is not expected to have any peers.

A special directory and a key specifying the expected number of members of the new cluster both need to be created.
We will generate a unique discovery token for this new cluster:

bash # UUID=$(uuidgen)
bash # echo $UUID
860a192e-59ae-4a1a-a73c-8fee7fe403f9

The following call to curl creates the special size key, which implicitly creates the directory for cluster bootstrap in the bootstrapper’s key-value store:

bash # curl -X PUT http://10.88.0.1:2379/v2/keys/_etcd/registry/${UUID}/_config/size -d value=3

The three etcd members-to-be now only need (besides the five basic parameters listed earlier) to know where to reach this discovery service, so the following line is added to all etcd cluster members-to-be:

discovery: 'http://10.88.0.1:2379/v2/keys/_etcd/registry/860a192e-59ae-4a1a-a73c-8fee7fe403f9/'

Now, when the etcd instances are launched, they will register themselves into the directory that we created earlier. As soon as _config/size many members have gathered there, they will bootstrap a new cluster on their own.
At this point, you can safely terminate the bootstrapper etcd instance.

The output of the discovery bootstrap method along with the config files can also be found in the archive.

Checking etcd healthiness

To check whether the bootstrap was successful, you can call the etcdctl cluster-health command.

bash # etcdctl cluster-health
member 919153442f157adf is healthy: got healthy result from http://10.88.0.41:2379
member 939c8672c1e24745 is healthy: got healthy result from http://10.88.0.42:2379
member c38cd15213ffca05 is healthy: got healthy result from http://10.88.0.43:2379
cluster is healthy

If you want to run this command somewhere other than on the hosts that contain the etcd members, you will have to specify the endpoints to which etcdctl should talk directly:

bash # etcdctl cluster-health --endpoints 'http://10.88.0.41:2379,http://10.88.0.42:2379,http://10.88.0.43:2379'

Running etcd as a Service

Usually, you’ll want to run etcd as some sort of daemon that is started whenever your server is started, to protect against intermittent failures and negligence after planned maintenance.

If you’ve installed etcd via your operating system’s package manager, a service file will already have been installed.
The service file is built in such a way that all of the necessary configuration parameters are loaded via environment variables. These can be set in the /etc/etcd/etcd.conf file and have a slightly different notation compared to the parameters that we’ve placed in YAML files in the examples above.
To convert the YAML configuration, you simply need to convert all parameter names to upper case and change dashes (-) to underscores (_) .

Caveats

If you try to follow this guide on Ubuntu or Debian and install etcd via apt, you will run into issues.
This is the result of the fact that everything that resembles a server automatically starts an instance through Systemd after installation has completed. You need to kill this instance, otherwise you won’t be able to run your etcd cluster members on ports 2379 and 2380.

With the recent update to etcd 3.4, the v2 API of etcd is disabled by default. Since the etcd v3 API is however currently not useable with patroni (due to missing support for multiple etcd endpoints in the library, see this pull request), you’ll need to manually re-enable support for the v2 API by adding enable-v2= true to your config file.

Bootstrapping an etcd server can be quite difficult if you run into it blindfolded. However, once the key concepts of etcd clusters are understood and you’ve learned what exactly needs to go into the configuration files, you can bootstrap an etcd cluster quickly and easily.

While there are lots of things to consider for member placement in more complex cross-data center setups, a simple three node cluster is probably fine for any testing environment and for setups which only span a single data center.

Do keep in mind that the cluster setups I demonstrated were stripped of any security considerations for ease of playing around. I highly recommend that you look into the different options for securing etcd. You should at least enable rolename and password authentication, and server certificates are probably a good idea to encrypt traffic if your network might be susceptible to eavesdropping attacks. For even more security, you can even add client certificates. Of course, Patroni works well with all three of these security mechanisms in any combination.

This post is part of a series.
Besides this post the following articles have already been published:
– PostgreSQL High-Availability and Patroni – an Introduction.
– Patroni: Setting up a highly availability PostgreSQL clusters

The series will also cover:
– configuration and troubleshooting
– failover, maintenance, and monitoring
– client connection handling and routing
– WAL archiving and database backups using pgBackrest
– PITR a patroni cluster using pgBackrest

↧

The Importance of System of Record Capability

Sync Replication in Action

Making Log Page Allocation Distributed and Lock-Free

MemSQL 7.0 Sync Replication Demonstration

Performance Impact

Conclusion

Why analyze Google Analytics data on AWS?

Moving Google Analytics data to AWS: Defining the requirements

Solution overview

Building the solution: Step by step guide

Step 1: Installation and permissions

Step 2: Review and clean the raw data

Step 3: Publishing to Amazon Athena

Step 4: Visualization in Amazon QuickSight

Conclusion

Key technical benefits:

About the Authors

Related

1. Overview

2. Creating a Table Partition by Range

3. Delete and Detach Partition

4. Create Function Using Plpgsql and Define a Trigger

5. Summary

Model evaluation and cross-validation basics

Common approaches to model evaluation

Repeated CV and LOOCV

The value of and the bias-variance trade-off

Implementing cross-validation in caret

The old-fashioned way: Implementing k-fold cross-validation by hand

Simulating data, defining the error metric, and setting

Partitioning the data

Training and validating the model

Iterating through the folds and computing the CV error

Conclusion

References

Über den Autor

Lukas Feick

Related

Prerequisites

Scenario

Solution overview

Building with AWS Glue Python Shell

Making a connection

ETL code review

Creating the Glue Python Shell Job

Test driving the Python Shell job

Step Functions Orchestration

Launch Template

Working with the Step Functions State Machine

SQL Review

State Machine execution

Inject Failure into Step Functions State Machine

Conclusion

About the Author

The Story of the Hero Database Applications Architect

The Story of the Disciplined Database Applications Architect

The Issue of Skewed Database Sizes

An Up-to-Date Solution for Skewed Database Sizes

References

Training, validation and test data sets

Partitioning the Data

How partitioning is accomplished

In VA|VS|VDMML Visual Interface

Using SAS Viya procedures

Summary

References and more information

DRBD: A Highly Available Tool That Can Help

Different Types of Highly Available Storage

What is a Hot Standby or Secondary Node?

Primary and Secondary Nodes

How Does DRBD Replication Actually Work?

Managed Hosting Can Help With Complex Infrastructures and DRBD High Availability

Get Your Free High Availability Checklist Today

Understanding AWS Glue worker types

Horizontal scaling for splittable datasets

Vertical scaling for Apache Spark jobs using larger worker types

Exceeding Yarn memory overhead

Disk space

AWS Glue job metrics

Apache Spark UI for AWS Glue jobs

Implementing cross-validation in `caret`

Bolt’s `puppet_conf` task