Azure SDK August 2019 preview and a dive into consistency

August 13, 2019, 1:00 am

≫ Next: Improved MySQL Query Performance With InnoDB Mutli Value Indexes

≪ Previous: AWS Lake Formation is now generally available

Feed: Microsoft Azure Blog.
Author: Maggie Pint.

The second previews of Azure SDKs which follow the latest Azure API Guidelines and Patterns are now available (.Net, Java, JavaScript, Python). These previews contain bug fixes, new features, and additional work towards guidelines adherence.

What’s New

The SDKs have many new features, bug fixes, and improvements. Some of the new features are below, but please read the release notes linked above and changelogs for details.

Storage Libraries for Java now include Files and Queues support.
Storage Libraries for Python have added Async versions of the APIs for Files, Queues, and Blobs.
Event Hubs libraries across languages have expanded support for sending multiple messages in a single call by adding the ability to create a batch avoiding the error scenario where a call exceeds size limits and giving batch size control to developers with bandwidth concerns.
Event Hubs libraries across languages have introduced a new model for consuming events via the EventProcessor class which simplifies the process of checkpointing today and will handle load balancing across partitions in upcoming previews.

Diving deeper into the guidelines: consistency

These Azure SDKs represent a cross-organizational effort to provide an ergonomic experience to every developer using every platform and as mentioned in the previous blog post, developer feedback helped define the following set of principles:

Idiomatic
Consistent
Approachable
Diagnosable
Compatible

Today we will deep dive into consistency.

Consistent

Feedback from developers and user studies have shown that APIs which are consistent are generally easier to learn and remember. In order to guide SDKs from Azure to be consistent, the guidelines contain the consistency principle:

Client libraries should be consistent within the language, consistent with the service and consistent between all target languages. In cases of conflict, consistency within the language is the highest priority and consistency between all target languages is the lowest priority.
Service-agnostic concepts such as logging, HTTP communication, and error handling should be consistent. The developer should not have to relearn service-agnostic concepts as they move between client libraries.
Consistency of terminology between the client library and the service is a good thing that aids in diagnosability.
All differences between the service and client library must have a good, articulated reason for existing, rooted in idiomatic usage.
The Azure SDK for each target language feels like a single product developed by a single team.
There should be feature parity across target languages. This is more important than feature parity with the service.

Let’s look closer at the second bullet point, “Service-agnostic concepts such as logging, HTTP communication, and error handling should be consistent.” Developers pointed out APIs that worked nicely on their own, but weren’t always perfectly consistent with each other. For example:

Blob storage used a skip/take style of paging, while returning a sync iterator as the result set:

let marker = undefined; do { const listBlobsResponse = await containerURL.listBlobFlatSegment( Aborter.none, marker );

marker = listBlobsResponse.nextMarker; for (const blob of listBlobsResponse.segment.blobItems) { console.log(`Blob: ${blob.name}`); } } while (marker);

Cosmos used an async iterator to return results:

for await (const results of this.container.items.query(querySpec).getAsyncIterator()){ console.log(results.result) }

Event Hubs used a ‘take’ style call that returned an array of results of a specified size:

const myEvents = await client.receiveBatch("my-partitionId", 10);

While using all three of these services together, developers indicated they had to work to remember more or refresh their memory by reviewing code samples.

The Consistency SDK Guideline

The JavaScript guidelines specify how to handle this situation in the section Modern and Idiomatic JavaScript:

YOU SHOULD use async functions for implementing asynchronous library APIs.

If you need to support ES5 and are concerned with library size, use async when combining asynchronous code with control flow constructs. Use promises for simpler code flows. async adds code bloat (especially when targeting ES5) when transpiled.

DO use Iterators and Async Iterators for sequences and streams of all sorts.

Both iterators and async iterators are built into JavaScript and easy to consume. Other streaming interfaces (such as node streams) may be used where appropriate as long as they’re idiomatic.

In a nutshell, it says when there is an asynchronous call that is a sequence (AKA list), Async Iterators are preferred.

In practice, this is how that principle is applied in the latest Azure SDK Libraries for Storage, Cosmos, and Event Hubs.

Storage, using an async iterator to list blobs:
for await (const blob of containerClient.listBlobsFlat()) { console.log(`Blob: ${blob.name}`); }

Cosmos, still using async iterators to list items:
for await (const resources of resources. container. items. readAll({ maxItemCount: 20 }). getAsyncIterator()) { console.log(resources.doc.id) }

Event Hubs – now using an async iterator to process events:
for await (const events of consumer.getEventIterator()){ console.log(`${events}`) }

As you can see, a service-agnostic concept—in this case paging—has been standardized across all three services.

Feedback

If you have feedback on consistency or think you’ve found a bug after trying the August 2019 Preview (.Net, Java, JavaScript, Python), please file an Issue or pull request on GitHub (guidelines, .Net, Java, JavaScript, Python), or reach out to @AzureSDK on Twitter. We welcome contributions to these guidelines and libraries!

↧

Improved MySQL Query Performance With InnoDB Mutli Value Indexes

August 14, 2019, 10:33 am

≫ Next: IBM Db2 for z/OS Useful Features

≪ Previous: Azure SDK August 2019 preview and a dive into consistency

Feed: Planet MySQL
;
Author: Dave Stokes
;

Multi-Valued Indexes are going to change the way you think about using JSON data and the way you architect your data. Before MySQL 8.0.17 you could store data in JSON arrays but trying to search on that data in those embedded arrays was tricky and usually required a full table scan. But now it is easy and very quick to search and to access the data in JSON arrays.

Multi-Valued Indexes

A Multi-Valued Index (MVI) is a secondary index defined on a column made up of an array of values. We are all used to traditional indexes where you have one value per index entry, a 1:1 ratio. A MVI can have multiple records for each index record. So you can have multiple postal codes, phone numbers, or other attributes from one JSON document indexed for quick access. See Multi-Valued Indexes for details.

For a very simple example, we will create a table. Note the casting of the $.nbr key/values as an unsigned array.

mysql> CREATE TABLE s (id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
-> name CHAR(20) NOT NULL,
-> j JSON,
-> INDEX nbrs( (CAST(j->’$.nbr’ AS UNSIGNED ARRAY)))
-> );
Query OK, 0 rows affected (0.11 sec)

Then add in some data. The goal is to have a set of multiple values available under the ‘nbr’ key where each number in the array represents some enumerated attribute.

mysql> SELECT * FROM s;

+—-+——-+———————+

| id | name | j |

+—-+——-+———————+

| 1 | Moe | {“nbr”: [1, 7, 45]} |

| 2 | Larry | {“nbr”: [2, 7, 55]} |

| 3 | Curly | {“nbr”: [5, 8, 45]} |

| 4 | Shemp | {“nbr”: [3, 6, 51]} |

+—-+——-+———————+

4 rows in set (0.00 sec)

So we want to search on one of the values in the ‘nbr’ arrays. Before 8.0.17, you could probably manage with a very elaborate JSON_CONTAINS() or JSON_EXTRACT() calls that have to handle multiple positions in that array. But with MySQL 8.0.17 you can check to see if a desired value is a member of the array very easily, And there is another new function, MEMBER OF() that can take advantage of MVIs.

mysql> SELECT * FROM s WHERE 7 MEMBER OF (j->”$.nbr”);

+—-+——-+———————+

| id | name | j |

+—-+——-+———————+

| 1 | Moe | {“nbr”: [1, 7, 45]} |

| 2 | Larry | {“nbr”: [2, 7, 55]} |

+—-+——-+———————+

2 rows in set (0.00 sec)

So we had two records with the number 7 in the array. Think abut how many times you have multiple uses of things like postcodes, phone numbers, credit cards , or email addresses tied to a master record. Now you can keep all that within one JSON document and not have to make multiple dives into the data to retrieve that information. Imagine you have a ‘build sheet’ of a complex product, say a car, and you wanted to be able to quickly find the ones with certain attributes (GPS, tinted windows, and red leather seats). A MVI give you a way to quickly and efficiently search for these attributes.

And for those curious about the query plan:

mysql> EXPLAIN SELECT * FROM s WHERE 7 MEMBER OF (j->”$.nbr”)G

*************************** 1. row ***************************

id: 1

select_type: SIMPLE

table: s

partitions: NULL

type: ref

possible_keys: nbrs

key: nbrs

key_len: 9

ref: const

rows: 1

filtered: 100.00

Extra: Using where

1 row in set, 1 warning (0.00 sec)

And yes the optimizer handles the new indexes easily. There are some implementation notes below that you will want to familiarize yourself with to make sure you know all the fine points of using MVIs at the end of this blog entry.

A Bigger Example

Lets create a table with one million rows with randomly created data inside a JSON array. Let us use a very simple table with a primary key and a JSON column that will supply the JSON array for the secondary index.

I wrote a quick PHP script to generate data on STDOUT to a temporary file. And that temporary file was fed in using the MySQL source command. It is my personal preference to load data this way and probably a bit of a personality quirk but it does allow me to truncate or drop table definitions and re-use the same data.

for ($x=1; $x < 1000000; $x++) {
$i = rand(1,10000000);
$j = rand(1,10000000);
$k = rand(1,10000000);
echo “INSERT into a1 (id,data) VALUES (NULL,'{“nbr”:[$i,$j,$k]}’);n”;
}
?>

An example line from the file looks like this:

INSERT into a1 (id,data) VALUES (NULL,'{“nbr”:[8526189,5951170,68]}’);

The entries in the array should have a pretty large cardinality with ranges between 1 and 10,000,000, especially considering there are only 1,000,000 rows.

Array subscripts in JSON start with a 0 (zero). And remember that the way to get to the third item in the array would be SELECT data->>”$.nbr[2]” for future reference. And is we wanted to check $.nbr[0] to $.nbr[N] we would have to explicitly check each one. Not pretty and expensive to perform.

My benchmark system is an older laptop with an i5 processor with 8k of ram filled with Ubuntu goodness. So hopefully this would be a worst case scenario for hardware as nobody would run such old & slow gear in production, right (nobody runs gear slow than me, wink-wink nudge-nudge)? The reason for such antiquated system usage is that comparisons would (or should) so similar gains on a percentage basis.

So lets us start by looking for a $.nbr[0] = 99999. I added one record with all three elements in the array as five nines to make for a simple example.

*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: a1
partitions: NULL
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 9718585
filtered: 100
Extra: Using where
1 row in set, 1 warning (0.0004 sec)
Note (code 1003): /* select#1 */ select `test`.`a1`.`id` AS `id`,json_unquote(json_extract(`test`.`a1`.`data`,’$.nbr’)) AS `data->>”$.nbr”` from `test`.`a1` where (json_unquote(json_extract(`test`.`a1`.`data`,’$.nbr[0]’)) = 99999)

And there are no indexes available to be used and it is a full table scan, as indicated in the type: ALL above. The query runs in about 0.61 seconds.

In the previous example we created the index with the table but this time it is created after the able. And I could have used ALTER TABLE too.

CREATE INDEX data__nbr_idx ON a1( (CAST(data->’$.nbr’ AS UNSIGNED ARRAY)) );

So first trial query:

SELECT id, data->>”$.nbr”
FROM a
WHERE data->>”$.nbr[2]” = 99999

We have to pick a specific entry in the array as we can not search each item of the array (at least until we can use MVIs). The query runs in about 0.62 seconds, or a fraction slower but close enough for me to say they are the same time. And EXPLAIN shows this is a full table scan and it does not take advantage of that index just created. So how do we access this new index and take advantage of the MVIs?

New Functions To The Rescue

There are new functions that can take advantage of MVIs when used to the right of the WHERE clause in a query with InnoDB tables. One of those functions is MEMBER OF().

SELECT _id, data->>”$.nbr”
FROM a1
WHERE 99999 MEMBER OF (data->”$.nbr”);

This query runs in 0.001 seconds which is much faster than the previous time of 0.61! And we are searching all the data in the array not just one slot in the array. So if we do not know if the data we want is in $.nbr[0] or $.nbr[N], we can search all of the array entries easily. So we are actually looking at more data and at a much faster rate.

We can also use JSON_CONTAINS() and JSON_OVERLAPS() see Three New JSON Functions in MySQL 8.0.17 fro details. These three functions are designed to take full advantage of Multi-Value indexes.

SELECT id, data->>”$.nbr”
FROM a1
WHERE JSON_CONTAINS(data->’$.nbr’,cast(‘[99999,99999]’ as JSON) );
+———+———————–+
| id | data->>”$.nbr” |
+———+———————–+
| 1000000 | [99999, 99999, 99999] |
+———+———————–+
1 row in set (0.0013 sec)

SELECT id, data->>”$.nbr” FROM a1 WHERE JSON_OVERLAPS(data->’$.nbr’,cast(‘[99999,99999]’ as JSON) );
+———+———————–+
| id | data->>”$.nbr” |
+———+———————–+
| 1000000 | [99999, 99999, 99999] |
+———+———————–+
1 row in set (0.0012 sec)

Fine Points

You can create MVIs with CREATE TABLE, ALTER TABLE, or CREATE INDEX statements, just like any other index.The values are cast as a same-type scalar in a SQL array, A virtual column is transparently generated with all the values of the array and then a functional index is created on the virtual column.

Only one MVI can be used in a composite index.

You can use MEMBER OF(), JSON_CONTAINS(), or JSON_OVERLAPS() in the WHERE clause to take advantage of MVIs. But once again you can you those three functions on non MVI JSON Data too.

DML for MVIs work like other DMLs for Indexes but you may have more than one insert/updates for a single clustered index record.

Empty arrays are not added to the index so do not try to search for empty values via the index.

MVIs do not support ordering of values so do not use them for primary keys! And no ASC or DSC either!!

And you are limited to 644,335 keys and 10,000 bytes by InnoDB for a single record. The limit is a single InnoDB undo log page size so you should get up to 1250 integer values.

MVIs can not be used in a foreign key specification.

And check the cardinality of you data. Having a very narrow range of numbers indexed will not really gain extra performance.

↧

IBM Db2 for z/OS Useful Features

August 19, 2019, 1:01 am

≫ Next: What’s new for SQL Server 2019 Analysis Services RC1

≪ Previous: Improved MySQL Query Performance With InnoDB Mutli Value Indexes

Feed: Databasejournal.com – Feature Database Articles.
Author: .

Since late 2016, Db2 V12 for the z/OS platform has been generally available. Along with many new features, V12 was the last version on the z/OS platform to be delivered as a complete software upgrade. Future features will be delivered using a new system called continuous delivery, where subsets of features are delivered as “function levels”. Sites may then choose which features and functions they wish to implement.

In this article some of the more useful features of Db2 and SQL that have been delivered as new functionality in the latest releases are reviewed.

High Speed Data Load

Analytics processing against a big data solution, a data warehouse, or a combination of data sources requires loading lots and lots of data quickly. Depending upon the methods used, it is possible that data are inconsistent until loading is complete. Consider a day’s worth of online sales data that contains new orders from new customers. If you load your big data application with the new orders first, their corresponding customers will not exist; load the customers first, and analytics for those customers will show no orders!

In general, massive data loads can encounter bottlenecks if done serially. Some issues include network capacity, disk storage capacity, hot spots in data and index pages containing concurrent updates, and logging volume. In the past, the DBA has attempted to alleviate these bottlenecks by creative database object designs, including creative data partitioning schemes, special index clustering, embedding significant free space (i.e., empty space) in objects, as so forth.

In Db2 V12, there is a new method and algorithm to support heavy inserts. This algorithm detects when a mass insert operation is happening that involves non-clustering indexes and manages multiple threads that execute concurrently.

Coupled with this new algorithm, IBM also includes a DRDA Fast Load software product that enhances high-speed data loading from distributed clients. The DBA no longer needs to import data from external sources, convert to a local encoding scheme, re-format and clean it, and then execute a data load utility. The DRDA Fast Load product does all of this directly from a remote process. Data transformations can be accomplished using SQL.

Enhancing analytics Even Further

With business analytics (BI) becoming more and more common, it is only natural that companies have implemented BI software that accesses a big data application. On the z/OS platform, IBM’s solution is the IBM Db2 Analytics Accelerator (IDAA). (This formerly stand-alone hardware unit is now available in a version that is embedded in the zSystems hardware chassis.)

With the IDAA as your main big data application, mass-insert operations once again become a concern. Luckily, there are several Db2 V12 features that can address these concerns.

First, Db2 allows any table to exist in any of three states:

only in the Db2 database;
only in the IDAA; and,
existing in both.

Regardless of configuration, the Db2 Optimizer will intercept incoming SQL queries and determine which of the table(s) will be used to generate results. For a typical operational query, the Optimizer should reference the in-Db2 table; alternatively, for an analytic query the best choice will probably be the table in the IDAA. This is because operational queries usually reference only a few tables and return a small set of rows. In contrast, analytic queries can mention tens of tables, and sometimes require accessing millions (or billions) of rows in order to return the desired result.

Enter a new Db2 feature: high-speed multi-row insert for IDAA tables. If the table exists only in the IDAA, this feature provides a method of getting lots of data into IDAA-based tables quickly while avoiding the locking and potential outage of a Load utility.

Ultra-large Tables

Long ago, tables were considered large if they contained millions of rows. Later, billions of rows became typical, with total storage sizes in the tens of gigabytes. Today, tables containing hundreds of gigabytes of data are becoming more common, especially with the advent of big data, and terabyte-sized tables are on the horizon.

In versions of Db2 before version 12, the practical limit to the size of a table was 256 gigabytes per partition in a 64-partition tablespace, for a total of 16,000 gigabytes (16 terabytes). Partition sizes and number of partitions per tablespace were not easily changeable. If the DBA mis-sized the tablespace, it might have been possible to alter the tablespace size parameters, but this necessitated a reorganization of the entire tablespace. This reorg would require an amount of work disk space approximately the size of the tablespace; and, during the reorg the entire tablespace would not be available.

In Db2 version 12, IBM implements a new database object. This is a universal tablespace type that permits specification of partition size (DSSIZE) for each partition, increases the maximum partition size from 256 gigabytes to 1 terabyte, and increases the maximum number of partitions allowed. At its greatest extent, a tablespace can now be sized at 4,000 terabytes (4 petabytes), and the table in that tablespace can now contain up to 256 trillion rows.

While it will be rare that such a table would be implemented in native Db2, such tables might exist in the IDAA as accelerator-only tables. Indeed, loading such a table in Db2 would require a massive amount of disk space, as well as large amounts of elapsed time and CPU cycles.

The latest Features and Functions

As noted above, IBM is delivering new features and functions using a “continuous delivery” model. Since late 2016, there have been several features delivered in this fashion. Two of the more useful ones are noted here.

Data encryption and data compression. IBM deliberately designed Version 12 of Db2 to integrate with zSystems hardware. The latest hardware, the IBM z14, contains new hardware options for cryptographic processing. This is implemented using special central processor units. Db2 can use this new feature for data encryption using standard SQL language syntax. In addition, if the DBA has specified that a tablespace is to use data compression for disk storage, no software code needs to be executed in order to compress data for storage and decompress for retrieval; instead, the z14 can do this transparently to and from disk. This means a significant reduction in CPU and elapsed time for compressed tables.

Another similar feature involves data compression of Db2 index data. In general, indexes contain entries for each data row that specify the key for the index and a physical pointer to the row in which it resides. For example, an index on the Customer table may contain entries for the column Customer-Number. The DBA would then typically use the CLUSTER parameter to indicate that this index would be used to attempt to maintain data rows in the table in sorted order by the key value. Db2 now allows options that can implement index data compression without losing the sort order of index entries. This is called order-preserving compression.

Attached processor use during Db2 Load and Reorg. IBM allows sites to install several different versions of attached processors to the zSystem hardware. These processors can then function as CPUs for certain specified applications with one important difference: IBM will not charge for that processor’s use. This can be a great benefit for some shops that use any zSystems licensing mechanism that is dependent upon CPU cycles, where typically an average of the peak CPU use during a period is used to charge the customer. Applications that use an attached processor will not incur charges, so DBAs are always on the lookout for ways to do this.

One popular option is called the z Integrated Information Processor, or zIIP. This chip can be installed in most zSystems hardware, and the operating system recognizes it automatically on installation. Db2 version 12 also recognizes the chip, and the latest version of the Load and Reorganization utilities are now eligible for execution on the zIIP.

Typically, Load is used to load data into a table. The Reorg utility unloads table data, sorts it, then loads it back in. Hence, both utilities have a phase where they load data into the table. During this phase, called the reload phase, the utilities also save in work datasets all data required for any table indexes. Once the table is loaded, the index data is then sorted and loaded into the appropriate index objects. It is this phase that is now eligible for zIIP execution.

In these days of big data and expanding table sizes, this can result in a significant cost reduction. IBM estimates that in some cases up to 90 percent of normal cpu usage spent by Load and Reorg can now be eliminated by using zIIP processors.

Summary

While IBM delivered Db2 Version 12 in late 2016, it is only now that many shops are using and using the most useful features. This is partly due to slow acceptance of the new version, but mostly because many shops are only now growing their applications and solutions to meet customer demands.

Data warehouse and big data applications grow naturally over time, as it is these historical data that are being used by analytical queries to show trends and make predictions. As tables grow and analytical queries become more frequent and more complex, the DBA must respond to both table size growth and data load time increases, with the corresponding increase in CPU cycles to support all that I/O.

The Db2 features noted here (high-speed data load, mass insert to IDAA-only tables, ultra-large table size, hardware-enhanced data encryption and compression, and use of zIIP processors) permit IT to continue to support growth in data volume while managing costs and complexity.

# # #

See all articles by Lockwood Lyon

↧

What’s new for SQL Server 2019 Analysis Services RC1

August 20, 2019, 9:48 pm

≫ Next: How to migrate a large data warehouse from IBM Netezza to Amazon Redshift with no downtime

≪ Previous: IBM Db2 for z/OS Useful Features

Feed: Microsoft Power BI Blog | Microsoft Power BI.
Author: .

We find great pleasure in announcing RC1 of SQL Server 2019 Analysis Services (SSAS 2019). The SSAS 2019 release is now feature complete! We may still make performance improvements for the RTM release.

RC1 introduces the following features:

Custom ordering of calculation items in calculation groups
Query interleaving with short query bias for high-concurrency workloads
Online attach for optimized synchronization of read-only replicas
Improved performance of Power BI reports over SSAS multidimensional
Governance setting to control Power BI cache refreshes

SSAS 2019 features announced in previous CTPs are recapped here:

Calculation groups for calculation reusability in complex models (CTP 2.3)
Governance settings to protect server memory from runaway queries (CTP 2.4)
Many-to-many relationships can help avoid unnecessary “snowflake” models (CTP 2.4)
Dynamic measure formatting with calculation groups (CTP 3.0)

We think you’ll agree the pace of delivery from Analysis Services engine team has been phenomenal lately. The SSAS 2019 release demonstrates Microsoft’s continued commitment to all our enterprise BI customers whether on premises, in the cloud, or hybrid.

Calculation groups

Calculation groups address the issue of proliferation of measures in complex BI models often caused by common calculations like time-intelligence. SSAS models are reused throughout large organizations, so they tend to grow in scale and complexity.

The public preview of calculation groups was announced for SSAS 2019 in the CTP 2.3 blog post, and Azure Analysis Services on the Azure Updates blog. Calculation groups will soon be launched and supported in Power BI Premium initially through the XMLA endpoint.

We are grateful for the tremendous enthusiasm from the community for this landmark feature for Analysis Services tabular models. Specifically, we’d like to thank Marco Russo and Alberto Ferrari for their excellent series of articles, and of course Daniel Otykier for ensuring that Tabular Editor supported calculation groups since the very first preview (CTP 2.3).

Here’s a link to the official documentation page for calculation groups: https://aka.ms/CalculationGroups. It contains detailed examples for time-intelligence, currency conversion, dynamic measure formats, and how to set the precedence property for multiple calculation groups in a single model. We plan to keep this article up to date as we make enhancements to calculation groups and the scenarios covered.

Custom ordering calculation items in calculation groups (Ordinal property)

RC1 introduces the Ordinal property for custom ordering of calculation items in calculation groups. As shown in the following image, this property ensures calculation items are shown to end-users in a more intuitive way:

SQL Server Data Tools (SSDT) support for calculation groups is being worked on and is planned by SQL Server 2019 general availability. In the meantime, in addition to Tabular Editor, you can use SSAS programming and scripting interfaces such as TOM and TMSL. The following snippet of JSON-metadata from a model.bim file shows the required properties to set up an Ordinal column for sorting purposes:

{
    "tables": [
        {
            "name": "Time Intelligence",
            "description": "Utility table for time-Intelligence calculations.",
            "columns": [
                {
                    "name": "Time Calc",
                    "dataType": "string",
                    "sourceColumn": "Name",
                    "sortByColumn": "Ordinal Col"
                },
                {
                    "name": "Ordinal Col",
                    "dataType": "int64",
                    "sourceColumn": "Ordinal",
                    "isHidden": true
                }
            ],
            "partitions": [
                {
                    "name": "Time Intelligence",
                    "source": {
                        "type": "calculationGroup"
                    }
                }
            ],
            "calculationGroup": {
                "calculationItems": [
                    {
                        "name": "YTD",
                        "description": "Generic year-to-date calculation.",
                        "calculationExpression": "CALCULATE(SELECTEDMEASURE(), DATESYTD(DimDate[Date]))",
                        "ordinal": 1
                    },
                    {
                        "name": "MTD",
                        "description": "Generic month-to-date calculation.",
                        "calculationExpression": "CALCULATE(SELECTEDMEASURE(), DATESMTD(DimDate[Date]))",
                        "ordinal": 2
                    }
                ]
            }
        }
    ]
}

Query interleaving with short query bias

SSAS models are reused throughout large organizations, so often require high user concurrency. Query interleaving in RC1 allows system configuration for improved user experiences in high-concurrency scenarios.

By default, the Analysis Services tabular engine works in a first-in, first-out (FIFO) fashion with regards to CPU. This means, for example, if one expensive/slow storage-engine query is received and followed by two otherwise fast queries, the otherwise fast queries can potentially get blocked waiting for the expensive query to complete. This is represented by the following diagram which shows Q1, Q2 and Q3 as the respective queries, their duration and CPU time.

Query interleaving with short query bias allows concurrent queries to share CPU resources, so fast queries are not blocked behind slow ones. Short-query bias means fast queries (defined by how much CPU each query has already consumed at a given point in time) can be allocated a higher proportion of resources than long-running queries. In the following illustration, the Q2 and Q3 queries are deemed “fast” queries and therefore allocated more CPU than Q1.

Query interleaving is intended to have little or no performance impact on queries that run in isolation; a single query can still consume as much CPU as it does with the FIFO model.

For details on how to set up query interleaving, please see the official documentation page: https://aka.ms/QueryInterleaving

Online attach

Online attach can be used for synchronization of read-only replicas in on-premises query scale-out environments.

To perform an online-attach operation, use the AllowOverwrite option of the Attach XMLA command. This operation may require double the model memory to keep the old version online while loading the new version.


  C:Program FilesMicrosoft SQL ServerMSAS15OLAPDataAdventureWorks.0.db
  True

A typical usage pattern could be as follows:

DB1 (version 1) is already attached on read-only server B.
DB1 (version 2) is processed on the write server A.
DB1 (version 2) is detached and placed on a location accessible to server B (either via a shared location, or using robocopy, etc.).
The
command with AllowOverwrite=True is executed on server B with the new location of DB1 (version 2).
- Without this new feature, the administrator is first required to detach the database and then attach the new version of the database. This leads to downtime when the database is unavailable to users, and queries against it will fail.
- When this new flag is specified, version 1 of the database is deleted atomically within the same transaction with no downtime. However, it comes at the cost of having both databases loaded simultaneously into memory.

Improved performance of Power BI reports over SSAS multidimensional

RC1 introduces optimized DAX query processing for commonly used DAX functions including SUMMARIZECOLUMNS and TREATAS. This can provide considerable performance benefits for Power BI reports over SSAS multidimensional. This enhancement has been referred to as “Super DAX MD”.

To make end-to-end use of this feature, you will need a forthcoming version of Power BI Desktop. Once the Power BI release is shipped and validated, we intend to enable it by default in a SSAS 2019 CU.

Governance setting for Power BI cache refreshes

The ClientCacheRefreshPolicy governance setting to control cache refreshes was first announced for Azure Analysis Services on the Azure blog in April. This property is now also available in SSAS 2019 RC1.

The Power BI service caches dashboard tile data and report data for initial load of Live Connect reports. This can cause an excessive number of cache queries being submitted to SSAS, and in extreme cases can overload the server.

1500 compatibility level

The SSAS 2019 modeling features work with the new 1500 compatibility level. 1500 models cannot be deployed to SQL Server 2017 or earlier or downgraded to lower compatibility levels.

Download Now

To get started with SQL Server 2019 RC1, find download instructions on the SQL Server 2019 web page. Enjoy!

↧

How to migrate a large data warehouse from IBM Netezza to Amazon Redshift with no downtime

August 21, 2019, 4:21 am

≫ Next: ahsan hadi: Horizontal scalability with Sharding in PostgreSQL – Where it is going Part 2 of 3.

≪ Previous: What’s new for SQL Server 2019 Analysis Services RC1

Feed: AWS Big Data Blog.

A large EMEA company recently decided to migrate their on-premises IBM Netezza data warehouse to Amazon Redshift. Given the volume of data (270TB uncompressed and more than 27K tables), the number of interconnected systems that made use of this data (more than 4.5K business processes), and the zero downtime requirements, we understood that the project would be a challenge. Since the company planned to decommission the data center where the data warehouse was deployed in less than a year’s time, there were also time constraints in place.

The data warehouse is a central piece for the company; it allows users across units to gather data and generate the daily reports required to run the business. In just a few years, business units accessing the cluster increased almost 3x, with 5x the initial number of users, executing 50x the number of daily queries for which the cluster had been designed. The legacy data warehouse was not able to scale to cover their business needs anymore, resulting in nightly ETL processes running outside of their time boundaries, and live queries taking too long.

The general dissatisfaction among the business users — along with the proximity of the data center decommissioning — moved the company to plan the migration, putting its IT department in charge of the definition of the new architecture, and the execution of the project.

Amazon Redshift, Amazon Web Services’ (AWS) fast, scalable OLAP data warehouse that makes it simple and cost-effective to analyze all your data across your data warehouse and data lake, was the perfect fit to solve their problems. Not only does Amazon Redshift provide full elasticity for future growth, and features such as concurrency scaling to cover high demand peaks, it also offers a whole ecosystem of analytics services to be easily integrated.

In this article, we explain how this customer performed a large-scale data warehouse migration from IBM Netezza to Amazon Redshift without downtime, by following a thoroughly planned migration process, and leveraging AWS Schema Conversion Tool (SCT) and Amazon Redshift best practices.

Preparing the migration

Large enterprise customers typically use data warehouse systems as a central repository for data analytics, aggregating operational data from heterogeneous transactional databases and running analytical workloads to serve analysts teams through business intelligence applications and reports. Using AWS, customers can benefit from the flexibility of having multiple compute resources processing different workloads, each workload scaling as the demand grows.

In this section, we describe the steps that we followed to prepare the migration of this data warehouse from IBM Netezza to Amazon Redshift.

Identifying workloads and dependencies

Customers typically have three different types of workloads running in their data warehouses:

Batch processes: Long-running processes that require many resources and low concurrency, such as ETL jobs.
Ad hoc queries: Short queries with high concurrency, such as analysts querying data.
Business workloads: Typically mixed workloads, such as BI applications, reports, and dashboards.

In this customer’s case, they were building business data marts through complex ETL jobs, running statistical models, generating reports, and allowing analysts to run ad hoc queries. In essence, these applications are divided into two groups of workloads: batch and ad hoc queries. The on-premises platform was always saturated and struggling to deal with the level of concurrency demanded by these workloads while offering acceptable performance.

The following diagram shows the architecture that customer had before the migration:

By using Amazon Redshift, the customer is able to fulfill the requirements of every analytical workload. Within the old data warehouse, shown in the above diagram, two different managed service providers managed two sets of independent data and workloads. For the new architecture, the decision was to split the data warehouse into two different Amazon Redshift clusters to serve those different workloads, as described in the following section. Within each of these clusters, the customer is able to manage resources and concurrency for different applications under the same workload by configuring Amazon Redshift Workload Management (WLM). A typical WLM setup is to match every workload with a queue, so in this case, each cluster had two queues configured: batch and ad hoc, each with a different number of slots and assigned memory.

Sizing the target architecture

For heterogeneous migrations, like this one, a comprehensive analysis should be performed on the source database, collecting enough data to design the new architecture that supports both data and applications.

The AWS Schema Conversion Tool was the perfect companion for this project, as the customer was able to automate the reports and generate an assessment that helped estimate the migration complexity for different objects, e.g. data types, UDFs, and stored procedures.

In a typical database migration, customers categorize large tables by number of rows. However, when migrating to columnar databases, such as Amazon Redshift, it is essential to also assess table width (that is, number of columns) from the very beginning. While columnar databases are generally more efficient than row-based databases in storing data, wide tables with few rows may have a negative impact on columnar databases. To estimate the minimum table size required for each table in Amazon Redshift, use this formula from the AWS Knowledge Center.

For this customer, there was a clear separation between core applications, and a large isolated business application with minimum data dependencies and a different set of users. As a result, one of the main architectural decisions was to use two different Amazon Redshift clusters:

Primary cluster: Holds the core schemas and most of the data, and serves most of the business applications. Due to high storage requirements and long batch processes that run here, the recommended Amazon Redshift node type for this cluster is the dense storage family.
Secondary cluster: Purpose built cluster for a single application, demands high I/O. The recommended Amazon Redshift node type for this cluster is the dense compute family.

Planning the migration

There are several approaches when it comes to database migration. A must-have requirement for many migrations is to minimize downtime, which was the main driver for the migration pattern described in this post.

One of the challenges when migrating databases is to keep the data updated in both systems, capturing the changes on the source, and applying them to the destination database during the migration. By definition, data warehouses shouldn’t be used to run transactional (OLTP) workloads, but long-running ETL processes and analytical workloads (OLAP). Usually, those ETL processes update data in batches, typically on a daily basis. This simplifies the migration, because when those ETL processes that load data into the data warehouse are run in parallel against both target systems during the migration, CDC is not required.

The following image summarizes how we planned this migration with the customer, following a parallel approach to minimize downtime on their production data warehouse:

The main steps of this process are (1) data migration, (2) technical validation, (3) data sync, and (4) business validation, as described in the following section.

Running the migration

Full data migration

The initial data migration was the first milestone of the project. The main requirements for this phase were: (1) minimize the impact on the data source, and (2) transfer the data as fast as possible. To do this, AWS offers several options, depending on the size of the database, network performance (AWS Direct Connect or AWS Snowball), and whether the migration is heterogeneous or not (AWS Database Migration Service or AWS Schema Conversion Tool).

For this heterogeneous migration, the customer used the AWS Schema Conversion Tool (SCT). The SCT enabled them to run the data migration, provisioning multiple virtual machines in the same data center where IBM Netezza was installed, each running an AWS SCT Data Extractor agent. These data extractors are Java processes that connect directly to the source database and migrate data in chunks to the target database.

Sizing data extraction agents

To estimate the number of data extractor agents needed for the migration, consider this rule of thumb: One data extractor agent per 1 TB of compressed data on the source. Another recommendation is to install extraction agents on individual computers.

For each agent, consider the following hardware general requirements:

CPU	4	Lots of transformations and a large number of packets to process during data migration
RAM	16	Data chunks are kept in memory before dumping to disk.
Disk	100 GB / ~500 IOPS	Intermediate results are stored on Disk.
Network	At least 1 Gbit (10 Gbit recommended)	While provisioning the resources, it is recommended to reduce the number of network hops from the source to AWS SCT data extraction agents.

Follow this documentation in order to go through the installation steps for the data extraction agents.

Depending on the size of the data to be migrated and the network speed, you may also run data extractor agents on top of EC2. For large data warehouses and in order to minimize downtime and optimize data transfer, it is recommended to deploy the data extractor agents as close as possible to the source. For example, in this migration project, 24 individual SCT extractor agents were installed in the on-premises data center for concurrent data extraction, and in order to speed up the process. Due to the stress that these operations generate on the data source, every extraction phase was run during weekends and off-hours.

The diagram below depicts the migration architecture deployed during the data migration phases:

Creating data extraction tasks

The source tables were migrated in parallel, on a table-by-table basis, using the deployed SCT extraction agents. These extraction agents authenticate using a valid user on the data source, allowing to adjust the resources available for that user during the extraction. Data was processed locally by SCT agents and uploaded to S3 through the network (via AWS Direct Connect). Note that other migration scenarios might require the use of AWS Snowball devices. Check the Snowball documentation to decide which transfer method is better for your scenario.

As part of the analysis performed while planning the migration, the customer identified large tables, e.g. tables with more than 20 million rows or 1TB size. In order to extract data from those tables, they used the virtual partitioning feature on AWS SCT, creating several sub-tasks and parallelizing the data extraction process for this table. We recommend creating two groups of tasks for each schema that migrates; one for small tables and another one for large tables using virtual partitions.

These tasks can be defined and created before running, so everything is ready for the migration window. Visit the following documentation to create, run, and monitor AWS SCT Data Extraction Tasks.

Technical validation

Once the initial extracted data was loaded to Amazon Redshift, data validation tests were performed in parallel, using validation scripts developed by the partner teams involved in the migration. The goal at this stage is to validate production workloads, comparing IBM Netezza and Amazon Redshift outputs from the same inputs.

Typical activities covered during this phase are the following:

Count number of objects and rows on each table.
Compare the same random subset of data in both IBM Netezza and Amazon Redshift for all migrated tables, validating that data is exactly the same row by row.
Check incorrect column encodings.
Identify skewed table data.
Annotate queries not benefiting from sort keys.
Identify inappropriate join cardinality.
Deal with tables with large varchar columns.
Confirm that processes do not crash when connected with target environment.
Validate daily batch job runs (job duration, number of rows processed).

You’ll find the right techniques to execute most of those activities in Top 10 Performance Tuning Techniques for Amazon Redshift.

Data synchronization

During this phase, the customer again migrated the tables and schemas that lost synchronization with the source during the Technical Validation phase. By using the same mechanism described on the First Full Data Migration section, and as ETL processes that generate the data marts are already running on the future system, data is kept updated after this synchronization phase.

Business validation

After the second data migration was successfully performed and the data movement was technically validated, the last remaining task was to involve the data warehouse users in the final validation. These users from different business units across the company accessed the data warehouse using a variety of tools and methods: JDBC/ODBC clients, Python code, PL/SQL procedures, custom applications, etc. It was central to the migration to make sure that every end user had verified and adapted his processes to work seamlessly with Amazon Redshift before the final cut-over was performed.

This phase took around three months and consisted of several tasks:

Adapt business users’ tools, applications, and scripts to connect to Amazon Redshift endpoints.
Modify user’s data load and dump procedures, replacing data movement to / from shared storage via ODBC / JDBC with COPY / UNLOAD operations from / to S3.
Modify any incompatible query, taking into account Amazon Redshift PostgreSQL implementation nuances.
Run business processes, both against IBM Netezza and Amazon Redshift, and compare results and execution times, being sure to notify any issue or unexpected result to the team in charge of the migration, so the case can be analyzed in detail.
Tune query performance, taking into account table sort keys and making extensive use of the EXPLAIN command in order to understand how Amazon Redshift plans and executes queries.

This business validation phase was key to have all end users aligned and ready for the final cut-over. Following Amazon Redshift best practices enabled end users to leverage the capabilities of their new data warehouse.

Soft cut-over

After all the migration and validation tasks had been performed, every ETL, business process, external system, and user tool was successfully connected and tested against Amazon Redshift.

This is when every process can be disconnected from the old data warehouse, which can be safely powered off and decommissioned.

Conclusion

In this blog post, we described the steps taken to perform a successful large-scale data warehouse migration from an on-premises IBM Netezza to Amazon Redshift. These same steps can be extrapolated to any other source data warehouse.

While this article describes a pure lift-and-shift migration, this is just the beginning of a transformation process towards a full-fledged corporate data lake. There are a series of next steps necessary in order to gain full advantage of the powerful analytics tools and services provided by AWS:

Activate Amazon Redshift’s Concurrency scaling feature on interactive query queues so clusters scale seamlessly on high usage periods without needing to provision the clusters for peak capacity.
Create a data lake in S3 and offload less accessed data, keeping warm and hot data on the Amazon Redshift clusters for higher performance.
Leverage Amazon Redshift Spectrum to be able to combine cold and hot data on analytic queries when required by the business needs.
Use Amazon Athena to be able to query cold data without affecting the data warehouse performance.

It is worth pointing out several takeaways that we feel are central to achieving a successful large-scale migration to Amazon Redshift:

Start with a PoC to make an accurate initial sizing of the Amazon Redshift cluster.
Create a detailed migration plan, which includes a clear procedure for every affected system.
Have end users fully aligned with the migration process, and make sure all their processes are validated before the final cutover is performed.
Follow Amazon Redshift best practices and techniques to leverage its full capabilities and performance.
Engage with the AWS account team from early stages and throughout the whole process. They are the point of contact with AWS specialists, Professional Services, and partners in order to bring the migration project to a successful conclusion.

We hope you found this post useful. Please feel free to leave a comment or question.

About the Authors

Guillermo Menendez Corral is a solutions architect at Amazon Web Services. He has over 12 years of experience designing and building SW applications and currently provides architectural guidance to AWS customers, with a focus on Analytics and Machine Learning.

Arturo Bayo is a big data consultant at Amazon Web Services. He promotes a data-driven culture in enterprise customers around EMEA, providing specialized guidance on business intelligence and data lake projects while working with AWS customers and partners to build innovative solutions around data and analytics.

↧

ahsan hadi: Horizontal scalability with Sharding in PostgreSQL – Where it is going Part 2 of 3.

August 21, 2019, 11:00 pm

≫ Next: ahsan hadi: Horizontal scalability with Sharding in PostgreSQL – Where it is going Part 3 of 3.

≪ Previous: How to migrate a large data warehouse from IBM Netezza to Amazon Redshift with no downtime

Feed: Planet PostgreSQL.

Declarative Partitioning

So far we have discussed scalability, what is scalability, why and when you need and what are the different types of scalability. Now we are starting to get into the meat of this topic and will discuss declarative partitioning and sharding in PostgreSQL. The sharding functionality is being laid on top of declarative partitioning functionality in PostgreSQL.

Declarative partitioning was released in PostgreSQL 10, prior to declarative partitioning PostgreSQL was using table inheritance and plpgsql triggers for providing table partitioning. The example below shows how a table can be partitioned using the declarative partitioning syntax introduced in PG 10 :

Declarative partitioning basically provides the native support for partitioning in PostgresSQL, using the syntax used in the example above, the user can create partitioned tables. This would divide a table into pieces called partitions, the pieces are called partitioned tables. All rows inserted into a partitioned table will be routed to one of the partition based on the partition key.

Lot of performance improvement for declarative partitioning was added in PostgreSQL 11 that improved the code for partition pruning, partition pruning is the ability of eliminating certain partitions from the search based on the quals provided in the WHERE predicate.

Sharding in PostgreSQL

Sharding is the ability to partition a table across one or more foreign servers, with declarative partitioning as show above the table can partitioned into multiple partitioned tables living on the same database server. Sharding allows the table to be partitioned in a way that the partitions live on external foreign servers and the parent table lives on the primary node where the user is creating the sharded table. All the foreign servers that are being used in the sharded tables are PostgreSQL foreign servers, other foreign server i.e. MongoDB, MySQL etc are not supported.

The example below shows how a sharded table can be created in PostgreSQL today, we will then talk about the approach/architecture that community is following in order to add sharding. We will discuss what’s already done for built-in sharding and what the important pieces remaining and also highlight the challenges.

This is the main parent table that is created on the main server.

On the remote server, you simply create a partitioned table. This is corresponding to the parent partition table that was created in the primary server as shown below. In this example the foreign server is shard1.

On the main server the above steps of creating the postgres_fdw extension with appropriate permissions, creating the foreign server where the partitioned table will be created and creating the user mapping needs to be carried out.

On the main server the partitioned table is created as shown above, the difference between a normal partitioned table and this one is that we are specifying the foreign server. In this case the foreign server is shard_1 where we have created the partitioned table.

Using
the example above the user can create a sharded table where the partitions are
living on a foreign server. Please note that the partitions need to created
manually on the foreign servers. Once this is setup, all the queries will
routed to there specific partitions using the partition pruning logic.

Ahsan Hadi is a VP of Development with HighGo Software Inc. Prior to coming to HighGo Software, Ahsan had worked at EnterpriseDB as a Senior Director of Product Development, Ahsan worked with EnterpriseDB for 15 years. The flagship product of EnterpriseDB is Postgres Plus Advanced server which is based on Open source PostgreSQL. Ahsan has vast experience with Postgres and has lead the development team at EnterpriseDB for building the core compatibility of adding Oracle compatible layer to EDB’s Postgres Plus Advanced Server. Ahsan has also spent number of years working with development team for adding Horizontal scalability and sharding to Postgres. Initially, he worked with postgres-xc which is multi-master sharded cluster and later worked on managing the development of adding horizontal scalability/sharding to Postgres. Ahsan has also worked a great deal with Postgres foreign data wrapper technology and worked on developing and maintaining FDW’s for several sql and nosql databases like MongoDB, Hadoop and MySQL.

Prior to EnterpriseDB, Ahsan worked for Fusion Technologies as a Senior Project Manager. Fusion Tech was a US based consultancy company, Ahsan lead the team that developed java based job factory responsible for placing items on shelfs at big stores like Walmart. Prior to Fusion technologies, Ahsan worked at British Telecom as a Analyst/Programmer and developed web based database application for network fault monitoring.

Ahsan joined HighGo Software Inc (Canada) in April 2019 and is leading the development teams based in multiple Geo’s, the primary responsibility is community based Postgres development and also developing HighGo Postgres server.

↧

ahsan hadi: Horizontal scalability with Sharding in PostgreSQL – Where it is going Part 3 of 3.

August 21, 2019, 11:00 pm

≫ Next: Asif Rehman: An Overview of Replication in PostgreSQL Context

≪ Previous: ahsan hadi: Horizontal scalability with Sharding in PostgreSQL – Where it is going Part 2 of 3.

Feed: Planet PostgreSQL.

Built-in Sharding Architecture

The build-in sharding feature in PostgreSQL is using the FDW based approach, the FDW’s are based on sql/med specification that defines how an external data source can be accessed from the PostgreSQL server. PostgreSQL provides number of foreign data wrapper (FDW’s) that are used for accessing external data sources, the postgres_fdw is used for accessing Postgres database running on external server, MySQL_fdw is used for accessing MySQL database from PG, MongoDB_fdw is used for accessing MongoDB and so on.

The diagram below explains the current approach of built-in Sharding in PostgreSQL, the partitions are created on foreign servers and PostgreSQL FDW is used for accessing the foreign servers and using the partition pruning logic the planner decides which partition to access and which partitions to exclude from the search.

Push Down Capabilities

Push down in this context is the ability to push parts of the foreign query to foreign servers in order to decrease the amount of data travelling from the foreign server to parent node. The two basic push-down techniques that have been part of postgres fdw from the start are select target-list pushdown and WHERE clause pushdown.

In the query above the planner will decide which partition to access based on the partition key i.e. logdate in this case. The WHERE clause will be pushed down to the foreign server that contains the respective partition. That’s the basic push down capabilities available in postgres_fdw.

The sharding feature requires more advanced push-down capabilities in order to push the maximum operations down to the foreign servers containing partitions and minimising the data sent over the wire to the parent node.

The above is a decent set of push down capabilities that have been added to PostgreSQL in last few major releases. The good thing about these features is that it already benefits a number of use cases even when the entire sharding feature is not in place.

What’s remaining and associated challenges

The are still a number of important features remaining before we can say that we have Sharding feature in PostgreSQL. In this section we are going to discuss these features and what are the challenges with these features. I am sure there are other features related to database cluster management i.e. backup/failover or monitoring that are not in this list.

1- 2PC for foreign data wrapper transactions

Currently FDW transactions don’t
support two phase commit, this means that if you are using multiple foreign
servers in a transaction and if one part of transaction fails in one foreign
server then the entire transaction on all foreign serves are suppose to fail.
This feature is required in order to guarantee data consistency across the
database cluster.

This feature is required in order to
support OLTP workload hence it is very important for sharding feature.

The design proposal and patches for this feature has been sent on hackers for last several years but it is not getting enough community interest hence the design of this feature is still outstanding.

2- Parallel foreign scan

When a query is querying multiple foreign scans in a single query, all the foreign scans are being executed in a sequential manner, one after another. The parallel foreign scan functionality is executing multiple foreign scans in parallel. This feature is really important for the OLAP test cases, for example if you are running AVG query on a large partition table that is divided over large number of partitions. The AVG operation will be sent to each foreign server sequentially and results from each foreign server is sent to the parent node which will be aggregate on the parent node and sent back to client. Once we have the parallel foriegn scan functionality, all the average operations on all the foreign servers will be executed in parallel and results sent to the parent node. The parent node will aggregate the data and sent the results to the client.

This is key piece needed for
completing the sharding feature, we currently have aggregate pushdown that will
send the aggregates down to the foreign server but we don’t have the
functionality to run the aggregate operations on all the partitions in
parallel.

This feature is particularly very
important for the OLAP use-case, the idea of having a large number of foreign
servers containing partition for a large partitioned table and aggregate
operation on partition running on all the foreign servers in parallel is very
powerful.

The infrastructure for parallel foreign scan feature is asynchronous query execution, this is a major change in PostgreSQL. There has been some work done on this but it feels that it is still a release or two away before it will be committed. Once asynchronous query execution is done, it will be easier to add parallel foreign scan functionality.

3- Shard management

The partitions on foreign servers are currently not getting created automatically, as described in “Sharding in PostgreSQL” section, the partitions needs to be created manually on foreign servers. This can be very tedious task if you are creating a partition table with large number of partitions and sub-partitions.

The shard management feature is
suppose to provide the ability to auto-create the partitions and sub-partitions
on the foreign servers. This will make the creation of sharded tables very
easy.

Not intending to go into any design details of how this feature will be implemented, the basic idea is that Sharded table syntax will be built on top on declarative sharding syntax. The postgres_fdw will be used to pushdown the DDL to the foreign servers, while the FDW’s are only meant to do SELECT or DML, doing DDL on external source is not part of sql/med specification. Anyhow we aren’t suppose to discuss the design of this feature in this blog.

This feature is not yet started in the community, the development team at HighGo is planning to work on this feature.

4- Global Transaction Manager / Snapshot Manager

This is another very important and
difficult feature that is mandatory for Sharding feature. The purpose of global
transaction/snapshot manager is suppose to provide global transactional
consistency. The problem described in section 1 of this chapter “2PC for
foreign data wrapper transactions” also ties in with the Global transaction
manager.

Lets suppose you have two concurrent
clients that are using a sharded table, client #1 is trying to access a
partition that is on server 1 and client #2 is also trying to access the
partition that is also on server 1. Client 2 should get a consistent view of
the partition i.e. any changes i.e. updates etc made to the partition during
client 1 transaction shouldn’t be visible to client 2. Once client 1
transaction gets committed, the charges will be visible to all new
transactions. The Global transaction manager is suppose to ensure that all
global transaction gets a consistent view of the database cluster. All the
concurrent clients using the database cluster (with tables sharded across
multiple foreign servers) should see consistent view of the database cluster.

This is hard problem to solve and companies like Postgres Professional have tried to solve this problem by using a external transaction manager. So far there doesn’t seem to be any solution accepted by the community. Right now there is no visible concentrated effort which is trying to implement the global transaction manager in the core or even as an external component.

There is mention of using other approaches like Clock-SI (Snapshot isolation for Partitioned tables) approach that is followed by other successful projects like Google cloud spanner and YugaByte for solving the same problem.

Conclusion

This is conclusion of all the 3 blogs of this series, horizontal scalability with sharding is imperative for PostgreSQL. It is possible that only some of the workloads need sharding today in order to solve there problems but I am sure everyone wants to know that PostgreSQL has a answer of this problem. It is also important to note that Sharding is not a solution for all big data or high concurrent workloads, you need to pick workloads where larger table can be logically partitioned across partitions and the queries are benefiting from the pushdown and other capabilities in using the sharded cluster.

As I mentioned in the initial section of this blog, the first target for the sharding features where it is complete is to be able to speed-up a long running complex query. This would be a OLAP query, not to say that sharing would benefit the OLTP workloads. The data would be partitioned across multiple servers instead of a single server.

Another important exercise that the sharding team should start to do soon is benchmarking using the capabilities already part of PostgreSQL. I know without parallel foreign scan, it is not possible to speed up a real OLAP query that uses multiple partitions. However the process of benchmarking should being soon, we need to identify the type of workload that should benefit from sharding, what is the performance without sharding and what performance to expect with a sharded cluster. I don’t think we can expect the performance to be linear as we add more shards to the cluster.

Another important point that I would to mention here that there has been critics about using the FDW machinery for implementing built-in sharding. There has been suggestion to go a more low level in order to efficiently handle cross node communication etc. The answer given by a senior community member is good one, we are using FDW machinery to implement this feature because that’s the quickest and less error prone route for implementing it. The FDW functionality is already tried and tested, if we try to implement using a approach that’s more complex and sophisticated, it will require allot of recourses and lots of time before we can produce something that we can call sharding.

It will take more then a few companies to invest there resources in building this big feature. So more and more companies should come together on implementing this feature in the community because it is Worth it.

↧

Asif Rehman: An Overview of Replication in PostgreSQL Context

August 21, 2019, 11:00 pm

≫ Next: Achieve up to 16x better Spark performance with Amazon EMR release 5.26.0

≪ Previous: ahsan hadi: Horizontal scalability with Sharding in PostgreSQL – Where it is going Part 3 of 3.

Feed: Planet PostgreSQL.

Replication is a critical part of any database system that aims to provide high availability (HA) and effective disaster recovery (DR) strategy. This blog is aimed at establishing the role of replication in a database system. The blog will give a general overview of replication and its types as well as an introduction to replication options in PostgreSQL.

The term Replication is used to describe the process of sharing information between one or more software or hardware systems; to ensure reliability, availability, and fault-tolerance. These systems can be located in the same vicinity, could be on a single machine or perhaps connected over a wide network. The replication can be divided broadly into Hardware and Software categories. We’ll explore these categories briefly, however, the main focus of this blog is database replication. So, let’s first understand what constitutes a database replication system.

In a nutshell, the database replication describes the process of coping data from the database instance to one or more database instances. Again, these instances can be in the same location or perhaps connected over a wide network.

Hardware-Based Replication

Let’s start with the hardware replication. Hardware-based replication keeps the multiple connected systems in sync. This syncing of data is done at the storage level as soon as an I/O is performed on the system, it’s propagated to the configured storage modules/devices/systems. This type of replication can be done for the entire storage or the selected partitions. The biggest advantage of this type of solution is that (generally) it’s easier to set up and is independent of software. This makes it perform much better, however, it reduces the flexibility and control over the replication. Here are the few pros of this type of replication.

Real-time – all the changes are applied to subsequent systems immediately.
Easier to Setup – no scripting or software configurations are required.
Independent of Application – replication happens at the storage layer and is independent of OS/software application.
Data Integrity and Consistency – As the mirroring happens at the storage layer, which effectively makes an exact copy of the storage disk, So the data integrity and consistency are automatically ensured.

Although hardware replication has some very appealing advantages yet it comes with its own limitations. It generally relies on the vendor locking i.e. the same type of hardware has to be used and often it’s not very cost-effective.

Software-Based Replication

This replication can range from generic solutions to product-specific solutions. The generic solutions tend to emulate the hardware replication by copying the data between different system at the software level. The software that is responsible for performing the replication, copies each bit written to the source storage and propagates it to the destination system(s). Whereas the product-specific solutions, are more considerate towards the product requirements and are generally meant for a specific product. The software-based replication has its pros and cons. On one hand, It provides flexibility and control over how the replication is done and is usually very cost-effective while provides a much better set of features. But on the other hand, it requires a lot of configurations and requires continuous monitoring and maintenance.

Database Replication

Having discussed replication and its different types; lets now turn our focus toward the database replication topic.

Before going into more details lets start by discussing the different terminologies used to describe the different components of a replication system in the database world. Primary-Standby, Master-Slave, Publisher-Subscriber, Master-Master/Multimaster are the most often used terms to describe the database servers participating in the replication setup.

The term Primary, Master and Publisher are used to describing the active node that strives to propagate changes received by it to the other nodes. Whereas Standby, Slave and Subscriber terms are used to describe the passive nodes that will be receiving the propagated changes from the active nodes. In this blog, we will use Primary to describe the active node and Standby for the passive node.

The database replication can be configured in Primary-Standby and Multi-master configurations.

In a Primary-Standby configuration, only one instance receives the data changes and then it distributes them among the standbys.

Whereas in the Multi-master system, each instance of a database can receive the data change and then propagates that change to other instances.

Synchronous/Asynchronous Replication

At the core of the replication process is the ability of the primary node to transmit data to standbys. Much like any other data transfer strategy, this can happen in a synchronous way where primary waits for all standbys to confirm that they have received and written the data on disk. The client is given a confirmation when these acknowledgments are received. Alternatively, the primary node can commit the data locally and transmit the transaction data to standbys whenever possible with the expectation that standbys will receive and write the data on disk. In this case, standbys do not send confirmation to the primary. Whereas the former strategy is called synchronous replication, this is referred to as asynchronous replication. Both hardware and software-based solutions support synchronous and asynchronous replication.

Following are diagrams that show both types of replication strategies in a graphical way.

Synchronous Replication
Asynchronous Replication

Since the main difference between the two strategies is the acknowledgment of data written on the standby systems; there are advantages and disadvantages in using both techniques. Synchronous replication may be configured in a way such that all systems are up-to-date and that the replication is done in real-time. But at the same time, it adversely impacts the performance of the system since the primary node waits for standbys’ confirmations.

The asynchronous replication tends to give better performance as there is no wait time attributed to confirmation messages from standbys.

Cascading Replication

So far we have seen that only the primary node transmits the transaction data to standbys. This is a load that can be distributed among multiple systems; i.e. a standby that has received the data and written on disk can transmit it onwards to standbys that are configured to receive data from it. This is called cascading replication where replication is configured between primary and standbys, and standby(s) and standbys. Following is a visual representation of cascading replication.

Standby Modes

Warm standby is a term used to describe a standby server that allows standby to receive the changes from primary but does not allow any of the client connection to it directly.

Hot standby is a term used to describe a standby server which also allows accepting the client connections.

PostgreSQL Replication

In the end, I would like to share the replication options available for PostgreSQL.

PostgreSQL offers built-in log streaming and logical replication options. The built-in replication is only available in the primary-standby configuration.

Streaming Replication (SR)

Also known as physical and binary replication in Postgres. Binary data is streamed to the standbys. In order to stream such data, streaming replication uses the WAL (write-ahead log) records. Any change made on the primary server is first written to the WAL files before it is written on disk and database server has the capability to stream these records to any number of standbys. Since its a binary copy of data, there are certain limitations to it as well. This type of replication can only be used when all of the changes have to be replicated on the standbys. One cannot use this if only a subset of changes is required. The standby servers either do not accept the connections or if they do, they can only serve the read-only queries. The other limitation is that it can not be used with a different version of the database server. The whole setup, consisting of multiple database servers has to be of the same version.

Streaming replication is asynchronous by default, however, it not much difficult to turn on the synchronous replication. This kind of setup is perfect for achieving high availability (HA). If the main server fails, one of the standbys can take its place since they are almost the exact copy of it. Here is the basic configuration:

# First create a user with replication role in the primary database, this user will be used by standbys to establish a connection with the primary:
CREATE USER foouser REPLICATION;

# Add it to the pg_hba.conf to allow authentication for this user:
# TYPE  DATABASE        USER       ADDRESS                 METHOD
host    replication     foouser          trust

# Take a base backup on the stanby:
pg_basebackup -h  -U foouser -D ~/standby

# In postgresql.conf following entries are needed
wal_level=replica # should be set to replica for streaming replication
max_wal_senders = 8 # number of standbys to allow connection at a time

# On the stanby, create a recovery.conf file in its data direcrtory with following contents:
standby_mode=on
primary_conninfo='user=foouser host= port='

Now start the servers and a basic streaming replication should work.

Logical Replication

Logical Replication is considerably new in PostgreSQL as compared to the streaming replication. The streaming replication is a byte-by-byte copy of the entire data. It, however, does not allow replication of a single table or a subset of the data from a primary server to standbys. Logical replication enables replicating particular objects in the database to the standbys, instead of replicating the whole database. It also allows replicating data between different versions of the database servers and not only that it also allows the standby to accept both read and write queries. However, be watchful of the fact that there is no conflict resolution system implemented. So if both primary and standbys are allowed to write on the same table, more likely than not, there will be data conflicts that will stop the replication process. In this case, the user must manually resolve these conflicts. However, both standbys and primary may allow writing on complete disjointed sets of data which will avoid any conflict resolution needs. Here is the basic configuration:

# First create a user with replication role in the primary database, this user will be used by standbys to establish a connection with the primary:
CREATE USER foouser REPLICATION;

# Add it to the pg_hba.conf to allow authentication for this user:
# TYPE  DATABASE        USER       ADDRESS                 METHOD
host    replication     foouser          trust

# For logical replication, base backup is not required, any of the two database instances can be used to create this setup.

wal_level=logical # for logical replication, wal_level needs to be set to logical in postgresql.conf file.

# on primary server, create a some tables to replicate and a publication that will list down these tables.
CREATE TABLE t1 (col1 int, col2 varchar);
CREATE TABLE t2 (col1 int, col2 varchar);
INSERT INTO t1 ....;
INSERT INTO t2 ....;
CREATE PUBLICATION foopub FOR TABLE t1, t2;

# on the standby, the structure of above mentioned tables needs to be created as DDL is not replicated.
CREATE TABLE t1 (col1 int, col2 varchar);
CREATE TABLE t2 (col1 int, col2 varchar);

# on standby, create a subscription for the above publication.
CREATE SUBSCRIPTION foosub CONNECTION 'host= port= user=foouser' PUBLICATION foopub;

With this a basic logical replication is achieved.

I hope this blog helped you understand replication in general and from PostgreSQL perspective streaming and logical replications concepts. Furthermore, I hope this helps you in better designing a replication environment for your needs.

Asif Rehman is a Senior Software Engineer at HighGo Software. He Joined EnterpriseDB, an Enterprise PostgreSQL’s company in 2005 and started his career in open source development particularly in PostgreSQL. Asif’s contributions range from developing in-house features relating to oracle compatibility, to developing tools around PostgreSQL. He Joined HighGo Software in the month of Sep 2018.

↧

Achieve up to 16x better Spark performance with Amazon EMR release 5.26.0

August 27, 2019, 10:42 am

≫ Next: Extract Oracle OLTP data in real time with GoldenGate and query from Amazon Athena

≪ Previous: Asif Rehman: An Overview of Replication in PostgreSQL Context

Feed: Recent Announcements.

With EMR release 5.26.0, Spark users benefit from all the new Spark performance optimizations introduced in EMR release 5.24.0 and 5.25.0 without the need to make any configuration or code changes. The following optimizations are enabled by default in the 5.26.0 release:

Dynamic partition pruning – Allows the Spark engine to infer relevant partitions at runtime, saving time and compute resources both by reading less data from storage and by reducing the number of records that need to be processed.
DISTINCT before INTERSECT – Eliminates duplicate values in each input collection prior to computing the intersection, which improves performance by reducing the amount of data shuffled between hosts.
Flattening scalar subqueries – Helps in situations where multiple different conditions need to be applied to rows from a specific table, preventing the table from being read multiple times for each condition.
Optimized join reorder – Dynamically reorders joins to execute smaller joins with filters first, reducing the processing required for larger subsequent joins.
Bloom filter join – Filters table joins dynamically to include only relevant rows, reducing the amount of data processed by Spark and improving query runtime performance.

Please visit Optimizing Spark Performance documentation and the EMR 5.26.0 release notes for details on these optimizations.

Also included in EMR 5.26.0, is a Beta integration with AWS Lake Formation and new versions of Apache HBase 1.4.10, and Apache Phoenix 4.14.2. Please see Integrating Amazon EMR with AWS Lake Formation (Beta) for more details on the integration.

Amazon EMR release 5.26.0 is now available in all supported regions for Amazon EMR.

The integration between AWS Lake Formation and Amazon EMR is in Beta, and is available in the US East (N. Virginia), and US West (Oregon) regions.

You can stay up to date on EMR releases by subscribing to the feed for EMR release notes. Use the icon at the top of the EMR Release Guide to link the feed URL directly to your favorite feed reader.

↧

Extract Oracle OLTP data in real time with GoldenGate and query from Amazon Athena

August 30, 2019, 4:02 am

≫ Next: A Technical Introduction to MemSQL

≪ Previous: Achieve up to 16x better Spark performance with Amazon EMR release 5.26.0

Feed: AWS Big Data Blog.

This post describes how you can improve performance and reduce costs by offloading reporting workloads from an online transaction processing (OLTP) database to Amazon Athena and Amazon S3. The architecture described allows you to implement a reporting system and have an understanding of the data that you receive by being able to query it on arrival. In this solution:

Oracle GoldenGate generates a new row on the target for every change on the source to create Slowly Changing Dimension Type 2 (SCD Type 2) data.
Athena allows you to run ad hoc queries on the SCD Type 2 data.

Principles of a modern reporting solution

Advanced database solutions use a set of principles to help them build cost-effective reporting solutions. Some of these principles are:

Separate the reporting activity from the OLTP. This approach provides resource isolation and enables databases to scale for their respective workloads.
Use query engines running on top of distributed file systems like Hadoop Distributed File System (HDFS) and cloud object stores, such as Amazon S3. The advent of query engines that can run on top of open-source HDFS and cloud object stores further reduces the cost of implementing dedicated reporting systems.

Furthermore, you can use these principles when building reporting solutions:

To reduce licensing costs of the commercial databases, move the reporting activity to an open-source database.
Use a log-based, real-time, change data capture (CDC), data-integration solution, which can replicate OLTP data from source systems, preferably in real-time mode, and provide a current view of the data. You can enable the data replication between the source and the target reporting systems using database CDC solutions. The transaction log-based CDC solutions capture database changes noninvasively from the source database and replicate them to the target datastore or file systems.

Prerequisites

If you use GoldenGate with Kafka and are considering cloud migration, you can benefit from this post. This post also assumes prior knowledge of GoldenGate and does not detail steps to install and configure GoldenGate. Knowledge of Java and Maven is also assumed. Ensure that a VPC with three subnets is available for manual deployment.

Understanding the architecture of this solution

The following workflow diagram (Figure 1) illustrates the solution that this post describes:

Amazon RDS for Oracle acts as the source.
A GoldenGate CDC solution produces data for Amazon Managed Streaming for Apache Kafka (Amazon MSK). GoldenGate streams the database CDC data to the consumer. Kafka topics with an MSK cluster receives the data from GoldenGate.
The Apache Flink application running on Amazon EMR consumes the data and sinks it into an S3 bucket.
Athena analyzes the data through queries. You can optionally run queries from Amazon Redshift Spectrum.

Data Pipeline

Figure 1

Amazon MSK is a fully managed service for Apache Kafka that makes it easy to provision Kafka clusters with few clicks without the need to provision servers, storage and configuring Apache Zookeeper manually. Kafka is an open-source platform for building real-time streaming data pipelines and applications.

Amazon RDS for Oracle is a fully managed database that frees up your time to focus on application development. It manages time-consuming database administration tasks, including provisioning, backups, software patching, monitoring, and hardware scaling.

GoldenGate is a real-time, log-based, heterogeneous database CDC solution. GoldenGate supports data replication from any supported database to various target databases or big data platforms like Kafka. GoldenGate’s ability to write the transactional data captured from the source in different formats, including delimited text, JSON, and Avro, enables seamless integration with a variety of BI tools. Each row has additional metadata columns including database operation type (Insert/Update/Delete).

Flink is an open-source, stream-processing framework with a distributed streaming dataflow engine for stateful computations over unbounded and bounded data streams. EMR supports Flink, letting you create managed clusters from the AWS Management Console. Flink also supports exactly-once semantics with the checkpointing feature, which is vital to ensure data accuracy when processing database CDC data. You can also use Flink to transform the streaming data row by row or in batches using windowing capabilities.

S3 is an object storage service with high scalability, data availability, security, and performance. You can run big data analytics across your S3 objects with AWS query-in-place services like Athena.

Athena is a serverless query service that makes it easy to query and analyze data in S3. With Athena and S3 as a data source, you define the schema and start querying using standard SQL. There’s no need for complex ETL jobs to prepare your data for analysis, which makes it easy for anyone familiar with SQL skills to analyze large-scale datasets quickly.

The following diagram shows a more detailed view of the data pipeline:

RDS for Oracle runs in a Single-AZ.
GoldenGate runs on an Amazon EC2 instance.
The MSK cluster spans across three Availability Zones.
Kafka topic is set up in MSK.
Flink runs on an EMR Cluster.
Producer Security Group for Oracle DB and GoldenGate instance.
Consumer Security Group for EMR with Flink.
Gateway endpoint for S3 private access.
NAT Gateway to download software components on GoldenGate instance.
S3 bucket and Athena.

For simplicity, this setup uses a single VPC with multiple subnets to deploy resources.

Figure 2

Configuring single-click deployment using AWS CloudFormation

The AWS CloudFormation template included in this post automates the deployment of the end-to-end solution that this blog post describes. The template provisions all required resources including RDS for Oracle, MSK, EMR, S3 bucket, and also adds an EMR step with a JAR file to consume messages from Kafka topic on MSK. Here’s the list of steps to launch the template and test the solution:

Launch the AWS CloudFormation template in the us-east-1
After successful stack creation, obtain GoldenGate Hub Server public IP from the Outputs tab of cloudformation.
Login to GoldenGate hub server using the IP address from step 2 as ec2-user and then switch to oracle user.sudo su – oracle
Connect to the source RDS for Oracle database using the sqlplus client and provide password(source).[oracle@ip-10-0-1-170 ~]$ sqlplus source@prod
Generate database transactions using SQL statements available in oracle user’s home directory.
```
SQL> @s

 SQL> @s1

 SQL> @s2
```
Query STOCK_TRADES table from Amazon Athena console. It takes a few seconds after committing transactions on the source database for database changes to be available for Athena for querying.

Manually deploying components

The following steps describe the configurations required to stream Oracle-changed data to MSK and sink it to an S3 bucket using Flink running on EMR. You can then query the S3 bucket using Athena. If you deployed the solution using AWS CloudFormation as described in the previous step, skip to the Testing the solution section.

Prepare an RDS source database for CDC using GoldenGate.The RDS source database version is Enterprise Edition 12.1.0.2.14. For instructions on configuring the RDS database, see Using Oracle GoldenGate with Amazon RDS. This post does not consider capturing data definition language (DDL).
Configure an EC2 instance for the GoldenGate hub server.Configure the GoldenGate hub server using Oracle Linux server 7.6 (ami-b9c38ad3) image in the us-east-1 Region. The GoldenGate hub server runs the GoldenGate extract process that extracts changes in real time from the database transaction log files. The server also runs a replicat process that publishes database changes to MSK.The GoldenGate hub server requires the following software components:

Java JDK 1.8.0 (required for GoldenGate big data adapter).
GoldenGate for Oracle (12.3.0.1.4) and GoldenGate for big data adapter (12.3.0.1).
Kafka 1.1.1 binaries (required for GoldenGate big data adapter classpath).
An IAM role attached to the GoldenGate hub server to allow access to the MSK cluster for GoldenGate processes running on the hub server.Use the GoldenGate (12.3.0) documentation to install and configure the GoldenGate for Oracle database. The GoldenGate Integrated Extract parameter file is eora2msk.prm.
```
EXTRACT eora2msk
SETENV (NLSLANG=AL32UTF8)

USERID ggadmin@ORCL, password ggadmin
TRANLOGOPTIONS INTEGRATEDPARAMS (max_sga_size 256)
EXTTRAIL /u01/app/oracle/product/ogg/dirdat/or
LOGALLSUPCOLS

TABLE SOURCE.STOCK_TRADES;
```
The logallsupcols extract parameter ensures that a full database table row is generated for every DML operation on the source, including updates and deletes.

Create a Kafka cluster using MSK and configure Kakfa topic.You can create the MSK cluster from the AWS Management Console, using the AWS CLI, or through an AWS CloudFormation template.

Use the list-clusters command to obtain a ClusterArn and a Zookeeper connection string after creating the cluster. You need this information to configure the GoldenGate big data adapter and Flink consumer. The following code illustrates the commands to run:

$aws kafka list-clusters --region us-east-1
{
    "ClusterInfoList": [
        {
            "EncryptionInfo": {
                "EncryptionAtRest": {
                    "DataVolumeKMSKeyId": "arn:aws:kms:us-east-1:xxxxxxxxxxxx:key/717d53d8-9d08-4bbb-832e-de97fadcaf00"
                }
            }, 
            "BrokerNodeGroupInfo": {
                "BrokerAZDistribution": "DEFAULT", 
                "ClientSubnets": [
                    "subnet-098210ac85a046999", 
                    "subnet-0c4b5ee5ff5ef70f2", 
                    "subnet-076c99d28d4ee87b4"
                ], 
                "StorageInfo": {
                    "EbsStorageInfo": {
                        "VolumeSize": 1000
                    }
                }, 
                "InstanceType": "kafka.m5.large"
            }, 
            "ClusterName": "mskcluster", 
            "CurrentBrokerSoftwareInfo": {
                "KafkaVersion": "1.1.1"
            }, 
            "CreationTime": "2019-01-24T04:41:56.493Z", 
            "NumberOfBrokerNodes": 3, 
            "ZookeeperConnectString": "10.0.2.9:2181,10.0.0.4:2181,10.0.3.14:2181", 
            "State": "ACTIVE", 
            "CurrentVersion": "K13V1IB3VIYZZH", 
            "ClusterArn": "arn:aws:kafka:us-east-1:xxxxxxxxx:cluster/mskcluster/8920bb38-c227-4bef-9f6c-f5d6b01d2239-3", 
            "EnhancedMonitoring": "DEFAULT"
        }
    ]
}

Obtain the IP addresses of the Kafka broker nodes by using the ClusterArn.

$aws kafka get-bootstrap-brokers --region us-east-1 --cluster-arn arn:aws:kafka:us-east-1:xxxxxxxxxxxx:cluster/mskcluster/8920bb38-c227-4bef-9f6c-f5d6b01d2239-3
{
    "BootstrapBrokerString": "10.0.3.6:9092,10.0.2.10:9092,10.0.0.5:9092"
}

Create a Kafka topic. The solution in this post uses the same name as table name for Kafka topic.

./kafka-topics.sh --create --zookeeper 10.0.2.9:2181,10.0.0.4:2181,10.0.3.14:2181 --replication-factor 3 --partitions 1 --topic STOCK_TRADES

Provision an EMR cluster with Flink.Create an EMR cluster 5.25 with Flink 1.8.0 (advanced option of the EMR cluster), and enable SSH access to the master node. Create and attach a role to the EMR master node so that Flink consumers can access the Kafka topic in the MSK cluster.

Configure the Oracle GoldenGate big data adapter for Kafka on the GoldenGate hub server.Download and install the Oracle GoldenGate big data adapter (12.3.0.1.0) using the Oracle GoldenGate download link. For more information, see the Oracle GoldenGate 12c (12.3.0.1) installation documentation.The following is the GoldenGate producer property file for Kafka (custom_kafka_producer.properties):

#Bootstrap broker string obtained from Step 3
bootstrap.servers= 10.0.3.6:9092,10.0.2.10:9092,10.0.0.5:9092
#bootstrap.servers=localhost:9092
acks=1
reconnect.backoff.ms=1000
value.serializer=org.apache.kafka.common.serialization.ByteArraySerializer
key.serializer=org.apache.kafka.common.serialization.ByteArraySerializer
# 100KB per partition
batch.size=16384
linger.ms=0

The following is the GoldenGate properties file for Kafka (Kafka.props):

gg.handlerlist = kafkahandler
gg.handler.kafkahandler.type=kafka
gg.handler.kafkahandler.KafkaProducerConfigFile=custom_kafka_producer.properties
#The following resolves the topic name using the short table name
#gg.handler.kafkahandler.topicName=SOURCE
gg.handler.kafkahandler.topicMappingTemplate=${tableName}
#The following selects the message key using the concatenated primary keys
gg.handler.kafkahandler.keyMappingTemplate=${primaryKeys}
gg.handler.kafkahandler.format=json_row
#gg.handler.kafkahandler.format=delimitedtext
#gg.handler.kafkahandler.SchemaTopicName=mySchemaTopic
#gg.handler.kafkahandler.SchemaTopicName=oratopic
gg.handler.kafkahandler.BlockingSend =false
gg.handler.kafkahandler.includeTokens=false
gg.handler.kafkahandler.mode=op
goldengate.userexit.writers=javawriter
javawriter.stats.display=TRUE
javawriter.stats.full=TRUE

gg.log=log4j
#gg.log.level=INFO
gg.log.level=DEBUG
gg.report.time=30sec
gg.classpath=dirprm/:/home/oracle/kafka/kafka_2.11-1.1.1/libs/*

javawriter.bootoptions=-Xmx512m -Xms32m -Djava.class.path=ggjava/ggjava.jar

The following is the GoldenGate replicat parameter file (rkafka.prm):

REPLICAT rkafka
-- Trail file for this example is located in "AdapterExamples/trail" directory
-- Command to add REPLICAT
-- add replicat rkafka, exttrail AdapterExamples/trail/tr
TARGETDB LIBFILE libggjava.so SET property=dirprm/kafka.props
REPORTCOUNT EVERY 1 MINUTES, RATE
GROUPTRANSOPS 10000
MAP SOURCE.STOCK_TRADES, TARGET SOURCE.STOCK_TRADES;

Create an S3 bucket and directory with a table name underneath for Flink to store (sink) Oracle CDC data.

Configure a Flink consumer to read from the Kafka topic that writes the CDC data to an S3 bucket.For instructions on setting up a Flink project using the Maven archetype, see Flink Project Build Setup.The following code example is the pom.xml file, used with the Maven project. For more information, see Getting Started with Maven.


  4.0.0

  org.apache.flink
  flink-quickstart-java
  1.8.0
  jar

  flink-quickstart-java
  http://www.example.com

  
    UTF-8
    @slf4j.version@
    @log4j.version@
    1.8
    1.7
    1.7
  


    
        org.apache.flink
        flink-java
        1.8.0
        compile
    
    
        org.apache.flink
        flink-hadoop-compatibility_2.11
        1.8.0
    
    
     org.apache.flink
     flink-connector-filesystem_2.11
     1.8.0
    

    
        org.apache.flink
        flink-streaming-java_2.11
        1.8.0
        compile
    
     
        org.apache.flink
        flink-s3-fs-presto
        1.8.0
    
    
   org.apache.flink
        flink-connector-kafka_2.11
        1.8.0
    
    
      org.apache.flink
      flink-clients_2.11
      1.8.0
    
    
      org.apache.flink
      flink-scala_2.11
      1.8.0
    

    
      org.apache.flink
      flink-streaming-scala_2.11
      1.8.0
    

    
      com.typesafe.akka
      akka-actor_2.11
      2.4.20
    
    
       com.typesafe.akka
       akka-protobuf_2.11
       2.4.20
    

  
     
         org.apache.maven.plugins
            maven-shade-plugin
            3.2.1
            
                   
                      package
                        
                            shade
                         
                       
                      
                  

                         
                           

                
                        
                                                                                 org.apache.flink:*
                        
                   
             
               
                                                                        
                                                                        flinkconsumer.flinkconsumer
               
                                                                         
                                                                     reference.conf
                                                                
                                                        
                
                      
                                                            org.codehaus.plexus.util
                                                              org.shaded.plexus.util
                    
                                                                  org.codehaus.plexus.util.xml.Xpp3Dom
                                                                  org.codehaus.plexus.util.xml.pull.*
                                                              
                                                           
                                                        
                                                        false
                                                
                                        
                                
                        

                        
                                org.apache.maven.plugins
                                maven-jar-plugin
                                2.5
                                
                                        
                                                
                                                        flinkconsumer.flinkconsumer
                                                
                                        
       
                        

                        
                                org.apache.maven.plugins
                                maven-compiler-plugin
                                3.1
                                
                                        1.7
                                        1.7
                                
                        
                



                
                        build-jar
                        
                                false

Compile the following Java program using mvn clean install and generate the JAR file:

package flinkconsumer;
import org.apache.flink.api.common.typeinfo.TypeInformation;
import org.apache.flink.api.java.typeutils.TypeExtractor;
import org.apache.flink.api.java.utils.ParameterTool;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.source.SourceFunction;
import org.apache.flink.streaming.util.serialization.DeserializationSchema;
import org.apache.flink.streaming.util.serialization.SerializationSchema;
import org.apache.flink.streaming.util.serialization.SimpleStringSchema;
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.util.Collector;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer;
import org.slf4j.LoggerFactory;
import org.apache.flink.runtime.state.filesystem.FsStateBackend;
import akka.actor.ActorSystem;
import akka.stream.ActorMaterializer;
import akka.stream.Materializer;
import com.typesafe.config.Config;
import org.apache.flink.streaming.connectors.fs.*;
import org.apache.flink.streaming.api.datastream.*;
import org.apache.flink.runtime.fs.hdfs.HadoopFileSystem;
import java.util.stream.Collectors;
import java.util.Arrays;
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
import java.util.Properties;
import java.util.regex.Pattern;
import java.io.*;
import java.net.BindException;
import java.util.*;
import java.util.Map.*;
import java.util.Arrays;

public class flinkconsumer{

    public static void main(String[] args) throws Exception {
        // create Streaming execution environment
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setBufferTimeout(1000);
        env.enableCheckpointing(5000);
        Properties properties = new Properties();
        properties.setProperty("bootstrap.servers", "10.0.3.6:9092,10.0.2.10:9092,10.0.0.5:9092");
        properties.setProperty("group.id", "flink");
        properties.setProperty("client.id", "demo1");

        DataStream message = env.addSource(new FlinkKafkaConsumer<>("STOCK_TRADES", new SimpleStringSchema(),properties));
        env.enableCheckpointing(60_00);
        env.setStateBackend(new FsStateBackend("hdfs://ip-10-0-3-12.ec2.internal:8020/flink/checkpoints"));

        RollingSink sink= new RollingSink("s3://flink-stream-demo/STOCK_TRADES");
       // sink.setBucketer(new DateTimeBucketer("yyyy-MM-dd-HHmm"));
       // The bucket part file size in bytes.
           sink.setBatchSize(400);
         message.map(new MapFunction() {
            private static final long serialVersionUID = -6867736771747690202L;
            @Override
            public String map(String value) throws Exception {
                //return " Value: " + value;
                return value;
            }
        }).addSink(sink).setParallelism(1);
        env.execute();
    }
}

$ /usr/bin/flink run ./flink-quickstart-java-1.7.0.jar

Create the stock_trades table from the Athena console. Each JSON document must be on a new line.

CREATE EXTERNAL TABLE `stock_trades`(
  `trade_id` string COMMENT 'from deserializer', 
  `ticker_symbol` string COMMENT 'from deserializer', 
  `units` int COMMENT 'from deserializer', 
  `unit_price` float COMMENT 'from deserializer', 
  `trade_date` timestamp COMMENT 'from deserializer', 
  `op_type` string COMMENT 'from deserializer')
ROW FORMAT SERDE 
  'org.openx.data.jsonserde.JsonSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION
  's3://flink-cdc-demo/STOCK_TRADES'
TBLPROPERTIES (
  'has_encrypted_data'='false', 
  'transient_lastDdlTime'='1561051196')

For more information, see Hive JSON SerDe.

Testing the solution

To test that the solution works, complete the following steps:

Log in to the source RDS instance from the GoldenGate hub server and perform insert, update, and delete operations on the stock_trades table:

$sqlplus source@prod
SQL> insert into stock_trades values(6,'NEW',29,75,sysdate);
SQL> update stock_trades set units=999 where trade_id=6;
SQL> insert into stock_trades values(7,'TEST',30,80,SYSDATE);
SQL>insert into stock_trades values (8,'XYZC', 20, 1800,sysdate);
SQL> commit;

Monitor the GoldenGate capture from the source database using the following stats command:

[oracle@ip-10-0-1-170 12.3.0]$ pwd
/u02/app/oracle/product/ogg/12.3.0
[oracle@ip-10-0-1-170 12.3.0]$ ./ggsci

Oracle GoldenGate Command Interpreter for Oracle
Version 12.3.0.1.4 OGGCORE_12.3.0.1.0_PLATFORMS_180415.0359_FBO
Linux, x64, 64bit (optimized), Oracle 12c on Apr 16 2018 00:53:30
Operating system character set identified as UTF-8.

Copyright (C) 1995, 2018, Oracle and/or its affiliates. All rights reserved.



GGSCI (ip-10-0-1-170.ec2.internal) 1> stats eora2msk

Monitor the GoldenGate replicat to a Kafka topic with the following:

[oracle@ip-10-0-1-170 12.3.0]$ pwd
/u03/app/oracle/product/ogg/bdata/12.3.0
[oracle@ip-10-0-1-170 12.3.0]$ ./ggsci

Oracle GoldenGate for Big Data
Version 12.3.2.1.1 (Build 005)

Oracle GoldenGate Command Interpreter
Version 12.3.0.1.2 OGGCORE_OGGADP.12.3.0.1.2_PLATFORMS_180712.2305
Linux, x64, 64bit (optimized), Generic on Jul 13 2018 00:46:09
Operating system character set identified as UTF-8.

Copyright (C) 1995, 2018, Oracle and/or its affiliates. All rights reserved.



GGSCI (ip-10-0-1-170.ec2.internal) 1> stats rkafka

Query the stock_trades table using the Athena console.

Summary

This post illustrates how you can offload reporting activity to Athena with S3 to reduce reporting costs and improve OLTP performance on the source database. This post serves as a guide for setting up a solution in the staging environment.

Deploying this solution in a production environment may require additional considerations, for example, high availability of GoldenGate hub servers, different file encoding formats for optimal query performance, and security considerations. Additionally, you can achieve similar outcomes using technologies like AWS Database Migration Service instead of GoldenGate for database CDC and Kafka Connect for the S3 sink.

About the Authors

Sreekanth Krishnavajjala is a solutions architect at Amazon Web Services.

Vinod Kataria is a senior partner solutions architect at Amazon Web Services.

↧

A Technical Introduction to MemSQL

August 30, 2019, 11:16 pm

≫ Next: Jonathan Katz: Just Upgrade: How PostgreSQL 12 Can Improve Your Performance

≪ Previous: Extract Oracle OLTP data in real time with GoldenGate and query from Amazon Athena

Feed: MemSQL Blog.
Author: John Sherwood.

John Sherwood, a senior engineer at MemSQL on the query optimizer team, presented at MemSQL’s Engineering Open House in our Seattle offices last month. He gave a technical introduction to the MemSQL database, including its support for in-memory rowstore tables and disk-backed columnstore tables, its SQL support and MySQL wire protocol compatibility, and how aggregator and leaf nodes interact to store data and answer queries simultaneously, scalably, and with low latencies. He also went into detail about code generation for queries and query execution. Following is a lightly edited transcript of John’s talk. – Ed.

This is a brief technical backgrounder on MemSQL, our features, our architecture, and so on. MemSQL: we exist. Very important first point. We have about 50 engineers scattered across our San Francisco and Seattle offices for the most part, but also a various set of offices across the rest of the country and the world.

With any company, and especially a database company, there is the question of why do we specifically exist? There’s absolutely no shortage of database products out there, as probably many of you could attest from your own companies.

Technical Introduction to MemSQL database 1

Scale-out is of course a bare minimum these days, but the primary feature of MemSQL has traditionally been the in-memory rowstore which allows us to circumvent many of the issues that arise with disk-based databases. Along the way, we’ve added columnstore, with several of its own unique features, and of course you’re presented all this functionality through a MySQL wire protocol-compatible interface.

Technical Introduction to MemSQL database 2

The rowstore requires that all the data can fit in main memory. By completely avoiding disk IO, we were able to make use of a variety of techniques to speed up the execution, with minimal principal latencies. The columnstore is able to leverage coding techniques that – with code generation and modern hardware – allow for incredibly fast scans.

The general market we find ourselves in is: companies who have large, shifting datasets, who are looking for very fast answers, ideally with minimal changes in latency, as well as those who have large historical data sets, who want very quick, efficient queries.

So, from 20,000 feet as mentioned, we scale out as well as up. At the very highest level, our cluster is made up of two kinds of nodes, leaves and aggregators. Leaves actually store data, while aggregators coordinate the data manipulation language (DML). There’s a single aggregator which we call the master aggregator – actually, in our codebase, we call it the Supreme Leader – which is actually responsible for coordinating the data definition language (DDL) and is the closest thing we have to Hadoop-style named namenode, et cetera that actually runs our cluster.

Technical Introduction to MemSQL database 3

As mentioned, the interface at MemSQL is MySQL compatible with extensions and our basic idiom remains the same: database, tables, rows. The most immediate nuance is that our underlying system will automatically break a logical database into multiple physical partitions, each of which is visible on the actual leaf. While we are provisionally willing to shard data without regard to what the user gives us, we much prefer it if you actually use a shard key which allows us to set up convenient joins, et cetera, for actual exploration of data.

The aggregator then is responsible for formulating query plans, bridging out across leaves as necessary to service the DML. Of particular note is that the engine that we use is able to have leaves perform computations with the same full amount of functionality that the aggregator itself can perform, which allows us to perform many worthwhile optimizations across the cluster.

Technical Introduction to MemSQL database 4

A quick, more visual example will better show what I’m talking about. Here we have an example cluster. We have a single master aggregator and three leaf nodes. A user has given us the very imaginatively named database “db” which we’re supposed to create. Immediately the aggregator’s job is to stripe this into multiple sub-databases, here shown as db_0 through db_2. In practice, we find that a database per physical core on the host works best, it allows parallelization and so on, but drawing out 48 of these boxes per would probably be a little bit much.

Technical Introduction to MemSQL database 5

So beyond just creating the database, as mentioned, we have a job as a database to persist data. And as running on a single host does not get you very far in the modern world. And so, we have replication. We do this by database partition, replicating data from each leaf to a chosen slave.
So as you can see here, we’ve created a cluster such that there is no single point of failure. If a node goes down, such as this leaf mastering db_2, the other leaf that currently masters db_0 will be promoted, step up, and start new serving data.

Technical Introduction to MemSQL database 6

I’d also note that while I’m kind of hand waving a lot of things, all this does take place under a very heavy, two phase commit sort of thing. Such that we do handle failures properly, but for hopefully obvious reasons, I’m not going to go there.

So in a very basic example, let’s say a user is actually querying this cluster. As mentioned, they talked to the master aggregator that’s shown as the logical database, db as mentioned, which they treat as just any other data. The master aggregator in this case is going to have to fan out across all the leaves, query them individually and merge the results.

One thing that I will note here, is that I mentioned that we can actually perform computations on the leaves, in a way that allows us not to do so on the master. Here we have an order-by clause, which we actually push down to each leaf. Perhaps there was actually an index on A that we take advantage of.

Technical Introduction to MemSQL database 7

Here the master aggregator will simply combine, merge, stream the results back. We can easily imagine that even for this trivial example, if each leaf is using its full storage for this table, the master aggregator (on homogenous hardware at least) will not be able to do a full quick sort, whatever you want to use, and actually sort all the data without spooling. And so even this trivial example shows how our distributed architecture allows faster speeds.

Before I move on, here’s an example of inserts. Here, as with point lookups and so on in the DML, we’re able to say the exact leaf that owns this row across its object.

Technical Introduction to MemSQL database 8

So here we talk to a single leaf end up transparently without the master aggregator necessarily knowing about it. Replicates that down to db_1’s slave on the other host. Allowing us to have durability, replication, all that good stuff.

Again, as a database, we are actually persisting everything in the data that has been entrusted to us. We kind of nuance between durability to the actual persistence of a single host versus replication across multiple hosts.

Like many databases, the strategy that we use for this is a streaming write-ahead-log which allows us to rephrase the problem from, “How do I stream transactions across the cluster?” to simply, “How do I actually replicate pages in an ordered log across multiple hosts?” As mentioned, this works at the database level, which means that there’s no actual concept of a schema, of the actual transactions themselves, or the row data. All that happens is that this storage layer is responsible for replicating these pages, the contents of which it is entirely agnostic to.

Technical Introduction to MemSQL database 9

The other large feature of MemSQL is its code generation. Essentially the classic way for a database to work is injecting in what we would call in the C++ world, virtual functions. The idea that in the common case, you might have an operator comparing a field of a row to a constant value.

Technical Introduction to MemSQL database 10

In a normal database you might inject an operator class that has a constant value, do a virtual function lookup to actually check that, and we go on with our lives. The nuance here is in a couple of ways this is suboptimal. First being that if we’re using a function pointer, a function call, we’re not in-line. And the second is simply that in making a function call, we’re having to dynamically look it up. Code generation on the other hand allows us to make those decisions beforehand, well before anything actually executes. This allows us both to make these basic optimizations where we could say, “this common case any engine would have – just optimize around it,” but also allows us to do very complex things outside of queries in a kind of precognitive way.

An impressive thing for most when they look through our code base is is just the amount of metadata we collect. We have huge amounts of data on various columns, on the tables and databases, and everything else. And at runtime if we were to attempt to read this, look at it, make decisions on it, we would be hopelessly slow. But instead, by using code generation, we’re able to make all the decisions up front, efficiently generate code and go on with our lives without having runtime costs. A huge lever for us is the fact that we use an LLVM toolchain underneath the hood, such that by generating IR – intermediate representation – LLVM, we can actually take advantage of the entire tool chain they’ve built up. In fact the same toolchain that we all love – or we would love if we actually used it here for our main code base – to use in our day to day lives. We get all those advantages: function inlining, loop unrolling vectorization, and so on.

And so between those two features we have the ability to build a next generation, amazing, streaming database.

↧

Jonathan Katz: Just Upgrade: How PostgreSQL 12 Can Improve Your Performance

September 2, 2019, 3:56 pm

≫ Next: Redis Load Handling vs Data Integrity: Tradeoffs in Distributed Data Store Design

≪ Previous: A Technical Introduction to MemSQL

Feed: Planet PostgreSQL.

PostgreSQL 12, the latest version of the “world’s most advanced open source relational database,” is being released in the next few weeks, barring any setbacks. This follows the project’s cadence of providing a raft of new database features once a year, which is quite frankly, amazing and one of the reasons why I wanted to be involved in the PostgreSQL community.

In my opinion, and this is a departure from previous years, PostgreSQL 12 does not contain one or two single features that everyone can point to and say that “this is the ‘FEATURE’ release,” (partitioning and query parallelism are recent examples that spring to mind). I’ve half-joked that the theme of this release should be “PostgreSQL 12: Now More Stable” — which of course is not a bad thing when you are managing mission critical data for your business.

And yet, I believe this release is a lot more than that: many of the features and enhancements in PostgreSQL 12 will just make your applications run better without doing any work other than upgrading!

(…and maybe rebuild your indexes, which, thanks to this release, is not as painful as it used to be)!

It can be quite nice to upgrade PostgreSQL and see noticeable improvements without having to do anything other than the upgrade itself. A few years back when I was analyzing an upgrade of PostgreSQL 9.4 to PostgreSQL 10, I measured that my underlying application was performing much more quickly: it took advantage of the query parallelism improvements introduced in PostgreSQL 10. Getting these improvements took almost no effort on my part (in this case, I set the max_parallel_workers config parameter).

Having applications work better by simply upgrading is a delightful experience for users, and it’s important that we keep our existing users happy as more and more people adopt PostgreSQL.

So, how can PostgreSQL 12 make your applications better just by upgrading? Read on!

Major Improvements to Indexing

Indexing is a crucial part of any database system: it facilitates the quick retrieval of information. The fundamental indexing system PostgreSQL users is called a B-tree, which is a type of index that is optimized for storage systems.

It’s very easy to take for granted the statement CREATE INDEX ON some_table (some_column); as PostgreSQL does a lot of work to keep the index up-to-date as the values it stores are continuously inserted, updated, and deleted. Typically, it just seems to work.

However, a problem with PostgreSQL indexes is that they can bloat and take up extra space on disk, which can also lead to performance penalties when both retrieving and updating data. In this case by “bloat,” I mean inefficiencies in how the index structure is maintained, which may or may not be related to garbage tuples that are removed by VACUUM (and a hat tip to Peter Geoghegan for this fact). Index bloat can be very noticeable on workloads where an index is modified heavily.

PostgreSQL 12 makes significant improvements to how B-tree indexes work, and from experiments using TPC-C like tests, showed a 40% reduction in space utilization on average. This not only reduces the amount of time spent maintaining B-tree indexes (i.e. writes), but also provide benefits to how quickly data can be retrieved, given indexes are overall a lot smaller.

Applications that make heavy updates to their tables, typically in the OLTP family (“online transaction processing“) should see noticeable improvements to their disk utilization as well as query performance. And less disk utilization means your database has more room to grow before you need to upgrade your infrastructure.

Based on your upgrade strategy, you may need to rebuild your B-tree indexes to take advantage of these improvements (for example, pg_upgrade will not automatically rebuild your indexes). In prior versions of PostgreSQL, if you have large indexes on your tables, this could lead to a significant downtime event as an index rebuild would block any modifications to a table. But here is another place where PostgreSQL 12 shines: now in PostgreSQL you can rebuild your indexes concurrently with the REINDEX CONCURRENTLY command, so now you can rebuild your indexes without potential downtime!

There are also other parts of PostgreSQL’s indexing infrastructure that received improvements in PostgreSQL 12. One of the things that falls into the “just works” category involves the write-ahead log, aka WAL. The write-ahead log serves an important function, as it records every transaction that occurs in PostgreSQL, which is fundamental features such as crash safety and replication, and used by applications for archival and point-in-time-recovery. The write-ahead log also means that additional information needs to be written to disk, which can have performance ramifications.

PostgreSQL 12 reduces the overhead of WAL records generated by the GiST, GIN, and SP-GiST indexes when an index is building. This has multiple noticeable benefits, including less space on disk required for these WAL records and faster replays of this data, such as during crash recovery or point-in-time-recovery. If you use any of these types of indexes in your applications (for instance, geospatial applications powered by PostGIS make heavy use of the GiST index type), this is yet another feature that will make a noticeable impact without you having to lift a finger.

Partitioning is Bigger, Better, Faster

PostgreSQL 10 introduced declarative partitioning. PostgreSQL 11 made it much easier to use. PostgreSQL 12 lets you really scale your partitions.

PostgreSQL 12 received significant performance improvements to the partitioning system, notably around how it can process tables that have thousands of partitions. For example, a query that only affects a few partitions on a table with thousands of them will perform significantly faster. In addition to seeing performance improvements on those types of queries, you should also see an improvement in INSERT speed on tables with many partitions as well.

Writing data with COPY, which is a great way to bulk load data (here’s an example of JSON ingestion) to partitioned tables, also received a boost in PostgreSQL 12. Using COPY was already fast; PostgreSQL 12 has made it noticeably faster.

All of the above makes it possible to store even larger data sets in PostgreSQL while making it easier to retrieve the data and, even better, it should just work. Applications that tend to have a lot of partitions, e.g. ones that record time series data, should see noticeable performance improvements just with an upgrade.

And while it may not broadly fall under the “make better just by upgrading” category, PostgreSQL 12 allows you to create foreign keys that reference partitioned tables, eliminating a “gotcha” that you may have experienced with partitioning.

WITH Queries Get a Big Boost

When the inlined common table expression patch was committed (aka CTEs, aka WITH queries) I could not wait to write an article on how big a deal this was for PostgreSQL application developers. This is one of those features where you can see your applications get faster, well, if you make use of CTEs.

I’ve often found that developers that are new to SQL like to make use of CTEs: if you write them in a certain way, it can feel like you’re writing an imperative program. I also enjoyed rewriting those queries to not use CTEs and demonstrate a performance gain. Alas, these days are now gone.

PostgreSQL 12 now allows a certain kind of CTE to be inlined, i.e. one that has no side-effects (a SELECT) that is used only once later in a query. If I had collected statistics on the number of queries using CTEs I would rewrite, the majority would fall into this group. This will help developers to write code that can feel more readable that is now performant as well.

What’s better is that PostgreSQL 12 will optimize the execution of this SQL without you having to do any additional work. And while I may no longer have to optimize this type of query pattern, it’s certainly better that PostgreSQL is continuing to improve its query optimizations.

Just-in-Time (JIT) Is Now a Default

For PostgreSQL 12 systems that support LLVM, just-in-time compilation, aka “JIT,” is enabled by default. In additional to providing JIT support to some internal operations, queries that have expressions (e.g. “x + y”, which is a simple expression) in select lists (e.g. what you write after “SELECT”), use aggregates, have expressions in WHERE clauses and others can utilize JIT for a performance boost.

As JIT is enabled by default in PostgreSQL 12, you can see performance boosts without doing anything, but I would recommend testing out your application on PostgreSQL 11, where JIT was introduced, to measure how your queries perform and see if you need to do any tuning.

What about the other new features in PostgreSQL 12?

There are a lot of new features in PostgreSQL 12 that I am really excited about, from the ability to introspect JSON data using the standard SQL/JSON path expressions, to a type of multifactor authentication available using the clientcert=verify-full setting, to generated columns, and many more. These are for a different blog post.

Much like my experience going to PostgreSQL 10, I believe PostgreSQL 12 provides a similar ability to improve your overall experience just by upgrading. Of course, your mileage may vary: as I did with my PostgreSQL 10 upgrade, test your application under similar production system conditions first before making the switch. Even if PostgreSQL 12 is “now more stable” as I suggested, you should always extensively test your applications before moving them into production.

↧

Redis Load Handling vs Data Integrity: Tradeoffs in Distributed Data Store Design

September 4, 2019, 8:00 am

≫ Next: Amazon EC2 Partition Placement Groups are Now Available in the AWS GovCloud (US) Regions

≪ Previous: Jonathan Katz: Just Upgrade: How PostgreSQL 12 Can Improve Your Performance

Feed: Blog – Hazelcast.
Author: Greg Luck.

Introduction

We all know that selecting the right technology for your business-critical systems is hard. You first have to decide what characteristics are most important to you, and then you need to identify the technologies that fit that profile. The problem is that you typically only get a superficial view of how technologies work, and thus end up making decisions on limited information.

Selecting amongst in-memory technologies – and especially distributed systems – can be especially challenging since many of the system’s attributes may not be easy to uncover. One topic, in particular, is how well a system protects you from data loss. All distributed systems use replication to try to reduce the risk of data loss due to hardware failure, but how the replication performs can vary by systems. The level of data safety you get is determined by the architectural design decisions built into the system. In this blog, we want to reinforce that design differences, including the nearly imperceptible ones, lead to materially different levels of data safety.

Recently, we were benchmarking Redis versus Hazelcast at high throughput to the point of network saturation. We were perplexed because Redis was reporting 50% higher throughput even though both systems were saturating the network using identical payloads. Network saturation should have been the limiting factor and with both systems writing to the primary and replica partitions, throughput should have been identical.

After much investigation, we learned that as workload grows, at some point Redis almost immediately stops replication and continues to skip replication while workload remained high. It was faster because Redis was writing only to the master, significantly raising the risk of lost data. This came as a big surprise to us and we suspect it will do the same for Redis users.

By default, Redis is sacrificing the safety of data to perform faster under high loads. Once the load finishes (or in this case the benchmark finishes) the replication is restarted and the replica shard re-syncs using the master RDB file, and everything appears to work normally. Hence, if you only verify consistency after the benchmark, the risky behavior will likely go unnoticed.

We expect that most users with busy systems are silently experiencing this scenario, unbeknownst to them as it takes a master node failure when under load to show it up as data loss.

This means two critical things:

Redis can look great in a benchmark, but
Redis will lose data if a master shard is lost when the workload is high enough that the “partial sync” stops.

In this blog, we will explore, in detail, the mystery of how Redis loses data under load.

Watch the Movie

Here we show, with first Redis and then Hazelcast, populating a cache under load and then killing the master node. Redis loses data; Hazelcast doesn’t.

[embedded content]

The Mystery

Testing Hazelcast and Redis at scale, we found that Redis was reporting 305,000 operations per second whereas Hazelcast reported 240,000. The entry value size was 110KB and both systems were configured for 1 replica.

The peculiar thing was that we were both saturating a 50Gbps network.

The Redis Benchmark Does Not Compute

We started turning over different rocks to see if we could find the root cause of the difference.

Data Compression

Could Redis possibly be using data compression causing smaller network payloads that could lead to better performance overall?

We closely monitored Redis’ communication between members and clients with networking capturing tools to make sure there was no compression on the payloads exchanged between endpoints. We identified no compression on the communication links, apart from the RDB files, which was a documented feature, which we anticipated. This, of course, explains nothing toward our benchmark figures.

Network Bandwidth

Is the available bandwidth on the boxes enough to sustain the load we are generating? 50Gbps translates to roughly 6.25GB / sec, but the actual throughput can be affected by the number of flows, between endpoints and/or cloud quotas on them.

Profiling

Last but not least, we needed to profile Hazelcast members, ideally, with Java Flight Recorder to capture as much data as possible, that could help in pinpointing the exact root of our slower benchmark result.

All JFR recordings and `prof` analysis on the members and clients show no significant hotspots that would justify the difference. We could reason about most of the output.

Network Utilisation

While inspecting the network load on the boxes, during a test run we did notice that both benchmarks reported the same network utilization.

In both cases, the outbound link was saturated, almost at 50Gbps at all times. We know for certain that the reported throughput of both systems is different, nevertheless, the reported network utilization is the same.

Both systems were able to saturate the network, the payloads were the same, so how could throughput be different?

So some more theories?

Either Hazelcast is spending more of the bandwidth for other network interactions, that we obviously didn’t consider, or…
Redis is doing something less than we believed it was doing.

A Closer Look at Redis

At Hazelcast we use a tool called Simulator, which is used for performance and soak testing Hazelcast and other systems. The Simulator output, demonstrating the actual throughput of the Redis cluster during the benchmark was:

All system statistics read healthy, from every point-of-view. CPU, I/O, Memory, all well within healthy ranges and similar from host to host.

However, on detailed, process per-process CPU utilization, we noticed that the Redis replica shards were utilizing 0% of the core. Which means they are not doing any work which can only be that there were not getting any data. This is the smoking gun.

Having a closer look at the logs, we notice the following output:

Master logs

3403:M 02 Jul 2019 11:57:15.114 – Accepted 10.0.3.246:56918

3403:M 02 Jul 2019 11:57:15.538 # Client id=549 addr=10.0.3.145:50480 fd=670 name= age=2 idle=2 flags=S db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=62 oll=2447 omem=268494184 events=r cmd=psync scheduled to be closed ASAP for overcoming of output buffer limits.

3403:M 02 Jul 2019 11:57:15.595 # Connection with replica 10.0.3.145:7016 lost.

Replica logs

3332:S 02 Jul 2019 11:57:13.538 * Connecting to MASTER 10.0.3.17:7000

3332:S 02 Jul 2019 11:57:13.539 * MASTER <-> REPLICA sync started

3332:S 02 Jul 2019 11:57:13.539 * Non blocking connect for SYNC fired the event.

3332:S 02 Jul 2019 11:57:13.539 * Master replied to PING, replication can continue…

3332:S 02 Jul 2019 11:57:13.539 * Partial resynchronization not possible (no cached master)

3332:S 02 Jul 2019 11:57:13.557 * Full resync from master: 41416b4cb4c33baa6a7a32b360cc58e9c767f144:4164722921

3332:S 02 Jul 2019 11:57:38.672 # I/O error reading bulk count from MASTER: Resource temporarily unavailable

At this stage, we start suspecting that there is no replication going on.

Which is confirmed by running the ‘INFO’ command on redis-cli, which reports:

master_link_status: down

Mystery Solved

Redis is turning off copying to the replica and can thus use all of the network bandwidth to write to the master only.

Consequences: Data Loss

No ongoing replication under system load means that the data is not safe in the event of a master node failure. If we don’t actively replicate the entries and the host goes down or a network partition takes place, all entries not written to the replica are lost.

In-Depth Analysis of Redis Logs, Code and Documentation

Investigating the logs and the code, it seems that the reason we hit this issue is a combination of the rate of events alongside the size of the entries. The following log message points to the ‘client-output-buffer-limit’ setting, which by default for replicas is set to 256MB hard limit, or 64MB soft limit:

Client id=549 addr=10.0.3.145:50480 fd=670 name= age=2 idle=2 flags=S db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=62 oll=2447 omem=268494184 events=r cmd=psync scheduled to be closed ASAP for overcoming of output buffer limits.

According to the documentation:

# The client output buffer limits can be used to force disconnection of clients

# that are not reading data from the server fast enough for some reason (a

# common reason is that a Pub/Sub client can't consume messages as fast as the

# publisher can produce them).

#

# The limit can be set differently for the three different classes of clients:

#

# normal -> normal clients including MONITOR clients

# replica  -> replica clients

# pubsub -> clients subscribed to at least one pubsub channel or pattern

#

# The syntax of every client-output-buffer-limit directive is the following:

#

# client-output-buffer-limit    

#

# A client is immediately disconnected once the hard limit is reached, or if

# the soft limit is reached and remains reached for the specified number of

# seconds (continuously).

# So for instance if the hard limit is 32 megabytes and the soft limit is

# 16 megabytes / 10 seconds, the client will get disconnected immediately

# if the size of the output buffers reach 32 megabytes, but will also get

# disconnected if the client reaches 16 megabytes and continuously overcomes

# the limit for 10 seconds.

#

# By default normal clients are not limited because they don't receive data

# without asking (in a push way), but just after a request, so only

# asynchronous clients may create a scenario where data is requested faster

# than it can read.

#

# Instead there is a default limit for pubsub and replica clients, since

# subscribers and replicas receive data in a push fashion.

#

# Both the hard or the soft limit can be disabled by setting them to zero.

client-output-buffer-limit normal 0 0 0

client-output-buffer-limit replica 256mb 64mb 60

client-output-buffer-limit pubsub 32mb 8mb 60

This state of things forces the replication link to go down, and once this happens Redis gets into an endless loop of:

Attempting connection with Primary
Connection succeeded
Attempting partial sync with X offset
Partial sync fails due to lack of backlog
Attempting FULL sync
Full sync fails due to I/O timeout (see. repl-timeout)
Connection closes

This happens for the duration of the benchmark. Once load eases out, the full sync is able to complete, successfully bringing the two shards up to date.

The most important part of this finding is that Redis offers no feedback loop. There is no way for a client/producer to be aware of this situation, thus, exposing users to potential data loss, while from a developer’s perspective, everything is behaving.

The only feedback offered is through careful reading of the logs and the CLI INFO command. With data durability at risk, one would expect that Redis would either back-pressure the producers or inform them by rejecting new writes. However, it appears that replicas are treated similarly to any other form of client.

Forcing Redis to Not Lose Data Under Load

We decided to see if there was a way to configure Redis to not lose data under load. There are a few things in the configuration that seem to help prevent this state of things between primary and replica nodes.

Client Output Buffer

The setting which seems to be the most relevant to data loss is:

client-output-buffer-limit replica 256mb 64mb 60s

According to the documentation, this acts as a protection against slow clients, however, in our case the clients are the replica nodes themselves. Since we pushed so much traffic through Redis, the replicas weren’t able to consume it as fast, creating an accumulation of data in the output buffer of the shard. This limit is adjustable, and ideally, it should prevent the disconnection from happening.

The default settings are limiting the buffer to a max of 256mb, hard limit (per shard). That means that for 16 primary shards, the buffers can grow up-to 4GB on that single host. This is not bad, but if we want to tune this to allow for longer connectivity for replication, then a value of 1GB could, in the worst case, occupy 16GB of memory in that single host. This value only accounts for the single replica connection buffer, not the live data set, nor any other buffers used in the process.

In other words, we need to accommodate for such a choice during planning/sizing but that is an impractically high amount to set for an output buffer.

Replication Backlog Size

Another setting is the ‘repl-backlog-size’ which by default is 1mb. According to the documentation:

# Set the replication backlog size. The backlog is a buffer that accumulates

# replica data when replicas are disconnected for some time, so that when a replica

# wants to reconnect again, often a full resync is not needed, but a partial

# resync is enough, just passing the portion of data the replica missed while

# disconnected.

#

# The bigger the replication backlog, the longer the time the replica can be

# disconnected and later be able to perform a partial resynchronization.

#

# The backlog is only allocated once there is at least a replica connected.

That has nothing to do with the replication link going offline in the first place, but nevertheless, it does tell us that with a well-sized backlog we should be able to survive temporary network downtimes between the nodes.

This ‘well-sized’ part is a nightmare of course for dev-ops because it requires planning and understanding of the application needs in storage; which, if‌ ‌changed, will need to be readjusted. In our case, we were storing payloads of 110KB, which are on the heavy side of the scale. Choosing a big backlog, in our case, an optimized setting of 1GB instead of 1MB means that we have storage for ~100 requests per shard on that box at the expense of 1GB of memory utilization per shard (on that box).

In other words, it’s also quite expensive in storage and offers a small time-window at this payload size for the partial sync to be used. However, under this kind of load, if you want to avoid the bandwidth cost of a FULL replication, and you want your replicas to be up-to-date sooner, then it should be adjusted.

Getting Feedback – Min Replica

Within Redis replication configuration, Redis contains a ‘min-replicas-to-write’. It is a way to force Redis to provide some feedback to the caller that things went wrong – i.e. being unable to replicate.

From the configuration:

# It is possible for a master to stop accepting writes if there are less than

# N replicas connected, having a lag less or equal than M seconds.

#

# The N replicas need to be in "online" state.

#

# The lag in seconds, that must be <= the specified value, is calculated from

# the last ping received from the replica, that is usually sent every second.

#

# This option does not GUARANTEE that N replicas will accept the write, but

# will limit the window of exposure for lost writes in case not enough replicas

# are available, to the specified number of seconds.

#

# For example to require at least 3 replicas with a lag <= 10 seconds use:

#

# min-replicas-to-write 3

# min-replicas-max-lag 10

#

# Setting one or the other to 0 disables the feature.

#

# By default min-replicas-to-write is set to 0 (feature disabled) and

# min-replicas-max-lag is set to 10.

To explain this better, if during a write a replica is OFFLINE or not responding to HB for more than X (lag setting) seconds, the write will be rejected. On the client-side you get an appropriate response, which most Java clients (in our case Jedis) handle with an exception.

redis.clients.jedis.exceptions.JedisDataException: NOREPLICAS Not enough good replicas to write.

This doesn’t fix the data loss problem, but it does provide a way to get feedback to the caller that the replicas are behind and data loss is possible.

Moreover, this is not on by default. By default, the number of required replicas is 0. Meaning you will not get notified.

Forced Sync with WAIT Operation

Full sync is another solution and probably the best to guarantee the safety of your data. However, it comes with a cost, the cost of issuing one more command (ie. ‘WAIT’) per operation, or even periodically. What this does is to guarantee that all replicas are up-to-date when it completes.

This extra command makes your application logic slightly more complicated; you now have to control how often to send this command and in rare and unfortunate timings it could lead to global pause of the application for X seconds while all writes are issuing the same command at the same time.

Issuing WAIT will dramatically slow down the throughput of Redis.

While it is part of the Redis command set, it is not implemented in all the clients. For Java developers, neither of the two Java clients from Jedis and Lettuce support the ‘WAIT’ command. So this is not possible to do from Java.

Throttling

Last but not least, and the only solution we were able to use and rely on to achieve replication of our data was to externally control the throttle of the requests. For the benchmark, it means that we set Simulator to latency mode, which has a rate limiter that is pre-configured and ensures that no more requests are issued in a given period of time.

Interestingly, using this approach we were able to get our Redis test cluster up to 220,000 ops/second, which was 20,000 ops/second less than Hazelcast. Any rate above this triggered a loss of writes to the replica and potential data loss.

In benchmarks, it’s quite trivial to implement such a limiter because we rely on a static figure for the rate, per client. In real-world applications, we don’t have constant requests equally from all clients.

That means that we need a distributed rate limiter that distributes a global capacity in slots to all clients, according to the load and needs of each one respectively. Quite a hassle.

Therefore, it is impractical to utilize throttling in an application.

How to Reproduce

Reproducer 1

For this reproducer, all you need is Java and a laptop. It uses a 2 node cluster of Redis and of Hazelcast running locally on the laptop.

It uses the code shown in the video.

Redis reproducer showing data loss: https://github.com/alparslanavci/redis-lost-mydata

The same test running against Hazelcast with no data loss:

https://github.com/alparslanavci/hazelcast-saved-mydata

Reproducer 2

This reproducer uses the detailed Simulator code outlined in the benchmarking exercise that discovered this problem. This one is pretty involved to set up for anyone outside of Hazelcast Engineering, but included here for transparency completeness.

Hardware Environment

Host type	EC2 type c5n.9xlarge
vCPU	36
RAM	96GiB
Network	50Gbps
Number of member instances	4
Number of client instances	25

* Notice the number of client instances needed to generate enough load and bring the software to its absolute limits. More on that in the future.

Software Environment

Configuration	Hazelcast	Redis
Clients	100 (4 on each host)
Client type	JAVA	Jedis
JVM settings	G1GC
Number of threads / client	32
Entry In-memory format	HD (72GB)	N/A
Replica Count	1
Replication Type	Async
IO Threads	36	N/A
Shards	N/A	32 primary / host 32 replicas / host
Nearcache	20 %	–

The benchmark was done using the stress test suite by Hazelcast and Simulator (https://github.com/hazelcast/hazelcast-simulator) which provides support for both Hazelcast and Jedis drivers.

Test Case

Number of maps	1
Entry Count	450,000
Payload size	110KB (random bytes)
Read/Write Ratio	3 / 1 (75% / 25%)
Duration	1 hour

Source code for the test is available here: ByteByteMapTest

How to Run

Starting the benchmark

Clone benchmark repo redis-nobackups
Install Hazelcast Simulator locally (https://github.com/hazelcast/hazelcast-simulator#installing-simulator)
Launch 29 EC2 instances as described under Hardware Environment
- Tag the instances with the same placement
- Make sure you use ENA supported AMIs to avail for the high bandwidth links between nodes in the same placement.
CD in the benchmark directory
Collect private and public IPs of the EC2 instances, and update the agents.txt or redis_agents.txt file for Hazelcast or Redis, respectively.
Run `provisioner –install`
SSH to the first 4 instances, and install Redis

For Redis

Start 32 shards per instance on the first 4 instances
Once the shards are started, form the cluster with 1 replica using the command below:
- redis-cli –cluster create –cluster-replicas 1
- Wait until cluster is formed observing message:
  - [OK] All nodes agree about slots configuration.
  - >>> Check for open slots…
  - >>> Check slots coverage…
  - [OK] All 16384 slots covered.
Run `./run-jedis.sh`

For Hazelcast

Run `./run-hazelcast.sh`

Showing Data Loss

Wait until the throughput stabilizes to its max, which is usually the best indicator that replication is no longer occurring.

Verify link status between the primary and replicas:

Connect to one of the primary shards using redis-cli

Once verified, kill the processes from the box (simulate networking issues or dead member).

- Run `killall -9 redis` to stop all Redis processes

Stop the benchmark
Wait until failover is complete
Assert number of total insertions (benchmark report) versus number of entries in the cluster
- A single box can host multiple shard processes and effectively all of them will experience the same issue with replicas. The data loss is not limited to a single shard (process). It’s more likely that a node (physical or virtual) will go offline or crash than a single process crashes.
Getting the DBSIZE of all primary shards in the cluster and comparing it to the number of puts from the simulator. In our case, 2,779,336 entries were inserted, but we only measured 2,209,922 at the end of the test.

The gap between what was inserted and what is measured at the end is heavily affected by the time difference between killing the processes and stopping the benchmark. The longer the benchmark is running after processes are down, the bigger the difference. Also note that if you wait too long to check, Redis will have performed a full sync.

Conclusion

Given Redis’ design decisions around dropping replicas under load, there are two major implications: firstly for Redis users and secondly for those comparing benchmarks between Redis and other systems which do properly replicate data.

Implications for Redis Users

If Redis users thought that setting up replicas meant they had data redundancy and multiple failures were required to lose data, they were wrong. A single point of failure on the master node is enough to potentially lose data.

Redis can also be configured to read from replicas. Users probably expect that replicas are eventually consistent with low inconsistency windows. Our testing shows that that inconsistency can live on for hours or days if high load continues.

As we showed above, there is currently no practical way covering all scenarios to stop Redis from falling behind on replicas. It is possible to carefully configure large amounts, possibly up to tens of GBs as in our case, of Client Output Buffers and Replication Queues for a given application, but this would require further testing to gain assurance.

Implications for Hazelcast vs Redis Benchmarking

Hazelcast is very fast. We know that, which is why we refused to accept that Redis was outperforming us on an at-scale test. We think it is critical that users understand that benchmarks should be apples to apples and that a configured data safety level be the actual level you are getting.

This goes for Redis benchmarking against other systems and also the headline performance numbers that Redis Labs put out.

Once we rate-limited Redis and then took it to the maximum without data loss, Hazelcast still beat it… without the risk of data loss due to a failed replication.

↧

Amazon EC2 Partition Placement Groups are Now Available in the AWS GovCloud (US) Regions

September 5, 2019, 11:56 am

≫ Next: Get the most IOPS out of your physical volumes using LVM.

≪ Previous: Redis Load Handling vs Data Integrity: Tradeoffs in Distributed Data Store Design

Feed: Recent Announcements.

Partition Placement Groups are a new placement group strategy which help reduce the likelihood of correlated failures for large distributed and replicated workloads such as HDFS, HBase and Cassandra running on EC2. Partition placement groups spread EC2 instances across logical partitions and ensure that instances in different partitions do not share the same underlying hardware, thus containing the impact of hardware failure to a single partition. In addition, partition placement groups offer visibility into the partitions and allow topology aware applications to use this information to make intelligent data replication decisions, increasing data availability and durability.

To learn more about partition placement groups, see the user guide for EC2 Partition Placement Groups.

↧

Get the most IOPS out of your physical volumes using LVM.

September 8, 2019, 9:14 am

≫ Next: Fabien Coelho: Data Loading Performance of Postgres and TimescaleDB

≪ Previous: Amazon EC2 Partition Placement Groups are Now Available in the AWS GovCloud (US) Regions

Feed: Planet MySQL
;
Author: MyDBOPS
;

Hope everyone aware about known about LVM(Logical Volume Manager) an extremely useful tool for handling the storage at various levels. LVM basically functions by layering abstractions on top of physical storage devices as mentioned below in the illustration.

Below is a simple diagrammatic expression of LVM

         sda1  sdb1   (PV:s on partitions or whole disks)
               /
              /
          Vgmysql      (VG)
           / | 
         /   |   
      data  log  tmp  (LV:s)
       |     |    |
      xfs  ext4  xfs  (filesystems)

IOPS is an extremely important resource, when it comes to storage it defines the performance of disk. Let’s not forget PIOPS(Provisioned IOPS) one of the major selling points for AWS and other cloud vendors for production machines such as databases. Since Disk is the slowest in the server, we can compare the major components as below.

Consider CPU in speed range of Fighter Jet, RAM in speed range of F1 car and hard Disk in speed range of bullock cart. With modern hardware improvement, IOPS is also seeing significant improvement with SSD’s.

In this blog, we are going to see Merging and Stripping of multiple HDD drives to reap the benefit of disks and combined IOPS

Below is the Disk attached to my server, Each is an 11TB disk with Max supported IOPS of 600.

# lsblk
NAME   MAJ:MIN  RM  SIZE  RO  TYPE  MOUNTPOINT
sda      8:0     0   10G    0   disk
sda1     8:1     0   10G    0   part        
sdb      8:16    0   10.9T  0   disk
sdc      8:32    0   10.9T  0   disk
sdd      8:48    0   10.9T  0   disk
sde      8:64    0   10.9T  0   disk
sdf      8:80    0   10.9T  0   disk
sdg      8:96    0   10.9T  0   disk

sda is the root partition, sd[b-g] is the attached HDD disk,

With Mere merging of these disk, you will have space management since the disk is clubbed in a linear fashion. With stripping our aim is to get 600*6=3600 IOPS or atleast a value somewhere around 3.2 k to 3.4 k.

Now let’s proceed to create the PV (Physical volume)

# pvcreate /dev/sd[b-g]
Physical volume "/dev/sdb" successfully created.
Physical volume "/dev/sdc" successfully created.
Physical volume "/dev/sdd" successfully created.
Physical volume "/dev/sde" successfully created.
Physical volume "/dev/sdf" successfully created.
Physical volume "/dev/sdg" successfully created.

Validating the PV status:

# pvs
PV VG Fmt Attr PSize PFree
/dev/vdb lvm2 --- 10.91t 10.91t
/dev/vdc lvm2 --- 10.91t 10.91t
/dev/vdd lvm2 --- 10.91t 10.91t
/dev/vde lvm2 --- 10.91t 10.91t
/dev/vdf lvm2 --- 10.91t 10.91t
/dev/vdg lvm2 --- 10.91t 10.91t

Let’s proceed to create a volume group (VG) with a physical extent of 1MB, (PE is similar to block size with physical disks) and volume group name as “vgmysql” combining the PV’s

#vgcreate -s 1M vgmysql /dev/vd[b-g] -v
Wiping internal VG cache
Wiping cache of LVM-capable devices
Wiping signatures on new PV /dev/vdb.
Wiping signatures on new PV /dev/vdc.
Wiping signatures on new PV /dev/vdd.
Wiping signatures on new PV /dev/vde.
Wiping signatures on new PV /dev/vdf.
Wiping signatures on new PV /dev/vdg.
Adding physical volume '/dev/vdb' to volume group 'vgmysql'
Adding physical volume '/dev/vdc' to volume group 'vgmysql'
Adding physical volume '/dev/vdd' to volume group 'vgmysql'
Adding physical volume '/dev/vde' to volume group 'vgmysql'
Adding physical volume '/dev/vdf' to volume group 'vgmysql'
Adding physical volume '/dev/vdg' to volume group 'vgmysql'
Archiving volume group "vgmysql" metadata (seqno 0).
Creating volume group backup "/etc/lvm/backup/vgmysql" (seqno 1).
Volume group "vgmysql" successfully created

Will check the volume group status as below with VG display

# vgdisplay -v  
--- Volume group ---
VG Name           vgmysql
System ID
Format            lvm2
Metadata Areas     6
MetadataSequenceNo 1
VG Access          read/write
VG Status          resizable
MAX LV             0
Cur LV             0
Open LV            0
Max PV             0
Cur PV             6
Act PV             6
VG Size            65.48 TiB
PE Size            1.00 MiB
Total PE           68665326
Alloc PE / Size    0 / 0
Free PE / Size     68665326 / 65.48 TiB
VG UUID 51KvHN-ZqgY-LyjH-znpq-Ufy2-AUVH-OqRNrN

Now our volume group is ready, let’s proceed to create Logical Volume(LV) space with stripe size of 16K equivalent to the page size of MySQL (InnoDB) to be stripped across the 6 attached disk

# lvcreate -L 7T -I 16k -i 6 -n mysqldata vgmysql
Rounding size 7.00 TiB (234881024 extents) up to stripe boundary size 7.00 TiB (234881028 extents).
Logical volume "mysqldata" created.

-L volume size
-I strip size
-i Equivalent to number of disks
-n LV name
Vgmysql volume group to use

lvdisplay to provide a complete view of the Logical volume

# lvdisplay -m
--- Logical volume ---
LV Path           /dev/vgmysql/mysqldata
LV Name           mysqldata
VG Name           vgmysql
LV UUID           Y6i7ql-ecfN-7lXz-GzzQ-eNsV-oax3-WVUKn6
LV Write Access   read/write
LV Creation host, time warehouse-db-archival-none, 2019-08-26 15:50:20 +0530
LV Status          available
# open             0
LV Size            7.00 TiB
Current LE         7340034
Segments           1
Allocation         inherit
Read ahead sectors auto
- currently set to 384
Block device       254:0
--- Segments ---
Logical extents 0 to 7340033:
  Type       striped
  Stripes   6
Stripe size 16.00 KiB

Now we will proceed to format with XFS and mount the partition

# mkfs.xfs /dev/mapper/vgmysql-mysqldata

Below are the mount options used

/dev/mapper/vgmysql-mysqldata on /var/lib/mysql type xfs (rw,noatime,nodiratime,attr2,nobarrier,inode64,sunit=32,swidth=192,noquota)

Now let’s proceed with the FIO test to have IO benchmark.

Command:

#fio --randrepeat=1 --name=randrw --rw=randrw --direct=1 --ioengine=libaio --bs=16k --numjobs=10 --size=512M --runtime=60 --time_based --iodepth=64 --group_reporting

Result:

read : io=1467.8MB, bw=24679KB/s, iops=1542, runt= 60903msec
slat (usec): min=3, max=1362.7K, avg=148.74, stdev=8772.92
clat (msec): min=2, max=6610, avg=233.47, stdev=356.86
lat (msec): min=2, max=6610, avg=233.62, stdev=357.65
write: io=1465.1MB, bw=24634KB/s, iops=1539, runt= 60903msec
slat (usec): min=4, max=1308.1K, avg=162.97, stdev=8196.09
clat (usec): min=551, max=5518.4K, avg=180989.83, stdev=316690.67
lat (usec): min=573, max=5526.4K, avg=181152.80, stdev=317708.30

We have the desired iops ~3.1k by merging and stripped LVM rather than the normal IOPS of 600

Key Take-aways:

Management of storage becomes very easy with LVM
Distributed IOPS with stripping helps in enhancing disk performance
LVM snapshots

Downsides:

Every tool has its own downsides, we should embrace it. Considering the use case it serves best ie., IOPS in our case. One major downside I could think of is, if any one of the disks fails with this setup there will be a potential data-loss/Data corruption.

Work Around:

To avoid this data-loss/Data corruption we have set-up HA by adding 3 slaves for this setup in production
Have a regular backup for stripped LVM with xtrabackup, MEB, or via snapshot
RAID 0 also serves the same purpose as the stripped LVM.

Featured Image by Carl J on Unsplash

Fabien Coelho: Data Loading Performance of Postgres and TimescaleDB

September 13, 2019, 3:46 am

≫ Next: Top 9 Most Common Commands In MongoDB | Architecture of MongoDB

≪ Previous: Get the most IOPS out of your physical volumes using LVM.

Feed: Planet PostgreSQL.

Postgres is the leading feature-full independent open-source relational database, steadily increasing its popularity for the past 5 years. TimescaleDB is a clever extension to Postgres which implements time-series related features, including under the hood automatic partioning, and more.

Because he knows how I like investigate Postgres (among other things) performance, Simon Riggs (2ndQuadrant) prompted me to look at the performance of loading a lot of data into Postgres and TimescaleDB, so as to understand somehow the degraded performance reported in their TimescaleDB vs Postgres comparison. Simon provided support, including provisioning 2 AWS VMs for a few days each.

Summary

The short summary for the result-oriented enthousiast is that for the virtual hardware (AWS r5.2xl and c5.xl) and software (Pg 11.[23] and 12dev, TsDB 1.2.2 and 1.3.0) investigated, the performance of loading up to 4 billion rows in standard and partioned tables is great, with Postgres leading as it does not have the overhead of managing dynamic partitions and has a smaller storage footprint to manage. A typical loading speed figure on the c5.xl VM with 5 data per row is over 320 Krows/s for Postgres and 225 Krows/s for TimescaleDB. We are talking about bites of 100 GB ingested per hour.

The longer summary for the performance testing enthousiast is that such investigation is always much more tricky than it looks. Although you are always measuring something, what it is really is never that obvious because it depends on what actually limits the performance: the CPU spent on Postgres processes, the disk IO bandwidth or latency… or even the process of generating fake data. Moreover, performance on a VM with the underlying hardware systems shared between users tend to vary, so that it is hard to get definite and stable measures, with significant variation (about 16%) from one run to the next the norm.

Test Scenario

I basically reused the TimescaleDB scenario where many devices frequently send timespamped data points which are inserted by batch of 10,000 rows into a table with an index on the timestamp.

All programs used for these tests are available on GitHub.

CREATE TABLE conditions(time TIMESTAMPTZ, dev_id INT, data1 FLOAT8, …, dataX FLOAT8);
CREATE INDEX conditions_time_idx ON conditions(time);

I used standard tables and tables partitioned per week or month. Although the initial scenario inserts X=10 data per row, I used X=5 for most tests so as to emphasize index and partioning overheads.

For filling the tables, three approaches have been used:

a dedicated perl script that outputs a COPY, piped into psql: piping means that data generation and insertion work in parallel, but generation may possibly be too slow to saturate the system.
a C program that does the same, although about 3.5 times faster.
a threaded load-balanced libpq C program which connects to the database and fills the target with a COPY. Although generation and insertion are serialized in each thread, several connections run in parallel.

Performances

All in all I ran 140 over-a-billion row loadings: 17 in the r5.2xl AWS instance and 123 on the c5.xl instance; 112 runs loaded 1 billion rows, 4 runs loaded 2 billion rows and 24 runs loaded 4 billion rows.

First Tests on a R5.2XL Instance

The first serie of tests used a r5.2xl memory-optimized AWS instance (8 vCPU, 64 GiB) with a 500 GB EBS (Elastic Block Store) gp2 (General Purpose v2) SSD-based volume attached.

The rational for this choice, which will be proven totally wrong, was that the database loading would be limited by holding the table index in memory, because if it was spilled on disk the performance would suffer. I hoped to see the same performance degradation depicted in the TimescaleDB comparison when the index would reach the available memory size, and I wanted that not too soon.

The VM ran Ubuntu 18.04 with Postgres 11.2 and 12dev installed from apt.postgresql.org and TimescaleDB 1.2.2 from their ppa. Postgres default configuration was tune thanks to timescaledb-tune, on which I added a checkpoint_timeout:

shared_preload_libraries = 'timescaledb'
shared_buffers = 15906MB
effective_cache_size = 47718MB
maintenance_work_mem = 2047MB
work_mem = 40719kB
timescaledb.max_background_workers = 4
max_worker_processes = 15
max_parallel_workers_per_gather = 4
max_parallel_workers = 8
wal_buffers = 16MB
min_wal_size = 4GB
max_wal_size = 8GB
default_statistics_target = 500
random_page_cost = 1.1
checkpoint_completion_target = 0.9
max_connections = 50
max_locks_per_transaction = 512
effective_io_concurrency = 200
checkpoint_timeout = 1h

Then I started to load 1 to 4 billion rows with fill.pl ... | psql. Although it means that the producer and consummer run on the same host thus can interfere one with the other, I wanted to avoid running on two boxes and have potential network bandwidth issues between these.

For 1 billion rows, the total size is 100 GB (79 GB table and 21 GB index) on Postgres with standard or partitioned (about 11 weeks filled) tables, and 114 GB for TimescaleDB. For 4 billion rows we reach 398 GB (315 GB table + 84 GB index over memory) for standard Postgres and 457 GB (315 GB table + 142 GB index) for TimescaleDB. TimescaleDB storage requires 15% more space, the addition being used for the index.

The next image shows the average speed of loading 4 billion rows in 400,000 batches of 10,000 rows on the r5.2xl VM with the psql-piping approach. All Postgres (standard, weekly or monthly partitions) tests load between 228 and 268 Krows/s, let us say an average of 248 Krows/s, while TimescaleDB loads at 183 Krows/s. TimescaleDB loads performance is about 26% below Postgres, which shows no sign of heavily decreasing performance over time.

I could have left it at that, job done, round of applause. However, I like digging. Let us have a look at the detailed loading speed for the first Postgres 12dev standard tables run and for the TimescaleDB run.

In both runs we can see two main modes: One dense high speed mode with pseudo-periodic upward or downward spikes, and a second sparse low speed mode around 65 Krows/s. The average is between these two modes. In order to get a (hopefully) clearer view, the next figures shows the sorted raw loading speed performance of all the presented runs.

We can clearly see the two main modes: one long hight speed flat line encompassing 92 to 99% of each run, and a dwidling low-performance performance for 8 to 1% of in the end, with most measures around 65 Krows/s. For the high speed part, all Postgres runs perform consistently at about 280 Krows/s. TimescaleDB run performs at 245 Krows/s, a 13% gap: this is about the storage gap, as Postgres has 15% less data to process and store, thus the performance is 18% better on this part. For the low speed part, I think that it is mostly related to index storage (page eviction and checkpoint) which interrupts the normal high speed flow. As the TimescaleDB index is 69% larger, more batches are concerned, this explain the larger low speed mode in the end and explains the further 10% performance gap. Then you can add some unrelated speed variations (we are on a VM with other processes running and doing IOs), which add +- 8% on our measures, and we have a global explanation for the figures.

Now, some depressing news: although the perl script was faster than loading (I checked that fill.pl > /dev/null was running a little faster than when piped to psql), the margin was small, and you have to take into account how piping works, with processes interrupted and restarted based on the filling and consumption of the intermediate buffer, so that it is possible that I was running a partly data-generation CPU-bound test.

I rewrote the perl script in C and started again on a smaller box, which will give… better performance.

Second Tests on a C5.XL Instance

The second serie used a c5.xl CPU-optimized AWS instance (4 vCPU, 8 GiB), with the same volume attached. The rational for this choice is that I did not encounter any performance issue in the previous test when the index reached the memory size, so I did not really need a memory-enhanced instance in the first place, but I was possibly limited by CPU, so the faster the CPU the better.

Otherwise the installation followed the same procedure as described in the previous section, which resulted in updated versions (pg 11.3 et ts 1.3.0) and these configuration changes to adapt settings to the much smaller box:

shared_buffers = 1906MB
effective_cache_size = 5718MB
maintenance_work_mem = 976000kB
work_mem = 9760kB
max_worker_processes = 11
max_parallel_workers_per_gather = 2
max_parallel_workers = 4
max_locks_per_transaction = 64

The next two figures shows average and sorted loading speed for 4 billion data on the C5.XL instance, with Postgres 11.3 and TimescaleDB 1.1.2. Postgres performance leads at 325 Krow/s, then both Postgres weekly and monthly partitioned tables around 265 Krows/s, then finally TimescaleDB which takes about 44% more time than Postgres at 226 Krows/s.

I implemented a threaded libpq-based generator, initially without load balance and then with load balancing, which allows to charge The next figure shows the averaged loading performance with the psql-pipe approach compared to two threads, which gave the best overall performance on the 4 vCPU VM.

The upper lines show the loading speed of batches for Postgres vs TimescaleDB. The lower lines show the same with the two thread loading approach. Although the performance per batch is lower, two batches are running in parallel, hence the overall better performance. The end of Postgres parallel run shows a bump, which is due to the lack of load balancing of the version used in this run. It is interesting to note that Postgres incurs a size penalty, which is on the index, when the load is parallel.

Conclusion

It is the first time I ran a such precise data loading benchmark, trying to replicate results advertised in TimescaleDB documentation which shows Postgres loading performance degrading quickly.

I failed to achieve that: both tools perform consistently well, with Postgres v11 and v12 leading the way in raw loading performance, but also without the expected advantages of timeseries optimizations.

I’m used to run benches on bare metal, using a VM was a first. It is harder to interpret results because you do not really know what is going on, which is a pain.

↧

Top 9 Most Common Commands In MongoDB | Architecture of MongoDB

September 14, 2019, 1:01 am

≫ Next: David Rowley: PostgreSQL 12: Partitioning is now faster

≪ Previous: Fabien Coelho: Data Loading Performance of Postgres and TimescaleDB

Feed: CronJ.
Author: Ayush Goel
;

As discussed in my last blog, there are 2 types of databases SQL and NoSQL. So in this blog, we will be looking at one type of NoSQL database that is MongoDB.

MongoDB is an example of a document-type database, which means it stores data in JSON-like documents. One of the biggest reasons for its popularity is the data model used is a highly elastic one that allows us to combine and store data of multivariate types.

MongoDB Architecture

Let’s first discuss the architecture of MongoDB. A typical MongoDB architecture has 3 parts-

Database: A database can be called as the physical container for data. Each database consists of several files on the file system. Also, multiple databases can exist on a single MongoDB server.

Collection: a Collection is simply a group of database documents. This is equivalent to a table in SQL. The biggest advantage is that there is no need to describe schemas when it comes to collections. In a single collection, different documents can have different fields.

Document: Document is simply a key-value pair. They have dynamic schemas which allow them to have flexible content.

MongoDB Commands

Now, as we have a fair bit of idea about the architecture of MongoDB, let’s move on to the section on how to build the database and query it.

1. Database Creation

MongoDB does not support any command to create a database. Though it supports the creation of a collection. It automatically creates a database when values are saved into the collection for the first time. The command mentioned below will create a database named ‘database’ if it doesn’t exist. If it does exist, then it will be selected.
Command: Use database

2. Dropping Databases

The command mentioned below is used to drop a database. This command will work on the database we are currently using.
Command: db.dropDatabase()

3. Creating a Collection

The command mentioned below is used to create a collection. This is also an optional command as MongoDB creates a collection automatically when data is inserted for the first time.
Command: db.createCollection(name, options)

Name: The string type which specifies the name of the collection to be created

Options: The document type which specifies the memory size and the indexing of the collection. It is an optional parameter.

4. Showing Collections

The following command is used to display all collections.
Command: db.getCollectionNames()

5. Projection

Many times only specific parts of the database are required rather than the whole database.
Find() method helps to display all fields of a document. For selecting specific fields, a list of fields with value 1 or 0 is passed. 1 is used to show the field and 0 is used to hide it. This ensures that only those fields with value 1 are selected.
Command: db.COLLECTION_NAME.find({},{KEY:1})

6. Date Operator

This command is used for adding date and time to the database.
Command:
Date() – It returns the current date as a string.
New Date() – It returns the current date as a date object.

7. $not Operator

$not does a logical NOT operation on the specified and selects only those documents that don’t match the . This only includes those documents which do not have that particular field. Similar is the AND and OR commands.
Command: { field: { $not: { } } }

8. Delete Commands

- The below mentioned commands are used for delete operations-
- Commands:
  collection.remove() – It deletes a single document that matches a filter. db.collection.deleteOne() – It deletes only the first matched document even if the command selects more than one document.
  db.collection.deletemany() – It deletes all the documents that match the specified filter.

9. Where Command

To pass either a string which has a JavaScript expression or a full JavaScript function to the query system, the following operator can be used. Similarly, $in can be used to check whether a given field is having values equal to the given values or not
Command: $where

Differences from Traditional RDBMS

The major differences are-

It is a collection-based, document-based and field-based database whereas RDBMS is a table-based, row-based and column-based database.
It is a non-relational database and RDBMS by definition is relational.
MongoDB gives Javascript Client for querying. No such feature in RDBMS.
MongoDB has a dynamic schema whereas RDBMS has a fixed schema.
MongoDB is easy to set up as compared to RDBMS.
MongoDB is good for hierarchical data storage as compared to RDBMS.
MongoDB can be horizontally scaled whereas RDBMS can be vertically scaled.

Why use MongoDB?

Lastly, let’s discuss some key features of MongoDB highlighting why this database should be used –

- Queries: MongoDB supports both ad-hoc queries and document-based queries.
- Index Support: MongoDB can support indexing.
- Replication: MongoDB supports Master-Slave replication making it easier to recover data when downtime occurs.
- Multiple Servers: MongoDB database can run over multiple servers. Data is duplicated to foolproof the system in the case of hardware failure.
- Auto-sharding: This process distributes data across multiple physical partitions called shards. Due to this process, MongoDB supports automatic load balancing.
- MapReduce: It supports MapReduce feature of big-data.
- Failure Handling: MongoDB allows for easy ways to cope with failures. Huge numbers of replicas are made which give out increased protection and data availability against various kinds of database downtimes.
- Schema-less Database: It is a schema-less database written in C++.
- Document-oriented Storage: It uses the BSON format which is a JSON-like format.

↧

David Rowley: PostgreSQL 12: Partitioning is now faster

September 17, 2019, 2:26 am

≫ Next: How To Bulk Import Data Into InnoDB Cluster?

≪ Previous: Top 9 Most Common Commands In MongoDB | Architecture of MongoDB

Feed: Planet PostgreSQL.

Table partitioning has been evolving since the feature was added to PostgreSQL in version 10. Version 11 saw some vast improvements, as I mentioned in a previous blog post.

During the PostgreSQL 12 development cycle, there was a big focus on scaling partitioning to make it not only perform better, but perform better with a larger number of partitions. Here I’d like to talk about what has been improved.

COPY Performance:

Bulk loading data into a partitioned table using COPY is now able to make use of bulk-inserts. Previously only one row was inserted at a time.

COPY FROM chart

The COPY speed does appear to slow with higher numbers of partitions, but in reality, it tails off with fewer rows per partition. In this test, as the partition count grows, the rows per partition shrinks. The reason for the slowdown is due to how the COPY code makes up to 1000 slots for each tuple, per partition. In the fewer partitions case, these slots are reused more often, hence performance is better. In reality, this performance tailing off is likely not to occur since you’re likely to have more than 12.2k rows per partition.

INSERT Performance:

In PostgreSQL 11 when INSERTing records into a partitioned table, every partition was locked, no matter if it received a new record or not. With larger numbers of partitions and fewer rows per INSERT, the overhead of this could become significant.

In PostgreSQL 12, we now lock a partition just before the first time it receives a row. This means if we’re inserting just 1 row, then only 1 partition is locked. This results in much better performance at higher partition counts, especially when inserting just 1 row at a time. This change in the locking behaviour was also teamed up with a complete rewrite of the partition tuple routing code. This rewrite massively reduces the overhead of the setting up of the tuple routing data structures during executor startup.

insert_single_row

You can see that the performance in PostgreSQL 12 is fairly consistent no matter how many partitions the partitioned table has.

SELECT Performance:

Back in PostgreSQL 10, the query planner would check the constraint of each partition one-by-one to see if it could possibly be required for the query. This meant a per-partition overhead, resulting in planning times increasing with higher numbers of partitions. PostgreSQL 11 improved this by adding “partition pruning”, an algorithm which can much more quickly identify matching partitions. However, PostgreSQL 11 still did some unnecessary processing and still loaded meta-data for each partition, regardless of if it was pruned or not.

PostgreSQL 12 changes things so this meta-data loading is performed after partition pruning. This results in significant performance improvements in the query planner when many partitions are pruned.

The chart below shows the performance of a SELECT of a single row from a HASH partitioned table partitioned on a BIGINT column, which is also the PRIMARY KEY of the table. Here partition pruning is able to prune all but the one needed partition.

Once again it is fairly clear that PostgreSQL 12 improves things significantly here. Performance does tail off just a little bit still at the higher partition counts, but it’s still light years ahead of PostgreSQL 11 on this test.

Other Partitioning Performance Improvements:

Ordered partition scans:

The planner is now able to make use of the implicit order of LIST and RANGE partitioned tables. This allows the use of the Append operator in place of the MergeAppend operator when the required sort order is the order defined by the partition key. This is not possible for HASH partitioned tables since various out of order values can share the same partition. This optimization reduces useless sort comparisons and provides a good boost for queries that use a LIMIT clause.

Get rid of single sub-plan Append and MergeAppend nodes:

This is a fairly trivial change which eliminates the Append and MergeAppend nodes when the planner sees it’s only got a single sub-node. It was quite useless to keep the Append / MergeAppend node in this case as they’re meant to be for appending multiple subplan results together. There’s not much to do when there’s already just 1 subplan. Removing these does also give a small performance boost to queries as pulling tuples through executor nodes, no matter how trivial they are, is not free. This change also allows some queries to partitioned tables to be parallelized which previously couldn’t be.

Various performance improvements to run-time partition pruning:

A fair bit of optimization work was also done around run-time partition pruning to reduce executor startup overheads. Some work was also done to allow PostgreSQL to make use of Advanced Bit Manipulation instructions which gives PostgreSQL’s Bitmapset type a performance boost. This allows supporting processors to perform various operations 64-bits at a time in a native operation. Previously all these operations trawled through the Bitmapset 1 byte at a time. These Bitmapsets have also changed from 32-bits to 64-bits on 64-bit machines. This effectively doubles the performance of working with larger Bitmapsets.

Some changes were also made to the executor to allow range tables (for storing relation meta-data) to be found in O(1) rather than O(N) time, where N is the number of tables in the range table list. This is particularly useful as each partition in the plan has a range table entry, so looking up the range table data for each partition was costly when the plan contained many partitions.

With these improvements and using a RANGE partitioned table partitioned by a timestamp column, each partition storing 1 month of data, the performance looks like:

run-time pruning

You can see that PostgreSQL 12’s gain gets bigger with more partitions. However, those bars taper off at higher partition counts. This is because I formed the query in a way that makes plan-time pruning impossible. The WHERE clause has a STABLE function, which the planner does not know the return value of, so cannot prune any partitions. The return value is evaluated during executor startup and run-time pruning takes care of the partition pruning. Unfortunately, this means the executor must lock all partitions in the plan, even the ones that are about to be run-time pruned. Since this query is fast to execute, the overhead of this locking really shows with higher partition counts. Improving that is going to have to wait for another release.

The good news is that if we change the WHERE clause swapping out the STABLE function call for a constant, the planner is able to take care of pruning:

plan time pruning

The planning overhead shows here as with few partitions the performance of PostgreSQL 12 is not as high as with the generic plan and run-time pruning. With larger numbers of partitions, the performance does not tail off as much when the planner is able to perform the pruning. This is because the query plan has is only 1 partition for the executor to lock and unlock.

Summary:

You can see from the graphs above that we’ve done a lot to improve partitioning in PostgreSQL 12. However, please don’t be too tempted by the graphs above and design all your partitioning strategies to include large numbers of partitions. Be aware that there are still cases where too many partitions can cause the query planner to use more RAM and become slow. When performance matters, and it generally always does, we highly recommend you run workload simulations. This should be done away from production server with various numbers of partitions to see how it affects your performance. Have a read of the best practices section of the documentation for further guidance.

Test Environment:

All tests were run on an Amazon AWS m5d.large instance using pgbench. The transactions per seconds tests were measured over 60 seconds.

The following settings where changed:

shared_buffers = 1GB work_mem = 256MB checkpoint_timeout = 60min max_wal_size = 10GB max_locks_per_transaction = 256

All transactions per second counts were measured using a single PostgreSQL connection.

↧

How To Bulk Import Data Into InnoDB Cluster?

September 19, 2019, 5:04 pm

≫ Next: Faster, Smarter, Better: Optimizations for Neo4j Graph Algorithms

≪ Previous: David Rowley: PostgreSQL 12: Partitioning is now faster

Feed: Planet MySQL
;
Author: Mirko Ortensi
;

If you need to do bulk importing into InnoDB Cluster, it is certainly possible to do so by using any of:
Unfortunately both imports will add load to instances and to channels interconnecting instances: data imported on the primary instance needs to be replicated to the rest of instances. And the bigger the data to import, the higher the load (and this could end up affecting). The import operation could be batched to reduce load, and Group Replication allows at least to throttle workload with flow control or to split messages in several smaller messages with group_replication_communication_max_message_size option.

How to import data into InnoDB Cluster?

But in case data is a whole table (MySQL 8 adds also flexibility to swap partitions and tables, may become handy), or data can be first loaded into an InnoDB table, there’s simple way to have an arbitrary amount of data pushed to InnoDB Cluster, and it takes advantage of tablespaces copying feature. I made a quick test to import a table.

I created the table t5 on an arbitrary instance and added a few rows. Then exported as in the instructions (does nothing but flush it and create an auxiliary .cnf file for definition validation at import time, not mandatory to use it but recommended):

FLUSH TABLES t5 FOR EXPORT;
On the InnoDB Cluster setup, I created the table t5 with same definition from the primary, then again on the primary:

ALTER TABLE t5 DISCARD TABLESPACE;
This will remove the t5.ibd tablespace on all the 3 instances. And with a simple SELECT, I made sure that this is as expected:

mysql> select * from test.t5; ERROR 1814 (HY000): Tablespace has been discarded for table 't5'
After that, I copied t5.ibd from the former instance under the related schema folder in *each* GR node.
Let’s check initial GTID set:

mysql> select @@GLOBAL.GTID_EXECUTED; +------------------------------------------------------------+ | @@GLOBAL.GTID_EXECUTED | +------------------------------------------------------------+ | 550fa9ee-a1f8-4b6d-9bfe-c03c12cd1c72:1-270:1000011-1000014 | +------------------------------------------------------------+ 1 row in set (0.00 sec)
Then on the primary, did:

mysql> ALTER TABLE t5 IMPORT TABLESPACE; Query OK, 0 rows affected, 1 warning (0.03 sec)
I am lazy and did not perform validation using .cfg (more from the instructions):

mysql> show warnings; +---------+------+-----------------------------------------------------------------------------------------------------------------------------------------+ | Level | Code | Message | +---------+------+-----------------------------------------------------------------------------------------------------------------------------------------+ | Warning | 1810 | InnoDB: IO Read error: (2, No such file or directory) Error opening './test/t5.cfg', will attempt to import without schema verification | +---------+------+-----------------------------------------------------------------------------------------------------------------------------------------+ 1 row in set (0.00 sec)
And all the tablespaces are loaded from local file system into the GR member. And new GTID set is:

mysql> select @@GLOBAL.GTID_EXECUTED; +------------------------------------------------------------+ | @@GLOBAL.GTID_EXECUTED | +------------------------------------------------------------+ | 550fa9ee-a1f8-4b6d-9bfe-c03c12cd1c72:1-271:1000011-1000014 | +------------------------------------------------------------+ 1 row in set (0.00 sec)
Let’s test it’s all ok:

mysql> select * from test.t5; +------+ | a | +------+ | 1 | | 2 | | 3 | | 4 | | 5 | | 6 | | 7 | | 777 | +------+ 8 rows in set (0.00 sec)
So data will be available on the rest of the nodes at no bandwidth and protocol cost, only this is indeed replicated, from binlog.

# at 2714 #190919 1:34:34 server id 1 end_log_pos 2821 Query thread_id=34 exec_time=0 error_code=0 SET TIMESTAMP=1568849674/*!*/; ALTER TABLE t5 IMPORT TABLESPACE /*!*/; SET @@SESSION.GTID_NEXT= 'AUTOMATIC' /* added by mysqlbinlog */ /*!*/; DELIMITER ; # End of log file /*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/; /*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=0*/;
Most important, I broke nothing!

To wrap up, instead of loading GB or TB into InnoDB Cluster and have the cluster replicate it, this trick can push your data at no cost.
Comments are welcome!

↧

Faster, Smarter, Better: Optimizations for Neo4j Graph Algorithms

September 23, 2019, 12:00 am

≫ Next: Replication at Speed – System of Record Capabilities for MemSQL 7.0

≪ Previous: How To Bulk Import Data Into InnoDB Cluster?

Feed: Neo4j Graph Database Platform.
Author: Jocelyn Hoppa.

Discover the latest improvements to the Neo4j Graph Algorithms library.

We’re happy to share recent improvements to the Neo4j Graph Algorithms library. These updates include optimizations at several layers, improved configuration and usability, as well as specific feature requests.

A big “thank you” to those who provided suggestions on how to better serve data scientists in production environments, ranging from graph analytics to graph-enhanced machine learning!

In this post, we’ll summarize the operational improvements from 3.5.6.1 through 3.5.9.0. We’ll also dive into a bit more detail on specific algorithm enhancements.

You can install the latest algorithms library directly from the Neo4j Desktop in the ‘Plugins’ section of your project or on the Neo4j Download Center.

Neo4j Graph Algorithms Library Infrastructure Improvements

The infrastructure improvements involved a few visible changes, such as helping users plan for resource requirements and providing more baseline graph information, as well as changes “under the hood” to improve performance and stomp out a few bugs.

Compute Memory Requirements Ahead of Time

The most anticipated change in overall operations is the ability to compute memory requirements ahead of time. The algo.memrec procedure determines memory needs for loading a graph (or Cypher projection) into memory and for running specific algorithms: PageRank, Label Propagation, Connected Components (Union Find) and Louvain.

This feature is super handy for getting the memory configuration right. Because Neo4j graph algorithms run completely in heap memory, the need for running analytics workloads is likely different than your transaction workloads.

Memrec allows you to anticipate the memory that will be used to load your graph or run your algorithms, and should be used as a first step to make sure your configuration is appropriate. More detail on the memrec procedure is detailed in this post on Neo4j Graph Algorithms Release.

Specify Different Concurrencies

We’ve also split out the concurrency configuration for algorithms, allowing users to specify different concurrencies for reading data into memory and for writing results back to the graph after an algorithm completes. There’s more detail in the above mentioned blog post, but this enables users to more finely tune their environment for different uses and goals.

Conduct Faster Reads/Writes

To increase overall performance, we’ve switched how we encode relationship weights in the in-memory graph; from using an array map to a parallel array. This change results in faster read/write speeds for weighted relationships and a lower memory footprint.

Load Graphs More Efficiently

The default graph loader has been updated from heavy to huge graphs with optimizations and bug fixes that have drastically accelerated node loading. The huge graph loader also now supports relationship deduplication when duplicateRelationships is configured.

Use Smarter Information

To simplify gathering basic information about our graphs, we added relationship counts and degree distribution statistics to the procedures for graph loading and graph info. We include the min/max and mean number of relationships well as the number of nodes in a range of degree percentiles.

Data scientists can use this type of data to understand relationship densities and distributions, which can impact algorithm choices and results. It’s also a quick way to check if the graph loaded as you expected and makes debugging simpler and more straightforward.

We have also added a new procedure, algo.graph.list(), that returns a list of the loaded graphs with their basic information. This is especially helpful when you’ve loaded many named graphs and can’t remember the names or even how many you have. (We’ve all been there!)

With the named graph list, house cleaning becomes much simpler because you can remove your extra graphs without having to restart the database.

Finally, there are also numerous other improvements from better error handling and bug fixes, to additional information that enhance usability. For example, algorithms will not execute when the provided node labels or relationship types don’t exist, and instead return an instructive error.

Neo4j Graph Algorithms Library Algorithm Enhancements

In regards to the algorithms themselves, we’ve made some general enhancements to accelerate results, as well as enterprise-grade optimizations for our product supported algorithms (PageRank, UnionFind /Connected Components, Louvain and Label Propagation). This aligns with our dual strategy to have many new algorithms created by our Labs team and a set of core algorithms with enterprise features supported and developed by our product engineering team.

As the product team implements new algorithms or variants of existing algorithms, we are giving users early access to these new features, via the added a beta namespace (algo.beta.), without breaking existing syntax.

Whenever new beta features are rolled out, these will be explained in the release notes. As we test and mature beta algos, they will eventually mature into the primary namespace.

General Algorithm Improvements

Users can now terminate product-supported algorithms during graph loading and result write. Previously, termination was only enabled during the actual computation of the algorithm. Algorithms may be terminated by ctrl-c in Cypher shell or clicking (x) in Neo4j browser.

We’ve also increased histogram accuracy in community detection algorithms by increasing the number of significant digits to 5 (from 2) – enabling us to give accurate results for the 100th percentile of nodes per communities. For those that do not want the histogram output, you can prune down the statistical results using the Yield clause.

PageRank

The PageRank algorithm now includes a tolerance parameter, which allows the user to run PageRank until values stabilize within the specified tolerance window. PageRank usually converges on results after enough iterations; in some scenarios it’s not mathematically possible, and in other cases it would simply take too long.

So when we only use an iteration limit, we are left to hope values converge in that iteration window and that we are not iterating unnecessarily and wasting time. Alternatively, tolerance allows users to terminate pageRank as soon as values have stabilized. This gives better results as well as potentially decreasing calculation times.

Previous versions allowed users to specify the number of iterations, which provides performance consistency. Using both a tolerance parameter and a maximum number of iterations is a best practice to balance and tune for accuracy and performance needs.

Label Propagation

The Label Propagation algorithm now has the option to produce identical results when run multiple times on the same graph. The initial start node is seeded and ties are broken by selecting the smaller community label. These are both configurable parameters via the setting seedProperty.

This option enables users to choose either a traditional approach which selects start nodes and breaks ties randomly or a deterministic approach. This change helps users in production settings where consistency is important.

As part of this update, the default parameter values for seeding and writing have been removed and users must specify the seedProperty (seedProperty is the new name for the former partitionProperty parameter.)

To prevent accidental overwrites (when users run an algorithm multiple times on the same graph and incorrectly write results back to the same property), we now require users to specify a value for the writeProperty parameter. If no writeProperty is specified, results will not be written back to the graph.

Label Propagation is the first algorithm with a beta implementation (algo.beta.labelPropagation), which previews the new syntax for seeding labels.

Connected Components (Union Find)

We have a new parallel implementation of Connected Components (Union Find) as makes better use of the available threads and consumes less memory.

These improvements are all included under algo.unionFind and obviate the need for previous experimental variations (algo.unionFind.forkJoin, algo.unionFind.forkJoinMerge, algo.unionFind.queue) which have been deprecated. Union Find is now significantly faster and requires less memory in heap to complete.

Like Label Propagation, the Union Find algorithm now also has an added option for a seedProperty parameter to set initial partition values. This enables users to preserve the original community IDs even after executing the algorithm multiple times and with the addition of new data.

To make seeding efficient, it runs concurrently on multiple threads and only writes on changed/new properties. This feature was built specifically for users who need to run Label Propagation on the same data set – with new data added incrementally – in an efficient way.

We also added a consecutiveIds parameter to allow users to specify that partitions should be labeled with successive integers. This eliminates numeric gaps in our community ID labels for easier reading.

Louvain Modularity

The Louvain Modularity algorithm has several optimizations to increase performance.

We optimized the way two internal data structures are populated, which reduces run time, and have reduced the algorithm’s memory footprint. We also removed some indirection to streamline processes and reduced the time taken to read relationship weights.

Conclusion

It’s exciting to see data scientists test the limits (and sometimes beyond!) of what can be accomplished using graph algorithms. We hope you find these recent optimization beneficial and put the new features to good use.

And as always, please let us know how we can continue to improve, what capabilities you need or new algorithms should be included!

If you’re just learning about graph algorithms or want some hands on material, download a free copy of the O’Reilly bookGraph Algorithms book and discover how to develop more intelligent solutions.

Download My Free Copy

↧