RealTime CDC From MySQL Using AWS MSK With Debezium

Feed: Planet MySQL
;
Author: Searce Engineering
;

Credit: AWS

CDC is becoming more popular nowadays. Many organizations that want to build a realtime/near realtime data pipe and reports are using the CDC as a backbone to powering their real-time reports. Debezium is an opensource product from RedHat and it supports multiple databases (both SQL and NoSQL). Apache Kafka is the core of this.

Managing and scaling Kafka clusters is not easy for everyone. If you are using AWS for your infra then let AWS manage the cluster. AWS has MSK is a managed Kafka service. We are going to configure Debezium with AWS MSK.

If you are already worked with AWS MSK, then you might be familiar with this configuration file. This is similar to the RDS Parameter group but here you need to upload a with your parameter name and its value. If you are using MSK for the first time, then it’ll make you a bit confused. No worries, I’ll give you the steps to do this. You can change this configuration file even after you launched the cluster. But it’s a good practice to create the configuration file before launching the cluster. Unfortunately AWS CLI is the only way to create the configuration file.

1. Create a conf file:

create a file in your Desktop or somewhere with the following parameters. For Debezium auto-create topic parameter is required. So I’ll use only this one for now. You can add more parameters if you want. Copy and Paste the below content to your conf file called kafka-custom-conf

If you didn’t enable the auto topic creation, then you’ll see the following error.

Dec 20 18:32:19 ip-172-31-44-220 connect-distributed[23563]: [2019-12-20 18:32:19,845] WARN [Producer clientId=connector-producer-mysql-connector-db01-0] Error while fetching metadata with correlation id 294 : {mysql-db01=UNKNOWN_TOPIC_OR_PARTITION} (org.apache.kafka.clients.NetworkClient:1051)

kafka-custom-conf file:

auto.create.topics.enable = true
zookeeper.connection.timeout.ms = 1000

2. Upload the conf file to AWS

Install AWS CLI on your workstation and run the following command.

NOTE: Im going to use Kafka Version 2.3.1, if you are going to use different Kafka version then change the --kafka-versions value.

aws kafka create-configuration --name "kafka-231-custom-conf" --description "Example configuration description." --kafka-versions "2.3.1" --server-properties file://C:UsersrbhuvDesktopkafka-custom-conf

Once you ran the command it’ll give you the following output.

{
"Arn": "arn:aws:kafka:us-east-1:0000111222333:configuration/kafka-231-custom-conf/6061ca2d-10b7-46c6-81c0-7fae1b208452-7",
"CreationTime": "2019-12-20T18:38:17.103000+00:00",
"LatestRevision": {
"CreationTime": "2019-12-20T18:38:17.103000+00:00",
"Description": "Example configuration description.",
"Revision": 1
},
"Name": "kafka-231-custom-conf"
}

3. (Optional) Update existing cluster with this conf file

If you are going to create a new cluster then ignore this step.

Note: if you forgot to take a note of the configuration ARN(from step2), then you can get it from cli aws kafka list-configurations

You need the version for Kafka(its, not Kafka software version, its giving to your cluster by AWS). Run aws kafka list-clusters this will give you the value for a current version like this “CurrentVersion”: “K2EUQ1WTGCTBG2”

Create a configuration info file called configuration-info.jsonwhich contains the ARN of your new conf file.

{
"Arn": "arn:aws:kafka:us-east-1:0000111222333:configuration/kafka-231-custom-conf/6061ca2d-10b7-46c6-81c0-7fae1b208452-7",
"Revision": 1
}

Now run the below command to update your Kafka cluster configuration file with the new file.

aws kafka update-cluster-configuration --cluster-arn  "arn:aws:kafka:us-east-1:0000111222333:cluster/searce-bigdata/599c6202-ec40-455a-afa8-d7c5916d7bc2-7" --configuration-info file://C:UsersrbhuvDesktopconfiguration-info.json   --current-version "K2EUQ1WTGCTBG2"

This will give you the following output.

{
"ClusterArn": "arn:aws:kafka:us-east-1:0000111222333:cluster/searce-bigdata/599c6202-ec40-455a-afa8-d7c5916d7bc2-7",
"ClusterOperationArn": "arn:aws:kafka:us-east-1:0000111222333:cluster-operation/searce-bigdata/599c6202-ec40-455a-afa8-d7c5916d7bc2-7/519396ad-1df2-46aa-8858-ba2c49f06c3c"
}

This MSK launch console is very easy and you can select the options as you need. I’m just giving you a few options where you need to focus.

Under the configuration choose Use a custom configuration supporting Apache Kafka 2.2.1
Then you can see the conf file which you created.

Under the encryption, if you are not going to use TLS then select Both TLS encrypted and plaintext traffic allowed In this blog, I’m not using any TLS connections from the connector to Kafka.

The rest of the options are straight forward, you can select them as your requirement.
It’ll take 20 to 25mins to create the cluster.

Click on the cluster name and it’ll take you the details page of the cluster. In the Top Right select View client information. There you can see the Kafka bootstrap servers endpoints and Zookeeper endpoints.

Install Java and Confluent Connector binaries:

apt-get update 
sudo apt-get install default-jrewget -qO - https://packages.confluent.io/deb/5.3/archive.key | sudo apt-key add -
sudo add-apt-repository "deb [arch=amd64] https://packages.confluent.io/deb/5.3 stable main"
sudo apt-get update && sudo apt-get install confluent-hub-client confluent-common confluent-kafka-2.12

Configure the Distributed connector properties:

Bootstrap Servers — Copy the Plaintext value from the MSK client information.
replication.factor — it should be >1, else you’ll get the following error. (Refer here)

Dec 20 11:42:36 ip-172-31-44-220 connect-distributed[2630]: [2019-12-20 11:42:36,290] WARN [Producer clientId=producer-3] Got error produce response with correlation id 844 on topic-partition connect-configs-0, retrying (2147482809 attempts left). Error: NOT_ENOUGH_REPLICAS (org.apache.kafka.clients.producer.internals.Sender:637)

/etc/kafka/connect-distributed.properties file:

vi  /etc/kafka/connect-distributed.propertiesbootstrap.servers=b-1.searce-bigdata.XXXXXXXXX.kafka.us-east-1.amazonaws.com:9092,b-3.searce-bigdata.XXXXXXXXX.kafka.us-east-1.amazonaws.com:9092,b-2.searce-bigdata.XXXXXXXXX.kafka.us-east-1.amazonaws.com:9092
group.id=debezium-clusterplugin.path=/usr/share/java,/usr/share/confluent-hub-components
offset.storage.replication.factor=2
config.storage.replication.factor=2
status.storage.replication.factor=2

Install Debezium MySQL connector and S3 connector:

confluent-hub install debezium/debezium-connector-mysql:latest
confluent-hub install confluentinc/kafka-connect-s3:latest

Start the connector service:

You can run your confluent connector application via systemctl.

vi /lib/systemd/system/confluent-connect-distributed.service[Unit]
Description=Apache Kafka - connect-distributed
Documentation=http://docs.confluent.io/
After=network.target[Service]
Type=simple
User=cp-kafka
Group=confluent
ExecStart=/usr/bin/connect-distributed /etc/kafka/connect-distributed.properties
TimeoutStopSec=180
Restart=no[Install]
WantedBy=multi-user.target

Start the service

systemctl enable confluent-connect-distributed
systemctl start confluent-connect-distributed

Configure MySQL Connector:

Note: Before configuring MySQL connector, make sure you have enabled binlog and the MySQL port should be accessible from the Debezium EC2. Also a MySQL User with respective permissions. (refer the Debezium docs).

Create a file mysql.json (this is my example conf file, you can refer Debezium docs for the meaning of these parameters)

Note: From line number 14 onwards I have added some filters to bring only the new data from MySQL to my consumer app. By default, Debezium adds some metadata info along with the MySQL Data, but I don’t want them) and make sure bootstrap servers and MySQL credentials are correct.

{
"name": "mysql-connector-db01",
"config": {
"name": "mysql-connector-db01",
"connector.class": "io.debezium.connector.mysql.MySqlConnector",
"database.server.id": "1",
"tasks.max": "3",
"database.history.kafka.bootstrap.servers": "b-1.searce-bigdata.XXXXXXXXX.kafka.us-east-1.amazonaws.com:9092,b-3.searce-bigdata.XXXXXXXXX.kafka.us-east-1.amazonaws.com:9092,b-2.searce-bigdata.XXXXXXXXX.kafka.us-east-1.amazonaws.com:9092",
"database.history.kafka.topic": "schema-changes.mysql",
"database.server.name": "mysql-db01",
"database.hostname": "172.31.84.129",
"database.port": "3306",
"database.user": "bhuvi",
"database.password": "your_strong_pass",
"database.whitelist": "bhuvi,new,test",
"internal.key.converter.schemas.enable": "false",
"transforms.unwrap.add.source.fields": "ts_ms",
"key.converter.schemas.enable": "false",
"internal.key.converter": "org.apache.kafka.connect.json.JsonConverter",
"internal.value.converter.schemas.enable": "false",
"value.converter.schemas.enable": "false",
"internal.value.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"key.converter": "org.apache.kafka.connect.json.JsonConverter",
"transforms": "unwrap",
"transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState"
}
}

“database.history.kafka.bootstrap.servers” — Kafka Servers IP.
“database.whitelist” — List of databases to get the CDC.
key.converter and value.converter and transforms parameters — By default Debezium output will have more detailed information. But I don’t want all of that information. I’m only interested in to get the new row and the timestamp when its inserted.

If you don’t want to customize anything then just remove everything after the database.whitelist

curl -X POST -H "Accept: application/json" -H "Content-Type: application/json" http://localhost:8083/connectors -d @mysql.json

curl GET localhost:8083/connectors/mysql-connector-db01/status {
"name": "mysql-connector-db01",
"connector": {
"state": "RUNNING",
"worker_id": "172.31.44.151:8083"
},
"tasks": [
{
"id": 0,
"state": "RUNNING",
"worker_id": "172.31.44.151:8083"
}
],
"type": "source"
}

Now insert something into any tables in proddb or test (because we have whilelisted only these databases to capture the CDC.

use test; 
create table rohi (
id int), 
fn varchar(10), 
ln varchar(10), 
phone int 
);  insert into rohi values (2, 'rohit', 'ayare','87611');

We can get these values from the Kafker brokers. Listen to the below topic:

mysql-db01.test.rohi
This is the combination of servername.databasename.tablename
servername(you mentioned this in as a server name in mysql json file).

kafka-console-consumer --bootstrap-server b-1.searce-bigdata.XXXXXXXXX.kafka.us-east-1.amazonaws.com:9092,b-3.searce-bigdata.XXXXXXXXX.kafka.us-east-1.amazonaws.com:9092,b-2.searce-bigdata.XXXXXXXXX.kafka.us-east-1.amazonaws.com:9092 --from-beginning{"id":1,"fn":"rohit","ln":"ayare","phone":87611,"__ts_ms":0}
{"id":1,"fn":"rohit","ln":"ayare","phone":87611,"__ts_ms":0}
{"id":1,"fn":"rohit","ln":"ayare","phone":87611,"__ts_ms":0}
{"id":1,"fn":"rohit","ln":"ayare","phone":87611,"__ts_ms":0}

It’ll start copying the historical data and start capturing real-time CDC.

I want to send this data to the S3 bucket. Make sure the Debezium VM is attached with an IAM role that has S3 access to write. So you must have an EC2 IAM role which has access to the target S3 bucket. Or install awscli and configure access and secret key(but its not recommended)

Create s3.json file.

{
"name": "s3-sink-db01",
"config": {
"connector.class": "io.confluent.connect.s3.S3SinkConnector",
"storage.class": "io.confluent.connect.s3.storage.S3Storage",
"s3.bucket.name": "searce-00000",
"name": "s3-sink-db01",
"tasks.max": "3",
"s3.region": "us-east-1",
"s3.part.size": "5242880",
"s3.compression.type": "gzip",
"timezone": "UTC",
"locale": "en",
"flush.size": "10000",
"rotate.interval.ms": "3600000",
"topics.regex": "mysql-db01.(.*)",
"internal.key.converter.schemas.enable": "false",
"key.converter.schemas.enable": "false",
"internal.key.converter": "org.apache.kafka.connect.json.JsonConverter",
"format.class": "io.confluent.connect.s3.format.json.JsonFormat",
"internal.value.converter.schemas.enable": "false",
"value.converter.schemas.enable": "false",
"internal.value.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"key.converter": "org.apache.kafka.connect.json.JsonConverter",
"partitioner.class": "io.confluent.connect.storage.partitioner.HourlyPartitioner",
"path.format": "YYYY/MM/dd/HH",
"partition.duration.ms": "3600000",
"rotate.schedule.interval.ms": "3600000"
}
}

"topics.regex": "mysql-db01" – It’ll send the data only from the topics which have mysql-db01 as a prefix. In our case, all the MySQL databases related topics will start with this prefix.
"flush.size" – The data will be uploaded to S3 only after these many number of records stored. Or after "rotate.schedule.interval.ms" this duration.

curl -X POST -H "Accept: application/json" -H "Content-Type: application/json" http://localhost:8083/connectors -d @s3.json

curl GET localhost:8083/connectors/s3-sink-db01/status |jq{
"name": "s3-sink-db01",
"connector": {
"state": "RUNNING",
"worker_id": "172.31.44.151:8083"
},
"tasks": [
{
"id": 0,
"state": "RUNNING",
"worker_id": "172.31.44.151:8083"
},
{
"id": 1,
"state": "RUNNING",
"worker_id": "172.31.44.151:8083"
},
{
"id": 2,
"state": "RUNNING",
"worker_id": "172.31.44.151:8083"
}
],
"type": "sink"
}

Insert the 10000 rows into the rohi table. Then check the S3 bucket. It’ll save the data in JSON format with GZIP compression. Also in an HOUR wise partitions.

Conclusion:

The MySQL and S3 config files are just referenced and we are using it. If you want more customization or you need any help in understanding the parameter, please refer to the Debezium documentation. Also, in this example blog, I’m doing S3 upload with a micro-batch(every 1hr or 10000 rows added/modified) If you want real-time then modify the config file accordingly.

If you want to do the same setup with Kafka in EC2 instead of AWS MSK please refer to the following link.

RealTime CDC From MySQL Using AWS MSK With Debezium

1. Create a conf file:

2. Upload the conf file to AWS

3. (Optional) Update existing cluster with this conf file

Install Java and Confluent Connector binaries:

Configure the Distributed connector properties:

Configure MySQL Connector:

Trending Articles

Bath man appears in court charged with attempted murder of a man...

MACLEAN, Allan

Black Angus Grilled Artichokes

Practice Sheet of Right form of verbs for HSC Students

Police blotter for Jan. 12

99 God Status for Whatsapp, Facebook

Rajasthan Board 12th Science Result 2018 name wise- RBSE 12th commerce result...

Notorious Naushad of Ippa gang nabbed

Child Kidnapping: Amy McNeil was kidnapped on her way to school by 5 adults;...

Sonible Smartlimit v1.1.5-R2R

NCERT Solutions for Class 9th Sanskrit Chapter 3 पाथेयम्

मतलबी दोस्त स्टेट्स | Matlabi Dost Status in Hindi – Selfish Friends Status

Arrow Flash 2 – Sinhala Dubbed – Episode 23 – 20th March 2016

[GET] AI Traffic Goldmine

[E² Plugin] HDF-Radio

Universal Multi-Patch v1.3 By RADIXX11

IWAN – Thanks and Praise ( Throw Back Thursday )

RONALD P SONDERGAARD Arrested by Miami-Dade County Corrections on Mar 03, 2017

मुख मैथुन से उठाएं सेक्स का भरपूर मज़ा, जानें क्या है इसका सही तरीकामुख मैथुन...

HSSC Excise & Taxation Inspector Result 2017 Scorecard/ Category Wise Merit List