How To Keep Elasticsearch Synced With A RDBMS Using Logstash - Elastic Blog
How To Keep Elasticsearch Synced With A RDBMS Using Logstash - Elastic Blog
How To Keep Elasticsearch Synced With A RDBMS Using Logstash - Elastic Blog
Share
In order to take advantage of the powerful search capabilities o ered by Elasticsearch, many
businesses will deploy Elasticsearch alongside existing relational databases. In such a scenario,
it will likely be necessary to keep Elasticsearch synchronized with the data that is stored in the
associated relational database. Therefore, in this blog post I will show how Logstash can be
used to e ciently copy records and to synchronize updates from a relational database into
Elasticsearch. The code and methods presented here have been tested with MySQL, but should
Cybersecurity with
theoretically work with any RDBMS. Elasticsearch
Learn how to Tackle
Elasticsearch: 7.1.1
Machine Learning in
Logstash: 7.1.1 Elasticsearch
Java: 1.8.0_162-b12 Get started with automated
https://www.elastic.co/blog/how-to-keep-elasticsearch-synchronized-with-a-relational-database-using-logstash 1/9
6/20/2019 How to keep Elasticsearch synced with a RDBMS using Logstash | Elastic Blog
request records with an update or insertion time that is newer than the last record that
Blog was received in the previous iteration of the polling loop.
News Engineering User Stories Releases Culture Archive
If the above conditions are satis ed, we can con gure Logstash to periodically request all new
or modi ed records from MySQL and then write them into Elasticsearch. The Logstash code for
this is presented later in this blog.
Learn More
CREATE DATABASE es_db;
USE es_db;
DROP TABLE IF EXISTS es_table;
CREATE TABLE es_table (
id BIGINT(20) UNSIGNED NOT NULL,
PRIMARY KEY (id),
UNIQUE KEY unique_id (id),
client_name VARCHAR(32) NOT NULL,
modification_time TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE
Improve the Search
CURRENT_TIMESTAMP, Experience on your Website
insertion_time TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP See how to get started with Elastic Site
); Search and begin ingesting data in
minutes so you can customize search…
Learn More
There are a few interesting parameters in the above MySQL con guration:
es_table: This is the name of the MySQL table that records will be read from and then
synchronized to Elasticsearch.
id: This is the unique identi er for this record. Notice that “id” is de ned as the PRIMARY
KEY as well as a UNIQUE KEY. This guarantees each “id” only appears once in the current
table. This will be translated to “_id” for updating or inserting the document into
Elasticsearch.
client_name: This is a eld that represents the user-de ned data that will be stored in each Elastic{ON} Tour New York
Opening Keynote
record. To keep this blog simple, we only have a single eld with user-de ned data, but
Elastic CEO Shay Banon announces
more elds could be easily added. This is the eld that we will modify to show that not updates on Elastic App Search, Elastic
only are newly inserted MySQL records copied to Elasticsearch, but that updated records APM, SQL for Elasticsearch, the…
MySQL will cause its value to be set to the time of the modi cation. This modi cation time
allows us to pull out any records that have been modi ed since the last time Logstash Sign up for product
requested documents from MySQL. updates!
insertion_time: This eld is mostly for demonstration purposes and is not strictly
Email Address
necessary for synchronization to work correctly. We use it to track when a record was
originally inserted into MySQL. By submitting you (1) agree to Elastic's
Terms of Service and Privacy Statement
and (2) agree to receive occasional emails.
Given the above con guration, records can be written to MySQL as follows:
https://www.elastic.co/blog/how-to-keep-elasticsearch-synchronized-with-a-relational-database-using-logstash 2/9
6/20/2019 How to keep Elasticsearch synced with a RDBMS using Logstash | Elastic Blog
UPDATE es_table SET client_name = <new client name> WHERE id=<id>; User Stories
Blog News Engineering Releases Culture Archive
INSERT INTO es_table (id, client_name) VALUES (<id>, <client name when
created> ON DUPLICATE KEY UPDATE client_name=<client name when updated>;
Synchronization code
The following logstash pipeline implements the synchronization code that is described in the
previous section:
input {
jdbc {
jdbc_driver_library => "<path>/mysql-connector-java-8.0.16.jar"
jdbc_driver_class => "com.mysql.jdbc.Driver"
jdbc_connection_string => "jdbc:mysql://<MySQL host>:3306/es_db"
jdbc_user => <my username>
jdbc_password => <my password>
jdbc_paging_enabled => true
tracking_column => "unix_ts_in_secs"
use_column_value => true
tracking_column_type => "numeric
Read More
There are a few areas from the above pipeline that should be highlighted:
tracking_column: This eld speci es the eld “unix_ts_in_secs” (described below) that is
used for tracking the last document read by Logstash from MySQL, and is stored on disk in
.logstash_jdbc_last_run. This value will be used to determine the starting value for
documents that Logstash will request in the next iteration of its polling loop. The value
stored in .logstash_jdbc_last_run can be accessed in the SELECT statement as
“:sql_last_value”.
unix_ts_in_secs: This is a eld that is generated by the above SELECT statement, and which
contains the “modi cation_time” as a standard Unix timestamp (seconds since the epoch).
This eld is referenced by the “tracking column” that we just discussed. A Unix timestamp
is used for tracking progress rather than a normal timestamp, as a normal timestamp may
cause errors due to the complexity of correctly converting back and forth between UMT
and the local timezone.
sql_last_value: This is a built-in parameter that contains the starting point for the current
iteration of Logstash’s polling loop, and it is referenced in the SELECT statement line of the
above jdbc input con guration. This is set to the most recent value of “unix_ts_in_secs”,
which is read from .logstash_jdbc_last_run. This is used as the starting point for
documents to be returned by the MySQL query that is executed in Logstash’s polling loop.
Including this variable in the query guarantees that insertions or updates that have
previously been propagated to Elasticsearch will not be re-sent to Elasticsearch.
schedule: This uses cron syntax to specify how often Logstash should poll MySQL for
changes. The speci cation of "*/5 * * * * *" tells Logstash to contact MySQL every 5
seconds.
modification_time < NOW(): This portion of the SELECT is one of the more di cult concepts
to explain and therefore it is explained in detail in the next section.
filter: In this section we simply copy the value of “id” from the MySQL record into a
metadata eld called “_id”, which we will later reference in the output to ensure that each
https://www.elastic.co/blog/how-to-keep-elasticsearch-synchronized-with-a-relational-database-using-logstash 3/9
6/20/2019 How to keep Elasticsearch synced with a RDBMS using Logstash | Elastic Blog
document is written into Elasticsearch with the correct “_id” value. Using a metadata eld
Blog ensures that this temporary value does not cause a new
News Engineering User Stories
eld to be created. We also
Releases Culture Archive
remove the “id”, “@version”, and “unix_ts_in_secs” elds from the document, as we do not
wish for them to be written to Elasticsearch.
output: In this section we specify that each document should be written to Elasticsearch,
and should be assigned an “_id” which is pulled from the metadata eld that we created in
the lter section. There is also a commented-out rubydebug output that can be enabled to
help with debugging.
At rst glance, the above approach appears that it should work correctly, but there are edge
cases where it may miss some documents. For example, let's consider a case where MySQL is
inserting 2 documents per second and where Logstash is executing its SELECT statement every
5 seconds. This is demonstrated in the following diagram, where each second is represented by
T0 to T10, and the records in MySQL are represented by R1 through R22. We assume that the
rst iteration of Logstash’s polling loop takes place at T5 and it reads documents R1 through
R11, as represented by the cyan boxes. The value stored in sql_last_value is now T5 as this is
the timestamp on the last record (R11) that was read. We also assume that just after Logstash
has read documents from MySQL, another document R12 is inserted into MySQL with a
timestamp of T5.
In the next iteration of the above SELECT statement, we only pull documents where the time is
greater than T5 (as instructed by WHERE (UNIX_TIMESTAMP(modification_time) > :sql_last_value)),
which means that record R12 will be skipped. This is shown in the diagram below, where the
cyan boxes represent the records that are read by Logstash in the current iteration, and the
grey boxes represent the records that were previously read by Logstash.
https://www.elastic.co/blog/how-to-keep-elasticsearch-synchronized-with-a-relational-database-using-logstash 4/9
6/20/2019 How to keep Elasticsearch synced with a RDBMS using Logstash | Elastic Blog
Notice that with this scenario’s SELECT, the record R12 will never be written to Elasticsearch.
However, this implementation is also not ideal. In this case, the problem is that the most recent
document(s) read from MySQL in the most recent time interval will be re-sent to Elasticsearch.
While this does not cause any issues with respect to correctness of the results, it does create
unnecessary work. As in the previous section, after the initial Logstash polling iteration, the
diagram below shows which documents have been read from MySQL.
When we execute the subsequent Logstash polling iteration, we pull all documents where the
time is greater than or equal to T5. This is demonstrated in the following diagram. Notice that
record R11 (shown in purple) will be sent to Elasticsearch again.
https://www.elastic.co/blog/how-to-keep-elasticsearch-synchronized-with-a-relational-database-using-logstash 5/9
6/20/2019 How to keep Elasticsearch synced with a RDBMS using Logstash | Elastic Blog
Neither of the previous two scenarios are ideal. In the rst scenario data can be lost, and in the
Blog
second scenario redundant data is read from MySQL and sent to Elasticsearch.
News Engineering User Stories Releases Culture Archive
This is demonstrated by the following diagram, where the current Logstash poll is executed at
T5. Notice that because modification_time < NOW() must be satis ed, only documents up to, but
not including, those in the period T5 will be read from MySQL. Since we have retrieved all
documents from T4 and no documents from T5, we know that sql_last_value will then be set to
T4 for the next Logstash polling iteration.
The diagram below demonstrates what happens on the next iteration of the Logstash poll.
Since UNIX_TIMESTAMP(modification_time) > :sql_last_value, and because sql_last_value is set to
T4, we know that only documents starting from T5 will be retrieved. Additionally, because only
documents that satisfy modification_time < NOW() will be retrieved, only documents up to and
including T9 will be retrieved. Again, this means that all documents in T9 are retrieved, and that
sql_last_value will be set to T9 for the next iteration. Therefore this approach eliminates the
risk of retrieving only a subset of MySQL documents from any given time interval.
https://www.elastic.co/blog/how-to-keep-elasticsearch-synchronized-with-a-relational-database-using-logstash 6/9
6/20/2019 How to keep Elasticsearch synced with a RDBMS using Logstash | Elastic Blog
Once the JDBC input schedule has triggered reading records from MySQL and written them to
Blog
Elasticsearch, we can execute the following Elasticsearch query to see the documents in
News Engineering User Stories Releases Culture Archive
Elasticsearch:
GET rdbms_sync_idx/_search
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "rdbms_sync_idx",
"_type" : "_doc",
"_id" : "1",
"_score" : 1.0,
"_source" : {
"insertion_time" : "2019-06-18T12:58:56.000Z",
"@timestamp" : "2019-06-18T13:04:27.436Z",
"modification_time" : "2019-06-18T12:58:56.000Z",
"client_name" : "Jim Carrey"
}
},
Etc …
We can then update the document that corresponds to _id=1 in MySQL as follows:
which correctly updates the document identi ed by _id of 1. We can look directly at the
document in Elasticsearch by executing the following:
GET rdbms_sync_idx/_doc/1
https://www.elastic.co/blog/how-to-keep-elasticsearch-synchronized-with-a-relational-database-using-logstash 7/9
6/20/2019 How to keep Elasticsearch synced with a RDBMS using Logstash | Elastic Blog
{
Blog News Engineering User Stories Releases Culture Archive
"_index" : "rdbms_sync_idx",
"_type" : "_doc",
"_id" : "1",
"_version" : 2,
"_seq_no" : 3,
"_primary_term" : 1,
"found" : true,
"_source" : {
"insertion_time" : "2019-06-18T12:58:56.000Z",
"@timestamp" : "2019-06-18T13:09:30.300Z",
"modification_time" : "2019-06-18T13:09:28.000Z",
"client_name" : "Jimbo Kerry"
}
}
Notice that the _version is now set to 2, that the modification_time is now di erent than the
insertion_time, and that the client_name eld has been correctly updated to the new value. The
@timestamp eld is not particularly interesting for this example, and is added by Logstash by
default.
An upsert into MySQL can be done as follows, and the reader of this blog can verify that the
correct information will be re ected in Elasticsearch:
INSERT INTO es_table (id, client_name) VALUES (4, 'Bob is new') ON DUPLICATE
KEY UPDATE client_name='Bob exists already';
Conclusion
In this blog post I showed how Logstash can be used to synchronize Elasticsearch with a
relational database. The code and methods presented here were tested with MySQL, but
should theoretically work with any RDBMS.
If you have any questions about Logstash or any other Elasticsearch-related topics, have a look
at our Discuss forums for valuable discussion, insights, and information. Also, don't forget to
try out our Elasticsearch Service, the only hosted Elasticsearch and Kibana o ering powered by
the creators of Elasticsearch.
https://www.elastic.co/blog/how-to-keep-elasticsearch-synchronized-with-a-relational-database-using-logstash 8/9
6/20/2019 How to keep Elasticsearch synced with a RDBMS using Logstash | Elastic Blog
Machine Learning Elastic App Search Service Videos & Webinars Investor Relations
Elasticsearch-Hadoop
Elasticsearch on Kubernetes
FOLLOW US
https://www.elastic.co/blog/how-to-keep-elasticsearch-synchronized-with-a-relational-database-using-logstash 9/9