2. X
• Introductions
• Disaster Recovery (DR), RTO and RPO
• Apache Solr and State of CDCR
• Test scenarios and Assumptions
• CDCR Architecture and Cross region VPC peering and SOLR
configurations.
• Demonstrate CDCR
• Observations
• Questions
Agenda
3. X
Sr. Consultant with SearchStax (formerly known as Measured
Search) since December 2015.
Serving as a Search Engineer with Allstate Insurance company in
their SOLR Enterprise Search team.
Previous clients include United Airlines, US Bank.
Extensive experience with middleware applications such as
TIBCO, IBM Websphere, etc.
SearchStax company information : https://www.searchstax.com/
The company was named one of the “Top 20 Open Source
Software Solutions for 2017” by CIOReview Magazine
Nishant Karve
About Me
4. X
Why do we need DR plans?
1. You’re only as strong as your weakest link.
An ideal disaster recovery plan would place your production
servers in a top tier data center with no single point of failure on
the power and network connections.
2. Customer retention is costly after a DR.
While on average it’s much cheaper to retain a customer then to
acquire a new one, re-acquiring an old customer after an IT
disaster is very expensive.
3. Customers expect perfection.
With ever increasing competition and the varied choices
available to the customer, we are nearing a phase where the
customer expects perfection from your online service.
4 Machines and hardware fail.
With the highly distributed nature of computing it’s quite
obvious that machines will fail. Fried motherboards, faulty
network switches, corrupted hard drives all contribute to a
disaster.
5. X
Any talk about Disaster recovery is incomplete without discussing about RPO and RTO
RPO (Recovery Point Objective): Is focused on data and your company’s loss tolerance in relation to
your data. It is determined by looking at the time between backups and the amount of data that could
be lost in between your backups.
RTO (Recovery Time Objective): Is the target time you set for the recovery of your IT and business
activities after a disaster has struck. The goal here is to calculate how quickly you need to recover,
which can then dictate the type of DR classification (Tier 0 - Tier 7, tier 0 indicates no off site data and
hence possibly no recovery).
While they may be different, both metrics need to be considered to develop an effective DR plan.
Time
Last known good
copy of data
DR Initiated Data Restored
Normal Business resumed
RPO RTO
RTO and RPO
6. X
What is Apache Solr?
Solr is the popular, blazing-fast, open source
enterprise search platform built on Apache
Lucene™
8. X
CDCR Introduction
• Disasters strike without notice. IT Companies are always prepared with a redundant copy of their
database(s) in one or multiple databases on a secondary site, possibly far away from the primary.
• Cross data center replication, also known as XDCR, is about dual data center writes to ensure business
continuity during a disaster.
• DR plans are extremely important for a consistent user experience and customer retention. Customer
retention is easier than acquiring new customers.
• CDCR can also be used for replicating a subset of your Production data to provide a production like test
environment for your developers and testers.
9. X
CDCR prior to out of box support in Apache Solr
DC1
DC2
1
2
Partition 1
Partition 2
Partition 3
Partition 4
Partition 1
Partition 2
Partition 3
Partition 4
Client
APPLICATION
10. X
1. Onus on Application to write to both data centers is
taken away.
2. Synchronization of data happens out of the box between
two Solr clusters.
3. Bi-directional is supported with minimal configuration
changes.
4. Multiple collections can also be replicated
5. Asynchronous data transfer
CDCR Support in Apache Solr post 6.6.x
11. X
• Test A: CDCR on AWS: Across 2 regions (Virginia and Ohio)
using AWS provided VPC peering.
• Test- B: CDCR on AWS: Within the same region but different
availability zones
• CDCR on premise. Out of scope for this discussion. However,
the solution works perfectly on on-premise DC
CDCR Scenarios
12. X
Several assumptions were made while testing CDCR on AWS. They are outlined as follows.
• SOLR 7.2 was used for evaluation.
• Bidirectional CDCR, which is a new offering with SOLR 7.2 was tested.
• A single node SOLR cluster with an external zookeeper was used for each region for cross region CDCR.
For test within the same region two separate clusters were used.
• VPC peering is not available for all AWS regions. Inter-Region VPC Peering is available in AWS US East
(N. Virginia), US East (Ohio), US West (Oregon) and EU (Ireland). Virginia and Ohio were used as two data
centers with VPC peering enabled between them.
• No performance test was done around the CDCR process. 10,000 documents were indexed using curl
and basic out of the box settings for CDCR.
• Custom VPC’s were created with different CIDR ranges since the default VPC CIDR ranges in Virginia,
Ohio and Ireland overlapped. As per AWS you cannot VPC peer two regions if their CIDR ranges overlap.
Assumptions
13. X
Test A- Cross region CDCR-AWS Setup
For the bidirectional CDCR to work cross region the two regions have to be VPC peered. VPC peering
is required to ensure that data could flow through the CDCR replicator across the network. This
section describes the AWS setup required to VPC peer two regions.
14. X
AWS Setup
• Step 1: Create a VPC (CIDR 10.0.0.0/16) , a public subnet, attach an Internet gateway to the
VPC and spin two EC2 instances in the Virginia region. One EC2 instance will host the
zookeeper on 2181 and the other EC2 instance will host the SOLR instance on 8983.
• Step 2: Create a VPC (CIDR 20.0.0.0/16), a public subnet, attach an Internet gateway to the
VPC and spin two EC2 instances in the Ohio region. One EC2 instance will host the
zookeeper on 2181 and the other EC2 instance will host the SOLR instance on 8983.
• Step 3: Create security groups in each region to allow inbound traffic on the following ports.
Port 22: For administering the EC2 instance. For Production, ensure that you allow traffic only
for the IP addresses that are authorized to install SOLR and Zookeeper on the EC2 instance.
Port 2181: For Zookeeper traffic
Port 8983: For SOLR traffic
Ephemeral traffic for all TCP ports between 0-65535 for the Security group itself.
15. X
VPC Peering..continued
• From the Virginia VPC generate a VPC peering request
• Head on to the Ohio VPC and accept the peering request.
• Once the peering request is accepted a Peering connection ID will be generated. Use this ID in each
regions Route table along with the CIDR range. This will ensure that traffic will be accepted cross
region from the IP’s within the CIDR range.
16. X
VPC Peering..continued
• Following is the route table for Ohio VPC. 10.0.0.0/16 is the CIDR for Virginia region.
• Following is the route table for Virginia VPC. 20.0.0.0/16 is the CIDR for Virginia region.
17. X
Testing Cross region setup
• Before you start installing SOLR and Zookeeper on the EC2 instances ensure that you are able to ping the
private IP addresses of each EC2 instance cross region. This will save you a lot of troubleshooting headache
later on. Please note: Since the Security group does not have a rule to allow inbound traffic on ICMP, ping
may not function. For testing reasons allow ICMP traffic. Once the test is complete disable the rule.
• It’s very critical that all EC2 instances can see each other for a successful CDCR. For on-premise networks
ensure all firewalls, ACL’s and other settings are in place before proceeding with the CDCR setup.
Duplicate the following on both regions.
1) Ensure that Java 1.8 is installed on both machines
2) Install SOLR 7.2 on one EC2 instance. Ensure that the /etc/hosts file on the EC2 instance reflects the
public IP of the machine.
3) Install Zookeeper 3.4.6 on another EC2 instance.
4) Start the Zookeeper instance on port 2181. If you would like to start the zookeeper on a different port,
adjust the security groups accordingly.
5) Start the SOLR instance in a cloud mode (SOLR cloud) on port 2181. If you would like to start the SOLR
instance on a different port, adjust the security groups accordingly. CDCR does not work if the SOLR
deployment mode is stand alone.
Install SOLR and Zookeeper
18. X
CDCR setup
For the test, default configuration files were used for creating a core. Following modifications were done to
the solrconfig.xml to enable it for CDCR. On a very high level for CDCR to function the idea is to allow the
Virginia cluster zookeeper talk to the Ohio cluster zookeeper and vice versa.
For the sake of this conversation, let’s assume that the Virginia DC is our Source DC for indexing and the
Ohio DC is the target. As per the official SOLR documentation only one DC can act as a source for
indexing documents at a given time. If for any reason a decision is made to flip the primary data center,
then the new source for indexing will be the Ohio data center. For search queries, both data centers can
be used at the same time. The setup is more like an Active-Passive setup than an active-active DR cluster.
Virginia DC Ohio DC
ZKHostZKHost
SOLR SOLR
PUSH
SOURCE TARGET
19. X
Virginia DC setup..solrconfig.xml
• Make the following changes in the solrconfig.xml for the Virginia cluster. This will allow to enable CDCR from Virginia to
Ohio cluster through the VPC peered network.
<requestHandler name="/update" class="solr.UpdateRequestHandler">
<lst name="defaults">
<str name="update.chain">cdcr-processor-chain</str>
</lst>
</requestHandler>
<updateRequestProcessorChain name="cdcr-processor-chain">
<processor class=“solr.CdcrUpdateProcessorFactory"/>
<processor class="solr.RunUpdateProcessorFactory"/>
</updateRequestProcessorChain>
<requestHandler name="/cdcr" class="solr.CdcrRequestHandler">
<lst name="replica">
<str name=“zkHost"><<OhioZK:2181>></str>
<str name=“source">music</str> <!— Source collection in Virginia —>
<str name=“target">music</str> <!— Target collection in Ohio —>
</lst>
<lst name="replicator">
<str name="threadPoolSize">8</str>
<str name="schedule">1000</str>
<str name="batchSize">128</str>
</lst>
<lst name="updateLogSynchronizer">
<str name="schedule">1000</str>
</lst> <!- - Missing in the documentation - - >
</requestHandler>
20. X
Virginia DC setup. continued
<updateHandler class="solr.DirectUpdateHandler2">
<updateLog class="solr.CdcrUpdateLog">
<str name="dir">${solr.ulog.dir:}</str>
<!--Any parameters from the original <updateLog> section -->
</updateLog>
</updateHandler>
• Make the following changes in the solrconfig.xml for the Ohio cluster. This will allow to enable
CDCR from Ohio to Virginia cluster through the VPC peered network.
<requestHandler name="/update" class="solr.UpdateRequestHandler">
<lst name="defaults">
<str name="update.chain">cdcr-processor-chain</str>
</lst>
</requestHandler>
<updateRequestProcessorChain name=“cdcr-processor-chain">
<processor class=“solr.CdcrUpdateProcessorFactory"/>
<processor class="solr.RunUpdateProcessorFactory"/>
</updateRequestProcessorChain>
<requestHandler name="/cdcr" class="solr.CdcrRequestHandler">
<lst name="replica">
<str name=“zkHost"><<VirginiaZK:2181>></str>
<str name=“source">music</str> <!— Source collection in Virginia —>
<str name=“target">music</str> <!— Target collection in Ohio —>
</lst>
Ohio DC Setup..solrconfig.xml
21. X
Ohio DC Setup.continued
<updateHandler class="solr.DirectUpdateHandler2">
<updateLog class="solr.CdcrUpdateLog">
<str name="dir">${solr.ulog.dir:}</str>
<!--Any parameters from the original <updateLog> section -->
</updateLog>
</updateHandler>
Ensure that openSearcher in the solrconfig.xml is set to true if you want to make the documents
searchable once they are committed to the index.
• Data sourced from https://www.kaggle.com/edumucelli/spotifys-worldwide-daily-song-ranking/data.
• This dataset contains the daily ranking of the 200 most listened songs in 53 countries from 2017 and
2018 by Spotify users. I have chosen 10,000 such records for demo
track Name of the song
artist Artist who performed it
streams Total number of streams
url URL on spotify
era Release date
region region code
22. X
Enabling CDCR
• In the previous slides we enabled the configuration files for CDCR. SOLR offers a CDCR API to
interact with the CDCR handler added in the solrconfig.xml. This CDCR API allows us to set the
direction of the CDCR, check for the status of the CDCR, disable buffer, check for CDCR logs and
errors. Let’s enable CDCR from Virginia to Ohio.
• http://<<VirginiaSOLR:8983>>/solr/music/cdcr?action=DISABLEBUFFER
http://<<OhioSOLR:8983>>/solr/music/cdcr?action=DISABLEBUFFER
http://<<VirginiaSOLR:8983>>/solr/music/cdcr?action=START
• Nothing else needs to be done in the target data center.
This will set the Virginia cluster as the Primary data center where the indexing queries should go.
Once this cluster receives the updates, they will be forwarded to the Ohio data center through the
“replica” element setting in the solrconfig.xml. The search queries can occur on both data centers.
• The documents once indexed in Virginia cluster should be available in the Ohio cluster. The replication
is largely dependent on the AWS backbone (due to VPC peering). I have noticed that the document is
available within 100 ms in the other region. I have tried to replicate 10,0000 documents using curl and
didn’t see any performance degradation.
23. X
CDCR tips
While CDCR is a great way to ensure that you always have a back up of your primary DC, it does
come with some limitations.
• CDCR is unlikely to be satisfactory with bulk operations
• CDCR works robustly when the Source and Target data centers have the same number of shards in
the collection.
• Running CDCR with the indexes on HDFS is currently not supported. There is an ongoing JIRA
issue for the same
• Configuration files are not automatically synched between data centers as previously mentioned.
• Always stop the CDCR process if your backup data center is going to be out of service for an
indefinite amount of time.
24. X
CDCR Tweaks
While the document clearly states that CDCR indexing operations should occur only on one data center
at a time (Active-Passive), I tried enabling CDCR from both directions.
Disclaimer: Before you implement this in your production cluster make sure that you understand the full
implications of enabling CDCR across both regions. I tested with around 10,000 documents that were
indexed at the same time on both clusters. I was able to successfully see 20,000 documents in each
cluster indicating that the data was successfully replicated in either cluster. This demonstrates that an
active active setup is possible and works well. However, a proper performance test should be conducted
with your use case to guarantee safe operation.
To enable CDCR on both clusters perform the following actions
On Virginia cluster
http://<<VirginiaSOLR:8983>/solr/music/cdcr?action=DISABLEBUFFER
http://<<VirginiaSOLR:8983>/solr/music/cdcr?action=START
On Ohio cluster
http://<<OhioSOLR:8983>/solr/music/cdcr?action=DISABLEBUFFER
http://<<OhioSOLR:8983>/solr/music/cdcr?action=START
Index a document on both data centers and they should get replicated across.
25. X
Why stop here?
• The current flavor of CDCR supports replicating data to one or more target data centers. This
opens up a plethora of opportunities for interesting setups. Here is what I tried
Virginia Data center
Production Primary
Ohio Data Center
Production Secondary
Ireland Data Center
OR
On-Premise Data center
zkHost (Ohio)
zkHost (Ireland)
zkHost (Virginia)
zkHost (Ireland)
VPC Peering
VPC Peering VPC Peering
zkHost (Ireland)
Clients that use
Ireland or on-
premise cluster
for research and
analytics
REPLICA 1
REPLICA 2
REPLICATOR
When indexing occurs in Virgina it’s replicated to the
Ohio cluster and the Ireland cluster.
In the event of an outage the CDCR direction is
flipped from Ohio to Virginia using the CDCR API.
The data indexed from Ohio to Virginia is available in
Ireland as well.
The data in Ireland can be used by your Research
and Analytics group.
In rare scenarios it can also be used as a secondary
backup.
26. X
Setup
The setup to replicate data to 2 data centers from a
source data center is pretty straightforward.
Repeat the replica element in the source data center
for the clusters you want the data synched too.
If the CDCR direction is flipped and Ohio becomes
the new primary, the requirement is to sync data to
Virginia and Ireland.
Hence the solrconfig.xml files for Virginia and Ohio
have the zookeeper settings for Ireland as well.
Ireland cluster is purely used as either a second
back up cluster or a cluster that can be used for
research and analytics.
Ensure that you create the appropriate collections on
each data center.
VPC peering needs to be done between Virginia and
Ohio, Ohio and Ireland and Virginia and Ireland.
Ensure that the CIDR block ranges in all 3 VPC’s
don’t overlap before VPC peering them.
Virginia solrconfig.xml
<lst name=“replica">
<str name=“zkHost"><<OhioZK:2181>></str>
<strname=“source">music</str>
<str name=“target">music</str>
</lst>
<lst name=“replica">
<str name=“zkHost"><<IrelandZK:2181>></str>
<strname=“source">music</str>
<str name=“target">music</str>
</lst>
Ohio solrconfig.xml
<lst name=“replica">
<str name=“zkHost"><<VirginiaZK:2181>></str>
<strname=“source">music</str>
<str name=“target">music</str>
</lst>
<lst name=“replica">
<str name=“zkHost"><<IrelandZK:2181>></str>
<strname=“source">music</str>
<str name=“target">music</str>
</lst>
27. X
Example of a sync between two data centers
and your corporate data center.
28. X
Few pointers
There are certain rules to follow while replicating data across regions.
1. The 3 peered VPC regions cannot replicate data in a transitive fashion. If Virginia is peered to
Ohio, Ohio is peered to Ireland and Ireland in turn is peered to Virginia, data added in Virginia
will not end up back in Virginia again. Cross region replicated data only moves a single hop. In
the above scenario, if data is added to the Virginia cluster it will end up to the directly peered
regions. This data will not be forwarded to any other regions that are peered downstream to
the direct ones.
2. In the scenario below, documents indexed in Virginia will not end up in California or Ireland via
Ohio.
Virginia
Ohio
Ireland
California
Doc A
Replica 1
VPC Peered
Replica 2
VPC Peered
VPC peered
Index update
29. X
Few pointers…continued
• Cross region VPC peering is constrained to the regions where cross region VPC peering is offered via AWS.
Third party tools can be used to extend the peering regions.
• The CIDR blocks for the regions you want to peer should not overlap. This is to ensure that the IP addresses
provided by AWS do not conflict across the regions.
• In the configuration I tried, I deployed the SOLR instances in a public subnet. The instance can be deployed
to a private instance fronted by a NAT gateway and have an EC2 instance in the public subnet handling the
user queries. This shields the SOLR instance from the internet.
• The indexes can be stored on S3 as well to take advantage of the multi-level replication and data lifecycle
management. However, doing so would introduce latencies when the EC2 connection tries to reach out to S3
for indexing/querying data.
• Cross region replication can also be achieved by taking a snapshot of the EBS volume on the primary and
copying the image to the other region. However, this will not provide a real time index copy for DR reasons.
• For the example scenario I have used standard SDD as a volume. In production scenarios, it’s highly
recommended to use Provisioned IOPS SDD for better performance. If your index/search requirements are
high, used EBS backed EC2 instances. For cross region replication use Network optimized EC2 instances
for high volume queries.
• The SOLR configuration files for your core should be stored in a versioning system, e.g. GitHub. The
configuration files have to be applied separately on each DC. To update the configuration, stop indexing on
the secondary and ensure that you enable the BUFFER on the the primary. Apply the configuration changes
to the secondary Once the secondary is up it will consume all the accrued updates from the Primary
BUFFER. Move all indexing to the secondary and then follow the same steps to update the primary. It all
depends on the type of configuration changes that are pushed.
30. X
Test- B: CDCR on AWS: Same region; different AZ
• This test was conducted on a single region but different availability zones. US-EAST-1 (N. Virginia) was
chosen for this test.
• 4 EC2 (2 SOLR and 2 ZK) instances were deployed in a public subnet. 2 instances were deployed in
AZ1 and the other 2 in AZ2.
• Configuration changes are exactly the same as TEST A.
• Since the communication between all the servers occur in the same region no additional changes are
needed in the route tables.
• Data was easily synched across the SOLR instances that were mimicked as Data center 1 and Data
center 2
31. X
Cluster consistency
• Data synched between 2 regions is eventually consistent. From my tests I observed that data is
synched immediately within the 2 peered regions. However, I tested with a smaller volume of data.
During peak production volumes, when indexing and querying is occurring on the primary cluster,
it’s imperative that there is some kind of check to ensure that the data is synched correctly to your
backup SOLR cluster.
• When a document is indexed to the Primary cluster, SOLR uses the _version_ field to generate a
unique version number for the document. This version number is indexed, along with other data
elements in the document, to the Primary data center. The same version number is used while
synching the data to the secondary data center. SOLR does not generate a new version number for
the same document while updating the index to the secondary data center.
• To ensure cluster consistency, one can build a small utility to keep a check of these version
numbers across the two clusters. If DOC A is indexed in the primary cluster with VERSION 1, then
the same document will get indexed to the secondary cluster with the same version number. This
utility will compare the version numbers for the two documents. If they match, then the document is
in sync. If the version number in the Primary cluster is higher than the one in secondary then the
secondary has not received the update yet (eventual consistency). In this case the utility needs to
execute the same comparison on the document after some time.