SlideShare ist ein Scribd-Unternehmen logo
1 von 6
www.impetus.com
Handling
Data Corruption
in Elasticsearch
This white paper focuses on handling data corruption in Elasticsearch. It describes
how to recover data from corrupted indices of Elasticsearch and
re-index that data in a new index.
The paper also guides you about Lucene’s index terminology.
Elasticsearch is an Open Source, schema free, and restful search engine built
on Apache Lucene. It has a stand-alone database server for data intake and
storage in a format optimized for language-based searches and a JSON-based
access API for ease-of-use.
An Elasticsearch cluster can be horizontally scaled by adding a new node at
runtime to cater to the increased volume of data as per need. It uses
zen-discovery for internal co-ordination between the nodes in a cluster.
Failover and high availability can be achieved by using replication and using a
distributed cluster setup.
What is Elasticsearch?
2
Data Replication
Data replication is used for high data availability. For example, if the replication
factor is 1, then there will be one replica of each primary shard. In case of
replication, there are rare chances of data loss. If the primary shard fails, then a
replica of that shard is used to manage the cluster in a stable state. If we
perform any query or other operation, it will be served by that shard. This
enables us to recover the data in case of data replication.
However, data replication has its own set of limitations like storage. In such
cases, where users do not want to replicate due to storage issues, recovering
the data of index if any primary shard gets corrupt is a major challenge.
Data Recovery from Corrupted Index
Data can be recovered from corrupted index by reading data files of an index
and re-indexing it to a new index. However, to recover the data, the user needs
to store all the fields in Elasticsearch, which stores and indexes the data as
Lucene files.
Each shard in the index may have multiple segments, which, if corrupt, makes
the index unstable. To make the data searchable, index must be in stable
state, which can be ensured in two ways:
• Run optimize operation on an index and merge all segments to one in a
shard. This may cause data loss since it removes the reference of that
particular segment of which data got corrupt.
• Recover the data by reading data files and re-indexing the same.
3
Lucene uses many files for an index. The table below highlights the four major
files that can be used to recover the data:
Name Extension Brief Description
Fields .fnm Stores information about the fields
Field Index .fdx Contains pointers to field data
Field Data .fdt The stored fields for documents
Segment Info .si Stores metadata about a segment
Note: If any of these files are corrupt, there are chances of data loss in case of
zero replication.
There are four steps to recover data from the corrupted index, which are
detailed below:
Before data recovery, it is important to identify the shard id of corrupted shard
of an index. Corrupted shards can be identified using UNASSIGNED state of
shard. However, you need to ensure that the whole cluster is in running state
and all the nodes are up. You can find a list of unassigned shards from
Elasticsearch cluster state. There are different ways of getting cluster state, for
example using curl request:
$ curl -XGET 'http://localhost:9200/_cluster/state'
Identify corrupted shards of index
You can identify the shard directory by logic dependent on the Elasticsearch
home and cluster name. If there is only one node on the machine, use the
shard id and index name to identify the shard directory.
Identify shard’s index directory
String shardDir=new
StringBuilder().append(esHome).append("/").append(dataDi
rectryName).append("/").append(clusterName).append("/nod
es/0/indices/").append(indexName).append("/").append(sha
rdId).append("/index").toString();
4
public void readAndReindexData(String indexName, String
indexDir,String newIndexName) {
try {
Codec codec = new Lucene42Codec();
File indexDirectory = new File(indexDir);
Directory dir = FSDirectory.open(indexDirectory);
List<String> segmentList = new ArrayList<String>();
/* Identify segment list by listing files in shard
directory. Each segment will have .si file */
for (File f : FileUtils.listFiles(indexDirectory, new
RegexFileFilter("_.*.si"), null)) {
String s = f.getName();
segmentList.add(s.substring(0,
s.indexOf('.')));
}
int total=0;
// Iterate over each segment of that shard and reindex
that
for (String segmentName : segmentList) {
try{
IOContext ioContext = new IOContext();
SegmentInfo segmentInfos =
codec.segmentInfoFormat().getSegmentInfoReader().read(dir,
segmentName, ioContext);
Directory segmentDir;
if (segmentInfos.getUseCompoundFile()) {
segmentDir = new CompoundFileDirectory(dir,
IndexFileNames.segmentFileName(segmentName, "",
IndexFileNames.COMPOUND_FILE_EXTENSION), ioContext,
false);
} else {
segmentDir = dir;
}
// Collect fields information
FieldInfos fieldInfos =
codec.fieldInfosFormat().getFieldInfosReader().read(segmen
tDir, segmentName, ioContext);
StoredFieldsReader storedFieldsReader =
codec.storedFieldsFormat() .fieldsReader(segmentDir,
segmentInfos, fieldInfos, ioContext);
Read data of corrupted shard using .fdt, .fdx files
There may be number of segments in an index, which one needs to identify
and then read the data of a particular segment. After reading a document from
a segment, you can insert the document into another index.
A sample code to read data from index using .fdt, .fdx, .fnm, and .si files is
given below:
5
total=total+segment?Infos.getDocCount();
for (int i = 0; i < segmentInfos.getDocCount(); ++i) {
try {
DocumentStoredFieldVisitor visitor = new
DocumentStoredFieldVisitor();
storedFieldsReader.visitDocument(i, visitor);
Document doc = visitor.getDocument();
// Get list of fields of a document
List<IndexableField> list = doc.getFields();
Map<String, Object> tempMap = new HashMap<String,
Object>();
for (IndexableField indexableField : list) {
tempMap.put(indexableField.name(),
indexableField.stringValue());
}
// Re-index the document in new index
this.index(tempMap,newIndexName);
} catch (Exception e) {
System.out.println("Couldn't get document " + i + ",
stored fields corruption.");
}}}catch(Exception e){}
}
System.out.println(total+" documents recovered.");
}catch (Exception e) {
e.printStackTrace();
}
}
When you read a document from the index, the document contains uid and
source fields. You can get the document id from uid field. Before indexing the
document, you need to remove the uid and source field, because Lucene add
these two fields by default when any document is indexed.
Re-index data in new index
© 2014 Impetus Technologies, Inc.
All rights reserved. Product and
company names mentioned herein
may be trademarks of their
respective companies.
August 2014
Impetus is a Software Solutions and Services Company with deep technical
maturity that brings you thought leadership, proactive innovation, and a
track record of success. Our Services and Solutions portfolio includes
Carrier grade large systems, Big Data, Cloud, Enterprise Mobility, and Test
and Performance Engineering.
Visit www.impetus.com or write to us at inquiry@impetus.com
About Impetus
Conclusion
As the data volume is increasing rapidly, it is a challenge for organizations to
replicate the data due to storage cost. Elasticsearch addresses this challenge
effectively and helps organizations recover data from corrupted Elasticsearch
index.
// Re-index the document in new index
private void index(Map<String, Object> record,String
newIndexName){
String docId=((String) record.get("_uid")).split("#")[1];
String mappingType=((String)
record.get("_uid")).split("#")[0];
record.remove("_uid");
record.remove("_source");
IndexRequest indexRequest = new IndexRequest(newIndexName,
mappingType, docId);
indexRequest.source(record);
BulkRequestBuilder bulkRequestBuilder =
client.prepareBulk();
bulkRequestBuilder.add(indexRequest);
bulkRequestBuilder.execute().actionGet();
}
Testing Environment:
Elasticsearch- 0.90.5
Java - 1.6.45
Operating System- RHEL
A sample code to re-index the documents using same document ids is given
below:

Weitere ähnliche Inhalte

Was ist angesagt?

Java căn bản - Chapter12
Java căn bản - Chapter12Java căn bản - Chapter12
Java căn bản - Chapter12
Vince Vo
 
SH 2 - SES 3 - MongoDB Aggregation Framework.pptx
SH 2 - SES 3 -  MongoDB Aggregation Framework.pptxSH 2 - SES 3 -  MongoDB Aggregation Framework.pptx
SH 2 - SES 3 - MongoDB Aggregation Framework.pptx
MongoDB
 

Was ist angesagt? (19)

Scmad Chapter08
Scmad Chapter08Scmad Chapter08
Scmad Chapter08
 
SQL Prepared Statements Tutorial
SQL Prepared Statements TutorialSQL Prepared Statements Tutorial
SQL Prepared Statements Tutorial
 
Java căn bản - Chapter12
Java căn bản - Chapter12Java căn bản - Chapter12
Java căn bản - Chapter12
 
Building a Search Engine Using Lucene
Building a Search Engine Using LuceneBuilding a Search Engine Using Lucene
Building a Search Engine Using Lucene
 
บทที่4
บทที่4บทที่4
บทที่4
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to Elasticsearch
 
Apache Lucene Basics
Apache Lucene BasicsApache Lucene Basics
Apache Lucene Basics
 
MS Sql Server: Manipulating Database
MS Sql Server: Manipulating DatabaseMS Sql Server: Manipulating Database
MS Sql Server: Manipulating Database
 
SH 2 - SES 3 - MongoDB Aggregation Framework.pptx
SH 2 - SES 3 -  MongoDB Aggregation Framework.pptxSH 2 - SES 3 -  MongoDB Aggregation Framework.pptx
SH 2 - SES 3 - MongoDB Aggregation Framework.pptx
 
Reproducible, Open Data Science in the Life Sciences
Reproducible, Open  Data Science in the  Life SciencesReproducible, Open  Data Science in the  Life Sciences
Reproducible, Open Data Science in the Life Sciences
 
Chapter 10: hashing data structure
Chapter 10:  hashing data structureChapter 10:  hashing data structure
Chapter 10: hashing data structure
 
Optimizing Data Accessin Sq Lserver2005
Optimizing Data Accessin Sq Lserver2005Optimizing Data Accessin Sq Lserver2005
Optimizing Data Accessin Sq Lserver2005
 
linked_lists3
linked_lists3linked_lists3
linked_lists3
 
ADO.net control
ADO.net controlADO.net control
ADO.net control
 
Computer notes - data structures
Computer notes - data structuresComputer notes - data structures
Computer notes - data structures
 
Correlated update vs merge
Correlated update vs mergeCorrelated update vs merge
Correlated update vs merge
 
HEPData Open Repositories 2016 Talk
HEPData Open Repositories 2016 TalkHEPData Open Repositories 2016 Talk
HEPData Open Repositories 2016 Talk
 
Big Data - Load CSV File & Query the EZ way - HPCC Systems
Big Data - Load CSV File & Query the EZ way - HPCC SystemsBig Data - Load CSV File & Query the EZ way - HPCC Systems
Big Data - Load CSV File & Query the EZ way - HPCC Systems
 
Sql introduction
Sql introductionSql introduction
Sql introduction
 

Ähnlich wie Impetus White Paper- Handling Data Corruption in Elasticsearch

Using elasticsearch with rails
Using elasticsearch with railsUsing elasticsearch with rails
Using elasticsearch with rails
Tom Z Zeng
 
Recursively Searching Files and DirectoriesSummaryBuild a class .pdf
Recursively Searching Files and DirectoriesSummaryBuild a class .pdfRecursively Searching Files and DirectoriesSummaryBuild a class .pdf
Recursively Searching Files and DirectoriesSummaryBuild a class .pdf
mallik3000
 
ASP.Net Presentation Part2
ASP.Net Presentation Part2ASP.Net Presentation Part2
ASP.Net Presentation Part2
Neeraj Mathur
 

Ähnlich wie Impetus White Paper- Handling Data Corruption in Elasticsearch (20)

ElasticSearch Basic Introduction
ElasticSearch Basic IntroductionElasticSearch Basic Introduction
ElasticSearch Basic Introduction
 
Workshop: Learning Elasticsearch
Workshop: Learning ElasticsearchWorkshop: Learning Elasticsearch
Workshop: Learning Elasticsearch
 
Elastic search
Elastic searchElastic search
Elastic search
 
Elasticsearch, a distributed search engine with real-time analytics
Elasticsearch, a distributed search engine with real-time analyticsElasticsearch, a distributed search engine with real-time analytics
Elasticsearch, a distributed search engine with real-time analytics
 
Using elasticsearch with rails
Using elasticsearch with railsUsing elasticsearch with rails
Using elasticsearch with rails
 
Lucene basics
Lucene basicsLucene basics
Lucene basics
 
Elasticsearch: An Overview
Elasticsearch: An OverviewElasticsearch: An Overview
Elasticsearch: An Overview
 
Recursively Searching Files and DirectoriesSummaryBuild a class .pdf
Recursively Searching Files and DirectoriesSummaryBuild a class .pdfRecursively Searching Files and DirectoriesSummaryBuild a class .pdf
Recursively Searching Files and DirectoriesSummaryBuild a class .pdf
 
Hazelcast
HazelcastHazelcast
Hazelcast
 
A FAST METHOD FOR IMPLEMENTATION OF THE PROPERTY LISTS IN PROGRAMMING LANGUAGES
A FAST METHOD FOR IMPLEMENTATION OF THE PROPERTY LISTS IN PROGRAMMING LANGUAGESA FAST METHOD FOR IMPLEMENTATION OF THE PROPERTY LISTS IN PROGRAMMING LANGUAGES
A FAST METHOD FOR IMPLEMENTATION OF THE PROPERTY LISTS IN PROGRAMMING LANGUAGES
 
Share point 2013 coding standards and best practices 1.0
Share point 2013 coding standards and best practices 1.0Share point 2013 coding standards and best practices 1.0
Share point 2013 coding standards and best practices 1.0
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 
ADO.NET
ADO.NETADO.NET
ADO.NET
 
D0373024030
D0373024030D0373024030
D0373024030
 
ASP.Net Presentation Part2
ASP.Net Presentation Part2ASP.Net Presentation Part2
ASP.Net Presentation Part2
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 
Elasticsearch as a search alternative to a relational database
Elasticsearch as a search alternative to a relational databaseElasticsearch as a search alternative to a relational database
Elasticsearch as a search alternative to a relational database
 
JavaCro'15 - Elasticsearch as a search alternative to a relational database -...
JavaCro'15 - Elasticsearch as a search alternative to a relational database -...JavaCro'15 - Elasticsearch as a search alternative to a relational database -...
JavaCro'15 - Elasticsearch as a search alternative to a relational database -...
 
SphinxSE with MySQL
SphinxSE with MySQLSphinxSE with MySQL
SphinxSE with MySQL
 
Advanced Web Programming Chapter 12
Advanced Web Programming Chapter 12Advanced Web Programming Chapter 12
Advanced Web Programming Chapter 12
 

Mehr von Impetus Technologies

Webinar maturity of mobile test automation- approaches and future trends
Webinar  maturity of mobile test automation- approaches and future trendsWebinar  maturity of mobile test automation- approaches and future trends
Webinar maturity of mobile test automation- approaches and future trends
Impetus Technologies
 

Mehr von Impetus Technologies (20)

Data Warehouse Modernization Webinar Series- Critical Trends, Implementation ...
Data Warehouse Modernization Webinar Series- Critical Trends, Implementation ...Data Warehouse Modernization Webinar Series- Critical Trends, Implementation ...
Data Warehouse Modernization Webinar Series- Critical Trends, Implementation ...
 
Future-Proof Your Streaming Analytics Architecture- StreamAnalytix Webinar
Future-Proof Your Streaming Analytics Architecture- StreamAnalytix WebinarFuture-Proof Your Streaming Analytics Architecture- StreamAnalytix Webinar
Future-Proof Your Streaming Analytics Architecture- StreamAnalytix Webinar
 
Building Real-time Streaming Apps in Minutes- Impetus Webinar
Building Real-time Streaming Apps in Minutes- Impetus WebinarBuilding Real-time Streaming Apps in Minutes- Impetus Webinar
Building Real-time Streaming Apps in Minutes- Impetus Webinar
 
Smart Enterprise Big Data Bus for the Modern Responsive Enterprise- StreamAna...
Smart Enterprise Big Data Bus for the Modern Responsive Enterprise- StreamAna...Smart Enterprise Big Data Bus for the Modern Responsive Enterprise- StreamAna...
Smart Enterprise Big Data Bus for the Modern Responsive Enterprise- StreamAna...
 
Real-world Applications of Streaming Analytics- StreamAnalytix Webinar
Real-world Applications of Streaming Analytics- StreamAnalytix WebinarReal-world Applications of Streaming Analytics- StreamAnalytix Webinar
Real-world Applications of Streaming Analytics- StreamAnalytix Webinar
 
Real-world Applications of Streaming Analytics- StreamAnalytix Webinar
Real-world Applications of Streaming Analytics- StreamAnalytix WebinarReal-world Applications of Streaming Analytics- StreamAnalytix Webinar
Real-world Applications of Streaming Analytics- StreamAnalytix Webinar
 
Real-time Streaming Analytics for Enterprises based on Apache Storm - Impetus...
Real-time Streaming Analytics for Enterprises based on Apache Storm - Impetus...Real-time Streaming Analytics for Enterprises based on Apache Storm - Impetus...
Real-time Streaming Analytics for Enterprises based on Apache Storm - Impetus...
 
Accelerating Hadoop Solution Lifecycle and Improving ROI- Impetus On-demand W...
Accelerating Hadoop Solution Lifecycle and Improving ROI- Impetus On-demand W...Accelerating Hadoop Solution Lifecycle and Improving ROI- Impetus On-demand W...
Accelerating Hadoop Solution Lifecycle and Improving ROI- Impetus On-demand W...
 
Deep Learning: Evolution of ML from Statistical to Brain-like Computing- Data...
Deep Learning: Evolution of ML from Statistical to Brain-like Computing- Data...Deep Learning: Evolution of ML from Statistical to Brain-like Computing- Data...
Deep Learning: Evolution of ML from Statistical to Brain-like Computing- Data...
 
SPARK USE CASE- Distributed Reinforcement Learning for Electricity Market Bi...
SPARK USE CASE-  Distributed Reinforcement Learning for Electricity Market Bi...SPARK USE CASE-  Distributed Reinforcement Learning for Electricity Market Bi...
SPARK USE CASE- Distributed Reinforcement Learning for Electricity Market Bi...
 
Enterprise Ready Android and Manageability- Impetus Webcast
Enterprise Ready Android and Manageability- Impetus WebcastEnterprise Ready Android and Manageability- Impetus Webcast
Enterprise Ready Android and Manageability- Impetus Webcast
 
Real-time Streaming Analytics: Business Value, Use Cases and Architectural Co...
Real-time Streaming Analytics: Business Value, Use Cases and Architectural Co...Real-time Streaming Analytics: Business Value, Use Cases and Architectural Co...
Real-time Streaming Analytics: Business Value, Use Cases and Architectural Co...
 
Leveraging NoSQL Database Technology to Implement Real-time Data Architecture...
Leveraging NoSQL Database Technology to Implement Real-time Data Architecture...Leveraging NoSQL Database Technology to Implement Real-time Data Architecture...
Leveraging NoSQL Database Technology to Implement Real-time Data Architecture...
 
Maturity of Mobile Test Automation: Approaches and Future Trends- Impetus Web...
Maturity of Mobile Test Automation: Approaches and Future Trends- Impetus Web...Maturity of Mobile Test Automation: Approaches and Future Trends- Impetus Web...
Maturity of Mobile Test Automation: Approaches and Future Trends- Impetus Web...
 
Big Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabBig Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLab
 
Webinar maturity of mobile test automation- approaches and future trends
Webinar  maturity of mobile test automation- approaches and future trendsWebinar  maturity of mobile test automation- approaches and future trends
Webinar maturity of mobile test automation- approaches and future trends
 
Next generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph labNext generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph lab
 
The Shared Elephant - Hadoop as a Shared Service for Multiple Departments – I...
The Shared Elephant - Hadoop as a Shared Service for Multiple Departments – I...The Shared Elephant - Hadoop as a Shared Service for Multiple Departments – I...
The Shared Elephant - Hadoop as a Shared Service for Multiple Departments – I...
 
Performance Testing of Big Data Applications - Impetus Webcast
Performance Testing of Big Data Applications - Impetus WebcastPerformance Testing of Big Data Applications - Impetus Webcast
Performance Testing of Big Data Applications - Impetus Webcast
 
Real-time Predictive Analytics in Manufacturing - Impetus Webinar
Real-time Predictive Analytics in Manufacturing - Impetus WebinarReal-time Predictive Analytics in Manufacturing - Impetus Webinar
Real-time Predictive Analytics in Manufacturing - Impetus Webinar
 

Kürzlich hochgeladen

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Kürzlich hochgeladen (20)

08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 

Impetus White Paper- Handling Data Corruption in Elasticsearch

  • 1. www.impetus.com Handling Data Corruption in Elasticsearch This white paper focuses on handling data corruption in Elasticsearch. It describes how to recover data from corrupted indices of Elasticsearch and re-index that data in a new index. The paper also guides you about Lucene’s index terminology.
  • 2. Elasticsearch is an Open Source, schema free, and restful search engine built on Apache Lucene. It has a stand-alone database server for data intake and storage in a format optimized for language-based searches and a JSON-based access API for ease-of-use. An Elasticsearch cluster can be horizontally scaled by adding a new node at runtime to cater to the increased volume of data as per need. It uses zen-discovery for internal co-ordination between the nodes in a cluster. Failover and high availability can be achieved by using replication and using a distributed cluster setup. What is Elasticsearch? 2 Data Replication Data replication is used for high data availability. For example, if the replication factor is 1, then there will be one replica of each primary shard. In case of replication, there are rare chances of data loss. If the primary shard fails, then a replica of that shard is used to manage the cluster in a stable state. If we perform any query or other operation, it will be served by that shard. This enables us to recover the data in case of data replication. However, data replication has its own set of limitations like storage. In such cases, where users do not want to replicate due to storage issues, recovering the data of index if any primary shard gets corrupt is a major challenge. Data Recovery from Corrupted Index Data can be recovered from corrupted index by reading data files of an index and re-indexing it to a new index. However, to recover the data, the user needs to store all the fields in Elasticsearch, which stores and indexes the data as Lucene files. Each shard in the index may have multiple segments, which, if corrupt, makes the index unstable. To make the data searchable, index must be in stable state, which can be ensured in two ways: • Run optimize operation on an index and merge all segments to one in a shard. This may cause data loss since it removes the reference of that particular segment of which data got corrupt. • Recover the data by reading data files and re-indexing the same.
  • 3. 3 Lucene uses many files for an index. The table below highlights the four major files that can be used to recover the data: Name Extension Brief Description Fields .fnm Stores information about the fields Field Index .fdx Contains pointers to field data Field Data .fdt The stored fields for documents Segment Info .si Stores metadata about a segment Note: If any of these files are corrupt, there are chances of data loss in case of zero replication. There are four steps to recover data from the corrupted index, which are detailed below: Before data recovery, it is important to identify the shard id of corrupted shard of an index. Corrupted shards can be identified using UNASSIGNED state of shard. However, you need to ensure that the whole cluster is in running state and all the nodes are up. You can find a list of unassigned shards from Elasticsearch cluster state. There are different ways of getting cluster state, for example using curl request: $ curl -XGET 'http://localhost:9200/_cluster/state' Identify corrupted shards of index You can identify the shard directory by logic dependent on the Elasticsearch home and cluster name. If there is only one node on the machine, use the shard id and index name to identify the shard directory. Identify shard’s index directory String shardDir=new StringBuilder().append(esHome).append("/").append(dataDi rectryName).append("/").append(clusterName).append("/nod es/0/indices/").append(indexName).append("/").append(sha rdId).append("/index").toString();
  • 4. 4 public void readAndReindexData(String indexName, String indexDir,String newIndexName) { try { Codec codec = new Lucene42Codec(); File indexDirectory = new File(indexDir); Directory dir = FSDirectory.open(indexDirectory); List<String> segmentList = new ArrayList<String>(); /* Identify segment list by listing files in shard directory. Each segment will have .si file */ for (File f : FileUtils.listFiles(indexDirectory, new RegexFileFilter("_.*.si"), null)) { String s = f.getName(); segmentList.add(s.substring(0, s.indexOf('.'))); } int total=0; // Iterate over each segment of that shard and reindex that for (String segmentName : segmentList) { try{ IOContext ioContext = new IOContext(); SegmentInfo segmentInfos = codec.segmentInfoFormat().getSegmentInfoReader().read(dir, segmentName, ioContext); Directory segmentDir; if (segmentInfos.getUseCompoundFile()) { segmentDir = new CompoundFileDirectory(dir, IndexFileNames.segmentFileName(segmentName, "", IndexFileNames.COMPOUND_FILE_EXTENSION), ioContext, false); } else { segmentDir = dir; } // Collect fields information FieldInfos fieldInfos = codec.fieldInfosFormat().getFieldInfosReader().read(segmen tDir, segmentName, ioContext); StoredFieldsReader storedFieldsReader = codec.storedFieldsFormat() .fieldsReader(segmentDir, segmentInfos, fieldInfos, ioContext); Read data of corrupted shard using .fdt, .fdx files There may be number of segments in an index, which one needs to identify and then read the data of a particular segment. After reading a document from a segment, you can insert the document into another index. A sample code to read data from index using .fdt, .fdx, .fnm, and .si files is given below:
  • 5. 5 total=total+segment?Infos.getDocCount(); for (int i = 0; i < segmentInfos.getDocCount(); ++i) { try { DocumentStoredFieldVisitor visitor = new DocumentStoredFieldVisitor(); storedFieldsReader.visitDocument(i, visitor); Document doc = visitor.getDocument(); // Get list of fields of a document List<IndexableField> list = doc.getFields(); Map<String, Object> tempMap = new HashMap<String, Object>(); for (IndexableField indexableField : list) { tempMap.put(indexableField.name(), indexableField.stringValue()); } // Re-index the document in new index this.index(tempMap,newIndexName); } catch (Exception e) { System.out.println("Couldn't get document " + i + ", stored fields corruption."); }}}catch(Exception e){} } System.out.println(total+" documents recovered."); }catch (Exception e) { e.printStackTrace(); } } When you read a document from the index, the document contains uid and source fields. You can get the document id from uid field. Before indexing the document, you need to remove the uid and source field, because Lucene add these two fields by default when any document is indexed. Re-index data in new index
  • 6. © 2014 Impetus Technologies, Inc. All rights reserved. Product and company names mentioned herein may be trademarks of their respective companies. August 2014 Impetus is a Software Solutions and Services Company with deep technical maturity that brings you thought leadership, proactive innovation, and a track record of success. Our Services and Solutions portfolio includes Carrier grade large systems, Big Data, Cloud, Enterprise Mobility, and Test and Performance Engineering. Visit www.impetus.com or write to us at inquiry@impetus.com About Impetus Conclusion As the data volume is increasing rapidly, it is a challenge for organizations to replicate the data due to storage cost. Elasticsearch addresses this challenge effectively and helps organizations recover data from corrupted Elasticsearch index. // Re-index the document in new index private void index(Map<String, Object> record,String newIndexName){ String docId=((String) record.get("_uid")).split("#")[1]; String mappingType=((String) record.get("_uid")).split("#")[0]; record.remove("_uid"); record.remove("_source"); IndexRequest indexRequest = new IndexRequest(newIndexName, mappingType, docId); indexRequest.source(record); BulkRequestBuilder bulkRequestBuilder = client.prepareBulk(); bulkRequestBuilder.add(indexRequest); bulkRequestBuilder.execute().actionGet(); } Testing Environment: Elasticsearch- 0.90.5 Java - 1.6.45 Operating System- RHEL A sample code to re-index the documents using same document ids is given below: