This white paper focuses on handling data corruption in Elasticsearch. It describes how to recover data from corrupted indices of Elasticsearch and re-index that data in a new index. The paper also guides you about Lucene’s index terminology
Handwritten Text Recognition for manuscripts and early printed texts
Impetus White Paper- Handling Data Corruption in Elasticsearch
1. www.impetus.com
Handling
Data Corruption
in Elasticsearch
This white paper focuses on handling data corruption in Elasticsearch. It describes
how to recover data from corrupted indices of Elasticsearch and
re-index that data in a new index.
The paper also guides you about Lucene’s index terminology.
2. Elasticsearch is an Open Source, schema free, and restful search engine built
on Apache Lucene. It has a stand-alone database server for data intake and
storage in a format optimized for language-based searches and a JSON-based
access API for ease-of-use.
An Elasticsearch cluster can be horizontally scaled by adding a new node at
runtime to cater to the increased volume of data as per need. It uses
zen-discovery for internal co-ordination between the nodes in a cluster.
Failover and high availability can be achieved by using replication and using a
distributed cluster setup.
What is Elasticsearch?
2
Data Replication
Data replication is used for high data availability. For example, if the replication
factor is 1, then there will be one replica of each primary shard. In case of
replication, there are rare chances of data loss. If the primary shard fails, then a
replica of that shard is used to manage the cluster in a stable state. If we
perform any query or other operation, it will be served by that shard. This
enables us to recover the data in case of data replication.
However, data replication has its own set of limitations like storage. In such
cases, where users do not want to replicate due to storage issues, recovering
the data of index if any primary shard gets corrupt is a major challenge.
Data Recovery from Corrupted Index
Data can be recovered from corrupted index by reading data files of an index
and re-indexing it to a new index. However, to recover the data, the user needs
to store all the fields in Elasticsearch, which stores and indexes the data as
Lucene files.
Each shard in the index may have multiple segments, which, if corrupt, makes
the index unstable. To make the data searchable, index must be in stable
state, which can be ensured in two ways:
• Run optimize operation on an index and merge all segments to one in a
shard. This may cause data loss since it removes the reference of that
particular segment of which data got corrupt.
• Recover the data by reading data files and re-indexing the same.
3. 3
Lucene uses many files for an index. The table below highlights the four major
files that can be used to recover the data:
Name Extension Brief Description
Fields .fnm Stores information about the fields
Field Index .fdx Contains pointers to field data
Field Data .fdt The stored fields for documents
Segment Info .si Stores metadata about a segment
Note: If any of these files are corrupt, there are chances of data loss in case of
zero replication.
There are four steps to recover data from the corrupted index, which are
detailed below:
Before data recovery, it is important to identify the shard id of corrupted shard
of an index. Corrupted shards can be identified using UNASSIGNED state of
shard. However, you need to ensure that the whole cluster is in running state
and all the nodes are up. You can find a list of unassigned shards from
Elasticsearch cluster state. There are different ways of getting cluster state, for
example using curl request:
$ curl -XGET 'http://localhost:9200/_cluster/state'
Identify corrupted shards of index
You can identify the shard directory by logic dependent on the Elasticsearch
home and cluster name. If there is only one node on the machine, use the
shard id and index name to identify the shard directory.
Identify shard’s index directory
String shardDir=new
StringBuilder().append(esHome).append("/").append(dataDi
rectryName).append("/").append(clusterName).append("/nod
es/0/indices/").append(indexName).append("/").append(sha
rdId).append("/index").toString();
4. 4
public void readAndReindexData(String indexName, String
indexDir,String newIndexName) {
try {
Codec codec = new Lucene42Codec();
File indexDirectory = new File(indexDir);
Directory dir = FSDirectory.open(indexDirectory);
List<String> segmentList = new ArrayList<String>();
/* Identify segment list by listing files in shard
directory. Each segment will have .si file */
for (File f : FileUtils.listFiles(indexDirectory, new
RegexFileFilter("_.*.si"), null)) {
String s = f.getName();
segmentList.add(s.substring(0,
s.indexOf('.')));
}
int total=0;
// Iterate over each segment of that shard and reindex
that
for (String segmentName : segmentList) {
try{
IOContext ioContext = new IOContext();
SegmentInfo segmentInfos =
codec.segmentInfoFormat().getSegmentInfoReader().read(dir,
segmentName, ioContext);
Directory segmentDir;
if (segmentInfos.getUseCompoundFile()) {
segmentDir = new CompoundFileDirectory(dir,
IndexFileNames.segmentFileName(segmentName, "",
IndexFileNames.COMPOUND_FILE_EXTENSION), ioContext,
false);
} else {
segmentDir = dir;
}
// Collect fields information
FieldInfos fieldInfos =
codec.fieldInfosFormat().getFieldInfosReader().read(segmen
tDir, segmentName, ioContext);
StoredFieldsReader storedFieldsReader =
codec.storedFieldsFormat() .fieldsReader(segmentDir,
segmentInfos, fieldInfos, ioContext);
Read data of corrupted shard using .fdt, .fdx files
There may be number of segments in an index, which one needs to identify
and then read the data of a particular segment. After reading a document from
a segment, you can insert the document into another index.
A sample code to read data from index using .fdt, .fdx, .fnm, and .si files is
given below:
5. 5
total=total+segment?Infos.getDocCount();
for (int i = 0; i < segmentInfos.getDocCount(); ++i) {
try {
DocumentStoredFieldVisitor visitor = new
DocumentStoredFieldVisitor();
storedFieldsReader.visitDocument(i, visitor);
Document doc = visitor.getDocument();
// Get list of fields of a document
List<IndexableField> list = doc.getFields();
Map<String, Object> tempMap = new HashMap<String,
Object>();
for (IndexableField indexableField : list) {
tempMap.put(indexableField.name(),
indexableField.stringValue());
}
// Re-index the document in new index
this.index(tempMap,newIndexName);
} catch (Exception e) {
System.out.println("Couldn't get document " + i + ",
stored fields corruption.");
}}}catch(Exception e){}
}
System.out.println(total+" documents recovered.");
}catch (Exception e) {
e.printStackTrace();
}
}
When you read a document from the index, the document contains uid and
source fields. You can get the document id from uid field. Before indexing the
document, you need to remove the uid and source field, because Lucene add
these two fields by default when any document is indexed.
Re-index data in new index