5. Goal
• Like Splunk
• Indexing streaming log data
• Search log data in real-time
5
6. Big data
• Data sets so large and complex for database
• Difficult to process them using traditional data processing
• 3Vs
• Volume : Large quantity of data
• Variety : Diverse set of data
• Velocity : speed of data
출처 : wikipedia
6
7. About Fastcatsearch
• Distributed system
• Fast indexing
• Fast queries
• Popular keyword
• GS cetification
• 70+ references
• Open source
• Muti-platform
• Easy web management
tool
• Dictionary management
• Plugin extension
7
17. Fastcatsearch
HDFS Indexer
Merger
SSeSegegmgmmeenentnt t Searcher
Index File
Issue
- Segment file commit
- Doc deletion
17
18. Import using Flume
1. FileSystem fs = FileSystem.get(URI.create(uriPath), conf);
2. Configuration conf = fs.getConf();
3. FileStatus[] status = fs.listStatus(new Path(dirPath));
4. SequenceFile.Reader.Option opt = SequenceFile.Reader.file(status[i].getPath());
5. for (int i = 0; i < status.length; i++) {
6. SequenceFile.Reader reader = new SequenceFile.Reader(conf, opt);
7. Writable key = (Writable) ReflectionUtils.newInstance(
reader.getKeyClass(), conf);
8. Writable value = (Writable) ReflectionUtils.newInstance(
reader.getValueClass(), conf);
9. while (reader.next(key, value)) {
10. Map<String, Object> parsedEvent = parseEvent(key.toString(),
value.toString());
11. if (parsedEvent != null) {
12. eventQueue.add(parsedEvent);
}
}
}
18
19. Making index segment
• Index has multiple segments
• Document writer
• Index writer
• Search index writer
• Field index writer
• Group index writer
19
20. Segment commit issue
• Update / Delete documents
• Not in-place update
• Append and delete operation
• Deletion for previous segments
• Mark as deleted
20
21. Segment merge issue
• Performance
• 2(n+m) in time and space
• Size Compaction - Deleted docs removed.
segment #1 segment #2 segment #3
segment #4
merge to new segment
21
22. Segment merge issue
• Why merge?
• Segment count grows fast
• Search index = Search all leaf segments in turn
• Document deletion
22
23. Inverted Indexing
Posting index term1
term3 term5 term7
Postings
term1 posting1 term2 posting2 term3 posting3
term4 posting4 term5 posting5 term6 posting6
Good for sequential writing to disk
23
24. Inverted Indexing
How about b tree?
block
block block block
Memory
block block block block block block
block block block block block block block block
Flush occurs much of data random writing to disk
File
24
25. Search in realtime
seg #1 seg #2 seg #3 seg #4
1. New created segment
Searchable data
25
26. Search in realtime
seg #1 seg #2 seg #3 seg #4
2. Merge segments
Searchable data
26
27. Search in realtime
seg #1 seg #2 seg #3 seg #4 seg #5
4. Remove segments
3. New merged segment
Searchable data
27