Bulking for Indexing, creating, updating and deleting
Bulk size in Bytes, not number of documents
If in doubt, smaller batch sizes
Parallelize multiple bulks
Async calls
Turn of refresh while indexing
Delay flushes
Throttle merging
Maybe increase indices.memory.index_buffer_size
Set replicas to zero (only DURING indexing, right?)
Disable warmup
Do you really need your _all field?
_source field & stored ???
Reduce analysis
Field norms
Term frequencies & positions
Not_analyzed is your friend
Dynamic mapping is for playtime, not production
No Scoring
Filter results can be cached
Most Simple filters are cached, but not all (geo)
Compound filters are not cached
Expicitly control cache with _cache
Bool filters query the cache for sub-filters, but and/or/not don‘t
Moving Target
Consider the scope -> filtered query probably?
Filter applied after query, but not in „filtered query“!
Regular Queries query first, filter afterwards
Filtered query filters first
Elements of Bool filters are executed sequentially
Place most restrictive filter first
Accelerator filter
Additional filter on general terms
Better for caching
Reduce Work for heavyweight filters
Pagination
Don‘t load too many results at once
Avoid deep pagination
Index-time vs. Query time optimizations: Try to do prework during index time
E.g. Prefix Query vs. Edge Ngram
Warmup for „common queries“
Turn on the slow log
Use multi-search if applicable
Load lazy as much as possible
Hide lesser needed ones
Only load once during pagination
For example sorting
Filed data stored in RAM
Expensive for the JVM, Garbage Collection Issues
OS File System cache can take care of that
Slightly slower
Test them!
Update is a delete + add
Partial updates still read the whole document
Even „small“ updates can be expensive
Sequential Ids allow optimized storage (binary stored)
Javas UUID is truly random
Internally Elasticsearch uses FlakeIDs
Multiple Shards allow for paralell writes
Multiple Replicas allow parallel reads
Indexing more expensive
Safety
Sharding makes reads slower
Accurate scoring round trip
Second round trip for the search
Reduce step
Third roundtrip to retrieve final set of documents
2 Rules of distributed Search:
Distributed Search is expensive!
Searching multiple indexes is the same as searching multiple shards
Only works for isolated „chunks“ of Data in the same index
Maybe „Users“
Routing key overrides shard key
Popular Example UserID
Multipe users will share a shard
Shards will be different in size
Alternative: Aliases
Move out Large users to new index
Have alias point to all indexes
Drawback: Cluster state will become big, high network impact
Use existent client librarys
If Java, prefer NodeClient
Alternative Transport Client
Http
Long lived connections
Check http chunking
Maximum Number of File Descriptors
Avoid Swapping
ES_HEAP_SIZE (Xms = Xmx)
Leave enough memory to the OS
½ memory to ES
Not more than 32GB
If using doc values, a few GB should be enough
Use concurrent GC
Default is CMS, maybe try G1
Check your Java Version
Avoid virtualisation
Noisy Neighbours
Storage
Use local
Use SSD
RAID 0