SlideShare ist ein Scribd-Unternehmen logo
1 von 46
RUNNING & SCALING
LARGE ELASTICSEARCH
CLUSTERS
FRED DE VILLAMIL, DIRECTOR OF
INFRASTRUCTURE
@FDEVILLAMIL
OCTOBER 2017
BACKGROUND
• FRED DE VILLAMIL, 39 ANS, TEAM
COFFEE @SYNTHESIO,
• LINUX / (FREE)BSD USER SINCE 1996,
• OPEN SOURCE CONTRIBUTOR SINCE
1998,
• LOVES TENNIS, PHOTOGRAPHY, CUTE
OTTERS, INAPPROPRIATE HUMOR AND
ELASTICSEARCH CLUSTERS OF UNUSUAL
SIZE.
WRITES ABOUT ES AT
HTTPS://THOUGHTS.T37.NET
ELASTICSEARCH @SYNTHESIO
• 8 production clusters, 600 hosts, 1.7PB storage, 37.5TB
RAM, average 15k writes / s, 800 search /s, some inputs >
200MB.
• Data nodes: 6 core Xeon E5v3, 64GB RAM, 4*800GB SSD
RAID0. Sometimes bi Xeon E5-2687Wv4 12 core (160
watts!!!).
• We agregate data from various cold storage and make them
searchable in a giffy.
AN ELASTICSEARCH CLUSTER OF UNUSUAL
SIZE
ENSURING HIGH
AVAILABILITY
NEVER GONNA GIVE YOU UP
• NEVER GONNA LET YOU DOWN,
• NEVER GONNA RUN AROUND AND
DESERT YOU,
• NEVER GONNA MAKE YOU CRY,
• NEVER GONNA SAY GOODBYE,
• NEVER GONNA TELL A LIE & HURT YOU.
AVOIDING DOWNTIME & SPLIT BRAINS
• RUN AT LEAST 3 MASTER NODES INTO 3 DIFFERENT
LOCATIONS.
• NEVER RUN BULK QUERIES ON THE MASTER
NODES.
• ACTUALLY NEVER RUN ANYTHING BUT
ADMINISTRATIVE TASKS ON THE MASTER NODES.
• SPREAD YOUR DATA INTO 2 DIFFERENT LOCATION
WITH AT LEAST A REPLICATION FACTOR OF 1 (1
PRIMARY, 1 REPLICA).
RACK AWARENESS
ALLOCATE A
RACK_ID TO THE
DATA NODES FOR
EVEN REPLICATION.
RESTART A WHOLE
DATA CENTER
@ONCE WITHOUT
DOWNTIME.
RACK AWARENESS + QUERY NODES ==
MAGIC
ES PRIVILEGES THE
DATA NODES WITH
THE SAME RACK_ID
AS THE QUERY.
REDUCES LATENCY
AND BALANCES THE
LOAD.
RACK AWARENESS + QUERY NODES + ZONE ==
MAGIC + FUN
ADD ZONES INTO THE
SAME RACK FOR
EVEN REPLICATION
WITH HIGHER
FACTOR.
USING ZONES FOR FUN & PROFIT
ALLOWING EVEN REPLICATION
WITH A HIGHER FACTOR WITHIN
THE SAME RACK.
ALLOWING MORE RESOURCES
TO THE MOST FREQUENTLY
ACCESSED INDEXES.
…
AVOIDING MEMORY
NIGHTMARE
HOW ELASTICSEARCH USES THE MEMORY
• Starts with allocating memory for Java heap.
• The Java heap contains all Elasticsearch buffers
and caches + a few other things.
• Each Java thread maps a system thread: +128kB
off heap.
• Elected master uses 250kB to store each shard
information inside the cluster.
ALLOCATING MEMORY
• Never allocate more than 31GB
heap to avoid the compressed
pointers issue.
• Use 1/2 of your memory up to
31GB.
• Feed your master and query
nodes, the more the better
(including CPU).
MEMORY LOCK
• Use memory_lock: true at
startup.
• Requires ulimit -l
unlimited.
• Allocates the whole heap at once.
• Uses contiguous memory regions.
• Avoids swapping (you should
disable swap anyway).
CHOSING THE RIGHT GARBAGE COLLECTOR
• ES runs with CMS as a default
garbage collector.
• CMS was designed for heaps <
4GB.
• Stop the world garbage
collection last too long & blocks
the cluster.
• Solution: switching to G1GC
(default in Java9, unsupported).
CMS VS G1GC
• CMS: SHARED CPU TIME WITH THE APPLICATION.
“STOPS THE WORLD” WHEN TOO MANY MEMORY TO
CLEAN UNTIL IT SENDS AN OUTOFMEMORYERROR.
• G1GC: SHORT, MORE FREQUENT, PAUSES. WON’T
STOP A NODE UNTIL IT LEAVES THE CLUSTER.
• ELASTIC SAYS: DON’T USE G1GC FOR REASONS,
SO READ THE DOC.
G1GC OPTIONS
+USEG1GC: ACTIVATES G1GC
MAXGCPAUSEMILLIS: TARGET FOR MAX GC PAUSE
TIME.
GCPAUSEINTERVALMILLIS:TARGET FOR COLLECTION
TIME SPACE
INITIATINGHEAPOCCUPANCYPERCENT: WHEN TO START
COLLECTING?
CHO0SING THE RIGHT STORAGE
• MMAPFS : MAPS LUCENE FILES ON
THE VIRTUAL MEMORY USING
MMAP. NEEDS AS MUCH MEMORY
AS THE FILE BEING MAPPED TO
AVOID ISSUES.
• NIOFS : APPLIES A SHARED LOCK
ON LUCENE FILES AND RELIES
ON VFS CACHE.
BUFFERS AND CACHES
• ELASTICSEARCH HAS MULTIPLE CACHES & BUFFERS, WITH DEFAULT VALUES,
KNOW THEM!
• BUFFERS + CACHE MUST BE < TOTAL JAVA HEAP (OBVIOUS BUT…).
• AUTOMATED EVICTION ON THE CACHE, BUT FORCING IT CAN SAVE YOUR LIFE
WITH A SMALL OVERHEAD.
• IF YOU HAVE OOM ISSUES, DISABLE THE CACHES!
• FROM A USER POV, CIRCUIT BREAKERS ARE A NO GO!
MANAGING LARGE INDEXES
INDEX DESIGN
• VERSION YOUR INDEX BY MAPPING: 1_*, 2_* ETC.
• THE MORE SHARDS, THE BETTER ELASTICITY, BUT
THE MORE CPU AND MEMORY USED ON THE
MASTERS.
• PROVISIONNING 10GB PER SHARDS ALLOWS A
FASTER RECOVERY & REALLOCATION.
REPLICATION TRICKS
• NUMBER OF REPLICAS MUST BE 0 OR ODD.
CONSISTENCY QUORUM: INT( (PRIMARY +
NUMBER_OF_REPLICAS) / 2 ) + 1.
• RAISE THE REPLICATION FACTOR TO SCALE FOR
READING
UP TO 100% OF THE DATA / DATA NODE.
ALIASES
• ACCESS MULTIPLE INDICES AT ONCE.
• READ MULTIPLE, WRITE ONLY ONE.
EXAMPLE ON TIMESTAMPED INDICES:
"18_20171020": { "aliases": { "2017": {}, "201710": {}, "20171020": {} } }
"18_20171021": { "aliases": { "2017": {}, "201710": {}, "20171021": {} } }
Queries:
/2017/_search
/201710/_search
AFTER A MAPPING CHANGE & REINDEX, CHANGE THE ALIAS:
ROLLOVER
• CREATE A NEW INDEX WHEN TOO OLD OR
TOO BIG.
• SUPPORT DATE MATH: DAILY INDEX
CREATION.
• USE ALIASES TO QUERY ALL
ROLLOVERED INDEXES.
PUT "logs-000001" { "aliases": { "logs": {} } }
POST /logs/_rollover { "conditions": { "max_docs": 10000000 } }
DAILY OPERATIONS
CONFIGURATION CHANGES
• PREFER CONFIGURATION FILE UPDATES TO API CALL FOR
PERMANENT CHANGES.
• VERSION YOUR CONFIGURATION CHANGES SO YOU CAN
ROLLBACK, ES REQUIRES LOTS OF FINE TUNING.
• WHEN USING _SETTINGS API, PREFER TRANSIENT TO
PERSISTENT, THEY’RE EASIER TO GET RID OF.
RECONFIGURING THE WHOLE CLUSTER
LOCK SHARD REALLOCATION & RECOVERY:
"cluster.routing.allocation.enable" : "none"
OPTIMIZE FOR RECOVERY:
"cluster.routing.allocation.node_initial_primaries_recoveries": 50
"indices.recovery.max_bytes_per_sec": "2048mb"
RESTART A FULL RACK, WAIT FOR NODES TO COME
THE REINDEX API
• IN CLUSTER AND CLUSTER TO CLUSTER REINDEX API.
• ALLOWS CROSS VERSION INDEXING: 1.7 TO 5.1…
• SLICED SCROLLS ONLY AVAILABLE STARTING 6.0.
• ACCEPT ES QUERIES TO FILTER THE DATA TO REINDEX.
• MERGE MULTIPLE INDEXES INTO 1.
BULK INDEXING TRICKS
LIMIT REBALANCE:
"cluster.routing.allocation.cluster_concurrent_rebalance": 1
"cluster.routing.allocation.balance.shard": "0.15f"
"cluster.routing.allocation.balance.threshold": "10.0f"
DISABLE REFRESH:
"index.refresh_interval:" "0"
NO REPLICA:
"index.number_of_replicas:" "0" // having replica index n times in Lucene, adding one just "rsync" the data.
ALLOCATE ON DEDICATED HARDWARE:
OPTIMIZING FOR SPACE & PERFORMANCES
• LUCENE SEGMENTS ARE IMMUTABLE, THE MORE YOU
WRITE, THE MORE SEGMENTS YOU GET.
• DELETING DOCUMENTS DOES COPY ON WRITE SO NO
REAL DELETE.
index.merge.scheduler.max_thread_count: default CPU/2 with min 4
POST /_force_merge?only_expunge_deletes: faster, only merge segments with deleted
POST /_force_merge?max_num_segments: don’t use on indexes you write on!
WARNING: _FORCE_MERGE HAS A COST IN CPU AND I/OS.
MINOR VERSION UPGRADES
• CHECK YOUR PLUGINS COMPATIBILITY, PLUGINS
MUST BE COMPILED FOR YOUR MINOR VERSION.
• START UPGRADING THE MASTER NODES.
• UPGRADE THE DATA NODES ON A WHOLE RACK AT
ONCE.
OS LEVEL UPGRADES
• ENSURE THE WHOLE CLUSTER RUNS THE SAME JAVA
VERSION.
• WHEN UPGRADING JAVA, CHECK IF YOU DON’T HAVE
TO UPGRADE THE KERNEL.
• PER NODE JAVA / KERNEL VERSION AVAILABLE IN THE
_STATS API.
MONITORING
CAPTAIN OBVIOUS, YOU’RE MY ONLY HOPE!
• GOOD MONITORING IS BUSINESS ORIENTED MONITORING.
• GOOD ALERTING IS ACTIONABLE ALERTING.
• DON’T MONITORE THE CLUSTER ONLY, BUT THE WHOLE PROCESSING
CHAIN.
• USELESS METRICS ARE USELESS.
• LOSING A DATACENTER: OK. LOSING DATA: NOT OK!
MONITORING TOOLING
• ELASTICSEARCH X-
PACK,
• GRAFANA…
LIFE, DEATH & _CLUSTER/HEALTH
• A RED CLUSTER MEANS AT LEAST 1
INDEX HAS MISSING DATA. DON’T
PANIC!
• USING LEVEL={INDEX,SHARD} AND AN
INDEX ID PROVIDES SPECIFIC
INFORMATION.
• LOTS OF PENDING TASKS MEANS YOUR
CLUSTER IS UNDER HEAVY LOAD AND
SOME NODES CAN’T PROCESS THEM
FAST ENOUGH.
• LONG WAITING TASKS MEANS YOU
HAVE A CAPACITY PLANNING PROBLEM.
USE THE _CAT API
• PROVIDES GENERAL INFORMATION
ABOUT YOUR NODES, SHARDS,
INDICES AND THREAD POOLS.
• HIT THE WHOLE CLUSTER, WHEN
IT TIMEOUTS YOU’VE PROBABLY
HAVING A NODE STUCK IN
GARBAGE COLLECTION.
MONITORING AT THE CLUSTER LEVEL
• USE THE _STATS API FOR PRECISE INFORMATION.
• MONITORE THE SHARDS REALLOCATION, TOO MANY
MEANS A DESIGN PROBLEM.
• MONITORE THE WRITES AND CLUSTER WIDE, IF THEY
FALL TO 0 AND IT’S UNUSUAL, A NODE IS STUCK IN
GC.
MONITORING AT THE NODE LEVEL
• USE THE _NODES/{NODE}, _NODES/{NODE}/STATS
AND _CAT/THREAD_POOL API.
• THE GARBAGE COLLECTION DURATION &
FREQUENCY IS A GOOD METRIC OF YOUR NODE
HEALTH.
• CACHE AND BUFFERS ARE MONITORED ON A NODE
LEVEL.
• MONITORING I/OS, SPACE, OPEN FILES & CPU IS
CRITICAL.
MONITORING AT THE INDEX LEVEL
• USE THE {INDEX}/_STATS API.
• MONITORE THE DOCUMENTS / SHARD RATIO.
• MONITORE THE MERGES, QUERY TIME.
• TOO MANY EVICTIONS MEANS YOU HAVE A CACHE
CONFIGURATION LEVEL.
TROUBLESHOOTING
WHAT’S REALLY GOING ON IN YOUR CLUSTER?
• THE _NODES/{NODE}/HOT_THREADS API TELLS WHAT HAPPENS ON
THE HOST.
• THE ELECTED MASTER NODES TELLS YOU MOST THING YOU NEED
TO KNOW.
• ENABLE THE SLOW LOGS TO UNDERSTAND YOUR BOTTLENECK &
OPTIMIZE THE QUERIES. DISABLE THE SLOW LOGS WHEN YOU’RE
DONE!!!
• WHEN NOT ENOUGH, MEMORY PROFILING IS YOUR FRIEND.
MEMORY PROFILING
• LIVE MEMORY OR HPROF FILE AFTER A CRASH.
• ALLOWS YOU TO TO KNOW WHAT IS / WAS IN YOUR
BUFFERS AND CACHES.
• YOURKIT JAVA PROFILER AS A TOOL.
TRACING
• KNOW WHAT’S REALLY HAPPENING IN YOUR JVM.
• LINUX 4.X PROVIDES GREAT PERF TOOLS, LINUX 4.9 EVEN
BETTER:
• LINUX-PERF,
• JAVA PERF MAP.
• VECTOR BY NETFLIX (NOT VEKTOR THE TRASH METAL
BAND).
QUESTIONS
?
@FDEVILLAMI
L
@SYNTHESIO
SLIDES: HTTP://BIT.DO/ELASTICSEARCH-SYSADMIN-201

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Data Factory V2 新機能徹底活用入門
Data Factory V2 新機能徹底活用入門Data Factory V2 新機能徹底活用入門
Data Factory V2 新機能徹底活用入門
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 
Elastic Stack Introduction
Elastic Stack IntroductionElastic Stack Introduction
Elastic Stack Introduction
 
Amazon Aurora 100% 활용하기
Amazon Aurora 100% 활용하기Amazon Aurora 100% 활용하기
Amazon Aurora 100% 활용하기
 
DBP-009_クラウドで実現するスケーラブルなデータ ウェアハウス Azure SQL Data Warehouse 解説
DBP-009_クラウドで実現するスケーラブルなデータ ウェアハウス Azure SQL Data Warehouse 解説DBP-009_クラウドで実現するスケーラブルなデータ ウェアハウス Azure SQL Data Warehouse 解説
DBP-009_クラウドで実現するスケーラブルなデータ ウェアハウス Azure SQL Data Warehouse 解説
 
스타트업 나홀로 데이터 엔지니어: 데이터 분석 환경 구축기 - 천지은 (Tappytoon) :: AWS Community Day Onlin...
스타트업 나홀로 데이터 엔지니어: 데이터 분석 환경 구축기 - 천지은 (Tappytoon) :: AWS Community Day Onlin...스타트업 나홀로 데이터 엔지니어: 데이터 분석 환경 구축기 - 천지은 (Tappytoon) :: AWS Community Day Onlin...
스타트업 나홀로 데이터 엔지니어: 데이터 분석 환경 구축기 - 천지은 (Tappytoon) :: AWS Community Day Onlin...
 
Introduction to Amazon Athena
Introduction to Amazon AthenaIntroduction to Amazon Athena
Introduction to Amazon Athena
 
Emergence of MongoDB as an Enterprise Data Hub
Emergence of MongoDB as an Enterprise Data HubEmergence of MongoDB as an Enterprise Data Hub
Emergence of MongoDB as an Enterprise Data Hub
 
Microsoft Azure Data Factory Hands-On Lab Overview Slides
Microsoft Azure Data Factory Hands-On Lab Overview SlidesMicrosoft Azure Data Factory Hands-On Lab Overview Slides
Microsoft Azure Data Factory Hands-On Lab Overview Slides
 
Intro to new Google cloud technologies: Google Storage, Prediction API, BigQuery
Intro to new Google cloud technologies: Google Storage, Prediction API, BigQueryIntro to new Google cloud technologies: Google Storage, Prediction API, BigQuery
Intro to new Google cloud technologies: Google Storage, Prediction API, BigQuery
 
Amazon Aurora and AWS Database Migration Service
Amazon Aurora and AWS Database Migration ServiceAmazon Aurora and AWS Database Migration Service
Amazon Aurora and AWS Database Migration Service
 
AWS Lambda를 기반으로한 실시간 빅테이터 처리하기
AWS Lambda를 기반으로한 실시간 빅테이터 처리하기AWS Lambda를 기반으로한 실시간 빅테이터 처리하기
AWS Lambda를 기반으로한 실시간 빅테이터 처리하기
 
Introduction to Batch Processing on AWS
Introduction to Batch Processing on AWSIntroduction to Batch Processing on AWS
Introduction to Batch Processing on AWS
 
クラウドDWHにおける観点とAzure Synapse Analyticsの対応
クラウドDWHにおける観点とAzure Synapse Analyticsの対応クラウドDWHにおける観点とAzure Synapse Analyticsの対応
クラウドDWHにおける観点とAzure Synapse Analyticsの対応
 
빅데이터를 위한 AWS 모범사례와 아키텍처 구축 패턴 :: 양승도 :: AWS Summit Seoul 2016
빅데이터를 위한 AWS 모범사례와 아키텍처 구축 패턴 :: 양승도 :: AWS Summit Seoul 2016빅데이터를 위한 AWS 모범사례와 아키텍처 구축 패턴 :: 양승도 :: AWS Summit Seoul 2016
빅데이터를 위한 AWS 모범사례와 아키텍처 구축 패턴 :: 양승도 :: AWS Summit Seoul 2016
 
오토스케일링 제대로 활용하기 (김일호) - AWS 웨비나 시리즈 2015
오토스케일링 제대로 활용하기 (김일호) - AWS 웨비나 시리즈 2015오토스케일링 제대로 활용하기 (김일호) - AWS 웨비나 시리즈 2015
오토스케일링 제대로 활용하기 (김일호) - AWS 웨비나 시리즈 2015
 
AWS Summit Seoul 2023 | 실시간 CDC 데이터 처리! Modern Transactional Data Lake 구축하기
AWS Summit Seoul 2023 | 실시간 CDC 데이터 처리! Modern Transactional Data Lake 구축하기AWS Summit Seoul 2023 | 실시간 CDC 데이터 처리! Modern Transactional Data Lake 구축하기
AWS Summit Seoul 2023 | 실시간 CDC 데이터 처리! Modern Transactional Data Lake 구축하기
 
Amazon SageMaker 모델 배포 방법 소개::김대근, AI/ML 스페셜리스트 솔루션즈 아키텍트, AWS::AWS AIML 스페셜 웨비나
Amazon SageMaker 모델 배포 방법 소개::김대근, AI/ML 스페셜리스트 솔루션즈 아키텍트, AWS::AWS AIML 스페셜 웨비나Amazon SageMaker 모델 배포 방법 소개::김대근, AI/ML 스페셜리스트 솔루션즈 아키텍트, AWS::AWS AIML 스페셜 웨비나
Amazon SageMaker 모델 배포 방법 소개::김대근, AI/ML 스페셜리스트 솔루션즈 아키텍트, AWS::AWS AIML 스페셜 웨비나
 
AWS Lake Formation을 통한 손쉬운 데이터 레이크 구성 및 관리 - 윤석찬 :: AWS Unboxing 온라인 세미나
AWS Lake Formation을 통한 손쉬운 데이터 레이크 구성 및 관리 - 윤석찬 :: AWS Unboxing 온라인 세미나AWS Lake Formation을 통한 손쉬운 데이터 레이크 구성 및 관리 - 윤석찬 :: AWS Unboxing 온라인 세미나
AWS Lake Formation을 통한 손쉬운 데이터 레이크 구성 및 관리 - 윤석찬 :: AWS Unboxing 온라인 세미나
 
Datastax Enterpriseをはじめよう
Datastax EnterpriseをはじめようDatastax Enterpriseをはじめよう
Datastax Enterpriseをはじめよう
 

Ähnlich wie Running & Scaling Large Elasticsearch Clusters

M6d cassandrapresentation
M6d cassandrapresentationM6d cassandrapresentation
M6d cassandrapresentation
Edward Capriolo
 
stackArmor presentation for DevOpsDC ver 4
stackArmor presentation for DevOpsDC ver 4stackArmor presentation for DevOpsDC ver 4
stackArmor presentation for DevOpsDC ver 4
Gaurav "GP" Pal
 

Ähnlich wie Running & Scaling Large Elasticsearch Clusters (20)

Managing Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using ElasticsearchManaging Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using Elasticsearch
 
CASSANDRA MEETUP - Choosing the right cloud instances for success
CASSANDRA MEETUP - Choosing the right cloud instances for successCASSANDRA MEETUP - Choosing the right cloud instances for success
CASSANDRA MEETUP - Choosing the right cloud instances for success
 
M6d cassandrapresentation
M6d cassandrapresentationM6d cassandrapresentation
M6d cassandrapresentation
 
DevOps for ETL processing at scale with MongoDB, Solr, AWS and Chef
DevOps for ETL processing at scale with MongoDB, Solr, AWS and ChefDevOps for ETL processing at scale with MongoDB, Solr, AWS and Chef
DevOps for ETL processing at scale with MongoDB, Solr, AWS and Chef
 
stackArmor presentation for DevOpsDC ver 4
stackArmor presentation for DevOpsDC ver 4stackArmor presentation for DevOpsDC ver 4
stackArmor presentation for DevOpsDC ver 4
 
HBase Operations and Best Practices
HBase Operations and Best PracticesHBase Operations and Best Practices
HBase Operations and Best Practices
 
Scaling Elasticsearch at Synthesio
Scaling Elasticsearch at SynthesioScaling Elasticsearch at Synthesio
Scaling Elasticsearch at Synthesio
 
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
 
Scaling with sync_replication using Galera and EC2
Scaling with sync_replication using Galera and EC2Scaling with sync_replication using Galera and EC2
Scaling with sync_replication using Galera and EC2
 
Benchmarking Solr Performance at Scale
Benchmarking Solr Performance at ScaleBenchmarking Solr Performance at Scale
Benchmarking Solr Performance at Scale
 
Storm presentation
Storm presentationStorm presentation
Storm presentation
 
HBase Sizing Guide
HBase Sizing GuideHBase Sizing Guide
HBase Sizing Guide
 
Is your Elastic Cluster Stable and Production Ready?
Is your Elastic Cluster Stable and Production Ready?Is your Elastic Cluster Stable and Production Ready?
Is your Elastic Cluster Stable and Production Ready?
 
Everyday I'm Scaling... Cassandra (Ben Bromhead, Instaclustr) | C* Summit 2016
Everyday I'm Scaling... Cassandra (Ben Bromhead, Instaclustr) | C* Summit 2016Everyday I'm Scaling... Cassandra (Ben Bromhead, Instaclustr) | C* Summit 2016
Everyday I'm Scaling... Cassandra (Ben Bromhead, Instaclustr) | C* Summit 2016
 
Everyday I’m scaling... Cassandra
Everyday I’m scaling... CassandraEveryday I’m scaling... Cassandra
Everyday I’m scaling... Cassandra
 
Hadoop at datasift
Hadoop at datasiftHadoop at datasift
Hadoop at datasift
 
Scaling Ceph at CERN - Ceph Day Frankfurt
Scaling Ceph at CERN - Ceph Day Frankfurt Scaling Ceph at CERN - Ceph Day Frankfurt
Scaling Ceph at CERN - Ceph Day Frankfurt
 
Running MySQL on Linux
Running MySQL on LinuxRunning MySQL on Linux
Running MySQL on Linux
 
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...
 
MySQL Performance Tuning London Meetup June 2017
MySQL Performance Tuning London Meetup June 2017MySQL Performance Tuning London Meetup June 2017
MySQL Performance Tuning London Meetup June 2017
 

Mehr von Fred de Villamil

Mehr von Fred de Villamil (9)

Scaling your Engineering Team
Scaling your Engineering TeamScaling your Engineering Team
Scaling your Engineering Team
 
Hiring and Managing Happy Engineers - CTO Pizza #3
Hiring and Managing Happy Engineers - CTO Pizza #3Hiring and Managing Happy Engineers - CTO Pizza #3
Hiring and Managing Happy Engineers - CTO Pizza #3
 
Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Without Downtime
Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Without DowntimeMigrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Without Downtime
Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Without Downtime
 
Devops commando - Paris Devops 2016-04
Devops commando - Paris Devops 2016-04Devops commando - Paris Devops 2016-04
Devops commando - Paris Devops 2016-04
 
The Commando Devops
The Commando DevopsThe Commando Devops
The Commando Devops
 
How People Use Iphone
How People Use IphoneHow People Use Iphone
How People Use Iphone
 
Zendcon Performance Oci8
Zendcon Performance Oci8Zendcon Performance Oci8
Zendcon Performance Oci8
 
Applications Web En Entreprise Avec Ruby On Rails Benefices Et Limitations Gu...
Applications Web En Entreprise Avec Ruby On Rails Benefices Et Limitations Gu...Applications Web En Entreprise Avec Ruby On Rails Benefices Et Limitations Gu...
Applications Web En Entreprise Avec Ruby On Rails Benefices Et Limitations Gu...
 
Presentation Rails
Presentation RailsPresentation Rails
Presentation Rails
 

Kürzlich hochgeladen

UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
rknatarajan
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Dr.Costas Sachpazis
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Christo Ananth
 

Kürzlich hochgeladen (20)

Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdf
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineering
 
Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELLPVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
NFPA 5000 2024 standard .
NFPA 5000 2024 standard                                  .NFPA 5000 2024 standard                                  .
NFPA 5000 2024 standard .
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
 
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
 

Running & Scaling Large Elasticsearch Clusters

  • 1. RUNNING & SCALING LARGE ELASTICSEARCH CLUSTERS FRED DE VILLAMIL, DIRECTOR OF INFRASTRUCTURE @FDEVILLAMIL OCTOBER 2017
  • 2. BACKGROUND • FRED DE VILLAMIL, 39 ANS, TEAM COFFEE @SYNTHESIO, • LINUX / (FREE)BSD USER SINCE 1996, • OPEN SOURCE CONTRIBUTOR SINCE 1998, • LOVES TENNIS, PHOTOGRAPHY, CUTE OTTERS, INAPPROPRIATE HUMOR AND ELASTICSEARCH CLUSTERS OF UNUSUAL SIZE. WRITES ABOUT ES AT HTTPS://THOUGHTS.T37.NET
  • 3. ELASTICSEARCH @SYNTHESIO • 8 production clusters, 600 hosts, 1.7PB storage, 37.5TB RAM, average 15k writes / s, 800 search /s, some inputs > 200MB. • Data nodes: 6 core Xeon E5v3, 64GB RAM, 4*800GB SSD RAID0. Sometimes bi Xeon E5-2687Wv4 12 core (160 watts!!!). • We agregate data from various cold storage and make them searchable in a giffy.
  • 4. AN ELASTICSEARCH CLUSTER OF UNUSUAL SIZE
  • 6. NEVER GONNA GIVE YOU UP • NEVER GONNA LET YOU DOWN, • NEVER GONNA RUN AROUND AND DESERT YOU, • NEVER GONNA MAKE YOU CRY, • NEVER GONNA SAY GOODBYE, • NEVER GONNA TELL A LIE & HURT YOU.
  • 7. AVOIDING DOWNTIME & SPLIT BRAINS • RUN AT LEAST 3 MASTER NODES INTO 3 DIFFERENT LOCATIONS. • NEVER RUN BULK QUERIES ON THE MASTER NODES. • ACTUALLY NEVER RUN ANYTHING BUT ADMINISTRATIVE TASKS ON THE MASTER NODES. • SPREAD YOUR DATA INTO 2 DIFFERENT LOCATION WITH AT LEAST A REPLICATION FACTOR OF 1 (1 PRIMARY, 1 REPLICA).
  • 8. RACK AWARENESS ALLOCATE A RACK_ID TO THE DATA NODES FOR EVEN REPLICATION. RESTART A WHOLE DATA CENTER @ONCE WITHOUT DOWNTIME.
  • 9. RACK AWARENESS + QUERY NODES == MAGIC ES PRIVILEGES THE DATA NODES WITH THE SAME RACK_ID AS THE QUERY. REDUCES LATENCY AND BALANCES THE LOAD.
  • 10. RACK AWARENESS + QUERY NODES + ZONE == MAGIC + FUN ADD ZONES INTO THE SAME RACK FOR EVEN REPLICATION WITH HIGHER FACTOR.
  • 11. USING ZONES FOR FUN & PROFIT ALLOWING EVEN REPLICATION WITH A HIGHER FACTOR WITHIN THE SAME RACK. ALLOWING MORE RESOURCES TO THE MOST FREQUENTLY ACCESSED INDEXES. …
  • 13. HOW ELASTICSEARCH USES THE MEMORY • Starts with allocating memory for Java heap. • The Java heap contains all Elasticsearch buffers and caches + a few other things. • Each Java thread maps a system thread: +128kB off heap. • Elected master uses 250kB to store each shard information inside the cluster.
  • 14. ALLOCATING MEMORY • Never allocate more than 31GB heap to avoid the compressed pointers issue. • Use 1/2 of your memory up to 31GB. • Feed your master and query nodes, the more the better (including CPU).
  • 15. MEMORY LOCK • Use memory_lock: true at startup. • Requires ulimit -l unlimited. • Allocates the whole heap at once. • Uses contiguous memory regions. • Avoids swapping (you should disable swap anyway).
  • 16. CHOSING THE RIGHT GARBAGE COLLECTOR • ES runs with CMS as a default garbage collector. • CMS was designed for heaps < 4GB. • Stop the world garbage collection last too long & blocks the cluster. • Solution: switching to G1GC (default in Java9, unsupported).
  • 17. CMS VS G1GC • CMS: SHARED CPU TIME WITH THE APPLICATION. “STOPS THE WORLD” WHEN TOO MANY MEMORY TO CLEAN UNTIL IT SENDS AN OUTOFMEMORYERROR. • G1GC: SHORT, MORE FREQUENT, PAUSES. WON’T STOP A NODE UNTIL IT LEAVES THE CLUSTER. • ELASTIC SAYS: DON’T USE G1GC FOR REASONS, SO READ THE DOC.
  • 18. G1GC OPTIONS +USEG1GC: ACTIVATES G1GC MAXGCPAUSEMILLIS: TARGET FOR MAX GC PAUSE TIME. GCPAUSEINTERVALMILLIS:TARGET FOR COLLECTION TIME SPACE INITIATINGHEAPOCCUPANCYPERCENT: WHEN TO START COLLECTING?
  • 19. CHO0SING THE RIGHT STORAGE • MMAPFS : MAPS LUCENE FILES ON THE VIRTUAL MEMORY USING MMAP. NEEDS AS MUCH MEMORY AS THE FILE BEING MAPPED TO AVOID ISSUES. • NIOFS : APPLIES A SHARED LOCK ON LUCENE FILES AND RELIES ON VFS CACHE.
  • 20. BUFFERS AND CACHES • ELASTICSEARCH HAS MULTIPLE CACHES & BUFFERS, WITH DEFAULT VALUES, KNOW THEM! • BUFFERS + CACHE MUST BE < TOTAL JAVA HEAP (OBVIOUS BUT…). • AUTOMATED EVICTION ON THE CACHE, BUT FORCING IT CAN SAVE YOUR LIFE WITH A SMALL OVERHEAD. • IF YOU HAVE OOM ISSUES, DISABLE THE CACHES! • FROM A USER POV, CIRCUIT BREAKERS ARE A NO GO!
  • 22. INDEX DESIGN • VERSION YOUR INDEX BY MAPPING: 1_*, 2_* ETC. • THE MORE SHARDS, THE BETTER ELASTICITY, BUT THE MORE CPU AND MEMORY USED ON THE MASTERS. • PROVISIONNING 10GB PER SHARDS ALLOWS A FASTER RECOVERY & REALLOCATION.
  • 23. REPLICATION TRICKS • NUMBER OF REPLICAS MUST BE 0 OR ODD. CONSISTENCY QUORUM: INT( (PRIMARY + NUMBER_OF_REPLICAS) / 2 ) + 1. • RAISE THE REPLICATION FACTOR TO SCALE FOR READING UP TO 100% OF THE DATA / DATA NODE.
  • 24. ALIASES • ACCESS MULTIPLE INDICES AT ONCE. • READ MULTIPLE, WRITE ONLY ONE. EXAMPLE ON TIMESTAMPED INDICES: "18_20171020": { "aliases": { "2017": {}, "201710": {}, "20171020": {} } } "18_20171021": { "aliases": { "2017": {}, "201710": {}, "20171021": {} } } Queries: /2017/_search /201710/_search AFTER A MAPPING CHANGE & REINDEX, CHANGE THE ALIAS:
  • 25. ROLLOVER • CREATE A NEW INDEX WHEN TOO OLD OR TOO BIG. • SUPPORT DATE MATH: DAILY INDEX CREATION. • USE ALIASES TO QUERY ALL ROLLOVERED INDEXES. PUT "logs-000001" { "aliases": { "logs": {} } } POST /logs/_rollover { "conditions": { "max_docs": 10000000 } }
  • 27. CONFIGURATION CHANGES • PREFER CONFIGURATION FILE UPDATES TO API CALL FOR PERMANENT CHANGES. • VERSION YOUR CONFIGURATION CHANGES SO YOU CAN ROLLBACK, ES REQUIRES LOTS OF FINE TUNING. • WHEN USING _SETTINGS API, PREFER TRANSIENT TO PERSISTENT, THEY’RE EASIER TO GET RID OF.
  • 28. RECONFIGURING THE WHOLE CLUSTER LOCK SHARD REALLOCATION & RECOVERY: "cluster.routing.allocation.enable" : "none" OPTIMIZE FOR RECOVERY: "cluster.routing.allocation.node_initial_primaries_recoveries": 50 "indices.recovery.max_bytes_per_sec": "2048mb" RESTART A FULL RACK, WAIT FOR NODES TO COME
  • 29. THE REINDEX API • IN CLUSTER AND CLUSTER TO CLUSTER REINDEX API. • ALLOWS CROSS VERSION INDEXING: 1.7 TO 5.1… • SLICED SCROLLS ONLY AVAILABLE STARTING 6.0. • ACCEPT ES QUERIES TO FILTER THE DATA TO REINDEX. • MERGE MULTIPLE INDEXES INTO 1.
  • 30. BULK INDEXING TRICKS LIMIT REBALANCE: "cluster.routing.allocation.cluster_concurrent_rebalance": 1 "cluster.routing.allocation.balance.shard": "0.15f" "cluster.routing.allocation.balance.threshold": "10.0f" DISABLE REFRESH: "index.refresh_interval:" "0" NO REPLICA: "index.number_of_replicas:" "0" // having replica index n times in Lucene, adding one just "rsync" the data. ALLOCATE ON DEDICATED HARDWARE:
  • 31. OPTIMIZING FOR SPACE & PERFORMANCES • LUCENE SEGMENTS ARE IMMUTABLE, THE MORE YOU WRITE, THE MORE SEGMENTS YOU GET. • DELETING DOCUMENTS DOES COPY ON WRITE SO NO REAL DELETE. index.merge.scheduler.max_thread_count: default CPU/2 with min 4 POST /_force_merge?only_expunge_deletes: faster, only merge segments with deleted POST /_force_merge?max_num_segments: don’t use on indexes you write on! WARNING: _FORCE_MERGE HAS A COST IN CPU AND I/OS.
  • 32. MINOR VERSION UPGRADES • CHECK YOUR PLUGINS COMPATIBILITY, PLUGINS MUST BE COMPILED FOR YOUR MINOR VERSION. • START UPGRADING THE MASTER NODES. • UPGRADE THE DATA NODES ON A WHOLE RACK AT ONCE.
  • 33. OS LEVEL UPGRADES • ENSURE THE WHOLE CLUSTER RUNS THE SAME JAVA VERSION. • WHEN UPGRADING JAVA, CHECK IF YOU DON’T HAVE TO UPGRADE THE KERNEL. • PER NODE JAVA / KERNEL VERSION AVAILABLE IN THE _STATS API.
  • 35. CAPTAIN OBVIOUS, YOU’RE MY ONLY HOPE! • GOOD MONITORING IS BUSINESS ORIENTED MONITORING. • GOOD ALERTING IS ACTIONABLE ALERTING. • DON’T MONITORE THE CLUSTER ONLY, BUT THE WHOLE PROCESSING CHAIN. • USELESS METRICS ARE USELESS. • LOSING A DATACENTER: OK. LOSING DATA: NOT OK!
  • 36. MONITORING TOOLING • ELASTICSEARCH X- PACK, • GRAFANA…
  • 37. LIFE, DEATH & _CLUSTER/HEALTH • A RED CLUSTER MEANS AT LEAST 1 INDEX HAS MISSING DATA. DON’T PANIC! • USING LEVEL={INDEX,SHARD} AND AN INDEX ID PROVIDES SPECIFIC INFORMATION. • LOTS OF PENDING TASKS MEANS YOUR CLUSTER IS UNDER HEAVY LOAD AND SOME NODES CAN’T PROCESS THEM FAST ENOUGH. • LONG WAITING TASKS MEANS YOU HAVE A CAPACITY PLANNING PROBLEM.
  • 38. USE THE _CAT API • PROVIDES GENERAL INFORMATION ABOUT YOUR NODES, SHARDS, INDICES AND THREAD POOLS. • HIT THE WHOLE CLUSTER, WHEN IT TIMEOUTS YOU’VE PROBABLY HAVING A NODE STUCK IN GARBAGE COLLECTION.
  • 39. MONITORING AT THE CLUSTER LEVEL • USE THE _STATS API FOR PRECISE INFORMATION. • MONITORE THE SHARDS REALLOCATION, TOO MANY MEANS A DESIGN PROBLEM. • MONITORE THE WRITES AND CLUSTER WIDE, IF THEY FALL TO 0 AND IT’S UNUSUAL, A NODE IS STUCK IN GC.
  • 40. MONITORING AT THE NODE LEVEL • USE THE _NODES/{NODE}, _NODES/{NODE}/STATS AND _CAT/THREAD_POOL API. • THE GARBAGE COLLECTION DURATION & FREQUENCY IS A GOOD METRIC OF YOUR NODE HEALTH. • CACHE AND BUFFERS ARE MONITORED ON A NODE LEVEL. • MONITORING I/OS, SPACE, OPEN FILES & CPU IS CRITICAL.
  • 41. MONITORING AT THE INDEX LEVEL • USE THE {INDEX}/_STATS API. • MONITORE THE DOCUMENTS / SHARD RATIO. • MONITORE THE MERGES, QUERY TIME. • TOO MANY EVICTIONS MEANS YOU HAVE A CACHE CONFIGURATION LEVEL.
  • 43. WHAT’S REALLY GOING ON IN YOUR CLUSTER? • THE _NODES/{NODE}/HOT_THREADS API TELLS WHAT HAPPENS ON THE HOST. • THE ELECTED MASTER NODES TELLS YOU MOST THING YOU NEED TO KNOW. • ENABLE THE SLOW LOGS TO UNDERSTAND YOUR BOTTLENECK & OPTIMIZE THE QUERIES. DISABLE THE SLOW LOGS WHEN YOU’RE DONE!!! • WHEN NOT ENOUGH, MEMORY PROFILING IS YOUR FRIEND.
  • 44. MEMORY PROFILING • LIVE MEMORY OR HPROF FILE AFTER A CRASH. • ALLOWS YOU TO TO KNOW WHAT IS / WAS IN YOUR BUFFERS AND CACHES. • YOURKIT JAVA PROFILER AS A TOOL.
  • 45. TRACING • KNOW WHAT’S REALLY HAPPENING IN YOUR JVM. • LINUX 4.X PROVIDES GREAT PERF TOOLS, LINUX 4.9 EVEN BETTER: • LINUX-PERF, • JAVA PERF MAP. • VECTOR BY NETFLIX (NOT VEKTOR THE TRASH METAL BAND).