Scale Large Log Management Deployment with Elastic

www.semplicityinc.com
FROM THE TRENCHES: SCALING
A LARGE LOG MANAGEMENT
DEPLOYMENT
War stories, tips & gotchas – it’s all here.
Prepared by SEMPlicity, Inc.
George Boitano
(617) 524-0171
gboitano@semplicityinc.com
© Copyright 2019 SEMplicity, Inc.

Connecting the Best of Both Worlds
2© Copyright 2019 SEMplicity, Inc.
Who We Are
– SEMplicity provides Elastic professional services for legacy SIEM
modernization.
– SEMplicity is an official licensed Elastic Managed Services Provider (MSP).
– SEMplicity is the largest Micro Focus services provider for ArcSight with 10
years of experience with legacy SIEM and log management.

Modern Log Management with Elastic
© Copyright 2019 SEMplicity, Inc. 3

The Challenge
A retailer needs much faster log search response times:
• using legacy log storage, searches spanning time periods of more than an
hour take minutes or hours to return;
• legacy log storage is very expensive at present, and getting worse as log
volumes increase;
• legacy log storage technology is frozen, without any roadmap for advanced
visualizations, machine learning, etc.
Enter FastSearch, our Elastic log management deployment:
• planned users include SOC analysts, incident response and hunt team;
• initial use case: fast searching of log records using keywords and free text;
• follow-on use case: Elastic storage of sensitive compliance logs (PCI, HIPPA) at
evidentiary standards;
• Roadmap includes advanced analyst and executive visualizations, alerting,
incident response research dashboards, unsupervised machine learning
anomaly detection.

Requirements Metrics
Metric Service Level
Retention 30 days or more
Volume Approximately 120K Events per Second (EPS)
Storage More than 31.3 3b per day
Performance 30-day single keyword search returns in under 6 seconds
Log Source 30-Day Storage
Windows Servers 55tb
Firewalls 270tb
Web Proxies 110tb
Other Sources About 500tb

ECE or Elastic Cloud Enterprise
Deployed internally on re-purposed Hardware, here are the benefits of Elastic
Cloud Enterprise (ECE)
• Sensitive or regulated data is stored or available within the internal
network.
• Centralized Management of Elasticsearch Deployments for
• Provisioning
• Scaling
• Monitoring
• Upgrades (Minimum to No downtime)
• Backup and Restore

Hardware
Our biggest challenge involved designing the ECE (Elastic Cloud Enterprise)
installation for our client based on the hardware we inherited:
• 60 or so RedHat servers with 256GB memory and varying storage
capabilities;
• Some with small SSD drives, some only spinning disks, some both;
• Most servers have between 19tb and 24tb storage available.
In order to get the most out of the available resources, we decided upon an
ECE implementation with the RAM:Storage ratio of 1:98, as described later.

High Level ECE Design

Availability Zones
Here is the Elastic recommended configuration for 3 availability zones:

Availability Zones
Due to hardware (disk) available, we consolidated this design a bit. Here is the
actual design we used:

ECE Clusters
The first cluster (now called a deployment) we set up was for sizing. This
starts as a single-node single-shard cluster. The disk allocated and number
of nodes/shards changes depending on the event source, so we try to
keep enough free space available in this sizing cluster for onboarding.

Determining Storage Density
Elastic Cloud Enterprise (ECE) provisions clusters at a ratio of 1GB of RAM for
every 32GB of storage, since we have servers with 256 GB RAM and 24 TB of
storage, we arrived at a memory to storage ratio of 1:98 for each allocator. This
can be calculated by using this formula.
• Storage / RAM = Storage Density
This gets more complex when you think about the instance size you plan to use.
ECE allows you to create instances with RAM allocations like this:
• For large deployments, smaller instances can be problematic. We have seen
ingest issues with 16gb and smaller (remember we’re dealing with high
volumes). An instance with 64gb RAM will be allocated 16 processors where
as instance with 16gb RAM will be allocated 4 processors.
• Calculating storage density with instance size in mind. You want to make
sure your storage density will allow for the maximum number of instances to
be created.

Storage Density Example
Say you have a server you plan to use as an ECE allocator with 24tb of storage
and 256gb RAM.
• The storage density would be 1:93 roughly.
• If you decide to use mostly 64gb nodes (which, annoyingly, ECE calls an
instance), that would be 5.9tb storage per node. That’s roughly 4 nodes per
Allocator.
• I’m rounding off numbers here though. Realistically, it’s 93.75gb of storage to
every one gb of RAM. That means setting your RAM:Disk to 1:93 is actually
around 192gb of unused storage. This is compounded when you take into
consideration that only 4 nodes will fit on each Allocator.
• Using 1:93 as your storage density, you will realistically only get 23.8tb of
available storage.
This isn’t a huge problem normally, but when you have systems with different
RAM to Disk ratios, it gets difficult to avoid wasted disk space.

Shard Sizing (Number of Shards)
There are different approaches or methods available for Shard Sizing. This is how
we arrived at the number of shards and nodes to handle the requirement.
In addition to setting up a sizing deployment, we also setup a monitoring
deployment to view the indexing statistics as well as Logstash performance. With
index mapping template tuned for disk optimization, we started sending Proxy
events which had been enriched with Logstash, and properly parsed for keyword,
IP, text and other field types.
• Determine the Daily index size by indexing for couple of days (during the
week).
• 1300GB ( 2600gb with Replica) + 25% for Growth = 3250gb
• Determine the total Index Size based on the number of retention days
• 3250gb * 30 days = 97500GB
• Determine the number of shards. A good rule of thumb is to shoot for shard
sizes of 60gb or less
• 3250gb/60gb = 54 shards

Instance or Node Sizing
Because of the higher EPS for this Log source, we determined to go with
64gb RAM:6.13 TB of storage. In order to determine, number of nodes or
instances
a) We have the total size of the index for 30 days retention period (97500
GB)
b) We have the total size of a node, 6130GB, you first need to know the
daily index size.
c) Number of nodes would be 97500/6130 = 16 nodes. Since we have 3
zones and nodes are distributed equally, we went with 18 nodes total.

Shard Sizing Details
A couple things to keep in mind about shard sizing:
• Generally speaking the more (smaller) shards you have, the faster ingest will
be (to a point). More shards will slow search times as well;
• Less (larger) shards will have the opposite effect;
• This is very dependent on EPS, number of nodes, and total index size;
• It is highly recommended to set up a Sizing Cluster with Monitoring and
thoroughly test each event source prior to sizing your production clusters.
Your search/ingest requirements will vary and these requirements will directly
impact your shard size and number of shards.

LogStash Architecture
Determining Logstash Architecture (for us) involved a lot of testing for each
log source.
Luckily, our data was already being collected in various ways and sent to
Kafka. Nearly all of it had been processed by ArcSight, so it was already in
CEF (common event format).
Pulling data from Kafka with Logstash is simple. You subscribe to the Kafka
topic (ours are separated by event type) using a Logstash input plugin.
Even simplified, much testing was required to determine how many
instances of Logstash were required for the EPS output from Kafka.

Logstash Architecture Details
• We decided to leverage two of our lower disk servers for Logstash instances.
• LogStash does not run under ECE. You can deploy it as a Docker container, but
we are not doing that.
• We do send LogStash metrics and health data to our monitoring cluster, to
help with tuning and debugging.
• Here is one of the configurations for collecting Proxy events:

Tuning LogStash Ingestion
When pulling data from Kafka, the number of Kafka partitions available is
important:
• Logstash can only leverage the same number of threads as there are
partitions available;
• If a proxy topic has 60 available partitions, Logstash can only leverage 60
consumer threads: more than that will simple remain idle and unused.
Depending on the EPS, and filters used by Logstash, you may consider
splitting partitions to several Logstash instances.

Tuning Logstash Applied
For our Proxy cluster, we split the topic among 4 Logstash instances each running
15 consumer threads.
• Each Logstash instance could leverage 15 consumer threads for a total of 60.
• General guidelines for ingestion:
• Lower EPS (6k/s) – fewer LogStash instances with higher number of
consumer threads each;
• Higher EPS (45k/s) – more LogStassh instances with lower number
consumer threads each.
• It’s also important to note that bumping up the Logstash JVM heap up to a
maximum of 30gb can improve throughput for each instance.
Using default settings, Logstash instances seem to max out around 2,000 EPS. By
testing different setups, you can improve this drastically.

LogStash Mapping
CEF is pretty good at normalizing data. However there are some things you can
do with Logstash to further enrich events.
• Concatenating fields, dropping fields, or mapping IP geoip data for instance:

LogStash all_content and copy_to
Our client also requested that we add all fields to a single field called
“all_content”.
• Quite often Analysts may not know the field(s) in which a certain string
resides.
• This increases ingest workload and storage by quite a bit; however, it can be
quite useful for searching the whole event for strings.
To implement, modify the CEF
LogStash mapping template
with a number of copy_to
parameters:

LogStash Indexing
Disable indexing on fields that are not going to be searched, like certain numeric
fields.
This is specified in the LogStash mapping template. It speeds up ingestion and
reduces storage required.

Problem: Ingestion Delays
Early on, while still setting up and tuning ECE Clusters, we were frequently
making changes, growing and shrinking nodes, etc.
During this process, we discovered that growing Deployments can
sometimes result in an issue where new indices weren’t being properly
spread across cluster instances.
We would frequently end up with a Deployment containing 20+ instances,
where only one or two instances were creating all shards and replicas.
ECE tries to keep all shards equally distributed, and when you create a new
instance, all new shards are created there until it’s shards are the same as
older instances.
This causes some serious problems with ingestion when you have 120k+
EPS.

Ingestion Delay Symptoms
Symptoms:
1. Logs are delayed in becoming available for search. Logs should be
available within 1 minute of ingestion. We were seeing delays of several
hours between ingestion and becoming available.
2. More than 3 shards are allocated to an instance or a node:
• As indicated in the previous slide, all shards for a new indices were
created/allocated to new instances.
Solution:
a) Please confirm that the routing allocation is set to “all” so that the
shards are allocated evenly across the available instances
b) To make maximum use of the available processing capacity, set the cpu
hard limit to “false” under Data section of advanced Elastic
configuration.

Setting cluster.routing.allocation
Make sure the cluster.routing.allocation is set to “all”:
GET _cluster/settings?include_defaults=true&filter_path=**.routing.allocation.enable*
If the output shows “enable” : “none”, then, you can reset by setting the value to
null.

Problem: Boot Loops & ECE Debugging
Symptom:
• If there is a syntax error with a user bundle or configuration file, ECE
does not report the actual error due;
• Instead, we only see a “boot loop error detected” message in the ECE
Admin UI as it tries to apply the config changes to the new instances of
ElasticSearch within each docker container.
Solution:
• Made sure that the deployment strategy is set to “rolling”.
This helped us to review the log file for the actual syntax errors. If set to any
strategy other than rolling, any new instances created are terminated after
the failure. This deletes the log files, making diagnosis of the root cause very
difficult.

Cross-Cluster Search
• When dealing with large amounts of data, it becomes necessary to have
not only multiple indices, but multiple clusters.
• In ECE, clusters are referred to as Deployments. Each Deployment is a
secure logical silo. This was designed around a multi-tenant architecture,
each with it’s own Kibana instance to access data in each cluster. The
idea was to prevent cross cluster searching across multiple clients.
• In cases like ours, where a single customer has enough data to warrant
multiple Deployments, a cross-cluster searching is necessary. This is a
drawback for large customers using ECE.
While ECE does not currently support cross-cluster searching, it is planned.
We’ve been assured it will be included in the next major release, which
should be early February at the latest.

Take-Aways
Know your EPS and plan for it to increase:
• EPS may increase when indexed due to replica shards or field mapping
Take time to consider your hardware prior to installing ECE such that you
can use similar RAM:Disk ratios;
• Identical Hardware will make your job easier in the long run.
Don't forget to set aside hardware for your Logstash architecture:
• It’s tempting and possible to put Logstash on your ECE servers, but for
large deployments they will need large amounts of memory which will
constrain Elasticsearch resources on that server.
Consider indexing options on fields to reduce the amount of writes on the
disk:
• Reduce keywords mapped;
• You may not need to index some fields.

Scale Large Log Management Deployment with Elastic

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Scale Large Log Management Deployment with Elastic

Ähnlich wie Scale Large Log Management Deployment with Elastic (20)

Mehr von FaithWestdorp

Mehr von FaithWestdorp (18)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Scale Large Log Management Deployment with Elastic