SlideShare ist ein Scribd-Unternehmen logo
1 von 36
Case Study
Elasticsearch Ingest @ Cisco Intercloud
Agenda
• Express Overview of StreamSets Data Collector
Kirit Basu, Product Management, StreamSets
• Introduction to Elastic
CatherineJohnson, Solutions Architect, Elastic
• Implementing Shipped Analytics Using StreamSets and Elasticsearch
Dmitri Chtchourov, Innovation Architect, Cloud Solutions CTO Group
Group
Performance Management
for Data Flows
© 2015 StreamSets, Inc. All rights reserved. May not be copied, modified, or distributed in whole or part without written consent of StreamSets, Inc.
History Founded by Informatica and Cloudera veterans.
Mission Bring operational excellence to managing data in motion.
Challenge Move data efficiently and with quality in the face of change.
Solution Open source software enabling performance management of
data flows.
Use cases Hadoop Ingest, Search Ingest, Message Broker Enablement,
Log Shipping, Cloud Migration, IoT, ...
Momentum Thousands of downloads, hundreds of companies using.
StreamSets At a Glance
© 2015 StreamSets, Inc. All rights reserved. May not be copied, modified, or distributed in whole or part without written consent of StreamSets, Inc.
StreamSets Data Collector
Adaptable Flows for Efficiency
Design ingest pipelines with minimal coding and
maximum flexibility.
Data Flow KPIs for Control
Monitor and act on data flow performance and
data quality.
Containerized Architecture for Agility
Operate continuously in the face of constant
change.
Open source software for the rapid
development and reliably operation of
complex data flows.
Get Started with StreamSets
http://streamsets.com/opensource
https://github.com/streamsets/datacollector/
#streamsets
March 2016
Introduction to Elastic
Software that makes massive amounts of
structured and unstructured data usable for
search, logging, analytics, and more in mission
critical systems and applications
Examples: Elastic Stack Use Cases
Logging
IT Operations
Application Management
Security Analytics
Analytics Search
Marketing Insights
Business Development
Customer Sentiment
Website Search
Internal/Intranet Search
URL Search
Internal Systems/Applications External Systems/Applications
Developers IT/Ops Business Users
Elastic Solves Many Developer Use Cases
Social
Location
User-
Activity
Machine
(Log files)
Documents
Handles Complex
& Diverse Data
Meets Today’s Core
Developer Requirements
Developer requirements
Many users / use cases
Fast data processing
Large data volumes
Data quality & integrity
Cross-source insights
Solves Critical
Use Cases
Application
Search
Embedded
Search
Logging
Security
Analytics
Operational
Analytics
More …
The Elastic Stack
Ingest
Store, Index,
& Analyze
User Interface
Plugins Monitoring Security Alerting
Elastic Cloud: Hosted Elasticsearch
Thank you!
www.elastic.co
Implementing Shipped Analytics Using
Streamsets and Elasticsearch
Dmitri Chtchourov, Innovation Architect, Cloud Solutions CTO Group
Tymofii Polekhin, Software Engineer
Agenda
MANTL & Shipped
Shipped Analytics for Shipped
Why we need Shipped Analytics?
Archtecture and Data Flow
Streamsets Pipelines
End to end dataflow and performance with Elasticsearch
Benefits of Streamsets
Demo
Microservices managed and scaled separately
Microservices managed by Mesos in a single platform
Microservices architecture for Mesos frameworks and other components
CIS/AWS/Metastack/vSphere/UCS…
Terraform
Spark
Executor N
Spark
Executor 1
Spark
Scheduler
Kafka
Broker N
Kafka
Broker 1
Kafka
Scheduler
Docker Docker
TraefikMicroservices …
REST API
REST API
Scripted provisioning
Direct provisioning
Policy, Auto-scaling
VM1
or
BM1
VM2
or
BM2
VM3
or
BM3
VM4
or
BM4
VM5
or
BM5
Shipped Analytics Cluster
Probe
Probe
Probe
• Both Shipped and Shipped Analytics running on MANTL
• Shipped Analytics – infra and app logs and metrics analysis
mesos-master
mesos-slave
marathon
zookeeper
consul
syslog
frameworks
collectd
cpu
memory
interface
disk
df
load
docker
zookeeper
marathon
mesos-slave
mesos-master
CollectD and Filebeat processes
running on every node in the
cluster.
Infrastructure Layer
Zookeeper Cluster Consul Cluster
Mesos Cluster
Marathon Framework
Kafka Cluster
topbeat filebeat
journalbeat dockerbeat
• Experimenting with Elastic Beats (unified arch., closer to micro-services model)
• Elastic Beats to replace collectd plugins and cAdvisor for containers
<file | top | *>beat collectd
logstash
DNS SRV beats.logstash.service.consul
Data normalization
Tagging
Cluster name decoration
Logstash is a single process per
cluster, discoverable with
standard inter-cluster
discovery mechanism, which
will get metrics from collectd
on every slave and logs from
filebeat on every slave,
normalize data and send to
desired output
DNS SRV collectd.logstash.service.consul
NOTE: currently Logstash is running in Docker container on every node, will be moving to Filebeat and Logstash mesos framework soon
logstash
Kafka 0.9.0.0 supports SSL
authentication and data
encryption for producers.
This is must-have security
when sending data to external
destination through WAN.
Sending data to central SA
cluster for long-term analytics
SSL encryption
WAN
kafka
SSL authentication
Shipped cluster
Shipped Analytics
StreamSets running in Mesos
Spark Cluster mode processing
data from multiple source
Shipped clusters and storing it
in Elasticsearch cluster.
kafka
elasticsearch
Streamsets Spark Streaming Cluster
Spark Job
Master instance
Spark Job Spark Job Spark Job
Lambda Reference Architecture
Monitoring / Analytics Cluster (local, Texas-3)
Global Monitoring / Analytics Cluster (global, Texas-1)
Monitoring / Analytics Cluster (local, Ams. -1 )
Monitoring / Analytics Cluster (local, Lon.-1)
Local components and deployment is the same as global, just smaller
Real-time and batch processing (Lambda), anomaly detection, visualization
SSL
Kafka
SSL
SSL
MQTT
Divide nodes by role for more
stable cluster operation and
ease of scalability
3 master/search nodes
5 live data nodes
3 archive data nodes
master/
search
master/
search
master/
search
live/
data
live/
data
live/
data
live/
data
live/
data
archive
/data
archive
/data
archive
/data
Shards=5 Replicas=4 Shards=5 Replicas=1
archive
/data
archive
/data
CPU=4
RAM=30GB
HDD=4TB
CPU=4
RAM=30GB
HDD=4TB
CPU=4
RAM=30GB
HDD=4TB
Streamsets pipelines process
incoming messages and
transform them according to
business logic requirements,
normalizing metrics and
parsing log lines; popping up
important information using
GROK filters or scripts.
Cluster Name
Decorator
Fields Type
Normalization
Metrics/Logs
Stream Splitter
ES Logs Output
General GROK
Filters
Float Value
Truncate
ES Metrics
Output
Shipped GROK
Logic
Marathon
• Streamsets instances running in docker containers in Marathon
o Easy deployment and scaling
o Fast upgrade to newer version
• Issues we faced with this approach:
o Containers were killed by marathon
o Needed to re-import pipeline every time we launch container
Marathon
• Working with Streamsets trying to resolve the OOM issue we increased
container memory and SDC heap size
• At first, all looked normal and we thought that it was just
starving on resources, but several days later we had SDC killed again
• We increased MEM and HEAP even more – to 16G, but we bought just
another day or two before is was killed again
• Looked like SDC heap were constantly filling with data
that don’t go away and eventually it kills the container
• Also GC was working hard and sometimes we got freezes
up to 60 seconds
• Decided to move out from Docker
Marathon
• Streamsets reading JSON messages from Kafka cluster and output
to Elasticsearch cluster
o De-serializing and serializing JSON was very slow with single
threaded process
o Consuming from Kafka performance test showed:
 JSON format: 5k records/sec avg
 Text format: 50k records/sec avg
 Binary format: 250k records/sec avg
• Streamsets team were very proactive with this issues
and in 2 days we received a fix for multi-threaded JSON parsing
o New testing showed:
 JSON format: 66k records/sec avg
Marathon
• Streamsets has never failed because of any internal logic bugs
but we kept seeing this oom-killer popping up and recovering was
not automated
• We decided to leave docker and run SDC natively on host,
still using Marathon for scaling and failover
• Without docker, we now can upload our pipeline on SDC startup,
and it will start working as soon as instance has loaded
We can freely scale up/down whenever we need
Also, we got rid of oom-killer issue as well
Each one of our 3 SDC instances already processes ~3B messages, with no issues!
• Streamsets pipeline consume metrics gathered by collectd
and logs gathered by logstash from 4 different clusters
(including self), transform and decorate them and send to
Elasticsearch for storage and analytics.
• First of all we consume messages from Kafka topic at
average of 5,000 messages per second. The consumer
itself parses JSON-format and sends further.
• Next stage is a JavaScript script that decorates messages
with cluster name, based on a instance hostname in that
message
• Finally, we exclude Marathon events from stream sending
them directly to ES
• Next stage will splits stream into 2 parts: logs and metrics
• Metrics are send straight to ES without any transformation
• Logs are the most interesting part:
o We pop docker container logs from stream and
delete “time” field that’s duplicate timstamp and
sending them to ES
o We separate logs from specific clusters, because we
need to apply special logic for them
o Separation is done though mapping IP’s to clusters in
the pipeline realtime
• Collecting data from several Mesos clusters and need to
correlate container metrics with it’s logs
• Use appID taskID and runID to identify specific containers
logs
• Container logs itself have all three of this, while mesos-
master and mesos-agent logs lacks runID
• All unidentified data is discarded
Current ShippedAnalytics prod cluster configuration:
Kafka Cluster: 7 brokers with 4CPU and 16GB RAM each
Logstash topic for all incoming messages with 7 partitions and 2 replicas
Current data flow is avg 5000 messages/sec to Kafka
Current data size is avg 1,2MB/sec to Kafka
Streamsets: 3 instances with identical pipeline configuration reading from Kafka cluster
7 partitions are split between 3 instances like 3/2/2
All 3 instances running natively on host (non-docker) with Marathon
Marathon restarts failed instance with automatic pipeline upload and start
Elasticsearch: 7 nodes with 4CPU, 16GB RAM and 2TB storage each
Each metrics is written to its own index, total of 15 indexes
Each index has 5 primary shards and 5 replica shards
Total Doc count: 17,5B Total Doc size: 3.84TB
1 Day rate count: ~500M 1 Day rate size: ~120GB
Streamsets is a great product to work with, also team is super helpful and works fast
• Lots of input and output connectors, huge processing capabilities
• Very intuitive and rich User Interface
• Easy to create pipelines visually, instead of writing code
• Clear data flow paths
• Small resource consumption compared to performance
• Easily can handle up to 10k records/sec to Elasticsearch with 1CPU 2GB RAM
• Simple configuration and deployment process
• Opensource(!)
• Fast logic changes with minimum downtime
• Preview mode(!) – check every stage before throwing all your data it
• Rich data transformation possibilities
• GROK filters – easy to migrate from Logstash
• Smart Errors handling
• Reliable: not once did Streamets crashed by itself – only Docker, Marathon, Mesos issues
Thank You!

Weitere ähnliche Inhalte

Was ist angesagt?

Feed Your SIEM Smart with Kafka Connect (Vitalii Rudenskyi, McKesson Corp) Ka...
Feed Your SIEM Smart with Kafka Connect (Vitalii Rudenskyi, McKesson Corp) Ka...Feed Your SIEM Smart with Kafka Connect (Vitalii Rudenskyi, McKesson Corp) Ka...
Feed Your SIEM Smart with Kafka Connect (Vitalii Rudenskyi, McKesson Corp) Ka...
HostedbyConfluent
 
Introducing a horizontally scalable, inference-based business Rules Engine fo...
Introducing a horizontally scalable, inference-based business Rules Engine fo...Introducing a horizontally scalable, inference-based business Rules Engine fo...
Introducing a horizontally scalable, inference-based business Rules Engine fo...
Cask Data
 

Was ist angesagt? (20)

Feed Your SIEM Smart with Kafka Connect (Vitalii Rudenskyi, McKesson Corp) Ka...
Feed Your SIEM Smart with Kafka Connect (Vitalii Rudenskyi, McKesson Corp) Ka...Feed Your SIEM Smart with Kafka Connect (Vitalii Rudenskyi, McKesson Corp) Ka...
Feed Your SIEM Smart with Kafka Connect (Vitalii Rudenskyi, McKesson Corp) Ka...
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
 
Building Continuously Curated Ingestion Pipelines
Building Continuously Curated Ingestion PipelinesBuilding Continuously Curated Ingestion Pipelines
Building Continuously Curated Ingestion Pipelines
 
Airbyte @ Airflow Summit - The new modern data stack
Airbyte @ Airflow Summit - The new modern data stackAirbyte @ Airflow Summit - The new modern data stack
Airbyte @ Airflow Summit - The new modern data stack
 
About CDAP
About CDAPAbout CDAP
About CDAP
 
#BDAM: EDW Optimization with Hadoop and CDAP, by Sagar Kapare from Cask
#BDAM: EDW Optimization with Hadoop and CDAP, by Sagar Kapare from Cask #BDAM: EDW Optimization with Hadoop and CDAP, by Sagar Kapare from Cask
#BDAM: EDW Optimization with Hadoop and CDAP, by Sagar Kapare from Cask
 
Project Ouroboros: Using StreamSets Data Collector to Help Manage the StreamS...
Project Ouroboros: Using StreamSets Data Collector to Help Manage the StreamS...Project Ouroboros: Using StreamSets Data Collector to Help Manage the StreamS...
Project Ouroboros: Using StreamSets Data Collector to Help Manage the StreamS...
 
SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...
SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...
SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...
 
Apache frameworks for Big and Fast Data
Apache frameworks for Big and Fast DataApache frameworks for Big and Fast Data
Apache frameworks for Big and Fast Data
 
Flattening the Curve with Kafka (Rishi Tarar, Northrop Grumman Corp.) Kafka S...
Flattening the Curve with Kafka (Rishi Tarar, Northrop Grumman Corp.) Kafka S...Flattening the Curve with Kafka (Rishi Tarar, Northrop Grumman Corp.) Kafka S...
Flattening the Curve with Kafka (Rishi Tarar, Northrop Grumman Corp.) Kafka S...
 
Streamsets and spark
Streamsets and sparkStreamsets and spark
Streamsets and spark
 
Introducing a horizontally scalable, inference-based business Rules Engine fo...
Introducing a horizontally scalable, inference-based business Rules Engine fo...Introducing a horizontally scalable, inference-based business Rules Engine fo...
Introducing a horizontally scalable, inference-based business Rules Engine fo...
 
Presto: Fast SQL on Everything
Presto: Fast SQL on EverythingPresto: Fast SQL on Everything
Presto: Fast SQL on Everything
 
"Who Moved my Data? - Why tracking changes and sources of data is critical to...
"Who Moved my Data? - Why tracking changes and sources of data is critical to..."Who Moved my Data? - Why tracking changes and sources of data is critical to...
"Who Moved my Data? - Why tracking changes and sources of data is critical to...
 
What's new in SQL on Hadoop and Beyond
What's new in SQL on Hadoop and BeyondWhat's new in SQL on Hadoop and Beyond
What's new in SQL on Hadoop and Beyond
 
Streaming data in the cloud with Confluent and MongoDB Atlas | Robert Waters,...
Streaming data in the cloud with Confluent and MongoDB Atlas | Robert Waters,...Streaming data in the cloud with Confluent and MongoDB Atlas | Robert Waters,...
Streaming data in the cloud with Confluent and MongoDB Atlas | Robert Waters,...
 
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
 
Enterprise Metadata Integration
Enterprise Metadata IntegrationEnterprise Metadata Integration
Enterprise Metadata Integration
 
Stream processing IoT time series data with Kafka & InfluxDB | Al Sargent, In...
Stream processing IoT time series data with Kafka & InfluxDB | Al Sargent, In...Stream processing IoT time series data with Kafka & InfluxDB | Al Sargent, In...
Stream processing IoT time series data with Kafka & InfluxDB | Al Sargent, In...
 
Qlik and Confluent Success Stories with Kafka - How Generali and Skechers Kee...
Qlik and Confluent Success Stories with Kafka - How Generali and Skechers Kee...Qlik and Confluent Success Stories with Kafka - How Generali and Skechers Kee...
Qlik and Confluent Success Stories with Kafka - How Generali and Skechers Kee...
 

Andere mochten auch

Andere mochten auch (11)

Adaptive Data Cleansing with StreamSets and Cassandra (Pat Patterson, StreamS...
Adaptive Data Cleansing with StreamSets and Cassandra (Pat Patterson, StreamS...Adaptive Data Cleansing with StreamSets and Cassandra (Pat Patterson, StreamS...
Adaptive Data Cleansing with StreamSets and Cassandra (Pat Patterson, StreamS...
 
Logging infrastructure for Microservices using StreamSets Data Collector
Logging infrastructure for Microservices using StreamSets Data CollectorLogging infrastructure for Microservices using StreamSets Data Collector
Logging infrastructure for Microservices using StreamSets Data Collector
 
Bad Data is Polluting Big Data
Bad Data is Polluting Big DataBad Data is Polluting Big Data
Bad Data is Polluting Big Data
 
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco IntercloudCase Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
 
Open Source Big Data Ingestion - Without the Heartburn!
Open Source Big Data Ingestion - Without the Heartburn!Open Source Big Data Ingestion - Without the Heartburn!
Open Source Big Data Ingestion - Without the Heartburn!
 
Ten canoes
Ten canoesTen canoes
Ten canoes
 
UX, ethnography and possibilities: for Libraries, Museums and Archives
UX, ethnography and possibilities: for Libraries, Museums and ArchivesUX, ethnography and possibilities: for Libraries, Museums and Archives
UX, ethnography and possibilities: for Libraries, Museums and Archives
 
Designing Teams for Emerging Challenges
Designing Teams for Emerging ChallengesDesigning Teams for Emerging Challenges
Designing Teams for Emerging Challenges
 
Visual Design with Data
Visual Design with DataVisual Design with Data
Visual Design with Data
 
3 Things Every Sales Team Needs to Be Thinking About in 2017
3 Things Every Sales Team Needs to Be Thinking About in 20173 Things Every Sales Team Needs to Be Thinking About in 2017
3 Things Every Sales Team Needs to Be Thinking About in 2017
 
How to Become a Thought Leader in Your Niche
How to Become a Thought Leader in Your NicheHow to Become a Thought Leader in Your Niche
How to Become a Thought Leader in Your Niche
 

Ähnlich wie Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud

Ähnlich wie Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud (20)

Centralized log-management-with-elastic-stack
Centralized log-management-with-elastic-stackCentralized log-management-with-elastic-stack
Centralized log-management-with-elastic-stack
 
BDA402 Deep Dive: Log Analytics with Amazon Elasticsearch Service
BDA402 Deep Dive: Log Analytics with Amazon Elasticsearch ServiceBDA402 Deep Dive: Log Analytics with Amazon Elasticsearch Service
BDA402 Deep Dive: Log Analytics with Amazon Elasticsearch Service
 
IBM Cloud Native Day April 2021: Serverless Data Lake
IBM Cloud Native Day April 2021: Serverless Data LakeIBM Cloud Native Day April 2021: Serverless Data Lake
IBM Cloud Native Day April 2021: Serverless Data Lake
 
Kafka streams decoupling with stores
Kafka streams decoupling with storesKafka streams decoupling with stores
Kafka streams decoupling with stores
 
Using Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFUsing Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SF
 
Using Data Lakes
Using Data Lakes Using Data Lakes
Using Data Lakes
 
Serverless SQL
Serverless SQLServerless SQL
Serverless SQL
 
Modernizing upstream workflows with aws storage - john mallory
Modernizing upstream workflows with aws storage -  john malloryModernizing upstream workflows with aws storage -  john mallory
Modernizing upstream workflows with aws storage - john mallory
 
Centralized Logging System Using ELK Stack
Centralized Logging System Using ELK StackCentralized Logging System Using ELK Stack
Centralized Logging System Using ELK Stack
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
IBM Cloud Day January 2021 Data Lake Deep Dive
IBM Cloud Day January 2021 Data Lake Deep DiveIBM Cloud Day January 2021 Data Lake Deep Dive
IBM Cloud Day January 2021 Data Lake Deep Dive
 
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
 
ELK Ruminating on Logs (Zendcon 2016)
ELK Ruminating on Logs (Zendcon 2016)ELK Ruminating on Logs (Zendcon 2016)
ELK Ruminating on Logs (Zendcon 2016)
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
 
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
 
DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
DEVNET-1140	InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...DEVNET-1140	InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
 
Instrumenting and Scaling Databases with Envoy
Instrumenting and Scaling Databases with EnvoyInstrumenting and Scaling Databases with Envoy
Instrumenting and Scaling Databases with Envoy
 
Typesafe spark- Zalando meetup
Typesafe spark- Zalando meetupTypesafe spark- Zalando meetup
Typesafe spark- Zalando meetup
 
Enabling Microservices Frameworks to Solve Business Problems
Enabling Microservices Frameworks to Solve  Business ProblemsEnabling Microservices Frameworks to Solve  Business Problems
Enabling Microservices Frameworks to Solve Business Problems
 

Kürzlich hochgeladen

Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
AroojKhan71
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 

Kürzlich hochgeladen (20)

Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 

Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud

  • 1. Case Study Elasticsearch Ingest @ Cisco Intercloud
  • 2. Agenda • Express Overview of StreamSets Data Collector Kirit Basu, Product Management, StreamSets • Introduction to Elastic CatherineJohnson, Solutions Architect, Elastic • Implementing Shipped Analytics Using StreamSets and Elasticsearch Dmitri Chtchourov, Innovation Architect, Cloud Solutions CTO Group Group
  • 4. © 2015 StreamSets, Inc. All rights reserved. May not be copied, modified, or distributed in whole or part without written consent of StreamSets, Inc. History Founded by Informatica and Cloudera veterans. Mission Bring operational excellence to managing data in motion. Challenge Move data efficiently and with quality in the face of change. Solution Open source software enabling performance management of data flows. Use cases Hadoop Ingest, Search Ingest, Message Broker Enablement, Log Shipping, Cloud Migration, IoT, ... Momentum Thousands of downloads, hundreds of companies using. StreamSets At a Glance
  • 5. © 2015 StreamSets, Inc. All rights reserved. May not be copied, modified, or distributed in whole or part without written consent of StreamSets, Inc. StreamSets Data Collector Adaptable Flows for Efficiency Design ingest pipelines with minimal coding and maximum flexibility. Data Flow KPIs for Control Monitor and act on data flow performance and data quality. Containerized Architecture for Agility Operate continuously in the face of constant change. Open source software for the rapid development and reliably operation of complex data flows.
  • 6. Get Started with StreamSets http://streamsets.com/opensource https://github.com/streamsets/datacollector/ #streamsets
  • 8. Software that makes massive amounts of structured and unstructured data usable for search, logging, analytics, and more in mission critical systems and applications
  • 9. Examples: Elastic Stack Use Cases Logging IT Operations Application Management Security Analytics Analytics Search Marketing Insights Business Development Customer Sentiment Website Search Internal/Intranet Search URL Search Internal Systems/Applications External Systems/Applications Developers IT/Ops Business Users
  • 10. Elastic Solves Many Developer Use Cases Social Location User- Activity Machine (Log files) Documents Handles Complex & Diverse Data Meets Today’s Core Developer Requirements Developer requirements Many users / use cases Fast data processing Large data volumes Data quality & integrity Cross-source insights Solves Critical Use Cases Application Search Embedded Search Logging Security Analytics Operational Analytics More …
  • 11. The Elastic Stack Ingest Store, Index, & Analyze User Interface Plugins Monitoring Security Alerting Elastic Cloud: Hosted Elasticsearch
  • 13. Implementing Shipped Analytics Using Streamsets and Elasticsearch Dmitri Chtchourov, Innovation Architect, Cloud Solutions CTO Group Tymofii Polekhin, Software Engineer
  • 14. Agenda MANTL & Shipped Shipped Analytics for Shipped Why we need Shipped Analytics? Archtecture and Data Flow Streamsets Pipelines End to end dataflow and performance with Elasticsearch Benefits of Streamsets Demo
  • 15. Microservices managed and scaled separately Microservices managed by Mesos in a single platform Microservices architecture for Mesos frameworks and other components CIS/AWS/Metastack/vSphere/UCS… Terraform Spark Executor N Spark Executor 1 Spark Scheduler Kafka Broker N Kafka Broker 1 Kafka Scheduler Docker Docker TraefikMicroservices … REST API REST API Scripted provisioning Direct provisioning Policy, Auto-scaling VM1 or BM1 VM2 or BM2 VM3 or BM3 VM4 or BM4 VM5 or BM5
  • 16.
  • 17. Shipped Analytics Cluster Probe Probe Probe • Both Shipped and Shipped Analytics running on MANTL • Shipped Analytics – infra and app logs and metrics analysis
  • 19. Infrastructure Layer Zookeeper Cluster Consul Cluster Mesos Cluster Marathon Framework Kafka Cluster topbeat filebeat journalbeat dockerbeat • Experimenting with Elastic Beats (unified arch., closer to micro-services model) • Elastic Beats to replace collectd plugins and cAdvisor for containers
  • 20. <file | top | *>beat collectd logstash DNS SRV beats.logstash.service.consul Data normalization Tagging Cluster name decoration Logstash is a single process per cluster, discoverable with standard inter-cluster discovery mechanism, which will get metrics from collectd on every slave and logs from filebeat on every slave, normalize data and send to desired output DNS SRV collectd.logstash.service.consul NOTE: currently Logstash is running in Docker container on every node, will be moving to Filebeat and Logstash mesos framework soon
  • 21. logstash Kafka 0.9.0.0 supports SSL authentication and data encryption for producers. This is must-have security when sending data to external destination through WAN. Sending data to central SA cluster for long-term analytics SSL encryption WAN kafka SSL authentication Shipped cluster Shipped Analytics
  • 22. StreamSets running in Mesos Spark Cluster mode processing data from multiple source Shipped clusters and storing it in Elasticsearch cluster. kafka elasticsearch Streamsets Spark Streaming Cluster Spark Job Master instance Spark Job Spark Job Spark Job
  • 23. Lambda Reference Architecture Monitoring / Analytics Cluster (local, Texas-3) Global Monitoring / Analytics Cluster (global, Texas-1) Monitoring / Analytics Cluster (local, Ams. -1 ) Monitoring / Analytics Cluster (local, Lon.-1) Local components and deployment is the same as global, just smaller Real-time and batch processing (Lambda), anomaly detection, visualization SSL Kafka SSL SSL MQTT
  • 24. Divide nodes by role for more stable cluster operation and ease of scalability 3 master/search nodes 5 live data nodes 3 archive data nodes master/ search master/ search master/ search live/ data live/ data live/ data live/ data live/ data archive /data archive /data archive /data Shards=5 Replicas=4 Shards=5 Replicas=1 archive /data archive /data CPU=4 RAM=30GB HDD=4TB CPU=4 RAM=30GB HDD=4TB CPU=4 RAM=30GB HDD=4TB
  • 25. Streamsets pipelines process incoming messages and transform them according to business logic requirements, normalizing metrics and parsing log lines; popping up important information using GROK filters or scripts. Cluster Name Decorator Fields Type Normalization Metrics/Logs Stream Splitter ES Logs Output General GROK Filters Float Value Truncate ES Metrics Output Shipped GROK Logic
  • 26. Marathon • Streamsets instances running in docker containers in Marathon o Easy deployment and scaling o Fast upgrade to newer version • Issues we faced with this approach: o Containers were killed by marathon o Needed to re-import pipeline every time we launch container
  • 27. Marathon • Working with Streamsets trying to resolve the OOM issue we increased container memory and SDC heap size • At first, all looked normal and we thought that it was just starving on resources, but several days later we had SDC killed again • We increased MEM and HEAP even more – to 16G, but we bought just another day or two before is was killed again • Looked like SDC heap were constantly filling with data that don’t go away and eventually it kills the container • Also GC was working hard and sometimes we got freezes up to 60 seconds • Decided to move out from Docker
  • 28. Marathon • Streamsets reading JSON messages from Kafka cluster and output to Elasticsearch cluster o De-serializing and serializing JSON was very slow with single threaded process o Consuming from Kafka performance test showed:  JSON format: 5k records/sec avg  Text format: 50k records/sec avg  Binary format: 250k records/sec avg • Streamsets team were very proactive with this issues and in 2 days we received a fix for multi-threaded JSON parsing o New testing showed:  JSON format: 66k records/sec avg
  • 29. Marathon • Streamsets has never failed because of any internal logic bugs but we kept seeing this oom-killer popping up and recovering was not automated • We decided to leave docker and run SDC natively on host, still using Marathon for scaling and failover • Without docker, we now can upload our pipeline on SDC startup, and it will start working as soon as instance has loaded We can freely scale up/down whenever we need Also, we got rid of oom-killer issue as well
  • 30. Each one of our 3 SDC instances already processes ~3B messages, with no issues!
  • 31. • Streamsets pipeline consume metrics gathered by collectd and logs gathered by logstash from 4 different clusters (including self), transform and decorate them and send to Elasticsearch for storage and analytics. • First of all we consume messages from Kafka topic at average of 5,000 messages per second. The consumer itself parses JSON-format and sends further. • Next stage is a JavaScript script that decorates messages with cluster name, based on a instance hostname in that message • Finally, we exclude Marathon events from stream sending them directly to ES
  • 32. • Next stage will splits stream into 2 parts: logs and metrics • Metrics are send straight to ES without any transformation • Logs are the most interesting part: o We pop docker container logs from stream and delete “time” field that’s duplicate timstamp and sending them to ES o We separate logs from specific clusters, because we need to apply special logic for them o Separation is done though mapping IP’s to clusters in the pipeline realtime
  • 33. • Collecting data from several Mesos clusters and need to correlate container metrics with it’s logs • Use appID taskID and runID to identify specific containers logs • Container logs itself have all three of this, while mesos- master and mesos-agent logs lacks runID • All unidentified data is discarded
  • 34. Current ShippedAnalytics prod cluster configuration: Kafka Cluster: 7 brokers with 4CPU and 16GB RAM each Logstash topic for all incoming messages with 7 partitions and 2 replicas Current data flow is avg 5000 messages/sec to Kafka Current data size is avg 1,2MB/sec to Kafka Streamsets: 3 instances with identical pipeline configuration reading from Kafka cluster 7 partitions are split between 3 instances like 3/2/2 All 3 instances running natively on host (non-docker) with Marathon Marathon restarts failed instance with automatic pipeline upload and start Elasticsearch: 7 nodes with 4CPU, 16GB RAM and 2TB storage each Each metrics is written to its own index, total of 15 indexes Each index has 5 primary shards and 5 replica shards Total Doc count: 17,5B Total Doc size: 3.84TB 1 Day rate count: ~500M 1 Day rate size: ~120GB
  • 35. Streamsets is a great product to work with, also team is super helpful and works fast • Lots of input and output connectors, huge processing capabilities • Very intuitive and rich User Interface • Easy to create pipelines visually, instead of writing code • Clear data flow paths • Small resource consumption compared to performance • Easily can handle up to 10k records/sec to Elasticsearch with 1CPU 2GB RAM • Simple configuration and deployment process • Opensource(!) • Fast logic changes with minimum downtime • Preview mode(!) – check every stage before throwing all your data it • Rich data transformation possibilities • GROK filters – easy to migrate from Logstash • Smart Errors handling • Reliable: not once did Streamets crashed by itself – only Docker, Marathon, Mesos issues