SlideShare ist ein Scribd-Unternehmen logo
1 von 28
Downloaden Sie, um offline zu lesen
STREAMING 4 BILLIONS MESSAGES
LESSONS LEARNED
Angelos Petheriotis (@apetheriotis)
HiveHome
What is HiveHome doing ?
Provides a range of different sensors that all work together
to build a smart and connected home.
How is Big Data generated at Connected Home

more devices to be released
How is it accessible ?
Avro messages through
Kafka Contracted topics
Some numbers ?
4+ Billion messages from input topics
to the Data Platform (increasing by 1000s every day)
Is it useful ?
Many of CH & BG services are based solely
on Big Data projects.
* design a micro services architecture‹
that won’t wake you up at 03:00 for a simple restart
* not duplicate stuff (code or configs) ‹
significant % of our time we are plumbers ..let’s make our lives easy
* be resilient to failures. ‹
Especially when dealing with stageful applications
* communicate/collaborate with data scientists ‹
mathematicians != engineers
Processing 50K msgs/s from IoT Devices you learn to:
Try to :
* Decouple applications
* Stick to single responsibility principle
* Make apps portable
* Make apps immutable
* Make testing portable and easy
Docker & Kubernetes
GoCD (CI/CD)
Decouple ETL
Police EL with Schema Registry
Microservices in real time pipelines
Average the internal temperatures per
house per 30 minutes and persist to ES
Pros
* We only support/monitor one app !
* All in one place and you don’t have to
remember git repos etc..
‹
Cons
* Job has 2 responsibilities
* Hard to test
* If we want to persist to Cassandra we
need to reprocess the messages
* We cannot reuse the app
Monolithic approach
T+ L
Pros
* 1 responsibility per app
* Easy to replace the load job to ES with
a Cassandra job
* Easy to replay data
* We CAN generalise/reuse the L stage
Cons
* We need to support/monitor 2 apps :(
Microsevices based approach
T E
E C*
ES
We went through of how our infrastructure looks like.
Let’s see what we deploy in that infrastructure
We used to write a lot of Spark apps for E & L operations
> internalTemperature to ElasticSearch
> internalTemperature Cassandra
> motionDetected to ElasticSearch
> deviceSignal to Cassandra
> 

But we replaced our spark jobs because

We ended up with:
* Duplicated code all over the place for
simple tasks
* Too many github repos. Hard to keep them
in your head
* Too much time to provision a small cluster
to test the app
* Many resources ÂŁÂŁÂŁ were wasted
because of the master/driver dependencies
of spark
The goal was to define the E & L stages *once* as
a generic re-usable component that handles:
Offset Management
Serialization / de-serialization
Partitioning / Scalability
Fault tolerance / fail-over
Schema Registry integration.
Kafka Connect to the rescue
Kafka Connect
* Suitable for EL operations (no T here)
* No driver/master/worker notations
* No dependency on zookeeper
* Uses the well tested kafka consumer/producers
* Configurable by a rest API
But by default you need to write some code for every application
for the specific domain transformations.
KCQL is a SQL like syntax allowing streamlined configuration of Kafka Sink Connectors
Examples:
* INSERT INTO transactionIndex SELECT * FROM transcationTopic
* INSERT INTO motionIndex SELECT motionDatetime As motionAt FROM motionTopic
KCQL (Kafka Connect Query Language)
Available operations: rename,ignore fields, reparation messages any many more
‹
(https://github.com/datamountaineer/kafka-connect-query-language)
How it looks like today 

Monitor your KC apps

* JMX Metrics and logs from the APP (Jmx metrics provide detailed granularity of the state of the KC app)
* Kafka Connect UI (logs and configs for each KC app available with 1 click - https://github.com/landoop)
E & L stages are now solid, well defined
with minimal duplication and highly reusable.
T needs some polishing. Time to re-think of‹
our T stage.
What was the problem 

Spark is great but not always the best option:
-> has the notation of micro batches
-> handling state is not optimal
-> you need shared storage to store checkpoints and state
-> you need a cluster with master, driver & workers
SPARK :(
From Spark to Kafka Streams 

Kafka Streams is great because:
-> is cluster and framework free
-> uses kafka to store the state
-> exposes the state via an API
-> has no notation of micro batches
-> KTables
-> No need for zookeeper
So we re-wrote one of our heavy CPU jobs in Kafka Streams
Results:
-> Again: No need to worry about where to store checkpoints. Everything is stored in kafka.
-> No need for a cluster. Just execute `java -jar app.jar`
-> Less scripting !
-> We needed to do funny stuff to make it work with scala :(
And now we have:
-> 50% less resources were used in some cases. Better CPU/Memory utilisation across instances.
-> Easier auto scaling. Just start more instances of your app and kafka streams will scale automatically.
-> Happier devops because they worry about the infrastructure and not the frameworks on top of that.
And since the state is exposed ‹
through an API we now know
what happens internally inside the app at any given time
Until now we described the engineering part of the Data
Platform team.
Let’s see who uses the data from our platform.
Data Science @ HiveHome
Some of the projects:‹
-> Energy Breakdown
Distribute the energy usage into categories (lighting, cooking etc) just by knowing the total hourly
consumed energy (patent pending)
-> Heating Failure Alert
Try to identify if a boiler is not working properly, knowing only the internal temperature of a house
-> as much data as possible
-> as soon as possible
-> as accessible as possible
Data Science @ Connected Home
what do scientists need ?
Data Science @ Connected Home
how to work with data scientists
* Be proactive. Have the data ready in advance.
* Keep the data in an flexible datastore. I.e. Elastic Search and not Cassandra.
* Side by side development during each iteration of a model. (Scientists do not unit test!)
* Jupyter/Zeppelin notebooks. Easily run and scale a model across your clusters.
So what we actually learned
(except from all the cool stuff we can add to our CVs)
* Decouple everything.
* When you start copying code and configs -> tools down and re-think of your applications setup.
* Try new technologies. The initial learning curve will compensate you later.
* Work tightly with data scientists so they develop similar mindset to an engineer.
Streaming 4 billion Messages per day. Lessons Learned.

Weitere Àhnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

How to Design a Modern Data Warehouse in BigQuery
How to Design a Modern Data Warehouse in BigQueryHow to Design a Modern Data Warehouse in BigQuery
How to Design a Modern Data Warehouse in BigQuery
 
2017 09-27 democratize data products with SQL
2017 09-27 democratize data products with SQL2017 09-27 democratize data products with SQL
2017 09-27 democratize data products with SQL
 
Migrating a multi tenant app to Azure (war biopic)
Migrating a multi tenant app to Azure (war biopic)Migrating a multi tenant app to Azure (war biopic)
Migrating a multi tenant app to Azure (war biopic)
 
Intro to new Google cloud technologies: Google Storage, Prediction API, BigQuery
Intro to new Google cloud technologies: Google Storage, Prediction API, BigQueryIntro to new Google cloud technologies: Google Storage, Prediction API, BigQuery
Intro to new Google cloud technologies: Google Storage, Prediction API, BigQuery
 
Build Real-Time Applications with Databricks Streaming
Build Real-Time Applications with Databricks StreamingBuild Real-Time Applications with Databricks Streaming
Build Real-Time Applications with Databricks Streaming
 
Introducing the Hub for Data Orchestration
Introducing the Hub for Data OrchestrationIntroducing the Hub for Data Orchestration
Introducing the Hub for Data Orchestration
 
Building Data Lakes with Apache Airflow
Building Data Lakes with Apache AirflowBuilding Data Lakes with Apache Airflow
Building Data Lakes with Apache Airflow
 
Unified Data Analytics: Helping Data Teams Solve the World’s Toughest Problems
Unified Data Analytics: Helping Data Teams Solve the World’s Toughest ProblemsUnified Data Analytics: Helping Data Teams Solve the World’s Toughest Problems
Unified Data Analytics: Helping Data Teams Solve the World’s Toughest Problems
 
Simplify and Scale Data Engineering Pipelines with Delta Lake
Simplify and Scale Data Engineering Pipelines with Delta LakeSimplify and Scale Data Engineering Pipelines with Delta Lake
Simplify and Scale Data Engineering Pipelines with Delta Lake
 
Yahoo's Next Generation User Profile Platform
Yahoo's Next Generation User Profile PlatformYahoo's Next Generation User Profile Platform
Yahoo's Next Generation User Profile Platform
 
Stargate, the gateway for some multi-models data API
Stargate, the gateway for some multi-models data APIStargate, the gateway for some multi-models data API
Stargate, the gateway for some multi-models data API
 
AWS Athena vs. Google BigQuery for interactive SQL Queries
AWS Athena vs. Google BigQuery for interactive SQL QueriesAWS Athena vs. Google BigQuery for interactive SQL Queries
AWS Athena vs. Google BigQuery for interactive SQL Queries
 
TDC2016SP - Trilha BigData
TDC2016SP - Trilha BigDataTDC2016SP - Trilha BigData
TDC2016SP - Trilha BigData
 
Accelerating Data Ingestion with Databricks Autoloader
Accelerating Data Ingestion with Databricks AutoloaderAccelerating Data Ingestion with Databricks Autoloader
Accelerating Data Ingestion with Databricks Autoloader
 
How Adobe uses Structured Streaming at Scale
How Adobe uses Structured Streaming at ScaleHow Adobe uses Structured Streaming at Scale
How Adobe uses Structured Streaming at Scale
 
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
 
New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...
New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...
New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...
 
Bulletproof Jobs: Patterns For Large-Scale Spark Processing
Bulletproof Jobs: Patterns For Large-Scale Spark ProcessingBulletproof Jobs: Patterns For Large-Scale Spark Processing
Bulletproof Jobs: Patterns For Large-Scale Spark Processing
 
Deep Learning in the Cloud at Scale: A Data Orchestration Story
Deep Learning in the Cloud at Scale: A Data Orchestration StoryDeep Learning in the Cloud at Scale: A Data Orchestration Story
Deep Learning in the Cloud at Scale: A Data Orchestration Story
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache Spark
 

Ähnlich wie Streaming 4 billion Messages per day. Lessons Learned.

Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Spark Summit
 
OSMC 2019 | Monitoring Alerts and Metrics on Large Power Systems Clusters by ...
OSMC 2019 | Monitoring Alerts and Metrics on Large Power Systems Clusters by ...OSMC 2019 | Monitoring Alerts and Metrics on Large Power Systems Clusters by ...
OSMC 2019 | Monitoring Alerts and Metrics on Large Power Systems Clusters by ...
NETWAYS
 

Ähnlich wie Streaming 4 billion Messages per day. Lessons Learned. (20)

Extending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with KubernetesExtending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with Kubernetes
 
[Draft] Fast Prototyping with DPDK and eBPF in Containernet
[Draft] Fast Prototyping with DPDK and eBPF in Containernet[Draft] Fast Prototyping with DPDK and eBPF in Containernet
[Draft] Fast Prototyping with DPDK and eBPF in Containernet
 
Deep Dive into Futures and the Parallel Programming Library
Deep Dive into Futures and the Parallel Programming LibraryDeep Dive into Futures and the Parallel Programming Library
Deep Dive into Futures and the Parallel Programming Library
 
Cytoscape: Now and Future
Cytoscape: Now and FutureCytoscape: Now and Future
Cytoscape: Now and Future
 
Using eBPF to Measure the k8s Cluster Health
Using eBPF to Measure the k8s Cluster HealthUsing eBPF to Measure the k8s Cluster Health
Using eBPF to Measure the k8s Cluster Health
 
Kubernetes Java Operator
Kubernetes Java OperatorKubernetes Java Operator
Kubernetes Java Operator
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
 
Effective Kubernetes - Is Kubernetes the new Linux? Is the new Application Se...
Effective Kubernetes - Is Kubernetes the new Linux? Is the new Application Se...Effective Kubernetes - Is Kubernetes the new Linux? Is the new Application Se...
Effective Kubernetes - Is Kubernetes the new Linux? Is the new Application Se...
 
Predicting Space Weather with Docker
Predicting Space Weather with DockerPredicting Space Weather with Docker
Predicting Space Weather with Docker
 
OS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of MLOS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of ML
 
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
 
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...
 
OSMC 2019 | Monitoring Alerts and Metrics on Large Power Systems Clusters by ...
OSMC 2019 | Monitoring Alerts and Metrics on Large Power Systems Clusters by ...OSMC 2019 | Monitoring Alerts and Metrics on Large Power Systems Clusters by ...
OSMC 2019 | Monitoring Alerts and Metrics on Large Power Systems Clusters by ...
 
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
 
Spring, Functions, Serverless and You
Spring, Functions, Serverless and YouSpring, Functions, Serverless and You
Spring, Functions, Serverless and You
 
How to build streaming data pipelines with Akka Streams, Flink, and Spark usi...
How to build streaming data pipelines with Akka Streams, Flink, and Spark usi...How to build streaming data pipelines with Akka Streams, Flink, and Spark usi...
How to build streaming data pipelines with Akka Streams, Flink, and Spark usi...
 
All the Ops: DataOps with GitOps for Streaming data on Kafka and Kubernetes
All the Ops: DataOps with GitOps for Streaming data on Kafka and KubernetesAll the Ops: DataOps with GitOps for Streaming data on Kafka and Kubernetes
All the Ops: DataOps with GitOps for Streaming data on Kafka and Kubernetes
 
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringApache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
 
Why do we even have Kubernetes?
Why do we even have Kubernetes?Why do we even have Kubernetes?
Why do we even have Kubernetes?
 

KĂŒrzlich hochgeladen

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
Enterprise Knowledge
 

KĂŒrzlich hochgeladen (20)

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 

Streaming 4 billion Messages per day. Lessons Learned.

  • 1. STREAMING 4 BILLIONS MESSAGES LESSONS LEARNED Angelos Petheriotis (@apetheriotis) HiveHome
  • 2. What is HiveHome doing ? Provides a range of different sensors that all work together to build a smart and connected home.
  • 3. How is Big Data generated at Connected Home 
more devices to be released How is it accessible ? Avro messages through Kafka Contracted topics Some numbers ? 4+ Billion messages from input topics to the Data Platform (increasing by 1000s every day) Is it useful ? Many of CH & BG services are based solely on Big Data projects.
  • 4. * design a micro services architecture‹ that won’t wake you up at 03:00 for a simple restart * not duplicate stuff (code or configs) ‹ significant % of our time we are plumbers ..let’s make our lives easy * be resilient to failures. ‹ Especially when dealing with stageful applications * communicate/collaborate with data scientists ‹ mathematicians != engineers Processing 50K msgs/s from IoT Devices you learn to:
  • 5. Try to : * Decouple applications * Stick to single responsibility principle * Make apps portable * Make apps immutable * Make testing portable and easy Docker & Kubernetes GoCD (CI/CD) Decouple ETL Police EL with Schema Registry Microservices in real time pipelines
  • 6. Average the internal temperatures per house per 30 minutes and persist to ES Pros * We only support/monitor one app ! * All in one place and you don’t have to remember git repos etc.. ‹ Cons * Job has 2 responsibilities * Hard to test * If we want to persist to Cassandra we need to reprocess the messages * We cannot reuse the app Monolithic approach T+ L
  • 7. Pros * 1 responsibility per app * Easy to replace the load job to ES with a Cassandra job * Easy to replay data * We CAN generalise/reuse the L stage Cons * We need to support/monitor 2 apps :( Microsevices based approach T E E C* ES
  • 8.
  • 9. We went through of how our infrastructure looks like. Let’s see what we deploy in that infrastructure
  • 10. We used to write a lot of Spark apps for E & L operations > internalTemperature to ElasticSearch > internalTemperature Cassandra > motionDetected to ElasticSearch > deviceSignal to Cassandra > 

  • 11. But we replaced our spark jobs because
 We ended up with: * Duplicated code all over the place for simple tasks * Too many github repos. Hard to keep them in your head * Too much time to provision a small cluster to test the app * Many resources ÂŁÂŁÂŁ were wasted because of the master/driver dependencies of spark
  • 12. The goal was to define the E & L stages *once* as a generic re-usable component that handles: Offset Management Serialization / de-serialization Partitioning / Scalability Fault tolerance / fail-over Schema Registry integration.
  • 13. Kafka Connect to the rescue Kafka Connect * Suitable for EL operations (no T here) * No driver/master/worker notations * No dependency on zookeeper * Uses the well tested kafka consumer/producers * Configurable by a rest API
  • 14. But by default you need to write some code for every application for the specific domain transformations.
  • 15. KCQL is a SQL like syntax allowing streamlined configuration of Kafka Sink Connectors Examples: * INSERT INTO transactionIndex SELECT * FROM transcationTopic * INSERT INTO motionIndex SELECT motionDatetime As motionAt FROM motionTopic KCQL (Kafka Connect Query Language) Available operations: rename,ignore fields, reparation messages any many more ‹ (https://github.com/datamountaineer/kafka-connect-query-language)
  • 16. How it looks like today 

  • 17. Monitor your KC apps
 * JMX Metrics and logs from the APP (Jmx metrics provide detailed granularity of the state of the KC app) * Kafka Connect UI (logs and configs for each KC app available with 1 click - https://github.com/landoop)
  • 18. E & L stages are now solid, well defined with minimal duplication and highly reusable. T needs some polishing. Time to re-think of‹ our T stage.
  • 19. What was the problem 
 Spark is great but not always the best option: -> has the notation of micro batches -> handling state is not optimal -> you need shared storage to store checkpoints and state -> you need a cluster with master, driver & workers SPARK :(
  • 20. From Spark to Kafka Streams 
 Kafka Streams is great because: -> is cluster and framework free -> uses kafka to store the state -> exposes the state via an API -> has no notation of micro batches -> KTables -> No need for zookeeper
  • 21. So we re-wrote one of our heavy CPU jobs in Kafka Streams Results: -> Again: No need to worry about where to store checkpoints. Everything is stored in kafka. -> No need for a cluster. Just execute `java -jar app.jar` -> Less scripting ! -> We needed to do funny stuff to make it work with scala :(
  • 22. And now we have: -> 50% less resources were used in some cases. Better CPU/Memory utilisation across instances. -> Easier auto scaling. Just start more instances of your app and kafka streams will scale automatically. -> Happier devops because they worry about the infrastructure and not the frameworks on top of that. And since the state is exposed ‹ through an API we now know what happens internally inside the app at any given time
  • 23. Until now we described the engineering part of the Data Platform team. Let’s see who uses the data from our platform.
  • 24. Data Science @ HiveHome Some of the projects:‹ -> Energy Breakdown Distribute the energy usage into categories (lighting, cooking etc) just by knowing the total hourly consumed energy (patent pending) -> Heating Failure Alert Try to identify if a boiler is not working properly, knowing only the internal temperature of a house
  • 25. -> as much data as possible -> as soon as possible -> as accessible as possible Data Science @ Connected Home what do scientists need ?
  • 26. Data Science @ Connected Home how to work with data scientists * Be proactive. Have the data ready in advance. * Keep the data in an flexible datastore. I.e. Elastic Search and not Cassandra. * Side by side development during each iteration of a model. (Scientists do not unit test!) * Jupyter/Zeppelin notebooks. Easily run and scale a model across your clusters.
  • 27. So what we actually learned (except from all the cool stuff we can add to our CVs) * Decouple everything. * When you start copying code and configs -> tools down and re-think of your applications setup. * Try new technologies. The initial learning curve will compensate you later. * Work tightly with data scientists so they develop similar mindset to an engineer.