Crunching thousands of events per second in nearly real time 
Aleksei Udatšnõi 
Lead Software Engineer @ Softonic 
NoSQL m...
Softonic use case
Softonic use case 
•Software guide, reviews, news, downloads 
•Originally built on LAMP stack 
•100M+ visitors monthly 
•1...
Big Data challenges 
•Track large volume of events 
•Process the stream of events in near real time 
•Present insights to ...
Softonic Developer Center
Softonic Developer Center 
developer.softonic.com
Legacy RDMBS based solution 
INSERT INTO tracking VALUES ..
Big Data architecture
Data ingestion with Flume 
Photo by BilfingerSE/ CC BY
Data ingestion with Flume 
•Tracking events in real time 
•Accumulating into larger chunks 
•Ingesting into HDFS
Data ingestion with Flume
Flume considerations 
•Flume vs. alternatives (Kafka, Storm)
Flume considerations 
•Flume vs. alternatives (Kafka, Storm) 
•Prepare failover plan
Flume considerations 
•Flume vs. alternatives (Kafka, Storm) 
•Prepare failover plan 
•Configuring rolling interval
Data processing 
Photo by bonzoWiltsUK/ CC BY
Data processing 
•Extracting KPI from unstructured data 
•Aggregating events by date, app, customer
Data processing 
From raw data 
{“id”:1234, “country”: “AR”, “date:” : “20141113”} .. {} 
{“id”:1234, “country”: “AR”, “da...
Real-time querying 
Pre-requisites 
•Data is read from HDFS 
•Advanced filtering 
•Minimal latency 
•Integrated into exist...
Real-time querying 
Options considered 
•Exporting aggregated data to MongoDB 
•Apache Spark 
•Presto 
•Cloudera Impala
Impala 
Photo by Colin J. McMechan/ CC BY
Impala 
•Engine which enables real-time queries in Apache Hadoop 
•Based on Dremelpaper of Google 
•Open source 
•Integrat...
Impala considerations 
•Sharing resources with MapReducejobs
Impala considerations 
•Sharing resources with MapReducejobs 
•Consistent data format is required
Impala considerations 
•Sharing resources with MapReducejobs 
•Consistent data format is required 
•Invalidating metadata
Putting all together
Putting all together
Conclusions 
•Data ingested by applications in real-time 
•Ingesting 1000+ events/sec 
•Data delivered to HDFS every 30 mi...
References 
https://developer.softonic.com/ 
http://flume.apache.org/ 
http://www.cloudera.com/content/cloudera/en/documen...
Thank you! 
@udachny
Nächste SlideShare
Wird geladen in …5
×

Aleksei Udatšnõi – Crunching thousands of events per second in nearly real time - NoSQL matters Barcelona 2014

758 Aufrufe

Veröffentlicht am

Aleksei Udatšnõi – Crunching thousands of events per second in nearly real time

Imagine you have a product which generates up to 10 thousands events per second or around 1 billion events per day. This live stream of data need to be tracked, processed and presented to end-users in a visually appealing way. The solution needs to be integrated into a traditional web application. That is the real use case at Softonic. In this talk we will show how it was solved in Softonic. We use the stack of technologies around Big Data to process and store live stream of data and present the results to users in nearly real time. This real-life solution is built around Hadoop ecosystem and it includes Flume, Hive, Oozie and Impala. We will show how to store and query such volumes of data using NoSQL database and how to build a scalable end-user web application using nearly real time data feed.

Veröffentlicht in: Daten & Analysen
0 Kommentare
1 Gefällt mir
Statistik
Notizen
  • Als Erste(r) kommentieren

Keine Downloads
Aufrufe
Aufrufe insgesamt
758
Auf SlideShare
0
Aus Einbettungen
0
Anzahl an Einbettungen
3
Aktionen
Geteilt
0
Downloads
17
Kommentare
0
Gefällt mir
1
Einbettungen 0
Keine Einbettungen

Keine Notizen für die Folie

Aleksei Udatšnõi – Crunching thousands of events per second in nearly real time - NoSQL matters Barcelona 2014

  1. 1. Crunching thousands of events per second in nearly real time Aleksei Udatšnõi Lead Software Engineer @ Softonic NoSQL matters Barcelona 22 November 2014 Photo by Chris Loxton/ CC BY
  2. 2. Softonic use case
  3. 3. Softonic use case •Software guide, reviews, news, downloads •Originally built on LAMP stack •100M+ visitors monthly •1000+ tracking events/sec generated
  4. 4. Big Data challenges •Track large volume of events •Process the stream of events in near real time •Present insights to stakeholders
  5. 5. Softonic Developer Center
  6. 6. Softonic Developer Center developer.softonic.com
  7. 7. Legacy RDMBS based solution INSERT INTO tracking VALUES ..
  8. 8. Big Data architecture
  9. 9. Data ingestion with Flume Photo by BilfingerSE/ CC BY
  10. 10. Data ingestion with Flume •Tracking events in real time •Accumulating into larger chunks •Ingesting into HDFS
  11. 11. Data ingestion with Flume
  12. 12. Flume considerations •Flume vs. alternatives (Kafka, Storm)
  13. 13. Flume considerations •Flume vs. alternatives (Kafka, Storm) •Prepare failover plan
  14. 14. Flume considerations •Flume vs. alternatives (Kafka, Storm) •Prepare failover plan •Configuring rolling interval
  15. 15. Data processing Photo by bonzoWiltsUK/ CC BY
  16. 16. Data processing •Extracting KPI from unstructured data •Aggregating events by date, app, customer
  17. 17. Data processing From raw data {“id”:1234, “country”: “AR”, “date:” : “20141113”} .. {} {“id”:1234, “country”: “AR”, “date:” : “20141113”} .. {} {“id”:1234, “country”: “ES”, “date:” : “20141113”} .. {} To aggregated 1234 AR 112 1234 ES 56 1234 MX 40
  18. 18. Real-time querying Pre-requisites •Data is read from HDFS •Advanced filtering •Minimal latency •Integrated into existing ecosystem
  19. 19. Real-time querying Options considered •Exporting aggregated data to MongoDB •Apache Spark •Presto •Cloudera Impala
  20. 20. Impala Photo by Colin J. McMechan/ CC BY
  21. 21. Impala •Engine which enables real-time queries in Apache Hadoop •Based on Dremelpaper of Google •Open source •Integrated into CDH •Outperforms alternatives solutions (Shark, Presto)
  22. 22. Impala considerations •Sharing resources with MapReducejobs
  23. 23. Impala considerations •Sharing resources with MapReducejobs •Consistent data format is required
  24. 24. Impala considerations •Sharing resources with MapReducejobs •Consistent data format is required •Invalidating metadata
  25. 25. Putting all together
  26. 26. Putting all together
  27. 27. Conclusions •Data ingested by applications in real-time •Ingesting 1000+ events/sec •Data delivered to HDFS every 30 minutes •Average query response time < 1 sec
  28. 28. References https://developer.softonic.com/ http://flume.apache.org/ http://www.cloudera.com/content/cloudera/en/documentation/cdh5/v5-1-x/Impala/impala.html http://research.google.com/pubs/pub36632.htmlhttp://blog.cloudera.com/blog/2014/05/new-sql-choices-in-the-apache-hadoop-ecosystem-why-impala-continues-to-lead/
  29. 29. Thank you! @udachny

×