Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Streaming Analytics

1.583 Aufrufe

Veröffentlicht am

This was presented in KDD 2016 conference in San Francisco, CA.

Veröffentlicht in: Daten & Analysen
  • I like this service ⇒ www.HelpWriting.net ⇐ from Academic Writers. I don't have enough time write it by myself.
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier
  • My brother found Custom Writing Service ⇒ www.WritePaper.info ⇐ and ordered a couple of works. Their customer service is outstanding, never left a query unanswered.
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier

Streaming Analytics

  1. 1. Streaming Analytics Ashish Gupta (LinkedIn) Neera Agarwal
  2. 2. What is a Data Stream ● Unbounded Data ● Data arriving continuously at high rate ● Too large to first store and then process ● Need to be processed in one pass ● May display Temporal Locality - patterns may evolve over time
  3. 3. Contrast with Batch Processing 1. Process Bounded Files. Files by ingestion time. Last 15 minutes, last 1 hour, last 1 day 2. Program can go back and forth in data. Do multipass processing. 3. Sessions and joins can span files. 4. Many machine learning algorithms need full batch of data to train. 5. Very high Latency, but very high throughput as well. a. Wait for files to arrive. Ie wait for file window to close. b. Processing the whole file(s) take time.
  4. 4. Streaming Applications ● Joining Clicks and Impressions ● Mobile applications - User activity ● Session based analysis ● Fraud detection ● Industrial IOT ● LinkedIn’s Streaming Standardization Platform
  5. 5. Lambda Architecture New Data Streaming System Batch System (Hadoop) Application Speed DB Batch DB Speed Layer Batch Layer
  6. 6. Streaming System Streaming Systems Architecture New Data Streaming System Hadoop Store and Serve SQL and NoSql DB Sources Kafka Real-time Consumers Process and Analyze Consumers Kafka New Data New Data Save Log Spark Flink Storm Samza Kafka Streams Consumers
  7. 7. What is Streaming Analytics “ Continuous processing on unbounded data” “Software that can filter, aggregate, enrich, and analyze a high throughput of data from multiple disparate live data sources and in any data format to identify simple and complex patterns to visualize business in real-time, detect urgent situations, and automate immediate actions.” - Forrester
  8. 8. Streaming Concepts Time Window Order Correctness Delayed data Out of order data Event Time Processing Time Fixed Window Sliding Window Sessions Consistency At least Once Exactly Once Checkpointing
  9. 9. Streaming Concepts - Time New Data Streaming System Kafka New Data New Data Event Time Ingestion Time Processing Time
  10. 10. Streaming Concepts - Windows Fixed Window/ Tumbling Window Sliding Window Session Window
  11. 11. Kafka Streams Processing Model Mini Batch Event level Event level Event level Event level Guarantee Exactly Once Exactly Once At least once At least once At least once State Management Yes Yes No Yes Yes Latency Medium Low Low Low Low Built in primitives Batch and streaming Batch and streaming Low Level API Low level API Streaming only Back Pressure Yes Yes No via Kafka via Kafka Open Source Streaming Systems
  12. 12. Case Study: LinkedIn Standardization Platform User edits profile Company Geo Title Education Seniority …. Search Feed ….. Job Recommendation Standardizer LinkedIn Profile
  13. 13. Pattern : External Lookup/Stream to Table Join Decide based on size of the data, latency needs and QPS of external systems Streaming System Service External Cache External DB Internal Cache
  14. 14. Pattern: Stream to Stream Joining ● Joins are expensive ● If partitions of two streams not collocated, then expensive shuffle ● Broadcast join if one file is small
  15. 15. Pattern: Reprocessing Streaming System (Samza) Hadoop SQL and NoSql DB Real-time Consumers Consumers Title, Geo, Company Education ….. DB Log Events DB Snapshot Kafka KafkaDatabus Databus Standardizer with ML Models Kafka
  16. 16. Pattern: Reprocessing Streaming System (Samza) Hadoop SQL and NoSql DB Real-time Consumers Consumers Title, Geo, Company Education ….. DB Log Events DB Snapshot Kafka KafkaDatabus Databus rewind 2 hours Standardizer with ML Models Kafka
  17. 17. Apache Kafka ● Highly scalable messaging system ● Distributed commit log ● Developed in LinkedIn back in 2010 ● At LinkedIn - more than 1.4 trillion messages per day across over 1400 brokers ● Distributed, partitioned, replicated ● Message retention - based on time and size
  18. 18. Some Kafka use cases ● Queuing/Messaging ● Metrics ● Auditing ● Logging Producer Producer Consumer Consumer Consumer Broker Broker Broker Broker Kafka Cluster Send messages Fetch messages Producer
  19. 19. References ● MillWheel: http://research.google.com/pubs/pub41378.html ● DataFlow:http://research.google.com/pubs/pub43864.html ● Samza: http://samza.apache.org/ ● Spark Streaming Paper: Discretized streams ● Big Data Stream Mining (KDD’15) ● Models and Issues in Data Stream Systems
  20. 20. Contact Us Ashish Gupta - ahgupta@linkedin.com https://www.linkedin.com/in/guptash Neera Agarwal - neera8work@gmail.com https://www.linkedin.com/in/neera-agarwal-21b9473

×