Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

The IoT and big data

312 Aufrufe

Veröffentlicht am

In the next five years, 15 to 40 billion additional connected devices are expected to hit the market. How can we handle such volumes and velocity of data?

Introduction to Dynamo storage systems, Riak, Cassandra, time series databases and edge analytics.

Veröffentlicht in: Ingenieurwesen
  • Login to see the comments

The IoT and big data

  1. 1. The IoT and Big Data April 2017
  2. 2. April 2017 Augury The mechanical diagnostics platform of the internet-of-things
  3. 3. April 2017 Augury The mechanical diagnostics platform of the internet-of-things
  4. 4. April 2017 Me Gal Ben-Haim Head of Architecture @ Augury https://www.linkedin.com/in/gbenhaim https://github.com/bsphere
  5. 5. - IoT data, access patterns and storage tiers - Time-series for IoT - Dynamo storage systems - Data processing architectures - Edge analytics Agenda April 2017
  6. 6. Categories - Device metadata - device info, configuration, ... - Users - user info, preferences, billing, devices, ... - Raw data - streamed/ingested measurements and events - Aggregated data - calculated data over time range April 2017 IoT data
  7. 7. April 2017 IoT data - raw
  8. 8. April 2017 IoT data - aggregated
  9. 9. Properties - Velocity - rate of measurements - Volume - number of devices, grows over time - Variety - different sensor types - Veracity - corrupt data, out-of-order, schema changes, … April 2017 IoT data
  10. 10. April 2017 IoT data Value - Info/metadata is critical - Aggregated data is more valuable than raw data (there are always exceptions) - The value of raw data goes down over time
  11. 11. April 2017 Data access patterns Category Read Write Update Device/user info and metadata Many reads, all over the dataset Occasional Occasional Raw data Mostly recent data High throughput Rare Aggregated data Many reads, mostly recent data Periodical Rare
  12. 12. April 2017 Data access patterns Tier Purpose Mission critical High performance data for processing Operational data Dashboards, reports and online analytics Historical data Offline analytics Archive Regulatory, disaster recovery, unpredicted situations Storage tiers
  13. 13. April 2017 IoT data is a time-series! - Sensor data is timestamped - Queries are usually across time ranges (data processing, online/offline analytics) - Historical data is moved to a different storage tier by time range Why?
  14. 14. CAP theorem April 2017
  15. 15. Dynamo systems April 2017 - Based on Amazon’s Dynamo whitepaper - Masterless distributed architecture - No single point of failure - Eventual consistency (AP over C), self healing (active anti-entropy) - Linear scalability, predictable performance - Various implementations - AWS DynamoDB (K/V), Cassandra (columnar, Facebook) / ScyllaDB, Riak (K/V, Basho), Voldemort (K/V, LinkedIn), Dynomite (generic - Redis, Netflix)
  16. 16. Dynamo systems April 2017 Coordination
  17. 17. Dynamo systems April 2017 Consistent hashing - “ring”
  18. 18. Dynamo systems April 2017 Replication strategy
  19. 19. Dynamo systems April 2017 Handling failures
  20. 20. Dynamo systems April 2017 Conflict resolution - Timestamps (last write wins) - Vector clocks / context / dotted versions - CRDTs (Conflict-free Replicated Data Types) - Application layer (r_val, siblings) - Immutable data is your best friend!
  21. 21. Data modeling April 2017 Non goals - Minimize the number of writes - Minimize data duplication - Spread the data evenly around the cluster - Minimize the number of partitions read - Avoid eventual consistency conflicts - Avoid large partitions Goals
  22. 22. Data modeling - Cassandra April 2017 CREATE TABLE playlists ( id uuid, song_order int, song_id uuid, title text, album text, artist text, PRIMARY KEY (id, song_order ) );
  23. 23. Dynamo time-series databases April 2017 - Roll your own (Cassandra, Riak, DynamoDB, …) - KairosDB (Cassandra backend) - RiakTS Choose partitioning based on rate of writes and rate/type of queries - tradeoff between latency and throughput
  24. 24. Cassandra time-series example April 2017 CREATE TABLE temperature_by_day ( weatherstation_id text, date text, event_time timestamp, temperature text, PRIMARY KEY ((weatherstation_id, date), event_time) );
  25. 25. RiakTS example April 2017 CREATE TABLE GeoCheckin ( region VARCHAR NOT NULL, state VARCHAR NOT NULL, time TIMESTAMP NOT NULL, weather VARCHAR NOT NULL, temperature DOUBLE, PRIMARY KEY ( (region, state, QUANTUM(time, 15, 'm')), region, state, time ));
  26. 26. April 2017 Data processing architectures Lambda architecture
  27. 27. April 2017 Data processing architectures Kappa architecture
  28. 28. April 2017 Data processing architectures Zeta architecture
  29. 29. Data processing tools landscape - Apache Kafka - Apache Storm - Apache Spark (+ Spark Streaming) - Apache Flink - Apache Beam (Google Dataflow) - Apache Samza - There are more… Apache Apex, Apache Ignite, Twitter Heron, AWS Kinesis, ... April 2017
  30. 30. Edge analytics April 2017
  31. 31. Edge analytics April 2017 - Data is processed near the source, not all data is sent back to the cloud - Critical for massive deployments, low latency, low bandwidth - Edge computing is done on device/gateway/on-premise servers - Analytics model is trained centrally and then deployed, using standards such as PMML - New complexity level - dataset for training models, multiple model versions and management, security, storage, compute horsepower, ... - Emerging platforms like AWS Greengrass
  32. 32. Edge analytics April 2017 We’re hiring www.augury.com/about/careers
  33. 33. Edge analytics April 2017 Questions?

×