Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

Big data should be simple

Wird geladen in …3

Hier ansehen

1 von 50 Anzeige

Weitere Verwandte Inhalte

Diashows für Sie (20)


Ähnlich wie Big data should be simple (20)


Aktuellste (20)

Big data should be simple

  1. 1. Big Data Should Be Simple Dori Waldman - Big Data Lead
  3. 3. OUR MISSION Ad-Request RTB Non RtB Im pressions / Clicks 1 3 2 High level architecture: fully scale & elastic
  4. 4. Big data challenges are complicated and includes features like: ● Massive cross dimensions aggregation, drill down in real time, for presenting ad ROI… ● Analytics (Databricks / Zeppelin) on Raw data ● Cube (Dynamic reports) ● Anomaly detection ● Audience targeting like google adwords: ○ Offline process to analyze users ○ Real time targeting request ● ML: filter bad traffic , targeting recommendation InnerActive Main DB Challenges
  5. 5. Main DB Stack Aggregations Configuration Analytic - Cube Analytic KV - Cache SSOT Q
  6. 6. Aggregation
  7. 7. Pre spark structured streaming
  8. 8. Massive Aggregations Lambda Flow ● raw data from kafka: ● Needs to do aggregation by: ○ Country ○ Country and OS and AppId ● Batch / Stream Secor Stream Hourly +Daily Batch Stream aggregate Hourly Data Batch - recovery hourly data from raw data Analytic Parquet is being used in order to avoid read-update-save cycle in C*
  9. 9. In short... ● Exactly once semantic : ○ recovery from failure (Yarn also helps) ○ recovery from updates (checkpoint) ■ Non idempotent data (aggregation) ■ Non transactional DB (Cassandra) ● Offsets management: ○ Save in Spark checkpoint ? No , we use ZK ● Out of order data ● Flow: We are saving data in hdfs , folder name contains “from” offset ○ we update ZK with “from” offset in the beginning of each stream iteration ○ In case spark crash , during streaming initialization we retrieve offsets from ZK (“from” of last iteration) ○ data is partitioned by date to manage out of order data
  10. 10. In short... Example : ● Before update Data in DB, only ZK was updated: ○ read from kafka offset 200-300 ○ update ZK with 200 ○ crash ○ recover ○ get offsets from ZK ○ read again from 200 , might be till 305 ○ write to hdfs - no data lost. ● After update data in DB (and ZK): ○ read from kafka offset 200-300 ○ update ZK with 200 ○ update in hdfs data in folder “200” ○ crash ○ recover ○ get offsets from ZK ○ read from 200 till 305 ○ overwrite data in hdfs folder “200” - no duplication.
  11. 11. Can we make it simple ? We could use checkpoint , enjoy from built-in offset management , Only during deployment , fix data with batch but its maintenance nightmare , not aligned with CI-CD
  12. 12. Spark 2.1 api https://www.youtube.com/watch?v=UQiuyov4J-4
  13. 13. Analytic
  14. 14. Analytic ETL, DashBoard, ● Data Format : Spark batch convert raw data to parquet format, which reduce significant query time ● ETL : once we have parquet files we run several ETL’s ● Dashboard : Databricks scheduler jobs update dashboard
  15. 15. Multi-Dimension
  16. 16. 16PAGE // Druid ● Cube ● Time series ● Columnar DB ● Fast ● ThetaSketch ● load Data Stream / batch ● Great UI ● Complicated Deploy
  17. 17. 17PAGE // Druid Components
  18. 18. 18PAGE // Druid Usage ● Hourly - Stream ○ Segments are available once they are closed, we are using tranquility(kafka consumer) to create hourly segment. ● Daily - Batch ○ We are using batch (fire-house) to create daily segments from hourly segments (from start of day till now) ● Hourly Batch ○ We are using EMR with batch to fix missing hours due to streaming failure, to create hourly segments from S3
  19. 19. 19PAGE // Druid Usage Hourly - Stream Daily - Batch Hourly - Batch (EMR)
  20. 20. Anomaly Detector
  21. 21. Flow We are using Twitter Anomaly detector https://github.com/twitter/AnomalyDetection we have scheduler process that query db, generate csv with data and time series, the csv file is the input for anomaly detection , send email incase of anomaly
  22. 22. Targeting
  23. 23. 23PAGE // Requirement Send Ad to 20 year old, male, from New-York... Data Types ● Regular DMP can provide information per user like if its male 20 years old , from New york that like sports and has car (Taxonomy). ● Internal DMP - InnerActive see cross publishers and demands , InnerActive monitor user ad experience such as which users clicks more, and on what, type of ad video user see till end... Data Collect and Merge ● 8T daily raw data needs to be aggregated to provide relevant information per user, and then merged with several DMP’s data.
  24. 24. Create Audiences (offline)
  25. 25. 25PAGE // Flow ● Spark batch generate hourly / daily / monthly aggregations per user from S3 (Kafka consumer upload data to S3) → 800M raws with InnerActive DMP data ● Data is merged with DMP data(align with taxonomy) ● Raws are ready to use and can be loaded to DB for analytic usage, and later on for audience selections throw the console. 1
  26. 26. 26PAGE // Raw Data ● 120 columns with: ○ Dimensions: Age, Gender, Country ○ Metrics: #Clicks, #Impressions, #Video ● 20 columns with MultiValue Dimensions: ○ Domains: [CNN.com, Ebay.com] ○ Interests: [Sport, Shopping, Cars] ● Checking if DB support advanced feature for future requirements like Map: {#number of clicks per domain} in addition to total number of clicks that the user is doing 2
  27. 27. 27PAGE // DB & Queries POC Requirement 3 ● 1-2 sec – queries over 400M users (not all queries took 2s) ● complex query but still avoid Join (multiValued / Map) ● Statistics functionality like HyperLogLog / ThetaSketch ● JDBC compliance ● Easy maintained ● Low Concurrency ● Scale Scale Scale ● Supported / Community ● Price estimation
  28. 28. 28PAGE // DB - How we choosed 1. columnar database ● Minimize Read-write from disk: “select age,name from tbl where age >30” instead of reading all rows and return all columns, and select only 2 fields, columnar DB only reads data from 2 columns and return only 2 columns ● High compression 4
  29. 29. 29PAGE // DB - How we choosed 2.MPP Data is big, needs to be split across machines to get result within ~2 sec 0.5-1T ● Massively Parallel Processing: ○ multiple processors working on different parts of the program. ○ Each processor has its own operating system and memory. ○ It speeds the performance of huge databases that deal with massive amounts of data. 4
  30. 30. 30PAGE // DB - How we choosed 3.FAST In Memory DB Or OS Cache? Memory main layers RAM / In process memory like Java HashMap Operation System cache Disk: SSD or not Speed Prices ● In Memory ○ Row Data resides in a computer’s random access memory (RAM) – not physical disk ● OS Cache ○ Columnar/compressed data in Disk, execute query loads the data to the OS memory 4
  31. 31. 31PAGE // ● we found out that it's faster (query time) to use columnar data that is loaded to OS cache, instead of row data in memory as it needs to read all data (we use random queries index will not help) ● Columnar compress data reduce storage size therefore it load more raw data to OS cache, disadvantage is that we pay more CPU time to decompress data per query DB - How we choosed 3.FAST In Memory DB Or OS Cache? 4
  32. 32. 32PAGE // Let the Game Begin FS with OS cache IN-Memory
  33. 33. 33PAGE // What was not tested
  34. 34. 34PAGE // ClickHouse Jdbc compliance Very Fast FS Columnar DB (load data to OS memory) MultiValue Support - “IN” HyperLogLog Support Easy to Install + Scale Load CSV data Build for our use case No Support - Open source Community?
  35. 35. 35PAGE // MemSql Commercial ○ Combination of Aggregator nodes that save data in memory as raw data, and Leaf nodes that save data in FS columnar ○ Increase Aggregators to better handle concurrency ○ Increase Leafs to minimize latency ○ We checked both options (raw data in memory, columnar data in FS)
  36. 36. 36PAGE // MemSql Memory VS. Cache ■ Jdbc compliance ■ Very Fast ■ Columnar DB + Memory ■ MultiValue Support? Json search, we avoid Join ■ Easy to Install + Scale ■ Load CSV data ■ Support / commercial
  37. 37. 37PAGE // RedShift ■ Jdbc compliance ■ Fast ■ Columnar DB + Memory ■ No MultiValue Support - you can use “like”, Join are not fast enough. ■ service ■ commercial ■ price high built upon postgress, save data in columnar , no index , scalable , each query run on its node (MPP)
  38. 38. 38PAGE // Solr We tried Solr although its handling data differently, Solr index documents and allows very rich queries. Solr fits our complex queries requierments using its faceting navigation once you run queries the data is loaded to cache and it provide good query time we pay during loading/indexing data - time We didn't finished POC on Solr, as we decided to continue with another solution.
  39. 39. 39PAGE // Presto Supported by TeraData
  40. 40. 40PAGE // Presto How It Works ● Use connectors to Mysql / HDFS / Cassandra ● MPP shared nothing across machines ● Priority Queries ● Can load parquet (columnar compress format) - no need for Join ● Support Hive queries
  41. 41. 41PAGE // Presto On Steroid ● Raptor is Presto connector (configuration) that load data to Presto machines (sharded on SSD) ● Instead of doing query to HDFS or S3 , it queries its local SSD while the second queries use OS cache ○ Short Load time (<1H) from remote HDFS to locals SSD (we also tried to use Presto on top of Alluxio, and hdfs & presto on same machine)
  42. 42. 42PAGE // Presto + Raptor Load data Flow ● Create table using Hive (data located in hdfs cluster) ● Create table in Raptor ○ CREATE TABLE raptor.audienceTbl AS SELECT * FROM hive.default.audience; ● Load data with Raptor to Presto machines ○ update table raptorTbl as select from hiveTbl where day = <day> ● In case of raptor failure, we can continue using Presto to query hdfs via Hive, and in the meantime load hdfs data to raptor again.
  43. 43. 43PAGE // Presto On Steroid ● Another option instead of Raptor is LLAP (Hive Cache)
  44. 44. 44PAGE // Presto On Steroid ● Another option instead of Raptor is Presto Memory Connector (RAM) - we are going to test it
  45. 45. 45PAGE // Presto UI AirPal (AirBNB) https://github.com/airbnb/superset
  46. 46. 46PAGE // We didn't looked for the fastest DB in the world. We looked for valid solutions for our use case, and we checked functionality, price, etc … Some solutions failed in the functionality requirements like MariaDB (Columnar) Some due to high price (RedShift) We decided to continue with Presto + Raptor, we can use our parquet data as is, load it to HDFS, load it to Presto machines... And The Winner is
  47. 47. RealTime Targeting
  48. 48. 48PAGE // Serving flow ● Once we selected audiences, next phase is real time match audiences to adRequest ● We are using spark batch that create key-value mapping between userId and audiences, and we upload it to aerospike ● During adRequest flow we retrieve (~4ms) user audiences... Key-Value, scale - In memory KV Lots of reads , update once a day in batch
  49. 49. MLib - next time
  50. 50. THANK YOUDori.waldman@inner-active.com https://www.linkedin.com/in/doriwaldman/ https://www.slideshare.net/doriwaldman