Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Big Data Science - hype?

Big Data Science - hype?

  • Loggen Sie sich ein, um Kommentare anzuzeigen.

Big Data Science - hype?

  1. 1. Big Data Science Hype? Levente Török Blinkbox Music Ltd ... GE Hungary
  2. 2. Disclaimer All statements appearing in slides or in the presentation represent my personal opinion. They are not in connection to any companies nor any person I had or have connection to. I reserve these statements with risk of error.
  3. 3. Summary - Big data? Data Science? Hype? - Continuous improvement of Online Systems - A/B testing
  4. 4. Data Science, hype? Harvard Business Review in 2012
  5. 5. Data Science, hype? Forbes in 2015 Whether employers know or don’t know what data scientists do, they have been using—in rapidly-growing numbers—the term“data scientist” in job descriptions in the past two years as Indeed.com’s data demonstrates.
  6. 6. Developers, developers ...
  7. 7. “Data Science” in media Yahoo Finance: “If you take a cue from the Harvard Business Review, the title goes to data scientists. That’s right. Data scientist, as in, the type of person who can write computer code and harness the power of big data to come up with innovative ways to use it for companies like Google (GOOG), Amazon (AMZN), and Yahoo! (YHOO).”
  8. 8. “Data Science” in media Nature Jobs:
  9. 9. Data Science, what is this? Wikipedia “Data Science is the extraction of knowledge from data, which is a continuation of the field data mining and predictive analytics”
  10. 10. Data? Science... ? 1) Big Data Engineer - Hive, Yarn, Spark, Impala 2) Data Miner - SAS, Knime, Rapid Miner, Weka, IBM Clementine 3) Big Data & Data Miner - Apache - Mahout - Spark - MLlib, Spark - GraphX - Apache - Giraph - GraphLab ?
  11. 11. Data Scientist? Big data - big failure: If an algo doesn’t work on small data, it wont work on big data. 4) Data Scientist is a real scientist: Follows scientific principles in data modeling: - conjectures hypothesis on statistical structure of data - validates it offline and online - improves model iteratively
  12. 12. Tools: R / Python / C++ http://bit.ly/1B3bSS1
  13. 13. Tools: verdict other -> R -> python = 0.44 * 0.26 = 0.11 other -> python -> R = 0.23 * 0.18 = 0.04 Is this correct? However ... what?
  14. 14. Improving Online Systems Examples Recommender systems (ie. RecSys) What to listen next? What ad to display? Anomaly detection: Is this user/system behaviour “normal”? Does this system going to fail soon?
  15. 15. Data Flow in Online Sys Online sys -> log -> daily aggregation -> long term -> batch model bld. storage queue -> async online model updates near optimal online data model
  16. 16. The major difficulty daily aggregates datasource batch model training online model training 1. batch model training starts: 4:00, finishes 4:30 2. new online model updates starts at 4:30, would finish at 5:10 with all the events from 0:00 to 4:30 but new events arrived in the mean time .... -> streaming architectures queue
  17. 17. Offline data modelling Train Test Model Prediction Parameters
  18. 18. Offline modeling 1. Data splits for train / test / quiz - time based: eg 2 weeks / 1 day / 1 day - entity based: set of users - session: set of sessions of users Test data preparation: - manual pos/neg sample data points labeled, or injected 2. Train by batch training Given a data set, we try to fit the model to the data set controlled by model parameters.
  19. 19. Offline data modelling 3. Prediction phase: Given a model - for each users we met in train, we give predictions - for each event we can see in test set, we predict likelihood 4. Evaluation phase: prediction and test data similarity is measured - RecSys: NDCG, Recall, Precision, AUC, ... 20 different metrics - Artificially labelled data set for anomaly detection: C2B (AUC), weighted AUC ... - Sanity check! -> Q/A team
  20. 20. Offline data modelling 4. Parameter search in parallel The output of the searching is the parameter vector (+ model id) that returns the optimal solution offline according to our belief NB: usually we are unsure which offline measure is going to reflect the best online results, so we have number of optimal parameter vectors according to different offline measures.
  21. 21. A/B testing Train_A Model_A Online pred_A Performance_A Model_B Online pred_B Performance_BTrain_B ??
  22. 22. Online performance tuning Train_A Model_A Online pred_A Parameters Performance_A Model_B Online pred_B Performance_BTrain_B
  23. 23. Online traffic split adj. Train_A Model_A Online pred_A Performance_A Model_B Online pred_B Performance_BTrain_B
  24. 24. Offline-Online matching Model NDCG AUC ... Avg Sess Len A 1 1 1 B 2 3 3 C 3 2 2 Offline measures Online measure compare with Pearsons corr. coeff.
  25. 25. On-line testing 5. A/B testing - control model - tested model (model with an offline optimal parameter set) 6. Evaluation of online results: Measures: - Session length, station length - Return rate, CLTV Filter and compare models -> wow!
  26. 26. On-line testing 7. Run many models one-by-one according phase 4. 8. Figure out the best offline metrics: Compare order statistics of offline and online models (ie Pearsons correlation) to figure out which of the offline metrics matter the most in online performance.
  27. 27. Model comparisons Problems: 1. Day 1 A is better, Day 2 B is better 2. The version with the longest session length != the version with the highest full play ratio of tracks 3. Outliers are dominates the session length average: - Number of users listen the service “forever” - Bouncing users pollutes the session length average with high noise
  28. 28. A/B testing 1. Version A: Control group 2. Version B: Treatment group With n_A, n_B users, we have successes of k_A, k_B. Is it enough if I compare k_A / n_A with k_B / n_B ?
  29. 29. A/B testing? Questions: - What if one day A wins, next the B wins? - How many users should I use for testing? - How long should I run test? - What if we have A, B, C ... versions we want to test?
  30. 30. Classical Statistics Hypothesis testing: - Does treatment B have any effect? - up to probability: (1-alpha) - given: a sample size of N Even the most well known A/B testing platforms can lead you illusory results. Command: “Sample size estimator”
  31. 31. Binomial ? Note that: Binomial distribution: Beta distribution: where
  32. 32. New statistics n_A = 150, k_A = 18 n_B = 145, k_B = 14 The major question:
  33. 33. New statistics n_A = 150, k_A = 18 n_B = 145, k_B = 14 The major question: Chance2beat: x f_A(x;...) f_B(x;...)
  34. 34. Chance 2 beat - This is a probability, we want to increase by testing. For example: - Can be: - Gaussians, - distributions w/ priors - empiric distributions, or - small sample size data sets directly - Sometimes it is not enough: use bootstrapping!
  35. 35. Thanks

×