Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

BIG2016- Lessons Learned from building real-life user-focused Big Data systems

2.941 Aufrufe

Veröffentlicht am

Invited talk at the BIG2016 conference - colocated with WWW 2016 in Montreal, Canada

Veröffentlicht in: Technologie
  • Préstamo de dinero rápido para unas buenas vacaciones . Hola/Buenas Tardes . Usted está atrapado, prohibida banco y usted no tiene el favor de bancos o mejor tiene un proyecto y necesidad de financiación, un mal expediente de crédito o necesidad dinero para pagar las cuentas,fondos para invertir en negocios. Así que si usted necesita préstamo de dinero no dude en ponerse en contacto conmigo por mi correo electrónico : fernandezlucas.sebastian@gmail.com fernandezlucas.sebastian@gmail.com
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier
  • Oferta de préstamo serio entre las personas honestas. Para todos sus problemas financieros, por favor póngase en contacto conmigo porque tengo un capital importante y me comprometo de préstamo tiene todas las personas honestas. Usted está en busca de listo para reactivar sus actividades ya sea para la realización del proyecto , ya sea para comprar un apartamento, pero por desgracia el banco plantea que usted tiene condiciones que usted es incapbles. No más preocupaciones que soy un individuo I otorga préstamos que van desde 5000 a 1000 000 € tiene todas las personas capaz de mantener su programa tiene una tasa de interés del 3% por año. Por favor, no dude en ponerse en contacto conmigo si usted está en necesidad. PD: Aquí está mi correo electrónico : fernandezlucas.sebastian@gmail.com fernandezlucas.sebastian@gmail.com
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier

BIG2016- Lessons Learned from building real-life user-focused Big Data systems

  1. 1. LessonsLearned from building real-life user-focused Big Data Systems Xavier Amatriain (@xamat) www.quora.com/profile/Xavier-Amatriain 04/12/16
  2. 2. A bit about
  3. 3. Our Mission “To share and grow the world’s knowledge” • Millions of questions & answers • Millions of users • Thousands of topics • ...
  4. 4. Demand What we care about Quality Relevance
  5. 5. LessonsLearned
  6. 6. MoreDatavs.BetterModels
  7. 7. More data or better models? Really? Anand Rajaraman: VC, Founder, Stanford Professor
  8. 8. More data or better models? Sometimes, it’s not about more data
  9. 9. More data or better models? Norvig: “Google does not have better Algorithms only more Data” Many features/ low-bias models
  10. 10. More data or better models? Sometimes, it’s not about more data
  11. 11. How useful is Big Data? ● “Everybody” has Big Data ○ Does everyone need it? ○ E.g. How many users do you need to compute a MF of 100 factors? ● Smart (e.g. stratified) sampling can produce as good (or better) results!
  12. 12. Sometimesyoudoneed A(more)ComplexModel
  13. 13. Better models and features that “don’t work” ● E.g. You have a linear model and have been selecting and optimizing features for that model ■ More complex model with the same features -> improvement not likely ■ More expressive features with the same model -> improvement not likely ● More complex features may require a more complex model ● A more complex model may not show improvements with a feature set that is too simple
  14. 14. Modelselectionisalsoabout Hyperparameteroptimization
  15. 15. Hyperparameter optimization ● Automate hyperparameter optimization by choosing the right metric. ○ But, is it as simple as choosing the max? ● Bayesian Optimization (Gaussian Processes) better than grid search ○ See spearmint, hyperopt, AutoML, MOE...
  16. 16. Supervisedvs.plus UnsupervisedLearning
  17. 17. Supervised/Unsupervised Learning ● Unsupervised learning as dimensionality reduction ● Unsupervised learning as feature engineering ● The “magic” behind combining unsupervised/supervised learning ○ E.g.1 clustering + knn ○ E.g.2 Matrix Factorization ■ MF can be interpreted as ● Unsupervised: ○ Dimensionality Reduction a la PCA ○ Clustering (e.g. NMF) ● Supervised ○ Labeled targets ~ regression
  18. 18. Supervised/Unsupervised Learning ● One of the “tricks” in Deep Learning is how it combines unsupervised/supervised learning ○ E.g. Stacked Autoencoders ○ E.g. training of convolutional nets
  19. 19. Everythingisanensemble
  20. 20. Ensembles ● Netflix Prize was won by an ensemble ○ Initially Bellkor was using GDBTs ○ BigChaos introduced ANN-based ensemble ● Most practical applications of ML run an ensemble ○ Why wouldn’t you? ○ At least as good as the best of your methods ○ Can add completely different approaches (e. g. CF and content-based) ○ You can use many different models at the ensemble layer: LR, GDBTs, RFs, ANNs...
  21. 21. Ensembles & Feature Engineering ● Ensembles are the way to turn any model into a feature! ● E.g. Don’t know if the way to go is to use Factorization Machines, Tensor Factorization, or RNNs? ○ Treat each model as a “feature” ○ Feed them into an ensemble
  22. 22. The Master Algorithm? It definitely is the ensemble!
  23. 23. TheLostArt ofFeatureEngineering
  24. 24. Feature Engineering ● Main properties of a well-behaved ML feature ○ Reusable ○ Transformable ○ Interpretable ○ Reliable ● Reusability: You should be able to reuse features in different models, applications, and teams ● Transformability: Besides directly reusing a feature, it should be easy to use a transformation of it (e.g. log(f), max(f), ∑ft over a time window…)
  25. 25. Feature Engineering ● Main properties of a well-behaved ML feature ○ Reusable ○ Transformable ○ Interpretable ○ Reliable ● Interpretability: In order to do any of the previous, you need to be able to understand the meaning of features and interpret their values. ● Reliability: It should be easy to monitor and detect bugs/issues in features
  26. 26. Feature Engineering Example - Quora Answer Ranking What is a good Quora answer? • truthful • reusable • provides explanation • well formatted • ...
  27. 27. Feature Engineering Example - Quora Answer Ranking How are those dimensions translated into features? • Features that relate to the answer quality itself • Interaction features (upvotes/downvotes, clicks, comments…) • User features (e.g. expertise in topic)
  28. 28. Implicitsignalsbeat explicitones (almostalways)
  29. 29. Implicit vs. Explicit ● Many have acknowledged that implicit feedback is more useful ● Is implicit feedback really always more useful? ● If so, why?
  30. 30. ● Implicit data is (usually): ○ More dense, and available for all users ○ Better representative of user behavior vs. user reflection ○ More related to final objective function ○ Better correlated with AB test results ● E.g. Rating vs watching Implicit vs. Explicit
  31. 31. ● However ○ It is not always the case that direct implicit feedback correlates well with long-term retention ○ E.g. clickbait ● Solution: ○ Combine different forms of implicit + explicit to better represent long-term goal Implicit vs. Explicit
  32. 32. bethoughtfulaboutyour TrainingData
  33. 33. Defining training/testing data ● Training a simple binary classifier for good/bad answer ○ Defining positive and negative labels -> Non-trivial task ○ Is this a positive or a negative? ● funny uninformative answer with many upvotes ● short uninformative answer by a well-known expert in the field ● very long informative answer that nobody reads/upvotes ● informative answer with grammar/spelling mistakes ● ...
  34. 34. Other training data issues: Time traveling ● Time traveling: usage of features that originated after the event you are trying to predict ○ E.g. Your upvoting an answer is a pretty good prediction of you reading that answer, especially because most upvotes happen AFTER you read the answer ○ Tricky when you have many related features ○ Whenever I see an offline experiment with huge wins, I ask: “Is there time traveling?”
  35. 35. YourModelwilllearn whatyouteachittolearn
  36. 36. Training a model ● Model will learn according to: ○ Training data (e.g. implicit and explicit) ○ Target function (e.g. probability of user reading an answer) ○ Metric (e.g. precision vs. recall) ● Example 1 (made up): ○ Optimize probability of a user going to the cinema to watch a movie and rate it “highly” by using purchase history and previous ratings. Use NDCG of the ranking as final metric using only movies rated 4 or higher as positives.
  37. 37. Example 2 - Quora’s feed ● Training data = implicit + explicit ● Target function: Value of showing a story to a user ~ weighted sum of actions: v = ∑a va 1{ya = 1} ○ predict probabilities for each action, then compute expected value: v_pred = E[ V | x ] = ∑a va p(a | x) ● Metric: any ranking metric
  38. 38. Offline testing ● Measure model performance, using (IR) metrics ● Offline performance = indication to make decisions on follow-up A/B tests ● A critical (and mostly unsolved) issue is how offline metrics correlate with A/B test results.
  39. 39. Learntodealwith PresentationBias
  40. 40. 2D Navigational modeling More likely to see Less likely
  41. 41. The curse of presentation bias ● User can only click on what you decide to show ● But, what you decide to show is the result of what your model predicted is good ● Simply treating things you show as negatives is not likely to work ● Better options ● Correcting for the probability a user will click on a position -> Attention models ● Explore/exploit approaches such as MAB
  42. 42. Youdon’tneedtodistribute yourMLalgorithm
  43. 43. Distributing ML ● Most of what people do in practice can fit into a multi- core machine ○ Smart data sampling ○ Offline schemes ○ Efficient parallel code ● Dangers of “easy” distributed approaches such as Hadoop/Spark ● Do you care about costs? How about latencies?
  44. 44. Distributing ML ● Example of optimizing computations to fit them into one machine ○ Spark implementation: 6 hours, 15 machines ○ Developer time: 4 days ○ C++ implementation: 10 minutes, 1 machine ● Most practical applications of Big Data can fit into a (multicore) implementation
  45. 45. Conclusions
  46. 46. ● In data, size is not all that matters ● Understand dependencies between data, models & systems ● Choose the right metric & optimize what matters ● Be thoughtful about ○ Your ML infrastructure/tools ○ Interaction between data and UX
  47. 47. Questions?