Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

Building data "Py-pelines"

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Nächste SlideShare
Architecting for analytics
Architecting for analytics
Wird geladen in …3
×

Hier ansehen

1 von 26 Anzeige

Weitere Verwandte Inhalte

Diashows für Sie (20)

Ähnlich wie Building data "Py-pelines" (20)

Anzeige

Aktuellste (20)

Anzeige

Building data "Py-pelines"

  1. 1. Founded in 2010, TravelBird’s focus is to bring back the joy of travel by providing inspiration to explore and simplicity in discovering new destinations. Active in eleven markets across Europe and inspiring three million travelers daily via email, web, and mobile app. Our Values Inspiring Prompting you to visit a place you’d never thought about before. Curated & local Proudly introducing travellers to the very best their destinations have to offer, with insider tips and local insight. Simple & easy Taking care of the core elements of your journey, and there for you every step of the way.
  2. 2. Our Team’s Role: Applying Data to Solve Problems ● Invoicing and liability risk modeling ● Marketing budgeting/attribution management ● CRM + personalization ● Email channel management ● Business intelligence ● Data gathering + enrichment (ETL) ● Data warehousing / Big data analytics And all done in Python
  3. 3. Our Architecture (Overall) ● Fully AWS hosted ● Mixture of permanent hosts, auto-scaled, and dynamically launched (ex for ML jobs) ● Production is built in Django + MySQL ● Data Science architecture (interesting stuff in red) is: ○ Postgres + Vertica for databases ○ Kinesis for event buffering ○ Spark for ML ○ Airflow + Rundeck for scheduling ○ Redis for RT data ○ S3 + HDFS + GFS for storage And Python for EVERYTHING
  4. 4. The Why-thon of Python ● We use it in production, so any dev can work on our data stack ● Best libraries available for ML/DL, visualization, data integration/transformation, anything you want to do with data ● It works for EVERYTHING machine learning, even with big data, so allows our data scientists to do data engineering as well ● It’s fast enough, and good hardware is cheaper than wasted dev time/resources
  5. 5. The event pipeline architecture
  6. 6. Python at the heart of it ● Python, python and only python ● Benefit from its great ecosystem (uwsgi, supervisord, Flask, 0mq, boto3, click, etc) ● Some design patterns: ○ Pipelining down to the lowest level (use queues and monitor them) ○ JSON, JSON everywhere ○ Exploit polymorphism ○ Processes, not threads ○ And most of all: keep it lean, easier to understand, easier to maintain ● Building the bibrary: it’s got the BEST modules ○ Abstract from low level AWS details ○ Utilities ○ Centralized configuration ○ Event library
  7. 7. The event library: a standard (2)
  8. 8. The event library: a standard (3) The event library also standardizes the life cycle: ● Decoding ● (De)serializing ● Processing Easy to store, easy to move around, easy to work with.
  9. 9. Deploy, Test, Quality Assurance Deployment Testing Monitoring Logging
  10. 10. Our nightly ML job chain (one of many!) ● 93 tasks consisting of: ○ Creation of Spark clusters on spot workers ○ (Lots of) Spark models ○ Keras models on deployed spot workers (we LOVE spot!) ○ Database queries and data aggregations ○ Output merges S3->DWH ● The beauty of python tooling? This is built and managed by data scientists, not engineers
  11. 11. Our Tools: PySpark, Keras, and Good Old Python ● Used for all the big, sexy analytics ○ Regression billions of records ○ Collaborative filtering ■ Average domain has 15k products and 1,5M training users ● PySpark instead of Scala allows recycling of all our custom Python libraries into ML jobs (rather than rewriting) ● In modern Spark, performance in Python and Scala is about the same (when using Spark functionality) ● Used for all the small, sexy analytics ○ Deep learning on session purchase propensity ○ Predicting sellout dates using RNNs ● Keras is easier and cleaner to read than raw TensorFlow ● Spark deep learning functionality is underdeveloped at this time ● In deep learning, TF is #1 and Keras #2, so Keras + TF is … #12? Great community and development
  12. 12. An example job: User-Item Ratings Observed User - Item Propensity User History Ratings Collaborative Filtering (ALS) Ratings Adjustment Collaborative Filtering (ALS) Current Items and Users Calculate Scores for Current User - Item Pairs User Features Item Features Re-weight based on feature data (ex airport preference) Write out to S3
  13. 13. PySpark, Easy as 1-2-3 ● This is a simplified version of that model in 20 lines of Python, skipping one step ● A data scientist familiar with Python can be working productively in Spark in a few days ● Easy, fast modeling means we can keep iteration time low, increasing number of tests
  14. 14. How mails are built, the short version ● Each domain is built and sent independently, allowing easy restarts in the event of issues and better parallelization ● The job on average takes two hours to build and schedule 2,5 million emails and synchronize the same data to Redis ● It looks complicated, but this complex job chain of 62 steps is 85 lines of Python, easy to modify and maintain ● Tasks consist of: ○ Database creation of 35 million content records ○ Real time generation and capture of 7,5 million events ○ Launching and spinning down of 16 AWS workers ○ Syncing of >50 million records to Redis ○ Syncing three different APIs (AND google sheets! BIG DATA)
  15. 15. How do we go from ranks to mails?
  16. 16. Every mail begins as a template
  17. 17. All templates are used to generate dynamic SQL to build the mail content for each recipient Why dynamic SQL? ● Our database hates transactions but loves batch ● Our data scientists can understand what’s going on and contribute (versus complex frameworks) Total runtime for 800k mails with >20 different templates, all personal? Six minutes
  18. 18. What happens when mails are built? 1. 15k records are picked up via PyODBC into Pandas based on mail, segment, and desired send hour 2. For each sub: a. We identify and build a dictionary for each module based on defined template rules b. Special modules are injected based on upcoming travel, retention campaigns, etc c. We determine a custom subject line using a Bayesian Bandit based on past subjects, content being sent, customer segmentation (ex preferred device type), and predicted open rate d. We add custom URL parameters to trigger experience changes based on past behavior 3. Those dicts are sent to RabbitMQ for consumption by our mailer (based on Django) to transform JSON into the pretty HTML 4. After successful transfer to RabbitMQ, those mails are marked as sent and the next batch is started
  19. 19. THIS is a mail ● This mail consists of four content blocks, 50% were decided at runtime ● Generating this mail took 0,007 seconds (including all database transactions); rendering to HTML takes another 0,04 seconds ● Every element in the args are personal except utm_medium, utm_source, and utm_content

×