Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Create a Data Science Lab with Microsoft and Open Source tools

10.812 Aufrufe

Veröffentlicht am

Veröffentlicht in: Technologie
  • Als Erste(r) kommentieren

Create a Data Science Lab with Microsoft and Open Source tools

  1. 1. Create a Data Science Lab with Microsoft and Open Source Tools Marcel Franke, pmOne AG, Germany
  2. 2. About me – Marcel Franke Practice Lead Advanced Analytics & Data Science pmOne AG – Germany, Austria, Switzerland >10 years experiences with large scale Data Warehouses based on SQL Server Blog: dwjunkie.wordpress.com
  3. 3. What is data science?
  4. 4. The Definition Data science incorporates varying elements and builds on techniques and theories from many fields, including mathematics, statistics, data engineering, pattern recognition and learning, advanced computing, visualization, uncertainty modeling, data warehousing, and high performance computing with the goal of extracting meaning from data and creating data products. Source: http://en.wikipedia.org/wiki/Data_science
  5. 5. A brief look into history
  7. 7. The beginnings of gambling Gambling exists since 3000 BC First games based on dices Origin in China and Mesopotamian * Source: Tiemeyer, E.; Zsifkovitis, H.: Information als Führungsmittel, München: Computerwoche Verlag 1995
  8. 8. Scientific foundations 17th century Paradox of Chevaliers de Méré LaPlace und Fermat discussed the paradox in several letters The beginning of theory of probability * Source: http://de.wikipedia.org/wiki/De-M%C3%A9r%C3%A9-Paradoxon
  9. 9. The science in Data Science Calculate probabilities Pattern recognition Calculation of analytical variance Machine Learning Simulations Predictions
  10. 10. BI, Data Mining & Prediction
  12. 12. What do companies do today?
  13. 13. Walmart – The pioneer of data analytics Source: Data Unser – Dr. Bloching, Bilder: walmart.com, yourdealz.de, squidoo.com, fuzzybrew.com
  14. 14. Visa 80% correct prediction of divorces within the next 5 years Reason: Divorce is the highest risk for private insolvency Source: visa.de
  15. 15. Customers need to find the right case What do consumers really do? Blonde looks somehow different  The new washing powder is really great…
  16. 16. Data can be accessed easily…
  17. 17. … but, it‘s hard to analyze it.
  19. 19. How does this fit to Big Data?
  20. 20. Our starting point… Structured data Unstructured data Harmonize and generate Information (Role of „Data Scientist“) „BIG Data“ Volume, Variety, Velocity
  21. 21. Typical Big Data Architecture Big Data Analytics Excel Big Data Advanced Analytics PowerPivot Big Data Preparation (SQL, Map Reduce) Unstructured data Structured data Massive Parallel Processing Big Data Storage Platform
  22. 22. “[Facebook] started in the Hadoop world. We are now bringing in relational to enhance that. We're kind of going [in] the other direction.” “We've been there, and [we] realized that using the wrong technology for certain kinds of problems can be difficult. We started at the end and we're working our way backwards, bringing in both.” Ken Rudin, Source: http://tdwi.org/articles/2013/05/06/facebooks-relationalplatform.aspx?j=192038&e=marcel.franke@pmone.com&l=50_HTML&u=3967541&mid=1060748&jb=84&m=1 Director of Analytics for Facebook
  23. 23. Some word to „R“ • R is a language and environment for statistical computing and graphics • R is Open Source under GNU general public license • Most widely used statistical software • Everything happens in-memory • Comes with a package manager (~5000 packages) • Provides also graphical functionalities
  24. 24. Samples of R
  25. 25. How to approach projects?
  26. 26. Starting Point Problems, which we know from the BI world already, are further exacerbated by big data. • Complexity of systems constantly grows • Amount of data growth exponentially (= Big Data) • Need for change is more frequent and is increasingly delving deeper into business rules • Solutions can no longer be thought ahead
  27. 27. Solution Option 1 – Classic Deterministic Everything can be planned and design at the drawing board…
  28. 28. How does a system with products & components and their relationships behaves with each other? Quelle: Cesar Hidalgo
  29. 29. Solution Option 2 – Learn from „mother Nature“ • How does nature deal with complex non-linear systems? • Evolution – Variation and selection – „Trial and Error“ „It is not the strongest of the species that survives, nor the most intelligent but the one most responsive to change.“ (Charles Darwin)
  30. 30. A candlestick?
  31. 31. 45 Iterations Technology helps, to speed iterations.
  32. 32. Laboratory & Factory
  33. 33. The laboratory Try & Error Pattern Recognition Analytical Apps
  34. 34. An efficient laboratory to experiment Power Pivot In-Memory Microsoft Excel Power View Unstructured Data Power Query Source Systems Power Map SQL Server Structured Data OleD B Odata WebServer-Logs Sensor-Data Data Marketplace SAP Databases
  35. 35. Easy to cosume The factory Integrated in the business process Analyze on mass data Host it and run it At Enterpise Scale For Realtime Enterprise
  36. 36. Stable Big Data Architecture Prediction & Data Science Front-Ends & Mobile Windows Azure On-Premises Source Systems Unstructured Data WebServer-Logs Sensor-Data HDInsight SQL Server PDW Data Marketplace Structured Data SAP Databases
  37. 37. How do we scale?
  38. 38. The battle
  39. 39. How do we scale? Relational data & compute SQL Server 2012 Parallel Data Warehouse Half Rack Infiniband Analytical data & compute HP DL 385 40 Cores 2 TB RAM Fusion-IO Card
  40. 40. What is Revolution Analytics? • Founded in 2007 • Aim: Evolution of R for high-performance • Offer R packages for faster performance and greater stability • Enterprise & Community products • Stand-alone, Scale-out (HPC), on Hadoop
  41. 41. How do we handle our data? R-ODBC: 10 MB/s Flat file export: 80 MB/s Data preparation Data transfer predictive scripts
  42. 42. Results • Generate predictions for 30.000 customers – – – – • • • • 50.000 rows per customer, 54 columns Customer goal: 5 Minutes Our solution: 7.500 customers in 5 Minutes Benchmark: 1 Minute Revolution Analytics ODBC driver does not work with PDW Standard R ODBC driver reads data with 10 MB/s Workaround via flat file export RDS format faster than csv
  43. 43. Other solutions? • R in database • R on Hadoop – RHadoop – Revolution Analytics RHadoop
  44. 44. Other solutions? • Services & Cloud
  45. 45. THANK YOU! • For attending this session and PASS SQLRally Nordic 2013, Stockholm
  46. 46. Titles are set to 34 pt, Arial Click to edit Master title style • Level 1 text is 28 pt Arial – Level 2 text is 24 pt Arial • Level 3 text is 20 pt Arial – Level 4 text is 20 pt Arial • Level 5 text is 20 pt Arial
  47. 47. Notes (hidden) • Some speakers may use this slide for hidden notes • Please delete if you prefer not to use • Please note you are also able to use notes section for each slide