Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

Cwin16 tls-datalab for scientists

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige

Hier ansehen

1 von 17 Anzeige

Weitere Verwandte Inhalte

Diashows für Sie (19)

Andere mochten auch (12)

Anzeige

Ähnlich wie Cwin16 tls-datalab for scientists (20)

Weitere von Capgemini (20)

Anzeige

Aktuellste (20)

Cwin16 tls-datalab for scientists

  1. 1. DataLab for Scientists – A new way of working DataLab for Scientists – A new way of working S. ANGELI / C. CORMONT / A. GREVIN 29/09/2016 S. ANGELI / C. CORMONT / A. GREVIN 29/09/2016
  2. 2. 2Copyright © Capgemini 2015. All Rights Reserved Scientists Methodology for data analytics Preprocessing & Automatic reconciliation raw Data bulk Predictive modelling Posed problem Descriptive modelling Algorithms for categorial data Algorithms for numerical data Algorithms for textual data Data frames Statistical tests, contingency & correlation matrices, factoriel analysis, hierarchical clustering, important variable extraction, semantic graphs construction Logistic regression, discriminant analysis, decision trees, k-means, kohonen map, supervised neural networks , document analysis
  3. 3. Copyright © 2016 Capgemini and Sogeti. All rights reserved. 3 Current issues with traditional tools & way of working  No scalability  Lack of collaborative tools  Lack of visualization tools  Several tools and languages: lack of integration  Need to accelerate the path from R&D to real industrialization
  4. 4. Copyright © 2016 Capgemini and Sogeti. All rights reserved. 4 Inside the highly iterative Journey Applications Expert, Data Scientist, Business Process Expert, Technical Architect Data Manipulation Run Mathematical Algorithms Understand Results Architecture Design & Infra setup Define Use case • Hortonworks • RHadoop • Map Reduce Data Extraction
  5. 5. Copyright © 2016 Capgemini and Sogeti. All rights reserved. 5 Our Use Case Data Size : 200Mo  20Go R loads data in memory -> local-> Job too long (hours), impossible 1) RStudio & RHadoop 2) Dataiku & pySpark
  6. 6. Copyright © 2016 Capgemini and Sogeti. All rights reserved. 6 First Architecture with RStudio & RHadoop
  7. 7. REX IMC2 | 29/09/2016 Copyright © 2016 Capgemini and Sogeti. All rights reserved. 7 DATAIKU: AN INTEGRATED DATA SCIENCE PLATFORM PRODUCTION Create quickly your predictive models and the associated workflows, by combining visual components and programming languages in a common environment Deploy your predictive applications in production using advanced automation of workflows and expose your machine learning models via API’s One product – one environment – one platform DESIGN PRODUCTION FOR CLICKERS FOR CODERS For all your data science projects and predictive applications Acquire, prepare, filter, join, copy your data with visual components… Use your favorite (big data) programming languages to add arbitrary custom logic…
  8. 8. Copyright © 2016 Capgemini and Sogeti. All rights reserved. 8 ETL Analyzes - ML Vizualisation Dataiku Talend Spotfire QlickView Dataiku vs others softwares Scikit Learn
  9. 9. Copyright © 2016 Capgemini and Sogeti. All rights reserved. 9 Rstudio-Hadoop Dataiku - Spark ~20 min ~7 min Only Code Code / Visual Operation / Visual Workflow IDE NoteBook Charts with R Charts with Dataiku Fractionnate WorkFlows Run Partial Job Results
  10. 10. Copyright © 2016 Capgemini and Sogeti. All rights reserved. 10 DEMO
  11. 11. Copyright © 2016 Capgemini and Sogeti. All rights reserved. 11 BACK UP
  12. 12. Copyright © 2016 Capgemini and Sogeti. All rights reserved. 12 RStudio - RHadoop Rhadoop : MapReduce, HDFS, Hbase and Avro API mapper(..[R function]) . . reducer(..[R function]) Use R function to manipulate data map reduce disk disk
  13. 13. Copyright © 2016 Capgemini and Sogeti. All rights reserved. 13 Dataiku with Spark (PySpark – Rspark) map() reduce() reduceByKey() filter() join() group() Use Spark function to manipulate data map reduce memory memory
  14. 14. Copyright © 2016 Capgemini and Sogeti. All rights reserved. 14 Dataiku Dataset Spark Dataframes Spark RDD Data in Dataiku and Spark Tables in dataiku Distributed tables SQL-like List Distributed Dataiku API Dataframes API
  15. 15. Copyright © 2016 Capgemini and Sogeti. All rights reserved. 15 Dataiku Server Spark (R-Python-SQL) Yarn NodeManager / Resource Manager Executor Executor Executor Dataiku – Spark -Yarn
  16. 16. Copyright © 2016 Capgemini and Sogeti. All rights reserved. 16 Rstudio-Hadoop Dataiku - Spark ~20 min ~7 min Only Code Code / Visual Operation / Visual Workflows IDE NoteBook Data Preparation Needed Data Preparation Needed Chart with R Chart with Dataiku and R Fractionnate WorkFlow Run Partial Job
  17. 17. www.capgemini.com The information contained in this presentation is proprietary. Copyright © 2016 Capgemini and Sogeti. All rights reserved. Rightshore® is a trademark belonging to Capgemini. www.sogeti.com About Capgemini and Sogeti With more than 180,000 people in over 40 countries, Capgemini is a global leader in consulting, technology and outsourcing services. The Group reported 2015 global revenues of EUR 11.9 billion. Together with its clients, Capgemini creates and delivers business, technology and digital solutions that fit their needs, enabling them to achieve innovation and competitiveness. A deeply multicultural organization, Capgemini has developed its own way of working, the Collaborative Business Experience™, and draws on Rightshore®, its worldwide delivery model. Sogeti is a leading provider of technology and software testing, specializing in Application, Infrastructure and Engineering Services. Sogeti offers cutting-edge solutions around Testing, Business Intelligence & Analytics, Mobile, Cloud and Cyber Security. Sogeti brings together more than 23,000 professionals in 15 countries and has a strong local presence in over 100 locations in Europe, USA and India. Sogeti is a wholly-owned subsidiary of Cap Gemini S.A., listed on the Paris Stock Exchange.

×