Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Architecting an Advanced Analytics Platform

4.501 Aufrufe

Veröffentlicht am

Presentation of Georgios Gkekas at the Oreilly software architecture conference on Feb 27 2018 in New York.

Veröffentlicht in: Wirtschaft & Finanzen
  • Making a living taking surveys at home! I have been a stay at home mom for almost 5 years and I am so excited to be able to still stay home, take care of my children and make a living taking surveys on my own computer! It's so easy to get started and I plan to make enough money each week so that my husband can actuallly quit his second job!!! Thank you so much! ★★★ http://ishbv.com/goldops777/pdf
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier
  • Analance advanced analytics module allows users to analyze data sets at any volume using sophisticated techniques at ease. The user does not require any coding skills. Only point and click. For a free demo booking click on this link: https://analance.ducenit.com/free-cloud-trial/
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier

Architecting an Advanced Analytics Platform

  1. 1. Architecting an Advanced Analytics Platform Georgios Gkekas ING Enabling Machine and Deep Learning New York, 27 February 2018
  2. 2. ING is a global financial service provider servicing more than 35 million customers. In the Netherlands we are the banking sector market leader with 10 million primary customers 3 Customers 35 million private, corporate and institutional customers Countries more than 40 In Europe, Asia, Australia, North and South America Employees 52,000 worldwide 12,416 in NL Market leaders Benelux Growth markets Commercial Banking Challengers
  3. 3. This year, ING was named Best Bank in the World.
  4. 4. 5 Exploding expectations, set by tech giants, not banks - Personal and Relevant - Instant and Seamless
  5. 5. 6 People engage with platforms more than ever To connect to friends & family, measure your health, find information, make travel arrangements, buy products, etc.
  6. 6. 7 New technologies like robotics and AI drive massive efficiency gains, and data is a core competency
  7. 7. 8 We must find new ways to be relevant
  8. 8. 9 How empowering people drives growth Improving the customer experience means giving our customers new and ever more reasons to interact. Better customer experience More interactions More data
  9. 9. 10 Keep pace with small, fast, agile competitors.
  10. 10. 11 Where we were Different approach in every market Wherewe’reheading United IT United processes United data United way of working United support functions Shared services One brand
  11. 11. Center of excellence in advanced analytics 12 • Specialists in machine learning and big data technologies • Support business units • Experiment with new data • Technology • Methods • Sources • Training and knowledge transfers • Development of exploration environment • Fundamental research
  12. 12. International Advanced Analytics Team One team split over 2 locations 21 FTEs – 2 STAs
  13. 13. Show me the platform 14 High-level architecture Data science tooling / software architecture Security architecture Data architecture Data science on production Future architecture
  14. 14. Target user groups 15 Administrators Data Scientists Business Users Data Owners Model / Code Development Ad-hoc Visualization Training Visualization Dashboarding Decision Making ETL pipeline Data up-/down- loads Batch processing calls Administration Configuration CI/CD Deployments
  15. 15. Capability Architecture 16 Data Access Data Processing Sqoop Development Tools TerminalServer Visualization Tools Dashboards ETLTooling Streamlinnig Hadoop Data Storage and Data Processing Services Higher-level frameworks Administration User Access Area GPUs Hadoop Distributed File System MapReduce Hive PIG Spark Flink Administration & Monitoring (Ambari + Icinga) Atlas (planned ) Ranger Data Governance &Lineage IDEs Jupyter Notebooks Java Scala Python R Visualization Dashboards Administrators SQL DBs Data Lakes Data Scientists Business Users Data Owners Models Development Decision Making Data Management ETL Service Export Semi- production Remote Desktop APIs External Data Sources Microservices
  16. 16. Technological Stack 17
  17. 17. Data Architecture
  18. 18. Data Architecture 1 19 … … Source 1 Source N Layer 0 (raw data in HDFS – ingestion folder) Layer 1 (transformed data – project folder) Layer S1 (exploration data) Layer SN (exploration data) Model 1 Model K Models On Production Only read access To Data Scientists Full access To Data Scientists External Systems
  19. 19. /projects/<project> |__ ingestion (read access only) |__ project (read + write) |__parquet (deep data) |__features (exploratory data) |__exploration (exploratory data) Raw data Deep Data Exploratory Data 20 Data Architecture 2
  20. 20. • Organize your data into meaningful buckets • Every data entity / table is split into more files according to a column • Usually a date column • Gains • Performance • Code simplification • Simplifies incremental updates Partitioning Design Pattern 21
  21. 21. Data Science Tooling
  22. 22. Data Ingestion & Preparation Data Inspection Data Cleaning Feature Engineering Feature & Model Selection Model Explanation Model Development Lifecycle 23 Data Sources
  23. 23. Data Ingestion & Preparation •ETL Service •API-based •Microservice Data Inspection •Table Profiler •API-based •Utility Data Cleaning •Manual work Feature Engineering •Autofeat •Software Library Feature & Model Selection •ZIP •Software Library Model Explanation •FIT •Software Library IAA Tooling for Model Development Lifecycle 24 Data Sources
  24. 24. ETL Service Architecture 25 ETL Microservice Layer HDFS /filepath/partition=<value> |___part1 |___part2 … |___partN Pluggable post-upload actions Split file into small bunches
  25. 25. ETL Service 26 curl -v -k -u <user> -X POST -F file=@<file> -F project=<project> https://<link>/api/uploadFile
  26. 26. Table Profile Architecture 27
  27. 27. TableProfiler Results 28
  28. 28. • Window-based: Features based on SQL window functions are fully supported. • Grouping-based: Features generated for a whole group of data are available. • Historical: TimeSince and LastValue are the two implemented historical features. • TimeSince: The amount of time since the same values in the data previously appeared, given a certain condition. • LastValue: Similar to TimeSince, but this copies the previous value instead of calculating the time difference. • Direct: Features that transform a column in the original dataframe into one or more new columns are supported. At the moment, one-hot encoding and numeric binning are available. • Custom: Users can define their own feature types and plug those into autofeat. 29 Autofeat generation Data Features Specification Autofeat Generation Utility Enhanced Data
  29. 29. 30 Autofeat Specification
  30. 30. ZIP
  31. 31. „Semi“-Production Support
  32. 32. What Production challenges concern data scientists most? 33 0 0.5 1 1.5 2 2.5 3 3.5 Data Quality Check Performance Monitoring Standard Data Science Pipeline Web services/User interface Model Replicability High Priority !
  33. 33. Target Solution-> PMML & Openscoring 34 Openscoring REST API Web Services What we want: standard, efficient and transparent pipeline for productionizing data science models. What we can: using Openscoring as middleware for scoring standard-based PMML data science models. Data Scientists Local Operations PMML Models Scoring Insights
  34. 34. What is PMML model? 35 Pre-processing Model Scoring Performance Association Rules Baseline Models Bayesian Network Cluster Models Gaussian Process General Regression k-Nearest Neighbours Naive Bayes Neural Network Random Forest Regression Ruleset Scorecard Sequences Text Models Time Series Decision Trees Vector Machine Univariate Statistics Partitions Predictive Model Quality Clustering Model Quality Gains/Lift Charts Ranking Quality Information ROC Graph Confusion Matrix Field Correlations Result Explanation Model Verification Data Dictionary Mining Schema Data Transformations Output and Targets Normalization Discretization Value mapping Text Indexing Built-in Functions Aggregation … Usage Type: active, target, sup.. Missing, invalid, outlier treatment Header Operation type, data type Statistical Methods Predicted values Rules Clustered ID and affinity Warning… Input Streams Output Streams Multivariate ANOVA … Multiple Models PMML Version: 4.3 PMML: Predictive Model Markup Language
  35. 35. How to score PMML models? 36 POST request for executing a model Clients 2 3 Downloading the results 1 Uploading dataset Ingestion Server Python Data Preparation and Scoring Data Preparation PMML Scoring 2.a 2.b 2.c PMML data preparation and scoring Openscoring.io REST API Services Auto-deployment from Git Features POST Request Read Models IAA Semi-production Environment Export Server
  36. 36. Enterprise Security
  37. 37. New GDPR requirement: Accountability 38 ING must demonstrate compliance with the following processing principles : • Processed lawfully, fairly and in a transparent way. • Collected for specific, explicit and legitimate purposes. • Limited to what is necessary. • Accurate and, when necessary, kept up to date. • Kept no longer than is necessary for the purposes. • Ensured appropriate security of personal data. ING must maintain a record of the • name and contact details of the DPE and DPO. • purposes of the processing. • categories of individuals and categories of personal data. • categories of data recipients including their location. • The retention period. • General description of the technical and organizational security measures. • Transfers to third (non-adequate) countries. The new Data Repository is also a cornerstone for ensuring accountability: A. Implement Data Repository (What) B. Adhere to the Reference Architecture Data protection (How) C. Mandatory GDPR-proof Privacy Risk Assessment (Why) For this reason building a data repository is a prerequisite to become GDPR compliant
  38. 38. Workflow-based roles assignment 39
  39. 39. 40 Security Architecture Currently 40 LDAP Linux users Access Area ING Group 1. LDAP login Access controlled through LDAP roles Access NodesAccess NodesAccess Nodes Hadoop Area Node NodeHadoop Node Kerberos Ranger SSO Authentication Authorization 2. Get TGT* with KeyTab 3.2. Work with the cluster as technical project user 3.1. Controlled access through Ranger ACLs* Node NodeHadoop Node Internal firewallEntry point firewall *TGT: Ticket Granting Ticket *AD: Active Directory *ACL: Access Control List
  40. 40. Information Security Strategies & Concepts – work in progress 41 ING Group Access VLAN Hadoop VLAN 1. AD login Access controlled Through AD roles Entry point firewall Internal firewall 3.2. Work with the cluster AD Synchronization of Roles IPA *TGT: Ticket Granting Ticket *AD: Active Directory *ACL: Access Control List AD trust & Roles Sync Node NodeHadoop Node Node NodeHadoop Node Access Nodes Access Nodes Access Nodes Ramon Corporate Role Assignment Process Role provisioning Kerberos Ranger 3.1. Controlled access through Ranger ACLs*
  41. 41. Moving Forward
  42. 42. 43 IPC-Data lake-TPA & Advanced analytics exploration environment IPCING Private Cloud Data Lake Exploration Environment Touch Point Architecture
  43. 43. Proposed Solution 44
  44. 44. 2015 Advanced Analytics Platform (DiBa) 2017 – ING Central Advanced Analytics Platform (IPC+DataLake) 2018 – ING Family (Interhyp BoB TMB) AA Platform (Public Cloud) 2019 – Friends (WB customers) AA Platform (Public Cloud + Fee Income) 2020 – Public AA Platform with Bank Level Security Roadmap 45
  45. 45. Thank you!