Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

Make your data fly - Building data platform in AWS

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige

Hier ansehen

1 von 53 Anzeige

Make your data fly - Building data platform in AWS

Herunterladen, um offline zu lesen

AWS Community Day Nordics presentation by Kimmo Kantojärvi and Roope Parviainen from Solita Oy

AWS Community Day Nordics presentation by Kimmo Kantojärvi and Roope Parviainen from Solita Oy

Anzeige
Anzeige

Weitere Verwandte Inhalte

Diashows für Sie (20)

Ähnlich wie Make your data fly - Building data platform in AWS (20)

Anzeige

Aktuellste (20)

Make your data fly - Building data platform in AWS

  1. 1. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. NORDICS Clarion Hotel Helsinki March 21, 2018
  2. 2. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Make your data fly - Building data platform in AWS Kimmo Kantojärvi & Roope Parviainen
  3. 3. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Today’s topics ● We are... ● Architectural evolution ● Making Data DevOps to work ● How to cope with the data challenges ● Our experiences with couple of the components/services and some tips & tricks ● EMR, Redshift, Airflow, visualization tools
  4. 4. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. We are...
  5. 5. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Kimmo (@kimmokantojarvi) ● Coding architect ● 15 years in data business ● AWS Certified Solutions Architect - Professional ● Ilves fan Roope ● Data Architect #HandsDirty ● Professional love for data of 5 years ● Software Development × DW × data platforms × IoT ● AWS Certified Solutions Architect - Professional
  6. 6. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. We are a data and customer value driven transformation company ▪ 96 % of our 186 clients recommend us ▪ Over 2 million daily users in maintained services ▪ Extensive partner network in tech and insight 1996 FOUNDED 650 EMPLOYEES 6 CITIES 4 COUNTRIES 76MTURNOVER 2017 20%AVG. PROFITABLE GROWTH PER ANNUM
  7. 7. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. We help our customers to create new services by understanding their customers and managing the change. We build capabilities and intelligence that help develop and create new business opportunities. We build and deliver new business and services technologies and infrastructure. We chase results and take care of our customers and their services. Offering Consulting and service design Data, analytics and AI Digital services DevOps and cloud services
  8. 8. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
  9. 9. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Architectural evolution
  10. 10. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. It used to be so simple ;) Source → ETL → DW → BI
  11. 11. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Today the architecture is much more versatile and enabled by cloud
  12. 12. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What happened? From ● On-premise ● Few key technologies ● Closed solutions from big players ● Investments ● Compute & storage combined ● Data pull/batch ● Schema-on-write ● GUI ● Long projects, big lead times To ● Cloud ● Various specific technologies ● Open source ● Flexible cost structure ● Separation of compute & storage ● Data push/stream ● Schema-on-read ● Code ● Agile methods, need to deliver fast
  13. 13. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
  14. 14. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Various options to load & process data ● Traditional ○ SQL ○ ETL tools ○ Integration tools ● APIs ● AWS Services ○ Glue ○ EMR ○ Kinesis ○ IoT ○ EC2/Lambda ○ S3 ● Processing/streaming engines ○ Spark ○ Flink ○ Storm ○ Presto/Hive ● Custom code ○ R, Python, etc. ○ Machine learning Make sure your new systems are built to share data!
  15. 15. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Offloading data processing with EMR (+ Spark) ● Suitable for processing large amount of data and complex calculations ● Java, Scala, Python ● Combine SQL, Python generators and Spark dataframes - Win-Win! ● Very cost-effective with spot instances ● Some learning curve (understanding configuration, behaviour and metrics) ● Not all EC2 instance types available ● Ramp-up time ~10min - not ideal for short tasks unless run continuously ● Testing code locally challenging (e.g. py-test + spark plugin)
  16. 16. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. code.zip job & environment configurations 60 x c3.xlarge process 10B rows in 1 hour = 3,5€ 1000 SQL queries replaced with 1000 lines of Python & Spark S3 DynamoDB S3 Redshift EMR data copy/unload data
  17. 17. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
  18. 18. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. response = ec2_client.describe_spot_price_history( AvailabilityZone='eu-west-1a', StartTime='2018-03-01', EndTime='2018-03-21', InstanceTypes=['c3.xlarge'], ProductDescriptions=['Linux/UNIX'], MaxResults=100 )
  19. 19. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
  20. 20. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. So many data storage options nowadays ● File/object storage ○ S3 ● Data warehouses ○ Redshift, Snowflake ● Traditional databases ○ RDS (MySQL, Postgres, MariaDB, MSSQL, Oracle) ● NoSQL databases ○ DynamoDB ○ MongoDB ○ Cassandra ● In-memory databases ○ Exasol ● GPU databases ○ MapD, BrytlytDB ● Time series databases ○ Kdb+, InfluxDB ● Caches ○ Redis, Memcached
  21. 21. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Redshift performance requires planning & design ● Redshift is cluster and each node has own data → data distribution affects query performance and data loading ● Optimal to query few wide tables rather than join many narrow tables together ○ E.g. data vault modeling a bit challenging from query performance point of view ● Each table requires minimum storage → more nodes → higher minimum storage
  22. 22. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. In addition to data distribution managing the query queues (WLM) setup important ● Max 500 concurrent connections per cluster, but only max 50 query slots ● Each slot takes own share of the memory, 50 slots → memory split to 1/50 parts ● Can be used to control long-running (maybe not so smart) queries made by users ○ E.g. failover after 5 min to queue with less resources
  23. 23. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. In Spectrum we trust ● Store part of the data in S3 (e.g. parquet + snappy), access as external table with SQL ● Separate Spectrum compute layer ● Read-only, still need to process the data into S3 and Redshift does support only CSV at the moment ● Athena and Spectrum seem to be faster if you have no joins but just single table ● VPC support not available yet https://aws.amazon.com/blogs/ big-data/10-best-practices-for- amazon-redshift-spectrum/
  24. 24. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Spectrum related wish list ● VPC support ● Write/delete also to allow schema-on-write ● Redshift unload to parquet/avro ● Some control over compute or control over cost structure
  25. 25. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Redshift requires still some maintenance ● Tasks taken care by AWS ○ Backups ○ Resizing ○ Node/disk replacement ○ Query caching ● Built-in maintenance processes which user controls ○ Analyze → Query optimizer needs to know tables ○ Vacuum → Sort data in correct order and free up storage for deleted data ○ Compression → Optimize table compression ● https://github.com/awslabs/amazon-redshift-utils ○ Great toolset for maintenance and reviewing system status
  26. 26. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Some other tips with Redshift ● With 3-year full prepayment break-even after 1 year = commitment actually only 1 year ○ 5,12TB = 32 x dc2.xlarge = 2 x dc2.8xlarge ≈ $90k/year ○ All upfront 3-years $31k/year ● Publish directly from staging and model later → faster visible results for business users ● A lot of interesting development going on (especially Spectrum)
  27. 27. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
  28. 28. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Sharing your data ● APIs ● Integration tools ● BI tools ● AWS services ○ QuickSight ○ Athena ○ API Gateway ○ S3
  29. 29. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Visualizing the data ● First phase to generate value of data is to visualize it ● General purpose BI/Analytics tool does not (always) cope with e.g. ○ vast amount of data ○ special visualisation need ● Right tool for the right purpose, “Mix and match” ○ PowerBI/Birst/Quicksight and custom d3.js / trending tool / Grafana / Kibana ○ Multiple data sources ■ Virtualization of data sources ■ Data catalogs and understandability
  30. 30. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Visualizing the data VS.
  31. 31. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Right tool for the right purpose
  32. 32. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
  33. 33. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Fast and slow data - same but different ● Platforms have to be able ingest both slow and fast data ○ Batches are simply not enough ○ Data streams & event-driven data loads ● Different endpoints / integrations (SFTP, HTTP REST, MQTT, data dumps) ● Different data pipelines and databases ○ Even for to same data based on usage needs ○ Orchestration of the whole becomes difficult ○ Parallelism when loading
  34. 34. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
  35. 35. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Managing the data flow ● Open source ○ Airflow ○ Oozie ○ Luigi ○ Jenkins ● Traditional ETL & Integration tools ● AWS services ○ Batch ○ Step ● Custom code ○ Lambda
  36. 36. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Airflow ● Visualization and management of whole data load ○ SQL ○ Command line ○ Python/Java/etc. ● Suitable for batch loading ● Loads can be generated programmatically based on metadata ● Parallel/multiple loads, managing parallelism ● Load history ● Logs available directly
  37. 37. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Airflow
  38. 38. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Airflow
  39. 39. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Airflow
  40. 40. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Things to consider ● Batch vs. streaming need to be handled separately ● Airflow has some flaws ○ GUI is not always up-to-date ○ Scanning DAG statuses takes time ● If you have a lot of custom code Lambdas running at different times, how do you manage parallelism and how to monitor
  41. 41. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Making Data DevOps to work
  42. 42. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data DevOps ● Target to achieve deployment processes similar to software projects ● Was not even possible earlier, because of poor support in traditional tools ● To make it effective and scalable should be metadata driven ○ Code generated based on metadata ● Need to focus in following good coding practices ● Version management for everything ○ Infrastructure as a code ○ Recursive schema changes ○ Data load changes ○ Report changes?
  43. 43. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data DevOps - Agile Data Engine ● Based on our previous experience/projects, now formalized and bundled as a product ● Enabled by AWS services, difficult to implement on-premises ● Design once, deploy multiple runtime environments ● Functionality ○ Data modelling, Load Mapping, Data Vault Automation ○ Continuous Deployment Management ○ Metadata Driven ELT Execution and Concurrency Control
  44. 44. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
  45. 45. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data modeling and why data vault ● Data vault is modeling and development method ● Hub = business entity, Satellite = all details, Link = join between entities ● Well defined principles for developing, naming conventions, etc. H_ORDER S_ORDER L_CUSTOMER _ORDER H_CUSTOMER S_CUSTOMER 1 * 1 * 1* 1 *
  46. 46. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data vault is one of the key enabler for increasing speed with schema-on-write approach ● Data model split into pieces allowing loads in multiple steps/parts ● Data loads can be auto-generated ● Many-to-many links allow representing any business situation ● Built-in storing history of changes with satellite structure ● Standard development model allows easier personnel changes
  47. 47. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. How to survive with the data challenges
  48. 48. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. #saddata ● Data you forced to collect even though no-one wants it as a customer & no-one needs it in your business & no-one can find or utilize - Jarno Kartela, AWS Summit Stockholm, 2017 ● So basically consider what data you are collecting, it all adds some maintenance overhead and need to keep GDPR in mind
  49. 49. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Handling malicious data ● Typically not considered ● Source could be 3rd party service or system which has poor data validation/handling ● Probably best to create separate landing account and run security check to the data before pushing forward
  50. 50. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Simple tasks to secure data ● Encrypt ○ S3 buckets ○ RDS & Redshift ○ EBS volumes ● Just block accesses ○ Network ACL ○ Security groups ○ S3 bucket policies ● Setup notifications on changes ● Prevent opening access { "Version": "2008-10-17", "Statement": [ "Effect": "Deny", "Action": "*", "Resource": "arn:aws:s3:::my-bucket/*", "Condition": { "StringNotEqualsIfExists": { "aws:SourceVpc": "vpc-abcdefg" }, "NotIpAddressIfExists": { "aws:SourceIp": [ "1.1.1.1/32" ]
  51. 51. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. There is no single data platform to answer all your needs ● How do you remove customer data from parquet files in S3 (as required in GDPR) ● How do you manage access to S3, Redshift, Tableau, etc. in centralized manner ● No centralized metadata management (maybe Glue in the future)
  52. 52. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Credits Harri Kallio Tero Honko
  53. 53. Thank you! Questions?

×