SlideShare ist ein Scribd-Unternehmen logo
1 von 14
Pipedrive DW on AWS
November 3, 2016
Erkki Suurna
Talking points
DW role in Pipedrive
AWS services PDW use
DW infrastructure
VPC, S3, Redshift, RDS, EC2, Kinesis, ELB
Security - KMS, IAM, S3 encryption
Hadoop and Spark stack
Self developed ETL in Python (300+ tasks running daily)
PDW main goal
Support data driven organisation
SaaS KPI metrics reporting to executive level
Product instrumentation and analysis
Feedback to end user
Business intelligence - data acquisition and transformation into meaningful
information for business analysis
PDW data platform
Disparate data sources are processed into one unified trusted data level (3TB)
25+ models and aggregates
100+ tables
Sensitive data is obfuscated
Encourage to use aggregated models instead of many canned reports
Process daily 100+k backups 330GB compressed => 3-4 TB data
AWS services PDW use - VPC
VPC
Logically isolated section, where you can define your virtual network
Every PDW service is in separate subnet and has separate security groups
US east region, 5 availability zones
Everything is closed by default
Infra is not permanent, script everything, because you will recreate it one day
AWS services PDW use - S3
Simple Secure Storage
Central point to all services
Not file system, more like endless storage
Encryption - AES-256
Lifecycle policy and storage classes
Distribute filenames in bucket by filename prefix for better performance (100TPS)
Multipart upload.
AWS services PDW use - Redshift and RDS
Redshift 3 node 6TB cluster
Based on Postgres
MPP - great product for analytical queries
Automatic backup
Scale in compute dense mode up to 300 TB, in storage dense mode up to 2 PB
No stored procedures nor functions
Load data from S3
Relational Database Service
Multi AZ, automatic backup and failover
ETL and Spark backend use Postgres as metastore
Snapshots of Pipedrive app backend Mysql-s
AWS services PDW use - EC2
Elastic Compute Cloud
Script all instances
Pre-built images (we use mainly Amazon Linux because it is managed by Amazon)
Spot instances (instance vs vCPU)
Auto Scaling Group
Access Control
KMS to encrypt and decrypt secrets
Identity and Access Management (IAM)
Security group as virtual firewall on instance level
AWS Kinesis
Scalable buffer
A shard is base throughput unit
1MB/sec data input
2MB/sec data output
up to 1000 PUT records/sec
24h retention
Hadoop and Spark stack
HDFS - new DW concept
Namenode is micro instance
2x datanode provide 10+TB storage
Spark - lightning fast processing cluster
ETL cluster in Yarn mode
Ad-hoc cluster in standalone mode
APIs
Python, Scala, Java, SQL, GraphX, Streaming, MLib
Formats
Source data in JSON
Destination data in parquet
Change deltas in CSV for Redshift import
Current setup
r3.8xlarge ( 32Core 244GB memory 2X320 GB ssd,
$2.66, spot saves 80-90%)
Data visualisation tools
Tableau
Desktop licence for development
Tableau Online serve interactive reports to end users
Re:dash
Fast and simple data visualisation
Grafana
Technical dash for infra monitoring
Zeppelin
Part or Spark stack
25+ interpreters available, currently we use Markdown, Shell, Scala, Python, SQL
Spark Zeppelin
demo
Interpreters
Shell
SQL
Python
Scala
Next steps
Spark
Spark Streaming POC
Kafka POC
Enable real-time dashboard on top of ElasticSearch + Kibana
POC based on graphic intensive machines (utilize GPU in Spark)
Alluxio - data storage in memory
Q & A

Weitere ähnliche Inhalte

Was ist angesagt?

AWS Webcast - Webinar Series for State and Local Government #3: Discover the ...
AWS Webcast - Webinar Series for State and Local Government #3: Discover the ...AWS Webcast - Webinar Series for State and Local Government #3: Discover the ...
AWS Webcast - Webinar Series for State and Local Government #3: Discover the ...Amazon Web Services
 
Building a Bigdata Architecture on AWS
Building a Bigdata Architecture on AWSBuilding a Bigdata Architecture on AWS
Building a Bigdata Architecture on AWSArun Sirimalla
 
Getting Started with Amazon Redshift - AWS July 2016 Webinar Series
Getting Started with Amazon Redshift - AWS July 2016 Webinar SeriesGetting Started with Amazon Redshift - AWS July 2016 Webinar Series
Getting Started with Amazon Redshift - AWS July 2016 Webinar SeriesAmazon Web Services
 
Getting Started with EC2, S3 and EMR
Getting Started with EC2, S3 and EMRGetting Started with EC2, S3 and EMR
Getting Started with EC2, S3 and EMRArun Sirimalla
 
Introduction to Amazon Relational Database Service
Introduction to Amazon Relational Database ServiceIntroduction to Amazon Relational Database Service
Introduction to Amazon Relational Database ServiceAmazon Web Services
 
Strategic Uses for Cost Efficient Long-Term Cloud Storage
Strategic Uses for Cost Efficient Long-Term Cloud StorageStrategic Uses for Cost Efficient Long-Term Cloud Storage
Strategic Uses for Cost Efficient Long-Term Cloud StorageAmazon Web Services
 
AWS Summit Auckland Platinum Sponsor presentation - Trend Micro
AWS Summit Auckland Platinum Sponsor presentation - Trend MicroAWS Summit Auckland Platinum Sponsor presentation - Trend Micro
AWS Summit Auckland Platinum Sponsor presentation - Trend MicroAmazon Web Services
 
AWS re:Invent 2016: Workshop: Stretching Scalability: Doing more with Amazon ...
AWS re:Invent 2016: Workshop: Stretching Scalability: Doing more with Amazon ...AWS re:Invent 2016: Workshop: Stretching Scalability: Doing more with Amazon ...
AWS re:Invent 2016: Workshop: Stretching Scalability: Doing more with Amazon ...Amazon Web Services
 
Best Practices for NoSQL Workloads on Amazon EC2 and Amazon EBS - February 20...
Best Practices for NoSQL Workloads on Amazon EC2 and Amazon EBS - February 20...Best Practices for NoSQL Workloads on Amazon EC2 and Amazon EBS - February 20...
Best Practices for NoSQL Workloads on Amazon EC2 and Amazon EBS - February 20...Amazon Web Services
 
2017 AWS DB Day | AWS 데이터베이스 개요 - 나의 업무에 적합한 데이터베이스는?
2017 AWS DB Day |  AWS 데이터베이스 개요 - 나의 업무에 적합한 데이터베이스는?2017 AWS DB Day |  AWS 데이터베이스 개요 - 나의 업무에 적합한 데이터베이스는?
2017 AWS DB Day | AWS 데이터베이스 개요 - 나의 업무에 적합한 데이터베이스는?Amazon Web Services Korea
 
MongoDB on AWS in 5 min
MongoDB on AWS in 5 minMongoDB on AWS in 5 min
MongoDB on AWS in 5 minDavid Turner
 
Amazon Web Services - Relational Database Service Meetup
Amazon Web Services - Relational Database Service MeetupAmazon Web Services - Relational Database Service Meetup
Amazon Web Services - Relational Database Service Meetupcyrilkhairallah
 
Databases & Analytics AWS re:invent 2019 Recap
Databases & Analytics AWS re:invent 2019 RecapDatabases & Analytics AWS re:invent 2019 Recap
Databases & Analytics AWS re:invent 2019 RecapSungmin Kim
 
Cost Savings at High Performance with Redis Labs and AWS
Cost Savings at High Performance with Redis Labs and AWSCost Savings at High Performance with Redis Labs and AWS
Cost Savings at High Performance with Redis Labs and AWSAmazon Web Services
 
Cloud Computing With Amazon Web Services, Part 3: Servers on Demand With EC2
Cloud Computing With Amazon Web Services, Part 3: Servers on Demand With EC2Cloud Computing With Amazon Web Services, Part 3: Servers on Demand With EC2
Cloud Computing With Amazon Web Services, Part 3: Servers on Demand With EC2white paper
 
AWSome Day 2016 - Module 4: Databases: Amazon DynamoDB and Amazon RDS
AWSome Day 2016 - Module 4: Databases: Amazon DynamoDB and Amazon RDSAWSome Day 2016 - Module 4: Databases: Amazon DynamoDB and Amazon RDS
AWSome Day 2016 - Module 4: Databases: Amazon DynamoDB and Amazon RDSAmazon Web Services
 

Was ist angesagt? (20)

AWS Webcast - Webinar Series for State and Local Government #3: Discover the ...
AWS Webcast - Webinar Series for State and Local Government #3: Discover the ...AWS Webcast - Webinar Series for State and Local Government #3: Discover the ...
AWS Webcast - Webinar Series for State and Local Government #3: Discover the ...
 
Building a Bigdata Architecture on AWS
Building a Bigdata Architecture on AWSBuilding a Bigdata Architecture on AWS
Building a Bigdata Architecture on AWS
 
Getting Started with Amazon Redshift - AWS July 2016 Webinar Series
Getting Started with Amazon Redshift - AWS July 2016 Webinar SeriesGetting Started with Amazon Redshift - AWS July 2016 Webinar Series
Getting Started with Amazon Redshift - AWS July 2016 Webinar Series
 
Getting Started with EC2, S3 and EMR
Getting Started with EC2, S3 and EMRGetting Started with EC2, S3 and EMR
Getting Started with EC2, S3 and EMR
 
Introduction to Amazon Relational Database Service
Introduction to Amazon Relational Database ServiceIntroduction to Amazon Relational Database Service
Introduction to Amazon Relational Database Service
 
Strategic Uses for Cost Efficient Long-Term Cloud Storage
Strategic Uses for Cost Efficient Long-Term Cloud StorageStrategic Uses for Cost Efficient Long-Term Cloud Storage
Strategic Uses for Cost Efficient Long-Term Cloud Storage
 
AWS Summit Auckland Platinum Sponsor presentation - Trend Micro
AWS Summit Auckland Platinum Sponsor presentation - Trend MicroAWS Summit Auckland Platinum Sponsor presentation - Trend Micro
AWS Summit Auckland Platinum Sponsor presentation - Trend Micro
 
AWS re:Invent 2016: Workshop: Stretching Scalability: Doing more with Amazon ...
AWS re:Invent 2016: Workshop: Stretching Scalability: Doing more with Amazon ...AWS re:Invent 2016: Workshop: Stretching Scalability: Doing more with Amazon ...
AWS re:Invent 2016: Workshop: Stretching Scalability: Doing more with Amazon ...
 
Best Practices for NoSQL Workloads on Amazon EC2 and Amazon EBS - February 20...
Best Practices for NoSQL Workloads on Amazon EC2 and Amazon EBS - February 20...Best Practices for NoSQL Workloads on Amazon EC2 and Amazon EBS - February 20...
Best Practices for NoSQL Workloads on Amazon EC2 and Amazon EBS - February 20...
 
2017 AWS DB Day | AWS 데이터베이스 개요 - 나의 업무에 적합한 데이터베이스는?
2017 AWS DB Day |  AWS 데이터베이스 개요 - 나의 업무에 적합한 데이터베이스는?2017 AWS DB Day |  AWS 데이터베이스 개요 - 나의 업무에 적합한 데이터베이스는?
2017 AWS DB Day | AWS 데이터베이스 개요 - 나의 업무에 적합한 데이터베이스는?
 
MongoDB on AWS in 5 min
MongoDB on AWS in 5 minMongoDB on AWS in 5 min
MongoDB on AWS in 5 min
 
Amazon Web Services - Relational Database Service Meetup
Amazon Web Services - Relational Database Service MeetupAmazon Web Services - Relational Database Service Meetup
Amazon Web Services - Relational Database Service Meetup
 
Databases & Analytics AWS re:invent 2019 Recap
Databases & Analytics AWS re:invent 2019 RecapDatabases & Analytics AWS re:invent 2019 Recap
Databases & Analytics AWS re:invent 2019 Recap
 
Cost Savings at High Performance with Redis Labs and AWS
Cost Savings at High Performance with Redis Labs and AWSCost Savings at High Performance with Redis Labs and AWS
Cost Savings at High Performance with Redis Labs and AWS
 
AWS RDS
AWS RDSAWS RDS
AWS RDS
 
Real-Time Event Processing
Real-Time Event ProcessingReal-Time Event Processing
Real-Time Event Processing
 
Cloud Computing With Amazon Web Services, Part 3: Servers on Demand With EC2
Cloud Computing With Amazon Web Services, Part 3: Servers on Demand With EC2Cloud Computing With Amazon Web Services, Part 3: Servers on Demand With EC2
Cloud Computing With Amazon Web Services, Part 3: Servers on Demand With EC2
 
AWSome Day 2016 - Module 4: Databases: Amazon DynamoDB and Amazon RDS
AWSome Day 2016 - Module 4: Databases: Amazon DynamoDB and Amazon RDSAWSome Day 2016 - Module 4: Databases: Amazon DynamoDB and Amazon RDS
AWSome Day 2016 - Module 4: Databases: Amazon DynamoDB and Amazon RDS
 
AWS Vs Azure
AWS Vs AzureAWS Vs Azure
AWS Vs Azure
 
Processing and Analytics
Processing and AnalyticsProcessing and Analytics
Processing and Analytics
 

Andere mochten auch

The Good, Bad and Ugly of Serverless
The Good, Bad and Ugly of ServerlessThe Good, Bad and Ugly of Serverless
The Good, Bad and Ugly of ServerlessPipedrive
 
Queues queues queues — How RabbitMQ enables reactive architectures
Queues queues queues — How RabbitMQ enables reactive architecturesQueues queues queues — How RabbitMQ enables reactive architectures
Queues queues queues — How RabbitMQ enables reactive architecturesMartin Tajur
 
Data-driven touch point marketing for customer service and increased conversions
Data-driven touch point marketing for customer service and increased conversionsData-driven touch point marketing for customer service and increased conversions
Data-driven touch point marketing for customer service and increased conversionscloud.IQ
 
Data-Driven Selling and The Value of Data In The Water Industry
Data-Driven Selling and The Value of Data In The Water IndustryData-Driven Selling and The Value of Data In The Water Industry
Data-Driven Selling and The Value of Data In The Water IndustrySunit Mohindroo
 
Data-driven Marketing and Sales 2016 Predictions - Lattice Engines
Data-driven Marketing and Sales 2016 Predictions - Lattice EnginesData-driven Marketing and Sales 2016 Predictions - Lattice Engines
Data-driven Marketing and Sales 2016 Predictions - Lattice EnginesLattice Engines
 
How application performance requirements impacted the (r)evolution of the Doc...
How application performance requirements impacted the (r)evolution of the Doc...How application performance requirements impacted the (r)evolution of the Doc...
How application performance requirements impacted the (r)evolution of the Doc...Renno Reinurm
 
Pipedrive API Integration
Pipedrive API IntegrationPipedrive API Integration
Pipedrive API IntegrationData2CRM.API
 
Social sales influencemktg_samfiorella_april2014
Social sales influencemktg_samfiorella_april2014Social sales influencemktg_samfiorella_april2014
Social sales influencemktg_samfiorella_april2014Sam Fiorella
 
Machine learning in action at Pipedrive
Machine learning in action at PipedriveMachine learning in action at Pipedrive
Machine learning in action at PipedriveAndré Karpištšenko
 
Salesforce Communities Webinar - Great Canadian Heli-Skiing
Salesforce Communities Webinar - Great Canadian Heli-SkiingSalesforce Communities Webinar - Great Canadian Heli-Skiing
Salesforce Communities Webinar - Great Canadian Heli-SkiingTraction on Demand
 
Vibe for PipeDrive
Vibe for PipeDriveVibe for PipeDrive
Vibe for PipeDriveJofin Joseph
 
Pipedrive - NOAH15 London
Pipedrive - NOAH15 LondonPipedrive - NOAH15 London
Pipedrive - NOAH15 LondonNOAH Advisors
 
Pipedrive - NOAH16 Berlin
Pipedrive - NOAH16 BerlinPipedrive - NOAH16 Berlin
Pipedrive - NOAH16 BerlinNOAH Advisors
 
CRM Support Desk Presentation
CRM Support Desk Presentation			CRM Support Desk Presentation
CRM Support Desk Presentation Prasanna Yogesh
 
How Pipedrive helped capytech
How Pipedrive helped capytechHow Pipedrive helped capytech
How Pipedrive helped capytechGetApp
 
10 Reasons to Use Zoho
10 Reasons to Use Zoho10 Reasons to Use Zoho
10 Reasons to Use Zohoguest5e332b
 
Fundraising Workshop
Fundraising WorkshopFundraising Workshop
Fundraising WorkshopShai Goldman
 

Andere mochten auch (19)

The Good, Bad and Ugly of Serverless
The Good, Bad and Ugly of ServerlessThe Good, Bad and Ugly of Serverless
The Good, Bad and Ugly of Serverless
 
Queues queues queues — How RabbitMQ enables reactive architectures
Queues queues queues — How RabbitMQ enables reactive architecturesQueues queues queues — How RabbitMQ enables reactive architectures
Queues queues queues — How RabbitMQ enables reactive architectures
 
Data-driven touch point marketing for customer service and increased conversions
Data-driven touch point marketing for customer service and increased conversionsData-driven touch point marketing for customer service and increased conversions
Data-driven touch point marketing for customer service and increased conversions
 
Data-Driven Selling and The Value of Data In The Water Industry
Data-Driven Selling and The Value of Data In The Water IndustryData-Driven Selling and The Value of Data In The Water Industry
Data-Driven Selling and The Value of Data In The Water Industry
 
Data-driven Marketing and Sales 2016 Predictions - Lattice Engines
Data-driven Marketing and Sales 2016 Predictions - Lattice EnginesData-driven Marketing and Sales 2016 Predictions - Lattice Engines
Data-driven Marketing and Sales 2016 Predictions - Lattice Engines
 
How application performance requirements impacted the (r)evolution of the Doc...
How application performance requirements impacted the (r)evolution of the Doc...How application performance requirements impacted the (r)evolution of the Doc...
How application performance requirements impacted the (r)evolution of the Doc...
 
Pipedrive API Integration
Pipedrive API IntegrationPipedrive API Integration
Pipedrive API Integration
 
pipedrivepresentation
pipedrivepresentationpipedrivepresentation
pipedrivepresentation
 
Data science for everyone
Data science for everyoneData science for everyone
Data science for everyone
 
Social sales influencemktg_samfiorella_april2014
Social sales influencemktg_samfiorella_april2014Social sales influencemktg_samfiorella_april2014
Social sales influencemktg_samfiorella_april2014
 
Machine learning in action at Pipedrive
Machine learning in action at PipedriveMachine learning in action at Pipedrive
Machine learning in action at Pipedrive
 
Salesforce Communities Webinar - Great Canadian Heli-Skiing
Salesforce Communities Webinar - Great Canadian Heli-SkiingSalesforce Communities Webinar - Great Canadian Heli-Skiing
Salesforce Communities Webinar - Great Canadian Heli-Skiing
 
Vibe for PipeDrive
Vibe for PipeDriveVibe for PipeDrive
Vibe for PipeDrive
 
Pipedrive - NOAH15 London
Pipedrive - NOAH15 LondonPipedrive - NOAH15 London
Pipedrive - NOAH15 London
 
Pipedrive - NOAH16 Berlin
Pipedrive - NOAH16 BerlinPipedrive - NOAH16 Berlin
Pipedrive - NOAH16 Berlin
 
CRM Support Desk Presentation
CRM Support Desk Presentation			CRM Support Desk Presentation
CRM Support Desk Presentation
 
How Pipedrive helped capytech
How Pipedrive helped capytechHow Pipedrive helped capytech
How Pipedrive helped capytech
 
10 Reasons to Use Zoho
10 Reasons to Use Zoho10 Reasons to Use Zoho
10 Reasons to Use Zoho
 
Fundraising Workshop
Fundraising WorkshopFundraising Workshop
Fundraising Workshop
 

Ähnlich wie Pipedrive DW on AWS: Redshift, S3, Spark and Python ETL

Best Practices for Protecting Cloud Workloads - November 2016 Webinar Series
Best Practices for Protecting Cloud Workloads - November 2016 Webinar SeriesBest Practices for Protecting Cloud Workloads - November 2016 Webinar Series
Best Practices for Protecting Cloud Workloads - November 2016 Webinar SeriesAmazon Web Services
 
AWS Innovate: Build a Data Lake on AWS- Johnathon Meichtry
AWS Innovate: Build a Data Lake on AWS- Johnathon MeichtryAWS Innovate: Build a Data Lake on AWS- Johnathon Meichtry
AWS Innovate: Build a Data Lake on AWS- Johnathon MeichtryAmazon Web Services Korea
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon RedshiftAmazon Web Services
 
(BDT308) Using Amazon Elastic MapReduce as Your Scalable Data Warehouse | AWS...
(BDT308) Using Amazon Elastic MapReduce as Your Scalable Data Warehouse | AWS...(BDT308) Using Amazon Elastic MapReduce as Your Scalable Data Warehouse | AWS...
(BDT308) Using Amazon Elastic MapReduce as Your Scalable Data Warehouse | AWS...Amazon Web Services
 
AWS 101, London - September 2014
AWS 101, London - September 2014AWS 101, London - September 2014
AWS 101, London - September 2014Ian Massingham
 
AWS Cloud Kata 2014 | Jakarta - 2-1 AWS Intro and Scale 2014
AWS Cloud Kata 2014 | Jakarta - 2-1 AWS Intro and Scale 2014AWS Cloud Kata 2014 | Jakarta - 2-1 AWS Intro and Scale 2014
AWS Cloud Kata 2014 | Jakarta - 2-1 AWS Intro and Scale 2014Amazon Web Services
 
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...Understanding AWS Managed Database and Analytics Services | AWS Public Sector...
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...Amazon Web Services
 
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...Understanding AWS Managed Database and Analytics Services | AWS Public Sector...
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...Amazon Web Services
 
[よくわかるAmazon Redshift in 大阪]Amazon Redshift最新情報と導入事例のご紹介
[よくわかるAmazon Redshift in 大阪]Amazon Redshift最新情報と導入事例のご紹介[よくわかるAmazon Redshift in 大阪]Amazon Redshift最新情報と導入事例のご紹介
[よくわかるAmazon Redshift in 大阪]Amazon Redshift最新情報と導入事例のご紹介Amazon Web Services Japan
 
Astroinformatics 2014: Scientific Computing on the Cloud with Amazon Web Serv...
Astroinformatics 2014: Scientific Computing on the Cloud with Amazon Web Serv...Astroinformatics 2014: Scientific Computing on the Cloud with Amazon Web Serv...
Astroinformatics 2014: Scientific Computing on the Cloud with Amazon Web Serv...Jamie Kinney
 
AWS Summit 2018 Summary
AWS Summit 2018 SummaryAWS Summit 2018 Summary
AWS Summit 2018 SummaryAshish Mrig
 
Architecting Cloud Apps
Architecting Cloud AppsArchitecting Cloud Apps
Architecting Cloud Appsjineshvaria
 
RDS for Oracle and SQL Server - November 2016 Webinar Series
RDS for Oracle and SQL Server - November 2016 Webinar SeriesRDS for Oracle and SQL Server - November 2016 Webinar Series
RDS for Oracle and SQL Server - November 2016 Webinar SeriesAmazon Web Services
 
AWS re:Invent 2016: Deep Dive on Amazon Relational Database Service (DAT305)
AWS re:Invent 2016: Deep Dive on Amazon Relational Database Service (DAT305)AWS re:Invent 2016: Deep Dive on Amazon Relational Database Service (DAT305)
AWS re:Invent 2016: Deep Dive on Amazon Relational Database Service (DAT305)Amazon Web Services
 
Getting Started with Managed Database Services on AWS
Getting Started with Managed Database Services on AWSGetting Started with Managed Database Services on AWS
Getting Started with Managed Database Services on AWSAmazon Web Services
 

Ähnlich wie Pipedrive DW on AWS: Redshift, S3, Spark and Python ETL (20)

Deep Dive: Amazon RDS
Deep Dive: Amazon RDSDeep Dive: Amazon RDS
Deep Dive: Amazon RDS
 
Best Practices for Protecting Cloud Workloads - November 2016 Webinar Series
Best Practices for Protecting Cloud Workloads - November 2016 Webinar SeriesBest Practices for Protecting Cloud Workloads - November 2016 Webinar Series
Best Practices for Protecting Cloud Workloads - November 2016 Webinar Series
 
AWS Innovate: Build a Data Lake on AWS- Johnathon Meichtry
AWS Innovate: Build a Data Lake on AWS- Johnathon MeichtryAWS Innovate: Build a Data Lake on AWS- Johnathon Meichtry
AWS Innovate: Build a Data Lake on AWS- Johnathon Meichtry
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
(BDT308) Using Amazon Elastic MapReduce as Your Scalable Data Warehouse | AWS...
(BDT308) Using Amazon Elastic MapReduce as Your Scalable Data Warehouse | AWS...(BDT308) Using Amazon Elastic MapReduce as Your Scalable Data Warehouse | AWS...
(BDT308) Using Amazon Elastic MapReduce as Your Scalable Data Warehouse | AWS...
 
Introduction to AWS tools
Introduction to AWS toolsIntroduction to AWS tools
Introduction to AWS tools
 
AWS 101 December 2014
AWS 101 December 2014AWS 101 December 2014
AWS 101 December 2014
 
AWS 101, London - September 2014
AWS 101, London - September 2014AWS 101, London - September 2014
AWS 101, London - September 2014
 
AWS Cloud Kata 2014 | Jakarta - 2-1 AWS Intro and Scale 2014
AWS Cloud Kata 2014 | Jakarta - 2-1 AWS Intro and Scale 2014AWS Cloud Kata 2014 | Jakarta - 2-1 AWS Intro and Scale 2014
AWS Cloud Kata 2014 | Jakarta - 2-1 AWS Intro and Scale 2014
 
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...Understanding AWS Managed Database and Analytics Services | AWS Public Sector...
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...
 
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...Understanding AWS Managed Database and Analytics Services | AWS Public Sector...
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...
 
[よくわかるAmazon Redshift in 大阪]Amazon Redshift最新情報と導入事例のご紹介
[よくわかるAmazon Redshift in 大阪]Amazon Redshift最新情報と導入事例のご紹介[よくわかるAmazon Redshift in 大阪]Amazon Redshift最新情報と導入事例のご紹介
[よくわかるAmazon Redshift in 大阪]Amazon Redshift最新情報と導入事例のご紹介
 
Astroinformatics 2014: Scientific Computing on the Cloud with Amazon Web Serv...
Astroinformatics 2014: Scientific Computing on the Cloud with Amazon Web Serv...Astroinformatics 2014: Scientific Computing on the Cloud with Amazon Web Serv...
Astroinformatics 2014: Scientific Computing on the Cloud with Amazon Web Serv...
 
AWS Summit 2018 Summary
AWS Summit 2018 SummaryAWS Summit 2018 Summary
AWS Summit 2018 Summary
 
Architecting Cloud Apps
Architecting Cloud AppsArchitecting Cloud Apps
Architecting Cloud Apps
 
RDS for Oracle and SQL Server - November 2016 Webinar Series
RDS for Oracle and SQL Server - November 2016 Webinar SeriesRDS for Oracle and SQL Server - November 2016 Webinar Series
RDS for Oracle and SQL Server - November 2016 Webinar Series
 
AWS re:Invent 2016: Deep Dive on Amazon Relational Database Service (DAT305)
AWS re:Invent 2016: Deep Dive on Amazon Relational Database Service (DAT305)AWS re:Invent 2016: Deep Dive on Amazon Relational Database Service (DAT305)
AWS re:Invent 2016: Deep Dive on Amazon Relational Database Service (DAT305)
 
India Webinar
India WebinarIndia Webinar
India Webinar
 
Getting Started with Managed Database Services on AWS
Getting Started with Managed Database Services on AWSGetting Started with Managed Database Services on AWS
Getting Started with Managed Database Services on AWS
 
AWS 101 Event - 16 July 2013
AWS 101 Event - 16 July 2013AWS 101 Event - 16 July 2013
AWS 101 Event - 16 July 2013
 

Kürzlich hochgeladen

Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 

Kürzlich hochgeladen (20)

Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 

Pipedrive DW on AWS: Redshift, S3, Spark and Python ETL

  • 1. Pipedrive DW on AWS November 3, 2016 Erkki Suurna
  • 2. Talking points DW role in Pipedrive AWS services PDW use DW infrastructure VPC, S3, Redshift, RDS, EC2, Kinesis, ELB Security - KMS, IAM, S3 encryption Hadoop and Spark stack Self developed ETL in Python (300+ tasks running daily)
  • 3. PDW main goal Support data driven organisation SaaS KPI metrics reporting to executive level Product instrumentation and analysis Feedback to end user Business intelligence - data acquisition and transformation into meaningful information for business analysis
  • 4. PDW data platform Disparate data sources are processed into one unified trusted data level (3TB) 25+ models and aggregates 100+ tables Sensitive data is obfuscated Encourage to use aggregated models instead of many canned reports Process daily 100+k backups 330GB compressed => 3-4 TB data
  • 5. AWS services PDW use - VPC VPC Logically isolated section, where you can define your virtual network Every PDW service is in separate subnet and has separate security groups US east region, 5 availability zones Everything is closed by default Infra is not permanent, script everything, because you will recreate it one day
  • 6. AWS services PDW use - S3 Simple Secure Storage Central point to all services Not file system, more like endless storage Encryption - AES-256 Lifecycle policy and storage classes Distribute filenames in bucket by filename prefix for better performance (100TPS) Multipart upload.
  • 7. AWS services PDW use - Redshift and RDS Redshift 3 node 6TB cluster Based on Postgres MPP - great product for analytical queries Automatic backup Scale in compute dense mode up to 300 TB, in storage dense mode up to 2 PB No stored procedures nor functions Load data from S3 Relational Database Service Multi AZ, automatic backup and failover ETL and Spark backend use Postgres as metastore Snapshots of Pipedrive app backend Mysql-s
  • 8. AWS services PDW use - EC2 Elastic Compute Cloud Script all instances Pre-built images (we use mainly Amazon Linux because it is managed by Amazon) Spot instances (instance vs vCPU) Auto Scaling Group Access Control KMS to encrypt and decrypt secrets Identity and Access Management (IAM) Security group as virtual firewall on instance level
  • 9. AWS Kinesis Scalable buffer A shard is base throughput unit 1MB/sec data input 2MB/sec data output up to 1000 PUT records/sec 24h retention
  • 10. Hadoop and Spark stack HDFS - new DW concept Namenode is micro instance 2x datanode provide 10+TB storage Spark - lightning fast processing cluster ETL cluster in Yarn mode Ad-hoc cluster in standalone mode APIs Python, Scala, Java, SQL, GraphX, Streaming, MLib Formats Source data in JSON Destination data in parquet Change deltas in CSV for Redshift import Current setup r3.8xlarge ( 32Core 244GB memory 2X320 GB ssd, $2.66, spot saves 80-90%)
  • 11. Data visualisation tools Tableau Desktop licence for development Tableau Online serve interactive reports to end users Re:dash Fast and simple data visualisation Grafana Technical dash for infra monitoring Zeppelin Part or Spark stack 25+ interpreters available, currently we use Markdown, Shell, Scala, Python, SQL
  • 13. Next steps Spark Spark Streaming POC Kafka POC Enable real-time dashboard on top of ElasticSearch + Kibana POC based on graphic intensive machines (utilize GPU in Spark) Alluxio - data storage in memory
  • 14. Q & A