SlideShare ist ein Scribd-Unternehmen logo
1 von 19
POSTGRESQL
MONITORING IN
ZALANDO
Helsinki
PostgreSQL
Meetup
October 2019
2
POSTGRESQL IN
ZALANDO
3
POSTGRESQL IN ZALANDO
Zalando works with PostgreSQL since approx. 2010 (running in DC)
2015 - migration to AWS (using RDS);
Patroni and Spilo have been started.
2016-2017 - using provided PostgreSQL clusters
(based on Spilo/STUPS);
2018-... - using PostgreSQL operator.
4
When a new postgresql custom resource appears,
the operator creates:
1. StatefulSet for PostgreSQL/Patroni cluster
2. Service for master node (ClusterIP or LB)
3. Service for replica nodes (ClusterIP or LB)
4. Extra DNS names for the services if needed
POSTGRESQL OPERATOR
5
ZMON - ZALANDO
MONITORING SYSTEM
6
Few facts about ZMON
● Created in 2013 during a HackWeek, in production since 2014;
● Optimized for our use-case: autonomous teams, multiple
Kubernetes clusters, common central storage;
● Open-sourced (APL2.0) and available from GitHub;
● Stores time-series data in KairosDB (based on top of Cassandra);
● Stores infrastructure data in PostgreSQL.
7
What’s important for us?
● ZMON uses the “pull” model, its workers fetch the metrics
● ZMON is a distributed monitoring system, the workers are running
in every K8s cluster;
● Autodiscover results in the most up-to-date view of our apps;
● Checks are centralized and may be shared between the teams;
● PostgreSQL Patroni/Spilo clusters are fully supported.
8
INFRASTRUCTURE
MONITORING
9
What are the metrics we need?
● Kubernetes
○ Pods CPU
○ Network I/O
○ Free space in Persistent Volumes
○ Open TCP connections for PostgreSQL processes
● AWS
○ EBS I/O
○ ELB throughput (rarely)
○ Backup S3 bucket
10
Prometheus
11
POSTGRESQL INTERNAL
METRICS
12
How to collect internal metrics
● ZMON worker is running in the same K8s cluster as the
PostgreSQL nodes;
● It supports running SQL queries natively;
● We need only to figure out the proper tables/views to select from.
13
What are the metrics we need?
● Server
○ idle transactions
○ failed login attempts
● Tables
○ size / seq_scans / inserts / updates / deletes
● Indexes
○ size / scans
● Backups
○ WAL archiver status
○ age of last backup
pg_stat_activity
pg_stat_archiver
pg_stat_user_tables
pg_stat_user_indexes
S3 bucket check
14
Idle transactions
15
Index scans
It looks like a problem!
16
But what’s about authentication?
● ZMON worker uses its own credentials to connect to all the databases in the
cluster. The credentials are separate from the application credentials;
● ZMON worker user is unique in every K8s cluster: robot_cluster_name;
● PostgreSQL role robot_zmon is created during the database setup to authorize
the ZMON user to access the database objects;
● ZMON may also be authorized to run queries against the application tables or
views, but the permissions for the ZMON role should be granted explicitly in DB:
GRANT SELECT ON ALL TABLES
IN SCHEMA public TO robot_zmon;
17
Application metrics
18
Key Takeaways
● Get useful metrics from your infrastructure (K8s, AWS, …)
● Collect PostgreSQL metrics from the pg_stat_… views
● Get your application metrics by querying your tables or views directly
(but beware of the performance impact)
● Treat your monitoring tool as another application, not as the superuser
Find out more about our
Culture, People & Jobs:
● ON SOCIAL MEDIA:
Linkedin @Zalando SE
Facebook @ Inside Zalando
Instagram @insidezalando
Twitter @ZalandoTech
● CAREER WEBSITE: jobs.zalando.com
● CORPORATE WEBSITE:
corporate.zalando.com
Feel free to reach me:
Uri Savelchev
<uri.savelchev@zalando.fi>

Weitere ähnliche Inhalte

Was ist angesagt?

Apache Spark Based Reliable Data Ingestion in Datalake with Gagan Agrawal
Apache Spark Based Reliable Data Ingestion in Datalake with Gagan AgrawalApache Spark Based Reliable Data Ingestion in Datalake with Gagan Agrawal
Apache Spark Based Reliable Data Ingestion in Datalake with Gagan Agrawal
Databricks
 
Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425
Wee Hyong Tok
 

Was ist angesagt? (20)

Writing Yarn Applications Hadoop Summit 2012
Writing Yarn Applications Hadoop Summit 2012Writing Yarn Applications Hadoop Summit 2012
Writing Yarn Applications Hadoop Summit 2012
 
GCP Meetup #3 - Approaches to Cloud Native Architectures
GCP Meetup #3 - Approaches to Cloud Native ArchitecturesGCP Meetup #3 - Approaches to Cloud Native Architectures
GCP Meetup #3 - Approaches to Cloud Native Architectures
 
Apache Spark Based Reliable Data Ingestion in Datalake with Gagan Agrawal
Apache Spark Based Reliable Data Ingestion in Datalake with Gagan AgrawalApache Spark Based Reliable Data Ingestion in Datalake with Gagan Agrawal
Apache Spark Based Reliable Data Ingestion in Datalake with Gagan Agrawal
 
Deeplearning
Deeplearning Deeplearning
Deeplearning
 
IEEE International Conference on Data Engineering 2015
IEEE International Conference on Data Engineering 2015IEEE International Conference on Data Engineering 2015
IEEE International Conference on Data Engineering 2015
 
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceZeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
 
DIscover Spark and Spark streaming
DIscover Spark and Spark streamingDIscover Spark and Spark streaming
DIscover Spark and Spark streaming
 
Quark Virtualization Engine for Analytics
Quark Virtualization Engine for Analytics Quark Virtualization Engine for Analytics
Quark Virtualization Engine for Analytics
 
Tachyon and Apache Spark
Tachyon and Apache SparkTachyon and Apache Spark
Tachyon and Apache Spark
 
Performance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsPerformance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark Metrics
 
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured Streaming
 
Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425
 
Lessons Learned from Modernizing USCIS Data Analytics Platform
Lessons Learned from Modernizing USCIS Data Analytics PlatformLessons Learned from Modernizing USCIS Data Analytics Platform
Lessons Learned from Modernizing USCIS Data Analytics Platform
 
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey KharlamovRUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
 
Big Telco - Yousun Jeong
Big Telco - Yousun JeongBig Telco - Yousun Jeong
Big Telco - Yousun Jeong
 
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
 
Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick view
 
Kylin and Druid Presentation
Kylin and Druid PresentationKylin and Druid Presentation
Kylin and Druid Presentation
 
Frequently Bought Together Recommendations Based on Embeddings
Frequently Bought Together Recommendations Based on EmbeddingsFrequently Bought Together Recommendations Based on Embeddings
Frequently Bought Together Recommendations Based on Embeddings
 
Operationalizing Big Data Pipelines At Scale
Operationalizing Big Data Pipelines At ScaleOperationalizing Big Data Pipelines At Scale
Operationalizing Big Data Pipelines At Scale
 

Ähnlich wie PostgreSQL Finland October meetup - PostgreSQL monitoring in Zalando

Patroni: Kubernetes-native PostgreSQL companion
Patroni: Kubernetes-native PostgreSQL companionPatroni: Kubernetes-native PostgreSQL companion
Patroni: Kubernetes-native PostgreSQL companion
Alexander Kukushkin
 

Ähnlich wie PostgreSQL Finland October meetup - PostgreSQL monitoring in Zalando (20)

What's New in PostgreSQL 9.3
What's New in PostgreSQL 9.3What's New in PostgreSQL 9.3
What's New in PostgreSQL 9.3
 
Postgre sql best_practices
Postgre sql best_practicesPostgre sql best_practices
Postgre sql best_practices
 
Gobblin @ NerdWallet (Nov 2015)
Gobblin @ NerdWallet (Nov 2015)Gobblin @ NerdWallet (Nov 2015)
Gobblin @ NerdWallet (Nov 2015)
 
Postgre sql best_practices
Postgre sql best_practicesPostgre sql best_practices
Postgre sql best_practices
 
An Introduction to Postgresql
An Introduction to PostgresqlAn Introduction to Postgresql
An Introduction to Postgresql
 
Deploying PostgreSQL on Kubernetes
Deploying PostgreSQL on KubernetesDeploying PostgreSQL on Kubernetes
Deploying PostgreSQL on Kubernetes
 
PASS Spanish Recomendaciones para entornos de SQL Server productivos
PASS Spanish   Recomendaciones para entornos de SQL Server productivosPASS Spanish   Recomendaciones para entornos de SQL Server productivos
PASS Spanish Recomendaciones para entornos de SQL Server productivos
 
PostgreSQL Sharding and HA: Theory and Practice (PGConf.ASIA 2017)
PostgreSQL Sharding and HA: Theory and Practice (PGConf.ASIA 2017)PostgreSQL Sharding and HA: Theory and Practice (PGConf.ASIA 2017)
PostgreSQL Sharding and HA: Theory and Practice (PGConf.ASIA 2017)
 
What's New in Postgres 9.4
What's New in Postgres 9.4What's New in Postgres 9.4
What's New in Postgres 9.4
 
PostgreSQL-as-a-Service with Crunchy PostgreSQL for PKS
PostgreSQL-as-a-Service with Crunchy PostgreSQL for PKSPostgreSQL-as-a-Service with Crunchy PostgreSQL for PKS
PostgreSQL-as-a-Service with Crunchy PostgreSQL for PKS
 
PostgreSQL-as-a-Service with Crunchy PostgreSQL for PKS
PostgreSQL-as-a-Service with Crunchy PostgreSQL for PKSPostgreSQL-as-a-Service with Crunchy PostgreSQL for PKS
PostgreSQL-as-a-Service with Crunchy PostgreSQL for PKS
 
TechEvent PostgreSQL Best Practices
TechEvent PostgreSQL Best PracticesTechEvent PostgreSQL Best Practices
TechEvent PostgreSQL Best Practices
 
Prashant_Agrawal_CV
Prashant_Agrawal_CVPrashant_Agrawal_CV
Prashant_Agrawal_CV
 
Introduction to Apache Airflow
Introduction to Apache AirflowIntroduction to Apache Airflow
Introduction to Apache Airflow
 
Real World Storage in Treasure Data
Real World Storage in Treasure DataReal World Storage in Treasure Data
Real World Storage in Treasure Data
 
Patroni: Kubernetes-native PostgreSQL companion
Patroni: Kubernetes-native PostgreSQL companionPatroni: Kubernetes-native PostgreSQL companion
Patroni: Kubernetes-native PostgreSQL companion
 
PGConf.ASIA 2019 Bali - PostgreSQL on K8S at Zalando - Alexander Kukushkin
PGConf.ASIA 2019 Bali - PostgreSQL on K8S at Zalando - Alexander KukushkinPGConf.ASIA 2019 Bali - PostgreSQL on K8S at Zalando - Alexander Kukushkin
PGConf.ASIA 2019 Bali - PostgreSQL on K8S at Zalando - Alexander Kukushkin
 
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
 
Elasticsearch on Kubernetes
Elasticsearch on KubernetesElasticsearch on Kubernetes
Elasticsearch on Kubernetes
 
Presto
PrestoPresto
Presto
 

Kürzlich hochgeladen

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Kürzlich hochgeladen (20)

Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 

PostgreSQL Finland October meetup - PostgreSQL monitoring in Zalando

  • 3. 3 POSTGRESQL IN ZALANDO Zalando works with PostgreSQL since approx. 2010 (running in DC) 2015 - migration to AWS (using RDS); Patroni and Spilo have been started. 2016-2017 - using provided PostgreSQL clusters (based on Spilo/STUPS); 2018-... - using PostgreSQL operator.
  • 4. 4 When a new postgresql custom resource appears, the operator creates: 1. StatefulSet for PostgreSQL/Patroni cluster 2. Service for master node (ClusterIP or LB) 3. Service for replica nodes (ClusterIP or LB) 4. Extra DNS names for the services if needed POSTGRESQL OPERATOR
  • 6. 6 Few facts about ZMON ● Created in 2013 during a HackWeek, in production since 2014; ● Optimized for our use-case: autonomous teams, multiple Kubernetes clusters, common central storage; ● Open-sourced (APL2.0) and available from GitHub; ● Stores time-series data in KairosDB (based on top of Cassandra); ● Stores infrastructure data in PostgreSQL.
  • 7. 7 What’s important for us? ● ZMON uses the “pull” model, its workers fetch the metrics ● ZMON is a distributed monitoring system, the workers are running in every K8s cluster; ● Autodiscover results in the most up-to-date view of our apps; ● Checks are centralized and may be shared between the teams; ● PostgreSQL Patroni/Spilo clusters are fully supported.
  • 9. 9 What are the metrics we need? ● Kubernetes ○ Pods CPU ○ Network I/O ○ Free space in Persistent Volumes ○ Open TCP connections for PostgreSQL processes ● AWS ○ EBS I/O ○ ELB throughput (rarely) ○ Backup S3 bucket
  • 12. 12 How to collect internal metrics ● ZMON worker is running in the same K8s cluster as the PostgreSQL nodes; ● It supports running SQL queries natively; ● We need only to figure out the proper tables/views to select from.
  • 13. 13 What are the metrics we need? ● Server ○ idle transactions ○ failed login attempts ● Tables ○ size / seq_scans / inserts / updates / deletes ● Indexes ○ size / scans ● Backups ○ WAL archiver status ○ age of last backup pg_stat_activity pg_stat_archiver pg_stat_user_tables pg_stat_user_indexes S3 bucket check
  • 15. 15 Index scans It looks like a problem!
  • 16. 16 But what’s about authentication? ● ZMON worker uses its own credentials to connect to all the databases in the cluster. The credentials are separate from the application credentials; ● ZMON worker user is unique in every K8s cluster: robot_cluster_name; ● PostgreSQL role robot_zmon is created during the database setup to authorize the ZMON user to access the database objects; ● ZMON may also be authorized to run queries against the application tables or views, but the permissions for the ZMON role should be granted explicitly in DB: GRANT SELECT ON ALL TABLES IN SCHEMA public TO robot_zmon;
  • 18. 18 Key Takeaways ● Get useful metrics from your infrastructure (K8s, AWS, …) ● Collect PostgreSQL metrics from the pg_stat_… views ● Get your application metrics by querying your tables or views directly (but beware of the performance impact) ● Treat your monitoring tool as another application, not as the superuser
  • 19. Find out more about our Culture, People & Jobs: ● ON SOCIAL MEDIA: Linkedin @Zalando SE Facebook @ Inside Zalando Instagram @insidezalando Twitter @ZalandoTech ● CAREER WEBSITE: jobs.zalando.com ● CORPORATE WEBSITE: corporate.zalando.com Feel free to reach me: Uri Savelchev <uri.savelchev@zalando.fi>

Hinweis der Redaktion

  1. AWS S3 bucket is used for WAL backups managed by Patroni
  2. - Previously use CloudWatch, but there are problems: inavailability, cost. Still in-use for LBs; - Now relies upon Prometheus - both for metrics collection and temporary in-cluster storage; - Metrics are coming from Linux kernel and Kubernetes (pods resource requests/limits); - Finally Prometheus is queried by ZMON agent; - Get CPU, memory in use, Volume free space, IOPS, Network OPS, etc.