Simple practices in performance monitoring and evaluation

•

0 gefällt mir•330 views

Schubert Zhang

Technologie

SLA
Service Level Agreements
https://en.wikipedia.org/wiki/Service-level_agreement
SLAs commonly include segments to address:

a deﬁnition of services, performance measurement, problem management, customer duties,
warranties, disaster recovery, termination of agreement.

•
•
• API
IM SLA
•
• Performance
• Performance
performance oriented SLA

Metrics
SLA Performance SLA
Performance Metrics
e.g.1: API
•
• (99%)
•
e.g.2: Call Center
• Abandonment Rate: Percentage of calls abandoned while waiting to be answered.
• ASA (Average Speed to Answer): Average time it takes for a call to be answered
by the service desk.
• TSF (Time Service Factor): Percentage of calls answered within a deﬁnite
timeframe, e.g., 80% in 20 seconds.
• FCR (First-Call Resolution): Percentage of incoming calls that can be resolved
without the use of a callback or without having the caller call back the helpdesk to
ﬁnish resolving the case.
• TAT (Turn-Around Time): Time taken to complete a certain task.
Metrics
Performance Metrics

Benchmarking
the quality of a service must be measured, evaluated,
… benchmarked.
and we must have a set of approaches for benchmarking.

Throughput
QPS TPS CPS
in seconds, in minutes, in hours …

Latency
Response Time Round-Trip Time(RTT) …
Average Median Min. Max. Percentile …

Quantile / Percentile
refers to Google Sawzall Paper

A Summary of these Concepts
Client-1
Client-2
Client-3
Client-N
Work Thread
Work Thread
Work Thread
Work Thread
Work Thread
ThroughputLatency Concurrency
Clients Server

Example-2
Evaluation Report to a NoSQL DB
Cassandra

Benchmark for Write API
Benchmark for Writes Cluster overview
Throughput
Latency
• Each node runs 6 clients (threads), totally 54 clients.
• Each client generates random CDRs for 50 million users/phone-numbers,
and puts them into DaStor one by one.
– Key Space: 50 million
– Size of a CDR: Thrift-compacted encoding, ~200 bytes
ü Throughput: average ~80K ops/s; per-node: average ~9K ops/s
ü Latency: average ~0.5ms
p Bottleneck: network (and memory)

Benchmark for Read API
• Each node runs 8 clients (threads) , totally 72 clients.
• Each client randomly uses a user-id/phone-number out of the 50-million
space, to get it’s recent 20 CDRs (one page) from DaStor.
• All clients read CDRs of a same day/bucket.
0.00%
5.00%
10.00%
15.00%
20.00%
25.00%
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61
100ms
percentage of read ops
ü Throughput: average ~140 ops/s; per-node: average ~16 ops/s
ü Latency: average ~500ms, 97% < 2s (SLA)
p Bottleneck: disk IO (random seek) (CPU load is very low)
average
97%
quantile

• In server side
• Add a operation-count and the time-
cost for every client call
• For every monitor interval, pull and
push the current Throughput and
Latency the monitor-tool(ganglia/
zabbix) or console.
• Throughput = sum of count / time interval
• Latency = average(sum of latency / sum of count),
max, min, quantile …
Code in Gitlab and Gerrit

• Java
• JMX (Java Management Extensions, a simple example at https://github.com/schubertzhang/jsketch)
• javaagent (java -javaagent:jar path [= premain ] )
• jmxetric (use JMX and javaagent to display metrics to Ganglia, https://github.com/schubertzhang/jmxetric)
•
• Ganglia
• Zabbix
• …

Performance Benchmark
Programing
Demo
Test and Evaluation the Throughput and Latency of http://www.fangdd.com

demo screenshots
Average 95%
The long tail …

Statistical Monitoring for Outlier
usually for trouble-shooting

Captured from UTStarcom mSwitch R5 system, Guangxi Site, 2004.
The magic matrix:

•
• Redis Memcache
• Just add at a point, very low-cost
•
• Very
• Logs ELK

Empfohlen

SEPM Outsourcingasherad

Everything You Need to Know About ShardingMongoDB

Using redmine as a sla ticketing system, helpdesk or service desk softwareAleksandar Pavic

Unit-I_ES.pdfBogiri Nagaraju

Rate limits and all aboutAlexander Tokarev

Asynchronous programming using CompletableFutures in JavaOresztész Margaritisz

Metrics driven development with dedicated Observability TeamLINE Corporation

Performance Oriented DesignRodrigo Campos

Empfohlen

SEPM Outsourcingasherad

Everything You Need to Know About ShardingMongoDB

Using redmine as a sla ticketing system, helpdesk or service desk softwareAleksandar Pavic

Unit-I_ES.pdfBogiri Nagaraju

Rate limits and all aboutAlexander Tokarev

Asynchronous programming using CompletableFutures in JavaOresztész Margaritisz

Metrics driven development with dedicated Observability TeamLINE Corporation

Performance Oriented DesignRodrigo Campos

IBM Impact 2014 AMC-1877: IBM WebSphere MQ for z/OS: Performance & AccountingPaul Dennis

X-Ray distributed tracing proof-of-conceptAram Alipoor

Latency SLOs Done RightFred Moyer

An adaptive and eventually self healing framework for geo-distributed real-ti...Angad Singh

Robotics technical Presentationklepsydratechnologie

High throughput data streaming in AzureAlexander Laysha

The Case for a Signal Oriented Data Stream Management SystemReza Rahimi

39245203 intro-es-ivEmbeddedbvp

Scaling habits of ASP.NETDavid Giard

LeanXcale Presentation - Waterloo UniversityRicardo Jimenez-Peris

Sharing is Caring: Toward Creating Self-tuning Multi-tenant Kafka (Anna Povzn...HostedbyConfluent

Transactional Streaming: If you can compute it, you can probably stream it.jhugg

A Transcat.com Webinar Presented by Aglient Technolgoes: Scope Technology Imp...Transcat

High Performance Erlang - Pitfalls and SolutionsYinghai Lu

Reduce SRE Stress: Minimizing Service Downtime with Grafana, InfluxDB and Tel...InfluxData

Energy efficient AI workload partitioning on multi-core systemsDeepak Shankar

How to scale recommendation system with HBaseRafael Arana

Protecting Your API with Redis by Jane Paek - Redis Day Seattle 2020Redis Labs

Business in a Flash: How to increase performance and lower costs in the data...Violin Memory

Apache Big Data 2016: Next Gen Big Data Analytics with Apache ApexApache Apex

Blockchain in ActionSchubert Zhang

科普区块链Schubert Zhang

Weitere ähnliche Inhalte

Ähnlich wie Simple practices in performance monitoring and evaluation

IBM Impact 2014 AMC-1877: IBM WebSphere MQ for z/OS: Performance & AccountingPaul Dennis

X-Ray distributed tracing proof-of-conceptAram Alipoor

Latency SLOs Done RightFred Moyer

An adaptive and eventually self healing framework for geo-distributed real-ti...Angad Singh

Robotics technical Presentationklepsydratechnologie

High throughput data streaming in AzureAlexander Laysha

The Case for a Signal Oriented Data Stream Management SystemReza Rahimi

39245203 intro-es-ivEmbeddedbvp

Scaling habits of ASP.NETDavid Giard

LeanXcale Presentation - Waterloo UniversityRicardo Jimenez-Peris

Sharing is Caring: Toward Creating Self-tuning Multi-tenant Kafka (Anna Povzn...HostedbyConfluent

Transactional Streaming: If you can compute it, you can probably stream it.jhugg

A Transcat.com Webinar Presented by Aglient Technolgoes: Scope Technology Imp...Transcat

High Performance Erlang - Pitfalls and SolutionsYinghai Lu

Reduce SRE Stress: Minimizing Service Downtime with Grafana, InfluxDB and Tel...InfluxData

Energy efficient AI workload partitioning on multi-core systemsDeepak Shankar

How to scale recommendation system with HBaseRafael Arana

Protecting Your API with Redis by Jane Paek - Redis Day Seattle 2020Redis Labs

Business in a Flash: How to increase performance and lower costs in the data...Violin Memory

Apache Big Data 2016: Next Gen Big Data Analytics with Apache ApexApache Apex

Ähnlich wie Simple practices in performance monitoring and evaluation (20)

IBM Impact 2014 AMC-1877: IBM WebSphere MQ for z/OS: Performance & Accounting

X-Ray distributed tracing proof-of-concept

Latency SLOs Done Right

An adaptive and eventually self healing framework for geo-distributed real-ti...

Robotics technical Presentation

High throughput data streaming in Azure

The Case for a Signal Oriented Data Stream Management System

39245203 intro-es-iv

Scaling habits of ASP.NET

LeanXcale Presentation - Waterloo University

Sharing is Caring: Toward Creating Self-tuning Multi-tenant Kafka (Anna Povzn...

Transactional Streaming: If you can compute it, you can probably stream it.

A Transcat.com Webinar Presented by Aglient Technolgoes: Scope Technology Imp...

High Performance Erlang - Pitfalls and Solutions

Reduce SRE Stress: Minimizing Service Downtime with Grafana, InfluxDB and Tel...

Energy efficient AI workload partitioning on multi-core systems

How to scale recommendation system with HBase

Protecting Your API with Redis by Jane Paek - Redis Day Seattle 2020

Business in a Flash: How to increase performance and lower costs in the data...

Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex

Mehr von Schubert Zhang

Blockchain in ActionSchubert Zhang

科普区块链Schubert Zhang

Engineering Culture and InfrastructureSchubert Zhang

Scrum Agile DevelopmentSchubert Zhang

Career AdviceSchubert Zhang

Engineering practices in big data storage and processingSchubert Zhang

HiveServer2Schubert Zhang

Horizon for Big DataSchubert Zhang

Bigtable数据模型解决CDR清单存储问题的资源估算Schubert Zhang

Big Data Engineering Team Meeting 20120223aSchubert Zhang

HBase Coprocessor IntroductionSchubert Zhang

Hadoop大数据实践经验Schubert Zhang

Wild Thinking of BigdataBaseSchubert Zhang

RockStor - A Cloud Object System based on HadoopSchubert Zhang

Fans of running gumpSchubert Zhang

Hadoop compress-streamSchubert Zhang

Ganglia轻度使用指南Schubert Zhang

DaStor/Cassandra report for CDR solutionSchubert Zhang

Big data and cloudSchubert Zhang

Learning from google megastore (Part-1)Schubert Zhang

Mehr von Schubert Zhang (20)

Blockchain in Action

科普区块链

Engineering Culture and Infrastructure

Scrum Agile Development

Career Advice

Engineering practices in big data storage and processing

HiveServer2

Horizon for Big Data

Bigtable数据模型解决CDR清单存储问题的资源估算

Big Data Engineering Team Meeting 20120223a

HBase Coprocessor Introduction

Hadoop大数据实践经验

Wild Thinking of BigdataBase

RockStor - A Cloud Object System based on Hadoop

Fans of running gump

Hadoop compress-stream

Ganglia轻度使用指南

DaStor/Cassandra report for CDR solution

Big data and cloud

Learning from google megastore (Part-1)

Kürzlich hochgeladen

Presentation on how to chat with PDF using ChatGPT code interpreternaman860154

Understanding the Laravel MVC ArchitecturePixlogix Infotech

Slack Application Development 101 Slidespraypatel2

Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix

Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard

A Domino Admins Adventures (Engage 2024)Gabriella Davis

The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad

Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55

Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745

AI as an Interface for Commercial BuildingsMemoori

08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls

From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software

Pigging Solutions in Pet Food ManufacturingPigging Solutions

SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited

My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar

Kürzlich hochgeladen (20)

Presentation on how to chat with PDF using ChatGPT code interpreter

Understanding the Laravel MVC Architecture

Slack Application Development 101 Slides

Swan(sea) Song – personal research during my six years at Swansea ... and bey...

Maximizing Board Effectiveness 2024 Webinar.pptx

A Domino Admins Adventures (Engage 2024)

The Codex of Business Writing Software for Real-World Solutions 2.pptx

Breaking the Kubernetes Kill Chain: Host Path Mount

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...

Azure Monitor & Application Insight to monitor Infrastructure & Application

How to Troubleshoot Apps for the Modern Connected Worker

[2024]Digital Global Overview Report 2024 Meltwater.pdf

AI as an Interface for Commercial Buildings

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men

From Event to Action: Accelerate Your Decision Making with Real-Time Automation

Pigging Solutions in Pet Food Manufacturing

SIEMENS: RAPUNZEL – A Tale About Knowledge Graph

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365

My Hashitalk Indonesia April 2024 Presentation

Simple practices in performance monitoring and evaluation

1. Simple Practices in Performance Monitoring and Evaluation Schubert Zhang 2016.3.24

2. SLA Service Level Agreements https://en.wikipedia.org/wiki/Service-level_agreement SLAs commonly include segments to address: a deﬁnition of services, performance measurement, problem management, customer duties, warranties, disaster recovery, termination of agreement.

3. • • • API IM SLA • • Performance • Performance performance oriented SLA

4. Metrics SLA Performance SLA Performance Metrics e.g.1: API • • (99%) • e.g.2: Call Center • Abandonment Rate: Percentage of calls abandoned while waiting to be answered. • ASA (Average Speed to Answer): Average time it takes for a call to be answered by the service desk. • TSF (Time Service Factor): Percentage of calls answered within a deﬁnite timeframe, e.g., 80% in 20 seconds. • FCR (First-Call Resolution): Percentage of incoming calls that can be resolved without the use of a callback or without having the caller call back the helpdesk to ﬁnish resolving the case. • TAT (Turn-Around Time): Time taken to complete a certain task. Metrics Performance Metrics

5. Benchmarking the quality of a service must be measured, evaluated, … benchmarked. and we must have a set of approaches for benchmarking.

6. Metrics to be monitored

7. Throughput QPS TPS CPS in seconds, in minutes, in hours …

8. Concurrency

9. Latency Response Time Round-Trip Time(RTT) … Average Median Min. Max. Percentile …

10. Quantile / Percentile refers to Google Sawzall Paper

11. A Summary of these Concepts Client-1 Client-2 Client-3 Client-N Work Thread Work Thread Work Thread Work Thread Work Thread ThroughputLatency Concurrency Clients Server

12. A Life-World Example

13. Example-1 Paper Amazon Dynamo

14.

15.

16. Average 99.9%, quantile

17. Example-2 Evaluation Report to a NoSQL DB Cassandra

18. Benchmark for Write API Benchmark for Writes Cluster overview Throughput Latency • Each node runs 6 clients (threads), totally 54 clients. • Each client generates random CDRs for 50 million users/phone-numbers, and puts them into DaStor one by one. – Key Space: 50 million – Size of a CDR: Thrift-compacted encoding, ~200 bytes ü Throughput: average ~80K ops/s; per-node: average ~9K ops/s ü Latency: average ~0.5ms p Bottleneck: network (and memory)

19. Benchmark for Read API • Each node runs 8 clients (threads) , totally 72 clients. • Each client randomly uses a user-id/phone-number out of the 50-million space, to get it’s recent 20 CDRs (one page) from DaStor. • All clients read CDRs of a same day/bucket. 0.00% 5.00% 10.00% 15.00% 20.00% 25.00% 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 100ms percentage of read ops ü Throughput: average ~140 ops/s; per-node: average ~16 ops/s ü Latency: average ~500ms, 97% < 2s (SLA) p Bottleneck: disk IO (random seek) (CPU load is very low) average 97% quantile

20. Total & Delta Total: Delta:

21. Generate the metrics and monitor them

22. • In server side • Add a operation-count and the time- cost for every client call • For every monitor interval, pull and push the current Throughput and Latency the monitor-tool(ganglia/ zabbix) or console. • Throughput = sum of count / time interval • Latency = average(sum of latency / sum of count), max, min, quantile … Code in Gitlab and Gerrit

23. Code for Spring Project

24. • Java • JMX (Java Management Extensions, a simple example at https://github.com/schubertzhang/jsketch) • javaagent (java -javaagent:jar path [= premain ] ) • jmxetric (use JMX and javaagent to display metrics to Ganglia, https://github.com/schubertzhang/jmxetric) • • Ganglia • Zabbix • …

25. Ganglia Zabbix etc.

26. Performance Benchmark Programing Demo Test and Evaluation the Throughput and Latency of http://www.fangdd.com

27. Demo Time …

28. demo screenshots

29. demo screenshots Average 95% The long tail …

30. Statistical Monitoring for Outlier usually for trouble-shooting

31. Captured from UTStarcom mSwitch R5 system, Guangxi Site, 2004. The magic matrix:

32. • • Redis Memcache • Just add at a point, very low-cost • • Very • Logs ELK

33. Heavy Logs & ELK It’s another topic!

34. Thank You!