Weitere ähnliche Inhalte Ähnlich wie Things you should know about Scalability! (20) Kürzlich hochgeladen (20) Things you should know about Scalability!1. Things you should know about
Scalability!
WJAX 2011, 08.11.2011, Munich
Robert Mederer
Copyright © 2011 Accenture All Rights Reserved. Accenture, its logo, and High Performance Delivered are trademarks of Accenture.
2. Abstract
Things you should know about Scalability!
Delivering architecture@internet-scale has several challenges to be
solved to be ready for extreme scalable architectures. This session is
about the art of scale, scalability, and scaling of web architectures. It will
give an overview of challenges, good practices and solutions to achieve
high scalability for web-based systems.
Copyright © 2011 Accenture. All rights reserved. 2
3. Who am I?
Experience
Robert Mederer
Lead Architecture & Execution 2000 - 2005: Technology Architect and Software
Anni-Albers-Straße 11 Engineer in several projects
80807 München 2006: Technical Architecture Lead, Integration
Mobil: +49-175-57-68012 and Execution Architecture for Location-Based
Service Provider
robert.mederer@accenture.com
2009: Technical Architecture Lead, Frontend
and Execution Architecture for a Government
Agency
2009/2010: Technical Architecture and front-
office integration build lead, Integration and
Execution Architecture Financial Services
Agency
2011: Architect and QA for Location Based
Services Platform
Copyright © 2011 Accenture. All rights reserved. 3
4. Accenture
High performance achieved
Company Profile Worldwide Revenues $25.5 billion
• Global management consulting, (in US$ billion, as of August 31, 2011)
technology services and Communications
outsourcing company Resources & High Tech
• 236.000 employees
• Rank 47 among the
“Best Global Brands 2008”
• Top 100 Employer
• 28 of the DAX-30-Companies Public
Financial Service
• 96 of the Fortune-Global-100 Services
• More than three-quarters of the
Fortune-Global-500 Products
• 87 of our Top 100-clients have been
with us for 10 or more years
Copyright © 2011 Accenture. All rights reserved. 4
5. Local Accenture … ???
Geographic unit
• Austria
• Switzerland
• Germany
Employees Berlin
• >6000 Düsseldorf
• We are hiring!
Exciting Technology work Frankfurt
• Large scale projects Erlangen/
(100+ people / multiple years) Nürnberg
• Most challenging requirements Munich
– Stock Exchange / Banking / Trading Systems Vienna
– AEMS Mobility Platform
– Large Scale Web Applications Zurich
(> 1M page views / day)
– Batch Architectures
Copyright © 2011 Accenture. All rights reserved. 5
6. Agenda
• Introduction
• Case Study
• Solution and Good Practice
• Further Topics
• Conclusion
Copyright © 2011 Accenture. All rights reserved. 6
7. Agenda
• Introduction
• Case Study
• Solution and Good Practice
• Further Topics
• Conclusion
Copyright © 2011 Accenture. All rights reserved. 7
9. Introduction | Question
Audience?
Who are You? How large is your total
– Developers, database?
– Architects, – < 10 GB?
– IT Manager – 10 GB-100 GB?
– 100GB-1TB?
How large are your – 1TB-10 TB?
application (QPS)? – 10TB+?
– 10-100?
– 100-1000?
– 1000-10000?
– 10000+?
Copyright © 2011 Accenture. All rights reserved. 9
10. Introduction | What is Performance?
How do I know if I have a performance problem?
If your system is
slow for a single user
Copyright © 2011 Accenture. All rights reserved. 10
11. Introduction | What is Scalability?
How do I know if I have a scalability problem?
If your system is
fast for a individual user
but slow under high load
Copyright © 2011 Accenture. All rights reserved. 11
12. Introduction | What is Performance?
Non-Functional Testing
Performance Testing of Web based systems
Definition
• Performance testing is defined as the technical investigation to determine
or validate the speed, scalability, and/or stability characteristics of the web
based system under test.
• Performance-related activities, such as testing and tuning, are concerned
with achieving response times, throughput, and resource-utilization levels
that meet the performance objectives for the application (SLA) under test.
Key Types of Performance Testing:
Performance Load Testing Stress Testing Capacity Testing
Testing
“Will it be fast "Will it support all "What happens if "What do I need
enough?“ of my clients?“ something goes to plan for when I
wrong?" get more
Source: Thomas Werft, Performance Engineer at Accenture customers?“
Copyright © 2011 Accenture. All rights reserved. 12
13. Introduction | What is Performance?
Non-Functional Testing
Performance Testing of Web based systems
Key performance indicators:
Criteria KPI Description
Response Time Average An average is a value found by adding all of the numbers in a
(first / last byte in ms) set together and then dividing them by the quantity of numbers
in the set
Percentile (Target A percentile is a measure that tells us what percent of the total
98%) frequency scored at or below that measure.
Median A median is simply the middle value in a data set when
sequenced from lowest to highest.
Throughput (QPS) Requests per Throughput is the number of units of work that can be handled
Second; per unit t of time; for instance, requests per second, calls per
Transaction per day, hits per second, reports per year, etc.
Second
Resource Utilization Processor; Resource utilization is the cost of the project in terms of system
Memory; Disk I/O; resources.
Network I/O Utilization is the percentage of time that a resource is busy
servicing user requests. The remaining percentage of time is
considered idle time.
Results are used for Performance Engineering, Performance Tuning
Source: Thomas Werft, Performance Engineer at Accenture
Copyright © 2011 Accenture. All rights reserved. 13
14. Introduction | What is Scalability?
Scalability
Definition
A system’s capacity to uphold the
same performance under heavier
volumes.
Source: Patterns for Performance and Operability: Building and Testing Enterprise Software, Chris Ford et. al., 2008
Copyright © 2011 Accenture. All rights reserved. 14
15. Introduction | What is Scalability?
Vertical Scalability
Is achieved by increasing the capacity of a single node
• CPU,
• Memory,
• Bandwidth, …
Simple Process
• Application is generally not affected by
those changes
Classical Example are Super
Computers like
• HP Integrity Superdome
• IBM Mainframe
Source: Hewlett-Packard
Copyright © 2011 Accenture. All rights reserved. 15
16. Introduction | What is Scalability?
Horizontal Scalability
• Application is spread on a cluster with several nodes
• Nodes can be added to scale out
Produces overhead
- Keep cluster consistent
- Node error detection and
handling
- Communication between nodes
• May be used to increase
reliability and availability
• Distributed Systems and Programs like
– SETI@Home
– World Wide Web
– Domain Name Service
Source: Space Sciences Laboratory, U.C. Berkeley
Copyright © 2011 Accenture. All rights reserved. 16
17. Introduction | Scalability Trade-Offs | Availability vs. Consistency
CAP Theorem (Brewer‘s Theorem)
• Consistency – all clients see the
same data at the same time
Consistency
• Availability – all clients can find
all data even in presence of
failure
• Partition Tolerance – system
Partition works even when one node
Availability
Tolerance failed
Impossible
Source: PODC-keynote, Towards Robust Distributed Systems, Dr. Eric A. Brewer, 2000
Copyright © 2011 Accenture. All rights reserved. 17
18. Introduction | Scalability Trade-Offs | Availability vs. Consistency
CAP Theorem
Normally, two of these properties for any shared-data
system
C Consistency + Availability
• High data integrity
P A • Single site, cluster database, LDAP, etc.
• 2-phase commit, data replication, etc.
C Consistency + Partition
• Distributed database, distributed locking, etc.
P A
• Pessimistic locking, etc.
Availability + Partition
C • High scalability
P A • Distributed cache, DNS, etc.
• Optimistic locking, expiration/leases (timeout), etc.
Source: “Architecting Cloudy Applications”, David Chou
Copyright © 2011 Accenture. All rights reserved. 18
19. Introduction | Scalability Trade-Offs | Availability vs. Consistency
Data and Scalability
Distributed Non- Available & Partition Tolerant
Relational data Consistent & Available
• Cassandra • RDBMSs
store solutions
must relax
• SimpleDB Consistency (MySQL,
• CouchDB Postgres, etc.)
guarantees around
• Riak • Greenplum
consistency,
• Dynamo • Vertica
partition tolerance
• Voldemort
and availability,
• Tokyo
resulting in
Cabinet
systems optimized
• KAI
for different
combinations Partition
Availability
of properties. Tolerance
Data Models Key:
Consistent & Partition Tolerant
Relational (comparison)
• BigTable • Scalaris
Key-Value • HyperTable • BerkeleyDB
Column-Oriented
• Hbase • MemcacheDB
Document-Oriented
• MongoDB • Redis
Source: Visual Guide to NoSQL Systems, http://blog.nahurst.com/tag/cap • Terrastore
Copyright © 2011 Accenture. All rights reserved. 19
20. Introduction | Scalability Trade-Offs | Availability vs. Consistency
Data and Scalability
Analysis and Classification
Copyright © 2011 Accenture. All rights reserved. 20
21. Introduction | Scalability Trade-Offs | Availability vs. Consistency
Data and Scalability
ACID - Do I really need it?
Relational databases were originally designed for transactional data processing
– reliably processing and maintaining data integrity – on different HW architectures.
In order to guarantee transactional integrity, the traditional relational database
management system (RDBMS) was architected to guarantee four core properties:
Atomicity, Consistency, Isolation and Durability (ACID).
Atomicity Consistency
A database is said to be atomic if when one if the database remains in a consistent state
part of the transaction fails, the entire after any transaction. Therefore, if a
transaction fails and database state is left transaction violates the consistency of the
unchanged. database (e.g. the value is not the right type)
then the transaction should be rolled back.
Durability Isolation
A database is said to be durable if it recovers A database is said to be isolated if transactions
all of the committed transactions in the system can’t have access to data currently being
even after system failure. modified by another transaction.
Copyright © 2011 Accenture. All rights reserved. 21
22. Introduction | Scalability Trade-Offs | Availability vs. Consistency
BASE
Modern Internet systems: focused on BASE
• Basically Available
• Soft-state (or scalable)
• Eventually consistent
Example: Amazon outage in April 2010 brought thousand
of customers down, including Pfizer, Netflix, Quora,
Foursquare, Reddit, …
• The Amazon.com 2010 Shareholder Letter Focusses on Technology
• http://www.allthingsdistributed.com/2011/04/the_amazoncom_2010_shareholder.html
• http://broadcast.oreilly.com/2011/04/the-aws-outage-the-clouds-shining-moment.html
• http://www.nytimes.com/2011/04/23/technology/23cloud.html
• http://www.allthingsdistributed.com/2007/12/eventually_consistent.html Dec. 2007
Copyright © 2011 Accenture. All rights reserved. 22
23. Introduction | Scalability Trade-Offs | Availability vs. Consistency
ACID vs. BASE
ACID BASE
• Strong consistency for transactions • Availability and scaling highest
highest priority priorities
• Availability less important • Weak consistency
• Pessimistic • Optimistic
• Complex mechanisms • Simple and fast
Copyright © 2011 Accenture. All rights reserved. 23
24. Introduction | Scalability Trade-Offs - Latency vs. Throughput
Network Latency vs. Throughput
Network protocols has an inherent throughput bottleneck that becomes
more severe with increased packet loss and latency
Source: http://www.asperasoft.com/en/technology/shortcomings_of_TCP_2/the_shortcomings_of_TCP_file_transfer_2
Copyright © 2011 Accenture. All rights reserved. 24
25. Introduction | Scalability and Edge Computing
Edge Computing
Transferring data or services from a centralized point to the
edge of the network
• Processing load is distributed
• Closer to the user
• Decreases latency
• Lower cost of hardware
• Increases service levels
• Greater flexibility in responding to
service requests
• Seasonal spikes in demand can be
off-loaded to other edge servers
Copyright © 2011 Accenture. All rights reserved. 25
26. Introduction | Caching
Caching and Types of Caches
Object cache
• Store objects for the application to be reused
• Cache data from database or generated by application
• E.g. ehCache, memcached, etc.
Application Cache
• Speed up performance or minimize resources used
• Proxy caching / Reverse proxy caching
• E.g. Squid, Varnish, etc
Content Delivery Network (CDN)
• Faster response time and fewer requests on the origin servers
• Push content closer to end user
• E.g. Akamai, Savvis, Mirror Image Internet, Netscaler, Amazon
CloudFoundry, etc
Copyright © 2011 Accenture. All rights reserved. 26
27. Introduction | Caching
CDN
Abstract architecture of a Content Delivery Network (CDN)
Source:Content Delivery Network (CDN) Research Directory, http://ww2.cs.mu.oz.au/~apathan/CDNs.html
Copyright © 2011 Accenture. All rights reserved. 27
28. Introduction | Caching
CDN
Basic interaction flows in a CDN environment
Source: Basic interaction flows in a CDN environment, http://ww2.cs.mu.oz.au/~apathan/CDNs.html
Copyright © 2011 Accenture. All rights reserved. 28
29. Introduction
Basics
Load Balancing
Definition:
• Methodology to distribute workload across multiple computers
or a computer cluster, network links, central processing units,
disk drives, or other resources, to achieve optimal resource
utilization, maximize throughput, minimize response time, and
avoid overload
• Using multiple components with load balancing, instead of a
single component, may increase reliability through
redundancy. The load balancing service is usually provided by
dedicated software or hardware, such as a multilayer switch or
a Domain Name System server.
Copyright © 2011 Accenture. All rights reserved. 29
30. Introduction
Load Balancing (Major) Usage
• Distributing the load across multiple servers
Server LB • Target is to scale beyond the capacity of one server, and to tolerate a
server failure.
Global Server • Directing users to different data center sites consisting of server farms
• Target is to provide users with fast response time and to tolerate a
LB complete data center failure (availability, business continuity, disaster
recovery, geographic routing)
• Distribute the load across multiple firewalls
Firewall LB • Target is to scale beyond the capacity of one firewall, and tolerate a
firewall failure.
Transparent • Transparently directs traffic to caches to accelerate the response time
for clients
Cache Switching • Or improve the performance of web servers by offloading the static
content to caches.
Source: Load Balancing Servers, Firewalls, and Caches by Chandra Kopparapu; John Wiley & Sons © 2002
Copyright © 2011 Accenture. All rights reserved. 30
31. Introduction
Basics
Load Balancing Algorithm’s
Random Allocation
• Pros: Simple to implement.
• Cons: Can lead to overloading of one server while under-utilization of
others.
Round-Robin Allocation
• Pros: Better than random allocation because the requests are equally
divided among the available servers in an orderly fashion.
• Cons: Round robin algorithm is not enough for load balancing based on
processing overhead required and if the server specifications are not
identical to each other in the server group.
Weighted Round-Robin Allocation
• Pros: Takes care of the capacity of the servers in the group.
• Cons: Does not consider the advanced load balancing requirements such
as processing times for each individual request.
Copyright © 2011 Accenture. All rights reserved. 31
32. Introduction
Basics
Server Load Balancing
• Hardware
– Barracuda Networks
– Cisco Systems
– Citrix Systems
– F5 Networks (BigIp)
– Etc.
• Software
– HAProxy
Simple Load Balancing over DNS – Apache HTTP Server with
(List of IP‘s with round robin) mod_proxy for Tomcat
Does that work?
– …
Problem:
• No real load balancing due to TTL of DNS
• No health check for service availability
Copyright © 2011 Accenture. All rights reserved. 32
33. Introduction | Load Balancing
Global Server Load Balancing
• Functionality
– DNS based routing
– Based on IP GEO database
(Geographic routing)
– Assumption: Local DNS for
client
• Provider
– F5 Networks (Global Load
Balancing Solutions)
– UltraDNS (Traffic Controller
Service)
– Level3 (Traffic Manager,
Copyright © 2011 Accenture. All rights reserved. BCDR Solution) 33
34. Introduction | Load Balancing
Global Server Load Balancing
Characteristics / Usage
• Increase application availability in event of entire site failure or overload
(Business Continuity, Disaster Recovery)
• Scale application performance by load balancing traffic across multiple
sites (Edge Computing (together with CDN))
• Need for more granularity and control in directing Web traffic
• More flexibility in building and managing Internet infrastructures
– E.g. Site based downtime management during release upgrade
• Cons: Not always working! Due to assumption of a local DNS (Public DNS
usage, DNS over VPN could fail to get the nearest server location)
– (see: http://www.royans.net/arch/fixing-gslb-global-server-load-balancing/)
• Fix: Google proposed a DNS enhancement to not use the DNS resolver IP
further more the client / end-user IP (see: DNS resolver,
http://googlecode.blogspot.com/2010/01/proposal-to-extend-dns-protocol.html )
Copyright © 2011 Accenture. All rights reserved. 34
35. Agenda
• Introduction
• Case Study
• Solution and Good Practice
• Further Topics
• Conclusion
Copyright © 2011 Accenture. All rights reserved. 35
36. Case Study – Internet Scale Web Services
Case Study – Non-Functional Requirements
ASIA: 15 Mil.
EU: 30 Mil.
USA: 50 Mil.
User groups:
• Web Browser users
• Mobile users AU: 2 Mil.
Availability: 99,99 %
Copyright © 2011 Accenture. All rights reserved. 36
37. Case Study – Internet Scale Web Services
Case Study – Non-Functional Requirements
ASIA:
1 data center:
• Singapore
Peak: 5.000 QPS
EU:
USA: 2 data center:
2 data center: • Frankfurt
• New York • London
• San Francisco Peak: 10.000 QPS
Peak: 20.000 QPS
AU:
1 data center:
• Sydney
Peak: 3.000 QPS
Copyright © 2011 Accenture. All rights reserved. 37
38. Case Study – Internet Scale Web Services
Case Study – Non-Functional Requirements
Performance in Case of Failure
EU:
USA: Failover
Failover Frankfurt ↔ London:
New York ↔ San Francisco: 20.000 QPS
40.000 QPS
AU / ASIA:
Failover
Singapore ↔ Sydney:
8.000 QPS
Copyright © 2011 Accenture. All rights reserved. 38
39. Case Study – Internet Scale Web Services
Case Study – Non-Functional Requirements
Response Times
RESTful Web Services:
• Calculate service: 100 ms (50ms latency)
• Binary service: 60 ms (50 ms latency)
• Search service: 50 ms (50 ms latency)
Copyright © 2011 Accenture. All rights reserved. 39
40. Case Study – Internet Scale Web Services
Case Study – Non-Functional Requirements
Data
100 TByte on each geography
- Binary (video, image, …)
- Index data
Copyright © 2011 Accenture. All rights reserved. 40
41. Agenda
• Introduction
• Case Study
• Solution and Good Practice
• Further Topics
• Conclusion
Copyright © 2011 Accenture. All rights reserved. 41
42. Case Study – Solution
Copyright © 2011 Accenture. All rights reserved. 42
43. Case Study – Solution
Copyright © 2011 Accenture. All rights reserved. 43
44. Agenda
• Introduction
• Case Study
• Solution and Good Practice
• Further Topics
• Conclusion
Copyright © 2011 Accenture. All rights reserved. 44
45. Furhter Topics
• Organization
– People, Process and Tools
– Governance (Lifecycle management)
• Where I do I find the truth in a highly scaled and
distributed architecture?
– Logging
• Log Analytics (e.g. Scribe (not really), Splunk)
– End-to-end data visualization
Copyright © 2011 Accenture. All rights reserved. 45
46. Agenda
• Introduction
• Case Study
• Solution and Good Practice
• Further Topics
• Conclusion
Copyright © 2011 Accenture. All rights reserved. 46
47. Conclusion
Content Caching
Reverse proxy Caching
• Fast and Scales well
• Dealing with invalidation is tricky
• Direct cache invalidation scales badly
• Instead, change URLs of modified resources
• Old ones will drop out of cache naturally
CDN – Content Delivery Network
• Faster response time and fewer requests on the origin servers
• No 100% control of caching. Based on internal statistics (Akamai).
• Operated by 3rd parties. Already in place. Not for Free
• Once something is cached on CDN, assume that it never changes
• Sometimes does load balancing as well
Copyright © 2011 Accenture. All rights reserved. 47
48. Conclusion
Common Concepts of Scalable Architecture
parallelization
asynchronous idempotent
7 Habits of operations
Good
partitioned Distributed fault-tolerance
data Systems
optimistic shared nothing
concurrency loosely coupled
Source: "Architecting Cloudy Applications", David Chou
Source: highscalability.com
Copyright © 2011 Accenture. All rights reserved. 48
49. Conclusion
Questionnaire
• Is there a need to scale my application?
– Vertical scaling is more easy to achieve (Cost)
– Use horizontal scaling only when required (Complexity)
• Is there a plan to proof your designed solution?
– Plan to do a lot of realistic Proof-of-Concepts
• Is there a one size fits all solution?
– NO!
• How important is ACID?
– Is BASE enough?
– Can a NoSQL solution be used?
Copyright © 2011 Accenture. All rights reserved. 49
50. References
The Art of Scalability: Scalable Web Architecture, Processes and Organizations for the Modern
Enterprise; Michael T. Fisher, Martin L. Abbott; Addison-Wesley Professional; 1 edition
Scalability Rules: 50 Principles for Scaling Web Sites; Martin L. Abbott, Michael T. Fisher Addison-
Wesley Professional; 1 edition (May 15, 2011)
Scalable Internet Architectures; Theo Schlossnagle; Sams; 1 edition (July 31, 2006)
Building Scalable Web Sites; Henderson; Oreilly
Websites: HighScalability.com, infoQ.com, Qcon.com, …
Copyright © 2011 Accenture. All rights reserved. 50
51. Thank You!
Contribution and Review:
Bukowski, Markus; Conradt, Steffen; Jacobs, Mareike; Krogemann,
Markus; Peuker, Jan; Van Isacker, Pieter; Wagenknecht, Dominik; Wagner,
Hubert; Werft, Thomas; Zakotnik, Jure
Copyright © 2011 Accenture. All rights reserved. 51