This is a short slide deck from a talk given by Eric Anderson of CopperEgg to the Austin Cloud Users Group on August 23rd, 2011. Talk included the migration around Amazon's services, Rackspace Cloud, and then back again. The important of super real-time monitoring becomes clear by the end.
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Austin Cloud Users Group - August 23rd, 2011
1. copperegg
Austin CUG - August 23rd, 2011
(presented by Eric Anderson)
anderson@copperegg.com
Wednesday, August 24, 11
2. About Us
CopperEgg
• Founded spring 2010
• Super real-time monitoring and analytics
About me (Eric Anderson)
• SysAdmin - Centaur - 1999-2007
• 1400 compute nodes, ~50-100 file servers, ~200 misc systems, hundreds of TB’s
• Software Engineer - StorSpeed - 2007-2010
• built distributed file system cache for NAS acceleration product
• Co-Founder/COO - CopperEgg - 2010-Present
2
Wednesday, August 24, 11
3. Why Cloud?
Important Differences:
• All reliable and business-worthy install need something like this:
Installs in seconds – copy/paste systems
• No configuration required - anyone can do it
•Physical security •Redundant infrastructure
•Redundant power •Multi-AZ, Regions, storage, etc
•Redundant AC •Resilient Applications
•Redundant & fast network •Designed for failure
•Peak hardware •Performance measurement
•Spare equipment •Automatic failover/recovery
•Physical space (storage of •Security of your infrastructure
spare stuff too) •Monitoring - up/down/status
•People to manage physical •Visibility into system as a whole
infrastructure •Don’t rely on cloud vendor!
•Hardware repairs •Delayed, inaccurate
3
Wednesday, August 24, 11
4. Why Cloud?
Important Differences:
All reliable and business-worthy systems need something like this:
Physical Cloud
•Physical security •Redundant infrastructure
•Redundant power •Multi-AZ, Regions, storage, etc
•Redundant AC •Resilient Applications
•Redundant & fast network •Designed for failure
•Peak hardware •Performance measurement
•Spare equipment •Automatic failover/recovery
•Physical space (storage of •Security of your infrastructure
spare stuff too) •Monitoring - up/down/status
•People to manage physical •Visibility into system as a whole
infrastructure •Don’t rely on cloud vendor!
•Hardware repairs •Delayed, inaccurate
4
Wednesday, August 24, 11
5. Why Cloud? (for CopperEgg)
Why did we go cloud?
• Needed to get building fast
• We didn’t know what we needed
• Just-in-time scaling
• Keep costs low and still provide awesome service levels
• Easy deployment for developers
• Test different scenarios, try new setups, etc
• We use it for everything!
• code repositories, tickets, email, phone, alerting, etc
5
Wednesday, August 24, 11
6. What we were building
Storage analytics product
• visualize network attached storage in real-time
• massive amounts of data
• analyzing 10 billion ops/day in beta, in real-time
• super real-time (seconds vs minutes)
Requirements:
• highly available
• super responsive
• gobble large amounts of analytics data in real-time
• historical data for 2 yrs
• great UI
6
Wednesday, August 24, 11
7. Where we started
+ SimpleDB
Bad:
• Outgrew it before we outgrew it
• Slow!
So then what?
7
Wednesday, August 24, 11
8. Amazon RDS to save the day!
+ SimpleDB
+ RDS
Good:
• Faster than SimpleDB
• Could scale the storage
Bad:
• Realized it still would not handle our dataset
• Inserts were too slow
So then what?
8
Wednesday, August 24, 11
9. MySQL on EC2 to save the day!
+ SimpleDB
+ RDS
EC2 + MySQL
Good:
• Faster than RDS
• Increased insert performance
• Using some cheats to get the insert rate up
Bad:
• Still not good enough insert performance..
So then what?
9
Wednesday, August 24, 11
10. MySQL on Rackspace Cloud
+ SimpleDB
+ RDS
+ MySQL EC2 + MySQL
Good:
• Faster than Amazon (CPU)
• Seemed cheaper
Bad:
• No easy way to scale across different zones or regions
• No way to expand storage per instance (whole instance only - costly!)
• Then we got the bill: they charge for data xfer between instances - OUCH
So then what?
10
Wednesday, August 24, 11
11. Back to Amazon!
+ SimpleDB
EC2, EBS,
+ RDS
MongoDB
+ MySQL EC2 + MySQL
Why did we move back?
• Lots of great services: S3, EC2, EBS, Route 53, ELB (we use all of these)
• Even more: SQS, SES, etc
• Multiple regions and availability zones
• Scale-as-you-need: storage, memory, cpu, redundancy
• Documentation
We’re still happy with this.. (9 months and running)
11
Wednesday, August 24, 11
12. What’s this NoSQL thing?
Realized maybe MySQL was not the best choice
• How about a NoSQL database?
• So we tested and measured every one we thought was worth looking at:
• Redis
• Tokyo Tyrant, Kyoto Cabinet
• Cassandra
• MongoDB
• etc, etc, etc (there are a lot)
12
Wednesday, August 24, 11
13. MongoDB won
MongoDB won the award - why?
• Redundant
• Scalable
• Persistent data-store
• Handles large amounts of data
• Awesome user community
• Vendor support
• Open source
• Lots of momentum
13
Wednesday, August 24, 11
14. Where are we now?
Needed a way to monitor our site:
• Requirements:
• Know right away when problems occur
• See into the performance of the system
• See historical trends as we grow the business
• Super real-time product needs super real-time monitoring
• Not satisfied with existing solutions
• slow updates (1m or 5m way to slow - not real-time)
• not ‘cloud friendly’
• pain to maintain
• some are pricey
14
Wednesday, August 24, 11
15. Not real-time?
Then what *is* real-time?
• Smallest amount of time you can comfortably have poor service before
someone notices and changes their behavior.
• Example:
• Web site can only be slow/unavailable for a few seconds before people leave
• Email can be slow for tens of seconds before people get grumpy (or less depending on
the people!)
• Twitter - well, we’ll leave that one for you to decide
So, if seconds is the yardstick for measuring poor performance,
why do we monitor every 1 or 5 minutes?
15
Wednesday, August 24, 11
16. CPU Usage: 5min sampling
100
75
50
25
1
5:00 PM 5:05 PM
Here’s what a 5 minute sample provides
• Doesn’t look like much is happening
• Users should not be complaining right?
16
Wednesday, August 24, 11
17. CPU Usage: 1min sampling
100
75
50
25
0
5:00 PM 5:01 PM 5:02 PM 5:03 PM 5:04 PM 5:05 PM
Same data - 1 minute sample
• Looks like there was some kind of cpu activity at 5:01pm - 5:02pm
• Still no issue though - right?
17
Wednesday, August 24, 11
18. CPU Usage: 5 second sampling
100
75
50
25
0
5:00 PM 5:01 PM 5:02 PM 5:03 PM 5:04 PM 5:05 PM
Same data - 5s sampling
• Becomes clear there was something happening:
• between 5:01:10pm - 5:01:25pm
18
Wednesday, August 24, 11
19. So we rolled our own
RevealCloud
• Turns out a lot of people agreed with us
• Highlights:
• Built on our super real-time analytics engine
• Updates in seconds vs minutes
• Easy to install, no config required
• Great looking and usable interface
• Works anywhere - public/private cloud, vm, bare metal)
19
Wednesday, August 24, 11