2. iRobot 2017 | 2
• Founded in 1990
• Defense and security: circa 2000
• Roomba: 2002
• Roomba 900 = cloud connectivity: 2015
• Migrated to AWS: 2016
• Now exclusively focused on consumer
robots
About iRobot
We are THE robot company
3. iRobot 2017 | 3
• Founded in 1976
• IT consulting for 15+ years
• Hopped over to iRobot in 2015
• Manage the AWS implementation across
iRobot
• Primary focus on the cloud connected
Robot ecosystem
• Contact me: akammerer@irobot.com
About Aaron
He is THE aws platform manager
4. iRobot 2017 | 4
• Embodying good ops:
• Good situational awareness
• Ability to navigate dynamic, challenging
landscapes with agility
• Can fix anything with the tools available
• A steady hand, calm and collected
About Operations
5. iRobot 2017 | 5
Our Team
Well, you go to war with the army you have
(well we’re actually not too shabby)
6. iRobot 2017 | 6
• Build faster
• POCs, testing, etc. flies
• Operate leaner
• Skip the pain of learning to scale
• Important for a historically hardware-oriented
company – we LIKE to build stuff here!
• Cost saving:
• Perhaps net-neutral between tightly
managed servers and AWS Managed Svcs
• Huge savings in internal operations,
development, and monitoring effort
So we can…
Why serverless on AWS?
Outsource servers, OS, and mid-tier applications to the pros
Serverless increases our agility
7. iRobot 2017 | 7
• Provides Rules Engine, Device Gateway,
Certs, Authentication/Auth, Registry,
Shadows
• Tons of infrastructure supporting these
features that we rely on AWS to maintain
for us
• Just one of the 25 services we utilize
Prime Example – AWS IoT
Why serverless on AWS?
No need to reinvent any wheels
8. iRobot 2017 | 8
• Add photo of missions
So that we can focus on our apps:
9. iRobot 2017 | 9
• Millions of robots sold per year
• Not all are connected, but majority soon
• iRobot Home production application:
• 100+ Lambda functions
• 25 AWS services
• 0 unmanaged EC2 instances
• Development and internal AWS footprint:
• ~50 accounts, growing constantly
• 1000s of Lambda deploys per day
• Low single digit FTE supporting operations
iRobot Scale
Currently running and managing
Lots of stuff!
11. iRobot 2017 | 11
• Moving from servers to serverless is a bit like
the change from on-prem to cloud
• It’s easier, in many respects, but it’s not
without its own idiosyncratic issues
• You stand on the shoulders of giants (Tim
Wagner is pretty tall), through outsourcing
these operations
• But outsourcing doesn’t mean you do zero
work
• Being clear about this organizationally is
important
DiffOps
No such thing as a free lunch
12. iRobot 2017 | 12
• Red/black Deployment Paradigm
• Proprietary CloudFormation deployments
• A deployment comprises a complete application
stack
̶ API Gateway, Lambda, CFront, Kinesis, etc
• Data sources are maintained separately and
protected from accidental updating, etc
iRobot stack
Production ecosystem – Deployment
13. iRobot 2017 | 13
• SumoLogic
• Essential for log sleuthing
• Get all data associated with an artifact
immediately across all accounts
• Provides quantitative metrics on fleet health
• Alarms and notifications
• Of course, we use Cloudwatch as well
iRobot stack
Production ecosystem – Monitoring
14. iRobot 2017 | 14
• ADFS – both our AWS console and
command line point of entry
• Ensures ease of access across environments
for developers
• Removes reliance on long-lived access keys
• Multi-region backup using Data pipeline and
S3 cross-region replication
• S3 as a cross account data messenger, or
hub in a hub and spoke data sharing model
• Multi-account/region rollouts of foundational
architectures
• Standardized IAM roles, policies
• Cloudtrail implementation
• Logging infrastructure (Sumologic pumpers, etc)
iRobot stack – multi-account considerations
Bits and pieces
15. iRobot 2017 | 15
• S3 has good bucket policy support for cross
account interaction
• Simply throw data to an accepting bucket on
the other account, where it can listen for the
objects events.
• Primarily for very loosely coupled
applications
• Our cloudtrail data is aggregated into one
bucket then processed by Sumologic
• Have also used a lambda client/server model
for more tightly coupled use cases
• Central ‘server’ lambda can be called by ’client’
lambdas in other accounts, limiting scope in the
’server’ account, without requiring apis, etc.
iRobot stack – S3 cross-account data transfer
Easily integrating applications
Account 1 Account 2
16. iRobot 2017 | 16
• Use ADFS to run scripts on all accounts
• Foundational roles, limit checking, support
utilization
• Maintain a data structure of all ADFS and
other foundational IAM roles/policies
• Tracked in source control
• Can be run idempotently in any account
• New accounts can be provisioned quickly
• Roll out standardized logging infrastructure
• Sumologic lambda infrastructure
• Cloudtrail implementation
• API Gateway/IOT logging parameters
• Consolidate billing
• Then run summation to Sumologic via cron’d
lambda, for billing alerts, granular reports, trends
iRobot stack – multi-account considerations
How to manage all 50+ accounts
17. iRobot 2017 | 17
• Same granularity in the platform as
production
• But orders of magnitude more churn
• Exercises the account limits
• Tests metrics to determine relevance and
meaning
• Bonus – Developer activity provides
additional visibility into how the platform is
currently behaving
• Higher volume of deployments in many different
AWS accounts means problems found quickly
• This can alert us prior to problems hitting prod
DeveloperOperations
Can help with visibility
Developers can be platform testers, canaries, and guinea pigs
18. iRobot 2017 | 18
• No provider is immune to problems
• Small effects are more common than big
outages
• More services = blips could be encountered
more frequently
• This comes with the territory
• Setting expectations organizationally is
important
• Architecting robustly is key
̶ Event based
̶ Async
̶ Microservices
The cloud has weather
19. iRobot 2017 | 19
• First, do no harm, gather data
• What is actually impacted? Current transactions
or new deployments?
• Contact AWS Enterprise Support
• Start the ball rolling toward the service teams if it
turns out this has a platform component
• Additionally consult the big board, as well as
the Twitterverse to gauge whether many
customers are affected
• Start working the diagnosis –
• Our code or platform?
Reacting to incidents
Errors abound, what do we do?
20. iRobot 2017 | 20
• Dig in:
• Execute runbooks, Consult Cloudwatch,
Sumologic, CWLogs
• Root cause, etc
• From Enterprise Support:
• Get updates on platform health
• Gain insights into more opaque aspects of
services – hot partitions on Dynamo DB for
instance
• Take direct action when possible –
• Ex. Kinesis stream iterator age increasing? Re-
shard.
Reacting to incidents cont’d
It’s not you it’s me
21. iRobot 2017 | 21
• Serverless requires a change in mindset
• These incidents can be opaque
• Feeling out of control of your own destiny
can be frustrating
• But the truth: you’d probably not do a better
job
• And in fact, you would likey do a lot worse
• And actions still need to be taken:
• Alert management to potential impact
• Proactively reach out to customer base
• Activate cross-region failover, etc.
Reacting to platform outages
When it’s a Cloud Provider problem
When it’s the platform’s problem, we still have work to do
22. iRobot 2017 | 22
• Biggest operational downside: visibility
• You only know what the provider tells you
• Architecture
• Security
• Operations
• How do they actually do all of the stuff they
do?
• Many known unknowns and unknown
unknowns
• Unknown unknown unknowns: what you
don’t know that they don’t know they don’t
know
Visibility
23. iRobot 2017 | 23
• AWS IoT today has 30+ metrics
• At launch, it had <10
• Without throttling metrics, thing shadow
updates, or web socket metrics it was hard to
debug issues
• Especially early on with small numbers of robots
• Can I connect? How many publishes?
• Load scale, are we over our limits?
Visibility
Metrics are our portal : Example – AWS IoT
More is better
24. iRobot 2017 | 24
• Enterprise Support has been a valuable
resource
• They are our eyes and ears within AWS
• Engage with them to run load tests,
understand account limits
• Our AWS Support team has made the effort
to understand our technology choices
• All of our AWS users, company-wide, benefit
from being able to create tickets
Visibility
AWS Enterprise Support
AWS Enterprise support, thumbs up!
25. iRobot 2017 | 25
• Personal Health Dashboard
• When performance is degraded, status is
important for ops to show evidence that it isn’t a
problem with our software
• Per-account service health means AWS can
update those affected customers more directly
• Metrics, metrics, metrics
• Service teams are always on the lookout for
which new metrics to include – connect with
them and share your requests!
• Kinesis shard-level metrics, lambda iterator
ages, all added with user input and makes a real
difference in understanding system performance
The future of improved AWS visibility
Looking toward the horizon
26. iRobot 2017 | 26
• Absolutely
• Without serverless in general and AWS in
particular, iRobot would not have been able
to build and run a scalable, low-cost
production cloud application with as
efficiently as we have today
So - Is serverless worth it?
Serverless is Manageable and it Works for Us
Personal favorite aspects of our AWS Platform implementation
Personal favorite aspects of our AWS Platform implementation
Personal favorite aspects of our AWS Platform implementation
We also support developeroperations – which help support the platform
Increased latency – kinesis empties a little slowly, but catches up
More services, we see these effect more pieces of our infrastructure, may be difficult to pinpoint exactly where problems are happening
The recent S3 outage was due to user error. It’s easy to play armchair hyperscale cloud operator and say you’d have prevented it.
The recent S3 outage was due to user error. It’s easy to play armchair hyperscale cloud operator and say you’d have prevented it.
AWS has an excellent commitment to security, and many certifications, but there are a lot of areas that certifications don’t cover and security details aren’t divulged