ITV's Common Platform
Tom Clark, Head of Common Platform, ITV plc
An introduction to the people, process and technology behind the cloud platform that underpins all of ITV's key applications - from the system that pays Ant & Dec to the ITV Hub. Touches on hiring, building a culture, devops at scale, $everything as code, and more.
DevOps Enterprise Summit London 2016
21. ITV’s Common Platform
#DOES16 @tomonocle
prd dev
Product
infraprd infradev
Consul Jenkins
Sensu ELK Grafana
Consul Jenkins
Sensu ELK Grafana
prd stg sit
Applications
Product Account VPC
ApplicationsApplications
47. Tom Clark
tom.clark@itv.com | @tomonocle
ITV blog
http://io.itv.com/
Autonomy, Mastery & Purpose (Pink)
http://www.danpink.com/
Pioneers, Settlers, Town Planners (Wardley)
http://blog.gardeviance.org/2015/03/on-
pioneers-settlers-town-planners-and.html
Terrafile (Ben Snape)
http://bensnape.com/2016/01/14/terraform-
design-patterns-the-terrafile/
Hinweis der Redaktion
Hello!
I’m Tom Clark, Head of Common Platform at ITV
To introduce + give background
In the industry for 15 years now, working as a contract sysadmin, infrastructure architect, and also as a perl developer
Seen a lot across many orgs - Jaguar Cars, BBC, Global Radio, ITV, plus two of my own startups
Couple of years ago went travelling, grew a big beard, got a motorbike, rode around asia, came back, wanted a new challenge - shaved off my beard and went permanent with ITV last year
Report to the Director of Infrastructure, who reports to the CTO, who reports to the Board
Please tweet me, it makes my Mum very proud
Integrated producer/broadcaster - means we make stuff as well as having the ability to distribute it ourselves
Founded in 1955, you probably knew it as “Channel 3” growing up
ITV you know today was born in 2004 when the regions like Carlton/Central/Granada/LWT merged
Member of the FTSE100 with a turnover of £3bn in 2015
In 2015: most watched entertainment show, drama, soap and sporting event
Reach 75% ABC1s
98% of commercial shows >5m viewers on ITV
Doing a lot with little - only 5,500 staff
First a quick guide to how ITV is set up
Studios: makes stuff
Commercial: sells stuff
Broadcast: distributes stuff on-air
Online: distributes stuff… Online (where the ITV Hub lives)
Shared Services: everything else - HR, Legal, Finance, etc.
Almost bust
Outsourced infrastructure to MSP - save money
VMs in weeks - fine because we were mostly waterfall
Scripted installs - paper scripts.
Still dev + ops
Added puppet to clean up (x windows, r* services, etc), then to configure apps
Added CI
Added monitoring
Rob Taylor - pioneer
MVP, thin slice devops team
Waited weeks for VMs, no progress
Went rogue, got the credit card and went to AWS
Whole stack up and running in six weeks XXXX
Not _quite_ that gung-ho, but essentially accurate
Showed that “devops”/product teams worked
Showed that cloud worked
Fast-forward to March year
Large modernisation programme ramping up
Asked to take Rob’s great work and industralise it
I’m a settler in the Wardley sense, I come in after the pioneer and take it to the next level
Now 14 instances
13 engineers
Hosts internal and external systems
ITV Hub (VOD platform)
Talent payment
Playout scheduling
Sales systems
Content delivery
“COTS” in 2017
This is the story of how we did that
What is the problem we’re always being asked to solve at ITV
FTSE100
Answer to Adam and the shareholders
Very lean
Automate the boring repetitive stuff
Concentrate on the interesting fun stuff
Automate more = more time = automate more = more time
Virtuous cycle
Law of accelerating returns
Rising tide lifts all boats
Allows you to make assumptions - don’t have to double check - faster!
If it looks like this here, it’ll look like that there too
Can steal from other teams
Want to get the benefits of standardisation without having to move in lock-step
How many systems can a failure affect
Lots of historical shared infrastructure
Must’ve been a big failure at some point, and someone said “I know! More process!”
Change Approval Board: “please sir, can I do a release?”
Always actively thinking about how it can be limited
Allows you to say “I don’t care” - incredibly powerful
Allows you to devolve responsibility to the product team
Run a command that did something you didn’t expect
Had to push on a door with a handle
Now imagine a green button marked “Stop!”
They’ve all violated the principle of least astonishment
All broken the standard behaviour we expect, and surprised us, astonished us, and that’s bad
Slows you down, reduces trust, makes you second-guess the system
What is the most obvious behaviour- do that
Want boring, predictable system
Give people responsibility and they’ll want to make it work. Quality through psychology, not process
Don’t say “make it good!” - people should want to make it good, and if they don’t they shouldn’t be on the team
Devs, testers, Platform Engineers - all together
Dev, stage, production - they run it end-to-end
No more operations teams
Will talk about how they apply to the three classic
People
Process
Technology
Used to have one common repo “linux_puppet” - VERY BRITTLE
Now one ‘infra’ repo per product - limited blast radius, higher SNR
Adopted roles + profiles
Treat internal modules like external modules
One repo per profile!
SEMVER! Changelogs! Releases! PRs!
Puppetfile
Early adopters
Built our own modules to express ITV standards
Same as puppet, one repo per module
Terraform/goism of remote modules == odd (to me)
Not easy to see what modules were being used
Lots of places to change versions
Must be a better way - why not a file listing modules and versions, and tool to dump them into your repo
Mentioned to Efstathios - came back the next day with “terrafile” which did exactly that with rake
Ben refined it, made it into a gem.
Now used by every team, allows you to see very clearly what modules a project is using
Loosely coupled, highly aligned
Tooling/automation to make it easy
Common vocabulary - standard terms: products, ecosystems, environments
Standard measures - does it apply to the product, ecosystem or environment
Every product gets a pair of accounts, mapped to the dev and prd ecosystems - blast radius reduction
Side benefits for billing, security, API rate limits and support costs.
Isolating services per ecosystem, per product leads to higher SNR - helps prevent alert fatigue
Don’t need to think “Is that alert for me” - it must be by definition
And we apply the same pattern to every instance
Technology would be useless without people to use it
Team of hundreds at the MSP
Two pizza approach
Brains not bodies
Smart - the ability to adapt to change, because the technology we use today won’t be the technology we use tomorrow and you have to be smart to keep up
Kind - the ability to fit into the team - essentially “don’t be a douche”. We don’t have room for brilliant jerks
Give people a chance and they’ll probably surprise you.
Perfect example is Cameron
Responded to hacker news job posting
Only 18 months experience, wasn’t sure we had capacity for a junior
Great skype chat
Great coffee chat
Brought him in for an interview
Smashed it out of the park
Web architecture whiteboard session
Invented sticky loadbalancing from first principles
Throw them in at the deep end with arm bands and a life guard and they’ll probably be fine
So once you’ve hired these smart and kind people, what next?
Daniel Pink’s theory of motivation is based upon three things
Autonomy
Mastery
Purpose
Autonomy - the freedom to make your own decisions. You’ve hired smart people, so let them do smart person stuff.
Trust, but set high standards.
Give them a map and a compass, not directions
Mastery - the opportunity to become brilliant at something through training and practice.
Purpose - the belief that what we do actually matters
Coronation Street, not saving lives but entertaining them which is the next best thing
So once we’ve given these smart and kind people autonomy, mastery and purpose - what do we do with them?
Two sets of engineers at ITV
Blast radius again - rather than contending for a central resource, embed them in the division, leave it up to the Technology Director to decide.
They report to me, but their workload is dictated by the division
Generally first responders for the product
Force multipliers on the team - make the team more effective
Influencing operational quality from the start when it’s cheap!
Responsible for the “concept” of the Common Platform itself
Common puppet modules
Common terraform modules
Tooling
Best practice
Complete/“batteries included” - should get you started out of the box
Heavy R+D
Used to be done in a project/product timeline - “1 week, 2 weeks, 4 weeks? Huh!” Blew their minds
Incubation of new hires
Second-opinion-as-a-service
Not dictators - custodians. Accepting PRs from other engineers.
Look around and steal - let chaos reign, rein in chaos
Reinventing the wheel - different every time
Boring stuff like logging, monitoring, deployment etc should be standard
That’s what we do with the core team
Do the work once and do it brilliantly
Share with the other teams
Frees the product engineer to focus on their product
Easy way for to make changes to the entire estate
Feeds into the Core team, they update the relevant modules, everyone upgrades
I say “upgrade” - how does that work?
Underpins the platform
SEMVER again
Stored in a github repo
Will eventually release monthly
Still in beta (0.1)
Essentially the owner’s manual
High level goals
Why we’re doing this
Every decision we make should service one or more of these goals
Quality = doing the right thing once rather than the wrong thing twice
Simplicity = as simple as possible, but not more
Value = as small as possible, as large as necessary
Mostly common sense
Problem is common sense isn’t that common
The musts, must nots, shoulds, should nots
Hopefully you’d all agree, but now it’s explicit
Note it doesn’t mention puppet
This document is abstracted from implementation detail
Standard, practices and principles: The day-to-day detail
Defines ecosystems
Defines how AWS looks
DNS standards
Alerting standards
For example “my development server just sneezed at 02:00 - should I page the on-call?”... No
Development server sneezing at 02:00 meets none of those criteria
“ITV Hub is down at prime time” - urgent, yes, important yes, actionable - most likely.
Not actually part of the spec
Clean code
Clear comments
Peer review
Don’t want chapter and verse - good code can be good documentation
More change in parallel, safely, than ever before
VMs in minutes, not weeks
Initial environments in weeks, not months (still too slow though!)
Performance has improved - enhanced monitoring, more eyes - sunlight is the best disinfectant
Reliability has improved, teams suffer the pain and want to fix
Finding talent.
Low supply, high demand.
Does the community need to build a devops academy to grow the next generation?
That’s it - whistlestop tour
Still on the journey, lots more still to do
Thank you
Any questions?