What does it take to get an application into production? Many processes, tools and automation surround that application to deliver it to the customer. As it becomes more common for development teams to autonomously deliver and run their software, the focus of the traditional operational teams shifts towards an as-a-service mindset. But how is such a team positioned within the company? And is Platform Engineering any different from Software Engineering?
In this talk I’ll share my experiences as a platform engineer and explain why I believe that every company should be conscious about why and how to setup this responsibility. I’ll also discuss the biggest challenges surrounding it - and how to tackle them.
22. Definition time: “Platform Engineering”
The composition and integration of a set of processes, tools and automation
(components) to build a coherent platform with the goal of empowering developers
to be able to easily build, maintain and operate their business logic.
24. Why Platform Engineering
“[T]he reality is that state of the art cloud native technology is still too hard to use if
every product engineering team has to individually solve common problems
around networking, observability, deployment, provisioning, caching, data storage,
etc.”
https://medium.com/@mattklein123/the-human-scalability-of-devops-e36c37d3db6a
29. Who has the responsibility?
1 or 2 platform engineers
1 platform team
multiple platform teams
30. Multiple approaches
● “Just hand me your source code on a USB stick. I’ll put it on the server and
make sure and that it keeps working”
Or…
● “Create a pull request <here> to automatically get a Git repository, CI/CD,
filled with a Hello World application, that deploys all the way to production”
31. Challenges we’ll talk about
1. Protecting the platform for organizational scalability
2. Building an opinionated platform
3. Specifying contracts
4. Collaborating with your users
36. The question to ask for each new feature
How much support will I need to provide after I have delivered this feature?
Less (or equal) is the only acceptable answer
37. Reducing toil
“Toil is the kind of work tied to running a production service that tends to be
manual, repetitive, automatable, tactical, devoid of enduring value, and that
scales linearly as a service grows.”
https://landing.google.com/sre/sre-book/chapters/eliminating-toil/
38. Examples of toil
● Manually creating Git repositories and granting the correct people access
● Manually creating CI/CD pipelines based on another project
● Running load tests from a laptop with a security token that you can not share
● Putting secrets into your secret management solution
44. Self-service instead of toil
● Manually creating Git repositories and granting the correct people access
● Manually creating CI/CD pipelines based on another project
● Running load tests from a laptop with a security token that you can not share
● Putting secrets into your secret management solution
45. Scalable support
Developer: Hello, can you help me with “something” for feature X?
Platform: Did you check the docs for feature X?
Developer: Yes, but it doesn’t mention anything about “something”
Platform: I see, give me as second
Developer: Sure
Platform: I edited the docs to include “something”, does this answer
your question? https://github.com/org/docs/pull/1337
Developer: Totally, thanks!
46. How to protect yourselves
● Automate. Think “as a service”.
● Prefer managed solutions above self-hosted. Every minute you spend on
maintaining a self-hosted solution is time not spend on improving the platform
usability
● Prefer documentation above single answers
● Make the platform as simple as possible
55. Self-service
● Lambda languages v.s. Lambda runtimes
● Hosted, shared CI/CD solution vs. BYO CI/CD solution
● Default Docker / EC2 images vs. built your own
● Default network with every application vs. create your own VPC/Subnet
59. Business value
What is the business value of each development team maintaining their own;
● CI/CD server
● Monitoring solution
● Network
● ...
60. Getting buy-in
Team Manifesto
We build everything with developers. We’ll always select a pioneering team that
has interest in a new service/feature, and work with them to make sure we pick
and configure the right tool and solve the root problem.
61. How to sell an opinionated platform
● Simple to use, high-level abstractions; developers shouldn’t want to use
anything else
● Make everyone aware of the constant balance seeking
● Build the features together with developers
73. Shared Responsibility
● Git repository automation
● CI/CD autoscaling
● Network ACLs
● Kubernetes cluster
● Logging / Metrics agent on EC2,
listening on specific port
Platform Development
● Creating Git repositories
● CI/CD configuration
● Security groups
● Application in Kubernetes
● Shipping logs/metrics to specific
port
74. Continuously test your responsibility
● Are my agents autoscaling?
● Are my NACLs correct?
● Is my Kubernetes cluster stable?
● Are the logging/metrics agent operable?
→ Make this transparent!
78. Codify your contract
execution:
- concurrency: 5
hold-for: 20m
ramp-up: 5m
scenarios:
requests:
- method: GET
url: https://example.com
modules:
blazemeter:
projectName: Team Name
Must specify testName:
https://url/to/docs
79. $> git clone example/template/skeleton
Codify your microservices
80. Default microservice
● Unit testing
● Integration testing
● Log example
● Metric example
● Secret example
● CI/CD to production
● ...
81. How to define your contracts
● Be aware of the contract and make the company aware
● Minimize the gray area
● Codify your contract
○ Shift left
○ Examples / blueprints
● Make sure the company understand the interface to your team
84. Scalable support
Developer: Hello, can you help me with “something” for feature X?
Platform: Did you check the docs for feature X?
Developer: Yes, but it doesn’t mention anything about “something”
Platform: I see, give me as second
Developer: Sure
Platform: I edited the docs to include “something”, does this answer
your question? https://github.com/org/docs/pull/1337
Developer: Totally, thanks!
85. Getting buy-in
Team Manifesto
We build everything with developers. We always select a pioneering team that has
interest in a new service/feature, and work with them to make sure we pick and
configure the right tool and solve the root problem.
87. Imagine… a power outage
You’re at home, watching TV, and BAM: TV is out, lights are out. What do you do?
● Check the other rooms to see the scope of the outage
● Check with the neighbours if it’s just your home or more
● Check with the power company
88. How to collaborate with your users
● Teach everyone how the platform works, in a scalable way
○ Documentation
○ Workshops
● Enable collaboration between development teams
● Make it clear how you work
89. Summarizing
1. You need a platform - and someone should own it
2. Always think about scale - both organizational and technical
3. Opinions are OK
4. Specify contracts, but be careful about what you support
5. Collaborate
I mention it last but it should really be at top.
Is all my data encrypted at rest?
Is all traffic encrypted?
What about patch management?
And there is more. Networking, storage, distributed tracing, e.t.c.
It’s expectation management that causes failure. If you underestimate the work required to get a solid platform up and running, you won’t have one and all developers will need longer to work on their applications.
Components: see previous examples
“Easy” is a vague word, but it should be your goal. Keep it easy. Developers are already doing complex things, they don’t need more.
Note the word “Operate”: develops should operate their application in production. It’s 2019.
First of all, it’s efficient. If every developers needs to solve I mentioned earlier, everyone is doing the same work.
In fact, it takes specific knowledge. Because: slide with quote
Responsibility / ownership. Everything should be owned.
VPC’s in AWS are not hard, but also not trivial. Every developer can surely pick this up, but it’s not very efficient if every developers needs to understand it to this level.
In other words: specific knowledge is required.
Development teams own logical groups of services: Platform teams owns platform services
Ownership is of course important in general. We’ll come back to who owns what later (contracts).
Ownership is important, but no different from “normal” (other) applications. Which is the bridge to the next slide: there is no difference.
Development teams own logical groups of services: Platform teams owns platform services
Ownership is of course important in general. We’ll come back to who owns what later (contracts).
Ownership is important, but no different from “normal” (other) applications. Which is the bridge to the next slide: there is no difference.
“Specific knowledge” is also not rare. When a team builds an event sourcing mechanism… they need specific knowledge required for that.
But I’ll just say “Platform team”
Lambda was always opinionated.
Introducing a new language takes time. And not only new languages: also version updates for existing languages (e.g. Python 2.7, 3.6, 3.7)
Your best people will leave if they can not pick up work that excites them.
Operational teams typically use Kanban instead of Scrum/sprints, because of the operational work that constantly comes in and that is also important. This means a lot of context switching, and constant battles about priority. If you work on “projects”, just like other development teams, this will happen way less.
Provide self-service solutions for toil. The platform team does not have to be involved anymore.
Provide self-service solutions for toil. The platform team does not have to be involved anymore.
Also mention how this is the happy path. Dev may also say “I have no time” or “Where are the docs”. You have also raised awareness about documentation. You need to educate people that you work like this. More on education later.
Last point is bridge to next section. Because “as simple as possible” typically means “opinionated”. Having to make choices leads to complexity.
Back to the AWS Lambda example. Lambda has always been very opinionated: only a specific set of languages and versions was supported.
That changed when they released custom runtimes. Still opinionated, but on a lower abstraction level.
Going back to the previous point about organizational scalability: less orange scales better.
Higher abstractions lead to more opinionated solutions. It means developers can easily use them, but are limited to what they can pick.
When you lower the abstraction of your platform features, more responsibility goes to the developer. E.g. with Lambda again;
Give us code in language X, version Y. High abstraction.
Give us an HTTP endpoint. Lower abstraction.
Contracts!
Provide self-service solutions for toil. The platform team does not have to be involved anymore.
Of course there is an alternative: running it yourself on EC2
The team that does this themselves may go faster, because they use the tools they are familiar with. However, how much do you lose company-wide because of the fragmentation?
The team that does this themselves may go faster, because they use the tools they are familiar with. However, how much do you lose company-wide because of the fragmentation?
Again back to the AWS Lambda example. When they released custom runtimes - and became a bit less opinionated - they changed their contract. Or actually: appended it
Before: give us a function in language X, version Y
Now: either this ^^ or: provide us an HTTP endpoint that we’ll sent a specific request to
Contract can be “double”: either specific use of feature, or lower abstraction.
So a contract says: “this is where the platform ends, and this is where you come in”.
Same when one microservice is talking to another.
Underlying implementation is hidden away. Makes life easier for everyone.
The (REST) contract to another microservice is typically very clear. Do the same with your platform interfaces.
The (REST) contract to another microservice is typically very clear. Do the same with your platform interfaces.
Blazemeter example: if you want to know how to create the YAML, check the blazemeter documentation.
The way you work together with the platform team is also a “contract”.
Again, AWS Lambda.
If you would ask AWS to add an additional language, do you think they would do it?
If you would ask AWS to increase the maximum timeout time - just for you - would they do it?
How is that any different from asking your platform team?
How do you expect either AWS or your Ops team to prioritize your request?
This is “helping”. Not only the question they asked, but also that this is the way we provide support.
The team that does this themselves may go faster, because they use the tools they are familiar with. However, how much do you lose company-wide because of the fragmentation?
“Collaboration” is also making sure developers collaborate with each other.
Of course if there is a big production outage, you may want to involve other teams sooner.