Matej Konecny discusses how his team at Atlassian manages their microservices architecture. In the early days, their applications were built as WAR files running on EC2 with autoscaling and logs stored in CloudTrail, which made monitoring and incident handling difficult. They have since implemented a unified PaaS across Atlassian using Docker Compose with automated provisioning of containers and resources. This common platform enforces best practices and allows sidecars to be reused. Incident detection and communication is now automated through tools like OpsGenie, and changes are easier to track. Regular meetings cover metrics, alerts, and code health to improve oversight of services.
8. Our early days...
EC2
Applications were built as
WARs and running on EC2/
EBS. We've used
autoscaling.
Logs
In CloudTrail. Difficult to
search for clues.
9. Our early days...
EC2
Applications were built as
WARs and running on EC2/
EBS. We've used
autoscaling.
Logs
In CloudTrail. Difficult to
search for clues.
Monitoring
Only whatever is built-in
AWS like CloudWatch.
18. INCIDENT HANDLING TODAY
Detect the
problem
Automatic
escalation to
service owner
Check the
changes globally
Rollback, turn off
feature or hotfix
Post Incident
Review
T+0 T+5 T+10 T+X Later
20. INCIDENT HANDLING TODAY
Detect the
problem
Automatic
escalation to
service owner
Check the
changes globally
Rollback, turn off
feature or hotfix
Post Incident
Review
T+0 T+5 T+10 T+X Later
27. HOT ticket raised
INCIDENT COMMUNICATION
Zoom/Slack/Statuspage
DETECTION
PAGE ACCEPTEDInvestigation
OpsGenie auto-page
28. HOT ticket raised
INCIDENT COMMUNICATION
Zoom/Slack/Statuspage
DETECTION
PAGE ACCEPTEDInvestigation
OpsGenie auto-page
Resolve
29. INCIDENT HANDLING TODAY
Detect the
problem
Automatic
escalation to
service owner
Check the
changes globally
Rollback, turn off
feature or hotfix
Post Incident
Review
T+0 T+5 T+10 T+X Later
33. INCIDENT HANDLING TODAY
Detect the
problem
Automatic
escalation to
service owner
Check the
changes globally
Rollback, turn off
feature or hotfix
Post Incident
Review
T+0 T+5 T+10 T+X Later
34. INCIDENT HANDLING TODAY
Detect the
problem
Automatic
escalation to
service owner
Check the
changes globally
Rollback, turn off
feature or hotfix
Post Incident
Review
T+0 T+5 T+10 T+X Later
37. Weekly TechOps meeting
Signal vs Noise
We check that the alerts
raised are meaningful.
Check KPIs
Did the service meet all
the defined Service Level
Objectives (SLO)?
38. Weekly TechOps meeting
Signal vs Noise
We check that the alerts
raised are meaningful.
Check KPIs
Did the service meet all
the defined Service Level
Objectives (SLO)?
Code health
Analyze the test coverage
and the technical debt
backlog.
43. What's next?
Knowledge silos
Automatically measure how
well each team member
knows each service to
reduce knowledge silos.
Service costs
Gain more insights in total
cost of ownership of each
service and new feature
built.
44. What's next?
Knowledge silos
Automatically measure how
well each team member
knows each service to
reduce knowledge silos.
Service costs
Gain more insights in total
cost of ownership of each
service and new feature
built.
Runbooks
Organize the runbooks and
make them searchable.
Run more war games (fire
drills).
47. Platform
Use common
platform and define
guiding principles.
Monitoring
Collect metrics and
aggregate the logs in
central location.
To sum up...
48. Platform
Use common
platform and define
guiding principles.
Monitoring
Collect metrics and
aggregate the logs in
central location.
Standards
Enforce standards
when deploying and
developing.
To sum up...
49. Platform
Use common
platform and define
guiding principles.
Monitoring
Collect metrics and
aggregate the logs in
central location.
Standards
Enforce standards
when deploying and
developing.
Learn
Incorporate learning
from the incidents.
They are invaluable!
To sum up...