Some of the biggest growing pains we've experienced with our microservice architecture at Jet is in preparing for system outages. I'll cover the benefits of choosing F# specifically, and functional programming in general, as well as Azure, for a chaos program. I'll follow up with a discussion of why your team needs to implement a chaos testing program, and finish with showing off our methods and code in depth.
9. We plan to be the new Amazon.com
Launched July 22, 2015
• Both Apple & Android named our app
as one of their top 5 for 2015
• Over 20k orders per day
• Over 10.5 million SKUs
• #4 marketplace worldwide
• 700 microservices
We’re hiring!
http://jet.com/about-us/working-at-jet
10. Azure Web sites Cloud
services VMs Service bus
queues
Services
bus topics
Blob storage
Table
storage Queues Hadoop DNS Active
directory
SQL Azure R
F# Paket FSharp.Data Chessie Unquote SQLProvider Python
Deedle
FAK
E
FSharp.Async React Node Angular SAS
Storm Elastic
Search
Xamarin Microservices Consul Kafka PDW
Splunk Redis SQL Puppet Jenkins
Apache
Hive
Apache
Tez
14. Chaos Engineering is…
Controlled experiments on a distributed system
that help you build confidence in the system’s
ability to tolerate the inevitable failures.
15.
16. Principles of Chaos Engineering
1. Define “normal”
2. Assume ”normal” will continue in both a control group
and an experimental group.
3. Introduce chaos: servers that crash, hard drives that
malfunction, network connections that are severed, etc.
4. Look for a difference in behavior between the control
group and the experimental group.
17. Going farther
Build a Hypothesis around Normal Behavior
Vary Real-world Events
Run Experiments in Production
Automate Experiments to Run Continuously
From http://principlesofchaos.org/
24. SQL restarts & geo-replication
Start
- Checks the source db for write access
- Renames db on destination server (to create a new one)
- Creates a geo-replication in the destination region
Stop
- Shuts down cloud services writing to source db
- Sets source db as read-only
- Ends continuous copy
- Allows writes to secondary db
28. I have now delivered three business critical projects
written in F#. I am still waiting for the first bug to
come in.
Simon Cousins
UK Power company
“
“ “
29. Concise and powerful code
public abstract class Transport{ }
public abstract class Car : Transport {
public string Make { get; private set; }
public string Model { get; private set; }
public Car (string make, string model) {
this.Make = make;
this.Model = model;
}
}
public abstract class Bus : Transport {
public int Route { get; private set; }
public Bus (int route) {
this.Route = route;
}
}
public class Bicycle: Transport {
public Bicycle() {
}
}
type Transport =
| Car of Make:string * Model:string
| Bus of Route:int
| Bicycle
C# F#
Trivial to pattern match on!
31. Concise and powerful code
public abstract class Transport{ }
public abstract class Car : Transport {
public string Make { get; private set; }
public string Model { get; private set; }
public Car (string make, string model) {
this.Make = make;
this.Model = model;
}
}
public abstract class Bus : Transport {
public int Route { get; private set; }
public Bus (int route) {
this.Route = route;
}
}
public class Bicycle: Transport {
public Bicycle() {
}
}
type Transport =
| Car of Make:string * Model:string
| Bus of Route:int
| Bicycle
| Train of Line:int
let getThereVia (transport:Transport) =
match transport with
| Car (make,model) -> ...
| Bus route -> ...
| Bicycle -> ...
Warning FS0025: Incomplete pattern
matches on this expression. For example,
the value ’Train' may indicate a case not
covered by the pattern(s)
C# F#
Should test and make sure that what we’re building can actually handle that.
Systems should have more variance!
Part of how they detected the Volkswagen test fiasco: no variability in tests!
Not enough chaos!
You might not think we need chaos because we do alllll these other kinds of testing.
But have you tested for timeouts? Have you tested your failover setups?
All the automated and manual testing can’t necessarily find issues like css failing because of a timeout.
Not a good experience!
Just because you've tested your components, doesn't mean the interactions between them are solid
3-5 mins
Let's back up.
400-1000 microservices.
Right tool for the job company. Chaos testing is the right tool.
Azure:
Storage Q for At-least-once
Svc bus for At-most-once, transactions, persistence, order guarantee
and Kafka for event distribution
5-7 mins
More even distribution of complexity: easier to create & maintain services (very small amount of logic), harder to manage all services.
Shift complexity from biz logic to infrastructure. Better than one giant monolithic solution with 25 projects!
Break all the things!
Why is breaking things good? Just causes inconvenience. Ugh!
(upgrading things is bad because breaks things!)
What could possibly go wrong?
paraphrased from http://principlesofchaos.org/
Controlled burn to stave off proper forest fire.
Paraphrased from http://principlesofchaos.org/
1: The overall system’s throughput, error rates, latency percentiles, etc. could all be metrics of interest representing steady state behavior.
2: Consider events that correspond to hardware failures like servers dying, software failures like malformed responses, and non-failure events like a spike in traffic or a scaling event.
3: Prod, to guarantee both authenticity of the way in which the system is exercised and relevance to the current deployed system.
4: Running experiments manually is labor-intensive and ultimately unsustainable.
10-15 mins
Outages happen when engineers are available, awake and able to resolve.
Engineers start designing for failure.
Prevents later outages & makes systems immune to failure.
Self service. Devs will WANT to test failure. (can I break chaos testing?? And have everything work??)
If availability matters, you should be testing for it.
This is a best practices move.
12-17 mins
Chaos engineering -
Chaos Gorilla is similar to Chaos Monkey, but simulates an outage of an entire Amazon availability zone.
Janitor Monkey searches for unused resources and disposes of them.
Conformity Monkey finds instances that don’t adhere to best-practices and shuts them down.
There’s also WazMonkey for Azure:
Can’t stop instances, only an entire cluster
Use in-house architecture + F# Chaos should be part of the continuous integration cycle
Startup. First goal, launch. Second, get everything right.
(much socialization.)
Takes a lot to convince devs that it’s not just okay, but valuable to break things at all. Much less in prod.
Netflix: can't test in QA because not "real system". Too expensive to create exact backup.
We have all service restarts logged in splunk, plus tests in jenkins. (Cross-reference). [to check if chaos caused failures]
Architecture diagram for all of this -> show parts that are breaking.
Setting as read-only ensures no corrupt data.
Having created chaos made this much easier to automate.
15-20 mins
Before we show code, let’s cover why we chose F#
The Jet story on why we chose F#.
F# for pricing engine because:
- less code leading to fewer bugs
- robust and safer code because of the type system.
Validation, logging, other cross-cutting concerns.
Using attributes to inject an extra function call before your functions is common in Web API, and other frameworks
We needed to handle cross-cutting concerns for services NOT based on HTTP. Very difficult to do in a generic way.
The Jet story on why we chose F#.
F# for pricing engine because:
- less code leading to fewer bugs
- robust and safer code because of the type system.
Validation, logging, other cross-cutting concerns.
Using attributes to inject an extra function call before your functions is common in Web API, and other frameworks
We needed to handle cross-cutting concerns for services NOT based on HTTP. Very difficult to do in a generic way.
Once inputs and outputs are defined, code is just a series of transformations. Can compose or pipe the operations.
Will help Composing handlers and avoiding handler’s hell.
Subscription and handlers should be defined as a series of functions, outside the Micro Service. Functional! Using Async.Seq
OO Patterns: command pattern, visitor pattern are both just higher order functions.
HO fns often can more easily express a complex work flow.
Now, why F# specifically?
Fewer bugs come from less code.
Fewer bugs come from less code.
Rewrote into prod in 6 weeks!!
Option type is a simple discriminated union.
Both are idiomatic.
C# version still lacks structural equality.
Would need to override equality, comparison, AND GetHashCode.
Also not proper immutability because private set is still mutable.
-- there should be no setter, and the backing field should be readonly.
Pattern matching is like a switch statement, but WAY more powerful.
Option type is a simple discriminated union.
Both are idiomatic.
C# version still lacks structural equality (based on contents): that means overriding the equality implementation, the comparison implementation, AND GetHashCode. (generates a hash code from your object instance).
Also not proper immutability because private set is still mutable. To get real immutability, there should be no setter, and the backing field should be readonly.
Pattern matching is like a switch statement, but WAY more powerful.
Thor.
unit vs case vs pallet.
E.g.: TickSpec: StepDefinitions depends on TickSpec: Feature
Decreased complexity = more time to get details correct.
30-35 mins
ORM
Why microservices?
F#: data in -> data out; inputs -> outputs. A function.
A microservice = just a function that naturally maps to an f# script, with input/out via the network.
Also, naming: Best practices = naming services after the function they perform.
Bad: “sku service” , good: “import skus"
No explicit meeting where we decided to do microservices, one day we just realized we had a bunch of microservices
Decode >> Handle >> interpret
Allows one fun to be called to implement.
Being functional allows composition here.
Within the chaos microservice handle:
compute is type ComputeManagementClient: provides the ability for me to work with most of the Azure compute landscape – hosted services, locations, virtual machines, and so on.
I’ve removed additional logging (checking for only one service role instance, eg.)
These bits of code, btw are all part of one microservice for chaos.
Just cloud services. Not VMs.
Add bits on torch maybe.
Async.ParallelIgnore:
-- it doesn’t keep track of the results of the asyncs it executes, this way it can schedule an unbounded sequence of them.
-- second it allows us to place a bound on parallelism at a given point in time - this is useful because too much parallelism can create too much contention and reduce throughput, increase latency.
-- finally, it starts each computation on the calling thread [this is faster in many scenarios, but if still desired to queue the work item, it is still possible via Async.SwitchToThreadPool.]
35-40 mins
- Manual testers noticed search not resulting in data.
Then Pricing team saw timeouts.
Then realized: Search team wasn’t handling those timeouts properly, began throwing “Operation Canceled” exceptions.
(Should have been caught at a high-level)
Then found out: Elasticsearch was completely down and had cascaded issues downstream (from search to pricing back to search, etc.).
We mapped the Elasticsearch downtime to a restart of Elastic Search in Chaos testing & found out it was a single instance (so we brought down the entire service.)
It took days for DevOps to bring this instance back up and it was then moved to a proper cluster in QA.
Chaos restarted Redis: We hadn’t enabled DNS for our redis cluster (necessary to support automatic failover!)
We updated checkpointing to a new system, but never updated our failover scripts.
tie back. if this/css etc. fails here, we find that, can fix it.
So basically, in conclusion:
40-45 mins
In conclusion:
Chaos should be part of your best practices.
Can help find/fix errors in your systems before your customers have them.