2. Agenda
• Fault Injection testing – why we need it?
• Fault Scenarios and the fallacies
• DiFIT – High level architecture
• DiFIT – Tech Stack
• Q&A
3. Fault Injection testing – why we need it?
• Distributed systems are unreliable!
Application fault Examples
1. Target service instance(s) or cluster down
2. High Availabilityof distributed computing - James Gosling
The 8 fallacies
scenarios for infrastructure pieces (Best
Effort, NSPOF, Session Failover)
3. Impact Theone off resource intensive operations like large report
1. of network is reliable.
generation, garbage collection, crons
2. Latency is zero.
System fault 3. Bandwidth is infinite.
examples
1. Network timeout is secure.
4. The network
5. Topology doesn't change.
2. Disk Full
6. There is one administrator.
3. FD 7. Transport cost is zero.
reaching limits
4. Network interface is down
8. The network is homogeneous.
4. How to test for faults?
Service 1
✗
Service 2
Wouldn’t it be easier manual testing:
Challenges in if we had something like –
• Know the commands
• How to test operations like
/v0.1/services/service2/stop[PUT] bringing down
Payload: {“service_port”:80,”host”:”192.168.1.50”,“forceful”:0}
network?
•
Response:Repeatability
204 OK
404, SERVICE_NOT_FOUND
400, COULD_NOT_STOP_SERVICE
5. DiFIT –A How to break?
WhatAn example
Typical test? Picture
can flow
The DiFIT
Complete
Backen
d HTTP Request
RESTful
UI Controller
Controller
X-Unit
DiFIT
Agent
Website
Retry
Queue
Supply Chain
✗ ✗
OMS
✗ ✗✗ Fulfillment Logistics
✗ Message
Queue
✗ DiFIT
✗
Agent
9. The best way to avoid failure is to fail constantly.
Hinweis der Redaktion
Question : How many of you do adverse scenario testing ?Every web app now is moving towards distributed deployment. And it has become increasingly important to test adverse scenario in addition to regular testing.Which brings us today to the agenda of today’s talk..To start with we are going to start with fallacies in distributed application. Assumptions which are no where close to reality.Will give overview various adverse scenerio.Then the architecture of the framework which further adds repeatability and automation of Adverse scenario testing
Overview: SCP : 20 Distributed Services, core Infrastructure pieces like MQ, cache, databases. Give brief overview of Flipkart SCM. Around 20 different services including application and infra pieces. Around 75l to 1Cr messages processed everyday. Business shuts down if MQ doesn't work properly. Extremely challenging to test.
Find out how to do the required operationssh into the systemRun the commandWait for the the server to stopDo the operationVerify the behavior
Start: Explain the flow. Stripped down version.Different points of failure. DB failures, MQ failures, including the master down, slave down etc. Network failures – timeout, connection refused. Application down. Work with the example of an app failure, say fulfillment. Retry queue will come into picture. Message will go to retry. Test needs to verify this.How to verify? Manually bring down FF and test.What if there is an agent which allows us to do this remotely using a controller.What is even better? Language agnostic REST apis, which can be called from any x unit frameworks.
Talk about STAF. It enables the command execution on remote machines. A peer-to-peer software. Very small memory footprint. Lightweight. Talk about various libraries, how they are written on top of STAF. Talk about pluggability of the agent.DiFITapis provide an abstraction over DiFIT commands which can be executed on remote machines. Gives a jar which can directly be used.DiFIT REST interface is written using Dropwizard which glues together jetty, jersey, jackson, hibernate etc and provides easy configuration mechanism. Talk about the discovery feature a bit. How it helps to find out the operations which can be done etc.