Most of the time everyone focuses on optimizing the Mean Time To Resolution over Mean time to Detect in any enterprise. However, with distributed resilient microservice architecture, Mean time to Detect the incident is crucial to keep the environment stable. As many organizations to move from a monolithic architecture to a distributed Microservices Architecture along with cloud-native adoption these environment includes thousands of components interacting in complex, rapidly changing deployments over multiple tiers. Therefore, there will be a large number of events, matrices and data are produced in each and every node. Today’s dynamic cloud-native environments use multiple different technologies and aggregated tools. Currently, in the industry there are two techniques utilized for monitoring: one is known as observability and the other is surveillance. In this session, Vanji will review how-to employ observability and surveillance technologies to reduce MTTD in a cloud-native environment.
5. 5
Monolithic Application (continued)
1 Regardless of logical
modular, application is
packaged as a single
monolith.
2 Packaging depends on the
language … .war, .jar or
directory structure
3 Simple to test and deploy
4 If simple, what is the issue?
Simple and easy only at the
beginning
6. Many Pain Points!!!!
● Increasingly difficult to make code changes
● Disrupts agile development
● Overtime, no single developer will understand the entire code. Changes will be
error prone
● CI/CD would become painful
● Scaling would be difficult
● An issue in one component could potentially bring down the entire application
● Stuck with a single language
Problems with Monolithic applications.
6
9. 9
Microservices Architecture pattern
1 An application written as
small interconnected
services, each implementing
distinct functionality
2 Self contained, maintains its
own datastores
3 Each service may expose a
REST API and most services
use other services
4 Services may also use other
Inter-process-
communication methods to
interact, such as queue etc.
11. Many Pain Points!!!!
● What the size limit of microservice?
● Inherent complexity of distribution of systems Handling transactions (partial
failures)
● Multiple databases
● Need for advanced technology (service mesh, service discovery, circuit breaker,
containers, orchestration etc)
Drawback of Microservices Architecture
11
14. How a Amature vs Pro Diagnose issues?
14
https://picture-funny.blogspot.com/2006/09/mechanical-genius.html
master diagnostic technician Kurt Juergens, of Foxborough
15. Optimize MTTD - Mean Time to Detect
Observability vs Surveillance
15
I am Vanji from California and I work as a Solution engineer with WSO2 an open source integration company. And During my free time, I volunteer with the Apache Synapse project as a project committee member and a Committer.
In this talk I will be mainly cover about Microservices Architecture and how to optimize the mean time to detect or well known as MTTD
As a Very first Steps let's Quickly review what is API?
For example I have illasurated a POS application functionality to show a very simple monolithic application behaviourFunctionalities are packaged together as one single packaged and exposed and interconnected to external system.
Regardless of logical modular, application is packaged as a single monolith.
Packaging depends on the language … .war, .jar or directory structure
Simple to test and deploy
If simple, what is the issue? Simple and easy only at the beginning
Many Pain Points!!!!
Increasingly difficult to make code changes
Disrupts agile development
Overtime, no single developer will understand the entire code. Changes will be error prone
CI/CD would become painful
Scaling would be difficult
An issue in one component could potentially bring down the entire application
Stuck with a single language
Since code size grow overtime it would become increasingly difficult for a single developer to understand the entire code. Hence, it would be difficult for a change to be made which disrupts continuous integration (CI) and continuous deployment (CD) say weekly.
How would a CPU intensive component could make use of a special hardware compared to a memory intensive component. You will have to make compromises.
In this illustration i have transformed the previously discussed monolithic POS application into Microservices architecture.
AS you can see now each and every functionality break in into different deployable independent modules and communication between each and every modules are
Now governed by standard protocols for example here it is RESTful APIS
So to summarize An application written as small interconnected services, each implementing distinct functionality
Self contained, maintains its own datastores
Each service may expose a REST API and most services use other services
Services may also use other Inter-process-communication methods to interact, such as queue etc
Further,
A microservice could implement a web UI and not expose a REST API
A microservice doesn’t necessarily expose an API, but often it does
A microservice could communicate via other means like a queue for IPC
A Web UI service may invoke other services that has REST APIs or other means like asynchronous message based communication to render the page
Separate databases ensure loose coupling but may impose duplicate data
Faster and focused development
Easy deployment and thus easy CI/CD - less dependency
Demand based scalability and flexibility
Reduced downtime due to modularity
Applications become cloud native
Microservice is a confusing term as there are people advocating writing small <300 Line of Code services. But microservice idea is to sufficiently decompose a monolith to take its advantages
By distributing, IPC mechanisms need to be brought-in
Need to update multiple database for consistency
In a cloud native container based microservices hard coding an endpoint is not an option. Thus service discovery is needed
mean time to recovery vs Mean time to Detect
Idea of these concepts are widely used in the tech industry regardless of monolithic or Microservice architecture. First of all we should understand the significance of the difference between MTTR and MTTD /
However, MTTD deals with how quickly it is possible identify the issue and minimize occurrence of any incidents.
MTTR deals with after an incident or issue happened how to quickly recover back from the incident. By then as you see in the picture damage has already occurred.
With organizations are moving from a monolithic architecture to a distributed Microservices Architecture along with cloud-native adoption these environment includes thousands of components interacting in complex, rapidly changing deployments over multiple tiers. Therefore, there will be a large number of events, matrices and data are produced in each and every node. Today’s dynamic cloud-native environments use multiple different technologies and aggregated tools. Currently, in the industry there are two techniques utilized for monitoring: one is known as observability and the other is surveillance.If there is no proper system of monitoring or observing the ecosystem, organizations will never be able to quickly detect or resolve the damaging problems. Hence, going forward we will understand the toolings, monitoring tools, KPIs, alerting mechanisms and observability techniques to significantly reduce the MTTD.
Observability is the critical pillar for reducing MTTDs,
By definition - observability is: collecting diagnostics data across all the stacks to identify and debug production problems and also to provide critical signals about usage to enable a highly adaptive and scalable system. Observability is primarily driven by 6 different dimensions to understand the environment.
Monitoring
Log aggregation
Tracing
Visualization
Alerting
People
A Fundamental aspect of monitoring is to collect, process, aggregate, and display real-time quantitative data about an ecosystem and measure metrics at three levels: network, machine, and application. Such monitoring will produce error counts and types, processing times, memory usage and server lifetimes.
For example, monitoring performance of the given JVM based application can simply be monitoring using JConsole and collecting matrices of CPU usage, memory usage, number of threads running, etc.
However, with larger enterprises with distributed applications, it is not feasible to just target monitoring a single JVM or machine. Instead, Application Performance Monitoring (APM) tools should be in place to facilitate the monitoring of multiple functional dimensions. For example, DynaTrace, AppDynamics, New Relic, Datadog and Apache Skywalking are full-fledged monitoring and analytics capability providers that allow APM.
Traditionally monolithic applications employed logging frameworks to provide insight on what has happened if something failed in the system. To understand or recognize the failure, looking at the log statements with correct timestamp and context is more than enough and most of the information will be revealed if the logs are correctly defined during development. However, with distributed Micro Service Architecture having logs is not enough to understand and see the big picture.
Tracing can be easily understood with the analogy of a medical angiogram. An angiogram is a technique used to find the blocks in the heart by injecting an x-ray sensitive dye which makes block detection possible through dynamic x-ray snapshots while the dye moves through blood vessels. Detecting the bottlenecks in this manner will be utilized to take any necessary action to fix the issue, rather than searching everywhere or replacing the entire heart.
Likewise, Tracing is heavily utilized in distributed software ecosystem to profile and monitor the communication or transaction between multiple systems including networks and applications. Further tracing also helps to understand the flow between services with an overview of application-level transparency. Zipkin, Jaeger, Instana, DataDog, Apache Skywalking, and Appdash are few examples that enable distributed tracing tools that support the OpenTracing Specification.
There are endless different varieties of logs like application logs, security logs, audit logs, access logs and more. In a single application, the complexity of all these logs is manageable. However, in a distributed architecture, there are many applications or services working together to complete a single business functionality.
For example, ordering a pizza involves checking the store availability, making the payment, placing the order, fulfilling the order, enabling tracking, shipping schedule placement, and many other activities.
In the event of an error in such a complex transaction, tracing may pinpoint the location to search for the root cause.
However, if the application-centric logs are distributed across the different components; it will be a nightmare to find the exact issue and time taken to find the relevant logs could make the situation more critical.
Therefore, having a centralized location to collect and index all the logs that belong to the enterprise is critical to make detecting the exact location of the issue more efficient.
Currently, there are multiple Tools and software in the market to achieve log aggregation Mainly Splunk, Sumo Logic, Elastic and GrayLog who play important roles in the log aggregation market.
There are tools that collect the data or logs or matrices in a centralized location. However, if the collected data and logs do not provide any meaningful information they will be not useful. Most of the APM tools and log aggregators provide data visualizations to depict a holistic view based on the criteria provided. For example, locating the host with the most number of error messages can be identified easily with visualization.
Another epic example is correlating two different errors that happened on separate hosts or applications and these can be created using time series aggregation charts.
Another epic example is correlating two different errors that happened on separate hosts or applications and these can be created using time series aggregation charts
The Visualization of the data is not only restricted to Errors and Exceptions but also it can be used to understand the behavioral monitoring of the application users. For example, if a user over-uses an API, data visualization could help to detect abusive behavior.
Searching for log and data can be helpful to speed up the debugging process and resolving issues.
But in reality, manually monitoring the visualizations to detect incidents is not practical.
Hence, creating automated alerts are critical. Common scenarios that require alerts include the sudden failure of one or more APIs, a sudden increase in the response time of one or more APIs and the change in the pattern of API resource access. These alerts can result in an email, a phone call, instant message or PagerDuty. The important aspect of the alert is when the predefined condition is met or violated, necessary stakeholders need to be informed with the right amount of information rather than too much data.
Collecting data in a random manner with different views of the same random data does not really reveal anything at all.Real-world surveillance is used to monitor activities by police or security and later may be used as evidence of crimes or other facts.
Likewise, surveillance is used to force the targeted observation of the system to make sure that functionalities and performance are not violating the intended behavior.
Real-world surveillance is used to monitor activities by police or security and later may be used as evidence of crimes or other facts. Likewise, surveillance is used to force the targeted observation of the system to make sure that functionalities and performance are not violating the intended behavior.
As an example, let’s take applications that are handling real-time traffic or processing a high payload and tends to be memory intensive. Probability of application consuming too much memory is high and if the application is not properly designed and developed to handle this, the application may use up too much memory and crash. Detecting these leaks or abnormal memory usage is critical to uninterrupted service.
Optimizing the infrastructure for minimizing mean time to detect the Incidents with Microservice architecture ensure that an organization has established appropriate systematic techniques to employ Observability and surveillance technologies effectively to find out the incidents right away to keep the system stable.