SlideShare ist ein Scribd-Unternehmen logo
1 von 22
COMMERCIAL IN CONFIDENCE Copyright 2018 FUJITSU LIMITED
Real time web server
log Analytics Using
Apache Spark - Kafka
Ankit Gupta
Data, Big Data & Modern Big Data Approaches
CONCEPT TRADITIONAL DATA TRADITIONAL BIG
DATA
MODERN BIG
DATA(Spark)
Data Sources • Relational
• Files
• Message queues
• Relational
• Files
• Message queues
• Data service
• Relational
• Files
• Message queues
• Data service
• NoSQL
Integration Analysis • Minimal • Medium • Faster time to market
• Modeled by analytical
transformations
Real-time • Minimal real time • Minimal real time • In real time or die
Data Access • Primarily batch • Batch • Micro batch (Spark-
Streaming)
Open Source
Technologies
• Fully embraced • Minimal • TCO rules
Need of Real time Analytics
When referring to “analytics,” people often think of manipulating an existing set of structured data to yield insights. “Real-time
analytics” takes this definition a step further by accounting for the constant appending of new data to the existing data set and
continuously re-analyzing the new dataset for new insights. But for analytics to be real-time, data needs to be ingested
immediately upon creation, delivering results in a matter of seconds, enabling those interpreting the data to react right away.
• Use cases that exemplify why real-time analytics are critical to performance and user experience, highlighting key capabilities
that enable real-time analytics in each layer of your system or application:
• The Application Layer
With your developer team preparing for a big push to production, you’re worried about the possibility of unforeseen issues
immediately following the deployment. Testing in development will never provide an exact replica of what will happen in production.
Therefore, the more you are able to view and monitor your logs in real-time, the faster you will be able to address and rectify issues.
While big issues may be easy to spot, real-time analytics can also help you identify small issues building over time that could
eventually slow down your application and user experience. While batch-processed analytics could only ever give you a historical
analysis of your systems data, real-time analytics can enable you to identify anomalous patterns in your data as they occur. Using a
log analytics tool that offers “anomaly alerts” can help you identify early warning signs of larger issues.
Need of Real time Analytics …
• The Database Layer
Imagine over the course of several minutes, your popular e-commerce application hasn’t received any orders. Where’s the first
place you’d look for a possible issue? You may first check to see if your website is still reachable from a browser. Then, you
may check your server logs. Or perhaps you check your APM tool? Or a web analytics tool? Are they all saying the same thing?
Or nothing at all? When you notice there aren’t any errors in your code and traffic to your website appears to have remained
steady, you decide to investigate your database. Only then, after wasting time investigating other scenarios, do you see your
database was improperly configured in the last deployment and has reached its row limit. How many sales have you lost while
guessing where to investigate? Without log-based, real-time analytics, database errors can go undiscovered, often only realized
after a period of noticeable inactivity and investigating. When using a real-time aggregated log analytics service, database
errors stream into the same single view with the rest of your system’s log events as they occur. Alerts on database errors can
be generated just as easily as alerts for the rest of your environment. And tools that offer custom tagging of specific event types
can also help you spot database specific errors as they occur.
• Server/Hosting Layer
Let’s say your mobile app was just featured on Product Hunt and you’re suddenly experiencing a spike in traffic. Luckily, your
app runs in an auto scaling environment and handles the load without issue. When the traffic later subsides and your servers
scale back, you decide to analyze the distribution of 400 errors over time. But how will you access data from the servers that
scaled down? If you weren’t sending those log files to a central location in real-time, your data is forever lost. In this scenario,
centralizing your logs in real-time is crucial to capturing all relevant data.
Use Case Model -1 Web server Log Analysis / Potential Security Log
Sources
Web server log analysis and statistics generator we analyze the web server logs to compute the following statistics
for further data analysis and create reports and dashboards:
• Response counts by different HTTP response codes
• Response content size
• IP address of the clients to assess where the highest web traffic is coming from
• Top end point URLs to identify which services are accessed more than others
Successful user login “Accepted password”,“Accepted publickey”,
“session opened”
Failed user login “authentication failure”,“failed password”
User log-off “session closed”
User account change or deletion “password changed”,“new user”,
“delete user”
Sudo actions “sudo: … COMMAND=…”“FAILED su”
Service failure “failed” or “failure”
Use Case Model -2 Checklist for Security On windows
Look at both inbound and outbound activities.
Examples below show log excerpts from Cisco ASA logs; other devices have similar functionality.
Traffic allowed on firewall “Built … connection”,“access-list … permitted”
Traffic blocked on firewall “access-list … denied”,“deny inbound”,
“Deny … by”
Bytes transferred (large files?) “Teardown TCP connection … duration … bytes …”
Bandwidth and protocol usage “limit … exceeded”,“CPU utilization”
Detected attack activity “attack from”
User account changes “user added”,“user deleted”,
“User priv level changed”
Administrator access “AAA user …”,“User … locked out”,
“login failed”
Use Case- Background
• We'll look at a web server log analytics use case to show how Spark Streaming can help with running analytics on data
streams that are generated in a continuous manner(Stream) to compute the following statistics for further data analysis and
create reports and dashboards:-
• IP address of the clients to assess where the highest web traffic is coming from.
• Top end point URLs to identify which services are accessed more than others.
• Streaming Data Analytics - Spark Streaming is an extension of core Spark API, which makes it easy to build fault-tolerant
processing of real-time data streams. Streaming data is basically a continuous group of data records generated from
sources like sensors, server traffic and online searches. Some of the examples of streaming data are user activity on
websites, monitoring data, server logs, and other event data. Streaming data processing applications help with live
dashboards, real-time online recommendations, and instant fraud detection.
The way Spark Streaming works is it divides the live stream of data into batches (called micro batches) of a pre-defined interval
(‘N’ seconds) and then treats each batch of data as Resilient Distributed Datasets (RDDs). Then we can process these
RDDs using the operations like map, reduce, reduceByKey, join and window. The results of these RDD operations are returned
in batches. We usually store these results into a data store for further analytics and to generate reports and dashboards or
sending event based alerts.
Kafka-Spark Streaming Architecture
INGESTION-LAYER AGGREGATION-LAYER ANALYSIS-LAYER STORAGE-LAYER
DATA PRODUCER
Kafka-Mechanism
Applications(producers) send messages (records)
to a Kafka node (broker) and said messages are
processed by other applications called consumers.
Said messages get stored in a topic and consumers
subscribe to the topic to receive new messages.
Apache Kafka is a distributed streaming platform, Publish and subscribe to streams of records, similar to a message queue or
enterprise messaging system, Store streams of records in a fault-tolerant durable way. Process streams of records as they occur.
Spark-Mechanism
The main() method of the program runs in the driver. The driver is the process that runs the user code(called as Driver
Program) that creates RDDs, and performs transformation and action, and also creates SparkContext.
The driver program splits the Spark application into the task and schedules them to run on the executor. The task scheduler
resides in the driver and distributes task among workers. The two main key roles of drivers are:
-> Converting user program into the task.
-> Scheduling task on the executor.
Technologies Used
• Zookeeper
• Apache Kafka
• Kafka Clients- Producer/Consumer
• Kafka Connect
• Apache Spark Streaming
• Scala
• Power BI – Visualization
Environment- Cloudera 6-Node Cluster
Kafka Producer Configuration
Clickstream Data Generated from Weblog server
Submit jar file in Client mode to the Spark cluster
Visualization of Spark Streaming Applications
First visualization is the DAG (Direct Acyclic Graph)
Processed D-stream Batch
Statistics during the execution
When the data stream is being sent to Kafka and processed by Spark Streaming consumer, which include the input rate
showing the number of events per second, processing time in milliseconds.
Output Stored in HDFS
Dashboard – ClickStream Analytics on PowerBI
Apache Spark Streaming -Real time web server log analytics

Weitere ähnliche Inhalte

Was ist angesagt?

Les 12 fl_db
Les 12 fl_dbLes 12 fl_db
Les 12 fl_db
Femi Adeyemi
 

Was ist angesagt? (20)

Kafka 101
Kafka 101Kafka 101
Kafka 101
 
KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...
KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...
KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...
 
Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & Features
 
Microservices Architecture & Testing Strategies
Microservices Architecture & Testing StrategiesMicroservices Architecture & Testing Strategies
Microservices Architecture & Testing Strategies
 
Apache Cassandra at Macys
Apache Cassandra at MacysApache Cassandra at Macys
Apache Cassandra at Macys
 
Introduction to docker
Introduction to dockerIntroduction to docker
Introduction to docker
 
Temporal-Joins in Kafka Streams and ksqlDB | Matthias Sax, Confluent
Temporal-Joins in Kafka Streams and ksqlDB | Matthias Sax, ConfluentTemporal-Joins in Kafka Streams and ksqlDB | Matthias Sax, Confluent
Temporal-Joins in Kafka Streams and ksqlDB | Matthias Sax, Confluent
 
Cassandra
CassandraCassandra
Cassandra
 
CI/CD with an Idempotent Kafka Producer & Consumer | Kafka Summit London 2022
CI/CD with an Idempotent Kafka Producer & Consumer | Kafka Summit London 2022CI/CD with an Idempotent Kafka Producer & Consumer | Kafka Summit London 2022
CI/CD with an Idempotent Kafka Producer & Consumer | Kafka Summit London 2022
 
NoSQL databases
NoSQL databasesNoSQL databases
NoSQL databases
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
 
8. column oriented databases
8. column oriented databases8. column oriented databases
8. column oriented databases
 
Aws storage
Aws storageAws storage
Aws storage
 
1. introduction to no sql
1. introduction to no sql1. introduction to no sql
1. introduction to no sql
 
On-boarding with JanusGraph Performance
On-boarding with JanusGraph PerformanceOn-boarding with JanusGraph Performance
On-boarding with JanusGraph Performance
 
Docker Kubernetes Istio
Docker Kubernetes IstioDocker Kubernetes Istio
Docker Kubernetes Istio
 
Les 12 fl_db
Les 12 fl_dbLes 12 fl_db
Les 12 fl_db
 
Learn docker in 90 minutes
Learn docker in 90 minutesLearn docker in 90 minutes
Learn docker in 90 minutes
 
Srinivas Sarathy, TD Bank - Architechting Organizational Change, OpenStack Is...
Srinivas Sarathy, TD Bank - Architechting Organizational Change, OpenStack Is...Srinivas Sarathy, TD Bank - Architechting Organizational Change, OpenStack Is...
Srinivas Sarathy, TD Bank - Architechting Organizational Change, OpenStack Is...
 
Elastic Stack Introduction
Elastic Stack IntroductionElastic Stack Introduction
Elastic Stack Introduction
 

Ähnlich wie Apache Spark Streaming -Real time web server log analytics

Event Stream Processing SAP
Event Stream Processing SAPEvent Stream Processing SAP
Event Stream Processing SAP
Gaurav Ahluwalia
 

Ähnlich wie Apache Spark Streaming -Real time web server log analytics (20)

xGem Data Stream Processing
xGem Data Stream ProcessingxGem Data Stream Processing
xGem Data Stream Processing
 
Azure Monitoring Overview
Azure Monitoring OverviewAzure Monitoring Overview
Azure Monitoring Overview
 
Event Stream Processing SAP
Event Stream Processing SAPEvent Stream Processing SAP
Event Stream Processing SAP
 
Enabling SQL Access to Data Lakes
Enabling SQL Access to Data LakesEnabling SQL Access to Data Lakes
Enabling SQL Access to Data Lakes
 
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
 
Streaming Visualization
Streaming VisualizationStreaming Visualization
Streaming Visualization
 
Confluent kafka meetupseattle jan2017
Confluent kafka meetupseattle jan2017Confluent kafka meetupseattle jan2017
Confluent kafka meetupseattle jan2017
 
ACDKOCHI19 - Next Generation Data Analytics Platform on AWS
ACDKOCHI19 - Next Generation Data Analytics Platform on AWSACDKOCHI19 - Next Generation Data Analytics Platform on AWS
ACDKOCHI19 - Next Generation Data Analytics Platform on AWS
 
Asynchronous micro-services and the unified log
Asynchronous micro-services and the unified logAsynchronous micro-services and the unified log
Asynchronous micro-services and the unified log
 
Getting started with Amazon Kinesis
Getting started with Amazon KinesisGetting started with Amazon Kinesis
Getting started with Amazon Kinesis
 
Getting started with amazon kinesis
Getting started with amazon kinesisGetting started with amazon kinesis
Getting started with amazon kinesis
 
AWS April 2016 Webinar Series - Getting Started with Real-Time Data Analytics...
AWS April 2016 Webinar Series - Getting Started with Real-Time Data Analytics...AWS April 2016 Webinar Series - Getting Started with Real-Time Data Analytics...
AWS April 2016 Webinar Series - Getting Started with Real-Time Data Analytics...
 
Emerging Prevalence of Data Streaming in Analytics and it's Business Signific...
Emerging Prevalence of Data Streaming in Analytics and it's Business Signific...Emerging Prevalence of Data Streaming in Analytics and it's Business Signific...
Emerging Prevalence of Data Streaming in Analytics and it's Business Signific...
 
Elevate your Splunk Deployment by Better Understanding your Value Breakfast S...
Elevate your Splunk Deployment by Better Understanding your Value Breakfast S...Elevate your Splunk Deployment by Better Understanding your Value Breakfast S...
Elevate your Splunk Deployment by Better Understanding your Value Breakfast S...
 
Getting Started with Real-time Analytics
Getting Started with Real-time AnalyticsGetting Started with Real-time Analytics
Getting Started with Real-time Analytics
 
Log analysis using elk
Log analysis using elkLog analysis using elk
Log analysis using elk
 
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...
 
Kafka Vs Spark - Comparison Guide
Kafka Vs Spark - Comparison GuideKafka Vs Spark - Comparison Guide
Kafka Vs Spark - Comparison Guide
 
Thing you didn't know you could do in Spark
Thing you didn't know you could do in SparkThing you didn't know you could do in Spark
Thing you didn't know you could do in Spark
 
New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...
 New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S... New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...
New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...
 

KĂźrzlich hochgeladen

Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
MarinCaroMartnezBerg
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
only4webmaster01
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 

KĂźrzlich hochgeladen (20)

Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 

Apache Spark Streaming -Real time web server log analytics

  • 1. COMMERCIAL IN CONFIDENCE Copyright 2018 FUJITSU LIMITED Real time web server log Analytics Using Apache Spark - Kafka Ankit Gupta
  • 2. Data, Big Data & Modern Big Data Approaches CONCEPT TRADITIONAL DATA TRADITIONAL BIG DATA MODERN BIG DATA(Spark) Data Sources • Relational • Files • Message queues • Relational • Files • Message queues • Data service • Relational • Files • Message queues • Data service • NoSQL Integration Analysis • Minimal • Medium • Faster time to market • Modeled by analytical transformations Real-time • Minimal real time • Minimal real time • In real time or die Data Access • Primarily batch • Batch • Micro batch (Spark- Streaming) Open Source Technologies • Fully embraced • Minimal • TCO rules
  • 3. Need of Real time Analytics When referring to “analytics,” people often think of manipulating an existing set of structured data to yield insights. “Real-time analytics” takes this definition a step further by accounting for the constant appending of new data to the existing data set and continuously re-analyzing the new dataset for new insights. But for analytics to be real-time, data needs to be ingested immediately upon creation, delivering results in a matter of seconds, enabling those interpreting the data to react right away. • Use cases that exemplify why real-time analytics are critical to performance and user experience, highlighting key capabilities that enable real-time analytics in each layer of your system or application: • The Application Layer With your developer team preparing for a big push to production, you’re worried about the possibility of unforeseen issues immediately following the deployment. Testing in development will never provide an exact replica of what will happen in production. Therefore, the more you are able to view and monitor your logs in real-time, the faster you will be able to address and rectify issues. While big issues may be easy to spot, real-time analytics can also help you identify small issues building over time that could eventually slow down your application and user experience. While batch-processed analytics could only ever give you a historical analysis of your systems data, real-time analytics can enable you to identify anomalous patterns in your data as they occur. Using a log analytics tool that offers “anomaly alerts” can help you identify early warning signs of larger issues.
  • 4. Need of Real time Analytics … • The Database Layer Imagine over the course of several minutes, your popular e-commerce application hasn’t received any orders. Where’s the first place you’d look for a possible issue? You may first check to see if your website is still reachable from a browser. Then, you may check your server logs. Or perhaps you check your APM tool? Or a web analytics tool? Are they all saying the same thing? Or nothing at all? When you notice there aren’t any errors in your code and traffic to your website appears to have remained steady, you decide to investigate your database. Only then, after wasting time investigating other scenarios, do you see your database was improperly configured in the last deployment and has reached its row limit. How many sales have you lost while guessing where to investigate? Without log-based, real-time analytics, database errors can go undiscovered, often only realized after a period of noticeable inactivity and investigating. When using a real-time aggregated log analytics service, database errors stream into the same single view with the rest of your system’s log events as they occur. Alerts on database errors can be generated just as easily as alerts for the rest of your environment. And tools that offer custom tagging of specific event types can also help you spot database specific errors as they occur. • Server/Hosting Layer Let’s say your mobile app was just featured on Product Hunt and you’re suddenly experiencing a spike in traffic. Luckily, your app runs in an auto scaling environment and handles the load without issue. When the traffic later subsides and your servers scale back, you decide to analyze the distribution of 400 errors over time. But how will you access data from the servers that scaled down? If you weren’t sending those log files to a central location in real-time, your data is forever lost. In this scenario, centralizing your logs in real-time is crucial to capturing all relevant data.
  • 5. Use Case Model -1 Web server Log Analysis / Potential Security Log Sources Web server log analysis and statistics generator we analyze the web server logs to compute the following statistics for further data analysis and create reports and dashboards: • Response counts by different HTTP response codes • Response content size • IP address of the clients to assess where the highest web traffic is coming from • Top end point URLs to identify which services are accessed more than others Successful user login “Accepted password”,“Accepted publickey”, “session opened” Failed user login “authentication failure”,“failed password” User log-off “session closed” User account change or deletion “password changed”,“new user”, “delete user” Sudo actions “sudo: … COMMAND=…”“FAILED su” Service failure “failed” or “failure”
  • 6. Use Case Model -2 Checklist for Security On windows Look at both inbound and outbound activities. Examples below show log excerpts from Cisco ASA logs; other devices have similar functionality. Traffic allowed on firewall “Built … connection”,“access-list … permitted” Traffic blocked on firewall “access-list … denied”,“deny inbound”, “Deny … by” Bytes transferred (large files?) “Teardown TCP connection … duration … bytes …” Bandwidth and protocol usage “limit … exceeded”,“CPU utilization” Detected attack activity “attack from” User account changes “user added”,“user deleted”, “User priv level changed” Administrator access “AAA user …”,“User … locked out”, “login failed”
  • 7. Use Case- Background • We'll look at a web server log analytics use case to show how Spark Streaming can help with running analytics on data streams that are generated in a continuous manner(Stream) to compute the following statistics for further data analysis and create reports and dashboards:- • IP address of the clients to assess where the highest web traffic is coming from. • Top end point URLs to identify which services are accessed more than others. • Streaming Data Analytics - Spark Streaming is an extension of core Spark API, which makes it easy to build fault-tolerant processing of real-time data streams. Streaming data is basically a continuous group of data records generated from sources like sensors, server traffic and online searches. Some of the examples of streaming data are user activity on websites, monitoring data, server logs, and other event data. Streaming data processing applications help with live dashboards, real-time online recommendations, and instant fraud detection. The way Spark Streaming works is it divides the live stream of data into batches (called micro batches) of a pre-defined interval (‘N’ seconds) and then treats each batch of data as Resilient Distributed Datasets (RDDs). Then we can process these RDDs using the operations like map, reduce, reduceByKey, join and window. The results of these RDD operations are returned in batches. We usually store these results into a data store for further analytics and to generate reports and dashboards or sending event based alerts.
  • 8. Kafka-Spark Streaming Architecture INGESTION-LAYER AGGREGATION-LAYER ANALYSIS-LAYER STORAGE-LAYER DATA PRODUCER
  • 9. Kafka-Mechanism Applications(producers) send messages (records) to a Kafka node (broker) and said messages are processed by other applications called consumers. Said messages get stored in a topic and consumers subscribe to the topic to receive new messages. Apache Kafka is a distributed streaming platform, Publish and subscribe to streams of records, similar to a message queue or enterprise messaging system, Store streams of records in a fault-tolerant durable way. Process streams of records as they occur.
  • 10. Spark-Mechanism The main() method of the program runs in the driver. The driver is the process that runs the user code(called as Driver Program) that creates RDDs, and performs transformation and action, and also creates SparkContext. The driver program splits the Spark application into the task and schedules them to run on the executor. The task scheduler resides in the driver and distributes task among workers. The two main key roles of drivers are: -> Converting user program into the task. -> Scheduling task on the executor.
  • 11. Technologies Used • Zookeeper • Apache Kafka • Kafka Clients- Producer/Consumer • Kafka Connect • Apache Spark Streaming • Scala • Power BI – Visualization
  • 14. Clickstream Data Generated from Weblog server
  • 15. Submit jar file in Client mode to the Spark cluster
  • 16. Visualization of Spark Streaming Applications
  • 17. First visualization is the DAG (Direct Acyclic Graph)
  • 19. Statistics during the execution When the data stream is being sent to Kafka and processed by Spark Streaming consumer, which include the input rate showing the number of events per second, processing time in milliseconds.
  • 21. Dashboard – ClickStream Analytics on PowerBI