SlideShare ist ein Scribd-Unternehmen logo
1 von 29
NLP Techniques for Log
Analysis
Jacob Perkins, CTO @ Insight Engines
● Speculative ideas with specific techniques
● Python is great for NLP, ML, simple text processing
Overview
Author of Text Processing with NLTK Cookbook
Contributor to Bad Data Handbook
Blog @ StreamHacker.com
Helped create Seahorse / Gnome Keyring (GPG UI)
CTO @ InsightEngines.com
About me
1. Tokenization
2. Feature Extraction
3. Classification
4. Clustering
5. Anomaly Detection
Topics
• Split text into tokens
• Many options beyond whitespace
• Works on any arbitrary text
• NLTK has many tokenizers
Tokenization
Sep 19 19:18:40 acmepayroll syslog: 04/30/10 12:18:51
39480627 wksh: HANDLING TELNET CALL (User: root,
Branch: ABCDE, Client: 10101) pid=9644
https://text-processing.com/demo/tokenize/
Sep 19 19:18:40 acmepayroll syslog: 04/30/10 12:18:51
39480627 wksh: HANDLING TELNET CALL (User: root,
Branch: ABCDE, Client: 10101) pid=9644
https://text-processing.com/demo/tokenize/
Sep 19 19:18:40 acmepayroll syslog: 04/30/10 12:18:51
39480627 wksh: HANDLING TELNET CALL (User: root,
Branch: ABCDE, Client: 10101) pid=9644
https://text-processing.com/demo/tokenize/
• Edit distance (a.k.a Levenshtein distance)
• Fuzzywuzzy
• Can use to identify similar strings
• Ex: Google vs Go0gle = edit distance 1
Fuzzy Matching
• Transform text into discrete values
• Use for data analysis, machine learning
• Art, not science
Feature Extraction
• Date parsing with dateutil
• Regex patterns
• Grammars with pyparsing
• Automatic log parsing with Logpai logparser
Parsing
● Bigram: (acmepayroll, syslog)
● Trigram: (HANDLING, TELNET, CALL)
● Skipgram: (syslog, HANDLING, CALL)
Ngram Features
• acmepayroll -> aa
• User -> Aa
• ABCDE -> AA
• 10101 -> nn
• pid=9644 -> aa=nn
Token Shapes
Log -> Token Shapes & Date Parsing
date aa syslog: date nn wksh: AA AA AA (User: aa,
Branch: AA, Client: nn) pid=nn
Sep 19 19:18:40 acmepayroll syslog: 04/30/10 12:18:51
39480627 wksh: HANDLING TELNET CALL (User: root,
Branch: ABCDE, Client: 10101) pid=9644
• Count tokens across all records & types (ie ssh)
• How uniform are tokens within a record type?
• Mostly uniform ~= clean data
• In a given record, does it have rare tokens?
• Rare = anomaly?
Identifying Rare Tokens
1. Log record -> feature extraction
2. Features -> Classifier
3. Classifier returns class probabilities
Classification
• Must train on good labeled data
• Binary classification is most accurate
• Scikit-learn has many options
● Spam vs Ham
● Sentiment & Opinion analysis: positive vs negative
● Fraud
Real World Classification
1. Train on record type (ssh vs everything else)
2. What has type ssh but doesn’t classify?
3. What is not ssh but does classify?
Log Classification Anomalies
Features:
● Description
● Rules / thresholds
● Log record features
Labels = priority level (high, medium, low)
Alert Classification
● No training needed (unsupervised)
● Group by feature similarity / distance
● Must operate on large batch of records
● Scikit-learn has many options
● Gensim for topic modeling
Clustering
1. Cluster a few different record types
2. Does each type correspond to a single cluster?
3. Which records don’t cluster well? (far from centroid)
Data Clustering Anomalies
● A.k.a. Novelty / Outlier detection
● A.k.a. One-class classification
● Learn from good data set
● Identify new records that don’t fit
● Scikit-learn has a few options
● Automated anomaly detection with Logpai loglizer
Anomaly Detection
● Tokenization
● Feature extraction
● Classification
● Clustering
● Anomaly detection
Summary
• NLTK
• Scikit-Learn
• Gensim
• Logpai
• Text-processing.com
• Streamhacker.com
References
● Investigator: plain english log search -> multiple
visualizations & recommendations to do next
● Analyzer: data health analysis
● InsightEngines.com
About Insight Engines
Thank you!

Weitere ähnliche Inhalte

Was ist angesagt?

You can detect PowerShell attacks
You can detect PowerShell attacksYou can detect PowerShell attacks
You can detect PowerShell attacksMichael Gough
 
Attack monitoring using ElasticSearch Logstash and Kibana
Attack monitoring using ElasticSearch Logstash and KibanaAttack monitoring using ElasticSearch Logstash and Kibana
Attack monitoring using ElasticSearch Logstash and KibanaPrajal Kulkarni
 
Elasticsearch for beginners
Elasticsearch for beginnersElasticsearch for beginners
Elasticsearch for beginnersNeil Baker
 
Digital Forensics
Digital ForensicsDigital Forensics
Digital ForensicsOldsun
 
Threat intelligence notes
Threat intelligence notesThreat intelligence notes
Threat intelligence notesAmgad Magdy
 
Memory Forensics for IR - Leveraging Volatility to Hunt Advanced Actors
Memory Forensics for IR - Leveraging Volatility to Hunt Advanced ActorsMemory Forensics for IR - Leveraging Volatility to Hunt Advanced Actors
Memory Forensics for IR - Leveraging Volatility to Hunt Advanced ActorsJared Greenhill
 
Threat hunting for Beginners
Threat hunting for BeginnersThreat hunting for Beginners
Threat hunting for BeginnersSKMohamedKasim
 
Practical White Hat Hacker Training - Passive Information Gathering(OSINT)
Practical White Hat Hacker Training -  Passive Information Gathering(OSINT)Practical White Hat Hacker Training -  Passive Information Gathering(OSINT)
Practical White Hat Hacker Training - Passive Information Gathering(OSINT)PRISMA CSI
 
Jim Dowling - Multi-tenant Flink-as-a-Service on YARN
Jim Dowling - Multi-tenant Flink-as-a-Service on YARN Jim Dowling - Multi-tenant Flink-as-a-Service on YARN
Jim Dowling - Multi-tenant Flink-as-a-Service on YARN Flink Forward
 
Kostas Kloudas - Extending Flink's Streaming APIs
Kostas Kloudas - Extending Flink's Streaming APIsKostas Kloudas - Extending Flink's Streaming APIs
Kostas Kloudas - Extending Flink's Streaming APIsVerverica
 
[215]네이버콘텐츠통계서비스소개 김기영
[215]네이버콘텐츠통계서비스소개 김기영[215]네이버콘텐츠통계서비스소개 김기영
[215]네이버콘텐츠통계서비스소개 김기영NAVER D2
 
Log Mining: Beyond Log Analysis
Log Mining: Beyond Log AnalysisLog Mining: Beyond Log Analysis
Log Mining: Beyond Log AnalysisAnton Chuvakin
 
Threat Hunting
Threat HuntingThreat Hunting
Threat HuntingSplunk
 
Building an Empire with PowerShell
Building an Empire with PowerShellBuilding an Empire with PowerShell
Building an Empire with PowerShellWill Schroeder
 
What is Cryptography?
What is Cryptography?What is Cryptography?
What is Cryptography?Pratik Poddar
 
Threat hunting on the wire
Threat hunting on the wireThreat hunting on the wire
Threat hunting on the wireInfoSec Addicts
 
4. The Advanced Encryption Standard (AES)
4. The Advanced Encryption Standard (AES)4. The Advanced Encryption Standard (AES)
4. The Advanced Encryption Standard (AES)Sam Bowne
 
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...Lucidworks
 
Social Engineering the Windows Kernel by James Forshaw
Social Engineering the Windows Kernel by James ForshawSocial Engineering the Windows Kernel by James Forshaw
Social Engineering the Windows Kernel by James ForshawShakacon
 

Was ist angesagt? (20)

You can detect PowerShell attacks
You can detect PowerShell attacksYou can detect PowerShell attacks
You can detect PowerShell attacks
 
Attack monitoring using ElasticSearch Logstash and Kibana
Attack monitoring using ElasticSearch Logstash and KibanaAttack monitoring using ElasticSearch Logstash and Kibana
Attack monitoring using ElasticSearch Logstash and Kibana
 
Elasticsearch for beginners
Elasticsearch for beginnersElasticsearch for beginners
Elasticsearch for beginners
 
Digital Forensics
Digital ForensicsDigital Forensics
Digital Forensics
 
Threat intelligence notes
Threat intelligence notesThreat intelligence notes
Threat intelligence notes
 
Memory Forensics for IR - Leveraging Volatility to Hunt Advanced Actors
Memory Forensics for IR - Leveraging Volatility to Hunt Advanced ActorsMemory Forensics for IR - Leveraging Volatility to Hunt Advanced Actors
Memory Forensics for IR - Leveraging Volatility to Hunt Advanced Actors
 
Threat hunting for Beginners
Threat hunting for BeginnersThreat hunting for Beginners
Threat hunting for Beginners
 
Practical White Hat Hacker Training - Passive Information Gathering(OSINT)
Practical White Hat Hacker Training -  Passive Information Gathering(OSINT)Practical White Hat Hacker Training -  Passive Information Gathering(OSINT)
Practical White Hat Hacker Training - Passive Information Gathering(OSINT)
 
Jim Dowling - Multi-tenant Flink-as-a-Service on YARN
Jim Dowling - Multi-tenant Flink-as-a-Service on YARN Jim Dowling - Multi-tenant Flink-as-a-Service on YARN
Jim Dowling - Multi-tenant Flink-as-a-Service on YARN
 
Kostas Kloudas - Extending Flink's Streaming APIs
Kostas Kloudas - Extending Flink's Streaming APIsKostas Kloudas - Extending Flink's Streaming APIs
Kostas Kloudas - Extending Flink's Streaming APIs
 
[215]네이버콘텐츠통계서비스소개 김기영
[215]네이버콘텐츠통계서비스소개 김기영[215]네이버콘텐츠통계서비스소개 김기영
[215]네이버콘텐츠통계서비스소개 김기영
 
Log Mining: Beyond Log Analysis
Log Mining: Beyond Log AnalysisLog Mining: Beyond Log Analysis
Log Mining: Beyond Log Analysis
 
BERT
BERTBERT
BERT
 
Threat Hunting
Threat HuntingThreat Hunting
Threat Hunting
 
Building an Empire with PowerShell
Building an Empire with PowerShellBuilding an Empire with PowerShell
Building an Empire with PowerShell
 
What is Cryptography?
What is Cryptography?What is Cryptography?
What is Cryptography?
 
Threat hunting on the wire
Threat hunting on the wireThreat hunting on the wire
Threat hunting on the wire
 
4. The Advanced Encryption Standard (AES)
4. The Advanced Encryption Standard (AES)4. The Advanced Encryption Standard (AES)
4. The Advanced Encryption Standard (AES)
 
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
 
Social Engineering the Windows Kernel by James Forshaw
Social Engineering the Windows Kernel by James ForshawSocial Engineering the Windows Kernel by James Forshaw
Social Engineering the Windows Kernel by James Forshaw
 

Ähnlich wie NLP techniques for log analysis

How to Write the Fastest JSON Parser/Writer in the World
How to Write the Fastest JSON Parser/Writer in the WorldHow to Write the Fastest JSON Parser/Writer in the World
How to Write the Fastest JSON Parser/Writer in the WorldMilo Yip
 
Query and audit logging in cassandra
Query and audit logging in cassandraQuery and audit logging in cassandra
Query and audit logging in cassandraVinay Kumar Chella
 
BSides LV 2016 - Beyond the tip of the iceberg - fuzzing binary protocols for...
BSides LV 2016 - Beyond the tip of the iceberg - fuzzing binary protocols for...BSides LV 2016 - Beyond the tip of the iceberg - fuzzing binary protocols for...
BSides LV 2016 - Beyond the tip of the iceberg - fuzzing binary protocols for...Alexandre Moneger
 
Messaging, interoperability and log aggregation - a new framework
Messaging, interoperability and log aggregation - a new frameworkMessaging, interoperability and log aggregation - a new framework
Messaging, interoperability and log aggregation - a new frameworkTomas Doran
 
Industry - Program analysis and verification - Type-preserving Heap Profiler ...
Industry - Program analysis and verification - Type-preserving Heap Profiler ...Industry - Program analysis and verification - Type-preserving Heap Profiler ...
Industry - Program analysis and verification - Type-preserving Heap Profiler ...ICSM 2011
 
The Newest in Session Types
The Newest in Session TypesThe Newest in Session Types
The Newest in Session TypesRoland Kuhn
 
NSLogger - Cocoaheads Paris Presentation - English
NSLogger - Cocoaheads Paris Presentation - EnglishNSLogger - Cocoaheads Paris Presentation - English
NSLogger - Cocoaheads Paris Presentation - EnglishFlorent Pillet
 
Hacker Halted 2014 - RDP Fuzzing And Why the Microsoft Open Protocol Specific...
Hacker Halted 2014 - RDP Fuzzing And Why the Microsoft Open Protocol Specific...Hacker Halted 2014 - RDP Fuzzing And Why the Microsoft Open Protocol Specific...
Hacker Halted 2014 - RDP Fuzzing And Why the Microsoft Open Protocol Specific...EC-Council
 
Ansible Best Practices - July 30
Ansible Best Practices - July 30Ansible Best Practices - July 30
Ansible Best Practices - July 30tylerturk
 
Cryptography in PHP: use cases
Cryptography in PHP: use casesCryptography in PHP: use cases
Cryptography in PHP: use casesEnrico Zimuel
 
Finding Needles in Haystacks (The Size of Countries)
Finding Needles in Haystacks (The Size of Countries)Finding Needles in Haystacks (The Size of Countries)
Finding Needles in Haystacks (The Size of Countries)packetloop
 
Paranoia 2018: A Process is No One
Paranoia 2018: A Process is No OneParanoia 2018: A Process is No One
Paranoia 2018: A Process is No OneJared Atkinson
 
NLP using JavaScript Natural Library
NLP using JavaScript Natural LibraryNLP using JavaScript Natural Library
NLP using JavaScript Natural LibraryAniruddha Chakrabarti
 
NBTC#2 - Why instrumentation is cooler then ice
NBTC#2 - Why instrumentation is cooler then iceNBTC#2 - Why instrumentation is cooler then ice
NBTC#2 - Why instrumentation is cooler then iceAlexandre Moneger
 
Vulnerability, exploit to metasploit
Vulnerability, exploit to metasploitVulnerability, exploit to metasploit
Vulnerability, exploit to metasploitTiago Henriques
 
BSIDES-PR Keynote Hunting for Bad Guys
BSIDES-PR Keynote Hunting for Bad GuysBSIDES-PR Keynote Hunting for Bad Guys
BSIDES-PR Keynote Hunting for Bad GuysJoff Thyer
 
OpenAI GPT in Depth - Questions and Misconceptions
OpenAI GPT in Depth - Questions and MisconceptionsOpenAI GPT in Depth - Questions and Misconceptions
OpenAI GPT in Depth - Questions and MisconceptionsIvo Andreev
 

Ähnlich wie NLP techniques for log analysis (20)

How to Write the Fastest JSON Parser/Writer in the World
How to Write the Fastest JSON Parser/Writer in the WorldHow to Write the Fastest JSON Parser/Writer in the World
How to Write the Fastest JSON Parser/Writer in the World
 
Taming Text
Taming TextTaming Text
Taming Text
 
rspamd-slides
rspamd-slidesrspamd-slides
rspamd-slides
 
Query and audit logging in cassandra
Query and audit logging in cassandraQuery and audit logging in cassandra
Query and audit logging in cassandra
 
BSides LV 2016 - Beyond the tip of the iceberg - fuzzing binary protocols for...
BSides LV 2016 - Beyond the tip of the iceberg - fuzzing binary protocols for...BSides LV 2016 - Beyond the tip of the iceberg - fuzzing binary protocols for...
BSides LV 2016 - Beyond the tip of the iceberg - fuzzing binary protocols for...
 
Messaging, interoperability and log aggregation - a new framework
Messaging, interoperability and log aggregation - a new frameworkMessaging, interoperability and log aggregation - a new framework
Messaging, interoperability and log aggregation - a new framework
 
Industry - Program analysis and verification - Type-preserving Heap Profiler ...
Industry - Program analysis and verification - Type-preserving Heap Profiler ...Industry - Program analysis and verification - Type-preserving Heap Profiler ...
Industry - Program analysis and verification - Type-preserving Heap Profiler ...
 
The Newest in Session Types
The Newest in Session TypesThe Newest in Session Types
The Newest in Session Types
 
NSLogger - Cocoaheads Paris Presentation - English
NSLogger - Cocoaheads Paris Presentation - EnglishNSLogger - Cocoaheads Paris Presentation - English
NSLogger - Cocoaheads Paris Presentation - English
 
Hacker Halted 2014 - RDP Fuzzing And Why the Microsoft Open Protocol Specific...
Hacker Halted 2014 - RDP Fuzzing And Why the Microsoft Open Protocol Specific...Hacker Halted 2014 - RDP Fuzzing And Why the Microsoft Open Protocol Specific...
Hacker Halted 2014 - RDP Fuzzing And Why the Microsoft Open Protocol Specific...
 
Ansible Best Practices - July 30
Ansible Best Practices - July 30Ansible Best Practices - July 30
Ansible Best Practices - July 30
 
Cryptography in PHP: use cases
Cryptography in PHP: use casesCryptography in PHP: use cases
Cryptography in PHP: use cases
 
Finding Needles in Haystacks (The Size of Countries)
Finding Needles in Haystacks (The Size of Countries)Finding Needles in Haystacks (The Size of Countries)
Finding Needles in Haystacks (The Size of Countries)
 
Paranoia 2018: A Process is No One
Paranoia 2018: A Process is No OneParanoia 2018: A Process is No One
Paranoia 2018: A Process is No One
 
NLP using JavaScript Natural Library
NLP using JavaScript Natural LibraryNLP using JavaScript Natural Library
NLP using JavaScript Natural Library
 
NBTC#2 - Why instrumentation is cooler then ice
NBTC#2 - Why instrumentation is cooler then iceNBTC#2 - Why instrumentation is cooler then ice
NBTC#2 - Why instrumentation is cooler then ice
 
Vulnerability, exploit to metasploit
Vulnerability, exploit to metasploitVulnerability, exploit to metasploit
Vulnerability, exploit to metasploit
 
03 blockchain transactions
03 blockchain transactions03 blockchain transactions
03 blockchain transactions
 
BSIDES-PR Keynote Hunting for Bad Guys
BSIDES-PR Keynote Hunting for Bad GuysBSIDES-PR Keynote Hunting for Bad Guys
BSIDES-PR Keynote Hunting for Bad Guys
 
OpenAI GPT in Depth - Questions and Misconceptions
OpenAI GPT in Depth - Questions and MisconceptionsOpenAI GPT in Depth - Questions and Misconceptions
OpenAI GPT in Depth - Questions and Misconceptions
 

Kürzlich hochgeladen

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 

Kürzlich hochgeladen (20)

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 

NLP techniques for log analysis

  • 1. NLP Techniques for Log Analysis Jacob Perkins, CTO @ Insight Engines
  • 2. ● Speculative ideas with specific techniques ● Python is great for NLP, ML, simple text processing Overview
  • 3. Author of Text Processing with NLTK Cookbook Contributor to Bad Data Handbook Blog @ StreamHacker.com Helped create Seahorse / Gnome Keyring (GPG UI) CTO @ InsightEngines.com About me
  • 4. 1. Tokenization 2. Feature Extraction 3. Classification 4. Clustering 5. Anomaly Detection Topics
  • 5. • Split text into tokens • Many options beyond whitespace • Works on any arbitrary text • NLTK has many tokenizers Tokenization
  • 6. Sep 19 19:18:40 acmepayroll syslog: 04/30/10 12:18:51 39480627 wksh: HANDLING TELNET CALL (User: root, Branch: ABCDE, Client: 10101) pid=9644 https://text-processing.com/demo/tokenize/
  • 7. Sep 19 19:18:40 acmepayroll syslog: 04/30/10 12:18:51 39480627 wksh: HANDLING TELNET CALL (User: root, Branch: ABCDE, Client: 10101) pid=9644 https://text-processing.com/demo/tokenize/
  • 8. Sep 19 19:18:40 acmepayroll syslog: 04/30/10 12:18:51 39480627 wksh: HANDLING TELNET CALL (User: root, Branch: ABCDE, Client: 10101) pid=9644 https://text-processing.com/demo/tokenize/
  • 9. • Edit distance (a.k.a Levenshtein distance) • Fuzzywuzzy • Can use to identify similar strings • Ex: Google vs Go0gle = edit distance 1 Fuzzy Matching
  • 10. • Transform text into discrete values • Use for data analysis, machine learning • Art, not science Feature Extraction
  • 11. • Date parsing with dateutil • Regex patterns • Grammars with pyparsing • Automatic log parsing with Logpai logparser Parsing
  • 12. ● Bigram: (acmepayroll, syslog) ● Trigram: (HANDLING, TELNET, CALL) ● Skipgram: (syslog, HANDLING, CALL) Ngram Features
  • 13. • acmepayroll -> aa • User -> Aa • ABCDE -> AA • 10101 -> nn • pid=9644 -> aa=nn Token Shapes
  • 14. Log -> Token Shapes & Date Parsing date aa syslog: date nn wksh: AA AA AA (User: aa, Branch: AA, Client: nn) pid=nn Sep 19 19:18:40 acmepayroll syslog: 04/30/10 12:18:51 39480627 wksh: HANDLING TELNET CALL (User: root, Branch: ABCDE, Client: 10101) pid=9644
  • 15. • Count tokens across all records & types (ie ssh) • How uniform are tokens within a record type? • Mostly uniform ~= clean data • In a given record, does it have rare tokens? • Rare = anomaly? Identifying Rare Tokens
  • 16. 1. Log record -> feature extraction 2. Features -> Classifier 3. Classifier returns class probabilities Classification • Must train on good labeled data • Binary classification is most accurate • Scikit-learn has many options
  • 17.
  • 18. ● Spam vs Ham ● Sentiment & Opinion analysis: positive vs negative ● Fraud Real World Classification
  • 19. 1. Train on record type (ssh vs everything else) 2. What has type ssh but doesn’t classify? 3. What is not ssh but does classify? Log Classification Anomalies
  • 20. Features: ● Description ● Rules / thresholds ● Log record features Labels = priority level (high, medium, low) Alert Classification
  • 21. ● No training needed (unsupervised) ● Group by feature similarity / distance ● Must operate on large batch of records ● Scikit-learn has many options ● Gensim for topic modeling Clustering
  • 22.
  • 23. 1. Cluster a few different record types 2. Does each type correspond to a single cluster? 3. Which records don’t cluster well? (far from centroid) Data Clustering Anomalies
  • 24. ● A.k.a. Novelty / Outlier detection ● A.k.a. One-class classification ● Learn from good data set ● Identify new records that don’t fit ● Scikit-learn has a few options ● Automated anomaly detection with Logpai loglizer Anomaly Detection
  • 25.
  • 26. ● Tokenization ● Feature extraction ● Classification ● Clustering ● Anomaly detection Summary
  • 27. • NLTK • Scikit-Learn • Gensim • Logpai • Text-processing.com • Streamhacker.com References
  • 28. ● Investigator: plain english log search -> multiple visualizations & recommendations to do next ● Analyzer: data health analysis ● InsightEngines.com About Insight Engines

Hinweis der Redaktion

  1. Punctuation in weird places
  2. NLP example: can’t
  3. Trained on WSJ news articles
  4. Grammars ~= multi-line regex
  5. Bigram & Trigram features can add a lot to classification & clustering accuracy
  6. Use token shapes to normalize? Technique based on TF/IDF & search indexing to identify high information words
  7. Sentiment used a lot for marketing analytics
  8. One vs all classification.
  9. Triage, identify false positives or negatives
  10. Topic modeling is different type of clustering