SlideShare ist ein Scribd-Unternehmen logo
1 von 23
Natural language processing and
machine learning
Nikola Milosevic
What is AI?
• Intelligence presented by a machine
• Flexible agent that interacts with the environment and
performs actions to maximize success towards certain goal
Popular AI
What is machine learning
• Subfield of computer science that explores
how machines can learn to perform certain
task without explicit programming
Data mining generally
Types of machine learning
• Supervised learning
• Semi-supervised learning
• Unsupervised learning
• Reinforcement learning
Machine learning problems
• Classification
• Clustering
• Regression
Testing the model
• Iteratively improve the model
• Test multiple algorithms – find the best one
• No free lunch theory
• Feedback loop for feature selection
• Konfuziona matrica
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
𝑇𝑃 + 𝑇𝑁
𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁
Examples of ML frameworks and
algorithms
• SCI-kit learn
– Python library
– Implementation of the most useful algorithms
– Naïve Bayes, SVM, Random forests, decision
trees…
• Keras
– Python library implementing about everything
related to neural networks
Text data
• About 80% of data in organizations are in text
format
• Harder to analyse than structured data
• Huge amount of textual documents
– Only in biomedicine 2200 scientific papers are
published every day
• Growing exponentially
Main goals of text mining
• Make communication easier (e.g. translation)
• Automate some processes (e.g.
communication agents/chatbots)
• Do data mining on textual and unstructured
data
Process overview
Challenges
• Man saw a woman with the telescope.
– Who has a telescope?
• Multiple senses, synonyms,
homonyms, irony
• Grammar and context can help
• Acronyms
Approaches
• Rule based
– Human defined rules to extract information
– Needs expert humans who know how people express
certain things
– Is quite laborious
• Machine learning based
– Machine tries to learn what to extract guided by
human
– Needs annotated corpora (usually fairly large)
• This is expensive to create and quite laborious
Levels of analysis
• Lexical
– Analysis of words
• Syntactic
– Analysis of organization of words
(phrases, sentences)
• Semantic
– Analysis meaning
• Sometimes pragmatic
– Analysis pragmatics of the use of certain words,
phrases. Why author used that?
Steps
Lexical processing
• Part of speech tagging
• Parsing
– Constituency
– Dependency
Stanford parser
Semantic processing
• Text classification
– Sentiment analysis (positive/negative)
– Classification by topics (politics/sport/business)
– Authorship detection (Tolkien, Rowling, Shakespeare)
• Named entity recognition
• Topic modelling (unsupervised)
• Search
Sequence modelling
• Machine learning technique useful for named
entity recognition
• Conditional random fields (CRF) or recurrent
neural networks (often LSTM)
Feature engineering
• Selecting important features that help extract
information
• Can be:
– Words, PoS, word shapes, vocabulary features,
etc.
– May depend on task and methodology
– Iterative process of selecting and improving the
performance
– Some features may confuse the algorithm
Search
• Finds documents that are the most relevant
for a given user query
• Usual techniques include algorithm called TF-
IDF and cosine similarity
• May additionally use links towards text,
positions of matched words and similar things
to rank found documents
• Apache Lucene, Solr (Java), there are also
Python libraries
Language models
• Used as features to classification and other
NLP tasks
• Contain some basic characteristics of language
• The most naïve (but also frequently used) is
called Bag of Words
• NN use more advanced
models: word2vec, Glove,
ULMo, BERT…
Useful tools and libraries
• Apache OpenNLP – Java
• Apache Lucene – Java, C#
• Stanford Core NLP – Java
• NLTK – Python
• GATE – GUI alat
• SharpNLP
• ...
• Weka – for machine learning (GUI)

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

MACHINE LEARNING PPT(ML) rohit.pptx
MACHINE LEARNING  PPT(ML) rohit.pptxMACHINE LEARNING  PPT(ML) rohit.pptx
MACHINE LEARNING PPT(ML) rohit.pptx
 
Machine learning
Machine learningMachine learning
Machine learning
 
Supervised and Unsupervised Machine Learning
Supervised and Unsupervised Machine LearningSupervised and Unsupervised Machine Learning
Supervised and Unsupervised Machine Learning
 
Machine learning overview
Machine learning overviewMachine learning overview
Machine learning overview
 
Machine Learning Basics
Machine Learning BasicsMachine Learning Basics
Machine Learning Basics
 
Introduction to AI & ML
Introduction to AI & MLIntroduction to AI & ML
Introduction to AI & ML
 
Machine learning
Machine learning Machine learning
Machine learning
 
Machine learning
Machine learningMachine learning
Machine learning
 
Machine Can Think
Machine Can ThinkMachine Can Think
Machine Can Think
 
Data mining presentation.ppt
Data mining presentation.pptData mining presentation.ppt
Data mining presentation.ppt
 
Automated Machine Learning (Auto ML)
Automated Machine Learning (Auto ML)Automated Machine Learning (Auto ML)
Automated Machine Learning (Auto ML)
 
Introduction to Named Entity Recognition
Introduction to Named Entity RecognitionIntroduction to Named Entity Recognition
Introduction to Named Entity Recognition
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Data science
Data scienceData science
Data science
 
Machine learning
Machine learningMachine learning
Machine learning
 
The fundamentals of Machine Learning
The fundamentals of Machine LearningThe fundamentals of Machine Learning
The fundamentals of Machine Learning
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Web scraping in python
Web scraping in python Web scraping in python
Web scraping in python
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 

Ähnlich wie Machine learning (ML) and natural language processing (NLP)

Techniques of information retrieval
Techniques of information retrieval Techniques of information retrieval
Techniques of information retrieval
Tariq Hassan
 
NLP, Expert system and pattern recognition
NLP, Expert system and pattern recognitionNLP, Expert system and pattern recognition
NLP, Expert system and pattern recognition
Mohammad Ilyas Malik
 
Intro 2 document
Intro 2 documentIntro 2 document
Intro 2 document
Uma Kant
 
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
Joaquin Delgado PhD.
 
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning... RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
S. Diana Hu
 
Comm 1130 technical_communication_march2012-alcock
Comm 1130 technical_communication_march2012-alcockComm 1130 technical_communication_march2012-alcock
Comm 1130 technical_communication_march2012-alcock
Melanie Parlette-Stewart
 

Ähnlich wie Machine learning (ML) and natural language processing (NLP) (20)

Introduction to Text Mining
Introduction to Text MiningIntroduction to Text Mining
Introduction to Text Mining
 
Text Mining
Text MiningText Mining
Text Mining
 
Techniques of information retrieval
Techniques of information retrieval Techniques of information retrieval
Techniques of information retrieval
 
NLP, Expert system and pattern recognition
NLP, Expert system and pattern recognitionNLP, Expert system and pattern recognition
NLP, Expert system and pattern recognition
 
NLP,expert,robotics.pptx
NLP,expert,robotics.pptxNLP,expert,robotics.pptx
NLP,expert,robotics.pptx
 
Indexing Techniques: Their Usage in Search Engines for Information Retrieval
Indexing Techniques: Their Usage in Search Engines for Information RetrievalIndexing Techniques: Their Usage in Search Engines for Information Retrieval
Indexing Techniques: Their Usage in Search Engines for Information Retrieval
 
machine learning
machine learningmachine learning
machine learning
 
Final presentation
Final presentationFinal presentation
Final presentation
 
Intro 2 document
Intro 2 documentIntro 2 document
Intro 2 document
 
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
 
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning... RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 
Comm 1130 technical_communication_march2012-alcock
Comm 1130 technical_communication_march2012-alcockComm 1130 technical_communication_march2012-alcock
Comm 1130 technical_communication_march2012-alcock
 
IR
IRIR
IR
 
Artificial Intelligence by B. Ravikumar
Artificial Intelligence by B. RavikumarArtificial Intelligence by B. Ravikumar
Artificial Intelligence by B. Ravikumar
 
Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...
 
Natural language processing (NLP) introduction
Natural language processing (NLP) introductionNatural language processing (NLP) introduction
Natural language processing (NLP) introduction
 
Algorithms and Data Structures
Algorithms and Data StructuresAlgorithms and Data Structures
Algorithms and Data Structures
 
ICS1020 NLP 2020
ICS1020 NLP 2020ICS1020 NLP 2020
ICS1020 NLP 2020
 
Labou "Data Science and the Library at UC San Diego"
Labou "Data Science and the Library at UC San Diego"Labou "Data Science and the Library at UC San Diego"
Labou "Data Science and the Library at UC San Diego"
 
Semantic technology in nutshell 2013. Semantic! are you a linguist?
Semantic technology in nutshell 2013. Semantic! are you a linguist?Semantic technology in nutshell 2013. Semantic! are you a linguist?
Semantic technology in nutshell 2013. Semantic! are you a linguist?
 

Mehr von Nikola Milosevic

Mehr von Nikola Milosevic (20)

Classifying intangible social innovation concepts using machine learning and ...
Classifying intangible social innovation concepts using machine learning and ...Classifying intangible social innovation concepts using machine learning and ...
Classifying intangible social innovation concepts using machine learning and ...
 
Veštačka inteligencija
Veštačka inteligencijaVeštačka inteligencija
Veštačka inteligencija
 
AI an the future of society
AI an the future of societyAI an the future of society
AI an the future of society
 
Machine learning prediction of stock markets
Machine learning prediction of stock marketsMachine learning prediction of stock markets
Machine learning prediction of stock markets
 
Equity forecast: Predicting long term stock market prices using machine learning
Equity forecast: Predicting long term stock market prices using machine learningEquity forecast: Predicting long term stock market prices using machine learning
Equity forecast: Predicting long term stock market prices using machine learning
 
BelBi2016 presentation: Hybrid methodology for information extraction from ta...
BelBi2016 presentation: Hybrid methodology for information extraction from ta...BelBi2016 presentation: Hybrid methodology for information extraction from ta...
BelBi2016 presentation: Hybrid methodology for information extraction from ta...
 
Extracting patient data from tables in clinical literature
Extracting patient data from tables in clinical literatureExtracting patient data from tables in clinical literature
Extracting patient data from tables in clinical literature
 
Supporting clinical trial data curation and integration with table mining
Supporting clinical trial data curation and integration with table miningSupporting clinical trial data curation and integration with table mining
Supporting clinical trial data curation and integration with table mining
 
Mobile security, OWASP Mobile Top 10, OWASP Seraphimdroid
Mobile security, OWASP Mobile Top 10, OWASP SeraphimdroidMobile security, OWASP Mobile Top 10, OWASP Seraphimdroid
Mobile security, OWASP Mobile Top 10, OWASP Seraphimdroid
 
Serbia2
Serbia2Serbia2
Serbia2
 
Table mining and data curation from biomedical literature
Table mining and data curation from biomedical literatureTable mining and data curation from biomedical literature
Table mining and data curation from biomedical literature
 
Malware
MalwareMalware
Malware
 
Sentiment analysis for Serbian language
Sentiment analysis for Serbian languageSentiment analysis for Serbian language
Sentiment analysis for Serbian language
 
Http and security
Http and securityHttp and security
Http and security
 
Android business models
Android business modelsAndroid business models
Android business models
 
Android(1)
Android(1)Android(1)
Android(1)
 
Sigurnosne prijetnje i mjere zaštite IT infrastrukture
Sigurnosne prijetnje i mjere zaštite IT infrastrukture Sigurnosne prijetnje i mjere zaštite IT infrastrukture
Sigurnosne prijetnje i mjere zaštite IT infrastrukture
 
Mašinska analiza sentimenta rečenica na srpskom jeziku
Mašinska analiza sentimenta rečenica na srpskom jezikuMašinska analiza sentimenta rečenica na srpskom jeziku
Mašinska analiza sentimenta rečenica na srpskom jeziku
 
Malware
MalwareMalware
Malware
 
Software Freedom day Serbia - Owasp - informaciona bezbednost u Srbiji open s...
Software Freedom day Serbia - Owasp - informaciona bezbednost u Srbiji open s...Software Freedom day Serbia - Owasp - informaciona bezbednost u Srbiji open s...
Software Freedom day Serbia - Owasp - informaciona bezbednost u Srbiji open s...
 

Kürzlich hochgeladen

Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Christo Ananth
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
MsecMca
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Christo Ananth
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
ssuser89054b
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
dollysharma2066
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
ankushspencer015
 

Kürzlich hochgeladen (20)

Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELLPVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
NFPA 5000 2024 standard .
NFPA 5000 2024 standard                                  .NFPA 5000 2024 standard                                  .
NFPA 5000 2024 standard .
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineering
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 

Machine learning (ML) and natural language processing (NLP)

  • 1. Natural language processing and machine learning Nikola Milosevic
  • 2. What is AI? • Intelligence presented by a machine • Flexible agent that interacts with the environment and performs actions to maximize success towards certain goal
  • 4. What is machine learning • Subfield of computer science that explores how machines can learn to perform certain task without explicit programming
  • 6. Types of machine learning • Supervised learning • Semi-supervised learning • Unsupervised learning • Reinforcement learning
  • 7. Machine learning problems • Classification • Clustering • Regression
  • 8. Testing the model • Iteratively improve the model • Test multiple algorithms – find the best one • No free lunch theory • Feedback loop for feature selection • Konfuziona matrica 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑇𝑃 + 𝑇𝑁 𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁
  • 9. Examples of ML frameworks and algorithms • SCI-kit learn – Python library – Implementation of the most useful algorithms – Naïve Bayes, SVM, Random forests, decision trees… • Keras – Python library implementing about everything related to neural networks
  • 10. Text data • About 80% of data in organizations are in text format • Harder to analyse than structured data • Huge amount of textual documents – Only in biomedicine 2200 scientific papers are published every day • Growing exponentially
  • 11. Main goals of text mining • Make communication easier (e.g. translation) • Automate some processes (e.g. communication agents/chatbots) • Do data mining on textual and unstructured data
  • 13. Challenges • Man saw a woman with the telescope. – Who has a telescope? • Multiple senses, synonyms, homonyms, irony • Grammar and context can help • Acronyms
  • 14. Approaches • Rule based – Human defined rules to extract information – Needs expert humans who know how people express certain things – Is quite laborious • Machine learning based – Machine tries to learn what to extract guided by human – Needs annotated corpora (usually fairly large) • This is expensive to create and quite laborious
  • 15. Levels of analysis • Lexical – Analysis of words • Syntactic – Analysis of organization of words (phrases, sentences) • Semantic – Analysis meaning • Sometimes pragmatic – Analysis pragmatics of the use of certain words, phrases. Why author used that?
  • 16. Steps
  • 17. Lexical processing • Part of speech tagging • Parsing – Constituency – Dependency Stanford parser
  • 18. Semantic processing • Text classification – Sentiment analysis (positive/negative) – Classification by topics (politics/sport/business) – Authorship detection (Tolkien, Rowling, Shakespeare) • Named entity recognition • Topic modelling (unsupervised) • Search
  • 19. Sequence modelling • Machine learning technique useful for named entity recognition • Conditional random fields (CRF) or recurrent neural networks (often LSTM)
  • 20. Feature engineering • Selecting important features that help extract information • Can be: – Words, PoS, word shapes, vocabulary features, etc. – May depend on task and methodology – Iterative process of selecting and improving the performance – Some features may confuse the algorithm
  • 21. Search • Finds documents that are the most relevant for a given user query • Usual techniques include algorithm called TF- IDF and cosine similarity • May additionally use links towards text, positions of matched words and similar things to rank found documents • Apache Lucene, Solr (Java), there are also Python libraries
  • 22. Language models • Used as features to classification and other NLP tasks • Contain some basic characteristics of language • The most naïve (but also frequently used) is called Bag of Words • NN use more advanced models: word2vec, Glove, ULMo, BERT…
  • 23. Useful tools and libraries • Apache OpenNLP – Java • Apache Lucene – Java, C# • Stanford Core NLP – Java • NLTK – Python • GATE – GUI alat • SharpNLP • ... • Weka – for machine learning (GUI)