SlideShare ist ein Scribd-Unternehmen logo
1 von 46
Downloaden Sie, um offline zu lesen
Statistical Tools for Linguists

        Cohan Sujay Carlos
           Aiaioo Labs
           Bangalore
Text Analysis and Statistical Methods

 • Motivation
 • Statistics and Probabilities
 • Application to Corpus Linguistics
Motivation
• Human Development is all about Tools
  – Describe the world
  – Explain the world
  – Solve problems in the world
• Some of these tools
  – Language
  – Algorithms
  – Statistics and Probabilities
Motivation – Algorithms for Education Policy

 • 300 to 400 million people are illiterate
 • If we took 1000 teachers, 100 students per
   class, and 3 years of teaching per student

   –12000 years
 • If we had 100,000 teachers

   –120 years
Motivation – Algorithms for Education Policy

 • 300 to 400 million people are illiterate
 • If we took 1 teacher, 10 students per class,
   and 3 years of teaching per student.
 • Then each student teaches 10 more students.

   – about 30 years
 • We could turn the whole world literate in

   – about 34 years
Motivation – Algorithms for Education Policy


 Difference:

 Policy 1 is O(n) time
 Policy 2 is O(log n) time
Motivation – Statistics for Linguists

 We have shown that:
 Using a tool from computer science, we can
 solve a problem in quite another area.

                   SIMILARLY

 Linguists will find statistics to be a handy tool
 to better understand languages.
Applications of Statistics to Linguistics


     • How can statistics be useful?
     • Can probabilities be useful?
Introduction to Aiaioo Labs
• Focus on Text Analysis, NLP, ML, AI
• Applications to business problems
• Team consists of
  – Researchers
     • Cohan
     • Madhulika
     • Sumukh
  – Linguists
  – Engineers
  – Marketing
Applications to Corpus Linguistics
•   What to annotate
•   How to develop insights
•   How to annotate
•   How much data to annotate
•   How to avoid mistakes in using the corpus
Approach to corpus construction
• The problem: ‘word semantics’
• What is better?
  – Wordnet
  – Google terabyte corpus (with annotations?)
Approach to corpus construction
• The problem: ‘word semantics’
• What is better?
  – Wordnet (set of rules about the real world)
  – Google terabyte corpus (real world)
Approach to corpus construction
• The problem: ‘word semantics’
• What is better?
    – Wordnet (not countable)
    – Google terabyte corpus (countable)



For training machine learning algorithms, the latter might be more valuable,
just because it is possible to tally up evidence on the latter corpus.

Of course I am simplifying things a lot and I don’t mean that the former is not
valuable at all.
Approach to corpus construction
 So if you are constructing a corpus on
 which machine learning methods might
 be applied, construct your corpus so that
 you retain as many examples of surface
 forms as possible.
Applications to Corpus Linguistics
•   What to annotate
•   How to develop insights
•   How to annotate
•   How much data to annotate
•   How to avoid mistakes in using the corpus
Problem : Spelling

1.   Field
2.   Wield
3.   Shield
4.   Deceive
5.   Receive
6.   Ceiling

                       Courtesy of http://norvig.com/chomsky.html
Rule-based Approach


    “I before E except after C”

-- an example of a linguistic insight




                 Courtesy of http://norvig.com/chomsky.html
Probabilistic Statistical Model:
• Count the occurrences of ‘ie’ and ‘ei’ and ‘cie’
  and ‘cei’ in a large corpus

P(IE) = 0.0177
P(EI) = 0.0046
P(CIE) = 0.0014
P(CEI) = 0.0005

                         Courtesy of http://norvig.com/chomsky.html
Words where ie occur after c
•   science
•   society
•   ancient
•   species




                   Courtesy of http://norvig.com/chomsky.html
But you can go back to a Rule-based
             Approach


  “I before E except after C only if C is not
               preceded by an S”

    -- an example of a linguistic insight


                      Courtesy of http://norvig.com/chomsky.html
What is a probability?

• A number between 0 and 1
• The sum of the probabilities on all outcomes is 1

Heads                  Tails




• P(heads) = 0.5
• P(tails) = 0.5
Estimation of P(IE)



P(“IE”) = C(“IE”) / C(all two letter sequences in my corpus)
What is Estimation?



P(“UN”) = C(“UN”) / C(all words in my corpus)
Applications to Corpus Linguistics
•   What to annotate
•   How to develop insights
•   How to annotate
•   How much data to annotate
•   How to avoid mistakes in using the corpus
How do you annotate?
• The problem: ‘named entity classification’
• What is better?
  – Per, Org, Loc, Prod, Time
  – Right, Wrong
How do you annotate?
• The problem: ‘named entity classification’
• What is better?
  – Per, Org, Loc, Prod, Time
  – Right, Wrong



      It depends on whether you care about
      precision or recall or both.
What are Precision and Recall



 Classification metrics used to compare ML
 algorithms.
Classification Metrics

     Politics                   Sports

The UN Security            Warwickshire's Clarke
Council adopts its first   equalled the first-class
clear condemnation of      record of seven

    How do you compare two ML algorithms?
Classification Quality Metrics
               Point of view = Politics

                            Gold - Politics       Gold - Sports

Observed - Politics   TP (True Positive)      FP (False Positive)


Observed - Sports     FN (False Negative)     TN (True Negative)
Classification Quality Metrics
               Point of view = Sports

                           Gold - Politics       Gold - Sports

Observed - Politics   TN (True Negative)     FN (False Positive)


Observed - Sports     FP (False Negative)    TP (True Positive)
Classification Quality Metric - Accuracy
                   Point of view = Sports

                               Gold - Politics      Gold – Sports

    Observed - Politics   TN (True Negative)     FN (False Positive)


    Observed - Sports     FP (False Negative)    TP (True Positive)
Metrics for Measuring Classification Quality
                      Point of View – Class 1

                              Gold Class 1               Gold Class 2

   Observed Class 1     TP                          FP


   Observed Class 2     FN                          TN




                Great metrics for highly unbalanced corpora!
Metrics for Measuring Classification Quality




  F-Score = the harmonic mean of Precision and Recall
F-Score Generalized


            1
 F
       1           1
        (1   )
       P           R
Precision, Recall, Average, F-Score

                 Precision          Recall         Average           F-Score

 Classifier 1   50%              50%             50%                50%


 Classifier 2   30%              70%             50%                42%


 Classifier 3   10%              90%             50%                18%




                 What is the sort of classifier that fares worst?
How do you annotate?
So if you are constructing a corpus for a
machine learning tool where only
precision matters, all you need is a corpus
of presumed positives that you mark as
right or wrong (or the label and other).

If you need to get good recall as well, you
will need a corpus annotated with all the
relevant labels.
Applications to Corpus Linguistics
•   What to annotate
•   How to develop insights
•   How to annotate
•   How much data to annotate
•   How to avoid mistakes in using the corpus
How much data should you annotate?
  • The problem: ‘named entity classification’
  • What is better?
    – 2000 words per category (each of Per, Org,
      Loc, Prod, Time)
    – 5000 words per category (each of Per, Org,
      Loc, Prod, Time)
Small Corpus – 4 Fold Cross-Validation

          Split          Train Folds         Test Fold

       First Run    • 1, 2, 3          • 4


       Second Run   • 2, 3, 4          • 1


       Third Run    • 3, 4, 1          • 2


       Fourth Run   • 4, 1, 2          • 3
Statistical significance in a paper


                              significance         estimate
                                                         variance




    Remember to take Inter-Annotator Agreement into account
How much do you annotate?
So you increase the corpus size till that
the error margins drop to a value that the
experimenter considers sufficient.

The smaller the error margins, the finer
the comparisons the experimenter can
make between algorithms.
Applications to Corpus Linguistics
•   What to annotate
•   How to develop insights
•   How to annotate
•   How much data to annotate
•   How to avoid mistakes in using the
    corpus
Avoid Mistakes
• The problem: ‘train a classifier’
• What is better?
  – Train with all the data that you have, and
    then test on all the data that you have?
  – Train on half and test on the other half?
Avoid Mistakes
• Training a corpus on a full corpus and
  then running tests using the same corpus
  is a bad idea because it is a bit like
  revealing the questions in the exam
  before the exam.
• A simple algorithm that can game such a
  test is a plain memorization algorithm
  that memorizes all the possible inputs
  and the corresponding outputs.
Corpus Splits

        Split            Percentage

Training        • 60%


Validation      • 20%


Testing         • 20%


Total           • 100%
How do you avoid mistakes?
Do not train a machine learning algorithm on the
‘testing’ section of the corpus.

During the development/tuning of the algorithm,
do not make any measurements using the
‘testing’ section, or you’re likely to ‘cheat’ on the
feature set, and settings. Use the ‘validation’
section for that.

I have seen researchers claim 99.7% accuracy on
Indian language POS tagging because they failed
to keep the different sections of their corpus
sufficiently well separated.

Weitere ähnliche Inhalte

Was ist angesagt?

Generalized transition graphs
Generalized transition graphsGeneralized transition graphs
Generalized transition graphsArham Khan G
 
Push Down Automata (PDA) | TOC (Theory of Computation) | NPDA | DPDA
Push Down Automata (PDA) | TOC  (Theory of Computation) | NPDA | DPDAPush Down Automata (PDA) | TOC  (Theory of Computation) | NPDA | DPDA
Push Down Automata (PDA) | TOC (Theory of Computation) | NPDA | DPDAAshish Duggal
 
Compiler design-lab-manual v-cse
Compiler design-lab-manual v-cseCompiler design-lab-manual v-cse
Compiler design-lab-manual v-cseravisharma159932
 
Propositional logic
Propositional logicPropositional logic
Propositional logicRushdi Shams
 
asymptotic analysis and insertion sort analysis
asymptotic analysis and insertion sort analysisasymptotic analysis and insertion sort analysis
asymptotic analysis and insertion sort analysisAnindita Kundu
 
Chapter 4 introduction to assessing language skills
Chapter 4  introduction to assessing language skillsChapter 4  introduction to assessing language skills
Chapter 4 introduction to assessing language skillsIES JFK
 
Pushdown Automata Theory
Pushdown Automata TheoryPushdown Automata Theory
Pushdown Automata TheorySaifur Rahman
 
Rules of inference
Rules of inferenceRules of inference
Rules of inferenceharman kaur
 
SCSJ3553 - Artificial Intelligence Final Exam paper - UTM
SCSJ3553 - Artificial Intelligence Final Exam paper - UTMSCSJ3553 - Artificial Intelligence Final Exam paper - UTM
SCSJ3553 - Artificial Intelligence Final Exam paper - UTMAbdul Khaliq
 
Theory of Automata and formal languages unit 2
Theory of Automata and formal languages unit 2Theory of Automata and formal languages unit 2
Theory of Automata and formal languages unit 2Abhimanyu Mishra
 
CONTEXT FREE GRAMMAR
CONTEXT FREE GRAMMAR CONTEXT FREE GRAMMAR
CONTEXT FREE GRAMMAR Zahid Parvez
 
NFA Converted to DFA , Minimization of DFA , Transition Diagram
NFA Converted to DFA , Minimization of DFA , Transition DiagramNFA Converted to DFA , Minimization of DFA , Transition Diagram
NFA Converted to DFA , Minimization of DFA , Transition DiagramAbdullah Jan
 
Minimization of DFA
Minimization of DFAMinimization of DFA
Minimization of DFAkunj desai
 

Was ist angesagt? (20)

Generalized transition graphs
Generalized transition graphsGeneralized transition graphs
Generalized transition graphs
 
Push Down Automata (PDA) | TOC (Theory of Computation) | NPDA | DPDA
Push Down Automata (PDA) | TOC  (Theory of Computation) | NPDA | DPDAPush Down Automata (PDA) | TOC  (Theory of Computation) | NPDA | DPDA
Push Down Automata (PDA) | TOC (Theory of Computation) | NPDA | DPDA
 
Compiler design-lab-manual v-cse
Compiler design-lab-manual v-cseCompiler design-lab-manual v-cse
Compiler design-lab-manual v-cse
 
Propositional logic
Propositional logicPropositional logic
Propositional logic
 
asymptotic analysis and insertion sort analysis
asymptotic analysis and insertion sort analysisasymptotic analysis and insertion sort analysis
asymptotic analysis and insertion sort analysis
 
NFA to DFA
NFA to DFANFA to DFA
NFA to DFA
 
Correlation.pptx
Correlation.pptxCorrelation.pptx
Correlation.pptx
 
Chapter 4 introduction to assessing language skills
Chapter 4  introduction to assessing language skillsChapter 4  introduction to assessing language skills
Chapter 4 introduction to assessing language skills
 
Pushdown Automata Theory
Pushdown Automata TheoryPushdown Automata Theory
Pushdown Automata Theory
 
Context free grammar
Context free grammar Context free grammar
Context free grammar
 
Logic
LogicLogic
Logic
 
Rules of inference
Rules of inferenceRules of inference
Rules of inference
 
NFA & DFA
NFA & DFANFA & DFA
NFA & DFA
 
SCSJ3553 - Artificial Intelligence Final Exam paper - UTM
SCSJ3553 - Artificial Intelligence Final Exam paper - UTMSCSJ3553 - Artificial Intelligence Final Exam paper - UTM
SCSJ3553 - Artificial Intelligence Final Exam paper - UTM
 
Theory of Automata and formal languages unit 2
Theory of Automata and formal languages unit 2Theory of Automata and formal languages unit 2
Theory of Automata and formal languages unit 2
 
Syntax analysis
Syntax analysisSyntax analysis
Syntax analysis
 
Undecidabality
UndecidabalityUndecidabality
Undecidabality
 
CONTEXT FREE GRAMMAR
CONTEXT FREE GRAMMAR CONTEXT FREE GRAMMAR
CONTEXT FREE GRAMMAR
 
NFA Converted to DFA , Minimization of DFA , Transition Diagram
NFA Converted to DFA , Minimization of DFA , Transition DiagramNFA Converted to DFA , Minimization of DFA , Transition Diagram
NFA Converted to DFA , Minimization of DFA , Transition Diagram
 
Minimization of DFA
Minimization of DFAMinimization of DFA
Minimization of DFA
 

Ähnlich wie Statistics for linguistics

NLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language ModelNLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language ModelHemantha Kulathilake
 
ISSTA'16 Summer School: Intro to Statistics
ISSTA'16 Summer School: Intro to StatisticsISSTA'16 Summer School: Intro to Statistics
ISSTA'16 Summer School: Intro to StatisticsAndrea Arcuri
 
To requirements and beyond...
To requirements and beyond...To requirements and beyond...
To requirements and beyond...SQALab
 
Making Machine Learning Work in Practice - StampedeCon 2014
Making Machine Learning Work in Practice - StampedeCon 2014Making Machine Learning Work in Practice - StampedeCon 2014
Making Machine Learning Work in Practice - StampedeCon 2014StampedeCon
 
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...Lifeng (Aaron) Han
 
Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Lifeng (Aaron) Han
 
To requirements and beyond...
To requirements and beyond...To requirements and beyond...
To requirements and beyond...SQALab
 
Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (m...
Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (m...Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (m...
Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (m...Les Perelman
 
Analysing & interpreting data.ppt
Analysing & interpreting data.pptAnalysing & interpreting data.ppt
Analysing & interpreting data.pptmanaswidebbarma1
 
VOC real world enterprise needs
VOC real world enterprise needsVOC real world enterprise needs
VOC real world enterprise needsIvan Berlocher
 
R Data Structures (Part 1)
R Data Structures (Part 1)R Data Structures (Part 1)
R Data Structures (Part 1)Victor Ordu
 
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT Lifeng (Aaron) Han
 
Lepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metricLepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metricLifeng (Aaron) Han
 
Do you Mean what you say? Recognizing Emotions.
Do you Mean what you say? Recognizing Emotions.Do you Mean what you say? Recognizing Emotions.
Do you Mean what you say? Recognizing Emotions.Sunil Kumar Kopparapu
 
Survey Research in Software Engineering
Survey Research in Software EngineeringSurvey Research in Software Engineering
Survey Research in Software EngineeringDaniel Mendez
 
Multimodal Learning Analytics
Multimodal Learning AnalyticsMultimodal Learning Analytics
Multimodal Learning AnalyticsXavier Ochoa
 
NLP Bootcamp 2018 : Representation Learning of text for NLP
NLP Bootcamp 2018 : Representation Learning of text for NLPNLP Bootcamp 2018 : Representation Learning of text for NLP
NLP Bootcamp 2018 : Representation Learning of text for NLPAnuj Gupta
 
Unstructured data processing webinar 06272016
Unstructured data processing webinar 06272016Unstructured data processing webinar 06272016
Unstructured data processing webinar 06272016George Roth
 

Ähnlich wie Statistics for linguistics (20)

NLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language ModelNLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language Model
 
ISSTA'16 Summer School: Intro to Statistics
ISSTA'16 Summer School: Intro to StatisticsISSTA'16 Summer School: Intro to Statistics
ISSTA'16 Summer School: Intro to Statistics
 
To requirements and beyond...
To requirements and beyond...To requirements and beyond...
To requirements and beyond...
 
Making Machine Learning Work in Practice - StampedeCon 2014
Making Machine Learning Work in Practice - StampedeCon 2014Making Machine Learning Work in Practice - StampedeCon 2014
Making Machine Learning Work in Practice - StampedeCon 2014
 
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
 
Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...
 
Search quality in practice
Search quality in practiceSearch quality in practice
Search quality in practice
 
To requirements and beyond...
To requirements and beyond...To requirements and beyond...
To requirements and beyond...
 
Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (m...
Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (m...Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (m...
Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (m...
 
Analysing & interpreting data.ppt
Analysing & interpreting data.pptAnalysing & interpreting data.ppt
Analysing & interpreting data.ppt
 
VOC real world enterprise needs
VOC real world enterprise needsVOC real world enterprise needs
VOC real world enterprise needs
 
R Data Structures (Part 1)
R Data Structures (Part 1)R Data Structures (Part 1)
R Data Structures (Part 1)
 
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
 
Lepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metricLepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metric
 
Do you Mean what you say? Recognizing Emotions.
Do you Mean what you say? Recognizing Emotions.Do you Mean what you say? Recognizing Emotions.
Do you Mean what you say? Recognizing Emotions.
 
Survey Research in Software Engineering
Survey Research in Software EngineeringSurvey Research in Software Engineering
Survey Research in Software Engineering
 
Multimodal Learning Analytics
Multimodal Learning AnalyticsMultimodal Learning Analytics
Multimodal Learning Analytics
 
NLP Bootcamp 2018 : Representation Learning of text for NLP
NLP Bootcamp 2018 : Representation Learning of text for NLPNLP Bootcamp 2018 : Representation Learning of text for NLP
NLP Bootcamp 2018 : Representation Learning of text for NLP
 
Unstructured data processing webinar 06272016
Unstructured data processing webinar 06272016Unstructured data processing webinar 06272016
Unstructured data processing webinar 06272016
 
Recommender Systems and Linked Open Data
Recommender Systems and Linked Open DataRecommender Systems and Linked Open Data
Recommender Systems and Linked Open Data
 

Mehr von aiaioo

Document Analysis with Deep Learning
Document Analysis with Deep LearningDocument Analysis with Deep Learning
Document Analysis with Deep Learningaiaioo
 
Deep Learning through Pytorch Exercises
Deep Learning through Pytorch ExercisesDeep Learning through Pytorch Exercises
Deep Learning through Pytorch Exercisesaiaioo
 
Learning Non-Linear Functions for Text Classification
Learning Non-Linear Functions for Text ClassificationLearning Non-Linear Functions for Text Classification
Learning Non-Linear Functions for Text Classificationaiaioo
 
Vaklipi Text Analytics Tools
Vaklipi Text Analytics ToolsVaklipi Text Analytics Tools
Vaklipi Text Analytics Toolsaiaioo
 
Fun with Text - Managing Text Analytics
Fun with Text - Managing Text AnalyticsFun with Text - Managing Text Analytics
Fun with Text - Managing Text Analyticsaiaioo
 
Arduino for Indian Languages
Arduino for Indian LanguagesArduino for Indian Languages
Arduino for Indian Languagesaiaioo
 
Fun with Text - Hacking Text Analytics
Fun with Text - Hacking Text AnalyticsFun with Text - Hacking Text Analytics
Fun with Text - Hacking Text Analyticsaiaioo
 
Vaklipi (Natural Language Programming and Queries)
Vaklipi (Natural Language Programming and Queries)Vaklipi (Natural Language Programming and Queries)
Vaklipi (Natural Language Programming and Queries)aiaioo
 
Rules engines to machine learning
Rules engines to machine learningRules engines to machine learning
Rules engines to machine learningaiaioo
 
Aiaioo labs - Only Slightly Futuristic
Aiaioo labs - Only Slightly FuturisticAiaioo labs - Only Slightly Futuristic
Aiaioo labs - Only Slightly Futuristicaiaioo
 

Mehr von aiaioo (10)

Document Analysis with Deep Learning
Document Analysis with Deep LearningDocument Analysis with Deep Learning
Document Analysis with Deep Learning
 
Deep Learning through Pytorch Exercises
Deep Learning through Pytorch ExercisesDeep Learning through Pytorch Exercises
Deep Learning through Pytorch Exercises
 
Learning Non-Linear Functions for Text Classification
Learning Non-Linear Functions for Text ClassificationLearning Non-Linear Functions for Text Classification
Learning Non-Linear Functions for Text Classification
 
Vaklipi Text Analytics Tools
Vaklipi Text Analytics ToolsVaklipi Text Analytics Tools
Vaklipi Text Analytics Tools
 
Fun with Text - Managing Text Analytics
Fun with Text - Managing Text AnalyticsFun with Text - Managing Text Analytics
Fun with Text - Managing Text Analytics
 
Arduino for Indian Languages
Arduino for Indian LanguagesArduino for Indian Languages
Arduino for Indian Languages
 
Fun with Text - Hacking Text Analytics
Fun with Text - Hacking Text AnalyticsFun with Text - Hacking Text Analytics
Fun with Text - Hacking Text Analytics
 
Vaklipi (Natural Language Programming and Queries)
Vaklipi (Natural Language Programming and Queries)Vaklipi (Natural Language Programming and Queries)
Vaklipi (Natural Language Programming and Queries)
 
Rules engines to machine learning
Rules engines to machine learningRules engines to machine learning
Rules engines to machine learning
 
Aiaioo labs - Only Slightly Futuristic
Aiaioo labs - Only Slightly FuturisticAiaioo labs - Only Slightly Futuristic
Aiaioo labs - Only Slightly Futuristic
 

Kürzlich hochgeladen

Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 

Kürzlich hochgeladen (20)

Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 

Statistics for linguistics

  • 1. Statistical Tools for Linguists Cohan Sujay Carlos Aiaioo Labs Bangalore
  • 2. Text Analysis and Statistical Methods • Motivation • Statistics and Probabilities • Application to Corpus Linguistics
  • 3. Motivation • Human Development is all about Tools – Describe the world – Explain the world – Solve problems in the world • Some of these tools – Language – Algorithms – Statistics and Probabilities
  • 4. Motivation – Algorithms for Education Policy • 300 to 400 million people are illiterate • If we took 1000 teachers, 100 students per class, and 3 years of teaching per student –12000 years • If we had 100,000 teachers –120 years
  • 5. Motivation – Algorithms for Education Policy • 300 to 400 million people are illiterate • If we took 1 teacher, 10 students per class, and 3 years of teaching per student. • Then each student teaches 10 more students. – about 30 years • We could turn the whole world literate in – about 34 years
  • 6. Motivation – Algorithms for Education Policy Difference: Policy 1 is O(n) time Policy 2 is O(log n) time
  • 7. Motivation – Statistics for Linguists We have shown that: Using a tool from computer science, we can solve a problem in quite another area. SIMILARLY Linguists will find statistics to be a handy tool to better understand languages.
  • 8. Applications of Statistics to Linguistics • How can statistics be useful? • Can probabilities be useful?
  • 9. Introduction to Aiaioo Labs • Focus on Text Analysis, NLP, ML, AI • Applications to business problems • Team consists of – Researchers • Cohan • Madhulika • Sumukh – Linguists – Engineers – Marketing
  • 10. Applications to Corpus Linguistics • What to annotate • How to develop insights • How to annotate • How much data to annotate • How to avoid mistakes in using the corpus
  • 11. Approach to corpus construction • The problem: ‘word semantics’ • What is better? – Wordnet – Google terabyte corpus (with annotations?)
  • 12. Approach to corpus construction • The problem: ‘word semantics’ • What is better? – Wordnet (set of rules about the real world) – Google terabyte corpus (real world)
  • 13. Approach to corpus construction • The problem: ‘word semantics’ • What is better? – Wordnet (not countable) – Google terabyte corpus (countable) For training machine learning algorithms, the latter might be more valuable, just because it is possible to tally up evidence on the latter corpus. Of course I am simplifying things a lot and I don’t mean that the former is not valuable at all.
  • 14. Approach to corpus construction So if you are constructing a corpus on which machine learning methods might be applied, construct your corpus so that you retain as many examples of surface forms as possible.
  • 15. Applications to Corpus Linguistics • What to annotate • How to develop insights • How to annotate • How much data to annotate • How to avoid mistakes in using the corpus
  • 16. Problem : Spelling 1. Field 2. Wield 3. Shield 4. Deceive 5. Receive 6. Ceiling Courtesy of http://norvig.com/chomsky.html
  • 17. Rule-based Approach “I before E except after C” -- an example of a linguistic insight Courtesy of http://norvig.com/chomsky.html
  • 18. Probabilistic Statistical Model: • Count the occurrences of ‘ie’ and ‘ei’ and ‘cie’ and ‘cei’ in a large corpus P(IE) = 0.0177 P(EI) = 0.0046 P(CIE) = 0.0014 P(CEI) = 0.0005 Courtesy of http://norvig.com/chomsky.html
  • 19. Words where ie occur after c • science • society • ancient • species Courtesy of http://norvig.com/chomsky.html
  • 20. But you can go back to a Rule-based Approach “I before E except after C only if C is not preceded by an S” -- an example of a linguistic insight Courtesy of http://norvig.com/chomsky.html
  • 21. What is a probability? • A number between 0 and 1 • The sum of the probabilities on all outcomes is 1 Heads Tails • P(heads) = 0.5 • P(tails) = 0.5
  • 22. Estimation of P(IE) P(“IE”) = C(“IE”) / C(all two letter sequences in my corpus)
  • 23. What is Estimation? P(“UN”) = C(“UN”) / C(all words in my corpus)
  • 24. Applications to Corpus Linguistics • What to annotate • How to develop insights • How to annotate • How much data to annotate • How to avoid mistakes in using the corpus
  • 25. How do you annotate? • The problem: ‘named entity classification’ • What is better? – Per, Org, Loc, Prod, Time – Right, Wrong
  • 26. How do you annotate? • The problem: ‘named entity classification’ • What is better? – Per, Org, Loc, Prod, Time – Right, Wrong It depends on whether you care about precision or recall or both.
  • 27. What are Precision and Recall Classification metrics used to compare ML algorithms.
  • 28. Classification Metrics Politics Sports The UN Security Warwickshire's Clarke Council adopts its first equalled the first-class clear condemnation of record of seven How do you compare two ML algorithms?
  • 29. Classification Quality Metrics Point of view = Politics Gold - Politics Gold - Sports Observed - Politics TP (True Positive) FP (False Positive) Observed - Sports FN (False Negative) TN (True Negative)
  • 30. Classification Quality Metrics Point of view = Sports Gold - Politics Gold - Sports Observed - Politics TN (True Negative) FN (False Positive) Observed - Sports FP (False Negative) TP (True Positive)
  • 31. Classification Quality Metric - Accuracy Point of view = Sports Gold - Politics Gold – Sports Observed - Politics TN (True Negative) FN (False Positive) Observed - Sports FP (False Negative) TP (True Positive)
  • 32. Metrics for Measuring Classification Quality Point of View – Class 1 Gold Class 1 Gold Class 2 Observed Class 1 TP FP Observed Class 2 FN TN Great metrics for highly unbalanced corpora!
  • 33. Metrics for Measuring Classification Quality F-Score = the harmonic mean of Precision and Recall
  • 34. F-Score Generalized 1 F 1 1   (1   ) P R
  • 35. Precision, Recall, Average, F-Score Precision Recall Average F-Score Classifier 1 50% 50% 50% 50% Classifier 2 30% 70% 50% 42% Classifier 3 10% 90% 50% 18% What is the sort of classifier that fares worst?
  • 36. How do you annotate? So if you are constructing a corpus for a machine learning tool where only precision matters, all you need is a corpus of presumed positives that you mark as right or wrong (or the label and other). If you need to get good recall as well, you will need a corpus annotated with all the relevant labels.
  • 37. Applications to Corpus Linguistics • What to annotate • How to develop insights • How to annotate • How much data to annotate • How to avoid mistakes in using the corpus
  • 38. How much data should you annotate? • The problem: ‘named entity classification’ • What is better? – 2000 words per category (each of Per, Org, Loc, Prod, Time) – 5000 words per category (each of Per, Org, Loc, Prod, Time)
  • 39. Small Corpus – 4 Fold Cross-Validation Split Train Folds Test Fold First Run • 1, 2, 3 • 4 Second Run • 2, 3, 4 • 1 Third Run • 3, 4, 1 • 2 Fourth Run • 4, 1, 2 • 3
  • 40. Statistical significance in a paper significance estimate variance Remember to take Inter-Annotator Agreement into account
  • 41. How much do you annotate? So you increase the corpus size till that the error margins drop to a value that the experimenter considers sufficient. The smaller the error margins, the finer the comparisons the experimenter can make between algorithms.
  • 42. Applications to Corpus Linguistics • What to annotate • How to develop insights • How to annotate • How much data to annotate • How to avoid mistakes in using the corpus
  • 43. Avoid Mistakes • The problem: ‘train a classifier’ • What is better? – Train with all the data that you have, and then test on all the data that you have? – Train on half and test on the other half?
  • 44. Avoid Mistakes • Training a corpus on a full corpus and then running tests using the same corpus is a bad idea because it is a bit like revealing the questions in the exam before the exam. • A simple algorithm that can game such a test is a plain memorization algorithm that memorizes all the possible inputs and the corresponding outputs.
  • 45. Corpus Splits Split Percentage Training • 60% Validation • 20% Testing • 20% Total • 100%
  • 46. How do you avoid mistakes? Do not train a machine learning algorithm on the ‘testing’ section of the corpus. During the development/tuning of the algorithm, do not make any measurements using the ‘testing’ section, or you’re likely to ‘cheat’ on the feature set, and settings. Use the ‘validation’ section for that. I have seen researchers claim 99.7% accuracy on Indian language POS tagging because they failed to keep the different sections of their corpus sufficiently well separated.