SlideShare ist ein Scribd-Unternehmen logo
1 von 31
NATURAL LANGUAGE PROCESSING   MSU Law

       AND MACHINE LEARNING
                              Electronic Discovery
                              Fall 2 01 2

              FOR DISCOVERY   Week 9
GOALS

                     Understand the BLACK BOX.
 Natural language processing
    Mathematical and linguistic concepts
    Models of representation
    Real-world application

 Machine learning
    Common pre-processing and learning algorithms
    Real-world application

 Communicate with software and service vendors!




© Bommarito Consulting
BLACK BOX

 How do we characterize a black box?




                         3     English   medium




          Inputs             Parameters           Outputs
© Bommarito Consulting
BLACK BOX




                              Secret: Most black boxes are




         ?
                               very similar inside.

                              We‟re going to learn to
                               identify the common parts.




© Bommarito Consulting
NATURAL LANGUAGE PROCESSING

 Definition: Dealing with real-world text in an automated,
  reproducible way.

 Often referred to as NLP.

 Used somewhat interchangeably with computational
  linguistics.




© Bommarito Consulting
NATURAL LANGUAGE PROCESSING

Let‟s start with some text.

   “Hurricane Sandy grounded 3,200 flights scheduled for today and
   tomorrow, prompted New York to suspend subway and bus service and
   forced the evacuation of the New Jersey shore as it headed toward land
   with life-threatening wind and rain.

    The system, which killed as many as 65 people in the Caribbean on its
   path north, may be capable of inflicting as much as $18 billion in
   damage when it barrels into New Jersey tomorrow and knock out power
   to millions for a week or more, according to forecasters and risk
   experts.”

   (Bloomberg article on Sandy)




© Bommarito Consulting
NATURAL LANGUAGE PROCESSING

What kind of questions can we ask?
 Basic
    What is the structure of the text?
        Paragraphs
        Sentences
        Tokens/words
    What are the words that appear in this text?
        Nouns
            Subjects
            Direct objects
        Verbs

 Advanced
    What are the concepts that appear in this text?
    How does this text compare to other text?




© Bommarito Consulting
NATURAL LANGUAGE PROCESSING

Segmentation and Tokenization

   “Hurricane Sandy grounded 3,200 flights scheduled for today and tomorrow,
   prompted New York to suspend subway and bus service and forced the evacuation of
   the New Jersey shore as it headed toward land with life-threatening wind and rain.

    The system, which killed as many as 65 people in the Caribbean on its path north,
   may be capable of inflicting as much as $18 billion in damage when it barrels into New
   Jersey tomorrow and knock out power to millions for a week or more, according to
   forecasters and risk experts.”



                 • Segments Types
                    • Paragraphs
                    • Sentences
                    • Tokens


© Bommarito Consulting
NATURAL LANGUAGE PROCESSING

Segmentation and Tokenization
But how does it work?

 Paragraphs
    Two consecutive line breaks
    A hard line break followed by an indent

 Sentences
    Period, except abbreviation, ellipsis within quotation, etc.

 Tokens and Words
    Whitespace
    Punctuation

Remember what real -world text looks like – think text and email.


© Bommarito Consulting
NATURAL LANGUAGE PROCESSING

Segmentation and Tokenization
   “Hurricane Sandy grounded 3,200 flights scheduled for today and tomorrow,
   prompted New York to suspend subway and bus service and forced the evacuation of
   the New Jersey shore as it headed toward land with life-threatening wind and rain.

    The system, which killed as many as 65 people in the Caribbean on its path north,
   may be capable of inflicting as much as $18 billion in damage when it barrels into New
   Jersey tomorrow and knock out power to millions for a week or more, according to
   forecasters and risk experts.”



 Paragraphs: 2
 Sentences: 2
 Words: 561 .
    ['Hurricane', 'Sandy', 'grounded', '3,200', 'flights', 'scheduled', 'for',
     'today', 'and', 'tomorrow„, …]


© Bommarito Consulting
NATURAL LANGUAGE PROCESSING

What kind of questions can we ask?
We now have an ordered list of tokens.

['Hurricane', 'Sandy', 'grounded', '3,200', 'flights', 'scheduled', 'for',
'today', 'and', 'tomorrow„, …]

      Does the word phrase “quote stuffing” occur in the text?
      How many times does “Sandy” occur?
      How often does “outage” occur after “power?”
      What percentage of tokens are numbers?




© Bommarito Consulting
NATURAL LANGUAGE PROCESSING

An Aside on Storage

D ata: The word „the‟ ten times and the word ‘a’ ten times.


 Representation 1 - Ordered List:
   [‘the‟, „a‟, „the‟, „a‟, „the‟, „a‟, …]

 Representation 2 – Term Frequency:
   [(„the‟, 10), („a‟, 10)]




© Bommarito Consulting
NATURAL LANGUAGE PROCESSING

An Aside on Storage
 Representation 1 - Ordered List:
   [‘the‟, „a‟, „the‟, „a‟, „the‟, „a‟, …]

 Representation 2 - Frequency Map:
   [(„the‟, 10), („a‟, 10)]

 Tradeoffs
    Total space
    Ease of answering certain questions
    Information about context

 Not all software make the same choice!


© Bommarito Consulting
NATURAL LANGUAGE PROCESSING

Stopwording, Stemming, Parsing, and Tagging
       Stopwording
         Removing “filler” words like prepositions, auxiliary or infinitive verbs, and
          conjunctions.

       Stemming
         Matching declined nouns like dog/dogs or child/children.
         Matching conjugated verbs like run/ran.

       Parsing
         Determining the “structure” of a sentence, typically as represented by a
          grade school sentence diagram (requires grammar definition; we‟ll skip).

       Tagging
         Identifying the part of speech of each token in a sentence.



© Bommarito Consulting
NATURAL LANGUAGE PROCESSING

Stopwording
    Hurricane Sandy grounded 3,200 flights scheduled for today and tomorrow,
   prompted New York to suspend subway and bus service and forced the evacuation of
   the New Jersey shore as it headed toward land with life-threatening wind and rain.

    The system, which killed as many as 65 people in the Caribbean on its path north,
   may be capable of inflicting as much as $18 billion in damage when it barrels into New
   Jersey tomorrow and knock out power to millions for a week or more, according to
   forecasters and risk experts.

     Hurricane Sandy grounded 3,200 flights scheduled today tomorrow, prompted New
   York suspend subway bus service forced evacuation New Jersey shore headed toward
   land life-threatening wind rain.

    System, killed many 65 people Caribbean path north, may capable inflicting much
   $18 billion damage barrels New Jersey tomorrow knock power millions week, according
   forecasters risk experts.




© Bommarito Consulting
NATURAL LANGUAGE PROCESSING

Stopwording + Stemming
    Hurricane Sandy grounded 3,200 flights scheduled for today and tomorrow,
   prompted New York to suspend subway and bus service and forced the evacuation of
   the New Jersey shore as it headed toward land with life-threatening wind and rain.

    The system, which killed as many as 65 people in the Caribbean on its path north,
   may be capable of inflicting as much as $18 billion in damage when it barrels into New
   Jersey tomorrow and knock out power to millions for a week or more, according to
   forecasters and risk experts.

    Hurrican Sandi ground 3,200 flight schedul today tomorrow, prompt New York
   suspend subway bu servic forc evacu New Jersey shore head toward land life-threaten
   wind rain.

    System, kill mani 65 peopl Caribbean path north, may capabl inflict much $18 billion
   damag barrel New Jersey tomorrow knock power million week, accord forecast risk
   expert.




© Bommarito Consulting
NATURAL LANGUAGE PROCESSING

Tagging
   Hurricane Sandy grounded 3,200 flights scheduled for today and tomorrow,
  prompted New York to suspend subway and bus service and forced the evacuation of
  the New Jersey shore as it headed toward land with life-threatening wind and rain.

   The system, which killed as many as 65 people in the Caribbean on its path north,
  may be capable of inflicting as much as $18 billion in damage when it barrels into New
  Jersey tomorrow and knock out power to millions for a week or more, according to
  forecasters and risk experts.

    [('Hurricane', 'NNP'), ('Sandy', 'NNP'), ('grounded', 'VBD'), ('3,200', 'CD'), ('flights',
   'NNS'), ('scheduled', 'VBN'), ('for', 'IN'), ('today', 'NN'), ('and', 'CC'), ('tomorrow', 'NN'), …]




© Bommarito Consulting
NATURAL LANGUAGE PROCESSING

Back to the black box.




                         3     English   medium




          Inputs             Parameters           Outputs
© Bommarito Consulting
NATURAL LANGUAGE PROCESSING

 Let‟s say that we‟re investigating Enron for accounting fraud
related to its reserve reporting and transfers.

 We want to look for any material that discusses reserves and
profits in the same sentence. However, we want cases where
these words are used as nouns; we‟re not interested in dinner
reservations.


             Inputs           Parameters     Output
             Memos            Stopword: No   Memos
             Research         Stem: Yes      Research
             Emails           Tag: Yes       Emails
             Texts            Search: …      Texts
             Transcriptions                  Transcriptions

© Bommarito Consulting
NATURAL LANGUAGE PROCESSING

 In general, all document search and discovery software
combines the elements discussed above.
      Segment
      Tokenize
      Stopword
      Stem
      Parse
      Tag
      Store
      Search
      Retrieve




© Bommarito Consulting
NATURAL LANGUAGE PROCESSING

 How do they dif fer?
      Interface and ease-of-use
      De-duplication and versioning
      Supported languages
      Optical character recognition (OCR)
      File formats, e.g., Word, WordPerfect, PDF, HTML
      Ability to scale to large databases.




© Bommarito Consulting
MACHINE LEARNING

 Definition: Automated classification and prediction on data.

 Examples:
      Product recommenders, a la Amazon
      Computer vision – is it a cat?
      Sentiment analysis
      Topic classification
      Document clustering


 At least two stages to machine learning:
    Training
    Classification



© Bommarito Consulting
MACHINE LEARNING

Learning

 Machine learning requires “learning” or “training.”

 There are two types of training:
    Supervised
    Unsupervised


 The goal of training is to determine a mapping from input
  features to a set of target classes.




© Bommarito Consulting
MACHINE LEARNING

Learning
  Imagine a student given a small list of organisms and
descriptions. The student is tasked to assign the organisms into
groups based on these descriptions. Where do the groups come
from?

 Super vised: The teacher provides the answers.
 Unsuper vised: The teacher provides nothing.

 When the student is done with the task , the teacher checks the
student‟s responses and decides if the student has learned.

 In our example, the teac her will typically provide the “canonical” domains
and ki ngdoms of bi ol ogy. However, mos t real -world problems domai ns are
not so well-studied.



© Bommarito Consulting
MACHINE LEARNING

Learning

 What if the teacher gave the student some of the answers?

 This is semi-supervised learning.

 Supervised: The teacher provides the answers.
 Semi-supervised: The teacher provides some answers.
 Unsupervised: The teacher provides nothing.




© Bommarito Consulting
MACHINE LEARNING

Classification

 The student has now learned to map from an organism‟s
description to a group.

 Now, the student is sent out into the field to use their
knowledge to classify newly discovered organisms.       They
observe the organisms and document the features they learned
to use. Then, they apply the learned rules to determine the
class of organism.




© Bommarito Consulting
MACHINE LEARNING

This is exactly how predictive coding works!

 Organisms : Documents
 Descriptions : Natural language features or models
 Semi-supervised : Sample coding

 The goal of predictive coding in discovery is to learn to classify
documents based on natural language features, typically into
relevant/irrelevant or privileged/unprivileged.




© Bommarito Consulting
MACHINE LEARNING

Some Machine Learning Algorithms
 Super vised
    Statistical models
       Bayesian, e.g., Naïve Bayes Classification
       Frequentist, e.g., Ordinary Least Squares.
    Neural Networks (NN)
    Support Vector Machines (SVM)
    Random Forests (RF)
    Genetic Algorithms (GA)
 Semi/unsuper vised
    Neural Networks (NN)
    Clustering
          K-means
          Hierarchical
          Radial Basis (RBF)
          Graph

© Bommarito Consulting
MACHINE LEARNING

Notes on Algorithm Diversity

 Not all algorithms return scores; some are binar y.
    True, True, False
    0.9, 0.7, 0.1
 Not all algorithms suppor t more than two classes.
    Cat, Dog, Mouse
    Cat, Not Cat
 Not all algorithms scale similarly.
    1M documents = 1 day
    10M documents = {10 days, 100 days, 1000 days}




© Bommarito Consulting
THANKS!

        You can get these slides on my blog – http://bommaritollc.com/blog/.




                              Michael J Bommarito II
                                 CEO, Bommarito Consulting, LLC
                                 Email: michael@bommaritollc.com
                                 Web: http://bommaritollc.com/




© Bommarito Consulting
REFERENCES

 B o o k s a n d Wi k i Pa g e s
     A Brief Sur vey of Text Mining. Hotho, Nurnberger, Paaß.
         http://www.kde.cs.uni -kassel.de/hotho/pub/2005/hotho05TextMining.pdf
     Text Mining: Predictive Methods for Analyzing Unstructured Information. Weiss, Indurkhya,
      Zhang, Damerau.
         http://www.amazon.com/Text -Mining-Predictive-Unstructured -Information/dp/0387954333
     The Elements of Statistical Learning.
         http://www-stat.stanford.edu/~tibs/ElemStatLearn /
     Wiki – Machine Learning.
         http://en.wikipedia.org/wiki/Machine_learning
     Wiki – Machine Learning Algorithms.
         http://en.wikipedia.org/wiki/List_of_machine_learni ng_algorithms
 So f t wa re
     Natural Language Toolkit (NLTK).
         http://nltk.org /
     Stanford NLP Group.
         http://nlp.stanford.edu/software /
     Weka.
         http://www.cs.waikato.ac.nz/ml/weka /
     R.
         http://www.r -project.org /
     SAS Predictive Analytics and Data Mining.
         http://www.sas.com/technologies/analytics/datamining/i ndex.html

Weitere ähnliche Inhalte

Andere mochten auch

Bommarito Presentation for University of Houston Computational Law Conference
Bommarito Presentation for University of Houston Computational Law ConferenceBommarito Presentation for University of Houston Computational Law Conference
Bommarito Presentation for University of Houston Computational Law Conference
mjbommar
 
Preserve the Luxury Or Extend the Brand? HBR Case Study
Preserve the Luxury Or Extend the Brand? HBR Case StudyPreserve the Luxury Or Extend the Brand? HBR Case Study
Preserve the Luxury Or Extend the Brand? HBR Case Study
Sameer Mathur
 

Andere mochten auch (16)

Natural Language Processing for Games Research
Natural Language Processing for Games ResearchNatural Language Processing for Games Research
Natural Language Processing for Games Research
 
Bommarito Presentation for University of Houston Computational Law Conference
Bommarito Presentation for University of Houston Computational Law ConferenceBommarito Presentation for University of Houston Computational Law Conference
Bommarito Presentation for University of Houston Computational Law Conference
 
Natural Language Processing and Machine Learning
Natural Language Processing and Machine LearningNatural Language Processing and Machine Learning
Natural Language Processing and Machine Learning
 
Thinaire Accelerated Aire
Thinaire Accelerated AireThinaire Accelerated Aire
Thinaire Accelerated Aire
 
Magazine layout assignment
Magazine layout assignmentMagazine layout assignment
Magazine layout assignment
 
SBM x
SBM xSBM x
SBM x
 
Assignment 1 l'oreal
Assignment 1   l'orealAssignment 1   l'oreal
Assignment 1 l'oreal
 
Comparative Analysis
Comparative AnalysisComparative Analysis
Comparative Analysis
 
Lakme brand
Lakme brandLakme brand
Lakme brand
 
Preserve the Luxury Or Extend the Brand? HBR Case Study
Preserve the Luxury Or Extend the Brand? HBR Case StudyPreserve the Luxury Or Extend the Brand? HBR Case Study
Preserve the Luxury Or Extend the Brand? HBR Case Study
 
Introduction to natural language processing
Introduction to natural language processingIntroduction to natural language processing
Introduction to natural language processing
 
Brand Image and Identity
Brand Image and IdentityBrand Image and Identity
Brand Image and Identity
 
Upgrade Your Business Skills
Upgrade Your Business SkillsUpgrade Your Business Skills
Upgrade Your Business Skills
 
Brand Audit on Loreal
Brand Audit on LorealBrand Audit on Loreal
Brand Audit on Loreal
 
Health n Wellness Marketing
Health n Wellness MarketingHealth n Wellness Marketing
Health n Wellness Marketing
 
Lakme Absolute Brand Extension Analysis
Lakme Absolute Brand Extension AnalysisLakme Absolute Brand Extension Analysis
Lakme Absolute Brand Extension Analysis
 

Kürzlich hochgeladen

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Kürzlich hochgeladen (20)

CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptx
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by Anitaraj
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 

Natural Language Processing and Machine Learning for Discovery

  • 1. NATURAL LANGUAGE PROCESSING MSU Law AND MACHINE LEARNING Electronic Discovery Fall 2 01 2 FOR DISCOVERY Week 9
  • 2. GOALS Understand the BLACK BOX.  Natural language processing  Mathematical and linguistic concepts  Models of representation  Real-world application  Machine learning  Common pre-processing and learning algorithms  Real-world application  Communicate with software and service vendors! © Bommarito Consulting
  • 3. BLACK BOX  How do we characterize a black box? 3 English medium Inputs Parameters Outputs © Bommarito Consulting
  • 4. BLACK BOX  Secret: Most black boxes are ? very similar inside.  We‟re going to learn to identify the common parts. © Bommarito Consulting
  • 5. NATURAL LANGUAGE PROCESSING  Definition: Dealing with real-world text in an automated, reproducible way.  Often referred to as NLP.  Used somewhat interchangeably with computational linguistics. © Bommarito Consulting
  • 6. NATURAL LANGUAGE PROCESSING Let‟s start with some text. “Hurricane Sandy grounded 3,200 flights scheduled for today and tomorrow, prompted New York to suspend subway and bus service and forced the evacuation of the New Jersey shore as it headed toward land with life-threatening wind and rain. The system, which killed as many as 65 people in the Caribbean on its path north, may be capable of inflicting as much as $18 billion in damage when it barrels into New Jersey tomorrow and knock out power to millions for a week or more, according to forecasters and risk experts.” (Bloomberg article on Sandy) © Bommarito Consulting
  • 7. NATURAL LANGUAGE PROCESSING What kind of questions can we ask?  Basic  What is the structure of the text?  Paragraphs  Sentences  Tokens/words  What are the words that appear in this text?  Nouns  Subjects  Direct objects  Verbs  Advanced  What are the concepts that appear in this text?  How does this text compare to other text? © Bommarito Consulting
  • 8. NATURAL LANGUAGE PROCESSING Segmentation and Tokenization “Hurricane Sandy grounded 3,200 flights scheduled for today and tomorrow, prompted New York to suspend subway and bus service and forced the evacuation of the New Jersey shore as it headed toward land with life-threatening wind and rain. The system, which killed as many as 65 people in the Caribbean on its path north, may be capable of inflicting as much as $18 billion in damage when it barrels into New Jersey tomorrow and knock out power to millions for a week or more, according to forecasters and risk experts.” • Segments Types • Paragraphs • Sentences • Tokens © Bommarito Consulting
  • 9. NATURAL LANGUAGE PROCESSING Segmentation and Tokenization But how does it work?  Paragraphs  Two consecutive line breaks  A hard line break followed by an indent  Sentences  Period, except abbreviation, ellipsis within quotation, etc.  Tokens and Words  Whitespace  Punctuation Remember what real -world text looks like – think text and email. © Bommarito Consulting
  • 10. NATURAL LANGUAGE PROCESSING Segmentation and Tokenization “Hurricane Sandy grounded 3,200 flights scheduled for today and tomorrow, prompted New York to suspend subway and bus service and forced the evacuation of the New Jersey shore as it headed toward land with life-threatening wind and rain. The system, which killed as many as 65 people in the Caribbean on its path north, may be capable of inflicting as much as $18 billion in damage when it barrels into New Jersey tomorrow and knock out power to millions for a week or more, according to forecasters and risk experts.”  Paragraphs: 2  Sentences: 2  Words: 561 .  ['Hurricane', 'Sandy', 'grounded', '3,200', 'flights', 'scheduled', 'for', 'today', 'and', 'tomorrow„, …] © Bommarito Consulting
  • 11. NATURAL LANGUAGE PROCESSING What kind of questions can we ask? We now have an ordered list of tokens. ['Hurricane', 'Sandy', 'grounded', '3,200', 'flights', 'scheduled', 'for', 'today', 'and', 'tomorrow„, …]  Does the word phrase “quote stuffing” occur in the text?  How many times does “Sandy” occur?  How often does “outage” occur after “power?”  What percentage of tokens are numbers? © Bommarito Consulting
  • 12. NATURAL LANGUAGE PROCESSING An Aside on Storage D ata: The word „the‟ ten times and the word ‘a’ ten times.  Representation 1 - Ordered List:  [‘the‟, „a‟, „the‟, „a‟, „the‟, „a‟, …]  Representation 2 – Term Frequency:  [(„the‟, 10), („a‟, 10)] © Bommarito Consulting
  • 13. NATURAL LANGUAGE PROCESSING An Aside on Storage  Representation 1 - Ordered List:  [‘the‟, „a‟, „the‟, „a‟, „the‟, „a‟, …]  Representation 2 - Frequency Map:  [(„the‟, 10), („a‟, 10)]  Tradeoffs  Total space  Ease of answering certain questions  Information about context  Not all software make the same choice! © Bommarito Consulting
  • 14. NATURAL LANGUAGE PROCESSING Stopwording, Stemming, Parsing, and Tagging  Stopwording  Removing “filler” words like prepositions, auxiliary or infinitive verbs, and conjunctions.  Stemming  Matching declined nouns like dog/dogs or child/children.  Matching conjugated verbs like run/ran.  Parsing  Determining the “structure” of a sentence, typically as represented by a grade school sentence diagram (requires grammar definition; we‟ll skip).  Tagging  Identifying the part of speech of each token in a sentence. © Bommarito Consulting
  • 15. NATURAL LANGUAGE PROCESSING Stopwording Hurricane Sandy grounded 3,200 flights scheduled for today and tomorrow, prompted New York to suspend subway and bus service and forced the evacuation of the New Jersey shore as it headed toward land with life-threatening wind and rain. The system, which killed as many as 65 people in the Caribbean on its path north, may be capable of inflicting as much as $18 billion in damage when it barrels into New Jersey tomorrow and knock out power to millions for a week or more, according to forecasters and risk experts. Hurricane Sandy grounded 3,200 flights scheduled today tomorrow, prompted New York suspend subway bus service forced evacuation New Jersey shore headed toward land life-threatening wind rain. System, killed many 65 people Caribbean path north, may capable inflicting much $18 billion damage barrels New Jersey tomorrow knock power millions week, according forecasters risk experts. © Bommarito Consulting
  • 16. NATURAL LANGUAGE PROCESSING Stopwording + Stemming Hurricane Sandy grounded 3,200 flights scheduled for today and tomorrow, prompted New York to suspend subway and bus service and forced the evacuation of the New Jersey shore as it headed toward land with life-threatening wind and rain. The system, which killed as many as 65 people in the Caribbean on its path north, may be capable of inflicting as much as $18 billion in damage when it barrels into New Jersey tomorrow and knock out power to millions for a week or more, according to forecasters and risk experts. Hurrican Sandi ground 3,200 flight schedul today tomorrow, prompt New York suspend subway bu servic forc evacu New Jersey shore head toward land life-threaten wind rain. System, kill mani 65 peopl Caribbean path north, may capabl inflict much $18 billion damag barrel New Jersey tomorrow knock power million week, accord forecast risk expert. © Bommarito Consulting
  • 17. NATURAL LANGUAGE PROCESSING Tagging Hurricane Sandy grounded 3,200 flights scheduled for today and tomorrow, prompted New York to suspend subway and bus service and forced the evacuation of the New Jersey shore as it headed toward land with life-threatening wind and rain. The system, which killed as many as 65 people in the Caribbean on its path north, may be capable of inflicting as much as $18 billion in damage when it barrels into New Jersey tomorrow and knock out power to millions for a week or more, according to forecasters and risk experts. [('Hurricane', 'NNP'), ('Sandy', 'NNP'), ('grounded', 'VBD'), ('3,200', 'CD'), ('flights', 'NNS'), ('scheduled', 'VBN'), ('for', 'IN'), ('today', 'NN'), ('and', 'CC'), ('tomorrow', 'NN'), …] © Bommarito Consulting
  • 18. NATURAL LANGUAGE PROCESSING Back to the black box. 3 English medium Inputs Parameters Outputs © Bommarito Consulting
  • 19. NATURAL LANGUAGE PROCESSING Let‟s say that we‟re investigating Enron for accounting fraud related to its reserve reporting and transfers. We want to look for any material that discusses reserves and profits in the same sentence. However, we want cases where these words are used as nouns; we‟re not interested in dinner reservations. Inputs Parameters Output Memos Stopword: No Memos Research Stem: Yes Research Emails Tag: Yes Emails Texts Search: … Texts Transcriptions Transcriptions © Bommarito Consulting
  • 20. NATURAL LANGUAGE PROCESSING In general, all document search and discovery software combines the elements discussed above.  Segment  Tokenize  Stopword  Stem  Parse  Tag  Store  Search  Retrieve © Bommarito Consulting
  • 21. NATURAL LANGUAGE PROCESSING  How do they dif fer?  Interface and ease-of-use  De-duplication and versioning  Supported languages  Optical character recognition (OCR)  File formats, e.g., Word, WordPerfect, PDF, HTML  Ability to scale to large databases. © Bommarito Consulting
  • 22. MACHINE LEARNING  Definition: Automated classification and prediction on data.  Examples:  Product recommenders, a la Amazon  Computer vision – is it a cat?  Sentiment analysis  Topic classification  Document clustering  At least two stages to machine learning:  Training  Classification © Bommarito Consulting
  • 23. MACHINE LEARNING Learning  Machine learning requires “learning” or “training.”  There are two types of training:  Supervised  Unsupervised  The goal of training is to determine a mapping from input features to a set of target classes. © Bommarito Consulting
  • 24. MACHINE LEARNING Learning Imagine a student given a small list of organisms and descriptions. The student is tasked to assign the organisms into groups based on these descriptions. Where do the groups come from?  Super vised: The teacher provides the answers.  Unsuper vised: The teacher provides nothing. When the student is done with the task , the teacher checks the student‟s responses and decides if the student has learned. In our example, the teac her will typically provide the “canonical” domains and ki ngdoms of bi ol ogy. However, mos t real -world problems domai ns are not so well-studied. © Bommarito Consulting
  • 25. MACHINE LEARNING Learning What if the teacher gave the student some of the answers? This is semi-supervised learning.  Supervised: The teacher provides the answers.  Semi-supervised: The teacher provides some answers.  Unsupervised: The teacher provides nothing. © Bommarito Consulting
  • 26. MACHINE LEARNING Classification The student has now learned to map from an organism‟s description to a group. Now, the student is sent out into the field to use their knowledge to classify newly discovered organisms. They observe the organisms and document the features they learned to use. Then, they apply the learned rules to determine the class of organism. © Bommarito Consulting
  • 27. MACHINE LEARNING This is exactly how predictive coding works!  Organisms : Documents  Descriptions : Natural language features or models  Semi-supervised : Sample coding The goal of predictive coding in discovery is to learn to classify documents based on natural language features, typically into relevant/irrelevant or privileged/unprivileged. © Bommarito Consulting
  • 28. MACHINE LEARNING Some Machine Learning Algorithms  Super vised  Statistical models  Bayesian, e.g., Naïve Bayes Classification  Frequentist, e.g., Ordinary Least Squares.  Neural Networks (NN)  Support Vector Machines (SVM)  Random Forests (RF)  Genetic Algorithms (GA)  Semi/unsuper vised  Neural Networks (NN)  Clustering  K-means  Hierarchical  Radial Basis (RBF)  Graph © Bommarito Consulting
  • 29. MACHINE LEARNING Notes on Algorithm Diversity  Not all algorithms return scores; some are binar y.  True, True, False  0.9, 0.7, 0.1  Not all algorithms suppor t more than two classes.  Cat, Dog, Mouse  Cat, Not Cat  Not all algorithms scale similarly.  1M documents = 1 day  10M documents = {10 days, 100 days, 1000 days} © Bommarito Consulting
  • 30. THANKS! You can get these slides on my blog – http://bommaritollc.com/blog/.  Michael J Bommarito II  CEO, Bommarito Consulting, LLC  Email: michael@bommaritollc.com  Web: http://bommaritollc.com/ © Bommarito Consulting
  • 31. REFERENCES  B o o k s a n d Wi k i Pa g e s  A Brief Sur vey of Text Mining. Hotho, Nurnberger, Paaß.  http://www.kde.cs.uni -kassel.de/hotho/pub/2005/hotho05TextMining.pdf  Text Mining: Predictive Methods for Analyzing Unstructured Information. Weiss, Indurkhya, Zhang, Damerau.  http://www.amazon.com/Text -Mining-Predictive-Unstructured -Information/dp/0387954333  The Elements of Statistical Learning.  http://www-stat.stanford.edu/~tibs/ElemStatLearn /  Wiki – Machine Learning.  http://en.wikipedia.org/wiki/Machine_learning  Wiki – Machine Learning Algorithms.  http://en.wikipedia.org/wiki/List_of_machine_learni ng_algorithms  So f t wa re  Natural Language Toolkit (NLTK).  http://nltk.org /  Stanford NLP Group.  http://nlp.stanford.edu/software /  Weka.  http://www.cs.waikato.ac.nz/ml/weka /  R.  http://www.r -project.org /  SAS Predictive Analytics and Data Mining.  http://www.sas.com/technologies/analytics/datamining/i ndex.html