SlideShare ist ein Scribd-Unternehmen logo
1 von 69
October 2013

Machine Learning for Language Technology

Lecture 7:
Learning from Massive Datasets
Marina Santini, Uppsala University
Department of Linguistics and Philology
Outline
Watch the pitfalls
Learning from massive datasets









Data Mining
Text Mining – Text Analytics
Web Mining
Big Data



Programming Languages and Framework for Big Data



Big Textual Data & Commercial Applications



Events, MeetUps, Coursera
2

Lect. 7: Learning from Massive Datasets
Practical Machine Learning

3

Lect. 7: Learning from Massive Datasets
Data Mining
Data mining is the extraction of implicit, previously
unknown and potentially useful information from data
(Witten and Frank, 2005)



4

Lect. 7: Learning from Massive Datasets
Watch out!
Machine Learning is not just about:

Finding data and blindly applying learning algorithms to
it
Blindly compare machine learning methods:

1.

2.

Model complexity
Representativeness of training data distribution
Reliability of class labels

1.
2.
3.

Remember: Practitioners’ expertise counts!

5

Lect. 7: Learning from Massive Datasets
Massive Datasets
Space and Time
Three ways to make learning feasible (the old way)








Small subset
Parallelization
Data chunks

The new way:






6

Develop new algorithms with lower computational
complexity
Increase background knowledge
Lect. 7: Learning from Massive Datasets
Domain Knowledge


Metadata



Semantic relation
Causal relation
Functional dependencies




7

Lect. 7: Learning from Massive Datasets
Text Mining
Actionable information
Comprehensible information
Problems








8

Text Analytics

Lect. 7: Learning from Massive Datasets
Definition: Text Analytics
A set of NLP techniques that provide some structure
to textual documents and help identify and extract
important information.



9

Lect. 7: Learning from Massive Datasets
Set of NLP (Natural Language Processing )
techniques


Common components of a text analytic package are:









10

Tokenization
Morphological Analysis
Syntactic Analysis
Named Entity Recognition
Sentiment Analysis
Automatic Summarization
Etc.

Lect. 7: Learning from Massive Datasets
NLP at Coursera (www.coursera.org)

11

Lect. 7: Learning from Massive Datasets
NLP is pervasive
Ex: spell-checkers







Google Search
Google Mail
Facebook
Office Word
[…]

12

Lect. 7: Learning from Massive Datasets
NLP is parvasive
Ex: Name Entity Recognition




Opinion mining
Brand Trends
Conversation
clouds on web
magazines and
online
newspapers…

13

Lect. 7: Learning from Massive Datasets
Sentiment Analysis

14

Lect. 7: Learning from Massive Datasets
Text Analytics Products and Frameworks


Commercial Products:












Attensity
Clarabridge
Temis
Lexalytics
Texify
SAS
SPSS
IBM Cognos
etc.
15

Open Source Frameworks:
•
•
•
•
•

GATE
NLTK
UIMA
openNLP
etc.

Lect. 7: Learning from Massive Datasets
However… (I)


NLP tools and applications (both commercial and
open source) are not perfect. Research is still very
active in all NLP fields.

16

Lect. 7: Learning from Massive Datasets
Ex: Syntactic Parser


Connexor



What about parsing a tweet?
“My son, Ky/o, asked me for the first time today how my
DAY was . . . I about melted. Told him that I had pizza for
lunch. Response? No fair “ (Twitter Tutorial 1: How to
Tweet Well)



17
Why NLP and Text Analytics for Text Mining?


Why is it important to know that a word is a noun, or
a verb or the name of brand?



Broadly speaking (Think about these as features for
a classification problem!)





18

Nouns and verbs (a.k.a. content words): Nouns are
important for topic detection; verbs are important if you
want to identify actions or intentions.
Adjectives = sentiment identification.
Function words (a.k.a. stop words) are important for
authorship attribution, plagiarism detection, etc.
etc.
Lect. 7: Learning from Massive Datasets
However… (II)


At present, the main pitfall of many NLP applications
is that they are not flexible enough to:





Completly disambiguate language
Identify how language is used in different types of
documents (a.k.a. genres).

For instance, in tweets langauge is used in a
different way than an emails, language used in
email is different from the language used in
academic papers, etc. )
Often tweaking NLP tools to different types of
text or solve language ambiguity in an ad-hoc
manner is time-consuming, difficult and
unrewarding…
19

Lect. 7: Learning from Massive Datasets
What for?










Text summarization
Document clustering
Authorship attribution
Automatic medadata extraction
Entity extraction
Information extraction
Information discovery
ACTIONABLE INTELLIGENCE

20

Lect. 7: Learning from Massive Datasets
Actionable Textual Intelligence


Business Intelligence (BI) + Customer Analytics + Social
Network Analytics + Crisis Intelligence […] = Actionable
Intelligence



Actionable Intelligence is information that:
1.

2.
3.
4.
5.

6.

21

must be accurate and verifiable
must be timely
must be comprehensive
must be comprehensible
!!! give the power to make decisions and to act straightaway !!!
!!! must handle BIG BIG BIG UNSTRUCTURED TEXTUAL
DATA !!!
Lect. 7: Learning from Massive Datasets
Big Data


BIG DATA [Wikipedia]:


Big data usually includes data sets with sizes beyond the ability of
commonly used software tools to capture, curate, manage, and
process the data within a tolerable elapsed time. Big data sizes
are a constantly moving target, as of 2012 ranging from a few
dozen terabytes to many petabytes of data in a single data
set. With this difficulty, new platforms of "big data" tools are being
developed to handle various aspects of large quantities of data.



Examples include Big Science, web logs, RFID, sensor networks,

social networks, social data (due to the social
data revolution), Internet text and documents,
Internet search indexing, call detail records,
astronomy, atmospheric science, genomics, biogeochemical,
biological, and other complex and often interdisciplinary scientific
research, military surveillance, medical records, photography
archives, video archives, and large-scale e-commerce.
22

Lect. 7: Learning from Massive Datasets
Big Unstructured
TEXTUAL Data

Merrill Lynch is one of the world's leading
financial management and advisory companies,
providing financial advice.



―Merrill Lynch estimates that more than 85 percent of all
business information exists as unstructured data –
commonly appearing in e‐mails, memos, notes from call
centers and support operations, news, user groups,
chats, reports, letters, surveys, white papers,
marketing material, research, presentations and web
pages.‖ [DM Review Magazine, February 2003 Issue]



ECONOMIC LOSS!

23

Lect. 7: Learning from Massive Datasets
Simple search is not enough…


Of course, it is possible to use simple search. But
simple search is unrewarding, because is based on
single terms.


24

”a search is made on the term felony. In a simple search,
the term felony is used, and everywhere there is a
reference to felony, a hit to an unstructured document is
made. But a simple search is crude. It does not find
references to crime, arson, murder, embezzlement,
vehicular homicide, and such, even though these crimes
are types of felonies” [ Source: Inmon, B. & A. Nesavich,
"Unstructured Textual Data in the Organization" from
"Managing Unstructured data in the organization",
Prentice Hall 2008, pp. 1–13]
Lect. 7: Learning from Massive Datasets
Programming languages and
frameworks for big data

25

Lect. 7: Learning from Massive Datasets
http://www.r-project.org/

R


R is a statistical programming language. It is a free
software programming language and a software
environment for statistical computing and graphics. The
R language is widely used among statisticians and data
miners for developing statistical software and data
analysis. Polls and surveys of data miners are showing
R's popularity has increased substantially in recent years
(wikipedia)

26
27

Lect. 7: Learning from Massive Datasets
MeetUps: R in Stockholm

28

Lect. 7: Learning from Massive Datasets
Can R help out?


Can R help overcome NLP shortcomings and open a
new direction in order to extract useful information
from Big TEXTUAL Data?

29

Lect. 7: Learning from Massive Datasets
Existing literature for linguists


Stefan Th. Gries (2013)
Statistics for linguistics
With R: A Practical
Introduction. De Gruyter
Mouton. New Edition.



Stefan Th. Gries (2009)
Quantitative corpus
linguistics with R: a practical
introduction. Routledge,
Taylor & Francis Group
(companion website).



Harald R. Baayen (2008)
Analyzing Linguistic Data: A
Practical Introduction to
Statistics using R.
Cambridge.
….



30

Lect. 7: Learning from Massive Datasets
Companion website by Stefan Th. Gries


BNC=British National Corpus (PoS tagged)

31

Lect. 7: Learning from Massive Datasets
BNC


The British National Corpus (BNC) is a 100 million word collection of
samples of written and spoken language from a wide range of
sources, designed to represent a wide cross-section of British
English from the later part of the 20th century, both spoken and
written. The latest edition is the BNC XML Edition, released in 2007.



The corpus is encoded according to the Guidelines of the Text
Encoding Initiative (TEI) to represent both the output from CLAWS
(automatic part-of-speech tagger) and a variety of other structural
properties of texts (e.g. headings, paragraphs, lists etc.). Full
classification, contextual and bibliographic information is also
included with each text in the form of a TEI-conformant header.

32

Lect. 7: Learning from Massive Datasets
R & the BNC: Excerpt from Google Books

33

Lect. 7: Learning from Massive Datasets
What about Big Textual Data?





Non standardized language
Non standard texts
Electronic documents of all kinds, eg. formal,
informal, short, long, private, public, etc.

34

Lect. 7: Learning from Massive Datasets
Not distributed system



Open Source




The name Scala is a
portmanteau of
"scalable" and
"language", signifying
that it is designed to
grow with the demands
of its users. James
Strachan, the creator of
Groovy, described Scala
as a possible successor to
Java






Commercial





35

R
Scala (also distributed
systems)
Rapid Miner
Weka
…
SPSS
SAS
MatLab
…

Lect. 7: Learning from Massive Datasets
From The Economist:
The Big Data scenario

36

Lect. 7: Learning from Massive Datasets
Commercial applications for Big Textual Data


Recorded Future  web intelligence (anticipating
emerging threats, future trends, anticipating
competitors’ actions, etc.)



Gavagai  large-scale textual analysis (prediction
and future trends)

37

Lect. 7: Learning from Massive Datasets
Thanks to Staffan Truffe’ for the ff slides

38

Lect. 7: Learning from Massive Datasets
Size

39

Lect. 7: Learning from Massive Datasets
In a few pictures…

40

Lect. 7: Learning from Massive Datasets
Metrics, structure and time

41

Lect. 7: Learning from Massive Datasets
Metric

42

Lect. 7: Learning from Massive Datasets
Structure

43

Lect. 7: Learning from Massive Datasets
Time

44

Lect. 7: Learning from Massive Datasets
Facts

45

Lect. 7: Learning from Massive Datasets
Pipeline

46

Lect. 7: Learning from Massive Datasets
Multi-Language

47

Lect. 7: Learning from Massive Datasets
Text Analytics

48

Lect. 7: Learning from Massive Datasets
Predictions

49

Lect. 7: Learning from Massive Datasets
Gavagai




Jussi Karlgren (PhD in Stylistics in Information Retrieval)
Magnus Sahlgren (PhD thesis in distributional semantics)
Fredrick Olsson (PhD thesis in Active Learning)


(co-workers at SICS)

The indeterminacy of translation is a thesis propounded by 20thcentury American analytic philosopher W. V. Quine.
Quine uses the example of the word "gavagai" uttered by a
native speaker of the unknown language Arunta upon seeing a
rabbit. A speaker of English could do what seems natural and
translate this as "Lo, a rabbit." But other translations would be
compatible with all the evidence he has: "Lo, food"; "Let's go
hunting"; "There will be a storm tonight" (these natives may be
superstitious)… (wikipedia)
50

Lect. 7: Learning from Massive Datasets
Ethersource presented
Thanks to F. Olsson for the ff slides

51

Lect. 7: Learning from Massive Datasets
Associations

52

Lect. 7: Learning from Massive Datasets
Language is flux

53

Lect. 7: Learning from Massive Datasets
Learning from use

54

Lect. 7: Learning from Massive Datasets
Scope

55

Lect. 7: Learning from Massive Datasets
Architecture

56

Lect. 7: Learning from Massive Datasets
Web vs printed world

57

Lect. 7: Learning from Massive Datasets
Noise…

58
Multi-linguality

59

Lect. 7: Learning from Massive Datasets
SICS

60

Watch the videos!

Lect. 7: Learning from Massive Datasets
Big Data MeetUp, Stockholm

61

Lect. 7: Learning from Massive Datasets
BIG DATA
communities

62

Lect. 7: Learning from Massive Datasets
Future Directions in Machine Learning for
Language Technology





Deluge of data
Little linguistic analysis in the realm of big-data realworld platforms and applications
Top-down systems cannot efficiently deal with
irregularity and unpredictability of big textual data
Data-driven systems can make it. However,


63

…we know that computers are not at ease with natural
languages used by humans, unless they learn how to
learn linguistic structure underlying natual language from
data…

Lect. 7: Learning from Massive Datasets
For a data-driven approach…






Annotated datasets that are needed for complete
supervised machine learning are costly, timecomsuming and require specialist expertise.
Is complete supervision even thinkable when we talk
about tera-, peta- or yottabytes? How big should
then be the training set?
Alternative solutions:





64

Semi-supervised methods (combination of labelled and
unlabelled data)
Weakly supervised methods (human-constructed rules
are typically used to guide the unsupervised learner)
Unsupervised learning results cannot still compete with
suprevised learning in many tasks…
Lect. 7: Learning from Massive Datasets
A new way to explore: Incomplete Supervision


Relies on partially labelled data:




65

‖ Human experts — or possibly a crowd of laymen —
annotate text with some linguistic structure related to the
structure
that one wants to predict. This data is then used for
partially supervised learning with a statistical model that
exploits the annotated structure to infer the linguistic
structure of interest.‖ p. 4

Lect. 7: Learning from Massive Datasets
Example






”…it is possible to construct accurate and robust part-of-speech
taggers for a wide range of languages, by combining (1) manually
annotated resources in English, or some other language for which
such resources are already available, with (2) a crowd-sourced
target-language specific lexicon, which lists the potential parts of
speech that each word may take in some context, at least for a
subset of the words.
Both (1) and (2) only provide partial information for the part-ofspeech tagging task. However, taken together they turn out to
provide substantially more information than either taken alone. “ p. 46
Oscar Täckström “Predicting Linguistic Structure with Incomplete and
Cross-Lingual Supervision” PhD Thesis, Uppsala University, 2013
(http://soda.swedish-ict.se/5513/)
66

Lect. 7: Learning from Massive Datasets
Conclusions






This course is an introduction to
Machine leaning for Language
Technology”.
You get a flavour of the
problems we come across when
devising models for enabling
machines to analyse and make
sense of natural human
language.

The next big big big step is to
bring as much linguistic
awareness as possible into big
data.
67

Lect. 7: Learning from Massive Datasets
Reading


Witten and Frank (2005) Ch. 8

68

Lect. 7: Learning from Massive Datasets
Thanx for your attention!

69

Lect. 7: Learning from Massive Datasets

Weitere ähnliche Inhalte

Was ist angesagt?

Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...
Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...
Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...
Leon Derczynski
 
Philosophy of Deep Learning
Philosophy of Deep LearningPhilosophy of Deep Learning
Philosophy of Deep Learning
Melanie Swan
 
Grammarly AI-NLP Club #2 - Recent advances in applied chatbot technology - Jo...
Grammarly AI-NLP Club #2 - Recent advances in applied chatbot technology - Jo...Grammarly AI-NLP Club #2 - Recent advances in applied chatbot technology - Jo...
Grammarly AI-NLP Club #2 - Recent advances in applied chatbot technology - Jo...
Grammarly
 
Text mining voor Business Intelligence toepassingen
Text mining voor Business Intelligence toepassingenText mining voor Business Intelligence toepassingen
Text mining voor Business Intelligence toepassingen
jcscholtes
 
Open source vs. open data
Open source vs. open dataOpen source vs. open data
Open source vs. open data
data publica
 

Was ist angesagt? (20)

Frontiers of Computational Journalism week 8 - Visualization and Network Anal...
Frontiers of Computational Journalism week 8 - Visualization and Network Anal...Frontiers of Computational Journalism week 8 - Visualization and Network Anal...
Frontiers of Computational Journalism week 8 - Visualization and Network Anal...
 
Semantic Web Investigation within Big Data Context
Semantic Web Investigation within Big Data ContextSemantic Web Investigation within Big Data Context
Semantic Web Investigation within Big Data Context
 
Fake News Detector
Fake News DetectorFake News Detector
Fake News Detector
 
Narrative: Text Generation Model from Data
Narrative: Text Generation Model from DataNarrative: Text Generation Model from Data
Narrative: Text Generation Model from Data
 
Implementing Artificial Intelligence with Big Data
Implementing Artificial Intelligence with Big DataImplementing Artificial Intelligence with Big Data
Implementing Artificial Intelligence with Big Data
 
Laorden 2012 cisis_negobot
Laorden 2012 cisis_negobotLaorden 2012 cisis_negobot
Laorden 2012 cisis_negobot
 
Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...
Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...
Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...
 
Philosophy of Deep Learning
Philosophy of Deep LearningPhilosophy of Deep Learning
Philosophy of Deep Learning
 
Cloud computing and networking course: paper presentation -Data Mining for In...
Cloud computing and networking course: paper presentation -Data Mining for In...Cloud computing and networking course: paper presentation -Data Mining for In...
Cloud computing and networking course: paper presentation -Data Mining for In...
 
Grammarly AI-NLP Club #2 - Recent advances in applied chatbot technology - Jo...
Grammarly AI-NLP Club #2 - Recent advances in applied chatbot technology - Jo...Grammarly AI-NLP Club #2 - Recent advances in applied chatbot technology - Jo...
Grammarly AI-NLP Club #2 - Recent advances in applied chatbot technology - Jo...
 
Tutorial on Question Answering Systems
Tutorial on Question Answering Systems Tutorial on Question Answering Systems
Tutorial on Question Answering Systems
 
Deep neural networks for matching online social networking profiles
Deep neural networks for matching online social networking profilesDeep neural networks for matching online social networking profiles
Deep neural networks for matching online social networking profiles
 
Question answering in linked data
Question answering in linked dataQuestion answering in linked data
Question answering in linked data
 
[DOLAP2019] Augmented Business Intelligence
[DOLAP2019] Augmented Business Intelligence[DOLAP2019] Augmented Business Intelligence
[DOLAP2019] Augmented Business Intelligence
 
A comparison of Lexicon-based approaches for Sentiment Analysis of microblog ...
A comparison of Lexicon-based approaches for Sentiment Analysis of microblog ...A comparison of Lexicon-based approaches for Sentiment Analysis of microblog ...
A comparison of Lexicon-based approaches for Sentiment Analysis of microblog ...
 
[ADBIS 2021] - Optimizing Execution Plans in a Multistore
[ADBIS 2021] - Optimizing Execution Plans in a Multistore[ADBIS 2021] - Optimizing Execution Plans in a Multistore
[ADBIS 2021] - Optimizing Execution Plans in a Multistore
 
Text mining voor Business Intelligence toepassingen
Text mining voor Business Intelligence toepassingenText mining voor Business Intelligence toepassingen
Text mining voor Business Intelligence toepassingen
 
Why L-3 Data Tactics Data Science?
Why L-3 Data Tactics Data Science?Why L-3 Data Tactics Data Science?
Why L-3 Data Tactics Data Science?
 
Linking Big Data to Rich Process Descriptions
Linking Big Data to Rich Process DescriptionsLinking Big Data to Rich Process Descriptions
Linking Big Data to Rich Process Descriptions
 
Open source vs. open data
Open source vs. open dataOpen source vs. open data
Open source vs. open data
 

Ähnlich wie Lecture 7: Learning from Massive Datasets

Structured and Unstructured Information Extraction Using Text Mining and Natu...
Structured and Unstructured Information Extraction Using Text Mining and Natu...Structured and Unstructured Information Extraction Using Text Mining and Natu...
Structured and Unstructured Information Extraction Using Text Mining and Natu...
rahulmonikasharma
 
How can text-mining leverage developments in Deep Learning? Presentation at ...
How can text-mining leverage developments in Deep Learning?  Presentation at ...How can text-mining leverage developments in Deep Learning?  Presentation at ...
How can text-mining leverage developments in Deep Learning? Presentation at ...
jcscholtes
 
Chapter Seven (7) According to the Comparative Analysis of To.docx
Chapter Seven (7) According to the Comparative Analysis of To.docxChapter Seven (7) According to the Comparative Analysis of To.docx
Chapter Seven (7) According to the Comparative Analysis of To.docx
tiffanyd4
 
Questions On Natural Language Processing
Questions On Natural Language ProcessingQuestions On Natural Language Processing
Questions On Natural Language Processing
Adriana Wilson
 

Ähnlich wie Lecture 7: Learning from Massive Datasets (20)

A DEVELOPMENT FRAMEWORK FOR A CONVERSATIONAL AGENT TO EXPLORE MACHINE LEARNIN...
A DEVELOPMENT FRAMEWORK FOR A CONVERSATIONAL AGENT TO EXPLORE MACHINE LEARNIN...A DEVELOPMENT FRAMEWORK FOR A CONVERSATIONAL AGENT TO EXPLORE MACHINE LEARNIN...
A DEVELOPMENT FRAMEWORK FOR A CONVERSATIONAL AGENT TO EXPLORE MACHINE LEARNIN...
 
Oss swot
Oss swotOss swot
Oss swot
 
Scraping and Clustering Techniques for the Characterization of Linkedin Profiles
Scraping and Clustering Techniques for the Characterization of Linkedin ProfilesScraping and Clustering Techniques for the Characterization of Linkedin Profiles
Scraping and Clustering Techniques for the Characterization of Linkedin Profiles
 
Scraping and clustering techniques
Scraping and clustering techniquesScraping and clustering techniques
Scraping and clustering techniques
 
Structured and Unstructured Information Extraction Using Text Mining and Natu...
Structured and Unstructured Information Extraction Using Text Mining and Natu...Structured and Unstructured Information Extraction Using Text Mining and Natu...
Structured and Unstructured Information Extraction Using Text Mining and Natu...
 
Synthesys Technical Overview
Synthesys Technical OverviewSynthesys Technical Overview
Synthesys Technical Overview
 
Big Data: the weakest link
Big Data: the weakest linkBig Data: the weakest link
Big Data: the weakest link
 
ACM Hypertext and Social Media Conference Tutorial on Knowledge-infused Deep ...
ACM Hypertext and Social Media Conference Tutorial on Knowledge-infused Deep ...ACM Hypertext and Social Media Conference Tutorial on Knowledge-infused Deep ...
ACM Hypertext and Social Media Conference Tutorial on Knowledge-infused Deep ...
 
Decision Support for E-Governance: A Text Mining Approach
Decision Support for E-Governance: A Text Mining ApproachDecision Support for E-Governance: A Text Mining Approach
Decision Support for E-Governance: A Text Mining Approach
 
How can text-mining leverage developments in Deep Learning? Presentation at ...
How can text-mining leverage developments in Deep Learning?  Presentation at ...How can text-mining leverage developments in Deep Learning?  Presentation at ...
How can text-mining leverage developments in Deep Learning? Presentation at ...
 
NLP & ML Webinar
NLP & ML WebinarNLP & ML Webinar
NLP & ML Webinar
 
KOHN.ppt
KOHN.pptKOHN.ppt
KOHN.ppt
 
KOHN.ppt
KOHN.pptKOHN.ppt
KOHN.ppt
 
Analyzing Big Data's Weakest Link (hint: it might be you)
Analyzing Big Data's Weakest Link  (hint: it might be you)Analyzing Big Data's Weakest Link  (hint: it might be you)
Analyzing Big Data's Weakest Link (hint: it might be you)
 
Chapter Seven (7) According to the Comparative Analysis of To.docx
Chapter Seven (7) According to the Comparative Analysis of To.docxChapter Seven (7) According to the Comparative Analysis of To.docx
Chapter Seven (7) According to the Comparative Analysis of To.docx
 
Questions On Natural Language Processing
Questions On Natural Language ProcessingQuestions On Natural Language Processing
Questions On Natural Language Processing
 
Cognitive Assistants - Opportunities and Challenges - slides
Cognitive Assistants - Opportunities and Challenges - slidesCognitive Assistants - Opportunities and Challenges - slides
Cognitive Assistants - Opportunities and Challenges - slides
 
An Comprehensive Study of Big Data Environment and its Challenges.
An Comprehensive Study of Big Data Environment and its Challenges.An Comprehensive Study of Big Data Environment and its Challenges.
An Comprehensive Study of Big Data Environment and its Challenges.
 
Tn T Horizons April 28 2008
Tn T Horizons April 28 2008Tn T Horizons April 28 2008
Tn T Horizons April 28 2008
 
How ChatGPT and AI-assisted coding changes software engineering profoundly
How ChatGPT and AI-assisted coding changes software engineering profoundlyHow ChatGPT and AI-assisted coding changes software engineering profoundly
How ChatGPT and AI-assisted coding changes software engineering profoundly
 

Mehr von Marina Santini

Mehr von Marina Santini (20)

Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...
Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...
Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...
 
Towards a Quality Assessment of Web Corpora for Language Technology Applications
Towards a Quality Assessment of Web Corpora for Language Technology ApplicationsTowards a Quality Assessment of Web Corpora for Language Technology Applications
Towards a Quality Assessment of Web Corpora for Language Technology Applications
 
A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-
A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-
A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-
 
An Exploratory Study on Genre Classification using Readability Features
An Exploratory Study on Genre Classification using Readability FeaturesAn Exploratory Study on Genre Classification using Readability Features
An Exploratory Study on Genre Classification using Readability Features
 
Lecture: Semantic Word Clouds
Lecture: Semantic Word CloudsLecture: Semantic Word Clouds
Lecture: Semantic Word Clouds
 
Lecture: Ontologies and the Semantic Web
Lecture: Ontologies and the Semantic WebLecture: Ontologies and the Semantic Web
Lecture: Ontologies and the Semantic Web
 
Lecture: Summarization
Lecture: SummarizationLecture: Summarization
Lecture: Summarization
 
Relation Extraction
Relation ExtractionRelation Extraction
Relation Extraction
 
Lecture: Question Answering
Lecture: Question AnsweringLecture: Question Answering
Lecture: Question Answering
 
IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)
 
Lecture: Vector Semantics (aka Distributional Semantics)
Lecture: Vector Semantics (aka Distributional Semantics)Lecture: Vector Semantics (aka Distributional Semantics)
Lecture: Vector Semantics (aka Distributional Semantics)
 
Lecture: Word Sense Disambiguation
Lecture: Word Sense DisambiguationLecture: Word Sense Disambiguation
Lecture: Word Sense Disambiguation
 
Lecture: Word Senses
Lecture: Word SensesLecture: Word Senses
Lecture: Word Senses
 
Sentiment Analysis
Sentiment AnalysisSentiment Analysis
Sentiment Analysis
 
Semantic Role Labeling
Semantic Role LabelingSemantic Role Labeling
Semantic Role Labeling
 
Semantics and Computational Semantics
Semantics and Computational SemanticsSemantics and Computational Semantics
Semantics and Computational Semantics
 
Lecture 9: Machine Learning in Practice (2)
Lecture 9: Machine Learning in Practice (2)Lecture 9: Machine Learning in Practice (2)
Lecture 9: Machine Learning in Practice (2)
 
Lecture 8: Machine Learning in Practice (1)
Lecture 8: Machine Learning in Practice (1) Lecture 8: Machine Learning in Practice (1)
Lecture 8: Machine Learning in Practice (1)
 
Lecture 5: Interval Estimation
Lecture 5: Interval Estimation Lecture 5: Interval Estimation
Lecture 5: Interval Estimation
 
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain RatioLecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
 

Kürzlich hochgeladen

Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
kauryashika82
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
QucHHunhnh
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
PECB
 
An Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdfAn Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdf
SanaAli374401
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
QucHHunhnh
 

Kürzlich hochgeladen (20)

Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
An Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdfAn Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdf
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 

Lecture 7: Learning from Massive Datasets

  • 1. October 2013 Machine Learning for Language Technology Lecture 7: Learning from Massive Datasets Marina Santini, Uppsala University Department of Linguistics and Philology
  • 2. Outline Watch the pitfalls Learning from massive datasets       Data Mining Text Mining – Text Analytics Web Mining Big Data  Programming Languages and Framework for Big Data  Big Textual Data & Commercial Applications  Events, MeetUps, Coursera 2 Lect. 7: Learning from Massive Datasets
  • 3. Practical Machine Learning 3 Lect. 7: Learning from Massive Datasets
  • 4. Data Mining Data mining is the extraction of implicit, previously unknown and potentially useful information from data (Witten and Frank, 2005)  4 Lect. 7: Learning from Massive Datasets
  • 5. Watch out! Machine Learning is not just about: Finding data and blindly applying learning algorithms to it Blindly compare machine learning methods: 1. 2. Model complexity Representativeness of training data distribution Reliability of class labels 1. 2. 3. Remember: Practitioners’ expertise counts! 5 Lect. 7: Learning from Massive Datasets
  • 6. Massive Datasets Space and Time Three ways to make learning feasible (the old way)      Small subset Parallelization Data chunks The new way:    6 Develop new algorithms with lower computational complexity Increase background knowledge Lect. 7: Learning from Massive Datasets
  • 7. Domain Knowledge  Metadata  Semantic relation Causal relation Functional dependencies   7 Lect. 7: Learning from Massive Datasets
  • 8. Text Mining Actionable information Comprehensible information Problems     8 Text Analytics Lect. 7: Learning from Massive Datasets
  • 9. Definition: Text Analytics A set of NLP techniques that provide some structure to textual documents and help identify and extract important information.  9 Lect. 7: Learning from Massive Datasets
  • 10. Set of NLP (Natural Language Processing ) techniques  Common components of a text analytic package are:        10 Tokenization Morphological Analysis Syntactic Analysis Named Entity Recognition Sentiment Analysis Automatic Summarization Etc. Lect. 7: Learning from Massive Datasets
  • 11. NLP at Coursera (www.coursera.org) 11 Lect. 7: Learning from Massive Datasets
  • 12. NLP is pervasive Ex: spell-checkers      Google Search Google Mail Facebook Office Word […] 12 Lect. 7: Learning from Massive Datasets
  • 13. NLP is parvasive Ex: Name Entity Recognition    Opinion mining Brand Trends Conversation clouds on web magazines and online newspapers… 13 Lect. 7: Learning from Massive Datasets
  • 14. Sentiment Analysis 14 Lect. 7: Learning from Massive Datasets
  • 15. Text Analytics Products and Frameworks  Commercial Products:          Attensity Clarabridge Temis Lexalytics Texify SAS SPSS IBM Cognos etc. 15 Open Source Frameworks: • • • • • GATE NLTK UIMA openNLP etc. Lect. 7: Learning from Massive Datasets
  • 16. However… (I)  NLP tools and applications (both commercial and open source) are not perfect. Research is still very active in all NLP fields. 16 Lect. 7: Learning from Massive Datasets
  • 17. Ex: Syntactic Parser  Connexor  What about parsing a tweet? “My son, Ky/o, asked me for the first time today how my DAY was . . . I about melted. Told him that I had pizza for lunch. Response? No fair “ (Twitter Tutorial 1: How to Tweet Well)  17
  • 18. Why NLP and Text Analytics for Text Mining?  Why is it important to know that a word is a noun, or a verb or the name of brand?  Broadly speaking (Think about these as features for a classification problem!)     18 Nouns and verbs (a.k.a. content words): Nouns are important for topic detection; verbs are important if you want to identify actions or intentions. Adjectives = sentiment identification. Function words (a.k.a. stop words) are important for authorship attribution, plagiarism detection, etc. etc. Lect. 7: Learning from Massive Datasets
  • 19. However… (II)  At present, the main pitfall of many NLP applications is that they are not flexible enough to:    Completly disambiguate language Identify how language is used in different types of documents (a.k.a. genres). For instance, in tweets langauge is used in a different way than an emails, language used in email is different from the language used in academic papers, etc. ) Often tweaking NLP tools to different types of text or solve language ambiguity in an ad-hoc manner is time-consuming, difficult and unrewarding… 19 Lect. 7: Learning from Massive Datasets
  • 20. What for?         Text summarization Document clustering Authorship attribution Automatic medadata extraction Entity extraction Information extraction Information discovery ACTIONABLE INTELLIGENCE 20 Lect. 7: Learning from Massive Datasets
  • 21. Actionable Textual Intelligence  Business Intelligence (BI) + Customer Analytics + Social Network Analytics + Crisis Intelligence […] = Actionable Intelligence  Actionable Intelligence is information that: 1. 2. 3. 4. 5. 6. 21 must be accurate and verifiable must be timely must be comprehensive must be comprehensible !!! give the power to make decisions and to act straightaway !!! !!! must handle BIG BIG BIG UNSTRUCTURED TEXTUAL DATA !!! Lect. 7: Learning from Massive Datasets
  • 22. Big Data  BIG DATA [Wikipedia]:  Big data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process the data within a tolerable elapsed time. Big data sizes are a constantly moving target, as of 2012 ranging from a few dozen terabytes to many petabytes of data in a single data set. With this difficulty, new platforms of "big data" tools are being developed to handle various aspects of large quantities of data.  Examples include Big Science, web logs, RFID, sensor networks, social networks, social data (due to the social data revolution), Internet text and documents, Internet search indexing, call detail records, astronomy, atmospheric science, genomics, biogeochemical, biological, and other complex and often interdisciplinary scientific research, military surveillance, medical records, photography archives, video archives, and large-scale e-commerce. 22 Lect. 7: Learning from Massive Datasets
  • 23. Big Unstructured TEXTUAL Data Merrill Lynch is one of the world's leading financial management and advisory companies, providing financial advice.  ―Merrill Lynch estimates that more than 85 percent of all business information exists as unstructured data – commonly appearing in e‐mails, memos, notes from call centers and support operations, news, user groups, chats, reports, letters, surveys, white papers, marketing material, research, presentations and web pages.‖ [DM Review Magazine, February 2003 Issue]  ECONOMIC LOSS! 23 Lect. 7: Learning from Massive Datasets
  • 24. Simple search is not enough…  Of course, it is possible to use simple search. But simple search is unrewarding, because is based on single terms.  24 ”a search is made on the term felony. In a simple search, the term felony is used, and everywhere there is a reference to felony, a hit to an unstructured document is made. But a simple search is crude. It does not find references to crime, arson, murder, embezzlement, vehicular homicide, and such, even though these crimes are types of felonies” [ Source: Inmon, B. & A. Nesavich, "Unstructured Textual Data in the Organization" from "Managing Unstructured data in the organization", Prentice Hall 2008, pp. 1–13] Lect. 7: Learning from Massive Datasets
  • 25. Programming languages and frameworks for big data 25 Lect. 7: Learning from Massive Datasets
  • 26. http://www.r-project.org/ R  R is a statistical programming language. It is a free software programming language and a software environment for statistical computing and graphics. The R language is widely used among statisticians and data miners for developing statistical software and data analysis. Polls and surveys of data miners are showing R's popularity has increased substantially in recent years (wikipedia) 26
  • 27. 27 Lect. 7: Learning from Massive Datasets
  • 28. MeetUps: R in Stockholm 28 Lect. 7: Learning from Massive Datasets
  • 29. Can R help out?  Can R help overcome NLP shortcomings and open a new direction in order to extract useful information from Big TEXTUAL Data? 29 Lect. 7: Learning from Massive Datasets
  • 30. Existing literature for linguists  Stefan Th. Gries (2013) Statistics for linguistics With R: A Practical Introduction. De Gruyter Mouton. New Edition.  Stefan Th. Gries (2009) Quantitative corpus linguistics with R: a practical introduction. Routledge, Taylor & Francis Group (companion website).  Harald R. Baayen (2008) Analyzing Linguistic Data: A Practical Introduction to Statistics using R. Cambridge. ….  30 Lect. 7: Learning from Massive Datasets
  • 31. Companion website by Stefan Th. Gries  BNC=British National Corpus (PoS tagged) 31 Lect. 7: Learning from Massive Datasets
  • 32. BNC  The British National Corpus (BNC) is a 100 million word collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of British English from the later part of the 20th century, both spoken and written. The latest edition is the BNC XML Edition, released in 2007.  The corpus is encoded according to the Guidelines of the Text Encoding Initiative (TEI) to represent both the output from CLAWS (automatic part-of-speech tagger) and a variety of other structural properties of texts (e.g. headings, paragraphs, lists etc.). Full classification, contextual and bibliographic information is also included with each text in the form of a TEI-conformant header. 32 Lect. 7: Learning from Massive Datasets
  • 33. R & the BNC: Excerpt from Google Books 33 Lect. 7: Learning from Massive Datasets
  • 34. What about Big Textual Data?    Non standardized language Non standard texts Electronic documents of all kinds, eg. formal, informal, short, long, private, public, etc. 34 Lect. 7: Learning from Massive Datasets
  • 35. Not distributed system  Open Source    The name Scala is a portmanteau of "scalable" and "language", signifying that it is designed to grow with the demands of its users. James Strachan, the creator of Groovy, described Scala as a possible successor to Java    Commercial     35 R Scala (also distributed systems) Rapid Miner Weka … SPSS SAS MatLab … Lect. 7: Learning from Massive Datasets
  • 36. From The Economist: The Big Data scenario 36 Lect. 7: Learning from Massive Datasets
  • 37. Commercial applications for Big Textual Data  Recorded Future  web intelligence (anticipating emerging threats, future trends, anticipating competitors’ actions, etc.)  Gavagai  large-scale textual analysis (prediction and future trends) 37 Lect. 7: Learning from Massive Datasets
  • 38. Thanks to Staffan Truffe’ for the ff slides 38 Lect. 7: Learning from Massive Datasets
  • 39. Size 39 Lect. 7: Learning from Massive Datasets
  • 40. In a few pictures… 40 Lect. 7: Learning from Massive Datasets
  • 41. Metrics, structure and time 41 Lect. 7: Learning from Massive Datasets
  • 42. Metric 42 Lect. 7: Learning from Massive Datasets
  • 43. Structure 43 Lect. 7: Learning from Massive Datasets
  • 44. Time 44 Lect. 7: Learning from Massive Datasets
  • 45. Facts 45 Lect. 7: Learning from Massive Datasets
  • 46. Pipeline 46 Lect. 7: Learning from Massive Datasets
  • 47. Multi-Language 47 Lect. 7: Learning from Massive Datasets
  • 48. Text Analytics 48 Lect. 7: Learning from Massive Datasets
  • 49. Predictions 49 Lect. 7: Learning from Massive Datasets
  • 50. Gavagai    Jussi Karlgren (PhD in Stylistics in Information Retrieval) Magnus Sahlgren (PhD thesis in distributional semantics) Fredrick Olsson (PhD thesis in Active Learning)  (co-workers at SICS) The indeterminacy of translation is a thesis propounded by 20thcentury American analytic philosopher W. V. Quine. Quine uses the example of the word "gavagai" uttered by a native speaker of the unknown language Arunta upon seeing a rabbit. A speaker of English could do what seems natural and translate this as "Lo, a rabbit." But other translations would be compatible with all the evidence he has: "Lo, food"; "Let's go hunting"; "There will be a storm tonight" (these natives may be superstitious)… (wikipedia) 50 Lect. 7: Learning from Massive Datasets
  • 51. Ethersource presented Thanks to F. Olsson for the ff slides 51 Lect. 7: Learning from Massive Datasets
  • 52. Associations 52 Lect. 7: Learning from Massive Datasets
  • 53. Language is flux 53 Lect. 7: Learning from Massive Datasets
  • 54. Learning from use 54 Lect. 7: Learning from Massive Datasets
  • 55. Scope 55 Lect. 7: Learning from Massive Datasets
  • 56. Architecture 56 Lect. 7: Learning from Massive Datasets
  • 57. Web vs printed world 57 Lect. 7: Learning from Massive Datasets
  • 59. Multi-linguality 59 Lect. 7: Learning from Massive Datasets
  • 60. SICS 60 Watch the videos! Lect. 7: Learning from Massive Datasets
  • 61. Big Data MeetUp, Stockholm 61 Lect. 7: Learning from Massive Datasets
  • 62. BIG DATA communities 62 Lect. 7: Learning from Massive Datasets
  • 63. Future Directions in Machine Learning for Language Technology     Deluge of data Little linguistic analysis in the realm of big-data realworld platforms and applications Top-down systems cannot efficiently deal with irregularity and unpredictability of big textual data Data-driven systems can make it. However,  63 …we know that computers are not at ease with natural languages used by humans, unless they learn how to learn linguistic structure underlying natual language from data… Lect. 7: Learning from Massive Datasets
  • 64. For a data-driven approach…    Annotated datasets that are needed for complete supervised machine learning are costly, timecomsuming and require specialist expertise. Is complete supervision even thinkable when we talk about tera-, peta- or yottabytes? How big should then be the training set? Alternative solutions:    64 Semi-supervised methods (combination of labelled and unlabelled data) Weakly supervised methods (human-constructed rules are typically used to guide the unsupervised learner) Unsupervised learning results cannot still compete with suprevised learning in many tasks… Lect. 7: Learning from Massive Datasets
  • 65. A new way to explore: Incomplete Supervision  Relies on partially labelled data:   65 ‖ Human experts — or possibly a crowd of laymen — annotate text with some linguistic structure related to the structure that one wants to predict. This data is then used for partially supervised learning with a statistical model that exploits the annotated structure to infer the linguistic structure of interest.‖ p. 4 Lect. 7: Learning from Massive Datasets
  • 66. Example    ”…it is possible to construct accurate and robust part-of-speech taggers for a wide range of languages, by combining (1) manually annotated resources in English, or some other language for which such resources are already available, with (2) a crowd-sourced target-language specific lexicon, which lists the potential parts of speech that each word may take in some context, at least for a subset of the words. Both (1) and (2) only provide partial information for the part-ofspeech tagging task. However, taken together they turn out to provide substantially more information than either taken alone. “ p. 46 Oscar Täckström “Predicting Linguistic Structure with Incomplete and Cross-Lingual Supervision” PhD Thesis, Uppsala University, 2013 (http://soda.swedish-ict.se/5513/) 66 Lect. 7: Learning from Massive Datasets
  • 67. Conclusions    This course is an introduction to Machine leaning for Language Technology”. You get a flavour of the problems we come across when devising models for enabling machines to analyse and make sense of natural human language. The next big big big step is to bring as much linguistic awareness as possible into big data. 67 Lect. 7: Learning from Massive Datasets
  • 68. Reading  Witten and Frank (2005) Ch. 8 68 Lect. 7: Learning from Massive Datasets
  • 69. Thanx for your attention! 69 Lect. 7: Learning from Massive Datasets

Hinweis der Redaktion

  1. Weneedtools toanalyse this huge amont of textual data and extract the information weneed.
  2. Orthographic check: is somethingwrittencorrectly or not? Vital for searching
  3. What is a namedentity?names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages,
  4. If you try with longer texts or with another genre, results are not reliable
  5. Business intelligence (BI) is the ability of an organization to collect, maintain, and organize data. This produces large amounts of information that can help develop new opportunities. Identifying these opportunities, and implementing an effective strategy, can provide a competitive market advantage and long-term stability. BI technologies provide historical, current and predictive views of business operations.Customer Experience Management (CEM) is the practice of actively listening to the Voice of the Customer through a variety of listening posts, analyzing customer feedback to create a basis for acting on better business decisions and then measuring the impact of those decisions to drive even greater operational performance and customer loyalty. Through this process, a company strategically organizes itself to manage a customer's entire experience with its product, service or company.  Companies invest in CEM to improve customer retention
  6. A tweet: My son, 6y/o, asked me for the first time today how my DAY was . . . I about melted. Told him that I had pizza for lunch. Response? No fairLanguage is highty ambiguous. Fair =reasonable and acceptable//treatingeveryoneequallyFair=a form of outdoor entertainment, at which there are large machines to ride on and games in which you can win prizes//an event at which people or businesses show and sell their productsplay fair: to do something in a fair and honest way
  7. Informationdiscovery is toovague
  8. Problem of size + a problem of diverse data! = heterogeneos dataRadio-frequencyidentification (RFID )
  9. Mucheffort hasbeenallocate to improvebig native data numeric data: balancesheets, income reports, financial and business reports, etc.Merrill Lynch – financial management and advisorywww.ml.com/Merrill Lynch is one of the world's leading financial management and advisory companies, providing financial advice and investment banking services.e‐mails, memos, notes from call centers and support operations, news, user groups, chats, reports, letters, surveys, white papers, marketing material, research, presentations , etc are different genres, ie different types of text. For example, emails and white papers are both textual genres but they differ a lot from each other. They might deal with the same topic, but in a complete different way. So the type of information related to the same topic can vary according to genre.
  10.  felony= any grave crimes, such as murder, rape, or burglary…
  11. Professor of Linguistics, Department of Linguistics, University of California, Santa Barbara
  12. N-gramsAveragesentence and wordlengthIndexingSplit infinitives
  13. Stockholm –umeÅcorpus (joakim)
  14. DescriptivestatisticsAnalyticalstatisticsMultifactorialmethodsToken/typeratio=The type-token ratio (TTR) is a measure of vocabulary variation within a written text or a person’s speech. The type-token ratios of two real world examples are calculated and interpreted. The type-token ratio is shown to be a helpful measure of lexical variety within a text. It can be used to monitor changes in children and adults with vocabulary difficulties.Tokens are the number of words. several of these tokens are repeated. For example, the token again occurs two times, the token are occurs three times, and the token and occurs five times. the total of 87 tokens in this text there are 62 so-called types. The relationship between the number of types and the number of tokens is known as the type-token ratio (TTR). For Text 1 above we can now calculate this as follows:Type-Token Ratio = (number of types/number of tokens) * 100= (62/87) * 100 = 71.3%The more types there are in comparison to the number of tokens, then the more varied is the vocabulary, i.e. it there is greater lexical variety.http://www.speech-therapy-information-and-resources.com/type-token-ratio.html
  15. http://youtu.be/qqfeUUjAIyQ