SlideShare ist ein Scribd-Unternehmen logo
1 von 65
Downloaden Sie, um offline zu lesen
1
2
Bank
6
23%
Bank
5
20%
Bank
4
21%
Bank
3
5%
Bank
2
15%
Bank
1
15%
3
Operations Analysis
4
5
6
7
We are raising our tablet forecast.
S
are
NP
We
S
raising
NP
forecastNP
tablet
DET
our
subj
obj
subj pred
Dependency
Tree
Oct 1 04:12:24 9.1.1.3 41865:
%PLATFORM_ENV-1-DUAL_PWR: Faulty
internal power supply B detected
Time Oct 1 04:12:24
Host 9.1.1.3
Process 41865
Category
%PLATFORM_ENV-1-
DUAL_PWR
Message
Faulty internal power
supply B detected
88
Singapore 2012 Annual Report
(136 pages PDF)
Identify note breaking down
Operating expenses line item,
and extract opex components
Identify line item for Operating
expenses from Income statement
(financial table in pdf document)
9
10
Intel's 2013 capex is elevated at 23% of sales, above average of 16%
FHLMC reported $4.4bn net loss and requested $6bn in capital from Treasury.
I'm still hearing from clients that Merrill's website is better.
Customer or
competitor?
Good or bad?
Entity of interest
11
’
’ 12
13
14

15
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin, in sagittis facilisis, volutpat dapibus, ultrices sit amet, sem , volutpat dapibus, ultrices sit amet,
sem Tomorrow, we will meet Mark Scott, Howard Smith and amet lt arcu tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis nunc
volutpat enim, quis viverra lacus nulla sit lectus. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante.
Suspendisse
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in
sagittis facilisis arcu Tomorrow, we will meet Mark Scott, Howard Smith and hendrerit faucibus pede mi ipsum. Curabitur cursus tincidunt orci.
Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in
Tokenization
(preprocessing step)
Level 1
Gazetteer[type = LastGaz]  Last
Gazetteer[type = FirstGaz]  First
Token[~ “[A-Z]w+”]  Caps
Rule priority used to prefer
First over Caps
• Rule priority used to prefer First over Caps.
• Lossy Sequencing: annotations dropped
because input to next stage must be a sequence
– First preferred over Last since it was declared earlier
16
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin, in sagittis facilisis, volutpat dapibus, ultrices sit amet, sem , volutpat dapibus, ultrices sit amet,
sem Tomorrow, we will meet Mark Scott, Howard Smith and amet lt arcu tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis nunc
volutpat enim, quis viverra lacus nulla sit lectus. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante.
Suspendisse
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in
sagittis facilisis arcu Tomorrow, we will meet Mark Scott, Howard Smith and hendrerit faucibus pede mi ipsum. Curabitur cursus tincidunt orci.
Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in e
sagittis Tomorrow, we will meet Mark Scott, Howard Smith and hendrerit faucibus pede mi sed ipsum. Curabitur cursus tincidunt orci.
Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est nunc volutpat enim, quis viverra
lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, ultrices sit amet, sem. Vestibulum quis dui vitae massa euismod faucibus. Pellentesque
id neque id tellus hendrerit tincidunt. Etiam augue. Class aptent
Tokenization
(preprocessing step)
Level 1
Gazetteer[type = LastGaz]  Last
Gazetteer[type = FirstGaz]  First
Token[~ “[A-Z]w+”]  Caps
Level 2 First Last  Person
First Caps  Person
First  Person
Rigid Rule Priority and Lossy
Sequencing in Level 1 caused
partial results
17
•
•
•
•
•
•
18
19
AQL Language
Optimizer
Operator
Graph
Specify extractor semantics
declaratively (express logic of
computation, not control flow)
Choose efficient execution
plan that implements
semantics
Optimized execution plan
executed at runtime
20
21
Document
text: String
Person
last: Spanfirst: Span fullname: Span
22
23 23
Mark
Scott
Anna
…
DocumentInput Tuple
…
we will meet Mark
Scott and
…
Output Tuple 2 Span 2Document
Span 1Output Tuple 1 Document
Dictionary
24
25
Dictionary
<First>
Smith
Scott
Tomorrow
Mark
Scott
Howard
Smith
Join
<First> <Caps>
Join
<First> <Last>
Mark Scott
HowardSmith
Mark Scott
HowardSmith
Union
Mark Scott
HowardSmith
Mark Scott
HowardSmith
Scott
Mark
Howard
Consolidate
Mark Scott
HowardSmith
Dictionary
<Last>
Regex
<Caps>
……Tomorrow, we will meet Mark Scott, Howard Smith …
Explicit operator for
resolving ambiguity
Input may contain overlapping annotations
(No Lossy Sequencing problem)
Output may contain overlapping annotations
(No Rigid Matching Regimes)
Scott
Mark
Howard
26
create view FirstCaps as
select CombineSpans(F.name, C.name) as name
from First F, Caps C
where FollowsTok(F.name, C.name, 0, 0);
<First> <Caps>
0 tokens
27
create view Person as
select S.name as name
from (
( select CombineSpans(F.name, C.name) as name
from First F, Caps C
where FollowsTok(F.name, C.name, 0, 0))
union all
( select CombineSpans(F.name, L.name) as name
from First F, Last L
where FollowsTok(F.name, L.name, 0, 0))
union all
( select *
from First F )
) S
consolidate on name;
<First><Caps>
<First><Last>
<First>
28
create view Person as
select S.name as name
from (
( select CombineSpans(F.name, C.name) as name
from First F, Caps C
where FollowsTok(F.name, C.name, 0, 0))
union all
( select CombineSpans(F.name, L.name) as name
from First F, Last L
where FollowsTok(F.name, L.name, 0, 0))
union all
( select *
from First F )
) S
consolidate on name;
Explicit clause for
resolving ambiguity
(No Rigid Priority
problem)
Input may contain
overlapping annotations
(No Lossy Sequencing
problem)
29
30
Deep Syntactic Parsing ML Training & Scoring
Core Operators
Tokenization Parts of Speech Dictionaries
Regular
Expressions
Span
Operations
Relational
Operations
Semantic Role Labels
Language to express NLP Algorithms  AQL
….
Aggregation
Operations
31
package com.ibm.avatar.algebra.util.sentence;
import java.io.BufferedWriter;
import java.util.ArrayList;
import java.util.HashSet;
import java.util.regex.Matcher;
public class SentenceChunker
{
private Matcher sentenceEndingMatcher = null;
public static BufferedWriter sentenceBufferedWriter = null;
private HashSet<String> abbreviations = new HashSet<String> ();
public SentenceChunker ()
{
}
/** Constructor that takes in the abbreviations directly. */
public SentenceChunker (String[] abbreviations)
{
// Generate the abbreviations directly.
for (String abbr : abbreviations) {
this.abbreviations.add (abbr);
}
}
/**
* @param doc the document text to be analyzed
* @return true if the document contains at least one sentence boundary
*/
public boolean containsSentenceBoundary (String doc)
{
String origDoc = doc;
/*
* Based on getSentenceOffsetArrayList()
*/
// String origDoc = doc;
// int dotpos, quepos, exclpos, newlinepos;
int boundary;
int currentOffset = 0;
do {
/* Get the next tentative boundary for the sentenceString */
setDocumentForObtainingBoundaries (doc);
boundary = getNextCandidateBoundary ();
if (boundary != -1) {doc.substring (0, boundary + 1);
String remainder = doc.substring (boundary + 1);
String candidate = /*
* Looks at the last character of the String. If this last
* character is part of an abbreviation (as detected by
* REGEX) then the sentenceString is not a fullSentence and
* "false” is returned
*/
// while (!(isFullSentence(candidate) &&
// doesNotBeginWithCaps(remainder))) {
while (!(doesNotBeginWithPunctuation (remainder)
&& isFullSentence (candidate))) {
/* Get the next tentative boundary for the sentenceString */
int nextBoundary = getNextCandidateBoundary ();
if (nextBoundary == -1) {
break;
}
boundary = nextBoundary;
candidate = doc.substring (0, boundary + 1);
remainder = doc.substring (boundary + 1);
}
if (candidate.length () > 0) {
// sentences.addElement(candidate.trim().replaceAll("n", "
// "));
// sentenceArrayList.add(new Integer(currentOffset + boundary
// + 1));
// currentOffset += boundary + 1;
// Found a sentence boundary. If the boundary is the last
// character in the string, we don't consider it to be
// contained within the string.
int baseOffset = currentOffset + boundary + 1;
if (baseOffset < origDoc.length ()) {
// System.err.printf("Sentence ends at %d of %dn",
// baseOffset, origDoc.length());
return true;
}
else {
return false;
}
}
// origDoc.substring(0,currentOffset));
// doc = doc.substring(boundary + 1);
doc = remainder;
}
}
while (boundary != -1);
// If we get here, didn't find any boundaries.
return false;
}
public ArrayList<Integer> getSentenceOffsetArrayList (String doc)
{
ArrayList<Integer> sentenceArrayList = new ArrayList<Integer> ();
// String origDoc = doc;
// int dotpos, quepos, exclpos, newlinepos;
int boundary;
int currentOffset = 0;
sentenceArrayList.add (new Integer (0));
do {
/* Get the next tentative boundary for the sentenceString */
setDocumentForObtainingBoundaries (doc);
boundary = getNextCandidateBoundary ();
if (boundary != -1) {
String candidate = doc.substring (0, boundary + 1);
String remainder = doc.substring (boundary + 1);
/*
* Looks at the last character of the String. If this last character
* is part of an abbreviation (as detected by REGEX) then the
* sentenceString is not a fullSentence and "false" is returned
*/
// while (!(isFullSentence(candidate) &&
// doesNotBeginWithCaps(remainder))) {
while (!(doesNotBeginWithPunctuation (remainder) &&
isFullSentence (candidate))) {
/* Get the next tentative boundary for the sentenceString */
int nextBoundary = getNextCandidateBoundary ();
if (nextBoundary == -1) {
break;
}
boundary = nextBoundary;
candidate = doc.substring (0, boundary + 1);
remainder = doc.substring (boundary + 1);
}
if (candidate.length () > 0) {
sentenceArrayList.add (new Integer (currentOffset + boundary + 1));
currentOffset += boundary + 1;
}
// origDoc.substring(0,currentOffset));
doc = remainder;
}
}
while (boundary != -1);
if (doc.length () > 0) {
sentenceArrayList.add (new Integer (currentOffset + doc.length ()));
}
sentenceArrayList.trimToSize ();
return sentenceArrayList;
}
private void setDocumentForObtainingBoundaries (String doc)
{
sentenceEndingMatcher = SentenceConstants.
sentenceEndingPattern.matcher (doc);
}
private int getNextCandidateBoundary ()
{
if (sentenceEndingMatcher.find ()) {
return sentenceEndingMatcher.start ();
}
else
return -1;
}
private boolean doesNotBeginWithPunctuation (String remainder)
{
Matcher m = SentenceConstants.punctuationPattern.matcher (remainder);
return (!m.find ());
}
private String getLastWord (String cand)
{
Matcher lastWordMatcher = SentenceConstants.lastWordPattern.matcher (cand);
if (lastWordMatcher.find ()) {
return lastWordMatcher.group ();
}
else {
return "";
}
}
/*
* Looks at the last character of the String. If this last character is
* par of an abbreviation (as detected by REGEX)
* then the sentenceString is not a fullSentence and "false" is returned
*/
private boolean isFullSentence (String cand)
{
// cand = cand.replaceAll("n", " "); cand = " " + cand;
Matcher validSentenceBoundaryMatcher =
SentenceConstants.validSentenceBoundaryPattern.matcher (cand);
if (validSentenceBoundaryMatcher.find ()) return true;
Matcher abbrevMatcher = SentenceConstants.abbrevPattern.matcher (cand);
if (abbrevMatcher.find ()) {
return false; // Means it ends with an abbreviation
}
else {
// Check if the last word of the sentenceString has an entry in the
// abbreviations dictionary (like Mr etc.)
String lastword = getLastWord (cand);
if (abbreviations.contains (lastword)) { return false; }
}
return true;
}
}
Java Implementation of Sentence Boundary Detection
create dictionary AbbrevDict from file
'abbreviation.dict’;
create view SentenceBoundary as
select R.match as boundary
from ( extract regex /(([.?!]+s)|(ns*n))/
on D.text as match from Document D ) R
where
Not(ContainsDict('AbbrevDict',
CombineSpans(LeftContextTok(R.match, 1),R.match)));
Equivalent AQL Implementation
31
32
33
34
35
Tokenization overhead is paid
only once
First
(followed within 0 tokens)
Plan C
Plan A
Join
Caps
Restricted Span
Evaluation
Plan B
First
Identify Caps starting
within 0 tokens
Extract text to the
right
Caps
Identify First ending
within 0 tokens
Extract text to the left
0
100
200
300
400
500
600
700
0 20 40 60 80 100
Average document size (KB)
Throughput(KB/sec)
Open Source Entity Tagger
SystemT
10~50x faste
[Chiticariu et al., ACL’10] 36
[Chiticariu et al., ACL’10]
Dataset Document Size
Throughput
(KB/sec)
Average Memory
(MB)
Range Average ANNIE SystemT ANNIE SystemT
Web Crawl 68 B – 388 KB 8.8 KB 42.8 498.8 201.8 77.2
Medium
SEC Filings
240 KB – 0.9 MB 401 KB 26.3 703.5 601.8 143.7
Large
SEC Flings
1 MB – 3.4 MB 1.54 MB 21.1 954.5 2683.5 189.6
37
38
•
•
•
•
•
•
39
PersonPhone
Person PhonePerson
Anna at James St. office (555-5555) ….
’
•
•
create view PersonPhone as
select P.name as person, N.number as phone
from Person P, Phone N
where Follows(P.name, N.number, 0, 30);
Person Phone
t1
t2
t3
t1  t3
t2  t3
Provenance:
Boolean
expression
40
41
2013 2015 2016 2017
• UC Santa Cruz
(full Graduate class)
2014
• U. Washington (Grad)
• U. Oregon (Undergrad)
• U. Aalborg, Denmark (Grad)
• UIUC (Grad)
• U. Maryland Baltimore County
(Undergrad)
• UC Irvine (Grad)
• NYU Abu-Dhabi (Undergrad)
• U. Washington (Grad)
• U. Oregon (Undergrad)
• U. Maryland Baltimore County
(Undergrad)
• …
• UC Santa Cruz, 3 lectures
in one Grad class
SystemT MOOC
42
43
44
45
create dictionary PurchaseVerbs as
('buy.01', 'purchase.01', 'acquire.01', 'get.01');
create view Relation as
select A.verb as BUY_VERB, R2.head as PURCHASE, A.polarity as WILL_BUY
from Action A, Role R
where
MatchesDict('PurchaseVerbs', A.verbClass);
and Equals(A.aid, R.aid)
and Equals(R.type, 'A1');
ACL ‘15, ‘16, EMNLP ‘16, COLING ’16a, ‘16b, ‘16c
•
•
46
47
48
49
50
51

52
Ease of
Programming
Ease of
Sharing
53
54
R1: create view Phone as
Regex(‘d{3}-d{4}’, Document, text);
R2: create view Person as
Dictionary(‘first_names.dict’, Document, text);
Dictionary file first_names.dict:
anna, james, john, peter…
R3: create table PersonPhone(match span);
insert into PersonPhone
select Merge(F.match, P.match) as match
from Person F, Phone P
where Follows(F.match, P.match, 0, 60);
Person PhonePerson Person Phone
Anna at James St. office (555-5555), or James, her assistant - 777-7777 have the details.
•
•
54





Person
Dictionary
FirstNames.dict
Doc
PersonPhone
Join
Follows(name,phone,0,60)
James
James555-5555
Phone
Regex
/d{3}-d{4}/
555-5555
PhonePerson
Anna at James St. office (555-5555), …
55
56
56
HLC 2
Remove James
from output of R2’
Dictionary op.
HLC 3: Remove
James555-5555
from output of R3’s
join op.
HLC 1
Remove 555-5555
from output of
R1’s Regex op.
true
Merge(F.match, P.match) as match
⋈Follows(F.match,P.match,0,60)
Dictionary
‘firstName.dict’, text
Regex
‘d{3}-d{4}’, text
R2 R1
R3
Doc
Goal: remove “James  555-5555” from output
 56
57
57
HLC 2
Remove James
from output of R2’
Dictionary op.
HLC 3: Remove
James555-5555
from output of R3’s
join op.
HLC 1
Remove 555-5555
from output of
R1’s Regex op.
true
Merge(F.match, P.match) as match
⋈Follows(F.match,P.match,0,60)
Dictionary
‘firstName.dict’, text
Regex
‘d{3}-d{4}’, text
R2 R1
R3
Doc
Goal: remove “James  555-5555” from output
LLC 1
Remove ‘James’
from FirstNames.dict
LLC 2
Add filter pred. on
street suffix in right
context of match
LLC 3
Reduce character gap between
F.match and P.match from 60 to 10
 57
58
⋈
Dictionary ContainsDict()



 Contains IsContained Overlaps
 “ PersonPhone
PersonPhone ”
58
59 • Input:
– Set of HLCs, provenance graph, labeled results
• Output:
– List of LLCs, ranked based on improvement in F1-measure
• Algorithm:
– For each operator Op, consider all HLCs (ti, Op)
– For each HLC, enumerate all possible LLCs
– For each LLC:
• Compute the set of local tuples it removes from the output of Op
• Propagate these removals up through the provenance graph to compute the
effect on end-to-end result
– Rank LLCs
59

– t Dictionary 

Op
– k Op
– k

– O(n2) n
 Op ti Op ti 
++
+
+
+
++
+ +
+
+ +
+
+ + +
+
+
+ +
+
+
+++
-
- - --
- --
- - --
-
-
- - --
-
-
-
-
-
-
- -
-Tuples to remove
from output of Op
Output tuples
60
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
Baseline I1 I2 I3 I4 I5
Enron
ACE
CoNLL
EnronPP
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
Baseline I1 I2 I3 I4 I5
Enron
ACE
CoNLL
EnronPP
61
Precision improves greatly after a few iterations, while recall remains fairly stable
Precision – % correct results
of total results identified
Recall – % correct results
identified of total correct labels
Person extraction on formal text (CoNLL, ACE)
Person and PersonPhone extraction on informal text (Enron)
61
0
10
20
30
40
50
60
70
80
90
F1- measure
62
 Almost all expert’s refinements are among top 12 generated refinements
 Done in 2 minutes !
Expert A after 1 hour, 9 refinements
Person extraction on informal text (Enron)
62
63
Development Environment
AQL Extractor
create view ProductMention as
select ...
from ...
where ...
create view IntentToBuy as
select ...
from ...
where ... Cost-based
optimization
...
Discovery tools for AQL
development
SystemT Runtime
Input
Documents
Extracted
Objects
Challenge: Building extractors for enterprise applications requires an information extraction system that is expressive,
efficient, transparent and usable. Existing solutions are either rule-based solutions based on cascading grammar with
expressivity and efficiency issues, or black-box solutions based on machine learning with lack of transparency.
Our Solution: A declarative information extraction system with cost-based optimization, high-performance runtime and
novel development tooling based on solid theoretical foundation [PODS’13, PODS’14], shipping with over 10+ IBM products.
AQL: a declarative language that can be used to build extractors
outperforming the state-of-the-arts [ACL’10]
Multilingual SRL-enabled: [ACL’15, ACL’16, EMNLP’16, COLING’16]
A suite of novel development tooling leveraging
machine learning and HCI [EMNLP’08, VLDB’10,
ACL’11, CIKM’11, ACL’12, EMNLP’12, CHI’13,
SIGMOD’13, ACL’13,VLDB’15, NAACL’15]
Cost-based optimization for
text-centric operations
[ICDE’08, ICDE’11, FPL’13, FPL’14]
Highly embeddable runtime
with high-throughput and
small memory footprint.
[SIGMOD Record’09, SIGMOD’09]
For details and
Online Class visit:
https://ibm.biz/BdF4GQ
64
65

Weitere ähnliche Inhalte

Andere mochten auch

Mining at scale with latent factor models for matrix completion
Mining at scale with latent factor models for matrix completionMining at scale with latent factor models for matrix completion
Mining at scale with latent factor models for matrix completionFabio Petroni, PhD
 
LCBM: Statistics-Based Parallel Collaborative Filtering
LCBM: Statistics-Based Parallel Collaborative FilteringLCBM: Statistics-Based Parallel Collaborative Filtering
LCBM: Statistics-Based Parallel Collaborative FilteringFabio Petroni, PhD
 
Topic modeling and WSD on the Ancora corpus
Topic modeling and WSD on the Ancora corpusTopic modeling and WSD on the Ancora corpus
Topic modeling and WSD on the Ancora corpusRubén Izquierdo Beviá
 
KafNafParserPy: a python library for parsing/creating KAF and NAF files
KafNafParserPy: a python library for parsing/creating KAF and NAF filesKafNafParserPy: a python library for parsing/creating KAF and NAF files
KafNafParserPy: a python library for parsing/creating KAF and NAF filesRubén Izquierdo Beviá
 
The Power of Declarative Analytics
The Power of Declarative AnalyticsThe Power of Declarative Analytics
The Power of Declarative AnalyticsYunyao Li
 
RANLP2013: DutchSemCor, in Quest of the Ideal Sense Tagged Corpus
RANLP2013: DutchSemCor, in Quest of the Ideal Sense Tagged CorpusRANLP2013: DutchSemCor, in Quest of the Ideal Sense Tagged Corpus
RANLP2013: DutchSemCor, in Quest of the Ideal Sense Tagged CorpusRubén Izquierdo Beviá
 
CLTL python course: Object Oriented Programming (3/3)
CLTL python course: Object Oriented Programming (3/3)CLTL python course: Object Oriented Programming (3/3)
CLTL python course: Object Oriented Programming (3/3)Rubén Izquierdo Beviá
 
Polyglot: Multilingual Semantic Role Labeling with Unified Labels
Polyglot: Multilingual Semantic Role Labeling with Unified LabelsPolyglot: Multilingual Semantic Role Labeling with Unified Labels
Polyglot: Multilingual Semantic Role Labeling with Unified LabelsYunyao Li
 
DutchSemCor workshop: Domain classification and WSD systems
DutchSemCor workshop: Domain classification and WSD systemsDutchSemCor workshop: Domain classification and WSD systems
DutchSemCor workshop: Domain classification and WSD systemsRubén Izquierdo Beviá
 
Transparent Machine Learning for Information Extraction: State-of-the-Art and...
Transparent Machine Learning for Information Extraction: State-of-the-Art and...Transparent Machine Learning for Information Extraction: State-of-the-Art and...
Transparent Machine Learning for Information Extraction: State-of-the-Art and...Yunyao Li
 
HSIENA: a hybrid publish/subscribe system
HSIENA: a hybrid publish/subscribe systemHSIENA: a hybrid publish/subscribe system
HSIENA: a hybrid publish/subscribe systemFabio Petroni, PhD
 
Enterprise Search in the Big Data Era: Recent Developments and Open Challenges
Enterprise Search in the Big Data Era: Recent Developments and Open ChallengesEnterprise Search in the Big Data Era: Recent Developments and Open Challenges
Enterprise Search in the Big Data Era: Recent Developments and Open ChallengesYunyao Li
 
Error analysis of Word Sense Disambiguation
Error analysis of Word Sense DisambiguationError analysis of Word Sense Disambiguation
Error analysis of Word Sense DisambiguationRubén Izquierdo Beviá
 
CORE: Context-Aware Open Relation Extraction with Factorization Machines
CORE: Context-Aware Open Relation Extraction with Factorization MachinesCORE: Context-Aware Open Relation Extraction with Factorization Machines
CORE: Context-Aware Open Relation Extraction with Factorization MachinesFabio Petroni, PhD
 

Andere mochten auch (16)

Mining at scale with latent factor models for matrix completion
Mining at scale with latent factor models for matrix completionMining at scale with latent factor models for matrix completion
Mining at scale with latent factor models for matrix completion
 
LCBM: Statistics-Based Parallel Collaborative Filtering
LCBM: Statistics-Based Parallel Collaborative FilteringLCBM: Statistics-Based Parallel Collaborative Filtering
LCBM: Statistics-Based Parallel Collaborative Filtering
 
Topic modeling and WSD on the Ancora corpus
Topic modeling and WSD on the Ancora corpusTopic modeling and WSD on the Ancora corpus
Topic modeling and WSD on the Ancora corpus
 
KafNafParserPy: a python library for parsing/creating KAF and NAF files
KafNafParserPy: a python library for parsing/creating KAF and NAF filesKafNafParserPy: a python library for parsing/creating KAF and NAF files
KafNafParserPy: a python library for parsing/creating KAF and NAF files
 
The Power of Declarative Analytics
The Power of Declarative AnalyticsThe Power of Declarative Analytics
The Power of Declarative Analytics
 
RANLP2013: DutchSemCor, in Quest of the Ideal Sense Tagged Corpus
RANLP2013: DutchSemCor, in Quest of the Ideal Sense Tagged CorpusRANLP2013: DutchSemCor, in Quest of the Ideal Sense Tagged Corpus
RANLP2013: DutchSemCor, in Quest of the Ideal Sense Tagged Corpus
 
CLTL python course: Object Oriented Programming (3/3)
CLTL python course: Object Oriented Programming (3/3)CLTL python course: Object Oriented Programming (3/3)
CLTL python course: Object Oriented Programming (3/3)
 
Polyglot: Multilingual Semantic Role Labeling with Unified Labels
Polyglot: Multilingual Semantic Role Labeling with Unified LabelsPolyglot: Multilingual Semantic Role Labeling with Unified Labels
Polyglot: Multilingual Semantic Role Labeling with Unified Labels
 
DutchSemCor workshop: Domain classification and WSD systems
DutchSemCor workshop: Domain classification and WSD systemsDutchSemCor workshop: Domain classification and WSD systems
DutchSemCor workshop: Domain classification and WSD systems
 
Transparent Machine Learning for Information Extraction: State-of-the-Art and...
Transparent Machine Learning for Information Extraction: State-of-the-Art and...Transparent Machine Learning for Information Extraction: State-of-the-Art and...
Transparent Machine Learning for Information Extraction: State-of-the-Art and...
 
HSIENA: a hybrid publish/subscribe system
HSIENA: a hybrid publish/subscribe systemHSIENA: a hybrid publish/subscribe system
HSIENA: a hybrid publish/subscribe system
 
Enterprise Search in the Big Data Era: Recent Developments and Open Challenges
Enterprise Search in the Big Data Era: Recent Developments and Open ChallengesEnterprise Search in the Big Data Era: Recent Developments and Open Challenges
Enterprise Search in the Big Data Era: Recent Developments and Open Challenges
 
Error analysis of Word Sense Disambiguation
Error analysis of Word Sense DisambiguationError analysis of Word Sense Disambiguation
Error analysis of Word Sense Disambiguation
 
CORE: Context-Aware Open Relation Extraction with Factorization Machines
CORE: Context-Aware Open Relation Extraction with Factorization MachinesCORE: Context-Aware Open Relation Extraction with Factorization Machines
CORE: Context-Aware Open Relation Extraction with Factorization Machines
 
Juan Calvino y el Calvinismo
Juan Calvino y el CalvinismoJuan Calvino y el Calvinismo
Juan Calvino y el Calvinismo
 
Information Extraction
Information ExtractionInformation Extraction
Information Extraction
 

Ähnlich wie 2017-01-25-SystemT-Overview-Stanford

2015 07-30-sysetm t.short.no.animation
2015 07-30-sysetm t.short.no.animation2015 07-30-sysetm t.short.no.animation
2015 07-30-sysetm t.short.no.animationdiannepatricia
 
Declarative Multilingual Information Extraction with SystemT
Declarative Multilingual Information Extraction with SystemTDeclarative Multilingual Information Extraction with SystemT
Declarative Multilingual Information Extraction with SystemTdiannepatricia
 
SystemT: Declarative Information Extraction
SystemT: Declarative Information ExtractionSystemT: Declarative Information Extraction
SystemT: Declarative Information ExtractionYunyao Li
 
Functional Programming You Already Know
Functional Programming You Already KnowFunctional Programming You Already Know
Functional Programming You Already KnowKevlin Henney
 
System programmin practical file
System programmin practical fileSystem programmin practical file
System programmin practical fileAnkit Dixit
 
Hierarchical free monads and software design in fp
Hierarchical free monads and software design in fpHierarchical free monads and software design in fp
Hierarchical free monads and software design in fpAlexander Granin
 
COMM 166 Final Research Proposal GuidelinesThe proposal should.docx
COMM 166 Final Research Proposal GuidelinesThe proposal should.docxCOMM 166 Final Research Proposal GuidelinesThe proposal should.docx
COMM 166 Final Research Proposal GuidelinesThe proposal should.docxcargillfilberto
 
COMM 166 Final Research Proposal GuidelinesThe proposal should.docx
COMM 166 Final Research Proposal GuidelinesThe proposal should.docxCOMM 166 Final Research Proposal GuidelinesThe proposal should.docx
COMM 166 Final Research Proposal GuidelinesThe proposal should.docxdrandy1
 
Martin Chapman: Research Overview, 2017
Martin Chapman: Research Overview, 2017Martin Chapman: Research Overview, 2017
Martin Chapman: Research Overview, 2017Martin Chapman
 
COMM 166 Final Research Proposal GuidelinesThe proposal should.docx
COMM 166 Final Research Proposal GuidelinesThe proposal should.docxCOMM 166 Final Research Proposal GuidelinesThe proposal should.docx
COMM 166 Final Research Proposal GuidelinesThe proposal should.docxmonicafrancis71118
 
C language by Dr. D. R. Gholkar
C language by Dr. D. R. GholkarC language by Dr. D. R. Gholkar
C language by Dr. D. R. GholkarPRAVIN GHOLKAR
 
Clean code _v2003
 Clean code _v2003 Clean code _v2003
Clean code _v2003R696
 
Bca2020 data structure and algorithm
Bca2020   data structure and algorithmBca2020   data structure and algorithm
Bca2020 data structure and algorithmsmumbahelp
 

Ähnlich wie 2017-01-25-SystemT-Overview-Stanford (20)

2015 07-30-sysetm t.short.no.animation
2015 07-30-sysetm t.short.no.animation2015 07-30-sysetm t.short.no.animation
2015 07-30-sysetm t.short.no.animation
 
Declarative Multilingual Information Extraction with SystemT
Declarative Multilingual Information Extraction with SystemTDeclarative Multilingual Information Extraction with SystemT
Declarative Multilingual Information Extraction with SystemT
 
SystemT: Declarative Information Extraction
SystemT: Declarative Information ExtractionSystemT: Declarative Information Extraction
SystemT: Declarative Information Extraction
 
Second Level Cache in JPA Explained
Second Level Cache in JPA ExplainedSecond Level Cache in JPA Explained
Second Level Cache in JPA Explained
 
Functional Programming You Already Know
Functional Programming You Already KnowFunctional Programming You Already Know
Functional Programming You Already Know
 
System programmin practical file
System programmin practical fileSystem programmin practical file
System programmin practical file
 
Learn C
Learn CLearn C
Learn C
 
C notes for exam preparation
C notes for exam preparationC notes for exam preparation
C notes for exam preparation
 
Hierarchical free monads and software design in fp
Hierarchical free monads and software design in fpHierarchical free monads and software design in fp
Hierarchical free monads and software design in fp
 
COMM 166 Final Research Proposal GuidelinesThe proposal should.docx
COMM 166 Final Research Proposal GuidelinesThe proposal should.docxCOMM 166 Final Research Proposal GuidelinesThe proposal should.docx
COMM 166 Final Research Proposal GuidelinesThe proposal should.docx
 
COMM 166 Final Research Proposal GuidelinesThe proposal should.docx
COMM 166 Final Research Proposal GuidelinesThe proposal should.docxCOMM 166 Final Research Proposal GuidelinesThe proposal should.docx
COMM 166 Final Research Proposal GuidelinesThe proposal should.docx
 
Martin Chapman: Research Overview, 2017
Martin Chapman: Research Overview, 2017Martin Chapman: Research Overview, 2017
Martin Chapman: Research Overview, 2017
 
COMM 166 Final Research Proposal GuidelinesThe proposal should.docx
COMM 166 Final Research Proposal GuidelinesThe proposal should.docxCOMM 166 Final Research Proposal GuidelinesThe proposal should.docx
COMM 166 Final Research Proposal GuidelinesThe proposal should.docx
 
C language by Dr. D. R. Gholkar
C language by Dr. D. R. GholkarC language by Dr. D. R. Gholkar
C language by Dr. D. R. Gholkar
 
Object-Oriented Programming Using C++
Object-Oriented Programming Using C++Object-Oriented Programming Using C++
Object-Oriented Programming Using C++
 
Structures-2
Structures-2Structures-2
Structures-2
 
Clean code _v2003
 Clean code _v2003 Clean code _v2003
Clean code _v2003
 
C#, What Is Next?
C#, What Is Next?C#, What Is Next?
C#, What Is Next?
 
Bca2020 data structure and algorithm
Bca2020   data structure and algorithmBca2020   data structure and algorithm
Bca2020 data structure and algorithm
 
Functions
FunctionsFunctions
Functions
 

2017-01-25-SystemT-Overview-Stanford

  • 1. 1
  • 2. 2
  • 5. 5
  • 6. 6
  • 7. 7 We are raising our tablet forecast. S are NP We S raising NP forecastNP tablet DET our subj obj subj pred Dependency Tree Oct 1 04:12:24 9.1.1.3 41865: %PLATFORM_ENV-1-DUAL_PWR: Faulty internal power supply B detected Time Oct 1 04:12:24 Host 9.1.1.3 Process 41865 Category %PLATFORM_ENV-1- DUAL_PWR Message Faulty internal power supply B detected
  • 8. 88 Singapore 2012 Annual Report (136 pages PDF) Identify note breaking down Operating expenses line item, and extract opex components Identify line item for Operating expenses from Income statement (financial table in pdf document)
  • 9. 9
  • 10. 10 Intel's 2013 capex is elevated at 23% of sales, above average of 16% FHLMC reported $4.4bn net loss and requested $6bn in capital from Treasury. I'm still hearing from clients that Merrill's website is better. Customer or competitor? Good or bad? Entity of interest
  • 11. 11
  • 13. 13
  • 14. 14
  • 16. Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin, in sagittis facilisis, volutpat dapibus, ultrices sit amet, sem , volutpat dapibus, ultrices sit amet, sem Tomorrow, we will meet Mark Scott, Howard Smith and amet lt arcu tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis nunc volutpat enim, quis viverra lacus nulla sit lectus. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in sagittis facilisis arcu Tomorrow, we will meet Mark Scott, Howard Smith and hendrerit faucibus pede mi ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in Tokenization (preprocessing step) Level 1 Gazetteer[type = LastGaz]  Last Gazetteer[type = FirstGaz]  First Token[~ “[A-Z]w+”]  Caps Rule priority used to prefer First over Caps • Rule priority used to prefer First over Caps. • Lossy Sequencing: annotations dropped because input to next stage must be a sequence – First preferred over Last since it was declared earlier 16
  • 17. Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin, in sagittis facilisis, volutpat dapibus, ultrices sit amet, sem , volutpat dapibus, ultrices sit amet, sem Tomorrow, we will meet Mark Scott, Howard Smith and amet lt arcu tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis nunc volutpat enim, quis viverra lacus nulla sit lectus. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in sagittis facilisis arcu Tomorrow, we will meet Mark Scott, Howard Smith and hendrerit faucibus pede mi ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in e sagittis Tomorrow, we will meet Mark Scott, Howard Smith and hendrerit faucibus pede mi sed ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, ultrices sit amet, sem. Vestibulum quis dui vitae massa euismod faucibus. Pellentesque id neque id tellus hendrerit tincidunt. Etiam augue. Class aptent Tokenization (preprocessing step) Level 1 Gazetteer[type = LastGaz]  Last Gazetteer[type = FirstGaz]  First Token[~ “[A-Z]w+”]  Caps Level 2 First Last  Person First Caps  Person First  Person Rigid Rule Priority and Lossy Sequencing in Level 1 caused partial results 17
  • 19. 19
  • 20. AQL Language Optimizer Operator Graph Specify extractor semantics declaratively (express logic of computation, not control flow) Choose efficient execution plan that implements semantics Optimized execution plan executed at runtime 20
  • 21. 21
  • 23. 23 23
  • 24. Mark Scott Anna … DocumentInput Tuple … we will meet Mark Scott and … Output Tuple 2 Span 2Document Span 1Output Tuple 1 Document Dictionary 24
  • 25. 25
  • 26. Dictionary <First> Smith Scott Tomorrow Mark Scott Howard Smith Join <First> <Caps> Join <First> <Last> Mark Scott HowardSmith Mark Scott HowardSmith Union Mark Scott HowardSmith Mark Scott HowardSmith Scott Mark Howard Consolidate Mark Scott HowardSmith Dictionary <Last> Regex <Caps> ……Tomorrow, we will meet Mark Scott, Howard Smith … Explicit operator for resolving ambiguity Input may contain overlapping annotations (No Lossy Sequencing problem) Output may contain overlapping annotations (No Rigid Matching Regimes) Scott Mark Howard 26
  • 27. create view FirstCaps as select CombineSpans(F.name, C.name) as name from First F, Caps C where FollowsTok(F.name, C.name, 0, 0); <First> <Caps> 0 tokens 27
  • 28. create view Person as select S.name as name from ( ( select CombineSpans(F.name, C.name) as name from First F, Caps C where FollowsTok(F.name, C.name, 0, 0)) union all ( select CombineSpans(F.name, L.name) as name from First F, Last L where FollowsTok(F.name, L.name, 0, 0)) union all ( select * from First F ) ) S consolidate on name; <First><Caps> <First><Last> <First> 28
  • 29. create view Person as select S.name as name from ( ( select CombineSpans(F.name, C.name) as name from First F, Caps C where FollowsTok(F.name, C.name, 0, 0)) union all ( select CombineSpans(F.name, L.name) as name from First F, Last L where FollowsTok(F.name, L.name, 0, 0)) union all ( select * from First F ) ) S consolidate on name; Explicit clause for resolving ambiguity (No Rigid Priority problem) Input may contain overlapping annotations (No Lossy Sequencing problem) 29
  • 30. 30 Deep Syntactic Parsing ML Training & Scoring Core Operators Tokenization Parts of Speech Dictionaries Regular Expressions Span Operations Relational Operations Semantic Role Labels Language to express NLP Algorithms  AQL …. Aggregation Operations
  • 31. 31 package com.ibm.avatar.algebra.util.sentence; import java.io.BufferedWriter; import java.util.ArrayList; import java.util.HashSet; import java.util.regex.Matcher; public class SentenceChunker { private Matcher sentenceEndingMatcher = null; public static BufferedWriter sentenceBufferedWriter = null; private HashSet<String> abbreviations = new HashSet<String> (); public SentenceChunker () { } /** Constructor that takes in the abbreviations directly. */ public SentenceChunker (String[] abbreviations) { // Generate the abbreviations directly. for (String abbr : abbreviations) { this.abbreviations.add (abbr); } } /** * @param doc the document text to be analyzed * @return true if the document contains at least one sentence boundary */ public boolean containsSentenceBoundary (String doc) { String origDoc = doc; /* * Based on getSentenceOffsetArrayList() */ // String origDoc = doc; // int dotpos, quepos, exclpos, newlinepos; int boundary; int currentOffset = 0; do { /* Get the next tentative boundary for the sentenceString */ setDocumentForObtainingBoundaries (doc); boundary = getNextCandidateBoundary (); if (boundary != -1) {doc.substring (0, boundary + 1); String remainder = doc.substring (boundary + 1); String candidate = /* * Looks at the last character of the String. If this last * character is part of an abbreviation (as detected by * REGEX) then the sentenceString is not a fullSentence and * "false” is returned */ // while (!(isFullSentence(candidate) && // doesNotBeginWithCaps(remainder))) { while (!(doesNotBeginWithPunctuation (remainder) && isFullSentence (candidate))) { /* Get the next tentative boundary for the sentenceString */ int nextBoundary = getNextCandidateBoundary (); if (nextBoundary == -1) { break; } boundary = nextBoundary; candidate = doc.substring (0, boundary + 1); remainder = doc.substring (boundary + 1); } if (candidate.length () > 0) { // sentences.addElement(candidate.trim().replaceAll("n", " // ")); // sentenceArrayList.add(new Integer(currentOffset + boundary // + 1)); // currentOffset += boundary + 1; // Found a sentence boundary. If the boundary is the last // character in the string, we don't consider it to be // contained within the string. int baseOffset = currentOffset + boundary + 1; if (baseOffset < origDoc.length ()) { // System.err.printf("Sentence ends at %d of %dn", // baseOffset, origDoc.length()); return true; } else { return false; } } // origDoc.substring(0,currentOffset)); // doc = doc.substring(boundary + 1); doc = remainder; } } while (boundary != -1); // If we get here, didn't find any boundaries. return false; } public ArrayList<Integer> getSentenceOffsetArrayList (String doc) { ArrayList<Integer> sentenceArrayList = new ArrayList<Integer> (); // String origDoc = doc; // int dotpos, quepos, exclpos, newlinepos; int boundary; int currentOffset = 0; sentenceArrayList.add (new Integer (0)); do { /* Get the next tentative boundary for the sentenceString */ setDocumentForObtainingBoundaries (doc); boundary = getNextCandidateBoundary (); if (boundary != -1) { String candidate = doc.substring (0, boundary + 1); String remainder = doc.substring (boundary + 1); /* * Looks at the last character of the String. If this last character * is part of an abbreviation (as detected by REGEX) then the * sentenceString is not a fullSentence and "false" is returned */ // while (!(isFullSentence(candidate) && // doesNotBeginWithCaps(remainder))) { while (!(doesNotBeginWithPunctuation (remainder) && isFullSentence (candidate))) { /* Get the next tentative boundary for the sentenceString */ int nextBoundary = getNextCandidateBoundary (); if (nextBoundary == -1) { break; } boundary = nextBoundary; candidate = doc.substring (0, boundary + 1); remainder = doc.substring (boundary + 1); } if (candidate.length () > 0) { sentenceArrayList.add (new Integer (currentOffset + boundary + 1)); currentOffset += boundary + 1; } // origDoc.substring(0,currentOffset)); doc = remainder; } } while (boundary != -1); if (doc.length () > 0) { sentenceArrayList.add (new Integer (currentOffset + doc.length ())); } sentenceArrayList.trimToSize (); return sentenceArrayList; } private void setDocumentForObtainingBoundaries (String doc) { sentenceEndingMatcher = SentenceConstants. sentenceEndingPattern.matcher (doc); } private int getNextCandidateBoundary () { if (sentenceEndingMatcher.find ()) { return sentenceEndingMatcher.start (); } else return -1; } private boolean doesNotBeginWithPunctuation (String remainder) { Matcher m = SentenceConstants.punctuationPattern.matcher (remainder); return (!m.find ()); } private String getLastWord (String cand) { Matcher lastWordMatcher = SentenceConstants.lastWordPattern.matcher (cand); if (lastWordMatcher.find ()) { return lastWordMatcher.group (); } else { return ""; } } /* * Looks at the last character of the String. If this last character is * par of an abbreviation (as detected by REGEX) * then the sentenceString is not a fullSentence and "false" is returned */ private boolean isFullSentence (String cand) { // cand = cand.replaceAll("n", " "); cand = " " + cand; Matcher validSentenceBoundaryMatcher = SentenceConstants.validSentenceBoundaryPattern.matcher (cand); if (validSentenceBoundaryMatcher.find ()) return true; Matcher abbrevMatcher = SentenceConstants.abbrevPattern.matcher (cand); if (abbrevMatcher.find ()) { return false; // Means it ends with an abbreviation } else { // Check if the last word of the sentenceString has an entry in the // abbreviations dictionary (like Mr etc.) String lastword = getLastWord (cand); if (abbreviations.contains (lastword)) { return false; } } return true; } } Java Implementation of Sentence Boundary Detection create dictionary AbbrevDict from file 'abbreviation.dict’; create view SentenceBoundary as select R.match as boundary from ( extract regex /(([.?!]+s)|(ns*n))/ on D.text as match from Document D ) R where Not(ContainsDict('AbbrevDict', CombineSpans(LeftContextTok(R.match, 1),R.match))); Equivalent AQL Implementation 31
  • 32. 32
  • 33. 33
  • 34. 34
  • 35. 35 Tokenization overhead is paid only once First (followed within 0 tokens) Plan C Plan A Join Caps Restricted Span Evaluation Plan B First Identify Caps starting within 0 tokens Extract text to the right Caps Identify First ending within 0 tokens Extract text to the left
  • 36. 0 100 200 300 400 500 600 700 0 20 40 60 80 100 Average document size (KB) Throughput(KB/sec) Open Source Entity Tagger SystemT 10~50x faste [Chiticariu et al., ACL’10] 36
  • 37. [Chiticariu et al., ACL’10] Dataset Document Size Throughput (KB/sec) Average Memory (MB) Range Average ANNIE SystemT ANNIE SystemT Web Crawl 68 B – 388 KB 8.8 KB 42.8 498.8 201.8 77.2 Medium SEC Filings 240 KB – 0.9 MB 401 KB 26.3 703.5 601.8 143.7 Large SEC Flings 1 MB – 3.4 MB 1.54 MB 21.1 954.5 2683.5 189.6 37
  • 38. 38
  • 40. PersonPhone Person PhonePerson Anna at James St. office (555-5555) …. ’ • • create view PersonPhone as select P.name as person, N.number as phone from Person P, Phone N where Follows(P.name, N.number, 0, 30); Person Phone t1 t2 t3 t1  t3 t2  t3 Provenance: Boolean expression 40
  • 41. 41
  • 42. 2013 2015 2016 2017 • UC Santa Cruz (full Graduate class) 2014 • U. Washington (Grad) • U. Oregon (Undergrad) • U. Aalborg, Denmark (Grad) • UIUC (Grad) • U. Maryland Baltimore County (Undergrad) • UC Irvine (Grad) • NYU Abu-Dhabi (Undergrad) • U. Washington (Grad) • U. Oregon (Undergrad) • U. Maryland Baltimore County (Undergrad) • … • UC Santa Cruz, 3 lectures in one Grad class SystemT MOOC 42
  • 43. 43
  • 44. 44
  • 45. 45
  • 46. create dictionary PurchaseVerbs as ('buy.01', 'purchase.01', 'acquire.01', 'get.01'); create view Relation as select A.verb as BUY_VERB, R2.head as PURCHASE, A.polarity as WILL_BUY from Action A, Role R where MatchesDict('PurchaseVerbs', A.verbClass); and Equals(A.aid, R.aid) and Equals(R.type, 'A1'); ACL ‘15, ‘16, EMNLP ‘16, COLING ’16a, ‘16b, ‘16c • • 46
  • 47. 47
  • 48. 48
  • 49. 49
  • 50. 50
  • 51. 51
  • 54. 54 R1: create view Phone as Regex(‘d{3}-d{4}’, Document, text); R2: create view Person as Dictionary(‘first_names.dict’, Document, text); Dictionary file first_names.dict: anna, james, john, peter… R3: create table PersonPhone(match span); insert into PersonPhone select Merge(F.match, P.match) as match from Person F, Phone P where Follows(F.match, P.match, 0, 60); Person PhonePerson Person Phone Anna at James St. office (555-5555), or James, her assistant - 777-7777 have the details. • • 54
  • 56. 56 56 HLC 2 Remove James from output of R2’ Dictionary op. HLC 3: Remove James555-5555 from output of R3’s join op. HLC 1 Remove 555-5555 from output of R1’s Regex op. true Merge(F.match, P.match) as match ⋈Follows(F.match,P.match,0,60) Dictionary ‘firstName.dict’, text Regex ‘d{3}-d{4}’, text R2 R1 R3 Doc Goal: remove “James  555-5555” from output  56
  • 57. 57 57 HLC 2 Remove James from output of R2’ Dictionary op. HLC 3: Remove James555-5555 from output of R3’s join op. HLC 1 Remove 555-5555 from output of R1’s Regex op. true Merge(F.match, P.match) as match ⋈Follows(F.match,P.match,0,60) Dictionary ‘firstName.dict’, text Regex ‘d{3}-d{4}’, text R2 R1 R3 Doc Goal: remove “James  555-5555” from output LLC 1 Remove ‘James’ from FirstNames.dict LLC 2 Add filter pred. on street suffix in right context of match LLC 3 Reduce character gap between F.match and P.match from 60 to 10  57
  • 58. 58 ⋈ Dictionary ContainsDict()     Contains IsContained Overlaps  “ PersonPhone PersonPhone ” 58
  • 59. 59 • Input: – Set of HLCs, provenance graph, labeled results • Output: – List of LLCs, ranked based on improvement in F1-measure • Algorithm: – For each operator Op, consider all HLCs (ti, Op) – For each HLC, enumerate all possible LLCs – For each LLC: • Compute the set of local tuples it removes from the output of Op • Propagate these removals up through the provenance graph to compute the effect on end-to-end result – Rank LLCs 59
  • 60.  – t Dictionary   Op – k Op – k  – O(n2) n  Op ti Op ti  ++ + + + ++ + + + + + + + + + + + + + + + +++ - - - -- - -- - - -- - - - - -- - - - - - - - - -Tuples to remove from output of Op Output tuples 60
  • 61. 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% Baseline I1 I2 I3 I4 I5 Enron ACE CoNLL EnronPP 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% Baseline I1 I2 I3 I4 I5 Enron ACE CoNLL EnronPP 61 Precision improves greatly after a few iterations, while recall remains fairly stable Precision – % correct results of total results identified Recall – % correct results identified of total correct labels Person extraction on formal text (CoNLL, ACE) Person and PersonPhone extraction on informal text (Enron) 61
  • 62. 0 10 20 30 40 50 60 70 80 90 F1- measure 62  Almost all expert’s refinements are among top 12 generated refinements  Done in 2 minutes ! Expert A after 1 hour, 9 refinements Person extraction on informal text (Enron) 62
  • 63. 63
  • 64. Development Environment AQL Extractor create view ProductMention as select ... from ... where ... create view IntentToBuy as select ... from ... where ... Cost-based optimization ... Discovery tools for AQL development SystemT Runtime Input Documents Extracted Objects Challenge: Building extractors for enterprise applications requires an information extraction system that is expressive, efficient, transparent and usable. Existing solutions are either rule-based solutions based on cascading grammar with expressivity and efficiency issues, or black-box solutions based on machine learning with lack of transparency. Our Solution: A declarative information extraction system with cost-based optimization, high-performance runtime and novel development tooling based on solid theoretical foundation [PODS’13, PODS’14], shipping with over 10+ IBM products. AQL: a declarative language that can be used to build extractors outperforming the state-of-the-arts [ACL’10] Multilingual SRL-enabled: [ACL’15, ACL’16, EMNLP’16, COLING’16] A suite of novel development tooling leveraging machine learning and HCI [EMNLP’08, VLDB’10, ACL’11, CIKM’11, ACL’12, EMNLP’12, CHI’13, SIGMOD’13, ACL’13,VLDB’15, NAACL’15] Cost-based optimization for text-centric operations [ICDE’08, ICDE’11, FPL’13, FPL’14] Highly embeddable runtime with high-throughput and small memory footprint. [SIGMOD Record’09, SIGMOD’09] For details and Online Class visit: https://ibm.biz/BdF4GQ 64
  • 65. 65