2017-01-25-SystemT-Overview-Stanford

Bank
6
23%
Bank
5
20%
Bank
4
21%
Bank
3
5%
Bank
2
15%
Bank
1
15%
3

7
We are raising our tablet forecast.
S
are
NP
We
S
raising
NP
forecastNP
tablet
DET
our
subj
obj
subj pred
Dependency
Tree
Oct 1 04:12:24 9.1.1.3 41865:
%PLATFORM_ENV-1-DUAL_PWR: Faulty
internal power supply B detected
Time Oct 1 04:12:24
Host 9.1.1.3
Process 41865
Category
%PLATFORM_ENV-1-
DUAL_PWR
Message
Faulty internal power
supply B detected

88
Singapore 2012 Annual Report
(136 pages PDF)
Identify note breaking down
Operating expenses line item,
and extract opex components
Identify line item for Operating
expenses from Income statement
(financial table in pdf document)

10
Intel's 2013 capex is elevated at 23% of sales, above average of 16%
FHLMC reported $4.4bn net loss and requested $6bn in capital from Treasury.
I'm still hearing from clients that Merrill's website is better.
Customer or
competitor?
Good or bad?
Entity of interest

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin, in sagittis facilisis, volutpat dapibus, ultrices sit amet, sem , volutpat dapibus, ultrices sit amet,
sem Tomorrow, we will meet Mark Scott, Howard Smith and amet lt arcu tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis nunc
volutpat enim, quis viverra lacus nulla sit lectus. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante.
Suspendisse
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in
sagittis facilisis arcu Tomorrow, we will meet Mark Scott, Howard Smith and hendrerit faucibus pede mi ipsum. Curabitur cursus tincidunt orci.
Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in
Tokenization
(preprocessing step)
Level 1
Gazetteer[type = LastGaz]  Last
Gazetteer[type = FirstGaz]  First
Token[~ “[A-Z]w+”]  Caps
Rule priority used to prefer
First over Caps
• Rule priority used to prefer First over Caps.
• Lossy Sequencing: annotations dropped
because input to next stage must be a sequence
– First preferred over Last since it was declared earlier
16

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin, in sagittis facilisis, volutpat dapibus, ultrices sit amet, sem , volutpat dapibus, ultrices sit amet,
sem Tomorrow, we will meet Mark Scott, Howard Smith and amet lt arcu tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis nunc
volutpat enim, quis viverra lacus nulla sit lectus. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante.
Suspendisse
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in
sagittis facilisis arcu Tomorrow, we will meet Mark Scott, Howard Smith and hendrerit faucibus pede mi ipsum. Curabitur cursus tincidunt orci.
Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in e
sagittis Tomorrow, we will meet Mark Scott, Howard Smith and hendrerit faucibus pede mi sed ipsum. Curabitur cursus tincidunt orci.
Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est nunc volutpat enim, quis viverra
lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, ultrices sit amet, sem. Vestibulum quis dui vitae massa euismod faucibus. Pellentesque
id neque id tellus hendrerit tincidunt. Etiam augue. Class aptent
Tokenization
(preprocessing step)
Level 1
Gazetteer[type = LastGaz]  Last
Gazetteer[type = FirstGaz]  First
Token[~ “[A-Z]w+”]  Caps
Level 2 First Last  Person
First Caps  Person
First  Person
Rigid Rule Priority and Lossy
Sequencing in Level 1 caused
partial results
17

AQL Language
Optimizer
Operator
Graph
Specify extractor semantics
declaratively (express logic of
computation, not control flow)
Choose efficient execution
plan that implements
semantics
Optimized execution plan
executed at runtime
20

Document
text: String
Person
last: Spanfirst: Span fullname: Span
22

Mark
Scott
Anna
…
DocumentInput Tuple
…
we will meet Mark
Scott and
…
Output Tuple 2 Span 2Document
Span 1Output Tuple 1 Document
Dictionary
24

Dictionary
<First>
Smith
Scott
Tomorrow
Mark
Scott
Howard
Smith
Join
<First> <Caps>
Join
<First> <Last>
Mark Scott
HowardSmith
Mark Scott
HowardSmith
Union
Mark Scott
HowardSmith
Mark Scott
HowardSmith
Scott
Mark
Howard
Consolidate
Mark Scott
HowardSmith
Dictionary
<Last>
Regex
<Caps>
……Tomorrow, we will meet Mark Scott, Howard Smith …
Explicit operator for
resolving ambiguity
Input may contain overlapping annotations
(No Lossy Sequencing problem)
Output may contain overlapping annotations
(No Rigid Matching Regimes)
Scott
Mark
Howard
26

create view FirstCaps as
select CombineSpans(F.name, C.name) as name
from First F, Caps C
where FollowsTok(F.name, C.name, 0, 0);
<First> <Caps>
0 tokens
27

create view Person as
select S.name as name
from (
( select CombineSpans(F.name, C.name) as name
where FollowsTok(F.name, C.name, 0, 0))
union all
( select CombineSpans(F.name, L.name) as name
from First F, Last L
where FollowsTok(F.name, L.name, 0, 0))
union all
( select *
from First F )
) S
consolidate on name;
<First><Caps>
<First><Last>
<First>
28

create view Person as
select S.name as name
from (
( select CombineSpans(F.name, C.name) as name
where FollowsTok(F.name, C.name, 0, 0))
union all
( select CombineSpans(F.name, L.name) as name
from First F, Last L
where FollowsTok(F.name, L.name, 0, 0))
union all
( select *
from First F )
) S
consolidate on name;
Explicit clause for
resolving ambiguity
(No Rigid Priority
problem)
Input may contain
overlapping annotations
(No Lossy Sequencing
problem)
29

30
Deep Syntactic Parsing ML Training & Scoring
Core Operators
Tokenization Parts of Speech Dictionaries
Regular
Expressions
Span
Operations
Relational
Operations
Semantic Role Labels
Language to express NLP Algorithms  AQL
….
Aggregation
Operations

31
package com.ibm.avatar.algebra.util.sentence;
import java.io.BufferedWriter;
import java.util.ArrayList;
import java.util.HashSet;
import java.util.regex.Matcher;
public class SentenceChunker
{
private Matcher sentenceEndingMatcher = null;
public static BufferedWriter sentenceBufferedWriter = null;
private HashSet<String> abbreviations = new HashSet<String> ();
public SentenceChunker ()
{
}
/** Constructor that takes in the abbreviations directly. */
public SentenceChunker (String[] abbreviations)
{
// Generate the abbreviations directly.
for (String abbr : abbreviations) {
this.abbreviations.add (abbr);
}
}
/**
* @param doc the document text to be analyzed
* @return true if the document contains at least one sentence boundary
*/
public boolean containsSentenceBoundary (String doc)
{
String origDoc = doc;
/*
* Based on getSentenceOffsetArrayList()
*/
// String origDoc = doc;
// int dotpos, quepos, exclpos, newlinepos;
int boundary;
int currentOffset = 0;
do {
/* Get the next tentative boundary for the sentenceString */
setDocumentForObtainingBoundaries (doc);
boundary = getNextCandidateBoundary ();
if (boundary != -1) {doc.substring (0, boundary + 1);
String remainder = doc.substring (boundary + 1);
String candidate = /*
* Looks at the last character of the String. If this last
* character is part of an abbreviation (as detected by
* REGEX) then the sentenceString is not a fullSentence and
* "false” is returned
*/
// while (!(isFullSentence(candidate) &&
// doesNotBeginWithCaps(remainder))) {
while (!(doesNotBeginWithPunctuation (remainder)
&& isFullSentence (candidate))) {
int nextBoundary = getNextCandidateBoundary ();
if (nextBoundary == -1) {
break;
}
boundary = nextBoundary;
candidate = doc.substring (0, boundary + 1);
remainder = doc.substring (boundary + 1);
}
if (candidate.length () > 0) {
// sentences.addElement(candidate.trim().replaceAll("n", "
// "));
// sentenceArrayList.add(new Integer(currentOffset + boundary
// + 1));
// currentOffset += boundary + 1;
// Found a sentence boundary. If the boundary is the last
// character in the string, we don't consider it to be
// contained within the string.
int baseOffset = currentOffset + boundary + 1;
if (baseOffset < origDoc.length ()) {
// System.err.printf("Sentence ends at %d of %dn",
// baseOffset, origDoc.length());
return true;
}
else {
return false;
}
}
// origDoc.substring(0,currentOffset));
// doc = doc.substring(boundary + 1);
doc = remainder;
}
}
while (boundary != -1);
// If we get here, didn't find any boundaries.
return false;
}
public ArrayList<Integer> getSentenceOffsetArrayList (String doc)
{
ArrayList<Integer> sentenceArrayList = new ArrayList<Integer> ();
// String origDoc = doc;
// int dotpos, quepos, exclpos, newlinepos;
int boundary;
int currentOffset = 0;
sentenceArrayList.add (new Integer (0));
do {
setDocumentForObtainingBoundaries (doc);
boundary = getNextCandidateBoundary ();
if (boundary != -1) {
String candidate = doc.substring (0, boundary + 1);
String remainder = doc.substring (boundary + 1);
/*
* Looks at the last character of the String. If this last character
* is part of an abbreviation (as detected by REGEX) then the
* sentenceString is not a fullSentence and "false" is returned
*/
// while (!(isFullSentence(candidate) &&
// doesNotBeginWithCaps(remainder))) {
while (!(doesNotBeginWithPunctuation (remainder) &&
isFullSentence (candidate))) {
int nextBoundary = getNextCandidateBoundary ();
if (nextBoundary == -1) {
break;
}
boundary = nextBoundary;
candidate = doc.substring (0, boundary + 1);
remainder = doc.substring (boundary + 1);
}
if (candidate.length () > 0) {
sentenceArrayList.add (new Integer (currentOffset + boundary + 1));
currentOffset += boundary + 1;
}
// origDoc.substring(0,currentOffset));
doc = remainder;
}
}
while (boundary != -1);
if (doc.length () > 0) {
sentenceArrayList.add (new Integer (currentOffset + doc.length ()));
}
sentenceArrayList.trimToSize ();
return sentenceArrayList;
}
private void setDocumentForObtainingBoundaries (String doc)
{
sentenceEndingMatcher = SentenceConstants.
sentenceEndingPattern.matcher (doc);
}
private int getNextCandidateBoundary ()
{
if (sentenceEndingMatcher.find ()) {
return sentenceEndingMatcher.start ();
}
else
return -1;
}
private boolean doesNotBeginWithPunctuation (String remainder)
{
Matcher m = SentenceConstants.punctuationPattern.matcher (remainder);
return (!m.find ());
}
private String getLastWord (String cand)
{
Matcher lastWordMatcher = SentenceConstants.lastWordPattern.matcher (cand);
if (lastWordMatcher.find ()) {
return lastWordMatcher.group ();
}
else {
return "";
}
}
/*
* Looks at the last character of the String. If this last character is
* par of an abbreviation (as detected by REGEX)
* then the sentenceString is not a fullSentence and "false" is returned
*/
private boolean isFullSentence (String cand)
{
// cand = cand.replaceAll("n", " "); cand = " " + cand;
Matcher validSentenceBoundaryMatcher =
SentenceConstants.validSentenceBoundaryPattern.matcher (cand);
if (validSentenceBoundaryMatcher.find ()) return true;
Matcher abbrevMatcher = SentenceConstants.abbrevPattern.matcher (cand);
if (abbrevMatcher.find ()) {
return false; // Means it ends with an abbreviation
}
else {
// Check if the last word of the sentenceString has an entry in the
// abbreviations dictionary (like Mr etc.)
String lastword = getLastWord (cand);
if (abbreviations.contains (lastword)) { return false; }
}
return true;
}
}
Java Implementation of Sentence Boundary Detection
create dictionary AbbrevDict from file
'abbreviation.dict’;
create view SentenceBoundary as
select R.match as boundary
from ( extract regex /(([.?!]+s)|(ns*n))/
on D.text as match from Document D ) R
where
Not(ContainsDict('AbbrevDict',
CombineSpans(LeftContextTok(R.match, 1),R.match)));
Equivalent AQL Implementation
31

35
Tokenization overhead is paid
only once
First
(followed within 0 tokens)
Plan C
Plan A
Join
Caps
Restricted Span
Evaluation
Plan B
First
Identify Caps starting
within 0 tokens
Extract text to the
right
Caps
Identify First ending
within 0 tokens
Extract text to the left

0
100
200
300
400
500
600
700
0 20 40 60 80 100
Average document size (KB)
Throughput(KB/sec)
Open Source Entity Tagger
SystemT
10~50x faste
[Chiticariu et al., ACL’10] 36

[Chiticariu et al., ACL’10]
Dataset Document Size
Throughput
(KB/sec)
Average Memory
(MB)
Range Average ANNIE SystemT ANNIE SystemT
Web Crawl 68 B – 388 KB 8.8 KB 42.8 498.8 201.8 77.2
Medium
SEC Filings
240 KB – 0.9 MB 401 KB 26.3 703.5 601.8 143.7
Large
SEC Flings
1 MB – 3.4 MB 1.54 MB 21.1 954.5 2683.5 189.6
37

PersonPhone
Person PhonePerson
Anna at James St. office (555-5555) ….
’
•
•
create view PersonPhone as
select P.name as person, N.number as phone
from Person P, Phone N
where Follows(P.name, N.number, 0, 30);
Person Phone
t1
t2
t3
t1  t3
t2  t3
Provenance:
Boolean
expression
40

2013 2015 2016 2017
• UC Santa Cruz
(full Graduate class)
2014
• U. Washington (Grad)
• U. Oregon (Undergrad)
• U. Aalborg, Denmark (Grad)
• UIUC (Grad)
• U. Maryland Baltimore County
(Undergrad)
• UC Irvine (Grad)
• NYU Abu-Dhabi (Undergrad)
• U. Washington (Grad)
• U. Oregon (Undergrad)
• U. Maryland Baltimore County
(Undergrad)
• …
• UC Santa Cruz, 3 lectures
in one Grad class
SystemT MOOC
42

create dictionary PurchaseVerbs as
('buy.01', 'purchase.01', 'acquire.01', 'get.01');
create view Relation as
select A.verb as BUY_VERB, R2.head as PURCHASE, A.polarity as WILL_BUY
from Action A, Role R
where
MatchesDict('PurchaseVerbs', A.verbClass);
and Equals(A.aid, R.aid)
and Equals(R.type, 'A1');
ACL ‘15, ‘16, EMNLP ‘16, COLING ’16a, ‘16b, ‘16c
•
•
46

Ease of
Programming
Ease of
Sharing
53

54
R1: create view Phone as
Regex(‘d{3}-d{4}’, Document, text);
R2: create view Person as
Dictionary(‘first_names.dict’, Document, text);
Dictionary file first_names.dict:
anna, james, john, peter…
R3: create table PersonPhone(match span);
insert into PersonPhone
select Merge(F.match, P.match) as match
from Person F, Phone P
where Follows(F.match, P.match, 0, 60);
Person PhonePerson Person Phone
Anna at James St. office (555-5555), or James, her assistant - 777-7777 have the details.
•
•
54






Person
Dictionary
FirstNames.dict
Doc
PersonPhone
Join
Follows(name,phone,0,60)
James
James555-5555
Phone
Regex
/d{3}-d{4}/
555-5555
PhonePerson
Anna at James St. office (555-5555), …
55

56
56
HLC 2
Remove James
from output of R2’
Dictionary op.
HLC 3: Remove
James555-5555
from output of R3’s
join op.
HLC 1
Remove 555-5555
from output of
R1’s Regex op.
true
Merge(F.match, P.match) as match
⋈Follows(F.match,P.match,0,60)
Dictionary
‘firstName.dict’, text
Regex
‘d{3}-d{4}’, text
R2 R1
R3
Doc
Goal: remove “James  555-5555” from output
 56

57
57
HLC 2
Remove James
from output of R2’
Dictionary op.
HLC 3: Remove
James555-5555
from output of R3’s
join op.
HLC 1
Remove 555-5555
from output of
R1’s Regex op.
true
Merge(F.match, P.match) as match
⋈Follows(F.match,P.match,0,60)
Dictionary
‘firstName.dict’, text
Regex
‘d{3}-d{4}’, text
R2 R1
R3
Doc
Goal: remove “James  555-5555” from output
LLC 1
Remove ‘James’
from FirstNames.dict
LLC 2
Add filter pred. on
street suffix in right
context of match
LLC 3
Reduce character gap between
F.match and P.match from 60 to 10
 57

58
⋈
Dictionary ContainsDict()



 Contains IsContained Overlaps
 “ PersonPhone
PersonPhone ”
58

59 • Input:
– Set of HLCs, provenance graph, labeled results
• Output:
– List of LLCs, ranked based on improvement in F1-measure
• Algorithm:
– For each operator Op, consider all HLCs (ti, Op)
– For each HLC, enumerate all possible LLCs
– For each LLC:
• Compute the set of local tuples it removes from the output of Op
• Propagate these removals up through the provenance graph to compute the
effect on end-to-end result
– Rank LLCs
59


– t Dictionary 

Op
– k Op
– k

– O(n2) n
 Op ti Op ti 
++
+
+
+
++
+ +
+
+ +
+
+ + +
+
+
+ +
+
+
+++
-
- - --
- --
- - --
-
-
- - --
-
-
-
-
-
-
- -
-Tuples to remove
from output of Op
Output tuples
60

0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
Baseline I1 I2 I3 I4 I5
Enron
ACE
CoNLL
EnronPP
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
Baseline I1 I2 I3 I4 I5
Enron
ACE
CoNLL
EnronPP
61
Precision improves greatly after a few iterations, while recall remains fairly stable
Precision – % correct results
of total results identified
Recall – % correct results
identified of total correct labels
Person extraction on formal text (CoNLL, ACE)
Person and PersonPhone extraction on informal text (Enron)
61

0
10
20
30
40
50
60
70
80
90
F1- measure
62
 Almost all expert’s refinements are among top 12 generated refinements
 Done in 2 minutes !
Expert A after 1 hour, 9 refinements
Person extraction on informal text (Enron)
62

Development Environment
AQL Extractor
create view ProductMention as
select ...
from ...
where ...
create view IntentToBuy as
select ...
from ...
where ... Cost-based
optimization
...
Discovery tools for AQL
development
SystemT Runtime
Input
Documents
Extracted
Objects
Challenge: Building extractors for enterprise applications requires an information extraction system that is expressive,
efficient, transparent and usable. Existing solutions are either rule-based solutions based on cascading grammar with
expressivity and efficiency issues, or black-box solutions based on machine learning with lack of transparency.
Our Solution: A declarative information extraction system with cost-based optimization, high-performance runtime and
novel development tooling based on solid theoretical foundation [PODS’13, PODS’14], shipping with over 10+ IBM products.
AQL: a declarative language that can be used to build extractors
outperforming the state-of-the-arts [ACL’10]
Multilingual SRL-enabled: [ACL’15, ACL’16, EMNLP’16, COLING’16]
A suite of novel development tooling leveraging
machine learning and HCI [EMNLP’08, VLDB’10,
ACL’11, CIKM’11, ACL’12, EMNLP’12, CHI’13,
SIGMOD’13, ACL’13,VLDB’15, NAACL’15]
Cost-based optimization for
text-centric operations
[ICDE’08, ICDE’11, FPL’13, FPL’14]
Highly embeddable runtime
with high-throughput and
small memory footprint.
[SIGMOD Record’09, SIGMOD’09]
For details and
Online Class visit:
https://ibm.biz/BdF4GQ
64

2017-01-25-SystemT-Overview-Stanford

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (16)

Ähnlich wie 2017-01-25-SystemT-Overview-Stanford

Ähnlich wie 2017-01-25-SystemT-Overview-Stanford (20)

2017-01-25-SystemT-Overview-Stanford