7. 7
We are raising our tablet forecast.
S
are
NP
We
S
raising
NP
forecastNP
tablet
DET
our
subj
obj
subj pred
Dependency
Tree
Oct 1 04:12:24 9.1.1.3 41865:
%PLATFORM_ENV-1-DUAL_PWR: Faulty
internal power supply B detected
Time Oct 1 04:12:24
Host 9.1.1.3
Process 41865
Category
%PLATFORM_ENV-1-
DUAL_PWR
Message
Faulty internal power
supply B detected
8. 88
Singapore 2012 Annual Report
(136 pages PDF)
Identify note breaking down
Operating expenses line item,
and extract opex components
Identify line item for Operating
expenses from Income statement
(financial table in pdf document)
10. 10
Intel's 2013 capex is elevated at 23% of sales, above average of 16%
FHLMC reported $4.4bn net loss and requested $6bn in capital from Treasury.
I'm still hearing from clients that Merrill's website is better.
Customer or
competitor?
Good or bad?
Entity of interest
16. Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin, in sagittis facilisis, volutpat dapibus, ultrices sit amet, sem , volutpat dapibus, ultrices sit amet,
sem Tomorrow, we will meet Mark Scott, Howard Smith and amet lt arcu tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis nunc
volutpat enim, quis viverra lacus nulla sit lectus. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante.
Suspendisse
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in
sagittis facilisis arcu Tomorrow, we will meet Mark Scott, Howard Smith and hendrerit faucibus pede mi ipsum. Curabitur cursus tincidunt orci.
Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in
Tokenization
(preprocessing step)
Level 1
Gazetteer[type = LastGaz] Last
Gazetteer[type = FirstGaz] First
Token[~ “[A-Z]w+”] Caps
Rule priority used to prefer
First over Caps
• Rule priority used to prefer First over Caps.
• Lossy Sequencing: annotations dropped
because input to next stage must be a sequence
– First preferred over Last since it was declared earlier
16
17. Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin, in sagittis facilisis, volutpat dapibus, ultrices sit amet, sem , volutpat dapibus, ultrices sit amet,
sem Tomorrow, we will meet Mark Scott, Howard Smith and amet lt arcu tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis nunc
volutpat enim, quis viverra lacus nulla sit lectus. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante.
Suspendisse
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in
sagittis facilisis arcu Tomorrow, we will meet Mark Scott, Howard Smith and hendrerit faucibus pede mi ipsum. Curabitur cursus tincidunt orci.
Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in e
sagittis Tomorrow, we will meet Mark Scott, Howard Smith and hendrerit faucibus pede mi sed ipsum. Curabitur cursus tincidunt orci.
Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est nunc volutpat enim, quis viverra
lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, ultrices sit amet, sem. Vestibulum quis dui vitae massa euismod faucibus. Pellentesque
id neque id tellus hendrerit tincidunt. Etiam augue. Class aptent
Tokenization
(preprocessing step)
Level 1
Gazetteer[type = LastGaz] Last
Gazetteer[type = FirstGaz] First
Token[~ “[A-Z]w+”] Caps
Level 2 First Last Person
First Caps Person
First Person
Rigid Rule Priority and Lossy
Sequencing in Level 1 caused
partial results
17
20. AQL Language
Optimizer
Operator
Graph
Specify extractor semantics
declaratively (express logic of
computation, not control flow)
Choose efficient execution
plan that implements
semantics
Optimized execution plan
executed at runtime
20
26. Dictionary
<First>
Smith
Scott
Tomorrow
Mark
Scott
Howard
Smith
Join
<First> <Caps>
Join
<First> <Last>
Mark Scott
HowardSmith
Mark Scott
HowardSmith
Union
Mark Scott
HowardSmith
Mark Scott
HowardSmith
Scott
Mark
Howard
Consolidate
Mark Scott
HowardSmith
Dictionary
<Last>
Regex
<Caps>
……Tomorrow, we will meet Mark Scott, Howard Smith …
Explicit operator for
resolving ambiguity
Input may contain overlapping annotations
(No Lossy Sequencing problem)
Output may contain overlapping annotations
(No Rigid Matching Regimes)
Scott
Mark
Howard
26
27. create view FirstCaps as
select CombineSpans(F.name, C.name) as name
from First F, Caps C
where FollowsTok(F.name, C.name, 0, 0);
<First> <Caps>
0 tokens
27
28. create view Person as
select S.name as name
from (
( select CombineSpans(F.name, C.name) as name
from First F, Caps C
where FollowsTok(F.name, C.name, 0, 0))
union all
( select CombineSpans(F.name, L.name) as name
from First F, Last L
where FollowsTok(F.name, L.name, 0, 0))
union all
( select *
from First F )
) S
consolidate on name;
<First><Caps>
<First><Last>
<First>
28
29. create view Person as
select S.name as name
from (
( select CombineSpans(F.name, C.name) as name
from First F, Caps C
where FollowsTok(F.name, C.name, 0, 0))
union all
( select CombineSpans(F.name, L.name) as name
from First F, Last L
where FollowsTok(F.name, L.name, 0, 0))
union all
( select *
from First F )
) S
consolidate on name;
Explicit clause for
resolving ambiguity
(No Rigid Priority
problem)
Input may contain
overlapping annotations
(No Lossy Sequencing
problem)
29
30. 30
Deep Syntactic Parsing ML Training & Scoring
Core Operators
Tokenization Parts of Speech Dictionaries
Regular
Expressions
Span
Operations
Relational
Operations
Semantic Role Labels
Language to express NLP Algorithms AQL
….
Aggregation
Operations
31. 31
package com.ibm.avatar.algebra.util.sentence;
import java.io.BufferedWriter;
import java.util.ArrayList;
import java.util.HashSet;
import java.util.regex.Matcher;
public class SentenceChunker
{
private Matcher sentenceEndingMatcher = null;
public static BufferedWriter sentenceBufferedWriter = null;
private HashSet<String> abbreviations = new HashSet<String> ();
public SentenceChunker ()
{
}
/** Constructor that takes in the abbreviations directly. */
public SentenceChunker (String[] abbreviations)
{
// Generate the abbreviations directly.
for (String abbr : abbreviations) {
this.abbreviations.add (abbr);
}
}
/**
* @param doc the document text to be analyzed
* @return true if the document contains at least one sentence boundary
*/
public boolean containsSentenceBoundary (String doc)
{
String origDoc = doc;
/*
* Based on getSentenceOffsetArrayList()
*/
// String origDoc = doc;
// int dotpos, quepos, exclpos, newlinepos;
int boundary;
int currentOffset = 0;
do {
/* Get the next tentative boundary for the sentenceString */
setDocumentForObtainingBoundaries (doc);
boundary = getNextCandidateBoundary ();
if (boundary != -1) {doc.substring (0, boundary + 1);
String remainder = doc.substring (boundary + 1);
String candidate = /*
* Looks at the last character of the String. If this last
* character is part of an abbreviation (as detected by
* REGEX) then the sentenceString is not a fullSentence and
* "false” is returned
*/
// while (!(isFullSentence(candidate) &&
// doesNotBeginWithCaps(remainder))) {
while (!(doesNotBeginWithPunctuation (remainder)
&& isFullSentence (candidate))) {
/* Get the next tentative boundary for the sentenceString */
int nextBoundary = getNextCandidateBoundary ();
if (nextBoundary == -1) {
break;
}
boundary = nextBoundary;
candidate = doc.substring (0, boundary + 1);
remainder = doc.substring (boundary + 1);
}
if (candidate.length () > 0) {
// sentences.addElement(candidate.trim().replaceAll("n", "
// "));
// sentenceArrayList.add(new Integer(currentOffset + boundary
// + 1));
// currentOffset += boundary + 1;
// Found a sentence boundary. If the boundary is the last
// character in the string, we don't consider it to be
// contained within the string.
int baseOffset = currentOffset + boundary + 1;
if (baseOffset < origDoc.length ()) {
// System.err.printf("Sentence ends at %d of %dn",
// baseOffset, origDoc.length());
return true;
}
else {
return false;
}
}
// origDoc.substring(0,currentOffset));
// doc = doc.substring(boundary + 1);
doc = remainder;
}
}
while (boundary != -1);
// If we get here, didn't find any boundaries.
return false;
}
public ArrayList<Integer> getSentenceOffsetArrayList (String doc)
{
ArrayList<Integer> sentenceArrayList = new ArrayList<Integer> ();
// String origDoc = doc;
// int dotpos, quepos, exclpos, newlinepos;
int boundary;
int currentOffset = 0;
sentenceArrayList.add (new Integer (0));
do {
/* Get the next tentative boundary for the sentenceString */
setDocumentForObtainingBoundaries (doc);
boundary = getNextCandidateBoundary ();
if (boundary != -1) {
String candidate = doc.substring (0, boundary + 1);
String remainder = doc.substring (boundary + 1);
/*
* Looks at the last character of the String. If this last character
* is part of an abbreviation (as detected by REGEX) then the
* sentenceString is not a fullSentence and "false" is returned
*/
// while (!(isFullSentence(candidate) &&
// doesNotBeginWithCaps(remainder))) {
while (!(doesNotBeginWithPunctuation (remainder) &&
isFullSentence (candidate))) {
/* Get the next tentative boundary for the sentenceString */
int nextBoundary = getNextCandidateBoundary ();
if (nextBoundary == -1) {
break;
}
boundary = nextBoundary;
candidate = doc.substring (0, boundary + 1);
remainder = doc.substring (boundary + 1);
}
if (candidate.length () > 0) {
sentenceArrayList.add (new Integer (currentOffset + boundary + 1));
currentOffset += boundary + 1;
}
// origDoc.substring(0,currentOffset));
doc = remainder;
}
}
while (boundary != -1);
if (doc.length () > 0) {
sentenceArrayList.add (new Integer (currentOffset + doc.length ()));
}
sentenceArrayList.trimToSize ();
return sentenceArrayList;
}
private void setDocumentForObtainingBoundaries (String doc)
{
sentenceEndingMatcher = SentenceConstants.
sentenceEndingPattern.matcher (doc);
}
private int getNextCandidateBoundary ()
{
if (sentenceEndingMatcher.find ()) {
return sentenceEndingMatcher.start ();
}
else
return -1;
}
private boolean doesNotBeginWithPunctuation (String remainder)
{
Matcher m = SentenceConstants.punctuationPattern.matcher (remainder);
return (!m.find ());
}
private String getLastWord (String cand)
{
Matcher lastWordMatcher = SentenceConstants.lastWordPattern.matcher (cand);
if (lastWordMatcher.find ()) {
return lastWordMatcher.group ();
}
else {
return "";
}
}
/*
* Looks at the last character of the String. If this last character is
* par of an abbreviation (as detected by REGEX)
* then the sentenceString is not a fullSentence and "false" is returned
*/
private boolean isFullSentence (String cand)
{
// cand = cand.replaceAll("n", " "); cand = " " + cand;
Matcher validSentenceBoundaryMatcher =
SentenceConstants.validSentenceBoundaryPattern.matcher (cand);
if (validSentenceBoundaryMatcher.find ()) return true;
Matcher abbrevMatcher = SentenceConstants.abbrevPattern.matcher (cand);
if (abbrevMatcher.find ()) {
return false; // Means it ends with an abbreviation
}
else {
// Check if the last word of the sentenceString has an entry in the
// abbreviations dictionary (like Mr etc.)
String lastword = getLastWord (cand);
if (abbreviations.contains (lastword)) { return false; }
}
return true;
}
}
Java Implementation of Sentence Boundary Detection
create dictionary AbbrevDict from file
'abbreviation.dict’;
create view SentenceBoundary as
select R.match as boundary
from ( extract regex /(([.?!]+s)|(ns*n))/
on D.text as match from Document D ) R
where
Not(ContainsDict('AbbrevDict',
CombineSpans(LeftContextTok(R.match, 1),R.match)));
Equivalent AQL Implementation
31
35. 35
Tokenization overhead is paid
only once
First
(followed within 0 tokens)
Plan C
Plan A
Join
Caps
Restricted Span
Evaluation
Plan B
First
Identify Caps starting
within 0 tokens
Extract text to the
right
Caps
Identify First ending
within 0 tokens
Extract text to the left
40. PersonPhone
Person PhonePerson
Anna at James St. office (555-5555) ….
’
•
•
create view PersonPhone as
select P.name as person, N.number as phone
from Person P, Phone N
where Follows(P.name, N.number, 0, 30);
Person Phone
t1
t2
t3
t1 t3
t2 t3
Provenance:
Boolean
expression
40
42. 2013 2015 2016 2017
• UC Santa Cruz
(full Graduate class)
2014
• U. Washington (Grad)
• U. Oregon (Undergrad)
• U. Aalborg, Denmark (Grad)
• UIUC (Grad)
• U. Maryland Baltimore County
(Undergrad)
• UC Irvine (Grad)
• NYU Abu-Dhabi (Undergrad)
• U. Washington (Grad)
• U. Oregon (Undergrad)
• U. Maryland Baltimore County
(Undergrad)
• …
• UC Santa Cruz, 3 lectures
in one Grad class
SystemT MOOC
42
46. create dictionary PurchaseVerbs as
('buy.01', 'purchase.01', 'acquire.01', 'get.01');
create view Relation as
select A.verb as BUY_VERB, R2.head as PURCHASE, A.polarity as WILL_BUY
from Action A, Role R
where
MatchesDict('PurchaseVerbs', A.verbClass);
and Equals(A.aid, R.aid)
and Equals(R.type, 'A1');
ACL ‘15, ‘16, EMNLP ‘16, COLING ’16a, ‘16b, ‘16c
•
•
46
54. 54
R1: create view Phone as
Regex(‘d{3}-d{4}’, Document, text);
R2: create view Person as
Dictionary(‘first_names.dict’, Document, text);
Dictionary file first_names.dict:
anna, james, john, peter…
R3: create table PersonPhone(match span);
insert into PersonPhone
select Merge(F.match, P.match) as match
from Person F, Phone P
where Follows(F.match, P.match, 0, 60);
Person PhonePerson Person Phone
Anna at James St. office (555-5555), or James, her assistant - 777-7777 have the details.
•
•
54
56. 56
56
HLC 2
Remove James
from output of R2’
Dictionary op.
HLC 3: Remove
James555-5555
from output of R3’s
join op.
HLC 1
Remove 555-5555
from output of
R1’s Regex op.
true
Merge(F.match, P.match) as match
⋈Follows(F.match,P.match,0,60)
Dictionary
‘firstName.dict’, text
Regex
‘d{3}-d{4}’, text
R2 R1
R3
Doc
Goal: remove “James 555-5555” from output
56
57. 57
57
HLC 2
Remove James
from output of R2’
Dictionary op.
HLC 3: Remove
James555-5555
from output of R3’s
join op.
HLC 1
Remove 555-5555
from output of
R1’s Regex op.
true
Merge(F.match, P.match) as match
⋈Follows(F.match,P.match,0,60)
Dictionary
‘firstName.dict’, text
Regex
‘d{3}-d{4}’, text
R2 R1
R3
Doc
Goal: remove “James 555-5555” from output
LLC 1
Remove ‘James’
from FirstNames.dict
LLC 2
Add filter pred. on
street suffix in right
context of match
LLC 3
Reduce character gap between
F.match and P.match from 60 to 10
57
59. 59 • Input:
– Set of HLCs, provenance graph, labeled results
• Output:
– List of LLCs, ranked based on improvement in F1-measure
• Algorithm:
– For each operator Op, consider all HLCs (ti, Op)
– For each HLC, enumerate all possible LLCs
– For each LLC:
• Compute the set of local tuples it removes from the output of Op
• Propagate these removals up through the provenance graph to compute the
effect on end-to-end result
– Rank LLCs
59
60.
– t Dictionary
Op
– k Op
– k
– O(n2) n
Op ti Op ti
++
+
+
+
++
+ +
+
+ +
+
+ + +
+
+
+ +
+
+
+++
-
- - --
- --
- - --
-
-
- - --
-
-
-
-
-
-
- -
-Tuples to remove
from output of Op
Output tuples
60
61. 0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
Baseline I1 I2 I3 I4 I5
Enron
ACE
CoNLL
EnronPP
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
Baseline I1 I2 I3 I4 I5
Enron
ACE
CoNLL
EnronPP
61
Precision improves greatly after a few iterations, while recall remains fairly stable
Precision – % correct results
of total results identified
Recall – % correct results
identified of total correct labels
Person extraction on formal text (CoNLL, ACE)
Person and PersonPhone extraction on informal text (Enron)
61
62. 0
10
20
30
40
50
60
70
80
90
F1- measure
62
Almost all expert’s refinements are among top 12 generated refinements
Done in 2 minutes !
Expert A after 1 hour, 9 refinements
Person extraction on informal text (Enron)
62
64. Development Environment
AQL Extractor
create view ProductMention as
select ...
from ...
where ...
create view IntentToBuy as
select ...
from ...
where ... Cost-based
optimization
...
Discovery tools for AQL
development
SystemT Runtime
Input
Documents
Extracted
Objects
Challenge: Building extractors for enterprise applications requires an information extraction system that is expressive,
efficient, transparent and usable. Existing solutions are either rule-based solutions based on cascading grammar with
expressivity and efficiency issues, or black-box solutions based on machine learning with lack of transparency.
Our Solution: A declarative information extraction system with cost-based optimization, high-performance runtime and
novel development tooling based on solid theoretical foundation [PODS’13, PODS’14], shipping with over 10+ IBM products.
AQL: a declarative language that can be used to build extractors
outperforming the state-of-the-arts [ACL’10]
Multilingual SRL-enabled: [ACL’15, ACL’16, EMNLP’16, COLING’16]
A suite of novel development tooling leveraging
machine learning and HCI [EMNLP’08, VLDB’10,
ACL’11, CIKM’11, ACL’12, EMNLP’12, CHI’13,
SIGMOD’13, ACL’13,VLDB’15, NAACL’15]
Cost-based optimization for
text-centric operations
[ICDE’08, ICDE’11, FPL’13, FPL’14]
Highly embeddable runtime
with high-throughput and
small memory footprint.
[SIGMOD Record’09, SIGMOD’09]
For details and
Online Class visit:
https://ibm.biz/BdF4GQ
64