SlideShare ist ein Scribd-Unternehmen logo
1 von 58
Downloaden Sie, um offline zu lesen
Foundational Research
Propelled by Text Analytics
Benny Kimelfeld
LogicBlox
Preamble
• Myself:
– Ph.D. @ HebrewU (DB uncertainty + search)
– IBM Almaden (DB theory, IR, Text Analytics)
– LogicBlox (ML in DB, Prob. Programming)
– Technion IL (Associate Prof., next year)
• This talk:
 Infrastructure for text analytics
+ DB theory, formal languages, NLP, data mining,
computational complexity, …
2
• Text Analytics in the Big Data Era
• Information Extraction Systems & Formalism
• Foundational Research Challenges
• Conclusions and Outlook
Outline
3
Text Analytics Matters
Some important applications are based on the
analysis of text-centric data; for example:
Semantic Search
Semantic understanding & indexing of
content to better match user's intent
Life-Science Mining
Extract knowledge bases from
scientific publications
e-Commerce
Comparison Shopping extracts &
compares inventory from online sources
CRM / BI
Monitor customer’s social-media activity
for sentiment & business leads
Log Analysis
Summarize, visualize and analyze logs
produced by machines
4
Database Management Systems
• Old news: Data management is involved!
– Data semantics, query/analysis semantics, storage,
query evaluation, indices, consistency, transactions,
backup, privacy, recovery, …
– From-scratch engineering is highly challenging
• Motivation to the concept of a general-purpose
Database Management System
– Most notably: relational model (pioneered by Edgar F.
Codd in 1969) and SQL
5
“Big Data” Phenomena
Proprietary data in orgs.
(enterprises, governments, …)
Proliferation of publically open
data sources (Web, social, …)
Past: Present:
Massive-data analyses incurred
high machinery/personnel cost
Business models (cloud, crowd,
opensource) facilitate analyses
Data structured/controlled by
admins, e-forms, software, …
Uncontrolled data from humans’
free text, heterogeneous kbs, …
Analyses by specialized teams
of heavily trained experts
Analyses by a wide community
featuring a wide range of skills
6
“By 2018, the United States alone could face a shortage of 140,000
to 190,000 people with deep analytical skills as well as 1.5 million
managers and analysts with the know-how to use the analysis of
big data to make effective decisions.”
“Big data: The next frontier for innovation, competition, and productivity”
McKinsey Report, May 2011
We need dev. & management systems to
facilitate value extraction from Big Data
by a wide range of users / skills
7
Core Task: Information Extraction (IE)
“Information Extraction (IE) is the name given to any process
which selectively structures and combines data which is found,
explicitly stated or implied, in one or more texts. The final
output of the extraction process varies; in every case, however, it
can be transformed so as to populate some type of database.”
J. Cowie and Y. Wilks., Handbook of
Natural Language Processing, 2000
“Information extraction is the identification, and consequent or concurrent
classification and structuring into semantic classes, of specific
information found in unstructured data sources, such as natural language
text, making the information more suitable for information processing tasks.”
M. F. Moens, Information Extraction: Algorithms
and Prospects in a Retrieval Context, 2006
→data-in-text
(unstructured)
data-in-db
(structured)
In short:
8
Popular Classes of IE Tasks
• Named Entity Recognition
From September 1936 to July 1938,
Turing spent most of his time studying
under Church at Princeton University.
In June 1938, he obtained his PhD
from Princeton.
person person organization
organization
9
Popular Classes of IE Tasks
AdvisedB
y
WorksIn
From September 1936 to July 1938,
Turing spent most of his time studying
under Church at Princeton University.
In June 1938, he obtained his PhD
from Princeton.
• Named Entity Recognition
• Relation Extraction
10
Popular Classes of IE Tasks
From September 1936 to July 1938,
Turing spent most of his time studying
under Church at Princeton University.
In June 1938, he obtained his PhD
from Princeton.
Graduation
Where?
Who?
• Named Entity Recognition
• Relation Extraction
• Event Extraction
11
Popular Classes of IE Tasks
From September 1936 to July 1938,
Turing spent most of his time studying
under Church at Princeton University.
In June 1938, he obtained his PhD
from Princeton.
Education
Start End
Graduation
When?
• Named Entity Recognition
• Relation Extraction
• Event Extraction
• Temporal IE
12
Popular Classes of IE Tasks
From September 1936 to July 1938,
Turing spent most of his time studying
under Church at Princeton University.
In June 1938, he obtained his PhD
from Princeton.
SameEntity
SameEntity
• Named Entity Recognition
• Relation Extraction
• Event Extraction
• Temporal IE
• Coreference Resolution
13
ariu
lmaden
A
m. com
Yunyao Li
IBM Research - Almaden
San Jose, CA
yunyaol i @us. i bm. com
Frederick R. Reiss
IBM Research - Almaden
San Jose, CA
f r r ei ss@us. i bm. com
stract
a” analytics over unstruc-
enewed interest in infor-
E). We surveyed the land-
ies and identied amajor
industry and academia:
ominatesthecommercial
garded as dead-end tech-
mia. We believe the dis-
he way in which the two
ethebenets and costsof
mia’s perception that rule-
research challenges. We
mportance of rule-based
Commercial*Vendors*(2013)*
NLP*Papers*
(200392012)*
100%$
50%$
0%$
3.5%*
21%$
75%$
Rule,$
Based$
Hybrid$
Machine$
Learning$
Based$
45%*
22%$
33%$
Implementa@ons*of*En@ty*Extrac@on*
Large*Vendors*
67%*
17%$
17%$
All*Vendors*
IE Paradigms: Rules & Statistics
• Rules
• ML classification
• Probabilistic graphical models
• Soft logic
[Chiticariu, Li, Reiss, EMNLP’13]
• EMNLP, ACL, NAACL, 2003-
2012
• 54 industrial vendors (Who’s
Who in Text Analytics, 2012)
“[…] rules are effective,
interpretable, and are easy
to customize by non-experts
to cope with errors.”
Gupta & Manning, CONLL’14
14
+
NLP
• Text Analytics in the Big Data Era
• Information Extraction Systems & Formalism
• Foundational Research Challenges
• Conclusions and Outlook
Outline
15
Xlog: Datalog for IE
• Extension of (non-recursive) Datalog
• Use case: DBLife (db research kb: dblife.cs.wisc.edu)
• Data types: string, document, span
– Focus on single-document programs
• “Procedural predicates” (p-predicates) are user-defined
functions that produce relations over spans
– Example: sentence(doc, span)
• Query-plan optimization
[Shen, Doan, Naughton, Ramakrishnan, VLDB 2007]
Kaspersky Lab CEO Eugene Kaspersky said Intel CEO Paul
Otellini and the Intel board had no idea what they were in for when
the company announced it was acquiring McAfee on August 19,
2010.
Same string, different spans
Span [42,47)
16
Xlog Example
“Declarative Information Extraction using Datalog with Embedded Extraction Predicates”
[Shen, Doan, Naughton, Ramakrishnan, VLDB 2007]
Regex.
(string)
Unary
regex
formula
Binary
regex
formula
17
• Datalog syntax
– Types: string, span
• Built in collection of p-predicates
– Various types of built-in regex formulas
– Linguistic: deep parsing, coreference
resolution, named-entity extractor
Instaread: Datalog + NLP
Binary regex
formulas
Unary regex
formulas
[Hoffmann, 2012]
18
IBM SystemT: SQL for IE
• Engine for AQL: SQL-like declarative IE lang.
– AQL = Annotation Query Language
• SystemT = AQL + Runtime + Dev. Tooling
– [Chiticariu et al., ACL 2010]: position SystemT as a
high-quality and high-efficiency IE solution
– System and IDE demos in ACL 2011, SIGMOD 2011
• Commercial product, high academic presence
– Integration on public financial records [Hernández et al., EDBT’13,
Balakrishnan et al. SIGMOD’10], NER [Chiticariu et al. EMNLP’10,
ACL’10, Nagesh et al. EMNLP’12, Roy et al. SIGMOD’13], IR [Zhu et
al. WWW’10, K et al. SIGIR’12, CIKM’12], sentiment analysis [Hu et
al., Interact’13], social media [Sindhwani et al., IBM Journal 2011]
19
SystemT’s AQL Example
[Chiticariu, Krishnamurthy, Li, Raghavan, Reiss, Vaithyanathan, ACL 2010]
regex + join w/ previous views
projection
union
Cleaning
Unary regex formulas
20
Formal Framework
• Repeated concept: Extend a relational query
language with text transducers (p-predicates,
usually regex formulas)
• Research challenge: theoretical underpinnings
of this combined document/relation model
• Expressive power
– Query-plan optimization: Can we rewrite an operator via “easier”
building blocks?
– System extensions: Can we express a new operation using
existing ones, or prove impossibility?
• Next: a formal framework
– With Fagin, Reiss, Vansummeren, PODS’13, JACM
21
22
Terminology
Kaspersky Lab CEO Eugene Kaspersky said Intel CEO Paul
Otellini and the Intel board had no idea what they were in for when
the company announced it was acquiring McAfee on August 19,
2010.
Company CEO CompanyCEO
[1,14)
(Kaspersky Lab)
[19,36)
(Eugene Kaspersky)
[1,36)
[42,47)
(Intel)
[52,65)
(Paul Otellini)
[42,65)
Relation over spans from the document
Document
Span [52,65)
Document Spanners
Document d Relation over the spans of d
Kaspersky Lab CEO Eugene
Kaspersky said Intel CEO Paul Otellini
and the Intel board had no idea what
they were in for when the company
announced it was acquiring McAfee
on August 19, 2010.
x y z
[1,14) [30,36) [1,36)
[42,47) [52,65) [42,65)
[102,110) [115,125) [102,125)
Document Spanner: a function that maps every
doc. (string) into a relation over the doc.’s spans
More formally:
• Finite alphabet of symbols
• A spanner maps each doc. d ∈ * into a relation over the spans [i,j) of d
• The relation has a fixed signature (set of attributes)
− The attributes come from an infinite domain of variables x, y, z, …
23
Spanners as Regex Formulas
• Regular expression with embedded variables
• Examples:
• Restriction: each “evaluation” (parse tree) assigns
one span to each variable (see [Fagin et al., PODS’13])
Ordinary regex Span variable
 .* x{dddd} .*
 .* in w{Alabama | Alaska | Arizona | …} .*
 (.* z{[A-Z][a-z]*, y{[A-Z][a-z]*}} .*) | …
Representation system for spanners
24
Spanners as Datalog w/ Regex
• Non-recursive Datalog (NR-Datalog)
• Operate over a document (not a relational db)
Token(x) := [ (Îľ | .*_) x{[a-zA-Z]+} ( ((,V_) .*) | Îľ) ]
State(x) := Token(x) , [.* x{Georgia|Virginia|Washington}.*]
Cap1st(x) := Token(x) , [.* x{[A-Z].*}.*]
CommaSp(x,y,z) := [.* z{x{.*} ,_ y{.*}}.*]
Loc(z) := CommaSp(x,y,z) , Cap1st(x) , State(y)
RETURN(x,z) := Cap1st(x) , [.*x{.*}_from_z{.*}.*}] , Loc(z)
Carter_from_Plains,_Georgia,_Washington
_from_Westmoreland,_Virginia
x z
[1,7)
Carter
[13,28)
Plains,_Georgia
[30,40)
Washington
[46,69)
Westmoreland,_Virginia
EDBs = Spanners!
Another representation
system for spanners
Quer
y goal
25
Spanners as Automata
0,1 0 1
Ordinary
NFA
1 0 0 1 1 1 0 1
Var-Stack
Automaton
1 0 0 1 1 1 0 1x{
y{
}
}
y{x{ } }
Var-Set
Automaton
1 0 0 1 1 1 0 1x{
}y
y{x{ }x }y
}x
0,1 0 1
0,1 0 1
• In an accepting run, each variable opens and later closes exactly once
⇒ Each accepting run defines an assignment to the variables
• Nondeterministic ⇒ multiple accepting runs ⇒ multiple tuples
Close most recent
Close x
y
x
x
y
Another representation system for spanners
y{
26
Study of Expressive Power
Spanners definable by
regex formulas=
Spanners definable by
var-stack automata
Spanners definable by
var-set automata =
Spanners definable by
NR Datalog w/ regex formulas
27
x{
y{
}
}
0,1 0 1
x{
}y
}x
0,1 0 1
y{
.*x{.*}_from_z{.*}.*}
Token(x) := [ (Îľ | .*_) x{[a-zA-Z]+} ( ((,V_) .*) | Îľ) ]
State(x) := Token(x) , [.* x{Georgia|Virginia|Washington}.*]
Cap1st(x) := Token(x) , [.* x{[A-Z].*}.*]
CommaSp(x,y,z) := [.* z{x{.*} ,_ y{.*}}.*]
Loc(z) := CommaSp(x,y,z) , Cap1st(x) , State(y)
RETURN(x,z) := Cap1st(x) , [.*x{.*}_from_z{.*}.*}] , Loc(z)
Consequences
• Connections between Datalog+regex
spanners and other language formalisms
– Classic string relations [Berstel 79]
– Graph queries (CRPQs) [Cruz et al. 87]
• Extension with string equality & difference
– Expressiveness / closure properties
• Principles for cleaning inconsistencies
– Follow up work [PODS’14]
– Next…
28
• Text Analytics in the Big Data Era
• Information Extraction Systems & Formalism
• Foundational Research Challenges
• Conclusions and Outlook
Outline
29
Next, highlight 3 lines of foundational research that
were motivated by our work on text analytics:
1. Database inconsistency w/ repair priorities
2. Frequent subgraph mining
3. Update propagation
30
• Extractors may produce inconsistent results
– Data artifacts
– Developer limitations
• Rather than repairing the existing extractors,
common practice is to clean (intermediate) results
– SystemT “consolidators” [Chiticariu et al.10]
– GATE/JAPE “controls” [Cunningham 02]
– Implicit in other rule systems, e.g., WHISK [Soderland 99]
– POSIX regex disambiguation [Fowler 03]
Cleaning IE Inconsistencies
33 Martin Luther King Jr. Dr., SE, Atlanta, GA 30303
Person2
Person1
Address1
31
SystemT Consolidators
[Chiticariu, Krishnamurthy, Li, Raghavan, Reiss, Vaithyanathan, ACL 2010]
Other policies
built in
32
Five GATE/JAPE Controls
All
Once
First
AppeltBrin
.* x{dd+} .*Sequence 12345 and sequence 12.
Document Spanner
Screenshots from GATE UI
33
Cleaning via Prioritized Repairs
• Problem: existing policies are ad-hoc; how to
expose a language for user declaration?
• [Fagin, K, Reiss, Vansummeren 2014]: spanner
formalism for declarative cleaning
• Key: prioritized repairs [Staworko, et al. 12]
• Idea: Extend extraction programs with
– Denial constraints: which facts are in conflict?
– Priority declarations: preference between facts
• Captures SystemT, GATE, WHISK, POSIX, …
• We are now trying to improve our understanding
of prioritized repairs…
34
Prioritized Repairs: Definition
Database
Denial
Constraints
Collection of facts Which sets of facts
cannot co-exist?
Priority
Relation
Binary “is preferred to”
relation
• [Arenas, Bertossi, Chomicki 99]: Inconsistent DB
represents a set of (equally likely) “repairs”
 Then we can ask for the “possible” or “consistent” query answers
• [Staworko, Chomicki, Marcinkowski 12] add priorities:
• Let A and B be two consistent subsets of the database
• Say that A improves B if we can obtain A from B by a
“profitable” exchange of facts (precision later…)
• A repair is a consistent subset that cannot be improved
Inconsistent Database Instance
35
Example
professor university city
Monica ubiobio ConcepciĂłn
Monica carleton Ottawa
Jorge uchile Santiago
Jorge ubiobio Santiago
Pablo uchile Santiago
Violated constraints (functional
dependencies):
• professor  university, city
(“key constraint”)
• university  city
professor university city
Monica ubiobio ConcepciĂłn
Monica carleton Ottawa
Jorge uchile Santiago
Jorge ubiobio Santiago
Pablo uchile Santiago
professor university city
Monica ubiobio ConcepciĂłn
Monica carleton Ottawa
Jorge uchile Santiago
Jorge ubiobio Santiago
Pablo uchile Santiago
“Ordinary” repairs [Arenas et al. 99]
Tuple priority  some repairs can be discarded [Staworko et al.] 36
A improves B if we get A from B by removing tuples & adding
tuple; each removed preferred to by some added
Complexity of Testing Improvability
Theorem:
 In the case of a single functional dependency
or two keys per relation, improvability can be
tested in polynomial time
 In any other combination of FDs, the
problem is NP-complete!
university faculty dean
UChile Economics Agosin
Technion CS Yavneh
Stanford Law Magill
two keys
37
Can a consistent subset be improved?
Recent work (unpublished)
w/ Fagin & Kolaitis
IE with Recurring Patterns
I want to buy my advisor a gift.
I really want to buy a gift to my advisor.
I want to buy a gift to the secretary and to my advisor.
1. Apply
dependency
parsing
38
[Zhang, Baldwin, Ho, K, Li, ACL13]: Restoring grammar in social media, sms, etc.
IE with Recurring Patterns
I want to buy my advisor a gift.
I really want to buy a gift to my advisor.
I want to buy a gift to the secretary and to my advisor.
I
want
buy
gift advisor
1. Apply
dependency
parsing
2. Find freq.
recurring
patterns
39
[Zhang, Baldwin, Ho, K, Li, ACL13]: Restoring grammar in social media, sms, etc.
= 3
g1 g2 g3 g4
Freq.
Freq. Max.
Freq.
Max.
Maximal Frequent Subgraphs
Complexity Study
• Naturally, there has been a lot of work on this problem
– SPIN [Huan et al. 04], MARGIN [Thomas et al. 10], …
• But little was known about the computational complexity
• Studied: impact of assumptions on comp. complexity
– Graph properties (e.g., trees, treewidth, etc.)
– Label repeatability
– Bounded #results desired
– Bounded threshold
• This work led to novel complexity results and a new
methodologies for mining maximal subgraphs
– [K & Kolaitis, ACM PODS’13, ACM TODS]
• Next, some complexity nuggets 
41
Complexity Nuggets
• Good news: If labels do not repeat in each input
graph, then there are PTime solutions when
– The threshold is bounded; or
– Graphs are trees & few results are desired
• In general graphs w/o label repetition, you can
find 2 results in PTime
– Bad news: But finding 3rd is NP-hard!
– Bad news: And if labels repeat and graphs are
trees, then finding 2nd is already NP-hard!
• Even for a bounded threshold
42
Improving Dictionaries w/ Feedback
text fragments
(sentences, tables, rows, …)
join
IBM , San Jose
company
occurrences
address
occurrencescompanies, countries, …
Apple , CupertinoIBM , Armonk
IE IE
IE
auto. suggest a “good”
fix to the IE program
Web data
“good” = small effect
on other results
Yahoo! , Cupertino Goo
43
View Updates
• View-update problem: Translate an update on a view to
an update on the base relations
• Deletion propagation as a special case
– Update is delete(a set of view tuples)
• Motivation:
– Classic: database/view maintenance
• DB access only through views, hidden join keys, etc.
– Debugging
• [K&al.12]: deletion propagation for debugging text extractors
– Database causality [Meliou&al.10]
• Intuition: good propagation provides a good explanation of why we
have the tuples to begin with
• [Bertossi, Salimi 14]: “Unifying Causality, Diagnosis,
Repairs and View-Updates in Databases”
44
Example: File Access
GroupFile
group file
ai a.txt
ai b.txt
db a. txt
db b.txt
os a.txt
UserGroup
user group
Emma ai
Emma db
Olivia os
Olivia db
Jacob ai
Access(u,f) :– UserGroup(u,g), GroupFile(g,f)
Delete source rows, s.t. Emma won’t access a.txt.
But, maintain maximum access permissions!
[Cui&Widom01; Buneman&al.02]
Access
user file
Emma a.txt
Emma b.txt
Olivia a.txt
Olivia b.txt
Jacob a.txt
Jacob b.txt
= ⋈
45
Example: File Access
= ⋈
GroupFile
group file
ai a.txt
ai b.txt
db a. txt
db b.txt
os a.txt
UserGroup
user group
Emma ai
Emma db
Olivia os
Olivia db
Jacob ai
Access
user file
Emma a.txt
Emma b.txt
Olivia a.txt
Olivia b.txt
Jacob a.txt
Jacob b.txt
Access(u,f) :– UserGroup(u,g), GroupFile(g,f)
[Cui&Widom01; Buneman&al.02]
Delete source rows, s.t. Emma won’t access a.txt.
But, maintain maximum access permissions!
46
Example: File Access
GroupFile
group file
ai a.txt
ai b.txt
db a. txt
db b.txt
os a.txt
UserGroup
user group
Emma ai
Emma db
Olivia os
Olivia db
Jacob ai
Access
user file
Emma a.txt
Emma b.txt
Olivia a.txt
Olivia b.txt
Jacob a.txt
Jacob b.txt
= ⋈
Access(u,f) :– UserGroup(u,g), GroupFile(g,f)
[Cui&Widom01; Buneman&al.02]
Delete source rows, s.t. Emma won’t access a.txt.
But, maintain maximum access permissions!
Decision variant is NP-complete [Buneman et al. 02]
47
Trichotomy in Complexity
We have established a precise (easily testable) criterion
that partition all cases into 3 categories:
1. The problem is solvable in PTime, and even via a
straightforward algorithm [Buneman et al. 2001]
2. The problem is NP-hard, but constant-ratio
approximable in PTime (ILP relaxation)
3. The problem is inapproximable for every ratio
Fix a schema (w/ fds) and a CQ w/o self joins
What is the complexity of finding a solution with a minimal side effect?
[K, Vondrak, Williams, Woodruff, PODS11, PODS12, TODS12, VLDB14]
48
• Text Analytics in the Big Data Era
• Information Extraction Systems & Formalism
• Foundational Research Challenges
• Conclusions and Outlook
Outline
49
Summary
• Text analytics & IE
• Rule systems for IE
• A formal framework for rules, relating IE to
traditional DB concepts such as Datalog
• Research directions motivated by IE
– Prioritized repairs
– Graph mining
– Update propagation
50
Outlook: DB w/ Deep Text Support
• We need a uniform & elegant data/query model to
combine structured data & text; usefulness for querying
both text and relations
• We need a principled, simple & transparent probability
model + effective quality + practical execution cost
• We need to balance between automation and control:
from full specification by experts to feature generation for
nonexperienced
– Maximally realize the potential of every developer!
– LogicBlox is working on incorporating ML in Datalog!
51
BACKUP SLIDES
52
Room for Both
Statistical
Solution
Rule
System
Feature Engineering
Model Space, Runtime
Cleaning + Post Proc.
Cleaning + Post Proc.
Building blocks
(e.g., dictionaries, NER)
“What doesn’t work: Anything requiring high
precision and full automation”
Feldman & Ungar, KDD’08 tutorial on text mining
53
String DB, Spanners, Interval Algebra
Kaspersky Kaspersky
Intel Otellini
IBM Rometty
[10,20) [16,26)
[32,37) [50,58)
[105,108) [121,128)
[10,20) [16,26)
[32,37) [50,58)
[105,108) [121,128)
String Databases Interval AlgebraSpanners
Atomic value: string Atomic value: span
(pointing to doc)
Atomic value: interval
(no text)
Join by string conditions
(e.g., x is a substring of y)
Join by interval conditions
(e.g., x is a sub-interval of y)
Join by interval+string
conditions (e.g., x a
token in y)
Apps: text predicates in DBs
[Grahne & al. 99] [Benedikt &
al. 03], string manipulation
[Bonner & Mecca 98]
[Ginsburg and Wang 98]
App: IE Apps: temporal reasoning
[Allen 83] [Vilain & Kautz
86] [Nebel & BĂźrckert 95]
[Krokhin et al. 03]
54
55
Imp. 1: Connection to Known Concepts
• Connection to Recognizable Relations [Berstel 79]
– These are unions of cross products of regular languages
– THM: The class of regular spanners is closed under
a string-selection predicate iff the predicate is a
recognizable relation
• Connection to CRPQs [Cruz et al. 87]
– Conjunctive Regular Path Queries have been studied as a
query language for labeled graphs
– THM: Regular spanners have the same expressive
power as unions of CRPQs on paths “with marked
endpoints”
• Up to some simple and necessary adaptation between the models
S I G M O D
Path with marked endpoints
Imp. 2: Adding String Equality
NR Datalog w/ regex formulas
Regular Spanners
Regularstr= Spanners
+ String-equality predicate
(+substring-of, prefix-of, …)
…application from Jane Doe,
social 012-345-6789, on Mar
20th… identified as John Doe,
012-345-6789, ask us to…
x1 x2
[117,125)
(Jane Doe)
[875,883)
(John Doe)
⋮ ⋮
NameSSN(x,y) := …
SameSSN(x1,x2) := NameSSN(x1,y1) , NameSSN(x2,y2) , str(y1)=str(y2)
Same string,
different spans
56
Difference with String Equality
• Are regularstr= spanners closed under difference?
– Why should they? Only positive operators are used…
– However, regex formulas (our EDBs) can introduce
“negative” operations (NFAs closed under complement)
• THM: The class of regular spanners is closed under
difference
• PROP: The class of regularstr= spanners is closed
under string-inequality selection
• THM: The class of regularstr= spanners is closed
under string-containment selection, but then, not
under non-string-containment selection!
• COR: The class of regularstr= is not closed under
difference
57
Formal Optimization Problem
Fixed: • Schema S w/ fun. dependencies
• Conjunctive query Q
Input: • Database instance I over S
• Set A⊆ Q(I) of answers to delete
Output: J ⊆ I s.t. Q(J) ∩ A = ∅
Goal: Minimize |(Q(I) – A) – Q(J)|
Side Effect
58

Weitere ähnliche Inhalte

Was ist angesagt?

Model of information retrieval (3)
Model  of information retrieval (3)Model  of information retrieval (3)
Model of information retrieval (3)
9866825059
 
Information retrieval
Information retrievalInformation retrieval
Information retrieval
hplap
 
18231979 Data Mining
18231979 Data Mining18231979 Data Mining
18231979 Data Mining
Raghav agrawal
 
Aggregation for searching complex information spaces
Aggregation for searching complex information spacesAggregation for searching complex information spaces
Aggregation for searching complex information spaces
Mounia Lalmas-Roelleke
 
CV
CVCV
CV
Jie Bao
 
Ci2004-10.doc
Ci2004-10.docCi2004-10.doc
Ci2004-10.doc
butest
 

Was ist angesagt? (20)

SemTecBiz 2012: Corporate Semantic Web
SemTecBiz 2012: Corporate Semantic WebSemTecBiz 2012: Corporate Semantic Web
SemTecBiz 2012: Corporate Semantic Web
 
Konsep Dasar Information Retrieval - Edi faizal
Konsep Dasar Information Retrieval - Edi faizal Konsep Dasar Information Retrieval - Edi faizal
Konsep Dasar Information Retrieval - Edi faizal
 
Semantic Web: introduction & overview
Semantic Web: introduction & overviewSemantic Web: introduction & overview
Semantic Web: introduction & overview
 
Ir 01
Ir   01Ir   01
Ir 01
 
Model of information retrieval (3)
Model  of information retrieval (3)Model  of information retrieval (3)
Model of information retrieval (3)
 
Fqas09
Fqas09Fqas09
Fqas09
 
BIG DATA RESEARCH
BIG DATA RESEARCHBIG DATA RESEARCH
BIG DATA RESEARCH
 
Managing Metadata for Science and Technology Studies: the RISIS case
Managing Metadata for Science and Technology Studies: the RISIS caseManaging Metadata for Science and Technology Studies: the RISIS case
Managing Metadata for Science and Technology Studies: the RISIS case
 
The Apache Solr Semantic Knowledge Graph
The Apache Solr Semantic Knowledge GraphThe Apache Solr Semantic Knowledge Graph
The Apache Solr Semantic Knowledge Graph
 
20 most popular data scientists
20 most popular data scientists20 most popular data scientists
20 most popular data scientists
 
CS6007 information retrieval - 5 units notes
CS6007   information retrieval - 5 units notesCS6007   information retrieval - 5 units notes
CS6007 information retrieval - 5 units notes
 
Data science.chapter-1,2,3
Data science.chapter-1,2,3Data science.chapter-1,2,3
Data science.chapter-1,2,3
 
Information retrieval
Information retrievalInformation retrieval
Information retrieval
 
18231979 Data Mining
18231979 Data Mining18231979 Data Mining
18231979 Data Mining
 
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”
Smartlogic, Semaphore and Semantically Enhanced Search –  For “Discovery”Smartlogic, Semaphore and Semantically Enhanced Search –  For “Discovery”
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”
 
Aggregation for searching complex information spaces
Aggregation for searching complex information spacesAggregation for searching complex information spaces
Aggregation for searching complex information spaces
 
CV
CVCV
CV
 
Lecture 01 Data Mining
Lecture 01 Data MiningLecture 01 Data Mining
Lecture 01 Data Mining
 
Lecture - Data Mining
Lecture - Data MiningLecture - Data Mining
Lecture - Data Mining
 
Ci2004-10.doc
Ci2004-10.docCi2004-10.doc
Ci2004-10.doc
 

Ähnlich wie Text Analytics - JCC2014 Kimelfeld

Post 1What is text analytics How does it differ from text mini.docx
Post 1What is text analytics How does it differ from text mini.docxPost 1What is text analytics How does it differ from text mini.docx
Post 1What is text analytics How does it differ from text mini.docx
stilliegeorgiana
 
Post 1What is text analytics How does it differ from text mini
Post 1What is text analytics How does it differ from text miniPost 1What is text analytics How does it differ from text mini
Post 1What is text analytics How does it differ from text mini
anhcrowley
 
Data and Information Integration: Information Extraction
Data and Information Integration: Information ExtractionData and Information Integration: Information Extraction
Data and Information Integration: Information Extraction
IJMER
 
A New Paradigm on Analytic-Driven Information and Automation V2.pdf
A New Paradigm on Analytic-Driven Information and Automation V2.pdfA New Paradigm on Analytic-Driven Information and Automation V2.pdf
A New Paradigm on Analytic-Driven Information and Automation V2.pdf
ArmyTrilidiaDevegaSK
 
Rule-based Information Extraction for Airplane Crashes Reports
Rule-based Information Extraction for Airplane Crashes ReportsRule-based Information Extraction for Airplane Crashes Reports
Rule-based Information Extraction for Airplane Crashes Reports
CSCJournals
 
Rule-based Information Extraction for Airplane Crashes Reports
Rule-based Information Extraction for Airplane Crashes ReportsRule-based Information Extraction for Airplane Crashes Reports
Rule-based Information Extraction for Airplane Crashes Reports
CSCJournals
 
Introduction to question answering for linked data & big data
Introduction to question answering for linked data & big dataIntroduction to question answering for linked data & big data
Introduction to question answering for linked data & big data
Andre Freitas
 
Machine Learned Relevance at A Large Scale Search Engine
Machine Learned Relevance at A Large Scale Search EngineMachine Learned Relevance at A Large Scale Search Engine
Machine Learned Relevance at A Large Scale Search Engine
Salford Systems
 

Ähnlich wie Text Analytics - JCC2014 Kimelfeld (20)

Post 1What is text analytics How does it differ from text mini.docx
Post 1What is text analytics How does it differ from text mini.docxPost 1What is text analytics How does it differ from text mini.docx
Post 1What is text analytics How does it differ from text mini.docx
 
Post 1What is text analytics How does it differ from text mini
Post 1What is text analytics How does it differ from text miniPost 1What is text analytics How does it differ from text mini
Post 1What is text analytics How does it differ from text mini
 
Department of Commerce App Challenge: Big Data Dashboards
Department of Commerce App Challenge: Big Data DashboardsDepartment of Commerce App Challenge: Big Data Dashboards
Department of Commerce App Challenge: Big Data Dashboards
 
Synthesys Technical Overview
Synthesys Technical OverviewSynthesys Technical Overview
Synthesys Technical Overview
 
DBLP-SSE: A DBLP Search Support Engine
DBLP-SSE: A DBLP Search Support EngineDBLP-SSE: A DBLP Search Support Engine
DBLP-SSE: A DBLP Search Support Engine
 
Data and Information Integration: Information Extraction
Data and Information Integration: Information ExtractionData and Information Integration: Information Extraction
Data and Information Integration: Information Extraction
 
Open IE tutorial 2018
Open IE tutorial 2018Open IE tutorial 2018
Open IE tutorial 2018
 
Database Essay
Database EssayDatabase Essay
Database Essay
 
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactData Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
 
A New Paradigm on Analytic-Driven Information and Automation V2.pdf
A New Paradigm on Analytic-Driven Information and Automation V2.pdfA New Paradigm on Analytic-Driven Information and Automation V2.pdf
A New Paradigm on Analytic-Driven Information and Automation V2.pdf
 
Rule-based Information Extraction for Airplane Crashes Reports
Rule-based Information Extraction for Airplane Crashes ReportsRule-based Information Extraction for Airplane Crashes Reports
Rule-based Information Extraction for Airplane Crashes Reports
 
Rule-based Information Extraction for Airplane Crashes Reports
Rule-based Information Extraction for Airplane Crashes ReportsRule-based Information Extraction for Airplane Crashes Reports
Rule-based Information Extraction for Airplane Crashes Reports
 
Introduction to question answering for linked data & big data
Introduction to question answering for linked data & big dataIntroduction to question answering for linked data & big data
Introduction to question answering for linked data & big data
 
Search Solutions 2011: Successful Enterprise Search By Design
Search Solutions 2011: Successful Enterprise Search By DesignSearch Solutions 2011: Successful Enterprise Search By Design
Search Solutions 2011: Successful Enterprise Search By Design
 
CS8080_IRT__UNIT_I_NOTES.pdf
CS8080_IRT__UNIT_I_NOTES.pdfCS8080_IRT__UNIT_I_NOTES.pdf
CS8080_IRT__UNIT_I_NOTES.pdf
 
CS8080 IRT UNIT I NOTES.pdf
CS8080 IRT UNIT I  NOTES.pdfCS8080 IRT UNIT I  NOTES.pdf
CS8080 IRT UNIT I NOTES.pdf
 
Web_Mining_Overview_Nfaoui_El_Habib
Web_Mining_Overview_Nfaoui_El_HabibWeb_Mining_Overview_Nfaoui_El_Habib
Web_Mining_Overview_Nfaoui_El_Habib
 
BrightTALK - Semantic AI
BrightTALK - Semantic AI BrightTALK - Semantic AI
BrightTALK - Semantic AI
 
Questions On The And Football
Questions On The And FootballQuestions On The And Football
Questions On The And Football
 
Machine Learned Relevance at A Large Scale Search Engine
Machine Learned Relevance at A Large Scale Search EngineMachine Learned Relevance at A Large Scale Search Engine
Machine Learned Relevance at A Large Scale Search Engine
 

Mehr von Pedro Contreras Flores

Mehr von Pedro Contreras Flores (20)

El dilema de las redes sociales
El dilema de las redes sociales El dilema de las redes sociales
El dilema de las redes sociales
 
Tipos de sistemas de informaciĂłn
Tipos de sistemas de informaciĂłnTipos de sistemas de informaciĂłn
Tipos de sistemas de informaciĂłn
 
Servicio de informaciĂłn para bibliotecas
Servicio de informaciĂłn para bibliotecasServicio de informaciĂłn para bibliotecas
Servicio de informaciĂłn para bibliotecas
 
GestiĂłn del conocimiento
GestiĂłn del conocimientoGestiĂłn del conocimiento
GestiĂłn del conocimiento
 
Business intelligence (bi) y big data0
Business intelligence (bi) y big data0Business intelligence (bi) y big data0
Business intelligence (bi) y big data0
 
Bibliotecas moviles y calidad
Bibliotecas moviles y calidadBibliotecas moviles y calidad
Bibliotecas moviles y calidad
 
Sistemas y servicios de informacion intro
Sistemas y servicios de informacion introSistemas y servicios de informacion intro
Sistemas y servicios de informacion intro
 
Plataforma de DigitalizaciĂłn
Plataforma de DigitalizaciĂłnPlataforma de DigitalizaciĂłn
Plataforma de DigitalizaciĂłn
 
Red de transporte urbano
Red de transporte urbanoRed de transporte urbano
Red de transporte urbano
 
Packing
PackingPacking
Packing
 
Hormigas arfificiales - Mauro San MartĂ­n
Hormigas arfificiales - Mauro San MartĂ­nHormigas arfificiales - Mauro San MartĂ­n
Hormigas arfificiales - Mauro San MartĂ­n
 
TecnologĂ­as de la informaciĂłn
TecnologĂ­as de la informaciĂłnTecnologĂ­as de la informaciĂłn
TecnologĂ­as de la informaciĂłn
 
Modelamiento y simulaciĂłn
Modelamiento y simulaciĂłnModelamiento y simulaciĂłn
Modelamiento y simulaciĂłn
 
Java 3D
Java 3DJava 3D
Java 3D
 
Complementos de programaciĂłn
Complementos de programaciĂłnComplementos de programaciĂłn
Complementos de programaciĂłn
 
4 memoria dinamica
4 memoria dinamica4 memoria dinamica
4 memoria dinamica
 
3 recursividad
3 recursividad3 recursividad
3 recursividad
 
2 punteros y lenguaje c
2 punteros y lenguaje c2 punteros y lenguaje c
2 punteros y lenguaje c
 
ProgramaciĂłn grafica en lenguaje c
ProgramaciĂłn grafica en lenguaje cProgramaciĂłn grafica en lenguaje c
ProgramaciĂłn grafica en lenguaje c
 
2 archivos
2 archivos2 archivos
2 archivos
 

KĂźrzlich hochgeladen

Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
gajnagarg
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
vexqp
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
kumargunjan9515
 
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
HyderabadDolls
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
kumargunjan9515
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
gajnagarg
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
gajnagarg
 

KĂźrzlich hochgeladen (20)

Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
 
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
 
Vastral Call Girls Book Now 7737669865 Top Class Escort Service Available
Vastral Call Girls Book Now 7737669865 Top Class Escort Service AvailableVastral Call Girls Book Now 7737669865 Top Class Escort Service Available
Vastral Call Girls Book Now 7737669865 Top Class Escort Service Available
 
Introduction to Statistics Presentation.pptx
Introduction to Statistics Presentation.pptxIntroduction to Statistics Presentation.pptx
Introduction to Statistics Presentation.pptx
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
 

Text Analytics - JCC2014 Kimelfeld

  • 1. Foundational Research Propelled by Text Analytics Benny Kimelfeld LogicBlox
  • 2. Preamble • Myself: – Ph.D. @ HebrewU (DB uncertainty + search) – IBM Almaden (DB theory, IR, Text Analytics) – LogicBlox (ML in DB, Prob. Programming) – Technion IL (Associate Prof., next year) • This talk:  Infrastructure for text analytics + DB theory, formal languages, NLP, data mining, computational complexity, … 2
  • 3. • Text Analytics in the Big Data Era • Information Extraction Systems & Formalism • Foundational Research Challenges • Conclusions and Outlook Outline 3
  • 4. Text Analytics Matters Some important applications are based on the analysis of text-centric data; for example: Semantic Search Semantic understanding & indexing of content to better match user's intent Life-Science Mining Extract knowledge bases from scientific publications e-Commerce Comparison Shopping extracts & compares inventory from online sources CRM / BI Monitor customer’s social-media activity for sentiment & business leads Log Analysis Summarize, visualize and analyze logs produced by machines 4
  • 5. Database Management Systems • Old news: Data management is involved! – Data semantics, query/analysis semantics, storage, query evaluation, indices, consistency, transactions, backup, privacy, recovery, … – From-scratch engineering is highly challenging • Motivation to the concept of a general-purpose Database Management System – Most notably: relational model (pioneered by Edgar F. Codd in 1969) and SQL 5
  • 6. “Big Data” Phenomena Proprietary data in orgs. (enterprises, governments, …) Proliferation of publically open data sources (Web, social, …) Past: Present: Massive-data analyses incurred high machinery/personnel cost Business models (cloud, crowd, opensource) facilitate analyses Data structured/controlled by admins, e-forms, software, … Uncontrolled data from humans’ free text, heterogeneous kbs, … Analyses by specialized teams of heavily trained experts Analyses by a wide community featuring a wide range of skills 6
  • 7. “By 2018, the United States alone could face a shortage of 140,000 to 190,000 people with deep analytical skills as well as 1.5 million managers and analysts with the know-how to use the analysis of big data to make effective decisions.” “Big data: The next frontier for innovation, competition, and productivity” McKinsey Report, May 2011 We need dev. & management systems to facilitate value extraction from Big Data by a wide range of users / skills 7
  • 8. Core Task: Information Extraction (IE) “Information Extraction (IE) is the name given to any process which selectively structures and combines data which is found, explicitly stated or implied, in one or more texts. The final output of the extraction process varies; in every case, however, it can be transformed so as to populate some type of database.” J. Cowie and Y. Wilks., Handbook of Natural Language Processing, 2000 “Information extraction is the identification, and consequent or concurrent classification and structuring into semantic classes, of specific information found in unstructured data sources, such as natural language text, making the information more suitable for information processing tasks.” M. F. Moens, Information Extraction: Algorithms and Prospects in a Retrieval Context, 2006 →data-in-text (unstructured) data-in-db (structured) In short: 8
  • 9. Popular Classes of IE Tasks • Named Entity Recognition From September 1936 to July 1938, Turing spent most of his time studying under Church at Princeton University. In June 1938, he obtained his PhD from Princeton. person person organization organization 9
  • 10. Popular Classes of IE Tasks AdvisedB y WorksIn From September 1936 to July 1938, Turing spent most of his time studying under Church at Princeton University. In June 1938, he obtained his PhD from Princeton. • Named Entity Recognition • Relation Extraction 10
  • 11. Popular Classes of IE Tasks From September 1936 to July 1938, Turing spent most of his time studying under Church at Princeton University. In June 1938, he obtained his PhD from Princeton. Graduation Where? Who? • Named Entity Recognition • Relation Extraction • Event Extraction 11
  • 12. Popular Classes of IE Tasks From September 1936 to July 1938, Turing spent most of his time studying under Church at Princeton University. In June 1938, he obtained his PhD from Princeton. Education Start End Graduation When? • Named Entity Recognition • Relation Extraction • Event Extraction • Temporal IE 12
  • 13. Popular Classes of IE Tasks From September 1936 to July 1938, Turing spent most of his time studying under Church at Princeton University. In June 1938, he obtained his PhD from Princeton. SameEntity SameEntity • Named Entity Recognition • Relation Extraction • Event Extraction • Temporal IE • Coreference Resolution 13
  • 14. ariu lmaden A m. com Yunyao Li IBM Research - Almaden San Jose, CA yunyaol i @us. i bm. com Frederick R. Reiss IBM Research - Almaden San Jose, CA f r r ei ss@us. i bm. com stract a” analytics over unstruc- enewed interest in infor- E). We surveyed the land- ies and identied amajor industry and academia: ominatesthecommercial garded as dead-end tech- mia. We believe the dis- he way in which the two ethebenets and costsof mia’s perception that rule- research challenges. We mportance of rule-based Commercial*Vendors*(2013)* NLP*Papers* (200392012)* 100%$ 50%$ 0%$ 3.5%* 21%$ 75%$ Rule,$ Based$ Hybrid$ Machine$ Learning$ Based$ 45%* 22%$ 33%$ Implementa@ons*of*En@ty*Extrac@on* Large*Vendors* 67%* 17%$ 17%$ All*Vendors* IE Paradigms: Rules & Statistics • Rules • ML classification • Probabilistic graphical models • Soft logic [Chiticariu, Li, Reiss, EMNLP’13] • EMNLP, ACL, NAACL, 2003- 2012 • 54 industrial vendors (Who’s Who in Text Analytics, 2012) “[…] rules are effective, interpretable, and are easy to customize by non-experts to cope with errors.” Gupta & Manning, CONLL’14 14 + NLP
  • 15. • Text Analytics in the Big Data Era • Information Extraction Systems & Formalism • Foundational Research Challenges • Conclusions and Outlook Outline 15
  • 16. Xlog: Datalog for IE • Extension of (non-recursive) Datalog • Use case: DBLife (db research kb: dblife.cs.wisc.edu) • Data types: string, document, span – Focus on single-document programs • “Procedural predicates” (p-predicates) are user-defined functions that produce relations over spans – Example: sentence(doc, span) • Query-plan optimization [Shen, Doan, Naughton, Ramakrishnan, VLDB 2007] Kaspersky Lab CEO Eugene Kaspersky said Intel CEO Paul Otellini and the Intel board had no idea what they were in for when the company announced it was acquiring McAfee on August 19, 2010. Same string, different spans Span [42,47) 16
  • 17. Xlog Example “Declarative Information Extraction using Datalog with Embedded Extraction Predicates” [Shen, Doan, Naughton, Ramakrishnan, VLDB 2007] Regex. (string) Unary regex formula Binary regex formula 17
  • 18. • Datalog syntax – Types: string, span • Built in collection of p-predicates – Various types of built-in regex formulas – Linguistic: deep parsing, coreference resolution, named-entity extractor Instaread: Datalog + NLP Binary regex formulas Unary regex formulas [Hoffmann, 2012] 18
  • 19. IBM SystemT: SQL for IE • Engine for AQL: SQL-like declarative IE lang. – AQL = Annotation Query Language • SystemT = AQL + Runtime + Dev. Tooling – [Chiticariu et al., ACL 2010]: position SystemT as a high-quality and high-efficiency IE solution – System and IDE demos in ACL 2011, SIGMOD 2011 • Commercial product, high academic presence – Integration on public financial records [HernĂĄndez et al., EDBT’13, Balakrishnan et al. SIGMOD’10], NER [Chiticariu et al. EMNLP’10, ACL’10, Nagesh et al. EMNLP’12, Roy et al. SIGMOD’13], IR [Zhu et al. WWW’10, K et al. SIGIR’12, CIKM’12], sentiment analysis [Hu et al., Interact’13], social media [Sindhwani et al., IBM Journal 2011] 19
  • 20. SystemT’s AQL Example [Chiticariu, Krishnamurthy, Li, Raghavan, Reiss, Vaithyanathan, ACL 2010] regex + join w/ previous views projection union Cleaning Unary regex formulas 20
  • 21. Formal Framework • Repeated concept: Extend a relational query language with text transducers (p-predicates, usually regex formulas) • Research challenge: theoretical underpinnings of this combined document/relation model • Expressive power – Query-plan optimization: Can we rewrite an operator via “easier” building blocks? – System extensions: Can we express a new operation using existing ones, or prove impossibility? • Next: a formal framework – With Fagin, Reiss, Vansummeren, PODS’13, JACM 21
  • 22. 22 Terminology Kaspersky Lab CEO Eugene Kaspersky said Intel CEO Paul Otellini and the Intel board had no idea what they were in for when the company announced it was acquiring McAfee on August 19, 2010. Company CEO CompanyCEO [1,14) (Kaspersky Lab) [19,36) (Eugene Kaspersky) [1,36) [42,47) (Intel) [52,65) (Paul Otellini) [42,65) Relation over spans from the document Document Span [52,65)
  • 23. Document Spanners Document d Relation over the spans of d Kaspersky Lab CEO Eugene Kaspersky said Intel CEO Paul Otellini and the Intel board had no idea what they were in for when the company announced it was acquiring McAfee on August 19, 2010. x y z [1,14) [30,36) [1,36) [42,47) [52,65) [42,65) [102,110) [115,125) [102,125) Document Spanner: a function that maps every doc. (string) into a relation over the doc.’s spans More formally: • Finite alphabet of symbols • A spanner maps each doc. d ∈ * into a relation over the spans [i,j) of d • The relation has a fixed signature (set of attributes) − The attributes come from an infinite domain of variables x, y, z, … 23
  • 24. Spanners as Regex Formulas • Regular expression with embedded variables • Examples: • Restriction: each “evaluation” (parse tree) assigns one span to each variable (see [Fagin et al., PODS’13]) Ordinary regex Span variable  .* x{dddd} .*  .* in w{Alabama | Alaska | Arizona | …} .*  (.* z{[A-Z][a-z]*, y{[A-Z][a-z]*}} .*) | … Representation system for spanners 24
  • 25. Spanners as Datalog w/ Regex • Non-recursive Datalog (NR-Datalog) • Operate over a document (not a relational db) Token(x) := [ (Îľ | .*_) x{[a-zA-Z]+} ( ((,V_) .*) | Îľ) ] State(x) := Token(x) , [.* x{Georgia|Virginia|Washington}.*] Cap1st(x) := Token(x) , [.* x{[A-Z].*}.*] CommaSp(x,y,z) := [.* z{x{.*} ,_ y{.*}}.*] Loc(z) := CommaSp(x,y,z) , Cap1st(x) , State(y) RETURN(x,z) := Cap1st(x) , [.*x{.*}_from_z{.*}.*}] , Loc(z) Carter_from_Plains,_Georgia,_Washington _from_Westmoreland,_Virginia x z [1,7) Carter [13,28) Plains,_Georgia [30,40) Washington [46,69) Westmoreland,_Virginia EDBs = Spanners! Another representation system for spanners Quer y goal 25
  • 26. Spanners as Automata 0,1 0 1 Ordinary NFA 1 0 0 1 1 1 0 1 Var-Stack Automaton 1 0 0 1 1 1 0 1x{ y{ } } y{x{ } } Var-Set Automaton 1 0 0 1 1 1 0 1x{ }y y{x{ }x }y }x 0,1 0 1 0,1 0 1 • In an accepting run, each variable opens and later closes exactly once ⇒ Each accepting run defines an assignment to the variables • Nondeterministic ⇒ multiple accepting runs ⇒ multiple tuples Close most recent Close x y x x y Another representation system for spanners y{ 26
  • 27. Study of Expressive Power Spanners definable by regex formulas= Spanners definable by var-stack automata Spanners definable by var-set automata = Spanners definable by NR Datalog w/ regex formulas 27 x{ y{ } } 0,1 0 1 x{ }y }x 0,1 0 1 y{ .*x{.*}_from_z{.*}.*} Token(x) := [ (Îľ | .*_) x{[a-zA-Z]+} ( ((,V_) .*) | Îľ) ] State(x) := Token(x) , [.* x{Georgia|Virginia|Washington}.*] Cap1st(x) := Token(x) , [.* x{[A-Z].*}.*] CommaSp(x,y,z) := [.* z{x{.*} ,_ y{.*}}.*] Loc(z) := CommaSp(x,y,z) , Cap1st(x) , State(y) RETURN(x,z) := Cap1st(x) , [.*x{.*}_from_z{.*}.*}] , Loc(z)
  • 28. Consequences • Connections between Datalog+regex spanners and other language formalisms – Classic string relations [Berstel 79] – Graph queries (CRPQs) [Cruz et al. 87] • Extension with string equality & difference – Expressiveness / closure properties • Principles for cleaning inconsistencies – Follow up work [PODS’14] – Next… 28
  • 29. • Text Analytics in the Big Data Era • Information Extraction Systems & Formalism • Foundational Research Challenges • Conclusions and Outlook Outline 29
  • 30. Next, highlight 3 lines of foundational research that were motivated by our work on text analytics: 1. Database inconsistency w/ repair priorities 2. Frequent subgraph mining 3. Update propagation 30
  • 31. • Extractors may produce inconsistent results – Data artifacts – Developer limitations • Rather than repairing the existing extractors, common practice is to clean (intermediate) results – SystemT “consolidators” [Chiticariu et al.10] – GATE/JAPE “controls” [Cunningham 02] – Implicit in other rule systems, e.g., WHISK [Soderland 99] – POSIX regex disambiguation [Fowler 03] Cleaning IE Inconsistencies 33 Martin Luther King Jr. Dr., SE, Atlanta, GA 30303 Person2 Person1 Address1 31
  • 32. SystemT Consolidators [Chiticariu, Krishnamurthy, Li, Raghavan, Reiss, Vaithyanathan, ACL 2010] Other policies built in 32
  • 33. Five GATE/JAPE Controls All Once First AppeltBrin .* x{dd+} .*Sequence 12345 and sequence 12. Document Spanner Screenshots from GATE UI 33
  • 34. Cleaning via Prioritized Repairs • Problem: existing policies are ad-hoc; how to expose a language for user declaration? • [Fagin, K, Reiss, Vansummeren 2014]: spanner formalism for declarative cleaning • Key: prioritized repairs [Staworko, et al. 12] • Idea: Extend extraction programs with – Denial constraints: which facts are in conflict? – Priority declarations: preference between facts • Captures SystemT, GATE, WHISK, POSIX, … • We are now trying to improve our understanding of prioritized repairs… 34
  • 35. Prioritized Repairs: Definition Database Denial Constraints Collection of facts Which sets of facts cannot co-exist? Priority Relation Binary “is preferred to” relation • [Arenas, Bertossi, Chomicki 99]: Inconsistent DB represents a set of (equally likely) “repairs”  Then we can ask for the “possible” or “consistent” query answers • [Staworko, Chomicki, Marcinkowski 12] add priorities: • Let A and B be two consistent subsets of the database • Say that A improves B if we can obtain A from B by a “profitable” exchange of facts (precision later…) • A repair is a consistent subset that cannot be improved Inconsistent Database Instance 35
  • 36. Example professor university city Monica ubiobio ConcepciĂłn Monica carleton Ottawa Jorge uchile Santiago Jorge ubiobio Santiago Pablo uchile Santiago Violated constraints (functional dependencies): • professor  university, city (“key constraint”) • university  city professor university city Monica ubiobio ConcepciĂłn Monica carleton Ottawa Jorge uchile Santiago Jorge ubiobio Santiago Pablo uchile Santiago professor university city Monica ubiobio ConcepciĂłn Monica carleton Ottawa Jorge uchile Santiago Jorge ubiobio Santiago Pablo uchile Santiago “Ordinary” repairs [Arenas et al. 99] Tuple priority  some repairs can be discarded [Staworko et al.] 36 A improves B if we get A from B by removing tuples & adding tuple; each removed preferred to by some added
  • 37. Complexity of Testing Improvability Theorem:  In the case of a single functional dependency or two keys per relation, improvability can be tested in polynomial time  In any other combination of FDs, the problem is NP-complete! university faculty dean UChile Economics Agosin Technion CS Yavneh Stanford Law Magill two keys 37 Can a consistent subset be improved? Recent work (unpublished) w/ Fagin & Kolaitis
  • 38. IE with Recurring Patterns I want to buy my advisor a gift. I really want to buy a gift to my advisor. I want to buy a gift to the secretary and to my advisor. 1. Apply dependency parsing 38 [Zhang, Baldwin, Ho, K, Li, ACL13]: Restoring grammar in social media, sms, etc.
  • 39. IE with Recurring Patterns I want to buy my advisor a gift. I really want to buy a gift to my advisor. I want to buy a gift to the secretary and to my advisor. I want buy gift advisor 1. Apply dependency parsing 2. Find freq. recurring patterns 39 [Zhang, Baldwin, Ho, K, Li, ACL13]: Restoring grammar in social media, sms, etc.
  • 40. = 3 g1 g2 g3 g4 Freq. Freq. Max. Freq. Max. Maximal Frequent Subgraphs
  • 41. Complexity Study • Naturally, there has been a lot of work on this problem – SPIN [Huan et al. 04], MARGIN [Thomas et al. 10], … • But little was known about the computational complexity • Studied: impact of assumptions on comp. complexity – Graph properties (e.g., trees, treewidth, etc.) – Label repeatability – Bounded #results desired – Bounded threshold • This work led to novel complexity results and a new methodologies for mining maximal subgraphs – [K & Kolaitis, ACM PODS’13, ACM TODS] • Next, some complexity nuggets  41
  • 42. Complexity Nuggets • Good news: If labels do not repeat in each input graph, then there are PTime solutions when – The threshold is bounded; or – Graphs are trees & few results are desired • In general graphs w/o label repetition, you can find 2 results in PTime – Bad news: But finding 3rd is NP-hard! – Bad news: And if labels repeat and graphs are trees, then finding 2nd is already NP-hard! • Even for a bounded threshold 42
  • 43. Improving Dictionaries w/ Feedback text fragments (sentences, tables, rows, …) join IBM , San Jose company occurrences address occurrencescompanies, countries, … Apple , CupertinoIBM , Armonk IE IE IE auto. suggest a “good” fix to the IE program Web data “good” = small effect on other results Yahoo! , Cupertino Goo 43
  • 44. View Updates • View-update problem: Translate an update on a view to an update on the base relations • Deletion propagation as a special case – Update is delete(a set of view tuples) • Motivation: – Classic: database/view maintenance • DB access only through views, hidden join keys, etc. – Debugging • [K&al.12]: deletion propagation for debugging text extractors – Database causality [Meliou&al.10] • Intuition: good propagation provides a good explanation of why we have the tuples to begin with • [Bertossi, Salimi 14]: “Unifying Causality, Diagnosis, Repairs and View-Updates in Databases” 44
  • 45. Example: File Access GroupFile group file ai a.txt ai b.txt db a. txt db b.txt os a.txt UserGroup user group Emma ai Emma db Olivia os Olivia db Jacob ai Access(u,f) :– UserGroup(u,g), GroupFile(g,f) Delete source rows, s.t. Emma won’t access a.txt. But, maintain maximum access permissions! [Cui&Widom01; Buneman&al.02] Access user file Emma a.txt Emma b.txt Olivia a.txt Olivia b.txt Jacob a.txt Jacob b.txt = ⋈ 45
  • 46. Example: File Access = ⋈ GroupFile group file ai a.txt ai b.txt db a. txt db b.txt os a.txt UserGroup user group Emma ai Emma db Olivia os Olivia db Jacob ai Access user file Emma a.txt Emma b.txt Olivia a.txt Olivia b.txt Jacob a.txt Jacob b.txt Access(u,f) :– UserGroup(u,g), GroupFile(g,f) [Cui&Widom01; Buneman&al.02] Delete source rows, s.t. Emma won’t access a.txt. But, maintain maximum access permissions! 46
  • 47. Example: File Access GroupFile group file ai a.txt ai b.txt db a. txt db b.txt os a.txt UserGroup user group Emma ai Emma db Olivia os Olivia db Jacob ai Access user file Emma a.txt Emma b.txt Olivia a.txt Olivia b.txt Jacob a.txt Jacob b.txt = ⋈ Access(u,f) :– UserGroup(u,g), GroupFile(g,f) [Cui&Widom01; Buneman&al.02] Delete source rows, s.t. Emma won’t access a.txt. But, maintain maximum access permissions! Decision variant is NP-complete [Buneman et al. 02] 47
  • 48. Trichotomy in Complexity We have established a precise (easily testable) criterion that partition all cases into 3 categories: 1. The problem is solvable in PTime, and even via a straightforward algorithm [Buneman et al. 2001] 2. The problem is NP-hard, but constant-ratio approximable in PTime (ILP relaxation) 3. The problem is inapproximable for every ratio Fix a schema (w/ fds) and a CQ w/o self joins What is the complexity of finding a solution with a minimal side effect? [K, Vondrak, Williams, Woodruff, PODS11, PODS12, TODS12, VLDB14] 48
  • 49. • Text Analytics in the Big Data Era • Information Extraction Systems & Formalism • Foundational Research Challenges • Conclusions and Outlook Outline 49
  • 50. Summary • Text analytics & IE • Rule systems for IE • A formal framework for rules, relating IE to traditional DB concepts such as Datalog • Research directions motivated by IE – Prioritized repairs – Graph mining – Update propagation 50
  • 51. Outlook: DB w/ Deep Text Support • We need a uniform & elegant data/query model to combine structured data & text; usefulness for querying both text and relations • We need a principled, simple & transparent probability model + effective quality + practical execution cost • We need to balance between automation and control: from full specification by experts to feature generation for nonexperienced – Maximally realize the potential of every developer! – LogicBlox is working on incorporating ML in Datalog! 51
  • 53. Room for Both Statistical Solution Rule System Feature Engineering Model Space, Runtime Cleaning + Post Proc. Cleaning + Post Proc. Building blocks (e.g., dictionaries, NER) “What doesn’t work: Anything requiring high precision and full automation” Feldman & Ungar, KDD’08 tutorial on text mining 53
  • 54. String DB, Spanners, Interval Algebra Kaspersky Kaspersky Intel Otellini IBM Rometty [10,20) [16,26) [32,37) [50,58) [105,108) [121,128) [10,20) [16,26) [32,37) [50,58) [105,108) [121,128) String Databases Interval AlgebraSpanners Atomic value: string Atomic value: span (pointing to doc) Atomic value: interval (no text) Join by string conditions (e.g., x is a substring of y) Join by interval conditions (e.g., x is a sub-interval of y) Join by interval+string conditions (e.g., x a token in y) Apps: text predicates in DBs [Grahne & al. 99] [Benedikt & al. 03], string manipulation [Bonner & Mecca 98] [Ginsburg and Wang 98] App: IE Apps: temporal reasoning [Allen 83] [Vilain & Kautz 86] [Nebel & BĂźrckert 95] [Krokhin et al. 03] 54
  • 55. 55 Imp. 1: Connection to Known Concepts • Connection to Recognizable Relations [Berstel 79] – These are unions of cross products of regular languages – THM: The class of regular spanners is closed under a string-selection predicate iff the predicate is a recognizable relation • Connection to CRPQs [Cruz et al. 87] – Conjunctive Regular Path Queries have been studied as a query language for labeled graphs – THM: Regular spanners have the same expressive power as unions of CRPQs on paths “with marked endpoints” • Up to some simple and necessary adaptation between the models S I G M O D Path with marked endpoints
  • 56. Imp. 2: Adding String Equality NR Datalog w/ regex formulas Regular Spanners Regularstr= Spanners + String-equality predicate (+substring-of, prefix-of, …) …application from Jane Doe, social 012-345-6789, on Mar 20th… identified as John Doe, 012-345-6789, ask us to… x1 x2 [117,125) (Jane Doe) [875,883) (John Doe) ⋮ ⋮ NameSSN(x,y) := … SameSSN(x1,x2) := NameSSN(x1,y1) , NameSSN(x2,y2) , str(y1)=str(y2) Same string, different spans 56
  • 57. Difference with String Equality • Are regularstr= spanners closed under difference? – Why should they? Only positive operators are used… – However, regex formulas (our EDBs) can introduce “negative” operations (NFAs closed under complement) • THM: The class of regular spanners is closed under difference • PROP: The class of regularstr= spanners is closed under string-inequality selection • THM: The class of regularstr= spanners is closed under string-containment selection, but then, not under non-string-containment selection! • COR: The class of regularstr= is not closed under difference 57
  • 58. Formal Optimization Problem Fixed: • Schema S w/ fun. dependencies • Conjunctive query Q Input: • Database instance I over S • Set A⊆ Q(I) of answers to delete Output: J ⊆ I s.t. Q(J) ∊ A = ∅ Goal: Minimize |(Q(I) – A) – Q(J)| Side Effect 58