2.5 quintillion bytes of data are created every day thats 25 followed by 17 zeros, or roughly 10 quadrillion laptop hard drives..Big data can be truly overwhelming..so how does one go about making sense of it?
Call Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service Available
Big data veracity challenges
1. Big Data and Veracity Challenges
Text Mining Workshop, ISI Kolkata
L. Venkata Subramaniam
L V k t S b
i
IBM Research India
Jan 8, 2014
1
2. The Four Dimensions of Big Data
Volume
l
Velocity
l i
Data at Rest
Data in Motion
Terabytes to exabytes
of existing data to
process
Streaming data,
milliseconds to
seconds to respond
Variety
i
Data in Many
Forms
Structured,
unstructured, text,
multimedia
Veracity*
i *
Data in Doubt
Uncertainty due to
data inconsistency
& incompleteness,
incompleteness
ambiguities, latency,
deception, model
approximations
* Truthfulness, accuracy or precision, correctness
2
2
3. We’ve Moved into a New Era of
Computing !
In order to realize new
opportunities, you need to think
beyond traditional sources of data
The term Big Data applies to information that can’t be processed or analyzed using traditional processes or tools
12+ terabytes
of
Tweets
eets
created
daily.
100’s
of different
types
of data.
5+million
trade events
per second.
Transactional
& Application
Data
Machine
Data
Social Data
Enterprise
Content
Volume Velocity
Variety Veracity Only 1 in 3
decision makers
trust their
information.
• Volume
• Velocity
• Variety
• Variety
• Str ct red
Structured
Semi• Semi
structured
• Highl
Highly
unstructured
• Highl
Highly
unstructured
• Ingestion
• Veracity
• Volume
• Throughput
3
4. Volume is growing so are Veracity issues
By 2015, 80% of all available data will be uncertain
2015
By 2015 the number of networked devices
will be double the entire global population.
All sensor data has uncertainty.
8000 100
90
7000
80
6000
70
5000
60
4000
50
3000
40
30
2000
Aggregat Uncertainty %
te
Glob Data Volume in Exaby
bal
ytes
9000
20
1000
The total number of social media
accounts exceeds the entire global
population. This data is highly
uncertain in both its expression and
content.
Data quality solutions exist
for enterprise data like
p
customer, product, and
address data, but this is only
a fraction of the total
enterprise data.
p
10
0
2005
4
Multiple sources: IDC Cisco
IDC,Cisco
2010
2015
5. What is Big Data? Big Data applies to information that can’t be
processed or analyzed using traditional processes or tools
Telco Profiles
Data Vo
olume, Ve
elocity, Var
riety
Call Detail
Records
Market
Trends
Smart Grid
Smarter
Weather
Cities
Sensor Modeling
Data
Smarter
Smarter
Traffic
Water
Portfolio
Risk
Market Feeds
Credit Card
Transactions
Medical
Transcription
Electronic Data
Interchange
CRM
Customer
Records
Traditional Data & Processing
Precise, authoritative,
well formed
5
Text, Audio,
Video
Contact
Centers
Retail
Fraud
SWIFT
Account
Management
Homeland
Security
Uncertainty
(1/veracity)
Disease
Progression
Patient
P ti t
Records
Predictive
Modeling
of Outcomes
Social Network
Data
Services
S
i
Data Uncertainty at Scale
Inconsistent, imprecise, uncertain, unverified, s
pontaneous, ambiguous, deceptive
6. Social media users in India
India (No. of Users In Million)
50
45
40
India (No. of Users In
India (No of Users In
Platforms
Million)
35
30
Facebook
India (No. of Users In Million)
20
Twitter
15 Million
Linkedin
25
45 Million
15 Million
15
10
5
0
Facebook
Twitter
Linkedin
Youtube
Google
Plus
6
7. Veracity issues arise due to:
Process Uncertainty
Processes contain
“randomness”
“ d
”
Data Uncertainty
Data input is uncertain
All modeling is approximate
Actual
Spelling
Intended
y
Spelling Text Entry
p
g
? ?
?
Uncertain travel times
Model Uncertainty
GPS Uncertainty
Fitting a curve to data
?? ?
Testimony
{Paris Airport}
Ambiguity
g y
Semiconductor yield
7
Contaminated?
Rumors
{John Smith, Dallas}
{John Smith, Kansas}
Conflicting Data
Forecasting a hurricane
(www.noaa.gov)
(
)
9. Upto 10
0,000 times more noisy
Big Data, Fast Data, Noisy Data
Social Media Communication is
meant for Friends
30% world population
on the internet and
increasing fast
Type of Text
WER
SMS (texting)
50%
Tweets
35%
ASR
30%
Web queries
15%
OCR
5%
Newswire Text
(WSJ, Reuters,
NYT)
0.005%
55 million
illi
Tweets per
day
Lead
Generation,
Disaster
Tracking
g
Large
Dimensional,
uncertain,
unverified
I’ll see ya tomo
RIP Jackson
J k
I’m lookie out 4 a car 2 burn rubber on the streets of LA
What should I buy?? A mini laptop with Windows
OR a Apple MacBook!??!
Noisy, Informal,
Noisy Informal Implicit and
Contextual Conversations
There are more social
networking accounts
t
ki
t
than people in the
world
Social Networking
overtakes Search:
Facebook becomes the
most visited website
ahead of Google
Big Data: More video content was
uploaded onto YouTube in the past two
months than all the new content
ABC,
ABC CBS and NBC have been entering
24/7 since 1948.”
9
10. SMS
0 there – there
1 aint – are not
2 no – no
3 doubt – doubt
4 there – there
5 hon – honey
6 im – I am
7 gonna – going
8 be – be
9 takin – taking
10 it – it
11 4 – for
12 life – life
13 u – You
14 wont – wont
15 b – be
16 rida – rid of
17 me – me
18 lol – laugh out loud
19 Ray – (NAME)
Texting Language: Over 50% of the words are
written in non standard ways
Spontaneous Language: Use of
slang, ungrammatical, no punctuations, no case
information
Mixing of Languages: Many SMS contain text in a
mix of two or more languages
Type of Noise
%
Deletion of
Characters
48%
Phonetic
Substitution
33%
Abbreviations
5%
Dialectical
Usage
4%
52% words
were non standard
Deletion of
Words
1.2%
(Contractor et al., 2010)
101 SMSes
10
11. Speech Recognition
SPEAKER 1: windows thanks for calling and you can
learn yes i don't mind it so then i went to
SPEAKER 2: well and ok bring the machine front
Recognition Errors: 10-40%
Word Error Rates
end loaded with a standard um and that's um it's
a desktop machine and i did that everything was
working wonderfully um I went ahead connected
into my my network um so i i changed my network
settings to um to my home network so i i can you
know it's showing me for my workroom um and then
it is said it had to reboot in order for changes
to take effect so i rebooted and now it's asking
me for a password which i never i never said
anything up
SPEAKER 1: ok just press the escape key i can
Spontaneous Language:
Use of slang, use of fillers
slang
like um and
ah, ungrammatical, false
starts,
starts no punctuations, no
punctuations
case information
doesn't do anything can you pull up so that i mean
Mixing f L
Mi i of Languages:
Contain words from two or
more languages
11
12. Historical Text
Non Standard Spellings: No notion of the importance of
having a single spelling for each word. Letters would be
added or removed to ease line justification.
New words: New words, words that are variants of
present vocabulary words
Different Language Style: Different grammar, language
g g
y
g
,
g g
model.
OCR: Character substitution errors, missed punctuations.
Baron et al. 2009
al
12
13. Emails, Blogs, Tweets, Online Chat,……
Chat Logs
g
[12:51:13 PM] Geetha: alrite
[12:52:01 PM] Richa: id has valid pw not expired
[12:52:49 PM] Geetha: can't get to theh site
can t
[12:53:04 PM] Richa: network connection may be slow
[12:54:39 PM] Geetha: ok Im able to now
[12:54:53 PM] Richa: should I reset the password
13
14. What is Noisy Text?
Any kind of difference in the surface form of an electronic text
from the intended, correct or original text (Knoblock et al.,
2007)
Noise can be at the lexical level {b4 before befour}
{b4, before,
Resulting in substitution, insertion, deletion,
transposition, run-on, and split.
Noise can be at morphological, syntactic, discourse level {I can
hear u, I can hear you, I can here you}
Resulting in substitution, insertion, deletion, transposition
of words and the introduction of out of vocabulary
words.
d
14
15. Classifying Noise
Lexical Errors (Subramaniam et
al., 2009)
Missing characters {before >
bef}
Extra characters {raster >
raaster}
Phonetic substitution {before >
b4, late > l8}
,
}
Abbreviations {laugh out loud
> lol, United Nations > UN}
Syntactical Errors (Kukich, 1992;
Foster et al., 2007)
Missing Word {What are the
subjects? > What the subjects?}
Extra word {Was that in the
summer? >Was that in the summer
it?}
Real word spelling errors {She could
not comprehend > She could no
comprehend.
comprehend.}
Agreement {She steered Melissa
round a corner > She steered
corner.
Melissa round a corners.}
Dialectical usage {I’m going to be
there > I’ gonna b there}
th
I’m
be th }
15
16. Techniques for Automatically Detecting Lexical Errors (Kukich 92)
Efficient methods to detect strings that do not appear in a given word list,
dictionary or lexicon
Nonword error d t ti
N
d
detection
Two approaches
N-gram
Look up each n-gram in an input string in a precompiled table to
ascertain either its existence or its frequency. Nonexistent or infrequent
n-grams (shj, i ) are identified as possible misspellings.
hj iqn
id tifi d
ibl
i
lli
Good for identifying errors made by OCR devices
But unusual/foreign language valid words will be marked and nicelooking mistakes will be marked valid
ill
ma ked alid
Dictionary based
Input string appears in a dictionary? If not, the string is f
f
flagged as a
misspelled word.
But nearly two-thirds of the words in a dictionary did not appear in an
eight million word corpus of New York Times text and conversely two
text, and,
twothirds of the words in the text were not in the dictionary (1986 study)
16
17. Techniques for automatically Detecting Incorrect (Syntax)
Grammar (Foster et al., 2007)
Efficient methods to detect word sequences that do not form a
Effi i
h d
d
d
h d
f
grammatical sentence
Three Approaches
N-gram
Classifies a sentence as ungrammatical if it contains an
unusual part of speech sequence
Precision-grammar
Classifies a sentence using a parser and a broadcoverage hand-written grammar
Probabilistic-parsing
Probabilistic parsing
Finds sentences with parsing error
17
18. Quantifying Noise (Subramaniam et al., 2009)
Quantifying Lexical Errors {Before, b4, befour, befor, bfore}
Edit Di
Edi Distance
Good for measuring surface level deviation from original
Perplexity
e p e ty
Good for measuring deviation from underlying language structure
at character level
Quantifying Semantic Errors {I came to LA yesterday. I am still jet
lagged., Came la yester day still jetlagged, Came 2 LA ystrday stil
jetl8d}
WER
Good for measuring real word errors (speech recognition errors)
Perplexity
Good for measuring deviation from “proper”
BLEU
Good for comparing a candidate translation against multiple
reference translations
18
19. Spelling Correction (Kukich, 1992)
Isolated Word Correction
Minimum edit distance techniques
Similarity key techniques
Probabilistic techniques
N-gram-based
N gram based techniques
Rule-based techniques
Will not catch typos resulting in correctly spelled words {form, from}
yp
g
y p
,
Estimates put real word errors at 30% of all word errors
Context-Dependent Word Correction
Parsing
Language models
Can errors be ignored and still meaningful interpretation be done? {I
am coming with you, I comes with you}
19
20. SMS Text Normalization
dis is n eg 4 txtin lang
This is an example for Texting language
Extreme corruption of words and sentences
Models for SMS language are lacking
Tomorrow never dies!!!
2moro (9)
( )
tomoz (25)
tomoro (12)
tomrw (5)
tom (2)
tomra (2)
tomorrow (24)
tomora ( )
(4)
tomm (1)
( )
tomo (3)
tomorow (3)
2mro (2)
morrow (1)
tomor (2)
tmorro (1)
moro ( )
(1)
Occurrence in a 1000 sms corpus
20
21. Finding Canonical Sets (Acharyya, 2009)
Learn mappings
costmer, castumar, kustamar,
customer
coustomber
How can we do it in an unsupervised way ?
Find some invariant, that does not change in spite of corruptions
Buckets of context seem invariant!
<..Back Bucket....> sceam <..Front Bucket...>
sceam : sms(2) new(5) recharge(4) t l
h
tel-provider(2) about(3)
id
b t
<..Back Bucket...> scheme <..Front Bucket...>
scheme : sms(4) new(2) activate(3) tel-provider(2) about(1)
recharge(1)
21
22. SMS Based FAQ Retrieval (Kothari et al., 2009)
SMS Question
FAQ
how 2 actvate romng on me hanset
Database
How do I activate Roaming
Dial *567*2# from your
handset
What are the rates for roaming
within India
Roaming rates on prepaid
connections are 60 Paise per
minute
SMS Answer
Dial *567*2# from your handset
Goal is to find the Question Q* that best matches the SMS S
•A scoring function Score(Q) assigns a
score to each question Q in the FAQ
dataset. The score measures how closely
the question matches the SMS string S
S.
22
23. FAQ Retrieval Problem Formulation
SMS is treated as a sequence of tokens S=s1,s2,…,sn
Let Θ denote the questions in the FAQ corpus where each
question Q ∈ Θ is treated as a set of tokens
Goal is to find the question Q* that best matches the SMS S
23
24. Method
M th d
For
F each t k si , a li t Li consisting of all t
h token
list
i ti
f ll terms f
from th di ti
the dictionary
that are variants of si are constructed. Variants are sorted in the
descending order of their weight
This space is searched to find the closest matching FAQ question.
24
25. Extracting Dialog Models (Negi et al 2009)
al.,
Huge number of repetitive calls at contact centers
Building t k i t d di l
B ildi task oriented dialog systems
t
Task specific information – concepts, subtasks
Task structure - manual encoding
g
Using large amounts of human to human conversation data
Extracting dialogue models using human-to-human conversations
E t ti di l
d l
i h
t h
ti
25
28. Finding Patterns with Gaps
Need for
N d f patterns capturing variations in expressions
i
i i
i
i
Have you rented a car from us before
Have you rented a car before
Have you rented a car from <Rent_Agency> before
<Rent Agency>
Mining regular expression patterns over tokens or entity types
Each tt
E h pattern represented as a t k sequence
t d
token
[rented car before]
Token sequences mined efficiently using extension of apriori algorithm
2
8
29. Association Analysis
Total number of possible itemsets is exponential (2N)
Brute-force technique infeasible
Support filtering is necessary
•
•
To eliminate spurious patterns
To avoid exponential search
-
Support has anti-monotone
property:
X ⊆ Y implies σ(Y) ≤ σ(X)
Efficient algorithms have been
designed to exhaustively find all
itemsets/patterns with sufficiently
high support
Given d items, there are 2d
possible candidate itemsets
ibl
did t it
t
29
30. Utterance Normalization
Identify concepts
Named Entity Annotation
Rule based annotator for annotations such as location, date, car
model,
model and amount
“I want to pick it up from <location> on <date>”
Grouping of utterances
Find patterns with gaps and represent each utterance by them along
with unigrams and bi-grams
Agent and customer utterances are clustered separately using an offg
p
y
g
the shelf clustering algorithm
30
31. Finding Subtasks and ordering
Customer and agents engage in
similar kinds of interactions to
accomplish an objective
Represent each call with agent
utterance and customer utterance
cluster labels
Subtasks
Patterns of cluster labels
(agents) with possible gaps
Lot of variability in customer
utterances
Vertical pattern mining
C1
C1
C2
C3
C3
Cn
31
32. Subtask Preconditions
Utterance pre-conditions
U
di i
Customer utterances that indicate start of a subtask
“please make this booking for “make payment” subtask
please
booking”
make payment
Frequent features from customer utterances
Flow pre-conditions
Only logical orders of subtasks are allowed
“make
“ k payment” subtask cannot b executed unless “ th
t” bt k
t be
t d l
“gather
pick-up information” subtask has been executed.
Collection of all the subtasks that precede the subtask
p
32
34. Data Fusion
Problem
Given multiple data points about an entity, create a single
p
p
y,
g
object representation while resolving conflicting data values
Difficulties
Null values: Subsumption and complementation
Contradictions in data values
Uncertainty & truth: Discover the true value and model
u ce ta ty this process
uncertainty in t s p ocess
Metadata: Preferences, recency, correctness
Lineage: Keep original values and their origin
Implementation in DBMS: SQL, extended SQL, UDFs, etc.
SQL
SQL UDFs etc
34
35. 360 Context
Analyze social data in the context of enterprise data to build entity and event profiles
and establish linkages between them for online and offline analysis
Entity (people, products, events) Insights
The problem
Solution
What are the key
product interests of
person A?
Over time learn about
the person’s product
interests from her social
media postings
p
g
What is the location and
trajectory of person B?
List significant events
like marriage, birth of a
child, relocation, etc.
What are the events of
interest happening in a
given location?
Lists the top events in a
given geography
What is the sentiment
g
product?
on a given p
Gives the sentiment on
a product
p
Understand customers wants and needs better
Gives the current
location and locations in
the past
What life events
happened in person A’s
life in the past x
months?
Key Sustained Value Factor:
intent to
purchase for
customers
Social Data
Smarter
Commerce
real-time public
safety events
Enterprise
Databases
User
Domains
What MDM 360 does?
propensities/
sentiment/intent
•
event Detection
•
entity Linkages
•
sentiment
core customer
view/transactions
•
event Profiles
•
entity Profiles
Smarter
Cities
Application
Domains
Builds an entity’s complete profile by aggregating data about the entity from social and enterprise data
ld
’
l
fl b
d
b
h
f
l d
d
sources. Here an entity refers to people, products, brands and events.
35
35
IBM Confidential
36. Extraction Challenges: Stages of intent
Stage
Example
Wishing for an event
“I just want to graduate, get a job, get a car, and
live with my boyfriend”
Anticipating an event
“Im getting a car for graduation yay!!!!!”
During an event
“At disneyworld :D”
Post event / continuous state
“Apparently I got a raise at work three months ago
and didn't know? Sweeeeeeeeeet”
Hobby
“Loves to fish, travel and frequent concerts. Down
to earth, athletic, professional 40 and single.
earth athletic professional,
single
Loves the outdoors, working out, travel and
younger fit guys for dating.”
36
37. Extraction Challenges: Detecting filtering conditions
Filter
Example
Spam
“Need a New #Credit Card for your #Business or
online #Ebay store? Compare and Apply Online.
http://retweet.it/r/We0iai”
Sarcasm, jokes
“I thought I was having a stroke this afternoon but it
turns out it was too many Starbucks Refreshers plus
my leg falling asleep.”
Resolve ambiguous meaning
“In the words of @LNSmooth23 I'm retiring from the
nightlife”
Non-personal
“My mom is buying a house, but why in Willingboro”
37
38. 360-degree Profiles from Social Media
g
Personal Attributes
Event Detection
• Identifiers: what, where, when…..
• Attributes: severity, urgency…
Social Media based
360-degree
Event and Individual
Profiles
Timely Insights on Events
Ti
l I i ht
E
t
• Event Detection
• Public Safety Events
• Plans for public disturbances
• Sentiment around events
• Citizen sentiment
• Identifiers: name, address, age, gender,
occupation…
• Interests: sports, pets cuisine
sports pets, cuisine…
• Life Cycle Status: marital, parental
• Relationships: family, friends, co-workers, work
and interest network
Timely Insights on
Individuals
• Intent to participate in public events
• Instigation for causing public damage
• Sentiment on events, govt policies
• Current Location
• Hate messages
Personal Interests
P
lI
• Personal preferences or political leanings
• Activity History
Intent
We must support the movement, I am going to the rally at Jantar
Mantar tomorrow
Anna Hazare has a point when he says politicians are corrupt and
need to be taught a lesson. The rally starts at 10.
Public Safety Events
Mamta Deedi is also joining Anna, Ramdev, n Kejriwal. She is going to do
Anshan at Delhi Jantar mantar. Ye sab public ki kahke le rahe hain.
So Its Mamata's day out tomorrow at #JantarMantar. #Rally.
Location announcements
I'm at Karir Square http://4sq.com/fYReSj
38
39. More data: Customer intent extracted from social media provides context
Go for the
best, DP2000
Buying a
DSLR
today !
Buying
DSLR
today!
Thrza gr8 deal
on ZX 550 @
ZX-550
the mall
Prior Business
Social
Transactions
Data
Entity
Extraction, Fact
Discovery, Intent &
Sentiment
Influencers
Intent
450M+ tweets/day Millions of tweets yield one
company-specific fact
Customer ready to buy a
DSLR camera today
today,
possibly at a nearby mall
Michael’s online friends offer lots of advice
Text Analytics used to extract intent from Social Media
Married, Male, Spouse
,
, p
Birthdate, Gift Type, Intent
to Purchase, Timeframe
Wifey’s birthday tomorrow, looking for a killer dslr
Sarcasm,
Wishful Thinking
Potential
Locations and
Activity
Maybe I should buy her that purple
roadster, while I’m at it. ;-) lol
Intent to Purchase,
Gift Type?
In NYC area this w/e, any good malls
nearby?
Region & City Location,
Timeframe, Intent to Shop
Resultant fact base contains billions of facts, and is incrementally updated
Fact segmentation or clustering is rapid enough to drive a business decision
3939
40. Matching Twitter profiles to Corporate Data
• Linking Social Media profiles with Employee database
• Several extensions are possible, for example, linking with Citizens and Security databases
Social media profiles
(name, address,
gender, age
gender age,
employment,
relationship, …)
Employment
filter
Social media profiles
of IBM employees
p y
and their network
Name: first, last
Name,
work location,
job description
Current Demo focused on Name and Location
matching, as well as EmployeeOf information
Choice of social media profile attributes
for linking constrained by availability of
IBM BluePage attributes
Twitter: 45M profiles
Resolution
Semantic Name Variations
Bill Chamberlin vs. Chamberlain, William H.
C. Mohan vs. Mohan Chandrasekaran (Mohan)
Employee Directory: 460K entries
p y
y
Name: (first, middle, last, preferred)
Geo Proximity
Home l
H
location: city, ( t t ) country
ti
it (state),
t
Employment: company + role
Saratoga, CA vs. San Jose, CA
New Jersey vs. New York
Job Role
Disambiguation
“Soft a e sales manage at IBM…” vs.
Software
manager
IBM
s
“Managing SPSS Sales for Canada…”
40
Work l
W k location: ( i state, zip, country)
i
(city,
i
)
Job description
41. Example Result
• Semantic name variations: Twitter name is a close variation of the IBM names
• Geo Proximity: Work locations are within 25mi of the Twitter location
• Job Role Disambiguation : description in Twitter profile matches HR role
41
42. Common D t P bl
C
Data Problems
• Lack of information
standards
t d d
Ashok Kumar
A Kumar
• Data misplaced in the database
Four sixteen Street 8 Anand Niketan Delhi
8,
Niketan,
Mr. Ashok Kr
#416 Anand Niketan, N Delhi, 21
110021
• Different formats & structures across
different systems
Data surprises in individual
fields
416 Anand Niketan, New Delhi, India 110021
Email
Tax ID
Telephone
91,,,,
228-02-1975
6173380300
ranivrgeoi@yahoo.co.in
i
i@ h
i
025 37 1888
025-37-1888
415 392 2000
415-392-2000
,CYRUS_DASTUR@HOTMAIL.COM 34-2671434
3380321
HP 15 State St.
508-466-1200 Orlando
• Special characters in the data
• The redundancy nightmare
• Duplicate records with a lack of
standards
90328574
90328575
01456
90238495
90233479
90233489
90345672
IBM
I.B.M. Inc.
187 N.Pk. Str. Salem NH 01456
187 N.Pk. St. Salem NH
Int. Bus. Machines
International Bus. M.
Inter-Nation Consults
I.B.
I B Manufacturing
187 No. Park St Salem NH 04156
187 Park Ave Salem NH 04156
15 Main Street Andover MA 02341
Park Blvd Bostno MA 04106
Blvd.
42
43. Address Variations…
Variations
• Spelling variations, hyphenation, abbreviations
• I 344
I-344
| Sarojini Nagar | N Delhi | 23
• 344 Block J | Sarojni Ngr | New Delhi | 110023
• 344 Block I | Sarojni Ngr | New Delhi | 110023
• Multiple Ways of writing the same field
• 13B
| Link Road
| Versova | Mumbai
• 18 Block M | Bandra Versova Link Rd | Versova | Mumbai
Rd.
• Missing Address Fields
• 4 Block C | ISID Campus I
4,
I V. Kunj I New Delhi | 110070
V
• 4C
I ISID Campus | Institutional Area| V. Kunj | New Delhi | 110070
• Errors
• 4C
I ISID Campus | Institutional Area| V. Kunj, New Delhi | 110007
43
44. Regional variations in Addresses across
India
Addresses in different regions contain words of the local language even when the
addresses are written in English
Ex : The commonly used word to describe a street type is “Gali” in Northern
India whereas “Beedhi/Veedhi” is the commonly used term in Southern India
Street Intersections and Street Information containing multiple Street Type Identifiers
like Cross and Main are extensively found in the Southern Indian regions
Ex : “3rd Main, 4th B Cross”
,
Sector and Pocket Information are found primarily in North Indian Addresses
Ex : “Sector 5, Pocket 2A 2nd Block”
Regional differences in writing addresses necessitate bifurcation of standardization
rules based on regions.
44
45. Investigating the Data
g
g
Take the Example: 123 St. Virginia St.
Parsing:
Separates multi-valued fields into individual pieces
Lexical A l i
L i l Analysis:
Determines business significance of individual pieces
Context Sensitive:
Identifies various data structures and content
123
Number
123
Number
123
St.
Virginia
Street
Type
Alpha
St.
Virginia
Street
Type
St.
Street Name
Street
Type
St. Virginia
St.
“The instructions for handling the data are inherent within the data
itself.”
45
St.
46. Sample Standardized Output
Sample Address Input:
“SANT KRUPA BUILDING, 2ND FLOOR, CHHEDA RD, NR S V JOSHI
HIGH SCHOOL, DOMBIVALI (E), THANE. INDIA.”
Standardization Output:
St d di ti
O t t
DoorNo
Floor Value
Building
Name
Building
Type
Street Name
Street Type
20
2nd FLOOR
SANT KRUPA
BUILDING
CHHEDA
ROAD
Landmark
Position
Landmark
Area
City
District
State
NEAR
S V JOSHI HIGH DOMBIVALLI
SCHOOL
EAST
THANE
THANE
MAHARASHTRA
46
47. Input Addresses vs Standardized Addresses
Sr.No
Standardized address
Highlights
1
A38/91 KONIA . . VARANASI
INDIA
A38/91 KONIA VARANASI VARANASI
UTTARPRADESH INDIA
Autopopulation of
state
2
VILL BASUDEVPUR PO
KHANJANCHAK
DURGACHAK HALDIA
TAMLUK INDIA
DURGACHAK ,HALDIA,VILLAGEBASUDEVPUR PO-KHANJANCHAK
PO KHANJANCHAK
TAMLUK EAST MIDNAPORE
WESTBENGAL INDIA
Rural address
Handling
3
NEAR RAJGHAR GIRLS
SCHOOL LACHIT NAGAR
HOUSE NO 5 ULUBARI
GUWAHATI ASSAM
GUWAHATI INDIA
5 NEAR RAJGHAR GIRLS SCHOOL
Maintaining a
ULUBARI LACHIT NAGAR GUWAHATI standard format
KAMRUP ASSAM INDIA
across addresses
(house no preceeds
Landmark
information)
4
1/15, PREMJYOTI CO OP HSG
1/15 PREMJYOTI COOPERATIVE
SOC., RAMBAUG - 5, KALYAN
HOUSING SOCIETY,RAMBAUG 5
(W), MAHARASHTRA 421301 KALYAN WEST BHIWANDI THANE
BHIWANDI INDIA
MAHARASHTRA 421301
Standardization of
Tokens
5
4
7
Input address
3/2,FIRINGI DANGA ROAD,
P.O.MALLICKPARA
SERAMPORE-3 CALCUTTA
INDIA
Standardization of
tokens
3/2,FIRINGI DANGA ROAD,
SERAMPORE-3 P.O.MALLICKPARA
KOLKATA WESTBENGAL INDIA
48. Two Methods to Decide a Match
Are these two records a match?
RHITU K
KAZANGIAN
RITU KUMAR
B
B
KAZANGIAN
+5
+2
A
+20
128 MAIN
ST
02111 12/8/62
128 MAINE RD 02110 12/8/62
/ /
A
B
D
B
A
= BBAABDBA
+3
+4
-1
+7
+9
=
+49
Deterministic Decisions Tables:
• Fields are compared
• Letter grade assigned
g
g
• Combined letter grades are compared to a vendor delivered file
• Result: Match; Fail; Suspect
Probabilistic Record Linkage:
• Fields are evaluated for degree-of-match
• Weight assigned: represents the “information content” by value
• Weights are summed to derived a total score
• Result: Statistical probability of a match
48
49. A Closer Look at Probabilistic Matching
C ose oo
obab st c atc g
RHITU K
KAZANGIAN
128 MAIN
RITU KUMAR
KAZANGIAN
128 MAINE RD 02110 12/8/62
+5
+2
+20
+3
ST
+4
02111 12/8/62
-1
+7
+9
= 49
Histogram of Weights
4000
3500
The weighted score is a
p
probability of a match; it
y
;
expresses the amount of
information content for all of
the fields compared
3000
# of Pairs
f
relative measure of the
The CUTOFF is the score above
which good matches are found
2500
2000
UnMatched
1500
1000
500
Matched
0
-50
49
-40
-30
-20
-10
0
10
20
30
40
50
60
49
50. The Value of Information Content
Information Content is measured both at the field and at the field value level and is
calculated automatically
Discriminating Value represents the significance of one field versus another in
contributing to a match
For example a Gender Code contributes less information than a Tax-Id Number
Frequency represents the significance of one value in a field over another value
q
y p
g
For example in a Last-Name Field, “SMITH” contributes less information than
“ROUTZAHN”
Probabilistic Matching uses the automatically generated measures of Information
Content to achieve the highest match rates possible utilizing a scientifically-justifiable
C
h
h h h
h
bl
l
f ll
f bl
methodology
50
51. Data Framework around the Individual
• Logins (User
credentials)
• Profile
• Expertise
• External & internal
unstructured data
linked to individuals
• S
Social activity
l
Big Data
Individual
Social
Presence
•
•
•
•
BLOG
Comment
Opinion “Like”s
Community
Individual
Credentials
• to Person
• Communities
• to Company:
Roles, History
• IBM Linkage
Individual
Core
Personal data:
• Name, Address
• Phone, eMail
• Behavioral
Preference /
permissions
Transactions
involving
i
l i
the Individual
• Tech Support Call
• O
Opportunity & Orders
t it
Od
• Responses to Marketing
Campaigns
Relationships
with the
individual
i di id l
Interactions
with the
Individual
•
•
•
•
•
•
Digital
Phone
eMail
F2F
Social
Web traffic
51
52. Analytics steps
Text Analytics
• Analyze and extract consumer
attributes from individual
messages
Intent
Entity
Integration
• Integrate information about a
consumer within a single social
media source over ti
di
time
Entity
Resolution
• Link social media profiles with
customer data
t
d t
• Link and integrate information
about a consumer across
multiple social media sources
All I really want is the Disney
Visa card from chase with the castle on it
Life Events
Looks like we'll be moving to New Orleans
sooner than I thought.
Personal Attributes
I am a engineer, mom, and wife
Relationships
Social Profiles of
Consumers
Master Data on
Customers
In fact I'm looking forward to the new
month. Both myself and the wife have our
th B th
lf
d th
if h
graduation ceremonies
52
53. Person Information across Documents
Who Is James Dimon?
Do these filings refer to the same person ?
variability in the person’s name, lack of a key identifier
supporting attributes vary depending on the context (form type)
All these facts need to be linked and integrated
53
53
Signatures
Biographies
Insider
Transactions
Committee
memberships
54. Entity & Relationship Analytics from
Big Data
Entity Views
Crawl
Extract /
Text
Analytics
Entity
Resolution
Map/Fuse
/Aggregate
Entities Relationships:
E ii &R l i
hi
Object-centric view
Unstructured
data sources
Untrusted View
Challenge
Construct and maintain comprehensive
profiles of entities and relationships
from unstructured data sources
Main Problem: Assemble an entity view, where each entity aggregates data from thousands of
different documents
Multiple stages of complex processing:
– Information extraction
•
–
From each unstructured d
F
h
t t d document, extract relevant structured records
t
t t l
t t t d
d
Entity resolution
• Link records (possibly across documents) that are about the same real-world “entity”
Entity
Integration – Entity population: mapping / fusion / aggregation
•
54
Collect all the facts about the same entity into one rich object with clean values and relationships to other entities
55. The Complete Entity View
Current purchase intentions
expressed by the consumer
Location-based information about a
consumer (where they plan to travel,
events they are going to attend)
Purchase history for a consumer
Life events (relocation, home
purchase, wedding, graduation)
Related people based on social
networking data
Comments/complaints expressed about
various products and services
Customer identity information (e.g.,
name, location) obtained from profiles
and content of posts
Micro-segmentation information about
individual consumers (e.g., gender, age
range, profession)
360 degree
360-degree profile of a customer
City
State
Age
Range
Gender
Houston
TX
30-39
Female
San Jose
CA
?
Male
Marital
Status
?
Married
Number
of kids
Employment
Status
Occupation
?
Employed
Journalist
2
Employed
Software Engineer
…
…
…
Aggregate attributes from multiple sources
Filter to obtain a segmentation
Analyze to obtain “Similar Populations”
Adding more input data gives better predictive power
55
56. Attribute fusion example: Inferring location from multiple clues
Metadata
Name: Tracy Guida
Sc ee a e @ acygu da
Screen name: @tracyguida
Location: Tampa
Description: just a Nor-Cal gal trying to fall in love with
Florida
Social Media Profile
Screen name : @tracyguida
Location:
Tampa, FL
Name:
Tracy Guida
Disambiguation, fusion of
partial information
Permanent
location
Fusion libraries:
• Confidence:
metadata vs. content
Messages
Gotta love Florida football #hot #humid
http://instagr.am/p/QOHPqhKdYt/
Check out my blog about #food in #TampaBay
h k
bl
b
f d
http://www.myothercitybythebay.com
Textual clues
Temporary location
I'm at Tracy's Seat At Micah's (Tampa, FL)
http://4sq.com/SZ4yjj
http //4sq com/SZ4 jj
I'm at S.o.G (Tampa, Florida)
http://4sq.com/UDweM5
Check-ins
I'm at Eats American Grill (Tampa, FL)
http://4sq.com/O1a1Jm
Who's
Wh ' watching the #presidential #debate tonight?
hi
h #
id i l #d b
i h?
(from 27.97989014,-82.54825406)
Fusion libraries:
• Confidence: place mentions vs.
g
geo-codes
• Analysis of location time-series
Geo-located
G l
d
documents
56
57. The Reliability (Veracity) Challenge
Θ = {θ1,...,θ N } - a set of hypotheses (frame of discernment, universe of
discourse)
{xni } - probability, possibility, belief in hypothesis {θn} of source i
{Oi } - input data (social media, enterprise information)
F(x1,...,xI ) – Fusion operator
{O1}
Environment
Environment
57
{OI }
Source 1
(source belief model
model,
source characteristics)
Source I
(source belief model,
source characteristics)
{x1}
Fusion
Fusion
operator
operator
{ xI }
F ( x1,..., x I )
58. Typical Reliability Settings
It is possible to assign a numerical degree of reliability to each source
A subset of sources is reliable but we do not know which one
Reliabilities of the sources can be ordered but no precise reliability values
are known
Reliability dependent on context too
During Mumbai Mantralaya fire a few tens of tweets on this event on
Twitter
Same day there is a match and there are several thousand tweets “Miami
Miami
on Fire”
58
59. Strategies for Utilizing Reliability
Strategies explicitly utilizing reliability of sources
Reliability is used to modify beliefs of each model before fusion and
then use transformed beliefs (separable case)
Strategies for modifying the fusion process to account for the
reliability of the sources (non separable case)
(non-separable case).
Strategies identifying reliability of data input to fusion processes
and eliminating the sources of poor reliability
Combination of strategies mentioned above
F(x1,...,xI ) FR (x1,...,xI )
F - i a context d
is
t t dependent operator, which depends on the
d t
t
hi h d
d
th
strategy selected and defined within the framework used for
uncertainty representation
R
59
60. Reliability Coefficients
Reliability coefficients represent trust in each belief model. They
introduce the second level of uncertainty and represent a measure of
y
p
the adequacy of the model used, the reality of the environment, and
source characteristics
Ri = Ri (Mi, γ ,Υ) - reliability of source i (reliability of source i and
, )
y
(
y
hypothesis j : Rij)
Mi - model of source i
γ parameters characterizing external environment (context)
Υ -parameters characterizing the internal environment of source I
(tuning parameters)
Relative eli bilit
Rel ti e reliability : ∑iIRi =1
1
May be replaced with max Ri = 1
i
60
61. Bayesian Fusion
In the Bayesian framework the degrees of belief are
represented by a priori, conditional and a posteriori
conditional,
probabilities.
Usually, decisions are made on a posteriori probabilities P(θn | yi ),
where yi i the input coming from source I,
h
is th i
t
i f
I
xi = P(θn | yi ) represents statistics of each source to be combined
(data, outputs of classifiers).
Fusion is
F i i performed by the Bayesian rule, which under the
f
d b th B
i
l
hi h d th
condition of source independence is reduced to a product:
Fn(x1,...,xI)|y =Fn(P)|yi =P(θn)∏[P(θn |yi)/P(θn)], n
This fusion operator is conjunctive and assumes total reliability
of the sources
61
62. Weighted Average
If the sources are not totally reliable, several fusion rules within
the framework of the probability theory have been proposed in
the literature
A majority of the weighted average methods are based on
consensus th
theory, which involves general procedures of
hi h i
l
l
d
f
combining single source probability distributions while decisions
are based on Bayesian decision theory
Fn(x1,...,xI,R1,...,RI)|yi =Fn(P,R)|yi =∑iP(θn |yi) Ri
where R is reliability associated with the sources in the global
membership function expressing quantitatively the goodness of
each source
i
62
63. Incorporation of Contextual
Information
This method integrates contextual information
The
Th method is based on the fact that, in a given context,
th d i b d
th f t th t i
i
t t
only a subset J of a set N of all sources to be combined is
valid or reliable (i.e. their belief model adequately represents
reality)
Fn ( x1 ,..., x I , R1 ,..., R I ) | y = ∑P(θ |y1,...,yn,AJ ) P(AJ )
where P(AJ) is the probability of validity of the subset J of
inputs. This probability is calculated thanks to the reliability
Ri of the individual inputs
63
64. Biographical and Biometric fusion for Person Identification
Many modern data repositories record both biographical and biometric
information
Motor Vehicle Licensing Authority, Passport, Identify cards etc
Unique Identification number (www.uidai.gov.in)
Fusing information from multiple sources bring value in
Data integration: Creating single view of citizen, person, customer
Identification of the person using Biometric information and biographical
information
Scaling person identification for large number of customer records
–
–
Biographical data is abundant, easy to match, scales to millions of records but can be noisy and uncertain.
Biometric data is noise free and gives high precision for identification but does not scale to large number of records
–
Both stream contain complimentary information which can be exploited by fusing together
Fusion for Person Identification can be done at two levels
– Decision fusion: Each matcher provides the decision which are then fused to produce the final decision.
– Score fusion: Each matcher provides score which is used for producing a score for decision making.
64
65. Score Fusion using Biometric and Biographical matcher
Consider M matchers operating on a database containing
N records which have both biographical and biometric
information.
For query q if all the records are equally likely for the identifier than
the posterior of the score given records is given by
There can b multiple biometric as well as biographical
Th
be
lti l bi
ti
ll
bi
hi l
matchers
Each query q will generate N x M scores i.e. M dimensional
scores for N records
Genuine match
score density
We model the scores as being generated from a
probability distribution.
b bilit di t ib ti
Score is fused using a joint distribution from different sources
The probability distribution under reasonable assumption
is the posterior distribution of scores given a query
The genuine and imposter match scores are assumed to be
identically distributed
The posterior distribution is modeled as a Gaussian mixture
model.
The model is built for both genuine match distribution
and imposter distribution
The query is assigned an identity of n0 only if
Models are learnt from training data.
The algorithms is
Which simplifies to
65
Imposter match
score density
66. Results
DataSets
Biometrics: NIST Dataset consisting of match scores of right
and left index finger
Biographical : Electoral records of citizens in an emerging
economy
Consists of Names and Address
Total of 6000 people were associated with the biometrics and
the biographical data.
Here M = 4, 2: Biometric, 2: Biographical (Name & Address)
Experimental Setup
Accuracy for different modalities
Half of the dataset was used for training the probability
densities for both the imposter and genuine match score
distribution was estimated
The number of Gaussians components was 5
The remaining records was used for testing.
Experimental Results.
E
i
t lR
lt
Score is fused using a joint distribution from these four
different sources
The name modality has the lowest accuracy where the
biometric modality has high accuracy
The fused accuracy is much higher than the individual
localities
The accuracy increases when all the modalities are combined
thus validating the usefulness of fusion
66
Identification accuracy for fusion of modalities
67. Social listening for monitoring the Philippine general elections 2013
• Online and offline analysis of social media messages around election debates and
election chatter for ABS‐CBN TV Channel
• Analysis of English and Filipino chatter to determine buzz and reaction on candidates,
campaigns, parties, topics and events
campaigns, parties, topics and events
• Analysis of over 6 million election related Twitter and Facebook posts
• Comparison with Pulse Asia Election Survey
Real time and offline monitoring of social
g
media conversations about parties and
POE, GRACE
candidates
Mar 13
50%
45%
40%
35%
30%
25%
Positive and negative sentiments for candidates
20%
15%
10%
5%
0%
Mar 08
Mar 09
Mar 10
Mar 11
Mar 12
Mar 13
Mar 14
Grace Poe released her TV ad which drew flak
from viewers. This was also the time that 3
candidates (Legarda Poe Escudero) of the
(Legarda, Poe,
Liberal Party who were also "guest" candidates
of UNA were dropped by UNA as the President
forbade them to attend UNA's soirees. Escudero
felt really, really bad about being dropped by
UNA (led by former president Estrada). Grace
(l d b f
id
E
d ) G
Poe offered to mediate between Escudero and
Estrada.
ZUBIRI, MIGZ (UNA)
VILLAR,CYNTHIA HANEPBUHAY (NP)
VILLANUEVA, BRO.EDDIE (BP)
TRILLANES, ANTONIO IV (NP)
SEÑERES, CHRISTIAN (DPP)
POE, GRACE
PENSON, RICARDO
MAGSAYSAY, RAMON JR. (LP)
MAGSAYSAY, MITOS (UNA)
MADRIGAL, JAMBY (LP)
MACEDA, MANONG ERNIE (UNA)
LLASOS, MARWIL (KPTRAN)
LEGARDA, LOREN (NPC)
(
)
HONTIVEROS, RISA (AKBAYAN)
HONASAN, GRINGO (UNA)
HAGEDORN, ED
FALCONE, BAL (DPP)
ESCUDERO, CHIZ
ENRILE, JUAN PONCE JR.(NPC)
EJERCITO ESTRADA, JV (UNA)
DELOS REYES,JC (KPTRAN)
DAVID, LITO (KPTRAN)
COJUANGCO, TINGTING (UNA)
,
(
)
CAYETANO, ALAN PETER (NP)
CASIÑO, TEDDY
BINAY, NANCY (UNA)
BELGICA, GRECO (DPP)
AQUINO, BENIGNO BAM (LP)
ANGARA, EDGARDO (LDP)
ALCANTARA, SAMSON (SJS)
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
67
90.00%
100.00%
68. Worldwide leads: Intent to buy, relocation to India
Venu Nair: Male, Atlanta,
USA: Looking for good
investment in Indian real
estate market
Kiran Singh: Female, IT
g
,
professional, Gurgaon:
Any good 2 BR in Sohna
Road?
5
12
14
2
Data from Dec 3-4, 2013
68
70. Crowd sensing
g
•
The “power of the crowd”
– a lot of information in a
timely manner from
everywhere
•
People already use the
social media to share
public safety and law
enforcement information
•
Gain deep situational
awareness
•
Emergencies,
call for help
Enable proactive actions
by augmenting traditional
law enforcement methods
Police Monitoring
Limited
coverage
Analytics
and fusion
in nearreal-time
Crowd
sensors
70
Rich
events
& KPIs
71. Drinking in the Open
Come to South City 2, in evening, its a regular scene there since last 4
years, people drink in open and food is served by restaurants in their
cars
khandsa road per sunrise hospital se aage tekho ke pass rehari waale
sharab pilaate hai, jinki wajah se waha aane jaane wale log pareshaan
ho rahe hai even shaam ko to PCR ka bhi unhe darr ni hai kirpa
hai,
hai,
karke inhe waha se hataiya Gurgaon Police
I also have a complaint to register. We have an alcohal drinking
menace in front of our commercial complex anand ganga comlex at
complex,
comlex,
sohna chowk, on the main road.
Police Harassment
These two Constables (Davinder Singh & his Colleague) were at their
worst behaviour...when they found all documents ok in the Car. I
couldn't understand the reason for harrasment...opp
Wrong Parking
this is the main way from sadar bazar to bhuteshwar mandir. I dnt
p
park vehicles both the way
y
think y this road exist. It is the best place to p
are used to park vehicles no action have been taken from years. I
think HUDA or MCG is not serious abt matter.
71
73. Conclusions
Noise is an unavoidable fact of real life communication
Communication meant for human consumption can be
C
i ti
tf h
ti
b
noisy for computers and vice versa
Due to ubiquitous sensors (GPS, Accelerometer), easy of use
apps (Facebook, Twitter, YouTube), and higher internet
connectivity, the key characteristics of raw data is changing.
This new data can be characterized by 4Vs Volume,
Velocity, Variety and Veracity
For example, during a Football match, some people will Tweet
about Goals Penalties etc while in addition there may be other
Goals, Penalties, etc.
reports in news channels. The data describes the same event
Fusion should create a single object representation
Different sources may have different reliability and it is
necessary to account for this fact to avoid decreasing in
p
performance of fusion results
73
Reliability and context should be taken into account during
fusion
74. Conclusions
Noise can be defined as any kind of difference in the surface
form of an electronic text from the intended, correct or original
text
Noise
N i can b in the form of errors arising from uncertainty in
be i h f
f
i i f
i
i
language and communication and recognition errors
74