Big data veracity challenges

Big Data and Veracity Challenges
Text Mining Workshop, ISI Kolkata

L. Venkata Subramaniam
L V k t S b
i
IBM Research India
Jan 8, 2014

1

The Four Dimensions of Big Data
Volume
l

Velocity
l i

Data at Rest

Data in Motion

Terabytes to exabytes
of existing data to
process

Streaming data,
milliseconds to
seconds to respond

Variety
i

Data in Many
Forms
Structured,
unstructured, text,
multimedia

Veracity*
i *

Data in Doubt
Uncertainty due to
data inconsistency
& incompleteness,
incompleteness
ambiguities, latency,
deception, model
approximations

* Truthfulness, accuracy or precision, correctness
2

2

We’ve Moved into a New Era of
Computing !

In order to realize new
opportunities, you need to think
beyond traditional sources of data

The term Big Data applies to information that can’t be processed or analyzed using traditional processes or tools

12+ terabytes
of
Tweets
eets
created
daily.

100’s
of different
types
of data.

5+million

trade events
per second.

Transactional
& Application
Data

Machine
Data

Social Data

Enterprise
Content

Volume Velocity

Variety Veracity Only 1 in 3
decision makers
trust their
information.

• Volume

• Velocity

• Variety

• Variety

• Str ct red
Structured

Semi• Semi
structured

• Highl
Highly
unstructured

• Highl
Highly
unstructured

• Ingestion

• Veracity

• Volume

• Throughput

3

Volume is growing so are Veracity issues
By 2015, 80% of all available data will be uncertain
2015
By 2015 the number of networked devices
will be double the entire global population.
All sensor data has uncertainty.

8000 100

90

7000

80

6000

70

5000

60

4000

50

3000

40
30

2000

Aggregat Uncertainty %
te

Glob Data Volume in Exaby
bal
ytes

9000

20

1000

The total number of social media
accounts exceeds the entire global
population. This data is highly
uncertain in both its expression and
content.
Data quality solutions exist
for enterprise data like
p
customer, product, and
address data, but this is only
a fraction of the total
enterprise data.
p

10

0
2005
4

Multiple sources: IDC Cisco
IDC,Cisco

2010

2015

What is Big Data? Big Data applies to information that can’t be
processed or analyzed using traditional processes or tools

Telco Profiles

Data Vo
olume, Ve
elocity, Var
riety

Call Detail
Records
Market
Trends

Smart Grid
Smarter
Weather
Cities
Sensor Modeling
Data
Smarter
Smarter
Traffic
Water

Portfolio
Risk

Market Feeds
Credit Card
Transactions

Medical
Transcription

Electronic Data
Interchange
CRM
Customer
Records

Traditional Data & Processing
Precise, authoritative,
well formed
5

Text, Audio,
Video
Contact
Centers

Retail

Fraud

SWIFT
Account
Management

Homeland
Security

Uncertainty
(1/veracity)

Disease
Progression

Patient
P ti t
Records
Predictive
Modeling
of Outcomes

Social Network
Data
Services
S
i

Data Uncertainty at Scale
Inconsistent, imprecise, uncertain, unverified, s
pontaneous, ambiguous, deceptive

Social media users in India
India (No. of Users In Million)
50
45
40

India (No. of Users In
India (No of Users In
Platforms
Million)

35
30

Facebook
India (No. of Users In Million)

20

Twitter

15 Million

Linkedin

25

45 Million

15 Million

15
10
5
0
Facebook

Twitter

Linkedin

Youtube

Google
Plus

6

Veracity issues arise due to:
Process Uncertainty
Processes contain
“randomness”
“ d
”

Data Uncertainty
Data input is uncertain

All modeling is approximate

Actual
Spelling

Intended
y
Spelling Text Entry
p
g

? ?
?

Uncertain travel times

Model Uncertainty

GPS Uncertainty

Fitting a curve to data

?? ?
Testimony

{Paris Airport}

Ambiguity
g y

Semiconductor yield

7

Contaminated?
Rumors

{John Smith, Dallas}
{John Smith, Kansas}
Conflicting Data

Forecasting a hurricane
(www.noaa.gov)
(
)

Upto 10
0,000 times more noisy

Big Data, Fast Data, Noisy Data

Social Media Communication is
meant for Friends
30% world population
on the internet and
increasing fast

Type of Text

WER

SMS (texting)

50%

Tweets

35%

ASR

30%

Web queries

15%

OCR

5%

Newswire Text
(WSJ, Reuters,
NYT)

0.005%

55 million
illi
Tweets per
day

Lead
Generation,
Disaster
Tracking
g
Large
Dimensional,
uncertain,
unverified

I’ll see ya tomo
RIP Jackson
J k
I’m lookie out 4 a car 2 burn rubber on the streets of LA
What should I buy?? A mini laptop with Windows
OR a Apple MacBook!??!

Noisy, Informal,
Noisy Informal Implicit and
Contextual Conversations

There are more social
networking accounts
t
ki
t
than people in the
world

Social Networking
overtakes Search:
Facebook becomes the
most visited website
ahead of Google

Big Data: More video content was
uploaded onto YouTube in the past two
months than all the new content
ABC,
ABC CBS and NBC have been entering
24/7 since 1948.”

9

SMS
0 there – there
1 aint – are not
2 no – no
3 doubt – doubt
4 there – there
5 hon – honey
6 im – I am
7 gonna – going
8 be – be
9 takin – taking
10 it – it
11 4 – for
12 life – life
13 u – You
14 wont – wont
15 b – be
16 rida – rid of
17 me – me
18 lol – laugh out loud
19 Ray – (NAME)

Texting Language: Over 50% of the words are
written in non standard ways
Spontaneous Language: Use of
slang, ungrammatical, no punctuations, no case
information
Mixing of Languages: Many SMS contain text in a
mix of two or more languages
Type of Noise

%

Deletion of
Characters

48%

Phonetic
Substitution

33%

Abbreviations

5%

Dialectical
Usage

4%

52% words
were non standard

Deletion of
Words

1.2%

(Contractor et al., 2010)

101 SMSes

10

Speech Recognition
SPEAKER 1: windows thanks for calling and you can
learn yes i don't mind it so then i went to

SPEAKER 2: well and ok bring the machine front

Recognition Errors: 10-40%
Word Error Rates

end loaded with a standard um and that's um it's
a desktop machine and i did that everything was
working wonderfully um I went ahead connected
into my my network um so i i changed my network
settings to um to my home network so i i can you
know it's showing me for my workroom um and then
it is said it had to reboot in order for changes
to take effect so i rebooted and now it's asking
me for a password which i never i never said
anything up
SPEAKER 1: ok just press the escape key i can

Spontaneous Language:
Use of slang, use of fillers
slang
like um and
ah, ungrammatical, false
starts,
starts no punctuations, no
punctuations
case information

doesn't do anything can you pull up so that i mean

Mixing f L
Mi i of Languages:
Contain words from two or
more languages
11

Historical Text
Non Standard Spellings: No notion of the importance of
having a single spelling for each word. Letters would be
added or removed to ease line justification.
New words: New words, words that are variants of
present vocabulary words
Different Language Style: Different grammar, language
g g
y
g
,
g g
model.
OCR: Character substitution errors, missed punctuations.

Baron et al. 2009
al

12

Emails, Blogs, Tweets, Online Chat,……

Chat Logs
g
[12:51:13 PM] Geetha: alrite
[12:52:01 PM] Richa: id has valid pw not expired
[12:52:49 PM] Geetha: can't get to theh site
can t
[12:53:04 PM] Richa: network connection may be slow
[12:54:39 PM] Geetha: ok Im able to now
[12:54:53 PM] Richa: should I reset the password
13

What is Noisy Text?
Any kind of difference in the surface form of an electronic text
from the intended, correct or original text (Knoblock et al.,
2007)
Noise can be at the lexical level {b4 before befour}
{b4, before,
Resulting in substitution, insertion, deletion,
transposition, run-on, and split.
Noise can be at morphological, syntactic, discourse level {I can

hear u, I can hear you, I can here you}

Resulting in substitution, insertion, deletion, transposition
of words and the introduction of out of vocabulary
words.
d
14

Classifying Noise
Lexical Errors (Subramaniam et
al., 2009)
Missing characters {before >

bef}

Extra characters {raster >

raaster}

Phonetic substitution {before >

b4, late > l8}
,
}
Abbreviations {laugh out loud
> lol, United Nations > UN}

Syntactical Errors (Kukich, 1992;
Foster et al., 2007)
Missing Word {What are the
subjects? > What the subjects?}
Extra word {Was that in the
summer? >Was that in the summer

it?}

Real word spelling errors {She could
not comprehend > She could no
comprehend.
comprehend.}
Agreement {She steered Melissa
round a corner > She steered
corner.

Melissa round a corners.}

Dialectical usage {I’m going to be
there > I’ gonna b there}
th
I’m
be th }
15

Techniques for Automatically Detecting Lexical Errors (Kukich 92)
Efficient methods to detect strings that do not appear in a given word list,
dictionary or lexicon
Nonword error d t ti
N
d
detection
Two approaches
N-gram
Look up each n-gram in an input string in a precompiled table to
ascertain either its existence or its frequency. Nonexistent or infrequent
n-grams (shj, i ) are identified as possible misspellings.
hj iqn
id tifi d
ibl
i
lli
Good for identifying errors made by OCR devices
But unusual/foreign language valid words will be marked and nicelooking mistakes will be marked valid
ill
ma ked alid
Dictionary based
Input string appears in a dictionary? If not, the string is f
f
flagged as a
misspelled word.
But nearly two-thirds of the words in a dictionary did not appear in an
eight million word corpus of New York Times text and conversely two
text, and,
twothirds of the words in the text were not in the dictionary (1986 study)
16

Techniques for automatically Detecting Incorrect (Syntax)
Grammar (Foster et al., 2007)
Efficient methods to detect word sequences that do not form a
Effi i
h d
d
d
h d
f
grammatical sentence
Three Approaches
N-gram
Classifies a sentence as ungrammatical if it contains an
unusual part of speech sequence
Precision-grammar
Classifies a sentence using a parser and a broadcoverage hand-written grammar
Probabilistic-parsing
Probabilistic parsing
Finds sentences with parsing error

17

Quantifying Noise (Subramaniam et al., 2009)
Quantifying Lexical Errors {Before, b4, befour, befor, bfore}
Edit Di
Edi Distance
Good for measuring surface level deviation from original
Perplexity
e p e ty
Good for measuring deviation from underlying language structure
at character level
Quantifying Semantic Errors {I came to LA yesterday. I am still jet

lagged., Came la yester day still jetlagged, Came 2 LA ystrday stil
jetl8d}

WER
Good for measuring real word errors (speech recognition errors)
Perplexity
Good for measuring deviation from “proper”
BLEU
Good for comparing a candidate translation against multiple
reference translations
18

Spelling Correction (Kukich, 1992)
Isolated Word Correction
Minimum edit distance techniques
Similarity key techniques
Probabilistic techniques
N-gram-based
N gram based techniques
Rule-based techniques
Will not catch typos resulting in correctly spelled words {form, from}
yp
g
y p
,
Estimates put real word errors at 30% of all word errors
Context-Dependent Word Correction
Parsing
Language models
Can errors be ignored and still meaningful interpretation be done? {I
am coming with you, I comes with you}
19

SMS Text Normalization
dis is n eg 4 txtin lang
This is an example for Texting language

Extreme corruption of words and sentences
Models for SMS language are lacking

Tomorrow never dies!!!
2moro (9)
( )
tomoz (25)
tomoro (12)
tomrw (5)
tom (2)
tomra (2)
tomorrow (24)
tomora ( )
(4)

tomm (1)
( )
tomo (3)
tomorow (3)
2mro (2)
morrow (1)
tomor (2)
tmorro (1)
moro ( )
(1)
Occurrence in a 1000 sms corpus

20

Finding Canonical Sets (Acharyya, 2009)
Learn mappings
costmer, castumar, kustamar,

customer

coustomber

How can we do it in an unsupervised way ?
Find some invariant, that does not change in spite of corruptions
Buckets of context seem invariant!
<..Back Bucket....> sceam <..Front Bucket...>
sceam : sms(2) new(5) recharge(4) t l
h
tel-provider(2) about(3)
id
b t
<..Back Bucket...> scheme <..Front Bucket...>
scheme : sms(4) new(2) activate(3) tel-provider(2) about(1)
recharge(1)
21

SMS Based FAQ Retrieval (Kothari et al., 2009)
SMS Question
FAQ
how 2 actvate romng on me hanset

Database

How do I activate Roaming
Dial *567*2# from your
handset
What are the rates for roaming
within India
Roaming rates on prepaid
connections are 60 Paise per
minute

SMS Answer
Dial *567*2# from your handset

Goal is to find the Question Q* that best matches the SMS S

•A scoring function Score(Q) assigns a
score to each question Q in the FAQ
dataset. The score measures how closely
the question matches the SMS string S
S.

22

FAQ Retrieval Problem Formulation
SMS is treated as a sequence of tokens S=s1,s2,…,sn
Let Θ denote the questions in the FAQ corpus where each
question Q ∈ Θ is treated as a set of tokens
Goal is to find the question Q* that best matches the SMS S

23

Method
M th d
For
F each t k si , a li t Li consisting of all t
h token
list
i ti
f ll terms f
from th di ti
the dictionary
that are variants of si are constructed. Variants are sorted in the
descending order of their weight

This space is searched to find the closest matching FAQ question.

24

Extracting Dialog Models (Negi et al 2009)
al.,
Huge number of repetitive calls at contact centers
Building t k i t d di l
B ildi task oriented dialog systems
t
Task specific information – concepts, subtasks
Task structure - manual encoding
g
Using large amounts of human to human conversation data

Extracting dialogue models using human-to-human conversations
E t ti di l
d l
i h
t h
ti

25

Example Conversation: Car Rental Domain

26

Overview
Transcribed
Calls

Normalized
Calls

Utterance
No a at o
Normalization

Subtasks

Mining of
Subtasks

Chat-bot

AIML
Co ve s o
Conversion

27

Finding Patterns with Gaps
Need for
N d f patterns capturing variations in expressions
i
i i
i
i
Have you rented a car from us before
Have you rented a car before
Have you rented a car from <Rent_Agency> before
<Rent Agency>

Mining regular expression patterns over tokens or entity types
Each tt
E h pattern represented as a t k sequence
t d
token
[rented car before]
Token sequences mined efficiently using extension of apriori algorithm

2
8

Association Analysis
Total number of possible itemsets is exponential (2N)
Brute-force technique infeasible

Support filtering is necessary
•
•

To eliminate spurious patterns
To avoid exponential search
-

Support has anti-monotone
property:
X ⊆ Y implies σ(Y) ≤ σ(X)

Efficient algorithms have been
designed to exhaustively find all
itemsets/patterns with sufficiently
high support

Given d items, there are 2d
possible candidate itemsets
ibl
did t it
t
29

Utterance Normalization
Identify concepts
Named Entity Annotation
Rule based annotator for annotations such as location, date, car
model,
model and amount
“I want to pick it up from <location> on <date>”
Grouping of utterances
Find patterns with gaps and represent each utterance by them along
with unigrams and bi-grams
Agent and customer utterances are clustered separately using an offg
p
y
g
the shelf clustering algorithm

30

Finding Subtasks and ordering
Customer and agents engage in
similar kinds of interactions to
accomplish an objective
Represent each call with agent
utterance and customer utterance
cluster labels
Subtasks
Patterns of cluster labels
(agents) with possible gaps
Lot of variability in customer
utterances
Vertical pattern mining

C1

C1

C2

C3

C3
Cn

31

Subtask Preconditions
Utterance pre-conditions
U
di i
Customer utterances that indicate start of a subtask
“please make this booking for “make payment” subtask
please
booking”
make payment
Frequent features from customer utterances
Flow pre-conditions
Only logical orders of subtasks are allowed
“make
“ k payment” subtask cannot b executed unless “ th
t” bt k
t be
t d l
“gather
pick-up information” subtask has been executed.
Collection of all the subtasks that precede the subtask
p

32

Data Fusion
Problem

Given multiple data points about an entity, create a single
p
p
y,
g
object representation while resolving conflicting data values

Difficulties

Null values: Subsumption and complementation
Contradictions in data values
Uncertainty & truth: Discover the true value and model
u ce ta ty this process
uncertainty in t s p ocess
Metadata: Preferences, recency, correctness
Lineage: Keep original values and their origin
Implementation in DBMS: SQL, extended SQL, UDFs, etc.
SQL
SQL UDFs etc

34

360 Context
Analyze social data in the context of enterprise data to build entity and event profiles
and establish linkages between them for online and offline analysis
Entity (people, products, events) Insights
The problem

Solution

What are the key
product interests of
person A?

Over time learn about
the person’s product
interests from her social
media postings
p
g

What is the location and
trajectory of person B?

List significant events
like marriage, birth of a
child, relocation, etc.

What are the events of
interest happening in a
given location?

Lists the top events in a
given geography

What is the sentiment
g
product?
on a given p

Gives the sentiment on
a product
p

Understand customers wants and needs better

Gives the current
location and locations in
the past

What life events
happened in person A’s
life in the past x
months?

Key Sustained Value Factor:

intent to
purchase for
customers

Social Data

Smarter
Commerce

real-time public
safety events

Enterprise
Databases

User
Domains
What MDM 360 does?

propensities/
sentiment/intent
•
event Detection
•
entity Linkages
•
sentiment

core customer
view/transactions
•
event Profiles
•
entity Profiles

Smarter
Cities

Application
Domains

Builds an entity’s complete profile by aggregating data about the entity from social and enterprise data
ld
’
l
fl b
d
b
h
f
l d
d
sources. Here an entity refers to people, products, brands and events.
35

35

IBM Confidential

Extraction Challenges: Stages of intent

Stage

Example

Wishing for an event

“I just want to graduate, get a job, get a car, and
live with my boyfriend”

Anticipating an event

“Im getting a car for graduation yay!!!!!”

During an event

“At disneyworld :D”

Post event / continuous state

“Apparently I got a raise at work three months ago
and didn't know? Sweeeeeeeeeet”

Hobby

“Loves to fish, travel and frequent concerts. Down
to earth, athletic, professional 40 and single.
earth athletic professional,
single
Loves the outdoors, working out, travel and
younger fit guys for dating.”

36

Extraction Challenges: Detecting filtering conditions
Filter

Example

Spam

“Need a New #Credit Card for your #Business or
online #Ebay store? Compare and Apply Online.
http://retweet.it/r/We0iai”

Sarcasm, jokes

“I thought I was having a stroke this afternoon but it
turns out it was too many Starbucks Refreshers plus
my leg falling asleep.”

Resolve ambiguous meaning

“In the words of @LNSmooth23 I'm retiring from the
nightlife”

Non-personal

“My mom is buying a house, but why in Willingboro”

37

360-degree Profiles from Social Media
g
Personal Attributes
Event Detection

• Identifiers: what, where, when…..
• Attributes: severity, urgency…

Social Media based
360-degree
Event and Individual
Profiles

Timely Insights on Events
Ti
l I i ht
E
t
• Event Detection
• Public Safety Events
• Plans for public disturbances
• Sentiment around events
• Citizen sentiment

• Identifiers: name, address, age, gender,
occupation…
• Interests: sports, pets cuisine
sports pets, cuisine…
• Life Cycle Status: marital, parental
• Relationships: family, friends, co-workers, work
and interest network

Timely Insights on
Individuals

• Intent to participate in public events
• Instigation for causing public damage
• Sentiment on events, govt policies
• Current Location
• Hate messages

Personal Interests
P
lI

• Personal preferences or political leanings
• Activity History

Intent
We must support the movement, I am going to the rally at Jantar
Mantar tomorrow
Anna Hazare has a point when he says politicians are corrupt and
need to be taught a lesson. The rally starts at 10.

Public Safety Events
Mamta Deedi is also joining Anna, Ramdev, n Kejriwal. She is going to do
Anshan at Delhi Jantar mantar. Ye sab public ki kahke le rahe hain.
So Its Mamata's day out tomorrow at #JantarMantar. #Rally.

Location announcements
I'm at Karir Square http://4sq.com/fYReSj

38

More data: Customer intent extracted from social media provides context
Go for the
best, DP2000

Buying a
DSLR
today !

Buying
DSLR
today!

Thrza gr8 deal
on ZX 550 @
ZX-550
the mall

Prior Business
Social
Transactions
Data

Entity
Extraction, Fact
Discovery, Intent &
Sentiment

Influencers

Intent

450M+ tweets/day Millions of tweets yield one
company-specific fact
Customer ready to buy a
DSLR camera today
today,
possibly at a nearby mall

Michael’s online friends offer lots of advice

Text Analytics used to extract intent from Social Media
Married, Male, Spouse
,
, p
Birthdate, Gift Type, Intent
to Purchase, Timeframe

Wifey’s birthday tomorrow, looking for a killer dslr

Sarcasm,
Wishful Thinking
Potential
Locations and
Activity

Maybe I should buy her that purple
roadster, while I’m at it. ;-) lol

Intent to Purchase,
Gift Type?

In NYC area this w/e, any good malls
nearby?

Region & City Location,
Timeframe, Intent to Shop

Resultant fact base contains billions of facts, and is incrementally updated
Fact segmentation or clustering is rapid enough to drive a business decision

3939

Matching Twitter profiles to Corporate Data
• Linking Social Media profiles with Employee database

• Several extensions are possible, for example, linking with Citizens and Security databases

Social media profiles
(name, address,
gender, age
gender age,
employment,
relationship, …)

Employment
filter

Social media profiles
of IBM employees
p y
and their network

Name: first, last

Name,
work location,
job description

Current Demo focused on Name and Location
matching, as well as EmployeeOf information

Choice of social media profile attributes
for linking constrained by availability of
IBM BluePage attributes

Twitter: 45M profiles

Resolution

Semantic Name Variations
Bill Chamberlin vs. Chamberlain, William H.
C. Mohan vs. Mohan Chandrasekaran (Mohan)

Employee Directory: 460K entries
p y
y
Name: (first, middle, last, preferred)

Geo Proximity
Home l
H
location: city, ( t t ) country
ti
it (state),
t
Employment: company + role

Saratoga, CA vs. San Jose, CA
New Jersey vs. New York

Job Role
Disambiguation
“Soft a e sales manage at IBM…” vs.
Software
manager
IBM
s
“Managing SPSS Sales for Canada…”

40

Work l
W k location: ( i state, zip, country)
i
(city,
i
)
Job description

Example Result

• Semantic name variations: Twitter name is a close variation of the IBM names
• Geo Proximity: Work locations are within 25mi of the Twitter location
• Job Role Disambiguation : description in Twitter profile matches HR role
41

Common D t P bl
C
Data Problems
• Lack of information
standards
t d d

Ashok Kumar
A Kumar

• Data misplaced in the database

Four sixteen Street 8 Anand Niketan Delhi
8,
Niketan,

Mr. Ashok Kr

#416 Anand Niketan, N Delhi, 21

110021

• Different formats & structures across
different systems

Data surprises in individual
fields

416 Anand Niketan, New Delhi, India 110021

Email

Tax ID

Telephone

91,,,,
228-02-1975
6173380300
ranivrgeoi@yahoo.co.in
i
i@ h
i
025 37 1888
025-37-1888
415 392 2000
415-392-2000
,CYRUS_DASTUR@HOTMAIL.COM 34-2671434
3380321
HP 15 State St.
508-466-1200 Orlando

• Special characters in the data

• The redundancy nightmare
• Duplicate records with a lack of
standards

90328574
90328575
01456
90238495
90233479
90233489
90345672

IBM
I.B.M. Inc.

187 N.Pk. Str. Salem NH 01456
187 N.Pk. St. Salem NH

Int. Bus. Machines
International Bus. M.
Inter-Nation Consults
I.B.
I B Manufacturing

187 No. Park St Salem NH 04156
187 Park Ave Salem NH 04156
15 Main Street Andover MA 02341
Park Blvd Bostno MA 04106
Blvd.

42

Regional variations in Addresses across
India
Addresses in different regions contain words of the local language even when the
addresses are written in English
Ex : The commonly used word to describe a street type is “Gali” in Northern
India whereas “Beedhi/Veedhi” is the commonly used term in Southern India
Street Intersections and Street Information containing multiple Street Type Identifiers
like Cross and Main are extensively found in the Southern Indian regions
Ex : “3rd Main, 4th B Cross”
,
Sector and Pocket Information are found primarily in North Indian Addresses
Ex : “Sector 5, Pocket 2A 2nd Block”
Regional differences in writing addresses necessitate bifurcation of standardization
rules based on regions.

44

Investigating the Data
g
g
Take the Example: 123 St. Virginia St.

Parsing:
Separates multi-valued fields into individual pieces

Lexical A l i
L i l Analysis:
Determines business significance of individual pieces

Context Sensitive:
Identifies various data structures and content

123
Number

123
Number

123

St.

Virginia

Street
Type

Alpha

St.

Virginia

Street
Type

St.

Street Name

Street
Type

St. Virginia

St.

“The instructions for handling the data are inherent within the data
itself.”
45

St.

Sample Standardized Output
Sample Address Input:
“SANT KRUPA BUILDING, 2ND FLOOR, CHHEDA RD, NR S V JOSHI
HIGH SCHOOL, DOMBIVALI (E), THANE. INDIA.”
Standardization Output:
St d di ti
O t t
DoorNo

Floor Value

Building
Name

Building
Type

Street Name

Street Type

20

2nd FLOOR

SANT KRUPA

BUILDING

CHHEDA

ROAD

Landmark
Position

Landmark

Area

City

District

State

NEAR

S V JOSHI HIGH DOMBIVALLI
SCHOOL
EAST

THANE

THANE

MAHARASHTRA

46

Input Addresses vs Standardized Addresses
Sr.No

Standardized address

Highlights

1

A38/91 KONIA . . VARANASI
INDIA

A38/91 KONIA VARANASI VARANASI
UTTARPRADESH INDIA

Autopopulation of
state

2

VILL BASUDEVPUR PO
KHANJANCHAK
DURGACHAK HALDIA
TAMLUK INDIA

DURGACHAK ,HALDIA,VILLAGEBASUDEVPUR PO-KHANJANCHAK
PO KHANJANCHAK
TAMLUK EAST MIDNAPORE
WESTBENGAL INDIA

Rural address
Handling

3

NEAR RAJGHAR GIRLS
SCHOOL LACHIT NAGAR
HOUSE NO 5 ULUBARI
GUWAHATI ASSAM
GUWAHATI INDIA

5 NEAR RAJGHAR GIRLS SCHOOL
Maintaining a
ULUBARI LACHIT NAGAR GUWAHATI standard format
KAMRUP ASSAM INDIA
across addresses
(house no preceeds
Landmark
information)

4

1/15, PREMJYOTI CO OP HSG
1/15 PREMJYOTI COOPERATIVE
SOC., RAMBAUG - 5, KALYAN
HOUSING SOCIETY,RAMBAUG 5
(W), MAHARASHTRA 421301 KALYAN WEST BHIWANDI THANE
BHIWANDI INDIA
MAHARASHTRA 421301

Standardization of
Tokens

5

4
7

Input address

3/2,FIRINGI DANGA ROAD,
P.O.MALLICKPARA
SERAMPORE-3 CALCUTTA
INDIA

Standardization of
tokens

3/2,FIRINGI DANGA ROAD,
SERAMPORE-3 P.O.MALLICKPARA
KOLKATA WESTBENGAL INDIA

Two Methods to Decide a Match
Are these two records a match?

RHITU K

KAZANGIAN

RITU KUMAR
B
B

KAZANGIAN

+5

+2

A
+20

128 MAIN

ST

02111 12/8/62

128 MAINE RD 02110 12/8/62
/ /
A
B
D
B
A
= BBAABDBA
+3

+4

-1

+7

+9

=

+49

Deterministic Decisions Tables:
• Fields are compared
• Letter grade assigned
g
g
• Combined letter grades are compared to a vendor delivered file
• Result: Match; Fail; Suspect
Probabilistic Record Linkage:
• Fields are evaluated for degree-of-match
• Weight assigned: represents the “information content” by value
• Weights are summed to derived a total score
• Result: Statistical probability of a match
48

A Closer Look at Probabilistic Matching
C ose oo
obab st c atc g
RHITU K

KAZANGIAN

128 MAIN

RITU KUMAR

KAZANGIAN

128 MAINE RD 02110 12/8/62

+5

+2

+20

+3

ST

+4

02111 12/8/62

-1

+7

+9

= 49

Histogram of Weights
4000

3500

The weighted score is a
p
probability of a match; it
y
;
expresses the amount of
information content for all of
the fields compared

3000

# of Pairs
f

relative measure of the

The CUTOFF is the score above
which good matches are found

2500

2000

UnMatched

1500

1000

500

Matched
0
-50

49

-40

-30

-20

-10

0

10

20

30

40

50

60

49

The Value of Information Content
Information Content is measured both at the field and at the field value level and is
calculated automatically
Discriminating Value represents the significance of one field versus another in
contributing to a match
For example a Gender Code contributes less information than a Tax-Id Number
Frequency represents the significance of one value in a field over another value
q
y p
g
For example in a Last-Name Field, “SMITH” contributes less information than
“ROUTZAHN”
Probabilistic Matching uses the automatically generated measures of Information
Content to achieve the highest match rates possible utilizing a scientifically-justifiable
C
h
h h h
h
bl
l
f ll
f bl
methodology

50

Data Framework around the Individual
• Logins (User
credentials)
• Profile
• Expertise
• External & internal
unstructured data
linked to individuals
• S
Social activity
l

Big Data

Individual
Social
Presence
•
•
•
•

BLOG
Comment
Opinion “Like”s
Community

Individual
Credentials

• to Person
• Communities
• to Company:
Roles, History
• IBM Linkage

Individual
Core
Personal data:
• Name, Address
• Phone, eMail
• Behavioral
Preference /
permissions

Transactions
involving
i
l i
the Individual
• Tech Support Call
• O
Opportunity & Orders
t it
Od
• Responses to Marketing
Campaigns

Relationships
with the
individual
i di id l
Interactions
with the
Individual
•
•
•
•
•
•

Digital
Phone
eMail
F2F
Social
Web traffic

51

Analytics steps

Text Analytics
• Analyze and extract consumer
attributes from individual
messages

Intent

Entity
Integration
• Integrate information about a
consumer within a single social
media source over ti
di
time

Entity
Resolution
• Link social media profiles with
customer data
t
d t

• Link and integrate information
about a consumer across
multiple social media sources

All I really want is the Disney
Visa card from chase with the castle on it

Life Events
Looks like we'll be moving to New Orleans
sooner than I thought.

Personal Attributes
I am a engineer, mom, and wife

Relationships

Social Profiles of
Consumers

Master Data on
Customers

In fact I'm looking forward to the new
month. Both myself and the wife have our
th B th
lf
d th
if h
graduation ceremonies

52

Person Information across Documents
Who Is James Dimon?

Do these filings refer to the same person ?
variability in the person’s name, lack of a key identifier
supporting attributes vary depending on the context (form type)

All these facts need to be linked and integrated

53

53

Signatures

Biographies
Insider
Transactions

Committee
memberships

Entity & Relationship Analytics from
Big Data
Entity Views

Crawl

Extract /
Text
Analytics

Entity
Resolution
Map/Fuse
/Aggregate
Entities Relationships:
E ii &R l i
hi
Object-centric view

Unstructured
data sources

Untrusted View

Challenge
Construct and maintain comprehensive
profiles of entities and relationships
from unstructured data sources
Main Problem: Assemble an entity view, where each entity aggregates data from thousands of
different documents
Multiple stages of complex processing:
– Information extraction
•

–

From each unstructured d
F
h
t t d document, extract relevant structured records
t
t t l
t t t d
d

Entity resolution

• Link records (possibly across documents) that are about the same real-world “entity”
Entity
Integration – Entity population: mapping / fusion / aggregation
•

54

Collect all the facts about the same entity into one rich object with clean values and relationships to other entities

The Complete Entity View
Current purchase intentions
expressed by the consumer

Location-based information about a
consumer (where they plan to travel,
events they are going to attend)

Purchase history for a consumer

Life events (relocation, home
purchase, wedding, graduation)

Related people based on social
networking data

Comments/complaints expressed about
various products and services

Customer identity information (e.g.,
name, location) obtained from profiles
and content of posts

Micro-segmentation information about
individual consumers (e.g., gender, age
range, profession)

360 degree
360-degree profile of a customer

City

State

Age
Range

Gender

Houston

TX

30-39

Female

San Jose

CA

?

Male

Marital
Status
?
Married

Number
of kids

Employment
Status

Occupation

?

Employed

Journalist

2

Employed

Software Engineer

…

…
…
Aggregate attributes from multiple sources
Filter to obtain a segmentation
Analyze to obtain “Similar Populations”
Adding more input data gives better predictive power
55

Attribute fusion example: Inferring location from multiple clues
Metadata
Name: Tracy Guida
Sc ee a e @ acygu da
Screen name: @tracyguida
Location: Tampa
Description: just a Nor-Cal gal trying to fall in love with

Florida

Social Media Profile

Screen name : @tracyguida
Location:
Tampa, FL
Name:
Tracy Guida

Disambiguation, fusion of
partial information

Permanent
location

Fusion libraries:
• Confidence:
metadata vs. content

Messages
Gotta love Florida football #hot #humid
http://instagr.am/p/QOHPqhKdYt/
Check out my blog about #food in #TampaBay
h k
bl
b
f d
http://www.myothercitybythebay.com

Textual clues

Temporary location

I'm at Tracy's Seat At Micah's (Tampa, FL)
http://4sq.com/SZ4yjj
http //4sq com/SZ4 jj
I'm at S.o.G (Tampa, Florida)
http://4sq.com/UDweM5

Check-ins

I'm at Eats American Grill (Tampa, FL)
http://4sq.com/O1a1Jm

Who's
Wh ' watching the #presidential #debate tonight?
hi
h #
id i l #d b
i h?
(from 27.97989014,-82.54825406)

Fusion libraries:
• Confidence: place mentions vs.
g
geo-codes
• Analysis of location time-series

Geo-located
G l
d
documents

56

The Reliability (Veracity) Challenge
Θ = {θ1,...,θ N } - a set of hypotheses (frame of discernment, universe of
discourse)
{xni } - probability, possibility, belief in hypothesis {θn} of source i
{Oi } - input data (social media, enterprise information)
F(x1,...,xI ) – Fusion operator

{O1}
Environment
Environment

57

{OI }

Source 1
(source belief model
model,
source characteristics)

Source I
(source belief model,
source characteristics)

{x1}
Fusion
Fusion
operator
operator

{ xI }

F ( x1,..., x I )

Typical Reliability Settings
It is possible to assign a numerical degree of reliability to each source
A subset of sources is reliable but we do not know which one
Reliabilities of the sources can be ordered but no precise reliability values
are known

Reliability dependent on context too
During Mumbai Mantralaya fire a few tens of tweets on this event on
Twitter
Same day there is a match and there are several thousand tweets “Miami
Miami
on Fire”

58

Strategies for Utilizing Reliability
Strategies explicitly utilizing reliability of sources
Reliability is used to modify beliefs of each model before fusion and
then use transformed beliefs (separable case)
Strategies for modifying the fusion process to account for the
reliability of the sources (non separable case)
(non-separable case).

Strategies identifying reliability of data input to fusion processes
and eliminating the sources of poor reliability
Combination of strategies mentioned above
F(x1,...,xI ) FR (x1,...,xI )
F - i a context d
is
t t dependent operator, which depends on the
d t
t
hi h d
d
th
strategy selected and defined within the framework used for
uncertainty representation
R

59

Reliability Coefficients
Reliability coefficients represent trust in each belief model. They
introduce the second level of uncertainty and represent a measure of
y
p
the adequacy of the model used, the reality of the environment, and
source characteristics
Ri = Ri (Mi, γ ,Υ) - reliability of source i (reliability of source i and
, )
y
(
y

hypothesis j : Rij)
Mi - model of source i

γ parameters characterizing external environment (context)
Υ -parameters characterizing the internal environment of source I
(tuning parameters)
Relative eli bilit
Rel ti e reliability : ∑iIRi =1
1
May be replaced with max Ri = 1
i
60

Bayesian Fusion
In the Bayesian framework the degrees of belief are
represented by a priori, conditional and a posteriori
conditional,
probabilities.
Usually, decisions are made on a posteriori probabilities P(θn | yi ),
where yi i the input coming from source I,
h
is th i
t
i f
I
xi = P(θn | yi ) represents statistics of each source to be combined
(data, outputs of classifiers).

Fusion is
F i i performed by the Bayesian rule, which under the
f
d b th B
i
l
hi h d th
condition of source independence is reduced to a product:
Fn(x1,...,xI)|y =Fn(P)|yi =P(θn)∏[P(θn |yi)/P(θn)], n

This fusion operator is conjunctive and assumes total reliability
of the sources
61

Weighted Average
If the sources are not totally reliable, several fusion rules within
the framework of the probability theory have been proposed in
the literature
A majority of the weighted average methods are based on
consensus th
theory, which involves general procedures of
hi h i
l
l
d
f
combining single source probability distributions while decisions
are based on Bayesian decision theory
Fn(x1,...,xI,R1,...,RI)|yi =Fn(P,R)|yi =∑iP(θn |yi) Ri
where R is reliability associated with the sources in the global
membership function expressing quantitatively the goodness of
each source
i

62

Incorporation of Contextual
Information
This method integrates contextual information
The
Th method is based on the fact that, in a given context,
th d i b d
th f t th t i
i
t t
only a subset J of a set N of all sources to be combined is
valid or reliable (i.e. their belief model adequately represents
reality)
Fn ( x1 ,..., x I , R1 ,..., R I ) | y = ∑P(θ |y1,...,yn,AJ ) P(AJ )
where P(AJ) is the probability of validity of the subset J of
inputs. This probability is calculated thanks to the reliability
Ri of the individual inputs

63

Biographical and Biometric fusion for Person Identification
Many modern data repositories record both biographical and biometric
information
Motor Vehicle Licensing Authority, Passport, Identify cards etc
Unique Identification number (www.uidai.gov.in)

Fusing information from multiple sources bring value in
Data integration: Creating single view of citizen, person, customer
Identification of the person using Biometric information and biographical
information

Scaling person identification for large number of customer records
–
–

Biographical data is abundant, easy to match, scales to millions of records but can be noisy and uncertain.
Biometric data is noise free and gives high precision for identification but does not scale to large number of records

–

Both stream contain complimentary information which can be exploited by fusing together

Fusion for Person Identification can be done at two levels
– Decision fusion: Each matcher provides the decision which are then fused to produce the final decision.
– Score fusion: Each matcher provides score which is used for producing a score for decision making.
64

Score Fusion using Biometric and Biographical matcher
Consider M matchers operating on a database containing
N records which have both biographical and biometric
information.

For query q if all the records are equally likely for the identifier than
the posterior of the score given records is given by

There can b multiple biometric as well as biographical
Th
be
lti l bi
ti
ll
bi
hi l
matchers
Each query q will generate N x M scores i.e. M dimensional
scores for N records

Genuine match
score density

We model the scores as being generated from a
probability distribution.
b bilit di t ib ti
Score is fused using a joint distribution from different sources

The probability distribution under reasonable assumption
is the posterior distribution of scores given a query

The genuine and imposter match scores are assumed to be
identically distributed

The posterior distribution is modeled as a Gaussian mixture
model.

The model is built for both genuine match distribution
and imposter distribution

The query is assigned an identity of n0 only if

Models are learnt from training data.

The algorithms is
Which simplifies to

65

Imposter match
score density

Results
DataSets
Biometrics: NIST Dataset consisting of match scores of right
and left index finger
Biographical : Electoral records of citizens in an emerging
economy
Consists of Names and Address

Total of 6000 people were associated with the biometrics and
the biographical data.
Here M = 4, 2: Biometric, 2: Biographical (Name & Address)

Experimental Setup

Accuracy for different modalities

Half of the dataset was used for training the probability
densities for both the imposter and genuine match score
distribution was estimated
The number of Gaussians components was 5
The remaining records was used for testing.

Experimental Results.
E
i
t lR
lt
Score is fused using a joint distribution from these four
different sources
The name modality has the lowest accuracy where the
biometric modality has high accuracy
The fused accuracy is much higher than the individual
localities
The accuracy increases when all the modalities are combined
thus validating the usefulness of fusion

66

Identification accuracy for fusion of modalities

Social listening for monitoring the Philippine general elections 2013
• Online and offline analysis of social media messages around election debates and
election chatter for ABS‐CBN TV Channel
• Analysis of English and Filipino chatter to determine buzz and reaction on candidates,
campaigns, parties, topics and events
campaigns, parties, topics and events
• Analysis of over 6 million election related Twitter and Facebook posts
• Comparison with Pulse Asia Election Survey
Real time and offline monitoring of social
g
media conversations about parties and
POE, GRACE
candidates

Mar 13

50%
45%
40%
35%
30%
25%

Positive and negative sentiments for candidates

20%
15%
10%
5%
0%
Mar 08

Mar 09

Mar 10

Mar 11

Mar 12

Mar 13

Mar 14

Grace Poe released her TV ad which drew flak
from viewers. This was also the time that 3
candidates (Legarda Poe Escudero) of the
(Legarda, Poe,
Liberal Party who were also "guest" candidates
of UNA were dropped by UNA as the President
forbade them to attend UNA's soirees. Escudero
felt really, really bad about being dropped by
UNA (led by former president Estrada). Grace
(l d b f
id
E
d ) G
Poe offered to mediate between Escudero and
Estrada.

ZUBIRI, MIGZ (UNA)
VILLAR,CYNTHIA HANEPBUHAY (NP)
VILLANUEVA, BRO.EDDIE (BP)
TRILLANES, ANTONIO IV (NP)
SEÃ‘ERES, CHRISTIAN (DPP)
POE, GRACE
PENSON, RICARDO
MAGSAYSAY, RAMON JR. (LP)
MAGSAYSAY, MITOS (UNA)
MADRIGAL, JAMBY (LP)
MACEDA, MANONG ERNIE (UNA)
LLASOS, MARWIL (KPTRAN)
LEGARDA, LOREN (NPC)
(
)
HONTIVEROS, RISA (AKBAYAN)
HONASAN, GRINGO (UNA)
HAGEDORN, ED
FALCONE, BAL (DPP)
ESCUDERO, CHIZ
ENRILE, JUAN PONCE JR.(NPC)
EJERCITO ESTRADA, JV (UNA)
DELOS REYES,JC (KPTRAN)
DAVID, LITO (KPTRAN)
COJUANGCO, TINGTING (UNA)
,
(
)
CAYETANO, ALAN PETER (NP)
CASIÃ‘O, TEDDY
BINAY, NANCY (UNA)
BELGICA, GRECO (DPP)
AQUINO, BENIGNO BAM (LP)
ANGARA, EDGARDO (LDP)
ALCANTARA, SAMSON (SJS)

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

67
90.00%

100.00%

Worldwide leads: Intent to buy, relocation to India
Venu Nair: Male, Atlanta,
USA: Looking for good
investment in Indian real
estate market
Kiran Singh: Female, IT
g
,
professional, Gurgaon:
Any good 2 BR in Sohna
Road?
5

12
14
2

Data from Dec 3-4, 2013

68

Sample leads
Name

Sex

Location

Profession

Interest where

Param_star

M

India

Media

2BHK

India

Kiran Singh

F

Gurgaon,
India

IT

2BHK

Sohna Road,
Gurgaon

Venu Nair

M

Atlanta, US

Muhammad
Faiz

M

Singapore

IT

Hooker India

-

Bangalore

Real Estate Apartm Bangalore
Bangalore,
ent
India

Apartm India
ent
2 and 3 Noida, India
BHK

69

Crowd sensing
g
•

The “power of the crowd”
– a lot of information in a
timely manner from
everywhere

•

People already use the
social media to share
public safety and law
enforcement information

•

Gain deep situational
awareness

•

Emergencies,
call for help

Enable proactive actions
by augmenting traditional
law enforcement methods

Police Monitoring

Limited
coverage

Analytics
and fusion
in nearreal-time
Crowd
sensors
70

Rich
events
& KPIs

Drinking in the Open
Come to South City 2, in evening, its a regular scene there since last 4
years, people drink in open and food is served by restaurants in their
cars
khandsa road per sunrise hospital se aage tekho ke pass rehari waale
sharab pilaate hai, jinki wajah se waha aane jaane wale log pareshaan
ho rahe hai even shaam ko to PCR ka bhi unhe darr ni hai kirpa
hai,
hai,
karke inhe waha se hataiya Gurgaon Police
I also have a complaint to register. We have an alcohal drinking
menace in front of our commercial complex anand ganga comlex at
complex,
comlex,
sohna chowk, on the main road.
Police Harassment
These two Constables (Davinder Singh & his Colleague) were at their
worst behaviour...when they found all documents ok in the Car. I
couldn't understand the reason for harrasment...opp
Wrong Parking
this is the main way from sadar bazar to bhuteshwar mandir. I dnt
p
park vehicles both the way
y
think y this road exist. It is the best place to p
are used to park vehicles no action have been taken from years. I
think HUDA or MCG is not serious abt matter.

71

Event detection and mapping

72

Conclusions
Noise is an unavoidable fact of real life communication
Communication meant for human consumption can be
C
i ti
tf h
ti
b
noisy for computers and vice versa

Due to ubiquitous sensors (GPS, Accelerometer), easy of use
apps (Facebook, Twitter, YouTube), and higher internet
connectivity, the key characteristics of raw data is changing.
This new data can be characterized by 4Vs Volume,
Velocity, Variety and Veracity

For example, during a Football match, some people will Tweet
about Goals Penalties etc while in addition there may be other
Goals, Penalties, etc.
reports in news channels. The data describes the same event
Fusion should create a single object representation

Different sources may have different reliability and it is
necessary to account for this fact to avoid decreasing in
p
performance of fusion results
73

Reliability and context should be taken into account during
fusion

Conclusions
Noise can be defined as any kind of difference in the surface
form of an electronic text from the intended, correct or original
text
Noise
N i can b in the form of errors arising from uncertainty in
be i h f
f
i i f
i
i
language and communication and recognition errors

74

lvsubram@in.ibm.com

THANK YOU! ☺

75

Big data veracity challenges

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (7)

Ähnlich wie Big data veracity challenges

Ähnlich wie Big data veracity challenges (20)

Mehr von Prayukth K V

Mehr von Prayukth K V (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Big data veracity challenges