SlideShare ist ein Scribd-Unternehmen logo
1 von 75
Downloaden Sie, um offline zu lesen
Big Data and Veracity Challenges
Text Mining Workshop, ISI Kolkata

L. Venkata Subramaniam
L V k t S b
i
IBM Research India
Jan 8, 2014

1
The Four Dimensions of Big Data
Volume
l

Velocity
l i

Data at Rest

Data in Motion

Terabytes to exabytes
of existing data to
process

Streaming data,
milliseconds to
seconds to respond

Variety
i

Data in Many
Forms
Structured,
unstructured, text,
multimedia

Veracity*
i *

Data in Doubt
Uncertainty due to
data inconsistency
& incompleteness,
incompleteness
ambiguities, latency,
deception, model
approximations

* Truthfulness, accuracy or precision, correctness
2

2
We’ve Moved into a New Era of
Computing !

In order to realize new
opportunities, you need to think
beyond traditional sources of data

The term Big Data applies to information that can’t be processed or analyzed using traditional processes or tools

12+ terabytes
of
Tweets
eets
created
daily.

100’s
of different
types
of data.

5+million

trade events
per second.

Transactional
& Application
Data

Machine
Data

Social Data

Enterprise
Content

Volume Velocity

Variety Veracity Only 1 in 3
decision makers
trust their
information.

• Volume

• Velocity

• Variety

• Variety

• Str ct red
Structured

Semi• Semi
structured

• Highl
Highly
unstructured

• Highl
Highly
unstructured

• Ingestion

• Veracity

• Volume

• Throughput

3
Volume is growing so are Veracity issues
By 2015, 80% of all available data will be uncertain
2015
By 2015 the number of networked devices
will be double the entire global population.
All sensor data has uncertainty.

8000 100

90

7000

80

6000

70

5000

60

4000

50

3000

40
30

2000

Aggregat Uncertainty %
te

Glob Data Volume in Exaby
bal
ytes

9000

20

1000

The total number of social media
accounts exceeds the entire global
population. This data is highly
uncertain in both its expression and
content.
Data quality solutions exist
for enterprise data like
p
customer, product, and
address data, but this is only
a fraction of the total
enterprise data.
p

10

0
2005
4

Multiple sources: IDC Cisco
IDC,Cisco

2010

2015
What is Big Data? Big Data applies to information that can’t be
processed or analyzed using traditional processes or tools

Telco Profiles

Data Vo
olume, Ve
elocity, Var
riety

Call Detail
Records
Market
Trends

Smart Grid
Smarter
Weather
Cities
Sensor Modeling
Data
Smarter
Smarter
Traffic
Water

Portfolio
Risk

Market Feeds
Credit Card
Transactions

Medical
Transcription

Electronic Data
Interchange
CRM
Customer
Records

Traditional Data & Processing
Precise, authoritative,
well formed
5

Text, Audio,
Video
Contact
Centers

Retail

Fraud

SWIFT
Account
Management

Homeland
Security

Uncertainty
(1/veracity)

Disease
Progression

Patient
P ti t
Records
Predictive
Modeling
of Outcomes

Social Network
Data
Services
S
i

Data Uncertainty at Scale
Inconsistent, imprecise, uncertain, unverified, s
pontaneous, ambiguous, deceptive
Social media users in India
India (No. of Users In Million)
50
45
40

India (No. of Users In
India (No of Users In
Platforms
Million)

35
30

Facebook
India (No. of Users In Million)

20

Twitter

15 Million

Linkedin

25

45 Million

15 Million

15
10
5
0
Facebook

Twitter

Linkedin

Youtube

Google
Plus

6
Veracity issues arise due to:
Process Uncertainty
Processes contain
“randomness”
“ d
”

Data Uncertainty
Data input is uncertain

All modeling is approximate

Actual
Spelling

Intended
y
Spelling Text Entry
p
g

? ?
?

Uncertain travel times

Model Uncertainty

GPS Uncertainty

Fitting a curve to data

?? ?
Testimony

{Paris Airport}

Ambiguity
g y

Semiconductor yield

7

Contaminated?
Rumors

{John Smith, Dallas}
{John Smith, Kansas}
Conflicting Data

Forecasting a hurricane
(www.noaa.gov)
(
)
What is Noise

8
Upto 10
0,000 times more noisy

Big Data, Fast Data, Noisy Data

Social Media Communication is
meant for Friends
30% world population
on the internet and
increasing fast

Type of Text

WER

SMS (texting)

50%

Tweets

35%

ASR

30%

Web queries

15%

OCR

5%

Newswire Text
(WSJ, Reuters,
NYT)

0.005%

55 million
illi
Tweets per
day

Lead
Generation,
Disaster
Tracking
g
Large
Dimensional,
uncertain,
unverified

I’ll see ya tomo
RIP Jackson
J k
I’m lookie out 4 a car 2 burn rubber on the streets of LA
What should I buy?? A mini laptop with Windows
OR a Apple MacBook!??!

Noisy, Informal,
Noisy Informal Implicit and
Contextual Conversations

There are more social
networking accounts
t
ki
t
than people in the
world

Social Networking
overtakes Search:
Facebook becomes the
most visited website
ahead of Google

Big Data: More video content was
uploaded onto YouTube in the past two
months than all the new content
ABC,
ABC CBS and NBC have been entering
24/7 since 1948.”

9
SMS
0 there – there
1 aint – are not
2 no – no
3 doubt – doubt
4 there – there
5 hon – honey
6 im – I am
7 gonna – going
8 be – be
9 takin – taking
10 it – it
11 4 – for
12 life – life
13 u – You
14 wont – wont
15 b – be
16 rida – rid of
17 me – me
18 lol – laugh out loud
19 Ray – (NAME)

Texting Language: Over 50% of the words are
written in non standard ways
Spontaneous Language: Use of
slang, ungrammatical, no punctuations, no case
information
Mixing of Languages: Many SMS contain text in a
mix of two or more languages
Type of Noise

%

Deletion of
Characters

48%

Phonetic
Substitution

33%

Abbreviations

5%

Dialectical
Usage

4%

52% words
were non standard

Deletion of
Words

1.2%

(Contractor et al., 2010)

101 SMSes

10
Speech Recognition
SPEAKER 1: windows thanks for calling and you can
learn yes i don't mind it so then i went to

SPEAKER 2: well and ok bring the machine front

Recognition Errors: 10-40%
Word Error Rates

end loaded with a standard um and that's um it's
a desktop machine and i did that everything was
working wonderfully um I went ahead connected
into my my network um so i i changed my network
settings to um to my home network so i i can you
know it's showing me for my workroom um and then
it is said it had to reboot in order for changes
to take effect so i rebooted and now it's asking
me for a password which i never i never said
anything up
SPEAKER 1: ok just press the escape key i can

Spontaneous Language:
Use of slang, use of fillers
slang
like um and
ah, ungrammatical, false
starts,
starts no punctuations, no
punctuations
case information

doesn't do anything can you pull up so that i mean

Mixing f L
Mi i of Languages:
Contain words from two or
more languages
11
Historical Text
Non Standard Spellings: No notion of the importance of
having a single spelling for each word. Letters would be
added or removed to ease line justification.
New words: New words, words that are variants of
present vocabulary words
Different Language Style: Different grammar, language
g g
y
g
,
g g
model.
OCR: Character substitution errors, missed punctuations.

Baron et al. 2009
al

12
Emails, Blogs, Tweets, Online Chat,……

Chat Logs
g
[12:51:13 PM] Geetha: alrite
[12:52:01 PM] Richa: id has valid pw not expired
[12:52:49 PM] Geetha: can't get to theh site
can t
[12:53:04 PM] Richa: network connection may be slow
[12:54:39 PM] Geetha: ok Im able to now
[12:54:53 PM] Richa: should I reset the password
13
What is Noisy Text?
Any kind of difference in the surface form of an electronic text
from the intended, correct or original text (Knoblock et al.,
2007)
Noise can be at the lexical level {b4 before befour}
{b4, before,
Resulting in substitution, insertion, deletion,
transposition, run-on, and split.
Noise can be at morphological, syntactic, discourse level {I can

hear u, I can hear you, I can here you}

Resulting in substitution, insertion, deletion, transposition
of words and the introduction of out of vocabulary
words.
d
14
Classifying Noise
Lexical Errors (Subramaniam et
al., 2009)
Missing characters {before >

bef}

Extra characters {raster >

raaster}

Phonetic substitution {before >

b4, late > l8}
,
}
Abbreviations {laugh out loud
> lol, United Nations > UN}

Syntactical Errors (Kukich, 1992;
Foster et al., 2007)
Missing Word {What are the
subjects? > What the subjects?}
Extra word {Was that in the
summer? >Was that in the summer

it?}

Real word spelling errors {She could
not comprehend > She could no
comprehend.
comprehend.}
Agreement {She steered Melissa
round a corner > She steered
corner.

Melissa round a corners.}

Dialectical usage {I’m going to be
there > I’ gonna b there}
th
I’m
be th }
15
Techniques for Automatically Detecting Lexical Errors (Kukich 92)
Efficient methods to detect strings that do not appear in a given word list,
dictionary or lexicon
Nonword error d t ti
N
d
detection
Two approaches
N-gram
Look up each n-gram in an input string in a precompiled table to
ascertain either its existence or its frequency. Nonexistent or infrequent
n-grams (shj, i ) are identified as possible misspellings.
hj iqn
id tifi d
ibl
i
lli
Good for identifying errors made by OCR devices
But unusual/foreign language valid words will be marked and nicelooking mistakes will be marked valid
ill
ma ked alid
Dictionary based
Input string appears in a dictionary? If not, the string is f
f
flagged as a
misspelled word.
But nearly two-thirds of the words in a dictionary did not appear in an
eight million word corpus of New York Times text and conversely two
text, and,
twothirds of the words in the text were not in the dictionary (1986 study)
16
Techniques for automatically Detecting Incorrect (Syntax)
Grammar (Foster et al., 2007)
Efficient methods to detect word sequences that do not form a
Effi i
h d
d
d
h d
f
grammatical sentence
Three Approaches
N-gram
Classifies a sentence as ungrammatical if it contains an
unusual part of speech sequence
Precision-grammar
Classifies a sentence using a parser and a broadcoverage hand-written grammar
Probabilistic-parsing
Probabilistic parsing
Finds sentences with parsing error

17
Quantifying Noise (Subramaniam et al., 2009)
Quantifying Lexical Errors {Before, b4, befour, befor, bfore}
Edit Di
Edi Distance
Good for measuring surface level deviation from original
Perplexity
e p e ty
Good for measuring deviation from underlying language structure
at character level
Quantifying Semantic Errors {I came to LA yesterday. I am still jet

lagged., Came la yester day still jetlagged, Came 2 LA ystrday stil
jetl8d}

WER
Good for measuring real word errors (speech recognition errors)
Perplexity
Good for measuring deviation from “proper”
BLEU
Good for comparing a candidate translation against multiple
reference translations
18
Spelling Correction (Kukich, 1992)
Isolated Word Correction
Minimum edit distance techniques
Similarity key techniques
Probabilistic techniques
N-gram-based
N gram based techniques
Rule-based techniques
Will not catch typos resulting in correctly spelled words {form, from}
yp
g
y p
,
Estimates put real word errors at 30% of all word errors
Context-Dependent Word Correction
Parsing
Language models
Can errors be ignored and still meaningful interpretation be done? {I
am coming with you, I comes with you}
19
SMS Text Normalization
dis is n eg 4 txtin lang
This is an example for Texting language

Extreme corruption of words and sentences
Models for SMS language are lacking

Tomorrow never dies!!!
2moro (9)
( )
tomoz (25)
tomoro (12)
tomrw (5)
tom (2)
tomra (2)
tomorrow (24)
tomora ( )
(4)

tomm (1)
( )
tomo (3)
tomorow (3)
2mro (2)
morrow (1)
tomor (2)
tmorro (1)
moro ( )
(1)
Occurrence in a 1000 sms corpus

20
Finding Canonical Sets (Acharyya, 2009)
Learn mappings
costmer, castumar, kustamar,

customer

coustomber

How can we do it in an unsupervised way ?
Find some invariant, that does not change in spite of corruptions
Buckets of context seem invariant!
<..Back Bucket....> sceam <..Front Bucket...>
sceam : sms(2) new(5) recharge(4) t l
h
tel-provider(2) about(3)
id
b t
<..Back Bucket...> scheme <..Front Bucket...>
scheme : sms(4) new(2) activate(3) tel-provider(2) about(1)
recharge(1)
21
SMS Based FAQ Retrieval (Kothari et al., 2009)
SMS Question
FAQ
how 2 actvate romng on me hanset

Database

How do I activate Roaming
Dial *567*2# from your
handset
What are the rates for roaming
within India
Roaming rates on prepaid
connections are 60 Paise per
minute

SMS Answer
Dial *567*2# from your handset

Goal is to find the Question Q* that best matches the SMS S

•A scoring function Score(Q) assigns a
score to each question Q in the FAQ
dataset. The score measures how closely
the question matches the SMS string S
S.

22
FAQ Retrieval Problem Formulation
SMS is treated as a sequence of tokens S=s1,s2,…,sn
Let Θ denote the questions in the FAQ corpus where each
question Q ∈ Θ is treated as a set of tokens
Goal is to find the question Q* that best matches the SMS S

23
Method
M th d
For
F each t k si , a li t Li consisting of all t
h token
list
i ti
f ll terms f
from th di ti
the dictionary
that are variants of si are constructed. Variants are sorted in the
descending order of their weight

This space is searched to find the closest matching FAQ question.

24
Extracting Dialog Models (Negi et al 2009)
al.,
Huge number of repetitive calls at contact centers
Building t k i t d di l
B ildi task oriented dialog systems
t
Task specific information – concepts, subtasks
Task structure - manual encoding
g
Using large amounts of human to human conversation data

Extracting dialogue models using human-to-human conversations
E t ti di l
d l
i h
t h
ti

25
Example Conversation: Car Rental Domain

26
Overview
Transcribed
Calls

Normalized
Calls

Utterance
No a at o
Normalization

Subtasks

Mining of
Subtasks

Chat-bot

AIML
Co ve s o
Conversion

27
Finding Patterns with Gaps
Need for
N d f patterns capturing variations in expressions
i
i i
i
i
Have you rented a car from us before
Have you rented a car before
Have you rented a car from <Rent_Agency> before
<Rent Agency>

Mining regular expression patterns over tokens or entity types
Each tt
E h pattern represented as a t k sequence
t d
token
[rented car before]
Token sequences mined efficiently using extension of apriori algorithm

2
8
Association Analysis
Total number of possible itemsets is exponential (2N)
Brute-force technique infeasible

Support filtering is necessary
•
•

To eliminate spurious patterns
To avoid exponential search
-

Support has anti-monotone
property:
X ⊆ Y implies σ(Y) ≤ σ(X)

Efficient algorithms have been
designed to exhaustively find all
itemsets/patterns with sufficiently
high support

Given d items, there are 2d
possible candidate itemsets
ibl
did t it
t
29
Utterance Normalization
Identify concepts
Named Entity Annotation
Rule based annotator for annotations such as location, date, car
model,
model and amount
“I want to pick it up from <location> on <date>”
Grouping of utterances
Find patterns with gaps and represent each utterance by them along
with unigrams and bi-grams
Agent and customer utterances are clustered separately using an offg
p
y
g
the shelf clustering algorithm

30
Finding Subtasks and ordering
Customer and agents engage in
similar kinds of interactions to
accomplish an objective
Represent each call with agent
utterance and customer utterance
cluster labels
Subtasks
Patterns of cluster labels
(agents) with possible gaps
Lot of variability in customer
utterances
Vertical pattern mining

C1

C1

C2

C3

C3
Cn

31
Subtask Preconditions
Utterance pre-conditions
U
di i
Customer utterances that indicate start of a subtask
“please make this booking for “make payment” subtask
please
booking”
make payment
Frequent features from customer utterances
Flow pre-conditions
Only logical orders of subtasks are allowed
“make
“ k payment” subtask cannot b executed unless “ th
t” bt k
t be
t d l
“gather
pick-up information” subtask has been executed.
Collection of all the subtasks that precede the subtask
p

32
Finding Subtasks

33
Data Fusion
Problem

Given multiple data points about an entity, create a single
p
p
y,
g
object representation while resolving conflicting data values

Difficulties

Null values: Subsumption and complementation
Contradictions in data values
Uncertainty & truth: Discover the true value and model
u ce ta ty this process
uncertainty in t s p ocess
Metadata: Preferences, recency, correctness
Lineage: Keep original values and their origin
Implementation in DBMS: SQL, extended SQL, UDFs, etc.
SQL
SQL UDFs etc

34
360 Context
Analyze social data in the context of enterprise data to build entity and event profiles
and establish linkages between them for online and offline analysis
Entity (people, products, events) Insights
The problem

Solution

What are the key
product interests of
person A?

Over time learn about
the person’s product
interests from her social
media postings
p
g

What is the location and
trajectory of person B?

List significant events
like marriage, birth of a
child, relocation, etc.

What are the events of
interest happening in a
given location?

Lists the top events in a
given geography

What is the sentiment
g
product?
on a given p

Gives the sentiment on
a product
p

Understand customers wants and needs better

Gives the current
location and locations in
the past

What life events
happened in person A’s
life in the past x
months?

Key Sustained Value Factor:

intent to
purchase for
customers

Social Data

Smarter
Commerce

real-time public
safety events

Enterprise
Databases

User
Domains
What MDM 360 does?

propensities/
sentiment/intent
•
event Detection
•
entity Linkages
•
sentiment

core customer
view/transactions
•
event Profiles
•
entity Profiles

Smarter
Cities

Application
Domains

Builds an entity’s complete profile by aggregating data about the entity from social and enterprise data
ld
’
l
fl b
d
b
h
f
l d
d
sources. Here an entity refers to people, products, brands and events.
35

35

IBM Confidential
Extraction Challenges: Stages of intent

Stage

Example

Wishing for an event

“I just want to graduate, get a job, get a car, and
live with my boyfriend”

Anticipating an event

“Im getting a car for graduation yay!!!!!”

During an event

“At disneyworld :D”

Post event / continuous state

“Apparently I got a raise at work three months ago
and didn't know? Sweeeeeeeeeet”

Hobby

“Loves to fish, travel and frequent concerts. Down
to earth, athletic, professional 40 and single.
earth athletic professional,
single
Loves the outdoors, working out, travel and
younger fit guys for dating.”

36
Extraction Challenges: Detecting filtering conditions
Filter

Example

Spam

“Need a New #Credit Card for your #Business or
online #Ebay store? Compare and Apply Online.
http://retweet.it/r/We0iai”

Sarcasm, jokes

“I thought I was having a stroke this afternoon but it
turns out it was too many Starbucks Refreshers plus
my leg falling asleep.”

Resolve ambiguous meaning

“In the words of @LNSmooth23 I'm retiring from the
nightlife”

Non-personal

“My mom is buying a house, but why in Willingboro”

37
360-degree Profiles from Social Media
g
Personal Attributes
Event Detection

• Identifiers: what, where, when…..
• Attributes: severity, urgency…

Social Media based
360-degree
Event and Individual
Profiles

Timely Insights on Events
Ti
l I i ht
E
t
• Event Detection
• Public Safety Events
• Plans for public disturbances
• Sentiment around events
• Citizen sentiment

• Identifiers: name, address, age, gender,
occupation…
• Interests: sports, pets cuisine
sports pets, cuisine…
• Life Cycle Status: marital, parental
• Relationships: family, friends, co-workers, work
and interest network

Timely Insights on
Individuals

• Intent to participate in public events
• Instigation for causing public damage
• Sentiment on events, govt policies
• Current Location
• Hate messages

Personal Interests
P
lI

• Personal preferences or political leanings
• Activity History

Intent
We must support the movement, I am going to the rally at Jantar
Mantar tomorrow
Anna Hazare has a point when he says politicians are corrupt and
need to be taught a lesson. The rally starts at 10.

Public Safety Events
Mamta Deedi is also joining Anna, Ramdev, n Kejriwal. She is going to do
Anshan at Delhi Jantar mantar. Ye sab public ki kahke le rahe hain.
So Its Mamata's day out tomorrow at #JantarMantar. #Rally.

Location announcements
I'm at Karir Square http://4sq.com/fYReSj

38
More data: Customer intent extracted from social media provides context
Go for the
best, DP2000

Buying a
DSLR
today !

Buying
DSLR
today!

Thrza gr8 deal
on ZX 550 @
ZX-550
the mall

Prior Business
Social
Transactions
Data

Entity
Extraction, Fact
Discovery, Intent &
Sentiment

Influencers

Intent

450M+ tweets/day Millions of tweets yield one
company-specific fact
Customer ready to buy a
DSLR camera today
today,
possibly at a nearby mall

Michael’s online friends offer lots of advice

Text Analytics used to extract intent from Social Media
Married, Male, Spouse
,
, p
Birthdate, Gift Type, Intent
to Purchase, Timeframe

Wifey’s birthday tomorrow, looking for a killer dslr

Sarcasm,
Wishful Thinking
Potential
Locations and
Activity

Maybe I should buy her that purple
roadster, while I’m at it. ;-) lol

Intent to Purchase,
Gift Type?

In NYC area this w/e, any good malls
nearby?

Region & City Location,
Timeframe, Intent to Shop

Resultant fact base contains billions of facts, and is incrementally updated
Fact segmentation or clustering is rapid enough to drive a business decision

3939
Matching Twitter profiles to Corporate Data
• Linking Social Media profiles with Employee database

• Several extensions are possible, for example, linking with Citizens and Security databases

Social media profiles
(name, address,
gender, age
gender age,
employment,
relationship, …)

Employment
filter

Social media profiles
of IBM employees
p y
and their network

Name: first, last

Name,
work location,
job description

Current Demo focused on Name and Location
matching, as well as EmployeeOf information

Choice of social media profile attributes
for linking constrained by availability of
IBM BluePage attributes

Twitter: 45M profiles

Resolution

Semantic Name Variations
Bill Chamberlin vs. Chamberlain, William H.
C. Mohan vs. Mohan Chandrasekaran (Mohan)

Employee Directory: 460K entries
p y
y
Name: (first, middle, last, preferred)

Geo Proximity
Home l
H
location: city, ( t t ) country
ti
it (state),
t
Employment: company + role

Saratoga, CA vs. San Jose, CA
New Jersey vs. New York

Job Role
Disambiguation
“Soft a e sales manage at IBM…” vs.
Software
manager
IBM
s
“Managing SPSS Sales for Canada…”

40

Work l
W k location: ( i state, zip, country)
i
(city,
i
)
Job description
Example Result

• Semantic name variations: Twitter name is a close variation of the IBM names
• Geo Proximity: Work locations are within 25mi of the Twitter location
• Job Role Disambiguation : description in Twitter profile matches HR role
41
Common D t P bl
C
Data Problems
• Lack of information
standards
t d d

Ashok Kumar
A Kumar

• Data misplaced in the database

Four sixteen Street 8 Anand Niketan Delhi
8,
Niketan,

Mr. Ashok Kr

#416 Anand Niketan, N Delhi, 21

110021

• Different formats & structures across
different systems

Data surprises in individual
fields

416 Anand Niketan, New Delhi, India 110021

Email

Tax ID

Telephone

91,,,,
228-02-1975
6173380300
ranivrgeoi@yahoo.co.in
i
i@ h
i
025 37 1888
025-37-1888
415 392 2000
415-392-2000
,CYRUS_DASTUR@HOTMAIL.COM 34-2671434
3380321
HP 15 State St.
508-466-1200 Orlando

• Special characters in the data

• The redundancy nightmare
• Duplicate records with a lack of
standards

90328574
90328575
01456
90238495
90233479
90233489
90345672

IBM
I.B.M. Inc.

187 N.Pk. Str. Salem NH 01456
187 N.Pk. St. Salem NH

Int. Bus. Machines
International Bus. M.
Inter-Nation Consults
I.B.
I B Manufacturing

187 No. Park St Salem NH 04156
187 Park Ave Salem NH 04156
15 Main Street Andover MA 02341
Park Blvd Bostno MA 04106
Blvd.

42
Address Variations…
Variations
• Spelling variations, hyphenation, abbreviations

• I 344
I-344
| Sarojini Nagar | N Delhi | 23
• 344 Block J | Sarojni Ngr | New Delhi | 110023
• 344 Block I | Sarojni Ngr | New Delhi | 110023
• Multiple Ways of writing the same field

• 13B
| Link Road
| Versova | Mumbai
• 18 Block M | Bandra Versova Link Rd | Versova | Mumbai
Rd.
• Missing Address Fields

• 4 Block C | ISID Campus I
4,
I V. Kunj I New Delhi | 110070
V
• 4C
I ISID Campus | Institutional Area| V. Kunj | New Delhi | 110070
• Errors

• 4C

I ISID Campus | Institutional Area| V. Kunj, New Delhi | 110007

43
Regional variations in Addresses across
India
Addresses in different regions contain words of the local language even when the
addresses are written in English
Ex : The commonly used word to describe a street type is “Gali” in Northern
India whereas “Beedhi/Veedhi” is the commonly used term in Southern India
Street Intersections and Street Information containing multiple Street Type Identifiers
like Cross and Main are extensively found in the Southern Indian regions
Ex : “3rd Main, 4th B Cross”
,
Sector and Pocket Information are found primarily in North Indian Addresses
Ex : “Sector 5, Pocket 2A 2nd Block”
Regional differences in writing addresses necessitate bifurcation of standardization
rules based on regions.

44
Investigating the Data
g
g
Take the Example: 123 St. Virginia St.

Parsing:
Separates multi-valued fields into individual pieces

Lexical A l i
L i l Analysis:
Determines business significance of individual pieces

Context Sensitive:
Identifies various data structures and content

123
Number

123
Number

123

St.

Virginia

Street
Type

Alpha

St.

Virginia

Street
Type

St.

Street Name

Street
Type

St. Virginia

St.

“The instructions for handling the data are inherent within the data
itself.”
45

St.
Sample Standardized Output
Sample Address Input:
“SANT KRUPA BUILDING, 2ND FLOOR, CHHEDA RD, NR S V JOSHI
HIGH SCHOOL, DOMBIVALI (E), THANE. INDIA.”
Standardization Output:
St d di ti
O t t
DoorNo

Floor Value

Building
Name

Building
Type

Street Name

Street Type

20

2nd FLOOR

SANT KRUPA

BUILDING

CHHEDA

ROAD

Landmark
Position

Landmark

Area

City

District

State

NEAR

S V JOSHI HIGH DOMBIVALLI
SCHOOL
EAST

THANE

THANE

MAHARASHTRA

46
Input Addresses vs Standardized Addresses
Sr.No

Standardized address

Highlights

1

A38/91 KONIA . . VARANASI
INDIA

A38/91 KONIA VARANASI VARANASI
UTTARPRADESH INDIA

Autopopulation of
state

2

VILL BASUDEVPUR PO
KHANJANCHAK
DURGACHAK HALDIA
TAMLUK INDIA

DURGACHAK ,HALDIA,VILLAGEBASUDEVPUR PO-KHANJANCHAK
PO KHANJANCHAK
TAMLUK EAST MIDNAPORE
WESTBENGAL INDIA

Rural address
Handling

3

NEAR RAJGHAR GIRLS
SCHOOL LACHIT NAGAR
HOUSE NO 5 ULUBARI
GUWAHATI ASSAM
GUWAHATI INDIA

5 NEAR RAJGHAR GIRLS SCHOOL
Maintaining a
ULUBARI LACHIT NAGAR GUWAHATI standard format
KAMRUP ASSAM INDIA
across addresses
(house no preceeds
Landmark
information)

4

1/15, PREMJYOTI CO OP HSG
1/15 PREMJYOTI COOPERATIVE
SOC., RAMBAUG - 5, KALYAN
HOUSING SOCIETY,RAMBAUG 5
(W), MAHARASHTRA 421301 KALYAN WEST BHIWANDI THANE
BHIWANDI INDIA
MAHARASHTRA 421301

Standardization of
Tokens

5

4
7

Input address

3/2,FIRINGI DANGA ROAD,
P.O.MALLICKPARA
SERAMPORE-3 CALCUTTA
INDIA

Standardization of
tokens

3/2,FIRINGI DANGA ROAD,
SERAMPORE-3 P.O.MALLICKPARA
KOLKATA WESTBENGAL INDIA
Two Methods to Decide a Match
Are these two records a match?

RHITU K

KAZANGIAN

RITU KUMAR
B
B

KAZANGIAN

+5

+2

A
+20

128 MAIN

ST

02111 12/8/62

128 MAINE RD 02110 12/8/62
/ /
A
B
D
B
A
= BBAABDBA
+3

+4

-1

+7

+9

=

+49

Deterministic Decisions Tables:
• Fields are compared
• Letter grade assigned
g
g
• Combined letter grades are compared to a vendor delivered file
• Result: Match; Fail; Suspect
Probabilistic Record Linkage:
• Fields are evaluated for degree-of-match
• Weight assigned: represents the “information content” by value
• Weights are summed to derived a total score
• Result: Statistical probability of a match
48
A Closer Look at Probabilistic Matching
C ose oo
obab st c atc g
RHITU K

KAZANGIAN

128 MAIN

RITU KUMAR

KAZANGIAN

128 MAINE RD 02110 12/8/62

+5

+2

+20

+3

ST

+4

02111 12/8/62

-1

+7

+9

= 49

Histogram of Weights
4000

3500

The weighted score is a
p
probability of a match; it
y
;
expresses the amount of
information content for all of
the fields compared

3000

# of Pairs
f

relative measure of the

The CUTOFF is the score above
which good matches are found

2500

2000

UnMatched

1500

1000

500

Matched
0
-50

49

-40

-30

-20

-10

0

10

20

30

40

50

60

49
The Value of Information Content
Information Content is measured both at the field and at the field value level and is
calculated automatically
Discriminating Value represents the significance of one field versus another in
contributing to a match
For example a Gender Code contributes less information than a Tax-Id Number
Frequency represents the significance of one value in a field over another value
q
y p
g
For example in a Last-Name Field, “SMITH” contributes less information than
“ROUTZAHN”
Probabilistic Matching uses the automatically generated measures of Information
Content to achieve the highest match rates possible utilizing a scientifically-justifiable
C
h
h h h
h
bl
l
f ll
f bl
methodology

50
Data Framework around the Individual
• Logins (User
credentials)
• Profile
• Expertise
• External & internal
unstructured data
linked to individuals
• S
Social activity
l

Big Data

Individual
Social
Presence
•
•
•
•

BLOG
Comment
Opinion “Like”s
Community

Individual
Credentials

• to Person
• Communities
• to Company:
Roles, History
• IBM Linkage

Individual
Core
Personal data:
• Name, Address
• Phone, eMail
• Behavioral
Preference /
permissions

Transactions
involving
i
l i
the Individual
• Tech Support Call
• O
Opportunity & Orders
t it
Od
• Responses to Marketing
Campaigns

Relationships
with the
individual
i di id l
Interactions
with the
Individual
•
•
•
•
•
•

Digital
Phone
eMail
F2F
Social
Web traffic

51
Analytics steps

Text Analytics
• Analyze and extract consumer
attributes from individual
messages

Intent

Entity
Integration
• Integrate information about a
consumer within a single social
media source over ti
di
time

Entity
Resolution
• Link social media profiles with
customer data
t
d t

• Link and integrate information
about a consumer across
multiple social media sources

All I really want is the Disney
Visa card from chase with the castle on it

Life Events
Looks like we'll be moving to New Orleans
sooner than I thought.

Personal Attributes
I am a engineer, mom, and wife

Relationships

Social Profiles of
Consumers

Master Data on
Customers

In fact I'm looking forward to the new
month. Both myself and the wife have our
th B th
lf
d th
if h
graduation ceremonies

52
Person Information across Documents
Who Is James Dimon?

Do these filings refer to the same person ?
variability in the person’s name, lack of a key identifier
supporting attributes vary depending on the context (form type)

All these facts need to be linked and integrated

53

53

Signatures

Biographies
Insider
Transactions

Committee
memberships
Entity & Relationship Analytics from
Big Data
Entity Views

Crawl

Extract /
Text
Analytics

Entity
Resolution
Map/Fuse
/Aggregate
Entities Relationships:
E ii &R l i
hi
Object-centric view

Unstructured
data sources

Untrusted View

Challenge
Construct and maintain comprehensive
profiles of entities and relationships
from unstructured data sources
Main Problem: Assemble an entity view, where each entity aggregates data from thousands of
different documents
Multiple stages of complex processing:
– Information extraction
•

–

From each unstructured d
F
h
t t d document, extract relevant structured records
t
t t l
t t t d
d

Entity resolution

• Link records (possibly across documents) that are about the same real-world “entity”
Entity
Integration – Entity population: mapping / fusion / aggregation
•

54

Collect all the facts about the same entity into one rich object with clean values and relationships to other entities
The Complete Entity View
Current purchase intentions
expressed by the consumer

Location-based information about a
consumer (where they plan to travel,
events they are going to attend)

Purchase history for a consumer

Life events (relocation, home
purchase, wedding, graduation)

Related people based on social
networking data

Comments/complaints expressed about
various products and services

Customer identity information (e.g.,
name, location) obtained from profiles
and content of posts

Micro-segmentation information about
individual consumers (e.g., gender, age
range, profession)

360 degree
360-degree profile of a customer

City

State

Age
Range

Gender

Houston

TX

30-39

Female

San Jose

CA

?

Male

Marital
Status
?
Married

Number
of kids

Employment
Status

Occupation

?

Employed

Journalist

2

Employed

Software Engineer

…

…
…
Aggregate attributes from multiple sources
Filter to obtain a segmentation
Analyze to obtain “Similar Populations”
Adding more input data gives better predictive power
55
Attribute fusion example: Inferring location from multiple clues
Metadata 
Name: Tracy Guida
Sc ee a e @ acygu da
Screen name: @tracyguida
Location: Tampa
Description: just a Nor-Cal gal trying to fall in love with

Florida

Social Media Profile

Screen name : @tracyguida
Location:
Tampa, FL
Name:
Tracy Guida

Disambiguation, fusion of
partial information

Permanent
location

Fusion libraries:
• Confidence:
metadata vs. content

Messages 
Gotta love Florida football #hot #humid
http://instagr.am/p/QOHPqhKdYt/
Check out my blog about #food in #TampaBay
h k
bl
b
f d
http://www.myothercitybythebay.com

Textual clues

Temporary location

I'm at Tracy's Seat At Micah's (Tampa, FL)
http://4sq.com/SZ4yjj
http //4sq com/SZ4 jj
I'm at S.o.G (Tampa, Florida)
http://4sq.com/UDweM5

Check-ins

I'm at Eats American Grill (Tampa, FL)
http://4sq.com/O1a1Jm

Who's
Wh ' watching the #presidential #debate tonight?
hi
h #
id i l #d b
i h?
(from 27.97989014,-82.54825406)

Fusion libraries:
• Confidence: place mentions vs.
g
geo-codes
• Analysis of location time-series

Geo-located
G l
d
documents

56
The Reliability (Veracity) Challenge
Θ = {θ1,...,θ N } - a set of hypotheses (frame of discernment, universe of
discourse)
{xni } - probability, possibility, belief in hypothesis {θn} of source i
{Oi } - input data (social media, enterprise information)
F(x1,...,xI ) – Fusion operator

{O1}
Environment
Environment

57

{OI }

Source 1
(source belief model
model,
source characteristics)

Source I
(source belief model,
source characteristics)

{x1}
Fusion
Fusion
operator
operator

{ xI }

F ( x1,..., x I )
Typical Reliability Settings
It is possible to assign a numerical degree of reliability to each source
A subset of sources is reliable but we do not know which one
Reliabilities of the sources can be ordered but no precise reliability values
are known

Reliability dependent on context too
During Mumbai Mantralaya fire a few tens of tweets on this event on
Twitter
Same day there is a match and there are several thousand tweets “Miami
Miami
on Fire”

58
Strategies for Utilizing Reliability
Strategies explicitly utilizing reliability of sources
Reliability is used to modify beliefs of each model before fusion and
then use transformed beliefs (separable case)
Strategies for modifying the fusion process to account for the
reliability of the sources (non separable case)
(non-separable case).

Strategies identifying reliability of data input to fusion processes
and eliminating the sources of poor reliability
Combination of strategies mentioned above
F(x1,...,xI ) FR (x1,...,xI )
F - i a context d
is
t t dependent operator, which depends on the
d t
t
hi h d
d
th
strategy selected and defined within the framework used for
uncertainty representation
R

59
Reliability Coefficients
Reliability coefficients represent trust in each belief model. They
introduce the second level of uncertainty and represent a measure of
y
p
the adequacy of the model used, the reality of the environment, and
source characteristics
Ri = Ri (Mi, γ ,Υ) - reliability of source i (reliability of source i and
, )
y
(
y

hypothesis j : Rij)
Mi - model of source i

γ parameters characterizing external environment (context)
Υ -parameters characterizing the internal environment of source I
(tuning parameters)
Relative eli bilit
Rel ti e reliability : ∑iIRi =1
1
May be replaced with max Ri = 1
i
60
Bayesian Fusion
In the Bayesian framework the degrees of belief are
represented by a priori, conditional and a posteriori
conditional,
probabilities.
Usually, decisions are made on a posteriori probabilities P(θn | yi ),
where yi i the input coming from source I,
h
is th i
t
i f
I
xi = P(θn | yi ) represents statistics of each source to be combined
(data, outputs of classifiers).

Fusion is
F i i performed by the Bayesian rule, which under the
f
d b th B
i
l
hi h d th
condition of source independence is reduced to a product:
Fn(x1,...,xI)|y =Fn(P)|yi =P(θn)∏[P(θn |yi)/P(θn)], n

This fusion operator is conjunctive and assumes total reliability
of the sources
61
Weighted Average
If the sources are not totally reliable, several fusion rules within
the framework of the probability theory have been proposed in
the literature
A majority of the weighted average methods are based on
consensus th
theory, which involves general procedures of
hi h i
l
l
d
f
combining single source probability distributions while decisions
are based on Bayesian decision theory
Fn(x1,...,xI,R1,...,RI)|yi =Fn(P,R)|yi =∑iP(θn |yi) Ri
where R is reliability associated with the sources in the global
membership function expressing quantitatively the goodness of
each source
i

62
Incorporation of Contextual
Information
This method integrates contextual information
The
Th method is based on the fact that, in a given context,
th d i b d
th f t th t i
i
t t
only a subset J of a set N of all sources to be combined is
valid or reliable (i.e. their belief model adequately represents
reality)
Fn ( x1 ,..., x I , R1 ,..., R I ) | y = ∑P(θ |y1,...,yn,AJ ) P(AJ )
where P(AJ) is the probability of validity of the subset J of
inputs. This probability is calculated thanks to the reliability
Ri of the individual inputs

63
Biographical and Biometric fusion for Person Identification
Many modern data repositories record both biographical and biometric
information
Motor Vehicle Licensing Authority, Passport, Identify cards etc
Unique Identification number (www.uidai.gov.in)

Fusing information from multiple sources bring value in
Data integration: Creating single view of citizen, person, customer
Identification of the person using Biometric information and biographical
information

Scaling person identification for large number of customer records
–
–

Biographical data is abundant, easy to match, scales to millions of records but can be noisy and uncertain.
Biometric data is noise free and gives high precision for identification but does not scale to large number of records

–

Both stream contain complimentary information which can be exploited by fusing together

Fusion for Person Identification can be done at two levels
– Decision fusion: Each matcher provides the decision which are then fused to produce the final decision.
– Score fusion: Each matcher provides score which is used for producing a score for decision making.
64
Score Fusion using Biometric and Biographical matcher
Consider M matchers operating on a database containing
N records which have both biographical and biometric
information.

For query q if all the records are equally likely for the identifier than
the posterior of the score given records is given by

There can b multiple biometric as well as biographical
Th
be
lti l bi
ti
ll
bi
hi l
matchers
Each query q will generate N x M scores i.e. M dimensional
scores for N records

Genuine match
score density

We model the scores as being generated from a
probability distribution.
b bilit di t ib ti
Score is fused using a joint distribution from different sources

The probability distribution under reasonable assumption
is the posterior distribution of scores given a query

The genuine and imposter match scores are assumed to be
identically distributed

The posterior distribution is modeled as a Gaussian mixture
model.

The model is built for both genuine match distribution
and imposter distribution

The query is assigned an identity of n0 only if

Models are learnt from training data.

The algorithms is
Which simplifies to

65

Imposter match
score density
Results
DataSets
Biometrics: NIST Dataset consisting of match scores of right
and left index finger
Biographical : Electoral records of citizens in an emerging
economy
Consists of Names and Address

Total of 6000 people were associated with the biometrics and
the biographical data.
Here M = 4, 2: Biometric, 2: Biographical (Name & Address)

Experimental Setup

Accuracy for different modalities

Half of the dataset was used for training the probability
densities for both the imposter and genuine match score
distribution was estimated
The number of Gaussians components was 5
The remaining records was used for testing.

Experimental Results.
E
i
t lR
lt
Score is fused using a joint distribution from these four
different sources
The name modality has the lowest accuracy where the
biometric modality has high accuracy
The fused accuracy is much higher than the individual
localities
The accuracy increases when all the modalities are combined
thus validating the usefulness of fusion

66

Identification accuracy for fusion of modalities
Social listening for monitoring the Philippine general elections 2013
• Online and offline analysis of social media messages around election debates and 
election chatter for ABS‐CBN TV Channel
• Analysis of English and Filipino chatter to determine buzz and reaction on candidates, 
campaigns, parties, topics and events
campaigns, parties, topics and events
• Analysis of over 6 million election related Twitter and Facebook posts
• Comparison with Pulse Asia Election Survey
Real time and offline monitoring of social 
g
media conversations about parties and 
POE, GRACE
candidates

Mar 13

50%
45%
40%
35%
30%
25%

Positive and negative sentiments for candidates

20%
15%
10%
5%
0%
Mar 08

Mar 09

Mar 10

Mar 11

Mar 12

Mar 13

Mar 14

Grace Poe released her TV ad which drew flak
from viewers. This was also the time that 3
candidates (Legarda Poe Escudero) of the
(Legarda, Poe,
Liberal Party who were also "guest" candidates
of UNA were dropped by UNA as the President
forbade them to attend UNA's soirees. Escudero
felt really, really bad about being dropped by
UNA (led by former president Estrada). Grace
(l d b f
id
E
d ) G
Poe offered to mediate between Escudero and
Estrada.

ZUBIRI, MIGZ (UNA)
VILLAR,CYNTHIA HANEPBUHAY (NP)
VILLANUEVA, BRO.EDDIE (BP)
TRILLANES, ANTONIO IV (NP)
SEÑERES, CHRISTIAN (DPP)
POE, GRACE
PENSON, RICARDO
MAGSAYSAY, RAMON JR. (LP)
MAGSAYSAY, MITOS (UNA)
MADRIGAL, JAMBY (LP)
MACEDA, MANONG ERNIE (UNA)
LLASOS, MARWIL (KPTRAN)
LEGARDA, LOREN (NPC)
(
)
HONTIVEROS, RISA (AKBAYAN)
HONASAN, GRINGO (UNA)
HAGEDORN, ED
FALCONE, BAL (DPP)
ESCUDERO, CHIZ
ENRILE, JUAN PONCE JR.(NPC)
EJERCITO ESTRADA, JV (UNA)
DELOS REYES,JC (KPTRAN)
DAVID, LITO (KPTRAN)
COJUANGCO, TINGTING (UNA)
,
(
)
CAYETANO, ALAN PETER (NP)
CASIÑO, TEDDY
BINAY, NANCY (UNA)
BELGICA, GRECO (DPP)
AQUINO, BENIGNO BAM (LP)
ANGARA, EDGARDO (LDP)
ALCANTARA, SAMSON (SJS)

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

67
90.00%

100.00%
Worldwide leads: Intent to buy, relocation to India
Venu Nair: Male, Atlanta,
USA: Looking for good
investment in Indian real
estate market
Kiran Singh: Female, IT
g
,
professional, Gurgaon:
Any good 2 BR in Sohna
Road?
5

12
14
2

Data from Dec 3-4, 2013

68
Sample leads
Name

Sex

Location

Profession

Interest where

Param_star

M

India

Media

2BHK

India

Kiran Singh

F

Gurgaon,
India

IT

2BHK

Sohna Road,
Gurgaon

Venu Nair

M

Atlanta, US

Muhammad
Faiz

M

Singapore

IT

Hooker India

-

Bangalore

Real Estate Apartm Bangalore
Bangalore,
ent
India

Apartm India
ent
2 and 3 Noida, India
BHK

69
Crowd sensing
g
•

The “power of the crowd”
– a lot of information in a
timely manner from
everywhere

•

People already use the
social media to share
public safety and law
enforcement information

•

Gain deep situational
awareness

•

Emergencies,
call for help

Enable proactive actions
by augmenting traditional
law enforcement methods

Police Monitoring

Limited
coverage

Analytics
and fusion
in nearreal-time
Crowd
sensors
70

Rich
events
& KPIs
Drinking in the Open
Come to South City 2, in evening, its a regular scene there since last 4
years, people drink in open and food is served by restaurants in their
cars
khandsa road per sunrise hospital se aage tekho ke pass rehari waale
sharab pilaate hai, jinki wajah se waha aane jaane wale log pareshaan
ho rahe hai even shaam ko to PCR ka bhi unhe darr ni hai kirpa
hai,
hai,
karke inhe waha se hataiya Gurgaon Police
I also have a complaint to register. We have an alcohal drinking
menace in front of our commercial complex anand ganga comlex at
complex,
comlex,
sohna chowk, on the main road.
Police Harassment
These two Constables (Davinder Singh & his Colleague) were at their
worst behaviour...when they found all documents ok in the Car. I
couldn't understand the reason for harrasment...opp
Wrong Parking
this is the main way from sadar bazar to bhuteshwar mandir. I dnt
p
park vehicles both the way
y
think y this road exist. It is the best place to p
are used to park vehicles no action have been taken from years. I
think HUDA or MCG is not serious abt matter.

71
Event detection and mapping

72
Conclusions
Noise is an unavoidable fact of real life communication
Communication meant for human consumption can be
C
i ti
tf h
ti
b
noisy for computers and vice versa

Due to ubiquitous sensors (GPS, Accelerometer), easy of use
apps (Facebook, Twitter, YouTube), and higher internet
connectivity, the key characteristics of raw data is changing.
This new data can be characterized by 4Vs Volume,
Velocity, Variety and Veracity

For example, during a Football match, some people will Tweet
about Goals Penalties etc while in addition there may be other
Goals, Penalties, etc.
reports in news channels. The data describes the same event
Fusion should create a single object representation

Different sources may have different reliability and it is
necessary to account for this fact to avoid decreasing in
p
performance of fusion results
73

Reliability and context should be taken into account during
fusion
Conclusions
Noise can be defined as any kind of difference in the surface
form of an electronic text from the intended, correct or original
text
Noise
N i can b in the form of errors arising from uncertainty in
be i h f
f
i i f
i
i
language and communication and recognition errors

74
lvsubram@in.ibm.com

THANK YOU! ☺

75

Weitere ähnliche Inhalte

Was ist angesagt?

Web indexing finale
Web indexing finaleWeb indexing finale
Web indexing finale
Ajit More
 
Proposal defense slideshow
Proposal defense slideshowProposal defense slideshow
Proposal defense slideshow
Coby Long
 
Thesis Power Point Presentation
Thesis Power Point PresentationThesis Power Point Presentation
Thesis Power Point Presentation
riddhikapandya1985
 
Er & eer to relational mapping
Er & eer to relational mappingEr & eer to relational mapping
Er & eer to relational mapping
saurabhshertukde
 

Was ist angesagt? (20)

Journey to Cloud Analytics
Journey to Cloud Analytics Journey to Cloud Analytics
Journey to Cloud Analytics
 
Proposal defense presentation
Proposal defense presentationProposal defense presentation
Proposal defense presentation
 
Data governance
Data governanceData governance
Data governance
 
DAS Slides: Data Governance - Combining Data Management with Organizational ...
DAS Slides: Data Governance -  Combining Data Management with Organizational ...DAS Slides: Data Governance -  Combining Data Management with Organizational ...
DAS Slides: Data Governance - Combining Data Management with Organizational ...
 
Big Data using NoSQL Technologies
Big Data using NoSQL TechnologiesBig Data using NoSQL Technologies
Big Data using NoSQL Technologies
 
Social Data Mining
Social Data MiningSocial Data Mining
Social Data Mining
 
Big Data and Harvesting Data from Social Media
Big Data and Harvesting Data from Social MediaBig Data and Harvesting Data from Social Media
Big Data and Harvesting Data from Social Media
 
System development project document
System development project documentSystem development project document
System development project document
 
General Knowledge Management Overview
General Knowledge Management OverviewGeneral Knowledge Management Overview
General Knowledge Management Overview
 
Master Data Management - Gartner Presentation
Master Data Management - Gartner PresentationMaster Data Management - Gartner Presentation
Master Data Management - Gartner Presentation
 
Analytics ROI Best Practices
Analytics ROI Best PracticesAnalytics ROI Best Practices
Analytics ROI Best Practices
 
Web indexing finale
Web indexing finaleWeb indexing finale
Web indexing finale
 
Proposal defense slideshow
Proposal defense slideshowProposal defense slideshow
Proposal defense slideshow
 
Metadata Strategies - Data Squared
Metadata Strategies - Data SquaredMetadata Strategies - Data Squared
Metadata Strategies - Data Squared
 
5 Level of MDM Maturity
5 Level of MDM Maturity5 Level of MDM Maturity
5 Level of MDM Maturity
 
Thesis Power Point Presentation
Thesis Power Point PresentationThesis Power Point Presentation
Thesis Power Point Presentation
 
Introduction data mining
Introduction data miningIntroduction data mining
Introduction data mining
 
Bank Database using MySQL
Bank Database using MySQL Bank Database using MySQL
Bank Database using MySQL
 
Er & eer to relational mapping
Er & eer to relational mappingEr & eer to relational mapping
Er & eer to relational mapping
 
Big Data Fundamentals
Big Data FundamentalsBig Data Fundamentals
Big Data Fundamentals
 

Andere mochten auch

Andere mochten auch (7)

BigDataEurope Overview - Communities, Requirements & Pilots
BigDataEurope Overview - Communities, Requirements & PilotsBigDataEurope Overview - Communities, Requirements & Pilots
BigDataEurope Overview - Communities, Requirements & Pilots
 
BigDataEurope: Project Introduction @ Year #1 Workshops
BigDataEurope: Project Introduction @ Year #1 WorkshopsBigDataEurope: Project Introduction @ Year #1 Workshops
BigDataEurope: Project Introduction @ Year #1 Workshops
 
BigDataEurope @BDVA Summit2016 2: Societal Pilots
BigDataEurope @BDVA Summit2016 2: Societal PilotsBigDataEurope @BDVA Summit2016 2: Societal Pilots
BigDataEurope @BDVA Summit2016 2: Societal Pilots
 
Building Confidence in Big Data - IBM Smarter Business 2013
Building Confidence in Big Data - IBM Smarter Business 2013 Building Confidence in Big Data - IBM Smarter Business 2013
Building Confidence in Big Data - IBM Smarter Business 2013
 
Dive exploring history presentation
Dive exploring history presentationDive exploring history presentation
Dive exploring history presentation
 
BigDataEurope - Big Data & Health
BigDataEurope - Big Data & HealthBigDataEurope - Big Data & Health
BigDataEurope - Big Data & Health
 
BDE-SC6 Hangout - “Insight into Virtual Currency Ecosystems”
BDE-SC6 Hangout - “Insight into Virtual Currency Ecosystems”BDE-SC6 Hangout - “Insight into Virtual Currency Ecosystems”
BDE-SC6 Hangout - “Insight into Virtual Currency Ecosystems”
 

Ähnlich wie Big data veracity challenges

Bharati_Bhaskar_BadNews_Memo.docxby Bhaskar BharatiSub.docx
Bharati_Bhaskar_BadNews_Memo.docxby Bhaskar BharatiSub.docxBharati_Bhaskar_BadNews_Memo.docxby Bhaskar BharatiSub.docx
Bharati_Bhaskar_BadNews_Memo.docxby Bhaskar BharatiSub.docx
tangyechloe
 
All these changes brought about by digital technology use have aff.docx
All these changes brought about by digital technology use have aff.docxAll these changes brought about by digital technology use have aff.docx
All these changes brought about by digital technology use have aff.docx
nettletondevon
 
Modernista it assesment
Modernista it assesmentModernista it assesment
Modernista it assesment
jabellas
 
Grammarly AI-NLP Club #2 - Recent advances in applied chatbot technology - Jo...
Grammarly AI-NLP Club #2 - Recent advances in applied chatbot technology - Jo...Grammarly AI-NLP Club #2 - Recent advances in applied chatbot technology - Jo...
Grammarly AI-NLP Club #2 - Recent advances in applied chatbot technology - Jo...
Grammarly
 
IS Undergrads Class 4
IS Undergrads Class 4IS Undergrads Class 4
IS Undergrads Class 4
Joao Cunha
 

Ähnlich wie Big data veracity challenges (20)

Hate Speech / Toxic Comment Detection - Data Mining (CSE-362) Project
Hate Speech / Toxic Comment Detection - Data Mining (CSE-362) ProjectHate Speech / Toxic Comment Detection - Data Mining (CSE-362) Project
Hate Speech / Toxic Comment Detection - Data Mining (CSE-362) Project
 
Microblog-genre noise and its impact on semantic annotation accuracy
Microblog-genre noise and its impact on semantic annotation accuracyMicroblog-genre noise and its impact on semantic annotation accuracy
Microblog-genre noise and its impact on semantic annotation accuracy
 
#5 Predicting Machine Translation Quality
#5 Predicting Machine Translation Quality#5 Predicting Machine Translation Quality
#5 Predicting Machine Translation Quality
 
Greenfield Effect: Patterns for Effective Disaster Delivery
Greenfield Effect: Patterns for Effective Disaster DeliveryGreenfield Effect: Patterns for Effective Disaster Delivery
Greenfield Effect: Patterns for Effective Disaster Delivery
 
Bharati_Bhaskar_BadNews_Memo.docxby Bhaskar BharatiSub.docx
Bharati_Bhaskar_BadNews_Memo.docxby Bhaskar BharatiSub.docxBharati_Bhaskar_BadNews_Memo.docxby Bhaskar BharatiSub.docx
Bharati_Bhaskar_BadNews_Memo.docxby Bhaskar BharatiSub.docx
 
Starting to Process Social Media
Starting to Process Social MediaStarting to Process Social Media
Starting to Process Social Media
 
BIG DATA | How to explain it & how to use it for your career?
BIG DATA | How to explain it & how to use it for your career?BIG DATA | How to explain it & how to use it for your career?
BIG DATA | How to explain it & how to use it for your career?
 
Artificial Assistants: How can I help you? by Christopher Currin
Artificial Assistants: How can I help you? by Christopher CurrinArtificial Assistants: How can I help you? by Christopher Currin
Artificial Assistants: How can I help you? by Christopher Currin
 
SignReco: Sign Language Translator
SignReco: Sign Language TranslatorSignReco: Sign Language Translator
SignReco: Sign Language Translator
 
Deep Learning | Speaker Indentification
Deep Learning | Speaker IndentificationDeep Learning | Speaker Indentification
Deep Learning | Speaker Indentification
 
All these changes brought about by digital technology use have aff.docx
All these changes brought about by digital technology use have aff.docxAll these changes brought about by digital technology use have aff.docx
All these changes brought about by digital technology use have aff.docx
 
Conversational experience by Systango
Conversational experience by SystangoConversational experience by Systango
Conversational experience by Systango
 
Modernista it assesment
Modernista it assesmentModernista it assesment
Modernista it assesment
 
Fairy Tale Writing Unit By Miss Teacher Tess Teach
Fairy Tale Writing Unit By Miss Teacher Tess TeachFairy Tale Writing Unit By Miss Teacher Tess Teach
Fairy Tale Writing Unit By Miss Teacher Tess Teach
 
Healthcare Data Management using Domain Specific Languages for Metadata Manag...
Healthcare Data Management using Domain Specific Languages for Metadata Manag...Healthcare Data Management using Domain Specific Languages for Metadata Manag...
Healthcare Data Management using Domain Specific Languages for Metadata Manag...
 
Making Voting Accessible
Making Voting Accessible Making Voting Accessible
Making Voting Accessible
 
Grammarly AI-NLP Club #2 - Recent advances in applied chatbot technology - Jo...
Grammarly AI-NLP Club #2 - Recent advances in applied chatbot technology - Jo...Grammarly AI-NLP Club #2 - Recent advances in applied chatbot technology - Jo...
Grammarly AI-NLP Club #2 - Recent advances in applied chatbot technology - Jo...
 
IS Undergrads Class 4
IS Undergrads Class 4IS Undergrads Class 4
IS Undergrads Class 4
 
Pitfalls and potholes of content moderation for chatbots, Elayne Ruane
Pitfalls and potholes of content moderation for chatbots, Elayne RuanePitfalls and potholes of content moderation for chatbots, Elayne Ruane
Pitfalls and potholes of content moderation for chatbots, Elayne Ruane
 
Speech recognition - Art of the possible
Speech recognition - Art of the possibleSpeech recognition - Art of the possible
Speech recognition - Art of the possible
 

Mehr von Prayukth K V

IoT and OT Threat Landscape Report 2023
IoT and OT Threat Landscape Report 2023IoT and OT Threat Landscape Report 2023
IoT and OT Threat Landscape Report 2023
Prayukth K V
 
Aviation industry IT trends 2015
Aviation industry IT trends 2015Aviation industry IT trends 2015
Aviation industry IT trends 2015
Prayukth K V
 

Mehr von Prayukth K V (20)

IoT and OT Threat Landscape Report 2023
IoT and OT Threat Landscape Report 2023IoT and OT Threat Landscape Report 2023
IoT and OT Threat Landscape Report 2023
 
Marketing niche tech
Marketing niche techMarketing niche tech
Marketing niche tech
 
State of the internet of things (IoT) market 2016 edition
State of the internet of things (IoT) market 2016 editionState of the internet of things (IoT) market 2016 edition
State of the internet of things (IoT) market 2016 edition
 
Architecture for India's Smart Cities project
Architecture for India's Smart Cities projectArchitecture for India's Smart Cities project
Architecture for India's Smart Cities project
 
Top global Fintech start-ups 2015-16
Top global Fintech start-ups 2015-16Top global Fintech start-ups 2015-16
Top global Fintech start-ups 2015-16
 
Social media marketing planning guide for 2016
Social media marketing planning guide for 2016Social media marketing planning guide for 2016
Social media marketing planning guide for 2016
 
State of marketing leadership 2015
State of marketing leadership 2015State of marketing leadership 2015
State of marketing leadership 2015
 
Drones and the Internet of Things: realising the potential of airborne comput...
Drones and the Internet of Things: realising the potential of airborne comput...Drones and the Internet of Things: realising the potential of airborne comput...
Drones and the Internet of Things: realising the potential of airborne comput...
 
India's draft Internet of Things -policy
India's draft Internet of Things -policyIndia's draft Internet of Things -policy
India's draft Internet of Things -policy
 
All about the HP split
All about the HP splitAll about the HP split
All about the HP split
 
CRM predicts and forecast 2018
CRM predicts and forecast 2018CRM predicts and forecast 2018
CRM predicts and forecast 2018
 
Cloud adoption and risk report Europe q1 2015
Cloud adoption and risk report Europe q1 2015Cloud adoption and risk report Europe q1 2015
Cloud adoption and risk report Europe q1 2015
 
Aviation industry IT trends 2015
Aviation industry IT trends 2015Aviation industry IT trends 2015
Aviation industry IT trends 2015
 
Finnish software industry survey - 2015
Finnish software industry survey - 2015Finnish software industry survey - 2015
Finnish software industry survey - 2015
 
How the internet of things is shaping up
How the internet of things is shaping upHow the internet of things is shaping up
How the internet of things is shaping up
 
Evolving a wearables marketing strategy in 2015
Evolving a wearables marketing strategy in 2015Evolving a wearables marketing strategy in 2015
Evolving a wearables marketing strategy in 2015
 
Leadership lessons for 2015
Leadership lessons for 2015Leadership lessons for 2015
Leadership lessons for 2015
 
Linkedin Vs Facebook
Linkedin Vs FacebookLinkedin Vs Facebook
Linkedin Vs Facebook
 
Smart cities 2020
Smart cities 2020Smart cities 2020
Smart cities 2020
 
Social Media Stats 2015
Social Media Stats 2015Social Media Stats 2015
Social Media Stats 2015
 

Kürzlich hochgeladen

Call Now ☎️🔝 9332606886🔝 Call Girls ❤ Service In Bhilwara Female Escorts Serv...
Call Now ☎️🔝 9332606886🔝 Call Girls ❤ Service In Bhilwara Female Escorts Serv...Call Now ☎️🔝 9332606886🔝 Call Girls ❤ Service In Bhilwara Female Escorts Serv...
Call Now ☎️🔝 9332606886🔝 Call Girls ❤ Service In Bhilwara Female Escorts Serv...
Anamikakaur10
 
Nelamangala Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Nelamangala Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Nelamangala Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Nelamangala Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Call Girls in Delhi, Escort Service Available 24x7 in Delhi 959961-/-3876
Call Girls in Delhi, Escort Service Available 24x7 in Delhi 959961-/-3876Call Girls in Delhi, Escort Service Available 24x7 in Delhi 959961-/-3876
Call Girls in Delhi, Escort Service Available 24x7 in Delhi 959961-/-3876
dlhescort
 
Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...
Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...
Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...
amitlee9823
 
Call Girls From Pari Chowk Greater Noida ❤️8448577510 ⊹Best Escorts Service I...
Call Girls From Pari Chowk Greater Noida ❤️8448577510 ⊹Best Escorts Service I...Call Girls From Pari Chowk Greater Noida ❤️8448577510 ⊹Best Escorts Service I...
Call Girls From Pari Chowk Greater Noida ❤️8448577510 ⊹Best Escorts Service I...
lizamodels9
 
unwanted pregnancy Kit [+918133066128] Abortion Pills IN Dubai UAE Abudhabi
unwanted pregnancy Kit [+918133066128] Abortion Pills IN Dubai UAE Abudhabiunwanted pregnancy Kit [+918133066128] Abortion Pills IN Dubai UAE Abudhabi
unwanted pregnancy Kit [+918133066128] Abortion Pills IN Dubai UAE Abudhabi
Abortion pills in Kuwait Cytotec pills in Kuwait
 
Russian Call Girls In Gurgaon ❤️8448577510 ⊹Best Escorts Service In 24/7 Delh...
Russian Call Girls In Gurgaon ❤️8448577510 ⊹Best Escorts Service In 24/7 Delh...Russian Call Girls In Gurgaon ❤️8448577510 ⊹Best Escorts Service In 24/7 Delh...
Russian Call Girls In Gurgaon ❤️8448577510 ⊹Best Escorts Service In 24/7 Delh...
lizamodels9
 

Kürzlich hochgeladen (20)

Call Now ☎️🔝 9332606886🔝 Call Girls ❤ Service In Bhilwara Female Escorts Serv...
Call Now ☎️🔝 9332606886🔝 Call Girls ❤ Service In Bhilwara Female Escorts Serv...Call Now ☎️🔝 9332606886🔝 Call Girls ❤ Service In Bhilwara Female Escorts Serv...
Call Now ☎️🔝 9332606886🔝 Call Girls ❤ Service In Bhilwara Female Escorts Serv...
 
SEO Case Study: How I Increased SEO Traffic & Ranking by 50-60% in 6 Months
SEO Case Study: How I Increased SEO Traffic & Ranking by 50-60%  in 6 MonthsSEO Case Study: How I Increased SEO Traffic & Ranking by 50-60%  in 6 Months
SEO Case Study: How I Increased SEO Traffic & Ranking by 50-60% in 6 Months
 
Falcon Invoice Discounting platform in india
Falcon Invoice Discounting platform in indiaFalcon Invoice Discounting platform in india
Falcon Invoice Discounting platform in india
 
Nelamangala Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Nelamangala Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Nelamangala Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Nelamangala Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Call Girls in Delhi, Escort Service Available 24x7 in Delhi 959961-/-3876
Call Girls in Delhi, Escort Service Available 24x7 in Delhi 959961-/-3876Call Girls in Delhi, Escort Service Available 24x7 in Delhi 959961-/-3876
Call Girls in Delhi, Escort Service Available 24x7 in Delhi 959961-/-3876
 
Katrina Personal Brand Project and portfolio 1
Katrina Personal Brand Project and portfolio 1Katrina Personal Brand Project and portfolio 1
Katrina Personal Brand Project and portfolio 1
 
Uneak White's Personal Brand Exploration Presentation
Uneak White's Personal Brand Exploration PresentationUneak White's Personal Brand Exploration Presentation
Uneak White's Personal Brand Exploration Presentation
 
Call Girls Service In Old Town Dubai ((0551707352)) Old Town Dubai Call Girl ...
Call Girls Service In Old Town Dubai ((0551707352)) Old Town Dubai Call Girl ...Call Girls Service In Old Town Dubai ((0551707352)) Old Town Dubai Call Girl ...
Call Girls Service In Old Town Dubai ((0551707352)) Old Town Dubai Call Girl ...
 
Malegaon Call Girls Service ☎ ️82500–77686 ☎️ Enjoy 24/7 Escort Service
Malegaon Call Girls Service ☎ ️82500–77686 ☎️ Enjoy 24/7 Escort ServiceMalegaon Call Girls Service ☎ ️82500–77686 ☎️ Enjoy 24/7 Escort Service
Malegaon Call Girls Service ☎ ️82500–77686 ☎️ Enjoy 24/7 Escort Service
 
Famous Olympic Siblings from the 21st Century
Famous Olympic Siblings from the 21st CenturyFamous Olympic Siblings from the 21st Century
Famous Olympic Siblings from the 21st Century
 
Business Model Canvas (BMC)- A new venture concept
Business Model Canvas (BMC)-  A new venture conceptBusiness Model Canvas (BMC)-  A new venture concept
Business Model Canvas (BMC)- A new venture concept
 
Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...
Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...
Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...
 
Falcon's Invoice Discounting: Your Path to Prosperity
Falcon's Invoice Discounting: Your Path to ProsperityFalcon's Invoice Discounting: Your Path to Prosperity
Falcon's Invoice Discounting: Your Path to Prosperity
 
Call Girls From Pari Chowk Greater Noida ❤️8448577510 ⊹Best Escorts Service I...
Call Girls From Pari Chowk Greater Noida ❤️8448577510 ⊹Best Escorts Service I...Call Girls From Pari Chowk Greater Noida ❤️8448577510 ⊹Best Escorts Service I...
Call Girls From Pari Chowk Greater Noida ❤️8448577510 ⊹Best Escorts Service I...
 
Eluru Call Girls Service ☎ ️93326-06886 ❤️‍🔥 Enjoy 24/7 Escort Service
Eluru Call Girls Service ☎ ️93326-06886 ❤️‍🔥 Enjoy 24/7 Escort ServiceEluru Call Girls Service ☎ ️93326-06886 ❤️‍🔥 Enjoy 24/7 Escort Service
Eluru Call Girls Service ☎ ️93326-06886 ❤️‍🔥 Enjoy 24/7 Escort Service
 
unwanted pregnancy Kit [+918133066128] Abortion Pills IN Dubai UAE Abudhabi
unwanted pregnancy Kit [+918133066128] Abortion Pills IN Dubai UAE Abudhabiunwanted pregnancy Kit [+918133066128] Abortion Pills IN Dubai UAE Abudhabi
unwanted pregnancy Kit [+918133066128] Abortion Pills IN Dubai UAE Abudhabi
 
Russian Call Girls In Gurgaon ❤️8448577510 ⊹Best Escorts Service In 24/7 Delh...
Russian Call Girls In Gurgaon ❤️8448577510 ⊹Best Escorts Service In 24/7 Delh...Russian Call Girls In Gurgaon ❤️8448577510 ⊹Best Escorts Service In 24/7 Delh...
Russian Call Girls In Gurgaon ❤️8448577510 ⊹Best Escorts Service In 24/7 Delh...
 
Mysore Call Girls 8617370543 WhatsApp Number 24x7 Best Services
Mysore Call Girls 8617370543 WhatsApp Number 24x7 Best ServicesMysore Call Girls 8617370543 WhatsApp Number 24x7 Best Services
Mysore Call Girls 8617370543 WhatsApp Number 24x7 Best Services
 
Value Proposition canvas- Customer needs and pains
Value Proposition canvas- Customer needs and painsValue Proposition canvas- Customer needs and pains
Value Proposition canvas- Customer needs and pains
 
Call Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service Available
Call Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service AvailableCall Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service Available
Call Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service Available
 

Big data veracity challenges

  • 1. Big Data and Veracity Challenges Text Mining Workshop, ISI Kolkata L. Venkata Subramaniam L V k t S b i IBM Research India Jan 8, 2014 1
  • 2. The Four Dimensions of Big Data Volume l Velocity l i Data at Rest Data in Motion Terabytes to exabytes of existing data to process Streaming data, milliseconds to seconds to respond Variety i Data in Many Forms Structured, unstructured, text, multimedia Veracity* i * Data in Doubt Uncertainty due to data inconsistency & incompleteness, incompleteness ambiguities, latency, deception, model approximations * Truthfulness, accuracy or precision, correctness 2 2
  • 3. We’ve Moved into a New Era of Computing ! In order to realize new opportunities, you need to think beyond traditional sources of data The term Big Data applies to information that can’t be processed or analyzed using traditional processes or tools 12+ terabytes of Tweets eets created daily. 100’s of different types of data. 5+million trade events per second. Transactional & Application Data Machine Data Social Data Enterprise Content Volume Velocity Variety Veracity Only 1 in 3 decision makers trust their information. • Volume • Velocity • Variety • Variety • Str ct red Structured Semi• Semi structured • Highl Highly unstructured • Highl Highly unstructured • Ingestion • Veracity • Volume • Throughput 3
  • 4. Volume is growing so are Veracity issues By 2015, 80% of all available data will be uncertain 2015 By 2015 the number of networked devices will be double the entire global population. All sensor data has uncertainty. 8000 100 90 7000 80 6000 70 5000 60 4000 50 3000 40 30 2000 Aggregat Uncertainty % te Glob Data Volume in Exaby bal ytes 9000 20 1000 The total number of social media accounts exceeds the entire global population. This data is highly uncertain in both its expression and content. Data quality solutions exist for enterprise data like p customer, product, and address data, but this is only a fraction of the total enterprise data. p 10 0 2005 4 Multiple sources: IDC Cisco IDC,Cisco 2010 2015
  • 5. What is Big Data? Big Data applies to information that can’t be processed or analyzed using traditional processes or tools Telco Profiles Data Vo olume, Ve elocity, Var riety Call Detail Records Market Trends Smart Grid Smarter Weather Cities Sensor Modeling Data Smarter Smarter Traffic Water Portfolio Risk Market Feeds Credit Card Transactions Medical Transcription Electronic Data Interchange CRM Customer Records Traditional Data & Processing Precise, authoritative, well formed 5 Text, Audio, Video Contact Centers Retail Fraud SWIFT Account Management Homeland Security Uncertainty (1/veracity) Disease Progression Patient P ti t Records Predictive Modeling of Outcomes Social Network Data Services S i Data Uncertainty at Scale Inconsistent, imprecise, uncertain, unverified, s pontaneous, ambiguous, deceptive
  • 6. Social media users in India India (No. of Users In Million) 50 45 40 India (No. of Users In India (No of Users In Platforms Million) 35 30 Facebook India (No. of Users In Million) 20 Twitter 15 Million Linkedin 25 45 Million 15 Million 15 10 5 0 Facebook Twitter Linkedin Youtube Google Plus 6
  • 7. Veracity issues arise due to: Process Uncertainty Processes contain “randomness” “ d ” Data Uncertainty Data input is uncertain All modeling is approximate Actual Spelling Intended y Spelling Text Entry p g ? ? ? Uncertain travel times Model Uncertainty GPS Uncertainty Fitting a curve to data ?? ? Testimony {Paris Airport} Ambiguity g y Semiconductor yield 7 Contaminated? Rumors {John Smith, Dallas} {John Smith, Kansas} Conflicting Data Forecasting a hurricane (www.noaa.gov) ( )
  • 9. Upto 10 0,000 times more noisy Big Data, Fast Data, Noisy Data Social Media Communication is meant for Friends 30% world population on the internet and increasing fast Type of Text WER SMS (texting) 50% Tweets 35% ASR 30% Web queries 15% OCR 5% Newswire Text (WSJ, Reuters, NYT) 0.005% 55 million illi Tweets per day Lead Generation, Disaster Tracking g Large Dimensional, uncertain, unverified I’ll see ya tomo RIP Jackson J k I’m lookie out 4 a car 2 burn rubber on the streets of LA What should I buy?? A mini laptop with Windows OR a Apple MacBook!??! Noisy, Informal, Noisy Informal Implicit and Contextual Conversations There are more social networking accounts t ki t than people in the world Social Networking overtakes Search: Facebook becomes the most visited website ahead of Google Big Data: More video content was uploaded onto YouTube in the past two months than all the new content ABC, ABC CBS and NBC have been entering 24/7 since 1948.” 9
  • 10. SMS 0 there – there 1 aint – are not 2 no – no 3 doubt – doubt 4 there – there 5 hon – honey 6 im – I am 7 gonna – going 8 be – be 9 takin – taking 10 it – it 11 4 – for 12 life – life 13 u – You 14 wont – wont 15 b – be 16 rida – rid of 17 me – me 18 lol – laugh out loud 19 Ray – (NAME) Texting Language: Over 50% of the words are written in non standard ways Spontaneous Language: Use of slang, ungrammatical, no punctuations, no case information Mixing of Languages: Many SMS contain text in a mix of two or more languages Type of Noise % Deletion of Characters 48% Phonetic Substitution 33% Abbreviations 5% Dialectical Usage 4% 52% words were non standard Deletion of Words 1.2% (Contractor et al., 2010) 101 SMSes 10
  • 11. Speech Recognition SPEAKER 1: windows thanks for calling and you can learn yes i don't mind it so then i went to SPEAKER 2: well and ok bring the machine front Recognition Errors: 10-40% Word Error Rates end loaded with a standard um and that's um it's a desktop machine and i did that everything was working wonderfully um I went ahead connected into my my network um so i i changed my network settings to um to my home network so i i can you know it's showing me for my workroom um and then it is said it had to reboot in order for changes to take effect so i rebooted and now it's asking me for a password which i never i never said anything up SPEAKER 1: ok just press the escape key i can Spontaneous Language: Use of slang, use of fillers slang like um and ah, ungrammatical, false starts, starts no punctuations, no punctuations case information doesn't do anything can you pull up so that i mean Mixing f L Mi i of Languages: Contain words from two or more languages 11
  • 12. Historical Text Non Standard Spellings: No notion of the importance of having a single spelling for each word. Letters would be added or removed to ease line justification. New words: New words, words that are variants of present vocabulary words Different Language Style: Different grammar, language g g y g , g g model. OCR: Character substitution errors, missed punctuations. Baron et al. 2009 al 12
  • 13. Emails, Blogs, Tweets, Online Chat,…… Chat Logs g [12:51:13 PM] Geetha: alrite [12:52:01 PM] Richa: id has valid pw not expired [12:52:49 PM] Geetha: can't get to theh site can t [12:53:04 PM] Richa: network connection may be slow [12:54:39 PM] Geetha: ok Im able to now [12:54:53 PM] Richa: should I reset the password 13
  • 14. What is Noisy Text? Any kind of difference in the surface form of an electronic text from the intended, correct or original text (Knoblock et al., 2007) Noise can be at the lexical level {b4 before befour} {b4, before, Resulting in substitution, insertion, deletion, transposition, run-on, and split. Noise can be at morphological, syntactic, discourse level {I can hear u, I can hear you, I can here you} Resulting in substitution, insertion, deletion, transposition of words and the introduction of out of vocabulary words. d 14
  • 15. Classifying Noise Lexical Errors (Subramaniam et al., 2009) Missing characters {before > bef} Extra characters {raster > raaster} Phonetic substitution {before > b4, late > l8} , } Abbreviations {laugh out loud > lol, United Nations > UN} Syntactical Errors (Kukich, 1992; Foster et al., 2007) Missing Word {What are the subjects? > What the subjects?} Extra word {Was that in the summer? >Was that in the summer it?} Real word spelling errors {She could not comprehend > She could no comprehend. comprehend.} Agreement {She steered Melissa round a corner > She steered corner. Melissa round a corners.} Dialectical usage {I’m going to be there > I’ gonna b there} th I’m be th } 15
  • 16. Techniques for Automatically Detecting Lexical Errors (Kukich 92) Efficient methods to detect strings that do not appear in a given word list, dictionary or lexicon Nonword error d t ti N d detection Two approaches N-gram Look up each n-gram in an input string in a precompiled table to ascertain either its existence or its frequency. Nonexistent or infrequent n-grams (shj, i ) are identified as possible misspellings. hj iqn id tifi d ibl i lli Good for identifying errors made by OCR devices But unusual/foreign language valid words will be marked and nicelooking mistakes will be marked valid ill ma ked alid Dictionary based Input string appears in a dictionary? If not, the string is f f flagged as a misspelled word. But nearly two-thirds of the words in a dictionary did not appear in an eight million word corpus of New York Times text and conversely two text, and, twothirds of the words in the text were not in the dictionary (1986 study) 16
  • 17. Techniques for automatically Detecting Incorrect (Syntax) Grammar (Foster et al., 2007) Efficient methods to detect word sequences that do not form a Effi i h d d d h d f grammatical sentence Three Approaches N-gram Classifies a sentence as ungrammatical if it contains an unusual part of speech sequence Precision-grammar Classifies a sentence using a parser and a broadcoverage hand-written grammar Probabilistic-parsing Probabilistic parsing Finds sentences with parsing error 17
  • 18. Quantifying Noise (Subramaniam et al., 2009) Quantifying Lexical Errors {Before, b4, befour, befor, bfore} Edit Di Edi Distance Good for measuring surface level deviation from original Perplexity e p e ty Good for measuring deviation from underlying language structure at character level Quantifying Semantic Errors {I came to LA yesterday. I am still jet lagged., Came la yester day still jetlagged, Came 2 LA ystrday stil jetl8d} WER Good for measuring real word errors (speech recognition errors) Perplexity Good for measuring deviation from “proper” BLEU Good for comparing a candidate translation against multiple reference translations 18
  • 19. Spelling Correction (Kukich, 1992) Isolated Word Correction Minimum edit distance techniques Similarity key techniques Probabilistic techniques N-gram-based N gram based techniques Rule-based techniques Will not catch typos resulting in correctly spelled words {form, from} yp g y p , Estimates put real word errors at 30% of all word errors Context-Dependent Word Correction Parsing Language models Can errors be ignored and still meaningful interpretation be done? {I am coming with you, I comes with you} 19
  • 20. SMS Text Normalization dis is n eg 4 txtin lang This is an example for Texting language Extreme corruption of words and sentences Models for SMS language are lacking Tomorrow never dies!!! 2moro (9) ( ) tomoz (25) tomoro (12) tomrw (5) tom (2) tomra (2) tomorrow (24) tomora ( ) (4) tomm (1) ( ) tomo (3) tomorow (3) 2mro (2) morrow (1) tomor (2) tmorro (1) moro ( ) (1) Occurrence in a 1000 sms corpus 20
  • 21. Finding Canonical Sets (Acharyya, 2009) Learn mappings costmer, castumar, kustamar, customer coustomber How can we do it in an unsupervised way ? Find some invariant, that does not change in spite of corruptions Buckets of context seem invariant! <..Back Bucket....> sceam <..Front Bucket...> sceam : sms(2) new(5) recharge(4) t l h tel-provider(2) about(3) id b t <..Back Bucket...> scheme <..Front Bucket...> scheme : sms(4) new(2) activate(3) tel-provider(2) about(1) recharge(1) 21
  • 22. SMS Based FAQ Retrieval (Kothari et al., 2009) SMS Question FAQ how 2 actvate romng on me hanset Database How do I activate Roaming Dial *567*2# from your handset What are the rates for roaming within India Roaming rates on prepaid connections are 60 Paise per minute SMS Answer Dial *567*2# from your handset Goal is to find the Question Q* that best matches the SMS S •A scoring function Score(Q) assigns a score to each question Q in the FAQ dataset. The score measures how closely the question matches the SMS string S S. 22
  • 23. FAQ Retrieval Problem Formulation SMS is treated as a sequence of tokens S=s1,s2,…,sn Let Θ denote the questions in the FAQ corpus where each question Q ∈ Θ is treated as a set of tokens Goal is to find the question Q* that best matches the SMS S 23
  • 24. Method M th d For F each t k si , a li t Li consisting of all t h token list i ti f ll terms f from th di ti the dictionary that are variants of si are constructed. Variants are sorted in the descending order of their weight This space is searched to find the closest matching FAQ question. 24
  • 25. Extracting Dialog Models (Negi et al 2009) al., Huge number of repetitive calls at contact centers Building t k i t d di l B ildi task oriented dialog systems t Task specific information – concepts, subtasks Task structure - manual encoding g Using large amounts of human to human conversation data Extracting dialogue models using human-to-human conversations E t ti di l d l i h t h ti 25
  • 26. Example Conversation: Car Rental Domain 26
  • 27. Overview Transcribed Calls Normalized Calls Utterance No a at o Normalization Subtasks Mining of Subtasks Chat-bot AIML Co ve s o Conversion 27
  • 28. Finding Patterns with Gaps Need for N d f patterns capturing variations in expressions i i i i i Have you rented a car from us before Have you rented a car before Have you rented a car from <Rent_Agency> before <Rent Agency> Mining regular expression patterns over tokens or entity types Each tt E h pattern represented as a t k sequence t d token [rented car before] Token sequences mined efficiently using extension of apriori algorithm 2 8
  • 29. Association Analysis Total number of possible itemsets is exponential (2N) Brute-force technique infeasible Support filtering is necessary • • To eliminate spurious patterns To avoid exponential search - Support has anti-monotone property: X ⊆ Y implies σ(Y) ≤ σ(X) Efficient algorithms have been designed to exhaustively find all itemsets/patterns with sufficiently high support Given d items, there are 2d possible candidate itemsets ibl did t it t 29
  • 30. Utterance Normalization Identify concepts Named Entity Annotation Rule based annotator for annotations such as location, date, car model, model and amount “I want to pick it up from <location> on <date>” Grouping of utterances Find patterns with gaps and represent each utterance by them along with unigrams and bi-grams Agent and customer utterances are clustered separately using an offg p y g the shelf clustering algorithm 30
  • 31. Finding Subtasks and ordering Customer and agents engage in similar kinds of interactions to accomplish an objective Represent each call with agent utterance and customer utterance cluster labels Subtasks Patterns of cluster labels (agents) with possible gaps Lot of variability in customer utterances Vertical pattern mining C1 C1 C2 C3 C3 Cn 31
  • 32. Subtask Preconditions Utterance pre-conditions U di i Customer utterances that indicate start of a subtask “please make this booking for “make payment” subtask please booking” make payment Frequent features from customer utterances Flow pre-conditions Only logical orders of subtasks are allowed “make “ k payment” subtask cannot b executed unless “ th t” bt k t be t d l “gather pick-up information” subtask has been executed. Collection of all the subtasks that precede the subtask p 32
  • 34. Data Fusion Problem Given multiple data points about an entity, create a single p p y, g object representation while resolving conflicting data values Difficulties Null values: Subsumption and complementation Contradictions in data values Uncertainty & truth: Discover the true value and model u ce ta ty this process uncertainty in t s p ocess Metadata: Preferences, recency, correctness Lineage: Keep original values and their origin Implementation in DBMS: SQL, extended SQL, UDFs, etc. SQL SQL UDFs etc 34
  • 35. 360 Context Analyze social data in the context of enterprise data to build entity and event profiles and establish linkages between them for online and offline analysis Entity (people, products, events) Insights The problem Solution What are the key product interests of person A? Over time learn about the person’s product interests from her social media postings p g What is the location and trajectory of person B? List significant events like marriage, birth of a child, relocation, etc. What are the events of interest happening in a given location? Lists the top events in a given geography What is the sentiment g product? on a given p Gives the sentiment on a product p Understand customers wants and needs better Gives the current location and locations in the past What life events happened in person A’s life in the past x months? Key Sustained Value Factor: intent to purchase for customers Social Data Smarter Commerce real-time public safety events Enterprise Databases User Domains What MDM 360 does? propensities/ sentiment/intent • event Detection • entity Linkages • sentiment core customer view/transactions • event Profiles • entity Profiles Smarter Cities Application Domains Builds an entity’s complete profile by aggregating data about the entity from social and enterprise data ld ’ l fl b d b h f l d d sources. Here an entity refers to people, products, brands and events. 35 35 IBM Confidential
  • 36. Extraction Challenges: Stages of intent Stage Example Wishing for an event “I just want to graduate, get a job, get a car, and live with my boyfriend” Anticipating an event “Im getting a car for graduation yay!!!!!” During an event “At disneyworld :D” Post event / continuous state “Apparently I got a raise at work three months ago and didn't know? Sweeeeeeeeeet” Hobby “Loves to fish, travel and frequent concerts. Down to earth, athletic, professional 40 and single. earth athletic professional, single Loves the outdoors, working out, travel and younger fit guys for dating.” 36
  • 37. Extraction Challenges: Detecting filtering conditions Filter Example Spam “Need a New #Credit Card for your #Business or online #Ebay store? Compare and Apply Online. http://retweet.it/r/We0iai” Sarcasm, jokes “I thought I was having a stroke this afternoon but it turns out it was too many Starbucks Refreshers plus my leg falling asleep.” Resolve ambiguous meaning “In the words of @LNSmooth23 I'm retiring from the nightlife” Non-personal “My mom is buying a house, but why in Willingboro” 37
  • 38. 360-degree Profiles from Social Media g Personal Attributes Event Detection • Identifiers: what, where, when….. • Attributes: severity, urgency… Social Media based 360-degree Event and Individual Profiles Timely Insights on Events Ti l I i ht E t • Event Detection • Public Safety Events • Plans for public disturbances • Sentiment around events • Citizen sentiment • Identifiers: name, address, age, gender, occupation… • Interests: sports, pets cuisine sports pets, cuisine… • Life Cycle Status: marital, parental • Relationships: family, friends, co-workers, work and interest network Timely Insights on Individuals • Intent to participate in public events • Instigation for causing public damage • Sentiment on events, govt policies • Current Location • Hate messages Personal Interests P lI • Personal preferences or political leanings • Activity History Intent We must support the movement, I am going to the rally at Jantar Mantar tomorrow Anna Hazare has a point when he says politicians are corrupt and need to be taught a lesson. The rally starts at 10. Public Safety Events Mamta Deedi is also joining Anna, Ramdev, n Kejriwal. She is going to do Anshan at Delhi Jantar mantar. Ye sab public ki kahke le rahe hain. So Its Mamata's day out tomorrow at #JantarMantar. #Rally. Location announcements I'm at Karir Square http://4sq.com/fYReSj 38
  • 39. More data: Customer intent extracted from social media provides context Go for the best, DP2000 Buying a DSLR today ! Buying DSLR today! Thrza gr8 deal on ZX 550 @ ZX-550 the mall Prior Business Social Transactions Data Entity Extraction, Fact Discovery, Intent & Sentiment Influencers Intent 450M+ tweets/day Millions of tweets yield one company-specific fact Customer ready to buy a DSLR camera today today, possibly at a nearby mall Michael’s online friends offer lots of advice Text Analytics used to extract intent from Social Media Married, Male, Spouse , , p Birthdate, Gift Type, Intent to Purchase, Timeframe Wifey’s birthday tomorrow, looking for a killer dslr Sarcasm, Wishful Thinking Potential Locations and Activity Maybe I should buy her that purple roadster, while I’m at it. ;-) lol Intent to Purchase, Gift Type? In NYC area this w/e, any good malls nearby? Region & City Location, Timeframe, Intent to Shop Resultant fact base contains billions of facts, and is incrementally updated Fact segmentation or clustering is rapid enough to drive a business decision 3939
  • 40. Matching Twitter profiles to Corporate Data • Linking Social Media profiles with Employee database • Several extensions are possible, for example, linking with Citizens and Security databases Social media profiles (name, address, gender, age gender age, employment, relationship, …) Employment filter Social media profiles of IBM employees p y and their network Name: first, last Name, work location, job description Current Demo focused on Name and Location matching, as well as EmployeeOf information Choice of social media profile attributes for linking constrained by availability of IBM BluePage attributes Twitter: 45M profiles Resolution Semantic Name Variations Bill Chamberlin vs. Chamberlain, William H. C. Mohan vs. Mohan Chandrasekaran (Mohan) Employee Directory: 460K entries p y y Name: (first, middle, last, preferred) Geo Proximity Home l H location: city, ( t t ) country ti it (state), t Employment: company + role Saratoga, CA vs. San Jose, CA New Jersey vs. New York Job Role Disambiguation “Soft a e sales manage at IBM…” vs. Software manager IBM s “Managing SPSS Sales for Canada…” 40 Work l W k location: ( i state, zip, country) i (city, i ) Job description
  • 41. Example Result • Semantic name variations: Twitter name is a close variation of the IBM names • Geo Proximity: Work locations are within 25mi of the Twitter location • Job Role Disambiguation : description in Twitter profile matches HR role 41
  • 42. Common D t P bl C Data Problems • Lack of information standards t d d Ashok Kumar A Kumar • Data misplaced in the database Four sixteen Street 8 Anand Niketan Delhi 8, Niketan, Mr. Ashok Kr #416 Anand Niketan, N Delhi, 21 110021 • Different formats & structures across different systems Data surprises in individual fields 416 Anand Niketan, New Delhi, India 110021 Email Tax ID Telephone 91,,,, 228-02-1975 6173380300 ranivrgeoi@yahoo.co.in i i@ h i 025 37 1888 025-37-1888 415 392 2000 415-392-2000 ,CYRUS_DASTUR@HOTMAIL.COM 34-2671434 3380321 HP 15 State St. 508-466-1200 Orlando • Special characters in the data • The redundancy nightmare • Duplicate records with a lack of standards 90328574 90328575 01456 90238495 90233479 90233489 90345672 IBM I.B.M. Inc. 187 N.Pk. Str. Salem NH 01456 187 N.Pk. St. Salem NH Int. Bus. Machines International Bus. M. Inter-Nation Consults I.B. I B Manufacturing 187 No. Park St Salem NH 04156 187 Park Ave Salem NH 04156 15 Main Street Andover MA 02341 Park Blvd Bostno MA 04106 Blvd. 42
  • 43. Address Variations… Variations • Spelling variations, hyphenation, abbreviations • I 344 I-344 | Sarojini Nagar | N Delhi | 23 • 344 Block J | Sarojni Ngr | New Delhi | 110023 • 344 Block I | Sarojni Ngr | New Delhi | 110023 • Multiple Ways of writing the same field • 13B | Link Road | Versova | Mumbai • 18 Block M | Bandra Versova Link Rd | Versova | Mumbai Rd. • Missing Address Fields • 4 Block C | ISID Campus I 4, I V. Kunj I New Delhi | 110070 V • 4C I ISID Campus | Institutional Area| V. Kunj | New Delhi | 110070 • Errors • 4C I ISID Campus | Institutional Area| V. Kunj, New Delhi | 110007 43
  • 44. Regional variations in Addresses across India Addresses in different regions contain words of the local language even when the addresses are written in English Ex : The commonly used word to describe a street type is “Gali” in Northern India whereas “Beedhi/Veedhi” is the commonly used term in Southern India Street Intersections and Street Information containing multiple Street Type Identifiers like Cross and Main are extensively found in the Southern Indian regions Ex : “3rd Main, 4th B Cross” , Sector and Pocket Information are found primarily in North Indian Addresses Ex : “Sector 5, Pocket 2A 2nd Block” Regional differences in writing addresses necessitate bifurcation of standardization rules based on regions. 44
  • 45. Investigating the Data g g Take the Example: 123 St. Virginia St. Parsing: Separates multi-valued fields into individual pieces Lexical A l i L i l Analysis: Determines business significance of individual pieces Context Sensitive: Identifies various data structures and content 123 Number 123 Number 123 St. Virginia Street Type Alpha St. Virginia Street Type St. Street Name Street Type St. Virginia St. “The instructions for handling the data are inherent within the data itself.” 45 St.
  • 46. Sample Standardized Output Sample Address Input: “SANT KRUPA BUILDING, 2ND FLOOR, CHHEDA RD, NR S V JOSHI HIGH SCHOOL, DOMBIVALI (E), THANE. INDIA.” Standardization Output: St d di ti O t t DoorNo Floor Value Building Name Building Type Street Name Street Type 20 2nd FLOOR SANT KRUPA BUILDING CHHEDA ROAD Landmark Position Landmark Area City District State NEAR S V JOSHI HIGH DOMBIVALLI SCHOOL EAST THANE THANE MAHARASHTRA 46
  • 47. Input Addresses vs Standardized Addresses Sr.No Standardized address Highlights 1 A38/91 KONIA . . VARANASI INDIA A38/91 KONIA VARANASI VARANASI UTTARPRADESH INDIA Autopopulation of state 2 VILL BASUDEVPUR PO KHANJANCHAK DURGACHAK HALDIA TAMLUK INDIA DURGACHAK ,HALDIA,VILLAGEBASUDEVPUR PO-KHANJANCHAK PO KHANJANCHAK TAMLUK EAST MIDNAPORE WESTBENGAL INDIA Rural address Handling 3 NEAR RAJGHAR GIRLS SCHOOL LACHIT NAGAR HOUSE NO 5 ULUBARI GUWAHATI ASSAM GUWAHATI INDIA 5 NEAR RAJGHAR GIRLS SCHOOL Maintaining a ULUBARI LACHIT NAGAR GUWAHATI standard format KAMRUP ASSAM INDIA across addresses (house no preceeds Landmark information) 4 1/15, PREMJYOTI CO OP HSG 1/15 PREMJYOTI COOPERATIVE SOC., RAMBAUG - 5, KALYAN HOUSING SOCIETY,RAMBAUG 5 (W), MAHARASHTRA 421301 KALYAN WEST BHIWANDI THANE BHIWANDI INDIA MAHARASHTRA 421301 Standardization of Tokens 5 4 7 Input address 3/2,FIRINGI DANGA ROAD, P.O.MALLICKPARA SERAMPORE-3 CALCUTTA INDIA Standardization of tokens 3/2,FIRINGI DANGA ROAD, SERAMPORE-3 P.O.MALLICKPARA KOLKATA WESTBENGAL INDIA
  • 48. Two Methods to Decide a Match Are these two records a match? RHITU K KAZANGIAN RITU KUMAR B B KAZANGIAN +5 +2 A +20 128 MAIN ST 02111 12/8/62 128 MAINE RD 02110 12/8/62 / / A B D B A = BBAABDBA +3 +4 -1 +7 +9 = +49 Deterministic Decisions Tables: • Fields are compared • Letter grade assigned g g • Combined letter grades are compared to a vendor delivered file • Result: Match; Fail; Suspect Probabilistic Record Linkage: • Fields are evaluated for degree-of-match • Weight assigned: represents the “information content” by value • Weights are summed to derived a total score • Result: Statistical probability of a match 48
  • 49. A Closer Look at Probabilistic Matching C ose oo obab st c atc g RHITU K KAZANGIAN 128 MAIN RITU KUMAR KAZANGIAN 128 MAINE RD 02110 12/8/62 +5 +2 +20 +3 ST +4 02111 12/8/62 -1 +7 +9 = 49 Histogram of Weights 4000 3500 The weighted score is a p probability of a match; it y ; expresses the amount of information content for all of the fields compared 3000 # of Pairs f relative measure of the The CUTOFF is the score above which good matches are found 2500 2000 UnMatched 1500 1000 500 Matched 0 -50 49 -40 -30 -20 -10 0 10 20 30 40 50 60 49
  • 50. The Value of Information Content Information Content is measured both at the field and at the field value level and is calculated automatically Discriminating Value represents the significance of one field versus another in contributing to a match For example a Gender Code contributes less information than a Tax-Id Number Frequency represents the significance of one value in a field over another value q y p g For example in a Last-Name Field, “SMITH” contributes less information than “ROUTZAHN” Probabilistic Matching uses the automatically generated measures of Information Content to achieve the highest match rates possible utilizing a scientifically-justifiable C h h h h h bl l f ll f bl methodology 50
  • 51. Data Framework around the Individual • Logins (User credentials) • Profile • Expertise • External & internal unstructured data linked to individuals • S Social activity l Big Data Individual Social Presence • • • • BLOG Comment Opinion “Like”s Community Individual Credentials • to Person • Communities • to Company: Roles, History • IBM Linkage Individual Core Personal data: • Name, Address • Phone, eMail • Behavioral Preference / permissions Transactions involving i l i the Individual • Tech Support Call • O Opportunity & Orders t it Od • Responses to Marketing Campaigns Relationships with the individual i di id l Interactions with the Individual • • • • • • Digital Phone eMail F2F Social Web traffic 51
  • 52. Analytics steps Text Analytics • Analyze and extract consumer attributes from individual messages Intent Entity Integration • Integrate information about a consumer within a single social media source over ti di time Entity Resolution • Link social media profiles with customer data t d t • Link and integrate information about a consumer across multiple social media sources All I really want is the Disney Visa card from chase with the castle on it Life Events Looks like we'll be moving to New Orleans sooner than I thought. Personal Attributes I am a engineer, mom, and wife Relationships Social Profiles of Consumers Master Data on Customers In fact I'm looking forward to the new month. Both myself and the wife have our th B th lf d th if h graduation ceremonies 52
  • 53. Person Information across Documents Who Is James Dimon? Do these filings refer to the same person ? variability in the person’s name, lack of a key identifier supporting attributes vary depending on the context (form type) All these facts need to be linked and integrated 53 53 Signatures Biographies Insider Transactions Committee memberships
  • 54. Entity & Relationship Analytics from Big Data Entity Views Crawl Extract / Text Analytics Entity Resolution Map/Fuse /Aggregate Entities Relationships: E ii &R l i hi Object-centric view Unstructured data sources Untrusted View Challenge Construct and maintain comprehensive profiles of entities and relationships from unstructured data sources Main Problem: Assemble an entity view, where each entity aggregates data from thousands of different documents Multiple stages of complex processing: – Information extraction • – From each unstructured d F h t t d document, extract relevant structured records t t t l t t t d d Entity resolution • Link records (possibly across documents) that are about the same real-world “entity” Entity Integration – Entity population: mapping / fusion / aggregation • 54 Collect all the facts about the same entity into one rich object with clean values and relationships to other entities
  • 55. The Complete Entity View Current purchase intentions expressed by the consumer Location-based information about a consumer (where they plan to travel, events they are going to attend) Purchase history for a consumer Life events (relocation, home purchase, wedding, graduation) Related people based on social networking data Comments/complaints expressed about various products and services Customer identity information (e.g., name, location) obtained from profiles and content of posts Micro-segmentation information about individual consumers (e.g., gender, age range, profession) 360 degree 360-degree profile of a customer City State Age Range Gender Houston TX 30-39 Female San Jose CA ? Male Marital Status ? Married Number of kids Employment Status Occupation ? Employed Journalist 2 Employed Software Engineer … … … Aggregate attributes from multiple sources Filter to obtain a segmentation Analyze to obtain “Similar Populations” Adding more input data gives better predictive power 55
  • 56. Attribute fusion example: Inferring location from multiple clues Metadata  Name: Tracy Guida Sc ee a e @ acygu da Screen name: @tracyguida Location: Tampa Description: just a Nor-Cal gal trying to fall in love with Florida Social Media Profile Screen name : @tracyguida Location: Tampa, FL Name: Tracy Guida Disambiguation, fusion of partial information Permanent location Fusion libraries: • Confidence: metadata vs. content Messages  Gotta love Florida football #hot #humid http://instagr.am/p/QOHPqhKdYt/ Check out my blog about #food in #TampaBay h k bl b f d http://www.myothercitybythebay.com Textual clues Temporary location I'm at Tracy's Seat At Micah's (Tampa, FL) http://4sq.com/SZ4yjj http //4sq com/SZ4 jj I'm at S.o.G (Tampa, Florida) http://4sq.com/UDweM5 Check-ins I'm at Eats American Grill (Tampa, FL) http://4sq.com/O1a1Jm Who's Wh ' watching the #presidential #debate tonight? hi h # id i l #d b i h? (from 27.97989014,-82.54825406) Fusion libraries: • Confidence: place mentions vs. g geo-codes • Analysis of location time-series Geo-located G l d documents 56
  • 57. The Reliability (Veracity) Challenge Θ = {θ1,...,θ N } - a set of hypotheses (frame of discernment, universe of discourse) {xni } - probability, possibility, belief in hypothesis {θn} of source i {Oi } - input data (social media, enterprise information) F(x1,...,xI ) – Fusion operator {O1} Environment Environment 57 {OI } Source 1 (source belief model model, source characteristics) Source I (source belief model, source characteristics) {x1} Fusion Fusion operator operator { xI } F ( x1,..., x I )
  • 58. Typical Reliability Settings It is possible to assign a numerical degree of reliability to each source A subset of sources is reliable but we do not know which one Reliabilities of the sources can be ordered but no precise reliability values are known Reliability dependent on context too During Mumbai Mantralaya fire a few tens of tweets on this event on Twitter Same day there is a match and there are several thousand tweets “Miami Miami on Fire” 58
  • 59. Strategies for Utilizing Reliability Strategies explicitly utilizing reliability of sources Reliability is used to modify beliefs of each model before fusion and then use transformed beliefs (separable case) Strategies for modifying the fusion process to account for the reliability of the sources (non separable case) (non-separable case). Strategies identifying reliability of data input to fusion processes and eliminating the sources of poor reliability Combination of strategies mentioned above F(x1,...,xI ) FR (x1,...,xI ) F - i a context d is t t dependent operator, which depends on the d t t hi h d d th strategy selected and defined within the framework used for uncertainty representation R 59
  • 60. Reliability Coefficients Reliability coefficients represent trust in each belief model. They introduce the second level of uncertainty and represent a measure of y p the adequacy of the model used, the reality of the environment, and source characteristics Ri = Ri (Mi, γ ,Υ) - reliability of source i (reliability of source i and , ) y ( y hypothesis j : Rij) Mi - model of source i γ parameters characterizing external environment (context) Υ -parameters characterizing the internal environment of source I (tuning parameters) Relative eli bilit Rel ti e reliability : ∑iIRi =1 1 May be replaced with max Ri = 1 i 60
  • 61. Bayesian Fusion In the Bayesian framework the degrees of belief are represented by a priori, conditional and a posteriori conditional, probabilities. Usually, decisions are made on a posteriori probabilities P(θn | yi ), where yi i the input coming from source I, h is th i t i f I xi = P(θn | yi ) represents statistics of each source to be combined (data, outputs of classifiers). Fusion is F i i performed by the Bayesian rule, which under the f d b th B i l hi h d th condition of source independence is reduced to a product: Fn(x1,...,xI)|y =Fn(P)|yi =P(θn)∏[P(θn |yi)/P(θn)], n This fusion operator is conjunctive and assumes total reliability of the sources 61
  • 62. Weighted Average If the sources are not totally reliable, several fusion rules within the framework of the probability theory have been proposed in the literature A majority of the weighted average methods are based on consensus th theory, which involves general procedures of hi h i l l d f combining single source probability distributions while decisions are based on Bayesian decision theory Fn(x1,...,xI,R1,...,RI)|yi =Fn(P,R)|yi =∑iP(θn |yi) Ri where R is reliability associated with the sources in the global membership function expressing quantitatively the goodness of each source i 62
  • 63. Incorporation of Contextual Information This method integrates contextual information The Th method is based on the fact that, in a given context, th d i b d th f t th t i i t t only a subset J of a set N of all sources to be combined is valid or reliable (i.e. their belief model adequately represents reality) Fn ( x1 ,..., x I , R1 ,..., R I ) | y = ∑P(θ |y1,...,yn,AJ ) P(AJ ) where P(AJ) is the probability of validity of the subset J of inputs. This probability is calculated thanks to the reliability Ri of the individual inputs 63
  • 64. Biographical and Biometric fusion for Person Identification Many modern data repositories record both biographical and biometric information Motor Vehicle Licensing Authority, Passport, Identify cards etc Unique Identification number (www.uidai.gov.in) Fusing information from multiple sources bring value in Data integration: Creating single view of citizen, person, customer Identification of the person using Biometric information and biographical information Scaling person identification for large number of customer records – – Biographical data is abundant, easy to match, scales to millions of records but can be noisy and uncertain. Biometric data is noise free and gives high precision for identification but does not scale to large number of records – Both stream contain complimentary information which can be exploited by fusing together Fusion for Person Identification can be done at two levels – Decision fusion: Each matcher provides the decision which are then fused to produce the final decision. – Score fusion: Each matcher provides score which is used for producing a score for decision making. 64
  • 65. Score Fusion using Biometric and Biographical matcher Consider M matchers operating on a database containing N records which have both biographical and biometric information. For query q if all the records are equally likely for the identifier than the posterior of the score given records is given by There can b multiple biometric as well as biographical Th be lti l bi ti ll bi hi l matchers Each query q will generate N x M scores i.e. M dimensional scores for N records Genuine match score density We model the scores as being generated from a probability distribution. b bilit di t ib ti Score is fused using a joint distribution from different sources The probability distribution under reasonable assumption is the posterior distribution of scores given a query The genuine and imposter match scores are assumed to be identically distributed The posterior distribution is modeled as a Gaussian mixture model. The model is built for both genuine match distribution and imposter distribution The query is assigned an identity of n0 only if Models are learnt from training data. The algorithms is Which simplifies to 65 Imposter match score density
  • 66. Results DataSets Biometrics: NIST Dataset consisting of match scores of right and left index finger Biographical : Electoral records of citizens in an emerging economy Consists of Names and Address Total of 6000 people were associated with the biometrics and the biographical data. Here M = 4, 2: Biometric, 2: Biographical (Name & Address) Experimental Setup Accuracy for different modalities Half of the dataset was used for training the probability densities for both the imposter and genuine match score distribution was estimated The number of Gaussians components was 5 The remaining records was used for testing. Experimental Results. E i t lR lt Score is fused using a joint distribution from these four different sources The name modality has the lowest accuracy where the biometric modality has high accuracy The fused accuracy is much higher than the individual localities The accuracy increases when all the modalities are combined thus validating the usefulness of fusion 66 Identification accuracy for fusion of modalities
  • 67. Social listening for monitoring the Philippine general elections 2013 • Online and offline analysis of social media messages around election debates and  election chatter for ABS‐CBN TV Channel • Analysis of English and Filipino chatter to determine buzz and reaction on candidates,  campaigns, parties, topics and events campaigns, parties, topics and events • Analysis of over 6 million election related Twitter and Facebook posts • Comparison with Pulse Asia Election Survey Real time and offline monitoring of social  g media conversations about parties and  POE, GRACE candidates Mar 13 50% 45% 40% 35% 30% 25% Positive and negative sentiments for candidates 20% 15% 10% 5% 0% Mar 08 Mar 09 Mar 10 Mar 11 Mar 12 Mar 13 Mar 14 Grace Poe released her TV ad which drew flak from viewers. This was also the time that 3 candidates (Legarda Poe Escudero) of the (Legarda, Poe, Liberal Party who were also "guest" candidates of UNA were dropped by UNA as the President forbade them to attend UNA's soirees. Escudero felt really, really bad about being dropped by UNA (led by former president Estrada). Grace (l d b f id E d ) G Poe offered to mediate between Escudero and Estrada. ZUBIRI, MIGZ (UNA) VILLAR,CYNTHIA HANEPBUHAY (NP) VILLANUEVA, BRO.EDDIE (BP) TRILLANES, ANTONIO IV (NP) SEÑERES, CHRISTIAN (DPP) POE, GRACE PENSON, RICARDO MAGSAYSAY, RAMON JR. (LP) MAGSAYSAY, MITOS (UNA) MADRIGAL, JAMBY (LP) MACEDA, MANONG ERNIE (UNA) LLASOS, MARWIL (KPTRAN) LEGARDA, LOREN (NPC) ( ) HONTIVEROS, RISA (AKBAYAN) HONASAN, GRINGO (UNA) HAGEDORN, ED FALCONE, BAL (DPP) ESCUDERO, CHIZ ENRILE, JUAN PONCE JR.(NPC) EJERCITO ESTRADA, JV (UNA) DELOS REYES,JC (KPTRAN) DAVID, LITO (KPTRAN) COJUANGCO, TINGTING (UNA) , ( ) CAYETANO, ALAN PETER (NP) CASIÑO, TEDDY BINAY, NANCY (UNA) BELGICA, GRECO (DPP) AQUINO, BENIGNO BAM (LP) ANGARA, EDGARDO (LDP) ALCANTARA, SAMSON (SJS) 0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00% 80.00% 67 90.00% 100.00%
  • 68. Worldwide leads: Intent to buy, relocation to India Venu Nair: Male, Atlanta, USA: Looking for good investment in Indian real estate market Kiran Singh: Female, IT g , professional, Gurgaon: Any good 2 BR in Sohna Road? 5 12 14 2 Data from Dec 3-4, 2013 68
  • 69. Sample leads Name Sex Location Profession Interest where Param_star M India Media 2BHK India Kiran Singh F Gurgaon, India IT 2BHK Sohna Road, Gurgaon Venu Nair M Atlanta, US Muhammad Faiz M Singapore IT Hooker India - Bangalore Real Estate Apartm Bangalore Bangalore, ent India Apartm India ent 2 and 3 Noida, India BHK 69
  • 70. Crowd sensing g • The “power of the crowd” – a lot of information in a timely manner from everywhere • People already use the social media to share public safety and law enforcement information • Gain deep situational awareness • Emergencies, call for help Enable proactive actions by augmenting traditional law enforcement methods Police Monitoring Limited coverage Analytics and fusion in nearreal-time Crowd sensors 70 Rich events & KPIs
  • 71. Drinking in the Open Come to South City 2, in evening, its a regular scene there since last 4 years, people drink in open and food is served by restaurants in their cars khandsa road per sunrise hospital se aage tekho ke pass rehari waale sharab pilaate hai, jinki wajah se waha aane jaane wale log pareshaan ho rahe hai even shaam ko to PCR ka bhi unhe darr ni hai kirpa hai, hai, karke inhe waha se hataiya Gurgaon Police I also have a complaint to register. We have an alcohal drinking menace in front of our commercial complex anand ganga comlex at complex, comlex, sohna chowk, on the main road. Police Harassment These two Constables (Davinder Singh & his Colleague) were at their worst behaviour...when they found all documents ok in the Car. I couldn't understand the reason for harrasment...opp Wrong Parking this is the main way from sadar bazar to bhuteshwar mandir. I dnt p park vehicles both the way y think y this road exist. It is the best place to p are used to park vehicles no action have been taken from years. I think HUDA or MCG is not serious abt matter. 71
  • 72. Event detection and mapping 72
  • 73. Conclusions Noise is an unavoidable fact of real life communication Communication meant for human consumption can be C i ti tf h ti b noisy for computers and vice versa Due to ubiquitous sensors (GPS, Accelerometer), easy of use apps (Facebook, Twitter, YouTube), and higher internet connectivity, the key characteristics of raw data is changing. This new data can be characterized by 4Vs Volume, Velocity, Variety and Veracity For example, during a Football match, some people will Tweet about Goals Penalties etc while in addition there may be other Goals, Penalties, etc. reports in news channels. The data describes the same event Fusion should create a single object representation Different sources may have different reliability and it is necessary to account for this fact to avoid decreasing in p performance of fusion results 73 Reliability and context should be taken into account during fusion
  • 74. Conclusions Noise can be defined as any kind of difference in the surface form of an electronic text from the intended, correct or original text Noise N i can b in the form of errors arising from uncertainty in be i h f f i i f i i language and communication and recognition errors 74