VOC real world enterprise needs

Communicating KnowledgeSentiment Analysis Symposium
Lessons Learned from a VOC
Analysis System for a big Korean
Telecommunication Company
Ivan Berlocher
SALTLUX
Sentiment Analysis Symposium
Nov. 9th 2011

Introduction
• Saltlux Inc. is located in Seoul, Korea, established in 1979 and renovated in 2003.
• Expertise domain:
Information Retrieval, Text/Data/Web/Graph Mining solutions and services based on
Semantic Web Technology.
• Main languages support: Korean, Japanese, English. For other use external
solutions.
• 70 employees in Seoul, one Development Center in Vietnam (12 employees)
One sales office in Japan (3 employees)
• Have several partnerships with other companies/institutes:
– Ontoprise in Germany
– Franz in California
– DERI in Ireland
• Have many partnerships with R&D (ETRI, KAIST, Universities…)
2

Table of Contents
• Project & Environment Description
– Needs of Customer
– System (Main) Requirements
• VOC Data
– Sample Data
– Data Analysis
• System Overview
• Korean Linguistic
• Sentiment Analysis
• Lessons Learned
• Future work
3

Project & Environment Description
4
• Needs of Customer
– Customer: Korean Corporation in Telecommunication
– Department of Voice of Customer Analysis
– Mission: Analysis (human typed) memos from all call centers for
identifying majors problems, make reports for decisions makers in
order to improve quality of services and augment customer
satisfaction.
– Data: human typed notes covering any kind of questions from
customers
• Information about subscriptions
• Inquiry or complaint about devices (phones) or services, dealership
• Complaints about quality of communication
• etc.
The numbers of notes: ~200 thousand a day. (~5 Millions a Month).
Required notes to be searchable during 1 year (~60 millions)

5
• System (Main) Requirements
• Distinguish between simple inquiries vs. complaints
• Classify into categories/departments of services
• Monitor Trends of Topics in real-time, daily, weekly, monthly
• Compare trends/tendency between by slice of times
• Find related Topics
• Manage personal vocabulary
• Anonymous”ize” personal data (people names, telephone, social
id, addresses etc.)
Project started in October 2010 for a 3 Months POC. (~10MM)
After acceptance(success) integration with real system for
another 3 months (~10 MM)
2 phases: ~200 000$

VOC Data Sample
6

VOC Data Sample
7
• Data often contain some
structured information
(metadata) but without any
standard.
• But most of time, no particular mark/meta.
Cause problem of Named Entities Recognition
more complex
All different input of same information
(연락처:Phone Number)

VOC Data Analysis
8
• Data contains lot‟s of named entities:
Products/Services/People/Social ID/phones numbers
often related to privacy
• Data contains lot‟s of technical (domain) terms
• Real content to analysis is mostly very short(tweets like)
but sometimes very.
• Lot‟s of misspelling/mistyping
• Korean(Asian) problem of segmentation, amplified by
speed constraint
• Lot‟s of (non standard) abbreviations

System Overview
9
Text
Segmentation
Morphological
Analyzer
Chunk/Phrase
Identification
Named
Entities
Recognition
Synonyms &
Normalization
Indexing
Distributed Indexes
Classifier
(Hybrid SVM
& Rules)
Analysis Phase
Searching/
Clustering
(TopicRank)
Timelines
Dumper
DFS
Timelines
20110713_0700_1.df
20110713_0700_2.df
20110713_0700_3.df
20110713_0710_1.df
20110713_0710_2.df
20110713_0710_3.df
Scheduler
Merger &
Ranker
Trend
(TopN)
DB
Web Server
(Web UI)
Complaint
Detector
• Overall Architecture
In the real system, for fast indexing, system has been parallelized on 18 Linux
machines.

System Overview
10
• Home page

System Overview
11
• Top N Keywords Extraction

System Overview
12
• Related Keywords (Word Clustering)

System Overview
13
• Trend (Timeline) view

System Overview
14
• Tweets view

Korean Linguistic
15
• Brief introduction
Korean is alphabetic based with consonants/vowels, composition by
consonant/vowel or consonant/vowel/consonant.
„나는 학생입니다.” => 나 = ㄴ (N) + ㅏ(A) = NA
=> 학 = ㅎ (H) + ㅏ(A) + ㄱ (K) = HAK
One unit of consonant/vowel or consonant/vowel/consonant is a
syllable called “Eojol”(Syllable) and words are composed of several
“eojeol”.
Basic grammar:
Words a composition of one root (Nouns, Adjectives/Verbs) followed
by a flexion marking grammatical role (Subject/Object/Location etc.)
for nouns (Called “Josa”)
or aspects/mood (tense, honorific form etc. ) for verbs/adjectives
(Called “Eomi”).

Korean Linguistic
16
• Examples:
„나는 학생입니다.” => “나는” = “나” (NA: I/me) + “는” (Neun: Thema)
학생입니다 = “학생” + “입니다” = “학생”(Hak-seng: Student) +
“입니다”(Im-ni-da: am) => I‟m (a) student.
Lot‟s of (composite) inflectional forms:
학생+입니다 = Noun + Be
학생 +인/이예요/이다/입니까?/인데/인데요 etc. (was, will be …) (eomi)
학생 + Syntactic Role (이:Subject/에게:To/한테:From/을:Object) etc. (josa)
Korean is highly agglomerative (concatenate prefix/nouns/josa/eomi)
Search Engine: 검색엔진.
High performance search engine: 고성능검색엔진
But usage of space is free/arbitrary.
Can write equivalently: 검색엔진 or 검색 엔진
Especially with SNS, space limited devices for speed constraints
(like real-time transcription of conversations) the space is more and more
un/mis- used.
=> Need Automatic Segmentation Correction.

17
• Automatic Segmentation Correction Illustration

Korean Linguistic
18
• Automatic Segmentation Correction Implementation
Binary Classification Approach:
Tagging each syllable as space or not before.
Can use any kind of Classifier.
Here we use CRF model (could be SVM)
with following set of features:
프랑스의 세계적인 디자이너 …
CRF
Accuracy at Character Level 96.25%
Precision at Word Level 95.58%
• Features
– 1gram, 2gram, 3gram, 4gram of characters (syllables)
– Korean or not, contains number
• Evaluation
– Accuracy (character)
– Word-precision
# words correct spaced word / # words produced by system
• Very simple to train (easy to get huge data)
• Not need of lexicon or any lexical information
• Perform surprisingly very well

Korean Linguistic
19
• Transliteration
- Korean used more and more English derived word
transliterated phonetically in Korean alphabet
(Reverse of “Romanization”).
Especially for foreign names (Companies, Products, People,
technical/domain terms)
– Transcription is non unique and non standard
Examples:
tablet, 태블릿, 태블릿 , 타블렛, 테블릿
Hitachi, 히타치, 히타찌, 히다찌, 히타찌
iPhone 4s, 아이폰 4s, 아이폰포에스, 아이폰 포에스

Korean Linguistic
20
• Automatic transliteration recognition
- Make a rules based transliteration based on phonetic
transliteration acting similarly to Soundex, adapted for
Korean pronunciation.
tablet, 태블릿
T=>ㅌ/ㄸ/ㄷ
A => ㅏ/ㅓ/ㅔ/ㅐ
Etc.
This method has high recall but low precision and need post-processing filtering (Remove
known Korean words from lexicons, remove too short nouns etc.)
Result has to be corrected by human, so need of efficient workbench for productivity.
Gathered a 130 thousand entries dictionaries, mainly IT oriented.
Still need more Academic research to solve this problem.

Sentiment Analysis
21
• Complaint Detection
Similar problem of standard Subjectivity Detection
(Detect if a sentence is sentiment bearing or not)
Simple Approach: Binary Classification
Using SVM,
manually tagged training/test corpuses.
(more than 20 thousand)
Features Space:
N-gram of Characters (Syllables/Eojol) + N-Gram of Words
using 2-4 grams gave best results.
Features Extraction is important to lower the features space.
Chi-square/Information Gain gave best results.

Sentiment Analysis
22
Problems: No freely available resources such Sentiword-Net
Need to build it!
Build our general domain dictionary as baseline:
20 000 verbs/adjectives classified as positive/negative/neutral
Result is a lexicon of ~5000 entries (only positive/negative)
Enrich with manually extracted features from N-grams.
Precision oriented (92%) but still quite low recall (75%).
Overall Accuracy: 85%
=> Still working on ways to make recall better without
sacrificing precision.
Basic Ideas:
Bagging / Boosting (Combining several Classifiers)
Make hybrid models between (linguistic: semantic/syntactic) rules
and Machine Learning(statistics)

Lessons Learned
23
• Lessons Learned
- Still a quite big gap between expectation of customer and
reality. Need to explain and let him involved in process of
assessment and knowledge/domain vocabulary acquisition
- Need acquire a lot of lexicons:
=> Named entities/Synonyms/Stopwords/Senti-Word
- Quality and Quantity of this lexicons is a real assets of
Company. Acquiring lexicons require workbenches for
efficiently semi-supervised methods (Filter manually automatic
methods) to reduce costs.
- Tuning Classifiers parameters, features extraction, linguistic
knowledge etc. is time/expertise consuming.
- Simple Academic methods works quite well (even needs lot of
tuning)
- Beyond simple search engine, NLP components quality
became more and more important, especially for Sentiment
Analysis

Lessons Learned
24
• Lessons Learned
- Customers gain more and more interested in “Big Data”, “Listening Platform”, “Cloud ”, “Social
Network/Intelligence”…
- More and more Customers want to get data/opinion out of in-site system
(Blogs, Communities(BBS), Tweets etc.). Typical questions:
How many crawlers are needed for crawl all Korean tweets/blogs?
How about crawling Facebook?
- How identify “Anti communities” (like “Anti-Samsung”); Who are Power bloggers?
Solutions required are required far more than Sentiment Analysis.
But often customer can‟t afford/don‟t want crawling infra-structure and maintenance fees.
New opportunities to deliver software in other forms than traditional packages selling: SaaS/PaaS
(Software/Platform/Infrastructure) as Service.
Even in enterprise, distributed framework is required (not only web scale services)
- Customers (as least in Korea) love knowing technology and are more and more high level users.
They not only buy solutions but consulting/expertise.
- Projects are more and more expensive, and many require either Benchmarks/POC

Future Work & Plan
25
• Future Work (On-going)
Acquire more entries in Sentiment dictionary
- Make a framework for handling Linguistic Rules and Statistical
(SVM/Rocchio)
- Coupling with Antonyms; and/or hints
- Better handling Negation
- Better Workbench for faster acquisition / (re-)training
- Co-Reference resolution
- (Full/Semi) Parsing ?
- More complex models than binary classification ?
- Building/Maintaining a Platform for Pass/Sass
A long long way to go…

Communicating KnowledgeSentiment Analysis Symposium 26
Questions?
Thank you.

VOC real world enterprise needs

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (13)

Ähnlich wie VOC real world enterprise needs

Ähnlich wie VOC real world enterprise needs (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

VOC real world enterprise needs