1. Communicating KnowledgeSentiment Analysis Symposium
Lessons Learned from a VOC
Analysis System for a big Korean
Telecommunication Company
Ivan Berlocher
SALTLUX
Sentiment Analysis Symposium
Nov. 9th 2011
2. Communicating KnowledgeSentiment Analysis Symposium
Introduction
• Saltlux Inc. is located in Seoul, Korea, established in 1979 and renovated in 2003.
• Expertise domain:
Information Retrieval, Text/Data/Web/Graph Mining solutions and services based on
Semantic Web Technology.
• Main languages support: Korean, Japanese, English. For other use external
solutions.
• 70 employees in Seoul, one Development Center in Vietnam (12 employees)
One sales office in Japan (3 employees)
• Have several partnerships with other companies/institutes:
– Ontoprise in Germany
– Franz in California
– DERI in Ireland
• Have many partnerships with R&D (ETRI, KAIST, Universities…)
2
3. Communicating KnowledgeSentiment Analysis Symposium
Table of Contents
• Project & Environment Description
– Needs of Customer
– System (Main) Requirements
• VOC Data
– Sample Data
– Data Analysis
• System Overview
• Korean Linguistic
• Sentiment Analysis
• Lessons Learned
• Future work
3
4. Communicating KnowledgeSentiment Analysis Symposium
Project & Environment Description
4
• Needs of Customer
– Customer: Korean Corporation in Telecommunication
– Department of Voice of Customer Analysis
– Mission: Analysis (human typed) memos from all call centers for
identifying majors problems, make reports for decisions makers in
order to improve quality of services and augment customer
satisfaction.
– Data: human typed notes covering any kind of questions from
customers
• Information about subscriptions
• Inquiry or complaint about devices (phones) or services, dealership
• Complaints about quality of communication
• etc.
The numbers of notes: ~200 thousand a day. (~5 Millions a Month).
Required notes to be searchable during 1 year (~60 millions)
5. Communicating KnowledgeSentiment Analysis Symposium
Project & Environment Description
5
• System (Main) Requirements
• Distinguish between simple inquiries vs. complaints
• Classify into categories/departments of services
• Monitor Trends of Topics in real-time, daily, weekly, monthly
• Compare trends/tendency between by slice of times
• Find related Topics
• Manage personal vocabulary
• Anonymous”ize” personal data (people names, telephone, social
id, addresses etc.)
Project started in October 2010 for a 3 Months POC. (~10MM)
After acceptance(success) integration with real system for
another 3 months (~10 MM)
2 phases: ~200 000$
7. Communicating KnowledgeSentiment Analysis Symposium
VOC Data Sample
7
• Data often contain some
structured information
(metadata) but without any
standard.
• But most of time, no particular mark/meta.
Cause problem of Named Entities Recognition
more complex
All different input of same information
(연락처:Phone Number)
8. Communicating KnowledgeSentiment Analysis Symposium
VOC Data Analysis
8
• Data contains lot‟s of named entities:
Products/Services/People/Social ID/phones numbers
often related to privacy
• Data contains lot‟s of technical (domain) terms
• Real content to analysis is mostly very short(tweets like)
but sometimes very.
• Lot‟s of misspelling/mistyping
• Korean(Asian) problem of segmentation, amplified by
speed constraint
• Lot‟s of (non standard) abbreviations
9. Communicating KnowledgeSentiment Analysis Symposium
System Overview
9
Text
Segmentation
Morphological
Analyzer
Chunk/Phrase
Identification
Named
Entities
Recognition
Synonyms &
Normalization
Indexing
Distributed Indexes
Classifier
(Hybrid SVM
& Rules)
Analysis Phase
Searching/
Clustering
(TopicRank)
Timelines
Dumper
DFS
Timelines
20110713_0700_1.df
20110713_0700_2.df
20110713_0700_3.df
20110713_0710_1.df
20110713_0710_2.df
20110713_0710_3.df
Scheduler
Merger &
Ranker
Trend
(TopN)
DB
Web Server
(Web UI)
Complaint
Detector
• Overall Architecture
In the real system, for fast indexing, system has been parallelized on 18 Linux
machines.
15. Communicating KnowledgeSentiment Analysis Symposium
Korean Linguistic
15
• Brief introduction
Korean is alphabetic based with consonants/vowels, composition by
consonant/vowel or consonant/vowel/consonant.
„나는 학생입니다.” => 나 = ㄴ (N) + ㅏ(A) = NA
=> 학 = ㅎ (H) + ㅏ(A) + ㄱ (K) = HAK
One unit of consonant/vowel or consonant/vowel/consonant is a
syllable called “Eojol”(Syllable) and words are composed of several
“eojeol”.
Basic grammar:
Words a composition of one root (Nouns, Adjectives/Verbs) followed
by a flexion marking grammatical role (Subject/Object/Location etc.)
for nouns (Called “Josa”)
or aspects/mood (tense, honorific form etc. ) for verbs/adjectives
(Called “Eomi”).
16. Communicating KnowledgeSentiment Analysis Symposium
Korean Linguistic
16
• Examples:
„나는 학생입니다.” => “나는” = “나” (NA: I/me) + “는” (Neun: Thema)
학생입니다 = “학생” + “입니다” = “학생”(Hak-seng: Student) +
“입니다”(Im-ni-da: am) => I‟m (a) student.
Lot‟s of (composite) inflectional forms:
학생+입니다 = Noun + Be
학생 +인/이예요/이다/입니까?/인데/인데요 etc. (was, will be …) (eomi)
학생 + Syntactic Role (이:Subject/에게:To/한테:From/을:Object) etc. (josa)
Korean is highly agglomerative (concatenate prefix/nouns/josa/eomi)
Search Engine: 검색엔진.
High performance search engine: 고성능검색엔진
But usage of space is free/arbitrary.
Can write equivalently: 검색엔진 or 검색 엔진
Especially with SNS, space limited devices for speed constraints
(like real-time transcription of conversations) the space is more and more
un/mis- used.
=> Need Automatic Segmentation Correction.
18. Communicating KnowledgeSentiment Analysis Symposium
Korean Linguistic
18
• Automatic Segmentation Correction Implementation
Binary Classification Approach:
Tagging each syllable as space or not before.
Can use any kind of Classifier.
Here we use CRF model (could be SVM)
with following set of features:
프랑스의 세계적인 디자이너 …
CRF
Accuracy at Character Level 96.25%
Precision at Word Level 95.58%
• Features
– 1gram, 2gram, 3gram, 4gram of characters (syllables)
– Korean or not, contains number
• Evaluation
– Accuracy (character)
– Word-precision
# words correct spaced word / # words produced by system
• Very simple to train (easy to get huge data)
• Not need of lexicon or any lexical information
• Perform surprisingly very well
19. Communicating KnowledgeSentiment Analysis Symposium
Korean Linguistic
19
• Transliteration
- Korean used more and more English derived word
transliterated phonetically in Korean alphabet
(Reverse of “Romanization”).
Especially for foreign names (Companies, Products, People,
technical/domain terms)
– Transcription is non unique and non standard
Examples:
tablet, 태블릿, 태블릿 , 타블렛, 테블릿
Hitachi, 히타치, 히타찌, 히다찌, 히타찌
iPhone 4s, 아이폰 4s, 아이폰포에스, 아이폰 포에스
20. Communicating KnowledgeSentiment Analysis Symposium
Korean Linguistic
20
• Automatic transliteration recognition
- Make a rules based transliteration based on phonetic
transliteration acting similarly to Soundex, adapted for
Korean pronunciation.
tablet, 태블릿
T=>ㅌ/ㄸ/ㄷ
A => ㅏ/ㅓ/ㅔ/ㅐ
Etc.
This method has high recall but low precision and need post-processing filtering (Remove
known Korean words from lexicons, remove too short nouns etc.)
Result has to be corrected by human, so need of efficient workbench for productivity.
Gathered a 130 thousand entries dictionaries, mainly IT oriented.
Still need more Academic research to solve this problem.
21. Communicating KnowledgeSentiment Analysis Symposium
Sentiment Analysis
21
• Complaint Detection
Similar problem of standard Subjectivity Detection
(Detect if a sentence is sentiment bearing or not)
Simple Approach: Binary Classification
Using SVM,
manually tagged training/test corpuses.
(more than 20 thousand)
Features Space:
N-gram of Characters (Syllables/Eojol) + N-Gram of Words
using 2-4 grams gave best results.
Features Extraction is important to lower the features space.
Chi-square/Information Gain gave best results.
22. Communicating KnowledgeSentiment Analysis Symposium
Sentiment Analysis
22
Problems: No freely available resources such Sentiword-Net
Need to build it!
Build our general domain dictionary as baseline:
20 000 verbs/adjectives classified as positive/negative/neutral
Result is a lexicon of ~5000 entries (only positive/negative)
Enrich with manually extracted features from N-grams.
Precision oriented (92%) but still quite low recall (75%).
Overall Accuracy: 85%
=> Still working on ways to make recall better without
sacrificing precision.
Basic Ideas:
Bagging / Boosting (Combining several Classifiers)
Make hybrid models between (linguistic: semantic/syntactic) rules
and Machine Learning(statistics)
23. Communicating KnowledgeSentiment Analysis Symposium
Lessons Learned
23
• Lessons Learned
- Still a quite big gap between expectation of customer and
reality. Need to explain and let him involved in process of
assessment and knowledge/domain vocabulary acquisition
- Need acquire a lot of lexicons:
=> Named entities/Synonyms/Stopwords/Senti-Word
- Quality and Quantity of this lexicons is a real assets of
Company. Acquiring lexicons require workbenches for
efficiently semi-supervised methods (Filter manually automatic
methods) to reduce costs.
- Tuning Classifiers parameters, features extraction, linguistic
knowledge etc. is time/expertise consuming.
- Simple Academic methods works quite well (even needs lot of
tuning)
- Beyond simple search engine, NLP components quality
became more and more important, especially for Sentiment
Analysis
24. Communicating KnowledgeSentiment Analysis Symposium
Lessons Learned
24
• Lessons Learned
- Customers gain more and more interested in “Big Data”, “Listening Platform”, “Cloud ”, “Social
Network/Intelligence”…
- More and more Customers want to get data/opinion out of in-site system
(Blogs, Communities(BBS), Tweets etc.). Typical questions:
How many crawlers are needed for crawl all Korean tweets/blogs?
How about crawling Facebook?
- How identify “Anti communities” (like “Anti-Samsung”); Who are Power bloggers?
Solutions required are required far more than Sentiment Analysis.
But often customer can‟t afford/don‟t want crawling infra-structure and maintenance fees.
New opportunities to deliver software in other forms than traditional packages selling: SaaS/PaaS
(Software/Platform/Infrastructure) as Service.
Even in enterprise, distributed framework is required (not only web scale services)
- Customers (as least in Korea) love knowing technology and are more and more high level users.
They not only buy solutions but consulting/expertise.
- Projects are more and more expensive, and many require either Benchmarks/POC
25. Communicating KnowledgeSentiment Analysis Symposium
Future Work & Plan
25
• Future Work (On-going)
Acquire more entries in Sentiment dictionary
- Make a framework for handling Linguistic Rules and Statistical
(SVM/Rocchio)
- Coupling with Antonyms; and/or hints
- Better handling Negation
- Better Workbench for faster acquisition / (re-)training
- Co-Reference resolution
- (Full/Semi) Parsing ?
- More complex models than binary classification ?
- Building/Maintaining a Platform for Pass/Sass
A long long way to go…