- How can you measure players’ perspective on localization quality?
- How do top PC games and publishers compare in terms of localization sentiment?
- Which genres are most sensitive to localization quality?
- What is the gold standard of localization quality in terms of player sentiment?
- What are the methodological challenges of localization sentiment analysis?
To try and answer these questions, we parsed more than 1 million Steam reviews by Russian players, extracted and manually categorized the ones that mention localization, calculated localization sentiment scores and tested a number of hypotheses. We also completed a pilot machine extraction of localisation sentiments with moderately accurate results.
Before we could dig to the bottom, we had to deal with many technical and methodological challenges and we’d like to share our findings. The research complements our surveys of Russian and Chinese players completed in 2015-2016 and presents a different perspective, as this time we observed existing real-life behavior instead of surveying respondents.
2. Contents
•Rationale for localisation sentiment analysis
•Research scope, assumptions & limitations
•Parsing
•Marking
•Validating marks
•Calculating scores & benchmarking
•Some findings from the ranking
•Correlation with earlier research
•Automation of loc sentiment analysis
•Key takeaways
•Discussion
3. Rationale for localisation sentiment analysis
•Gather additional data for workflow and
vendor management improvement.
•Identify localisation quality advocates from
other game developers & team up.
•Select benchmark content for localisation
quality evaluation systems.
Better game localisation quality!*
*Not guaranteed. Results may vary.
4. Scope, assumptions & limitations
•PC (Steam) titles; no physical distribution.
•Games with 1M+ global owners as of May 2018 that
have a Russian version – total 267 titles
(non-random sampling).
•Reviews in Russian.
•“Most recent” and “Most helpful (all time)” reviews,
cap ≈ 200K entries.
•All non-specific sentiments are assumed to be about
main game (not the DLCs).
•All DLC-specific sentiment excluded.
•No localisation sentiment is treated as neutral
localisation sentiment.
6. Initial rules
•Used wildcards (locali*)
•Used both Cyrillic and Latin script.
•перевод|perevod|переве|pereve|локализ|lokaliz|
русск|russk|язык|yazik|озвуч|ozvuch|дубляж|dubly|
субтитр|subtitr|опечат|opechat|граммат|grammat|
орфогр|orfogr|пунктуац|punkt|текст|tekst
7. Parsing
•Steam: steam-scraper data scraper (Python) – number of
available reviews limited by Steam; speed ≈1,000 reviews
per minute (depends on reviews size).
• For mobile platforms we use google-play-scraper (Javascript)
(4,400 reviews per language per game; standard quotas 50,000
server requests per day, 10 requests per second, cooldown 1
hour) and app-store-scraper (Javascript) – 500 reviews per
territory, ≈5,000 reviews per minute.
•Deleted duplicates (need to ignore “page”, “page order”,
“date” and “username” fields).
•Extracted reviews with keywords (Notepad++ regular
expressions).
8. Validating data
•Check for false positives on a small batch of data →
prepare a blacklist of keywords (exceptions).
e.g. “text” but not “texture”
e.g. “Russian” but not “Russian servers”
•“Grammar”, “punctuation”, “typo” – highly noisy
keywords (players refer to their own writing).
10. Marking localisation sentiment
• One review – One mark.
• Separate markers for presence (Y), absence (N) and
quality (- / +), as well as a neutral marker (0).
• Separate markers for VO (V, EV, RV) and Loc (L).
• Localisation and VO mentioned – mark localisation.
• Negative and positive sentiment – mark negative.
• Sentiment about marketing assets only – ignore.
• Sarcasm obvious – mark negative.
• Sentiment about DLC only – ignore.
• Sentiment about non-Steam version – ignore.
• Sentiment about technical problems – ignore.
• Manual marking output: 4-5 reviews per minute.
12. Calculating localisation sentiment scores
•User noted positive quality of loc = +1
•User noted negative quality of loc = -1
•User noted presence of loc = +1
•User noted positive quality of Rus. VO = +1
•User noted negative quality of Rus. VO = -1
•User noted presence of Rus. VO = +1 (only if Steam
shows availability of Russian VO)
•User noted absence of Rus. VO = -1 (only if Steam
shows unavailability of Russian VO)
•All other marks (voiceover sentiment with
unspecified language, unclear sentiment etc) = 0
•Reviews with no localisation sentiment = 0
13. Calculating localisation sentiment scores
• 𝑆𝑒𝑛𝑡𝑖𝑚𝑒𝑛𝑡 𝑠𝑐𝑜𝑟𝑒 =
σ(+1) + σ(−1)
𝑇𝑜𝑡𝑎𝑙 𝑟𝑒𝑣𝑖𝑒𝑤𝑠 𝑝𝑎𝑟𝑠𝑒𝑑
•Not an absolute score: 0 does not signify a truly
neutral localisation sentiment, since users might be
X times more likely to express a negative sentiment
than a positive sentiment.
•Useful for comparing games against each other, or
different stages of the same product.
•Validation of the score:
Confidence interval of the “total reviews parsed”
sample (in the total estimated number of Russian
players) for the title should be at least 3x less than
the share of all localisation sentiments in the
sample! → 104 titles (of 267 marked).
14. Preparing a ranking of titles
•Parameters in order of importance:
1. Localisation sentiment score.
2. Share of positive sentiments in total.
3. “Net promoter score”: σ(+1) − σ −1
•4 states for parameters 1 and 2, depending on
where the value lies on the range across all
titles in the ranking:
16. Top 22 titles by % positive sentiment (highest first)*
• Kerbal Space Program
• DOOM
• Stardew Valley
• Tomb Raider
• The Witcher 2: Assassins of
Kings Enhanced Edition
• Titan Quest Anniversary Edition
• Neverwinter
• Trine 2: Complete Story
• Torchlight II
• Left 4 Dead 2
• Hitman: Absolution
• Alien: Isolation
• Game Dev Tycoon
• POSTAL 2
• Far Cry 3
• Everlasting Summer
• Team Fortress 2
• Portal 2
• Mafia II
• Mirror's Edge
• The Elder Scrolls V: Skyrim
• Tom Clancy’s The Division
*benchmark – 80%
17. Bottom 22 titles by % positive sentiment (lowest first)
• Line of Sight
• XCOM 2
• Total War: WARHAMMER II
• Sid Meier’s Civilization VI
• Fallout 4
• Warhammer: Vermintide 2
• HITMAN
• Sleeping Dogs: Definitive
Edition
• L.A. Noire
• Max Payne 3
• Grand Theft Auto V
• Chivalry: Medieval Warfare
• Total War: ATTILA
• Loadout
• SMITE
• Dead Space 2
• Mad Max
• Dying Light
• Batman: Arkham Origins
• Wolfenstein: The New Order
• Alan Wake
• Warhammer: End Times -
Vermintide
18. What is the sentiment benchmark?
Share of positive loc sentiments = 80%
Weighted loc sentiment score = +0.01
• These are the average sentiment scores for titles that ranked
high in our 2016 survey of players (avg 90%) and were
present in both data sets (2016 and 2018).
• The 90% cut-off point for 2016 survey is to include all titles by
Blizzard, which was selected for its unbeatably consistent
scores (lowest score = Overwatch, 92%)
19.
20.
21.
22. • More sensitive to loc: Strategy, Adventure, RPG
• Less sensitive to loc: MMO, Action, Simulation, Casual
23. How does self-publishing affect
localisation sentiment?
Positive sentiments
(Mean average)
Loc sentiment
score
Self-published titles
(incl. by internal
studios) 44% -0.0028
Titles with dedicated
external publisher 56% 0.0020
24. Other findings (treat with caution)
• Some correlation was observed between loc sentiment and
share of Russian players in the game’s audience:
Positive loc sentiment > 66% → 12% players were Russian
Positive loc sentiment < 33% → 9% players were Russian.
• No correlation was observed between Russian user score
and availability of Russian VO.
• Players’ localisation sentiment (share of positive
sentiments) is generally independent of whether they
recommend the game or not (Russian user score); same for
userscore in the sample of localisation-related reviews.
27. Automation: initial approach
•Divide all manually marked reviews into 2 sets – positive
and negative → Extract specific collocations of 2-6
words → update rules.
•Didn’t work:
1. Attribute word(s) often separated from the keyword.
2. Multiple grammatical forms / affixes.
3. Chains of attributes and keywords, endless variations:
Очень разочаровал русский перевод в игре: ошибки в
текстовых словах ( даже в интерфейсе), ошибки в
переводе, в озвучке повторяются слова и звучат
банально, и дословно, в общем получаем мы
нелепую озвучку и перевод игры в целом.
28. Automation: search rules and keywords
•Working approach: Keyword base + Attribute base
(before / after the keyword) separated by 25 characters
(max.)
•≈ Pareto distribution of keyword frequency. The vast
majority of sentiments have any of the 6 keywords:
локализ, перев, русск, озвуч, дубляж, субтитр
•Non-linear correlation between number of dictionary
entries and resulting accuracy:
перевод, перевед, перевел, перевест (4 bases) can be
reduced to перев (1 base) with accuracy loss ≈5%
29. Automation: attribute words
•Manually compiled 2 dictionaries of most frequent
attribute bases (negative and positive).
•Validated each attribute to ensure accuracy:
• If the proportion of frequencies in negative : positive data
sets is less than 2:1 (or 1:2) → remove.
• If the frequency in the false positives data set is considerably
higher than in two other data sets combined → remove.
•If a review contains both positive and negative templates
→ mark as both positive and negative sentiment.
30. Automation: tips & tricks
• Complex sentences with contrasted parts and punctuation
signs can lead to a false positive:
"русская локализация радует, но сюжет плохой".
Blacklisting all templates with punctuation signs marginally
improves accuracy (by 1-2%).
31. Automation: tips & tricks
• When multiple collocations have been detected in a review →
compare if any of the collocations include the others →
remove the mark for the inner one:
Хорошая русская локализация
32. Automation: tips & tricks
• Delete the space before the attribute and check for a negation
prefix (“non-”) or particle (“not”) → invert the mark.
Игра переведена не полностью
• This also helps to remove redundant terms from the
dictionaries.
34. Automation: KPIs
•Machine competence = % correct : % inverted (≈ 8:1)
60% identified correctly by machine
(target = 80%)
7% inverted
33% not identified by machine
Human marks
Machine marks
33% noise (false positives)
35. Other challenges
• Keyword ambiguity: the Russian term “озвучка” usually
means “voice over” or “voice acting”, but can also mean
“sound design”.
• Typically the player mentions “voice over” without specifying
if she meant Russian VO, English acting or something else.
How you interpret these depends on the purpose of your
analysis.
• Negative sentiment machine is harder to optimise (people
tend to use negative words more and combine them freely).
• Detecting sarcasm is hard.
• Complaints about absence of VO = negative loc sentiment?
36. Key takeaways
• Two meaningful sentiment scores – % positive and weighted.
• Benchmarks for localisation sentiment are 80% (% positive)
and 0.01 (weighted score).
• High % of loc-related reviews (> 1%) and high overall no. of
any reviews (> 2,000) are important for validity.
• Strategy, Adventure and RPG are more sensitive to
localisation than MMO, Action, Simulation and Casual.
• Some AAA developers and publishers are consistently better
than others. Self-published titles generally have worse loc
sentiment.
• Machine identifies at least 67% loc sentiments compared to a
human, with at least 8:1 accuracy and 33% noise. Accuracy
can be futher improved.
37. Our research team
Demid Tishin
founding partner
www.allcorrectgames.com
More research here! www.slideshare.net/dtishin
Need customised analysis? dtishin@gmail.com
Dmitry Arthur Denis Demid
38. • Do you measure players’ localisation sentiment?
• What challenges do you face on the way?
• What actions do you take based on the findings?
• E.g. revise localisation workflow, vendor pool, etc.
• How do you automate it?
• What are your benchmarks?
• How do you factor player sentiment in your
localisation quality evaluation systems?