Combating Abusive Language in Chat with Apache Spark with Wes Kerr

•

4 gefällt mir•1,528 views

This document discusses Riot Games' use of Apache Spark and machine learning to combat abusive language in League of Legends chat. It summarizes that Riot used Word2Vec and TF-IDF models on months of chat logs to identify toxic language like "noob" and "rekt". Riot then used Apache Spark ML to scale these models and their complexity to larger datasets and neural networks using techniques like GPUs and TensorFlow. This helped Riot better shield players from toxicity in games.

Daten & Analysen

Wesley Kerr
Riot Games
Combating Abusive
Language in Chat with
Apache Spark

Choose
From over 130 champions, each
having a unique backstory and
abilities.
Compete
With your team to complete
objectives and battle the enemy
team.
Win
Take down defenses and destroy
the enemy nexus.

Teamwork
94.5%
of players feel that
sportsmanship matters

1%
of all players are
consistently
unsportsmanlike
2%
of all games infected by
serious toxicity
In-Game Toxicity
95%
of all serious toxicity
comes from players who
are otherwise
sportsmanlike

wN
w1
Exploration
Word2Vec
i was out of mana
⎲g(embeddings)
⎳
context target (wt
)
w2
wt
Predictnearbywords(wt
)
. . .
256 dimension embeddings
month of chat logs
each line of chat is a document
split on spaces and lower case

nj 0.94
goodjob 0.83
gjj 0.82
gjh 0.81
gj 0.79
gfj 0.78
gw 0.77
ty 0.77
t.t 0.74
q.q 0.74
t-t 0.73
q_q 0.72
:'( 0.72
t_t 0.72
;c 0.72
;( 0.71
rekted 0.94
wrecked 0.83
wrekt 0.82
owned 0.81
shrekt 0.79
clapped 0.78
bodied 0.77
roasted 0.77
nub 0.89
nooob 0.89
nob 0.83
n00b 0.80
nobb 0.79
noooob 0.79
noobb 0.78
nooooob 0.78
Exploration
gj qq noob rekt

Enhancements
● Desktop
● R/Python
● AWS Clusters
● Apache Spark ML
● GPUs
● Tensorflow
Pros
Production
Extremely high precision
Cons
Limited data
Low recall

Banko and Brill [2001]. "Scaling to Very Very Large Corpora for
Natural Language Disambiguation".
Enhancements

Enhancements
● AWS Clusters
● Apache Spark ML
Pros
Scale out model complexity
Scale out training data size

Enhancements
● AWS Clusters
● Apache Spark ML
Pros
Scale out model complexity
Scale out training data size
Extractors
Word2Vec
TF-IDF
CountVectorizer
Transformers
n-grams
Tokenizer
Standard Scaler
Algorithms
Logistic Regression
Random Forests
Gradient Boosted
Trees
Spark Machine Learning Library

● GPUs
● Tensorflow
Future
Pros
Global Model
Easy tokenization
Source: Zhang, Y., & Wallace, B. (2015). A Sensitivity Analysis of (and Practitioners’
Guide to) Convolutional Neural Networks for Sentence Classification.

● GPUs
● Tensorflow
Future
Input
Vocab Size
Window Size
Stride
Convolutions
Activation Function
Window Size
# of Feature Maps
Depth
Connected
Activation Function
# of Hidden Nodes
Depth
Hyperparameters

Future
Spark Driver
Worker Node
Worker Node
Worker Node
S3
Read/Write
Read/Write
Read/Write

Conclusions
helps us to …
● Shield our players from extreme toxicity in games!
● Rapidly explore the space of solutions
● Scale to far larger datasets than we could process before
● Scale hyperparameter searches across neural network architectures

Thank you.
Wesley Kerr
WKERR @ RIOTGAMES.COM

Empfohlen

DW Migration Webinar-March 2022.pptxDatabricks

Data Lakehouse Symposium | Day 1 | Part 1Databricks

Data Lakehouse Symposium | Day 1 | Part 2Databricks

Data Lakehouse Symposium | Day 2Databricks

Data Lakehouse Symposium | Day 4Databricks

5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks

Democratizing Data Quality Through a Centralized PlatformDatabricks

Learn to Use Databricks for Data ScienceDatabricks

Empfohlen

DW Migration Webinar-March 2022.pptxDatabricks

Data Lakehouse Symposium | Day 1 | Part 1Databricks

Data Lakehouse Symposium | Day 1 | Part 2Databricks

Data Lakehouse Symposium | Day 2Databricks

Data Lakehouse Symposium | Day 4Databricks

5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks

Democratizing Data Quality Through a Centralized PlatformDatabricks

Learn to Use Databricks for Data ScienceDatabricks

Why APM Is Not the Same As ML MonitoringDatabricks

The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks

Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks

Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks

Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks

Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks

Sawtooth Windows for Feature AggregationsDatabricks

Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks

Re-imagine Data Monitoring with whylogs and SparkDatabricks

Raven: End-to-end Optimization of ML Prediction QueriesDatabricks

Processing Large Datasets for ADAS Applications using Apache SparkDatabricks

Massive Data Processing in Adobe Using Delta LakeDatabricks

Machine Learning CI/CD for Email Attack DetectionDatabricks

Jeeves Grows Up: An AI Chatbot for Performance and QualityDatabricks

Intuitive & Scalable Hyperparameter Tuning with Apache Spark + FugueDatabricks

Infrastructure Agnostic Machine Learning Workload DeploymentDatabricks

Improving Apache Spark for Dynamic Allocation and Spot InstancesDatabricks

Importance of ML Reproducibility & Applications with MLfLowDatabricks

Hyperspace for Delta LakeDatabricks

How We Optimize Spark SQL Jobs With parallel and sync IODatabricks

Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangeThinkInnovation

Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...gajnagarg

Weitere ähnliche Inhalte

Mehr von Databricks

Why APM Is Not the Same As ML MonitoringDatabricks

The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks

Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks

Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks

Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks

Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks

Sawtooth Windows for Feature AggregationsDatabricks

Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks

Re-imagine Data Monitoring with whylogs and SparkDatabricks

Raven: End-to-end Optimization of ML Prediction QueriesDatabricks

Processing Large Datasets for ADAS Applications using Apache SparkDatabricks

Massive Data Processing in Adobe Using Delta LakeDatabricks

Machine Learning CI/CD for Email Attack DetectionDatabricks

Jeeves Grows Up: An AI Chatbot for Performance and QualityDatabricks

Intuitive & Scalable Hyperparameter Tuning with Apache Spark + FugueDatabricks

Infrastructure Agnostic Machine Learning Workload DeploymentDatabricks

Improving Apache Spark for Dynamic Allocation and Spot InstancesDatabricks

Importance of ML Reproducibility & Applications with MLfLowDatabricks

Hyperspace for Delta LakeDatabricks

How We Optimize Spark SQL Jobs With parallel and sync IODatabricks

Mehr von Databricks (20)

Why APM Is Not the Same As ML Monitoring

The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix

Stage Level Scheduling Improving Big Data and AI Integration

Simplify Data Conversion from Spark to TensorFlow and PyTorch

Scaling your Data Pipelines with Apache Spark on Kubernetes

Scaling and Unifying SciKit Learn and Apache Spark Pipelines

Sawtooth Windows for Feature Aggregations

Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink

Re-imagine Data Monitoring with whylogs and Spark

Raven: End-to-end Optimization of ML Prediction Queries

Processing Large Datasets for ADAS Applications using Apache Spark

Massive Data Processing in Adobe Using Delta Lake

Machine Learning CI/CD for Email Attack Detection

Jeeves Grows Up: An AI Chatbot for Performance and Quality

Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue

Infrastructure Agnostic Machine Learning Workload Deployment

Improving Apache Spark for Dynamic Allocation and Spot Instances

Importance of ML Reproducibility & Applications with MLfLow

Hyperspace for Delta Lake

How We Optimize Spark SQL Jobs With parallel and sync IO

Kürzlich hochgeladen

Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangeThinkInnovation

Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...gajnagarg

Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Riyadh +966572737505 get cytotec

5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795

Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg

Statistics notes ,it includes mean to index numberssuginr1

Computer science Sql cheat sheet.pdf.pdfSayantanBiswas37

Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...HyderabadDolls

Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...HyderabadDolls

Reconciling Conflicting Data Curation Actions: Transparency Through Argument...Bertram Ludäscher

In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabiaahmedjiabur940

Gartner's Data Analytics Maturity Model.pptxchadhar227

Digital Transformation Playbook by Graham WareGraham Ware

DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann

Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums

Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...nirzagarg

Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...gajnagarg

Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowgargpaaro

SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...Elaine Werffeli

Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Klinik kandungan

Kürzlich hochgeladen (20)

Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange

Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...

Abortion pills in Jeddah | +966572737505 | Get Cytotec

5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed

Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...

Statistics notes ,it includes mean to index numbers

Computer science Sql cheat sheet.pdf.pdf

Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...

Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...

Reconciling Conflicting Data Curation Actions: Transparency Through Argument...

In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia

Gartner's Data Analytics Maturity Model.pptx

Digital Transformation Playbook by Graham Ware

DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK

Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...

Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...

Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...

Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now

SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...

Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...

Combating Abusive Language in Chat with Apache Spark with Wes Kerr

1. Wesley Kerr Riot Games Combating Abusive Language in Chat with Apache Spark

3. Choose From over 130 champions, each having a unique backstory and abilities. Compete With your team to complete objectives and battle the enemy team. Win Take down defenses and destroy the enemy nexus.

4. Teamwork 94.5% of players feel that sportsmanship matters

5. 1% of all players are consistently unsportsmanlike 2% of all games infected by serious toxicity In-Game Toxicity 95% of all serious toxicity comes from players who are otherwise sportsmanlike

8. wN w1 Exploration Word2Vec i was out of mana ⎲g(embeddings) ⎳ context target (wt ) w2 wt Predictnearbywords(wt ) . . . 256 dimension embeddings month of chat logs each line of chat is a document split on spaces and lower case

9. nj 0.94 goodjob 0.83 gjj 0.82 gjh 0.81 gj 0.79 gfj 0.78 gw 0.77 ty 0.77 t.t 0.74 q.q 0.74 t-t 0.73 q_q 0.72 :'( 0.72 t_t 0.72 ;c 0.72 ;( 0.71 rekted 0.94 wrecked 0.83 wrekt 0.82 owned 0.81 shrekt 0.79 clapped 0.78 bodied 0.77 roasted 0.77 nub 0.89 nooob 0.89 nob 0.83 n00b 0.80 nobb 0.79 noooob 0.79 noobb 0.78 nooooob 0.78 Exploration gj qq noob rekt

10. Enhancements ● Desktop ● R/Python ● AWS Clusters ● Apache Spark ML ● GPUs ● Tensorflow Pros Production Extremely high precision Cons Limited data Low recall

11. Banko and Brill [2001]. "Scaling to Very Very Large Corpora for Natural Language Disambiguation". Enhancements

12. Enhancements

13. Enhancements ● AWS Clusters ● Apache Spark ML Pros Scale out model complexity Scale out training data size

14. Enhancements ● AWS Clusters ● Apache Spark ML Pros Scale out model complexity Scale out training data size Extractors Word2Vec TF-IDF CountVectorizer Transformers n-grams Tokenizer Standard Scaler Algorithms Logistic Regression Random Forests Gradient Boosted Trees Spark Machine Learning Library

15. Enhancements ● AWS Clusters ● Apache Spark ML Pros Scale out model complexity Scale out training data size

16.

17. ● GPUs ● Tensorflow Future Pros Global Model Easy tokenization Source: Zhang, Y., & Wallace, B. (2015). A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification.

18. ● GPUs ● Tensorflow Future Input Vocab Size Window Size Stride Convolutions Activation Function Window Size # of Feature Maps Depth Connected Activation Function # of Hidden Nodes Depth Hyperparameters

19. Future Spark Driver Worker Node Worker Node Worker Node S3 Read/Write Read/Write Read/Write

20. Conclusions helps us to … ● Shield our players from extreme toxicity in games! ● Rapidly explore the space of solutions ● Scale to far larger datasets than we could process before ● Scale hyperparameter searches across neural network architectures

21. Thank you. Wesley Kerr WKERR @ RIOTGAMES.COM