A talk on EDHREC, a service for magic the gathering deck recommendations. I discuss the algorithms used, my infrastructure, and some lessons learned about building data science applications.
Automating Google Workspace (GWS) & more with Apps Script
EDHREC @ Data Science MD
1. EDHREC, Magic: TG
Recommendation Engine
(and data science on games)
Donald Miner @donaldpminer dminer@minerkasch.com
September 21st, 2015 - Data Science MD Meetup
Games & Stuff in Glen Burnie, MD
4. Talk agenda
Background
EDHREC Overview
EDHREC Data Analysis
EDHREC Architecture
Data Science Application UX Lessons Learned
Related Work in Magic and Other Domains
Virtues of Data Science on Games
5. Magic: The Gathering
Trading card game
First published in 1993
20 million players in 2015 (World of Warcraft has 7.1 million subscribers)
Organized tournaments
Secondary market
1993
$27,000
6. Elder Dragon Highlander / Commander
One of the Magic “formats”
Started independently from WOTC late 00’s
Officially supported starting 2011
Typically multiplayer
100-card singleton deck
(instead of 60-card, up to 4x copies)
Each deck has a single “commander”
(unique to this format)
7. Data Science
Term coined around 2008
Represents a shift in data
analysis in industry
A mix of computer science,
machine learning, statistics,
programming, visualization,
and domain knowledge
14. EDHREC Algorithm 1.0
User-based Collaborative Filtering
Image from http://blog.comsysto.com/2013/04/03/background-of-collaborative-filtering-with-mahout/
Analogy:
Deck -> User
Card -> Item
Pros:
Better at picking up bigger themes in decks
Easy to implement
Cons:
Had issues discovering subtle deck themes
Had issues pointing out combos
15. Recommendation Engine 2.0 Algorithm
31,000
decks
Decks that contain Sanguine Bond AND Exquisite Blood
÷
Decks that contain Sanguine Bond OR Exquisite Blood
Step 1: Card Affinity Matrix
Jaccard / Tanimoto distance
Repeat for every card combination
(15,000 cards)
This is the basis of the Card Analysis page
This matrix is built offline in batch
Image from http://blog.comsysto.com/2013/04/03/background-of-collaborative-filtering-with-mahout/
16. Recommendation Engine 2.0 Algorithm
31,000
decks
1. Select each row of the Tanimoto matrix corresponding to cards in Deck D
2. Sum the columns
3. Sort by score, display results
Step 2: Calculate Scores
This gives you a sum of the Tanimoto coefficients
I really have no idea what this algorithm is called… I’m not sure if it’s novel or not
This is performed in real time
17. Lessons learned:
Taking out the garbage
A lot of garbage gets submitted to EDHREC
Decks with <20 cards
Decks with invalid commanders
Decks with illegal cards
The algorithms handle this well and rarely do problem cards show up
However, pruning “worthless” decks significantly improves
performance due to all the O(N^2) algorithms going on
General advice: Think about which pieces of data are worthless in your data set
18. Lessons learned:
Partitioning (too much or too little)
Partitioning the user/deck space into subgroups is a great way to speed things
up in recommendation engines
The 31,000 EDHREC decks are partitioned into 27 partitions
(one per possible color combination)
Algorithms are ran typically on a single partition
(e.g., Red/Blue deck recommendations only come from other Red/Blue decks)
However, themes that span color combinations suffer worse recommendations
However, partitioning too deep causes problems
I tried partitioning by commander, and that was awful:
new commanders, themes than span commanders suffer
General advice: There is no good way to figure out a partition scheme, just try it out
21. Batch Processes
(cron)
Reddit Bot
(praw)
Redis
• In-memory key/value data store
• Stores website state
• Utilized as a cache
• Stores all of the decks
• Stores all of the pre-computed stats
• Stores all metadata about Magic cards
• EDHREC serializes most things to common
internal json data formats
• Very fast
• Very easy to use
• Good support with Python
• Getting harder to do “analysis”
• Going to move to Redshift SQL database
for analytical things
22. Batch Processes
(cron)
Reddit Bot
(praw)
Cherrypy
• “A Minimalist Python Web Framework”
• Runs the website
• Pulls data from Redis and then renders the
results as HTML
• Most of the data from Redis is cached in
memory objects (IPC to Redis too slow)
• EDHREC runs 6 of these in parallel behind
an NGINX round robin proxy
• Very easy to use, doesn’t get in your way
• Very easy to expose Python data science
• Running into problems with
maintainability due to my own sloppiness
23. Batch Processes
(cron)
Reddit Bot
(praw)
Python
• Programming language
• Plenty of good libraries for data analysis:
numpy, pandas in this case
• Can handle the “full stack” well
(from data analysis to web front end)
• PRAW is a great framework for building
Reddit bots
• Most things run every few hours
24. Batch Processes
(cron)
Reddit Bot
(praw)
Amazon Web Services
• Infrastructure as a Service
• Easily spin up new servers with
pre-built operating system
• EDHREC runs on one m4.2xlarge
8 CPUs, 32GB RAM, Better network
10 cents per hour ($72/month)
• Great for recovering from failures
• Easy to upgrade machine
• Very good uptime so far
• Easy to backup to s3
26. LOL! Look at the dumb bot!
Lesson learned:
Humans LOVE pointing out when something the AI is doing is strange or wrong,
even if it gets it right 90% of the time.
Therefore, I am very conservative of what I end up publishing as
I’ve gotten burned a few times. Which can be a shame sometimes.
(just a couple examples)
27. The apocalypse is near
“EDHREC is ruining EDH/Commander”
“EDHREC is taking the fun out of deck construction”
“EDHREC kills conversation”
MapQuest takes the fun out of planning trips!
Mostly these are taken as compliments
AI is going to have resistance from people who liked the manual labor
I don’t think the commentary entirely off base… but...
28. Sometimes too much is too much
Over-engineering and doing too much is an easy trap
You want to make it better and provide more “intelligence”
Give the users ability to discover and find things
Increases user engagement
Better results
Philosophy: EDHREC is a tool, not a solution
I’m starting to see my other data science projects this way
Lesson learned:
Spend more time on interactive “discovery tools”
than intelligent do-everything algorithms
33. Virtues of this whole thing
Community
Most hobbies are defined by communities
Technology can bring communities together
Self-Development
Data has value and getting data of value is hard
Hobby-based data is relatively easy to acquire (compared to say data used by
health care companies)
A great way to do real data science on real data (opposed to synthetic data on a
more valuable data set)
Profit!
Hobbyists are passionate about their hobby and willing to spend money on it
They will pay for and support services they like
34. EDHREC, Magic: TG
Recommendation Engine
(and data science on games)
Donald Miner @donaldpminer dminer@minerkasch.com
September 21st, 2015 - Data Science MD Meetup
Games & Stuff in Glen Burnie, MD
Hinweis der Redaktion
Building a Magic: The Gathering card game recommendation engine and using data science on data about hobbies
In this talk, Don will give an overview of edhrec.com, a service that provides recommendations for a specific style of play in the Magic: The Gathering trading card game called Commander. The service takes user-created "decks", saves them in a database, and then provides recommendations on what other cards that user should be using in their deck. The website has been around for about a year and is visited by over 50,000 players a month as of September 2015. The talk is geared towards people that don't know anything about Magic or Commander, however, and most of the time will be spent discussing: the methods and approaches used, specifically recommendation engines and the common problems when using them in practice lessons learned about human factor of having a data-driven service that targets a passionate hobbyist population that doesn't know much about data science or even computer science the virtues of spending time on analyzing data for seemingly "toy" domains
Building a Magic: The Gathering card game recommendation engine and using data science on data about hobbies
In this talk, Don will give an overview of edhrec.com, a service that provides recommendations for a specific style of play in the Magic: The Gathering trading card game called Commander. The service takes user-created "decks", saves them in a database, and then provides recommendations on what other cards that user should be using in their deck. The website has been around for about a year and is visited by over 50,000 players a month as of September 2015. The talk is geared towards people that don't know anything about Magic or Commander, however, and most of the time will be spent discussing: the methods and approaches used, specifically recommendation engines and the common problems when using them in practice lessons learned about human factor of having a data-driven service that targets a passionate hobbyist population that doesn't know much about data science or even computer science the virtues of spending time on analyzing data for seemingly "toy" domains