SlideShare a Scribd company logo
1 of 26
Stevens Institute of Technology
                          Web Analytics - Fall 2012
                                     Midterm Project


By Chris Armstrong, Dan Derringer, Jude Ken-Kwofie,
       Hemanth Mahadevaiah, and Sujana Veeraganti
   Which player statistics (goals, assists, etc) are
    most strongly correlated to a player‟s salary?
    ◦ What statistics determine a player‟s value?

   Are MLS player salaries correlated to player
    popularity?
Question #1: Which player statistics (goals,
assists, etc) are most strongly correlated to a
player‟s salary?

Method:
 Extract player statistics from MLS.com
 Extract player salary from MLS.com
 Analyze to determine what correlation(s) exist
Question #2: Are MLS player salaries correlated
to player popularity?

Method:
   Use salary information previously extracted
   Use # of google search results for each player‟s
    name to use as indicator for popularity:

       David Beckham = 1,700,000 search results
       Abdul Thompson Conteh = 2,480 search results
       Popularity: David Beckham > Abdul Thompson Conteh
   All-time stats scraped by Chris Armstrong
   2007-2012 players‟ salaries scraped by Jude
    Ken-Kwofie
   Scripts combined by Daniel Derringer

Issues encountered:
 Few players in the all-time stats received
  salaries in 2007-2012
 Merging the data
Data to be used:
 The all-time stats and 2012 salaries
    ◦ Using the salaries from 2007-2011 eliminated too
      many players for analysis

Merging the data:
 Create a for loop in Python to merge all five
  of the tables
    ◦ However, this took over 45 minutes to run
   Write an R script to merge the tables
    ◦ However, was not very elegant
Tools Used:

   Mechanize, urllib2: URL handling

   Regular Expressions, Beautiful Soup:
    Parsing, cleaning

   Pandas: Data manipulation
Process:
1)   Iterate through each stat „type‟ (goals, assists, goalkeeping,
     fouls, shots)

2)   Extract all stats using Beautiful Soup/RE

3)   Merge dictionaries into one Pandas DataFrame (dropping
     duplicates)

4)   Save output to CSV file
Tools Used:

   urllib2: URL handling

   Regular Expressions: Parsing, cleaning

   PyPDF2: Reading/extracting from PDF‟s

   Pandas: Data manipulation
   Open URL for 2012 Salaries
   Save resulting PDF to local machine
   Open PDF file and parse with PyPDF2
   Extract player name and salary with Reg Ex
   Concatenate Last Name, First Name
   Merge on player name with MLS Stats
    dataframe
Tools Used:
   Google Custom Search API
   urllib2: URL handling
   JSON: Data structure
   Use Google Custom Search API to iterate
    through MLS Stats dataframe and search for
    each player name + „MLS‟
                example: “John Doe” MLS


   Extract search result # from returned JSON
    object

   Append to MLS Stats dataframe
   926 Players

   29 stat categories plus salaries and search
    results

   280 players with salary figures

   All contained in one Dataframe object

   CSV saved for each scraping process, as well
    as for master table
   Wanted to give users ability to point and click
    options

   Plotting on demand

   High level access to script

   Learn something new!
   Tkinter is the defacto Python module for
    creating user interfaces

   Can be as simple as dialog boxes or complex
    as games

   Wide range of options and very flexible   (menus,
    radio buttons, checkboxes, etc)
   Used Tkinter “widgets” to create simple dialog
    box interfaces

   Allows user to upload files via dialog box

   Interactive plotting
    ◦ Pandas/Matplotlib
   Due to the lack of publically available player
    passing efficiency data we found it
    challenging to build relationships between
    salary and performance and to determine the
    best players.
   Analyzed player compensation versus player
    goals, assists, shot as well as to simply
    calculate statistics based on player minutes,
    goal, assists, shots, shots on goal, game
    winning goals and game winning assists
From a data set of 251 MLS players we
determined for the year 2012:

   The average MLS player earns $200,262.58.
   The lowest paid player, Jeb Brovsky, earns
    $33,750.
   The highest paid player is Thierry Henry. He
    earns $5,000,000.
   Out of the 251 players, 55.77% of the players
    make salaries greater than or equal to
    $100,000
Basic Statistics
The visual representation of the statistics was
generated with R, Matplotlib and Pandas.
Scatter plots and histograms were developed to
show:

   Player compensation versus player goals,
    assists and shots (scatter plots)

   Player minutes, goal, assists, shots, shots on
    goal, game winning goals and game winning
    assists (histrograms)
The plot shows exploratory data analysis of the various attributes
like Minutes, Goals, Shots, Assists, and Shots on Goals, Game
Winning Goals, Game Winning Assists and Salary to summarize the
main characteristics in easy-to-understand form
   Remove possible confound of more
    experienced players having higher “counting”
    stats by converting all stats to be per game.

   Soccer not necessarily a meritocracy, salary
    more correlated to google search results than
    any other metric (cause and effect issue?)

   True player value challenging to measure
    based on limited statistical information
   Varying knowledge of Python and other
    platforms created issues when combining and
    editing code

   Working on a team requires the right system
    for effective collaboration (beware the danger
    of email chains!)

   When you think you‟ve debugged enough,
    debug some more

More Related Content

Similar to Major League Soccer Player Analysis

Using Data Science to grow games / Robert Magyar (SuperScale)
Using Data Science to grow games / Robert Magyar (SuperScale)Using Data Science to grow games / Robert Magyar (SuperScale)
Using Data Science to grow games / Robert Magyar (SuperScale)DevGAMM Conference
 
Game Analytics: A Practitioner’s Perspective
Game Analytics: A Practitioner’s PerspectiveGame Analytics: A Practitioner’s Perspective
Game Analytics: A Practitioner’s PerspectiveDecimus
 
Opta michiel jongsma
Opta michiel jongsmaOpta michiel jongsma
Opta michiel jongsmaBigDataExpo
 
Game Analytics: Opening the Black Box
Game Analytics: Opening the Black BoxGame Analytics: Opening the Black Box
Game Analytics: Opening the Black BoxAnders Drachen
 
Predicting Winner of DOTA2 Game
Predicting Winner of DOTA2 GamePredicting Winner of DOTA2 Game
Predicting Winner of DOTA2 GamePrashanth Raj
 
Big Data BizViz Sports Analytics
Big Data BizViz Sports AnalyticsBig Data BizViz Sports Analytics
Big Data BizViz Sports AnalyticsBig Data BizViz LLC
 
Delivering Winning Results with Sports Analytics and HPCC Systems
Delivering Winning Results with Sports Analytics and HPCC SystemsDelivering Winning Results with Sports Analytics and HPCC Systems
Delivering Winning Results with Sports Analytics and HPCC SystemsHPCC Systems
 
The Essential Role of Data Feeds in Modern Football
The Essential Role of Data Feeds in Modern FootballThe Essential Role of Data Feeds in Modern Football
The Essential Role of Data Feeds in Modern FootballDataSportsGroup
 
A graphical model for football story snippet synthesis
A graphical model for football story snippet synthesisA graphical model for football story snippet synthesis
A graphical model for football story snippet synthesisSangram Gaikwad
 
Cricket Score and Winning Prediction
Cricket Score and Winning PredictionCricket Score and Winning Prediction
Cricket Score and Winning PredictionIRJET Journal
 
Driving Digital Soccer Experiences with Structured Data Feeds
Driving Digital Soccer Experiences with Structured Data FeedsDriving Digital Soccer Experiences with Structured Data Feeds
Driving Digital Soccer Experiences with Structured Data FeedsDataSportsGroup
 
Sports Analytics: Market Shares, Strategy, and Forecasts, Worldwide, 2015 to ...
Sports Analytics: Market Shares, Strategy, and Forecasts, Worldwide, 2015 to ...Sports Analytics: Market Shares, Strategy, and Forecasts, Worldwide, 2015 to ...
Sports Analytics: Market Shares, Strategy, and Forecasts, Worldwide, 2015 to ...Shrikant Mandlik
 
CLanctot_DSlavin_JMiron_Stats415_Project
CLanctot_DSlavin_JMiron_Stats415_ProjectCLanctot_DSlavin_JMiron_Stats415_Project
CLanctot_DSlavin_JMiron_Stats415_ProjectDimitry Slavin
 
Introduction to brainCloud - Sept 2014
Introduction to brainCloud - Sept 2014Introduction to brainCloud - Sept 2014
Introduction to brainCloud - Sept 2014Paul Winterhalder
 
Chapter 1 Information Systems in Global Business TodayNBA TEAMS .docx
Chapter 1 Information Systems in Global Business TodayNBA TEAMS .docxChapter 1 Information Systems in Global Business TodayNBA TEAMS .docx
Chapter 1 Information Systems in Global Business TodayNBA TEAMS .docxtidwellveronique
 
Game Behavioral Analytics
Game Behavioral AnalyticsGame Behavioral Analytics
Game Behavioral Analyticsmdk8989
 
B04124012020
B04124012020B04124012020
B04124012020IOSR-JEN
 
Discovering The Best Free Football Scouting Software
Discovering The Best Free Football Scouting SoftwareDiscovering The Best Free Football Scouting Software
Discovering The Best Free Football Scouting Software360 Scouting
 

Similar to Major League Soccer Player Analysis (20)

IRJET-V8I11270.pdf
IRJET-V8I11270.pdfIRJET-V8I11270.pdf
IRJET-V8I11270.pdf
 
Using Data Science to grow games / Robert Magyar (SuperScale)
Using Data Science to grow games / Robert Magyar (SuperScale)Using Data Science to grow games / Robert Magyar (SuperScale)
Using Data Science to grow games / Robert Magyar (SuperScale)
 
Game Analytics: A Practitioner’s Perspective
Game Analytics: A Practitioner’s PerspectiveGame Analytics: A Practitioner’s Perspective
Game Analytics: A Practitioner’s Perspective
 
Opta michiel jongsma
Opta michiel jongsmaOpta michiel jongsma
Opta michiel jongsma
 
SSD Major Portfolio
SSD Major PortfolioSSD Major Portfolio
SSD Major Portfolio
 
Game Analytics: Opening the Black Box
Game Analytics: Opening the Black BoxGame Analytics: Opening the Black Box
Game Analytics: Opening the Black Box
 
Predicting Winner of DOTA2 Game
Predicting Winner of DOTA2 GamePredicting Winner of DOTA2 Game
Predicting Winner of DOTA2 Game
 
Big Data BizViz Sports Analytics
Big Data BizViz Sports AnalyticsBig Data BizViz Sports Analytics
Big Data BizViz Sports Analytics
 
Delivering Winning Results with Sports Analytics and HPCC Systems
Delivering Winning Results with Sports Analytics and HPCC SystemsDelivering Winning Results with Sports Analytics and HPCC Systems
Delivering Winning Results with Sports Analytics and HPCC Systems
 
The Essential Role of Data Feeds in Modern Football
The Essential Role of Data Feeds in Modern FootballThe Essential Role of Data Feeds in Modern Football
The Essential Role of Data Feeds in Modern Football
 
A graphical model for football story snippet synthesis
A graphical model for football story snippet synthesisA graphical model for football story snippet synthesis
A graphical model for football story snippet synthesis
 
Cricket Score and Winning Prediction
Cricket Score and Winning PredictionCricket Score and Winning Prediction
Cricket Score and Winning Prediction
 
Driving Digital Soccer Experiences with Structured Data Feeds
Driving Digital Soccer Experiences with Structured Data FeedsDriving Digital Soccer Experiences with Structured Data Feeds
Driving Digital Soccer Experiences with Structured Data Feeds
 
Sports Analytics: Market Shares, Strategy, and Forecasts, Worldwide, 2015 to ...
Sports Analytics: Market Shares, Strategy, and Forecasts, Worldwide, 2015 to ...Sports Analytics: Market Shares, Strategy, and Forecasts, Worldwide, 2015 to ...
Sports Analytics: Market Shares, Strategy, and Forecasts, Worldwide, 2015 to ...
 
CLanctot_DSlavin_JMiron_Stats415_Project
CLanctot_DSlavin_JMiron_Stats415_ProjectCLanctot_DSlavin_JMiron_Stats415_Project
CLanctot_DSlavin_JMiron_Stats415_Project
 
Introduction to brainCloud - Sept 2014
Introduction to brainCloud - Sept 2014Introduction to brainCloud - Sept 2014
Introduction to brainCloud - Sept 2014
 
Chapter 1 Information Systems in Global Business TodayNBA TEAMS .docx
Chapter 1 Information Systems in Global Business TodayNBA TEAMS .docxChapter 1 Information Systems in Global Business TodayNBA TEAMS .docx
Chapter 1 Information Systems in Global Business TodayNBA TEAMS .docx
 
Game Behavioral Analytics
Game Behavioral AnalyticsGame Behavioral Analytics
Game Behavioral Analytics
 
B04124012020
B04124012020B04124012020
B04124012020
 
Discovering The Best Free Football Scouting Software
Discovering The Best Free Football Scouting SoftwareDiscovering The Best Free Football Scouting Software
Discovering The Best Free Football Scouting Software
 

Major League Soccer Player Analysis

  • 1. Stevens Institute of Technology Web Analytics - Fall 2012 Midterm Project By Chris Armstrong, Dan Derringer, Jude Ken-Kwofie, Hemanth Mahadevaiah, and Sujana Veeraganti
  • 2. Which player statistics (goals, assists, etc) are most strongly correlated to a player‟s salary? ◦ What statistics determine a player‟s value?  Are MLS player salaries correlated to player popularity?
  • 3. Question #1: Which player statistics (goals, assists, etc) are most strongly correlated to a player‟s salary? Method:  Extract player statistics from MLS.com  Extract player salary from MLS.com  Analyze to determine what correlation(s) exist
  • 4. Question #2: Are MLS player salaries correlated to player popularity? Method:  Use salary information previously extracted  Use # of google search results for each player‟s name to use as indicator for popularity: David Beckham = 1,700,000 search results Abdul Thompson Conteh = 2,480 search results Popularity: David Beckham > Abdul Thompson Conteh
  • 5. All-time stats scraped by Chris Armstrong  2007-2012 players‟ salaries scraped by Jude Ken-Kwofie  Scripts combined by Daniel Derringer Issues encountered:  Few players in the all-time stats received salaries in 2007-2012  Merging the data
  • 6. Data to be used:  The all-time stats and 2012 salaries ◦ Using the salaries from 2007-2011 eliminated too many players for analysis Merging the data:  Create a for loop in Python to merge all five of the tables ◦ However, this took over 45 minutes to run  Write an R script to merge the tables ◦ However, was not very elegant
  • 7. Tools Used:  Mechanize, urllib2: URL handling  Regular Expressions, Beautiful Soup: Parsing, cleaning  Pandas: Data manipulation
  • 8. Process: 1) Iterate through each stat „type‟ (goals, assists, goalkeeping, fouls, shots) 2) Extract all stats using Beautiful Soup/RE 3) Merge dictionaries into one Pandas DataFrame (dropping duplicates) 4) Save output to CSV file
  • 9. Tools Used:  urllib2: URL handling  Regular Expressions: Parsing, cleaning  PyPDF2: Reading/extracting from PDF‟s  Pandas: Data manipulation
  • 10. Open URL for 2012 Salaries  Save resulting PDF to local machine  Open PDF file and parse with PyPDF2  Extract player name and salary with Reg Ex  Concatenate Last Name, First Name  Merge on player name with MLS Stats dataframe
  • 11. Tools Used:  Google Custom Search API  urllib2: URL handling  JSON: Data structure
  • 12. Use Google Custom Search API to iterate through MLS Stats dataframe and search for each player name + „MLS‟ example: “John Doe” MLS  Extract search result # from returned JSON object  Append to MLS Stats dataframe
  • 13. 926 Players  29 stat categories plus salaries and search results  280 players with salary figures  All contained in one Dataframe object  CSV saved for each scraping process, as well as for master table
  • 14. Wanted to give users ability to point and click options  Plotting on demand  High level access to script  Learn something new!
  • 15. Tkinter is the defacto Python module for creating user interfaces  Can be as simple as dialog boxes or complex as games  Wide range of options and very flexible (menus, radio buttons, checkboxes, etc)
  • 16. Used Tkinter “widgets” to create simple dialog box interfaces  Allows user to upload files via dialog box  Interactive plotting ◦ Pandas/Matplotlib
  • 17. Due to the lack of publically available player passing efficiency data we found it challenging to build relationships between salary and performance and to determine the best players.  Analyzed player compensation versus player goals, assists, shot as well as to simply calculate statistics based on player minutes, goal, assists, shots, shots on goal, game winning goals and game winning assists
  • 18. From a data set of 251 MLS players we determined for the year 2012:  The average MLS player earns $200,262.58.  The lowest paid player, Jeb Brovsky, earns $33,750.  The highest paid player is Thierry Henry. He earns $5,000,000.  Out of the 251 players, 55.77% of the players make salaries greater than or equal to $100,000
  • 20. The visual representation of the statistics was generated with R, Matplotlib and Pandas. Scatter plots and histograms were developed to show:  Player compensation versus player goals, assists and shots (scatter plots)  Player minutes, goal, assists, shots, shots on goal, game winning goals and game winning assists (histrograms)
  • 21.
  • 22. The plot shows exploratory data analysis of the various attributes like Minutes, Goals, Shots, Assists, and Shots on Goals, Game Winning Goals, Game Winning Assists and Salary to summarize the main characteristics in easy-to-understand form
  • 23.
  • 24.
  • 25. Remove possible confound of more experienced players having higher “counting” stats by converting all stats to be per game.  Soccer not necessarily a meritocracy, salary more correlated to google search results than any other metric (cause and effect issue?)  True player value challenging to measure based on limited statistical information
  • 26. Varying knowledge of Python and other platforms created issues when combining and editing code  Working on a team requires the right system for effective collaboration (beware the danger of email chains!)  When you think you‟ve debugged enough, debug some more