This document summarizes a midterm project analyzing MLS player statistics and salaries. The project team scraped player stats and salary data from online sources and merged the data into a single dataframe. They analyzed correlations between salaries and various stats like goals and assists. Visualizations were created in R to show relationships between attributes. The team encountered challenges in data merging due to different platforms used and a lack of detailed player stats.
Discovering The Best Free Football Scouting Software
Major League Soccer Player Analysis
1. Stevens Institute of Technology
Web Analytics - Fall 2012
Midterm Project
By Chris Armstrong, Dan Derringer, Jude Ken-Kwofie,
Hemanth Mahadevaiah, and Sujana Veeraganti
2. Which player statistics (goals, assists, etc) are
most strongly correlated to a player‟s salary?
◦ What statistics determine a player‟s value?
Are MLS player salaries correlated to player
popularity?
3. Question #1: Which player statistics (goals,
assists, etc) are most strongly correlated to a
player‟s salary?
Method:
Extract player statistics from MLS.com
Extract player salary from MLS.com
Analyze to determine what correlation(s) exist
4. Question #2: Are MLS player salaries correlated
to player popularity?
Method:
Use salary information previously extracted
Use # of google search results for each player‟s
name to use as indicator for popularity:
David Beckham = 1,700,000 search results
Abdul Thompson Conteh = 2,480 search results
Popularity: David Beckham > Abdul Thompson Conteh
5. All-time stats scraped by Chris Armstrong
2007-2012 players‟ salaries scraped by Jude
Ken-Kwofie
Scripts combined by Daniel Derringer
Issues encountered:
Few players in the all-time stats received
salaries in 2007-2012
Merging the data
6. Data to be used:
The all-time stats and 2012 salaries
◦ Using the salaries from 2007-2011 eliminated too
many players for analysis
Merging the data:
Create a for loop in Python to merge all five
of the tables
◦ However, this took over 45 minutes to run
Write an R script to merge the tables
◦ However, was not very elegant
8. Process:
1) Iterate through each stat „type‟ (goals, assists, goalkeeping,
fouls, shots)
2) Extract all stats using Beautiful Soup/RE
3) Merge dictionaries into one Pandas DataFrame (dropping
duplicates)
4) Save output to CSV file
10. Open URL for 2012 Salaries
Save resulting PDF to local machine
Open PDF file and parse with PyPDF2
Extract player name and salary with Reg Ex
Concatenate Last Name, First Name
Merge on player name with MLS Stats
dataframe
11. Tools Used:
Google Custom Search API
urllib2: URL handling
JSON: Data structure
12. Use Google Custom Search API to iterate
through MLS Stats dataframe and search for
each player name + „MLS‟
example: “John Doe” MLS
Extract search result # from returned JSON
object
Append to MLS Stats dataframe
13. 926 Players
29 stat categories plus salaries and search
results
280 players with salary figures
All contained in one Dataframe object
CSV saved for each scraping process, as well
as for master table
14. Wanted to give users ability to point and click
options
Plotting on demand
High level access to script
Learn something new!
15. Tkinter is the defacto Python module for
creating user interfaces
Can be as simple as dialog boxes or complex
as games
Wide range of options and very flexible (menus,
radio buttons, checkboxes, etc)
16. Used Tkinter “widgets” to create simple dialog
box interfaces
Allows user to upload files via dialog box
Interactive plotting
◦ Pandas/Matplotlib
17. Due to the lack of publically available player
passing efficiency data we found it
challenging to build relationships between
salary and performance and to determine the
best players.
Analyzed player compensation versus player
goals, assists, shot as well as to simply
calculate statistics based on player minutes,
goal, assists, shots, shots on goal, game
winning goals and game winning assists
18. From a data set of 251 MLS players we
determined for the year 2012:
The average MLS player earns $200,262.58.
The lowest paid player, Jeb Brovsky, earns
$33,750.
The highest paid player is Thierry Henry. He
earns $5,000,000.
Out of the 251 players, 55.77% of the players
make salaries greater than or equal to
$100,000
20. The visual representation of the statistics was
generated with R, Matplotlib and Pandas.
Scatter plots and histograms were developed to
show:
Player compensation versus player goals,
assists and shots (scatter plots)
Player minutes, goal, assists, shots, shots on
goal, game winning goals and game winning
assists (histrograms)
21.
22. The plot shows exploratory data analysis of the various attributes
like Minutes, Goals, Shots, Assists, and Shots on Goals, Game
Winning Goals, Game Winning Assists and Salary to summarize the
main characteristics in easy-to-understand form
23.
24.
25. Remove possible confound of more
experienced players having higher “counting”
stats by converting all stats to be per game.
Soccer not necessarily a meritocracy, salary
more correlated to google search results than
any other metric (cause and effect issue?)
True player value challenging to measure
based on limited statistical information
26. Varying knowledge of Python and other
platforms created issues when combining and
editing code
Working on a team requires the right system
for effective collaboration (beware the danger
of email chains!)
When you think you‟ve debugged enough,
debug some more