This document describes a project to create a federated ontology for sports that combines data from soccer, tennis, and cricket. The project team collected data on players, matches, rankings, and tournaments from various websites and databases. They then cleaned the data and created ontologies to model the domains and relationships for each sport. The cleaned data was mapped to the ontologies using a tool called Karma. The modeled data was published to a triplestore and SPARQL queries were run to test the system. The goal of the project was to integrate information from multiple sports sources into a single federated ontology.
"Exploring the Essential Functions and Design Considerations of Spillways in ...
Federated Ontology for Sports- Paper
1. Federated Ontology for Sports
CSCI586_Web_Group1_Topic1G: Database Interoperability Project Report
Abhishek Agrawal
agra47@usc.edu
George Sam
gsam@usc.edu
Hari Haran Venugopal
hvenugop@usc.edu
Noopur Joshi
noopurbj@usc.edu
Abstract—Our project aims at providing a brief information
on player background, tournament details (schedule, location
etc.). We have created a federated ontology to include
information about Soccer, tennis and cricket. This systems is then
modeled for the data sets from each sport. We then run queries
on the system to test our results.
Keywords—RDF, Scraping
I. INTRODUCTION
The world of sports is very important from a technology
perspective. The different sports constantly produce data and
this is where it is crucial to develop systems that continuously
analyze and extract information from this data. Many people
follow multiple sports and in this day and age, they desire to
have information about their favorite sports at the tip of their
fingers. With the development of the mobile development
platform, this information can be easily accessed. Multiple
systems have been developed which make use of many
computational concepts in Natural Language Processing,
Machine Learning and Artificial Intelligence to extract
information about the sources, process them and display the
results. The problem is that the data is continuous and
constantly increasing. Websites are available for specific
sports only and rarely provide user the information about other
sports. The idea of this project is to create an ontology based
system that can collaborate data from multiple sources. This
data is then modeled into a federated ontology for sports,
consisting of data models and ontologies from multiple
sources. Our intention is to create a system with a federated
ontology to model multiple sports. In our system, we have
started by modelling cricket, tennis and football (soccer).
II. MOTIVATION
A. Ontology for Sports
There are Existing ontologies for sports. These ontologies
cover most of the information required in the sport fields. In
cricket the ontology covers the player information, like
matches played, average batting score, average bowling score.
But it is specific to cricket only. Similar ontologies are present
for tennis and and other sports. But these sports are
structurally different and thus their ontologies are very
different. It is thus necessary to create an ontology that will try
and integrate the different information about multiple sports.
In today’s world of technology, we have access to tablets,
mobile devices, laptops, and desktops and so on. There is a
need for constant, intelligent, up-to-date, integrated and
detailed information from the Web. Though there is lot of
information available on the web but is unstructured and
varying. Ontologies provide a way of maintaining this
unstructured information in one format that can be easily
shared.
Having a common ontology helps to aggregate data from
various sources. This data can be shared with other users or
applications.
B. Need for a Federated Ontology
Everybody follows one or more sports. But we have to
search information related to our choice of sport on different
website. How much time will it take for you to search “all
sports player from England”?? You will have to search for
each sport individually. Consider a situation, your favorite
sport person is “Rafael Nadal” and you want to know latest
updates about him. You still to have to search update-to-
knowledge about him individually.
Federated sports ontology can help us to represent different
sports and presents a common view. Also federated ontology
is extendible in the sense that more information like new
player details, player statistics, new sports or changes in rules
of games can easily be added to previously gathered data.
One of the core importance of federated ontology or in general
terms “Sematic Web” is in the Data Analysis. Data Analyst
spends a lot of time to convert unstructured data into
structured data. Storing data based on a common ontology
solves this problem and saves a lot of time. Simple Statistical
information like “How do the players and/or teams measure up
against one another in various categories?” Can easily be
answered using federated ontology.
Ontologies can be used in application like News feeds.
Federated Ontologies helps us to combine editorial coverage
of sports with all data feeds presented at one place.
C. Our System
Our system for a federated ontology models information
about Cricket, Tennis and Football (soccer). This information
2. consists of information about players, the sporting events, and
the rankings about teams and players. Since these sports are
structurally different, it is important to find out fields and
attributes about the players and rankings that can be similar
across all the sports. Information was gathered about the
following domains:
Cricket
o Player Information
o Player Rankings for T20, ODI and for Test Match
types
o Team Information
o Rankings for teams in T20, ODI and Test Match types
Tennis
o Player Information including ranking
o Tournament Information for Wimbledon, US Open,
Australian Open and French Open
Football
o Player information from the English Premier League
and from the Spanish La Liga
o Information about the games played for the 2 leagues
To limit the size of data collected, we collected data for the
duration from 2004 to 2014 (10 years). The information was
collected in multiple files pertaining to each domain and type.
These files were merged into single files.
Steps for the Project:
1. Data Collection and Data Scraping
2. Data Cleaning
3. Ontology Creation
4. Data Modelling
5. Data Publishing
6. Running queries to extract information
The diagram below highlights these steps.
Fig 1. System Development Cycle
III. DATA SCRAPING
The data for the diffrent sports was collected by developing
web scrapers. We have developed the scrapers in Java ad
Python using libraries. The scrapers scrape though the web
sites and collect information in the form of JSON files or CSV
files.
1. Data Scraping for Cricket
Cricket Dataset:
In order to collect dataset for cricket we scrapped below
websites
http://www.icc-cricket.com/player-rankings/overview
http://www.espncricinfo.com/ci/content/player/index.html
http://cricsheet.org/
We collected the Player and Team information from the
espncricinfo.com and Ranking data from Icc-cricket.com.
In order collect the data regarding the T20, T20I, ODI and
Test matches, we used cricsheet.org.
Scrapper used:
Chrome Web Scrapper: In this we created the site maps
and scrapped data using it. It is very handy and very
powerful tool.
Fig 2. ICC-Cricket Website Screenshot
YAML Java Library:
Matches data set was in YAML format. We converted this
data to JSON format with required fields extracted using a
SnakeYAML library.
2. Data Scraping for Tennis
Source: www.atpworldtour.com
Scraper Tool: Beautiful Soap
Beautiful Soup is a Python based library consisting of
features:
Simple methods and Pythonic idioms for navigating,
searching, and modifying a parse tree: a toolkit for
dissecting a document and extracting content.
The library converts incoming documents to Unicode and
outgoing documents to UTF-8.
Beautiful Soup sits on top of popular Python parsers like
lxml and html5lib, consisting of parsing strategies or trade
speed for flexibility.
Steps taken to retrieve dataset for tennis domain:
1. We primarily focused out data set to top 100 ranked
professional tennis players from atpworldtour.
2. For each tennis player we retrieved all the attributes like:
Rank. Age, Birthplace, Residence, Height, Weight, Plays,
3. TurnedPro, Coach, Website and Personal History using
Beautiful Soap in python script.
3. We also scrapped content for all the grand slam details
like year, winner, score for each slams: Australian Open,
French Open, Wimbledon and US Open respectively.
Fig 3. ATP Website Screenshot
3. Data Scraping for Football
Data Source:
http://www.soccerbase.com/
https://github.com/openfootball/en-england
http://www.footballsquads.co.uk/spain/
http://www.footballsquads.co.uk/england/
Scraper Used: JSoup
JSoup is a Java based library to scrap websites. It provide
functions to get information in the form of text, or in the form
of HTML tags. This information can then be parsed to extract
tag attributes, tag content, etc.
The data is collected in the form of JSON files. Thes files are
then combined to create one file for players in English Premier
League, for the Spanish La Liga and one file for information
about the teams respectively.
Fig 4. SoccerBase Website Screenshot
IV. DATA CLEANING
Data was cleaned using Google OpenRefine and Karma. The
files had attributes like player profile and other information
which was irrelevant for the data sets. These attributes create
problems while mapping to data models.
This step is important because conversion of files from one
format to another generates a lot of discrepancies. Entities like
white spaces and new line characters tend to break data
objects and give errors while modelling. In our project we
performed cleaning of Tennis data in Karma, while cricket and
football data were cleaned using Google OpenRefine.
Stages of Data Cleaning:
• Import Data: We imported in the form of json format,
XML and CSV files since, we our data sources was very
diverse and we used different web scraping tools.
• Merge Data Sets: We merged all the content
belonging to a single domain onto a one file Eg: All the 4
different grand slams details where combined in single file.
• Rebuild Missing Data: We filled the missing data
with empty value.
• Standardize and Normalize Data: Some of the fields
where merged using Karma Pytransform scripts eg: First
Name and Last Name to Name and separated fields into
multiple columns Eg: Height containing data in Inches and
cms into separate field for each of the height units and also
discarded all the irrelevant fields.
• De-Duplicate: All the duplicates values where
discarded which contributed to noisy data.
• Verify, Enrich and Export Data: Once the dataset was
cleaned as per the requirement for the federated ontology
mapping, we exported the dataset in json format, which would
be consumed for the next phase - Data Modeling.
Fig 5. Data Cleaning Steps
V. ONTOLOGY CREATION
Protégé is a free and open source ontology editor which was
developed at Stanford. Protégé defines a graphical user
interface to define ontologies.
Our project required the creation of a federated Sports
ontology and we used Protege for the creation of the ontology.
The ontology of our project is shown below.
4. Fig 6. Ontology Graph
Ontology can be created either by the top-down approach or
the bottom-up approach. The top-down approach of creating
an ontology involves the definition of the most general
concepts in the domain and then specifying the specialization
of the concepts. In our case the first step would be to define
concepts for all sports in general. The next step would be to
define the specialization of every sport like Football, Cricket
and Tennis used in our project.
The bottom-up approach start with the definition of the most
specific concepts, starting from the leaves of the hierarchy and
then mentioning the general concepts of the ontology. For
example in our ontology it would mean defining the
CricketPlayer, FootballPlayer and the TennisPlayer Class first
and then going all the way up the hierarchy of the ontology.
There also exists an hybrid approach which is a combination
of both the top-down approach and the bottom-up approach in
the development of the ontology. Here we define the salient
concepts first and then either specialize or generalize We may
start with the Sports and Player class first and then we may
specialize about the individual player classes further
continuing with the other classes like the Match or the
Tournament class.
In the creation of our ontology we have used the Hybrid
approach.
The class hierarchy of our ontology is shown below:
Fig 7. Class Heirarchy
All the subclasses are mapped under the Top-Level classes.
For example the CricketPlayer is mapped under the Players
class.
The Object Properties of our Ontology is as shown below:
Fig 8. Object Properties Screenshot
These properties help us to draw inferences among the
different concepts. For example looking at our class hierarchy
and the list of Object Properties we can show the relation that
a Player is an AthleteOf a Sport.
Fig 9. Property Mapping
While defining the object property we can define both the
Domain and Range of the property in Protégé. For example
consider the above relation between the Player and the Sport
Class. The object property being defined here is "AthleteOf".
The instance of the Player class would be the AthleteOf the
Sport Class. Thus the domain would be Player Class and the
Range for the property would be the Sport Class.
The data properties of our ontology are shown below:
5. Fig 10. Data Properties
They are used to specify the special attributes associated with
every class.
In our ontology we have different data properties like the
MatchTeam1, MatchTeam2 which specify the teams against
which the Matches were played. When creating the data
properties in Protégé, we can specify the domain and the range
of the property. For example let us consider the Height
property. It is a data property specified for all players in
general, so the domain of the Property would be the class
Player. Since the height of the player would either be
represented as an integer or a float, the Range of it would be
either integer or float.
VI. DATA MODELLING
A. Tools
There are many tools available at our disposal to create a data
model for this federated ontology. Most commonly used tools
are Protégé and Karma. In our system we have used Karma to
model the data for the data sets. The data modelling was done
using Karma. The steps used to model the data are:
• Import our sports ontology along with the other OWL
files into the Karma workspace
• Import the data sets one at a time into Karma
• Create Semantic mappings for every attribute in the
data set
• Create class URIs for attributes which can be used as
Keys
• Publish the model
• Load it into the triple store
The data model was done for each data set one at a time. In the
diagram below we have shown a data model for tennis player.
The player JSON file had many attributes which needed
modelling. It shows the classes used and the semantic
information for the properties. In this data set we intended to
merge the data with information about the tennis tournaments
– Wimbledon, French Open, US Open and Australian Open.
In this merging it is important for us to create a key attribute
which acts like a key for merging data. This is in the form of a
URI field. In our data set we decided to use the Player name
for a URI. We did not have a field for URI so we created one
using Python transformations to create a new data field called
PlayerURI. This field will act as a class URI for the player
class. When the data is loaded in the triple store this field will
act as the key to merge the data sets.
We created the PlayerURI field as the class URI and it shows
the link in the form of dashed green link. Other attributes are
class properties. As we build the data model to map higher
classes, Karma automatically suggested the properties like
dcterms:type between the Players class and TennisPlayer
class.
Fig 11. Tennis Player Model- Karma
Once the data model is created we had to check for all
semantic mappings if they were correct and if they matched to
the ontology. Below are screenshots for cricket and football
players.
Fig 12. Cricket Player Model- Karma
6. VII. DATA PUBLISHING
The important step after modelling of the data is to publish the
data into a triple store. Karma creates bindings for every
attribute and class. The data model maps onto the ontology we
have created. We have class mappings for every football
player cricket player and tennis player. Similarly in our sport
games data sets, we create models for every tournament
information.
We have used OpenRDf to publish. OpenRDF comes
integrated into Karma. It enables the user to link the triple
store repository with the Karma instance, thus providing a way
to easily transport the models and the RDFs from Karma into
the triple store.
Steps:
1. Create a Repository in OpenRDF
2. Publish the RDF files from Karma
3. Load the data into OpenRDF in the form of contexts
4. Verify the triples
Below are few screenshots of the triple store with the contexts
and the RDF triples. Each context stores RDF related to that
particular sport.
The contexts are:
http://localhost.com/tennis
http://localhost.com/cricket
http://localhost.com/football
Fig 13. RDF Triples in OpenRDF
VIII.QUERIES
The following SPARQL queries were run on the data sets to
test the results of the modelling:
1. To extract names of players from all sports living in
England
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-
ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX cs586: <http://www.semanticweb.org/CS586/>
PREFIX schema: <http://www.schema.org/>
PREFIX local: <http://localhost:8080/source/>
PREFIX dcterms: <http://purl.org/dc/terms/>
SELECT DISTINCT ?Pname ?type ?loc
WHERE
{
{
?pl cs586:PlayerName ?Pname;
dcterms:type ?p;
cs586:MemberOf ?team.
?team cs586:TeamName ?loc.
?p rdf:type ?type .
?p rdf:type
<http://www.semanticweb.org/CS586#CricketPlayer>.
FILTER(regex(?loc,"England"))
}
UNION
{
?pl cs586:PlayerName ?Pname;
dcterms:type ?p;
cs586:Nationality ?loc;
cs586:MemberOf ?team.
?team cs586:TeamName ?tname.
?p rdf:type ?type .
?p rdf:type
<http://www.semanticweb.org/CS586#FootballPlayer>.
FILTER(regex(?loc,"ENG"))
}
UNION
{
?pl cs586:PlayerName ?Pname;
dcterms:type ?p.
?p rdf:type ?type.
?p rdf:type
<http://www.semanticweb.org/CS586#TennisPlayer>;
<http://www.semanticweb.org/CS586/BornIn> ?n.
?n cs586:LocationName ?loc.
FILTER(regex(?loc,"England"))
}
}
Result:
Fig 14. Query 1 Result
2. Query to extract player names of left-handed players from
all sports
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-
ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX cs586: <http://www.semanticweb.org/CS586/>
7. PREFIX schema: <http://www.schema.org/>
PREFIX local: <http://localhost:8080/source/>
PREFIX dcterms: <http://purl.org/dc/terms/>
SELECT DISTINCT ?Pname
WHERE
{
{
?pl cs586:PlayerName ?Pname;
dcterms:type ?p.
?p cs586:CricketBatStyle ?batstyle.
?p rdf:type
<http://www.semanticweb.org/CS586#CricketPlayer>.
FILTER(regex(?batstyle,"Left-hand"))
}
UNION
{
?pl dcterms:type ?r;
cs586:PlayerName ?Pname.
?r cs586:TennisPlays ?q.
FILTER(regex(?q,"Left-hand"))
}
}
Result:
Fig 15. Query 2 Result
3. Query to find players above 30 years of age
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-
ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX cs586: <http://www.semanticweb.org/CS586/>
PREFIX schema: <http://www.schema.org/>
PREFIX local: <http://localhost:8080/source/>
SELECT (COUNT(?age) AS ?AgeValue)
WHERE
{
?p cs586:HasRank ?rank;
cs586:PlayerName ?Pname;
cs586:BornOn ?date;
cs586:Age ?age.
FILTER(?age > 30)
}
Result:
Fig 16. Query 3 Result
IX. TOOLS USED
A. Data Scraping:
JSoup, JSON, Request, BeautifulSoup, Sesame
OpenRDF, Apache Jena, Scrapy, WebScraper.
B. Data Cleaning:
• Karma Tool: Karma offers a programming-by-example
interface to enable users to define data transformation scripts
that transform data expressed in multiple data formats into a
common format.
• Google Refine : A power tool for working with messy data,
cleaning it up, transforming it from one format into another.
C. Ontology Creation
Protégé: Protégé is a free and open source ontology editor
which was developed at Stanford. Protégé defines a
graphical user interface to define ontologies.
D. Data Modelling and Publishing
Karma
OpenRDF: It is an open source Apache Sesame based
framework to create triple stores and load and query
RDF triples.
X. INDIVIDUAL CONTRIBUTION
a. Abhishek Agrawal:
Scrapped websites (http://www.espncricinfo.com/) and
(http://www.icc-cricket.com/) and (http://cricsheet.org/)
using a combination of Scrapy python library and
Chrome Web scrapper. Did modelling for cricket files
in Karma. Worked on Ontology Creation.
8. b.George Sam:
Performed Data Scraping in Python to extract
information about players and league matches for
English Premier League and Spanish La Liga Players.
Did Data cleaning in Google OpenRefine. Worked on
Data Modelling in Karma and Triple Store Creation in
OpenRDF. Developed Sparql queries.
c. Hari Haran Venugopal:
Built a python script to scrap websites, collected
relevant dataset in JSON format for domain tennis,
primarily focusing on players details: Name, Bio-Data,
Ranking, Personal History, Coach, Age, Nationality and
tournament details considering all grand slams: year,
winner, scores. Did Data cleaning in Karma. Developed
SPARQL Queries. Performed data modelling for
d.Noopur Joshi:
Responsible for creating ontologies for each sport and
the federated ontology for all sports combined on
Protégé. The approach being used is the Hybrid
approach for creating the ontologies. Initially an
ontology for each individual sport namely Football,
Tennis and Cricket is created. These ontologies were
then mapped to the federated ontology for all sports.
She handled Data cleaning for the datasets using
Google OpenRefine. The paper referred for creating
ontologies:
http://130.88.198.11/tutorials/protegeowltutorial/resour
ces/ProtegeOWLTutorialP4_v1_3.pdf
Worked in data modelling for cricket and football
players.
XI. CONCLUSION AND FUTURE WORK
We have thus been able to create a system which implements a
federated ontology to include data from Tennis, Cricket and
Football. This information is then further modelled into Karma
and is then loaded into RDF files in a triple store. We have
then run queries to gather some statistical data about players
from all sports.
Further work can be done in this field. Many other sports can
be integrated into the federated ontology. A web application
can also be created to display the information and provide an
interface to search the data sets. This system can also be
integrated into a mobile application as a front end. This will
provide greater coverage of the system onto multiple devices.
References
[1] http://www.isi.edu/integration/karma/
[2] http://phd.jabenitez.com/wp-content/uploads/2014/03/A-
Practical-Guide-To-Building-OWL-Ontologies-Using-Protege-4.pdf
[3] http://ict.siit.tu.ac.th/~sun/SW/Protege%20Tutorial.pdf
[4] http://www.crummy.com/software/BeautifulSoup/
[5] https://chrome.google.com/webstore/detail/web-
scraper/jnhgnonknehpejjnehehllkliplmbmhn?hl=en
[6] https://code.google.com/p/google-refine/
[7] http://www.datacleansing.net.au/Data_Cleansing_Services
[8] www.atpworldtour.com
[9] http://www.icc-cricket.com/player-rankings/overview
[10]
http://www.espncricinfo.com/ci/content/player/index.html
[11] http://cricsheet.org/
[12] https://code.google.com/p/snakeyaml/