This document provides an overview of a Hadoop session that will cover:
1. An introduction to big data including the history and evolution of Hadoop and how it addresses challenges with traditional databases.
2. The Hadoop architecture and ecosystem including components like HDFS, MapReduce, HBase and how they address issues with scalability, flexibility and cost compared to traditional databases.
3. Hands-on analysis of a soccer dataset using Hadoop to perform tasks like data classification, prediction and player analysis.
Hank Roark of H2O gives an overview on data science, machine learning, and H2O.
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Hank Roark of H2O gives an overview on data science, machine learning, and H2O.
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
GalvanizeU Seattle: Eleven Almost-Truisms About DataPaco Nathan
http://www.meetup.com/Seattle-Data-Science/events/223445403/
Almost a dozen almost-truisms about Data that almost everyone should consider carefully as they embark on a journey into Data Science. There are a number of preconceptions about working with data at scale where the realities beg to differ. This talk estimates that number to be at least eleven, through probably much larger. At least that number has a great line from a movie. Let's consider some of the less-intuitive directions in which this field is heading, along with likely consequences and corollaries -- especially for those who are just now beginning to study about the technologies, the processes, and the people involved.
Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...Ashok Royal
Bigdata Hadoop, Its components and a Hadoop project is described in Details.
Visit http://hadoop-beginners.blogspot.com to see Hadoop Tutorials.
Thanks for the visit. :)
Vodafone, Cyberpark ve Türkiye Teknoloji Geliştirme Vakfı işbirliğinde düzenlen etkinlikte büyük veri kavramı, Apache Hadoop Ekosistemi ve Türkiye ve Dünyadaki örnek uygulamalar anlatıldı.
-
1 Haziran 2016 - Onur Karadeli, Mustafa Murat Sever
Big Data in the Cloud: Enabling the Fourth Paradigm by Matching SMEs with Dat...Alexandru Iosup
Data are pouring in, and defining and providing data-processing services at massive scale, in short, Big Data services, could significantly improve the revenue of Europe's Small and Medium Enterprises (SMEs). A paradigm shift is about occur, one in which data processing becomes a basic life utility, for both SMEs and the European people. Although the burgeoning datacenter industry, of which the Netherlands is a top player in Europe, is promising to enable Big Data services, the architectures and even infrastructure for these services are still lagging behind in performance, efficiency, and sophistication, and are built as monoliths reminding us of traditional data silos. Can we remove the performance and efficiency limitations of the current Big Data ecosystems, that is, of the complex stacks of middleware that are currently in use, for Big Data services? In this talk, I will present several use cases (workloads) of Big Data services for time-stamped [2,3] and graph data [4], evaluate or benchmark the performance of several Big Data stacks [3,4] for these use-cases, and present a path (and promising early results) to providing a generic, data-agnostic, non-monolithic Big Data architecture that can efficiently and elastically use datacenter resources via cloud computing interfaces [1,5].
[1] A. L. Varbanescu and A. Iosup, On Many-Task Big Data Processing: from GPUs to Clouds. Proc. of SC|12 (MTAGS).? http://www.pds.ewi.tudelft.nl/~iosup/many-tasks-big-data-vision13mtags_v100.pdf
[2] de Ruiter and Iosup. A workload model for MapReduce. MSc thesis at TU Delft. Jun 2012. Available online via TU Delft Library, http://library.tudelft.nl
[3] Hegeman, Ghit, Capotã, Hidders, Epema, Iosup. The BTWorld Use Case for Big Data Analytics: Description, MapReduce Logical Workflow, and Empirical Evaluation. IEEE Big Data 2013. http://www.pds.ewi.tudelft.nl/~iosup/btworld-mapreduce-workflow13ieeebigdata.pdf
[4] Y. Guo, M. Biczak, A. L. Varbanescu, A. Iosup, C. Martella, and T. L. Willke. How Well do Graph-Processing Platforms Perform? An Empirical Performance Evaluation and Analysis. IEEE IPDPS 2014. http://www.pds.ewi.tudelft.nl/~iosup/perf-eval-graph-proc14ipdps.pdf
[5] B. Ghit, N. Yigitbasi, A. Iosup, and D. Epema. Balanced Resource Allocations Across Multiple Dynamic MapReduce Clusters. ACM SIGMETRICS 2014. http://pds.twi.tudelft.nl/~iosup/dynamic-mapreduce14sigmetrics.pdf
Webinar : Talend : The Non-Programmer's Swiss Knife for Big DataEdureka!
Talend Open Studio (TOS) is a wonderful open source Data Integration (DI) tool used to build end-to-end ETL solutions. This course will not only help the beginners to understand the art of data integration but also equip them with Big Data skills in the smart way. This course also aims to educate you about Big Data through Talend's powerful product "Talend for Big Data" (the first Hadoop-based data integration platform). The topics covered in the presentation are:
1. Why ETL is still essential and arrival of Big Data is not the doom of ETL era
2.How and why ETL is using Talend
3.Talend complementing Hadoop Ecosystem? Adopting to ETL-Big Data industry
4.Learn Big Data not in months but in Minutes! Sounds too good?
10 Popular Hadoop Technical Interview QuestionsZaranTech LLC
Big Data has been attested as one of the fastest growing technologies of this decade and thus potent enough to produce a large number of jobs. While enterprises across industrial stretch have started building teams, Hadoop technical interview questions could vary from simple definitions to critical case studies. Let’s take quick glimpse at the most obvious ones.
Deep Water - Bringing Tensorflow, Caffe, Mxnet to H2OSri Ambati
Arno Candel introduces Deep Water, which brings Tensorflow, Caffe, Mxnet to H2O. It also brings support for GPUs, image classification, NLP and much more to H2O.
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Human in the loop: a design pattern for managing teams working with MLPaco Nathan
Strata CA 2018-03-08
https://conferences.oreilly.com/strata/strata-ca/public/schedule/detail/64223
Although it has long been used for has been used for use cases like simulation, training, and UX mockups, human-in-the-loop (HITL) has emerged as a key design pattern for managing teams where people and machines collaborate. One approach, active learning (a special case of semi-supervised learning), employs mostly automated processes based on machine learning models, but exceptions are referred to human experts, whose decisions help improve new iterations of the models.
Hadoop Tutorial | What is Hadoop | Hadoop Project on Reddit | EdurekaEdureka!
This Edureka Hadoop Tutorial ( Hadoop Tutorial Blog Series: https://goo.gl/zndT2V ) helps you understand Big Data and Hadoop in detail. This Hadoop Tutorial is ideal for both beginners as well as professionals who want to learn or brush up their Hadoop concepts.
This Edureka Hadoop Tutorial provides knowledge on:
1) What are the driving factors of Big Data and what are its challenges?
2) How Hadoop solves Big Data storage and processing challenges with Facebook use-case?
3) The overview of Hadoop YARN Architecture and its Components.
4) A real-life implementation of a complete end to end Hadoop Project on a Reddit use case on a Hadoop Cluster.
Check our complete Hadoop playlist here: https://goo.gl/ExJdZs
Morten Egan gives a short introduction to Big Data and what it is all about. What is MapReduce, HDFS, Hive, Pig and HCatalog? Also, a short introduction to Hortonworks.
This presentation was made for the danish oracle user group.
Under the grid computing paradigm, large sets of heterogeneous resources can be aggregated and shared. Grid development and acceptance hinge on proving that grids reliably support real applications, and on creating adequate benchmarks to quantify this support. However, applications of grids (and clouds) are just beginning to emerge, and traditional benchmarks have yet to prove representative in grid environments. To address this chicken-and-egg problem, we propose a middle-way approach: create and run synthetic grid workloads comprised of applications representative for today's grids (and clouds). For this purpose, we have designed and implemented GrenchMark, a framework for synthetic workload generation and submission. The framework greatly facilitates synthetic workload modeling, comes with over 35 synthetic and real applications, and is extensible and flexible. We show how the framework can be used for grid system analysis, functionality testing in grid environments, and for comparing different grid settings, and present the results obtained with GrenchMark in our multi-cluster grid, the DAS.
GalvanizeU Seattle: Eleven Almost-Truisms About DataPaco Nathan
http://www.meetup.com/Seattle-Data-Science/events/223445403/
Almost a dozen almost-truisms about Data that almost everyone should consider carefully as they embark on a journey into Data Science. There are a number of preconceptions about working with data at scale where the realities beg to differ. This talk estimates that number to be at least eleven, through probably much larger. At least that number has a great line from a movie. Let's consider some of the less-intuitive directions in which this field is heading, along with likely consequences and corollaries -- especially for those who are just now beginning to study about the technologies, the processes, and the people involved.
Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...Ashok Royal
Bigdata Hadoop, Its components and a Hadoop project is described in Details.
Visit http://hadoop-beginners.blogspot.com to see Hadoop Tutorials.
Thanks for the visit. :)
Vodafone, Cyberpark ve Türkiye Teknoloji Geliştirme Vakfı işbirliğinde düzenlen etkinlikte büyük veri kavramı, Apache Hadoop Ekosistemi ve Türkiye ve Dünyadaki örnek uygulamalar anlatıldı.
-
1 Haziran 2016 - Onur Karadeli, Mustafa Murat Sever
Big Data in the Cloud: Enabling the Fourth Paradigm by Matching SMEs with Dat...Alexandru Iosup
Data are pouring in, and defining and providing data-processing services at massive scale, in short, Big Data services, could significantly improve the revenue of Europe's Small and Medium Enterprises (SMEs). A paradigm shift is about occur, one in which data processing becomes a basic life utility, for both SMEs and the European people. Although the burgeoning datacenter industry, of which the Netherlands is a top player in Europe, is promising to enable Big Data services, the architectures and even infrastructure for these services are still lagging behind in performance, efficiency, and sophistication, and are built as monoliths reminding us of traditional data silos. Can we remove the performance and efficiency limitations of the current Big Data ecosystems, that is, of the complex stacks of middleware that are currently in use, for Big Data services? In this talk, I will present several use cases (workloads) of Big Data services for time-stamped [2,3] and graph data [4], evaluate or benchmark the performance of several Big Data stacks [3,4] for these use-cases, and present a path (and promising early results) to providing a generic, data-agnostic, non-monolithic Big Data architecture that can efficiently and elastically use datacenter resources via cloud computing interfaces [1,5].
[1] A. L. Varbanescu and A. Iosup, On Many-Task Big Data Processing: from GPUs to Clouds. Proc. of SC|12 (MTAGS).? http://www.pds.ewi.tudelft.nl/~iosup/many-tasks-big-data-vision13mtags_v100.pdf
[2] de Ruiter and Iosup. A workload model for MapReduce. MSc thesis at TU Delft. Jun 2012. Available online via TU Delft Library, http://library.tudelft.nl
[3] Hegeman, Ghit, Capotã, Hidders, Epema, Iosup. The BTWorld Use Case for Big Data Analytics: Description, MapReduce Logical Workflow, and Empirical Evaluation. IEEE Big Data 2013. http://www.pds.ewi.tudelft.nl/~iosup/btworld-mapreduce-workflow13ieeebigdata.pdf
[4] Y. Guo, M. Biczak, A. L. Varbanescu, A. Iosup, C. Martella, and T. L. Willke. How Well do Graph-Processing Platforms Perform? An Empirical Performance Evaluation and Analysis. IEEE IPDPS 2014. http://www.pds.ewi.tudelft.nl/~iosup/perf-eval-graph-proc14ipdps.pdf
[5] B. Ghit, N. Yigitbasi, A. Iosup, and D. Epema. Balanced Resource Allocations Across Multiple Dynamic MapReduce Clusters. ACM SIGMETRICS 2014. http://pds.twi.tudelft.nl/~iosup/dynamic-mapreduce14sigmetrics.pdf
Webinar : Talend : The Non-Programmer's Swiss Knife for Big DataEdureka!
Talend Open Studio (TOS) is a wonderful open source Data Integration (DI) tool used to build end-to-end ETL solutions. This course will not only help the beginners to understand the art of data integration but also equip them with Big Data skills in the smart way. This course also aims to educate you about Big Data through Talend's powerful product "Talend for Big Data" (the first Hadoop-based data integration platform). The topics covered in the presentation are:
1. Why ETL is still essential and arrival of Big Data is not the doom of ETL era
2.How and why ETL is using Talend
3.Talend complementing Hadoop Ecosystem? Adopting to ETL-Big Data industry
4.Learn Big Data not in months but in Minutes! Sounds too good?
10 Popular Hadoop Technical Interview QuestionsZaranTech LLC
Big Data has been attested as one of the fastest growing technologies of this decade and thus potent enough to produce a large number of jobs. While enterprises across industrial stretch have started building teams, Hadoop technical interview questions could vary from simple definitions to critical case studies. Let’s take quick glimpse at the most obvious ones.
Deep Water - Bringing Tensorflow, Caffe, Mxnet to H2OSri Ambati
Arno Candel introduces Deep Water, which brings Tensorflow, Caffe, Mxnet to H2O. It also brings support for GPUs, image classification, NLP and much more to H2O.
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Human in the loop: a design pattern for managing teams working with MLPaco Nathan
Strata CA 2018-03-08
https://conferences.oreilly.com/strata/strata-ca/public/schedule/detail/64223
Although it has long been used for has been used for use cases like simulation, training, and UX mockups, human-in-the-loop (HITL) has emerged as a key design pattern for managing teams where people and machines collaborate. One approach, active learning (a special case of semi-supervised learning), employs mostly automated processes based on machine learning models, but exceptions are referred to human experts, whose decisions help improve new iterations of the models.
Hadoop Tutorial | What is Hadoop | Hadoop Project on Reddit | EdurekaEdureka!
This Edureka Hadoop Tutorial ( Hadoop Tutorial Blog Series: https://goo.gl/zndT2V ) helps you understand Big Data and Hadoop in detail. This Hadoop Tutorial is ideal for both beginners as well as professionals who want to learn or brush up their Hadoop concepts.
This Edureka Hadoop Tutorial provides knowledge on:
1) What are the driving factors of Big Data and what are its challenges?
2) How Hadoop solves Big Data storage and processing challenges with Facebook use-case?
3) The overview of Hadoop YARN Architecture and its Components.
4) A real-life implementation of a complete end to end Hadoop Project on a Reddit use case on a Hadoop Cluster.
Check our complete Hadoop playlist here: https://goo.gl/ExJdZs
Morten Egan gives a short introduction to Big Data and what it is all about. What is MapReduce, HDFS, Hive, Pig and HCatalog? Also, a short introduction to Hortonworks.
This presentation was made for the danish oracle user group.
Under the grid computing paradigm, large sets of heterogeneous resources can be aggregated and shared. Grid development and acceptance hinge on proving that grids reliably support real applications, and on creating adequate benchmarks to quantify this support. However, applications of grids (and clouds) are just beginning to emerge, and traditional benchmarks have yet to prove representative in grid environments. To address this chicken-and-egg problem, we propose a middle-way approach: create and run synthetic grid workloads comprised of applications representative for today's grids (and clouds). For this purpose, we have designed and implemented GrenchMark, a framework for synthetic workload generation and submission. The framework greatly facilitates synthetic workload modeling, comes with over 35 synthetic and real applications, and is extensible and flexible. We show how the framework can be used for grid system analysis, functionality testing in grid environments, and for comparing different grid settings, and present the results obtained with GrenchMark in our multi-cluster grid, the DAS.
Gail Zhou on "Big Data Technology, Strategy, and Applications"Gail Zhou, MBA, PhD
Dr. Gail Zhou presented this topic at DevNexus on Feb 25, 2014. Big Data history, opportunities, and applications. Big Data key concepts, reference architecture with open source technology stacks. Hadoop architecture explained (HDFS, Map Reduce, and YARN). Big Data start-up challenges and strategies to overcome them. Technology update: Hadoop and Cassandra based technology offerings.
I have collected information for the beginners to provide an overview of big data and hadoop which will help them to understand the basics and give them a Start-Up.
This Presentation gives an insight into what is big data, data analytics, difference between big data and data science.And also salary trends in big data analytics.
Big Data may well be the Next Big Thing in the IT world. The first organizations to embrace it were online and startup firms. Firms like Google, eBay, LinkedIn, and Facebook were built around big data from the beginning.
Big Data brings big promise and also big challenges, the primary and most important one being the ability to deliver Value to business stakeholders who are not data scientists!
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)Sascha Dittmann
In dieser Session stellen wir anhand eines praktischen Szenarios vor, wie konkrete Aufgabenstellungen mit HDInsight in der Praxis gelöst werden können:
- Grundlagen von HDInsight für Windows Server und Windows Azure
- Mit Windows Azure HDInsight arbeiten
- MapReduce-Jobs mit Javascript und .NET Code implementieren
Building a Big Data platform with the Hadoop ecosystemGregg Barrett
This presentation provides a brief insight into a Big Data platform using the Hadoop ecosystem.
To this end the presentation will touch on:
-views of the Big Data ecosystem and it’s components
-an example of a Hadoop cluster
-considerations when selecting a Hadoop distribution
-some of the Hadoop distributions available
-a recommended Hadoop distribution
With many organisations considering getting on the Hadoop bandwagon, this document provides an overview of the planned use cases for Hadoop, an illustration of some of the common technology components, suggestions on when Hadoop is worth considering, some the challenges organisations are experiencing, cost considerations and finally, how an organisation should position for a Big Data initiative. Any organisation considering a Big Data initiative with Hadoop should thoroughly consider each of these areas before embarking on a course of action.
BigData Meets the Federal Data Center - an overview of nosql solutions to data challenges (e.g. Hadoop, Hbase, Mongodb, cassandra, redis etc). Also includes a vignette on Google Prediction API.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfGetInData
Recently we have observed the rise of open-source Large Language Models (LLMs) that are community-driven or developed by the AI market leaders, such as Meta (Llama3), Databricks (DBRX) and Snowflake (Arctic). On the other hand, there is a growth in interest in specialized, carefully fine-tuned yet relatively small models that can efficiently assist programmers in day-to-day tasks. Finally, Retrieval-Augmented Generation (RAG) architectures have gained a lot of traction as the preferred approach for LLMs context and prompt augmentation for building conversational SQL data copilots, code copilots and chatbots.
In this presentation, we will show how we built upon these three concepts a robust Data Copilot that can help to democratize access to company data assets and boost performance of everyone working with data platforms.
Why do we need yet another (open-source ) Copilot?
How can we build one?
Architecture and evaluation
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
2. Session Overview
I. Introduction:
I.A Big Data Introduction
I.B Hadoop Distributions
II. Hadoop Architecture
III. Hadoop Ecosystem & Design Patterns
IV. Soccer dataset: Introduction & Metadata Parsing
V. Introduction to Data Science: Decision Trees
VI. Soccer Data Analysis:
VI.A Soccer Game Classification & Prediction
VI.B Individual Player Analysis
VII. Wrap Up
15/04/2015Contribute:SummerOfTechnologies
2
3. I.ABig Data Introduction
History. What’s Big Data? What problems does it solve? Use cases.
Comparison with RDBMs.
15/04/2015Contribute:SummerOfTechnologies
3
4. How did it all start?
• Google: Indexing the web?
Google File System (2003)
MapReduce (2004)
• 2006: Doug Cutting joins Yahoo and
gets a dedicated team to work on his
Hadoop project
• 2008: Hadoop becomes a top level
Apache Project
• 2008: Hadoop breaks Terasort record:
1TB, 910 nodes, 5 minutes
2009: 59s (1400 nodes)
2009: 3h (3400 nodes) 100TB sort
15/04/2015Contribute:SummerOfTechnologies
4
5. How does it all evolve?http://blog.mikiobraun.de/2013/02/big-data-beyond-map-reduce-googles-papers.html
Did Google sit back and relax?
• 2006: BigTable
• 2010: Percolator:
BigTable +individual updates & transactions
• 2010: Pregel:
scalable graph computing
• 2010: Dremel: interactive db (real-time)
• 2011: MegaStore (BigTable + schema)
focus on distributed consistency
• 2012: Spanner (MegaStore + SQL)
Did the Open Source community?
• HBase (facebook messaging service)
• HBase
dfdfdfdfdfddfddfdfdfdfdfdfdfdfdfdfdfdfdfdfdb
• Apache Giraph, Neo4J
b
• Cloudera’s Impala
15/04/2015Contribute:SummerOfTechnologies
5
Doug Cutting: “Google is living a
few years in the future and sending
the rest of us messages”
6. Today’s data challenges
• Data momentum
= Volume * Velocity
• CAP Theorem
Parallel/Cloud computing
=> not only BIG data
=> also complex data analysis!
• DWH solutions are:
Expensive!
Not horizontally scalable
Inflexible schemas
15/04/2015Contribute:SummerOfTechnologies
6
• Data Variety
Data creation: humans <> machines
7. Use cases?
• Internet of Things:
Everything has an IP
• Customer behaviour analysis:
Clickstreams,...
• Social media analysis:
Twitter, Facebook, ...
• Fraud Detection:
Sample -> full datasets, realtime
• Cognitive computing: IBM Watson:
Large scale text mining
• Stack traces complex systems:
Discovering system failure patterns
• Energy:
Centralized production -> distributed Smart Grids
• Keyword: Personalized ...
Medicine
Janssen + Intel + Univ. = Exascience project
Drug prescription
Insurance
Advertising
Travelling
• A different approach:
Pattern Discovery
<> Pattern Matching
15/04/2015Contribute:SummerOfTechnologies
7
8. Pattern matching versus Discovery
• Hadoop is a Data Scientist’s playground:
Explore your (big) data, discover new patterns
• RDBMS works with Data Committee
Which patterns do we want to store (schema)
• Example from my research background:
Where do certain DNA patterns occur?
How to model DNA patterns?
• Cooking analogy:
Follow a recipe OR be the cook
15/04/2015Contribute:SummerOfTechnologies
8
9. Big Data hype or reality?
JOB AD
University of Leuven, ESAT-STADIUS
In the framework of a collaboration with Janssen Pharmaceuticals, we are looking
for a talented postdoctoral researcher to develop kernel methods that link drug
targets, disease phenotypes, and pharmaceutical compounds. Leveraging
large-scale public and in-house data sets, you will develop kernel methods and/or
network-based methods to predict potential links between targets, diseases, or
candidate drugs. This research builds upon the expertise of Janssen Pharma and
previous work of our team on genomic data fusion. The research will be also
carried out with a team of the University of Linz, Austria (Prof. Sepp Hochreiter)
specialized in kernel learning and chemoinformatics.
Project Details:
Exascience project: Janssen, Imec, Intel, Universities,..
NGS Data ( 1 billion $ -> 2000$) -> mapping -> SNP dataset -> disease matching
Trend: Replace wet lab experiments by computer simulations
15/04/2015Contribute:SummerOfTechnologies
9
10. Limitations of classical RDBMshttp://youtu.be/d2xeNpfzsYI?t=3m24s tot 5m00s
Lecturer:AmrAwadallah (CTO + founder cloudera)
15/04/2015Contribute:SummerOfTechnologies
10
• Data streams:
• DataSource -> Storage only (raw)
• Storage layer => ETL => RDBMS => BI
• 3 problems:
• STORAGE TO ETL is problematic: Moving data to compute doesn’t scale
• ETL typically overnight => not enough time too process all data!
• Too much network overhead moving data from storage to compute grid
• Solution? Move the code to where the data is!
• STORAGE TO archiving: Archiving data too early = premature data death
• Archiving too early since storage cost is too high (balance storage cost vs economic value)
• archiving is cheap but retrieval is extremely expensive!
• Solution? Storage has to become cheaper! (Return on byte)
• STORAGE TO BI: No ability to explore the original raw data
• You cannot ask NEW questions! Very inflexible!
12. The left hand and the right hand
Hadoop
Schema on read
Load is fast
Schema’s can change
Only Batch processing & no indexes
CAP: No transactions!
No atomic updates!
Commodity Hardware
Classical RDBMs
15/04/2015Contribute:SummerOfTechnologies
12
Schema on write
Load is slow (ETL first)
Adapting schema is very difficult
Read is fast (schema => indexing)
Very good at transactions
CRUD
Expensive (purpose) Data Warehouse
13. The end of my presentation? http://www.businessweek.com/articles/2014-06-
27/google-just-made-big-data-expertise-much-tougher-to-fake
15/04/2015Contribute:SummerOfTechnologies
13
For the last five years or so, it’s been pretty easy to pretend you knew something about Big Data. You went to the cocktail party—the one with all the dudes—grabbed a drink and
then said “Hadoop” over and over and over again. People nodded. Absurdly lucrative job offers rolled in the next day. Simple.
Well, Google (GOOG) officially put an end to the good times this week. During some talks at the company’s annual developer conference, Google executives declared that
they’re over Hadoop. It’s yesterday’s buzzword. Anyone who wants to be a true Big Data jockey will now need to be conversant in Flume, MillWheel, Google Cloud Dataflow, and
Spurch. (Okay, I made the last one up.)
Here’s the deal. About a decade ago, Google’s engineers wrote some papers detailing a new way to analyze huge stores of data. They described the method as MapReduce: Data
was spread in smallish chunks across thousands of servers; people asked questions of the information; and they received answers a few minutes or hours later. Yahoo! (YHOO) led
the charge to turn this underlying technology into an open-source product called Hadoop. Hundreds of companies have since helped establish Hadoop as more or less the standard
of modern data analysis work. (startups as Cloudera, Hortonworks, and MapR have their own versions of Hadoop that companies can use, and just about every company that needs
to analyze lots of informatioMuch has been written on this topic.) Such n has its own Hadoop team.
Google probably processes more information than any company on the planet and tends to have to invent tools to cope with the data. As a result, its technology runs a good five to
10 years ahead of the competition. This week, it is revealing that it abandoned the MapReduce/Hadoop approach some time ago in favor of some more flexible data analysis
systems.
One of the big limitations around Hadoop was that you tended to have to do “batch” operations, which means ordering a computer to perform an operation in bulk and then waiting
for the result. You might ask a mainframe to process a company’s payroll as a batch job, or in a more contemporary example, analyze all the search terms that people in Texas
typed into Google last Tuesday.
According to Google, its Cloud Dataflow service can do all this while also running data analysis jobs on information right as it pours into a database. One example Google
demonstrated at its conference was an instantaneous analysis of tweets about World Cup matches. You know, life-and-death stuff.
Google has taken internal tools—those funky-named ones such as Flume and MillWheel—and bundled them into the Cloud Dataflow service, which it plans to start offering to
developers and customers as a cloud service. The promise is that other companies will be able to deal with more information easier and faster than ever before.
While Google has historically been a very secretive company, it is opening up its internal technology as a competitive maneuver. Google is proving more willing than,
say, Amazon.com (AMZN) to hand over the clever things built by its engineers to others. It’s an understandable move, given Amazon’s significant lead in the cloud computing arena.
As for the Hadoop clan? You would think that Google flat-out calling it passé would make it hard to keep hawking Hadoop as the hot, hot thing your company can’t live without. And
there’s some truth to this being an issue.
That said, even the biggest Hadoop fans such as Cloudera have been moving past the technology for some time. Cloudera leans on a handful of super-fast data analysis engines
like Spark and Impala, which can grab data from Hadoop-based storage systems and torture it in ways similar to Google’s.
The painful upshot, however, is that faking your way through the Big Data realm will be much harder from now on. Try keeping your Flume and Impala straight after a couple of gin
and tonics.
15. Session Overview
I. Introduction:
I.A Big Data Introduction
I.B Hadoop Distributions
II. Hadoop Architecture
III. Hadoop Ecosystem & Design Patterns
IV. Soccer dataset: Introduction & Metadata Parsing
V. Introduction to Data Science: Decision Trees
VI. Soccer Data Analysis:
VI.A Soccer Game Classification & Prediction
VI.B Individual Player Analysis
VII. Wrap Up
15/04/2015Contribute:SummerOfTechnologies
15
19. Intel Makes Significant Equity
Investment in Cloudera
15/04/2015Contribute:SummerOfTechnologies
19
$740M Cloudera
Investment
PALO ALTO, Calif., and SANTA CLARA, Calif., March 27, 2014 – Intel
Corporation and Cloudera today announced a broad strategic technology
and business collaboration, as well as a significant equity investment
from Intel making it Cloudera’s largest strategic shareholder and a
member of its board of directors. This is Intel’s single largest data center
technology investment in its history. The deal will join Cloudera’s leading
enterprise analytic data management software powered by Apache
Hadoop™ with the leading data center architecture based on Intel®
Xeon® technology. The goal is acceleration of customer adoption of big
data solutions, making it easier for companies of all sizes to obtain
increased business value from data by deploying open source Apache
Hadoop solutions. Both the strategic collaboration and the equity
investment are subject to standard closing conditions, including
customary regulatory approvals.
Cloudera will develop and optimize Cloudera’s Distribution including
Apache Hadoop (CDH) for Intel architecture as its preferred platform and
support a range of next-generation technologies including Intel fabrics,
flash memory and security. In turn, Intel will market and promote CDH
and Cloudera Enterprise to its customers as its preferred Hadoop
platform. Intel will focus its engineering and marketing resources on the
joint roadmap. The optimizations from Intel’s Distribution for Apache
Hadoop/Intel Data Platform (IDH/IDP) will be integrated into CDH and
IDH/IDP and will be transitioned after v3.1 release at the end of March.
To ensure a seamless customer transition to CDH, Intel and Cloudera will
work together on a migration path from IDH/IDP. Cloudera will also
ensure that all enhancements will be contributed to their respective open
source projects and CDH.
...
20. Session Overview
I. Introduction:
I.A Big Data Introduction
I.B Hadoop Distributions
II. Hadoop Architecture
III. Hadoop Ecosystem & Design Patterns
IV. Soccer dataset: Introduction & Metadata Parsing
V. Introduction to Data Science: Decision Trees
VI. Soccer Data Analysis:
VI.A Soccer Game Classification & Prediction
VI.B Individual Player Analysis
VII. Wrap Up
15/04/2015Contribute:SummerOfTechnologies
20
21. II. Hadoop Architecture
HDFS & MR architecture
Figures: Hadoop in practice, Data-intensive Text Processing, Hadoop: the definitive guide
15/04/2015Contribute:SummerOfTechnologies
21
22. Hadoop: The operating system for data
clusters
• Hadoop is a scalable fault-tolerant distributed system for data storage and
processing: Failure is the rule not the exception!
15/04/2015Contribute:SummerOfTechnologies
22
23. Hadoop Distributed Filesystem
• HDFS is optimized for streaming reads
and writes:
HDFS uses large blocksizes (64 or 128
MB) => HD seek time negligible
(compared to read/write))
HDFS replicates its data blocks (usually 3
times) to improve availability and fault
tolerance (resilient against node failure!)
• HDFS: No Updates!
• Master-Slave: Namenode-Datanode
15/04/2015Contribute:SummerOfTechnologies
23
26. MapReduce execution engine
15/04/2015Contribute:SummerOfTechnologies
26
• Master-Slave: JobTracker-TaskTracker
• JT schedules map and reduce tasks on
TaskTrackers
• JT tries to schedule the work near the
data => Move algorithm NOT data
• JT sends heartbeats to TTs to check health
• If a task fails 4 times => JOB failure
• If a TT fails 4 times => Removed from pool
• A TT has Map & Reduce slots to
run M & R tasks
• Anything can be configured!
28. Ser/De in Hadoop
• A Mapper processes an InputSplit = { (k1,v1), (k2,v2), (k3,v3), ... }
• How is an InputSplit defined?
• InputFormat class splits input and RecordReader generates KV pairs from split
(TextInputFormat: split = slice of file, record = line of text)
• InputFormat tries to make splits = file blocks in HDFS
=> DATA LOCALITY
• Custom InputFormats possible : JSON, XML, SequenceFiles, ...
• Hadoop has its own serialization types: WritableComparables (Text,
IntWritable, FloatWritable, BytesWritable,...)
15/04/2015Contribute:SummerOfTechnologies
28
30. Session Overview
I. Introduction:
I.A Big Data Introduction
I.B Hadoop Distributions
II. Hadoop Architecture
III. Hadoop Ecosystem & Design Patterns
IV. Soccer dataset: Introduction & Metadata Parsing
V. Introduction to Data Science: Decision Trees
VI. Soccer Data Analysis:
VI.A Soccer Game Classification & Prediction
VI.B Individual Player Analysis
VII. Wrap Up
15/04/2015Contribute:SummerOfTechnologies
30
31. III. Hadoop Ecosystem
& Design patterns
Intro to MapReduce programming; Using the ecosystem to stimulate code
reuse;
15/04/2015Contribute:SummerOfTechnologies
31
32. Hadoop
Word Count
Problem Description
• Given a set of documents, calculate
the number of times each word
occurs.
• A MapReduce program consists of:
A Driver
A Mapper (optional)
A Reducer (optional)
DRIVER CODE
15/04/2015Contribute:SummerOfTechnologies
32
33. Hello World Hadoop: Word Count
• WordCountMapper:
The InputFormat partitions the data
in a set of splits.
Each split is fed to a Mapper
A RecordReader generates records
= Text Value
The Mapper takes a record an
splits the text into words
The Mapper emits every word with
frequency one.
• WordCountReducer:
Hadoop is responsible for getting all
key-value pairs with the same key
to one reducer (parallel sort)
A reducer gets a collection of
values which go with a single key
(frequencies)
This reducer adds up the
frequencies and emits the sum to a
file specified in setOutputPath(...)
15/04/2015Contribute:SummerOfTechnologies
33
34. Optimization patterns in MRData-Intensive Text Processing (Lin,Dyer)
Use of Combiners
• A combiner is a mini-reducer which can
be run an arbitrary number of times on
the output of a mapper before it is
streamed to disk
In-Mapper Combiner design pattern
• Create a HashMap in your Mapper
class in which you store all
(word,frequency) pairs, emit when
mapper has finished with a split
• Drawback: Hashmap must fit in
memory!
15/04/2015Contribute:SummerOfTechnologies
34
• Each Context.write() streams data to the local filesystem! (BOTTLENECK!)
• HINT: Is it necessary to emit every single (word, 1) pair from the mapper?
35. NLP: Word co-occurrence matricesData-Intensive Text Processing (Lin,Dyer)
Problem Description
• For each word we want to calculate the
relative frequencies of words co-
occurring with this word (in the same
sentence).
• Example:
(dog,cat) occurs 2 times, (dog,walking)
occurs 3 times then the results should be:
=> (dog,cat) = 40%, (dog,walking) = 60%
• Requirements?
We must count all word co-occurrences &
sum all the (dog,*) combinations to calculate
the relative frequencies
MAPPERv1
15/04/2015Contribute:SummerOfTechnologies
35
36. Relative frequencies!
Problem
• We need to know how much time dog
occurs together with any other word in
orde to calculate the relative
frequencies!
• Solution emit (dog,*) pairs as well!
MAPPERv2
15/04/2015Contribute:SummerOfTechnologies
36
37. How to get the data to the reducer?
Problem
• (dog,*), (dog,cat), (dog,walking) are
different keys, they might end up in a
different reducer!!!
• MapReduce possibilities
MapReduce has a Partitioner which decides
where each K,V must go
MapReducer has a GroupingComparator to
decide which K,V pairs end up in one reducer
group
MapReduce has a SortComparator to
decide on how to sort the K,V pairs in the
reducer group
15/04/2015Contribute:SummerOfTechnologies
37
39. Limitations of MapReduce
Problem Description
• MapReduce does NOT stimulate code
reuse!
Suppose we have a table and we want to
calculate the average, the minimum and
the maximum
This can be done with 1 job but to make your
code reusable you need 3!
A max job, a min job and an avg job
• MapReduce requires a lot of coding!
• Relational operators such as Joins,
Orders,... should only be written once!
Solution?
• On top of MapReduce 2 scripting
languages are build which allow one to
use relational logic which is then
translated in a sequence of MapReduce
jobs:
PIG and HIVE
Pig is a Dataflow language: single data
transformation
Hive is an sql-like language
Bottom line: you can use hadoop without
knowledge of mapreduce!!!!!!
15/04/2015Contribute:SummerOfTechnologies
39
41. Pig’s philosophyhttp://pig.apache.org/philosophy.html
• Pigs eat anything
relational, nested, unstructured,... data
• Pigs live anywhere
Hadoop is not strictly required
• Pigs are domestic animals
integration with other languages (Python)
extendible with UDFs
• Pigs fly
optimizes its translation to MR jobs
15/04/2015Contribute:SummerOfTechnologies
41
42. Pig Latinhttp://pig.apache.org/docs/r0.8.1/piglatin_ref2.html
• Data types: same as SQL + {tuples, bags, maps}
• LOAD ‘path’ USING PigStorage(delim) AS ... (schema)
• STORE users INTO ‘path’ USING PigStorage(delim)
• DISTINCT operator: only keep unique records
• FILTER operator:
FILTER users BY age == 30;
• SPLIT
SPLIT users INTO adults IF age >= 18,
children OTHERWISE;
• ORDER
ORDER users BY age DESC/ASC;
15/04/2015Contribute:SummerOfTechnologies
42
43. Pig Latin (cont’d)
• FOREACH users GENERATE (projection + operations between columns)
name,
age;
• Nesting data with GROUP BY operator:
gr = GROUP users BY age;
age_counts = FOREACH gr GENERATE
group as age,
COUNT(users) as people_same_age;
• Unnesting data with FLATTEN operator:
FLATTEN(tuple) => each tuple field to separate column
FLATTEN(bag/map) => each bag/map item to separate row
• INNER JOIN: (LEFT/RIGHT/FULL OUTER also possible)
JOIN age_counts by age, users by age;
• UNION users1, users2;
15/04/2015Contribute:SummerOfTechnologies
43
44. Pig Latin (cont’d)
• String Operators:
SUBSTRING
INDEXOF
SIZE
CONCAT
• Mathematical operators:
MIN
MAX
AVG
COUNT
SUM
• Conditional logic with ternary operator: (age > 18 ? ‘adult’ : ‘child’);
15/04/2015Contribute:SummerOfTechnologies
44
46. Hadoop 4 you: environment?
• Possibilities to run your own Hadoop POC:
1. Develop locally using open source Jars:
Eclipse IDE, IntelliJ
Preferably linux environment or windows + cygwin
2. Testing/Demo:
Setting up your own one-node cluster:
http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/
Use a preconfigured virtual machine by one of the vendors:
Cloudera VM, HortonWorks VM, MapR VM + VMware or VirtualBox
3. Real cluster
Setting up your own cluster
Elastic MapReduce service of Amazon
15/04/2015Contribute:SummerOfTechnologies
46
47. Elastic MapReduce in the cloud
• Hive
• Pig
• Impala
• Hadoop streaming
• Hadoop custom jar
15/04/2015Contribute:SummerOfTechnologies
47
48. Conclusion
• Pig is a high-level scripting language on top of MapReduce
• Pig stimulates code reuse
=> It creates a logical plan to run a script in the lowest number of MR jobs
• Pig is very easy to use
• Most people limit themselves to Pig/Hive (ex.: Yahoo!)
• MapReduce gives you full control and allows you to optimize complex jobs:
Word Co-Occurrence matrices
• Some relational operators can be hard to implement:
How would you implement a JOIN?
15/04/2015Contribute:SummerOfTechnologies
48
49. Session Overview
I. Introduction:
I.A Big Data Introduction
I.B Hadoop Distributions
II. Hadoop Architecture
III. Hadoop Ecosystem & Design Patterns
IV. Soccer dataset: Introduction & Metadata Parsing
V. Introduction to Data Science: Decision Trees
VI. Soccer Data Analysis:
VI.A Soccer Game Classification & Prediction
VI.B Individual Player Analysis
VII. Wrap Up
15/04/2015Contribute:SummerOfTechnologies
49
50. IV. Soccer Dataset:
Introduction & Metadata
Parsing
How does the data look like? Why use a Big Data approach? Parsing the
game metadata. Pig Exercise
15/04/2015Contribute:SummerOfTechnologies
50
53. Why choose a Big Data approach
• The current Opta Sports dataset contains data from 2010-2014 with
(http://fivethirtyeight.com/features/lionel-messi-is-impossible/)
16,574 players
24,904 games (both league and international)
• Our sample dataset contains:
90 games from Bundesliga 2 in 2008-2009
• Arguments?
The real dataset IS big!
=>Implement scalable solution to start with!
Processing in parallel is preferable
Schema evolves over time
Data is not relational
Exploratory analysis: not sure what to look for?
Fig.: result of Batch query
15/04/2015Contribute:SummerOfTechnologies
53
60. Session Overview
I. Introduction
I.A Big Data Introduction
I.B Hadoop Distributions
II. Hadoop Architecture
III. Hadoop Ecosystem & Design patterns
IV. Soccer dataset: Introduction & metadata parsing
V. Introduction to Data Science: Decision Trees
VI. Soccer Data analysis:
VI.A Soccer game classification & prediction
VI.B Individual player analysis
VII. Wrap up
15/04/2015Contribute:SummerOfTechnologies
60
61. V. Introduction to Data
Science: Decision trees
What’s data science? Classification with decision trees and random
forests
15/04/2015Contribute:SummerOfTechnologies
61
62. The sexiest job of the 21st century! (Harvard
business review)
http://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/
• LinkedIn: focus on engineering => keep the
social network up and running!
• Jonathan Goldman: What would happen if you
presented users with names of people they
hadn’t yet connected with but seemed likely
to know? (> where you went to school, same
company,...)
• Result: Very high click-through rate on ‘People
you may know adds’
• DS = high-ranking professional with the training
and curiosity to make discoveries in the world of
big data
• LinkedIn from ‘empty box’ to 300 Million users
63. Skillset of a data scientist: (Google afbeeldingen)
Traditional Venn Diagram V2.0 Data science requires a team
65. No clear definition
A Person who is better at statistics than any software engineer and better at
software engineering than any statistician
Josh Wills
Sr. Director of Data Science Cloudera
67. Algorithms for data science (ML)
• Machine learning algorithms are usually categorized as:
supervized versus unsupervized (testset containing the ‘truth’ available?)
input/output are categorical or continuous
• Supervized ex.: Classification (cat.) & Regression (cont.)
Classification: Soccerdata->win/draw/loss
Housing prizes versus their size
• Unsupervized ex.: Clustering & Collaborative filtering
Can I divide my customers into certain segments based on their behaviour?
Which books does Amazon recommend me based on my previous purchases or based on
what similar customers bought?
69. Decision trees
• “A decision tree is a flowchart-like structure
in which an internal node represents a test
on an attribute, each branch represents the
outcome of the test and each leaf node
represents a class label”. (Wikipedia)
• Toy example how could we build an
optimal tree splitting people into
Male/Female based on their: length,
hip perimeter and the size of their
nose?
• => Put the question (rule) which
makes the best split first!
ID Length Hip perimeter Nose length Gender
1 155 70 2 F
2 160 80 2 F
3 165 80 3 F
4 170 75 3 F
5 180 90 2 F
6 190 65 3 M
7 200 60 2 M
8 195 55 3 M
9 185 50 3 M
10 175 50 10 M
70. Shannon entropy: the best split?
• To measure the best split we need an impurity measure: Shannon’s
information entropy (high entropy = high impurity)
• We have two classes: Male (M) & Female (F), 5 each (total T = 10)
• Entropy formula: S = - M/T * log2 M/T - F/T * log2 F/T
• Min(- X log X) = 0 when X=0 or X=1 => perfect split S = 0
• Initial set 5/10 males, 5/10 females: S = 0,34
• Suppose we split one nose length:
• Safter = Sleft + Sright = 0 + 0,31 = 0,31
• Information Gain = Sbefore – Safter = +0,03
• Note: pinnokio branch has S = 0 (completely pure)
71. Can we do better?
Split on length?
• Safter = Sleft + Sright = 0 + 0,21 = 0,21
• Information Gain = Sbefore – Safter
= +0,13
• Split on hip perimeter would be perfect
for this training set
• NOTE: training set is only a sample
of the universe!
Scatterplot
15/04/2015Contribute:SummerOfTechnologies
71
72. Democracy versus Totalitarism
• Decision trees are rather sensitive to the sample (overfitting!)
• An alternative for a single Decision trees is Random Forest
• A random forest classifier is an ensemble of decision trees BUT:
Each tree recieves only a subset of the training data
Each tree recieves only a subset of the features
• The classification is done by adding the class probabilities together
• How can this work?
Indecisive trees with bad feature sets have probabilities close to 0,5 => have no
influence
Example: Tree with only nose length (noselength <0,3): 44% male, 56% female
Example: Tree with length (length > 1m70): 83% male, 17% female
• Side effect of Random Forest training: weights help select the dominant features
(feature selection = which features are upmost in the best trees?)
15/04/2015Contribute:SummerOfTechnologies
72
73. Python: SciKit library
• SciKit library contains machine learning algorithms
• Accuracy of forest on testset is 100%
15/04/2015Contribute:SummerOfTechnologies
73
74. Session Overview
I. Introduction:
I.A Big Data Introduction
I.B Hadoop Distributions
II. Hadoop Architecture
III. Hadoop Ecosystem & Design Patterns
IV. Soccer dataset: Introduction & Metadata Parsing
V. Introduction to Data Science: Decision Trees
VI. Soccer Data Analysis:
VI.A Soccer Game Classification & Prediction
VI.B Individual Player Analysis
VII. Wrap Up
15/04/2015Contribute:SummerOfTechnologies
74
75. VI.A Soccer Data analyis:
Classification & Prediction
15/04/2015Contribute:SummerOfTechnologies
75
76. Classifying soccer games
• Outcome = Win/Draw/Loss => Classification
• Use 80 games as a training set, 10 games to evaluate classifier
• Feature vectors:
1 FV = 1 game as (Home vs Away)
Content of FV are soccer statistics: #shots on goal, #passes, #successful passes,
#offensive passes, ...
NOTE: eliminate features which have perfect correlation with result: #goals, #assists
Every feature has 2 values: #home passes, #away passes
(#home - #away) / (#home + #away) -> value in [-1,+1]
• NOTE: classification can only be done AFTER the game!
77. MR Job to extract feature vectors
• 55 features per game
• Mapper parses F24_*.xml and creates feature vector, No Reducer required
• Events are regular: contain a set of attributes & qualifiers
Create an Event class with an attributes map and a qualifier map
Create Filter classes to filter events:
AreaFilter, OutcomeFilter, EventIDFilter, QualifierFilter, DirectionFilter
A function that splits a set of events into Home and Away
• Live demo
78. Accuracy 5/10 + 2 close calls
15/04/2015Contribute:SummerOfTechnologies
78
80. • Slightly modified MR job: Emit featurevectors for both teams
• Can we predict the outcome of a game based on their history?
• Averaging previous games + calculate feature vector values ( (x-y)/(x+y)
Using an RF classifier for prognosis?
• Use history (9 games) to generate
average feature vector
• Use weighted history to generate
average feature vector:
• F = (1*F1 + 2*F2 +3*F3 + ...) / (1+2+3+...)
15/04/2015Contribute:SummerOfTechnologies
80
82. Session Overview
I. Introduction:
I.A Big Data Introduction
I.B Hadoop Distributions
II. Hadoop Architecture
III. Hadoop Ecosystem & Design Patterns
IV. Soccer dataset: Introduction & Metadata Parsing
V. Introduction to Data Science: Decision Trees
VI. Soccer Data Analysis:
VI.A Soccer Game Classification & Prediction
VI.B Individual Player Analysis
VII. Wrap Up
15/04/2015Contribute:SummerOfTechnologies
82
83. VI.B Soccer Data analyis:
Individual player analysis
Extract and visualize stats, rank players, does our rank reveal talented
players?
15/04/2015Contribute:SummerOfTechnologies
83
84. Extracting player stats with MR
• Select a number of features of interest: shots on target, passes, ...
• Mapper: extract these stats per game and emit (PlayerID, stats)
• Reducer: aggregate stats per player and emit (PlayerID, (agg_stats, #games))
• Pig: create a player score & player ranking
• Python: visualize player stats in scatter plots
• Live Demo
90. Outliers in scatterplots
• What about the players excelling in the scatterplots?
• Two categories: Players > 23 and Players <= 23
• Players > 23: No remarkable career: Bundesliga 2 is their level
• Players <= 23: Currently all have a number of caps!
15/04/2015Contribute:SummerOfTechnologies
90
Carlos E. Marquez
Brazil
Kazan
Marko Marin
Germany
Chelsea->Sevilla
Chinedu Obasi
Nigeria
Schalke 04
Patrick Helmes
Germany
Wolfsburg
Nando Rafael
Angola
Düsseldorf
91. Player ranking in Pig
• OptaSports has Castroll Index to rank player performance
• Demo Pig: create player ranking based on 2 scores:
Attacker_Score : shots_on_target / avg_sot
+ successful_dribbles / avg_sd
+ touches_in_square / avg_tis
Allround_Score: Attacker_Score
+ successful_offensive_passes / avg_sop
+ successful_passes / avg_sp
Suggestions?
94. Session Overview
I. Introduction:
I.A Big Data Introduction
I.B Hadoop Distributions
II. Hadoop Architecture
III. Hadoop Ecosystem & Design Patterns
IV. Soccer dataset: Introduction & Metadata Parsing
V. Introduction to Data Science: Decision Trees
VI. Soccer Data Analysis:
VI.A Soccer Game Classification & Prediction
VI.B Individual Player Analysis
VII. Wrap Up
15/04/2015Contribute:SummerOfTechnologies
94
96. Conclusion Soccer analysis
• A random forest classifier can be used both for classification and prediction
• Feature selection tells you which features determine the result
=> improve your classifier by removing features or use in a different classification algorithm
• Classification accuracy is 50%, while 33% is expected by random
• The classification probabilities are very interesting:
Removing the close calls the classification accuracy is 5/8 (62,5%)
Removing the close calls improves the prognosis to 4/8 (50%) and 4/7(57%)
• Scatterplots are an easy tool to select promising players
• Scoring functions based on domain knowledge allow you to rank the players
15/04/2015Contribute:SummerOfTechnologies
96
97. General Conclusion
• Big Data is for real!
• The Hadoop ecosystem (PIG) makes Big Data accessible for a broader audience
• Big Data goes hand in hand with Data Science
• A data scientist requires a very broad skillset
• Number crunching is Hadoop’s task, while postprocessing is Python’s
• We introduced Decision trees and Random Forests
• Soccer games are hard to predict but promising players are easy to find
• The speaker likes ents and Pinnokio!?