"Hadoop 2015: What we’ve learned in 5 years", Martin Oberhuber, Senior Data Scientist at ThinkBig
YouTube Link: https://www.youtube.com/watch?v=odOTsGgfzm8
Watch more from Data Natives 2015 here: http://bit.ly/1OVkK2J
Visit the conference website to learn more: www.datanatives.io
Follow Data Natives:
https://www.facebook.com/DataNatives
https://twitter.com/DataNativesConf
Stay Connected to Data Natives by Email: Subscribe to our newsletter to get the news first about Data Natives 2016: http://bit.ly/1WMJAqS
2. 1st
Professional services provider with 100% focus on
open source and Big Data Hadoop ecosystem
• Founded 2010
• 100+ Successful Programs
• 80+ Clients
• Global Delivery Capabilities
3. 3 Core Analytic Solution Domains
Device Analytics Customer AnalyticsRisk
Analytics
eCommerce
2 of Global Top 5
Retail
2 of Global Top 5
Social Networking
Global #1
Telecommunications
2 of Global Top 5
Media & Advertising
2 of Global Top 5
Internet Transaction Security
Global #1
Semiconductor
2 of Global Top 5
Data Storage Device
3 of Global Top 5
Disk Manufacturing
Global #1
Telecommunications
2 of Global Top 5
Brokerage & Mutual Funds
2 of Global Top 5
Asset Management
Global #1
Credit Issuer
2 of Global Top 5
Banking
4 of Global Top 10
Financial Data Services
2 of Global Top 5
4. Think Big VELOCITY Methodology
Big Data
Strategy
Think Big
Academy
Big Data
Program Mgt
Business
Analytics
Managed
Services
Data
Engineering
Big Data Lab
Think Big engages with its client’s business, technical, analyst and support teams in an
agile inspired VELOCITY Methodology to continuously develop Big Data solutions
5. Think Big Enterprise Data Lake
Downstream
ApplicationsInformation Sources
Evaluate
Source
Data
Prepare Source
Metadata
Prepare Data for
Ingest
Enterprise Data Lake
Sequence Automate
Apply Structure
Compress Protect
Dashboard Engine
Collect & Manage
Metadata
Perimeter-Authentication-Authorisation
Ingest
9. Data Science Approaches
Single Workstation
- Small data sets
- No distributed analytics
across multiple nodes
- Powerful tools are R or
Python
- Data Scientist can focus on
business problem
Mixed
Single Workstation + Cluster
- Small or large data sets
- Data wrangling and feature
engineering is performed on
cluster
- Predictive analysis and
modeling can be performed on
single workstation
- Powerful tools are Hadoop
Streaming and Spark
combined with R and Python
- Data Scientist has to
parallelize of some data
mining tasks
Cluster
- Large data sets
- Both data wrangling and
modeling is performed on
cluster
- Spark is one of the few tools
that support efficient parallel
machine learning
- Parallelizing machine learning
algorithms is challenging
10. Data Lake (HDFS)
Core Data Science
Production
• Model scoring
• Dashboards
Plug & play model deployment
Data Sources
Ingestion
Real-time
Optimization with
Multi-armed Bandit
Data
Real-time Data
Productionizing Analytics
Hinweis der Redaktion
----- Meeting Notes (8/13/15 15:33) -----
My name is Matt McDevitt. I have been with Think Big nearly 4 years starting as one of the first employees. Over that time period I helped grow the business starting out in our HQ in Mountain View, CA. I then moved to Salt Lake City to build our first Solution. After that I moved to New York to accelerate our growth in the North East. My most recent move has been to London to grow our new International practice making it my fourth city in less than 4 years with the company. Throughout my time at Think Big I have worn many different hats, with the most recent being delivering one of our largest projects which I will touch upon later in this presentation.