SlideShare ist ein Scribd-Unternehmen logo
1 von 62
Downloaden Sie, um offline zu lesen
Big Data Berlin
Peter Wang
Continuum Analytics
pwang@continuum.io
@pwang
Agenda
• Big Data - An honest perspective
• Architecting for Data
• Continuum’s Tools
• DARPA & Data Science
About Peter
• Co-founder & President at Continuum
• Author of several Python libraries & tools
• Scientific, financial, engineering HPC using
Python, C, C++, etc.
• InteractiveVisualization
• Organizer of Austin Python
• Background in Physics (BA Cornell ’99)
Big Data - An Honest Perspective
Origin of “Big Data” Movement
• Storage disruption: plummeting HDD costs,
cloud-based storage
• (I/O evolution: 10gE SANs, Flash drives)
• ETL disruption: Hadoop / Hive / HBase
• Basic analytics & statistics:“counting things”
Big Data (circa 2012)
http://techcrunch.com/2012/10/27/big-data-right-now-five-trendy-open-source-technologies/
The Players
• Data Processing & Low-level infrastructure
• Traditional BI vendors
• New BI startups
• Data-oriented startups
• Analytics-as-a-service
• “Big data” infrastructure platforms (DB &
analytical compute as a service)
Another perspective (2011)
•Diversification away from SQL & relational DBs
•“Messy” data, agile data processing
•Dynamic schema management
•Acknowledgement of heterogenous data environment
•Focus on high performance
•Richer simulations, processing more data
•Modern hardware revolution (SSDs, GPUs, etc.)
•Advanced visualization
•Interactive, novel plots
•Beyond simple reports and dashboards
•Advanced analytics
•Richer statistical models, Bayesian approaches
•Machine learning
•Predictive databases
Observed Trends
Data Revolution
“Internet Revolution” True Believer, 1996:
Businesses that build network-oriented capability
into their core will fundamentally outcompete and
destroy their competition.
“Data Revolution” True Believer, 2010:
Businesses that build data comprehension into
their core will destroy their competition over the
next 5-10 years
Opportunities
• Advanced ML & Predictive DBs will provide
transformative insights to nearly every business.
• Mobile & hi-speed connectivity means more dimensions
of customer life are being digitized.
• Every bit of new data makes old data more valuable
• Analyzing historical data becomes more important
• Developing internal data analysis capability means you
can more easily build data products to sell downstream.
• This is becoming an industry unto itself.
Technical Challenges
• Hardware & software do not yet make data analysis
easy at terabyte scales
• Current analytics are mostly I/O bound. Next gen
“advanced” analytics will be compute bound
(simulations, distributed LinAlg). Efficiency matters.
• Reproducible analytical environment
• Library & language choices can add “air gaps” between
domain expert and analytical infrastructure.
Business Challenges
• Data exploration is new discipline for most businesses
• Balancing agility & process for data-oriented processes
and analytical libraries.
• Bad data architecture will generally not cause
catastrophic failures
• Instead, will erode your ability to compete.
It’s hard to know when you are sucking.
Data Matters
• Data has mass.
• Scalability requires minimizing data-movement (only
as necessary).
• Deep/Advanced Analytics needs full computing
stack, as accessible as SQL and Excel
• Data should only move when it has to (to
communicate results, to replicate, to back-up) not
because the technology doesn’t allow access.
Algorithms Matter
...a Mac Mini running GraphChi can
analyze Twitter’s social graph from 2010
—which contains 40 million users and 1.2
billion connections—in 59 minutes.“The
previous published result on this problem
took 400 minutes using a cluster of about
1,000 computers,” Guestrin says.
-- MITTech Review
“...Spark, running on a cluster of 50
machines (100 CPUs) runs five iterations
of Pagerank on the twitter-2010 in 486.6
seconds. GraphChi solves the same
problem in less than double of the time
(790 seconds), with only 2 CPUs.”
Berkeley Data Stack (BDAS)
Memory Matters
1980s 90s-00s 2010s
implemented several memory lay-
ers with different capabilities: lower-
level caches (that is, those closer to
of memory hierarchy (for an example
in progress, see the Sequoia project
at www.stanford.edu/group/sequoia),
Programmers should exploit the op-
timizations inherent in temporal and
spatial locality as much as possible.
Figure 1. Evolution of the hierarchical memory model. (a) The primordial (and simplest) model; (b) the most common current
implementation, which includes additional cache levels; and (c) a sensible guess at what’s coming over the next decade:
three levels of cache in the CPU and solid state disks lying between main memory and classical mechanical disks.
Mechanical disk Mechanical disk Mechanical disk
Speed
Capacity
Solid state disk
Main memory
Level 3 cache
Level 2 cache
Level 1 cache
Level 2 cache
Level 1 cache
Main memoryMain memory
CPUCPU
(a) (b) (c)
Central
processing
unit (CPU)
Speed Matters
Jeff Hammerbacher’s Advice
• Instrument everything
• Put all your data in one place
• Data first, questions later
• Store first, structure later (often the data model is
dependent on the analysis you'd like to perform)
• Keep raw data forever
• Let everyone party on the data
• Introduce tools to support the whole research cycle
(think of the scope of the product as the entire cycle, not
just the container)
• Modular and composable infrastructure
Architecting for Data
Data exploration as the central task.
Data visualization as a first-class citizen.
Enable agility.
Python for Data Science
• Easy for domain experts to learn
• Powerful enough for software devs to
build backend infrastructure
• Mature and broad ecosystem of libraries
enables rich applications and scripts
(over 28,500 packages on PyPI).
• Very large community of users
• Syntax matters
Why Python?
Rich enough syntax & features to do
powerful, high-level things;
Easily extensible via C/C++/Fortran to
optimize low-level things;
Connects to existing infrastructure with
extremely large, capable third-party library
support.
Key Strengths
• Machine learning, statistical processing
• Text analytics
• Graph analysis
• Integration with Hadoop + additional map-
reduce and distributed data paradigms
• Over a decade of use in scientific computing
• Very popular among data scientists and
“regular” scientists
Python in Big Data
•Python for data analysis, big data, BI
•Uses much of SciPy, but adds libraries for
machine learning & advanced analytics
•Developer & user community
•Next conferences:
•Boston, July 27-28, 2013
•NYC & London, October/Nov 2013
•Looking for sponsors, local chapters, etc.
PyData
Python in Enterprise
• “Up & coming” technology; advocates are generally
early adopters and thought leaders
• Classic languages like Java, C# are safe bets;
frequently used for low-risk projects
• Python & others are used for innovation or
disruptive “skunk works” projects
• Potential impedance mismatch with some
organizations & dev groups
• No silver bullets
Domains
• Finance
• Geophysics
• Defense
• Advertising metrics & data analysis
• Scientific computing
Technologies
• Array/Columnar data processing
• Distributed computing, HPC
• GPU and new vector hardware
• Machine learning, predictive analytics
• InteractiveVisualization
Enterprise
Python
Scientific
Computing
Data Processing
Data Analysis
Visualisation
Scalable
Computing
Continuum Analytics
• Out-of-core, distributed data computation
• Interactive visualization of massive datasets
• Advanced, powerful analytics, accessible to
domain experts and business users via a
simplified programming model
• Collaborative, shareable analysis
To revolutionize analysis and visualization by moving
high-level code and domain expertise to data
Mission
Big Picture
Empower domain experts with
high-level tools that exploit modern
hardware
Array Oriented Computing
Projects
Blaze: High-performance Python library for modern
vector computing, distributed and streaming data
Numba:Vectorizing Python compiler for multicore
and GPU, using LLVM
Bokeh: Interactive, grammar-based visualization
system for large datasets
Common theme: High-level, expressive language for
domain experts; innovative compilers & runtimes
for efficient, powerful data transformation
Objectives - Blaze
• Flexible descriptor for tabular and semi-structured data
• Seamless handling of:
• On-disk / Out of core
• Streaming data
• Distributed data
• Uniform treatment of:
• “arrays of structures” and
“structures of arrays”
• missing values
• “ragged” shapes
• categorical types
• computed columns
Blaze Status
• DataShape type grammar
• NumPy-compatible C++ calculation engine (DyND)
• Synthesis of array function kernels (via LLVM)
• Fast timeseries routines (dynamic time warping for
pattern matching)
• Array Server prototype
• BLZ columnar storage format
• 0.1 Released at beginning of summer, working on 0.2
Schematic
Database
GPU Node
Array
Server
NFS
Array
Server
Array
Server
Blaze Client
Synthesized
Array/Table view
array+sql://
array://
file:// array://
Python REPL,
Scripts
Viz Data
Server
C, C++,
FORTRAN
JVM
languages
Blaze Demos & Benchmarks
Kiva:Array Server
DataShape + Raw JSON = Web Service
type KivaLoan = {
id: int64;
name: string;
description: {
languages: var, string(2);
texts: json # map<string(2), string>;
};
status: string; # LoanStatusType;
funded_amount: float64;
basket_amount: json; # Option(float64);
paid_amount: json; # Option(float64);
image: {
id: int64;
template_id: int64;
};
video: json;
activity: string;
sector: string;
use: string;
delinquent: bool;
location: {
country_code: string(2);
country: string;
town: json; # Option(string);
geo: {
level: string; # GeoLevelType
pairs: string; # latlong
type: string; # GeoTypeType
}
};
....
{"id":200533,"name":"Miawand Group","description":{"languages":
["en"],"texts":{"en":"Ozer is a member of the Miawand Group. He lives in the
16th district of Kabul, Afghanistan. He lives in a family of eight members. He
is single, but is a responsible boy who works hard and supports the whole
family. He is a carpenter and is busy working in his shop seven days a week.
He needs the loan to purchase wood and needed carpentry tools such as tape
measures, rulers and so on.rn rnHe hopes to make progress through the
loan and he is confident that will make his repayments on time and will join
for another loan cycle as well. rnrn"}},"status":"paid","funded_amount":
925,"basket_amount":null,"paid_amount":925,"image":{"id":
539726,"template_id":
1},"video":null,"activity":"Carpentry","sector":"Construction","use":"He wants
to buy tools for his carpentry shop","delinquent":null,"location":
{"country_code":"AF","country":"Afghanistan","town":"Kabul
Afghanistan","geo":{"level":"country","pairs":"33
65","type":"point"}},"partner_id":
34,"posted_date":"2010-05-13T20:30:03Z","planned_expiration_date":null,"loa
n_amount":925,"currency_exchange_loss_amount":null,"borrowers":
[{"first_name":"Ozer","last_name":"","gender":"M","pictured":true},
{"first_name":"Rohaniy","last_name":"","gender":"M","pictured":true},
{"first_name":"Samem","last_name":"","gender":"M","pictured":true}],"terms":
{"disbursal_date":"2010-05-13T07:00:00Z","disbursal_currency":"AFN","disbur
sal_amount":42000,"loan_amount":925,"local_payments":
[{"due_date":"2010-06-13T07:00:00Z","amount":4200},
{"due_date":"2010-07-13T07:00:00Z","amount":4200},
{"due_date":"2010-08-13T07:00:00Z","amount":4200},
{"due_date":"2010-09-13T07:00:00Z","amount":4200},
{"due_date":"2010-10-13T07:00:00Z","amount":4200},
{"due_date":"2010-11-13T08:00:00Z","amount":4200},
{"due_date":"2010-12-13T08:00:00Z","amount":4200},
{"due_date":"2011-01-13T08:00:00Z","amount":4200},
{"due_date":"2011-02-13T08:00:00Z","amount":4200},
{"due_date":"2011-03-13T08:00:00Z","amount":
4200}],"scheduled_payments": ...
2.9gb of JSON => network-queryable array: ~5 minutes
http://192.34.58.57:8080/kiva/loans
Akamai Dataset ETL
Hive Python script
Hardware
Memory
Time
(traceroute)
Routes/hr/Ghz
8x 16 core, 2 GHz
(128 cores)
1x 8 core, 2.2 GHz
RAM: 8x 382 GB
HDD: 8x 15k rpm
RAM: 144 GB
HDD: 2x 7200rpm
5 hrs, 635M routes 11 hrs, 113M routes
496k 584k
• Python performs ~18% better with almost no optimization
• resulting IPMap can be used for realtime, online query and
aggregation
Querying Traceroute in BLZ format
1k Random1k Random Full ScanFull Scan
Time RAM Time RAM
BLZ (disk)
BLZ (mem)
NPY (memmap)
NumPy (mem)
3.5s 0.04mb 2.9s 8mb
2.37s 210mb 2.4s 210mb
0.24s 0.2mb 0.23s 602mb
.13s 603mb 0.23s 603mb
Meant for dealing with Big Data
(RAM consumption is extremely low)
Numba
• Just-in-time, dynamic compiler for Python
• Optimize data-parallel computations at call time,
to take advantage of local hardware configuration
• Compatible with NumPy, Blaze
• Leverage LLVM ecosystem:
• Optimization passes
• Inter-op with other languages
• Variety of backends (e.g. CUDA for GPU support)
Numba
LLVM IR
x86
C++
ARM
PTX
C
Fortran
Python
Numba turns Python into a “compiled language”
Example
Numba
LLVM-based architecture
Image Processing
@jit('void(f8[:,:],f8[:,:],f8[:,:])')
def filter(image, filt, output):
M, N = image.shape
m, n = filt.shape
for i in range(m//2, M-m//2):
for j in range(n//2, N-n//2):
result = 0.0
for k in range(m):
for l in range(n):
result += image[i+k-m//2,j+l-n//2]*filt[k, l]
output[i,j] = result
~1500x speed-up
Example: MandelbrotVectorized
from numbapro import vectorize
sig = 'uint8(uint32, f4, f4, f4, f4, uint32, uint32,
uint32)'
@vectorize([sig], target='gpu')
def mandel(tid, min_x, max_x, min_y, max_y, width,
height, iters):
pixel_size_x = (max_x - min_x) / width
pixel_size_y = (max_y - min_y) / height
x = tid % width
y = tid / width
real = min_x + x * pixel_size_x
imag = min_y + y * pixel_size_y
c = complex(real, imag)
z = 0.0j
for i in range(iters):
z = z * z + c
if (z.real * z.real + z.imag * z.imag) >= 4:
return i
return 255
Kind Time Speed-up
Python 263.6 1.0x
CPU 2.639 100x
GPU 0.1676 1573x
Tesla S2050
Bokeh
• Language-based (instead of GUI) visualization system
• High-level expressions of data binding, statistical transforms,
interactivity and linked data
• Easy to learn, but expressive depth for power users
• Interactive
• Data space configuration as well as data selection
• Specified from high-level language constructs
• Web as first class interface target
• Support for large datasets via intelligent downsampling
(“abstract rendering”)
Bokeh
Inspirations:
• Chaco: interactive, viz pipeline for large data
• Protovis & Stencil :
Binding visual Glyphs to data and expressions
• ggplot2: faceting, statistical overlays
Design goal:
Accessible, extensible, interactive plotting for the web...
... for non-Javascript programmers
Bokeh & BokehJS Demos
• BokehJS demos
• Audio Spectrogram
• Bokeh Examples
- Low-level Python interface
- IPython Notebook
integration
- ggplot example
Abstract Rendering
Pixels'are'Bins…'
and'always'have'been'
1 2 2 3 4 4 3 2 2 1
A'
D'
B'
C'
B'
C'
D'
A'
Counts'
Z>View'
Geometry'
Pixels'
Hi-def Alpha
Kiva:Abstract Rendering
Basic AR can identify trouble spots in standard plots, and also
offer automatic tone mapping, taking perception into account.
37 mil elements, showing adjacency between entities in Kiva dataset
Abstract Rendering
? ? ? ? ? ? ? ? ? ?
B#
C#
D#
A#
Aggregates#(“Abstract”#Pixels)#
Geometry#
Pixels#
Reduce#
Transfer#
Kiva:Abstract Rendering of Sparsity
“Drawing the Dark”
Akin to mapping the ocean trenches; typical viz starts at sea level & goes up.
Spatial example
http://Wakari.io
• Cloud-hosted Python analytics environment
• Full Linux sandbox for every user
• IPython notebook
• Interactive Javascript plotting
• Easy to share notebooks & code with other users
• Free plan: 512mb memory, 10gb disk
• Premium plans include: More powerful machines,
more memory/disk, SSH access, cluster support
Data Summary Explorer
Continuum Data Explorer (CDX)
White House Big Data Initiative
• $200 million for NIH, NSF,
DOE, DOD, USGS
• DoD investing $60 mil
annually on new programs
• $25 mil for XDATA
DARPA XDATA (BAA-12-38)
“A large and critical part of DoD data can be characterized as semi-
structured, heterogeneous, and scientifically collected data with
varied amounts of completeness and standardization. Therefore a
one-size-fits-all end-to-end system is unlikely to meet all
analytical goals...”
“DoD collected data are particularly difficult to deal with, including
missing data, missing connections between data, incomplete data,
corrupted data, data of variable size and type, etc.”
XDATA Needs
“MapReduce ... results in selection bias for certain types of
problems, which may prevent ... a comprehensive understanding
of the data.”
• Develop analytical principles which scale across data volume and
distributed architecture
• Minimize design-to-execution time
• Leverage problem structure to create new algorithms that trade-
off time/space/stream complexity
• Distributed sampling & estimation techniques
• Distributed dimensionality reduction, matrix fact.
• Determining optimal cloud configuration & resource allocation
with asymmetric components (GPU, big-mem nodes, etc.)
Big data berlin
Big data berlin

Weitere ähnliche Inhalte

Was ist angesagt?

An Incomplete Data Tools Landscape for Hackers in 2015
An Incomplete Data Tools Landscape for Hackers in 2015An Incomplete Data Tools Landscape for Hackers in 2015
An Incomplete Data Tools Landscape for Hackers in 2015Wes McKinney
 
Memory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningMemory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningWes McKinney
 
Data ingestion using NiFi - Quick Overview
Data ingestion using NiFi - Quick OverviewData ingestion using NiFi - Quick Overview
Data ingestion using NiFi - Quick OverviewDurga Gadiraju
 
Big Data Architecture and Deployment
Big Data Architecture and DeploymentBig Data Architecture and Deployment
Big Data Architecture and DeploymentCisco Canada
 
Big Data Computing Architecture
Big Data Computing ArchitectureBig Data Computing Architecture
Big Data Computing ArchitectureGang Tao
 
My Data Journey with Python (SciPy 2015 Keynote)
My Data Journey with Python (SciPy 2015 Keynote)My Data Journey with Python (SciPy 2015 Keynote)
My Data Journey with Python (SciPy 2015 Keynote)Wes McKinney
 
PyData: The Next Generation
PyData: The Next GenerationPyData: The Next Generation
PyData: The Next GenerationWes McKinney
 
Bi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonBi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonDremio Corporation
 
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataWes McKinney
 
Ibis: Scaling the Python Data Experience
Ibis: Scaling the Python Data ExperienceIbis: Scaling the Python Data Experience
Ibis: Scaling the Python Data ExperienceWes McKinney
 
Python Data Ecosystem: Thoughts on Building for the Future
Python Data Ecosystem: Thoughts on Building for the FuturePython Data Ecosystem: Thoughts on Building for the Future
Python Data Ecosystem: Thoughts on Building for the FutureWes McKinney
 
Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer wit...
Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer wit...Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer wit...
Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer wit...Databricks
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Wes McKinney
 
HUG_Ireland_Apache_Arrow_Tomer_Shiran
HUG_Ireland_Apache_Arrow_Tomer_Shiran HUG_Ireland_Apache_Arrow_Tomer_Shiran
HUG_Ireland_Apache_Arrow_Tomer_Shiran John Mulhall
 
Jethro data meetup index base sql on hadoop - oct-2014
Jethro data meetup    index base sql on hadoop - oct-2014Jethro data meetup    index base sql on hadoop - oct-2014
Jethro data meetup index base sql on hadoop - oct-2014Eli Singer
 
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackApache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackWes McKinney
 
HBase and Drill: How loosley typed SQL is ideal for NoSQL
HBase and Drill: How loosley typed SQL is ideal for NoSQLHBase and Drill: How loosley typed SQL is ideal for NoSQL
HBase and Drill: How loosley typed SQL is ideal for NoSQLDataWorks Summit
 

Was ist angesagt? (20)

An Incomplete Data Tools Landscape for Hackers in 2015
An Incomplete Data Tools Landscape for Hackers in 2015An Incomplete Data Tools Landscape for Hackers in 2015
An Incomplete Data Tools Landscape for Hackers in 2015
 
Memory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningMemory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine Learning
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
Data ingestion using NiFi - Quick Overview
Data ingestion using NiFi - Quick OverviewData ingestion using NiFi - Quick Overview
Data ingestion using NiFi - Quick Overview
 
Big Data Architecture and Deployment
Big Data Architecture and DeploymentBig Data Architecture and Deployment
Big Data Architecture and Deployment
 
Big Data Computing Architecture
Big Data Computing ArchitectureBig Data Computing Architecture
Big Data Computing Architecture
 
My Data Journey with Python (SciPy 2015 Keynote)
My Data Journey with Python (SciPy 2015 Keynote)My Data Journey with Python (SciPy 2015 Keynote)
My Data Journey with Python (SciPy 2015 Keynote)
 
NoSQL Needs SomeSQL
NoSQL Needs SomeSQLNoSQL Needs SomeSQL
NoSQL Needs SomeSQL
 
PyData: The Next Generation
PyData: The Next GenerationPyData: The Next Generation
PyData: The Next Generation
 
Bi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonBi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in London
 
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory Data
 
Ibis: Scaling the Python Data Experience
Ibis: Scaling the Python Data ExperienceIbis: Scaling the Python Data Experience
Ibis: Scaling the Python Data Experience
 
Python Data Ecosystem: Thoughts on Building for the Future
Python Data Ecosystem: Thoughts on Building for the FuturePython Data Ecosystem: Thoughts on Building for the Future
Python Data Ecosystem: Thoughts on Building for the Future
 
What is hadoop
What is hadoopWhat is hadoop
What is hadoop
 
Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer wit...
Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer wit...Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer wit...
Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer wit...
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018
 
HUG_Ireland_Apache_Arrow_Tomer_Shiran
HUG_Ireland_Apache_Arrow_Tomer_Shiran HUG_Ireland_Apache_Arrow_Tomer_Shiran
HUG_Ireland_Apache_Arrow_Tomer_Shiran
 
Jethro data meetup index base sql on hadoop - oct-2014
Jethro data meetup    index base sql on hadoop - oct-2014Jethro data meetup    index base sql on hadoop - oct-2014
Jethro data meetup index base sql on hadoop - oct-2014
 
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackApache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics Stack
 
HBase and Drill: How loosley typed SQL is ideal for NoSQL
HBase and Drill: How loosley typed SQL is ideal for NoSQLHBase and Drill: How loosley typed SQL is ideal for NoSQL
HBase and Drill: How loosley typed SQL is ideal for NoSQL
 

Ähnlich wie Big data berlin

Demystifying Data Engineering
Demystifying Data EngineeringDemystifying Data Engineering
Demystifying Data Engineeringnathanmarz
 
Demystifying data engineering
Demystifying data engineeringDemystifying data engineering
Demystifying data engineeringThang Bui (Bob)
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMichael Hiskey
 
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data TorrentSeagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data TorrentSeeling Cheung
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarKognitio
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...Institute of Contemporary Sciences
 
20160331 sa introduction to big data pipelining berlin meetup 0.3
20160331 sa introduction to big data pipelining berlin meetup   0.320160331 sa introduction to big data pipelining berlin meetup   0.3
20160331 sa introduction to big data pipelining berlin meetup 0.3Simon Ambridge
 
No sql and sql - open analytics summit
No sql and sql - open analytics summitNo sql and sql - open analytics summit
No sql and sql - open analytics summitOpen Analytics
 
Apache Cassandra training. Overview and Basics
Apache Cassandra training. Overview and BasicsApache Cassandra training. Overview and Basics
Apache Cassandra training. Overview and BasicsOleg Magazov
 
An overview of modern scalable web development
An overview of modern scalable web developmentAn overview of modern scalable web development
An overview of modern scalable web developmentTung Nguyen
 
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Tech Triveni
 
Powering Real-Time Big Data Analytics with a Next-Gen GPU Database
Powering Real-Time Big Data Analytics with a Next-Gen GPU DatabasePowering Real-Time Big Data Analytics with a Next-Gen GPU Database
Powering Real-Time Big Data Analytics with a Next-Gen GPU DatabaseKinetica
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesWes McKinney
 
Continuum Analytics and Python
Continuum Analytics and PythonContinuum Analytics and Python
Continuum Analytics and PythonTravis Oliphant
 
So You Want to Build a Data Lake?
So You Want to Build a Data Lake?So You Want to Build a Data Lake?
So You Want to Build a Data Lake?David P. Moore
 
HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...
HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...
HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...HPCC Systems
 

Ähnlich wie Big data berlin (20)

Intro to Big Data
Intro to Big DataIntro to Big Data
Intro to Big Data
 
Demystifying Data Engineering
Demystifying Data EngineeringDemystifying Data Engineering
Demystifying Data Engineering
 
Demystifying data engineering
Demystifying data engineeringDemystifying data engineering
Demystifying data engineering
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data TorrentSeagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
 
Apache drill
Apache drillApache drill
Apache drill
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 
20160331 sa introduction to big data pipelining berlin meetup 0.3
20160331 sa introduction to big data pipelining berlin meetup   0.320160331 sa introduction to big data pipelining berlin meetup   0.3
20160331 sa introduction to big data pipelining berlin meetup 0.3
 
No sql and sql - open analytics summit
No sql and sql - open analytics summitNo sql and sql - open analytics summit
No sql and sql - open analytics summit
 
Apache Cassandra training. Overview and Basics
Apache Cassandra training. Overview and BasicsApache Cassandra training. Overview and Basics
Apache Cassandra training. Overview and Basics
 
An overview of modern scalable web development
An overview of modern scalable web developmentAn overview of modern scalable web development
An overview of modern scalable web development
 
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...
 
04 open source_tools
04 open source_tools04 open source_tools
04 open source_tools
 
Powering Real-Time Big Data Analytics with a Next-Gen GPU Database
Powering Real-Time Big Data Analytics with a Next-Gen GPU DatabasePowering Real-Time Big Data Analytics with a Next-Gen GPU Database
Powering Real-Time Big Data Analytics with a Next-Gen GPU Database
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data Frames
 
Continuum Analytics and Python
Continuum Analytics and PythonContinuum Analytics and Python
Continuum Analytics and Python
 
DA_01_Intro.pptx
DA_01_Intro.pptxDA_01_Intro.pptx
DA_01_Intro.pptx
 
So You Want to Build a Data Lake?
So You Want to Build a Data Lake?So You Want to Build a Data Lake?
So You Want to Build a Data Lake?
 
HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...
HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...
HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...
 

Kürzlich hochgeladen

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 

Kürzlich hochgeladen (20)

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 

Big data berlin

  • 1. Big Data Berlin Peter Wang Continuum Analytics pwang@continuum.io @pwang
  • 2. Agenda • Big Data - An honest perspective • Architecting for Data • Continuum’s Tools • DARPA & Data Science
  • 3. About Peter • Co-founder & President at Continuum • Author of several Python libraries & tools • Scientific, financial, engineering HPC using Python, C, C++, etc. • InteractiveVisualization • Organizer of Austin Python • Background in Physics (BA Cornell ’99)
  • 4. Big Data - An Honest Perspective
  • 5. Origin of “Big Data” Movement • Storage disruption: plummeting HDD costs, cloud-based storage • (I/O evolution: 10gE SANs, Flash drives) • ETL disruption: Hadoop / Hive / HBase • Basic analytics & statistics:“counting things”
  • 8. The Players • Data Processing & Low-level infrastructure • Traditional BI vendors • New BI startups • Data-oriented startups • Analytics-as-a-service • “Big data” infrastructure platforms (DB & analytical compute as a service)
  • 10. •Diversification away from SQL & relational DBs •“Messy” data, agile data processing •Dynamic schema management •Acknowledgement of heterogenous data environment •Focus on high performance •Richer simulations, processing more data •Modern hardware revolution (SSDs, GPUs, etc.) •Advanced visualization •Interactive, novel plots •Beyond simple reports and dashboards •Advanced analytics •Richer statistical models, Bayesian approaches •Machine learning •Predictive databases Observed Trends
  • 11. Data Revolution “Internet Revolution” True Believer, 1996: Businesses that build network-oriented capability into their core will fundamentally outcompete and destroy their competition. “Data Revolution” True Believer, 2010: Businesses that build data comprehension into their core will destroy their competition over the next 5-10 years
  • 12. Opportunities • Advanced ML & Predictive DBs will provide transformative insights to nearly every business. • Mobile & hi-speed connectivity means more dimensions of customer life are being digitized. • Every bit of new data makes old data more valuable • Analyzing historical data becomes more important • Developing internal data analysis capability means you can more easily build data products to sell downstream. • This is becoming an industry unto itself.
  • 13. Technical Challenges • Hardware & software do not yet make data analysis easy at terabyte scales • Current analytics are mostly I/O bound. Next gen “advanced” analytics will be compute bound (simulations, distributed LinAlg). Efficiency matters. • Reproducible analytical environment • Library & language choices can add “air gaps” between domain expert and analytical infrastructure.
  • 14. Business Challenges • Data exploration is new discipline for most businesses • Balancing agility & process for data-oriented processes and analytical libraries. • Bad data architecture will generally not cause catastrophic failures • Instead, will erode your ability to compete. It’s hard to know when you are sucking.
  • 15. Data Matters • Data has mass. • Scalability requires minimizing data-movement (only as necessary). • Deep/Advanced Analytics needs full computing stack, as accessible as SQL and Excel • Data should only move when it has to (to communicate results, to replicate, to back-up) not because the technology doesn’t allow access.
  • 16. Algorithms Matter ...a Mac Mini running GraphChi can analyze Twitter’s social graph from 2010 —which contains 40 million users and 1.2 billion connections—in 59 minutes.“The previous published result on this problem took 400 minutes using a cluster of about 1,000 computers,” Guestrin says. -- MITTech Review “...Spark, running on a cluster of 50 machines (100 CPUs) runs five iterations of Pagerank on the twitter-2010 in 486.6 seconds. GraphChi solves the same problem in less than double of the time (790 seconds), with only 2 CPUs.”
  • 18. Memory Matters 1980s 90s-00s 2010s implemented several memory lay- ers with different capabilities: lower- level caches (that is, those closer to of memory hierarchy (for an example in progress, see the Sequoia project at www.stanford.edu/group/sequoia), Programmers should exploit the op- timizations inherent in temporal and spatial locality as much as possible. Figure 1. Evolution of the hierarchical memory model. (a) The primordial (and simplest) model; (b) the most common current implementation, which includes additional cache levels; and (c) a sensible guess at what’s coming over the next decade: three levels of cache in the CPU and solid state disks lying between main memory and classical mechanical disks. Mechanical disk Mechanical disk Mechanical disk Speed Capacity Solid state disk Main memory Level 3 cache Level 2 cache Level 1 cache Level 2 cache Level 1 cache Main memoryMain memory CPUCPU (a) (b) (c) Central processing unit (CPU)
  • 20. Jeff Hammerbacher’s Advice • Instrument everything • Put all your data in one place • Data first, questions later • Store first, structure later (often the data model is dependent on the analysis you'd like to perform) • Keep raw data forever • Let everyone party on the data • Introduce tools to support the whole research cycle (think of the scope of the product as the entire cycle, not just the container) • Modular and composable infrastructure
  • 21. Architecting for Data Data exploration as the central task. Data visualization as a first-class citizen. Enable agility.
  • 22. Python for Data Science
  • 23.
  • 24. • Easy for domain experts to learn • Powerful enough for software devs to build backend infrastructure • Mature and broad ecosystem of libraries enables rich applications and scripts (over 28,500 packages on PyPI). • Very large community of users • Syntax matters Why Python?
  • 25. Rich enough syntax & features to do powerful, high-level things; Easily extensible via C/C++/Fortran to optimize low-level things; Connects to existing infrastructure with extremely large, capable third-party library support. Key Strengths
  • 26. • Machine learning, statistical processing • Text analytics • Graph analysis • Integration with Hadoop + additional map- reduce and distributed data paradigms • Over a decade of use in scientific computing • Very popular among data scientists and “regular” scientists Python in Big Data
  • 27. •Python for data analysis, big data, BI •Uses much of SciPy, but adds libraries for machine learning & advanced analytics •Developer & user community •Next conferences: •Boston, July 27-28, 2013 •NYC & London, October/Nov 2013 •Looking for sponsors, local chapters, etc. PyData
  • 28. Python in Enterprise • “Up & coming” technology; advocates are generally early adopters and thought leaders • Classic languages like Java, C# are safe bets; frequently used for low-risk projects • Python & others are used for innovation or disruptive “skunk works” projects • Potential impedance mismatch with some organizations & dev groups • No silver bullets
  • 29. Domains • Finance • Geophysics • Defense • Advertising metrics & data analysis • Scientific computing Technologies • Array/Columnar data processing • Distributed computing, HPC • GPU and new vector hardware • Machine learning, predictive analytics • InteractiveVisualization Enterprise Python Scientific Computing Data Processing Data Analysis Visualisation Scalable Computing Continuum Analytics
  • 30. • Out-of-core, distributed data computation • Interactive visualization of massive datasets • Advanced, powerful analytics, accessible to domain experts and business users via a simplified programming model • Collaborative, shareable analysis To revolutionize analysis and visualization by moving high-level code and domain expertise to data Mission
  • 31. Big Picture Empower domain experts with high-level tools that exploit modern hardware Array Oriented Computing
  • 32. Projects Blaze: High-performance Python library for modern vector computing, distributed and streaming data Numba:Vectorizing Python compiler for multicore and GPU, using LLVM Bokeh: Interactive, grammar-based visualization system for large datasets Common theme: High-level, expressive language for domain experts; innovative compilers & runtimes for efficient, powerful data transformation
  • 33. Objectives - Blaze • Flexible descriptor for tabular and semi-structured data • Seamless handling of: • On-disk / Out of core • Streaming data • Distributed data • Uniform treatment of: • “arrays of structures” and “structures of arrays” • missing values • “ragged” shapes • categorical types • computed columns
  • 34. Blaze Status • DataShape type grammar • NumPy-compatible C++ calculation engine (DyND) • Synthesis of array function kernels (via LLVM) • Fast timeseries routines (dynamic time warping for pattern matching) • Array Server prototype • BLZ columnar storage format • 0.1 Released at beginning of summer, working on 0.2
  • 35. Schematic Database GPU Node Array Server NFS Array Server Array Server Blaze Client Synthesized Array/Table view array+sql:// array:// file:// array:// Python REPL, Scripts Viz Data Server C, C++, FORTRAN JVM languages
  • 36. Blaze Demos & Benchmarks
  • 37. Kiva:Array Server DataShape + Raw JSON = Web Service type KivaLoan = { id: int64; name: string; description: { languages: var, string(2); texts: json # map<string(2), string>; }; status: string; # LoanStatusType; funded_amount: float64; basket_amount: json; # Option(float64); paid_amount: json; # Option(float64); image: { id: int64; template_id: int64; }; video: json; activity: string; sector: string; use: string; delinquent: bool; location: { country_code: string(2); country: string; town: json; # Option(string); geo: { level: string; # GeoLevelType pairs: string; # latlong type: string; # GeoTypeType } }; .... {"id":200533,"name":"Miawand Group","description":{"languages": ["en"],"texts":{"en":"Ozer is a member of the Miawand Group. He lives in the 16th district of Kabul, Afghanistan. He lives in a family of eight members. He is single, but is a responsible boy who works hard and supports the whole family. He is a carpenter and is busy working in his shop seven days a week. He needs the loan to purchase wood and needed carpentry tools such as tape measures, rulers and so on.rn rnHe hopes to make progress through the loan and he is confident that will make his repayments on time and will join for another loan cycle as well. rnrn"}},"status":"paid","funded_amount": 925,"basket_amount":null,"paid_amount":925,"image":{"id": 539726,"template_id": 1},"video":null,"activity":"Carpentry","sector":"Construction","use":"He wants to buy tools for his carpentry shop","delinquent":null,"location": {"country_code":"AF","country":"Afghanistan","town":"Kabul Afghanistan","geo":{"level":"country","pairs":"33 65","type":"point"}},"partner_id": 34,"posted_date":"2010-05-13T20:30:03Z","planned_expiration_date":null,"loa n_amount":925,"currency_exchange_loss_amount":null,"borrowers": [{"first_name":"Ozer","last_name":"","gender":"M","pictured":true}, {"first_name":"Rohaniy","last_name":"","gender":"M","pictured":true}, {"first_name":"Samem","last_name":"","gender":"M","pictured":true}],"terms": {"disbursal_date":"2010-05-13T07:00:00Z","disbursal_currency":"AFN","disbur sal_amount":42000,"loan_amount":925,"local_payments": [{"due_date":"2010-06-13T07:00:00Z","amount":4200}, {"due_date":"2010-07-13T07:00:00Z","amount":4200}, {"due_date":"2010-08-13T07:00:00Z","amount":4200}, {"due_date":"2010-09-13T07:00:00Z","amount":4200}, {"due_date":"2010-10-13T07:00:00Z","amount":4200}, {"due_date":"2010-11-13T08:00:00Z","amount":4200}, {"due_date":"2010-12-13T08:00:00Z","amount":4200}, {"due_date":"2011-01-13T08:00:00Z","amount":4200}, {"due_date":"2011-02-13T08:00:00Z","amount":4200}, {"due_date":"2011-03-13T08:00:00Z","amount": 4200}],"scheduled_payments": ... 2.9gb of JSON => network-queryable array: ~5 minutes http://192.34.58.57:8080/kiva/loans
  • 38. Akamai Dataset ETL Hive Python script Hardware Memory Time (traceroute) Routes/hr/Ghz 8x 16 core, 2 GHz (128 cores) 1x 8 core, 2.2 GHz RAM: 8x 382 GB HDD: 8x 15k rpm RAM: 144 GB HDD: 2x 7200rpm 5 hrs, 635M routes 11 hrs, 113M routes 496k 584k • Python performs ~18% better with almost no optimization • resulting IPMap can be used for realtime, online query and aggregation
  • 39. Querying Traceroute in BLZ format 1k Random1k Random Full ScanFull Scan Time RAM Time RAM BLZ (disk) BLZ (mem) NPY (memmap) NumPy (mem) 3.5s 0.04mb 2.9s 8mb 2.37s 210mb 2.4s 210mb 0.24s 0.2mb 0.23s 602mb .13s 603mb 0.23s 603mb Meant for dealing with Big Data (RAM consumption is extremely low)
  • 40. Numba • Just-in-time, dynamic compiler for Python • Optimize data-parallel computations at call time, to take advantage of local hardware configuration • Compatible with NumPy, Blaze • Leverage LLVM ecosystem: • Optimization passes • Inter-op with other languages • Variety of backends (e.g. CUDA for GPU support)
  • 41. Numba LLVM IR x86 C++ ARM PTX C Fortran Python Numba turns Python into a “compiled language”
  • 44. Image Processing @jit('void(f8[:,:],f8[:,:],f8[:,:])') def filter(image, filt, output): M, N = image.shape m, n = filt.shape for i in range(m//2, M-m//2): for j in range(n//2, N-n//2): result = 0.0 for k in range(m): for l in range(n): result += image[i+k-m//2,j+l-n//2]*filt[k, l] output[i,j] = result ~1500x speed-up
  • 45. Example: MandelbrotVectorized from numbapro import vectorize sig = 'uint8(uint32, f4, f4, f4, f4, uint32, uint32, uint32)' @vectorize([sig], target='gpu') def mandel(tid, min_x, max_x, min_y, max_y, width, height, iters): pixel_size_x = (max_x - min_x) / width pixel_size_y = (max_y - min_y) / height x = tid % width y = tid / width real = min_x + x * pixel_size_x imag = min_y + y * pixel_size_y c = complex(real, imag) z = 0.0j for i in range(iters): z = z * z + c if (z.real * z.real + z.imag * z.imag) >= 4: return i return 255 Kind Time Speed-up Python 263.6 1.0x CPU 2.639 100x GPU 0.1676 1573x Tesla S2050
  • 46. Bokeh • Language-based (instead of GUI) visualization system • High-level expressions of data binding, statistical transforms, interactivity and linked data • Easy to learn, but expressive depth for power users • Interactive • Data space configuration as well as data selection • Specified from high-level language constructs • Web as first class interface target • Support for large datasets via intelligent downsampling (“abstract rendering”)
  • 47. Bokeh Inspirations: • Chaco: interactive, viz pipeline for large data • Protovis & Stencil : Binding visual Glyphs to data and expressions • ggplot2: faceting, statistical overlays Design goal: Accessible, extensible, interactive plotting for the web... ... for non-Javascript programmers
  • 48. Bokeh & BokehJS Demos • BokehJS demos • Audio Spectrogram • Bokeh Examples - Low-level Python interface - IPython Notebook integration - ggplot example
  • 49. Abstract Rendering Pixels'are'Bins…' and'always'have'been' 1 2 2 3 4 4 3 2 2 1 A' D' B' C' B' C' D' A' Counts' Z>View' Geometry' Pixels'
  • 51. Kiva:Abstract Rendering Basic AR can identify trouble spots in standard plots, and also offer automatic tone mapping, taking perception into account. 37 mil elements, showing adjacency between entities in Kiva dataset
  • 52. Abstract Rendering ? ? ? ? ? ? ? ? ? ? B# C# D# A# Aggregates#(“Abstract”#Pixels)# Geometry# Pixels# Reduce# Transfer#
  • 53. Kiva:Abstract Rendering of Sparsity “Drawing the Dark” Akin to mapping the ocean trenches; typical viz starts at sea level & goes up.
  • 55. http://Wakari.io • Cloud-hosted Python analytics environment • Full Linux sandbox for every user • IPython notebook • Interactive Javascript plotting • Easy to share notebooks & code with other users • Free plan: 512mb memory, 10gb disk • Premium plans include: More powerful machines, more memory/disk, SSH access, cluster support
  • 58. White House Big Data Initiative • $200 million for NIH, NSF, DOE, DOD, USGS • DoD investing $60 mil annually on new programs • $25 mil for XDATA
  • 59. DARPA XDATA (BAA-12-38) “A large and critical part of DoD data can be characterized as semi- structured, heterogeneous, and scientifically collected data with varied amounts of completeness and standardization. Therefore a one-size-fits-all end-to-end system is unlikely to meet all analytical goals...” “DoD collected data are particularly difficult to deal with, including missing data, missing connections between data, incomplete data, corrupted data, data of variable size and type, etc.”
  • 60. XDATA Needs “MapReduce ... results in selection bias for certain types of problems, which may prevent ... a comprehensive understanding of the data.” • Develop analytical principles which scale across data volume and distributed architecture • Minimize design-to-execution time • Leverage problem structure to create new algorithms that trade- off time/space/stream complexity • Distributed sampling & estimation techniques • Distributed dimensionality reduction, matrix fact. • Determining optimal cloud configuration & resource allocation with asymmetric components (GPU, big-mem nodes, etc.)