SlideShare ist ein Scribd-Unternehmen logo
1 von 27
www.twosigma.com
Memory Interoperability for
Analytics and Machine Learning
March 26, 2017All Rights Reserved
Wes McKinney @wesmckinn
ScaledML @ Stanford
March 25, 2017
Me
March 26, 2017
• Currently: Software Architect at Two Sigma Investments
• Creator of Python pandas project
• PMC member for Apache Arrow and Apache Parquet
• Author of Python for Data Analysis
• Other Python projects: Ibis, Feather, statsmodels
All Rights Reserved 2
Important Legal Information
March 26, 2017
The information presented here is offered for informational purposes only and should not be used for
any other purpose (including, without limitation, the making of investment decisions). Examples
provided herein are for illustrative purposes only and are not necessarily based on actual data. Nothing
herein constitutes: an offer to sell or the solicitation of any offer to buy any security or other interest;
tax advice; or investment advice. This presentation shall remain the property of Two Sigma Investments,
LP (“Two Sigma”) and Two Sigma reserves the right to require the return of this presentation at any
time.
Some of the images, logos or other material used herein may be protected by copyright and/or
trademark. If so, such copyrights and/or trademarks are most likely owned by the entity that created the
material and are used purely for identification and comment as fair use under international copyright
and/or trademark laws. Use of such image, copyright or trademark does not imply any association with
such organization (or endorsement of such organization) by Two Sigma, nor vice versa.
Copyright © 2017 TWO SIGMA INVESTMENTS, LP. All rights reserved
All Rights Reserved 3
This talk
4March 26, 2017
• Benefits of interoperable data and metadata
• Challenges to sharing memory between runtime environments
• Apache Arrow: Purpose and C++ architecture
• Opportunities for collaboration
• Example application: pandas 2.0
All Rights Reserved
Changing hardware landscape
March 26, 2017
• Intel has released first production 3D Xpoint SSD
• Reported 1000x faster than NAND, less expensive than RAM
• Convergence between RAM vs. shared memory / mmap performance
All Rights Reserved 5
Changing software landscape
March 26, 2017
• Next-gen ML / AI frameworks (TensorFlow, Torch, etc.)
• DIY open source architectures for machine learning in production
• Streaming / batch data processing pipelines
• Data cleaning and feature engineering
• Model fitting / scoring / serving
All Rights Reserved 6
“Zero-copy” memory interfaces
March 26, 2017
• Enables computational tools to process a dataset without any additional
serialization, or transfer to a different memory space
• Can do random access on a dataset that does not fit in RAM
• Another interpretation: reading a dataset is a metadata-only conversion
All Rights Reserved 7
Challenges to zero-copy memory sharing
March 26, 2017
• Cross-language issues
• Type metadata + logical types
• Byte/bit-level memory layout
• Language-specific issues
• In-memory data structures
• Memory allocation and sharing constructs
All Rights Reserved 8
What is pandas?
March 26, 2017
• Popular in-memory data manipulation tool for Python
• Focused on tabular datasets (“data frames”)
• Sprawling codebase spanning multiple areas
• IO for many data formats
• Array manipulations / data preparation
• OLAP-style analytics
• Internals implemented using NumPy array objects
All Rights Reserved 9
NumPy
March 26, 2017
• Tensor memory model ("ndarray") for numeric data
• Strided, homogeneously-typed, byte-addressable memory
• APL-inspired semantics
• Zero-copy construction from compatible memory layouts
• Computational tools support both strided and contiguous memory access
All Rights Reserved 10
pandas: Technical debt + Architectural issues
March 26, 2017
• Tensor library like NumPy awkward fit for pandas use cases
• Multidimensionality + strided memory access complicated algorithms
• Lack of built-in missing value support
• Weak on native string, variable length, or nested types
• pandas at core a “in-memory columnar” problem, similar to analytical SQL
engines
All Rights Reserved 11
Thesis: Tensors and Tables
March 26, 2017
• 2 data structures best suited for zero-copy sharing
• Tensors: N-dimensional, homogeneously-typed arrays
• Tables: Column-oriented, heterogeneously typed
• These data structures can be defined using common memory and metadata
primitives
All Rights Reserved 12
Observations
March 26, 2017
• A Tensor is semantically a multidimensional view of a 1D block of memory
• Writing computational code targeting arbitrary tensors is much more difficult
than 1D contiguous arrays
• Tensors of non-fixed size types (e.g. strings) occur less frequently
All Rights Reserved 13
Apache Arrow
March 26, 2017
• github.com/apache/arrow
• Collaboration amongst broad set of OSS projects around language-agnostic
shared data structures
• Initial focus
• In-memory columnar tables
• Canonical metadata
• Interoperability between JVM and native code (C/C++) ecosystem
All Rights Reserved 14
High performance data interchange
March 26, 2017All Rights Reserved
Today With Arrow
Source: Apache Arrow
15
What does Apache Arrow give you?
March 26, 2017
• Cache-efficient columnar memory: optimized for CPU affinity and SIMD /
parallel processing, O(1) random value access
• Zero-copy messaging / IPC: Language-agnostic metadata, batch/file-based
and streaming binary formats
• Complex schema support: Flat and nested data types
• Main implementations in C++ and Java: with integration tests
• Bindings / implementations for C, Python, Ruby, Javascript in various stages
of development
All Rights Reserved 16
Arrow in C++
March 26, 2017
• Reusable memory management and IO subsystem for native code applications
• Layered in multiple components
• Memory management
• Type metadata / schemas
• Array / Table containers
• IO interfaces
• Zero-copy IPC / messaging
All Rights Reserved 17
Arrow C++: Memory management
March 26, 2017
• arrow::Buffer
• RAII-based memory lifetime with std::shared_ptr<Buffer>
• arrow::MemoryMappedBuffer: for memory maps
• arrow::MemoryPool
• Abstract memory allocator for tracking all allocations
All Rights Reserved 18
Arrow C++: Type metadata
March 26, 2017
• arrow::DataType
• Base class for fixed size, variable size, and nested datatypes
• arrow::Field
• Type + name + additional metadata
• arrow::Schema
• Collection of fields
All Rights Reserved 19
Arrow C++: Array / Table containers
March 26, 2017
• arrow::Array
• 1-dimensional columnar arrays: Int32Array, ListArray, StructArray, etc.
• Support for dictionary-encoded arrays
• arrow::RecordBatch
• Collection of equal-length arrays
• arrow::Column
• Logical table “column” as chunked array
• arrow::Table
• Collection of columns
All Rights Reserved 20
Arrow C++: IO interfaces
March 26, 2017
• arrow::{InputStream, OutputStream}
• arrow::RandomAccessFile
• Abstract file interface
• arrow::MemoryMappedFile
• Zero-copy reads to arrow::Buffer
• Specific implementations for OS files, HDFS, etc.
All Rights Reserved 21
Arrow C++: Messaging / IPC
March 26, 2017
• Metadata read/write using Google’s Flatbuffers library
• Encapsulated Message type
• Write record batches, read with zero-copy
• arrow::{FileWriter, FileReader}
• Random access / “batch” binary format
• arrow::{StreamWriter, StreamReader}
• Streaming binary format
All Rights Reserved 22
In development: arrow::Tensor
March 26, 2017
• Targeting interoperability with memory layouts as used in NumPy,
TensorFlow, Torch, or other standard tensor-based frameworks
• data: arrow::Buffer
• shape: dimension sizes
• strides: memory ordering
• Zero-copy reads using Arrow’s shared memory tools
• Support Tensor math libraries for C++ like xtensor
All Rights Reserved 23
Example use: Ray ML framework from Berkeley RISELab
March 26, 2017All Rights Reserved 24
Source: https://arxiv.org/abs/1703.03924
• Shared memory-based object
store
• Zero-copy tensor reads using
Arrow libraries
Example use: pandas 2.0
March 26, 2017
• In-planning rearchitecture of pandas’s internals
• libpandas — largely Python-agnostic C++11 library
• Decoupling pandas data structures from NumPy tensors
• Support analytics targeting native Arrow memory
• Multicore / parallel algorithms
• Leverage latest SIMD intrinsics
• Lazy-loading DataFrames from primary input formats
• CSV, JSON, HDF5, Apache Parquet
All Rights Reserved 25
Other examples
March 26, 2017
• Spark integration (SPARK-13534)
• Weld integration (ARROW-649)
All Rights Reserved 26
Thank you
March 26, 2017
• Building code and community around
• IO subsystems
• Metadata
• Data structures and in-memory formats
All Rights Reserved 27

Weitere ähnliche Inhalte

Was ist angesagt?

Data Analysis and Statistics in Python using pandas and statsmodels
Data Analysis and Statistics in Python using pandas and statsmodelsData Analysis and Statistics in Python using pandas and statsmodels
Data Analysis and Statistics in Python using pandas and statsmodels
Wes McKinney
 

Was ist angesagt? (20)

Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020
 
Next-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache ArrowNext-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache Arrow
 
An Incomplete Data Tools Landscape for Hackers in 2015
An Incomplete Data Tools Landscape for Hackers in 2015An Incomplete Data Tools Landscape for Hackers in 2015
An Incomplete Data Tools Landscape for Hackers in 2015
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data Frames
 
Data Analysis and Statistics in Python using pandas and statsmodels
Data Analysis and Statistics in Python using pandas and statsmodelsData Analysis and Statistics in Python using pandas and statsmodels
Data Analysis and Statistics in Python using pandas and statsmodels
 
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
 
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory Data
 
PyData: The Next Generation
PyData: The Next GenerationPyData: The Next Generation
PyData: The Next Generation
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS Session
 
Ibis: Scaling the Python Data Experience
Ibis: Scaling the Python Data ExperienceIbis: Scaling the Python Data Experience
Ibis: Scaling the Python Data Experience
 
Improving data interoperability in Python and R
Improving data interoperability in Python and RImproving data interoperability in Python and R
Improving data interoperability in Python and R
 
My Data Journey with Python (SciPy 2015 Keynote)
My Data Journey with Python (SciPy 2015 Keynote)My Data Journey with Python (SciPy 2015 Keynote)
My Data Journey with Python (SciPy 2015 Keynote)
 
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
 
PyCon Singapore 2013 Keynote
PyCon Singapore 2013 KeynotePyCon Singapore 2013 Keynote
PyCon Singapore 2013 Keynote
 
Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)
 
Future of pandas
Future of pandasFuture of pandas
Future of pandas
 
Apache Arrow - An Overview
Apache Arrow - An OverviewApache Arrow - An Overview
Apache Arrow - An Overview
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data Transport
 
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
 
Apache Arrow and Python: The latest
Apache Arrow and Python: The latestApache Arrow and Python: The latest
Apache Arrow and Python: The latest
 

Andere mochten auch

Andere mochten auch (20)

Raising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data ScienceRaising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data Science
 
ドローン農業最前線
ドローン農業最前線ドローン農業最前線
ドローン農業最前線
 
Functional go
Functional goFunctional go
Functional go
 
Goをカンストさせる話
Goをカンストさせる話Goをカンストさせる話
Goをカンストさせる話
 
Startup Pitch Decks
Startup Pitch DecksStartup Pitch Decks
Startup Pitch Decks
 
Angular of things: angular2 + web bluetooth
Angular of things: angular2 + web bluetoothAngular of things: angular2 + web bluetooth
Angular of things: angular2 + web bluetooth
 
マイクロサービスバックエンドAPIのためのRESTとgRPC
マイクロサービスバックエンドAPIのためのRESTとgRPCマイクロサービスバックエンドAPIのためのRESTとgRPC
マイクロサービスバックエンドAPIのためのRESTとgRPC
 
Dolor en rn
Dolor en rnDolor en rn
Dolor en rn
 
Introduction to Search Systems - ScaleConf Colombia 2017
Introduction to Search Systems - ScaleConf Colombia 2017Introduction to Search Systems - ScaleConf Colombia 2017
Introduction to Search Systems - ScaleConf Colombia 2017
 
Mapping Experiences - Workshop Presentation
Mapping Experiences - Workshop PresentationMapping Experiences - Workshop Presentation
Mapping Experiences - Workshop Presentation
 
Chainerを使って細胞を数えてみた
Chainerを使って細胞を数えてみたChainerを使って細胞を数えてみた
Chainerを使って細胞を数えてみた
 
Large Scale Deep Learning with TensorFlow
Large Scale Deep Learning with TensorFlow Large Scale Deep Learning with TensorFlow
Large Scale Deep Learning with TensorFlow
 
HoloLens x Graphics 入門
HoloLens x Graphics 入門HoloLens x Graphics 入門
HoloLens x Graphics 入門
 
Unreal engine4を使ったVRコンテンツ製作で 120%役に立つtips集+GDC情報をご紹介
Unreal engine4を使ったVRコンテンツ製作で 120%役に立つtips集+GDC情報をご紹介Unreal engine4を使ったVRコンテンツ製作で 120%役に立つtips集+GDC情報をご紹介
Unreal engine4を使ったVRコンテンツ製作で 120%役に立つtips集+GDC情報をご紹介
 
Cartilla de bienvenida a la comunidad educativa para el reinicio de clases, a...
Cartilla de bienvenida a la comunidad educativa para el reinicio de clases, a...Cartilla de bienvenida a la comunidad educativa para el reinicio de clases, a...
Cartilla de bienvenida a la comunidad educativa para el reinicio de clases, a...
 
PyCon APAC 2016 Keynote
PyCon APAC 2016 KeynotePyCon APAC 2016 Keynote
PyCon APAC 2016 Keynote
 
Surrounded by flowers (Michael and Inessa Garmash )
Surrounded by flowers (Michael and Inessa Garmash )Surrounded by flowers (Michael and Inessa Garmash )
Surrounded by flowers (Michael and Inessa Garmash )
 
High Performance Python on Apache Spark
High Performance Python on Apache SparkHigh Performance Python on Apache Spark
High Performance Python on Apache Spark
 
フォントの選び方・使い方
フォントの選び方・使い方フォントの選び方・使い方
フォントの選び方・使い方
 
Krenn algorithmic democracy_ab_jan_2016
Krenn algorithmic democracy_ab_jan_2016Krenn algorithmic democracy_ab_jan_2016
Krenn algorithmic democracy_ab_jan_2016
 

Ähnlich wie Memory Interoperability in Analytics and Machine Learning

Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
MLconf
 
An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...
DataWorks Summit
 

Ähnlich wie Memory Interoperability in Analytics and Machine Learning (20)

Engineering patterns for implementing data science models on big data platforms
Engineering patterns for implementing data science models on big data platformsEngineering patterns for implementing data science models on big data platforms
Engineering patterns for implementing data science models on big data platforms
 
Big data berlin
Big data berlinBig data berlin
Big data berlin
 
Design Choices for Cloud Data Platforms
Design Choices for Cloud Data PlatformsDesign Choices for Cloud Data Platforms
Design Choices for Cloud Data Platforms
 
[DataCon.TW 2017] Data Lake: centralize in on-prem vs. decentralize on cloud
[DataCon.TW 2017] Data Lake: centralize in on-prem vs. decentralize on cloud[DataCon.TW 2017] Data Lake: centralize in on-prem vs. decentralize on cloud
[DataCon.TW 2017] Data Lake: centralize in on-prem vs. decentralize on cloud
 
01 introduction fundamentals_of_parallelism_and_code_optimization-www.astek.ir
01 introduction fundamentals_of_parallelism_and_code_optimization-www.astek.ir01 introduction fundamentals_of_parallelism_and_code_optimization-www.astek.ir
01 introduction fundamentals_of_parallelism_and_code_optimization-www.astek.ir
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019
 
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
 
Simplifying And Accelerating Data Access for Python With Dremio and Apache Arrow
Simplifying And Accelerating Data Access for Python With Dremio and Apache ArrowSimplifying And Accelerating Data Access for Python With Dremio and Apache Arrow
Simplifying And Accelerating Data Access for Python With Dremio and Apache Arrow
 
An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...
 
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinneyIbis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
 
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
 
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
 
The Challenges of Bringing Machine Learning to the Masses
The Challenges of Bringing Machine Learning to the MassesThe Challenges of Bringing Machine Learning to the Masses
The Challenges of Bringing Machine Learning to the Masses
 
Scaling Data Science on Big Data
Scaling Data Science on Big DataScaling Data Science on Big Data
Scaling Data Science on Big Data
 
Threat hunting using notebook technologies
Threat hunting using notebook technologiesThreat hunting using notebook technologies
Threat hunting using notebook technologies
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
Nisha talagala keynote_inflow_2016
Nisha talagala keynote_inflow_2016Nisha talagala keynote_inflow_2016
Nisha talagala keynote_inflow_2016
 
Python for Data: Past, Present, Future (PyCon JP 2017 Keynote)
Python for Data: Past, Present, Future (PyCon JP 2017 Keynote)Python for Data: Past, Present, Future (PyCon JP 2017 Keynote)
Python for Data: Past, Present, Future (PyCon JP 2017 Keynote)
 
معرفی کاربردهای یادگیری عمیق و چالش های آن در کلان داده
معرفی کاربردهای یادگیری عمیق و چالش های آن در کلان دادهمعرفی کاربردهای یادگیری عمیق و چالش های آن در کلان داده
معرفی کاربردهای یادگیری عمیق و چالش های آن در کلان داده
 
Architectures for HPC/HTC Workloads on AWS - CMP306 - re:Invent 2017
Architectures for HPC/HTC Workloads on AWS - CMP306 - re:Invent 2017Architectures for HPC/HTC Workloads on AWS - CMP306 - re:Invent 2017
Architectures for HPC/HTC Workloads on AWS - CMP306 - re:Invent 2017
 

Mehr von Wes McKinney

Mehr von Wes McKinney (10)

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
 
Apache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data Framework
 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache Arrow
 
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackApache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics Stack
 
Apache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackApache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science Stack
 
Shared Infrastructure for Data Science
Shared Infrastructure for Data ScienceShared Infrastructure for Data Science
Shared Infrastructure for Data Science
 
Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)
 
Enabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data CitizenEnabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data Citizen
 

Kürzlich hochgeladen

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Kürzlich hochgeladen (20)

A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 

Memory Interoperability in Analytics and Machine Learning

  • 1. www.twosigma.com Memory Interoperability for Analytics and Machine Learning March 26, 2017All Rights Reserved Wes McKinney @wesmckinn ScaledML @ Stanford March 25, 2017
  • 2. Me March 26, 2017 • Currently: Software Architect at Two Sigma Investments • Creator of Python pandas project • PMC member for Apache Arrow and Apache Parquet • Author of Python for Data Analysis • Other Python projects: Ibis, Feather, statsmodels All Rights Reserved 2
  • 3. Important Legal Information March 26, 2017 The information presented here is offered for informational purposes only and should not be used for any other purpose (including, without limitation, the making of investment decisions). Examples provided herein are for illustrative purposes only and are not necessarily based on actual data. Nothing herein constitutes: an offer to sell or the solicitation of any offer to buy any security or other interest; tax advice; or investment advice. This presentation shall remain the property of Two Sigma Investments, LP (“Two Sigma”) and Two Sigma reserves the right to require the return of this presentation at any time. Some of the images, logos or other material used herein may be protected by copyright and/or trademark. If so, such copyrights and/or trademarks are most likely owned by the entity that created the material and are used purely for identification and comment as fair use under international copyright and/or trademark laws. Use of such image, copyright or trademark does not imply any association with such organization (or endorsement of such organization) by Two Sigma, nor vice versa. Copyright © 2017 TWO SIGMA INVESTMENTS, LP. All rights reserved All Rights Reserved 3
  • 4. This talk 4March 26, 2017 • Benefits of interoperable data and metadata • Challenges to sharing memory between runtime environments • Apache Arrow: Purpose and C++ architecture • Opportunities for collaboration • Example application: pandas 2.0 All Rights Reserved
  • 5. Changing hardware landscape March 26, 2017 • Intel has released first production 3D Xpoint SSD • Reported 1000x faster than NAND, less expensive than RAM • Convergence between RAM vs. shared memory / mmap performance All Rights Reserved 5
  • 6. Changing software landscape March 26, 2017 • Next-gen ML / AI frameworks (TensorFlow, Torch, etc.) • DIY open source architectures for machine learning in production • Streaming / batch data processing pipelines • Data cleaning and feature engineering • Model fitting / scoring / serving All Rights Reserved 6
  • 7. “Zero-copy” memory interfaces March 26, 2017 • Enables computational tools to process a dataset without any additional serialization, or transfer to a different memory space • Can do random access on a dataset that does not fit in RAM • Another interpretation: reading a dataset is a metadata-only conversion All Rights Reserved 7
  • 8. Challenges to zero-copy memory sharing March 26, 2017 • Cross-language issues • Type metadata + logical types • Byte/bit-level memory layout • Language-specific issues • In-memory data structures • Memory allocation and sharing constructs All Rights Reserved 8
  • 9. What is pandas? March 26, 2017 • Popular in-memory data manipulation tool for Python • Focused on tabular datasets (“data frames”) • Sprawling codebase spanning multiple areas • IO for many data formats • Array manipulations / data preparation • OLAP-style analytics • Internals implemented using NumPy array objects All Rights Reserved 9
  • 10. NumPy March 26, 2017 • Tensor memory model ("ndarray") for numeric data • Strided, homogeneously-typed, byte-addressable memory • APL-inspired semantics • Zero-copy construction from compatible memory layouts • Computational tools support both strided and contiguous memory access All Rights Reserved 10
  • 11. pandas: Technical debt + Architectural issues March 26, 2017 • Tensor library like NumPy awkward fit for pandas use cases • Multidimensionality + strided memory access complicated algorithms • Lack of built-in missing value support • Weak on native string, variable length, or nested types • pandas at core a “in-memory columnar” problem, similar to analytical SQL engines All Rights Reserved 11
  • 12. Thesis: Tensors and Tables March 26, 2017 • 2 data structures best suited for zero-copy sharing • Tensors: N-dimensional, homogeneously-typed arrays • Tables: Column-oriented, heterogeneously typed • These data structures can be defined using common memory and metadata primitives All Rights Reserved 12
  • 13. Observations March 26, 2017 • A Tensor is semantically a multidimensional view of a 1D block of memory • Writing computational code targeting arbitrary tensors is much more difficult than 1D contiguous arrays • Tensors of non-fixed size types (e.g. strings) occur less frequently All Rights Reserved 13
  • 14. Apache Arrow March 26, 2017 • github.com/apache/arrow • Collaboration amongst broad set of OSS projects around language-agnostic shared data structures • Initial focus • In-memory columnar tables • Canonical metadata • Interoperability between JVM and native code (C/C++) ecosystem All Rights Reserved 14
  • 15. High performance data interchange March 26, 2017All Rights Reserved Today With Arrow Source: Apache Arrow 15
  • 16. What does Apache Arrow give you? March 26, 2017 • Cache-efficient columnar memory: optimized for CPU affinity and SIMD / parallel processing, O(1) random value access • Zero-copy messaging / IPC: Language-agnostic metadata, batch/file-based and streaming binary formats • Complex schema support: Flat and nested data types • Main implementations in C++ and Java: with integration tests • Bindings / implementations for C, Python, Ruby, Javascript in various stages of development All Rights Reserved 16
  • 17. Arrow in C++ March 26, 2017 • Reusable memory management and IO subsystem for native code applications • Layered in multiple components • Memory management • Type metadata / schemas • Array / Table containers • IO interfaces • Zero-copy IPC / messaging All Rights Reserved 17
  • 18. Arrow C++: Memory management March 26, 2017 • arrow::Buffer • RAII-based memory lifetime with std::shared_ptr<Buffer> • arrow::MemoryMappedBuffer: for memory maps • arrow::MemoryPool • Abstract memory allocator for tracking all allocations All Rights Reserved 18
  • 19. Arrow C++: Type metadata March 26, 2017 • arrow::DataType • Base class for fixed size, variable size, and nested datatypes • arrow::Field • Type + name + additional metadata • arrow::Schema • Collection of fields All Rights Reserved 19
  • 20. Arrow C++: Array / Table containers March 26, 2017 • arrow::Array • 1-dimensional columnar arrays: Int32Array, ListArray, StructArray, etc. • Support for dictionary-encoded arrays • arrow::RecordBatch • Collection of equal-length arrays • arrow::Column • Logical table “column” as chunked array • arrow::Table • Collection of columns All Rights Reserved 20
  • 21. Arrow C++: IO interfaces March 26, 2017 • arrow::{InputStream, OutputStream} • arrow::RandomAccessFile • Abstract file interface • arrow::MemoryMappedFile • Zero-copy reads to arrow::Buffer • Specific implementations for OS files, HDFS, etc. All Rights Reserved 21
  • 22. Arrow C++: Messaging / IPC March 26, 2017 • Metadata read/write using Google’s Flatbuffers library • Encapsulated Message type • Write record batches, read with zero-copy • arrow::{FileWriter, FileReader} • Random access / “batch” binary format • arrow::{StreamWriter, StreamReader} • Streaming binary format All Rights Reserved 22
  • 23. In development: arrow::Tensor March 26, 2017 • Targeting interoperability with memory layouts as used in NumPy, TensorFlow, Torch, or other standard tensor-based frameworks • data: arrow::Buffer • shape: dimension sizes • strides: memory ordering • Zero-copy reads using Arrow’s shared memory tools • Support Tensor math libraries for C++ like xtensor All Rights Reserved 23
  • 24. Example use: Ray ML framework from Berkeley RISELab March 26, 2017All Rights Reserved 24 Source: https://arxiv.org/abs/1703.03924 • Shared memory-based object store • Zero-copy tensor reads using Arrow libraries
  • 25. Example use: pandas 2.0 March 26, 2017 • In-planning rearchitecture of pandas’s internals • libpandas — largely Python-agnostic C++11 library • Decoupling pandas data structures from NumPy tensors • Support analytics targeting native Arrow memory • Multicore / parallel algorithms • Leverage latest SIMD intrinsics • Lazy-loading DataFrames from primary input formats • CSV, JSON, HDF5, Apache Parquet All Rights Reserved 25
  • 26. Other examples March 26, 2017 • Spark integration (SPARK-13534) • Weld integration (ARROW-649) All Rights Reserved 26
  • 27. Thank you March 26, 2017 • Building code and community around • IO subsystems • Metadata • Data structures and in-memory formats All Rights Reserved 27