How Apache Arrow and Apache Parquet are helpful technologies to connect the Python Data ecosystem to other landscapes such as the Java/Scala based Big Data ecosystem.
PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landscapes using Arrow and Parquet
1. 1
Connecting PyData to other Big Data
Landscapes using Arrow and Parquet
Uwe L. Korn, PyCon.DE 2017
2. 2
• Data Scientist & Architect at Blue Yonder
(@BlueYonderTech)
• Apache {Arrow, Parquet} PMC
• Work in Python, Cython, C++11 and SQL
• Heavy Pandas user
About me
xhochy
uwe@apache.org
3. 3
Python is a good companion for a Data Scientist
…but there are other ecosystems out there.
4. • Large set of files on distributed filesystem
• Non-uniform schema
• Execute query
• Only a subset is interesting
4
Why do I care?
not in Python
5. 5
All are amazing but…
How to get my data out of Python and back in again?
…but there was no fast Parquet access 2 years ago.
Use Parquet!
6. 6
A general problem
• Great interoperability inside ecosystems
• Often based on a common backend (e.g. NumPy)
• Poor integration to other systems
• CSV is your only resort
• „We need to talk!“
• Memory copy is about 10GiB/s
• (De-)serialisation comes on top
9. 9
About Parquet
1. Columnar on-disk storage format
2. Started in fall 2012 by Cloudera & Twitter
3. July 2013: 1.0 release
4. top-level Apache project
5. Fall 2016: Python & C++ support
6. State of the art format in the Hadoop ecosystem
• often used as the default I/O option
10. 10
Why use Parquet?
1. Columnar format
—> vectorized operations
2. Efficient encodings and compressions
—> small size without the need for a fat CPU
3. Predicate push-down
—> bring computation to the I/O layer
4. Language independent format
—> libs in Java / Scala / C++ / Python /…
11. Compression
1. Shrink data size independent of its content
2. More CPU intensive than encoding
3. encoding+compression performs better than
compression alone with less CPU cost
4. LZO, Snappy, GZIP, Brotli
—> If in doubt: use Snappy
5. GZIP: 174 MiB (11%)
Snappy: 216 MiB (14 %)
12. Predicate pushdown
1. Only load used data
• skip columns that are not needed
• skip (chunks of) rows that not relevant
2. saves I/O load as the data is not transferred
3. saves CPU as the data is not decoded
Which products are sold in $?
19. 19
Apache Arrow
• Specification for in-memory columnar data layout
• No overhead for cross-system communication
• Designed for efficiency (exploit SIMD, cache locality, ..)
• Exchange data without conversion between Python, C++, C(glib),
Ruby, Lua, R, JavaScript and the JVM
• This brought Parquet to Pandas without any Python code in
parquet-cpp
20. 20
Dissecting Arrow C++
• General zero-copy memory management
• jemalloc as the base allocator
• Columnar memory format & metadata
• Schema & DataType
• Columns & Table
21. 21
Dissecting Arrow C++
• Structured data IPC (inter-process communication)
• used in Spark for JVM<->Python
• future extensions include: GRPC backend, shared memory
communication, …
• Columnar in-memory analytics
• be the backbone of Pandas 2.0
22. 0.05s
Converting 1 million longs
from Spark to PySpark
22
with Arrow
https://github.com/apache/spark/pull/15821#issuecomment-282175163
23. 23
Apache Arrow – Real life improvement
Real life example!
Retrieve a dataset from an MPP database and analyze it in Pandas
1. Run a query in the DB
2. Pass it in columnar form to the DB driver
3. The OBDC layer transform it into row-wise form
4. Pandas makes it columnar again
Ugly real-life solution: export as CSV, bypass ODBC
24. 24
Better solution: Turbodbc with Arrow support
1. Retrieve columnar results
2. Pass them in a columnar fashion to Pandas
More systems in the future (without the ODBC overhead)
See also Michael’s talk tomorrow: Turbodbc: Turbocharged database
access for data scientists
Apache Arrow – Real life improvement
28. Blue Yonder GmbH
Ohiostraße 8
76149 Karlsruhe
Germany
+49 721 383117 0
Blue Yonder Software Limited
19 Eastbourne Terrace
London, W2 6LG
United Kingdom
+44 20 3626 0360
Blue Yonder
Best decisions,
delivered daily
Blue Yonder Analytics, Inc.
5048 Tennyson Parkway
Suite 250
Plano, Texas 75024
USA
28