SlideShare ist ein Scribd-Unternehmen logo
1 von 43
Data Processing at the Speed of
100 Gbps@Apache Crail (Incubating)
http://crail.incubator.apache.org/
Animesh Trivedi
IBM Research, Zurich
The Performance Landscape
Trend 1: The I/O performance is increasing dramatically
The Performance Landscape
Trend 2: The I/O diversity is increasing
dramatically
The Key Challenge
Given the current hardware performance and diversity
landscape, how can we orchestrate data movement at the
speed of hardware
in other words:
" Can we feed modern data processing stacks at 100+ Gbps
data speeds with 1-10 usec access latencies"
Apache Crail (Incubating)
 A distributed data “store” platform designed from scratch to “share” data at the
speed of hardware
 “store”: in DRAM/Flash/3DXP, with multiple front-end APIs to storage
 “share”: intermediate data from intra-job (shuffle, broadcast) & inter-job
 Targets to speed-up workloads by accelerating data sharing
 An effort to unify isolated and incompatible efforts to integrate high-performance
device in big data frameworks
 Written in Java8 with multiple client-language support
 Started at IBM Research Lab, Zurich
 Apache Incubator project since November, 2017
System Overview – Crail Store
client
Datanode Datanode Datanode
 logically centralized
 keeps track of metadata
 access metadata from the namenode
 access data from datanodes
 donate storage and network resources to Crail
 mostly dumb/stateless servers
Control RPCs
Data operations
Namenode
Apache Crail: Datanode
 DataNode is responsible for exposing
(storage + networking) resources to the system
 DRAM + RDMA [1], TCP/Sockets
 Flash + NVMeF [2]
 Local DRAM + memcpy
 Local flash + SPDK
 Accepts client connections and serves data
 Periodic heartbeat pings with the namenode
Datanode
RDMA, TCP, DPDK,
...
Apache Crail: Namenode
 Responsible for keeping track of storage resources in the
cluster
 Done by managing them in blocks (e.g., 1MB block size)
 Maintains a hierarchical node tree
 Directory, data files, stream files, multifiles, tables, KV
 Clients connect to the namenode to create new nodes, lookup,
read, and write to nodes
 Allocation policy (different media and node types)
 A node can contains some blocks from DRAM and flash
Apache Crail: Clients
 Clients read and write data
 Standalone or a part of a compute framework
 Single writer, without holes files
 Distributed clients often implicitly index and synchronize on the file/directory
name
 Multiple-client side storage abstractions/interfaces
 Simple hierarchical file system
 Key-Value (KV) store
 HDFS
 Streaming (WiP)
So where does
the performance come from?
Performance Principles – I
NIC
Ethernet
IP/TCP
sockets
OSkernel
100 Gbps
1 usec latency
Performance Principles – I
NIC
Ethernet
IP/TCP
sockets
container
OSkernel
HDFS
Alluxio
Spark
TeraSort
100 Gbps
1 usec latency
JV
M
?
Performance Principles – I
NIC
Ethernet
IP/TCP
sockets
container
OSkernel
JVM
HDFS
Alluxio
Spark
TeraSort
100 Gbps
1 usec latency
1. Use high-performance user-level I/O
RDMA[1], NVMeF[2], SPDK/DPDK in Java
- one-sided RDMA read/write operations
Performance Principles – II
network
object
Performance Principles – II
buffering
copies
allocation
(de)serialization
network
object
Performance Principles – II
buffering
copies
allocation
(de)serialization
network
object
2. Careful software design for the μsec-Era
software time budgeting, pre-allocation
of buffers, buffer caching, and reusing,
ownership, etc.
Performance Principles – III
client
Datanode Datanode Datanode Datanode
Performance Principles – III
client
Datanode Datanode Datanode Datanode
client
Performance Principles – III
client
Datanode Datanode Datanode Datanode
client
3. Careful data orchestration in the cluster
randomization, async requests, compute-I/O overlap,
smart buffering, data allocation policies, etc.
Performance Principle - Recap
1. Use high-performance user-level I/O
3. Careful data orchestration in the cluster
2. Careful software design for the μsec-Era
But what If I don’t have … RDMA
1. Use high-performance user-level I/O
3. Careful data orchestration in the cluster
2. Careful software design for the μsec-Era
The Full Apache Crail (Incubating)
Stack
Apache Crail Store
NVMf DPDKRDMATCP
Hierarchical Namespace API
Streaming KV HDFS
Fast Network (100Gb/s RoCE, etc.)
Data Processing Framework
(Spark, TensorFlow, λ Compute, …)
3DXP GPUNVMeDRAM
…
10th of GB/s
10 µsec
FS
Use in Apache Spark – Broadcast
Crail
Stor
e
spark
driver
/job_id/bc/1/
spark
executor
0
spark
executor
1
spark
executor
2
RDMA/NVMeF
write
RDMA/NVMeF
read
Use in Apache Spark - Shuffle
Spark Shuffle plugin [4] that write and reads data in
Crail
Crail
Stor
e
spark
executor0
spark
executor1
spark
executor2
/job_id/shuffle/0/
/job_id/shuffle/1/
/job_id/shuffle/2/
spark
executor0
spark
executor1
spark
executor2
Performance Numbers
 Baseline performance for Apache Crail Store [6]:
 Latency, bandwidth, and IOPS numbers in a distributed
setting in the JVM
 Crail + Spark integration, available at [4]
 Micro-operations: Broadcast and GroupBy
 Workloads: TeraSort and SQL JOIN
 Mixed DRAM and Flash setting
 A mix of x86 and POWER8 machines, 100 GbE
network, 256GB DRAM DDR3, and NVMe flash
Crail Store – DRAM
”File Read Bandwidth”
Crail delivers full hardware bandwidth from DRAM
Crail Store – DRAM
”File Read Latency”
Crail delivers ultra-low file/data access latencies in JVM
Crail Store – NVMeF
”File Read Bandwidth”
Crail delivers full hardware bandwidth from remote NVMe flash
smart buffering and I/O
scheduling
Crail Store – NVMeF
”File Read Latency”
Crail delivers native NVMe access performance in JVM
Crail Store – Metadata IOPS
”getFile operation”
Crail offers (almost) linear scalability with Namenode ops
8-9M IOPS
How does data processing-level performance
look like?
Apache Crail Store (Incubating)
Apache Spark – TeraSort & SQL JOIN
Apache Spark
Shuffle Module
Apache Spark
Broadcast Module
Crail Workload - Spark - Broadcast
Crail offers 10-100x performance gains for Spark Broadcasts
Crail Workload - Spark - GroupBy
Crail offers 2-5x performance gains for Spark Shuffle
2x 5x
Crail Workload - Spark - TeraSort
Crail offers 6x performance gains for Spark TeraSort [7]
Crail Workload - Spark - TeraSort
Crail Workload - Spark - EquiJoin
Crail offers 2x performance gains for SQL JOIN
Crail Store – Flash Disaggregation
Crail offers the possibility for storage disaggregation
Current Status and Future Plans
 Apache Incubator project since November, 2017
 First source release (tag: db64fd), a few weeks ago
 Plenty of opportunities
 Multiple languages support (C++ (WiP), Python, Rust,
_your_fav_language_)
 Multiple frameworks integration (Flink, Hadoop, λ compute,
TensorFlow, SnapML[5], _your_fav_framework_here_)
 Multiple datanode (network and storage) integration
 Automated testing and packaging framework
 Deploying in the cloud/containerization
 JVM/runtime optimizations
 And all the fun stuff that you know and love about Apache projects !
Thanks to …
Patrick Stuedi, Jonas Pfefferle, Michael Kaufmann,
Adrian Schuepbach, Bernard Metzler
Thanks you !
 Website: http://crail.incubator.apache.org/
 Blog: http://crail.incubator.apache.org/blog/
 Mailings list: dev@crail.incubator.apache.org
 JIRA: https://issues.apache.org/jira/browse/CRAIL
 Slack: https://the-asf.slack.com/messages/C8VDLDWMV
See you all at the mailing list :)
References
[1] DiSNI: Direct Storage and Networking Interface, https://github.com/zrlio/disni
[2] jNVMf: A NVMe over Fabrics library for Java, https://github.com/zrlio/jNVMf
[3] “Crail: A High-Performance I/O Architecture for Distributed Data Processing”, in the IEEE Bulletin
of the
Technical Committee on Data Engineering, Special Issue on Distributed Data
Management with RDMA,
Volume 40, pages 40-52, March, 2017. http://sites.computer.org/debull/A17mar/p38.pdf
[4] Crail I/O accleration plugins for Apache Spark, https://github.com/zrlio/crail-spark-io
[5] Snap machine learning (SnapML), https://www.zurich.ibm.com/snapml/
[6] Apache Crail (Incubating) performance blogs
 Part I: DRAM, http://crail.incubator.apache.org/blog/2017/08/crail-memory.html
 Part II: NVMeF, http://crail.incubator.apache.org/blog/2017/08/crail-nvme-fabrics-v1.html
 Part III: Metadata, http://crail.incubator.apache.org/blog/2017/11/crail-metadata.html
[7] Sorting on a 100Gbit/s Cluster using Spark/Crail,
http://crail.incubator.apache.org/blog/2017/01/sorting.html
Notice
IBM is a trademark of International Business Machines Corporation,
registered in many jurisdictions worldwide. Intel and Intel Xeon are
trademarks or registered trademarks of Intel Corporation or its
subsidiaries in the United States and other countries. Linux is a
registered trademark of Linus Torvalds in the United States, other
countries, or both. Java and all Java- based trade-marks and logos
are trademarks or registered trademarks of Oracle and/or its affiliates
Other products and service names might be trademarks of IBM or
other companies.
Backup

Weitere ähnliche Inhalte

Was ist angesagt?

Lessons learned from running Spark on Docker
Lessons learned from running Spark on DockerLessons learned from running Spark on Docker
Lessons learned from running Spark on DockerDataWorks Summit
 
Present and future of unified, portable, and efficient data processing with A...
Present and future of unified, portable, and efficient data processing with A...Present and future of unified, portable, and efficient data processing with A...
Present and future of unified, portable, and efficient data processing with A...DataWorks Summit
 
Real-time Freight Visibility: How TMW Systems uses NiFi and SAM to create sub...
Real-time Freight Visibility: How TMW Systems uses NiFi and SAM to create sub...Real-time Freight Visibility: How TMW Systems uses NiFi and SAM to create sub...
Real-time Freight Visibility: How TMW Systems uses NiFi and SAM to create sub...DataWorks Summit
 
Is your Enterprise Data lake Metadata Driven AND Secure?
Is your Enterprise Data lake Metadata Driven AND Secure?Is your Enterprise Data lake Metadata Driven AND Secure?
Is your Enterprise Data lake Metadata Driven AND Secure?DataWorks Summit/Hadoop Summit
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesDataWorks Summit
 
Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateDataWorks Summit
 
Saving the elephant—now, not later
Saving the elephant—now, not laterSaving the elephant—now, not later
Saving the elephant—now, not laterDataWorks Summit
 
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...DataWorks Summit/Hadoop Summit
 
End-to-End Security and Auditing in a Big Data as a Service Deployment
End-to-End Security and Auditing in a Big Data as a Service DeploymentEnd-to-End Security and Auditing in a Big Data as a Service Deployment
End-to-End Security and Auditing in a Big Data as a Service DeploymentDataWorks Summit/Hadoop Summit
 
What's new in apache hive
What's new in apache hive What's new in apache hive
What's new in apache hive DataWorks Summit
 
Light-weighted HDFS disaster recovery
Light-weighted HDFS disaster recoveryLight-weighted HDFS disaster recovery
Light-weighted HDFS disaster recoveryDataWorks Summit
 
Insights into Real-world Data Management Challenges
Insights into Real-world Data Management ChallengesInsights into Real-world Data Management Challenges
Insights into Real-world Data Management ChallengesDataWorks Summit
 
Accelerating TensorFlow with RDMA for high-performance deep learning
Accelerating TensorFlow with RDMA for high-performance deep learningAccelerating TensorFlow with RDMA for high-performance deep learning
Accelerating TensorFlow with RDMA for high-performance deep learningDataWorks Summit
 
Big data at United Airlines
Big data at United AirlinesBig data at United Airlines
Big data at United AirlinesDataWorks Summit
 
Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...DataWorks Summit
 
Ozone and HDFS’s evolution
Ozone and HDFS’s evolutionOzone and HDFS’s evolution
Ozone and HDFS’s evolutionDataWorks Summit
 
Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4Chris Nauroth
 
Big Data in the Cloud - The What, Why and How from the Experts
Big Data in the Cloud - The What, Why and How from the ExpertsBig Data in the Cloud - The What, Why and How from the Experts
Big Data in the Cloud - The What, Why and How from the ExpertsDataWorks Summit/Hadoop Summit
 
How to Achieve a Self-Service and Secure Multitenant Data Lake in a Large Com...
How to Achieve a Self-Service and Secure Multitenant Data Lake in a Large Com...How to Achieve a Self-Service and Secure Multitenant Data Lake in a Large Com...
How to Achieve a Self-Service and Secure Multitenant Data Lake in a Large Com...DataWorks Summit
 

Was ist angesagt? (20)

Lessons learned from running Spark on Docker
Lessons learned from running Spark on DockerLessons learned from running Spark on Docker
Lessons learned from running Spark on Docker
 
HDFS tiered storage
HDFS tiered storageHDFS tiered storage
HDFS tiered storage
 
Present and future of unified, portable, and efficient data processing with A...
Present and future of unified, portable, and efficient data processing with A...Present and future of unified, portable, and efficient data processing with A...
Present and future of unified, portable, and efficient data processing with A...
 
Real-time Freight Visibility: How TMW Systems uses NiFi and SAM to create sub...
Real-time Freight Visibility: How TMW Systems uses NiFi and SAM to create sub...Real-time Freight Visibility: How TMW Systems uses NiFi and SAM to create sub...
Real-time Freight Visibility: How TMW Systems uses NiFi and SAM to create sub...
 
Is your Enterprise Data lake Metadata Driven AND Secure?
Is your Enterprise Data lake Metadata Driven AND Secure?Is your Enterprise Data lake Metadata Driven AND Secure?
Is your Enterprise Data lake Metadata Driven AND Secure?
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
 
Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community Update
 
Saving the elephant—now, not later
Saving the elephant—now, not laterSaving the elephant—now, not later
Saving the elephant—now, not later
 
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
 
End-to-End Security and Auditing in a Big Data as a Service Deployment
End-to-End Security and Auditing in a Big Data as a Service DeploymentEnd-to-End Security and Auditing in a Big Data as a Service Deployment
End-to-End Security and Auditing in a Big Data as a Service Deployment
 
What's new in apache hive
What's new in apache hive What's new in apache hive
What's new in apache hive
 
Light-weighted HDFS disaster recovery
Light-weighted HDFS disaster recoveryLight-weighted HDFS disaster recovery
Light-weighted HDFS disaster recovery
 
Insights into Real-world Data Management Challenges
Insights into Real-world Data Management ChallengesInsights into Real-world Data Management Challenges
Insights into Real-world Data Management Challenges
 
Accelerating TensorFlow with RDMA for high-performance deep learning
Accelerating TensorFlow with RDMA for high-performance deep learningAccelerating TensorFlow with RDMA for high-performance deep learning
Accelerating TensorFlow with RDMA for high-performance deep learning
 
Big data at United Airlines
Big data at United AirlinesBig data at United Airlines
Big data at United Airlines
 
Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...
 
Ozone and HDFS’s evolution
Ozone and HDFS’s evolutionOzone and HDFS’s evolution
Ozone and HDFS’s evolution
 
Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4
 
Big Data in the Cloud - The What, Why and How from the Experts
Big Data in the Cloud - The What, Why and How from the ExpertsBig Data in the Cloud - The What, Why and How from the Experts
Big Data in the Cloud - The What, Why and How from the Experts
 
How to Achieve a Self-Service and Secure Multitenant Data Lake in a Large Com...
How to Achieve a Self-Service and Secure Multitenant Data Lake in a Large Com...How to Achieve a Self-Service and Secure Multitenant Data Lake in a Large Com...
How to Achieve a Self-Service and Secure Multitenant Data Lake in a Large Com...
 

Ähnlich wie Data processing at the speed of 100 Gbps@Apache Crail (Incubating)

Ceph Day Shanghai - Hyper Converged PLCloud with Ceph
Ceph Day Shanghai - Hyper Converged PLCloud with Ceph Ceph Day Shanghai - Hyper Converged PLCloud with Ceph
Ceph Day Shanghai - Hyper Converged PLCloud with Ceph Ceph Community
 
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...inside-BigData.com
 
The other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needsThe other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needsgagravarr
 
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...Frank Munz
 
Unified Batch and Real-Time Stream Processing Using Apache Flink
Unified Batch and Real-Time Stream Processing Using Apache FlinkUnified Batch and Real-Time Stream Processing Using Apache Flink
Unified Batch and Real-Time Stream Processing Using Apache FlinkSlim Baltagi
 
Introducing Cloud Development with Project Shipped and Mantl: a deep dive
Introducing Cloud Development with Project Shipped and Mantl: a deep diveIntroducing Cloud Development with Project Shipped and Mantl: a deep dive
Introducing Cloud Development with Project Shipped and Mantl: a deep diveCisco DevNet
 
Introducing Cloud Development with Mantl
Introducing Cloud Development with MantlIntroducing Cloud Development with Mantl
Introducing Cloud Development with MantlCisco DevNet
 
Getting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
Getting Started with Apache Spark and Alluxio for Blazingly Fast AnalyticsGetting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
Getting Started with Apache Spark and Alluxio for Blazingly Fast AnalyticsAlluxio, Inc.
 
Scalable POSIX File Systems in the Cloud
Scalable POSIX File Systems in the CloudScalable POSIX File Systems in the Cloud
Scalable POSIX File Systems in the CloudRed_Hat_Storage
 
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Lucidworks
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformYao Yao
 
5 Factors When Selecting a High Performance, Low Latency Database
5 Factors When Selecting a High Performance, Low Latency Database5 Factors When Selecting a High Performance, Low Latency Database
5 Factors When Selecting a High Performance, Low Latency DatabaseScyllaDB
 
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...Ceph Community
 
Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Jim Dowling
 
Hadoop enhancements using next gen IA technologies
Hadoop enhancements using next gen IA technologiesHadoop enhancements using next gen IA technologies
Hadoop enhancements using next gen IA technologiesBigdata Meetup Kochi
 
Red Hat Storage 2014 - Product(s) Overview
Red Hat Storage 2014 - Product(s) OverviewRed Hat Storage 2014 - Product(s) Overview
Red Hat Storage 2014 - Product(s) OverviewMarcel Hergaarden
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinIntro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinAlex Zeltov
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersDatabricks
 

Ähnlich wie Data processing at the speed of 100 Gbps@Apache Crail (Incubating) (20)

Ceph Day Shanghai - Hyper Converged PLCloud with Ceph
Ceph Day Shanghai - Hyper Converged PLCloud with Ceph Ceph Day Shanghai - Hyper Converged PLCloud with Ceph
Ceph Day Shanghai - Hyper Converged PLCloud with Ceph
 
Devops Spark Streaming
Devops Spark StreamingDevops Spark Streaming
Devops Spark Streaming
 
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
 
The other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needsThe other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needs
 
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
 
Unified Batch and Real-Time Stream Processing Using Apache Flink
Unified Batch and Real-Time Stream Processing Using Apache FlinkUnified Batch and Real-Time Stream Processing Using Apache Flink
Unified Batch and Real-Time Stream Processing Using Apache Flink
 
Introducing Cloud Development with Project Shipped and Mantl: a deep dive
Introducing Cloud Development with Project Shipped and Mantl: a deep diveIntroducing Cloud Development with Project Shipped and Mantl: a deep dive
Introducing Cloud Development with Project Shipped and Mantl: a deep dive
 
Introducing Cloud Development with Mantl
Introducing Cloud Development with MantlIntroducing Cloud Development with Mantl
Introducing Cloud Development with Mantl
 
Getting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
Getting Started with Apache Spark and Alluxio for Blazingly Fast AnalyticsGetting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
Getting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
 
Scalable POSIX File Systems in the Cloud
Scalable POSIX File Systems in the CloudScalable POSIX File Systems in the Cloud
Scalable POSIX File Systems in the Cloud
 
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
 
5 Factors When Selecting a High Performance, Low Latency Database
5 Factors When Selecting a High Performance, Low Latency Database5 Factors When Selecting a High Performance, Low Latency Database
5 Factors When Selecting a High Performance, Low Latency Database
 
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
 
Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019
 
Hadoop enhancements using next gen IA technologies
Hadoop enhancements using next gen IA technologiesHadoop enhancements using next gen IA technologies
Hadoop enhancements using next gen IA technologies
 
spark_v1_2
spark_v1_2spark_v1_2
spark_v1_2
 
Red Hat Storage 2014 - Product(s) Overview
Red Hat Storage 2014 - Product(s) OverviewRed Hat Storage 2014 - Product(s) Overview
Red Hat Storage 2014 - Product(s) Overview
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinIntro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
 

Mehr von DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Mehr von DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Kürzlich hochgeladen

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...apidays
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusZilliz
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 

Kürzlich hochgeladen (20)

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 

Data processing at the speed of 100 Gbps@Apache Crail (Incubating)

  • 1. Data Processing at the Speed of 100 Gbps@Apache Crail (Incubating) http://crail.incubator.apache.org/ Animesh Trivedi IBM Research, Zurich
  • 2. The Performance Landscape Trend 1: The I/O performance is increasing dramatically
  • 3. The Performance Landscape Trend 2: The I/O diversity is increasing dramatically
  • 4. The Key Challenge Given the current hardware performance and diversity landscape, how can we orchestrate data movement at the speed of hardware in other words: " Can we feed modern data processing stacks at 100+ Gbps data speeds with 1-10 usec access latencies"
  • 5. Apache Crail (Incubating)  A distributed data “store” platform designed from scratch to “share” data at the speed of hardware  “store”: in DRAM/Flash/3DXP, with multiple front-end APIs to storage  “share”: intermediate data from intra-job (shuffle, broadcast) & inter-job  Targets to speed-up workloads by accelerating data sharing  An effort to unify isolated and incompatible efforts to integrate high-performance device in big data frameworks  Written in Java8 with multiple client-language support  Started at IBM Research Lab, Zurich  Apache Incubator project since November, 2017
  • 6. System Overview – Crail Store client Datanode Datanode Datanode  logically centralized  keeps track of metadata  access metadata from the namenode  access data from datanodes  donate storage and network resources to Crail  mostly dumb/stateless servers Control RPCs Data operations Namenode
  • 7. Apache Crail: Datanode  DataNode is responsible for exposing (storage + networking) resources to the system  DRAM + RDMA [1], TCP/Sockets  Flash + NVMeF [2]  Local DRAM + memcpy  Local flash + SPDK  Accepts client connections and serves data  Periodic heartbeat pings with the namenode Datanode RDMA, TCP, DPDK, ...
  • 8. Apache Crail: Namenode  Responsible for keeping track of storage resources in the cluster  Done by managing them in blocks (e.g., 1MB block size)  Maintains a hierarchical node tree  Directory, data files, stream files, multifiles, tables, KV  Clients connect to the namenode to create new nodes, lookup, read, and write to nodes  Allocation policy (different media and node types)  A node can contains some blocks from DRAM and flash
  • 9. Apache Crail: Clients  Clients read and write data  Standalone or a part of a compute framework  Single writer, without holes files  Distributed clients often implicitly index and synchronize on the file/directory name  Multiple-client side storage abstractions/interfaces  Simple hierarchical file system  Key-Value (KV) store  HDFS  Streaming (WiP)
  • 10. So where does the performance come from?
  • 11. Performance Principles – I NIC Ethernet IP/TCP sockets OSkernel 100 Gbps 1 usec latency
  • 12. Performance Principles – I NIC Ethernet IP/TCP sockets container OSkernel HDFS Alluxio Spark TeraSort 100 Gbps 1 usec latency JV M ?
  • 13. Performance Principles – I NIC Ethernet IP/TCP sockets container OSkernel JVM HDFS Alluxio Spark TeraSort 100 Gbps 1 usec latency 1. Use high-performance user-level I/O RDMA[1], NVMeF[2], SPDK/DPDK in Java - one-sided RDMA read/write operations
  • 14. Performance Principles – II network object
  • 15. Performance Principles – II buffering copies allocation (de)serialization network object
  • 16. Performance Principles – II buffering copies allocation (de)serialization network object 2. Careful software design for the μsec-Era software time budgeting, pre-allocation of buffers, buffer caching, and reusing, ownership, etc.
  • 17. Performance Principles – III client Datanode Datanode Datanode Datanode
  • 18. Performance Principles – III client Datanode Datanode Datanode Datanode client
  • 19. Performance Principles – III client Datanode Datanode Datanode Datanode client 3. Careful data orchestration in the cluster randomization, async requests, compute-I/O overlap, smart buffering, data allocation policies, etc.
  • 20. Performance Principle - Recap 1. Use high-performance user-level I/O 3. Careful data orchestration in the cluster 2. Careful software design for the μsec-Era
  • 21. But what If I don’t have … RDMA 1. Use high-performance user-level I/O 3. Careful data orchestration in the cluster 2. Careful software design for the μsec-Era
  • 22. The Full Apache Crail (Incubating) Stack Apache Crail Store NVMf DPDKRDMATCP Hierarchical Namespace API Streaming KV HDFS Fast Network (100Gb/s RoCE, etc.) Data Processing Framework (Spark, TensorFlow, λ Compute, …) 3DXP GPUNVMeDRAM … 10th of GB/s 10 µsec FS
  • 23. Use in Apache Spark – Broadcast Crail Stor e spark driver /job_id/bc/1/ spark executor 0 spark executor 1 spark executor 2 RDMA/NVMeF write RDMA/NVMeF read
  • 24. Use in Apache Spark - Shuffle Spark Shuffle plugin [4] that write and reads data in Crail Crail Stor e spark executor0 spark executor1 spark executor2 /job_id/shuffle/0/ /job_id/shuffle/1/ /job_id/shuffle/2/ spark executor0 spark executor1 spark executor2
  • 25. Performance Numbers  Baseline performance for Apache Crail Store [6]:  Latency, bandwidth, and IOPS numbers in a distributed setting in the JVM  Crail + Spark integration, available at [4]  Micro-operations: Broadcast and GroupBy  Workloads: TeraSort and SQL JOIN  Mixed DRAM and Flash setting  A mix of x86 and POWER8 machines, 100 GbE network, 256GB DRAM DDR3, and NVMe flash
  • 26. Crail Store – DRAM ”File Read Bandwidth” Crail delivers full hardware bandwidth from DRAM
  • 27. Crail Store – DRAM ”File Read Latency” Crail delivers ultra-low file/data access latencies in JVM
  • 28. Crail Store – NVMeF ”File Read Bandwidth” Crail delivers full hardware bandwidth from remote NVMe flash smart buffering and I/O scheduling
  • 29. Crail Store – NVMeF ”File Read Latency” Crail delivers native NVMe access performance in JVM
  • 30. Crail Store – Metadata IOPS ”getFile operation” Crail offers (almost) linear scalability with Namenode ops 8-9M IOPS
  • 31. How does data processing-level performance look like? Apache Crail Store (Incubating) Apache Spark – TeraSort & SQL JOIN Apache Spark Shuffle Module Apache Spark Broadcast Module
  • 32. Crail Workload - Spark - Broadcast Crail offers 10-100x performance gains for Spark Broadcasts
  • 33. Crail Workload - Spark - GroupBy Crail offers 2-5x performance gains for Spark Shuffle 2x 5x
  • 34. Crail Workload - Spark - TeraSort Crail offers 6x performance gains for Spark TeraSort [7]
  • 35. Crail Workload - Spark - TeraSort
  • 36. Crail Workload - Spark - EquiJoin Crail offers 2x performance gains for SQL JOIN
  • 37. Crail Store – Flash Disaggregation Crail offers the possibility for storage disaggregation
  • 38. Current Status and Future Plans  Apache Incubator project since November, 2017  First source release (tag: db64fd), a few weeks ago  Plenty of opportunities  Multiple languages support (C++ (WiP), Python, Rust, _your_fav_language_)  Multiple frameworks integration (Flink, Hadoop, λ compute, TensorFlow, SnapML[5], _your_fav_framework_here_)  Multiple datanode (network and storage) integration  Automated testing and packaging framework  Deploying in the cloud/containerization  JVM/runtime optimizations  And all the fun stuff that you know and love about Apache projects !
  • 39. Thanks to … Patrick Stuedi, Jonas Pfefferle, Michael Kaufmann, Adrian Schuepbach, Bernard Metzler
  • 40. Thanks you !  Website: http://crail.incubator.apache.org/  Blog: http://crail.incubator.apache.org/blog/  Mailings list: dev@crail.incubator.apache.org  JIRA: https://issues.apache.org/jira/browse/CRAIL  Slack: https://the-asf.slack.com/messages/C8VDLDWMV See you all at the mailing list :)
  • 41. References [1] DiSNI: Direct Storage and Networking Interface, https://github.com/zrlio/disni [2] jNVMf: A NVMe over Fabrics library for Java, https://github.com/zrlio/jNVMf [3] “Crail: A High-Performance I/O Architecture for Distributed Data Processing”, in the IEEE Bulletin of the Technical Committee on Data Engineering, Special Issue on Distributed Data Management with RDMA, Volume 40, pages 40-52, March, 2017. http://sites.computer.org/debull/A17mar/p38.pdf [4] Crail I/O accleration plugins for Apache Spark, https://github.com/zrlio/crail-spark-io [5] Snap machine learning (SnapML), https://www.zurich.ibm.com/snapml/ [6] Apache Crail (Incubating) performance blogs  Part I: DRAM, http://crail.incubator.apache.org/blog/2017/08/crail-memory.html  Part II: NVMeF, http://crail.incubator.apache.org/blog/2017/08/crail-nvme-fabrics-v1.html  Part III: Metadata, http://crail.incubator.apache.org/blog/2017/11/crail-metadata.html [7] Sorting on a 100Gbit/s Cluster using Spark/Crail, http://crail.incubator.apache.org/blog/2017/01/sorting.html
  • 42. Notice IBM is a trademark of International Business Machines Corporation, registered in many jurisdictions worldwide. Intel and Intel Xeon are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both. Java and all Java- based trade-marks and logos are trademarks or registered trademarks of Oracle and/or its affiliates Other products and service names might be trademarks of IBM or other companies.