SlideShare ist ein Scribd-Unternehmen logo
1 von 6
Downloaden Sie, um offline zu lesen
1
Summary
Apache Spark is gaining adoption both for its features and its
performance. Apache Spark enables the organizations to realize the
real-time analytics and gain faster insights in today’s competitive
business environment. It provides significant performance
advantages to analyze massive datasets with its in-memory
processing engine and support for running standard SQL natively on
Apache Spark platform. This configuration brief describes the
performance and scalability benefits of deploying Apache Spark
cluster on the Lenovo server platform using Intel NVMe drives and
Mellanox network interface cards. Some of the key advantages of
this solution stack are listed below:
 Serves as a reference for optimized Apache Spark Cluster setup
 Illustrates performance benefits using Hadoop-DS
1
benchmark
 Perfectly suited for high data ingest rates
 Delivers extreme performance in a small data center footprint
WWW.LENOVO.COM
Industry First 100TB Spark SQL Benchmark
Lenovo Big Data Configuration for
Apache Spark
CONFIGURATION BRIEF
Big Data
HIGHLIGHTS
 Unleash extreme Apache Spark cluster performance
 Faster time to analytics with sub-millisecond I/O latency
 Leverage scalability from 1 to 100 TB dataset and beyond
 Optimized solution driven by reliable infrastructure stack
 Standard SQL support with Apache Spark 2.1
 Targeted workloads include SQL & Analytics
2
Lenovo Big Data Configuration for Apache
Spark
The system is architected with high performance servers featuring
high performance storage and networking components. This
architecture offers a balanced configuration to manage the compute,
storage and networking demands of enterprise scale Apache Spark
clusters. . The aggregate capability is provided by 1008 cores, 42TB
DRAM, 448TB Flash storage and a 100Gbps Ethernet network. The
cluster is running Linux, IBM Open Platform with Apache Hadoop and
Apache Spark 2.1 (pre-release).
As a challenging SQL workload we selected the Hadoop-DS workload
(a TPC-DS derivative) at a scale of 100TB on a cluster of 28 Lenovo
x3650 data nodes along with a pair of Lenovo x3550 management
nodes to provide HDFS services. All servers were connected with a
100GbE Mellanox network. Each data node had a pair of Intel E5-
2697 v4 (18-core) processors, 1.5TB of DRAM and 8 x 2.0TB Intel
NVMe SSD drives.
The goal is to determine how much the very latest CPUs, NVMe solid
state storage and high bandwidth networking would improve Data
Center Space and energy efficiency while handling a demanding SQL
workload with multiple concurrent users. To ensure the latest Spark
SQL was used, we took a snapshot of the Apache Spark 2.1 trunk and
added select Spark SQL enhancements from the IBM Spark
Technology Center.
Performance Overview
Early results are extremely encouraging. With the major uplift in
Apache Spark 2.0, Spark SQL was able to parse and obtain plans for
all 99 SQL queries using TPC-DS compliant syntax. This is an
achievement no other company has demonstrated with open source
SQL on Hadoop. 91 of 99 queries were successfully run against 100TB
of data in a single user test. During the execution of this test, the
cluster achieved peak CPU utilization of 100% as well as I/O
throughput of 358 GB/s.
Intel® Solid State Drive DC P3600 Series AIC
(Add-in Card) and 2.5” form factors both
include:
 Consistently high IOPS and throughput
 Sustained low latency
 Variable Sector Size and End-to-End
data path protection
 Power loss protection capacitor self-
test
 Out of band management
 Thermal throttling and monitoring
Mellanox Ethernet Solution delivers:
 World’s fastest interconnect with
lowest latency and highest bandwidth
 Advanced hardware offloads for IP
packet processing
 Uncompromised data integrity with
zero-packet loss switches and cables
with lowest BER rate of 1e
-15
CONFIGURATION BRIEF
Lenovo Big Data Configuration for Apache Spark
“The rapid evolution of Spark and its
readiness for the next generation of Big
Data Spark clusters is demonstrated by
these 100TB performance results”
Berni Schiefer, IBM
3
CONFIGURATION BRIEF
Lenovo Big Data Configuration for Apache Spark
Apache Spark
There is an important new cluster computing framework called Apache Spark in the Big Data world. Over the past year it has become
a white-hot Apache project based on a number of measures
2, 3
. Many view Spark as a natural successor to Hadoop MapReduce that
preserves the elasticity and resiliency of MapReduce while being easier to program. It also has a richer and better set of programmer
APIs and is much better at exploiting abundant dynamic memory.
In addition to the Apache Spark core there are a set of libraries including SQL and DataFrames, MLlib for machine learning, GraphX,
and Spark Streaming. Users can combine these libraries seamlessly in the same application. Of these one of the most significant is
Spark SQL since SQL remains one of the world’s most popular database access technologies
4
.
So far, most performance studies of Spark SQL have used more traditional Hadoop cluster topologies, with relatively small servers,
modest amounts of memory and large capacity SAS/SATA drives and 10GbE networks. IBM, Intel, Lenovo and Mellanox set out to
investigate a topology that leverages advances made in CPU, memory, storage and networking to assess the readiness of Spark SQL to
harness these new capabilities.
Key Technologies
Advantages of using Intel® NVMe™ storage – The Intel® Solid State
Drive (SSD) Data Center Family for PCIe® brings extreme data
throughput directly to Intel® Xeon® processors with up to six times
faster data transfer speed than 6 Gbps SAS/SATA SSDs. The
performance of a single drive from the Intel SSD Data Center Family for
PCIe®, specifically the Intel® SSD DC P3600 Series (450K IOPS), can
replace the performance of 7 SATA SSDs aggregated through an HBA
(~500K IOPS). The P3600 Series is a PCIe® Gen3 SSD architected with
the new high performance controller interface – NVMe™ (Non-Volatile
Memory Express™) delivering leading performance, low latency and
Quality of Service.
Advantages of using Mellanox 100GbE network – With an increasing
need to use faster storage (SSDs, NVMe) for IOPS intensive big data
workloads like Spark SQL over HDFS, a faster higher bandwidth network
becomes a paramount. While it takes ~10 HDDs per data node to
exceed what a 10GbE network can do, it takes just 1 NVMe drive to
cross that limit and just 4 to demand a 100GbE network. Network
latency also becomes crucial for such transactional-centric workload to
deliver consistently low query latency. With 100GbE based network
including ConnectX®-4 adapters, Spectrum
TM
switches and LinkX®
cables, Mellanox provides the highest bandwidth and lowest latency to
deliver the fastest network needed for faster storage.
4
Performance Test Methodology
The Transaction Processing Performance Council (TPC) has served as the pre-
eminent industry-standard benchmarking organization since 1988.
The TPC-DS benchmark models the decision support functions of a retail product
supplier. The supporting schema contains vital business information, such as
customer, order, and product data. TPC-DS models industries that must manage,
sell and distribute products. It utilizes the business model of a large retail company
having multiple stores located nation-wide. The benchmark currently supports
database sizes at 1TB, 3TB, 10TB, 30TB and 100TB. A workload consisting of 99
standard ANSI/ISO SQL queries are used to exercise the system.
In 2014, IBM defined Hadoop-DS, a SQL on Hadoop-oriented benchmark inspired by
and derived from TPC-DS. Hadoop-DS executes a workload that is largely consistent
with the requirements of TPC-DS. It uses the same schema, the same queries and
the same execution rules but it omits the data maintenance functions, the second
throughput run and does not include any pricing. It also does not go through a
formal TPC audit process and as a result it is not a formal TPC result. To date, no
company has officially audited and published a TPC-DS result.
Test Results
When running the 4-stream multi-user (throughput) test, a total of 360 queries
completed requiring a max of 87% CPU utilization across the cluster and with an
average CPU utilization of 49% The 100GbE switch moved a remarkable 13.5 GB/s
of traffic, while the flash storage handled 12.8 GB/s of I/O bandwidth per data node
(358 GB/s cluster wide). Overall we were able to largely saturate the entire cluster.
These results demonstrate that Spark SQL is well on its way to being able to master
challenging SQL workloads such as Hadoop-DS. We can also see that using new
advanced servers allows our cluster to handle a large, complex workload rapidly
while saving energy and space compared with traditional Hadoop Clusters. As
memory and flash storage improve and drop in price, we expect the Spark F1
Cluster design to become pervasively deployed.
In summary, advances in hardware and software are paving the way for a new class
of Spark cluster that offers the resiliency and elasticity found in larger Hadoop
clusters, the powerful ability to integrate advanced analytic processing, including
machine learning into elegant Spark applications, and at the same time able to
process demanding SQL-based business intelligence queries.
CONFIGURATION BRIEF
Lenovo Big Data Configuration for Apache Spark
5
CONFIGURATION BRIEF
Lenovo Big Data Configuration for Apache Spark
Solution Configuration
Data Nodes – Lenovo x3650 M5 Servers
The Lenovo System x3650 M5 server is an enterprise class 2U two-socket
versatile server that incorporates outstanding reliability, availability, and
serviceability (RAS), security, and high-efficiency for business-critical
applications. It offers a flexible, scalable design and simple upgrade path to
various storage configurations and up to 1.5 TB of TruDDR4 Memory. This
reference configuration uses 8 NVMe PCIe SSDs and a dual-port 100GbE network
adapter.
With standards-based Intel® infrastructure that allows for innovation across a
wide choice of leading analytics platforms, now you can confidently begin to
move your business forward with actionable, real-time insights from advanced
analytics. The powerful new Intel® Xeon® processor E5-2600 v4 product family
offers versatility across diverse data center workloads including a wide range of
scale-out data analytics frameworks.
Management Nodes – Lenovo x3550 M5 Servers
The Lenovo System x3550 M5 server is a cost- and density-balanced 1U two-
socket rack server. The x3550M5 features a new, innovative, energy-smart
design with up to two Intel Xeon processors of the high-performance E5-2600 v4
product family processors a large capacity of faster, energy-efficient TruDDR4
Memory up to twelve 12Gb/s SAS drives, and up to three PCI Express (PCIe) 3.0
I/O expansion.
Storage - Intel® P3600 NVMe SSD
The Intel® SSD DC P3600 Series is a PCIe* Gen3 SSD architected with the new
high performance controller interface, Non-Volatile Memory Express* (NVMe*),
delivering leading performance, low latency, and quality of service. Matching the
performance with world-class reliability and endurance, Intel SSD DC P3600
Series offers a range of capacity—400 GB, 800 GB, 1.2 TB, 1.6 TB and 2 TB in
both add-In card and 2.5-inch form factor.
Networking – Mellanox Spectrum SN2700 switch
Mellanox Spectrum-based SN2700 (32x 100GbE ports) switch provides the
highest performance fabric solution in a 1U form factor delivering non-blocking
throughput for big data workloads, with predictable low-latency and zero-packet
loss (ZPL). Due to the bursty network traffic in big data workloads, non-blocking
switches play a crucial role in delivering predictable SQL query completion time.
In addition, LinkX DAC cables offers reliable connections at speed from 10 to
100Gb/s with highest quality, featuring error rates up to 100x lower than
industry standards.
Network Adapters - Mellanox ConnectX-4 100GbE adapters
Mellanox ConnectX-4 Ethernet adapter provides the highest performing and
most flexible interconnect solution for Big Data, Cloud and HPC applications at
various speed of 10/25 and 40/50/100Gbps. Big Data applications utilizing TCP or
UDP over IP transport can achieve the highest efficiency and application density
with the hardware-based stateless offload and flow steering engines. These
advanced offloads reduce CPU overhead in packet processing and improving
application query latency.
Spark F1 Cluster: 28 Data Nodes each provisioned with
Intel E5-2697 v4 processors, 16TB of NVMe storage and 1.5
TB DDR4 memory and interconnected with a 100 GbE
network yields one incredibly fast Apache Spark platform
The Lenovo x3650 M5 server Data Node and Lenovo
x3550 M5 server Management Node bring high
performance and high reliability in a small footprint
6
© 2016 Lenovo. All rights reserved.
Availability: Offers, prices, specifications and availability may change without notice. Lenovo is not responsible for photographic or typographical errors. Warranty: For a copy of applicable
warranties, write to: Lenovo Warranty Information, 1009 Think Place, Morrisville, NC, 27560, Lenovo makes no representation or warranty regarding third-party products or services.
Trademarks: Lenovo, the Lenovo logo, System x, ThinkServer are trademarks or registered trademarks of Lenovo. Microsoft and Windows are registered trademarks of Microsoft
Corporation. Intel, the Intel logo, Xeon and Xeon Inside are registered trademarks of Intel Corporation in the U.S. and other countries. Other company, product, and service names may be
trademarks or service marks of others. CRN: BDAAPHEXX64
10/2016
CONFIGURATION BRIEF
Lenovo Big Data Configuration for Apache Spark
Why Lenovo
Lenovo is a leading provider of x86 servers for the data center. Featuring rack,
tower, blade, dense and converged systems, and the Lenovo server portfolio
provides excellent performance, reliability and security. Lenovo also offers a
full range of networking, storage, software, solutions, and comprehensive
services supporting business needs throughout the IT lifecycle. With options for
planning, deployment, and support, Lenovo offers expertise and services
needed to deliver better service-level agreements and generate greater end-
user satisfaction.
For More Information
To learn more about the Lenovo Big Data Configuration for Apache Spark
Platform, contact your Lenovo Business Partner or visit:
lenovo.com/systems/solutions.
1
Hadoop-DS is based on TPC-DS. While results have been reviewed for correctness, they have not
been audited or published
2
https://upside.tdwi.org/articles/2016/08/22/spark-is-hot-hot-hot.aspx
3
https://adtmag.com/blogs/dev-watch/2016/05/asf-big-data-projects.aspx
4
http://stackoverflow.com/research/developer-survey-2016

Weitere ähnliche Inhalte

Was ist angesagt?

Oracle Solaris Software Integration
Oracle Solaris Software IntegrationOracle Solaris Software Integration
Oracle Solaris Software IntegrationOTN Systems Hub
 
Expert summit SQL Server 2016
Expert summit   SQL Server 2016Expert summit   SQL Server 2016
Expert summit SQL Server 2016Łukasz Grala
 
Úvod do Oracle Cloud infrastruktury
Úvod do Oracle Cloud infrastrukturyÚvod do Oracle Cloud infrastruktury
Úvod do Oracle Cloud infrastrukturyMarketingArrowECS_CZ
 
Meetup Oracle Database MAD: 2.1 Data Management Trends: SQL, NoSQL y Big Data
Meetup Oracle Database MAD: 2.1 Data Management Trends: SQL, NoSQL y Big Data Meetup Oracle Database MAD: 2.1 Data Management Trends: SQL, NoSQL y Big Data
Meetup Oracle Database MAD: 2.1 Data Management Trends: SQL, NoSQL y Big Data avanttic Consultoría Tecnológica
 
Oracle big data appliance and solutions
Oracle big data appliance and solutionsOracle big data appliance and solutions
Oracle big data appliance and solutionssolarisyougood
 
The Analytic Platform behind IBM’s Watson Data Platform - Big Data Spain 2017
The Analytic Platform behind IBM’s Watson Data Platform - Big Data Spain 2017The Analytic Platform behind IBM’s Watson Data Platform - Big Data Spain 2017
The Analytic Platform behind IBM’s Watson Data Platform - Big Data Spain 2017Luciano Resende
 
Spark Infrastructure Made Easy
Spark Infrastructure Made EasySpark Infrastructure Made Easy
Spark Infrastructure Made EasyBlueData, Inc.
 
Triple C - Centralize, Cloudify and Consolidate Dozens of Oracle Databases (O...
Triple C - Centralize, Cloudify and Consolidate Dozens of Oracle Databases (O...Triple C - Centralize, Cloudify and Consolidate Dozens of Oracle Databases (O...
Triple C - Centralize, Cloudify and Consolidate Dozens of Oracle Databases (O...Lucas Jellema
 
Oracle Database Appliance, ODA, X7-2 portfolio.
Oracle Database Appliance, ODA, X7-2 portfolio.Oracle Database Appliance, ODA, X7-2 portfolio.
Oracle Database Appliance, ODA, X7-2 portfolio.Daryll Whyte
 
Hadoop Virtualization - Intel White Paper
Hadoop Virtualization - Intel White PaperHadoop Virtualization - Intel White Paper
Hadoop Virtualization - Intel White PaperBlueData, Inc.
 
Bootcamp 2017 - SQL Server on Linux
Bootcamp 2017 - SQL Server on LinuxBootcamp 2017 - SQL Server on Linux
Bootcamp 2017 - SQL Server on LinuxMaximiliano Accotto
 
CloudBridge and NetApp Storage Solutions - The Killer App
CloudBridge and NetApp Storage Solutions - The Killer AppCloudBridge and NetApp Storage Solutions - The Killer App
CloudBridge and NetApp Storage Solutions - The Killer AppNetApp
 
Oracle Database Appliance (ODA) X6-2 Portfolio Overview
Oracle Database Appliance (ODA) X6-2 Portfolio OverviewOracle Database Appliance (ODA) X6-2 Portfolio Overview
Oracle Database Appliance (ODA) X6-2 Portfolio OverviewDaryll Whyte
 
IBM Power leading Cognitive Systems
IBM Power leading Cognitive SystemsIBM Power leading Cognitive Systems
IBM Power leading Cognitive SystemsHugo Blanco
 
MOUG17: DB Security; Secure your Data
MOUG17: DB Security; Secure your DataMOUG17: DB Security; Secure your Data
MOUG17: DB Security; Secure your DataMonica Li
 
Exploring microservices in a Microsoft landscape
Exploring microservices in a Microsoft landscapeExploring microservices in a Microsoft landscape
Exploring microservices in a Microsoft landscapeAlex Thissen
 
Oracle Cloud Infrastructure – Compute
Oracle Cloud Infrastructure – ComputeOracle Cloud Infrastructure – Compute
Oracle Cloud Infrastructure – ComputeMarketingArrowECS_CZ
 
OpenStack Days East -- MySQL Options in OpenStack
OpenStack Days East -- MySQL Options in OpenStackOpenStack Days East -- MySQL Options in OpenStack
OpenStack Days East -- MySQL Options in OpenStackMatt Lord
 
NetApp Management Pack for VMware vRealize Operations | Blue Medora
NetApp Management Pack for VMware vRealize Operations | Blue MedoraNetApp Management Pack for VMware vRealize Operations | Blue Medora
NetApp Management Pack for VMware vRealize Operations | Blue MedoraBlue Medora
 

Was ist angesagt? (20)

Oracle Solaris Software Integration
Oracle Solaris Software IntegrationOracle Solaris Software Integration
Oracle Solaris Software Integration
 
Expert summit SQL Server 2016
Expert summit   SQL Server 2016Expert summit   SQL Server 2016
Expert summit SQL Server 2016
 
Úvod do Oracle Cloud infrastruktury
Úvod do Oracle Cloud infrastrukturyÚvod do Oracle Cloud infrastruktury
Úvod do Oracle Cloud infrastruktury
 
Meetup Oracle Database MAD: 2.1 Data Management Trends: SQL, NoSQL y Big Data
Meetup Oracle Database MAD: 2.1 Data Management Trends: SQL, NoSQL y Big Data Meetup Oracle Database MAD: 2.1 Data Management Trends: SQL, NoSQL y Big Data
Meetup Oracle Database MAD: 2.1 Data Management Trends: SQL, NoSQL y Big Data
 
Oracle big data appliance and solutions
Oracle big data appliance and solutionsOracle big data appliance and solutions
Oracle big data appliance and solutions
 
The Analytic Platform behind IBM’s Watson Data Platform - Big Data Spain 2017
The Analytic Platform behind IBM’s Watson Data Platform - Big Data Spain 2017The Analytic Platform behind IBM’s Watson Data Platform - Big Data Spain 2017
The Analytic Platform behind IBM’s Watson Data Platform - Big Data Spain 2017
 
Spark Infrastructure Made Easy
Spark Infrastructure Made EasySpark Infrastructure Made Easy
Spark Infrastructure Made Easy
 
Triple C - Centralize, Cloudify and Consolidate Dozens of Oracle Databases (O...
Triple C - Centralize, Cloudify and Consolidate Dozens of Oracle Databases (O...Triple C - Centralize, Cloudify and Consolidate Dozens of Oracle Databases (O...
Triple C - Centralize, Cloudify and Consolidate Dozens of Oracle Databases (O...
 
Oracle Database Appliance, ODA, X7-2 portfolio.
Oracle Database Appliance, ODA, X7-2 portfolio.Oracle Database Appliance, ODA, X7-2 portfolio.
Oracle Database Appliance, ODA, X7-2 portfolio.
 
AltaVault
AltaVaultAltaVault
AltaVault
 
Hadoop Virtualization - Intel White Paper
Hadoop Virtualization - Intel White PaperHadoop Virtualization - Intel White Paper
Hadoop Virtualization - Intel White Paper
 
Bootcamp 2017 - SQL Server on Linux
Bootcamp 2017 - SQL Server on LinuxBootcamp 2017 - SQL Server on Linux
Bootcamp 2017 - SQL Server on Linux
 
CloudBridge and NetApp Storage Solutions - The Killer App
CloudBridge and NetApp Storage Solutions - The Killer AppCloudBridge and NetApp Storage Solutions - The Killer App
CloudBridge and NetApp Storage Solutions - The Killer App
 
Oracle Database Appliance (ODA) X6-2 Portfolio Overview
Oracle Database Appliance (ODA) X6-2 Portfolio OverviewOracle Database Appliance (ODA) X6-2 Portfolio Overview
Oracle Database Appliance (ODA) X6-2 Portfolio Overview
 
IBM Power leading Cognitive Systems
IBM Power leading Cognitive SystemsIBM Power leading Cognitive Systems
IBM Power leading Cognitive Systems
 
MOUG17: DB Security; Secure your Data
MOUG17: DB Security; Secure your DataMOUG17: DB Security; Secure your Data
MOUG17: DB Security; Secure your Data
 
Exploring microservices in a Microsoft landscape
Exploring microservices in a Microsoft landscapeExploring microservices in a Microsoft landscape
Exploring microservices in a Microsoft landscape
 
Oracle Cloud Infrastructure – Compute
Oracle Cloud Infrastructure – ComputeOracle Cloud Infrastructure – Compute
Oracle Cloud Infrastructure – Compute
 
OpenStack Days East -- MySQL Options in OpenStack
OpenStack Days East -- MySQL Options in OpenStackOpenStack Days East -- MySQL Options in OpenStack
OpenStack Days East -- MySQL Options in OpenStack
 
NetApp Management Pack for VMware vRealize Operations | Blue Medora
NetApp Management Pack for VMware vRealize Operations | Blue MedoraNetApp Management Pack for VMware vRealize Operations | Blue Medora
NetApp Management Pack for VMware vRealize Operations | Blue Medora
 

Ähnlich wie The Apache Spark config behind the indsutry's first 100TB Spark SQL benchmark

Application Report: Big Data - Big Cluster Interconnects
Application Report: Big Data - Big Cluster InterconnectsApplication Report: Big Data - Big Cluster Interconnects
Application Report: Big Data - Big Cluster InterconnectsIT Brand Pulse
 
Ceph Day Taipei - Accelerate Ceph via SPDK
Ceph Day Taipei - Accelerate Ceph via SPDK Ceph Day Taipei - Accelerate Ceph via SPDK
Ceph Day Taipei - Accelerate Ceph via SPDK Ceph Community
 
RedisConf17 - Redis Enterprise on IBM Power Systems
RedisConf17 - Redis Enterprise on IBM Power SystemsRedisConf17 - Redis Enterprise on IBM Power Systems
RedisConf17 - Redis Enterprise on IBM Power SystemsRedis Labs
 
Get insight from document-based distributed MongoDB databases sooner and have...
Get insight from document-based distributed MongoDB databases sooner and have...Get insight from document-based distributed MongoDB databases sooner and have...
Get insight from document-based distributed MongoDB databases sooner and have...Principled Technologies
 
Get insight from document-based distributed MongoDB databases sooner and have...
Get insight from document-based distributed MongoDB databases sooner and have...Get insight from document-based distributed MongoDB databases sooner and have...
Get insight from document-based distributed MongoDB databases sooner and have...Principled Technologies
 
Red hat Storage Day LA - Designing Ceph Clusters Using Intel-Based Hardware
Red hat Storage Day LA - Designing Ceph Clusters Using Intel-Based HardwareRed hat Storage Day LA - Designing Ceph Clusters Using Intel-Based Hardware
Red hat Storage Day LA - Designing Ceph Clusters Using Intel-Based HardwareRed_Hat_Storage
 
Data proliferation and machine learning: The case for upgrading your servers ...
Data proliferation and machine learning: The case for upgrading your servers ...Data proliferation and machine learning: The case for upgrading your servers ...
Data proliferation and machine learning: The case for upgrading your servers ...Principled Technologies
 
IBM POWER - An ideal platform for scale-out deployments
IBM POWER - An ideal platform for scale-out deploymentsIBM POWER - An ideal platform for scale-out deployments
IBM POWER - An ideal platform for scale-out deploymentsthinkASG
 
Apache Cassandra performance advantages of the new Dell PowerEdge C6620 with ...
Apache Cassandra performance advantages of the new Dell PowerEdge C6620 with ...Apache Cassandra performance advantages of the new Dell PowerEdge C6620 with ...
Apache Cassandra performance advantages of the new Dell PowerEdge C6620 with ...Principled Technologies
 
OWF14 - Plenary Session : Thibaud Besson, IBM POWER Systems Specialist
OWF14 - Plenary Session : Thibaud Besson, IBM POWER Systems SpecialistOWF14 - Plenary Session : Thibaud Besson, IBM POWER Systems Specialist
OWF14 - Plenary Session : Thibaud Besson, IBM POWER Systems SpecialistParis Open Source Summit
 
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Databricks
 
Move your private cloud to Dell EMC PowerEdge C6420 server nodes and boost Ap...
Move your private cloud to Dell EMC PowerEdge C6420 server nodes and boost Ap...Move your private cloud to Dell EMC PowerEdge C6420 server nodes and boost Ap...
Move your private cloud to Dell EMC PowerEdge C6420 server nodes and boost Ap...Principled Technologies
 
Deploying Apache Spark and testing big data applications on servers powered b...
Deploying Apache Spark and testing big data applications on servers powered b...Deploying Apache Spark and testing big data applications on servers powered b...
Deploying Apache Spark and testing big data applications on servers powered b...Principled Technologies
 
Big Telco - Yousun Jeong
Big Telco - Yousun JeongBig Telco - Yousun Jeong
Big Telco - Yousun JeongSpark Summit
 
Big Telco Real-Time Network Analytics
Big Telco Real-Time Network AnalyticsBig Telco Real-Time Network Analytics
Big Telco Real-Time Network AnalyticsYousun Jeong
 
Scaling Redis Cluster Deployments for Genome Analysis (featuring LSU) - Terry...
Scaling Redis Cluster Deployments for Genome Analysis (featuring LSU) - Terry...Scaling Redis Cluster Deployments for Genome Analysis (featuring LSU) - Terry...
Scaling Redis Cluster Deployments for Genome Analysis (featuring LSU) - Terry...Redis Labs
 

Ähnlich wie The Apache Spark config behind the indsutry's first 100TB Spark SQL benchmark (20)

Application Report: Big Data - Big Cluster Interconnects
Application Report: Big Data - Big Cluster InterconnectsApplication Report: Big Data - Big Cluster Interconnects
Application Report: Big Data - Big Cluster Interconnects
 
Ceph Day Taipei - Accelerate Ceph via SPDK
Ceph Day Taipei - Accelerate Ceph via SPDK Ceph Day Taipei - Accelerate Ceph via SPDK
Ceph Day Taipei - Accelerate Ceph via SPDK
 
RedisConf17 - Redis Enterprise on IBM Power Systems
RedisConf17 - Redis Enterprise on IBM Power SystemsRedisConf17 - Redis Enterprise on IBM Power Systems
RedisConf17 - Redis Enterprise on IBM Power Systems
 
Get insight from document-based distributed MongoDB databases sooner and have...
Get insight from document-based distributed MongoDB databases sooner and have...Get insight from document-based distributed MongoDB databases sooner and have...
Get insight from document-based distributed MongoDB databases sooner and have...
 
Get insight from document-based distributed MongoDB databases sooner and have...
Get insight from document-based distributed MongoDB databases sooner and have...Get insight from document-based distributed MongoDB databases sooner and have...
Get insight from document-based distributed MongoDB databases sooner and have...
 
Red hat Storage Day LA - Designing Ceph Clusters Using Intel-Based Hardware
Red hat Storage Day LA - Designing Ceph Clusters Using Intel-Based HardwareRed hat Storage Day LA - Designing Ceph Clusters Using Intel-Based Hardware
Red hat Storage Day LA - Designing Ceph Clusters Using Intel-Based Hardware
 
Data proliferation and machine learning: The case for upgrading your servers ...
Data proliferation and machine learning: The case for upgrading your servers ...Data proliferation and machine learning: The case for upgrading your servers ...
Data proliferation and machine learning: The case for upgrading your servers ...
 
IBM POWER - An ideal platform for scale-out deployments
IBM POWER - An ideal platform for scale-out deploymentsIBM POWER - An ideal platform for scale-out deployments
IBM POWER - An ideal platform for scale-out deployments
 
Apache Cassandra performance advantages of the new Dell PowerEdge C6620 with ...
Apache Cassandra performance advantages of the new Dell PowerEdge C6620 with ...Apache Cassandra performance advantages of the new Dell PowerEdge C6620 with ...
Apache Cassandra performance advantages of the new Dell PowerEdge C6620 with ...
 
OWF14 - Plenary Session : Thibaud Besson, IBM POWER Systems Specialist
OWF14 - Plenary Session : Thibaud Besson, IBM POWER Systems SpecialistOWF14 - Plenary Session : Thibaud Besson, IBM POWER Systems Specialist
OWF14 - Plenary Session : Thibaud Besson, IBM POWER Systems Specialist
 
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
 
Move your private cloud to Dell EMC PowerEdge C6420 server nodes and boost Ap...
Move your private cloud to Dell EMC PowerEdge C6420 server nodes and boost Ap...Move your private cloud to Dell EMC PowerEdge C6420 server nodes and boost Ap...
Move your private cloud to Dell EMC PowerEdge C6420 server nodes and boost Ap...
 
Deploying Apache Spark and testing big data applications on servers powered b...
Deploying Apache Spark and testing big data applications on servers powered b...Deploying Apache Spark and testing big data applications on servers powered b...
Deploying Apache Spark and testing big data applications on servers powered b...
 
Sgi hadoop
Sgi hadoopSgi hadoop
Sgi hadoop
 
Big Telco - Yousun Jeong
Big Telco - Yousun JeongBig Telco - Yousun Jeong
Big Telco - Yousun Jeong
 
Big Telco Real-Time Network Analytics
Big Telco Real-Time Network AnalyticsBig Telco Real-Time Network Analytics
Big Telco Real-Time Network Analytics
 
Scaling Redis Cluster Deployments for Genome Analysis (featuring LSU) - Terry...
Scaling Redis Cluster Deployments for Genome Analysis (featuring LSU) - Terry...Scaling Redis Cluster Deployments for Genome Analysis (featuring LSU) - Terry...
Scaling Redis Cluster Deployments for Genome Analysis (featuring LSU) - Terry...
 
NetApp All Flash storage
NetApp All Flash storageNetApp All Flash storage
NetApp All Flash storage
 
Sparc solaris servers
Sparc solaris serversSparc solaris servers
Sparc solaris servers
 
Exadata
ExadataExadata
Exadata
 

Mehr von Lenovo Data Center

SciNet -- Pushing scientific boundaries
SciNet -- Pushing scientific boundaries SciNet -- Pushing scientific boundaries
SciNet -- Pushing scientific boundaries Lenovo Data Center
 
Umeå University -- Supercharging research to enable ground-breaking innovation
Umeå University -- Supercharging research to enable ground-breaking innovationUmeå University -- Supercharging research to enable ground-breaking innovation
Umeå University -- Supercharging research to enable ground-breaking innovationLenovo Data Center
 
China Mobile -- Ringing in a new age of mobile technology
China Mobile -- Ringing in a new age of mobile technologyChina Mobile -- Ringing in a new age of mobile technology
China Mobile -- Ringing in a new age of mobile technologyLenovo Data Center
 
Lenovo -- Geared up for global growth with streamlined supply chain operations
Lenovo -- Geared up for global growth with streamlined supply chain operationsLenovo -- Geared up for global growth with streamlined supply chain operations
Lenovo -- Geared up for global growth with streamlined supply chain operationsLenovo Data Center
 
North Carolina State University -- Harnessing Artificial Intelligence and big...
North Carolina State University -- Harnessing Artificial Intelligence and big...North Carolina State University -- Harnessing Artificial Intelligence and big...
North Carolina State University -- Harnessing Artificial Intelligence and big...Lenovo Data Center
 
HiQ -- Finland Supporting sophisticated, user-friendly smartphone app develop...
HiQ -- Finland Supporting sophisticated, user-friendly smartphone app develop...HiQ -- Finland Supporting sophisticated, user-friendly smartphone app develop...
HiQ -- Finland Supporting sophisticated, user-friendly smartphone app develop...Lenovo Data Center
 
Queen’s University -- Powerful research with ultra-efficient supercomputer
Queen’s University -- Powerful research with ultra-efficient supercomputerQueen’s University -- Powerful research with ultra-efficient supercomputer
Queen’s University -- Powerful research with ultra-efficient supercomputerLenovo Data Center
 
Oblakoteka -- Capitalizing on soaring demand for cloud computing
Oblakoteka -- Capitalizing on soaring demand for cloud computingOblakoteka -- Capitalizing on soaring demand for cloud computing
Oblakoteka -- Capitalizing on soaring demand for cloud computingLenovo Data Center
 
CINECA -- Providing the intelligence to solve our biggest challenges
CINECA -- Providing the intelligence to solve our biggest challengesCINECA -- Providing the intelligence to solve our biggest challenges
CINECA -- Providing the intelligence to solve our biggest challengesLenovo Data Center
 
Callaway Golf -- Keeping global operations swinging
Callaway Golf -- Keeping global operations swingingCallaway Golf -- Keeping global operations swinging
Callaway Golf -- Keeping global operations swingingLenovo Data Center
 
Barcelona Supercomputing Center -- Pushing the boundaries of human knowledge
Barcelona Supercomputing Center -- Pushing the boundaries of human knowledgeBarcelona Supercomputing Center -- Pushing the boundaries of human knowledge
Barcelona Supercomputing Center -- Pushing the boundaries of human knowledgeLenovo Data Center
 
Lenovo Preserves the past with futuristic solutions – Six Nations Polytechnic
Lenovo Preserves the past with futuristic solutions – Six Nations PolytechnicLenovo Preserves the past with futuristic solutions – Six Nations Polytechnic
Lenovo Preserves the past with futuristic solutions – Six Nations PolytechnicLenovo Data Center
 
Taiyuan Wusu Comprehensive Bonded Zone supports business growth and the boomi...
Taiyuan Wusu Comprehensive Bonded Zone supports business growth and the boomi...Taiyuan Wusu Comprehensive Bonded Zone supports business growth and the boomi...
Taiyuan Wusu Comprehensive Bonded Zone supports business growth and the boomi...Lenovo Data Center
 
Industrial Bank unleashes innovation with Lenovo and Nutanix
Industrial Bank unleashes innovation with Lenovo and NutanixIndustrial Bank unleashes innovation with Lenovo and Nutanix
Industrial Bank unleashes innovation with Lenovo and NutanixLenovo Data Center
 
KEK helps scientists uncover the mysteries of the universe with Lenovo superc...
KEK helps scientists uncover the mysteries of the universe with Lenovo superc...KEK helps scientists uncover the mysteries of the universe with Lenovo superc...
KEK helps scientists uncover the mysteries of the universe with Lenovo superc...Lenovo Data Center
 
Europe Airpost Shrunk its IT Footprint 66%
Europe Airpost Shrunk its IT Footprint 66%Europe Airpost Shrunk its IT Footprint 66%
Europe Airpost Shrunk its IT Footprint 66%Lenovo Data Center
 
Next Level Supercomputing at STFC Hartree Centre
Next Level Supercomputing at STFC Hartree CentreNext Level Supercomputing at STFC Hartree Centre
Next Level Supercomputing at STFC Hartree CentreLenovo Data Center
 
A New Supercomputer Rises at the University of Adelaide
A New Supercomputer Rises at the University of AdelaideA New Supercomputer Rises at the University of Adelaide
A New Supercomputer Rises at the University of AdelaideLenovo Data Center
 
Virtualization, VMware and… Vegetation
Virtualization, VMware and… VegetationVirtualization, VMware and… Vegetation
Virtualization, VMware and… VegetationLenovo Data Center
 

Mehr von Lenovo Data Center (20)

SciNet -- Pushing scientific boundaries
SciNet -- Pushing scientific boundaries SciNet -- Pushing scientific boundaries
SciNet -- Pushing scientific boundaries
 
Umeå University -- Supercharging research to enable ground-breaking innovation
Umeå University -- Supercharging research to enable ground-breaking innovationUmeå University -- Supercharging research to enable ground-breaking innovation
Umeå University -- Supercharging research to enable ground-breaking innovation
 
China Mobile -- Ringing in a new age of mobile technology
China Mobile -- Ringing in a new age of mobile technologyChina Mobile -- Ringing in a new age of mobile technology
China Mobile -- Ringing in a new age of mobile technology
 
Lenovo -- Geared up for global growth with streamlined supply chain operations
Lenovo -- Geared up for global growth with streamlined supply chain operationsLenovo -- Geared up for global growth with streamlined supply chain operations
Lenovo -- Geared up for global growth with streamlined supply chain operations
 
North Carolina State University -- Harnessing Artificial Intelligence and big...
North Carolina State University -- Harnessing Artificial Intelligence and big...North Carolina State University -- Harnessing Artificial Intelligence and big...
North Carolina State University -- Harnessing Artificial Intelligence and big...
 
HiQ -- Finland Supporting sophisticated, user-friendly smartphone app develop...
HiQ -- Finland Supporting sophisticated, user-friendly smartphone app develop...HiQ -- Finland Supporting sophisticated, user-friendly smartphone app develop...
HiQ -- Finland Supporting sophisticated, user-friendly smartphone app develop...
 
Queen’s University -- Powerful research with ultra-efficient supercomputer
Queen’s University -- Powerful research with ultra-efficient supercomputerQueen’s University -- Powerful research with ultra-efficient supercomputer
Queen’s University -- Powerful research with ultra-efficient supercomputer
 
Oblakoteka -- Capitalizing on soaring demand for cloud computing
Oblakoteka -- Capitalizing on soaring demand for cloud computingOblakoteka -- Capitalizing on soaring demand for cloud computing
Oblakoteka -- Capitalizing on soaring demand for cloud computing
 
CINECA -- Providing the intelligence to solve our biggest challenges
CINECA -- Providing the intelligence to solve our biggest challengesCINECA -- Providing the intelligence to solve our biggest challenges
CINECA -- Providing the intelligence to solve our biggest challenges
 
Callaway Golf -- Keeping global operations swinging
Callaway Golf -- Keeping global operations swingingCallaway Golf -- Keeping global operations swinging
Callaway Golf -- Keeping global operations swinging
 
Barcelona Supercomputing Center -- Pushing the boundaries of human knowledge
Barcelona Supercomputing Center -- Pushing the boundaries of human knowledgeBarcelona Supercomputing Center -- Pushing the boundaries of human knowledge
Barcelona Supercomputing Center -- Pushing the boundaries of human knowledge
 
Lenovo Preserves the past with futuristic solutions – Six Nations Polytechnic
Lenovo Preserves the past with futuristic solutions – Six Nations PolytechnicLenovo Preserves the past with futuristic solutions – Six Nations Polytechnic
Lenovo Preserves the past with futuristic solutions – Six Nations Polytechnic
 
Taiyuan Wusu Comprehensive Bonded Zone supports business growth and the boomi...
Taiyuan Wusu Comprehensive Bonded Zone supports business growth and the boomi...Taiyuan Wusu Comprehensive Bonded Zone supports business growth and the boomi...
Taiyuan Wusu Comprehensive Bonded Zone supports business growth and the boomi...
 
Industrial Bank unleashes innovation with Lenovo and Nutanix
Industrial Bank unleashes innovation with Lenovo and NutanixIndustrial Bank unleashes innovation with Lenovo and Nutanix
Industrial Bank unleashes innovation with Lenovo and Nutanix
 
KEK helps scientists uncover the mysteries of the universe with Lenovo superc...
KEK helps scientists uncover the mysteries of the universe with Lenovo superc...KEK helps scientists uncover the mysteries of the universe with Lenovo superc...
KEK helps scientists uncover the mysteries of the universe with Lenovo superc...
 
Europe Airpost Shrunk its IT Footprint 66%
Europe Airpost Shrunk its IT Footprint 66%Europe Airpost Shrunk its IT Footprint 66%
Europe Airpost Shrunk its IT Footprint 66%
 
Next Level Supercomputing at STFC Hartree Centre
Next Level Supercomputing at STFC Hartree CentreNext Level Supercomputing at STFC Hartree Centre
Next Level Supercomputing at STFC Hartree Centre
 
A Supercomputer for CERFACS
A Supercomputer for CERFACSA Supercomputer for CERFACS
A Supercomputer for CERFACS
 
A New Supercomputer Rises at the University of Adelaide
A New Supercomputer Rises at the University of AdelaideA New Supercomputer Rises at the University of Adelaide
A New Supercomputer Rises at the University of Adelaide
 
Virtualization, VMware and… Vegetation
Virtualization, VMware and… VegetationVirtualization, VMware and… Vegetation
Virtualization, VMware and… Vegetation
 

Kürzlich hochgeladen

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 

Kürzlich hochgeladen (20)

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 

The Apache Spark config behind the indsutry's first 100TB Spark SQL benchmark

  • 1. 1 Summary Apache Spark is gaining adoption both for its features and its performance. Apache Spark enables the organizations to realize the real-time analytics and gain faster insights in today’s competitive business environment. It provides significant performance advantages to analyze massive datasets with its in-memory processing engine and support for running standard SQL natively on Apache Spark platform. This configuration brief describes the performance and scalability benefits of deploying Apache Spark cluster on the Lenovo server platform using Intel NVMe drives and Mellanox network interface cards. Some of the key advantages of this solution stack are listed below:  Serves as a reference for optimized Apache Spark Cluster setup  Illustrates performance benefits using Hadoop-DS 1 benchmark  Perfectly suited for high data ingest rates  Delivers extreme performance in a small data center footprint WWW.LENOVO.COM Industry First 100TB Spark SQL Benchmark Lenovo Big Data Configuration for Apache Spark CONFIGURATION BRIEF Big Data HIGHLIGHTS  Unleash extreme Apache Spark cluster performance  Faster time to analytics with sub-millisecond I/O latency  Leverage scalability from 1 to 100 TB dataset and beyond  Optimized solution driven by reliable infrastructure stack  Standard SQL support with Apache Spark 2.1  Targeted workloads include SQL & Analytics
  • 2. 2 Lenovo Big Data Configuration for Apache Spark The system is architected with high performance servers featuring high performance storage and networking components. This architecture offers a balanced configuration to manage the compute, storage and networking demands of enterprise scale Apache Spark clusters. . The aggregate capability is provided by 1008 cores, 42TB DRAM, 448TB Flash storage and a 100Gbps Ethernet network. The cluster is running Linux, IBM Open Platform with Apache Hadoop and Apache Spark 2.1 (pre-release). As a challenging SQL workload we selected the Hadoop-DS workload (a TPC-DS derivative) at a scale of 100TB on a cluster of 28 Lenovo x3650 data nodes along with a pair of Lenovo x3550 management nodes to provide HDFS services. All servers were connected with a 100GbE Mellanox network. Each data node had a pair of Intel E5- 2697 v4 (18-core) processors, 1.5TB of DRAM and 8 x 2.0TB Intel NVMe SSD drives. The goal is to determine how much the very latest CPUs, NVMe solid state storage and high bandwidth networking would improve Data Center Space and energy efficiency while handling a demanding SQL workload with multiple concurrent users. To ensure the latest Spark SQL was used, we took a snapshot of the Apache Spark 2.1 trunk and added select Spark SQL enhancements from the IBM Spark Technology Center. Performance Overview Early results are extremely encouraging. With the major uplift in Apache Spark 2.0, Spark SQL was able to parse and obtain plans for all 99 SQL queries using TPC-DS compliant syntax. This is an achievement no other company has demonstrated with open source SQL on Hadoop. 91 of 99 queries were successfully run against 100TB of data in a single user test. During the execution of this test, the cluster achieved peak CPU utilization of 100% as well as I/O throughput of 358 GB/s. Intel® Solid State Drive DC P3600 Series AIC (Add-in Card) and 2.5” form factors both include:  Consistently high IOPS and throughput  Sustained low latency  Variable Sector Size and End-to-End data path protection  Power loss protection capacitor self- test  Out of band management  Thermal throttling and monitoring Mellanox Ethernet Solution delivers:  World’s fastest interconnect with lowest latency and highest bandwidth  Advanced hardware offloads for IP packet processing  Uncompromised data integrity with zero-packet loss switches and cables with lowest BER rate of 1e -15 CONFIGURATION BRIEF Lenovo Big Data Configuration for Apache Spark “The rapid evolution of Spark and its readiness for the next generation of Big Data Spark clusters is demonstrated by these 100TB performance results” Berni Schiefer, IBM
  • 3. 3 CONFIGURATION BRIEF Lenovo Big Data Configuration for Apache Spark Apache Spark There is an important new cluster computing framework called Apache Spark in the Big Data world. Over the past year it has become a white-hot Apache project based on a number of measures 2, 3 . Many view Spark as a natural successor to Hadoop MapReduce that preserves the elasticity and resiliency of MapReduce while being easier to program. It also has a richer and better set of programmer APIs and is much better at exploiting abundant dynamic memory. In addition to the Apache Spark core there are a set of libraries including SQL and DataFrames, MLlib for machine learning, GraphX, and Spark Streaming. Users can combine these libraries seamlessly in the same application. Of these one of the most significant is Spark SQL since SQL remains one of the world’s most popular database access technologies 4 . So far, most performance studies of Spark SQL have used more traditional Hadoop cluster topologies, with relatively small servers, modest amounts of memory and large capacity SAS/SATA drives and 10GbE networks. IBM, Intel, Lenovo and Mellanox set out to investigate a topology that leverages advances made in CPU, memory, storage and networking to assess the readiness of Spark SQL to harness these new capabilities. Key Technologies Advantages of using Intel® NVMe™ storage – The Intel® Solid State Drive (SSD) Data Center Family for PCIe® brings extreme data throughput directly to Intel® Xeon® processors with up to six times faster data transfer speed than 6 Gbps SAS/SATA SSDs. The performance of a single drive from the Intel SSD Data Center Family for PCIe®, specifically the Intel® SSD DC P3600 Series (450K IOPS), can replace the performance of 7 SATA SSDs aggregated through an HBA (~500K IOPS). The P3600 Series is a PCIe® Gen3 SSD architected with the new high performance controller interface – NVMe™ (Non-Volatile Memory Express™) delivering leading performance, low latency and Quality of Service. Advantages of using Mellanox 100GbE network – With an increasing need to use faster storage (SSDs, NVMe) for IOPS intensive big data workloads like Spark SQL over HDFS, a faster higher bandwidth network becomes a paramount. While it takes ~10 HDDs per data node to exceed what a 10GbE network can do, it takes just 1 NVMe drive to cross that limit and just 4 to demand a 100GbE network. Network latency also becomes crucial for such transactional-centric workload to deliver consistently low query latency. With 100GbE based network including ConnectX®-4 adapters, Spectrum TM switches and LinkX® cables, Mellanox provides the highest bandwidth and lowest latency to deliver the fastest network needed for faster storage.
  • 4. 4 Performance Test Methodology The Transaction Processing Performance Council (TPC) has served as the pre- eminent industry-standard benchmarking organization since 1988. The TPC-DS benchmark models the decision support functions of a retail product supplier. The supporting schema contains vital business information, such as customer, order, and product data. TPC-DS models industries that must manage, sell and distribute products. It utilizes the business model of a large retail company having multiple stores located nation-wide. The benchmark currently supports database sizes at 1TB, 3TB, 10TB, 30TB and 100TB. A workload consisting of 99 standard ANSI/ISO SQL queries are used to exercise the system. In 2014, IBM defined Hadoop-DS, a SQL on Hadoop-oriented benchmark inspired by and derived from TPC-DS. Hadoop-DS executes a workload that is largely consistent with the requirements of TPC-DS. It uses the same schema, the same queries and the same execution rules but it omits the data maintenance functions, the second throughput run and does not include any pricing. It also does not go through a formal TPC audit process and as a result it is not a formal TPC result. To date, no company has officially audited and published a TPC-DS result. Test Results When running the 4-stream multi-user (throughput) test, a total of 360 queries completed requiring a max of 87% CPU utilization across the cluster and with an average CPU utilization of 49% The 100GbE switch moved a remarkable 13.5 GB/s of traffic, while the flash storage handled 12.8 GB/s of I/O bandwidth per data node (358 GB/s cluster wide). Overall we were able to largely saturate the entire cluster. These results demonstrate that Spark SQL is well on its way to being able to master challenging SQL workloads such as Hadoop-DS. We can also see that using new advanced servers allows our cluster to handle a large, complex workload rapidly while saving energy and space compared with traditional Hadoop Clusters. As memory and flash storage improve and drop in price, we expect the Spark F1 Cluster design to become pervasively deployed. In summary, advances in hardware and software are paving the way for a new class of Spark cluster that offers the resiliency and elasticity found in larger Hadoop clusters, the powerful ability to integrate advanced analytic processing, including machine learning into elegant Spark applications, and at the same time able to process demanding SQL-based business intelligence queries. CONFIGURATION BRIEF Lenovo Big Data Configuration for Apache Spark
  • 5. 5 CONFIGURATION BRIEF Lenovo Big Data Configuration for Apache Spark Solution Configuration Data Nodes – Lenovo x3650 M5 Servers The Lenovo System x3650 M5 server is an enterprise class 2U two-socket versatile server that incorporates outstanding reliability, availability, and serviceability (RAS), security, and high-efficiency for business-critical applications. It offers a flexible, scalable design and simple upgrade path to various storage configurations and up to 1.5 TB of TruDDR4 Memory. This reference configuration uses 8 NVMe PCIe SSDs and a dual-port 100GbE network adapter. With standards-based Intel® infrastructure that allows for innovation across a wide choice of leading analytics platforms, now you can confidently begin to move your business forward with actionable, real-time insights from advanced analytics. The powerful new Intel® Xeon® processor E5-2600 v4 product family offers versatility across diverse data center workloads including a wide range of scale-out data analytics frameworks. Management Nodes – Lenovo x3550 M5 Servers The Lenovo System x3550 M5 server is a cost- and density-balanced 1U two- socket rack server. The x3550M5 features a new, innovative, energy-smart design with up to two Intel Xeon processors of the high-performance E5-2600 v4 product family processors a large capacity of faster, energy-efficient TruDDR4 Memory up to twelve 12Gb/s SAS drives, and up to three PCI Express (PCIe) 3.0 I/O expansion. Storage - Intel® P3600 NVMe SSD The Intel® SSD DC P3600 Series is a PCIe* Gen3 SSD architected with the new high performance controller interface, Non-Volatile Memory Express* (NVMe*), delivering leading performance, low latency, and quality of service. Matching the performance with world-class reliability and endurance, Intel SSD DC P3600 Series offers a range of capacity—400 GB, 800 GB, 1.2 TB, 1.6 TB and 2 TB in both add-In card and 2.5-inch form factor. Networking – Mellanox Spectrum SN2700 switch Mellanox Spectrum-based SN2700 (32x 100GbE ports) switch provides the highest performance fabric solution in a 1U form factor delivering non-blocking throughput for big data workloads, with predictable low-latency and zero-packet loss (ZPL). Due to the bursty network traffic in big data workloads, non-blocking switches play a crucial role in delivering predictable SQL query completion time. In addition, LinkX DAC cables offers reliable connections at speed from 10 to 100Gb/s with highest quality, featuring error rates up to 100x lower than industry standards. Network Adapters - Mellanox ConnectX-4 100GbE adapters Mellanox ConnectX-4 Ethernet adapter provides the highest performing and most flexible interconnect solution for Big Data, Cloud and HPC applications at various speed of 10/25 and 40/50/100Gbps. Big Data applications utilizing TCP or UDP over IP transport can achieve the highest efficiency and application density with the hardware-based stateless offload and flow steering engines. These advanced offloads reduce CPU overhead in packet processing and improving application query latency. Spark F1 Cluster: 28 Data Nodes each provisioned with Intel E5-2697 v4 processors, 16TB of NVMe storage and 1.5 TB DDR4 memory and interconnected with a 100 GbE network yields one incredibly fast Apache Spark platform The Lenovo x3650 M5 server Data Node and Lenovo x3550 M5 server Management Node bring high performance and high reliability in a small footprint
  • 6. 6 © 2016 Lenovo. All rights reserved. Availability: Offers, prices, specifications and availability may change without notice. Lenovo is not responsible for photographic or typographical errors. Warranty: For a copy of applicable warranties, write to: Lenovo Warranty Information, 1009 Think Place, Morrisville, NC, 27560, Lenovo makes no representation or warranty regarding third-party products or services. Trademarks: Lenovo, the Lenovo logo, System x, ThinkServer are trademarks or registered trademarks of Lenovo. Microsoft and Windows are registered trademarks of Microsoft Corporation. Intel, the Intel logo, Xeon and Xeon Inside are registered trademarks of Intel Corporation in the U.S. and other countries. Other company, product, and service names may be trademarks or service marks of others. CRN: BDAAPHEXX64 10/2016 CONFIGURATION BRIEF Lenovo Big Data Configuration for Apache Spark Why Lenovo Lenovo is a leading provider of x86 servers for the data center. Featuring rack, tower, blade, dense and converged systems, and the Lenovo server portfolio provides excellent performance, reliability and security. Lenovo also offers a full range of networking, storage, software, solutions, and comprehensive services supporting business needs throughout the IT lifecycle. With options for planning, deployment, and support, Lenovo offers expertise and services needed to deliver better service-level agreements and generate greater end- user satisfaction. For More Information To learn more about the Lenovo Big Data Configuration for Apache Spark Platform, contact your Lenovo Business Partner or visit: lenovo.com/systems/solutions. 1 Hadoop-DS is based on TPC-DS. While results have been reviewed for correctness, they have not been audited or published 2 https://upside.tdwi.org/articles/2016/08/22/spark-is-hot-hot-hot.aspx 3 https://adtmag.com/blogs/dev-watch/2016/05/asf-big-data-projects.aspx 4 http://stackoverflow.com/research/developer-survey-2016