08448380779 Call Girls In Civil Lines Women Seeking Men
True Scale Ddr Best In Class Performance
1. WHITE PaPEr
QLogic TrueScale™ DDR IB Adapter Provides
Scalable, Best-In-Class Performance
QLogic’s DDR Adapters QLogic’s Message Rate 340% Better and
Scalable Latency Up to 33% Superior
Outperform Mellanox ®
adapters over Mellanox ConnectX™ adapters. The findings in
Executive Summary
this paper demonstrate that QLogic TrueScale adapters are the
Solving today’s most challenging computational problems requires
best choice for High Performance Computing (HPC) applications.
more powerful, cost-effective, and power efficient systems. as
clusters and the number of processors per cluster grow to address
Key Findings
problems of increasing complexity, the communication needs
of the applications also increase. Consequently, interconnect
The QLogic 7200 Series DDr InfiniBand adapters offer better
performance is crucial for application scaling. Satisfying the high
message and scalable latency than Mellanox’s ConnectX adapters.
performance requirements of Inter-Processor Communications
The test results described in this paper suggest that:
(IPC) requires a interconnect that:
• Message rate performance is over 340-percent better
• Efficiently processes a variety of messages patterns
than ConnectX
• Leverages the benefits of multi-core processors
• Scales with the size of the fabric • Scalable latency is up to 33 percent superior to ConnectX
• Minimizes power requirements
• TrueScale bandwidth performance is anywhere from 120
QLogic Host Channel adapters (HCas) have been architected with to 70 percent better at 128- and 1024-byte message sizes,
these design goals in mind to provide significantly better scaling respectively
performance than any other InfiniBand™ (IB) architecture. as
• HPC customers can reap the benefits of TrueScale
a result, a measurable and sustainable difference in application
adapters, which significantly outperform Mellanox DDR
performance can be realized when deploying the TrueScale IB
adapters as the size of the cluster increases
architecture.
QLogic has performed a series of head-to-head performance
benchmarks showing the I/O performance and scalability
advantages of their 7200 Series of Dual Data rate (DDr) IB
2. WHITE PaPEr
QLogic TrueScale™ DDR IB Adapter Provides Scalable, Best-In-Class Performance QLogic’s DDR Adapters Outperform Mellanox®
“scale-out” (large node count) clusters, the efficient message
results
processing capabilities of the adapter enable more effective use of
The most accurate way to establish the best interconnect option
the available compute resources, resulting in application performance
for a given application is to install and run the application on a
benefits as the number of cores per node and the number of nodes
variety of fabrics to determine the best performing option. However,
in a cluster increase.
given the costs associated with this approach, the use of industry
standard benchmarks is a more pragmatic means of evaluating an
Microbenchmarks
interconnect.
Table 1 summarizes QLogic’s findings in scalable benchmark
For applications with heavy messaging requirements, message rate
performance between ConnectX and TrueScale IB adapters.
performance is a good indicator of how well an interconnect will
be able to support the needs of an application. another factor to
Message rate
consider is how well the interconnect maintains its performance as
As seen in Table 1, at eight processes per node (ppn),
the system is scaled. The High Performance Computing Challenge
TrueScale message rate performance is over three times that
(HPCC) scalable latency and scalable message rate benchmarks
of ConnectX.
are strong indicators of how well the interconnect will support an
application at scale. OSU’s Multiple Bandwidth/Message rate benchmark (osu_mbw_mr)
was run on two servers connected by 1m cable (no switch), each
architecturally, ConnectX is designed to offload more of the burden of server with 2x 3.0 GHz Intel® Harpertown E5472, quad-core CPUs,
communication processing from the CPU to the adapter. This design 16GB raM, rHEL 5. ConnectX runs used OFED 1.3, MVaPICH-1.0.0
can provide benefits in CPU utilization, especially when using single- (default options). TrueScale runs used InfiniPath® 2.2/OFED 1.3 and
QLogic MPI (default options).
or dual-core compute nodes. However, given the availability of
multiple cores in today’s compute nodes, this approach is no longer as multi-core systems become increasingly more prevalent, the cluster
optimal. as more cores are added to a node, the communications interconnect must be able to accommodate more processes per
burden on a single adapter increases significantly. This results in compute node. The TrueScale architecture was designed with this trend
an increased dependency on the adapter’s capabilities for scalable in mind, enabling users to take maximum advantage of all the cores in
“system” performance. Consequently, scalability anomalies can their compute nodes. This is accomplished through high message rate
begin to appear when the number of cores in a compute node and superior inter- and intra-node communication capabilities.
increases to four or five.
Primarily due to the offload capability of ConnectX, Mellanox’s
adapters require significantly more power to operate — as much
as 50 percent compared to TrueScale adapters. The additional
wattage required to power the compute nodes is also reflected
in the associated higher cooling costs to bring down the ambient
temperature in the data center.
TrueScale architecture is designed to support highly-scaled
applications with high message rate and ultra-low scalable latency
performance. In both “scale-up” (multi-core environments) and
Table 1. Summary of QLogic’s Message Rate and Scalable Latency Advantage Over Mellanox
Comparison Benchmark Mellanox® QLogic QLogic advantage
MHGH28 | MHGH29 QLE7240 | QLE7280
Message rage OSU Message rate @ 8 ppn 4.5 | 5.5 19 | 26 Over 340%
(non-coalesced) Million messages/s Million message/s
Scalable Latency HPCC random ring Latency 4.4 | 8.9 1.3 | 1.1 Up to 33%
@ 128 cores µs µs
HSG-WP08014 IB0030901-00 a 2
3. WHITE PaPEr
QLogic TrueScale™ DDR IB Adapter Provides Scalable, Best-In-Class Performance QLogic’s DDR Adapters Outperform Mellanox®
Figure 1 illustrates the ability of TrueScale to make effective use
of multi-core nodes1. Note that ConnectX does not scale as the
processes per node increase. With TrueScale, more application work
is accomplished as the node size increases.
Figure 2. TrueScale Multi-core Advantage in Latency Performance
When measuring latency with a realistic 128-byte message size, the
latency performance of ConnectX drops off at about four to five cores
per node. Under the same conditions, TrueScale provides consistent
and predictable levels of performance.
Figure 1. TrueScale Multi-core Advantage in Message Rate
Performance
Scalable Latency application Performance
In terms of scalable latency performance, at 128 cores, QLogic’s
SPEC MPI2007
MPI latency ranges from 13 percent to 33 percent of Mellanox’s
There are more sophisticated benchmarks, such as SPEC MPI2007,
ConnectX.
which measure performance at a system level over a variety of
different applications. This benchmark suite includes 13 different
all scalable latency results are from the HPC Challenge web site
codes and emphasizes areas of performance that are most relevant
(http://icl.cs.utk.edu/hpcc/hpcc_results_all.cgi) and use the random
ring Latency benchmark. ConnectX Gen1 results are from the 2008- to MPI applications running on large scale systems. The quantity
05-15 submission by Intel using 128 cores of the Intel Endeavour and performance of the microprocessors, memory architecture,
cluster with Xeon® E5462 CPUs (2.8 GHz); ConnectX Gen2 results interconnect, compiler, and shared file system are all evaluated.
are from the 2008-05-09 submission by TU Dresden using 128
cores of the SGI® altix® ICE 8200EX cluster with Xeon X5472 CPUs In august 2008, QLogic ran the SPECmpiM_base2007 benchmark on
(3.0 GHz). QLogic QLE7240 results are from their 2008-08-05 a TrueScale enabled cluster that yielded the best overall performance
submission using 128 cores of the Darwin Cluster with Xeon 5160 at 96 and 128 cores3. This result represents third-party validation
CPUs (3.0 GHz); QLogic QLE7280 results are from their 2008-08-01
of the scalable performance capabilities of the architecture over a
submission using 128 cores of the QLogic Benchmark Cluster with
variety of application types. This result compared favorably not only
Xeon E5472 CPUs (3.0 GHz).
to other commodity x86-based compute clusters, but also against
platforms from large system vendors.
Figure 2 shows that TrueScale adapters maintain consistent latency
performance as more cores are added to a node.2 Consequently,
Halo Test
more of the compute power can be used for application workload
The halo test from argonne National Laboratory’s mpptest benchmark
rather than waiting for the adapter to process messages.
suite simulates communications patterns in layered ocean models.
1 These are the results of the OSU multiple bandwidth message rate (osu_
mbw_mr) test. The test used a 1-byte message size when run on two nodes,
each with 2x 3.0 GHz Intel Xeon E5472 quad-core CPUs. The test used QLogic
MPI 2.2 for QLE7280 adapters and MVaPICH-1.0.0 and OFED 1.3 on Gen2
ConnectX DDr adapters.
2 These are the results of the OSU Multiple-latency (osu_multi_lat) test of
QLE7240 and Gen1 ConnectX HCas at 128 bytes message size when run on two 3 Details of the submission and results can be found at:
nodes, each with 2x 2.33 GHz Intel Xeon E5410 quad-core CPUs. http://www.spec.org/mpi2007/results/res2008q3/
HSG-WP08014 IB0030901-00 a 3
4. WHITE PaPEr
QLogic TrueScale™ DDR IB Adapter Provides Scalable, Best-In-Class Performance QLogic’s DDR Adapters Outperform Mellanox®
Unlike many of the point-to-point microbenchmarks that measure Summary and Conclusion
peak bandwidth, this benchmark measures throughput performance
TrueScale is architecturally designed to take advantage of two
over a variety of message sizes. as seen in Figure 3, TrueScale out-
significant trends in high performance computing clusters: the
performs Mellanox across the entire range of message sizes.1
prevalence of multi-core processors in compute nodes and the
need to deploy increasingly larger clusters to tackle more complex
computational problems.
The benefits of the TrueScale architecture can be demonstrated in a
variety of industry standard benchmarks that measure the scalable
performance characteristics of the interconnect. More importantly,
the advantages can be realized through improved application
performance and a reduced time-to-solution at about half the power
of ConnectX.
Figure 3. TrueScale Bandwidth Performance on Halo Benchmark
application requirements vary in terms of message sizes and patterns,
so performance over a variety of message sizes is a better predictor
of performance than peak measurements. At four processes per
node, TrueScale bandwidth performance is anywhere from
120 to 70 percent better at 128 and 1024 byte message sizes,
respectively.
1 The benchmark is the Halo test from argonne National Laboratory’s mpptest.
In particular, the 2D halo psendrecv test at 4 processes per node on 8 nodes of
2 x 2.6 GHz aMD® Opteron™ 2218 CPUs, 8 GB DDr2-667 memory; NVIDIa®
MCP55 PCIe chipset, for a total of 32 MPI ranks. QLogic MPI 2.2 used for
TrueScale adapters and MVaPICH 0.9.9 for ConnectX.
HSG-WP08014 IB0030901-00 a 4