Remote storage provides the ability to separate compute and storage, which ushers in a new world of infinitely scalable and cost-effective storage. Remote storage in the cloud built to the HDFS standard has unique features that make it a great choice for storing and analyzing petabytes of data at a time. Customers can have unlimited storage capacity without any limit to the number or size of the files. With such scale, superior I/O performance becomes an increasingly important consideration when performing analysis on this data. For all workloads, a remote storage in the cloud can provide amazing performance when all the different knobs are tuned correctly...
Speaker
Stephen Wu, Senior Program Manager, Microsoft
4. Cloud
elasticity
Scale compute
and storage
independently
No expensive
data centers
Less
management
Global presence Availability
SLAs
SparkTM and HadoopÂź are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.
26. Default â Four apps
Total resources
64 cores
192 GB
Your App
16 cores
48 GB
SparkTM is either a registered trademark or trademark of the Apache Software Foundation in the United States and/or other countries.
30. 16 executors
More executors Fewer executorsOut of
memory
8 executors
32 executors
Best practice: set each executor no more than 64GB
App resources
32 cores
96GB
35. Map Stage
Map tasks
Reducer Stage
Input Data
Reduce tasks
Output
HiveTM is either a registered trademark or trademark of the Apache Software Foundation in the United States and/or other countries.
36. 6GB
16 containers
More containers Fewer containersOut of
memory
YARN resources
Memory: 96GB
CPU: 32 cores
12GB
8 containers
3GB
32 containers
Best practice: set to minimum YARN container size
38. 1.6 waves
(best effort)
# of containers # of map tasks
10MB
Input Data: 80MB
Tez.grouping.min-size = 20MB
39. 1.6 waves
(best effort)
Tez.grouping.min-size = 20MB
# of containers
20MB
# of map tasks
Input Data: 80MB
Apache TezÂź is either a registered trademark or trademark of the Apache Software Foundation in the United States and/or other countries.
40. 1.6 waves
(best effort)
Set tez.grouping.min-size = 5MB
# of containers
20MB
# of map tasks
Input Data: 80MB
Apache TezÂź is either a registered trademark or trademark of the Apache Software Foundation in the United States and/or other countries.
41. 1.6 waves
(best effort)
Set tez.grouping.min-size = 5MB
# of containers
10MB
# of map tasks
Input Data: 80MB