SlideShare ist ein Scribd-Unternehmen logo
1 von 60
Downloaden Sie, um offline zu lesen
Design for Large-Scale Automation
12/30/2015
Ongoing...
Design for large-scale, high-performance, distributed software systems for
complex algorithms such as graph, optimization, prediction, and machine
learning.
Corrections/improvements are very welcome at hohaxu@gmail.com (Hao Xu)
Topics
● Large-scale Automation: Why Challenging?
● Design Principles: Coping with Complexity and Physicality
● Computation Paradigms: HPC, Spark, Tensorflow
● Designs: Logical, Physical, System levels
● Distributed and Iterative Algorithms: Partition, Sync, Iteration Trade-offs
● Smart QA: Protection, Auditing, Debug codes
Design Objectives for Large-scale Automation
● Scalability (growing)
● Extensibility (evolving)
● Performance (fast)
● Maintenance (controllable)
Scalability: Name of the Game
● Electronics simulation: mandatory for simulation software to scale with
Moore’s law
● Internet Applications: systems need to be ready for next 10x user growth and
feature evolution
● Knowledge Base: bigger system improves cross referencing and hence quality
of learning new knowledge
● Deep learning: capacity of system affects quality of latent features learned
and hence the prediction capability
● Internet of Things: as the name suggests...
What make it difficult? #1 Complexity
● Complexity is the TOP challenge for software engineering
● Usually grows with the scale of the system
○ exhibits different patterns at different scale
○ explodes with the number of software features
● The only way to handle complexity
○ “Divide and Conquer”
○ realized by various Design Principles
What make it difficult? #2 Physicality
● Software is physical, just like human
○ Results are stored in physical memory (RAM/ROM/Disk)
○ Computation is done in physical processing units (CPU/GPU/FPGA)
● Not feasible to build one gigantic machine that solves everything
○ System should live on machine farms
○ Data / Computation should be distributed
● Physicality complicates the design of systems
○ Data partition
○ Computation partition
Design Principles
Abstraction and Decoupling
Design Principles: The Philosophy
Design Principles for Coping with Complexity
● Abstraction (Vertical Divide & Conquer)
○ Core Abstractions
○ Hierarchization
● Decoupling (Horizontal Divide & Conquer)
○ Encapsulation
○ Layerization
Decoupling
Centerpiece of large-scale system design
Abstraction
Abstraction: Vertical Divide and Conquer
● Core Abstractions
○ the soul of large-scale systems
○ the root of abstraction hierarchy
○ higher level abstraction = better extensibility
● Hierarchization
○ simplification of system functionality graph
○ ideally mapped into tree structures (no loop)
○ the template for Object Oriented Design
○ need a balance b/w delegation & check
Decoupling: Horizontal Divide and Conquer
● Encapsulation
○ components encapsulate complex logic
○ API design for minimal interface
● Layerization
○ algorithms divided into layers
○ each layer handles a feature/algorithm
■ layer 1: Graph partition and communication
■ layer 2: Graph node property analysis
■ layer 3: User operation on Graph nodes
■ ...
The Priority of Abstractions for Project Management
● Core abstractions (1st Priority)
○ Determines functionality/scalability
● Library abstractions (2nd Priority)
○ Determines performance
● Logic abstractions (low priority)
○ Flows
○ Apps
○ Business logics
1
2
3
Computation Paradigms
Language level, Flow level, System level
Computation Paradigms: The Framework
Computation Paradigms
● What is Computation Paradigm?
○ Computation abstraction at different levels
○ Offers encapsulation and parallelism at different levels
○ Crucial to choose the right computation paradigm
● Computation Paradigm at different levels
○ Language level: Python, C, Scala
○ Flow level: Imperative, Symbolic, Functional programming
○ System level: Computation-centric (HPC) or Data-centric (e.g. Spark)
Flow level: Imperative Programming
● Imperative Programming: No native abstraction
○ C++ / Python / Java
○ Computation at instruction level
○ Task level parallel
Flow level: Functional Programming
● Functional Programming: Data abstraction
○ Scala / MapReduce
○ Immutable, Stateless function
● Pros
○ Offers Data level parallel
● Cons
○ Data read only, need to make another copy if update.
○ More memory consumption. Potential performance overhead.
Flow level: Symbolic Programming
● Symbolic Programming: Operator abstraction
○ Theano / TensorFlow
○ Operator level parallel
○ Graph model as base engine
● Pros
○ Offers high operator parallelism through graph propagation
● Cons
○ Not flexible for all programming tasks
○ May incur overhead handling with fine-grained operators
System level: Computation-Centric System (typical HPC 1)
● What is HPC
○ HPC is extreme parallel computing
○ Computation Partition
■ Communication delay aware
● Inter-node L3/L2/L1
● Intra-node interconnect 100gb/s
● Inter-cluster ethernet 1gb/s + Ram to Disk time
■ Physical architecture ware
● Register size etc
System level: Computation-Centric System (typical HPC 2)
● Parallel at different levels
○ Multi-threading
○ Multi-process
○ Distributed cluster
○ Mainstream communication: MPI
● Partition based on needs of communication
○ Minimize communication
○ Algorithm partition
○ Data partition
System level: Computation-Centric System (typical HPC 3)
● Exploit Heterogeneous Components
○ GPU acceleration (many small cores)
■ Model is too small; too much overhead; stays on CPU
■ Model is too large; exceeds GPU memory; do partial acceleration
■ Exchange memory with CPU through memory copy
○ FPGA (millions of gates)
○ SSD, RAID 0/1,5/10
● Disk IO
○ HDF5 parallel read/write
System level: Data-Centric System (Spark-like)
● Data partition: Physically distributed central DB
○ Serialization: boost:serialization(c++), pickling(Python)
● Scalable computation
○ Usually has a scheduler
○ Explicit scheduling: user defines computation graph nodes
○ Implicit scheduling: engine analyzes the computation graph
● Stateless
○ Good for debug, easy recover from failure
System level: Hybrid Architecture
● Hybrid Architecture Example: TensorFlow
○ Stochastic algorithms → use Data-centric model
■ E.g. Back propagation: Parameter Server
○ Deterministic algorithms → use Computation-centric (HPC) model
■ E.g. Common data sync among model partitions: Bulk Synchronous
Parallel
Designs: The Quality
Logical Design
Objectify, Modularize, Standardize
Logical Design
● Objectify everything
○ an object can have multiple copies for parallel computing
○ avoid singleton / global / static variables
○ top level should fall through, should not execute anything
Logical Design
● Standardize everything
○ Base Class for any task = function(data, parameters, executor_id)
○ schema (base class) for task
○ scheme for any data
○ schema for any function
○ schema for any parameter
● Benefits
○ higher level automation
○ potentially more intelligent system
Logical Design
● Modularize everything
○ encapsulate data by using setter / getter
○ encapsulate atomic or repeated functionality
○ #define any hard number
○ factorize long function or class
○ build shared libraries from bottom-up
■ communication lib
■ parallel computing lib
■ debug / reporting lib
Physical Design
Code, Memory, Performance
Physical Design: Code
● Source Code
○ component level decouple by folder
○ module level decouple by file
○ variable space decouple by namespace
● Code change
○ physical change (files/folders touched) should reflect logical change
○ change scope should narrow down as development goes
○ diff mangement
Physical Design: Memory 1)
● Memory is the #1 factor for performance
○ Code runs in memory, not in the air
● OS Memory Handling
○ Memory allocation, fragmentation, release etc
○ Tcmalloc VS jemalloc
■ Improves allocation/fragmentation
■ Still has issue on release
Physical Design: Memory 2)
● Interpreter Memory Handling
○ Garbage Collection
● Manual Memory Management
○ memory pooling is mandatory
○ memory lifecycle management for any large usage
Physical Design: Memory 3)
● Trade-offs
○ Depends on application
■ Memory critical: TC/JEmalloc
■ Memory and Performance critical: MMU
○ HPC is memory and performance critical
■ Parallel does not solve all the problem. Single machine performance is
still dominant factor
■ You should know the code very well to design manual MMU
○ Spark replacing JVM memory management with Tungsten project
Physical Design: Performance
● Performance Tuning
○ profiling, profiling, profiling...
○ lazy initialization / write / read
○ cache-aware design
■ cache-friendly data structure
● linked structure locality
■ cache-friendly algorithm
● read / write locality
System Design
Distributed, Parallel, Resilient
System Design
● Scalable Distributed System
○ DB Service: Data and Computation decouple
○ Task/Scheduler: Computation and Execution decouple
○ Query/Queue: Producer and Consumer decouple
System Design
● DB Service
○ Logically Centralized
■ Parameter Server
○ Physically distributed
■ Only routing / bookkeeping service on Master
■ Master capacity is not an issue
■ Computation locality on Slaves
System Design
● Parallel Computing
○ multi-threading
■ light overhead
■ shared memory, data exchange OK
○ multi-process
■ heavy overhead
■ separated memory space, more difficult data exchange
○ distributed multiple machine
■ balance between computation VS. communication
System Design
● TensorFlow Example
○ Multi-threading: Graph Execution Engine
■ BFS
■ DFS
○ Multi-machine: Graph partition
■ Edge-cut?
■ Vertex-cut?
System Design
● Fault Tolerance
○ Monitor granularity
■ system level: module behavior
■ flow level: major steps
■ algorithm level: major checkpoints
○ Persistence granularity
■ recovery depth
■ recovery contents
Distributed and Iterative
Algorithms
Partition, Sync, Iterate, Global/Local Optimum
Distributed and Iterative Algorithms:
The Lifeblood
Key Issues of Distributed Algorithms
● Data / Model partition
○ inference data partition; graph partition; datastore sharding
● Communication paradigm
○ Spark RDD; MPI; RPC
● Computation locality
○ locality-aware job scheduling; Yarn; Drill
● Parallel algorithm paradigm
○ Map/Reduce; Spark
● Multi-stage distributed flow
Distributed Deterministic Algorithms 1)
● What to sync?
○ what is the key information to stitch each pieces together
○ sync data to resemble single machine algorithm (rare but can be useful)
○ keep data local, sync results (map/reduce)
● When to sync?
○ lazy sync (e.g. Bulk Synchronous Parallel)
○ async (e.g. Parameter Server)
● Where to sync?
○ refactor algorithm by optimal sync point
Distributed Deterministic Algorithms 2)
● Trade-offs
○ performance
■ computation VS. communication
○ scalability
■ need scalable communication pattern
■ avoid point-to-point communication
Distributed Approximate Algorithms 1)
● QoR loss in distributed computing
○ for many algorithms, lack of global sync leads to QoR loss
○ full global sync is very expensive in communication cost
○ carefully choose sync points to maximize Performance / QoR Loss
● Self-healing Algorithms
○ some algorithms have less dependency on global sync
○ e.g. in Stochastic Optimization
■ global sync may be postponed to allow local optimum explored
■ however this nice feature is data / model dependant
Distributed Approximate Algorithms 2)
● Major challenges 1)
○ Trade-off on QoR?
■ approximation is inevitable, so what can be approximated?
■ not just an engineering problem
■ usually needs assessment on business impact
○ Solutions
■ for each approximation candidates, detail profiling on QoR loss
VS. Performance Gain VS. Business impact
Distributed Approximate Algorithms 3)
● Major challenges 2)
○ Hard to maintain?
■ Stochastic Algorithms: find deterministic in probability values
■ Graph algorithms: hard to trace in large-scale graph
○ Solutions
■ develop single machine algorithm first as golden
■ detailed testing and correlation for each parallelization step
■ detailed testing to understand result/error pattern on small data
Distributed Iterative Algorithms 1)
● Many algorithms for large-scale problem are iterative
○ Simulated Annealing; Genetic Algorithm; Graph Partition; PageRank;
Expectation Maximization; Loopy Belief Propagation etc
● Two Common approaches
○ Local computation + lazy Sync
○ Global computation with graph propagation
Distributed Iterative Algorithms 2)
● Distributed environment adds another layer of complexity
○ iterations need to be tuned, or completely re-designed
○ may become harder to converge
● Tuning iterations
○ Again, where to iterate?
■ spend runtime on key gainer
■ profiling of iterations VS. QoR gain
○ Tuning knots for convergence
■ iteration knots have very high impact on convergence
■ profiling of convergence parameters VS runtime VS QoR
Multi-stage Distributed Flow
● Data re-partition problem (“Shuffle” in Spark Language)
“In these distributed computation engines, the shuffle refers to the
repartitioning and aggregation of data during an all-to-all operation.
Understandably, most performance, scalability, and reliability issues that we
observe in production Spark deployments occur within the shuffle.”
http://blog.cloudera.com/blog/2015/01/improving-sort-performance-in-apache-spark-
its-a-double
Multi-stage Distributed Flow 1)
● Data re-partition problem (“Shuffle” in Spark Language)
○ unified partition VS. per-stage partition
■ per-stage partition fits algorithm better, but requires data
migration
○ global partition VS. stream partition
■ global partition fits algorithm better, but requires single machine to
hold all data for partition
■ stream partition + post-partition adjustment
Multi-stage Distributed Flow 2)
● Data re-partition problem (“Shuffle” in Spark Language)
○ QoR numerical dependence on the number of partitions
■ direct partitioning has numerical stability problem
■ fine-grained partition + post-partition coarsening is better
● Solutions
○ Hard to use standard library for high performance system
○ Best performance system is customized on:
■ Data volume
■ Computation intensity
■ (Multiple-stage) Algorithm parallelism
○ Always, keep a golden of single machine run, even for small input data!
Smart QA
cannot fix a bug unless you can reproduce it
cannot build a system unless you can test it
…...
Smart QA: The Guardian
Smart QA: Why
● Successful software must have good QA
○ A high level model of the system
○ Save time in debug
○ Save business in crisis
● Throughout Software Lifecycle
○ Development: test-driven development
○ Deployment: handles discrepancy b/w user env and dev env
○ Maintenance: predicts error, learns from failures, improves system
Protection Code
● Assert / Try, Except / Raise…
● Good to have:
○ Cases run through
○ Information on internal data, sometimes
● Too much of it?
○ hurts performance
● Need a balance
○ Input of external data → sanity check
○ Internal data → no check on high performance engine. System design and code
should ensure that
Auditing Code
● Check correctness from another angle
○ Rule based
■ Simply adds up the numbers to see if match
■ Use another algorithm, simpler, but does rough check
○ Data driven
■ Samples intermediate data from normal runs, issues alert when
runtime data distribution is different
Debug Code
● As important as functional code! (if not more)
● Essentially a high level abstraction on code OUTPUT
○ Not just debug
○ A reversed tree structure, with samples on key nodes
○ Grows intelligently with field practice
● Maintenance effort should decrease over time
○ Error handling/messaging system should mature through time
○ Bugs should be fixed in the right direction, not just workaround

Weitere ähnliche Inhalte

Was ist angesagt?

A Pioneering Approach to Parallel Array Processing in Quantitative and Mathem...
A Pioneering Approach to Parallel Array Processing in Quantitative and Mathem...A Pioneering Approach to Parallel Array Processing in Quantitative and Mathem...
A Pioneering Approach to Parallel Array Processing in Quantitative and Mathem...Yigal D. Jhirad
 
computer system architecture
computer system architecturecomputer system architecture
computer system architecturedileesh E D
 
Parallel computing
Parallel computingParallel computing
Parallel computingvirend111
 
Predictive Analytics for Alpha Generation and Risk Management
Predictive Analytics for Alpha Generation and Risk ManagementPredictive Analytics for Alpha Generation and Risk Management
Predictive Analytics for Alpha Generation and Risk ManagementYigal D. Jhirad
 
Hardware Architecture for Calculating LBP-Based Image Region Descriptors
Hardware Architecture for Calculating LBP-Based Image Region DescriptorsHardware Architecture for Calculating LBP-Based Image Region Descriptors
Hardware Architecture for Calculating LBP-Based Image Region DescriptorsMarek Kraft
 
2 parallel processing presentation ph d 1st semester
2 parallel processing presentation ph d 1st semester2 parallel processing presentation ph d 1st semester
2 parallel processing presentation ph d 1st semesterRafi Ullah
 
CArcMOOC 03.02 - Switching networks and combinational circuits
CArcMOOC 03.02 - Switching networks and combinational circuitsCArcMOOC 03.02 - Switching networks and combinational circuits
CArcMOOC 03.02 - Switching networks and combinational circuitsAlessandro Bogliolo
 
Model Transformation A Personal Perspective
Model Transformation A Personal PerspectiveModel Transformation A Personal Perspective
Model Transformation A Personal PerspectiveEdward Willink
 
multiprocessors and multicomputers
 multiprocessors and multicomputers multiprocessors and multicomputers
multiprocessors and multicomputersPankaj Kumar Jain
 
Lecutre-6 Datapath Design.ppt
Lecutre-6 Datapath Design.pptLecutre-6 Datapath Design.ppt
Lecutre-6 Datapath Design.pptRaJibRaju3
 
Introduction of data_structure
Introduction of data_structureIntroduction of data_structure
Introduction of data_structureeShikshak
 
Von neumann architecture
Von neumann architectureVon neumann architecture
Von neumann architectureAbdullaShakib1
 
Lecture 7 cuda execution model
Lecture 7   cuda execution modelLecture 7   cuda execution model
Lecture 7 cuda execution modelVajira Thambawita
 
Lecture 2
Lecture 2Lecture 2
Lecture 2Mr SMAK
 
06 - Программирование микроконтроллеров. Обзор контроллера MSP-430 (en)
06 - Программирование микроконтроллеров. Обзор контроллера MSP-430 (en)06 - Программирование микроконтроллеров. Обзор контроллера MSP-430 (en)
06 - Программирование микроконтроллеров. Обзор контроллера MSP-430 (en)Roman Brovko
 

Was ist angesagt? (20)

A Pioneering Approach to Parallel Array Processing in Quantitative and Mathem...
A Pioneering Approach to Parallel Array Processing in Quantitative and Mathem...A Pioneering Approach to Parallel Array Processing in Quantitative and Mathem...
A Pioneering Approach to Parallel Array Processing in Quantitative and Mathem...
 
computer system architecture
computer system architecturecomputer system architecture
computer system architecture
 
Parallel computing
Parallel computingParallel computing
Parallel computing
 
Predictive Analytics for Alpha Generation and Risk Management
Predictive Analytics for Alpha Generation and Risk ManagementPredictive Analytics for Alpha Generation and Risk Management
Predictive Analytics for Alpha Generation and Risk Management
 
Advanced computer architecture
Advanced computer architectureAdvanced computer architecture
Advanced computer architecture
 
Hardware Architecture for Calculating LBP-Based Image Region Descriptors
Hardware Architecture for Calculating LBP-Based Image Region DescriptorsHardware Architecture for Calculating LBP-Based Image Region Descriptors
Hardware Architecture for Calculating LBP-Based Image Region Descriptors
 
2 parallel processing presentation ph d 1st semester
2 parallel processing presentation ph d 1st semester2 parallel processing presentation ph d 1st semester
2 parallel processing presentation ph d 1st semester
 
CArcMOOC 03.02 - Switching networks and combinational circuits
CArcMOOC 03.02 - Switching networks and combinational circuitsCArcMOOC 03.02 - Switching networks and combinational circuits
CArcMOOC 03.02 - Switching networks and combinational circuits
 
Model Transformation A Personal Perspective
Model Transformation A Personal PerspectiveModel Transformation A Personal Perspective
Model Transformation A Personal Perspective
 
CArcMOOC 03.05 - RTL design
CArcMOOC 03.05 - RTL designCArcMOOC 03.05 - RTL design
CArcMOOC 03.05 - RTL design
 
multiprocessors and multicomputers
 multiprocessors and multicomputers multiprocessors and multicomputers
multiprocessors and multicomputers
 
Lecutre-6 Datapath Design.ppt
Lecutre-6 Datapath Design.pptLecutre-6 Datapath Design.ppt
Lecutre-6 Datapath Design.ppt
 
Introduction of data_structure
Introduction of data_structureIntroduction of data_structure
Introduction of data_structure
 
Von neumann architecture
Von neumann architectureVon neumann architecture
Von neumann architecture
 
Lecture 7 cuda execution model
Lecture 7   cuda execution modelLecture 7   cuda execution model
Lecture 7 cuda execution model
 
Parallel Processing Concepts
Parallel Processing Concepts Parallel Processing Concepts
Parallel Processing Concepts
 
Lecture 2
Lecture 2Lecture 2
Lecture 2
 
Functions with Heap and stack
 Functions with Heap and stack Functions with Heap and stack
Functions with Heap and stack
 
V3I8-0460
V3I8-0460V3I8-0460
V3I8-0460
 
06 - Программирование микроконтроллеров. Обзор контроллера MSP-430 (en)
06 - Программирование микроконтроллеров. Обзор контроллера MSP-430 (en)06 - Программирование микроконтроллеров. Обзор контроллера MSP-430 (en)
06 - Программирование микроконтроллеров. Обзор контроллера MSP-430 (en)
 

Andere mochten auch

sejarah perekonomian indonesia
sejarah perekonomian indonesiasejarah perekonomian indonesia
sejarah perekonomian indonesiaAsgari S
 
Americas COE Research Insights Hanjin Shipping Bankruptcy (9 19 16) (2)
Americas COE Research Insights  Hanjin Shipping Bankruptcy (9 19 16) (2)Americas COE Research Insights  Hanjin Shipping Bankruptcy (9 19 16) (2)
Americas COE Research Insights Hanjin Shipping Bankruptcy (9 19 16) (2)Matthew Marshall
 
la chambre de Nicolas aménagée
la chambre de Nicolas aménagée  la chambre de Nicolas aménagée
la chambre de Nicolas aménagée Magdalena Popovska
 
Shiping agents list in bd
Shiping agents list in bdShiping agents list in bd
Shiping agents list in bdAsif Azad
 
DéFinitions Cecrl
DéFinitions CecrlDéFinitions Cecrl
DéFinitions Cecrlnizida6
 
Guide PLU durable
Guide PLU durableGuide PLU durable
Guide PLU durableRAREreseau
 

Andere mochten auch (12)

Thetime
ThetimeThetime
Thetime
 
Questionary for Logistics Test
Questionary for Logistics TestQuestionary for Logistics Test
Questionary for Logistics Test
 
sejarah perekonomian indonesia
sejarah perekonomian indonesiasejarah perekonomian indonesia
sejarah perekonomian indonesia
 
Americas COE Research Insights Hanjin Shipping Bankruptcy (9 19 16) (2)
Americas COE Research Insights  Hanjin Shipping Bankruptcy (9 19 16) (2)Americas COE Research Insights  Hanjin Shipping Bankruptcy (9 19 16) (2)
Americas COE Research Insights Hanjin Shipping Bankruptcy (9 19 16) (2)
 
Formation Grade 100 Maj 02 09 08
Formation Grade 100 Maj 02 09 08Formation Grade 100 Maj 02 09 08
Formation Grade 100 Maj 02 09 08
 
la chambre de Nicolas aménagée
la chambre de Nicolas aménagée  la chambre de Nicolas aménagée
la chambre de Nicolas aménagée
 
Batel f re v66
Batel f re v66Batel f re v66
Batel f re v66
 
Logistics ppt
Logistics pptLogistics ppt
Logistics ppt
 
Shiping agents list in bd
Shiping agents list in bdShiping agents list in bd
Shiping agents list in bd
 
Happy & sad sena
Happy & sad senaHappy & sad sena
Happy & sad sena
 
DéFinitions Cecrl
DéFinitions CecrlDéFinitions Cecrl
DéFinitions Cecrl
 
Guide PLU durable
Guide PLU durableGuide PLU durable
Guide PLU durable
 

Ähnlich wie Software Design Practices for Large-Scale Automation

Netflix machine learning
Netflix machine learningNetflix machine learning
Netflix machine learningAmer Ather
 
Big data processing systems research
Big data processing systems researchBig data processing systems research
Big data processing systems researchVasia Kalavri
 
Machine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsMachine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsZhenxiao Luo
 
Client side machine learning
Client side machine learningClient side machine learning
Client side machine learningKumar Abhinav
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartMukesh Singh
 
A Modern Interface for Data Science on Postgres/Greenplum - Greenplum Summit ...
A Modern Interface for Data Science on Postgres/Greenplum - Greenplum Summit ...A Modern Interface for Data Science on Postgres/Greenplum - Greenplum Summit ...
A Modern Interface for Data Science on Postgres/Greenplum - Greenplum Summit ...VMware Tanzu
 
Apache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming modelApache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming modelMartin Zapletal
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
ApacheCon 2022_ Large scale unification of file format.pptx
ApacheCon 2022_ Large scale unification of file format.pptxApacheCon 2022_ Large scale unification of file format.pptx
ApacheCon 2022_ Large scale unification of file format.pptxXinliShang1
 
Lessons learned from designing a QA Automation for analytics databases (big d...
Lessons learned from designing a QA Automation for analytics databases (big d...Lessons learned from designing a QA Automation for analytics databases (big d...
Lessons learned from designing a QA Automation for analytics databases (big d...Omid Vahdaty
 
Big Data processing with Apache Spark
Big Data processing with Apache SparkBig Data processing with Apache Spark
Big Data processing with Apache SparkLucian Neghina
 
COA-Unit4-PPT.pptx
COA-Unit4-PPT.pptxCOA-Unit4-PPT.pptx
COA-Unit4-PPT.pptxRuhul Amin
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3 Omid Vahdaty
 
How to Develop and Operate Cloud First Data Platforms
How to Develop and Operate Cloud First Data PlatformsHow to Develop and Operate Cloud First Data Platforms
How to Develop and Operate Cloud First Data PlatformsAlluxio, Inc.
 
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned Omid Vahdaty
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriDemi Ben-Ari
 
SystemML - Datapalooza Denver - 05.17.16 MWD
SystemML - Datapalooza Denver - 05.17.16 MWDSystemML - Datapalooza Denver - 05.17.16 MWD
SystemML - Datapalooza Denver - 05.17.16 MWDMike Dusenberry
 
Parallel Computing-Part-1.pptx
Parallel Computing-Part-1.pptxParallel Computing-Part-1.pptx
Parallel Computing-Part-1.pptxkrnaween
 

Ähnlich wie Software Design Practices for Large-Scale Automation (20)

Netflix machine learning
Netflix machine learningNetflix machine learning
Netflix machine learning
 
Towards Data Operations
Towards Data OperationsTowards Data Operations
Towards Data Operations
 
Big data processing systems research
Big data processing systems researchBig data processing systems research
Big data processing systems research
 
Machine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsMachine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systems
 
Client side machine learning
Client side machine learningClient side machine learning
Client side machine learning
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @Lendingkart
 
A Modern Interface for Data Science on Postgres/Greenplum - Greenplum Summit ...
A Modern Interface for Data Science on Postgres/Greenplum - Greenplum Summit ...A Modern Interface for Data Science on Postgres/Greenplum - Greenplum Summit ...
A Modern Interface for Data Science on Postgres/Greenplum - Greenplum Summit ...
 
Apache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming modelApache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming model
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
ApacheCon 2022_ Large scale unification of file format.pptx
ApacheCon 2022_ Large scale unification of file format.pptxApacheCon 2022_ Large scale unification of file format.pptx
ApacheCon 2022_ Large scale unification of file format.pptx
 
Lessons learned from designing a QA Automation for analytics databases (big d...
Lessons learned from designing a QA Automation for analytics databases (big d...Lessons learned from designing a QA Automation for analytics databases (big d...
Lessons learned from designing a QA Automation for analytics databases (big d...
 
Big Data processing with Apache Spark
Big Data processing with Apache SparkBig Data processing with Apache Spark
Big Data processing with Apache Spark
 
COA-Unit4-PPT.pptx
COA-Unit4-PPT.pptxCOA-Unit4-PPT.pptx
COA-Unit4-PPT.pptx
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3
 
How to Develop and Operate Cloud First Data Platforms
How to Develop and Operate Cloud First Data PlatformsHow to Develop and Operate Cloud First Data Platforms
How to Develop and Operate Cloud First Data Platforms
 
module01.ppt
module01.pptmodule01.ppt
module01.ppt
 
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-Ari
 
SystemML - Datapalooza Denver - 05.17.16 MWD
SystemML - Datapalooza Denver - 05.17.16 MWDSystemML - Datapalooza Denver - 05.17.16 MWD
SystemML - Datapalooza Denver - 05.17.16 MWD
 
Parallel Computing-Part-1.pptx
Parallel Computing-Part-1.pptxParallel Computing-Part-1.pptx
Parallel Computing-Part-1.pptx
 

Kürzlich hochgeladen

%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfonteinmasabamasaba
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnAmarnathKambale
 
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrainmasabamasaba
 
Artyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxArtyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxAnnaArtyushina1
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisamasabamasaba
 
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...masabamasaba
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech studentsHimanshiGarg82
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisamasabamasaba
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...masabamasaba
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is insideshinachiaurasa2
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastPapp Krisztián
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfonteinmasabamasaba
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplatePresentation.STUDIO
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension AidPhilip Schwarz
 

Kürzlich hochgeladen (20)

%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
Artyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxArtyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptx
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 

Software Design Practices for Large-Scale Automation

  • 1. Design for Large-Scale Automation 12/30/2015
  • 2. Ongoing... Design for large-scale, high-performance, distributed software systems for complex algorithms such as graph, optimization, prediction, and machine learning. Corrections/improvements are very welcome at hohaxu@gmail.com (Hao Xu)
  • 3. Topics ● Large-scale Automation: Why Challenging? ● Design Principles: Coping with Complexity and Physicality ● Computation Paradigms: HPC, Spark, Tensorflow ● Designs: Logical, Physical, System levels ● Distributed and Iterative Algorithms: Partition, Sync, Iteration Trade-offs ● Smart QA: Protection, Auditing, Debug codes
  • 4. Design Objectives for Large-scale Automation ● Scalability (growing) ● Extensibility (evolving) ● Performance (fast) ● Maintenance (controllable)
  • 5. Scalability: Name of the Game ● Electronics simulation: mandatory for simulation software to scale with Moore’s law ● Internet Applications: systems need to be ready for next 10x user growth and feature evolution ● Knowledge Base: bigger system improves cross referencing and hence quality of learning new knowledge ● Deep learning: capacity of system affects quality of latent features learned and hence the prediction capability ● Internet of Things: as the name suggests...
  • 6. What make it difficult? #1 Complexity ● Complexity is the TOP challenge for software engineering ● Usually grows with the scale of the system ○ exhibits different patterns at different scale ○ explodes with the number of software features ● The only way to handle complexity ○ “Divide and Conquer” ○ realized by various Design Principles
  • 7. What make it difficult? #2 Physicality ● Software is physical, just like human ○ Results are stored in physical memory (RAM/ROM/Disk) ○ Computation is done in physical processing units (CPU/GPU/FPGA) ● Not feasible to build one gigantic machine that solves everything ○ System should live on machine farms ○ Data / Computation should be distributed ● Physicality complicates the design of systems ○ Data partition ○ Computation partition
  • 10. Design Principles for Coping with Complexity ● Abstraction (Vertical Divide & Conquer) ○ Core Abstractions ○ Hierarchization ● Decoupling (Horizontal Divide & Conquer) ○ Encapsulation ○ Layerization Decoupling Centerpiece of large-scale system design Abstraction
  • 11. Abstraction: Vertical Divide and Conquer ● Core Abstractions ○ the soul of large-scale systems ○ the root of abstraction hierarchy ○ higher level abstraction = better extensibility ● Hierarchization ○ simplification of system functionality graph ○ ideally mapped into tree structures (no loop) ○ the template for Object Oriented Design ○ need a balance b/w delegation & check
  • 12. Decoupling: Horizontal Divide and Conquer ● Encapsulation ○ components encapsulate complex logic ○ API design for minimal interface ● Layerization ○ algorithms divided into layers ○ each layer handles a feature/algorithm ■ layer 1: Graph partition and communication ■ layer 2: Graph node property analysis ■ layer 3: User operation on Graph nodes ■ ...
  • 13. The Priority of Abstractions for Project Management ● Core abstractions (1st Priority) ○ Determines functionality/scalability ● Library abstractions (2nd Priority) ○ Determines performance ● Logic abstractions (low priority) ○ Flows ○ Apps ○ Business logics 1 2 3
  • 14. Computation Paradigms Language level, Flow level, System level
  • 16. Computation Paradigms ● What is Computation Paradigm? ○ Computation abstraction at different levels ○ Offers encapsulation and parallelism at different levels ○ Crucial to choose the right computation paradigm ● Computation Paradigm at different levels ○ Language level: Python, C, Scala ○ Flow level: Imperative, Symbolic, Functional programming ○ System level: Computation-centric (HPC) or Data-centric (e.g. Spark)
  • 17. Flow level: Imperative Programming ● Imperative Programming: No native abstraction ○ C++ / Python / Java ○ Computation at instruction level ○ Task level parallel
  • 18. Flow level: Functional Programming ● Functional Programming: Data abstraction ○ Scala / MapReduce ○ Immutable, Stateless function ● Pros ○ Offers Data level parallel ● Cons ○ Data read only, need to make another copy if update. ○ More memory consumption. Potential performance overhead.
  • 19. Flow level: Symbolic Programming ● Symbolic Programming: Operator abstraction ○ Theano / TensorFlow ○ Operator level parallel ○ Graph model as base engine ● Pros ○ Offers high operator parallelism through graph propagation ● Cons ○ Not flexible for all programming tasks ○ May incur overhead handling with fine-grained operators
  • 20. System level: Computation-Centric System (typical HPC 1) ● What is HPC ○ HPC is extreme parallel computing ○ Computation Partition ■ Communication delay aware ● Inter-node L3/L2/L1 ● Intra-node interconnect 100gb/s ● Inter-cluster ethernet 1gb/s + Ram to Disk time ■ Physical architecture ware ● Register size etc
  • 21. System level: Computation-Centric System (typical HPC 2) ● Parallel at different levels ○ Multi-threading ○ Multi-process ○ Distributed cluster ○ Mainstream communication: MPI ● Partition based on needs of communication ○ Minimize communication ○ Algorithm partition ○ Data partition
  • 22. System level: Computation-Centric System (typical HPC 3) ● Exploit Heterogeneous Components ○ GPU acceleration (many small cores) ■ Model is too small; too much overhead; stays on CPU ■ Model is too large; exceeds GPU memory; do partial acceleration ■ Exchange memory with CPU through memory copy ○ FPGA (millions of gates) ○ SSD, RAID 0/1,5/10 ● Disk IO ○ HDF5 parallel read/write
  • 23. System level: Data-Centric System (Spark-like) ● Data partition: Physically distributed central DB ○ Serialization: boost:serialization(c++), pickling(Python) ● Scalable computation ○ Usually has a scheduler ○ Explicit scheduling: user defines computation graph nodes ○ Implicit scheduling: engine analyzes the computation graph ● Stateless ○ Good for debug, easy recover from failure
  • 24. System level: Hybrid Architecture ● Hybrid Architecture Example: TensorFlow ○ Stochastic algorithms → use Data-centric model ■ E.g. Back propagation: Parameter Server ○ Deterministic algorithms → use Computation-centric (HPC) model ■ E.g. Common data sync among model partitions: Bulk Synchronous Parallel
  • 27. Logical Design ● Objectify everything ○ an object can have multiple copies for parallel computing ○ avoid singleton / global / static variables ○ top level should fall through, should not execute anything
  • 28. Logical Design ● Standardize everything ○ Base Class for any task = function(data, parameters, executor_id) ○ schema (base class) for task ○ scheme for any data ○ schema for any function ○ schema for any parameter ● Benefits ○ higher level automation ○ potentially more intelligent system
  • 29. Logical Design ● Modularize everything ○ encapsulate data by using setter / getter ○ encapsulate atomic or repeated functionality ○ #define any hard number ○ factorize long function or class ○ build shared libraries from bottom-up ■ communication lib ■ parallel computing lib ■ debug / reporting lib
  • 31. Physical Design: Code ● Source Code ○ component level decouple by folder ○ module level decouple by file ○ variable space decouple by namespace ● Code change ○ physical change (files/folders touched) should reflect logical change ○ change scope should narrow down as development goes ○ diff mangement
  • 32. Physical Design: Memory 1) ● Memory is the #1 factor for performance ○ Code runs in memory, not in the air ● OS Memory Handling ○ Memory allocation, fragmentation, release etc ○ Tcmalloc VS jemalloc ■ Improves allocation/fragmentation ■ Still has issue on release
  • 33. Physical Design: Memory 2) ● Interpreter Memory Handling ○ Garbage Collection ● Manual Memory Management ○ memory pooling is mandatory ○ memory lifecycle management for any large usage
  • 34. Physical Design: Memory 3) ● Trade-offs ○ Depends on application ■ Memory critical: TC/JEmalloc ■ Memory and Performance critical: MMU ○ HPC is memory and performance critical ■ Parallel does not solve all the problem. Single machine performance is still dominant factor ■ You should know the code very well to design manual MMU ○ Spark replacing JVM memory management with Tungsten project
  • 35. Physical Design: Performance ● Performance Tuning ○ profiling, profiling, profiling... ○ lazy initialization / write / read ○ cache-aware design ■ cache-friendly data structure ● linked structure locality ■ cache-friendly algorithm ● read / write locality
  • 37. System Design ● Scalable Distributed System ○ DB Service: Data and Computation decouple ○ Task/Scheduler: Computation and Execution decouple ○ Query/Queue: Producer and Consumer decouple
  • 38. System Design ● DB Service ○ Logically Centralized ■ Parameter Server ○ Physically distributed ■ Only routing / bookkeeping service on Master ■ Master capacity is not an issue ■ Computation locality on Slaves
  • 39. System Design ● Parallel Computing ○ multi-threading ■ light overhead ■ shared memory, data exchange OK ○ multi-process ■ heavy overhead ■ separated memory space, more difficult data exchange ○ distributed multiple machine ■ balance between computation VS. communication
  • 40. System Design ● TensorFlow Example ○ Multi-threading: Graph Execution Engine ■ BFS ■ DFS ○ Multi-machine: Graph partition ■ Edge-cut? ■ Vertex-cut?
  • 41. System Design ● Fault Tolerance ○ Monitor granularity ■ system level: module behavior ■ flow level: major steps ■ algorithm level: major checkpoints ○ Persistence granularity ■ recovery depth ■ recovery contents
  • 42. Distributed and Iterative Algorithms Partition, Sync, Iterate, Global/Local Optimum
  • 43. Distributed and Iterative Algorithms: The Lifeblood
  • 44. Key Issues of Distributed Algorithms ● Data / Model partition ○ inference data partition; graph partition; datastore sharding ● Communication paradigm ○ Spark RDD; MPI; RPC ● Computation locality ○ locality-aware job scheduling; Yarn; Drill ● Parallel algorithm paradigm ○ Map/Reduce; Spark ● Multi-stage distributed flow
  • 45. Distributed Deterministic Algorithms 1) ● What to sync? ○ what is the key information to stitch each pieces together ○ sync data to resemble single machine algorithm (rare but can be useful) ○ keep data local, sync results (map/reduce) ● When to sync? ○ lazy sync (e.g. Bulk Synchronous Parallel) ○ async (e.g. Parameter Server) ● Where to sync? ○ refactor algorithm by optimal sync point
  • 46. Distributed Deterministic Algorithms 2) ● Trade-offs ○ performance ■ computation VS. communication ○ scalability ■ need scalable communication pattern ■ avoid point-to-point communication
  • 47. Distributed Approximate Algorithms 1) ● QoR loss in distributed computing ○ for many algorithms, lack of global sync leads to QoR loss ○ full global sync is very expensive in communication cost ○ carefully choose sync points to maximize Performance / QoR Loss ● Self-healing Algorithms ○ some algorithms have less dependency on global sync ○ e.g. in Stochastic Optimization ■ global sync may be postponed to allow local optimum explored ■ however this nice feature is data / model dependant
  • 48. Distributed Approximate Algorithms 2) ● Major challenges 1) ○ Trade-off on QoR? ■ approximation is inevitable, so what can be approximated? ■ not just an engineering problem ■ usually needs assessment on business impact ○ Solutions ■ for each approximation candidates, detail profiling on QoR loss VS. Performance Gain VS. Business impact
  • 49. Distributed Approximate Algorithms 3) ● Major challenges 2) ○ Hard to maintain? ■ Stochastic Algorithms: find deterministic in probability values ■ Graph algorithms: hard to trace in large-scale graph ○ Solutions ■ develop single machine algorithm first as golden ■ detailed testing and correlation for each parallelization step ■ detailed testing to understand result/error pattern on small data
  • 50. Distributed Iterative Algorithms 1) ● Many algorithms for large-scale problem are iterative ○ Simulated Annealing; Genetic Algorithm; Graph Partition; PageRank; Expectation Maximization; Loopy Belief Propagation etc ● Two Common approaches ○ Local computation + lazy Sync ○ Global computation with graph propagation
  • 51. Distributed Iterative Algorithms 2) ● Distributed environment adds another layer of complexity ○ iterations need to be tuned, or completely re-designed ○ may become harder to converge ● Tuning iterations ○ Again, where to iterate? ■ spend runtime on key gainer ■ profiling of iterations VS. QoR gain ○ Tuning knots for convergence ■ iteration knots have very high impact on convergence ■ profiling of convergence parameters VS runtime VS QoR
  • 52. Multi-stage Distributed Flow ● Data re-partition problem (“Shuffle” in Spark Language) “In these distributed computation engines, the shuffle refers to the repartitioning and aggregation of data during an all-to-all operation. Understandably, most performance, scalability, and reliability issues that we observe in production Spark deployments occur within the shuffle.” http://blog.cloudera.com/blog/2015/01/improving-sort-performance-in-apache-spark- its-a-double
  • 53. Multi-stage Distributed Flow 1) ● Data re-partition problem (“Shuffle” in Spark Language) ○ unified partition VS. per-stage partition ■ per-stage partition fits algorithm better, but requires data migration ○ global partition VS. stream partition ■ global partition fits algorithm better, but requires single machine to hold all data for partition ■ stream partition + post-partition adjustment
  • 54. Multi-stage Distributed Flow 2) ● Data re-partition problem (“Shuffle” in Spark Language) ○ QoR numerical dependence on the number of partitions ■ direct partitioning has numerical stability problem ■ fine-grained partition + post-partition coarsening is better ● Solutions ○ Hard to use standard library for high performance system ○ Best performance system is customized on: ■ Data volume ■ Computation intensity ■ (Multiple-stage) Algorithm parallelism ○ Always, keep a golden of single machine run, even for small input data!
  • 55. Smart QA cannot fix a bug unless you can reproduce it cannot build a system unless you can test it …...
  • 56. Smart QA: The Guardian
  • 57. Smart QA: Why ● Successful software must have good QA ○ A high level model of the system ○ Save time in debug ○ Save business in crisis ● Throughout Software Lifecycle ○ Development: test-driven development ○ Deployment: handles discrepancy b/w user env and dev env ○ Maintenance: predicts error, learns from failures, improves system
  • 58. Protection Code ● Assert / Try, Except / Raise… ● Good to have: ○ Cases run through ○ Information on internal data, sometimes ● Too much of it? ○ hurts performance ● Need a balance ○ Input of external data → sanity check ○ Internal data → no check on high performance engine. System design and code should ensure that
  • 59. Auditing Code ● Check correctness from another angle ○ Rule based ■ Simply adds up the numbers to see if match ■ Use another algorithm, simpler, but does rough check ○ Data driven ■ Samples intermediate data from normal runs, issues alert when runtime data distribution is different
  • 60. Debug Code ● As important as functional code! (if not more) ● Essentially a high level abstraction on code OUTPUT ○ Not just debug ○ A reversed tree structure, with samples on key nodes ○ Grows intelligently with field practice ● Maintenance effort should decrease over time ○ Error handling/messaging system should mature through time ○ Bugs should be fixed in the right direction, not just workaround