2. Lance Co Ting Keh
Machine Learning @ Box
Distributed ML Infrastructure
Go Blue Devils!
Shivnath Babu
Associate Professor @ Duke
Chief Scientist at Unravel Data Sys.
R&D in Management of Data Systems
13. What can go wrong?
Failures
â˘âŻ My query failed after 6 hours!
â˘âŻ What does this exception mean?
14. What can go wrong?
â˘âŻ Failures
â˘âŻ My query failed after 6 hours!
â˘âŻ What does this exception mean?
â˘âŻ Wrong results
â˘âŻ Result of my job looks wrong
â˘âŻ Bad performance
â˘âŻ My app is very slow
â˘âŻ Pipeline is not meeting the 4hr SLA
â˘âŻ Poor scalability
â˘âŻ Oh, but it worked on the dev cluster!
â˘âŻ Bad App(le)s
â˘âŻ Tomâs query brought the cluster down
â˘âŻ Application Problems
â˘âŻ Poor choice of transformations
â˘âŻ Ineffective caching
â˘âŻ Bloated data structures
â˘âŻ Data/Storage Problems
â˘âŻ Skewed data, load imbalance
â˘âŻ Small files, poor data partitioning
â˘âŻ Spark Problems
â˘âŻ Shuffle
â˘âŻ Lazy evaluation causes confusion
â˘âŻ Resource Problems
â˘âŻ Resource contention
â˘âŻ Performance degradation
And Why?