Talk at Floor 28, Tel Aviv.
Infrastructure, tips to speed up training, hyperparameter optimization, model compilation, Amazon SageMaker Neo, cost optimization, Amazon Elastic Inference
1/ The new Amazon EC2 P3dn instance
2/ With four-times the networking bandwidth and twice the GPU memory of the largest P3 instance, P3dn is ideal for large scale distributed training. No one else has anything close.
3/ P3dn.24xlarge instances offer 96vCPUs of Intel Skylake processors to reduce preprocessing time of data required for machine learning training.
3/ The enhanced networking of the P3n instance allows GPUs to be used more efficiently in multi-node configurations so training jobs complete faster.
4/ Finally, the extra GPU memory allows developers to easily handle more advanced machine learning models such as holding and processing multiple batches of 4k images for image classification and object detection systems
These performance and accuracy trade offs are felt most acutely at the edge.
1/ IoT applications are usually running on devices, out there in the real world. This means that the accuracy of models can be felt quickly, and immediately. Consumer IoT applications have a high expectation of accuracy - such as Alexa detecting the wake word reliably - the accuracy of that model really matters to the overall experience. In industrial IoT, devices are often responsible for monitoring and maintaining and core manufacturing processes, or safety. The accuracy of a model here is critical.
2/ Applications running on IoT devices at the edge are commonly very sensitive to latency; it’s part of the reason why customers are running the workload there in the first place, because they can’t afford the round trip to the cloud and back. So any increase in that latency can have a meaningful impact on the success of the device itself.
3/ IoT applications are often incredibly resource constrained, in a way which is much more acute than in the cloud. The devices are smaller, and have less memory and processing power, which is a real problem for machine learning models.
4/ In many cases, IoT applications need to run on very diverse hardware platforms, with a dizzying myriad of processor architectures. To get any sort of performance, developers have to optimize
5/ Finally, one of the key benefits of machine learning can get lost; the ability to continually improve the model. IoT applications are great data generators, and once that data is “ground-truthed”, it can be used to build more sophisticated models. However, if the effort to optimize those improved models for the constraints and diverse hardware at the edge is high, then it’s less likely to happen, and developers are leaving money on the table. A real missed opportunity.
6/ We don’t think that customers should have to choose between accuracy and performance. It’s a false choice, with a high cost.
So I’m excited to announce a new feature of SageMaker…