21. FLIP-O-RAMA with Python and HDFS/Hops
Left Hand Here
Right
Thumb
Here
Data
Collection
Experimentation Training Serving
Feature
Extraction
Data
Transformation
& Verification
Test
22. import hops.hdfs as hdfs
features = ["Age", "Occupation", "Sex", …, "Country"]
with open(hdfs.project_path() +
"/TestJob/data/census/adult.data", "r") as trainFile:
train_data=pd.read_csv(trainFile, names=features,
sep=r's*,s*', engine='python', na_values="?")
Data Ingestion with Pandas
Right
Thumb
Here
23. import hops.hdfs as hdfs
features = ["Age", "Occupation", "Sex", …, "Country"]
h = hdfs.get_fs()
with h.open_file(hdfs.project_path() +
"/TestJob/data/census/adult.data", "r") as trainFile:
train_data=pd.read_csv(trainFile, names=features,
sep=r's*,s*', engine='python', na_values="?")
Data Ingestion with Pandas on Hops
Right
Thumb
Here
36. Summary
•The future of Deep Learning is Distributed
https://www.oreilly.com/ideas/distributed-tensorflow
•Hops is a new Data Platform with first-class support for
Python / Deep Learning / ML / Data Governance / GPUs
*https://twitter.com/karpathy/status/972701240017633281
“It is starting to look like deep learning workflows of the future
feature autotuned architectures running with autotuned
compute schedules across arbitrary backends.”
Andrej Karpathy - Head of AI @ Tesla
36/38
AI is going distributed.
These words come from Rich Sutton, and that is significant because the pre-Deepmind reinforcement learning developed by him and others did not scale with available compute. It was single-agent, single-threaded AI. But that has changed in the last few years as we can see on this chart.
If we take AlexNet as the starting point for the deep learning revolution on hardware accelerators, like GPUs, since then, in the last 5 years, we have seen a 300,000 fold increase in the number of compute operations required to train the state-of-the-art deep learning models, recently with Neural Machine translation by Google and Alpha Go Zero by DeepMind.
That is a 3 and a half month doubling time. And if we compare that with Moore’s Law - Moore’s law states that the number of transistors on a chip double every 18 months. When Moore’s Law hit the wall in the mid 2000s, it crossed that chasm by going multi-core.
This increase in compute in the last 5 years has not happened because Nvidia doubled the number of cores on their GPUs every 3.5 months. Deep Learning also hit a wall around 2016. We crossed that chasm by going distributed. And yes, AlphaGo would not have beaten the world’s best Go Player if it had not been a distributed system.
DL years are like dog years – so the last 5 years is a half a lifetime. Let’s look at the last 12 months.
anthropomorphized by Yann Le Cunn and Jeff Dean
1 billion images with 1500 hashtags from instagram, trained over several weeks on 336 GPUs.
Better Optimization algorithms –large-batch second-order methods like K-FAC
Better Regularization methods – like Dropout,
The other 3 – which were the domain of ML experts can now be the domain of AI itself. Throw compute and storage at them and they will do better than us.
The first phase of the Battle begins with who can train DNNs faster
Facebook initiated the hostilities….
A friend tried to explain Paxos to me – it was painful for both of us.
How can it be slower this time? I haven’t changed anything!
What do you mean service X have moved?
A WHAT caused my system to crash?!?
What do you mean the time isn’t the same at all servers?
Hey Spark, I just want to collect TBs at the Driver….what’s your problem with that?
KubeFlow – Data Scientists write infrastructure (Dockerfiles, Kubernetes YML files) and ML/DL code
Code over Visual Programming
Languages: Python/R
ML Frameworks: scikit-learn, SparkML
Notebooks: Jupyter
DL Frameworks: Keras-TensorFlow/PyTorch
we have this dataset – the census data from the US.
The hdfs module we import will just give us a path to a project – not a HDFS path, just a “/Project/dotai” path.
Flickorama
Higher-Level APIs in TensorFlow
Estimator, Experiment, and Dataset
tensorflow.contrib.learn.python.learn.estimators
Tuner.run_experiment(..)
Spark can launch many TensorFlow jobs in parallel for Hyperparameter Tuning
One experiment per Mapper
Higher-Level APIs in TensorFlow
Estimator, Experiment, and Dataset
tensorflow.contrib.learn.python.learn.estimators
Tuner.run_experiment(..)
Spark can launch many TensorFlow jobs in parallel for Hyperparameter Tuning
One experiment per Mapper
In other words, the future of deep learning is “distributed”.
HopsFS is a drop-in replacement for HDFS that can scale Hadoop Filesystem (HDFS) to over 1 million ops/s by migrating the NameNode metadata to an external scale-out in-memory database. This talk will introduce recent improvements in HopsFS: storing small files in the database (both in-memory and on SSD disk tables), a new scalable block-reporting protocol, support for erasure-coding with data locality, and work on multi-data center replication. For small files (under 64-128 KB), HopsFS can reduce read latency to under 10ms, while also improving read throughput by 3-4X and write throughput by >15X. Our new block reporting protocol reduces block reporting traffic by up to 99% for large clusters, at the cost of a small increase in metadata. While our solution for erasure-coding is implemented at the block-level preserving data locality. Finally, our ongoing work on geographic replication points a way forward for HDFS in the cloud, providing data-center level high availability without any performance hit.
Like Maslo's hierarchy of needs, the AI hierarchy of needs describes the stepwise requirements that organizations must go through to become data-driven and intelligent in their use of data. First, organizations must reliably ingest and manage their data assets with pipelines - Hadoop being the natural data platform for this first level. Only then, can organizations progress to understanding their data in hindsight with analytics tools, scale-out databases, and visualization. From here, organizations need to transition to predictive analytics, labelling data, building traditional machine learning and deep learning models, and managing features and experiments. From here to hyperscale AI companies, organizations need to scale out their neural network training to tens or hundreds of GPUs and automate their machine learning processes. In this talk, we will traverse this AI hierarchy of needs for Hadoop and TensorFlow. We will start at support for GPUs in YARN, and work our way up to distributing TensorFlow (using parameter server (TensorFlowOnSpark) and AllReduce (Horovod) frameworks) across many servers, while a unified feature store based on HDFS, Hive, and Kafka.