SlideShare ist ein Scribd-Unternehmen logo
1 von 76
Downloaden Sie, um offline zu lesen
Architecting &
Productionising
Data Science
Applications at
Scale
Contents
1. Introduction & Motivation
2. Parallel Processing
3. Spark
4. Kafka
5. Scalable Streaming Machine Learning Approaches
6. Architecture
7. Summary & Conclusions
1. Introduction & Motivation
https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-says/#42ed5ede6f63
Constitutes 4% of actual job:
● Machine Learning Algorithms
● Machine Learning Libraries
● More Machine Learning Libraries
● Linear Algebra
● Statistics
● GPUs, HPCs
● Information Theory
● Complexity Theory
Interpretation + Wiring Together of:
Created By
Data App developers, 3rd parties, Business Analysts,
Data Engineers
Algorithms Academics, Mathematicians
Libraries Academics, big tech companies (Google, FB), open
source community
Hardware / Environments System Administrators, DevOps
Production Quality Implementation Software Engineers, Data Engineers
Root Cause
This really means?
Person who is worse at coding than any software engineer and worse at maths than any
mathematician
Should mean:
Software engineer who’s good at maths, or a mathematician who’s good at engineering
Solution - How to put creativity into Data Science
Covered by this talk
● Be at the boundary where Data Science meets the real world:
○ Be part of generating the data with production quality applications
○ Be part of productionising the algorithms with production quality code
● Automate Everything
● Avoid SQL & DBs mindset that creates the 79% of boring painful work
Not covered
● Minimalist algorithmic design:
○ KISS - Keep it Simple Stupid, YAGNI - You Aren’t Gona Need It
○ Probability Theory & Complexity Theory (NOT stats & linalg) as the principle foundations of all
algorithms
Real
World
Production
Kafka
Connect,
Or similar
Kafka (Streams)
Application
ETL
Feature
Extraction
ML Prediction
Kafka
S3, (Google
Cloud Storage).
Parquet Format
Actions
Labs (EMR, Dataproc)
Spark
Analytics Cluster.
Transient / Ephemeral
Zeppelin (or jupyter)
ML Training Cluster.
Transient / Ephemeral
Deployments
Automation - Turn the 96% into 0%
Simplicity--the art of maximizing the amount of work not done--is essential. - Agile Manifesto
Perfection is achieved not when there is nothing more to add, but when there is nothing left to take away. -
Antoine de Saint-Exupery
Any intelligent fool can make things bigger and more complex... It takes a touch of genius - and a lot of
courage to move in the opposite direction. - E. F. Schumacher
Everything should be made as simple as possible, but not simpler. - Einstein
Data science is about solving problems, not models or algorithms. - Data Science Manifesto
Aim to completely remove manual intervention in numerical processing. - Data Science Manifesto
2. Parallel Processing
Definition - Big Data
The input is too big to fit into memory on a single machine. - A Model of Computation for
MapReduce, http://theory.stanford.edu/~sergei/papers/soda10-mrc.pdf
Multiset
( )
● Different implementations of MapReduce differ in the following
○ How mappers are started and run in parallel (multithreading or multi-process)
○ How big the records are that are fed to mappers
○ The shuffle algorithm implementation
○ How reducers are started and run in parallel
○ How the shuffling feeds records to the reducers
○ How data is cached or stored in between stages
● This abstraction is necessary to understand how to write efficient code in a Big
Data context
● No real life implementation will exactly match the formal abstraction
● In Batch implementations like Spark & Hadoop the key algorithm design
considerations are
○ Cannot assume the data has any order (and best to assume no order within partitions)
○ Maximise usage of map-side reduce via monoids (aka map-side partial-aggregation)
● In a Streaming context, like Kafka, the key algorithm design considerations are
○ Cannot assume the data has any global order (but can within partitions)
○ Maximise usage of partial-aggregations
3. Spark
3.1 History of Hadoop MapReduce
Function Hadoop Implementation Details
Mapper An entire JVM that by default processes one block of HDFS or one file of s3. This means the
number of mappers (and thus the resulting parallelism) is heavily determined by how you store
your data. You can increase/decrease the amount of data processed by a mapper, but not always.
Sometimes there are not enough CPUs available on a node to process all it’s data, Hadoop is
clever and will use “data locality” to process data near (e.g. same rack) where it is stored
Reducer Similarly an entire JVM will process many reducers (although the literature just calls this JVM a
reducer (singular)). The number of reducers is chosen explicitly.
Shuffle Hash Shuffle: for a given < Key, Multiset[Value] > from the shuffle phase, the implementation
selects a reducer based on an integer hash modulo num-reducers of the Key, then feeds the
Values to the reducer as a stream. The JVM has a HashMap to keep track of each reducer. This
is memory intensive.
Sort Shuffle: (often the modern default) here the shuffle algorithm sorts the data in transit to the
reducers (this can be done efficiently thanks to repeated application of Merge Sort). Now the
reducers can run in sequence in the reducer JVM, no need for a HashMap. This algorithm uses
less memory, but when many distinct keys exist can be slow.
Function Hadoop Implementation Details
Map Reduce Job A key feature of Hadoop is that all the phases, Map, Shuffle and Reduce, can execute
simultaneously. So as mappers output data, that output is simultaneously shuffled, and fed into
reducers, which write that data out.
Map Reduce Program A key feature of Hadoop is that mappers (nearly) always read from a filesystem, and reducers
(nearly) always write to a filesystem.
3.2 Drawbacks of Hadoop MapReduce
● Since each mapper/reducer uses an entire JVM, this can result in inefficient use of
memory. Each JVM cannot share memory, so any memory it is not using cannot be used
by other JVMs.
● Furthermore JVMs cannot share common data, so if mappers use, say a big Dictionary to
perform its function, that Dictionary must be duplicated across every JVM. Historical
work around to this involve in memory databases.
● Chaining MapReduce Stages together to form a MapReduce Program results in many
unnecessary reads/writes from disk.
● The original API was a low level Java API. Consequently code was quite verbose and
difficult to write unit tests. High level frameworks were built on top, the best being
Scalding (which sat on Cascading), the worst being Hive or Pig.
● Since every mapper/reducer starts a JVM, and managing these JVMs is complicated,
Hadoop has a high latency (typically 10+ seconds).
3.2.1 Hive - The Worst Invention used in Data Science
● Hive is an SQL-like DSL for generating Hadoop MapReduce jobs
● It is batch oriented
● SQL written for PostGres, Oracle or Teradata rarely executes in the same way
on a Hadoop cluster. Consequently it is very slow and horrible to debug.
● Big Data’s central premises are
○ Unstructured / semi structured data; Key-Value Stores
○ Schema-on-read
○ NoSQL
● Hive is exactly opposite to the central premises of Big Data
○ Structured data, Tabular
○ Schema-on-write (hive metastore)
○ SQL
● Hive, and it’s associated SQL-world mindset is the main reason 79% of the work
of a Data Scientist is boring, painful and unnecessary
3.3 Spark MapReduce Implementation
Function Spark Implementation Details
Mapper A Spark Task processes a mapper, each task has a single thread, and multiple tasks (threads)
can execute in a single JVM, called an Executor.
Similarly to Hadoop, the default number of tasks is determined heavily by the format of the input
data (number of files, type of compression, etc). This is because Spark reuses Hadoops
underlying filesystem APIs.
Reducer Similarly to a mapper reducers are tasks running in an executor. Again the number of reducers
must be chosen wisely. Output of the shuffle phase is fed to the reducers like in the Hadoop
implementation (although the API differs greatly, as keys are often implicit in Spark).
Map Reduce Job In Spark, only the Shuffle and Reduce phases can execute simultaneously, so they must wait for
the Map phase to complete before they start.
Map Reduce Program Spark can chain multiple MapReduce stages together without writing to disk by keeping datasets
in memory.
3.4 Spark Benefits
● More efficient allocation of memory thanks to tasks sharing a single JVM
● JVM management simpler and faster, so latency only a couple of seconds
● Using a SparkContext we can keep the executor JVMs running, this means in
some situations the latency can be less than 1 second
● Shared JVM means we need not copy large data structures, we can keep a
single copy per node - this is called a BroadcastVariable
● Since the overhead of chaining jobs together is drastically cut by keeping
datasets in memory, many algorithms run orders of magnitude faster than
Hadoop (e.g. Logistic Regression)
● The RDD API for Spark is very easy to use and very concise.
● The Dataset API when combined with Parquet compression can result in very
efficient applications
3.5 Spark Drawbacks
● API - exceptions rarely correspond to your code, often the only debugging
approach is binary chop.
● Only the lower level RDD API is approximately functional, the Dataframe and
Dataset APIs are really bad from a design and functional point of view.
● Spark needs to keep its entire Map Phase somewhere (in memory, or serialised
to disk) as it will not start the Shuffle Phase until it’s complete. This means
there are some circumstances (extremely huge data) where Hadoop can
execute a job while Spark cannot.
● Spark is generally less stable than Hadoop
● (Currently) the number of Map tasks is not as flexible as Hadoop. In particular
there is no efficient way to get more tasks than there are input blocks, or less
tasks than there are files.
3.6 Spark Basics
● Spark is best run on a cluster 1 in 1 out. I.e. one Spark job at a time that uses all
the resources. For e.g. 5 jobs run in sequence will finish faster than 5 jobs run in
parallel.
● So Number of Executors should be 1 per node. E.g. 100 nodes means 100
executors. There are very rare exceptions to this.
● Number of Tasks should be so that all CPUs are being used. Therefore 2 - 4
tasks per CPU. E.g. 16 cores per node, 100 nodes, means at least 2 * 16 * 100 =
3,200 tasks.
3.7 Example Spark Code - Averages
Can be done with
Datasets / Dataframes
3.8 Example Spark Code - Ranges
case class MaxMin(max: Int, min: Int)
object MaxMin {
def apply(pair: (Int, Int)) = MaxMin(pair._1, pair._2)
def apply(i: Int) = MaxMin(i, i)
implicit object MaxMinMonoid extends Monoid[MaxMin] {
def zero: MaxMin = MaxMin(Int.MinValue, Int.MaxValue)
def append(mm1: MaxMin, mm2: => MaxMin): MaxMin = MaxMin(
if (mm1.max > mm2.max) mm1.max else mm2.max,
if (mm1.min < mm2.min) mm1.min else mm2.min
)
}
.groupByKey().mapValues(values => values.max - values.min)
.map(MaxMin.apply).reduceByKey(_ |+| _)
.mapValues(mm => mm.max - mm.min) Can NOT be done with
Datasets / Dataframes
3.9 Joins in Spark & Hadoop
Note: Joining streams of data is the bread and butter of an IOT ingestion platform.
● Joins effectively work by treating both input datasets A and B as a single
dataset A ++ B. The shuffle algorithm (Sort or Hash) effectively performs the
bulk logic of the join. Differences between left, right & outer are very subtle.
● In Spark we can perform a broadcast join, which means copying one entire table
into memory for every executor.
● A natural implementation of a scheduled (e.g. daily) join will shuffle all the data.
Therefore
○ Using Spark or Hadoop to join data as part of a scheduled pipeline is computationally expensive
○ One often must engineer an alternative solution, like using a database, e.g. Cassandra, or as we
will see Kafka.
3.10 Joins & Timeseries in Cassandra and Databases
● Cassandra is the ideal database for storing timeseries data
● The operational cost of a Big Data Database, like Cassandra, is huge. These
databases often require 2 - 3 DevOps engineers to maintain them.
● Modifying Cassandra Schemas, or Data Models, requires significant
engineering effort
● Schemas and Data Models require significant meta-data or data-dictionaries to
make sense of the data
● Joins must be materialised, even when they are only used for downstream
aggregations
3.11 Top 10 Spark Mistakes
1. Using SparkSQL (consider RedShift, Snowflake, PostGres, Oracle, Teradata, Google Big Query)
2. Using SparkStreaming (Hammer nail, consider Kafka, Samza, Akka, Flink, Storm)
3. Using MLLib (when a vertically scaled lib will do)
4. Using way too many executors
5. Running many jobs at the same time, rather than simply running one job at a time with all the resources
6. Writing out a single file (or too few)
7. Writing out too many files
8. Using anything other than Parquet for file format
9. Using pseudo-SQL (i.e. Dataframes or the weird Datasets Expr syntax). Should instead use aggregate
with a custom aggregator, or RDD. The nasty Expr syntax is horribly unfunctional, and thus impossible
to unit test.
10. Not using all resources
11. Bonus: Not considering rolling out ones own serialisation
4 Kafka
4.1 Definition - Streaming Platform
1. It lets you publish and subscribe to streams of records. In this respect it is
similar to a message queue or enterprise messaging system.
2. It lets you store streams of records in a fault-tolerant way.
3. It lets you process streams of records as they occur.
Kafka Streams satisfies 3, while SparkStreaming does not - SparkStreaming is a
misnomer.
4.2 Definition - Kafka Topic
● A Kafka Topic is a partitioned sequence. Partitions partitioned by key, or
round-robin
● A Topic Partition supports the following logical operations:
○ Append(block: List[(K, V)]) - O(BlockSize + C_a)
○ Read(offset: Long): List[(K, V)] - O(BlockSize + C_r)
● 1 to many consumers, 1 to many producers
● Unlike a queue many consumers can read from a single topic such that every
record is read by every consumer
● Kafka R/W performance is O(1) in the size of the topic. So storing infinitely is
not a problem (though it is obviously O(N) in space).
● Kafka Topics technically support random read, but reading sequential blocks is
faster
4.3 Producer API
val props = new Properties();
// Similar to Nagle's Algorithm
// Producer will send either by batch or time (whichever comes first)
props.put("batch.size", 16384);
props.put("linger.ms", 1);
val producer: Producer[KeyType, ValueType] = new KafkaProducer(props);
In a nutshell has a single async method
send(record: ProducerRecord[KeyType, ValueType]): Future[RecordMetadata]
4.4 Producer Notes
● There are other methods for atomic transactions & idempotency (i.e. sending
messages to multiple topics atomically).
● Sends happen implicitly by background threads controlled by the library
● Multithreading of preprocessing is controlled by the application
● The buffer.memory controls the total amount of memory available to the
producer for buffering. When the buffer space is exhausted additional send
calls will block.
4.5 Consumer API (Auto Commit)
props.put("enable.auto.commit", "true");
props.put("auto.commit.interval.ms", "1000");
val consumer: KafkaConsumer[KeyType, ValueType] = new KafkaConsumer(props);
consumer.subscribe(List("topic-A", "topic-B"))
In a nutshell has a single async method
poll(timeout: Long): ConsumerRecords[KeyType, ValueType]
4.6 Consumer API (Manual Commit)
props.put("enable.auto.commit", "false");
// … your logic will include
consumer.commitSync();
4.7 Consumer Notes
● Offset Commits store consumer offsets in Kafka, essentially marking where the
consumer is.
● Offsets can be reset - so topics can be replayed easily
● If an application consumes records but fails just before it manages to commit
it’s offset to kafka, when the application is restarted it will re-consume those
records. This is called “at-least-once” delivery guarantee.
● Consumers label themselves with consumer group name, and each record
published to a topic is delivered to one consumer instance within each
subscribing consumer group. This allows for load balancing/parallelism within a
"logical subscriber".
● So if there is a topic with four partitions, and a consumer group with two
processes, each process would consume from two partitions.
4.8 Guarantees
● Messages sent by a producer to a particular topic partition will be appended in
the order they are sent. That is, if a record M1 is sent by the same producer as a
record M2, and M1 is sent first, then M1 will have a lower offset than M2 and
appear earlier in the log.
● A consumer instance sees records in the order they are stored in the log.
● For a topic with replication factor N, we will tolerate up to N-1 server failures
without losing any records committed to the log.
4.9 Kafka Stream
● A high level API very similar to Spark for processing (streaming) data in parallel
● Each stream partition is a totally ordered sequence of data records and maps to
a Kafka topic partition.
● A data record in the stream maps to a Kafka message from that topic.
● The keys of data records determine the partitioning of data in both Kafka and
Kafka Streams, i.e., how data is routed to specific partitions within topics.
● An application instance sets the number of stream threads
● Each stream thread can process multiple stream tasks, where a stream task is a
logical unit of parallelism - it is assigned a collection of partitions corresponding
to the source topics.
● A stream task may execute a complex processor topology that can have
multiple source topics and multiple sink topics
4.10 Kafka Streams API
Properties props = new Properties();
props.put(StreamsConfig.NUM_STREAM_THREADS_CONFIG, "3");
StreamsConfig config = new StreamsConfig(props);
StreamsBuilder builder = new StreamsBuilder();
builder.<String, String>stream("my-input-topic").mapValues(value ->
value.length().toString()).to("my-output-topic");
KafkaStreams streams = new KafkaStreams(builder.build(), config);
streams.start();
4.11 Kafka Streams “MapReduce” Implementation
People will not call Kafka Streams a MapReduce framework, it is certainly more general, but at a high level it still helps to
consider it under the MapReduce abstraction, at least so we can compare to Spark / Hadoop
Function Kafka Streams Implementation Details
Mapper Is consumer + processing logic. Cannot have more consumers than partitions of input topic.
Processing logic is entirely controlled by user, and thus threading here can be more complicated
and fine tuned than in the Hadoop and Spark worlds.
Shuffle Effectively no Reduce Phase, or the Reducer always just flattens the Key -> Multiset[Value] back
out into Mulitset[Key -> Value]. This is emulated by a producer producing to a partitioned Kafka
topic. A Shuffle-Reduce phase can output to multiple output datasets (topics), unlike Spark or
Hadoop that naturally only output a single dataset.
Map Reduce Job &
Program
So a Map Reduce program in Kafka Streams constitutes a lot of Map-Shuffle-Map-Shuffle phases.
In Kafka Streams, all of the phases in all of the jobs can run at the same time. Furthermore they
share the same processes and threads. This is possible because it is entirely event driven.
Sometimes you may have a disconnected topology that makes sense to run on separate clusters.
4.12 Joins in Kafka Streams
● Kafka has a lot greater flexibility in how a join is performed. In a nutshell the
two options for full (non-windowed) joins consist of:
○ (Shuffle-like) Co-partitioning input topics A and B
○ (Broadcast-like) Using a GlobalKTable
● In the Shuffle-like option, Kafka will keep local key-value caches within stream
tasks corresponding to the partitions for that stream task. Kafka will use either
RocksDB (SSD based DB with in memory caches) or a HashMap.
● In the Broadcast-like option, one of the entire input topics is kept as a key-value
cache within every stream task.
● Note that in a Kafka join it’s always possible to engineer it such that each record
is shuffled only once (although multiple state lookups will occur)
4.13 Benefits of Kafka
● Perfect for storing timeseries and event driven data
● Offers a stream first key-value-store first philosophy which is highly conducive
to application and algorithm development
● Can handle real time data
● Considerably more flexible parallelism model than Spark & Hadoop
● Shuffling operations (e.g.) Joins and GroupBys need not be performed over and
over again in order to get the latest view of information
● Errors are easier to debug since Kafka is essentially a library, not a framework.
The logic executes inside your application, not within someone else's
application. The logs and stack traces make sense.
4.14 Drawbacks of Kafka
● (or Benefit?) Usually requires the user has a more detailed understanding of the
fundamental building blocks (e.g. consumers, producers) when compared to just
getting a Spark application to “work” (e.g. tasks, executors)
● (or Benefit?) Requires user to become more intimate with the DevOps of the
application, since the parallelism model is far more explicit
● Currently only easy(ish) way to get up and running quickly is with Confluent
Cloud (whereas Spark has EMR, Google Dataproc, etc)
● Kafka does not yet have some high level DSLs that could allow for easier
DevOps
● (or Benefit? Here lies fun) Kafka does not yet have a (openly available) Machine
Learning library
4. Scalable Streaming Machine Learning Approaches
Premature Questions
4.1 Critical Questions to Ask First
● Do you have high quality training data?
● What latency between train and predict do you really need? I.e. do you really
need real time training? What is the business case, what analysis has been done
to prove that real time will really earn more profit?
● How much data do you need to process after ETL/Cleaning? I.e. do you really
need to use parallel processing?
4.2 Streaming ML Approaches
Incremental Algorithms: There are incremental versions of Support Vector
Machines, Bayesian Networks and Neural networks. Bayesian Networks can easily
be designed to run in parallel too.
Periodic Re-training with a batch algorithm: We simply buffer the relevant data
and retrain our model “every so often”.
https://blog.bigml.com/2013/03/12/machine-learning-from-streaming-data-two-problems-two-solutions-two-concerns-and-two-lessons/
4.3 Bayesian Networks are Awesome
● Work well for most business problems, fintech, adtech, retail, marketing. Only
use cases not really covered are rare in most enterprises (e.g. image, sound
recognition)
● Machine Learning rarely ever needs regression, since automated actions are
binary or categorical (e.g. send marketing email, or not)
● Consider discretizing continuous variables via Information Theory
(Kullback-Leibler, Shannon Entropy)
● Bayesian Networks are transparent since they are just a direct application of
Probability Theory and Information Theory. Therefore they are easy to
understand, maintain, and cannot overfit.
5. Streaming Machine Learning Architecture
Real
World
Production
Kafka
Connect,
Or similar
Kafka (Streams)
Application
ETL
Feature
Extraction
ML Prediction
Kafka
S3, (Google
Cloud Storage).
Parquet Format
Actions
Labs (EMR, Dataproc)
Spark
Analytics Cluster.
Transient / Ephemeral
Zeppelin (or jupyter)
ML Training Cluster.
Transient / Ephemeral
Deployments
Real
World
Production
Kafka
Connect,
Or similar
Kafka (Streams)
Application
ETL
Feature
Extraction
ML Prediction
Kafka
S3, (Google
Cloud Storage).
Parquet Format
Actions
Labs (EMR, Dataproc)
Spark
Analytics Cluster.
Transient / Ephemeral
Zeppelin (or jupyter)
ML Training Cluster.
Transient / EphemeralDeployments
APIOther Applications
Other Applications & Analytics
6. Conclusions
● A badly engineered, badly designed and badly written ingestion & ETL
framework will result in low quality data and metadata.
● Most Data Science and Big Data applications do not need a database, nor even a
Hive cluster.
● If you as a Data Scientist do not get involved in the ingestion, ETL, engineering,
software development and architecture of a system, you will inevitably find
yourself spending your time on “99% preparation, 1% misinterpretation”
● In upcoming years Kafka and Kafka Streams will become the accepted industry
standard for building real time (and even batch) data driven applications
● In upcoming years significant development will be seen in writing Machine
Learning libraries that integrate easily with Kafka Streams
Exercises
These exercises are intentionally open ended to allow potentially hours of fun for
each.
1. (1 - 2 hours) Write down in (reasonably) formal mathematical notation the
mappers and reducers for a sorting algorithm.
2. (1 - 4 hours) For the Spark Code Examples 3.7 & 3.8, try to derive complexity
formula for the approaches where you can assume your favourite distributions
on the keys and values.
3. (1 - 10 days) Similarly, try randomly generating some data according to your
favourite distributions, producing a fully working Spark app and executing the
code on an EMR cluster. Compare the times, plot how the times differ as the
input data sizes grow (or even use 3D plots to see how parameters of your
distrbution effect the times too).

Weitere ähnliche Inhalte

Was ist angesagt?

Shared slides-edbt-keynote-03-19-13
Shared slides-edbt-keynote-03-19-13Shared slides-edbt-keynote-03-19-13
Shared slides-edbt-keynote-03-19-13Daniel Abadi
 
HBase and Drill: How loosley typed SQL is ideal for NoSQL
HBase and Drill: How loosley typed SQL is ideal for NoSQLHBase and Drill: How loosley typed SQL is ideal for NoSQL
HBase and Drill: How loosley typed SQL is ideal for NoSQLDataWorks Summit
 
From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real...
From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real...From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real...
From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real...Daniel Abadi
 
Hadoop and Graph Data Management: Challenges and Opportunities
Hadoop and Graph Data Management: Challenges and OpportunitiesHadoop and Graph Data Management: Challenges and Opportunities
Hadoop and Graph Data Management: Challenges and OpportunitiesDaniel Abadi
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark FundamentalsZahra Eskandari
 
Machine Learning by Example - Apache Spark
Machine Learning by Example - Apache SparkMachine Learning by Example - Apache Spark
Machine Learning by Example - Apache SparkMeeraj Kunnumpurath
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHitendra Kumar
 
Apache Spark: The Analytics Operating System
Apache Spark: The Analytics Operating SystemApache Spark: The Analytics Operating System
Apache Spark: The Analytics Operating SystemAdarsh Pannu
 
Hadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapaHadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapakapa rohit
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyRohit Kulkarni
 
Front Range PHP NoSQL Databases
Front Range PHP NoSQL DatabasesFront Range PHP NoSQL Databases
Front Range PHP NoSQL DatabasesJon Meredith
 
Hadoop live online training
Hadoop live online trainingHadoop live online training
Hadoop live online trainingHarika583
 
Hadoop: The Default Machine Learning Platform ?
Hadoop: The Default Machine Learning Platform ?Hadoop: The Default Machine Learning Platform ?
Hadoop: The Default Machine Learning Platform ?Milind Bhandarkar
 
May 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETLMay 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETLAdam Muise
 

Was ist angesagt? (20)

Shared slides-edbt-keynote-03-19-13
Shared slides-edbt-keynote-03-19-13Shared slides-edbt-keynote-03-19-13
Shared slides-edbt-keynote-03-19-13
 
HBase and Drill: How loosley typed SQL is ideal for NoSQL
HBase and Drill: How loosley typed SQL is ideal for NoSQLHBase and Drill: How loosley typed SQL is ideal for NoSQL
HBase and Drill: How loosley typed SQL is ideal for NoSQL
 
From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real...
From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real...From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real...
From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real...
 
Apache PIG
Apache PIGApache PIG
Apache PIG
 
Hadoop and Graph Data Management: Challenges and Opportunities
Hadoop and Graph Data Management: Challenges and OpportunitiesHadoop and Graph Data Management: Challenges and Opportunities
Hadoop and Graph Data Management: Challenges and Opportunities
 
SparkPaper
SparkPaperSparkPaper
SparkPaper
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Machine Learning by Example - Apache Spark
Machine Learning by Example - Apache SparkMachine Learning by Example - Apache Spark
Machine Learning by Example - Apache Spark
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
 
Apache Spark: The Analytics Operating System
Apache Spark: The Analytics Operating SystemApache Spark: The Analytics Operating System
Apache Spark: The Analytics Operating System
 
Greenplum Architecture
Greenplum ArchitectureGreenplum Architecture
Greenplum Architecture
 
Hadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapaHadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapa
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
 
Front Range PHP NoSQL Databases
Front Range PHP NoSQL DatabasesFront Range PHP NoSQL Databases
Front Range PHP NoSQL Databases
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop live online training
Hadoop live online trainingHadoop live online training
Hadoop live online training
 
1. Apache HIVE
1. Apache HIVE1. Apache HIVE
1. Apache HIVE
 
Hadoop: The Default Machine Learning Platform ?
Hadoop: The Default Machine Learning Platform ?Hadoop: The Default Machine Learning Platform ?
Hadoop: The Default Machine Learning Platform ?
 
Hadoop technology doc
Hadoop technology docHadoop technology doc
Hadoop technology doc
 
May 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETLMay 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETL
 

Ähnlich wie Architecting and productionising data science applications at scale

Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map ReduceUrvashi Kataria
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance IssuesAntonios Katsarakis
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkAndy Petrella
 
Managing Big data Module 3 (1st part)
Managing Big data Module 3 (1st part)Managing Big data Module 3 (1st part)
Managing Big data Module 3 (1st part)Soumee Maschatak
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hoodAdarsh Pannu
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analyticsinoshg
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabadsreehari orienit
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
Hadoop Tutorial.ppt
Hadoop Tutorial.pptHadoop Tutorial.ppt
Hadoop Tutorial.pptSathish24111
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopGERARDO BARBERENA
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...Databricks
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...Reynold Xin
 
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...Big Data Montreal
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkManish Gupta
 
Learn what is Hadoop-and-BigData
Learn  what is Hadoop-and-BigDataLearn  what is Hadoop-and-BigData
Learn what is Hadoop-and-BigDataThanusha154
 

Ähnlich wie Architecting and productionising data science applications at scale (20)

Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance Issues
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache Spark
 
Hadoop
HadoopHadoop
Hadoop
 
Managing Big data Module 3 (1st part)
Managing Big data Module 3 (1st part)Managing Big data Module 3 (1st part)
Managing Big data Module 3 (1st part)
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hood
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabad
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Hadoop tutorial
Hadoop tutorialHadoop tutorial
Hadoop tutorial
 
Hadoop Tutorial.ppt
Hadoop Tutorial.pptHadoop Tutorial.ppt
Hadoop Tutorial.ppt
 
Map reducecloudtech
Map reducecloudtechMap reducecloudtech
Map reducecloudtech
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to Hadoop
 
Spark
SparkSpark
Spark
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
 
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache Spark
 
Neo4j vs giraph
Neo4j vs giraphNeo4j vs giraph
Neo4j vs giraph
 
Learn what is Hadoop-and-BigData
Learn  what is Hadoop-and-BigDataLearn  what is Hadoop-and-BigData
Learn what is Hadoop-and-BigData
 

Kürzlich hochgeladen

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 

Kürzlich hochgeladen (20)

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 

Architecting and productionising data science applications at scale

  • 2. Contents 1. Introduction & Motivation 2. Parallel Processing 3. Spark 4. Kafka 5. Scalable Streaming Machine Learning Approaches 6. Architecture 7. Summary & Conclusions
  • 3. 1. Introduction & Motivation
  • 4.
  • 6. Constitutes 4% of actual job: ● Machine Learning Algorithms ● Machine Learning Libraries ● More Machine Learning Libraries ● Linear Algebra ● Statistics ● GPUs, HPCs ● Information Theory ● Complexity Theory
  • 7. Interpretation + Wiring Together of: Created By Data App developers, 3rd parties, Business Analysts, Data Engineers Algorithms Academics, Mathematicians Libraries Academics, big tech companies (Google, FB), open source community Hardware / Environments System Administrators, DevOps Production Quality Implementation Software Engineers, Data Engineers
  • 8. Root Cause This really means? Person who is worse at coding than any software engineer and worse at maths than any mathematician Should mean: Software engineer who’s good at maths, or a mathematician who’s good at engineering
  • 9. Solution - How to put creativity into Data Science Covered by this talk ● Be at the boundary where Data Science meets the real world: ○ Be part of generating the data with production quality applications ○ Be part of productionising the algorithms with production quality code ● Automate Everything ● Avoid SQL & DBs mindset that creates the 79% of boring painful work Not covered ● Minimalist algorithmic design: ○ KISS - Keep it Simple Stupid, YAGNI - You Aren’t Gona Need It ○ Probability Theory & Complexity Theory (NOT stats & linalg) as the principle foundations of all algorithms
  • 10. Real World Production Kafka Connect, Or similar Kafka (Streams) Application ETL Feature Extraction ML Prediction Kafka S3, (Google Cloud Storage). Parquet Format Actions Labs (EMR, Dataproc) Spark Analytics Cluster. Transient / Ephemeral Zeppelin (or jupyter) ML Training Cluster. Transient / Ephemeral Deployments
  • 11. Automation - Turn the 96% into 0% Simplicity--the art of maximizing the amount of work not done--is essential. - Agile Manifesto Perfection is achieved not when there is nothing more to add, but when there is nothing left to take away. - Antoine de Saint-Exupery Any intelligent fool can make things bigger and more complex... It takes a touch of genius - and a lot of courage to move in the opposite direction. - E. F. Schumacher Everything should be made as simple as possible, but not simpler. - Einstein Data science is about solving problems, not models or algorithms. - Data Science Manifesto Aim to completely remove manual intervention in numerical processing. - Data Science Manifesto
  • 12. 2. Parallel Processing Definition - Big Data The input is too big to fit into memory on a single machine. - A Model of Computation for MapReduce, http://theory.stanford.edu/~sergei/papers/soda10-mrc.pdf
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 19. ( )
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
  • 29. ● Different implementations of MapReduce differ in the following ○ How mappers are started and run in parallel (multithreading or multi-process) ○ How big the records are that are fed to mappers ○ The shuffle algorithm implementation ○ How reducers are started and run in parallel ○ How the shuffling feeds records to the reducers ○ How data is cached or stored in between stages ● This abstraction is necessary to understand how to write efficient code in a Big Data context ● No real life implementation will exactly match the formal abstraction ● In Batch implementations like Spark & Hadoop the key algorithm design considerations are ○ Cannot assume the data has any order (and best to assume no order within partitions) ○ Maximise usage of map-side reduce via monoids (aka map-side partial-aggregation) ● In a Streaming context, like Kafka, the key algorithm design considerations are ○ Cannot assume the data has any global order (but can within partitions) ○ Maximise usage of partial-aggregations
  • 30.
  • 31.
  • 32.
  • 34. 3.1 History of Hadoop MapReduce Function Hadoop Implementation Details Mapper An entire JVM that by default processes one block of HDFS or one file of s3. This means the number of mappers (and thus the resulting parallelism) is heavily determined by how you store your data. You can increase/decrease the amount of data processed by a mapper, but not always. Sometimes there are not enough CPUs available on a node to process all it’s data, Hadoop is clever and will use “data locality” to process data near (e.g. same rack) where it is stored Reducer Similarly an entire JVM will process many reducers (although the literature just calls this JVM a reducer (singular)). The number of reducers is chosen explicitly. Shuffle Hash Shuffle: for a given < Key, Multiset[Value] > from the shuffle phase, the implementation selects a reducer based on an integer hash modulo num-reducers of the Key, then feeds the Values to the reducer as a stream. The JVM has a HashMap to keep track of each reducer. This is memory intensive. Sort Shuffle: (often the modern default) here the shuffle algorithm sorts the data in transit to the reducers (this can be done efficiently thanks to repeated application of Merge Sort). Now the reducers can run in sequence in the reducer JVM, no need for a HashMap. This algorithm uses less memory, but when many distinct keys exist can be slow.
  • 35. Function Hadoop Implementation Details Map Reduce Job A key feature of Hadoop is that all the phases, Map, Shuffle and Reduce, can execute simultaneously. So as mappers output data, that output is simultaneously shuffled, and fed into reducers, which write that data out. Map Reduce Program A key feature of Hadoop is that mappers (nearly) always read from a filesystem, and reducers (nearly) always write to a filesystem.
  • 36.
  • 37. 3.2 Drawbacks of Hadoop MapReduce ● Since each mapper/reducer uses an entire JVM, this can result in inefficient use of memory. Each JVM cannot share memory, so any memory it is not using cannot be used by other JVMs. ● Furthermore JVMs cannot share common data, so if mappers use, say a big Dictionary to perform its function, that Dictionary must be duplicated across every JVM. Historical work around to this involve in memory databases. ● Chaining MapReduce Stages together to form a MapReduce Program results in many unnecessary reads/writes from disk. ● The original API was a low level Java API. Consequently code was quite verbose and difficult to write unit tests. High level frameworks were built on top, the best being Scalding (which sat on Cascading), the worst being Hive or Pig. ● Since every mapper/reducer starts a JVM, and managing these JVMs is complicated, Hadoop has a high latency (typically 10+ seconds).
  • 38. 3.2.1 Hive - The Worst Invention used in Data Science ● Hive is an SQL-like DSL for generating Hadoop MapReduce jobs ● It is batch oriented ● SQL written for PostGres, Oracle or Teradata rarely executes in the same way on a Hadoop cluster. Consequently it is very slow and horrible to debug. ● Big Data’s central premises are ○ Unstructured / semi structured data; Key-Value Stores ○ Schema-on-read ○ NoSQL ● Hive is exactly opposite to the central premises of Big Data ○ Structured data, Tabular ○ Schema-on-write (hive metastore) ○ SQL ● Hive, and it’s associated SQL-world mindset is the main reason 79% of the work of a Data Scientist is boring, painful and unnecessary
  • 39. 3.3 Spark MapReduce Implementation Function Spark Implementation Details Mapper A Spark Task processes a mapper, each task has a single thread, and multiple tasks (threads) can execute in a single JVM, called an Executor. Similarly to Hadoop, the default number of tasks is determined heavily by the format of the input data (number of files, type of compression, etc). This is because Spark reuses Hadoops underlying filesystem APIs. Reducer Similarly to a mapper reducers are tasks running in an executor. Again the number of reducers must be chosen wisely. Output of the shuffle phase is fed to the reducers like in the Hadoop implementation (although the API differs greatly, as keys are often implicit in Spark). Map Reduce Job In Spark, only the Shuffle and Reduce phases can execute simultaneously, so they must wait for the Map phase to complete before they start. Map Reduce Program Spark can chain multiple MapReduce stages together without writing to disk by keeping datasets in memory.
  • 40.
  • 41. 3.4 Spark Benefits ● More efficient allocation of memory thanks to tasks sharing a single JVM ● JVM management simpler and faster, so latency only a couple of seconds ● Using a SparkContext we can keep the executor JVMs running, this means in some situations the latency can be less than 1 second ● Shared JVM means we need not copy large data structures, we can keep a single copy per node - this is called a BroadcastVariable ● Since the overhead of chaining jobs together is drastically cut by keeping datasets in memory, many algorithms run orders of magnitude faster than Hadoop (e.g. Logistic Regression) ● The RDD API for Spark is very easy to use and very concise. ● The Dataset API when combined with Parquet compression can result in very efficient applications
  • 42. 3.5 Spark Drawbacks ● API - exceptions rarely correspond to your code, often the only debugging approach is binary chop. ● Only the lower level RDD API is approximately functional, the Dataframe and Dataset APIs are really bad from a design and functional point of view. ● Spark needs to keep its entire Map Phase somewhere (in memory, or serialised to disk) as it will not start the Shuffle Phase until it’s complete. This means there are some circumstances (extremely huge data) where Hadoop can execute a job while Spark cannot. ● Spark is generally less stable than Hadoop ● (Currently) the number of Map tasks is not as flexible as Hadoop. In particular there is no efficient way to get more tasks than there are input blocks, or less tasks than there are files.
  • 43. 3.6 Spark Basics ● Spark is best run on a cluster 1 in 1 out. I.e. one Spark job at a time that uses all the resources. For e.g. 5 jobs run in sequence will finish faster than 5 jobs run in parallel. ● So Number of Executors should be 1 per node. E.g. 100 nodes means 100 executors. There are very rare exceptions to this. ● Number of Tasks should be so that all CPUs are being used. Therefore 2 - 4 tasks per CPU. E.g. 16 cores per node, 100 nodes, means at least 2 * 16 * 100 = 3,200 tasks.
  • 44. 3.7 Example Spark Code - Averages Can be done with Datasets / Dataframes
  • 45. 3.8 Example Spark Code - Ranges case class MaxMin(max: Int, min: Int) object MaxMin { def apply(pair: (Int, Int)) = MaxMin(pair._1, pair._2) def apply(i: Int) = MaxMin(i, i) implicit object MaxMinMonoid extends Monoid[MaxMin] { def zero: MaxMin = MaxMin(Int.MinValue, Int.MaxValue) def append(mm1: MaxMin, mm2: => MaxMin): MaxMin = MaxMin( if (mm1.max > mm2.max) mm1.max else mm2.max, if (mm1.min < mm2.min) mm1.min else mm2.min ) }
  • 46. .groupByKey().mapValues(values => values.max - values.min) .map(MaxMin.apply).reduceByKey(_ |+| _) .mapValues(mm => mm.max - mm.min) Can NOT be done with Datasets / Dataframes
  • 47. 3.9 Joins in Spark & Hadoop Note: Joining streams of data is the bread and butter of an IOT ingestion platform. ● Joins effectively work by treating both input datasets A and B as a single dataset A ++ B. The shuffle algorithm (Sort or Hash) effectively performs the bulk logic of the join. Differences between left, right & outer are very subtle. ● In Spark we can perform a broadcast join, which means copying one entire table into memory for every executor. ● A natural implementation of a scheduled (e.g. daily) join will shuffle all the data. Therefore ○ Using Spark or Hadoop to join data as part of a scheduled pipeline is computationally expensive ○ One often must engineer an alternative solution, like using a database, e.g. Cassandra, or as we will see Kafka.
  • 48. 3.10 Joins & Timeseries in Cassandra and Databases ● Cassandra is the ideal database for storing timeseries data ● The operational cost of a Big Data Database, like Cassandra, is huge. These databases often require 2 - 3 DevOps engineers to maintain them. ● Modifying Cassandra Schemas, or Data Models, requires significant engineering effort ● Schemas and Data Models require significant meta-data or data-dictionaries to make sense of the data ● Joins must be materialised, even when they are only used for downstream aggregations
  • 49. 3.11 Top 10 Spark Mistakes 1. Using SparkSQL (consider RedShift, Snowflake, PostGres, Oracle, Teradata, Google Big Query) 2. Using SparkStreaming (Hammer nail, consider Kafka, Samza, Akka, Flink, Storm) 3. Using MLLib (when a vertically scaled lib will do) 4. Using way too many executors 5. Running many jobs at the same time, rather than simply running one job at a time with all the resources 6. Writing out a single file (or too few) 7. Writing out too many files 8. Using anything other than Parquet for file format 9. Using pseudo-SQL (i.e. Dataframes or the weird Datasets Expr syntax). Should instead use aggregate with a custom aggregator, or RDD. The nasty Expr syntax is horribly unfunctional, and thus impossible to unit test. 10. Not using all resources 11. Bonus: Not considering rolling out ones own serialisation
  • 51. 4.1 Definition - Streaming Platform 1. It lets you publish and subscribe to streams of records. In this respect it is similar to a message queue or enterprise messaging system. 2. It lets you store streams of records in a fault-tolerant way. 3. It lets you process streams of records as they occur. Kafka Streams satisfies 3, while SparkStreaming does not - SparkStreaming is a misnomer.
  • 52. 4.2 Definition - Kafka Topic ● A Kafka Topic is a partitioned sequence. Partitions partitioned by key, or round-robin ● A Topic Partition supports the following logical operations: ○ Append(block: List[(K, V)]) - O(BlockSize + C_a) ○ Read(offset: Long): List[(K, V)] - O(BlockSize + C_r) ● 1 to many consumers, 1 to many producers ● Unlike a queue many consumers can read from a single topic such that every record is read by every consumer ● Kafka R/W performance is O(1) in the size of the topic. So storing infinitely is not a problem (though it is obviously O(N) in space). ● Kafka Topics technically support random read, but reading sequential blocks is faster
  • 53.
  • 54. 4.3 Producer API val props = new Properties(); // Similar to Nagle's Algorithm // Producer will send either by batch or time (whichever comes first) props.put("batch.size", 16384); props.put("linger.ms", 1); val producer: Producer[KeyType, ValueType] = new KafkaProducer(props); In a nutshell has a single async method send(record: ProducerRecord[KeyType, ValueType]): Future[RecordMetadata]
  • 55. 4.4 Producer Notes ● There are other methods for atomic transactions & idempotency (i.e. sending messages to multiple topics atomically). ● Sends happen implicitly by background threads controlled by the library ● Multithreading of preprocessing is controlled by the application ● The buffer.memory controls the total amount of memory available to the producer for buffering. When the buffer space is exhausted additional send calls will block.
  • 56. 4.5 Consumer API (Auto Commit) props.put("enable.auto.commit", "true"); props.put("auto.commit.interval.ms", "1000"); val consumer: KafkaConsumer[KeyType, ValueType] = new KafkaConsumer(props); consumer.subscribe(List("topic-A", "topic-B")) In a nutshell has a single async method poll(timeout: Long): ConsumerRecords[KeyType, ValueType]
  • 57. 4.6 Consumer API (Manual Commit) props.put("enable.auto.commit", "false"); // … your logic will include consumer.commitSync();
  • 58. 4.7 Consumer Notes ● Offset Commits store consumer offsets in Kafka, essentially marking where the consumer is. ● Offsets can be reset - so topics can be replayed easily ● If an application consumes records but fails just before it manages to commit it’s offset to kafka, when the application is restarted it will re-consume those records. This is called “at-least-once” delivery guarantee. ● Consumers label themselves with consumer group name, and each record published to a topic is delivered to one consumer instance within each subscribing consumer group. This allows for load balancing/parallelism within a "logical subscriber". ● So if there is a topic with four partitions, and a consumer group with two processes, each process would consume from two partitions.
  • 59. 4.8 Guarantees ● Messages sent by a producer to a particular topic partition will be appended in the order they are sent. That is, if a record M1 is sent by the same producer as a record M2, and M1 is sent first, then M1 will have a lower offset than M2 and appear earlier in the log. ● A consumer instance sees records in the order they are stored in the log. ● For a topic with replication factor N, we will tolerate up to N-1 server failures without losing any records committed to the log.
  • 60. 4.9 Kafka Stream ● A high level API very similar to Spark for processing (streaming) data in parallel ● Each stream partition is a totally ordered sequence of data records and maps to a Kafka topic partition. ● A data record in the stream maps to a Kafka message from that topic. ● The keys of data records determine the partitioning of data in both Kafka and Kafka Streams, i.e., how data is routed to specific partitions within topics. ● An application instance sets the number of stream threads ● Each stream thread can process multiple stream tasks, where a stream task is a logical unit of parallelism - it is assigned a collection of partitions corresponding to the source topics. ● A stream task may execute a complex processor topology that can have multiple source topics and multiple sink topics
  • 61.
  • 62. 4.10 Kafka Streams API Properties props = new Properties(); props.put(StreamsConfig.NUM_STREAM_THREADS_CONFIG, "3"); StreamsConfig config = new StreamsConfig(props); StreamsBuilder builder = new StreamsBuilder(); builder.<String, String>stream("my-input-topic").mapValues(value -> value.length().toString()).to("my-output-topic"); KafkaStreams streams = new KafkaStreams(builder.build(), config); streams.start();
  • 63. 4.11 Kafka Streams “MapReduce” Implementation People will not call Kafka Streams a MapReduce framework, it is certainly more general, but at a high level it still helps to consider it under the MapReduce abstraction, at least so we can compare to Spark / Hadoop Function Kafka Streams Implementation Details Mapper Is consumer + processing logic. Cannot have more consumers than partitions of input topic. Processing logic is entirely controlled by user, and thus threading here can be more complicated and fine tuned than in the Hadoop and Spark worlds. Shuffle Effectively no Reduce Phase, or the Reducer always just flattens the Key -> Multiset[Value] back out into Mulitset[Key -> Value]. This is emulated by a producer producing to a partitioned Kafka topic. A Shuffle-Reduce phase can output to multiple output datasets (topics), unlike Spark or Hadoop that naturally only output a single dataset. Map Reduce Job & Program So a Map Reduce program in Kafka Streams constitutes a lot of Map-Shuffle-Map-Shuffle phases. In Kafka Streams, all of the phases in all of the jobs can run at the same time. Furthermore they share the same processes and threads. This is possible because it is entirely event driven. Sometimes you may have a disconnected topology that makes sense to run on separate clusters.
  • 64. 4.12 Joins in Kafka Streams ● Kafka has a lot greater flexibility in how a join is performed. In a nutshell the two options for full (non-windowed) joins consist of: ○ (Shuffle-like) Co-partitioning input topics A and B ○ (Broadcast-like) Using a GlobalKTable ● In the Shuffle-like option, Kafka will keep local key-value caches within stream tasks corresponding to the partitions for that stream task. Kafka will use either RocksDB (SSD based DB with in memory caches) or a HashMap. ● In the Broadcast-like option, one of the entire input topics is kept as a key-value cache within every stream task. ● Note that in a Kafka join it’s always possible to engineer it such that each record is shuffled only once (although multiple state lookups will occur)
  • 65. 4.13 Benefits of Kafka ● Perfect for storing timeseries and event driven data ● Offers a stream first key-value-store first philosophy which is highly conducive to application and algorithm development ● Can handle real time data ● Considerably more flexible parallelism model than Spark & Hadoop ● Shuffling operations (e.g.) Joins and GroupBys need not be performed over and over again in order to get the latest view of information ● Errors are easier to debug since Kafka is essentially a library, not a framework. The logic executes inside your application, not within someone else's application. The logs and stack traces make sense.
  • 66. 4.14 Drawbacks of Kafka ● (or Benefit?) Usually requires the user has a more detailed understanding of the fundamental building blocks (e.g. consumers, producers) when compared to just getting a Spark application to “work” (e.g. tasks, executors) ● (or Benefit?) Requires user to become more intimate with the DevOps of the application, since the parallelism model is far more explicit ● Currently only easy(ish) way to get up and running quickly is with Confluent Cloud (whereas Spark has EMR, Google Dataproc, etc) ● Kafka does not yet have some high level DSLs that could allow for easier DevOps ● (or Benefit? Here lies fun) Kafka does not yet have a (openly available) Machine Learning library
  • 67. 4. Scalable Streaming Machine Learning Approaches
  • 69. 4.1 Critical Questions to Ask First ● Do you have high quality training data? ● What latency between train and predict do you really need? I.e. do you really need real time training? What is the business case, what analysis has been done to prove that real time will really earn more profit? ● How much data do you need to process after ETL/Cleaning? I.e. do you really need to use parallel processing?
  • 70. 4.2 Streaming ML Approaches Incremental Algorithms: There are incremental versions of Support Vector Machines, Bayesian Networks and Neural networks. Bayesian Networks can easily be designed to run in parallel too. Periodic Re-training with a batch algorithm: We simply buffer the relevant data and retrain our model “every so often”. https://blog.bigml.com/2013/03/12/machine-learning-from-streaming-data-two-problems-two-solutions-two-concerns-and-two-lessons/
  • 71. 4.3 Bayesian Networks are Awesome ● Work well for most business problems, fintech, adtech, retail, marketing. Only use cases not really covered are rare in most enterprises (e.g. image, sound recognition) ● Machine Learning rarely ever needs regression, since automated actions are binary or categorical (e.g. send marketing email, or not) ● Consider discretizing continuous variables via Information Theory (Kullback-Leibler, Shannon Entropy) ● Bayesian Networks are transparent since they are just a direct application of Probability Theory and Information Theory. Therefore they are easy to understand, maintain, and cannot overfit.
  • 72. 5. Streaming Machine Learning Architecture
  • 73. Real World Production Kafka Connect, Or similar Kafka (Streams) Application ETL Feature Extraction ML Prediction Kafka S3, (Google Cloud Storage). Parquet Format Actions Labs (EMR, Dataproc) Spark Analytics Cluster. Transient / Ephemeral Zeppelin (or jupyter) ML Training Cluster. Transient / Ephemeral Deployments
  • 74. Real World Production Kafka Connect, Or similar Kafka (Streams) Application ETL Feature Extraction ML Prediction Kafka S3, (Google Cloud Storage). Parquet Format Actions Labs (EMR, Dataproc) Spark Analytics Cluster. Transient / Ephemeral Zeppelin (or jupyter) ML Training Cluster. Transient / EphemeralDeployments APIOther Applications Other Applications & Analytics
  • 75. 6. Conclusions ● A badly engineered, badly designed and badly written ingestion & ETL framework will result in low quality data and metadata. ● Most Data Science and Big Data applications do not need a database, nor even a Hive cluster. ● If you as a Data Scientist do not get involved in the ingestion, ETL, engineering, software development and architecture of a system, you will inevitably find yourself spending your time on “99% preparation, 1% misinterpretation” ● In upcoming years Kafka and Kafka Streams will become the accepted industry standard for building real time (and even batch) data driven applications ● In upcoming years significant development will be seen in writing Machine Learning libraries that integrate easily with Kafka Streams
  • 76. Exercises These exercises are intentionally open ended to allow potentially hours of fun for each. 1. (1 - 2 hours) Write down in (reasonably) formal mathematical notation the mappers and reducers for a sorting algorithm. 2. (1 - 4 hours) For the Spark Code Examples 3.7 & 3.8, try to derive complexity formula for the approaches where you can assume your favourite distributions on the keys and values. 3. (1 - 10 days) Similarly, try randomly generating some data according to your favourite distributions, producing a fully working Spark app and executing the code on an EMR cluster. Compare the times, plot how the times differ as the input data sizes grow (or even use 3D plots to see how parameters of your distrbution effect the times too).