6. Constitutes 4% of actual job:
● Machine Learning Algorithms
● Machine Learning Libraries
● More Machine Learning Libraries
● Linear Algebra
● Statistics
● GPUs, HPCs
● Information Theory
● Complexity Theory
7. Interpretation + Wiring Together of:
Created By
Data App developers, 3rd parties, Business Analysts,
Data Engineers
Algorithms Academics, Mathematicians
Libraries Academics, big tech companies (Google, FB), open
source community
Hardware / Environments System Administrators, DevOps
Production Quality Implementation Software Engineers, Data Engineers
8. Root Cause
This really means?
Person who is worse at coding than any software engineer and worse at maths than any
mathematician
Should mean:
Software engineer who’s good at maths, or a mathematician who’s good at engineering
9. Solution - How to put creativity into Data Science
Covered by this talk
● Be at the boundary where Data Science meets the real world:
○ Be part of generating the data with production quality applications
○ Be part of productionising the algorithms with production quality code
● Automate Everything
● Avoid SQL & DBs mindset that creates the 79% of boring painful work
Not covered
● Minimalist algorithmic design:
○ KISS - Keep it Simple Stupid, YAGNI - You Aren’t Gona Need It
○ Probability Theory & Complexity Theory (NOT stats & linalg) as the principle foundations of all
algorithms
11. Automation - Turn the 96% into 0%
Simplicity--the art of maximizing the amount of work not done--is essential. - Agile Manifesto
Perfection is achieved not when there is nothing more to add, but when there is nothing left to take away. -
Antoine de Saint-Exupery
Any intelligent fool can make things bigger and more complex... It takes a touch of genius - and a lot of
courage to move in the opposite direction. - E. F. Schumacher
Everything should be made as simple as possible, but not simpler. - Einstein
Data science is about solving problems, not models or algorithms. - Data Science Manifesto
Aim to completely remove manual intervention in numerical processing. - Data Science Manifesto
12. 2. Parallel Processing
Definition - Big Data
The input is too big to fit into memory on a single machine. - A Model of Computation for
MapReduce, http://theory.stanford.edu/~sergei/papers/soda10-mrc.pdf
29. ● Different implementations of MapReduce differ in the following
○ How mappers are started and run in parallel (multithreading or multi-process)
○ How big the records are that are fed to mappers
○ The shuffle algorithm implementation
○ How reducers are started and run in parallel
○ How the shuffling feeds records to the reducers
○ How data is cached or stored in between stages
● This abstraction is necessary to understand how to write efficient code in a Big
Data context
● No real life implementation will exactly match the formal abstraction
● In Batch implementations like Spark & Hadoop the key algorithm design
considerations are
○ Cannot assume the data has any order (and best to assume no order within partitions)
○ Maximise usage of map-side reduce via monoids (aka map-side partial-aggregation)
● In a Streaming context, like Kafka, the key algorithm design considerations are
○ Cannot assume the data has any global order (but can within partitions)
○ Maximise usage of partial-aggregations
34. 3.1 History of Hadoop MapReduce
Function Hadoop Implementation Details
Mapper An entire JVM that by default processes one block of HDFS or one file of s3. This means the
number of mappers (and thus the resulting parallelism) is heavily determined by how you store
your data. You can increase/decrease the amount of data processed by a mapper, but not always.
Sometimes there are not enough CPUs available on a node to process all it’s data, Hadoop is
clever and will use “data locality” to process data near (e.g. same rack) where it is stored
Reducer Similarly an entire JVM will process many reducers (although the literature just calls this JVM a
reducer (singular)). The number of reducers is chosen explicitly.
Shuffle Hash Shuffle: for a given < Key, Multiset[Value] > from the shuffle phase, the implementation
selects a reducer based on an integer hash modulo num-reducers of the Key, then feeds the
Values to the reducer as a stream. The JVM has a HashMap to keep track of each reducer. This
is memory intensive.
Sort Shuffle: (often the modern default) here the shuffle algorithm sorts the data in transit to the
reducers (this can be done efficiently thanks to repeated application of Merge Sort). Now the
reducers can run in sequence in the reducer JVM, no need for a HashMap. This algorithm uses
less memory, but when many distinct keys exist can be slow.
35. Function Hadoop Implementation Details
Map Reduce Job A key feature of Hadoop is that all the phases, Map, Shuffle and Reduce, can execute
simultaneously. So as mappers output data, that output is simultaneously shuffled, and fed into
reducers, which write that data out.
Map Reduce Program A key feature of Hadoop is that mappers (nearly) always read from a filesystem, and reducers
(nearly) always write to a filesystem.
36.
37. 3.2 Drawbacks of Hadoop MapReduce
● Since each mapper/reducer uses an entire JVM, this can result in inefficient use of
memory. Each JVM cannot share memory, so any memory it is not using cannot be used
by other JVMs.
● Furthermore JVMs cannot share common data, so if mappers use, say a big Dictionary to
perform its function, that Dictionary must be duplicated across every JVM. Historical
work around to this involve in memory databases.
● Chaining MapReduce Stages together to form a MapReduce Program results in many
unnecessary reads/writes from disk.
● The original API was a low level Java API. Consequently code was quite verbose and
difficult to write unit tests. High level frameworks were built on top, the best being
Scalding (which sat on Cascading), the worst being Hive or Pig.
● Since every mapper/reducer starts a JVM, and managing these JVMs is complicated,
Hadoop has a high latency (typically 10+ seconds).
38. 3.2.1 Hive - The Worst Invention used in Data Science
● Hive is an SQL-like DSL for generating Hadoop MapReduce jobs
● It is batch oriented
● SQL written for PostGres, Oracle or Teradata rarely executes in the same way
on a Hadoop cluster. Consequently it is very slow and horrible to debug.
● Big Data’s central premises are
○ Unstructured / semi structured data; Key-Value Stores
○ Schema-on-read
○ NoSQL
● Hive is exactly opposite to the central premises of Big Data
○ Structured data, Tabular
○ Schema-on-write (hive metastore)
○ SQL
● Hive, and it’s associated SQL-world mindset is the main reason 79% of the work
of a Data Scientist is boring, painful and unnecessary
39. 3.3 Spark MapReduce Implementation
Function Spark Implementation Details
Mapper A Spark Task processes a mapper, each task has a single thread, and multiple tasks (threads)
can execute in a single JVM, called an Executor.
Similarly to Hadoop, the default number of tasks is determined heavily by the format of the input
data (number of files, type of compression, etc). This is because Spark reuses Hadoops
underlying filesystem APIs.
Reducer Similarly to a mapper reducers are tasks running in an executor. Again the number of reducers
must be chosen wisely. Output of the shuffle phase is fed to the reducers like in the Hadoop
implementation (although the API differs greatly, as keys are often implicit in Spark).
Map Reduce Job In Spark, only the Shuffle and Reduce phases can execute simultaneously, so they must wait for
the Map phase to complete before they start.
Map Reduce Program Spark can chain multiple MapReduce stages together without writing to disk by keeping datasets
in memory.
40.
41. 3.4 Spark Benefits
● More efficient allocation of memory thanks to tasks sharing a single JVM
● JVM management simpler and faster, so latency only a couple of seconds
● Using a SparkContext we can keep the executor JVMs running, this means in
some situations the latency can be less than 1 second
● Shared JVM means we need not copy large data structures, we can keep a
single copy per node - this is called a BroadcastVariable
● Since the overhead of chaining jobs together is drastically cut by keeping
datasets in memory, many algorithms run orders of magnitude faster than
Hadoop (e.g. Logistic Regression)
● The RDD API for Spark is very easy to use and very concise.
● The Dataset API when combined with Parquet compression can result in very
efficient applications
42. 3.5 Spark Drawbacks
● API - exceptions rarely correspond to your code, often the only debugging
approach is binary chop.
● Only the lower level RDD API is approximately functional, the Dataframe and
Dataset APIs are really bad from a design and functional point of view.
● Spark needs to keep its entire Map Phase somewhere (in memory, or serialised
to disk) as it will not start the Shuffle Phase until it’s complete. This means
there are some circumstances (extremely huge data) where Hadoop can
execute a job while Spark cannot.
● Spark is generally less stable than Hadoop
● (Currently) the number of Map tasks is not as flexible as Hadoop. In particular
there is no efficient way to get more tasks than there are input blocks, or less
tasks than there are files.
43. 3.6 Spark Basics
● Spark is best run on a cluster 1 in 1 out. I.e. one Spark job at a time that uses all
the resources. For e.g. 5 jobs run in sequence will finish faster than 5 jobs run in
parallel.
● So Number of Executors should be 1 per node. E.g. 100 nodes means 100
executors. There are very rare exceptions to this.
● Number of Tasks should be so that all CPUs are being used. Therefore 2 - 4
tasks per CPU. E.g. 16 cores per node, 100 nodes, means at least 2 * 16 * 100 =
3,200 tasks.
44. 3.7 Example Spark Code - Averages
Can be done with
Datasets / Dataframes
47. 3.9 Joins in Spark & Hadoop
Note: Joining streams of data is the bread and butter of an IOT ingestion platform.
● Joins effectively work by treating both input datasets A and B as a single
dataset A ++ B. The shuffle algorithm (Sort or Hash) effectively performs the
bulk logic of the join. Differences between left, right & outer are very subtle.
● In Spark we can perform a broadcast join, which means copying one entire table
into memory for every executor.
● A natural implementation of a scheduled (e.g. daily) join will shuffle all the data.
Therefore
○ Using Spark or Hadoop to join data as part of a scheduled pipeline is computationally expensive
○ One often must engineer an alternative solution, like using a database, e.g. Cassandra, or as we
will see Kafka.
48. 3.10 Joins & Timeseries in Cassandra and Databases
● Cassandra is the ideal database for storing timeseries data
● The operational cost of a Big Data Database, like Cassandra, is huge. These
databases often require 2 - 3 DevOps engineers to maintain them.
● Modifying Cassandra Schemas, or Data Models, requires significant
engineering effort
● Schemas and Data Models require significant meta-data or data-dictionaries to
make sense of the data
● Joins must be materialised, even when they are only used for downstream
aggregations
49. 3.11 Top 10 Spark Mistakes
1. Using SparkSQL (consider RedShift, Snowflake, PostGres, Oracle, Teradata, Google Big Query)
2. Using SparkStreaming (Hammer nail, consider Kafka, Samza, Akka, Flink, Storm)
3. Using MLLib (when a vertically scaled lib will do)
4. Using way too many executors
5. Running many jobs at the same time, rather than simply running one job at a time with all the resources
6. Writing out a single file (or too few)
7. Writing out too many files
8. Using anything other than Parquet for file format
9. Using pseudo-SQL (i.e. Dataframes or the weird Datasets Expr syntax). Should instead use aggregate
with a custom aggregator, or RDD. The nasty Expr syntax is horribly unfunctional, and thus impossible
to unit test.
10. Not using all resources
11. Bonus: Not considering rolling out ones own serialisation
51. 4.1 Definition - Streaming Platform
1. It lets you publish and subscribe to streams of records. In this respect it is
similar to a message queue or enterprise messaging system.
2. It lets you store streams of records in a fault-tolerant way.
3. It lets you process streams of records as they occur.
Kafka Streams satisfies 3, while SparkStreaming does not - SparkStreaming is a
misnomer.
52. 4.2 Definition - Kafka Topic
● A Kafka Topic is a partitioned sequence. Partitions partitioned by key, or
round-robin
● A Topic Partition supports the following logical operations:
○ Append(block: List[(K, V)]) - O(BlockSize + C_a)
○ Read(offset: Long): List[(K, V)] - O(BlockSize + C_r)
● 1 to many consumers, 1 to many producers
● Unlike a queue many consumers can read from a single topic such that every
record is read by every consumer
● Kafka R/W performance is O(1) in the size of the topic. So storing infinitely is
not a problem (though it is obviously O(N) in space).
● Kafka Topics technically support random read, but reading sequential blocks is
faster
53.
54. 4.3 Producer API
val props = new Properties();
// Similar to Nagle's Algorithm
// Producer will send either by batch or time (whichever comes first)
props.put("batch.size", 16384);
props.put("linger.ms", 1);
val producer: Producer[KeyType, ValueType] = new KafkaProducer(props);
In a nutshell has a single async method
send(record: ProducerRecord[KeyType, ValueType]): Future[RecordMetadata]
55. 4.4 Producer Notes
● There are other methods for atomic transactions & idempotency (i.e. sending
messages to multiple topics atomically).
● Sends happen implicitly by background threads controlled by the library
● Multithreading of preprocessing is controlled by the application
● The buffer.memory controls the total amount of memory available to the
producer for buffering. When the buffer space is exhausted additional send
calls will block.
56. 4.5 Consumer API (Auto Commit)
props.put("enable.auto.commit", "true");
props.put("auto.commit.interval.ms", "1000");
val consumer: KafkaConsumer[KeyType, ValueType] = new KafkaConsumer(props);
consumer.subscribe(List("topic-A", "topic-B"))
In a nutshell has a single async method
poll(timeout: Long): ConsumerRecords[KeyType, ValueType]
57. 4.6 Consumer API (Manual Commit)
props.put("enable.auto.commit", "false");
// … your logic will include
consumer.commitSync();
58. 4.7 Consumer Notes
● Offset Commits store consumer offsets in Kafka, essentially marking where the
consumer is.
● Offsets can be reset - so topics can be replayed easily
● If an application consumes records but fails just before it manages to commit
it’s offset to kafka, when the application is restarted it will re-consume those
records. This is called “at-least-once” delivery guarantee.
● Consumers label themselves with consumer group name, and each record
published to a topic is delivered to one consumer instance within each
subscribing consumer group. This allows for load balancing/parallelism within a
"logical subscriber".
● So if there is a topic with four partitions, and a consumer group with two
processes, each process would consume from two partitions.
59. 4.8 Guarantees
● Messages sent by a producer to a particular topic partition will be appended in
the order they are sent. That is, if a record M1 is sent by the same producer as a
record M2, and M1 is sent first, then M1 will have a lower offset than M2 and
appear earlier in the log.
● A consumer instance sees records in the order they are stored in the log.
● For a topic with replication factor N, we will tolerate up to N-1 server failures
without losing any records committed to the log.
60. 4.9 Kafka Stream
● A high level API very similar to Spark for processing (streaming) data in parallel
● Each stream partition is a totally ordered sequence of data records and maps to
a Kafka topic partition.
● A data record in the stream maps to a Kafka message from that topic.
● The keys of data records determine the partitioning of data in both Kafka and
Kafka Streams, i.e., how data is routed to specific partitions within topics.
● An application instance sets the number of stream threads
● Each stream thread can process multiple stream tasks, where a stream task is a
logical unit of parallelism - it is assigned a collection of partitions corresponding
to the source topics.
● A stream task may execute a complex processor topology that can have
multiple source topics and multiple sink topics
61.
62. 4.10 Kafka Streams API
Properties props = new Properties();
props.put(StreamsConfig.NUM_STREAM_THREADS_CONFIG, "3");
StreamsConfig config = new StreamsConfig(props);
StreamsBuilder builder = new StreamsBuilder();
builder.<String, String>stream("my-input-topic").mapValues(value ->
value.length().toString()).to("my-output-topic");
KafkaStreams streams = new KafkaStreams(builder.build(), config);
streams.start();
63. 4.11 Kafka Streams “MapReduce” Implementation
People will not call Kafka Streams a MapReduce framework, it is certainly more general, but at a high level it still helps to
consider it under the MapReduce abstraction, at least so we can compare to Spark / Hadoop
Function Kafka Streams Implementation Details
Mapper Is consumer + processing logic. Cannot have more consumers than partitions of input topic.
Processing logic is entirely controlled by user, and thus threading here can be more complicated
and fine tuned than in the Hadoop and Spark worlds.
Shuffle Effectively no Reduce Phase, or the Reducer always just flattens the Key -> Multiset[Value] back
out into Mulitset[Key -> Value]. This is emulated by a producer producing to a partitioned Kafka
topic. A Shuffle-Reduce phase can output to multiple output datasets (topics), unlike Spark or
Hadoop that naturally only output a single dataset.
Map Reduce Job &
Program
So a Map Reduce program in Kafka Streams constitutes a lot of Map-Shuffle-Map-Shuffle phases.
In Kafka Streams, all of the phases in all of the jobs can run at the same time. Furthermore they
share the same processes and threads. This is possible because it is entirely event driven.
Sometimes you may have a disconnected topology that makes sense to run on separate clusters.
64. 4.12 Joins in Kafka Streams
● Kafka has a lot greater flexibility in how a join is performed. In a nutshell the
two options for full (non-windowed) joins consist of:
○ (Shuffle-like) Co-partitioning input topics A and B
○ (Broadcast-like) Using a GlobalKTable
● In the Shuffle-like option, Kafka will keep local key-value caches within stream
tasks corresponding to the partitions for that stream task. Kafka will use either
RocksDB (SSD based DB with in memory caches) or a HashMap.
● In the Broadcast-like option, one of the entire input topics is kept as a key-value
cache within every stream task.
● Note that in a Kafka join it’s always possible to engineer it such that each record
is shuffled only once (although multiple state lookups will occur)
65. 4.13 Benefits of Kafka
● Perfect for storing timeseries and event driven data
● Offers a stream first key-value-store first philosophy which is highly conducive
to application and algorithm development
● Can handle real time data
● Considerably more flexible parallelism model than Spark & Hadoop
● Shuffling operations (e.g.) Joins and GroupBys need not be performed over and
over again in order to get the latest view of information
● Errors are easier to debug since Kafka is essentially a library, not a framework.
The logic executes inside your application, not within someone else's
application. The logs and stack traces make sense.
66. 4.14 Drawbacks of Kafka
● (or Benefit?) Usually requires the user has a more detailed understanding of the
fundamental building blocks (e.g. consumers, producers) when compared to just
getting a Spark application to “work” (e.g. tasks, executors)
● (or Benefit?) Requires user to become more intimate with the DevOps of the
application, since the parallelism model is far more explicit
● Currently only easy(ish) way to get up and running quickly is with Confluent
Cloud (whereas Spark has EMR, Google Dataproc, etc)
● Kafka does not yet have some high level DSLs that could allow for easier
DevOps
● (or Benefit? Here lies fun) Kafka does not yet have a (openly available) Machine
Learning library
69. 4.1 Critical Questions to Ask First
● Do you have high quality training data?
● What latency between train and predict do you really need? I.e. do you really
need real time training? What is the business case, what analysis has been done
to prove that real time will really earn more profit?
● How much data do you need to process after ETL/Cleaning? I.e. do you really
need to use parallel processing?
70. 4.2 Streaming ML Approaches
Incremental Algorithms: There are incremental versions of Support Vector
Machines, Bayesian Networks and Neural networks. Bayesian Networks can easily
be designed to run in parallel too.
Periodic Re-training with a batch algorithm: We simply buffer the relevant data
and retrain our model “every so often”.
https://blog.bigml.com/2013/03/12/machine-learning-from-streaming-data-two-problems-two-solutions-two-concerns-and-two-lessons/
71. 4.3 Bayesian Networks are Awesome
● Work well for most business problems, fintech, adtech, retail, marketing. Only
use cases not really covered are rare in most enterprises (e.g. image, sound
recognition)
● Machine Learning rarely ever needs regression, since automated actions are
binary or categorical (e.g. send marketing email, or not)
● Consider discretizing continuous variables via Information Theory
(Kullback-Leibler, Shannon Entropy)
● Bayesian Networks are transparent since they are just a direct application of
Probability Theory and Information Theory. Therefore they are easy to
understand, maintain, and cannot overfit.
75. 6. Conclusions
● A badly engineered, badly designed and badly written ingestion & ETL
framework will result in low quality data and metadata.
● Most Data Science and Big Data applications do not need a database, nor even a
Hive cluster.
● If you as a Data Scientist do not get involved in the ingestion, ETL, engineering,
software development and architecture of a system, you will inevitably find
yourself spending your time on “99% preparation, 1% misinterpretation”
● In upcoming years Kafka and Kafka Streams will become the accepted industry
standard for building real time (and even batch) data driven applications
● In upcoming years significant development will be seen in writing Machine
Learning libraries that integrate easily with Kafka Streams
76. Exercises
These exercises are intentionally open ended to allow potentially hours of fun for
each.
1. (1 - 2 hours) Write down in (reasonably) formal mathematical notation the
mappers and reducers for a sorting algorithm.
2. (1 - 4 hours) For the Spark Code Examples 3.7 & 3.8, try to derive complexity
formula for the approaches where you can assume your favourite distributions
on the keys and values.
3. (1 - 10 days) Similarly, try randomly generating some data according to your
favourite distributions, producing a fully working Spark app and executing the
code on an EMR cluster. Compare the times, plot how the times differ as the
input data sizes grow (or even use 3D plots to see how parameters of your
distrbution effect the times too).