11. Associative + Commutative
Operations
• Associative: 1 + (2 + 3) = (1 + 2) + 3
• Commutative: 1 + 2 = 2 + 1
• Allows us to parallelize our reduce (for
instance locally in combiners)
• Applies to many operations, not just
integer addition.
• Spoiler: Key to incremental aggregations
12. {a,
b}
{b, c}
{a, c}
{a}
{a, b,
c}
+ +
=
{a, c}
=
{a, b,
c}
=
+
We can also parallelize the “addition” of other types, like Sets, as
Set Union is associative
13. Monoid Interface
• Abstract Algebra provides a formal foundation for
what we can casually observe.
• Don’t be thrown off by the name, just think of it as
another trait/interface.
• Monoids provide a critical abstraction to treat
aggregations of different types in the same way
20. Requirements and Tradeoffs
Query
Latency
milliseconds seconds minutes
• Results are pre-computed
• requires compute and
storage resources
• Supported queries must be
known in advance
• Results are computed at query
time
• No resources used except
for executed queries
• Ad-hoc queries
21. Requirements and Tradeoffs
Number of
Users
large many few
• Resources required per query
must be small
• Requires scalable query
handling/storage
• Queries can be
expensive
22. Requirements and Tradeoffs
Freshness of
Results
seconds minutes hours
• May require streaming
platform in addition to batch
• Smaller, more frequent
updates is more work
• Single batch platform
• Less frequent
computation
23. Requirements and Tradeoffs
Amount of
Data
billions millions thousands
• Requires parallelized
computation and storage
• Single server is
sufficient
24. SQL on Hadoop
• Impala, Hive, SparkSQL
milliseconds seconds minutes
large many few
seconds minutes hours
billions millions thousands
Query Latency
# of Users
Freshness
Data Size
25. Batch Jobs
• Spark, Hadoop MapReduce
milliseconds seconds minutes
large many few
seconds minutes hours
billions millions thousands
Query Latency
# of Users
Freshness
Data Size
Dependent on
where you put the
job’s output
26. Online Incremental Systems
• Twitter’s Summingbird [PA1, C4], Google’s Mesa [PA2],
Koverse’s Aggregation Framework
milliseconds seconds minutes
large many few
seconds minutes hours
billions millions thousands
Query Latency
# of Users
Freshness
Data Size
S
M
K
27. Online Incremental Systems:
Common Components
• Aggregations are computed/reduced
incrementally via associative operations
• Results are mostly pre-computed for so
queries are inexpensive
• Aggregations, keyed by dimensions, are
stored in low latency, scalable key-value
store