Presentation of the paper Compressed Linear Algebra for Large Scale Machine Learning. The authors propose a scheme for matrix compression specifically designed for machine learning data matrices and define matrix operations on the compressed representation.
Digital Marketing Plan, how digital marketing works
Compressed linear algebra for large scale machine learning
1. Compressed Linear Algebra for
Large Scale Machine Learning
Ahmed Elgohary, Matthias Boehm, Peter J. Haas, Frederick R. Reiss, Berthold
Reinwald
IBM Research - Almaden; San Jose, CA, USA
University of Maryland; College Park, MD, USA
Presented by: Issa Memari
25/1/2018
5. Solution: Fit more data into memory
Data Compression
algorithm
Compressed data
6. Solution: Fit more data into memory
Data Compression
algorithm
Compressed data
Compressed
data block
Compressed
data block
Compressed
data block
Decompression
algorithm
7. Solution: Fit more data into memory
Data Compression
algorithm
Compressed data
Compressed
data block
Compressed
data block
Compressed
data block
Decompression
algorithm
Data block
Machine
learning
algorithm
34. Compression planning
Compression planning involves three tasks:
1. Estimating column compression ratios
2. Partitioning columns into groups
3. Choosing the encoding format for each group
35. Estimating column compression ratios
Instead of scanning the full data matrix, estimate parameters from a random sample of the data
36. Partitioning columns into groups
1. Enumerate all possible partitions, infeasible. Bell(13)=4213597
2. Greedy brute force.
3. Bin packing + greedy brute force.
37. Choosing the encoding format for each group
1. Scan the data matrix and compute actual compressed sizes for chosen groups.
2. For each group, compute compressed size as the minimum of OLE and RLE sizes.
3. If a group is incompressible, keep removing the column with largest estimated compressed size
until group is compressible or empty.
In fact if we try to encode these two columns using OLE individually, we’re gonna end up with an encoding that contains 20 values. The first column is even incompressible (encoded size is larger than uncompressed)
Let us now take a look at how the encoded data is actually stored in memory, this will be useful to understand how to compute the size of a compressed column group.
Bij is the number of segments of tuple tij
Zi is the total number of offsets in the column group
Di is the number of distinct tuples
|Gi| is the number of columns in the group
Bij is the number of segments of tuple tij
Zi is the total number of offsets in the column group
Di is the number of distinct tuples
|Gi| is the number of columns in the group
Size in bytes for an OLE encoded group of columns and for an RLE encoded group of columns
Say what is compression ratio
Describe what this partitioning thing is
Describe what choosing the encoding format is
Bij is the number of segments of tuple tij
Zi is the total number of offsets in the column group
Di is the number of distinct tuples
|Gi| is the number of columns in the group
Rij is the number of runs for tuple tij