11. File Format
• Easy to grep / read, from the
command line.
• Server is easy to implement &
maintain.
• Very fast thanks to the index. Very
sparse though.
• Disk space not really and issue here.
We can always get rid of old indexes.
• Problem?
11
Key1 x Size
0
0
0
Key2 x Size
0
0
0
0
KeyN x Size
Key1 Value 1
Key1 Value 2
Key1 Value 3
Key1 Value 4
Key1 Value 5
Key2 Value 1
Key2 Value 2
Key2 Value 3
Key2 Value 4
Key2 Value 5
Key2 Value 6
...
...
...
...
...
...
...
...
...
KeyN Value 1
KeyN Value 2
KeyN Value 3
KeyN Value 4
KeyN Value 5
KeyN Value 6
KeyN Value 7
Index File Data File
12. File Format
• Easy to grep / read, from the
command line.
• Server is easy to implement &
maintain.
• Very fast thanks to the index. Very
sparse though.
• Disk space not really and issue here.
We can always get rid of old indexes.
• Problem?
• It takes more Hme to generate the
index than to create the Data File in
Hadoop.
12
Key1 x Size
0
0
0
Key2 x Size
0
0
0
0
KeyN x Size
Key1 Value 1
Key1 Value 2
Key1 Value 3
Key1 Value 4
Key1 Value 5
Key2 Value 1
Key2 Value 2
Key2 Value 3
Key2 Value 4
Key2 Value 5
Key2 Value 6
...
...
...
...
...
...
...
...
...
KeyN Value 1
KeyN Value 2
KeyN Value 3
KeyN Value 4
KeyN Value 5
KeyN Value 6
KeyN Value 7
Index File Data File
13. File Format
• Easy to grep / read, from the
command line.
• Server is easy to implement &
maintain.
• Very fast thanks to the index. Very
sparse though.
• Disk space not really and issue here.
We can always get rid of old indexes.
• Problem?
• It takes more Hme to generate the
index than to create the Data File in
Hadoop.
• Like... 6 Hmes more.
13
Key1 x Size
0
0
0
Key2 x Size
0
0
0
0
KeyN x Size
Key1 Value 1
Key1 Value 2
Key1 Value 3
Key1 Value 4
Key1 Value 5
Key2 Value 1
Key2 Value 2
Key2 Value 3
Key2 Value 4
Key2 Value 5
Key2 Value 6
...
...
...
...
...
...
...
...
...
KeyN Value 1
KeyN Value 2
KeyN Value 3
KeyN Value 4
KeyN Value 5
KeyN Value 6
KeyN Value 7
Index File Data File
16. Requirements for the new file format:
• Binary:
– So it is smaller.
– Store thriZ serialized data.
• Compression friendly.
• Self indexed:
– We do not want an index file anymore.
• Hadoop friendly:
– Generated in Hadoop, we don’t want to preprocess it before serving.
• Java/C++/Python friendly:
– These are the languages used in the Data and M.I.R. teams.
16
17. Requirements for the new file format:
• Binary:
– So it is smaller.
– Store thriZ serialized data.
• Compression friendly:
• Self indexed:
– We do not want an index file anymore.
• Hadoop friendly:
– Generated in Hadoop, we don’t want to preprocess it before serving.
• Java/C++/Python friendly:
– These are the languages used in the Data and M.I.R. teams.
– Yeah, we sLll use C++.
17
18. !
KeyLen (int) ValLen (int) Key (byte[]) Value (byte[])
DATA BLOCK MAGIC (8B)
Key-Value (First)
……
Key-Value (Last)
Data Block 0
Data Block 1
Data Block 2
Meta Block 0
(Optional)
Meta Block 1
(Optional)
User Defined Metadata,
start with METABLOCKMAGIC
KeyLen
(vint)
Key
(byte[])
id
(1B)
ValLen
(vint)
Val
(byte[])
File Info
Size or ItemsNum (int)
LASTKEY (byte[])
AVG_KEY_LEN (int)
AVG_VALUE_LEN (int)
COMPARATOR (className)
Data Index
Meta Index
(Optional)
Index of Data Block 0
…
User Defined
INDEX BLOCK MAGIC (8B)
Index of Meta Block 0
…
Offset(long) DataSize (int) Key (byte[])
KeyLen (vint)
Trailer INDEX BLOCK MAGIC (8B)
Fixed File Trailer
(Go to next picture)
Offset(long) MetaSize (int) MetaNameLen (vint) MetaName (byte[])
3
HFile:
18
by Schubert Zang
hqp://cloudepr.blogspot.com
• Based on Google’s SSTable (From Bigtable)
• Keys and Values are byte strings.
• Keys are ordered.
• Sequence of blocks.
• Block index loaded into memory.
• Can be queried with hbase
org.apache.hadoop.hbase.io.hfile.HFile
19. HFile:
19
// create an HFile reader from a file.
Hfile.Reader reader = new HFile.Reader(fs,
filePath, new SimpleBlockCache(),true);
// load its info into memory.
reader.loadFileInfo();
// get a Scanner
HFileScanner scan = reader.getScanner(true,true);
// create the key we are interested in.
KeyValue kvKey = new KeyValue(Bytes.toBytes(key),
Bytes.toBytes(“f”),...);
// check if the key is in the file.
if (0 != scan.seekTo(kvKey.getKey()) {
log.error(“Couldn’t find the key”);
} else {
log.info(“Value:” +
scan.getKeyValue().getValue());
}
28. We are hiring! (http://www.last.fm/about/jobs)
28
Data Scientist
Purpose & Background of Role
We're seeking two top notch data scientists with strong programming skills to join the
small and very enthusiastic data and recommendations team at Last.fm. These two
positions are full-time and based in London.
Are you a superb data analyst as well as a hands-on implementer that understands the
trade-offs of the memory hierarchy and is able to work around constraints in disk speed,
memory size and CPU cycles? Are you familiar with all common data structures and their
complexity? Do you take pride in being clever and solving difficult problems creatively?
Are you full of ideas and always looking for new ways of making use out of data? Are you
an advocate for data-driven development and fully capable of conducting a proper A/B
test? Do you love music?
Requirements:
• Solid background in statistics and computer science
• Highly fluent in Python and either C++ or Java (or both)
• Comfortable with the Unix CLI and shell scripting
• Passion for machine learning and data visualisation
• Proficient with databases, both relational and non-relational
• Experience with Hadoop and analysing terabyte-scale datasets
• Familiar with data-driven development and split testing
• Basic understanding of common web technologies
• Track record in music information retrieval research is a plus
29. We are hiring! (http://www.last.fm/about/jobs)
29
Data Scientist
Purpose & Background of Role
We're seeking two top notch data scientists with strong programming skills to join the
small and very enthusiastic data and recommendations team at Last.fm. These two
positions are full-time and based in London.
Are you a superb data analyst as well as a hands-on implementer that understands the
trade-offs of the memory hierarchy and is able to work around constraints in disk speed,
memory size and CPU cycles? Are you familiar with all common data structures and their
complexity? Do you take pride in being clever and solving difficult problems creatively?
Are you full of ideas and always looking for new ways of making use out of data? Are you
an advocate for data-driven development and fully capable of conducting a proper A/B
test? Do you love music?
Requirements:
• Solid background in statistics and computer science
• Highly fluent in Python and either C++ or Java (or both)
• Comfortable with the Unix CLI and shell scripting
• Passion for machine learning and data visualisation
• Proficient with databases, both relational and non-relational
• Experience with Hadoop and analysing terabyte-scale datasets
• Familiar with data-driven development and split testing
• Basic understanding of common web technologies
• Track record in music information retrieval research is a plus
x 2