4. Data set
Stanford SNAP Amazon reviews
35GB
35M reviews
University of Illinois Amazon member
info
142MB
Member location information joeme 92 5/26 Cleveland, OH United States Joseph M. Kotow B00006HAXW
OH
7. Pipeline
ImportTsv
SNAP
REVIEWS in
10 rows per
review
UIC MEMBER
LOCATION
BT0S0V006HAXW Rock Rhythm & Doo Wop Greatest Early Rock unknown A1RSDHE9a-ppyBase
N6RSZF Joseph M Kotow 9/9 5.0 1042502400 Pittsburgh – Home of the OLDIES I
have all of the doo wop DVD’s and this one is as good or better than the 1st ones. Rem…
8. Pipeline
PIG to CLEAN,
JOIN and
AGGREGATE
rating reviews and
totals
ImportTsv
SNAP
REVIEWS in
10 rows per
review
UIC MEMBER
LOCATION
BT0S0V006HAXW Rock Rhythm & Doo Wop Greatest Early Rock unknown A1RSDHE9a-ppyBase
N6RSZF Joseph M Kotow 9/9 5.0 1042502400 Pittsburgh – Home of the OLDIES I
have all of the doo wop DVD’s and this one is as good or better than the 1st ones. Rem…
10. HBase Schema
Table Schemas:
PRODUCTID_STATE,
TOTAL REVIEWS, AVG RATING
PRODUCTID_STATE_BYYEAR_EPOCH,
TOTAL REVIEWS, AVG RATING
PRODUCTID_STATE_BYMONTH_EPOCH,
TOTAL REVIEWS, AVG RATING
PRODUCTID_STATE_BYDAY_EPOCH,
TOTAL REVIEWS, AVG RATING
•Example:
B00003CWT6_CA_BYMONTH_1008115200000
11. Retrospective
Design Considerations
• HBase was used for optimizations for reads,
range scans, and scalability
• Data was bucketed by state and different time
intervals for query performance by avoiding the
cost of recalculating aggregates at the expense
of storage
• Java MR was used to convert multi-row
reviews to tabular format
Future
• Scrape Amazon for new reviews
• Filter and display reviews
12. About me – Andy Lai
UC Berkeley (B.S. Electrical Engineering
& Computer Science)
SJSU (M.S. Engineering)
Software Engineer (DB2, Relational
database)
Interests: