6. Step0:
• AWS
• Elastic MapReduce 1
• S3
o Ruby s3sync
http://s3sync.net/wiki
• elastic-mapreduce
o Amazon Ruby
o http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2264
7. Step1:
• Wikipedia
o wget "http://download.wikimedia.org/jawiki/latest/jawiki-
latest-pages-articles.xml.bz2"
o bunzip2 jawiki-latest-pages-articles.xml.bz2
•
o <page> 20000
o Hadoop Streaming worker
• S3
o ohkura-wikipedia:jawiki/articles/part-00000, 00001, ...
o EC2
9. Step2:
Mapper
link_pat = re.compile(r"[[([^]|#]*?)[]|#]")
for line in sys.stdin:
for link in link_pat.findall(line):
if ":" not in link:
print "LongValueSum:%st1" % link
Reducer
aggregate (Hadoop Reducer)