SlideShare a Scribd company logo
1 of 35
Download to read offline
An	
  Introduc+on	
  to	
  	
  
      Data	
  Intensive	
  Compu+ng	
  
                        	
  
Appendix	
  A:	
  Amazon’s	
  Elas+c	
  MapReduce	
  
                 Robert	
  Grossman	
  
                University	
  of	
  Chicago	
  
                 Open	
  Data	
  Group	
  
                             	
  
                   Collin	
  BenneF	
  
                 Open	
  Data	
  Group	
  
                             	
  
                November	
  14,	
  2011	
         1	
  
Sec+on	
  A1	
  
Hadoop	
  Streaming	
  



See	
  	
  
hFp://hadoop.apache.org/common/docs/
r0.15.2/streaming.html	
  
Basic	
  Idea	
  
•  With	
  Hadoop	
  streams	
  you	
  can	
  run	
  any	
  
   program	
  as	
  the	
  Mapper	
  and	
  the	
  Reducer.	
  
•  For	
  example,	
  you	
  can	
  run	
  Python	
  and	
  Perl	
  
   code.	
  
•  You	
  can	
  also	
  run	
  standard	
  Unix	
  u+li+es.	
  
•  With	
  streams,	
  Mappers	
  and	
  Reducers	
  use	
  
   standard	
  input	
  and	
  standard	
  output.	
  
Mappers	
  for	
  Streams	
  
•  As	
  the	
  mapper	
  task	
  runs,	
  it	
  converts	
  its	
  inputs	
  
   into	
  lines	
  and	
  feed	
  the	
  lines	
  to	
  the	
  stdin	
  of	
  the	
  
   process.	
  
•  The	
  mapper	
  collects	
  the	
  line	
  oriented	
  outputs	
  
   from	
  the	
  stdout	
  of	
  the	
  process	
  and	
  converts	
  each	
  
   line	
  into	
  a	
  key/value	
  pair,	
  which	
  is	
  collected	
  as	
  
   the	
  output	
  of	
  the	
  mapper.	
  	
  
•  By	
  default,	
  the	
  prefix	
  of	
  a	
  line	
  up	
  to	
  the	
  first	
  tab	
  
   character	
  is	
  the	
  key	
  and	
  the	
  the	
  rest	
  of	
  the	
  line	
  
   (excluding	
  the	
  tab)	
  is	
  the	
  value.	
  
•  This	
  default	
  can	
  be	
  changed.	
  
Reducers	
  for	
  Streams	
  
•  As	
  the	
  reducer	
  task	
  runs,	
  it	
  converts	
  its	
  input	
  
   key/values	
  pairs	
  into	
  lines	
  and	
  feeds	
  the	
  lines	
  
   to	
  the	
  stdin	
  of	
  the	
  process.	
  	
  
•  The	
  reducer	
  collects	
  the	
  line	
  oriented	
  outputs	
  
   from	
  the	
  stdout	
  of	
  the	
  process,	
  converts	
  each	
  
   line	
  into	
  a	
  key/value	
  pair,	
  which	
  is	
  collected	
  as	
  
   the	
  output	
  of	
  the	
  reducer.	
  
•  By	
  default,	
  the	
  prefix	
  of	
  a	
  line	
  up	
  to	
  the	
  first	
  
   tab	
  character	
  is	
  the	
  key	
  and	
  the	
  the	
  rest	
  of	
  the	
  
   line	
  (excluding	
  the	
  tab	
  character)	
  is	
  the	
  value.	
  	
  
Example	
  
$HADOOP_HOME/bin/hadoop	
  	
  jar	
  
$HADOOP_HOME/hadoop-­‐streaming.jar	
  	
  
	
  	
  	
  	
  -­‐input	
  myInputDirs	
  	
  
	
  	
  	
  	
  -­‐output	
  myOutputDir	
  	
  
	
  	
  	
  	
  -­‐mapper	
  /bin/cat	
  	
  
	
  	
  	
  	
  -­‐reducer	
  /bin/wc	
  
	
  
•  Here	
  the	
  Unix	
  u+li+es	
  cat	
  and	
  wc	
  are	
  the	
  
               Mapper	
  and	
  Reducer.	
  
Sec+on	
  A2	
  
S3	
  Buckets	
  
S3	
  Buckets	
  
•  S3	
  bucket	
  names	
  must	
  be	
  unique	
  across	
  AWS	
  
•  A	
  good	
  prac+ce	
  is	
  to	
  use	
  a	
  paFern	
  like	
  
          	
   	
  tutorial.osdc.org/dataset1.txt	
  
       for	
  a	
  domain	
  you	
  own.	
  
•  The	
  file	
  is	
  then	
  referenced	
  as	
  
   	
  tutorial.osdc.org.s3.	
  amazonaws.com/
dataset1.txt	
  
•  If	
  you	
  own	
  osdc.org	
  you	
  can	
  create	
  a	
  DNS	
  
   CNAME	
  entry	
  to	
  access	
  the	
  file	
  as	
  
       tutorial.osdc.org/dataset1.txt	
  
S3	
  Security	
  
•  AWS	
  access	
  key	
  (user	
  name)	
  
•  This	
  func+on	
  is	
  your	
  S3	
  username	
  .	
  It	
  is	
  an	
  
   alphanumeric	
  text	
  string	
  that	
  uniquely	
  
   iden+fies	
  users.	
  	
  
•  AWS	
  Secret	
  key	
  (func+ons	
  as	
  password)	
  
AWS	
  Account	
  Informa+on	
  
Access	
  Keys	
  




User	
  Name	
       Password	
  
Sec+on	
  A3	
  
Using	
  AWS	
  Elas+c	
  MapReduce	
  
Overview	
  




1.  Upload	
  input	
  data	
  to	
  S3	
  
2.  Create	
  job	
  flow	
  by	
  defining	
  Map	
  and	
  Reduce	
  
3.  Download	
  output	
  data	
  from	
  S3	
  
Create	
  New	
  Elas+c	
  MR	
  Job	
  Flow	
  
Custom	
  Jobs	
  
•  Amazon	
  Elas+c	
  MR	
  Custom	
  jobs	
  can	
  be	
  
   wriFen	
  as	
  a:	
  
   – Custom	
  Jar	
  File	
  
   – Streaming	
  File	
  
   – Pig	
  Program	
  
   – Hive	
  Program	
  
   	
  
Step	
  1.	
  Load	
  Your	
  Data	
  Into	
  an	
  	
  
                     S3	
  Bucket	
  




•  Amazon’s	
  Elas+c	
  MapReduce	
  reads	
  data	
  from	
  
   S3	
  and	
  write	
  data	
  to	
  S3	
  
Step	
  1a.	
  Create	
  &	
  Name	
  the	
  S3	
  Bucket	
  	
  
Step	
  1b.	
  Upload	
  Data	
  Into	
  the	
  S3	
  Bucket	
  




•  This	
  can	
  be	
  done	
  from	
  the	
  AWS	
  Console.	
  
•  This	
  can	
  also	
  be	
  done	
  using	
  command	
  line	
  
   tools.	
  
Step	
  2a.	
  	
  Write	
  a	
  Mapper	
  
#!/usr/bin/python	
  
	
  	
  	
  	
  	
  	
  
	
  	
  	
  import	
  sys	
  
	
  	
  	
  import	
  re	
  
	
  	
  	
  def	
  main(argv):	
  
	
  	
  	
  	
  	
  line	
  =	
  sys.stdin.readline()	
  
	
  	
  	
  	
  	
  paFern	
  =	
  re.compile("[a-­‐zA-­‐Z][a-­‐zA-­‐Z0-­‐9]*")	
  
	
  	
  	
  	
  	
  try:	
  
	
  	
  	
  	
  	
  	
  	
  while	
  line:	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  for	
  word	
  in	
  	
  paFern.findall(line):	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  print	
  	
  "LongValueSum:"	
  +	
  word.lower()	
  +	
  "t"	
  +	
  "1"	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  line	
  =	
  	
  sys.stdin.readline()	
  
	
  	
  	
  	
  	
  except	
  "end	
  of	
  file":	
  
	
  	
  	
  	
  	
  	
  	
  return	
  None	
  
	
  	
  	
  if	
  __name__	
  ==	
  "__main__":	
  
	
  	
  	
  	
  	
  main(sys.argv)	
  
	
  
Step	
  2b.	
  	
  Upload	
  the	
  Mapper	
  to	
  S3	
  




•  This	
  Mapper	
  is	
  already	
  in	
  S3	
  in	
  this	
  loca+on:	
  
	
  
	
  	
  	
  	
  	
  s3://elas+cmapreduce/samples/wordcount/
wordSpliFer.py	
  
	
  
So	
  we	
  don’t	
  need	
  to	
  upload.	
  
	
  
Step	
  3a.	
  	
  Write	
  a	
  Reducer	
  
def	
  main(argv):	
  
	
  	
  	
  	
  line	
  =	
  sys.stdin.readline();	
  
	
  	
  	
  	
  try:	
  
	
  	
  	
  	
  	
  	
  	
  	
  while	
  line:	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  line	
  =	
  line[:-­‐1];	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  fields	
  =	
  line.split("t");	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  print	
  generateLongCountToken(fields[0]);	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  line	
  =	
  sys.stdin.readline();	
  
	
  	
  	
  	
  except	
  "end	
  of	
  file":	
  
	
  	
  	
  	
  	
  	
  	
  	
  return	
  None	
  
	
  
Step	
  3a.	
  	
  Write	
  a	
  Reducer	
  (cont’d)	
  
#!/usr/bin/python	
  
	
  
import	
  sys;	
  
	
  
def	
  generateLongCountToken(id):	
  
	
  	
  	
  	
  return	
  "LongValueSum:"	
  +	
  id	
  +	
  "t"	
  +	
  "1"	
  
	
  
def	
  main(argv):	
  
	
  	
  	
  	
  line	
  =	
  sys.stdin.readline();	
  
	
  	
  	
  	
  try:	
  
	
  	
  	
  	
  	
  	
  	
  	
  while	
  line:	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  line	
  =	
  line[:-­‐1];	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  fields	
  =	
  line.split("t");	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  print	
  generateLongCountToken(fields[0]);	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  line	
  =	
  sys.stdin.readline();	
  
	
  	
  	
  	
  except	
  "end	
  of	
  file":	
  
	
  	
  	
  	
  	
  	
  	
  	
  return	
  None	
  
if	
  __name__	
  ==	
  "__main__":	
  
	
  	
  	
  	
  	
  main(sys.argv)	
  
	
  
Step	
  3b.	
  Upload	
  Reducer	
  to	
  S3	
  



myAggregatorForKeyCount.py	
  
	
  
•  This	
  is	
  a	
  standard	
  Reducer	
  and	
  part	
  of	
  a	
  
     standard	
  Hadoop	
  library	
  called	
  Aggregate	
  so	
  
     we	
  don’t	
  need	
  to	
  upload	
  it,	
  just	
  invoke	
  it.	
  
Hadoop	
  Library	
  Aggregate	
  
To	
  use	
  Aggregate,	
  simply	
  specify	
  "-­‐reducer	
  aggregate":	
  
	
  
$HADOOP_HOME/bin/hadoop	
  	
  jar	
  $HADOOP_HOME/
hadoop-­‐streaming.jar	
  	
  
	
  	
  	
  	
  -­‐input	
  myInputDirs	
  	
  
	
  	
  	
  	
  -­‐output	
  myOutputDir	
  	
  
	
  	
  	
  	
  -­‐mapper	
  myAggregatorForKeyCount.py	
  	
  
	
  	
  	
  	
  -­‐reducer	
  aggregate	
  	
  
	
  	
  	
  	
  -­‐file	
  myAggregatorForKeyCount.py	
  	
  
	
  	
  	
  	
  -­‐jobconf	
  mapred.reduce.tasks=12	
  
	
  
Step	
  4.	
  	
  Define	
  the	
  Job	
  Flow	
  
Step	
  4a.	
  Specify	
  Parameters	
  
Step	
  4b.	
  Configure	
  EC2	
  Parameters	
  




•  Default	
  parameters	
  work	
  for	
  this	
  example	
  
Step	
  4c.	
  Configure	
  Bootstrap	
  Ac+ons	
  




•  These	
  include	
  parameters	
  for	
  Hadoop,	
  etc.	
  
•  Here	
  are	
  the	
  choices:	
  
Step	
  4d.	
  Review	
  Configura+on	
  
Step	
  5.	
  	
  Launch	
  Job	
  Flow	
  &	
  Wait	
  




         …	
  and	
  wait	
  …	
  
Wait	
  for	
  Job	
  




•  This	
  job	
  took	
  3	
  minutes.	
  
Step	
  6.	
  The	
  Output	
  Data	
  is	
  in	
  S3	
  




•  The	
  output	
  is	
  in	
  files	
  labeled	
  part-­‐00000,	
  
   part-­‐00001,	
  etc.	
  
•  Recall	
  we	
  specified	
  the	
  bucket	
  plus	
  folders:	
  
tutorial.osdc.org/wordcount/output/2011-­‐06-­‐26	
  
Step	
  6.	
  Download	
  the	
  Data	
  From	
  S3	
  

•  You	
  can	
  leave	
  the	
  data	
  in	
  S3	
  and	
  work	
  with	
  it.	
  
•  You	
  can	
  download	
  it	
  with	
  command	
  line	
  
     tools:	
  
aws	
  get	
  tutorial.osdc.org/wordcount/output/
2011-­‐06-­‐26/part-­‐00000	
  part00000	
  
	
  
•  You	
  can	
  download	
  it	
  with	
  the	
  S3	
  AWS	
  
     Console.	
  
	
  
Step	
  7.	
  	
  Remove	
  Any	
  Unnecessary	
  Files	
  




•  You	
  will	
  be	
  charged	
  for	
  all	
  files	
  that	
  remain	
  in	
  
   S3,	
  so	
  remove	
  any	
  unnecessary	
  ones.	
  
Ques+ons?	
  

For	
  the	
  most	
  current	
  version	
  of	
  these	
  notes,	
  see	
  
                         rgrossman.com	
  

More Related Content

What's hot

Introduction to Machine Learning with Spark
Introduction to Machine Learning with SparkIntroduction to Machine Learning with Spark
Introduction to Machine Learning with Sparkdatamantra
 
Pattern: PMML for Cascading and Hadoop
Pattern: PMML for Cascading and HadoopPattern: PMML for Cascading and Hadoop
Pattern: PMML for Cascading and HadoopPaco Nathan
 
FlinkML: Large Scale Machine Learning with Apache Flink
FlinkML: Large Scale Machine Learning with Apache FlinkFlinkML: Large Scale Machine Learning with Apache Flink
FlinkML: Large Scale Machine Learning with Apache FlinkTheodoros Vasiloudis
 
Overview of Apache SystemML by Berthold Reinwald and Nakul Jindal
Overview of Apache SystemML by Berthold Reinwald and Nakul JindalOverview of Apache SystemML by Berthold Reinwald and Nakul Jindal
Overview of Apache SystemML by Berthold Reinwald and Nakul JindalArvind Surve
 
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarExploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarSpark Summit
 
Ml programming with python
Ml programming with pythonMl programming with python
Ml programming with pythonKumud Arora
 
MapReduce Design Patterns
MapReduce Design PatternsMapReduce Design Patterns
MapReduce Design PatternsDonald Miner
 
Large-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkLarge-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkDB Tsai
 
Parallel Linear Regression in Interative Reduce and YARN
Parallel Linear Regression in Interative Reduce and YARNParallel Linear Regression in Interative Reduce and YARN
Parallel Linear Regression in Interative Reduce and YARNDataWorks Summit
 
H2O World - Sparkling Water - Michal Malohlava
H2O World - Sparkling Water - Michal MalohlavaH2O World - Sparkling Water - Michal Malohlava
H2O World - Sparkling Water - Michal MalohlavaSri Ambati
 
Machine Learning with Apache Flink at Stockholm Machine Learning Group
Machine Learning with Apache Flink at Stockholm Machine Learning GroupMachine Learning with Apache Flink at Stockholm Machine Learning Group
Machine Learning with Apache Flink at Stockholm Machine Learning GroupTill Rohrmann
 
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CAApache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CARobert Metzger
 
Boulder/Denver BigData: Cluster Computing with Apache Mesos and Cascading
Boulder/Denver BigData: Cluster Computing with Apache Mesos and CascadingBoulder/Denver BigData: Cluster Computing with Apache Mesos and Cascading
Boulder/Denver BigData: Cluster Computing with Apache Mesos and CascadingPaco Nathan
 
Lazy Join Optimizations Without Upfront Statistics with Matteo Interlandi
Lazy Join Optimizations Without Upfront Statistics with Matteo InterlandiLazy Join Optimizations Without Upfront Statistics with Matteo Interlandi
Lazy Join Optimizations Without Upfront Statistics with Matteo InterlandiDatabricks
 
ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop
ACM Bay Area Data Mining Workshop: Pattern, PMML, HadoopACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop
ACM Bay Area Data Mining Workshop: Pattern, PMML, HadoopPaco Nathan
 
MLFlow 1.0 Meetup
MLFlow 1.0 Meetup MLFlow 1.0 Meetup
MLFlow 1.0 Meetup Databricks
 
The Mechanics of Testing Large Data Pipelines (QCon London 2016)
The Mechanics of Testing Large Data Pipelines (QCon London 2016)The Mechanics of Testing Large Data Pipelines (QCon London 2016)
The Mechanics of Testing Large Data Pipelines (QCon London 2016)Mathieu Bastian
 

What's hot (20)

Introduction to Machine Learning with Spark
Introduction to Machine Learning with SparkIntroduction to Machine Learning with Spark
Introduction to Machine Learning with Spark
 
Pattern: PMML for Cascading and Hadoop
Pattern: PMML for Cascading and HadoopPattern: PMML for Cascading and Hadoop
Pattern: PMML for Cascading and Hadoop
 
FlinkML: Large Scale Machine Learning with Apache Flink
FlinkML: Large Scale Machine Learning with Apache FlinkFlinkML: Large Scale Machine Learning with Apache Flink
FlinkML: Large Scale Machine Learning with Apache Flink
 
Overview of Apache SystemML by Berthold Reinwald and Nakul Jindal
Overview of Apache SystemML by Berthold Reinwald and Nakul JindalOverview of Apache SystemML by Berthold Reinwald and Nakul Jindal
Overview of Apache SystemML by Berthold Reinwald and Nakul Jindal
 
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarExploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
 
Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014
 
Ml programming with python
Ml programming with pythonMl programming with python
Ml programming with python
 
Machine Learning in R
Machine Learning in RMachine Learning in R
Machine Learning in R
 
MapReduce Design Patterns
MapReduce Design PatternsMapReduce Design Patterns
MapReduce Design Patterns
 
Large-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkLarge-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache Spark
 
Parallel Linear Regression in Interative Reduce and YARN
Parallel Linear Regression in Interative Reduce and YARNParallel Linear Regression in Interative Reduce and YARN
Parallel Linear Regression in Interative Reduce and YARN
 
H2O World - Sparkling Water - Michal Malohlava
H2O World - Sparkling Water - Michal MalohlavaH2O World - Sparkling Water - Michal Malohlava
H2O World - Sparkling Water - Michal Malohlava
 
Machine Learning with Apache Flink at Stockholm Machine Learning Group
Machine Learning with Apache Flink at Stockholm Machine Learning GroupMachine Learning with Apache Flink at Stockholm Machine Learning Group
Machine Learning with Apache Flink at Stockholm Machine Learning Group
 
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CAApache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
 
Boulder/Denver BigData: Cluster Computing with Apache Mesos and Cascading
Boulder/Denver BigData: Cluster Computing with Apache Mesos and CascadingBoulder/Denver BigData: Cluster Computing with Apache Mesos and Cascading
Boulder/Denver BigData: Cluster Computing with Apache Mesos and Cascading
 
Map/Reduce intro
Map/Reduce introMap/Reduce intro
Map/Reduce intro
 
Lazy Join Optimizations Without Upfront Statistics with Matteo Interlandi
Lazy Join Optimizations Without Upfront Statistics with Matteo InterlandiLazy Join Optimizations Without Upfront Statistics with Matteo Interlandi
Lazy Join Optimizations Without Upfront Statistics with Matteo Interlandi
 
ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop
ACM Bay Area Data Mining Workshop: Pattern, PMML, HadoopACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop
ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop
 
MLFlow 1.0 Meetup
MLFlow 1.0 Meetup MLFlow 1.0 Meetup
MLFlow 1.0 Meetup
 
The Mechanics of Testing Large Data Pipelines (QCon London 2016)
The Mechanics of Testing Large Data Pipelines (QCon London 2016)The Mechanics of Testing Large Data Pipelines (QCon London 2016)
The Mechanics of Testing Large Data Pipelines (QCon London 2016)
 

Viewers also liked

What Are Science Clouds?
What Are Science Clouds?What Are Science Clouds?
What Are Science Clouds?Robert Grossman
 
The Matsu Project - Open Source Software for Processing Satellite Imagery Data
The Matsu Project - Open Source Software for Processing Satellite Imagery DataThe Matsu Project - Open Source Software for Processing Satellite Imagery Data
The Matsu Project - Open Source Software for Processing Satellite Imagery DataRobert Grossman
 
Architectures for Data Commons (XLDB 15 Lightning Talk)
Architectures for Data Commons (XLDB 15 Lightning Talk)Architectures for Data Commons (XLDB 15 Lightning Talk)
Architectures for Data Commons (XLDB 15 Lightning Talk)Robert Grossman
 
Using the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science ResearchUsing the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science ResearchRobert Grossman
 
Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)Robert Grossman
 
Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)Robert Grossman
 
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)Robert Grossman
 
Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)
Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)
Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)Robert Grossman
 
The Open Science Data Cloud: Empowering the Long Tail of Science
The Open Science Data Cloud: Empowering the Long Tail of ScienceThe Open Science Data Cloud: Empowering the Long Tail of Science
The Open Science Data Cloud: Empowering the Long Tail of ScienceRobert Grossman
 
Keynote on 2015 Yale Day of Data
Keynote on 2015 Yale Day of Data Keynote on 2015 Yale Day of Data
Keynote on 2015 Yale Day of Data Robert Grossman
 
Managing Big Data (Chapter 2, SC 11 Tutorial)
Managing Big Data (Chapter 2, SC 11 Tutorial)Managing Big Data (Chapter 2, SC 11 Tutorial)
Managing Big Data (Chapter 2, SC 11 Tutorial)Robert Grossman
 
Clouds and Commons for the Data Intensive Science Community (June 8, 2015)
Clouds and Commons for the Data Intensive Science Community (June 8, 2015)Clouds and Commons for the Data Intensive Science Community (June 8, 2015)
Clouds and Commons for the Data Intensive Science Community (June 8, 2015)Robert Grossman
 
What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? Robert Grossman
 
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...Robert Grossman
 
Practical Methods for Identifying Anomalies That Matter in Large Datasets
Practical Methods for Identifying Anomalies That Matter in Large DatasetsPractical Methods for Identifying Anomalies That Matter in Large Datasets
Practical Methods for Identifying Anomalies That Matter in Large DatasetsRobert Grossman
 
Adversarial Analytics - 2013 Strata & Hadoop World Talk
Adversarial Analytics - 2013 Strata & Hadoop World TalkAdversarial Analytics - 2013 Strata & Hadoop World Talk
Adversarial Analytics - 2013 Strata & Hadoop World TalkRobert Grossman
 
AnalyticOps - Chicago PAW 2016
AnalyticOps - Chicago PAW 2016AnalyticOps - Chicago PAW 2016
AnalyticOps - Chicago PAW 2016Robert Grossman
 
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014Robert Grossman
 
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...Robert Grossman
 

Viewers also liked (19)

What Are Science Clouds?
What Are Science Clouds?What Are Science Clouds?
What Are Science Clouds?
 
The Matsu Project - Open Source Software for Processing Satellite Imagery Data
The Matsu Project - Open Source Software for Processing Satellite Imagery DataThe Matsu Project - Open Source Software for Processing Satellite Imagery Data
The Matsu Project - Open Source Software for Processing Satellite Imagery Data
 
Architectures for Data Commons (XLDB 15 Lightning Talk)
Architectures for Data Commons (XLDB 15 Lightning Talk)Architectures for Data Commons (XLDB 15 Lightning Talk)
Architectures for Data Commons (XLDB 15 Lightning Talk)
 
Using the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science ResearchUsing the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science Research
 
Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)
 
Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)
 
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
 
Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)
Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)
Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)
 
The Open Science Data Cloud: Empowering the Long Tail of Science
The Open Science Data Cloud: Empowering the Long Tail of ScienceThe Open Science Data Cloud: Empowering the Long Tail of Science
The Open Science Data Cloud: Empowering the Long Tail of Science
 
Keynote on 2015 Yale Day of Data
Keynote on 2015 Yale Day of Data Keynote on 2015 Yale Day of Data
Keynote on 2015 Yale Day of Data
 
Managing Big Data (Chapter 2, SC 11 Tutorial)
Managing Big Data (Chapter 2, SC 11 Tutorial)Managing Big Data (Chapter 2, SC 11 Tutorial)
Managing Big Data (Chapter 2, SC 11 Tutorial)
 
Clouds and Commons for the Data Intensive Science Community (June 8, 2015)
Clouds and Commons for the Data Intensive Science Community (June 8, 2015)Clouds and Commons for the Data Intensive Science Community (June 8, 2015)
Clouds and Commons for the Data Intensive Science Community (June 8, 2015)
 
What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care?
 
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
 
Practical Methods for Identifying Anomalies That Matter in Large Datasets
Practical Methods for Identifying Anomalies That Matter in Large DatasetsPractical Methods for Identifying Anomalies That Matter in Large Datasets
Practical Methods for Identifying Anomalies That Matter in Large Datasets
 
Adversarial Analytics - 2013 Strata & Hadoop World Talk
Adversarial Analytics - 2013 Strata & Hadoop World TalkAdversarial Analytics - 2013 Strata & Hadoop World Talk
Adversarial Analytics - 2013 Strata & Hadoop World Talk
 
AnalyticOps - Chicago PAW 2016
AnalyticOps - Chicago PAW 2016AnalyticOps - Chicago PAW 2016
AnalyticOps - Chicago PAW 2016
 
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
 
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
 

Similar to Amazon EMR Streaming Guide

Amazon EMR Masterclass
Amazon EMR MasterclassAmazon EMR Masterclass
Amazon EMR MasterclassIan Massingham
 
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & DataductAmazon Web Services
 
Yaetos_Meetup_SparkBCN_v1.pdf
Yaetos_Meetup_SparkBCN_v1.pdfYaetos_Meetup_SparkBCN_v1.pdf
Yaetos_Meetup_SparkBCN_v1.pdfprevota
 
Learn c++ Programming Language
Learn c++ Programming LanguageLearn c++ Programming Language
Learn c++ Programming LanguageSteve Johnson
 
Aws dc elastic-mapreduce
Aws dc elastic-mapreduceAws dc elastic-mapreduce
Aws dc elastic-mapreducebeaknit
 
Aws dc elastic-mapreduce
Aws dc elastic-mapreduceAws dc elastic-mapreduce
Aws dc elastic-mapreducebeaknit
 
Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013larsgeorge
 
r,rstats,r language,r packages
r,rstats,r language,r packagesr,rstats,r language,r packages
r,rstats,r language,r packagesAjay Ohri
 
Flink Forward SF 2017: Malo Deniélou - No shard left behind: Dynamic work re...
Flink Forward SF 2017: Malo Deniélou -  No shard left behind: Dynamic work re...Flink Forward SF 2017: Malo Deniélou -  No shard left behind: Dynamic work re...
Flink Forward SF 2017: Malo Deniélou - No shard left behind: Dynamic work re...Flink Forward
 
Lens: Data exploration with Dask and Jupyter widgets
Lens: Data exploration with Dask and Jupyter widgetsLens: Data exploration with Dask and Jupyter widgets
Lens: Data exploration with Dask and Jupyter widgetsVíctor Zabalza
 
AWS Hadoop and PIG and overview
AWS Hadoop and PIG and overviewAWS Hadoop and PIG and overview
AWS Hadoop and PIG and overviewDan Morrill
 
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Databricks
 
Cascading on starfish
Cascading on starfishCascading on starfish
Cascading on starfishFei Dong
 
Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache Beam
Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache BeamMalo Denielou - No shard left behind: Dynamic work rebalancing in Apache Beam
Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache BeamFlink Forward
 
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...DrPDShebaKeziaMalarc
 

Similar to Amazon EMR Streaming Guide (20)

Amazon EMR Masterclass
Amazon EMR MasterclassAmazon EMR Masterclass
Amazon EMR Masterclass
 
Amazon EMR Masterclass
Amazon EMR MasterclassAmazon EMR Masterclass
Amazon EMR Masterclass
 
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
 
Yaetos_Meetup_SparkBCN_v1.pdf
Yaetos_Meetup_SparkBCN_v1.pdfYaetos_Meetup_SparkBCN_v1.pdf
Yaetos_Meetup_SparkBCN_v1.pdf
 
Learn c++ Programming Language
Learn c++ Programming LanguageLearn c++ Programming Language
Learn c++ Programming Language
 
Aws dc elastic-mapreduce
Aws dc elastic-mapreduceAws dc elastic-mapreduce
Aws dc elastic-mapreduce
 
Aws dc elastic-mapreduce
Aws dc elastic-mapreduceAws dc elastic-mapreduce
Aws dc elastic-mapreduce
 
Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013
 
r,rstats,r language,r packages
r,rstats,r language,r packagesr,rstats,r language,r packages
r,rstats,r language,r packages
 
Flink Forward SF 2017: Malo Deniélou - No shard left behind: Dynamic work re...
Flink Forward SF 2017: Malo Deniélou -  No shard left behind: Dynamic work re...Flink Forward SF 2017: Malo Deniélou -  No shard left behind: Dynamic work re...
Flink Forward SF 2017: Malo Deniélou - No shard left behind: Dynamic work re...
 
Lens: Data exploration with Dask and Jupyter widgets
Lens: Data exploration with Dask and Jupyter widgetsLens: Data exploration with Dask and Jupyter widgets
Lens: Data exploration with Dask and Jupyter widgets
 
AWS Hadoop and PIG and overview
AWS Hadoop and PIG and overviewAWS Hadoop and PIG and overview
AWS Hadoop and PIG and overview
 
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
 
Cascading on starfish
Cascading on starfishCascading on starfish
Cascading on starfish
 
Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache Beam
Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache BeamMalo Denielou - No shard left behind: Dynamic work rebalancing in Apache Beam
Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache Beam
 
Data Science
Data ScienceData Science
Data Science
 
L4 functions
L4 functionsL4 functions
L4 functions
 
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
 
pig.ppt
pig.pptpig.ppt
pig.ppt
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 

More from Robert Grossman

Some Frameworks for Improving Analytic Operations at Your Company
Some Frameworks for Improving Analytic Operations at Your CompanySome Frameworks for Improving Analytic Operations at Your Company
Some Frameworks for Improving Analytic Operations at Your CompanyRobert Grossman
 
Some Proposed Principles for Interoperating Cloud Based Data Platforms
Some Proposed Principles for Interoperating Cloud Based Data PlatformsSome Proposed Principles for Interoperating Cloud Based Data Platforms
Some Proposed Principles for Interoperating Cloud Based Data PlatformsRobert Grossman
 
A Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataA Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataRobert Grossman
 
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed DeployedCrossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed DeployedRobert Grossman
 
A Data Biosphere for Biomedical Research
A Data Biosphere for Biomedical ResearchA Data Biosphere for Biomedical Research
A Data Biosphere for Biomedical ResearchRobert Grossman
 
What is Data Commons and How Can Your Organization Build One?
What is Data Commons and How Can Your Organization Build One?What is Data Commons and How Can Your Organization Build One?
What is Data Commons and How Can Your Organization Build One?Robert Grossman
 
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...Robert Grossman
 
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...Robert Grossman
 
Open Science Data Cloud (IEEE Cloud 2011)
Open Science Data Cloud (IEEE Cloud 2011)Open Science Data Cloud (IEEE Cloud 2011)
Open Science Data Cloud (IEEE Cloud 2011)Robert Grossman
 
Open Science Data Cloud - CCA 11
Open Science Data Cloud - CCA 11Open Science Data Cloud - CCA 11
Open Science Data Cloud - CCA 11Robert Grossman
 
Bionimbus - Northwestern CGI Workshop 4-21-2011
Bionimbus - Northwestern CGI Workshop 4-21-2011Bionimbus - Northwestern CGI Workshop 4-21-2011
Bionimbus - Northwestern CGI Workshop 4-21-2011Robert Grossman
 

More from Robert Grossman (11)

Some Frameworks for Improving Analytic Operations at Your Company
Some Frameworks for Improving Analytic Operations at Your CompanySome Frameworks for Improving Analytic Operations at Your Company
Some Frameworks for Improving Analytic Operations at Your Company
 
Some Proposed Principles for Interoperating Cloud Based Data Platforms
Some Proposed Principles for Interoperating Cloud Based Data PlatformsSome Proposed Principles for Interoperating Cloud Based Data Platforms
Some Proposed Principles for Interoperating Cloud Based Data Platforms
 
A Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataA Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate Data
 
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed DeployedCrossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
 
A Data Biosphere for Biomedical Research
A Data Biosphere for Biomedical ResearchA Data Biosphere for Biomedical Research
A Data Biosphere for Biomedical Research
 
What is Data Commons and How Can Your Organization Build One?
What is Data Commons and How Can Your Organization Build One?What is Data Commons and How Can Your Organization Build One?
What is Data Commons and How Can Your Organization Build One?
 
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
 
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
 
Open Science Data Cloud (IEEE Cloud 2011)
Open Science Data Cloud (IEEE Cloud 2011)Open Science Data Cloud (IEEE Cloud 2011)
Open Science Data Cloud (IEEE Cloud 2011)
 
Open Science Data Cloud - CCA 11
Open Science Data Cloud - CCA 11Open Science Data Cloud - CCA 11
Open Science Data Cloud - CCA 11
 
Bionimbus - Northwestern CGI Workshop 4-21-2011
Bionimbus - Northwestern CGI Workshop 4-21-2011Bionimbus - Northwestern CGI Workshop 4-21-2011
Bionimbus - Northwestern CGI Workshop 4-21-2011
 

Recently uploaded

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 

Recently uploaded (20)

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 

Amazon EMR Streaming Guide

  • 1. An  Introduc+on  to     Data  Intensive  Compu+ng     Appendix  A:  Amazon’s  Elas+c  MapReduce   Robert  Grossman   University  of  Chicago   Open  Data  Group     Collin  BenneF   Open  Data  Group     November  14,  2011   1  
  • 2. Sec+on  A1   Hadoop  Streaming   See     hFp://hadoop.apache.org/common/docs/ r0.15.2/streaming.html  
  • 3. Basic  Idea   •  With  Hadoop  streams  you  can  run  any   program  as  the  Mapper  and  the  Reducer.   •  For  example,  you  can  run  Python  and  Perl   code.   •  You  can  also  run  standard  Unix  u+li+es.   •  With  streams,  Mappers  and  Reducers  use   standard  input  and  standard  output.  
  • 4. Mappers  for  Streams   •  As  the  mapper  task  runs,  it  converts  its  inputs   into  lines  and  feed  the  lines  to  the  stdin  of  the   process.   •  The  mapper  collects  the  line  oriented  outputs   from  the  stdout  of  the  process  and  converts  each   line  into  a  key/value  pair,  which  is  collected  as   the  output  of  the  mapper.     •  By  default,  the  prefix  of  a  line  up  to  the  first  tab   character  is  the  key  and  the  the  rest  of  the  line   (excluding  the  tab)  is  the  value.   •  This  default  can  be  changed.  
  • 5. Reducers  for  Streams   •  As  the  reducer  task  runs,  it  converts  its  input   key/values  pairs  into  lines  and  feeds  the  lines   to  the  stdin  of  the  process.     •  The  reducer  collects  the  line  oriented  outputs   from  the  stdout  of  the  process,  converts  each   line  into  a  key/value  pair,  which  is  collected  as   the  output  of  the  reducer.   •  By  default,  the  prefix  of  a  line  up  to  the  first   tab  character  is  the  key  and  the  the  rest  of  the   line  (excluding  the  tab  character)  is  the  value.    
  • 6. Example   $HADOOP_HOME/bin/hadoop    jar   $HADOOP_HOME/hadoop-­‐streaming.jar            -­‐input  myInputDirs            -­‐output  myOutputDir            -­‐mapper  /bin/cat            -­‐reducer  /bin/wc     •  Here  the  Unix  u+li+es  cat  and  wc  are  the   Mapper  and  Reducer.  
  • 7. Sec+on  A2   S3  Buckets  
  • 8. S3  Buckets   •  S3  bucket  names  must  be  unique  across  AWS   •  A  good  prac+ce  is  to  use  a  paFern  like      tutorial.osdc.org/dataset1.txt   for  a  domain  you  own.   •  The  file  is  then  referenced  as    tutorial.osdc.org.s3.  amazonaws.com/ dataset1.txt   •  If  you  own  osdc.org  you  can  create  a  DNS   CNAME  entry  to  access  the  file  as   tutorial.osdc.org/dataset1.txt  
  • 9. S3  Security   •  AWS  access  key  (user  name)   •  This  func+on  is  your  S3  username  .  It  is  an   alphanumeric  text  string  that  uniquely   iden+fies  users.     •  AWS  Secret  key  (func+ons  as  password)  
  • 11. Access  Keys   User  Name   Password  
  • 12. Sec+on  A3   Using  AWS  Elas+c  MapReduce  
  • 13. Overview   1.  Upload  input  data  to  S3   2.  Create  job  flow  by  defining  Map  and  Reduce   3.  Download  output  data  from  S3  
  • 14. Create  New  Elas+c  MR  Job  Flow  
  • 15. Custom  Jobs   •  Amazon  Elas+c  MR  Custom  jobs  can  be   wriFen  as  a:   – Custom  Jar  File   – Streaming  File   – Pig  Program   – Hive  Program    
  • 16. Step  1.  Load  Your  Data  Into  an     S3  Bucket   •  Amazon’s  Elas+c  MapReduce  reads  data  from   S3  and  write  data  to  S3  
  • 17. Step  1a.  Create  &  Name  the  S3  Bucket    
  • 18. Step  1b.  Upload  Data  Into  the  S3  Bucket   •  This  can  be  done  from  the  AWS  Console.   •  This  can  also  be  done  using  command  line   tools.  
  • 19. Step  2a.    Write  a  Mapper   #!/usr/bin/python                    import  sys        import  re        def  main(argv):            line  =  sys.stdin.readline()            paFern  =  re.compile("[a-­‐zA-­‐Z][a-­‐zA-­‐Z0-­‐9]*")            try:                while  line:                    for  word  in    paFern.findall(line):                        print    "LongValueSum:"  +  word.lower()  +  "t"  +  "1"                    line  =    sys.stdin.readline()            except  "end  of  file":                return  None        if  __name__  ==  "__main__":            main(sys.argv)    
  • 20. Step  2b.    Upload  the  Mapper  to  S3   •  This  Mapper  is  already  in  S3  in  this  loca+on:              s3://elas+cmapreduce/samples/wordcount/ wordSpliFer.py     So  we  don’t  need  to  upload.    
  • 21. Step  3a.    Write  a  Reducer   def  main(argv):          line  =  sys.stdin.readline();          try:                  while  line:                          line  =  line[:-­‐1];                          fields  =  line.split("t");                          print  generateLongCountToken(fields[0]);                          line  =  sys.stdin.readline();          except  "end  of  file":                  return  None    
  • 22. Step  3a.    Write  a  Reducer  (cont’d)   #!/usr/bin/python     import  sys;     def  generateLongCountToken(id):          return  "LongValueSum:"  +  id  +  "t"  +  "1"     def  main(argv):          line  =  sys.stdin.readline();          try:                  while  line:                          line  =  line[:-­‐1];                          fields  =  line.split("t");                          print  generateLongCountToken(fields[0]);                          line  =  sys.stdin.readline();          except  "end  of  file":                  return  None   if  __name__  ==  "__main__":            main(sys.argv)    
  • 23. Step  3b.  Upload  Reducer  to  S3   myAggregatorForKeyCount.py     •  This  is  a  standard  Reducer  and  part  of  a   standard  Hadoop  library  called  Aggregate  so   we  don’t  need  to  upload  it,  just  invoke  it.  
  • 24. Hadoop  Library  Aggregate   To  use  Aggregate,  simply  specify  "-­‐reducer  aggregate":     $HADOOP_HOME/bin/hadoop    jar  $HADOOP_HOME/ hadoop-­‐streaming.jar            -­‐input  myInputDirs            -­‐output  myOutputDir            -­‐mapper  myAggregatorForKeyCount.py            -­‐reducer  aggregate            -­‐file  myAggregatorForKeyCount.py            -­‐jobconf  mapred.reduce.tasks=12    
  • 25. Step  4.    Define  the  Job  Flow  
  • 26. Step  4a.  Specify  Parameters  
  • 27. Step  4b.  Configure  EC2  Parameters   •  Default  parameters  work  for  this  example  
  • 28. Step  4c.  Configure  Bootstrap  Ac+ons   •  These  include  parameters  for  Hadoop,  etc.   •  Here  are  the  choices:  
  • 29. Step  4d.  Review  Configura+on  
  • 30. Step  5.    Launch  Job  Flow  &  Wait   …  and  wait  …  
  • 31. Wait  for  Job   •  This  job  took  3  minutes.  
  • 32. Step  6.  The  Output  Data  is  in  S3   •  The  output  is  in  files  labeled  part-­‐00000,   part-­‐00001,  etc.   •  Recall  we  specified  the  bucket  plus  folders:   tutorial.osdc.org/wordcount/output/2011-­‐06-­‐26  
  • 33. Step  6.  Download  the  Data  From  S3   •  You  can  leave  the  data  in  S3  and  work  with  it.   •  You  can  download  it  with  command  line   tools:   aws  get  tutorial.osdc.org/wordcount/output/ 2011-­‐06-­‐26/part-­‐00000  part00000     •  You  can  download  it  with  the  S3  AWS   Console.    
  • 34. Step  7.    Remove  Any  Unnecessary  Files   •  You  will  be  charged  for  all  files  that  remain  in   S3,  so  remove  any  unnecessary  ones.  
  • 35. Ques+ons?   For  the  most  current  version  of  these  notes,  see   rgrossman.com