Intro To Cascading

1. Introduction to Cascading

2. What is Cascading?

3. What is Cascading? • abstraction over MapReduce

4. What is Cascading? • abstraction over MapReduce • API for data-processing workﬂows

5. Why use Cascading?

6. Why use Cascading? • MapReduce can be:

7. Why use Cascading? • MapReduce can be: • the wrong level of granularity

8. Why use Cascading? • MapReduce can be: • the wrong level of granularity • cumbersome to chain together

9. Why use Cascading?

10. Why use Cascading? • Cascading helps create

11. Why use Cascading? • Cascading helps create • higher-level data processing abstractions

12. Why use Cascading? • Cascading helps create • higher-level data processing abstractions • sophisticated data pipelines

13. Why use Cascading? • Cascading helps create • higher-level data processing abstractions • sophisticated data pipelines • reusable components

14. Credits • Cascading written by Chris Wensel • based on his users guide http://bit.ly/cascading

15. package cascadingtutorial.wordcount; /** * Wordcount example in Cascading */ public class Main { public static void main( String[] args ) { String inputPath = args[0]; String outputPath = args[1]; Scheme inputScheme = new TextLine(new Fields("offset", "line")); Scheme outputScheme = new TextLine(); Tap sourceTap = inputPath.matches( "^[^:]+://.*") ? new Hfs(inputScheme, inputPath) : new Lfs(inputScheme, inputPath); Tap sinkTap = outputPath.matches("^[^:]+://.*") ? new Hfs(outputScheme, outputPath) : new Lfs(outputScheme, outputPath); Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "s+"), new Fields("word")); wcPipe = new GroupBy(wcPipe, new Fields("word")); wcPipe = new Every(wcPipe, new Count(), new Fields("count", "word")); Properties properties = new Properties(); FlowConnector.setApplicationJarClass(properties, Main.class); Flow parsedLogFlow = new FlowConnector(properties) .connect(sourceTap, sinkTap, wcPipe); parsedLogFlow.start(); parsedLogFlow.complete(); } }

16. Data Model Pipes Filters

17. Data Model ﬁle

23. Data Model file file file

24. Data Model file file file Filters

25. Tuple

26. Pipe Assembly

27. Pipe Assembly Tuple Stream

28. Taps: sources & sinks

29. Taps: sources & sinks Taps

32. Taps: sources & sinks Taps Source

33. Taps: sources & sinks Taps Source Sink

35. Pipe + Taps = Flow Flow

36. Flow Flow Flow Flow Flow

37. n Flows Flow Flow Flow Flow Flow

38. n Flows = Cascade Flow Flow Flow Flow Flow

39. n Flows = Cascade Flow Flow Flow Flow Flow complex

40. Example

43. import cascading.tap.Tap; import cascading.tuple.Fields; import java.util.Properties; /** * Wordcount example in Cascading */ public class Main { public static void main( String[] args ) { Scheme inputScheme = new TextLine(new Fields("offset", "line")); Scheme outputScheme = new TextLine(); String inputPath = args[0]; String outputPath = args[1]; Tap sourceTap = inputPath.matches( "^[^:]+://.*") ? new Hfs(inputScheme, inputPath) : new Lfs(inputScheme, inputPath); Tap sinkTap = outputPath.matches("^[^:]+://.*") ? new Hfs(outputScheme, outputPath) : new Lfs(outputScheme, outputPath); Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "s+"), new Fields("word"));

45. import cascading.tap.Tap; import cascading.tuple.Fields; import java.util.Properties; /** * Wordcount example in Cascading */ public class Main { public static void main( String[] args ) { Scheme inputScheme = new TextLine(new Fields("offset", "line")); Scheme outputScheme = new TextLine(); String inputPath = args[0]; String outputPath = args[1]; Tap sourceTap = inputPath.matches( "^[^:]+://.*") ? new Hfs(inputScheme, inputPath) : new Lfs(inputScheme, inputPath); TextLine() Tap sinkTap = outputPath.matches("^[^:]+://.*") ? new Hfs(outputScheme, outputPath) : new Lfs(outputScheme, outputPath); Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "s+"), new Fields("word"));

46. import cascading.tap.Tap; import cascading.tuple.Fields; import java.util.Properties; /** * Wordcount example in Cascading */ public class Main { public static void main( String[] args ) { Scheme inputScheme = new TextLine(new Fields("offset", "line")); Scheme outputScheme = new TextLine(); String inputPath = args[0]; String outputPath = args[1]; Tap sourceTap = inputPath.matches( "^[^:]+://.*") ? new Hfs(inputScheme, inputPath) : new Lfs(inputScheme, inputPath); TextLine() Tap sinkTap SequenceFile() = outputPath.matches("^[^:]+://.*") ? new Hfs(outputScheme, outputPath) : new Lfs(outputScheme, outputPath); Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "s+"), new Fields("word"));

48. /** * Wordcount example in Cascading */ public class Main { public static void main( String[] args ) { Scheme inputScheme = new TextLine(new Fields("offset", "line")); Scheme outputScheme = new TextLine(); String inputPath = args[0]; String outputPath = args[1]; Tap sourceTap = inputPath.matches( "^[^:]+://.*") ? new Hfs(inputScheme, inputPath) : new Lfs(inputScheme, inputPath); Tap sinkTap = outputPath.matches("^[^:]+://.*") ? new Hfs(outputScheme, outputPath) : new Lfs(outputScheme, outputPath); Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "s+"), new Fields("word")); wcPipe = new GroupBy(wcPipe, new Fields("word")); wcPipe = new Every(wcPipe, new Count(), new Fields("count", "word"));

49. /** * Wordcount example in Cascading */ public class Main { hdfs://master0:54310/user/nmurray/data.txt public static void main( String[] args ) { Scheme inputScheme = new TextLine(new Fields("offset", "line")); Scheme outputScheme = new TextLine(); String inputPath = args[0]; String outputPath = args[1]; Tap sourceTap = inputPath.matches( "^[^:]+://.*") ? new Hfs(inputScheme, inputPath) : new Lfs(inputScheme, inputPath); Tap sinkTap = outputPath.matches("^[^:]+://.*") ? new Hfs(outputScheme, outputPath) : new Lfs(outputScheme, outputPath); Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "s+"), new Fields("word")); wcPipe = new GroupBy(wcPipe, new Fields("word")); wcPipe = new Every(wcPipe, new Count(), new Fields("count", "word"));

50. /** * Wordcount example in Cascading */ public class Main { hdfs://master0:54310/user/nmurray/data.txt new Hfs() public static void main( String[] args ) { Scheme inputScheme = new TextLine(new Fields("offset", "line")); Scheme outputScheme = new TextLine(); String inputPath = args[0]; String outputPath = args[1]; Tap sourceTap = inputPath.matches( "^[^:]+://.*") ? new Hfs(inputScheme, inputPath) : new Lfs(inputScheme, inputPath); Tap sinkTap = outputPath.matches("^[^:]+://.*") ? new Hfs(outputScheme, outputPath) : new Lfs(outputScheme, outputPath); Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "s+"), new Fields("word")); wcPipe = new GroupBy(wcPipe, new Fields("word")); wcPipe = new Every(wcPipe, new Count(), new Fields("count", "word"));

51. /** * Wordcount example in Cascading */ public class Main { hdfs://master0:54310/user/nmurray/data.txt new Hfs() data/sources/obama-inaugural-address.txt public static void main( String[] args ) { Scheme inputScheme = new TextLine(new Fields("offset", "line")); Scheme outputScheme = new TextLine(); String inputPath = args[0]; String outputPath = args[1]; Tap sourceTap = inputPath.matches( "^[^:]+://.*") ? new Hfs(inputScheme, inputPath) : new Lfs(inputScheme, inputPath); Tap sinkTap = outputPath.matches("^[^:]+://.*") ? new Hfs(outputScheme, outputPath) : new Lfs(outputScheme, outputPath); Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "s+"), new Fields("word")); wcPipe = new GroupBy(wcPipe, new Fields("word")); wcPipe = new Every(wcPipe, new Count(), new Fields("count", "word"));

52. /** * Wordcount example in Cascading */ public class Main { hdfs://master0:54310/user/nmurray/data.txt new Hfs() data/sources/obama-inaugural-address.txt public static void main( String[] args ) new Lfs() { Scheme inputScheme = new TextLine(new Fields("offset", "line")); Scheme outputScheme = new TextLine(); String inputPath = args[0]; String outputPath = args[1]; Tap sourceTap = inputPath.matches( "^[^:]+://.*") ? new Hfs(inputScheme, inputPath) : new Lfs(inputScheme, inputPath); Tap sinkTap = outputPath.matches("^[^:]+://.*") ? new Hfs(outputScheme, outputPath) : new Lfs(outputScheme, outputPath); Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "s+"), new Fields("word")); wcPipe = new GroupBy(wcPipe, new Fields("word")); wcPipe = new Every(wcPipe, new Count(), new Fields("count", "word"));

53. /** * Wordcount example in Cascading */ public class Main { hdfs://master0:54310/user/nmurray/data.txt new Hfs() data/sources/obama-inaugural-address.txt public static void main( String[] args ) new Lfs() { Scheme inputScheme = new TextLine(new Fields("offset", "line")); Scheme outputScheme = new TextLine(); String inputPath = args[0]; String outputPath = args[1]; Tap sourceTap = inputPath.matches( "^[^:]+://.*") ? new Hfs(inputScheme, inputPath) : new Lfs(inputScheme, inputPath); Tap sinkTap = outputPath.matches("^[^:]+://.*") ? new Hfs(outputScheme, outputPath) : new Lfs(outputScheme, outputPath); Pipe wcPipe = new Each("wordcount", new Fields("line"), S3fs() new RegexSplitGenerator(new Fields("word"), "s+"), new Fields("word")); wcPipe = new GroupBy(wcPipe, new Fields("word")); wcPipe = new Every(wcPipe, new Count(), new Fields("count", "word"));

54. /** * Wordcount example in Cascading */ public class Main { hdfs://master0:54310/user/nmurray/data.txt new Hfs() data/sources/obama-inaugural-address.txt public static void main( String[] args ) new Lfs() { Scheme inputScheme = new TextLine(new Fields("offset", "line")); Scheme outputScheme = new TextLine(); String inputPath = args[0]; String outputPath = args[1]; Tap sourceTap = inputPath.matches( "^[^:]+://.*") ? new Hfs(inputScheme, inputPath) : new Lfs(inputScheme, inputPath); Tap sinkTap = outputPath.matches("^[^:]+://.*") ? new Hfs(outputScheme, outputPath) : new Lfs(outputScheme, outputPath); Pipe wcPipe = new Each("wordcount", new Fields("line"), S3fs() new RegexSplitGenerator(new Fields("word"), "s+"), new Fields("word")); GlobHfs() new Fields("count", "word")); wcPipe = new GroupBy(wcPipe, new Fields("word")); wcPipe = new Every(wcPipe, new Count(),

55. /** * Wordcount example in Cascading */ public class Main { public static void main( String[] args ) { Scheme inputScheme = new TextLine(new Fields("offset", "line")); Scheme outputScheme = new TextLine(); String inputPath = args[0]; String outputPath = args[1]; Tap sourceTap = inputPath.matches( "^[^:]+://.*") ? new Hfs(inputScheme, inputPath) : new Lfs(inputScheme, inputPath); Tap sinkTap = outputPath.matches("^[^:]+://.*") ? new Hfs(outputScheme, outputPath) : new Lfs(outputScheme, outputPath); Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "s+"), new Fields("word")); wcPipe = new GroupBy(wcPipe, new Fields("word")); wcPipe = new Every(wcPipe, new Count(), new Fields("count", "word"));

56. Scheme outputScheme = new TextLine(); String inputPath = args[0]; String outputPath = args[1]; Tap sourceTap = inputPath.matches( "^[^:]+://.*") ? new Hfs(inputScheme, inputPath) : new Lfs(inputScheme, inputPath); Tap sinkTap = outputPath.matches("^[^:]+://.*") ? new Hfs(outputScheme, outputPath) : new Lfs(outputScheme, outputPath); Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "s+"), new Fields("word")); wcPipe = new GroupBy(wcPipe, new Fields("word")); wcPipe = new Every(wcPipe, new Count(), new Fields("count", "word")); Properties properties = new Properties(); FlowConnector.setApplicationJarClass(properties, Main.class); Flow parsedLogFlow = new FlowConnector(properties) .connect(sourceTap, sinkTap, wcPipe); parsedLogFlow.start(); parsedLogFlow.complete(); } }

57. Scheme outputScheme = new TextLine(); String inputPath = args[0]; String outputPath = args[1]; Pipe Assembly Tap sourceTap = inputPath.matches( "^[^:]+://.*") ? new Hfs(inputScheme, inputPath) : new Lfs(inputScheme, inputPath); Tap sinkTap = outputPath.matches("^[^:]+://.*") ? new Hfs(outputScheme, outputPath) : new Lfs(outputScheme, outputPath); Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "s+"), new Fields("word")); wcPipe = new GroupBy(wcPipe, new Fields("word")); wcPipe = new Every(wcPipe, new Count(), new Fields("count", "word")); Properties properties = new Properties(); FlowConnector.setApplicationJarClass(properties, Main.class); Flow parsedLogFlow = new FlowConnector(properties) .connect(sourceTap, sinkTap, wcPipe); parsedLogFlow.start(); parsedLogFlow.complete(); } }

58. Pipe Assemblies

59. Pipe Assemblies

60. Pipe Assemblies • Deﬁne work against a Tuple Stream

61. Pipe Assemblies • Deﬁne work against a Tuple Stream • May have multiple sources and sinks

62. Pipe Assemblies • Deﬁne work against a Tuple Stream • May have multiple sources and sinks • Splits

63. Pipe Assemblies • Deﬁne work against a Tuple Stream • May have multiple sources and sinks • Splits • Merges

64. Pipe Assemblies • Deﬁne work against a Tuple Stream • May have multiple sources and sinks • Splits • Merges • Joins

65. Pipe Assemblies

66. Pipe Assemblies • Pipe

67. Pipe Assemblies • Pipe • Each • GroupBy • CoGroup • Every • SubAssembly

69. Pipe Assemblies • Pipe Applies a • Each Function or Filter • GroupBy Operation to each • CoGroup Tuple • Every • SubAssembly

72. Pipe Assemblies • Pipe • Each Group • GroupBy (& merge) • CoGroup • Every • SubAssembly

73. Pipe Assemblies • Pipe • Each • GroupBy Joins • CoGroup (inner, outer, left, right) • Every • SubAssembly

75. Pipe Assemblies • Pipe • Each • GroupBy • CoGroup • Every Applies an Aggregator (count, sum) to every group of Tuples. • SubAssembly

76. Pipe Assemblies • Pipe • Each • GroupBy • CoGroup • Every • SubAssembly Nesting

77. Each vs. Every

78. Each vs. Every • Each() is for individual Tuples

79. Each vs. Every • Each() is for individual Tuples • Every() is for groups of Tuples

80. Each new Each(previousPipe, argumentSelector, operation, outputSelector)

81. Each new Each(previousPipe, argumentSelector, operation, outputSelector) Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "s+"), new Fields("word"));

88. Each new Each(previousPipe, argumentSelector, operation, outputSelector) Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "s+"), new Fields("word")); Fields()

89. Each offset line 0 the lazy brown fox Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "s+"), new Fields("word"));

90. Each offset line 0 the lazy brown fox Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "s+"), new Fields("word"));

91. Each X offset 0 line the lazy brown fox Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "s+"), new Fields("word"));

92. Each X offset 0 line the lazy brown fox Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "s+"), new Fields("word")); word word word word the lazy brown fox

93. Operations: Functions

94. Operations: Functions • Insert()

95. Operations: Functions • Insert() • RegexParser() / RegexGenerator() / RegexReplace()

96. Operations: Functions • Insert() • RegexParser() / RegexGenerator() / RegexReplace() • DateParser() / DateFormatter()

97. Operations: Functions • Insert() • RegexParser() / RegexGenerator() / RegexReplace() • DateParser() / DateFormatter() • XPathGenerator()

98. Operations: Functions • Insert() • RegexParser() / RegexGenerator() / RegexReplace() • DateParser() / DateFormatter() • XPathGenerator() • Identity()

99. Operations: Functions • Insert() • RegexParser() / RegexGenerator() / RegexReplace() • DateParser() / DateFormatter() • XPathGenerator() • Identity() • etc...

100. Operations: Filters

101. Operations: Filters • RegexFilter()

102. Operations: Filters • RegexFilter() • FilterNull()

103. Operations: Filters • RegexFilter() • FilterNull() • And() / Or() / Not() / Xor()

104. Operations: Filters • RegexFilter() • FilterNull() • And() / Or() / Not() / Xor() • ExpressionFilter()

105. Operations: Filters • RegexFilter() • FilterNull() • And() / Or() / Not() / Xor() • ExpressionFilter() • etc...

107. Each Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "s+"), new Fields("word"));

108. String inputPath = args[0]; String outputPath = args[1]; Tap sourceTap = inputPath.matches( "^[^:]+://.*") ? new Hfs(inputScheme, inputPath) : new Lfs(inputScheme, inputPath); Tap sinkTap = outputPath.matches("^[^:]+://.*") ? new Hfs(outputScheme, outputPath) : new Lfs(outputScheme, outputPath); Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "s+"), new Fields("word")); wcPipe = new GroupBy(wcPipe, new Fields("word")); wcPipe = new Every(wcPipe, new Count(), new Fields("count", "word")); Properties properties = new Properties(); FlowConnector.setApplicationJarClass(properties, Main.class); Flow parsedLogFlow = new FlowConnector(properties) .connect(sourceTap, sinkTap, wcPipe); parsedLogFlow.start(); parsedLogFlow.complete(); } }

109. new Lfs(inputScheme, inputPath); Tap sinkTap = outputPath.matches("^[^:]+://.*") ? new Hfs(outputScheme, outputPath) : new Lfs(outputScheme, outputPath); Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "s+"), new Fields("word")); wcPipe = new GroupBy(wcPipe, new Fields("word")); wcPipe = new Every(wcPipe, new Count(), new Fields("count", "word")); Properties properties = new Properties(); FlowConnector.setApplicationJarClass(properties, Main.class); Flow parsedLogFlow = new FlowConnector(properties) .connect(sourceTap, sinkTap, wcPipe); parsedLogFlow.start(); parsedLogFlow.complete(); } }

112. Every new Every(previousPipe, argumentSelector, operation, outputSelector)

113. Every new Every(previousPipe, argumentSelector, operation, outputSelector) new Every(wcPipe, new Count(), new Fields("count", "word"));

114. new Every(wcPipe, new Count(), new Fields("count", "word"));

115. Operations: Aggregator

116. Operations: Aggregator • Average()

117. Operations: Aggregator • Average() • Count()

118. Operations: Aggregator • Average() • Count() • First() / Last()

119. Operations: Aggregator • Average() • Count() • First() / Last() • Min() / Max()

120. Operations: Aggregator • Average() • Count() • First() / Last() • Min() / Max() • Sum()

121. Every new Every(previousPipe, argumentSelector, operation, outputSelector)

122. Every new Every(previousPipe, argumentSelector, operation, outputSelector) new Every(wcPipe, new Count(), new Fields("count", "word"));

123. Every new Every(previousPipe, argumentSelector, operation, outputSelector) new Every(wcPipe, new Count(), new Fields("count", "word")); new Every(wcPipe, Fields.ALL, new Count(), new Fields("count", "word"));

124. Field Selection

125. Predefined Field Sets

126. Predefined Field Sets Fields.ALL all available fields Fields.GROUP fields used for last grouping Fields.VALUES fields not used for last grouping fields of argument Tuple Fields.ARGS (for Operations) replaces input with Operation result Fields.RESULTS (for Pipes)

131. Field Selection new Each(previousPipe, argumentSelector, operation, outputSelector) pipe = new Each(pipe, new Fields("timestamp"), new DateParser("yyyy-MM-dd HH:mm:ss"), new Fields("timestamp"));

132. Field Selection new Each(previousPipe, argumentSelector, operation, outputSelector) Operation Input Tuple = Original Tuple Argument Selector

133. Field Selection pipe = new Each(pipe, new Fields("timestamp"), new DateParser("yyyy-MM-dd HH:mm:ss"), new Fields("ts", "search_term", "visitor_id"));

134. Field Selection Original Tuple visitor_ search_ time page_ id id term stamp number pipe = new Each(pipe, new Fields("timestamp"), new DateParser("yyyy-MM-dd HH:mm:ss"), new Fields("ts", "search_term", "visitor_id"));

135. Field Selection Original Tuple Argument Selector visitor_ search_ time page_ id timestamp id term stamp number pipe = new Each(pipe, new Fields("timestamp"), new DateParser("yyyy-MM-dd HH:mm:ss"), new Fields("ts", "search_term", "visitor_id"));

136. Field Selection Original Tuple Argument Selector visitor_ search_ time page_ id timestamp id term stamp number pipe = new Each(pipe, new Fields("timestamp"), new DateParser("yyyy-MM-dd HH:mm:ss"), new Fields("ts", "search_term", "visitor_id"));

137. Field Selection Original Tuple Argument Selector visitor_ search_ time page_ id timestamp id term stamp number pipe = new Each(pipe, new Fields("timestamp"), new DateParser("yyyy-MM-dd HH:mm:ss"), new Fields("ts", "search_term", "visitor_id")); Input to DateParser will be: timestamp

138. Field Selection new Each(previousPipe, argumentSelector, operation, outputSelector) Output Tuple = Original Tuple ⊕ Operation Tuple Output Selector

139. Field Selection pipe = new Each(pipe, new Fields("timestamp"), new DateParser("yyyy-MM-dd HH:mm:ss"), new Fields("ts", "search_term", "visitor_id"));

140. Field Selection Original Tuple visitor_ search_ time page_ id id term stamp number pipe = new Each(pipe, new Fields("timestamp"), new DateParser("yyyy-MM-dd HH:mm:ss"), new Fields("ts", "search_term", "visitor_id"));

141. Field Selection Original Tuple id visitor_ id search_ term time stamp page_ number ⊕ pipe = new Each(pipe, new Fields("timestamp"), new DateParser("yyyy-MM-dd HH:mm:ss"), new Fields("ts", "search_term", "visitor_id"));

142. Field Selection Original Tuple DateParser Output id visitor_ id search_ term time stamp page_ number ⊕ ts pipe = new Each(pipe, new Fields("timestamp"), new DateParser("yyyy-MM-dd HH:mm:ss"), new Fields("ts", "search_term", "visitor_id"));

143. Field Selection Original Tuple DateParser Output Output Selector id visitor_ id search_ term time stamp page_ number ⊕ ts ts search_ term visitor_ id pipe = new Each(pipe, new Fields("timestamp"), new DateParser("yyyy-MM-dd HH:mm:ss"), new Fields("ts", "search_term", "visitor_id"));

146. Field Selection Original Tuple DateParser Output Output Selector id visitor_ id search_ term time stamp page_ number ⊕ ts ts search_ term visitor_ id pipe = new Each(pipe, new Fields("timestamp"), new DateParser("yyyy-MM-dd HH:mm:ss"), new Fields("ts", "search_term", "visitor_id")); Output of Each will be: search_ visitor_ ts term id

150. Field Selection Original Tuple DateParser Output Output Selector X id visitor_ id search_ term XX ⊕ time stamp page_ number ts ts search_ term visitor_ id pipe = new Each(pipe, new Fields("timestamp"), new DateParser("yyyy-MM-dd HH:mm:ss"), new Fields("ts", "search_term", "visitor_id")); Output of Each will be: search_ visitor_ ts term id

151. word hello word hello word world new Every(wcPipe, new Count(), new Fields("count", "word"));

152. word hello word hello word world new Every(wcPipe, new Count(), new Fields("count", "word")); count word 2 hello count word 1 world

154. new Lfs(outputScheme, outputPath); Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "s+"), new Fields("word")); wcPipe = new GroupBy(wcPipe, new Fields("word")); wcPipe = new Every(wcPipe, new Count(), new Fields("count", "word")); Properties properties = new Properties(); FlowConnector.setApplicationJarClass(properties, Main.class); Flow parsedLogFlow = new FlowConnector(properties) .connect(sourceTap, sinkTap, wcPipe); parsedLogFlow.start(); parsedLogFlow.complete(); } }

158. Run

159. input: mary had a little lamb little lamb little lamb mary had a little lamb whose fleece was white as snow

160. input: mary had a little lamb little lamb little lamb mary had a little lamb whose fleece was white as snow command: hadoop jar ./target/cascading-tutorial-1.0.0.jar data/sources/misc/mary.txt data/output

161. output: 2 a 1 as 1 fleece 2 had 4 lamb 4 little 2 mary 1 snow 1 was 1 white 1 whose

162. What’s next?

163. apply operations to tuple streams

164. apply operations to tuple streams repeat.

165. Cookbook

166. Cookbook // remove all tuples where search_term is '-' pipe = new Each(pipe, new Fields("search_term"), new RegexFilter("^(-)$", true));

167. Cookbook // convert "timestamp" String field to "ts" long field pipe = new Each(pipe, new Fields("timestamp"), new DateParser("yyyy-MM-dd HH:mm:ss"), Fields.ALL);

168. Cookbook // td - timestamp of day-level granularity pipe = new Each(pipe, new Fields("ts"), new ExpressionFunction( new Fields("td"), "ts - (ts % (24 * 60 * 60 * 1000))", long.class), Fields.ALL);

169. Cookbook // SELECT DISTINCT visitor_id // GROUP BY visitor_id ORDER BY ts pipe = new GroupBy(pipe, new Fields("visitor_id"), new Fields("ts")); // take the First() tuple of every grouping pipe = new Every(pipe, Fields.ALL, new First(), Fields.RESULTS);

170. Custom Operations

171. Desired Operation Pipe pipe = new Each("ngram pipe", new Fields("line"), new NGramTokenizer(2), new Fields("ngram"));

172. Desired Operation line mary had a little lamb Pipe pipe = new Each("ngram pipe", new Fields("line"), new NGramTokenizer(2), new Fields("ngram"));

173. Desired Operation line mary had a little lamb Pipe pipe = new Each("ngram pipe", new Fields("line"), new NGramTokenizer(2), new Fields("ngram")); ngram mary had

174. Desired Operation line mary had a little lamb Pipe pipe = new Each("ngram pipe", new Fields("line"), new NGramTokenizer(2), new Fields("ngram")); ngram ngram mary had had a

175. Desired Operation line mary had a little lamb Pipe pipe = new Each("ngram pipe", new Fields("line"), new NGramTokenizer(2), new Fields("ngram")); ngram ngram ngram mary had had a a little

176. Desired Operation line mary had a little lamb Pipe pipe = new Each("ngram pipe", new Fields("line"), new NGramTokenizer(2), new Fields("ngram")); ngram ngram ngram ngram mary had had a a little little lamb

177. Custom Operations

178. Custom Operations • extend BaseOperation

179. Custom Operations • extend BaseOperation • implement Function (or Filter)

180. Custom Operations • extend BaseOperation • implement Function (or Filter) • implement operate

181. public class NGramTokenizer extends BaseOperation implements Function { private int size; public NGramTokenizer(int size) { super(new Fields("ngram")); this.size = size; } public void operate(FlowProcess flowProcess, FunctionCall functionCall) { String token; TupleEntry arguments = functionCall.getArguments(); com.attinteractive.util.NGramTokenizer t = new com.attinteractive.util.NGramTokenizer(arguments.getString(0), this.size); while((token = t.next()) != null) { TupleEntry result = new TupleEntry(new Fields("ngram"), new Tuple(token)); functionCall.getOutputCollector().add(result); } } }

189. Tuple 123 vis-abc tacos 2009-10-08 03:21:23 1

190. Tuple 123 vis-abc tacos 2009-10-08 03:21:23 1 zero-indexed, ordered values

191. Tuple 123 vis-abc tacos 2009-10-08 03:21:23 1 Fields id visitor_ search_ id term time stamp page_ number list of strings representing field names

192. Tuple 123 vis-abc tacos 2009-10-08 03:21:23 1 Fields id visitor_ search_ id term time stamp page_ number

193. TupleEntry visitor_ search_ time page_ id id term stamp number 123 vis-abc tacos 2009-10-08 03:21:23 1

197. Success line mary had a little lamb Pipe pipe = new Each("ngram pipe", new Fields("line"), new NGramTokenizer(2), new Fields("ngram")); ngram ngram ngram ngram mary had had a a little little lamb

198. code examples

199. dot

200. dot Flow importFlow = flowConnector.connect(sourceTap, tmpIntermediateTap, importPipe); importFlow.writeDOT( "import-flow.dot" );

204. What next?

205. What next?

206. What next? • http://www.cascading.org/

207. What next? • http://www.cascading.org/ • http://bit.ly/cascading

208. Questions? <nmurray@att.com>

Intro To Cascading

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Intro To Cascading

Similar to Intro To Cascading (20)

Recently uploaded

Recently uploaded (20)

Intro To Cascading