Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Cascalog internal dsl_preso

3.211 Aufrufe

Veröffentlicht am

Veröffentlicht in: Bildung
  • Loggen Sie sich ein, um Kommentare anzuzeigen.

  • Gehören Sie zu den Ersten, denen das gefällt!

Cascalog internal dsl_preso

  1. 1. Cascalog  An example for a complex workflow Marc Limotte Metamarkets Group October 27, 2010
  2. 2. What is Cascalog? • An internal DSL (domain specific language) for map/reduce • Implemented in Clojure (a functional language that runs on  the Java VM) • Several layers of abstraction up-- based on Cascading (an  API for building of Hadoop m/r jobs)
  3. 3. The Three Bears: Choosing a Solution • Start with a problem or business requirement • Interested here in the class of problems that require  programmatic query construction: o Java program on top of Hadoop API (too much control) o External DSLs (too little control) o An internal DSL (just right)
  4. 4. Business Requirement Original data is an N- dimensional cube
  5. 5. Business Requirement Generate all  possible  aggregations
  6. 6. Business Requirement 2^N – 1 possible ways to rollup the data.
  7. 7. Tree showing ways to Aggregate
  8. 8. Alternative Problem Formulation There is another way to solve this problem. • For each input record – Map task outputs a key for each possible agg – Use map-side aggregation (combiner) • Simpler • In our tests, much slower • Memory contention? aggregating on a large number of keys.
  9. 9. Solution 1: External Query Language CREATE TABLE emp_salary_dept_jt... INSERT INTO emp_salary_dept_jt (...) SELECT department, job_title, avg(salary), sum(cnt) FROM emp_salary GROUP BY department, job_title CREATE TABLE emp_salary_dept_jt... INSERT INTO emp_salary_jt (...) SELECT job_title, group_avg(avg_salary, cnt) FROM emp_salary_dept_jt GROUP BY job_title And 61 additional queries... or extend the language to make it Turing Complete or use string manipulation to construct the queries...
  10. 10. Not terrible, but... Write code in another language to manipulate Query strings. You have to work with two different languages. • Different naming conventions • Different semantics for escaping special characters, etc • Your IDE will probably only help you with the outer language (syntax highlighting, syntax verification, formatting, etc). • Limits composability (UDFs are composable, but not the control flow) • Complicates abstraction
  11. 11. Solution 2: Java Map / Reduce • Control logic to launch each of 63 jobs • Map o Parameterizable for data source (previous aggregation) and which fields to collapse, could be passed in the JobConf • Reduce o Compute avg using previous group avg and count
  12. 12. Solution 3: An Internal DSL To my knowledge, Cascalog is the only option for an “internal” DSL. High level code walkthrough follows: • Helper functions • Custom function (UDF) • Core Process • Unit Test • Execution
  13. 13. Helper Functions (def DIMS ["?dept" "?country" "?city" "?jobtitle" "?manager" "?function"]) ;All sub-lists generated by removing one member. (defn sublists ([s] (sublists [] s)) ([left right] (let [left2 (conj (vec left) (first right)) right2 (rest right)] (cons (concat left right2) (if (seq right2) (sublists left2 right2) []))))) ;Create key with only the requested dims. Other dims are replaced with *. (defn key-for-dims ([dims key-str] (str-join "," (key-for-dims dims (.split key-str ",") 0))) ([dims key-split idx] (if (>= idx (count key-split)) [] (cons (if (some #{(nth DIMS idx)} dims) (nth key-split idx) "*") (key-for-dims dims key-split (+ idx 1)))))) • Pure functions • Immutable data types • Recursion
  14. 14. ;Computes an average from a set of other averages (def grpavg (<- [!avg !cnt :> !newavg !newcnt] (* !avg !cnt :> !s) (sum !s :> !total) (sum !cnt :> !newcnt) (div !total !newcnt :> !newavg))) Cascalog's Analog to the UDF • Same syntax • No interface to implement • Same source file (unless you want to share it)
  15. 15. ;Computes an average from a set of other averages (def grpavg (<- [!avg !cnt :> !newavg !newcnt] (* !avg !cnt :> !s) (sum !s :> !total) (sum !cnt :> !newcnt) (div !total !newcnt :> !newavg))) Cascalog's Analog to the UDF • Same syntax • No interface to implement • Same source file (unless you want to share it)
  16. 16. Cascalog's Analog to the UDF • Same syntax • No interface to implement • Same source file (unless you want to share it) ;Computes an average from a set of other averages (def grpavg (<- [!avg !cnt :> !newavg !newcnt] (* !avg !cnt :> !s) (sum !s :> !total) (sum !cnt :> !newcnt) (div !total !newcnt :> !newavg)))
  17. 17. Core Process ;Creates a query, where parent is the source, and the dims list is used ;to construct the output key. The other metrics are rolled up. (defn make-qry [dims parent] (<- [?key ?avg ?cnt] (parent ?pkey ?pavg ?pcnt) (key-for-dims dims ?pkey :> ?key) (grpavg ?pavg ?pcnt :> ?avg ?cnt))) ;Given an initial src and a full list of dimensions; return a map of each ;subset of dims to a query that implements a rollup along that set of dims. (defn generate-query-tree [src dims] (let [query (if (= dims DIMS) (basic-qry src) (make-qry dims src)) gqt-with-src (partial generate-query-tree query)] (if (empty? dims) {dims query} (assoc (apply merge (map gqt-with-src (sublists dims))) dims query))))
  18. 18. Unit Testing Cascalog (deftest test-make-qry (with-tmp-sources [src [["a,b,c,d,e1,f1" 200 50000] ["a,b,c,d,e1,f2" 100 20000] ["a,b,c,d,e3,f3" 300 40000]]] (test?- [["a,b,c,d,e1,*" 300 40000] ["a,b,c,d,e3,*" 300 40000]] (make-qry (choose DIMS [0 1 2 3 4]) src)) (test?- [["a,*,*,*,*,*" 600 40000]] (make-qry (choose DIMS [0]) src)))) • sample input in green • expected result in orange
  19. 19. Executing the Queries For completeness, here is the code that executes all the queries. (defn -main [input-dir output-dir] (let [source (get-data (hfs-hadoop-seqfile input-dir)) queries (vals (generate-query-tree source DIMS))] (?- (hfs-textline-replace output-dir) queries)))
  20. 20. Reference • Cascalog Project (Nathan Marz) http://github.com/nathanmarz/cascalog • Cascading Project (Chris Wensel) http://www.cascading.org/ • Google group http://groups.google.com/group/cascalog-user • IM: Come chat in the #cascading room on freenode • Book: Practical Clojure by VanderHart and Sierra