Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.
It‘s For Data Transformation
M. Sc. Johannes Schildgen
2015-07-08
schildgen@cs.uni-kl.de
… Is Not A Query Language!
on Wid...
2
"A DBA walks into a
NoSQL bar, but turns
and leaves because he
couldn't find a table"
3
Column Families
RowId info children
4
RowId info children
Peter
Lisa
born cmpny salary
1965 IBM 70k
Lisa Carl Susi Toni
€5 €0 €10 €7
born school
1997 BSIT
Pet...
5
HBase API
put ‘pers‘, ‘Carl‘,
‘info:born‘, ‘1982‘
put ‘pers‘, ‘Carl‘,
‘info:school‘, ‘BSIT‘
put ‘pers‘, ‘Carl‘,
‘info:sc...
6
Jaspersoft HBase QL
{
"tableName": "pers",
"deserializerClass": "com.jaspersoft…DefaultDeserializer",
"filter": {
"Singl...
7
Phoenix
SELECT * FROM pers WHERE school = ‘BSIT‘
𝛔 𝐬𝐜𝐡𝐨𝐨𝐥=′ 𝐁𝐒𝐈𝐓′ 𝐩𝐞𝐫𝐬
„Parent of
each person?“
https://github.com/force...
8
9
Input Table Output Table
10
Column
RowID Value
Input Cell Output Cell
Column
RowID Value
11
_c
_r _v
_c
_r _v
Input Cell Output Cell
12
_c
born
_r
Lisa
_v
1997
_c
born
_r
Lisa
_v
1997
Input Cell Output Cell
𝐩𝐞𝐫𝐬
OUT._r <- IN._r,
OUT.born <- IN.born;
𝝅 𝒃𝒐𝒓𝒏
13
_c
born
_r
Lisa
_v
1997
_c
born
_r
Lisa
_v
1997
Input Cell Output Cell
𝐩𝐞𝐫𝐬
OUT._r <- IN._r,
OUT.born <- IN.born,
OUT.s...
14
_c
born
_r
Lisa
_v
1997
_c
born
_r
Lisa
_v
1997
Input Cell Output Cell
𝐩𝐞𝐫𝐬
OUT._r <- IN._r,
OUT.$(IN._c) <- IN._v;
𝛔 𝐬...
15
_c
born
_r
Lisa
_v
1997
_c
born
_r
Lisa
_v
1997
Input Cell Output Cell
IN-FILTER: school=‘BSIT‘,
OUT._r <- IN._r,
OUT.$...
16
That was:
Selection and Projection
17
Now:
Grouping
18
_c
cmpny
_r
Peter
_v
IBM
_c
salsum
_r
IBM
_v
645k
Input Cell Output Cell
Salary sum of
each company.
OUT._r <- IN.cmpny...
19
RowId info
Eve
Carl
Julia
Lisa
OUT._r <- IN.cmpny,
OUT.salsum <- SUM(IN.salary):
born cmpny salary
1965 IBM 70k
born cm...
20
Advanced Transformations:
More Filters
21
RowId info children
Peter
Lisa
born cmpny salary
1965 IBM 70k
Lisa Carl Susi Toni
€5 €0 €10 €7
born school
1997 BSIT
Pe...
22
RowId info children
Peter
Lisa
born cmpny salary
1965 IBM 70k
Lisa Carl Susi Toni
€5 €0 €10 €7
born school
1997 BSIT
OU...
23
RowId info children
Peter
Lisa
born cmpny salary
1965 IBM 70k
Lisa Carl Susi Toni
€5 €0 €10 €7
born school
1997 BSIT
IN...
24
RowId info children
Peter
Lisa
born cmpny salary
1965 IBM 70k
Lisa Carl Susi Toni
€5 €0 €10 €7
born school
1997 BSIT
IN...
25
RowId info children
Peter
Lisa
born cmpny salary
1965 IBM 70k
Lisa Carl Susi Toni
€5 €0 €10 €7
born school
1997 BSIT
IN...
26
RowId info children
Peter
Lisa
born cmpny salary
1965 IBM 70k
Lisa Carl Susi Toni
€5 €0 €10 €7
born school
1997 BSIT
IN...
27
NotaQL Transformation Platform:
MapReduce
28
map(rowId, row)
row violates
row pred.?
has more
columns?
no
cell violates
cell pred.?
yes
map IN.{_r,_c,_v}, fetched c...
29
reduce((r,c), {v})
put(r, c, aggregateAll(v))
Stop
((IBM, salsum), {70k, 80k, 10k})
((IBM, salsum), 160k)
30
Advanced Transformations:
Graph Algorithms
31
RowId info children
Peter
Lisa
born cmpny salary
1965 IBM 70k
Lisa Carl Susi Toni
€5 €0 €10 €7
born school
1997 BSIT
Pe...
32
_c
Lisa
_r
Peter
_c
€5
_c
Peter
_r
Lisa
_v
€5
Input Cell Output Cell
„Parent of each
person?“
OUT._r <- IN.children._c,...
33
RowId info links
Wikipedia
Twitter
Google
crawled pr
17:35 0.333
Twitter Google
- -
Google
-
Wikipedia
-
crawled pr
17:...
34
RowId info links incoming
Wikipedia
Twitter
Google
crawled pr
17:35 0.333
Twitter Google
- -
Google
-
Wikipedia
-
crawl...
35
RowId info links incoming
Wikipedia
Twitter
Google
crawled pr
17:35 0.333
Twitter Google
- -
Google
-
Wikipedia
-
crawl...
36
RowId info links incoming
Wikipedia
Twitter
Google
crawled pr
17:35 0.333
Twitter Google
- -
Google
-
Wikipedia
-
crawl...
37
Advanced Transformations:
Text Processing
38
RowId info
Wikipedia
Twitter
crawled pr body
17:35 0.333 all information can be found here
OUT._r <- IN._r,
OUT.words
<...
39
RowId info
Wikipedia
Twitter
crawled pr body words
17:35 0.333 all information can be found here 6
OUT._r <- IN._r,
OUT...
40
RowId info
Wikipedia
Twitter
crawled pr body words
17:35 0.333 all information can be found here 6
OUT._r <- IN.body.sp...
41
Conclusion
• Selection, Projection
• Grouping, Aggregation
• Schema-Flexible
• Horizontal Aggregation
• Metadata↔Data
•...
42
Thank you!
Nächste SlideShare
Wird geladen in …5
×

NotaQL Is Not a Query Language! It's for Data Transformation on Wide-Column Stores

459 Aufrufe

Veröffentlicht am

30th British International Conference on Databases (BICOD 2015)

Veröffentlicht in: Ingenieurwesen
  • Als Erste(r) kommentieren

  • Gehören Sie zu den Ersten, denen das gefällt!

NotaQL Is Not a Query Language! It's for Data Transformation on Wide-Column Stores

  1. 1. It‘s For Data Transformation M. Sc. Johannes Schildgen 2015-07-08 schildgen@cs.uni-kl.de … Is Not A Query Language! on Wide-Column Stores
  2. 2. 2 "A DBA walks into a NoSQL bar, but turns and leaves because he couldn't find a table"
  3. 3. 3 Column Families RowId info children
  4. 4. 4 RowId info children Peter Lisa born cmpny salary 1965 IBM 70k Lisa Carl Susi Toni €5 €0 €10 €7 born school 1997 BSIT Peter, 1965, IBM, 70k Lisa, 1997, BSIT Column Families €10 €5 €0€7
  5. 5. 5 HBase API put ‘pers‘, ‘Carl‘, ‘info:born‘, ‘1982‘ put ‘pers‘, ‘Carl‘, ‘info:school‘, ‘BSIT‘ put ‘pers‘, ‘Carl‘, ‘info:school‘, ‘BUIT‘ get ‘pers‘, ‘Carl‘
  6. 6. 6 Jaspersoft HBase QL { "tableName": "pers", "deserializerClass": "com.jaspersoft…DefaultDeserializer", "filter": { "SingleColumnValueFilter": { "family": „info", "qualifier": „school", "compareOp": "EQUAL", "comparator": { "SubstringComparator": { "substr": „BSIT" } } } } } 𝛔 𝐬𝐜𝐡𝐨𝐨𝐥=′ 𝐁𝐒𝐈𝐓′ 𝐩𝐞𝐫𝐬 http://community.jaspersoft.com/wiki/jaspersoft-hbase-query-language
  7. 7. 7 Phoenix SELECT * FROM pers WHERE school = ‘BSIT‘ 𝛔 𝐬𝐜𝐡𝐨𝐨𝐥=′ 𝐁𝐒𝐈𝐓′ 𝐩𝐞𝐫𝐬 „Parent of each person?“ https://github.com/forcedotcom/phoenix
  8. 8. 8
  9. 9. 9 Input Table Output Table
  10. 10. 10 Column RowID Value Input Cell Output Cell Column RowID Value
  11. 11. 11 _c _r _v _c _r _v Input Cell Output Cell
  12. 12. 12 _c born _r Lisa _v 1997 _c born _r Lisa _v 1997 Input Cell Output Cell 𝐩𝐞𝐫𝐬 OUT._r <- IN._r, OUT.born <- IN.born; 𝝅 𝒃𝒐𝒓𝒏
  13. 13. 13 _c born _r Lisa _v 1997 _c born _r Lisa _v 1997 Input Cell Output Cell 𝐩𝐞𝐫𝐬 OUT._r <- IN._r, OUT.born <- IN.born, OUT.school <- IN.school; 𝝅 𝒃𝒐𝒓𝒏,𝒔𝒄𝒉𝒐𝒐𝒍
  14. 14. 14 _c born _r Lisa _v 1997 _c born _r Lisa _v 1997 Input Cell Output Cell 𝐩𝐞𝐫𝐬 OUT._r <- IN._r, OUT.$(IN._c) <- IN._v; 𝛔 𝐬𝐜𝐡𝐨𝐨𝐥=′ 𝐁𝐒𝐈𝐓′
  15. 15. 15 _c born _r Lisa _v 1997 _c born _r Lisa _v 1997 Input Cell Output Cell IN-FILTER: school=‘BSIT‘, OUT._r <- IN._r, OUT.$(IN._c) <- IN._v; row predicate 𝐩𝐞𝐫𝐬𝛔 𝐬𝐜𝐡𝐨𝐨𝐥=′ 𝐁𝐒𝐈𝐓′
  16. 16. 16 That was: Selection and Projection
  17. 17. 17 Now: Grouping
  18. 18. 18 _c cmpny _r Peter _v IBM _c salsum _r IBM _v 645k Input Cell Output Cell Salary sum of each company. OUT._r <- IN.cmpny, OUT.salsum <- SUM(IN.salary):
  19. 19. 19 RowId info Eve Carl Julia Lisa OUT._r <- IN.cmpny, OUT.salsum <- SUM(IN.salary): born cmpny salary 1965 IBM 70k born cmpny job 1966 IBM intern born cmpny salary 1967 IBM 80k born school salary 1997 BSIT 1k salsum IBM 70k salsum IBM 80k salsum IBM 150k
  20. 20. 20 Advanced Transformations: More Filters
  21. 21. 21 RowId info children Peter Lisa born cmpny salary 1965 IBM 70k Lisa Carl Susi Toni €5 €0 €10 €7 born school 1997 BSIT Peter, 1965, IBM, 70k Lisa, 1997, BSIT €10 €5 €0€7
  22. 22. 22 RowId info children Peter Lisa born cmpny salary 1965 IBM 70k Lisa Carl Susi Toni €5 €0 €10 €7 born school 1997 BSIT OUT._r <- IN._r, OUT.$(IN._c) <- IN._v;
  23. 23. 23 RowId info children Peter Lisa born cmpny salary 1965 IBM 70k Lisa Carl Susi Toni €5 €0 €10 €7 born school 1997 BSIT IN-FILTER: COL_COUNT(children)>0 OUT._r <- IN._r, OUT.$(IN._c) <- IN._v;
  24. 24. 24 RowId info children Peter Lisa born cmpny salary 1965 IBM 70k Lisa Carl Susi Toni €5 €0 €10 €7 born school 1997 BSIT IN-FILTER: COL_COUNT(children)>0 OUT._r <- IN._r, OUT.$(IN.children._c) <- IN._v;
  25. 25. 25 RowId info children Peter Lisa born cmpny salary 1965 IBM 70k Lisa Carl Susi Toni €5 €0 €10 €7 born school 1997 BSIT IN-FILTER: COL_COUNT(children)>0 OUT._r <- IN._r, OUT.$(IN.children._c?(@>5)) <- IN._v; cell predicate
  26. 26. 26 RowId info children Peter Lisa born cmpny salary 1965 IBM 70k Lisa Carl Susi Toni €5 €0 €10 €7 born school 1997 BSIT IN-FILTER: COL_COUNT(children)>0 OUT._r <- IN._r, OUT.$(IN.children._c?(!Carl)) <- IN._v; cell predicate
  27. 27. 27 NotaQL Transformation Platform: MapReduce
  28. 28. 28 map(rowId, row) row violates row pred.? has more columns? no cell violates cell pred.? yes map IN.{_r,_c,_v}, fetched columns and constants to r,c and v no emit((r, c), v) no yes Stop yes RowId info Peter born cmpny salary 1965 IBM 70k salsum IBM 70k ((IBM, salsum), 70k)
  29. 29. 29 reduce((r,c), {v}) put(r, c, aggregateAll(v)) Stop ((IBM, salsum), {70k, 80k, 10k}) ((IBM, salsum), 160k)
  30. 30. 30 Advanced Transformations: Graph Algorithms
  31. 31. 31 RowId info children Peter Lisa born cmpny salary 1965 IBM 70k Lisa Carl Susi Toni €5 €0 €10 €7 born school 1997 BSIT Peter, 1965, IBM, 70k Lisa, 1997, BSIT €10 €5 €0€7
  32. 32. 32 _c Lisa _r Peter _c €5 _c Peter _r Lisa _v €5 Input Cell Output Cell „Parent of each person?“ OUT._r <- IN.children._c, OUT.$(IN._r) <- IN._v;
  33. 33. 33 RowId info links Wikipedia Twitter Google crawled pr 17:35 0.333 Twitter Google - - Google - Wikipedia - crawled pr 17:36 0.333 crawled pr 17:36 0.333 OUT._r <- IN.links._c, OUT.incoming.$(IN._r) <- IN._v;
  34. 34. 34 RowId info links incoming Wikipedia Twitter Google crawled pr 17:35 0.333 Twitter Google - - Google - Wikipedia - crawled pr 17:36 0.333 crawled pr 17:36 0.333 OUT._r <- IN.links._c, OUT.incoming.$(IN._r) <- IN._v; Twitter Wikipedia - - Google - Wikipedia - Reverting the graph
  35. 35. 35 RowId info links incoming Wikipedia Twitter Google crawled pr 17:35 0.333 Twitter Google - - Google - Wikipedia - crawled pr 17:36 0.333 crawled pr 17:36 0.333 OUT._r <- IN.links._c, OUT.info.pr <- SUM(IN.pr/COL_COUNT(links)); Twitter Wikipedia - - Google - Wikipedia -
  36. 36. 36 RowId info links incoming Wikipedia Twitter Google crawled pr 17:35 0.333 Twitter Google - - Google - Wikipedia - crawled pr 17:36 0.167 crawled pr 17:36 0.5 OUT._r <- IN.links._c, OUT.info.pr <- SUM(IN.pr/COL_COUNT(links)); Twitter Wikipedia - - Google - Wikipedia - PageRank
  37. 37. 37 Advanced Transformations: Text Processing
  38. 38. 38 RowId info Wikipedia Twitter crawled pr body 17:35 0.333 all information can be found here OUT._r <- IN._r, OUT.words <- COUNT(IN.body.split(‘ ‘)); crawled pr body 17:36 0.167 click here for more information
  39. 39. 39 RowId info Wikipedia Twitter crawled pr body words 17:35 0.333 all information can be found here 6 OUT._r <- IN._r, OUT.words <- COUNT(IN.body.split(‘ ‘)); crawled pr body 17:36 0.167 click here for more information 5 Word Count
  40. 40. 40 RowId info Wikipedia Twitter crawled pr body words 17:35 0.333 all information can be found here 6 OUT._r <- IN.body.split(‘ ‘), OUT.$(IN._r) <- COUNT(*); crawled pr body 17:36 0.167 click here for more information 5 RowId info here … Wikipedia Twitter 1 1 Term Index
  41. 41. 41 Conclusion • Selection, Projection • Grouping, Aggregation • Schema-Flexible • Horizontal Aggregation • Metadata↔Data • Graph Processing • Text Processing ≠SQL
  42. 42. 42 Thank you!

×