Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.
on Wide-Column Stores
M. Sc. Johannes Schildgen
2015-06-29
schildgen@cs.uni-kl.de
Incremental Data Transformations
with
"A DBA walks into a
NoSQL bar, but turns
and leaves because he
couldn't find a table"
Column Families
RowId info children
RowId info children
Peter
Lisa
born cmpny salary
1965 IBM 70k
Lisa Carl Susi Toni
€5 €0 €10 €7
born school
1997 BSIT
Peter...
HBase API
put ‘pers‘, ‘Carl‘,
‘info:born‘, ‘1982‘
put ‘pers‘, ‘Carl‘,
‘info:school‘, ‘BSIT‘
put ‘pers‘, ‘Carl‘,
‘info:scho...
Jaspersoft HBase QL
{
"tableName": "pers",
"deserializerClass": "com.jaspersoft…DefaultDeserializer",
"filter": {
"SingleC...
Phoenix
SELECT * FROM pers WHERE school = ‘BSIT‘
𝛔 𝐬𝐜𝐡𝐨𝐨𝐥=′ 𝐁𝐒𝐈𝐓′ 𝐩𝐞𝐫𝐬
https://github.com/forcedotcom/phoenix
„Parent of
e...
Input Table Output Table
Column
RowID Value
Input Cell Output Cell
Column
RowID Value
_c
_r _v
_c
_r _v
Input Cell Output Cell
_c
born
_r
Lisa
_v
1997
_c
born
_r
Lisa
_v
1997
Input Cell Output Cell
𝐩𝐞𝐫𝐬
OUT._r <- IN._r,
OUT.born <- IN.born;
𝝅 𝒃𝒐𝒓𝒏
_c
born
_r
Lisa
_v
1997
_c
born
_r
Lisa
_v
1997
Input Cell Output Cell
𝐩𝐞𝐫𝐬
OUT._r <- IN._r,
OUT.born <- IN.born,
OUT.scho...
_c
born
_r
Lisa
_v
1997
_c
born
_r
Lisa
_v
1997
Input Cell Output Cell
𝐩𝐞𝐫𝐬
OUT._r <- IN._r,
OUT.$(IN._c) <- IN._v;
𝛔 𝐬𝐜𝐡𝐨...
_c
born
_r
Lisa
_v
1997
_c
born
_r
Lisa
_v
1997
Input Cell Output Cell
IN-FILTER: school=‘BSIT‘,
OUT._r <- IN._r,
OUT.$(IN...
That was:
Selection and Projection
Now:
Grouping
_c
cmpny
_r
Peter
_v
IBM
_c
salsum
_r
IBM
_v
645k
Input Cell Output Cell
Salary sum of
each company.
OUT._r <- IN.cmpny,
O...
RowId info
Eve
Carl
Julia
Lisa
OUT._r <- IN.cmpny,
OUT.salsum <- SUM(IN.salary):
born cmpny salary
1965 IBM 70k
born cmpny...
Advanced Transformations:
More Filters
RowId info children
Peter
Lisa
born cmpny salary
1965 IBM 70k
Lisa Carl Susi Toni
€5 €0 €10 €7
born school
1997 BSIT
Peter...
RowId info children
Peter
Lisa
born cmpny salary
1965 IBM 70k
Lisa Carl Susi Toni
€5 €0 €10 €7
born school
1997 BSIT
OUT._...
RowId info children
Peter
Lisa
born cmpny salary
1965 IBM 70k
Lisa Carl Susi Toni
€5 €0 €10 €7
born school
1997 BSIT
IN-FI...
RowId info children
Peter
Lisa
born cmpny salary
1965 IBM 70k
Lisa Carl Susi Toni
€5 €0 €10 €7
born school
1997 BSIT
IN-FI...
RowId info children
Peter
Lisa
born cmpny salary
1965 IBM 70k
Lisa Carl Susi Toni
€5 €0 €10 €7
born school
1997 BSIT
IN-FI...
RowId info children
Peter
Lisa
born cmpny salary
1965 IBM 70k
Lisa Carl Susi Toni
€5 €0 €10 €7
born school
1997 BSIT
IN-FI...
NotaQL Transformation Platform:
MapReduce
map(rowId, row)
row violates
row pred.?
has more
columns?
no
cell violates
cell pred.?
yes
map IN.{_r,_c,_v}, fetched colu...
reduce((r,c), {v})
put(r, c, aggregateAll(v))
Stop
((IBM, salsum), {70k, 80k, 10k})
((IBM, salsum), 160k)
Incremental Transformations:
Self-Maintainability
Salary sum of all people born
before 1980 per company.
IN-FILTER: born<1980,
OUT._r <- IN.cmpny,
OUT.salsum <- SUM(IN.sala...
Salary sum of all people born
before 1980 per company.
IN-FILTER: born<1980,
OUT._r <- IN.cmpny,
OUT.salsum <- SUM(IN.sala...
Salary sum of all people born
before 1980 per company.
IN-FILTER: born<1980,
OUT._r <- IN.cmpny,
OUT.salsum <- SUM(IN.sala...
Salary sum of all people born
before 1980 per company.
IN-FILTER: born<1980,
OUT._r <- IN.cmpny,
OUT.salsum <- SUM(IN.sala...
Salary sum of all people born
before 1980 per company.
IN-FILTER: born<1980,
OUT._r <- IN.cmpny,
OUT.salsum <- SUM(IN.sala...
Salary sum of all people born
before 1980 per company.
IN-FILTER: born<1980,
OUT._r <- IN.cmpny,
OUT.salsum <- SUM(IN.sala...
Salary sum of all people born
before 1980 per company.
IN-FILTER: born<1980,
OUT._r <- IN.cmpny,
OUT.salsum <- SUM(IN.sala...
map(rowId, row)
row violates
row pred.?
has more
columns?
no
cell violates
cell pred.?
yes
map IN.{_r,_c,_v}, fetched colu...
map(rowId, row)
row violates
row pred.?
has more
columns?
no
cell violates
cell pred.?
yes
map IN.{_r,_c,_v}, fetched colu...
map(rowId, row)
row violates
row pred.?
has more
columns?
no
cell violates
cell pred.?
yes
map IN.{_r,_c,_v}, fetched colu...
map(rowId, row)
row violates
row pred.?
has more
columns?
no
cell violates
cell pred.?
yes
map IN.{_r,_c,_v}, fetched colu...
Evaluation:
Performance
0
10
20
30
40
50
60
70
80
90
100
TPCH 1 (selection, projection, sum
of prices per status code)
TPCH 6 (selection, projecti...
Conclusion
• Selection, Projection
• Grouping, Aggregation
• Schema-Flexible
• Horizontal Aggregation
• Metadata↔Data
• Gr...
Thank you!
Incremental Data Transformations on Wide-Column Stores with NotaQL
Nächste SlideShare
Wird geladen in …5
×

Incremental Data Transformations on Wide-Column Stores with NotaQL

440 Aufrufe

Veröffentlicht am

SummerSOC 2015

Veröffentlicht in: Ingenieurwesen
  • Als Erste(r) kommentieren

  • Gehören Sie zu den Ersten, denen das gefällt!

Incremental Data Transformations on Wide-Column Stores with NotaQL

  1. 1. on Wide-Column Stores M. Sc. Johannes Schildgen 2015-06-29 schildgen@cs.uni-kl.de Incremental Data Transformations with
  2. 2. "A DBA walks into a NoSQL bar, but turns and leaves because he couldn't find a table"
  3. 3. Column Families RowId info children
  4. 4. RowId info children Peter Lisa born cmpny salary 1965 IBM 70k Lisa Carl Susi Toni €5 €0 €10 €7 born school 1997 BSIT Peter, 1965, IBM, 70k Lisa, 1997, BSIT Column Families €10 €5 €0€7
  5. 5. HBase API put ‘pers‘, ‘Carl‘, ‘info:born‘, ‘1982‘ put ‘pers‘, ‘Carl‘, ‘info:school‘, ‘BSIT‘ put ‘pers‘, ‘Carl‘, ‘info:school‘, ‘BUIT‘ get ‘pers‘, ‘Carl‘
  6. 6. Jaspersoft HBase QL { "tableName": "pers", "deserializerClass": "com.jaspersoft…DefaultDeserializer", "filter": { "SingleColumnValueFilter": { "family": „info", "qualifier": „school", "compareOp": "EQUAL", "comparator": { "SubstringComparator": { "substr": „BSIT" } } } } } 𝛔 𝐬𝐜𝐡𝐨𝐨𝐥=′ 𝐁𝐒𝐈𝐓′ 𝐩𝐞𝐫𝐬 http://community.jaspersoft.com/wiki/jaspersoft-hbase-query-language
  7. 7. Phoenix SELECT * FROM pers WHERE school = ‘BSIT‘ 𝛔 𝐬𝐜𝐡𝐨𝐨𝐥=′ 𝐁𝐒𝐈𝐓′ 𝐩𝐞𝐫𝐬 https://github.com/forcedotcom/phoenix „Parent of each person?“
  8. 8. Input Table Output Table
  9. 9. Column RowID Value Input Cell Output Cell Column RowID Value
  10. 10. _c _r _v _c _r _v Input Cell Output Cell
  11. 11. _c born _r Lisa _v 1997 _c born _r Lisa _v 1997 Input Cell Output Cell 𝐩𝐞𝐫𝐬 OUT._r <- IN._r, OUT.born <- IN.born; 𝝅 𝒃𝒐𝒓𝒏
  12. 12. _c born _r Lisa _v 1997 _c born _r Lisa _v 1997 Input Cell Output Cell 𝐩𝐞𝐫𝐬 OUT._r <- IN._r, OUT.born <- IN.born, OUT.school <- IN.school; 𝝅 𝒃𝒐𝒓𝒏,𝒔𝒄𝒉𝒐𝒐𝒍
  13. 13. _c born _r Lisa _v 1997 _c born _r Lisa _v 1997 Input Cell Output Cell 𝐩𝐞𝐫𝐬 OUT._r <- IN._r, OUT.$(IN._c) <- IN._v; 𝛔 𝐬𝐜𝐡𝐨𝐨𝐥=′ 𝐁𝐒𝐈𝐓′
  14. 14. _c born _r Lisa _v 1997 _c born _r Lisa _v 1997 Input Cell Output Cell IN-FILTER: school=‘BSIT‘, OUT._r <- IN._r, OUT.$(IN._c) <- IN._v; row predicate 𝐩𝐞𝐫𝐬𝛔 𝐬𝐜𝐡𝐨𝐨𝐥=′ 𝐁𝐒𝐈𝐓′
  15. 15. That was: Selection and Projection
  16. 16. Now: Grouping
  17. 17. _c cmpny _r Peter _v IBM _c salsum _r IBM _v 645k Input Cell Output Cell Salary sum of each company. OUT._r <- IN.cmpny, OUT.salsum <- SUM(IN.salary);
  18. 18. RowId info Eve Carl Julia Lisa OUT._r <- IN.cmpny, OUT.salsum <- SUM(IN.salary): born cmpny salary 1965 IBM 70k born cmpny job 1966 IBM intern born cmpny salary 1967 IBM 80k born school salary 1997 BSIT 1k salsum IBM 70k salsum IBM 80k salsum IBM 150k
  19. 19. Advanced Transformations: More Filters
  20. 20. RowId info children Peter Lisa born cmpny salary 1965 IBM 70k Lisa Carl Susi Toni €5 €0 €10 €7 born school 1997 BSIT Peter, 1965, IBM, 70k Lisa, 1997, BSIT €10 €5 €0€7
  21. 21. RowId info children Peter Lisa born cmpny salary 1965 IBM 70k Lisa Carl Susi Toni €5 €0 €10 €7 born school 1997 BSIT OUT._r <- IN._r, OUT.$(IN._c) <- IN._v;
  22. 22. RowId info children Peter Lisa born cmpny salary 1965 IBM 70k Lisa Carl Susi Toni €5 €0 €10 €7 born school 1997 BSIT IN-FILTER: COL_COUNT(children)>0 OUT._r <- IN._r, OUT.$(IN._c) <- IN._v;
  23. 23. RowId info children Peter Lisa born cmpny salary 1965 IBM 70k Lisa Carl Susi Toni €5 €0 €10 €7 born school 1997 BSIT IN-FILTER: COL_COUNT(children)>0 OUT._r <- IN._r, OUT.$(IN.children._c) <- IN._v;
  24. 24. RowId info children Peter Lisa born cmpny salary 1965 IBM 70k Lisa Carl Susi Toni €5 €0 €10 €7 born school 1997 BSIT IN-FILTER: COL_COUNT(children)>0 OUT._r <- IN._r, OUT.$(IN.children._c?(@>5)) <- IN._v; cell predicate
  25. 25. RowId info children Peter Lisa born cmpny salary 1965 IBM 70k Lisa Carl Susi Toni €5 €0 €10 €7 born school 1997 BSIT IN-FILTER: COL_COUNT(children)>0 OUT._r <- IN._r, OUT.$(IN.children._c?(!Carl)) <- IN._v; cell predicate
  26. 26. NotaQL Transformation Platform: MapReduce
  27. 27. map(rowId, row) row violates row pred.? has more columns? no cell violates cell pred.? yes map IN.{_r,_c,_v}, fetched columns and constants to r,c and v no emit((r, c), v) no yes Stop yes RowId info Peter born cmpny salary 1965 IBM 70k salsum IBM 70k ((IBM, salsum), 70k)
  28. 28. reduce((r,c), {v}) put(r, c, aggregateAll(v)) Stop ((IBM, salsum), {70k, 80k, 10k}) ((IBM, salsum), 160k)
  29. 29. Incremental Transformations: Self-Maintainability
  30. 30. Salary sum of all people born before 1980 per company. IN-FILTER: born<1980, OUT._r <- IN.cmpny, OUT.salsum <- SUM(IN.salary); 17:45 17:47 17:50 Execute job New people are added Execute job again Δ+ =
  31. 31. Salary sum of all people born before 1980 per company. IN-FILTER: born<1980, OUT._r <- IN.cmpny, OUT.salsum <- SUM(IN.salary); 17:45 17:47 17:50 Execute job New people are added Execute job again salsum IBM 150k born cmpny salary Melissa 1989 IBM 50k Nora 1977 IBM 80k salsum IBM 230k+
  32. 32. Salary sum of all people born before 1980 per company. IN-FILTER: born<1980, OUT._r <- IN.cmpny, OUT.salsum <- SUM(IN.salary); 17:50 17:52 17:55 Execute job People are deleted Execute job again salsum IBM 230k born cmpny salary Melissa 1989 IBM 50k Nora 1977 IBM 80k salsum IBM 150k -
  33. 33. Salary sum of all people born before 1980 per company. IN-FILTER: born<1980, OUT._r <- IN.cmpny, OUT.salsum <- SUM(IN.salary); 17:50 17:52 17:55 Execute job People are updated Execute job again born cmpny salary Peter 1965 IBM 70k 1990 -= 70k
  34. 34. Salary sum of all people born before 1980 per company. IN-FILTER: born<1980, OUT._r <- IN.cmpny, OUT.salsum <- SUM(IN.salary); 17:50 17:52 17:55 Execute job People are updated Execute job again born cmpny salary Peter 1965 IBM 70k 1990 += 70k 1965
  35. 35. Salary sum of all people born before 1980 per company. IN-FILTER: born<1980, OUT._r <- IN.cmpny, OUT.salsum <- SUM(IN.salary); 17:50 17:52 17:55 Execute job People are updated Execute job again born cmpny salary Peter 1965 IBM 70k 75k += 75k-70k
  36. 36. Salary sum of all people born before 1980 per company. IN-FILTER: born<1980, OUT._r <- IN.cmpny, OUT.salsum <- SUM(IN.salary); 17:50 17:52 17:55 Execute job People are updated Execute job again born cmpny salary Peter 1965 IBM 70k SAP -= 70k
  37. 37. map(rowId, row) row violates row pred.? has more columns? no cell violates cell pred.? yes map IN.{_r,_c,_v}, fetched columns and constants to r,c and v no emit((r, c), v) no yes Stop yes reduce((r,c), {v}) put(r, c, aggregateAll(v)) Stop Just read Delta (ts>ts‘)
  38. 38. map(rowId, row) row violates row pred.? has more columns? no cell violates cell pred.? yes map IN.{_r,_c,_v}, fetched columns and constants to r,c and v no emit((r, c), v) no yes Stop yes reduce((r,c), {v}) put(r, c, aggregateAll(v)) Stop …and the former Result
  39. 39. map(rowId, row) row violates row pred.? has more columns? no cell violates cell pred.? yes map IN.{_r,_c,_v}, fetched columns and constants to r,c and v no emit((r, c), v) no yes Stop yes reduce((r,c), {v}) put(r, c, aggregateAll(v)) Stop Was it satisfied on previous execution? Yes? Invert v *
  40. 40. map(rowId, row) row violates row pred.? has more columns? no cell violates cell pred.? yes map IN.{_r,_c,_v}, fetched columns and constants to r,c and v no emit((r, c), v) no yes Stop yes reduce((r,c), {v}) put(r, c, aggregateAll(v)) Stop Unchanged?
  41. 41. Evaluation: Performance
  42. 42. 0 10 20 30 40 50 60 70 80 90 100 TPCH 1 (selection, projection, sum of prices per status code) TPCH 6 (selection, projection, sum of all prices) Reverse Web-Link Graph Native Hadoop NotaQL (non-inc.) NotaQL (0.1% changes, timestamp- based CDC) NotaQL (10% changes, manual CDC)
  43. 43. Conclusion • Selection, Projection • Grouping, Aggregation • Schema-Flexible • Horizontal Aggregation • Metadata↔Data • Graph Processing • Text Processing ≠SQL Incremental!
  44. 44. Thank you!

×