Real time data driven applications 
and SQL vs NoSQL databases 
GoDataDriven 
PROUDLY PART OF THE XEBIA GROUP 
Giovanni La...
Who am I? 
2008-2012: PhD Theoretical Physics 
2012-2013: KPMG 
2013-Now: GoDataDriven
Feedback 
@gglanzani
Real-time, data driven app? 
• No store and retrieve; 
• Store, {transform, enrich, analyse} and retrieve; 
• Real-time: r...
Get insight about event impact
Get insight about event impact
Get insight about event impact
Get insight about event impact
Challenges 
1. Big Data; 
2. Privacy; 
3. Some real-time analysis; 
4. Real-time retrieval.
Is it Big Data? 
Everybody talks about it 
Nobody knows how to do it 
Everyone thinks everyone else is doing it, so everyo...
Is it Big Data? 
• Raw logs are in the order of 40TB; 
•We use Hadoop for storing, enriching and pre-processing.
2. Privacy
3. (Some) real-time analysis
4. Real-Time Retrieval 
• Harder than it looks; 
• Large data; 
• Retrieval is by giving date, center location + 
radius.
REST 
Architecture 
AngularJS python app 
JSON 
Front-end Back-end
JS-1
JS-2
Data Example 
date hour id_activity postcode hits delta sbi 
2013-01-01 12 1234 1234AB 35 22 1 
2013-01-08 12 1234 1234AB ...
Data Example 
date hour id_activity postcod 
e hits delta sbi 
2013-01-01 12 1234 1234AB 35 22 1 
2013-01-08 12 1234 1234A...
helper.py example 
def get_statistics(data, sbi): 
sbi_df = data[data.sbi == sbi] 
# select * from data where sbi = sbi 
h...
helper.py example 
def get_timeline(data, sbi): 
df_sbi = data.groupby([“date”, “hour", “sbi"]).aggregate(sum) 
# select s...
Who has my data? 
• First iteration was a (pre)-POC, less data (3GB vs 
500GB); 
• Time constraints; 
• Oeps: everything i...
Advantage of “everything is a df” 
Pro:• 
Fast!! 
• Use what you know 
• NO DBA’s! 
•We all love CSV’s! 
Contra: 
• Doesn’...
If you want to go down this path 
• Set the dataframe index wisely; 
• Align the data to the index: 
source_data.sort_inde...
If you want to go down this path 
The reason pandas is faster is because I came up with a better algorithm
If you don’t 
AngularJS python app 
REST 
JSON 
? 
Front-end Back-end Database
A word about (traditional) databases…
Db: programming language dict
Postgres for data driven apps?
Postgres for data driven apps?
Issues?! 
•With a radius of 10km, in Amsterdam, you get 
10k postcodes. You need to do this in your SQL: 
SELECT * FROM da...
Postgres + Postgis (2.x) 
PostGIS is a spatial database extender for PostgreSQL. 
Supports geographic objects allowing loc...
Other db’s?
How we solved it 
1. Align data on disk by date; 
2. Use the temporary table trick: 
CREATE TEMPORARY TABLE tmp (postcodes...
Take home messages 
1. Geospatial problems are “hard” and cam kill your 
queries; 
2. Not everybody has infinite resources...
GoDataDriven 
We’re hiring / Questions? / Thank you! 
@gglanzani 
giovannilanzani@godatadriven.com 
Giovanni Lanzani 
Data...
Nächste SlideShare
Wird geladen in …5
×

Giovanni Lanzani – SQL & NoSQL databases for data driven applications - NoSQL matters Barcelona 2014

941 Aufrufe

Veröffentlicht am

Giovanni Lanzani – SQL & NoSQL databases for data driven applications

For data to be the fuel of the 21th century, and for data science to live up to its promise as adriver of innovation, their application should not be confined to dashboards and static analyses.Instead they should be the driver of real applications that support the organisations that own orgenerates the data. Most of these applications are web-based and require real-time access to thedata. However, many Big Data analyses and tools are inherently batch-driven and not well suited forsecure, real-time and performance-critical connections with applications. Trade-offs become ofteninevitable, especially when mixing multiple tools and data sources.In this talk we will describe our journey to build a data driven application at a large Dutchfinancial institution. We will dive into the issues we faced, our considerations and the technicalchoices we made in order to perform data analyses but also drive a web-based, real-timeapplications. We considered and used Impala, Hbase, and MongoDB, but also conventional SQL databasessuch as MySQL and PostgreSQL. Important aspects in our journey were, among others, the handling ofgeographical data, the access to hundreds of millions of records as well as the real time analysisof millions or data points.

Veröffentlicht in: Daten & Analysen
0 Kommentare
0 Gefällt mir
Statistik
Notizen
  • Als Erste(r) kommentieren

  • Gehören Sie zu den Ersten, denen das gefällt!

Keine Downloads
Aufrufe
Aufrufe insgesamt
941
Auf SlideShare
0
Aus Einbettungen
0
Anzahl an Einbettungen
2
Aktionen
Geteilt
0
Downloads
19
Kommentare
0
Gefällt mir
0
Einbettungen 0
Keine Einbettungen

Keine Notizen für die Folie

Giovanni Lanzani – SQL & NoSQL databases for data driven applications - NoSQL matters Barcelona 2014

  1. 1. Real time data driven applications and SQL vs NoSQL databases GoDataDriven PROUDLY PART OF THE XEBIA GROUP Giovanni Lanzani Data Whisperer
  2. 2. Who am I? 2008-2012: PhD Theoretical Physics 2012-2013: KPMG 2013-Now: GoDataDriven
  3. 3. Feedback @gglanzani
  4. 4. Real-time, data driven app? • No store and retrieve; • Store, {transform, enrich, analyse} and retrieve; • Real-time: retrieve is not a batch process; • App: something your mother could use: SELECT attendees FROM NoSQLMatters WHERE password = '1234';
  5. 5. Get insight about event impact
  6. 6. Get insight about event impact
  7. 7. Get insight about event impact
  8. 8. Get insight about event impact
  9. 9. Challenges 1. Big Data; 2. Privacy; 3. Some real-time analysis; 4. Real-time retrieval.
  10. 10. Is it Big Data? Everybody talks about it Nobody knows how to do it Everyone thinks everyone else is doing it, so everyone claims they’re doing it… Dan Ariely
  11. 11. Is it Big Data? • Raw logs are in the order of 40TB; •We use Hadoop for storing, enriching and pre-processing.
  12. 12. 2. Privacy
  13. 13. 3. (Some) real-time analysis
  14. 14. 4. Real-Time Retrieval • Harder than it looks; • Large data; • Retrieval is by giving date, center location + radius.
  15. 15. REST Architecture AngularJS python app JSON Front-end Back-end
  16. 16. JS-1
  17. 17. JS-2
  18. 18. Data Example date hour id_activity postcode hits delta sbi 2013-01-01 12 1234 1234AB 35 22 1 2013-01-08 12 1234 1234AB 45 35 1 2013-01-01 11 2345 5555ZB 2 1 2 2013-01-08 11 2345 5555ZB 55 2 2
  19. 19. Data Example date hour id_activity postcod e hits delta sbi 2013-01-01 12 1234 1234AB 35 22 1 2013-01-08 12 1234 1234AB 45 35 1 2013-01-01 11 2345 5555ZB 2 1 2 2013-01-08 11 2345 5555ZB 55 2 2
  20. 20. helper.py example def get_statistics(data, sbi): sbi_df = data[data.sbi == sbi] # select * from data where sbi = sbi hits = sbi_df.hits.sum() # select sum(hits) from … delta_hits = sbi_df.delta.sum() # select sum(delta) from … if delta_hits: percentage = (hits - delta_hits) / delta_hits else: percentage = 0 return {"sbi": sbi, "total": hits, "percentage": percentage}
  21. 21. helper.py example def get_timeline(data, sbi): df_sbi = data.groupby([“date”, “hour", “sbi"]).aggregate(sum) # select sum(hits), sum(delta) from data group by date, hour, sbi return df_sbi
  22. 22. Who has my data? • First iteration was a (pre)-POC, less data (3GB vs 500GB); • Time constraints; • Oeps: everything is a pandas df!
  23. 23. Advantage of “everything is a df” Pro:• Fast!! • Use what you know • NO DBA’s! •We all love CSV’s! Contra: • Doesn’t scale; • Huge startup time; • NO DBA’s! •We all hate CSV’s!
  24. 24. If you want to go down this path • Set the dataframe index wisely; • Align the data to the index: source_data.sort_index(inplace=True) • Beware of modifications of the original dataframe!
  25. 25. If you want to go down this path The reason pandas is faster is because I came up with a better algorithm
  26. 26. If you don’t AngularJS python app REST JSON ? Front-end Back-end Database
  27. 27. A word about (traditional) databases…
  28. 28. Db: programming language dict
  29. 29. Postgres for data driven apps?
  30. 30. Postgres for data driven apps?
  31. 31. Issues?! •With a radius of 10km, in Amsterdam, you get 10k postcodes. You need to do this in your SQL: SELECT * FROM datapoints WHERE date IN date_array AND postcode IN postcode_array; • Index on date and postcode, but single queries running more than 20 minutes.
  32. 32. Postgres + Postgis (2.x) PostGIS is a spatial database extender for PostgreSQL. Supports geographic objects allowing location queries: SELECT * FROM datapoints WHERE ST_DWithin(lon, lat, 1500) AND dates IN ('2013-02-30', '2013-02-31'); -- every point within 1.5km -- from (lat, lon) on imaginary dates
  33. 33. Other db’s?
  34. 34. How we solved it 1. Align data on disk by date; 2. Use the temporary table trick: CREATE TEMPORARY TABLE tmp (postcodes STRING NOT NULL PRIMARY KEY); INSERT INTO tmp (postcodes) VALUES postcode_array; SELECT * FROM tmp JOIN datapoints d ON d.postcode = tmp.postcodes WHERE d.dt IN dates_array; 3. Lose precision: 1234AB→1234
  35. 35. Take home messages 1. Geospatial problems are “hard” and cam kill your queries; 2. Not everybody has infinite resources: be smart and KISS! 3. SQL or NoSQL? (Size, schema)
  36. 36. GoDataDriven We’re hiring / Questions? / Thank you! @gglanzani giovannilanzani@godatadriven.com Giovanni Lanzani Data Whisperer

×