Big Data analysis is commonly associated with batch processing. Users aiming to combine batch and stream processing have to rely on tailorRmade architectures o Users buy Big Data plaSorms, but, How do I start?. What is my entry point to the plaSorm? #CassandraSummit 2014 San Francisco
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
Stratio CrossData: an efficient distributed datahub with batch and streaming query capabilities
1. Stratio Meta
An efficient distributed datahub with batch and
streaming query capabilities
Daniel Higuero
Alvaro Agea
dhiguero@stratio.com
alvaro@stratio.com
#CassandraSummit-20141"
2. Stratio Crossdata
An efficient distributed datahub with batch and
streaming query capabilities
Daniel Higuero
Alvaro Agea
dhiguero@stratio.com
alvaro@stratio.com
#CassandraSummit-20142"
3. Who are we?
STRATIO
• Stra3o-is-a-Big-Data-Company
• Founded-in-2013
• Commercially-launched-in-2014
• 50+-employees-in-Madrid
• Office-in-San-Francisco
• Cer3fied-Spark-distribu3on
#CassandraSummit-2014
3"
7. What our clients demand?
o Easy-deployment
o Easy-administra3on
o Read/write-performance
o EasyRtoRlearn-query-language-o
Integra3on-with-BI-Tools
o Join-opera3ons
o Support-for-streaming-sources
o Integra3on-with-other-data-stores
o Ability-to-query-data-without-thinking-about-the-schema-(nonRindexed-data)
#CassandraSummit-2014
7"
8. What our clients demand?
! Easy%deployment%
! Easy%administra0on%
! Read/write%performance%
! Easy6to6learn%query%language%
o Integra3on-with-BI-Tools
o Join-opera3ons
o Support-for-streaming-sources
o Integra3on-with-other-data-stores
o Ability-to-query-data-without-thinking-about-the-schema-(nonRindexed-data)
#CassandraSummit-2014
8"
12. Connecting to the outside world
o Crossdata-defines-an-IConnector-extension-interface
o User-can-easily-add-new-connectors-to-support
• Different-datastores
• Different-processing-engines
• Different-versions
o Where-each-connector-defines-its-capabili3es
#CassandraSummit-2014
12"
Our planner will choose the best connector for each query
13. Query execution
#CassandraSummit-2014
13"
Parsing" Valida8on" Planning" Execu8on"
C*"
Connector1"
Connector2"
Connector3"
Our planner will choose the best connector for each query
14. Multi-cluster support
o Stra3o-Crossdata-offers-the-possibility-of-accessing-a-single-catalog-
across-a-set-of-datastores.-
• Mul3ple-clusters-can-coexist-to-op3mize-plaSorm-performance
" E.g.,-produc3on-cluster,-test-cluster,-writeRop3mized-cluster,-
readRop3mized-cluster,-etc.-
• A-table-is-saved-in-a-unique-datastore
#CassandraSummit-2014
14"
22. Streaming queries: windows syntax
#CassandraSummit-2014
22"
SELECT fieldGroup,avg(Field2)
FROM eph_table
WITH WINDOW 5 minutes
WHERE field1=100 AND field2>100
GROUP BY fieldGroup;
23. Joining batch and streaming
SELECT * FROM demo.temporal
WITH WINDOW 10 secs
INNER JOIN demo.users
#CassandraSummit-2014
ON users.name = temporal.name;
SELECT * FROM
demo.temporal
WITH WINDOW 10 secs
"
SELECT *
FROM demo.users
"
INNER JOIN ON
users.name =
temporal.name
"
23"
25. Full text search with
o Clients-request-the-ability-to-perform-full-text-searches
o We-have-developed-an-integra3on-between-Lucene-and-
Cassandra
o C*-users-can-now-enjoy-all-Lucene-features:
• Full-text-searches,-range-queries,-fuzzy-queries….
#CassandraSummit-2014
25"
https://github.com/Stratio/stratio-cassandra
29. Why Spark?
o Stra3o-Crossdata-uses-Spark-to-perform-nonRna3vely-supported-opera3ons
o Spark-brings-several-benefits-over-Hadoop-o
InRMemory-processing
o RDD-abstrac3on
o Simpler-API-o
Increased-flexibility-(e.g.,-not-need-for-iden3ty-mapping)
#CassandraSummit-2014
29"
30. What about Spark SQL?
o Different-approach-to-query-execu3on
• We-only-use-Spark-when-it-speedups-queries
" Na3ve-drivers-are-faster-for-simple-queries
" Spark-SQL-has-limited-RDD-sources
• Avoid-some-Spark-limita3ons
• Several-batch-and-streaming-contexts-in-a-single-JVM-SPARKR2243
#CassandraSummit-2014
30"
36. Stratio Crossdata ODBC
o WellRknown-interface-standard-(for-BI-tools,-external-apps,-…)
o We-have-implemented-it-using-Simba-SDK
o ODBC-opens-the-full-poten3al-of-Stra3o-Crossdata-to-the-external-
world
o Currently-tested-with-Tableau,-Qlikview-and-MS-Excel
#CassandraSummit-2014
36"
One ODBC for all datastores!
38. The future
o Security
o Query-op3mizer-and-smart-query-planner
o Leverage-system-sta3s3cs
o Support-for-UDFs
o Become-an-Apache-project
#CassandraSummit-2014
38"
https://github.com/Stratio/stratio-meta
39. We are looking for an Apache Champion
#CassandraSummit-2014
39"
Can"you"
help"us?"
40. A wish list for Cassandra
o Ability-to-stop-running-queries
o Interac3ve-users-are-unpredictable
o Some-excep3on-paths-are-not-clear-or-defined-(e.g.,-secondary-indexes)
o Distribute-some-of-the-opera3ons-currently-performed-on-the-coordinator
• E.g.,-aggrega3ons-like-count(*)
#CassandraSummit-2014
40"
41. Stratio Crossdata
An efficient distributed datahub with batch and
streaming query capabilities
Daniel Higuero
Alvaro Agea
dhiguero@stratio.com
alvaro@stratio.com
#CassandraSummit-201441"