SlideShare ist ein Scribd-Unternehmen logo
1 von 32
Downloaden Sie, um offline zu lesen
Advanced search and
Top-K queries in Cassandra
1
Andrés de la Peña
andres@stratio.com
@a_de_la_pena
Apache Cassandra Meetup 2015
•  Stratio is a Big Data Company
•  Founded in 2013
•  Commercially launched in 2014
•  70+ employees in Madrid
•  Office in San Francisco
•  Certified Spark distribution
Apache Cassandra Meetup 2015
Who are we?
Introduction to Cassandra
Cassandra query methods
Stratio Lucene based 2i implementation
Integrating Lucene 2i with Apache Spark
1
2
3
CONTENTS
Apache Cassandra Meetup 2015
4
Tunable	
  consistency	
  
Tradeoffs between consistency and latency are tunable. C* values a high
availability and partitioning against consistency; strong consistency can be
achieved but there is no row locking.
Incremental	
  scalability	
  
Nodes added to a cluster increase throughput in a predictable & linear fashion.
The	
  best	
  of	
  Dynamo	
  &	
  Big	
  Table	
  
Combines the partitioning and replication of Amazon’s Dynamo with the log-
structured data model of Google’s Bigtable.
Decentralized	
  
P2P architecture without master node or single point of failure.
Apache Cassandra overview
4Apache Cassandra Meetup 2015
Apache Cassandra operators
5Apache Cassandra Meetup 2015
primary key
secondary indexes
token ranges
Throughput
Expressiveness
Cassandra query methods
6Apache Cassandra Meetup 2015
•  O(1) node lookup for partition key
•  Range slices for clustering key
•  Usually requires denormalization
Primary key queries
Node
3
Node
1
Node
2
Partition key Clustering key range
CLIENT
apena
2014-04-10:body
When you..
aagea
dhiguero
apena
2014-04-06:body 2014-04-07:body 2014-04-08:body
To study and… To think and... If you see what..
2014-04-06:body
The cautious…
2014-04-10:body
When you..
2014-04-11:body
When you do…
7Apache Cassandra Meetup 2015
primary key
secondary indexes
token ranges
Throughput
Expressiveness
Cassandra query methods
8Apache Cassandra Meetup 2015
CLIENT
C*
node
C*
node
2i local column
family
C*
node
2i local column
family
2i local column
family
Secondary indexes queries
•  Inverted index
•  Mitigates denormalization
•  Queries may involve all C* nodes
•  Queries limited to a single column
9Apache Cassandra Meetup 2015
primary key
secondary indexes
token ranges
Throughput
Expressiveness
Cassandra query methods
10Apache Cassandra Meetup 2015
C*	
  
node	
  
C*	
  
node	
  
C*	
  
node	
  
Spark
master
Token range queries
•  Used by MapReduce frameworks
as Hadoop or Spark
•  All kinds of queries are possible
•  Low throughput
•  Ad-hoc queries
•  Batch processing
•  Materialized views
CLIENT
query= function (all data)
11Apache Cassandra Meetup 2015
C*	
  
node	
  
C*	
  
node	
  
C*	
  
node	
  
Combining 2i with MapReduce
•  Expressiveness avoiding full scans
•  Still limited by one indexed column per query
Spark
masterCLIENT
Secondary
index
Secondary
index
Secondary
index
12Apache Cassandra Meetup 2015
MORE EXPRESIVENESS
What do we miss from 2i indexes?
•  Range queries
•  Multivariable search
•  Full text search
•  Sorting by fields
•  Top-k queries
13Apache Cassandra Meetup 2015
IT’S ARCHITECTURE
What do we like from the existing 2i?
•  Each node indexes its own data
•  The index implementations do not need to be distributed
•  Can be created after design and ingestion
•  Natural extension point
14Apache Cassandra Meetup 2015
Thinking in a custom secondary index implementation…
WHY NOT USE ?
15Apache Cassandra Meetup 2015
Why we like Lucene
•  Proven stable and fast indexing solution
•  Expressive queries
- Multivariable, ranges, full text, sorting, top-k, etc.
•  Mature distributed search solutions built on top of it
- Solr, ElasticSearch
•  Can be fully embedded in application code
•  Published under the Apache License
16Apache Cassandra Meetup 2015
HOW IT WORKS
Apache Cassandra Meetup 2015
ALTER TABLE tweets ADD lucene TEXT;
CREATE TABLE tweets (
id bigint,
createdAt timestamp,
message text,
userid bigint,
username text,
PRIMARY KEY (userid, createdAt, id) );
Create index
•  Built in the background in any moment
•  Real time updates
•  Mapping eases ETL
•  Language aware
18
CREATE CUSTOM INDEX tweets_idx ON twitter.tweets (lucene)
USING 'com.stratio.index.RowIndex'
WITH OPTIONS = {
'refresh_seconds' : '60',
'schema' : '{
default_analyzer : "EnglishAnalyzer”,
fields : {
createdat : {type : "date", pattern : "yyyy-MM-dd"},
message : {type : "text", analyzer : ”EnglishAnalyzer"},
userid : {type : "string"},
username : {type : "string”}
}
}'
};
Apache Cassandra Meetup 2015
SELECT * FROM tweets WHERE lucene
= ‘{
filter : {type : "match",
field : "text",
value : "cassandra"}
}’ LIMIT 10;
search 10
found 6
found 4
We are done !
Filtering query
CLIENT
C*
node
C*
node
C*
node
Lucene
index
Lucene
index
Lucene
index
19Apache Cassandra Meetup 2015
Found 5
Found 4
Found 5
Top-k query
SELECT * FROM tweets WHERE lucene
= ‘{
query: {type : "match",
field : "text",
value : "cassandra"}
}’ LIMIT 5;
Search top-5
CLIENT
Search top-5
C*
node
C*
node
C*
node
Lucene
index
Lucene
index
Lucene
index
Merge 14
to best 5
20Apache Cassandra Meetup 2015
SELECT * FROM tweets WHERE lucene = ‘{
filter :
{
type : "boolean", must :
[
{type : "range", field : "time" lower : "2014/04/25”},
{type : "boolean", should :
[
{type : "prefix", field : "user", value : "a"} ,
{type : "wildcard", field : "user", value : "*b*"} ,
{type : "match", field : "user", value : "fast"}
]
}
]
},
sort :
{
fields: [ {field :"time", reverse : true},
{field : "user", reverse : false} ]
}
}’ LIMIT 10000;
Queries can be as complex as you want
21Apache Cassandra Meetup 2015
NO MAINTENANCE REQUIRED
Some implementation details
•  A Lucene document per CQL row, and a Lucene field per indexed column
•  SortingMergePolicy keeps index sorted in the same way that C* does
•  Index commits synchronized with column family flushes
•  Segments merge synchronized with column family compactions
22Apache Cassandra Meetup 2015
LUCENE
AND
SPARK
Apache Cassandra Meetup 2015
Split friendly. It supports searches within a token range
SELECT * FROM tweets WHERE lucene = ‘{
filter : {type:"match", field:"text", value:"cassandra"}
}’
AND TOKEN(userid, createdAt, id) > 253653456456
AND TOKEN(userid, createdAt, id) <= 3456467456756
LIMIT 10000;
Integrating Lucene & Spark
24Apache Cassandra Meetup 2015
SELECT * FROM tweets WHERE lucene = ‘{
filter : {type:"match", field:"text", value:"cassandra"}
}’
AND userid = 3543534
AND createdAt > 2011-02-03 04:05+0000
LIMIT 5000;
Paging friendly: It supports starting queries in a certain point
Integrating Lucene & Spark
25Apache Cassandra Meetup 2015
Integrating Lucene & Spark
CLIENT
Spark
master
C*
node
C*
node
C*
node
Lucene
Lucene
Lucene
•  Compute large amounts of data
•  Avoid systematic full scan
•  Reduces the amount of data to be processed
•  Filtering push-down
26Apache Cassandra Meetup 2015
WHEN TO
USE INDEXES
AND WHEN TO
USE FULL SCAN
Apache Cassandra Meetup 2015
Index performance in Spark
Time
Records returned
Full scan
Lucene 2i
28Apache Cassandra Meetup 2015
DEMOLucene indexes in C*
Apache Cassandra Meetup 2015
Conclusions
•  Added new query methods
- Multivariable queries (AND, OR, NOT)
- Range queries (>, >=, <, <=) and regular expressions
- Full text queries (match, phrase, fuzzy...)
•  Top-k query support
- Lucene scoring formula
- Sort by field values
•  Compatible with MapReduce frameworks
•  Preserves Cassandra’s functionality
30Apache Cassandra Meetup 2015
Its open source
31
github.com/stratio/stratio-cassandra
•  Published as fork of Apache Cassandra
•  Apache License Version 2.0
stratio.github.io/crossdata
•  Apache License Version 2.0
github.com/stratio/deep-spark
•  Apache License Version 2.0
Apache Cassandra Meetup 2015
Advanced search and
Top-K queries in Cassandra
32
Andrés de la Peña
andres@stratio.com
@a_de_la_pena
Apache Cassandra Meetup 2015

Weitere ähnliche Inhalte

Was ist angesagt?

Kafka Lambda architecture with mirroring
Kafka Lambda architecture with mirroringKafka Lambda architecture with mirroring
Kafka Lambda architecture with mirroring
Anant Rustagi
 

Was ist angesagt? (20)

Spark Summit - Stratio Streaming
Spark Summit - Stratio Streaming Spark Summit - Stratio Streaming
Spark Summit - Stratio Streaming
 
Analyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and CassandraAnalyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and Cassandra
 
An Introduction to Distributed Search with Datastax Enterprise Search
An Introduction to Distributed Search with Datastax Enterprise SearchAn Introduction to Distributed Search with Datastax Enterprise Search
An Introduction to Distributed Search with Datastax Enterprise Search
 
Cassandra & Spark for IoT
Cassandra & Spark for IoTCassandra & Spark for IoT
Cassandra & Spark for IoT
 
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Real-Time Log Analysis with Apache Mesos, Kafka and CassandraReal-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
 
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
 
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
Real-Time Anomaly Detection  with Spark MLlib, Akka and  CassandraReal-Time Anomaly Detection  with Spark MLlib, Akka and  Cassandra
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
 
Kafka Lambda architecture with mirroring
Kafka Lambda architecture with mirroringKafka Lambda architecture with mirroring
Kafka Lambda architecture with mirroring
 
Lambda Architecture Using SQL
Lambda Architecture Using SQLLambda Architecture Using SQL
Lambda Architecture Using SQL
 
Apache Cassandra and Python for Analyzing Streaming Big Data
Apache Cassandra and Python for Analyzing Streaming Big Data Apache Cassandra and Python for Analyzing Streaming Big Data
Apache Cassandra and Python for Analyzing Streaming Big Data
 
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
 
Reactive dashboard’s using apache spark
Reactive dashboard’s using apache sparkReactive dashboard’s using apache spark
Reactive dashboard’s using apache spark
 
Spark with Cassandra by Christopher Batey
Spark with Cassandra by Christopher BateySpark with Cassandra by Christopher Batey
Spark with Cassandra by Christopher Batey
 
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
 
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
 
Lambda architecture
Lambda architectureLambda architecture
Lambda architecture
 
Sa introduction to big data pipelining with cassandra &amp; spark west mins...
Sa introduction to big data pipelining with cassandra &amp; spark   west mins...Sa introduction to big data pipelining with cassandra &amp; spark   west mins...
Sa introduction to big data pipelining with cassandra &amp; spark west mins...
 
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
 
New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 

Andere mochten auch

Functional programming in scala
Functional programming in scalaFunctional programming in scala
Functional programming in scala
Stratio
 

Andere mochten auch (15)

Apache Cassandra multi-datacenter essentials
Apache Cassandra multi-datacenter essentialsApache Cassandra multi-datacenter essentials
Apache Cassandra multi-datacenter essentials
 
Functional programming in scala
Functional programming in scalaFunctional programming in scala
Functional programming in scala
 
Primeros pasos con Spark - Spark Meetup Madrid 30-09-2014
Primeros pasos con Spark - Spark Meetup Madrid 30-09-2014Primeros pasos con Spark - Spark Meetup Madrid 30-09-2014
Primeros pasos con Spark - Spark Meetup Madrid 30-09-2014
 
Cassandra training
Cassandra trainingCassandra training
Cassandra training
 
Solr & Cassandra: Searching Cassandra with DataStax Enterprise
Solr & Cassandra: Searching Cassandra with DataStax EnterpriseSolr & Cassandra: Searching Cassandra with DataStax Enterprise
Solr & Cassandra: Searching Cassandra with DataStax Enterprise
 
Cassandra Materialized Views
Cassandra Materialized ViewsCassandra Materialized Views
Cassandra Materialized Views
 
Timeli: Believing Cassandra: Our Big-Data Journey To Enlightenment under the ...
Timeli: Believing Cassandra: Our Big-Data Journey To Enlightenment under the ...Timeli: Believing Cassandra: Our Big-Data Journey To Enlightenment under the ...
Timeli: Believing Cassandra: Our Big-Data Journey To Enlightenment under the ...
 
Migration Best Practices: From RDBMS to Cassandra without a Hitch
Migration Best Practices: From RDBMS to Cassandra without a HitchMigration Best Practices: From RDBMS to Cassandra without a Hitch
Migration Best Practices: From RDBMS to Cassandra without a Hitch
 
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-MallaKerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
 
Bulk Loading Data into Cassandra
Bulk Loading Data into CassandraBulk Loading Data into Cassandra
Bulk Loading Data into Cassandra
 
700 Queries Per Second with Updates: Spark As A Real-Time Web Service
700 Queries Per Second with Updates: Spark As A Real-Time Web Service700 Queries Per Second with Updates: Spark As A Real-Time Web Service
700 Queries Per Second with Updates: Spark As A Real-Time Web Service
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
 
Seminario Web MongoDB-Paradigma: Cree aplicaciones más escalables utilizando ...
Seminario Web MongoDB-Paradigma: Cree aplicaciones más escalables utilizando ...Seminario Web MongoDB-Paradigma: Cree aplicaciones más escalables utilizando ...
Seminario Web MongoDB-Paradigma: Cree aplicaciones más escalables utilizando ...
 
Cassandra and materialized views
Cassandra and materialized viewsCassandra and materialized views
Cassandra and materialized views
 
OpenStack Architecture
OpenStack ArchitectureOpenStack Architecture
OpenStack Architecture
 

Ähnlich wie Advanced search and Top-K queries in Cassandra

Ähnlich wie Advanced search and Top-K queries in Cassandra (20)

Advanced search and Top-k queries in Cassandra - Cassandra Summit Europe 2014
Advanced search and Top-k queries in Cassandra - Cassandra Summit Europe 2014Advanced search and Top-k queries in Cassandra - Cassandra Summit Europe 2014
Advanced search and Top-k queries in Cassandra - Cassandra Summit Europe 2014
 
Cassandra REST API with Pagination TEAM 15
Cassandra REST API with Pagination TEAM 15Cassandra REST API with Pagination TEAM 15
Cassandra REST API with Pagination TEAM 15
 
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
 
Spark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesSpark DataFrames and ML Pipelines
Spark DataFrames and ML Pipelines
 
Oscon 2019 - Optimizing analytical queries on Cassandra by 100x
Oscon 2019 - Optimizing analytical queries on Cassandra by 100xOscon 2019 - Optimizing analytical queries on Cassandra by 100x
Oscon 2019 - Optimizing analytical queries on Cassandra by 100x
 
Lambda at Weather Scale - Cassandra Summit 2015
Lambda at Weather Scale - Cassandra Summit 2015Lambda at Weather Scale - Cassandra Summit 2015
Lambda at Weather Scale - Cassandra Summit 2015
 
BI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache CassandraBI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache Cassandra
 
Stratio big data spain
Stratio   big data spainStratio   big data spain
Stratio big data spain
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
Analytics with Cassandra & Spark
Analytics with Cassandra & SparkAnalytics with Cassandra & Spark
Analytics with Cassandra & Spark
 
An efficient data mining solution by integrating Spark and Cassandra
An efficient data mining solution by integrating Spark and CassandraAn efficient data mining solution by integrating Spark and Cassandra
An efficient data mining solution by integrating Spark and Cassandra
 
Jug - ecosystem
Jug -  ecosystemJug -  ecosystem
Jug - ecosystem
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Chti jug - 2018-06-26
Chti jug - 2018-06-26Chti jug - 2018-06-26
Chti jug - 2018-06-26
 
Managing data analytics in a hybrid cloud
Managing data analytics in a hybrid cloudManaging data analytics in a hybrid cloud
Managing data analytics in a hybrid cloud
 
Announcing Spark Driver for Cassandra
Announcing Spark Driver for CassandraAnnouncing Spark Driver for Cassandra
Announcing Spark Driver for Cassandra
 
Spark Introduction
Spark IntroductionSpark Introduction
Spark Introduction
 
Cassandra and Spark
Cassandra and SparkCassandra and Spark
Cassandra and Spark
 
Apache cassandra
Apache cassandraApache cassandra
Apache cassandra
 
Cassandra - A decentralized storage system
Cassandra - A decentralized storage systemCassandra - A decentralized storage system
Cassandra - A decentralized storage system
 

Mehr von Stratio

Introduction to Asynchronous scala
Introduction to Asynchronous scalaIntroduction to Asynchronous scala
Introduction to Asynchronous scala
Stratio
 

Mehr von Stratio (20)

Mesos Meetup - Building an enterprise-ready analytics and operational ecosyst...
Mesos Meetup - Building an enterprise-ready analytics and operational ecosyst...Mesos Meetup - Building an enterprise-ready analytics and operational ecosyst...
Mesos Meetup - Building an enterprise-ready analytics and operational ecosyst...
 
Can an intelligent system exist without awareness? BDS18
Can an intelligent system exist without awareness? BDS18Can an intelligent system exist without awareness? BDS18
Can an intelligent system exist without awareness? BDS18
 
Kafka and KSQL - Apache Kafka Meetup
Kafka and KSQL - Apache Kafka MeetupKafka and KSQL - Apache Kafka Meetup
Kafka and KSQL - Apache Kafka Meetup
 
Wild Data - The Data Science Meetup
Wild Data - The Data Science MeetupWild Data - The Data Science Meetup
Wild Data - The Data Science Meetup
 
Using Kafka on Event-driven Microservices Architectures - Apache Kafka Meetup
Using Kafka on Event-driven Microservices Architectures - Apache Kafka MeetupUsing Kafka on Event-driven Microservices Architectures - Apache Kafka Meetup
Using Kafka on Event-driven Microservices Architectures - Apache Kafka Meetup
 
Ensemble methods in Machine Learning
Ensemble methods in Machine Learning Ensemble methods in Machine Learning
Ensemble methods in Machine Learning
 
Stratio Sparta 2.0
Stratio Sparta 2.0Stratio Sparta 2.0
Stratio Sparta 2.0
 
Big Data Security: Facing the challenge
Big Data Security: Facing the challengeBig Data Security: Facing the challenge
Big Data Security: Facing the challenge
 
Operationalizing Big Data
Operationalizing Big DataOperationalizing Big Data
Operationalizing Big Data
 
Artificial Intelligence on Data Centric Platform
Artificial Intelligence on Data Centric PlatformArtificial Intelligence on Data Centric Platform
Artificial Intelligence on Data Centric Platform
 
Introduction to Artificial Neural Networks
Introduction to Artificial Neural NetworksIntroduction to Artificial Neural Networks
Introduction to Artificial Neural Networks
 
“A Distributed Operational and Informational Technological Stack”
“A Distributed Operational and Informational Technological Stack” “A Distributed Operational and Informational Technological Stack”
“A Distributed Operational and Informational Technological Stack”
 
Meetup: Cómo monitorizar y optimizar procesos de Spark usando la Spark Web - ...
Meetup: Cómo monitorizar y optimizar procesos de Spark usando la Spark Web - ...Meetup: Cómo monitorizar y optimizar procesos de Spark usando la Spark Web - ...
Meetup: Cómo monitorizar y optimizar procesos de Spark usando la Spark Web - ...
 
Lunch&Learn: Combinación de modelos
Lunch&Learn: Combinación de modelosLunch&Learn: Combinación de modelos
Lunch&Learn: Combinación de modelos
 
Meetup: Spark + Kerberos
Meetup: Spark + KerberosMeetup: Spark + Kerberos
Meetup: Spark + Kerberos
 
Distributed Logistic Model Trees
Distributed Logistic Model TreesDistributed Logistic Model Trees
Distributed Logistic Model Trees
 
[Strata] Sparkta
[Strata] Sparkta[Strata] Sparkta
[Strata] Sparkta
 
Introduction to Asynchronous scala
Introduction to Asynchronous scalaIntroduction to Asynchronous scala
Introduction to Asynchronous scala
 
Why spark by Stratio - v.1.0
Why spark by Stratio - v.1.0Why spark by Stratio - v.1.0
Why spark by Stratio - v.1.0
 
On-the-fly ETL con EFK: ElasticSearch, Flume, Kibana
On-the-fly ETL con EFK: ElasticSearch, Flume, KibanaOn-the-fly ETL con EFK: ElasticSearch, Flume, Kibana
On-the-fly ETL con EFK: ElasticSearch, Flume, Kibana
 

Kürzlich hochgeladen

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Kürzlich hochgeladen (20)

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 

Advanced search and Top-K queries in Cassandra

  • 1. Advanced search and Top-K queries in Cassandra 1 Andrés de la Peña andres@stratio.com @a_de_la_pena Apache Cassandra Meetup 2015
  • 2. •  Stratio is a Big Data Company •  Founded in 2013 •  Commercially launched in 2014 •  70+ employees in Madrid •  Office in San Francisco •  Certified Spark distribution Apache Cassandra Meetup 2015 Who are we?
  • 3. Introduction to Cassandra Cassandra query methods Stratio Lucene based 2i implementation Integrating Lucene 2i with Apache Spark 1 2 3 CONTENTS Apache Cassandra Meetup 2015 4
  • 4. Tunable  consistency   Tradeoffs between consistency and latency are tunable. C* values a high availability and partitioning against consistency; strong consistency can be achieved but there is no row locking. Incremental  scalability   Nodes added to a cluster increase throughput in a predictable & linear fashion. The  best  of  Dynamo  &  Big  Table   Combines the partitioning and replication of Amazon’s Dynamo with the log- structured data model of Google’s Bigtable. Decentralized   P2P architecture without master node or single point of failure. Apache Cassandra overview 4Apache Cassandra Meetup 2015
  • 5. Apache Cassandra operators 5Apache Cassandra Meetup 2015
  • 6. primary key secondary indexes token ranges Throughput Expressiveness Cassandra query methods 6Apache Cassandra Meetup 2015
  • 7. •  O(1) node lookup for partition key •  Range slices for clustering key •  Usually requires denormalization Primary key queries Node 3 Node 1 Node 2 Partition key Clustering key range CLIENT apena 2014-04-10:body When you.. aagea dhiguero apena 2014-04-06:body 2014-04-07:body 2014-04-08:body To study and… To think and... If you see what.. 2014-04-06:body The cautious… 2014-04-10:body When you.. 2014-04-11:body When you do… 7Apache Cassandra Meetup 2015
  • 8. primary key secondary indexes token ranges Throughput Expressiveness Cassandra query methods 8Apache Cassandra Meetup 2015
  • 9. CLIENT C* node C* node 2i local column family C* node 2i local column family 2i local column family Secondary indexes queries •  Inverted index •  Mitigates denormalization •  Queries may involve all C* nodes •  Queries limited to a single column 9Apache Cassandra Meetup 2015
  • 10. primary key secondary indexes token ranges Throughput Expressiveness Cassandra query methods 10Apache Cassandra Meetup 2015
  • 11. C*   node   C*   node   C*   node   Spark master Token range queries •  Used by MapReduce frameworks as Hadoop or Spark •  All kinds of queries are possible •  Low throughput •  Ad-hoc queries •  Batch processing •  Materialized views CLIENT query= function (all data) 11Apache Cassandra Meetup 2015
  • 12. C*   node   C*   node   C*   node   Combining 2i with MapReduce •  Expressiveness avoiding full scans •  Still limited by one indexed column per query Spark masterCLIENT Secondary index Secondary index Secondary index 12Apache Cassandra Meetup 2015
  • 13. MORE EXPRESIVENESS What do we miss from 2i indexes? •  Range queries •  Multivariable search •  Full text search •  Sorting by fields •  Top-k queries 13Apache Cassandra Meetup 2015
  • 14. IT’S ARCHITECTURE What do we like from the existing 2i? •  Each node indexes its own data •  The index implementations do not need to be distributed •  Can be created after design and ingestion •  Natural extension point 14Apache Cassandra Meetup 2015
  • 15. Thinking in a custom secondary index implementation… WHY NOT USE ? 15Apache Cassandra Meetup 2015
  • 16. Why we like Lucene •  Proven stable and fast indexing solution •  Expressive queries - Multivariable, ranges, full text, sorting, top-k, etc. •  Mature distributed search solutions built on top of it - Solr, ElasticSearch •  Can be fully embedded in application code •  Published under the Apache License 16Apache Cassandra Meetup 2015
  • 17. HOW IT WORKS Apache Cassandra Meetup 2015
  • 18. ALTER TABLE tweets ADD lucene TEXT; CREATE TABLE tweets ( id bigint, createdAt timestamp, message text, userid bigint, username text, PRIMARY KEY (userid, createdAt, id) ); Create index •  Built in the background in any moment •  Real time updates •  Mapping eases ETL •  Language aware 18 CREATE CUSTOM INDEX tweets_idx ON twitter.tweets (lucene) USING 'com.stratio.index.RowIndex' WITH OPTIONS = { 'refresh_seconds' : '60', 'schema' : '{ default_analyzer : "EnglishAnalyzer”, fields : { createdat : {type : "date", pattern : "yyyy-MM-dd"}, message : {type : "text", analyzer : ”EnglishAnalyzer"}, userid : {type : "string"}, username : {type : "string”} } }' }; Apache Cassandra Meetup 2015
  • 19. SELECT * FROM tweets WHERE lucene = ‘{ filter : {type : "match", field : "text", value : "cassandra"} }’ LIMIT 10; search 10 found 6 found 4 We are done ! Filtering query CLIENT C* node C* node C* node Lucene index Lucene index Lucene index 19Apache Cassandra Meetup 2015
  • 20. Found 5 Found 4 Found 5 Top-k query SELECT * FROM tweets WHERE lucene = ‘{ query: {type : "match", field : "text", value : "cassandra"} }’ LIMIT 5; Search top-5 CLIENT Search top-5 C* node C* node C* node Lucene index Lucene index Lucene index Merge 14 to best 5 20Apache Cassandra Meetup 2015
  • 21. SELECT * FROM tweets WHERE lucene = ‘{ filter : { type : "boolean", must : [ {type : "range", field : "time" lower : "2014/04/25”}, {type : "boolean", should : [ {type : "prefix", field : "user", value : "a"} , {type : "wildcard", field : "user", value : "*b*"} , {type : "match", field : "user", value : "fast"} ] } ] }, sort : { fields: [ {field :"time", reverse : true}, {field : "user", reverse : false} ] } }’ LIMIT 10000; Queries can be as complex as you want 21Apache Cassandra Meetup 2015
  • 22. NO MAINTENANCE REQUIRED Some implementation details •  A Lucene document per CQL row, and a Lucene field per indexed column •  SortingMergePolicy keeps index sorted in the same way that C* does •  Index commits synchronized with column family flushes •  Segments merge synchronized with column family compactions 22Apache Cassandra Meetup 2015
  • 24. Split friendly. It supports searches within a token range SELECT * FROM tweets WHERE lucene = ‘{ filter : {type:"match", field:"text", value:"cassandra"} }’ AND TOKEN(userid, createdAt, id) > 253653456456 AND TOKEN(userid, createdAt, id) <= 3456467456756 LIMIT 10000; Integrating Lucene & Spark 24Apache Cassandra Meetup 2015
  • 25. SELECT * FROM tweets WHERE lucene = ‘{ filter : {type:"match", field:"text", value:"cassandra"} }’ AND userid = 3543534 AND createdAt > 2011-02-03 04:05+0000 LIMIT 5000; Paging friendly: It supports starting queries in a certain point Integrating Lucene & Spark 25Apache Cassandra Meetup 2015
  • 26. Integrating Lucene & Spark CLIENT Spark master C* node C* node C* node Lucene Lucene Lucene •  Compute large amounts of data •  Avoid systematic full scan •  Reduces the amount of data to be processed •  Filtering push-down 26Apache Cassandra Meetup 2015
  • 27. WHEN TO USE INDEXES AND WHEN TO USE FULL SCAN Apache Cassandra Meetup 2015
  • 28. Index performance in Spark Time Records returned Full scan Lucene 2i 28Apache Cassandra Meetup 2015
  • 29. DEMOLucene indexes in C* Apache Cassandra Meetup 2015
  • 30. Conclusions •  Added new query methods - Multivariable queries (AND, OR, NOT) - Range queries (>, >=, <, <=) and regular expressions - Full text queries (match, phrase, fuzzy...) •  Top-k query support - Lucene scoring formula - Sort by field values •  Compatible with MapReduce frameworks •  Preserves Cassandra’s functionality 30Apache Cassandra Meetup 2015
  • 31. Its open source 31 github.com/stratio/stratio-cassandra •  Published as fork of Apache Cassandra •  Apache License Version 2.0 stratio.github.io/crossdata •  Apache License Version 2.0 github.com/stratio/deep-spark •  Apache License Version 2.0 Apache Cassandra Meetup 2015
  • 32. Advanced search and Top-K queries in Cassandra 32 Andrés de la Peña andres@stratio.com @a_de_la_pena Apache Cassandra Meetup 2015