SlideShare ist ein Scribd-Unternehmen logo
1 von 39
An Introduction to DSE Search
Caleb Rackliffe
Software Engineer
caleb.rackliffe@datastax.com
@calebrackliffe
What problem were we trying to solve?
3
Application
DataStax Driver
4
SELECT * FROM customers WHERE country LIKE '%land%';
5
What about secondary indexes?
Why not just create your own secondary index
implementation that supports wildcard queries?
7
I need full-text search!
Why did we build something new?
10
Application
DataStax Driver Solr Client
Polyglot Persistence!
12
Application
DataStax Driver Solr Client
Consistency
Cost
Complexity
14
partitioning
multi-DC
replication
geospatial
wildcards
monitoring
C* field type support (UDT, Tuple, collections)
security
live indexing
sorting
faceting
fault-tolerant distributed search
caching
text analysis
grouping
automatic index updates
JVM
CQL
repair
15
Application
DataStax Driver Solr Client
Consistency
Complexity
Cost
How about some examples?
Creating a Solr Core
bash$ dse cassandra -s
cqlsh> CREATE KEYSPACE test
WITH replication = {'class': 'NetworkTopologyStrategy', 'Solr':1};
cqlsh:test> CREATE TABLE test.user(username text PRIMARY KEY,
fullname text,
address_ map<text, text>);
bash$ dsetool create_core test.user generateResources=true
Start a node…
Create a table…
Create the core…
bash$ dsetool get_core_schema test.user
<?xml version="1.0" encoding="UTF-8" standalone=“no"?>
<schema name="autoSolrSchema" version="1.5">
<types>
<fieldType class="org.apache.solr.schema.TextField" name="text">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
<fieldType class="org.apache.solr.schema.StrField" name="string"/>
</types>
<fields>
<field indexed="true" name="username" stored="true" type="string"/>
<field indexed="true" name="fullname" stored="true" type="text"/>
<dynamicField indexed="true" name="address_*" stored="true" type="string"/>
</fields>
<uniqueKey>fullname</uniqueKey>
</schema>
The Schema
Insert Rows (…and Index Documents)
cqlsh:test> INSERT INTO user(username, fullname, address)
VALUES('sbtourist', 'Sergio Bossa', {'address_home' : 'UK', 'address_work' : 'UK'});
cqlsh:test> INSERT INTO user(username, fullname, address)
VALUES('bereng', 'Berenguer Blasi', {'address_home' : 'ES', 'address_work' : 'ES'});
cqlsh:test> INSERT INTO user(username, fullname, address)
VALUES('thegrinch', 'Sven Delmas', {'address_home':'US','address_work':'HQ'});
…and that’s it. No ETL. No writing to a second datastore.
Wildcards
cqlsh:test> SELECT username, address
FROM user
WHERE solr_query='{"q":"address_home:U*"}';
username | address
-----------+----------------------------------------------------
sbtourist | {‘address_home': 'UK', ‘address_work': 'UK'}
thegrinch | {‘address_home': 'US', ‘address_work': 'HQ'}
(2 rows)
Sorting and Limits
cqlsh:test> SELECT username, address
FROM user
WHERE solr_query=‘{"q":"*:*", "sort":"address_home desc"}';
username | address
-----------+----------------------------------------------------
thegrinch | {'address_home': 'US', 'address_work': 'HQ'}
sbtourist | {'address_home': 'UK', 'address_work': 'UK'}
bereng | {'address_home': 'ES', 'address_work': 'ES'}
(3 rows)
cqlsh:test> SELECT username, address
FROM user
WHERE solr_query='{"q":"*:*", "sort":"address_home desc"}'
LIMIT 1;
username | address
-----------+----------------------------------------------------
thegrinch | {'address_home': 'US', 'address_work': 'HQ'}
(3 rows)
Faceting
cqlsh:test> SELECT *
FROM user
WHERE solr_query='{"q":"*:*", "facet":{"field" : "address_work"}}';
facet_fields
--------------------------------------------
{"address_work" : {"ES" : 1 , "HQ" : 1 , "UK" : 1}}
(1 rows)
Partition Restrictions
cqlsh:test> CREATE TABLE event(sensor_id bigint,
recording_time timestamp,
description text,
PRIMARY KEY(sensor_id, recording_time));
…
cqlsh:test> SELECT recording_time, description
FROM test.event
WHERE sensor_id = 2314234432
AND solr_query=‘description:unremarkable’;
What do the internals look like?
Indexing
26
Buffered
Searchable
Durable
Memory
Disk
27
Buffered
Searchable
Durable
Memory
Disk
28
RAMBuffer
Segment
Segment
Memory
Disk
Segment Segment
Buffered
Searchable
Durable
Soft Commit
Hard Commit
Querying
Replica Selection
A
A
RF=2
shards: A-E
B
B CC D
D E
E
coordinator1
2
34
5
Healthy Unhealthy
Replica Selection
A
A
RF=2
shards: A-E
B
B CC D
D E
E
coordinator1
2
34
5
Healthy Unhealthy
What happens if a shard query fails?
Failover: Phase 1
4 nodes
RF = 2
shards: A-D
no vnodes
1
2
3
4
Failover: Phase 2
4 nodes
RF = 2
shards: A-D
no vnodes
1
2
3
4
Failover: Phase 3
4 nodes
RF = 2
shards: A-D
no vnodes
1
2
3
4
Platform Integrations
Search + Analytics: Explicit Predicate Pushdown
bash$ dse spark
scala> val table = sc.cassandraTable("wiki","solr")
scala> val result = table.select("id","title")
.where(“solr_query=‘body:dog'")
.collect
http://docs.datastax.com

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Cassandra Community Webinar: Apache Cassandra Internals
Cassandra Community Webinar: Apache Cassandra InternalsCassandra Community Webinar: Apache Cassandra Internals
Cassandra Community Webinar: Apache Cassandra Internals
 
Cassandra 2.0 and timeseries
Cassandra 2.0 and timeseriesCassandra 2.0 and timeseries
Cassandra 2.0 and timeseries
 
Successful Architectures for Fast Data
Successful Architectures for Fast DataSuccessful Architectures for Fast Data
Successful Architectures for Fast Data
 
The Best and Worst of Cassandra-stress Tool (Christopher Batey, The Last Pick...
The Best and Worst of Cassandra-stress Tool (Christopher Batey, The Last Pick...The Best and Worst of Cassandra-stress Tool (Christopher Batey, The Last Pick...
The Best and Worst of Cassandra-stress Tool (Christopher Batey, The Last Pick...
 
Real data models of silicon valley
Real data models of silicon valleyReal data models of silicon valley
Real data models of silicon valley
 
Cassandra 2.0 better, faster, stronger
Cassandra 2.0   better, faster, strongerCassandra 2.0   better, faster, stronger
Cassandra 2.0 better, faster, stronger
 
Advanced Apache Cassandra Operations with JMX
Advanced Apache Cassandra Operations with JMXAdvanced Apache Cassandra Operations with JMX
Advanced Apache Cassandra Operations with JMX
 
Time series with apache cassandra strata
Time series with apache cassandra   strataTime series with apache cassandra   strata
Time series with apache cassandra strata
 
Cassandra Community Webinar | In Case of Emergency Break Glass
Cassandra Community Webinar | In Case of Emergency Break GlassCassandra Community Webinar | In Case of Emergency Break Glass
Cassandra Community Webinar | In Case of Emergency Break Glass
 
Cassandra Fundamentals - C* 2.0
Cassandra Fundamentals - C* 2.0Cassandra Fundamentals - C* 2.0
Cassandra Fundamentals - C* 2.0
 
Advanced data modeling with apache cassandra
Advanced data modeling with apache cassandraAdvanced data modeling with apache cassandra
Advanced data modeling with apache cassandra
 
New features in ProxySQL 2.0 (updated to 2.0.9) by Rene Cannao (ProxySQL)
New features in ProxySQL 2.0 (updated to 2.0.9) by Rene Cannao (ProxySQL)New features in ProxySQL 2.0 (updated to 2.0.9) by Rene Cannao (ProxySQL)
New features in ProxySQL 2.0 (updated to 2.0.9) by Rene Cannao (ProxySQL)
 
Apache cassandra and spark. you got the the lighter, let's start the fire
Apache cassandra and spark. you got the the lighter, let's start the fireApache cassandra and spark. you got the the lighter, let's start the fire
Apache cassandra and spark. you got the the lighter, let's start the fire
 
Cassandra 3.0 advanced preview
Cassandra 3.0 advanced previewCassandra 3.0 advanced preview
Cassandra 3.0 advanced preview
 
Solr Search Engine: Optimize Is (Not) Bad for You
Solr Search Engine: Optimize Is (Not) Bad for YouSolr Search Engine: Optimize Is (Not) Bad for You
Solr Search Engine: Optimize Is (Not) Bad for You
 
What is in All of Those SSTable Files Not Just the Data One but All the Rest ...
What is in All of Those SSTable Files Not Just the Data One but All the Rest ...What is in All of Those SSTable Files Not Just the Data One but All the Rest ...
What is in All of Those SSTable Files Not Just the Data One but All the Rest ...
 
Owning time series with team apache Strata San Jose 2015
Owning time series with team apache   Strata San Jose 2015Owning time series with team apache   Strata San Jose 2015
Owning time series with team apache Strata San Jose 2015
 
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
 
Cassandra 3.0 Awesomeness
Cassandra 3.0 AwesomenessCassandra 3.0 Awesomeness
Cassandra 3.0 Awesomeness
 
DataSource V2 and Cassandra – A Whole New World
DataSource V2 and Cassandra – A Whole New WorldDataSource V2 and Cassandra – A Whole New World
DataSource V2 and Cassandra – A Whole New World
 

Andere mochten auch

Servidor web lamp
Servidor web lampServidor web lamp
Servidor web lamp
yaser6700
 
Project Management Diploma with Instructors
Project Management Diploma with InstructorsProject Management Diploma with Instructors
Project Management Diploma with Instructors
Cisco
 

Andere mochten auch (20)

Cassandra 2.1 boot camp, Read/Write path
Cassandra 2.1 boot camp, Read/Write pathCassandra 2.1 boot camp, Read/Write path
Cassandra 2.1 boot camp, Read/Write path
 
Cassandra Summit 2015: Intro to DSE Search
Cassandra Summit 2015: Intro to DSE SearchCassandra Summit 2015: Intro to DSE Search
Cassandra Summit 2015: Intro to DSE Search
 
Understanding DSE Search by Matt Stump
Understanding DSE Search by Matt StumpUnderstanding DSE Search by Matt Stump
Understanding DSE Search by Matt Stump
 
Apache Cassandra Developer Training Slide Deck
Apache Cassandra Developer Training Slide DeckApache Cassandra Developer Training Slide Deck
Apache Cassandra Developer Training Slide Deck
 
DataStax | DSE Search 5.0 and Beyond (Nick Panahi & Ariel Weisberg) | Cassand...
DataStax | DSE Search 5.0 and Beyond (Nick Panahi & Ariel Weisberg) | Cassand...DataStax | DSE Search 5.0 and Beyond (Nick Panahi & Ariel Weisberg) | Cassand...
DataStax | DSE Search 5.0 and Beyond (Nick Panahi & Ariel Weisberg) | Cassand...
 
Copa menstrual y esponjas vaginales
Copa menstrual y esponjas vaginalesCopa menstrual y esponjas vaginales
Copa menstrual y esponjas vaginales
 
Servidor web lamp
Servidor web lampServidor web lamp
Servidor web lamp
 
Magonia getxo blog
Magonia  getxo blogMagonia  getxo blog
Magonia getxo blog
 
Scala for rubyists
Scala for rubyistsScala for rubyists
Scala for rubyists
 
Accesus - Catalogo andamio para vias ferroviarias
Accesus - Catalogo andamio para vias ferroviariasAccesus - Catalogo andamio para vias ferroviarias
Accesus - Catalogo andamio para vias ferroviarias
 
Tams 2012
Tams 2012Tams 2012
Tams 2012
 
Adquirir una propiedad en españa en 7 pasos
Adquirir una propiedad en españa en 7 pasosAdquirir una propiedad en españa en 7 pasos
Adquirir una propiedad en españa en 7 pasos
 
2013 brand id&print
2013 brand id&print2013 brand id&print
2013 brand id&print
 
Pairform cci formpro
Pairform   cci formproPairform   cci formpro
Pairform cci formpro
 
los bracekts
los bracekts los bracekts
los bracekts
 
9Guia1
9Guia19Guia1
9Guia1
 
Una modesta proposición
Una modesta proposiciónUna modesta proposición
Una modesta proposición
 
Dossier ii torneo once caballeros c.f.
Dossier ii torneo once caballeros c.f.Dossier ii torneo once caballeros c.f.
Dossier ii torneo once caballeros c.f.
 
Presentacion corporativa sevenminds agosto2012 (1)
Presentacion corporativa sevenminds agosto2012 (1)Presentacion corporativa sevenminds agosto2012 (1)
Presentacion corporativa sevenminds agosto2012 (1)
 
Project Management Diploma with Instructors
Project Management Diploma with InstructorsProject Management Diploma with Instructors
Project Management Diploma with Instructors
 

Ähnlich wie DataStax: An Introduction to DataStax Enterprise Search

11thingsabout11g 12659705398222 Phpapp01
11thingsabout11g 12659705398222 Phpapp0111thingsabout11g 12659705398222 Phpapp01
11thingsabout11g 12659705398222 Phpapp01
Karam Abuataya
 
Wait Events 10g
Wait Events 10gWait Events 10g
Wait Events 10g
sagai
 

Ähnlich wie DataStax: An Introduction to DataStax Enterprise Search (20)

Distributed Queries in IDS: New features.
Distributed Queries in IDS: New features.Distributed Queries in IDS: New features.
Distributed Queries in IDS: New features.
 
Proxysql sharding
Proxysql shardingProxysql sharding
Proxysql sharding
 
Advanced pg_stat_statements: Filtering, Regression Testing & more
Advanced pg_stat_statements: Filtering, Regression Testing & moreAdvanced pg_stat_statements: Filtering, Regression Testing & more
Advanced pg_stat_statements: Filtering, Regression Testing & more
 
11thingsabout11g 12659705398222 Phpapp01
11thingsabout11g 12659705398222 Phpapp0111thingsabout11g 12659705398222 Phpapp01
11thingsabout11g 12659705398222 Phpapp01
 
11 Things About11g
11 Things About11g11 Things About11g
11 Things About11g
 
Inexpensive Datamasking for MySQL with ProxySQL — Data Anonymization for Deve...
Inexpensive Datamasking for MySQL with ProxySQL — Data Anonymization for Deve...Inexpensive Datamasking for MySQL with ProxySQL — Data Anonymization for Deve...
Inexpensive Datamasking for MySQL with ProxySQL — Data Anonymization for Deve...
 
Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...
Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...
Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...
 
Using Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into CassandraUsing Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into Cassandra
 
Wait Events 10g
Wait Events 10gWait Events 10g
Wait Events 10g
 
Meetup cassandra sfo_jdbc
Meetup cassandra sfo_jdbcMeetup cassandra sfo_jdbc
Meetup cassandra sfo_jdbc
 
NetDevOps 202: Life After Configuration
NetDevOps 202: Life After ConfigurationNetDevOps 202: Life After Configuration
NetDevOps 202: Life After Configuration
 
SQLMAP Tool Usage - A Heads Up
SQLMAP Tool Usage - A  Heads UpSQLMAP Tool Usage - A  Heads Up
SQLMAP Tool Usage - A Heads Up
 
07 application security fundamentals - part 2 - security mechanisms - data ...
07   application security fundamentals - part 2 - security mechanisms - data ...07   application security fundamentals - part 2 - security mechanisms - data ...
07 application security fundamentals - part 2 - security mechanisms - data ...
 
Meetup cassandra for_java_cql
Meetup cassandra for_java_cqlMeetup cassandra for_java_cql
Meetup cassandra for_java_cql
 
Enable Database Service over HTTP or IBM WebSphere MQ in 15_minutes with IAS
Enable Database Service over HTTP or IBM WebSphere MQ in 15_minutes with IASEnable Database Service over HTTP or IBM WebSphere MQ in 15_minutes with IAS
Enable Database Service over HTTP or IBM WebSphere MQ in 15_minutes with IAS
 
Presentation
PresentationPresentation
Presentation
 
GumGum: Multi-Region Cassandra in AWS
GumGum: Multi-Region Cassandra in AWSGumGum: Multi-Region Cassandra in AWS
GumGum: Multi-Region Cassandra in AWS
 
Dbms lab Manual
Dbms lab ManualDbms lab Manual
Dbms lab Manual
 
Updates from Cassandra Summit 2016 & SASI Indexes
Updates from Cassandra Summit 2016 & SASI IndexesUpdates from Cassandra Summit 2016 & SASI Indexes
Updates from Cassandra Summit 2016 & SASI Indexes
 
TechEvent PostgreSQL Best Practices
TechEvent PostgreSQL Best PracticesTechEvent PostgreSQL Best Practices
TechEvent PostgreSQL Best Practices
 

Mehr von DataStax Academy

Cassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsCassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart Labs
DataStax Academy
 
Cassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackCassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stack
DataStax Academy
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonCassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
DataStax Academy
 
Standing Up Your First Cluster
Standing Up Your First ClusterStanding Up Your First Cluster
Standing Up Your First Cluster
DataStax Academy
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with Dse
DataStax Academy
 
Introduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraIntroduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache Cassandra
DataStax Academy
 
Advanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraAdvanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache Cassandra
DataStax Academy
 

Mehr von DataStax Academy (20)

Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craftForrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
 
Introduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph DatabaseIntroduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph Database
 
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache CassandraIntroduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
 
Cassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsCassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart Labs
 
Cassandra 3.0 Data Modeling
Cassandra 3.0 Data ModelingCassandra 3.0 Data Modeling
Cassandra 3.0 Data Modeling
 
Cassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackCassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stack
 
Data Modeling for Apache Cassandra
Data Modeling for Apache CassandraData Modeling for Apache Cassandra
Data Modeling for Apache Cassandra
 
Coursera Cassandra Driver
Coursera Cassandra DriverCoursera Cassandra Driver
Coursera Cassandra Driver
 
Production Ready Cassandra
Production Ready CassandraProduction Ready Cassandra
Production Ready Cassandra
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonCassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
 
Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1
 
Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2
 
Standing Up Your First Cluster
Standing Up Your First ClusterStanding Up Your First Cluster
Standing Up Your First Cluster
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with Dse
 
Introduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraIntroduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache Cassandra
 
Cassandra Core Concepts
Cassandra Core ConceptsCassandra Core Concepts
Cassandra Core Concepts
 
Bad Habits Die Hard
Bad Habits Die Hard Bad Habits Die Hard
Bad Habits Die Hard
 
Advanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraAdvanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache Cassandra
 
Advanced Cassandra
Advanced CassandraAdvanced Cassandra
Advanced Cassandra
 
Apache Cassandra and Drivers
Apache Cassandra and DriversApache Cassandra and Drivers
Apache Cassandra and Drivers
 

Kürzlich hochgeladen

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 

Kürzlich hochgeladen (20)

Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 

DataStax: An Introduction to DataStax Enterprise Search

Hinweis der Redaktion

  1. “Hello! My name is Caleb Rackliffe, and I’m a member of the search team at DataStax. Today I’d like to walk you through a brief (but action-packed) introduction to DataStax Enterprise Search. I’ll start with a question…”
  2. “Before we talk about what DSE Search is, let’s make sure we know why we built it.”
  3. “Here we have a small Cassandra cluster and an application sitting on top of it, using the Datastax driver. We can go a long way with CQL and proper denormalization, but what happens when we find ourselves wanting to do something as seemingly simple as…”
  4. “…this. You’ll recognize the SQL-style wildcard query, which Cassandra does not support out of the box.”
  5. Cassandra’s built in secondary indexes might seem like a solution, but they… …don’t support wildcard queries. …can perform poorly unless limited to a single partition. …can perform poorly for very high or very low cardinality fields. …may fail for a frequently updated/deleted column.
  6. “You could, but then you’d be saddled with the cost of building and maintaining that, and you’ll still end up with something that is designed for a fairly specific use-case.”
  7. “So when our search problem lacks the structure to make denormalization effective, and is beyond the capabilities of C* secondary indexes, we need to think a bit more broadly.”
  8. “Fortunately, there are technologies out there that handle full-text and other more advanced kinds of search well, and most of them, like Solr, are built on the foundation of the Apache Lucene project. ”
  9. “Well, let’s see what it would look like to use a separate, Lucene-based search cluster alongside our Cassandra cluster…”
  10. “…here we are. Our application is now sitting on top of both a Cassandra cluster and separate search cluster. Notice that we’ve added a new client to our application, specifically for search. So we’ve got Cassandra doing key-value lookups and probably some range queries…we’ve got our search cluster handling the more advanced ad-hoc queries for us.”
  11. “This is polyglot persistence at its best…right?”
  12. “Well, maybe not…and we can talk about this along 3 axes.” Complexity - The persistence layer of our application is now more complex. We have to configure two clients, write to two data stores, and, if we write to one of them asynchronously, manage a queueing solution. Consistency - Since the two data stores have no explicit knowledge of each other, we have to manage questions of consistency between them in our application. Cost - Aside from the implicit cost of complexity, we’ll also need to deal with the explicit cost of infrastructure and hardware for a separate cluster.
  13. “So if you need avoid data loss, scale your writes, and replicate your index over multiple DCs, your architecture might start to look like this lovely Rube Goldberg machine. We wanted to provide all of this in an operationally simple package…”
  14. “DSE Search is designed to address those problems. We’ve built a coherent search platform that integrates Cassandra’s distributed persistence, Lucene’s core search and indexing functionality, and the advanced features of Solr in the same JVM…and then we’ve made a number of our own enhancements, which we’ll see in the coming slides.”
  15. “So back to our architecture diagram. First, with DSE search, we can eliminate the cost associated with running a separate search cluster. We can eliminate much of the complexity at the application layer, since we don’t have to deal with two clients, and we only have to manage one write path…and with all of our data stored in Cassandra alone and collocated with the relevant shards of our search index, we’ve eliminated many of the potential issues of consistency between the two.”
  16. “We’ll go into more details on the indexing and query paths, but before we we do that, let’s run through some basic examples and get a feel for the ergonomics of our solution.”
  17. “First, we’ll startup a single node. (The -s switch here tells the node it’s going to handle a search workload.) Second, we create a table from the CQL prompt. Third, we create a Solr core over that table from dsetool…and that’s it. We’re ready to index documents. Note that we don’t have to create the Solr schema explicitly, because DSE Search creates it for us, using the CQL schema to determine its type mappings.”
  18. “Under the hood, the schema actually looks something like this, but you shouldn’t need to trouble yourself with it, unless our default type mappings aren’t quite right for you. In that case, you can just tweak the auto-generated schema and re-upload it.”
  19. “Next we insert a few rows, which will be indexed automatically for search. There is no ETL involved and no explicit writing to a second data store. We’re ready to make some queries…”
  20. “…so let’s start with a simple wildcard query. Here, we want to find everyone who’s home address starts with a U, and of course we find users in the United States and the UK.”
  21. “Sorting and Limits! In the first query, we just find all our users and sort them descending by home address. In the second query, we do the same thing except we also use the CQL LIMIT keyword to narrow our results down to just the top result by home address.”
  22. “Faceting allows us to take the results of a query, in this case a query for all documents, group them, and count the members in each group. In this example, faceting on our users’ work addresses tells us that we have one working in Spain, one at corporate headquarters, and one in the UK. This is very common in the context of a product search, where a user wants to drill into results by brand.”
  23. “What if we want to restrict our search to a specific partition? Here I have another table, one that records series of sensor events. Using a CQL partition key restriction in our WHERE clause, we can ensure that our query visits only the node that contains that partition and then filters on it once we get there. Much like our earlier usage of LIMIT, this is a case where we’re translating CQL instructions to search-specific instructions under the hood.”
  24. “Now that we have an idea of what basic usage looks like, let’s take a high-level look at what’s going on in the indexing and query internals…”
  25. “The indexing process starts with a Cassandra write. It arrives at the coordinator, is distributed to the proper replicas, and it written the commit log and Memtable, as you would expect. At this point, we create an updated Lucene document and queue it up for indexing, then we return to the coordinator and the client. Then, asynchronously, we update the index. Finally, also in the background, when a C* Memtable is flushed to disk, we also flush the corresponding index updates to disk, ensuring their durability.”
  26. “In near-real-time search systems, updated documents, once indexed, progress through 3 stages: a buffered stage, where they are just accumulated in memory; a searchable stage, where they move to disk and become visible to ongoing queries; and a durable stage, where they are permanently added to the index and will survive restart.” “Because moving from the “buffered” layer to the “searchable“ layer is expensive, we are forced to make a tradeoff between the visibility of our data and indexing throughput. i.e. We can make our writes visible to ongoing searches more quickly at the cost of slower indexing throughput, or we can maximize indexing throughput with longer delays before write are visible to searches.”
  27. In DSE 4.7, we released a feature called “Live Indexing”. Essentially, we’ve made indexed documents buffered in memory searchable, eliminating the need to build a separate “searchable” representation of the index and the need to make a hard decision between update availability and throughput. This might remind your of the Cassandra write path, where we have “searchable” Memtables buffered in memory that are periodically flushed to “durable” SSTables.
  28. “This is what it would look like if we mapped these stages to their equivalents in Solr. Notice that the soft commit process creates searchable segments, which must later be merged by Lucene in the background. Since live indexing bypasses this second level, we can accumulate larger segments before flushing to disk, and this reduces the cost of the segment merges that occur in the background.”
  29. “On the query side, we’ve implemented our own distributed search, informed by the topology of the cluster that Cassandra makes available to us. Here we have a 4-node cluster with a replication factor of 2. Our first step is to determine the set of nodes that optimally covers the ring, in this case, the tokens from 0 -> 1000. We then scatter the query to those nodes, find the IDs for matching documents, and read the documents themselves, which are stored only in Cassandra. Notice here that, to minimize fan-out, we only contact node 3, not 4 + 2 to cover ranges 0 -> 250 and 250 -> 500.”
  30. “When we need to chose between replicas of a particular token, we do our best to minimize fan-out, to cover the entire dataset optimally. When multiple nodes could be optimal selections, we look more closely at the health and activity of those nodes. In this example we have a 5-node cluster with a replication factor of 2 and index shards A-E. We’ll denote health here by color, with green being health, red being unhealthy, and yellow in the middle. If we need to cover shard B, we can query either node 2 or node 3, but we’ll pick node 2, because it’s healthier.”
  31. “However, node health is not the only criterion we use for selection. If node 2 is healthy, but is also in the middle of an expensive operation, let’s say, rebuilding its search index, we’ll want to choose node 3, since node 2 is not potentially both out of date and not able to devote as many resources to handling incoming queries.”
  32. “Here we have a healthy 4-node cluster with a replication factor of 2 and 4 index shards. If node 1 coordinates our request, it only needs to contact itself and node 3 to cover all 4 of the shards A-D…”
  33. “…but then node 3 fails. It could have been a disk failure or a network issue…”
  34. “…but it was probably because you let this guy near it.”
  35. “In any case, we still need to cover shards B and C, but node 3 was the only node that contained both of them, so we’ll need to contact nodes 2 and 4.”
  36. “To this point, I’ve talked about search in a fairly isolated way, but in the context of a larger platform, there are opportunities to step outside that.”
  37. “One example is the integration we released in DSE 4.7 with Spark - a component of DSE Analytics. There are cases where pushing a search query through a Spark job can meaningfully cut down on the size of the RDD Spark presents for analysis. In this example, we’re filtering every Wikipedia article that contains the word ‘dog’ using search, avoiding some unnecessary filtering after we build the RDD.”
  38. “Well that wraps it up for me. If you’d like to dig deeper into any of the topics I covered here, or you’d like to try DSE out for yourself, please visit docs.datastax.com. Thank you all so much for coming, and enjoy the rest of your Summit!”