Parallel Computing with SolrCloud: Presented by Joel Bernstein, Alfresco

O C T O B E R 1 3 - 1 6 , 2 0 1 6 • A U S T I N , T X

Parallel SQL
Joel Bernstein
Search Engineer, Alfresco
jbernste@apache.org

3
03
Introduction
•  Joel Bernstein
•  Lucene/Solr Committer
•  Search Engineer at Alfresco
•  Live and work in NYC

4
03
Alfresco
•  Open source ECM (Enterprise Content Management)
•  Alfresco is a system of record for documents
•  Uses Solr for search
•  1800+ customers
•  11 million active user accounts
•  Alfresco Solr: Document level access control,
eventually consistent, transactional,
multi-master, distributed search and faceting (coming
in Alfresco 5.1)

5
01
Agenda
1.  SQL Unleashed (What can it do?)
2. SQL Under the Hood (How does it work?)

6
01
SQL Unleashed
(In Solr 6.0)

7
01
Why SQL?
•  Solr has many awesome features.
•  But all of these feature create complexity.
•  Which faceting API to use? When to
Stream? Which parameters to use for
optimal performance?
•  The complexity level increases
dramatically when distributed joins come
into play
•  With SQL we can provide an optimizer to
choose the best query plan.

8
01
The SQL Interface at Glance
•  SQL over Map/Reduce: supports high
cardinality aggregations and
distributed joins.
•  SQL over Facets: high performance on
moderate cardinality aggregations.
•  SQL with Solr Search Predicates
•  SQL is fully integrated with SolrCloud

9
01
SQL Syntax: Limited and Unlimited SELECT
•  select colA, colB from tableB
•  select colA, colB from tableB limit 100
•  Unlimited selects return the entire result
set. Return ﬁelds must be DocValues.
•  Limited selects can sort by score and
retrieve any stored ﬁeld.

10
01
SQL Syntax: ORDER BY
•  select a, b from tableB order by a desc,
b desc
•  Unlimited selects sort the entire result
set

11
01
The Predicate: Phrase Searching
•  select a, b from tableB where c = ‘hello
world’
•  Searches for the phrase ‘hello world’ in
ﬁeld c.

12
01
The Predicate: Boolean searching
•  select a, b from tableB where c = ‘(hello world)’
•  Adding parens searches for (hello OR world).
•  Supports Solr query syntax inside the parens.

13
01
The Predicate: Range query
•  select a, b from tableB where c = ‘[0 TO 100]’

14
01
The Predicate: Arbitrary Boolean clauses
•  select a, b from tableB where (c = ‘hello
world’ AND d = ‘[0 TO 100]’)

15
01
SQL Syntax: Select Distinct
•  select distinct a, b from tableB
•  Map/Reduce Implementation: Tuples
•  are shufﬂed to worker nodes where the
distinct operation is performed.
•  JSON Facet Implementation: distinct
operation is pushed down into the
search engine
•  Map/Reduce for high cardinality
•  Facet for high QPS

16
01
Shuffle vs Push Down
•  Shuffling: high cardinality and
parallel relational algebra
(Distributed Joins)
•  Pushdown (Facet): blazing fast, high
QPS, moderate cardinality
•  aggregationMode flag is available with
the JDBC driver and http interface
[map_reduce or facet]

17
01
Aggregations: Stats
•  select count(*), sum(a) from tableA
•  Uses the StatsComponent under the covers
•  Initial release supports count, sum, avg, min,
max
•  Aggregation logic is always
pushed down into the search engine.

18
01
Aggregations: GROUP BY
•  select a, b count(*), sum(c) from tableB group by
a, b having count(*) > 50 order by sum(c) desc
•  Supports complex having clause: having (count(*)
> 50 AND sum(b) < 1000)
•  Has Map/Reduce implementation (shufﬂe)
•  And JSON Facet implementation (push down)
•  Map/Reduce can handle high cardinality multi-
dimension aggregations.

19
01
JDBC Driver
•  Ships with Solrj
•  Poolable Connection and Statement
•  SolrCloud Aware Load Balancing
•  Connection has aggregationMode
switch [map_reduce or facet]

21
01
SQL Parsing
•  Presto SQL Parser handles the parsing
•  SQL Statements are compiled to
TupleStream objects
•  The TupleStream is the base interface of the
Streaming API
•  The Streaming API is a general purpose
parallel computing API for SolrCloud

22
01
Parallel Computing Framework
•  Shufﬂing
•  Worker Collections
•  Streaming API
•  Streaming Expressions
•  Parallel SQL

23
01
Shuffling (sorting & partitioning)
•  Shuffling is pushed down into the search engine
•  Sorting: /export handler “stream sorts”
entire result sets.
•  Partitioning: HashQParserPlugin, hash
partitioning filter. Partitions results on
arbitrary fields.
•  Tuples (search results) begin streaming
instantly to worker nodes. Shuffling
never requires a spill to disk.
•  All replicas shuffle in parallel for the same
query. Allows for massive throughput.

24
01
Shufﬂing (sorting & partitioning)
Worker 2Worker 1
Shard 1
Replica 1
Shard 2
Replica 1
Shard 1
Replica 2
Shard 2
Replica 2
Client
Each worker is
shufﬂed ½
the result set
Tuples are
sorted and
partitioned on
keys

25
01
Worker Collections
•  Are Generic SolrCloud Collections
•  Can hold data, or just perform work
•  Search results are shufﬂed to the
workers
•  Conﬁgured with the /stream handler

26
01
Streaming API
•  Java Programming API for the parallel
computing framework
•  Real-time Map/Reduce and Parallel
Relational Algebra
•  Abstracts search results as Streams of
tuples (TupleStream)
•  Streams are transformed in parallel by
pluggable Decorator streams.
•  Parallel transformations include: group by, roll
up, union, intersect, complement and join

27
01
Streaming Expressions
•  Contributed by Dennis Gove (Bloomberg)
•  String Query Language and Serialization
format for the Streaming API
•  Streaming Expressions compile to
TupleStreams
•  TupleStreams serialize to
Streaming Expressions

28
01
Parallel SQL
•  Compiles SQL to a TupleStream
•  The TupleStream is serialized to a
Streaming Expression and sent to
worker nodes.
•  Worker nodes translate the Streaming
Expression back into TupleStream
•  Worker nodes open() and read() the
TupleStream in parallel. Tuples are
returned from each worker

29
01
From SQL to Streaming Expression
select str_s, count(*), sum(field_i), min(field_i), max(field_i),
avg(field_i) from collection1 where text='XXXX' group by str_s
rollup(
search(collection1,
q="(text:XXXX)",
qt="/export",
fl="str_s, field_i",
partitionKeys=str_s,
sort="str_s asc",
zkHost="127.0.0.1:64149/solr"),
over=str_s,
count(*),
sum(field_i),
min(field_i),
max(field_i),
avg(field_i))

30
01
Parallel SQL Shufﬂe (5 workers, 5 shards, aggregationMode=map_reduce)
Client
Worker 2
Shard 3
Replica 2
Worker 3Worker 1 Worker 4 Worker 5
Shard 1
Replica 2
Shard 1
Replica 3
Shard 2
Replica 3
Shard 2
Replica 2
Shard 2
Replica 1
Shard 1
Replica 1 Shard 3
Replica 1
Shard 3
Replica 3
Shard 4
Replica 3
Shard 4
Replica 2
Shard 4
Replica 1
Shard 5
Replica 3
Shard 5
Replica 2
Shard 5
Replica 1
/SQL
handler

31
01
Jira Tickets
•  SOLR-7560: Parallel SQL Support
•  SOLR-7377: Solr Streaming Expressions
•  SOLR-7082: Streaming Aggregation for SolrCloud
•  SOLR-7441: Improve overall robustness of the
Streaming stack: Streaming API,
Streaming Expressions, Parallel SQL

32
01
Getting Involved
• SQL is in Trunk
• Releasing with Solr 6
• Streaming API and Streaming Expressions
are located in the Solrj libraries
(solrj.io)
• Patches welcome
• Testers and feedback needed

Parallel Computing with SolrCloud: Presented by Joel Bernstein, Alfresco

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Parallel Computing with SolrCloud: Presented by Joel Bernstein, Alfresco

Ähnlich wie Parallel Computing with SolrCloud: Presented by Joel Bernstein, Alfresco (20)

Mehr von Lucidworks

Mehr von Lucidworks (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Parallel Computing with SolrCloud: Presented by Joel Bernstein, Alfresco