Using Amazon CloudSearch With Databases - CloudSearch Meetup 061913

© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
Searching for Success
Amazon CloudSearch and Relational Databases

Agenda
Finding things
• Types of Databases
Making Choices
What is CloudSearch?
Combining CloudSearch with Relational
Sample Code

Finding Things
So Many Databases

Finding Your Information
Your users need to find things
• What do you use?
A Database!
• What Kind?

It's a Big World Out There!
"Database" != "Relational Database"
Tons of relational databases
• Amazon RDS
• MySQL
• MSSQL
• Oracle
but…

Many Other Types
NoSQL databases
• Dynamo, Cassandra, CouchDB…
Graph databases
• Neo4J, Titan, …
Column oriented databases
• Redshift, Bigtable…
Text Search Engine
• CloudSearch, Lucene, Autonomy...

Text Search Engine
Good at text queries
• "Harry Potter and the Philosopher's Stone"
Harry Potter and the Philosopher's Stone
harry potter and the philosopher's stone
harry potter and the philosopher stone
harry potter philosopher stone

Text Search Engine
Basic element is the document
Documents are made of fields
"title" => "star wars"
Fields can be
• Missing
• Multi-valued
• Variable length

Text Search Engine
Documents are not "normalized"
• In a relational database
• A movie table
• A director table
• An actor table
• In CloudSearch
• One document per movie

Relational
ID Document
1 title:star trek
actor: chris pine zacchary quinto zoe saldana
directory: j j abrams
ID Title
1 Star Wars
2 Star Trek
3 Dark Star
ID Actor
1 Zacchary Quinto
2 Chris Pine
3 Zoë Saldana
ID Director
1 J.J. Abrams
2 George Lucas
3 John Carpenter
Text Search Engine

Relevance
Key differentiator for text search
Not "does this match?"
• "how WELL does this match?
Includes multiple factors
• Term Frequency, Document Frequency, Proximity
Users can customize this
• Distance
• Popularity
• Field Weighting

Text is more than "War & Peace"
It's not just books & blog posts
Meta-data
• Author, Title, Category, Tags
• Can include numbers: counts, dates, latitude,…

Making Choices
Relational? CloudSearch?

Relational Database
Good at
• Exact matches
• Joins
• Atomic Transactions
Not so good at
• Relevance
• How well does this match?
• Handling words

Text Search Engines
Good at finding
• Words, Phrases
• Relevance
Not so good at
• Joins
• Transactions

Options for Search
Can I just use a relational database?
• Yes.
Do I want to just use a relational database?
• Probably not

Simple Approach
Widely supported, easy
SELECT id, title FROM books WHERE title LIKE "%amazon%"
Does not perform well
Doesn't deal with multiple words

Text Extensions for Relational Databases
Vendor specific
SELECT id,title FROM books WHERE MATCH(title)
AGAINST('Harry Potter') IN NATURAL LANGUAGE MODE
• Use different index structures
• Typically MUCH less mature than relational code
• More manual processes
• Scaling, (if possible)
• Managing
• minimal relevance, no control

Appropriate Tools
VS

Options
Relational database
• Weak relevance
• Scaling & performance limits
Text Search Engine
• No transactions & locking
• No Joins
Both
• Some extra effort, then best of both worlds

What is Amazon CloudSearch?

CloudSearch
Fully-managed text search engine
High Performance
Automatically Scaling
Reliable, Resilient
Based on Amazon Product Search

Search Features
Faceting
Complex queries
• (and 'potter harry' (not author:'rowling'))
Configurable synonyms, stemming & stopwords
Custom Sorting/Ranking

Scaling
CloudSearch scales automatically
• Handle your spikes
• Plan for success, but don't spend until you need it
• Handle more data
• Scaling is seamless – no downtime

Automatic Scaling
SEARCH INSTANCE
Index Partition n
Copy 1
SEARCH INSTANCE
Index Partition 2
Copy 2
SEARCH INSTANCE
Index Partition n
Copy 2
SEARCH INSTANCE
Index Partition 2
Copy n
SEARCH INSTANCE
DATA Document Quantity and Size
TRAFFIC
Search
Request
Volume and
Complexity
Index Partition n
Copy n
SEARCH INSTANCE
Index Partition 1
Copy 1
SEARCH INSTANCE
Index Partition 2
Copy 1
SEARCH INSTANCE
Index Partition 1
Copy 2
SEARCH INSTANCE
Index Partition 1
Copy n

Easy to Use
Rest API
Simple to add
• Http Post
Simple to query
• q=star trek
Simple to integrate
• JSON
Documents
CloudSearch
Queries
HTTP
HTTP

Amazon CloudSearch Architecture
DNS / Load Balancing AWS Query
Search API Console Config
API
Command
Line Tools
ConsoleDoc Svc
API
Command
Line Tools
Console
SEARCH SERVICE DOCUMENT SERVICE CONFIG SERVICE
Search Domain

What Can You Search For With CloudSearch?
Wine
Your college buddies
Curly hair products
Downton Abbey episodes
News in Bermuda
Playoff tickets
Online courses
Cat memes
Furniture
Doctor reviews
Take out food
Vacation rentals
Trademarks
African safaris
Kids arts & crafts
French dating/marriage
Online videos
Recipes
Weather insurance
Fashion news
Bollywood music
Stock art
And more!

Combining CloudSearch
+
Relational Database

Combining the Two
Best of both worlds
• Relational queries run on relational database
• Text queries run on CloudSearch
Downside: Complexity
• More moving parts
• Synchronization

Synchronization
Which one is the master?
• Usually the relational database
Updates
• All at once
• At regular intervals
• When data is available
Deletes

Dataflow
One source
Simultaneous updates
RDBMS
CloudSearch
Loader
Sourc
e

Dataflow
One source
Two loaders
RDBMS CloudSearchLoader
Sourc
e
Loader

Dataflow
One source
Log updates
Two loader
Sourc
e
Log Loader

Dataflow
Sourc
e
Log Loader
Sourc
e
Sourc
e

Sample Code

Java Example
Read from MySQL
• JDBC – Nothing special
Post to CloudSearch
• Apache HTTP Client

Libraries
Apache
• HTTP Client
• HTTP Core
• Commons Logging
AWS Java SDK
MySQL connector

Source Files
CloudSearchRDS
• Just does the setup for the demo
ExtractAndUpload
• Does the main work
Batcher
• Groups documents into batches
PosterHttp
• Posts to CloudSearch

Main Loop
ResultSet rs = stmt.executeQuery("select * from movies");
ResultSetMetaData meta = rs.getMetaData();
for (int col = 1; col <= meta.getColumnCount(); col++)
names.add(meta.getColumnName(col));
while (rs.next()) {
int version = (int) (lastModified.getTime() / 1000);
JSONObject doc = new JSONObject();
for (String name : names) {
doc.put(name, rs.getString(name));
}
String id = rs.getString("id");
if (batcher != null) {
batcher.addDocument(doc, version, id);
}
}

SQL
select * from movies;
select key as id, title as name from movies
Denormalizing may require multiple queries

Demo

Search: It's not just for Relational Data
You can pull data from
• S3
• Redshift
• Web
• Internal Documents
• And more…
And make it searchable

Indexing S3
ListObjectsRequest listObjectsRequest = new
ListObjectsRequest().withBucketName(bucketName);
ObjectListing objectListing;
do {
objectListing = s3client.listObjects(listObjectsRequest);
for (S3ObjectSummary objectSummary : objectListing.getObjectSummaries()) {
processObject(objectSummary);
}
listObjectsRequest.setMarker(objectListing.getNextMarker());
} while (objectListing.isTruncated());

Summary
Use the right tool!
• Text Search for Searching Text
CloudSearch is fully managed text search
Easy to get data from relational DB
Easy to load data into CloudSearch

Next Step: Free Trial
One month (750 hours) free.
Set up an account
Give it a try!
Questions?
• TomHill@amazon.com

Using Amazon CloudSearch With Databases - CloudSearch Meetup 061913

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to Using Amazon CloudSearch With Databases - CloudSearch Meetup 061913

Similar to Using Amazon CloudSearch With Databases - CloudSearch Meetup 061913 (20)

More from Michael Bohlig

More from Michael Bohlig (9)

Recently uploaded

Recently uploaded (20)

Using Amazon CloudSearch With Databases - CloudSearch Meetup 061913

Editor's Notes