Chris Bradford & Matt Overstreet review several Cassandra use cases we’ve encountered in state and federal government. C* solves many big data problems when storing, enriching and improving access to data.
Club of Rome: Eco-nomics for an Ecological Civilization
Use cases for cassandra in federal and state government
1. Use Cases For Cassandra in
Federal and State
Government
Chris Bradford and Matt Overstreet
2. Matt Overstreet
● Software Architect
● Search relevancy engineer
● Has worked on systems ranging
from Tractor Trailer weigh stations
to celebrity websites
● Likes Cassandra
GitHub: omnifroodle
3. ● DataStax Cassandra Architect
● Contributor to CQLEngine -
Python C* ORM
● Developed Trireme -
a C* migration engine
● Created the world’s smallest C*
cluster
Chris Bradford
Twitter: @bradfordcp
GitHub: bradfordcp
4. Who we are
● Consulting firm based in Charlottesville
Virginia
● Founded in 2005
● 30 consultants delivering projects
● Focused on Search in 2010, specifically Solr
and Lucene
● Delivering Cassandra Consulting since 2012
● Datastax Gold partner
● Great with Search, Analytics and Discovery
5. Blog & Publications
● Blog: http://o19s.com/blog/
● Twitter: @o19s
● Books
o Relevant Search
(Manning)
o Building a Search
Server with
Elasticsearch (Packt)
o Apache Solr
Enterprise Search
Server (Packt)
6. How we got here
OpenSource Connections started with a deep
expertise in full text search.
As the size and velocity of the data we interact
with grew, so did our toolset for storing,
presenting and processing that data.
8. Some Use Cases
- Analytics Workloads
- Welfare Fraud Detection
- Intrusion Detection
- Distributed Data Warehousing
- Data Warehouse/Sink
- Replication & Recovery
9. Analytics Workloads
Look for patterns of user error, fraud and abuse
in forms submitted to an agency.
Requires the ability to compare submissions to
look for similar identifiers such like name, street
address, etc
10. Welfare Fraud Detection
● Massive amounts of data
● Hard to compare and find patterns
● Difficult to incorporate human analysis
11. Welfare Fraud Detection
● Ingest data into the system or work on data
in place
● Fraud Score Generation
o Automated rules
o Manually
● Employees can now focus on reviewing the
flagged records
12.
13. Intrusion Detection
● Stream log data in to C* from applications
● Surface metrics through a security
dashboard
● Perform analysis on records looking for
anomalies (Optional) CREATE TABLE ids (
window TIMESTAMP,
route VARCHAR,
status_code VARCHAR,
request_id TIMEUUID,
PRIMARY KEY ((window, route,
15. Distributed Data Warehouse
● Cassandra is designed in a peer
to peer architecture. There are no
“masters” or “slaves”.
● True distributed load, write anywhere, read
anywhere.
● Built-in replication between data centers.
18. Data Warehouse
● Cassandra is used to house case data from
disparate systems
● Data is then pushed into a full text search
index
● Cases may now be searched through an
intuitive web interface
19.
20. Operations
● Widely compatible with programming
languages used in enterprise development
● OpsCenter monitoring tool
● Cassandra scales predictably
● Fault-tolerant
21. Use Case Review
● Analytics Workloads
○ Welfare Fraud Detection
○ Intrusion Detection
● Distributed Data Warehousing
○ Data Warehouse/Sink
○ Replication & Recovery
Matt -
We are based in Charlottesville Virginia. (and big fans of the amtrak line to DC)
We’ve always been interested in search, (one of our founders wrote the book on it - see next slide). In 2010 we really made search our focus and have been adding related technologies to really help deliver on full text search.
In 2012 we also started delivering Cassandra consulting, and we are currently a Datastax Gold Partner.
Relevant search will be out soon, great book about the art of tuning search results.
Building a search server with ElasticSearch -> is a great video introduction to both the Angular javascript framework and ElasticSearch.
Apache Solr Enterprise is the definitive guide for planning, building and maintaining Apache Solr
OpenSource connections started with a deep expertise in full text search.
As the size and velocity of the data we interact with grew, so did our toolset for storing and processing that data.
The size of the documents we needed to search over grew, as did the demands for better pre-processing of those documents.
As we were storing and searching increasing millions of documents we needed a better place to store and process them. Apache Cassandra has been a great tool for that purpose, particularly with Datastax Enterprise. DSE brings along Apache Spark and Apache Solr, both of which we’ll talk about a bit here.
Here is an idea of the breadth of knowledge we have in the “Search, Analytics and Discovery” stack.
This includes multiple search systems (Elasticsearch, Solr), Big Data stores (Cassandra, Spark), and frontend systems (Angular, Ember)
We’ll cover a few cases where Cassandra has been a great solution.
Loosely we can break the examples down into two categories, Analytics Workloads like
Fraud Detection
Intrusion Detection
and Distributed Data Warehousing
Why is Cassandra a good choice for analytics workloads?
Great for time series data, which is often the core of analytics data.
Cassandra is incredibly fast at writing data, which is often an issue with analytics data.
Cassandra has no single point of failure, which means analytics data isn’t dropped.
It scales linearly.
Also,
Datastax has create an Apache Spark connector.
Apache Spark is data processing engine. It is capable of running on a cluster of machines, and smartely scheduling work accross them. It also supports processing “streaming” data, which is great when dealing with analytics data.
Data may be ingested in batches or streamed in as data is acquired
Automated rules may be run during ingestion or periodic batch jobs
Manually flagged entries may be used to tune and generate automated rules
Look for patterns in new data including existing data
Velocity and data locality are the big stories here
Spark performs some automated rule checks in both streaming and batch configurations
Streaming - good for small window based checks
Batch - ideal for larger jobs against the bigger dataset
Machine learning may be used to develop new classifications and groups of records
Why Cassandra for Intrusion Detection:
Blazing fast write speed.
No single point of failure.
How it works:
data is streamed to Cassandra into a wide row based timeslice/route/status_code
data can then be monitored by timeslice to look for spikes
Warning, make sure someone attends the datamodeling talk before trying this at home, you’ll need to understand how cassandra stores and access data to get the most out of this approach
--- Repeat from last slide ----
Why Cassandra for Intrusion Detection:
Blazing fast write speed.
No single point of failure.
How it works:
data is streamed to Cassandra into a wide row based timeslice/route/status_code
data can then be monitored by timeslice to look for spikes
Warning, make sure someone attends the datamodeling talk before trying this at home, you’ll need to understand how cassandra stores and access data to get the most out of this approach
Why Cassandra for this:
Data replication, both locally in the data center on between data centers
“tunable” consistency
Cassandra is highly available as soon as you have two nodes. Data is automatically copied between nodes. Other solutions require special configuration for multi-master configurations or are only available as a commercial product. Cassandra gives you true multi-master out of the box.
Netflix Example:
They set up a Cassandra Cluster with nodes in Oregon and Northern Virginia.
Load was simulated to a production level.
To test the speed of replication they wrote 1 million records in one region. 500ms later they read all records from the data center in VA.
Within the scope of a datacenter application developers interact with the cluster as though it’s a local data store.
Should the local cluster go down the driver automatically routes requests to another datacenter if available.
225 YEARS of data spanning tens of millions documents
Each document has over 250 fields
Note that columns without data do not consume storage space
Compare this to dealing with distributed Master-Slave in MS SQL or other
Source documents are coming from various systems with information about part of the claim. In this case there were 10 different types of source documents including metadata about the cases.