Use cases for cassandra in federal and state government

•Download as PPTX, PDF•

2 likes•1,142 views

Chris Bradford & Matt Overstreet review several Cassandra use cases we’ve encountered in state and federal government. C* solves many big data problems when storing, enriching and improving access to data.

Government & Nonprofit

Use Cases For Cassandra in
Federal and State
Government
Chris Bradford and Matt Overstreet

Matt Overstreet
● Software Architect
● Search relevancy engineer
● Has worked on systems ranging
from Tractor Trailer weigh stations
to celebrity websites
● Likes Cassandra
GitHub: omnifroodle

● DataStax Cassandra Architect
● Contributor to CQLEngine -
Python C* ORM
● Developed Trireme -
a C* migration engine
● Created the world’s smallest C*
cluster
Chris Bradford
Twitter: @bradfordcp
GitHub: bradfordcp

Who we are
● Consulting firm based in Charlottesville
Virginia
● Founded in 2005
● 30 consultants delivering projects
● Focused on Search in 2010, specifically Solr
and Lucene
● Delivering Cassandra Consulting since 2012
● Datastax Gold partner
● Great with Search, Analytics and Discovery

Blog & Publications
● Blog: http://o19s.com/blog/
● Twitter: @o19s
● Books
o Relevant Search
(Manning)
o Building a Search
Server with
Elasticsearch (Packt)
o Apache Solr
Enterprise Search
Server (Packt)

How we got here
OpenSource Connections started with a deep
expertise in full text search.
As the size and velocity of the data we interact
with grew, so did our toolset for storing,
presenting and processing that data.

Some Use Cases
- Analytics Workloads
- Welfare Fraud Detection
- Intrusion Detection
- Distributed Data Warehousing
- Data Warehouse/Sink
- Replication & Recovery

Analytics Workloads
Look for patterns of user error, fraud and abuse
in forms submitted to an agency.
Requires the ability to compare submissions to
look for similar identifiers such like name, street
address, etc

Welfare Fraud Detection
● Massive amounts of data
● Hard to compare and find patterns
● Difficult to incorporate human analysis

Welfare Fraud Detection
● Ingest data into the system or work on data
in place
● Fraud Score Generation
o Automated rules
o Manually
● Employees can now focus on reviewing the
flagged records

Intrusion Detection
● Stream log data in to C* from applications
● Surface metrics through a security
dashboard
● Perform analysis on records looking for
anomalies (Optional) CREATE TABLE ids (
window TIMESTAMP,
route VARCHAR,
status_code VARCHAR,
request_id TIMEUUID,
PRIMARY KEY ((window, route,

Distributed Data Warehouse
● Cassandra is designed in a peer
to peer architecture. There are no
“masters” or “slaves”.
● True distributed load, write anywhere, read
anywhere.
● Built-in replication between data centers.

Data Warehouse
● Cassandra is used to house case data from
disparate systems
● Data is then pushed into a full text search
index
● Cases may now be searched through an
intuitive web interface

Operations
● Widely compatible with programming
languages used in enterprise development
● OpsCenter monitoring tool
● Cassandra scales predictably
● Fault-tolerant

Use Case Review
● Analytics Workloads
○ Welfare Fraud Detection
○ Intrusion Detection
● Distributed Data Warehousing
○ Data Warehouse/Sink
○ Replication & Recovery

What's hot

Introducing Hydra – An Open Source Document Processing Frameworklucenerevolution

Intro to SearchGrant Ingersoll

Search Analytics with ELK (Elastic Stack)MC+A

Elasticsearch in NetflixDanny Yuan

Webinar: Fusion for Data ScienceLucidworks

This Ain't Your Parent's Search EngineGrant Ingersoll

Introduction to ElasticsearchRuslan Zavacky

So we all have ORCID integrations, now what?Bram Luyten

Log analysis using Logstash,ElasticSearch and KibanaAvinash Ramineni

Log analysis with the elk stackVikrant Chauhan

Elasticsearch IntroductionRoopendra Vishwakarma

Your data layer - Choosing the right database solutions for the futureObjectRocket

Data IO: Next Generation Search with Lucene and Solr 4Grant Ingersoll

Elastic Stack RoadmapImma Valls Bernaus

What Is ELK Stack | ELK Tutorial For Beginners | Elasticsearch Kibana | ELK S...Edureka!

ElasticSearch - index server used as a document databaseRobert Lujo

Webinar: Rapid Solr Development with FusionLucidworks

Introduction to Elasticsearch with basics of LuceneRahul Jain

AWS Customer Presentation: Freie Univerisitat - Berlin Summit 2012Amazon Web Services

Big Data Warehousing Meetup: Developing a super-charged NoSQL data mart using...Caserta

What's hot (20)

Introducing Hydra – An Open Source Document Processing Framework

Intro to Search

Search Analytics with ELK (Elastic Stack)

Elasticsearch in Netflix

Webinar: Fusion for Data Science

This Ain't Your Parent's Search Engine

Introduction to Elasticsearch

So we all have ORCID integrations, now what?

Log analysis using Logstash,ElasticSearch and Kibana

Log analysis with the elk stack

Elasticsearch Introduction

Your data layer - Choosing the right database solutions for the future

Data IO: Next Generation Search with Lucene and Solr 4

Elastic Stack Roadmap

What Is ELK Stack | ELK Tutorial For Beginners | Elasticsearch Kibana | ELK S...

ElasticSearch - index server used as a document database

Webinar: Rapid Solr Development with Fusion

Introduction to Elasticsearch with basics of Lucene

AWS Customer Presentation: Freie Univerisitat - Berlin Summit 2012

Big Data Warehousing Meetup: Developing a super-charged NoSQL data mart using...

Viewers also liked

Lucene - 10 ans d'usages plus ou moins classiquesSylvain Wallez

Lessons Learned with Spark at the US Patent & Trademark OfficeOpenSource Connections

Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...Spark Summit

Core Techs Et LuceneCore-Techs

User Experience Design Considerations for Multi-Museum CollaborationsDesign for Context

Presentation Lucene / Solr / Datafari - Nantes JUGfrancelabs

Viewers also liked (6)

Lucene - 10 ans d'usages plus ou moins classiques

Lessons Learned with Spark at the US Patent & Trademark Office

Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...

Core Techs Et Lucene

User Experience Design Considerations for Multi-Museum Collaborations

Presentation Lucene / Solr / Datafari - Nantes JUG

Similar to Use cases for cassandra in federal and state government

Intro To Graph Databases - Oxana GoriucFraugster

Big data at scrapinghubDana Brophy

Continuous delivery for machine learningRajesh Muppalla

Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dan Lynn

DataStaxMichael Shaler

Disrupting Data Discoverymarkgrover

Using Cloud Automation Technologies to Deliver an Enterprise Data FabricCambridge Semantics

Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...Neo4j

Building data "Py-pelines"Rob Winters

Extracting Insights from Data at TwitterPrasad Wagle

Cassandra Summit 2014: Apache Cassandra at Telefonica CBSDataStax Academy

Workshop on Google Cloud Data PlatformGoDataDriven

Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Databricks

A Modern Interface for Data Science on Postgres/Greenplum - Greenplum Summit ...VMware Tanzu

Confluent & MongoDB APAC Lunch & Learnconfluent

MongoDB .local Houston 2019: Wide Ranging Analytical Solutions on MongoDBMongoDB

Apache Spark & Cassandra use case at Telefónica Cbs by Antonio AlcacerStratio

Key Skills Required for Data EngineeringFibonalabs

Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...Databricks

Data Discovery and Metadatamarkgrover

Similar to Use cases for cassandra in federal and state government (20)

Intro To Graph Databases - Oxana Goriuc

Big data at scrapinghub

Continuous delivery for machine learning

Dirty Data? Clean it up! - Rocky Mountain DataCon 2016

DataStax

Disrupting Data Discovery

Using Cloud Automation Technologies to Deliver an Enterprise Data Fabric

Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...

Building data "Py-pelines"

Extracting Insights from Data at Twitter

Cassandra Summit 2014: Apache Cassandra at Telefonica CBS

Workshop on Google Cloud Data Platform

Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...

A Modern Interface for Data Science on Postgres/Greenplum - Greenplum Summit ...

Confluent & MongoDB APAC Lunch & Learn

MongoDB .local Houston 2019: Wide Ranging Analytical Solutions on MongoDB

Apache Spark & Cassandra use case at Telefónica Cbs by Antonio Alcacer

Key Skills Required for Data Engineering

Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...

Data Discovery and Metadata

Recently uploaded

Yellow is My Favorite Color By Annabelle.pdfAmir Saranga

Panet vs.Plastics - Earth Day 2024 - 22 APRILChristina Parmionova

call girls in moti bagh DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️saminamagar

Premium Call Girls Btm Layout - 7001305949 Escorts Service with Real Photos a...narwatsonia7

High-Level Thematic Event on Tourism - SUSTAINABILITY WEEK 2024- United Natio...Christina Parmionova

WORLD CREATIVITY AND INNOVATION DAY 2024.Christina Parmionova

Monastic-Supremacy-in-the-Philippines-_20240328_092725_0000.pdfCharlynTorres1

2024: The FAR, Federal Acquisition Regulations - Part 26JSchaus & Associates

Jewish Efforts to Influence American Immigration Policy in the Years Before t...yalehistoricalreview

call girls in Tilak Nagar DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️saminamagar

call girls in Laxmi Nagar DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️saminamagar

2023 Ecological Profile of Ilocos Norte.pdfilocosnortegovph

call girls in Narela DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️saminamagar

If there is a Hell on Earth, it is the Lives of Children in Gaza.pdfKatrina Sriranpong

Earth Day 2024 - AMC "COMMON GROUND'' movie night.Christina Parmionova

2024: The FAR, Federal Acquisition Regulations - Part 25JSchaus & Associates

Call Girls Near Surya International Hotel New Delhi 9873777170Sonam Pathan

Russian Call Girl Hebbagodi ! 7001305949 ₹2999 Only and Free Hotel Delivery 2...narwatsonia7

In credit? Assessing where Universal Credit’s long rollout has left the benef...ResolutionFoundation

Club of Rome: Eco-nomics for an Ecological CivilizationEnergy for One World

Recently uploaded (20)

Yellow is My Favorite Color By Annabelle.pdf

Panet vs.Plastics - Earth Day 2024 - 22 APRIL

call girls in moti bagh DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️

Premium Call Girls Btm Layout - 7001305949 Escorts Service with Real Photos a...

High-Level Thematic Event on Tourism - SUSTAINABILITY WEEK 2024- United Natio...

WORLD CREATIVITY AND INNOVATION DAY 2024.

Monastic-Supremacy-in-the-Philippines-_20240328_092725_0000.pdf

2024: The FAR, Federal Acquisition Regulations - Part 26

Jewish Efforts to Influence American Immigration Policy in the Years Before t...

call girls in Tilak Nagar DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️

call girls in Laxmi Nagar DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️

2023 Ecological Profile of Ilocos Norte.pdf

call girls in Narela DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️

If there is a Hell on Earth, it is the Lives of Children in Gaza.pdf

Earth Day 2024 - AMC "COMMON GROUND'' movie night.

2024: The FAR, Federal Acquisition Regulations - Part 25

Call Girls Near Surya International Hotel New Delhi 9873777170

Russian Call Girl Hebbagodi ! 7001305949 ₹2999 Only and Free Hotel Delivery 2...

In credit? Assessing where Universal Credit’s long rollout has left the benef...

Club of Rome: Eco-nomics for an Ecological Civilization

Use cases for cassandra in federal and state government

1. Use Cases For Cassandra in Federal and State Government Chris Bradford and Matt Overstreet

2. Matt Overstreet ● Software Architect ● Search relevancy engineer ● Has worked on systems ranging from Tractor Trailer weigh stations to celebrity websites ● Likes Cassandra GitHub: omnifroodle

3. ● DataStax Cassandra Architect ● Contributor to CQLEngine - Python C* ORM ● Developed Trireme - a C* migration engine ● Created the world’s smallest C* cluster Chris Bradford Twitter: @bradfordcp GitHub: bradfordcp

4. Who we are ● Consulting firm based in Charlottesville Virginia ● Founded in 2005 ● 30 consultants delivering projects ● Focused on Search in 2010, specifically Solr and Lucene ● Delivering Cassandra Consulting since 2012 ● Datastax Gold partner ● Great with Search, Analytics and Discovery

5. Blog & Publications ● Blog: http://o19s.com/blog/ ● Twitter: @o19s ● Books o Relevant Search (Manning) o Building a Search Server with Elasticsearch (Packt) o Apache Solr Enterprise Search Server (Packt)

6. How we got here OpenSource Connections started with a deep expertise in full text search. As the size and velocity of the data we interact with grew, so did our toolset for storing, presenting and processing that data.

7. OSC Toolkit

8. Some Use Cases - Analytics Workloads - Welfare Fraud Detection - Intrusion Detection - Distributed Data Warehousing - Data Warehouse/Sink - Replication & Recovery

9. Analytics Workloads Look for patterns of user error, fraud and abuse in forms submitted to an agency. Requires the ability to compare submissions to look for similar identifiers such like name, street address, etc

10. Welfare Fraud Detection ● Massive amounts of data ● Hard to compare and find patterns ● Difficult to incorporate human analysis

11. Welfare Fraud Detection ● Ingest data into the system or work on data in place ● Fraud Score Generation o Automated rules o Manually ● Employees can now focus on reviewing the flagged records

12.

13. Intrusion Detection ● Stream log data in to C* from applications ● Surface metrics through a security dashboard ● Perform analysis on records looking for anomalies (Optional) CREATE TABLE ids ( window TIMESTAMP, route VARCHAR, status_code VARCHAR, request_id TIMEUUID, PRIMARY KEY ((window, route,

14. Intrusion Detection

15. Distributed Data Warehouse ● Cassandra is designed in a peer to peer architecture. There are no “masters” or “slaves”. ● True distributed load, write anywhere, read anywhere. ● Built-in replication between data centers.

16. Simple Distributed Applications

17.

18. Data Warehouse ● Cassandra is used to house case data from disparate systems ● Data is then pushed into a full text search index ● Cases may now be searched through an intuitive web interface

19.

20. Operations ● Widely compatible with programming languages used in enterprise development ● OpsCenter monitoring tool ● Cassandra scales predictably ● Fault-tolerant

21. Use Case Review ● Analytics Workloads ○ Welfare Fraud Detection ○ Intrusion Detection ● Distributed Data Warehousing ○ Data Warehouse/Sink ○ Replication & Recovery

22. Q & A

Editor's Notes

Matt - We are based in Charlottesville Virginia. (and big fans of the amtrak line to DC) We’ve always been interested in search, (one of our founders wrote the book on it - see next slide). In 2010 we really made search our focus and have been adding related technologies to really help deliver on full text search. In 2012 we also started delivering Cassandra consulting, and we are currently a Datastax Gold Partner.
Relevant search will be out soon, great book about the art of tuning search results. Building a search server with ElasticSearch -> is a great video introduction to both the Angular javascript framework and ElasticSearch. Apache Solr Enterprise is the definitive guide for planning, building and maintaining Apache Solr
OpenSource connections started with a deep expertise in full text search. As the size and velocity of the data we interact with grew, so did our toolset for storing and processing that data. The size of the documents we needed to search over grew, as did the demands for better pre-processing of those documents. As we were storing and searching increasing millions of documents we needed a better place to store and process them. Apache Cassandra has been a great tool for that purpose, particularly with Datastax Enterprise. DSE brings along Apache Spark and Apache Solr, both of which we’ll talk about a bit here.
Here is an idea of the breadth of knowledge we have in the “Search, Analytics and Discovery” stack. This includes multiple search systems (Elasticsearch, Solr), Big Data stores (Cassandra, Spark), and frontend systems (Angular, Ember)
We’ll cover a few cases where Cassandra has been a great solution. Loosely we can break the examples down into two categories, Analytics Workloads like Fraud Detection Intrusion Detection and Distributed Data Warehousing
Why is Cassandra a good choice for analytics workloads? Great for time series data, which is often the core of analytics data. Cassandra is incredibly fast at writing data, which is often an issue with analytics data. Cassandra has no single point of failure, which means analytics data isn’t dropped. It scales linearly. Also, Datastax has create an Apache Spark connector. Apache Spark is data processing engine. It is capable of running on a cluster of machines, and smartely scheduling work accross them. It also supports processing “streaming” data, which is great when dealing with analytics data.
Data may be ingested in batches or streamed in as data is acquired Automated rules may be run during ingestion or periodic batch jobs Manually flagged entries may be used to tune and generate automated rules Look for patterns in new data including existing data
Velocity and data locality are the big stories here Spark performs some automated rule checks in both streaming and batch configurations Streaming - good for small window based checks Batch - ideal for larger jobs against the bigger dataset Machine learning may be used to develop new classifications and groups of records
Why Cassandra for Intrusion Detection: Blazing fast write speed. No single point of failure. How it works: data is streamed to Cassandra into a wide row based timeslice/route/status_code data can then be monitored by timeslice to look for spikes Warning, make sure someone attends the datamodeling talk before trying this at home, you’ll need to understand how cassandra stores and access data to get the most out of this approach
--- Repeat from last slide ---- Why Cassandra for Intrusion Detection: Blazing fast write speed. No single point of failure. How it works: data is streamed to Cassandra into a wide row based timeslice/route/status_code data can then be monitored by timeslice to look for spikes Warning, make sure someone attends the datamodeling talk before trying this at home, you’ll need to understand how cassandra stores and access data to get the most out of this approach
Why Cassandra for this: Data replication, both locally in the data center on between data centers “tunable” consistency Cassandra is highly available as soon as you have two nodes. Data is automatically copied between nodes. Other solutions require special configuration for multi-master configurations or are only available as a commercial product. Cassandra gives you true multi-master out of the box.
Netflix Example: They set up a Cassandra Cluster with nodes in Oregon and Northern Virginia. Load was simulated to a production level. To test the speed of replication they wrote 1 million records in one region. 500ms later they read all records from the data center in VA.
Within the scope of a datacenter application developers interact with the cluster as though it’s a local data store. Should the local cluster go down the driver automatically routes requests to another datacenter if available.
225 YEARS of data spanning tens of millions documents Each document has over 250 fields Note that columns without data do not consume storage space Compare this to dealing with distributed Master-Slave in MS SQL or other
Source documents are coming from various systems with information about part of the claim. In this case there were 10 different types of source documents including metadata about the cases.
Drivers in C# .Net C++ Java Node PHP Ruby

Use cases for cassandra in federal and state government

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Use cases for cassandra in federal and state government

Similar to Use cases for cassandra in federal and state government (20)

More from OpenSource Connections

More from OpenSource Connections (20)

Recently uploaded

Recently uploaded (20)

Use cases for cassandra in federal and state government

Editor's Notes