Apache Cassandra is the leading distributed database in use at thousands of sites with the world’s most demanding scalability and availability requirements. Apache Spark is a distributed data analytics computing framework that has gained a lot of traction in processing large amounts of data in an efficient and user-friendly manner. The joining of both provides a powerful combination of real-time data collection with analytics. After a brief overview of Cassandra and Spark, this class will dive into various aspects of the integration.
3. 1. KPI is a Silver Level DataStax Partner
2. KPI is a top tier sponsor at Cassandra Summit
• September 22-24, 2015, Santa Clara, CA
3. KPI and its consultants have implemented
DataStax at multiple retail and financial services
customers
-
4.
5. 1. Use Case Requirements for Data Model
2. Security and Encryption Requirements
3. Service Level Agreements
4. Operational Requirements (Monitor and Manage)
5. Search Requirements (DataStax Search)
6. Analytics Requirements (DataStax Analytics)
6. 1. Key to success “get the data model right”
2. Leverage what is in place:
1. Query logs
2. Define specific Create, Read, Update, and Delete “CRUD” requirements
3. DataStax Security
1. Authentication Req. (i.e. Kerberos, Password, SSL, LDAP, etc.)
2. Authorization Req. (i.e. access to Scheme, Table, or other database
components)
4. Encryption
1. Client Application to DataStax (the Cluster)
2. Node-to-Node (Inter-Cluster)
7. 5. SLA’s
1. Highly recommended “must have”
2. Lack of SLA’s lead to project failure.
6. Understand you are building a mission critical system
1. Make sure to define operational monitoring and management of the system
7. DataStax Search
1. Define Search Requirements
2. Determine the fields that will be searched on and returned (i.e. multiple
search fields or single search field, the use of faceted results vs. ranked list
results, etc.)
8. 7. DataStax Analytics
1. Analytics requirements should be captured at this time.
8. Analytics requirements should incorporate:
1. statistical algorithms,
2. required data sources,
3. data movement/modifications,
4. security/access,
5. other analytical requirements at a clear enough level to enable a thorough
design.
9. 1. Data Model Design
2. Data Access Object Design
3. Data Movement Design
4. Operational Design (Management and Monitoring)
5. Search Design
6. Analytics Design
10. 1. Data Model Design should clearly include:
1. Keyspace Design (Replication Strategy, Name)
2. Table Design (Table Names, Partition Keys, Clustering Columns (if applicable),
and physical table properties as necessary (i.e. encryption, bloom filter
settings, etc.)
3. Any relationships between tables. Note that database joining within DataStax
Enterprise is not technically feasible. However, relationships between tables
are still important, especially for the application developers.
11. 2. When leveraging simple Data Access Objects projects
are more successful
1. Simple Data Access Objects are best to encapsulate and abstract data
manipulation logic.
2. This is opposed to the current trend in application development, where
projects leverage frameworks to encapsulate, abstract, and represent
database components as application objects, i.e. Hibernate, LinQ, JPA, ORM,
etc.
3. Designing the Data Access Object, as much as possible, up front will help the
application development team as they build out higher-level functionality.
12. 3. Data Movement Design is essential to your success
1. Batch and real-time data integration between systems
2. ETL, Change Data Capture, data pipelines, etc.
3. Data types, transformation logic, error handling, look-ups, and data
normalization should be clearly documented.
13. 4. Operational Design
1. Tooling and the techniques used:
1. deploy new nodes, configure and upgrade nodes in the cluster, backup and
restore operations, cluster monitoring, OpsCenter use, repairs, alerting,
disaster management processes, etc.
2. KPI recommends using a "playbook" approach to Operational
Design.
14. 5. Search Design
1. Incorporate items such as:
1.searchable terms, returned terms, tokenizers, filters,
multidocument search terms, etc.
6. DataStax Analytics Design
1. determine which Analytics components will be leveraged in the
solution.
15. 1. Infrastructure
2. Deployment and Configuration Management
3. Software Components (Data Model and
Application)
4. Unit Testing of Components
16. 1. Application Development – use Agile or Waterfall methodology as
desired by your organization
2. Deployment and Configuration Management Mechanism
1. Key in a distributed system is the need to automate as much as possible
2. Opscenter, Docker, Vagrant, Chef, Puppet, etc. should be leveraged.
3. Unit Testing of Components
1. More complex with distributed systems compared to single node systems.
2. Specific defects, such as race conditions, are only observed "at scale“
3. unit testing should be executed over a small cluster that contains more than a
single node.
4. Tools such as ccm can be used by developers to automate the process of
quickly launching test clusters as part of a unit test.
18. 1. Critical to enable the project team to identify actual
issues prior to going to production “at scale”
2. Minimum 2 week period where the application is running
at production scale.
3. It may take several iterations of configuration, code
change, and refactoring to enable full execution
19. 4. Operational Readiness Checklist
1. Replace a downed node and a dead seed node
2. Configure and execute repair (within GC_Grace_Period)
3. Add a node to a cluster
4. Replace a downed Data Center
5. Add a Data Center to the cluster
6. Decommission a node
7. Restore a backup
8. At a Cluster Level and Per Node Level, report on errors, throughput, latency,
resource saturation, bottlenecks, compactions, flushes, and health
20. Highlight the normal, operational mode of an application built on
DataStax Enterprise.
Prepare for all eventualities, and address by adding nodes to expand
capacity to the system when needed.
Scale with DataStax Enterprise.
The attached presentation is intended for technical audiences. It provides some good details on data modeling as well as Pre-Production testing. The main takeaway is that, if the PoC is well constructed, then you can move directly into the Pre-Production testing phase of this approach, skipping the requirements through implementation phases. This highlights the scaling advantage of Apache Cassandra and DataStax Enterprise.