SlideShare ist ein Scribd-Unternehmen logo
1 von 29
Real-time Application
Architecture
David Mellor
VP & Chief Architect Curriculum Associates
Building a Real-Time
Feedback Loop for Education
David Mellor
VP & Chief Architect Curriculum Associates
Adjusted title to match abstract
submission
• Curriculum Associates has a mission to make classrooms better places for
teachers and students.
• Our founding value drives us to continually innovative to produce new
exciting products that give every student and teacher the chance to
succeed.
–Students
–Teachers
–Administrators
• To meet some of our latest expectations, and provide the best
teacher/student feedback available, we are enabling our educators with
real-time data.
3
Our Mission
•The Architectural Journey
•Understanding Sharding
•Using Multiple Storage Engines
•Adding Kafka Message Queues
•Integrating the Data Lake
4
In This Talk
The Architectural Journey
5
6
The Architecture End State
iReady
Confluent
Kafka
Debezium
Connector
(DB to Kafka)
S3
Connector
(Kafka to S3)
HTC Dispatch
Data Lake
Raw Store
Read
Optimized Store
Nightly
Load
Files
Brokers ZooKeeper
Lesson
iReady
Event Event System
MemSQL
Reporting
DB
MemSQL
Reporting
DB
7
Our Architectural Journey
• Where did we start and what fundamental problem do we need to solve
to get real-time values to our users?
iReady
Lesson
iReady
Event
Scheduled
Jobs
ETL to Data
Warehouse
Data Warehouse
Reporting
Data Mart
ETL to Data Mart
8
Start with the Largest Aggregate Report
Our largest aggregate report
consists of logically:
–6,000,000 leaf values filtered to
250,000
–600,000,000 leaf values filtered to
10,000,000 used as the
intermediate dataset
–Rolled up to produce 300
aggregate totals
–Response target 1 Sec
6,000,000+ Students
600,000,000+ Facts
A District Report:
10,000,000 Facts
rolled up and into
300 schools
SID DESC ATTR1 SID FACT1 FACT2
.
.
.
• SQL Compatible – our developers and basic paradigm is
SQL
• Fast Calculations – we need to compute large calculations
across large datasets
• Fast Updates – we need to do real-time updates
• Fast Loads – we need to re-load our reporting database
nightly
• Massively Scalable – we need to support large data
volumes and large numbers of concurrent users.
• Cost Effective – we need a practical solution based on cost
9
In-Memory Databases MemSQL
• Columnar and Row storage models provides
for very fast aggregations across large
amounts of data
• Very fast load times allows us to update our
reporting db nightly
• Very fast update times for Row storage tables
• Highly scalable based on their MPP base
architecture
• Unique ability to query across Columnar and
Row tables in a single query
• Convert our existing database design to be optimal in MemSQL
• Analyze our usage patterns to determine the best Sharding key
• Create our prototype and run typical queries to determine the optimal
database structure across the spectrum of projected usage
–Use the same Sharding key in all tables
–Push down predicates to as many tables as we can
10
Our MemSQL Journey Begins
Understanding Sharding
11
12
Why is the selection of a Sharding key so
important?
SID DESC ATTR1 SID FACT1 FACT2
.
.
.
NODE2
NODE3
NODE1
Create database with 9 partitions
Create tables in the database using
a sharding key which is advantageous
to query execution
The goal is to get the execution of a
given query to be as evenly distributed
over the partitions
PS1 PS2 PS3
PS4 PS5 PS6
PS7 PS8 PS9
13
How does the Sharding Key affect the “join”
PS1 PS2
Node 1
Select a.sid, b.factid from table1 a, table2 b
Where a.sid in {10 ….. } and b.sid in {10 ….. }
And a.sid = b.sid
The basis of the join is on the sid column.
When the sharding key is chosen based on the sid
Columns for both tables … the join can be done
Independently within each partition and the result
Merged
This is an ideal situation to get the nodes performing
In parallel which can maximum query performance
14
How does the Sharding Key affect the “join”
PS1 PS2
Node 1
Select a.sid, b.factid from table1 a, table2 b
Where a.sid in {10 ….. } and b.sid in {10 ….. }
And a.sid = b.sid
When the sharding key is not based on the sid
Columns for both tables … the join cannot be done
Independently within each partition and will cause
what
Is called a broadcast
This is not the ideal situation to get the nodes
performing
In parallel and we have seen query performance
degredration
In these cases
Using Multiple Storage Engines
15
• Row storage was the most performant for queries, load and updates –
this is also the most expensive solution
• Columnar storage was performant for some queries and load but
degraded with updates – cost effective but not performant enough on too
many of the target queries
• To maximize our use of MemSQL we have combined Row storage and
Columnar storage to create a logical table
–Volatile (changeable) data is kept in Row storage
–Non-Volatile (immutable) data is kept in Columnar storage
–Requests for data are made using “Union All” queries
16
Columnar and Row Storage Models
17
Columnar and Row
.
.
.
SID FACT1 FACT2 Active
?
?
?
Row Storage Portion Columnar Storage Portion
Logical
Table
SID FACT1 FACT2 Active
n
n
n
n
n
n
n
.
.
.
n
n
n
SID FACT1 FACT2 Active
?
n
n
?
Select sid, fact1, fact2
From fact_row
Where sid in (1 …10)
Union All
Select sid, fact1, fact2
From fact_columnar
Where sid in (1 …10)
Adding Kafka Message Queues
18
19
Dispatching The Human Time Events Events
iReady
Kafka
HTC Dispatch
MemSQL
Reporting
DB
Brokers ZooKeeper
Lesson
iReady
Event
Event System
• We had a database engine that
could perform our queries
• We solved our cost and scaling
needs
• We proved we could load and
update the database on the
desired schedule
• How are we going to get the
real-time data to the Reporting
DB?
20
Dispatching The Human Time Events Events
iReady
Kafka
HTC Dispatch
MemSQL
Reporting
DB
Brokers ZooKeeper
Lesson
iReady
Event
Event System
Event:
Payload:
JSON
Event:
Payload:
JSON
Event:
Payload:
JSON
Event:
Payload:
JSON
Event:
Payload:
JSON
Event:
Payload:
JSON
Event:
Payload:
JSON
Event:
Payload:
JSON
Event:
Payload:
JSON
Event:
Payload:
JSON
MemSQL
Pipeline
• Use MemSQL Pipelines to
ingest data into MemSQL
from Kafka
• Declared MemSQL Objects
• Managed and controlled by
MemSQL
• No significant transforms
• Tables are augment with a column to contain the event in JSON form
• All other columns derived
21
Kafka and MemSQL Pipelines
SID FACT1 FACT2 SID FACT1 FACT2 event
CREATE TABLE realtime.fact_table
(event JSON NOT NULL,
SID as event::data::$SID PERSISTED
varchar(36)
FACT1 as event::data::rawFact1 PERSISTED
int(11)
FACT2 as event::data::rawFact2 PERSISTED
int(11)
KEY (SID))
create pipeline fact_pipe asload data
kafka '0.0.0.0:0000/fact-event-
stream'into table realtime.fact_table
columns (event);
22
Adding the Nightly Rebuild Process
iReady
Confluent
Kafka
Debezium
Connector
(DB to Kafka)
HTC Dispatch
MemSQL
Reporting
DB
Brokers ZooKeeper
Lesson
iReady
Event Event System
• Get the transactional data from
the database
• Employ database replication to
dedicated slaves
• Introduce the Confluent platform
to unify data movement through
Kafka
• Deploy the Debezium Confluent
Connector to move the replication
log data into Kafka
Integrating the Data Lake
23
24
Create and Update the Data Lake
iReady
Confluent
Kafka
Debezium
Connector
(DB to
Kafka)
S3
Connector
(Kafka to
S3)
HTC
Dispatch
MemSQL
Reporting
DB
Data Lake
Raw Store
Read
Optimized
Store
Brokers ZooKeeper
Lesson
iReady
Event
Event System
• Build a Data Lake in S3
• Deploy the Confluent S3
Connector to move the
transaction data from Kafka
to the Data Lake
• Split the Data Lake into 2
Distinct forms – Raw and
Read Optimized
• Deploy Spark to move the
data from the Raw form to
the Read Optimized form
25
Move the Data from the Data Lake to MemSQL
iReady
Confluent
Kafka
Debezium
Connector
(DB to
Kafka)
S3
Connector
(Kafka to
S3)
HTC
Dispatch
MemSQL
Reporting
DB
Data Lake
Raw Store
Read
Optimized
Store
Brokers ZooKeeper
Lesson
iReady
Event
Event System
• Deploy Spark to transform
the data from the Read
Optimized form to a
Reporting Optimized form
• Save the output to a
managed S3 location
• Deploy MemSQL S3
Pipelines to automatically
ingest the nightly load files
from a specified location
• Deploy MemSQL Pipeline to
Kafka
• Activate the MemSQL
Pipeline when the reload is
complete
Nightly
Load
Files
MemSQL
Reporting
DB
26
Swap the Light/Dark MemSQL DB
iReady
Confluent
Kafka
Debezium
Connector
(DB to
Kafka)
S3
Connector
(Kafka to
S3)
HTC
Dispatch
Data Lake
Raw Store
Read
Optimized
Store
Brokers ZooKeeper
Lesson
iReady
Event
Event System
• Open up the Dark DB to
accept connections
• Trigger an iReady
application event to drain
the current connection pool
and replace the connections
with new connections to the
new database
• Close the current Light DB
Nightly
Load
Files
MemSQL
Reporting
DB
MemSQL
Reporting
DB
MemSQL
Reporting
DB
MemSQL
Reporting
DB
27
The Architecture End State
iReady
Confluent
Kafka
Debezium
Connector
(DB to Kafka)
S3
Connector
(Kafka to S3)
HTC Dispatch
Data Lake
Raw Store
Read
Optimized Store
Nightly
Load
Files
Brokers ZooKeeper
Lesson
iReady
Event Event System
MemSQL
Reporting
DB
MemSQL
Reporting
DB
• Ensure the system you are considering is up to the challenge of your most
sophisticated queries
• With distributed systems, spend time to pick the right sharding strategy
• Make use of multiple storage engines where available
• Design workflows with message queues for flexibilty and update-ability
• Incorporate data lakes for long term retention and context
28
Key Takeaways
Real-time Application Architecture for Education Data Insights

Weitere ähnliche Inhalte

Was ist angesagt?

Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time PersonalizationUsing Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time PersonalizationPatrick Di Loreto
 
Querying Data Pipeline with AWS Athena
Querying Data Pipeline with AWS AthenaQuerying Data Pipeline with AWS Athena
Querying Data Pipeline with AWS AthenaYaroslav Tkachenko
 
Deploy data analysis pipeline with mesos and docker
Deploy data analysis pipeline with mesos and dockerDeploy data analysis pipeline with mesos and docker
Deploy data analysis pipeline with mesos and dockerVu Nguyen Duy
 
How Alibaba Cloud scaled ApsaraDB with MariaDB MaxScale
How Alibaba Cloud scaled ApsaraDB with MariaDB MaxScaleHow Alibaba Cloud scaled ApsaraDB with MariaDB MaxScale
How Alibaba Cloud scaled ApsaraDB with MariaDB MaxScaleMariaDB plc
 
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...Databricks
 
BigData: AWS RedShift with S3, EC2
BigData: AWS RedShift with S3, EC2BigData: AWS RedShift with S3, EC2
BigData: AWS RedShift with S3, EC2Paulraj Pappaiah
 
How Pulsar Enables Netdata to Offer Unlimited Infrastructure Monitoring for F...
How Pulsar Enables Netdata to Offer Unlimited Infrastructure Monitoring for F...How Pulsar Enables Netdata to Offer Unlimited Infrastructure Monitoring for F...
How Pulsar Enables Netdata to Offer Unlimited Infrastructure Monitoring for F...StreamNative
 
New AWS Services for Bioinformatics
New AWS Services for BioinformaticsNew AWS Services for Bioinformatics
New AWS Services for BioinformaticsLynn Langit
 
Renegotiating the boundary between database latency and consistency
Renegotiating the boundary between database latency  and consistencyRenegotiating the boundary between database latency  and consistency
Renegotiating the boundary between database latency and consistencyScyllaDB
 
Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...
Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...
Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...DataStax
 
Azure + DataStax Enterprise Powers Office 365 Per User Store
Azure + DataStax Enterprise Powers Office 365 Per User StoreAzure + DataStax Enterprise Powers Office 365 Per User Store
Azure + DataStax Enterprise Powers Office 365 Per User StoreDataStax Academy
 
eBay Cloud CMS - QCon 2012 - http://yidb.org/
eBay Cloud CMS - QCon 2012 - http://yidb.org/eBay Cloud CMS - QCon 2012 - http://yidb.org/
eBay Cloud CMS - QCon 2012 - http://yidb.org/Xu Jiang
 
Real-Time Data Pipelines with Kafka, Spark, and Operational Databases
Real-Time Data Pipelines with Kafka, Spark, and Operational DatabasesReal-Time Data Pipelines with Kafka, Spark, and Operational Databases
Real-Time Data Pipelines with Kafka, Spark, and Operational DatabasesSingleStore
 
Should you read Kafka as a stream or in batch? Should you even care? | Ido Na...
Should you read Kafka as a stream or in batch? Should you even care? | Ido Na...Should you read Kafka as a stream or in batch? Should you even care? | Ido Na...
Should you read Kafka as a stream or in batch? Should you even care? | Ido Na...HostedbyConfluent
 
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Amazon Web Services
 
Beyond Relational
Beyond RelationalBeyond Relational
Beyond RelationalLynn Langit
 
Building Reliable Cloud Storage with Riak and CloudStack - Andy Gross, Chief ...
Building Reliable Cloud Storage with Riak and CloudStack - Andy Gross, Chief ...Building Reliable Cloud Storage with Riak and CloudStack - Andy Gross, Chief ...
Building Reliable Cloud Storage with Riak and CloudStack - Andy Gross, Chief ...buildacloud
 

Was ist angesagt? (20)

Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time PersonalizationUsing Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization
 
Querying Data Pipeline with AWS Athena
Querying Data Pipeline with AWS AthenaQuerying Data Pipeline with AWS Athena
Querying Data Pipeline with AWS Athena
 
Data Stores @ Netflix
Data Stores @ NetflixData Stores @ Netflix
Data Stores @ Netflix
 
Deploy data analysis pipeline with mesos and docker
Deploy data analysis pipeline with mesos and dockerDeploy data analysis pipeline with mesos and docker
Deploy data analysis pipeline with mesos and docker
 
How Alibaba Cloud scaled ApsaraDB with MariaDB MaxScale
How Alibaba Cloud scaled ApsaraDB with MariaDB MaxScaleHow Alibaba Cloud scaled ApsaraDB with MariaDB MaxScale
How Alibaba Cloud scaled ApsaraDB with MariaDB MaxScale
 
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
 
BigData: AWS RedShift with S3, EC2
BigData: AWS RedShift with S3, EC2BigData: AWS RedShift with S3, EC2
BigData: AWS RedShift with S3, EC2
 
How Pulsar Enables Netdata to Offer Unlimited Infrastructure Monitoring for F...
How Pulsar Enables Netdata to Offer Unlimited Infrastructure Monitoring for F...How Pulsar Enables Netdata to Offer Unlimited Infrastructure Monitoring for F...
How Pulsar Enables Netdata to Offer Unlimited Infrastructure Monitoring for F...
 
New AWS Services for Bioinformatics
New AWS Services for BioinformaticsNew AWS Services for Bioinformatics
New AWS Services for Bioinformatics
 
Renegotiating the boundary between database latency and consistency
Renegotiating the boundary between database latency  and consistencyRenegotiating the boundary between database latency  and consistency
Renegotiating the boundary between database latency and consistency
 
Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...
Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...
Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...
 
Azure + DataStax Enterprise Powers Office 365 Per User Store
Azure + DataStax Enterprise Powers Office 365 Per User StoreAzure + DataStax Enterprise Powers Office 365 Per User Store
Azure + DataStax Enterprise Powers Office 365 Per User Store
 
eBay Cloud CMS - QCon 2012 - http://yidb.org/
eBay Cloud CMS - QCon 2012 - http://yidb.org/eBay Cloud CMS - QCon 2012 - http://yidb.org/
eBay Cloud CMS - QCon 2012 - http://yidb.org/
 
Big Data Tools in AWS
Big Data Tools in AWSBig Data Tools in AWS
Big Data Tools in AWS
 
Real-Time Data Pipelines with Kafka, Spark, and Operational Databases
Real-Time Data Pipelines with Kafka, Spark, and Operational DatabasesReal-Time Data Pipelines with Kafka, Spark, and Operational Databases
Real-Time Data Pipelines with Kafka, Spark, and Operational Databases
 
Should you read Kafka as a stream or in batch? Should you even care? | Ido Na...
Should you read Kafka as a stream or in batch? Should you even care? | Ido Na...Should you read Kafka as a stream or in batch? Should you even care? | Ido Na...
Should you read Kafka as a stream or in batch? Should you even care? | Ido Na...
 
How to deploy Apache Spark 
to Mesos/DCOS
How to deploy Apache Spark 
to Mesos/DCOSHow to deploy Apache Spark 
to Mesos/DCOS
How to deploy Apache Spark 
to Mesos/DCOS
 
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
 
Beyond Relational
Beyond RelationalBeyond Relational
Beyond Relational
 
Building Reliable Cloud Storage with Riak and CloudStack - Andy Gross, Chief ...
Building Reliable Cloud Storage with Riak and CloudStack - Andy Gross, Chief ...Building Reliable Cloud Storage with Riak and CloudStack - Andy Gross, Chief ...
Building Reliable Cloud Storage with Riak and CloudStack - Andy Gross, Chief ...
 

Ähnlich wie Real-time Application Architecture for Education Data Insights

Couchbase Singapore Meetup #2: Why Developing with Couchbase is easy !!
Couchbase Singapore Meetup #2:  Why Developing with Couchbase is easy !! Couchbase Singapore Meetup #2:  Why Developing with Couchbase is easy !!
Couchbase Singapore Meetup #2: Why Developing with Couchbase is easy !! Karthik Babu Sekar
 
Couchbase Chennai Meetup: Developing with Couchbase- made easy
Couchbase Chennai Meetup:  Developing with Couchbase- made easyCouchbase Chennai Meetup:  Developing with Couchbase- made easy
Couchbase Chennai Meetup: Developing with Couchbase- made easyKarthik Babu Sekar
 
AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...
AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...
AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...Amazon Web Services
 
TenMax Data Pipeline Experience Sharing
TenMax Data Pipeline Experience SharingTenMax Data Pipeline Experience Sharing
TenMax Data Pipeline Experience SharingChen-en Lu
 
Migrating to Amazon RDS with Database Migration Service
Migrating to Amazon RDS with Database Migration ServiceMigrating to Amazon RDS with Database Migration Service
Migrating to Amazon RDS with Database Migration ServiceAmazon Web Services
 
Real-Time Analytics with Spark and MemSQL
Real-Time Analytics with Spark and MemSQLReal-Time Analytics with Spark and MemSQL
Real-Time Analytics with Spark and MemSQLSingleStore
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsDatabricks
 
Real time processing of trade data with kafka, spark streaming and aerospike ...
Real time processing of trade data with kafka, spark streaming and aerospike ...Real time processing of trade data with kafka, spark streaming and aerospike ...
Real time processing of trade data with kafka, spark streaming and aerospike ...Mich Talebzadeh (Ph.D.)
 
Real time processing of trade data with kafka, spark streaming and aerospike ...
Real time processing of trade data with kafka, spark streaming and aerospike ...Real time processing of trade data with kafka, spark streaming and aerospike ...
Real time processing of trade data with kafka, spark streaming and aerospike ...Mich Talebzadeh (Ph.D.)
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Landon Robinson
 
(DAT204) NoSQL? No Worries: Build Scalable Apps on AWS NoSQL Services
(DAT204) NoSQL? No Worries: Build Scalable Apps on AWS NoSQL Services(DAT204) NoSQL? No Worries: Build Scalable Apps on AWS NoSQL Services
(DAT204) NoSQL? No Worries: Build Scalable Apps on AWS NoSQL ServicesAmazon Web Services
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data EngineeringC4Media
 
Using a Fast Operational Database to Build Real-time Streaming Aggregations
Using a Fast Operational Database to Build Real-time Streaming AggregationsUsing a Fast Operational Database to Build Real-time Streaming Aggregations
Using a Fast Operational Database to Build Real-time Streaming AggregationsVoltDB
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...DataStax
 
Introducing Events and Stream Processing into Nationwide Building Society (Ro...
Introducing Events and Stream Processing into Nationwide Building Society (Ro...Introducing Events and Stream Processing into Nationwide Building Society (Ro...
Introducing Events and Stream Processing into Nationwide Building Society (Ro...confluent
 
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon RedshiftAmazon Web Services
 
AWS Summit 2013 | India - Petabyte Scale Data Warehousing at Low Cost, Abhish...
AWS Summit 2013 | India - Petabyte Scale Data Warehousing at Low Cost, Abhish...AWS Summit 2013 | India - Petabyte Scale Data Warehousing at Low Cost, Abhish...
AWS Summit 2013 | India - Petabyte Scale Data Warehousing at Low Cost, Abhish...Amazon Web Services
 
찾아가는 AWS 세미나(구로,가산,판교) - AWS 기반 빅데이터 활용 방법 (김일호 솔루션즈 아키텍트)
찾아가는 AWS 세미나(구로,가산,판교) - AWS 기반 빅데이터 활용 방법 (김일호 솔루션즈 아키텍트)찾아가는 AWS 세미나(구로,가산,판교) - AWS 기반 빅데이터 활용 방법 (김일호 솔루션즈 아키텍트)
찾아가는 AWS 세미나(구로,가산,판교) - AWS 기반 빅데이터 활용 방법 (김일호 솔루션즈 아키텍트)Amazon Web Services Korea
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon RedshiftAmazon Web Services
 

Ähnlich wie Real-time Application Architecture for Education Data Insights (20)

Couchbase Singapore Meetup #2: Why Developing with Couchbase is easy !!
Couchbase Singapore Meetup #2:  Why Developing with Couchbase is easy !! Couchbase Singapore Meetup #2:  Why Developing with Couchbase is easy !!
Couchbase Singapore Meetup #2: Why Developing with Couchbase is easy !!
 
Couchbase Chennai Meetup: Developing with Couchbase- made easy
Couchbase Chennai Meetup:  Developing with Couchbase- made easyCouchbase Chennai Meetup:  Developing with Couchbase- made easy
Couchbase Chennai Meetup: Developing with Couchbase- made easy
 
AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...
AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...
AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...
 
TenMax Data Pipeline Experience Sharing
TenMax Data Pipeline Experience SharingTenMax Data Pipeline Experience Sharing
TenMax Data Pipeline Experience Sharing
 
Migrating to Amazon RDS with Database Migration Service
Migrating to Amazon RDS with Database Migration ServiceMigrating to Amazon RDS with Database Migration Service
Migrating to Amazon RDS with Database Migration Service
 
Real-Time Analytics with Spark and MemSQL
Real-Time Analytics with Spark and MemSQLReal-Time Analytics with Spark and MemSQL
Real-Time Analytics with Spark and MemSQL
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous Applications
 
Real time processing of trade data with kafka, spark streaming and aerospike ...
Real time processing of trade data with kafka, spark streaming and aerospike ...Real time processing of trade data with kafka, spark streaming and aerospike ...
Real time processing of trade data with kafka, spark streaming and aerospike ...
 
Real time processing of trade data with kafka, spark streaming and aerospike ...
Real time processing of trade data with kafka, spark streaming and aerospike ...Real time processing of trade data with kafka, spark streaming and aerospike ...
Real time processing of trade data with kafka, spark streaming and aerospike ...
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
 
(DAT204) NoSQL? No Worries: Build Scalable Apps on AWS NoSQL Services
(DAT204) NoSQL? No Worries: Build Scalable Apps on AWS NoSQL Services(DAT204) NoSQL? No Worries: Build Scalable Apps on AWS NoSQL Services
(DAT204) NoSQL? No Worries: Build Scalable Apps on AWS NoSQL Services
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
 
Using a Fast Operational Database to Build Real-time Streaming Aggregations
Using a Fast Operational Database to Build Real-time Streaming AggregationsUsing a Fast Operational Database to Build Real-time Streaming Aggregations
Using a Fast Operational Database to Build Real-time Streaming Aggregations
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
 
Introducing Events and Stream Processing into Nationwide Building Society (Ro...
Introducing Events and Stream Processing into Nationwide Building Society (Ro...Introducing Events and Stream Processing into Nationwide Building Society (Ro...
Introducing Events and Stream Processing into Nationwide Building Society (Ro...
 
Create cloud service on AWS
Create cloud service on AWSCreate cloud service on AWS
Create cloud service on AWS
 
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
 
AWS Summit 2013 | India - Petabyte Scale Data Warehousing at Low Cost, Abhish...
AWS Summit 2013 | India - Petabyte Scale Data Warehousing at Low Cost, Abhish...AWS Summit 2013 | India - Petabyte Scale Data Warehousing at Low Cost, Abhish...
AWS Summit 2013 | India - Petabyte Scale Data Warehousing at Low Cost, Abhish...
 
찾아가는 AWS 세미나(구로,가산,판교) - AWS 기반 빅데이터 활용 방법 (김일호 솔루션즈 아키텍트)
찾아가는 AWS 세미나(구로,가산,판교) - AWS 기반 빅데이터 활용 방법 (김일호 솔루션즈 아키텍트)찾아가는 AWS 세미나(구로,가산,판교) - AWS 기반 빅데이터 활용 방법 (김일호 솔루션즈 아키텍트)
찾아가는 AWS 세미나(구로,가산,판교) - AWS 기반 빅데이터 활용 방법 (김일호 솔루션즈 아키텍트)
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 

Kürzlich hochgeladen

Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 

Kürzlich hochgeladen (20)

Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 

Real-time Application Architecture for Education Data Insights

  • 1. Real-time Application Architecture David Mellor VP & Chief Architect Curriculum Associates
  • 2. Building a Real-Time Feedback Loop for Education David Mellor VP & Chief Architect Curriculum Associates Adjusted title to match abstract submission
  • 3. • Curriculum Associates has a mission to make classrooms better places for teachers and students. • Our founding value drives us to continually innovative to produce new exciting products that give every student and teacher the chance to succeed. –Students –Teachers –Administrators • To meet some of our latest expectations, and provide the best teacher/student feedback available, we are enabling our educators with real-time data. 3 Our Mission
  • 4. •The Architectural Journey •Understanding Sharding •Using Multiple Storage Engines •Adding Kafka Message Queues •Integrating the Data Lake 4 In This Talk
  • 6. 6 The Architecture End State iReady Confluent Kafka Debezium Connector (DB to Kafka) S3 Connector (Kafka to S3) HTC Dispatch Data Lake Raw Store Read Optimized Store Nightly Load Files Brokers ZooKeeper Lesson iReady Event Event System MemSQL Reporting DB MemSQL Reporting DB
  • 7. 7 Our Architectural Journey • Where did we start and what fundamental problem do we need to solve to get real-time values to our users? iReady Lesson iReady Event Scheduled Jobs ETL to Data Warehouse Data Warehouse Reporting Data Mart ETL to Data Mart
  • 8. 8 Start with the Largest Aggregate Report Our largest aggregate report consists of logically: –6,000,000 leaf values filtered to 250,000 –600,000,000 leaf values filtered to 10,000,000 used as the intermediate dataset –Rolled up to produce 300 aggregate totals –Response target 1 Sec 6,000,000+ Students 600,000,000+ Facts A District Report: 10,000,000 Facts rolled up and into 300 schools SID DESC ATTR1 SID FACT1 FACT2 . . .
  • 9. • SQL Compatible – our developers and basic paradigm is SQL • Fast Calculations – we need to compute large calculations across large datasets • Fast Updates – we need to do real-time updates • Fast Loads – we need to re-load our reporting database nightly • Massively Scalable – we need to support large data volumes and large numbers of concurrent users. • Cost Effective – we need a practical solution based on cost 9 In-Memory Databases MemSQL • Columnar and Row storage models provides for very fast aggregations across large amounts of data • Very fast load times allows us to update our reporting db nightly • Very fast update times for Row storage tables • Highly scalable based on their MPP base architecture • Unique ability to query across Columnar and Row tables in a single query
  • 10. • Convert our existing database design to be optimal in MemSQL • Analyze our usage patterns to determine the best Sharding key • Create our prototype and run typical queries to determine the optimal database structure across the spectrum of projected usage –Use the same Sharding key in all tables –Push down predicates to as many tables as we can 10 Our MemSQL Journey Begins
  • 12. 12 Why is the selection of a Sharding key so important? SID DESC ATTR1 SID FACT1 FACT2 . . . NODE2 NODE3 NODE1 Create database with 9 partitions Create tables in the database using a sharding key which is advantageous to query execution The goal is to get the execution of a given query to be as evenly distributed over the partitions PS1 PS2 PS3 PS4 PS5 PS6 PS7 PS8 PS9
  • 13. 13 How does the Sharding Key affect the “join” PS1 PS2 Node 1 Select a.sid, b.factid from table1 a, table2 b Where a.sid in {10 ….. } and b.sid in {10 ….. } And a.sid = b.sid The basis of the join is on the sid column. When the sharding key is chosen based on the sid Columns for both tables … the join can be done Independently within each partition and the result Merged This is an ideal situation to get the nodes performing In parallel which can maximum query performance
  • 14. 14 How does the Sharding Key affect the “join” PS1 PS2 Node 1 Select a.sid, b.factid from table1 a, table2 b Where a.sid in {10 ….. } and b.sid in {10 ….. } And a.sid = b.sid When the sharding key is not based on the sid Columns for both tables … the join cannot be done Independently within each partition and will cause what Is called a broadcast This is not the ideal situation to get the nodes performing In parallel and we have seen query performance degredration In these cases
  • 16. • Row storage was the most performant for queries, load and updates – this is also the most expensive solution • Columnar storage was performant for some queries and load but degraded with updates – cost effective but not performant enough on too many of the target queries • To maximize our use of MemSQL we have combined Row storage and Columnar storage to create a logical table –Volatile (changeable) data is kept in Row storage –Non-Volatile (immutable) data is kept in Columnar storage –Requests for data are made using “Union All” queries 16 Columnar and Row Storage Models
  • 17. 17 Columnar and Row . . . SID FACT1 FACT2 Active ? ? ? Row Storage Portion Columnar Storage Portion Logical Table SID FACT1 FACT2 Active n n n n n n n . . . n n n SID FACT1 FACT2 Active ? n n ? Select sid, fact1, fact2 From fact_row Where sid in (1 …10) Union All Select sid, fact1, fact2 From fact_columnar Where sid in (1 …10)
  • 18. Adding Kafka Message Queues 18
  • 19. 19 Dispatching The Human Time Events Events iReady Kafka HTC Dispatch MemSQL Reporting DB Brokers ZooKeeper Lesson iReady Event Event System • We had a database engine that could perform our queries • We solved our cost and scaling needs • We proved we could load and update the database on the desired schedule • How are we going to get the real-time data to the Reporting DB?
  • 20. 20 Dispatching The Human Time Events Events iReady Kafka HTC Dispatch MemSQL Reporting DB Brokers ZooKeeper Lesson iReady Event Event System Event: Payload: JSON Event: Payload: JSON Event: Payload: JSON Event: Payload: JSON Event: Payload: JSON Event: Payload: JSON Event: Payload: JSON Event: Payload: JSON Event: Payload: JSON Event: Payload: JSON MemSQL Pipeline • Use MemSQL Pipelines to ingest data into MemSQL from Kafka • Declared MemSQL Objects • Managed and controlled by MemSQL • No significant transforms
  • 21. • Tables are augment with a column to contain the event in JSON form • All other columns derived 21 Kafka and MemSQL Pipelines SID FACT1 FACT2 SID FACT1 FACT2 event CREATE TABLE realtime.fact_table (event JSON NOT NULL, SID as event::data::$SID PERSISTED varchar(36) FACT1 as event::data::rawFact1 PERSISTED int(11) FACT2 as event::data::rawFact2 PERSISTED int(11) KEY (SID)) create pipeline fact_pipe asload data kafka '0.0.0.0:0000/fact-event- stream'into table realtime.fact_table columns (event);
  • 22. 22 Adding the Nightly Rebuild Process iReady Confluent Kafka Debezium Connector (DB to Kafka) HTC Dispatch MemSQL Reporting DB Brokers ZooKeeper Lesson iReady Event Event System • Get the transactional data from the database • Employ database replication to dedicated slaves • Introduce the Confluent platform to unify data movement through Kafka • Deploy the Debezium Confluent Connector to move the replication log data into Kafka
  • 24. 24 Create and Update the Data Lake iReady Confluent Kafka Debezium Connector (DB to Kafka) S3 Connector (Kafka to S3) HTC Dispatch MemSQL Reporting DB Data Lake Raw Store Read Optimized Store Brokers ZooKeeper Lesson iReady Event Event System • Build a Data Lake in S3 • Deploy the Confluent S3 Connector to move the transaction data from Kafka to the Data Lake • Split the Data Lake into 2 Distinct forms – Raw and Read Optimized • Deploy Spark to move the data from the Raw form to the Read Optimized form
  • 25. 25 Move the Data from the Data Lake to MemSQL iReady Confluent Kafka Debezium Connector (DB to Kafka) S3 Connector (Kafka to S3) HTC Dispatch MemSQL Reporting DB Data Lake Raw Store Read Optimized Store Brokers ZooKeeper Lesson iReady Event Event System • Deploy Spark to transform the data from the Read Optimized form to a Reporting Optimized form • Save the output to a managed S3 location • Deploy MemSQL S3 Pipelines to automatically ingest the nightly load files from a specified location • Deploy MemSQL Pipeline to Kafka • Activate the MemSQL Pipeline when the reload is complete Nightly Load Files MemSQL Reporting DB
  • 26. 26 Swap the Light/Dark MemSQL DB iReady Confluent Kafka Debezium Connector (DB to Kafka) S3 Connector (Kafka to S3) HTC Dispatch Data Lake Raw Store Read Optimized Store Brokers ZooKeeper Lesson iReady Event Event System • Open up the Dark DB to accept connections • Trigger an iReady application event to drain the current connection pool and replace the connections with new connections to the new database • Close the current Light DB Nightly Load Files MemSQL Reporting DB MemSQL Reporting DB MemSQL Reporting DB MemSQL Reporting DB
  • 27. 27 The Architecture End State iReady Confluent Kafka Debezium Connector (DB to Kafka) S3 Connector (Kafka to S3) HTC Dispatch Data Lake Raw Store Read Optimized Store Nightly Load Files Brokers ZooKeeper Lesson iReady Event Event System MemSQL Reporting DB MemSQL Reporting DB
  • 28. • Ensure the system you are considering is up to the challenge of your most sophisticated queries • With distributed systems, spend time to pick the right sharding strategy • Make use of multiple storage engines where available • Design workflows with message queues for flexibilty and update-ability • Incorporate data lakes for long term retention and context 28 Key Takeaways

Hinweis der Redaktion

  1. User Growth 250K – 4.5M in 4 years 80K concurrent users 60K/min user diagnostic item responses 13K/min lesson component starts 332K/day diagonstics completed 1.6M/day lesson completed
  2. Create database defines number of partitions A partition is created on a specific node Tables in the database are sharded evenly among the partitions using the defined sharding key
  3. A good design allows the join to all be performed on a single node If not memsql needs to shuttle the join data to the necessary node to perform the join
  4. A good design allows the join to all be performed on a single node If not memsql needs to shuttle the join data to the necessary node to perform the join
  5. Raw Store is Avero or JSON Read Optimized Store is Parquet