Real-time Application Architecture for Education Data Insights

Real-time Application
Architecture
David Mellor
VP & Chief Architect Curriculum Associates

Building a Real-Time
Feedback Loop for Education
David Mellor
VP & Chief Architect Curriculum Associates
Adjusted title to match abstract
submission

• Curriculum Associates has a mission to make classrooms better places for
teachers and students.
• Our founding value drives us to continually innovative to produce new
exciting products that give every student and teacher the chance to
succeed.
–Students
–Teachers
–Administrators
• To meet some of our latest expectations, and provide the best
teacher/student feedback available, we are enabling our educators with
real-time data.
3
Our Mission

•The Architectural Journey
•Understanding Sharding
•Using Multiple Storage Engines
•Adding Kafka Message Queues
•Integrating the Data Lake
4
In This Talk

6
The Architecture End State
iReady
Confluent
Kafka
Debezium
Connector
(DB to Kafka)
S3
Connector
(Kafka to S3)
HTC Dispatch
Data Lake
Raw Store
Read
Optimized Store
Nightly
Load
Files
Brokers ZooKeeper
Lesson
iReady
Event Event System
MemSQL
Reporting
DB
MemSQL
Reporting
DB

7
Our Architectural Journey
• Where did we start and what fundamental problem do we need to solve
to get real-time values to our users?
iReady
Lesson
iReady
Event
Scheduled
Jobs
ETL to Data
Warehouse
Data Warehouse
Reporting
Data Mart
ETL to Data Mart

8
Start with the Largest Aggregate Report
Our largest aggregate report
consists of logically:
–6,000,000 leaf values filtered to
250,000
–600,000,000 leaf values filtered to
10,000,000 used as the
intermediate dataset
–Rolled up to produce 300
aggregate totals
–Response target 1 Sec
6,000,000+ Students
600,000,000+ Facts
A District Report:
10,000,000 Facts
rolled up and into
300 schools
SID DESC ATTR1 SID FACT1 FACT2
.
.
.

• SQL Compatible – our developers and basic paradigm is
SQL
• Fast Calculations – we need to compute large calculations
across large datasets
• Fast Updates – we need to do real-time updates
• Fast Loads – we need to re-load our reporting database
nightly
• Massively Scalable – we need to support large data
volumes and large numbers of concurrent users.
• Cost Effective – we need a practical solution based on cost
9
In-Memory Databases MemSQL
• Columnar and Row storage models provides
for very fast aggregations across large
amounts of data
• Very fast load times allows us to update our
reporting db nightly
• Very fast update times for Row storage tables
• Highly scalable based on their MPP base
architecture
• Unique ability to query across Columnar and
Row tables in a single query

• Convert our existing database design to be optimal in MemSQL
• Analyze our usage patterns to determine the best Sharding key
• Create our prototype and run typical queries to determine the optimal
database structure across the spectrum of projected usage
–Use the same Sharding key in all tables
–Push down predicates to as many tables as we can
10
Our MemSQL Journey Begins

12
Why is the selection of a Sharding key so
important?
SID DESC ATTR1 SID FACT1 FACT2
.
.
.
NODE2
NODE3
NODE1
Create database with 9 partitions
Create tables in the database using
a sharding key which is advantageous
to query execution
The goal is to get the execution of a
given query to be as evenly distributed
over the partitions
PS1 PS2 PS3
PS4 PS5 PS6
PS7 PS8 PS9

13
How does the Sharding Key affect the “join”
PS1 PS2
Node 1
Select a.sid, b.factid from table1 a, table2 b
Where a.sid in {10 ….. } and b.sid in {10 ….. }
And a.sid = b.sid
The basis of the join is on the sid column.
When the sharding key is chosen based on the sid
Columns for both tables … the join can be done
Independently within each partition and the result
Merged
This is an ideal situation to get the nodes performing
In parallel which can maximum query performance

14
How does the Sharding Key affect the “join”
PS1 PS2
Node 1
Select a.sid, b.factid from table1 a, table2 b
Where a.sid in {10 ….. } and b.sid in {10 ….. }
And a.sid = b.sid
When the sharding key is not based on the sid
Columns for both tables … the join cannot be done
Independently within each partition and will cause
what
Is called a broadcast
This is not the ideal situation to get the nodes
performing
In parallel and we have seen query performance
degredration
In these cases

Using Multiple Storage Engines
15

• Row storage was the most performant for queries, load and updates –
this is also the most expensive solution
• Columnar storage was performant for some queries and load but
degraded with updates – cost effective but not performant enough on too
many of the target queries
• To maximize our use of MemSQL we have combined Row storage and
Columnar storage to create a logical table
–Volatile (changeable) data is kept in Row storage
–Non-Volatile (immutable) data is kept in Columnar storage
–Requests for data are made using “Union All” queries
16
Columnar and Row Storage Models

17
Columnar and Row
.
.
.
SID FACT1 FACT2 Active
?
?
?
Row Storage Portion Columnar Storage Portion
Logical
Table
n
n
n
n
n
n
n
.
.
.
n
n
n
?
n
n
?
Select sid, fact1, fact2
From fact_row
Where sid in (1 …10)
Union All
Select sid, fact1, fact2
From fact_columnar
Where sid in (1 …10)

Adding Kafka Message Queues
18

19
Dispatching The Human Time Events Events
iReady
Kafka
HTC Dispatch
MemSQL
Reporting
DB
Brokers ZooKeeper
Lesson
iReady
Event
Event System
• We had a database engine that
could perform our queries
• We solved our cost and scaling
needs
• We proved we could load and
update the database on the
desired schedule
• How are we going to get the
real-time data to the Reporting
DB?

20
Dispatching The Human Time Events Events
iReady
Kafka
HTC Dispatch
MemSQL
Reporting
DB
Brokers ZooKeeper
Lesson
iReady
Event
Event System
Event:
Payload:
JSON
Event:
Payload:
JSON
Event:
Payload:
JSON
Event:
Payload:
JSON
Event:
Payload:
JSON
Event:
Payload:
JSON
Event:
Payload:
JSON
Event:
Payload:
JSON
Event:
Payload:
JSON
Event:
Payload:
JSON
MemSQL
Pipeline
• Use MemSQL Pipelines to
ingest data into MemSQL
from Kafka
• Declared MemSQL Objects
• Managed and controlled by
MemSQL
• No significant transforms

• Tables are augment with a column to contain the event in JSON form
• All other columns derived
21
Kafka and MemSQL Pipelines
SID FACT1 FACT2 SID FACT1 FACT2 event
CREATE TABLE realtime.fact_table
(event JSON NOT NULL,
SID as event::data::$SID PERSISTED
varchar(36)
FACT1 as event::data::rawFact1 PERSISTED
int(11)
FACT2 as event::data::rawFact2 PERSISTED
int(11)
KEY (SID))
create pipeline fact_pipe asload data
kafka '0.0.0.0:0000/fact-event-
stream'into table realtime.fact_table
columns (event);

22
Adding the Nightly Rebuild Process
iReady
Confluent
Kafka
Debezium
Connector
(DB to Kafka)
HTC Dispatch
MemSQL
Reporting
DB
Brokers ZooKeeper
Lesson
iReady
Event Event System
• Get the transactional data from
the database
• Employ database replication to
dedicated slaves
• Introduce the Confluent platform
to unify data movement through
Kafka
• Deploy the Debezium Confluent
Connector to move the replication
log data into Kafka

24
Create and Update the Data Lake
iReady
Confluent
Kafka
Debezium
Connector
(DB to
Kafka)
S3
Connector
(Kafka to
S3)
HTC
Dispatch
MemSQL
Reporting
DB
Data Lake
Raw Store
Read
Optimized
Store
Brokers ZooKeeper
Lesson
iReady
Event
Event System
• Build a Data Lake in S3
• Deploy the Confluent S3
Connector to move the
transaction data from Kafka
to the Data Lake
• Split the Data Lake into 2
Distinct forms – Raw and
Read Optimized
• Deploy Spark to move the
data from the Raw form to
the Read Optimized form

25
Move the Data from the Data Lake to MemSQL
iReady
Confluent
Kafka
Debezium
Connector
(DB to
Kafka)
S3
Connector
(Kafka to
S3)
HTC
Dispatch
MemSQL
Reporting
DB
Data Lake
Raw Store
Read
Optimized
Store
Brokers ZooKeeper
Lesson
iReady
Event
Event System
• Deploy Spark to transform
the data from the Read
Optimized form to a
Reporting Optimized form
• Save the output to a
managed S3 location
• Deploy MemSQL S3
Pipelines to automatically
ingest the nightly load files
from a specified location
• Deploy MemSQL Pipeline to
Kafka
• Activate the MemSQL
Pipeline when the reload is
complete
Nightly
Load
Files
MemSQL
Reporting
DB

26
Swap the Light/Dark MemSQL DB
iReady
Confluent
Kafka
Debezium
Connector
(DB to
Kafka)
S3
Connector
(Kafka to
S3)
HTC
Dispatch
Data Lake
Raw Store
Read
Optimized
Store
Brokers ZooKeeper
Lesson
iReady
Event
Event System
• Open up the Dark DB to
accept connections
• Trigger an iReady
application event to drain
the current connection pool
and replace the connections
with new connections to the
new database
• Close the current Light DB
Nightly
Load
Files
MemSQL
Reporting
DB
MemSQL
Reporting
DB
MemSQL
Reporting
DB
MemSQL
Reporting
DB

27
The Architecture End State
iReady
Confluent
Kafka
Debezium
Connector
(DB to Kafka)
S3
Connector
(Kafka to S3)
HTC Dispatch
Data Lake
Raw Store
Read
Optimized Store
Nightly
Load
Files
Brokers ZooKeeper
Lesson
iReady
Event Event System
MemSQL
Reporting
DB
MemSQL
Reporting
DB

• Ensure the system you are considering is up to the challenge of your most
sophisticated queries
• With distributed systems, spend time to pick the right sharding strategy
• Make use of multiple storage engines where available
• Design workflows with message queues for flexibilty and update-ability
• Incorporate data lakes for long term retention and context
28
Key Takeaways

Real-time Application Architecture for Education Data Insights

Real-time Application Architecture for Education Data Insights

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Real-time Application Architecture for Education Data Insights

Ähnlich wie Real-time Application Architecture for Education Data Insights (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Real-time Application Architecture for Education Data Insights

Hinweis der Redaktion