Simplicity, accuracy, speed; these are three things everyone wants from their data architecture. Join this webinar presented by Behzad Pirvali, Performance Architect at MaxCDN and Peter Vescuso, CMO at VoltDB to learn how MaxCDN used VoltDB, the world’s fastest operational DB with fast data pipeline, to reduce the number of managed environments by 2/3 times with 1/10th of CPU cycles required with alternative solutions. All while achieving 100% billing accuracy on 32 TB of daily web server data. The fill recording of this webinar is also available here: http://learn.voltdb.com/WRMaxCDN.html
3. page
Big Data
“Perishable insights can have exponentially more value than
after-the-fact traditional historical analytics.”
Mike Gual.eri, Principal Analyst, Forrester Research
Fast Data
DATA IS TRANSFORMING BUSINESS
9. Biography
- Technical:
- Started programing in 1985
- Developed kernel apps like printer drivers and high
performance networking tools in C
- MS in Electrical Engineering from Technical
University in Graz/Austria in 1995
- Filed for two patents for improving RDBMS
Performance in 2005 (Symantec Corp) and 2008
(FOX news)
- Hobbies:
- Running (Marathons)
- Photography
- RC Airplanes
- Electronics
10. Agenda - Vision
- Technical requirements
- System Architecture
- Why using VoltDB over HBASE or Cassandra
- VoltDB, Things to consider when designing
solutions with VoltDB
- Conclusion
- Resources
11. Vision
- Building a real-time analytic engine for:
- real-time diagnoses of our Edge Servers
- MaxCDN-Predict
- Elastic Provisioning
- Improving Serving performance
- Using this data to bill customers
12. Technical Requirements
- The system should have the following features:
- Horizontally scalable
- Real-time (15 seconds SLA) from the time content is served till it shows up
into the aggregates.
- Zero production support:
- Zero touch crash recovery
- No data clean-up/recovery required
- Guaranteed no data lost
- SQL interface for mining and drill-down
- Ad-Hoc queries of the not aggregated raw-data
14. System Architecture
- When Nginx serves the content, it logs this transaction
- These logs are streamed into the aggregation farm from around the world. We
get ~ 32 TB of logs per day. This data gets pushed into 4 rabbit-mq queues.
- A farm of 4 machines, clean up and pre-aggregate this data. They create a
batch of 70K raw-data along with corresponding aggregates and push it into a
rabbit-mq queue.
- VoltDB cluster runs with:
- 7 machines in k-factor=0
- Sync logging mode for “no data lost”
- 48 SitesPerHost. So, a total of 7*48 = 336 partitions.
15. System Architecture
- VoltDB clients read these batches from rabbit-mq and push this data into a VoltDB
cluster composed of 7 machines. They use VoltDB’s “hashinator” to push an array
of data into only “one procedure call per Table per Partition”. These clients
guarantee batch level atomic processing across 1680 (=5*336) VoltDB stored
procedure calls
- Tables are maintained in a ring-buffer fashion.
- We can only keep ~ 30 min of most recent raw-data
- The system behaves completely like a distributed transactional RDBMS in terms of
“no data lost guarantee”.
16. System Architecture
- Zero touch crash recovery:
- When VoltDB crashes:
- Clients go into pause mode
- Supervisord starts up VoltDB cluster in recovery mode
- When VoltDB clients or other components crash:
- VoltDB clients and all the other critical components run under Supvisord. So, they
get restarted automatically
- Completely transactional processing through utilizing :
- VoltDB’s atomic processing at the stored procedure level
- Rabbit-MQ re-play guarantee
- Idempotency
17. Why using VoltDB over HBASE or Cassandra
- Simply because of the “multi-row WRITE atomicity”.
- Multi-row WRITE atomicity results in much less CPU / I/O load as well as easier
implementation.
- To make this clear let us consider our use-case of pushing our 70K batches of raw-
logs into a storage system:
- VoltDB:
- With VoltDB, we have got stored-proc level atomicity. Current implementation pushes 70000
rows into 336 partitions. So, each stored-proc call writes 70,000/336 = ~ 208 rows into the
rawlogs table. For these 208 rows, we add one row into the TX table with batch-id of this
batch.
19. Why using VoltDB over HBASE or Cassandra
- HBASE:
- HBASE only offers single row atomicity. So, let us say, we have got also 336 partitions, but,
with HBASE, we have to include batch-id into each row. So, writing the batch-id 208 times
instead of one time. When we apply the batch,we have to go through “208 IF statements” for
each row and apply the batch if needed. So, this would mean a lot more CPU, I/O, and space
requirements.
- If the batch size grows to 140K from 70K, these 208 WRITEs and “IF statments” will also grow
to 416.
20. VoltDB, Things to Consider when Designing Solutions
- Good things:
- SQL interface unlike Trident or Spark-Streaming
- Merges the good things of the old-world like SQL and transactions with the
good things of the new world like ‘no-locks’, ‘k-factor’ HA, etc….
- Very simple and intuitive API and usage
- k-factor + logs + snapshots eliminates the need to backup the system.
- Fast query performance
- Horizontal scalability
21. VoltDB, Things to Consider when Designing Solutions
- Each partition has got only one thread of execution for INSERT/UPDATE.
- Workarounds:
- Get faster CPUs
- Pre-process the data outside VoltDB
- Maximum data coming out of a partition is limited to 50 MB.
- Workarounds:
- Make sure there is no relevant query with a qualified set of bigger than 50 MB for any
partitions
- The more partitions, the better
22. Conclusion
- VoltDB merges the good things of the old-world and new world.
- Provides an easy and scalable solution for real-time streaming aggregation
- Like any other tool, has some limitations that need to be taken into account when
used towards a solution.
23. - VoltDBDB Docs: https://docs.VoltDBdb.com/
- Lambda Architecture:
https://VoltDBdb.com/blog/simplifying-complex-lambda-architecture
- Lambda Architecture: http://lambda-architecture.net/
- Storm/Trident: http://storm.apache.org/documentation/Trident-tutorial.html
- Spark Streaming: http://spark.apache.org/streaming/
I am available by email: bpirvali@gmail.com
Resources