3. S&P Capital IQ
S&P Capital IQ combines two of our
strongest brands - S&P, with its long
history and experience in the financial
markets and Capital IQ, which is known
among professionals globally for its
comprehensive company and financial
information and powerful analytical
tools.
4. Agenda
• Creation of Excel Plug-in with Global
Data, Global Sales and US based servers
• High Performance data gets for Big
Historical Time Series Data.
• QA
5. S&P Capital IQ Excel Plug-in
• Excel Plug-in provides thousands of data points
on Demand
• Allows customers anywhere in the world to use
our data assets on their desktops on demand
• It needs to be a fast user experience
everywhere in the world
6. Global Customers US Data Center
Average Response Time Milliseconds
From: London To: New Jersey 400
From: New York To: New Jersey 30
From: Melbourne To: New Jersey 800
Response times rounded
7. Global Customers Global Data Center
Average Response Time Milliseconds
From: London To: Ireland 400 to 40
From: New York To: New Jersey 30 to 30
From: Melbourne To: Singapore 800 to 60
Response times rounded
9. How do we make it even faster?
Smart Cache
Pre-send data
Router
- Move data the customers uses the most to their
desktop.
- Automatically get the data for the customer.
- Learn to send the right data to the customer.
10. Smart Cache
2.
3. Smart Cache
1.
5. 4. a
Router
b
1. User Opts into Smart Cache
2. The system pre-sends data package to customer
3. User makes a request for data
a. Smart Cache Checks Locally first
b. Not local grab data from the cloud
4. Smart Cache sends usage logs
5. Pre-sent data package is altered for the customer
11. Smart Caching Data
Smart Cache
1.
4. 2.
3.
Router
5. 6.
7.
1. Collect logs from smart cache
2. Collect and decrypt cloud and local usage logs
3. Apply logs to Mahout
4. Use customer profile
5. Mahout comes out with an update suggestion list
6. Customer specific package is created
7. Prepared package is ready for pick by smart cache
12. Smart Caching Data
Lessons Learned
• The algorithm works similar to a website matching engine for
shopping.
• Different in that the customer does not see the
recommendations they just have a faster experience
• All data sets are used to learn but only large data sets are
custom packaged for delivery
• Sometimes it is easier to just send the entire package when
the data set is small enough and used by the customer.
• Don’t expect success day 1 or day 30 the longer you learn the
more accurate it should become
• Not a replacement for simple logic
• Algorithm requires constant feeding and attention.
• There are cases where you can’t learn about your user such
as when they share ID’s.
14. High Performance Data Gets
• Some data assets due to size are still routed back to the
US
• Big Data sets ~10T of time series data
• As those data assets became more popular we needed
to move the right data to the cloud
• Cannot synchronize the data so fast loads are required
• Single Milliseconds get times
15. High Performance Data Gets
• Using Hadoop learn what are the most used
large data assets.
• Move the subset of data identified as the most
used data to the cloud.
• Fast loading of millions of records
• Allow for Single Milliseconds data retrieval times
16. High Performance Data Gets
Cassandra Hbase
http://cassandra.apache.org/ http://hbase.apache.org/
Apache Cassandra is a highly scalable, eventually consistent, HBase is an open-source, distributed, versioned, column-oriented
distributed, structured key-value store. Cassandra brings together store modeled after Google's Bigtable: A Distributed Storage
the distributed systems technologies from Dynamo and the data System for Structured Data by Chang et al. Just as Bigtable
model from Google's BigTable. Like Dynamo, Cassandra leverages the distributed data storage provided by the Google File
is eventually consistent. Like BigTable, Cassandra provides a System, HBase provides Bigtable-like capabilities on top of Hadoop
ColumnFamily-based data model richer than typical key/value and HDFS.
systems.
Cassandra was open sourced by Facebook in 2008, where it was Hbase is similar to an RDBMS in that it has the concept of tables;
designed by Avinash Lakshman (one of the authors of Amazon's however, columns in Hbase tables are not fixed in number or data
Dynamo) and Prashant Malik ( Facebook Engineer ). In a lot of type and can have any data type which varies from one row to the
ways you can think of Cassandra as Dynamo 2.0 or a marriage of other.
Dynamo and BigTable. Cassandra is in production use at Facebook
but is still under heavy development.
We tried to do a similar POC using Cassandra with a smaller subset
of data because of above mentioned hardware restrictions.
Unlike Hbase and RDBMS, there is no concept of a table. Instead
we have columns, column families and Keypsaces.
17. High Performance Data Gets
Cassandra Hbase Oracle
Data Get 400 Microseconds 1 Milliseconds 5 Seconds
Data Load 10 Minutes 10 Minutes 10 Minutes
• Data Get
– Time to pull 1 security and 1 data point
• Data Load
– Time take to load 6 million securities
18. High Performance Data Gets
• Virtual Oracle instances did not meet our
performance needs.
• EMR needed for Hbase was not cost effective for
data gets.
• Hbase is difficult to implement in AWS due to the
hardware requirements of Hadoop
• Cassandra can be segmented logically for Big
Data Assets with minimal to no performance
degradation in AWS