In the past decade a number of technologies have revolutionized the way we do analytics in banking. In this talk we would like to summarize this journey from classical statistical offline modeling to the latest real-time streaming predictive analytical techniques.
In particular, we will look at hadoop and how this distributing computing paradigm has evolved with the advent of in-memory computing. We will introduce Spark, an engine for large-scale data processing optimized for in-memory computing.
Finally, we will describe how to make data science actionable and how to overcome some of the limitations of current batch processing with streaming analytics.
4. The origins (40s, 50s, 60s)
Operation Research during World War II
First Predictive Weather Model on ENIAC
5.
6. The origins (40s, 50s, 60s)
● Operational Research
● Collision loss vs Anti-Aircraft loss
● Optimization (Statistical) problems
● Scheduling and resource allocation
9. Analytics goes Mainstream
(70s, 80s)
● The Relational Database is born!
1972: E.F. Codd relational database model, normalization:
(free from insertion, deletion and update anomalies)
1978: Peter Chen, The entity-relationship model
10. ● 1982: IBM DB2, Oracle v3, Sybase (SAP)
● 1986: First standardized SQL
● 1987: Commercial use of Decision Support Systems:
Texas Air Traffic Expert system
Analytics goes Mainstream
(70s, 80s)
12. Exploratory Data Analysis
In 1977, Tukey published Exploratory Data Analysis,
arguing that more emphasis needed to be placed on using
data to suggest hypotheses to test and that Exploratory
Data Analysis and Confirmatory Data Analysis “can—and
should—proceed side by side.”
Analytics goes Mainstream
(70s, 80s)
13. The Internet goes Global
(90s)
● 1995: Amazon
● 1995: eBay
● 1996: HotMail
● 1998: Google
● 1998: Paypal
16. The Internet goes Global
(90s)
● Analytics (OLAP):
Long queries, aggregations, data mining, reporting, models
● Operations (OLTP):
Fast transactions, ACID, consistent, available, fault-tolerant
17. Data warehouses and ETLs (90s)
● Building the Data Warehouse by
William Inmon (John Wiley - QED,
1992)
18. The World goes Social
(00s)
Web apps go in hyper - growth
● 2003: LinkedIn
● 2003: Skype
● 2004: Facebook
● 2006: Twitter
19.
20. The advent of MPP OLAPs (Early 00s)
● Massive multi-rack systems
● 100’s of Computing Cores
● 100’s Terabytes of Storage
● Distributed computing
● Advanced Query Plans
● Columnar Data Models
● Re-programmable hardware
27. Fast Data, APIs, Mobile and IoT (10s)
● WhatsApp: in a day
● 31 billion messages sent
● 700 million photo’s sent
28. Fast Data, APIs, Mobile and IoT (10s)
New Problems:
● Hadoop is too slow (File -> File)
● Productivity of Data Science goes down
● SQL is not enough
● Distributed Machine Learning algorithms?
30. The RAM is the new Disk (10s)
Spark is a new framework for in-memory computing
Unify in a Distributed Computing paradigm:
SQL, Machine Learning, Map-Reduce, Graph Analytics
31. Spark
Generality
Combine SQL, streaming, and
complex analytics.
Runs Everywhere
Spark runs on Hadoop, Mesos,
standalone, or in the cloud.
Multiple Data Sources
It can access diverse data
sources including HDFS,
Cassandra, HBase, and S3.
https://spark.apache.org/
32. Popular Analytical Stacks (10s)
Hadoop Hive + MPP
Spark + Cassandra (no Hadoop!)
Spark + HDFS + Elastic(Search)