Our retail banking market demands now more than ever to stay close to our customers, and to carefully understand what services, products, and wishes are relevant for each customer at any given time.
This sort of marketing research is often beyond the capacity of traditional BI reporting frameworks. In this talk, we illustrate how we team up data scientists and big data engineers in order to create and scale distributed analyses on a big data platform.
4. Conversion is the ultimate form of
permission marketing
Permission marketing is about the honour of being
heard.
How to earn it ?
Provide the right suggestions, at the right time.
This is what makes data analysis valuable
5. When do you really know your customer ?
know about last unique:
5 songs?
100 songs?
10’000 songs?
6. Old & New stuff.
We evolve slowly, our personality, our habits.
But events and trends can affect us on a short notice
How do you combine old with new?
7. The customer’s context
Complex on many dimensions:
Personal history:
amount of transactions ever done
Long term Interaction:
how the users’ action correlate with others
Real time events:
Trends and recent events
8. The customer’s context
context is related to time:
slow changing: the defining characteristic of a person
fast changing: events which influence our lives, trends
Require very different
technology solutions !!!
9. Challenges
millions of billions of
Not much time to react
window of opportunity sometimes is just a few seconds
Load of information to process
you want to understand well the user history
10. Slow and fast
ranking and preference
analysis
segmentation and clustering
short term trending topics
rule-based recommendations
10’s Terabytes of Data.
This can take hours ….
100’s of events per second.
This must be fast ….
11. Hadoop: Distributed Data OS
Reliable
Distributed, Replicated File System
Low cost
↓ Cost vs ↑ Performance/Storage
Computing Powerhouse
All clusters CPU’s working in parallel
for running queries
12. Scala / Akka / Spray:
a WEB API reactive framework
Actor
A Actor
B
Actor
C
msg 1
msg 2
msg 3
msg 4
● it scales horizontally (can run in cluster mode)
● maximum use of the available cores/memory
1. processing is non-blocking, threads are re-used
2. can parallelize computing power across many actors
Very fast: 1000’s messages/sec
Very reliable: auto recovery
13. Distributed computing:
lambda architecture
Batch
Computing
HTTP RESTful API
In-Memory
Distributed Database
In-memory
Distributed DB’s
Lambda Architecture
Batch + Streaming
low-latency
Web API services
Streaming
Computing
Data Warehouses Messaging Busses
15. All Things Distributed
Distributing computing and storage
more machines = more storage/computing
Open Source software solutions
mature enough for pragmatic adopters
Near realtime + big data technologies
Hadoop, Scala, Akka, Spray, Cassandra
16. Science & Engineering
Statistics,
Data Science
Python
R
Visualization
IT Infra
Big Data
Java
Scala
SQL
Hadoop: Big Data Infrastructure, Data Science on large datasets
Big Data and Fast Data
requires different profiles to be able to
achieve the best results
17. Parallelism Mathematics Programming
Languages Machine Learning Statistics
Big Data Algorithms Cloud Computing
Natalino Busa
@natalinobusa
www.natalinobusa.com
Thanks !
Any questions?