Our retail banking market demands now more than ever to stay close to our customers, and to carefully understand what services, products, and wishes are relevant for each customer at any given time. This sort of marketing research is often beyond the capacity of traditional BI reporting frameworks. In this talk, we illustrate how we team up data scientists and big data engineers in order to create and scale distributed analyses on a big data platform. By using Hadoop and open source statistical language and tools such R and Python, we can execute a variety of machine learning algorithms, and scale them out on a distributed computing framework.
5. Back to routine.
Grocery, broken washmachine
After-vacation fun
Pancake house.
Traveling back.
Just back home. Pizza.
Shopping in Sicily
Vacation!
The bank statements How I read the bank bills
6. Back to routine.
Grocery, broken washmachine
After-vacation fun
Pancake house.
Traveling back.
Just back home. Pizza.
Shopping in Sicily
Vacation!
The bank statements How I read the bank bills What happened those days
7. data is the fabric of our lives
Let’s give more meaning and context to data.
8. Abraham Harold Maslow (April 1, 1908 –
June 8, 1970) was an American psychologist
who was best known for creating Maslow's
hierarchy of needs
9. breathing, food, water, sleep
security of body, resources,
health, employment, property
friend, family, partner
security of love and belonging
self-esteem, confidence,
achievements, respect
spontaneity, creativity,
acceptance, freedom, ethics
Physiology
Contractual
Love & Caring
Esteem
Self-actualization
Very human needs
11. Connectivity, Electricity, Hardware /
Infra
security of basic operations
REST APIs, Encryption, Authentication
Notification, Alerts,
Social bonding, Predictions
Set goals, planning,
Achievements, Advisory role
Freedom,
Trusted Companion
Physiology
Contractual
Love & Caring
Esteem
Self-actualization
Technology is reaching out
12. Data science top 3
Dimensionality
Reduction
Predictive
Analytics
Clustering
Segmentation
13. Data science: what’s working?
- Random Forests
- Artificial Neural Networks
- Clustering Algorithms
- Pattern Recognition
- Time-Serie analysis
- Regression
Most actual models are a
combination of these ones
14. Data science ^.^/
keep it scientific
cross-validate your models
keep it measurable
play with it
create new features
explore the available data
16. # Multiple Linear Regression Example
fit <- lm(y ~ x1 + x2 + x3, data=mydata)
summary(fit) # show results
● Language for statistics
● Easy to Analyze and shape data
● Advanced statistical package
● Fueled by academia and professionals
● Very clean visualization packages
Packages for machine learning
time serie forecasting, clustering, classification
decision trees, neural networks
Remote procedure calls (RPC)
From scala/java via RProcess and Rserve
Data Science: R
17. >>> from sklearn.datasets import load_iris
>>> from sklearn import tree
>>> iris = load_iris()
>>> clf = tree.DecisionTreeClassifier()
>>> clf = clf.fit(iris.data, iris.target)
● Flexible, concise language
● Quick to code and prototype
● Portable, visualization libraries
Machine learning libraries:
scipy, statsmodels, sklearn,
matplotlib, ipython
Web libraries
flask, tornado, (no)SQL clients
Data Science: Python
19. The customer’s context
Personal history:
amount of transactions ever done
Long term Interaction:
how the users’ action correlate with others
Real time events:
Trends and recent events
20. The customer’s context
context is related to time:
slow changing: the defining characteristic of a person
fast changing: events which influence our lives, trends
Require very different
technology solutions !!!
21. Challenges
Not much time to react
Events must be delivered fast to the new machine APIs
It’s Web, and Mobile Apps: latency budget is limited
Loads of information to process
Understand well the user history
Access a larger context
22. Big Data and Fast data
ranking and preference
segmentation and clustering
short term trending topics
rule-based recommendations
10’s Terabytes of Data.
This can take hours ….
100’s of events per second.
This must be fast ….
28. Hadoop: Distributed Data OS
Reliable
Distributed, Replicated File System
Low cost
↓ Cost vs ↑ Performance/Storage
Computing Powerhouse
All clusters CPU’s working in parallel for
running queries
29. Cassandra: A low-latency 2D store
Reliable
Distributed, Replicated File System
Low latency
Sub msec. read/write operations
Tunable CAP
Define your level of consistency
Data model:
hashed rows, sorted wide columns
Architecture model:
No SPOF, ring of nodes,
omogeneous system
30. Scala / Akka / Spray:
a WEB API reactive framework
Actor
A Actor
B
Actor
C
msg 1
msg 2
msg 3
msg 4
● it scales horizontally (can run in cluster mode)
● maximum use of the available cores/memory
● processing is non-blocking, threads are re-used
● can parallelize computing power across many actors
Very fast: 1000’s messages/sec
Very reliable: auto recovery
Lazy: compute only when required
31. Putting it all together
Hadoop
application (actor based)
millions of millions of
λ=
conversions
( lamda )
Data queues
32. Science & Engineering
Statistics,
Data Science
Python
R
Visualization
IT Infra
Big Data
Java
Scala
SQL
Hadoop: Big Data Infrastructure, Data Science on large datasets
Big Data and Fast Data
requires different profiles to be able to
achieve the best results
33. Some lessons learned
● Mix and match technologies is a good thing
● Fast Data must complement Big Data
● Ease integration among teams
● Hadoop, Cassandra, and Akka
● Data Science takes time to figure out
34. Parallelism Mathematics Programming
Languages Machine Learning Statistics
Big Data Algorithms Cloud Computing
Natalino Busa
@natalinobusa
www.natalinobusa.com
Thanks !
Any questions?