Big data solutions for advanced marketing analytics

Big Data Solutions for
Marketing Analytics
Natalino Busa
@natalinobusa

Parallelism Hadoop Cassandra Akka
Machine Learning Statistics Big Data
Algorithms Cloud Computing Scala Spray
Natalino Busa
@natalinobusa
www.natalinobusa.com

Back to routine.
Grocery, broken washmachine
After-vacation fun
Pancake house.
Traveling back.
Just back home. Pizza.
Shopping in Sicily
Vacation!
The bank statements How I read the bank bills

Back to routine.
Grocery, broken washmachine
After-vacation fun
Pancake house.
Traveling back.
Just back home. Pizza.
Shopping in Sicily
Vacation!
The bank statements How I read the bank bills What happened those days

data is the fabric of our lives
Let’s give more meaning and context to data.

Abraham Harold Maslow (April 1, 1908 –
June 8, 1970) was an American psychologist
who was best known for creating Maslow's
hierarchy of needs

breathing, food, water, sleep
security of body, resources,
health, employment, property
friend, family, partner
security of love and belonging
self-esteem, confidence,
achievements, respect
spontaneity, creativity,
acceptance, freedom, ethics
Physiology
Contractual
Love & Caring
Esteem
Self-actualization
Very human needs

How much caring can
technology be?

Connectivity, Electricity, Hardware /
Infra
security of basic operations
REST APIs, Encryption, Authentication
Notification, Alerts,
Social bonding, Predictions
Set goals, planning,
Achievements, Advisory role
Freedom,
Trusted Companion
Physiology
Contractual
Love & Caring
Esteem
Self-actualization
Technology is reaching out

Data science top 3
Dimensionality
Reduction
Predictive
Analytics
Clustering
Segmentation

Data science: what’s working?
- Random Forests
- Artificial Neural Networks
- Clustering Algorithms
- Pattern Recognition
- Time-Serie analysis
- Regression
Most actual models are a
combination of these ones

Data science ^.^/
keep it scientific
cross-validate your models
keep it measurable
play with it
create new features
explore the available data

# Multiple Linear Regression Example
fit <- lm(y ~ x1 + x2 + x3, data=mydata)
summary(fit) # show results
● Language for statistics
● Easy to Analyze and shape data
● Advanced statistical package
● Fueled by academia and professionals
● Very clean visualization packages
Packages for machine learning
time serie forecasting, clustering, classification
decision trees, neural networks
Remote procedure calls (RPC)
From scala/java via RProcess and Rserve
Data Science: R

>>> from sklearn.datasets import load_iris
>>> from sklearn import tree
>>> iris = load_iris()
>>> clf = tree.DecisionTreeClassifier()
>>> clf = clf.fit(iris.data, iris.target)
● Flexible, concise language
● Quick to code and prototype
● Portable, visualization libraries
Machine learning libraries:
scipy, statsmodels, sklearn,
matplotlib, ipython
Web libraries
flask, tornado, (no)SQL clients
Data Science: Python

The customer’s context
Personal history:
amount of transactions ever done
Long term Interaction:
how the users’ action correlate with others
Real time events:
Trends and recent events

The customer’s context
context is related to time:
slow changing: the defining characteristic of a person
fast changing: events which influence our lives, trends
Require very different
technology solutions !!!

Challenges
Not much time to react
Events must be delivered fast to the new machine APIs
It’s Web, and Mobile Apps: latency budget is limited
Loads of information to process
Understand well the user history
Access a larger context

Big Data and Fast data
ranking and preference
segmentation and clustering
short term trending topics
rule-based recommendations
10’s Terabytes of Data.
This can take hours ….
100’s of events per second.
This must be fast ….

core banking systems
SOAP
services
and DBs
System
BUS
customer
facing appls
channels
A high-level bank schematic

Higher
separation !
Less silos
Interactions
with core
systems
Bigger and Faster

Hadoop: Distributed Data OS
Reliable
Distributed, Replicated File System
Low cost
↓ Cost vs ↑ Performance/Storage
Computing Powerhouse
All clusters CPU’s working in parallel for
running queries

Cassandra: A low-latency 2D store
Reliable
Distributed, Replicated File System
Low latency
Sub msec. read/write operations
Tunable CAP
Define your level of consistency
Data model:
hashed rows, sorted wide columns
Architecture model:
No SPOF, ring of nodes,
omogeneous system

Scala / Akka / Spray:
a WEB API reactive framework
Actor
A Actor
B
Actor
C
msg 1
msg 2
msg 3
msg 4
● it scales horizontally (can run in cluster mode)
● maximum use of the available cores/memory
● processing is non-blocking, threads are re-used
● can parallelize computing power across many actors
Very fast: 1000’s messages/sec
Very reliable: auto recovery
Lazy: compute only when required

Putting it all together
Hadoop
application (actor based)
millions of millions of
λ=
conversions
( lamda )
Data queues

Science & Engineering
Statistics,
Data Science
Python
R
Visualization
IT Infra
Big Data
Java
Scala
SQL
Hadoop: Big Data Infrastructure, Data Science on large datasets
Big Data and Fast Data
requires different profiles to be able to
achieve the best results

Some lessons learned
● Mix and match technologies is a good thing
● Fast Data must complement Big Data
● Ease integration among teams
● Hadoop, Cassandra, and Akka
● Data Science takes time to figure out

Parallelism Mathematics Programming
Languages Machine Learning Statistics
Big Data Algorithms Cloud Computing
Natalino Busa
@natalinobusa
www.natalinobusa.com
Thanks !
Any questions?

Big data solutions for advanced marketing analytics

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Big data solutions for advanced marketing analytics

Ähnlich wie Big data solutions for advanced marketing analytics (20)

Mehr von Natalino Busa

Mehr von Natalino Busa (17)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Big data solutions for advanced marketing analytics