Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.
The Barclays Data
Science Hackathon:
Building Retail Recommender
Systems based on Customer
Shopping Behavior
Gianmario	Spa...
The Barclays Data Science Team
•  Retail Business Banking division based in the HQ
(Canary Wharf, London)
•  Back in time ...
Lanzarote off-site
•  1 week (5 days contest
Monday - Friday)
•  Building a recommender
system of retail merchants for
peo...
The technical challenges
•  No infrastructure available, only laptops and a
1G WiFi shared Internet connection.
•  Build, ...
Code	@ll	3am,	wake	up	early	in	the	morning	and	go	surfing!	
Enjoy	canarian	cuisine…	
…and	local	wine
The Professional Data Science Manifesto
work in progress…
Why Spark? (just to name a few…)
•  Speed / performance, in-memory solution
•  Elastic jobs, you can start small and scale...
Preparation work (ETL)
•  Extract, transform and load data into representations
matching the business domain rather than t...
Anonymised Generalised Data
•  Bottom-up k-anonymity:
–  Map all of the categorical attributes of each customer
(online ac...
K-anonymity example
!mestamp	 customerId	 occupa!
on	
gender	 amount	 business	
2015-03-05	 9218324	 Engineer	 male	 58.42...
Data Types
AnonymizedRecord	corresponds	to	a	single	transac@on	where:	
•  Customer	confiden@al	informa@on	have	been	masked	...
Some numbers (Bristol only)
•  ~ 70 GB of data
(Kryo serialized format)
•  A few millions
transactions from 2015
(1 year w...
Recommender APIs
•  RecommenderTrainer receives the raw data and has to
perform the feature engineering tailored for the s...
Thoughts on Efficient Spark Programming
(Vancouver Spark Meetup 03-09-2015)
http://www.slideshare.net/nielsh1/thoughts-on-...
Split	data	by	
customer	id		
NOT	by	
transac@on	
Down-sample	
test	customers	
for	quick	
evalua@ons	
Train	and	get	recomme...
Mean Average Precision (MAP)
•  Each customer has visited m relevant businesses
•  Recommendations predict n ranked busine...
MAP example
=	Businesses	visited	by	test	user	Bob		
?	 ?	 ?	
Recommenda@ons	
#Bob,	N	=	6	
Precision(k):	 1/1 	0 	2/3 	0 	0...
Most Popular Businesses
Learn	most	
popular	
businesses	
during	training	
and	broadcast	
them	into	a	list	
Create	a	recomm...
CUSTOMER-TO-CUSTOMER
SIMILARITY MODELS
Each customer is represented in a sparse feature space
Must define a metric space t...
Customer Features
•  Represent each customer in terms of histograms:
–  Distribution of spending across different dimensio...
Extracting Customer Features 1/2
Businesses	are	
too	many	to	fit	
into	a	Map,	we	
only	take	the	
top	ones	and		
assume	the	...
Extracting Customer Features 2/2
Broadcast	
variables	
should	be	
destroyed	at	
the	end	of	
their	scope	
1.	select	the	
di...
K-Neighbours Recommender Take	the	
previously	
computed	
customer	
features	and	
build	a	VPTree		
For	each	
customer	find	t...
Vantage-point (VP) Tree
•  It’s an heuristic data structure
for fast spatial search
•  Each node of the tree contains
one ...
BUSINESS-TO-BUSINESS
SIMILARITY MODELS
Similarity metric based on the portion of
common customers
Conditional probability
...
Common customers matrix
Sum	
-	 3	 10	 12	 25	
3	 -	 8	 0	 11	
10	 8	 -	 1	 19	
12	 0	 1	 -	 13	
Sum	
25	 11	 19	 13	 -	
E...
0.7	
0.3	
0.1	
0.5	
0.2	
0	
0.2	->	0	
0.4	->	0	
0.3	
0.1	
0.2	
Visited	
businesses	
B1	
Visited	businesses’	
neighbours	
B...
NEIGHBOUR-TO-BUSINESS
Hybrid approach of K-Neighbours combined with
Business-to-Business
3 levels: customer neighbours -> ...
Customer’s	
neighbours	
Direct	businesses	+	
neighbours’s	businesses	
Businesses’s	neighbours
We	know	visited	
business	frequency	
from	our	own	wallet	
and	we	fill	the	others	
with	our	neighbour’s	
normalized	frequenc...
MATRIX FACTORIZATION
MODELS
Factorize the transaction matrix of Customer-to-
Business into 2 matrices of Customer-to-Topic...
Topic Modeling for Learning Analytics
Researchers LAK15 Tutorial
http://www.slideshare.net/vitomirkovanovic/topic-modeling...
ALS is available in Spark MLlib
Ra@ngs	as	
counts	of	
transac@ons	
Model	parameters	are	the	
factorized	matrices.	We	had	t...
Recommendation scores produced by
multiplying vectors
Top N without sorting
Accumulator	is	at	most	N	elements
OTHER APPROACHES
Covariance Matrix:
build a covariance matrix of each pair of users and then
multiply it with the user-to-...
SUMMARY AND
CONCLUSIONS
Models comparison
Neighbour-to-Businesses	
Business-to-Business	
tanimoto)	
ALS	
Covariance	matrix	
Business-to-Business	
...
Limitations
•  ML and MLlib are not flexible enough and need
some extra development (bloody private fields)
•  Linear alge...
Conclusions
•  Spark and Scala were excellent tools for rapid
prototyping during the week, especially for
bespoke algorith...
Automated	single-
bu[on	execu@on	
Built	a	real-world	
recommender	
Common	
evalua@on	APIs	
Data	valida@on	
manually	done	a...
Off-site
•  Success of the hackathon was not solely down
to technology.
•  Innovation requires an environment where:
–  gr...
https://blog.cloudera.com/blog/2016/05/the-barclays-data-science-hackathon-using-apache-spark-and-scala-for-rapid-prototyp...
The Barclays Data Science Hackathon: Building Retail Recommender Systems based on Customer Shopping Behaviour - Gianmario ...
Nächste SlideShare
Wird geladen in …5
×

The Barclays Data Science Hackathon: Building Retail Recommender Systems based on Customer Shopping Behaviour - Gianmario Spacagna, Pirelli

6.754 Aufrufe

Veröffentlicht am

In the depths of the last cold, wet British winter, the Advanced Data Analytics team from Barclays escaped to a villa on Lanzarote, Canary Islands, for a one week hackathon where they collaboratively developed a recommendation system on top of Apache Spark. The contest consisted on using Bristol customer shopping behaviour data to make personalised recommendations in a sort of Kaggle-like competition where each team's goal was to build an MVP and then repeatedly iterate on it using common interfaces defined by a specifically built framework.
The talk will cover:

• How to rapidly prototype in Spark (via the native Scala API) on your laptop and magically scale to a production cluster without huge re-engineering effort.

• The benefits of doing type-safe ETLs representing data in hybrid, and possibly nested, structures like case classes.

• Enhanced collaboration and fair performance comparison by sharing ad-hoc APIs plugged into a common evaluation framework.

• The co-existence of machine learning models available in MLlib and domain-specific bespoke algorithms implemented from scratch.

• A showcase of different families of recommender models (business-to-business similarity, customer-to-customer similarity, matrix factorisation, random forest and ensembling techniques).

• How Scala (and functional programming) helped our cause.

Gianmario is a Senior Data Scientist at Pirelli Tyre, processing telemetry data for smart manufacturing and connected vehicles applications. His main expertise is on building production-oriented machine learning systems. Co-author of the Professional Manifesto for Data Science, he loves evangelising his passion for best practices and effective methodologies amongst the community. Prior to Pirelli, he worked in Financial Services (Barclays), Cyber Security (Cisco) and Predictive Marketing (AgilOne).

Veröffentlicht in: Daten & Analysen
  • Follow the link, new dating source: ❶❶❶ http://bit.ly/2u6xbL5 ❶❶❶
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier
  • Sex in your area is here: ♥♥♥ http://bit.ly/2u6xbL5 ♥♥♥
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier
  • DOWNLOAD FULL eBOOK INTO AVAILABLE FORMAT ......................................................................................................................... ......................................................................................................................... 1.DOWNLOAD FULL. PDF eBook here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... 1.DOWNLOAD FULL. EPUB eBook here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... 1.DOWNLOAD FULL. doc eBook here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... 1.DOWNLOAD FULL. PDF eBook here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... 1.DOWNLOAD FULL. EPUB eBook here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... 1.DOWNLOAD FULL. doc eBook here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... ......................................................................................................................... ......................................................................................................................... .............. Browse by Genre Available eBooks ......................................................................................................................... Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, CookeBOOK Crime, eeBOOK Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult,
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier

The Barclays Data Science Hackathon: Building Retail Recommender Systems based on Customer Shopping Behaviour - Gianmario Spacagna, Pirelli

  1. 1. The Barclays Data Science Hackathon: Building Retail Recommender Systems based on Customer Shopping Behavior Gianmario Spacagna @gm_spacagna Data Science Milan meetup, 13 July 2016
  2. 2. The Barclays Data Science Team •  Retail Business Banking division based in the HQ (Canary Wharf, London) •  Back in time (Dec 2015) was 6 members: Head + mix of (engineering and machine learning) specialists •  Goal: building data-driven applications such as: –  Insights Engine for small businesses –  Complaints NLP analytics –  Mortgage predictive models –  Pricing optimisation –  Graph fraud detection –  and so on...
  3. 3. Lanzarote off-site •  1 week (5 days contest Monday - Friday) •  Building a recommender system of retail merchants for people living in Bristol, UK •  Forget about 9-5 working hours •  Stimulate creativity and team- working •  Brainstorm new ideas and make them happen •  Have fun!
  4. 4. The technical challenges •  No infrastructure available, only laptops and a 1G WiFi shared Internet connection. •  Build, test, and refactor quickly, no time for long end-to-end evaluations. •  Work with common structures without constraining individual initiative and innovation. •  Design for deployment to production on a multi- tenant cluster.
  5. 5. Code @ll 3am, wake up early in the morning and go surfing! Enjoy canarian cuisine… …and local wine
  6. 6. The Professional Data Science Manifesto work in progress…
  7. 7. Why Spark? (just to name a few…) •  Speed / performance, in-memory solution •  Elastic jobs, you can start small and scale up •  What works locally works distributed, almost! •  Single place for doing everything from source to the endpoint •  It cuts development time being designed according to functional programming principles •  Reproducibility via a DAG of declarative transformations rather than procedural side-effect actions
  8. 8. Preparation work (ETL) •  Extract, transform and load data into representations matching the business domain rather than the raw database representation •  Aggregate in order to increase generality but preserving anonymised information for training the models •  Every business is uniquely represented by the combo (MerchantName, MerchantTown) + optionally a postcode when available •  Join each transaction happened in Bristol with the business and customer details
  9. 9. Anonymised Generalised Data •  Bottom-up k-anonymity: –  Map all of the categorical attributes of each customer (online active flag, residential area type, gender, marital status, occupation) into a bucket –  Group similar customers and replace the single bucket with a group of buckets and count the number of group members –  Recursively continue until each user is mapped into a bucket group with at least k members •  Masking: –  Replace user identifiers with uniquely generated IDs
  10. 10. K-anonymity example !mestamp customerId occupa! on gender amount business 2015-03-05 9218324 Engineer male 58.42 Waitrose 2015-03-06 324624 Cook female 118.90 Waitrose 2015-03-06 324624 Cook female 5.99 Abokado Categorical bucket Day of week custome rId amount business engineer-male, student-male, cook-female Thursday 00003 [50-60] Waitrose Friday 00012 [100--1 20] Waitrose Friday 00012 [0-10] Abokado
  11. 11. Data Types AnonymizedRecord corresponds to a single transac@on where: •  Customer confiden@al informa@on have been masked and a[ributes generalised into a set of possible buckets •  Business informa@on are clear (name, town and op@onal postcode) •  Time is only represented as day of week •  Amount was binned to reduce resolu@on
  12. 12. Some numbers (Bristol only) •  ~ 70 GB of data (Kryo serialized format) •  A few millions transactions from 2015 (1 year worth of data) •  ~ 100k Barclays retail customers •  ~ 50K Businesses
  13. 13. Recommender APIs •  RecommenderTrainer receives the raw data and has to perform the feature engineering tailored for the specific implementation and return a Recommender model instance. •  The Recommender instance takes an RDD of customer ids and a positive number N and returns at top N recommendations for each customer. •  We used the pair (MerchantName, MerchantTown) to represent the unique business we want to recommend.
  14. 14. Thoughts on Efficient Spark Programming (Vancouver Spark Meetup 03-09-2015) http://www.slideshare.net/nielsh1/thoughts-on-efficient-spark- programming-vancouver-spark-meetup-03092015
  15. 15. Split data by customer id NOT by transac@on Down-sample test customers for quick evalua@ons Train and get recommenda@ons Check the model is not chea@ng Ground truth for evalua@on Compute MAP
  16. 16. Mean Average Precision (MAP) •  Each customer has visited m relevant businesses •  Recommendations predict n ranked businesses •  For a given customer we compute the average precision as: •  P(k) = precision at cut-off k in the recommendation list, i.e. the ratio of number of relevant businesses, up to the position k. P(k) = 0 when the k-th business is not relevant. •  MAP for N customers at n is the average of the average precision of each customer: ap@n = P(k) / min(m,n) k=1 n ∑ MAP@n = ap @ ni / N i=1 N ∑
  17. 17. MAP example = Businesses visited by test user Bob ? ? ? Recommenda@ons #Bob, N = 6 Precision(k): 1/1 0 2/3 0 0 3/6 Average Precision #Bob = (1 + 2/3 + 3/6) / 3 = 0.722 Average Precision #Alice = (1/2 + 2/5) / 2 = 0.45 MAP@6 = (0.722 + 0.45) / 2 = 0.586 = Businesses visited by test user Alice ? ? Recommenda@ons #Alice, N = 6 Precision(k): 0 1/2 0 0 2/5 0 ? ?
  18. 18. Most Popular Businesses Learn most popular businesses during training and broadcast them into a list Create a recommender that maps every customer id to the same top n businesses Most popular businesses recommender could be used as baseline and also as “padder” for filling missing recommenda@ons of more advanced recommenders.
  19. 19. CUSTOMER-TO-CUSTOMER SIMILARITY MODELS Each customer is represented in a sparse feature space Must define a metric space that satisfies the triangle inequality Similarity (or distance) based on: Common behaviour (geographical and temporal shopping journeys) Common demographic attributes (age, residential area, gender, job position…)
  20. 20. Customer Features •  Represent each customer in terms of histograms: –  Distribution of spending across different dimensions: •  week days, postcode sectors, merchant categories, businesses –  Probability distributions of its generalised attributes: •  Online activity, gender, marital status, occupation •  If we flatten each map and fill with 0s all of the missing keys, we can then compute the cosine distance between two customers
  21. 21. Extracting Customer Features 1/2 Businesses are too many to fit into a Map, we only take the top ones and assume the tail to be negligible Wallet histogram: Count of each (customer, bin) using reduceByKey followed by groupBy on customer to merge all of the bins count into a map
  22. 22. Extracting Customer Features 2/2 Broadcast variables should be destroyed at the end of their scope 1. select the dis@nct customer Id with the associated categorical group 2. perform a map-side mul@- join: One map over the whole RDD with mul@ple look-ups into broadcast maps
  23. 23. K-Neighbours Recommender Take the previously computed customer features and build a VPTree For each customer find the approximated nearest K similar (1 – distance) neighbours and assign a score to each business in the neighbour wallet propor@oned to the rela@ve similarity score Since same business may appear mul@ple @mes, sum all the scores and take top-ranked N
  24. 24. Vantage-point (VP) Tree •  It’s an heuristic data structure for fast spatial search •  Each node of the tree contains one data point + a radius –  Left child branch contains points that are closer than the radius, right the farther away •  Construction time: O(n log(n)) •  Search time*: O(log(n)) *Under certain circumstances
  25. 25. BUSINESS-TO-BUSINESS SIMILARITY MODELS Similarity metric based on the portion of common customers Conditional probability Tanimoto Coefficient
  26. 26. Common customers matrix Sum - 3 10 12 25 3 - 8 0 11 10 8 - 1 19 12 0 1 - 13 Sum 25 11 19 13 - Each cell represent the dis@nct number of common customers Business similari@es: •  Condi@onal probability •  Tanimoto coefficient
  27. 27. 0.7 0.3 0.1 0.5 0.2 0 0.2 -> 0 0.4 -> 0 0.3 0.1 0.2 Visited businesses B1 Visited businesses’ neighbours B2 Weights sum excluding visited: 0.8 0.6 “Probability” score P(c) = P(B2c / B1a) * P(B1a) + P(B2c / B1b) * P(B1b) (0.1/0.8)*0.7 + (0.3/0.6)*0.3 = 0.2375 (0.5/0.8)*0.7 + (0.1/0.6) * 0.3 = 0.4875 (0.2/0.8)*0.7 + (0.2/0.6)*0.3 = 0.275 0 a a b c d e e
  28. 28. NEIGHBOUR-TO-BUSINESS Hybrid approach of K-Neighbours combined with Business-to-Business 3 levels: customer neighbours -> neighbour’s businesses -> businesses’ neighbours We named this model: Botticelli model
  29. 29. Customer’s neighbours Direct businesses + neighbours’s businesses Businesses’s neighbours
  30. 30. We know visited business frequency from our own wallet and we fill the others with our neighbour’s normalized frequency
  31. 31. MATRIX FACTORIZATION MODELS Factorize the transaction matrix of Customer-to- Business into 2 matrices of Customer-to-Topic and Topic-to-Business (e.g. LSA, SVD…) Recommendations are done by applying linear algebra
  32. 32. Topic Modeling for Learning Analytics Researchers LAK15 Tutorial http://www.slideshare.net/vitomirkovanovic/topic-modeling-for- learning-analytics-researchers-lak15-tutorial
  33. 33. ALS is available in Spark MLlib Ra@ngs as counts of transac@ons Model parameters are the factorized matrices. We had to re-implement the scoring func@on due to scalability issues
  34. 34. Recommendation scores produced by multiplying vectors
  35. 35. Top N without sorting Accumulator is at most N elements
  36. 36. OTHER APPROACHES Covariance Matrix: build a covariance matrix of each pair of users and then multiply it with the user-to-business matrix Random Forest: one binary classifier for each business Ensembling models: aggregating recommendations from different models
  37. 37. SUMMARY AND CONCLUSIONS
  38. 38. Models comparison Neighbour-to-Businesses Business-to-Business tanimoto) ALS Covariance matrix Business-to-Business (condi@onal prob) K-Neighbours Most popular 16% 12% 11% 10% 9% 8% 3% MAP@20 Remember: for every national retail chain where you have a lot of customers, you have a lot of local niche businesses where only a small portion of of the customer base ever shop there -> Very hard to predict those! Simple solutions made of counts and divisions may out- perform more advanced ones
  39. 39. Limitations •  ML and MLlib are not flexible enough and need some extra development (bloody private fields) •  Linear algebra libraries in MLlib are limited, it took as a while to learn how to optimize them •  Scala and Spark create confusion for some method behaviour (e.g. fold, collect, mapValues, groupBy) •  Many machine learning libraries are based on vectors and don’t easily allow ad-hoc definition of data types based on the business context
  40. 40. Conclusions •  Spark and Scala were excellent tools for rapid prototyping during the week, especially for bespoke algorithms. •  We used the same production stack together with notebooks for ad-hoc explorations or quick and dirty tests. •  At the end of the hackathon the best model is almost a production-ready MVP
  41. 41. Automated single- bu[on execu@on Built a real-world recommender Common evalua@on APIs Data valida@on manually done as prepara@on step Only MAP considered Notebook analysis immediately followed by knowledge conversion into code requirements Our MVP was simplis@c and not considering a few edge cases
  42. 42. Off-site •  Success of the hackathon was not solely down to technology. •  Innovation requires an environment where: –  great people can connect –  set clear ambitious goals –  work together free of distractions –  pressure of delivering comes from the group –  Fail safely, go to sleep, wake up next day (go surfing) and try again!
  43. 43. https://blog.cloudera.com/blog/2016/05/the-barclays-data-science-hackathon-using-apache-spark-and-scala-for-rapid-prototyping/ Original article on Cloudera Engineering Blog https://github.com/gm-spacagna/lanzarote-awesomeness GitHub code Further Reading A lot of references regarding Agile and Spark http://datasciencevademecum.wordpress.com Data Science Vademecum The Barclays Data Science team at this hackathon was: Panos Malliakas, Victor Paraschiv, Harry Powell, Charis Sfyrakis, Gianmario Spacagna and Raffael Strassnig http://www.datasciencemanifesto.org/ The Professional Data Science Manifesto

×