SlideShare a Scribd company logo
1 of 30
Download to read offline
Case	
  Study	
  




    Value extraction from BBVA credit
            card transactions
104,000	
  employees	
  
47	
  million	
  customers	
  
The	
  idea	
  
 Extract	
  value	
  
       from	
  
 anonymized	
  
  credit	
  card	
  
 transac5ons	
  
data	
  &	
  share	
  it	
  	
  	
  
   Always:	
  	
  
   ü  Impersonal	
  
   ü  Aggregated	
  
   ü  Dissociated	
  
   ü  Irreversible	
  
Helping	
  

   Consumers	
  
      Informed	
  decision	
  
      ü  Shop	
  recommenda5ons	
  (by	
  loca5on	
  and	
  by	
  category)	
  
      ü  Best	
  5me	
  to	
  buy	
  
      ü  Ac5vity	
  &	
  fidelity	
  of	
  shop’s	
  customers	
  


      Sellers	
  
      Learning	
  clients	
  pa:erns	
  
      ü  Ac5vity	
  &	
  fidelity	
  of	
  shop’s	
  customers	
  
      ü  Sex	
  &	
  Age	
  &	
  Loca5on	
  
      ü  Buying	
  paIerns	
  
Shop	
  stats	
            For	
  different	
  periods	
  
                           ü  All,	
  year,	
  quarter,	
  month,	
  week,	
  day	
  




                …	
  and	
  much	
  more	
  
The	
  applica5ons	
  

Internal	
  use	
  


Sellers	
  


Customers	
  
The	
  challenges	
  

Company	
  silos	
                The	
  costs	
  

The	
  amount	
  of	
  data	
         Security	
  

 Development	
  flexibility/agility	
  

            Human	
  failures	
  
The	
  plaOorm	
  




  Data	
  storage	
                 S3	
  
Data	
  processing	
     Elas5c	
  Map	
  Reduce	
  
  Data	
  serving	
                EC2	
  
The	
  architecture	
  
Hadoop	
  

Distributed	
  Filesystem	
  
 ü     Files	
  as	
  big	
  as	
  you	
  want	
  
 ü     Horizontal	
  scalability	
  
 ü     Failover	
  
 	
  

Distributed	
  Compu5ng	
  
 ü     MapReduce	
  
 ü     Batch	
  oriented	
  
      •     Input	
  files	
  processed	
  and	
  converted	
  in	
  output	
  files	
  
 ü  Horizontal	
  scalability	
  
 	
  
Easier	
  Hadoop	
  Java	
  API	
  
    ü    But	
  keeping	
  similar	
  efficiency	
  

Common	
  design	
  paIerns	
  covered	
  
    ü    Compound	
  records	
  
    ü    Secondary	
  sor5ng	
  
    ü    Joins	
  

Other	
  improvements	
  
    ü    Instance	
  based	
  configura5on	
  
    ü    First	
  class	
  mul5ple	
  input/output	
  

Tuple	
  MapReduce	
  implementaDon	
  for	
  Hadoop	
  
Tuple	
  MapReduce	
  
Our	
  evoluDon	
  to	
  Google’s	
  MapReduce	
  
Pere	
  Ferrera,	
  Iván	
  de	
  Prado,	
  Eric	
  Palacios,	
  Jose	
  Luis	
  Fernandez-­‐
Marquez,	
  Giovanna	
  Di	
  Marzo	
  Serugendo:	
  	
  
	
  
Tuple	
  MapReduce:	
  Beyond	
  classic	
  MapReduce.	
  	
  
	
  
In	
  ICDM	
  2012:	
  Proceedings	
  of	
  the	
  IEEE	
  Interna6onal	
  Conference	
  
on	
  Data	
  Mining	
  
	
  
Brussels,	
  Belgium	
  |	
  December	
  10	
  –	
  13,	
  2012	
  
Sales	
  difference	
  between	
  the	
  most	
  selling	
  
Tuple	
  MapReduce	
     offices	
  per	
  each	
  loca6on	
  
Tuple	
  MapReduce	
  

         Main	
  constraint	
  


         ü  Group	
  by	
  clause	
  must	
  be	
  a	
  subset	
  of	
  sort	
  by	
  clause	
  

Indeed,	
  Tuple	
  MapReduce	
  can	
  be	
  implemented	
  on	
  top	
  of	
  
any	
  MapReduce	
  implementaDon	
  
   •  Pangool	
  -­‐>	
  Tuple	
  MapReduce	
  over	
  Hadoop	
  
Efficiency	
  
Similar	
  efficiency	
  to	
  Hadoop	
  




    hIp://pangool.net/benchmark.html	
  
Voldemort	
  
Distributed	
  key/value	
  store	
  
Voldemort	
  &	
  Hadoop	
  

        Benefits	
  
     ü  Scalability	
  &	
  failover	
  
     ü  Upda5ng	
  the	
  database	
  does	
  not	
  affect	
  serving	
  queries	
  
     ü  All	
  data	
  is	
  replaced	
  at	
  each	
  execu5on	
  
           •  Providing	
  agility/flexibility	
  	
  
                   §  Big	
  development	
  changes	
  are	
  not	
  a	
  pain	
  
           •  Easier	
  survival	
  to	
  human	
  errors	
  
                   §  Fix	
  code	
  and	
  run	
  again	
  
           •  Easy	
  to	
  set	
  up	
  new	
  clusters	
  with	
  different	
  topologies	
  	
  
Basic	
  sta5s5cs	
  
Easy	
  to	
  implement	
  with	
  Pangool/Hadoop	
  
   ü  One	
  job,	
  grouping	
  by	
  the	
  dimension	
  over	
  which	
  you	
  want	
  to	
  
       calculate	
  the	
  sta5s5cs.	
  


Count	
               Average	
                          Min	
                Max	
                 Stdev	
  
CompuDng	
  several	
  Dme	
  periods	
  in	
  the	
  
same	
  job	
  
     ü  Use	
  the	
  mapper	
  for	
  replica5ng	
  each	
  datum	
  for	
  each	
  period	
  
     ü  Add	
  a	
  period	
  iden5fier	
  field	
  in	
  the	
  tuple	
  and	
  include	
  it	
  in	
  the	
  
         group	
  by	
  clause	
  	
  
Dis5nct	
  count	
  
Possible	
  to	
  compute	
  in	
  a	
  single	
  job	
  
    ü  Using	
  secondary	
  sor5ng	
  by	
  the	
  field	
  you	
  want	
  to	
  dis5nct	
  count	
  
        on	
  
    ü  Detec5ng	
  changes	
  on	
  that	
  field	
  	
  

Example	
  
    ü  Group	
  by	
  shop,	
  sort	
  by	
  shop	
  and	
  card	
  

    Shop	
                         Card	
  
    Shop	
  1	
                    1234	
  
    Shop	
  1	
                    1234	
  
    Shop	
  1	
                    1234	
                               Change	
  
                                                                                     +1	
  
    Shop	
  1	
                    5678	
                                                           2	
  dis5nct	
  
                                                                                                    buyers	
  for	
  
    Shop	
  1	
                    5678	
                               Change	
  
                                                                                     +1	
           shop	
  1	
  
Histograms	
  
Typically	
  two-­‐pass	
  algorithm	
  
  ü  First	
  pass	
  for	
  detec5ng	
  the	
  minimum	
  and	
  the	
  
      maximum	
  and	
  determine	
  the	
  bins	
  ranges	
  
  ü  Second	
  pass	
  to	
  count	
  the	
  number	
  of	
  occurrences	
  
      on	
  each	
  bin	
  
AdaptaDve	
  histogram	
  	
  
                                                                   ü  One	
  pass	
  
                                                                   ü  Fixed	
  number	
  of	
  bins	
  
                                                                   ü  Bins	
  adapt	
  	
  
Op5mal	
  histogram	
  
Calculate	
  the	
  be:er	
  histogram	
  that	
  represents	
  the	
  original	
  one	
  
using	
  a	
  limited	
  number	
  of	
  flexible	
  width	
  bins	
  
      ü  Reduce	
  storage	
  needs	
  
      ü  More	
  representa5ve	
  than	
  fixed	
  width	
  ones	
  -­‐>	
  beIer	
  
          visualiza5on	
  
Op5mal	
  histogram	
  
   Exact	
  Algorithm	
  
   Petri	
  Kontkanen,	
  Petri	
  Myllym	
  aki	
  
                                             ̈
   	
  
   MDL	
  Histogram	
  Density	
  EsDmaDon	
  
   	
  
   hIp://eprints.pascal-­‐network.org/archive/00002983/	
  



Too	
  slow	
  for	
  producDon	
  use	
  
Op5mal	
  histogram	
  
 Alterna5ve:	
  Approximated	
  algorithm	
  
Random-­‐restart	
  hill	
  climbing	
  	
  
    ü  A	
  solu5on	
  is	
  just	
  a	
  way	
  of	
  grouping	
  exis5ng	
  bins	
  
    ü  From	
  a	
  solu5on,	
  you	
  can	
  move	
  to	
  some	
  close	
  
        solu5ons	
  
    ü  Some	
  are	
  beIer:	
  reduce	
  the	
  representa5on	
  error	
  	
  

Algorithm	
  
    1.  Iterate	
  N	
  5mes,	
  keeping	
  best	
  
        solu5on	
  
        1.  Generate	
  a	
  random	
  solu5on	
  
        2.  Iterate	
  un5l	
  no	
  improvement	
  
             1.  Move	
  to	
  next	
  beIer	
  
                    possible	
  movement	
  
Op5mal	
  histogram	
  
 Alterna5ve:	
  Approximated	
  algorithm	
  
Random-­‐restart	
  hill	
  climbing	
  	
  
    ü  One	
  order	
  of	
  magnitude	
  faster	
  
    ü  99%	
  accuracy	
  	
  
Everything	
  in	
  one	
  job	
  
 Basic	
  staDsDcs	
  -­‐>	
  1	
  job	
  
 DisDnct	
  count	
  staDsDcs	
  -­‐>	
  1	
  job	
  
 One	
  pass	
  histograms	
  -­‐>	
  1	
  job	
  
 Several	
  periods	
  &	
  shops	
  -­‐>	
  1	
  job	
  

     We	
  can	
  put	
  all	
  together	
  so	
  that	
  
   compu5ng	
  all	
  sta5s5cs	
  for	
  all	
  shops	
  
          fits	
  into	
  exactly	
  one	
  job	
  	
  	
  
Shop	
  recommenda5ons	
  
Based	
  on	
  co-­‐occurrences	
  
   ü  If	
  somebody	
  bought	
  in	
  shop	
  A	
  and	
  in	
  shop	
  B,	
  then	
  a	
  co-­‐occurrence	
  
       between	
  A	
  and	
  B	
  exists	
  
   ü  Only	
  one	
  co-­‐occurrence	
  is	
  considered	
  although	
  a	
  buyer	
  bought	
  
       several	
  5mes	
  in	
  A	
  and	
  B	
  
   ü  Top	
  co-­‐occurrences	
  per	
  each	
  shop	
  are	
  the	
  recommenda5ons	
  

Improvements	
  
   ü  Most	
  popular	
  shops	
  are	
  filtered	
  out	
  because	
  almost	
  everybody	
  buys	
  
       in	
  them.	
  
   ü  Recommenda5ons	
  by	
  category,	
  by	
  locaDon	
  and	
  by	
  both	
  
   ü  Different	
  calcula5on	
  periods	
  
Shop	
  recommenda5ons	
  
Implemented	
  in	
  Pangool	
  
    ü  Using	
  its	
  coun5ng	
  and	
  joining	
  capabili5es	
  
    ü  Several	
  jobs	
  

Challenges	
  
    ü  If	
  somebody	
  bought	
  	
  in	
  many	
  shops,	
  the	
  list	
  of	
  co-­‐occurrences	
  can	
  
        explode:	
  
            •  Co-­‐occurrences	
  =	
  N	
  *	
  (N	
  –	
  1),	
  where	
  N	
  =	
  #	
  of	
  dis5nct	
  shops	
  
                where	
  the	
  person	
  bought	
  
    ü  Alleviated	
  by	
  limi5ng	
  the	
  total	
  number	
  of	
  dis5nct	
  shops	
  to	
  consider	
  
            ü  Only	
  uses	
  the	
  top	
  M	
  shops	
  where	
  the	
  client	
  bought	
  the	
  most	
  	
  
Future	
  
    ü  Time	
  aware	
  co-­‐occurrences.	
  The	
  client	
  bought	
  in	
  A	
  and	
  B	
  and	
  he	
  
        did	
  it	
  in	
  a	
  close	
  period	
  of	
  5me.	
  
Some	
  numbers	
  
EsDmated	
  resources	
  needed	
  with	
  1	
  year	
  
data	
  
                  270	
  GB	
  of	
  stats	
  to	
  serve	
  

24	
  large	
  instances	
  ~	
  11	
  hours	
  of	
  execu5on	
  

                               $3500	
  month	
  
       ü  Op5miza5ons	
  s5ll	
  possible	
  
       ü  Cost	
  without	
  the	
  use	
  of	
  reserved	
  instances	
  
       ü  Probably	
  cheaper	
  with	
  an	
  in-­‐house	
  Hadoop	
  cluster	
  
Conclusion	
  
It	
  was	
  possible	
  to	
  develop	
  a	
  Big	
  Data	
  
soluDon	
  for	
  a	
  Bank	
  
  ü  With	
  low	
  use	
  of	
  resources	
  
  ü  Quickly	
  
  ü  Thanks	
  to	
  the	
  use	
  of	
  technologies	
  like	
  Hadoop,	
  Amazon	
  Web	
  
      Services	
  and	
  NoSQL	
  databases	
  

The	
  soluDon	
  is	
  
    ü  Scalable	
  
    ü  Flexible/agile.	
  Improvements	
  easy	
  to	
  implement	
  
    ü  Prepared	
  to	
  stand	
  human	
  failures	
  
    ü  At	
  a	
  reasonable	
  cost	
  

Main	
  advantage:	
  doing	
  always	
  everything	
  
Future:	
  Splout	
  
Key/value	
  datastores	
  have	
  limitaDons	
  
  ü  Only	
  accept	
  querying	
  by	
  the	
  key	
  
  ü  Aggrega5ons	
  no	
  possible	
  
  ü  In	
  other	
  words,	
  we	
  are	
  forced	
  to	
  pre-­‐compute	
  everything	
  
       ü  Not	
  always	
  possible	
  -­‐>	
  data	
  explode	
  
       ü  For	
  this	
  par5cular	
  case,	
  5me	
  ranges	
  are	
  fixed	
  

Splout:	
  like	
  Voldemort	
  but	
  SQL!	
  
  ü  The	
  idea:	
  to	
  replace	
  Voldemort	
  by	
  Splout	
  SQL	
  
  ü  Much	
  richer	
  queries:	
  real-­‐5me	
  aggrega5ons,	
  flexible	
  5me	
  ranges	
  
  ü  It	
  would	
  allow	
  to	
  create	
  some	
  kind	
  of	
  Google	
  Analy5cs	
  for	
  the	
  
      sta5s5cs	
  discussed	
  in	
  this	
  presenta5on	
  
  ü  Open	
  Sourced!!!	
  
       hIps://github.com/datasalt/splout-­‐db	
  	
  

More Related Content

Viewers also liked

Nosql Introduction
Nosql IntroductionNosql Introduction
Nosql IntroductionAnju Singh
 
Buscador vertical escalable con Hadoop
Buscador vertical escalable con HadoopBuscador vertical escalable con Hadoop
Buscador vertical escalable con Hadoopdatasalt
 
Bigdata Hadoop project payment gateway domain
Bigdata Hadoop project payment gateway domainBigdata Hadoop project payment gateway domain
Bigdata Hadoop project payment gateway domainKamal A
 
Proof of Concept for Hadoop: storage and analytics of electrical time-series
Proof of Concept for Hadoop: storage and analytics of electrical time-seriesProof of Concept for Hadoop: storage and analytics of electrical time-series
Proof of Concept for Hadoop: storage and analytics of electrical time-seriesDataWorks Summit
 
Hadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce DetailsHadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce DetailsAnju Singh
 
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summitAnalysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summitSlim Baltagi
 
The electronic payment systems
The electronic payment systemsThe electronic payment systems
The electronic payment systemsVishal Singh
 
You Know the Drill
You Know the DrillYou Know the Drill
You Know the DrillDanny Thomas
 

Viewers also liked (18)

Nosql Introduction
Nosql IntroductionNosql Introduction
Nosql Introduction
 
BIGDATA & HADOOP PROJECT
BIGDATA & HADOOP PROJECTBIGDATA & HADOOP PROJECT
BIGDATA & HADOOP PROJECT
 
BIGDATA & HADOOP PROJECT
BIGDATA & HADOOP PROJECTBIGDATA & HADOOP PROJECT
BIGDATA & HADOOP PROJECT
 
Buscador vertical escalable con Hadoop
Buscador vertical escalable con HadoopBuscador vertical escalable con Hadoop
Buscador vertical escalable con Hadoop
 
BIGDATA & HADOOP PROJECT
BIGDATA & HADOOP PROJECTBIGDATA & HADOOP PROJECT
BIGDATA & HADOOP PROJECT
 
Mary kay case
Mary kay caseMary kay case
Mary kay case
 
Bigdata Hadoop project payment gateway domain
Bigdata Hadoop project payment gateway domainBigdata Hadoop project payment gateway domain
Bigdata Hadoop project payment gateway domain
 
Proof of Concept for Hadoop: storage and analytics of electrical time-series
Proof of Concept for Hadoop: storage and analytics of electrical time-seriesProof of Concept for Hadoop: storage and analytics of electrical time-series
Proof of Concept for Hadoop: storage and analytics of electrical time-series
 
Hadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce DetailsHadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce Details
 
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summitAnalysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
 
Build application using sbt
Build application using sbtBuild application using sbt
Build application using sbt
 
Operations on rdd
Operations on rddOperations on rdd
Operations on rdd
 
Hadoop admiin demo
Hadoop admiin demoHadoop admiin demo
Hadoop admiin demo
 
Apache spark basics
Apache spark basicsApache spark basics
Apache spark basics
 
The electronic payment systems
The electronic payment systemsThe electronic payment systems
The electronic payment systems
 
Digital Banking
Digital BankingDigital Banking
Digital Banking
 
Workbook14
Workbook14Workbook14
Workbook14
 
You Know the Drill
You Know the DrillYou Know the Drill
You Know the Drill
 

Similar to Datasalt - BBVA case study - extracting value from credit card transactions

Value extraction from BBVA credit card transactions. IVAN DE PRADO at Big Dat...
Value extraction from BBVA credit card transactions. IVAN DE PRADO at Big Dat...Value extraction from BBVA credit card transactions. IVAN DE PRADO at Big Dat...
Value extraction from BBVA credit card transactions. IVAN DE PRADO at Big Dat...Big Data Spain
 
"A Framework for Developing Trading Models Based on Machine Learning" by Kris...
"A Framework for Developing Trading Models Based on Machine Learning" by Kris..."A Framework for Developing Trading Models Based on Machine Learning" by Kris...
"A Framework for Developing Trading Models Based on Machine Learning" by Kris...Quantopian
 
Beware of Low Frequency Data by Ernie Chan, Managing Member, QTS Capital Mana...
Beware of Low Frequency Data by Ernie Chan, Managing Member, QTS Capital Mana...Beware of Low Frequency Data by Ernie Chan, Managing Member, QTS Capital Mana...
Beware of Low Frequency Data by Ernie Chan, Managing Member, QTS Capital Mana...Quantopian
 
Algorithmic pricing: Forecasting and Pricing
Algorithmic pricing: Forecasting and PricingAlgorithmic pricing: Forecasting and Pricing
Algorithmic pricing: Forecasting and PricingTofigh Naghibi
 
Nazdar Top Ten Mistakes Webinar
Nazdar Top Ten Mistakes WebinarNazdar Top Ten Mistakes Webinar
Nazdar Top Ten Mistakes WebinarJudyHeft
 
Quant trading with artificial intelligence
Quant trading with artificial intelligenceQuant trading with artificial intelligence
Quant trading with artificial intelligenceRoger Lee, CFA
 
Lecture 15
Lecture 15Lecture 15
Lecture 15Shani729
 
Using Java & Genetic Algorithms to Beat the Market
Using Java & Genetic Algorithms to Beat the MarketUsing Java & Genetic Algorithms to Beat the Market
Using Java & Genetic Algorithms to Beat the MarketMatthew Ring
 
Machine Learning with Hadoop Boston hug 2012
Machine Learning with Hadoop Boston hug 2012Machine Learning with Hadoop Boston hug 2012
Machine Learning with Hadoop Boston hug 2012MapR Technologies
 
H2O World - Solving Customer Churn with Machine Learning - Julian Bharadwaj
H2O World - Solving Customer Churn with Machine Learning - Julian BharadwajH2O World - Solving Customer Churn with Machine Learning - Julian Bharadwaj
H2O World - Solving Customer Churn with Machine Learning - Julian BharadwajSri Ambati
 
production-180404033805 (1).pptx
production-180404033805 (1).pptxproduction-180404033805 (1).pptx
production-180404033805 (1).pptxsadiqfarhan2
 
Stop Flying Blind! Quantifying Risk with Monte Carlo Simulation
Stop Flying Blind! Quantifying Risk with Monte Carlo SimulationStop Flying Blind! Quantifying Risk with Monte Carlo Simulation
Stop Flying Blind! Quantifying Risk with Monte Carlo SimulationSam McAfee
 
Building a Data Ingestion & Processing Pipeline with Spark & Airflow
Building a Data Ingestion & Processing Pipeline with Spark & AirflowBuilding a Data Ingestion & Processing Pipeline with Spark & Airflow
Building a Data Ingestion & Processing Pipeline with Spark & AirflowTom Lous
 
RLCON 2013 Prague
RLCON 2013 PragueRLCON 2013 Prague
RLCON 2013 PragueGuy Meisl
 
Solving Large Scale Optimization Problems using CPLEX Optimization Studio
Solving Large Scale Optimization Problems using CPLEX Optimization StudioSolving Large Scale Optimization Problems using CPLEX Optimization Studio
Solving Large Scale Optimization Problems using CPLEX Optimization Studiooptimizatiodirectdirect
 
1 introductory slides (1)
1 introductory slides (1)1 introductory slides (1)
1 introductory slides (1)tafosepsdfasg
 

Similar to Datasalt - BBVA case study - extracting value from credit card transactions (20)

Value extraction from BBVA credit card transactions. IVAN DE PRADO at Big Dat...
Value extraction from BBVA credit card transactions. IVAN DE PRADO at Big Dat...Value extraction from BBVA credit card transactions. IVAN DE PRADO at Big Dat...
Value extraction from BBVA credit card transactions. IVAN DE PRADO at Big Dat...
 
"A Framework for Developing Trading Models Based on Machine Learning" by Kris...
"A Framework for Developing Trading Models Based on Machine Learning" by Kris..."A Framework for Developing Trading Models Based on Machine Learning" by Kris...
"A Framework for Developing Trading Models Based on Machine Learning" by Kris...
 
1000 track2 Bharadwaj
1000 track2 Bharadwaj1000 track2 Bharadwaj
1000 track2 Bharadwaj
 
Beware of Low Frequency Data by Ernie Chan, Managing Member, QTS Capital Mana...
Beware of Low Frequency Data by Ernie Chan, Managing Member, QTS Capital Mana...Beware of Low Frequency Data by Ernie Chan, Managing Member, QTS Capital Mana...
Beware of Low Frequency Data by Ernie Chan, Managing Member, QTS Capital Mana...
 
Algorithmic pricing: Forecasting and Pricing
Algorithmic pricing: Forecasting and PricingAlgorithmic pricing: Forecasting and Pricing
Algorithmic pricing: Forecasting and Pricing
 
Nazdar Top Ten Mistakes Webinar
Nazdar Top Ten Mistakes WebinarNazdar Top Ten Mistakes Webinar
Nazdar Top Ten Mistakes Webinar
 
Quant trading with artificial intelligence
Quant trading with artificial intelligenceQuant trading with artificial intelligence
Quant trading with artificial intelligence
 
Lecture 15
Lecture 15Lecture 15
Lecture 15
 
Using Java & Genetic Algorithms to Beat the Market
Using Java & Genetic Algorithms to Beat the MarketUsing Java & Genetic Algorithms to Beat the Market
Using Java & Genetic Algorithms to Beat the Market
 
Machine Learning with Hadoop Boston hug 2012
Machine Learning with Hadoop Boston hug 2012Machine Learning with Hadoop Boston hug 2012
Machine Learning with Hadoop Boston hug 2012
 
H2O World - Solving Customer Churn with Machine Learning - Julian Bharadwaj
H2O World - Solving Customer Churn with Machine Learning - Julian BharadwajH2O World - Solving Customer Churn with Machine Learning - Julian Bharadwaj
H2O World - Solving Customer Churn with Machine Learning - Julian Bharadwaj
 
Optimisation vs prediction
Optimisation vs predictionOptimisation vs prediction
Optimisation vs prediction
 
Data ware housing- Introduction to olap .
Data ware housing- Introduction to  olap .Data ware housing- Introduction to  olap .
Data ware housing- Introduction to olap .
 
Msbi by quontra us
Msbi by quontra usMsbi by quontra us
Msbi by quontra us
 
production-180404033805 (1).pptx
production-180404033805 (1).pptxproduction-180404033805 (1).pptx
production-180404033805 (1).pptx
 
Stop Flying Blind! Quantifying Risk with Monte Carlo Simulation
Stop Flying Blind! Quantifying Risk with Monte Carlo SimulationStop Flying Blind! Quantifying Risk with Monte Carlo Simulation
Stop Flying Blind! Quantifying Risk with Monte Carlo Simulation
 
Building a Data Ingestion & Processing Pipeline with Spark & Airflow
Building a Data Ingestion & Processing Pipeline with Spark & AirflowBuilding a Data Ingestion & Processing Pipeline with Spark & Airflow
Building a Data Ingestion & Processing Pipeline with Spark & Airflow
 
RLCON 2013 Prague
RLCON 2013 PragueRLCON 2013 Prague
RLCON 2013 Prague
 
Solving Large Scale Optimization Problems using CPLEX Optimization Studio
Solving Large Scale Optimization Problems using CPLEX Optimization StudioSolving Large Scale Optimization Problems using CPLEX Optimization Studio
Solving Large Scale Optimization Problems using CPLEX Optimization Studio
 
1 introductory slides (1)
1 introductory slides (1)1 introductory slides (1)
1 introductory slides (1)
 

Datasalt - BBVA case study - extracting value from credit card transactions

  • 1. Case  Study   Value extraction from BBVA credit card transactions
  • 2. 104,000  employees   47  million  customers  
  • 3. The  idea   Extract  value   from   anonymized   credit  card   transac5ons   data  &  share  it       Always:     ü  Impersonal   ü  Aggregated   ü  Dissociated   ü  Irreversible  
  • 4. Helping   Consumers   Informed  decision   ü  Shop  recommenda5ons  (by  loca5on  and  by  category)   ü  Best  5me  to  buy   ü  Ac5vity  &  fidelity  of  shop’s  customers   Sellers   Learning  clients  pa:erns   ü  Ac5vity  &  fidelity  of  shop’s  customers   ü  Sex  &  Age  &  Loca5on   ü  Buying  paIerns  
  • 5. Shop  stats   For  different  periods   ü  All,  year,  quarter,  month,  week,  day   …  and  much  more  
  • 6. The  applica5ons   Internal  use   Sellers   Customers  
  • 7. The  challenges   Company  silos   The  costs   The  amount  of  data   Security   Development  flexibility/agility   Human  failures  
  • 8. The  plaOorm   Data  storage   S3   Data  processing   Elas5c  Map  Reduce   Data  serving   EC2  
  • 10. Hadoop   Distributed  Filesystem   ü  Files  as  big  as  you  want   ü  Horizontal  scalability   ü  Failover     Distributed  Compu5ng   ü  MapReduce   ü  Batch  oriented   •  Input  files  processed  and  converted  in  output  files   ü  Horizontal  scalability    
  • 11. Easier  Hadoop  Java  API   ü  But  keeping  similar  efficiency   Common  design  paIerns  covered   ü  Compound  records   ü  Secondary  sor5ng   ü  Joins   Other  improvements   ü  Instance  based  configura5on   ü  First  class  mul5ple  input/output   Tuple  MapReduce  implementaDon  for  Hadoop  
  • 12. Tuple  MapReduce   Our  evoluDon  to  Google’s  MapReduce   Pere  Ferrera,  Iván  de  Prado,  Eric  Palacios,  Jose  Luis  Fernandez-­‐ Marquez,  Giovanna  Di  Marzo  Serugendo:       Tuple  MapReduce:  Beyond  classic  MapReduce.       In  ICDM  2012:  Proceedings  of  the  IEEE  Interna6onal  Conference   on  Data  Mining     Brussels,  Belgium  |  December  10  –  13,  2012  
  • 13. Sales  difference  between  the  most  selling   Tuple  MapReduce   offices  per  each  loca6on  
  • 14. Tuple  MapReduce   Main  constraint   ü  Group  by  clause  must  be  a  subset  of  sort  by  clause   Indeed,  Tuple  MapReduce  can  be  implemented  on  top  of   any  MapReduce  implementaDon   •  Pangool  -­‐>  Tuple  MapReduce  over  Hadoop  
  • 15. Efficiency   Similar  efficiency  to  Hadoop   hIp://pangool.net/benchmark.html  
  • 17. Voldemort  &  Hadoop   Benefits   ü  Scalability  &  failover   ü  Upda5ng  the  database  does  not  affect  serving  queries   ü  All  data  is  replaced  at  each  execu5on   •  Providing  agility/flexibility     §  Big  development  changes  are  not  a  pain   •  Easier  survival  to  human  errors   §  Fix  code  and  run  again   •  Easy  to  set  up  new  clusters  with  different  topologies    
  • 18. Basic  sta5s5cs   Easy  to  implement  with  Pangool/Hadoop   ü  One  job,  grouping  by  the  dimension  over  which  you  want  to   calculate  the  sta5s5cs.   Count   Average   Min   Max   Stdev   CompuDng  several  Dme  periods  in  the   same  job   ü  Use  the  mapper  for  replica5ng  each  datum  for  each  period   ü  Add  a  period  iden5fier  field  in  the  tuple  and  include  it  in  the   group  by  clause    
  • 19. Dis5nct  count   Possible  to  compute  in  a  single  job   ü  Using  secondary  sor5ng  by  the  field  you  want  to  dis5nct  count   on   ü  Detec5ng  changes  on  that  field     Example   ü  Group  by  shop,  sort  by  shop  and  card   Shop   Card   Shop  1   1234   Shop  1   1234   Shop  1   1234   Change   +1   Shop  1   5678   2  dis5nct   buyers  for   Shop  1   5678   Change   +1   shop  1  
  • 20. Histograms   Typically  two-­‐pass  algorithm   ü  First  pass  for  detec5ng  the  minimum  and  the   maximum  and  determine  the  bins  ranges   ü  Second  pass  to  count  the  number  of  occurrences   on  each  bin   AdaptaDve  histogram     ü  One  pass   ü  Fixed  number  of  bins   ü  Bins  adapt    
  • 21. Op5mal  histogram   Calculate  the  be:er  histogram  that  represents  the  original  one   using  a  limited  number  of  flexible  width  bins   ü  Reduce  storage  needs   ü  More  representa5ve  than  fixed  width  ones  -­‐>  beIer   visualiza5on  
  • 22. Op5mal  histogram   Exact  Algorithm   Petri  Kontkanen,  Petri  Myllym  aki   ̈   MDL  Histogram  Density  EsDmaDon     hIp://eprints.pascal-­‐network.org/archive/00002983/   Too  slow  for  producDon  use  
  • 23. Op5mal  histogram   Alterna5ve:  Approximated  algorithm   Random-­‐restart  hill  climbing     ü  A  solu5on  is  just  a  way  of  grouping  exis5ng  bins   ü  From  a  solu5on,  you  can  move  to  some  close   solu5ons   ü  Some  are  beIer:  reduce  the  representa5on  error     Algorithm   1.  Iterate  N  5mes,  keeping  best   solu5on   1.  Generate  a  random  solu5on   2.  Iterate  un5l  no  improvement   1.  Move  to  next  beIer   possible  movement  
  • 24. Op5mal  histogram   Alterna5ve:  Approximated  algorithm   Random-­‐restart  hill  climbing     ü  One  order  of  magnitude  faster   ü  99%  accuracy    
  • 25. Everything  in  one  job   Basic  staDsDcs  -­‐>  1  job   DisDnct  count  staDsDcs  -­‐>  1  job   One  pass  histograms  -­‐>  1  job   Several  periods  &  shops  -­‐>  1  job   We  can  put  all  together  so  that   compu5ng  all  sta5s5cs  for  all  shops   fits  into  exactly  one  job      
  • 26. Shop  recommenda5ons   Based  on  co-­‐occurrences   ü  If  somebody  bought  in  shop  A  and  in  shop  B,  then  a  co-­‐occurrence   between  A  and  B  exists   ü  Only  one  co-­‐occurrence  is  considered  although  a  buyer  bought   several  5mes  in  A  and  B   ü  Top  co-­‐occurrences  per  each  shop  are  the  recommenda5ons   Improvements   ü  Most  popular  shops  are  filtered  out  because  almost  everybody  buys   in  them.   ü  Recommenda5ons  by  category,  by  locaDon  and  by  both   ü  Different  calcula5on  periods  
  • 27. Shop  recommenda5ons   Implemented  in  Pangool   ü  Using  its  coun5ng  and  joining  capabili5es   ü  Several  jobs   Challenges   ü  If  somebody  bought    in  many  shops,  the  list  of  co-­‐occurrences  can   explode:   •  Co-­‐occurrences  =  N  *  (N  –  1),  where  N  =  #  of  dis5nct  shops   where  the  person  bought   ü  Alleviated  by  limi5ng  the  total  number  of  dis5nct  shops  to  consider   ü  Only  uses  the  top  M  shops  where  the  client  bought  the  most     Future   ü  Time  aware  co-­‐occurrences.  The  client  bought  in  A  and  B  and  he   did  it  in  a  close  period  of  5me.  
  • 28. Some  numbers   EsDmated  resources  needed  with  1  year   data   270  GB  of  stats  to  serve   24  large  instances  ~  11  hours  of  execu5on   $3500  month   ü  Op5miza5ons  s5ll  possible   ü  Cost  without  the  use  of  reserved  instances   ü  Probably  cheaper  with  an  in-­‐house  Hadoop  cluster  
  • 29. Conclusion   It  was  possible  to  develop  a  Big  Data   soluDon  for  a  Bank   ü  With  low  use  of  resources   ü  Quickly   ü  Thanks  to  the  use  of  technologies  like  Hadoop,  Amazon  Web   Services  and  NoSQL  databases   The  soluDon  is   ü  Scalable   ü  Flexible/agile.  Improvements  easy  to  implement   ü  Prepared  to  stand  human  failures   ü  At  a  reasonable  cost   Main  advantage:  doing  always  everything  
  • 30. Future:  Splout   Key/value  datastores  have  limitaDons   ü  Only  accept  querying  by  the  key   ü  Aggrega5ons  no  possible   ü  In  other  words,  we  are  forced  to  pre-­‐compute  everything   ü  Not  always  possible  -­‐>  data  explode   ü  For  this  par5cular  case,  5me  ranges  are  fixed   Splout:  like  Voldemort  but  SQL!   ü  The  idea:  to  replace  Voldemort  by  Splout  SQL   ü  Much  richer  queries:  real-­‐5me  aggrega5ons,  flexible  5me  ranges   ü  It  would  allow  to  create  some  kind  of  Google  Analy5cs  for  the   sta5s5cs  discussed  in  this  presenta5on   ü  Open  Sourced!!!   hIps://github.com/datasalt/splout-­‐db