SlideShare ist ein Scribd-Unternehmen logo
1 von 76
Downloaden Sie, um offline zu lesen
The Artful Business
                        of Data Mining
                            Distributed Schema-less
                           Document-Based Databases




Wednesday 27 March 13
David Coallier
                         @davidcoallier



Wednesday 27 March 13
Data Scientist
                         At Engine Yard (.com)




Wednesday 27 March 13
RDBMs

Wednesday 27 March 13
Structure
          Restrictions
          Safety
Wednesday 27 March 13
id    name      age    address

                        1     david       1     315
                        2     divad       3      51
                        3       foo      41      31
                        4       bar      42      98
                        5     john      3315     85
                        6      jack       4      11
                        7        jill     8      66
                        ...       ...    ...     ...




Wednesday 27 March 13
id    name      age    address

                        1     david       1     315
                        2     divad       3      51
                        3       foo      41      31
                        4       bar      42      98
                        5     john      3315     85
                        6      jack       4      11
                        7        jill     8      66
                        ...       ...    ...     ...




Wednesday 27 March 13
id    name      age    address

                        1     david       1     315
                        2     divad       3      51
                        3       foo      41      31
                        4       bar      42      98
                        5     john      3315     85
                        6      jack       4      11
                        7        jill     8      66
                        ...       ...    ...     ...




Wednesday 27 March 13
id    name      age    address

                        1     david       1     315
                        2     divad       3      51
                        3       foo      41      31
                        4       bar      42      98
                        5     john      3315     85
                        6      jack       4      11
                        7        jill     8      66
                        ...       ...    ...     ...




Wednesday 27 March 13
id    name      age    address

                        1     david       1     315
                        2     divad       3      51
                        3       foo      41      31
                        4       bar      42      98
                        5     john      3315     85
                        6      jack       4      11
                        7        jill     8      66
                        ...       ...    ...     ...




Wednesday 27 March 13
What If?


Wednesday 27 March 13
id    name      age    address   phone

                        1     david      26     IE        353
                        2     divad      27     US         1
                        3       foo      42     IE        353
                        4       bar      31     CA         1
                        5     john       17     NZ        131
                        6      jack     128     DK        311
                        7        jill    21     IE        353
                        ...       ...    ...     ...       ...




Wednesday 27 March 13
Before
                   Moving on
Wednesday 27 March 13
JSON

Wednesday 27 March 13
What is JSON?


Wednesday 27 March 13
{
                            "firstName": "David",
                            "lastName": "Coallier",
                            "age": 26,
                            "address": {
                                "streetAddress": "Mansfield House",
                                "city": "Crosshaven",
                            },
                            "phoneNumbers": [
                                {
                                    "type": "mobile",
                                    "number": "0863299999"
                                }
                            ]
                        }




Wednesday 27 March 13
What is HTTP?


Wednesday 27 March 13
What is a Schema?


Wednesday 27 March 13
Alternative

Wednesday 27 March 13
Schema-less


Wednesday 27 March 13
Does
      NOT
      Mean
      Structure-less
Wednesday 27 March 13
Documents
      and
      K-V Buckets
Wednesday 27 March 13
CouchDB
                        Cluster of unreliable commodity hardware




Wednesday 27 March 13
Replication Attachments
               Generated “random” ids
               Dictionary Revisions?
               JSON Objects
               HTTP CRUD


Wednesday 27 March 13
Documents

Wednesday 27 March 13
Wednesday 27 March 13
{
                            "_id": "131dafsd1vasd",
                            "_rev": "12-fva32asdf",
                            "firstName": "David",
                            "lastName": "Coallier",
                            "age": 26,
                            "address": {
                                "streetAddress": "Mansfield House",
                                "city": "Crosshaven",
                            },
                            "phoneNumbers": [
                                {
                                    "type": "mobile",
                                    "number": "0863299999"
                                }
                            ]
                        }




Wednesday 27 March 13
How do you
      find
      Anything?
Wednesday 27 March 13
Map/Reduce

Wednesday 27 March 13
...

Wednesday 27 March 13
Riak

Wednesday 27 March 13
Dynamo
     Paper
Wednesday 27 March 13
CAP
     Theorem
Wednesday 27 March 13
Key-Value
  Buckets
Wednesday 27 March 13
Differences?

Wednesday 27 March 13
CouchDB                                      Riak
           Storage Model         append-only                                 bitcask
                   Access            HTTP                                HTTP, PB
                 Retrieval       Views(M/R)                  M/R, Indexes, Search
               Versioning    Eventual Consistency                  Vector Clocks
            Concurrency          No Locking                   Client Resolution
              Replication    master/master/slave replication, clustering
           Scaling In/Out         Big Couch                                 Built-in
             Management         Futon/Fuxton                        Riak Control
                                  http://guide.couchdb.org   http://downloads.basho.com/papers/bitcask-intro.pdf



Wednesday 27 March 13
Map/Reduce

Wednesday 27 March 13
Mapper:
Executed on document

Reducer:
Receives output from mappers


Wednesday 27 March 13
{
            {                                         {                    {
                                     "_id": "...",
                "_id": "...",                             "_id": "...",        "_id": "...",
                                     "_rev": "...",
                "_rev": "...",                            "_rev": "...",       "_rev": "...",
                                     "age": "32",
                "age": "26"                               "age": "42"          "age": "17"
                                     "heads": "3",
            }                                         }                    }
                                 }




Wednesday 27 March 13
{
            {                                         {                    {
                                     "_id": "...",
                "_id": "...",                             "_id": "...",        "_id": "...",
                                     "_rev": "...",
                "_rev": "...",                            "_rev": "...",       "_rev": "...",
                                     "age": "32",
                "age": "26"                               "age": "42"          "age": "17"
                                     "heads": "3",
            }                                         }                    }
                                 }




Wednesday 27 March 13
{
                  "age": "32",
                  "heads": "3",
 }

Wednesday 27 March 13
Map: find-ages

                                 {
            {                                         {                    {
                                     "_id": "...",
                "_id": "...",                             "_id": "...",        "_id": "...",
                                     "_rev": "...",
                "_rev": "...",                            "_rev": "...",       "_rev": "...",
                                     "age": "32",
                "age": "26"                               "age": "42"          "age": "17"
                                     "heads": "3",
            }                                         }                    }
                                 }




Wednesday 27 March 13
Map: find-ages
                function find_ages(doc) {
                  if (typeof(doc.age) != undefined) {
                    emit(doc._id, doc.age);
                  }
                }




Wednesday 27 March 13
Map: find-ages

                                 {
            {                                         {                    {
                                     "_id": "...",
                "_id": "...",                             "_id": "...",        "_id": "...",
                                     "_rev": "...",
                "_rev": "...",                            "_rev": "...",       "_rev": "...",
                                     "age": "32",
                "age": "26"                               "age": "42"          "age": "17"
                                     "heads": "3",
            }                                         }                    }
                                 }




Wednesday 27 March 13
Map: find-ages

                                 {
            {                                         {                    {
                                     "_id": "...",
                "_id": "...",                             "_id": "...",        "_id": "...",
                                     "_rev": "...",
                "_rev": "...",                            "_rev": "...",       "_rev": "...",
                                     "age": "32",
                "age": "26"                               "age": "42"          "age": "17"
                                     "heads": "3",
            }                                         }                    }
                                 }




                26                   32                   42                   17
Wednesday 27 March 13
Map: find-ages

               26       32   42   17

              Reduce: sum

Wednesday 27 March 13
Reduce: sum

    function sum(values) {
      return sum(values);
    }


Wednesday 27 March 13
Map: find-ages

               26       32    42   17

              Reduce: sum
                             117
Wednesday 27 March 13
Mapper:
Executed on document

Reducer:
Receives output from mappers


Wednesday 27 March 13
So
     What?
Wednesday 27 March 13
The
     Machines
     They Lurn.
Wednesday 27 March 13
The
     Problem
Wednesday 27 March 13
Statistics
     Example
Wednesday 27 March 13
Mean,
  Std. Deviation
  Age
Wednesday 27 March 13
n
                1
             µ = ∑ xi
                n i=1
Wednesday 27 March 13
n
           1
        σ=   ∑
           n i=1
                 (xi − µ ) 2




Wednesday 27 March 13
Mapper:
Executed on document

Reducer:
Receives output from mappers


Wednesday 27 March 13
Mapper:
  Retrieve values, pre-process

Reducer:
 Receive, process further.


Wednesday 27 March 13
{
            {                                         {                    {
                                     "_id": "...",
                "_id": "...",                             "_id": "...",        "_id": "...",
                                     "_rev": "...",
                "_rev": "...",                            "_rev": "...",       "_rev": "...",
                                     "age": "32",
                "age": "26"                               "age": "42"          "age": "17"
                                     "heads": "3",
            }                                         }                    }
                                 }




Wednesday 27 March 13
[
                            [ 26, 676],
                            [ 32, 1024],
                            [ 42, 1764],
                            [ 17, 289 ]
                        ]
Wednesday 27 March 13
/**
                          * Our mapper function.
                          */
                        map: function(doc) {
                           emit(null, [doc.age, doc.age * doc.age]);
                        }

                        /**
                         * Our reducer...
                         */
                        reduce: function(keys, values, rereduce) {
                          var N = 0;
                          var summed = 0;
                          var summedSquare = 0;

                            for (var i in values) {
                              N += 1;
                              summed += values[i][0];
                              summedSquare += values[i][1];
                            }

                            var mean = summed / N;
                            var standard_deviation = Math.sqrt(
                              (summedSquare / N) - (mean* mean)
                            )

                            return [mean, standard_deviation]
                        }




Wednesday 27 March 13
/**
   * Our mapper function.
   */
 map: function(doc) {
    emit(null, [doc.age, doc.age * doc.age]);
 }

 /**
  * Our reducer...
  */
 reduce: function(keys, values, rereduce) {
   var N = values.length;
   var summed = sum(values.map(function(v) { return v[0]; }));
   var summedSquares = sum(values.map(function(v) { return v[1];}));

     var mean = summed / N;
     var standard_deviation = Math.sqrt(
       (summedSquares / N) - (mean*mean)
     )

     return [mean, standard_deviation]
 }


Wednesday 27 March 13
Naive
  Bayes
Wednesday 27 March 13
Real Life
  Fraud
Wednesday 27 March 13
P(x j = k | y = fraudulent)
  P(x j = k | y = normal)
  P(y)

Wednesday 27 March 13
We need to:
  Sum x j = k , for each y
  to calculate P(x|y)



Wednesday 27 March 13
We need:
   More than 1 mapper.




Wednesday 27 March 13
We need

                          4
                        mappers
Wednesday 27 March 13
Mapper #1:
   ∑1i P(x = k | y = fraudulent)
                        j




Wednesday 27 March 13
Mapper #2:
   ∑1i P(x = k | y = normal)
                        j




Wednesday 27 March 13
Mapper #3:
   ∑1i P(y = fraudulent)

Wednesday 27 March 13
Mapper #4:
   ∑1i P(y = normal)


Wednesday 27 March 13
Reducer
         Sums up
         results for
         parameters
Wednesday 27 March 13
Cluster
  Analysis
Wednesday 27 March 13
k-means

Wednesday 27 March 13
Mapper:
 Divide vectors into subgroups,
 Calculate d(p,q) between
 vectors, find centroids,
 sum them up.

 Reducer:
 Sum up the sums,
 get new centroids.

Wednesday 27 March 13

Weitere ähnliche Inhalte

Andere mochten auch

Facebooks new model
Facebooks new modelFacebooks new model
Facebooks new modelfinanzas_uca
 
Digital business #5
Digital business #5Digital business #5
Digital business #5finanzas_uca
 
Об инициативе украиского правительства касательно регистрации Интернет-изданий
Об инициативе украиского правительства касательно регистрации Интернет-изданийОб инициативе украиского правительства касательно регистрации Интернет-изданий
Об инициативе украиского правительства касательно регистрации Интернет-изданийKrainiak
 
Crystallized042210
Crystallized042210Crystallized042210
Crystallized042210klee4vp
 
Lams101: Introducing the Learning Activity Management System
Lams101: Introducing the Learning Activity Management SystemLams101: Introducing the Learning Activity Management System
Lams101: Introducing the Learning Activity Management SystemAllan Carrington
 
Thesis Final120309
Thesis Final120309Thesis Final120309
Thesis Final120309klee4vp
 
Mobile clinic breast_cancer_research_proposal_
Mobile clinic breast_cancer_research_proposal_Mobile clinic breast_cancer_research_proposal_
Mobile clinic breast_cancer_research_proposal_klee4vp
 
Draft Framework sep 26
Draft Framework sep 26Draft Framework sep 26
Draft Framework sep 26chefhja
 
Khrsheed khawar peshawar night Part-2
Khrsheed khawar peshawar night Part-2Khrsheed khawar peshawar night Part-2
Khrsheed khawar peshawar night Part-2Ahmed Hashmi
 
Code Reviews - Vortrag für Innogames
Code Reviews - Vortrag für InnogamesCode Reviews - Vortrag für Innogames
Code Reviews - Vortrag für InnogamesFrank Sons
 
Menulis di blog dan manfaat yang menyertainya
Menulis di blog dan manfaat yang menyertainyaMenulis di blog dan manfaat yang menyertainya
Menulis di blog dan manfaat yang menyertainyaAmril Taufik Gobel
 
Kitchenbathportfolio3
Kitchenbathportfolio3Kitchenbathportfolio3
Kitchenbathportfolio3RaquelT
 
telephone data systems 99AR
telephone data systems  99ARtelephone data systems  99AR
telephone data systems 99ARfinance48
 
autozone AZO_04AR
autozone  AZO_04ARautozone  AZO_04AR
autozone AZO_04ARfinance46
 
10:30 AM ET Q4 2008 Tenneco Inc. Earnings Conference
 10:30 AM ET 	Q4 2008 Tenneco Inc. Earnings Conference 10:30 AM ET 	Q4 2008 Tenneco Inc. Earnings Conference
10:30 AM ET Q4 2008 Tenneco Inc. Earnings Conferencefinance46
 
Alescon Heeft Passie Voor Horeca 20120305
Alescon Heeft Passie Voor Horeca 20120305Alescon Heeft Passie Voor Horeca 20120305
Alescon Heeft Passie Voor Horeca 20120305Johan Lapidaire
 

Andere mochten auch (19)

Facebooks new model
Facebooks new modelFacebooks new model
Facebooks new model
 
Digital business #5
Digital business #5Digital business #5
Digital business #5
 
Об инициативе украиского правительства касательно регистрации Интернет-изданий
Об инициативе украиского правительства касательно регистрации Интернет-изданийОб инициативе украиского правительства касательно регистрации Интернет-изданий
Об инициативе украиского правительства касательно регистрации Интернет-изданий
 
Crystallized042210
Crystallized042210Crystallized042210
Crystallized042210
 
SuferinţA
SuferinţASuferinţA
SuferinţA
 
Lams101: Introducing the Learning Activity Management System
Lams101: Introducing the Learning Activity Management SystemLams101: Introducing the Learning Activity Management System
Lams101: Introducing the Learning Activity Management System
 
Thesis Final120309
Thesis Final120309Thesis Final120309
Thesis Final120309
 
Mobile clinic breast_cancer_research_proposal_
Mobile clinic breast_cancer_research_proposal_Mobile clinic breast_cancer_research_proposal_
Mobile clinic breast_cancer_research_proposal_
 
Draft Framework sep 26
Draft Framework sep 26Draft Framework sep 26
Draft Framework sep 26
 
Khrsheed khawar peshawar night Part-2
Khrsheed khawar peshawar night Part-2Khrsheed khawar peshawar night Part-2
Khrsheed khawar peshawar night Part-2
 
Code Reviews - Vortrag für Innogames
Code Reviews - Vortrag für InnogamesCode Reviews - Vortrag für Innogames
Code Reviews - Vortrag für Innogames
 
Menulis di blog dan manfaat yang menyertainya
Menulis di blog dan manfaat yang menyertainyaMenulis di blog dan manfaat yang menyertainya
Menulis di blog dan manfaat yang menyertainya
 
Kitchenbathportfolio3
Kitchenbathportfolio3Kitchenbathportfolio3
Kitchenbathportfolio3
 
Thats Cool
Thats CoolThats Cool
Thats Cool
 
telephone data systems 99AR
telephone data systems  99ARtelephone data systems  99AR
telephone data systems 99AR
 
Presentation2
Presentation2Presentation2
Presentation2
 
autozone AZO_04AR
autozone  AZO_04ARautozone  AZO_04AR
autozone AZO_04AR
 
10:30 AM ET Q4 2008 Tenneco Inc. Earnings Conference
 10:30 AM ET 	Q4 2008 Tenneco Inc. Earnings Conference 10:30 AM ET 	Q4 2008 Tenneco Inc. Earnings Conference
10:30 AM ET Q4 2008 Tenneco Inc. Earnings Conference
 
Alescon Heeft Passie Voor Horeca 20120305
Alescon Heeft Passie Voor Horeca 20120305Alescon Heeft Passie Voor Horeca 20120305
Alescon Heeft Passie Voor Horeca 20120305
 

Mehr von David Coallier

Data Science at Scale @ barricade.io
Data Science at Scale @ barricade.ioData Science at Scale @ barricade.io
Data Science at Scale @ barricade.ioDavid Coallier
 
Data Science, what even?!
Data Science, what even?!Data Science, what even?!
Data Science, what even?!David Coallier
 
Data Science, what even...
Data Science, what even...Data Science, what even...
Data Science, what even...David Coallier
 
PRISM seed-stage Investor Deck
PRISM seed-stage Investor DeckPRISM seed-stage Investor Deck
PRISM seed-stage Investor DeckDavid Coallier
 
The Artful Business of Data Mining: Computational Statistics with Open Source...
The Artful Business of Data Mining: Computational Statistics with Open Source...The Artful Business of Data Mining: Computational Statistics with Open Source...
The Artful Business of Data Mining: Computational Statistics with Open Source...David Coallier
 
Taking PHP to the next level
Taking PHP to the next levelTaking PHP to the next level
Taking PHP to the next levelDavid Coallier
 
Mobile Cloud Architectures
Mobile Cloud ArchitecturesMobile Cloud Architectures
Mobile Cloud ArchitecturesDavid Coallier
 
Taking PHP To the next level
Taking PHP To the next levelTaking PHP To the next level
Taking PHP To the next levelDavid Coallier
 
Orchestra at EngineYard
Orchestra at EngineYardOrchestra at EngineYard
Orchestra at EngineYardDavid Coallier
 
The Orchestra Platform
The Orchestra PlatformThe Orchestra Platform
The Orchestra PlatformDavid Coallier
 
Building APIs with FRAPI
Building APIs with FRAPIBuilding APIs with FRAPI
Building APIs with FRAPIDavid Coallier
 
RESTful APIs and FRAPI
RESTful APIs and FRAPIRESTful APIs and FRAPI
RESTful APIs and FRAPIDavid Coallier
 
Open Source for the greater good
Open Source for the greater goodOpen Source for the greater good
Open Source for the greater goodDavid Coallier
 
PHP 5.3, a walkthrough
PHP 5.3, a walkthroughPHP 5.3, a walkthrough
PHP 5.3, a walkthroughDavid Coallier
 
RESTful APIs and FRAPI, a matter of minutes
RESTful APIs and FRAPI, a matter of minutesRESTful APIs and FRAPI, a matter of minutes
RESTful APIs and FRAPI, a matter of minutesDavid Coallier
 
An introduction to CouchDB
An introduction to CouchDBAn introduction to CouchDB
An introduction to CouchDBDavid Coallier
 
Get ready for web3.0! Open up your app!
Get ready for web3.0! Open up your app!Get ready for web3.0! Open up your app!
Get ready for web3.0! Open up your app!David Coallier
 

Mehr von David Coallier (18)

Data Science at Scale @ barricade.io
Data Science at Scale @ barricade.ioData Science at Scale @ barricade.io
Data Science at Scale @ barricade.io
 
Data Science, what even?!
Data Science, what even?!Data Science, what even?!
Data Science, what even?!
 
Data Science, what even...
Data Science, what even...Data Science, what even...
Data Science, what even...
 
PRISM seed-stage Investor Deck
PRISM seed-stage Investor DeckPRISM seed-stage Investor Deck
PRISM seed-stage Investor Deck
 
The Artful Business of Data Mining: Computational Statistics with Open Source...
The Artful Business of Data Mining: Computational Statistics with Open Source...The Artful Business of Data Mining: Computational Statistics with Open Source...
The Artful Business of Data Mining: Computational Statistics with Open Source...
 
Taking PHP to the next level
Taking PHP to the next levelTaking PHP to the next level
Taking PHP to the next level
 
Mobile Cloud Architectures
Mobile Cloud ArchitecturesMobile Cloud Architectures
Mobile Cloud Architectures
 
Taking PHP To the next level
Taking PHP To the next levelTaking PHP To the next level
Taking PHP To the next level
 
Orchestra at EngineYard
Orchestra at EngineYardOrchestra at EngineYard
Orchestra at EngineYard
 
The Orchestra Platform
The Orchestra PlatformThe Orchestra Platform
The Orchestra Platform
 
Breaking Technologies
Breaking TechnologiesBreaking Technologies
Breaking Technologies
 
Building APIs with FRAPI
Building APIs with FRAPIBuilding APIs with FRAPI
Building APIs with FRAPI
 
RESTful APIs and FRAPI
RESTful APIs and FRAPIRESTful APIs and FRAPI
RESTful APIs and FRAPI
 
Open Source for the greater good
Open Source for the greater goodOpen Source for the greater good
Open Source for the greater good
 
PHP 5.3, a walkthrough
PHP 5.3, a walkthroughPHP 5.3, a walkthrough
PHP 5.3, a walkthrough
 
RESTful APIs and FRAPI, a matter of minutes
RESTful APIs and FRAPI, a matter of minutesRESTful APIs and FRAPI, a matter of minutes
RESTful APIs and FRAPI, a matter of minutes
 
An introduction to CouchDB
An introduction to CouchDBAn introduction to CouchDB
An introduction to CouchDB
 
Get ready for web3.0! Open up your app!
Get ready for web3.0! Open up your app!Get ready for web3.0! Open up your app!
Get ready for web3.0! Open up your app!
 

Kürzlich hochgeladen

Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 

Kürzlich hochgeladen (20)

Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 

The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases

  • 1. The Artful Business of Data Mining Distributed Schema-less Document-Based Databases Wednesday 27 March 13
  • 2. David Coallier @davidcoallier Wednesday 27 March 13
  • 3. Data Scientist At Engine Yard (.com) Wednesday 27 March 13
  • 5. Structure Restrictions Safety Wednesday 27 March 13
  • 6. id name age address 1 david 1 315 2 divad 3 51 3 foo 41 31 4 bar 42 98 5 john 3315 85 6 jack 4 11 7 jill 8 66 ... ... ... ... Wednesday 27 March 13
  • 7. id name age address 1 david 1 315 2 divad 3 51 3 foo 41 31 4 bar 42 98 5 john 3315 85 6 jack 4 11 7 jill 8 66 ... ... ... ... Wednesday 27 March 13
  • 8. id name age address 1 david 1 315 2 divad 3 51 3 foo 41 31 4 bar 42 98 5 john 3315 85 6 jack 4 11 7 jill 8 66 ... ... ... ... Wednesday 27 March 13
  • 9. id name age address 1 david 1 315 2 divad 3 51 3 foo 41 31 4 bar 42 98 5 john 3315 85 6 jack 4 11 7 jill 8 66 ... ... ... ... Wednesday 27 March 13
  • 10. id name age address 1 david 1 315 2 divad 3 51 3 foo 41 31 4 bar 42 98 5 john 3315 85 6 jack 4 11 7 jill 8 66 ... ... ... ... Wednesday 27 March 13
  • 12. id name age address phone 1 david 26 IE 353 2 divad 27 US 1 3 foo 42 IE 353 4 bar 31 CA 1 5 john 17 NZ 131 6 jack 128 DK 311 7 jill 21 IE 353 ... ... ... ... ... Wednesday 27 March 13
  • 13. Before Moving on Wednesday 27 March 13
  • 15. What is JSON? Wednesday 27 March 13
  • 16. { "firstName": "David", "lastName": "Coallier", "age": 26, "address": { "streetAddress": "Mansfield House", "city": "Crosshaven", }, "phoneNumbers": [ { "type": "mobile", "number": "0863299999" } ] } Wednesday 27 March 13
  • 17. What is HTTP? Wednesday 27 March 13
  • 18. What is a Schema? Wednesday 27 March 13
  • 21. Does NOT Mean Structure-less Wednesday 27 March 13
  • 22. Documents and K-V Buckets Wednesday 27 March 13
  • 23. CouchDB Cluster of unreliable commodity hardware Wednesday 27 March 13
  • 24. Replication Attachments Generated “random” ids Dictionary Revisions? JSON Objects HTTP CRUD Wednesday 27 March 13
  • 27. { "_id": "131dafsd1vasd", "_rev": "12-fva32asdf", "firstName": "David", "lastName": "Coallier", "age": 26, "address": { "streetAddress": "Mansfield House", "city": "Crosshaven", }, "phoneNumbers": [ { "type": "mobile", "number": "0863299999" } ] } Wednesday 27 March 13
  • 28. How do you find Anything? Wednesday 27 March 13
  • 32. Dynamo Paper Wednesday 27 March 13
  • 33. CAP Theorem Wednesday 27 March 13
  • 36. CouchDB Riak Storage Model append-only bitcask Access HTTP HTTP, PB Retrieval Views(M/R) M/R, Indexes, Search Versioning Eventual Consistency Vector Clocks Concurrency No Locking Client Resolution Replication master/master/slave replication, clustering Scaling In/Out Big Couch Built-in Management Futon/Fuxton Riak Control http://guide.couchdb.org http://downloads.basho.com/papers/bitcask-intro.pdf Wednesday 27 March 13
  • 38. Mapper: Executed on document Reducer: Receives output from mappers Wednesday 27 March 13
  • 39. { { { { "_id": "...", "_id": "...", "_id": "...", "_id": "...", "_rev": "...", "_rev": "...", "_rev": "...", "_rev": "...", "age": "32", "age": "26" "age": "42" "age": "17" "heads": "3", } } } } Wednesday 27 March 13
  • 40. { { { { "_id": "...", "_id": "...", "_id": "...", "_id": "...", "_rev": "...", "_rev": "...", "_rev": "...", "_rev": "...", "age": "32", "age": "26" "age": "42" "age": "17" "heads": "3", } } } } Wednesday 27 March 13
  • 41. { "age": "32", "heads": "3", } Wednesday 27 March 13
  • 42. Map: find-ages { { { { "_id": "...", "_id": "...", "_id": "...", "_id": "...", "_rev": "...", "_rev": "...", "_rev": "...", "_rev": "...", "age": "32", "age": "26" "age": "42" "age": "17" "heads": "3", } } } } Wednesday 27 March 13
  • 43. Map: find-ages function find_ages(doc) { if (typeof(doc.age) != undefined) { emit(doc._id, doc.age); } } Wednesday 27 March 13
  • 44. Map: find-ages { { { { "_id": "...", "_id": "...", "_id": "...", "_id": "...", "_rev": "...", "_rev": "...", "_rev": "...", "_rev": "...", "age": "32", "age": "26" "age": "42" "age": "17" "heads": "3", } } } } Wednesday 27 March 13
  • 45. Map: find-ages { { { { "_id": "...", "_id": "...", "_id": "...", "_id": "...", "_rev": "...", "_rev": "...", "_rev": "...", "_rev": "...", "age": "32", "age": "26" "age": "42" "age": "17" "heads": "3", } } } } 26 32 42 17 Wednesday 27 March 13
  • 46. Map: find-ages 26 32 42 17 Reduce: sum Wednesday 27 March 13
  • 47. Reduce: sum function sum(values) { return sum(values); } Wednesday 27 March 13
  • 48. Map: find-ages 26 32 42 17 Reduce: sum 117 Wednesday 27 March 13
  • 49. Mapper: Executed on document Reducer: Receives output from mappers Wednesday 27 March 13
  • 50. So What? Wednesday 27 March 13
  • 51. The Machines They Lurn. Wednesday 27 March 13
  • 52. The Problem Wednesday 27 March 13
  • 53. Statistics Example Wednesday 27 March 13
  • 54. Mean, Std. Deviation Age Wednesday 27 March 13
  • 55. n 1 µ = ∑ xi n i=1 Wednesday 27 March 13
  • 56. n 1 σ= ∑ n i=1 (xi − µ ) 2 Wednesday 27 March 13
  • 57. Mapper: Executed on document Reducer: Receives output from mappers Wednesday 27 March 13
  • 58. Mapper: Retrieve values, pre-process Reducer: Receive, process further. Wednesday 27 March 13
  • 59. { { { { "_id": "...", "_id": "...", "_id": "...", "_id": "...", "_rev": "...", "_rev": "...", "_rev": "...", "_rev": "...", "age": "32", "age": "26" "age": "42" "age": "17" "heads": "3", } } } } Wednesday 27 March 13
  • 60. [ [ 26, 676], [ 32, 1024], [ 42, 1764], [ 17, 289 ] ] Wednesday 27 March 13
  • 61. /** * Our mapper function. */ map: function(doc) { emit(null, [doc.age, doc.age * doc.age]); } /** * Our reducer... */ reduce: function(keys, values, rereduce) { var N = 0; var summed = 0; var summedSquare = 0; for (var i in values) { N += 1; summed += values[i][0]; summedSquare += values[i][1]; } var mean = summed / N; var standard_deviation = Math.sqrt( (summedSquare / N) - (mean* mean) ) return [mean, standard_deviation] } Wednesday 27 March 13
  • 62. /** * Our mapper function. */ map: function(doc) { emit(null, [doc.age, doc.age * doc.age]); } /** * Our reducer... */ reduce: function(keys, values, rereduce) { var N = values.length; var summed = sum(values.map(function(v) { return v[0]; })); var summedSquares = sum(values.map(function(v) { return v[1];})); var mean = summed / N; var standard_deviation = Math.sqrt( (summedSquares / N) - (mean*mean) ) return [mean, standard_deviation] } Wednesday 27 March 13
  • 63. Naive Bayes Wednesday 27 March 13
  • 64. Real Life Fraud Wednesday 27 March 13
  • 65. P(x j = k | y = fraudulent) P(x j = k | y = normal) P(y) Wednesday 27 March 13
  • 66. We need to: Sum x j = k , for each y to calculate P(x|y) Wednesday 27 March 13
  • 67. We need: More than 1 mapper. Wednesday 27 March 13
  • 68. We need 4 mappers Wednesday 27 March 13
  • 69. Mapper #1: ∑1i P(x = k | y = fraudulent) j Wednesday 27 March 13
  • 70. Mapper #2: ∑1i P(x = k | y = normal) j Wednesday 27 March 13
  • 71. Mapper #3: ∑1i P(y = fraudulent) Wednesday 27 March 13
  • 72. Mapper #4: ∑1i P(y = normal) Wednesday 27 March 13
  • 73. Reducer Sums up results for parameters Wednesday 27 March 13
  • 76. Mapper: Divide vectors into subgroups, Calculate d(p,q) between vectors, find centroids, sum them up. Reducer: Sum up the sums, get new centroids. Wednesday 27 March 13