SlideShare ist ein Scribd-Unternehmen logo
1 von 63
Downloaden Sie, um offline zu lesen
MAD Skills: New
                          Analysis Practices for
                                Big Data
                                   Presented By:
                                  Christan Grant
                                cgrant@cise.ufl.edu




Thursday, August 25, 11                              1
MAD Skills: New Analysis
                               Practices for Big Data
                     •    Authors

                          •   Jeff Cohen – Greenplum

                          •   Brian Dolan – Fox Audience Network

                          •   Mark Dunlap – Evergreen Technologies

                          •   Joseph M Hellerstein – UC Berkeley

                          •   Caleb Welton – Greenplum

                     •    Presented at Very Large Database Conference 2009 in Lyon, France




Thursday, August 25, 11                                                                      2
If you are looking for a career where your services will be
                          in high demand, you should find something where you
                          provide a scarce, complementary service to something
                          that is getting ubiquitous and cheap.

                          So what’s getting ubiquitous and cheap? Data.

                          And what is complementary to data? Analysis.

                          – Prof. Hal Varian, Chief Economist at Google




Thursday, August 25, 11                                                                 3
• Storage is Cheap
                      • The worlds largest data warehouse of
                          ~15 years ago can be stored on disks for
                          about $2000.




Thursday, August 25, 11                                              4
• Traditionally, a lot of time is spent building
                          well structured data warehouses.

                          • Contains summaries of data
                          • Main analytics location
                          • Jealously guarded by IT

Thursday, August 25, 11                                                 5
• Data analysis is common culture
                       • Culture is to collect and analyze data in
                          difference business units.




Thursday, August 25, 11                                              6
• Data analysis is common culture
                       • Culture is to collect and analyze data in
                          difference business units.

                     • “In God we trust; all others must bring data”




Thursday, August 25, 11                                                6
• Data analysis is common culture
                       • Culture is to collect and analyze data in
                            difference business units.

                     • “In God we trust; all others must bring data”
                     • Analytics are crucial – Analysts need more
                          access and more permissions

                          • Hence M.A.D.
Thursday, August 25, 11                                                6
• Magnetic – Attract data and practitioners.
                          The Database must be painless and efficient.

                     • Agile – Analysts need to ingest and analyze
                          on the fly.

                     • Deep – Sophisticated analytics at scale.
                          Don’t force analyst to work on samples
                          (where the long tail is lost).

                     • Library of tools
Thursday, August 25, 11                                                 7
Outline
                     •    Background
                     •    FOX Audience Network
                     •    Being MAD
                     •    In-Database Operations
                     •    MAD DBMS
                     •    Conclusions
                     •    Question


Thursday, August 25, 11                            8
Background
                     • OLAP and Data Cubes
                          • Data structure to
                             provides descriptive
                             statistics – Simple
                             summaries (sum,
                             average, std)

                          • We want inferential/
                             inductive statistics –
                             Predictions, causality
                             analysis, distributional
                             comparison



Thursday, August 25, 11                                 9
Background
                     •    Databases and Statistical Packages
                          •   Many analysts download data to use in Excel/SAS/
                              Matlab/R or their favorite programming language?
                              FORTRAN??


                          •   Use matrix/vector operations
                          •   Most of these stat packages require data to fit in RAM
                              •   Taking samples from the full data to fit into ram
                                  results in loss of precision
                          •   External toolkits may also lack parallelism



Thursday, August 25, 11                                                               10
Background
                     • MapReduce and Parallel Programming
                      • Strengths of MR are scalable batch
                            parallelism and fault tolerance.
                          • There is work on putting ML algorithms
                            in MapReduce Framework (Mahout)
                          • MR is just a programming framework and
                            works well with MAD


Thursday, August 25, 11                                              11
Background
                     • Data Mining and Analytics in the DB
                      • Plenty of work in this area
                      • Methods are typically implemented as a
                            black box
                          • Algorithms are not flexible as in
                            programmed packages



Thursday, August 25, 11                                          12
FOX Audience Network
                     • Served MySpace.com, IGN.com, Scout.com
                          etc.
                     • About 150 Million users
                     • Ad network, bought by Rubicon in Nov
                          2010




Thursday, August 25, 11                                         13
Fox Audience Network
                     •    42 Nodes (2 Masters, 40 Workers)
                     •    Sun X4500 Thumper
                     •    48 500GB drives
                     •    16GB RAM
                     •    ~ 5TB Daily
                     •    1 Table 1.5 Trillion Rows
                     •    Many different types of users workloads
                     •    Dynamic query ecosystem


Thursday, August 25, 11                                             14
Fox Audience Network
                     •    Use Cases
                     •    Queries to the system are vastly different
                     •    Example queries:
                          •   How many female WWF enthusiasts under the age of 30 visited
                              the Toyota community over the last four days and saw a medium
                              rectangle?
                              •   Expensive
                          •   How are these people similar to those that visited Nissan?
                              •   Open-ended, requires some statistics and the analyst to
                                  be in the loop.



Thursday, August 25, 11                                                                       15
Being MAD
                     • Analysts (Data Scientists) trump DBAs
                      • They crave data
                      • They can work with dirty data
                      • They like all data (not samples)
                      • Have a belt of tools


Thursday, August 25, 11                                        16
Being MAD

                     • Get data to the warehouse as soon as
                          possible
                     • Cleaning and integration should be done
                          intelligently




Thursday, August 25, 11                                          17
Being MAD
                     • Three logical layers for data stores:
                      • Reporting – Specialized static aggregates
                      • Production – Aggregates for most users
                      • Stages – Raw tables and logs
                      • Sandboxes – Play ground for analysts
                     • The logical levels encourages creativity
Thursday, August 25, 11                                             18
Being MAD

                     • Data Scientists need power tools Math +
                          Stats + ML
                     • The MADLib approach is to develop a
                          hierarchy of mathematical concepts in the
                          database.




Thursday, August 25, 11                                               19
In-Database Operations
                     • User Defined Functions (UDF)
                       • SQL or PL/pgSQL
                       • Internal (c)
                       • Dynamically Loaded (c, c++, java, javascript,
                            perl, python, R, ruby, scheme, etc)
                          • SELECT f1(city) FROM states
                            WHERE f2(capital)


Thursday, August 25, 11                                                  20
In-Database Operations
                     •    Value abstraction levels

                     •    1.0                         Scalar

                     •    [0.3, 0.3, 0.4]            Vector

                     •    [[0.3, 0.7], [0.6, 0.4]]   Matrix
                                                     Function
                     •    f( • ) → value




Thursday, August 25, 11                                         21
In-Database Operations


                          Scalar Arithmetic



Thursday, August 25, 11                       22
In-Database Operations

                     • Scalar Arithmetic
                      • SELECT 5*4;
                      • SELECT sqrt(64);
                      • SELECT cos(-3.14159 * sqrt(2) / 2 );

Thursday, August 25, 11                                        23
In-Database Operations


                          Vector Arithmetic



Thursday, August 25, 11                       24
In-Database Operations

                     • Vector
                      • array objects: float8[]
                      • array operations are implemented as
                          UDFs.


                                                  technically violates 1st normal form



Thursday, August 25, 11                                                                  25
In-Database Operations
                     • Vector




                                technically violates 1st normal form



Thursday, August 25, 11                                                26
In-Database Operations


                          Matrix Arithmetic



Thursday, August 25, 11                       27
In-Database Operations

                • Matrix Representation #1
                          CREATE TABLE B (
                               row_number integer,
                               vector numeric[]
                          );


   http://www.postgresql.org/docs/current/static/arrays.html

Thursday, August 25, 11                                        28
In-Database Operations

                • Matrix addition:
                          SELECT A. row_number, A.vector + B.vector
                          FROM A, B
                          WHERE A.row_number = B.row_numbner



       This can be easily implemented as a primitive
       operator:
       SELECT A ** B from A ,B
Thursday, August 25, 11                                               29
In-Database Operations
                • Matrix and vector multiplication: Av

                          SELECT 1, array_accum(row_number, vector*v)
                          FROM A


                          array_accum(x,v) is a custom function. It takes the
                          value v and puts it in the row indexed by x.


Thursday, August 25, 11                                                         30
In-Database Operations
                •         Matrix Representation #2
                          CREATE TABLE A (
                               row_number INTEGER,
                               column_number INTEGER,
                               value DECIMAL
                          );
                • Great sparse representation!
   http://www.postgresql.org/docs/current/static/arrays.html

Thursday, August 25, 11                                        31
In-Database Operations
                • Matrix Multiplication (using rep #2):
                          SELECT A. row_number, B.column_number, SUM(A.value * B.value)

                          FROM A, B

                          WHERE A.column_number = B.row_number

                          GROUP BY A.row_number, B.column_number




Thursday, August 25, 11                                                                   32
In-Database Operations
                • These operations require one pass over the
                          data and can be done in parallel.
                • What about Matrix Division, and Matrix
                          Inverse?
                • SQL is awkward bc of it’s lack of loops for
                          iteration.




Thursday, August 25, 11                                         33
In-Database Operations
                     • Document similarity for fraud detection
                       • Check to see if several advertisers are
                            linking to pages to the same content.
                          • If several advertisers have out links to
                            similar documents it is possible they are
                            using stolen credit card.
                          • Especially if the outbound documents are
                            malicious.


Thursday, August 25, 11                                                 34
In-Database Operations
                • For document similarity we use tf-idf. This is a
                          score for a term in a given document.
                          For a term:
                             tf-idft,d = tft,d * idft
                               tft,d = term count in doc
                               idft = log { |docs| / |docs w. term|}



Thursday, August 25, 11                                                35
In-Database Operations
                 •        Table docs contains triples (document, term, count)

                 • tf UDF in SQL takes a term and document as parameters
                          SELECT count(term)
                          FROM docs
                          WHERE term = t AND document = d



                 • idf UDF in SQL takes a term as a parameter
                          SELECT log (
                                  (SELECT count(document) FROM docs ) /
                                  (SELECT count(document) FROM docs WHERE term = t GROUP BY document))
                            FROM docs




Thursday, August 25, 11                                                                                  36
In-Database Operations
                •         Each doc will have a vector with the tf-idf scores,
                          we can compare the results using cosine similarity.
                          cos θ = v * u / (||v|| * ||u||)
                          small θ means v and u are similar


                • SQL is straight forward using the vector
                          operators



Thursday, August 25, 11                                                         37
In-Database Operations
                •         Ordinary Least Squares is used to fit a line to a collection of points.
                •         This helps to describe seasonal trends - its a first look at the data.
                          Solve Y = Xβ where X is an n × k Matrix.
                          Estimate β using β* = (XTX)-1XTy
                          We can Parallellize using computation
                                Distribute:    A ← XTX, b ← XTy
                                Collect:      A and b.
                                Calculate:    A-1
                                Multiply:     A-1 * b




Thursday, August 25, 11                                                                            38
In-Database Operations


                          Functional Arithmetic



Thursday, August 25, 11                           39
In-Database Operations
                     • A/B Testing (Multivariate testing)
                      • Web companies show people different
                            versions of their website to to see which
                            one is the best.
                          • Ad companies compare different add
                            campaigns and may want to find the one
                            with the higher clicks-through rate.



Thursday, August 25, 11                                                 40
In-Database Operations
                     •    Mann-Whitney U Test
                          •   This is a popular test to compare non-
                              parametric data.
                          •   non-parametric def= data set that does not fit
                              in a well known distribution
                          •   Combine data values and score the values, find
                              the average location of the list for each value.
                          •   Z = U – (n1n2/2) / √(n1n2(n1 + n2 +1)/12)


Thursday, August 25, 11                                                          41
In-Database Operations
                •         Mann-Whitney U Test
                           •   Given Table T with (sample_id, value)
                          CREATE VIEW R AS
                          SELECT sample_id, avg(value) AS sample_avg
                             sum(rown) AS rank_sum, count(*) AS sample_n,
                             sum(rown) - count(*) * count(*) + 1 AS sample_us
                           FROM ( SELECT sample_id,
                             row_number() OVER (ORDER BY value DESC) AS
                             rown, value FROM T) AS ordered
                           GROUP BY sample_id



Thursday, August 25, 11                                                         42
In-Database Operations
                •         Mann-Whitney U Test
                          •   With large sample count we can get the z_score
                          SELECT r.sample_u, r.sample_avg, r.sample_n,
                              (r.sample_u – a.sum_u / 2) / sqrt(a.sum_u *
                              (a.sum_n + 1) / 12) AS z_score
                          FROM R AS r, (SELECT sum(sample_u) AS
                          sum_u, sum(sample_n) AS sum_n FROM R) AS a
                          GROUP BY r.sample_u, r.sample_avg, r.sample_n,
                          a.sum_n, a.sum_u


Thursday, August 25, 11                                                        43
In-Database Operations
                •         Mann-Whitney U Test
                          •   With large sample count we can get the z_score
                          SELECT r.sample_u, r.sample_avg, r.sample_n,
                              (r.sample_u – a.sum_u / 2) / sqrt(a.sum_u *
                              (a.sum_n + 1) / 12) AS z_score
                          FROM R AS r, (SELECT sum(sample_u) AS
                          sum_u, sum(sample_n) AS sum_n FROM R) AS a
                          GROUP BY r.sample_u, r.sample_avg, r.sample_n,
                          a.sum_n, a.sum_u
 This can be easily implemented as a stored procedure:
 SELECT man_whitney(value) FROM table
Thursday, August 25, 11                                                        43
In-Database Operations
                     • Resampling
                       • Sampling several subsets instead of the
                            whole population.
                          • Bootstrapping       def=
                                                  Sampling to buckets
                            (w/ replacement) then combine buckets
                          • Jackknifing   def=
                                           Leave some data out and
                            sample; then combine runs



Thursday, August 25, 11                                                 44
In-Database Operations
                     •    Bootstrapping
                          •   Pre-select which trials contain which rows
                          •    Suppose we have population N = 100 and we
                              have subsamples of M = 3
                          CREATE VIEW design AS
                          SELECT a.trial_id, floor(100 * random()) AS
                          row_id
                          FROM generate_series(1, 10000) AS a (trial_id),
                          generate_series(1, 3) AS b (subsample_id)


Thursday, August 25, 11                                                     45
In-Database Operations
                     •    Bootstrapping
                     •    Calculate the sampling distribution
                          CREATE VIEW trials AS
                          SELECT d.trial_id, AVG(T.value) AS avg_value
                          FROM design d,T
                          WHERE d.row_id = T.row_id
                          GROUP BY d.trial_id



Thursday, August 25, 11                                                  46
In-Database Operations
                     • Bootstrapping
                     • Final distribution parameters
                          SELECT AVG(avg_value), STDDEV(avg_value)
                          FROM trials




Thursday, August 25, 11                                              47
MAD DBMS
                     • Loading/Unloading
                      • Scatter/Gather Streaming – fast!
                      • Extract-Transform-Load vs Extract-Load-
                            Transform
                          • Greenplum has MapReduce support!


Thursday, August 25, 11                                           48
MAD DBMS
                     •    Storage
                          •   Tunable table types:
                              •   external tables (e.g. files)
                              •   heap tables (frequent updates)
          compressible
                              •   append-only tables (rare updates)
                              •   column-stores flexibility
                          •   All tables have a distribution policy


Thursday, August 25, 11                                               49
MAD DBMS
                     •    Partitioning
                          •   Partition by range of values or columns (list)
                              •   i.e. partition by timestamp old stuff goes to
                                  compressed table, new stuff goes to heap
                                  storage.
                          •   Query optimizer knows the partitioning
                              scheme
                          •   This is useful for staging data


Thursday, August 25, 11                                                           50
MAD DBMS
                     •    MAD Programming
                          •   Use your favorite programming language extension
                          •   Programmers must think out the code works w/o
                              shared memory. (data-parallel)
                          •   Map and Reduce functions can be written in Python,
                              Perl, or R.
                          •   They run using the Scatter/Gather technique. (Any
                              table type)
                          •   SQL is not necessary!


Thursday, August 25, 11                                                            51
MAD DBMS
                     • Future
                      • Need a better repository for the library
                            of functions. (Morpheus SIGMOD06)
                          • Automatically choose data and tables
                            layouts automatically
                          • Use online aggregation for faster answers
                            to queries.


Thursday, August 25, 11                                                 52
Questions




Thursday, August 25, 11               53
Questions

                     • Why would one be apposed to the M.A.D.
                          paradigm?




Thursday, August 25, 11                                         53
Questions

                     • Why would one be apposed to the M.A.D.
                          paradigm?
                     • How does MADLib and Greenplum fit into
                          the NoSQL movement?




Thursday, August 25, 11                                         53
Questions

                     • Why would one be apposed to the M.A.D.
                          paradigm?
                     • How does MADLib and Greenplum fit into
                          the NoSQL movement?
                     • Is there a market for an AppStore for
                          functions/utilities for MADLib?



Thursday, August 25, 11                                         53
Conclusion
                     • This paper describes the M.A.D. (Magnetic
                          Agile Deep) mindset and the M.A.D. library.
                     • It demonstrated how to perform analytics
                          inside the DB using SQL and stored
                          procedures.
                     • It also described the Greenplum database.
                          A general data-parallel database for simple
                          and complex queries from flexible tables.


Thursday, August 25, 11                                                 54
Thank You!
Thursday, August 25, 11                55
MADLib Installation
                              (ubuntu)
                     •    sudo apt-get install postgresql flex bison oxygen
                          libboost1.42-* git graphviz libarmadillo-dev
                          libarmadillo-doc
                     •    sudo apt-get install lapack++ blas++
                     •    git clone https://github.com/madlib/madlib.git
                     •    ./configure
                     •    cd build
                     •    cd make
  https://github.com/madlib/madlib/wiki/Building-MADlib-from-Source



Thursday, August 25, 11                                                      56
Matrix Multiplication
                                       Demo
                          -- Create tables in the new matrix format   	

      (2, 1, 7),
                          CREATE TABLE A (                            	

      (2, 2, 8);
                             row_number INTEGER,
                              column_number INTEGER,                  -- Check to see that the values are multiplying correctly
                          	

      	

   value DECIMAL                SELECT A.row_number, B.column_number,
                          );                                          A.value::text || '*' || B.value::text
                          INSERT INTO A VALUES                        FROM A, B
                          	

     (1, 1, 1),                          WHERE A.column_number = B.row_number
                          	

        (1, 2, 2),
                          	

        (2, 1, 3),                       -- Add the Groupby to see the sum work
                          	

        (2, 2, 4);                       SELECT A.row_number, B.column_number, SUM(A.value *
                                                                      B.value)
                          CREATE TABLE B (                            FROM A, B
                                row_number INTEGER,                   WHERE A.column_number = B.row_number
                                column_number INTEGER,                GROUP BY A.row_number, B.column_number;
                          	

     	

   value DECIMAL
                          );                                          DROP TABLE A;
                          INSERT INTO B VALUES                        DROP TABLE B;
                          	

        (1, 1, 5),
                          	

        (1, 2, 6),




Thursday, August 25, 11                                                                                                           57

Weitere ähnliche Inhalte

Was ist angesagt?

Marketing Strategy
Marketing StrategyMarketing Strategy
Marketing StrategyAyesha Ghazi
 
Brand elements of Coca Cola
Brand elements of Coca ColaBrand elements of Coca Cola
Brand elements of Coca ColaRidaAbbass
 
브랜드커뮤니케이션전략 카카오프렌즈 브랜드북
브랜드커뮤니케이션전략 카카오프렌즈 브랜드북브랜드커뮤니케이션전략 카카오프렌즈 브랜드북
브랜드커뮤니케이션전략 카카오프렌즈 브랜드북Songjuhyun
 
Starbuck - Part 1 7s and brand identity
Starbuck - Part 1 7s and brand identityStarbuck - Part 1 7s and brand identity
Starbuck - Part 1 7s and brand identityCharlotte L
 
매직 스트로베리 사운드 브랜드북 : 숙명여자대학교 홍보광고학과 서혜진
매직 스트로베리 사운드 브랜드북 : 숙명여자대학교 홍보광고학과 서혜진매직 스트로베리 사운드 브랜드북 : 숙명여자대학교 홍보광고학과 서혜진
매직 스트로베리 사운드 브랜드북 : 숙명여자대학교 홍보광고학과 서혜진서 혜진
 

Was ist angesagt? (6)

Marketing Strategy
Marketing StrategyMarketing Strategy
Marketing Strategy
 
Brand elements of Coca Cola
Brand elements of Coca ColaBrand elements of Coca Cola
Brand elements of Coca Cola
 
브랜드커뮤니케이션전략 카카오프렌즈 브랜드북
브랜드커뮤니케이션전략 카카오프렌즈 브랜드북브랜드커뮤니케이션전략 카카오프렌즈 브랜드북
브랜드커뮤니케이션전략 카카오프렌즈 브랜드북
 
Starbuck - Part 1 7s and brand identity
Starbuck - Part 1 7s and brand identityStarbuck - Part 1 7s and brand identity
Starbuck - Part 1 7s and brand identity
 
Branding of ccd
Branding of ccdBranding of ccd
Branding of ccd
 
매직 스트로베리 사운드 브랜드북 : 숙명여자대학교 홍보광고학과 서혜진
매직 스트로베리 사운드 브랜드북 : 숙명여자대학교 홍보광고학과 서혜진매직 스트로베리 사운드 브랜드북 : 숙명여자대학교 홍보광고학과 서혜진
매직 스트로베리 사운드 브랜드북 : 숙명여자대학교 홍보광고학과 서혜진
 

Andere mochten auch

Media Pitching - jak efektywnie pisać do dziennikarzy - MailMyDay2017
Media Pitching - jak efektywnie pisać do dziennikarzy - MailMyDay2017Media Pitching - jak efektywnie pisać do dziennikarzy - MailMyDay2017
Media Pitching - jak efektywnie pisać do dziennikarzy - MailMyDay2017Prowly PR Software
 
Thomas Davenport: Analytics at Work: How to Make Better Decisions and Get Bet...
Thomas Davenport: Analytics at Work: How to Make Better Decisions and Get Bet...Thomas Davenport: Analytics at Work: How to Make Better Decisions and Get Bet...
Thomas Davenport: Analytics at Work: How to Make Better Decisions and Get Bet...SAS Institute India Pvt. Ltd
 
Useful Cosmetic Dentistry Information
Useful Cosmetic Dentistry InformationUseful Cosmetic Dentistry Information
Useful Cosmetic Dentistry InformationDental Clinic
 
AWS Webcast - Amazon Kinesis and Apache Storm
AWS Webcast - Amazon Kinesis and Apache StormAWS Webcast - Amazon Kinesis and Apache Storm
AWS Webcast - Amazon Kinesis and Apache StormAmazon Web Services
 
Oportunidades y desafíos del Acuerdo Comercial del Perú con la India
Oportunidades y desafíos del Acuerdo Comercial del Perú con la IndiaOportunidades y desafíos del Acuerdo Comercial del Perú con la India
Oportunidades y desafíos del Acuerdo Comercial del Perú con la India Carlos Alberto Aquino Rodriguez
 
Ringmakers of-saturn---norman-r.-bergrun
Ringmakers of-saturn---norman-r.-bergrunRingmakers of-saturn---norman-r.-bergrun
Ringmakers of-saturn---norman-r.-bergrunLex Pit
 
Lazette Harnish: America's Most-Visited Tourist Places
Lazette Harnish: America's Most-Visited Tourist PlacesLazette Harnish: America's Most-Visited Tourist Places
Lazette Harnish: America's Most-Visited Tourist PlacesLazette Harnish
 
Plan pour la paix: Pour un renouveau des relations internationales
Plan pour la paix: Pour un renouveau des relations internationalesPlan pour la paix: Pour un renouveau des relations internationales
Plan pour la paix: Pour un renouveau des relations internationalesFlorian Brunner
 
Plumber colorado springs co open rooter
Plumber colorado springs co   open rooterPlumber colorado springs co   open rooter
Plumber colorado springs co open rooterplumber80903
 
How to Develop a Social Media Presence in 30 Days or Less
How to Develop a Social Media Presence in 30 Days or LessHow to Develop a Social Media Presence in 30 Days or Less
How to Develop a Social Media Presence in 30 Days or LessGeorge Sloane
 
Moodle is dead... Iain Bruce, James Blair, Michael O'Loughlin
Moodle is dead... 	Iain Bruce, James Blair, Michael O'LoughlinMoodle is dead... 	Iain Bruce, James Blair, Michael O'Loughlin
Moodle is dead... Iain Bruce, James Blair, Michael O'LoughlinIreland & UK Moodlemoot 2012
 
Sintesis informativa 22 de marzo 2017
Sintesis informativa 22 de marzo 2017Sintesis informativa 22 de marzo 2017
Sintesis informativa 22 de marzo 2017megaradioexpress
 
Announcements- Thursday, March 23, 2017
Announcements- Thursday, March 23, 2017Announcements- Thursday, March 23, 2017
Announcements- Thursday, March 23, 2017Ken Stayner
 

Andere mochten auch (16)

Media Pitching - jak efektywnie pisać do dziennikarzy - MailMyDay2017
Media Pitching - jak efektywnie pisać do dziennikarzy - MailMyDay2017Media Pitching - jak efektywnie pisać do dziennikarzy - MailMyDay2017
Media Pitching - jak efektywnie pisać do dziennikarzy - MailMyDay2017
 
Thomas Davenport: Analytics at Work: How to Make Better Decisions and Get Bet...
Thomas Davenport: Analytics at Work: How to Make Better Decisions and Get Bet...Thomas Davenport: Analytics at Work: How to Make Better Decisions and Get Bet...
Thomas Davenport: Analytics at Work: How to Make Better Decisions and Get Bet...
 
Useful Cosmetic Dentistry Information
Useful Cosmetic Dentistry InformationUseful Cosmetic Dentistry Information
Useful Cosmetic Dentistry Information
 
AWS Webcast - Amazon Kinesis and Apache Storm
AWS Webcast - Amazon Kinesis and Apache StormAWS Webcast - Amazon Kinesis and Apache Storm
AWS Webcast - Amazon Kinesis and Apache Storm
 
MagenTys Service Overview
MagenTys Service OverviewMagenTys Service Overview
MagenTys Service Overview
 
Oportunidades y desafíos del Acuerdo Comercial del Perú con la India
Oportunidades y desafíos del Acuerdo Comercial del Perú con la IndiaOportunidades y desafíos del Acuerdo Comercial del Perú con la India
Oportunidades y desafíos del Acuerdo Comercial del Perú con la India
 
Ringmakers of-saturn---norman-r.-bergrun
Ringmakers of-saturn---norman-r.-bergrunRingmakers of-saturn---norman-r.-bergrun
Ringmakers of-saturn---norman-r.-bergrun
 
Lazette Harnish: America's Most-Visited Tourist Places
Lazette Harnish: America's Most-Visited Tourist PlacesLazette Harnish: America's Most-Visited Tourist Places
Lazette Harnish: America's Most-Visited Tourist Places
 
Religion and Enviroment
Religion and EnviromentReligion and Enviroment
Religion and Enviroment
 
Goyescas
GoyescasGoyescas
Goyescas
 
Plan pour la paix: Pour un renouveau des relations internationales
Plan pour la paix: Pour un renouveau des relations internationalesPlan pour la paix: Pour un renouveau des relations internationales
Plan pour la paix: Pour un renouveau des relations internationales
 
Plumber colorado springs co open rooter
Plumber colorado springs co   open rooterPlumber colorado springs co   open rooter
Plumber colorado springs co open rooter
 
How to Develop a Social Media Presence in 30 Days or Less
How to Develop a Social Media Presence in 30 Days or LessHow to Develop a Social Media Presence in 30 Days or Less
How to Develop a Social Media Presence in 30 Days or Less
 
Moodle is dead... Iain Bruce, James Blair, Michael O'Loughlin
Moodle is dead... 	Iain Bruce, James Blair, Michael O'LoughlinMoodle is dead... 	Iain Bruce, James Blair, Michael O'Loughlin
Moodle is dead... Iain Bruce, James Blair, Michael O'Loughlin
 
Sintesis informativa 22 de marzo 2017
Sintesis informativa 22 de marzo 2017Sintesis informativa 22 de marzo 2017
Sintesis informativa 22 de marzo 2017
 
Announcements- Thursday, March 23, 2017
Announcements- Thursday, March 23, 2017Announcements- Thursday, March 23, 2017
Announcements- Thursday, March 23, 2017
 

Ähnlich wie MAD Skills: New Analysis Practices for Big Data

Needs for Data Management & Citation Throughout the Information Lifecycle
Needs for Data Management & Citation Throughout  the Information LifecycleNeeds for Data Management & Citation Throughout  the Information Lifecycle
Needs for Data Management & Citation Throughout the Information LifecycleMicah Altman
 
Realtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBaseRealtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBaselarsgeorge
 
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012Gigaom
 
Supporting Libraries in Leading the Way in Research Data Management
Supporting Libraries in Leading the Way in Research Data ManagementSupporting Libraries in Leading the Way in Research Data Management
Supporting Libraries in Leading the Way in Research Data ManagementMarieke Guy
 
Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...
Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...
Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...Cloudera, Inc.
 
MapR LucidWorks Joint Webinar 121211
MapR LucidWorks Joint Webinar 121211MapR LucidWorks Joint Webinar 121211
MapR LucidWorks Joint Webinar 121211MapR Technologies
 
Ml pluss ejan2013
Ml pluss ejan2013Ml pluss ejan2013
Ml pluss ejan2013CS, NcState
 
ESI Supplemental Webinar 2 - DataONE presentation slides
ESI Supplemental Webinar 2 - DataONE presentation slides ESI Supplemental Webinar 2 - DataONE presentation slides
ESI Supplemental Webinar 2 - DataONE presentation slides DuraSpace
 
e-Science, Research Data and Libaries
e-Science, Research Data and Libariese-Science, Research Data and Libaries
e-Science, Research Data and LibariesRob Grim
 
The architecture of data analytics PaaS on AWS
The architecture of data analytics PaaS on AWSThe architecture of data analytics PaaS on AWS
The architecture of data analytics PaaS on AWSTreasure Data, Inc.
 
How Pacific Northwest National Laboratory Uses AWS to Enable Data Science & R...
How Pacific Northwest National Laboratory Uses AWS to Enable Data Science & R...How Pacific Northwest National Laboratory Uses AWS to Enable Data Science & R...
How Pacific Northwest National Laboratory Uses AWS to Enable Data Science & R...Amazon Web Services
 
Large-Scale Search Discovery Analytics with Hadoop, Mahout, Solr
Large-Scale Search Discovery Analytics with Hadoop, Mahout, SolrLarge-Scale Search Discovery Analytics with Hadoop, Mahout, Solr
Large-Scale Search Discovery Analytics with Hadoop, Mahout, SolrDataWorks Summit
 
Working with Instrument Data (GlobusWorld Tour - UMich)
Working with Instrument Data (GlobusWorld Tour - UMich)Working with Instrument Data (GlobusWorld Tour - UMich)
Working with Instrument Data (GlobusWorld Tour - UMich)Globus
 
MapR lucidworks joint webinar
MapR lucidworks joint webinarMapR lucidworks joint webinar
MapR lucidworks joint webinarTed Dunning
 
Hadoop: A Hands-on Introduction
Hadoop: A Hands-on IntroductionHadoop: A Hands-on Introduction
Hadoop: A Hands-on IntroductionClaudio Martella
 
Gilbane Boston 2012: XML and SQL: Not Dead Yet
Gilbane Boston 2012: XML and SQL: Not Dead YetGilbane Boston 2012: XML and SQL: Not Dead Yet
Gilbane Boston 2012: XML and SQL: Not Dead YetPeter O'Kelly
 
What Is GDS and Neo4j’s GDS Library
What Is GDS and Neo4j’s GDS LibraryWhat Is GDS and Neo4j’s GDS Library
What Is GDS and Neo4j’s GDS LibraryNeo4j
 

Ähnlich wie MAD Skills: New Analysis Practices for Big Data (20)

NISO Forum, Denver, Sept. 24, 2012: Needs for Data Management & Citation Thro...
NISO Forum, Denver, Sept. 24, 2012: Needs for Data Management & Citation Thro...NISO Forum, Denver, Sept. 24, 2012: Needs for Data Management & Citation Thro...
NISO Forum, Denver, Sept. 24, 2012: Needs for Data Management & Citation Thro...
 
Needs for Data Management & Citation Throughout the Information Lifecycle
Needs for Data Management & Citation Throughout  the Information LifecycleNeeds for Data Management & Citation Throughout  the Information Lifecycle
Needs for Data Management & Citation Throughout the Information Lifecycle
 
Realtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBaseRealtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBase
 
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
 
Supporting Libraries in Leading the Way in Research Data Management
Supporting Libraries in Leading the Way in Research Data ManagementSupporting Libraries in Leading the Way in Research Data Management
Supporting Libraries in Leading the Way in Research Data Management
 
Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...
Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...
Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...
 
MapR LucidWorks Joint Webinar 121211
MapR LucidWorks Joint Webinar 121211MapR LucidWorks Joint Webinar 121211
MapR LucidWorks Joint Webinar 121211
 
Ml pluss ejan2013
Ml pluss ejan2013Ml pluss ejan2013
Ml pluss ejan2013
 
ESI Supplemental Webinar 2 - DataONE presentation slides
ESI Supplemental Webinar 2 - DataONE presentation slides ESI Supplemental Webinar 2 - DataONE presentation slides
ESI Supplemental Webinar 2 - DataONE presentation slides
 
e-Science, Research Data and Libaries
e-Science, Research Data and Libariese-Science, Research Data and Libaries
e-Science, Research Data and Libaries
 
251 carpenter
251 carpenter251 carpenter
251 carpenter
 
The architecture of data analytics PaaS on AWS
The architecture of data analytics PaaS on AWSThe architecture of data analytics PaaS on AWS
The architecture of data analytics PaaS on AWS
 
How Pacific Northwest National Laboratory Uses AWS to Enable Data Science & R...
How Pacific Northwest National Laboratory Uses AWS to Enable Data Science & R...How Pacific Northwest National Laboratory Uses AWS to Enable Data Science & R...
How Pacific Northwest National Laboratory Uses AWS to Enable Data Science & R...
 
Large-Scale Search Discovery Analytics with Hadoop, Mahout, Solr
Large-Scale Search Discovery Analytics with Hadoop, Mahout, SolrLarge-Scale Search Discovery Analytics with Hadoop, Mahout, Solr
Large-Scale Search Discovery Analytics with Hadoop, Mahout, Solr
 
Working with Instrument Data (GlobusWorld Tour - UMich)
Working with Instrument Data (GlobusWorld Tour - UMich)Working with Instrument Data (GlobusWorld Tour - UMich)
Working with Instrument Data (GlobusWorld Tour - UMich)
 
MapR lucidworks joint webinar
MapR lucidworks joint webinarMapR lucidworks joint webinar
MapR lucidworks joint webinar
 
Hawaii Pacific GIS Conference 2012: LiDAR for Intrastructure and Terrian Mapp...
Hawaii Pacific GIS Conference 2012: LiDAR for Intrastructure and Terrian Mapp...Hawaii Pacific GIS Conference 2012: LiDAR for Intrastructure and Terrian Mapp...
Hawaii Pacific GIS Conference 2012: LiDAR for Intrastructure and Terrian Mapp...
 
Hadoop: A Hands-on Introduction
Hadoop: A Hands-on IntroductionHadoop: A Hands-on Introduction
Hadoop: A Hands-on Introduction
 
Gilbane Boston 2012: XML and SQL: Not Dead Yet
Gilbane Boston 2012: XML and SQL: Not Dead YetGilbane Boston 2012: XML and SQL: Not Dead Yet
Gilbane Boston 2012: XML and SQL: Not Dead Yet
 
What Is GDS and Neo4j’s GDS Library
What Is GDS and Neo4j’s GDS LibraryWhat Is GDS and Neo4j’s GDS Library
What Is GDS and Neo4j’s GDS Library
 

Kürzlich hochgeladen

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 

Kürzlich hochgeladen (20)

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 

MAD Skills: New Analysis Practices for Big Data

  • 1. MAD Skills: New Analysis Practices for Big Data Presented By: Christan Grant cgrant@cise.ufl.edu Thursday, August 25, 11 1
  • 2. MAD Skills: New Analysis Practices for Big Data • Authors • Jeff Cohen – Greenplum • Brian Dolan – Fox Audience Network • Mark Dunlap – Evergreen Technologies • Joseph M Hellerstein – UC Berkeley • Caleb Welton – Greenplum • Presented at Very Large Database Conference 2009 in Lyon, France Thursday, August 25, 11 2
  • 3. If you are looking for a career where your services will be in high demand, you should find something where you provide a scarce, complementary service to something that is getting ubiquitous and cheap. So what’s getting ubiquitous and cheap? Data. And what is complementary to data? Analysis. – Prof. Hal Varian, Chief Economist at Google Thursday, August 25, 11 3
  • 4. • Storage is Cheap • The worlds largest data warehouse of ~15 years ago can be stored on disks for about $2000. Thursday, August 25, 11 4
  • 5. • Traditionally, a lot of time is spent building well structured data warehouses. • Contains summaries of data • Main analytics location • Jealously guarded by IT Thursday, August 25, 11 5
  • 6. • Data analysis is common culture • Culture is to collect and analyze data in difference business units. Thursday, August 25, 11 6
  • 7. • Data analysis is common culture • Culture is to collect and analyze data in difference business units. • “In God we trust; all others must bring data” Thursday, August 25, 11 6
  • 8. • Data analysis is common culture • Culture is to collect and analyze data in difference business units. • “In God we trust; all others must bring data” • Analytics are crucial – Analysts need more access and more permissions • Hence M.A.D. Thursday, August 25, 11 6
  • 9. • Magnetic – Attract data and practitioners. The Database must be painless and efficient. • Agile – Analysts need to ingest and analyze on the fly. • Deep – Sophisticated analytics at scale. Don’t force analyst to work on samples (where the long tail is lost). • Library of tools Thursday, August 25, 11 7
  • 10. Outline • Background • FOX Audience Network • Being MAD • In-Database Operations • MAD DBMS • Conclusions • Question Thursday, August 25, 11 8
  • 11. Background • OLAP and Data Cubes • Data structure to provides descriptive statistics – Simple summaries (sum, average, std) • We want inferential/ inductive statistics – Predictions, causality analysis, distributional comparison Thursday, August 25, 11 9
  • 12. Background • Databases and Statistical Packages • Many analysts download data to use in Excel/SAS/ Matlab/R or their favorite programming language? FORTRAN?? • Use matrix/vector operations • Most of these stat packages require data to fit in RAM • Taking samples from the full data to fit into ram results in loss of precision • External toolkits may also lack parallelism Thursday, August 25, 11 10
  • 13. Background • MapReduce and Parallel Programming • Strengths of MR are scalable batch parallelism and fault tolerance. • There is work on putting ML algorithms in MapReduce Framework (Mahout) • MR is just a programming framework and works well with MAD Thursday, August 25, 11 11
  • 14. Background • Data Mining and Analytics in the DB • Plenty of work in this area • Methods are typically implemented as a black box • Algorithms are not flexible as in programmed packages Thursday, August 25, 11 12
  • 15. FOX Audience Network • Served MySpace.com, IGN.com, Scout.com etc. • About 150 Million users • Ad network, bought by Rubicon in Nov 2010 Thursday, August 25, 11 13
  • 16. Fox Audience Network • 42 Nodes (2 Masters, 40 Workers) • Sun X4500 Thumper • 48 500GB drives • 16GB RAM • ~ 5TB Daily • 1 Table 1.5 Trillion Rows • Many different types of users workloads • Dynamic query ecosystem Thursday, August 25, 11 14
  • 17. Fox Audience Network • Use Cases • Queries to the system are vastly different • Example queries: • How many female WWF enthusiasts under the age of 30 visited the Toyota community over the last four days and saw a medium rectangle? • Expensive • How are these people similar to those that visited Nissan? • Open-ended, requires some statistics and the analyst to be in the loop. Thursday, August 25, 11 15
  • 18. Being MAD • Analysts (Data Scientists) trump DBAs • They crave data • They can work with dirty data • They like all data (not samples) • Have a belt of tools Thursday, August 25, 11 16
  • 19. Being MAD • Get data to the warehouse as soon as possible • Cleaning and integration should be done intelligently Thursday, August 25, 11 17
  • 20. Being MAD • Three logical layers for data stores: • Reporting – Specialized static aggregates • Production – Aggregates for most users • Stages – Raw tables and logs • Sandboxes – Play ground for analysts • The logical levels encourages creativity Thursday, August 25, 11 18
  • 21. Being MAD • Data Scientists need power tools Math + Stats + ML • The MADLib approach is to develop a hierarchy of mathematical concepts in the database. Thursday, August 25, 11 19
  • 22. In-Database Operations • User Defined Functions (UDF) • SQL or PL/pgSQL • Internal (c) • Dynamically Loaded (c, c++, java, javascript, perl, python, R, ruby, scheme, etc) • SELECT f1(city) FROM states WHERE f2(capital) Thursday, August 25, 11 20
  • 23. In-Database Operations • Value abstraction levels • 1.0 Scalar • [0.3, 0.3, 0.4] Vector • [[0.3, 0.7], [0.6, 0.4]] Matrix Function • f( • ) → value Thursday, August 25, 11 21
  • 24. In-Database Operations Scalar Arithmetic Thursday, August 25, 11 22
  • 25. In-Database Operations • Scalar Arithmetic • SELECT 5*4; • SELECT sqrt(64); • SELECT cos(-3.14159 * sqrt(2) / 2 ); Thursday, August 25, 11 23
  • 26. In-Database Operations Vector Arithmetic Thursday, August 25, 11 24
  • 27. In-Database Operations • Vector • array objects: float8[] • array operations are implemented as UDFs. technically violates 1st normal form Thursday, August 25, 11 25
  • 28. In-Database Operations • Vector technically violates 1st normal form Thursday, August 25, 11 26
  • 29. In-Database Operations Matrix Arithmetic Thursday, August 25, 11 27
  • 30. In-Database Operations • Matrix Representation #1 CREATE TABLE B ( row_number integer, vector numeric[] ); http://www.postgresql.org/docs/current/static/arrays.html Thursday, August 25, 11 28
  • 31. In-Database Operations • Matrix addition: SELECT A. row_number, A.vector + B.vector FROM A, B WHERE A.row_number = B.row_numbner This can be easily implemented as a primitive operator: SELECT A ** B from A ,B Thursday, August 25, 11 29
  • 32. In-Database Operations • Matrix and vector multiplication: Av SELECT 1, array_accum(row_number, vector*v) FROM A array_accum(x,v) is a custom function. It takes the value v and puts it in the row indexed by x. Thursday, August 25, 11 30
  • 33. In-Database Operations • Matrix Representation #2 CREATE TABLE A ( row_number INTEGER, column_number INTEGER, value DECIMAL ); • Great sparse representation! http://www.postgresql.org/docs/current/static/arrays.html Thursday, August 25, 11 31
  • 34. In-Database Operations • Matrix Multiplication (using rep #2): SELECT A. row_number, B.column_number, SUM(A.value * B.value) FROM A, B WHERE A.column_number = B.row_number GROUP BY A.row_number, B.column_number Thursday, August 25, 11 32
  • 35. In-Database Operations • These operations require one pass over the data and can be done in parallel. • What about Matrix Division, and Matrix Inverse? • SQL is awkward bc of it’s lack of loops for iteration. Thursday, August 25, 11 33
  • 36. In-Database Operations • Document similarity for fraud detection • Check to see if several advertisers are linking to pages to the same content. • If several advertisers have out links to similar documents it is possible they are using stolen credit card. • Especially if the outbound documents are malicious. Thursday, August 25, 11 34
  • 37. In-Database Operations • For document similarity we use tf-idf. This is a score for a term in a given document. For a term: tf-idft,d = tft,d * idft tft,d = term count in doc idft = log { |docs| / |docs w. term|} Thursday, August 25, 11 35
  • 38. In-Database Operations • Table docs contains triples (document, term, count) • tf UDF in SQL takes a term and document as parameters SELECT count(term) FROM docs WHERE term = t AND document = d • idf UDF in SQL takes a term as a parameter SELECT log ( (SELECT count(document) FROM docs ) / (SELECT count(document) FROM docs WHERE term = t GROUP BY document)) FROM docs Thursday, August 25, 11 36
  • 39. In-Database Operations • Each doc will have a vector with the tf-idf scores, we can compare the results using cosine similarity. cos θ = v * u / (||v|| * ||u||) small θ means v and u are similar • SQL is straight forward using the vector operators Thursday, August 25, 11 37
  • 40. In-Database Operations • Ordinary Least Squares is used to fit a line to a collection of points. • This helps to describe seasonal trends - its a first look at the data. Solve Y = Xβ where X is an n × k Matrix. Estimate β using β* = (XTX)-1XTy We can Parallellize using computation Distribute: A ← XTX, b ← XTy Collect: A and b. Calculate: A-1 Multiply: A-1 * b Thursday, August 25, 11 38
  • 41. In-Database Operations Functional Arithmetic Thursday, August 25, 11 39
  • 42. In-Database Operations • A/B Testing (Multivariate testing) • Web companies show people different versions of their website to to see which one is the best. • Ad companies compare different add campaigns and may want to find the one with the higher clicks-through rate. Thursday, August 25, 11 40
  • 43. In-Database Operations • Mann-Whitney U Test • This is a popular test to compare non- parametric data. • non-parametric def= data set that does not fit in a well known distribution • Combine data values and score the values, find the average location of the list for each value. • Z = U – (n1n2/2) / √(n1n2(n1 + n2 +1)/12) Thursday, August 25, 11 41
  • 44. In-Database Operations • Mann-Whitney U Test • Given Table T with (sample_id, value) CREATE VIEW R AS SELECT sample_id, avg(value) AS sample_avg sum(rown) AS rank_sum, count(*) AS sample_n, sum(rown) - count(*) * count(*) + 1 AS sample_us FROM ( SELECT sample_id, row_number() OVER (ORDER BY value DESC) AS rown, value FROM T) AS ordered GROUP BY sample_id Thursday, August 25, 11 42
  • 45. In-Database Operations • Mann-Whitney U Test • With large sample count we can get the z_score SELECT r.sample_u, r.sample_avg, r.sample_n, (r.sample_u – a.sum_u / 2) / sqrt(a.sum_u * (a.sum_n + 1) / 12) AS z_score FROM R AS r, (SELECT sum(sample_u) AS sum_u, sum(sample_n) AS sum_n FROM R) AS a GROUP BY r.sample_u, r.sample_avg, r.sample_n, a.sum_n, a.sum_u Thursday, August 25, 11 43
  • 46. In-Database Operations • Mann-Whitney U Test • With large sample count we can get the z_score SELECT r.sample_u, r.sample_avg, r.sample_n, (r.sample_u – a.sum_u / 2) / sqrt(a.sum_u * (a.sum_n + 1) / 12) AS z_score FROM R AS r, (SELECT sum(sample_u) AS sum_u, sum(sample_n) AS sum_n FROM R) AS a GROUP BY r.sample_u, r.sample_avg, r.sample_n, a.sum_n, a.sum_u This can be easily implemented as a stored procedure: SELECT man_whitney(value) FROM table Thursday, August 25, 11 43
  • 47. In-Database Operations • Resampling • Sampling several subsets instead of the whole population. • Bootstrapping def= Sampling to buckets (w/ replacement) then combine buckets • Jackknifing def= Leave some data out and sample; then combine runs Thursday, August 25, 11 44
  • 48. In-Database Operations • Bootstrapping • Pre-select which trials contain which rows • Suppose we have population N = 100 and we have subsamples of M = 3 CREATE VIEW design AS SELECT a.trial_id, floor(100 * random()) AS row_id FROM generate_series(1, 10000) AS a (trial_id), generate_series(1, 3) AS b (subsample_id) Thursday, August 25, 11 45
  • 49. In-Database Operations • Bootstrapping • Calculate the sampling distribution CREATE VIEW trials AS SELECT d.trial_id, AVG(T.value) AS avg_value FROM design d,T WHERE d.row_id = T.row_id GROUP BY d.trial_id Thursday, August 25, 11 46
  • 50. In-Database Operations • Bootstrapping • Final distribution parameters SELECT AVG(avg_value), STDDEV(avg_value) FROM trials Thursday, August 25, 11 47
  • 51. MAD DBMS • Loading/Unloading • Scatter/Gather Streaming – fast! • Extract-Transform-Load vs Extract-Load- Transform • Greenplum has MapReduce support! Thursday, August 25, 11 48
  • 52. MAD DBMS • Storage • Tunable table types: • external tables (e.g. files) • heap tables (frequent updates) compressible • append-only tables (rare updates) • column-stores flexibility • All tables have a distribution policy Thursday, August 25, 11 49
  • 53. MAD DBMS • Partitioning • Partition by range of values or columns (list) • i.e. partition by timestamp old stuff goes to compressed table, new stuff goes to heap storage. • Query optimizer knows the partitioning scheme • This is useful for staging data Thursday, August 25, 11 50
  • 54. MAD DBMS • MAD Programming • Use your favorite programming language extension • Programmers must think out the code works w/o shared memory. (data-parallel) • Map and Reduce functions can be written in Python, Perl, or R. • They run using the Scatter/Gather technique. (Any table type) • SQL is not necessary! Thursday, August 25, 11 51
  • 55. MAD DBMS • Future • Need a better repository for the library of functions. (Morpheus SIGMOD06) • Automatically choose data and tables layouts automatically • Use online aggregation for faster answers to queries. Thursday, August 25, 11 52
  • 57. Questions • Why would one be apposed to the M.A.D. paradigm? Thursday, August 25, 11 53
  • 58. Questions • Why would one be apposed to the M.A.D. paradigm? • How does MADLib and Greenplum fit into the NoSQL movement? Thursday, August 25, 11 53
  • 59. Questions • Why would one be apposed to the M.A.D. paradigm? • How does MADLib and Greenplum fit into the NoSQL movement? • Is there a market for an AppStore for functions/utilities for MADLib? Thursday, August 25, 11 53
  • 60. Conclusion • This paper describes the M.A.D. (Magnetic Agile Deep) mindset and the M.A.D. library. • It demonstrated how to perform analytics inside the DB using SQL and stored procedures. • It also described the Greenplum database. A general data-parallel database for simple and complex queries from flexible tables. Thursday, August 25, 11 54
  • 62. MADLib Installation (ubuntu) • sudo apt-get install postgresql flex bison oxygen libboost1.42-* git graphviz libarmadillo-dev libarmadillo-doc • sudo apt-get install lapack++ blas++ • git clone https://github.com/madlib/madlib.git • ./configure • cd build • cd make https://github.com/madlib/madlib/wiki/Building-MADlib-from-Source Thursday, August 25, 11 56
  • 63. Matrix Multiplication Demo -- Create tables in the new matrix format (2, 1, 7), CREATE TABLE A ( (2, 2, 8); row_number INTEGER, column_number INTEGER, -- Check to see that the values are multiplying correctly value DECIMAL SELECT A.row_number, B.column_number, ); A.value::text || '*' || B.value::text INSERT INTO A VALUES FROM A, B (1, 1, 1), WHERE A.column_number = B.row_number (1, 2, 2), (2, 1, 3), -- Add the Groupby to see the sum work (2, 2, 4); SELECT A.row_number, B.column_number, SUM(A.value * B.value) CREATE TABLE B ( FROM A, B row_number INTEGER, WHERE A.column_number = B.row_number column_number INTEGER, GROUP BY A.row_number, B.column_number; value DECIMAL ); DROP TABLE A; INSERT INTO B VALUES DROP TABLE B; (1, 1, 5), (1, 2, 6), Thursday, August 25, 11 57