SlideShare ist ein Scribd-Unternehmen logo
1 von 63
HiPIC




           Big Data Fundamentals in the
            Emerging New Data World
                   PIT (Product Innovation Team)
                   Samsung Electronics America
                                 San Jose, CA
                                 Aug 17th 2012



                             Jongwook Woo (PhD)
                High-Performance Internet Computing Center (HiPIC)
        Educational Partner with Cloudera and Grants Awardee of Amazon AWS
                       Computer Information Systems Department
                         California State University, Los Angeles

 Jongwook Woo
                                                                             CSULA
HiPIC             Contents
 Fundamentals of Big Data

 NoSQL DB: HBase, MongoDB

 Data-Intensive Computing: Hadoop

 Big Data Supporters and Use Cases




                                      CSULA
  Jongwook Woo
HiPIC                  Experience in Big Data
 Several publications regarding Hadoop and NoSQL
     “Apriori-Map/Reduce Algorithm”, Jongwook Woo, PDPTA
      2012, Las Vegas (July 16-19, 2012)
     “Market Basket Analysis Algorithm with no-SQL DB HBase and
      Hadoop”,Jongwook Woo, Siddharth Basopia, Yuhang Xu, Seon
      Ho Kim, EDB 2012, Incheon, Aug. 25-27, 2011
     “Market Basket Analysis Algorithm with Map/Reduce of Cloud
      Computing”, Jongwook Woo and Yuhang Xu, PDPTA 2011,
      Las Vegas (July 18-21, 2011)
     Jongwook Woo, “Introduction to Cloud Computing”, in the
      10th KOCSEA Technical Symposium, UNLV, Dec 18 - 19, 2009

 Talks in Korean Universities and companies
     Yonsei, Sookmyung, KAIST, Korean Polytech Univ
           – Winter 2011
     VanillaBreeze
           – Winter 2011
                                                             CSULA
  Jongwook Woo
HiPIC            Experience in Big Data (Cont’d)
 Grants
     Received Amazon AWS in Education Research Grant (July
      2012 - July 2014)
     Received Amazon AWS in Education Coursework Grants (July
      2012 - July 2013, Jan 2011 - Dec 2011

 Partnership
     Received Academic Education Partnership with Cloudera since
      June 2012

 Certificate
     Certificate of Achievement in the Big Data University Training
      Course, “Hadoop Fundamentals I”, July 8 2012

 Cloud Computing Blog
     http://dal-cloudcomputing.blogspot.com/
                                                                   CSULA
  Jongwook Woo
What is Big Data, Map/Reduce, Hadoop, NoSQL DB on
HiPIC                              Cloud Computing



                                         Clo
                                            ude
                           AWS                  ra
                                                          Ho
        L

                                                            rto
   DB SQ



                                                                  nW
    No




                                                                       ork
                                                                          s




                                                                              CSULA
  Jongwook Woo
HiPIC                      Big Data

Too much data
    Tera-Byte (1012), Peta-byte (1015)
           – Because of web
           – Sensor Data, Bioinformatics, Social
             Computing, smart phone, online game…

Cannot handle with the legacy
 approach
    Too big
    Un-/Semi-structured data


                                                    CSULA
  Jongwook Woo
HiPIC               Two Issues in Big Data

How to store Big Data
    NoSQL DB

How to compute Big Data
    Parallel Computing with multiple cheap
     computers
           – Not need super computers




                                              CSULA
  Jongwook Woo
HiPIC             Contents
 Fundamentals of Big Data

 NoSQL DB: HBase, MongoDB

 Data-Intensive Computing: Hadoop

 Big Data Supporters and Use Cases




                                      CSULA
  Jongwook Woo
HiPIC                         New Data Trend
 Sparsity
    Schema free data with sparse attributes
      – Document Term vector
      – User-Item matrix
      – Semantic or social relations
    No relational property
      – nor complex join queries
                 • Log data




                                               CSULA
  Jongwook Woo
HiPIC                  New Data Trend (Cont’d)
Immutable
    No need to update and delete data
           – Only insert with versions
                 • Tracking history
                 • Lock-free (key based autonomicity)




                                                        CSULA
  Jongwook Woo
HiPIC                       Big Data for RDBMS
 Issues in RDBMS
    Hard to scale
      – Relation gets broken
                 • Partitioning for scalability
                 • Replication for availability
    Speed
      – The Seek times of physical storage
                 • Slower than N/W speed
                 • 1TB disk: 10Mbps transfer rate
                     – 100K sec =>27.8 hrs
           – With Multiple data sources at difference places
              • 100 10GB disks: each 10Mbps transfer rate
                     – 1K sec =>16.7min

                                                               CSULA
  Jongwook Woo
HiPIC              Big Data for RDBMS (Cont’d)

Issues in RDBMS (Cont’d)
    Data Integration
           – Not good for un-/semi-structured data
                 • Many unstructured data
                   – Web or log data etc
    RDB not good in parallelization
           – Cannot split 1000 tasks to cheap 1000 PCs
             efficiently




                                                     CSULA
  Jongwook Woo
HiPIC                      RDBMS Issues

Solution
    Big Data
      ⇒Data Cleansing by Hadoop
                 ⇒ Data Computation (MapReduce, Pig)
                 ⇒ Data Repositories (NoSQL DB: HBase,
                  Cassandra, MongoDB)
           ⇒Business Intelligence (Data Mining,
            OLAP, Data Visualization, Reporting):
            Hive, Mahout




                                                         CSULA
  Jongwook Woo
HiPIC                              NoSQL DBs
 not primarily built on tables,
     generally do not use SQL for data manipulation
     non-relational, distributed data stores
       – often do not provide ACID (atomicity, consistency, isolation,
         durability)
                 • which are the key attributes of classic RDB

 Fast Index on large amount of data
     Lookup by keys (key/value)

 NoSQL normally supports MapReduce
     Parallel computation




                                                                   CSULA
  Jongwook Woo
HiPIC            Use Cases for NoSQL DB [1]

RDBMS replacement
   for high-traffic web applications

Semi-structured content management

Real-time analytics & high-speed logging

Web Infrastructure
   Web 2.0, Media, SaaS, Gaming,
   Finance, Telecom, Healthcare, Government

Three NoSQL DB Approaches
   Key/Value, Column, Document
                                               CSULA
  Jongwook Woo
HiPIC                 Data Store of NoSQL DB
 Key/Value store
    (Key, Value)
    Functions
           – Index, versioning, sorting, locking, transaction,
             replication
    Apache Cassandra, Memcached




                                                                 CSULA
  Jongwook Woo
HiPIC             Data Store of NoSQL DB (Cont’d)
 Column-Oriented Stores (Extensible Record
  Stores)
    stores data tables as sections of columns of data
           – rather than as rows of data, like most RDBMS
                 • Sparse fields in RDBMS
           – well-suited for OLAP-like workloads (e.g., data
             warehouses)
    Extensible record horizontally and vertically
     partitioned across nodes
           – Rows and Columns are distributed over multiple
             nodes
    BigTable, HBase, Cassandra, Hypertable


                                                               CSULA
  Jongwook Woo
HiPIC              Data Store of NoSQL DB (Cont’d)

     StudentId          Lastname         Firstname        email
1                    Smith         Joe               smith@hi.com
2                    Jones         Mary              mary@hi.com
3                    Johnson       Cathy             cathy@hi.com

       Row Oriented
        – 1,Smith, Joe, smith@hi.com;
        – 2,Jones, Mary, mary@hi.com;
        – 3,Johnson, Cathy, cathy@hi.com;
      Column Oriented
        – 1,2,3;
        – Smith, Jones, Johnson;
        – Joe, Mary, Cathy;
        – smith@hi.com, mary@hi.com, cathy@hi.com;
                                                                    CSULA
    Jongwook Woo
HiPIC
                    HBase Schema Example (Student/Course)

  RDBMS
       Students: (id, name, sex, age)
       Courses: (id, title, desc, teacher_id)
       S_C: (s_id, c_id, type)

  HBase
                                                    Column Families

          id                                Info:                       Course

  <student_id>            Info:name     Info:sex       Info:age   Course:<course_id>=
                                                                          type


                                              Column Families

     id                               Info:                             student

<course_id>        Info:title   Info:desc      Info:teacher_id    student:<student_id>
                                                                         =type

                                                                                  CSULA
    Jongwook Woo
HiPIC            Data Store of NoSQL DB (Cont’d)
 Document Store
    Collections and Documents
           – vs Tables and Records of RDB
    Used in Search Engine/Repository
    Multiple index to store indexed document
           – no fixed fields
    Not simple key-value lookup
           – Use API
    Functions
           – No locking, Replication, Transaction
    MongoDB, CouchDB, ThruDB, SimpleDB



                                                    CSULA
  Jongwook Woo
HiPIC            The Great Divide [1]

                                    MongoDB
                    HBase




  MongoDB sweet spot: Easy, Flexible, Scalable

                                              CSULA
  Jongwook Woo
HiPIC            Understanding the Document Model [1]

    {
        _id:“A4304”
        author: “nosh”,
        date: 22/6/2010,
        title: “Intro to MongoDB”
        text: “MongoDB is an open source..”,
        tags: [“webinar”, “opensource”]
        comments: [{author: “mike”,
                         date: 11/18/2010,
                    txt: “Did you see the…”,
                           votes: 7},….]
    }



 Documents->Collections->Databases
                                                        CSULA
  Jongwook Woo
HiPIC            Document Model Makes Queries Simple [1]


  Operators:
  $gt, $lt, $gte, $lte, $ne, $all, $in, $nin, count, limit,
  skip, group


  Example:
  db.posts.find({author: “nosh”,
                     tags: “webinar”})




                                                              CSULA
  Jongwook Woo
HiPIC            Selected Users [1]




                                      CSULA
  Jongwook Woo
HiPIC             Contents
 Fundamentals of Big Data

 NoSQL DB: HBase, MongoDB

 Data-Intensive Computing: Hadoop

 Big Data Supporters and Use Cases




                                      CSULA
  Jongwook Woo
HiPIC                          Data nowadays

• Data Issues
    o    data grows to 10TB, and then 100TB.
    o    Unstructured data coming from sources
                like Facebook, Twitter, RFID readers, sensors,
                 and so on.
                Need to derive information from both the
                 relational data and the unstructured data
                   • as soon as possible.

• Solution to efficiently compute Big
   Data
    o    Hadoop Map/Reduce
                                                              CSULA
  Jongwook Woo
HiPIC            Solutions in Big Data Computation
 Map/Reduce by Google
     (Key, Value) parallel computing
 Apache Hadoop
     Big Data
       ⇒Data Computation (MapReduce, Pig)

 Integrating MapReduce and RDB
     Oracle + Hadoop
     Sybase IQ
     Vertica + Hadoop
     Hadoop DB
     Greenplum
     Aster Data
 Integrating MapReduce and NoSQL DB
     MongoDB MapReduce
     HBase
                                                     CSULA
  Jongwook Woo
HiPIC                             Apache Hadoop
 Motivated by Google Map/Reduce and GFS
     open source project of the Apache Foundation.
     framework written in Java
        – originally developed by Doug Cutting
                 • who named it after his son's toy elephant.

 Two core Components
     Storage: HDFS
       – High Bandwidth Clustered storage
     Processing: Map/Reduce
       – Fault Tolerant Distributed Processing

 Hadoop scales linearly with
     data size
     Analysis complexity

                                                                CSULA
  Jongwook Woo
HiPIC                     Hadoop issues
 Map/Reduce is not DB
     Algorithm in Restricted Parallel Computing

 HDFS and HBase
     Cannot compete with the functions in RDBMS

 But, useful for
     Semi-structured data model and high-level dataflow query
      language on top of MapReduce
        – Pig, Hive, Jsql, Cascading, Cloudbase
     Useful for huge (peta- or Terra-bytes) but non-complicated data
        – Web crawling
        – log analysis
            • Log file for web companies
        – New York Times case


                                                                 CSULA
  Jongwook Woo
HiPIC            MapReduce Pros & Cons Summary
 Good when
    Huge data for input, intermediate, output
    A few synchronization required
    Read once; batch oriented datasets (ETL)

 Bad for
    Fast response time
    Large amount of shared data
    Fine-grained synch needed
    CPU-intensive not data-intensive
    Continuous input stream


                                                 CSULA
  Jongwook Woo
HiPIC                 MapReduce in Detail
Functions borrowed from functional
 programming languages (eg. Lisp)

Provides Restricted parallel programming
 model on Hadoop
        User implements Map() and Reduce()
        Libraries (Hadoop) take care of
         EVERYTHING else
             – Parallelization
             – Fault Tolerance
             – Data Distribution
             – Load Balancing

                                              CSULA
  Jongwook Woo
HiPIC                      Map
Convert input data to (key, value) pairs
map() functions run in parallel,
     creating different intermediate (key, value)
     values from different input data sets




                                                 CSULA
  Jongwook Woo
HiPIC                       Reduce
 reduce() combines those intermediate values
  into one or more final values for that same
  key
 reduce() functions also run in parallel,
    each working on a different output key
 Bottleneck:
    reduce phase can’t start until map phase is
     completely finished.




                                                   CSULA
  Jongwook Woo
HiPIC            Example: Sort URLs in the largest hit order
Compute the largest hit URLs
      Stored in log files

Map()
      Input <logFilename, file text>
      Output: Parses file and emits <url, hit counts> pairs
                 – eg. <http://hello.com, 1>

Reduce()
      Input: <url, list of hit counts> from multiple map
       nodes
      Output: Sums all values for the same key and emits
       <url, TotalCount>
                 – eg.<http://hello.com, (3, 5, 2, 7)> => <http://hello.com, 17>
                                                                           CSULA
  Jongwook Woo
HiPIC            Map/Reduce for URL visits

                              Input Log Data


              Map1()          Map2()       …      Mapm()
 (http://hi.com, 1)         (http://halo.com, 1)
 (http://hello.com, 3)      (http://hello.com, 5)
 …                          …
                         Data Aggregation/Combine
   (http://hi.com, <1, 1, …, 1>)       (http://halo.com, <1, 5,>)
                       (http://hello.com, <3, 5, 2, 7>)
                 Reduce1 ()    Reduce2()    …      Reducel()

  (http://hi.com, 32)                       (http://halo.com, 6)
                          (http://hello.com, 17)
                                                               CSULA
  Jongwook Woo
HiPIC                             Legacy Example

In late 2007, the New York Times
 wanted to make available over the web
 its entire archive of articles,
    11 million in all, dating back to 1851.
    four-terabyte pile of images in TIFF format.
    needed to translate that four-terabyte pile of TIFFs
     into more web-friendly PDF files.
           – not a particularly complicated but large computing chore,
                 • requiring a whole lot of computer processing time.




                                                                         CSULA
  Jongwook Woo
HiPIC                    Legacy Example (Cont’d)

In late 2007, the New York Times
 wanted to make available over the web
 its entire archive of articles,
    a software programmer at the Times, Derek Gottfrid,
       – playing around with Amazon Web Services, Elastic
         Compute Cloud (EC2),
           • uploaded the four terabytes of TIFF data into Amazon's
             Simple Storage System (S3)
           • In less than 24 hours, 11 millions PDFs, all stored
             neatly in S3 and ready to be served up to visitors to the
             Times site.
     The total cost for the computing job? $240
           – 10 cents per computer-hour times 100 computers times 24 hours




                                                                        CSULA
  Jongwook Woo
HiPIC             Contents
 Fundamentals of Big Data

 NoSQL DB: HBase, MongoDB

 Data-Intensive Computing: Hadoop

 Big Data Supporters and Use Cases




                                      CSULA
  Jongwook Woo
HiPIC
                             Supporters of Big Data

 Apache Hadoop Supporters
     Cloudera
           – Like Linux and Redhat
           – HiPIC is an Academic Partner
     Hortonworks
           – Pig
     Facebook
           – Hive
     IBM
           – Jaql

 NoSQL DB supporters
     MongoDB
           – HiPIC tries to collaborate
     HBase, CouchDB, Apache Cassandra (originally by FB) etc

                                                                CSULA
  Jongwook Woo
HiPIC          Similarities in Pig, Hive, and Jaql

•    translate high-level languages into MapReduce jobs
     o the programmer can work at a higher level
         than writing MapReduce jobs in Java or other
           lower-level languages
•    programs are much smaller than Java code.

•    option to extend these languages,
      o often by writing user-defined functions in Java.
•    Interoperability
      o programs written in these high-level languages can
        be imbedded inside other languages as well.
•    the same limitations as Hadoop does
      o non-supporting random reads and writes
             o and low-latency queries.
                                                           CSULA
    Jongwook Woo
HiPIC                                Pig

•    developed at Yahoo Research around 2006
      o moved into the Apache Software Foundation in
        2007.
•    PigLatin,
      o Pig's language
      o a data flow language
      o well suited to processing unstructured data
         Unlike SQL, not require that the data have a
           schema
                    However, can still leverage the value of a schema




                                                                   CSULA
    Jongwook Woo
HiPIC                                  Hive
•    developed at Facebook
      o turns Hadoop into a data warehouse
            o complete with a dialect of SQL for querying.
•    HiveQL
     o a declarative language (SQL dialect)
•    Difference from PigLatin,
      o    you do not specify the data flow,
            but instead describe the result you want
                    Hive figures out how to build a data flow to
                     achieve it.
      o a schema is required,
         but not limited to one schema.
            o data can have many schemas

                                                                    CSULA
    Jongwook Woo
HiPIC                        Hive (Cont'd)

•    Similarity with PigLatin and SQL,
      o HiveQL on its own is a relationally complete
        language
          but not a Turing complete language,
                    That can express any computation
      o can be extended through UDFs (User Defined
        Functions) of Java
          just like Pig to be Turing complete




                                                        CSULA
    Jongwook Woo
HiPIC                      Jaql

• developed at IBM.
• a data flow language
    o its native data structure format is JSON (JavaScript
      Object Notation).

• Schemas are optional
• Turing complete on its own
    o without the need for extension through UDFs.




                                                        CSULA
  Jongwook Woo
HiPIC
                   Use Cases

 Amazon AWS

 Facebook

 Twitter

 Craiglist

 HuffPOst | AOL




                               CSULA
  Jongwook Woo
HiPIC                                Amazon AWS

 amazon.com
     Consumer and seller business

 aws.amazon.com
     IT infrastructure business
           – Focus on your business not IT management
     Pay as you go
           – Pay for servers by the hour
           – Pay for storage per Giga byte per month
           – Pay for data transfer per Giga byte
     Services with many APIs
           – S3: Simple Storage Service
           – EC2: Elastic Compute Cloud
                 • Provide many virtual Linux servers
                 • Can run on multiple nodes
                     – Hadoop and HBase
                     – MongoDB
                                                        CSULA
  Jongwook Woo
HiPIC                   Amazon AWS (Cont’d)

 Customers on aws.amazon.com
 Samsung
     – Smart TV hub sites: TV applications are on AWS
 Netflix
           – ~25% of US internet traffic
           – ~100% on AWS
 NASA JPL
           – Analyze more than 200,000 images
 NASDAQ
           – Using AWS S3

 HiPIC received research and teaching grants
  from AWS
                                                   CSULA
  Jongwook Woo
HiPIC                            Facebook [7]

 Using Apache HBase
 For Titan and Puma
 HBase for FB
           – Provide excellent write performance and good reads
           – Nice features
                 • Scalable
                 • Fault Tolerance
                 • MapReduce




                                                              CSULA
  Jongwook Woo
HiPIC                         Titan: Facebook

 Message services in FB
    Hundreds of millions of active users
    15+ billion messages a month
    50K instant message a second

 Challenges
    High write throughput
           – Every message, instant message, SMS, email
    Massive Clusters
           – Must be easily scalable

 Solution
    Clustered HBase
                                                          CSULA
  Jongwook Woo
HiPIC                              Puma: Facebook

 ETL
     Extract, Transform, Load
       – Data Integrating from many data sources to Data Warehouse
     Data analytics
       – Domain owners’ web analytics for Ad and apps
                 • clicks, likes, shares, comments etc

 ETL before Puma
     8 – 24 hours
        – Procedures: Scribe, HDFS, Hive, MySQL

 ETL after Puma
     Puma
        – Real time MapReduce framework
     2 – 30 secs
        – Procedures: Scribe, HDFS, Puma, HBase


                                                                     CSULA
  Jongwook Woo
HiPIC                              Twitter [8]

 Three Challenges
    Collecting Data
           – Scribe as FB
    Large Scale Storage and analysis
           – Cassandra: ColumnFamily key-value store
           – Hadoop
    Rapid Learning over Big Data
           – Pig
                 • 5% of Java code
                 • 5% of dev time
                 • Within 20% of running time




                                                       CSULA
  Jongwook Woo
HiPIC                 Craiglist in MongoDB [9]

 Craiglist
    ~700 cities, worldwide
    ~1 billion hits/day
    ~1.5 million posts/day
    Servers
           – ~500 servers
           – ~100 MySQL servers

 Migrate to MongoDB
    Scalable, Fast, Proven, Friendly




                                                 CSULA
  Jongwook Woo
HiPIC
                               HuffPost | AOL [10]



Two Machine Learning Use Cases
  Comment Moderation
        – Evaluate All New HuffPost User Comments
          Every Day
                 • Identify Abusive / Aggressive Comments
                 • Auto Delete / Publish ~25% Comments Every
                   Day
  Article Classification
        – Tag Articles for Advertising
                 • E.g.: scary, salacious, …



                                                           CSULA
  Jongwook Woo
HiPIC               HuffPost | AOL [10]

 Parallelize on Hadoop
    Good news:
      – Mahout, a parallel machine learning tool, is
        already available.
      – There are Mallet, libsvm, Weka, … that support
        necessary algorithms.
    Bad news:
      – Mahout doesn’t support necessary algorithms
        yet.
      – Other algorithms do not run natively on Hadoop.
 build a flexible ML platform running on
  Hadoop
    Pig for Hadoop implementation.

                                                    CSULA
  Jongwook Woo
HiPIC
                                   MapReduce Example

 Word Count in the previous slide

 Shortest Path in the graph
     Graph algorithm is very suitable for M/R, especially BFS
           – Spreading activation type of processing
     Map:
           – Input: a node n as a key, and (D, points-to) as its value
                 • D is the distance to the node from the start
                 • points-to is a list of nodes reachable from n
           – Output: ∀p ∈ points-to, emit (p, D+1)
     Reduce:
           – Input: possible distances to a given p
           – Output: selects the minimum one
                 • Perform multiple iterations
     Iterative process for matrix, graph, network
           – Apache HAMA needed?
                 • Iterative Process on Hadoop

                                                                         CSULA
  Jongwook Woo
HiPIC
                               MapReduce Example (Cont’d)

 Social N/W analysis
     Recommend new friends (friend of a friend: FOAF)
     Map
           – In: (x, <friendsx>)
           – Out: if (u, x) are friends
                 • (u, < friendsx / friendsu >)
                         – < friendsx / friendsu >: friends of x but not friends of u
           – Otherwise
                 • nil
     Reduce
           – In: (u, < < friendsa / friendsu >, < friendsa / friendsu >, …>)
                 • Friends list of all users a, b, … who are friends of u
           – Out: (u, < (X1 , N1 ), (X2 , N2 ), …>)
                 • Xm : FOAF of u
                 • Nm : Total number of occurrences in all FOAF lists
                         – To sort or rank the results

                                                                                        CSULA
  Jongwook Woo
HiPIC
                      MapReduce Example (Cont’d)

 Inverted Indexing (Full Text Search)
     Map (3 nodes):
        – Input:
             • Doc1: “Columbus’s egg”
             • Doc 2: “The chicken and egg problem”
             • Doc 3: “Easter Egg”
        – Output:
             • Map1: (“columbus’s”, (doc1, 1)), (“egg”, (doc1, 2))
             • Map2: (“the”, (doc2, 1)), (“chicken”, (doc2, 2)), (“and”,
               (doc2, 3)), (“egg”, (doc2, 4)), (“problem”, (doc2, 5))
             • Map3: (“easter”, (doc3, 1)), (“egg”, (doc3, 2))
     Intermediate Shuffle
        – (“columbus’s”, (doc1, 1)), (“egg”, <(doc1, 2), (doc2, 4), (doc3,
          2)>), (“the”, (doc2, 1)), (“chicken”, (doc2, 2)), (“and”, (doc2,
          3)), (“problem”, (doc2, 5)), (“easter”, (doc3, 1)))


                                                                        CSULA
  Jongwook Woo
HiPIC
                      MapReduce Example (Cont’d)

Inverted Indexing (Full Text Search)
 (Cont’d)
    Reduce
           – Input: (“columbus’s”, (doc1, 1)), (“egg”,
             <(doc1, 2), (doc2, 4), (doc3, 2)>), (“the”,
             (doc2, 1)), (“chicken”, (doc2, 2)), (“and”,
             (doc2, 3)), (“problem”, (doc2, 5)), (“easter”,
             (doc3, 1)))
           – Output: same as above
               • Assuming (“egg”, <(doc1, 2), (doc1, 4),
                 (doc3, 2)>), output is:
                  – (“egg”, <(doc1, <2, 4>), (doc3, 2)>),

                                                              CSULA
  Jongwook Woo
HiPIC               Conclusion
 Era of Big Data

 Need to store and compute Big Data

 Storage: NoSQL DB

 Computation: Hadoop MapRedude

 Need to analyze Big Data in mobile computing, SNS
  for Ad, User Behavior, Patterns …




                                                 CSULA
  Jongwook Woo
HiPIC




                 CSULA
  Jongwook Woo
HiPIC                     References
1) Introduction to MongoDB, Nosh Petigara, Jan 11, 2011

2) Hadoop Fundamental I, Big Data University

3) “Large Scale Data Analysis with Map/Reduce”, Marin
   Dimitrov, Feb 2010

4) “BFS & MapReduce”, Edward J Yoon
   http://blog.udanax.org/2009/02/breadth-first-search-
   mapreduce.html, Feb 26 2009

5) “Market Basket Analysis Algorithm with no-SQL DB HBase
   and Hadoop”,Jongwook Woo, Siddharth Basopia, Yuhang
   Xu, Seon Ho Kim, The Third International Conference on
   Emerging Databases (EDB 2011), Songdo Park Hotel,
   Incheon, Korea, Aug. 25-27, 2011




                                                          CSULA
  Jongwook Woo
HiPIC                    References
6) “Market Basket Analysis Algorithm with Map/Reduce of
   Cloud Computing”, Jongwook Woo and Yuhang Xu, The
   2011 international Conference on Parallel and Distributed
   Processing Techniques and Applications (PDPTA 2011),Las
   Vegas (July 18-21, 2011)

7) Building Realtime Big Data Services at Facebook with
   Hadoop and Hbase, Jonathan Gray, Facebook, Nov 11, 2011,
   Hadoop World NYC

8) Analyzing Big Data at Twitter, Kevin Well, Web 2.0 Expo,
   NYC, Sep 2010

9) Lessons Learned from Migrating 2+ Billion Documents at
   Craigslist, Jeremy Zawodny, 2011

10) Machine Learning on Hadoop at Huffington Post | AOL, Thu
    Kyaw and Sang Chul Song, Hadoop DC, Oct 4, 2011


                                                              CSULA
  Jongwook Woo
HiPIC                   References
11) “MapReduce Debates and Schema-Free”, Woohyun Kim,
    www.coordguru.com, http://blog.naver.com/wisereign, March
    3 2010

12) “Large Scale Data Analysis with Map/Reduce”, Marin
    Dimitrov, Feb 2010

13) “HBase Schema Design Case Studies”, Qingyan Liu, July 13
    2009




                                                          CSULA
  Jongwook Woo

Weitere ähnliche Inhalte

Was ist angesagt?

HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...Simplilearn
 
Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBaseCloudera, Inc.
 
Apache HBase - Introduction & Use Cases
Apache HBase - Introduction & Use CasesApache HBase - Introduction & Use Cases
Apache HBase - Introduction & Use CasesData Con LA
 
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBaseCon 2012 | HBase Schema Design - Ian Varley, SalesforceHBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBaseCon 2012 | HBase Schema Design - Ian Varley, SalesforceCloudera, Inc.
 
Intro to HBase - Lars George
Intro to HBase - Lars GeorgeIntro to HBase - Lars George
Intro to HBase - Lars GeorgeJAX London
 
HBase for Architects
HBase for ArchitectsHBase for Architects
HBase for ArchitectsNick Dimiduk
 
Apache HBase for Architects
Apache HBase for ArchitectsApache HBase for Architects
Apache HBase for ArchitectsNick Dimiduk
 
Apache HBase Application Archetypes
Apache HBase Application ArchetypesApache HBase Application Archetypes
Apache HBase Application ArchetypesCloudera, Inc.
 
HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...
HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...
HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...Cloudera, Inc.
 
Apache Spark on Apache HBase: Current and Future
Apache Spark on Apache HBase: Current and Future Apache Spark on Apache HBase: Current and Future
Apache Spark on Apache HBase: Current and Future HBaseCon
 
Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)alexbaranau
 
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...HBaseCon
 

Was ist angesagt? (20)

HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
 
Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
 
Intro to HBase
Intro to HBaseIntro to HBase
Intro to HBase
 
Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBase
 
Apache HBase - Introduction & Use Cases
Apache HBase - Introduction & Use CasesApache HBase - Introduction & Use Cases
Apache HBase - Introduction & Use Cases
 
Apache hadoop hbase
Apache hadoop hbaseApache hadoop hbase
Apache hadoop hbase
 
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBaseCon 2012 | HBase Schema Design - Ian Varley, SalesforceHBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
 
Intro to HBase - Lars George
Intro to HBase - Lars GeorgeIntro to HBase - Lars George
Intro to HBase - Lars George
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
HBase for Architects
HBase for ArchitectsHBase for Architects
HBase for Architects
 
Apache HBase for Architects
Apache HBase for ArchitectsApache HBase for Architects
Apache HBase for Architects
 
Apache HBase Application Archetypes
Apache HBase Application ArchetypesApache HBase Application Archetypes
Apache HBase Application Archetypes
 
HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...
HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...
HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...
 
HBASE Overview
HBASE OverviewHBASE Overview
HBASE Overview
 
Apache Spark on Apache HBase: Current and Future
Apache Spark on Apache HBase: Current and Future Apache Spark on Apache HBase: Current and Future
Apache Spark on Apache HBase: Current and Future
 
HBase
HBaseHBase
HBase
 
Apache hive introduction
Apache hive introductionApache hive introduction
Apache hive introduction
 
Nextag talk
Nextag talkNextag talk
Nextag talk
 
Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)
 
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...
 

Andere mochten auch

Andere mochten auch (7)

Big Data Fundamentals
Big Data FundamentalsBig Data Fundamentals
Big Data Fundamentals
 
Big Data : Risks and Opportunities
Big Data : Risks and OpportunitiesBig Data : Risks and Opportunities
Big Data : Risks and Opportunities
 
Big data in telecom
Big data in telecomBig data in telecom
Big data in telecom
 
Fundamentals of Big Data in 2 minutes!!
Fundamentals of Big Data in  2 minutes!!Fundamentals of Big Data in  2 minutes!!
Fundamentals of Big Data in 2 minutes!!
 
Big Data Fundamentals
Big Data FundamentalsBig Data Fundamentals
Big Data Fundamentals
 
What is Big Data?
What is Big Data?What is Big Data?
What is Big Data?
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 

Ähnlich wie Big Data Fundamentals in the Emerging New Data World

Big Data and Data Intensive Computing: Education and Training
Big Data and Data Intensive Computing: Education and TrainingBig Data and Data Intensive Computing: Education and Training
Big Data and Data Intensive Computing: Education and TrainingJongwook Woo
 
NoSQL and MapReduce
NoSQL and MapReduceNoSQL and MapReduce
NoSQL and MapReduceJ Singh
 
Recent IT Development and Women: Big Data and The Power of Women in Goryeo
 Recent IT Development and Women: Big Data and The Power of Women in Goryeo Recent IT Development and Women: Big Data and The Power of Women in Goryeo
Recent IT Development and Women: Big Data and The Power of Women in GoryeoJongwook Woo
 
Introduction to Big Data: Smart Factory
Introduction to Big Data: Smart FactoryIntroduction to Big Data: Smart Factory
Introduction to Big Data: Smart FactoryJongwook Woo
 
Big Data Platform adopting Spark and Use Cases with Open Data
Big Data  Platform adopting Spark and Use Cases with Open DataBig Data  Platform adopting Spark and Use Cases with Open Data
Big Data Platform adopting Spark and Use Cases with Open DataJongwook Woo
 
A unified data modeler in the world of big data
A unified data modeler in the world of big dataA unified data modeler in the world of big data
A unified data modeler in the world of big dataWilliam Luk
 
Introduction to Spark: Data Analysis and Use Cases in Big Data
Introduction to Spark: Data Analysis and Use Cases in Big Data Introduction to Spark: Data Analysis and Use Cases in Big Data
Introduction to Spark: Data Analysis and Use Cases in Big Data Jongwook Woo
 
NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...
NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...
NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...Felix Gessert
 
NO SQL: What, Why, How
NO SQL: What, Why, HowNO SQL: What, Why, How
NO SQL: What, Why, HowIgor Moochnick
 
Big Data Trend with Open Platform
Big Data Trend with Open PlatformBig Data Trend with Open Platform
Big Data Trend with Open PlatformJongwook Woo
 
Why no sql ? Why Couchbase ?
Why no sql ? Why Couchbase ?Why no sql ? Why Couchbase ?
Why no sql ? Why Couchbase ?Ahmed Rashwan
 
Size does not matter (if your data is in a silo)
Size does not matter (if your data is in a silo)Size does not matter (if your data is in a silo)
Size does not matter (if your data is in a silo)Ora Lassila
 
Non relational databases-no sql
Non relational databases-no sqlNon relational databases-no sql
Non relational databases-no sqlRam kumar
 
Big Data Analysis and Industrial Approach using Spark
Big Data Analysis and Industrial Approach using SparkBig Data Analysis and Industrial Approach using Spark
Big Data Analysis and Industrial Approach using SparkJongwook Woo
 
Gilbane Boston 2011 big data
Gilbane Boston 2011 big dataGilbane Boston 2011 big data
Gilbane Boston 2011 big dataPeter O'Kelly
 
Relational databases vs Non-relational databases
Relational databases vs Non-relational databasesRelational databases vs Non-relational databases
Relational databases vs Non-relational databasesJames Serra
 
Big Data Technologies and Why They Matter To R Users
Big Data Technologies and Why They Matter To R UsersBig Data Technologies and Why They Matter To R Users
Big Data Technologies and Why They Matter To R UsersAdaryl "Bob" Wakefield, MBA
 

Ähnlich wie Big Data Fundamentals in the Emerging New Data World (20)

Big Data and Data Intensive Computing: Education and Training
Big Data and Data Intensive Computing: Education and TrainingBig Data and Data Intensive Computing: Education and Training
Big Data and Data Intensive Computing: Education and Training
 
NoSQL and MapReduce
NoSQL and MapReduceNoSQL and MapReduce
NoSQL and MapReduce
 
Recent IT Development and Women: Big Data and The Power of Women in Goryeo
 Recent IT Development and Women: Big Data and The Power of Women in Goryeo Recent IT Development and Women: Big Data and The Power of Women in Goryeo
Recent IT Development and Women: Big Data and The Power of Women in Goryeo
 
Introduction to Big Data: Smart Factory
Introduction to Big Data: Smart FactoryIntroduction to Big Data: Smart Factory
Introduction to Big Data: Smart Factory
 
Big Data Platform adopting Spark and Use Cases with Open Data
Big Data  Platform adopting Spark and Use Cases with Open DataBig Data  Platform adopting Spark and Use Cases with Open Data
Big Data Platform adopting Spark and Use Cases with Open Data
 
A unified data modeler in the world of big data
A unified data modeler in the world of big dataA unified data modeler in the world of big data
A unified data modeler in the world of big data
 
the rising no sql technology
the rising no sql technologythe rising no sql technology
the rising no sql technology
 
Introduction to Spark: Data Analysis and Use Cases in Big Data
Introduction to Spark: Data Analysis and Use Cases in Big Data Introduction to Spark: Data Analysis and Use Cases in Big Data
Introduction to Spark: Data Analysis and Use Cases in Big Data
 
NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...
NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...
NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...
 
NO SQL: What, Why, How
NO SQL: What, Why, HowNO SQL: What, Why, How
NO SQL: What, Why, How
 
Big Data Trend with Open Platform
Big Data Trend with Open PlatformBig Data Trend with Open Platform
Big Data Trend with Open Platform
 
Why no sql ? Why Couchbase ?
Why no sql ? Why Couchbase ?Why no sql ? Why Couchbase ?
Why no sql ? Why Couchbase ?
 
Size does not matter (if your data is in a silo)
Size does not matter (if your data is in a silo)Size does not matter (if your data is in a silo)
Size does not matter (if your data is in a silo)
 
Non relational databases-no sql
Non relational databases-no sqlNon relational databases-no sql
Non relational databases-no sql
 
Unit 3 MongDB
Unit 3 MongDBUnit 3 MongDB
Unit 3 MongDB
 
Spark ukc2015v1.1
Spark ukc2015v1.1Spark ukc2015v1.1
Spark ukc2015v1.1
 
Big Data Analysis and Industrial Approach using Spark
Big Data Analysis and Industrial Approach using SparkBig Data Analysis and Industrial Approach using Spark
Big Data Analysis and Industrial Approach using Spark
 
Gilbane Boston 2011 big data
Gilbane Boston 2011 big dataGilbane Boston 2011 big data
Gilbane Boston 2011 big data
 
Relational databases vs Non-relational databases
Relational databases vs Non-relational databasesRelational databases vs Non-relational databases
Relational databases vs Non-relational databases
 
Big Data Technologies and Why They Matter To R Users
Big Data Technologies and Why They Matter To R UsersBig Data Technologies and Why They Matter To R Users
Big Data Technologies and Why They Matter To R Users
 

Mehr von Jongwook Woo

Machine Learning in Quantum Computing
Machine Learning in Quantum ComputingMachine Learning in Quantum Computing
Machine Learning in Quantum ComputingJongwook Woo
 
Comparing Scalable Predictive Analysis using Spark XGBoost Platforms
Comparing Scalable Predictive Analysis using Spark XGBoost PlatformsComparing Scalable Predictive Analysis using Spark XGBoost Platforms
Comparing Scalable Predictive Analysis using Spark XGBoost PlatformsJongwook Woo
 
Scalable Predictive Analysis and The Trend with Big Data & AI
Scalable Predictive Analysis and The Trend with Big Data & AIScalable Predictive Analysis and The Trend with Big Data & AI
Scalable Predictive Analysis and The Trend with Big Data & AIJongwook Woo
 
Introduction to Big Data and AI for Business Analytics and Prediction
Introduction to Big Data and AI for Business Analytics and PredictionIntroduction to Big Data and AI for Business Analytics and Prediction
Introduction to Big Data and AI for Business Analytics and PredictionJongwook Woo
 
Introduction to Big Data and its Trends
Introduction to Big Data and its TrendsIntroduction to Big Data and its Trends
Introduction to Big Data and its TrendsJongwook Woo
 
Rating Prediction using Deep Learning and Spark
Rating Prediction using Deep Learning and SparkRating Prediction using Deep Learning and Spark
Rating Prediction using Deep Learning and SparkJongwook Woo
 
History and Trend of Big Data and Deep Learning
History and Trend of Big Data and Deep LearningHistory and Trend of Big Data and Deep Learning
History and Trend of Big Data and Deep LearningJongwook Woo
 
The Importance of Open Innovation in AI era
The Importance of Open Innovation in AI eraThe Importance of Open Innovation in AI era
The Importance of Open Innovation in AI eraJongwook Woo
 
Traffic Data Analysis and Prediction using Big Data
Traffic Data Analysis and Prediction using Big DataTraffic Data Analysis and Prediction using Big Data
Traffic Data Analysis and Prediction using Big DataJongwook Woo
 
Big Data and Predictive Analysis
Big Data and Predictive AnalysisBig Data and Predictive Analysis
Big Data and Predictive AnalysisJongwook Woo
 
Predictive Analysis of Financial Fraud Detection using Azure and Spark ML
Predictive Analysis of Financial Fraud Detection using Azure and Spark MLPredictive Analysis of Financial Fraud Detection using Azure and Spark ML
Predictive Analysis of Financial Fraud Detection using Azure and Spark MLJongwook Woo
 
Whose tombs are so called Nakrang tombs in Pyungyang? By Moon Sungjae
Whose tombs are so called Nakrang tombs in Pyungyang? By Moon SungjaeWhose tombs are so called Nakrang tombs in Pyungyang? By Moon Sungjae
Whose tombs are so called Nakrang tombs in Pyungyang? By Moon SungjaeJongwook Woo
 
President Election of Korea in 2017
President Election of Korea in 2017President Election of Korea in 2017
President Election of Korea in 2017Jongwook Woo
 
Big Data Trend and Open Data
Big Data Trend and Open DataBig Data Trend and Open Data
Big Data Trend and Open DataJongwook Woo
 
Big Data Analysis in Hydrogen Station using Spark and Azure ML
Big Data Analysis in Hydrogen Station using Spark and Azure MLBig Data Analysis in Hydrogen Station using Spark and Azure ML
Big Data Analysis in Hydrogen Station using Spark and Azure MLJongwook Woo
 
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and SparkAlphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and SparkJongwook Woo
 
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and SparkAlphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and SparkJongwook Woo
 
Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...
Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...
Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...Jongwook Woo
 
Spark tutorial @ KCC 2015
Spark tutorial @ KCC 2015Spark tutorial @ KCC 2015
Spark tutorial @ KCC 2015Jongwook Woo
 

Mehr von Jongwook Woo (20)

Machine Learning in Quantum Computing
Machine Learning in Quantum ComputingMachine Learning in Quantum Computing
Machine Learning in Quantum Computing
 
Comparing Scalable Predictive Analysis using Spark XGBoost Platforms
Comparing Scalable Predictive Analysis using Spark XGBoost PlatformsComparing Scalable Predictive Analysis using Spark XGBoost Platforms
Comparing Scalable Predictive Analysis using Spark XGBoost Platforms
 
Scalable Predictive Analysis and The Trend with Big Data & AI
Scalable Predictive Analysis and The Trend with Big Data & AIScalable Predictive Analysis and The Trend with Big Data & AI
Scalable Predictive Analysis and The Trend with Big Data & AI
 
Introduction to Big Data and AI for Business Analytics and Prediction
Introduction to Big Data and AI for Business Analytics and PredictionIntroduction to Big Data and AI for Business Analytics and Prediction
Introduction to Big Data and AI for Business Analytics and Prediction
 
Introduction to Big Data and its Trends
Introduction to Big Data and its TrendsIntroduction to Big Data and its Trends
Introduction to Big Data and its Trends
 
Rating Prediction using Deep Learning and Spark
Rating Prediction using Deep Learning and SparkRating Prediction using Deep Learning and Spark
Rating Prediction using Deep Learning and Spark
 
History and Trend of Big Data and Deep Learning
History and Trend of Big Data and Deep LearningHistory and Trend of Big Data and Deep Learning
History and Trend of Big Data and Deep Learning
 
The Importance of Open Innovation in AI era
The Importance of Open Innovation in AI eraThe Importance of Open Innovation in AI era
The Importance of Open Innovation in AI era
 
Traffic Data Analysis and Prediction using Big Data
Traffic Data Analysis and Prediction using Big DataTraffic Data Analysis and Prediction using Big Data
Traffic Data Analysis and Prediction using Big Data
 
Big Data and Predictive Analysis
Big Data and Predictive AnalysisBig Data and Predictive Analysis
Big Data and Predictive Analysis
 
Predictive Analysis of Financial Fraud Detection using Azure and Spark ML
Predictive Analysis of Financial Fraud Detection using Azure and Spark MLPredictive Analysis of Financial Fraud Detection using Azure and Spark ML
Predictive Analysis of Financial Fraud Detection using Azure and Spark ML
 
AI on Big Data
AI on Big DataAI on Big Data
AI on Big Data
 
Whose tombs are so called Nakrang tombs in Pyungyang? By Moon Sungjae
Whose tombs are so called Nakrang tombs in Pyungyang? By Moon SungjaeWhose tombs are so called Nakrang tombs in Pyungyang? By Moon Sungjae
Whose tombs are so called Nakrang tombs in Pyungyang? By Moon Sungjae
 
President Election of Korea in 2017
President Election of Korea in 2017President Election of Korea in 2017
President Election of Korea in 2017
 
Big Data Trend and Open Data
Big Data Trend and Open DataBig Data Trend and Open Data
Big Data Trend and Open Data
 
Big Data Analysis in Hydrogen Station using Spark and Azure ML
Big Data Analysis in Hydrogen Station using Spark and Azure MLBig Data Analysis in Hydrogen Station using Spark and Azure ML
Big Data Analysis in Hydrogen Station using Spark and Azure ML
 
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and SparkAlphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
 
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and SparkAlphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
 
Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...
Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...
Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...
 
Spark tutorial @ KCC 2015
Spark tutorial @ KCC 2015Spark tutorial @ KCC 2015
Spark tutorial @ KCC 2015
 

Kürzlich hochgeladen

Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 

Kürzlich hochgeladen (20)

Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 

Big Data Fundamentals in the Emerging New Data World

  • 1. HiPIC Big Data Fundamentals in the Emerging New Data World PIT (Product Innovation Team) Samsung Electronics America San Jose, CA Aug 17th 2012 Jongwook Woo (PhD) High-Performance Internet Computing Center (HiPIC) Educational Partner with Cloudera and Grants Awardee of Amazon AWS Computer Information Systems Department California State University, Los Angeles Jongwook Woo CSULA
  • 2. HiPIC Contents  Fundamentals of Big Data  NoSQL DB: HBase, MongoDB  Data-Intensive Computing: Hadoop  Big Data Supporters and Use Cases CSULA Jongwook Woo
  • 3. HiPIC Experience in Big Data  Several publications regarding Hadoop and NoSQL  “Apriori-Map/Reduce Algorithm”, Jongwook Woo, PDPTA 2012, Las Vegas (July 16-19, 2012)  “Market Basket Analysis Algorithm with no-SQL DB HBase and Hadoop”,Jongwook Woo, Siddharth Basopia, Yuhang Xu, Seon Ho Kim, EDB 2012, Incheon, Aug. 25-27, 2011  “Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing”, Jongwook Woo and Yuhang Xu, PDPTA 2011, Las Vegas (July 18-21, 2011)  Jongwook Woo, “Introduction to Cloud Computing”, in the 10th KOCSEA Technical Symposium, UNLV, Dec 18 - 19, 2009  Talks in Korean Universities and companies  Yonsei, Sookmyung, KAIST, Korean Polytech Univ – Winter 2011  VanillaBreeze – Winter 2011 CSULA Jongwook Woo
  • 4. HiPIC Experience in Big Data (Cont’d)  Grants  Received Amazon AWS in Education Research Grant (July 2012 - July 2014)  Received Amazon AWS in Education Coursework Grants (July 2012 - July 2013, Jan 2011 - Dec 2011  Partnership  Received Academic Education Partnership with Cloudera since June 2012  Certificate  Certificate of Achievement in the Big Data University Training Course, “Hadoop Fundamentals I”, July 8 2012  Cloud Computing Blog  http://dal-cloudcomputing.blogspot.com/ CSULA Jongwook Woo
  • 5. What is Big Data, Map/Reduce, Hadoop, NoSQL DB on HiPIC Cloud Computing Clo ude AWS ra Ho L rto DB SQ nW No ork s CSULA Jongwook Woo
  • 6. HiPIC Big Data Too much data Tera-Byte (1012), Peta-byte (1015) – Because of web – Sensor Data, Bioinformatics, Social Computing, smart phone, online game… Cannot handle with the legacy approach Too big Un-/Semi-structured data CSULA Jongwook Woo
  • 7. HiPIC Two Issues in Big Data How to store Big Data NoSQL DB How to compute Big Data Parallel Computing with multiple cheap computers – Not need super computers CSULA Jongwook Woo
  • 8. HiPIC Contents  Fundamentals of Big Data  NoSQL DB: HBase, MongoDB  Data-Intensive Computing: Hadoop  Big Data Supporters and Use Cases CSULA Jongwook Woo
  • 9. HiPIC New Data Trend  Sparsity Schema free data with sparse attributes – Document Term vector – User-Item matrix – Semantic or social relations No relational property – nor complex join queries • Log data CSULA Jongwook Woo
  • 10. HiPIC New Data Trend (Cont’d) Immutable No need to update and delete data – Only insert with versions • Tracking history • Lock-free (key based autonomicity) CSULA Jongwook Woo
  • 11. HiPIC Big Data for RDBMS  Issues in RDBMS Hard to scale – Relation gets broken • Partitioning for scalability • Replication for availability Speed – The Seek times of physical storage • Slower than N/W speed • 1TB disk: 10Mbps transfer rate – 100K sec =>27.8 hrs – With Multiple data sources at difference places • 100 10GB disks: each 10Mbps transfer rate – 1K sec =>16.7min CSULA Jongwook Woo
  • 12. HiPIC Big Data for RDBMS (Cont’d) Issues in RDBMS (Cont’d) Data Integration – Not good for un-/semi-structured data • Many unstructured data – Web or log data etc RDB not good in parallelization – Cannot split 1000 tasks to cheap 1000 PCs efficiently CSULA Jongwook Woo
  • 13. HiPIC RDBMS Issues Solution Big Data ⇒Data Cleansing by Hadoop ⇒ Data Computation (MapReduce, Pig) ⇒ Data Repositories (NoSQL DB: HBase, Cassandra, MongoDB) ⇒Business Intelligence (Data Mining, OLAP, Data Visualization, Reporting): Hive, Mahout CSULA Jongwook Woo
  • 14. HiPIC NoSQL DBs  not primarily built on tables,  generally do not use SQL for data manipulation  non-relational, distributed data stores – often do not provide ACID (atomicity, consistency, isolation, durability) • which are the key attributes of classic RDB  Fast Index on large amount of data  Lookup by keys (key/value)  NoSQL normally supports MapReduce  Parallel computation CSULA Jongwook Woo
  • 15. HiPIC Use Cases for NoSQL DB [1] RDBMS replacement for high-traffic web applications Semi-structured content management Real-time analytics & high-speed logging Web Infrastructure Web 2.0, Media, SaaS, Gaming, Finance, Telecom, Healthcare, Government Three NoSQL DB Approaches Key/Value, Column, Document CSULA Jongwook Woo
  • 16. HiPIC Data Store of NoSQL DB  Key/Value store (Key, Value) Functions – Index, versioning, sorting, locking, transaction, replication Apache Cassandra, Memcached CSULA Jongwook Woo
  • 17. HiPIC Data Store of NoSQL DB (Cont’d)  Column-Oriented Stores (Extensible Record Stores) stores data tables as sections of columns of data – rather than as rows of data, like most RDBMS • Sparse fields in RDBMS – well-suited for OLAP-like workloads (e.g., data warehouses) Extensible record horizontally and vertically partitioned across nodes – Rows and Columns are distributed over multiple nodes BigTable, HBase, Cassandra, Hypertable CSULA Jongwook Woo
  • 18. HiPIC Data Store of NoSQL DB (Cont’d) StudentId Lastname Firstname email 1 Smith Joe smith@hi.com 2 Jones Mary mary@hi.com 3 Johnson Cathy cathy@hi.com  Row Oriented – 1,Smith, Joe, smith@hi.com; – 2,Jones, Mary, mary@hi.com; – 3,Johnson, Cathy, cathy@hi.com; Column Oriented – 1,2,3; – Smith, Jones, Johnson; – Joe, Mary, Cathy; – smith@hi.com, mary@hi.com, cathy@hi.com; CSULA Jongwook Woo
  • 19. HiPIC HBase Schema Example (Student/Course)  RDBMS  Students: (id, name, sex, age)  Courses: (id, title, desc, teacher_id)  S_C: (s_id, c_id, type)  HBase Column Families id Info: Course <student_id> Info:name Info:sex Info:age Course:<course_id>= type Column Families id Info: student <course_id> Info:title Info:desc Info:teacher_id student:<student_id> =type CSULA Jongwook Woo
  • 20. HiPIC Data Store of NoSQL DB (Cont’d)  Document Store Collections and Documents – vs Tables and Records of RDB Used in Search Engine/Repository Multiple index to store indexed document – no fixed fields Not simple key-value lookup – Use API Functions – No locking, Replication, Transaction MongoDB, CouchDB, ThruDB, SimpleDB CSULA Jongwook Woo
  • 21. HiPIC The Great Divide [1] MongoDB HBase MongoDB sweet spot: Easy, Flexible, Scalable CSULA Jongwook Woo
  • 22. HiPIC Understanding the Document Model [1] { _id:“A4304” author: “nosh”, date: 22/6/2010, title: “Intro to MongoDB” text: “MongoDB is an open source..”, tags: [“webinar”, “opensource”] comments: [{author: “mike”, date: 11/18/2010, txt: “Did you see the…”, votes: 7},….] } Documents->Collections->Databases CSULA Jongwook Woo
  • 23. HiPIC Document Model Makes Queries Simple [1] Operators: $gt, $lt, $gte, $lte, $ne, $all, $in, $nin, count, limit, skip, group Example: db.posts.find({author: “nosh”, tags: “webinar”}) CSULA Jongwook Woo
  • 24. HiPIC Selected Users [1] CSULA Jongwook Woo
  • 25. HiPIC Contents  Fundamentals of Big Data  NoSQL DB: HBase, MongoDB  Data-Intensive Computing: Hadoop  Big Data Supporters and Use Cases CSULA Jongwook Woo
  • 26. HiPIC Data nowadays • Data Issues o data grows to 10TB, and then 100TB. o Unstructured data coming from sources  like Facebook, Twitter, RFID readers, sensors, and so on.  Need to derive information from both the relational data and the unstructured data • as soon as possible. • Solution to efficiently compute Big Data o Hadoop Map/Reduce CSULA Jongwook Woo
  • 27. HiPIC Solutions in Big Data Computation  Map/Reduce by Google  (Key, Value) parallel computing  Apache Hadoop  Big Data ⇒Data Computation (MapReduce, Pig)  Integrating MapReduce and RDB  Oracle + Hadoop  Sybase IQ  Vertica + Hadoop  Hadoop DB  Greenplum  Aster Data  Integrating MapReduce and NoSQL DB  MongoDB MapReduce  HBase CSULA Jongwook Woo
  • 28. HiPIC Apache Hadoop  Motivated by Google Map/Reduce and GFS  open source project of the Apache Foundation.  framework written in Java – originally developed by Doug Cutting • who named it after his son's toy elephant.  Two core Components  Storage: HDFS – High Bandwidth Clustered storage  Processing: Map/Reduce – Fault Tolerant Distributed Processing  Hadoop scales linearly with  data size  Analysis complexity CSULA Jongwook Woo
  • 29. HiPIC Hadoop issues  Map/Reduce is not DB  Algorithm in Restricted Parallel Computing  HDFS and HBase  Cannot compete with the functions in RDBMS  But, useful for  Semi-structured data model and high-level dataflow query language on top of MapReduce – Pig, Hive, Jsql, Cascading, Cloudbase  Useful for huge (peta- or Terra-bytes) but non-complicated data – Web crawling – log analysis • Log file for web companies – New York Times case CSULA Jongwook Woo
  • 30. HiPIC MapReduce Pros & Cons Summary  Good when Huge data for input, intermediate, output A few synchronization required Read once; batch oriented datasets (ETL)  Bad for Fast response time Large amount of shared data Fine-grained synch needed CPU-intensive not data-intensive Continuous input stream CSULA Jongwook Woo
  • 31. HiPIC MapReduce in Detail Functions borrowed from functional programming languages (eg. Lisp) Provides Restricted parallel programming model on Hadoop User implements Map() and Reduce() Libraries (Hadoop) take care of EVERYTHING else – Parallelization – Fault Tolerance – Data Distribution – Load Balancing CSULA Jongwook Woo
  • 32. HiPIC Map Convert input data to (key, value) pairs map() functions run in parallel,  creating different intermediate (key, value) values from different input data sets CSULA Jongwook Woo
  • 33. HiPIC Reduce  reduce() combines those intermediate values into one or more final values for that same key  reduce() functions also run in parallel, each working on a different output key  Bottleneck: reduce phase can’t start until map phase is completely finished. CSULA Jongwook Woo
  • 34. HiPIC Example: Sort URLs in the largest hit order Compute the largest hit URLs  Stored in log files Map()  Input <logFilename, file text>  Output: Parses file and emits <url, hit counts> pairs – eg. <http://hello.com, 1> Reduce()  Input: <url, list of hit counts> from multiple map nodes  Output: Sums all values for the same key and emits <url, TotalCount> – eg.<http://hello.com, (3, 5, 2, 7)> => <http://hello.com, 17> CSULA Jongwook Woo
  • 35. HiPIC Map/Reduce for URL visits Input Log Data Map1() Map2() … Mapm() (http://hi.com, 1) (http://halo.com, 1) (http://hello.com, 3) (http://hello.com, 5) … … Data Aggregation/Combine (http://hi.com, <1, 1, …, 1>) (http://halo.com, <1, 5,>) (http://hello.com, <3, 5, 2, 7>) Reduce1 () Reduce2() … Reducel() (http://hi.com, 32) (http://halo.com, 6) (http://hello.com, 17) CSULA Jongwook Woo
  • 36. HiPIC Legacy Example In late 2007, the New York Times wanted to make available over the web its entire archive of articles, 11 million in all, dating back to 1851. four-terabyte pile of images in TIFF format. needed to translate that four-terabyte pile of TIFFs into more web-friendly PDF files. – not a particularly complicated but large computing chore, • requiring a whole lot of computer processing time. CSULA Jongwook Woo
  • 37. HiPIC Legacy Example (Cont’d) In late 2007, the New York Times wanted to make available over the web its entire archive of articles, a software programmer at the Times, Derek Gottfrid, – playing around with Amazon Web Services, Elastic Compute Cloud (EC2), • uploaded the four terabytes of TIFF data into Amazon's Simple Storage System (S3) • In less than 24 hours, 11 millions PDFs, all stored neatly in S3 and ready to be served up to visitors to the Times site.  The total cost for the computing job? $240 – 10 cents per computer-hour times 100 computers times 24 hours CSULA Jongwook Woo
  • 38. HiPIC Contents  Fundamentals of Big Data  NoSQL DB: HBase, MongoDB  Data-Intensive Computing: Hadoop  Big Data Supporters and Use Cases CSULA Jongwook Woo
  • 39. HiPIC Supporters of Big Data  Apache Hadoop Supporters  Cloudera – Like Linux and Redhat – HiPIC is an Academic Partner  Hortonworks – Pig  Facebook – Hive  IBM – Jaql  NoSQL DB supporters  MongoDB – HiPIC tries to collaborate  HBase, CouchDB, Apache Cassandra (originally by FB) etc CSULA Jongwook Woo
  • 40. HiPIC Similarities in Pig, Hive, and Jaql • translate high-level languages into MapReduce jobs o the programmer can work at a higher level  than writing MapReduce jobs in Java or other lower-level languages • programs are much smaller than Java code. • option to extend these languages, o often by writing user-defined functions in Java. • Interoperability o programs written in these high-level languages can be imbedded inside other languages as well. • the same limitations as Hadoop does o non-supporting random reads and writes o and low-latency queries. CSULA Jongwook Woo
  • 41. HiPIC Pig • developed at Yahoo Research around 2006 o moved into the Apache Software Foundation in 2007. • PigLatin, o Pig's language o a data flow language o well suited to processing unstructured data  Unlike SQL, not require that the data have a schema  However, can still leverage the value of a schema CSULA Jongwook Woo
  • 42. HiPIC Hive • developed at Facebook o turns Hadoop into a data warehouse o complete with a dialect of SQL for querying. • HiveQL o a declarative language (SQL dialect) • Difference from PigLatin, o you do not specify the data flow,  but instead describe the result you want  Hive figures out how to build a data flow to achieve it. o a schema is required,  but not limited to one schema. o data can have many schemas CSULA Jongwook Woo
  • 43. HiPIC Hive (Cont'd) • Similarity with PigLatin and SQL, o HiveQL on its own is a relationally complete language  but not a Turing complete language,  That can express any computation o can be extended through UDFs (User Defined Functions) of Java  just like Pig to be Turing complete CSULA Jongwook Woo
  • 44. HiPIC Jaql • developed at IBM. • a data flow language o its native data structure format is JSON (JavaScript Object Notation). • Schemas are optional • Turing complete on its own o without the need for extension through UDFs. CSULA Jongwook Woo
  • 45. HiPIC Use Cases  Amazon AWS  Facebook  Twitter  Craiglist  HuffPOst | AOL CSULA Jongwook Woo
  • 46. HiPIC Amazon AWS  amazon.com  Consumer and seller business  aws.amazon.com  IT infrastructure business – Focus on your business not IT management  Pay as you go – Pay for servers by the hour – Pay for storage per Giga byte per month – Pay for data transfer per Giga byte  Services with many APIs – S3: Simple Storage Service – EC2: Elastic Compute Cloud • Provide many virtual Linux servers • Can run on multiple nodes – Hadoop and HBase – MongoDB CSULA Jongwook Woo
  • 47. HiPIC Amazon AWS (Cont’d)  Customers on aws.amazon.com  Samsung – Smart TV hub sites: TV applications are on AWS  Netflix – ~25% of US internet traffic – ~100% on AWS  NASA JPL – Analyze more than 200,000 images  NASDAQ – Using AWS S3  HiPIC received research and teaching grants from AWS CSULA Jongwook Woo
  • 48. HiPIC Facebook [7]  Using Apache HBase  For Titan and Puma  HBase for FB – Provide excellent write performance and good reads – Nice features • Scalable • Fault Tolerance • MapReduce CSULA Jongwook Woo
  • 49. HiPIC Titan: Facebook  Message services in FB Hundreds of millions of active users 15+ billion messages a month 50K instant message a second  Challenges High write throughput – Every message, instant message, SMS, email Massive Clusters – Must be easily scalable  Solution Clustered HBase CSULA Jongwook Woo
  • 50. HiPIC Puma: Facebook  ETL  Extract, Transform, Load – Data Integrating from many data sources to Data Warehouse  Data analytics – Domain owners’ web analytics for Ad and apps • clicks, likes, shares, comments etc  ETL before Puma  8 – 24 hours – Procedures: Scribe, HDFS, Hive, MySQL  ETL after Puma  Puma – Real time MapReduce framework  2 – 30 secs – Procedures: Scribe, HDFS, Puma, HBase CSULA Jongwook Woo
  • 51. HiPIC Twitter [8]  Three Challenges Collecting Data – Scribe as FB Large Scale Storage and analysis – Cassandra: ColumnFamily key-value store – Hadoop Rapid Learning over Big Data – Pig • 5% of Java code • 5% of dev time • Within 20% of running time CSULA Jongwook Woo
  • 52. HiPIC Craiglist in MongoDB [9]  Craiglist ~700 cities, worldwide ~1 billion hits/day ~1.5 million posts/day Servers – ~500 servers – ~100 MySQL servers  Migrate to MongoDB Scalable, Fast, Proven, Friendly CSULA Jongwook Woo
  • 53. HiPIC HuffPost | AOL [10] Two Machine Learning Use Cases Comment Moderation – Evaluate All New HuffPost User Comments Every Day • Identify Abusive / Aggressive Comments • Auto Delete / Publish ~25% Comments Every Day Article Classification – Tag Articles for Advertising • E.g.: scary, salacious, … CSULA Jongwook Woo
  • 54. HiPIC HuffPost | AOL [10]  Parallelize on Hadoop Good news: – Mahout, a parallel machine learning tool, is already available. – There are Mallet, libsvm, Weka, … that support necessary algorithms. Bad news: – Mahout doesn’t support necessary algorithms yet. – Other algorithms do not run natively on Hadoop.  build a flexible ML platform running on Hadoop Pig for Hadoop implementation. CSULA Jongwook Woo
  • 55. HiPIC MapReduce Example  Word Count in the previous slide  Shortest Path in the graph  Graph algorithm is very suitable for M/R, especially BFS – Spreading activation type of processing  Map: – Input: a node n as a key, and (D, points-to) as its value • D is the distance to the node from the start • points-to is a list of nodes reachable from n – Output: ∀p ∈ points-to, emit (p, D+1)  Reduce: – Input: possible distances to a given p – Output: selects the minimum one • Perform multiple iterations  Iterative process for matrix, graph, network – Apache HAMA needed? • Iterative Process on Hadoop CSULA Jongwook Woo
  • 56. HiPIC MapReduce Example (Cont’d)  Social N/W analysis  Recommend new friends (friend of a friend: FOAF)  Map – In: (x, <friendsx>) – Out: if (u, x) are friends • (u, < friendsx / friendsu >) – < friendsx / friendsu >: friends of x but not friends of u – Otherwise • nil  Reduce – In: (u, < < friendsa / friendsu >, < friendsa / friendsu >, …>) • Friends list of all users a, b, … who are friends of u – Out: (u, < (X1 , N1 ), (X2 , N2 ), …>) • Xm : FOAF of u • Nm : Total number of occurrences in all FOAF lists – To sort or rank the results CSULA Jongwook Woo
  • 57. HiPIC MapReduce Example (Cont’d)  Inverted Indexing (Full Text Search)  Map (3 nodes): – Input: • Doc1: “Columbus’s egg” • Doc 2: “The chicken and egg problem” • Doc 3: “Easter Egg” – Output: • Map1: (“columbus’s”, (doc1, 1)), (“egg”, (doc1, 2)) • Map2: (“the”, (doc2, 1)), (“chicken”, (doc2, 2)), (“and”, (doc2, 3)), (“egg”, (doc2, 4)), (“problem”, (doc2, 5)) • Map3: (“easter”, (doc3, 1)), (“egg”, (doc3, 2))  Intermediate Shuffle – (“columbus’s”, (doc1, 1)), (“egg”, <(doc1, 2), (doc2, 4), (doc3, 2)>), (“the”, (doc2, 1)), (“chicken”, (doc2, 2)), (“and”, (doc2, 3)), (“problem”, (doc2, 5)), (“easter”, (doc3, 1))) CSULA Jongwook Woo
  • 58. HiPIC MapReduce Example (Cont’d) Inverted Indexing (Full Text Search) (Cont’d) Reduce – Input: (“columbus’s”, (doc1, 1)), (“egg”, <(doc1, 2), (doc2, 4), (doc3, 2)>), (“the”, (doc2, 1)), (“chicken”, (doc2, 2)), (“and”, (doc2, 3)), (“problem”, (doc2, 5)), (“easter”, (doc3, 1))) – Output: same as above • Assuming (“egg”, <(doc1, 2), (doc1, 4), (doc3, 2)>), output is: – (“egg”, <(doc1, <2, 4>), (doc3, 2)>), CSULA Jongwook Woo
  • 59. HiPIC Conclusion  Era of Big Data  Need to store and compute Big Data  Storage: NoSQL DB  Computation: Hadoop MapRedude  Need to analyze Big Data in mobile computing, SNS for Ad, User Behavior, Patterns … CSULA Jongwook Woo
  • 60. HiPIC CSULA Jongwook Woo
  • 61. HiPIC References 1) Introduction to MongoDB, Nosh Petigara, Jan 11, 2011 2) Hadoop Fundamental I, Big Data University 3) “Large Scale Data Analysis with Map/Reduce”, Marin Dimitrov, Feb 2010 4) “BFS & MapReduce”, Edward J Yoon http://blog.udanax.org/2009/02/breadth-first-search- mapreduce.html, Feb 26 2009 5) “Market Basket Analysis Algorithm with no-SQL DB HBase and Hadoop”,Jongwook Woo, Siddharth Basopia, Yuhang Xu, Seon Ho Kim, The Third International Conference on Emerging Databases (EDB 2011), Songdo Park Hotel, Incheon, Korea, Aug. 25-27, 2011 CSULA Jongwook Woo
  • 62. HiPIC References 6) “Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing”, Jongwook Woo and Yuhang Xu, The 2011 international Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA 2011),Las Vegas (July 18-21, 2011) 7) Building Realtime Big Data Services at Facebook with Hadoop and Hbase, Jonathan Gray, Facebook, Nov 11, 2011, Hadoop World NYC 8) Analyzing Big Data at Twitter, Kevin Well, Web 2.0 Expo, NYC, Sep 2010 9) Lessons Learned from Migrating 2+ Billion Documents at Craigslist, Jeremy Zawodny, 2011 10) Machine Learning on Hadoop at Huffington Post | AOL, Thu Kyaw and Sang Chul Song, Hadoop DC, Oct 4, 2011 CSULA Jongwook Woo
  • 63. HiPIC References 11) “MapReduce Debates and Schema-Free”, Woohyun Kim, www.coordguru.com, http://blog.naver.com/wisereign, March 3 2010 12) “Large Scale Data Analysis with Map/Reduce”, Marin Dimitrov, Feb 2010 13) “HBase Schema Design Case Studies”, Qingyan Liu, July 13 2009 CSULA Jongwook Woo

Hinweis der Redaktion

  1. If you choose to supply one. Like SQL, PigLatin is relationally complete, which means it is at least as powerful as relational algebra. Turing completeness requires looping constructs, an infinite memory model, and conditional constructs. PigLatin is not Turing complete on its own, but is Turing complete when extended with User-Defined Functions in Java.