SlideShare ist ein Scribd-Unternehmen logo
1 von 55
Paradigm Shifts:

                 Big Data


                                         Pini Cohen
                                    VP and Senior Analyst




Tell me and I’ll forget
Show me and I may remember       STKI Summit 2012
Involve me and I’ll understand
The “Magic” of internet companies




                                                                                Source: http://venturebeat.com/2011/10/24/next-hot-internet-companies-not-in-us/internet-company-growth/
              Pini Cohen’s work Copyright STKI@2012
                                                                            2
              Do not remove source or attribution from any slide or graph
Pinterest




            Pini Cohen’s work Copyright STKI@2012
            Do not remove source or attribution from any slide or graph   3
Pinterest Architecture Update - 18 Million Visitors, 10x Growth,12 Employees, 410
TB of Data


       • 80 million objects stored in S3 with 410 terabytes of user
         data, 10x what they had in August. EC2 instances have
         grown by 3x. Around $39K fo S3 and $30K for EC2 a month.
       • Pay for what you use saves money. Most traffic happens in
         the afternoons and evenings, so they reduce the number of
         instances at night by 40%.
       • 12 employees as of last December. Using the cloud a site can
         grow dramatically while maintaining a very small team.
         Looks like 31 employees as of now.




         Source: http://highscalability.com/blog/2012/5/21/pinterest-architecture-update-18-million-visitors-10x-growth.html




                                                             Pini Cohen’s work Copyright STKI@2012
                                                             Do not remove source or attribution from any slide or graph       4
Instagram

     • The Instagram philosophy:
        • Simplicity
        • Optimized for minimal operational burden
        • Instrument everything




                 Pini Cohen’s work Copyright STKI@2012
                 Do not remove source or attribution from any slide or graph   5
Scaling Instagram

     • Instagram went to 30+ million users in less than two years
       and then rocketed to 40 million users 10 days after the
       launch of its Android application.
     • After the release of the Android they had 1 million new
       users in 12 hours.

     • 2 engineers in 2010.
     • 3 engineers in 2011
     • 5 engineers 2012, 2.5 on the backend. This includes
       iPhone and Android development.


      Source: http://highscalability.com/blog/2012/4/16/instagram-architecture-update-whats-new-with-instagram.html




                                                        Pini Cohen’s work Copyright STKI@2012
                                                        Do not remove source or attribution from any slide or graph   6
Tumblr – Microbloging  social networking platform


    •   500 million page views a day
    •   15B+ page views month
    •   Peak rate of ~40k requests per second
    •   1+ TB/day into Hadoop cluster
    •   Many TB/day into MySQL/HBase/Redis/Memcache
    •   Growing at 30% a month
    •   ~1000 hardware nodes in production (not cloud)
    •   ~20 engineers (total 106 employees)



                                                                               Source: http://highscalability.com/blog/2012/2/13/tumblr-architecture-15-billion-page-views-a-month-and-harder.html STKI modifications




                 Pini Cohen’s work Copyright STKI@2012
                 Do not remove source or attribution from any slide or graph                                                                                                                       7
Technology listing

     • Hadoop  Mapreduce
     • NoSQL dbms (Cassandra, Mongo, HBASE)
     • Shrading
     • In Memory DBMS
     • Memcashed
     • MemSQL
     • Solr
     • Redis
     • DJANGO
     • Python
     • ELB - Elastic load balancing amazon

                Pini Cohen’s work Copyright STKI@2012
                Do not remove source or attribution from any slide or graph
Paradigm shifts agenda

     • Big Data:
        • Big Data definition and background
        • Big Data value
        • Big Data technology




                                                                                 Source: http://www.b2binbound.com/blog/?Tag=paradigm%20shift




                   Pini Cohen’s work Copyright STKI@2012
                   Do not remove source or attribution from any slide or graph                                                                  9
Big Data Definition – 4 V’s (or more…)

     • Volume – tens of TBs and more (15-20TB+)
     • Velocity – the speed in which data is added – 10M items
       per hour and more. And the speed in which the data needs
       to be processed
     • Variety – different types of data – structured &
       unstructured. In many cases deals with internet of things,
       social media, but also with voice, video, etc.
     • Variability - able to cope with new attributes and changing
       data types – without interrupting the analytical process
       (without “import-export”)
     • Other optional V’s - validity, volatility, viscosity (resistance
       to flow), etc.    source: http://www.computerweekly.com/blogs/cwdn/2011/11/datas-main-drivers-volume-velocity-variety-and-variability.html




                 Pini Cohen’s work Copyright STKI@2012
                 Do not remove source or attribution from any slide or graph                                                                        10
The origins of the 3V’s:

      • 2002 research by Doug Laney from META Group (now
        Gartner):




                Pini Cohen’s work Copyright STKI@2012
                Do not remove source or attribution from any slide or graph   11
“Big Data” theme main current usage:

     • “Big Data" is just marketing jargon. -Doug Laney,
       Gartner source: http://www.computerweekly.com/blogs/cwdn/2011/11/datas-main-drivers-volume-velocity-variety-and-variability.html




                                                                                             Source: http://winnbadisa.com/wp-content/uploads/2011/12/marketing-career-cloud.jpg
     • STKI : doing something significantly different from
       what you’ve done until now

               Pini Cohen’s work Copyright STKI@2012
               Do not remove source or attribution from any slide or graph                                                                                                         12
Big Data at work:

     • Orbitz Worldwide has collected 750 terabytes of
       unstructured data on their consumers’ behavior – detailed
       information from customer online visits and browsing
       sessions. Using Hadoop, models have been developed
       intended to improve search results and tailor the user
       experience based on everything from location, interest in
       family travel versus solo travel, and even the kind of device
       being used to explore travel options.
     • The result? To date, a 7% increase in interaction rate, 37%
       growth in stickiness of sessions and a net 2.6% in booking
       path engagement.


             Source: http://www.deloitte.com/assets/Dcom-UnitedStates/Local%20Assets/Documents/us_cons_techtrends2012_013112.pdf




                              Pini Cohen’s work Copyright STKI@2012
                              Do not remove source or attribution from any slide or graph                                          13
DW appliances will be discussed later




               Teradata                                                                    EMC Greenplun                Oracle Exadata




       Source: http://www.asugnews.com/2011/09/06/inside-saps-product-naming-strategies/
                                                  Pini Cohen’s work Copyright STKI@2012
                                                                                                                      14
                                                                                                                Microsoft Parallel Data Warehouse
                                                  Do not remove source or attribution from any slide or graph
What is the business value of big data analytics?

     • Big data is now a technology looking for a business need
     • It can mean doing the same thing but better / faster
       (better segmentation, more accurate analysis model)
     • Or it can mean doing completely new things (telematics,
       sentiment analysis, recommendation engine, matching
       competition’s pricing in real time, being able to analyze
       data we haven’t been able to analyze in the past)




                Pini Cohen’s work Copyright STKI@2012
                Do not remove source or attribution from any slide or graph
Decision making – old school vs. new school (big data)

     • Old School:
        • Phase 1 : Analyze existing data and prepare general model
        • Phase 2: Apply the general model to specific client
        • This means applying the same model for many clients when they
          arrive
     • Issues with Old School decision making:
        • Time gap between preparing and applying the model
        • # of combinations might be too big for general model (example:
          recommendation based in interest)
        • The general model generated is biased towards “main stream”
          population
     • New School (Big Data):
        • Phase 1: Prepare specific model for the client and apply the model
          – instantly

                 Pini Cohen’s work Copyright STKI@2012
                 Do not remove source or attribution from any slide or graph   16
Big data use cases

     • Recommendation engines – match users to one another
       and provide recommendation based on similar users
       (Examples: Linkedin – people you may know; Amazon)
     • Sentiment Analysis (Macro or individual user)
     • Fraud Detection - customer behavior, historical and
       transactional data combined. Same but more affordable
     • Customer Churn
     • Social graph analysis – influencers
     • Customer experience analysis – combine data from call
       center, web, social media etc.
     • Improved segmentation – more data (clickstream, call
       records) for more accurate analysis
     • Improved customer retention

                Pini Cohen’s work Copyright STKI@2012
                Do not remove source or attribution from any slide or graph
Technology: Elements  Concepts


      • Storing data for analytics (mainly):
         • HDFS – Hadoop File System
         • Map Reduce- Programming method mainly for analytics
         • Other “Add-on”: Pig, , Hive, JAQL (IBM)
      • Storing and retrieving data - DBMS:
         • NoSQL – DBMS (not only SQL):
             •   Cassandra
             •   MongoDB
             •   CouchDB
             •   Hbase




                    Pini Cohen’s work Copyright STKI@2012
                    Do not remove source or attribution from any slide or graph   18
Who Uses Hadoop?

     •   Amazon/A9                                                              Quantcast
     •   AOL
                                                                                Rackspace/Mailtrust
     •   Facebook
     •   Fox interactive media
                                                                                Veoh
     •   Netflix                                                                Yahoo!
     •   New York Times                                                         PowerSet (now
                                                                                 Microsoft)



  More at http://wiki.apache.org/hadoop/PoweredBy




                   Pini Cohen’s work Copyright STKI@2012
                   Do not remove source or attribution from any slide or graph                         19
Who Uses Cassandra?

     •   Facebook                                                            SimpleGeo
     •   Digg                                                                Rackspace
     •   Despegar                                                            Shazam
     •   Ooyala                                                              SoftwareProjects
     •   Imagini




                Pini Cohen’s work Copyright STKI@2012
                Do not remove source or attribution from any slide or graph                      20
Big Data technologies (Hadoop etc.) vs. traditional IT


  Traditional IT                                              Big Data
  Centralized Storage                                         Local storage
  Brand redundant Servers                                     Cheap HW  White Boxes
  Standard Infrastructure and virtual                         Is standardization needed?! (in the HW
  servers.                                                    level). No server virtualization.
  Well established backup and DRP                             Why do I need backup? How do I tackle
  procedures                                                  DRP (compute clusters that are stretched
                                                              over locations)
  Traditional vendors                                         Open Source solutions
  Mature products and procedures                              In a new patch for specific issues
                                                              sometimes it is written “not implemented
                                                              yet”
  Traditional programming, SQL           Different kind of programming (map-
                                         reduce) , no Joins
      Will Big Data infrastructure be part of existing infrastructure or will be
                             developed as new domain?
                        Pini Cohen’s work Copyright STKI@2012
                        Do not remove source or attribution from any slide or graph                      21
New type of scale:

     • Hadoop:
        • Up to 4,000 machines in a cluster
        • Up to 20 PB in a cluster
     • Currently traditional IT technologies can not handle this
       kind of scale.
     • This scale comes with a cost!




                                                                               Source: http://www.techsangam.com/wp-content/uploads/2012/01/i_love_scalability_mug.jpg




                 Pini Cohen’s work Copyright STKI@2012
                 Do not remove source or attribution from any slide or graph                                                                                             22
Brewer's (CAP) Theorem

     • It is impossible for a distributed computer system to
       simultaneously provide all three of the following
       guarantees:
        • Consistency (all nodes see the same data at the same time)
        • Availability (node failures do not prevent survivors from
          continuing to operate)
        • Partition Tolerance (the system continues to operate in many
          partitions and despite arbitrary message loss)




                  Source: Scalebase STKI modifications

                                                                               Professor Eric A. Brewer
                 Pini Cohen’s work Copyright STKI@2012
                 Do not remove source or attribution from any slide or graph                      23
Dealing With CAP

     • Drop Consistency
        • Welcome to the “Eventually Consistent” term.
            • At the end – everything will work out just fine - And hey, sometimes
              this is a good enough solution
        • When no updates occur for a long period of time, eventually all
          updates will propagate through the system and all the nodes will
          be consistent
        • For a given accepted update and a given node, eventually either
          the update reaches the node or the node is removed from service
        • Known as BASE (Basically Available, Soft state, Eventual
          consistency), as opposed to ACID




                                                           Source: Scalebase
                 Pini Cohen’s work Copyright STKI@2012
                 Do not remove source or attribution from any slide or graph         24
Hadoop

    • Apache Hadoop is a software framework that supports
      data-intensive distributed applications
    • It enables applications to work with thousands of nodes
      and petabytes of data.
    • Hadoop was inspired by Google's MapReduce and Google
      File System (GFS) papers
    • Contains (basically):
         • HDFS – Hadoop file System
         • MapReduce programming model




                 Pini Cohen’s work Copyright STKI@2012
                 Do not remove source or attribution from any slide or graph   25
HDFS – Hadoop File System

        • Parallel
        • Distributed on commodity elements
        • Throughput over latency
        • Reliable and self healing
        • For large scale – typical file is gigabytes to terabytes (for
          one file!)
        • Applications need a write-once-read-many access
          model (mainly analytics)




                Pini Cohen’s work Copyright STKI@2012
                Do not remove source or attribution from any slide or graph   26
HDFS motivation

     • What if you needed to write a program that distributes
       data on commodity HW (PC’s or Servers). You would need
       to take care of:
        •   Where is the data located
        •   How to distribute data between the nodes
        •   How many times you want to replicate the data
        •   How to insert, select and update data
        •   What to do if one node or more fails
        •   How to add node or to take out a node
        •   Manage and monitor the environment
     • Hadoop File System did it for you!



                   Pini Cohen’s work Copyright STKI@2012
                   Do not remove source or attribution from any slide or graph   27
HDFS: Hadoop Distributed File Systems

              • Data nodes and Name node
              • Client requests meta data about a file from namenode
              • Data is served directly from datanode



                                                                                                                                             HDFS namenode
    Application
                         (file name, block id)
    HDFS Client                                                                                                       File namespace                                  /user/css534/input
                         (block id, block location)
                                                                                                                                                                      block 3df2




                                                                                                              instructions                                             state
                  (block id, byte range)
                                                                                                            HDFS datanode                                                          HDFS datanode
                   block data
                                                                                                      Linux local file system                                                  Linux local file system

                                                                                                                                           …                                                      …

                                    source: http://www.google.co.il/url?sa=t&rct=j&q=Rob+Jordan++Chris+Livdahl+hadoop+filetype%3Apptx&source=web&cd=1&ved=0CCIQFjAA




                         Pini Cohen’s work Copyright STKI@2012
                         Do not remove source or attribution from any slide or graph                                                                                                                     28
Datanode Blockreports


File “part-0” will be
replicated twice and will
populatesaved in blocks 1
and 3 (file is big so it has to
be divided to 2 blocks)




                                                 Block 1 is on data nodes A and C




                                                          source: http://www.google.co.il/url?sa=t&rct=j&q=Rob+Jordan++Chris+Livdahl+hadoop+filetype%3Apptx&source=web&cd=1&ved=0CCIQFjAA



                          Pini Cohen’s work Copyright STKI@2012
                          Do not remove source or attribution from any slide or graph                                                                                                       29
HDFS basic limitations

     • Namenode is single point of failure
     • Write-once model
     • Plan to support appending-writes
     • A namespace with an extremely large number of files
       exceeds Namenode’s capacity to maintain
     • Cannot be mounted by exisiting OS
     • Getting data in and out is tedious
     • HDFS does not implement / support user quotas / access
       permissions
     • Data balancing schemes
     • No periodic checkpoints

                Pini Cohen’s work Copyright STKI@2012
                Do not remove source or attribution from any slide or graph   30
Map Reduce programming model

    • In very basic – Brings the program to the data
    • Contains two elements:
        • Map: this part of the job is performed in parallel  asynchronous
          by each node
        • Reduce: gather the result from the relevant nodes
    • In more detail :
        • Map : return (write on temp file) a list containing zero or more
          ( k, v ) pairs
            • Output can be a different key from the input
            • Output can have same key
        • Reduce : return a new list of reduced output from input




                 Pini Cohen’s work Copyright STKI@2012
                 Do not remove source or attribution from any slide or graph   31
MapReduce motivation

    • What if you needed to write a program that processes data
      that’s on distributed computers?
    • You would need to write distributed program that:
       • Finds where the data located
       • Work on each node and then combine the result from each node
         together.
       • Where (on the local node) and how (format) to write the
         intermediate results
       • Find when the jobs of all participating nodes have concluded and
         then start the “aggregation” part
       • What to do if a job is stuck (restart the job or turn to another node
         to perform the same job)
    • Hadopp MapReduce is the framework for you!

                 Pini Cohen’s work Copyright STKI@2012
                 Do not remove source or attribution from any slide or graph   32
MapReduce example:

    map(String key, String value):
    // key: document name
    // value: document contents
    for each word w in value:
     EmitIntermediate(w, "1");

    reduce(String key, Iterator values):
    // key: a word
    // values: a list of counts
    int result = 0;
    for each v in values:
     result += ParseInt(v);
    Emit(AsString(result));

                 Pini Cohen’s work Copyright STKI@2012
                 Do not remove source or attribution from any slide or graph   33
Dataflow in Hadoop



                                                 Master                         Job: Word Count
                Submit job

                                                                                             All elements – standard HW




                       map                            schedule                      reduce



                       map                                                          reduce




                                                                    Source: Haifa Labs IBM
                Pini Cohen’s work Copyright STKI@2012
                Do not remove source or attribution from any slide or graph                                       34
Dataflow in Hadoop




                   Hello World Bye World
 Read                                                     Hello 1
 Input File                                               World 2
                                map                                                          reduce
              Block 1                                      Bye1

                   Hello Hadoop Goodbye Hadoop
   HDFS
              Block 2                                      Hello 1
                                map                       Hadoop 2                           reduce
                                                          Goodbye 1




                                                                             Source: Haifa Labs IBM
                         Pini Cohen’s work Copyright STKI@2012
                         Do not remove source or attribution from any slide or graph                  35
Dataflow in Hadoop




                              Finished                                      Finished + Location


                     map                     Local
                                              FS
                                                                                  reduce



                                              Local
                     map                       FS                                 reduce




                                                                  Source: Haifa Labs IBM
              Pini Cohen’s work Copyright STKI@2012
              Do not remove source or attribution from any slide or graph                         36
Dataflow in Hadoop




                     map                     Local
                                              FS
                                                                                  reduce

                                                          HTTP GET
                                              Local
                     map                       FS                                 reduce




                                                                  Source: Haifa Labs IBM
              Pini Cohen’s work Copyright STKI@2012
              Do not remove source or attribution from any slide or graph                  37
Dataflow in Hadoop




                                                                                           Write
                                                                                           Final
                                                                                  reduce
                                                                                           Answer
                                                                                              HDFS

                                                                                  reduce      Bye 1
                                                                                              Goodbye 1
                                                                                              Hadoop 2
                                                                                              Hello 2
                                                                                              World 2

                                                                  Source: Haifa Labs IBM
              Pini Cohen’s work Copyright STKI@2012
              Do not remove source or attribution from any slide or graph                                 38
Components of Cluster Node


            Flow File Input
               Processor

                                                                    Flow Analysis           Flow Analysis   • Flow file
            Cluster File                                               Map                    Reduce
            Cluster File
                                                                        Map                    Reduce         input processor
              System
             (System)
               HDFS                                                                                         • Flow analysis
  flow-      ( HDFS )
                                                                               MapReduce Library              map/reduce
  tools
                                                                                                            • Flow-tools
                                                                              Hadoop                        • Hadoop
                                                                                                               • HDFS
                                                 Java Virtual Machine
                                                                                                               • MapReduce
                Operating System ( Linux )                                                                  • Java VM
                                                                                                            • OS : Linux
          Hardware ( CPU, HDD, Memory, NIC )
               Source: www.caida.org/workshops/.../wide-casfi1004_wkang.ppt




                                            Pini Cohen’s work Copyright STKI@2012
                                            Do not remove source or attribution from any slide or graph                         39
Hive: MapReduce helper:

     • Code Example:
        • hive> INSERT OVERWRITE TABLE events SELECT a.* FROM profiles a;
        • hive> INSERT OVERWRITE TABLE events SELECT a.* FROM profiles a
         WHERE a.key < 100;
        • hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/reg_3' SELECT a.*
         FROM events a;
        • hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_4' select a.invites,
         a.pokes FROM profiles a;
        • hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT COUNT(*)
         FROM invites a WHERE a.ds='2008-08-15';
        • hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT a.foo, a.bar
         FROM invites a;
        • hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/sum' SELECT
         SUM(a.pc) FROM pc1 a;




                 Pini Cohen’s work Copyright STKI@2012
                 Do not remove source or attribution from any slide or graph   40
NoSQL DBMS: storing and retrieving data

     • Key/Value
         • A big hash table
         • Examples: Voldemort, Amazon’s Dynamo
     • Big Table
         • Big table, column families
         • Examples: Hbase, Cassandra
     • Document based
         • Collections of collections
         • Examples: CouchDB, MongoDB
     • Graph databases
         • Based on graph theory
         • Examples: Neo4J
     • Each solves a different problem


                                                             Source: Scalebase

                   Pini Cohen’s work Copyright STKI@2012
                   Do not remove source or attribution from any slide or graph   41
Pros/Cons

     • Pros:
         • Performance
         • BigData
         • Most solutions are open source
         • Data is replicated to nodes and is therefore fault-tolerant
           (partitioning)
         • Don't require a schema
         • Can scale up and down
     • Cons:
         •   Code change
         •   No framework support
         •   Not ACID
         •   Eco system (BI, Backup)
         •   There is always a database at the backend
         •   Some API is just too simple
                                                               Source: Scalebase

                     Pini Cohen’s work Copyright STKI@2012
                     Do not remove source or attribution from any slide or graph   42
Apache Cassandra

     • Cassandra is a highly scalable, eventually
       consistent, distributed, structured key-value
       store
     • Child of Google’s BigTable and Amazon’s
       Dynamo
     • Peer to peer architecture. All nodes are equal                         Source: ids.snu.ac.kr/w/images/1/18/2011SS-03.ppt




     • Cassandra’s replication factor (RF) is the total
       number of nodes onto which the data will be
       placed. RF of at least 2 is highly recommended,
       keeping in mind that your effective number of
       nodes is (N total nodes / RF).
     • CQL (Cassandra Query Language) command line
     • Time stamp for each value written


                Pini Cohen’s work Copyright STKI@2012
                Do not remove source or attribution from any slide or graph                                             43
Consistent Hashing

• Partition using consistent hashing (for the
  first node data is placed) based on MD5
  Distributed hash table algorithm                                                                    A
• Keys hash to a point on a fixed circular
                                                                                                                  C
  space                                                                                           V       B
• Ring is partitioned into a set of ordered
  slots and servers and keys hashed over
  these slots
• Nodes take positions on the circle.       S                                                                 D
• A, B, and D exists.
•   B responsible for AB range ( for replication
    factor=2 – default).
•   D responsible for BD range.
•   A responsible for DA range.                                                              R            H
• C joins.
•   B, D split ranges.                                                                                M
•   C gets BC from D.
                                  Source: http://www.intertech.com/resource/usergroup/NoSQL.ppt



                               Pini Cohen’s work Copyright STKI@2012
                               Do not remove source or attribution from any slide or graph                        44
Cassandra’s tunable consistency (write)

Level          Behavior
               Ensure that the write has been written to at least 1 node, including HintedHandoff
ANY
               recipients.
               Ensure that the write has been written to at least 1 replica's commit log and
ONE
               memory table before responding to the client.
               Ensure that the write has been written to at least 2 replica's before responding to
TWO
               the client.
               Ensure that the write has been written to at least 3 replica's before responding to
THREE
               the client.
               Ensure that the write has been written to N / 2 + 1 replicas before responding to the
QUORUM
               client.
               Ensure that the write has been written to <ReplicationFactor> / 2 + 1 nodes, within
LOCAL_QUORUM
               the local datacenter (requires NetworkTopologyStrategy)

               Ensure that the write has been written to <ReplicationFactor> / 2 + 1 nodes in each
EACH_QUORUM
               datacenter (requires NetworkTopologyStrategy)

               Ensure that the write is written to all N replicas before responding to the client. Any
ALL
               unresponsive replicas will fail the operation.

                        Pini Cohen’s work Copyright STKI@2012
                        Do not remove source or attribution from any slide or graph         Source: wiki
                                                                                                  45
Cassandra’s data model structure

                 Think of cassandra as row-oriented
      keyspace


                           column family
        settings
          (eg,
      partitioner)          settings                          column
                              (eg,
                          comparator,
                           type [Std])                               name                                             value                                           clock




                                                                    Source: http://assets.en.oreilly.com/1/event/51/Scaling%20Web%20Applications%20with%20Cassandra%20Presentation.ppt

                     Pini Cohen’s work Copyright STKI@2012
                     Do not remove source or attribution from any slide or graph                                                                                                         46
Data Model – “flexible” scheme!

 ColumnFamily: Rockets

Key                      Value

 1                        Name                                                          Value

                          name                                                          Rocket-Powered Roller Skates
                          toon                                                          Ready, Set, Zoom
                          inventoryQty                                                  5
                          brakes                                                        false


 2                        Name                                                          Value

                          name                                                          Little Giant Do-It-Yourself Rocket-Sled Kit
                          toon                                                          Beep Prepared
                          inventoryQty                                                  4
                          brakes                                                        false


 3                        Name                                                          Value

                          name                                                          Acme Jet Propelled Unicycle
                          toon                                                          Hot Rod and Reel
                          inventoryQty                                                  1
                          wheels                                                        1
                                   Source: http://wenku.baidu.com/view/6e254321482fb4daa58d4b87.html




                           Pini Cohen’s work Copyright STKI@2012
                           Do not remove source or attribution from any slide or graph                                                47
Cassandra’s CQL – Cassandra SQL Language

     • SQL like. Example:
        • CREATE KEYSPACE test with strategy_class = 'SimpleStrategy' and
          strategy_options:replication_factor=1;
        • CREATE INDEX ON users (birth_date);
        • SELECT * FROM users WHERE state='UT' AND birth_date > 1970;
     • However:
        • No Joins
        • No UPDATES/DELETES




                  Pini Cohen’s work Copyright STKI@2012
                  Do not remove source or attribution from any slide or graph   48
NoSQL benchmark – for scale!




            Source: r esearch.yahoo.com/files/ycsb-v4.pdf




                        Pini Cohen’s work Copyright STKI@2012
                        Do not remove source or attribution from any slide or graph   49
Can we live with NoSQL limitations?

     • Facebook has dropped Cassandra
     • “..we found Cassandra's eventual consistency model to be a
       difficult pattern to reconcile for our new Messages
       infrastructure”
     • Facebook has selected HBase (Columnar DBMS) .
                 http://www.facebook.com/notes/facebook-engineering/the-underlying-technology-of-
     messages/454991608919




                          Pini Cohen’s work Copyright STKI@2012
                          Do not remove source or attribution from any slide or graph               50
What about other NoSQL DBMS?

    • MongoDB
    • Hbase
    • CouchDB
    • Maybe next session….




               Pini Cohen’s work Copyright STKI@2012
               Do not remove source or attribution from any slide or graph   51
Big Data potential implications on IT

     • Will traditional RDBMS be obsolete? Surely no!
     • Several areas are Big Data zone by definition – Internet
       marketing, Cyber, DW, etc.
     • How well can we live with “Eventually Consistent” which in
       most cases means 1-2 minutes delay?!
     • Can we define that all batch data can live well on Big Data
       technologies?
     • Will we see at the end (10 years form now) that only small
       portion of data still resides on RDBMS and most of the data
       resides on Big Data technologies?!




                Pini Cohen’s work Copyright STKI@2012
                Do not remove source or attribution from any slide or graph   52
Big data challenges

     • NLP in Hebrew (entity recognition is more difficult)
     • Adapting analytical algorithms to match big data world
       (Anomaly detection needs to be redefined)
     • Some problem with consistency
     • Skiils problem – BI needs to program in Java, Hadoop,
       NoSQL knowledge




                Pini Cohen’s work Copyright STKI@2012
                Do not remove source or attribution from any slide or graph
Example of big data technology: SPLUNK

     • Splunk is a traditional IT vendor based on MapReduce
       (from 2009)




                Pini Cohen’s work Copyright STKI@2012
                Do not remove source or attribution from any slide or graph   54
Thanks for your patience and hope you enjoyed




     Here you can find the latest version of this presentation http://www.slideshare.net/pini

                        Pini Cohen’s work Copyright STKI@2012
                                                                                      55
                        Do not remove source or attribution from any slide or graph

Weitere ähnliche Inhalte

Andere mochten auch (12)

Joshua Robinson resume
Joshua Robinson resumeJoshua Robinson resume
Joshua Robinson resume
 
Objetivos 13 03-21
Objetivos 13 03-21Objetivos 13 03-21
Objetivos 13 03-21
 
What Have We Learned
What Have We LearnedWhat Have We Learned
What Have We Learned
 
Introduction to prebreeding component of CWR project
Introduction to prebreeding component of CWR project Introduction to prebreeding component of CWR project
Introduction to prebreeding component of CWR project
 
Local seo Melbourne
Local seo MelbourneLocal seo Melbourne
Local seo Melbourne
 
Employee Handbook
Employee HandbookEmployee Handbook
Employee Handbook
 
Sur un banc
Sur un bancSur un banc
Sur un banc
 
Elabora un dibujo
Elabora un dibujoElabora un dibujo
Elabora un dibujo
 
TalentDash Introduction
TalentDash Introduction TalentDash Introduction
TalentDash Introduction
 
Світ-системний аналіз Е.Валлерстайна
Світ-системний аналіз Е.ВаллерстайнаСвіт-системний аналіз Е.Валлерстайна
Світ-системний аналіз Е.Валлерстайна
 
Карта стереотипів США
Карта стереотипів СШАКарта стереотипів США
Карта стереотипів США
 
ЦСЄ - Україна
ЦСЄ - УкраїнаЦСЄ - Україна
ЦСЄ - Україна
 

Ähnlich wie Big data 2012 v1

For netapp haifa 2012 v3
For netapp haifa 2012 v3For netapp haifa 2012 v3
For netapp haifa 2012 v3
Pini Cohen
 
Running Data Platforms Like Products
Running Data Platforms Like ProductsRunning Data Platforms Like Products
Running Data Platforms Like Products
VMware Tanzu
 
doolyk_rev_p_001.compressed
doolyk_rev_p_001.compresseddoolyk_rev_p_001.compressed
doolyk_rev_p_001.compressed
Doolytics
 

Ähnlich wie Big data 2012 v1 (20)

Teaching IT one trick or two
Teaching IT one trick or twoTeaching IT one trick or two
Teaching IT one trick or two
 
Secure development 2014
Secure development 2014Secure development 2014
Secure development 2014
 
Summit 2017 cyber delivery v4 long version
Summit 2017 cyber delivery v4 long versionSummit 2017 cyber delivery v4 long version
Summit 2017 cyber delivery v4 long version
 
Big data analytics
Big data analyticsBig data analytics
Big data analytics
 
For netapp haifa 2012 v3
For netapp haifa 2012 v3For netapp haifa 2012 v3
For netapp haifa 2012 v3
 
Continuum Analytics and Python
Continuum Analytics and PythonContinuum Analytics and Python
Continuum Analytics and Python
 
Brand niemann03292011
Brand niemann03292011Brand niemann03292011
Brand niemann03292011
 
Building Resiliency and Agility with Data Virtualization for the New Normal
Building Resiliency and Agility with Data Virtualization for the New NormalBuilding Resiliency and Agility with Data Virtualization for the New Normal
Building Resiliency and Agility with Data Virtualization for the New Normal
 
SGI Big Data Launch
SGI Big Data LaunchSGI Big Data Launch
SGI Big Data Launch
 
Bigdata-Intro.pptx
Bigdata-Intro.pptxBigdata-Intro.pptx
Bigdata-Intro.pptx
 
Summit 2011 infra_dbms
Summit 2011 infra_dbmsSummit 2011 infra_dbms
Summit 2011 infra_dbms
 
Running Data Platforms Like Products
Running Data Platforms Like ProductsRunning Data Platforms Like Products
Running Data Platforms Like Products
 
New Opportunities for Connected Data - Emil Eifrem @ GraphConnect Boston + Ch...
New Opportunities for Connected Data - Emil Eifrem @ GraphConnect Boston + Ch...New Opportunities for Connected Data - Emil Eifrem @ GraphConnect Boston + Ch...
New Opportunities for Connected Data - Emil Eifrem @ GraphConnect Boston + Ch...
 
Create your Big Data vision and Hadoop-ify your data warehouse
Create your Big Data vision and Hadoop-ify your data warehouseCreate your Big Data vision and Hadoop-ify your data warehouse
Create your Big Data vision and Hadoop-ify your data warehouse
 
doolyk_rev_p_001.compressed
doolyk_rev_p_001.compresseddoolyk_rev_p_001.compressed
doolyk_rev_p_001.compressed
 
Why Data Modeling Is Fundamental
Why Data Modeling Is FundamentalWhy Data Modeling Is Fundamental
Why Data Modeling Is Fundamental
 
Solution Centric Architectural Presentation - Implementing a Logical Data War...
Solution Centric Architectural Presentation - Implementing a Logical Data War...Solution Centric Architectural Presentation - Implementing a Logical Data War...
Solution Centric Architectural Presentation - Implementing a Logical Data War...
 
Bigdataissueschallengestoolsngoodpractices 141130054740-conversion-gate01
Bigdataissueschallengestoolsngoodpractices 141130054740-conversion-gate01Bigdataissueschallengestoolsngoodpractices 141130054740-conversion-gate01
Bigdataissueschallengestoolsngoodpractices 141130054740-conversion-gate01
 
Cloud Con 2015 - Integration & Web APIs
Cloud Con 2015 - Integration & Web APIsCloud Con 2015 - Integration & Web APIs
Cloud Con 2015 - Integration & Web APIs
 
Data-Ed: Data Architecture Requirements
Data-Ed: Data Architecture Requirements Data-Ed: Data Architecture Requirements
Data-Ed: Data Architecture Requirements
 

Mehr von Pini Cohen

Mehr von Pini Cohen (20)

Cto 2021 markets v2
Cto 2021 markets v2Cto 2021 markets v2
Cto 2021 markets v2
 
Workato integrators corrections stki Israeli VAS market research 2020 v1
Workato integrators corrections stki Israeli VAS  market research 2020 v1Workato integrators corrections stki Israeli VAS  market research 2020 v1
Workato integrators corrections stki Israeli VAS market research 2020 v1
 
It procurement 2019 v3
It procurement 2019 v3It procurement 2019 v3
It procurement 2019 v3
 
STKI summit CTO presentation 2019
STKI summit CTO presentation 2019STKI summit CTO presentation 2019
STKI summit CTO presentation 2019
 
STKI IT Delivery staffing ratios 2018 v3
STKI IT Delivery staffing ratios 2018 v3STKI IT Delivery staffing ratios 2018 v3
STKI IT Delivery staffing ratios 2018 v3
 
Stkisummi18 i taa_s_cybergov_long_version_v2
Stkisummi18 i taa_s_cybergov_long_version_v2Stkisummi18 i taa_s_cybergov_long_version_v2
Stkisummi18 i taa_s_cybergov_long_version_v2
 
Dev trends 18_q1
Dev trends 18_q1Dev trends 18_q1
Dev trends 18_q1
 
Stkisummi18 i taa_s_cybergov_long_version_v1
Stkisummi18 i taa_s_cybergov_long_version_v1Stkisummi18 i taa_s_cybergov_long_version_v1
Stkisummi18 i taa_s_cybergov_long_version_v1
 
Delivery positionnig 2017 v2
Delivery positionnig 2017   v2Delivery positionnig 2017   v2
Delivery positionnig 2017 v2
 
IT procurement cloud (and other) recommandations
IT procurement cloud (and other) recommandationsIT procurement cloud (and other) recommandations
IT procurement cloud (and other) recommandations
 
IT procurement v2
IT procurement v2IT procurement v2
IT procurement v2
 
Cyber ratios 2017 v1
Cyber ratios 2017 v1Cyber ratios 2017 v1
Cyber ratios 2017 v1
 
Delivery positionnig 2016 v1
Delivery positionnig 2016 v1Delivery positionnig 2016 v1
Delivery positionnig 2016 v1
 
Ratios 2016 v1
Ratios 2016 v1Ratios 2016 v1
Ratios 2016 v1
 
It delivery 2016 v5
It delivery 2016 v5It delivery 2016 v5
It delivery 2016 v5
 
Positioning stki pini 2015 v1
Positioning stki  pini 2015 v1Positioning stki  pini 2015 v1
Positioning stki pini 2015 v1
 
Stki ratios 2015 v1
Stki ratios 2015 v1Stki ratios 2015 v1
Stki ratios 2015 v1
 
Delivery 2015 pini
Delivery 2015 piniDelivery 2015 pini
Delivery 2015 pini
 
STKI Summit 2014 Infra Trends - How CIO Deliver - complete infra trends
STKI Summit 2014 Infra Trends - How CIO Deliver - complete infra trendsSTKI Summit 2014 Infra Trends - How CIO Deliver - complete infra trends
STKI Summit 2014 Infra Trends - How CIO Deliver - complete infra trends
 
STKI staffing ratios ratios 2014
STKI staffing ratios ratios 2014STKI staffing ratios ratios 2014
STKI staffing ratios ratios 2014
 

Kürzlich hochgeladen

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Kürzlich hochgeladen (20)

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 

Big data 2012 v1

  • 1. Paradigm Shifts: Big Data Pini Cohen VP and Senior Analyst Tell me and I’ll forget Show me and I may remember STKI Summit 2012 Involve me and I’ll understand
  • 2. The “Magic” of internet companies Source: http://venturebeat.com/2011/10/24/next-hot-internet-companies-not-in-us/internet-company-growth/ Pini Cohen’s work Copyright STKI@2012 2 Do not remove source or attribution from any slide or graph
  • 3. Pinterest Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph 3
  • 4. Pinterest Architecture Update - 18 Million Visitors, 10x Growth,12 Employees, 410 TB of Data • 80 million objects stored in S3 with 410 terabytes of user data, 10x what they had in August. EC2 instances have grown by 3x. Around $39K fo S3 and $30K for EC2 a month. • Pay for what you use saves money. Most traffic happens in the afternoons and evenings, so they reduce the number of instances at night by 40%. • 12 employees as of last December. Using the cloud a site can grow dramatically while maintaining a very small team. Looks like 31 employees as of now. Source: http://highscalability.com/blog/2012/5/21/pinterest-architecture-update-18-million-visitors-10x-growth.html Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph 4
  • 5. Instagram • The Instagram philosophy: • Simplicity • Optimized for minimal operational burden • Instrument everything Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph 5
  • 6. Scaling Instagram • Instagram went to 30+ million users in less than two years and then rocketed to 40 million users 10 days after the launch of its Android application. • After the release of the Android they had 1 million new users in 12 hours. • 2 engineers in 2010. • 3 engineers in 2011 • 5 engineers 2012, 2.5 on the backend. This includes iPhone and Android development. Source: http://highscalability.com/blog/2012/4/16/instagram-architecture-update-whats-new-with-instagram.html Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph 6
  • 7. Tumblr – Microbloging social networking platform • 500 million page views a day • 15B+ page views month • Peak rate of ~40k requests per second • 1+ TB/day into Hadoop cluster • Many TB/day into MySQL/HBase/Redis/Memcache • Growing at 30% a month • ~1000 hardware nodes in production (not cloud) • ~20 engineers (total 106 employees) Source: http://highscalability.com/blog/2012/2/13/tumblr-architecture-15-billion-page-views-a-month-and-harder.html STKI modifications Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph 7
  • 8. Technology listing • Hadoop Mapreduce • NoSQL dbms (Cassandra, Mongo, HBASE) • Shrading • In Memory DBMS • Memcashed • MemSQL • Solr • Redis • DJANGO • Python • ELB - Elastic load balancing amazon Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph
  • 9. Paradigm shifts agenda • Big Data: • Big Data definition and background • Big Data value • Big Data technology Source: http://www.b2binbound.com/blog/?Tag=paradigm%20shift Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph 9
  • 10. Big Data Definition – 4 V’s (or more…) • Volume – tens of TBs and more (15-20TB+) • Velocity – the speed in which data is added – 10M items per hour and more. And the speed in which the data needs to be processed • Variety – different types of data – structured & unstructured. In many cases deals with internet of things, social media, but also with voice, video, etc. • Variability - able to cope with new attributes and changing data types – without interrupting the analytical process (without “import-export”) • Other optional V’s - validity, volatility, viscosity (resistance to flow), etc. source: http://www.computerweekly.com/blogs/cwdn/2011/11/datas-main-drivers-volume-velocity-variety-and-variability.html Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph 10
  • 11. The origins of the 3V’s: • 2002 research by Doug Laney from META Group (now Gartner): Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph 11
  • 12. “Big Data” theme main current usage: • “Big Data" is just marketing jargon. -Doug Laney, Gartner source: http://www.computerweekly.com/blogs/cwdn/2011/11/datas-main-drivers-volume-velocity-variety-and-variability.html Source: http://winnbadisa.com/wp-content/uploads/2011/12/marketing-career-cloud.jpg • STKI : doing something significantly different from what you’ve done until now Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph 12
  • 13. Big Data at work: • Orbitz Worldwide has collected 750 terabytes of unstructured data on their consumers’ behavior – detailed information from customer online visits and browsing sessions. Using Hadoop, models have been developed intended to improve search results and tailor the user experience based on everything from location, interest in family travel versus solo travel, and even the kind of device being used to explore travel options. • The result? To date, a 7% increase in interaction rate, 37% growth in stickiness of sessions and a net 2.6% in booking path engagement. Source: http://www.deloitte.com/assets/Dcom-UnitedStates/Local%20Assets/Documents/us_cons_techtrends2012_013112.pdf Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph 13
  • 14. DW appliances will be discussed later Teradata EMC Greenplun Oracle Exadata Source: http://www.asugnews.com/2011/09/06/inside-saps-product-naming-strategies/ Pini Cohen’s work Copyright STKI@2012 14 Microsoft Parallel Data Warehouse Do not remove source or attribution from any slide or graph
  • 15. What is the business value of big data analytics? • Big data is now a technology looking for a business need • It can mean doing the same thing but better / faster (better segmentation, more accurate analysis model) • Or it can mean doing completely new things (telematics, sentiment analysis, recommendation engine, matching competition’s pricing in real time, being able to analyze data we haven’t been able to analyze in the past) Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph
  • 16. Decision making – old school vs. new school (big data) • Old School: • Phase 1 : Analyze existing data and prepare general model • Phase 2: Apply the general model to specific client • This means applying the same model for many clients when they arrive • Issues with Old School decision making: • Time gap between preparing and applying the model • # of combinations might be too big for general model (example: recommendation based in interest) • The general model generated is biased towards “main stream” population • New School (Big Data): • Phase 1: Prepare specific model for the client and apply the model – instantly Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph 16
  • 17. Big data use cases • Recommendation engines – match users to one another and provide recommendation based on similar users (Examples: Linkedin – people you may know; Amazon) • Sentiment Analysis (Macro or individual user) • Fraud Detection - customer behavior, historical and transactional data combined. Same but more affordable • Customer Churn • Social graph analysis – influencers • Customer experience analysis – combine data from call center, web, social media etc. • Improved segmentation – more data (clickstream, call records) for more accurate analysis • Improved customer retention Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph
  • 18. Technology: Elements Concepts • Storing data for analytics (mainly): • HDFS – Hadoop File System • Map Reduce- Programming method mainly for analytics • Other “Add-on”: Pig, , Hive, JAQL (IBM) • Storing and retrieving data - DBMS: • NoSQL – DBMS (not only SQL): • Cassandra • MongoDB • CouchDB • Hbase Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph 18
  • 19. Who Uses Hadoop? • Amazon/A9  Quantcast • AOL  Rackspace/Mailtrust • Facebook • Fox interactive media  Veoh • Netflix  Yahoo! • New York Times  PowerSet (now Microsoft) More at http://wiki.apache.org/hadoop/PoweredBy Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph 19
  • 20. Who Uses Cassandra? • Facebook  SimpleGeo • Digg  Rackspace • Despegar  Shazam • Ooyala  SoftwareProjects • Imagini Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph 20
  • 21. Big Data technologies (Hadoop etc.) vs. traditional IT Traditional IT Big Data Centralized Storage Local storage Brand redundant Servers Cheap HW White Boxes Standard Infrastructure and virtual Is standardization needed?! (in the HW servers. level). No server virtualization. Well established backup and DRP Why do I need backup? How do I tackle procedures DRP (compute clusters that are stretched over locations) Traditional vendors Open Source solutions Mature products and procedures In a new patch for specific issues sometimes it is written “not implemented yet” Traditional programming, SQL Different kind of programming (map- reduce) , no Joins Will Big Data infrastructure be part of existing infrastructure or will be developed as new domain? Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph 21
  • 22. New type of scale: • Hadoop: • Up to 4,000 machines in a cluster • Up to 20 PB in a cluster • Currently traditional IT technologies can not handle this kind of scale. • This scale comes with a cost! Source: http://www.techsangam.com/wp-content/uploads/2012/01/i_love_scalability_mug.jpg Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph 22
  • 23. Brewer's (CAP) Theorem • It is impossible for a distributed computer system to simultaneously provide all three of the following guarantees: • Consistency (all nodes see the same data at the same time) • Availability (node failures do not prevent survivors from continuing to operate) • Partition Tolerance (the system continues to operate in many partitions and despite arbitrary message loss) Source: Scalebase STKI modifications Professor Eric A. Brewer Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph 23
  • 24. Dealing With CAP • Drop Consistency • Welcome to the “Eventually Consistent” term. • At the end – everything will work out just fine - And hey, sometimes this is a good enough solution • When no updates occur for a long period of time, eventually all updates will propagate through the system and all the nodes will be consistent • For a given accepted update and a given node, eventually either the update reaches the node or the node is removed from service • Known as BASE (Basically Available, Soft state, Eventual consistency), as opposed to ACID Source: Scalebase Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph 24
  • 25. Hadoop • Apache Hadoop is a software framework that supports data-intensive distributed applications • It enables applications to work with thousands of nodes and petabytes of data. • Hadoop was inspired by Google's MapReduce and Google File System (GFS) papers • Contains (basically): • HDFS – Hadoop file System • MapReduce programming model Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph 25
  • 26. HDFS – Hadoop File System • Parallel • Distributed on commodity elements • Throughput over latency • Reliable and self healing • For large scale – typical file is gigabytes to terabytes (for one file!) • Applications need a write-once-read-many access model (mainly analytics) Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph 26
  • 27. HDFS motivation • What if you needed to write a program that distributes data on commodity HW (PC’s or Servers). You would need to take care of: • Where is the data located • How to distribute data between the nodes • How many times you want to replicate the data • How to insert, select and update data • What to do if one node or more fails • How to add node or to take out a node • Manage and monitor the environment • Hadoop File System did it for you! Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph 27
  • 28. HDFS: Hadoop Distributed File Systems • Data nodes and Name node • Client requests meta data about a file from namenode • Data is served directly from datanode HDFS namenode Application (file name, block id) HDFS Client File namespace /user/css534/input (block id, block location) block 3df2 instructions state (block id, byte range) HDFS datanode HDFS datanode block data Linux local file system Linux local file system … … source: http://www.google.co.il/url?sa=t&rct=j&q=Rob+Jordan++Chris+Livdahl+hadoop+filetype%3Apptx&source=web&cd=1&ved=0CCIQFjAA Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph 28
  • 29. Datanode Blockreports File “part-0” will be replicated twice and will populatesaved in blocks 1 and 3 (file is big so it has to be divided to 2 blocks) Block 1 is on data nodes A and C source: http://www.google.co.il/url?sa=t&rct=j&q=Rob+Jordan++Chris+Livdahl+hadoop+filetype%3Apptx&source=web&cd=1&ved=0CCIQFjAA Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph 29
  • 30. HDFS basic limitations • Namenode is single point of failure • Write-once model • Plan to support appending-writes • A namespace with an extremely large number of files exceeds Namenode’s capacity to maintain • Cannot be mounted by exisiting OS • Getting data in and out is tedious • HDFS does not implement / support user quotas / access permissions • Data balancing schemes • No periodic checkpoints Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph 30
  • 31. Map Reduce programming model • In very basic – Brings the program to the data • Contains two elements: • Map: this part of the job is performed in parallel asynchronous by each node • Reduce: gather the result from the relevant nodes • In more detail : • Map : return (write on temp file) a list containing zero or more ( k, v ) pairs • Output can be a different key from the input • Output can have same key • Reduce : return a new list of reduced output from input Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph 31
  • 32. MapReduce motivation • What if you needed to write a program that processes data that’s on distributed computers? • You would need to write distributed program that: • Finds where the data located • Work on each node and then combine the result from each node together. • Where (on the local node) and how (format) to write the intermediate results • Find when the jobs of all participating nodes have concluded and then start the “aggregation” part • What to do if a job is stuck (restart the job or turn to another node to perform the same job) • Hadopp MapReduce is the framework for you! Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph 32
  • 33. MapReduce example: map(String key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, "1"); reduce(String key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result)); Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph 33
  • 34. Dataflow in Hadoop Master Job: Word Count Submit job All elements – standard HW map schedule reduce map reduce Source: Haifa Labs IBM Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph 34
  • 35. Dataflow in Hadoop Hello World Bye World Read Hello 1 Input File World 2 map reduce Block 1 Bye1 Hello Hadoop Goodbye Hadoop HDFS Block 2 Hello 1 map Hadoop 2 reduce Goodbye 1 Source: Haifa Labs IBM Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph 35
  • 36. Dataflow in Hadoop Finished Finished + Location map Local FS reduce Local map FS reduce Source: Haifa Labs IBM Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph 36
  • 37. Dataflow in Hadoop map Local FS reduce HTTP GET Local map FS reduce Source: Haifa Labs IBM Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph 37
  • 38. Dataflow in Hadoop Write Final reduce Answer HDFS reduce Bye 1 Goodbye 1 Hadoop 2 Hello 2 World 2 Source: Haifa Labs IBM Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph 38
  • 39. Components of Cluster Node Flow File Input Processor Flow Analysis Flow Analysis • Flow file Cluster File Map Reduce Cluster File Map Reduce input processor System (System) HDFS • Flow analysis flow- ( HDFS ) MapReduce Library map/reduce tools • Flow-tools Hadoop • Hadoop • HDFS Java Virtual Machine • MapReduce Operating System ( Linux ) • Java VM • OS : Linux Hardware ( CPU, HDD, Memory, NIC ) Source: www.caida.org/workshops/.../wide-casfi1004_wkang.ppt Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph 39
  • 40. Hive: MapReduce helper: • Code Example: • hive> INSERT OVERWRITE TABLE events SELECT a.* FROM profiles a; • hive> INSERT OVERWRITE TABLE events SELECT a.* FROM profiles a WHERE a.key < 100; • hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/reg_3' SELECT a.* FROM events a; • hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_4' select a.invites, a.pokes FROM profiles a; • hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT COUNT(*) FROM invites a WHERE a.ds='2008-08-15'; • hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT a.foo, a.bar FROM invites a; • hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/sum' SELECT SUM(a.pc) FROM pc1 a; Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph 40
  • 41. NoSQL DBMS: storing and retrieving data • Key/Value • A big hash table • Examples: Voldemort, Amazon’s Dynamo • Big Table • Big table, column families • Examples: Hbase, Cassandra • Document based • Collections of collections • Examples: CouchDB, MongoDB • Graph databases • Based on graph theory • Examples: Neo4J • Each solves a different problem Source: Scalebase Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph 41
  • 42. Pros/Cons • Pros: • Performance • BigData • Most solutions are open source • Data is replicated to nodes and is therefore fault-tolerant (partitioning) • Don't require a schema • Can scale up and down • Cons: • Code change • No framework support • Not ACID • Eco system (BI, Backup) • There is always a database at the backend • Some API is just too simple Source: Scalebase Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph 42
  • 43. Apache Cassandra • Cassandra is a highly scalable, eventually consistent, distributed, structured key-value store • Child of Google’s BigTable and Amazon’s Dynamo • Peer to peer architecture. All nodes are equal Source: ids.snu.ac.kr/w/images/1/18/2011SS-03.ppt • Cassandra’s replication factor (RF) is the total number of nodes onto which the data will be placed. RF of at least 2 is highly recommended, keeping in mind that your effective number of nodes is (N total nodes / RF). • CQL (Cassandra Query Language) command line • Time stamp for each value written Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph 43
  • 44. Consistent Hashing • Partition using consistent hashing (for the first node data is placed) based on MD5 Distributed hash table algorithm A • Keys hash to a point on a fixed circular C space V B • Ring is partitioned into a set of ordered slots and servers and keys hashed over these slots • Nodes take positions on the circle. S D • A, B, and D exists. • B responsible for AB range ( for replication factor=2 – default). • D responsible for BD range. • A responsible for DA range. R H • C joins. • B, D split ranges. M • C gets BC from D. Source: http://www.intertech.com/resource/usergroup/NoSQL.ppt Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph 44
  • 45. Cassandra’s tunable consistency (write) Level Behavior Ensure that the write has been written to at least 1 node, including HintedHandoff ANY recipients. Ensure that the write has been written to at least 1 replica's commit log and ONE memory table before responding to the client. Ensure that the write has been written to at least 2 replica's before responding to TWO the client. Ensure that the write has been written to at least 3 replica's before responding to THREE the client. Ensure that the write has been written to N / 2 + 1 replicas before responding to the QUORUM client. Ensure that the write has been written to <ReplicationFactor> / 2 + 1 nodes, within LOCAL_QUORUM the local datacenter (requires NetworkTopologyStrategy) Ensure that the write has been written to <ReplicationFactor> / 2 + 1 nodes in each EACH_QUORUM datacenter (requires NetworkTopologyStrategy) Ensure that the write is written to all N replicas before responding to the client. Any ALL unresponsive replicas will fail the operation. Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph Source: wiki 45
  • 46. Cassandra’s data model structure Think of cassandra as row-oriented keyspace column family settings (eg, partitioner) settings column (eg, comparator, type [Std]) name value clock Source: http://assets.en.oreilly.com/1/event/51/Scaling%20Web%20Applications%20with%20Cassandra%20Presentation.ppt Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph 46
  • 47. Data Model – “flexible” scheme! ColumnFamily: Rockets Key Value 1 Name Value name Rocket-Powered Roller Skates toon Ready, Set, Zoom inventoryQty 5 brakes false 2 Name Value name Little Giant Do-It-Yourself Rocket-Sled Kit toon Beep Prepared inventoryQty 4 brakes false 3 Name Value name Acme Jet Propelled Unicycle toon Hot Rod and Reel inventoryQty 1 wheels 1 Source: http://wenku.baidu.com/view/6e254321482fb4daa58d4b87.html Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph 47
  • 48. Cassandra’s CQL – Cassandra SQL Language • SQL like. Example: • CREATE KEYSPACE test with strategy_class = 'SimpleStrategy' and strategy_options:replication_factor=1; • CREATE INDEX ON users (birth_date); • SELECT * FROM users WHERE state='UT' AND birth_date > 1970; • However: • No Joins • No UPDATES/DELETES Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph 48
  • 49. NoSQL benchmark – for scale! Source: r esearch.yahoo.com/files/ycsb-v4.pdf Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph 49
  • 50. Can we live with NoSQL limitations? • Facebook has dropped Cassandra • “..we found Cassandra's eventual consistency model to be a difficult pattern to reconcile for our new Messages infrastructure” • Facebook has selected HBase (Columnar DBMS) . http://www.facebook.com/notes/facebook-engineering/the-underlying-technology-of- messages/454991608919 Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph 50
  • 51. What about other NoSQL DBMS? • MongoDB • Hbase • CouchDB • Maybe next session…. Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph 51
  • 52. Big Data potential implications on IT • Will traditional RDBMS be obsolete? Surely no! • Several areas are Big Data zone by definition – Internet marketing, Cyber, DW, etc. • How well can we live with “Eventually Consistent” which in most cases means 1-2 minutes delay?! • Can we define that all batch data can live well on Big Data technologies? • Will we see at the end (10 years form now) that only small portion of data still resides on RDBMS and most of the data resides on Big Data technologies?! Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph 52
  • 53. Big data challenges • NLP in Hebrew (entity recognition is more difficult) • Adapting analytical algorithms to match big data world (Anomaly detection needs to be redefined) • Some problem with consistency • Skiils problem – BI needs to program in Java, Hadoop, NoSQL knowledge Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph
  • 54. Example of big data technology: SPLUNK • Splunk is a traditional IT vendor based on MapReduce (from 2009) Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph 54
  • 55. Thanks for your patience and hope you enjoyed Here you can find the latest version of this presentation http://www.slideshare.net/pini Pini Cohen’s work Copyright STKI@2012 55 Do not remove source or attribution from any slide or graph