SlideShare ist ein Scribd-Unternehmen logo
1 von 41
© Adam Perer
                                                   COSI




    COSI: Cloud Oriented Subgraph
Identification in Massive Social Networks
      Matthias Bröcheler, Andrea Pugliese
             & V.S. Subrahmanian
© solofotones/flickr   COSI
© Felix Heinen




 2
© solofotones/flickr                    COSI
© Felix Heinen




                       SNA Challenge:
                       Scalability
 3
COSI

                   500 million users



50M tweets / day




   Huge Social Networks
                             © Ludwig Gatzke
COSI

Cloud based
                     Asynchronous
  storage


              COSI

 Answers complex queries in ~1 sec
   on a 778 million edge network
COSI


              Outline
Motivation
Subgraph Identification
Graph Partitioning
Query Answering
Experiments
Conclusion
collaborate
  USA                                       Prof                                                                 Prof
                                                                                                                                                    COSI
                   dean                                                             author
                                                                                                                               member               Italy
 in
                                          Jones                       Paper                                 Baneri
                                                                      “ABC”       comment                                                UC
            UMD                                      author
              CS                                                                                                                         CS
                                                                                                                                                          in
                                faculty                                                   Prof
                                                               friends
                                                                                        Calero              faculty
  department in
                                                                                                 member
                       faculty               Prof                                    presented
                                          Dooley                attended                                Social
      University
         MD                                                                                            Science      department              Universita
                      department in                                       ASONAM                                                              Calabria
                                                                            10                     dean
                                             attended                                                                     Prof
faculty                     UMD
                                                                 author
                                                                                               submitted                 Roma
           member
                         Physics                                                                            author
                                         organized                                                                                                 visited

            Prof          author           accepted             KPLLC                                             Paper               friends
                                                                 09                            Paper             “UVW”
         Smith                             Paper                                               “HIJ”
                                                                         submitted
                                           “XYZ”
                                                                                                           comment
                                              comment      attended
         student of             author                                                                                         Prof
                                                                                        Prof
                                    collaborates                                                                          Olsen           student of
                                                               Prof                  Lund         member
                                                                                                                        dean
         Jamie                                             Larsen
                                                                              faculty                                                              Karl
        Lock                                member
                                                                                                                     Social                      Oede
                                     visited                                                                        Science
                                                      Odense                            SDU
                         John
colleagues              Doe                          Physics     department
                                                                                     Odense                                               Denmark
COSI


Example Query

                                   ?p
                author                   comment

                     ?v1                 ?v3
         faculty              friends
                                               faculty
        University                  in
           MD              Italy         ?v2




     Simple query, yet already
     difficult to answer by hand


8
COSI


Fraud Detection Example


                            Bank1
              wired                   wired

                 ?v1                  ?v2
                           friends


         Suspicious             ?v3
                      labeled




9
COSI


    COSI Architecture
    Graph Data      Client          B   ?X



      
                                             ?Z   C




         
                                    A   ?Y

            load                    Receive query -
                                    Return results

                 Distribute data/
                 Dispatch query         Query answer




         
                Exchange Data /
                                             
                 Forward query
COSI


           COSI Architecture
          Graph Data      Client          B   ?X



            
                                                   ?Z   C




               
                                          A   ?Y

                  load                    Receive query -
                                          Return results
Partition Graph        Distribute data/
                       Dispatch query         Query answer




               
                      Exchange Data /
                                                   
                                                Answer Queries


                       Forward query
COSI


              Outline
Motivation
Subgraph Identification
Graph Partitioning
Query Answering
Experiments
Conclusion
COSI


 COSI Graph Partitioning
     ï‚§â€ŻHow should we partition the graph?
     ï‚§â€ŻGOAL: Find a way to partition the
       graph DB into “blocks” across the k
       storage nodes so that expected
       time to answer queries is small.




13
COSI


 Example Query & Naive Approach
       Jones
       Dooley                        ?p
                  author
       Smith                               comment

                       ?v1                 ?v3
           faculty              friends
                                                 faculty
          University                  in
             MD              Italy         ?v2




14
COSI


 Co-Retrieval
                                         Paper “ABC”
                                   ?p
                  author                  comment

                       Jones              ?v3
           faculty             friends
                                                faculty
          University                in
             MD            Italy          ?v2




       Co-retrieval:
       Jones – Paper “ABC“


15
COSI


 Cost Model
     ï‚§â€Ż Query trace: A query trace w.r.t. a query plan x
        for query Q consists of
         -  All vertices in the DB whose neighborhood is
             retrieved during execution of x
         -  All pairs (u,v) of vertices where x retrieves
             v’s nbhd immediately after retrieving u’s
             nbhd.
          ‱  Intuition: Try to put u,v on same storage node.
          ‱  Assumption: Retrieved nbhds are cached in
              memory.

16
COSI


 Cost Model            (continued)

     ï‚§â€ŻAssume fixed but arbitrary distribution
        over the set of all queries.
     ï‚§â€ŻThis induces a pdf over the set of all
        feasible query plans qp(Q) for query Q.
        -  (x)=  Q Ɠ , qp(Q)=x (Q).
       -  Prob of query plan “x” is the sum of the probs of
           queries requiring query plan x.
     ï‚§â€ŻLet E(v) be the event that v is retrieved by
        a query trace of a random query plan for
        Q.
17
COSI


 Cost Model          (continued)

     ï‚§â€Ż Prob that vertex v occurs in the trace of a
         randomly chosen query plan is
          (E(v)) =  x Ɠ qp(Q) ⁄ v Ɠ qt(x,DB) (x).
     ï‚§â€Ż Prob that (u,v) occurs in the trace of a randomly
         chosen query plan is
          (E(u,v)) = x Ɠ qp(Q) ⁄ (u,v ) Ɠ qt(x,DB) (x).




18
COSI


 Cost Model         (continued)

     Key Theorem
      Suppose vertex retrieval and inter-node comms
       are uniform across storage nodes. The partition
       of the DB graph that minimizes query exec time
       coincides with the partition that minimizes edge
       cut cost in the graph (V,VV) with weight
       function w(u,v)= (E(u,v))+ (E(v,u)).

     ï‚§â€Ż SO MIN EDGE-CUTS IN COMPLETE GRAPHS IS
         CLOSELY RELATED TO MINIMIZING QUERY
         EXECUTION TIME.
19
COSI


 Partitioning Algorithm
     ï‚§â€Ż Challenges
         -  Finding MIN EDGE-CUT is NP-complete.
         -  We want to process graphs containing 100s of
             millions of edges.
     ï‚§â€Ż So we want an algorithm that is
         -  Very fast
         -  Produces good edge cuts
            ‱  but maybe not optimal
     ï‚§â€Ż To achieve speed, we focus on partition strategies that
         permanently assign vertices to blocks.

20
COSI


     Individual edge insertion
     ï‚§â€ŻSuppose we have a partition P={P1,..,Pk}.
     ï‚§â€ŻWe are inserting the edge (v,p,o).
     ï‚§â€ŻVertex force vectors: Measures how strongly
        each Pi “pulls” a vertex.
       -  |v|[i] = fP( y Ɠ (nbhd(v) 
 Pi) w(v,y))
       -  fP maps positive reals to reals and is an “affinity”
           measure.
       -  |v|[i] sums up the weights of edges from v to each
           neighbor in Pi. Insert v into block Pi with highest |v
          |[i].

21
COSI


 Affinity Measures
     ï‚§â€ŻMust satisfy 3 properties
       -  Connectedness of a vertex to a partition
           block. This helps minimize edge cut.
       -  Imbalance of block sizes.
         ‱  E.g. standard deviation of block sizes,
             normalized by expected DB size.
       -  Excessive size should be punished.


22
COSI


 Batch insertion
     ï‚§â€ŻAdding a set of edges at once.
     ï‚§â€ŻIdea: Find strongly connected
        components using modularity
        maximization and assign those to the
        partition block with highest affinity.




23
COSI


Batch Partitioning Algorithm
                        Force Vector
                          Affinity

                         Contract

                        Maximize
                        Modularity

                         Contract


                        Maximize
                        Modularity
COSI


 Graph modularity
     ï‚§â€ŻMod(P) = Pi Ɠ P(W(Pi,Pi)/2|E| -
                 degW(Pi) 2/(2|E|)2)

     ï‚§â€ŻWhere
       -  W(X,Y) is the sum of the weights of
           edges (x,y) with x in X, y in Y.
       -  degW(v) is the sum of the weights of
           edges (v,-) and
       -  degW(Pi) is the sum of the degW(v)’s for
           v in Pi.
25
COSI


              Outline
Motivation
Subgraph Identification
Graph Partitioning
Query Answering
Experiments
Conclusion
COSI


     Query Answering
    Graph Data     Client         B   ?X



      
                                           ?Z   C




         
                                  A   ?Y

            load                  Receive query -
                                  Return results

                 Dispatch query
                                      Query answer




         
            Forward (partially
                                           
             Answered) query
COSI


 Example Query

                                    ?p
                 author                   comment

                      ?v1                 ?v3
          faculty              friends
                                                faculty
         University                  in
            MD              Italy         ?v2


          P1




28
COSI


 Example Query
     Jones : P2
     Dooley : P2                        ?p
                     author
     Smith : P3                               comment

                          ?v1                 ?v3
              faculty              friends
                                                    faculty
             University                  in
                MD              Italy         ?v2




29
COSI


 Example Query
                                         Paper “ABC” : P2
                                         Paper “HIJ” : P3
                                  ?p
                 author                    comment
         P2                                               Calero : P2
                      Dooley              ?v3
          faculty              friends
                                                faculty
         University                in
            MD            Italy           ?v2




      Where to send query next?



30
COSI


 Query answering
     ï‚§â€ŻBasic: Next substitution arbitrary
     ï‚§â€ŻCOSI_Heur is a heuristic version that makes
        intelligent choices about the next variable
        to be substituted.
       -  Branching Factor  # possible substitutions
       -  Communication cost  # messages to be sent
       -  Workload distribution  partitions hosting
           vertices


31
COSI


              Outline
Motivation
Subgraph Identification
Graph Partitioning
Query Answering
Experiments
Conclusion
COSI


 COSI implementation
     ï‚§â€ŻImplementation is in Java (approx
        10,000 loc)
     ï‚§â€Ż778M edges social network DB
       -  Flickr, Orkut, Livejournal, Youtube
       -  [Mislove ‘07]

     ï‚§â€Ż16-node compute cluster
       -  8 GB of RAM
       -  30 GB HDs
       -  8 core Intel CPU
33
COSI


 Partitioning quality
                     Comparison of Partitioning Methods
      40.0%
      35.0%
      30.0%
      25.0%                                                      Edge Cut
      20.0%
                                                                 Improvement
      15.0%
                                                                 Imbalance
      10.0%
       5.0%
       0.0%
               Single Greedy   Batch Greedy    Batch Partition


     COSI_Partition achieves a 36% improvement in
     edge-cut with only slightly higher imbalance.
     Took 7.5 h to load with individual triple insertion, 10.5 h with batch.

34
COSI
                                                                                                   Logarithmic
      Query answering time                                                                            scale
10000000
                                        Query Times by Cost Model (in ms)
 1000000

  100000
ms




     10000

      1000

          100
                6 Edges / 7 Edges / 8 Edges / 9 Edges / 10 Edges / 11 Edges / 11 Edges / 14 Edges / 16 Edges / 17 Edges / 23 Edges /
                 3 Vars    4 Vars    3 Vars    3 Vars     3 Vars     4 Vars     5 Vars     5 Vars     7 Vars     5 Vars     6 Vars

                  Cost Model A
                  Cost Model 2.0/0.5             Cost Model B
                                                Cost Model 1.2/0.1              Cost Model C
                                                                               Cost Model 8.0/5.0            No Cost Model
                                                                                                             No Cost Model

                       COSI_heur does very well, answering
                       pretty complex queries in under a second.
            X-axis shows number of edges and variable vertices.
     35
COSI
                                                                                  Logarithmic
  Partitioning Effect                                                                scale
   100000



       10000
 Time (ms)




             1000



              100
                    6E/3V   7E/4V     8E/3V   9E/3V 10E/3V 11E/4V 11E/5V 14E/5V 16E/7V 17E/5V 23E/6V
                                                  Size of the query (# edges / # vertices)
                                    COSI Batch Partition       Individual Edge Insertion


                      COSI_heur does very well, answering
                      pretty complex queries in under a second.
36
COSI


              Outline
Motivation
Subgraph Identification
Graph Partitioning
Query Answering
Experiments
Conclusion
COSI


 Related Work
                 Systems                   Pros               Cons
Single         Neo4j, DEO,         Latency, Speed        Limited size
Machine        Hypergraph,                               Limited Throughput
               RDF-3X, OWLIM,
               AllegroGraph, etc
Orchestrated   YARS 2, system      Size Scalability      Latency
Distribution   extensions                                Limited Throughput


Asynchronous COSI                  Size Scalability      Latency
Cloud                              Throughput
oriented                           Scalability
                                   Resource Elasticity



38
COSI


 Conclusion
 ï‚§â€ŻCOSI is a general, scalable and fast
    graph database framework for social
    network analysis
 ï‚§â€ŻDemonstrated scalability and speed on
    the problem of subgraph identification




39
COSI




dogma.umiacs.umd.edu
?
             COSI




Questions?
Comments?

Weitere Àhnliche Inhalte

Mehr von Matthias Broecheler

Adding Value through graph analysis using Titan and Faunus
Adding Value through graph analysis using Titan and FaunusAdding Value through graph analysis using Titan and Faunus
Adding Value through graph analysis using Titan and FaunusMatthias Broecheler
 
Titan: Big Graph Data with Cassandra
Titan: Big Graph Data with CassandraTitan: Big Graph Data with Cassandra
Titan: Big Graph Data with CassandraMatthias Broecheler
 
PMatch: Probabilistic Subgraph Matching on Huge Social Networks
PMatch: Probabilistic Subgraph Matching on Huge Social NetworksPMatch: Probabilistic Subgraph Matching on Huge Social Networks
PMatch: Probabilistic Subgraph Matching on Huge Social NetworksMatthias Broecheler
 
Budget-Match: Cost Effective Subgraph Matching on Large Networks
Budget-Match: Cost Effective Subgraph Matching on Large NetworksBudget-Match: Cost Effective Subgraph Matching on Large Networks
Budget-Match: Cost Effective Subgraph Matching on Large NetworksMatthias Broecheler
 
Computing Marginal in CCMRFs - NIPS 2010
Computing Marginal in CCMRFs - NIPS 2010Computing Marginal in CCMRFs - NIPS 2010
Computing Marginal in CCMRFs - NIPS 2010Matthias Broecheler
 
A Scalable Framework for Modeling Competitive Diffusion in Social Networks
A Scalable Framework for Modeling Competitive Diffusion in Social NetworksA Scalable Framework for Modeling Competitive Diffusion in Social Networks
A Scalable Framework for Modeling Competitive Diffusion in Social NetworksMatthias Broecheler
 

Mehr von Matthias Broecheler (8)

Adding Value through graph analysis using Titan and Faunus
Adding Value through graph analysis using Titan and FaunusAdding Value through graph analysis using Titan and Faunus
Adding Value through graph analysis using Titan and Faunus
 
Big Graph Data
Big Graph DataBig Graph Data
Big Graph Data
 
Titan: Big Graph Data with Cassandra
Titan: Big Graph Data with CassandraTitan: Big Graph Data with Cassandra
Titan: Big Graph Data with Cassandra
 
PMatch: Probabilistic Subgraph Matching on Huge Social Networks
PMatch: Probabilistic Subgraph Matching on Huge Social NetworksPMatch: Probabilistic Subgraph Matching on Huge Social Networks
PMatch: Probabilistic Subgraph Matching on Huge Social Networks
 
Budget-Match: Cost Effective Subgraph Matching on Large Networks
Budget-Match: Cost Effective Subgraph Matching on Large NetworksBudget-Match: Cost Effective Subgraph Matching on Large Networks
Budget-Match: Cost Effective Subgraph Matching on Large Networks
 
Probabilistic Soft Logic
Probabilistic Soft LogicProbabilistic Soft Logic
Probabilistic Soft Logic
 
Computing Marginal in CCMRFs - NIPS 2010
Computing Marginal in CCMRFs - NIPS 2010Computing Marginal in CCMRFs - NIPS 2010
Computing Marginal in CCMRFs - NIPS 2010
 
A Scalable Framework for Modeling Competitive Diffusion in Social Networks
A Scalable Framework for Modeling Competitive Diffusion in Social NetworksA Scalable Framework for Modeling Competitive Diffusion in Social Networks
A Scalable Framework for Modeling Competitive Diffusion in Social Networks
 

KĂŒrzlich hochgeladen

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 

KĂŒrzlich hochgeladen (20)

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 

COSI: Cloud Oriented Subgraph Identification in Massive Social Networks

  • 1. © Adam Perer COSI COSI: Cloud Oriented Subgraph Identification in Massive Social Networks Matthias Bröcheler, Andrea Pugliese & V.S. Subrahmanian
  • 2. © solofotones/flickr COSI © Felix Heinen 2
  • 3. © solofotones/flickr COSI © Felix Heinen SNA Challenge: Scalability 3
  • 4. COSI 500 million users 50M tweets / day Huge Social Networks © Ludwig Gatzke
  • 5. COSI Cloud based Asynchronous storage COSI Answers complex queries in ~1 sec on a 778 million edge network
  • 6. COSI Outline Motivation Subgraph Identification Graph Partitioning Query Answering Experiments Conclusion
  • 7. collaborate USA Prof Prof COSI dean author member Italy in Jones Paper Baneri “ABC” comment UC UMD author CS CS in faculty Prof friends Calero faculty department in member faculty Prof presented Dooley attended Social University MD Science department Universita department in ASONAM Calabria 10 dean attended Prof faculty UMD author submitted Roma member Physics author organized visited Prof author accepted KPLLC Paper friends 09 Paper “UVW” Smith Paper “HIJ” submitted “XYZ” comment comment attended student of author Prof Prof collaborates Olsen student of Prof Lund member dean Jamie Larsen faculty Karl Lock member Social Oede visited Science Odense SDU John colleagues Doe Physics department Odense Denmark
  • 8. COSI Example Query ?p author comment ?v1 ?v3 faculty friends faculty University in MD Italy ?v2 Simple query, yet already difficult to answer by hand 8
  • 9. COSI Fraud Detection Example Bank1 wired wired ?v1 ?v2 friends Suspicious ?v3 labeled 9
  • 10. COSI COSI Architecture Graph Data Client B ?X  ?Z C  A ?Y load Receive query - Return results Distribute data/ Dispatch query Query answer     Exchange Data /  Forward query
  • 11. COSI COSI Architecture Graph Data Client B ?X  ?Z C  A ?Y load Receive query - Return results Partition Graph Distribute data/ Dispatch query Query answer     Exchange Data /  Answer Queries Forward query
  • 12. COSI Outline Motivation Subgraph Identification Graph Partitioning Query Answering Experiments Conclusion
  • 13. COSI COSI Graph Partitioning ï‚§â€ŻHow should we partition the graph? ï‚§â€ŻGOAL: Find a way to partition the graph DB into “blocks” across the k storage nodes so that expected time to answer queries is small. 13
  • 14. COSI Example Query & Naive Approach Jones Dooley ?p author Smith comment ?v1 ?v3 faculty friends faculty University in MD Italy ?v2 14
  • 15. COSI Co-Retrieval Paper “ABC” ?p author comment Jones ?v3 faculty friends faculty University in MD Italy ?v2 Co-retrieval: Jones – Paper “ABC“ 15
  • 16. COSI Cost Model ï‚§â€Ż Query trace: A query trace w.r.t. a query plan x for query Q consists of -  All vertices in the DB whose neighborhood is retrieved during execution of x -  All pairs (u,v) of vertices where x retrieves v’s nbhd immediately after retrieving u’s nbhd. ‱  Intuition: Try to put u,v on same storage node. ‱  Assumption: Retrieved nbhds are cached in memory. 16
  • 17. COSI Cost Model (continued) ï‚§â€ŻAssume fixed but arbitrary distribution over the set of all queries. ï‚§â€ŻThis induces a pdf over the set of all feasible query plans qp(Q) for query Q. -  (x)=  Q Ɠ , qp(Q)=x (Q). -  Prob of query plan “x” is the sum of the probs of queries requiring query plan x. ï‚§â€ŻLet E(v) be the event that v is retrieved by a query trace of a random query plan for Q. 17
  • 18. COSI Cost Model (continued) ï‚§â€Ż Prob that vertex v occurs in the trace of a randomly chosen query plan is (E(v)) =  x Ɠ qp(Q) ⁄ v Ɠ qt(x,DB) (x). ï‚§â€Ż Prob that (u,v) occurs in the trace of a randomly chosen query plan is (E(u,v)) = x Ɠ qp(Q) ⁄ (u,v ) Ɠ qt(x,DB) (x). 18
  • 19. COSI Cost Model (continued) Key Theorem Suppose vertex retrieval and inter-node comms are uniform across storage nodes. The partition of the DB graph that minimizes query exec time coincides with the partition that minimizes edge cut cost in the graph (V,VV) with weight function w(u,v)= (E(u,v))+ (E(v,u)). ï‚§â€Ż SO MIN EDGE-CUTS IN COMPLETE GRAPHS IS CLOSELY RELATED TO MINIMIZING QUERY EXECUTION TIME. 19
  • 20. COSI Partitioning Algorithm ï‚§â€Ż Challenges -  Finding MIN EDGE-CUT is NP-complete. -  We want to process graphs containing 100s of millions of edges. ï‚§â€Ż So we want an algorithm that is -  Very fast -  Produces good edge cuts ‱  but maybe not optimal ï‚§â€Ż To achieve speed, we focus on partition strategies that permanently assign vertices to blocks. 20
  • 21. COSI Individual edge insertion ï‚§â€ŻSuppose we have a partition P={P1,..,Pk}. ï‚§â€ŻWe are inserting the edge (v,p,o). ï‚§â€ŻVertex force vectors: Measures how strongly each Pi “pulls” a vertex. -  |v|[i] = fP( y Ɠ (nbhd(v) 
 Pi) w(v,y)) -  fP maps positive reals to reals and is an “affinity” measure. -  |v|[i] sums up the weights of edges from v to each neighbor in Pi. Insert v into block Pi with highest |v |[i]. 21
  • 22. COSI Affinity Measures ï‚§â€ŻMust satisfy 3 properties -  Connectedness of a vertex to a partition block. This helps minimize edge cut. -  Imbalance of block sizes. ‱  E.g. standard deviation of block sizes, normalized by expected DB size. -  Excessive size should be punished. 22
  • 23. COSI Batch insertion ï‚§â€ŻAdding a set of edges at once. ï‚§â€ŻIdea: Find strongly connected components using modularity maximization and assign those to the partition block with highest affinity. 23
  • 24. COSI Batch Partitioning Algorithm Force Vector Affinity Contract Maximize Modularity Contract Maximize Modularity
  • 25. COSI Graph modularity ï‚§â€ŻMod(P) = Pi Ɠ P(W(Pi,Pi)/2|E| - degW(Pi) 2/(2|E|)2) ï‚§â€ŻWhere -  W(X,Y) is the sum of the weights of edges (x,y) with x in X, y in Y. -  degW(v) is the sum of the weights of edges (v,-) and -  degW(Pi) is the sum of the degW(v)’s for v in Pi. 25
  • 26. COSI Outline Motivation Subgraph Identification Graph Partitioning Query Answering Experiments Conclusion
  • 27. COSI Query Answering Graph Data Client B ?X  ?Z C  A ?Y load Receive query - Return results Dispatch query Query answer     Forward (partially  Answered) query
  • 28. COSI Example Query ?p author comment ?v1 ?v3 faculty friends faculty University in MD Italy ?v2 P1 28
  • 29. COSI Example Query Jones : P2 Dooley : P2 ?p author Smith : P3 comment ?v1 ?v3 faculty friends faculty University in MD Italy ?v2 29
  • 30. COSI Example Query Paper “ABC” : P2 Paper “HIJ” : P3 ?p author comment P2 Calero : P2 Dooley ?v3 faculty friends faculty University in MD Italy ?v2 Where to send query next? 30
  • 31. COSI Query answering ï‚§â€ŻBasic: Next substitution arbitrary ï‚§â€ŻCOSI_Heur is a heuristic version that makes intelligent choices about the next variable to be substituted. -  Branching Factor  # possible substitutions -  Communication cost  # messages to be sent -  Workload distribution  partitions hosting vertices 31
  • 32. COSI Outline Motivation Subgraph Identification Graph Partitioning Query Answering Experiments Conclusion
  • 33. COSI COSI implementation ï‚§â€ŻImplementation is in Java (approx 10,000 loc) ï‚§â€Ż778M edges social network DB -  Flickr, Orkut, Livejournal, Youtube -  [Mislove ‘07] ï‚§â€Ż16-node compute cluster -  8 GB of RAM -  30 GB HDs -  8 core Intel CPU 33
  • 34. COSI Partitioning quality Comparison of Partitioning Methods 40.0% 35.0% 30.0% 25.0% Edge Cut 20.0% Improvement 15.0% Imbalance 10.0% 5.0% 0.0% Single Greedy Batch Greedy Batch Partition COSI_Partition achieves a 36% improvement in edge-cut with only slightly higher imbalance. Took 7.5 h to load with individual triple insertion, 10.5 h with batch. 34
  • 35. COSI Logarithmic Query answering time scale 10000000 Query Times by Cost Model (in ms) 1000000 100000 ms 10000 1000 100 6 Edges / 7 Edges / 8 Edges / 9 Edges / 10 Edges / 11 Edges / 11 Edges / 14 Edges / 16 Edges / 17 Edges / 23 Edges / 3 Vars 4 Vars 3 Vars 3 Vars 3 Vars 4 Vars 5 Vars 5 Vars 7 Vars 5 Vars 6 Vars Cost Model A Cost Model 2.0/0.5 Cost Model B Cost Model 1.2/0.1 Cost Model C Cost Model 8.0/5.0 No Cost Model No Cost Model COSI_heur does very well, answering pretty complex queries in under a second. X-axis shows number of edges and variable vertices. 35
  • 36. COSI Logarithmic Partitioning Effect scale 100000 10000 Time (ms) 1000 100 6E/3V 7E/4V 8E/3V 9E/3V 10E/3V 11E/4V 11E/5V 14E/5V 16E/7V 17E/5V 23E/6V Size of the query (# edges / # vertices) COSI Batch Partition Individual Edge Insertion COSI_heur does very well, answering pretty complex queries in under a second. 36
  • 37. COSI Outline Motivation Subgraph Identification Graph Partitioning Query Answering Experiments Conclusion
  • 38. COSI Related Work Systems Pros Cons Single Neo4j, DEO, Latency, Speed Limited size Machine Hypergraph, Limited Throughput RDF-3X, OWLIM, AllegroGraph, etc Orchestrated YARS 2, system Size Scalability Latency Distribution extensions Limited Throughput Asynchronous COSI Size Scalability Latency Cloud Throughput oriented Scalability Resource Elasticity 38
  • 39. COSI Conclusion ï‚§â€ŻCOSI is a general, scalable and fast graph database framework for social network analysis ï‚§â€ŻDemonstrated scalability and speed on the problem of subgraph identification 39
  • 41. ? COSI Questions? Comments?