SlideShare ist ein Scribd-Unternehmen logo
1 von 36
musweet.com
Handling Humongous Data Sets
from the Social Web

Grischa Andreew & Nader Cserny, compuccino
Agenda


• Einführung

• Technik

• Daten im Detail

• Abfragen

• Tools & Debugging

• Fragen
Einführung
Was ist musweet?
Media Stream
Themen
Analytics
Pro l
Statistik




7002  KÜNSTLER
                  20.848 3.914.259
                  SOCIAL MEDIA PROFILE                 STREAM ITEMS




                  ~3450
                    READ QUERIES / SEC
                                         309.855
                                         MEDIEN (AUDIO, VIDEO, FOTO)




                 ~15.000 INSERTS / DAY
                                         3.75 GB           DATA SIZE
Technik
Was brauchen wir?


KÜNSTLER            Name, City, Genre, Bild


SOCIAL PROFILE      Plattform (z.B. twitter), Link


MEDIA POSTS         Bilder, Videos, Audios, Statusmeldungen




PROFIL INFO         Freunde, Follower, Datum, Webseiten,
                    Pro lbild, Biographie, Label, etc.
MySQL Schema

                                                                                                                                Artist
                                                                                                                            id
                                                                                                                            name
                                                                                                                            Indexs
                                                                                                                            name




                                                                        Numbers
                                                                 artist
                                                                 socialprofile
                                                                 outgoing                                         Socialprofile                   artist_genres
                                                                 incoming                                        id                             artist
                                                                 feedback                                        artist                         genre
                                                                 push                                            url                            Indexes
       Service Informations Twitter
artist                                                           Indexes                                         service                        artist_genre
lang,                                                            artist_outgoing                                 Indexes
verified                                                          artist_incoming                                 artists_service
location,                                                        artist_feedback
id,                                                              artist_push
url,                                                             artist
created_at,
description,                                                                                                          Stream
time_zone,                                                                                                     artist                                Genres
profile_image_url,                                                                                              socialprofile                       id
screen_name                                Service Informations Facebook                                       message                            name
Indexes                               artist                                                                   created_at                         Indexes
artist                                category,                                                                Indexes                            name
                                      name,                                                                    message
                                      fan_count,                                                               artist_created_at
     Service Informations Myspace     bio,                                                                     created_at
artist                                url,
website,                              username,
genre,                                record_label,
location,                             location,
art_des_labels,                       profile_image_url,                                Stream Informations Facebook
                                                                                                                                 Stream Informations Twitter
headline,                             band_members,                                stream                                                                              Stream Informations Myspace
                                                                                                                            stream
created_at                            website,                                     name,                                                                         stream
                                                                                                                            source,
id,                                   ink,                                         caption,                                                                      category,
                                                                                                                            in_reply_to_status_id,
profile_image_url,                     pinnwand_posts,                              link,                                                                         image,
                                                                                                                            in_reply_to_user_id,
label                                 genre,                                       likes,                                                                        link,
                                                                                                                            truncated,
Indexes                               friends,                                     type,                                                                         source
                                                                                                                            deleted
artist                                id                                           icon                                                                          Indexes
                                                                                                                            Indexes
                                      Indexes                                      Indexes                                                                       stream
                                                                                                                            stream
                                      artist                                       stream
MongoDB Schema

                                        Artist
                 id (Object Id)
                 name (str)
                 genres (strict array)
                 socialprofiles (strict array)
                    service (dbref)
                    url (str)
                    numbers (strict array)
                     incoming
                     outgoing
                     push
                     feedback
                     date
                    meta (array)
                      (unterschiedliche Felder, ja nach Plattform)
                 Indexes
                 name,
                 genres,
                 socialprofiles.service,
                 socialprofiles.numbers



                                         Stream
                 id (Hash aus facebook / myspace / twitter id)
                 socialprofile (dbref)
                 genres (strict array) (redundanz der genres vom
                 artists um den stream direkt über genres
                 abzufragen)
                 data (array)
                    ( data from plattforms,
                      field message is a must have)
                 created_at (datetime)
                 Indexes
                 socialprofile,
                 genres,
                 data.message
Wie kommen wir an die Daten? (Einfach)



 Crawler                                 musweet

 • Verarbeitung von Links     Links      • Darstellung der Inhalte
 • Extraktion von Medien                 • Zuordnung Artist / Service
 • Aufbereitung der Inhalte

                              Daten
Daten im Detail
Künstler Pro l bei MySpace




"numbers" : {
     "outgoing" : 221665,
     "incoming" : 770355,
     "feedback" : 36862603,
     "push" : 0,
     "date" : "Wed Sep 29 2010 02:00:00 GMT+0200 (CEST)",
},
"meta" : {
     "website" : "http://www.snoopdogg.com",
     "genre" : "Hip Hop / Rap / R&B",
     "location" : "Long Beach, California Vereinigte Staaten von Amerika",
     "art_des_labels" : "Major",
     "headline" : "",
     "created_at" : "Sat Dec 11 2004 01:00:00 GMT+0100 (CET)",
     "id" : 6344278,
     "profile_image_url" : "http://c1.ac-images.myspacecdn.com/images02/130/
     m_9857dcca155247b69e1260e6e34cce3c.jpg",
     "label" : "Doggystyle / Priority"
}
Künstler Pro l bei twitter




"numbers" : {
	 "outgoing" : 1204,
	 "incoming" : 2030350,
	 "feedback" : 22750,
	 "push" : 3145,
           "date" : "Wed Sep 29 2010 02:00:00 GMT+0200 (CEST)"
},
"meta" : {
	 "lang" : "en",
	 "verified" : true,
	 "location" : "LBC",
	 "id" : 3004231,
	 "url" : "http://www.snoopdogg.com",
	 "created_at" : "Fri Mar 30 2007 21:05:42 GMT+0200 (CEST)",
	 "description" : "More Malice CD + DVD IN STORES NOW",
	 "time_zone" : "Pacific Time (US & Canada)",
	 "profile_image_url" : "http://a3.twimg.com/profile_images/1096549203/snoop_normal.jpg",
	 "screen_name" : "SnoopDogg"
}
Künstler Pro l bei facebook
"numbers" : {
	 "outgoing" : 0,
	 "incoming" : 2930860,
	 "feedback" : 0,
	 "push" : 0,
},
"meta" : {
	 "category" : "Musicians",
	 "name" : "Snoop Dogg",
	 "fan_count" : 2930860,
	 "bio" : "The offices at the top of the Capitol Records building in Hollywood are home to some
of Southern California’s most awe-inspiring views. ....",
	 "url" : "http://www.facebook.com/snoopdogg?v=info",
	 "username" : "snoopdogg",
	 "record_label" : "Priority/Doggystyle ",
	 "location" : "Long Beach, CA",
	 "profile_image_url" : "http://profile.ak.fbcdn.net/hprofile-ak-snc4/
hs622.snc3/27524_11455644806_1192_s.jpg",
	 "band_members" : "Snoop Dogg",
	 "website" : [
	 	 "http://www.snoopdogg.com",
	 	 "http://www.myspace.com/snoopdogg",
	 	 "http://twitter.com/snoopdogg"
	 ],
	 "link" : "http://www.facebook.com/snoopdogg",
	 "pinnwand_posts" : 0,
	 "genre" : "Hip Hop / Rap / R&B",
	 "friends" : 0,
	 "id" : "11455644806"
}
Abfragen
MySQL vs. MongoDB (1)

Alle Social Media Pro le mit Follower-Zahlen von einem Artist

MySQL                                    MongoDB
SELECT                                   db.artist.find( { "name": "Snoop Dogg" } )
	 n.incoming,
	 a.id as artist,
	 a.name as artist_name,                 Dauer: 0.0001 Sek.
	 s.id as socialprofile,
	 s.url as socialprofile_url,
FROM
	 numbers as n
	 JOIN socialprofile as s on s.id =
n.socialprofile
	 JOIN artist as a on a.id = n.artist
WHERE
	 a.name = "Snoop Dogg"
ORDER n.incoming DESC


Dauer: 0.0288 Sek.
MySQL vs. MongoDB (2)

10 HipHop Musiker mit den meisten Followern

MySQL                                    MongoDB
SELECT                                   db.artist.find( {
	 n.incoming,                              "genre": DBRef("genre","hiphop")
	 a.id as artist,                        } ).sort( {
	 a.name as artist_name,                   "socialprofiles.numbers.incoming": -1
	 s.id as socialprofile,                 } ).limit(10)
	 s.url as socialprofile_url,
FROM
                                         Dauer: 0.0230 Sek.
	 numbers as n
	 JOIN artist_genres as ag on
ag.artists = n.artist
   JOIN genres as g on g.id = ag.genre
   JOIN socialprofile as s on s.id =
n.socialprofile
   JOIN artist as a on a.id = n.artist
WHERE
	 g.name = "Hip/Hop"
ORDER BY
         n.incoming DESC
LIMIT 10

Dauer: 0.8741 Sek.
MySQL Index


• Index wird von links nach rechts gelesen

  Reihenfolge wichtig

  Felder: „artist“, „incoming“, „push“, „date“

  SELECT   *   FROM   numbers   WHERE   artist =   1   Funktioniert
  SELECT   *   FROM   numbers   WHERE   incoming   =   1 Funktioniert nicht
  SELECT   *   FROM   numbers   WHERE   artist =   1   AND push < 10 Funktioniert nicht
  SELECT   *   FROM   numbers   WHERE   artist =   1   AND push < 10 AND incoming > 0 Funktioniert




• Index Debugging

  EXPLAIN SELECT * FROM numbers WHERE artist = 1
MongoDB Index


• Index Reihenfolge ist egal

  kann ein Feld mitten im Index verwenden

  db.artist.ensureIndex( {"name":1, "numbers": -1 } );


  db.artist.find( { "name": "Snoop Dog" } ) Funktioniert
  db.artist.find( { "socialprofiles.numbers.incoming": { "$gte": 10 } } ) Funktioniert
  db.artist.find( {
    "name": "Snoop Dogg",
    "socialprofiles.numbers.incoming": { "$gte": 0 }
  } ) Funktioniert




• Index Debugging

  db.artist.find( { "name": "Snoop Dogg" } ).explain()
Tools & Debugging
MongoDB Fehlermeldungen


• Sortierte Abfrage ohne Limit:
  Fehler: „too much data for sort() with no index. add an index or
  specify a smaller limit“

  Lösung: Feld in den Index aufnehmen



• Duplicate Key Error:
  Fehler: in älteren Versionen (< 1.6.0) schmiert DB bei zu vielen
  Duplicate Key Errors ab

  Lösung: Upsert verwenden
db.serverStatus()


Wieviel memory-Verbrauch, wieviele Connections, ...



globalLock           Wie lange Collections gesperrt waren, ...

connections          Wieviel Verbindungen offen / verfügbar, ...

backgroundFlushing Wann war der letzte Flush auf die Festplatte, ...

...                  Mehr Info in der Dokumentation:
                     http://www.mongodb.org/display/DOCS/Monitoring+and+Diagnostics
Pro ling



db.setPro lingLevel(0) off

                       log slow operations (>100ms), optional „slow“
db.setPro lingLevel(1)
                       de nieren mit db.setPro lingLevel(1, 10)
db.setPro lingLevel(2) log all operations



system.pro le
{   "ts"   : "Tue Aug 10 2010 00:25:33 GMT+0200 (CEST)", "info" : "update   musweet.media  query:
{   _id:   ObjId(4c607ef2a68299079400b5ea) } nscanned:1 moved ", "millis"   : 0 }
{   "ts"   : "Tue Aug 10 2010 00:25:33 GMT+0200 (CEST)", "info" : "update   musweet.media  query:
{   _id:   ObjId(4c607ef2a68299079400b468) } nscanned:1 moved ", "millis"   : 0 }
{   "ts"   : "Tue Aug 10 2010 00:25:33 GMT+0200 (CEST)", "info" : "update   musweet.media  query:
{   _id:   ObjId(4c607fd9a68299079400c067) } nscanned:1", "millis" : 0 }
{   "ts"   : "Tue Aug 10 2010 00:25:33 GMT+0200 (CEST)", "info" : "update   musweet.media  query:
{   _id:   ObjId(4c607ef2a68299079400b6a6) } nscanned:1 moved ", "millis"   : 0 }
Collection Objekte analysieren

Download: http://github.com/compuccino/mongodb-ac
Abschließend...
Abschließend...


• Fragen?



• Mehr über uns:

  http://compuccino.com

  http://facebook.com/compuccino



• Personen:

  Grischa Andreew, @grischaandreew

  Nader Cserny, @nadr

Weitere ähnliche Inhalte

Kürzlich hochgeladen

How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 

Kürzlich hochgeladen (20)

How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 

Empfohlen

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by HubspotMarius Sescu
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTExpeed Software
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsPixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 

Empfohlen (20)

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 

Handling Humongous Data Sets

  • 1. musweet.com Handling Humongous Data Sets from the Social Web Grischa Andreew & Nader Cserny, compuccino
  • 2. Agenda • Einführung • Technik • Daten im Detail • Abfragen • Tools & Debugging • Fragen
  • 3.
  • 10. Statistik 7002 KÜNSTLER 20.848 3.914.259 SOCIAL MEDIA PROFILE STREAM ITEMS ~3450 READ QUERIES / SEC 309.855 MEDIEN (AUDIO, VIDEO, FOTO) ~15.000 INSERTS / DAY 3.75 GB DATA SIZE
  • 11.
  • 13. Was brauchen wir? KÜNSTLER Name, City, Genre, Bild SOCIAL PROFILE Plattform (z.B. twitter), Link MEDIA POSTS Bilder, Videos, Audios, Statusmeldungen PROFIL INFO Freunde, Follower, Datum, Webseiten, Pro lbild, Biographie, Label, etc.
  • 14. MySQL Schema Artist id name Indexs name Numbers artist socialprofile outgoing Socialprofile artist_genres incoming id artist feedback artist genre push url Indexes Service Informations Twitter artist Indexes service artist_genre lang, artist_outgoing Indexes verified artist_incoming artists_service location, artist_feedback id, artist_push url, artist created_at, description, Stream time_zone, artist Genres profile_image_url, socialprofile id screen_name Service Informations Facebook message name Indexes artist created_at Indexes artist category, Indexes name name, message fan_count, artist_created_at Service Informations Myspace bio, created_at artist url, website, username, genre, record_label, location, location, art_des_labels, profile_image_url, Stream Informations Facebook Stream Informations Twitter headline, band_members, stream Stream Informations Myspace stream created_at website, name, stream source, id, ink, caption, category, in_reply_to_status_id, profile_image_url, pinnwand_posts, link, image, in_reply_to_user_id, label genre, likes, link, truncated, Indexes friends, type, source deleted artist id icon Indexes Indexes Indexes Indexes stream stream artist stream
  • 15. MongoDB Schema Artist id (Object Id) name (str) genres (strict array) socialprofiles (strict array) service (dbref) url (str) numbers (strict array) incoming outgoing push feedback date meta (array) (unterschiedliche Felder, ja nach Plattform) Indexes name, genres, socialprofiles.service, socialprofiles.numbers Stream id (Hash aus facebook / myspace / twitter id) socialprofile (dbref) genres (strict array) (redundanz der genres vom artists um den stream direkt über genres abzufragen) data (array) ( data from plattforms, field message is a must have) created_at (datetime) Indexes socialprofile, genres, data.message
  • 16. Wie kommen wir an die Daten? (Einfach) Crawler musweet • Verarbeitung von Links Links • Darstellung der Inhalte • Extraktion von Medien • Zuordnung Artist / Service • Aufbereitung der Inhalte Daten
  • 17.
  • 19. Künstler Pro l bei MySpace "numbers" : { "outgoing" : 221665, "incoming" : 770355, "feedback" : 36862603, "push" : 0, "date" : "Wed Sep 29 2010 02:00:00 GMT+0200 (CEST)", }, "meta" : { "website" : "http://www.snoopdogg.com", "genre" : "Hip Hop / Rap / R&B", "location" : "Long Beach, California Vereinigte Staaten von Amerika", "art_des_labels" : "Major", "headline" : "", "created_at" : "Sat Dec 11 2004 01:00:00 GMT+0100 (CET)", "id" : 6344278, "profile_image_url" : "http://c1.ac-images.myspacecdn.com/images02/130/ m_9857dcca155247b69e1260e6e34cce3c.jpg", "label" : "Doggystyle / Priority" }
  • 20. Künstler Pro l bei twitter "numbers" : { "outgoing" : 1204, "incoming" : 2030350, "feedback" : 22750, "push" : 3145, "date" : "Wed Sep 29 2010 02:00:00 GMT+0200 (CEST)" }, "meta" : { "lang" : "en", "verified" : true, "location" : "LBC", "id" : 3004231, "url" : "http://www.snoopdogg.com", "created_at" : "Fri Mar 30 2007 21:05:42 GMT+0200 (CEST)", "description" : "More Malice CD + DVD IN STORES NOW", "time_zone" : "Pacific Time (US & Canada)", "profile_image_url" : "http://a3.twimg.com/profile_images/1096549203/snoop_normal.jpg", "screen_name" : "SnoopDogg" }
  • 21. Künstler Pro l bei facebook "numbers" : { "outgoing" : 0, "incoming" : 2930860, "feedback" : 0, "push" : 0, }, "meta" : { "category" : "Musicians", "name" : "Snoop Dogg", "fan_count" : 2930860, "bio" : "The offices at the top of the Capitol Records building in Hollywood are home to some of Southern California’s most awe-inspiring views. ....", "url" : "http://www.facebook.com/snoopdogg?v=info", "username" : "snoopdogg", "record_label" : "Priority/Doggystyle ", "location" : "Long Beach, CA", "profile_image_url" : "http://profile.ak.fbcdn.net/hprofile-ak-snc4/ hs622.snc3/27524_11455644806_1192_s.jpg", "band_members" : "Snoop Dogg", "website" : [ "http://www.snoopdogg.com", "http://www.myspace.com/snoopdogg", "http://twitter.com/snoopdogg" ], "link" : "http://www.facebook.com/snoopdogg", "pinnwand_posts" : 0, "genre" : "Hip Hop / Rap / R&B", "friends" : 0, "id" : "11455644806" }
  • 22.
  • 24. MySQL vs. MongoDB (1) Alle Social Media Pro le mit Follower-Zahlen von einem Artist MySQL MongoDB SELECT db.artist.find( { "name": "Snoop Dogg" } ) n.incoming, a.id as artist, a.name as artist_name, Dauer: 0.0001 Sek. s.id as socialprofile, s.url as socialprofile_url, FROM numbers as n JOIN socialprofile as s on s.id = n.socialprofile JOIN artist as a on a.id = n.artist WHERE a.name = "Snoop Dogg" ORDER n.incoming DESC Dauer: 0.0288 Sek.
  • 25. MySQL vs. MongoDB (2) 10 HipHop Musiker mit den meisten Followern MySQL MongoDB SELECT db.artist.find( { n.incoming, "genre": DBRef("genre","hiphop") a.id as artist, } ).sort( { a.name as artist_name, "socialprofiles.numbers.incoming": -1 s.id as socialprofile, } ).limit(10) s.url as socialprofile_url, FROM Dauer: 0.0230 Sek. numbers as n JOIN artist_genres as ag on ag.artists = n.artist JOIN genres as g on g.id = ag.genre JOIN socialprofile as s on s.id = n.socialprofile JOIN artist as a on a.id = n.artist WHERE g.name = "Hip/Hop" ORDER BY n.incoming DESC LIMIT 10 Dauer: 0.8741 Sek.
  • 26. MySQL Index • Index wird von links nach rechts gelesen Reihenfolge wichtig Felder: „artist“, „incoming“, „push“, „date“ SELECT * FROM numbers WHERE artist = 1 Funktioniert SELECT * FROM numbers WHERE incoming = 1 Funktioniert nicht SELECT * FROM numbers WHERE artist = 1 AND push < 10 Funktioniert nicht SELECT * FROM numbers WHERE artist = 1 AND push < 10 AND incoming > 0 Funktioniert • Index Debugging EXPLAIN SELECT * FROM numbers WHERE artist = 1
  • 27. MongoDB Index • Index Reihenfolge ist egal kann ein Feld mitten im Index verwenden db.artist.ensureIndex( {"name":1, "numbers": -1 } ); db.artist.find( { "name": "Snoop Dog" } ) Funktioniert db.artist.find( { "socialprofiles.numbers.incoming": { "$gte": 10 } } ) Funktioniert db.artist.find( { "name": "Snoop Dogg", "socialprofiles.numbers.incoming": { "$gte": 0 } } ) Funktioniert • Index Debugging db.artist.find( { "name": "Snoop Dogg" } ).explain()
  • 28.
  • 30. MongoDB Fehlermeldungen • Sortierte Abfrage ohne Limit: Fehler: „too much data for sort() with no index. add an index or specify a smaller limit“ Lösung: Feld in den Index aufnehmen • Duplicate Key Error: Fehler: in älteren Versionen (< 1.6.0) schmiert DB bei zu vielen Duplicate Key Errors ab Lösung: Upsert verwenden
  • 31. db.serverStatus() Wieviel memory-Verbrauch, wieviele Connections, ... globalLock Wie lange Collections gesperrt waren, ... connections Wieviel Verbindungen offen / verfügbar, ... backgroundFlushing Wann war der letzte Flush auf die Festplatte, ... ... Mehr Info in der Dokumentation: http://www.mongodb.org/display/DOCS/Monitoring+and+Diagnostics
  • 32. Pro ling db.setPro lingLevel(0) off log slow operations (>100ms), optional „slow“ db.setPro lingLevel(1) de nieren mit db.setPro lingLevel(1, 10) db.setPro lingLevel(2) log all operations system.pro le { "ts" : "Tue Aug 10 2010 00:25:33 GMT+0200 (CEST)", "info" : "update musweet.media  query: { _id: ObjId(4c607ef2a68299079400b5ea) } nscanned:1 moved ", "millis" : 0 } { "ts" : "Tue Aug 10 2010 00:25:33 GMT+0200 (CEST)", "info" : "update musweet.media  query: { _id: ObjId(4c607ef2a68299079400b468) } nscanned:1 moved ", "millis" : 0 } { "ts" : "Tue Aug 10 2010 00:25:33 GMT+0200 (CEST)", "info" : "update musweet.media  query: { _id: ObjId(4c607fd9a68299079400c067) } nscanned:1", "millis" : 0 } { "ts" : "Tue Aug 10 2010 00:25:33 GMT+0200 (CEST)", "info" : "update musweet.media  query: { _id: ObjId(4c607ef2a68299079400b6a6) } nscanned:1 moved ", "millis" : 0 }
  • 33. Collection Objekte analysieren Download: http://github.com/compuccino/mongodb-ac
  • 34.
  • 36. Abschließend... • Fragen? • Mehr über uns: http://compuccino.com http://facebook.com/compuccino • Personen: Grischa Andreew, @grischaandreew Nader Cserny, @nadr

Hinweis der Redaktion

  1. Erweiterbarkeit und Handling von gro&amp;#xDF;en Datenmengen im Rahmen unseres Projekts musweet.com
  2. Was ist musweet? Warum wir uns f&amp;#xFC;r MongoDB entschieden haben Vergleich zw. dem alten System mit MySQL u. MongoDB Interessante Abfragen Welche Tools &amp; Debugging Methoden wir verwenden
  3. Website rund um Musik und deren Akteure im Social Web misst und bewertet Online-Aktivit&amp;#xE4;t in Echtzeit analysiert Datenquellen und stellt diese dar zeigt Fotos, Musik, Videos von Bands u. Musikern Erfahrungen von wahl.de mit MySQL jetzt mit MongoDB bei musweet.com umgesetzt
  4. Media Stream mit Link Expander (=Enthaltene Medien werden direkt auf der Seite dargestellt) Aktuell crawlen wir myspace, facebook, twitter -&gt; sp&amp;#xE4;ter erweiterung auf blogs, youtube Stream nach Genre filterbar
  5. Meist diskutierte Themen der letzten 7 Tage
  6. Wer hat die meisten Freunde dazugewonnen (Big Mover) Wer die meisten Nachrichten geschrieben (Big Shaker) Filterbar nach Genre Tagesaktuell
  7. Stamminformationen eines K&amp;#xFC;nstlers Social Media Profile =&gt; Wo bewegt sich der Musiker im Netz Media Stream vom Musiker Zuk&amp;#xFC;nftige Konzerte Related Artists: &amp;#xE4;hnliche im Genre und &amp;#xE4;hnliche Kontaktzahlen
  8. Wachsende Datenbasis Aktivit&amp;#xE4;t aus dem Social Web verlangt hohe Performance bei den Inserts Erstmal mit bekannten K&amp;#xFC;nstlern gestartet, sp&amp;#xE4;ter Erweiterung
  9. Wir haben K&amp;#xFC;nstler mit versch. Social Profiles die jeweils wieder unterschiedliche Profile / Stream Informationen haben der Stream / die Profileinformationen sollen nach den Attributen (genres,..) vom K&amp;#xFC;nstler sortierbar sein
  10. F&amp;#xFC;r jeden weiteren Service brauchen wir zwei Tabellen ( Profileinformation, Stream ) mehr, f&amp;#xFC;r jedes weitere Attribut beim K&amp;#xFC;nstler / Scoialprofile was mehrdimensional sein soll brauchen wir eine Join und einen Daten Tabelle ( artist -&gt; artists_genres -&gt; genres ). Durch die vielen Tabellen ist es nicht einfach die Daten abzufragen / jede &amp;#xC4;nderung muss im Backend und im Frontend implementiert werden
  11. Drastisch reduziertes Schema m&amp;#xF6;glich Neues Attribut erfordert nur einen neuen Eintrag im Objekt (ohne dass man an die DB ran muss) die &amp;#xC4;nderungen werden im Backend implementiert, das Frontend muss nicht ge&amp;#xE4;ndert werden.
  12. Crawler ist eine eigenst&amp;#xE4;ndige Application und verwaltet die Crawls f&amp;#xFC;r mehrere Client-Apps wie musweet.com. musweet.com registriert die Socialprofiles im Crawler und bekommt eine Push Notfication wenn sich ein Profil &amp;#xE4;ndert oder eine neue Nachricht geschrieben wird.
  13. numbers Object ist festgesetzt und immer gleich aufgebaut meta Object ist mit plattformspezifischen Daten gef&amp;#xFC;llt.
  14. Bei Twitter haben wir andere Infos als bei Myspace &amp;#x201E;profile_image_url&amp;#x201C; bezeichnet das Profil-Bild des K&amp;#xFC;nstlers auf der Plattform.
  15. Bei Facebook haben wir meist mehr Informationen als bei den anderen Plattformen, je nach Facebook Account Type (Fanpage/User Profile)
  16. MySQL: entweder mit JOIN oder 3 SELECTs MongoDB Abfragen gestalten sich viel einfacher und performanter
  17. MySQL: Noch mehr JOINs oder SELECT statements MongoDB mit DBRef auf Genre
  18. Viele unterschiedliche Indizes notwendig =&gt; viele GB an Daten
  19. Indizes platzsparender und einfacher anwendbar MongoDB kann in einem Index nur einen multiindex (Array als Daten) haben
  20. Fehlermeldungen die wir w&amp;#xE4;hrend der Entwicklung hatten Fehler &amp;#x201E;too much data for sort()&amp;#x201C; tritt erst sp&amp;#xE4;ter auf, wenn man viele Daten in der DB hat
  21. globalLock: wie lange gesperrt mem: wieviel Speicher verbraucht wird IndexCounters: wieviele Hits, wieviele Misses connections: wieviele offen, wieviele verf&amp;#xFC;gbar opcounters: wieviel inserts, updates, deletes backgroundFlushing: wann war der letzte Flush
  22. langsame Datenbank-Abfragen oder alle Abfragen Profiling auf Datenbank-Ebene
  23. N&amp;#xFC;tzliches Tool um herauszufinden wieviele unterschiedliche Objekt Strukturen man in der Collection hat und deren Aufbau zu sehen.