SlideShare ist ein Scribd-Unternehmen logo
1 von 38
Downloaden Sie, um offline zu lesen
“If everything seems
  under control, you're
not going fast enough.”
realtime analysis of #debate hashtag




                  Davide Palmisano @dpalmisano
when size matters: the
  4Vs of   big data

   Volume, Velocity, Variety,
   and Veracity
let’s focus on   Velocity
during peak time ~35
              persons/second top up
              their Oyster card*




http://www.tfl.gov.uk/corporate/modesoftransport/londonunderground/1608.aspx
every second ~58 new
   pictures are uploaded on
                   Instagram*




http://www.digitalbuzzblog.com/infographic-instagram-stats/
the night of the first
#debate,      2615 tweets
           per second have
                been recorded*


http://www.nbcnews.com/technology/technolog/presidential-debate-sets-twitter-record-6281796
What have been the most
  influential URLs ?
What have been the   implicit
 concepts underlying the
       conversation?
How these concepts
evolved during the
    discussion?
every single tweet
potentially contains some
   hidden information
extract such information,
   making it explicit,
     analysing it
 and doing it at a rate of
   ~2000 tweets/sec?
real-time analytics

Storm,       a free and open source
   distributed realtime computation
   system. Storm makes it easy to
        reliably process unbounded
streams of data, doing for realtime
   processing what Hadoop did for
                    batch processing.
batch analyses

The Apache Hadoop software library is a
framework that allows for the distributed
  processing of large data sets across
   clusters of computers using simple
          programming models.

         + hdfs, a distributed FS
data gathering from the Social Web




    crunching the Social Web, in real-time.



formerly known as          Beancounter
beancounter.io is a SaaS
  platform to profile your
users from their activities on
      the Social Web
now powering part
of the Italian public
        broadcaster
            #socialtv
       environment
(a quick parenthesis)

                                                                                or ...

   “how a butterfly flapping
     its wings in Asia might
   cause a hurricane in the
                    Atlantic”                                                       *
http://www.amazon.com/Strategic-Thinking-New-Science-Complexity/dp/0684842688
beancounter.io uses Twitter
  OAuth authorisation to
 perform TV Social events
        check-ins
while beancounter.io was
         handling more than ~100
          check-ins per minute

       at 13.32 UTC-8 Twitter had
                            an         outage *
https://status.io.watchmouse.com/7617/125017//statuses/home_timeline-(OAuth-1.0a)
Facebook and Twitter check-ins rate


                             Nov 6, 2012 13:32 UTC-8      twitter service disruption
                                                                                                                      200




                                                                                                                    150




                                                                                                                100




                                                                                                               50


2012-11-06T20:45:01.690984
                             2012-11-06T21:40:03.615521

                                                          2012-11-06T22:35:04.645506                       0


                                                                                       2012-11-06T23:30:05.627388
Facebook and Twitter overall comments
                                   Nov 6, 2012 13:32 UTC-8                         twitter service disruption

                                                                                                                               1500




                                                                                                                              1125




                                                                                                                         750




                                                                                                                        375




2012-11-06T20:45:01.690984
                                                                                                                    0
                             2012-11-06T21:30:02.861083

                                                          2012-11-06T22:15:04.455317

                                                                                       2012-11-06T23:00:05.432714




                                                                                              Facebook              Twitter
lesson learnt: the real-time
Web is an hyper-connected
graph of a myriad of di!erent
        live systems


 always mind the butterflies,
 even if you can’t see them
back to #debate
<timestamp, <c0...cn>>

concepts are extracted using NLP
  technologies for each tweet
we’ve tied together beancounter.io,
              Storm and Hadoop


  please note, this was only the
       10% of the firehose


                                                  real-time analytics


hdfs, distributed FS

                                                Storm
                              batch analytics
more than ~ 500k tweets
processed in 2h for an average
      rate of ~70 t/sec

    each tweet produced a
  snapshot (~10k each) for an
 overall size of 4.6GB of data
more than ~18k
    di!erent URLs shared


 highest peak: 253 tweets/sec


5 amazon EC2 x-large instance
    + 2 mid-sized for HDFS
recurring concepts

                                                                                                70000




                                                                                            52500




                                                                                         35000




                                                                                        17500




Osama Bin Laden
             Iran
                    Israel                                                          0
                             Middle East
                                           Pakistan
                                                      Iraq
                                                             Afghanistan
                                                                           Russia
most co-occurrent concepts

      Iran - Israel 35.356 %
 Russia - Middle East 24.7 %
                 ...
                 ...
Wikileaks - Richard Nixon 93.5%
5321




17284   6960
facts
  data viz is a completely another job


mining data requires science skills, it’s not
 just about technology: it’s about math

 forget to control everything when data
   flows at that speed: make reasoned
             approximations
?
Davide Palmisano
@dpalmisano
http://davidepalmisano.com

Weitere ähnliche Inhalte

Mehr von Davide Palmisano

beancounter.io - Social Web user profiling as a service #semtechbiz
beancounter.io - Social Web user profiling as a service #semtechbiz beancounter.io - Social Web user profiling as a service #semtechbiz
beancounter.io - Social Web user profiling as a service #semtechbiz Davide Palmisano
 
NoTube: past, present and future
NoTube: past, present and futureNoTube: past, present and future
NoTube: past, present and futureDavide Palmisano
 
Dear Sourcesense, don't you think it's time to make sense of #opendata as well?
Dear Sourcesense, don't you think it's time to make sense of #opendata as well?Dear Sourcesense, don't you think it's time to make sense of #opendata as well?
Dear Sourcesense, don't you think it's time to make sense of #opendata as well?Davide Palmisano
 
distilling the Web of Data drop by drop (with Java)
distilling the Web of Data drop by drop (with Java)distilling the Web of Data drop by drop (with Java)
distilling the Web of Data drop by drop (with Java)Davide Palmisano
 
From the Semantic Web to the Web of Data: ten years of linking up
From the Semantic Web to the Web of Data: ten years of linking upFrom the Semantic Web to the Web of Data: ten years of linking up
From the Semantic Web to the Web of Data: ten years of linking upDavide Palmisano
 
NoTube Project Collecting Data Social Web
NoTube Project Collecting Data Social WebNoTube Project Collecting Data Social Web
NoTube Project Collecting Data Social WebDavide Palmisano
 

Mehr von Davide Palmisano (7)

beancounter.io - Social Web user profiling as a service #semtechbiz
beancounter.io - Social Web user profiling as a service #semtechbiz beancounter.io - Social Web user profiling as a service #semtechbiz
beancounter.io - Social Web user profiling as a service #semtechbiz
 
NoTube: past, present and future
NoTube: past, present and futureNoTube: past, present and future
NoTube: past, present and future
 
Dear Sourcesense, don't you think it's time to make sense of #opendata as well?
Dear Sourcesense, don't you think it's time to make sense of #opendata as well?Dear Sourcesense, don't you think it's time to make sense of #opendata as well?
Dear Sourcesense, don't you think it's time to make sense of #opendata as well?
 
distilling the Web of Data drop by drop (with Java)
distilling the Web of Data drop by drop (with Java)distilling the Web of Data drop by drop (with Java)
distilling the Web of Data drop by drop (with Java)
 
From the Semantic Web to the Web of Data: ten years of linking up
From the Semantic Web to the Web of Data: ten years of linking upFrom the Semantic Web to the Web of Data: ten years of linking up
From the Semantic Web to the Web of Data: ten years of linking up
 
Unwinding The Twine
Unwinding The TwineUnwinding The Twine
Unwinding The Twine
 
NoTube Project Collecting Data Social Web
NoTube Project Collecting Data Social WebNoTube Project Collecting Data Social Web
NoTube Project Collecting Data Social Web
 

Kürzlich hochgeladen

Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 

Kürzlich hochgeladen (20)

Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 

If everything seems under control, you're not going fast enough

  • 1. “If everything seems under control, you're not going fast enough.” realtime analysis of #debate hashtag Davide Palmisano @dpalmisano
  • 2. when size matters: the 4Vs of big data Volume, Velocity, Variety, and Veracity
  • 3. let’s focus on Velocity
  • 4. during peak time ~35 persons/second top up their Oyster card* http://www.tfl.gov.uk/corporate/modesoftransport/londonunderground/1608.aspx
  • 5. every second ~58 new pictures are uploaded on Instagram* http://www.digitalbuzzblog.com/infographic-instagram-stats/
  • 6. the night of the first #debate, 2615 tweets per second have been recorded* http://www.nbcnews.com/technology/technolog/presidential-debate-sets-twitter-record-6281796
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12. What have been the most influential URLs ?
  • 13. What have been the implicit concepts underlying the conversation?
  • 14. How these concepts evolved during the discussion?
  • 15. every single tweet potentially contains some hidden information
  • 16. extract such information, making it explicit, analysing it and doing it at a rate of ~2000 tweets/sec?
  • 17. real-time analytics Storm, a free and open source distributed realtime computation system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing.
  • 18. batch analyses The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. + hdfs, a distributed FS
  • 19. data gathering from the Social Web crunching the Social Web, in real-time. formerly known as Beancounter
  • 20. beancounter.io is a SaaS platform to profile your users from their activities on the Social Web
  • 21. now powering part of the Italian public broadcaster #socialtv environment
  • 22. (a quick parenthesis) or ... “how a butterfly flapping its wings in Asia might cause a hurricane in the Atlantic” * http://www.amazon.com/Strategic-Thinking-New-Science-Complexity/dp/0684842688
  • 23. beancounter.io uses Twitter OAuth authorisation to perform TV Social events check-ins
  • 24. while beancounter.io was handling more than ~100 check-ins per minute at 13.32 UTC-8 Twitter had an outage * https://status.io.watchmouse.com/7617/125017//statuses/home_timeline-(OAuth-1.0a)
  • 25. Facebook and Twitter check-ins rate Nov 6, 2012 13:32 UTC-8 twitter service disruption 200 150 100 50 2012-11-06T20:45:01.690984 2012-11-06T21:40:03.615521 2012-11-06T22:35:04.645506 0 2012-11-06T23:30:05.627388
  • 26. Facebook and Twitter overall comments Nov 6, 2012 13:32 UTC-8 twitter service disruption 1500 1125 750 375 2012-11-06T20:45:01.690984 0 2012-11-06T21:30:02.861083 2012-11-06T22:15:04.455317 2012-11-06T23:00:05.432714 Facebook Twitter
  • 27. lesson learnt: the real-time Web is an hyper-connected graph of a myriad of di!erent live systems always mind the butterflies, even if you can’t see them
  • 29. <timestamp, <c0...cn>> concepts are extracted using NLP technologies for each tweet
  • 30. we’ve tied together beancounter.io, Storm and Hadoop please note, this was only the 10% of the firehose real-time analytics hdfs, distributed FS Storm batch analytics
  • 31. more than ~ 500k tweets processed in 2h for an average rate of ~70 t/sec each tweet produced a snapshot (~10k each) for an overall size of 4.6GB of data
  • 32. more than ~18k di!erent URLs shared highest peak: 253 tweets/sec 5 amazon EC2 x-large instance + 2 mid-sized for HDFS
  • 33. recurring concepts 70000 52500 35000 17500 Osama Bin Laden Iran Israel 0 Middle East Pakistan Iraq Afghanistan Russia
  • 34. most co-occurrent concepts Iran - Israel 35.356 % Russia - Middle East 24.7 % ... ... Wikileaks - Richard Nixon 93.5%
  • 35. 5321 17284 6960
  • 36. facts data viz is a completely another job mining data requires science skills, it’s not just about technology: it’s about math forget to control everything when data flows at that speed: make reasoned approximations
  • 37. ?