SlideShare ist ein Scribd-Unternehmen logo
1 von 91
Downloaden Sie, um offline zu lesen
Quality to Quantity to Quality
         on the Web




     Andraž Tori, CTO at Zemanta
              @andraz
Topics
- a bit about Zemanta
- how advanced “data tools” and spammers
  interact
We are all trying to organize the web
Making it right,

making it useful

  and linked
Not so long time ago, in a city not far away...
some other people
are trying to do the opposite
trying to disorganize it,

  make it confusing,

and to profit from that
using the tools we have built!
Their motives are not sinster
          (mostly)
it is about profit
Profit
- publish as much content as possible
- quality is not (that) important
- get traffic or high page ranking for certain terms
- sell clicks, links or whole “fully built” sites to the
  highest bidder


- users and search engines are necessary evil to
  be tricked as cheaply as possible
So, why do I care?
Job opening

You will get a spreadsheet with 180 blog url’s and
logins. You will log into each blog and schedule 2
posts per week ...


You will spice up every post with images and/or
related links within the content, using a Wordpress
plugin called Zemanta


https://www.odesk.com/jobs/Wordpress-Blog-Poster_~~c8c04549b8e6b600
And why might you care?
- the organized information is great tool for those
   that try to disorganize it
- they are poisoning “our web”, including twitter,
   facebook
- and it's hard to see in the fog they are causing
- it is just matter of time when they start poisioning
   linked data too
What do we do at
- is a “personal writing assistant”
- suggesting content while you write (your blog)
- analyzing your text
- connecting it with background knowledge, other
  stories on the web, images
- you choose what suggestions to include
- to make your writing more informative, vivid and
   useful
Opening up the hood
the reality
How it works


                                       Content
                                       suggestions
Plain text                 Semantic
              Analysis
 (article)                  search


             Linked data   RSS feeds
Main design goals
- Input is meaningful chunk of text (not a keyword
   or a phrase)
- Input is (semi) English language
- Has to work across all domains in the open
  world
  - music, celebrities, finance, entertainment, politics,
    gardening, parenting, …
Analysis pipeline
                                           Known phrases
Named Entity
                                             extraction
 Extraction
                                           (aho-corasick)

                                                            Triple store
        Surface form features evaluation

               Statistical comparison to
               background knowledge


                 Semantic coherence
                   and hand-tuned
                     heuristics


                                                                etc.

           Disambiguated entities
Analysis pipeline
                                                                    Known phrases
                         Named Entity
                                                                      extraction
                          Extraction
                                                                    (aho-corasick)
Categorization to Dmoz




                                                                                     Triple store
                                 Surface form features evaluation

                                        Statistical comparison to
                                        background knowledge


                                          Semantic coherence
                                            and hand-tuned
                                              heuristics


                                                                                         etc.

Categories               Ambigious named entities         Disambiguated entities
Background knowledge
- Data from Wikipedia, MusicBrainz, Freebase…
  and world wild web
- Includes linguistical and semantical properties
   + unstructured data
- Present in two forms:
  - in “original” custom built triple store on top of MySQL
     (150 GB)
  - processed into 7 GB optimized “memory mapped
    dump”
Background knowledge
- 7M mined and linked up entities and
  concepts
                                           Triple store
- 30M aliases
- Refreshed about once a month
  - want to make it real-time
- Input data quality is really important


                                               etc.
After analysis


Text
           SOLR         Related articles
          articles



           SOLR             Images
          images
Example SOLR query
boost(((                                  wiki_entities:Health insurance
wiki_entities:Medical underwriting        wiki_entities:United States
wiki_entities:Affordable Care Act         wiki_entities:Barack Obama
wiki_entities:Lifetime (TV network)       wiki_entities:Insurance
wiki_entities:Preventive medicine         wiki_entities:Child
wiki_entities:Patient Protection and Affordable Care Act )         ^3.0)

(text:zemhealthinsurq^0.68    text:health^0.62        text:premium^0.36
text:zeminsurcompaniq^0.56    text:increas^0.29       text:rate^0.27
text:zemhealthinsurcompaniq^0.35                      text:zempreventcareq^0.26
text:medic^0.26               text:compani^0.23       text:obamacar^0.21
text:todai^0.21               text:polici^0.21        text:care^0.19 ) ^105.0

((dmoz_categories:
Top/Business/Financial_Services/Insurance/Agents_and_Marketers/Health
dmoz_categories:
Top/Business/Financial_Services/Insurance/Agents_and_Marketers/Health/United_States
dmoz_categories:
Top/Business/Financial_Services/Insurance/Agents_and_Marketers/Health/United_States
/California) ^0.1),

(1 - 0.2) * sqrt(1.0/(1.15E-8*float(1285185600000 - date(published_datetime)
ms)+1.0)) + 0.2)
Solr
- We adapted Solr for “query by document”
- 52% precision (at 10) on internal evaluations
  - plain Lucene MLT comes to 44%
  - difference is from “bag of terms” approach over “bag
    of words” (terms coming from analysis step)
- Our live index is 5M articles
- Solr is really not optimized to handle 50 terms in
  a single query
Lucene plain “More Like This”
Metrics & tests
- Every part of the system is being constantly
  evaluted
- Precision/recall at 5 different points in the
  system
- Mostly bi-weekly releases of new datasets and
  the engine
Overview
- We do pretty deep processing to deliver simple
  user experience of “personal authoring
  assistant”
- And everything is available over the web API
  - tagging
  - named entity recognition and disambiguation to
    Linked Open Data URIs
What API offers?

• Tags
                               Most used
• Categories
• Concepts and entities   Most interesting
• Related articles
• Related images
So mash-ups happen...
Some API users
We are just one of the many people offering
 services based on large amounts of web data

each spending man-years trying to organize their
    data, trying to offer best possible service
now back to the bad guys
Job opening

You will get a spreadsheet with 180 blog url’s and
logins. You will log into each blog and schedule 2
posts per week ...


You will spice up every post with images and/or
related links within the content, using a Wordpress
plugin called Zemanta


https://www.odesk.com/jobs/Wordpress-Blog-Poster_~~c8c04549b8e6b600
There's more than meets the eye
Gather search terms               Analyze →                Find / create
(extensions, logs, guess)     what people search for?        such content




 Pull additional content    Use Zemanta or OpenCalais
                                                           Cover your tracks
     from Freebase           to add tags, images, links




         Publish




                             Amazon Mechanical Turk
  Use Zemanta to find
                                to post comments                Profit?
     similar blogs
                             and links back to your site
Warnings
- I've seen no single system using the whole
   pipeline as described, however all parts were
   found in the wild
- Examples used are from all kinds of sites –
  good, bad and ugly
- I am not trying to imply that all of the steps in the
   diagram are bad, but they can be used by bad
   guys efficiently
Gather search terms             Analyze →                 Find / create
(extensions, logs, guess)    what people search for?         such content




  Pull additional content   Use Zemanta or OpenCalais
                                                           Cover your tracks
      from Freebase          to add tags, images, links




         Publish




                             Amazon Mechanical Turk
   Use Zemanta to find
                                to post comments                Profit?
      similar blogs
                             and links back to your site
Finding their keywords, niches
- Domain expertise
- Users like to install extensions and say “yes”
- You observe referrers on sites you control
- You buy the data on the black market
The sophisticated part of the market
“Demand Media relies on a proprietary algorithm
  to help editors best determine what subjects
  their writers should tackle.”
Factors:
  - Keyword competition
  - Revenue
  - Driving traffic to/from existing conent



 http://emediavitals.com/article/16/demand-media-s-content-assembly-line
Gather search terms               Analyze →               Find / create
(extensions, logs, guess)     what people search for?       such content




 Pull additional content    Use Zemanta or OpenCalais
                                                           Cover your tracks
     from Freebase           to add tags, images, links




         Publish




                             Amazon Mechanical Turk
  Use Zemanta to find
                                to post comments                Profit?
     similar blogs
                             and links back to your site
Find / create content
- Steal
- Take from “open article directories”
- Have your own “content assembly line” like
  Demand Media
Open article directories
Gather search terms               Analyze →                Find / create
(extensions, logs, guess)     what people search for?        such content




 Pull additional content    Use Zemanta or OpenCalais
                                                           Cover your tracks
     from Freebase           to add tags, images, links




         Publish




                             Amazon Mechanical Turk
  Use Zemanta to find
                                to post comments                Profit?
     similar blogs
                             and links back to your site
Tһiѕ iѕ nοt the text you аre lookinɡ for.
Tһiѕ iѕ nοt the text you аre lookinɡ for.
Translate it to random language and back to English

Übersetzen sie zufällig Sprache und wieder auf Englisch
  Language and translate it happen again in English

 Μεταφράστε αυτό σε δειγματοληπτικούς γλώσσα και
             πίσω στην αγγλική γλώσσα
  Translate this random language back to English

  Traduisez au langage aléatoire et revenir à l'anglais
  Translate to random language to English and back

              它翻译成随机的语言和回英文
Translate it back into the English language and random
Covering their tracks
- Trying to fool search engines or people?
- Search engines are catching up
- Google Translate API is being closed due to
  “abuse”?
- The trend is “rewriting” by human editors,
  procured on the global market
Gather search terms              Analyze →                 Find / create
(extensions, logs, guess)    what people search for?         such content




 Pull additional content    Use Zemanta, OpenCalais
                                                           Cover your tracks
     from Freebase          to add tags, images, links




         Publish




                             Amazon Mechanical Turk
  Use Zemanta to find
                                to post comments                Profit?
     similar blogs
                             and links back to your site
Spammers say darndest things
Gather search terms               Analyze →                Find / create
(extensions, logs, guess)     what people search for?        such content




Pull additional content     Use Zemanta or OpenCalais
                                                           Cover your tracks
    from Freebase            to add tags, images, links




         Publish




                             Amazon Mechanical Turk
  Use Zemanta to find
                                to post comments                Profit?
     similar blogs
                             and links back to your site
Remixing linked data and spam
- Currently mostly the good guys are using Linked
  Data
- However, it's just too tempting to be left alone
- Fully synthetic articles using factual information
  from linked data?
     – Using advanced tools to form proper natural
        language sentences and maybe even storyline?
Gather search terms               Analyze →                Find / create
(extensions, logs, guess)     what people search for?        such content




 Pull additional content    Use Zemanta or OpenCalais
                                                           Cover your tracks
     from Freebase           to add tags, images, links




        Publish




                             Amazon Mechanical Turk
  Use Zemanta to find
                                to post comments                Profit?
     similar blogs
                             and links back to your site
Publish
- On hosted third party platforms
  - eating their resources
- Platforms have hard time killing spammers
- Smaller ones don't necessarily have the incentive
- If they remove spammer too fast, it is easier for
   spammer to probe the limits
  - Platforms use “kill with delay”
- Spam detection is resource intensive
Gather search terms               Analyze →               Find / create
(extensions, logs, guess)     what people search for?       such content




 Pull additional content    Use Zemanta or OpenCalais
                                                          Cover your tracks
     from Freebase           to add tags, images, links




         Publish




                            Amazon Mechanical Turk
 Use Zemanta to find
                               to post comments                Profit?
    similar blogs
                            and links back to your site
Valuable comments

As I write this post, Zemanta is showing me
 5 different articles that are related to my
 post. I could visit each one of these sites
 and reach out to the owner to see if they
 would be interested in linking to my post,
 or I could leave a valuable comment on the
 page and include a link back to my post.

 http://www.mainelyseo.com/zemanta-review-seo-link-building-with-the-zemanta-plugin/
- Guy in previous slide is honest and well-
  meaning
- But what if you automate that via Amazon
  Mechanical Turk or oDesk?
Gather search terms               Analyze →                Find / create
(extensions, logs, guess)     what people search for?        such content




 Pull additional content    Use Zemanta or OpenCalais
                                                           Cover your tracks
     from Freebase           to add tags, images, links




         Publish




                             Amazon Mechanical Turk
  Use Zemanta to find
                                to post comments               Profit?
     similar blogs
                             and links back to your site
Profit?
- sell ads
- sell links
- sell “fully developed site”
- to the highest bidder
Search engines to the rescue?

- Mahalo cut 10% of the staff the day after
  Google announced ranking changes
- Demand Media's stock isn't doing that well
  anymore
- However this is a never-ending story, we'll have
  co-evolution for foreseeable future
Ecosystem
- Very sophisticated, large players
  - moving to more high quality content, video?
- Small time operations
  - using more and more sophisticated tools available
    on the market cheaply (modern asymmetric
    warfare?)
- Dark industry specifically building tools to poison
  the web and sell them to small time operators
Food for thought
Can we make spammers (and others) work for us,
         making linked data better?


             (think reCAPTCHA)
Could article directories be fruitfully used?

eZineArticles.com, GoArticles.com, etc...
Find rewritten articles and use them as parallel
                     corpus?
Could we use global workforce market more
    efficiently to get more linked data?
Thesis, antithesis, synthesis?




               http://xkcd.com/810/
Thank you!

Questions?
Image sources

    http://www.flickr.com/photos/dzingeek/4587871752/

    http://www.flickr.com/photos/25101572@N02/4393474025/

    http://www.flickr.com/photos/billward/4740384434/

    http://www.flickr.com/photos/jurvetson/542500748

    http://www.flickr.com/photos/legofenris/4288913574

    http://www.flickr.com/photos/ekilby/3733627940

    http://www.flickr.com/photos/ekilby/3732799269/

    http://www.flickr.com/photos/cipherswarm/38354452

    http://xkcd.com/810/

Weitere ähnliche Inhalte

Was ist angesagt?

Understanding Seo At A Glance
Understanding Seo At A GlanceUnderstanding Seo At A Glance
Understanding Seo At A Glancepoojagupta267
 
Lecture 26
Lecture 26Lecture 26
Lecture 26Shani729
 
Thought Vectors and Knowledge Graphs in AI-powered Search
Thought Vectors and Knowledge Graphs in AI-powered SearchThought Vectors and Knowledge Graphs in AI-powered Search
Thought Vectors and Knowledge Graphs in AI-powered SearchTrey Grainger
 
Faceted Navigation of User-Generated Metadata (Calit2 Rescue Seminar Series 2...
Faceted Navigation of User-Generated Metadata (Calit2 Rescue Seminar Series 2...Faceted Navigation of User-Generated Metadata (Calit2 Rescue Seminar Series 2...
Faceted Navigation of User-Generated Metadata (Calit2 Rescue Seminar Series 2...Bradley Allen
 
The Internet
The InternetThe Internet
The Internetmscuttle
 
Semantic Search tutorial at SemTech 2012
Semantic Search tutorial at SemTech 2012Semantic Search tutorial at SemTech 2012
Semantic Search tutorial at SemTech 2012Peter Mika
 
The Semantic Knowledge Graph
The Semantic Knowledge GraphThe Semantic Knowledge Graph
The Semantic Knowledge GraphTrey Grainger
 
Lexical Pattern- Based Approach for Extracting Name Aliases
Lexical Pattern- Based Approach for Extracting Name AliasesLexical Pattern- Based Approach for Extracting Name Aliases
Lexical Pattern- Based Approach for Extracting Name AliasesIJMER
 
Page rank and hyperlink
Page rank and hyperlink Page rank and hyperlink
Page rank and hyperlink Silicon
 
It's 2017, and I still want to sell you a graph database
It's 2017, and I still want to sell you a graph databaseIt's 2017, and I still want to sell you a graph database
It's 2017, and I still want to sell you a graph databaseSwanand Pagnis
 
Semantic Search at Yahoo
Semantic Search at YahooSemantic Search at Yahoo
Semantic Search at YahooPeter Mika
 

Was ist angesagt? (13)

Semantic Web, e-commerce
Semantic Web, e-commerceSemantic Web, e-commerce
Semantic Web, e-commerce
 
Understanding Seo At A Glance
Understanding Seo At A GlanceUnderstanding Seo At A Glance
Understanding Seo At A Glance
 
Lecture 26
Lecture 26Lecture 26
Lecture 26
 
Thought Vectors and Knowledge Graphs in AI-powered Search
Thought Vectors and Knowledge Graphs in AI-powered SearchThought Vectors and Knowledge Graphs in AI-powered Search
Thought Vectors and Knowledge Graphs in AI-powered Search
 
Faceted Navigation of User-Generated Metadata (Calit2 Rescue Seminar Series 2...
Faceted Navigation of User-Generated Metadata (Calit2 Rescue Seminar Series 2...Faceted Navigation of User-Generated Metadata (Calit2 Rescue Seminar Series 2...
Faceted Navigation of User-Generated Metadata (Calit2 Rescue Seminar Series 2...
 
The Internet
The InternetThe Internet
The Internet
 
Semantic Search tutorial at SemTech 2012
Semantic Search tutorial at SemTech 2012Semantic Search tutorial at SemTech 2012
Semantic Search tutorial at SemTech 2012
 
The Semantic Knowledge Graph
The Semantic Knowledge GraphThe Semantic Knowledge Graph
The Semantic Knowledge Graph
 
Lecture09
Lecture09Lecture09
Lecture09
 
Lexical Pattern- Based Approach for Extracting Name Aliases
Lexical Pattern- Based Approach for Extracting Name AliasesLexical Pattern- Based Approach for Extracting Name Aliases
Lexical Pattern- Based Approach for Extracting Name Aliases
 
Page rank and hyperlink
Page rank and hyperlink Page rank and hyperlink
Page rank and hyperlink
 
It's 2017, and I still want to sell you a graph database
It's 2017, and I still want to sell you a graph databaseIt's 2017, and I still want to sell you a graph database
It's 2017, and I still want to sell you a graph database
 
Semantic Search at Yahoo
Semantic Search at YahooSemantic Search at Yahoo
Semantic Search at Yahoo
 

Andere mochten auch

Tips, Tricks and Strategies on Better Business Blogging
Tips, Tricks and Strategies on Better Business BloggingTips, Tricks and Strategies on Better Business Blogging
Tips, Tricks and Strategies on Better Business BloggingZemanta
 
Bloggers vs. publishers
Bloggers vs. publishersBloggers vs. publishers
Bloggers vs. publishersZemanta
 
How to Kick Ass, Take Names and Create Advocacy on the Social Web
How to Kick Ass, Take Names and Create Advocacy on the Social WebHow to Kick Ass, Take Names and Create Advocacy on the Social Web
How to Kick Ass, Take Names and Create Advocacy on the Social WebMichael Brito | Zeno Group
 
Blogging 101 - Zemanta NYC Meetup
Blogging 101 - Zemanta NYC MeetupBlogging 101 - Zemanta NYC Meetup
Blogging 101 - Zemanta NYC MeetupZemanta
 
4TH SELECTION TRIAL RESULTS
4TH SELECTION TRIAL RESULTS4TH SELECTION TRIAL RESULTS
4TH SELECTION TRIAL RESULTSgunforglory
 
Blogging 201: From Blank Slate to Blog in Under an Hour
Blogging 201: From Blank Slate to Blog in Under an HourBlogging 201: From Blank Slate to Blog in Under an Hour
Blogging 201: From Blank Slate to Blog in Under an HourAdam Gartenberg
 
12th Asian Championship RESULTS
12th Asian Championship RESULTS12th Asian Championship RESULTS
12th Asian Championship RESULTSgunforglory
 
Guide to Content Marketing
Guide to Content MarketingGuide to Content Marketing
Guide to Content MarketingZemanta
 
Gv mavlankar 2011
Gv mavlankar 2011Gv mavlankar 2011
Gv mavlankar 2011gunforglory
 
Circular for 55th nscc
Circular for 55th nsccCircular for 55th nscc
Circular for 55th nsccgunforglory
 

Andere mochten auch (13)

Tips, Tricks and Strategies on Better Business Blogging
Tips, Tricks and Strategies on Better Business BloggingTips, Tricks and Strategies on Better Business Blogging
Tips, Tricks and Strategies on Better Business Blogging
 
Bloggers vs. publishers
Bloggers vs. publishersBloggers vs. publishers
Bloggers vs. publishers
 
How to Kick Ass, Take Names and Create Advocacy on the Social Web
How to Kick Ass, Take Names and Create Advocacy on the Social WebHow to Kick Ass, Take Names and Create Advocacy on the Social Web
How to Kick Ass, Take Names and Create Advocacy on the Social Web
 
Blogging 101 - Zemanta NYC Meetup
Blogging 101 - Zemanta NYC MeetupBlogging 101 - Zemanta NYC Meetup
Blogging 101 - Zemanta NYC Meetup
 
4TH SELECTION TRIAL RESULTS
4TH SELECTION TRIAL RESULTS4TH SELECTION TRIAL RESULTS
4TH SELECTION TRIAL RESULTS
 
Blogging 201: From Blank Slate to Blog in Under an Hour
Blogging 201: From Blank Slate to Blog in Under an HourBlogging 201: From Blank Slate to Blog in Under an Hour
Blogging 201: From Blank Slate to Blog in Under an Hour
 
Social Crush Keynote #socialcrush
Social Crush Keynote #socialcrushSocial Crush Keynote #socialcrush
Social Crush Keynote #socialcrush
 
12th Asian Championship RESULTS
12th Asian Championship RESULTS12th Asian Championship RESULTS
12th Asian Championship RESULTS
 
Guide to Content Marketing
Guide to Content MarketingGuide to Content Marketing
Guide to Content Marketing
 
N17
N17N17
N17
 
Gv mavlankar 2011
Gv mavlankar 2011Gv mavlankar 2011
Gv mavlankar 2011
 
N20
N20N20
N20
 
Circular for 55th nscc
Circular for 55th nsccCircular for 55th nscc
Circular for 55th nscc
 

Ähnlich wie Quality, Quantity, Web and Semantics

Relationships at the Heart of Semantic Web: Modeling, Discovering, Validating...
Relationships at the Heart of Semantic Web: Modeling, Discovering, Validating...Relationships at the Heart of Semantic Web: Modeling, Discovering, Validating...
Relationships at the Heart of Semantic Web: Modeling, Discovering, Validating...Artificial Intelligence Institute at UofSC
 
SEMANTIC CONTENT MANAGEMENT FOR ENTERPRISES AND NATIONAL SECURITY
SEMANTIC CONTENT MANAGEMENT FOR ENTERPRISES AND NATIONAL SECURITYSEMANTIC CONTENT MANAGEMENT FOR ENTERPRISES AND NATIONAL SECURITY
SEMANTIC CONTENT MANAGEMENT FOR ENTERPRISES AND NATIONAL SECURITYAmit Sheth
 
Annotating search results from web databases
Annotating search results from web databasesAnnotating search results from web databases
Annotating search results from web databasesIEEEFINALYEARPROJECTS
 
JAVA 2013 IEEE DATAMINING PROJECT Annotating search results from web databases
JAVA 2013 IEEE DATAMINING PROJECT Annotating search results from web databasesJAVA 2013 IEEE DATAMINING PROJECT Annotating search results from web databases
JAVA 2013 IEEE DATAMINING PROJECT Annotating search results from web databasesIEEEGLOBALSOFTTECHNOLOGIES
 
Relationships at the Heart of Semantic Web: Modeling, Discovering, Validating...
Relationships at the Heart of Semantic Web: Modeling, Discovering, Validating...Relationships at the Heart of Semantic Web: Modeling, Discovering, Validating...
Relationships at the Heart of Semantic Web: Modeling, Discovering, Validating...Artificial Intelligence Institute at UofSC
 
E-commerce Search Engine with Apache Lucene/Solr
E-commerce Search Engine with Apache Lucene/SolrE-commerce Search Engine with Apache Lucene/Solr
E-commerce Search Engine with Apache Lucene/SolrVincenzo D'Amore
 
How to build a data lake with aws glue data catalog (ABD213-R) re:Invent 2017
How to build a data lake with aws glue data catalog (ABD213-R)  re:Invent 2017How to build a data lake with aws glue data catalog (ABD213-R)  re:Invent 2017
How to build a data lake with aws glue data catalog (ABD213-R) re:Invent 2017Amazon Web Services
 
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”
Smartlogic, Semaphore and Semantically Enhanced Search –  For “Discovery”Smartlogic, Semaphore and Semantically Enhanced Search –  For “Discovery”
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”voginip
 
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”
Smartlogic, Semaphore and Semantically Enhanced Search –  For “Discovery”Smartlogic, Semaphore and Semantically Enhanced Search –  For “Discovery”
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”VOGIN-academie
 
Content Management, Metadata and Semantic Web
Content Management, Metadata and Semantic WebContent Management, Metadata and Semantic Web
Content Management, Metadata and Semantic WebAmit Sheth
 
Content Management, Metadata and Semantic Web
Content Management, Metadata and Semantic WebContent Management, Metadata and Semantic Web
Content Management, Metadata and Semantic WebAmit Sheth
 
Humantics | Optimizing Your Content Strategy in an Entity-Driven World
Humantics | Optimizing Your Content Strategy in an Entity-Driven WorldHumantics | Optimizing Your Content Strategy in an Entity-Driven World
Humantics | Optimizing Your Content Strategy in an Entity-Driven WorldGrant Simmons
 
Crowdsourced query augmentation through the semantic discovery of domain spec...
Crowdsourced query augmentation through the semantic discovery of domain spec...Crowdsourced query augmentation through the semantic discovery of domain spec...
Crowdsourced query augmentation through the semantic discovery of domain spec...Trey Grainger
 
Smart Data Webinar: Choosing the Right Data Management Architecture for Cogni...
Smart Data Webinar: Choosing the Right Data Management Architecture for Cogni...Smart Data Webinar: Choosing the Right Data Management Architecture for Cogni...
Smart Data Webinar: Choosing the Right Data Management Architecture for Cogni...DATAVERSITY
 
Making IA Real: Planning an Information Architecture Strategy
Making IA Real: Planning an Information Architecture StrategyMaking IA Real: Planning an Information Architecture Strategy
Making IA Real: Planning an Information Architecture StrategyChiara Fox Ogan
 
The need for sophistication in modern search engine implementations
The need for sophistication in modern search engine implementationsThe need for sophistication in modern search engine implementations
The need for sophistication in modern search engine implementationsBen DeMott
 
CSCI 340 Final Group ProjectNatalie Warden, Arturo Gonzalez, R.docx
CSCI 340 Final Group ProjectNatalie Warden, Arturo Gonzalez, R.docxCSCI 340 Final Group ProjectNatalie Warden, Arturo Gonzalez, R.docx
CSCI 340 Final Group ProjectNatalie Warden, Arturo Gonzalez, R.docxmydrynan
 
A Multifaceted Look At Faceting - Ted Sullivan, Lucidworks
A Multifaceted Look At Faceting - Ted Sullivan, LucidworksA Multifaceted Look At Faceting - Ted Sullivan, Lucidworks
A Multifaceted Look At Faceting - Ted Sullivan, LucidworksLucidworks
 

Ähnlich wie Quality, Quantity, Web and Semantics (20)

Relationships at the Heart of Semantic Web: Modeling, Discovering, Validating...
Relationships at the Heart of Semantic Web: Modeling, Discovering, Validating...Relationships at the Heart of Semantic Web: Modeling, Discovering, Validating...
Relationships at the Heart of Semantic Web: Modeling, Discovering, Validating...
 
SEMANTIC CONTENT MANAGEMENT FOR ENTERPRISES AND NATIONAL SECURITY
SEMANTIC CONTENT MANAGEMENT FOR ENTERPRISES AND NATIONAL SECURITYSEMANTIC CONTENT MANAGEMENT FOR ENTERPRISES AND NATIONAL SECURITY
SEMANTIC CONTENT MANAGEMENT FOR ENTERPRISES AND NATIONAL SECURITY
 
Annotating search results from web databases
Annotating search results from web databasesAnnotating search results from web databases
Annotating search results from web databases
 
JAVA 2013 IEEE DATAMINING PROJECT Annotating search results from web databases
JAVA 2013 IEEE DATAMINING PROJECT Annotating search results from web databasesJAVA 2013 IEEE DATAMINING PROJECT Annotating search results from web databases
JAVA 2013 IEEE DATAMINING PROJECT Annotating search results from web databases
 
Relationships at the Heart of Semantic Web: Modeling, Discovering, Validating...
Relationships at the Heart of Semantic Web: Modeling, Discovering, Validating...Relationships at the Heart of Semantic Web: Modeling, Discovering, Validating...
Relationships at the Heart of Semantic Web: Modeling, Discovering, Validating...
 
E-commerce Search Engine with Apache Lucene/Solr
E-commerce Search Engine with Apache Lucene/SolrE-commerce Search Engine with Apache Lucene/Solr
E-commerce Search Engine with Apache Lucene/Solr
 
How to build a data lake with aws glue data catalog (ABD213-R) re:Invent 2017
How to build a data lake with aws glue data catalog (ABD213-R)  re:Invent 2017How to build a data lake with aws glue data catalog (ABD213-R)  re:Invent 2017
How to build a data lake with aws glue data catalog (ABD213-R) re:Invent 2017
 
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”
Smartlogic, Semaphore and Semantically Enhanced Search –  For “Discovery”Smartlogic, Semaphore and Semantically Enhanced Search –  For “Discovery”
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”
 
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”
Smartlogic, Semaphore and Semantically Enhanced Search –  For “Discovery”Smartlogic, Semaphore and Semantically Enhanced Search –  For “Discovery”
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”
 
Content Management, Metadata and Semantic Web
Content Management, Metadata and Semantic WebContent Management, Metadata and Semantic Web
Content Management, Metadata and Semantic Web
 
Content Management, Metadata and Semantic Web
Content Management, Metadata and Semantic WebContent Management, Metadata and Semantic Web
Content Management, Metadata and Semantic Web
 
Key Phrases for Better Search
Key Phrases for Better SearchKey Phrases for Better Search
Key Phrases for Better Search
 
Humantics | Optimizing Your Content Strategy in an Entity-Driven World
Humantics | Optimizing Your Content Strategy in an Entity-Driven WorldHumantics | Optimizing Your Content Strategy in an Entity-Driven World
Humantics | Optimizing Your Content Strategy in an Entity-Driven World
 
Grant Simmons - Advanced Search Summit Napa 2021
Grant Simmons - Advanced Search Summit Napa 2021Grant Simmons - Advanced Search Summit Napa 2021
Grant Simmons - Advanced Search Summit Napa 2021
 
Crowdsourced query augmentation through the semantic discovery of domain spec...
Crowdsourced query augmentation through the semantic discovery of domain spec...Crowdsourced query augmentation through the semantic discovery of domain spec...
Crowdsourced query augmentation through the semantic discovery of domain spec...
 
Smart Data Webinar: Choosing the Right Data Management Architecture for Cogni...
Smart Data Webinar: Choosing the Right Data Management Architecture for Cogni...Smart Data Webinar: Choosing the Right Data Management Architecture for Cogni...
Smart Data Webinar: Choosing the Right Data Management Architecture for Cogni...
 
Making IA Real: Planning an Information Architecture Strategy
Making IA Real: Planning an Information Architecture StrategyMaking IA Real: Planning an Information Architecture Strategy
Making IA Real: Planning an Information Architecture Strategy
 
The need for sophistication in modern search engine implementations
The need for sophistication in modern search engine implementationsThe need for sophistication in modern search engine implementations
The need for sophistication in modern search engine implementations
 
CSCI 340 Final Group ProjectNatalie Warden, Arturo Gonzalez, R.docx
CSCI 340 Final Group ProjectNatalie Warden, Arturo Gonzalez, R.docxCSCI 340 Final Group ProjectNatalie Warden, Arturo Gonzalez, R.docx
CSCI 340 Final Group ProjectNatalie Warden, Arturo Gonzalez, R.docx
 
A Multifaceted Look At Faceting - Ted Sullivan, Lucidworks
A Multifaceted Look At Faceting - Ted Sullivan, LucidworksA Multifaceted Look At Faceting - Ted Sullivan, Lucidworks
A Multifaceted Look At Faceting - Ted Sullivan, Lucidworks
 

Mehr von Zemanta

Effective content strategy guided by the SAVE framework
Effective content strategy guided by the SAVE frameworkEffective content strategy guided by the SAVE framework
Effective content strategy guided by the SAVE frameworkZemanta
 
7 tips for creating user friendly mobile content
7 tips for creating user friendly mobile content7 tips for creating user friendly mobile content
7 tips for creating user friendly mobile contentZemanta
 
25 Content Marketing Facts That Will Make You Seriously Consider Mobile
25 Content Marketing Facts That Will Make You Seriously Consider Mobile25 Content Marketing Facts That Will Make You Seriously Consider Mobile
25 Content Marketing Facts That Will Make You Seriously Consider MobileZemanta
 
40 killer content marketing and blogging tools
40 killer content marketing and blogging tools40 killer content marketing and blogging tools
40 killer content marketing and blogging toolsZemanta
 
Blogger outreach-campaign-a-guide
Blogger outreach-campaign-a-guideBlogger outreach-campaign-a-guide
Blogger outreach-campaign-a-guideZemanta
 
Triple Your Post Frequency
Triple Your Post FrequencyTriple Your Post Frequency
Triple Your Post FrequencyZemanta
 
7 Steps to Creating Engaging Content
7 Steps to Creating Engaging Content7 Steps to Creating Engaging Content
7 Steps to Creating Engaging ContentZemanta
 
Business Blogging on Fire! - Effective Strategies for Corporate Blogging
Business Blogging on Fire! - Effective Strategies for Corporate BloggingBusiness Blogging on Fire! - Effective Strategies for Corporate Blogging
Business Blogging on Fire! - Effective Strategies for Corporate BloggingZemanta
 
Zemanta - SocialCrush Columbia - Better Business Blogging - Part 1
Zemanta - SocialCrush Columbia - Better Business Blogging - Part 1Zemanta - SocialCrush Columbia - Better Business Blogging - Part 1
Zemanta - SocialCrush Columbia - Better Business Blogging - Part 1Zemanta
 
Zemanta - SocialCrush Columbia - Better Business Blogging - Part 2
Zemanta - SocialCrush Columbia - Better Business Blogging - Part 2Zemanta - SocialCrush Columbia - Better Business Blogging - Part 2
Zemanta - SocialCrush Columbia - Better Business Blogging - Part 2Zemanta
 

Mehr von Zemanta (10)

Effective content strategy guided by the SAVE framework
Effective content strategy guided by the SAVE frameworkEffective content strategy guided by the SAVE framework
Effective content strategy guided by the SAVE framework
 
7 tips for creating user friendly mobile content
7 tips for creating user friendly mobile content7 tips for creating user friendly mobile content
7 tips for creating user friendly mobile content
 
25 Content Marketing Facts That Will Make You Seriously Consider Mobile
25 Content Marketing Facts That Will Make You Seriously Consider Mobile25 Content Marketing Facts That Will Make You Seriously Consider Mobile
25 Content Marketing Facts That Will Make You Seriously Consider Mobile
 
40 killer content marketing and blogging tools
40 killer content marketing and blogging tools40 killer content marketing and blogging tools
40 killer content marketing and blogging tools
 
Blogger outreach-campaign-a-guide
Blogger outreach-campaign-a-guideBlogger outreach-campaign-a-guide
Blogger outreach-campaign-a-guide
 
Triple Your Post Frequency
Triple Your Post FrequencyTriple Your Post Frequency
Triple Your Post Frequency
 
7 Steps to Creating Engaging Content
7 Steps to Creating Engaging Content7 Steps to Creating Engaging Content
7 Steps to Creating Engaging Content
 
Business Blogging on Fire! - Effective Strategies for Corporate Blogging
Business Blogging on Fire! - Effective Strategies for Corporate BloggingBusiness Blogging on Fire! - Effective Strategies for Corporate Blogging
Business Blogging on Fire! - Effective Strategies for Corporate Blogging
 
Zemanta - SocialCrush Columbia - Better Business Blogging - Part 1
Zemanta - SocialCrush Columbia - Better Business Blogging - Part 1Zemanta - SocialCrush Columbia - Better Business Blogging - Part 1
Zemanta - SocialCrush Columbia - Better Business Blogging - Part 1
 
Zemanta - SocialCrush Columbia - Better Business Blogging - Part 2
Zemanta - SocialCrush Columbia - Better Business Blogging - Part 2Zemanta - SocialCrush Columbia - Better Business Blogging - Part 2
Zemanta - SocialCrush Columbia - Better Business Blogging - Part 2
 

Kürzlich hochgeladen

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 

Kürzlich hochgeladen (20)

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 

Quality, Quantity, Web and Semantics

  • 1. Quality to Quantity to Quality on the Web Andraž Tori, CTO at Zemanta @andraz
  • 2. Topics - a bit about Zemanta - how advanced “data tools” and spammers interact
  • 3. We are all trying to organize the web
  • 4. Making it right, making it useful and linked
  • 5.
  • 6.
  • 7. Not so long time ago, in a city not far away...
  • 9. are trying to do the opposite
  • 10. trying to disorganize it, make it confusing, and to profit from that
  • 11.
  • 12. using the tools we have built!
  • 13.
  • 14. Their motives are not sinster (mostly)
  • 15. it is about profit
  • 16. Profit - publish as much content as possible - quality is not (that) important - get traffic or high page ranking for certain terms - sell clicks, links or whole “fully built” sites to the highest bidder - users and search engines are necessary evil to be tricked as cheaply as possible
  • 17.
  • 18. So, why do I care?
  • 19. Job opening You will get a spreadsheet with 180 blog url’s and logins. You will log into each blog and schedule 2 posts per week ... You will spice up every post with images and/or related links within the content, using a Wordpress plugin called Zemanta https://www.odesk.com/jobs/Wordpress-Blog-Poster_~~c8c04549b8e6b600
  • 20. And why might you care? - the organized information is great tool for those that try to disorganize it - they are poisoning “our web”, including twitter, facebook - and it's hard to see in the fog they are causing - it is just matter of time when they start poisioning linked data too
  • 21.
  • 22. What do we do at
  • 23. - is a “personal writing assistant” - suggesting content while you write (your blog) - analyzing your text - connecting it with background knowledge, other stories on the web, images - you choose what suggestions to include - to make your writing more informative, vivid and useful
  • 24.
  • 25.
  • 26.
  • 28.
  • 29.
  • 31.
  • 32. How it works Content suggestions Plain text Semantic Analysis (article) search Linked data RSS feeds
  • 33. Main design goals - Input is meaningful chunk of text (not a keyword or a phrase) - Input is (semi) English language - Has to work across all domains in the open world - music, celebrities, finance, entertainment, politics, gardening, parenting, …
  • 34. Analysis pipeline Known phrases Named Entity extraction Extraction (aho-corasick) Triple store Surface form features evaluation Statistical comparison to background knowledge Semantic coherence and hand-tuned heuristics etc. Disambiguated entities
  • 35. Analysis pipeline Known phrases Named Entity extraction Extraction (aho-corasick) Categorization to Dmoz Triple store Surface form features evaluation Statistical comparison to background knowledge Semantic coherence and hand-tuned heuristics etc. Categories Ambigious named entities Disambiguated entities
  • 36. Background knowledge - Data from Wikipedia, MusicBrainz, Freebase… and world wild web - Includes linguistical and semantical properties + unstructured data - Present in two forms: - in “original” custom built triple store on top of MySQL (150 GB) - processed into 7 GB optimized “memory mapped dump”
  • 37. Background knowledge - 7M mined and linked up entities and concepts Triple store - 30M aliases - Refreshed about once a month - want to make it real-time - Input data quality is really important etc.
  • 38. After analysis Text SOLR Related articles articles SOLR Images images
  • 40. boost((( wiki_entities:Health insurance wiki_entities:Medical underwriting wiki_entities:United States wiki_entities:Affordable Care Act wiki_entities:Barack Obama wiki_entities:Lifetime (TV network) wiki_entities:Insurance wiki_entities:Preventive medicine wiki_entities:Child wiki_entities:Patient Protection and Affordable Care Act ) ^3.0) (text:zemhealthinsurq^0.68 text:health^0.62 text:premium^0.36 text:zeminsurcompaniq^0.56 text:increas^0.29 text:rate^0.27 text:zemhealthinsurcompaniq^0.35 text:zempreventcareq^0.26 text:medic^0.26 text:compani^0.23 text:obamacar^0.21 text:todai^0.21 text:polici^0.21 text:care^0.19 ) ^105.0 ((dmoz_categories: Top/Business/Financial_Services/Insurance/Agents_and_Marketers/Health dmoz_categories: Top/Business/Financial_Services/Insurance/Agents_and_Marketers/Health/United_States dmoz_categories: Top/Business/Financial_Services/Insurance/Agents_and_Marketers/Health/United_States /California) ^0.1), (1 - 0.2) * sqrt(1.0/(1.15E-8*float(1285185600000 - date(published_datetime) ms)+1.0)) + 0.2)
  • 41. Solr - We adapted Solr for “query by document” - 52% precision (at 10) on internal evaluations - plain Lucene MLT comes to 44% - difference is from “bag of terms” approach over “bag of words” (terms coming from analysis step) - Our live index is 5M articles - Solr is really not optimized to handle 50 terms in a single query
  • 42. Lucene plain “More Like This”
  • 43. Metrics & tests - Every part of the system is being constantly evaluted - Precision/recall at 5 different points in the system - Mostly bi-weekly releases of new datasets and the engine
  • 44. Overview - We do pretty deep processing to deliver simple user experience of “personal authoring assistant” - And everything is available over the web API - tagging - named entity recognition and disambiguation to Linked Open Data URIs
  • 45. What API offers? • Tags Most used • Categories • Concepts and entities Most interesting • Related articles • Related images
  • 46.
  • 49. We are just one of the many people offering services based on large amounts of web data each spending man-years trying to organize their data, trying to offer best possible service
  • 50. now back to the bad guys
  • 51.
  • 52. Job opening You will get a spreadsheet with 180 blog url’s and logins. You will log into each blog and schedule 2 posts per week ... You will spice up every post with images and/or related links within the content, using a Wordpress plugin called Zemanta https://www.odesk.com/jobs/Wordpress-Blog-Poster_~~c8c04549b8e6b600
  • 53. There's more than meets the eye
  • 54. Gather search terms Analyze → Find / create (extensions, logs, guess) what people search for? such content Pull additional content Use Zemanta or OpenCalais Cover your tracks from Freebase to add tags, images, links Publish Amazon Mechanical Turk Use Zemanta to find to post comments Profit? similar blogs and links back to your site
  • 55. Warnings - I've seen no single system using the whole pipeline as described, however all parts were found in the wild - Examples used are from all kinds of sites – good, bad and ugly - I am not trying to imply that all of the steps in the diagram are bad, but they can be used by bad guys efficiently
  • 56. Gather search terms Analyze → Find / create (extensions, logs, guess) what people search for? such content Pull additional content Use Zemanta or OpenCalais Cover your tracks from Freebase to add tags, images, links Publish Amazon Mechanical Turk Use Zemanta to find to post comments Profit? similar blogs and links back to your site
  • 57. Finding their keywords, niches - Domain expertise - Users like to install extensions and say “yes” - You observe referrers on sites you control - You buy the data on the black market
  • 58.
  • 59. The sophisticated part of the market “Demand Media relies on a proprietary algorithm to help editors best determine what subjects their writers should tackle.” Factors: - Keyword competition - Revenue - Driving traffic to/from existing conent http://emediavitals.com/article/16/demand-media-s-content-assembly-line
  • 60. Gather search terms Analyze → Find / create (extensions, logs, guess) what people search for? such content Pull additional content Use Zemanta or OpenCalais Cover your tracks from Freebase to add tags, images, links Publish Amazon Mechanical Turk Use Zemanta to find to post comments Profit? similar blogs and links back to your site
  • 61. Find / create content - Steal - Take from “open article directories” - Have your own “content assembly line” like Demand Media
  • 63. Gather search terms Analyze → Find / create (extensions, logs, guess) what people search for? such content Pull additional content Use Zemanta or OpenCalais Cover your tracks from Freebase to add tags, images, links Publish Amazon Mechanical Turk Use Zemanta to find to post comments Profit? similar blogs and links back to your site
  • 64. Tһiѕ iѕ nοt the text you аre lookinɡ for.
  • 65. Tһiѕ iѕ nοt the text you аre lookinɡ for.
  • 66. Translate it to random language and back to English Übersetzen sie zufällig Sprache und wieder auf Englisch Language and translate it happen again in English Μεταφράστε αυτό σε δειγματοληπτικούς γλώσσα και πίσω στην αγγλική γλώσσα Translate this random language back to English Traduisez au langage aléatoire et revenir à l'anglais Translate to random language to English and back 它翻译成随机的语言和回英文 Translate it back into the English language and random
  • 67. Covering their tracks - Trying to fool search engines or people? - Search engines are catching up - Google Translate API is being closed due to “abuse”? - The trend is “rewriting” by human editors, procured on the global market
  • 68.
  • 69. Gather search terms Analyze → Find / create (extensions, logs, guess) what people search for? such content Pull additional content Use Zemanta, OpenCalais Cover your tracks from Freebase to add tags, images, links Publish Amazon Mechanical Turk Use Zemanta to find to post comments Profit? similar blogs and links back to your site
  • 71.
  • 72. Gather search terms Analyze → Find / create (extensions, logs, guess) what people search for? such content Pull additional content Use Zemanta or OpenCalais Cover your tracks from Freebase to add tags, images, links Publish Amazon Mechanical Turk Use Zemanta to find to post comments Profit? similar blogs and links back to your site
  • 73.
  • 74. Remixing linked data and spam - Currently mostly the good guys are using Linked Data - However, it's just too tempting to be left alone - Fully synthetic articles using factual information from linked data? – Using advanced tools to form proper natural language sentences and maybe even storyline?
  • 75. Gather search terms Analyze → Find / create (extensions, logs, guess) what people search for? such content Pull additional content Use Zemanta or OpenCalais Cover your tracks from Freebase to add tags, images, links Publish Amazon Mechanical Turk Use Zemanta to find to post comments Profit? similar blogs and links back to your site
  • 76. Publish - On hosted third party platforms - eating their resources - Platforms have hard time killing spammers - Smaller ones don't necessarily have the incentive - If they remove spammer too fast, it is easier for spammer to probe the limits - Platforms use “kill with delay” - Spam detection is resource intensive
  • 77. Gather search terms Analyze → Find / create (extensions, logs, guess) what people search for? such content Pull additional content Use Zemanta or OpenCalais Cover your tracks from Freebase to add tags, images, links Publish Amazon Mechanical Turk Use Zemanta to find to post comments Profit? similar blogs and links back to your site
  • 78. Valuable comments As I write this post, Zemanta is showing me 5 different articles that are related to my post. I could visit each one of these sites and reach out to the owner to see if they would be interested in linking to my post, or I could leave a valuable comment on the page and include a link back to my post. http://www.mainelyseo.com/zemanta-review-seo-link-building-with-the-zemanta-plugin/
  • 79. - Guy in previous slide is honest and well- meaning - But what if you automate that via Amazon Mechanical Turk or oDesk?
  • 80. Gather search terms Analyze → Find / create (extensions, logs, guess) what people search for? such content Pull additional content Use Zemanta or OpenCalais Cover your tracks from Freebase to add tags, images, links Publish Amazon Mechanical Turk Use Zemanta to find to post comments Profit? similar blogs and links back to your site
  • 81. Profit? - sell ads - sell links - sell “fully developed site” - to the highest bidder
  • 82. Search engines to the rescue? - Mahalo cut 10% of the staff the day after Google announced ranking changes - Demand Media's stock isn't doing that well anymore - However this is a never-ending story, we'll have co-evolution for foreseeable future
  • 83. Ecosystem - Very sophisticated, large players - moving to more high quality content, video? - Small time operations - using more and more sophisticated tools available on the market cheaply (modern asymmetric warfare?) - Dark industry specifically building tools to poison the web and sell them to small time operators
  • 85. Can we make spammers (and others) work for us, making linked data better? (think reCAPTCHA)
  • 86. Could article directories be fruitfully used? eZineArticles.com, GoArticles.com, etc...
  • 87. Find rewritten articles and use them as parallel corpus?
  • 88. Could we use global workforce market more efficiently to get more linked data?
  • 89. Thesis, antithesis, synthesis? http://xkcd.com/810/
  • 91. Image sources  http://www.flickr.com/photos/dzingeek/4587871752/  http://www.flickr.com/photos/25101572@N02/4393474025/  http://www.flickr.com/photos/billward/4740384434/  http://www.flickr.com/photos/jurvetson/542500748  http://www.flickr.com/photos/legofenris/4288913574  http://www.flickr.com/photos/ekilby/3733627940  http://www.flickr.com/photos/ekilby/3732799269/  http://www.flickr.com/photos/cipherswarm/38354452  http://xkcd.com/810/