SlideShare a Scribd company logo
1 of 18
Download to read offline
Building a super database from linked data




                           Stephen Wang 王傳仁
                           me@stephenwang.com
                                  March 3, 2011
Who is this NOT for?




              Who IS this for?

    Building a large database from a tiny team

    Organizing the world's information

    Information innovation
About

    Co-founder, CTO

    Popular movie reviews web site

    Aggregated reviews,
    comprehensive film database
The Stone Age

  
      Static HTML
      templates
  
      Editors read articles
      and pull quotations
  
      Only cover the
      newest movies
  
      ~1000 films
Modern Times

                                         
                                             Shift to LAMP
                                         
                                             License long-tail
                                             database
                                         
                                             Automated spiders,
                                             early UGC via critics

(How I felt maintaining Rotten
                                         
                                             Use homegrown
Tomatoes' overloaded database servers)       CMS for additional
                                             content
v




The Result

    8 million unique visitors / month

    Lean startup: 25x traffic with 7 staff

    Great site for film lovers (including Steve Jobs)
About

    Co-founder, CTO

    SNS for artists started
    with Daniel Wu 吴彦祖

    Started with six artists,
    now 1,600 artists,
    600K registered users

    Also powers official
    web sites:
李连杰: JetLi.com
成龙: JackieChan.com
莫文蔚: KarenMok.com
Our LAMP stack: Not the best setup for...
                         Newsfeeds...
                     Viral loop analysis...
                    Multivariate testing...


                   The Problem?!?
Scalability issues with real-time data, but without traffic from
                    public, long-tail content
About


    A better
    entertainment
    database

    Providing the long-
    tail content

    Still a part of
    alivenotdead.com

    Still in alpha
Features

    Comprehensive info
    for celebrities, films,
    music, and TV

    Searchable, structured
    data

    Multilingual: English,
    Chinese, Japanese

    Aggregated social
    media from
    inside/outside China
Why use mongoDB?

Flexible schema for different data sources




              Dozens of other sources...
Why use

           Scalable big data

    2 million+ topics   
                            500,000 translations
    covered

                            Next challenge:
                            Aggregating and
                            storing the social
                            media firehose
Why use

Crossing the border...

    Alivenotdead.com   
                           alive.tom.com in
    in Hong Kong           Tianjin




Use replica sets/eventual consistency to overcome
      frequent cross-border network issues
Using Linked Open Data

    Wikipedia as structured data

    Creative Commons license


                      
                          Multiple CC sources
                      
                          Organized taxonomy
                      
                          Acquired by Google
                      
                          No Chinese/Japanese yet!
Using Linked Open Data

    Wikipedia as structured data

    Creative Commons license


                         
                             Only Wikipedia
                         
                             Messy taxonomy
                         
                             Chinese/Japanese topic
                             translations, but requires
                             English topic link
Using Linked Open Data





    Use Freebase organized taxonomy, broad data

    Expand DBpedia to Chinese-only topics

    Same methodology across Chinese wiki sources
The Future
                                
                                    Developer API
                                
                                    Topic extraction
                                
                                    Real-time trends
                                    across languages
                                
                                    Other verticals

Already 10x more data than Rotten Tomatoes...
The complete sum of information from across the web...
Information not constrained by language...
We're hiring PHP engineers! Send your CV to
          me@stephenwang.com
    My blog: http://stephenwang.com

More Related Content

Similar to Building a super database from linked data

Collaborative Ontology Building Project
Collaborative Ontology Building Project  Collaborative Ontology Building Project
Collaborative Ontology Building Project
Jie Bao
 
Towards social webtops using semantic wiki
Towards social webtops using semantic wikiTowards social webtops using semantic wiki
Towards social webtops using semantic wiki
Jie Bao
 
Using Semantic Wiki as a Semantic Web Workbench
Using Semantic Wiki as a Semantic Web WorkbenchUsing Semantic Wiki as a Semantic Web Workbench
Using Semantic Wiki as a Semantic Web Workbench
Jie Bao
 
Web 3.0: The Upcoming Revolution
Web 3.0: The Upcoming RevolutionWeb 3.0: The Upcoming Revolution
Web 3.0: The Upcoming Revolution
Nitin Godawat
 
Semantic Web: In Quest for the Next Generation Killer Apps
Semantic Web: In Quest for the Next Generation Killer AppsSemantic Web: In Quest for the Next Generation Killer Apps
Semantic Web: In Quest for the Next Generation Killer Apps
Jie Bao
 
medstream2.ppt
medstream2.pptmedstream2.ppt
medstream2.ppt
Videoguy
 

Similar to Building a super database from linked data (20)

HBaseCon 2015: Warcbase - Scaling 'Out' and 'Down' HBase for Web Archiving
HBaseCon 2015: Warcbase - Scaling 'Out' and 'Down' HBase for Web ArchivingHBaseCon 2015: Warcbase - Scaling 'Out' and 'Down' HBase for Web Archiving
HBaseCon 2015: Warcbase - Scaling 'Out' and 'Down' HBase for Web Archiving
 
Collaborative Ontology Building Project
Collaborative Ontology Building Project  Collaborative Ontology Building Project
Collaborative Ontology Building Project
 
Library 2.0: A New Version for the Future
Library 2.0: A New Version for the FutureLibrary 2.0: A New Version for the Future
Library 2.0: A New Version for the Future
 
Blockchain Beyond Finance - Cronos Groep - Jan 17, 2017
 Blockchain Beyond Finance - Cronos Groep - Jan 17, 2017 Blockchain Beyond Finance - Cronos Groep - Jan 17, 2017
Blockchain Beyond Finance - Cronos Groep - Jan 17, 2017
 
Towards social webtops using semantic wiki
Towards social webtops using semantic wikiTowards social webtops using semantic wiki
Towards social webtops using semantic wiki
 
Using Semantic Wiki as a Semantic Web Workbench
Using Semantic Wiki as a Semantic Web WorkbenchUsing Semantic Wiki as a Semantic Web Workbench
Using Semantic Wiki as a Semantic Web Workbench
 
Pipe dreams
Pipe dreamsPipe dreams
Pipe dreams
 
Web 3.0: The Upcoming Revolution
Web 3.0: The Upcoming RevolutionWeb 3.0: The Upcoming Revolution
Web 3.0: The Upcoming Revolution
 
Semantic Annotation and Search for Resources in the Next Generation Web
Semantic Annotation and Search for Resources in the Next Generation WebSemantic Annotation and Search for Resources in the Next Generation Web
Semantic Annotation and Search for Resources in the Next Generation Web
 
slides
slidesslides
slides
 
Web 3.0 Emerging
Web 3.0 EmergingWeb 3.0 Emerging
Web 3.0 Emerging
 
WordLift 2.0 presented on the Semantic Web Meetup in Rome
WordLift 2.0 presented on the Semantic Web Meetup in RomeWordLift 2.0 presented on the Semantic Web Meetup in Rome
WordLift 2.0 presented on the Semantic Web Meetup in Rome
 
Hadoop and Pig at Twitter__HadoopSummit2010
Hadoop and Pig at Twitter__HadoopSummit2010Hadoop and Pig at Twitter__HadoopSummit2010
Hadoop and Pig at Twitter__HadoopSummit2010
 
Poster Semantic Web - Abhijit Chandrasen Manepatil
Poster Semantic Web - Abhijit Chandrasen ManepatilPoster Semantic Web - Abhijit Chandrasen Manepatil
Poster Semantic Web - Abhijit Chandrasen Manepatil
 
Open Content Library LGM 2007
Open Content Library LGM 2007Open Content Library LGM 2007
Open Content Library LGM 2007
 
IIIF: International Image Interoperability Framework @ DLF2012
IIIF: International Image Interoperability Framework @ DLF2012IIIF: International Image Interoperability Framework @ DLF2012
IIIF: International Image Interoperability Framework @ DLF2012
 
Interlinking Online Communities and Enriching Social Software with the Semant...
Interlinking Online Communities and Enriching Social Software with the Semant...Interlinking Online Communities and Enriching Social Software with the Semant...
Interlinking Online Communities and Enriching Social Software with the Semant...
 
Semantic Web: In Quest for the Next Generation Killer Apps
Semantic Web: In Quest for the Next Generation Killer AppsSemantic Web: In Quest for the Next Generation Killer Apps
Semantic Web: In Quest for the Next Generation Killer Apps
 
medstream2.ppt
medstream2.pptmedstream2.ppt
medstream2.ppt
 
Software Ecosystems as Networks - Advances on the FASTEN project, Paolo Boldi...
Software Ecosystems as Networks - Advances on the FASTEN project, Paolo Boldi...Software Ecosystems as Networks - Advances on the FASTEN project, Paolo Boldi...
Software Ecosystems as Networks - Advances on the FASTEN project, Paolo Boldi...
 

Recently uploaded

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Recently uploaded (20)

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 

Building a super database from linked data

  • 1. Building a super database from linked data Stephen Wang 王傳仁 me@stephenwang.com March 3, 2011
  • 2. Who is this NOT for? Who IS this for?  Building a large database from a tiny team  Organizing the world's information  Information innovation
  • 3. About  Co-founder, CTO  Popular movie reviews web site  Aggregated reviews, comprehensive film database
  • 4. The Stone Age  Static HTML templates  Editors read articles and pull quotations  Only cover the newest movies  ~1000 films
  • 5. Modern Times  Shift to LAMP  License long-tail database  Automated spiders, early UGC via critics (How I felt maintaining Rotten  Use homegrown Tomatoes' overloaded database servers) CMS for additional content
  • 6. v The Result  8 million unique visitors / month  Lean startup: 25x traffic with 7 staff  Great site for film lovers (including Steve Jobs)
  • 7. About  Co-founder, CTO  SNS for artists started with Daniel Wu 吴彦祖  Started with six artists, now 1,600 artists, 600K registered users  Also powers official web sites: 李连杰: JetLi.com 成龙: JackieChan.com 莫文蔚: KarenMok.com
  • 8. Our LAMP stack: Not the best setup for... Newsfeeds... Viral loop analysis... Multivariate testing... The Problem?!? Scalability issues with real-time data, but without traffic from public, long-tail content
  • 9. About  A better entertainment database  Providing the long- tail content  Still a part of alivenotdead.com  Still in alpha
  • 10. Features  Comprehensive info for celebrities, films, music, and TV  Searchable, structured data  Multilingual: English, Chinese, Japanese  Aggregated social media from inside/outside China
  • 11. Why use mongoDB? Flexible schema for different data sources Dozens of other sources...
  • 12. Why use Scalable big data  2 million+ topics  500,000 translations covered Next challenge: Aggregating and storing the social media firehose
  • 13. Why use Crossing the border...  Alivenotdead.com  alive.tom.com in in Hong Kong Tianjin Use replica sets/eventual consistency to overcome frequent cross-border network issues
  • 14. Using Linked Open Data  Wikipedia as structured data  Creative Commons license  Multiple CC sources  Organized taxonomy  Acquired by Google  No Chinese/Japanese yet!
  • 15. Using Linked Open Data  Wikipedia as structured data  Creative Commons license  Only Wikipedia  Messy taxonomy  Chinese/Japanese topic translations, but requires English topic link
  • 16. Using Linked Open Data  Use Freebase organized taxonomy, broad data  Expand DBpedia to Chinese-only topics  Same methodology across Chinese wiki sources
  • 17. The Future  Developer API  Topic extraction  Real-time trends across languages  Other verticals Already 10x more data than Rotten Tomatoes... The complete sum of information from across the web... Information not constrained by language...
  • 18. We're hiring PHP engineers! Send your CV to me@stephenwang.com My blog: http://stephenwang.com