Más contenido relacionado


Statistical entity extraction from web

  2.  Theneed forcollecting andunderstanding Web informationabout a real-world entity (such as a personor a product)iscurrently fulfilledmanually through search engines.  Informationabouta single entity mightappear in thousands of Web pages. Even if a searchenginecould findall therelevantWeb pages about an entity,the user would needtoshift through allthese pages to geta complete view of the entity.
  3.  EntityCube: Anautomatically generated entity relationshipgraphbasedon knowledgeextracted frombillionsofwebpages
  4.  Entity Retrieval: Entity search engines can return a ranked list of entities most relevant for a user query.  Entity Relationship/Fact Mining and Navigation: Entity search engines enable users to explore highly relevant information during searches to discover interesting relationships/facts about the entities associated with their queries.  Prominence Ranking: Entity search engines detect the popularity of an entity and enable users to browse entities in different categories ranked by their prominence during a given time period.
  5. A. Web Entities : We define the concept of Web Entity as the principal data units about which Web information is to be collected, indexed and ranked. Web entities are usually recognizable concepts such as people, organization, locations, products, papers, conferences, or journals, which have relevance to the application domain. Different types of entities are used to representthe informationfordifferentconcepts.
  6. B. Entity Search Engine:  First, a crawler fetches web data related to the targeted entities & the crawled data is classifiedinto differententity types.  The information is put into the web entity store & entity search engines can be constructed based on thestructured informationin theentitystore.
  7. Visual Layout Features :  Web pages usually contain many explicitor implicit visual separators such as lines, blank area, image, font size and color, element size & position.
  8. Text Features :  Text content is themost natural featuretouse for entity extraction.  Inwebpages, thereare a lotofHTML elementswhichonly contain veryshort text fragments.  We do not furthersegment these shorttext fragmentsintoindividual words. Instead, we consider them asthe atomic labelingunits for webentity extraction. For long text sentences/paragraphswithin webpages,however, we furthersegment them intotext fragmentsusing algorithmslikeSemi-CRF.
  9.  The web information about a single entity may be distributed in diverse web sources, the web entity extraction task should integrate all the knowledge pieces extracted fromdifferentwebpages.  The most challenging problem in entity information integration is name disambiguation.  Solve the name disambiguation problem with users we have introduced the novel entityframework‘iknoweb’.
  10. iKnoweb:  iKnowebinteractivelyinvolves human intelligence forentity knowledge miningproblems.  It is acrowdsourcing approach whichcombine both the power ofknowledge miningalgorithmsand user contributions. iKnoweb Overview:  One importantconcept we propose in iKnowebis MaximumRecognition Units (MRU),whichservesas atomic units inthe interactivename disambiguation process.  MRU: It is a groupof knowledgepieceswhicharefully automatically assigned tothe same entity identifierwith100% confidence that theyrefertothe same entity.
  11.  Detecting Maximum RecognitionUnits  Question Generation  MRU and Question Re-Ranking  Network Effects  InteractionOptimization
  12.  Solves the name disambiguation problems together withusers in both Microsoft Academic Search and EntityCube.
  13. The statistical snowballwork to automatically discover text patterns from billions of web pages leveraging the information redundancy property of the Web is introduced. iKnoweb, an interactive knowledge mining framework, which collaborates with the end users to connect the extracted knowledge pieces mined from Web and builds an accurate entity knowledge web.
  14.  Eugene Agichtein, Luis Gravano: Snowball: extractingrelations from large plain-textcollections. In Proceedings of the fifth ACM conference on Digital libraries, pp.85-94, June 02-07, 2000, San Antonio, Texas, United States.  C.Cortes andV.Vapnik.Support-vector networks. MachineLearing, Vol. 20, Nr. 3 (1995),p.273-297,1995.