IWMW 2002: The Value of Metadata and How to Realise It
Content Management, Metadata and Semantic Web
1. Content Management, Metadata & Semantic Web Keynote Address Net.ObjectDAYS 2001, Erfurt, Germany, September 11, 2001 Amit Sheth CTO/SrVP, Voquette (www.voquette.com) [formerly Founder/CEO, Taalee, www.taalee.com] Director, Large Scale Distributed Information Systems Lab, University Of Georgia (lsdis.cs.uga.edu) [email_address] Metadata Extraction is a patented pending technology of Taalee, Inc. Semantic Engine and WorldModel are trademarks of Taalee. Inc.
2.
3.
4.
5.
6.
7.
8.
9.
10. Creating and Serving Metadata to Power the Life-cycle of Content Applications Back End "A Web content repository without metadata is like a library without an index." - Jack Jia, IWOV “ Metadata increases content value in each step of content value chain.” Amit Sheth Where is the content? Whose is it? Produce Aggregate What is this content about? Catalog/ Index What other content is it related to? Integrate Syndicate What is the right content for this user? Personalize What is the best way to monetize this interaction? Interactive Marketing Broadcast, Wireline, Wireless, Interactive TV Semantic Metadata
11. A Metadata Classification Data (Heterogeneous Types/Media) Content Independent Metadata (creation-date, location, type-of-sensor...) Content Dependent Metadata (size, max colors, rows, columns...) Direct Content Based Metadata (inverted lists, document vectors, LSI) Domain Independent (structural) Metadata (C++ class-subclass relationships, HTML/SGML Document Type Definitions, C program structure...) Domain Specific Metadata area, population (Census), land-cover, relief (GIS),metadata concept descriptions from ontologies Ontologies Classifications Domain Models User More Semantics for Relevance to tackle Information Overload!!
12.
13. “ The Web of data (and connections) with meaning in the sense that a computer program can learn enough about what the data means to process it . . . . Imagine what computers can understand when there is a vast tangle of interconnected terms and data that can automatically be followed.” (Tim Berners-Lee, Weaving the Web , 1999) A Content Management centric definition of Semantic Web: The concept that Web-accessible content can be organized and utilized semantically, rather than though syntactic and structural methods. Semantics: The Next Step in the Web’s Evolution
16. Statistical/AI Techniques Customer Article Feed 4715 Classification of Article 4715 Customer Training Set Traditional Text Categorization Routing/Distribution Classify Place in a taxonomy feed Most traditional Content Management Products support Categorization of unstructured content.. Standard Metadata Feed Source : iSyndicate Posted Date : 11/20/2000
17. Knowledge-base & Statistical/AI Techniques Article Feed 4715 Classification of Article 4715 Customer Training Set & KB Routing/Distribution Classify Place in a taxonomy Taalee Training Set & KB Map to another taxonomy Metadata Catalog Semantic Engine™ Precise Personalization/ Syndication/Filtering Voquette/Taalee’s Categorization & Automatic Metadata Creation feed Standard metadata Semantic metadata FTE Company Analysis Conference Calls Earnings Stock Analysis ENT Company Analysis Conference Calls Earnings Stock Analysis NYSE Member Companies Market News IPOs Automated Content Enrichment (ACE) Article 4715 Metadata Feed Source : iSyndicate Posted Date : 11/20/2000 Company Name : France Telecom , Equant Ticker Symbol : FTE , ENT Exchange : NYSE Topic : Company News
18.
19. Multiple competitng standards! Multiple heterogeneous metadata models with different tag names for the same data in the same GIS domain Kansas State FGDC Metadata Model Theme keywords : digital line graph, hydrography, transportation... Title : Dakota Aquifer Online linkage : http://gisdasc.kgs.ukans.edu/dasc/ Direct Spatial Reference Method: Vector Horizontal Coordinate System Definition: Universal Transverse Mercator … … … ... UDK Metadata Model Search terms : digital line graph, hydrography, transportation... Topic : Dakota Aquifer Adress Id: http://gisdasc.kgs.ukans.edu/dasc/ Measuring Techniques: Vector Co-ordinate System: Universal Transverse Mercator … … … ...
25. Metadata Specifications (MetaModels) Metadata Domain Independent (Dublin Core, RDF, DAML+OIL) Frameworks/Infrastructures (XCM, XMI) Function Specific ICE (Syndication) Domain (Application) Specific MARC (Library), FGDC and UDK (Geographic), PRISM (Publishing), FXML (Financial Transactions). RIXML (Buy-Sell Research/Financial Services), IMS Learning Resource (Distance Learning). ….. Media Specific MPEGx, VoiceXML NewsML (News exchange)
26.
27.
28.
29. NewsML Source:http://www.mediabricks.com The content provider supplies NewsML packaged media content to the operator. The content can be categorized as current events, finance, sport, etc. (but no standards is specified) and updated hourly. The operator receives NewsML data from the content provider. The content server automatically pushes updated news articles to all news service subscribers. Consumers sign up for the news service directly on the device. When using the news service, the user browses through the categories and reads the news articles. The news articles are presented in a continuous flow (one after the other) without end-user interaction.
35. Information Extraction for Metadata Creation METADATA EXTRACTORS Key challenge: Create/extract as much (semantics) metadata automatically as possible WWW, Enterprise Repositories Digital Maps Nexis UPI AP Feeds/ Documents Digital Audios Data Stores Digital Videos Digital Images . . . . . . . . .
36. Extracting a Text Document: Syntactic approach INCIDENT MANAGEMENT SITUATION REPORT Friday August 1, 1997 - 0530 MDT NATIONAL PREPAREDNESS LEVEL II CURRENT SITUATION: Alaska continues to experience large fire activity. Additional fires have been staffed for structure protection. SIMELS, Galena District, BLM . This fire is on the east side of the Innoko Flats, between Galena and McGr The fore is active on the southern perimeter, which is burning into a continuous stand of black spruce. The fire has increased in size, but was not mapped due to thick smoke. The slopover on the eastern perimeter is 35% contained, while protection of the historic cabit continues. CHINIKLIK MOUNTAIN, Galena District, BLM . A Type II Incident Management Team (Wehking) is assigned to the Chiniklik fire. The fire is contained. Major areas of heat have been mopped up. The fire is contained. Major areas of heat have been mopped-up. All crews and overhead will mop-up where the fire burned beyond the meadows. No flare-ups occurred today. Demobilization is planned for this weekend, depending on the results of infrared scanning. LAYOUT Date => day month int ‘,’ int
37. Extraction Agent Web Page Enhanced Metadata Asset Taalee Extraction and Knowledgebase Enhancement
38. Automatic Categorization & Metadata Tagging (unstructured text/transcript of A/V) ABSOLUTE CONTROL OF THE SENATE IS STILL IN QUESTION. AS OF TONIGHT, THE REPUBLICANS HAVE 50 SENATE SEATS AND THE DEMOCRATS 49. IN WASHINGTON STATE, THE SENATE RACE REMAINS TOO CLOSE TO CALL. IF THE DEMOCRATIC CHALLENGER UNSEATS THE REPUBLICAN IUMBENT THE SENATE WILL BE EVENLY DIVIDED. IN MISSOURI, REPUBLICAN SENATOR JOHN ASHCROFT SAYS HE WILL NOT CHALLENGE HIS LOSS TO GOVERNOR MEL CARNAHAN WHO DIED IN A CRASH THREE WEEKS AGO. GOVERNOR CARNAHAN'S WIFE IS EXPECTED TO TAKE HIS PLACE. IN THE HIGHEST PROFILE SENATE EVENT OF THE NIGHT, HILLARY CLINTON WON THE NEW YORK SENATE SEAT. SHE IS THE FIRST FIRST LADY TO RUN MUCH LESS WIN. Video Segment with Associated Text Segment Description Semantic Metadata Auto Categorization
39. Video with Editorialized Text on the Web Automatic Categorization & Metadata Tagging (Web page) Auto Categorization Semantic Metadata
40. Automatic Categorization & Metadata Tagging (Feed) Text From Bllomberg Auto Categorization Semantic Metadata
41. Taalee Metadata on Football Assets Rich Media Reference Page Baltimore 31, Pit 24 http://www.nfl.com Quandry Ismail and Tony Banks hook up for their third long touchdown, this time on a 76-yarder to extend the Raven’s lead to 31-24 in the third quarter. Professional Ravens, Steelers Bal 31, Pit 24 Quandry Ismail, Tony Banks Touchdown NFL.com 2/02/2000 League: Teams: Score: Players: Event: Produced by: Posted date: Crawler provided text for indexing vs Agent provided semantic metadata Virage Search on football touchdown Jimmy Smith Interview Part Seven Jimmy Smith explains his philosophy on showboating. URL: http://cbs.sportsline... Brian Griese Interview Part Four Brian Griese talks about the first touchdown he ever threw. URL: http://cbs.sportsline... Metadata from Typical Cataloging of Football Assets
42. Traditional Content Management Agent Push Pull Information Extraction Agents Dynamic KB Custom WorldModel Relevant Metadata Enhancement Knowledge Management Aggregation & Metadata Extraction Knowledge Management (Knowledge Base, Domain Model, Metadata) Agent Front End Portal Voquette Semantic Applications Feeds (proprietary formats, standards-based, NewsML) Corporate Repositories Web Sites One Approach to Extending Traditional CM: Voquette’s Semantic Engine Technology Search Personalization Alerts Notifications Custom “research” applications Content Metadata Metadata Metadata Metadata
43.
44. Content which does contain the words the user asked for Extractor Agents Content which does not contain the words the user asked for, but is about what he asked for. Value-added Metadata Content the user did not think to ask for , but which he needs to know . Semantic Associations + + Semantic Content End-User Semantic Content
46. Taalee’s Semantic Search Highly customizable, precise and freshest A/V search Context and Domain Specific Attributes Uniform Metadata for Content from Multiple Sources, Can be sorted by any field Delightful, relevant information, exceptional targeting opportunity
47. Creating a Web of related information What can a context do?
48. Example (test on http://directory.mediaanywhere.com ) Search for company ‘Commerce One’ Links to news on companies that compete against Commerce One Links to news on companies Commerce One competes against (To view news on Ariba, click on the link for Ariba) Crucial news on Commerce One’s competitors (Ariba) can be accessed easily and automatically
49. What else can a context do? (a commercial perspective) Semantic Enrichment Semantic Targeting
50. Semantic/Interactive Targeting Precisely targeted through the use of Structured Metadata and integration from multiple sources Buy Al Pacino Videos Buy Russell Crowe Videos Buy Christopher Plummer Videos Buy Diane Venora Videos Buy Philip Baker Hall Videos Buy The Insider Video
51. Example 1 – Snapshots (“Jamal Anderson”) Click on first result for Jamal Anderson View metadata. Note that Team name and League name are also included in the metadata Search for ‘Jamal Anderson’ in ‘Football’ View the original source HTML page. Verify that the source page contains no mention of Team name and League name . They were Taalee’s value-additions to the metadata to facilitate easier search.
52. Example 2 – Snapshots (“Gary Sheffield”) Click on first result for Gary Sheffield View metadata. Note that Team name and League name are also included in the metadata Search for ‘Gary Sheffield’ in ‘Baseball’ View the original source HTML page. Verify that the source page contains no mention of Team name and League name . They were Taalee’s value-additions to the metadata to facilitate easier search.
53. Semantic Web – Intelligent Content (supported by Taalee Semantic Engine) Related Stock News Industry News Technology Products COMPANY EPA Regulations Competition COMPANIES in Same or Related INDUSTRY COMPANIES in INDUSTRY with Competing PRODUCTS Impacting INDUSTRY or Filed By COMPANY Important to INDUSTRY or COMPANY SEC Intelligent Content = What You Asked for + What you need to know!
54. Semantic Application – Equity Dashboard Focused relevant content organized by topic ( semantic categorization ) Automatic Content Aggregation from multiple content providers and feeds Related news not specifically asked for (Semantic Associations) Competitive research inferred automatically Automatic 3 rd party content integration
55. Internal Source 1 Research Internal Source 2 External feeds/Web (e.g. Reuters) Voquette Metabase World Model Third-party Content Mgmt And Syndication Semantic Engine 1 2 3 4 Cisco story from Source 1 passed on to add semantic associations Consults Knowledge Base for Cisco ’s competition Returns result: Lucent is a competitor of Cisco Lucent story from external feeds picked for publishing as “semantically related” to Cisco story – passed on to Dashboard Story on Lucent Story on Cisco XCM-compliant metadata, XML or other format Semantic Application ASP/Enterprise hosted Extractor Agent 1 Extractor Agent 2 Extractor Agent 3 Metadata centric Content Management Architecture
56. Wireless Application of Semantic Metadata and Automatic Content Enrichment Clicking on the link for Cisco Analyst Calls displays a listing sorted by date. Semantic filtering uses just the right metadata to meet screen and other constrains. E.g., Analyst Call focuses on the source and analyst name or company. The icon denote additional metadata, such as “Strong Buy” by H&Q Analyst. MyStocks News Sports Music MyMedia $ My Stocks CSCO NT IBM Market CSCO Analyst Call Conf Call Earnings 11/08 ON24 Payne 11/07 ON24 H&Q 11/06 CBS Langlesis CSCO Analysis
57.
58. Metadata for Automatic Content Enrichment Interactive Television This segment has embedded or referenced metadata that is used by personalization application to show only the stocks that user is interested in. This screen is customizable with interactivity feature using metadata such as whether there is a new Conference Call video on CSCO. Part of the screen can be automatically customized to show conference call specific information– including transcript, participation, etc. all of which are relevant metadata Conference Call itself can have embedded metadata to support personalization and interactivity.
59.
60. Along with the evolution of metadata and semantic technologies enabling the next generation of the Web, Content Management has entered the next generation of Enhanced Content Management.
61.
62.
Hinweis der Redaktion
<number>
01/31/15
<number>
01/31/15
<number>
01/31/15
Companies in categorization field: Autonomy, Metacode (bought by Interwoven), Semio, Inxight, etc.
Typical strategies employed by competition: Statistical/AI/Parsing/NLP/Rules-based/Collaborative Filtering
Result: Partial success in categorization
Placement of a document in a node, solely based on above strategies (nothing to do with metadata describing it – the basis behind semantics)
Resulting classification – rigid/static/ambiguous/fuzzy
Captures only standard physical metadata (source, date, length etc.), which is often useless in categorization purposes
<number>
01/31/15
Taalee performs categorization by laying importance to semantic metadata extracted from any document
Strategies employed by Taalee: Knowledge-based/Statistical/Rules-based/AI techniques
Result: Complete success in categorization!
Precise category/categories chalked out for classifying document
Resulting classification – flexible/dynamic/unambiguous/crisp
Value-added metadata churned out to rig out the context/gist of the document
Metadata => Great potential for Automated Content Enrichment (ACE)
Classifying into or mapping to other taxonomies possible
Promise to greatly enhance the current functioning of Content Manager and Syndication Software/Service
<number>
01/31/15
Why? What is its use?