Cross-lingual ontology lexicalisation, translation and information extraction Net2 workshop, University of South Africa (UNISA)
1.
2. Context and Motivation Monnet use case in financial domain query financial information Cross-vocabulary Cross-lingual Get result in your own language Research challenges localization & translation of vocabularies cross-lingual ontology-based information extraction
8. XBRL – Semantic Analysis “Enhance semantics to facilitate translation and information extraction.”
9. XBRL – Terminological Analysis ifrs:MinimumFinanceLeasePaymentsReceivableAtPresentValue ifrs:MinimumFinanceLeasePaymentsReceivable Minimum finance lease payments receivable, at present value sapTerm:payments googleDefine:leasePayments sapTerm:financeLease googleDefine:Finance_lease Domain Independent Domain Related Domain Specific Domain Related Domain Independent Domain Independent Domain Specific
10. XBRL – Linguistic Analysis Financial text “… received minimum finance lease payments …” verb “… lease payment …” complex singular simple minimum finance lease payments receivable XBRL term adverb … lease payments … plural
11. Outline 1. Research challenge and motivation 2. Ontology Translation 3. Lexicalization (lemon) 4. CLOBIE (CL Ontology-based Inf. Extraction)
12. Translation using STL Models developed in Monnet English / German / Spanish / Dutch …Net2 Afrikaans? Zulu? Xhosa? … ifrs:MinimumFinanceLeasePaymentsPayable ifrs:ProfitLossBeforeTax ifrs:Revenue
13. Application in Machine Translation in Dutch available-for-sale financial assets IFRS, SAPTerm, GoogleDefine 1. term analysis using: domain TM (IFRS), Linked Open Data (DBPedia), Translation services (GoogleTranslate) [available-for-sale] [financial] [assets] 2. translate subterms using: [voorverkoopbeschikbare] [financiële] [activa] 3. term synthesis using: grammars (rules, statistical models) voor verkoop beschikbare financiële activa
14. Application in Machine Translation in Afrikaans available-for-sale financial assets IFRS, SAPTerm, GoogleDefine 1. term analysis using: domain TM (IFRS), Linked Open Data (DBPedia), Translation services (GoogleTranslate) [available-for-sale] [financial] [assets] 2. translate subterms using: [beskikbaarvirverkoop] [finansiële] [bates] 3. term synthesis using: grammars (rules, statistical models) finansiële bates beskikbaar vir verkoop
15. Application in Machine Translation in Spanish available-for-sale financial assets IFRS, SAPTerm, GoogleDefine 1. term analysis using: domain TM (IFRS), Linked Open Data (DBPedia), Translation services (GoogleTranslate) [available-for-sale] [financial] [assets] 2. translate subterms using: [disponiblespara la venta] [financia] [activos] 3. term synthesis using: grammars (rules, statistical models) activos financieros disponibles para la venta
16. Outline 1. Research challenge and motivation 2. Ontology Translation 3. Lexicalization (lemon) 4. CLOBIE (CL Ontology-based Inf. Extraction)
17. Why do we need a lexicon? http://en.wikipedia.org/wiki/Finance_lease “loads of unlinked domain-specific terminology on the web !” An interoperable web for … ? re-use enable multilinguality cross-lingual search cross-lingual fact extraction http://www.investopedia.com/terms/l/lease-payments.asp
18. Lexicon standards overview ISO (XML) TEI (Text Encoding Initiative) LMF (Lexical Markup Framework) W3C & Semantic Web (RDF / OWL) build-in rdfs:label lightweight linguistic representations (SKOS, SKOS-XL) rich linguistic representations (GOLD, LexInfo)
20. SKOS – Multilingual Information Not much uptake yet? from http://data.nytimes.com/
21. Ontology-Text Mismatch ‘Edificio-historico’ vs. ‘…edificio, declarado Monumento Histórico…’ >> goes beyond SKOS (monolingual & multilingual term variants) >> requires representation of lexical information to compute linguistic variants, e.g. ‘edificio historico[apposVP[NP[Adj]]]’
22. A Lexicon Model for Ontologies Requirements for ‘ontology-lexicon’ model Represent linguistic information relative to ontology Avoid unnecessary ambiguities by representing only lexical features relevant to semantics of underlying application Keep semantics separate from linguistic info Separate clearly ‘world’ (properties of objects referred to by words) from ‘word’ (properties of words) knowledge Modular, minimal design Provide simple core model that can be easily extended upon need
23. Was there a solution already? - SKOS Simple Knowledge Organization System – SKOS General model for formalizing thesauri, terminologies and related semantic and knowledge resources Formalization of terminology in focus - terminology, classification, Semantic Web communities Does not address linguistic aspects of terminology, or therefore, the lexicon-ontology interface http://www.w3.org/2004/02/skos/
24. Was there a solution already? - GOLD General Ontology for Linguistic Description – GOLD Community-based ontology of linguistics Linguistic study in focus - linguistics community Formal model of linguistics as an ontology, but not about connecting lexical features to ontological semantics Other issues: very big, modularity? http://linguistics-ontology.org/gold/2010
25. Was there a solution already? - OWN OntoWordNet – OWN Formal specification of WordNet through extension and axiomatization of its conceptual relations Formal knowledge representation in focus - logic, knowledge representation, Semantic Web communities Turns WordNet into an ontology but not about connecting lexical features to ontological semantics http://wiki.loa-cnr.it/index.php/LoaWiki:OWN
26. Was there a solution already? - LMF Lexical Markup Framework – LMF General model for formalizing and sharing of machine-readable dictionaries Lexical knowledge representation in focus - lexicography, NLP communities Very close to ontology-lexicon requirements, but no view on how lexical features link to ontological semantics – semantics is limited to a notion of sense based on synsets Other issues: incomplete formal model, focus on classes, less on properties/relations http://www.lexicalmarkupframework.org/
27. lemon lexicon model for ontologies: ‘lemon’ General model for formalizing lexical features relative to independently defined ontological semantics http://www.monnet-project.eu/lemon Two-level modelling Abstract level (meta-model): lemon Instantiation level (lexicon model): e.g. ‘LexInfo2’ http://lexinfo.net/
30. lemon: Lexicon Lexicon: wild animals entry entry entry LE: Kudu LE: shaped like a Kudu LexicalEntry can be a Word, Phrase, or Part - such as an Affix
31. lemon: Form wild animals otherForm abstractForm canonicalForm LE F LE F LE F “kudu” “greater” “great”
32. lemon: Structure ? LE: shaped like a Kudu LE: shaped LE: like LE: a LE: Kudu LexicalEntry can be decomposed into one or more Components and compositional structure can be represented
33. lemon: Structure - Example :Component :Component :Component :Component lexeme edge edge decomposition :LexicalEntry :node :LexicalEntry :node :node :LexicalEntry :node :node :LexicalEntry :node :LexicalEntry :node shaped like a kudu constituent:PP shaped, lemma=“shape” constituent:VP constituent:VBN like, lemma=“like” constituent:NP constituent:IN a constituent:DT Kudu constituent:NNP element leaf edge edge element leaf edge element leaf edge element leaf
34. lemon: Meaning & Reference LE: kudu lexeme sense LS sememe reference
35. lemon: Meaning & Reference LE: kudu sense sense LE: greater kudu narrower LS LS reference reference preSem
36. lemon: Meaning & Reference LE:greater kudu LE:lesser kudu sense sense lexical incompatibility LS LS incompatible reference reference dbpedia:Kudu
37. lemon: Meaning & Reference LE: kudu LE: goat sense sense ontological incompatibility LS LS reference reference owl:disjointWith
38. lemon: Lexical Projection LexicalEntry can introduce a syntactic frame with arguments that are mapped to LexicalSense and indirectly to ontological semantic objects/properties
44. Outline 1. Research challenge and motivation 2. Ontology Translation & Inform. Extraction 3. Lexicalization (lemon) 4. CLOBIE (Cross-lingual Ontology-based Information Extraction)
45. What is CLOBIE Information Extraction Monolingual No semantics Cross-lingual Information Extraction Multilingual Ontology-based Information Extraction Semantics in the background
46. What is CLOBIE Information extraction(monolingual) Information extraction (multilingual) Information extraction with semantics “SAP sold risk securities at a value of 12b EUR.” PATTERN: .*SAP.*[sells|sold|issues].*[risk securities].*[0-9]+b [EUR|USD].* PATTERN_DE: .*SAP.*verkaufte*.*[RisikoWertpapiere].*[0-9]+b [EUR|USD].* .*[COMPANY] sell [ASSETS] .* PATTERN: .*$COMPANY .*[sells|sold|issues].*$ASSETS.*$MONETARY_VALUE.} financial assets non-financial assets risk securities Property, Plant & Equipment
47. Application in Information Extraction (IE) :MinimumFinanceLeasePaymentsReceivable rdfs:subClassOf xbrli:monetaryItemType ; rdfs:label “Minimum finance lease payments receivable”@en . semantically lifted Minimum finance lease payments receivable term analysis receivables payments received linguistic analysis Tesco’s Annual Report 2009 Tesco’s Annual Report 2009 Tesco’s Annual Report 2009 Tesco’s Annual Report 2009 SAP Annual Report 2008 SAP Annual Report 2008 SAP Annual Report 2008 SAP Annual Report 2008 …The fair value of the Group’s finance leasereceivablesat 23 February 2008 was £5m… ..As at December 31, 2008, the future minimumlease payments expected to be received was €16million… …The fair value of the Group’s finance leasereceivablesat 23 February 2008 was £5m… ..As at December 31, 2008, the future minimumlease paymentsexpected to be received was €16million… …The fair value of the Group’s finance lease receivables at 23 February 2008 was £5m… ..As at December 31, 2008, the future minimum lease payments expected to be received was €16million… …The fair value of the Group’s finance lease receivables at 23 February 2008 was £5m… ..As at December 31, 2008, the future minimum lease payments expected to be received was €16million…
48. CLOBIE Interdisciplinary Statistical MT Rule-based MT Localization Term extraction Relation extraction Extract. grammars Machine Translation Information Extraction NLP Corpus query Term analysis POS tagging Morph analysis Information Retrieval CLOBIE Semantic Web TF-IDF Web query ranking algorithms CLIR (ESA, MT-based) Ontologies SKOS, lemon SPARQL queries
49. Why CLOBIE? Many unstructured resources (News, FinReps) Knowledge in SW is often: Not dynamic (no regular, only manual updates) Knowledge across languages/countries not integrated
50.
51.
52.
53. CLOBIE Data set (Wind Energy) 10 companies in Wind Energy domain Financial reports in German / Spanish / English / Dutch IFRS / DE-GAAP Semantics defined by IFRS vocabulary xEBR vocabulary
54. Next steps… Benchmark development and evaluation on the basis of a data set in finance domain financial reports and news from different companies in wind energy domain multilingual (German, Dutch, Spanish, English) multi-vocabulary (IFRS, European local GAAPs, DBPedia) Cross-lingual ontology-based information retrieval system Generate ontology-based information extraction grammars from lemon ontology-lexicons
Hinweis der Redaktion
Frame: VerbNet, …LinguisticOntology: GOLD, LexInfo2Form: SKOSLexicalSense-Ontology: SKOS-XLNode/Edge: ParseStructures rare formats such as NEGRA Corpus / TIGER TAG SET by IMS Stuttgart or StanfordParser proprietary
Also phrasal lexicon
Lemon distinguishes among different types of lexical forms
LexicalSenseunderspecified sense THAT points to a language-external referenceunique ontological semantic object (depending on conditions and context) can have subsense andsenseRelation with other lexicalSensesemene relation between lexicalSense and ontologicalSemantic Object can be either: pref / alt / hiddenSem