Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked Datasets

Wird geladen in …3

Hier ansehen

1 von 25 Anzeige

Weitere Verwandte Inhalte

Diashows für Sie (20)

Ähnlich wie Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked Datasets (20)


Weitere von Machine Learning Prague (13)

Aktuellste (20)


Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked Datasets

  1. 1. Thomson Reuters © 2014. Confidential. All Rights Reserved. No part of this document may be disclosed, reproduced or used in any form without the prior permission of Thomson Reuters TR DISCOVER DeZhao Song, Frank Schilder, Charese Smiley… TR Corporate Research and Development, Eagan, Minnesota Chris Brew TR Corporate Research & Development, London ML Prague, April 23th 2016
  2. 2. Outline • TR Discover: NLP as part of the solution to a business problem. – Problem – Technologies used – Demonstration – Reflections • What is it like to be a scientist working in a business setting?
  3. 3. About me • B.Sc Chemistry, Bristol • Search Examiner, European Patent Office, Berlin, Germany • M.Sc and D.Phil, Sussex, with Steve Isard in EP • Postdoc at Edinburgh, Scotland • Sharp Laboratories of Europe, Oxford • Research (and faculty-ish) positions at Edinburgh • Core faculty in Linguistics and CSE, OSU, Columbus OH, USA • Educational Testing Service, Princeton, NJ, USA • Nuance Communications, Sunnyvale, CA, USA • Thomson Reuters Corporate Research, London, England
  4. 4. Disclaimer All opinions are my own, and do not reflect official positions of The Thomson Reuters Corporation
  5. 5. Thomson Reuters’ Business • Offer people information that they value enough to pay for. • Professional users • Many products, each catering for its own market segment.
  6. 6. Thomson Reuters’ Business • Not an internet company with tens of millions of users • Company is set up to build long-term trust relationships with clients. – Many product managers and marketers, who are highly expert in maintaining these relationships – Privacy and data security are crucial. • Cannot be “one size fits all”. • Role of technology is to support and improve products. Primary responsibility for daily support is the Platform Group.
  7. 7. Who are Corporate Research & Development? • Mission: To support Thomson Reuters by carrying out applied research relevant to our businesses. • Team of approx. 40 researchers, developers, managers, administrators, architects • Distributed group: • Rochester, NY, USA • Eagan, MN, USA • New York City, NY, USA (3 Times Square, 12th floor) • London, UK (1 Mark Square), opened in August 2013
  8. 8. The business context Financial and Ris k Legal News IP&Science
  9. 9. The technology context • Databases (mainly SQL, mainly Oracle) • Search (mainly Elastic Search, built on top of Lucene) • Virtualized servers in data centers • Front ends mostly in Javascript with AngularJS + components to aid branding via common look and feel. • Back ends mostly Java. • Products often consist of a bundle of related capabilities, packaged together to help potential users understand.
  10. 10. BOLD • Big • Open • Linked • Data The Knowledge Graph
  11. 11. Experts and non-experts • There are also expert and non-expert professional users – Cortellis (product for drug companies) • First time user, asks broad questions. No idea what is available. Needs whatever guidance we can give. • Expert user. Knows roughly what is available, but may need help locating what they want. – Common thread: users are trying to do something specific, such as a market overview, a comparison, or verification of a hunch about a trend. Give them a data visualization, not just raw data.
  12. 12. Expert user
  13. 13. Natural Language Query: TR-Discover • Keyword based search is not enough to express user intent. • What if the user could type queries, and be guided towards things that our system can answer? – Experts and first timers alike can access through NL – Enables discovery of data – Capture of user intent allows well-targeted analytics • This is not new, there have been NL database query systems since the 60s, but these tend to be hard wired to specific databases and their schemas. We want a reusable tool.
  14. 14. Placeholder for demo. Available to you at http://cortellislabs.com/. You do have to register, but anyone can. NB. Beta version. Works, but has rough edges.
  15. 15. First-time user
  16. 16. Market technology trends
  17. 17. NER Sentiment Analytics
  18. 18. Comparing top 10 indications for companies for Drugs having a primary indication of pain
  19. 19. How we did it • Feature based context-free grammar with features, using the formalism of NLTK. • Real logic-based formal semantics. • Autosuggest based on grammar, logical form and heuristics derived from our databases. • Query via translation from logic to SPARQL or SQL – SQL is just for now, for efficiency. – But we plan to keep logic as a separate level, not translate directly to query language.
  20. 20. The Grammar • Feature based context- free grammar with features, using the formalism of NLTK. – Grammar captures selectional restrictions relevant to the drug domain. – Adding a new domain should (mainly) be a matter of adding new lexical entries.
  21. 21. Grammar • The word “drugs” is plural, and has λx.drug(x) as its semantics • For now, prepositional phrases have features that enforce very tight attachment preferences. This is going to break, but OK for now. • The type a of verb specifes both the potential subject-type and object-type, which can be used to filter out nonsensical questions like “drugs headquartered in the U.S”. G1: Nom→N G2: NP→Nom G3: NPbar → NP G4: NPbar → NPbar VPbar G5: VPbar → TV NPbar Lex1: N[type=drug, num=pl, sem=<λx.drug(x)>] → ’drugs’ Lex2: TV[TYPE=[drug,org,dev], sem=<λX x.X(λy.dev org drug(y,x))>, tns=past, NUM=?n] → ’developed by’ Lex3: TV[TYPE=[org,country,hq], NUM=?n] → ’headquartered in’
  22. 22. Query translation input: Drugs developed by Merck
  23. 23. Query translation output: Drugs developed by Merck • This was SPARQL. It works, but is much the slowest part of the system • Similar translation for SQL. – In our demo, we can use the fact that we know all the words of the grammar to make the database small. – This lets us replace the big costly knowledge graph with a single-file Sqlite database. • Yes, we know it won’t scale PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX example: <http://www.example.com#> select ?x where { ?id042 rdfs:label ’Merck’. ?id042 rdf:type example:Company . ?x rdf:type example:Drug . ?id042 example:develops ?x . }
  24. 24. Autosuggest • Use the grammar to calculate possible continuations from what we have so far. – Currently this process does not use a full-fledged parser, and relies on the fact that the grammar is carefully engineered to minimize local and eradicate global ambiguity. – I want to achieve a tighter integration with the parser, and generate predictions based on elements present in the parser’s chart, allowing more ambiguity • Rank suggestions by preferring concepts that correspond to nodes in the RDF graph that are involved in many relationships. – When we have large enough query logs we hope to add in an additional preference component based on a domain specific n-gram language model.
  25. 25. What is it like being a scientist in the business world? • It varies with the DNA of the organization… – ETS – Nuance – Thomson Reuters

Hinweis der Redaktion

  • Make money by providing information services that professional users want to use. By definition, professional users are in the business of making money by doing something. Lawyers seek legal information because they are trying to give legal advice; traders seek financial and economic information in order to trade; banks want to know who their customers are because the government requires them to check AML before giving out accounts, drug researchers and marketers want to know what their competitors are doing and might do. But we are not a mass market provider, like Google, although some of our markets are large. Also, our users are paying customers, so they aren’t as free with data release as Google’s customers or as Ashley Madison’s <whatever you call the people who use Ashley Madison.
    Our business is to help our customers get the information that they need. A major challenge is that our customers don’t always know what information to ask for. Part of the answer to this challenge is to shift perspective from the task of providing information on explicit request to the larger task of discovering and communicating meaningful patterns in data.  This is the essence of analytics. Thomson Reuters, as a customer-focused organization, must focus on the patterns that our customers find meaningful. We have to step into their shoes and give them what they need.
  • We have a lot of different information products, all of which need to be tailored to the specific needs of the customer
    A company advantage is that our professional users trust us. Trust is hard to gain, but easy to lose, as shown by recent episode in the car industry.
    A second advantage is that we generate good content, either by originating it, or by carefully collating it from openly available sources, then checking and curating it to ensure quality.

    Some of the markets are extremely careful about privacy and data sharing, especially in areas such as law, where you are directly competing with other users; banking, where the databases may contain sensitive commercial data; healthcare, where there is sensitive personal information. This is a good reason for having product managers and marketers who specialize in each market. A challenge is to offer them ideas that allow them to make appropriate promises to clients. The lawyers have been told “machine learning improves the system by working with data in the aggregate, not by directly sharing information about cases with your competitors”. If the product is good enough, they can accept this.

    These factors again make the company work differently from (say) Google, Amazon or eBay, which basically do just one thing, and can focus engineering and marketing resources on that.

    It also means that the Platform Group, of which research is a part, has the task of supporting a broad spectrum of needs, and there is big payoff for being able to understand what others need. . In CS or Cognitive Science terms a lot of the general thinking done by product managers would fall under “User Modelling”, if it was science.
  • Research is part of the Platform Group, and aims to be part of T-R’s sensory system. Looks at Deep Learning, Cognitive Computing, Block chain, Machine Learning months or years before the main businesses do. Products include proof-of-concept prototypes and associated academic publications.
  • Make money by providing information services that professional users want to use. Business is organized into so-called “verticals”, each of which provides information for a specific market. T-R’s main verticals are shown on the slide (F&R split into Trading on the one side Compliance and Risk on the other) , IP&Science,.news, legal). Each vertical has numerous products, each catering for a different subgroup of users. Most services are delivered via the user’s desktop browser. This model is under pressure, because of the demand for services that can be accessed on mobile phones and other small devices.
  • Behind each product is the necessary infrastructure. Most products rely heavily on databases (SQL, using Oracle’s tooling, by default) and on search (ElasticSearch by default).
    There is a tendency, which must be resisted, for each vertical or even each product to develop its own way of filling out the detail. Part of the answer is to standardize, which we do, using standard enterprise methods. Front ends are usually Javascript, using AngularJS. Back ends are Java by default, with other technologies used as needed. Development groups are strongly encouraged to expose well-designed, capable APIs, usually via REST.

    From the technology perspective, a T-R product often looks like a bunch of capabilities, drawn together so that they can be offered to users as a package. This organization is great for innovation, because you can add something new (and perhaps weird) as a free extra, and see how users go with it. No need to build markets from scratch.
  • We have a LOT of APIs, and face major challenges in disseminating knowledge about them to all the places that it is needed. Our application developers need to learn these APIs, and our backend people need to maintain and document these APIs. Part of the difficulty is that there are too many different database designs, too many schemas, too many arbitrary choices about what goes where. The BOLD vision is to support the knowledge graph by exposing all data in the form of RDF triples, then using SPARQL as a universally understood query language. Many of the social and technical obstacles to interoperability will remain, but the idea is to clear away the fixable problems, making it easier to spend less time on issues of technology and data access, more on delivering the RIGHT experience to users.

    It is impossible for all our technologists to be expert in all the APIs, the bet is that SPARQL will help us to get up to speed quicker and more rapidly become effective.

  • Cortellis is a data integration and search platform developed for professional users in the Pharmaceutical industry. It relies on a linked dataset that covers a variety of domains, including Life Sciences, Intellectual Property, Legal and Finance. A keyword-based query interface allows users
    to obtain information relevant to a wide range of tasks, such as drug repurposing, target finding, legal research about a specific drug, or
    search for patents owned by a particular company.

    Our first use case targets first-time users of TR Discover.
    We assume that these users will not have detailed knowledge
    of the underlying data, and that they will often be asking
    broad exploratory questions.

    The second use case targets expert professional users (e.g., medical professionals, financial analysts, or patent officers). These users understand the domain, and have specific questions in mind. They know which data they need, and their questions may require material from multiple slices of the data. They may not know or care how the data is partitioned across database tables.Ideally, they should be sheltered from this level of implementation detail. Such a user might work for a a pharmaceutical company and be specifically interested in searching for patents relevant to a particular line of current or intended drug development. Guided by the structured autosuggest built into our system, the user could pose questions such as `patents filed by companies developing drugs targeting PDE 4 inhibitor using Small molecule therapeutic that have already been launched''. There is no need for the user to know exactly where this data is, only that it is potentially available somewhere in the system and can presumably be queried for. Our system returns 12 patents for this question and from the generated analytics (Figure \ref{figure:use_case_2_analytics}), the user can immediately see a general outline of the competitive situation. The user can then dig further into the patents to collect more information, and begin to design a development strategy that navigates around potential infringements of the competitors' protected rights, for example.

    Neither class of user wants just raw data. The experts might load into R or a spreadsheet, and play, but both classes will be happier if we directly give them what they need. The business term for data visualization is analytics.
  • Visualizes which companies are patent active in Japan.
  • We want this to be portable across domains.
  • It depends on the culture of the company and the nature of its business. The Educational Testing Service is technically a non-profit, with a mission to give back to education, and a few very well-established products. There was a mix of blue-sky research and very practical efforts to improve products, measure their performance, and move into new markets. has moments where it acts like a charitable foundation, moments when it feels very like a university and some moments when it acts a bit like Mr. Burns from the Simpsons.
    Nuance is an established speech technology provider building state of the art products, at scale.
    Nuance has a lot of people using Ph.D level skills to build complicated, difficult to understand products fast. With dodgy APIs. Focussed on a few high-value markets.
    Mix of blue sky and near term. Odds are that if you pick a random employee, they have an engineering background.

    Thomson Reuters is a broad spectrum information company, has relatively few people with advanced technical degrees, but puts a lot of effort into understanding customer needs. Challenges include dealing with the sheer diversity of products. is very diverse. Many different information products. Many people with different skill sets and backgrounds. Engineering background unlikely, but other skills, especially people skills and gifts at understanding what a client needs are very likely. To fit in, you have to play well with such people, and if you aren’t already, learn to be a bit like them.