Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.
David Baehrens
Large-Scale Patent Classification
at the European Patent Office
ABOUT AVERBIS
Founded: 2007
Location: Freiburg im Breisgau
Team: Domain & IT-Experts
Focus: Leverage structured & unstruct...
PORTFOLIO
Solutions
Libraries PharmaPatentsHealthcare Social Media
Terminology
Management Text Mining
Search &
Analytics N...
TERMINOLOGY MANAGEMENT
Terminology management
software
Provision of terminologies
Mappings between
terminologies
Building ...
Synonyms: dimethyl sulfoxide, dimethylsulfoxide, Domoso, Infiltrina
Hierarchies: cancer, carcinoma, melanoma, lymphoma, gl...
RULE ENGINE
1. NAME OF THE MEDICINAL PRODUCT
Desloratadine ratiopharm 5 mg film-coated tablets
Primary Field Name Secondar...
SEARCH & NOSQL
Free text + concept based
search
Text mining integration
Guided navigation / facets
NoSQL functionalities
M...
DOCUMENT CLASSIFICATION
Hotel Reviews
Patents
SEARCH & NOSQL
INFORMATION DISCOVERY
Terminology
Management Text Mining
Search &
Analytics NoSQL
Categorization
& Clustering
Delivery / D...
PATENT CLASSIFICATION AT EPO
Tender No. 1585
1) Pre-Classification of
unpublished patents into departments
2) Re-Classific...
ABOUT EPO
• The European Patent Office (EPO)
grants European patents for the
Contracting States to the European
Patent Con...
NUMBER OF STAFF
Status: December 2008
PATENT APPLICATIONS
http://www.epo.org/about-us/annual-reports-statistics/annual-report/2014.html
COOPERATIVE PATENT CLASSIFICATION
• Patent Classification System based on ECLA / IPC
• jointly developed by the European P...
EXAMPLE CPC CLASS
GRANTED PATENT
EARLY PATENT
EARLY PATENT
EARLY PATENT
PATENT CLASSIFICATION AT EPO
Tender No. 1585
1) Pre-Classification of
unpublished patents into departments
Our Motivation:...
OLD CLASSIFICATION PROCESS
PATENTS CLA SSIFICATION DEPARTMENTS
CLASSIFICATION COMPLEXITY
~250.000
CPC Codes
~1.500
Ranges
250
Departments
CLASSIFICATION PROCESS
PATENTS CLA SSIFICATION DEPARTMENTS
NEW CLASSIFICATION PROCESS
PATENTS CLA SSIFICATION DEPARTMENTS
SOME FACTS
• about 650k training documents from 2005-2013
• supervised learning: light-weight and fast linear support
vect...
HIERARCHICAL CLASSIFICATION
STATUS & OUTLOOK
 Range-specific quality
evaluation
 Going live with best
ranges
• Continuous optimization
PATENT CLASSIFICATION AT EPO
Tender No. 1585
1) Re-Classification on
published patents, if category system changes
Challen...
NEW RE-CLASSIFICATION PROCESS
Training Data
• Human Annotator starts labeling about 20% of
the documents with new subclass...
STATUS & OUTLOOK
 Currently in evaluation
phase
• Going live in the next
weeks
…NOT ONLY PATENTS
Solutions
Libraries PharmaPatentsHealthcare Social Media
Terminology
Management Text Mining
Search &
Ana...
For further questions, please contact:
David Baehrens
 + 49 (0)761 203 97690
 info@averbis.com
Nächste SlideShare
Wird geladen in …5
×

David Baehrens: Large-Scale Patent Classification at the European Patent Office

759 Aufrufe

Veröffentlicht am

http://2015.semantics.cc/david-baehrens

Veröffentlicht in: Daten & Analysen
  • Als Erste(r) kommentieren

  • Gehören Sie zu den Ersten, denen das gefällt!

David Baehrens: Large-Scale Patent Classification at the European Patent Office

  1. 1. David Baehrens Large-Scale Patent Classification at the European Patent Office
  2. 2. ABOUT AVERBIS Founded: 2007 Location: Freiburg im Breisgau Team: Domain & IT-Experts Focus: Leverage structured & unstructured information Current Sectors: Pharma, Health, Automotive, Publishers & Libraries
  3. 3. PORTFOLIO Solutions Libraries PharmaPatentsHealthcare Social Media Terminology Management Text Mining Search & Analytics NoSQL Categorization & Clustering Automotive
  4. 4. TERMINOLOGY MANAGEMENT Terminology management software Provision of terminologies Mappings between terminologies Building terminology-based applications
  5. 5. Synonyms: dimethyl sulfoxide, dimethylsulfoxide, Domoso, Infiltrina Hierarchies: cancer, carcinoma, melanoma, lymphoma, glioblastoma… Patterns: dates, citations, mail addresses… Rule-based extraction of all different kinds of complex information Persons, Locations, Genes, …. Coocurrences, Typed Relations, e.g. Genes / Diseases / Modification Type TEXT MINING Term Detection Regular Expressions Rule Engine Named Entities Relations Sentences, Tokens, POS-Tags, Chunks, Paragraphs, Sections, Stemming, Decompounding…Syntax Detection
  6. 6. RULE ENGINE 1. NAME OF THE MEDICINAL PRODUCT Desloratadine ratiopharm 5 mg film-coated tablets Primary Field Name Secondary Field Name Field Value MedicalProductName coveredText Desloratadine ratiopharm 5 mg film-coated tablets inventedPartName DESLORATADINE strengthPart 5 mg pharmaceuticalDoseFormPart FILM-COATED TABLET TextRegelErgebnis
  7. 7. SEARCH & NOSQL Free text + concept based search Text mining integration Guided navigation / facets NoSQL functionalities Multi- & cross lingual search Related documents Based on Apache Solr • Extended Query Syntax • JSON-API • Scalability …
  8. 8. DOCUMENT CLASSIFICATION Hotel Reviews Patents
  9. 9. SEARCH & NOSQL
  10. 10. INFORMATION DISCOVERY Terminology Management Text Mining Search & Analytics NoSQL Categorization & Clustering Delivery / Deployment / Runtime Environment Integration Tests / Continuous Integration Extensive Documentation Common Architecture / Application Design User & Role Management, Security Communication Bus Project Management
  11. 11. PATENT CLASSIFICATION AT EPO Tender No. 1585 1) Pre-Classification of unpublished patents into departments 2) Re-Classification on published patents, if category system changes
  12. 12. ABOUT EPO • The European Patent Office (EPO) grants European patents for the Contracting States to the European Patent Convention • Second largest intergovernmental institution in Europe • Not an EU institution • Self-financing, i.e. revenue from fees covers operating and capital expenditure
  13. 13. NUMBER OF STAFF Status: December 2008
  14. 14. PATENT APPLICATIONS
  15. 15. http://www.epo.org/about-us/annual-reports-statistics/annual-report/2014.html
  16. 16. COOPERATIVE PATENT CLASSIFICATION • Patent Classification System based on ECLA / IPC • jointly developed by the European Patent Office (EPO) and the United States Patent and Trademark Office (USPTO) • used by both the EPO and USPTO since 1 January 2013 • currently contains about 250.000 classes
  17. 17. EXAMPLE CPC CLASS
  18. 18. GRANTED PATENT
  19. 19. EARLY PATENT
  20. 20. EARLY PATENT
  21. 21. EARLY PATENT
  22. 22. PATENT CLASSIFICATION AT EPO Tender No. 1585 1) Pre-Classification of unpublished patents into departments Our Motivation: • Great Classification Use-Case – Big Data (80 Mio. patents available) – Large Scale Category System >250.000 CPC codes – Tough classification quality and response time constraints • Text Mining Success Story
  23. 23. OLD CLASSIFICATION PROCESS PATENTS CLA SSIFICATION DEPARTMENTS
  24. 24. CLASSIFICATION COMPLEXITY ~250.000 CPC Codes ~1.500 Ranges 250 Departments
  25. 25. CLASSIFICATION PROCESS PATENTS CLA SSIFICATION DEPARTMENTS
  26. 26. NEW CLASSIFICATION PROCESS PATENTS CLA SSIFICATION DEPARTMENTS
  27. 27. SOME FACTS • about 650k training documents from 2005-2013 • supervised learning: light-weight and fast linear support vector machine • Training time (16 Cores, 128 GB RAM) – Feature Extraction: ~1 hour – Training of Classifiers: ~1 hour – 90/10 tests with a look-a-head of 3 levels and reporting 3 best candidates: ~1 hour • Prediction: 5 docs in 5 sec
  28. 28. HIERARCHICAL CLASSIFICATION
  29. 29. STATUS & OUTLOOK  Range-specific quality evaluation  Going live with best ranges • Continuous optimization
  30. 30. PATENT CLASSIFICATION AT EPO Tender No. 1585 1) Re-Classification on published patents, if category system changes Challenges and Facts: – 250.000 CPC codes, regular changes/refinements – Several re-classification projects at any one time, great variation in size, a class is split into 5-20(?) subclasses – No training material available
  31. 31. NEW RE-CLASSIFICATION PROCESS Training Data • Human Annotator starts labeling about 20% of the documents with new subclasses Statistical Models • are generated on-the-fly, and • Cross-validation test are carried out Threshold • If cross-validation achieves certain threshold (e.g. 90%), the remaining documents are classified fully automatically without further review • Otherwise, more training data is being generated
  32. 32. STATUS & OUTLOOK  Currently in evaluation phase • Going live in the next weeks
  33. 33. …NOT ONLY PATENTS Solutions Libraries PharmaPatentsHealthcare Social Media Terminology Management Text Mining Search & Analytics NoSQL Categorization & Clustering Automotive
  34. 34. For further questions, please contact: David Baehrens  + 49 (0)761 203 97690  info@averbis.com

×