Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.
Using Machine Learning for Automatic
Classification of Companies
IC-SDV 2018
Nice, France April 23rd, 2018
Aleksandar Kapi...
• General Information
• Project Update
• Deeping the SEARCHCORPUS
• How to improve?
• What is What?
• Approaches, Learning...
General Information
• Family-owned global corporation
• Founded 1885 in Ingelheim, Germany
• Focus on Human pharmaceutical...
General Information
IC-SDV 2018. Aleksandar Kapisoda
Scientific
Information
Center
• 1960 Central Library
• 1990 Scientifi...
Project Update Biotech SEARCHCORPUS®
IC-SDV 2018. Aleksandar Kapisoda
Project Update
&
Status Quo
2015: Collecting company...
IC-SDV 2018. Aleksandar Kapisoda
Deeping the SEARCHCORPUS
https://adexchanger.com/managing-the-data/deepening-data-lake-se...
IC-SDV 2018. Aleksandar Kapisoda
Deeping the SEARCHCORPUS
SEARCHCORPUS is our storage repository that holds a vast amount ...
IC-SDV 2018. Aleksandar Kapisoda
Deeping the SEARCHCORPUS – Data Lake
https://adexchanger.com/managing-the-data/deepening-...
IC-SDV 2018. Aleksandar Kapisoda
Deeping the SEARCHCORPUS
Second-party data is in our case
an Information Scientist who ha...
Using the domain expertise
• Question:
What ist the Question?
• Content:
Which information wants our internal customer?
• ...
Status Quo
• Actually we have in our SEARCHCORPUS too much
data to create measureable outcomes in audience
targeting the r...
Focusing the Search Space
• We are looking for a company that licenses its process
• We are not looking for a service prov...
Improve significantly the Search Precision
in Biotech Company SEARCHCORPUS®
An abstract classification of the sources
e.q....
IC-SDV 2018. Aleksandar Kapisoda
How to improve the Search Precision?
https://sipmm.edu.sg/how-artificial-intelligence-rev...
IC-SDV 2018. Aleksandar Kapisoda
What is What?
Artificial Intelligence (AI):
Machines in-built with Approach to achieve
Ma...
IC-SDV 2018. Aleksandar Kapisoda
Approaches, Learnings & Results
https://blogs.systweak.com/2016/12/artificial-learning-ma...
Approaches, Learnings & Results
Corporate Standard Presentation 2017
Preparing the ground for Learning
Creating learning s...
IC-SDV 2018. Aleksandar Kapisoda
Artificial Neural Networks - A Pathway to Deep Learning
http://adventuresinmachinelearnin...
IC-SDV 2018. Aleksandar Kapisoda
Support Vector Machine – Automated Classification & Clustering
In machine learning, suppo...
IC-SDV 2018. Aleksandar Kapisoda
Approaches, Learnings & Results
• Businesses of the same type may have very small (1 page...
IC-SDV 2018. Aleksandar Kapisoda
• Recognition rates not satisfying
• Training too expensive (lots of nodes due to large i...
IC-SDV 2018. Aleksandar Kapisoda
• No conversion, because of small training sets (50 samples
only)
• Recognition rate not ...
IC-SDV 2018. Aleksandar Kapisoda
• For all 6 real world samples we got > 96% average recognition
rate
• Preparation and tr...
Next Steps
Integration Optimization More Sources Further development
The verified approach for
website classification
base...
Conlusions
IC-SDV 2018. Aleksandar Kapisoda
Acknowledgements
Klaus Kater
Developing Partner
Deep Search 9
IC-SDV 2018. Aleksandar Kapisoda
Dr. Gabriele Becher
Boehrin...
Contact Information
Aleksandar Kapisoda
aleksandar.kapisoda@boehringer-ingelheim.com
Discovery Research Coordination – Sci...
Thank You!
Questions?
For more information have a look at:
www.boehringer-ingelheim.com
www.opnme.com
© Boehringer Ingelhe...
Nächste SlideShare
Wird geladen in …5
×

IC-SDV 2018: Aleksandar Kapisoda (Boehringer) Using Machine Learning for Automatic Classification of Companies

559 Aufrufe

Veröffentlicht am

Focusing on the significance of targets is one of the key drivers for quality of web search.

Filtering targeted companies based on the significance of their business model for the expected search results was one of our “nice to haves” last year.

Evaluating a number of artificial intelligence approaches based on neural networks, classical machine learning and semantic technologies lead us to a working hybrid approach.

Veröffentlicht in: Internet
  • Als Erste(r) kommentieren

  • Gehören Sie zu den Ersten, denen das gefällt!

IC-SDV 2018: Aleksandar Kapisoda (Boehringer) Using Machine Learning for Automatic Classification of Companies

  1. 1. Using Machine Learning for Automatic Classification of Companies IC-SDV 2018 Nice, France April 23rd, 2018 Aleksandar Kapisoda
  2. 2. • General Information • Project Update • Deeping the SEARCHCORPUS • How to improve? • What is What? • Approaches, Learnings & Results • Next steps • Conclusions Content IC-SDV 2018. Aleksandar Kapisoda Using Machine Learning for Automatic Classification of Companies
  3. 3. General Information • Family-owned global corporation • Founded 1885 in Ingelheim, Germany • Focus on Human pharmaceuticals, Animal health and biopharmaceutical contract manufacturing • Around 45,700 employees worldwide • Four R&D sites worldwide • R&D expenditure of EUR 3.1 billion • 17 production facilities (human pharmaceuticals) in 11 countries • Net sales of around EUR 15.9 billion • 143 affiliated companies worldwide • Investment in tangible assets: EUR 645 million Status: 31.12.2016 IC-SDV 2018. Aleksandar Kapisoda Boehringer Ingelheim Center Our headquarter in Ingelheim
  4. 4. General Information IC-SDV 2018. Aleksandar Kapisoda Scientific Information Center • 1960 Central Library • 1990 Scientific Information Services • 2003 Scientific Library • 2006 Scientific Information Center Scope: Access to Knowledge MainTasks: • Global acquisition of external data (scientific databases, literature) • Scientific Information Sources & Consultanty
  5. 5. Project Update Biotech SEARCHCORPUS® IC-SDV 2018. Aleksandar Kapisoda Project Update & Status Quo 2015: Collecting company URLs Quickly collected 80.000+ companies but also “unwanted” targets like beauty farms, pharmacies.. 2016 & 2017: Drastic improvement of data quality Relevance Filtering (Tagging with domain Ontologies &Taxonomies) 2018: Focusing the Search Space Classification of the Search Space Today > 45.000 biotech companies in 140 countries with more than 8.5 Mio. Web pages Usage: As the word about the Biotech Company SEARCHCORPUS® spreads within BI we get new diverse research targets: • Competitive intelligence • Business development & Licensing
  6. 6. IC-SDV 2018. Aleksandar Kapisoda Deeping the SEARCHCORPUS https://adexchanger.com/managing-the-data/deepening-data-lake-second-party-data-increases-ai-enterprises/ Data Lake is a storage repository that holds a vast amount of data until it is needed Marketers with sparse data often do not have enough data to create measureable outcomes in audience targeting through modeling. Source: Chris O’Hara
  7. 7. IC-SDV 2018. Aleksandar Kapisoda Deeping the SEARCHCORPUS SEARCHCORPUS is our storage repository that holds a vast amount of crawled data until it is needed 2015 - 2017 https://adexchanger.com/managing-the-data/deepening-data-lake-second-party-data-increases-ai-enterprises/ In our SEARCHCORPUS we too much data to create measureable outcomes in audience targeting the right company.
  8. 8. IC-SDV 2018. Aleksandar Kapisoda Deeping the SEARCHCORPUS – Data Lake https://adexchanger.com/managing-the-data/deepening-data-lake-second-party-data-increases-ai-enterprises/ Second-party data is simply someone else’s first-party data. When relevant insights for is added to a data lake, the result is a more robust environment for deeper data-led insights for both targeting and analytics. Source: Chris O’Hara.
  9. 9. IC-SDV 2018. Aleksandar Kapisoda Deeping the SEARCHCORPUS Second-party data is in our case an Information Scientist who has domain expertise, she owns the relevant insights. This environment is more robust for deeper data-led insights for targeting the relevant company. https://adexchanger.com/managing-the-data/deepening-data-lake-second-party-data-increases-ai-enterprises/ 2018
  10. 10. Using the domain expertise • Question: What ist the Question? • Content: Which information wants our internal customer? • Search: What are the right keywords? Where to search? • Results IC-SDV 2018. Aleksandar Kapisoda Deeping the SEARCHCORPUS with the relevant insights Information Scientist
  11. 11. Status Quo • Actually we have in our SEARCHCORPUS too much data to create measureable outcomes in audience targeting the right company. • Big Pharma, Biotech Companies, CRO, Digital Health, Life Science & University Challenge: IC-SDV 2018. Aleksandar Kapisoda Deeping the SEARCHCORPUS Deeper data-led insights for targeting the relevant company. . Look for specific types of businesses e.g.: Distinguish CROs from R&D startups Status Quo & Challange
  12. 12. Focusing the Search Space • We are looking for a company that licenses its process • We are not looking for a service provider who could produce a drug for you according to your procedure (which you do not have). • 40% CROs in this case wrongly positive. • We exclude all unsuitable business models and we gain 100% less false positives. • This reduces the search time or optimizes the quality of the results GoalSearch Space IC-SDV 2018. Aleksandar Kapisoda Deeping the SEARCHCORPUS . Deeper data-led insights for targeting the relevant company.
  13. 13. Improve significantly the Search Precision in Biotech Company SEARCHCORPUS® An abstract classification of the sources e.q. Business Model (CRO) IC-SDV 2018. Aleksandar Kapisoda Deeping the SEARCHCORPUS Deeper data-led insights for targeting the relevant company. Goal
  14. 14. IC-SDV 2018. Aleksandar Kapisoda How to improve the Search Precision? https://sipmm.edu.sg/how-artificial-intelligence-revolutionize-procurement/
  15. 15. IC-SDV 2018. Aleksandar Kapisoda What is What? Artificial Intelligence (AI): Machines in-built with Approach to achieve Machine Learning (ML): Machines with Human Brain Human Intelligence Support Vector Machines Deep Learning (DL): Techniques to train the Machine’s Brain Neural Networks are the foundation for DL https://blogs.systweak.com/2016/12/artificial-learning-machine-learning-and-deep-learning-know-the-difference/
  16. 16. IC-SDV 2018. Aleksandar Kapisoda Approaches, Learnings & Results https://blogs.systweak.com/2016/12/artificial-learning-machine-learning-and-deep-learning-know-the-difference/ Classification based on website structure Deep Learning Machine Learning Feed Forward Neural Networks Recurrent Neural Networks Support Vector Machine Choosing the training Algorithms Training Algorithms
  17. 17. Approaches, Learnings & Results Corporate Standard Presentation 2017 Preparing the ground for Learning Creating learning set: Our Information Specialist categorized 50 company websites into a small training sets Focus areas: • Big Pharma • Biotech • CROs • Digital Health • Life Science • University Training set IC-SDV 2018. Aleksandar Kapisoda
  18. 18. IC-SDV 2018. Aleksandar Kapisoda Artificial Neural Networks - A Pathway to Deep Learning http://adventuresinmachinelearning.com/neural-networks-tutorial/ Artificial Neural Networks are computing systems vaguely inspired by the biological neural networks that constitute animal brains. Such systems "learn" tasks by considering examples, generally without task-specific programming. Artificial Neural Network as a Black Box: Deep Learning Artificial Neural Networks
  19. 19. IC-SDV 2018. Aleksandar Kapisoda Support Vector Machine – Automated Classification & Clustering In machine learning, support vector machines which are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis. Word2vec is an efficient algorithm which uses a simple neural network for unsupervised learning of word embedding on a large, unlabeled corpus. How does Automated Classification & Clustering works? Consists of diving the items that make up a collection into categories or classes The goal is to accurately predict the target class for each record in new data. https://groups.google.com/forum/#!topic/gensim/EwK-6JgkWVI Machine Learning Support Vector Machine
  20. 20. IC-SDV 2018. Aleksandar Kapisoda Approaches, Learnings & Results • Businesses of the same type may have very small (1 page) or very large websites (1000s of pages) • Looking at the link structure as an image does not reveal anything about the type of business Result: • Our crawlers know the structure of the company websites • Does the website structure reflect the type of business? (Then we could e.g. use image classification algorithms) Approach 1 Classification based on website structure Classification based on website structure
  21. 21. IC-SDV 2018. Aleksandar Kapisoda • Recognition rates not satisfying • Training too expensive (lots of nodes due to large input vector) Result: Approaches, Learnings & Results • Feed Forward Neural Networks andTerm Frequency For the training of the Neural Network we used two-layered Backpropagation networks, which are classic FFNN classifiers for non-linear problems. For this approach we converted the input data into a vector using aTF-IDF1) preprocessor trained on our large corpus. 1) Term frequency – inverse document frequency identifies important terms by setting the number of occurrences of a term in a document in relation to the number of documents in the corpus that contain this term. Apporach 2 Feed Forward Neural Networks Deep Learning
  22. 22. IC-SDV 2018. Aleksandar Kapisoda • No conversion, because of small training sets (50 samples only) • Recognition rate not satisfying Result: • Recurrent Neural Networks are good on learning sequences RNNs can make decisions based on the text that is converted into an input vector (Word2Vec / Doc2Vec) and takes the sequence of the words into account, thus allowing to find patterns in the content, memorize them and bind them to specific classes. Approaches, Learnings & Results Apporach 3 Recurrent Neural Networks forText Classification Deep Learning
  23. 23. IC-SDV 2018. Aleksandar Kapisoda • For all 6 real world samples we got > 96% average recognition rate • Preparation and training is easy enough for data scientists to create new classes on the fly without programming effort Result: SVMs can create non-linear classifiers by transforming the input space onto a high dimensional one.Therefore they offer a good compromise between complexity and performance To obtain a reasonably sized input vector (remember, we classify a whole website which may have several 100 MB of content), we use the preprocessor from our FFNN approach with some magic. Approaches, Learnings & Results Final Approach Support Vector Machine with Normalized Input Machine Learning
  24. 24. Next Steps Integration Optimization More Sources Further development The verified approach for website classification based on Support Vector Machines will be integrated into the Deep SEARCH 9® development environment. The Biotech SEARCHCORPUS® will be further optimized by dynamically deviding the search space into clusters that reflect the focus of all current research targets. Additional sources will be crawled as new websites can be classified before they are added to the SEARCHCORPUS®. The classification approach will be further developed with the goal to recognize companies developing technologies not being monitored so far. IC-SDV 2018. Aleksandar Kapisoda
  25. 25. Conlusions IC-SDV 2018. Aleksandar Kapisoda
  26. 26. Acknowledgements Klaus Kater Developing Partner Deep Search 9 IC-SDV 2018. Aleksandar Kapisoda Dr. Gabriele Becher Boehringer Ingelheim Pharma GmbH & Co. KG
  27. 27. Contact Information Aleksandar Kapisoda aleksandar.kapisoda@boehringer-ingelheim.com Discovery Research Coordination – Scientific Information Center IC-SDV 2018. Aleksandar Kapisoda
  28. 28. Thank You! Questions? For more information have a look at: www.boehringer-ingelheim.com www.opnme.com © Boehringer Ingelheim International GmbH 2017

×