SlideShare ist ein Scribd-Unternehmen logo
1 von 36
Downloaden Sie, um offline zu lesen
WWW.LEDS-PROJEKT.DE
STREAMING-BASED TEXT MINING USING
DEEP LEARNING AND SEMANTICS
MARTIN VOIGT / ONTOS
AGENDA
Use Cases
Lessons
Learned
WILDHORNMINER
Market
Situation
 What is
required from
project
partners and
customers?
 What is good,
what not and
what are the
next steps?
 What are the
others
doing?
 How looks Ontos’
approach of a flexible
text mining by using
Deep Learning?
 How to analyze
texts from various
sources and
interlink with
existing knowledge
graphs?
12. September 20162
USE CASES
What is required from project partners and customers?
12. September 20163
USE CASES FROM LEDS
• Content Augmentation in E-Commerce
• Product descriptions imported from various sources as usually unstructured text
• Much manual work is resource consuming and expensive
http://www.walmart.com/ip/The-Revenant-Blu-ray-Digital-HD/50129277
12. September 20164
USE CASES FROM CUSTOMERS
• Brand or competitor monitoring
• What are the people or journalists writing about my brand?
• What are my competitors doing? How is the market changing?
 Monitoring of
 web sites
 news feeds
 social media
channels
 …
 Link with
 reports
 CRM
 Open data
 …
12. September 20166
USE CASES FROM CUSTOMERS
• Assist criminal investigations in the (dark) web
• More and more organized crime in the web
• Detect entities with their relation and additional facts
But Neumann and McKelvey eventually sold
the business to their landlord Joshua Guttman.
Neumann McKelvey
?
Joshua
Guttman
Business
sold_what
sold_to
a
12. September 20167
MAIN REQUIREMENTS
• Detection and classification of entities, relations and facts with good F1
• Long, high quality texts vs. short texts with bad / missing grammar in
multiple languages
• Training not by linguists but domain experts
• Flexible adaption to new domains and contexts
• Many different data sources
12. September 20168
MARKET SITUATION
What are the others doing?
12. September 20169
MARKET
• MarketsAndMarkets: NLP (all) for 2020
• Market: $ 13,4 bn
• CAGR: ~ 18,4%
• ResearchMOZ: Text Analytics 2015-2019
• CAGR: ~ 16,1%
• Transparency Market Research: Text Analytics 2016–2024
• CAGR: ~ 17,6%
• Driving element: NLP-as-a-Service with CAGR of ~20,2%
http://www.marketsandmarkets.com/PressReleases/natural-language-processing-nlp.asp
http://www.researchmoz.us/global-natural-language-processing-market-2015-2019-report.html
http://www.transparencymarketresearch.com/pressrelease/global-text-analytics-market.htm
12. September 201610
MARKET
• ResearchAndMarkets: 2016 Top 10 Information and Communication
Technologies
• 7. Natural Language Processing (NLP)
• Increasing Demand for Automation Drives Growth
• Funding on the Rise - Start-ups Driving Key Innovations for NLP Applications
• High Acceptance of Technology enables the US to Remain on the Top
• Gartner‘s Top 10 Strategic Technology Trends für 2016
• 4 - Information of Everything:
• Access to heterogeneous data sources
• (Semantic) Linking of data items
• 5 - Advanced Machine Learning:
• Deep Learning and Neural Network for NLP
http://www.researchandmarkets.com/research/vd8fr9/2016_top_10
12. September 201611
GETTING AN OVERVIEW
• Understood domain by reviewing 30+ tools and frameworks
• Checked scientific prototypes as well as open source tools
• Investigated what other enterprise, companies, and startups are doing
• Defined a set of 15+ criteria to compare and understand
12. September 201613
GETTING AN OVERVIEW
• Understood domain by reviewing 30+ tools and frameworks
• Findings
• Startups and huge enterprises usually stick to Deep Learning (DL)  high flexibility
• Some existing companies loss market share, e.g., Attensity
• Market saturation in U.S., only less companies in Europe (5)
• More and more NLP-as-a-Service
• Only a few use RDF data, e.g., for output or disambiguation (8)
• Only 8 tools could extract relations and facts
• Only 9 open source tools with available benchmarks
12. September 201614
GETTING AN OVERVIEW
• Rough comparison of the main concepts
• Conclusions for Ontos
• Use DL as foundation for text mining tasks
• Combine it with semantic technologies, e.g.
for disambiguation
• External view: NLP-as-a-Service
• Internal view: NLP pipelines in order to
combine different tools / technologies
based on http://www.deeplearningbook.org/contents/intro.html Fig 1.5
12. September 201616
MINER
How looks Ontos’ approach of a flexible text mining by using Deep Learning?
12. September 201617
MINER - OVERVIEW
 Language model generated
from Corpora
 Model as input for supervised
step
Supervised Model
 Domain or task specific model
for sequence labeling
 Trained on specific texts by
domain experts
 Large collection of texts in a
given language or “dialect”,
e.g. Social Media / Twitter
Large text corpora Unsupervised Model
12. September 201618
MINER – CORPUS CREATION
• 1 corpus per language / dialect required
• the larger and more heterogeneous the better  get and model more
contexts for words
• Sources: Common Crawl, Wikipedia, news feeds, domain specific texts,
Twitter, …
12. September 201619
MINER – UNSUPERVISED MODEL
• language model that could be reused in multiple domains / contexts
• reuse the concept of word2vec to create word embeddings
Language
Dictionary
Word2Vec Model
Word2Vec
Dictionary
Numeric Corpus
12. September 201620
MINER – SUPERVISED MODEL
• Sequence labeling task
• use a bi-directional LSTM
(BLSTM)
• Gated Recurrent Unit (GRU) as
specialized LSTM
• Output layer
• join and normalize
• classification
LEDS is a research project
Word Embeddings
GRU GRU GRU GRU GRU
GRU GRU GRU GRU GRU
Forward GRU
Backward GRU
OUT OUT OUT OUT OUTOutput Layers
0
0
Output
Textual Input
12. September 201621
MINER – IMPLEMENTATION
• Torch framework
• matured, efficient GPU support, good packages for neural networks, …
• Scripting with Lua language
• Service implementation with Go because of great integration with Lua
• Integrated in Ontos Eiger workbench
12. September 201622
• Create new “projects”
ad-hoc
• Reuse pre-labeled
corpora
• Free definition of
entity types as required
for domain
MINER – IMPLEMENTATION
12. September 201623
• List and manage
documents of project
corpora
MINER – IMPLEMENTATION
12. September 201624
• Double-check and
annotate new labels
MINER – IMPLEMENTATION
12. September 201625
MINER – IMPLEMENTATION
• English News
• ~ 400 English texts from CoNLL 2003 and news feeds
• 5 entity types defined and manually annotated by 2 experts
• 1st value: correct start of entity
• 2nd value: correct end of entity
Class: Person Organization Product Location Event
Per class F1:....... 0.976 | 0.990 0.932 | 0.962 0.873 | 0.942 0.958 | 0.976 0.891 | 0.928
Per class Recall:... 0.974 | 0.988 0.936 | 0.953 0.845 | 0.926 0.965 | 0.980 0.870 | 0.913
Per class Precision: 0.978 | 0.992 0.928 | 0.971 0.902 | 0.959 0.951 | 0.973 0.914 | 0.944
http://www.cnts.ua.ac.be/conll2003/ner/
12. September 201626
MINER – IMPLEMENTATION
• English Twitter
• ~ 2400 tweets from Twitter NLP tools / A. Ritter 2011
• 5 entity types defined, some are combined from original source
Class Person Organisation Place Product Thing
Per class F1:....... 0.797 | 0.883 0.548 | 0.141 0.725 | 0.669 0.195 | 0.207 0.409 | 0.551
Per class Recall:... 0.769 | 0.852 0.487 | 0.085 0.696 | 0.647 0.120 | 0.129 0.337 | 0.529
Per class Precision: 0.828 | 0.918 0.627 | 0.414 0.756 | 0.693 0.516 | 0.514 0.520 | 0.574
https://github.com/aritter/twitter_nlp
12. September 201627
WILDHORN
How to analyze texts from various sources and interlink with existing
knowledge graphs?
12. September 201628
NLP PIPELINES
• Problem: How to efficiently connect various tools in the data stream?
• Sources
• RSS feed reader, crawler, Twitter, FTP servers, …
• Analytics
• 1 MINER instance per language
• Disambiguate and link to Knowledge graphs
or taxonomies
• Sinks
• RDF stores, Apache Solr, HDFS, Apache Cassandra, …
http://www.computernewsme.com/wp-content/uploads/2011/06/Cable-clutter.jpg12. September 201629
NLP PIPELINES
• Best practices – by LinkedIn
• http://www.confluent.io/blog/stream-data-platform-1/
• Apache Kafka!
12. September 201630
WILDHORN NLP PIPELINE
Message
Broker
Apache Spark Apache Kafka / Confluent
Schema
Registry
REST APITechnical Foundation:
Twitter Extractor
RSS Feed Crawler
REST
Proxy
Log File Extractor
Web / Darknet
Crawler
Taxo Extractor
MINER
AGDISTIS
Named Entity
Extraction &
Disambiguation
ENS
RDF Serializer QUAD
NoSQL Serializer Apache Cassandra
Solr Serializer Apache Solr
12. September 201632
Document
Classification
Sentiment
Analysis
WILDHORN NLP PIPELINE
• Implementation
• Apache Spark 1.6 & Confluent 3.0
• Use of Spark Jobserver to manage
Apache Spark applications
• Ontos Eiger backend ready
• First tests
• Simple Mesos cluster with 3 servers
• ~ 1000 message / sec with 1 broker without MINER analytics: no data problem
• Problem: 1 MINER instance currently scales up to 100 message / sec
Kafka Pipeline
Part
Producers
Consumers
Topic
Schema
Part
Producers
Consumers
Topic
Schema
...https://github.com/spark-jobserver/spark-jobserver
12. September 201633
WILDHORN NLP PIPELINE
https://github.com/spark-jobserver/spark-jobserver
12. September 201634
LESSONS LEARNED
What is good, what not and what are the next steps?
12. September 201635
CURRENT STATUS
• Detection and classification of entities, relations and facts with good F1
• Long, high quality texts vs. short texts with bad / missing grammar in
multiple languages
• Training not by linguists but domain experts
• Flexible adaption to new domains and contexts
• Many different data sources
12. September 201636
LESSONS LEARNED
• Neural networks / deep learning provide great concepts & frameworks for
flexible, high quality NLP tasks
• Apache Kafka / Confluent Platform in combination with Apache Spark
good foundation for data streaming and processing
• Disambiguation and linking of entities to taxonomies and Knowledge
Graphs via Semantic Web technologies is a core contribution for data
integration
• Hard to find employees with Deep Learning skills
12. September 201637
NEXT STEPS
• Try letter-trigram word hashing to overcome out of dictionary problem of
word2vec algorithm
• Relation and fact extraction in MINER
Adam Neumann is the CEO of super-hot office rental company WeWork, the most valuable startup in New York City.
PersonFirst Name
Sentiment
Entity Type
Adam super-hot
Entity
Adam Neumann
Legend Relation Relation Type
firstName1
firstName WhoHow
positive1
positiveSentiment
WeWork
Organization
ThingWhatHow
a has
A1 A2 A3 A4
Annotation
https://arxiv.org/abs/1608.06757
12. September 201638
NEXT STEPS
• Define NLP pipelines
in frontend
• Make MINER
scalable
• Usable search
interface
• Benchmark with
GERBIL
12. September 201639
Q & A
Dr. Martin Voigt
Managing Director
Ontos GmbH
D-04319 Leipzig
M: +49 178 40 222 58
E: martin.voigt@ontos.com
Twitter: m_a_r_t_i_n
12. September 201640

Weitere ähnliche Inhalte

Was ist angesagt?

II-SDV 2015, 20 - 21 April, in Nice
II-SDV 2015, 20 - 21 April, in NiceII-SDV 2015, 20 - 21 April, in Nice
II-SDV 2015, 20 - 21 April, in Nice
Dr. Haxel Consult
 
II-SDV 2015, 20 - 21 April, in Nice
II-SDV 2015, 20 - 21 April, in NiceII-SDV 2015, 20 - 21 April, in Nice
II-SDV 2015, 20 - 21 April, in Nice
Dr. Haxel Consult
 

Was ist angesagt? (20)

Felix Burkhardt | ARCHITECTURE FOR A QUESTION ANSWERING MACHINE
Felix Burkhardt | ARCHITECTURE FOR A QUESTION ANSWERING MACHINEFelix Burkhardt | ARCHITECTURE FOR A QUESTION ANSWERING MACHINE
Felix Burkhardt | ARCHITECTURE FOR A QUESTION ANSWERING MACHINE
 
The Evolution of Search and Big Data
The Evolution of Search and Big DataThe Evolution of Search and Big Data
The Evolution of Search and Big Data
 
Semantic E-Commerce - Use Cases in Enterprise Web Applications
Semantic E-Commerce - Use Cases in Enterprise Web ApplicationsSemantic E-Commerce - Use Cases in Enterprise Web Applications
Semantic E-Commerce - Use Cases in Enterprise Web Applications
 
Enterprise search
Enterprise searchEnterprise search
Enterprise search
 
Enterprise Search Summit Keynote: A Big Data Architecture for Search
Enterprise Search Summit Keynote: A Big Data Architecture for SearchEnterprise Search Summit Keynote: A Big Data Architecture for Search
Enterprise Search Summit Keynote: A Big Data Architecture for Search
 
KESeDa: Knowledge Extraction from Heterogeneous Semi-Structured Data Sources
KESeDa: Knowledge Extraction from Heterogeneous Semi-Structured Data SourcesKESeDa: Knowledge Extraction from Heterogeneous Semi-Structured Data Sources
KESeDa: Knowledge Extraction from Heterogeneous Semi-Structured Data Sources
 
II-SDV 2015, 20 - 21 April, in Nice
II-SDV 2015, 20 - 21 April, in NiceII-SDV 2015, 20 - 21 April, in Nice
II-SDV 2015, 20 - 21 April, in Nice
 
II-SDV 2016 - QWAM Content Intelligence
II-SDV 2016 - QWAM Content IntelligenceII-SDV 2016 - QWAM Content Intelligence
II-SDV 2016 - QWAM Content Intelligence
 
Charles Ivie
Charles Ivie Charles Ivie
Charles Ivie
 
eccenca CorporateMemory - Semantically integrated Enterprise Data Lakes
eccenca CorporateMemory - Semantically integrated Enterprise Data Lakeseccenca CorporateMemory - Semantically integrated Enterprise Data Lakes
eccenca CorporateMemory - Semantically integrated Enterprise Data Lakes
 
Connected data meetup group - introduction & scope
Connected data meetup group - introduction & scopeConnected data meetup group - introduction & scope
Connected data meetup group - introduction & scope
 
Semantically integrated Enterprise Data Lakes and Co-Evolution of Public / Pr...
Semantically integrated Enterprise Data Lakes and Co-Evolution of Public / Pr...Semantically integrated Enterprise Data Lakes and Co-Evolution of Public / Pr...
Semantically integrated Enterprise Data Lakes and Co-Evolution of Public / Pr...
 
Managed Metadata and Taxonomies in SharePoint 2013
Managed Metadata and Taxonomies in SharePoint 2013Managed Metadata and Taxonomies in SharePoint 2013
Managed Metadata and Taxonomies in SharePoint 2013
 
Ontos NLP Stack, Sep. 2016
Ontos NLP Stack, Sep. 2016Ontos NLP Stack, Sep. 2016
Ontos NLP Stack, Sep. 2016
 
Sebastian Hellmann
Sebastian HellmannSebastian Hellmann
Sebastian Hellmann
 
II-SDV 2015, 20 - 21 April, in Nice
II-SDV 2015, 20 - 21 April, in NiceII-SDV 2015, 20 - 21 April, in Nice
II-SDV 2015, 20 - 21 April, in Nice
 
Lju Lazarevic
Lju LazarevicLju Lazarevic
Lju Lazarevic
 
Optimising Content Spending with Analytics
Optimising Content Spending with AnalyticsOptimising Content Spending with Analytics
Optimising Content Spending with Analytics
 
II-SDV 2016 Patrick Beaucamp - Data Science with R and Vanilla Air
II-SDV 2016 Patrick Beaucamp - Data Science with R and Vanilla AirII-SDV 2016 Patrick Beaucamp - Data Science with R and Vanilla Air
II-SDV 2016 Patrick Beaucamp - Data Science with R and Vanilla Air
 
Enabling Low-cost Open Data Publishing and Reuse
Enabling Low-cost Open Data Publishing and ReuseEnabling Low-cost Open Data Publishing and Reuse
Enabling Low-cost Open Data Publishing and Reuse
 

Andere mochten auch

MongoDB Hadoop and Humongous Data
MongoDB Hadoop and Humongous DataMongoDB Hadoop and Humongous Data
MongoDB Hadoop and Humongous Data
MongoDB
 
Combining Distributional Semantics and Entity Linking for Context-aware Conte...
Combining Distributional Semantics and Entity Linking for Context-aware Conte...Combining Distributional Semantics and Entity Linking for Context-aware Conte...
Combining Distributional Semantics and Entity Linking for Context-aware Conte...
Cataldo Musto
 
Kostas Kastrantas | Business Opportunities with Linked Open Data
Kostas Kastrantas  | Business Opportunities with Linked Open DataKostas Kastrantas  | Business Opportunities with Linked Open Data
Kostas Kastrantas | Business Opportunities with Linked Open Data
semanticsconference
 

Andere mochten auch (20)

MongoDB Hadoop and Humongous Data
MongoDB Hadoop and Humongous DataMongoDB Hadoop and Humongous Data
MongoDB Hadoop and Humongous Data
 
Combining Distributional Semantics and Entity Linking for Context-aware Conte...
Combining Distributional Semantics and Entity Linking for Context-aware Conte...Combining Distributional Semantics and Entity Linking for Context-aware Conte...
Combining Distributional Semantics and Entity Linking for Context-aware Conte...
 
Entity linking meets Word Sense Disambiguation: a unified approach(TACL 2014)の紹介
Entity linking meets Word Sense Disambiguation: a unified approach(TACL 2014)の紹介Entity linking meets Word Sense Disambiguation: a unified approach(TACL 2014)の紹介
Entity linking meets Word Sense Disambiguation: a unified approach(TACL 2014)の紹介
 
Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack
Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stackBig Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack
Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack
 
OUTDATED Text Mining 5/5: Information Extraction
OUTDATED Text Mining 5/5: Information ExtractionOUTDATED Text Mining 5/5: Information Extraction
OUTDATED Text Mining 5/5: Information Extraction
 
Textmining Information Extraction
Textmining Information ExtractionTextmining Information Extraction
Textmining Information Extraction
 
Michael Fuchs | How to compute semantic relationships between entities and fa...
Michael Fuchs | How to compute semantic relationships between entities and fa...Michael Fuchs | How to compute semantic relationships between entities and fa...
Michael Fuchs | How to compute semantic relationships between entities and fa...
 
Sebastian Bader | Semantic Technologies for Assisted Decision-Making in Indus...
Sebastian Bader | Semantic Technologies for Assisted Decision-Making in Indus...Sebastian Bader | Semantic Technologies for Assisted Decision-Making in Indus...
Sebastian Bader | Semantic Technologies for Assisted Decision-Making in Indus...
 
Jörg Waitelonis, Henrik Jürges and Harald Sack | Don't compare Apples to Oran...
Jörg Waitelonis, Henrik Jürges and Harald Sack | Don't compare Apples to Oran...Jörg Waitelonis, Henrik Jürges and Harald Sack | Don't compare Apples to Oran...
Jörg Waitelonis, Henrik Jürges and Harald Sack | Don't compare Apples to Oran...
 
Kerstin Diwisch | Towards a holistic visualization management for knowledge g...
Kerstin Diwisch | Towards a holistic visualization management for knowledge g...Kerstin Diwisch | Towards a holistic visualization management for knowledge g...
Kerstin Diwisch | Towards a holistic visualization management for knowledge g...
 
Philippe Martin and Jérémy Bénard | Importing, Translating and Exporting Know...
Philippe Martin and Jérémy Bénard | Importing, Translating and Exporting Know...Philippe Martin and Jérémy Bénard | Importing, Translating and Exporting Know...
Philippe Martin and Jérémy Bénard | Importing, Translating and Exporting Know...
 
Camilo Thorne, Stefano Faralli and Heiner Stuckenschmidt | Entity Linking for...
Camilo Thorne, Stefano Faralli and Heiner Stuckenschmidt | Entity Linking for...Camilo Thorne, Stefano Faralli and Heiner Stuckenschmidt | Entity Linking for...
Camilo Thorne, Stefano Faralli and Heiner Stuckenschmidt | Entity Linking for...
 
Adam Bartusiak and Jörg Lässig | Semantic Processing for the Conversion of Un...
Adam Bartusiak and Jörg Lässig | Semantic Processing for the Conversion of Un...Adam Bartusiak and Jörg Lässig | Semantic Processing for the Conversion of Un...
Adam Bartusiak and Jörg Lässig | Semantic Processing for the Conversion of Un...
 
Vladimir Alexiev | Semantic Enrichment of Twitter Microposts Helps Understand...
Vladimir Alexiev | Semantic Enrichment of Twitter Microposts Helps Understand...Vladimir Alexiev | Semantic Enrichment of Twitter Microposts Helps Understand...
Vladimir Alexiev | Semantic Enrichment of Twitter Microposts Helps Understand...
 
Phil Ritchie | Putting Standards into Action: Multilingual and Semantic Enric...
Phil Ritchie | Putting Standards into Action: Multilingual and Semantic Enric...Phil Ritchie | Putting Standards into Action: Multilingual and Semantic Enric...
Phil Ritchie | Putting Standards into Action: Multilingual and Semantic Enric...
 
Kostas Kastrantas | Business Opportunities with Linked Open Data
Kostas Kastrantas  | Business Opportunities with Linked Open DataKostas Kastrantas  | Business Opportunities with Linked Open Data
Kostas Kastrantas | Business Opportunities with Linked Open Data
 
Victor Charpenay | Standardized Semantics for an Open Web of Things
Victor Charpenay | Standardized Semantics for an Open Web of ThingsVictor Charpenay | Standardized Semantics for an Open Web of Things
Victor Charpenay | Standardized Semantics for an Open Web of Things
 
Najmeh Mousavi Nejad, Simon Scerri, Sören Auer and Elisa M. Sibarani | EULAid...
Najmeh Mousavi Nejad, Simon Scerri, Sören Auer and Elisa M. Sibarani | EULAid...Najmeh Mousavi Nejad, Simon Scerri, Sören Auer and Elisa M. Sibarani | EULAid...
Najmeh Mousavi Nejad, Simon Scerri, Sören Auer and Elisa M. Sibarani | EULAid...
 
Shuangyong Song, Qingliang Miao and Yao Meng | Linking Images to Semantic Kno...
Shuangyong Song, Qingliang Miao and Yao Meng | Linking Images to Semantic Kno...Shuangyong Song, Qingliang Miao and Yao Meng | Linking Images to Semantic Kno...
Shuangyong Song, Qingliang Miao and Yao Meng | Linking Images to Semantic Kno...
 
Kolawole John Adebayo, Luigi Di Caro and Guido Boella | A Supervised Keyphras...
Kolawole John Adebayo, Luigi Di Caro and Guido Boella | A Supervised Keyphras...Kolawole John Adebayo, Luigi Di Caro and Guido Boella | A Supervised Keyphras...
Kolawole John Adebayo, Luigi Di Caro and Guido Boella | A Supervised Keyphras...
 

Ähnlich wie Martin Voigt | Streaming-based Text Mining using Deep Learning and Semantics

A Mobile-First, Cloud-First Stack at Pearson
A Mobile-First, Cloud-First Stack at PearsonA Mobile-First, Cloud-First Stack at Pearson
A Mobile-First, Cloud-First Stack at Pearson
MongoDB
 
Geeks bearing gifts: Unwrapping New Technologies, Version April12
Geeks bearing gifts: Unwrapping New Technologies, Version April12Geeks bearing gifts: Unwrapping New Technologies, Version April12
Geeks bearing gifts: Unwrapping New Technologies, Version April12
ayoungkin
 
MongoDB Partner Program Update - November 2013
MongoDB Partner Program Update - November 2013MongoDB Partner Program Update - November 2013
MongoDB Partner Program Update - November 2013
MongoDB
 

Ähnlich wie Martin Voigt | Streaming-based Text Mining using Deep Learning and Semantics (20)

Semantic Technology in Publishing & Finance
Semantic Technology in Publishing & FinanceSemantic Technology in Publishing & Finance
Semantic Technology in Publishing & Finance
 
A Mobile-First, Cloud-First Stack at Pearson
A Mobile-First, Cloud-First Stack at PearsonA Mobile-First, Cloud-First Stack at Pearson
A Mobile-First, Cloud-First Stack at Pearson
 
An Introduction to Semantic Web Technology
An Introduction to Semantic Web TechnologyAn Introduction to Semantic Web Technology
An Introduction to Semantic Web Technology
 
[CAS4687] going mobile with a hybrid cloud and on premises architecture rrs
[CAS4687] going mobile with a hybrid cloud and on premises architecture rrs[CAS4687] going mobile with a hybrid cloud and on premises architecture rrs
[CAS4687] going mobile with a hybrid cloud and on premises architecture rrs
 
Getting to grips with a Service Level Agreement and how SLA-Ready can help
Getting to grips with a Service Level Agreement and how SLA-Ready can helpGetting to grips with a Service Level Agreement and how SLA-Ready can help
Getting to grips with a Service Level Agreement and how SLA-Ready can help
 
OpenAIRE Open Innovation call: Next Generation Repositories
OpenAIRE Open Innovation call: Next Generation RepositoriesOpenAIRE Open Innovation call: Next Generation Repositories
OpenAIRE Open Innovation call: Next Generation Repositories
 
Are you SLA-Ready?
Are you SLA-Ready?Are you SLA-Ready?
Are you SLA-Ready?
 
ITAC 2016 Where Open Source Meets Audit Analytics
ITAC 2016 Where Open Source Meets Audit AnalyticsITAC 2016 Where Open Source Meets Audit Analytics
ITAC 2016 Where Open Source Meets Audit Analytics
 
Christian Opitz | Semantic E-Commerce - Use Cases in Enterprise Web Applications
Christian Opitz | Semantic E-Commerce - Use Cases in Enterprise Web ApplicationsChristian Opitz | Semantic E-Commerce - Use Cases in Enterprise Web Applications
Christian Opitz | Semantic E-Commerce - Use Cases in Enterprise Web Applications
 
OpenStack 2015 Marketing Plan
OpenStack 2015 Marketing PlanOpenStack 2015 Marketing Plan
OpenStack 2015 Marketing Plan
 
[CAS4687] Going Mobile with a Hybrid Cloud and On-Premises architecture
[CAS4687] Going Mobile with a Hybrid Cloud and On-Premises architecture[CAS4687] Going Mobile with a Hybrid Cloud and On-Premises architecture
[CAS4687] Going Mobile with a Hybrid Cloud and On-Premises architecture
 
From Business Idea to Successful Delivery by Serhiy Haziyev & Olha Hrytsay, S...
From Business Idea to Successful Delivery by Serhiy Haziyev & Olha Hrytsay, S...From Business Idea to Successful Delivery by Serhiy Haziyev & Olha Hrytsay, S...
From Business Idea to Successful Delivery by Serhiy Haziyev & Olha Hrytsay, S...
 
Digital libraries with ict and innovation
Digital libraries with ict and innovationDigital libraries with ict and innovation
Digital libraries with ict and innovation
 
Geeks bearing gifts: Unwrapping New Technologies, Version April12
Geeks bearing gifts: Unwrapping New Technologies, Version April12Geeks bearing gifts: Unwrapping New Technologies, Version April12
Geeks bearing gifts: Unwrapping New Technologies, Version April12
 
Applying semantic web technologies at NXP Semiconductors
Applying semantic web technologies at NXP SemiconductorsApplying semantic web technologies at NXP Semiconductors
Applying semantic web technologies at NXP Semiconductors
 
Analysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data AnalyticsAnalysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data Analytics
 
Analysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data AnalyticsAnalysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data Analytics
 
Session 4.2 unleash the triple: leveraging a corporate discovery interface....
Session 4.2   unleash the triple: leveraging a corporate discovery interface....Session 4.2   unleash the triple: leveraging a corporate discovery interface....
Session 4.2 unleash the triple: leveraging a corporate discovery interface....
 
MongoDB Partner Program Update - November 2013
MongoDB Partner Program Update - November 2013MongoDB Partner Program Update - November 2013
MongoDB Partner Program Update - November 2013
 
SESAM4 - A guide to semantics in the Linked Open Data cloud, Robert HP Engels...
SESAM4 - A guide to semantics in the Linked Open Data cloud, Robert HP Engels...SESAM4 - A guide to semantics in the Linked Open Data cloud, Robert HP Engels...
SESAM4 - A guide to semantics in the Linked Open Data cloud, Robert HP Engels...
 

Mehr von semanticsconference

Mehr von semanticsconference (20)

Linear books to open world adventure
Linear books to open world adventureLinear books to open world adventure
Linear books to open world adventure
 
Session 1.2 high-precision, context-free entity linking exploiting unambigu...
Session 1.2   high-precision, context-free entity linking exploiting unambigu...Session 1.2   high-precision, context-free entity linking exploiting unambigu...
Session 1.2 high-precision, context-free entity linking exploiting unambigu...
 
Session 4.3 semantic annotation for enhancing collaborative ideation
Session 4.3   semantic annotation for enhancing collaborative ideationSession 4.3   semantic annotation for enhancing collaborative ideation
Session 4.3 semantic annotation for enhancing collaborative ideation
 
Session 1.1 dalicc - data licenses clearance center
Session 1.1   dalicc - data licenses clearance centerSession 1.1   dalicc - data licenses clearance center
Session 1.1 dalicc - data licenses clearance center
 
Session 1.3 context information management across smart city knowledge domains
Session 1.3   context information management across smart city knowledge domainsSession 1.3   context information management across smart city knowledge domains
Session 1.3 context information management across smart city knowledge domains
 
Session 0.0 aussenac semanticsnl-pwebsem2017-v4
Session 0.0   aussenac semanticsnl-pwebsem2017-v4Session 0.0   aussenac semanticsnl-pwebsem2017-v4
Session 0.0 aussenac semanticsnl-pwebsem2017-v4
 
Session 0.0 keynote sandeep sacheti - final hi res
Session 0.0   keynote sandeep sacheti - final hi resSession 0.0   keynote sandeep sacheti - final hi res
Session 0.0 keynote sandeep sacheti - final hi res
 
Session 1.1 linked data applied: a field report from the netherlands
Session 1.1   linked data applied: a field report from the netherlandsSession 1.1   linked data applied: a field report from the netherlands
Session 1.1 linked data applied: a field report from the netherlands
 
Session 1.2 enrich your knowledge graphs: linked data integration with pool...
Session 1.2   enrich your knowledge graphs: linked data integration with pool...Session 1.2   enrich your knowledge graphs: linked data integration with pool...
Session 1.2 enrich your knowledge graphs: linked data integration with pool...
 
Session 1.4 connecting information from legislation and datasets using a ca...
Session 1.4   connecting information from legislation and datasets using a ca...Session 1.4   connecting information from legislation and datasets using a ca...
Session 1.4 connecting information from legislation and datasets using a ca...
 
Session 1.4 a distributed network of heritage information
Session 1.4   a distributed network of heritage informationSession 1.4   a distributed network of heritage information
Session 1.4 a distributed network of heritage information
 
Session 0.0 media panel - matthias priem - gtuo - semantics 2017
Session 0.0   media panel - matthias priem - gtuo - semantics 2017Session 0.0   media panel - matthias priem - gtuo - semantics 2017
Session 0.0 media panel - matthias priem - gtuo - semantics 2017
 
Session 1.3 semantic asset management in the dutch rail engineering and con...
Session 1.3   semantic asset management in the dutch rail engineering and con...Session 1.3   semantic asset management in the dutch rail engineering and con...
Session 1.3 semantic asset management in the dutch rail engineering and con...
 
Session 1.3 energy, smart homes & smart grids: towards interoperability...
Session 1.3   energy, smart homes & smart grids: towards interoperability...Session 1.3   energy, smart homes & smart grids: towards interoperability...
Session 1.3 energy, smart homes & smart grids: towards interoperability...
 
Session 1.2 improving access to digital content by semantic enrichment
Session 1.2   improving access to digital content by semantic enrichmentSession 1.2   improving access to digital content by semantic enrichment
Session 1.2 improving access to digital content by semantic enrichment
 
Session 2.3 semantics for safeguarding & security – a police story
Session 2.3   semantics for safeguarding & security – a police storySession 2.3   semantics for safeguarding & security – a police story
Session 2.3 semantics for safeguarding & security – a police story
 
Session 2.5 semantic similarity based clustering of license excerpts for im...
Session 2.5   semantic similarity based clustering of license excerpts for im...Session 2.5   semantic similarity based clustering of license excerpts for im...
Session 2.5 semantic similarity based clustering of license excerpts for im...
 
Session 1.6 slovak public metadata governance and management based on linke...
Session 1.6   slovak public metadata governance and management based on linke...Session 1.6   slovak public metadata governance and management based on linke...
Session 1.6 slovak public metadata governance and management based on linke...
 
Session 5.6 towards a semantic outlier detection framework in wireless sens...
Session 5.6   towards a semantic outlier detection framework in wireless sens...Session 5.6   towards a semantic outlier detection framework in wireless sens...
Session 5.6 towards a semantic outlier detection framework in wireless sens...
 
Session 2.2 ontology-guided job market demand analysis: a cross-sectional s...
Session 2.2   ontology-guided job market demand analysis: a cross-sectional s...Session 2.2   ontology-guided job market demand analysis: a cross-sectional s...
Session 2.2 ontology-guided job market demand analysis: a cross-sectional s...
 

Kürzlich hochgeladen

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Kürzlich hochgeladen (20)

Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 

Martin Voigt | Streaming-based Text Mining using Deep Learning and Semantics

  • 1. WWW.LEDS-PROJEKT.DE STREAMING-BASED TEXT MINING USING DEEP LEARNING AND SEMANTICS MARTIN VOIGT / ONTOS
  • 2. AGENDA Use Cases Lessons Learned WILDHORNMINER Market Situation  What is required from project partners and customers?  What is good, what not and what are the next steps?  What are the others doing?  How looks Ontos’ approach of a flexible text mining by using Deep Learning?  How to analyze texts from various sources and interlink with existing knowledge graphs? 12. September 20162
  • 3. USE CASES What is required from project partners and customers? 12. September 20163
  • 4. USE CASES FROM LEDS • Content Augmentation in E-Commerce • Product descriptions imported from various sources as usually unstructured text • Much manual work is resource consuming and expensive http://www.walmart.com/ip/The-Revenant-Blu-ray-Digital-HD/50129277 12. September 20164
  • 5. USE CASES FROM CUSTOMERS • Brand or competitor monitoring • What are the people or journalists writing about my brand? • What are my competitors doing? How is the market changing?  Monitoring of  web sites  news feeds  social media channels  …  Link with  reports  CRM  Open data  … 12. September 20166
  • 6. USE CASES FROM CUSTOMERS • Assist criminal investigations in the (dark) web • More and more organized crime in the web • Detect entities with their relation and additional facts But Neumann and McKelvey eventually sold the business to their landlord Joshua Guttman. Neumann McKelvey ? Joshua Guttman Business sold_what sold_to a 12. September 20167
  • 7. MAIN REQUIREMENTS • Detection and classification of entities, relations and facts with good F1 • Long, high quality texts vs. short texts with bad / missing grammar in multiple languages • Training not by linguists but domain experts • Flexible adaption to new domains and contexts • Many different data sources 12. September 20168
  • 8. MARKET SITUATION What are the others doing? 12. September 20169
  • 9. MARKET • MarketsAndMarkets: NLP (all) for 2020 • Market: $ 13,4 bn • CAGR: ~ 18,4% • ResearchMOZ: Text Analytics 2015-2019 • CAGR: ~ 16,1% • Transparency Market Research: Text Analytics 2016–2024 • CAGR: ~ 17,6% • Driving element: NLP-as-a-Service with CAGR of ~20,2% http://www.marketsandmarkets.com/PressReleases/natural-language-processing-nlp.asp http://www.researchmoz.us/global-natural-language-processing-market-2015-2019-report.html http://www.transparencymarketresearch.com/pressrelease/global-text-analytics-market.htm 12. September 201610
  • 10. MARKET • ResearchAndMarkets: 2016 Top 10 Information and Communication Technologies • 7. Natural Language Processing (NLP) • Increasing Demand for Automation Drives Growth • Funding on the Rise - Start-ups Driving Key Innovations for NLP Applications • High Acceptance of Technology enables the US to Remain on the Top • Gartner‘s Top 10 Strategic Technology Trends für 2016 • 4 - Information of Everything: • Access to heterogeneous data sources • (Semantic) Linking of data items • 5 - Advanced Machine Learning: • Deep Learning and Neural Network for NLP http://www.researchandmarkets.com/research/vd8fr9/2016_top_10 12. September 201611
  • 11. GETTING AN OVERVIEW • Understood domain by reviewing 30+ tools and frameworks • Checked scientific prototypes as well as open source tools • Investigated what other enterprise, companies, and startups are doing • Defined a set of 15+ criteria to compare and understand 12. September 201613
  • 12. GETTING AN OVERVIEW • Understood domain by reviewing 30+ tools and frameworks • Findings • Startups and huge enterprises usually stick to Deep Learning (DL)  high flexibility • Some existing companies loss market share, e.g., Attensity • Market saturation in U.S., only less companies in Europe (5) • More and more NLP-as-a-Service • Only a few use RDF data, e.g., for output or disambiguation (8) • Only 8 tools could extract relations and facts • Only 9 open source tools with available benchmarks 12. September 201614
  • 13. GETTING AN OVERVIEW • Rough comparison of the main concepts • Conclusions for Ontos • Use DL as foundation for text mining tasks • Combine it with semantic technologies, e.g. for disambiguation • External view: NLP-as-a-Service • Internal view: NLP pipelines in order to combine different tools / technologies based on http://www.deeplearningbook.org/contents/intro.html Fig 1.5 12. September 201616
  • 14. MINER How looks Ontos’ approach of a flexible text mining by using Deep Learning? 12. September 201617
  • 15. MINER - OVERVIEW  Language model generated from Corpora  Model as input for supervised step Supervised Model  Domain or task specific model for sequence labeling  Trained on specific texts by domain experts  Large collection of texts in a given language or “dialect”, e.g. Social Media / Twitter Large text corpora Unsupervised Model 12. September 201618
  • 16. MINER – CORPUS CREATION • 1 corpus per language / dialect required • the larger and more heterogeneous the better  get and model more contexts for words • Sources: Common Crawl, Wikipedia, news feeds, domain specific texts, Twitter, … 12. September 201619
  • 17. MINER – UNSUPERVISED MODEL • language model that could be reused in multiple domains / contexts • reuse the concept of word2vec to create word embeddings Language Dictionary Word2Vec Model Word2Vec Dictionary Numeric Corpus 12. September 201620
  • 18. MINER – SUPERVISED MODEL • Sequence labeling task • use a bi-directional LSTM (BLSTM) • Gated Recurrent Unit (GRU) as specialized LSTM • Output layer • join and normalize • classification LEDS is a research project Word Embeddings GRU GRU GRU GRU GRU GRU GRU GRU GRU GRU Forward GRU Backward GRU OUT OUT OUT OUT OUTOutput Layers 0 0 Output Textual Input 12. September 201621
  • 19. MINER – IMPLEMENTATION • Torch framework • matured, efficient GPU support, good packages for neural networks, … • Scripting with Lua language • Service implementation with Go because of great integration with Lua • Integrated in Ontos Eiger workbench 12. September 201622
  • 20. • Create new “projects” ad-hoc • Reuse pre-labeled corpora • Free definition of entity types as required for domain MINER – IMPLEMENTATION 12. September 201623
  • 21. • List and manage documents of project corpora MINER – IMPLEMENTATION 12. September 201624
  • 22. • Double-check and annotate new labels MINER – IMPLEMENTATION 12. September 201625
  • 23. MINER – IMPLEMENTATION • English News • ~ 400 English texts from CoNLL 2003 and news feeds • 5 entity types defined and manually annotated by 2 experts • 1st value: correct start of entity • 2nd value: correct end of entity Class: Person Organization Product Location Event Per class F1:....... 0.976 | 0.990 0.932 | 0.962 0.873 | 0.942 0.958 | 0.976 0.891 | 0.928 Per class Recall:... 0.974 | 0.988 0.936 | 0.953 0.845 | 0.926 0.965 | 0.980 0.870 | 0.913 Per class Precision: 0.978 | 0.992 0.928 | 0.971 0.902 | 0.959 0.951 | 0.973 0.914 | 0.944 http://www.cnts.ua.ac.be/conll2003/ner/ 12. September 201626
  • 24. MINER – IMPLEMENTATION • English Twitter • ~ 2400 tweets from Twitter NLP tools / A. Ritter 2011 • 5 entity types defined, some are combined from original source Class Person Organisation Place Product Thing Per class F1:....... 0.797 | 0.883 0.548 | 0.141 0.725 | 0.669 0.195 | 0.207 0.409 | 0.551 Per class Recall:... 0.769 | 0.852 0.487 | 0.085 0.696 | 0.647 0.120 | 0.129 0.337 | 0.529 Per class Precision: 0.828 | 0.918 0.627 | 0.414 0.756 | 0.693 0.516 | 0.514 0.520 | 0.574 https://github.com/aritter/twitter_nlp 12. September 201627
  • 25. WILDHORN How to analyze texts from various sources and interlink with existing knowledge graphs? 12. September 201628
  • 26. NLP PIPELINES • Problem: How to efficiently connect various tools in the data stream? • Sources • RSS feed reader, crawler, Twitter, FTP servers, … • Analytics • 1 MINER instance per language • Disambiguate and link to Knowledge graphs or taxonomies • Sinks • RDF stores, Apache Solr, HDFS, Apache Cassandra, … http://www.computernewsme.com/wp-content/uploads/2011/06/Cable-clutter.jpg12. September 201629
  • 27. NLP PIPELINES • Best practices – by LinkedIn • http://www.confluent.io/blog/stream-data-platform-1/ • Apache Kafka! 12. September 201630
  • 28. WILDHORN NLP PIPELINE Message Broker Apache Spark Apache Kafka / Confluent Schema Registry REST APITechnical Foundation: Twitter Extractor RSS Feed Crawler REST Proxy Log File Extractor Web / Darknet Crawler Taxo Extractor MINER AGDISTIS Named Entity Extraction & Disambiguation ENS RDF Serializer QUAD NoSQL Serializer Apache Cassandra Solr Serializer Apache Solr 12. September 201632 Document Classification Sentiment Analysis
  • 29. WILDHORN NLP PIPELINE • Implementation • Apache Spark 1.6 & Confluent 3.0 • Use of Spark Jobserver to manage Apache Spark applications • Ontos Eiger backend ready • First tests • Simple Mesos cluster with 3 servers • ~ 1000 message / sec with 1 broker without MINER analytics: no data problem • Problem: 1 MINER instance currently scales up to 100 message / sec Kafka Pipeline Part Producers Consumers Topic Schema Part Producers Consumers Topic Schema ...https://github.com/spark-jobserver/spark-jobserver 12. September 201633
  • 31. LESSONS LEARNED What is good, what not and what are the next steps? 12. September 201635
  • 32. CURRENT STATUS • Detection and classification of entities, relations and facts with good F1 • Long, high quality texts vs. short texts with bad / missing grammar in multiple languages • Training not by linguists but domain experts • Flexible adaption to new domains and contexts • Many different data sources 12. September 201636
  • 33. LESSONS LEARNED • Neural networks / deep learning provide great concepts & frameworks for flexible, high quality NLP tasks • Apache Kafka / Confluent Platform in combination with Apache Spark good foundation for data streaming and processing • Disambiguation and linking of entities to taxonomies and Knowledge Graphs via Semantic Web technologies is a core contribution for data integration • Hard to find employees with Deep Learning skills 12. September 201637
  • 34. NEXT STEPS • Try letter-trigram word hashing to overcome out of dictionary problem of word2vec algorithm • Relation and fact extraction in MINER Adam Neumann is the CEO of super-hot office rental company WeWork, the most valuable startup in New York City. PersonFirst Name Sentiment Entity Type Adam super-hot Entity Adam Neumann Legend Relation Relation Type firstName1 firstName WhoHow positive1 positiveSentiment WeWork Organization ThingWhatHow a has A1 A2 A3 A4 Annotation https://arxiv.org/abs/1608.06757 12. September 201638
  • 35. NEXT STEPS • Define NLP pipelines in frontend • Make MINER scalable • Usable search interface • Benchmark with GERBIL 12. September 201639
  • 36. Q & A Dr. Martin Voigt Managing Director Ontos GmbH D-04319 Leipzig M: +49 178 40 222 58 E: martin.voigt@ontos.com Twitter: m_a_r_t_i_n 12. September 201640