SlideShare ist ein Scribd-Unternehmen logo
1 von 9
Downloaden Sie, um offline zu lesen
SALES RELAUNCH F&Q SESSION
Multi-lingual data processing
The CIS and Georgia
Olga Rink, director general
3
Content
Interfax - Dun & Bradstreet, Innovations in Multi-lingual context
• Business environment
• Main stages of processing multi-lingual business data
o Naming convention
o Transliteration
o Matching
• Seeding and verifying objects in a media coverage
4
Official languages, population (mn) and Russian as a
second language (est.)
Interfax - Dun & Bradstreet, Innovations in Multi-lingual context
5
Multi-lingual environment
Interfax - Dun & Bradstreet, Innovations in Multi-lingual context
Country Official language (group)
Population,
mn Alphabet Second language
Russian, % of
population, est.
Russia Russian 150Cyrillic
35+* official and over 100
used  100%
Armenia
Armenian (Indo-European
language) 3Own script Russian, English 100%
Azerbaijan Azeri Turkish 9,8
Latin in Azerbaijan, Cyrillic in Russia
(Dagestan) 90%
Belarus Bielaruskaja mova, Russian 9,5Cyrillic Russian  100%
Georgia Georgian (Kartvelian language) 3,7Georgian script
Russian, English, Azeri,
Armenian 100%
Kazakhstan
Kazakh (Turkic language),
Russian 17,7
Kazakh alphabets (Cyrillic, Latin,
Perso-Arabic, Kazakh Braille)
Russian
 100%
Kyrgyzstan
Kyrgyz (Turkic language),
Russian 6Cyrillic Kyrgyz  100%
Moldova Romanian 3,6Latin Russian is widely used  90%
Tajikistan Tajik (Persian dialect) 8Cyrillic Russian 90%
Turkmenistan Turkmen (Turkic language) 5,2Cyrillic, Latin Russian is used 100%
Ukraine Ukrainian (Ukrayins'ka mova) 42,5Cyrillic
Russian is widely used along
with a number of other
languages  100%
Uzbekistan Uzbek, in fact Russian 31,6Cyrillic, Latin Russian is widely used 100%
• The Constitution of Dagestan defines "Russian and the languages of
the peoples of Dagestan" as the state languages
•  a bulk of newly-registered business is available in Cyrillic or Latin
6Interfax - Dun & Bradstreet, Innovations in Multi-lingual context
• For Slavic languages we use ISO
9:1995 standard with one exception:
put a combination of Latin characters
instead of Latin diacritic characters.
Example: Ch (without diacritic) instead of
Ч – Č (with diacritic)
• ISO9985 is used for Armenian
• ISO 9984 – for Georgian
• ООО «Ъ» (Trade style: OOO TVERDY
ZNAK; OOO “” is a transliterated
name – no way to find by the
original name)
• Minor changes in transliteration like
3DNYUS, OOO >3DNEWS, LLC are
accepted and now filtered while
being updated
• Matching rules are defined in our
“Naming Convention”: i.e. the
transliterated «normalized» Charter
brief company name is used as
primary: an indication to a legal form
in the name (required by law) is put at
the end via comma.
• Second one is the transliterated full
legal name.
• Trade style contains official name in
English/Latin or trade marks
• We use rule-based and machine
learning approaches, including areas
of collecting data, identifying
objects, developing credit scorings,
digesting media coverage
7
Natural Language Processing and Machine Learning
The SCAN engine is leveraging vast amounts of text data to enable the next generation of Interfax data products
Interfax - Dun & Bradstreet, Innovations in Multi-lingual context
Interfax builds a scalable machine learning infrastructure that enables data scientists and engineers to explore, train,
and deploy credit and reputation risk models with minimal effort
• Tagging documents and
• Classifying by a text type (media-release,
forecast, feature etc)
Detecting and Disambiguating Named Entities
Support Vector Machine (SVM) or Bayes are used,
depending on configuration
• SVM represents a text as a vector to compare with a pattern
(prototype); The closeness defines the type
• Bayes rule is applicable when you rely on pre-determined
assumptions (a range of known “symptoms”) while calculating
probabilities
Rule-based fact extraction and sentiment analysis
At an initial phase for seeding named persons
• Rule-based approach mostly
• Context analysis and statistics for entity disambiguation
Clarification of Named Entity Detection with learning semi-
automatically labelled corpus
• Support Vector Machine (SVM)
• A neural network on the basis of the existing rule-based
structure is considered for future
8
An intellectual WOW-effect or what can only SCAN
do – forward to “verifying” media coverage
Interfax - Dun & Bradstreet, Innovations in Multi-lingual context
Out of 3 mn companies automatically
generated by the Scan linguistic kernel for
the recent year 22 thousand have been
verified, 0.5 mn are identified with Spark
2 mn persons were generated (seeded);
out of them 75 thousand verified
300 thousand of geographic locations: all
Russian ones identified by OKATO classifier
and many global locations got by parsing
Wikipedia
13 thousand trade marks (“Trade style”)
24 thousand sources in
Russian
ThankYou
Interfax – Dun & Bradstreet
www.dnb.ru

Weitere ähnliche Inhalte

Andere mochten auch

Presentacion Teledetección
Presentacion TeledetecciónPresentacion Teledetección
Presentacion Teledetecciónmanuelmch
 
Web 2.0 tatys
Web 2.0 tatysWeb 2.0 tatys
Web 2.0 tatystaty24edu
 
Answer HW Alternatives
Answer HW AlternativesAnswer HW Alternatives
Answer HW AlternativesTippery
 
Innovative lesson plan
Innovative lesson planInnovative lesson plan
Innovative lesson planSabariChandran
 
Acreditación de programas de grado en ingeniería, arquitectura y diseño -Expe...
Acreditación de programas de grado en ingeniería, arquitectura y diseño -Expe...Acreditación de programas de grado en ingeniería, arquitectura y diseño -Expe...
Acreditación de programas de grado en ingeniería, arquitectura y diseño -Expe...Consejo de Rectores de Panamá
 
Gabriela mazoni e franciela gomes
Gabriela mazoni e franciela gomesGabriela mazoni e franciela gomes
Gabriela mazoni e franciela gomesemefguerreiro
 
COMPRA COLETIVA: ESTUDO SOBRE O IMPACTO NAS EMPRESAS DE SERVIÇOS QUE UTILIZAM...
COMPRA COLETIVA: ESTUDO SOBRE O IMPACTO NAS EMPRESAS DE SERVIÇOS QUE UTILIZAM...COMPRA COLETIVA: ESTUDO SOBRE O IMPACTO NAS EMPRESAS DE SERVIÇOS QUE UTILIZAM...
COMPRA COLETIVA: ESTUDO SOBRE O IMPACTO NAS EMPRESAS DE SERVIÇOS QUE UTILIZAM...Natália Macário
 
Imaginary Invention: Ultra perfect skin
Imaginary Invention: Ultra perfect skinImaginary Invention: Ultra perfect skin
Imaginary Invention: Ultra perfect skinTippery
 
Top 10 tv dramas
Top 10 tv dramasTop 10 tv dramas
Top 10 tv dramasGeorgeSilke
 
ผู้นำในดวงใจ , พลเอกเปรม ติณสูลานนท์
ผู้นำในดวงใจ , พลเอกเปรม ติณสูลานนท์ผู้นำในดวงใจ , พลเอกเปรม ติณสูลานนท์
ผู้นำในดวงใจ , พลเอกเปรม ติณสูลานนท์Udomchai Boonrod
 
ทำ - ธรรมนูญ : ธรรมนูญประชาชนเพื่อการจัดการตนเอง
ทำ - ธรรมนูญ : ธรรมนูญประชาชนเพื่อการจัดการตนเองทำ - ธรรมนูญ : ธรรมนูญประชาชนเพื่อการจัดการตนเอง
ทำ - ธรรมนูญ : ธรรมนูญประชาชนเพื่อการจัดการตนเองTum Meng
 
Maranhão - Império
Maranhão - ImpérioMaranhão - Império
Maranhão - ImpérioLyssa Martins
 
Фабрика "Смирнов" - больше чем качество
Фабрика "Смирнов" - больше чем качествоФабрика "Смирнов" - больше чем качество
Фабрика "Смирнов" - больше чем качествоAkiwa
 
การจัดโครงสร้างสถานศึกษา
การจัดโครงสร้างสถานศึกษาการจัดโครงสร้างสถานศึกษา
การจัดโครงสร้างสถานศึกษาUdomchai Boonrod
 
Innovative lesson plan
Innovative lesson planInnovative lesson plan
Innovative lesson planrsjulie436
 
Para obtener trabajo_urgente__oracin_efectiva_a_san_judas_tadeo
Para obtener trabajo_urgente__oracin_efectiva_a_san_judas_tadeoPara obtener trabajo_urgente__oracin_efectiva_a_san_judas_tadeo
Para obtener trabajo_urgente__oracin_efectiva_a_san_judas_tadeoEdwin Ambulodegui
 
Projecte Niger Francés
Projecte Niger FrancésProjecte Niger Francés
Projecte Niger Francésrpiquerasm
 
4รายงานนวีตกรรม
4รายงานนวีตกรรม 4รายงานนวีตกรรม
4รายงานนวีตกรรม krupornpana55
 
History of bastard sword
History of bastard swordHistory of bastard sword
History of bastard swordDixievaldez
 

Andere mochten auch (20)

Presentacion Teledetección
Presentacion TeledetecciónPresentacion Teledetección
Presentacion Teledetección
 
Web 2.0 tatys
Web 2.0 tatysWeb 2.0 tatys
Web 2.0 tatys
 
Answer HW Alternatives
Answer HW AlternativesAnswer HW Alternatives
Answer HW Alternatives
 
Innovative lesson plan
Innovative lesson planInnovative lesson plan
Innovative lesson plan
 
Acreditación de programas de grado en ingeniería, arquitectura y diseño -Expe...
Acreditación de programas de grado en ingeniería, arquitectura y diseño -Expe...Acreditación de programas de grado en ingeniería, arquitectura y diseño -Expe...
Acreditación de programas de grado en ingeniería, arquitectura y diseño -Expe...
 
Gabriela mazoni e franciela gomes
Gabriela mazoni e franciela gomesGabriela mazoni e franciela gomes
Gabriela mazoni e franciela gomes
 
COMPRA COLETIVA: ESTUDO SOBRE O IMPACTO NAS EMPRESAS DE SERVIÇOS QUE UTILIZAM...
COMPRA COLETIVA: ESTUDO SOBRE O IMPACTO NAS EMPRESAS DE SERVIÇOS QUE UTILIZAM...COMPRA COLETIVA: ESTUDO SOBRE O IMPACTO NAS EMPRESAS DE SERVIÇOS QUE UTILIZAM...
COMPRA COLETIVA: ESTUDO SOBRE O IMPACTO NAS EMPRESAS DE SERVIÇOS QUE UTILIZAM...
 
Imaginary Invention: Ultra perfect skin
Imaginary Invention: Ultra perfect skinImaginary Invention: Ultra perfect skin
Imaginary Invention: Ultra perfect skin
 
Top 10 tv dramas
Top 10 tv dramasTop 10 tv dramas
Top 10 tv dramas
 
ผู้นำในดวงใจ , พลเอกเปรม ติณสูลานนท์
ผู้นำในดวงใจ , พลเอกเปรม ติณสูลานนท์ผู้นำในดวงใจ , พลเอกเปรม ติณสูลานนท์
ผู้นำในดวงใจ , พลเอกเปรม ติณสูลานนท์
 
ทำ - ธรรมนูญ : ธรรมนูญประชาชนเพื่อการจัดการตนเอง
ทำ - ธรรมนูญ : ธรรมนูญประชาชนเพื่อการจัดการตนเองทำ - ธรรมนูญ : ธรรมนูญประชาชนเพื่อการจัดการตนเอง
ทำ - ธรรมนูญ : ธรรมนูญประชาชนเพื่อการจัดการตนเอง
 
Maranhão - Império
Maranhão - ImpérioMaranhão - Império
Maranhão - Império
 
Фабрика "Смирнов" - больше чем качество
Фабрика "Смирнов" - больше чем качествоФабрика "Смирнов" - больше чем качество
Фабрика "Смирнов" - больше чем качество
 
การจัดโครงสร้างสถานศึกษา
การจัดโครงสร้างสถานศึกษาการจัดโครงสร้างสถานศึกษา
การจัดโครงสร้างสถานศึกษา
 
Innovative lesson plan
Innovative lesson planInnovative lesson plan
Innovative lesson plan
 
Para obtener trabajo_urgente__oracin_efectiva_a_san_judas_tadeo
Para obtener trabajo_urgente__oracin_efectiva_a_san_judas_tadeoPara obtener trabajo_urgente__oracin_efectiva_a_san_judas_tadeo
Para obtener trabajo_urgente__oracin_efectiva_a_san_judas_tadeo
 
Projecte Niger Francés
Projecte Niger FrancésProjecte Niger Francés
Projecte Niger Francés
 
4รายงานนวีตกรรม
4รายงานนวีตกรรม 4รายงานนวีตกรรม
4รายงานนวีตกรรม
 
El Virus De La Gripe
El Virus De La GripeEl Virus De La Gripe
El Virus De La Gripe
 
History of bastard sword
History of bastard swordHistory of bastard sword
History of bastard sword
 

Ähnlich wie Processing multi-lingual business data

SAS Global 2021 Introduction to Natural Language Processing
SAS Global 2021 Introduction to Natural Language Processing SAS Global 2021 Introduction to Natural Language Processing
SAS Global 2021 Introduction to Natural Language Processing Colleen Farrelly
 
Recent advances in LVCSR : A benchmark comparison of performances
Recent advances in LVCSR : A benchmark comparison of performancesRecent advances in LVCSR : A benchmark comparison of performances
Recent advances in LVCSR : A benchmark comparison of performancesIJECEIAES
 
Tulsa Techfest 2008 - Creating A Voice User Interface With Speech Server
Tulsa Techfest 2008 - Creating A Voice User Interface With Speech ServerTulsa Techfest 2008 - Creating A Voice User Interface With Speech Server
Tulsa Techfest 2008 - Creating A Voice User Interface With Speech ServerJason Townsend, MBA
 
Machine translation for eDiscovery involving cross-border matters
Machine translation for eDiscovery involving cross-border mattersMachine translation for eDiscovery involving cross-border matters
Machine translation for eDiscovery involving cross-border mattersVIA
 
Methods and apparatus for automatic translation of a computer program languag...
Methods and apparatus for automatic translation of a computer program languag...Methods and apparatus for automatic translation of a computer program languag...
Methods and apparatus for automatic translation of a computer program languag...Tal Lavian Ph.D.
 
The State of Automatic Speech Recognition 2022 (2).pdf
The State of Automatic Speech Recognition 2022 (2).pdfThe State of Automatic Speech Recognition 2022 (2).pdf
The State of Automatic Speech Recognition 2022 (2).pdf3Play Media
 
Information Retrieval
Information Retrieval Information Retrieval
Information Retrieval ShujaatZaheer3
 
Content Processing Architecture and Applications - Introduction to Text Mining
Content Processing Architecture and Applications - Introduction to Text MiningContent Processing Architecture and Applications - Introduction to Text Mining
Content Processing Architecture and Applications - Introduction to Text MiningFindwise
 
Calais @ the Palo Alto Semantic Web Meetup
Calais @ the Palo Alto Semantic Web MeetupCalais @ the Palo Alto Semantic Web Meetup
Calais @ the Palo Alto Semantic Web MeetupKrista Thomas
 
Natural Language Processing for Data Analytics - Tel Aviv Summit 2018
Natural Language Processing for Data Analytics - Tel Aviv Summit 2018Natural Language Processing for Data Analytics - Tel Aviv Summit 2018
Natural Language Processing for Data Analytics - Tel Aviv Summit 2018Amazon Web Services
 
Tackling Hidden Risks in AML Sanctions Screening Programs
Tackling Hidden Risks in AML Sanctions Screening ProgramsTackling Hidden Risks in AML Sanctions Screening Programs
Tackling Hidden Risks in AML Sanctions Screening ProgramsAlessa
 
Growing Your Freelance Business (Olga Melnikova)
Growing Your Freelance Business (Olga Melnikova)Growing Your Freelance Business (Olga Melnikova)
Growing Your Freelance Business (Olga Melnikova)Olga Melnikova
 
Essential Elements of Excellent Multilingual Search
Essential Elements of Excellent Multilingual SearchEssential Elements of Excellent Multilingual Search
Essential Elements of Excellent Multilingual Searchandrew_paulsen
 
Cross lingual information retrieval across 100 languages - Andrej Muhic
Cross lingual information retrieval across 100 languages - Andrej Muhic Cross lingual information retrieval across 100 languages - Andrej Muhic
Cross lingual information retrieval across 100 languages - Andrej Muhic Andrej Muhic
 
Carolina Scarton - ESR 7 - USFD
Carolina Scarton - ESR 7 - USFD  Carolina Scarton - ESR 7 - USFD
Carolina Scarton - ESR 7 - USFD RIILP
 
Trends In Languages 2010
Trends In Languages 2010Trends In Languages 2010
Trends In Languages 2010Markus Voelter
 
Reconnaissance - For pentesting and user awareness
Reconnaissance - For pentesting and user awarenessReconnaissance - For pentesting and user awareness
Reconnaissance - For pentesting and user awarenessLeon Teale
 
Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...
Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...
Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...Machine Learning Prague
 
Semantic Search Component
Semantic Search ComponentSemantic Search Component
Semantic Search ComponentMario Flecha
 

Ähnlich wie Processing multi-lingual business data (20)

cldr_overview
cldr_overviewcldr_overview
cldr_overview
 
SAS Global 2021 Introduction to Natural Language Processing
SAS Global 2021 Introduction to Natural Language Processing SAS Global 2021 Introduction to Natural Language Processing
SAS Global 2021 Introduction to Natural Language Processing
 
Recent advances in LVCSR : A benchmark comparison of performances
Recent advances in LVCSR : A benchmark comparison of performancesRecent advances in LVCSR : A benchmark comparison of performances
Recent advances in LVCSR : A benchmark comparison of performances
 
Tulsa Techfest 2008 - Creating A Voice User Interface With Speech Server
Tulsa Techfest 2008 - Creating A Voice User Interface With Speech ServerTulsa Techfest 2008 - Creating A Voice User Interface With Speech Server
Tulsa Techfest 2008 - Creating A Voice User Interface With Speech Server
 
Machine translation for eDiscovery involving cross-border matters
Machine translation for eDiscovery involving cross-border mattersMachine translation for eDiscovery involving cross-border matters
Machine translation for eDiscovery involving cross-border matters
 
Methods and apparatus for automatic translation of a computer program languag...
Methods and apparatus for automatic translation of a computer program languag...Methods and apparatus for automatic translation of a computer program languag...
Methods and apparatus for automatic translation of a computer program languag...
 
The State of Automatic Speech Recognition 2022 (2).pdf
The State of Automatic Speech Recognition 2022 (2).pdfThe State of Automatic Speech Recognition 2022 (2).pdf
The State of Automatic Speech Recognition 2022 (2).pdf
 
Information Retrieval
Information Retrieval Information Retrieval
Information Retrieval
 
Content Processing Architecture and Applications - Introduction to Text Mining
Content Processing Architecture and Applications - Introduction to Text MiningContent Processing Architecture and Applications - Introduction to Text Mining
Content Processing Architecture and Applications - Introduction to Text Mining
 
Calais @ the Palo Alto Semantic Web Meetup
Calais @ the Palo Alto Semantic Web MeetupCalais @ the Palo Alto Semantic Web Meetup
Calais @ the Palo Alto Semantic Web Meetup
 
Natural Language Processing for Data Analytics - Tel Aviv Summit 2018
Natural Language Processing for Data Analytics - Tel Aviv Summit 2018Natural Language Processing for Data Analytics - Tel Aviv Summit 2018
Natural Language Processing for Data Analytics - Tel Aviv Summit 2018
 
Tackling Hidden Risks in AML Sanctions Screening Programs
Tackling Hidden Risks in AML Sanctions Screening ProgramsTackling Hidden Risks in AML Sanctions Screening Programs
Tackling Hidden Risks in AML Sanctions Screening Programs
 
Growing Your Freelance Business (Olga Melnikova)
Growing Your Freelance Business (Olga Melnikova)Growing Your Freelance Business (Olga Melnikova)
Growing Your Freelance Business (Olga Melnikova)
 
Essential Elements of Excellent Multilingual Search
Essential Elements of Excellent Multilingual SearchEssential Elements of Excellent Multilingual Search
Essential Elements of Excellent Multilingual Search
 
Cross lingual information retrieval across 100 languages - Andrej Muhic
Cross lingual information retrieval across 100 languages - Andrej Muhic Cross lingual information retrieval across 100 languages - Andrej Muhic
Cross lingual information retrieval across 100 languages - Andrej Muhic
 
Carolina Scarton - ESR 7 - USFD
Carolina Scarton - ESR 7 - USFD  Carolina Scarton - ESR 7 - USFD
Carolina Scarton - ESR 7 - USFD
 
Trends In Languages 2010
Trends In Languages 2010Trends In Languages 2010
Trends In Languages 2010
 
Reconnaissance - For pentesting and user awareness
Reconnaissance - For pentesting and user awarenessReconnaissance - For pentesting and user awareness
Reconnaissance - For pentesting and user awareness
 
Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...
Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...
Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...
 
Semantic Search Component
Semantic Search ComponentSemantic Search Component
Semantic Search Component
 

Kürzlich hochgeladen

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...apidays
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbuapidays
 

Kürzlich hochgeladen (20)

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 

Processing multi-lingual business data

  • 2. Multi-lingual data processing The CIS and Georgia Olga Rink, director general
  • 3. 3 Content Interfax - Dun & Bradstreet, Innovations in Multi-lingual context • Business environment • Main stages of processing multi-lingual business data o Naming convention o Transliteration o Matching • Seeding and verifying objects in a media coverage
  • 4. 4 Official languages, population (mn) and Russian as a second language (est.) Interfax - Dun & Bradstreet, Innovations in Multi-lingual context
  • 5. 5 Multi-lingual environment Interfax - Dun & Bradstreet, Innovations in Multi-lingual context Country Official language (group) Population, mn Alphabet Second language Russian, % of population, est. Russia Russian 150Cyrillic 35+* official and over 100 used  100% Armenia Armenian (Indo-European language) 3Own script Russian, English 100% Azerbaijan Azeri Turkish 9,8 Latin in Azerbaijan, Cyrillic in Russia (Dagestan) 90% Belarus Bielaruskaja mova, Russian 9,5Cyrillic Russian  100% Georgia Georgian (Kartvelian language) 3,7Georgian script Russian, English, Azeri, Armenian 100% Kazakhstan Kazakh (Turkic language), Russian 17,7 Kazakh alphabets (Cyrillic, Latin, Perso-Arabic, Kazakh Braille) Russian  100% Kyrgyzstan Kyrgyz (Turkic language), Russian 6Cyrillic Kyrgyz  100% Moldova Romanian 3,6Latin Russian is widely used  90% Tajikistan Tajik (Persian dialect) 8Cyrillic Russian 90% Turkmenistan Turkmen (Turkic language) 5,2Cyrillic, Latin Russian is used 100% Ukraine Ukrainian (Ukrayins'ka mova) 42,5Cyrillic Russian is widely used along with a number of other languages  100% Uzbekistan Uzbek, in fact Russian 31,6Cyrillic, Latin Russian is widely used 100% • The Constitution of Dagestan defines "Russian and the languages of the peoples of Dagestan" as the state languages •  a bulk of newly-registered business is available in Cyrillic or Latin
  • 6. 6Interfax - Dun & Bradstreet, Innovations in Multi-lingual context • For Slavic languages we use ISO 9:1995 standard with one exception: put a combination of Latin characters instead of Latin diacritic characters. Example: Ch (without diacritic) instead of Ч – Č (with diacritic) • ISO9985 is used for Armenian • ISO 9984 – for Georgian • ООО «Ъ» (Trade style: OOO TVERDY ZNAK; OOO “” is a transliterated name – no way to find by the original name) • Minor changes in transliteration like 3DNYUS, OOO >3DNEWS, LLC are accepted and now filtered while being updated • Matching rules are defined in our “Naming Convention”: i.e. the transliterated «normalized» Charter brief company name is used as primary: an indication to a legal form in the name (required by law) is put at the end via comma. • Second one is the transliterated full legal name. • Trade style contains official name in English/Latin or trade marks • We use rule-based and machine learning approaches, including areas of collecting data, identifying objects, developing credit scorings, digesting media coverage
  • 7. 7 Natural Language Processing and Machine Learning The SCAN engine is leveraging vast amounts of text data to enable the next generation of Interfax data products Interfax - Dun & Bradstreet, Innovations in Multi-lingual context Interfax builds a scalable machine learning infrastructure that enables data scientists and engineers to explore, train, and deploy credit and reputation risk models with minimal effort • Tagging documents and • Classifying by a text type (media-release, forecast, feature etc) Detecting and Disambiguating Named Entities Support Vector Machine (SVM) or Bayes are used, depending on configuration • SVM represents a text as a vector to compare with a pattern (prototype); The closeness defines the type • Bayes rule is applicable when you rely on pre-determined assumptions (a range of known “symptoms”) while calculating probabilities Rule-based fact extraction and sentiment analysis At an initial phase for seeding named persons • Rule-based approach mostly • Context analysis and statistics for entity disambiguation Clarification of Named Entity Detection with learning semi- automatically labelled corpus • Support Vector Machine (SVM) • A neural network on the basis of the existing rule-based structure is considered for future
  • 8. 8 An intellectual WOW-effect or what can only SCAN do – forward to “verifying” media coverage Interfax - Dun & Bradstreet, Innovations in Multi-lingual context Out of 3 mn companies automatically generated by the Scan linguistic kernel for the recent year 22 thousand have been verified, 0.5 mn are identified with Spark 2 mn persons were generated (seeded); out of them 75 thousand verified 300 thousand of geographic locations: all Russian ones identified by OKATO classifier and many global locations got by parsing Wikipedia 13 thousand trade marks (“Trade style”) 24 thousand sources in Russian
  • 9. ThankYou Interfax – Dun & Bradstreet www.dnb.ru