SlideShare ist ein Scribd-Unternehmen logo
1 von 37
Downloaden Sie, um offline zu lesen
WHAT'S HAPPENING
TO MY CLIENTS*?
Extracting value from news articles
* or partners, competitors, suppliers, etc…
Ugo Scaiella @ Speck&Tech
7 Apr 2022
★ Master + 3 yrs research on IR/ML @ UniPI
★ 2 years @ ION Trading as SWE
★ From 2013 @ spaziodati.eu
★ Led DandelionAPI dev, now SWE Manager
★ Strongly-typed languages lover
★ In troubled relationship with Guido
Wagyu addicted
Wannabe grill master
Father of 4
ABOUT ME
Me, working on
Neural Alcoholic
Networks
ATOKA.IO Info about
6M companies in Italy
ATOKA.IO Info about
6M companies in Italy
GOOGLE
NEWS
GOOGLE
NEWS
GOOGLE
NEWS
WHAT IF MY
PORTFOLIO IS
MADE OF
THOUSANDS OF
SMEs?
COMPANYTXT
Entity linking vs Atoka 6M
companies DB
SEDANO
Our news processing
pipeline
ATOKA NEWS
Search engine and news
monitoring for end users
LET'S TRY TO SOLVE THIS
COMPANYTXT
Entity linking vs Atoka 6M
companies DB
SEDANO
Our news processing
pipeline
ATOKA NEWS
Search engine and news
monitoring for end users
LET'S TRY TO SOLVE THIS
SPAZIODATI
2 entities
★ SpazioDati (Trento)
★ Spazio Dati (Sassuolo)
MICHELE BARBERA
95 entities
★ …
★ CEO @SpazioDati
★ …
THIS IS COMPANYTXT
MENTION
IDENTIFICATION
CANDIDATE
EXTRACTION
DISAMBIGUATION
… Michele Barbera presenta
il nuovo prodotto di
SpazioDati, Atoka …
MENTION IDENTIFICATION
CANNOT BE SYNTAX BASED ONLY
CAVIT CANTINA VITICOLTORI CONSORZIO CANTINE SOCIALI DEL TRENTINO SOCIETA'
COOPERATIVA PIU' BREVEMENTE CAVIT S.C. PER FINALITA' PRODUTTIVE POTRA' OPERARE
ANCHE COME CANTINA PRODUTTORI, VITICOLTORI TRENTINI, VINTRENTO, TRENTINA VINI,
VILLALTA, ACCADEMIA DELLO SPUMANTE TRENTINO, CAVIT, C.V., C.C.S.T., RA.VIN
CANNOT BE REGEX ONLY
PANINI
GRUPPO DISTRIBUZIONE
DIMENSION
AZIENDA TRASPORTI
CANDIDATE SELECTION
FUZZY MATCHING
yeah, but "authoritative sources" ⇏ "good data":
"V.N.P. - VALSA NUOVA PERLINO S.P.A.,"SIGLABILE:"V.N.P. S.P.A" "
VALSA S.P.A.","PERLINO S.P.A.","PERLINO OPTIMA S.P.A.","V.A.T. S.P.A.",
"P.A.T. S.P.A.", "P.O. S.P.A.","SCANAVINO S.P.A.","FILIPETTI S.P.A.","CA' V
ERGANA S.P.A.","TERRE DEI SESI S.P.A.","SANDILIANO S.P.A.","TERRE
DEI SOLARI S.P
DEAL WITH ACRONYMS
SOCIETA' NAZIONALE APPALTI MANUTENZIONI LAZIO SUD S.N.A.M. SOCIET A' A
RESPONSABILITA LIMITATA
NOT JOKING
GE.A. S.R.L. (LA LETTERA E DELLA PAROLA
GE.A. DEVE INTENDERSI SCRITTA CON
CARATTERI MINUSCOLI)
U GARIXAN DI ZUNINO ZULEIKA E CAMILLO
LORENZO - S.N.C. ***(LA LETTERA "A" DELLA
PA ROLA GARIXAN E' DA INTENDERSI
ACCENTATA CON ACCENTO ACUTO)***
SOCIETA' COOPERATIVA DI CONSUMO DI
GNOCCA
DISAMBIGUATION
Ideally, let's exploit
everything we know about
the company:
★ Locations
★ Sectors
★ Related companies
★ Key people
Hard part is to mix
everything together
Pisa
Trento
Gabriele
Antonelli
Michele
Barbera
SpazioDati
Cerved Group Big Data
Business
Intelligence
Lead
generation
CURRENT WORKING ON
MENTIONS
Pattern matching on pre-computed
mentions + NER
Fine-tuned NER
CANDIDATES Pre-computed Fuzzy matching on names
DISAMBIGUATION
Only structured links
(people, companies)
Add also contextual
information about locations
and activity
NLP PIPELINE Separated steps Fully integrated pipeline
LANGUAGES Only Italian Major EU languages
IMPLEMENTATION
★ Only Società di Capitale
★ Not bad, but not WOW!
★ Huge room for
improvement
RESULTS
TECH STACK
★ Current: mainly java
★ NER and Disambiguator:
simple and fast random
forests
★ Ad hoc and optimized
data-structure
★ It's getting old 🥺
★ Now working with: BERT,
Tensorflow e NN
★ Still not taking into account
timing 😬
TAKEAWAYS
★ Never use Apple M1 for these jobs… maybe in a couple of years
★ Language models are REALLY effective, it's not just the hype
… but, if you really want to reach a real-world level, you have to adapt them
★ NLP building blocks (POS tagger, encoders, etc…) are now a commodity,
… but availability of good training data for e2e task is still THE problem
★ You need GPU for those models only when have a lot of data
… and in that case, GPUs really make the difference
a sweating g3.16xlarge
COMPANYTXT
Entity linking vs Atoka 6M
companies DB
SEDANO
Our news processing
pipeline
ATOKA NEWS
Search engine and news
monitoring for end users
LET'S TRY TO SOLVE THIS
THIS IS SEDANO
MAIN PIPELINE
Classifier B
Locations
Classifier A
Business Event
Cleansing
Dirty work
Annotation
Company
annotations
Deduplication
Remove same
articles
CLEANSING
Il titolo ? arrivato a perdere
oltre il 3 per cento. n n
Tronchetti: <Apertura in
calo? Vediamo prossimi
mesi,l'azienda ? solida>
di Andrea Fontana
seguici su Twitter
★ Data cleansing is like sewer
pipes cleaning, someone has
to do it
★ Web News is Web Data, so
HORRIBLE DATA
★ NLP tools are significantly
affected by bad input texts
DEDUPLICATION
★ Same articles, different newspaper
★ Stopword removal + stemming +
discarding shortest phrases + local
sensitive hashing
★ Streaming approach, use Redis for
caching
BUSINESS EVENTS
★ Management changes
★ Economic results
★ Launch of new products
★ Failures
★ Accidents
★ Strikes
★ …
AND MANY OTHERS
★ E-S-G themes
✴ Environmental
✴ Social
✴ Governance
★ Sentiment
★ Locations: provinces
TECH STACK
★ Mainly Python
★ Celery for distributed tasks
★ Django + Postgres for API
★ GRPC + Golang for core
clustering algorithm
★ Classifiers: Scikit Learn and TF
★ S3 for raw data
★ ES for news articles storage
★ Redis for caching
★ K8s cluster on AWS
★ Sentry, ELK, Prometheus,
Grafana
★ Scalability
★ Streaming
★ Operational burden
★ Idempotency
CHALLENGES
COMPANYTXT
Entity linking vs Atoka 6M
companies DB
SEDANO
Our news processing
pipeline
ATOKA NEWS
Search engine and news
monitoring for end users
LET'S TRY TO SOLVE THIS
THIS IS ATOKA NEWS
THIS IS ATOKA NEWS
THIS IS ATOKA NEWS
THIS IS ATOKA NEWS
THIS IS ATOKA NEWS
LAST BUT NOT LEAST…
SPAZIODATI = Startup culture (team of 40, several in this room, talk to us!)
within one of the largest Fintech in the world: ION Group (>15k
employees)
https://spaziodati.eu/en/jobs/
But if you’re smart we welcome any role, even if not listed :-)
Yes, vegetarians too!
CREDITS: This presentation template was created by Slidesgo,
including icons by Flaticon and infographics & images by Freepik
THANKS!
QUESTIONS?

Weitere ähnliche Inhalte

Ähnlich wie What's happening to my clients? Extracting value from news articles

Netnod news magazine_#5
Netnod news magazine_#5Netnod news magazine_#5
Netnod news magazine_#5netnod
 
Intergen Smarts 3 (2003)
Intergen Smarts 3 (2003)Intergen Smarts 3 (2003)
Intergen Smarts 3 (2003)Intergen
 
Jarod Sickler and Morley Tooke - DITA Support Portals: A One Stop Shop to Giv...
Jarod Sickler and Morley Tooke - DITA Support Portals: A One Stop Shop to Giv...Jarod Sickler and Morley Tooke - DITA Support Portals: A One Stop Shop to Giv...
Jarod Sickler and Morley Tooke - DITA Support Portals: A One Stop Shop to Giv...LavaConConference
 
Barcelona global gathering 2020 jan21st
Barcelona global gathering 2020   jan21stBarcelona global gathering 2020   jan21st
Barcelona global gathering 2020 jan21stanimuscrm
 
SharePoint 2016 And Office 365: A Look Ahead To What's Coming
SharePoint 2016 And Office 365: A Look Ahead To What's ComingSharePoint 2016 And Office 365: A Look Ahead To What's Coming
SharePoint 2016 And Office 365: A Look Ahead To What's ComingRichard Harbridge
 
SharePoint 2016 & Office 365: A Look Ahead To What’s Coming
SharePoint 2016 & Office 365: A Look Ahead To What’s ComingSharePoint 2016 & Office 365: A Look Ahead To What’s Coming
SharePoint 2016 & Office 365: A Look Ahead To What’s ComingRichard Harbridge
 
Setting the Stage for Dramatically Improving O&G Operations Performance
Setting the Stage for Dramatically Improving O&G Operations PerformanceSetting the Stage for Dramatically Improving O&G Operations Performance
Setting the Stage for Dramatically Improving O&G Operations PerformanceBill Bosler, P.E.
 
Beyond The Intranet: Digital Workplace Apps, Solutions & Bots
Beyond The Intranet: Digital Workplace Apps, Solutions & BotsBeyond The Intranet: Digital Workplace Apps, Solutions & Bots
Beyond The Intranet: Digital Workplace Apps, Solutions & BotsRichard Harbridge
 
SharePoint & The Road Ahead: SharePoint 2016 & Office 365
SharePoint & The Road Ahead: SharePoint 2016 & Office 365 SharePoint & The Road Ahead: SharePoint 2016 & Office 365
SharePoint & The Road Ahead: SharePoint 2016 & Office 365 Richard Harbridge
 
Lessons learned on Corporate Social Networks [intra.NET Reloaded 2012]
Lessons learned on Corporate Social Networks [intra.NET Reloaded 2012]Lessons learned on Corporate Social Networks [intra.NET Reloaded 2012]
Lessons learned on Corporate Social Networks [intra.NET Reloaded 2012]Zyncro
 
SharePoint 2016 & Office 365: A Look Ahead To What's Coming - SPS Vancouver
SharePoint 2016 & Office 365: A Look Ahead To What's Coming - SPS VancouverSharePoint 2016 & Office 365: A Look Ahead To What's Coming - SPS Vancouver
SharePoint 2016 & Office 365: A Look Ahead To What's Coming - SPS VancouverRichard Harbridge
 
Beyond The Intranet: Digital Workplace Apps, Solutions & Bots
Beyond The Intranet: Digital Workplace Apps, Solutions & BotsBeyond The Intranet: Digital Workplace Apps, Solutions & Bots
Beyond The Intranet: Digital Workplace Apps, Solutions & BotsRichard Harbridge
 
Intergen Smarts 2 (2002)
Intergen Smarts 2 (2002)Intergen Smarts 2 (2002)
Intergen Smarts 2 (2002)Intergen
 
Is your Magento fast enough?
Is your Magento fast enough?Is your Magento fast enough?
Is your Magento fast enough?Giannis Economou
 
Design & Development For The 2020s
Design & Development For The 2020sDesign & Development For The 2020s
Design & Development For The 2020sRightpoint
 
Data Curation @ SpazioDati - NEXA Lunch Seminar
Data Curation @ SpazioDati - NEXA Lunch SeminarData Curation @ SpazioDati - NEXA Lunch Seminar
Data Curation @ SpazioDati - NEXA Lunch SeminarSpazioDati
 
List of Business Application Trends in 2022.pptx
List of Business Application Trends in 2022.pptxList of Business Application Trends in 2022.pptx
List of Business Application Trends in 2022.pptxNoeticITServices
 

Ähnlich wie What's happening to my clients? Extracting value from news articles (20)

Netnod news magazine_#5
Netnod news magazine_#5Netnod news magazine_#5
Netnod news magazine_#5
 
Intergen Smarts 3 (2003)
Intergen Smarts 3 (2003)Intergen Smarts 3 (2003)
Intergen Smarts 3 (2003)
 
Collab Mtl 2016 - PowerApps
Collab Mtl 2016 - PowerAppsCollab Mtl 2016 - PowerApps
Collab Mtl 2016 - PowerApps
 
Jarod Sickler and Morley Tooke - DITA Support Portals: A One Stop Shop to Giv...
Jarod Sickler and Morley Tooke - DITA Support Portals: A One Stop Shop to Giv...Jarod Sickler and Morley Tooke - DITA Support Portals: A One Stop Shop to Giv...
Jarod Sickler and Morley Tooke - DITA Support Portals: A One Stop Shop to Giv...
 
Barcelona global gathering 2020 jan21st
Barcelona global gathering 2020   jan21stBarcelona global gathering 2020   jan21st
Barcelona global gathering 2020 jan21st
 
SharePoint 2016 And Office 365: A Look Ahead To What's Coming
SharePoint 2016 And Office 365: A Look Ahead To What's ComingSharePoint 2016 And Office 365: A Look Ahead To What's Coming
SharePoint 2016 And Office 365: A Look Ahead To What's Coming
 
SharePoint 2016 & Office 365: A Look Ahead To What’s Coming
SharePoint 2016 & Office 365: A Look Ahead To What’s ComingSharePoint 2016 & Office 365: A Look Ahead To What’s Coming
SharePoint 2016 & Office 365: A Look Ahead To What’s Coming
 
Setting the Stage for Dramatically Improving O&G Operations Performance
Setting the Stage for Dramatically Improving O&G Operations PerformanceSetting the Stage for Dramatically Improving O&G Operations Performance
Setting the Stage for Dramatically Improving O&G Operations Performance
 
Beyond The Intranet: Digital Workplace Apps, Solutions & Bots
Beyond The Intranet: Digital Workplace Apps, Solutions & BotsBeyond The Intranet: Digital Workplace Apps, Solutions & Bots
Beyond The Intranet: Digital Workplace Apps, Solutions & Bots
 
SharePoint & The Road Ahead: SharePoint 2016 & Office 365
SharePoint & The Road Ahead: SharePoint 2016 & Office 365 SharePoint & The Road Ahead: SharePoint 2016 & Office 365
SharePoint & The Road Ahead: SharePoint 2016 & Office 365
 
Lessons learned on Corporate Social Networks [intra.NET Reloaded 2012]
Lessons learned on Corporate Social Networks [intra.NET Reloaded 2012]Lessons learned on Corporate Social Networks [intra.NET Reloaded 2012]
Lessons learned on Corporate Social Networks [intra.NET Reloaded 2012]
 
Our Digital Journey
Our Digital JourneyOur Digital Journey
Our Digital Journey
 
SharePoint 2016 & Office 365: A Look Ahead To What's Coming - SPS Vancouver
SharePoint 2016 & Office 365: A Look Ahead To What's Coming - SPS VancouverSharePoint 2016 & Office 365: A Look Ahead To What's Coming - SPS Vancouver
SharePoint 2016 & Office 365: A Look Ahead To What's Coming - SPS Vancouver
 
Beyond The Intranet: Digital Workplace Apps, Solutions & Bots
Beyond The Intranet: Digital Workplace Apps, Solutions & BotsBeyond The Intranet: Digital Workplace Apps, Solutions & Bots
Beyond The Intranet: Digital Workplace Apps, Solutions & Bots
 
Intergen Smarts 2 (2002)
Intergen Smarts 2 (2002)Intergen Smarts 2 (2002)
Intergen Smarts 2 (2002)
 
Is your Magento fast enough?
Is your Magento fast enough?Is your Magento fast enough?
Is your Magento fast enough?
 
Design & Development For The 2020s
Design & Development For The 2020sDesign & Development For The 2020s
Design & Development For The 2020s
 
Data Curation @ SpazioDati - NEXA Lunch Seminar
Data Curation @ SpazioDati - NEXA Lunch SeminarData Curation @ SpazioDati - NEXA Lunch Seminar
Data Curation @ SpazioDati - NEXA Lunch Seminar
 
Milathan domain forum
Milathan domain forumMilathan domain forum
Milathan domain forum
 
List of Business Application Trends in 2022.pptx
List of Business Application Trends in 2022.pptxList of Business Application Trends in 2022.pptx
List of Business Application Trends in 2022.pptx
 

Mehr von Speck&Tech

What should 6G be? - 6G: bridging gaps, connecting futures
What should 6G be? - 6G: bridging gaps, connecting futuresWhat should 6G be? - 6G: bridging gaps, connecting futures
What should 6G be? - 6G: bridging gaps, connecting futuresSpeck&Tech
 
Creare il sangue artificiale: "buon sangue non mente"
Creare il sangue artificiale: "buon sangue non mente"Creare il sangue artificiale: "buon sangue non mente"
Creare il sangue artificiale: "buon sangue non mente"Speck&Tech
 
AWS: gestire la scalabilità su larga scala
AWS: gestire la scalabilità su larga scalaAWS: gestire la scalabilità su larga scala
AWS: gestire la scalabilità su larga scalaSpeck&Tech
 
Praticamente... AWS - Amazon Web Services
Praticamente... AWS - Amazon Web ServicesPraticamente... AWS - Amazon Web Services
Praticamente... AWS - Amazon Web ServicesSpeck&Tech
 
Data Sense-making: navigating the world through the lens of information design
Data Sense-making: navigating the world through the lens of information designData Sense-making: navigating the world through the lens of information design
Data Sense-making: navigating the world through the lens of information designSpeck&Tech
 
Data Activism: data as rhetoric, data as power
Data Activism: data as rhetoric, data as powerData Activism: data as rhetoric, data as power
Data Activism: data as rhetoric, data as powerSpeck&Tech
 
Delve into the world of the human microbiome and metagenomics
Delve into the world of the human microbiome and metagenomicsDelve into the world of the human microbiome and metagenomics
Delve into the world of the human microbiome and metagenomicsSpeck&Tech
 
Home4MeAi: un progetto sociale che utilizza dispositivi IoT per sfruttare le ...
Home4MeAi: un progetto sociale che utilizza dispositivi IoT per sfruttare le ...Home4MeAi: un progetto sociale che utilizza dispositivi IoT per sfruttare le ...
Home4MeAi: un progetto sociale che utilizza dispositivi IoT per sfruttare le ...Speck&Tech
 
Monitorare una flotta di autobus: architettura di un progetto di acquisizione...
Monitorare una flotta di autobus: architettura di un progetto di acquisizione...Monitorare una flotta di autobus: architettura di un progetto di acquisizione...
Monitorare una flotta di autobus: architettura di un progetto di acquisizione...Speck&Tech
 
Why LLMs should be handled with care
Why LLMs should be handled with careWhy LLMs should be handled with care
Why LLMs should be handled with careSpeck&Tech
 
Building intelligent applications with Large Language Models
Building intelligent applications with Large Language ModelsBuilding intelligent applications with Large Language Models
Building intelligent applications with Large Language ModelsSpeck&Tech
 
Privacy in the era of quantum computers
Privacy in the era of quantum computersPrivacy in the era of quantum computers
Privacy in the era of quantum computersSpeck&Tech
 
Machine learning with quantum computers
Machine learning with quantum computersMachine learning with quantum computers
Machine learning with quantum computersSpeck&Tech
 
Give your Web App superpowers by using GPUs
Give your Web App superpowers by using GPUsGive your Web App superpowers by using GPUs
Give your Web App superpowers by using GPUsSpeck&Tech
 
From leaf to orbit: exploring forests with technology
From leaf to orbit: exploring forests with technologyFrom leaf to orbit: exploring forests with technology
From leaf to orbit: exploring forests with technologySpeck&Tech
 
Innovating Wood
Innovating WoodInnovating Wood
Innovating WoodSpeck&Tech
 
Behind the scenes of our everyday Internet: the role of an IXP like MIX
Behind the scenes of our everyday Internet: the role of an IXP like MIXBehind the scenes of our everyday Internet: the role of an IXP like MIX
Behind the scenes of our everyday Internet: the role of an IXP like MIXSpeck&Tech
 
Architecting a 35 PB distributed parallel file system for science
Architecting a 35 PB distributed parallel file system for scienceArchitecting a 35 PB distributed parallel file system for science
Architecting a 35 PB distributed parallel file system for scienceSpeck&Tech
 
Truck planning: how to certify the right route
Truck planning: how to certify the right routeTruck planning: how to certify the right route
Truck planning: how to certify the right routeSpeck&Tech
 
Break it up! 5G, cruise control, autonomous vehicle cooperation, and bending ...
Break it up! 5G, cruise control, autonomous vehicle cooperation, and bending ...Break it up! 5G, cruise control, autonomous vehicle cooperation, and bending ...
Break it up! 5G, cruise control, autonomous vehicle cooperation, and bending ...Speck&Tech
 

Mehr von Speck&Tech (20)

What should 6G be? - 6G: bridging gaps, connecting futures
What should 6G be? - 6G: bridging gaps, connecting futuresWhat should 6G be? - 6G: bridging gaps, connecting futures
What should 6G be? - 6G: bridging gaps, connecting futures
 
Creare il sangue artificiale: "buon sangue non mente"
Creare il sangue artificiale: "buon sangue non mente"Creare il sangue artificiale: "buon sangue non mente"
Creare il sangue artificiale: "buon sangue non mente"
 
AWS: gestire la scalabilità su larga scala
AWS: gestire la scalabilità su larga scalaAWS: gestire la scalabilità su larga scala
AWS: gestire la scalabilità su larga scala
 
Praticamente... AWS - Amazon Web Services
Praticamente... AWS - Amazon Web ServicesPraticamente... AWS - Amazon Web Services
Praticamente... AWS - Amazon Web Services
 
Data Sense-making: navigating the world through the lens of information design
Data Sense-making: navigating the world through the lens of information designData Sense-making: navigating the world through the lens of information design
Data Sense-making: navigating the world through the lens of information design
 
Data Activism: data as rhetoric, data as power
Data Activism: data as rhetoric, data as powerData Activism: data as rhetoric, data as power
Data Activism: data as rhetoric, data as power
 
Delve into the world of the human microbiome and metagenomics
Delve into the world of the human microbiome and metagenomicsDelve into the world of the human microbiome and metagenomics
Delve into the world of the human microbiome and metagenomics
 
Home4MeAi: un progetto sociale che utilizza dispositivi IoT per sfruttare le ...
Home4MeAi: un progetto sociale che utilizza dispositivi IoT per sfruttare le ...Home4MeAi: un progetto sociale che utilizza dispositivi IoT per sfruttare le ...
Home4MeAi: un progetto sociale che utilizza dispositivi IoT per sfruttare le ...
 
Monitorare una flotta di autobus: architettura di un progetto di acquisizione...
Monitorare una flotta di autobus: architettura di un progetto di acquisizione...Monitorare una flotta di autobus: architettura di un progetto di acquisizione...
Monitorare una flotta di autobus: architettura di un progetto di acquisizione...
 
Why LLMs should be handled with care
Why LLMs should be handled with careWhy LLMs should be handled with care
Why LLMs should be handled with care
 
Building intelligent applications with Large Language Models
Building intelligent applications with Large Language ModelsBuilding intelligent applications with Large Language Models
Building intelligent applications with Large Language Models
 
Privacy in the era of quantum computers
Privacy in the era of quantum computersPrivacy in the era of quantum computers
Privacy in the era of quantum computers
 
Machine learning with quantum computers
Machine learning with quantum computersMachine learning with quantum computers
Machine learning with quantum computers
 
Give your Web App superpowers by using GPUs
Give your Web App superpowers by using GPUsGive your Web App superpowers by using GPUs
Give your Web App superpowers by using GPUs
 
From leaf to orbit: exploring forests with technology
From leaf to orbit: exploring forests with technologyFrom leaf to orbit: exploring forests with technology
From leaf to orbit: exploring forests with technology
 
Innovating Wood
Innovating WoodInnovating Wood
Innovating Wood
 
Behind the scenes of our everyday Internet: the role of an IXP like MIX
Behind the scenes of our everyday Internet: the role of an IXP like MIXBehind the scenes of our everyday Internet: the role of an IXP like MIX
Behind the scenes of our everyday Internet: the role of an IXP like MIX
 
Architecting a 35 PB distributed parallel file system for science
Architecting a 35 PB distributed parallel file system for scienceArchitecting a 35 PB distributed parallel file system for science
Architecting a 35 PB distributed parallel file system for science
 
Truck planning: how to certify the right route
Truck planning: how to certify the right routeTruck planning: how to certify the right route
Truck planning: how to certify the right route
 
Break it up! 5G, cruise control, autonomous vehicle cooperation, and bending ...
Break it up! 5G, cruise control, autonomous vehicle cooperation, and bending ...Break it up! 5G, cruise control, autonomous vehicle cooperation, and bending ...
Break it up! 5G, cruise control, autonomous vehicle cooperation, and bending ...
 

Kürzlich hochgeladen

Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 

Kürzlich hochgeladen (20)

Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 

What's happening to my clients? Extracting value from news articles

  • 1. WHAT'S HAPPENING TO MY CLIENTS*? Extracting value from news articles * or partners, competitors, suppliers, etc… Ugo Scaiella @ Speck&Tech 7 Apr 2022
  • 2. ★ Master + 3 yrs research on IR/ML @ UniPI ★ 2 years @ ION Trading as SWE ★ From 2013 @ spaziodati.eu ★ Led DandelionAPI dev, now SWE Manager ★ Strongly-typed languages lover ★ In troubled relationship with Guido Wagyu addicted Wannabe grill master Father of 4 ABOUT ME Me, working on Neural Alcoholic Networks
  • 3. ATOKA.IO Info about 6M companies in Italy
  • 4. ATOKA.IO Info about 6M companies in Italy
  • 8.
  • 9. WHAT IF MY PORTFOLIO IS MADE OF THOUSANDS OF SMEs?
  • 10. COMPANYTXT Entity linking vs Atoka 6M companies DB SEDANO Our news processing pipeline ATOKA NEWS Search engine and news monitoring for end users LET'S TRY TO SOLVE THIS
  • 11. COMPANYTXT Entity linking vs Atoka 6M companies DB SEDANO Our news processing pipeline ATOKA NEWS Search engine and news monitoring for end users LET'S TRY TO SOLVE THIS
  • 12. SPAZIODATI 2 entities ★ SpazioDati (Trento) ★ Spazio Dati (Sassuolo) MICHELE BARBERA 95 entities ★ … ★ CEO @SpazioDati ★ … THIS IS COMPANYTXT MENTION IDENTIFICATION CANDIDATE EXTRACTION DISAMBIGUATION … Michele Barbera presenta il nuovo prodotto di SpazioDati, Atoka …
  • 13. MENTION IDENTIFICATION CANNOT BE SYNTAX BASED ONLY CAVIT CANTINA VITICOLTORI CONSORZIO CANTINE SOCIALI DEL TRENTINO SOCIETA' COOPERATIVA PIU' BREVEMENTE CAVIT S.C. PER FINALITA' PRODUTTIVE POTRA' OPERARE ANCHE COME CANTINA PRODUTTORI, VITICOLTORI TRENTINI, VINTRENTO, TRENTINA VINI, VILLALTA, ACCADEMIA DELLO SPUMANTE TRENTINO, CAVIT, C.V., C.C.S.T., RA.VIN CANNOT BE REGEX ONLY PANINI GRUPPO DISTRIBUZIONE DIMENSION AZIENDA TRASPORTI
  • 14. CANDIDATE SELECTION FUZZY MATCHING yeah, but "authoritative sources" ⇏ "good data": "V.N.P. - VALSA NUOVA PERLINO S.P.A.,"SIGLABILE:"V.N.P. S.P.A" " VALSA S.P.A.","PERLINO S.P.A.","PERLINO OPTIMA S.P.A.","V.A.T. S.P.A.", "P.A.T. S.P.A.", "P.O. S.P.A.","SCANAVINO S.P.A.","FILIPETTI S.P.A.","CA' V ERGANA S.P.A.","TERRE DEI SESI S.P.A.","SANDILIANO S.P.A.","TERRE DEI SOLARI S.P DEAL WITH ACRONYMS SOCIETA' NAZIONALE APPALTI MANUTENZIONI LAZIO SUD S.N.A.M. SOCIET A' A RESPONSABILITA LIMITATA
  • 15. NOT JOKING GE.A. S.R.L. (LA LETTERA E DELLA PAROLA GE.A. DEVE INTENDERSI SCRITTA CON CARATTERI MINUSCOLI) U GARIXAN DI ZUNINO ZULEIKA E CAMILLO LORENZO - S.N.C. ***(LA LETTERA "A" DELLA PA ROLA GARIXAN E' DA INTENDERSI ACCENTATA CON ACCENTO ACUTO)*** SOCIETA' COOPERATIVA DI CONSUMO DI GNOCCA
  • 16. DISAMBIGUATION Ideally, let's exploit everything we know about the company: ★ Locations ★ Sectors ★ Related companies ★ Key people Hard part is to mix everything together Pisa Trento Gabriele Antonelli Michele Barbera SpazioDati Cerved Group Big Data Business Intelligence Lead generation
  • 17. CURRENT WORKING ON MENTIONS Pattern matching on pre-computed mentions + NER Fine-tuned NER CANDIDATES Pre-computed Fuzzy matching on names DISAMBIGUATION Only structured links (people, companies) Add also contextual information about locations and activity NLP PIPELINE Separated steps Fully integrated pipeline LANGUAGES Only Italian Major EU languages IMPLEMENTATION
  • 18. ★ Only Società di Capitale ★ Not bad, but not WOW! ★ Huge room for improvement RESULTS
  • 19. TECH STACK ★ Current: mainly java ★ NER and Disambiguator: simple and fast random forests ★ Ad hoc and optimized data-structure ★ It's getting old 🥺 ★ Now working with: BERT, Tensorflow e NN ★ Still not taking into account timing 😬
  • 20. TAKEAWAYS ★ Never use Apple M1 for these jobs… maybe in a couple of years ★ Language models are REALLY effective, it's not just the hype … but, if you really want to reach a real-world level, you have to adapt them ★ NLP building blocks (POS tagger, encoders, etc…) are now a commodity, … but availability of good training data for e2e task is still THE problem ★ You need GPU for those models only when have a lot of data … and in that case, GPUs really make the difference a sweating g3.16xlarge
  • 21. COMPANYTXT Entity linking vs Atoka 6M companies DB SEDANO Our news processing pipeline ATOKA NEWS Search engine and news monitoring for end users LET'S TRY TO SOLVE THIS
  • 23. MAIN PIPELINE Classifier B Locations Classifier A Business Event Cleansing Dirty work Annotation Company annotations Deduplication Remove same articles
  • 24. CLEANSING Il titolo ? arrivato a perdere oltre il 3 per cento. n n Tronchetti: <Apertura in calo? Vediamo prossimi mesi,l'azienda ? solida> di Andrea Fontana seguici su Twitter ★ Data cleansing is like sewer pipes cleaning, someone has to do it ★ Web News is Web Data, so HORRIBLE DATA ★ NLP tools are significantly affected by bad input texts
  • 25. DEDUPLICATION ★ Same articles, different newspaper ★ Stopword removal + stemming + discarding shortest phrases + local sensitive hashing ★ Streaming approach, use Redis for caching
  • 26. BUSINESS EVENTS ★ Management changes ★ Economic results ★ Launch of new products ★ Failures ★ Accidents ★ Strikes ★ …
  • 27. AND MANY OTHERS ★ E-S-G themes ✴ Environmental ✴ Social ✴ Governance ★ Sentiment ★ Locations: provinces
  • 28. TECH STACK ★ Mainly Python ★ Celery for distributed tasks ★ Django + Postgres for API ★ GRPC + Golang for core clustering algorithm ★ Classifiers: Scikit Learn and TF ★ S3 for raw data ★ ES for news articles storage ★ Redis for caching ★ K8s cluster on AWS ★ Sentry, ELK, Prometheus, Grafana
  • 29. ★ Scalability ★ Streaming ★ Operational burden ★ Idempotency CHALLENGES
  • 30. COMPANYTXT Entity linking vs Atoka 6M companies DB SEDANO Our news processing pipeline ATOKA NEWS Search engine and news monitoring for end users LET'S TRY TO SOLVE THIS
  • 36. LAST BUT NOT LEAST… SPAZIODATI = Startup culture (team of 40, several in this room, talk to us!) within one of the largest Fintech in the world: ION Group (>15k employees) https://spaziodati.eu/en/jobs/ But if you’re smart we welcome any role, even if not listed :-) Yes, vegetarians too!
  • 37. CREDITS: This presentation template was created by Slidesgo, including icons by Flaticon and infographics & images by Freepik THANKS! QUESTIONS?