SlideShare ist ein Scribd-Unternehmen logo
1 von 13
Downloaden Sie, um offline zu lesen
First workshop of the Advisory Committee on Statistical Methods
Impact of the advisory committee on the Istat projects
First workshop of the Advisory Committee on Statistical Methods – Rome, November 19, 2018
Text mining and machine learning techniques
for text classification
Francesco Scalfati ISTAT (scalfati@istat.it )
Fabiana Bianchi ISTAT (fabianchi@istat.it)
Outline
Evolution path
Strategy Overview
Process model
IT environment
Case studies
Lesson learnt and future developments
First workshop of the Advisory Committee on Statistical Methods – Rome, November 19, 2018
About this work
 This work starts as a research work with DIAG (Prof R.
Bruni, University of Rome “Sapienza”):
 generalized algorithms based on natural language
processing (NLP) and machine learning techniques to
solve the automatic detection of enterprise
characteristics in websites
 First work discussed with the Advisory Committee (Nov, 2017):
 Text mining and machine learning techniques for text
classification, with application to the automatic
categorization of websites
 At present: Full web mining strategy that uses an
enhancement of techniques described in previous point, applied
to case studies:
1. ICT usage in Enterprises
2. Statistical Business Register
First workshop of the Advisory Committee on Statistical Methods – Rome, November 19, 2018
1 – Web
address
acquisition
URL from administrative sources
URL from thematic directory sites
URL from batch queries on search engines (URL Retrieval
techniques in case of non existing URL)
2 – Enterprise
identification
URL validation, checks URL’s validity (recurring errors and
domain extraction)
Detection of identification variables from the website and
comparison with the same information available in the register
3 – Data
analytics
Web Scraping techniques for web data acquisition
Text Mining techniques for extracting the requested information
Machine Learning techniques for the use of algorithms that
simulate a learning process for the construction of predictive
models
Strategy Overview
First workshop of the Advisory Committee on Statistical Methods – Rome, November 19, 2018
Process model: data analytics phase
First workshop of the Advisory Committee on Statistical Methods – Rome, November 19, 2018
DWH
NLP: text mining
Web scraping
fitting learning models
performance evaluation
tokenization
lemmatization
Supervised classification
URLs validation
URLs retrieval
POS tagging
summarization Predictions
Business register
Internet data
Operational layer Analysis layer
Text documents
IT environment
Data capturing Data preparation
Machine learning
Data integration
First workshop of the Advisory Committee on Statistical Methods – Rome, November 19, 2018
Java – Jsoup - Selenium HQ - Phantom JS
Nltk (Python) – Scikit learn (Python) – Treetagger - R
DBMS noSQL – DBMS Oracle - Filesystem
Technological components:
Case study 1: ICT2017
 “ICT usage in Enterprises” (2017): experimental statistics
to produce estimates on(1):
 web ordering functions (e-commerce component);
 information on job vacancies;
 links to social media (Facebook, Twitter, Instagram etc.).
(1) Work team:
 Istat: G. Barcaroli, G. Bianchi, F. Bianchi, N. Golini, A. Nurra,
P. Righi, S. Salamone, F. Scalfati, D. Summa
 Università di Roma Sapienza: R. Bruni
First workshop of the Advisory Committee on Statistical Methods – Rome, November 19, 2018
1) Reference population: 184,000 (≥10 employees)
2) Scraped websites: 85,000
3) ICT survey sample / answers: 32,000 / 19,000
4) Respondents with scraped websites: 11,700
 Dataset (4) as training set and dataset (2) as test
Performance evaluation on web ordering
RF SVM LOGISTIC
Total
perturbation Accuracy % F1 Score % Accuracy % F1 Score % Accuracy % F1 Score %
0% 90,45 73,51 88,81 70,31 87,76 65,86
3% 90,66 73,25 87,94 68,80 87,35 63,90
6% 90,15 70,97 86,63 64,68 87,30 62,37
9% 90,20 70,39 84,63 59,68 86,96 60,46
12% 89,91 68,50 82,93 56,21 87,59 60,32
15% 86,92 57,92 76,79 53,77 86,50 56,44
18% 84,57 48,96 71,86 49,15 85,87 52,94
 performance degrades by increasing the perturbation level
 degradation is not so marked as it could be expected
 up to 12% the classification performance slightly decreased
 best performance had been obtained with RF
First workshop of the Advisory Committee on Statistical Methods – Rome, November 19, 2018
The procedure performance has been analyzed introducing
controlled errors in class labels of a corrected training set
Case study 2: Statistical Business Register
 Experimental project in Istat Laboratory for Innovation:
First evaluations on the feasibility of using Big Data in the SBR
production process (2):
 structural characters: anagraphic characteristics and personal
data, business data, dimension, etc.
 qualitative characters: short description of business activity,
categorization of .pdf, categorization by social media account, etc.
(2) TEAMWORK :
 M. AMARONE, D. APRILE, G. BIANCHI, M. CONSALVI, B. GENTILI, F. PANCELLA,
F. SCALFATI, D. SUMMA, C.VIVIANO
First workshop of the Advisory Committee on Statistical Methods – Rome, November 19, 2018
 Sample of 100.000 enterprises
 64% URLs from administrative sources
 14% URLs from certified email domain
 5% URLs from web portals
 17% URLs from search engine
SBR: characters extracted from web (1/2)
CHARACTERS Total
URLs
from
admin
source
URLs
from
certified
mails
URLs
from
web
portals
URLs
from
search
engine
Company name 100,000 64 % 14 % 5 % 17 %
Tax code/VAT number 80,440 60 % 9 % 5 % 26 %
Enterprise street address 103,000 70 % 8 % 7 % 15 %
E-mail Address 198,000 60 % 10 % 3 % 27 %
Telephone Number 230,000 63 % 8 % 7 % 22 %
Company Capital 21,580 40 % 4 % 35 % 21 %
Social Media account 131,830 65 % 9 % 6 % 20 %
Business Activity 100,000 64 % 14 % 5 % 17 %
Job application facilities 15,000 52 % 16 % 2 % 30 %
First workshop of the Advisory Committee on Statistical Methods – Rome, November 19, 2018
Information scraped from enterprise websites divided by URL source.
SBR: characteristics extracted from web (2/2)
Type of local unit %
Addresses related to
administrative headquarters
70
Addresses related to other
local units
30
Addresses by type of local unit
Type of address %
E-mail address 92
Certified e-mail
address
8
E-mail addresses by type of address
Type of phone
number
%
Phone 58
Mobile phone 12
Fax 30
Phone numbers by type
Enterprises by presence on social media
Online job facilities %
Enterprises WITH online job
application facilities
15
Enterprises WITHOUT online job
application facilities
85
First workshop of the Advisory Committee on Statistical Methods – Rome, November 19, 2018
Lessons learnt and future developments
Opportunities:
 Enrichment of the quantity, quality of the information produced
 Timeliness, the web is an independent source of data
 Statistical burden less significant on the respondents
First workshop of the Advisory Committee on Statistical Methods – Rome, November 19, 2018
Problems:
 High usage of computing and storage resources
 Difficulty to extract information provided by unusual techniques (for
instance Flash animations) or with anti-scraping mechanisms
 Certify the quality of information and the data reliability
Future developments :
 Face the “representativeness” of the training set (subsample with a
valid URL)
 Move from experimentation to production
Thank you for your attention
First workshop of the Advisory Committee on Statistical Methods – Rome, November 19, 2018

Weitere ähnliche Inhalte

Ähnlich wie Session I - Big Data F. Bianchi, F. Scalfati, Text mining and machine learning techniques for text classification

Success Stories on Big Data & Analytics
Success Stories on Big Data & AnalyticsSuccess Stories on Big Data & Analytics
Success Stories on Big Data & AnalyticsDataBench
 
Performance Measurement and Improvement of Lean Manufacturing Operations: A L...
Performance Measurement and Improvement of Lean Manufacturing Operations: A L...Performance Measurement and Improvement of Lean Manufacturing Operations: A L...
Performance Measurement and Improvement of Lean Manufacturing Operations: A L...Leandro Silvério
 
C. Santoro, Labour Market Areas Web application based on Istat SDP framework
C. Santoro, Labour Market Areas Web application based on Istat SDP frameworkC. Santoro, Labour Market Areas Web application based on Istat SDP framework
C. Santoro, Labour Market Areas Web application based on Istat SDP frameworkIstituto nazionale di statistica
 
Simpda 2014 - A living story: measuring quality of developments in a large in...
Simpda 2014 - A living story: measuring quality of developments in a large in...Simpda 2014 - A living story: measuring quality of developments in a large in...
Simpda 2014 - A living story: measuring quality of developments in a large in...SpagoWorld
 
Process Analysis with Process Mining
Process Analysis with Process MiningProcess Analysis with Process Mining
Process Analysis with Process MiningMichael Groeschel
 
Applying deep learning tools to data available at the banking industry level....
Applying deep learning tools to data available at the banking industry level....Applying deep learning tools to data available at the banking industry level....
Applying deep learning tools to data available at the banking industry level....Data Driven Innovation
 
Demystifying Machine Learning for Manufacturing: Data Science for all
Demystifying Machine Learning for Manufacturing: Data Science for allDemystifying Machine Learning for Manufacturing: Data Science for all
Demystifying Machine Learning for Manufacturing: Data Science for allInfosys
 
Relating Big Data Business and Technical Performance Indicators, Barbara Pern...
Relating Big Data Business and Technical Performance Indicators, Barbara Pern...Relating Big Data Business and Technical Performance Indicators, Barbara Pern...
Relating Big Data Business and Technical Performance Indicators, Barbara Pern...DataBench
 
Benchmarking for Big Data Applications with the DataBench Framework, Arne Ber...
Benchmarking for Big Data Applications with the DataBench Framework, Arne Ber...Benchmarking for Big Data Applications with the DataBench Framework, Arne Ber...
Benchmarking for Big Data Applications with the DataBench Framework, Arne Ber...DataBench
 
Opportunities and methodological challenges of Big Data for official statist...
Opportunities and methodological challenges of  Big Data for official statist...Opportunities and methodological challenges of  Big Data for official statist...
Opportunities and methodological challenges of Big Data for official statist...Piet J.H. Daas
 
Building a Bridge between Technical and Business Benchmarking, Gabriella Catt...
Building a Bridge between Technical and Business Benchmarking, Gabriella Catt...Building a Bridge between Technical and Business Benchmarking, Gabriella Catt...
Building a Bridge between Technical and Business Benchmarking, Gabriella Catt...IDC4EU
 
Building a Bridge between Technical and Business Benchmarking, Gabriella Catt...
Building a Bridge between Technical and Business Benchmarking, Gabriella Catt...Building a Bridge between Technical and Business Benchmarking, Gabriella Catt...
Building a Bridge between Technical and Business Benchmarking, Gabriella Catt...DataBench
 
A LEAN SIX-SIGMA MANUFACTURING PROCESS CASE STUDY
A LEAN SIX-SIGMA MANUFACTURING PROCESS CASE STUDYA LEAN SIX-SIGMA MANUFACTURING PROCESS CASE STUDY
A LEAN SIX-SIGMA MANUFACTURING PROCESS CASE STUDYLinda Garcia
 
BDVe Webinar Series: DataBench – Benchmarking Big Data. Gabriella Cattaneo. T...
BDVe Webinar Series: DataBench – Benchmarking Big Data. Gabriella Cattaneo. T...BDVe Webinar Series: DataBench – Benchmarking Big Data. Gabriella Cattaneo. T...
BDVe Webinar Series: DataBench – Benchmarking Big Data. Gabriella Cattaneo. T...Big Data Value Association
 
Analytics Service Framework
Analytics Service Framework Analytics Service Framework
Analytics Service Framework Vishwanath Ramdas
 
[Publication] System Dynamics Applications to Design User Centred Decentralis...
[Publication] System Dynamics Applications to Design User Centred Decentralis...[Publication] System Dynamics Applications to Design User Centred Decentralis...
[Publication] System Dynamics Applications to Design User Centred Decentralis...Junie Kwon
 
2014 287fo brochure01 inglese
2014 287fo brochure01 inglese2014 287fo brochure01 inglese
2014 287fo brochure01 ingleseformatresearch
 
Improving Business Performance Through Big Data Benchmarking, Todor Ivanov, B...
Improving Business Performance Through Big Data Benchmarking, Todor Ivanov, B...Improving Business Performance Through Big Data Benchmarking, Todor Ivanov, B...
Improving Business Performance Through Big Data Benchmarking, Todor Ivanov, B...DataBench
 

Ähnlich wie Session I - Big Data F. Bianchi, F. Scalfati, Text mining and machine learning techniques for text classification (20)

Success Stories on Big Data & Analytics
Success Stories on Big Data & AnalyticsSuccess Stories on Big Data & Analytics
Success Stories on Big Data & Analytics
 
Performance Measurement and Improvement of Lean Manufacturing Operations: A L...
Performance Measurement and Improvement of Lean Manufacturing Operations: A L...Performance Measurement and Improvement of Lean Manufacturing Operations: A L...
Performance Measurement and Improvement of Lean Manufacturing Operations: A L...
 
C. Santoro, Labour Market Areas Web application based on Istat SDP framework
C. Santoro, Labour Market Areas Web application based on Istat SDP frameworkC. Santoro, Labour Market Areas Web application based on Istat SDP framework
C. Santoro, Labour Market Areas Web application based on Istat SDP framework
 
Simpda 2014 - A living story: measuring quality of developments in a large in...
Simpda 2014 - A living story: measuring quality of developments in a large in...Simpda 2014 - A living story: measuring quality of developments in a large in...
Simpda 2014 - A living story: measuring quality of developments in a large in...
 
Process Analysis with Process Mining
Process Analysis with Process MiningProcess Analysis with Process Mining
Process Analysis with Process Mining
 
Certified Business Analytics Specialist (CBAS)
Certified Business Analytics Specialist (CBAS) Certified Business Analytics Specialist (CBAS)
Certified Business Analytics Specialist (CBAS)
 
Applying deep learning tools to data available at the banking industry level....
Applying deep learning tools to data available at the banking industry level....Applying deep learning tools to data available at the banking industry level....
Applying deep learning tools to data available at the banking industry level....
 
Demystifying Machine Learning for Manufacturing: Data Science for all
Demystifying Machine Learning for Manufacturing: Data Science for allDemystifying Machine Learning for Manufacturing: Data Science for all
Demystifying Machine Learning for Manufacturing: Data Science for all
 
Relating Big Data Business and Technical Performance Indicators, Barbara Pern...
Relating Big Data Business and Technical Performance Indicators, Barbara Pern...Relating Big Data Business and Technical Performance Indicators, Barbara Pern...
Relating Big Data Business and Technical Performance Indicators, Barbara Pern...
 
Benchmarking for Big Data Applications with the DataBench Framework, Arne Ber...
Benchmarking for Big Data Applications with the DataBench Framework, Arne Ber...Benchmarking for Big Data Applications with the DataBench Framework, Arne Ber...
Benchmarking for Big Data Applications with the DataBench Framework, Arne Ber...
 
Opportunities and methodological challenges of Big Data for official statist...
Opportunities and methodological challenges of  Big Data for official statist...Opportunities and methodological challenges of  Big Data for official statist...
Opportunities and methodological challenges of Big Data for official statist...
 
Building a Bridge between Technical and Business Benchmarking, Gabriella Catt...
Building a Bridge between Technical and Business Benchmarking, Gabriella Catt...Building a Bridge between Technical and Business Benchmarking, Gabriella Catt...
Building a Bridge between Technical and Business Benchmarking, Gabriella Catt...
 
Building a Bridge between Technical and Business Benchmarking, Gabriella Catt...
Building a Bridge between Technical and Business Benchmarking, Gabriella Catt...Building a Bridge between Technical and Business Benchmarking, Gabriella Catt...
Building a Bridge between Technical and Business Benchmarking, Gabriella Catt...
 
A LEAN SIX-SIGMA MANUFACTURING PROCESS CASE STUDY
A LEAN SIX-SIGMA MANUFACTURING PROCESS CASE STUDYA LEAN SIX-SIGMA MANUFACTURING PROCESS CASE STUDY
A LEAN SIX-SIGMA MANUFACTURING PROCESS CASE STUDY
 
BDVe Webinar Series: DataBench – Benchmarking Big Data. Gabriella Cattaneo. T...
BDVe Webinar Series: DataBench – Benchmarking Big Data. Gabriella Cattaneo. T...BDVe Webinar Series: DataBench – Benchmarking Big Data. Gabriella Cattaneo. T...
BDVe Webinar Series: DataBench – Benchmarking Big Data. Gabriella Cattaneo. T...
 
Analytics Service Framework
Analytics Service Framework Analytics Service Framework
Analytics Service Framework
 
[Publication] System Dynamics Applications to Design User Centred Decentralis...
[Publication] System Dynamics Applications to Design User Centred Decentralis...[Publication] System Dynamics Applications to Design User Centred Decentralis...
[Publication] System Dynamics Applications to Design User Centred Decentralis...
 
2014 287fo brochure01 inglese
2014 287fo brochure01 inglese2014 287fo brochure01 inglese
2014 287fo brochure01 inglese
 
Itais 2013
Itais 2013Itais 2013
Itais 2013
 
Improving Business Performance Through Big Data Benchmarking, Todor Ivanov, B...
Improving Business Performance Through Big Data Benchmarking, Todor Ivanov, B...Improving Business Performance Through Big Data Benchmarking, Todor Ivanov, B...
Improving Business Performance Through Big Data Benchmarking, Todor Ivanov, B...
 

Mehr von Istituto nazionale di statistica

Mehr von Istituto nazionale di statistica (20)

Censimenti Permanenti Istituzioni non profit
Censimenti Permanenti Istituzioni non profitCensimenti Permanenti Istituzioni non profit
Censimenti Permanenti Istituzioni non profit
 
Censimenti Permanenti Istituzioni non profit
Censimenti Permanenti Istituzioni non profitCensimenti Permanenti Istituzioni non profit
Censimenti Permanenti Istituzioni non profit
 
Censimenti Permanenti Istituzioni non profit
Censimenti Permanenti Istituzioni non profitCensimenti Permanenti Istituzioni non profit
Censimenti Permanenti Istituzioni non profit
 
Censimenti Permanenti Istituzioni non profit
Censimenti Permanenti Istituzioni non profitCensimenti Permanenti Istituzioni non profit
Censimenti Permanenti Istituzioni non profit
 
Censimenti Permanenti Istituzioni non profit
Censimenti Permanenti Istituzioni non profitCensimenti Permanenti Istituzioni non profit
Censimenti Permanenti Istituzioni non profit
 
Censimenti Permanenti Istituzioni non profit
Censimenti Permanenti Istituzioni non profitCensimenti Permanenti Istituzioni non profit
Censimenti Permanenti Istituzioni non profit
 
Censimento Permanente Istituzioni Pubbliche
Censimento Permanente Istituzioni PubblicheCensimento Permanente Istituzioni Pubbliche
Censimento Permanente Istituzioni Pubbliche
 
Censimento Permanente Istituzioni Pubbliche
Censimento Permanente Istituzioni PubblicheCensimento Permanente Istituzioni Pubbliche
Censimento Permanente Istituzioni Pubbliche
 
Censimento Permanente Istituzioni Pubbliche
Censimento Permanente Istituzioni PubblicheCensimento Permanente Istituzioni Pubbliche
Censimento Permanente Istituzioni Pubbliche
 
Censimento Permanente Istituzioni Pubbliche
Censimento Permanente Istituzioni PubblicheCensimento Permanente Istituzioni Pubbliche
Censimento Permanente Istituzioni Pubbliche
 
14a Conferenza Nazionale di Statisticacnstatistica14
14a Conferenza Nazionale di Statisticacnstatistica1414a Conferenza Nazionale di Statisticacnstatistica14
14a Conferenza Nazionale di Statisticacnstatistica14
 
14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica
 
14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica
 
14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica
 
14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica
 
14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica
 
14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica
 
14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica
 
14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica
 
14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica
 

Kürzlich hochgeladen

BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...Sapna Thakur
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxVishalSingh1417
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...PsychoTech Services
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024Janet Corral
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room servicediscovermytutordmt
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingTeacherCyreneCayanan
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfchloefrazer622
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...christianmathematics
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 

Kürzlich hochgeladen (20)

BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room service
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writing
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdf
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 

Session I - Big Data F. Bianchi, F. Scalfati, Text mining and machine learning techniques for text classification

  • 1. First workshop of the Advisory Committee on Statistical Methods Impact of the advisory committee on the Istat projects First workshop of the Advisory Committee on Statistical Methods – Rome, November 19, 2018 Text mining and machine learning techniques for text classification Francesco Scalfati ISTAT (scalfati@istat.it ) Fabiana Bianchi ISTAT (fabianchi@istat.it)
  • 2. Outline Evolution path Strategy Overview Process model IT environment Case studies Lesson learnt and future developments First workshop of the Advisory Committee on Statistical Methods – Rome, November 19, 2018
  • 3. About this work  This work starts as a research work with DIAG (Prof R. Bruni, University of Rome “Sapienza”):  generalized algorithms based on natural language processing (NLP) and machine learning techniques to solve the automatic detection of enterprise characteristics in websites  First work discussed with the Advisory Committee (Nov, 2017):  Text mining and machine learning techniques for text classification, with application to the automatic categorization of websites  At present: Full web mining strategy that uses an enhancement of techniques described in previous point, applied to case studies: 1. ICT usage in Enterprises 2. Statistical Business Register First workshop of the Advisory Committee on Statistical Methods – Rome, November 19, 2018
  • 4. 1 – Web address acquisition URL from administrative sources URL from thematic directory sites URL from batch queries on search engines (URL Retrieval techniques in case of non existing URL) 2 – Enterprise identification URL validation, checks URL’s validity (recurring errors and domain extraction) Detection of identification variables from the website and comparison with the same information available in the register 3 – Data analytics Web Scraping techniques for web data acquisition Text Mining techniques for extracting the requested information Machine Learning techniques for the use of algorithms that simulate a learning process for the construction of predictive models Strategy Overview First workshop of the Advisory Committee on Statistical Methods – Rome, November 19, 2018
  • 5. Process model: data analytics phase First workshop of the Advisory Committee on Statistical Methods – Rome, November 19, 2018
  • 6. DWH NLP: text mining Web scraping fitting learning models performance evaluation tokenization lemmatization Supervised classification URLs validation URLs retrieval POS tagging summarization Predictions Business register Internet data Operational layer Analysis layer Text documents IT environment Data capturing Data preparation Machine learning Data integration First workshop of the Advisory Committee on Statistical Methods – Rome, November 19, 2018 Java – Jsoup - Selenium HQ - Phantom JS Nltk (Python) – Scikit learn (Python) – Treetagger - R DBMS noSQL – DBMS Oracle - Filesystem Technological components:
  • 7. Case study 1: ICT2017  “ICT usage in Enterprises” (2017): experimental statistics to produce estimates on(1):  web ordering functions (e-commerce component);  information on job vacancies;  links to social media (Facebook, Twitter, Instagram etc.). (1) Work team:  Istat: G. Barcaroli, G. Bianchi, F. Bianchi, N. Golini, A. Nurra, P. Righi, S. Salamone, F. Scalfati, D. Summa  Università di Roma Sapienza: R. Bruni First workshop of the Advisory Committee on Statistical Methods – Rome, November 19, 2018 1) Reference population: 184,000 (≥10 employees) 2) Scraped websites: 85,000 3) ICT survey sample / answers: 32,000 / 19,000 4) Respondents with scraped websites: 11,700  Dataset (4) as training set and dataset (2) as test
  • 8. Performance evaluation on web ordering RF SVM LOGISTIC Total perturbation Accuracy % F1 Score % Accuracy % F1 Score % Accuracy % F1 Score % 0% 90,45 73,51 88,81 70,31 87,76 65,86 3% 90,66 73,25 87,94 68,80 87,35 63,90 6% 90,15 70,97 86,63 64,68 87,30 62,37 9% 90,20 70,39 84,63 59,68 86,96 60,46 12% 89,91 68,50 82,93 56,21 87,59 60,32 15% 86,92 57,92 76,79 53,77 86,50 56,44 18% 84,57 48,96 71,86 49,15 85,87 52,94  performance degrades by increasing the perturbation level  degradation is not so marked as it could be expected  up to 12% the classification performance slightly decreased  best performance had been obtained with RF First workshop of the Advisory Committee on Statistical Methods – Rome, November 19, 2018 The procedure performance has been analyzed introducing controlled errors in class labels of a corrected training set
  • 9. Case study 2: Statistical Business Register  Experimental project in Istat Laboratory for Innovation: First evaluations on the feasibility of using Big Data in the SBR production process (2):  structural characters: anagraphic characteristics and personal data, business data, dimension, etc.  qualitative characters: short description of business activity, categorization of .pdf, categorization by social media account, etc. (2) TEAMWORK :  M. AMARONE, D. APRILE, G. BIANCHI, M. CONSALVI, B. GENTILI, F. PANCELLA, F. SCALFATI, D. SUMMA, C.VIVIANO First workshop of the Advisory Committee on Statistical Methods – Rome, November 19, 2018  Sample of 100.000 enterprises  64% URLs from administrative sources  14% URLs from certified email domain  5% URLs from web portals  17% URLs from search engine
  • 10. SBR: characters extracted from web (1/2) CHARACTERS Total URLs from admin source URLs from certified mails URLs from web portals URLs from search engine Company name 100,000 64 % 14 % 5 % 17 % Tax code/VAT number 80,440 60 % 9 % 5 % 26 % Enterprise street address 103,000 70 % 8 % 7 % 15 % E-mail Address 198,000 60 % 10 % 3 % 27 % Telephone Number 230,000 63 % 8 % 7 % 22 % Company Capital 21,580 40 % 4 % 35 % 21 % Social Media account 131,830 65 % 9 % 6 % 20 % Business Activity 100,000 64 % 14 % 5 % 17 % Job application facilities 15,000 52 % 16 % 2 % 30 % First workshop of the Advisory Committee on Statistical Methods – Rome, November 19, 2018 Information scraped from enterprise websites divided by URL source.
  • 11. SBR: characteristics extracted from web (2/2) Type of local unit % Addresses related to administrative headquarters 70 Addresses related to other local units 30 Addresses by type of local unit Type of address % E-mail address 92 Certified e-mail address 8 E-mail addresses by type of address Type of phone number % Phone 58 Mobile phone 12 Fax 30 Phone numbers by type Enterprises by presence on social media Online job facilities % Enterprises WITH online job application facilities 15 Enterprises WITHOUT online job application facilities 85 First workshop of the Advisory Committee on Statistical Methods – Rome, November 19, 2018
  • 12. Lessons learnt and future developments Opportunities:  Enrichment of the quantity, quality of the information produced  Timeliness, the web is an independent source of data  Statistical burden less significant on the respondents First workshop of the Advisory Committee on Statistical Methods – Rome, November 19, 2018 Problems:  High usage of computing and storage resources  Difficulty to extract information provided by unusual techniques (for instance Flash animations) or with anti-scraping mechanisms  Certify the quality of information and the data reliability Future developments :  Face the “representativeness” of the training set (subsample with a valid URL)  Move from experimentation to production
  • 13. Thank you for your attention First workshop of the Advisory Committee on Statistical Methods – Rome, November 19, 2018