Session I - Big Data F. Bianchi, F. Scalfati, Text mining and machine learning techniques for text classification
1. First workshop of the Advisory Committee on Statistical Methods
Impact of the advisory committee on the Istat projects
First workshop of the Advisory Committee on Statistical Methods – Rome, November 19, 2018
Text mining and machine learning techniques
for text classification
Francesco Scalfati ISTAT (scalfati@istat.it )
Fabiana Bianchi ISTAT (fabianchi@istat.it)
2. Outline
Evolution path
Strategy Overview
Process model
IT environment
Case studies
Lesson learnt and future developments
First workshop of the Advisory Committee on Statistical Methods – Rome, November 19, 2018
3. About this work
This work starts as a research work with DIAG (Prof R.
Bruni, University of Rome “Sapienza”):
generalized algorithms based on natural language
processing (NLP) and machine learning techniques to
solve the automatic detection of enterprise
characteristics in websites
First work discussed with the Advisory Committee (Nov, 2017):
Text mining and machine learning techniques for text
classification, with application to the automatic
categorization of websites
At present: Full web mining strategy that uses an
enhancement of techniques described in previous point, applied
to case studies:
1. ICT usage in Enterprises
2. Statistical Business Register
First workshop of the Advisory Committee on Statistical Methods – Rome, November 19, 2018
4. 1 – Web
address
acquisition
URL from administrative sources
URL from thematic directory sites
URL from batch queries on search engines (URL Retrieval
techniques in case of non existing URL)
2 – Enterprise
identification
URL validation, checks URL’s validity (recurring errors and
domain extraction)
Detection of identification variables from the website and
comparison with the same information available in the register
3 – Data
analytics
Web Scraping techniques for web data acquisition
Text Mining techniques for extracting the requested information
Machine Learning techniques for the use of algorithms that
simulate a learning process for the construction of predictive
models
Strategy Overview
First workshop of the Advisory Committee on Statistical Methods – Rome, November 19, 2018
5. Process model: data analytics phase
First workshop of the Advisory Committee on Statistical Methods – Rome, November 19, 2018
6. DWH
NLP: text mining
Web scraping
fitting learning models
performance evaluation
tokenization
lemmatization
Supervised classification
URLs validation
URLs retrieval
POS tagging
summarization Predictions
Business register
Internet data
Operational layer Analysis layer
Text documents
IT environment
Data capturing Data preparation
Machine learning
Data integration
First workshop of the Advisory Committee on Statistical Methods – Rome, November 19, 2018
Java – Jsoup - Selenium HQ - Phantom JS
Nltk (Python) – Scikit learn (Python) – Treetagger - R
DBMS noSQL – DBMS Oracle - Filesystem
Technological components:
7. Case study 1: ICT2017
“ICT usage in Enterprises” (2017): experimental statistics
to produce estimates on(1):
web ordering functions (e-commerce component);
information on job vacancies;
links to social media (Facebook, Twitter, Instagram etc.).
(1) Work team:
Istat: G. Barcaroli, G. Bianchi, F. Bianchi, N. Golini, A. Nurra,
P. Righi, S. Salamone, F. Scalfati, D. Summa
Università di Roma Sapienza: R. Bruni
First workshop of the Advisory Committee on Statistical Methods – Rome, November 19, 2018
1) Reference population: 184,000 (≥10 employees)
2) Scraped websites: 85,000
3) ICT survey sample / answers: 32,000 / 19,000
4) Respondents with scraped websites: 11,700
Dataset (4) as training set and dataset (2) as test
8. Performance evaluation on web ordering
RF SVM LOGISTIC
Total
perturbation Accuracy % F1 Score % Accuracy % F1 Score % Accuracy % F1 Score %
0% 90,45 73,51 88,81 70,31 87,76 65,86
3% 90,66 73,25 87,94 68,80 87,35 63,90
6% 90,15 70,97 86,63 64,68 87,30 62,37
9% 90,20 70,39 84,63 59,68 86,96 60,46
12% 89,91 68,50 82,93 56,21 87,59 60,32
15% 86,92 57,92 76,79 53,77 86,50 56,44
18% 84,57 48,96 71,86 49,15 85,87 52,94
performance degrades by increasing the perturbation level
degradation is not so marked as it could be expected
up to 12% the classification performance slightly decreased
best performance had been obtained with RF
First workshop of the Advisory Committee on Statistical Methods – Rome, November 19, 2018
The procedure performance has been analyzed introducing
controlled errors in class labels of a corrected training set
9. Case study 2: Statistical Business Register
Experimental project in Istat Laboratory for Innovation:
First evaluations on the feasibility of using Big Data in the SBR
production process (2):
structural characters: anagraphic characteristics and personal
data, business data, dimension, etc.
qualitative characters: short description of business activity,
categorization of .pdf, categorization by social media account, etc.
(2) TEAMWORK :
M. AMARONE, D. APRILE, G. BIANCHI, M. CONSALVI, B. GENTILI, F. PANCELLA,
F. SCALFATI, D. SUMMA, C.VIVIANO
First workshop of the Advisory Committee on Statistical Methods – Rome, November 19, 2018
Sample of 100.000 enterprises
64% URLs from administrative sources
14% URLs from certified email domain
5% URLs from web portals
17% URLs from search engine
10. SBR: characters extracted from web (1/2)
CHARACTERS Total
URLs
from
admin
source
URLs
from
certified
mails
URLs
from
web
portals
URLs
from
search
engine
Company name 100,000 64 % 14 % 5 % 17 %
Tax code/VAT number 80,440 60 % 9 % 5 % 26 %
Enterprise street address 103,000 70 % 8 % 7 % 15 %
E-mail Address 198,000 60 % 10 % 3 % 27 %
Telephone Number 230,000 63 % 8 % 7 % 22 %
Company Capital 21,580 40 % 4 % 35 % 21 %
Social Media account 131,830 65 % 9 % 6 % 20 %
Business Activity 100,000 64 % 14 % 5 % 17 %
Job application facilities 15,000 52 % 16 % 2 % 30 %
First workshop of the Advisory Committee on Statistical Methods – Rome, November 19, 2018
Information scraped from enterprise websites divided by URL source.
11. SBR: characteristics extracted from web (2/2)
Type of local unit %
Addresses related to
administrative headquarters
70
Addresses related to other
local units
30
Addresses by type of local unit
Type of address %
E-mail address 92
Certified e-mail
address
8
E-mail addresses by type of address
Type of phone
number
%
Phone 58
Mobile phone 12
Fax 30
Phone numbers by type
Enterprises by presence on social media
Online job facilities %
Enterprises WITH online job
application facilities
15
Enterprises WITHOUT online job
application facilities
85
First workshop of the Advisory Committee on Statistical Methods – Rome, November 19, 2018
12. Lessons learnt and future developments
Opportunities:
Enrichment of the quantity, quality of the information produced
Timeliness, the web is an independent source of data
Statistical burden less significant on the respondents
First workshop of the Advisory Committee on Statistical Methods – Rome, November 19, 2018
Problems:
High usage of computing and storage resources
Difficulty to extract information provided by unusual techniques (for
instance Flash animations) or with anti-scraping mechanisms
Certify the quality of information and the data reliability
Future developments :
Face the “representativeness” of the training set (subsample with a
valid URL)
Move from experimentation to production
13. Thank you for your attention
First workshop of the Advisory Committee on Statistical Methods – Rome, November 19, 2018