Session I - Big Data F. Bianchi, F. Scalfati, Text mining and machine learning techniques for text classification

First workshop of the Advisory Committee on Statistical Methods
Impact of the advisory committee on the Istat projects
First workshop of the Advisory Committee on Statistical Methods – Rome, November 19, 2018
Text mining and machine learning techniques
for text classification
Francesco Scalfati ISTAT (scalfati@istat.it )
Fabiana Bianchi ISTAT (fabianchi@istat.it)

Outline
Evolution path
Strategy Overview
Process model
IT environment
Case studies
Lesson learnt and future developments

About this work
 This work starts as a research work with DIAG (Prof R.
Bruni, University of Rome “Sapienza”):
 generalized algorithms based on natural language
processing (NLP) and machine learning techniques to
solve the automatic detection of enterprise
characteristics in websites
 First work discussed with the Advisory Committee (Nov, 2017):
 Text mining and machine learning techniques for text
classification, with application to the automatic
categorization of websites
 At present: Full web mining strategy that uses an
enhancement of techniques described in previous point, applied
to case studies:
1. ICT usage in Enterprises
2. Statistical Business Register

1 – Web
address
acquisition
URL from administrative sources
URL from thematic directory sites
URL from batch queries on search engines (URL Retrieval
techniques in case of non existing URL)
2 – Enterprise
identification
URL validation, checks URL’s validity (recurring errors and
domain extraction)
Detection of identification variables from the website and
comparison with the same information available in the register
3 – Data
analytics
Web Scraping techniques for web data acquisition
Text Mining techniques for extracting the requested information
Machine Learning techniques for the use of algorithms that
simulate a learning process for the construction of predictive
models
Strategy Overview

Process model: data analytics phase

DWH
NLP: text mining
Web scraping
fitting learning models
performance evaluation
tokenization
lemmatization
Supervised classification
URLs validation
URLs retrieval
POS tagging
summarization Predictions
Business register
Internet data
Operational layer Analysis layer
Text documents
IT environment
Data capturing Data preparation
Machine learning
Data integration
Java – Jsoup - Selenium HQ - Phantom JS
Nltk (Python) – Scikit learn (Python) – Treetagger - R
DBMS noSQL – DBMS Oracle - Filesystem
Technological components:

Case study 1: ICT2017
 “ICT usage in Enterprises” (2017): experimental statistics
to produce estimates on(1):
 web ordering functions (e-commerce component);
 information on job vacancies;
 links to social media (Facebook, Twitter, Instagram etc.).
(1) Work team:
 Istat: G. Barcaroli, G. Bianchi, F. Bianchi, N. Golini, A. Nurra,
P. Righi, S. Salamone, F. Scalfati, D. Summa
 Università di Roma Sapienza: R. Bruni
1) Reference population: 184,000 (≥10 employees)
2) Scraped websites: 85,000
3) ICT survey sample / answers: 32,000 / 19,000
4) Respondents with scraped websites: 11,700
 Dataset (4) as training set and dataset (2) as test

Performance evaluation on web ordering
RF SVM LOGISTIC
Total
perturbation Accuracy % F1 Score % Accuracy % F1 Score % Accuracy % F1 Score %
0% 90,45 73,51 88,81 70,31 87,76 65,86
3% 90,66 73,25 87,94 68,80 87,35 63,90
6% 90,15 70,97 86,63 64,68 87,30 62,37
9% 90,20 70,39 84,63 59,68 86,96 60,46
12% 89,91 68,50 82,93 56,21 87,59 60,32
15% 86,92 57,92 76,79 53,77 86,50 56,44
18% 84,57 48,96 71,86 49,15 85,87 52,94
 performance degrades by increasing the perturbation level
 degradation is not so marked as it could be expected
 up to 12% the classification performance slightly decreased
 best performance had been obtained with RF
The procedure performance has been analyzed introducing
controlled errors in class labels of a corrected training set

Case study 2: Statistical Business Register
 Experimental project in Istat Laboratory for Innovation:
First evaluations on the feasibility of using Big Data in the SBR
production process (2):
 structural characters: anagraphic characteristics and personal
data, business data, dimension, etc.
 qualitative characters: short description of business activity,
categorization of .pdf, categorization by social media account, etc.
(2) TEAMWORK :
 M. AMARONE, D. APRILE, G. BIANCHI, M. CONSALVI, B. GENTILI, F. PANCELLA,
F. SCALFATI, D. SUMMA, C.VIVIANO
 Sample of 100.000 enterprises
 64% URLs from administrative sources
 14% URLs from certified email domain
 5% URLs from web portals
 17% URLs from search engine

SBR: characters extracted from web (1/2)
CHARACTERS Total
URLs
from
admin
source
URLs
from
certified
mails
URLs
from
web
portals
URLs
from
search
engine
Company name 100,000 64 % 14 % 5 % 17 %
Tax code/VAT number 80,440 60 % 9 % 5 % 26 %
Enterprise street address 103,000 70 % 8 % 7 % 15 %
E-mail Address 198,000 60 % 10 % 3 % 27 %
Telephone Number 230,000 63 % 8 % 7 % 22 %
Company Capital 21,580 40 % 4 % 35 % 21 %
Social Media account 131,830 65 % 9 % 6 % 20 %
Business Activity 100,000 64 % 14 % 5 % 17 %
Job application facilities 15,000 52 % 16 % 2 % 30 %
Information scraped from enterprise websites divided by URL source.

SBR: characteristics extracted from web (2/2)
Type of local unit %
Addresses related to
administrative headquarters
70
Addresses related to other
local units
30
Addresses by type of local unit
Type of address %
E-mail address 92
Certified e-mail
address
8
E-mail addresses by type of address
Type of phone
number
%
Phone 58
Mobile phone 12
Fax 30
Phone numbers by type
Enterprises by presence on social media
Online job facilities %
Enterprises WITH online job
application facilities
15
Enterprises WITHOUT online job
application facilities
85

Lessons learnt and future developments
Opportunities:
 Enrichment of the quantity, quality of the information produced
 Timeliness, the web is an independent source of data
 Statistical burden less significant on the respondents
Problems:
 High usage of computing and storage resources
 Difficulty to extract information provided by unusual techniques (for
instance Flash animations) or with anti-scraping mechanisms
 Certify the quality of information and the data reliability
Future developments :
 Face the “representativeness” of the training set (subsample with a
valid URL)
 Move from experimentation to production

Thank you for your attention

Session I - Big Data F. Bianchi, F. Scalfati, Text mining and machine learning techniques for text classification

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Session I - Big Data F. Bianchi, F. Scalfati, Text mining and machine learning techniques for text classification

Ähnlich wie Session I - Big Data F. Bianchi, F. Scalfati, Text mining and machine learning techniques for text classification (20)

Mehr von Istituto nazionale di statistica

Mehr von Istituto nazionale di statistica (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Session I - Big Data F. Bianchi, F. Scalfati, Text mining and machine learning techniques for text classification