EmbNum: Semantic Labeling for Numerical Values with Deep Metric Learning

EmbNum:
Semantic Labeling for Numerical
Values with Deep Metric Learning
Phuc NGUYEN*, Khai NGUYEN, Ryutaro ICHISE, Hideaki TAKEDA
JIST 2018: The 8th Join International Semantic Technology Conference
November 27, 2018 1* phucnt@nii.ac.jp

Agenda
1. Background
2. Problem & Challenges
3. Approach
4. Evaluation
5. Summary
6. Backup

Open Data is a Global Trend
4
Vision [1]: Make data available
• For everyone
• Transparency
• Potential for Innovation
[1] https://opendatabarometer.org/4thedition/report/#open_data_trends
[2] Umbrich et al. Quality assessment & evolution of open data portals, OBD 2015
Issue in Open Data Usability [2]
(82 portals, 160k datasets)
- Missing metadata, context description
- Data no interlink
- Most of data structured and tabular
How can we improve
usability of tabular data?
?

Tabular Data Integration
5
well text description [1] poor text description [2]
[1] Dominique et al. Matching HTML Tables to DBpedia, WIMS 2015
[2] Neumaier et al. Multi-level semantic labelling of numerical values. ISWC 2016
- No headers
- No entity label
- Large amount of numerical values
capacity
This paper

Semantic Labeling for
Numerical Values
7
Labeled Database
lengthcapacity height Semantic
Label
Numerical
Values
Numerical
Similarity
Searching
Output
Ranking Numerical
Distance
Semantic
Labels
1 0.01 height
2 0.5 length
3 1.2 capacity
4 … …
How to calculate numerical similarity?
similarity(unknown, item)
unknown
Input
All values measure
- numeric
- same meaning
- same scale

Numerical Similarity
A, B is two lists of numerical values
Challenges:
• A & B rarely have the same set of values
• A, B size might vary
• No knowledge about data
+ Type: continuous, discrete
+ Distribution: normal, uniform
8
height
Unknown
A
B
Similarity

Related works
Use Hypothesis Test with assumption on type,
distribution of data
• Stonebraker et al [1]:
Welch’s t-test (normal distributions)
• SemanticTyper [2], Neumaier et al. [3]:
Kolmogorov-Smirnov test (continuous)
• DSL [4]:
New similarity with a logistic model.
- Kolmogorov-Smirnov test (continuous)
- Mann-Whitney test (continuous)
- Numeric Jaccard (value range)
9
[1] Stonebraker et al., Data curation at scale: The data tamer system. In: CIDR 2013
[2] Ramnandan et al., Assigning semantic labels to data sources. ESWC 2015
[3] Neumaier et al., Multi-level semantic labelling of numerical values, ISWC 2016
[4] Pham et al., Semantic labeling: a domain-independent approach. ISWC 2016
But, no knowledge of
data type or distribution
Learn a similarity metric
with no data assumption

Framework
11
Representation Learning Semantic Labeling
Offline Online
Unknown
Similarity
Search
1. height
2. height
3. length
4. capacity
5. …
Ranking List
height height capacity
Preprocessing
Embedding
Model
Triplet loss
CDF
Vector
Triplets
Triplets
Preprocessing
CDF
Database
Learned
Embedding
Model

Preprocessing with inverse
transform sampling
12
!
Why?
- Standardize the input size
- Reduce the computational time
- Retain the cumulative distribution of original data
- Put data in order to leverage the capability of CNN
How?
decRainDays
- Input: List of numerical values ! = {$%, $', … , $) }
- CDF $ ∈ !:
,- $ = . ! ≤ $ , $ ∈ !, ,-: ℝ → 0,1
- Inverse CDF:
,-
5%
(7) = min $: ,- $ ≥ 7 , 7 ∈ [0,1]
Inverse transform sampling:
Select h numerical values from ! with probability
7 ∈ ? =
@
A
B ∈ {1,2, … , ℎ}
Example:
ℎ = 100, ? = {0.01, 0.02, … , 1}
FGH7IBJK ! = {,-
5%
0.01 , ,-
5%
0.02 , … , ,-
5%
(1)}

Evaluation (1)
Dataset:
1. Standard Data: City Data [1] [2]
300 numerical columns extracted from City class in DBpedia. (Normalized data)
2. Real World Data: Open Data
- 500 numerical numerical extracted from five Open Data portals:
1) Ireland (data.gov.ie), 2) the UK (data.gov.uk), 3) the EU (data.europa.eu), 4) Canada (open.canada.ca), and 5) Australia (data.gov.au)
14
Data
#
Sources
#
Labels
#
Columns
# Row of Each Column
Min Max Median Average
City 10 30 300 4 2,251 113 642.73
Open 10 50 500 4 186,082 467 14,659.63
[1] Ramnandan et al..: Assigning semantic labels to data sources. ESWC 2015
[2] Pham et al..: Semantic labeling: a domain-independent approach. ISWC 2016
Table 1: Description of City Data and Open Data

Evaluation (2)
Metrics:
- Effectiveness: MRR score (probability of correctness)
- Efficiency: Run-Time in seconds
Baselines:
1. SemanticTyper [1]:
• Kolmogorov Smirnov test
2. DSL [2]: the metric is a combination of
• Kolmogorov Smirnov test
• Mann-Whitney test
• Numeric Jaccard
15

Evaluation (3)
Experiment Setting:
- Representation Learning:
Training: City Data (50% - 5 sources)
- Semantic Labeling:
• City Data (50% - 5 sources*) and Open Data (100% - 10 sources)
• Query: column of one source
• Database: columns of the other sources.
Simulate data increase overtime to test effectiveness and efficiency
• City Data (5 sources): 75 experiments
• Open Data (10 sources): 5110 experiments
16
* The data is not the same with the data in Representation Learning

Semantic Labeling
17
City Data Open Data
Efficiency:
- More data, more time
- EmbNum perform faster
SemanticTyper (25 times)
and DSL (92 times)
Effectiveness:
- More data, more accurate
- EmbNum significant
outperform (Paired T-Test)
SemanticTyper and DSL

Summary
• EmbNum:
o Learn representation for numerical values
with deep metric learning
o Calculate similarity on these representations
• Advantages:
o Accurate, no need to make assumption
about data type, distribution
o Fast
o Domain independence
• Future works:
o Extend the similarity metric
- Multiple-scale (meter, centimeter, feet, inch)
- Hierarchical context:
height of human who live in Tokyo and was born
in Yokohama
o Recognize new semantic types
19
Unknown
Embedding Space
Ranked
Results:
1. height
2. height
3. length
4. capacity
5. …

EmbNum:
Semantic Labeling for Numerical Values
with Deep Metric Learning
Phuc NGUYEN*, Khai NGUYEN, Ryutaro ICHISE, Hideaki TAKEDA
JIST 2018: The 8th Join International Semantic Technology Conference
November 27, 2018 20
Question & Comments
* phucnt@nii.ac.jp

Representation Learning
22
tSNE visualization of embedding features of City Data
After learning the similar patterns close together
—> Effective of learning method

CDF or PDF
23
CDF and PDF represent probability density [1]
PDF: represent probability with areas
CDF: represent vertical distance
In terms of graphical intuition, CDF is more
better than PDF since the calculation of vertical
distance is faster and more accurate than area.
[1] W.S. Cleveland, The Elements of Graphing Data. NJ, USA: Hobart Press 1994

EmbNum: Learning and
Labeling
Labeled
Attributes
Database
Unknown
Attribute
Embedding
Model
Feature Vectors
Labeled
Attributes
Preprocessing
Feature
Extracting
Semantic Labeling
Representation Learning
Feature Vector
Similarity
Searching
Preprocessing
Representation
Learning
Ranked
Results
Input Output
Input
Output
Embedding model
Embedding model
Embedding model
Shared Weights
TripletLoss
!(#$)
!(#$
&
)
!(#$
'
)
...
Triplet Sampling
Feature
Vectors
anchor #$
posiEve #$
&
negaEve #$
'
#$
&
Shared Weights
#$
'
#$
Output
23

City Data (Describe)
areaTotal populationDensity areaLand areaWater aprRainInch augRainDays populationTotal aprHighF aprLowF aprMeanF
count 21690 18130 19250 13700 40 320 17830 1130 1130 160
mean 4.09E+08 644.75 3.70E+07 2.64E+09 2.94 9.79 557595.59 66.98 43.59 49.24
std 6.44E+09 3343.07 4.79E+08 2.75E+11 2.8 6.21 5.63E+07 11.13 11.46 9.63
min 0 0 0 0 0.22 0 0 8.5 -4.9 9.7
25% 2.56E+06 189.3 2.18E+06 0 0.99 4.8 1459.25 59 35.92 43.38
50% 8.00E+06 365.8 6.09E+06 50000 2.68 10.6 8407.5 66.55 42.2 47.75
75% 3.02E+07 618.61 1.89E+07 490000 3.52 13.83 30575.25 74.4 49.88 54.3
max 4.63E+11 309709 3.94E+10 3.20E+13 15.42 26.1 7.51E+09 104 87 80
aprRainDays percentageOfAreaWater areaLandKm areaLandSqMi areaTotalKm areaTotalSqMi areaWaterKm decMeanF augHighF augLowF
count 320 860 9840 9400 12160 9470 9560 160 1130 1130
mean 7.94 5.98 52.07 8.94 708.17 11.79 15.66 3.04E+01 85.34 61.4
std 4.72 13.19 678.09 51.29 8594.54 93.82 857.33 1.34E+01 8.28 8.85
min 0 0 0 0 0 0 0 -1.7 41 32.9
25% 4.3 0.1 2.25 0.81 2.8 0.83 0 21.23 81 54.92
50% 8 0.98 6.35 2.24 10.1 2.34 0 29 86.4 61.55
75% 11.22 4.57 20.05 6.9 45.97 7.4 0.2 3.75E+01 91 68
max 25 84.58 39400 2874 462705 4811.5 80000 7.80E+01 110 86
augMeanF populationMetro augRainInch decHighF decLowF areaWaterSqMi decRainDays decRainInch elevationFt elevationM
count 160 970 40 1130 1130 9390 320 40 9720 22510
mean 70.49 2.53E+07 2.49 48.87 29.49 1318.01 6.48 3.73 1113.48 365.82
std 7.22 7.37E+08 3.06 15.75 15.45 127541.37 5.02 4.15 2092.3 3008.53
min 47 2 0.02 -3 -23 0 0 0.28 -177 -314
25% 65.77 172641 0.11 37 20.5 0 2.18 1.83 440 128.02
50% 69.45 608235.5 2.15 47 28.2 0 5.8 2.58 863 261
75% 76.25 2.05E+06 3.5 58.75 37.4 0.09 9.83 3.53 1250.75 391
max 90 2.30E+10 14.83 95.7 81 1.24E+07 27 24.15 171549 445001

Open Data (Describe)
point y x y bng n bng e grade eastitm quantity long clients
count 16787 542158 542158 37814 37814 817 189 251260 152366 120
mean 6.90E+06 149.14 -35.29 293708.8 421808.2 6.16 530008.61 515104.01 107.1 384.19
std 11045.1 0.15 0.24 156839.67 113008.45 3.5 2430.79 7.52E+06 63.83 619.37
min 6.88E+06 145.77 -37.55 7962 10133 0 524219.87 0 -10.86 1
25% 6.89E+06 149.08 -35.37 177609.75 349393 4 528451.41 106 1.26 34.25
50% 6.90E+06 149.15 -35.32 261787 430200 7 529876.17 6310 143.87 156.5
75% 6.91E+06 149.17 -35.25 387048 512040 8 531961.53 107000 144.79 384.5
max 6.94E+06 151.24 -32.82 1.19E+06 654853 17 535458.49 6.68E+08 145.28 2908
point x retail trade trade debtors lon eastig gross interest cost of sales cost centre gross rent lat
count 16787 9996 889 4531 189 15219 1214 7025 15219 152979
mean 539426.47 9.75 1.30E+09 142.01 130042.73 8.74E+06 1.26E+10 1508.81 2.61E+07 -16.69
std 6095.78 147.17 7.15E+09 0.99 2431.31 1.31E+08 4.68E+10 395.81 3.47E+08 38.12
min 517645.59 0 0 141.55 124252.75 0 -581581 990 0 -38.37
25% 535993.55 0 5.77E+06 141.6 128485.2 162419.5 4.44E+07 1103 465264 -37.73
50% 540100.63 0 4.37E+07 141.61 129910.27 781001 4.11E+08 1833 2.48E+06 -37.58
75% 542517.03 2 3.56E+08 141.62 131996.07 3.85E+06 3.64E+09 1836 1.29E+07 -37.52
max 554135.38 11325 1.14E+11 144.99 135493.8 1.13E+10 4.44E+11 13304 3.58E+10 60.61
trade
creditors
rent
expenses
gifts or
donations
quantity
quantit
temperature
deg c station altitude
depreciation
expenses
superannuation
expenses
pm10 teom ug
m3
help assessment
debt
count 889 925 33878 1560203 22176 1599 925 925 22176 15016
mean 1.07E+09 2.61E+08 1.19E+06 262000.22 12.17 154.35 4.30E+08 1.75E+08 11.5 1.14E+06
std 5.61E+09 1.12E+09 1.71E+07 3.31E+06 4.09 260.79 2.58E+09 7.44E+08 7.3 1.71E+07
min 0 0 0 0 0.1 0 0 0 -4.9 0
25% 5.01E+06 2.36E+06 10450.25 73 9.4 20 2.74E+06 1.28E+06 6.8 0
50% 3.91E+07 1.62E+07 70994 1415 11.8 59 2.11E+07 1.02E+07 10.1 17093
75% 3.26E+08 1.21E+08 416793.75 21760 14.6 160 1.23E+08 6.95E+07 14.4 305799
max 8.51E+10 1.57E+10 2.81E+09 4.56E+08 35.7 1950 3.85E+10 9.94E+09 117.1 1.59E+09
time from
payscale
minimum
repairs and
maintenance year inc time period time to statistics year
salary cost of
reports year ann e

Tabular Data
Integration with
Knowledge Base
Matching
Matching tabular data to
Knowledge Bases
- Ontology
- Knowledge graph
(DBpedia, Wikidata)
- Label data

EmbNum: Semantic Labeling for Numerical Values with Deep Metric Learning

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie EmbNum: Semantic Labeling for Numerical Values with Deep Metric Learning

Ähnlich wie EmbNum: Semantic Labeling for Numerical Values with Deep Metric Learning (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

EmbNum: Semantic Labeling for Numerical Values with Deep Metric Learning