Semantic labeling for numerical values is a task of assigning semantic labels to unknown numerical attributes. The semantic labels could be numerical properties in ontologies, instances in knowledge bases, or labeled data that are manually annotated by domain experts. In this paper, we refer to semantic labeling as a retrieval setting where the label of an unknown attribute is assigned by the label of the most relevant attribute in labeled data. One of the greatest challenges is that an unknown attribute rarely has the same set of values with the similar one in the labeled data. To overcome the issue, statistical interpretation of value distribution is taken into account. However, the existing studies assume a specific form of distribution. It is not appropriate in particular to apply open data where there is no knowledge of data in advance. To address these problems, we propose a neural numerical embedding model (EmbNum) to learn useful representation vectors for numerical attributes without prior assumptions on the distribution of data. Then, the "semantic similarities" between the attributes are measured on these representation vectors by the Euclidean distance. Our empirical experiments on City Data and Open Data show that EmbNum significantly outperforms state-of-the-art methods for the task of numerical attribute semantic labeling regarding effectiveness and efficiency.
4. Open Data is a Global Trend
4
Vision [1]: Make data available
• For everyone
• Transparency
• Potential for Innovation
[1] https://opendatabarometer.org/4thedition/report/#open_data_trends
[2] Umbrich et al. Quality assessment & evolution of open data portals, OBD 2015
Issue in Open Data Usability [2]
(82 portals, 160k datasets)
- Missing metadata, context description
- Data no interlink
- Most of data structured and tabular
How can we improve
usability of tabular data?
?
5. Tabular Data Integration
5
well text description [1] poor text description [2]
[1] Dominique et al. Matching HTML Tables to DBpedia, WIMS 2015
[2] Neumaier et al. Multi-level semantic labelling of numerical values. ISWC 2016
- No headers
- No entity label
- Large amount of numerical values
capacity
This paper
7. Semantic Labeling for
Numerical Values
7
Labeled Database
lengthcapacity height Semantic
Label
Numerical
Values
Numerical
Similarity
Searching
Output
Ranking Numerical
Distance
Semantic
Labels
1 0.01 height
2 0.5 length
3 1.2 capacity
4 … …
How to calculate numerical similarity?
similarity(unknown, item)
unknown
Input
All values measure
- numeric
- same meaning
- same scale
8. Numerical Similarity
A, B is two lists of numerical values
Challenges:
• A & B rarely have the same set of values
• A, B size might vary
• No knowledge about data
+ Type: continuous, discrete
+ Distribution: normal, uniform
8
height
Unknown
A
B
Similarity
9. Related works
Use Hypothesis Test with assumption on type,
distribution of data
• Stonebraker et al [1]:
Welch’s t-test (normal distributions)
• SemanticTyper [2], Neumaier et al. [3]:
Kolmogorov-Smirnov test (continuous)
• DSL [4]:
New similarity with a logistic model.
- Kolmogorov-Smirnov test (continuous)
- Mann-Whitney test (continuous)
- Numeric Jaccard (value range)
9
[1] Stonebraker et al., Data curation at scale: The data tamer system. In: CIDR 2013
[2] Ramnandan et al., Assigning semantic labels to data sources. ESWC 2015
[3] Neumaier et al., Multi-level semantic labelling of numerical values, ISWC 2016
[4] Pham et al., Semantic labeling: a domain-independent approach. ISWC 2016
But, no knowledge of
data type or distribution
Learn a similarity metric
with no data assumption
14. Evaluation (1)
Dataset:
1. Standard Data: City Data [1] [2]
300 numerical columns extracted from City class in DBpedia. (Normalized data)
2. Real World Data: Open Data
- 500 numerical numerical extracted from five Open Data portals:
1) Ireland (data.gov.ie), 2) the UK (data.gov.uk), 3) the EU (data.europa.eu), 4) Canada (open.canada.ca), and 5) Australia (data.gov.au)
14
Data
#
Sources
#
Labels
#
Columns
# Row of Each Column
Min Max Median Average
City 10 30 300 4 2,251 113 642.73
Open 10 50 500 4 186,082 467 14,659.63
[1] Ramnandan et al..: Assigning semantic labels to data sources. ESWC 2015
[2] Pham et al..: Semantic labeling: a domain-independent approach. ISWC 2016
Table 1: Description of City Data and Open Data
15. Evaluation (2)
Metrics:
- Effectiveness: MRR score (probability of correctness)
- Efficiency: Run-Time in seconds
Baselines:
1. SemanticTyper [1]:
• Kolmogorov Smirnov test
2. DSL [2]: the metric is a combination of
• Kolmogorov Smirnov test
• Mann-Whitney test
• Numeric Jaccard
15
[1] Ramnandan et al..: Assigning semantic labels to data sources. ESWC 2015
[2] Pham et al..: Semantic labeling: a domain-independent approach. ISWC 2016
16. Evaluation (3)
Experiment Setting:
- Representation Learning:
Training: City Data (50% - 5 sources)
- Semantic Labeling:
• City Data (50% - 5 sources*) and Open Data (100% - 10 sources)
• Query: column of one source
• Database: columns of the other sources.
Simulate data increase overtime to test effectiveness and efficiency
• City Data (5 sources): 75 experiments
• Open Data (10 sources): 5110 experiments
16
[1] Ramnandan et al..: Assigning semantic labels to data sources. ESWC 2015
[2] Pham et al..: Semantic labeling: a domain-independent approach. ISWC 2016
* The data is not the same with the data in Representation Learning
17. Semantic Labeling
17
City Data Open Data
Efficiency:
- More data, more time
- EmbNum perform faster
SemanticTyper (25 times)
and DSL (92 times)
Effectiveness:
- More data, more accurate
- EmbNum significant
outperform (Paired T-Test)
SemanticTyper and DSL
19. Summary
• EmbNum:
o Learn representation for numerical values
with deep metric learning
o Calculate similarity on these representations
• Advantages:
o Accurate, no need to make assumption
about data type, distribution
o Fast
o Domain independence
• Future works:
o Extend the similarity metric
- Multiple-scale (meter, centimeter, feet, inch)
- Hierarchical context:
height of human who live in Tokyo and was born
in Yokohama
o Recognize new semantic types
19
Unknown
Embedding Space
Ranked
Results:
1. height
2. height
3. length
4. capacity
5. …
20. EmbNum:
Semantic Labeling for Numerical Values
with Deep Metric Learning
Phuc NGUYEN*, Khai NGUYEN, Ryutaro ICHISE, Hideaki TAKEDA
JIST 2018: The 8th Join International Semantic Technology Conference
November 27, 2018 20
Question & Comments
* phucnt@nii.ac.jp
23. CDF or PDF
23
CDF and PDF represent probability density [1]
PDF: represent probability with areas
CDF: represent vertical distance
In terms of graphical intuition, CDF is more
better than PDF since the calculation of vertical
distance is faster and more accurate than area.
[1] W.S. Cleveland, The Elements of Graphing Data. NJ, USA: Hobart Press 1994
26. Open Data (Describe)
point y x y bng n bng e grade eastitm quantity long clients
count 16787 542158 542158 37814 37814 817 189 251260 152366 120
mean 6.90E+06 149.14 -35.29 293708.8 421808.2 6.16 530008.61 515104.01 107.1 384.19
std 11045.1 0.15 0.24 156839.67 113008.45 3.5 2430.79 7.52E+06 63.83 619.37
min 6.88E+06 145.77 -37.55 7962 10133 0 524219.87 0 -10.86 1
25% 6.89E+06 149.08 -35.37 177609.75 349393 4 528451.41 106 1.26 34.25
50% 6.90E+06 149.15 -35.32 261787 430200 7 529876.17 6310 143.87 156.5
75% 6.91E+06 149.17 -35.25 387048 512040 8 531961.53 107000 144.79 384.5
max 6.94E+06 151.24 -32.82 1.19E+06 654853 17 535458.49 6.68E+08 145.28 2908
point x retail trade trade debtors lon eastig gross interest cost of sales cost centre gross rent lat
count 16787 9996 889 4531 189 15219 1214 7025 15219 152979
mean 539426.47 9.75 1.30E+09 142.01 130042.73 8.74E+06 1.26E+10 1508.81 2.61E+07 -16.69
std 6095.78 147.17 7.15E+09 0.99 2431.31 1.31E+08 4.68E+10 395.81 3.47E+08 38.12
min 517645.59 0 0 141.55 124252.75 0 -581581 990 0 -38.37
25% 535993.55 0 5.77E+06 141.6 128485.2 162419.5 4.44E+07 1103 465264 -37.73
50% 540100.63 0 4.37E+07 141.61 129910.27 781001 4.11E+08 1833 2.48E+06 -37.58
75% 542517.03 2 3.56E+08 141.62 131996.07 3.85E+06 3.64E+09 1836 1.29E+07 -37.52
max 554135.38 11325 1.14E+11 144.99 135493.8 1.13E+10 4.44E+11 13304 3.58E+10 60.61
trade
creditors
rent
expenses
gifts or
donations
quantity
quantit
temperature
deg c station altitude
depreciation
expenses
superannuation
expenses
pm10 teom ug
m3
help assessment
debt
count 889 925 33878 1560203 22176 1599 925 925 22176 15016
mean 1.07E+09 2.61E+08 1.19E+06 262000.22 12.17 154.35 4.30E+08 1.75E+08 11.5 1.14E+06
std 5.61E+09 1.12E+09 1.71E+07 3.31E+06 4.09 260.79 2.58E+09 7.44E+08 7.3 1.71E+07
min 0 0 0 0 0.1 0 0 0 -4.9 0
25% 5.01E+06 2.36E+06 10450.25 73 9.4 20 2.74E+06 1.28E+06 6.8 0
50% 3.91E+07 1.62E+07 70994 1415 11.8 59 2.11E+07 1.02E+07 10.1 17093
75% 3.26E+08 1.21E+08 416793.75 21760 14.6 160 1.23E+08 6.95E+07 14.4 305799
max 8.51E+10 1.57E+10 2.81E+09 4.56E+08 35.7 1950 3.85E+10 9.94E+09 117.1 1.59E+09
time from
payscale
minimum
repairs and
maintenance year inc time period time to statistics year
salary cost of
reports year ann e
27. Tabular Data
Integration with
Knowledge Base
Matching
Matching tabular data to
Knowledge Bases
- Ontology
- Knowledge graph
(DBpedia, Wikidata)
- Label data