SlideShare ist ein Scribd-Unternehmen logo
1 von 27
Downloaden Sie, um offline zu lesen
EmbNum:
Semantic Labeling for Numerical
Values with Deep Metric Learning
Phuc NGUYEN*, Khai NGUYEN, Ryutaro ICHISE, Hideaki TAKEDA
JIST 2018: The 8th Join International Semantic Technology Conference
November 27, 2018 1* phucnt@nii.ac.jp
Agenda
1. Background
2. Problem & Challenges
3. Approach
4. Evaluation
5. Summary
6. Backup
1. Backgourd
7
Open Data is a Global Trend
4
Vision [1]: Make data available
• For everyone
• Transparency
• Potential for Innovation
[1] https://opendatabarometer.org/4thedition/report/#open_data_trends
[2] Umbrich et al. Quality assessment & evolution of open data portals, OBD 2015
Issue in Open Data Usability [2]
(82 portals, 160k datasets)
- Missing metadata, context description
- Data no interlink
- Most of data structured and tabular
How can we improve
usability of tabular data?
?
Tabular Data Integration
5
well text description [1] poor text description [2]
[1] Dominique et al. Matching HTML Tables to DBpedia, WIMS 2015
[2] Neumaier et al. Multi-level semantic labelling of numerical values. ISWC 2016
- No headers
- No entity label
- Large amount of numerical values
capacity
This paper
2. Problem &
Challenges
6
Semantic Labeling for
Numerical Values
7
Labeled Database
lengthcapacity height Semantic
Label
Numerical
Values
Numerical
Similarity
Searching
Output
Ranking Numerical
Distance
Semantic
Labels
1 0.01 height
2 0.5 length
3 1.2 capacity
4 … …
How to calculate numerical similarity?
similarity(unknown, item)
unknown
Input
All values measure
- numeric
- same meaning
- same scale
Numerical Similarity
A, B is two lists of numerical values
Challenges:
• A & B rarely have the same set of values
• A, B size might vary
• No knowledge about data
+ Type: continuous, discrete
+ Distribution: normal, uniform
8
height
Unknown
A
B
Similarity
Related works
Use Hypothesis Test with assumption on type,
distribution of data
• Stonebraker et al [1]:
Welch’s t-test (normal distributions)
• SemanticTyper [2], Neumaier et al. [3]:
Kolmogorov-Smirnov test (continuous)
• DSL [4]:
New similarity with a logistic model.
- Kolmogorov-Smirnov test (continuous)
- Mann-Whitney test (continuous)
- Numeric Jaccard (value range)
9
[1] Stonebraker et al., Data curation at scale: The data tamer system. In: CIDR 2013
[2] Ramnandan et al., Assigning semantic labels to data sources. ESWC 2015
[3] Neumaier et al., Multi-level semantic labelling of numerical values, ISWC 2016
[4] Pham et al., Semantic labeling: a domain-independent approach. ISWC 2016
But, no knowledge of
data type or distribution
Learn a similarity metric
with no data assumption
3. Approach
10
Framework
11
Representation Learning Semantic Labeling
Offline Online
Unknown
Similarity
Search
1. height
2. height
3. length
4. capacity
5. …
Ranking List
height height capacity
Preprocessing
Embedding
Model
Triplet loss
CDF
Vector
Triplets
height height capacity
Triplets
Preprocessing
CDF
Database
height height capacity
Learned
Embedding
Model
Preprocessing with inverse
transform sampling
12
!
Why?
- Standardize the input size
- Reduce the computational time
- Retain the cumulative distribution of original data
- Put data in order to leverage the capability of CNN
How?
decRainDays
- Input: List of numerical values ! = {$%, $', … , $) }
- CDF $ ∈ !:
,- $ = . ! ≤ $ , $ ∈ !, ,-: ℝ → 0,1
- Inverse CDF:
,-
5%
(7) = min $: ,- $ ≥ 7 , 7 ∈ [0,1]
Inverse transform sampling:
Select h numerical values from ! with probability
7 ∈ ? =
@
A
B ∈ {1,2, … , ℎ}
Example:
ℎ = 100, ? = {0.01, 0.02, … , 1}
FGH7IBJK ! = {,-
5%
0.01 , ,-
5%
0.02 , … , ,-
5%
(1)}
4. Evaluation
13
Evaluation (1)
Dataset:
1. Standard Data: City Data [1] [2]
300 numerical columns extracted from City class in DBpedia. (Normalized data)
2. Real World Data: Open Data
- 500 numerical numerical extracted from five Open Data portals:
1) Ireland (data.gov.ie), 2) the UK (data.gov.uk), 3) the EU (data.europa.eu), 4) Canada (open.canada.ca), and 5) Australia (data.gov.au)
14
Data
#
Sources
#
Labels
#
Columns
# Row of Each Column
Min Max Median Average
City 10 30 300 4 2,251 113 642.73
Open 10 50 500 4 186,082 467 14,659.63
[1] Ramnandan et al..: Assigning semantic labels to data sources. ESWC 2015
[2] Pham et al..: Semantic labeling: a domain-independent approach. ISWC 2016
Table 1: Description of City Data and Open Data
Evaluation (2)
Metrics:
- Effectiveness: MRR score (probability of correctness)
- Efficiency: Run-Time in seconds
Baselines:
1. SemanticTyper [1]:
• Kolmogorov Smirnov test
2. DSL [2]: the metric is a combination of
• Kolmogorov Smirnov test
• Mann-Whitney test
• Numeric Jaccard
15
[1] Ramnandan et al..: Assigning semantic labels to data sources. ESWC 2015
[2] Pham et al..: Semantic labeling: a domain-independent approach. ISWC 2016
Evaluation (3)
Experiment Setting:
- Representation Learning:
Training: City Data (50% - 5 sources)
- Semantic Labeling:
• City Data (50% - 5 sources*) and Open Data (100% - 10 sources)
• Query: column of one source
• Database: columns of the other sources.
Simulate data increase overtime to test effectiveness and efficiency
• City Data (5 sources): 75 experiments
• Open Data (10 sources): 5110 experiments
16
[1] Ramnandan et al..: Assigning semantic labels to data sources. ESWC 2015
[2] Pham et al..: Semantic labeling: a domain-independent approach. ISWC 2016
* The data is not the same with the data in Representation Learning
Semantic Labeling
17
City Data Open Data
Efficiency:
- More data, more time
- EmbNum perform faster
SemanticTyper (25 times)
and DSL (92 times)
Effectiveness:
- More data, more accurate
- EmbNum significant
outperform (Paired T-Test)
SemanticTyper and DSL
5. Summary
18
Summary
• EmbNum:
o Learn representation for numerical values
with deep metric learning
o Calculate similarity on these representations
• Advantages:
o Accurate, no need to make assumption
about data type, distribution
o Fast
o Domain independence
• Future works:
o Extend the similarity metric
- Multiple-scale (meter, centimeter, feet, inch)
- Hierarchical context:
height of human who live in Tokyo and was born
in Yokohama
o Recognize new semantic types
19
Unknown
Embedding Space
Ranked
Results:
1. height
2. height
3. length
4. capacity
5. …
EmbNum:
Semantic Labeling for Numerical Values
with Deep Metric Learning
Phuc NGUYEN*, Khai NGUYEN, Ryutaro ICHISE, Hideaki TAKEDA
JIST 2018: The 8th Join International Semantic Technology Conference
November 27, 2018 20
Question & Comments
* phucnt@nii.ac.jp
6. Backup
21
Representation Learning
22
tSNE visualization of embedding features of City Data
After learning the similar patterns close together
—> Effective of learning method
CDF or PDF
23
CDF and PDF represent probability density [1]
PDF: represent probability with areas
CDF: represent vertical distance
In terms of graphical intuition, CDF is more
better than PDF since the calculation of vertical
distance is faster and more accurate than area.
[1] W.S. Cleveland, The Elements of Graphing Data. NJ, USA: Hobart Press 1994
EmbNum: Learning and
Labeling
Labeled
Attributes
Database
Unknown
Attribute
Embedding
Model
Feature Vectors
Labeled
Attributes
Preprocessing
Feature
Extracting
Semantic Labeling
Representation Learning
Feature Vector
Similarity
Searching
Preprocessing
Representation
Learning
Ranked
Results
Input Output
Input
Output
Embedding model
Embedding model
Embedding model
Shared Weights
TripletLoss
!(#$)
!(#$
&
)
!(#$
'
)
...
Triplet Sampling
Feature
Vectors
anchor	#$
posiEve	#$
&
negaEve	#$
'
#$
&
Shared Weights
#$
'
#$
Output
23
City Data (Describe)
areaTotal populationDensity areaLand areaWater aprRainInch augRainDays populationTotal aprHighF aprLowF aprMeanF
count 21690 18130 19250 13700 40 320 17830 1130 1130 160
mean 4.09E+08 644.75 3.70E+07 2.64E+09 2.94 9.79 557595.59 66.98 43.59 49.24
std 6.44E+09 3343.07 4.79E+08 2.75E+11 2.8 6.21 5.63E+07 11.13 11.46 9.63
min 0 0 0 0 0.22 0 0 8.5 -4.9 9.7
25% 2.56E+06 189.3 2.18E+06 0 0.99 4.8 1459.25 59 35.92 43.38
50% 8.00E+06 365.8 6.09E+06 50000 2.68 10.6 8407.5 66.55 42.2 47.75
75% 3.02E+07 618.61 1.89E+07 490000 3.52 13.83 30575.25 74.4 49.88 54.3
max 4.63E+11 309709 3.94E+10 3.20E+13 15.42 26.1 7.51E+09 104 87 80
aprRainDays percentageOfAreaWater areaLandKm areaLandSqMi areaTotalKm areaTotalSqMi areaWaterKm decMeanF augHighF augLowF
count 320 860 9840 9400 12160 9470 9560 160 1130 1130
mean 7.94 5.98 52.07 8.94 708.17 11.79 15.66 3.04E+01 85.34 61.4
std 4.72 13.19 678.09 51.29 8594.54 93.82 857.33 1.34E+01 8.28 8.85
min 0 0 0 0 0 0 0 -1.7 41 32.9
25% 4.3 0.1 2.25 0.81 2.8 0.83 0 21.23 81 54.92
50% 8 0.98 6.35 2.24 10.1 2.34 0 29 86.4 61.55
75% 11.22 4.57 20.05 6.9 45.97 7.4 0.2 3.75E+01 91 68
max 25 84.58 39400 2874 462705 4811.5 80000 7.80E+01 110 86
augMeanF populationMetro augRainInch decHighF decLowF areaWaterSqMi decRainDays decRainInch elevationFt elevationM
count 160 970 40 1130 1130 9390 320 40 9720 22510
mean 70.49 2.53E+07 2.49 48.87 29.49 1318.01 6.48 3.73 1113.48 365.82
std 7.22 7.37E+08 3.06 15.75 15.45 127541.37 5.02 4.15 2092.3 3008.53
min 47 2 0.02 -3 -23 0 0 0.28 -177 -314
25% 65.77 172641 0.11 37 20.5 0 2.18 1.83 440 128.02
50% 69.45 608235.5 2.15 47 28.2 0 5.8 2.58 863 261
75% 76.25 2.05E+06 3.5 58.75 37.4 0.09 9.83 3.53 1250.75 391
max 90 2.30E+10 14.83 95.7 81 1.24E+07 27 24.15 171549 445001
Open Data (Describe)
point y x y bng n bng e grade eastitm quantity long clients
count 16787 542158 542158 37814 37814 817 189 251260 152366 120
mean 6.90E+06 149.14 -35.29 293708.8 421808.2 6.16 530008.61 515104.01 107.1 384.19
std 11045.1 0.15 0.24 156839.67 113008.45 3.5 2430.79 7.52E+06 63.83 619.37
min 6.88E+06 145.77 -37.55 7962 10133 0 524219.87 0 -10.86 1
25% 6.89E+06 149.08 -35.37 177609.75 349393 4 528451.41 106 1.26 34.25
50% 6.90E+06 149.15 -35.32 261787 430200 7 529876.17 6310 143.87 156.5
75% 6.91E+06 149.17 -35.25 387048 512040 8 531961.53 107000 144.79 384.5
max 6.94E+06 151.24 -32.82 1.19E+06 654853 17 535458.49 6.68E+08 145.28 2908
point x retail trade trade debtors lon eastig gross interest cost of sales cost centre gross rent lat
count 16787 9996 889 4531 189 15219 1214 7025 15219 152979
mean 539426.47 9.75 1.30E+09 142.01 130042.73 8.74E+06 1.26E+10 1508.81 2.61E+07 -16.69
std 6095.78 147.17 7.15E+09 0.99 2431.31 1.31E+08 4.68E+10 395.81 3.47E+08 38.12
min 517645.59 0 0 141.55 124252.75 0 -581581 990 0 -38.37
25% 535993.55 0 5.77E+06 141.6 128485.2 162419.5 4.44E+07 1103 465264 -37.73
50% 540100.63 0 4.37E+07 141.61 129910.27 781001 4.11E+08 1833 2.48E+06 -37.58
75% 542517.03 2 3.56E+08 141.62 131996.07 3.85E+06 3.64E+09 1836 1.29E+07 -37.52
max 554135.38 11325 1.14E+11 144.99 135493.8 1.13E+10 4.44E+11 13304 3.58E+10 60.61
trade
creditors
rent
expenses
gifts or
donations
quantity
quantit
temperature
deg c station altitude
depreciation
expenses
superannuation
expenses
pm10 teom ug
m3
help assessment
debt
count 889 925 33878 1560203 22176 1599 925 925 22176 15016
mean 1.07E+09 2.61E+08 1.19E+06 262000.22 12.17 154.35 4.30E+08 1.75E+08 11.5 1.14E+06
std 5.61E+09 1.12E+09 1.71E+07 3.31E+06 4.09 260.79 2.58E+09 7.44E+08 7.3 1.71E+07
min 0 0 0 0 0.1 0 0 0 -4.9 0
25% 5.01E+06 2.36E+06 10450.25 73 9.4 20 2.74E+06 1.28E+06 6.8 0
50% 3.91E+07 1.62E+07 70994 1415 11.8 59 2.11E+07 1.02E+07 10.1 17093
75% 3.26E+08 1.21E+08 416793.75 21760 14.6 160 1.23E+08 6.95E+07 14.4 305799
max 8.51E+10 1.57E+10 2.81E+09 4.56E+08 35.7 1950 3.85E+10 9.94E+09 117.1 1.59E+09
time from
payscale
minimum
repairs and
maintenance year inc time period time to statistics year
salary cost of
reports year ann e
Tabular Data
Integration with
Knowledge Base
Matching
Matching tabular data to
Knowledge Bases
- Ontology
- Knowledge graph
(DBpedia, Wikidata)
- Label data

Weitere ähnliche Inhalte

Ähnlich wie EmbNum: Semantic Labeling for Numerical Values with Deep Metric Learning

313 IDS _Course_Introduction_PPT.pptx
313 IDS _Course_Introduction_PPT.pptx313 IDS _Course_Introduction_PPT.pptx
313 IDS _Course_Introduction_PPT.pptxsameernsn1
 
Introduction Machine Learning Syllabus
Introduction Machine Learning SyllabusIntroduction Machine Learning Syllabus
Introduction Machine Learning SyllabusAndres Mendez-Vazquez
 
Comparative Analysis of RMSE and MAP Metrices for Evaluating CNN and LSTM Mod...
Comparative Analysis of RMSE and MAP Metrices for Evaluating CNN and LSTM Mod...Comparative Analysis of RMSE and MAP Metrices for Evaluating CNN and LSTM Mod...
Comparative Analysis of RMSE and MAP Metrices for Evaluating CNN and LSTM Mod...GagandeepKaur872517
 
B.Tech Scheme and Syllabus 2019-2020 onwards.docx
B.Tech Scheme and Syllabus 2019-2020 onwards.docxB.Tech Scheme and Syllabus 2019-2020 onwards.docx
B.Tech Scheme and Syllabus 2019-2020 onwards.docxRamanPandey31
 
SYNOPSIS on Parse representation and Linear SVM.
SYNOPSIS on Parse representation and Linear SVM.SYNOPSIS on Parse representation and Linear SVM.
SYNOPSIS on Parse representation and Linear SVM.bhavinecindus
 
Data Science & Big Data - Theory.pdf
Data Science & Big Data - Theory.pdfData Science & Big Data - Theory.pdf
Data Science & Big Data - Theory.pdfRAKESHG79
 
Workshop nwav 47 - LVS - Tool for Quantitative Data Analysis
Workshop nwav 47 - LVS - Tool for Quantitative Data AnalysisWorkshop nwav 47 - LVS - Tool for Quantitative Data Analysis
Workshop nwav 47 - LVS - Tool for Quantitative Data AnalysisOlga Scrivner
 
Asp.net Lab manual
Asp.net Lab manualAsp.net Lab manual
Asp.net Lab manualTamil Dhasan
 
how to build a Length of Stay model for a ProofOfConcept project
how to build a Length of Stay model for a ProofOfConcept projecthow to build a Length of Stay model for a ProofOfConcept project
how to build a Length of Stay model for a ProofOfConcept projectZenodia Charpy
 
Current clustering techniques
Current clustering techniquesCurrent clustering techniques
Current clustering techniquesPoonam Kshirsagar
 
Ug 3 1 r19 cse syllabus
Ug 3 1 r19 cse syllabusUg 3 1 r19 cse syllabus
Ug 3 1 r19 cse syllabusSubbuBuddu
 
Benchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detectionBenchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detectionSotiris Beis
 
Asymmetric Tri-training for Unsupervised Domain Adaptation
Asymmetric Tri-training for Unsupervised Domain AdaptationAsymmetric Tri-training for Unsupervised Domain Adaptation
Asymmetric Tri-training for Unsupervised Domain AdaptationYoshitaka Ushiku
 
Benchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detectionBenchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detectionSymeon Papadopoulos
 
Survey on MapReduce in Big Data Clustering using Machine Learning Algorithms
Survey on MapReduce in Big Data Clustering using Machine Learning AlgorithmsSurvey on MapReduce in Big Data Clustering using Machine Learning Algorithms
Survey on MapReduce in Big Data Clustering using Machine Learning AlgorithmsIRJET Journal
 
Lecture 1 Pandas Basics.pptx machine learning
Lecture 1 Pandas Basics.pptx machine learningLecture 1 Pandas Basics.pptx machine learning
Lecture 1 Pandas Basics.pptx machine learningmy6305874
 
Advances in Exploratory Data Analysis, Visualisation and Quality for Data Cen...
Advances in Exploratory Data Analysis, Visualisation and Quality for Data Cen...Advances in Exploratory Data Analysis, Visualisation and Quality for Data Cen...
Advances in Exploratory Data Analysis, Visualisation and Quality for Data Cen...Hima Patel
 

Ähnlich wie EmbNum: Semantic Labeling for Numerical Values with Deep Metric Learning (20)

313 IDS _Course_Introduction_PPT.pptx
313 IDS _Course_Introduction_PPT.pptx313 IDS _Course_Introduction_PPT.pptx
313 IDS _Course_Introduction_PPT.pptx
 
Introduction Machine Learning Syllabus
Introduction Machine Learning SyllabusIntroduction Machine Learning Syllabus
Introduction Machine Learning Syllabus
 
Comparative Analysis of RMSE and MAP Metrices for Evaluating CNN and LSTM Mod...
Comparative Analysis of RMSE and MAP Metrices for Evaluating CNN and LSTM Mod...Comparative Analysis of RMSE and MAP Metrices for Evaluating CNN and LSTM Mod...
Comparative Analysis of RMSE and MAP Metrices for Evaluating CNN and LSTM Mod...
 
B.Tech Scheme and Syllabus 2019-2020 onwards.docx
B.Tech Scheme and Syllabus 2019-2020 onwards.docxB.Tech Scheme and Syllabus 2019-2020 onwards.docx
B.Tech Scheme and Syllabus 2019-2020 onwards.docx
 
SYNOPSIS on Parse representation and Linear SVM.
SYNOPSIS on Parse representation and Linear SVM.SYNOPSIS on Parse representation and Linear SVM.
SYNOPSIS on Parse representation and Linear SVM.
 
Data Science & Big Data - Theory.pdf
Data Science & Big Data - Theory.pdfData Science & Big Data - Theory.pdf
Data Science & Big Data - Theory.pdf
 
Workshop nwav 47 - LVS - Tool for Quantitative Data Analysis
Workshop nwav 47 - LVS - Tool for Quantitative Data AnalysisWorkshop nwav 47 - LVS - Tool for Quantitative Data Analysis
Workshop nwav 47 - LVS - Tool for Quantitative Data Analysis
 
Vivarana literature survey
Vivarana literature surveyVivarana literature survey
Vivarana literature survey
 
Asp.net Lab manual
Asp.net Lab manualAsp.net Lab manual
Asp.net Lab manual
 
how to build a Length of Stay model for a ProofOfConcept project
how to build a Length of Stay model for a ProofOfConcept projecthow to build a Length of Stay model for a ProofOfConcept project
how to build a Length of Stay model for a ProofOfConcept project
 
Current clustering techniques
Current clustering techniquesCurrent clustering techniques
Current clustering techniques
 
Ug 3 1 r19 cse syllabus
Ug 3 1 r19 cse syllabusUg 3 1 r19 cse syllabus
Ug 3 1 r19 cse syllabus
 
Benchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detectionBenchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detection
 
MUDROD - Ranking
MUDROD - RankingMUDROD - Ranking
MUDROD - Ranking
 
Asymmetric Tri-training for Unsupervised Domain Adaptation
Asymmetric Tri-training for Unsupervised Domain AdaptationAsymmetric Tri-training for Unsupervised Domain Adaptation
Asymmetric Tri-training for Unsupervised Domain Adaptation
 
akm_Biodata
akm_Biodataakm_Biodata
akm_Biodata
 
Benchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detectionBenchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detection
 
Survey on MapReduce in Big Data Clustering using Machine Learning Algorithms
Survey on MapReduce in Big Data Clustering using Machine Learning AlgorithmsSurvey on MapReduce in Big Data Clustering using Machine Learning Algorithms
Survey on MapReduce in Big Data Clustering using Machine Learning Algorithms
 
Lecture 1 Pandas Basics.pptx machine learning
Lecture 1 Pandas Basics.pptx machine learningLecture 1 Pandas Basics.pptx machine learning
Lecture 1 Pandas Basics.pptx machine learning
 
Advances in Exploratory Data Analysis, Visualisation and Quality for Data Cen...
Advances in Exploratory Data Analysis, Visualisation and Quality for Data Cen...Advances in Exploratory Data Analysis, Visualisation and Quality for Data Cen...
Advances in Exploratory Data Analysis, Visualisation and Quality for Data Cen...
 

Kürzlich hochgeladen

FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024Elizabeth Walsh
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentationcamerronhm
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfNirmal Dwivedi
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Jisc
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfSherif Taha
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...Poonam Aher Patil
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and ModificationsMJDuyan
 
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Pooja Bhuva
 
Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxPooja Bhuva
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17Celine George
 
OSCM Unit 2_Operations Processes & Systems
OSCM Unit 2_Operations Processes & SystemsOSCM Unit 2_Operations Processes & Systems
OSCM Unit 2_Operations Processes & SystemsSandeep D Chaudhary
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...Nguyen Thanh Tu Collection
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsKarakKing
 
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfUnit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfDr Vijay Vishwakarma
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxJisc
 
Wellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptxWellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptxJisc
 
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Pooja Bhuva
 
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptxExploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptxPooja Bhuva
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxheathfieldcps1
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxEsquimalt MFRC
 

Kürzlich hochgeladen (20)

FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
 
Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptx
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
OSCM Unit 2_Operations Processes & Systems
OSCM Unit 2_Operations Processes & SystemsOSCM Unit 2_Operations Processes & Systems
OSCM Unit 2_Operations Processes & Systems
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
 
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfUnit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
Wellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptxWellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptx
 
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
 
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptxExploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 

EmbNum: Semantic Labeling for Numerical Values with Deep Metric Learning

  • 1. EmbNum: Semantic Labeling for Numerical Values with Deep Metric Learning Phuc NGUYEN*, Khai NGUYEN, Ryutaro ICHISE, Hideaki TAKEDA JIST 2018: The 8th Join International Semantic Technology Conference November 27, 2018 1* phucnt@nii.ac.jp
  • 2. Agenda 1. Background 2. Problem & Challenges 3. Approach 4. Evaluation 5. Summary 6. Backup
  • 4. Open Data is a Global Trend 4 Vision [1]: Make data available • For everyone • Transparency • Potential for Innovation [1] https://opendatabarometer.org/4thedition/report/#open_data_trends [2] Umbrich et al. Quality assessment & evolution of open data portals, OBD 2015 Issue in Open Data Usability [2] (82 portals, 160k datasets) - Missing metadata, context description - Data no interlink - Most of data structured and tabular How can we improve usability of tabular data? ?
  • 5. Tabular Data Integration 5 well text description [1] poor text description [2] [1] Dominique et al. Matching HTML Tables to DBpedia, WIMS 2015 [2] Neumaier et al. Multi-level semantic labelling of numerical values. ISWC 2016 - No headers - No entity label - Large amount of numerical values capacity This paper
  • 7. Semantic Labeling for Numerical Values 7 Labeled Database lengthcapacity height Semantic Label Numerical Values Numerical Similarity Searching Output Ranking Numerical Distance Semantic Labels 1 0.01 height 2 0.5 length 3 1.2 capacity 4 … … How to calculate numerical similarity? similarity(unknown, item) unknown Input All values measure - numeric - same meaning - same scale
  • 8. Numerical Similarity A, B is two lists of numerical values Challenges: • A & B rarely have the same set of values • A, B size might vary • No knowledge about data + Type: continuous, discrete + Distribution: normal, uniform 8 height Unknown A B Similarity
  • 9. Related works Use Hypothesis Test with assumption on type, distribution of data • Stonebraker et al [1]: Welch’s t-test (normal distributions) • SemanticTyper [2], Neumaier et al. [3]: Kolmogorov-Smirnov test (continuous) • DSL [4]: New similarity with a logistic model. - Kolmogorov-Smirnov test (continuous) - Mann-Whitney test (continuous) - Numeric Jaccard (value range) 9 [1] Stonebraker et al., Data curation at scale: The data tamer system. In: CIDR 2013 [2] Ramnandan et al., Assigning semantic labels to data sources. ESWC 2015 [3] Neumaier et al., Multi-level semantic labelling of numerical values, ISWC 2016 [4] Pham et al., Semantic labeling: a domain-independent approach. ISWC 2016 But, no knowledge of data type or distribution Learn a similarity metric with no data assumption
  • 11. Framework 11 Representation Learning Semantic Labeling Offline Online Unknown Similarity Search 1. height 2. height 3. length 4. capacity 5. … Ranking List height height capacity Preprocessing Embedding Model Triplet loss CDF Vector Triplets height height capacity Triplets Preprocessing CDF Database height height capacity Learned Embedding Model
  • 12. Preprocessing with inverse transform sampling 12 ! Why? - Standardize the input size - Reduce the computational time - Retain the cumulative distribution of original data - Put data in order to leverage the capability of CNN How? decRainDays - Input: List of numerical values ! = {$%, $', … , $) } - CDF $ ∈ !: ,- $ = . ! ≤ $ , $ ∈ !, ,-: ℝ → 0,1 - Inverse CDF: ,- 5% (7) = min $: ,- $ ≥ 7 , 7 ∈ [0,1] Inverse transform sampling: Select h numerical values from ! with probability 7 ∈ ? = @ A B ∈ {1,2, … , ℎ} Example: ℎ = 100, ? = {0.01, 0.02, … , 1} FGH7IBJK ! = {,- 5% 0.01 , ,- 5% 0.02 , … , ,- 5% (1)}
  • 14. Evaluation (1) Dataset: 1. Standard Data: City Data [1] [2] 300 numerical columns extracted from City class in DBpedia. (Normalized data) 2. Real World Data: Open Data - 500 numerical numerical extracted from five Open Data portals: 1) Ireland (data.gov.ie), 2) the UK (data.gov.uk), 3) the EU (data.europa.eu), 4) Canada (open.canada.ca), and 5) Australia (data.gov.au) 14 Data # Sources # Labels # Columns # Row of Each Column Min Max Median Average City 10 30 300 4 2,251 113 642.73 Open 10 50 500 4 186,082 467 14,659.63 [1] Ramnandan et al..: Assigning semantic labels to data sources. ESWC 2015 [2] Pham et al..: Semantic labeling: a domain-independent approach. ISWC 2016 Table 1: Description of City Data and Open Data
  • 15. Evaluation (2) Metrics: - Effectiveness: MRR score (probability of correctness) - Efficiency: Run-Time in seconds Baselines: 1. SemanticTyper [1]: • Kolmogorov Smirnov test 2. DSL [2]: the metric is a combination of • Kolmogorov Smirnov test • Mann-Whitney test • Numeric Jaccard 15 [1] Ramnandan et al..: Assigning semantic labels to data sources. ESWC 2015 [2] Pham et al..: Semantic labeling: a domain-independent approach. ISWC 2016
  • 16. Evaluation (3) Experiment Setting: - Representation Learning: Training: City Data (50% - 5 sources) - Semantic Labeling: • City Data (50% - 5 sources*) and Open Data (100% - 10 sources) • Query: column of one source • Database: columns of the other sources. Simulate data increase overtime to test effectiveness and efficiency • City Data (5 sources): 75 experiments • Open Data (10 sources): 5110 experiments 16 [1] Ramnandan et al..: Assigning semantic labels to data sources. ESWC 2015 [2] Pham et al..: Semantic labeling: a domain-independent approach. ISWC 2016 * The data is not the same with the data in Representation Learning
  • 17. Semantic Labeling 17 City Data Open Data Efficiency: - More data, more time - EmbNum perform faster SemanticTyper (25 times) and DSL (92 times) Effectiveness: - More data, more accurate - EmbNum significant outperform (Paired T-Test) SemanticTyper and DSL
  • 19. Summary • EmbNum: o Learn representation for numerical values with deep metric learning o Calculate similarity on these representations • Advantages: o Accurate, no need to make assumption about data type, distribution o Fast o Domain independence • Future works: o Extend the similarity metric - Multiple-scale (meter, centimeter, feet, inch) - Hierarchical context: height of human who live in Tokyo and was born in Yokohama o Recognize new semantic types 19 Unknown Embedding Space Ranked Results: 1. height 2. height 3. length 4. capacity 5. …
  • 20. EmbNum: Semantic Labeling for Numerical Values with Deep Metric Learning Phuc NGUYEN*, Khai NGUYEN, Ryutaro ICHISE, Hideaki TAKEDA JIST 2018: The 8th Join International Semantic Technology Conference November 27, 2018 20 Question & Comments * phucnt@nii.ac.jp
  • 22. Representation Learning 22 tSNE visualization of embedding features of City Data After learning the similar patterns close together —> Effective of learning method
  • 23. CDF or PDF 23 CDF and PDF represent probability density [1] PDF: represent probability with areas CDF: represent vertical distance In terms of graphical intuition, CDF is more better than PDF since the calculation of vertical distance is faster and more accurate than area. [1] W.S. Cleveland, The Elements of Graphing Data. NJ, USA: Hobart Press 1994
  • 24. EmbNum: Learning and Labeling Labeled Attributes Database Unknown Attribute Embedding Model Feature Vectors Labeled Attributes Preprocessing Feature Extracting Semantic Labeling Representation Learning Feature Vector Similarity Searching Preprocessing Representation Learning Ranked Results Input Output Input Output Embedding model Embedding model Embedding model Shared Weights TripletLoss !(#$) !(#$ & ) !(#$ ' ) ... Triplet Sampling Feature Vectors anchor #$ posiEve #$ & negaEve #$ ' #$ & Shared Weights #$ ' #$ Output 23
  • 25. City Data (Describe) areaTotal populationDensity areaLand areaWater aprRainInch augRainDays populationTotal aprHighF aprLowF aprMeanF count 21690 18130 19250 13700 40 320 17830 1130 1130 160 mean 4.09E+08 644.75 3.70E+07 2.64E+09 2.94 9.79 557595.59 66.98 43.59 49.24 std 6.44E+09 3343.07 4.79E+08 2.75E+11 2.8 6.21 5.63E+07 11.13 11.46 9.63 min 0 0 0 0 0.22 0 0 8.5 -4.9 9.7 25% 2.56E+06 189.3 2.18E+06 0 0.99 4.8 1459.25 59 35.92 43.38 50% 8.00E+06 365.8 6.09E+06 50000 2.68 10.6 8407.5 66.55 42.2 47.75 75% 3.02E+07 618.61 1.89E+07 490000 3.52 13.83 30575.25 74.4 49.88 54.3 max 4.63E+11 309709 3.94E+10 3.20E+13 15.42 26.1 7.51E+09 104 87 80 aprRainDays percentageOfAreaWater areaLandKm areaLandSqMi areaTotalKm areaTotalSqMi areaWaterKm decMeanF augHighF augLowF count 320 860 9840 9400 12160 9470 9560 160 1130 1130 mean 7.94 5.98 52.07 8.94 708.17 11.79 15.66 3.04E+01 85.34 61.4 std 4.72 13.19 678.09 51.29 8594.54 93.82 857.33 1.34E+01 8.28 8.85 min 0 0 0 0 0 0 0 -1.7 41 32.9 25% 4.3 0.1 2.25 0.81 2.8 0.83 0 21.23 81 54.92 50% 8 0.98 6.35 2.24 10.1 2.34 0 29 86.4 61.55 75% 11.22 4.57 20.05 6.9 45.97 7.4 0.2 3.75E+01 91 68 max 25 84.58 39400 2874 462705 4811.5 80000 7.80E+01 110 86 augMeanF populationMetro augRainInch decHighF decLowF areaWaterSqMi decRainDays decRainInch elevationFt elevationM count 160 970 40 1130 1130 9390 320 40 9720 22510 mean 70.49 2.53E+07 2.49 48.87 29.49 1318.01 6.48 3.73 1113.48 365.82 std 7.22 7.37E+08 3.06 15.75 15.45 127541.37 5.02 4.15 2092.3 3008.53 min 47 2 0.02 -3 -23 0 0 0.28 -177 -314 25% 65.77 172641 0.11 37 20.5 0 2.18 1.83 440 128.02 50% 69.45 608235.5 2.15 47 28.2 0 5.8 2.58 863 261 75% 76.25 2.05E+06 3.5 58.75 37.4 0.09 9.83 3.53 1250.75 391 max 90 2.30E+10 14.83 95.7 81 1.24E+07 27 24.15 171549 445001
  • 26. Open Data (Describe) point y x y bng n bng e grade eastitm quantity long clients count 16787 542158 542158 37814 37814 817 189 251260 152366 120 mean 6.90E+06 149.14 -35.29 293708.8 421808.2 6.16 530008.61 515104.01 107.1 384.19 std 11045.1 0.15 0.24 156839.67 113008.45 3.5 2430.79 7.52E+06 63.83 619.37 min 6.88E+06 145.77 -37.55 7962 10133 0 524219.87 0 -10.86 1 25% 6.89E+06 149.08 -35.37 177609.75 349393 4 528451.41 106 1.26 34.25 50% 6.90E+06 149.15 -35.32 261787 430200 7 529876.17 6310 143.87 156.5 75% 6.91E+06 149.17 -35.25 387048 512040 8 531961.53 107000 144.79 384.5 max 6.94E+06 151.24 -32.82 1.19E+06 654853 17 535458.49 6.68E+08 145.28 2908 point x retail trade trade debtors lon eastig gross interest cost of sales cost centre gross rent lat count 16787 9996 889 4531 189 15219 1214 7025 15219 152979 mean 539426.47 9.75 1.30E+09 142.01 130042.73 8.74E+06 1.26E+10 1508.81 2.61E+07 -16.69 std 6095.78 147.17 7.15E+09 0.99 2431.31 1.31E+08 4.68E+10 395.81 3.47E+08 38.12 min 517645.59 0 0 141.55 124252.75 0 -581581 990 0 -38.37 25% 535993.55 0 5.77E+06 141.6 128485.2 162419.5 4.44E+07 1103 465264 -37.73 50% 540100.63 0 4.37E+07 141.61 129910.27 781001 4.11E+08 1833 2.48E+06 -37.58 75% 542517.03 2 3.56E+08 141.62 131996.07 3.85E+06 3.64E+09 1836 1.29E+07 -37.52 max 554135.38 11325 1.14E+11 144.99 135493.8 1.13E+10 4.44E+11 13304 3.58E+10 60.61 trade creditors rent expenses gifts or donations quantity quantit temperature deg c station altitude depreciation expenses superannuation expenses pm10 teom ug m3 help assessment debt count 889 925 33878 1560203 22176 1599 925 925 22176 15016 mean 1.07E+09 2.61E+08 1.19E+06 262000.22 12.17 154.35 4.30E+08 1.75E+08 11.5 1.14E+06 std 5.61E+09 1.12E+09 1.71E+07 3.31E+06 4.09 260.79 2.58E+09 7.44E+08 7.3 1.71E+07 min 0 0 0 0 0.1 0 0 0 -4.9 0 25% 5.01E+06 2.36E+06 10450.25 73 9.4 20 2.74E+06 1.28E+06 6.8 0 50% 3.91E+07 1.62E+07 70994 1415 11.8 59 2.11E+07 1.02E+07 10.1 17093 75% 3.26E+08 1.21E+08 416793.75 21760 14.6 160 1.23E+08 6.95E+07 14.4 305799 max 8.51E+10 1.57E+10 2.81E+09 4.56E+08 35.7 1950 3.85E+10 9.94E+09 117.1 1.59E+09 time from payscale minimum repairs and maintenance year inc time period time to statistics year salary cost of reports year ann e
  • 27. Tabular Data Integration with Knowledge Base Matching Matching tabular data to Knowledge Bases - Ontology - Knowledge graph (DBpedia, Wikidata) - Label data