SlideShare ist ein Scribd-Unternehmen logo
1 von 22
Building Structured Data from
Product Descriptions
Keiji Shinzato
Product information extraction

An Italian product. This is a fruity
red wine that mainly consists of
sangiovese grapes of Tuscany.

Type

Red

Grape
variety

Sangiovese

Region

Italy,
Tuscany
2
Background

• Structured data play a crucial role for
making Rakuten more attractive service.
– Faceted navigation, recommendation, and
market analysis.

ベリンダ・コーリー キアンティ
2011 750ml
トスカーナ州 キャ
ンティ地区のサン
ジョベーゼ種を主
体につくられる、
イタリアを代表す
る赤ワインの一つ。

Attribute

Value

Type

赤

Region

イタリア,
トスカーナ州キャンティ
地区

Grape

サンジョベーゼ

Vintage

2011

3
Faceted navigation

Reference: http://www.amazon.com/
4
Background

• Structured data play a crucial role for
making Rakuten more attractive service.
– Faceted navigation, recommendation, and
market analysis.

• Unsupervised methodology is required.
– 100 million products / 40,000 categories.
ベリンダ・コーリー キアンティ
2011 750ml
トスカーナ州 キャ
ンティ地区のサン
ジョベーゼ種を主
体につくられる、
イタリアを代表す
る赤ワインの一つ。

Attribute

Value

Type

赤

Region

イタリア,
トスカーナ州キャンティ
地区

Grape

サンジョベーゼ

Vintage

2011

5
Table is an useful clue, but…
WINE > CHILE

WINE > CHILE

Montes Alpha M 2009

Montes Alpha M 2009

Type

Red

Region

Chile

38%

Grape

Cabernet
sauvignon,
Merlot,
Cabernet franc,
Petit verdot

Year

2009

Product page including a table

Montes Alpha M is a blend
of Cabernet
Sauvignon, Merlot, Cabern
et Franc, and Petit Verdot.
A powerful wine with very
good level of soft and
rounded tannins. Intense
dark red color. The wine is
elegant and has a …
Product page consists of
sentences
6
Product information extraction
WINE > CHILE

Montes Alpha M 2009
Montes Alpha M is a blend
of Cabernet Sauvignon,
Merlot, Cabernet Franc,
and Petit Verdot.
A powerful wine with very
good level of soft and
rounded tannins. Intense
dark red color. The wine is
elegant and has a very
well defined character. …

Product page (unstructured)

Attribute

Value

Type

Red

Region

Chile

Grape

Cabernet sauvignon,
Merlot,
Cabernet franc,
Petit verdot

Vintage

2009

Company

Montes

Structured data

• Issue1: How do we know attributes for a category ??
• Issue2: How do we extract attribute values from full
texts ??
7
Attribute name collection
Analyze a large amount of table data
for collecting attributes of an object

Attribute values
Attribute names
of Wine

Reference: http://item.rakuten.co.jp/redbox/odm3000728/
8
Attribute value database (wine)
ぶどう品種
(Grape
variety)

内容量
(Volume)

産地
(Region)

生産者
(Winery)

味わい
(Taste)

Chardonnay

750ML

France

Farnese

Dry

Chardonnay
100%

720ML

Italy

Mas de
Monistrol

Full body

Merlot

375ML

Spain

Leroy

Medium body

Riesling

500ML

Chile

M. Chapoutier

Slightly sweet

Syrah

1500ML

German

Mastroberardino

Sweet

Grenache

360ML

Australia

Santero

Medium dry

Merlot

200ML

America

Saltarelli

Extremely sweet

Tempranillo

3000ML

Bordeaux

Cavicchioli

Medium dry

Sangiovese

1800ML

Champagne

Fontodi

Red Full body

Syrah100%

1000ML

Argentina

Ca'Rugate

Middle sweet

Precision is high, but coverage is low.
9
Product information extraction
WINE > CHILE

Montes Alpha M 2009
Montes Alpha M is a blend
of Cabernet Sauvignon,
Merlot, Cabernet Franc,
and Petit Verdot.
A powerful wine with very
good level of soft and
rounded tannins. Intense
dark red color. The wine is
elegant and has a very
well defined character. …

Product page (unstructured)

Attribute

Value

Type

Red

Region

Chile

Grape

Cabernet sauvignon,
Merlot,
Cabernet franc,
Petit verdot

Vintage

2009

Company

Montes

Structured data

• Issue1: How do we know attributes for each category ??
• Issue2: How do we extract attribute values from product
descriptions ??
10
Unsupervised attribute value extraction
- distant supervision approach Semi-structured data

Generation
Chateau d’Issan 1994

Construction
Database
:
<Region, Margaux>
<Color, White>
:

This is a wine
from Margaux.
...

Annotation

Rule
wine from x
⇒ x is a Region
Rule is generated
through machine
learning algorithm.

Product page including
entries in the database
11
Corpus with attribute-value annotations (wine)
• <産地>アルザス</産地>で最も香り豊かと言われるスパイシーで華やかなワイ
J:

E: ン。
A spicy and gorgeous wine that is known as the richest aroma one in

J: <production_area> Alsace </production_area>.
•

最もお手頃で、<生産者>ドメーヌ・ペゴー</生産者>の美味しさを気軽に楽し

E: める、とっても嬉しい一本なのです
This is a very nice wine because we can easily enjoy the taste of <winery>

J: Domaine Pegau </winery> at the best price.
• <ぶどう品種>ソーヴィニヨン・ブラン</ぶどう品種>種の特長がよく表れたワ
E:

J: イン。
A wine that <grape_variety> Sauvignon Blanc </grape_variety> was well

E: featured.
•

<タイプ>白</タイプ>身魚の塩焼きやシンプルな味付けのソテー、焼き牡蠣、

豚のしょうが焼き、ボンゴレビアンコなどと。

12
Unsupervised attribute value extraction
- distant supervision approach Semi-structured data

Generation
Chateau d’Issan 1994

Construction
Database
:
<Region, Margaux>
<Color, White>
:

This is a wine
from Margaux.
...

Annotation

Rule
wine from x
⇒ x is a Region
Rule is generated
through machine
learning algorithm.

Product page including
entries in the database
13
Extraction rule generation
• Algorithm: Conditional random fields [Lafferty+ 2001]
• Chunk tag: Start/End (IOBES) model [Sekine+ 1998]
• Features:
–
–
–
–
–
–
–

Token: Surface form of the token.
Base: Base form of the token.
PoS: Part-of-Speech tag of the token.
Char. type: Types of characters in the token.
Prefix: Double character prefix of the token.
Suffix: Double character suffix of the token.
The above features of ±3 tokens surrounding the token.

They are frequently employed in the task of Japanese
named entity recognition.
14
Unsupervised attribute value extraction
- distant supervision approach Semi-structured data

Generation
Chateau d’Issan 1994

Construction
Database
:
<Region, Margaux>
<Color, White>
:

This is a wine
from Margaux.
...

Annotation

Rule
wine from x
⇒ x is a Region
Rule is generated
through machine
learning algorithm.

Product page including
entries in the database
15
Unsupervised attribute value extraction
- distant supervision approach Terre di matraja
Bianco 2012

Apply
Rule
wine from x
⇒ x is a Region

This is a wine
from Tuscany.
...

Rule

1800 < x <= 2013
⇒ x is a Vintage

Attribute
Region
Vintage
Grape

Value
Tuscany
2012
Chardonnay
16
Performance (F-score)

Without ML
With ML

43.8 pt.
60.1pt.
Wine

24.1pt.
71.5 pt.
Shampoo
17
Wine / Japanese

An Italian product. This is a fruity
red wine that mainly consists of
sangiovese grapes of Tuscany.

Type

Red

Grape
variety

Sangiovese

Region

Italy,
Tuscany
18
Shampoo / Japanese

``MCH Natural shampoo 1000ml’’ is a shampoo
consisting of cypress oil and charcoal.
Category
Product
name

Shampoo
MCH Natural shampoo
1000ml

Ingredient

Cypress oil,
Charcoal

19
Video game / French

Product
type
Saga

Nintendo 64,
Nintendo DS
Mario

20
Conclusion
• Developing a technique for extracting product
information from unstructured data.
– Independent of any category and language.

• Useful services can be realized on structured
product data.
• Our paper is available on the web.
– ACL anthology: http://aclweb.org/anthology//I/I13/

21
Thank you for listing !

22

Weitere ähnliche Inhalte

Ähnlich wie [RakutenTechConf2013] [C-4_2] Building Structured Data from Product Descriptions

Riondo winemaker's tasting notes
Riondo winemaker's tasting notesRiondo winemaker's tasting notes
Riondo winemaker's tasting notes
Riondo USA
 

Ähnlich wie [RakutenTechConf2013] [C-4_2] Building Structured Data from Product Descriptions (17)

Wine of italy
Wine of italyWine of italy
Wine of italy
 
WINES OF ITALY.pptx
WINES OF ITALY.pptxWINES OF ITALY.pptx
WINES OF ITALY.pptx
 
Italian wine
Italian wine Italian wine
Italian wine
 
Italian wines
Italian winesItalian wines
Italian wines
 
Wine Data Analysis using R, SQL and TABLEAU
Wine Data Analysis using R, SQL and TABLEAUWine Data Analysis using R, SQL and TABLEAU
Wine Data Analysis using R, SQL and TABLEAU
 
24 10-12 presentation vca marco tiggelman
24 10-12 presentation vca marco tiggelman24 10-12 presentation vca marco tiggelman
24 10-12 presentation vca marco tiggelman
 
Riondo winemaker's tasting notes
Riondo winemaker's tasting notesRiondo winemaker's tasting notes
Riondo winemaker's tasting notes
 
Italy and Spain Oct 13th.
Italy and Spain Oct 13th.Italy and Spain Oct 13th.
Italy and Spain Oct 13th.
 
Italian bologna Wine
Italian bologna WineItalian bologna Wine
Italian bologna Wine
 
2011 Foundation Wine Course 3: Rest of the Old World
2011 Foundation Wine Course 3: Rest of the Old World2011 Foundation Wine Course 3: Rest of the Old World
2011 Foundation Wine Course 3: Rest of the Old World
 
October 24th, 2016
October 24th, 2016October 24th, 2016
October 24th, 2016
 
Toschi Book 4.2012 Email
Toschi Book 4.2012  EmailToschi Book 4.2012  Email
Toschi Book 4.2012 Email
 
Argentine wines by viners club
Argentine wines by viners clubArgentine wines by viners club
Argentine wines by viners club
 
( Domaines Barons de Rothschild (Lafite) Wines Vietnam Brochure
( Domaines Barons de Rothschild (Lafite) Wines Vietnam Brochure ( Domaines Barons de Rothschild (Lafite) Wines Vietnam Brochure
( Domaines Barons de Rothschild (Lafite) Wines Vietnam Brochure
 
wine and grape with france regions.......
wine and grape with france regions.......wine and grape with france regions.......
wine and grape with france regions.......
 
The vineyards of bergerac france
The vineyards of bergerac franceThe vineyards of bergerac france
The vineyards of bergerac france
 
International market japan
International market   japanInternational market   japan
International market japan
 

Mehr von Rakuten Group, Inc.

Mehr von Rakuten Group, Inc. (20)

コードレビュー改善のためにJenkinsとIntelliJ IDEAのプラグインを自作してみた話
コードレビュー改善のためにJenkinsとIntelliJ IDEAのプラグインを自作してみた話コードレビュー改善のためにJenkinsとIntelliJ IDEAのプラグインを自作してみた話
コードレビュー改善のためにJenkinsとIntelliJ IDEAのプラグインを自作してみた話
 
楽天における安全な秘匿情報管理への道のり
楽天における安全な秘匿情報管理への道のり楽天における安全な秘匿情報管理への道のり
楽天における安全な秘匿情報管理への道のり
 
What Makes Software Green?
What Makes Software Green?What Makes Software Green?
What Makes Software Green?
 
Simple and Effective Knowledge-Driven Query Expansion for QA-Based Product At...
Simple and Effective Knowledge-Driven Query Expansion for QA-Based Product At...Simple and Effective Knowledge-Driven Query Expansion for QA-Based Product At...
Simple and Effective Knowledge-Driven Query Expansion for QA-Based Product At...
 
DataSkillCultureを浸透させる楽天の取り組み
DataSkillCultureを浸透させる楽天の取り組みDataSkillCultureを浸透させる楽天の取り組み
DataSkillCultureを浸透させる楽天の取り組み
 
大規模なリアルタイム監視の導入と展開
大規模なリアルタイム監視の導入と展開大規模なリアルタイム監視の導入と展開
大規模なリアルタイム監視の導入と展開
 
楽天における大規模データベースの運用
楽天における大規模データベースの運用楽天における大規模データベースの運用
楽天における大規模データベースの運用
 
楽天サービスを支えるネットワークインフラストラクチャー
楽天サービスを支えるネットワークインフラストラクチャー楽天サービスを支えるネットワークインフラストラクチャー
楽天サービスを支えるネットワークインフラストラクチャー
 
楽天の規模とクラウドプラットフォーム統括部の役割
楽天の規模とクラウドプラットフォーム統括部の役割楽天の規模とクラウドプラットフォーム統括部の役割
楽天の規模とクラウドプラットフォーム統括部の役割
 
Rakuten Services and Infrastructure Team.pdf
Rakuten Services and Infrastructure Team.pdfRakuten Services and Infrastructure Team.pdf
Rakuten Services and Infrastructure Team.pdf
 
The Data Platform Administration Handling the 100 PB.pdf
The Data Platform Administration Handling the 100 PB.pdfThe Data Platform Administration Handling the 100 PB.pdf
The Data Platform Administration Handling the 100 PB.pdf
 
Supporting Internal Customers as Technical Account Managers.pdf
Supporting Internal Customers as Technical Account Managers.pdfSupporting Internal Customers as Technical Account Managers.pdf
Supporting Internal Customers as Technical Account Managers.pdf
 
Making Cloud Native CI_CD Services.pdf
Making Cloud Native CI_CD Services.pdfMaking Cloud Native CI_CD Services.pdf
Making Cloud Native CI_CD Services.pdf
 
How We Defined Our Own Cloud.pdf
How We Defined Our Own Cloud.pdfHow We Defined Our Own Cloud.pdf
How We Defined Our Own Cloud.pdf
 
Travel & Leisure Platform Department's tech info
Travel & Leisure Platform Department's tech infoTravel & Leisure Platform Department's tech info
Travel & Leisure Platform Department's tech info
 
Travel & Leisure Platform Department's tech info
Travel & Leisure Platform Department's tech infoTravel & Leisure Platform Department's tech info
Travel & Leisure Platform Department's tech info
 
OWASPTop10_Introduction
OWASPTop10_IntroductionOWASPTop10_Introduction
OWASPTop10_Introduction
 
Introduction of GORA API Group technology
Introduction of GORA API Group technologyIntroduction of GORA API Group technology
Introduction of GORA API Group technology
 
100PBを越えるデータプラットフォームの実情
100PBを越えるデータプラットフォームの実情100PBを越えるデータプラットフォームの実情
100PBを越えるデータプラットフォームの実情
 
社内エンジニアを支えるテクニカルアカウントマネージャー
社内エンジニアを支えるテクニカルアカウントマネージャー社内エンジニアを支えるテクニカルアカウントマネージャー
社内エンジニアを支えるテクニカルアカウントマネージャー
 

Kürzlich hochgeladen

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Kürzlich hochgeladen (20)

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 

[RakutenTechConf2013] [C-4_2] Building Structured Data from Product Descriptions

  • 1. Building Structured Data from Product Descriptions Keiji Shinzato
  • 2. Product information extraction An Italian product. This is a fruity red wine that mainly consists of sangiovese grapes of Tuscany. Type Red Grape variety Sangiovese Region Italy, Tuscany 2
  • 3. Background • Structured data play a crucial role for making Rakuten more attractive service. – Faceted navigation, recommendation, and market analysis. ベリンダ・コーリー キアンティ 2011 750ml トスカーナ州 キャ ンティ地区のサン ジョベーゼ種を主 体につくられる、 イタリアを代表す る赤ワインの一つ。 Attribute Value Type 赤 Region イタリア, トスカーナ州キャンティ 地区 Grape サンジョベーゼ Vintage 2011 3
  • 5. Background • Structured data play a crucial role for making Rakuten more attractive service. – Faceted navigation, recommendation, and market analysis. • Unsupervised methodology is required. – 100 million products / 40,000 categories. ベリンダ・コーリー キアンティ 2011 750ml トスカーナ州 キャ ンティ地区のサン ジョベーゼ種を主 体につくられる、 イタリアを代表す る赤ワインの一つ。 Attribute Value Type 赤 Region イタリア, トスカーナ州キャンティ 地区 Grape サンジョベーゼ Vintage 2011 5
  • 6. Table is an useful clue, but… WINE > CHILE WINE > CHILE Montes Alpha M 2009 Montes Alpha M 2009 Type Red Region Chile 38% Grape Cabernet sauvignon, Merlot, Cabernet franc, Petit verdot Year 2009 Product page including a table Montes Alpha M is a blend of Cabernet Sauvignon, Merlot, Cabern et Franc, and Petit Verdot. A powerful wine with very good level of soft and rounded tannins. Intense dark red color. The wine is elegant and has a … Product page consists of sentences 6
  • 7. Product information extraction WINE > CHILE Montes Alpha M 2009 Montes Alpha M is a blend of Cabernet Sauvignon, Merlot, Cabernet Franc, and Petit Verdot. A powerful wine with very good level of soft and rounded tannins. Intense dark red color. The wine is elegant and has a very well defined character. … Product page (unstructured) Attribute Value Type Red Region Chile Grape Cabernet sauvignon, Merlot, Cabernet franc, Petit verdot Vintage 2009 Company Montes Structured data • Issue1: How do we know attributes for a category ?? • Issue2: How do we extract attribute values from full texts ?? 7
  • 8. Attribute name collection Analyze a large amount of table data for collecting attributes of an object Attribute values Attribute names of Wine Reference: http://item.rakuten.co.jp/redbox/odm3000728/ 8
  • 9. Attribute value database (wine) ぶどう品種 (Grape variety) 内容量 (Volume) 産地 (Region) 生産者 (Winery) 味わい (Taste) Chardonnay 750ML France Farnese Dry Chardonnay 100% 720ML Italy Mas de Monistrol Full body Merlot 375ML Spain Leroy Medium body Riesling 500ML Chile M. Chapoutier Slightly sweet Syrah 1500ML German Mastroberardino Sweet Grenache 360ML Australia Santero Medium dry Merlot 200ML America Saltarelli Extremely sweet Tempranillo 3000ML Bordeaux Cavicchioli Medium dry Sangiovese 1800ML Champagne Fontodi Red Full body Syrah100% 1000ML Argentina Ca'Rugate Middle sweet Precision is high, but coverage is low. 9
  • 10. Product information extraction WINE > CHILE Montes Alpha M 2009 Montes Alpha M is a blend of Cabernet Sauvignon, Merlot, Cabernet Franc, and Petit Verdot. A powerful wine with very good level of soft and rounded tannins. Intense dark red color. The wine is elegant and has a very well defined character. … Product page (unstructured) Attribute Value Type Red Region Chile Grape Cabernet sauvignon, Merlot, Cabernet franc, Petit verdot Vintage 2009 Company Montes Structured data • Issue1: How do we know attributes for each category ?? • Issue2: How do we extract attribute values from product descriptions ?? 10
  • 11. Unsupervised attribute value extraction - distant supervision approach Semi-structured data Generation Chateau d’Issan 1994 Construction Database : <Region, Margaux> <Color, White> : This is a wine from Margaux. ... Annotation Rule wine from x ⇒ x is a Region Rule is generated through machine learning algorithm. Product page including entries in the database 11
  • 12. Corpus with attribute-value annotations (wine) • <産地>アルザス</産地>で最も香り豊かと言われるスパイシーで華やかなワイ J: E: ン。 A spicy and gorgeous wine that is known as the richest aroma one in J: <production_area> Alsace </production_area>. • 最もお手頃で、<生産者>ドメーヌ・ペゴー</生産者>の美味しさを気軽に楽し E: める、とっても嬉しい一本なのです This is a very nice wine because we can easily enjoy the taste of <winery> J: Domaine Pegau </winery> at the best price. • <ぶどう品種>ソーヴィニヨン・ブラン</ぶどう品種>種の特長がよく表れたワ E: J: イン。 A wine that <grape_variety> Sauvignon Blanc </grape_variety> was well E: featured. • <タイプ>白</タイプ>身魚の塩焼きやシンプルな味付けのソテー、焼き牡蠣、 豚のしょうが焼き、ボンゴレビアンコなどと。 12
  • 13. Unsupervised attribute value extraction - distant supervision approach Semi-structured data Generation Chateau d’Issan 1994 Construction Database : <Region, Margaux> <Color, White> : This is a wine from Margaux. ... Annotation Rule wine from x ⇒ x is a Region Rule is generated through machine learning algorithm. Product page including entries in the database 13
  • 14. Extraction rule generation • Algorithm: Conditional random fields [Lafferty+ 2001] • Chunk tag: Start/End (IOBES) model [Sekine+ 1998] • Features: – – – – – – – Token: Surface form of the token. Base: Base form of the token. PoS: Part-of-Speech tag of the token. Char. type: Types of characters in the token. Prefix: Double character prefix of the token. Suffix: Double character suffix of the token. The above features of ±3 tokens surrounding the token. They are frequently employed in the task of Japanese named entity recognition. 14
  • 15. Unsupervised attribute value extraction - distant supervision approach Semi-structured data Generation Chateau d’Issan 1994 Construction Database : <Region, Margaux> <Color, White> : This is a wine from Margaux. ... Annotation Rule wine from x ⇒ x is a Region Rule is generated through machine learning algorithm. Product page including entries in the database 15
  • 16. Unsupervised attribute value extraction - distant supervision approach Terre di matraja Bianco 2012 Apply Rule wine from x ⇒ x is a Region This is a wine from Tuscany. ... Rule 1800 < x <= 2013 ⇒ x is a Vintage Attribute Region Vintage Grape Value Tuscany 2012 Chardonnay 16
  • 17. Performance (F-score) Without ML With ML 43.8 pt. 60.1pt. Wine 24.1pt. 71.5 pt. Shampoo 17
  • 18. Wine / Japanese An Italian product. This is a fruity red wine that mainly consists of sangiovese grapes of Tuscany. Type Red Grape variety Sangiovese Region Italy, Tuscany 18
  • 19. Shampoo / Japanese ``MCH Natural shampoo 1000ml’’ is a shampoo consisting of cypress oil and charcoal. Category Product name Shampoo MCH Natural shampoo 1000ml Ingredient Cypress oil, Charcoal 19
  • 20. Video game / French Product type Saga Nintendo 64, Nintendo DS Mario 20
  • 21. Conclusion • Developing a technique for extracting product information from unstructured data. – Independent of any category and language. • Useful services can be realized on structured product data. • Our paper is available on the web. – ACL anthology: http://aclweb.org/anthology//I/I13/ 21
  • 22. Thank you for listing ! 22