SlideShare a Scribd company logo
1 of 27
Download to read offline
Attention, Learn to Solve
Routing Problems!
ICLR 2019
University of Amsterdam
Wouter Kool, Herke van Hoof and Max Welling
Abstract
• Learn heuristics for combinatorial optimization problems can save
costly development.
• Propose a model based on attention layers and train this model using
REINFORCE with a baseline based on deterministic greedy rollout.
• Outperform recent learned heuristics for TSP.
Introduction
• Approaches to solve combinatorial optimization problem can be
divided into
• Exact methods: guarantee finding optimal solutions
• Heuristics: trade off optimality for computational cost, usually expressed in
the form of rules (like the policy to make decisions)
• Train a model to parameterize policies to obtain new and stronger
algorithm for routing problem.
Introduction (cont’d)
• Propose a model based on attention and train it using REINFORCE
with greedy rollout baseline.
• Show the flexibility of proposed approach on multiple routing
problems.
Background
Attention mechanism
• For encoder-decoder model, use attention to obtain new context vector.
• ℎ𝑗 denotes encoder hidden state, 𝑠𝑖 denotes decoder hidden state.
• Alignment model, compatibility: relationship between current decoding
state and every encoding state.
• 𝑒𝑖𝑗 = 𝑎(𝑠𝑖−1, ℎ𝑗)
• Attention weight
• 𝛼𝑖𝑗 =
exp(𝑒 𝑖𝑗)
σ 𝑘=1
𝑇
exp 𝑒 𝑖𝑘
• Context vector
• 𝑐𝑖 = σ 𝑗=1
𝑇
𝛼𝑖𝑗ℎ𝑗
Transformer
• Multi-head attention: project the input encoding to different number
of spaces
• Self-attention: no additional decoding state, just encoding states
themselves
• Each head has its own attention mechanism
Attention model
Problem definition
• Define a problem instance 𝑠 as a graph with 𝑛 nodes, where node 𝑖 ∈
{1, … , 𝑛} is represented by features 𝑥𝑖.
• For TSP, 𝑥𝑖 is the coordinate of node 𝑖 (in 2d space).
• Define a solution 𝜋 = (𝜋1, … , 𝜋 𝑛) as a permutation of the nodes.
• Given a problem 𝑠, model output a policy 𝑝(𝜋|𝑠) for selecting a
solution 𝜋
Encoder-decoder model
• Encoder-decoder model defines stochastic policy 𝑝(𝜋|𝑠) for selecting a solution 𝜋
given a problem instance 𝑠.
𝑝 𝜃 𝜋 𝑠 = ෑ
𝑡=1
𝑛
𝑝 𝜃(𝜋 𝑡|𝑠, 𝜋1:𝑡−1)
• The encoder produces embeddings of all input nodes.
• The decoder produces the sequence 𝜋, one node at a time, based on embedding
nodes and mask and context.
• For TSP,
• embedding nodes: from encoder
• mask: remaining nodes during decoding
• context: First and last node embedding in tour during decoding
Encoder
• 𝑑 𝑥-dimensional input feature 𝑥𝑖. For TSP, 𝑑 𝑥 = 2.
• 𝑑ℎ-dimensional node embedding. Let 𝑑ℎ = 128.
• Initial embedding: ℎ𝑖
0
= 𝑊 𝑥 𝑥𝑖 + 𝑏 𝑥
• The embedding ℎ𝑖
𝑙
are updated using 𝑁 attention layers.
෠ℎ𝑖 = 𝐵𝑁 𝑙 ℎ𝑖
𝑙−1
+ 𝑀𝐻𝐴𝑖
𝑙
ℎ1
𝑙−1
, … , ℎ 𝑛
𝑙−1
ℎ𝑖
𝑙
= 𝐵𝑁 𝑙(෠ℎ𝑖 + 𝐹𝐹 𝑙(෠ℎ𝑖))
• Graph embedding: തℎ 𝑁 =
1
𝑛
σ𝑖=1
𝑛
ℎ𝑖
𝑁
𝑖 denotes the node index
𝑙 denotes the output of 𝑙’th attention layer
FF: node-wise feed forward
MHA: multi-head attention
BN: batch normalization
Multi-head attention
• 𝑀𝐻𝐴𝑖
𝑙
ℎ1
𝑙−1
, … , ℎ 𝑛
𝑙−1
• Let number of heads 𝑀 = 8, embedding dimension 𝑑ℎ = 128.
• Each head has its own attention mechanism.
Result vector of each head
• Each node has its own query 𝑞𝑖, key 𝑘𝑖 and value 𝑣𝑖.
• 𝑞𝑖 = 𝑊 𝑄ℎ𝑖, 𝑘𝑖 = 𝑊 𝐾ℎ𝑖, 𝑣𝑖 = 𝑊 𝑉ℎ𝑖
• 𝑊 𝑄 and 𝑊 𝐾 are (𝑑 𝑘 × 𝑑ℎ) matrices, 𝑊 𝑉 is (𝑑 𝑣 × 𝑑ℎ) matrix.
• Given node 𝑖 and another node 𝑗:
• 𝑞𝑖 and 𝑘𝑗 determine the importance of 𝑣𝑗
• Compatibility 𝑢𝑖𝑗 =
𝑞𝑖
𝑇
𝑘 𝑗
√𝑑 𝑘
if node 𝑖 adjacent to node j else −∞ .
• Attention weight 𝑎𝑖𝑗 =
𝑒
𝑢 𝑖𝑗
σ
𝑗′ 𝑒
𝑢
𝑖𝑗′ ∈ [0,1]
• Result vector ℎ𝑖
′
= σ 𝑗 𝑎𝑖𝑗 𝑣𝑗 (size is 𝑑 𝑣)
1. Compute the compatibility
2. Compute the attention weight
3. Linear combination of 𝑎𝑖𝑗 and 𝑣𝑗
Final result vector
• Let ℎ𝑖𝑚
′
denote the result vector of node 𝑖 in head 𝑚 (size is 𝑑 𝑣)
• In Transformer, concatenate the result vectors first and transform it.
• 𝑀𝐻𝐴𝑖 ℎ1, … , ℎ 𝑛 = 𝑊 𝑂 𝑐𝑜𝑛𝑐𝑎𝑡(ℎ𝑖1′, … ℎ𝑖𝑚′)
• In proposed method, transform each result vectors and sum up them.
• 𝑀𝐻𝐴𝑖 ℎ1, … , ℎ 𝑛 = σ 𝑚=1
𝑀
𝑊𝑚
𝑂
ℎ𝑖𝑚′
• Both method output 𝑑ℎ-dimensional vector for each node.
𝑚 ⋅ 𝑑 𝑣𝑑ℎ × (𝑚 ⋅ 𝑑 𝑣)
𝑑ℎ × 𝑑 𝑣
Decoder
• At decoding time, the decode context consisted of embedding of the
graph, the last node and first node
• ℎ 𝑐
𝑁
= ቐ
തℎ 𝑁 , ℎ 𝜋 𝑡−1
𝑁
, ℎ 𝜋1
𝑁
if 𝑡 > 1
തℎ 𝑁 , 𝑣 𝑙, 𝑣 𝑓 else.
• (3 ⋅ 𝑑ℎ)-dimensional result vector ℎ 𝑐
𝑁
: embedding of the special
context node (𝑐)
[⋅,⋅,⋅] horizontal concatenation operator
𝑣 𝑙 and 𝑣 𝑓 are learnable 𝑑ℎ-dimensional parameters
Update context node embedding
• Obtain new context node embedding ℎ 𝑐
𝑁+1
using 𝑀-head attention.
• The keys and values come from node embedding ℎ𝑖
𝑁
, query comes
from context node.
• 𝑞 𝑐 = 𝑊 𝑄ℎ 𝑐 , 𝑘𝑖 = 𝑊 𝐾ℎ𝑖, 𝑣𝑖 = 𝑊 𝑉ℎ𝑖
• Compatibility 𝑢(𝑐)𝑗 =
𝑞(𝑐)
𝑇
𝑘 𝑗
√𝑑 𝑘
𝑑 𝑘 =
𝑑ℎ
𝑀
if node 𝑗 haven’t been visited
else −∞.
• Apply the similar 𝑀𝐻𝐴 to get ℎ 𝑐
𝑁+1
(size is 𝑑ℎ).
Final output probability
• Compute 𝑝 𝜃 𝜋 𝑡 𝑠, 𝜋1:𝑡−1 using single attention head (𝑀 = 1, 𝑑 𝑘 =
𝑑ℎ) but only compute compatibility (no need 𝑣𝑖)
• 𝑢(𝑐)𝑗 = 𝐶 ⋅ tanh
𝑞 𝑐
𝑇
𝑘 𝑗
𝑑 𝑘
∈ [−𝐶, 𝐶] if node 𝑗 haven’t been visited else
− ∞(𝐶 = 10).
• Compute the final output probability vector 𝑝 using softmax
𝑝𝑖 = 𝑝 𝜃 𝜋 𝑡 = 𝑖 𝑠, 𝜋1:𝑡−1 =
𝑒 𝑢(𝑐)𝑖
σ 𝑗 𝑒 𝑢(𝑐)𝑗
REINFORCE with greedy rollout
baseline
REINFORCE with baseline
• Define the loss ℒ 𝜃 𝑠 = 𝔼 𝑝 𝜃 𝜋 𝑠 [𝐿(𝜋)]
• Optimize ℒ by gradient descent using REINFORCE
• By introduce the baseline reduces gradient variance and then speed up
learning.
𝛻ℒ 𝜃 𝑠 = 𝔼 𝑝 𝜃 𝜋 𝑠 [ 𝐿 𝜋 − 𝑏 𝑠 𝛻 log 𝑝 𝜃(𝜋|𝑠)]
• Common baseline
• Exponential moving average 𝑏 𝑠 = 𝑀 with decay 𝛽.
• 𝑀0 = 𝐿 𝜋 , 𝑀𝑡+1 = 𝛽𝑀𝑡 + 1 − 𝛽 𝐿(𝜋)
• Learned value function (critic) ො𝑣(𝑠, 𝜔)
• 𝜔 are learned from (𝑠, 𝐿(𝜋))
Proposed baseline
Replace baseline parameter if improvement is significant
Sample solution 𝜋𝑖 based on 𝑝 𝜽
Greedily pick baseline solution 𝜋𝑖
𝐵𝐿
based on 𝑝 𝜽𝐵𝐿
Calculate the gradient of loss with REINFORCE
with baseline as length of 𝜋𝑖
𝐵𝐿
.
Two model, one for training another for baseline
Copy the training parameter to baseline
Experiments
Learned heuristic
Non-learned baseline
Heuristic solver
structure2vec
Pointer network (PN)
PN+ RL
Compare to heuristic solver, non-learned baseline and learned heuristic
PN: pointer network
AM: attention model (proposed method)
TSP20 result compare to pointer network (10000 instances)
Generalization ability
Discussion
• Introduce a model and training method which both contribute to
significantly improved results on learned heuristics for TSP.
• Using attention instead of recurrence introduces invariance to the
input order of the nodes, increasing learning efficiency.
• The multi-head attention mechanism allows nodes to communicate
relevant information over different channels.

More Related Content

What's hot

Reinforcement Learning with Deep Energy-Based Policies
Reinforcement Learning with Deep Energy-Based PoliciesReinforcement Learning with Deep Energy-Based Policies
Reinforcement Learning with Deep Energy-Based PoliciesSangwoo Mo
 
Dueling network architectures for deep reinforcement learning
Dueling network architectures for deep reinforcement learningDueling network architectures for deep reinforcement learning
Dueling network architectures for deep reinforcement learningTaehoon Kim
 
An introduction to deep reinforcement learning
An introduction to deep reinforcement learningAn introduction to deep reinforcement learning
An introduction to deep reinforcement learningBig Data Colombia
 
강화학습의 흐름도 Part 1
강화학습의 흐름도 Part 1강화학습의 흐름도 Part 1
강화학습의 흐름도 Part 1Dongmin Lee
 
텐서플로우 설치도 했고 튜토리얼도 봤고 기초 예제도 짜봤다면 TensorFlow KR Meetup 2016
텐서플로우 설치도 했고 튜토리얼도 봤고 기초 예제도 짜봤다면 TensorFlow KR Meetup 2016텐서플로우 설치도 했고 튜토리얼도 봤고 기초 예제도 짜봤다면 TensorFlow KR Meetup 2016
텐서플로우 설치도 했고 튜토리얼도 봤고 기초 예제도 짜봤다면 TensorFlow KR Meetup 2016Taehoon Kim
 
[DL輪読会]陰関数微分を用いた深層学習
[DL輪読会]陰関数微分を用いた深層学習[DL輪読会]陰関数微分を用いた深層学習
[DL輪読会]陰関数微分を用いた深層学習Deep Learning JP
 
강화학습 기초부터 DQN까지 (Reinforcement Learning from Basics to DQN)
강화학습 기초부터 DQN까지 (Reinforcement Learning from Basics to DQN)강화학습 기초부터 DQN까지 (Reinforcement Learning from Basics to DQN)
강화학습 기초부터 DQN까지 (Reinforcement Learning from Basics to DQN)Curt Park
 
Gradient descent optimizer
Gradient descent optimizerGradient descent optimizer
Gradient descent optimizerHojin Yang
 
Sigfin Neural Fractional SDE NET
Sigfin Neural Fractional SDE NETSigfin Neural Fractional SDE NET
Sigfin Neural Fractional SDE NETKei Nakagawa
 
Dynamic programming
Dynamic programmingDynamic programming
Dynamic programmingShakil Ahmed
 
PRML 12-12.1.4 主成分分析 (PCA) / Principal Component Analysis (PCA)
PRML 12-12.1.4 主成分分析 (PCA) / Principal Component Analysis (PCA)PRML 12-12.1.4 主成分分析 (PCA) / Principal Component Analysis (PCA)
PRML 12-12.1.4 主成分分析 (PCA) / Principal Component Analysis (PCA)Akihiro Nitta
 
Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningKhaled Saleh
 
Combinatorial optimization and deep reinforcement learning
Combinatorial optimization and deep reinforcement learningCombinatorial optimization and deep reinforcement learning
Combinatorial optimization and deep reinforcement learning민재 정
 
Introduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNNIntroduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNNHye-min Ahn
 
[論文紹介] LSTM (LONG SHORT-TERM MEMORY)
[論文紹介] LSTM (LONG SHORT-TERM MEMORY)[論文紹介] LSTM (LONG SHORT-TERM MEMORY)
[論文紹介] LSTM (LONG SHORT-TERM MEMORY)Tomoyuki Hioki
 
基礎からのベイズ統計学第5章
基礎からのベイズ統計学第5章基礎からのベイズ統計学第5章
基礎からのベイズ統計学第5章hiro5585
 

What's hot (20)

Reinforcement Learning with Deep Energy-Based Policies
Reinforcement Learning with Deep Energy-Based PoliciesReinforcement Learning with Deep Energy-Based Policies
Reinforcement Learning with Deep Energy-Based Policies
 
Dueling network architectures for deep reinforcement learning
Dueling network architectures for deep reinforcement learningDueling network architectures for deep reinforcement learning
Dueling network architectures for deep reinforcement learning
 
An introduction to deep reinforcement learning
An introduction to deep reinforcement learningAn introduction to deep reinforcement learning
An introduction to deep reinforcement learning
 
강화학습의 흐름도 Part 1
강화학습의 흐름도 Part 1강화학습의 흐름도 Part 1
강화학습의 흐름도 Part 1
 
ddpg seminar
ddpg seminarddpg seminar
ddpg seminar
 
XLNet Presentation.pdf
XLNet Presentation.pdfXLNet Presentation.pdf
XLNet Presentation.pdf
 
텐서플로우 설치도 했고 튜토리얼도 봤고 기초 예제도 짜봤다면 TensorFlow KR Meetup 2016
텐서플로우 설치도 했고 튜토리얼도 봤고 기초 예제도 짜봤다면 TensorFlow KR Meetup 2016텐서플로우 설치도 했고 튜토리얼도 봤고 기초 예제도 짜봤다면 TensorFlow KR Meetup 2016
텐서플로우 설치도 했고 튜토리얼도 봤고 기초 예제도 짜봤다면 TensorFlow KR Meetup 2016
 
[DL輪読会]陰関数微分を用いた深層学習
[DL輪読会]陰関数微分を用いた深層学習[DL輪読会]陰関数微分を用いた深層学習
[DL輪読会]陰関数微分を用いた深層学習
 
강화학습 기초부터 DQN까지 (Reinforcement Learning from Basics to DQN)
강화학습 기초부터 DQN까지 (Reinforcement Learning from Basics to DQN)강화학습 기초부터 DQN까지 (Reinforcement Learning from Basics to DQN)
강화학습 기초부터 DQN까지 (Reinforcement Learning from Basics to DQN)
 
Gradient descent optimizer
Gradient descent optimizerGradient descent optimizer
Gradient descent optimizer
 
Sigfin Neural Fractional SDE NET
Sigfin Neural Fractional SDE NETSigfin Neural Fractional SDE NET
Sigfin Neural Fractional SDE NET
 
Dynamic programming
Dynamic programmingDynamic programming
Dynamic programming
 
PRML 12-12.1.4 主成分分析 (PCA) / Principal Component Analysis (PCA)
PRML 12-12.1.4 主成分分析 (PCA) / Principal Component Analysis (PCA)PRML 12-12.1.4 主成分分析 (PCA) / Principal Component Analysis (PCA)
PRML 12-12.1.4 主成分分析 (PCA) / Principal Component Analysis (PCA)
 
Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement Learning
 
Combinatorial optimization and deep reinforcement learning
Combinatorial optimization and deep reinforcement learningCombinatorial optimization and deep reinforcement learning
Combinatorial optimization and deep reinforcement learning
 
Introduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNNIntroduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNN
 
Dynamic pgmming
Dynamic pgmmingDynamic pgmming
Dynamic pgmming
 
Chap 4 markov chains
Chap 4   markov chainsChap 4   markov chains
Chap 4 markov chains
 
[論文紹介] LSTM (LONG SHORT-TERM MEMORY)
[論文紹介] LSTM (LONG SHORT-TERM MEMORY)[論文紹介] LSTM (LONG SHORT-TERM MEMORY)
[論文紹介] LSTM (LONG SHORT-TERM MEMORY)
 
基礎からのベイズ統計学第5章
基礎からのベイズ統計学第5章基礎からのベイズ統計学第5章
基礎からのベイズ統計学第5章
 

Similar to Paper study: Attention, learn to solve routing problems!

Paper study: Learning to solve circuit sat
Paper study: Learning to solve circuit satPaper study: Learning to solve circuit sat
Paper study: Learning to solve circuit satChenYiHuang5
 
Paper Study: Transformer dissection
Paper Study: Transformer dissectionPaper Study: Transformer dissection
Paper Study: Transformer dissectionChenYiHuang5
 
Paper Study: Melding the data decision pipeline
Paper Study: Melding the data decision pipelinePaper Study: Melding the data decision pipeline
Paper Study: Melding the data decision pipelineChenYiHuang5
 
مدخل إلى تعلم الآلة
مدخل إلى تعلم الآلةمدخل إلى تعلم الآلة
مدخل إلى تعلم الآلةFares Al-Qunaieer
 
Efficient anomaly detection via matrix sketching
Efficient anomaly detection via matrix sketchingEfficient anomaly detection via matrix sketching
Efficient anomaly detection via matrix sketchingHsing-chuan Hsieh
 
NS-CUK Seminar: H.E.Lee, Review on "Gated Graph Sequence Neural Networks", I...
NS-CUK Seminar: H.E.Lee,  Review on "Gated Graph Sequence Neural Networks", I...NS-CUK Seminar: H.E.Lee,  Review on "Gated Graph Sequence Neural Networks", I...
NS-CUK Seminar: H.E.Lee, Review on "Gated Graph Sequence Neural Networks", I...ssuser4b1f48
 
Optimum Engineering Design - Day 2b. Classical Optimization methods
Optimum Engineering Design - Day 2b. Classical Optimization methodsOptimum Engineering Design - Day 2b. Classical Optimization methods
Optimum Engineering Design - Day 2b. Classical Optimization methodsSantiagoGarridoBulln
 
Computational Intelligence Assisted Engineering Design Optimization (using MA...
Computational Intelligence Assisted Engineering Design Optimization (using MA...Computational Intelligence Assisted Engineering Design Optimization (using MA...
Computational Intelligence Assisted Engineering Design Optimization (using MA...AmirParnianifard1
 
[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You NeedDaiki Tanaka
 
Linear regression, costs & gradient descent
Linear regression, costs & gradient descentLinear regression, costs & gradient descent
Linear regression, costs & gradient descentRevanth Kumar
 
SPICE-MATEX @ DAC15
SPICE-MATEX @ DAC15SPICE-MATEX @ DAC15
SPICE-MATEX @ DAC15Hao Zhuang
 
A compact zero knowledge proof to restrict message space in homomorphic encry...
A compact zero knowledge proof to restrict message space in homomorphic encry...A compact zero knowledge proof to restrict message space in homomorphic encry...
A compact zero knowledge proof to restrict message space in homomorphic encry...MITSUNARI Shigeo
 
Applied Algorithms and Structures week999
Applied Algorithms and Structures week999Applied Algorithms and Structures week999
Applied Algorithms and Structures week999fashiontrendzz20
 
Rethinking Attention with Performers
Rethinking Attention with PerformersRethinking Attention with Performers
Rethinking Attention with PerformersJoonhyung Lee
 
Paper Study: OptNet: Differentiable Optimization as a Layer in Neural Networks
Paper Study: OptNet: Differentiable Optimization as a Layer in Neural NetworksPaper Study: OptNet: Differentiable Optimization as a Layer in Neural Networks
Paper Study: OptNet: Differentiable Optimization as a Layer in Neural NetworksChenYiHuang5
 
Score based Generative Modeling through Stochastic Differential Equations
Score based Generative Modeling through Stochastic Differential EquationsScore based Generative Modeling through Stochastic Differential Equations
Score based Generative Modeling through Stochastic Differential EquationsSungchul Kim
 

Similar to Paper study: Attention, learn to solve routing problems! (20)

Paper study: Learning to solve circuit sat
Paper study: Learning to solve circuit satPaper study: Learning to solve circuit sat
Paper study: Learning to solve circuit sat
 
Paper Study: Transformer dissection
Paper Study: Transformer dissectionPaper Study: Transformer dissection
Paper Study: Transformer dissection
 
Paper Study: Melding the data decision pipeline
Paper Study: Melding the data decision pipelinePaper Study: Melding the data decision pipeline
Paper Study: Melding the data decision pipeline
 
مدخل إلى تعلم الآلة
مدخل إلى تعلم الآلةمدخل إلى تعلم الآلة
مدخل إلى تعلم الآلة
 
Efficient anomaly detection via matrix sketching
Efficient anomaly detection via matrix sketchingEfficient anomaly detection via matrix sketching
Efficient anomaly detection via matrix sketching
 
NS-CUK Seminar: H.E.Lee, Review on "Gated Graph Sequence Neural Networks", I...
NS-CUK Seminar: H.E.Lee,  Review on "Gated Graph Sequence Neural Networks", I...NS-CUK Seminar: H.E.Lee,  Review on "Gated Graph Sequence Neural Networks", I...
NS-CUK Seminar: H.E.Lee, Review on "Gated Graph Sequence Neural Networks", I...
 
Optimum Engineering Design - Day 2b. Classical Optimization methods
Optimum Engineering Design - Day 2b. Classical Optimization methodsOptimum Engineering Design - Day 2b. Classical Optimization methods
Optimum Engineering Design - Day 2b. Classical Optimization methods
 
Tokyo conference
Tokyo conferenceTokyo conference
Tokyo conference
 
04 Multi-layer Feedforward Networks
04 Multi-layer Feedforward Networks04 Multi-layer Feedforward Networks
04 Multi-layer Feedforward Networks
 
Computational Intelligence Assisted Engineering Design Optimization (using MA...
Computational Intelligence Assisted Engineering Design Optimization (using MA...Computational Intelligence Assisted Engineering Design Optimization (using MA...
Computational Intelligence Assisted Engineering Design Optimization (using MA...
 
[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need
 
Linear regression, costs & gradient descent
Linear regression, costs & gradient descentLinear regression, costs & gradient descent
Linear regression, costs & gradient descent
 
SPICE-MATEX @ DAC15
SPICE-MATEX @ DAC15SPICE-MATEX @ DAC15
SPICE-MATEX @ DAC15
 
A compact zero knowledge proof to restrict message space in homomorphic encry...
A compact zero knowledge proof to restrict message space in homomorphic encry...A compact zero knowledge proof to restrict message space in homomorphic encry...
A compact zero knowledge proof to restrict message space in homomorphic encry...
 
Applied Algorithms and Structures week999
Applied Algorithms and Structures week999Applied Algorithms and Structures week999
Applied Algorithms and Structures week999
 
Rethinking Attention with Performers
Rethinking Attention with PerformersRethinking Attention with Performers
Rethinking Attention with Performers
 
Paper Study: OptNet: Differentiable Optimization as a Layer in Neural Networks
Paper Study: OptNet: Differentiable Optimization as a Layer in Neural NetworksPaper Study: OptNet: Differentiable Optimization as a Layer in Neural Networks
Paper Study: OptNet: Differentiable Optimization as a Layer in Neural Networks
 
Score based Generative Modeling through Stochastic Differential Equations
Score based Generative Modeling through Stochastic Differential EquationsScore based Generative Modeling through Stochastic Differential Equations
Score based Generative Modeling through Stochastic Differential Equations
 
lecture_20.pptx
lecture_20.pptxlecture_20.pptx
lecture_20.pptx
 
lecture_20.pptx
lecture_20.pptxlecture_20.pptx
lecture_20.pptx
 

Recently uploaded

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 

Recently uploaded (20)

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 

Paper study: Attention, learn to solve routing problems!

  • 1. Attention, Learn to Solve Routing Problems! ICLR 2019 University of Amsterdam Wouter Kool, Herke van Hoof and Max Welling
  • 2. Abstract • Learn heuristics for combinatorial optimization problems can save costly development. • Propose a model based on attention layers and train this model using REINFORCE with a baseline based on deterministic greedy rollout. • Outperform recent learned heuristics for TSP.
  • 3. Introduction • Approaches to solve combinatorial optimization problem can be divided into • Exact methods: guarantee finding optimal solutions • Heuristics: trade off optimality for computational cost, usually expressed in the form of rules (like the policy to make decisions) • Train a model to parameterize policies to obtain new and stronger algorithm for routing problem.
  • 4. Introduction (cont’d) • Propose a model based on attention and train it using REINFORCE with greedy rollout baseline. • Show the flexibility of proposed approach on multiple routing problems.
  • 6. Attention mechanism • For encoder-decoder model, use attention to obtain new context vector. • ℎ𝑗 denotes encoder hidden state, 𝑠𝑖 denotes decoder hidden state. • Alignment model, compatibility: relationship between current decoding state and every encoding state. • 𝑒𝑖𝑗 = 𝑎(𝑠𝑖−1, ℎ𝑗) • Attention weight • 𝛼𝑖𝑗 = exp(𝑒 𝑖𝑗) σ 𝑘=1 𝑇 exp 𝑒 𝑖𝑘 • Context vector • 𝑐𝑖 = σ 𝑗=1 𝑇 𝛼𝑖𝑗ℎ𝑗
  • 7. Transformer • Multi-head attention: project the input encoding to different number of spaces • Self-attention: no additional decoding state, just encoding states themselves • Each head has its own attention mechanism
  • 9. Problem definition • Define a problem instance 𝑠 as a graph with 𝑛 nodes, where node 𝑖 ∈ {1, … , 𝑛} is represented by features 𝑥𝑖. • For TSP, 𝑥𝑖 is the coordinate of node 𝑖 (in 2d space). • Define a solution 𝜋 = (𝜋1, … , 𝜋 𝑛) as a permutation of the nodes. • Given a problem 𝑠, model output a policy 𝑝(𝜋|𝑠) for selecting a solution 𝜋
  • 10. Encoder-decoder model • Encoder-decoder model defines stochastic policy 𝑝(𝜋|𝑠) for selecting a solution 𝜋 given a problem instance 𝑠. 𝑝 𝜃 𝜋 𝑠 = ෑ 𝑡=1 𝑛 𝑝 𝜃(𝜋 𝑡|𝑠, 𝜋1:𝑡−1) • The encoder produces embeddings of all input nodes. • The decoder produces the sequence 𝜋, one node at a time, based on embedding nodes and mask and context. • For TSP, • embedding nodes: from encoder • mask: remaining nodes during decoding • context: First and last node embedding in tour during decoding
  • 11. Encoder • 𝑑 𝑥-dimensional input feature 𝑥𝑖. For TSP, 𝑑 𝑥 = 2. • 𝑑ℎ-dimensional node embedding. Let 𝑑ℎ = 128. • Initial embedding: ℎ𝑖 0 = 𝑊 𝑥 𝑥𝑖 + 𝑏 𝑥 • The embedding ℎ𝑖 𝑙 are updated using 𝑁 attention layers. ෠ℎ𝑖 = 𝐵𝑁 𝑙 ℎ𝑖 𝑙−1 + 𝑀𝐻𝐴𝑖 𝑙 ℎ1 𝑙−1 , … , ℎ 𝑛 𝑙−1 ℎ𝑖 𝑙 = 𝐵𝑁 𝑙(෠ℎ𝑖 + 𝐹𝐹 𝑙(෠ℎ𝑖)) • Graph embedding: തℎ 𝑁 = 1 𝑛 σ𝑖=1 𝑛 ℎ𝑖 𝑁 𝑖 denotes the node index 𝑙 denotes the output of 𝑙’th attention layer FF: node-wise feed forward MHA: multi-head attention BN: batch normalization
  • 12.
  • 13. Multi-head attention • 𝑀𝐻𝐴𝑖 𝑙 ℎ1 𝑙−1 , … , ℎ 𝑛 𝑙−1 • Let number of heads 𝑀 = 8, embedding dimension 𝑑ℎ = 128. • Each head has its own attention mechanism.
  • 14. Result vector of each head • Each node has its own query 𝑞𝑖, key 𝑘𝑖 and value 𝑣𝑖. • 𝑞𝑖 = 𝑊 𝑄ℎ𝑖, 𝑘𝑖 = 𝑊 𝐾ℎ𝑖, 𝑣𝑖 = 𝑊 𝑉ℎ𝑖 • 𝑊 𝑄 and 𝑊 𝐾 are (𝑑 𝑘 × 𝑑ℎ) matrices, 𝑊 𝑉 is (𝑑 𝑣 × 𝑑ℎ) matrix. • Given node 𝑖 and another node 𝑗: • 𝑞𝑖 and 𝑘𝑗 determine the importance of 𝑣𝑗 • Compatibility 𝑢𝑖𝑗 = 𝑞𝑖 𝑇 𝑘 𝑗 √𝑑 𝑘 if node 𝑖 adjacent to node j else −∞ . • Attention weight 𝑎𝑖𝑗 = 𝑒 𝑢 𝑖𝑗 σ 𝑗′ 𝑒 𝑢 𝑖𝑗′ ∈ [0,1] • Result vector ℎ𝑖 ′ = σ 𝑗 𝑎𝑖𝑗 𝑣𝑗 (size is 𝑑 𝑣)
  • 15. 1. Compute the compatibility 2. Compute the attention weight 3. Linear combination of 𝑎𝑖𝑗 and 𝑣𝑗
  • 16. Final result vector • Let ℎ𝑖𝑚 ′ denote the result vector of node 𝑖 in head 𝑚 (size is 𝑑 𝑣) • In Transformer, concatenate the result vectors first and transform it. • 𝑀𝐻𝐴𝑖 ℎ1, … , ℎ 𝑛 = 𝑊 𝑂 𝑐𝑜𝑛𝑐𝑎𝑡(ℎ𝑖1′, … ℎ𝑖𝑚′) • In proposed method, transform each result vectors and sum up them. • 𝑀𝐻𝐴𝑖 ℎ1, … , ℎ 𝑛 = σ 𝑚=1 𝑀 𝑊𝑚 𝑂 ℎ𝑖𝑚′ • Both method output 𝑑ℎ-dimensional vector for each node. 𝑚 ⋅ 𝑑 𝑣𝑑ℎ × (𝑚 ⋅ 𝑑 𝑣) 𝑑ℎ × 𝑑 𝑣
  • 17. Decoder • At decoding time, the decode context consisted of embedding of the graph, the last node and first node • ℎ 𝑐 𝑁 = ቐ തℎ 𝑁 , ℎ 𝜋 𝑡−1 𝑁 , ℎ 𝜋1 𝑁 if 𝑡 > 1 തℎ 𝑁 , 𝑣 𝑙, 𝑣 𝑓 else. • (3 ⋅ 𝑑ℎ)-dimensional result vector ℎ 𝑐 𝑁 : embedding of the special context node (𝑐) [⋅,⋅,⋅] horizontal concatenation operator 𝑣 𝑙 and 𝑣 𝑓 are learnable 𝑑ℎ-dimensional parameters
  • 18. Update context node embedding • Obtain new context node embedding ℎ 𝑐 𝑁+1 using 𝑀-head attention. • The keys and values come from node embedding ℎ𝑖 𝑁 , query comes from context node. • 𝑞 𝑐 = 𝑊 𝑄ℎ 𝑐 , 𝑘𝑖 = 𝑊 𝐾ℎ𝑖, 𝑣𝑖 = 𝑊 𝑉ℎ𝑖 • Compatibility 𝑢(𝑐)𝑗 = 𝑞(𝑐) 𝑇 𝑘 𝑗 √𝑑 𝑘 𝑑 𝑘 = 𝑑ℎ 𝑀 if node 𝑗 haven’t been visited else −∞. • Apply the similar 𝑀𝐻𝐴 to get ℎ 𝑐 𝑁+1 (size is 𝑑ℎ).
  • 19. Final output probability • Compute 𝑝 𝜃 𝜋 𝑡 𝑠, 𝜋1:𝑡−1 using single attention head (𝑀 = 1, 𝑑 𝑘 = 𝑑ℎ) but only compute compatibility (no need 𝑣𝑖) • 𝑢(𝑐)𝑗 = 𝐶 ⋅ tanh 𝑞 𝑐 𝑇 𝑘 𝑗 𝑑 𝑘 ∈ [−𝐶, 𝐶] if node 𝑗 haven’t been visited else − ∞(𝐶 = 10). • Compute the final output probability vector 𝑝 using softmax 𝑝𝑖 = 𝑝 𝜃 𝜋 𝑡 = 𝑖 𝑠, 𝜋1:𝑡−1 = 𝑒 𝑢(𝑐)𝑖 σ 𝑗 𝑒 𝑢(𝑐)𝑗
  • 20. REINFORCE with greedy rollout baseline
  • 21. REINFORCE with baseline • Define the loss ℒ 𝜃 𝑠 = 𝔼 𝑝 𝜃 𝜋 𝑠 [𝐿(𝜋)] • Optimize ℒ by gradient descent using REINFORCE • By introduce the baseline reduces gradient variance and then speed up learning. 𝛻ℒ 𝜃 𝑠 = 𝔼 𝑝 𝜃 𝜋 𝑠 [ 𝐿 𝜋 − 𝑏 𝑠 𝛻 log 𝑝 𝜃(𝜋|𝑠)] • Common baseline • Exponential moving average 𝑏 𝑠 = 𝑀 with decay 𝛽. • 𝑀0 = 𝐿 𝜋 , 𝑀𝑡+1 = 𝛽𝑀𝑡 + 1 − 𝛽 𝐿(𝜋) • Learned value function (critic) ො𝑣(𝑠, 𝜔) • 𝜔 are learned from (𝑠, 𝐿(𝜋))
  • 22. Proposed baseline Replace baseline parameter if improvement is significant Sample solution 𝜋𝑖 based on 𝑝 𝜽 Greedily pick baseline solution 𝜋𝑖 𝐵𝐿 based on 𝑝 𝜽𝐵𝐿 Calculate the gradient of loss with REINFORCE with baseline as length of 𝜋𝑖 𝐵𝐿 . Two model, one for training another for baseline Copy the training parameter to baseline
  • 24. Learned heuristic Non-learned baseline Heuristic solver structure2vec Pointer network (PN) PN+ RL Compare to heuristic solver, non-learned baseline and learned heuristic
  • 25. PN: pointer network AM: attention model (proposed method) TSP20 result compare to pointer network (10000 instances)
  • 27. Discussion • Introduce a model and training method which both contribute to significantly improved results on learned heuristics for TSP. • Using attention instead of recurrence introduces invariance to the input order of the nodes, increasing learning efficiency. • The multi-head attention mechanism allows nodes to communicate relevant information over different channels.