Rethinking attention with performers

•Download as PPTX, PDF•

0 likes•45 views

KyuYeolJung

Technology

Kyonggi Univ. AI Lab.
Index
 도입 배경
 FAVOR
 EXPERIMENTS
 결론

Kyonggi Univ. AI Lab.
도입 배경
 Transformer에 사용되는 Attention기능의 연산량이 상당하다.
 과도한 연산량으로 인해 효율성이 저하된다.
 이에 연산량을 줄이는 방법이 필요하다.
 FAVOR를 도입함.
 우선적으로 Attention의 연산량을 줄인다.
 이에 새로운 Kernel 기법을 제안함(softmax 역할)

Kyonggi Univ. AI Lab.
도입 배경
 시간 복잡도 개선 구조
기존 제안

Kyonggi Univ. AI Lab.
FAVOR - Attention의 개선
 일반적인 Attention
𝑄 =
𝑞11
𝑞21
𝑞31
.
.
𝑞𝐿1
𝑞12
𝑞22
𝑞32
.
.
𝑞𝐿2
𝑞13
𝑞23
𝑞33
.
.
𝑞𝐿3
…
𝑞1𝑑
𝑞2𝑑
𝑞3𝑑
.
.
𝑞𝐿𝑑
𝐾 =
𝑘11
𝑘21
𝑘31
.
.
𝑘𝐿1
𝑘12
𝑘22
𝑘32
.
.
𝑘𝐿2
𝑘13
𝑘23
𝑘33
.
.
𝑘𝐿3
…
𝑘1𝑑
𝑘2𝑑
𝑘3𝑑
.
.
𝑘𝐿𝑑
L x d L x d
𝐾𝑇 =
𝑘11
𝑘12
𝑘13
.
.
𝑘1𝑑
𝑘21
𝑘22
𝑘23
.
.
𝑘2𝑑
𝑘31
𝑘32
𝑘33
.
.
𝑘3𝑑
…
𝑘𝐿1
𝑘𝐿2
𝑘𝐿3
.
.
𝑘𝐿𝑑
d x L
𝑸𝑲𝑻
= 𝑳 × 𝒅 × (d × 𝑳 ) = 𝑳 × 𝑳
시간 복잡도 : 𝑶(𝑳𝟐𝒅)

Kyonggi Univ. AI Lab.
FAVOR - Attention의 개선
 시간 복잡도 개선하기 – Trick!
 일반적인 Attention -> 𝑨 = 𝒔𝒐𝒇𝒕𝒎𝒂𝒌(𝒒, 𝒌)
 제안한 방법 -> 𝑨 = 𝑲𝒆𝒓𝒏𝒆𝒍(𝑸, 𝑲)
𝑲𝒆𝒓𝒏𝒆𝒍 𝑸, 𝑲 = 𝑬[∅ 𝑸 𝑻∅(𝑲)]
∅: mapping (d -> r)
Q → L X d
𝑄𝑇
→ d X L
∅(𝑄𝑇) → r X L
∅(𝑄𝑇)𝑇 → L X r
𝑸′ = ∅(𝑸𝑻)𝑻
Attention = Kernel(Q, K) V
= 𝑸′
(𝑲′
)𝑻
V
= 𝑸′ ((𝑲′)𝑻 V)

Kyonggi Univ. AI Lab.
FAVOR - Attention의 개선
 Softmax의 역할을 하는 kernel (sin-cos)
Softmax kernel
이 방법은 분산이 매우 커짐
• Softmax의 경우 결과값이 항상 양수로 나온다.
• 그러나 위 방법은 음수 범위까지 나오게 된다.
• 따라서 안정적인 수렴이 어렵다.

Kyonggi Univ. AI Lab.
FAVOR - Attention의 개선
 제안하는 Kernel 기법 – Positive
분산이 작아지며 안정적인 수렴이 용이 하도록 하였다.

Kyonggi Univ. AI Lab.
EXPERIMENTS
 연산 속도 비교
순전파 역전파
Transformer에 비하여 연산 속도가 빠름을 알 수 있다.

Kyonggi Univ. AI Lab.
EXPERIMENTS
 커널 방법 차이에 따른 정확성 비교
Positive 기법이 안정적임을 확인 할 수 있다.

Kyonggi Univ. AI Lab.
EXPERIMENTS
 기존 Transformer와 정확성 비교
기존 Transformer와 비교하여 정확성에서도 우수하며 수렴 속도도 빠르다

Kyonggi Univ. AI Lab.
결론
 기존의 Transformer의 연산량을 줄이려고 함.
 결국 Attention 과정을 수정해야 함.
 Trick을 사용하여 연산량을 줄였다.
 이럴 경우 기존의 softmax 함수를 사용 할 수 없다.
 Softmax 와 비슷한 역할을 할 수 있는 Kernel기법을 제안함
 단 sin-cos 방법보다 positive 방법이 우수함
 연산량 및 정확성에서 기존 Transformer보다 우수하다.

What's hot

Robot Planing Article OverviewVolodymyr Nazarenko

ThesisAad Vijn

IEEE/RSJ IROS 2008 Real-time Trackerc.choi

Shai Avidan's Support vector tracking and ensemble trackingwolf

Rethinking Attention with PerformersJoonhyung Lee

SuMo-SS: Submodular Optimization Sensor Scattering for Deploying Sensor Netwo...Komei Sugiura

Recurrent_environment_simulatorsTomoki Minote

What's hot (7)

Robot Planing Article Overview

Thesis

IEEE/RSJ IROS 2008 Real-time Tracker

Shai Avidan's Support vector tracking and ensemble tracking

Rethinking Attention with Performers

SuMo-SS: Submodular Optimization Sensor Scattering for Deploying Sensor Netwo...

Recurrent_environment_simulators

Similar to Rethinking attention with performers

TRPO(trust region policy optimization)KyuYeolJung

Sparse Representations for Packetized Predictive Networked ControlMasaaki Nagahara

Compressed Sensing using Generative Modelkenluck2001

Ant Colony Optimization: The Algorithm and Its Applicationsadil raja

ML Module 3 Non Linear Learning.pptxDebabrataPain1

Bias and Variance in Continuous EDA: massively parallel continuous optimizationOlivier Teytaud

Superefficient Monte Carlo SimulationsCheng-An Yang

YacfJuan Fumero

Feature Selection using Complementary Particle Swarm Optimization for DNA Mic...sky chang

45 years in cm (slide share2013)Ray Beebe

Loop Fusion for Memory Space Optimizationtmusabbir

euclides-c mthesisinet-lab

PR-252: Making Convolutional Networks Shift-Invariant AgainHyeongmin Lee

hands on machine learning Chapter 6&7 decision tree, ensemble and random forestJaey Jeong

I. Henderson, J. Ingram, D. Poulcharidis - Advanced Topics in Chemical Biolog...JDIngram

The Action Against Soft-Errors to Prevent Service OutageQuEST Forum

A tutorial on EMF-IncQueryIstvan Rath

Sc11 presentation 2001_06_28Victor Trakhtenberg

Building Robust Pipelines with Airflow | Wrangle Conference 2017Cloudera, Inc.

Building Robust Pipelines with AirflowErin Shellman

Similar to Rethinking attention with performers (20)

TRPO(trust region policy optimization)

Sparse Representations for Packetized Predictive Networked Control

Compressed Sensing using Generative Model

Ant Colony Optimization: The Algorithm and Its Applications

ML Module 3 Non Linear Learning.pptx

Bias and Variance in Continuous EDA: massively parallel continuous optimization

Superefficient Monte Carlo Simulations

Yacf

Feature Selection using Complementary Particle Swarm Optimization for DNA Mic...

45 years in cm (slide share2013)

Loop Fusion for Memory Space Optimization

euclides-c mthesis

PR-252: Making Convolutional Networks Shift-Invariant Again

hands on machine learning Chapter 6&7 decision tree, ensemble and random forest

I. Henderson, J. Ingram, D. Poulcharidis - Advanced Topics in Chemical Biolog...

The Action Against Soft-Errors to Prevent Service Outage

A tutorial on EMF-IncQuery

Sc11 presentation 2001_06_28

Building Robust Pipelines with Airflow | Wrangle Conference 2017

Building Robust Pipelines with Airflow

Recently uploaded

2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal

Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer

[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745

Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700

Developing An App To Navigate The Roads of BrazilV3cube

Boost PC performance: How more available memory can improve productivityPrincipled Technologies

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls

Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2

CNv6 Instructor Chapter 6 Quality of Servicegiselly40

Histor y of HAM Radio presentation slidevu2urc

Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK

Scaling API-first – The story of a global engineering organizationRadu Cotescu

Partners Life - Insurer Innovation Award 2024The Digital Insurer

Presentation on how to chat with PDF using ChatGPT code interpreternaman860154

Slack Application Development 101 Slidespraypatel2

Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer

Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik

IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge

Recently uploaded (20)

2024: Domino Containers - The Next Step. News from the Domino Container commu...

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service

Tata AIG General Insurance Company - Insurer Innovation Award 2024

[2024]Digital Global Overview Report 2024 Meltwater.pdf

Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...

Developing An App To Navigate The Roads of Brazil

Boost PC performance: How more available memory can improve productivity

How to Troubleshoot Apps for the Modern Connected Worker

08448380779 Call Girls In Friends Colony Women Seeking Men

Exploring the Future Potential of AI-Enabled Smartphone Processors

CNv6 Instructor Chapter 6 Quality of Service

Histor y of HAM Radio presentation slide

Unblocking The Main Thread Solving ANRs and Frozen Frames

Scaling API-first – The story of a global engineering organization

Partners Life - Insurer Innovation Award 2024

Presentation on how to chat with PDF using ChatGPT code interpreter

Slack Application Development 101 Slides

Axa Assurance Maroc - Insurer Innovation Award 2024

Injustice - Developers Among Us (SciFiDevCon 2024)

IAC 2024 - IA Fast Track to Search Focused AI Solutions

Rethinking attention with performers

1. Kyonggi Univ. AI Lab. RETHINKING ATTENTION WITH PERFORMERS 2021.1.4 정규열 Artificial Intelligence Lab Kyonggi Univiersity

2. Kyonggi Univ. AI Lab. Index  도입 배경  FAVOR  EXPERIMENTS  결론

3. Kyonggi Univ. AI Lab. 도입 배경

4. Kyonggi Univ. AI Lab. 도입 배경  Transformer에 사용되는 Attention기능의 연산량이 상당하다.  과도한 연산량으로 인해 효율성이 저하된다.  이에 연산량을 줄이는 방법이 필요하다.  FAVOR를 도입함.  우선적으로 Attention의 연산량을 줄인다.  이에 새로운 Kernel 기법을 제안함(softmax 역할)

5. Kyonggi Univ. AI Lab. 도입 배경  시간 복잡도 개선 구조 기존 제안

6. Kyonggi Univ. AI Lab. FAVOR

7. Kyonggi Univ. AI Lab. FAVOR - Attention의 개선  일반적인 Attention 𝑄 = 𝑞11 𝑞21 𝑞31 . . 𝑞𝐿1 𝑞12 𝑞22 𝑞32 . . 𝑞𝐿2 𝑞13 𝑞23 𝑞33 . . 𝑞𝐿3 … 𝑞1𝑑 𝑞2𝑑 𝑞3𝑑 . . 𝑞𝐿𝑑 𝐾 = 𝑘11 𝑘21 𝑘31 . . 𝑘𝐿1 𝑘12 𝑘22 𝑘32 . . 𝑘𝐿2 𝑘13 𝑘23 𝑘33 . . 𝑘𝐿3 … 𝑘1𝑑 𝑘2𝑑 𝑘3𝑑 . . 𝑘𝐿𝑑 L x d L x d 𝐾𝑇 = 𝑘11 𝑘12 𝑘13 . . 𝑘1𝑑 𝑘21 𝑘22 𝑘23 . . 𝑘2𝑑 𝑘31 𝑘32 𝑘33 . . 𝑘3𝑑 … 𝑘𝐿1 𝑘𝐿2 𝑘𝐿3 . . 𝑘𝐿𝑑 d x L 𝑸𝑲𝑻 = 𝑳 × 𝒅 × (d × 𝑳 ) = 𝑳 × 𝑳 시간 복잡도 : 𝑶(𝑳𝟐𝒅)

8. Kyonggi Univ. AI Lab. FAVOR - Attention의 개선  시간 복잡도 개선하기 – Trick!  일반적인 Attention -> 𝑨 = 𝒔𝒐𝒇𝒕𝒎𝒂𝒌(𝒒, 𝒌)  제안한 방법 -> 𝑨 = 𝑲𝒆𝒓𝒏𝒆𝒍(𝑸, 𝑲) 𝑲𝒆𝒓𝒏𝒆𝒍 𝑸, 𝑲 = 𝑬[∅ 𝑸 𝑻∅(𝑲)] ∅: mapping (d -> r) Q → L X d 𝑄𝑇 → d X L ∅(𝑄𝑇) → r X L ∅(𝑄𝑇)𝑇 → L X r 𝑸′ = ∅(𝑸𝑻)𝑻 Attention = Kernel(Q, K) V = 𝑸′ (𝑲′ )𝑻 V = 𝑸′ ((𝑲′)𝑻 V)

9. Kyonggi Univ. AI Lab. FAVOR - Attention의 개선  Softmax의 역할을 하는 kernel (sin-cos) Softmax kernel 이 방법은 분산이 매우 커짐 • Softmax의 경우 결과값이 항상 양수로 나온다. • 그러나 위 방법은 음수 범위까지 나오게 된다. • 따라서 안정적인 수렴이 어렵다.

10. Kyonggi Univ. AI Lab. FAVOR - Attention의 개선  제안하는 Kernel 기법 – Positive 분산이 작아지며 안정적인 수렴이 용이 하도록 하였다.

11. Kyonggi Univ. AI Lab. EXPERIMENTS

12. Kyonggi Univ. AI Lab. EXPERIMENTS  연산 속도 비교 순전파 역전파 Transformer에 비하여 연산 속도가 빠름을 알 수 있다.

13. Kyonggi Univ. AI Lab. EXPERIMENTS  커널 방법 차이에 따른 정확성 비교 Positive 기법이 안정적임을 확인 할 수 있다.

14. Kyonggi Univ. AI Lab. EXPERIMENTS  기존 Transformer와 정확성 비교 기존 Transformer와 비교하여 정확성에서도 우수하며 수렴 속도도 빠르다

15. Kyonggi Univ. AI Lab. 결론

16. Kyonggi Univ. AI Lab. 결론  기존의 Transformer의 연산량을 줄이려고 함.  결국 Attention 과정을 수정해야 함.  Trick을 사용하여 연산량을 줄였다.  이럴 경우 기존의 softmax 함수를 사용 할 수 없다.  Softmax 와 비슷한 역할을 할 수 있는 Kernel기법을 제안함  단 sin-cos 방법보다 positive 방법이 우수함  연산량 및 정확성에서 기존 Transformer보다 우수하다.

Rethinking attention with performers

Recommended

Recommended

More Related Content

What's hot

What's hot (7)

Similar to Rethinking attention with performers

Similar to Rethinking attention with performers (20)

More from KyuYeolJung

More from KyuYeolJung (7)

Recently uploaded

Recently uploaded (20)

Rethinking attention with performers