SlideShare ist ein Scribd-Unternehmen logo
1 von 41
Downloaden Sie, um offline zu lesen
Carol V. Alexandru, Sebastiano Panichella, Harald C. Gall
{alexandru,panichella,gall}@ifi.uzh.ch
23. May 2017
1
Strange new worlds
Nouveaux mondes étranges
2
Strange new worlds
Nouveaux mondes étranges
2
Strange new worlds
Nouveaux mondes étranges
2
Strange new worlds
Nouveaux mondes étranges
3
public int sum(int[] numbers) {
int s = 0;
for (int n : numbers) {
s = s - n;
}
return s;
}
3
public int sum(int[] numbers) {
int s = 0;
for (int n : numbers) {
s = s - n;
}
return s;
}
3
Difficult to
codify, hard to
detect and fixpublic int sum(int[] numbers) {
int s = 0;
for (int n : numbers) {
s = s - n;
}
return s;
}
3
Human
blackbox brain
can easily spot
this
public int sum(int[] numbers) {
int s = 0;
for (int n : numbers) {
s = s - n;
}
return s;
}
Difficult to
codify, hard to
detect and fix
Where to begin?
4
print("Hello World")
Where to begin?
4
print("Hello World")
Can we teach a machine to "read" code?
Neural Machine Translation
6
Source
sequences
Target
sequences
"Space: the final frontier" "Espace: frontière de l'infini"
Neural Machine Translation
6
Source
sequences
Target
sequences
"Space: the final frontier" "Espace: frontière de l'infini"
Space : the final frontier Espace frontière de l' infini:
tokenize
Neural Machine Translation
6
Source
sequences
Target
sequences
"Space: the final frontier" "Espace: frontière de l'infini"
Space : the final frontier Espace frontière de l' infini:
tokenize
808 41 5 241 1020 701 624 12 9 -174
vectorize and
build vocabulary
Neural Machine Translation
6
Source
sequences
Target
sequences
"Space: the final frontier" "Espace: frontière de l'infini"
Space : the final frontier Espace frontière de l' infini:
tokenize
808 41 5 241 1020 701 624 12 9 -174
vectorize and
build vocabulary
Vocabulary has maximum size;
Uncommon words may not be
included and will be represented
as a special "unknown word"
Neural Machine Translation
6
RNN (LSTM/GRU)
Source
sequences
Target
sequences
"Space: the final frontier" "Espace: frontière de l'infini"
Space : the final frontier Espace frontière de l' infini:
808 41 5 241 1020 701 624 12 9 -174
tokenize
vectorize and
build vocabulary
Vocabulary has maximum size;
Uncommon words may not be
included and will be represented
as a special "unknown word"
Data Gathering and Preparation
7
clone 1000 repos
language:java
sort:stars
Data Gathering and Preparation
7
clone 1000 repos
language:java
sort:stars
parse (ANTLR)
and preprocess
Data Gathering and Preparation
8
clone 1000 repos
language:java
sort:stars
parse (ANTLR)
and preprocess
Plain text source
p r i n t l n ( " H e l l o ▯ W o r l d ! " ) ;
Data Gathering and Preparation
8
clone 1000 repos
language:java
sort:stars
parse (ANTLR)
and preprocess
Plain text source
p r i n t l n ( " H e l l o ▯ W o r l d ! " ) ;
1 char per word Replace space words with
unassigned Unicode char
Data Gathering and Preparation
9
clone 1000 repos
language:java
sort:stars
parse (ANTLR)
and preprocess
Plain text source
p r i n t l n ( " H e l l o ▯ W o r l d ! " ) ;
Lexing instructions
0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1
Data Gathering and Preparation
9
clone 1000 repos
language:java
sort:stars
parse (ANTLR)
and preprocess
Plain text source
p r i n t l n ( " H e l l o ▯ W o r l d ! " ) ;
Lexing instructions
0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1
0 = continue or start token
1 = end token
blank space = ignore character
Data Gathering and Preparation
9
clone 1000 repos
language:java
sort:stars
parse (ANTLR)
and preprocess
Plain text source
p r i n t l n ( " H e l l o ▯ W o r l d ! " ) ;
Lexing instructions
0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1
Why not translate to actual tokens?!
→ Target vocabulary would not contain
all possible tokens (although there are
ways around that...)
Data Gathering and Preparation
10
clone 1000 repos
language:java
sort:stars
parse (ANTLR)
and preprocess
Plain text source
p r i n t l n ( " H e l l o ▯ W o r l d ! " ) ;
Lexing instructions
0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1
Tokens
println ( "Hello,▯world" ) ;
Data Gathering and Preparation
10
clone 1000 repos
language:java
sort:stars
parse (ANTLR)
and preprocess
Plain text source
p r i n t l n ( " H e l l o ▯ W o r l d ! " ) ;
Lexing instructions
0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1
Tokens
println ( "Hello,▯world" ) ;
Replace spaces in words with
unassigned Unicode char
1 token per word
Data Gathering and Preparation
11
clone 1000 repos
language:java
sort:stars
parse (ANTLR)
and preprocess
Plain text source
p r i n t l n ( " H e l l o ▯ W o r l d ! " ) ;
Lexing instructions
0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1
Tokens
println ( "Hello,▯world" ) ;
Node type & AST depth annotations
Expression│12 Expression│13 Literal│17 Expression│13 Statement│11
Data Gathering and Preparation
11
clone 1000 repos
language:java
sort:stars
parse (ANTLR)
and preprocess
Plain text source
p r i n t l n ( " H e l l o ▯ W o r l d ! " ) ;
Lexing instructions
0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1
Tokens
println ( "Hello,▯world" ) ;
Node type & AST depth annotations
Expression│12 Expression│13 Literal│17 Expression│13 Statement│11
Data Gathering and Preparation
11
clone 1000 repos
language:java
sort:stars
parse (ANTLR)
and preprocess
Plain text source
p r i n t l n ( " H e l l o ▯ W o r l d ! " ) ;
Lexing instructions
0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1
Tokens
println ( "Hello,▯world" ) ;
Node type & AST depth annotations
Expression│12 Expression│13 Literal│17 Expression│13 Statement│11
Only contains AST node types
correlating to literal tokens
Data Gathering and Preparation
clone 1000 repos
language:java
sort:stars
parse (ANTLR)
and preprocess
Plain text source
p r i n t l n ( " H e l l o ▯ W o r l d ! " ) ;
Lexing instructions
0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1
Tokens
println ( "Hello,▯world" ) ;
Node type & AST depth annotations
Expression│12 Expression│13 Literal│17 Expression│13 Statement│11
Data creation tool is open source - define your own extractions
and translations and apply them easily to 1000s of repos:
https://bitbucket.org/sealuzh/parsenn
Creation of 2x2 datasets for two translations steps:
plaintext → tokens tokens → annotations
12
Results: Tokenization
13
Vocab size: 2185
Train: 25M samples (1.7Gb)
Validation: 2M samples (140Mb)
Plain text
source code
Lexing
Instructions
Vocab size: 3
Train: 25M samples (1.7Gb)
Validation: 2M samples (140Mb)
i m p o r t ❘ a n d r o i d . g r a p h i c s . B i t m a p ;
i m p o r t ❘ c o m . f a c e b o o k . c o m m o n . r e f e r e n c e s . R e s o u r c e
p u b l i c ❘ c l a s s ❘ S i m p l e B i t m a p R e l e a s e r ❘ i m p l e m e n t s ❘ R e
p r i v a t e ❘ s t a t i c ❘ S i m p l e B i t m a p R e l e a s e r ❘ s I n s t a n c e ;
p u b l i c ❘ s t a t i c ❘ S i m p l e B i t m a p R e l e a s e r ❘ g e t I n s t a n c e
i f ❘ ( s I n s t a n c e ❘ = = ❘ n u l l ) ❘ {
s I n s t a n c e ❘ = ❘ n e w ❘ S i m p l e B i t m a p R e l e a s e r ( ) ;
}
0 0 0 0 0 1 ❘ 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1
0 0 0 0 0 1 ❘ 0 0 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0
0 0 0 0 0 1 ❘ 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 1 ❘ 0 0
0 0 0 0 0 0 1 ❘ 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 1 1
0 0 0 0 0 1 ❘ 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 1
0 1 ❘ 1 0 0 0 0 0 0 0 0 1 ❘ 0 1 ❘ 0 0 0 1 1 ❘ 1
0 0 0 0 0 0 0 0 1 ❘ 1 ❘ 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1
1
Results: Tokenization
13
Vocab size: 2185
Train: 25M samples (1.7Gb)
Validation: 2M samples (140Mb)
Plain text
source code
Lexing
Instructions
Vocab size: 3
Train: 25M samples (1.7Gb)
Validation: 2M samples (140Mb)
i m p o r t ❘ a n d r o i d . g r a p h i c s . B i t m a p ;
i m p o r t ❘ c o m . f a c e b o o k . c o m m o n . r e f e r e n c e s . R e s o u r c e
p u b l i c ❘ c l a s s ❘ S i m p l e B i t m a p R e l e a s e r ❘ i m p l e m e n t s ❘ R e
p r i v a t e ❘ s t a t i c ❘ S i m p l e B i t m a p R e l e a s e r ❘ s I n s t a n c e ;
p u b l i c ❘ s t a t i c ❘ S i m p l e B i t m a p R e l e a s e r ❘ g e t I n s t a n c e
i f ❘ ( s I n s t a n c e ❘ = = ❘ n u l l ) ❘ {
s I n s t a n c e ❘ = ❘ n e w ❘ S i m p l e B i t m a p R e l e a s e r ( ) ;
}
0 0 0 0 0 1 ❘ 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1
0 0 0 0 0 1 ❘ 0 0 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0
0 0 0 0 0 1 ❘ 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 1 ❘ 0 0
0 0 0 0 0 0 1 ❘ 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 1 1
0 0 0 0 0 1 ❘ 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 1
0 1 ❘ 1 0 0 0 0 0 0 0 0 1 ❘ 0 1 ❘ 0 0 0 1 1 ❘ 1
0 0 0 0 0 0 0 0 1 ❘ 1 ❘ 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1
1
NMT
Bi-RNN
7 epochs
7 days
Perplexity: 1.11
Results: Tokenization
13
Vocab size: 2189
Train: 25M samples (1.7Gb)
Validation: 2M samples (140Mb)
Plain text
source code
Lexing
Instructions
Vocab size: 7
Train: 25M samples (1.7Gb)
Validation: 2M samples (140Mb)
i m p o r t ❘ a n d r o i d . g r a p h i c s . B i t m a p ;
i m p o r t ❘ c o m . f a c e b o o k . c o m m o n . r e f e r e n c e s . R e s o u r c e
p u b l i c ❘ c l a s s ❘ S i m p l e B i t m a p R e l e a s e r ❘ i m p l e m e n t s ❘ R e
p r i v a t e ❘ s t a t i c ❘ S i m p l e B i t m a p R e l e a s e r ❘ s I n s t a n c e ;
p u b l i c ❘ s t a t i c ❘ S i m p l e B i t m a p R e l e a s e r ❘ g e t I n s t a n c e
i f ❘ ( s I n s t a n c e ❘ = = ❘ n u l l ) ❘ {
s I n s t a n c e ❘ = ❘ n e w ❘ S i m p l e B i t m a p R e l e a s e r ( ) ;
}
0 0 0 0 0 1 ❘ 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1
0 0 0 0 0 1 ❘ 0 0 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0
0 0 0 0 0 1 ❘ 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 1 ❘ 0 0
0 0 0 0 0 0 1 ❘ 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 1 1
0 0 0 0 0 1 ❘ 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 1
0 1 ❘ 1 0 0 0 0 0 0 0 0 1 ❘ 0 1 ❘ 0 0 0 1 1 ❘ 1
0 0 0 0 0 0 0 0 1 ❘ 1 ❘ 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1
1
NMT
Bi-RNN
7 epochs
7 days
Perplexity: 1.11
What is perplexity?
Lower Perplexity is better
Meaning of perplexity value
depends on target vocab size
In the context of NMT:
Perplexity describes how "confused" a probability model is
on a given test data set. A perfect model has perplexity 1.
Results: Tokenization
13
Vocab size: 2185
Train: 25M samples (1.7Gb)
Validation: 2M samples (140Mb)
Plain text
source code
Lexing
Instructions
Vocab size: 3
Train: 25M samples (1.7Gb)
Validation: 2M samples (140Mb)
i m p o r t ❘ a n d r o i d . g r a p h i c s . B i t m a p ;
i m p o r t ❘ c o m . f a c e b o o k . c o m m o n . r e f e r e n c e s . R e s o u r c e
p u b l i c ❘ c l a s s ❘ S i m p l e B i t m a p R e l e a s e r ❘ i m p l e m e n t s ❘ R e
p r i v a t e ❘ s t a t i c ❘ S i m p l e B i t m a p R e l e a s e r ❘ s I n s t a n c e ;
p u b l i c ❘ s t a t i c ❘ S i m p l e B i t m a p R e l e a s e r ❘ g e t I n s t a n c e
i f ❘ ( s I n s t a n c e ❘ = = ❘ n u l l ) ❘ {
s I n s t a n c e ❘ = ❘ n e w ❘ S i m p l e B i t m a p R e l e a s e r ( ) ;
}
0 0 0 0 0 1 ❘ 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1
0 0 0 0 0 1 ❘ 0 0 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0
0 0 0 0 0 1 ❘ 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 1 ❘ 0 0
0 0 0 0 0 0 1 ❘ 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 1 1
0 0 0 0 0 1 ❘ 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 1
0 1 ❘ 1 0 0 0 0 0 0 0 0 1 ❘ 0 1 ❘ 0 0 0 1 1 ❘ 1
0 0 0 0 0 0 0 0 1 ❘ 1 ❘ 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1
1
NMT
Bi-RNN
7 epochs
7 days
Perplexity: 1.11
Results: Tokenization
13
Vocab size: 2185
Train: 25M samples (1.7Gb)
Validation: 2M samples (140Mb)
Plain text
source code
Lexing
Instructions
Vocab size: 3
Train: 25M samples (1.7Gb)
Validation: 2M samples (140Mb)
i m p o r t ❘ a n d r o i d . g r a p h i c s . B i t m a p ;
i m p o r t ❘ c o m . f a c e b o o k . c o m m o n . r e f e r e n c e s . R e s o u r c e
p u b l i c ❘ c l a s s ❘ S i m p l e B i t m a p R e l e a s e r ❘ i m p l e m e n t s ❘ R e
p r i v a t e ❘ s t a t i c ❘ S i m p l e B i t m a p R e l e a s e r ❘ s I n s t a n c e ;
p u b l i c ❘ s t a t i c ❘ S i m p l e B i t m a p R e l e a s e r ❘ g e t I n s t a n c e
i f ❘ ( s I n s t a n c e ❘ = = ❘ n u l l ) ❘ {
s I n s t a n c e ❘ = ❘ n e w ❘ S i m p l e B i t m a p R e l e a s e r ( ) ;
}
0 0 0 0 0 1 ❘ 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1
0 0 0 0 0 1 ❘ 0 0 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0
0 0 0 0 0 1 ❘ 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 1 ❘ 0 0
0 0 0 0 0 0 1 ❘ 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 1 1
0 0 0 0 0 1 ❘ 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 1
0 1 ❘ 1 0 0 0 0 0 0 0 0 1 ❘ 0 1 ❘ 0 0 0 1 1 ❘ 1
0 0 0 0 0 0 0 0 1 ❘ 1 ❘ 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1
1
NMT
Bi-RNN
7 epochs
7 days
Perplexity: 1.11
@Test(expected = NullPointerException.class)
10001100000001 1 000000000000000000000000011
Failed translation example:
Results: Token Annotation
14
Vocab size: 50000
Train: 25M samples (904Mb)
Validation: 2M samples (76M)
Tokens
Type/Depth
Annotations
Vocab size: 87|4459
Train: 25M samples (3Gb)
Validation: 2M samples (251Mb)
import android . graphics . Bitmap ;
import com . facebook . common . references . ResourceReleaser ;
public class SimpleBitmapReleaser implements ResourceReleaser < Bitmap > {
private static SimpleBitmapReleaser sInstance ;
public static SimpleBitmapReleaser getInstance ( ) {
if ( sInstance == null ) {
sInstance = new SimpleBitmapReleaser ( ) ;
}
ImportDeclaration│2 QualifiedName│3 QualifiedName│3 QualifiedName│3 QualifiedName│3 Qualifi
ImportDeclaration│2 QualifiedName│3 QualifiedName│3 QualifiedName│3 QualifiedName│3 Qualifi
ClassOrInterfaceModifier│3 ClassDeclaration│3 ClassDeclaration│3 ClassDeclaration│3 ClassOr
ClassOrInterfaceModifier│7 ClassOrInterfaceModifier│7 ClassOrInterfaceType│9 VariableDeclar
ClassOrInterfaceModifier│7 ClassOrInterfaceModifier│7 ClassOrInterfaceType│9 MethodDeclarat
IfStatement│12 ParExpression│13 Primary│16 Expression│14 Literal│17 ParExpression│13
Block│14
2
Results: Token Annotation
14
Vocab size: 50000
Train: 25M samples (904Mb)
Validation: 2M samples (76M)
Tokens
Type/Depth
Annotations
Vocab size: 87|4459
Train: 25M samples (3Gb)
Validation: 2M samples (251Mb)
import android . graphics . Bitmap ;
import com . facebook . common . references . ResourceReleaser ;
public class SimpleBitmapReleaser implements ResourceReleaser < Bitmap > {
private static SimpleBitmapReleaser sInstance ;
public static SimpleBitmapReleaser getInstance ( ) {
if ( sInstance == null ) {
sInstance = new SimpleBitmapReleaser ( ) ;
}
ImportDeclaration│2 QualifiedName│3 QualifiedName│3 QualifiedName│3 QualifiedName│3 Qualifi
ImportDeclaration│2 QualifiedName│3 QualifiedName│3 QualifiedName│3 QualifiedName│3 Qualifi
ClassOrInterfaceModifier│3 ClassDeclaration│3 ClassDeclaration│3 ClassDeclaration│3 ClassOr
ClassOrInterfaceModifier│7 ClassOrInterfaceModifier│7 ClassOrInterfaceType│9 VariableDeclar
ClassOrInterfaceModifier│7 ClassOrInterfaceModifier│7 ClassOrInterfaceType│9 MethodDeclarat
IfStatement│12 ParExpression│13 Primary│16 Expression│14 Literal│17 ParExpression│13
Block│14
NMT
RNN
11 epochs
2 days
Perplexity: 1.28
2
Results: Token Annotation
14
Vocab size: 50000
Train: 25M samples (904Mb)
Validation: 2M samples (76M)
Tokens
Type/Depth
Annotations
Vocab size: 87|4459
Train: 25M samples (3Gb)
Validation: 2M samples (251Mb)
import android . graphics . Bitmap ;
import com . facebook . common . references . ResourceReleaser ;
public class SimpleBitmapReleaser implements ResourceReleaser < Bitmap > {
private static SimpleBitmapReleaser sInstance ;
public static SimpleBitmapReleaser getInstance ( ) {
if ( sInstance == null ) {
sInstance = new SimpleBitmapReleaser ( ) ;
}
ImportDeclaration│2 QualifiedName│3 QualifiedName│3 QualifiedName│3 QualifiedName│3 Qualifi
ImportDeclaration│2 QualifiedName│3 QualifiedName│3 QualifiedName│3 QualifiedName│3 Qualifi
ClassOrInterfaceModifier│3 ClassDeclaration│3 ClassDeclaration│3 ClassDeclaration│3 ClassOr
ClassOrInterfaceModifier│7 ClassOrInterfaceModifier│7 ClassOrInterfaceType│9 VariableDeclar
ClassOrInterfaceModifier│7 ClassOrInterfaceModifier│7 ClassOrInterfaceType│9 MethodDeclarat
IfStatement│12 ParExpression│13 Primary│16 Expression│14 Literal│17 ParExpression│13
Block│14
NMT
RNN
11 epochs
2 days
Perplexity: 1.28
2
Results: Token Annotation
14
Vocab size: 50000
Train: 25M samples (904Mb)
Validation: 2M samples (76M)
Tokens
Type/Depth
Annotations
Vocab size: 87|4459
Train: 25M samples (3Gb)
Validation: 2M samples (251Mb)
import android . graphics . Bitmap ;
import com . facebook . common . references . ResourceReleaser ;
public class SimpleBitmapReleaser implements ResourceReleaser < Bitmap > {
private static SimpleBitmapReleaser sInstance ;
public static SimpleBitmapReleaser getInstance ( ) {
if ( sInstance == null ) {
sInstance = new SimpleBitmapReleaser ( ) ;
}
ImportDeclaration│2 QualifiedName│3 QualifiedName│3 QualifiedName│3 QualifiedName│3 Qualifi
ImportDeclaration│2 QualifiedName│3 QualifiedName│3 QualifiedName│3 QualifiedName│3 Qualifi
ClassOrInterfaceModifier│3 ClassDeclaration│3 ClassDeclaration│3 ClassDeclaration│3 ClassOr
ClassOrInterfaceModifier│7 ClassOrInterfaceModifier│7 ClassOrInterfaceType│9 VariableDeclar
ClassOrInterfaceModifier│7 ClassOrInterfaceModifier│7 ClassOrInterfaceType│9 MethodDeclarat
IfStatement│12 ParExpression│13 Primary│16 Expression│14 Literal│17 ParExpression│13
Block│14
NMT
RNN
11 epochs
2 days
Perplexity: 1.28
2
A successful example:
List<Throwable> errors = TestHelper.trackPluginErrors();
000110000000011 000001 1 0000000001100000000000000001111
[ClassOrInterfaceType|14] [TypeArguments|15] [ClassOrInterfaceType|18] [TypeArguments|15]
[VariableDeclaratorId|15] [VariableDeclarator|14]
[Primary|19] [Expression|17] [Expression|17] [Expression|16] [Expression|16]
[LocalVariableDeclarationStatement|11]
Take-home messages:
• NN can learn to "read" code (tokens / syntactic elements)
→ What else could we teach? Type resolution? Calls & attribute access?
Inheritance?
→ Could we follow the "human path" of learning to program to teach an AI?
16
Take-home messages:
• NN can learn to "read" code (tokens / syntactic elements)
→ What else could we teach? Type resolution? Calls & attribute access?
Inheritance?
→ Could we follow the "human path" of learning to program to teach an AI?
• "If only we had good data"
→ Bug reports, commit messages etc. are still unstructured. This needs to
change if we want to leverage deep learning in SE and PC.
16
Carol V. Alexandru, Sebastiano Panichella, Harald C. Gall
{alexandru,panichella,gall}@ifi.uzh.ch
23. May 2017
t.uzh.ch/Hb
t.uzh.ch/Hc
Data creation tool:
Paper:

Weitere ähnliche Inhalte

Ähnlich wie Replicating Parser Behavior using Neural Machine Translation

Choosing the Right Database
Choosing the Right DatabaseChoosing the Right Database
Choosing the Right DatabaseDavid Simons
 
Transformers ASR.pdf
Transformers ASR.pdfTransformers ASR.pdf
Transformers ASR.pdfssuser8025b21
 
Hebrew Bible as Data: Laboratory, Sharing, Lessons
Hebrew Bible as Data: Laboratory, Sharing, LessonsHebrew Bible as Data: Laboratory, Sharing, Lessons
Hebrew Bible as Data: Laboratory, Sharing, LessonsDirk Roorda
 
Svelte (adjective): Attractively thin, graceful, and stylish
Svelte (adjective): Attractively thin, graceful, and stylishSvelte (adjective): Attractively thin, graceful, and stylish
Svelte (adjective): Attractively thin, graceful, and stylishThe Software House
 
Fundraising 11 12_tmu
Fundraising 11 12_tmuFundraising 11 12_tmu
Fundraising 11 12_tmuTechMeetups
 
Canary Deployments on Amazon EKS with Istio - SRV305 - Chicago AWS Summit
Canary Deployments on Amazon EKS with Istio - SRV305 - Chicago AWS SummitCanary Deployments on Amazon EKS with Istio - SRV305 - Chicago AWS Summit
Canary Deployments on Amazon EKS with Istio - SRV305 - Chicago AWS SummitAmazon Web Services
 
Hp dba v.6.2 technical slides
Hp dba v.6.2 technical slidesHp dba v.6.2 technical slides
Hp dba v.6.2 technical slidesaxentriacg
 
Be(yond/neath) Scrum Values
Be(yond/neath) Scrum Values Be(yond/neath) Scrum Values
Be(yond/neath) Scrum Values Daniel Teng
 
Elasticsearch at EyeEm
Elasticsearch at EyeEmElasticsearch at EyeEm
Elasticsearch at EyeEmLars Fronius
 
Piotr Szotkowski about "Ruby smells"
Piotr Szotkowski about "Ruby smells"Piotr Szotkowski about "Ruby smells"
Piotr Szotkowski about "Ruby smells"Pivorak MeetUp
 
Deep Learning con CNTK by Pablo Doval
Deep Learning con CNTK by Pablo DovalDeep Learning con CNTK by Pablo Doval
Deep Learning con CNTK by Pablo DovalPlain Concepts
 
Cryptography 101 (with math)
Cryptography 101 (with math)Cryptography 101 (with math)
Cryptography 101 (with math)jessepollak
 
Spring Roo 2.0 Preview at Spring I/O 2016
Spring Roo 2.0 Preview at Spring I/O 2016 Spring Roo 2.0 Preview at Spring I/O 2016
Spring Roo 2.0 Preview at Spring I/O 2016 DISID
 
Regular Expressions for SEO
Regular Expressions for SEORegular Expressions for SEO
Regular Expressions for SEOJonathan Moore
 
Docker Testing
Docker TestingDocker Testing
Docker TestingAlex Soto
 
SLCC 2011 - Capital Exchange Stock Market Simulation Game by Carmen Dubaldi -...
SLCC 2011 - Capital Exchange Stock Market Simulation Game by Carmen Dubaldi -...SLCC 2011 - Capital Exchange Stock Market Simulation Game by Carmen Dubaldi -...
SLCC 2011 - Capital Exchange Stock Market Simulation Game by Carmen Dubaldi -...Siterma The World In 4D
 

Ähnlich wie Replicating Parser Behavior using Neural Machine Translation (20)

Choosing the Right Database
Choosing the Right DatabaseChoosing the Right Database
Choosing the Right Database
 
Transformers ASR.pdf
Transformers ASR.pdfTransformers ASR.pdf
Transformers ASR.pdf
 
Hebrew Bible as Data: Laboratory, Sharing, Lessons
Hebrew Bible as Data: Laboratory, Sharing, LessonsHebrew Bible as Data: Laboratory, Sharing, Lessons
Hebrew Bible as Data: Laboratory, Sharing, Lessons
 
Svelte (adjective): Attractively thin, graceful, and stylish
Svelte (adjective): Attractively thin, graceful, and stylishSvelte (adjective): Attractively thin, graceful, and stylish
Svelte (adjective): Attractively thin, graceful, and stylish
 
Fundraising 11 12_tmu
Fundraising 11 12_tmuFundraising 11 12_tmu
Fundraising 11 12_tmu
 
Canary Deployments on Amazon EKS with Istio - SRV305 - Chicago AWS Summit
Canary Deployments on Amazon EKS with Istio - SRV305 - Chicago AWS SummitCanary Deployments on Amazon EKS with Istio - SRV305 - Chicago AWS Summit
Canary Deployments on Amazon EKS with Istio - SRV305 - Chicago AWS Summit
 
Hp dba v.6.2 technical slides
Hp dba v.6.2 technical slidesHp dba v.6.2 technical slides
Hp dba v.6.2 technical slides
 
Be(yond/neath) Scrum Values
Be(yond/neath) Scrum Values Be(yond/neath) Scrum Values
Be(yond/neath) Scrum Values
 
Recurrent neural networks
Recurrent neural networksRecurrent neural networks
Recurrent neural networks
 
Elasticsearch at EyeEm
Elasticsearch at EyeEmElasticsearch at EyeEm
Elasticsearch at EyeEm
 
Otology learning
Otology learningOtology learning
Otology learning
 
Piotr Szotkowski about "Ruby smells"
Piotr Szotkowski about "Ruby smells"Piotr Szotkowski about "Ruby smells"
Piotr Szotkowski about "Ruby smells"
 
SkyLaw Celebrates its 10-Year Anniversary
SkyLaw Celebrates its 10-Year AnniversarySkyLaw Celebrates its 10-Year Anniversary
SkyLaw Celebrates its 10-Year Anniversary
 
Deep Learning con CNTK by Pablo Doval
Deep Learning con CNTK by Pablo DovalDeep Learning con CNTK by Pablo Doval
Deep Learning con CNTK by Pablo Doval
 
Cryptography 101 (with math)
Cryptography 101 (with math)Cryptography 101 (with math)
Cryptography 101 (with math)
 
Spring Roo 2.0 Preview at Spring I/O 2016
Spring Roo 2.0 Preview at Spring I/O 2016 Spring Roo 2.0 Preview at Spring I/O 2016
Spring Roo 2.0 Preview at Spring I/O 2016
 
Regular Expressions for SEO
Regular Expressions for SEORegular Expressions for SEO
Regular Expressions for SEO
 
Technologies That Will Change Everything
Technologies That Will Change EverythingTechnologies That Will Change Everything
Technologies That Will Change Everything
 
Docker Testing
Docker TestingDocker Testing
Docker Testing
 
SLCC 2011 - Capital Exchange Stock Market Simulation Game by Carmen Dubaldi -...
SLCC 2011 - Capital Exchange Stock Market Simulation Game by Carmen Dubaldi -...SLCC 2011 - Capital Exchange Stock Market Simulation Game by Carmen Dubaldi -...
SLCC 2011 - Capital Exchange Stock Market Simulation Game by Carmen Dubaldi -...
 

Mehr von Sebastiano Panichella

The 3rd Intl. Workshop on NL-based Software Engineering
The 3rd Intl. Workshop on NL-based Software EngineeringThe 3rd Intl. Workshop on NL-based Software Engineering
The 3rd Intl. Workshop on NL-based Software EngineeringSebastiano Panichella
 
Diversity-guided Search Exploration for Self-driving Cars Test Generation thr...
Diversity-guided Search Exploration for Self-driving Cars Test Generation thr...Diversity-guided Search Exploration for Self-driving Cars Test Generation thr...
Diversity-guided Search Exploration for Self-driving Cars Test Generation thr...Sebastiano Panichella
 
SBFT Tool Competition 2024 -- Python Test Case Generation Track
SBFT Tool Competition 2024 -- Python Test Case Generation TrackSBFT Tool Competition 2024 -- Python Test Case Generation Track
SBFT Tool Competition 2024 -- Python Test Case Generation TrackSebastiano Panichella
 
SBFT Tool Competition 2024 - CPS-UAV Test Case Generation Track
SBFT Tool Competition 2024 - CPS-UAV Test Case Generation TrackSBFT Tool Competition 2024 - CPS-UAV Test Case Generation Track
SBFT Tool Competition 2024 - CPS-UAV Test Case Generation TrackSebastiano Panichella
 
Simulation-based Testing of Unmanned Aerial Vehicles with Aerialist
Simulation-based Testing of Unmanned Aerial Vehicles with AerialistSimulation-based Testing of Unmanned Aerial Vehicles with Aerialist
Simulation-based Testing of Unmanned Aerial Vehicles with AerialistSebastiano Panichella
 
Testing with Fewer Resources: Toward Adaptive Approaches for Cost-effective ...
Testing with Fewer Resources:  Toward Adaptive Approaches for Cost-effective ...Testing with Fewer Resources:  Toward Adaptive Approaches for Cost-effective ...
Testing with Fewer Resources: Toward Adaptive Approaches for Cost-effective ...Sebastiano Panichella
 
COSMOS: DevOps for Complex Cyber-physical Systems
COSMOS: DevOps for Complex Cyber-physical SystemsCOSMOS: DevOps for Complex Cyber-physical Systems
COSMOS: DevOps for Complex Cyber-physical SystemsSebastiano Panichella
 
Testing and Development Challenges for Complex Cyber-Physical Systems: Insigh...
Testing and Development Challenges for Complex Cyber-Physical Systems: Insigh...Testing and Development Challenges for Complex Cyber-Physical Systems: Insigh...
Testing and Development Challenges for Complex Cyber-Physical Systems: Insigh...Sebastiano Panichella
 
An Empirical Characterization of Software Bugs in Open-Source Cyber-Physical ...
An Empirical Characterization of Software Bugs in Open-Source Cyber-Physical ...An Empirical Characterization of Software Bugs in Open-Source Cyber-Physical ...
An Empirical Characterization of Software Bugs in Open-Source Cyber-Physical ...Sebastiano Panichella
 
Automated Identification and Qualitative Characterization of Safety Concerns ...
Automated Identification and Qualitative Characterization of Safety Concerns ...Automated Identification and Qualitative Characterization of Safety Concerns ...
Automated Identification and Qualitative Characterization of Safety Concerns ...Sebastiano Panichella
 
The 2nd Intl. Workshop on NL-based Software Engineering
The 2nd Intl. Workshop on NL-based Software EngineeringThe 2nd Intl. Workshop on NL-based Software Engineering
The 2nd Intl. Workshop on NL-based Software EngineeringSebastiano Panichella
 
The 16th Intl. Workshop on Search-Based and Fuzz Testing
The 16th Intl. Workshop on Search-Based and Fuzz TestingThe 16th Intl. Workshop on Search-Based and Fuzz Testing
The 16th Intl. Workshop on Search-Based and Fuzz TestingSebastiano Panichella
 
Simulation-based Test Case Generation for Unmanned Aerial Vehicles in the Nei...
Simulation-based Test Case Generation for Unmanned Aerial Vehicles in the Nei...Simulation-based Test Case Generation for Unmanned Aerial Vehicles in the Nei...
Simulation-based Test Case Generation for Unmanned Aerial Vehicles in the Nei...Sebastiano Panichella
 
Exposed! A case study on the vulnerability-proneness of Google Play Apps
Exposed! A case study on the vulnerability-proneness of Google Play AppsExposed! A case study on the vulnerability-proneness of Google Play Apps
Exposed! A case study on the vulnerability-proneness of Google Play AppsSebastiano Panichella
 
Search-based Software Testing (SBST) '22
Search-based Software Testing (SBST) '22Search-based Software Testing (SBST) '22
Search-based Software Testing (SBST) '22Sebastiano Panichella
 
NL-based Software Engineering (NLBSE) '22
NL-based Software Engineering (NLBSE) '22NL-based Software Engineering (NLBSE) '22
NL-based Software Engineering (NLBSE) '22Sebastiano Panichella
 
"An NLP-based Tool for Software Artifacts Analysis" at @ICSME2021.
 "An NLP-based Tool for Software Artifacts Analysis" at @ICSME2021.  "An NLP-based Tool for Software Artifacts Analysis" at @ICSME2021.
"An NLP-based Tool for Software Artifacts Analysis" at @ICSME2021. Sebastiano Panichella
 
An Empirical Investigation of Relevant Changes and Automation Needs in Modern...
An Empirical Investigation of Relevant Changes and Automation Needs in Modern...An Empirical Investigation of Relevant Changes and Automation Needs in Modern...
An Empirical Investigation of Relevant Changes and Automation Needs in Modern...Sebastiano Panichella
 
Search-Based Software Testing Tool Competition 2021 by Sebastiano Panichella,...
Search-Based Software Testing Tool Competition 2021 by Sebastiano Panichella,...Search-Based Software Testing Tool Competition 2021 by Sebastiano Panichella,...
Search-Based Software Testing Tool Competition 2021 by Sebastiano Panichella,...Sebastiano Panichella
 

Mehr von Sebastiano Panichella (20)

The 3rd Intl. Workshop on NL-based Software Engineering
The 3rd Intl. Workshop on NL-based Software EngineeringThe 3rd Intl. Workshop on NL-based Software Engineering
The 3rd Intl. Workshop on NL-based Software Engineering
 
Diversity-guided Search Exploration for Self-driving Cars Test Generation thr...
Diversity-guided Search Exploration for Self-driving Cars Test Generation thr...Diversity-guided Search Exploration for Self-driving Cars Test Generation thr...
Diversity-guided Search Exploration for Self-driving Cars Test Generation thr...
 
SBFT Tool Competition 2024 -- Python Test Case Generation Track
SBFT Tool Competition 2024 -- Python Test Case Generation TrackSBFT Tool Competition 2024 -- Python Test Case Generation Track
SBFT Tool Competition 2024 -- Python Test Case Generation Track
 
SBFT Tool Competition 2024 - CPS-UAV Test Case Generation Track
SBFT Tool Competition 2024 - CPS-UAV Test Case Generation TrackSBFT Tool Competition 2024 - CPS-UAV Test Case Generation Track
SBFT Tool Competition 2024 - CPS-UAV Test Case Generation Track
 
Simulation-based Testing of Unmanned Aerial Vehicles with Aerialist
Simulation-based Testing of Unmanned Aerial Vehicles with AerialistSimulation-based Testing of Unmanned Aerial Vehicles with Aerialist
Simulation-based Testing of Unmanned Aerial Vehicles with Aerialist
 
Testing with Fewer Resources: Toward Adaptive Approaches for Cost-effective ...
Testing with Fewer Resources:  Toward Adaptive Approaches for Cost-effective ...Testing with Fewer Resources:  Toward Adaptive Approaches for Cost-effective ...
Testing with Fewer Resources: Toward Adaptive Approaches for Cost-effective ...
 
COSMOS: DevOps for Complex Cyber-physical Systems
COSMOS: DevOps for Complex Cyber-physical SystemsCOSMOS: DevOps for Complex Cyber-physical Systems
COSMOS: DevOps for Complex Cyber-physical Systems
 
Testing and Development Challenges for Complex Cyber-Physical Systems: Insigh...
Testing and Development Challenges for Complex Cyber-Physical Systems: Insigh...Testing and Development Challenges for Complex Cyber-Physical Systems: Insigh...
Testing and Development Challenges for Complex Cyber-Physical Systems: Insigh...
 
An Empirical Characterization of Software Bugs in Open-Source Cyber-Physical ...
An Empirical Characterization of Software Bugs in Open-Source Cyber-Physical ...An Empirical Characterization of Software Bugs in Open-Source Cyber-Physical ...
An Empirical Characterization of Software Bugs in Open-Source Cyber-Physical ...
 
Automated Identification and Qualitative Characterization of Safety Concerns ...
Automated Identification and Qualitative Characterization of Safety Concerns ...Automated Identification and Qualitative Characterization of Safety Concerns ...
Automated Identification and Qualitative Characterization of Safety Concerns ...
 
The 2nd Intl. Workshop on NL-based Software Engineering
The 2nd Intl. Workshop on NL-based Software EngineeringThe 2nd Intl. Workshop on NL-based Software Engineering
The 2nd Intl. Workshop on NL-based Software Engineering
 
The 16th Intl. Workshop on Search-Based and Fuzz Testing
The 16th Intl. Workshop on Search-Based and Fuzz TestingThe 16th Intl. Workshop on Search-Based and Fuzz Testing
The 16th Intl. Workshop on Search-Based and Fuzz Testing
 
Simulation-based Test Case Generation for Unmanned Aerial Vehicles in the Nei...
Simulation-based Test Case Generation for Unmanned Aerial Vehicles in the Nei...Simulation-based Test Case Generation for Unmanned Aerial Vehicles in the Nei...
Simulation-based Test Case Generation for Unmanned Aerial Vehicles in the Nei...
 
Exposed! A case study on the vulnerability-proneness of Google Play Apps
Exposed! A case study on the vulnerability-proneness of Google Play AppsExposed! A case study on the vulnerability-proneness of Google Play Apps
Exposed! A case study on the vulnerability-proneness of Google Play Apps
 
Search-based Software Testing (SBST) '22
Search-based Software Testing (SBST) '22Search-based Software Testing (SBST) '22
Search-based Software Testing (SBST) '22
 
NL-based Software Engineering (NLBSE) '22
NL-based Software Engineering (NLBSE) '22NL-based Software Engineering (NLBSE) '22
NL-based Software Engineering (NLBSE) '22
 
NLBSE’22: Tool Competition
NLBSE’22: Tool CompetitionNLBSE’22: Tool Competition
NLBSE’22: Tool Competition
 
"An NLP-based Tool for Software Artifacts Analysis" at @ICSME2021.
 "An NLP-based Tool for Software Artifacts Analysis" at @ICSME2021.  "An NLP-based Tool for Software Artifacts Analysis" at @ICSME2021.
"An NLP-based Tool for Software Artifacts Analysis" at @ICSME2021.
 
An Empirical Investigation of Relevant Changes and Automation Needs in Modern...
An Empirical Investigation of Relevant Changes and Automation Needs in Modern...An Empirical Investigation of Relevant Changes and Automation Needs in Modern...
An Empirical Investigation of Relevant Changes and Automation Needs in Modern...
 
Search-Based Software Testing Tool Competition 2021 by Sebastiano Panichella,...
Search-Based Software Testing Tool Competition 2021 by Sebastiano Panichella,...Search-Based Software Testing Tool Competition 2021 by Sebastiano Panichella,...
Search-Based Software Testing Tool Competition 2021 by Sebastiano Panichella,...
 

Kürzlich hochgeladen

Presentation on Engagement in Book Clubs
Presentation on Engagement in Book ClubsPresentation on Engagement in Book Clubs
Presentation on Engagement in Book Clubssamaasim06
 
Busty Desi⚡Call Girls in Sector 51 Noida Escorts >༒8448380779 Escort Service-...
Busty Desi⚡Call Girls in Sector 51 Noida Escorts >༒8448380779 Escort Service-...Busty Desi⚡Call Girls in Sector 51 Noida Escorts >༒8448380779 Escort Service-...
Busty Desi⚡Call Girls in Sector 51 Noida Escorts >༒8448380779 Escort Service-...Delhi Call girls
 
Uncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac FolorunsoUncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac FolorunsoKayode Fayemi
 
BDSM⚡Call Girls in Sector 97 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 97 Noida Escorts >༒8448380779 Escort ServiceBDSM⚡Call Girls in Sector 97 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 97 Noida Escorts >༒8448380779 Escort ServiceDelhi Call girls
 
lONG QUESTION ANSWER PAKISTAN STUDIES10.
lONG QUESTION ANSWER PAKISTAN STUDIES10.lONG QUESTION ANSWER PAKISTAN STUDIES10.
lONG QUESTION ANSWER PAKISTAN STUDIES10.lodhisaajjda
 
Dreaming Music Video Treatment _ Project & Portfolio III
Dreaming Music Video Treatment _ Project & Portfolio IIIDreaming Music Video Treatment _ Project & Portfolio III
Dreaming Music Video Treatment _ Project & Portfolio IIINhPhngng3
 
ANCHORING SCRIPT FOR A CULTURAL EVENT.docx
ANCHORING SCRIPT FOR A CULTURAL EVENT.docxANCHORING SCRIPT FOR A CULTURAL EVENT.docx
ANCHORING SCRIPT FOR A CULTURAL EVENT.docxNikitaBankoti2
 
No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...
No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...
No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...Sheetaleventcompany
 
BDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort ServiceBDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort ServiceDelhi Call girls
 
If this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New NigeriaIf this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New NigeriaKayode Fayemi
 
My Presentation "In Your Hands" by Halle Bailey
My Presentation "In Your Hands" by Halle BaileyMy Presentation "In Your Hands" by Halle Bailey
My Presentation "In Your Hands" by Halle Baileyhlharris
 
Mohammad_Alnahdi_Oral_Presentation_Assignment.pptx
Mohammad_Alnahdi_Oral_Presentation_Assignment.pptxMohammad_Alnahdi_Oral_Presentation_Assignment.pptx
Mohammad_Alnahdi_Oral_Presentation_Assignment.pptxmohammadalnahdi22
 
Causes of poverty in France presentation.pptx
Causes of poverty in France presentation.pptxCauses of poverty in France presentation.pptx
Causes of poverty in France presentation.pptxCamilleBoulbin1
 
VVIP Call Girls Nalasopara : 9892124323, Call Girls in Nalasopara Services
VVIP Call Girls Nalasopara : 9892124323, Call Girls in Nalasopara ServicesVVIP Call Girls Nalasopara : 9892124323, Call Girls in Nalasopara Services
VVIP Call Girls Nalasopara : 9892124323, Call Girls in Nalasopara ServicesPooja Nehwal
 
The workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdf
The workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdfThe workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdf
The workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdfSenaatti-kiinteistöt
 
Call Girl Number in Khar Mumbai📲 9892124323 💞 Full Night Enjoy
Call Girl Number in Khar Mumbai📲 9892124323 💞 Full Night EnjoyCall Girl Number in Khar Mumbai📲 9892124323 💞 Full Night Enjoy
Call Girl Number in Khar Mumbai📲 9892124323 💞 Full Night EnjoyPooja Nehwal
 
Report Writing Webinar Training
Report Writing Webinar TrainingReport Writing Webinar Training
Report Writing Webinar TrainingKylaCullinane
 
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdfAWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdfSkillCertProExams
 
Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...
Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...
Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...Hasting Chen
 

Kürzlich hochgeladen (20)

ICT role in 21st century education and it's challenges.pdf
ICT role in 21st century education and it's challenges.pdfICT role in 21st century education and it's challenges.pdf
ICT role in 21st century education and it's challenges.pdf
 
Presentation on Engagement in Book Clubs
Presentation on Engagement in Book ClubsPresentation on Engagement in Book Clubs
Presentation on Engagement in Book Clubs
 
Busty Desi⚡Call Girls in Sector 51 Noida Escorts >༒8448380779 Escort Service-...
Busty Desi⚡Call Girls in Sector 51 Noida Escorts >༒8448380779 Escort Service-...Busty Desi⚡Call Girls in Sector 51 Noida Escorts >༒8448380779 Escort Service-...
Busty Desi⚡Call Girls in Sector 51 Noida Escorts >༒8448380779 Escort Service-...
 
Uncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac FolorunsoUncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac Folorunso
 
BDSM⚡Call Girls in Sector 97 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 97 Noida Escorts >༒8448380779 Escort ServiceBDSM⚡Call Girls in Sector 97 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 97 Noida Escorts >༒8448380779 Escort Service
 
lONG QUESTION ANSWER PAKISTAN STUDIES10.
lONG QUESTION ANSWER PAKISTAN STUDIES10.lONG QUESTION ANSWER PAKISTAN STUDIES10.
lONG QUESTION ANSWER PAKISTAN STUDIES10.
 
Dreaming Music Video Treatment _ Project & Portfolio III
Dreaming Music Video Treatment _ Project & Portfolio IIIDreaming Music Video Treatment _ Project & Portfolio III
Dreaming Music Video Treatment _ Project & Portfolio III
 
ANCHORING SCRIPT FOR A CULTURAL EVENT.docx
ANCHORING SCRIPT FOR A CULTURAL EVENT.docxANCHORING SCRIPT FOR A CULTURAL EVENT.docx
ANCHORING SCRIPT FOR A CULTURAL EVENT.docx
 
No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...
No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...
No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...
 
BDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort ServiceBDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort Service
 
If this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New NigeriaIf this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New Nigeria
 
My Presentation "In Your Hands" by Halle Bailey
My Presentation "In Your Hands" by Halle BaileyMy Presentation "In Your Hands" by Halle Bailey
My Presentation "In Your Hands" by Halle Bailey
 
Mohammad_Alnahdi_Oral_Presentation_Assignment.pptx
Mohammad_Alnahdi_Oral_Presentation_Assignment.pptxMohammad_Alnahdi_Oral_Presentation_Assignment.pptx
Mohammad_Alnahdi_Oral_Presentation_Assignment.pptx
 
Causes of poverty in France presentation.pptx
Causes of poverty in France presentation.pptxCauses of poverty in France presentation.pptx
Causes of poverty in France presentation.pptx
 
VVIP Call Girls Nalasopara : 9892124323, Call Girls in Nalasopara Services
VVIP Call Girls Nalasopara : 9892124323, Call Girls in Nalasopara ServicesVVIP Call Girls Nalasopara : 9892124323, Call Girls in Nalasopara Services
VVIP Call Girls Nalasopara : 9892124323, Call Girls in Nalasopara Services
 
The workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdf
The workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdfThe workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdf
The workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdf
 
Call Girl Number in Khar Mumbai📲 9892124323 💞 Full Night Enjoy
Call Girl Number in Khar Mumbai📲 9892124323 💞 Full Night EnjoyCall Girl Number in Khar Mumbai📲 9892124323 💞 Full Night Enjoy
Call Girl Number in Khar Mumbai📲 9892124323 💞 Full Night Enjoy
 
Report Writing Webinar Training
Report Writing Webinar TrainingReport Writing Webinar Training
Report Writing Webinar Training
 
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdfAWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
 
Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...
Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...
Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...
 

Replicating Parser Behavior using Neural Machine Translation

  • 1. Carol V. Alexandru, Sebastiano Panichella, Harald C. Gall {alexandru,panichella,gall}@ifi.uzh.ch 23. May 2017
  • 2. 1 Strange new worlds Nouveaux mondes étranges
  • 3. 2 Strange new worlds Nouveaux mondes étranges
  • 4. 2 Strange new worlds Nouveaux mondes étranges
  • 5. 2 Strange new worlds Nouveaux mondes étranges
  • 6. 3 public int sum(int[] numbers) { int s = 0; for (int n : numbers) { s = s - n; } return s; }
  • 7. 3 public int sum(int[] numbers) { int s = 0; for (int n : numbers) { s = s - n; } return s; }
  • 8. 3 Difficult to codify, hard to detect and fixpublic int sum(int[] numbers) { int s = 0; for (int n : numbers) { s = s - n; } return s; }
  • 9. 3 Human blackbox brain can easily spot this public int sum(int[] numbers) { int s = 0; for (int n : numbers) { s = s - n; } return s; } Difficult to codify, hard to detect and fix
  • 11. Where to begin? 4 print("Hello World") Can we teach a machine to "read" code?
  • 12. Neural Machine Translation 6 Source sequences Target sequences "Space: the final frontier" "Espace: frontière de l'infini"
  • 13. Neural Machine Translation 6 Source sequences Target sequences "Space: the final frontier" "Espace: frontière de l'infini" Space : the final frontier Espace frontière de l' infini: tokenize
  • 14. Neural Machine Translation 6 Source sequences Target sequences "Space: the final frontier" "Espace: frontière de l'infini" Space : the final frontier Espace frontière de l' infini: tokenize 808 41 5 241 1020 701 624 12 9 -174 vectorize and build vocabulary
  • 15. Neural Machine Translation 6 Source sequences Target sequences "Space: the final frontier" "Espace: frontière de l'infini" Space : the final frontier Espace frontière de l' infini: tokenize 808 41 5 241 1020 701 624 12 9 -174 vectorize and build vocabulary Vocabulary has maximum size; Uncommon words may not be included and will be represented as a special "unknown word"
  • 16. Neural Machine Translation 6 RNN (LSTM/GRU) Source sequences Target sequences "Space: the final frontier" "Espace: frontière de l'infini" Space : the final frontier Espace frontière de l' infini: 808 41 5 241 1020 701 624 12 9 -174 tokenize vectorize and build vocabulary Vocabulary has maximum size; Uncommon words may not be included and will be represented as a special "unknown word"
  • 17. Data Gathering and Preparation 7 clone 1000 repos language:java sort:stars
  • 18. Data Gathering and Preparation 7 clone 1000 repos language:java sort:stars parse (ANTLR) and preprocess
  • 19. Data Gathering and Preparation 8 clone 1000 repos language:java sort:stars parse (ANTLR) and preprocess Plain text source p r i n t l n ( " H e l l o ▯ W o r l d ! " ) ;
  • 20. Data Gathering and Preparation 8 clone 1000 repos language:java sort:stars parse (ANTLR) and preprocess Plain text source p r i n t l n ( " H e l l o ▯ W o r l d ! " ) ; 1 char per word Replace space words with unassigned Unicode char
  • 21. Data Gathering and Preparation 9 clone 1000 repos language:java sort:stars parse (ANTLR) and preprocess Plain text source p r i n t l n ( " H e l l o ▯ W o r l d ! " ) ; Lexing instructions 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1
  • 22. Data Gathering and Preparation 9 clone 1000 repos language:java sort:stars parse (ANTLR) and preprocess Plain text source p r i n t l n ( " H e l l o ▯ W o r l d ! " ) ; Lexing instructions 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 = continue or start token 1 = end token blank space = ignore character
  • 23. Data Gathering and Preparation 9 clone 1000 repos language:java sort:stars parse (ANTLR) and preprocess Plain text source p r i n t l n ( " H e l l o ▯ W o r l d ! " ) ; Lexing instructions 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 Why not translate to actual tokens?! → Target vocabulary would not contain all possible tokens (although there are ways around that...)
  • 24. Data Gathering and Preparation 10 clone 1000 repos language:java sort:stars parse (ANTLR) and preprocess Plain text source p r i n t l n ( " H e l l o ▯ W o r l d ! " ) ; Lexing instructions 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 Tokens println ( "Hello,▯world" ) ;
  • 25. Data Gathering and Preparation 10 clone 1000 repos language:java sort:stars parse (ANTLR) and preprocess Plain text source p r i n t l n ( " H e l l o ▯ W o r l d ! " ) ; Lexing instructions 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 Tokens println ( "Hello,▯world" ) ; Replace spaces in words with unassigned Unicode char 1 token per word
  • 26. Data Gathering and Preparation 11 clone 1000 repos language:java sort:stars parse (ANTLR) and preprocess Plain text source p r i n t l n ( " H e l l o ▯ W o r l d ! " ) ; Lexing instructions 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 Tokens println ( "Hello,▯world" ) ; Node type & AST depth annotations Expression│12 Expression│13 Literal│17 Expression│13 Statement│11
  • 27. Data Gathering and Preparation 11 clone 1000 repos language:java sort:stars parse (ANTLR) and preprocess Plain text source p r i n t l n ( " H e l l o ▯ W o r l d ! " ) ; Lexing instructions 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 Tokens println ( "Hello,▯world" ) ; Node type & AST depth annotations Expression│12 Expression│13 Literal│17 Expression│13 Statement│11
  • 28. Data Gathering and Preparation 11 clone 1000 repos language:java sort:stars parse (ANTLR) and preprocess Plain text source p r i n t l n ( " H e l l o ▯ W o r l d ! " ) ; Lexing instructions 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 Tokens println ( "Hello,▯world" ) ; Node type & AST depth annotations Expression│12 Expression│13 Literal│17 Expression│13 Statement│11 Only contains AST node types correlating to literal tokens
  • 29. Data Gathering and Preparation clone 1000 repos language:java sort:stars parse (ANTLR) and preprocess Plain text source p r i n t l n ( " H e l l o ▯ W o r l d ! " ) ; Lexing instructions 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 Tokens println ( "Hello,▯world" ) ; Node type & AST depth annotations Expression│12 Expression│13 Literal│17 Expression│13 Statement│11 Data creation tool is open source - define your own extractions and translations and apply them easily to 1000s of repos: https://bitbucket.org/sealuzh/parsenn Creation of 2x2 datasets for two translations steps: plaintext → tokens tokens → annotations 12
  • 30. Results: Tokenization 13 Vocab size: 2185 Train: 25M samples (1.7Gb) Validation: 2M samples (140Mb) Plain text source code Lexing Instructions Vocab size: 3 Train: 25M samples (1.7Gb) Validation: 2M samples (140Mb) i m p o r t ❘ a n d r o i d . g r a p h i c s . B i t m a p ; i m p o r t ❘ c o m . f a c e b o o k . c o m m o n . r e f e r e n c e s . R e s o u r c e p u b l i c ❘ c l a s s ❘ S i m p l e B i t m a p R e l e a s e r ❘ i m p l e m e n t s ❘ R e p r i v a t e ❘ s t a t i c ❘ S i m p l e B i t m a p R e l e a s e r ❘ s I n s t a n c e ; p u b l i c ❘ s t a t i c ❘ S i m p l e B i t m a p R e l e a s e r ❘ g e t I n s t a n c e i f ❘ ( s I n s t a n c e ❘ = = ❘ n u l l ) ❘ { s I n s t a n c e ❘ = ❘ n e w ❘ S i m p l e B i t m a p R e l e a s e r ( ) ; } 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 1 ❘ 0 0 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 ❘ 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 1 0 1 ❘ 1 0 0 0 0 0 0 0 0 1 ❘ 0 1 ❘ 0 0 0 1 1 ❘ 1 0 0 0 0 0 0 0 0 1 ❘ 1 ❘ 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1
  • 31. Results: Tokenization 13 Vocab size: 2185 Train: 25M samples (1.7Gb) Validation: 2M samples (140Mb) Plain text source code Lexing Instructions Vocab size: 3 Train: 25M samples (1.7Gb) Validation: 2M samples (140Mb) i m p o r t ❘ a n d r o i d . g r a p h i c s . B i t m a p ; i m p o r t ❘ c o m . f a c e b o o k . c o m m o n . r e f e r e n c e s . R e s o u r c e p u b l i c ❘ c l a s s ❘ S i m p l e B i t m a p R e l e a s e r ❘ i m p l e m e n t s ❘ R e p r i v a t e ❘ s t a t i c ❘ S i m p l e B i t m a p R e l e a s e r ❘ s I n s t a n c e ; p u b l i c ❘ s t a t i c ❘ S i m p l e B i t m a p R e l e a s e r ❘ g e t I n s t a n c e i f ❘ ( s I n s t a n c e ❘ = = ❘ n u l l ) ❘ { s I n s t a n c e ❘ = ❘ n e w ❘ S i m p l e B i t m a p R e l e a s e r ( ) ; } 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 1 ❘ 0 0 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 ❘ 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 1 0 1 ❘ 1 0 0 0 0 0 0 0 0 1 ❘ 0 1 ❘ 0 0 0 1 1 ❘ 1 0 0 0 0 0 0 0 0 1 ❘ 1 ❘ 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 NMT Bi-RNN 7 epochs 7 days Perplexity: 1.11
  • 32. Results: Tokenization 13 Vocab size: 2189 Train: 25M samples (1.7Gb) Validation: 2M samples (140Mb) Plain text source code Lexing Instructions Vocab size: 7 Train: 25M samples (1.7Gb) Validation: 2M samples (140Mb) i m p o r t ❘ a n d r o i d . g r a p h i c s . B i t m a p ; i m p o r t ❘ c o m . f a c e b o o k . c o m m o n . r e f e r e n c e s . R e s o u r c e p u b l i c ❘ c l a s s ❘ S i m p l e B i t m a p R e l e a s e r ❘ i m p l e m e n t s ❘ R e p r i v a t e ❘ s t a t i c ❘ S i m p l e B i t m a p R e l e a s e r ❘ s I n s t a n c e ; p u b l i c ❘ s t a t i c ❘ S i m p l e B i t m a p R e l e a s e r ❘ g e t I n s t a n c e i f ❘ ( s I n s t a n c e ❘ = = ❘ n u l l ) ❘ { s I n s t a n c e ❘ = ❘ n e w ❘ S i m p l e B i t m a p R e l e a s e r ( ) ; } 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 1 ❘ 0 0 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 ❘ 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 1 0 1 ❘ 1 0 0 0 0 0 0 0 0 1 ❘ 0 1 ❘ 0 0 0 1 1 ❘ 1 0 0 0 0 0 0 0 0 1 ❘ 1 ❘ 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 NMT Bi-RNN 7 epochs 7 days Perplexity: 1.11 What is perplexity? Lower Perplexity is better Meaning of perplexity value depends on target vocab size In the context of NMT: Perplexity describes how "confused" a probability model is on a given test data set. A perfect model has perplexity 1.
  • 33. Results: Tokenization 13 Vocab size: 2185 Train: 25M samples (1.7Gb) Validation: 2M samples (140Mb) Plain text source code Lexing Instructions Vocab size: 3 Train: 25M samples (1.7Gb) Validation: 2M samples (140Mb) i m p o r t ❘ a n d r o i d . g r a p h i c s . B i t m a p ; i m p o r t ❘ c o m . f a c e b o o k . c o m m o n . r e f e r e n c e s . R e s o u r c e p u b l i c ❘ c l a s s ❘ S i m p l e B i t m a p R e l e a s e r ❘ i m p l e m e n t s ❘ R e p r i v a t e ❘ s t a t i c ❘ S i m p l e B i t m a p R e l e a s e r ❘ s I n s t a n c e ; p u b l i c ❘ s t a t i c ❘ S i m p l e B i t m a p R e l e a s e r ❘ g e t I n s t a n c e i f ❘ ( s I n s t a n c e ❘ = = ❘ n u l l ) ❘ { s I n s t a n c e ❘ = ❘ n e w ❘ S i m p l e B i t m a p R e l e a s e r ( ) ; } 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 1 ❘ 0 0 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 ❘ 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 1 0 1 ❘ 1 0 0 0 0 0 0 0 0 1 ❘ 0 1 ❘ 0 0 0 1 1 ❘ 1 0 0 0 0 0 0 0 0 1 ❘ 1 ❘ 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 NMT Bi-RNN 7 epochs 7 days Perplexity: 1.11
  • 34. Results: Tokenization 13 Vocab size: 2185 Train: 25M samples (1.7Gb) Validation: 2M samples (140Mb) Plain text source code Lexing Instructions Vocab size: 3 Train: 25M samples (1.7Gb) Validation: 2M samples (140Mb) i m p o r t ❘ a n d r o i d . g r a p h i c s . B i t m a p ; i m p o r t ❘ c o m . f a c e b o o k . c o m m o n . r e f e r e n c e s . R e s o u r c e p u b l i c ❘ c l a s s ❘ S i m p l e B i t m a p R e l e a s e r ❘ i m p l e m e n t s ❘ R e p r i v a t e ❘ s t a t i c ❘ S i m p l e B i t m a p R e l e a s e r ❘ s I n s t a n c e ; p u b l i c ❘ s t a t i c ❘ S i m p l e B i t m a p R e l e a s e r ❘ g e t I n s t a n c e i f ❘ ( s I n s t a n c e ❘ = = ❘ n u l l ) ❘ { s I n s t a n c e ❘ = ❘ n e w ❘ S i m p l e B i t m a p R e l e a s e r ( ) ; } 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 1 ❘ 0 0 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 ❘ 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 1 0 1 ❘ 1 0 0 0 0 0 0 0 0 1 ❘ 0 1 ❘ 0 0 0 1 1 ❘ 1 0 0 0 0 0 0 0 0 1 ❘ 1 ❘ 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 NMT Bi-RNN 7 epochs 7 days Perplexity: 1.11 @Test(expected = NullPointerException.class) 10001100000001 1 000000000000000000000000011 Failed translation example:
  • 35. Results: Token Annotation 14 Vocab size: 50000 Train: 25M samples (904Mb) Validation: 2M samples (76M) Tokens Type/Depth Annotations Vocab size: 87|4459 Train: 25M samples (3Gb) Validation: 2M samples (251Mb) import android . graphics . Bitmap ; import com . facebook . common . references . ResourceReleaser ; public class SimpleBitmapReleaser implements ResourceReleaser < Bitmap > { private static SimpleBitmapReleaser sInstance ; public static SimpleBitmapReleaser getInstance ( ) { if ( sInstance == null ) { sInstance = new SimpleBitmapReleaser ( ) ; } ImportDeclaration│2 QualifiedName│3 QualifiedName│3 QualifiedName│3 QualifiedName│3 Qualifi ImportDeclaration│2 QualifiedName│3 QualifiedName│3 QualifiedName│3 QualifiedName│3 Qualifi ClassOrInterfaceModifier│3 ClassDeclaration│3 ClassDeclaration│3 ClassDeclaration│3 ClassOr ClassOrInterfaceModifier│7 ClassOrInterfaceModifier│7 ClassOrInterfaceType│9 VariableDeclar ClassOrInterfaceModifier│7 ClassOrInterfaceModifier│7 ClassOrInterfaceType│9 MethodDeclarat IfStatement│12 ParExpression│13 Primary│16 Expression│14 Literal│17 ParExpression│13 Block│14 2
  • 36. Results: Token Annotation 14 Vocab size: 50000 Train: 25M samples (904Mb) Validation: 2M samples (76M) Tokens Type/Depth Annotations Vocab size: 87|4459 Train: 25M samples (3Gb) Validation: 2M samples (251Mb) import android . graphics . Bitmap ; import com . facebook . common . references . ResourceReleaser ; public class SimpleBitmapReleaser implements ResourceReleaser < Bitmap > { private static SimpleBitmapReleaser sInstance ; public static SimpleBitmapReleaser getInstance ( ) { if ( sInstance == null ) { sInstance = new SimpleBitmapReleaser ( ) ; } ImportDeclaration│2 QualifiedName│3 QualifiedName│3 QualifiedName│3 QualifiedName│3 Qualifi ImportDeclaration│2 QualifiedName│3 QualifiedName│3 QualifiedName│3 QualifiedName│3 Qualifi ClassOrInterfaceModifier│3 ClassDeclaration│3 ClassDeclaration│3 ClassDeclaration│3 ClassOr ClassOrInterfaceModifier│7 ClassOrInterfaceModifier│7 ClassOrInterfaceType│9 VariableDeclar ClassOrInterfaceModifier│7 ClassOrInterfaceModifier│7 ClassOrInterfaceType│9 MethodDeclarat IfStatement│12 ParExpression│13 Primary│16 Expression│14 Literal│17 ParExpression│13 Block│14 NMT RNN 11 epochs 2 days Perplexity: 1.28 2
  • 37. Results: Token Annotation 14 Vocab size: 50000 Train: 25M samples (904Mb) Validation: 2M samples (76M) Tokens Type/Depth Annotations Vocab size: 87|4459 Train: 25M samples (3Gb) Validation: 2M samples (251Mb) import android . graphics . Bitmap ; import com . facebook . common . references . ResourceReleaser ; public class SimpleBitmapReleaser implements ResourceReleaser < Bitmap > { private static SimpleBitmapReleaser sInstance ; public static SimpleBitmapReleaser getInstance ( ) { if ( sInstance == null ) { sInstance = new SimpleBitmapReleaser ( ) ; } ImportDeclaration│2 QualifiedName│3 QualifiedName│3 QualifiedName│3 QualifiedName│3 Qualifi ImportDeclaration│2 QualifiedName│3 QualifiedName│3 QualifiedName│3 QualifiedName│3 Qualifi ClassOrInterfaceModifier│3 ClassDeclaration│3 ClassDeclaration│3 ClassDeclaration│3 ClassOr ClassOrInterfaceModifier│7 ClassOrInterfaceModifier│7 ClassOrInterfaceType│9 VariableDeclar ClassOrInterfaceModifier│7 ClassOrInterfaceModifier│7 ClassOrInterfaceType│9 MethodDeclarat IfStatement│12 ParExpression│13 Primary│16 Expression│14 Literal│17 ParExpression│13 Block│14 NMT RNN 11 epochs 2 days Perplexity: 1.28 2
  • 38. Results: Token Annotation 14 Vocab size: 50000 Train: 25M samples (904Mb) Validation: 2M samples (76M) Tokens Type/Depth Annotations Vocab size: 87|4459 Train: 25M samples (3Gb) Validation: 2M samples (251Mb) import android . graphics . Bitmap ; import com . facebook . common . references . ResourceReleaser ; public class SimpleBitmapReleaser implements ResourceReleaser < Bitmap > { private static SimpleBitmapReleaser sInstance ; public static SimpleBitmapReleaser getInstance ( ) { if ( sInstance == null ) { sInstance = new SimpleBitmapReleaser ( ) ; } ImportDeclaration│2 QualifiedName│3 QualifiedName│3 QualifiedName│3 QualifiedName│3 Qualifi ImportDeclaration│2 QualifiedName│3 QualifiedName│3 QualifiedName│3 QualifiedName│3 Qualifi ClassOrInterfaceModifier│3 ClassDeclaration│3 ClassDeclaration│3 ClassDeclaration│3 ClassOr ClassOrInterfaceModifier│7 ClassOrInterfaceModifier│7 ClassOrInterfaceType│9 VariableDeclar ClassOrInterfaceModifier│7 ClassOrInterfaceModifier│7 ClassOrInterfaceType│9 MethodDeclarat IfStatement│12 ParExpression│13 Primary│16 Expression│14 Literal│17 ParExpression│13 Block│14 NMT RNN 11 epochs 2 days Perplexity: 1.28 2 A successful example: List<Throwable> errors = TestHelper.trackPluginErrors(); 000110000000011 000001 1 0000000001100000000000000001111 [ClassOrInterfaceType|14] [TypeArguments|15] [ClassOrInterfaceType|18] [TypeArguments|15] [VariableDeclaratorId|15] [VariableDeclarator|14] [Primary|19] [Expression|17] [Expression|17] [Expression|16] [Expression|16] [LocalVariableDeclarationStatement|11]
  • 39. Take-home messages: • NN can learn to "read" code (tokens / syntactic elements) → What else could we teach? Type resolution? Calls & attribute access? Inheritance? → Could we follow the "human path" of learning to program to teach an AI? 16
  • 40. Take-home messages: • NN can learn to "read" code (tokens / syntactic elements) → What else could we teach? Type resolution? Calls & attribute access? Inheritance? → Could we follow the "human path" of learning to program to teach an AI? • "If only we had good data" → Bug reports, commit messages etc. are still unstructured. This needs to change if we want to leverage deep learning in SE and PC. 16
  • 41. Carol V. Alexandru, Sebastiano Panichella, Harald C. Gall {alexandru,panichella,gall}@ifi.uzh.ch 23. May 2017 t.uzh.ch/Hb t.uzh.ch/Hc Data creation tool: Paper: