The Application of Grammar Inference to Software Language Engineering
1. SLATE 2015, Madrid, Spain June 18-19, 2015
1/67
The Application of Grammar
Inference to Software
Language Engineering
Marjan Mernik
University of Maribor, Slovenia
2. SLATE 2015, Madrid, Spain June 18-19, 2015
2/67
Outline of the Presentation
• What is a Grammar Inference (GI)?
• A short introduction to Software Language
Engineering (SLE)
• Some applications of GI to SLE and SE (CFG
inference, Metamodel inference, Graph
grammar inference, semantic inference)
• Inside GI
• Conclusion
3. SLATE 2015, Madrid, Spain June 18-19, 2015
3/67
What is a Grammar Inference (GI)?
• Grammatical inference is a process of
learning the grammar from positive (and
negative) language samples (sentences).
• Grammatical inference attracts researchers
from different fields such as pattern
recognition, computational linguistic,
natural language acquisition, software
engineering, ...
4. SLATE 2015, Madrid, Spain June 18-19, 2015
4/67
What is a Grammar Inference (GI)?
• Given a sentence ps and a grammar G we can
tell whether ps belongs to L(G) (ps ∈ L(G)).
Such sentence is called positive sample.
• A set of positive samples is denoted with S+. In
similar manner we can defined set of negative
samples S-. Those samples do not belong to
L(G) and can no be derived from starting
symbol S.
5. SLATE 2015, Madrid, Spain June 18-19, 2015
5/67
What is a Grammar Inference (GI)?
• Given a set S+ and S-, which might be also
empty, the task of grammar inference is to
find at least one grammar G such that
S+⊆L(G) and S-⊆L (G).
• A set of positive samples S+ of a L(G) is
structurally complete if each grammar
production is used in the generation of at
least one sentence in S+.
6. SLATE 2015, Madrid, Spain June 18-19, 2015
6/67
What is a Grammar Inference (GI)?
print 5
print a where a=10
print b+1 where b=1
print a+b+2 where a=1, b=2
What
computer
language
she used?
Try out our newly
developed grammar
inference algorithm!
What is a
grammar of
this
language?
7. SLATE 2015, Madrid, Spain June 18-19, 2015
7/67
A short introduction to SLE
• Software Language Engineering (SLE) is a
young engineering discipline with the aim of
establishing a systematic and rigorous
approach to the development, use, and
maintenance of computer languages, which
comprises specification, modeling and
programming languages.
• SLE is equally focused on General-Purpose
Langauges (GPLs) and Domain-Specific
Languages (DSLs)
8. SLATE 2015, Madrid, Spain June 18-19, 2015
8/67
A short introduction to SLE
Technicalliterature
Existing
implementations
Customsurveys
Expertadvice
Currentandfuture
requirements
Terminology
Concepts
Commonalities
Variations
Syntax
Semantics
DSLDSL
Possible
existing
implementations
Requirements
DSL
DSL development phases
9. SLATE 2015, Madrid, Spain June 18-19, 2015
9/67
Some applications of GI to SLE
• A. Stevenson, J. R. Cordy. A Survey of grammatical inference
in software engineering. Science of Computer Programming
96 (2014) 444-459.
– Inference of GPLs
– Inference of DSLs (grammar and metamodel
based)
– Inference of Visual Languages (based on graph
grammars)
– Inference of execution traces
10. SLATE 2015, Madrid, Spain June 18-19, 2015
10/67
Some applications of GI to SLE
• Inference of GPLs
– Problem is still too complex to infer a GPL with several
hundreds of production rules.
– Several successful attempts have been developed to infer a
dialect GPL grammar G’ from existing GPL grammar G.
M. Di Penta, P. Lombardi, K. Taneja, L. Troiano. Search-
Based Inference of Dialect Grammars. Soft Computing 12
(2008) 51-66.
A. Dubej, P. Jalote, S. Aggarval. Learning context free
grammar rules from a set of programs. IET Software 2
(2008) 223-240.
11. SLATE 2015, Madrid, Spain June 18-19, 2015
11/67
Some applications of GI to SLE
• Inference of GPLs
– Restrictive assumption have been declared in these works:
1. G and G’ share the same set of non-terminals
2. T ⊆ T’
3. (T’-T) consists only of keywords and operators
4. Production rules can be only added and not removed
5. Missing rules always begin with a keyword
12. SLATE 2015, Madrid, Spain June 18-19, 2015
12/67
Some applications of GI to SLE
• HRNČIČ, Dejan, MERNIK, Marjan, BRYANT, Barrett Richard.
Embedding DSLS into GPLS: A Grammatical Inference
Approach. Information Technology and Control , 2011, vol.
40, no. 4, pp. 307-315.
• Our approach can be used also for syntax
extensions and for DSL embedding
– To embed domain-specific language (e.g, SQL)
into another programming language (GPL or DSL)
13. SLATE 2015, Madrid, Spain June 18-19, 2015
13/67
Some applications of GI to SLE
• Initial grammar (ANSI C):
1. translation unit ::= external decl
2. translation unit ::= translation unit external decl
3. external decl ::= function denition
4. external decl ::= decl
6. function denition ::= declarator decl list compound stat
9. decl ::= decl specs init declarator list ;
10. decl ::= decl specs ;
11. decl list ::= decl
12. decl list ::= decl list decl
15. decl specs ::= type spec decl specs
27. type spec ::= int | long | ...
45. init declarator list ::= init declarator
46. init declarator list ::= init declarator list , init declarator
47. init declarator ::= declarator
64. enumerator ::= id
65. enumerator ::= id = const exp
67. declarator ::= direct declarator
68. direct declarator ::= id
69. direct declarator ::= ( declarator )
70. direct declarator ::= direct declarator [ const exp ]
71. direct declarator ::= direct declarator [ ]
72. direct declarator ::= direct declarator ( param type list )
73. direct declarator ::= direct declarator ( id list )
74. direct declarator ::= direct declarator ( )
88. id list ::= id
89. id list ::= id list , id
90. initializer ::= assignment exp
91. initializer ::= initializer list
93. initializer list ::= initializer
94. initializer list ::= initializer list , initializer
110. stat ::= labeled stat | exp stat | compound stat | selection stat
114. stat ::= iteration stat | jump stat
116. labeled stat ::= id : stat
117. labeled stat ::= case const exp : stat
118. labeled stat ::= default : stat
119. exp stat ::= exp ;
120. exp stat ::= ;
121. compound stat ::= decl list stat list
125. stat list ::= stat
126. stat list ::= stat list stat
127. selection stat ::= if ( exp ) stat
129. selection stat ::= switch ( exp ) stat
130. iteration stat ::= while ( exp ) stat
131. iteration stat ::= do stat while ( exp ) ;
132. iteration stat ::= for ( exp ; exp ; exp ) stat
140. jump stat ::= goto id ; | continue ; | break ; | return exp ;
145. exp ::= assignment exp
146. exp ::= exp , assignment exp
147. assignment exp ::= conditional exp
148. assignment exp ::= conditional exp assignment operator
assignment exp
205. const ::= int const | char const | oat const
14. SLATE 2015, Madrid, Spain June 18-19, 2015
14/67
Some applications of GI to SLE
• Initial grammar (ANSI C):
int main() {
char str[][];
int i;
printf("Students:");
for(i = 0; i < str.length; i++) {
printf(str[i]);
}
return 0;
}
int main() {
char str[][] = { SELECT Name FROM
Students };
int i;
printf("Students:");
for(i = 0; i < str.length; i++) {
printf(str[i]);
}
return 0;
}
true positive sample false negative samples:
int main() {
char str[][] = { SELECT Name, Surname
FROM Students, Professors };
int i;
printf("Students and Professors:");
for(i = 0; i < str.length; i++) {
printf(str[i]);
}
return 0;
}
15. SLATE 2015, Madrid, Spain June 18-19, 2015
15/67
Some applications of GI to SLE
• Inferred Grammar:
1. translation unit ::= external decl
2. translation unit ::= translation unit external decl
3. external decl ::= function denition
4. external decl ::= decl
6. function denition ::= declarator decl list compound stat
9. decl ::= decl specs init declarator list ;
10. decl ::= decl specs ;
11. decl list ::= decl
12. decl list ::= decl list decl
15. decl specs ::= type spec decl specs
27. type spec ::= int | long | ...
45. init declarator list ::= init declarator
46. init declarator list ::= init declarator list , init declarator
47. init declarator ::= declarator
64. enumerator ::= id
65. enumerator ::= id = const exp
67. declarator ::= direct declarator NT1
68. direct declarator ::= id
69. direct declarator ::= ( declarator )
70. direct declarator ::= direct declarator [ const exp ]
71. direct declarator ::= direct declarator [ ]
72. direct declarator ::= direct declarator ( param type list )
73. direct declarator ::= direct declarator ( id list )
74. direct declarator ::= direct declarator ( )
88. id list ::= id
89. id list ::= id list , id
90. initializer ::= assignment exp
91. initializer ::= initializer list
93. initializer list ::= initializer
94. initializer list ::= initializer list , initializer
110. stat ::= labeled stat | exp stat | compound stat | selection stat
114. stat ::= iteration stat | jump stat
116. labeled stat ::= id : stat
117. labeled stat ::= case const exp : stat
118. labeled stat ::= default : stat
119. exp stat ::= exp ;
120. exp stat ::= ;
121. compound stat ::= decl list stat list
125. stat list ::= stat
126. stat list ::= stat list stat
127. selection stat ::= if ( exp ) stat
129. selection stat ::= switch ( exp ) stat
130. iteration stat ::= while ( exp ) stat
131. iteration stat ::= do stat while ( exp ) ;
132. iteration stat ::= for ( exp ; exp ; exp ) stat
140. jump stat ::= goto id ; | continue ; | break ; | return exp ;
145. exp ::= assignment exp
146. exp ::= exp , assignment exp
147. assignment exp ::= conditional exp
148. assignment exp ::= conditional exp assignment operator
assignment exp
205. const ::= int const | char const | oat const
208. NT1 ::= = SELECT id NT2 FROM id NT2 | ϵ
210. NT2 ::= , id NT2 | ϵ
16. SLATE 2015, Madrid, Spain June 18-19, 2015
16/67
Some applications of GI to SLE
• Inference of DSLs
– A DSL grammar is usually small enough that the inference
process is feasible.
– Inference of grammar-based DSLs.
– Inference of metamodel-base DSLs.
HRNČIČ, Dejan, MERNIK, Marjan, BRYANT, Barrett Richard,
JAVED, Faizan. A memetic grammar inference algorithm for
language learning. Applied Soft Computing, 2012, vol. 12, iss. 3,
pp. 1006-1020.
HRNČIČ, Dejan, MERNIK, Marjan, BRYANT, Barrett Richard.
Improving grammar inference by a memetic algorithm. IEEE
Transactions on Systems, Man, and Cybernetics - Part C, 2012,
vol. 42, no. 5, pp. 692-703.
17. SLATE 2015, Madrid, Spain June 18-19, 2015
17/67
Some applications of GI to SLE
• 12 input samples of DESK language on which the
algorithm was tested:
1. print a
2. print 3
3. print b + 14
4. print a + b + c
5. print a where b = 14
6. print 10 where d = 15
7. print 9 + b where b = 16
8. print 1 + 2 where id = 1
9. print a where b = 5, c = 4
10. print 21 where a = 6, b = 5
11. print 5 + 6 where a = 3, c = 14
12. print a + b + c where a = 4, b = 3, c = 2
18. SLATE 2015, Madrid, Spain June 18-19, 2015
18/67
Some applications of GI to SLE
Original grammar:
1. DESK ::= print E C
2. E ::= E + F
3. E ::= F
4. F ::= id
5. F ::= num
6. C ::= where Ds
7. C ::= ε
8. Ds ::= D
9. Ds ::= Ds , D
10. D ::= id = num
Inferred grammar:
1: NT1 -> print NT3 NT5
2: NT2 -> + NT3
3: NT2 -> ε
4: NT3 -> num NT2
5: NT3 -> id NT2
6: NT4 -> , id = num NT4
7: NT4 -> ε
8: NT5 -> where id = num NT4
9: NT5 -> ε
19. SLATE 2015, Madrid, Spain June 18-19, 2015
19/67
Some applications of GI to SLE
DSL for hypertree description
20. SLATE 2015, Madrid, Spain June 18-19, 2015
20/67
Some applications of GI to SLE
Inferred grammar for hypertree description DSL
21. SLATE 2015, Madrid, Spain June 18-19, 2015
21/67
Some applications of GI to SLE
• As a model conforms to a metamodel in a
similar manner to how a program conforms to a
grammar, the metamodel inference can be
defined as follows.
• The set of all models that conform to a given
metamodel MM will be called the language of
the metamodel and denoted L(MM). Given a
model instance m and a metamodel MM we can
tell whether m conforms to MM (m ∈ L(MM)).
22. SLATE 2015, Madrid, Spain June 18-19, 2015
22/67
Some applications of GI to SLE
• A set of positive samples is denoted with S+.
Conversely, a negative sample belongs to
L(MM), which denotes a set of all models that
do not conform to metamodel MM. A set of
negative samples is denoted with S-.
• A set of positive samples S+ of a metamodel
MM is structurally complete if each
metamodel element appears in at least one
model in S+.
23. SLATE 2015, Madrid, Spain June 18-19, 2015
23/67
Some applications of GI to SLE
• Given a set of positive samples S+ and set of
negative samples S-, which might be also
empty, the task of metamodel inference is to
find at least one metamodel MM such that
S+⊆L(MM) and S-⊆L(MM).
24. SLATE 2015, Madrid, Spain June 18-19, 2015
24/67
Some applications of GI to SLE
• JAVED, Faizan, MERNIK, Marjan, GRAY, Jeffrey G., BRYANT,
Barrett Richard. MARS: A Metamodel Recovery System Using
Grammar Inference. Information and Software Technology,
2008, vol. 50, iss. 9-10, pp. 948-968.
• Qichao Liu, Jeff Gray, Marjan Mernik, Barrett R. Bryant.
Application of Metamodel Inference with Large-Scale
Metamodels. Int. J. Software and Informatics 6(2): 201-231
(2012)
25. SLATE 2015, Madrid, Spain June 18-19, 2015
25/67
Some applications of GI to SLE
• ESML (Embedded System Modeling Language)
26. SLATE 2015, Madrid, Spain June 18-19, 2015
26/67
Some applications of GI to SLE
• Original ESML metamodel - Configuration viewpoint
27. SLATE 2015, Madrid, Spain June 18-19, 2015
27/67
Some applications of GI to SLE
• Inferred ESML metamodel - Configuration viewpoint
28. SLATE 2015, Madrid, Spain June 18-19, 2015
28/67
Some applications of GI to SLE
• Inference of Visual Languages (based on graph
grammars)
FÜRST, Luka, MERNIK, Marjan, MAHNIČ, Viljan. Graph grammar
induction as a parser-controlled heuristic search process.
AGTIVE’12, pp. 121-136.
FÜRST, Luka, MERNIK, Marjan, MAHNIČ, Viljan. Converting
metamodels to graph grammars: doing without advanced graph
grammar features. Software and System Modeling (SoSym),
2013, Article in Press.
29. SLATE 2015, Madrid, Spain June 18-19, 2015
29/67
Graph grammar inference
Positive and negative samples for hydrocarbons
with single and double bonds
33. SLATE 2015, Madrid, Spain June 18-19, 2015
33/67
Some applications of GI to SLE
• Inference of execution traces
– GI is well suited to analyses involving program
execution because the events in execution traces
are temporally ordered in the same way that
tokens are spatially ordered.
– GI can be used in the dynamic analysis of
software.
N. Walkinshaw, K. Bogdanov, M. Holcombe, S. Salahuddin.
Improving dynamic software analysis by applying grammar
inference principles. J. Softw. Maint. Evol. 20 (2008) 269-
290.
34. SLATE 2015, Madrid, Spain June 18-19, 2015
34/67
Some applications of GI to SLE
• Javier Canovas, Jordi Cabot, Jesus Lopez-Fernandez, Jesus
Sanchez, Cuadrado, Esther Guerra, Juan De Lara. Engaging
End-Users in the Collaborative Development of Domain-
Specific Modelling Languages, CDVE 2013: 101-110.
35. SLATE 2015, Madrid, Spain June 18-19, 2015
35/67
Some applications of GI to SLE
• Javier Canovas, Jordi Cabot, Jesus Lopez-Fernandez, Jesus
Sanchez, Cuadrado, Esther Guerra, Juan De Lara. Engaging
End-Users in the Collaborative Development of Domain-
Specific Modelling Languages. CDVE 2013: 101-110.
36. SLATE 2015, Madrid, Spain June 18-19, 2015
36/67
Some applications of GI to SLE
• Model evolution using metamodel inference
37. SLATE 2015, Madrid, Spain June 18-19, 2015
37/67
Some applications of GI to SE
• Applications to Software Engineering (SE)
• Grammar-based systems
• Marjan Mernik, Matej Crepinsek, Tomaz Kosar,
Damijan Rebernak, Viljem Zumer. Grammar-Based
Systems: Definition and Examples. Informatica
28(3): 245-255 (2004)
38. SLATE 2015, Madrid, Spain June 18-19, 2015
38/67
Some applications of GI to SE
• A grammar-based system is any system that uses a
grammar and/or sentences produced by this
grammar to solve various problems outside the
domain of programming language definition and its
implementation. The vital component of such a
system is well structured and expressed with a
grammar or with sentences produced by this
grammar in an explicit or implicit manners.
39. SLATE 2015, Madrid, Spain June 18-19, 2015
39/67
Some applications of GI to SE
• Applications to Software Engineering (SE)
• Wil van der Aalst. Process Mining: Making Sense of Processes
Hidden in Big Event Data. Invited talk at FedCSIS 2013
• Sentence = Trace in event log
• Grammar = Process model
40. SLATE 2015, Madrid, Spain June 18-19, 2015
40/67
Inside GI
• Several theoretical learning models.
– Identification in the limit (allows the inference
algorithm to converge on the target grammar
given a sufficiently large quantity of sentences).
– Teacher and queries (introduces a teacher who
knows the target language and can answer
particular types of queries from the learner)
– Probably Approximately Correct (PAC) learning
(doesn’t guarantee exact identification).
41. SLATE 2015, Madrid, Spain June 18-19, 2015
41/67
Inside GI
• Gold Theorem (1967) - it is impossible to
identify any of the four classes of languages
in the Chomsky hierarchy in the limit using
only positive samples.
• Using both negative and positive samples,
the Chomsky hierarchy languages can be
identified in the limit.
42. SLATE 2015, Madrid, Spain June 18-19, 2015
42/67
Inside GI
• Intuitively, Gold's theorem can be explained
by recognizing the fact that the final
generalization of positive samples would be
an automation that accept all strings.
• Singular use of positive samples results in an
uncertainty as to when the generalization
steps should be stopped. This implies the
need for some restrictions or background
knowledge on the generalization process.
43. SLATE 2015, Madrid, Spain June 18-19, 2015
43/67
Inside GI
• A lot of research has been done on
extraction of context-free grammars, but
the problem is still not solved sufficiently
mainly due to immense search space.
49. SLATE 2015, Madrid, Spain June 18-19, 2015
49/67
Inside GI
• Memetic algorithms are evolutionary
algorithms with local search operator
– use of evolutionary concepts (population,
evolutionary operators)
– improves the search for solutions with local
search.
50. SLATE 2015, Madrid, Spain June 18-19, 2015
50/67
Inside GI
example n
example 1
regular
definitions
...
initiali-
zation
local
search
mutation
generali-
zation
selection
evolutionary
cycle
found
grammars
parse positive
examples
MAGIc
evaluate
(LISA
parser)
- simple
- Sequitur
diff
• Memetic Algorithm for Grammatical Inference
51. SLATE 2015, Madrid, Spain June 18-19, 2015
51/67
Inside GI
• Sequitur: http://sequitur.info/
• abcabdabcabd
0 → 1 1
1 → 2 c 2 d
2 → a b
• p i w i=n, i=n // print id where id=n, id=n
0 → p 1 w 2, 2
1 → i
2 → 1 = n
52. SLATE 2015, Madrid, Spain June 18-19, 2015
52/67
Inside GI
print a where c=2
print 5+b where b = 10print id where id=num
print num+id where id=num
53. SLATE 2015, Madrid, Spain June 18-19, 2015
53/67
Inside GI
Apply diff command!
1a2,3
> num
> +
print id where id=num
print num+id where id=num
What is the difference
among two samples?
print id where id=num
print num+id where id=num
But where to change the
grammar?
54. SLATE 2015, Madrid, Spain June 18-19, 2015
54/67
Inside GI
Start with the grammar
that parses first sample:
print a where c=2
N1 ::= print N2 where id = num
N2 ::= id
Use information from
LR(1) parsing on 2nd
sample.
Configurations returned from the
LR(1) parser:
Nx → α1 • α2
Ny → β •
Nz → • γ
56. SLATE 2015, Madrid, Spain June 18-19, 2015
56/67
Inside GI
• Nx → α1 • α2
– if
Nx ::= α1 N1 α2
N1 ::= ai+1 ... am
N1 ::= ε
– if
Nx ::= α1 N1
N1 ::= α2
N1 ::= ai+1 ... am
– if
change in this configuration can’t be made
)FIRST(αs 21k ∈+
FOLLOW(Nx)s)FIRST(αs 1k21k ∈∧∉ ++
FOLLOW(Nx)s)FIRST(αs 1k21k ∉∧∉ ++
57. SLATE 2015, Madrid, Spain June 18-19, 2015
57/67
Inside GI
print a where c=2
print 5+b where b = 10
N1 → print • N2 where id = num
N1 ::= print N2 where id = num
N2 ::= id
N1 ::= print N3 N2 where id = num
N2 ::= id
N3 ::= num +
N3 ::= ε
58. SLATE 2015, Madrid, Spain June 18-19, 2015
58/67
Inside GI
Production: Nx ::= α1 Ny α2
Option
Nx ::= α1 Nz α2
Nz ::= Ny
Nz ::= ε
But, how mutation is
done?
59. SLATE 2015, Madrid, Spain June 18-19, 2015
59/67
Inside GI
Nx ::= α Ny Nx ::= Ny Ny
Ny ::= α Ny ::= α
Ny ::= β Ny ::= βWhat about
generalization step?
Nx ::= Ny Ny Nx ::= Ny
Ny ::= α Ny ::= α Ny
Ny ::= β Ny ::= β Ny
Ny ::= ε
60. SLATE 2015, Madrid, Spain June 18-19, 2015
60/67
Semantic inference
a) Grammar G:
E ::= T EE
EE ::= + T EE
EE ::= ε
T ::= int
b) Attributes and their types:
int synthesized vals
int inherited inVal
c) Set of terminals T = {‘nonterminal.attribute’,
‘lexValue’}
d) Set of functions F = {sum}
Set of positive programs with
associated meanings:
(5, 5)
(2+5, 7)
(10+5+8, 23)
61. SLATE 2015, Madrid, Spain June 18-19, 2015
61/67
Semantic inference
// infered attribute grammars
E ::= T EE
{ E[0].vals = EE[0].vals;
EE[0].inVal = T[0].vals;}
EE ::= + T EE
{ EE[0].vals = EE[1].vals;
EE[1].inVal = sum(T[0].vals, EE[0].inVal);}
EE ::= ε
{EE[0].vals = EE[0].inVal;}
T ::= int
{T[0].vals = Integer.parseInt(#int.value());}
63. SLATE 2015, Madrid, Spain June 18-19, 2015
63/67
Semantic inference
Actually, the prototypical implementation found two additional
attribute grammars, which differ in the semantics for the 2nd
production EE ::= + T EE, but both are valid. The following two
semantic rules for the 2nd production were found:
a) EE ::= + T EE
{ EE[0].vals = sum(EE[1].vals, T[0].vals);
EE[1].inVal = EE[0].inVal;}
b) EE ::= + T EE
{ EE[0].vals = sum(EE[1].vals, EE[0].inVal);
EE[1].inVal = T[0].vals;}
66. SLATE 2015, Madrid, Spain June 18-19, 2015
66/67
Conclusion
Hope that I convinced you
that grammatical inference is
interesting and useful.
Yes, I will used in my
current project on
business process
mining.
67. SLATE 2015, Madrid, Spain June 18-19, 2015
67/67
Conclusion
Collaborators:
M. Črepinšek, D. Hrnčič1,
B. Bryant2, A. Sprague2, J. Gray2, J. Faizan2, Q. Liu2
L. Fürst3, V. Mahnič3
1University of Maribor, Slovenia
2The University of Alabama at Birmingham, USA
3University of Ljubljana, Slovenia
Sent comments/questions to:
marjan.mernik@um.si