Compilers allow to analyze and
manipulate textual data. There is numerous usages.
Hoa\Compiler
offers to manipulate several compilers based on
needs.
Table of contents
Introduction
What
is conceived well is expressed clearly, and the right words come
easily.
A language is a way to express or to
formulate a solution to a
problem. And there exists a lot of problems. We read and
write in several languages daily, and some of these languages are
understood by machines. This operation is
possible thanks to compilers.
Formal
languages is a theory that includes the automatic
analysis of those languages through tools like
automata or grammars. It is necessary to
have a detailed explanation to understand well all these concepts. However, we
are going to popularise as much as needed to allow a correct usage of
Hoa\Compiler
.
Language and grammar
A language is a set of words. Each word
is a sequence of symbols belonging to an
alphabet. A symbol represents the smallest lexical
unit of a language, it is atomic and we will call it a
lexeme (also often mentionned as a token). These sequences of
lexemes representing words are built with rules. From a word
and a root rule, we will try to derive its sub-rules. If a
derivation exists, then the word is considered as valid, else
it is considered as invalid. We also speak about words
matching. For example, if we consider the following
rules:
exp ::= exp + exp
| number
number ::= digit number
| digit
digit ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
The word that we are going to match is 7 + 35
. The root rule
is exp
. If we derive (from the left to the right and
from the top to the bottom), we can have
exp + exp
or number
(the
disjunction, i.e. the “or”, is represented by the
“|
” symbol):
exp + exp | number
→ exp + exp
→ ( exp + exp | number ) + exp
→ number + exp
→ ( digit number | digit ) + exp
We continue to derive until we eliminate every rules in
order to have only lexemes:
…
→ ( digit number | digit ) + exp
→ digit + exp
→ ( 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 ) + exp
→ 7 + exp
→ 7 + ( exp + exp | number )
→ 7 + number
→ 7 + ( digit number | digit )
→ 7 + digit number
→ 7 + ( 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 ) number
→ 7 + 3 number
→ 7 + 3 ( digit number | digit )
→ 7 + 3 digit
→ 7 + 3 ( 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 )
→ 7 + 3 5
A derivation exists to match the word 7 + 35
, thus it is a
valid word for those rules.
A set of rules is called a grammar. And so, a grammar
represents a language!
However, these are different kinds of grammars. In 1956, the
Chomsky
hierarchy has been formulated, classifying grammars in four
levels:
- unrestricted grammars, matching langages known as
Turing languages, rules have no restriction,
- context-sensitive grammars, matching contextual
languages,
- context-free grammars, matching algebraic languages,
based on stacked automata,
- regular grammars, matching regular languages.
Each level includes the next level. Hoa\Compiler
only treats
languages defined by grammars of level 3 and 4. To give a quick idea, regular
grammars can be connected to
regular
expressions (like PCRE), well-known by
developers. But regular grammars are not able to match a pair of
symbols (like parentheses, braces or quotes), while algebraic
languages allow that (thanks to the concept of stack of lexemes).
Matching words
In general, the compilation process begins with two
analyzes: lexical and
syntactic. A lexical analysis splits a word
in a sequence of lexemes. This sequence will thereafter be
used by the syntactic analyzer in order to verify that the word
belongs to the language.
According to the grammar, the matching will not be done in the same way,
but the principle stays the same: using lexemes one after the other in the
sequence and verify that they allow going forward in the
derivation of grammar's rules.
Syntactic analyzes are also classified into categories: LL, LR, LALR etc.
Hoa\Compiler
only offers LL syntactic analyzers, stands for
Left-to-right Leftmost derivation, i.e. from the topest rule to the deepest,
and rules are derived from the left to the right. Here, again, there is
sub-categories, whose two supported by Hoa\Compiler
: LL(1) and
LL(*). Globally, we speak about LL(k) syntactic analyzer: if a lexeme
does not allow to derive a rule as expected, then the analyzer is allowed to
go back to k step backward; we also speak about
backtrack. Put another way, rules can be ambiguous: each time
we derive a rule of the grammar, we have several possible choices and the
analyzer can be wrong, this is why it needs to backtrack sometimes. The
k variable defines the ambiguity level. If a grammar
can be analyzed by a LL(1) syntactic analyzer, it is known as
unambiguous: each time a lexeme is used to derive our rules,
there is only one choice. And if we have a LL(*) syntactic analyzer, it means
that the k variable is undefined. The following
example illustrates an unambiguous grammar:
rule ::= a b c | d e f
And this example illustrates an ambiguous grammar:
rule1 ::= a rule2
rule2 ::= b rule3 | b rule4
rule3 ::= c d
rule4 ::= e f
Let's see what happens when trying to find a derivation for the word
abef
from the root rule rule1
:
rule1
→ a rule2 a bef ✔
→ a (b rule3 | b rule4) a bef
→ a b rule3 ab ef ✔
→ a b c d abe f ✖
← a b rule3 ab ef ✖
← a (b rule3 | b rule4) a bef
→ a b rule4 ab ef ✔
→ a b e f abef ✔
The rule2
rule is ambiguous, which can lead to a bad
derivation and consequently a backtrack.
LL(k)
compiler-compiler
Writing compilers is a laborious task. It is not always
difficult but most of the time, it is repetitive and it takes time. This is
why there exists compilers of compilers, or in other words,
compilers generators. Most of the time, these compiler-compilers use an
intermediary language to write a grammar. The
Hoa\Compiler
library provides the
Hoa\Compiler\Llk\Llk
class that allows the writing of
compiler-compilers through a dedicated language.
PP language
The PP language, standing for PHP Parser, allows to express
algebraic grammars. It is written in files having the
.pp
extension (please, see the
hoa://Library/Compiler/.Mime
file).
A grammar is composed of lexemes and
rules. The declaration of a lexeme has the following syntax:
%token namespace_in:name value ->
namespace_out
, where name
represents the
name of the lexeme, value
represents
its value, expressed with the
PCRE format (take care to not match an empty
value, in which case an exception will be thrown), and
namespace_in
and namespace_out
represent the names of namespaces and are optional (the
default name is default
). For example number
describing a number composed of digits from 0
to
9
:
%token number \d+
Namespaces represent disjointed subsets of lexemes, used
to ease analyzes. A %skip
declaration is similar
to %token
excepts that it represents a lexeme to
skip, it means to not consider. A common example of
%skip
lexemes are spaces:
%skip space \s
To explain rules, we will use the Json.pp
grammar as an
example, which is a softly simplified version of the
JSON language (please, see the
RFC4627). The
complete grammar is located in the
hoa://Library/Json/Grammar.pp
file. Thus:
%skip space \s
// Scalars.
%token true true
%token false false
%token null null
// Strings.
%token quote_ " -> string
%token string:string [^"]+
%token string:_quote " -> default
// Objects.
%token brace_ {
%token _brace }
// Arrays.
%token bracket_ \[
%token _bracket \]
// Rest.
%token colon :
%token comma ,
%token number \d+
value:
<true> | <false> | <null> | string() | object() | array() | number()
string:
::quote_:: <string> ::_quote::
number:
<number>
#object:
::brace_:: pair() ( ::comma:: pair() )* ::_brace::
#pair:
string() ::colon:: value()
#array:
::bracket_:: value() ( ::comma:: value() )* ::_bracket::
Notice that we have two namespaces of lexemes: default
and
string
(it allows to avoid confusion with the
quote_
and string::_quote
lexemes that have the same
representation). We also notice the value
rule, which is a
disjunction of several lexemes and rules. The
constructions of the PP language are the following:
rule()
to call a rule,
<token>
and ::token::
to declare a lexeme (namespaces do not appear here),
|
for a disjunction (a choice),
(…)
for a group,
e?
to say that e
is
optional,
e+
to say that e
can be
present 1 or more times,
e*
to say that e
can be
present 0 or more times,
e{x,y}
to say that
e
can be present between x
and
y
times,
#node
to create a node
node
in the resulting tree,
token[i]
to unify
lexemes between each others.
Few constructions but largely enough.
Finally, the grammar of the PP language is written… with the PP language!
You can find it in the
hoa://Library/Compiler/Llk/Llk.pp
file.
Compilation process
The compilation process used by Hoa\Compiler\Llk\Llk
is
classical: it starts by lexically analyzing the textual data,
the word, i.e. to transform our data in a sequence of lexemes. The declaration
order of the lexemes is primordial because the lexical
analyzer will use them one after the other. Next, the
syntactic analyzer comes into play in order to
match our data.
If the syntactic analysis is successful, we obtain a
trace. This trace can be transformed into an AST, stands for
Abstract Syntax Tree. This tree represents our textual data after the
analysis. One advantage is that it can visited (we will detail this part
later), which allows us to add new constraints which can be
not be expressed in the grammar, such as type verification.
Let's manipulate Hoa\Compiler\Llk\Llk
a little bit. This class
is a helper to easily read a grammar expressed with the PP
format. This class takes only one argument, which is an input stream that
points to the grammar and returns a Hoa\Compiler\Llk\Parser
compiler ready to use. On this compiler, we will call the
Hoa\Compiler\Llk\Parser::parse
to analyze a JSON data. If the
data is correct, we will get an AST, otherwise an exception will be thrown.
Finally, we will use the Hoa\Compiler\Visitor\Dump
visitor to
print our AST:
// 1. Load grammar.
$compiler = Hoa\Compiler\Llk\Llk::load(new Hoa\File\Read('Json.pp'));
// 2. Parse a data.
$ast = $compiler->parse('{"foo": true, "bar": [null, 42]}');
// 3. Dump the AST.
$dump = new Hoa\Compiler\Visitor\Dump();
echo $dump->visit($ast);
/**
* Will output:
* > #object
* > > #pair
* > > > token(string:string, foo)
* > > > token(true, true)
* > > #pair
* > > > token(string:string, bar)
* > > > #array
* > > > > token(null, null)
* > > > > token(number, 42)
*/
When we write and test a grammar, we will repeat these three tasks very
often. This is why the hoa
script provides the
compiler:pp
command. This command proposes to analyze a data
based on a grammar and to apply a visitor if needed on the resulting AST.
Notice that the grammar can be read through a
pipe:
$ echo '[1, [1, [2, 3], 5], 8]' | hoa compiler:pp Json.pp 0 --visitor dump
> #array
> > token(number, 1)
> > #array
> > > token(number, 1)
> > > #array
> > > > token(number, 2)
> > > > token(number, 3)
> > > token(number, 5)
> > token(number, 8)
This is a useful way to quickly test a data against a
grammar. Do not hesitate to read the help of the compiler:pp
command to get more details!
The analyzes start with the root rule but we can use
another rule thanks to the second argument of the
Hoa\Compiler\Llk\Parser::parse
method:
$compiler->parse('{"foo": true, "bar": [null, 42]}', 'object');
To use the root rule, we have to use null
.
And finally, to not generate the AST but only know if a data is valid or
not, we can use the last argument of this method by using
false
:
$valid = $compiler->parse('{"foo": true, "bar": [null, 42]}', null, false);
var_dump($valid);
/**
* Will output:
* bool(true)
*/
Errors
The compilation errors are represented by exceptions, thus:
$ echo '{"foo" true}' | hoa compiler:pp Json.pp 0 --visitor dump
Uncaught exception (Hoa\Compiler\Exception\UnexpectedToken):
Hoa\Compiler\Llk\Parser::parse(): (0) Unexpected token "true" (true) at line 1
and column 8:
{"foo" true}
↑
in hoa://Library/Compiler/Llk/Parser.php at line 1
Several exceptions can be thrown according to the context:
- during the lexical analysis,
Hoa\Compiler\Exception\UnrecognizedToken
when a lexeme is not
recognized, i.e. when the textual data can not be split into a sequence of
lexemes, and Hoa\Compiler\Exception\Lexer
when other more
general errors happen, for example if a lexeme matches an empty value,
- during the syntactic analysis,
Hoa\Compiler\Exception\UnexpectedToken
when a lexeme is not
expected, i.e. it is not able to derive the rules of the grammar.
The parent exception is Hoa\Compiler\Exception\Exception
.
Nodes
The compilation process very often succeeds by the
production of an AST. It is important to control its
form, its size, the data it
holds etc. This is why it is necessary to understand the
#node
notation because it allows to create
nodes in the AST. First, lexemes declared with the
<token>
syntax will appear in the tree, while the
other lexemes, declared with the ::token::
syntax, will
not appear. Indeed, in our last example, the quote_
,
brace_
, colon
, comma
etc. lexemes do
not appear. Next, it is important to notice that the declaration of nodes can
be overwritten inside a same rule. Finally,
a rule name can be preceeded by #
, just like the
#array
rule, that allows to define a default
node, but it can be overwritten. For example, if we replace the
array
rule by:
#array:
::bracket_:: value() ( ::comma:: value() #bigarray )* ::_bracket::
If the array contains only one value, the node will be named
#array
, else it will be named #bigarray
. Let's
see:
$ echo '[42]' | hoa compiler:pp Json.pp 0 --visitor dump
> #array
> > token(number, 42)
$ echo '[4, 2]' | hoa compiler:pp Json.pp 0 --visitor dump
> #bigarray
> > token(number, 4)
> > token(number, 2)
Of course, a node can be created or not based on the
derivation of a rule. The mecanism is normally pretty much
intuitive.
Namespaces
Let's detail a little bit the behavior of the lexical analyzer regarding
the namespaces.
Lexemes are put in namespaces, it means that only the
lexemes of the current namespace are used by the lexical
analyzer. The default namespace is default
. To declare the
namespace of a lexeme, we have to use the :
operator. When a
lexeme is consumed, it is able to change the current
namespace for the rest of the lexical analysis, thanks to the ->
operator. Thus, we will declare five lexemes: foo
and
bar
in the default
namespace, baz
in
ns1
and qux
and foo
in
ns2
. The fact that foo
is declared two times is not
embarrassing because it is put in different namespaces:
%token foo fo+ -> ns1
%token bar ba?r+ -> ns2
%token ns1:baz ba?z+ -> default
%token ns2:qux qux+
%token ns2:foo FOO -> ns1
Writing default:bar
is strictly equivalent to
bar
. The foo
lexeme leads into ns1
,
bar
into ns2
, ns1:baz
into
default
, ns2:qux
stays in ns2
and
ns2:foo
leads into ns1
. Let's observe the sequence
of lexemes produced by the lexical analyzer with the following data
fooooobzzbarrrquxFOObaz
:
$pp = '%token foo fo+ -> ns1' . "\n" .
'%token bar ba?r+ -> ns2' . "\n" .
'%token ns1:baz ba?z+ -> default' . "\n" .
'%token ns2:qux qux+' . "\n" .
'%token ns2:foo FOO -> ns1';
// 1. Parse PP.
Hoa\Compiler\Llk\Llk::parsePP($pp, $tokens, $rawRules);
// 2. Run the lexical analyzer.
$lexer = new Hoa\Compiler\Llk\Lexer();
$sequence = $lexer->lexMe('fooooobzzbarrrquxFOObaz', $tokens);
// 3. Pretty-print the result.
$format = '%' . (strlen((string) count($sequence)) + 1) . 's ' .
'%-13s %-20s %-20s %6s' . "\n";
$header = sprintf($format, '#', 'namespace', 'token name', 'token value', 'offset');
echo $header, str_repeat('-', strlen($header)), "\n";
foreach ($sequence as $i => $token) {
printf(
$format,
$i,
$token['namespace'],
$token['token'],
$token['value'],
$token['offset']
);
}
/**
* Will output:
* # namespace token name token value offset
* ---------------------------------------------------------------------
* 0 default foo fooooo 0
* 1 ns1 baz bzz 6
* 2 default bar barrr 9
* 3 ns2 qux qux 14
* 4 ns2 foo FOO 17
* 5 ns1 baz baz 20
* 6 default EOF EOF 23
*/
We read the lexemes, their namespace and their default value in the table.
The data has been successfully splitted, and we have jumped
from one namespace to the other. If this time we try with the
foqux
data, we will have an error: fo
is matched by
the foo
lexeme in the default
namespace, we then
jump to the ns1
namespace, and here, no lexeme in this namespace
is able to recognize at least q
. An error is thrown.
So far, we have seen how to jump from one namespace to another with the
->
operator. No history about the used
namespaces is stored. However, in some rare cases, it is able for a lexeme to
be accessible through several namespaces and we would like
that a lexeme jumps back to the previous namespace. Put
another way, we would like a history of the used namespaces and being able to
navigate (backward only) in it. This is the role of the
__shift__ * n
keyword, in conjunction with the
->
operator as usual. __shift__
is equivalent to
say: jump back to the previous namespace. __shift__
is equivalent
to __shift__ * 1
, and __shift__ * n
to say:
jump back n
times to the previous namespace.
When the __shift__
keyword appears in the grammar, namespaces
are managed like a stack, hence the shift
vocabulary. We have to take care to shift namespaces regularly in order to
avoid a huge stack.
Let's take the example of the following grammar:
%token foo1 a -> ns1
%token foo2 x -> ns2
%token end e
%token ns1:bar b
%token ns1:baz c -> ns3
%token ns1:tada t -> __shift__
%token ns2:bar y
%token ns2:baz z -> ns3
%token ns2:tada T -> __shift__
%token ns3:qux = -> __shift__
#test:
( <foo1> | <foo2> ) <bar> <baz> <qux> <tada> <end>
This grammar recognizes two data: abc=te
and
xyz=Te
. If the first lexeme foo1
is matched, the
syntactic analyzer will jump to the ns1
namespace. When it will
match the ns1:baz
namespace, it will jump into ns3
.
Then, when it will match ns3:qux
, it will jump back to the
previous namespace, being ns1
. And so on. Now, if it does not
match foo1
but foo2
, it will jump to the
ns2
namespace, and then in ns3
and then in
ns2
again. The __shift__
keywords in
ns1:tada
and ns2:tada
are only used to shift
namespaces but no ambiguity exists: these namespaces are accessible by only
one namespace, namely default
.
Now, let's save this grammar in the NamespaceStack.pp
file and
use the -s/--token-sequence
option of the hoa
compiler:pp
command:
$ echo -n 'abc=te' | hoa compiler:pp NamespaceStack.pp 0 --token-sequence
# namespace token name token value offset
-------------------------------------------------------------------------------
0 default foo1 a 0
1 ns1 bar b 1
2 ns1 baz c 2
3 ns3 qux = 3
4 ns1 tada t 4
5 default end e 5
6 default EOF EOF 6
$ echo -n 'xyz=Te' | hoa compiler:pp NamespaceStack.pp 0 --token-sequence
# namespace token name token value offset
-------------------------------------------------------------------------------
0 default foo2 x 0
1 ns2 bar y 1
2 ns2 baz z 2
3 ns3 qux = 3
4 ns2 tada T 4
5 default end e 5
6 default EOF EOF 6
We see that the lexical analyzer has successfully jumped between all
namespaces, as expected. We had two ways to access to the ns3
namespace: either from ns1
or from ns2
. The analyzer
has succeeded to create a history of namespaces and navigate in it.
Unification
A new feature brought by the PP language regarding other grammar languages
is the capability to express a unification of lexemes. Let's
imagine the following grammar:
%token quote '|"
%token string \w+
rule:
::quote:: <string> ::quote::
The quotes surrounding the string can be of two kinds: simple, with the
“'
” symbol, or double, with the “"
” symbol. Thus,
the "foo"
and 'foo'
data are valid, but
also "foo'
and 'foo"
!
The unification of lexemes allows to add an additionnal
constraint on the value of lexemes at
runtime. The syntax is the following: token[i]
.
The value of i
indicates what lexemes will have the same
value. And finally, the unification is locale to an instance
of a rule, it means that there is no unification between lexemes of different
rules and the unification is only applied on the current
rule. Thus, the example becomes:
rule:
::quote[0]:: <string> ::quote[0]::
Which invalidates the "foo'
and 'foo"
data.
Unification finds numerous applications, such as the name of opening and
closing tags in the XML language.
Abstract Syntax Tree
An abstract syntax tree represents a textual data in a
structural form. Each node of this tree is
represented by the Hoa\Compiler\Llk\TreeNode
class. Among the
most useful methods, we find:
getId
to get the identifier of the node,
getValueToken
to get the name of the lexeme,
getValueValue
to get the value of the lexeme,
isToken
if the token represents a lexeme,
getChild($i)
to get the
$i
-th child of the node,
getChildren
to get all the children,
getChildrenNumber
to count the number of children,
getParent
to get the parent node,
getData
to get a reference to a data array,
accept
to apply the visitor.
Visitors are the most simple way to cross an AST. As an
example, let's write the PrettyPrinter
visitor which will rewrite
a JSON data with our own conventions (spacing, line breaking etc.). A visitor
must implement the Hoa\Visitor\Visit
interface and will have only
one method to implement: visit
with three arguments:
the element and two optional arguments (by reference and by copy). Let's
see:
class PrettyPrinter implements Hoa\Visitor\Visit
{
public function visit (
Hoa\Visitor\Element $element,
&$handle = null,
$eldnah = null
) {
static $i = 0;
static $indent = ' ';
// One behaviour per node in the AST.
switch ($element->getId()) {
// Object: { … }.
case '#object':
echo '{', "\n";
++$i;
foreach ($element->getChildren() as $e => $child) {
if (0 < $e) {
echo ',', "\n";
}
echo str_repeat($indent, $i);
$child->accept($this, $handle, $eldnah);
}
echo "\n", str_repeat($indent, --$i), '}';
break;
// Array: [ … ].
case '#array':
echo '[', "\n";
++$i;
foreach ($element->getChildren() as $e => $child) {
if (0 < $e) {
echo ',', "\n";
}
echo str_repeat($indent, $i);
$child->accept($this, $handle, $eldnah);
}
echo "\n", str_repeat($indent, --$i), ']';
break;
// Pair: "…": ….
case '#pair':
echo
$element->getChild(0)->accept($this, $handle, $eldnah),
': ',
$element->getChild(1)->accept($this, $handle, $eldnah);
break;
// Many tokens.
case 'token':
switch ($element->getValueToken()) {
// String: "…".
case 'string':
echo '"', $element->getValueValue(), '"';
break;
// Booleans.
case 'true':
case 'false':
// Null.
case 'null':
// Number.
case 'number':
echo $element->getValueValue();
break;
}
break;
}
}
}
We are going to see an example:
$compiler = Hoa\Compiler\Llk\Llk::load(new Hoa\File\Read('Json.pp'));
$ast = $compiler->parse('{"foo": true, "bar": [null, 42]}');
$prettyPrint = new PrettyPrinter();
echo $prettyPrint->visit($ast);
/**
* Will output:
* {
* "foo": true,
* "bar": [
* null,
* 42
* ]
* }
*/
The getData
method is very useful to store
data that are likely to be reused, for example from one
visitor to another. This method returns a reference to an
array; thus:
$data = $element->getData();
if (!isset($data['previousComputing'])) {
throw new Exception('Need a previous computing.', 0);
}
$previous = $data['previousComputing'];
It is common to use one visitor per constraint:
verification of symbols, verification of types etc. Some of them can set data
for the next ones. The getData
method does not require any
data structure, it only provides an access to an array. Feel free to do what
you want with it.
Using the Hoa\Compiler\Llk\TreeNode
is very
trivial and we will use it most of the time with a
visitor.
Traces
The LL(k) compiler provided by Hoa is clearly composed of
layers. Each layer can be used with a maximal modularity and
extensibility. When the grammar is translated into “machine rules” and that
the lexical and syntactic analyzers have validated a data, a
trace is computed. This trace is a flat array containing only
three kind of classes:
Hoa\Compiler\Llk\Rule\Entry
when the syntactic analyzer has
opened a rule,
Hoa\Compiler\Llk\Rule\Ekzit
when the syntactic analyzer has
closed a rule,
Hoa\Compiler\Llk\Rule\Token
when the syntactic analyzer has
encountered a lexeme.
We can get the trace with the
Hoa\Compiler\Llk\Parser::getTrace
method. To understand this
trace, we will start with an example:
$compiler = Hoa\Compiler\Llk\Llk::load(new Hoa\File\Read('Json.pp'));
$ast = $compiler->parse('{"foo": true, "bar": [null, 42]}');
$i = 0;
foreach ($compiler->getTrace() as $element) {
if ($element instanceof Hoa\Compiler\Llk\Rule\Entry) {
echo str_repeat('> ', ++$i), 'enter ', $element->getRule(), "\n";
} elseif ($element instanceof Hoa\Compiler\Llk\Rule\Token) {
echo
str_repeat(' ', $i + 1), 'token ', $element->getTokenName(),
', consumed ', $element->getValue(), "\n";
} else {
echo str_repeat('< ', $i--), 'ekzit ', $element->getRule(), "\n";
}
}
/**
* Will output:
* > enter value
* > > enter object
* token brace_, consumed {
* > > > enter pair
* > > > > enter string
* token quote_, consumed "
* token string, consumed foo
* token _quote, consumed "
* < < < < ekzit string
* token colon, consumed :
* > > > > enter value
* token true, consumed true
* < < < < ekzit value
* < < < ekzit pair
* > > > enter 13
* > > > > enter 12
* token comma, consumed ,
* > > > > > enter pair
* > > > > > > enter string
* token quote_, consumed "
* token string, consumed bar
* token _quote, consumed "
* < < < < < < ekzit string
* token colon, consumed :
* > > > > > > enter value
* > > > > > > > enter array
* token bracket_, consumed [
* > > > > > > > > enter value
* token null, consumed null
* < < < < < < < < ekzit value
* > > > > > > > > enter 21
* > > > > > > > > > enter 20
* token comma, consumed ,
* > > > > > > > > > > enter value
* token number, consumed 42
* < < < < < < < < < < ekzit value
* < < < < < < < < < ekzit 20
* < < < < < < < < ekzit 21
* token _bracket, consumed ]
* < < < < < < < ekzit array
* < < < < < < ekzit value
* < < < < < ekzit pair
* < < < < ekzit 12
* < < < ekzit 13
* token _brace, consumed }
* < < ekzit object
* < ekzit value
*/
This example reveals several things. First, the informations given by the
trace can be useful: if we are on a rule, we have its name
(with the getRule
method), and if we are on a lexeme, we have its
name (with the getTokenName
method), its
representation (expressed in the PCRE format, with the
getRepresentation
method), its value (with the
getValue
method), if it is a node to keep in the
AST (with the isKept
method) and its unification
index if it exists (with the getUnificationIndex
method). Of
course, all these informations can be edited, which can influence the
construction of the AST which is the (optional) next step.
Next, we notice that sometimes we enter in a rule that exists in the
grammar, like object
, pair
, value
etc.,
and sometimes we enter in a numerical rule, like
13
, 12
, 21
, 20
etc. When
the grammar is interpreted in order to be translated into “machine rules”,
each one of these rules is linearized regarding the PP
operators:
Hoa\Compiler\Llk\Rule\Choice
for a disjunction
Hoa\Compiler\Llk\Rule\Concatenation
for a
concatenation,
Hoa\Compiler\Llk\Rule\Repetition
for a repetition,
Hoa\Compiler\Llk\Rule\Token
for a token (like seen
above).
All these rules in this format are accessible through the
Hoa\Compiler\Llk\Parser::getRules
methods, represented by an
array. We will print all the accessible rules from the root
for a better understanding:
$compiler = Hoa\Compiler\Llk\Llk::load(new Hoa\File\Read('Json.pp'));
// 1. Get all rules.
$rules = $compiler->getRules();
// 2. Start with the root rule.
$stack = [$compiler->getRootRule() => true];
while (false !== current($stack)) {
$rule = key($stack);
next($stack);
echo
"\n", '"', $rule, '" is a ',
strtolower(substr(get_class($rules[$rule]), 22));
$subrules = $rules[$rule]->getContent();
// 3a. Token.
if (null === $subrules) {
continue;
}
echo ' of rules: ';
// 3b. Other rules.
foreach ((array) $rules[$rule]->getContent() as $subrule) {
if (!array_key_exists($subrule, $stack)) {
// 4. Unshift new rules to print.
$stack[$subrule] = true;
}
echo $subrule, ' ';
}
}
/**
* Will output:
* "value" is a choice of rules: 1 2 3 string object array number
* "1" is a token
* "2" is a token
* "3" is a token
* "string" is a concatenation of rules: 5 6 7
* "object" is a concatenation of rules: 10 pair 13 14
* "array" is a concatenation of rules: 18 value 21 22
* "number" is a token
* "5" is a token
* "6" is a token
* "7" is a token
* "10" is a token
* "pair" is a concatenation of rules: string 16 value
* "13" is a repetition of rules: 12
* "14" is a token
* "18" is a token
* "21" is a repetition of rules: 20
* "22" is a token
* "16" is a token
* "12" is a concatenation of rules: 11 pair
* "20" is a concatenation of rules: 19 value
* "11" is a token
* "19" is a token
*/
If we read the object
rule, we know that it is a concatenation
of the 10
, pair
, 13
and 14
rules. 10
is a lexeme, pair
is a concatenation of
the string
, 16
and value
rules, and so
on. The initial grammar is translated in order to be in its
tiniest form. It allows to reason on rules
in an easier and faster way. Indeed,
computations on the grammar are not restricted to AST. In the previous step,
with the trace, we are already able to apply computations.
Generation
A grammar can be used to satisfy two goals: to validate a
data (if we see it as an automata) or to generate a data. So
far, we have seen how to validate a data through several steps:
initialization of our compiler, running of the lexical and
syntactic analyzers, producing a trace,
transformation of this trace into an AST in order to
visit that tree. But we can stop at the first step, the
initialization of our compiler, to exploit the rules in order to generate a
data which will be valid regarding our grammar.
Hoa\Compiler\Llk\Sampler
provides three data
generation algorithms:
- random uniform generation,
- bounded exhaustive generation,
- grammar coverage-based generation.
Why providing three algorithms? Because it is illusory to think that one
algorithm can satisfy numerous context of usages. Each one
fits different needs, we will explain them later.
Generating a data based on a grammar requires two
steps:
- generating values for all lexemes in order to get raw
data,
- generating a path in the rules of the grammar.
A path is equivalent to a derivation, the vocabulary is different according
to our goal: validation or generation.
Isotropic generation
of lexemes
In order to generate values for lexemes, no matter what algorithm is used,
it has to be fast. We are going to use a process called
isotropic. We start with a rule and we go forward only by
using random and locally uniform (only for
this choice) choices. For instance, if we have a disjunction between three
sub-rules, we will randomly and uniformly choose between 1 and 3. If we have a
concatenation, we will just concatenate each parts. And finally, a repetition
is nothing more than a disjunction of concatenation: indeed,
e{1,3}
is strictly equivalent to e |
ee | eee
. Let's see:
([ae]+|[x-z]!){1,3} repeat [ae]+|[x-z]! 2 times
→ ([ae]+|[x-z]!)([ae]+|[x-z]!) choose between [ae]+ and [x-z]!
→ ([ae]+)([ae]+|[x-z]!) repeat ae
2 times
→ [ae][ae]([ae]+|[x-z]!) choose between a and e
→ e[ae]([ae]+|[x-z]!) again
→ ea([ae]+|[x-z]!) choose between [ae]+ and [x-z]!
→ ea([x-z]!) choose between x, y and z
→ eay!
This generation is naive but this is not important. What is important is
the path generation in rules, or put another way, the generation of
sequences of lexemes.
The first algorithm is the random and
uniform generation. This algorithm is useful if we would like
to generate sequences of lexemes of a given size n
with a uniformity, not local (like in the isotropic
generation), but on the set of all the possible sequences.
Succinctly, the algorithm works in two steps: pre-computation
(once per size) followed by a generation. The pre-computation
is an automatic step and computes the number
of all the possible sequences and sub-sequences of size n. This step
helps to compute cumulative distribution functions in order
to guide the final sequences of lexemes.
We will generate 10 random data of size 7, it means composed of 7 lexemes.
That's why we will use the Hoa\Compiler\Llk\Sampler\Uniform
class
that takes our grammar in first argument, the lexeme value generator in second
argument (here Hoa\Regex\Visitor\Isotropic
) and finally a
size:
$sampler = new Hoa\Compiler\Llk\Sampler\Uniform(
// Grammar.
Hoa\Compiler\Llk\Llk::load(new Hoa\File\Read('Json.pp')),
// Token sampler.
new Hoa\Regex\Visitor\Isotropic(new Hoa\Math\Sampler\Random()),
// Length.
7
);
for ($i = 0; $i < 10; ++$i) {
echo $i, ' => ', $sampler->uniform(), "\n";
}
/**
* Will output:
* 0 => [ false , null , null ]
* 1 => [ " l " , null ]
* 2 => [ [ true ] , true ]
* 3 => [ [ [ 4435 ] ] ]
* 4 => [ [ [ 9366 ] ] ]
* 5 => [ true , false , null ]
* 6 => { " |h<# " : false }
* 7 => [ [ [ false ] ] ]
* 8 => [ false , true , 7 ]
* 9 => [ false , 5 , 79 ]
*/
We can redefine the size with the
Hoa\Compiler\Llk\Sampler\Uniform::setLength
method. We have
noticed that some repetition operators have no upper bound, like
+
or *
; in this case, we set it to n.
Bounded exhaustive
generation
The second algorithm is the bounded exhaustive generation.
This algorithm generates all sequences of lexemes (for the
exhaustive side) from size 1 to n (for the
bounded side). An interesting aspect of this algorithm is that the
exhaustivity is kind of a proof! The algorithm behaves like
an iterator and is very simple to use:
$sampler = new Hoa\Compiler\Llk\Sampler\BoundedExhaustive(
// Grammar.
Hoa\Compiler\Llk\Llk::load(new Hoa\File\Read('Json.pp')),
// Token sampler.
new Hoa\Regex\Visitor\Isotropic(new Hoa\Math\Sampler\Random()),
// Length.
7
);
foreach ($sampler as $i => $data) {
echo $i, ' => ', $data, "\n";
}
/**
* Will output:
* 0 => true
* 1 => false
* 2 => null
* 3 => " 8u2 "
* 4 => { " ,M@aj{ " : true }
* 5 => { " x`|V " : false }
* 6 => { " NttB " : null }
* 7 => { " eJWwA " : 0 }
* 8 => [ true ]
* 9 => [ true , true ]
* 10 => [ true , true , true ]
* 11 => [ true , true , false ]
* 12 => [ true , true , null ]
* 13 => [ true , true , 32 ]
* 14 => [ true , false ]
* 15 => [ true , false , true ]
* 16 => [ true , false , false ]
* 17 => [ true , false , null ]
* 18 => [ true , false , 729 ]
* 19 => [ true , null ]
* 20 => [ true , null , true ]
* 21 => [ true , null , false ]
* …
* 157 => [ 780 , 01559 , 32 ]
* 158 => 344
*/
Like the previous algorithm, we can redefine the upper bound with the
Hoa\Compiler\Llk\Sampler\BoundedExhaustive::setLength
method. And
the repetition operators with no upper bounds are bounded to n.
Grammar
coverage-based generation
The last algorithm is the grammar coverage-based
generation. This algorithm reduces the combinatory explosion
encountered with the previous algorithm but the goal is different: it will
generate sequences of lexemes that will “activate” all the
rules branches of the grammar. A rule is said covered if and
only if all its sub-rules are covered, and a lexeme is said covered if it has
been used in a sequence. To ensure a diversity in the
produced sequences, choice points between uncovered sub-rules are resolved
randomly. The algorithm also behaves like an iterator:
$sampler = new Hoa\Compiler\Llk\Sampler\Coverage(
// Grammar.
Hoa\Compiler\Llk\Llk::load(new Hoa\File\Read('Json.pp')),
// Token sampler.
new Hoa\Regex\Visitor\Isotropic(new Hoa\Math\Sampler\Random())
);
foreach ($sampler as $i => $data) {
echo $i, ' => ', $data, "\n";
}
/**
* Will output:
* 0 => true
* 1 => { " )o?bz " : null , " %3W) " : [ false , 130 , " 6 " ] }
* 2 => [ { " ny " : true } ]
* 3 => { " Ne;[3 " : [ true , true ] , " th: " : true , " C[8} " : true }
*/
To avoid the combinatory explosion and to ensure the
termination of the algorithm, we will use the following
heuristic on the repetition operators:
*
will repeat 0
, 1
and
2
times, +
will repeat 1
and
2
times, and finally {x,y}
will
repeat x
, x + 1
,
y - 1
and y
times. This heuristic
finds its origin in the limit testing.
We notice in our example that 4 data are generated and are enough to
cover entirely our grammar!
Comparison between
algorithms
Here are some clues to know when to use one algorithm
instead of another.
- Random uniform generation:
- fast for small data, high data diversity and fixed
size,
- the pre-computation step is exponential and therefore
long while, next, the generation is very fast.
- Bounded exhaustive generation:
- fast for small data and exhaustivity is efficient,
- exponential number of data.
- Grammar coverage-based generation:
- fast for average or big data, and diversity of
data,
- do not consider size.
LL(1) compiler-compiler
The documentation of the LL(1) compiler, through the
Hoa\Compiler\Ll1
class, is not written yet. The goal is different
regarding Hoa\Compiler\Llk
: there is no grammar description
language, automata are hard-coded with arrays and this type of compiler is
high performance oriented. That's why its behavior is
monolitic.
Conclusion
Hoa\Compiler
provides two compiler-compilers:
Hoa\Compiler\Llk
and Hoa\Compiler\Ll1
, in order to
validate data. The PP language allows to
write algebraic grammars in a simple and
natural way. The LL(k) compiler-compiler is split
in distinct processus, which encourages hackability. Finally,
several algorithms allows to generate data based on a grammar
according to specific usages.