SPARK Documentation for Release 0.6
John Aycock Computer Science University of Victoria
aycock@csc.uvic.ca
This document is a random scattering of thoughts, derived from
my own experience and from the valuable feedback I've gotten from:
- Jan Decaluwe
- Paul Dubois
- Mike Fletcher
- Darrell Gallion
- Nick Mathewson
- Gordon McMillan
- Amit J. Patel
- Tim Peters
- Dickon Reed
- Nicky Sandhu
- Richard White
The ``real'' documentation is still the IPC7 paper. Please send
me any suggestions you have about the framework, or things to add
to this document!
This is new in version 0.6.
(The module name has been changed to spark.py; new code should
import spark instead of import generic.)
1 The Token Class
The version of Token that's supplied is as minimal as possible. The
only thing the framework assumes is a __cmp__ method, so as long as
you adhere to this you can expand the class or drop in your own.
GenericParser's default error method attempts to print out
the offending token in a readable form, so you may find it useful to have
a __str__ or __repr__ in your Token class too.
2 The AST Class
Like the Token class, the AST class is very minimal. The
framework assumes that each AST node will have a string-valued
type attribute, and that the children of a node are obtainable
using the __getitem__ method. Again, you can roll your own
class with this type of interface too.
I prefer n-ary trees for my ASTs rather than binary trees, so the
example one in examples/paper is quite a bit different than my own.
3 The GenericScanner Class
3.1 The error Method
If the input is not matched by some point by regular expressions you've
supplied, then the error method will get called (in previous
versions an assertion was raised instead). This method gets passed
the input string and the position within the string that the error
occurred.
GenericScanner provides a default implementation which prints out
an error message and raises a SystemExit exception.
I'm deliberating as to whether this should be a straight exception, or
if I should modify the interface to permit some form of error recovery...
3.2 Bugs/Limitations/Caveats
- Speed.
- I've addressed this in this latest release, thanks to
some code from Tim Peters. However, GenericScanner is
ultimately only as good as Python's RE engine - a peril of
code re-use.
- Regular expressions.
- GenericScanner interprets regular
expressions with the re.VERBOSE option enabled. This
means you have to escape the # character if you want
to literally match it.
4 The GenericParser Class
4.1 Input Grammars
GenericParser should work with any Backus-Naur form (BNF) grammar,
including grammars
with empty right-hand sides and ambiguous grammars.
There are apparently a few rather obscure cases where Earley's parsing algorithm
fails, but they're not likely to crop up in any reasonable application.
4.2 The typestring Method
This is new in version 0.6.
GenericParser can often run faster if it knows the type of its
tokens. More specifically, if it knows a token's type as a string. You
can tell the parser this by supplying a typestring method, which is
passed a token, and must either return that token's type as a string, or
None if it's unable to (in the latter case, the parser will simply
fall back on a less-efficient method of operation).
For example, if a token's type is always stored in under the name type,
you might have the following code:
class MyParser(GenericParser):
...
def typestring(self, token):
return token.type
Note that GenericParser may or may not use typestring; your
tokens must still be comparable objects, as before. Supplying typestring
is purely optional.
4.3 Bugs/Limitations/Caveats
- Speed.
- You're using a very general parsing algorithm
implemented in Python; it's not going to be blazingly fast,
sorry.
- Action code.
- The entire input has been recognized by
GenericParser before any of your action code is
executed. This may restrict the types of things you do on-the-fly
in your parser actions. It's one argument for building an AST
that you can traverse any way you want.
- Watch your method names.
- I've been bitten by this a few times.
Python won't warn you if you inadventently redefine a method,
which I've done when cutting and pasting a bit too freely. It's
a bugger to track down too.
4.4 Ambiguity Resolution
Since the IPC7 paper, I had added some rudimentary ambiguity resolution code.
It was undocumented, far from satisfactory, and it has been supplanted by
the following interface.
Ambiguities are not resolved until the parse tree is being traversed. In
other words, the input is already known to be syntactically correct; it's
just a matter of deciding which parse tree to build when there are multiple
choices.
When an ambiguity is reached, a choice will need to be made between two
or more rules. These rules must reside in different p_ methods.
The method names which correspond to the conflicting rules, minus the
initial ``p_,'' are gathered together in a list. This list is
sorted by the length of the rules' right-hand side - shorter rules appear
earlier in the list - and passed to the resolve method. The
resolve method must choose an element of the list and return its choice.
GenericParser supplies a default implementation of resolve
which always selects the rule with the shortest right-hand side (assuming
the conflicting rules reside in different p_ methods, of course).
This is enough to handle the ``dangling-else'' conflict.
Now some examples. The first one always picks the rule with the shortest
right-hand side (this is the default as supplied):
class MyParser(GenericParser):
...
def resolve(self, list):
return list[0]
The second example always picks the rule with the longest
right-hand side:
class MyParser(GenericParser):
...
def resolve(self, list):
return list[-1]
Here's one which exercises fine-grained control:
class MyParser(GenericParser):
...
def p_if_stmt(self, args):
'''
stmt ::= IF expr THEN stmt
'''
def p_ifelse_stmt(self, args):
'''
stmt ::= IF expr THEN stmt ELSE stmt
'''
...
def resolve(self, list):
choice = {
('if_stmt', 'ifelse_stmt'): 'if_stmt',
}
return choice[tuple(list)]
If you have an ambiguity, but aren't sure where, you may want to override
the default resolve with one that simply prints out the list it's
passed. This allows you to find the guilty parties, and to ensure that there's
no duplicates in the list.
5 The GenericASTBuilder Class
This class will construct syntax trees for you automatically (or at
least with minimal intervention). Instead of having your parser be a
subclass of GenericParser, you make it a subclass of
GenericASTBuilder. GenericASTBuilder is a subclass of
GenericParser, meaning the parsing interface is unchanged.
The constructor for GenericASTBuilder takes an extra argument,
which is the class that you want AST nodes to be.
It's actually simpler than it sounds. The rest of this section gives
a ``cookbook'' explanation. The class of AST nodes is called ``AST''
in the examples below.
5.1 Heterogeneous Parse Tree
Sometimes these are called ``concrete syntax trees.'' By heterogeneous
I mean that the leaves of the tree are instances of tokens rather than
instances of the AST class. (See the GenericASTTraversal section
for more on this.)
class AutoExprParser(GenericASTBuilder):
def __init__(self, AST, start='expr'):
GenericASTBuilder.__init__(self, AST, start)
def p_expr(self, args):
'''
expr ::= expr + term
expr ::= term
term ::= term * factor
term ::= factor
factor ::= number
factor ::= float
'''
That's it. The code that uses it would look like:
parser = AutoExprParser(AST)
parseTree = parser.parse(tokens)
Except for the extra class argument to the constructor, there's no
changes. In AutoExprParser, all the rules are lumped together
since there's no further need for actions.
The AST class must support the __setslice__ and __len__
operations in order to use GenericASTBuilder.
5.2 Homogeneous Parse Tree
To make a homogeneous parse tree, the leaves need to be changed from
token instances into AST instances. When GenericASTBuilder
encounters a leaf, it calls GenericASTBuilder.terminal, so you
simply need to supply your own version of it:
class AutoExprParser(GenericASTBuilder):
...
def terminal(self, token):
return AST(token.type)
In practice, you'll probably want to copy some attribute values from the
token to the AST node too.
5.3 Any-geneous Abstract Syntax Tree
To use GenericASTBuilder for building an abstract syntax tree,
there should be a fairly straightforward mapping between the parse tree
and the AST you want. Just like GenericASTBuilder.terminal was
supplied in the last section, you'll now supply a
GenericASTBuilder.nonterminal method. The arguments to this method
are the nonterminal it's trying to build a node for, and the node's children.
For example, if I wanted to flatten the parse tree out a bit, I could
skip allocating new nodes if there's only one child:
class AutoExprParser(GenericASTBuilder):
...
def nonterminal(self, type, args):
if len(args) == 1:
return args[0]
return GenericASTBuilder.nonterminal(self, type, args)
args is just a list, so you could also delete elements from it,
or any other transformation you can imagine.
5.4 Bugs/Limitations/Caveats
- Ignorance.
- Any parser actions you supply in your p_ functions are
silently ignored by GenericASTBuilder in the current version.
This may change in the future.
6 The GenericASTTraversal Class
This was called the ASTTraversal class in the IPC7 paper. For
consistency, I've renamed it and placed it in spark.py along
with the other generic classes. For backward compatibility,
ASTTraversal is still present in its old spot as a wrapper; new
code should use GenericASTTraversal.
I got a great suggestion about heterogeneous ASTs: use the already-present
token instances as leaves of the AST. I was all ready to add support
into GenericASTTraversal so that it only traversed a node's children
if the node had a __getitem__ method present. Then it
suddenly occurred to me that GenericASTTraversal already supports
heterogeneous ASTs: if you want to use tokens as leaves, just add a
method to your token class:
class MyToken:
...
def __getitem__(self, i): raise IndexError
...
This way the interface to GenericASTTraversal is kept simple.
If you want to prevent further traversal into a subtree during a
preorder traversal, calling the prune method will do the trick. I
found I needed this when generating code from ASTs.
6.1 The typestring Method
This is new in version 0.6.
To decouple GenericASTTraversal from the node implementation,
GenericASTTraversal now calls the typestring method to
get the node's type as a string. Like its counterpart in GenericParser,
it takes a node as an argument, and must return a string (it may not return
None, unlike the one in GenericParser).
The default implementation simply returns the node's type field; this
behaviour is backwards-compatible.
7 The GenericASTMatcher Class
GenericASTMatcher is a class that is designed for generating code
from ASTs. You supply a set of patterns (in docstrings, of course) and
actions; GenericASTMatcher will find a way to cover the AST with
your patterns, then invoke the corresponding actions. Actions are called
in left-to-right, bottom-up order.
For example:
class AutoInterpret(GenericASTMatcher):
def __init__(self, ast):
GenericASTMatcher.__init__(self, 'V', ast)
def p_n(self, tree):
'''
V ::= number
'''
tree.value = int(tree.attr)
def p_add(self, tree):
'''
V ::= expr ( V + V )
'''
tree.value = tree[0].value + tree[2].value
def p_multiply(self, tree):
'''
V ::= term ( V * V )
'''
tree.value = tree[0].value * tree[2].value
def p_addmul(self, tree):
'''
V ::= expr ( V + term ( V * V ) )
'''
tree.value = tree[0].value + tree[2][0].value * tree[2][2].value
As in GenericParser, the methods you supply are prefixed with
p_, which in this context stands for ``pattern.'' The argument
to the p_ methods is the AST node where the pattern is rooted.
Patterns are just trees linearized into prefix
form, which use parentheses to denote
subtrees. On the left-hand side is a symbol which the pattern is
effectively ``rewritten'' to if the pattern is matched. For example,
the pattern V ::= term ( V * V ) would correspond to:
term
/|\ => V
/ | \
V * V
You'd use the above class as follows:
interpreter = AutoInterpret(ast)
interpreter.match()
print ast.value
You may optionally supply an AST to match. This is so you can
create one instance of your matcher class, then have it match different
ASTs. For instance, you could use a matcher for implementing a
language's arithmetic expressions, and use a GenericASTTraversal
for the rest.
AST nodes must support __getitem__ and __cmp__ methods
to use GenericASTMatcher.
7.1 Bugs/Limitations/Caveats
- Ambiguity.
- If there's a conflict between two patterns, then
GenericASTMatcher will choose the longest one. Ideally,
the entire matching engine will eventually be replaced by a more
sophisticated one that'll find an ``optimal'' covering.
- Parentheses considered harmful.
- You may end up with some strange
things happening if you have a terminal/nonterminal named ``(''
or ``)'' as they're delimiting the pattern's tree structure.
- Patterns.
- GenericASTMatcher's engine will accept patterns
that are more general than those described above. These restrictions
may be enforced in a later release, however.
8 Inheritance and Generic* Classes
You can override t_ and p_ methods now in the expected fashion;
in other words, defining p_foo in a subclass hides p_foo in
its superclass. (Previously duplicate method names were not removed when
the Generic* classes did their reflection, giving strange results.)
9 Miscellaneous
- Memory leaks!
- Several of the Generic* classes keep references
to the various t_ and p_ methods. Unfortunately this
results in a circular reference, as Darrell Gallion pointed out, which
Python's current garbage collector won't collect.
This will only be an issue if you create and destroy Generic*
classes repeatedly; simply using a single instance of
GenericScanner repeatedly will not consume extra memory.
I probably won't fix this, since full GC is likely coming soon
to a Python near you, and it won't affect many users. If you
do need to handle this problem, contact me and I can advise you
how to change the SPARK code.
- How do I keep track of the line number?
- There's currently no
automatic way to do this. What I do is to keep a line
number attribute, say lineno, for each token. Then
in my GenericScanner subclass I'll have a method like:
def t_nl(self, s):
r'\n'
self.lineno = self.lineno + 1
(I set self.lineno to 1 in my subclass' tokenize method.)
- What about inherited attributes?
- For inherited attributes - ones
that propagate from the root to the leaves of a tree - I just set
variables in my subclass rather than pass them around explicitly.
You may want to avoid relying on this while parsing unless you
know exactly how and when your parsing actions are being invoked.
- How do I distribute a project that uses SPARK?
- As far as I'm
concerned, you need only include spark.py. I'd appreciate it
if you mentioned SPARK in your documentation and/or web pages. And
if you send me the URL of your project's home page, I can add it
to the SPARK web page.
File translated from TEX by TTH, version 1.60.
|