Parsing the USP search path BNF using pyparsing

# Parsing the USP search path BNF using pyparsing

Let's parse the search path grammar! The USP (User Services Platform) committee has given us a BNF in their pdf that describes the object path model (including search path) using a context-free language.

This is how the BNF of Path Objects looks like:

idmpath ::= objpath | parampath | cmdpath | evntpath | searchpath
objpath ::= name '.' (name (('.' inst)|(reffollow '.' name) )? '.')*
parampath ::= objpath name
cmdpath ::= objpath name '()'
evntpath ::= objpath name '!'
inst ::= posnum | expr | '*'
expr ::= '[' (exprcomp ( '&&' exprcomp )*) ']'
exprcomp ::= relpath oper value
relpath ::= name (reffollow? '.' name )*
reffollow ::= ( '#' (posnum | '*') '+' )| '+'
oper ::= '==' | '!=' | '<' | '>' | '<=' | '>='
value ::= literal | number
name ::= [A-Za-z_] [A-Za-z_0-9]*
literal ::= '"' [^"]* '"'
posnum ::= [1-9] [0-9]*
number ::= '0' | ( '-'? posnum )

I've interpreted a searchpath based on the HTML specification, which unfortunately is not defined above, as follows:

searchpath ::= objpath name?

The pyparsing PEG language looks like this:

# Helpers.
alphas = pyparsing.alphas.upper() + pyparsing.alphas.lower()
dot = Suppress(Char('.'))

# Basic stuff such as names and operators.
name = Word(alphas + '_', alphas + pyparsing.nums + '_')
literal = Char('"') + CharsNotIn('"') + Char('"')
posnum = Word(pyparsing.nums[1:], pyparsing.nums)
number = Char('0') | (Optional(Char('-')) + posnum)
value = literal | number
oper = Literal('==') | Literal('!=') | Literal('<') | Literal('>') | Literal('<=') | Literal('>=')

# The real stuff - references, expressions, paths.
reffollow = (Char('#') + (posnum | Char('*')) + Char('+')) | Char('+')
relpath = name + ZeroOrMore(Optional(reffollow) + dot + name)
exprcomp = relpath + oper + value
expr = Char('[') + exprcomp + ZeroOrMore(Literal('&&') + exprcomp) + Char(']')
inst = posnum | expr | Char('*')
objpath = name + dot + ZeroOrMore(name + Optional((dot + inst) | (reffollow + dot + name)) + dot)
parampath = objpath + name

searchpath = objpath + Optional(name)

Note: PEG languages are not equivalent to context-free languages. In fact they are less powerful, since they are more strict in terms of ambiguity. E.g. the following rule cannot be parsed by a PEG:

S ::= 'x' S 'x' | 'x'

So why not simply do a Regex? BNFs describe context-free languages and often cannot be implemented using a regular language. Specifically, context-free languages should strictly contain context-free rules only (i.e. the left side of each rule in the BNF further above has only one variable) and they can have recursion on both sides. Regular languages do not allow that and are even more strict - they can only contain a rule that ends in a terminal (e.g. a character) or a terminal combined with a variable (given terminal a and variable B: aB or Ba. In above example it could be '*' posnum or posnum '*'). For instance, the rule for exprcomp as it stands is not allowed in regular languages, since it contains three variables. Now, these rules can still possibly be converted into a regular language by enumerating all possible combinations of a variable and a terminal. E.g. oper consists of 6 possible values, so why not add 6 rules a → '==' value, a → '>=' value, and so on. However, this is a lot of work. Let's just stay context-free and use pyparsing!

There are lemmas to prove that a language is not regular such as the Pumping Lemma. However, I haven't been able to find a contradiction (i.e. any word in the above language can be pumped). This does not mean it is regular, so it is not a proof, but I have a gut feeling the above language is in fact regular. One way to actually prove this would be to create a finite automaton, i.e. a DFA or NFA from the BNF. Another way could be to try to convert every single rule into a left-/right-regular rule. If this is successful, it is indeed a regular language.

Published on 2021-01-29