42 the string matches the empty string $\mts$, then @{term r} matches @{term s} |
42 the string matches the empty string $\mts$, then @{term r} matches @{term s} |
43 (and {\em vice versa}). The derivative has the property (which may be |
43 (and {\em vice versa}). The derivative has the property (which may be |
44 regarded as its specification) that, for every string @{term s} and regular |
44 regarded as its specification) that, for every string @{term s} and regular |
45 expression @{term r} and character @{term c}, one has @{term "cs \<in> L(r)"} if |
45 expression @{term r} and character @{term c}, one has @{term "cs \<in> L(r)"} if |
46 and only if \mbox{@{term "s \<in> L(der c r)"}}. The beauty of Brzozowski's |
46 and only if \mbox{@{term "s \<in> L(der c r)"}}. The beauty of Brzozowski's |
47 derivatives is that they are neatly expressible in any functional language, and |
47 derivatives is that they are neatly expressible in any functional language, |
48 very easy to be defined and reasoned about in a theorem prover---the |
48 and easily definable and reasoned about in theorem provers---the definitions |
49 definitions consist just of inductive datatypes and recursive functions. A |
49 just consist of inductive datatypes and simple recursive functions. A |
50 completely formalised proof of this matcher has for example been given in |
50 completely formalised correctness proof of this matcher in for example HOL4 |
51 \cite{Owens2008}. |
51 has been given in~\cite{Owens2008}. |
52 |
52 |
53 One limitation of Brzozowski's matcher is that it only generates a YES/NO |
53 One limitation of Brzozowski's matcher is that it only generates a YES/NO |
54 answer for a string being matched by a regular expression. Sulzmann and Lu |
54 answer for whether a string is being matched by a regular expression. |
55 \cite{Sulzmann2014} extended this matcher to allow generation not just of a |
55 Sulzmann and Lu \cite{Sulzmann2014} extended this matcher to allow |
56 YES/NO answer but of an actual matching, called a [lexical] {\em value}. |
56 generation not just of a YES/NO answer but of an actual matching, called a |
57 They give a still very simple algorithm to calculate a value that appears to |
57 [lexical] {\em value}. They give a simple algorithm to calculate a value |
58 be the value associated with POSIX lexing (posix) %%\cite{posix}. The |
58 that appears to be the value associated with POSIX lexing |
59 challenge then is to specify that value, in an algorithm-independent |
59 \cite{Kuklewicz,Vansummeren2006}. The challenge then is to specify that |
60 fashion, and to show that Sulzamman and Lu's derivative-based algorithm does |
60 value, in an algorithm-independent fashion, and to show that Sulzamann and |
61 indeed calculate a value that is correct according to the specification. |
61 Lu's derivative-based algorithm does indeed calculate a value that is |
62 |
62 correct according to the specification. |
63 Inspired by work of Frisch and Cardelli \cite{Frisch2004} on a GREEDY |
63 |
64 regular expression matching algorithm, the answer given in |
64 The answer given by Sulzmann and Lu \cite{Sulzmann2014} is to define a |
65 \cite{Sulzmann2014} is to define a relation (called an ``Order Relation'') |
65 relation (called an ``Order Relation'') on the set of values of @{term r}, |
66 on the set of values of @{term r}, and to show that (once a string to be |
66 and to show that (once a string to be matched is chosen) there is a maximum |
67 matched is chosen) there is a maximum element and that it is computed by the |
67 element and that it is computed by their derivative-based algorithm. This |
68 derivative-based algorithm. Beginning with our observations that, without |
68 proof idea is inspired by work of Frisch and Cardelli \cite{Frisch2004} on a |
69 evidence that it is transitive, it cannot be called an ``order relation'', |
69 GREEDY regular expression matching algorithm. Beginning with our |
70 and that the relation is called a ``total order'' despite being evidently |
70 observations that, without evidence that it is transitive, it cannot be |
71 not total\footnote{\textcolor{green}{Why is it not total?}}, we identify |
71 called an ``order relation'', and that the relation is called a ``total |
72 problems with this approach (of which some of the proofs are not published |
72 order'' despite being evidently not total\footnote{\textcolor{green}{We |
73 in \cite{Sulzmann2014}); perhaps more importantly, we give a simple |
73 should give an argument as footnote}}, we identify problems with this |
74 inductive (and algorithm-independent) definition of what we call being a |
74 approach (of which some of the proofs are not published in |
75 {\em POSIX value} for a regular expression @{term r} and a string @{term s}; |
75 \cite{Sulzmann2014}); perhaps more importantly, we give a simple inductive |
76 we show that the algorithm computes such a value and that such a value is |
76 (and algorithm-independent) definition of what we call being a {\em POSIX |
77 unique. Proofs are both done by hand and checked in {\em Isabelle} |
77 value} for a regular expression @{term r} and a string @{term s}; we show |
78 %\cite{isabelle}. Our experience of doing our proofs has been that this |
78 that the algorithm computes such a value and that such a value is unique. |
79 mechanical checking was absolutely essential: this subject area has hidden |
79 Proofs are both done by hand and checked in {\em Isabelle/HOL}. The |
80 snares. This was also noted by Kuklewitz \cite{Kuklewicz} who found that |
80 experience of doing our proofs has been that this mechanical checking was |
81 nearly all POSIX matching implementations are ``buggy'' \cite[Page |
81 absolutely essential: this subject area has hidden snares. This was also |
82 203]{Sulzmann2014}. |
82 noted by Kuklewitz \cite{Kuklewicz} who found that nearly all POSIX matching |
83 |
83 implementations are ``buggy'' \cite[Page 203]{Sulzmann2014}. |
84 \textcolor{green}{Say something about POSIX lexing} |
84 |
|
85 If a regular expression matches a string, then in general there are more |
|
86 than one possibility of how the string is matched. There are two commonly |
|
87 used disambiguation strategies to generate a unique answer: one is called |
|
88 GREEDY matching \cite{Frisch2004} and the other is POSIX |
|
89 matching~\cite{Kuklewicz,Sulzmann2014}. For example consider the string |
|
90 @{term xy} and the regular expression @{term "STAR (ALT (ALT x y) xy)"}. |
|
91 Either the string can be matched in two `iterations' by the single |
|
92 letter-regular expressions @{term x} and @{term y}, or directly in one |
|
93 iteration by @{term xy}. The first case corresponds to GREEDY matching, |
|
94 which first matches with the left-most symbol and only matches the next |
|
95 symbol in case of a mismatch. The second case is POSIX matching, which |
|
96 prefers the longest match. |
|
97 |
|
98 In the context of lexing, where an input string is separated into a sequence |
|
99 of tokens, POSIX is the more natural disambiguation strategy for what |
|
100 programmers consider basic syntactic building blocks in their programs. |
|
101 There are two underlying rules behind tokenising using POSIX matching: |
|
102 |
|
103 \begin{itemize} |
|
104 \item[$\bullet$] \underline{The Longest Match Rule (or ``maximal munch rule''):} |
|
105 |
|
106 The longest initial substring matched by any regular expression is taken as |
|
107 next token.\smallskip |
|
108 |
|
109 \item[$\bullet$] \underline{Rule Priority:} |
|
110 |
|
111 For a particular longest initial substring, the first regular expression |
|
112 that can match determines the token. |
|
113 \end{itemize} |
|
114 |
|
115 \noindent Consider for example a regular expression @{text |
|
116 "r\<^bsub>key\<^esub>"} that can recognise keywords, like @{text "if"}, |
|
117 @{text "then"} and so on; and another regular expression @{text |
|
118 "r\<^bsub>id\<^esub>"} that can recognise identifiers (such as single |
|
119 characters followed by characters or numbers). Then we can form the regular |
|
120 expression @{text "(r\<^bsub>key\<^esub> + r\<^bsub>id\<^esub>)\<^sup>\<star>"} and |
|
121 use POSIX matching to, for example, tokenise the strings @{text "iffoo"} and |
|
122 @{text "if"}. In the first case we obtain by the longest match rule a |
|
123 single identifier token, not a keyword and identifier. In the second case we |
|
124 obtain by rule priority a keyword token, not an identifier token. |
|
125 \bigskip |
|
126 |
|
127 \noindent\textcolor{green}{Not Done Yet} |
|
128 |
|
129 |
|
130 \medskip\noindent |
|
131 {\bf Contributions:} |
|
132 |
|
133 Derivatives as calculated by Brzozowski's method are usually more complex |
|
134 regular expressions than the initial one; various optimisations are |
|
135 possible, such as the simplifications of @{term "ALT ZERO r"}, @{term "ALT r |
|
136 ZERO"}, @{term "SEQ ONE r"} and @{term "SEQ r ONE"} to @{term r}. One of the |
|
137 advantages of having a simple specification and correctness proof is that |
|
138 the latter can be refined to allow for such optimisations and simple |
|
139 correctness proof. |
85 |
140 |
86 An extended version of \cite{Sulzmann2014} is available at the website |
141 An extended version of \cite{Sulzmann2014} is available at the website |
87 of its first author; this includes some ``proofs'', claimed in |
142 of its first author; this includes some ``proofs'', claimed in |
88 \cite{Sulzmann2014} to be ``rigorous''. Since these are evidently not in |
143 \cite{Sulzmann2014} to be ``rigorous''. Since these are evidently not in |
89 final form, we make no comment thereon, preferring to give general |
144 final form, we make no comment thereon, preferring to give general |
90 reasons for our belief that the approach of \cite{Sulzmann2014} is |
145 reasons for our belief that the approach of \cite{Sulzmann2014} is |
91 problematic rather than to discuss details of unpublished work. |
146 problematic rather than to discuss details of unpublished work. |
92 |
147 |
93 Derivatives as calculated by Brzozowski's method are usually more |
|
94 complex regular expressions than the initial one; various optimisations |
|
95 are possible, such as the simplifications of @{term "ALT ZERO r"}, |
|
96 @{term "ALT r ZERO"}, @{term "SEQ ONE r"} and @{term "SEQ r ONE"} to |
|
97 @{term r}. One of the advantages of having a simple specification and |
|
98 correctness proof is that the latter can be refined to allow for such |
|
99 optimisations and simple correctness proof. |
|
100 |
|
101 |
|
102 Sulzmann and Lu \cite{Sulzmann2014} |
|
103 \cite{Brzozowski1964} |
|
104 |
|
105 there are two commonly used |
|
106 disambiguation strategies to create a unique matching tree: |
|
107 one is called \emph{greedy} matching \cite{Frisch2004} and the |
|
108 other is \emph{POSIX} matching~\cite{Kuklewicz,Sulzmann2014}. |
|
109 For the latter there are two rough rules: |
|
110 |
|
111 \begin{itemize} |
|
112 \item The Longest Match Rule (or ``maximal munch rule''):\smallskip\\ The |
|
113 longest initial substring matched by any regular |
|
114 expression is taken as next token. |
|
115 |
|
116 \item Rule Priority:\smallskip\\ For a particular longest initial |
|
117 substring, the first regular expression that can match |
|
118 determines the token. |
|
119 \end{itemize} |
|
120 |
|
121 \noindent In the context of lexing, POSIX is the more |
|
122 interesting disambiguation strategy as it produces longest |
|
123 matches, which is necessary for tokenising programs. For |
|
124 example the string \textit{iffoo} should not match the keyword |
|
125 \textit{if} and the rest, but as one string \textit{iffoo}, |
|
126 which might be a variable name in a program. As another |
|
127 example consider the string $xy$ and the regular expression |
|
128 \mbox{$(x + y + xy)^*$}. Either the input string can be |
|
129 matched in two `iterations' by the single letter-regular |
|
130 expressions $x$ and $y$, or directly in one iteration by $xy$. |
|
131 The first case corresponds to greedy matching, which first |
|
132 matches with the left-most symbol and only matches the next |
|
133 symbol in case of a mismatch. The second case is POSIX |
|
134 matching, which prefers the longest match. In case more than |
|
135 one (longest) matches exist, only then it prefers the |
|
136 left-most match. While POSIX matching seems natural, it turns |
|
137 out to be much more subtle than greedy matching in terms of |
|
138 implementations and in terms of proving properties about it. |
|
139 If POSIX matching is implemented using automata, then one has |
|
140 to follow transitions (according to the input string) until |
|
141 one finds an accepting state, record this state and look for |
|
142 further transition which might lead to another accepting state |
|
143 that represents a longer input initial substring to be |
|
144 matched. Only if none can be found, the last accepting state |
|
145 is returned. |
|
146 |
|
147 |
|
148 Sulzmann and Lu's paper \cite{Sulzmann2014} targets POSIX |
|
149 regular expression matching. They write that it is known to be |
|
150 to be a non-trivial problem and nearly all POSIX matching |
|
151 implementations are ``buggy'' \cite[Page 203]{Sulzmann2014}. |
|
152 For this they cite a study by Kuklewicz \cite{Kuklewicz}. My |
|
153 current work is about formalising the proofs in the paper by |
|
154 Sulzmann and Lu. Specifically, they propose in this paper a |
|
155 POSIX matching algorithm and give some details of a |
|
156 correctness proof for this algorithm inside the paper and some |
|
157 more details in an appendix. This correctness proof is |
|
158 unformalised, meaning it is just a ``pencil-and-paper'' proof, |
|
159 not done in a theorem prover. Though, the paper and presumably |
|
160 the proof have been peer-reviewed. Unfortunately their proof |
|
161 does not give enough details such that it can be |
|
162 straightforwardly implemented in a theorem prover, say |
|
163 Isabelle. In fact, the purported proof they outline does not |
|
164 work in central places. We discovered this when filling in |
|
165 many gaps and attempting to formalise the proof in Isabelle. |
|
166 |
|
167 |
|
168 \medskip\noindent |
|
169 {\bf Contributions:} |
|
170 |
148 |
171 *} |
149 *} |
172 |
150 |
173 section {* Preliminaries *} |
151 section {* Preliminaries *} |
174 |
152 |