16 |
16 |
17 |
17 |
18 abbreviation |
18 abbreviation |
19 "der_syn r c \<equiv> der c r" |
19 "der_syn r c \<equiv> der c r" |
20 abbreviation |
20 abbreviation |
|
21 "ders_syn r s \<equiv> ders s r" |
|
22 abbreviation |
21 "bder_syn r c \<equiv> bder c r" |
23 "bder_syn r c \<equiv> bder c r" |
22 |
24 |
23 notation (latex output) |
25 notation (latex output) |
24 der_syn ("_\\_" [79, 1000] 76) and |
26 der_syn ("_\\_" [79, 1000] 76) and |
|
27 ders_syn ("_\\_" [79, 1000] 76) and |
25 bder_syn ("_\\_" [79, 1000] 76) and |
28 bder_syn ("_\\_" [79, 1000] 76) and |
26 bders ("_\\_" [79, 1000] 76) and |
29 bders ("_\\_" [79, 1000] 76) and |
27 bders_simp ("_\\\<^sub>s\<^sub>i\<^sub>m\<^sub>p _" [79, 1000] 76) and |
30 bders_simp ("_\\\<^sub>s\<^sub>i\<^sub>m\<^sub>p _" [79, 1000] 76) and |
28 |
31 |
29 ZERO ("\<^bold>0" 81) and |
32 ZERO ("\<^bold>0" 81) and |
30 ONE ("\<^bold>1" 81) and |
33 ONE ("\<^bold>1" 81) and |
31 CH ("_" [1000] 80) and |
34 CH ("_" [1000] 80) and |
32 ALT ("_ + _" [77,77] 78) and |
35 ALT ("_ + _" [77,77] 78) and |
33 SEQ ("_ \<cdot> _" [77,77] 78) and |
36 SEQ ("_ \<cdot> _" [77,77] 78) and |
34 STAR ("_\<^sup>\<star>" [79] 78) and |
37 STAR ("_\<^sup>*" [79] 78) and |
35 |
38 |
36 val.Void ("Empty" 78) and |
39 val.Void ("Empty" 78) and |
37 val.Char ("Char _" [1000] 78) and |
40 val.Char ("Char _" [1000] 78) and |
38 val.Left ("Left _" [79] 78) and |
41 val.Left ("Left _" [79] 78) and |
39 val.Right ("Right _" [1000] 78) and |
42 val.Right ("Right _" [1000] 78) and |
90 regular expression, written @{term "der c r"}, give a simple solution |
93 regular expression, written @{term "der c r"}, give a simple solution |
91 to the problem of matching a string @{term s} with a regular |
94 to the problem of matching a string @{term s} with a regular |
92 expression @{term r}: if the derivative of @{term r} w.r.t.\ (in |
95 expression @{term r}: if the derivative of @{term r} w.r.t.\ (in |
93 succession) all the characters of the string matches the empty string, |
96 succession) all the characters of the string matches the empty string, |
94 then @{term r} matches @{term s} (and {\em vice versa}). We are aware |
97 then @{term r} matches @{term s} (and {\em vice versa}). We are aware |
95 of a mechanised correctness proof of Brzozowski's matcher in HOL4 by |
98 of a mechanised correctness proof of Brzozowski's derivative-based matcher in HOL4 by |
96 Owens and Slind~\cite{Owens2008}. Another one in Isabelle/HOL is part |
99 Owens and Slind~\cite{Owens2008}. Another one in Isabelle/HOL is part |
97 of the work by Krauss and Nipkow \cite{Krauss2011}. And another one |
100 of the work by Krauss and Nipkow \cite{Krauss2011}. And another one |
98 in Coq is given by Coquand and Siles \cite{Coquand2012}. |
101 in Coq is given by Coquand and Siles \cite{Coquand2012}. |
99 |
102 Also Ribeiro and Du Bois give one in Agda \cite{RibeiroAgda2017}. |
100 There are two difficulties with derivative-based matchers and also |
103 |
101 lexers: First, Brzozowski's original matcher only generates a yes/no |
104 |
102 answer for whether a regular expression matches a string or not. |
105 However, there are two difficulties with derivative-based matchers: |
103 Sulzmann and Lu~\cite{Sulzmann2014} overcome this difficulty by |
106 First, Brzozowski's original matcher only generates a yes/no answer |
104 cleverly extending Brzozowski's matching algorithm to POSIX |
107 for whether a regular expression matches a string or not. This is too |
105 lexing. This extended version generates additional information on |
108 little information in the context of lexing where separate tokens must |
106 \emph{how} a regular expression matches a string. They achieve this by |
109 be identified and also classified (for example as keywords |
107 |
110 or identifiers). Sulzmann and Lu~\cite{Sulzmann2014} overcome this |
108 |
111 difficulty by cleverly extending Brzozowski's matching |
109 |
112 algorithm. Their extended version generates additional information on |
110 |
113 \emph{how} a regular expression matches a string following the POSIX |
111 The second problem is that Brzozowski's derivatives can |
114 rules for regular expression matching. They achieve this by adding a |
|
115 second ``phase'' to Brzozowski's algorithm involving an injection |
|
116 function. In our own earlier work we provided the formal |
|
117 specification of what POSIX matching means and proved in Isabelle/HOL |
|
118 the correctness |
|
119 of Sulzmann and Lu's extended algorithm accordingly |
|
120 \cite{AusafDyckhoffUrban2016}. |
|
121 |
|
122 The second difficulty is that Brzozowski's derivatives can |
112 grow to arbitrarily big sizes. For example if we start with the |
123 grow to arbitrarily big sizes. For example if we start with the |
113 regular expression \mbox{@{text "(a + aa)\<^sup>*"}} and take |
124 regular expression \mbox{@{text "(a + aa)\<^sup>*"}} and take |
114 successive derivatives according to the character $a$, we end up with |
125 successive derivatives according to the character $a$, we end up with |
115 a sequence of ever-growing derivatives like |
126 a sequence of ever-growing derivatives like |
116 |
127 |
119 \begin{tabular}{rll} |
130 \begin{tabular}{rll} |
120 $(a + aa)^*$ & $\ll$ & $(\ONE + \ONE{}a) \cdot (a + aa)^*$\\ |
131 $(a + aa)^*$ & $\ll$ & $(\ONE + \ONE{}a) \cdot (a + aa)^*$\\ |
121 & $\ll$ & $(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* \;+\; (\ONE + \ONE{}a) \cdot (a + aa)^*$\\ |
132 & $\ll$ & $(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* \;+\; (\ONE + \ONE{}a) \cdot (a + aa)^*$\\ |
122 & $\ll$ & $(\ZERO + \ZERO{}a + \ZERO) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^* \;+\; $\\ |
133 & $\ll$ & $(\ZERO + \ZERO{}a + \ZERO) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^* \;+\; $\\ |
123 & & $\qquad(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^*$\\ |
134 & & $\qquad(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^*$\\ |
124 & $\ll$ & \ldots |
135 & $\ll$ & \ldots \hspace{15mm}(regular expressions of sizes 98, 169, 283, 468, 767, \ldots) |
125 \end{tabular} |
136 \end{tabular} |
126 \end{center} |
137 \end{center} |
127 |
138 |
128 \noindent where after around 35 steps we run out of memory on a |
139 \noindent where after around 35 steps we run out of memory on a |
129 typical computer (we define the precise details of the derivative |
140 typical computer (we shall define shortly the precise details of our |
130 operation later). Clearly, the notation involving $\ZERO$s and |
141 regular expressions and the derivative operation). Clearly, the |
131 $\ONE$s already suggests simplification rules that can be applied to |
142 notation involving $\ZERO$s and $\ONE$s already suggests |
132 regular regular expressions, for example $\ZERO{}r \Rightarrow \ZERO$, |
143 simplification rules that can be applied to regular regular |
133 $\ONE{}r \Rightarrow r$, $\ZERO{} + r \Rightarrow r$ and $r + r |
144 expressions, for example $\ZERO{}\,r \Rightarrow \ZERO$, $\ONE{}\,r |
134 \Rightarrow r$. While such simple-minded reductions have been proved |
145 \Rightarrow r$, $\ZERO{} + r \Rightarrow r$ and $r + r \Rightarrow |
135 in our earlier work to preserve the correctness of Sulzmann and Lu's |
146 r$. While such simple-minded simplifications have been proved in our |
136 algorithm, they unfortunately do \emph{not} help with limiting the |
147 earlier work to preserve the correctness of Sulzmann and Lu's |
137 gowth of the derivatives shown above: yes, the growth is slowed, but the |
148 algorithm \cite{AusafDyckhoffUrban2016}, they unfortunately do |
138 derivatives can still grow beyond any finite bound. |
149 \emph{not} help with limiting the growth of the derivatives shown |
139 |
150 above: the growth is slowed, but the derivatives can still grow quickly |
140 |
151 beyond any finite bound. |
141 |
152 |
142 Sulzmann and Lu introduce a |
153 |
143 bitcoded version of their lexing algorithm. They make some claims |
154 |
144 about the correctness and speed of this version, but do |
155 Sulzmann and Lu overcome this ``growth problem'' in a second algorithm |
145 not provide any supporting proof arguments, not even |
156 \cite{Sulzmann2014} where they introduce bitcoded |
146 ``pencil-and-paper'' arguments. They wrote about their bitcoded |
157 regular expressions. In this version, POSIX values are |
147 ``incremental parsing method'' (that is the algorithm to be studied in this |
158 represented as bitsequences and such sequences are incrementally generated |
148 section): |
159 when derivatives are calculated. The compact representation |
|
160 of bitsequences and regular expressions allows them to define a more |
|
161 ``aggressive'' simplification method that keeps the size of the |
|
162 derivatives finite no matter what the length of the string is. |
|
163 They make some informal claims about the correctness and linear behaviour |
|
164 of this version, but do not provide any supporting proof arguments, not |
|
165 even ``pencil-and-paper'' arguments. They write about their bitcoded |
|
166 \emph{incremental parsing method} (that is the algorithm to be formalised |
|
167 in this paper): |
149 |
168 |
150 \begin{quote}\it |
169 \begin{quote}\it |
151 ``Correctness Claim: We further claim that the incremental parsing |
170 ``Correctness Claim: We further claim that the incremental parsing |
152 method [..] in combination with the simplification steps [..] |
171 method [..] in combination with the simplification steps [..] |
153 yields POSIX parse trees. We have tested this claim |
172 yields POSIX parse trees. We have tested this claim |
154 extensively [..] but yet |
173 extensively [..] but yet |
155 have to work out all proof details.'' |
174 have to work out all proof details.'' |
156 \end{quote} |
175 \end{quote} |
157 |
176 |
158 |
177 \noindent{}\textbf{Contributions:} We have implemented in Isabelle/HOL |
159 |
178 the derivative-based lexing algorithm of Sulzmann and Lu |
160 |
179 \cite{Sulzmann2014} where regular expressions are annotated with |
161 |
180 bitsequences. We define the crucial simplification function as a |
162 If a regular expression matches a string, then in general there is more |
181 recursive function, instead of a fix-point operation. One objective of |
163 than one way of how the string is matched. There are two commonly used |
182 the simplification function is to remove duplicates of regular |
164 disambiguation strategies to generate a unique answer: one is called |
183 expressions. For this Sulzmann and Lu use in their paper the standard |
165 GREEDY matching \cite{Frisch2004} and the other is POSIX |
184 @{text nub} function from Haskell's list library, but this function |
166 matching~\cite{POSIX,Kuklewicz,OkuiSuzuki2010,Sulzmann2014,Vansummeren2006}. |
185 does not achieve the intended objective with bitcoded regular expressions. The |
167 For example consider the string @{term xy} and the regular expression |
186 reason is that in the bitcoded setting, each copy generally has a |
168 \mbox{@{term "STAR (ALT (ALT x y) xy)"}}. Either the string can be |
187 different bitcode annotation---so @{text nub} would never ``fire''. |
169 matched in two `iterations' by the single letter-regular expressions |
188 Inspired by Scala's library for lists, we shall instead use a @{text |
170 @{term x} and @{term y}, or directly in one iteration by @{term xy}. The |
189 distinctBy} function that finds duplicates under an erasing function |
171 first case corresponds to GREEDY matching, which first matches with the |
190 that deletes bitcodes. |
172 left-most symbol and only matches the next symbol in case of a mismatch |
191 We shall also introduce our own argument and definitions for |
173 (this is greedy in the sense of preferring instant gratification to |
192 establishing the correctness of the bitcoded algorithm when |
174 delayed repletion). The second case is POSIX matching, which prefers the |
193 simplifications are included.\medskip |
175 longest match. |
194 |
176 |
195 \noindent In this paper, we shall first briefly introduce the basic notions |
177 |
196 of regular expressions and describe the basic definitions |
178 The derivative has the property (which may almost be |
197 of POSIX lexing from our earlier work \cite{AusafDyckhoffUrban2016}. This serves |
179 regarded as its specification) that, for every string @{term s} and |
198 as a reference point for what correctness means in our Isabelle/HOL proofs. We shall then prove |
180 regular expression @{term r} and character @{term c}, one has @{term |
199 the correctness for the bitcoded algorithm without simplification, and |
181 "cs \<in> L(r)"} if and only if \mbox{@{term "s \<in> L(der c r)"}}. |
200 after that extend the proof to include simplification. |
182 |
201 |
183 |
202 *} |
184 |
203 |
|
204 section {* Background *} |
|
205 |
|
206 text {* |
|
207 In our Isabelle/HOL formalisation strings are lists of characters with |
|
208 the empty string being represented by the empty list, written $[]$, |
|
209 and list-cons being written as $\_\!::\!\_\,$; string |
|
210 concatenation is $\_ \,@\, \_\,$. We often use the usual |
|
211 bracket notation for lists also for strings; for example a string |
|
212 consisting of just a single character $c$ is written $[c]$. |
|
213 Our regular expressions are defined as usual as the elements of the following inductive |
|
214 datatype: |
|
215 |
|
216 \begin{center} |
|
217 @{text "r :="} |
|
218 @{const "ZERO"} $\mid$ |
|
219 @{const "ONE"} $\mid$ |
|
220 @{term "CH c"} $\mid$ |
|
221 @{term "ALT r\<^sub>1 r\<^sub>2"} $\mid$ |
|
222 @{term "SEQ r\<^sub>1 r\<^sub>2"} $\mid$ |
|
223 @{term "STAR r"} |
|
224 \end{center} |
|
225 |
|
226 \noindent where @{const ZERO} stands for the regular expression that does |
|
227 not match any string, @{const ONE} for the regular expression that matches |
|
228 only the empty string and @{term c} for matching a character literal. |
|
229 The constructors $+$ and $\cdot$ represent alternatives and sequences, respectively. |
|
230 The |
|
231 \emph{language} of a regular expression, written $L$, is defined as usual |
|
232 and we omit giving the definition here (see for example \cite{AusafDyckhoffUrban2016}). |
|
233 |
|
234 Central to Brzozowski's regular expression matcher are two functions |
|
235 called @{text nullable} and \emph{derivative}. The latter is written |
|
236 $r\backslash c$ for the derivative of the regular expression $r$ |
|
237 w.r.t.~the character $c$. Both functions are defined by recursion over |
|
238 regular expressions. |
185 |
239 |
186 \begin{center} |
240 \begin{center} |
187 \begin{tabular}{cc} |
241 \begin{tabular}{cc} |
188 \begin{tabular}{r@ {\hspace{2mm}}c@ {\hspace{2mm}}l} |
242 \begin{tabular}{r@ {\hspace{2mm}}c@ {\hspace{2mm}}l} |
189 @{thm (lhs) der.simps(1)} & $\dn$ & @{thm (rhs) der.simps(1)}\\ |
243 @{thm (lhs) der.simps(1)} & $\dn$ & @{thm (rhs) der.simps(1)}\\ |
206 @{thm (lhs) nullable.simps(6)} & $\dn$ & @{thm (rhs) nullable.simps(6)}\medskip\\ |
260 @{thm (lhs) nullable.simps(6)} & $\dn$ & @{thm (rhs) nullable.simps(6)}\medskip\\ |
207 \end{tabular} |
261 \end{tabular} |
208 \end{tabular} |
262 \end{tabular} |
209 \end{center} |
263 \end{center} |
210 |
264 |
|
265 \noindent |
|
266 We can extend this definition to give derivatives w.r.t.~strings: |
|
267 |
|
268 \begin{center} |
|
269 \begin{tabular}{cc} |
|
270 \begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l} |
|
271 @{thm (lhs) ders.simps(1)} & $\dn$ & @{thm (rhs) ders.simps(1)} |
|
272 \end{tabular} |
|
273 & |
|
274 \begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l} |
|
275 @{thm (lhs) ders.simps(2)} & $\dn$ & @{thm (rhs) ders.simps(2)} |
|
276 \end{tabular} |
|
277 \end{tabular} |
|
278 \end{center} |
|
279 |
|
280 \noindent |
|
281 Using @{text nullable} and the derivative operation, we can |
|
282 define the following simple regular expression matcher: |
|
283 % |
|
284 \[ |
|
285 @{text "match s r"} \;\dn\; @{term nullable}(r\backslash s) |
|
286 \] |
|
287 |
|
288 \noindent This is essentially Brzozowski's algorithm from 1964. Its |
|
289 main virtue is that the algorithm can be easily implemented as a |
|
290 functional program (either in a functional programming language or in |
|
291 a theorem prover). The correctness proof for @{text match} amounts to |
|
292 establishing the property |
|
293 % |
|
294 \begin{proposition}\label{matchcorr} |
|
295 @{text "match s r"} \;\;\text{if and only if}\;\; $s \in L(r)$ |
|
296 \end{proposition} |
|
297 |
|
298 \noindent |
|
299 It is a fun exercise to formaly prove this property in a theorem prover. |
|
300 |
|
301 The novel idea of Sulzmann and Lu is to extend this algorithm for |
|
302 lexing, where it is important to find out which part of the string |
|
303 is matched by which part of the regular expression. |
|
304 For this Sulzmann and Lu presented two lexing algorithms in their paper |
|
305 \cite{Sulzmann2014}. This first algorithm consists of two phases: first a |
|
306 matching phase (which is Brzozowski's algorithm) and then a value |
|
307 construction phase. The values encode \emph{how} a regular expression |
|
308 matches a string. \emph{Values} are defined as the inductive datatype |
|
309 |
|
310 \begin{center} |
|
311 @{text "v :="} |
|
312 @{const "Void"} $\mid$ |
|
313 @{term "val.Char c"} $\mid$ |
|
314 @{term "Left v"} $\mid$ |
|
315 @{term "Right v"} $\mid$ |
|
316 @{term "Seq v\<^sub>1 v\<^sub>2"} $\mid$ |
|
317 @{term "Stars vs"} |
|
318 \end{center} |
|
319 |
|
320 \noindent where we use @{term vs} to stand for a list of values. The |
|
321 string underlying a value can be calculated by a @{const flat} |
|
322 function, written @{term "flat DUMMY"}. It traverses a value and |
|
323 collects the characters contained in it. Sulzmann and Lu also define inductively an |
|
324 inhabitation relation that associates values to regular expressions: |
|
325 |
|
326 \begin{center} |
|
327 \begin{tabular}{c} |
|
328 \\[-8mm] |
|
329 @{thm[mode=Axiom] Prf.intros(4)} \qquad |
|
330 @{thm[mode=Axiom] Prf.intros(5)[of "c"]}\\[4mm] |
|
331 @{thm[mode=Rule] Prf.intros(2)[of "v\<^sub>1" "r\<^sub>1" "r\<^sub>2"]} \qquad |
|
332 @{thm[mode=Rule] Prf.intros(3)[of "v\<^sub>2" "r\<^sub>1" "r\<^sub>2"]}\\[4mm] |
|
333 @{thm[mode=Rule] Prf.intros(1)[of "v\<^sub>1" "r\<^sub>1" "v\<^sub>2" "r\<^sub>2"]} \qquad |
|
334 @{thm[mode=Rule] Prf.intros(6)[of "vs" "r"]} |
|
335 \end{tabular} |
|
336 \end{center} |
|
337 |
|
338 \noindent Note that no values are associated with the regular expression |
|
339 @{term ZERO}, since it cannot match any string. |
|
340 It is routine to establish how values ``inhabiting'' a regular |
|
341 expression correspond to the language of a regular expression, namely |
|
342 |
|
343 \begin{proposition} |
|
344 @{thm L_flat_Prf} |
|
345 \end{proposition} |
|
346 |
|
347 In general there is more than one value inhabited by a regular |
|
348 expression (meaning regular expressions can typically match more |
|
349 than one string). But even when fixing a string from the language of the |
|
350 regular expression, there are generally more than one way of how the |
|
351 regular expression can match this string. POSIX lexing is about |
|
352 identifying the unique value for a given regular expression and a |
|
353 string that satisfies the informal POSIX rules (see |
|
354 \cite{POSIX,Kuklewicz,OkuiSuzuki2010,Sulzmann2014,Vansummeren2006}).\footnote{POSIX |
|
355 lexing acquired its name from the fact that the corresponding |
|
356 rules were described as part of the POSIX specification for |
|
357 Unix-like operating systems \cite{POSIX}.} Sometimes these |
|
358 informal rules are called \emph{maximal much rule} and \emph{rule priority}. |
|
359 One contribution of our earlier paper is to give a convenient |
|
360 specification for what a POSIX value is (the inductive rules are shown in |
|
361 Figure~\ref{POSIXrules}). |
211 |
362 |
212 \begin{figure}[t] |
363 \begin{figure}[t] |
|
364 \begin{center} |
|
365 \begin{tabular}{c} |
|
366 @{thm[mode=Axiom] Posix.intros(1)}\<open>P\<close>@{term "ONE"} \qquad |
|
367 @{thm[mode=Axiom] Posix.intros(2)}\<open>P\<close>@{term "c"}\medskip\\ |
|
368 @{thm[mode=Rule] Posix.intros(3)[of "s" "r\<^sub>1" "v" "r\<^sub>2"]}\<open>P+L\<close>\qquad |
|
369 @{thm[mode=Rule] Posix.intros(4)[of "s" "r\<^sub>2" "v" "r\<^sub>1"]}\<open>P+R\<close>\medskip\\ |
|
370 $\mprset{flushleft} |
|
371 \inferrule |
|
372 {@{thm (prem 1) Posix.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]} \qquad |
|
373 @{thm (prem 2) Posix.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]} \\\\ |
|
374 @{thm (prem 3) Posix.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]}} |
|
375 {@{thm (concl) Posix.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]}}$\<open>PS\<close>\medskip\smallskip\\ |
|
376 @{thm[mode=Axiom] Posix.intros(7)}\<open>P[]\<close>\qquad |
|
377 $\mprset{flushleft} |
|
378 \inferrule |
|
379 {@{thm (prem 1) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]} \qquad |
|
380 @{thm (prem 2) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]} \qquad |
|
381 @{thm (prem 3) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]} \\\\ |
|
382 @{thm (prem 4) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]}} |
|
383 {@{thm (concl) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]}}$\<open>P\<star>\<close>\\[-4mm] |
|
384 \end{tabular} |
|
385 \end{center} |
|
386 \caption{The inductive definition of POSIX values taken from our earlier paper \cite{AusafDyckhoffUrban2016}. The ternary relation, written $(s, r) \rightarrow v$, formalises the notion |
|
387 of given a string $s$ and a regular |
|
388 expression $r$ what is the unique value $v$ that satisfies the informal POSIX constraints for |
|
389 regular expression matching.}\label{POSIXrules} |
|
390 \end{figure} |
|
391 |
|
392 The clever idea by Sulzmann and Lu \cite{Sulzmann2014} in their first algorithm is to define |
|
393 an injection function on values that mirrors (but inverts) the |
|
394 construction of the derivative on regular expressions. Essentially it |
|
395 injects back a character into a value. |
|
396 For this they define two functions called @{text mkeps} and @{text inj}: |
|
397 |
|
398 \begin{center} |
|
399 \begin{tabular}{l} |
|
400 \begin{tabular}{lcl} |
|
401 @{thm (lhs) mkeps.simps(1)} & $\dn$ & @{thm (rhs) mkeps.simps(1)}\\ |
|
402 @{thm (lhs) mkeps.simps(2)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) mkeps.simps(2)[of "r\<^sub>1" "r\<^sub>2"]}\\ |
|
403 @{thm (lhs) mkeps.simps(3)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) mkeps.simps(3)[of "r\<^sub>1" "r\<^sub>2"]}\\ |
|
404 @{thm (lhs) mkeps.simps(4)} & $\dn$ & @{thm (rhs) mkeps.simps(4)}\\ |
|
405 \end{tabular}\smallskip\\ |
|
406 |
|
407 \begin{tabular}{lcl} |
|
408 @{thm (lhs) injval.simps(1)} & $\dn$ & @{thm (rhs) injval.simps(1)}\\ |
|
409 @{thm (lhs) injval.simps(2)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1"]} & $\dn$ & |
|
410 @{thm (rhs) injval.simps(2)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1"]}\\ |
|
411 @{thm (lhs) injval.simps(3)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]} & $\dn$ & |
|
412 @{thm (rhs) injval.simps(3)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]}\\ |
|
413 @{thm (lhs) injval.simps(4)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1" "v\<^sub>2"]} & $\dn$ |
|
414 & @{thm (rhs) injval.simps(4)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1" "v\<^sub>2"]}\\ |
|
415 @{thm (lhs) injval.simps(5)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1" "v\<^sub>2"]} & $\dn$ |
|
416 & @{thm (rhs) injval.simps(5)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1" "v\<^sub>2"]}\\ |
|
417 @{thm (lhs) injval.simps(6)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]} & $\dn$ |
|
418 & @{thm (rhs) injval.simps(6)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]}\\ |
|
419 @{thm (lhs) injval.simps(7)[of "r" "c" "v" "vs"]} & $\dn$ |
|
420 & @{thm (rhs) injval.simps(7)[of "r" "c" "v" "vs"]} |
|
421 \end{tabular} |
|
422 \end{tabular} |
|
423 \end{center} |
|
424 |
|
425 \noindent |
|
426 The function @{text mkeps} is called when the last derivative is nullable, that is |
|
427 the string to be matched is in the language of the regular expression. It generates |
|
428 a value for how the last derivative can match the empty string. The injection function |
|
429 then calculates the corresponding value for each intermediate derivative until |
|
430 a value for the original regular expression is generated. |
|
431 Graphically the algorithm by |
|
432 Sulzmann and Lu can be illustrated by the picture in Figure~\ref{Sulz} |
|
433 where the path from the left to the right involving @{term derivatives}/@{const |
|
434 nullable} is the first phase of the algorithm (calculating successive |
|
435 \Brz's derivatives) and @{const mkeps}/@{text inj}, the path from right to |
|
436 left, the second phase. This picture shows the steps required when a |
|
437 regular expression, say @{text "r\<^sub>1"}, matches the string @{term |
|
438 "[a,b,c]"}. The lexing algorithm by Sulzmann and Lu can be defined as: |
|
439 |
|
440 \begin{figure}[t] |
213 \begin{center} |
441 \begin{center} |
214 \begin{tikzpicture}[scale=2,node distance=1.3cm, |
442 \begin{tikzpicture}[scale=2,node distance=1.3cm, |
215 every node/.style={minimum size=6mm}] |
443 every node/.style={minimum size=6mm}] |
216 \node (r1) {@{term "r\<^sub>1"}}; |
444 \node (r1) {@{term "r\<^sub>1"}}; |
217 \node (r2) [right=of r1]{@{term "r\<^sub>2"}}; |
445 \node (r2) [right=of r1]{@{term "r\<^sub>2"}}; |
232 \draw (r4) node[anchor=north west] {\;\raisebox{-8mm}{@{term "mkeps"}}}; |
460 \draw (r4) node[anchor=north west] {\;\raisebox{-8mm}{@{term "mkeps"}}}; |
233 \end{tikzpicture} |
461 \end{tikzpicture} |
234 \end{center} |
462 \end{center} |
235 \mbox{}\\[-13mm] |
463 \mbox{}\\[-13mm] |
236 |
464 |
237 \caption{The two phases of the algorithm by Sulzmann \& Lu \cite{Sulzmann2014}, |
465 \caption{The two phases of the first algorithm by Sulzmann \& Lu \cite{Sulzmann2014}, |
238 matching the string @{term "[a,b,c]"}. The first phase (the arrows from |
466 matching the string @{term "[a,b,c]"}. The first phase (the arrows from |
239 left to right) is \Brz's matcher building successive derivatives. If the |
467 left to right) is \Brz's matcher building successive derivatives. If the |
240 last regular expression is @{term nullable}, then the functions of the |
468 last regular expression is @{term nullable}, then the functions of the |
241 second phase are called (the top-down and right-to-left arrows): first |
469 second phase are called (the top-down and right-to-left arrows): first |
242 @{term mkeps} calculates a value @{term "v\<^sub>4"} witnessing |
470 @{term mkeps} calculates a value @{term "v\<^sub>4"} witnessing |
243 how the empty string has been recognised by @{term "r\<^sub>4"}. After |
471 how the empty string has been recognised by @{term "r\<^sub>4"}. After |
244 that the function @{term inj} ``injects back'' the characters of the string into |
472 that the function @{term inj} ``injects back'' the characters of the string into |
245 the values. |
473 the values. The value @{term "v\<^sub>1"} is the result of the algorithm representing |
|
474 the POSIX value for this string and |
|
475 regular expression. |
246 \label{Sulz}} |
476 \label{Sulz}} |
247 \end{figure} |
477 \end{figure} |
248 |
478 |
249 |
479 |
|
480 |
|
481 \begin{center} |
|
482 \begin{tabular}{lcl} |
|
483 @{thm (lhs) lexer.simps(1)} & $\dn$ & @{thm (rhs) lexer.simps(1)}\\ |
|
484 @{thm (lhs) lexer.simps(2)} & $\dn$ & @{text "case"} @{term "lexer (der c r) s"} @{text of}\\ |
|
485 & & \phantom{$|$} @{term "None"} @{text "\<Rightarrow>"} @{term None}\\ |
|
486 & & $|$ @{term "Some v"} @{text "\<Rightarrow>"} @{term "Some (injval r c v)"} |
|
487 \end{tabular} |
|
488 \end{center} |
|
489 |
|
490 |
|
491 We have shown in our earlier paper \cite{AusafDyckhoffUrban2016} that the algorithm by Sulzmann and Lu |
|
492 is correct. The cenral property we established relates the derivative operation to the injection function. |
|
493 |
|
494 \begin{proposition}\label{Posix2} |
|
495 \textit{If} $(s,\; r\backslash c) \rightarrow v$ \textit{then} $(c :: s,\; r) \rightarrow$ \textit{inj} $r\; c\; v$. |
|
496 \end{proposition} |
|
497 |
|
498 \noindent |
|
499 With this in place we were able to prove: |
|
500 |
|
501 |
|
502 \begin{proposition}\mbox{}\smallskip\\\label{lexercorrect} |
|
503 \begin{tabular}{ll} |
|
504 (1) & @{thm (lhs) lexer_correct_None} if and only if @{thm (rhs) lexer_correct_None}\\ |
|
505 (2) & @{thm (lhs) lexer_correct_Some} if and only if @{thm (rhs) lexer_correct_Some}\\ |
|
506 \end{tabular} |
|
507 \end{proposition} |
|
508 |
|
509 \noindent |
|
510 In fact we have shown that in the success case the generated POSIX value $v$ is |
|
511 unique and in the failure case that there is no POSIX value $v$ that satisfies |
|
512 $(s, r) \rightarrow v$. While the algorithm is correct, it is excrutiatingly |
|
513 slow in examples where the derivatives grow arbitrarily (see example from the |
|
514 Introduction). However it can be used as a conveninet reference point for the correctness |
|
515 proof of the second algorithm by Sulzmann and Lu, which we shall describe next. |
|
516 |
250 *} |
517 *} |
251 |
518 |
252 section {* Background *} |
519 section {* Bitcoded Regular Expressions and Derivatives *} |
253 |
520 |
254 text {* |
521 text {* |
255 In our Isabelle/HOL formalisation strings are lists of characters with |
|
256 the empty string being represented by the empty list, written $[]$, |
|
257 and list-cons being written as $\_\!\_\!::\!\_\!\_\,$; string |
|
258 concatenation is $\_\!\_ \,@\, \_\!\_\,$. Often we use the usual |
|
259 bracket notation for lists also for strings; for example a string |
|
260 consisting of just a single character $c$ is written $[c]$. |
|
261 Our egular expressions are defined as usual as the elements of the following inductive |
|
262 datatype: |
|
263 |
|
264 \begin{center} |
|
265 @{text "r :="} |
|
266 @{const "ZERO"} $\mid$ |
|
267 @{const "ONE"} $\mid$ |
|
268 @{term "CH c"} $\mid$ |
|
269 @{term "ALT r\<^sub>1 r\<^sub>2"} $\mid$ |
|
270 @{term "SEQ r\<^sub>1 r\<^sub>2"} $\mid$ |
|
271 @{term "STAR r"} |
|
272 \end{center} |
|
273 |
|
274 \noindent where @{const ZERO} stands for the regular expression that does |
|
275 not match any string, @{const ONE} for the regular expression that matches |
|
276 only the empty string and @{term c} for matching a character literal. The |
|
277 language of a regular expression, written $L$, is defined as usual |
|
278 (see for example \cite{AusafDyckhoffUrban2016}). |
|
279 |
|
280 Central to Brzozowski's regular expression matcher are two functions |
|
281 called $\nullable$ and \emph{derivative}. The latter is written |
|
282 $r\backslash c$ for the derivative of the regular expression $r$ |
|
283 w.r.t.~the character $c$. Both functions are defined by recursion over |
|
284 regular expressions. |
|
285 |
|
286 |
|
287 \begin{center} |
|
288 \begin{tabular}{lcl} |
|
289 $\nullable(\ZERO)$ & $\dn$ & $\mathit{false}$ \\ |
|
290 $\nullable(\ONE)$ & $\dn$ & $\mathit{true}$ \\ |
|
291 $\nullable(c)$ & $\dn$ & $\mathit{false}$ \\ |
|
292 $\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\ |
|
293 $\nullable(r_1\cdot r_2)$ & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\ |
|
294 $\nullable(r^*)$ & $\dn$ & $\mathit{true}$ \\ |
|
295 \end{tabular} |
|
296 \end{center} |
|
297 |
|
298 \noindent |
|
299 The derivative function takes a regular expression, say $r$ and a |
|
300 character, say $c$, as input and returns the derivative regular |
|
301 expression. |
|
302 |
|
303 \begin{center} |
|
304 \begin{tabular}{lcl} |
|
305 $\ZERO \backslash c$ & $\dn$ & $\ZERO$\\ |
|
306 $\ONE \backslash c$ & $\dn$ & $\ZERO$\\ |
|
307 $d \backslash c$ & $\dn$ & |
|
308 $\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\ |
|
309 $(r_1 + r_2)\backslash c$ & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\ |
|
310 $(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \nullable(r_1)$\\ |
|
311 & & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\ |
|
312 & & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\ |
|
313 $(r^*)\backslash c$ & $\dn$ & $(r\backslash c) \cdot r^*$\\ |
|
314 \end{tabular} |
|
315 \end{center} |
|
316 |
|
317 |
|
318 Sulzmann and Lu presented two lexing algorithms in their paper from 2014 |
|
319 \cite{Sulzmann2014}. This first algorithm consists of two phases: first a |
|
320 matching phase (which is Brzozowski's algorithm) and then a value |
|
321 construction phase. The values encode \emph{how} a regular expression |
|
322 matches a string. \emph{Values} are defined as the inductive datatype |
|
323 |
|
324 \begin{center} |
|
325 @{text "v :="} |
|
326 @{const "Void"} $\mid$ |
|
327 @{term "val.Char c"} $\mid$ |
|
328 @{term "Left v"} $\mid$ |
|
329 @{term "Right v"} $\mid$ |
|
330 @{term "Seq v\<^sub>1 v\<^sub>2"} $\mid$ |
|
331 @{term "Stars vs"} |
|
332 \end{center} |
|
333 |
|
334 \noindent where we use @{term vs} to stand for a list of |
|
335 values. |
|
336 |
|
337 |
|
338 |
|
339 |
|
340 \noindent Sulzmann and Lu also define inductively an inhabitation relation |
|
341 that associates values to regular expressions: |
|
342 |
|
343 \begin{center} |
|
344 \begin{tabular}{c} |
|
345 \\[-8mm] |
|
346 @{thm[mode=Axiom] Prf.intros(4)} \qquad |
|
347 @{thm[mode=Axiom] Prf.intros(5)[of "c"]}\\[4mm] |
|
348 @{thm[mode=Rule] Prf.intros(2)[of "v\<^sub>1" "r\<^sub>1" "r\<^sub>2"]} \qquad |
|
349 @{thm[mode=Rule] Prf.intros(3)[of "v\<^sub>2" "r\<^sub>1" "r\<^sub>2"]}\\[4mm] |
|
350 @{thm[mode=Rule] Prf.intros(1)[of "v\<^sub>1" "r\<^sub>1" "v\<^sub>2" "r\<^sub>2"]}\\[4mm] |
|
351 %%@ { t h m[mode=Axiom] Prf.intros(6)[of "r"]} \qquad |
|
352 @{thm[mode=Rule] Prf.intros(6)[of "r" "vs"]} |
|
353 \end{tabular} |
|
354 \end{center} |
|
355 |
|
356 \noindent Note that no values are associated with the regular expression |
|
357 @{term ZERO}. It is routine to establish how values ``inhabiting'' a regular |
|
358 expression correspond to the language of a regular expression, namely |
|
359 |
|
360 \begin{proposition} |
|
361 @{thm L_flat_Prf} |
|
362 \end{proposition} |
|
363 |
|
364 Sulzmann-Lu algorithm with inj. State that POSIX rules. |
|
365 metion slg is correct. |
|
366 |
|
367 |
|
368 \begin{figure}[t] |
|
369 \begin{center} |
|
370 \begin{tabular}{c} |
|
371 @{thm[mode=Axiom] Posix.intros(1)}\<open>P\<close>@{term "ONE"} \qquad |
|
372 @{thm[mode=Axiom] Posix.intros(2)}\<open>P\<close>@{term "c"}\medskip\\ |
|
373 @{thm[mode=Rule] Posix.intros(3)[of "s" "r\<^sub>1" "v" "r\<^sub>2"]}\<open>P+L\<close>\qquad |
|
374 @{thm[mode=Rule] Posix.intros(4)[of "s" "r\<^sub>2" "v" "r\<^sub>1"]}\<open>P+R\<close>\medskip\\ |
|
375 $\mprset{flushleft} |
|
376 \inferrule |
|
377 {@{thm (prem 1) Posix.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]} \qquad |
|
378 @{thm (prem 2) Posix.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]} \\\\ |
|
379 @{thm (prem 3) Posix.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]}} |
|
380 {@{thm (concl) Posix.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]}}$\<open>PS\<close>\\ |
|
381 @{thm[mode=Axiom] Posix.intros(7)}\<open>P[]\<close>\medskip\\ |
|
382 $\mprset{flushleft} |
|
383 \inferrule |
|
384 {@{thm (prem 1) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]} \qquad |
|
385 @{thm (prem 2) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]} \qquad |
|
386 @{thm (prem 3) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]} \\\\ |
|
387 @{thm (prem 4) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]}} |
|
388 {@{thm (concl) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]}}$\<open>P\<star>\<close> |
|
389 \end{tabular} |
|
390 \end{center} |
|
391 \caption{Our inductive definition of POSIX values.}\label{POSIXrules} |
|
392 \end{figure} |
|
393 |
|
394 |
|
395 \begin{center} |
|
396 \begin{tabular}{lcl} |
|
397 @{thm (lhs) mkeps.simps(1)} & $\dn$ & @{thm (rhs) mkeps.simps(1)}\\ |
|
398 @{thm (lhs) mkeps.simps(2)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) mkeps.simps(2)[of "r\<^sub>1" "r\<^sub>2"]}\\ |
|
399 @{thm (lhs) mkeps.simps(3)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) mkeps.simps(3)[of "r\<^sub>1" "r\<^sub>2"]}\\ |
|
400 @{thm (lhs) mkeps.simps(4)} & $\dn$ & @{thm (rhs) mkeps.simps(4)}\\ |
|
401 \end{tabular} |
|
402 \end{center} |
|
403 |
|
404 \begin{center} |
|
405 \begin{tabular}{l@ {\hspace{5mm}}lcl} |
|
406 \textit{(1)} & @{thm (lhs) injval.simps(1)} & $\dn$ & @{thm (rhs) injval.simps(1)}\\ |
|
407 \textit{(2)} & @{thm (lhs) injval.simps(2)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1"]} & $\dn$ & |
|
408 @{thm (rhs) injval.simps(2)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1"]}\\ |
|
409 \textit{(3)} & @{thm (lhs) injval.simps(3)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]} & $\dn$ & |
|
410 @{thm (rhs) injval.simps(3)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]}\\ |
|
411 \textit{(4)} & @{thm (lhs) injval.simps(4)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1" "v\<^sub>2"]} & $\dn$ |
|
412 & @{thm (rhs) injval.simps(4)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1" "v\<^sub>2"]}\\ |
|
413 \textit{(5)} & @{thm (lhs) injval.simps(5)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1" "v\<^sub>2"]} & $\dn$ |
|
414 & @{thm (rhs) injval.simps(5)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1" "v\<^sub>2"]}\\ |
|
415 \textit{(6)} & @{thm (lhs) injval.simps(6)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]} & $\dn$ |
|
416 & @{thm (rhs) injval.simps(6)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]}\\ |
|
417 \textit{(7)} & @{thm (lhs) injval.simps(7)[of "r" "c" "v" "vs"]} & $\dn$ |
|
418 & @{thm (rhs) injval.simps(7)[of "r" "c" "v" "vs"]}\\ |
|
419 \end{tabular} |
|
420 \end{center} |
|
421 |
|
422 *} |
|
423 |
|
424 section {* Bitcoded Regular Expressions and Derivatives *} |
|
425 |
|
426 text {* |
|
427 |
522 |
428 In the second part of their paper \cite{Sulzmann2014}, |
523 In the second part of their paper \cite{Sulzmann2014}, |
429 Sulzmann and Lu describe another algorithm that generates POSIX |
524 Sulzmann and Lu describe another algorithm that also generates POSIX |
430 values but dispences with the second phase where characters are |
525 values but dispences with the second phase where characters are |
431 injected ``back'' into values. For this they annotate bitcodes to |
526 injected ``back'' into values. For this they annotate bitcodes to |
432 regular expressions, which we define in Isabelle/HOL as the datatype |
527 regular expressions, which we define in Isabelle/HOL as the datatype |
433 |
528 |
434 \begin{center} |
529 \begin{center} |
577 expression by just erasing the annotated bitsequences. We omit the |
672 expression by just erasing the annotated bitsequences. We omit the |
578 straightforward definition. For defining the algorithm, we also need |
673 straightforward definition. For defining the algorithm, we also need |
579 the functions \textit{bnullable} and \textit{bmkeps}, which are the |
674 the functions \textit{bnullable} and \textit{bmkeps}, which are the |
580 ``lifted'' versions of \textit{nullable} and \textit{mkeps} acting on |
675 ``lifted'' versions of \textit{nullable} and \textit{mkeps} acting on |
581 bitcoded regular expressions, instead of regular expressions. |
676 bitcoded regular expressions, instead of regular expressions. |
582 |
677 % |
583 \begin{center} |
678 \begin{center} |
584 \begin{tabular}{@ {}c@ {}c@ {}} |
679 \begin{tabular}{@ {}c@ {}c@ {}} |
585 \begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l} |
680 \begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l} |
586 $\textit{bnullable}\,(\textit{ZERO})$ & $\dn$ & $\textit{false}$\\ |
681 $\textit{bnullable}\,(\textit{ZERO})$ & $\dn$ & $\textit{false}$\\ |
587 $\textit{bnullable}\,(\textit{ONE}\,bs)$ & $\dn$ & $\textit{true}$\\ |
682 $\textit{bnullable}\,(\textit{ONE}\,bs)$ & $\dn$ & $\textit{true}$\\ |
588 $\textit{bnullable}\,(\textit{CHAR}\,bs\,c)$ & $\dn$ & $\textit{false}$\\ |
683 $\textit{bnullable}\,(\textit{CHAR}\,bs\,c)$ & $\dn$ & $\textit{false}$\\ |
589 $\textit{bnullable}\,(\textit{ALTs}\,bs\,\rs)$ & $\dn$ & |
684 $\textit{bnullable}\,(\textit{ALTs}\,bs\,\rs)$ & $\dn$ & |
590 $\exists\, r \in \rs. \,\textit{bnullable}\,r$\\ |
685 $\exists\, r \in \rs. \,\textit{bnullable}\,r$\\ |
709 All properties are by induction on annotated regular expressions. There are no |
804 All properties are by induction on annotated regular expressions. There are no |
710 interesting cases. |
805 interesting cases. |
711 \end{proof} |
806 \end{proof} |
712 |
807 |
713 \noindent |
808 \noindent |
|
809 The only problem left for the correctness proof is that the bitcoded algorithm |
|
810 has only a ``forward phase'' where POSIX values are generated incrementally. |
|
811 We can achive the same effect with @{text lexer} by stacking up injection |
|
812 functions in the during forward phase. An auxiliary function, called $\textit{flex}$, |
|
813 allows us to recast the rules of $\lexer$ (with its two phases) in terms of a single |
|
814 phase. |
|
815 |
|
816 \begin{center} |
|
817 \begin{tabular}{lcl} |
|
818 $\textit{flex}\;r\,f\,[]$ & $\dn$ & $f$\\ |
|
819 $\textit{flex}\;r\,f\,(c\!::\!s)$ & $\dn$ & |
|
820 $\textit{flex}\,(r\backslash c)\,(\lambda v.\,f\,(\inj\,r\,c\,v))\,s$\\ |
|
821 \end{tabular} |
|
822 \end{center} |
|
823 |
|
824 \noindent |
|
825 The point of this function is that when |
|
826 reaching the end of the string, we just need to apply the stacked |
|
827 injection functions to the value generated by $\mkeps$. |
|
828 Using this function we can recast the success case in @{text lexer} |
|
829 as follows: |
|
830 |
|
831 \begin{proposition}\label{flex} |
|
832 If @{text "lexer r s = Some v"} \;then\; @{text "v = "}$\,\textit{flex}\,r\,id\,s\, |
|
833 (\mkeps (r\backslash s))$. |
|
834 \end{proposition} |
|
835 |
|
836 \noindent |
|
837 Note we did not redefine \textit{lexer}, we just established that the |
|
838 value generated by \textit{lexer} can also be obtained by a different |
|
839 method. While this different method is not efficient (we essentially |
|
840 need to traverse the string $s$ twice, once for building the |
|
841 derivative $r\backslash s$ and another time for stacking up injection |
|
842 functions using \textit{flex}), it helped us with proving |
|
843 that incrementally building up values. |
|
844 |
714 This brings us to our main lemma in this section: if we build a |
845 This brings us to our main lemma in this section: if we build a |
715 derivative, say $r\backslash s$ and have a value, say $v$, inhabited |
846 derivative, say $r\backslash s$ and have a value, say $v$, inhabited |
716 by this derivative, then we can produce the result $\lexer$ generates |
847 by this derivative, then we can produce the result $\lexer$ generates |
717 by applying this value to the stacked-up injection functions |
848 by applying this value to the stacked-up injection functions |
718 $\textit{flex}$ assembles. The lemma establishes that this is the same |
849 $\textit{flex}$ assembles. The lemma establishes that this is the same |
735 for this way of performing the induction.} |
866 for this way of performing the induction.} |
736 |
867 |
737 The case for $[]$ is routine using Lemmas~\ref{codedecode} |
868 The case for $[]$ is routine using Lemmas~\ref{codedecode} |
738 and~\ref{retrievecode}. In the case $s\,@\,[c]$, we can infer from |
869 and~\ref{retrievecode}. In the case $s\,@\,[c]$, we can infer from |
739 the assumption that $\vdash v : (r\backslash s)\backslash c$ |
870 the assumption that $\vdash v : (r\backslash s)\backslash c$ |
740 holds. Hence by Lemma~\ref{Posix2} we know that |
871 holds. Hence by Prop.~\ref{Posix2} we know that |
741 (*) $\vdash \inj\,(r\backslash s)\,c\,v : r\backslash s$ holds too. |
872 (*) $\vdash \inj\,(r\backslash s)\,c\,v : r\backslash s$ holds too. |
742 By definition of $\textit{flex}$ we can unfold the left-hand side |
873 By definition of $\textit{flex}$ we can unfold the left-hand side |
743 to be |
874 to be |
744 \[ |
875 \[ |
745 \textit{Some}\,(\textit{flex}\;r\,\textit{id}\,(s\,@\,[c])\,v) = |
876 \textit{Some}\,(\textit{flex}\;r\,\textit{id}\,(s\,@\,[c])\,v) = |
746 \textit{Some}\,(\textit{flex}\;r\,\textit{id}\,s\,(\inj\,(r\backslash s)\,c\,v)) |
877 \textit{Some}\,(\textit{flex}\;r\,\textit{id}\,s\,(\inj\,(r\backslash s)\,c\,v)) |
747 \] |
878 \] |
748 |
879 |
749 \noindent |
880 \noindent |
750 By induction hypothesis and (*) we can rewrite the right-hand side to |
881 By induction hypothesis and (*) we can rewrite the right-hand side to |
751 |
882 % |
752 \[ |
883 \[ |
753 \textit{decode}\,(\textit{retrieve}\,(r^\uparrow\backslash s)\; |
884 \textit{decode}\,(\textit{retrieve}\,(r^\uparrow\backslash s)\; |
754 (\inj\,(r\backslash s)\,c\,\,v))\,r |
885 (\inj\,(r\backslash s)\,c\,\,v))\,r |
755 \] |
886 \] |
756 |
887 |