21 "bder_syn r c \<equiv> bder c r" |
21 "bder_syn r c \<equiv> bder c r" |
22 |
22 |
23 notation (latex output) |
23 notation (latex output) |
24 der_syn ("_\\_" [79, 1000] 76) and |
24 der_syn ("_\\_" [79, 1000] 76) and |
25 bder_syn ("_\\_" [79, 1000] 76) and |
25 bder_syn ("_\\_" [79, 1000] 76) and |
|
26 bders ("_\\_" [79, 1000] 76) and |
|
27 bders_simp ("_\\\<^sub>s\<^sub>i\<^sub>m\<^sub>p _" [79, 1000] 76) and |
26 |
28 |
27 ZERO ("\<^bold>0" 81) and |
29 ZERO ("\<^bold>0" 81) and |
28 ONE ("\<^bold>1" 81) and |
30 ONE ("\<^bold>1" 81) and |
29 CH ("_" [1000] 80) and |
31 CH ("_" [1000] 80) and |
30 ALT ("_ + _" [77,77] 78) and |
32 ALT ("_ + _" [77,77] 78) and |
36 val.Left ("Left _" [79] 78) and |
38 val.Left ("Left _" [79] 78) and |
37 val.Right ("Right _" [1000] 78) and |
39 val.Right ("Right _" [1000] 78) and |
38 val.Seq ("Seq _ _" [79,79] 78) and |
40 val.Seq ("Seq _ _" [79,79] 78) and |
39 val.Stars ("Stars _" [79] 78) and |
41 val.Stars ("Stars _" [79] 78) and |
40 |
42 |
|
43 Prf ("\<turnstile> _ : _" [75,75] 75) and |
41 Posix ("'(_, _') \<rightarrow> _" [63,75,75] 75) and |
44 Posix ("'(_, _') \<rightarrow> _" [63,75,75] 75) and |
42 |
45 |
43 flat ("|_|" [75] 74) and |
46 flat ("|_|" [75] 74) and |
44 flats ("|_|" [72] 74) and |
47 flats ("|_|" [72] 74) and |
45 injval ("inj _ _ _" [79,77,79] 76) and |
48 injval ("inj _ _ _" [79,77,79] 76) and |
81 expressions have sparked quite a bit of interest in the functional |
84 expressions have sparked quite a bit of interest in the functional |
82 programming and theorem prover communities. The beauty of |
85 programming and theorem prover communities. The beauty of |
83 Brzozowski's derivatives \cite{Brzozowski1964} is that they are neatly |
86 Brzozowski's derivatives \cite{Brzozowski1964} is that they are neatly |
84 expressible in any functional language, and easily definable and |
87 expressible in any functional language, and easily definable and |
85 reasoned about in theorem provers---the definitions just consist of |
88 reasoned about in theorem provers---the definitions just consist of |
86 inductive datatypes and simple recursive functions. A mechanised |
89 inductive datatypes and simple recursive functions. Derivatives of a |
87 correctness proof of Brzozowski's matcher in for example HOL4 has been |
90 regular expression, written @{term "der c r"}, give a simple solution |
88 mentioned by Owens and Slind~\cite{Owens2008}. Another one in |
91 to the problem of matching a string @{term s} with a regular |
89 Isabelle/HOL is part of the work by Krauss and Nipkow |
92 expression @{term r}: if the derivative of @{term r} w.r.t.\ (in |
90 \cite{Krauss2011}. And another one in Coq is given by Coquand and |
93 succession) all the characters of the string matches the empty string, |
91 Siles \cite{Coquand2012}. |
94 then @{term r} matches @{term s} (and {\em vice versa}). We are aware |
92 |
95 of a mechanised correctness proof of Brzozowski's matcher in HOL4 by |
93 |
96 Owens and Slind~\cite{Owens2008}. Another one in Isabelle/HOL is part |
94 The notion of derivatives |
97 of the work by Krauss and Nipkow \cite{Krauss2011}. And another one |
95 \cite{Brzozowski1964}, written @{term "der c r"}, of a regular |
98 in Coq is given by Coquand and Siles \cite{Coquand2012}. |
96 expression give a simple solution to the problem of matching a string |
99 |
97 @{term s} with a regular expression @{term r}: if the derivative of |
100 There are two difficulties with derivative-based matchers and also |
98 @{term r} w.r.t.\ (in succession) all the characters of the string |
101 lexers: First, Brzozowski's original matcher only generates a yes/no |
99 matches the empty string, then @{term r} matches @{term s} (and {\em |
102 answer for whether a regular expression matches a string or not. |
100 vice versa}). The derivative has the property (which may almost be |
103 Sulzmann and Lu~\cite{Sulzmann2014} overcome this difficulty by |
101 regarded as its specification) that, for every string @{term s} and |
104 cleverly extending Brzozowski's matching algorithm to POSIX |
102 regular expression @{term r} and character @{term c}, one has @{term |
105 lexing. This extended version generates additional information on |
103 "cs \<in> L(r)"} if and only if \mbox{@{term "s \<in> L(der c r)"}}. |
106 \emph{how} a regular expression matches a string. They achieve this by |
|
107 |
|
108 |
|
109 |
|
110 |
|
111 The second problem is that Brzozowski's derivatives can |
|
112 grow to arbitrarily big sizes. For example if we start with the |
|
113 regular expression \mbox{@{text "(a + aa)\<^sup>*"}} and take |
|
114 successive derivatives according to the character $a$, we end up with |
|
115 a sequence of ever-growing derivatives like |
|
116 |
|
117 \def\ll{\stackrel{\_\backslash{} a}{\longrightarrow}} |
|
118 \begin{center} |
|
119 \begin{tabular}{rll} |
|
120 $(a + aa)^*$ & $\ll$ & $(\ONE + \ONE{}a) \cdot (a + aa)^*$\\ |
|
121 & $\ll$ & $(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* \;+\; (\ONE + \ONE{}a) \cdot (a + aa)^*$\\ |
|
122 & $\ll$ & $(\ZERO + \ZERO{}a + \ZERO) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^* \;+\; $\\ |
|
123 & & $\qquad(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^*$\\ |
|
124 & $\ll$ & \ldots |
|
125 \end{tabular} |
|
126 \end{center} |
|
127 |
|
128 \noindent where after around 35 steps we run out of memory on a |
|
129 typical computer (we define the precise details of the derivative |
|
130 operation later). Clearly, the notation involving $\ZERO$s and |
|
131 $\ONE$s already suggests simplification rules that can be applied to |
|
132 regular regular expressions, for example $\ZERO{}r \Rightarrow \ZERO$, |
|
133 $\ONE{}r \Rightarrow r$, $\ZERO{} + r \Rightarrow r$ and $r + r |
|
134 \Rightarrow r$. While such simple-minded reductions have been proved |
|
135 in our earlier work to preserve the correctness of Sulzmann and Lu's |
|
136 algorithm, they unfortunately do \emph{not} help with limiting the |
|
137 gowth of the derivatives shown above: yes, the growth is slowed, but the |
|
138 derivatives can still grow beyond any finite bound. |
|
139 |
|
140 |
|
141 |
|
142 Sulzmann and Lu introduce a |
|
143 bitcoded version of their lexing algorithm. They make some claims |
|
144 about the correctness and speed of this version, but do |
|
145 not provide any supporting proof arguments, not even |
|
146 ``pencil-and-paper'' arguments. They wrote about their bitcoded |
|
147 ``incremental parsing method'' (that is the algorithm to be studied in this |
|
148 section): |
|
149 |
|
150 \begin{quote}\it |
|
151 ``Correctness Claim: We further claim that the incremental parsing |
|
152 method [..] in combination with the simplification steps [..] |
|
153 yields POSIX parse trees. We have tested this claim |
|
154 extensively [..] but yet |
|
155 have to work out all proof details.'' |
|
156 \end{quote} |
|
157 |
|
158 |
|
159 |
104 |
160 |
105 |
161 |
106 If a regular expression matches a string, then in general there is more |
162 If a regular expression matches a string, then in general there is more |
107 than one way of how the string is matched. There are two commonly used |
163 than one way of how the string is matched. There are two commonly used |
108 disambiguation strategies to generate a unique answer: one is called |
164 disambiguation strategies to generate a unique answer: one is called |
186 *} |
250 *} |
187 |
251 |
188 section {* Background *} |
252 section {* Background *} |
189 |
253 |
190 text {* |
254 text {* |
|
255 In our Isabelle/HOL formalisation strings are lists of characters with |
|
256 the empty string being represented by the empty list, written $[]$, |
|
257 and list-cons being written as $\_\!\_\!::\!\_\!\_\,$; string |
|
258 concatenation is $\_\!\_ \,@\, \_\!\_\,$. Often we use the usual |
|
259 bracket notation for lists also for strings; for example a string |
|
260 consisting of just a single character $c$ is written $[c]$. |
|
261 Our egular expressions are defined as usual as the elements of the following inductive |
|
262 datatype: |
|
263 |
|
264 \begin{center} |
|
265 @{text "r :="} |
|
266 @{const "ZERO"} $\mid$ |
|
267 @{const "ONE"} $\mid$ |
|
268 @{term "CH c"} $\mid$ |
|
269 @{term "ALT r\<^sub>1 r\<^sub>2"} $\mid$ |
|
270 @{term "SEQ r\<^sub>1 r\<^sub>2"} $\mid$ |
|
271 @{term "STAR r"} |
|
272 \end{center} |
|
273 |
|
274 \noindent where @{const ZERO} stands for the regular expression that does |
|
275 not match any string, @{const ONE} for the regular expression that matches |
|
276 only the empty string and @{term c} for matching a character literal. The |
|
277 language of a regular expression, written $L$, is defined as usual |
|
278 (see for example \cite{AusafDyckhoffUrban2016}). |
|
279 |
|
280 Central to Brzozowski's regular expression matcher are two functions |
|
281 called $\nullable$ and \emph{derivative}. The latter is written |
|
282 $r\backslash c$ for the derivative of the regular expression $r$ |
|
283 w.r.t.~the character $c$. Both functions are defined by recursion over |
|
284 regular expressions. |
|
285 |
|
286 |
|
287 \begin{center} |
|
288 \begin{tabular}{lcl} |
|
289 $\nullable(\ZERO)$ & $\dn$ & $\mathit{false}$ \\ |
|
290 $\nullable(\ONE)$ & $\dn$ & $\mathit{true}$ \\ |
|
291 $\nullable(c)$ & $\dn$ & $\mathit{false}$ \\ |
|
292 $\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\ |
|
293 $\nullable(r_1\cdot r_2)$ & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\ |
|
294 $\nullable(r^*)$ & $\dn$ & $\mathit{true}$ \\ |
|
295 \end{tabular} |
|
296 \end{center} |
|
297 |
|
298 \noindent |
|
299 The derivative function takes a regular expression, say $r$ and a |
|
300 character, say $c$, as input and returns the derivative regular |
|
301 expression. |
|
302 |
|
303 \begin{center} |
|
304 \begin{tabular}{lcl} |
|
305 $\ZERO \backslash c$ & $\dn$ & $\ZERO$\\ |
|
306 $\ONE \backslash c$ & $\dn$ & $\ZERO$\\ |
|
307 $d \backslash c$ & $\dn$ & |
|
308 $\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\ |
|
309 $(r_1 + r_2)\backslash c$ & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\ |
|
310 $(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \nullable(r_1)$\\ |
|
311 & & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\ |
|
312 & & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\ |
|
313 $(r^*)\backslash c$ & $\dn$ & $(r\backslash c) \cdot r^*$\\ |
|
314 \end{tabular} |
|
315 \end{center} |
|
316 |
|
317 |
|
318 Sulzmann and Lu presented two lexing algorithms in their paper from 2014 |
|
319 \cite{Sulzmann2014}. This first algorithm consists of two phases: first a |
|
320 matching phase (which is Brzozowski's algorithm) and then a value |
|
321 construction phase. The values encode \emph{how} a regular expression |
|
322 matches a string. \emph{Values} are defined as the inductive datatype |
|
323 |
|
324 \begin{center} |
|
325 @{text "v :="} |
|
326 @{const "Void"} $\mid$ |
|
327 @{term "val.Char c"} $\mid$ |
|
328 @{term "Left v"} $\mid$ |
|
329 @{term "Right v"} $\mid$ |
|
330 @{term "Seq v\<^sub>1 v\<^sub>2"} $\mid$ |
|
331 @{term "Stars vs"} |
|
332 \end{center} |
|
333 |
|
334 \noindent where we use @{term vs} to stand for a list of |
|
335 values. |
|
336 |
|
337 |
|
338 |
|
339 |
|
340 \noindent Sulzmann and Lu also define inductively an inhabitation relation |
|
341 that associates values to regular expressions: |
|
342 |
|
343 \begin{center} |
|
344 \begin{tabular}{c} |
|
345 \\[-8mm] |
|
346 @{thm[mode=Axiom] Prf.intros(4)} \qquad |
|
347 @{thm[mode=Axiom] Prf.intros(5)[of "c"]}\\[4mm] |
|
348 @{thm[mode=Rule] Prf.intros(2)[of "v\<^sub>1" "r\<^sub>1" "r\<^sub>2"]} \qquad |
|
349 @{thm[mode=Rule] Prf.intros(3)[of "v\<^sub>2" "r\<^sub>1" "r\<^sub>2"]}\\[4mm] |
|
350 @{thm[mode=Rule] Prf.intros(1)[of "v\<^sub>1" "r\<^sub>1" "v\<^sub>2" "r\<^sub>2"]}\\[4mm] |
|
351 %%@ { t h m[mode=Axiom] Prf.intros(6)[of "r"]} \qquad |
|
352 @{thm[mode=Rule] Prf.intros(6)[of "r" "vs"]} |
|
353 \end{tabular} |
|
354 \end{center} |
|
355 |
|
356 \noindent Note that no values are associated with the regular expression |
|
357 @{term ZERO}. It is routine to establish how values ``inhabiting'' a regular |
|
358 expression correspond to the language of a regular expression, namely |
|
359 |
|
360 \begin{proposition} |
|
361 @{thm L_flat_Prf} |
|
362 \end{proposition} |
|
363 |
191 Sulzmann-Lu algorithm with inj. State that POSIX rules. |
364 Sulzmann-Lu algorithm with inj. State that POSIX rules. |
192 metion slg is correct. |
365 metion slg is correct. |
193 |
366 |
194 |
367 |
195 \begin{figure}[t] |
368 \begin{figure}[t] |