27 bders_simp ("_\\\<^sub>b\<^sub>s\<^sub>i\<^sub>m\<^sub>p _" [79, 1000] 76) and |
28 bders_simp ("_\\\<^sub>b\<^sub>s\<^sub>i\<^sub>m\<^sub>p _" [79, 1000] 76) and |
28 |
29 |
29 ZERO ("\<^bold>0" 81) and |
30 ZERO ("\<^bold>0" 81) and |
30 ONE ("\<^bold>1" 81) and |
31 ONE ("\<^bold>1" 81) and |
31 CH ("_" [1000] 80) and |
32 CH ("_" [1000] 80) and |
32 ALT ("_ + _" [77,77] 78) and |
33 ALT ("_ + _" [76,76] 77) and |
33 SEQ ("_ \<cdot> _" [77,77] 78) and |
34 SEQ ("_ \<cdot> _" [78,78] 78) and |
34 STAR ("_\<^sup>*" [79] 78) and |
35 STAR ("_\<^sup>*" [79] 78) and |
35 NTIMES ("_\<^sup>{\<^sup>_\<^sup>}" [79] 78) and |
36 NTIMES ("_\<^sup>{\<^bsup>_\<^esup>\<^sup>}" [79] 78) and |
36 |
37 |
37 val.Void ("Empty" 78) and |
38 val.Void ("Empty" 78) and |
38 val.Char ("Char _" [1000] 78) and |
39 val.Char ("Char _" [1000] 78) and |
39 val.Left ("Left _" [79] 78) and |
40 val.Left ("Left _" [79] 78) and |
40 val.Right ("Right _" [1000] 78) and |
41 val.Right ("Right _" [1000] 78) and |
41 val.Seq ("Seq _ _" [79,79] 78) and |
42 val.Seq ("Seq _ _" [79,79] 78) and |
42 val.Stars ("Stars _" [79] 78) and |
43 val.Stars ("Stars _" [1000] 78) and |
43 |
44 |
44 Prf ("\<turnstile> _ : _" [75,75] 75) and |
45 Prf ("\<turnstile> _ : _" [75,75] 75) and |
45 Posix ("'(_, _') \<rightarrow> _" [63,75,75] 75) and |
46 Posix ("'(_, _') \<rightarrow> _" [63,75,75] 75) and |
46 |
47 |
47 flat ("|_|" [75] 74) and |
48 flat ("|_|" [75] 74) and |
79 and "rs \<noteq> Nil ==> retrieve (AALTs bs (r#rs)) (Right v) = bs @ retrieve (AALTs [] rs) v" |
81 and "rs \<noteq> Nil ==> retrieve (AALTs bs (r#rs)) (Right v) = bs @ retrieve (AALTs [] rs) v" |
80 apply (metis list.exhaust retrieve.simps(4)) |
82 apply (metis list.exhaust retrieve.simps(4)) |
81 by (metis list.exhaust retrieve.simps(5)) |
83 by (metis list.exhaust retrieve.simps(5)) |
82 |
84 |
83 |
85 |
84 |
86 lemma better_retrieve2: |
85 |
87 shows "retrieve (ANTIMES bs r (n + 1)) (Stars (v#vs)) = |
|
88 bs @ [Z] @ retrieve r v @ retrieve (ANTIMES [] r n) (Stars vs)" |
|
89 by auto |
86 (*>*) |
90 (*>*) |
87 |
91 |
88 section {* Introduction *} |
92 section {* Introduction *} |
89 |
93 |
90 text {* |
94 text {* |
91 |
95 |
92 In the last fifteen or so years, Brzozowski's derivatives of regular |
96 In the last fifteen or so years, Brzozowski's derivatives of regular |
93 expressions have sparked quite a bit of interest in the functional |
97 expressions have sparked quite a bit of interest in the functional |
94 programming and theorem prover communities. The beauty of |
98 programming and theorem prover communities. |
95 Brzozowski's derivatives \cite{Brzozowski1964} is that they are neatly |
|
96 expressible in any functional language, and easily definable and |
|
97 reasoned about in theorem provers---the definitions just consist of |
|
98 inductive datatypes and simple recursive functions. Another neat |
|
99 feature of derivatives is that they can be easily extended to bounded |
|
100 regular expressions, such as @{term "NTIMES r n"}, where numbers or |
|
101 intervals specify how many times a regular expression should be used |
|
102 during matching. |
|
103 |
|
104 Derivatives of a |
99 Derivatives of a |
105 regular expression, written @{term "der c r"}, give a simple solution |
100 regular expression, written @{term "der c r"}, give a simple solution |
106 to the problem of matching a string @{term s} with a regular |
101 to the problem of matching a string @{term s} with a regular |
107 expression @{term r}: if the derivative of @{term r} w.r.t.\ (in |
102 expression @{term r}: if the derivative of @{term r} w.r.t.\ (in |
108 succession) all the characters of the string matches the empty string, |
103 succession) all the characters of the string matches the empty string, |
109 then @{term r} matches @{term s} (and {\em vice versa}). We are aware |
104 then @{term r} matches @{term s} (and {\em vice versa}). |
110 of a mechanised correctness proof of Brzozowski's derivative-based matcher in HOL4 by |
105 The beauty of |
111 Owens and Slind~\cite{Owens2008}. Another one in Isabelle/HOL is part |
106 Brzozowski's derivatives \cite{Brzozowski1964} is that they are neatly |
112 of the work by Krauss and Nipkow~\cite{Krauss2011}. And another one |
107 expressible in any functional language, and easily definable and |
113 in Coq is given by Coquand and Siles \cite{Coquand2012}. |
108 reasoned about in theorem provers---the definitions just consist of |
114 Also Ribeiro and Du Bois give one in Agda~\cite{RibeiroAgda2017}. |
109 inductive datatypes and simple recursive functions. Another attractive |
|
110 feature of derivatives is that they can be easily extended to \emph{bounded} |
|
111 regular expressions, such as @{term "r"}$^{\{@{text n}\}}$ or @{term r}$^{\{..@{text n}\}}$, where numbers or |
|
112 intervals of numbers specify how many times a regular expression should be used |
|
113 during matching. |
|
114 |
|
115 |
115 |
116 |
116 |
117 |
117 However, there are two difficulties with derivative-based matchers: |
118 However, there are two difficulties with derivative-based matchers: |
118 First, Brzozowski's original matcher only generates a yes/no answer |
119 First, Brzozowski's original matcher only generates a yes/no answer |
119 for whether a regular expression matches a string or not. This is too |
120 for whether a regular expression matches a string or not. This is too |
187 |
188 |
188 \noindent{}\textbf{Contributions:} We have implemented in Isabelle/HOL |
189 \noindent{}\textbf{Contributions:} We have implemented in Isabelle/HOL |
189 the derivative-based lexing algorithm of Sulzmann and Lu |
190 the derivative-based lexing algorithm of Sulzmann and Lu |
190 \cite{Sulzmann2014} where regular expressions are annotated with |
191 \cite{Sulzmann2014} where regular expressions are annotated with |
191 bitsequences. We define the crucial simplification function as a |
192 bitsequences. We define the crucial simplification function as a |
192 recursive function, without the need of a fix-point operation. One objective of |
193 recursive function, without the need of a fixpoint operation. One objective of |
193 the simplification function is to remove duplicates of regular |
194 the simplification function is to remove duplicates of regular |
194 expressions. For this Sulzmann and Lu use in their paper the standard |
195 expressions. For this Sulzmann and Lu use in their paper the standard |
195 @{text nub} function from Haskell's list library, but this function |
196 @{text nub} function from Haskell's list library, but this function |
196 does not achieve the intended objective with bitcoded regular expressions. The |
197 does not achieve the intended objective with bitcoded regular expressions. The |
197 reason is that in the bitcoded setting, each copy generally has a |
198 reason is that in the bitcoded setting, each copy generally has a |
198 different bitcode annotation---so @{text nub} would never ``fire''. |
199 different bitcode annotation---so @{text nub} would never ``fire''. |
199 Inspired by Scala's library for lists, we shall instead use a @{text |
200 Inspired by Scala's library for lists, we shall instead use a @{text |
200 distinctWith} function that finds duplicates under an ``erasing'' function |
201 distinctWith} function that finds duplicates under an ``erasing'' function |
201 which deletes bitcodes before comparing regular expressions. |
202 which deletes bitcodes before comparing regular expressions. |
202 We shall also introduce our own argument and definitions for |
203 We shall also introduce our \emph{own} argument and definitions for |
203 establishing the correctness of the bitcoded algorithm when |
204 establishing the correctness of the bitcoded algorithm when |
204 simplifications are included. Finally we |
205 simplifications are included. Finally we |
205 establish that the size of derivatives can be finitely bounded.\medskip%\footnote{ In this paper, we shall first briefly introduce the basic notions |
206 establish that the size of derivatives can be finitely bounded.\medskip%\footnote{ In this paper, we shall first briefly introduce the basic notions |
206 %of regular expressions and describe the definition |
207 %of regular expressions and describe the definition |
207 %of POSIX lexing from our earlier work \cite{AusafDyckhoffUrban2016}. This serves |
208 %of POSIX lexing from our earlier work \cite{AusafDyckhoffUrban2016}. This serves |
240 @{term "SEQ r\<^sub>1 r\<^sub>2"} $\mid$ |
241 @{term "SEQ r\<^sub>1 r\<^sub>2"} $\mid$ |
241 @{term "STAR r"} $\mid$ |
242 @{term "STAR r"} $\mid$ |
242 @{term "NTIMES r n"} |
243 @{term "NTIMES r n"} |
243 \end{center} |
244 \end{center} |
244 |
245 |
245 \noindent where @{const ZERO} stands for the regular expression that does |
246 \noindent where @{const ZERO} stands for the regular expression that |
246 not match any string, @{const ONE} for the regular expression that matches |
247 does not match any string, @{const ONE} for the regular expression |
247 only the empty string and @{term c} for matching a character literal. |
248 that matches only the empty string and @{term c} for matching a |
248 The constructors $+$ and $\cdot$ represent alternatives and sequences, respectively. |
249 character literal. The constructors $+$ and $\cdot$ represent |
249 We sometimes omit the $\cdot$ in a sequence regular expression for brevity. |
250 alternatives and sequences, respectively. We sometimes omit the |
250 In our work here we also add to the usual ``basic'' regular expressions |
251 $\cdot$ in a sequence regular expression for brevity. The |
251 the bounded regular expression @{term "NTIMES r n"} where the @{term n} |
252 \emph{language} of a regular expression, written $L(r)$, is defined |
252 specifies that @{term r} should match exactly @{term n}-times. For |
253 as usual and we omit giving the definition here (see for example |
253 brevity we omit the other bounded regular expressions |
254 \cite{AusafDyckhoffUrban2016}). |
254 @{text "r"}$^{\{..n\}}$, @{text "r"}$^{\{n..\}}$ and @{text "r"}$^{\{n..m\}}$ |
255 |
255 which specify an interval for how many times @{text r} should match. Our |
256 In our work here we also add to the usual ``basic'' regular |
256 results extend straightforwardly also to them. The |
257 expressions the \emph{bounded} regular expression @{term "NTIMES r |
257 \emph{language} of a regular expression, written $L(r)$, is defined as usual |
258 n"} where the @{term n} specifies that @{term r} should match |
258 and we omit giving the definition here (see for example \cite{AusafDyckhoffUrban2016}). |
259 exactly @{term n}-times. Again for brevity we omit the other bounded |
|
260 regular expressions @{text "r"}$^{\{..n\}}$, @{text "r"}$^{\{n..\}}$ |
|
261 and @{text "r"}$^{\{n..m\}}$ which specify intervals for how many |
|
262 times @{text r} should match. The results presented in this paper |
|
263 extend straightforwardly to them too. The importance of the bounded |
|
264 regular expressions is that they are often used in practical |
|
265 applications, such as Snort (a system for detecting network |
|
266 intrusion) and also in XML Schema definitions. According to Bj\"{o}rklund et |
|
267 al~\cite{BjorklundMartensTimm2015}, bounded regular expressions |
|
268 occur frequently in the latter and can have counters of up to |
|
269 ten million. The problem is that tools based on the classic notion |
|
270 of automata need to expand @{text "r"}$^{\{n\}}$ into @{text n} |
|
271 connected copies of the automaton for @{text r}. This leads to very |
|
272 inefficient matching algorithms or algorithms that consume large |
|
273 amounts of memory. A classic example is the regular expression |
|
274 \mbox{@{term "SEQ (SEQ (STAR (ALT a b)) a) (NTIMES (ALT a b) n)"}} |
|
275 where the minimal DFA requires at least $2^{n + 1}$ states (see |
|
276 \cite{CountingSet2020}). Therefore regular expression matching |
|
277 libraries that rely on the classic notion of DFAs often impose |
|
278 adhoc limits for bounded regular expressions: For example in the |
|
279 regular expression matching library in the Go language the regular expression |
|
280 @{term "NTIMES a 1001"} is not permitted, because no counter can be |
|
281 above 1000, and in the built-in Rust regular expression library |
|
282 expressions such as @{text "a\<^bsup>{1000}{100}{5}\<^esup>"} give an error |
|
283 message for being too big. These problems can of course be solved in matching |
|
284 algorithms where automata go beyond the classic notion and for |
|
285 instance include explicit counters (see~\cite{CountingSet2020}). |
|
286 The point here is that Brzozowski derivatives and the algorithms by |
|
287 Sulzmann and Lu can be straightforwardly extended to deal with |
|
288 bounded regular expressions and moreover the resulting code |
|
289 still consists of only simple recursive functions and inductive |
|
290 datatypes. Finally, bounded regular expressions |
|
291 do not destroy our finite boundedness property, which we shall |
|
292 prove later on.%, because during the lexing process counters will only be |
|
293 %decremented. |
|
294 |
259 |
295 |
260 Central to Brzozowski's regular expression matcher are two functions |
296 Central to Brzozowski's regular expression matcher are two functions |
261 called @{text nullable} and \emph{derivative}. The latter is written |
297 called @{text nullable} and \emph{derivative}. The latter is written |
262 $r\backslash c$ for the derivative of the regular expression $r$ |
298 $r\backslash c$ for the derivative of the regular expression $r$ |
263 w.r.t.~the character $c$. Both functions are defined by recursion over |
299 w.r.t.~the character $c$. Both functions are defined by recursion over |
264 regular expressions. |
300 regular expressions. |
265 % |
301 % |
266 \begin{center} |
302 \begin{center} |
267 \begin{tabular}{cc} |
303 \begin{tabular}{c @ {\hspace{-1mm}}c} |
268 \begin{tabular}{r@ {\hspace{2mm}}c@ {\hspace{2mm}}l} |
304 \begin{tabular}{r@ {\hspace{2mm}}c@ {\hspace{2mm}}l} |
269 @{thm (lhs) der.simps(1)} & $\dn$ & @{thm (rhs) der.simps(1)}\\ |
305 @{thm (lhs) der.simps(1)} & $\dn$ & @{thm (rhs) der.simps(1)}\\ |
270 @{thm (lhs) der.simps(2)} & $\dn$ & @{thm (rhs) der.simps(2)}\\ |
306 @{thm (lhs) der.simps(2)} & $\dn$ & @{thm (rhs) der.simps(2)}\\ |
271 @{thm (lhs) der.simps(3)} & $\dn$ & @{thm (rhs) der.simps(3)}\\ |
307 @{thm (lhs) der.simps(3)} & $\dn$ & @{thm (rhs) der.simps(3)}\\ |
272 @{thm (lhs) der.simps(4)[of c "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) der.simps(4)[of c "r\<^sub>1" "r\<^sub>2"]}\\ |
308 @{thm (lhs) der.simps(4)[of c "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) der.simps(4)[of c "r\<^sub>1" "r\<^sub>2"]}\\ |
273 @{thm (lhs) der.simps(5)[of c "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{text "if"} @{term "nullable(r\<^sub>1)"}\\ |
309 @{thm (lhs) der.simps(5)[of c "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{text "if"} @{term "nullable(r\<^sub>1)"}\\ |
274 & & @{text "then"} @{term "ALT (SEQ (der c r\<^sub>1) r\<^sub>2) (der c r\<^sub>2)"}\\ |
310 & & @{text "then"} @{term "ALT (SEQ (der c r\<^sub>1) r\<^sub>2) (der c r\<^sub>2)"}\\ |
275 & & @{text "else"} @{term "SEQ (der c r\<^sub>1) r\<^sub>2"}\\ |
311 & & @{text "else"} @{term "SEQ (der c r\<^sub>1) r\<^sub>2"}\\ |
276 % & & @{thm (rhs) der.simps(5)[of c "r\<^sub>1" "r\<^sub>2"]}\\ |
312 % & & @{thm (rhs) der.simps(5)[of c "r\<^sub>1" "r\<^sub>2"]}\\ |
277 @{thm (lhs) der.simps(6)} & $\dn$ & @{thm (rhs) der.simps(6)} |
313 @{thm (lhs) der.simps(6)} & $\dn$ & @{thm (rhs) der.simps(6)}\\ |
|
314 @{thm (lhs) der.simps(7)} & $\dn$ & @{thm (rhs) der.simps(7)} |
278 \end{tabular} |
315 \end{tabular} |
279 & |
316 & |
280 \begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l} |
317 \begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l} |
281 @{thm (lhs) nullable.simps(1)} & $\dn$ & @{thm (rhs) nullable.simps(1)}\\ |
318 @{thm (lhs) nullable.simps(1)} & $\dn$ & @{thm (rhs) nullable.simps(1)}\\ |
282 @{thm (lhs) nullable.simps(2)} & $\dn$ & @{thm (rhs) nullable.simps(2)}\\ |
319 @{thm (lhs) nullable.simps(2)} & $\dn$ & @{thm (rhs) nullable.simps(2)}\\ |
283 @{thm (lhs) nullable.simps(3)} & $\dn$ & @{thm (rhs) nullable.simps(3)}\\ |
320 @{thm (lhs) nullable.simps(3)} & $\dn$ & @{thm (rhs) nullable.simps(3)}\\ |
284 @{thm (lhs) nullable.simps(4)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) nullable.simps(4)[of "r\<^sub>1" "r\<^sub>2"]}\\ |
321 @{thm (lhs) nullable.simps(4)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) nullable.simps(4)[of "r\<^sub>1" "r\<^sub>2"]}\\ |
285 @{thm (lhs) nullable.simps(5)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) nullable.simps(5)[of "r\<^sub>1" "r\<^sub>2"]}\\ |
322 @{thm (lhs) nullable.simps(5)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) nullable.simps(5)[of "r\<^sub>1" "r\<^sub>2"]}\\ |
286 @{thm (lhs) nullable.simps(6)} & $\dn$ & @{thm (rhs) nullable.simps(6)}%\medskip\\ |
323 @{thm (lhs) nullable.simps(6)} & $\dn$ & @{thm (rhs) nullable.simps(6)}\\ |
|
324 @{thm (lhs) nullable.simps(7)} & $\dn$ & @{text "if"} @{text "n = 0"}\\ |
|
325 & & @{text "then"} @{const "True"}\\ |
|
326 & & @{text "else"} @{term "nullable r"} |
287 \end{tabular} |
327 \end{tabular} |
288 \end{tabular} |
328 \end{tabular} |
289 \end{center} |
329 \end{center} |
290 |
330 |
291 \noindent |
331 \noindent |
332 function, written @{term "flat DUMMY"}. It traverses a value and |
378 function, written @{term "flat DUMMY"}. It traverses a value and |
333 collects the characters contained in it \cite{AusafDyckhoffUrban2016}. |
379 collects the characters contained in it \cite{AusafDyckhoffUrban2016}. |
334 |
380 |
335 |
381 |
336 Sulzmann and Lu also define inductively an |
382 Sulzmann and Lu also define inductively an |
337 inhabitation relation that associates values to regular expressions. This |
383 inhabitation relation that associates values to regular expressions. Our |
338 is defined by the following six rules for the values: |
384 version of this relation is defined the following six rules for the values: |
339 % |
385 % |
340 \begin{center} |
386 \begin{center} |
341 \begin{tabular}{@ {\hspace{-12mm}}c@ {}} |
|
342 \begin{tabular}{@ {}c@ {}} |
387 \begin{tabular}{@ {}c@ {}} |
343 \\[-8mm] |
388 @{thm[mode=Axiom] Prf.intros(4)}\qquad |
344 @{thm[mode=Axiom] Prf.intros(4)}\\ |
389 @{thm[mode=Axiom] Prf.intros(5)[of "c"]}\qquad |
345 @{thm[mode=Axiom] Prf.intros(5)[of "c"]} |
390 @{thm[mode=Rule] Prf.intros(2)[of "v\<^sub>1" "r\<^sub>1" "r\<^sub>2"]}\qquad |
346 \end{tabular} |
391 @{thm[mode=Rule] Prf.intros(3)[of "v\<^sub>2" "r\<^sub>2" "r\<^sub>1"]}\qquad |
347 \quad |
392 @{thm[mode=Rule] Prf.intros(1)[of "v\<^sub>1" "r\<^sub>1" "v\<^sub>2" "r\<^sub>2"]}\medskip\\ |
348 @{thm[mode=Rule] Prf.intros(2)[of "v\<^sub>1" "r\<^sub>1" "r\<^sub>2"]}\quad |
393 @{thm[mode=Rule] Prf.intros(6)[of "vs" "r"]}\qquad |
349 @{thm[mode=Rule] Prf.intros(3)[of "v\<^sub>2" "r\<^sub>2" "r\<^sub>1"]}\quad |
394 $\mprset{flushleft}\inferrule{ |
350 @{thm[mode=Rule] Prf.intros(1)[of "v\<^sub>1" "r\<^sub>1" "v\<^sub>2" "r\<^sub>2"]}\quad |
395 @{thm (prem 1) Prf.intros(7)[of "vs\<^sub>1" "r" "vs\<^sub>2" "n"]}\\\\ |
351 @{thm[mode=Rule] Prf.intros(6)[of "vs" "r"]}\mbox{}\hspace{-10mm}\mbox{} |
396 @{thm (prem 2) Prf.intros(7)[of "vs\<^sub>1" "r" "vs\<^sub>2" "n"]}\quad |
|
397 @{thm (prem 3) Prf.intros(7)[of "vs\<^sub>1" "r" "vs\<^sub>2" "n"]} |
|
398 } |
|
399 {@{thm (concl) Prf.intros(7)[of "vs\<^sub>1" "r" "vs\<^sub>2" "n"]}} |
|
400 $ |
352 \end{tabular} |
401 \end{tabular} |
353 \end{center} |
402 \end{center} |
354 |
403 |
355 \noindent Note that no values are associated with the regular expression |
404 \noindent Note that no values are associated with the regular expression |
356 @{term ZERO}, since it cannot match any string. |
405 @{term ZERO}, since it cannot match any string. Interesting is our version |
|
406 of the rule for @{term "STAR r"} where we require that each value |
|
407 in @{term vs} flattens to a non-empty string. This means if @{term "STAR r"} ``fires'' |
|
408 one or more times, then each copy needs to match a non-empty string. |
|
409 Similarly, in the rule |
|
410 for @{term "NTIMES r n"} we require that the length of the list |
|
411 @{term "vs\<^sub>1 @ vs\<^sub>2"} equals @{term n} (meaning the regular expression |
|
412 @{text r} matches @{text n}-times) and that the first segment of this list |
|
413 contains values that flatten to non-empty strings followed by a segment that |
|
414 only contains values that flatten to the empty string. |
357 It is routine to establish how values ``inhabiting'' a regular |
415 It is routine to establish how values ``inhabiting'' a regular |
358 expression correspond to the language of a regular expression, namely |
416 expression correspond to the language of a regular expression, namely |
359 @{thm L_flat_Prf}. |
417 @{thm L_flat_Prf}. |
360 |
418 |
361 In general there is more than one value inhabited by a regular |
419 In general there is more than one value inhabited by a regular |
395 \inferrule |
453 \inferrule |
396 {@{thm (prem 1) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]} \qquad |
454 {@{thm (prem 1) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]} \qquad |
397 @{thm (prem 2) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]} \qquad |
455 @{thm (prem 2) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]} \qquad |
398 @{thm (prem 3) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]} \\\\ |
456 @{thm (prem 3) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]} \\\\ |
399 @{thm (prem 4) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]}} |
457 @{thm (prem 4) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]}} |
400 {@{thm (concl) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]}}$\<open>P\<star>\<close>\\[-4mm] |
458 {@{thm (concl) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]}}$\<open>P\<star>\<close>\medskip\smallskip\\ |
|
459 \mprset{sep=4mm} |
|
460 @{thm[mode=Rule] Posix.intros(9)}\<open>Pn[]\<close>\quad |
|
461 $\mprset{flushleft} |
|
462 \inferrule |
|
463 {@{thm (prem 1) Posix.intros(8)[of "s\<^sub>1" "r" "v" "s\<^sub>2" n "vs"]} \qquad |
|
464 @{thm (prem 2) Posix.intros(8)[of "s\<^sub>1" "r" "v" "s\<^sub>2" n "vs"]} \qquad |
|
465 @{thm (prem 3) Posix.intros(8)[of "s\<^sub>1" "r" "v" "s\<^sub>2" n "vs"]} \\\\ |
|
466 @{thm (prem 4) Posix.intros(8)[of "s\<^sub>1" "r" "v" "s\<^sub>2" n "vs"]}} |
|
467 {@{thm (concl) Posix.intros(8)[of "s\<^sub>1" "r" "v" "s\<^sub>2" n "vs"]}}$\<open>Pn+\<close> |
|
468 \\[-4mm] |
401 \end{tabular} |
469 \end{tabular} |
402 \end{center} |
470 \end{center} |
403 \caption{The inductive definition of POSIX values taken from our earlier paper \cite{AusafDyckhoffUrban2016}. The ternary relation, written $(s, r) \rightarrow v$, formalises the notion |
471 \caption{The inductive definition of POSIX values taken from our earlier paper \cite{AusafDyckhoffUrban2016}. The ternary relation, written $(s, r) \rightarrow v$, formalises the notion |
404 of given a string $s$ and a regular |
472 of given a string $s$ and a regular |
405 expression $r$ what is the unique value $v$ that satisfies the informal POSIX constraints for |
473 expression $r$ what is the unique value $v$ that satisfies the informal POSIX constraints for |
406 regular expression matching.}\label{POSIXrules} |
474 regular expression matching.\smallskip}\label{POSIXrules} |
407 \end{figure} |
475 \end{figure} |
408 |
476 |
409 The clever idea by Sulzmann and Lu \cite{Sulzmann2014} in their first algorithm is to define |
477 The clever idea by Sulzmann and Lu \cite{Sulzmann2014} in their first algorithm is to define |
410 an injection function on values that mirrors (but inverts) the |
478 an injection function on values that mirrors (but inverts) the |
411 construction of the derivative on regular expressions. Essentially it |
479 construction of the derivative on regular expressions. Essentially it |
417 \begin{tabular}{@ {}lcl@ {}} |
485 \begin{tabular}{@ {}lcl@ {}} |
418 @{thm (lhs) mkeps.simps(1)} & $\dn$ & @{thm (rhs) mkeps.simps(1)}\\ |
486 @{thm (lhs) mkeps.simps(1)} & $\dn$ & @{thm (rhs) mkeps.simps(1)}\\ |
419 @{thm (lhs) mkeps.simps(2)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) mkeps.simps(2)[of "r\<^sub>1" "r\<^sub>2"]}\\ |
487 @{thm (lhs) mkeps.simps(2)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) mkeps.simps(2)[of "r\<^sub>1" "r\<^sub>2"]}\\ |
420 @{thm (lhs) mkeps.simps(3)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) mkeps.simps(3)[of "r\<^sub>1" "r\<^sub>2"]}\\ |
488 @{thm (lhs) mkeps.simps(3)[of "r\<^sub>1" "r\<^sub>2"]} & $\dn$ & @{thm (rhs) mkeps.simps(3)[of "r\<^sub>1" "r\<^sub>2"]}\\ |
421 @{thm (lhs) mkeps.simps(4)} & $\dn$ & @{thm (rhs) mkeps.simps(4)}\\ |
489 @{thm (lhs) mkeps.simps(4)} & $\dn$ & @{thm (rhs) mkeps.simps(4)}\\ |
422 \end{tabular}\smallskip\\ |
490 @{thm (lhs) mkeps.simps(5)} & $\dn$ & @{thm (rhs) mkeps.simps(5)}\\ |
|
491 \end{tabular}\medskip\\ |
423 |
492 |
424 \begin{tabular}{@ {}cc@ {}} |
493 \begin{tabular}{@ {}cc@ {}} |
425 \begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {}@ {}} |
494 \begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {}@ {}} |
426 @{thm (lhs) injval.simps(1)} & $\dn$ & @{thm (rhs) injval.simps(1)}\\ |
495 @{thm (lhs) injval.simps(1)} & $\dn$ & @{thm (rhs) injval.simps(1)}\\ |
427 @{thm (lhs) injval.simps(2)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1"]} & $\dn$ & |
496 @{thm (lhs) injval.simps(2)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1"]} & $\dn$ & |
428 @{thm (rhs) injval.simps(2)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1"]}\\ |
497 @{thm (rhs) injval.simps(2)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1"]}\\ |
429 @{thm (lhs) injval.simps(3)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]} & $\dn$ & |
498 @{thm (lhs) injval.simps(3)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]} & $\dn$ & |
430 @{thm (rhs) injval.simps(3)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]}\\ |
499 @{thm (rhs) injval.simps(3)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]}\\ |
431 @{thm (lhs) injval.simps(7)[of "r" "c" "v" "vs"]} & $\dn$ |
500 @{thm (lhs) injval.simps(7)[of "r" "c" "v" "vs"]} & $\dn$ |
432 & @{thm (rhs) injval.simps(7)[of "r" "c" "v" "vs"]} |
501 & @{thm (rhs) injval.simps(7)[of "r" "c" "v" "vs"]}\\ |
|
502 @{thm (lhs) injval.simps(8)[of "r" "n" "c" "v" "vs"]} & $\dn$ |
|
503 & @{thm (rhs) injval.simps(8)[of "r" "n" "c" "v" "vs"]} |
433 \end{tabular} |
504 \end{tabular} |
434 & |
505 & |
435 \begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {}@ {}} |
506 \begin{tabular}{@ {\hspace{-3mm}}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {}@ {}} |
436 @{thm (lhs) injval.simps(4)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1" "v\<^sub>2"]} & $\dn$ |
507 @{thm (lhs) injval.simps(4)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1" "v\<^sub>2"]} & $\dn$ |
437 & @{thm (rhs) injval.simps(4)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1" "v\<^sub>2"]}\\ |
508 & @{thm (rhs) injval.simps(4)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1" "v\<^sub>2"]}\\ |
438 @{thm (lhs) injval.simps(5)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1" "v\<^sub>2"]} & $\dn$ |
509 @{thm (lhs) injval.simps(5)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1" "v\<^sub>2"]} & $\dn$ |
439 & @{thm (rhs) injval.simps(5)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1" "v\<^sub>2"]}\\ |
510 & @{thm (rhs) injval.simps(5)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>1" "v\<^sub>2"]}\\ |
440 @{thm (lhs) injval.simps(6)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]} & $\dn$\\ |
511 @{thm (lhs) injval.simps(6)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]} & $\dn$\\ |
441 \multicolumn{3}{@ {}r@ {}}{@{thm (rhs) injval.simps(6)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]}} |
512 \multicolumn{3}{@ {}r@ {}}{@{thm (rhs) injval.simps(6)[of "r\<^sub>1" "r\<^sub>2" "c" "v\<^sub>2"]}}\\ \mbox{}\\ \mbox{} |
442 \end{tabular} |
513 \end{tabular} |
443 \end{tabular} |
514 \end{tabular} |
444 \end{tabular}\smallskip |
515 \end{tabular}\smallskip |
445 \end{center} |
516 \end{center} |
446 |
517 |
447 \noindent |
518 \noindent |
448 The function @{text mkeps} is run when the last derivative is nullable, that is |
519 The function @{text mkeps} is run when the last derivative is nullable, that is |
449 the string to be matched is in the language of the regular expression. It generates |
520 the string to be matched is in the language of the regular expression. It generates |
450 a value for how the last derivative can match the empty string. The injection function |
521 a value for how the last derivative can match the empty string. In case |
|
522 of @{term "NTIMES r n"} we use the function @{term replicate} in order to generate |
|
523 a list of exactly @{term n} copies, which is the length of the list we expect in this |
|
524 case. The injection function |
451 then calculates the corresponding value for each intermediate derivative until |
525 then calculates the corresponding value for each intermediate derivative until |
452 a value for the original regular expression is generated. |
526 a value for the original regular expression is generated. |
453 Graphically the algorithm by |
527 Graphically the algorithm by |
454 Sulzmann and Lu can be illustrated by the following picture %in Figure~\ref{Sulz} |
528 Sulzmann and Lu can be illustrated by the following picture %in Figure~\ref{Sulz} |
455 where the path from the left to the right involving @{term derivatives}/@{const |
529 where the path from the left to the right involving @{term derivatives}/@{const |
652 (\Right\,v, bs_1)$\\ |
727 (\Right\,v, bs_1)$\\ |
653 $\textit{decode}'\,bs\;(r_1\cdot r_2)$ & $\dn$ & |
728 $\textit{decode}'\,bs\;(r_1\cdot r_2)$ & $\dn$ & |
654 $\textit{let}\,(v_1, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}$\\ |
729 $\textit{let}\,(v_1, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}$\\ |
655 & & $\textit{let}\,(v_2, bs_2) = \textit{decode}'\,bs_1\,r_2$ |
730 & & $\textit{let}\,(v_2, bs_2) = \textit{decode}'\,bs_1\,r_2$ |
656 \hspace{2mm}$\textit{in}\;(\Seq\,v_1\,v_2, bs_2)$\\ |
731 \hspace{2mm}$\textit{in}\;(\Seq\,v_1\,v_2, bs_2)$\\ |
657 $\textit{decode}'\,(\Z\!::\!bs)\,(r^*)$ & $\dn$ & $(\Stars\,[], bs)$\\ |
732 $\textit{decode}'\,(\S\!::\!bs)\,(r^*)$ & $\dn$ & $(\Stars\,[], bs)$\\ |
658 $\textit{decode}'\,(\S\!::\!bs)\,(r^*)$ & $\dn$ & |
733 $\textit{decode}'\,(\Z\!::\!bs)\,(r^*)$ & $\dn$ & |
659 $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r\;\textit{in}$\\ |
734 $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r\;\textit{in}$\\ |
660 & & $\textit{let}\,(\Stars\,vs, bs_2) = \textit{decode}'\,bs_1\,r^*$ |
735 & & $\textit{let}\,(\Stars\,vs, bs_2) = \textit{decode}'\,bs_1\,r^*$ |
661 \hspace{2mm}$\textit{in}\;(\Stars\,v\!::\!vs, bs_2)$\medskip\\ |
736 \hspace{2mm}$\textit{in}\;(\Stars\,v\!::\!vs, bs_2)$\\ |
662 \multicolumn{3}{l}{$\textit{decode}\,bs\,r$ $\dn$ |
737 $\textit{decode}'\,(\S\!::\!bs)\,(r^{\{n\}})$ & $\dn$ & $(\Stars\,[], bs)$\\ |
|
738 $\textit{decode}'\,(\Z\!::\!bs)\,(r^{\{n\}})$ & $\dn$ & |
|
739 $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r\;\textit{in}$\\ |
|
740 & & $\textit{let}\,(\Stars\,vs, bs_2) = \textit{decode}'\,bs_1\,r^{\{n - 1\}}$ |
|
741 \hspace{2mm}$\textit{in}\;(\Stars\,v\!::\!vs, bs_2)$\medskip\\ |
|
742 \multicolumn{3}{@ {}l}{$\textit{decode}\,bs\,r$ $\dn$ |
663 $\textit{let}\,(v, bs') = \textit{decode}'\,bs\,r\;\textit{in}$ |
743 $\textit{let}\,(v, bs') = \textit{decode}'\,bs\,r\;\textit{in}$ |
664 $\;\,\textit{if}\;bs' = []\;\textit{then}\;\textit{Some}\,v\; |
744 $\;\,\textit{if}\;bs' = []\;\textit{then}\;\textit{Some}\,v\; |
665 \textit{else}\;\textit{None}$}\\[-4mm] |
745 \textit{else}\;\textit{None}$}\\[-4mm] |
666 \end{tabular} |
746 \end{tabular} |
667 \end{center} |
747 \end{center} |
702 $\textit{fuse}\,bs\,(\textit{ALTs}\,bs'\,rs)$ & $\dn$ & |
782 $\textit{fuse}\,bs\,(\textit{ALTs}\,bs'\,rs)$ & $\dn$ & |
703 $\textit{ALTs}\,(bs\,@\,bs')\,rs$\\ |
783 $\textit{ALTs}\,(bs\,@\,bs')\,rs$\\ |
704 $\textit{fuse}\,bs\,(\textit{SEQ}\,bs'\,r_1\,r_2)$ & $\dn$ & |
784 $\textit{fuse}\,bs\,(\textit{SEQ}\,bs'\,r_1\,r_2)$ & $\dn$ & |
705 $\textit{SEQ}\,(bs\,@\,bs')\,r_1\,r_2$\\ |
785 $\textit{SEQ}\,(bs\,@\,bs')\,r_1\,r_2$\\ |
706 $\textit{fuse}\,bs\,(\textit{STAR}\,bs'\,r)$ & $\dn$ & |
786 $\textit{fuse}\,bs\,(\textit{STAR}\,bs'\,r)$ & $\dn$ & |
707 $\textit{STAR}\,(bs\,@\,bs')\,r$ |
787 $\textit{STAR}\,(bs\,@\,bs')\,r$\\ |
|
788 $\textit{fuse}\,bs\,(\textit{NT}\,bs'\,r\,n)$ & $\dn$ & |
|
789 $\textit{NT}\,(bs\,@\,bs')\,r\,n$ |
708 \end{tabular} |
790 \end{tabular} |
709 \end{tabular} |
791 \end{tabular} |
710 \end{center} |
792 \end{center} |
711 |
793 |
712 \noindent |
794 \noindent |
743 the functions \textit{bnullable} and \textit{bmkeps}(\textit{s}), which are the |
827 the functions \textit{bnullable} and \textit{bmkeps}(\textit{s}), which are the |
744 ``lifted'' versions of \textit{nullable} and \textit{mkeps} acting on |
828 ``lifted'' versions of \textit{nullable} and \textit{mkeps} acting on |
745 bitcoded regular expressions. |
829 bitcoded regular expressions. |
746 % |
830 % |
747 \begin{center} |
831 \begin{center} |
748 \begin{tabular}{@ {}c@ {\hspace{2mm}}c@ {}} |
832 \begin{tabular}{@ {\hspace{-1mm}}c@ {\hspace{0mm}}c@ {}} |
749 \begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l} |
833 \begin{tabular}{@ {}l@ {\hspace{0.5mm}}c@ {\hspace{1mm}}l} |
750 $\textit{bnullable}\,(\textit{ZERO})$ & $\dn$ & $\textit{False}$\\ |
834 $\textit{bnullable}\,(\textit{ZERO})$ & $\dn$ & $\textit{False}$\\ |
751 $\textit{bnullable}\,(\textit{ONE}\,bs)$ & $\dn$ & $\textit{True}$\\ |
835 $\textit{bnullable}\,(\textit{ONE}\,bs)$ & $\dn$ & $\textit{True}$\\ |
752 $\textit{bnullable}\,(\textit{CHAR}\,bs\,c)$ & $\dn$ & $\textit{False}$\\ |
836 $\textit{bnullable}\,(\textit{CHAR}\,bs\,c)$ & $\dn$ & $\textit{False}$\\ |
753 $\textit{bnullable}\,(\textit{ALTs}\,bs\,\rs)$ & $\dn$ & |
837 $\textit{bnullable}\,(\textit{ALTs}\,bs\,\rs)$ & $\dn$ & |
754 $\exists\, r \in \rs. \,\textit{bnullable}\,r$\\ |
838 $\exists\, r \in \rs. \,\textit{bnullable}\,r$\\ |
755 $\textit{bnullable}\,(\textit{SEQ}\,bs\,r_1\,r_2)$ & $\dn$ & |
839 $\textit{bnullable}\,(\textit{SEQ}\,bs\,r_1\,r_2)$ & $\dn$ & |
756 $\textit{bnullable}\,r_1\wedge \textit{bnullable}\,r_2$\\ |
840 $\textit{bnullable}\,r_1\wedge \textit{bnullable}\,r_2$\\ |
757 $\textit{bnullable}\,(\textit{STAR}\,bs\,r)$ & $\dn$ & |
841 $\textit{bnullable}\,(\textit{STAR}\,bs\,r)$ & $\dn$ & |
758 $\textit{True}$ |
842 $\textit{True}$\\ |
|
843 $\textit{bnullable}\,(\textit{NT}\,bs\,r\,n)$ & $\dn$ &\\ |
|
844 \multicolumn{3}{l}{$\textit{if}\;n = 0\;\textit{then}\;\textit{True}\; |
|
845 \textit{else}\;\textit{bnullable}\,r$}\\ |
759 \end{tabular} |
846 \end{tabular} |
760 & |
847 & |
761 \begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {}} |
848 \begin{tabular}{@ {}l@ {\hspace{0.5mm}}c@ {\hspace{1mm}}l@ {}} |
762 $\textit{bmkeps}\,(\textit{ONE}\,bs)$ & $\dn$ & $bs$\\ |
849 $\textit{bmkeps}\,(\textit{ONE}\,bs)$ & $\dn$ & $bs$\\ |
763 $\textit{bmkeps}\,(\textit{ALTs}\,bs\,\rs)$ & $\dn$ & |
850 $\textit{bmkeps}\,(\textit{ALTs}\,bs\,\rs)$ & $\dn$ & |
764 $bs\,@\,\textit{bmkepss}\,\rs$\\ |
851 $bs\,@\,\textit{bmkepss}\,\rs$\\ |
765 $\textit{bmkeps}\,(\textit{SEQ}\,bs\,r_1\,r_2)$ & $\dn$ &\\ |
852 $\textit{bmkeps}\,(\textit{SEQ}\,bs\,r_1\,r_2)$ & $\dn$ &\\ |
766 \multicolumn{3}{r}{$bs \,@\,\textit{bmkeps}\,r_1\,@\, \textit{bmkeps}\,r_2$}\\ |
853 \multicolumn{3}{l}{$bs \,@\,\textit{bmkeps}\,r_1\,@\, \textit{bmkeps}\,r_2$}\\ |
767 $\textit{bmkeps}\,(\textit{STAR}\,bs\,r)$ & $\dn$ & |
854 $\textit{bmkeps}\,(\textit{STAR}\,bs\,r)$ & $\dn$ & |
768 $bs \,@\, [\S]$\\ |
855 $bs \,@\, [\S]$\\ |
769 $\textit{bmkepss}\,(r\!::\!\rs)$ & $\dn$ & |
856 $\textit{bmkeps}\,(\textit{NT}\,bs\,r\,n)$ & $\dn$ &\\ |
770 $\textit{if}\;\textit{bnullable}\,r$\\ |
857 \multicolumn{3}{l@ {}}{$\textit{if}\;n=0\;\textit{then}\;bs \,@\, [\S]$}\\ |
771 & &$\textit{then}\;\textit{bmkeps}\,r$\\ |
858 \multicolumn{3}{l@ {}}{$\textit{else}\,bs \,@\, [\Z] \,@\, \textit{bmkeps}\,r\,@\, |
772 & &$\textit{else}\;\textit{bmkepss}\,\rs$ |
859 \textit{bmkeps}\,(@{term "ANTIMES [] r (n - 1)"})$}\\ |
|
860 $\textit{bmkepss}\,(r\!::\!\rs)$ & $\dn$ &\\ |
|
861 \multicolumn{3}{l}{$\textit{if}\;\textit{bnullable}\,r\,\textit{then}\;\textit{bmkeps}\,r\; |
|
862 \textit{else}\;\textit{bmkepss}\,\rs$} |
773 \end{tabular} |
863 \end{tabular} |
774 \end{tabular} |
864 \end{tabular} |
775 \end{center} |
865 \end{center} |
776 |
866 |
777 |
867 |
794 $\textit{if}\;\textit{bnullable}\,r_1$\\ |
884 $\textit{if}\;\textit{bnullable}\,r_1$\\ |
795 & &$\textit{then}\;\textit{ALT}\,bs\;(\textit{SEQ}\,[]\,(r_1\backslash c)\,r_2)$ |
885 & &$\textit{then}\;\textit{ALT}\,bs\;(\textit{SEQ}\,[]\,(r_1\backslash c)\,r_2)$ |
796 $\;(\textit{fuse}\,(\textit{bmkeps}\,r_1)\,(r_2\backslash c))$\\ |
886 $\;(\textit{fuse}\,(\textit{bmkeps}\,r_1)\,(r_2\backslash c))$\\ |
797 & &$\textit{else}\;\textit{SEQ}\,bs\,(r_1\backslash c)\,r_2$\\ |
887 & &$\textit{else}\;\textit{SEQ}\,bs\,(r_1\backslash c)\,r_2$\\ |
798 $(\textit{STAR}\,bs\,r)\backslash c$ & $\dn$ & |
888 $(\textit{STAR}\,bs\,r)\backslash c$ & $\dn$ & |
799 $\textit{SEQ}\;bs\,(\textit{fuse}\, [\Z] (r\backslash c))\, |
889 $\textit{SEQ}\;(bs\,@\,[\Z])\,(r\backslash c)\, |
800 (\textit{STAR}\,[]\,r)$ |
890 (\textit{STAR}\,[]\,r)$\\ |
|
891 $(\textit{NT}\,bs\,r\,n)\backslash c$ & $\dn$ & |
|
892 $\textit{if}\;n = 0\; \textit{then}\;\textit{ZERO}\;\textit{else}\; |
|
893 \textit{SEQ}\;(bs\,@\,[\Z])\,(r\backslash c)\, |
|
894 (\textit{NT}\,[]\,r\,(n - 1))$ |
801 \end{tabular} |
895 \end{tabular} |
802 \end{center} |
896 \end{center} |
803 |
897 |
804 |
898 |
805 \noindent |
899 \noindent |
1421 obscure, examples. |
1540 obscure, examples. |
1422 %We found that from an implementation |
1541 %We found that from an implementation |
1423 %point-of-view it is really important to have the formal proofs of |
1542 %point-of-view it is really important to have the formal proofs of |
1424 %the corresponding properties at hand. |
1543 %the corresponding properties at hand. |
1425 |
1544 |
1426 We have also developed a |
1545 We can of course only make a claim about the correctness and the sizes of the |
1427 healthy suspicion when experimental data is used to back up |
1546 derivatives, not about the efficiency or runtime of our version of |
|
1547 Sulzman and Lu's algorithm. But we found the size is an important |
|
1548 first indicator about efficiency: clearly if the derivatives can |
|
1549 grow to arbitrarily big sizes and the algorithm needs to traverse |
|
1550 the derivatives possibly several times, then the algorithm will be |
|
1551 slow---excruciatingly slow that is. Other works seems to make |
|
1552 stronger claims, but during our work we have developed a healthy |
|
1553 suspicion when for example experimental data is used to back up |
1428 efficiency claims. For example Sulzmann and Lu write about their |
1554 efficiency claims. For example Sulzmann and Lu write about their |
1429 equivalent of @{term blexer_simp} \textit{``...we can incrementally compute |
1555 equivalent of @{term blexer_simp} \textit{``...we can incrementally |
1430 bitcoded parse trees in linear time in the size of the input''} |
1556 compute bitcoded parse trees in linear time in the size of the |
1431 \cite[Page 14]{Sulzmann2014}. |
1557 input''} \cite[Page 14]{Sulzmann2014}. Given the growth of the |
1432 Given the growth of the |
1558 derivatives in some cases even after aggressive simplification, |
1433 derivatives in some cases even after aggressive simplification, this |
1559 this is a hard to believe claim. A similar claim about a |
1434 is a hard to believe claim. A similar claim about a theoretical runtime |
1560 theoretical runtime of @{text "O(n\<^sup>2)"} is made for the |
1435 of @{text "O(n\<^sup>2)"} is made for the Verbatim lexer, which calculates |
1561 Verbatim lexer, which calculates tokens according to POSIX |
1436 tokens according to POSIX rules~\cite{verbatim}. For this Verbatim uses Brzozowski's |
1562 rules~\cite{verbatim}. For this, Verbatim uses Brzozowski's |
1437 derivatives like in our work. |
1563 derivatives like in our work. The authors write: \textit{``The |
1438 The authors write: \textit{``The results of our empirical tests [..] confirm that Verbatim has |
1564 results of our empirical tests [..] confirm that Verbatim has |
1439 @{text "O(n\<^sup>2)"} time complexity.''} \cite[Section~VII]{verbatim}. |
1565 @{text "O(n\<^sup>2)"} time complexity.''} |
1440 While their correctness proof for Verbatim is formalised in Coq, the claim about |
1566 \cite[Section~VII]{verbatim}. While their correctness proof for |
1441 the runtime complexity is only supported by some emperical evidence obtained |
1567 Verbatim is formalised in Coq, the claim about the runtime |
1442 by using the code extraction facilities of Coq. |
1568 complexity is only supported by some emperical evidence obtained by |
1443 Given our observation with the ``growth problem'' of derivatives, |
1569 using the code extraction facilities of Coq. Given our observation |
1444 we |
1570 with the ``growth problem'' of derivatives, we tried out their |
1445 tried out their extracted OCaml code with the example |
1571 extracted OCaml code with the example \mbox{@{text "(a + |
1446 \mbox{@{text "(a + aa)\<^sup>*"}} as a single lexing rule, and it took for us around 5 minutes to tokenise a |
1572 aa)\<^sup>*"}} as a single lexing rule, and it took for us around 5 |
1447 string of 40 $a$'s and that increased to approximately 19 minutes when the |
1573 minutes to tokenise a string of 40 $a$'s and that increased to |
1448 string is 50 $a$'s long. Taking into account that derivatives are not simplified in the Verbatim |
1574 approximately 19 minutes when the string is 50 $a$'s long. Taking |
1449 lexer, such numbers are not surprising. |
1575 into account that derivatives are not simplified in the Verbatim |
1450 Clearly our result of having finite |
1576 lexer, such numbers are not surprising. Clearly our result of |
1451 derivatives might sound rather weak in this context but we think such effeciency claims |
1577 having finite derivatives might sound rather weak in this context |
1452 really require further scrutiny. |
1578 but we think such effeciency claims really require further |
1453 |
1579 scrutiny. |
1454 The contribution of this paper is to make sure |
1580 |
1455 derivatives do not grow arbitrarily big (universially) In the example \mbox{@{text "(a + aa)\<^sup>*"}}, |
1581 The contribution of this paper is to make sure derivatives do not |
1456 \emph{all} derivatives have a size of 17 or less. The result is that |
1582 grow arbitrarily big (universially) In the example \mbox{@{text "(a |
1457 lexing a string of, say, 50\,000 a's with the regular expression \mbox{@{text "(a + aa)\<^sup>*"}} takes approximately |
1583 + aa)\<^sup>*"}}, \emph{all} derivatives have a size of 17 or |
1458 10 seconds with our Scala implementation |
1584 less. The result is that lexing a string of, say, 50\,000 a's with |
1459 of the presented algorithm. Our Isabelle code including the results from Sec.~5 is available from \url{https://github.com/urbanchr/posix}. |
1585 the regular expression \mbox{@{text "(a + aa)\<^sup>*"}} takes |
|
1586 approximately 10 seconds with our Scala implementation of the |
|
1587 presented algorithm. |
|
1588 |
|
1589 Finally, let us come back to the point about bounded regular |
|
1590 expressions. We have in this paper only shown that @{term "NTIMES r |
|
1591 n"} can be included, but all our results extend straightforwardly |
|
1592 also to the other bounded regular expressions. We find bounded |
|
1593 regular expressions fit naturally into the setting of Brzozowski |
|
1594 derivatives and the bitcoded regular expressions by Sulzmann and Lu. |
|
1595 In contrast bounded regular expressions are often the Achilles' |
|
1596 heel in regular expression matchers that use the traditional |
|
1597 automata-based approach to lexing, primarily because they need to expand the |
|
1598 counters of bounded regular expressions into $n$-connected copies |
|
1599 of an automaton. This is not needed in Sulzmann and Lu's algorithm. |
|
1600 To see the difference consider for example the regular expression @{term "SEQ (NTIMES a |
|
1601 1001) (STAR a)"}, which is not permitted in the Go language because |
|
1602 the counter is too big. In contrast we have no problem with |
|
1603 matching this regular expression with, say 50\,000 a's, because the |
|
1604 counters can be kept compact. In fact, the overall size of the |
|
1605 derivatives is never greater than 5 in this example. Even in the |
|
1606 example from Section 2, where Rust raises an error message, namely |
|
1607 \mbox{@{text "a\<^bsup>{1000}{100}{5}\<^esup>"}}, the maximum size for |
|
1608 our derivatives is a moderate 14. |
|
1609 |
|
1610 Let us also compare our work to the verified Verbatim++ lexer where |
|
1611 the authors of the Verbatim lexer introduced a number of |
|
1612 improvements and optimisations, for example memoisation |
|
1613 \cite{verbatimpp}. However, unlike Verbatim, which works with |
|
1614 derivatives like in our work, Verbatim++ compiles first a regular |
|
1615 expression into a DFA. While this makes lexing fast in many cases, |
|
1616 with examples of bounded regular expressions like |
|
1617 \mbox{@{text "a\<^bsup>{100}{5}\<^esup>"}} |
|
1618 one needs to represent them as |
|
1619 sequences of $a \cdot a \cdot \ldots \cdot a$ (500 a's in sequence). We have run |
|
1620 their extracted code with such a regular expression as a |
|
1621 single lexing rule and a string of 50\,000 a's---lexing in this case |
|
1622 takes approximately 5~minutes. We are not aware of any better |
|
1623 translation using the traditional notion of DFAs. Therefore we |
|
1624 prefer to stick with calculating derivatives, but attempt to make |
|
1625 this calculation (in the future) as fast as possible. What we can guaranty |
|
1626 with the presented work is that the maximum size of the derivatives |
|
1627 for this example is not bigger than 9. This means our Scala |
|
1628 implementation only needs a few seconds for this example. |
|
1629 % |
|
1630 % |
|
1631 %Possible ideas are |
|
1632 %zippers which have been used in the context of parsing of |
|
1633 %context-free grammars \cite{zipperparser}. |
|
1634 \medskip |
|
1635 |
|
1636 \noindent |
|
1637 Our Isabelle code including the results from Sec.~5 is available from \url{https://github.com/urbanchr/posix}. |
1460 %\\[-10mm] |
1638 %\\[-10mm] |
1461 |
1639 |
1462 |
1640 |
1463 %%\bibliographystyle{plain} |
1641 %%\bibliographystyle{plain} |
1464 \bibliography{root} |
1642 \bibliography{root} |
1465 |
1643 |
|
1644 \newpage |
1466 \appendix |
1645 \appendix |
1467 |
1646 |
1468 \section{Some Proofs} |
1647 \section{Some Proofs} |
1469 |
1648 |
1470 While we have formalised \emph{all} results in Isabelle/HOL, below we shall give some |
1649 While we have formalised \emph{all} results in Isabelle/HOL, below we shall give some |
1471 rough details of our reasoning in ``English''. |
1650 rough details of our reasoning in ``English'' if this is helpful for reviewing. |
1472 |
1651 |
1473 \begin{proof}[Proof of Lemma~\ref{codedecode}] |
1652 \begin{proof}[Proof of Lemma~\ref{codedecode}] |
1474 This follows from the property that |
1653 This follows from the property that |
1475 $\textit{decode}'\,((\textit{code}\,v) \,@\, bs)\,r = (v, bs)$ holds |
1654 $\textit{decode}'\,((\textit{code}\,v) \,@\, bs)\,r = (v, bs)$ holds |
1476 for any bit-sequence $bs$ and $\vdash v : r$. This property can be |
1655 for any bit-sequence $bs$ and $\vdash v : r$. This property can be |