8 %and then give the algorithm and its variant and discuss |
8 %and then give the algorithm and its variant and discuss |
9 %why more aggressive simplifications are needed. |
9 %why more aggressive simplifications are needed. |
10 |
10 |
11 In this chapter, we define the basic notions |
11 In this chapter, we define the basic notions |
12 for regular languages and regular expressions. |
12 for regular languages and regular expressions. |
13 This is essentially a description in ``English" |
13 This is essentially a description in ``English'' |
14 of our formalisation in Isabelle/HOL. |
14 of our formalisation in Isabelle/HOL. |
15 We also give the definition of what $\POSIX$ lexing means, |
15 We also give the definition of what $\POSIX$ lexing means, |
16 followed by an algorithm by Sulzmanna and Lu\parencite{Sulzmann2014} |
16 followed by a lexing algorithm by Sulzmanna and Lu\parencite{Sulzmann2014} |
17 that produces the output conforming |
17 that produces the output conforming |
18 to the $\POSIX$ standard. |
18 to the $\POSIX$ standard. |
19 It is also worth mentioning that |
19 \footnote{In what follows |
20 we choose to use the ML-style notation |
20 we choose to use the Isabelle-style notation |
21 for function applications, where |
21 for function applications, where |
22 the parameters of a function is not enclosed |
22 the parameters of a function are not enclosed |
23 inside a pair of parentheses (e.g. $f \;x \;y$ |
23 inside a pair of parentheses (e.g. $f \;x \;y$ |
24 instead of $f(x,\;y)$). This is mainly |
24 instead of $f(x,\;y)$). This is mainly |
25 to make the text visually more concise. |
25 to make the text visually more concise.} |
26 |
26 |
27 \section{Basic Concepts} |
27 \section{Basic Concepts} |
28 Usually, formal language theory starts with an alphabet |
28 Usually, formal language theory starts with an alphabet |
29 denoting a set of characters. |
29 denoting a set of characters. |
30 Here we just use the datatype of characters from Isabelle, |
30 Here we just use the datatype of characters from Isabelle, |
36 \begin{center} |
36 \begin{center} |
37 \begin{tabular}{lcl} |
37 \begin{tabular}{lcl} |
38 $\textit{s}$ & $\dn$ & $[] \; |\; c :: s$ |
38 $\textit{s}$ & $\dn$ & $[] \; |\; c :: s$ |
39 \end{tabular} |
39 \end{tabular} |
40 \end{center} |
40 \end{center} |
41 Where $c$ is a variable ranging over characters. |
41 where $c$ is a variable ranging over characters. |
42 Strings can be concatenated to form longer strings in the same |
42 Strings can be concatenated to form longer strings in the same |
43 way as we concatenate two lists, which we shall write as $s_1 @ s_2$. |
43 way as we concatenate two lists, which we shall write as $s_1 @ s_2$. |
44 We omit the precise |
44 We omit the precise |
45 recursive definition here. |
45 recursive definition here. |
46 We overload this concatenation operator for two sets of strings: |
46 We overload this concatenation operator for two sets of strings: |
86 \begin{tabular}{lcl} |
86 \begin{tabular}{lcl} |
87 $\textit{Der} \;c \;A$ & $\dn$ & $\{ s \mid c :: s \in A \}$\\ |
87 $\textit{Der} \;c \;A$ & $\dn$ & $\{ s \mid c :: s \in A \}$\\ |
88 \end{tabular} |
88 \end{tabular} |
89 \end{center} |
89 \end{center} |
90 \noindent |
90 \noindent |
91 This can be generalised to "chopping off" a string from all strings within set $A$, |
91 This can be generalised to ``chopping off'' a string |
|
92 from all strings within a set $A$, |
92 namely: |
93 namely: |
93 \begin{center} |
94 \begin{center} |
94 \begin{tabular}{lcl} |
95 \begin{tabular}{lcl} |
95 $\textit{Ders} \;s \;A$ & $\dn$ & $\{ s' \mid s@s' \in A \}$\\ |
96 $\textit{Ders} \;s \;A$ & $\dn$ & $\{ s' \mid s@s' \in A \}$\\ |
96 \end{tabular} |
97 \end{tabular} |
97 \end{center} |
98 \end{center} |
98 \noindent |
99 \noindent |
99 which is essentially the left quotient $A \backslash L$ of $A$ against |
100 which is essentially the left quotient $A \backslash L$ of $A$ against |
100 the singleton language with $L = \{w\}$ |
101 the singleton language with $L = \{s\}$ |
101 in formal language theory. |
102 in formal language theory. |
102 However, for the purposes here, the $\textit{Ders}$ definition with |
103 However, for the purposes here, the $\textit{Ders}$ definition with |
103 a single string is sufficient. |
104 a single string is sufficient. |
104 |
105 |
105 The reason for defining derivatives |
106 The reason for defining derivatives |
143 \] |
144 \] |
144 \end{lemma} |
145 \end{lemma} |
145 \noindent |
146 \noindent |
146 This lemma states that if $A$ contains the empty string, $\Der$ can "pierce" through it |
147 This lemma states that if $A$ contains the empty string, $\Der$ can "pierce" through it |
147 and get to $B$. |
148 and get to $B$. |
148 The language $A*$'s derivative can be described using the language derivative |
149 The language derivative for $A*$ can be described using the language derivative |
149 of $A$: |
150 of $A$: |
150 \begin{lemma} |
151 \begin{lemma} |
151 $\textit{Der} \;c \;(A*) = (\textit{Der}\; c A) @ (A*)$\\ |
152 $\textit{Der} \;c \;(A*) = (\textit{Der}\; c A) @ (A*)$\\ |
152 \end{lemma} |
153 \end{lemma} |
153 \begin{proof} |
154 \begin{proof} |
154 There are too inclusions to prove: |
155 There are two inclusions to prove: |
155 \begin{itemize} |
156 \begin{itemize} |
156 \item{$\subseteq$}:\\ |
157 \item{$\subseteq$}:\\ |
157 The set |
158 The set |
158 \[ \{s \mid c :: s \in A*\} \] |
159 \[ \{s \mid c :: s \in A*\} \] |
159 is enclosed in the set |
160 is enclosed in the set |
165 and the rest of the string will |
166 and the rest of the string will |
166 be still in $A*$. |
167 be still in $A*$. |
167 \item{$\supseteq$}:\\ |
168 \item{$\supseteq$}:\\ |
168 Note that |
169 Note that |
169 \[ \Der \; c \; (A*) = \Der \; c \; (\{ [] \} \cup (A @ A*) ) \] |
170 \[ \Der \; c \; (A*) = \Der \; c \; (\{ [] \} \cup (A @ A*) ) \] |
170 hold. |
171 holds. |
171 Also this holds: |
172 Also the following holds: |
172 \[ \Der \; c \; (\{ [] \} \cup (A @ A*) ) = \Der\; c \; (A @ A*) \] |
173 \[ \Der \; c \; (\{ [] \} \cup (A @ A*) ) = \Der\; c \; (A @ A*) \] |
173 where the $\textit{RHS}$ can be rewritten |
174 where the $\textit{RHS}$ can be rewritten |
174 as \[ (\Der \; c\; A) @ A* \cup (\Der \; c \; (A*)) \] |
175 as \[ (\Der \; c\; A) @ A* \cup (\Der \; c \; (A*)) \] |
175 which of course contains $\Der \; c \; A @ A*$. |
176 which of course contains $\Der \; c \; A @ A*$. |
176 \end{itemize} |
177 \end{itemize} |
216 $L \; r_1 \cdot r_2$ & $\dn$ & $ L \; r_1 @ L \; r_2$\\ |
217 $L \; r_1 \cdot r_2$ & $\dn$ & $ L \; r_1 @ L \; r_2$\\ |
217 $L \; r^*$ & $\dn$ & $ (L\;r)*$ |
218 $L \; r^*$ & $\dn$ & $ (L\;r)*$ |
218 \end{tabular} |
219 \end{tabular} |
219 \end{center} |
220 \end{center} |
220 \noindent |
221 \noindent |
221 Now with semantic derivatives of a language and regular expressions and |
222 Now with language derivatives of a language and regular expressions and |
222 their language interpretations in place, we are ready to define derivatives on regexes. |
223 their language interpretations in place, we are ready to define derivatives on regular expressions. |
223 \subsection{Brzozowski Derivatives and a Regular Expression Matcher} |
224 \subsection{Brzozowski Derivatives and a Regular Expression Matcher} |
224 %Recall, the language derivative acts on a set of strings |
225 %Recall, the language derivative acts on a set of strings |
225 %and essentially chops off a particular character from |
226 %and essentially chops off a particular character from |
226 %all strings in that set, Brzozowski defined a derivative operation on regular expressions |
227 %all strings in that set, Brzozowski defined a derivative operation on regular expressions |
227 %so that after derivative $L(r\backslash c)$ |
228 %so that after derivative $L(r\backslash c)$ |
228 %will look as if it was obtained by doing a language derivative on $L(r)$: |
229 %will look as if it was obtained by doing a language derivative on $L(r)$: |
229 Recall that the semantic derivative acts on a |
230 Recall that the language derivative acts on a |
230 language (set of strings). |
231 language (set of strings). |
231 One can decide whether a string $s$ belongs |
232 One can decide whether a string $s$ belongs |
232 to a language $S$ by taking derivative with respect to |
233 to a language $S$ by taking derivative with respect to |
233 that string and then checking whether the empty |
234 that string and then checking whether the empty |
234 string is in the derivative: |
235 string is in the derivative: |
373 \end{property} |
374 \end{property} |
374 \noindent |
375 \noindent |
375 Now we give the recursive definition of derivative on |
376 Now we give the recursive definition of derivative on |
376 regular expressions, so that it satisfies the properties above. |
377 regular expressions, so that it satisfies the properties above. |
377 The derivative function, written $r\backslash c$, |
378 The derivative function, written $r\backslash c$, |
378 defines how a regular expression evolves into |
379 takes a regular expression $r$ and character $c$, and |
379 a new one after all the string it contains is acted on: |
380 returns a new regular expression representing |
380 if it starts with $c$, then the character is chopped of, |
381 the original regular expression's language $L \; r$ |
381 if not, that string is removed. |
382 being taken the language derivative with respect to $c$. |
382 \begin{center} |
383 \begin{center} |
383 \begin{tabular}{lcl} |
384 \begin{tabular}{lcl} |
384 $\ZERO \backslash c$ & $\dn$ & $\ZERO$\\ |
385 $\ZERO \backslash c$ & $\dn$ & $\ZERO$\\ |
385 $\ONE \backslash c$ & $\dn$ & $\ZERO$\\ |
386 $\ONE \backslash c$ & $\dn$ & $\ZERO$\\ |
386 $d \backslash c$ & $\dn$ & |
387 $d \backslash c$ & $\dn$ & |
399 contains an empty string, then the second component of the sequence |
400 contains an empty string, then the second component of the sequence |
400 needs to be considered, as its derivative will contribute to the |
401 needs to be considered, as its derivative will contribute to the |
401 result of this derivative: |
402 result of this derivative: |
402 \begin{center} |
403 \begin{center} |
403 \begin{tabular}{lcl} |
404 \begin{tabular}{lcl} |
404 $(r_1 \cdot r_2 ) \backslash c$ & $\dn$ & $\textit{if}\;\,([] \in L(r_1))\; r_1 \backslash c \cdot r_2 + r_2 \backslash c$ \\ |
405 $(r_1 \cdot r_2 ) \backslash c$ & $\dn$ & |
|
406 $\textit{if}\;\,([] \in L(r_1))\; |
|
407 \textit{then} \; r_1 \backslash c \cdot r_2 + r_2 \backslash c$ \\ |
405 & & $\textit{else} \; (r_1 \backslash c) \cdot r_2$ |
408 & & $\textit{else} \; (r_1 \backslash c) \cdot r_2$ |
406 \end{tabular} |
409 \end{tabular} |
407 \end{center} |
410 \end{center} |
408 \noindent |
411 \noindent |
409 Notice how this closely resembles |
412 Notice how this closely resembles |
416 \Der \; c\; B$\\ |
419 \Der \; c\; B$\\ |
417 & & $\textit{else}\; (\Der \; c \; A) @ B$\\ |
420 & & $\textit{else}\; (\Der \; c \; A) @ B$\\ |
418 \end{tabular} |
421 \end{tabular} |
419 \end{center} |
422 \end{center} |
420 \noindent |
423 \noindent |
421 The star regular expression $r^*$'s derivative |
424 The derivative of the star regular expression $r^*$ |
422 unwraps one iteration of $r$, turns it into $r\backslash c$, |
425 unwraps one iteration of $r$, turns it into $r\backslash c$, |
423 and attaches the original $r^*$ |
426 and attaches the original $r^*$ |
424 after $r\backslash c$, so that |
427 after $r\backslash c$, so that |
425 we can further unfold it as many times as needed: |
428 we can further unfold it as many times as needed: |
426 \[ |
429 \[ |
427 (r^*) \backslash c \dn (r \backslash c)\cdot r^*. |
430 (r^*) \backslash c \dn (r \backslash c)\cdot r^*. |
428 \] |
431 \] |
429 Again, |
432 Again, |
430 the structure is the same as the semantic derivative of Kleene star: |
433 the structure is the same as the language derivative of Kleene star: |
431 \[ |
434 \[ |
432 \textit{Der} \;c \;(A*) \dn (\textit{Der}\; c A) @ (A*) |
435 \textit{Der} \;c \;(A*) \dn (\textit{Der}\; c A) @ (A*) |
433 \] |
436 \] |
434 In the above definition of $(r_1\cdot r_2) \backslash c$, |
437 In the above definition of $(r_1\cdot r_2) \backslash c$, |
435 the $\textit{if}$ clause's |
438 the $\textit{if}$ clause's |
492 to strings as follows: |
495 to strings as follows: |
493 |
496 |
494 \begin{center} |
497 \begin{center} |
495 \begin{tabular}{lcl} |
498 \begin{tabular}{lcl} |
496 $r \backslash_s (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash_s s$ \\ |
499 $r \backslash_s (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash_s s$ \\ |
497 $r \backslash [\,] $ & $\dn$ & $r$ |
500 $r \backslash_s [\,] $ & $\dn$ & $r$ |
498 \end{tabular} |
501 \end{tabular} |
499 \end{center} |
502 \end{center} |
500 |
503 |
501 \noindent |
504 \noindent |
502 When there is no ambiguity, we will |
505 When there is no ambiguity, we will |
503 omit the subscript and use $\backslash$ instead |
506 omit the subscript and use $\backslash$ instead |
504 of $\backslash_r$ to denote |
507 of $\backslash_s$ to denote |
505 string derivatives for brevity. |
508 string derivatives for brevity. |
506 Brzozowski's regular-expression matcher algorithm can then be described as: |
509 Brzozowski's regular-expression matching algorithm can then be described as: |
507 |
510 |
508 \begin{definition} |
511 \begin{definition} |
509 $\textit{match}\;s\;r \;\dn\; \nullable \; (r\backslash s)$ |
512 $\textit{match}\;s\;r \;\dn\; \nullable \; (r\backslash s)$ |
510 \end{definition} |
513 \end{definition} |
511 |
514 |
513 Assuming the string is given as a sequence of characters, say $c_0c_1..c_n$, |
516 Assuming the string is given as a sequence of characters, say $c_0c_1..c_n$, |
514 this algorithm presented graphically is as follows: |
517 this algorithm presented graphically is as follows: |
515 |
518 |
516 \begin{equation}\label{graph:successive_ders} |
519 \begin{equation}\label{graph:successive_ders} |
517 \begin{tikzcd} |
520 \begin{tikzcd} |
518 r_0 \arrow[r, "\backslash c_0"] & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed] & r_n \arrow[r,"\textit{nullable}?"] & \;\textrm{YES}/\textrm{NO} |
521 r_0 \arrow[r, "\backslash c_0"] & r_1 \arrow[r, "\backslash c_1"] & |
|
522 r_2 \arrow[r, dashed] & r_n \arrow[r,"\textit{nullable}?"] & |
|
523 \;\textrm{true}/\textrm{false} |
519 \end{tikzcd} |
524 \end{tikzcd} |
520 \end{equation} |
525 \end{equation} |
521 |
526 |
522 \noindent |
527 \noindent |
523 It can be |
528 It can be |
524 relatively easily shown that this matcher is correct: |
529 relatively easily shown that this matcher is correct: |
525 \begin{lemma} |
530 \begin{lemma} |
526 $\textit{match} \; s\; r = \textit{true} \; \textit{iff} \; s \in L(r)$ |
531 $\textit{match} \; s\; r = \textit{true} \; \textit{iff} \; s \in L(r)$ |
527 \end{lemma} |
532 \end{lemma} |
528 \begin{proof} |
533 \begin{proof} |
529 By the stepwise property of derivatives |
534 By induction on $s$ using property of derivatives: |
530 (lemma \ref{derDer}). |
535 lemma \ref{derDer}. |
531 \end{proof} |
536 \end{proof} |
532 \noindent |
537 \noindent |
533 \begin{center} |
538 \begin{center} |
534 \begin{figure} |
539 \begin{figure} |
535 \begin{tikzpicture} |
540 \begin{tikzpicture} |
536 \begin{axis}[ |
541 \begin{axis}[ |
537 xlabel={$n$}, |
542 xlabel={$n$}, |
538 ylabel={time in secs}, |
543 ylabel={time in secs}, |
539 ymode = log, |
544 ymode = log, |
541 legend pos=north west, |
546 legend pos=north west, |
542 legend cell align=left] |
547 legend cell align=left] |
543 \addplot[red,mark=*, mark options={fill=white}] table {NaiveMatcher.data}; |
548 \addplot[red,mark=*, mark options={fill=white}] table {NaiveMatcher.data}; |
544 \end{axis} |
549 \end{axis} |
545 \end{tikzpicture} |
550 \end{tikzpicture} |
546 \caption{Matching $(a^*)^*b$ against $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$}\label{NaiveMatcher} |
551 \caption{Matching the regular expression $(a^*)^*b$ against strings of the form |
|
552 $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s} |
|
553 $ using Brzozowski's original algorithm}\label{NaiveMatcher} |
547 \end{figure} |
554 \end{figure} |
548 \end{center} |
555 \end{center} |
549 \noindent |
556 \noindent |
550 If we implement the above algorithm naively, however, |
557 If we implement the above algorithm naively, however, |
551 the algorithm can be excruciatingly slow, as shown in |
558 the algorithm can be excruciatingly slow, as shown in |
552 \ref{NaiveMatcher}. |
559 \ref{NaiveMatcher}. |
553 Note that both axes are in logarithmic scale. |
560 Note that both axes are in logarithmic scale. |
554 Around two dozens characters |
561 Around two dozens characters |
555 would already explode the matcher on regular expression |
562 would already explode the matcher on the regular expression |
556 $(a^*)^*b$. |
563 $(a^*)^*b$. |
557 For this, we need to introduce certain |
564 Too improve this situation, we need to introduce simplification |
558 rewrite rules for the intermediate results, |
565 rewrite rules for the intermediate results, |
559 such as $r + r \rightarrow r$, |
566 such as $r + r \rightarrow r$, |
560 and make sure those rules do not change the |
567 and make sure those rules do not change the |
561 language of the regular expression. |
568 language of the regular expression. |
562 One simpled-minded simplification function |
569 One simpled-minded simplification function |
572 & & $\quad \case \; (r_1', r_2') \Rightarrow r_1'\cdot r_2'$\\ |
579 & & $\quad \case \; (r_1', r_2') \Rightarrow r_1'\cdot r_2'$\\ |
573 $\simp \; r_1 + r_2$ & $\dn$ & $(\simp \; r_1, \simp \; r_2) \textit{match}$\\ |
580 $\simp \; r_1 + r_2$ & $\dn$ & $(\simp \; r_1, \simp \; r_2) \textit{match}$\\ |
574 & & $\quad \; \case \; (\ZERO, r_2') \Rightarrow r_2'$\\ |
581 & & $\quad \; \case \; (\ZERO, r_2') \Rightarrow r_2'$\\ |
575 & & $\quad \; \case \; (r_1', \ZERO) \Rightarrow r_1'$\\ |
582 & & $\quad \; \case \; (r_1', \ZERO) \Rightarrow r_1'$\\ |
576 & & $\quad \; \case \; (r_1', r_2') \Rightarrow r_1' + r_2'$\\ |
583 & & $\quad \; \case \; (r_1', r_2') \Rightarrow r_1' + r_2'$\\ |
577 $\simp \; r$ & $\dn$ & $r$ |
584 $\simp \; r$ & $\dn$ & $r\quad\quad (otherwise)$ |
578 |
585 |
579 \end{tabular} |
586 \end{tabular} |
580 \end{center} |
587 \end{center} |
581 If we repeatedly apply this simplification |
588 If we repeatedly apply this simplification |
582 function during the matching algorithm, |
589 function during the matching algorithm, |
606 against |
613 against |
607 $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$ Using $\textit{matcher}_{simp}$}\label{BetterMatcher} |
614 $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$ Using $\textit{matcher}_{simp}$}\label{BetterMatcher} |
608 \end{figure} |
615 \end{figure} |
609 \noindent |
616 \noindent |
610 The running time of $\textit{ders}\_\textit{simp}$ |
617 The running time of $\textit{ders}\_\textit{simp}$ |
611 on the same example of \ref{NaiveMatcher} |
618 on the same example of Figure \ref{NaiveMatcher} |
612 is now very tame in terms of the length of inputs, |
619 is now ``tame'' in terms of the length of inputs, |
613 as shown in \ref{BetterMatcher}. |
620 as shown in Figure \ref{BetterMatcher}. |
614 |
621 |
615 Building derivatives and then testing the existence |
622 So far the story is use Brzozowski derivatives, |
616 of empty string in the resulting regular expression's language, |
623 simplifying where possible and at the end test |
617 adding simplifications when necessary. |
624 whether the empty string is recognised by the final derivative. |
618 So far, so good. But what if we want to |
625 But what if we want to |
619 do lexing instead of just getting a YES/NO answer? |
626 do lexing instead of just getting a true/false answer? |
620 Sulzmanna and Lu \cite{Sulzmann2014} first came up with a nice and |
627 Sulzmanna and Lu \cite{Sulzmann2014} first came up with a nice and |
621 elegant (arguably as beautiful as the definition of the original derivative) solution for this. |
628 elegant (arguably as beautiful as the definition of the original derivative) solution for this. |
622 |
629 |
623 \section{Values and the Lexing Algorithm by Sulzmann and Lu} |
630 \section{Values and the Lexing Algorithm by Sulzmann and Lu} |
624 In this section, we present a two-phase regular expression lexing |
631 In this section, we present a two-phase regular expression lexing |
1150 The $[]$ case is proven by lemma \ref{mePosix}, and the inductive case |
1157 The $[]$ case is proven by lemma \ref{mePosix}, and the inductive case |
1151 by lemma \ref{injPosix}. |
1158 by lemma \ref{injPosix}. |
1152 \end{proof} |
1159 \end{proof} |
1153 \noindent |
1160 \noindent |
1154 As we did earlier in this chapter on the matcher, one can |
1161 As we did earlier in this chapter on the matcher, one can |
1155 introduce simplification on the regex. |
1162 introduce simplification on the regular expression. |
1156 However, now one needs to do a backward phase and make sure |
1163 However, now one needs to do a backward phase and make sure |
1157 the values align with the regular expressions. |
1164 the values align with the regular expressions. |
1158 Therefore one has to |
1165 Therefore one has to |
1159 be careful not to break the correctness, as the injection |
1166 be careful not to break the correctness, as the injection |
1160 function heavily relies on the structure of the regexes and values |
1167 function heavily relies on the structure of the regular expressions and values |
1161 being correct and matching each other. |
1168 being correct and matching each other. |
1162 It can be achieved by recording some extra rectification functions |
1169 It can be achieved by recording some extra rectification functions |
1163 during the derivatives step, and applying these rectifications in |
1170 during the derivatives step, and applying these rectifications in |
1164 each run during the injection phase. |
1171 each run during the injection phase. |
1165 With extra care |
1172 With extra care |