11 In this chapter, we define the basic notions |
11 In this chapter, we define the basic notions |
12 for regular languages and regular expressions. |
12 for regular languages and regular expressions. |
13 This is essentially a description in ``English'' |
13 This is essentially a description in ``English'' |
14 the functions and datatypes of our formalisation in Isabelle/HOL. |
14 the functions and datatypes of our formalisation in Isabelle/HOL. |
15 We also define what $\POSIX$ lexing means, |
15 We also define what $\POSIX$ lexing means, |
16 followed by a lexing algorithm by Sulzmanna and Lu \parencite{Sulzmann2014} |
16 followed by the first lexing algorithm by Sulzmanna and Lu \parencite{Sulzmann2014} |
17 that produces the output conforming |
17 that produces the output conforming |
18 to the $\POSIX$ standard\footnote{In what follows |
18 to the $\POSIX$ standard\footnote{In what follows |
19 we choose to use the Isabelle-style notation |
19 we choose to use the Isabelle-style notation |
20 for function applications, where |
20 for function applications, where |
21 the parameters of a function are not enclosed |
21 the parameters of a function are not enclosed |
107 The reason for defining derivatives |
107 The reason for defining derivatives |
108 is that they provide another approach |
108 is that they provide another approach |
109 to test membership of a string in |
109 to test membership of a string in |
110 a set of strings. |
110 a set of strings. |
111 For example, to test whether the string |
111 For example, to test whether the string |
112 $bar$ is contained in the set $\{foo, bar, brak\}$, one takes derivative of the set with |
112 $bar$ is contained in the set $\{foo, bar, brak\}$, one can take derivative of the set with |
113 respect to the string $bar$: |
113 respect to the string $bar$: |
114 \begin{center} |
114 \begin{center} |
115 \begin{tabular}{lll} |
115 \begin{tabular}{lll} |
116 $S = \{foo, bar, brak\}$ & $ \stackrel{\backslash b}{\rightarrow }$ & |
116 $S = \{foo, bar, brak\}$ & $ \stackrel{\backslash b}{\rightarrow }$ & |
117 $\{ar, rak\}$ \\ |
117 $\{ar, rak\}$ \\ |
120 %& $\stackrel{[] \in S \backslash bar}{\longrightarrow}$ & $bar \in S$\\ |
120 %& $\stackrel{[] \in S \backslash bar}{\longrightarrow}$ & $bar \in S$\\ |
121 \end{tabular} |
121 \end{tabular} |
122 \end{center} |
122 \end{center} |
123 \noindent |
123 \noindent |
124 and in the end, test whether the set |
124 and in the end, test whether the set |
125 has the empty string.\footnote{We use the infix notation $A\backslash c$ |
125 contains the empty string.\footnote{We use the infix notation $A\backslash c$ |
126 instead of $\Der \; c \; A$ for brevity, as it is clear we are operating |
126 instead of $\Der \; c \; A$ for brevity, as it will always be |
|
127 clear from the context that we are operating |
127 on languages rather than regular expressions.} |
128 on languages rather than regular expressions.} |
128 |
129 |
129 In general, if we have a language $S$, |
130 In general, if we have a language $S$, |
130 then we can test whether $s$ is in $S$ |
131 then we can test whether $s$ is in $S$ |
131 by testing whether $[] \in S \backslash s$. |
132 by testing whether $[] \in S \backslash s$. |
202 We use $\ZERO$ for the regular expression that |
203 We use $\ZERO$ for the regular expression that |
203 matches no string, and $\ONE$ for the regular |
204 matches no string, and $\ONE$ for the regular |
204 expression that matches only the empty string.\footnote{ |
205 expression that matches only the empty string.\footnote{ |
205 Some authors |
206 Some authors |
206 also use $\phi$ and $\epsilon$ for $\ZERO$ and $\ONE$ |
207 also use $\phi$ and $\epsilon$ for $\ZERO$ and $\ONE$ |
207 but we prefer our notation.} |
208 but we prefer this notation.} |
208 The sequence regular expression is written $r_1\cdot r_2$ |
209 The sequence regular expression is written $r_1\cdot r_2$ |
209 and sometimes we omit the dot if it is clear which |
210 and sometimes we omit the dot if it is clear which |
210 regular expression is meant; the alternative |
211 regular expression is meant; the alternative |
211 is written $r_1 + r_2$. |
212 is written $r_1 + r_2$. |
212 The \emph{language} or meaning of |
213 The \emph{language} or meaning of |
225 \end{center} |
226 \end{center} |
226 \noindent |
227 \noindent |
227 %Now with language derivatives of a language and regular expressions and |
228 %Now with language derivatives of a language and regular expressions and |
228 %their language interpretations in place, we are ready to define derivatives on regular expressions. |
229 %their language interpretations in place, we are ready to define derivatives on regular expressions. |
229 With $L$, we are ready to introduce Brzozowski derivatives on regular expressions. |
230 With $L$, we are ready to introduce Brzozowski derivatives on regular expressions. |
230 We do so by first introducing what properties it should satisfy. |
231 We do so by first introducing what properties they should satisfy. |
231 |
232 |
232 \subsection{Brzozowski Derivatives and a Regular Expression Matcher} |
233 \subsection{Brzozowski Derivatives and a Regular Expression Matcher} |
233 %Recall, the language derivative acts on a set of strings |
234 %Recall, the language derivative acts on a set of strings |
234 %and essentially chops off a particular character from |
235 %and essentially chops off a particular character from |
235 %all strings in that set, Brzozowski defined a derivative operation on regular expressions |
236 %all strings in that set, Brzozowski defined a derivative operation on regular expressions |
289 Brzozowski noticed that $\Der$ |
290 Brzozowski noticed that $\Der$ |
290 can be ``mirrored'' on regular expressions which |
291 can be ``mirrored'' on regular expressions which |
291 he calls the derivative of a regular expression $r$ |
292 he calls the derivative of a regular expression $r$ |
292 with respect to a character $c$, written |
293 with respect to a character $c$, written |
293 $r \backslash c$. This infix operator |
294 $r \backslash c$. This infix operator |
294 takes an original regular expression $r$ as input |
295 takes regular expression $r$ as input |
295 and a character as a right operand and |
296 and a character as a right operand. |
296 outputs a result, which is a new regular expression. |
|
297 The derivative operation on regular expression |
297 The derivative operation on regular expression |
298 is defined such that the language of the derivative result |
298 is defined such that the language of the derivative result |
299 coincides with the language of the original |
299 coincides with the language of the original |
300 regular expression being taken |
300 regular expression being taken |
301 derivative with respect to the same character: |
301 derivative with respect to the same characters, namely |
302 \begin{property} |
302 \begin{property} |
303 |
303 |
304 \[ |
304 \[ |
305 L \; (r \backslash c) = \Der \; c \; (L \; r) |
305 L \; (r \backslash c) = \Der \; c \; (L \; r) |
306 \] |
306 \] |
333 \{\ldots,\;s_1,\;\ldots\}$} |
333 \{\ldots,\;s_1,\;\ldots\}$} |
334 }\\ |
334 }\\ |
335 \vspace{ 3mm } |
335 \vspace{ 3mm } |
336 |
336 |
337 The derivatives on regular expression can again be |
337 The derivatives on regular expression can again be |
338 generalised to a string. |
338 generalised to strings. |
339 One could compute $r_{start} \backslash s$ and test membership of $s$ |
339 One could compute $r_{start} \backslash s$ and test membership of $s$ |
340 in $L \; r_{start}$ by checking |
340 in $L \; r_{start}$ by checking |
341 whether the empty string is in the language of |
341 whether the empty string is in the language of |
342 $r_{end}$ ($r_{start}\backslash s$).\\ |
342 $r_{end}$ (that is $r_{start}\backslash s$).\\ |
343 |
343 |
344 \vspace{2mm} |
344 \vspace{2mm} |
345 \Longstack{ |
345 \Longstack{ |
346 \notate{$r_{start}$}{4}{ |
346 \notate{$r_{start}$}{4}{ |
347 \Longstack{$L \; r_{start} = \{\ldots, \;$ |
347 \Longstack{$L \; r_{start} = \{\ldots, \;$ |
376 \notate{$r_{end}$}{1}{ |
376 \notate{$r_{end}$}{1}{ |
377 $L \; r_{end} = \{\ldots, \; [], \ldots\}$} |
377 $L \; r_{end} = \{\ldots, \; [], \ldots\}$} |
378 } |
378 } |
379 |
379 |
380 |
380 |
381 |
381 We have the property that |
382 \begin{property} |
382 \begin{property} |
383 $s \in L \; r_{start} \iff [] \in L \; r_{end}$ |
383 $s \in L \; r_{start} \iff [] \in L \; r_{end}$ |
384 \end{property} |
384 \end{property} |
385 \noindent |
385 \noindent |
386 Next, we give the recursive definition of derivative on |
386 Next, we give the recursive definition of derivative on |
387 regular expressions so that it satisfies the properties above. |
387 regular expressions so that it satisfies the properties above. |
388 The derivative function, written $r\backslash c$, |
388 %The derivative function, written $r\backslash c$, |
389 takes a regular expression $r$ and character $c$, and |
389 %takes a regular expression $r$ and character $c$, and |
390 returns a new regular expression representing |
390 %returns a new regular expression representing |
391 the original regular expression's language $L \; r$ |
391 %the original regular expression's language $L \; r$ |
392 being taken the language derivative with respect to $c$. |
392 %being taken the language derivative with respect to $c$. |
393 \begin{table} |
393 \begin{table} |
394 \begin{center} |
394 \begin{center} |
395 \begin{tabular}{lcl} |
395 \begin{tabular}{lcl} |
396 $\ZERO \backslash c$ & $\dn$ & $\ZERO$\\ |
396 $\ZERO \backslash c$ & $\dn$ & $\ZERO$\\ |
397 $\ONE \backslash c$ & $\dn$ & $\ZERO$\\ |
397 $\ONE \backslash c$ & $\dn$ & $\ZERO$\\ |
416 result of this derivative: |
416 result of this derivative: |
417 \begin{center} |
417 \begin{center} |
418 \begin{tabular}{lcl} |
418 \begin{tabular}{lcl} |
419 $(r_1 \cdot r_2 ) \backslash c$ & $\dn$ & |
419 $(r_1 \cdot r_2 ) \backslash c$ & $\dn$ & |
420 $\textit{if}\;\,([] \in L(r_1))\; |
420 $\textit{if}\;\,([] \in L(r_1))\; |
421 \textit{then} \; r_1 \backslash c \cdot r_2 + r_2 \backslash c$ \\ |
421 \textit{then} \; (r_1 \backslash c) \cdot r_2 + r_2 \backslash c$ \\ |
422 & & $\textit{else} \; (r_1 \backslash c) \cdot r_2$ |
422 & & $\textit{else} \; (r_1 \backslash c) \cdot r_2$ |
423 \end{tabular} |
423 \end{tabular} |
424 \end{center} |
424 \end{center} |
425 \noindent |
425 \noindent |
426 Notice how this closely resembles |
426 Notice how this closely resembles |
481 would require both children to have the empty string |
481 would require both children to have the empty string |
482 to compose an empty string, and the Kleene star |
482 to compose an empty string, and the Kleene star |
483 is always nullable because it naturally |
483 is always nullable because it naturally |
484 contains the empty string. |
484 contains the empty string. |
485 \noindent |
485 \noindent |
486 We have the following correspondence between |
486 We have the following two correspondences between |
487 derivatives on regular expressions and |
487 derivatives on regular expressions and |
488 derivatives on a set of strings: |
488 derivatives on a set of strings: |
489 \begin{lemma}\label{derDer} |
489 \begin{lemma}\label{derDer} |
490 \mbox{} |
490 \mbox{} |
491 \begin{itemize} |
491 \begin{itemize} |
568 If we implement the above algorithm naively, however, |
568 If we implement the above algorithm naively, however, |
569 the algorithm can be excruciatingly slow, as shown in |
569 the algorithm can be excruciatingly slow, as shown in |
570 \ref{NaiveMatcher}. |
570 \ref{NaiveMatcher}. |
571 Note that both axes are in logarithmic scale. |
571 Note that both axes are in logarithmic scale. |
572 Around two dozen characters |
572 Around two dozen characters |
573 would already ``explode'' the matcher with the regular expression |
573 this algorithm already ``explodes'' with the regular expression |
574 $(a^*)^*b$. |
574 $(a^*)^*b$. |
575 To improve this situation, we need to introduce simplification |
575 To improve this situation, we need to introduce simplification |
576 rules for the intermediate results, |
576 rules for the intermediate results, |
577 such as $r + r \rightarrow r$, |
577 such as $r + r \rightarrow r$ or $\ONE \cdot r \rightarrow r$, |
578 and make sure those rules do not change the |
578 and make sure those rules do not change the |
579 language of the regular expression. |
579 language of the regular expression. |
580 One simpled-minded simplification function |
580 One simple-minded simplification function |
581 that achieves these requirements |
581 that achieves these requirements |
582 is given below (see Ausaf et al. \cite{AusafDyckhoffUrban2016}): |
582 is given below (see Ausaf et al. \cite{AusafDyckhoffUrban2016}): |
583 \begin{center} |
583 \begin{center} |
584 \begin{tabular}{lcl} |
584 \begin{tabular}{lcl} |
585 $\simp \; r_1 \cdot r_2 $ & $ \dn$ & |
585 $\simp \; r_1 \cdot r_2 $ & $ \dn$ & |
731 and the values both flatten to $abc$. |
731 and the values both flatten to $abc$. |
732 Lexers therefore have to disambiguate and choose only |
732 Lexers therefore have to disambiguate and choose only |
733 one of the values to be generated. $\POSIX$ is one of the |
733 one of the values to be generated. $\POSIX$ is one of the |
734 disambiguation strategies that is widely adopted. |
734 disambiguation strategies that is widely adopted. |
735 |
735 |
736 Ausaf et al.\parencite{AusafDyckhoffUrban2016} |
736 Ausaf et al. \cite{AusafDyckhoffUrban2016} |
737 formalised the property |
737 formalised the property |
738 as a ternary relation. |
738 as a ternary relation. |
739 The $\POSIX$ value $v$ for a regular expression |
739 The $\POSIX$ value $v$ for a regular expression |
740 $r$ and string $s$, denoted as $(s, r) \rightarrow v$, can be specified |
740 $r$ and string $s$, denoted as $(s, r) \rightarrow v$, can be specified |
741 in the following rules\footnote{The names of the rules are used |
741 in the following rules\footnote{The names of the rules are used |
841 where identifiers are defined as usual (letters |
841 where identifiers are defined as usual (letters |
842 followed by letters, numbers or underscores), |
842 followed by letters, numbers or underscores), |
843 then a match with a keyword (if) |
843 then a match with a keyword (if) |
844 followed by |
844 followed by |
845 an identifier (foo) would be incorrect. |
845 an identifier (foo) would be incorrect. |
846 POSIX lexing would generate what we want. |
846 POSIX lexing generates what is included by lexing. |
847 |
847 |
848 \noindent |
848 \noindent |
849 We know that a POSIX |
849 We know that a POSIX |
850 value for regular expression $r$ is inhabited by $r$. |
850 value for regular expression $r$ is inhabited by $r$. |
851 \begin{lemma} |
851 \begin{lemma} |
913 derivative-based matching |
913 derivative-based matching |
914 to a lexing algorithm by a second phase |
914 to a lexing algorithm by a second phase |
915 after the initial phase of successive derivatives. |
915 after the initial phase of successive derivatives. |
916 This second phase generates a POSIX value |
916 This second phase generates a POSIX value |
917 if the regular expression matches the string. |
917 if the regular expression matches the string. |
918 Two functions are involved: $\inj$ and $\mkeps$. |
918 The algorithm uses two functions called $\inj$ and $\mkeps$. |
919 The function $\mkeps$ constructs a POSIX value from the last |
919 The function $\mkeps$ constructs a POSIX value from the last |
920 derivative $r_n$: |
920 derivative $r_n$: |
921 \begin{ceqn} |
921 \begin{ceqn} |
922 \begin{equation}\label{graph:mkeps} |
922 \begin{equation}\label{graph:mkeps} |
923 \begin{tikzcd} |
923 \begin{tikzcd} |
956 we give the $\Stars$ constructor an empty list, meaning |
956 we give the $\Stars$ constructor an empty list, meaning |
957 no iteration is taken. |
957 no iteration is taken. |
958 The result of $\mkeps$ on a $\nullable$ $r$ |
958 The result of $\mkeps$ on a $\nullable$ $r$ |
959 is a POSIX value for $r$ and the empty string: |
959 is a POSIX value for $r$ and the empty string: |
960 \begin{lemma}\label{mePosix} |
960 \begin{lemma}\label{mePosix} |
961 $\nullable\; r \implies (r, []) \rightarrow (\mkeps\; v)$ |
961 $\nullable\; r \implies (r, []) \rightarrow (\mkeps\; r)$ |
962 \end{lemma} |
962 \end{lemma} |
963 \begin{proof} |
963 \begin{proof} |
964 By induction on $r$. |
964 By induction on $r$. |
965 \end{proof} |
965 \end{proof} |
966 \noindent |
966 \noindent |
1320 \Stars \; [ |
1320 \Stars \; [ |
1321 \Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a]), |
1321 \Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a]), |
1322 \Seq \; (\Stars \; [\Char \; a])\; (\Stars \; []) |
1322 \Seq \; (\Stars \; [\Char \; a])\; (\Stars \; []) |
1323 ] |
1323 ] |
1324 \] |
1324 \] |
1325 And $\derssimp \; aa \; (a^*a^*)^*$ would be |
1325 And $\derssimp \; aa \; (a^*a^*)^*$ is |
1326 \[ |
1326 \[ |
1327 ((a^*a^* + a^*)+a^*)\cdot(a^*a^*)^* + |
1327 ((a^*a^* + a^*)+a^*)\cdot(a^*a^*)^* + |
1328 (a^*a^* + a^*)\cdot(a^*a^*)^*. |
1328 (a^*a^* + a^*)\cdot(a^*a^*)^*. |
1329 \] |
1329 \] |
1330 which removes two out of the seven terms corresponding to the |
1330 which removes two out of the seven terms corresponding to the |
1350 [\Seq \; (\Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}]), \Stars\,[]]. |
1350 [\Seq \; (\Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}]), \Stars\,[]]. |
1351 \] |
1351 \] |
1352 At any moment, the subterms in a regular expression |
1352 At any moment, the subterms in a regular expression |
1353 that will potentially result in a POSIX value is only |
1353 that will potentially result in a POSIX value is only |
1354 a minority among the many other terms, |
1354 a minority among the many other terms, |
1355 and one can remove ones that are not possible to |
1355 and one can remove the ones that are not possible to |
1356 be POSIX. |
1356 be POSIX. |
1357 In the above example, |
1357 In the above example, |
1358 \begin{equation}\label{eqn:growth2} |
1358 \begin{equation}\label{eqn:growth2} |
1359 ((a^*a^* + \underbrace{a^*}_\text{A})+\underbrace{a^*}_\text{duplicate of A})\cdot(a^*a^*)^* + |
1359 ((a^*a^* + \underbrace{a^*}_\text{A})+\underbrace{a^*}_\text{duplicate of A})\cdot(a^*a^*)^* + |
1360 \underbrace{(a^*a^* + a^*)\cdot(a^*a^*)^*}_\text{further simp removes this}. |
1360 \underbrace{(a^*a^* + a^*)\cdot(a^*a^*)^*}_\text{further simp removes this}. |
1377 \end{center} |
1377 \end{center} |
1378 Other terms with an underlying value, such as |
1378 Other terms with an underlying value, such as |
1379 \[ |
1379 \[ |
1380 \Stars \; [\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a, \Char \; a])] |
1380 \Stars \; [\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a, \Char \; a])] |
1381 \] |
1381 \] |
1382 is too hopeless to contribute a POSIX lexical value, |
1382 do not to contribute a POSIX lexical value, |
1383 and is therefore thrown away. |
1383 and therefore can be thrown away. |
1384 |
1384 |
1385 Ausaf et al. \cite{AusafDyckhoffUrban2016} |
1385 Ausaf et al. \cite{AusafDyckhoffUrban2016} |
1386 have come up with some simplification steps, however those steps |
1386 have come up with some simplification steps, however those steps |
1387 are not yet sufficiently strong, so they achieve the above effects. |
1387 are not yet sufficiently strong, to achieve the above effects. |
1388 And even with these relatively mild simplifications, the proof |
1388 And even with these relatively mild simplifications, the proof |
1389 is already quite a bit more complicated than the theorem \ref{lexerCorrectness}. |
1389 is already quite a bit more complicated than the theorem \ref{lexerCorrectness}. |
1390 One would prove something like this: |
1390 One would need to prove something like this: |
1391 \[ |
1391 \[ |
1392 \textit{If}\; (\textit{snd} \; (\textit{simp} \; r\backslash c), s) \rightarrow v \;\; |
1392 \textit{If}\; (\textit{snd} \; (\textit{simp} \; r\backslash c), s) \rightarrow v \;\; |
1393 \textit{then}\;\; (r, c::s) \rightarrow |
1393 \textit{then}\;\; (r, c::s) \rightarrow |
1394 \inj\;\, r\, \;c \;\, ((\textit{fst} \; (\textit{simp} \; r \backslash c))\; v). |
1394 \inj\;\, r\, \;c \;\, ((\textit{fst} \; (\textit{simp} \; r \backslash c))\; v). |
1395 \] |
1395 \] |
1396 instead of the simple lemma \ref{injPosix}, where now $\textit{simp}$ |
1396 instead of the simple lemma \ref{injPosix}, where now $\textit{simp}$ |
1397 not only has to return a simplified regular expression, |
1397 not only has to return a simplified regular expression, |
1398 but also what specific simplifications |
1398 but also what specific simplifications |
1399 has been done as a function on values |
1399 have been done as a function on values |
1400 showing how one can transform the value |
1400 showing how one can transform the value |
1401 underlying the simplified regular expression |
1401 underlying the simplified regular expression |
1402 to the unsimplified one. |
1402 to the unsimplified one. |
1403 |
1403 |
1404 We therefore choose a slightly different approach |
1404 We therefore choose a slightly different approach |