ChengsongTanPhdThesis/Chapters/Inj.tex
changeset 638 dd9dde2d902b
parent 637 e3752aac8ec2
child 639 80cc6dc4c98b
equal deleted inserted replaced
637:e3752aac8ec2 638:dd9dde2d902b
    11 In this chapter, we define the basic notions 
    11 In this chapter, we define the basic notions 
    12 for regular languages and regular expressions.
    12 for regular languages and regular expressions.
    13 This is essentially a description in ``English''
    13 This is essentially a description in ``English''
    14 the functions and datatypes of our formalisation in Isabelle/HOL.
    14 the functions and datatypes of our formalisation in Isabelle/HOL.
    15 We also define what $\POSIX$ lexing means, 
    15 We also define what $\POSIX$ lexing means, 
    16 followed by a lexing algorithm by Sulzmanna and Lu \parencite{Sulzmann2014} 
    16 followed by the first lexing algorithm by Sulzmanna and Lu \parencite{Sulzmann2014} 
    17 that produces the output conforming
    17 that produces the output conforming
    18 to the $\POSIX$ standard\footnote{In what follows 
    18 to the $\POSIX$ standard\footnote{In what follows 
    19 we choose to use the Isabelle-style notation
    19 we choose to use the Isabelle-style notation
    20 for function applications, where
    20 for function applications, where
    21 the parameters of a function are not enclosed
    21 the parameters of a function are not enclosed
   107 The reason for defining derivatives
   107 The reason for defining derivatives
   108 is that they provide another approach
   108 is that they provide another approach
   109 to test membership of a string in 
   109 to test membership of a string in 
   110 a set of strings. 
   110 a set of strings. 
   111 For example, to test whether the string
   111 For example, to test whether the string
   112 $bar$ is contained in the set $\{foo, bar, brak\}$, one takes derivative of the set with
   112 $bar$ is contained in the set $\{foo, bar, brak\}$, one can take derivative of the set with
   113 respect to the string $bar$:
   113 respect to the string $bar$:
   114 \begin{center}
   114 \begin{center}
   115 \begin{tabular}{lll}
   115 \begin{tabular}{lll}
   116 	$S = \{foo, bar, brak\}$ & $ \stackrel{\backslash b}{\rightarrow }$ & 
   116 	$S = \{foo, bar, brak\}$ & $ \stackrel{\backslash b}{\rightarrow }$ & 
   117 	$\{ar, rak\}$ \\
   117 	$\{ar, rak\}$ \\
   120 				 %& $\stackrel{[] \in S \backslash bar}{\longrightarrow}$ & $bar \in S$\\
   120 				 %& $\stackrel{[] \in S \backslash bar}{\longrightarrow}$ & $bar \in S$\\
   121 \end{tabular}	
   121 \end{tabular}	
   122 \end{center}
   122 \end{center}
   123 \noindent
   123 \noindent
   124 and in the end, test whether the set
   124 and in the end, test whether the set
   125 has the empty string.\footnote{We use the infix notation $A\backslash c$
   125 contains the empty string.\footnote{We use the infix notation $A\backslash c$
   126 	instead of $\Der \; c \; A$ for brevity, as it is clear we are operating
   126 	instead of $\Der \; c \; A$ for brevity, as it will always be
       
   127 	clear from the context that we are operating
   127 on languages rather than regular expressions.}
   128 on languages rather than regular expressions.}
   128 
   129 
   129 In general, if we have a language $S$,
   130 In general, if we have a language $S$,
   130 then we can test whether $s$ is in $S$
   131 then we can test whether $s$ is in $S$
   131 by testing whether $[] \in S \backslash s$.
   132 by testing whether $[] \in S \backslash s$.
   202 We use $\ZERO$ for the regular expression that
   203 We use $\ZERO$ for the regular expression that
   203 matches no string, and $\ONE$ for the regular
   204 matches no string, and $\ONE$ for the regular
   204 expression that matches only the empty string.\footnote{
   205 expression that matches only the empty string.\footnote{
   205 Some authors
   206 Some authors
   206 also use $\phi$ and $\epsilon$ for $\ZERO$ and $\ONE$
   207 also use $\phi$ and $\epsilon$ for $\ZERO$ and $\ONE$
   207 but we prefer our notation.} 
   208 but we prefer this notation.} 
   208 The sequence regular expression is written $r_1\cdot r_2$
   209 The sequence regular expression is written $r_1\cdot r_2$
   209 and sometimes we omit the dot if it is clear which
   210 and sometimes we omit the dot if it is clear which
   210 regular expression is meant; the alternative
   211 regular expression is meant; the alternative
   211 is written $r_1 + r_2$.
   212 is written $r_1 + r_2$.
   212 The \emph{language} or meaning of 
   213 The \emph{language} or meaning of 
   225 \end{center}
   226 \end{center}
   226 \noindent
   227 \noindent
   227 %Now with language derivatives of a language and regular expressions and
   228 %Now with language derivatives of a language and regular expressions and
   228 %their language interpretations in place, we are ready to define derivatives on regular expressions.
   229 %their language interpretations in place, we are ready to define derivatives on regular expressions.
   229 With $L$, we are ready to introduce Brzozowski derivatives on regular expressions.
   230 With $L$, we are ready to introduce Brzozowski derivatives on regular expressions.
   230 We do so by first introducing what properties it should satisfy.
   231 We do so by first introducing what properties they should satisfy.
   231 
   232 
   232 \subsection{Brzozowski Derivatives and a Regular Expression Matcher}
   233 \subsection{Brzozowski Derivatives and a Regular Expression Matcher}
   233 %Recall, the language derivative acts on a set of strings
   234 %Recall, the language derivative acts on a set of strings
   234 %and essentially chops off a particular character from
   235 %and essentially chops off a particular character from
   235 %all strings in that set, Brzozowski defined a derivative operation on regular expressions
   236 %all strings in that set, Brzozowski defined a derivative operation on regular expressions
   289 Brzozowski noticed that $\Der$
   290 Brzozowski noticed that $\Der$
   290 can be ``mirrored'' on regular expressions which
   291 can be ``mirrored'' on regular expressions which
   291 he calls the derivative of a regular expression $r$
   292 he calls the derivative of a regular expression $r$
   292 with respect to a character $c$, written
   293 with respect to a character $c$, written
   293 $r \backslash c$. This infix operator
   294 $r \backslash c$. This infix operator
   294 takes an original regular expression $r$ as input
   295 takes regular expression $r$ as input
   295 and a character as a right operand and
   296 and a character as a right operand.
   296 outputs a result, which is a new regular expression.
       
   297 The derivative operation on regular expression
   297 The derivative operation on regular expression
   298 is defined such that the language of the derivative result 
   298 is defined such that the language of the derivative result 
   299 coincides with the language of the original 
   299 coincides with the language of the original 
   300 regular expression being taken 
   300 regular expression being taken 
   301 derivative with respect to the same character:
   301 derivative with respect to the same characters, namely
   302 \begin{property}
   302 \begin{property}
   303 
   303 
   304 \[
   304 \[
   305 	L \; (r \backslash c) = \Der \; c \; (L \; r)
   305 	L \; (r \backslash c) = \Der \; c \; (L \; r)
   306 \]
   306 \]
   333 	\{\ldots,\;s_1,\;\ldots\}$}
   333 	\{\ldots,\;s_1,\;\ldots\}$}
   334 }\\ 
   334 }\\ 
   335 \vspace{ 3mm }
   335 \vspace{ 3mm }
   336 
   336 
   337 The derivatives on regular expression can again be 
   337 The derivatives on regular expression can again be 
   338 generalised to a string.
   338 generalised to strings.
   339 One could compute $r_{start} \backslash s$  and test membership of $s$
   339 One could compute $r_{start} \backslash s$  and test membership of $s$
   340 in $L \; r_{start}$ by checking
   340 in $L \; r_{start}$ by checking
   341 whether the empty string is in the language of 
   341 whether the empty string is in the language of 
   342 $r_{end}$ ($r_{start}\backslash s$).\\
   342 $r_{end}$ (that is $r_{start}\backslash s$).\\
   343 
   343 
   344 \vspace{2mm}
   344 \vspace{2mm}
   345 \Longstack{
   345 \Longstack{
   346 	\notate{$r_{start}$}{4}{
   346 	\notate{$r_{start}$}{4}{
   347 		\Longstack{$L \; r_{start} = \{\ldots, \;$ 
   347 		\Longstack{$L \; r_{start} = \{\ldots, \;$ 
   376 	\notate{$r_{end}$}{1}{
   376 	\notate{$r_{end}$}{1}{
   377 	$L \; r_{end} = \{\ldots, \; [], \ldots\}$}
   377 	$L \; r_{end} = \{\ldots, \; [], \ldots\}$}
   378 }
   378 }
   379 
   379 
   380 
   380 
   381 
   381 We have the property that
   382 \begin{property}
   382 \begin{property}
   383 	$s \in L \; r_{start} \iff [] \in L \; r_{end}$
   383 	$s \in L \; r_{start} \iff [] \in L \; r_{end}$
   384 \end{property}
   384 \end{property}
   385 \noindent
   385 \noindent
   386 Next, we give the recursive definition of derivative on
   386 Next, we give the recursive definition of derivative on
   387 regular expressions so that it satisfies the properties above.
   387 regular expressions so that it satisfies the properties above.
   388 The derivative function, written $r\backslash c$, 
   388 %The derivative function, written $r\backslash c$, 
   389 takes a regular expression $r$ and character $c$, and
   389 %takes a regular expression $r$ and character $c$, and
   390 returns a new regular expression representing
   390 %returns a new regular expression representing
   391 the original regular expression's language $L \; r$
   391 %the original regular expression's language $L \; r$
   392 being taken the language derivative with respect to $c$.
   392 %being taken the language derivative with respect to $c$.
   393 \begin{table}
   393 \begin{table}
   394 	\begin{center}
   394 	\begin{center}
   395 \begin{tabular}{lcl}
   395 \begin{tabular}{lcl}
   396 		$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\  
   396 		$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\  
   397 		$\ONE \backslash c$  & $\dn$ & $\ZERO$\\
   397 		$\ONE \backslash c$  & $\dn$ & $\ZERO$\\
   416 result of this derivative:
   416 result of this derivative:
   417 \begin{center}
   417 \begin{center}
   418 	\begin{tabular}{lcl}
   418 	\begin{tabular}{lcl}
   419 		$(r_1 \cdot r_2 ) \backslash c$ & $\dn$ & 
   419 		$(r_1 \cdot r_2 ) \backslash c$ & $\dn$ & 
   420 		$\textit{if}\;\,([] \in L(r_1))\; 
   420 		$\textit{if}\;\,([] \in L(r_1))\; 
   421 		\textit{then} \; r_1 \backslash c \cdot r_2 + r_2 \backslash c$ \\
   421 		\textit{then} \; (r_1 \backslash c) \cdot r_2 + r_2 \backslash c$ \\
   422 		& & $\textit{else} \; (r_1 \backslash c) \cdot r_2$
   422 		& & $\textit{else} \; (r_1 \backslash c) \cdot r_2$
   423 	\end{tabular}
   423 	\end{tabular}
   424 \end{center}
   424 \end{center}
   425 \noindent
   425 \noindent
   426 Notice how this closely resembles
   426 Notice how this closely resembles
   481 would require both children to have the empty string
   481 would require both children to have the empty string
   482 to compose an empty string, and the Kleene star
   482 to compose an empty string, and the Kleene star
   483 is always nullable because it naturally
   483 is always nullable because it naturally
   484 contains the empty string.  
   484 contains the empty string.  
   485 \noindent
   485 \noindent
   486 We have the following correspondence between 
   486 We have the following two correspondences between 
   487 derivatives on regular expressions and
   487 derivatives on regular expressions and
   488 derivatives on a set of strings:
   488 derivatives on a set of strings:
   489 \begin{lemma}\label{derDer}
   489 \begin{lemma}\label{derDer}
   490 	\mbox{}
   490 	\mbox{}
   491 	\begin{itemize}
   491 	\begin{itemize}
   497 \end{lemma}
   497 \end{lemma}
   498 \begin{proof}
   498 \begin{proof}
   499 	By induction on $r$.
   499 	By induction on $r$.
   500 \end{proof}
   500 \end{proof}
   501 \noindent
   501 \noindent
   502 which is the main property of derivatives
   502 which are the main properties of derivatives
   503 that enables us to reason about the correctness of
   503 that enables us later to reason about the correctness of
   504 derivative-based matching.
   504 derivative-based matching.
   505 We can generalise the derivative operation shown above for single characters
   505 We can generalise the derivative operation shown above for single characters
   506 to strings as follows:
   506 to strings as follows:
   507 
   507 
   508 \begin{center}
   508 \begin{center}
   568 If we implement the above algorithm naively, however,
   568 If we implement the above algorithm naively, however,
   569 the algorithm can be excruciatingly slow, as shown in 
   569 the algorithm can be excruciatingly slow, as shown in 
   570 \ref{NaiveMatcher}.
   570 \ref{NaiveMatcher}.
   571 Note that both axes are in logarithmic scale.
   571 Note that both axes are in logarithmic scale.
   572 Around two dozen characters
   572 Around two dozen characters
   573 would already ``explode'' the matcher with the regular expression 
   573 this algorithm already ``explodes'' with the regular expression 
   574 $(a^*)^*b$.
   574 $(a^*)^*b$.
   575 To improve this situation, we need to introduce simplification
   575 To improve this situation, we need to introduce simplification
   576 rules for the intermediate results,
   576 rules for the intermediate results,
   577 such as $r + r \rightarrow r$,
   577 such as $r + r \rightarrow r$ or $\ONE \cdot r \rightarrow r$,
   578 and make sure those rules do not change the 
   578 and make sure those rules do not change the 
   579 language of the regular expression.
   579 language of the regular expression.
   580 One simpled-minded simplification function
   580 One simple-minded simplification function
   581 that achieves these requirements 
   581 that achieves these requirements 
   582 is given below (see Ausaf et al. \cite{AusafDyckhoffUrban2016}):
   582 is given below (see Ausaf et al. \cite{AusafDyckhoffUrban2016}):
   583 \begin{center}
   583 \begin{center}
   584 	\begin{tabular}{lcl}
   584 	\begin{tabular}{lcl}
   585 		$\simp \; r_1 \cdot r_2 $ & $ \dn$ & 
   585 		$\simp \; r_1 \cdot r_2 $ & $ \dn$ & 
   731 and the values both flatten to $abc$.
   731 and the values both flatten to $abc$.
   732 Lexers therefore have to disambiguate and choose only
   732 Lexers therefore have to disambiguate and choose only
   733 one of the values to be generated. $\POSIX$ is one of the
   733 one of the values to be generated. $\POSIX$ is one of the
   734 disambiguation strategies that is widely adopted.
   734 disambiguation strategies that is widely adopted.
   735 
   735 
   736 Ausaf et al.\parencite{AusafDyckhoffUrban2016} 
   736 Ausaf et al. \cite{AusafDyckhoffUrban2016} 
   737 formalised the property 
   737 formalised the property 
   738 as a ternary relation.
   738 as a ternary relation.
   739 The $\POSIX$ value $v$ for a regular expression
   739 The $\POSIX$ value $v$ for a regular expression
   740 $r$ and string $s$, denoted as $(s, r) \rightarrow v$, can be specified 
   740 $r$ and string $s$, denoted as $(s, r) \rightarrow v$, can be specified 
   741 in the following rules\footnote{The names of the rules are used
   741 in the following rules\footnote{The names of the rules are used
   841 where identifiers are defined as usual (letters
   841 where identifiers are defined as usual (letters
   842 followed by letters, numbers or underscores),
   842 followed by letters, numbers or underscores),
   843 then a match with a keyword (if)
   843 then a match with a keyword (if)
   844 followed by
   844 followed by
   845 an identifier (foo) would be incorrect.
   845 an identifier (foo) would be incorrect.
   846 POSIX lexing would generate what we want.
   846 POSIX lexing generates what is included by lexing.
   847 
   847 
   848 \noindent
   848 \noindent
   849 We know that a POSIX 
   849 We know that a POSIX 
   850 value for regular expression $r$ is inhabited by $r$.
   850 value for regular expression $r$ is inhabited by $r$.
   851 \begin{lemma}
   851 \begin{lemma}
   913 derivative-based matching
   913 derivative-based matching
   914 to a lexing algorithm by a second phase 
   914 to a lexing algorithm by a second phase 
   915 after the initial phase of successive derivatives.
   915 after the initial phase of successive derivatives.
   916 This second phase generates a POSIX value 
   916 This second phase generates a POSIX value 
   917 if the regular expression matches the string.
   917 if the regular expression matches the string.
   918 Two functions are involved: $\inj$ and $\mkeps$.
   918 The algorithm uses two functions called $\inj$ and $\mkeps$.
   919 The function $\mkeps$ constructs a POSIX value from the last
   919 The function $\mkeps$ constructs a POSIX value from the last
   920 derivative $r_n$:
   920 derivative $r_n$:
   921 \begin{ceqn}
   921 \begin{ceqn}
   922 \begin{equation}\label{graph:mkeps}
   922 \begin{equation}\label{graph:mkeps}
   923 \begin{tikzcd}
   923 \begin{tikzcd}
   956 we give the $\Stars$ constructor an empty list, meaning
   956 we give the $\Stars$ constructor an empty list, meaning
   957 no iteration is taken.
   957 no iteration is taken.
   958 The result of $\mkeps$ on a $\nullable$ $r$ 
   958 The result of $\mkeps$ on a $\nullable$ $r$ 
   959 is a POSIX value for $r$ and the empty string:
   959 is a POSIX value for $r$ and the empty string:
   960 \begin{lemma}\label{mePosix}
   960 \begin{lemma}\label{mePosix}
   961 $\nullable\; r \implies (r, []) \rightarrow (\mkeps\; v)$
   961 $\nullable\; r \implies (r, []) \rightarrow (\mkeps\; r)$
   962 \end{lemma}
   962 \end{lemma}
   963 \begin{proof}
   963 \begin{proof}
   964 	By induction on $r$.
   964 	By induction on $r$.
   965 \end{proof}
   965 \end{proof}
   966 \noindent
   966 \noindent
  1320 	\Stars \; [
  1320 	\Stars \; [
  1321 	 	   \Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a]),
  1321 	 	   \Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a]),
  1322 		   \Seq \; (\Stars \; [\Char \; a])\; (\Stars \; [])
  1322 		   \Seq \; (\Stars \; [\Char \; a])\; (\Stars \; [])
  1323 		  ] 
  1323 		  ] 
  1324 \]
  1324 \]
  1325 And $\derssimp \; aa \; (a^*a^*)^*$ would be
  1325 And $\derssimp \; aa \; (a^*a^*)^*$ is
  1326 \[
  1326 \[
  1327 	((a^*a^* + a^*)+a^*)\cdot(a^*a^*)^* + 
  1327 	((a^*a^* + a^*)+a^*)\cdot(a^*a^*)^* + 
  1328 	(a^*a^* + a^*)\cdot(a^*a^*)^*.
  1328 	(a^*a^* + a^*)\cdot(a^*a^*)^*.
  1329 \]
  1329 \]
  1330 which removes two out of the seven terms corresponding to the
  1330 which removes two out of the seven terms corresponding to the
  1350 	[\Seq \; (\Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}]), \Stars\,[]].
  1350 	[\Seq \; (\Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}]), \Stars\,[]].
  1351 \]
  1351 \]
  1352 At any moment, the  subterms in a regular expression
  1352 At any moment, the  subterms in a regular expression
  1353 that will potentially result in a POSIX value is only
  1353 that will potentially result in a POSIX value is only
  1354 a minority among the many other terms,
  1354 a minority among the many other terms,
  1355 and one can remove ones that are not possible to 
  1355 and one can remove the ones that are not possible to 
  1356 be POSIX.
  1356 be POSIX.
  1357 In the above example,
  1357 In the above example,
  1358 \begin{equation}\label{eqn:growth2}
  1358 \begin{equation}\label{eqn:growth2}
  1359 	((a^*a^* + \underbrace{a^*}_\text{A})+\underbrace{a^*}_\text{duplicate of A})\cdot(a^*a^*)^* + 
  1359 	((a^*a^* + \underbrace{a^*}_\text{A})+\underbrace{a^*}_\text{duplicate of A})\cdot(a^*a^*)^* + 
  1360 	\underbrace{(a^*a^* + a^*)\cdot(a^*a^*)^*}_\text{further simp removes this}.
  1360 	\underbrace{(a^*a^* + a^*)\cdot(a^*a^*)^*}_\text{further simp removes this}.
  1377 \end{center}
  1377 \end{center}
  1378 Other terms with an underlying value, such as
  1378 Other terms with an underlying value, such as
  1379 \[
  1379 \[
  1380 	\Stars \; [\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a, \Char \; a])]
  1380 	\Stars \; [\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a, \Char \; a])]
  1381 \]
  1381 \]
  1382 is too hopeless to contribute a POSIX lexical value,
  1382 do not to contribute a POSIX lexical value,
  1383 and is therefore thrown away.
  1383 and therefore can be thrown away.
  1384 
  1384 
  1385 Ausaf et al. \cite{AusafDyckhoffUrban2016} 
  1385 Ausaf et al. \cite{AusafDyckhoffUrban2016} 
  1386 have come up with some simplification steps, however those steps
  1386 have come up with some simplification steps, however those steps
  1387 are not yet sufficiently strong, so they achieve the above effects.
  1387 are not yet sufficiently strong, to achieve the above effects.
  1388 And even with these relatively mild simplifications, the proof
  1388 And even with these relatively mild simplifications, the proof
  1389 is already quite a bit more complicated than the theorem \ref{lexerCorrectness}.
  1389 is already quite a bit more complicated than the theorem \ref{lexerCorrectness}.
  1390 One would prove something like this: 
  1390 One would need to prove something like this: 
  1391 \[
  1391 \[
  1392 	\textit{If}\; (\textit{snd} \; (\textit{simp} \; r\backslash c), s) \rightarrow  v  \;\;
  1392 	\textit{If}\; (\textit{snd} \; (\textit{simp} \; r\backslash c), s) \rightarrow  v  \;\;
  1393 	\textit{then}\;\; (r, c::s) \rightarrow 
  1393 	\textit{then}\;\; (r, c::s) \rightarrow 
  1394 	\inj\;\, r\,  \;c \;\, ((\textit{fst} \; (\textit{simp} \; r \backslash c))\; v).
  1394 	\inj\;\, r\,  \;c \;\, ((\textit{fst} \; (\textit{simp} \; r \backslash c))\; v).
  1395 \]
  1395 \]
  1396 instead of the simple lemma \ref{injPosix}, where now $\textit{simp}$
  1396 instead of the simple lemma \ref{injPosix}, where now $\textit{simp}$
  1397 not only has to return a simplified regular expression,
  1397 not only has to return a simplified regular expression,
  1398 but also what specific simplifications 
  1398 but also what specific simplifications 
  1399 has been done as a function on values
  1399 have been done as a function on values
  1400 showing how one can transform the value
  1400 showing how one can transform the value
  1401 underlying the simplified regular expression
  1401 underlying the simplified regular expression
  1402 to the unsimplified one.
  1402 to the unsimplified one.
  1403 
  1403 
  1404 We therefore choose a slightly different approach
  1404 We therefore choose a slightly different approach