ChengsongTanPhdThesis/Chapters/Inj.tex
changeset 637 e3752aac8ec2
parent 628 7af4e2420a8c
child 638 dd9dde2d902b
equal deleted inserted replaced
636:0bcb4a7cb40c 637:e3752aac8ec2
     9 %why more aggressive simplifications are needed. 
     9 %why more aggressive simplifications are needed. 
    10 
    10 
    11 In this chapter, we define the basic notions 
    11 In this chapter, we define the basic notions 
    12 for regular languages and regular expressions.
    12 for regular languages and regular expressions.
    13 This is essentially a description in ``English''
    13 This is essentially a description in ``English''
    14 of our formalisation in Isabelle/HOL.
    14 the functions and datatypes of our formalisation in Isabelle/HOL.
    15 We also give the definition of what $\POSIX$ lexing means, 
    15 We also define what $\POSIX$ lexing means, 
    16 followed by a lexing algorithm by Sulzmanna and Lu \parencite{Sulzmann2014} 
    16 followed by a lexing algorithm by Sulzmanna and Lu \parencite{Sulzmann2014} 
    17 that produces the output conforming
    17 that produces the output conforming
    18 to the $\POSIX$ standard\footnote{In what follows 
    18 to the $\POSIX$ standard\footnote{In what follows 
    19 we choose to use the Isabelle-style notation
    19 we choose to use the Isabelle-style notation
    20 for function applications, where
    20 for function applications, where
    24 to make the text visually more concise.}.
    24 to make the text visually more concise.}.
    25 
    25 
    26 \section{Basic Concepts}
    26 \section{Basic Concepts}
    27 Formal language theory usually starts with an alphabet 
    27 Formal language theory usually starts with an alphabet 
    28 denoting a set of characters.
    28 denoting a set of characters.
    29 Here we just use the datatype of characters from Isabelle,
    29 Here we use the datatype of characters from Isabelle,
    30 which roughly corresponds to the ASCII characters.
    30 which roughly corresponds to the ASCII characters.
    31 In what follows, we shall leave the information about the alphabet
    31 In what follows, we shall leave the information about the alphabet
    32 implicit.
    32 implicit.
    33 Then using the usual bracket notation for lists,
    33 Then using the usual bracket notation for lists,
    34 we can define strings made up of characters as: 
    34 we can define strings made up of characters as: 
    38 \end{tabular}
    38 \end{tabular}
    39 \end{center}
    39 \end{center}
    40 where $c$ is a variable ranging over characters.
    40 where $c$ is a variable ranging over characters.
    41 The $::$ stands for list cons and $[]$ for the empty
    41 The $::$ stands for list cons and $[]$ for the empty
    42 list.
    42 list.
    43 A singleton list is sometimes written as $[c]$ for brevity.
    43 For brevity, a singleton list is sometimes written as $[c]$.
    44 Strings can be concatenated to form longer strings in the same
    44 Strings can be concatenated to form longer strings in the same
    45 way as we concatenate two lists, which we shall write as $s_1 @ s_2$.
    45 way we concatenate two lists, which we shall write as $s_1 @ s_2$.
    46 We omit the precise 
    46 We omit the precise 
    47 recursive definition here.
    47 recursive definition here.
    48 We overload this concatenation operator for two sets of strings:
    48 We overload this concatenation operator for two sets of strings:
    49 \begin{center}
    49 \begin{center}
    50 \begin{tabular}{lcl}
    50 \begin{tabular}{lcl}
   119 				 & $\stackrel{\backslash r}{\rightarrow}$ & $\{[]\}$\\
   119 				 & $\stackrel{\backslash r}{\rightarrow}$ & $\{[]\}$\\
   120 				 %& $\stackrel{[] \in S \backslash bar}{\longrightarrow}$ & $bar \in S$\\
   120 				 %& $\stackrel{[] \in S \backslash bar}{\longrightarrow}$ & $bar \in S$\\
   121 \end{tabular}	
   121 \end{tabular}	
   122 \end{center}
   122 \end{center}
   123 \noindent
   123 \noindent
   124 and in the end test whether the set
   124 and in the end, test whether the set
   125 has the empty string.\footnote{We use the infix notation $A\backslash c$
   125 has the empty string.\footnote{We use the infix notation $A\backslash c$
   126 	instead of $\Der \; c \; A$ for brevity, as it is clear we are operating
   126 	instead of $\Der \; c \; A$ for brevity, as it is clear we are operating
   127 on languages rather than regular expressions.}
   127 on languages rather than regular expressions.}
   128 
   128 
   129 In general, if we have a language $S$,
   129 In general, if we have a language $S$,
   154 \begin{proof}
   154 \begin{proof}
   155 There are two inclusions to prove:
   155 There are two inclusions to prove:
   156 \begin{itemize}
   156 \begin{itemize}
   157 \item{$\subseteq$}:\\
   157 \item{$\subseteq$}:\\
   158 The set 
   158 The set 
   159 \[ \{s \mid c :: s \in A*\} \]
   159 \[ S_1 = \{s \mid c :: s \in A*\} \]
   160 is enclosed in the set
   160 is enclosed in the set
   161 \[ \{s_1 @ s_2 \mid s_1 \, s_2.\;  s_1 \in \{s \mid c :: s \in A\} \land s_2 \in A* \} \]
   161 \[ S_2 = \{s_1 @ s_2 \mid s_1 \, s_2.\;  s_1 \in \{s \mid c :: s \in A\} \land s_2 \in A* \}. \]
   162 because whenever you have a string starting with a character 
   162 This is because for any string $c::s$ satisfying $c::s \in A*$,
   163 in the language of a Kleene star $A*$, 
   163 %whenever you have a string starting with a character 
   164 then that character together with some sub-string
   164 %in the language of a Kleene star $A*$, 
   165 immediately after it will form the first iteration, 
   165 %then that
   166 and the rest of the string will 
   166 the character $c$, together with a prefix of $s$
   167 be still in $A*$.
   167 %immediately after $c$ 
       
   168 forms the first iteration of $A*$, 
       
   169 and the rest of the $s$ is also $A*$.
       
   170 This coincides with the definition of $S_2$.
   168 \item{$\supseteq$}:\\
   171 \item{$\supseteq$}:\\
   169 Note that
   172 Note that
   170 \[ \Der \; c \; (A*) = \Der \; c \;  (\{ [] \} \cup (A @ A*) ) \]
   173 \[ \Der \; c \; (A*) = \Der \; c \;  (\{ [] \} \cup (A @ A*) ) \]
   171 holds.
   174 holds.
   172 Also the following holds:
   175 Also the following holds:
   192 			 \mid  r_1 + r_2   
   195 			 \mid  r_1 + r_2   
   193 			 \mid r^*         
   196 			 \mid r^*         
   194 \]
   197 \]
   195 \noindent
   198 \noindent
   196 We call them basic because we will introduce
   199 We call them basic because we will introduce
   197 additional constructors in later chapters such as negation
   200 additional constructors in later chapters, such as negation
   198 and bounded repetitions.
   201 and bounded repetitions.
   199 We use $\ZERO$ for the regular expression that
   202 We use $\ZERO$ for the regular expression that
   200 matches no string, and $\ONE$ for the regular
   203 matches no string, and $\ONE$ for the regular
   201 expression that matches only the empty string.\footnote{
   204 expression that matches only the empty string.\footnote{
   202 Some authors
   205 Some authors
   221 \end{tabular}
   224 \end{tabular}
   222 \end{center}
   225 \end{center}
   223 \noindent
   226 \noindent
   224 %Now with language derivatives of a language and regular expressions and
   227 %Now with language derivatives of a language and regular expressions and
   225 %their language interpretations in place, we are ready to define derivatives on regular expressions.
   228 %their language interpretations in place, we are ready to define derivatives on regular expressions.
   226 With $L$ we are ready to introduce Brzozowski derivatives on regular expressions.
   229 With $L$, we are ready to introduce Brzozowski derivatives on regular expressions.
   227 We do so by first introducing what properties it should satisfy.
   230 We do so by first introducing what properties it should satisfy.
   228 
   231 
   229 \subsection{Brzozowski Derivatives and a Regular Expression Matcher}
   232 \subsection{Brzozowski Derivatives and a Regular Expression Matcher}
   230 %Recall, the language derivative acts on a set of strings
   233 %Recall, the language derivative acts on a set of strings
   231 %and essentially chops off a particular character from
   234 %and essentially chops off a particular character from
   378 
   381 
   379 \begin{property}
   382 \begin{property}
   380 	$s \in L \; r_{start} \iff [] \in L \; r_{end}$
   383 	$s \in L \; r_{start} \iff [] \in L \; r_{end}$
   381 \end{property}
   384 \end{property}
   382 \noindent
   385 \noindent
   383 Next we give the recursive definition of derivative on
   386 Next, we give the recursive definition of derivative on
   384 regular expressions, so that it satisfies the properties above.
   387 regular expressions so that it satisfies the properties above.
   385 The derivative function, written $r\backslash c$, 
   388 The derivative function, written $r\backslash c$, 
   386 takes a regular expression $r$ and character $c$, and
   389 takes a regular expression $r$ and character $c$, and
   387 returns a new regular expression representing
   390 returns a new regular expression representing
   388 the original regular expression's language $L \; r$
   391 the original regular expression's language $L \; r$
   389 being taken the language derivative with respect to $c$.
   392 being taken the language derivative with respect to $c$.
   439 we can further unfold it as many times as needed:
   442 we can further unfold it as many times as needed:
   440 \[
   443 \[
   441 	(r^*) \backslash c \dn (r \backslash c)\cdot r^*.
   444 	(r^*) \backslash c \dn (r \backslash c)\cdot r^*.
   442 \]
   445 \]
   443 Again,
   446 Again,
   444 the structure is the same as the language derivative of Kleene star: 
   447 the structure is the same as the language derivative of the Kleene star: 
   445 \[
   448 \[
   446 	\textit{Der} \;c \;(A*) \dn (\textit{Der}\; c A) @ (A*)
   449 	\textit{Der} \;c \;(A*) \dn (\textit{Der}\; c A) @ (A*)
   447 \]
   450 \]
   448 In the above definition of $(r_1\cdot r_2) \backslash c$,
   451 In the above definition of $(r_1\cdot r_2) \backslash c$,
   449 the $\textit{if}$ clause's
   452 the $\textit{if}$ clause's
   520 \begin{definition}
   523 \begin{definition}
   521 $\textit{match}\;s\;r \;\dn\; \nullable \; (r\backslash s)$
   524 $\textit{match}\;s\;r \;\dn\; \nullable \; (r\backslash s)$
   522 \end{definition}
   525 \end{definition}
   523 
   526 
   524 \noindent
   527 \noindent
   525 Assuming the string is given as a sequence of characters, say $c_0c_1..c_n$, 
   528 Assuming the string is given as a sequence of characters, say $c_0c_1 \ldots c_n$, 
   526 this algorithm presented graphically is as follows:
   529 this algorithm, presented graphically, is as follows:
   527 
   530 
   528 \begin{equation}\label{matcher}
   531 \begin{equation}\label{matcher}
   529 \begin{tikzcd}
   532 \begin{tikzcd}
   530 r_0 \arrow[r, "\backslash c_0"]  & r_1 \arrow[r, "\backslash c_1"] & 
   533 r_0 \arrow[r, "\backslash c_0"]  & r_1 \arrow[r, "\backslash c_1"] & 
   531 r_2 \arrow[r, dashed]  & r_n  \arrow[r,"\textit{nullable}?"] & 
   534 r_2 \arrow[r, dashed]  & r_n  \arrow[r,"\textit{nullable}?"] & 
   538 easy to show that this matcher is correct, namely
   541 easy to show that this matcher is correct, namely
   539 \begin{lemma}
   542 \begin{lemma}
   540 	$\textit{match} \; s\; r  = \textit{true} \; \textit{iff} \; s \in L(r)$
   543 	$\textit{match} \; s\; r  = \textit{true} \; \textit{iff} \; s \in L(r)$
   541 \end{lemma}
   544 \end{lemma}
   542 \begin{proof}
   545 \begin{proof}
   543 	By induction on $s$ using property of derivatives:
   546 	By induction on $s$ using the property of derivatives:
   544 	lemma \ref{derDer}.
   547 	lemma \ref{derDer}.
   545 \end{proof}
   548 \end{proof}
   546 \begin{figure}
   549 \begin{figure}
   547 \begin{center}
   550 \begin{center}
   548 \begin{tikzpicture}
   551 \begin{tikzpicture}
   564 \noindent
   567 \noindent
   565 If we implement the above algorithm naively, however,
   568 If we implement the above algorithm naively, however,
   566 the algorithm can be excruciatingly slow, as shown in 
   569 the algorithm can be excruciatingly slow, as shown in 
   567 \ref{NaiveMatcher}.
   570 \ref{NaiveMatcher}.
   568 Note that both axes are in logarithmic scale.
   571 Note that both axes are in logarithmic scale.
   569 Around two dozens characters
   572 Around two dozen characters
   570 would already ``explode'' the matcher with the regular expression 
   573 would already ``explode'' the matcher with the regular expression 
   571 $(a^*)^*b$.
   574 $(a^*)^*b$.
   572 To improve this situation, we need to introduce simplification
   575 To improve this situation, we need to introduce simplification
   573 rules for the intermediate results,
   576 rules for the intermediate results,
   574 such as $r + r \rightarrow r$,
   577 such as $r + r \rightarrow r$,
   626 The running time of $\textit{ders}\_\textit{simp}$
   629 The running time of $\textit{ders}\_\textit{simp}$
   627 on the same example of Figure \ref{NaiveMatcher}
   630 on the same example of Figure \ref{NaiveMatcher}
   628 is now ``tame''  in terms of the length of inputs,
   631 is now ``tame''  in terms of the length of inputs,
   629 as shown in Figure \ref{BetterMatcher}.
   632 as shown in Figure \ref{BetterMatcher}.
   630 
   633 
   631 So far the story is use Brzozowski derivatives and
   634 So far, the story is use Brzozowski derivatives and
   632 simplify as much as possible and at the end test
   635 simplify as much as possible, and at the end test
   633 whether the empty string is recognised 
   636 whether the empty string is recognised 
   634 by the final derivative.
   637 by the final derivative.
   635 But what if we want to 
   638 But what if we want to 
   636 do lexing instead of just getting a true/false answer?
   639 do lexing instead of just getting a true/false answer?
   637 Sulzmann and Lu \cite{Sulzmann2014} first came up with a nice and 
   640 Sulzmann and Lu \cite{Sulzmann2014} first came up with a nice and 
   725 multiple values for it. For example, both
   728 multiple values for it. For example, both
   726 $\vdash \Seq(\Left \; ab)(\Right \; c):(ab+a)(bc+c)$ and
   729 $\vdash \Seq(\Left \; ab)(\Right \; c):(ab+a)(bc+c)$ and
   727 $\vdash \Seq(\Right\; a)(\Left \; bc ):(ab+a)(bc+c)$ hold
   730 $\vdash \Seq(\Right\; a)(\Left \; bc ):(ab+a)(bc+c)$ hold
   728 and the values both flatten to $abc$.
   731 and the values both flatten to $abc$.
   729 Lexers therefore have to disambiguate and choose only
   732 Lexers therefore have to disambiguate and choose only
   730 one of the values be generated. $\POSIX$ is one of the
   733 one of the values to be generated. $\POSIX$ is one of the
   731 disambiguation strategies that is widely adopted.
   734 disambiguation strategies that is widely adopted.
   732 
   735 
   733 Ausaf et al.\parencite{AusafDyckhoffUrban2016} 
   736 Ausaf et al.\parencite{AusafDyckhoffUrban2016} 
   734 formalised the property 
   737 formalised the property 
   735 as a ternary relation.
   738 as a ternary relation.
   794 %			
   797 %			
   795 %
   798 %
   796 %\end{tikzpicture}
   799 %\end{tikzpicture}
   797 %\caption{Maximum munch example: $s$ matches $r_{token1} \cdot r_{token2}$}\label{munch}
   800 %\caption{Maximum munch example: $s$ matches $r_{token1} \cdot r_{token2}$}\label{munch}
   798 %\end{figure}
   801 %\end{figure}
   799 The above $\POSIX$ rules follows the intuition described below: 
   802 The above $\POSIX$ rules follow the intuition described below: 
   800 \begin{itemize}
   803 \begin{itemize}
   801 	\item (Left Priority)\\
   804 	\item (Left Priority)\\
   802 		Match the leftmost regular expression when multiple options of matching
   805 		Match the leftmost regular expression when multiple options for matching
   803 		are available. See P+L and P+R where in P+R $s$ cannot
   806 		are available. See P+L and P+R where in P+R $s$ cannot
   804 		be in the language of $L \; r_1$.
   807 		be in the language of $L \; r_1$.
   805 	\item (Maximum munch)\\
   808 	\item (Maximum munch)\\
   806 		Always match a subpart as much as possible before proceeding
   809 		Always match a subpart as much as possible before proceeding
   807 		to the next part of the string.
   810 		to the next part of the string.
   808 		For example, when the string $s$ matches 
   811 		For example, when the string $s$ matches 
   809 		$r_{part1}\cdot r_{part2}$, and we have two ways $s$ can be split:
   812 		$r_{part1}\cdot r_{part2}$, and we have two ways $s$ can be split:
   810 		Then the split that matches a longer string for the first part
   813 		Then the split that matches a longer string for the first part
   811 		$r_{part1}$ is preferred by this maximum munch rule.
   814 		$r_{part1}$ is preferred by this maximum munch rule.
   812 		This is caused by the side-condition 
   815 		The side-condition 
   813 		\begin{center}
   816 		\begin{center}
   814 		$\nexists s_3 \; s_4. s_3 \neq [] \land s_3 @ s_4 = s_2 \land 
   817 		$\nexists s_3 \; s_4. s_3 \neq [] \land s_3 @ s_4 = s_2 \land 
   815 		s_1@ s_3 \in L \; r_1 \land s_4 \in L \; r_2$
   818 		s_1@ s_3 \in L \; r_1 \land s_4 \in L \; r_2$
   816 		\end{center}
   819 		\end{center}
   817 		in PS.
   820 		in PS causes this.
   818 		%(See
   821 		%(See
   819 		%\ref{munch} for an illustration).
   822 		%\ref{munch} for an illustration).
   820 \end{itemize}
   823 \end{itemize}
   821 \noindent
   824 \noindent
   822 These disambiguation strategies can be 
   825 These disambiguation strategies can be 
   881 \[
   884 \[
   882 (s_1'', r_1) \rightarrow v_1''
   885 (s_1'', r_1) \rightarrow v_1''
   883 \;\;and\;\; (s_2'', r_2) \rightarrow v_2'' \;\;and \;\;s_1'' @s_2'' = s 
   886 \;\;and\;\; (s_2'', r_2) \rightarrow v_2'' \;\;and \;\;s_1'' @s_2'' = s 
   884 \]
   887 \]
   885 cannot possibly form a $\POSIX$ value either, because
   888 cannot possibly form a $\POSIX$ value either, because
   886 by definition there is a candidate
   889 by definition, there is a candidate
   887 with longer initial string
   890 with a longer initial string
   888 $s_1$. Therefore, we know that the POSIX
   891 $s_1$. Therefore, we know that the POSIX
   889 value $\Seq \; a \; b$ for $r_1 \cdot r_2$ matching
   892 value $\Seq \; a \; b$ for $r_1 \cdot r_2$ matching
   890 $s$ must have the 
   893 $s$ must have the 
   891 property that 
   894 property that 
   892 \[
   895 \[
   898 then by induction hypothesis $v_{10} = v_1$ and $v_{20}= v_2$, 
   901 then by induction hypothesis $v_{10} = v_1$ and $v_{20}= v_2$, 
   899 which means this "other" $\POSIX$ value $\Seq(v_{10}, v_{20})$
   902 which means this "other" $\POSIX$ value $\Seq(v_{10}, v_{20})$
   900 is the same as $\Seq(v_1, v_2)$. 
   903 is the same as $\Seq(v_1, v_2)$. 
   901 \end{proof}
   904 \end{proof}
   902 \noindent
   905 \noindent
   903 We have now defined what a POSIX value is and shown that it is unique,
   906 We have now defined what a POSIX value is and shown that it is unique.
   904 the problem is to generate
   907 The problem is to generate
   905 such a value in a lexing algorithm using derivatives.
   908 such a value in a lexing algorithm using derivatives.
   906 
   909 
   907 \subsection{Sulzmann and Lu's Injection-based Lexing Algorithm}
   910 \subsection{Sulzmann and Lu's Injection-based Lexing Algorithm}
   908 
   911 
   909 Sulzmann and Lu extended Brzozowski's 
   912 Sulzmann and Lu extended Brzozowski's 
   961 	By induction on $r$.
   964 	By induction on $r$.
   962 \end{proof}
   965 \end{proof}
   963 \noindent
   966 \noindent
   964 After the $\mkeps$-call, Sulzmann and Lu inject back the characters one by one
   967 After the $\mkeps$-call, Sulzmann and Lu inject back the characters one by one
   965 in reverse order as they were chopped off in the derivative phase.
   968 in reverse order as they were chopped off in the derivative phase.
   966 The fucntion for this is called $\inj$. This function 
   969 The function for this is called $\inj$. This function 
   967 operates on values, unlike $\backslash$ which operates on regular expressions.
   970 operates on values, unlike $\backslash$ which operates on regular expressions.
   968 In the diagram below, $v_i$ stands for the (POSIX) value 
   971 In the diagram below, $v_i$ stands for the (POSIX) value 
   969 for how the regular expression 
   972 for how the regular expression 
   970 $r_i$ matches the string $s_i$ consisting of the last $n-i$ characters
   973 $r_i$ matches the string $s_i$ consisting of the last $n-i$ characters
   971 of $s$ (i.e. $s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$.
   974 of $s$ (i.e. $s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$.
   983 \caption{The two-phase lexing algorithm by Sulzmann and Lu \cite{AusafDyckhoffUrban2016},
   986 \caption{The two-phase lexing algorithm by Sulzmann and Lu \cite{AusafDyckhoffUrban2016},
   984 	matching the regular expression $r_0$ and string of the form $[c_0, c_1, \ldots, c_{n-1}]$.
   987 	matching the regular expression $r_0$ and string of the form $[c_0, c_1, \ldots, c_{n-1}]$.
   985 	The first phase involves taking successive derivatives w.r.t the characters $c_0$,
   988 	The first phase involves taking successive derivatives w.r.t the characters $c_0$,
   986 	$c_1$, and so on. These are the same operations as they have appeared in the matcher
   989 	$c_1$, and so on. These are the same operations as they have appeared in the matcher
   987 	\ref{matcher}. When the final derivative regular expression is nullable (contains the empty string),
   990 	\ref{matcher}. When the final derivative regular expression is nullable (contains the empty string),
   988 	then the second phase starts. First $\mkeps$ generates a POSIX value which tells us how $r_n$ matches
   991 	then the second phase starts. First, $\mkeps$ generates a POSIX value which tells us how $r_n$ matches
   989 	the empty string , by always selecting the leftmost 
   992 	the empty string, by always selecting the leftmost 
   990 	nullable regular expression. After that $\inj$ ``injects'' back the character in reverse order as they 
   993 	nullable regular expression. After that, $\inj$ ``injects'' back the character in reverse order as they 
   991 	appeared in the string, always preserving POSIXness.}\label{graph:inj}
   994 	appeared in the string, always preserving POSIXness.}\label{graph:inj}
   992 \end{figure}
   995 \end{figure}
   993 \noindent
   996 \noindent
   994 The function $\textit{inj}$ as defined by Sulzmann and Lu
   997 The function $\textit{inj}$ as defined by Sulzmann and Lu
   995 takes three arguments: a regular
   998 takes three arguments: a regular
  1018 \end{center}
  1021 \end{center}
  1019 
  1022 
  1020 \noindent 
  1023 \noindent 
  1021 The function recurses on 
  1024 The function recurses on 
  1022 the shape of regular
  1025 the shape of regular
  1023 expressionsw  and values.
  1026 expressions and values.
  1024 Intuitively, each clause analyses 
  1027 Intuitively, each clause analyses 
  1025 how $r_i$ could have transformed when being 
  1028 how $r_i$ could have transformed when being 
  1026 derived by $c$, identifying which subpart
  1029 derived by $c$, identifying which subpart
  1027 of $v_{i+1}$ has the ``hole'' 
  1030 of $v_{i+1}$ has the ``hole'' 
  1028 to inject the character back into.
  1031 to inject the character back into.
  1029 Once the character is
  1032 Once the character is
  1030 injected back to that sub-value; 
  1033 injected back to that sub-value; 
  1031 $\inj$ assembles all parts together
  1034 $\inj$ assembles all parts
  1032 to form a new value.
  1035 to form a new value.
  1033 
  1036 
  1034 For instance, the last clause is an
  1037 For instance, the last clause is an
  1035 injection into a sequence value $v_{i+1}$
  1038 injection into a sequence value $v_{i+1}$
  1036 whose second child
  1039 whose second child
  1037 value is a star, and the shape of the 
  1040 value is a star and the shape of the 
  1038 regular expression $r_i$ before injection 
  1041 regular expression $r_i$ before injection 
  1039 is a star.
  1042 is a star.
  1040 We therefore know 
  1043 We therefore know 
  1041 the derivative 
  1044 the derivative 
  1042 starts on a star and ends as a sequence:
  1045 starts on a star and ends as a sequence:
  1055 \[
  1058 \[
  1056 	\vdash v: r\backslash c.
  1059 	\vdash v: r\backslash c.
  1057 \]
  1060 \]
  1058 Finally, 
  1061 Finally, 
  1059 $\inj \; r \;c \; v$ is prepended
  1062 $\inj \; r \;c \; v$ is prepended
  1060 to the previous list of iterations, and then
  1063 to the previous list of iterations and then
  1061 wrapped under the $\Stars$
  1064 wrapped under the $\Stars$
  1062 constructor, giving us $\Stars \; ((\inj \; r \; c \; v) ::vs)$.
  1065 constructor, giving us $\Stars \; ((\inj \; r \; c \; v) ::vs)$.
  1063 
  1066 
  1064 Recall that lemma 
  1067 Recall that lemma 
  1065 \ref{mePosix} tells us that
  1068 \ref{mePosix} tells us that
  1086 					$\inj \; r \; c \; v$ &   $=$ & $ \inj \;\; (a \cdot b) \;\; c \;\; (\Seq \; v_a \; v_b) $\\
  1089 					$\inj \; r \; c \; v$ &   $=$ & $ \inj \;\; (a \cdot b) \;\; c \;\; (\Seq \; v_a \; v_b) $\\
  1087 					& $=$ & $\Seq \; (\inj \;a \; c \; v_a) \; v_b$
  1090 					& $=$ & $\Seq \; (\inj \;a \; c \; v_a) \; v_b$
  1088 				\end{tabular}
  1091 				\end{tabular}
  1089 			\end{center}
  1092 			\end{center}
  1090 			We know that there exists a unique pair of
  1093 			We know that there exists a unique pair of
  1091 			$s_a$ and $s_b$ satisfaying	
  1094 			$s_a$ and $s_b$ satisfying	
  1092 				$(a \backslash c, s_a) \rightarrow v_a$,
  1095 				$(a \backslash c, s_a) \rightarrow v_a$,
  1093 				$(b , s_b) \rightarrow v_b$, and
  1096 				$(b , s_b) \rightarrow v_b$, and
  1094 				$\nexists s_3 \; s_4. s_3 \neq [] \land s_a @ s_3 \in 
  1097 				$\nexists s_3 \; s_4. s_3 \neq [] \land s_a @ s_3 \in 
  1095 				L \; (a\backslash c) \land
  1098 				L \; (a\backslash c) \land
  1096 				s_4 \in L \; b$.
  1099 				s_4 \in L \; b$.
  1111 				\begin{tabular}{lcl}
  1114 				\begin{tabular}{lcl}
  1112 					$\inj \; r \; c \; v$ &   $=$ & $ \inj \;\; (a \cdot b) \;\; c \;\; (\Left \; (\Seq \; v_a \; v_b)) $\\
  1115 					$\inj \; r \; c \; v$ &   $=$ & $ \inj \;\; (a \cdot b) \;\; c \;\; (\Left \; (\Seq \; v_a \; v_b)) $\\
  1113 					& $=$ & $\Seq \; (\inj \;a \; c \; v_a) \; v_b$
  1116 					& $=$ & $\Seq \; (\inj \;a \; c \; v_a) \; v_b$
  1114 				\end{tabular}
  1117 				\end{tabular}
  1115 			\end{center}
  1118 			\end{center}
  1116 			With a similar reasoning, 
  1119 			With similar reasoning, 
  1117 
  1120 
  1118 			\[
  1121 			\[
  1119 				(a\cdot b, (c::s_a)@s_b) \rightarrow \Seq \; (\inj \; a\;c \;v_a) \; v_b.
  1122 				(a\cdot b, (c::s_a)@s_b) \rightarrow \Seq \; (\inj \; a\;c \;v_a) \; v_b.
  1120 			\]
  1123 			\]
  1121 			again holds.
  1124 			again holds.
  1131 			We know that $a$ must be nullable,
  1134 			We know that $a$ must be nullable,
  1132 			allowing us to call $\mkeps$ and get
  1135 			allowing us to call $\mkeps$ and get
  1133 			\[
  1136 			\[
  1134 				(a, []) \rightarrow \mkeps \; a.
  1137 				(a, []) \rightarrow \mkeps \; a.
  1135 			\]
  1138 			\]
  1136 			Also by inductive hypothesis
  1139 			Also, by inductive hypothesis
  1137 			\[
  1140 			\[
  1138 				(b, c::s) \rightarrow \inj\; b \; c \; v_b
  1141 				(b, c::s) \rightarrow \inj\; b \; c \; v_b
  1139 			\]
  1142 			\]
  1140 			holds.
  1143 			holds.
  1141 			In addition, as
  1144 			In addition, as
  1152 			\[
  1155 			\[
  1153 				\nexists s_3 \; s_4. \; s_3 \neq [] \land
  1156 				\nexists s_3 \; s_4. \; s_3 \neq [] \land
  1154 				s_3 @s_4 = c::s  \land s_3 \in L \; a 
  1157 				s_3 @s_4 = c::s  \land s_3 \in L \; a 
  1155 				\land s_4 \in L \; b.
  1158 				\land s_4 \in L \; b.
  1156 			\]
  1159 			\]
  1157 			(Which basically says there cannot be a longer 
  1160 			(Which says there cannot be a longer 
  1158 			initial split for $s$ other than the empty string.)
  1161 			initial split for $s$ other than the empty string.)
  1159 			Therefore we have $\Seq \; (\mkeps \; a) \;(\inj \;b \; c\; v_b)$
  1162 			Therefore we have $\Seq \; (\mkeps \; a) \;(\inj \;b \; c\; v_b)$
  1160 			as the POSIX value for $a\cdot b$.
  1163 			as the POSIX value for $a\cdot b$.
  1161 	\end{itemize}
  1164 	\end{itemize}
  1162 	The star case can be proven similarly.
  1165 	The star case can be proven similarly.
  1185 		\end{tabular}
  1188 		\end{tabular}
  1186 	\end{center}
  1189 	\end{center}
  1187 \end{theorem} 
  1190 \end{theorem} 
  1188 \begin{proof}
  1191 \begin{proof}
  1189 By induction on $s$. $r$ generalising over an arbitrary regular expression.
  1192 By induction on $s$. $r$ generalising over an arbitrary regular expression.
  1190 The $[]$ case is proven by  an application of lemma \ref{mePosix}, and the inductive case
  1193 The $[]$ case is proven by an application of lemma \ref{mePosix}, and the inductive case
  1191 by lemma \ref{injPosix}.
  1194 by lemma \ref{injPosix}.
  1192 \end{proof}
  1195 \end{proof}
  1193 \noindent
  1196 \noindent
  1194 As we did earlier in this chapter with the matcher, one can 
  1197 As we did earlier in this chapter with the matcher, one can 
  1195 introduce simplification on the regular expression in each derivative step.
  1198 introduce simplification on the regular expression in each derivative step.
  1196 However, now one needs to do a backward phase and make sure
  1199 However, due to lexing, one needs to do a backward phase (w.r.t the forward derivative phase)
  1197 the values align with the regular expression.
  1200 and ensure that
       
  1201 the values align with the regular expression at each step.
  1198 Therefore one has to
  1202 Therefore one has to
  1199 be careful not to break the correctness, as the injection 
  1203 be careful not to break the correctness, as the injection 
  1200 function heavily relies on the structure of 
  1204 function heavily relies on the structure of 
  1201 the regular expressions and values being aligned.
  1205 the regular expressions and values being aligned.
  1202 This can be achieved by recording some extra rectification functions
  1206 This can be achieved by recording some extra rectification functions
  1203 during the derivatives step, and applying these rectifications in 
  1207 during the derivatives step and applying these rectifications in 
  1204 each run during the injection phase.
  1208 each run during the injection phase.
  1205 With extra care
  1209 With extra care
  1206 one can show that POSIXness will not be affected
  1210 one can show that POSIXness will not be affected
  1207 by the simplifications listed here \cite{AusafDyckhoffUrban2016}. 
  1211 by the simplifications listed here \cite{AusafDyckhoffUrban2016}. 
  1208 \begin{center}
  1212 \begin{center}
  1221 		$\simp \; r$ & $\dn$ & $r\quad\quad (otherwise)$
  1225 		$\simp \; r$ & $\dn$ & $r\quad\quad (otherwise)$
  1222 				   
  1226 				   
  1223 	\end{tabular}
  1227 	\end{tabular}
  1224 \end{center}
  1228 \end{center}
  1225 
  1229 
  1226 However, with the simple-minded simplification rules allowed
  1230 However, one can still end up
  1227 in an injection-based lexer, one can still end up
  1231 with exploding derivatives,
  1228 with exploding derivatives.
  1232 even with the simple-minded simplification rules allowed
  1229 \section{A Case Requring More Aggressive Simplifications}
  1233 in an injection-based lexer.
       
  1234 \section{A Case Requiring More Aggressive Simplifications}
  1230 For example, when starting with the regular
  1235 For example, when starting with the regular
  1231 expression $(a^* \cdot a^*)^*$ and building just over
  1236 expression $(a^* \cdot a^*)^*$ and building just over
  1232 a dozen successive derivatives 
  1237 a dozen successive derivatives 
  1233 w.r.t.~the character $a$, one obtains a derivative regular expression
  1238 w.r.t.~the character $a$, one obtains a derivative regular expression
  1234 with millions of nodes (when viewed as a tree)
  1239 with millions of nodes (when viewed as a tree)
  1235 even with the simple-minded simplification.
  1240 even with the mentioned simplifications.
  1236 \begin{figure}[H]
  1241 \begin{figure}[H]
  1237 \begin{center}
  1242 \begin{center}
  1238 \begin{tikzpicture}
  1243 \begin{tikzpicture}
  1239 \begin{axis}[
  1244 \begin{axis}[
  1240     xlabel={$n$},
  1245     xlabel={$n$},
  1260 	s=\underbrace{aa\ldots a}_\text{n \textit{a}s}
  1265 	s=\underbrace{aa\ldots a}_\text{n \textit{a}s}
  1261 \]
  1266 \]
  1262 as an example.
  1267 as an example.
  1263 This is a highly ambiguous regular expression, with
  1268 This is a highly ambiguous regular expression, with
  1264 many ways to split up the string into multiple segments for
  1269 many ways to split up the string into multiple segments for
  1265 different star iteratioins,
  1270 different star iterations,
  1266 and for each segment 
  1271 and for each segment 
  1267 multiple ways of splitting between 
  1272 multiple ways of splitting between 
  1268 the two $a^*$ sub-expressions.
  1273 the two $a^*$ sub-expressions.
  1269 When $n$ is equal to $1$, there are two lexical values for
  1274 When $n$ is equal to $1$, there are two lexical values for
  1270 the match:
  1275 the match:
  1329 distinct lexical values that cannot be eliminated by 
  1334 distinct lexical values that cannot be eliminated by 
  1330 the simple-minded simplification of $\derssimp$. 
  1335 the simple-minded simplification of $\derssimp$. 
  1331 A lexer without a good enough strategy to 
  1336 A lexer without a good enough strategy to 
  1332 deduplicate will naturally
  1337 deduplicate will naturally
  1333 have an exponential runtime on highly
  1338 have an exponential runtime on highly
  1334 ambiguous regular expressions, because there
  1339 ambiguous regular expressions because there
  1335 are exponentially many matches.
  1340 are exponentially many matches.
  1336 For this particular example, it seems
  1341 For this particular example, it seems
  1337 that the number of distinct matches growth
  1342 that the number of distinct matches growth
  1338 speed is proportional to $(2n)!/(n!(n+1)!)$ ($n$ being the input length).
  1343 speed is proportional to $(2n)!/(n!(n+1)!)$ ($n$ being the input length).
  1339 
  1344 
  1340 On the other hand, the
  1345 On the other hand, the
  1341  $\POSIX$ value for $r= (a^*\cdot a^*)^*$  and 
  1346  $\POSIX$ value for $r= (a^*\cdot a^*)^*$  and 
  1342 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$ is  
  1347 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$ is  
  1343 \[
  1348 \[
  1344 	\Stars\,
  1349 	\Stars\,
  1345 	[\Seq \; (\Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}]), \Stars\,[]]
  1350 	[\Seq \; (\Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}]), \Stars\,[]].
  1346 \]
  1351 \]
  1347 and at any moment the  subterms in a regular expression
  1352 At any moment, the  subterms in a regular expression
  1348 that will result in a POSIX value is only
  1353 that will potentially result in a POSIX value is only
  1349 a minority among the many other terms,
  1354 a minority among the many other terms,
  1350 and one can remove ones that are absolutely not possible to 
  1355 and one can remove ones that are not possible to 
  1351 be POSIX.
  1356 be POSIX.
  1352 In the above example,
  1357 In the above example,
  1353 \begin{equation}\label{eqn:growth2}
  1358 \begin{equation}\label{eqn:growth2}
  1354 	((a^*a^* + \underbrace{a^*}_\text{A})+\underbrace{a^*}_\text{duplicate of A})\cdot(a^*a^*)^* + 
  1359 	((a^*a^* + \underbrace{a^*}_\text{A})+\underbrace{a^*}_\text{duplicate of A})\cdot(a^*a^*)^* + 
  1355 	\underbrace{(a^*a^* + a^*)\cdot(a^*a^*)^*}_\text{further simp removes this}.
  1360 	\underbrace{(a^*a^* + a^*)\cdot(a^*a^*)^*}_\text{further simp removes this}.
  1368 	\begin{tabular}{lr}
  1373 	\begin{tabular}{lr}
  1369 		$\Stars \; [\Seq \; (\Stars \; [\Char \; a, \Char \; a])\; (\Stars \; [])]$  & $(\text{term 1})$\\
  1374 		$\Stars \; [\Seq \; (\Stars \; [\Char \; a, \Char \; a])\; (\Stars \; [])]$  & $(\text{term 1})$\\
  1370 		$\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; [\Char \; a])]  $ &  $(\text{term 2})$
  1375 		$\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; [\Char \; a])]  $ &  $(\text{term 2})$
  1371 	\end{tabular}
  1376 	\end{tabular}
  1372 \end{center}
  1377 \end{center}
  1373 Other terms with an underlying value such as
  1378 Other terms with an underlying value, such as
  1374 \[
  1379 \[
  1375 	\Stars \; [\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a, \Char \; a])]
  1380 	\Stars \; [\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a, \Char \; a])]
  1376 \]
  1381 \]
  1377 is simply too hopeless to contribute a POSIX lexical value,
  1382 is too hopeless to contribute a POSIX lexical value,
  1378 and is therefore thrown away.
  1383 and is therefore thrown away.
  1379 
  1384 
  1380 Ausaf et al. \cite{AusafDyckhoffUrban2016} 
  1385 Ausaf et al. \cite{AusafDyckhoffUrban2016} 
  1381 have come up with some simplification steps, however those steps
  1386 have come up with some simplification steps, however those steps
  1382 are not yet sufficiently strong so that they achieve the above effects.
  1387 are not yet sufficiently strong, so they achieve the above effects.
  1383 And even with these relatively mild simplifications the proof
  1388 And even with these relatively mild simplifications, the proof
  1384 is already quite a bit complicated than the theorem \ref{lexerCorrectness}.
  1389 is already quite a bit more complicated than the theorem \ref{lexerCorrectness}.
  1385 One would prove something like: 
  1390 One would prove something like this: 
  1386 \[
  1391 \[
  1387 	\textit{If}\; (\textit{snd} \; (\textit{simp} \; r\backslash c), s) \rightarrow  v  \;\;
  1392 	\textit{If}\; (\textit{snd} \; (\textit{simp} \; r\backslash c), s) \rightarrow  v  \;\;
  1388 	\textit{then}\;\; (r, c::s) \rightarrow 
  1393 	\textit{then}\;\; (r, c::s) \rightarrow 
  1389 	\inj\;\, r\,  \;c \;\, ((\textit{fst} \; (\textit{simp} \; r \backslash c))\; v) 
  1394 	\inj\;\, r\,  \;c \;\, ((\textit{fst} \; (\textit{simp} \; r \backslash c))\; v).
  1390 \]
  1395 \]
  1391 instead of the simple lemma \ref{injPosix}, where now $\textit{simp}$
  1396 instead of the simple lemma \ref{injPosix}, where now $\textit{simp}$
  1392 not only has to return a simplified regular expression,
  1397 not only has to return a simplified regular expression,
  1393 but also what specific simplifications 
  1398 but also what specific simplifications 
  1394 has been done as a function on values
  1399 has been done as a function on values