ChengsongTanPhdThesis/Chapters/Inj.tex
changeset 564 3cbcd7cda0a9
parent 543 b2bea5968b89
child 567 28cb8089ec36
equal deleted inserted replaced
562:57e33978e55d 564:3cbcd7cda0a9
     2 
     2 
     3 \chapter{Regular Expressions and POSIX Lexing} % Main chapter title
     3 \chapter{Regular Expressions and POSIX Lexing} % Main chapter title
     4 
     4 
     5 \label{Inj} % In chapter 2 \ref{Chapter2} we will introduce the concepts
     5 \label{Inj} % In chapter 2 \ref{Chapter2} we will introduce the concepts
     6 %and notations we 
     6 %and notations we 
     7 %use for describing the lexing algorithm by Sulzmann and Lu,
     7 % used for describing the lexing algorithm by Sulzmann and Lu,
     8 %and then give the algorithm and its variant, and discuss
     8 %and then give the algorithm and its variant and discuss
     9 %why more aggressive simplifications are needed. 
     9 %why more aggressive simplifications are needed. 
    10 
    10 
    11 In this chapter, we define the basic notions 
    11 In this chapter, we define the basic notions 
    12 for regular languages and regular expressions.
    12 for regular languages and regular expressions.
    13 This is essentially a description in "English"
    13 This is essentially a description in ``English"
    14 of your formalisation in Isabelle/HOL.
    14 of our formalisation in Isabelle/HOL.
    15 We also give the definition of what $\POSIX$ lexing means.
    15 We also give the definition of what $\POSIX$ lexing means, 
       
    16 followed by an algorithm by Sulzmanna and Lu\parencite{Sulzmann2014} 
       
    17 that produces the output conforming
       
    18 to the $\POSIX$ standard.
       
    19 It is also worth mentioning that
       
    20 we choose to use the ML-style notation
       
    21 for function applications, where
       
    22 the parameters of a function is not enclosed
       
    23 inside a pair of parentheses (e.g. $f \;x \;y$
       
    24 instead of $f(x,\;y)$). This is mainly
       
    25 to make the text visually more concise.
    16 
    26 
    17 \section{Basic Concepts}
    27 \section{Basic Concepts}
    18 Usually formal language theory starts with an alphabet 
    28 Usually, formal language theory starts with an alphabet 
    19 denoting a set of characters.
    29 denoting a set of characters.
    20 Here we just use the datatype of characters from Isabelle,
    30 Here we just use the datatype of characters from Isabelle,
    21 which roughly corresponds to the ASCII characters.
    31 which roughly corresponds to the ASCII characters.
    22 In what follows we shall leave the information about the alphabet
    32 In what follows, we shall leave the information about the alphabet
    23 implicit.
    33 implicit.
    24 Then using the usual bracket notation for lists,
    34 Then using the usual bracket notation for lists,
    25 we can define strings made up of characters: 
    35 we can define strings made up of characters: 
    26 \begin{center}
    36 \begin{center}
    27 \begin{tabular}{lcl}
    37 \begin{tabular}{lcl}
    28 $\textit{s}$ & $\dn$ & $[] \; |\; c  :: s$
    38 $\textit{s}$ & $\dn$ & $[] \; |\; c  :: s$
    29 \end{tabular}
    39 \end{tabular}
    30 \end{center}
    40 \end{center}
    31 Where $c$ is a variable ranging over characters.
    41 Where $c$ is a variable ranging over characters.
    32 Strings can be concatenated to form longer strings in the same
    42 Strings can be concatenated to form longer strings in the same
    33 way as we concatenate two lists, which we write as @.
    43 way as we concatenate two lists, which we shall write as $s_1 @ s_2$.
    34 We omit the precise 
    44 We omit the precise 
    35 recursive definition here.
    45 recursive definition here.
    36 We overload this concatenation operator for two sets of strings:
    46 We overload this concatenation operator for two sets of strings:
    37 \begin{center}
    47 \begin{center}
    38 \begin{tabular}{lcl}
    48 \begin{tabular}{lcl}
    46 \begin{tabular}{lcl}
    56 \begin{tabular}{lcl}
    47 $A^0 $ & $\dn$ & $\{ [] \}$\\
    57 $A^0 $ & $\dn$ & $\{ [] \}$\\
    48 $A^{n+1}$ & $\dn$ & $A @ A^n$
    58 $A^{n+1}$ & $\dn$ & $A @ A^n$
    49 \end{tabular}
    59 \end{tabular}
    50 \end{center}
    60 \end{center}
    51 The union of all the natural number powers of a language   
    61 The union of all powers of a language   
    52 is usually defined as the Kleene star operator:
    62 can be used to define the Kleene star operator:
    53 \begin{center}
    63 \begin{center}
    54 \begin{tabular}{lcl}
    64 \begin{tabular}{lcl}
    55  $A*$ & $\dn$ & $\bigcup_{i \geq 0} A^i$ \\
    65  $A*$ & $\dn$ & $\bigcup_{i \geq 0} A^i$ \\
    56 \end{tabular}
    66 \end{tabular}
    57 \end{center}
    67 \end{center}
    58 
    68 
    59 \noindent
    69 \noindent
    60 However, to obtain a convenient induction principle 
    70 However, to obtain a more convenient induction principle 
    61 in Isabelle/HOL, 
    71 in Isabelle/HOL, 
    62 we instead define the Kleene star
    72 we instead define the Kleene star
    63 as an inductive set: 
    73 as an inductive set: 
    64 
    74 
    65 \begin{center}
    75 \begin{center}
    66 \begin{mathpar}
    76 \begin{mathpar}
    67 \inferrule{}{[] \in A*\\}
    77 	\inferrule{\mbox{}}{[] \in A*\\}
    68 
    78 
    69 \inferrule{\\s_1 \in A \land \; s_2 \in A*}{s_1 @ s_2 \in A*}
    79 \inferrule{s_1 \in A \;\; s_2 \in A*}{s_1 @ s_2 \in A*}
    70 \end{mathpar}
    80 \end{mathpar}
    71 \end{center}
    81 \end{center}
    72 \ChristianComment{Yes, used the inferrule command in mathpar}
    82 \noindent
    73 We also define an operation of "chopping off" a character from
    83 We also define an operation of "chopping off" a character from
    74 a language, which we call $\Der$, meaning \emph{Derivative} (for a language):
    84 a language, which we call $\Der$, meaning \emph{Derivative} (for a language):
    75 \begin{center}
    85 \begin{center}
    76 \begin{tabular}{lcl}
    86 \begin{tabular}{lcl}
    77 $\textit{Der} \;c \;A$ & $\dn$ & $\{ s \mid c :: s \in A \}$\\
    87 $\textit{Der} \;c \;A$ & $\dn$ & $\{ s \mid c :: s \in A \}$\\
    87 \end{center}
    97 \end{center}
    88 \noindent
    98 \noindent
    89 which is essentially the left quotient $A \backslash L$ of $A$ against 
    99 which is essentially the left quotient $A \backslash L$ of $A$ against 
    90 the singleton language with $L = \{w\}$
   100 the singleton language with $L = \{w\}$
    91 in formal language theory.
   101 in formal language theory.
    92 However for the purposes here, the $\textit{Ders}$ definition with 
   102 However, for the purposes here, the $\textit{Ders}$ definition with 
    93 a single string is sufficient.
   103 a single string is sufficient.
    94 
   104 
    95 With the  sequencing, Kleene star, and $\textit{Der}$ operator on languages,
   105 With the sequencing, Kleene star, and $\textit{Der}$ operator on languages,
    96 we have a  few properties of how the language derivative can be defined using 
   106 we have a  few properties of how the language derivative can be defined using 
    97 sub-languages.
   107 sub-languages.
    98 \begin{lemma}
   108 \begin{lemma}
    99 \[
   109 \[
   100 	\Der \; c \; (A @ B) =
   110 	\Der \; c \; (A @ B) =
   111 of $A$:
   121 of $A$:
   112 \begin{lemma}
   122 \begin{lemma}
   113 $\textit{Der} \;c \;(A*) = (\textit{Der}\; c A) @ (A*)$\\
   123 $\textit{Der} \;c \;(A*) = (\textit{Der}\; c A) @ (A*)$\\
   114 \end{lemma}
   124 \end{lemma}
   115 \begin{proof}
   125 \begin{proof}
       
   126 There are too inclusions to prove:
   116 \begin{itemize}
   127 \begin{itemize}
   117 \item{$\subseteq$}
   128 \item{$\subseteq$}:\\
   118 \noindent
       
   119 The set 
   129 The set 
   120 \[ \{s \mid c :: s \in A*\} \]
   130 \[ \{s \mid c :: s \in A*\} \]
   121 is enclosed in the set
   131 is enclosed in the set
   122 \[ \{s_1 @ s_2 \mid s_1 \, s_2. s_1 \in \{s \mid c :: s \in A\} \land s_2 \in A* \} \]
   132 \[ \{s_1 @ s_2 \mid s_1 \, s_2.\;  s_1 \in \{s \mid c :: s \in A\} \land s_2 \in A* \} \]
   123 because whenever you have a string starting with a character 
   133 because whenever you have a string starting with a character 
   124 in the language of a Kleene star $A*$, 
   134 in the language of a Kleene star $A*$, 
   125 then that character together with some sub-string
   135 then that character together with some sub-string
   126 immediately after it will form the first iteration, 
   136 immediately after it will form the first iteration, 
   127 and the rest of the string will 
   137 and the rest of the string will 
   128 be still in $A*$.
   138 be still in $A*$.
   129 \item{$\supseteq$}
   139 \item{$\supseteq$}:\\
   130 Note that
   140 Note that
   131 \[ \Der \; c \; (A*) = \Der \; c \;  (\{ [] \} \cup (A @ A*) ) \]
   141 \[ \Der \; c \; (A*) = \Der \; c \;  (\{ [] \} \cup (A @ A*) ) \]
   132 and 
   142 hold.
       
   143 Also this holds:
   133 \[ \Der \; c \;  (\{ [] \} \cup (A @ A*) ) = \Der\; c \; (A @ A*) \]
   144 \[ \Der \; c \;  (\{ [] \} \cup (A @ A*) ) = \Der\; c \; (A @ A*) \]
   134 where the $\textit{RHS}$ of the above equatioin can be rewritten
   145 where the $\textit{RHS}$ can be rewritten
   135 as \[ (\Der \; c\; A) @ A* \cup A' \], $A'$ being a possibly empty set.
   146 as \[ (\Der \; c\; A) @ A* \cup (\Der \; c \; (A*)) \]
       
   147 which of course contains $\Der \; c \; A @ A*$.
   136 \end{itemize}
   148 \end{itemize}
   137 \end{proof}
   149 \end{proof}
   138 
   150 
   139 \noindent
   151 \noindent
   140 Before we define the $\textit{Der}$ and $\textit{Ders}$ counterpart
   152 Before we define the $\textit{Der}$ and $\textit{Ders}$ counterpart
   141 for regular languages, we need to first give definitions for regular expressions.
   153 for regular languages, we need to first give definitions for regular expressions.
   142 
   154 
   143 \subsection{Regular Expressions and Their Meaning}
   155 \subsection{Regular Expressions and Their Meaning}
   144 The basic regular expressions  are defined inductively
   156 The \emph{basic regular expressions} are defined inductively
   145  by the following grammar:
   157  by the following grammar:
   146 \[			r ::=   \ZERO \mid  \ONE
   158 \[			r ::=   \ZERO \mid  \ONE
   147 			 \mid  c  
   159 			 \mid  c  
   148 			 \mid  r_1 \cdot r_2
   160 			 \mid  r_1 \cdot r_2
   149 			 \mid  r_1 + r_2   
   161 			 \mid  r_1 + r_2   
   150 			 \mid r^*         
   162 			 \mid r^*         
   151 \]
   163 \]
   152 \noindent
   164 \noindent
   153 We call them basic because we might introduce
   165 We call them basic because we will introduce
   154 more constructs later such as negation
   166 additional constructors in later chapters such as negation
   155 and bounded repetitions.
   167 and bounded repetitions.
   156 We defined the regular expression containing
   168 We use $\ZERO$ for the regular expression that
   157 nothing as $\ZERO$, note that some authors
   169 matches no string, and $\ONE$ for the regular
   158 also use $\phi$ for that.
   170 expression that matches only the empty string\footnote{
   159 Similarly, the regular expression denoting the 
   171 some authors
   160 singleton set with only $[]$ is sometimes 
   172 also use $\phi$ and $\epsilon$ for $\ZERO$ and $\ONE$
   161 denoted by $\epsilon$, but we use $\ONE$ here.
   173 but we prefer our notation}. 
   162 
   174 The sequence regular expression is written $r_1\cdot r_2$
   163 The language or set of strings denoted 
   175 and sometimes we omit the dot if it is clear which
   164 by regular expressions are defined as
   176 regular expression is meant; the alternative
       
   177 is written $r_1 + r_2$.
       
   178 The \emph{language} or meaning of 
       
   179 a regular expression is defined recursively as
       
   180 a set of strings:
   165 %TODO: FILL in the other defs
   181 %TODO: FILL in the other defs
   166 \begin{center}
   182 \begin{center}
   167 \begin{tabular}{lcl}
   183 \begin{tabular}{lcl}
   168 $L \; (\ZERO)$ & $\dn$ & $\phi$\\
   184 $L \; \ZERO$ & $\dn$ & $\phi$\\
   169 $L \; (\ONE)$ & $\dn$ & $\{[]\}$\\
   185 $L \; \ONE$ & $\dn$ & $\{[]\}$\\
   170 $L \; (c)$ & $\dn$ & $\{[c]\}$\\
   186 $L \; c$ & $\dn$ & $\{[c]\}$\\
   171 $L \; (r_1 + r_2)$ & $\dn$ & $ L \; (r_1) \cup L \; ( r_2)$\\
   187 $L \; r_1 + r_2$ & $\dn$ & $ L \; r_1 \cup L \; r_2$\\
   172 $L \; (r_1 \cdot r_2)$ & $\dn$ & $ L \; (r_1) @ L \; (r_2)$\\
   188 $L \; r_1 \cdot r_2$ & $\dn$ & $ L \; r_1 @ L \; r_2$\\
   173 $L \; (r^*)$ & $\dn$ & $ (L(r))^*$
   189 $L \; r^*$ & $\dn$ & $ (L\;r)*$
   174 \end{tabular}
   190 \end{tabular}
   175 \end{center}
   191 \end{center}
   176 \noindent
   192 \noindent
   177 Which is also called the "language interpretation" of
       
   178 a regular expression.
       
   179 
       
   180 Now with semantic derivatives of a language and regular expressions and
   193 Now with semantic derivatives of a language and regular expressions and
   181 their language interpretations in place, we are ready to define derivatives on regexes.
   194 their language interpretations in place, we are ready to define derivatives on regexes.
   182 \subsection{Brzozowski Derivatives and a Regular Expression Matcher}
   195 \subsection{Brzozowski Derivatives and a Regular Expression Matcher}
   183 
   196 %Recall, the language derivative acts on a set of strings
   184 \ChristianComment{Hi this part I want to keep the ordering as is, so that it keeps the 
   197 %and essentially chops off a particular character from
   185 readers engaged with a story how we got to the definition of $\backslash$, rather 
   198 %all strings in that set, Brzozowski defined a derivative operation on regular expressions
   186 than first "overwhelming" them with the definition of $\nullable$.}
   199 %so that after derivative $L(r\backslash c)$ 
   187 
   200 %will look as if it was obtained by doing a language derivative on $L(r)$:
   188 The language derivative acts on a string set and chops off a character from
   201 Recall that the semantic derivative acts on a set of 
   189 all strings in that set, we want to define a derivative operation on regular expressions
   202 strings. Brzozowski noticed that this operation
   190 so that after derivative $L(r\backslash c)$ 
   203 can be ``mirrored" on regular expressions which
   191 will look as if it was obtained by doing a language derivative on $L(r)$:
   204 he calls the derivative of a regular expression $r$
   192 \begin{center}
   205 with respect to a character $c$, written
       
   206 $r \backslash c$.
       
   207 He defined this operation such that the following property holds:
       
   208 \begin{center}
       
   209 
   193 \[
   210 \[
   194 r\backslash c \dn ?
   211 L(r \backslash c) = \Der \; c \; L(r)
   195 \]
   212 \]
   196 so that
   213 \end{center}
       
   214 \noindent
       
   215 For example in the sequence case we have 
       
   216 \begin{center}
       
   217 	\begin{tabular}{lcl}
       
   218 		$\Der \; c \; (A @ B)$ & $\dn$ & 
       
   219 		$ \textit{if} \;\,  [] \in A \; 
       
   220 		\textit{then} \;\, ((\Der \; c \; A) @ B ) \cup 
       
   221 		\Der \; c\; B$\\
       
   222 		& & $\textit{else}\; (\Der \; c \; A) @ B$\\
       
   223 	\end{tabular}
       
   224 \end{center}
       
   225 \noindent
       
   226 This can be translated to  
       
   227 regular expressions in the following 
       
   228 manner:
       
   229 \begin{center}
       
   230 	\begin{tabular}{lcl}
       
   231 		$(r_1 \cdot r_2 ) \backslash c$ & $\dn$ & $\textit{if}\;\,([] \in L(r_1)) r_1 \backslash c \cdot r_2 + r_2 \backslash c$ \\
       
   232 		& & $\textit{else} \; (r_1 \backslash c) \cdot r_2$
       
   233 	\end{tabular}
       
   234 \end{center}
       
   235 
       
   236 \noindent
       
   237 And similarly, the Kleene star's semantic derivative
       
   238 can be expressed as
   197 \[
   239 \[
   198 L(r \backslash c) = \Der \; c \; L(r) ?
   240 	\textit{Der} \;c \;(A*) \dn (\textit{Der}\; c A) @ (A*)
   199 \]
   241 \]
   200 \end{center}
   242 which translates to
   201 So we mimic the equalities we have for $\Der$ on language concatenation
       
   202 
       
   203 \[
   243 \[
   204 \Der \; c \; (A @ B) = \textit{if} \;  [] \in A \; \textit{then} ((\Der \; c \; A) @ B ) \cup \Der \; c\; B \quad \textit{else}\; (\Der \; c \; A) @ B\\
   244 	(r^*) \backslash c \dn (r \backslash c)\cdot r^*.
   205 \]
   245 \]
   206 to get the derivative for sequence regular expressions:
   246 In the above definition of $(r_1\cdot r_2) \backslash c$,
   207 \[
   247 the $\textit{if}$ clause's
   208 (r_1 \cdot r_2 ) \backslash c = \textit{if}\,([] \in L(r_1)) r_1 \backslash c \cdot r_2 + r_2 \backslash c \textit{else} (r_1 \backslash c) \cdot r_2
   248 boolean condition 
   209 \]
   249 $[] \in L(r_1)$ needs to be 
   210 
   250 somehow recursively computed.
   211 \noindent
   251 We call such a function that checks
   212 and language Kleene star:
   252 whether the empty string $[]$ is 
   213 \[
   253 in the language of a regular expression $\nullable$:
   214 \textit{Der} \;c \;A* = (\textit{Der}\; c A) @ (A*)
   254 \begin{center}
   215 \]
   255 		\begin{tabular}{lcl}
   216 to get derivative of the Kleene star regular expression:
   256 			$\nullable(\ZERO)$     & $\dn$ & $\mathit{false}$ \\  
   217 \[
   257 			$\nullable(\ONE)$      & $\dn$ & $\mathit{true}$ \\
   218 r^* \backslash c = (r \backslash c)\cdot r^*
   258 			$\nullable(c)$ 	       & $\dn$ & $\mathit{false}$ \\
   219 \]
   259 			$\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\
   220 Note that although we can formalise the boolean predicate
   260 			$\nullable(r_1\cdot r_2)$  & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\
   221 $[] \in L(r_1)$ without problems, if we want a function that works
   261 			$\nullable(r^*)$       & $\dn$ & $\mathit{true}$ \\
   222 computationally, then we would have to define a function that tests
   262 		\end{tabular}
   223 whether an empty string is in the language of a regular expression.
   263 \end{center}
   224 We call such a function $\nullable$:
   264 \noindent
   225 
   265 The $\ZERO$ regular expression
   226 
   266 does not contain any string and
   227 
   267 therefore is not \emph{nullable}.
       
   268 $\ONE$ is \emph{nullable} 
       
   269 by definition. 
       
   270 The character regular expression $c$
       
   271 corresponds to the singleton set $\{c\}$, 
       
   272 and therefore does not contain the empty string.
       
   273 The alternative regular expression is nullable
       
   274 if at least one of its children is nullable.
       
   275 The sequence regular expression
       
   276 would require both children to have the empty string
       
   277 to compose an empty string, and the Kleene star
       
   278 is always nullable because it naturally
       
   279 contains the empty string. 
       
   280   
       
   281 The derivative function, written $r\backslash c$, 
       
   282 defines how a regular expression evolves into
       
   283 a new one after all the string it contains is acted on: 
       
   284 if it starts with $c$, then the character is chopped of,
       
   285 if not, that string is removed.
   228 \begin{center}
   286 \begin{center}
   229 \begin{tabular}{lcl}
   287 \begin{tabular}{lcl}
   230 		$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\  
   288 		$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\  
   231 		$\ONE \backslash c$  & $\dn$ & $\ZERO$\\
   289 		$\ONE \backslash c$  & $\dn$ & $\ZERO$\\
   232 		$d \backslash c$     & $\dn$ & 
   290 		$d \backslash c$     & $\dn$ & 
   237 	&   & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
   295 	&   & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
   238 	$(r^*)\backslash c$           & $\dn$ & $(r\backslash c) \cdot r^*$\\
   296 	$(r^*)\backslash c$           & $\dn$ & $(r\backslash c) \cdot r^*$\\
   239 \end{tabular}
   297 \end{tabular}
   240 \end{center}
   298 \end{center}
   241 \noindent
   299 \noindent
   242 The function derivative, written $r\backslash c$, 
   300 The most involved cases are the sequence case
   243 defines how a regular expression evolves into
   301 and the star case.
   244 a new regular expression after all the string it contains
       
   245 is chopped off a certain head character $c$.
       
   246 The most involved cases are the sequence 
       
   247 and star case.
       
   248 The sequence case says that if the first regular expression
   302 The sequence case says that if the first regular expression
   249 contains an empty string then the second component of the sequence
   303 contains an empty string, then the second component of the sequence
   250 might be chosen as the target regular expression to be chopped
   304 needs to be considered, as its derivative will contribute to the
   251 off its head character.
   305 result of this derivative.
   252 The star regular expression's derivative unwraps the iteration of
   306 The star regular expression $r^*$'s derivative 
   253 regular expression and attaches the star regular expression
   307 unwraps one iteration of $r$, turns it into $r\backslash c$,
   254 to the sequence's second element to make sure a copy is retained
   308 and attaches the original $r^*$
   255 for possible more iterations in later phases of lexing.
   309 after $r\backslash c$, so that 
   256 
   310 we can further unfold it as many times as needed.
   257 
   311 We have the following correspondence between 
   258 To test whether $[] \in L(r_1)$, we need the $\nullable$ function,
   312 derivatives on regular expressions and
   259 which tests whether the empty string $""$ 
   313 derivatives on a set of strings:
   260 is in the language of $r$:
   314 \begin{lemma}\label{derDer}
   261 
       
   262 
       
   263 \begin{center}
       
   264 		\begin{tabular}{lcl}
       
   265 			$\nullable(\ZERO)$     & $\dn$ & $\mathit{false}$ \\  
       
   266 			$\nullable(\ONE)$      & $\dn$ & $\mathit{true}$ \\
       
   267 			$\nullable(c)$ 	       & $\dn$ & $\mathit{false}$ \\
       
   268 			$\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\
       
   269 			$\nullable(r_1\cdot r_2)$  & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\
       
   270 			$\nullable(r^*)$       & $\dn$ & $\mathit{true}$ \\
       
   271 		\end{tabular}
       
   272 \end{center}
       
   273 \noindent
       
   274 The empty set does not contain any string and
       
   275 therefore not the empty string, the empty string 
       
   276 regular expression contains the empty string
       
   277 by definition, the character regular expression
       
   278 is the singleton that contains character only,
       
   279 and therefore does not contain the empty string,
       
   280 the alternative regular expression (or "or" expression)
       
   281 might have one of its children regular expressions
       
   282 being nullable and any one of its children being nullable
       
   283 would suffice. The sequence regular expression
       
   284 would require both children to have the empty string
       
   285 to compose an empty string and the Kleene star
       
   286 operation naturally introduced the empty string. 
       
   287   
       
   288 We have the following property where the derivative on regular 
       
   289 expressions coincides with the derivative on a set of strings:
       
   290 
       
   291 \begin{lemma}
       
   292 $\textit{Der} \; c \; L(r) = L (r\backslash c)$
   315 $\textit{Der} \; c \; L(r) = L (r\backslash c)$
   293 \end{lemma}
   316 \end{lemma}
   294 
   317 
   295 \noindent
   318 \noindent
   296 The main property of the derivative operation
   319 The main property of the derivative operation
   297 that enables us to reason about the correctness of
   320 (that enables us to reason about the correctness of
   298 an algorithm using derivatives is 
   321 derivative-based matching)
       
   322 is 
   299 
   323 
   300 \begin{lemma}\label{derStepwise}
   324 \begin{lemma}\label{derStepwise}
   301 $c\!::\!s \in L(r)$ holds
   325 	$c\!::\!s \in L(r)$ \textit{iff} $s \in L(r\backslash c)$.
   302 if and only if $s \in L(r\backslash c)$.
       
   303 \end{lemma}
   326 \end{lemma}
   304 
   327 
   305 \noindent
   328 \noindent
   306 We can generalise the derivative operation shown above for single characters
   329 We can generalise the derivative operation shown above for single characters
   307 to strings as follows:
   330 to strings as follows:
   312 $r \backslash [\,] $ & $\dn$ & $r$
   335 $r \backslash [\,] $ & $\dn$ & $r$
   313 \end{tabular}
   336 \end{tabular}
   314 \end{center}
   337 \end{center}
   315 
   338 
   316 \noindent
   339 \noindent
   317 When there is no ambiguity we will use  $\backslash$ to denote
   340 When there is no ambiguity, we will 
       
   341 omit the subscript and use $\backslash$ instead
       
   342 of $\backslash_r$ to denote
   318 string derivatives for brevity.
   343 string derivatives for brevity.
   319 Brzozowski's  regular-expression matcher algorithm can then be described as:
   344 Brzozowski's  regular-expression matcher algorithm can then be described as:
   320 
   345 
   321 \begin{definition}
   346 \begin{definition}
   322 $\textit{match}\;s\;r \;\dn\; \nullable(r\backslash s)$
   347 $\textit{match}\;s\;r \;\dn\; \nullable \; (r\backslash s)$
   323 \end{definition}
   348 \end{definition}
   324 
   349 
   325 \noindent
   350 \noindent
   326 Assuming the string is given as a sequence of characters, say $c_0c_1..c_n$, 
   351 Assuming the string is given as a sequence of characters, say $c_0c_1..c_n$, 
   327 this algorithm presented graphically is as follows:
   352 this algorithm presented graphically is as follows:
   334 
   359 
   335 \noindent
   360 \noindent
   336  It can  be
   361  It can  be
   337 relatively  easily shown that this matcher is correct:
   362 relatively  easily shown that this matcher is correct:
   338 \begin{lemma}
   363 \begin{lemma}
   339 $\textit{match} \; s\; r  = \textit{true} \Longleftrightarrow s \in L(r)$
   364 	$\textit{match} \; s\; r  = \textit{true} \; \textit{iff} \; s \in L(r)$
   340 \end{lemma}
   365 \end{lemma}
   341 \begin{proof}
   366 \begin{proof}
   342 By the stepwise property of $\backslash$ (\ref{derStepwise})
   367 	By the stepwise property of derivatives (lemma \ref{derStepwise})
       
   368 	and lemma \ref{derDer}. 
   343 \end{proof}
   369 \end{proof}
   344 \noindent
   370 \noindent
   345 If we implement the above algorithm naively, however,
   371 \begin{center}
   346 the algorithm can be excruciatingly slow.
   372 	\begin{figure}
   347 
       
   348 
       
   349 \begin{figure}
       
   350 \begin{tikzpicture}
   373 \begin{tikzpicture}
   351 \begin{axis}[
   374 \begin{axis}[
   352     xlabel={$n$},
   375     xlabel={$n$},
   353     ylabel={time in secs},
   376     ylabel={time in secs},
   354     ymode = log,
   377     ymode = log,
   358 \addplot[red,mark=*, mark options={fill=white}] table {NaiveMatcher.data};
   381 \addplot[red,mark=*, mark options={fill=white}] table {NaiveMatcher.data};
   359 \end{axis}
   382 \end{axis}
   360 \end{tikzpicture} 
   383 \end{tikzpicture} 
   361 \caption{Matching $(a^*)^*b$ against $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$}\label{NaiveMatcher}
   384 \caption{Matching $(a^*)^*b$ against $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$}\label{NaiveMatcher}
   362 \end{figure}
   385 \end{figure}
   363    
   386 \end{center} 
   364 \noindent
   387 \noindent
   365 For this we need to introduce certain 
   388 If we implement the above algorithm naively, however,
       
   389 the algorithm can be excruciatingly slow, as shown in 
       
   390 \ref{NaiveMatcher}.
       
   391 Note that both axes are in logarithmic scale.
       
   392 Around two dozens characters
       
   393 would already explode the matcher on regular expression 
       
   394 $(a^*)^*b$.
       
   395 For this, we need to introduce certain 
   366 rewrite rules for the intermediate results,
   396 rewrite rules for the intermediate results,
   367 such as $r + r \rightarrow r$,
   397 such as $r + r \rightarrow r$,
   368 and make sure those rules do not change the 
   398 and make sure those rules do not change the 
   369 language of the regular expression.
   399 language of the regular expression.
   370 We have a simplification function (that is as simple as possible
   400 One simpled-minded simplification function
   371 while having much power on making a regex simpler):
   401 that achieves these requirements is given below:
   372 \begin{verbatim}
   402 \begin{center}
   373 def simp(r: Rexp) : Rexp = r match {
   403 	\begin{tabular}{lcl}
   374   case SEQ(r1, r2) => 
   404 		$\simp \; r_1 \cdot r_2 $ & $ \dn$ & 
   375     (simp(r1), simp(r2)) match {
   405 		$(\simp \; r_1,  \simp \; r_2) \; \textit{match}$\\
   376       case (ZERO, _) => ZERO
   406 					  & & $\quad \case \; (\ZERO, \_) \Rightarrow \ZERO$\\
   377       case (_, ZERO) => ZERO
   407 					  & & $\quad \case \; (\_, \ZERO) \Rightarrow \ZERO$\\
   378       case (ONE, r2s) => r2s
   408 					  & & $\quad \case \; (\ONE, r_2') \Rightarrow r_2'$\\
   379       case (r1s, ONE) => r1s
   409 					  & & $\quad \case \; (r_1', \ONE) \Rightarrow r_1'$\\
   380       case (r1s, r2s) => SEQ(r1s, r2s)
   410 					  & & $\quad \case \; (r_1', r_2') \Rightarrow r_1'\cdot r_2'$\\
   381     }
   411 		$\simp \; r_1 + r_2$ & $\dn$ & $(\simp \; r_1, \simp \; r_2) \textit{match}$\\
   382   case ALTS(r1, r2) => {
   412 				     & & $\quad \; \case \; (\ZERO, r_2') \Rightarrow r_2'$\\
   383     (simp(r1), simp(r2)) match {
   413 				     & & $\quad \; \case \; (r_1', \ZERO) \Rightarrow r_1'$\\
   384       case (ZERO, r2s) => r2s
   414 				     & & $\quad \; \case \; (r_1', r_2') \Rightarrow r_1' + r_2'$\\
   385       case (r1s, ZERO) => r1s
   415 		$\simp \; r$ & $\dn$ & $r$
   386       case (r1s, r2s) =>
   416 				   
   387         if(r1s == r2s) r1s else ALTS(r1s, r2s)
   417 	\end{tabular}
   388     }
   418 \end{center}
   389   }
   419 If we repeatedly apply this simplification  
   390   case r => r
   420 function during the matching algorithm, 
   391 }
   421 we have a matcher with simplification:
   392 \end{verbatim}
   422 \begin{center}
   393 If we repeatedly incorporate these 
   423 	\begin{tabular}{lcl}
   394 rules during the matching algorithm, 
   424 		$\derssimp \; [] \; r$ & $\dn$ & $r$\\
   395 we have a lexer with simplification:
   425 		$\derssimp \; c :: cs \; r$ & $\dn$ & $\derssimp \; cs \; (\simp \; (r \backslash c))$\\
   396 \begin{verbatim}
   426 		$\textit{matcher}_{simp}\; s \; r $ & $\dn$ & $\nullable \; (\derssimp \; s\;r)$
   397 def ders_simp(s: List[Char], r: Rexp) : Rexp = s match {
   427 	\end{tabular}
   398   case Nil => simp(r)
   428 \end{center}
   399   case c :: cs => ders_simp(cs, simp(der(c, r)))
   429 \begin{figure}
   400 }
       
   401 
       
   402 def simp_matcher(s: String, r: Rexp) : Boolean = 
       
   403   nullable(ders_simp(s.toList, r))
       
   404 
       
   405 \end{verbatim}
       
   406 \noindent
       
   407 After putting in those rules, the example of \ref{NaiveMatcher}
       
   408 is now very tame in the length of inputs:
       
   409 
       
   410 
       
   411 \begin{tikzpicture}
   430 \begin{tikzpicture}
   412 \begin{axis}[
   431 \begin{axis}[
   413     xlabel={$n$},
   432     xlabel={$n$},
   414     ylabel={time in secs},
   433     ylabel={time in secs},
   415     ymode = log,
   434     ymode = log,
   416     xmode = log,
   435     xmode = log,
       
   436     grid = both,
   417     legend entries={Matcher With Simp},  
   437     legend entries={Matcher With Simp},  
   418     legend pos=north west,
   438     legend pos=north west,
   419     legend cell align=left]
   439     legend cell align=left]
   420 \addplot[red,mark=*, mark options={fill=white}] table {BetterMatcher.data};
   440 \addplot[red,mark=*, mark options={fill=white}] table {BetterMatcher.data};
   421 \end{axis}
   441 \end{axis}
   422 \end{tikzpicture} \label{fig:BetterMatcher}
   442 \end{tikzpicture} 
   423 
   443 \caption{$(a^*)^*b$ 
   424 
   444 against 
   425 \noindent
   445 $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$ Using $\textit{matcher}_{simp}$}\label{BetterMatcher}
   426 Note how the x-axis is in logarithmic scale.
   446 \end{figure}
       
   447 \noindent
       
   448 The running time of $\textit{ders}\_\textit{simp}$
       
   449 on the same example of \ref{NaiveMatcher}
       
   450 is now very tame in terms of the length of inputs,
       
   451 as shown in \ref{BetterMatcher}.
       
   452 
   427 Building derivatives and then testing the existence
   453 Building derivatives and then testing the existence
   428 of empty string in the resulting regular expression's language,
   454 of empty string in the resulting regular expression's language,
   429 and add simplification rules when necessary.
   455 adding simplifications when necessary.
   430 So far, so good. But what if we want to 
   456 So far, so good. But what if we want to 
   431 do lexing instead of just getting a YES/NO answer?
   457 do lexing instead of just getting a YES/NO answer?
   432 \citeauthor{Sulzmann2014} first came up with a nice and 
   458 Sulzmanna and Lu \cite{Sulzmann2014} first came up with a nice and 
   433 elegant (arguably as beautiful as the definition of the original derivative) solution for this.
   459 elegant (arguably as beautiful as the definition of the original derivative) solution for this.
   434 
   460 
   435 \section{Values and the Lexing Algorithm by Sulzmann and Lu}
   461 \section{Values and the Lexing Algorithm by Sulzmann and Lu}
   436 Here we present the hybrid phases of a regular expression lexing 
   462 In this section, we present a two-phase regular expression lexing 
   437 algorithm using the function $\inj$, as given by Sulzmann and Lu.
   463 algorithm.
   438 They first defined the datatypes for storing the 
   464 The first phase takes successive derivatives with 
   439 lexing information called a \emph{value} or
   465 respect to the input string,
   440 sometimes also \emph{lexical value}.  These values and regular
   466 and the second phase does the reverse, \emph{injecting} back
       
   467 characters, in the meantime constructing a lexing result.
       
   468 We will introduce the injection phase in detail slightly
       
   469 later, but as a preliminary we have to first define 
       
   470 the datatype for lexing results, 
       
   471 called \emph{value} or
       
   472 sometimes also \emph{lexical value}.  Values and regular
   441 expressions correspond to each other as illustrated in the following
   473 expressions correspond to each other as illustrated in the following
   442 table:
   474 table:
   443 
   475 
   444 \begin{center}
   476 \begin{center}
   445 	\begin{tabular}{c@{\hspace{20mm}}c}
   477 	\begin{tabular}{c@{\hspace{20mm}}c}
   464 			& $\mid$ & $\Right(v)$  \\
   496 			& $\mid$ & $\Right(v)$  \\
   465 			& $\mid$ & $\Stars\,[v_1,\ldots\,v_n]$ \\
   497 			& $\mid$ & $\Stars\,[v_1,\ldots\,v_n]$ \\
   466 		\end{tabular}
   498 		\end{tabular}
   467 	\end{tabular}
   499 	\end{tabular}
   468 \end{center}
   500 \end{center}
   469 
   501 \noindent
   470 \noindent
   502 A value has an underlying string, which 
   471 We have a formal binary relation for telling whether the structure
   503 can be calculated by the ``flatten" function $|\_|$:
   472 of a regular expression agrees with the value.
   504 \begin{center}
       
   505 	\begin{tabular}{lcl}
       
   506 		$|\Empty|$ & $\dn$ &  $[]$\\
       
   507 		$|\Char \; c|$ & $ \dn$ & $ [c]$\\
       
   508 		$|\Seq(v_1, v_2)|$ & $ \dn$ & $ v_1| @ |v_2|$\\
       
   509 		$|\Left(v)|$ & $ \dn$ & $ |v|$\\
       
   510 		$|\Right(v)|$ & $ \dn$ & $ |v|$\\
       
   511 		$|\Stars([])|$ & $\dn$ & $[]$\\
       
   512 		$|\Stars(v::vs)|$ &  $\dn$ & $ |v| @ |\Stars(vs)|$
       
   513 	\end{tabular}
       
   514 \end{center}
       
   515 Sulzmann and Lu used a binary predicate, written $\vdash v:r $,
       
   516 to indicate that a value $v$ could be generated from a lexing algorithm
       
   517 with input $r$. They call it the value inhabitation relation. 
   473 \begin{mathpar}
   518 \begin{mathpar}
   474 \inferrule{}{\vdash \Char(c) : \mathbf{c}} \hspace{2em}
   519 	\inferrule{\mbox{}}{\vdash \Char(c) : \mathbf{c}} \hspace{2em}
   475 \inferrule{}{\vdash \Empty :  \ONE} \hspace{2em}
   520 
   476 \inferrule{\vdash v_1 : r_1 \\ \vdash v_2 : r_2 }{\vdash \Seq(v_1, v_2) : (r_1 \cdot r_2)}
   521 	\inferrule{\mbox{}}{\vdash \Empty :  \ONE} \hspace{2em}
       
   522 
       
   523 \inferrule{\vdash v_1 : r_1 \;\; \vdash v_2 : r_2 }{\vdash \Seq(v_1, v_2) : (r_1 \cdot r_2)}
       
   524 
       
   525 \inferrule{\vdash v_1 : r_1}{\vdash \Left(v_1):r_1+r_2}
       
   526 
       
   527 \inferrule{\vdash v_2 : r_2}{\vdash \Right(v_2):r_1 + r_2}
       
   528 
       
   529 \inferrule{\forall v \in vs. \vdash v:r \land  |v| \neq []}{\vdash \Stars(vs):r^*}
   477 \end{mathpar}
   530 \end{mathpar}
   478 
   531 \noindent
   479 Building on top of Sulzmann and Lu's attempt to formalise the 
   532 The condition $|v| \neq []$ in the premise of star's rule
   480 notion of POSIX lexing rules \parencite{Sulzmann2014}, 
   533 is to make sure that for a given pair of regular 
   481 Ausaf and Urban\parencite{AusafDyckhoffUrban2016} modelled
   534 expression $r$ and string $s$, the number of values 
   482 POSIX matching as a ternary relation recursively defined in a
   535 satisfying $|v| = s$ and $\vdash v:r$ is finite.
   483 natural deduction style.
   536 Given the same string and regular expression, there can be
   484 The formal definition of a $\POSIX$ value $v$ for a regular expression
   537 multiple values for it. For example, both
       
   538 $\vdash \Seq(\Left \; ab)(\Right \; c):(ab+a)(bc+c)$ and
       
   539 $\vdash \Seq(\Right\; a)(\Left \; bc ):(ab+a)(bc+c)$ hold
       
   540 and the values both flatten to $abc$.
       
   541 Lexers therefore have to disambiguate and choose only
       
   542 one of the values to output. $\POSIX$ is one of the
       
   543 disambiguation strategies that is widely adopted.
       
   544 
       
   545 Ausaf and Urban\parencite{AusafDyckhoffUrban2016} 
       
   546 formalised the property 
       
   547 as a ternary relation.
       
   548 The $\POSIX$ value $v$ for a regular expression
   485 $r$ and string $s$, denoted as $(s, r) \rightarrow v$, can be specified 
   549 $r$ and string $s$, denoted as $(s, r) \rightarrow v$, can be specified 
   486 in the following set of rules:
   550 in the following set of rules\footnote{The names of the rules are used
   487 \ChristianComment{Will complete later}
   551 as they were originally given in \cite{AusafDyckhoffUrban2016}}:
   488 \newcommand*{\inference}[3][t]{%
   552 \noindent
   489    \begingroup
   553 \begin{figure}
   490    \def\and{\\}%
   554 \begin{mathpar}
   491    \begin{tabular}[#1]{@{\enspace}c@{\enspace}}
   555 	\inferrule[P1]{\mbox{}}{([], \ONE) \rightarrow \Empty}
   492    #2 \\
   556 		
   493    \hline
   557 	\inferrule[PC]{\mbox{}}{([c], c) \rightarrow \Char \; c}
   494    #3
   558 
   495    \end{tabular}%
   559 	\inferrule[P+L]{(s,r_1)\rightarrow v_1}{(s, r_1+r_2)\rightarrow \Left \; v_1}
   496    \endgroup
   560 
   497 }
   561 	\inferrule[P+R]{(s,r_2)\rightarrow v_2\\ s \notin L \; r_1}{(s, r_1+r_2)\rightarrow \Right \; v_2}
   498 \begin{center}
   562 
   499 \inference{$s_1 @ s_2 = s$ \and $(\nexists s_3 s_4 s_5. s_1 @ s_5 = s_3 \land s_5 \neq [] \land s_3 @ s_4 = s \land (s_3, r_1) \rightarrow v_3 \land (s_4, r_2) \rightarrow v_4)$ \and $(s_1, r_1) \rightarrow v_1$ \and $(s_2, r_2) \rightarrow v_2$  }{$(s, r_1 \cdot r_2) \rightarrow \Seq(v_1, v_2)$ }
   563 	\inferrule[PS]{(s_1, v_1) \rightarrow r_1 \\ (s_2, v_2)\rightarrow r_2\\
   500 \end{center}
   564 		\nexists s_3 \; s_4. s_3 \neq [] \land s_3 @ s_4 = s_2 \land 
   501 \noindent
   565 		s_1@ s_3 \in L \; r_1 \land s_4 \in L \; r_2}{(s_1 @ s_2, r_1\cdot r_2) \rightarrow
   502 The above $\POSIX$ rules could be explained intuitionally as
   566 	\Seq \; v_1 \; v_2}
       
   567 
       
   568 	\inferrule[P{[]}]{\mbox{}}{([], r^*) \rightarrow \Stars([])}
       
   569 
       
   570 	\inferrule[P*]{(s_1, v) \rightarrow v \\ (s_2, r^*) \rightarrow \Stars \; vs \\
       
   571 		|v| \neq []\\ \nexists s_3 \; s_4. s_3 \neq [] \land s_3@s_4 = s_2 \land
       
   572 		s_1@s_3 \in L \; r \land s_4 \in L \; r^*}{(s_1@s_2, r^*)\rightarrow \Stars \;
       
   573 	(v::vs)}
       
   574 \end{mathpar}
       
   575 \caption{POSIX Lexing Rules}
       
   576 \end{figure}
       
   577 \noindent
       
   578 The above $\POSIX$ rules follows the intuition described below: 
   503 \begin{itemize}
   579 \begin{itemize}
   504 \item
   580 	\item (Left Priority)\\
   505 match the leftmost regular expression when multiple options of matching
   581 		Match the leftmost regular expression when multiple options of matching
   506 are available  
   582 		are available.
   507 \item 
   583 	\item (Maximum munch)\\
   508 always match a subpart as much as possible before proceeding
   584 		Always match a subpart as much as possible before proceeding
   509 to the next token.
   585 		to the next token.
   510 \end{itemize}
   586 \end{itemize}
   511 
   587 \noindent
   512 The reason why we are interested in $\POSIX$ values is that they can
   588 These disambiguation strategies can be 
   513 be practically used in the lexing phase of a compiler front end.
   589 quite practical.
   514 For instance, when lexing a code snippet 
   590 For instance, when lexing a code snippet 
   515 $\textit{iffoo} = 3$ with the regular expression $\textit{keyword} + \textit{identifier}$, we want $\textit{iffoo}$ to be recognized
   591 \[ 
   516 as an identifier rather than a keyword.
   592 	\textit{iffoo} = 3
   517 \ChristianComment{Do I also introduce lexical values $LV$ here?}
   593 \]
   518 We know that $\POSIX$ values are also part of the normal values:
   594 using the regular expression (with 
       
   595 keyword and identifier having their 
       
   596 usualy definitions on any formal
       
   597 language textbook, for instance
       
   598 keyword is a nonempty string starting with letters 
       
   599 followed by alphanumeric characters or underscores):
       
   600 \[
       
   601 	\textit{keyword} + \textit{identifier},
       
   602 \]
       
   603 we want $\textit{iffoo}$ to be recognized
       
   604 as an identifier rather than a keyword (if)
       
   605 followed by
       
   606 an identifier (foo).
       
   607 POSIX lexing achieves this.
       
   608 
       
   609 We know that a $\POSIX$ value is also a normal underlying
       
   610 value:
   519 \begin{lemma}
   611 \begin{lemma}
   520 $(r, s) \rightarrow v \implies \vdash v: r$
   612 $(r, s) \rightarrow v \implies \vdash v: r$
   521 \end{lemma}
   613 \end{lemma}
   522 \noindent
   614 \noindent
   523 The good property about a $\POSIX$ value is that 
   615 The good property about a $\POSIX$ value is that 
   525 one can always uniquely determine the $\POSIX$ value for it:
   617 one can always uniquely determine the $\POSIX$ value for it:
   526 \begin{lemma}
   618 \begin{lemma}
   527 $\textit{if} \,(s, r) \rightarrow v_1 \land (s, r) \rightarrow v_2\quad  \textit{then} \; v_1 = v_2$
   619 $\textit{if} \,(s, r) \rightarrow v_1 \land (s, r) \rightarrow v_2\quad  \textit{then} \; v_1 = v_2$
   528 \end{lemma}
   620 \end{lemma}
   529 \begin{proof}
   621 \begin{proof}
   530 By induction on $s$, $r$ and $v_1$. The induction principle is
   622 By induction on $s$, $r$ and $v_1$. The inductive cases
   531 the \POSIX rules. Each case is proven by a combination of
   623 are all the POSIX rules. 
   532 the induction rules for $\POSIX$ values and the inductive hypothesis.
   624 Each case is proven by a combination of
   533 Probably the most cumbersome cases are the sequence and star with non-empty iterations.
   625 the induction rules for $\POSIX$ 
       
   626 values and the inductive hypothesis.
       
   627 Probably the most cumbersome cases are 
       
   628 the sequence and star with non-empty iterations.
   534 
   629 
   535 We give the reasoning about the sequence case as follows:
   630 We give the reasoning about the sequence case as follows:
   536 When we have $(s_1, r_1) \rightarrow v_1$ and $(s_2, r_2) \rightarrow v_2$, 
   631 When we have $(s_1, r_1) \rightarrow v_1$ and $(s_2, r_2) \rightarrow v_2$, 
   537 we know that there could not be a longer string $r_1'$ such that $(s_1', r_1) \rightarrow v_1'$
   632 we know that there could not be a longer string $r_1'$ such that $(s_1', r_1) \rightarrow v_1'$
   538 and $(s_2', r_2) \rightarrow v2'$ and $s_1' @s_2' = s$ all hold.
   633 and $(s_2', r_2) \rightarrow v2'$ and $s_1' @s_2' = s$ all hold.
   542 $(s_1, r_1) \rightarrow v_1'$ and $(s_2, r_2) \rightarrow v_2'$,
   637 $(s_1, r_1) \rightarrow v_1'$ and $(s_2, r_2) \rightarrow v_2'$,
   543 Then by induction hypothesis $v_1' = v_1$ and $v_2'= v_2$, 
   638 Then by induction hypothesis $v_1' = v_1$ and $v_2'= v_2$, 
   544 which means this "different" $\POSIX$ value $\Seq(v_1', v_2')$
   639 which means this "different" $\POSIX$ value $\Seq(v_1', v_2')$
   545 is the same as $\Seq(v_1, v_2)$. 
   640 is the same as $\Seq(v_1, v_2)$. 
   546 \end{proof}
   641 \end{proof}
   547 Now we know what a $\POSIX$ value is, the problem is how do we achieve 
   642 Now we know what a $\POSIX$ value is; the problem is how do we achieve 
   548 such a value in a lexing algorithm, using derivatives?
   643 such a value in a lexing algorithm, using derivatives?
   549 
   644 
   550 \subsection{Sulzmann and Lu's Injection-based Lexing Algorithm}
   645 \subsection{Sulzmann and Lu's Injection-based Lexing Algorithm}
   551 
   646 
   552 The contribution of Sulzmann and Lu is an extension of Brzozowski's
   647 The contribution of Sulzmann and Lu is an extension of Brzozowski's
   553 algorithm by a second phase (the first phase being building successive
   648 algorithm by a second phase (the first phase being building successive
   554 derivatives---see \ref{graph:successive_ders}). In this second phase, a POSIX value 
   649 derivatives---see \ref{graph:successive_ders}). This second phase generates a POSIX value if the regular expression matches the string.
   555 is generated if the regular expression matches the string.
       
   556 Two functions are involved: $\inj$ and $\mkeps$.
   650 Two functions are involved: $\inj$ and $\mkeps$.
   557 The function $\mkeps$ constructs a value from the last
   651 The function $\mkeps$ constructs a value from the last
   558 one of all the successive derivatives:
   652 one of all the successive derivatives:
   559 \begin{ceqn}
   653 \begin{ceqn}
   560 \begin{equation}\label{graph:mkeps}
   654 \begin{equation}\label{graph:mkeps}
   568 It tells us how can an empty string be matched by a 
   662 It tells us how can an empty string be matched by a 
   569 regular expression, in a $\POSIX$ way:
   663 regular expression, in a $\POSIX$ way:
   570 
   664 
   571 	\begin{center}
   665 	\begin{center}
   572 		\begin{tabular}{lcl}
   666 		\begin{tabular}{lcl}
   573 			$\mkeps(\ONE)$ 		& $\dn$ & $\Empty$ \\
   667 			$\mkeps \; \ONE$ 		& $\dn$ & $\Empty$ \\
   574 			$\mkeps(r_{1}+r_{2})$	& $\dn$ 
   668 			$\mkeps \; r_{1}+r_{2}$	& $\dn$ 
   575 			& \textit{if} $\nullable(r_{1})$\\ 
   669 						& \textit{if} ($\nullable \; r_{1})$\\ 
   576 			& & \textit{then} $\Left(\mkeps(r_{1}))$\\ 
   670 			& & \textit{then} $\Left (\mkeps \; r_{1})$\\ 
   577 			& & \textit{else} $\Right(\mkeps(r_{2}))$\\
   671 			& & \textit{else} $\Right(\mkeps \; r_{2})$\\
   578 			$\mkeps(r_1\cdot r_2)$ 	& $\dn$ & $\Seq\,(\mkeps\,r_1)\,(\mkeps\,r_2)$\\
   672 			$\mkeps \; r_1\cdot r_2$ 	& $\dn$ & $\Seq\;(\mkeps\;r_1)\;(\mkeps \; r_2)$\\
   579 			$mkeps(r^*)$	        & $\dn$ & $\Stars\,[]$
   673 			$\mkeps \; r^* $	        & $\dn$ & $\Stars\;[]$
   580 		\end{tabular}
   674 		\end{tabular}
   581 	\end{center}
   675 	\end{center}
   582 
   676 
   583 
   677 
   584 \noindent 
   678 \noindent 
   667 \end{center}
   761 \end{center}
   668  \noindent
   762  \noindent
   669  The central property of the $\lexer$ is that it gives the correct result by
   763  The central property of the $\lexer$ is that it gives the correct result by
   670  $\POSIX$ standards:
   764  $\POSIX$ standards:
   671  \begin{theorem}
   765  \begin{theorem}
   672  \begin{tabular}{l}
   766 	 The $\lexer$ based on derivatives and injections is both sound and complete:
   673  $\lexer \; r \; s = \Some(v) \Longleftrightarrow (r, \; s) \rightarrow v$\\
   767  \begin{tabular}{lcl}
   674  $\lexer \;r \; s = \None \Longleftrightarrow \neg(\exists v. (r, s) \rightarrow v)$
   768 	 $\lexer \; r \; s = \Some(v)$ & $ \Longleftrightarrow$ & $ (r, \; s) \rightarrow v$\\
       
   769 	 $\lexer \;r \; s = \None $ & $\Longleftrightarrow$ & $ \neg(\exists v. (r, s) \rightarrow v)$
   675  \end{tabular}
   770  \end{tabular}
   676  \end{theorem}
   771  \end{theorem}
   677  
   772  
   678  
   773  
   679  \begin{proof}
   774  \begin{proof}
   774 have an exponential runtime on ambiguous regular expressions.
   869 have an exponential runtime on ambiguous regular expressions.
   775 With just $\inj$ and $\mkeps$, the lexing algorithm will keep track of all different values
   870 With just $\inj$ and $\mkeps$, the lexing algorithm will keep track of all different values
   776 of a match. This means Sulzmann and Lu's injection-based algorithm 
   871 of a match. This means Sulzmann and Lu's injection-based algorithm 
   777  exponential by nature.
   872  exponential by nature.
   778 Somehow one has to make sure which
   873 Somehow one has to make sure which
   779  lexical values are $\POSIX$ and need to be kept in a lexing algorithm.
   874  lexical values are $\POSIX$ and must be kept in a lexing algorithm.
   780 
   875 
   781 
   876 
   782  For example, the above $r= (a^*\cdot a^*)^*$  and 
   877  For example, the above $r= (a^*\cdot a^*)^*$  and 
   783 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$ example has the POSIX value
   878 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$ example has the POSIX value
   784 $ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$.
   879 $ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$.
   786 not corresponding to this value during lexing.
   881 not corresponding to this value during lexing.
   787 To do this, a two-phase algorithm with rectification is a bit too fragile.
   882 To do this, a two-phase algorithm with rectification is a bit too fragile.
   788 Can we not create those intermediate values $v_1,\ldots v_n$,
   883 Can we not create those intermediate values $v_1,\ldots v_n$,
   789 and get the lexing information that should be already there while
   884 and get the lexing information that should be already there while
   790 doing derivatives in one pass, without a second injection phase?
   885 doing derivatives in one pass, without a second injection phase?
   791 In the meantime, can we make sure that simplifications
   886 In the meantime, can we ensure that simplifications
   792 are easily handled without breaking the correctness of the algorithm?
   887 are easily handled without breaking the correctness of the algorithm?
   793 
   888 
   794 Sulzmann and Lu solved this problem by
   889 Sulzmann and Lu solved this problem by
   795 introducing additional information to the 
   890 introducing additional information to the 
   796 regular expressions called \emph{bitcodes}.
   891 regular expressions called \emph{bitcodes}.