ChengsongTanPhdThesis/Chapters/Inj.tex
changeset 538 8016a2480704
parent 536 aff7bf93b9c7
child 539 7cf9f17aa179
equal deleted inserted replaced
537:50e590823220 538:8016a2480704
     6 %and notations we 
     6 %and notations we 
     7 %use for describing the lexing algorithm by Sulzmann and Lu,
     7 %use for describing the lexing algorithm by Sulzmann and Lu,
     8 %and then give the algorithm and its variant, and discuss
     8 %and then give the algorithm and its variant, and discuss
     9 %why more aggressive simplifications are needed. 
     9 %why more aggressive simplifications are needed. 
    10 
    10 
    11 
    11 In this chapter, we define the basic notions 
    12 \section{Basic Concepts and Notations for Strings, Languages, and Regular Expressions}
    12 for regular languages and regular expressions.
    13 We have a primitive datatype char, denoting characters.
    13 We also give the definition of what $\POSIX$ lexing means.
    14 \[			char ::=  a
    14 
    15 			 \mid b
    15 \section{Basic Concepts}
    16 			 \mid c
    16 Usually in formal language theory there is an alphabet 
    17 			 \mid  \ldots
    17 denoting a set of characters.
    18 			 \mid z       
    18 Here we only use the datatype of characters from Isabelle,
    19 \]
    19 which roughly corresponds to the ASCII character.
    20 (which one is better?)
    20 Then using the usual $[]$ notation for lists,
    21 \begin{center}
    21 we can define strings using chars:
    22 \begin{tabular}{lcl}
       
    23 $\textit{char}$ & $\dn$ & $a | b | c | \ldots$\\
       
    24 \end{tabular}
       
    25 \end{center}
       
    26 They can form strings by lists:
       
    27 \begin{center}
    22 \begin{center}
    28 \begin{tabular}{lcl}
    23 \begin{tabular}{lcl}
    29 $\textit{string}$ & $\dn$ & $[] | c  :: cs$\\
    24 $\textit{string}$ & $\dn$ & $[] | c  :: cs$\\
    30 & & $(c\; \text{has char type})$
    25 & & $(c\; \text{has char type})$
    31 \end{tabular}
    26 \end{tabular}
    32 \end{center}
    27 \end{center}
    33 And strings can be concatenated to form longer strings:
    28 And strings can be concatenated to form longer strings,
    34 \begin{center}
    29 in the same way as we concatenate two lists,
    35 \begin{tabular}{lcl}
    30 which we denote as $@$. We omit the precise 
    36 $[] @ s_2$ & $\dn$ & $s_2$\\
    31 recursive definition here.
    37 $(c :: s_1) @ s_2$ & $\dn$ & $c :: (s_1 @ s_2)$
    32 We overload this concatenation operator for two sets of strings:
    38 \end{tabular}
       
    39 \end{center}
       
    40 
       
    41 A set of strings can operate with another set of strings:
       
    42 \begin{center}
    33 \begin{center}
    43 \begin{tabular}{lcl}
    34 \begin{tabular}{lcl}
    44 $A @ B $ & $\dn$ & $\{s_A @ s_B \mid s_A \in A; s_B \in B \}$\\
    35 $A @ B $ & $\dn$ & $\{s_A @ s_B \mid s_A \in A; s_B \in B \}$\\
    45 \end{tabular}
    36 \end{tabular}
    46 \end{center}
    37 \end{center}
    47 We also call the above "language concatenation".
    38 We also call the above \emph{language concatenation}.
    48 The power of a language is defined recursively, using the 
    39 The power of a language is defined recursively, using the 
    49 concatenation operator $@$:
    40 concatenation operator $@$:
    50 \begin{center}
    41 \begin{center}
    51 \begin{tabular}{lcl}
    42 \begin{tabular}{lcl}
    52 $A^0 $ & $\dn$ & $\{ [] \}$\\
    43 $A^0 $ & $\dn$ & $\{ [] \}$\\
    53 $A^{n+1}$ & $\dn$ & $A^n @ A$
    44 $A^{n+1}$ & $\dn$ & $A^n @ A$
    54 \end{tabular}
    45 \end{tabular}
    55 \end{center}
    46 \end{center}
    56 The union of all the natural number powers of a language   
    47 The union of all the natural number powers of a language   
    57 is denoted by the Kleene star operator:
    48 is defined as the Kleene star operator:
    58 \begin{center}
    49 \begin{center}
    59 \begin{tabular}{lcl}
    50 \begin{tabular}{lcl}
    60  $A*$ & $\dn$ & $\bigcup_{i \geq 0} A^i$ \\
    51  $A*$ & $\dn$ & $\bigcup_{i \geq 0} A^i$ \\
    61 \end{tabular}
    52 \end{tabular}
    62 \end{center}
    53 \end{center}
    63 
    54 
    64 To get an induction principle in Isabelle/HOL, 
    55 \noindent
       
    56 However, to obtain a convenient induction principle 
       
    57 in Isabelle/HOL, 
    65 we instead define the Kleene star
    58 we instead define the Kleene star
    66 as an inductive set: 
    59 as an inductive set: 
    67 \begin{center}
    60 
    68 \begin{tabular}{lcl}
    61 \begin{center}
    69 $[] \in A*$  & &\\
    62 \begin{mathpar}
    70 $s_1 \in A \land \; s_2 \in A* $ & $\implies$ & $s_1 @ s_2 \in A*$\\
    63 \inferrule{}{[] \in A*\\}
    71 \end{tabular}
    64 
    72 \end{center}
    65 \inferrule{\\s_1 \in A \land \; s_2 \in A*}{s_1 @ s_2 \in A*}
    73 \subsection{Language Derivatives}
    66 \end{mathpar}
    74 We also define an operation of chopping off a character from
    67 \end{center}
    75 a language:
    68 
       
    69 We also define an operation of "chopping of" a character from
       
    70 a language, which we call $\Der$, meaning "Derivative for a language":
    76 \begin{center}
    71 \begin{center}
    77 \begin{tabular}{lcl}
    72 \begin{tabular}{lcl}
    78 $\textit{Der} \;c \;A$ & $\dn$ & $\{ s \mid c :: s \in A \}$\\
    73 $\textit{Der} \;c \;A$ & $\dn$ & $\{ s \mid c :: s \in A \}$\\
    79 \end{tabular}
    74 \end{tabular}
    80 \end{center}
    75 \end{center}
    81 
    76 \noindent
    82 This can be generalised to chopping off a string from all strings within set $A$:
    77 This can be generalised to "chopping off" a string from all strings within set $A$,
       
    78 with the help of the concatenation operator:
    83 \begin{center}
    79 \begin{center}
    84 \begin{tabular}{lcl}
    80 \begin{tabular}{lcl}
    85 $\textit{Ders} \;w \;A$ & $\dn$ & $\{ s \mid w@s \in A \}$\\
    81 $\textit{Ders} \;w \;A$ & $\dn$ & $\{ s \mid w@s \in A \}$\\
    86 \end{tabular}
    82 \end{tabular}
    87 \end{center}
    83 \end{center}
    88 
    84 \noindent
    89 which is essentially the left quotient $A \backslash L'$ of $A$ against 
    85 which is essentially the left quotient $A \backslash L'$ of $A$ against 
    90 the singleton language $L' = \{w\}$
    86 the singleton language $L' = \{w\}$
    91 in formal language theory.
    87 in formal language theory.
    92 For this dissertation the $\textit{Ders}$ definition with 
    88 For this dissertation the $\textit{Ders}$ definition with 
    93 a single string suffices.
    89 a single string suffices.
    97 sub-languages.
    93 sub-languages.
    98 \begin{lemma}
    94 \begin{lemma}
    99 \[
    95 \[
   100 	\Der \; c \; (A @ B) =
    96 	\Der \; c \; (A @ B) =
   101 	\begin{cases}
    97 	\begin{cases}
   102 	((\Der \; c \; A) @ B ) \cup \Der \; c\; B, &  \text{if} \;  [] \in A  \\
    98 	((\Der \; c \; A) \, @ \, B ) \cup (\Der \; c\; B) , &  \text{if} \;  [] \in A  \\
   103 	 (\Der \; c \; A) @ B, & \text{otherwise}
    99 	 (\Der \; c \; A) \,  @ \, B, & \text{otherwise}
   104 	 \end{cases}	
   100 	 \end{cases}	
   105 \]
   101 \]
   106 \end{lemma}
   102 \end{lemma}
   107 \noindent
   103 \noindent
   108 This lemma states that if $A$ contains the empty string, $\Der$ can "pierce" through it
   104 This lemma states that if $A$ contains the empty string, $\Der$ can "pierce" through it
   109 and get to $B$.
   105 and get to $B$.
   110 
       
   111 The language $A*$'s derivative can be described using the language derivative
   106 The language $A*$'s derivative can be described using the language derivative
   112 of $A$:
   107 of $A$:
   113 \begin{lemma}
   108 \begin{lemma}
   114 $\textit{Der} \;c \;A* = (\textit{Der}\; c A) @ (A*)$\\
   109 $\textit{Der} \;c \;(A*) = (\textit{Der}\; c A) @ (A*)$\\
   115 \end{lemma}
   110 \end{lemma}
   116 \begin{proof}
   111 \begin{proof}
   117 \begin{itemize}
   112 \begin{itemize}
   118 \item{$\subseteq$}
   113 \item{$\subseteq$}
       
   114 \noindent
   119 The set 
   115 The set 
   120 \[ \{s \mid c :: s \in A*\} \]
   116 \[ \{s \mid c :: s \in A*\} \]
   121 is enclosed in the set
   117 is enclosed in the set
   122 \[ \{s_1 @ s_2 \mid s_1 \, s_2. s_1 \in \{s \mid c :: s \in A\} \land s_2 \in A* \} \]
   118 \[ \{s_1 @ s_2 \mid s_1 \, s_2. s_1 \in \{s \mid c :: s \in A\} \land s_2 \in A* \} \]
   123 because whenever you have a string starting with a character 
   119 because whenever you have a string starting with a character 
   124 in the language of a Kleene star $A*$, then that character together with some sub-string
   120 in the language of a Kleene star $A*$, 
   125 immediately after it will form the first iteration, and the rest of the string will 
   121 then that character together with some sub-string
       
   122 immediately after it will form the first iteration, 
       
   123 and the rest of the string will 
   126 be still in $A*$.
   124 be still in $A*$.
   127 \item{$\supseteq$}
   125 \item{$\supseteq$}
   128 Note that
   126 Note that
   129 \[ \Der \; c \; A* = \Der \; c \;  (\{ [] \} \cup (A @ A*) ) \]
   127 \[ \Der \; c \; (A*) = \Der \; c \;  (\{ [] \} \cup (A @ A*) ) \]
   130 and 
   128 and 
   131 \[ \Der \; c \;  (\{ [] \} \cup (A @ A*) ) = \Der\; c \; (A @ A*) \]
   129 \[ \Der \; c \;  (\{ [] \} \cup (A @ A*) ) = \Der\; c \; (A @ A*) \]
   132 where the $\textit{RHS}$ of the above equatioin can be rewritten
   130 where the $\textit{RHS}$ of the above equatioin can be rewritten
   133 as \[ (\Der \; c\; A) @ A* \cup A' \], $A'$ being a possibly empty set.
   131 as \[ (\Der \; c\; A) @ A* \cup A' \], $A'$ being a possibly empty set.
   134 \end{itemize}
   132 \end{itemize}
   135 \end{proof}
   133 \end{proof}
       
   134 
       
   135 \noindent
   136 Before we define the $\textit{Der}$ and $\textit{Ders}$ counterpart
   136 Before we define the $\textit{Der}$ and $\textit{Ders}$ counterpart
   137 for regular languages, we need to first give definitions for regular expressions.
   137 for regular languages, we need to first give definitions for regular expressions.
   138 
   138 
   139 \subsection{Regular Expressions and Their Meaning}
   139 \subsection{Regular Expressions and Their Meaning}
   140 Suppose we have an alphabet $\Sigma$, the strings  whose characters
       
   141 are from $\Sigma$
       
   142 can be expressed as $\Sigma^*$.
       
   143 
       
   144 We use patterns to define a set of strings concisely. Regular expressions
       
   145 are one of such patterns systems:
       
   146 The basic regular expressions  are defined inductively
   140 The basic regular expressions  are defined inductively
   147  by the following grammar:
   141  by the following grammar:
   148 \[			r ::=   \ZERO \mid  \ONE
   142 \[			r ::=   \ZERO \mid  \ONE
   149 			 \mid  c  
   143 			 \mid  c  
   150 			 \mid  r_1 \cdot r_2
   144 			 \mid  r_1 \cdot r_2
   151 			 \mid  r_1 + r_2   
   145 			 \mid  r_1 + r_2   
   152 			 \mid r^*         
   146 			 \mid r^*         
   153 \]
   147 \]
       
   148 \noindent
       
   149 We call them basic because we might introduce
       
   150 more constructs later such as negation
       
   151 and bounded repetitions.
       
   152 We defined the regular expression containing
       
   153 nothing as $\ZERO$, note that some authors
       
   154 also use $\phi$ for that.
       
   155 Similarly, the regular expression denoting the 
       
   156 singleton set with only $[]$ is sometimes 
       
   157 denoted by $\epsilon$, but we use $\ONE$ here.
   154 
   158 
   155 The language or set of strings denoted 
   159 The language or set of strings denoted 
   156 by regular expressions are defined as
   160 by regular expressions are defined as
   157 %TODO: FILL in the other defs
   161 %TODO: FILL in the other defs
   158 \begin{center}
   162 \begin{center}
   170 a regular expression.
   174 a regular expression.
   171 
   175 
   172 Now with semantic derivatives of a language and regular expressions and
   176 Now with semantic derivatives of a language and regular expressions and
   173 their language interpretations in place, we are ready to define derivatives on regexes.
   177 their language interpretations in place, we are ready to define derivatives on regexes.
   174 \subsection{Brzozowski Derivatives and a Regular Expression Matcher}
   178 \subsection{Brzozowski Derivatives and a Regular Expression Matcher}
       
   179 
       
   180 \ChristianComment{Hi this part I want to keep the ordering as is, so that it keeps the 
       
   181 readers engaged with a story how we got to the definition of $\backslash$, rather 
       
   182 than first "overwhelming" them with the definition of $\nullable$.}
       
   183 
   175 The language derivative acts on a string set and chops off a character from
   184 The language derivative acts on a string set and chops off a character from
   176 all strings in that set, we want to define a derivative operation on regular expressions
   185 all strings in that set, we want to define a derivative operation on regular expressions
   177 so that after derivative $L(r\backslash c)$ 
   186 so that after derivative $L(r\backslash c)$ 
   178 will look as if it was obtained by doing a language derivative on $L(r)$:
   187 will look as if it was obtained by doing a language derivative on $L(r)$:
       
   188 \begin{center}
   179 \[
   189 \[
   180 L(r \backslash c) = \Der \; c \; L(r)
   190 r\backslash c \dn ?
   181 \]
   191 \]
       
   192 so that
       
   193 \[
       
   194 L(r \backslash c) = \Der \; c \; L(r) ?
       
   195 \]
       
   196 \end{center}
   182 So we mimic the equalities we have for $\Der$ on language concatenation
   197 So we mimic the equalities we have for $\Der$ on language concatenation
   183 
   198 
   184 \[
   199 \[
   185 \Der \; c \; (A @ B) = \textit{if} \;  [] \in A \; \textit{then} ((\Der \; c \; A) @ B ) \cup \Der \; c\; B \quad \textit{else}\; (\Der \; c \; A) @ B\\
   200 \Der \; c \; (A @ B) = \textit{if} \;  [] \in A \; \textit{then} ((\Der \; c \; A) @ B ) \cup \Der \; c\; B \quad \textit{else}\; (\Der \; c \; A) @ B\\
   186 \]
   201 \]
   211 		$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\  
   226 		$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\  
   212 		$\ONE \backslash c$  & $\dn$ & $\ZERO$\\
   227 		$\ONE \backslash c$  & $\dn$ & $\ZERO$\\
   213 		$d \backslash c$     & $\dn$ & 
   228 		$d \backslash c$     & $\dn$ & 
   214 		$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
   229 		$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
   215 $(r_1 + r_2)\backslash c$     & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
   230 $(r_1 + r_2)\backslash c$     & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
   216 $(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, nullable(r_1)$\\
   231 $(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, [] \in L(r_1)$\\
   217 	&   & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\
   232 	&   & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\
   218 	&   & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
   233 	&   & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
   219 	$(r^*)\backslash c$           & $\dn$ & $(r\backslash c) \cdot r^*$\\
   234 	$(r^*)\backslash c$           & $\dn$ & $(r\backslash c) \cdot r^*$\\
   220 \end{tabular}
   235 \end{tabular}
   221 \end{center}
   236 \end{center}
   234 regular expression and attaches the star regular expression
   249 regular expression and attaches the star regular expression
   235 to the sequence's second element to make sure a copy is retained
   250 to the sequence's second element to make sure a copy is retained
   236 for possible more iterations in later phases of lexing.
   251 for possible more iterations in later phases of lexing.
   237 
   252 
   238 
   253 
   239 The $\nullable$ function tests whether the empty string $""$ 
   254 To test whether $[] \in L(r_1)$, we need the $\nullable$ function,
       
   255 which tests whether the empty string $""$ 
   240 is in the language of $r$:
   256 is in the language of $r$:
   241 
   257 
   242 
   258 
   243 \begin{center}
   259 \begin{center}
   244 		\begin{tabular}{lcl}
   260 		\begin{tabular}{lcl}
   300 string derivatives for brevity.
   316 string derivatives for brevity.
   301 
   317 
   302 and then define Brzozowski's  regular-expression matching algorithm as:
   318 and then define Brzozowski's  regular-expression matching algorithm as:
   303 
   319 
   304 \begin{definition}
   320 \begin{definition}
   305 $match\;s\;r \;\dn\; nullable(r\backslash s)$
   321 $\textit{match}\;s\;r \;\dn\; \nullable(r\backslash s)$
   306 \end{definition}
   322 \end{definition}
   307 
   323 
   308 \noindent
   324 \noindent
   309 Assuming the string is given as a sequence of characters, say $c_0c_1..c_n$, 
   325 Assuming the string is given as a sequence of characters, say $c_0c_1..c_n$, 
   310 this algorithm presented graphically is as follows:
   326 this algorithm presented graphically is as follows:
   314 r_0 \arrow[r, "\backslash c_0"]  & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed]  & r_n  \arrow[r,"\textit{nullable}?"] & \;\textrm{YES}/\textrm{NO}
   330 r_0 \arrow[r, "\backslash c_0"]  & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed]  & r_n  \arrow[r,"\textit{nullable}?"] & \;\textrm{YES}/\textrm{NO}
   315 \end{tikzcd}
   331 \end{tikzcd}
   316 \end{equation}
   332 \end{equation}
   317 
   333 
   318 \noindent
   334 \noindent
       
   335 
       
   336 
       
   337 Building derivatives and then testing the existence
       
   338 of empty string in the resulting regular expression's language.
       
   339 So far, so good. But what if we want to 
       
   340 do lexing instead of just getting a YES/NO answer?
       
   341 Sulzmann and Lu~\cite{Sulzmann2014} first came up with a nice and 
       
   342 elegant (arguably as beautiful as the definition of the original derivative) solution for this.
       
   343 
       
   344 \subsection*{Values and the Lexing Algorithm by Sulzmann and Lu}
       
   345 Here we present the hybrid phases of a regular expression lexing 
       
   346 algorithm using the function $\inj$, as given by Sulzmann and Lu.
       
   347 They first defined the datatypes for storing the 
       
   348 lexing information called a \emph{value} or
       
   349 sometimes also \emph{lexical value}.  These values and regular
       
   350 expressions correspond to each other as illustrated in the following
       
   351 table:
       
   352 
       
   353 \begin{center}
       
   354 	\begin{tabular}{c@{\hspace{20mm}}c}
       
   355 		\begin{tabular}{@{}rrl@{}}
       
   356 			\multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\
       
   357 			$r$ & $::=$  & $\ZERO$\\
       
   358 			& $\mid$ & $\ONE$   \\
       
   359 			& $\mid$ & $c$          \\
       
   360 			& $\mid$ & $r_1 \cdot r_2$\\
       
   361 			& $\mid$ & $r_1 + r_2$   \\
       
   362 			\\
       
   363 			& $\mid$ & $r^*$         \\
       
   364 		\end{tabular}
       
   365 		&
       
   366 		\begin{tabular}{@{\hspace{0mm}}rrl@{}}
       
   367 			\multicolumn{3}{@{}l}{\textbf{Values}}\medskip\\
       
   368 			$v$ & $::=$  & \\
       
   369 			&        & $\Empty$   \\
       
   370 			& $\mid$ & $\Char(c)$          \\
       
   371 			& $\mid$ & $\Seq\,v_1\, v_2$\\
       
   372 			& $\mid$ & $\Left(v)$   \\
       
   373 			& $\mid$ & $\Right(v)$  \\
       
   374 			& $\mid$ & $\Stars\,[v_1,\ldots\,v_n]$ \\
       
   375 		\end{tabular}
       
   376 	\end{tabular}
       
   377 \end{center}
       
   378 
       
   379 \noindent
       
   380 We have a formal binary relation for telling whether the structure
       
   381 of a regular expression agrees with the value.
       
   382 \begin{mathpar}
       
   383 \inferrule{}{\vdash \Char(c) : \mathbf{c}} \hspace{2em}
       
   384 \inferrule{}{\vdash \Empty :  \ONE} \hspace{2em}
       
   385 \inferrule{\vdash v_1 : r_1 \\ \vdash v_2 : r_2 }{\vdash \Seq(v_1, v_2) : (r_1 \cdot r_2)}
       
   386 \end{mathpar}
       
   387 
       
   388 Building on top of Sulzmann and Lu's attempt to formalise the 
       
   389 notion of POSIX lexing rules \parencite{Sulzmann2014}, 
       
   390 Ausaf and Urban\parencite{AusafDyckhoffUrban2016} modelled
       
   391 POSIX matching as a ternary relation recursively defined in a
       
   392 natural deduction style.
       
   393 The formal definition of a $\POSIX$ value $v$ for a regular expression
       
   394 $r$ and string $s$, denoted as $(s, r) \rightarrow v$, can be specified 
       
   395 in the following set of rules:
       
   396 \ChristianComment{Will complete later}
       
   397 \newcommand*{\inference}[3][t]{%
       
   398    \begingroup
       
   399    \def\and{\\}%
       
   400    \begin{tabular}[#1]{@{\enspace}c@{\enspace}}
       
   401    #2 \\
       
   402    \hline
       
   403    #3
       
   404    \end{tabular}%
       
   405    \endgroup
       
   406 }
       
   407 \begin{center}
       
   408 \inference{$s_1 @ s_2 = s$ \and $(\nexists s_3 s_4 s_5. s_1 @ s_5 = s_3 \land s_5 \neq [] \land s_3 @ s_4 = s \land (s_3, r_1) \rightarrow v_3 \land (s_4, r_2) \rightarrow v_4)$ \and $(s_1, r_1) \rightarrow v_1$ \and $(s_2, r_2) \rightarrow v_2$  }{$(s, r_1 \cdot r_2) \rightarrow \Seq(v_1, v_2)$ }
       
   409 \end{center}
       
   410 \noindent
       
   411 The above $\POSIX$ rules could be explained intuitionally as
       
   412 \begin{itemize}
       
   413 \item
       
   414 match the leftmost regular expression when multiple options of matching
       
   415 are available  
       
   416 \item 
       
   417 always match a subpart as much as possible before proceeding
       
   418 to the next token.
       
   419 \end{itemize}
       
   420 
       
   421 The reason why we are interested in $\POSIX$ values is that they can
       
   422 be practically used in the lexing phase of a compiler front end.
       
   423 For instance, when lexing a code snippet 
       
   424 $\textit{iffoo} = 3$ with the regular expression $\textit{keyword} + \textit{identifier}$, we want $\textit{iffoo}$ to be recognized
       
   425 as an identifier rather than a keyword.
       
   426 
       
   427 The good property about a $\POSIX$ value is that 
       
   428 given the same regular expression $r$ and string $s$,
       
   429 one can always uniquely determine the $\POSIX$ value for it:
       
   430 \begin{lemma}
       
   431 $\textit{if} \,(s, r) \rightarrow v_1 \land (s, r) \rightarrow v_2\quad  \textit{then} \; v_1 = v_2$
       
   432 \end{lemma}
       
   433 Now we know what a $\POSIX$ value is, the problem is how do we achieve 
       
   434 such a value in a lexing algorithm, using derivatives?
       
   435 
       
   436 \subsection{Sulzmann and Lu's Injection-based Lexing Algorithm}
       
   437 
       
   438 The contribution of Sulzmann and Lu is an extension of Brzozowski's
       
   439 algorithm by a second phase (the first phase being building successive
       
   440 derivatives---see \eqref{graph:successive_ders}). In this second phase, a POSIX value 
       
   441 is generated if the regular expression matches the string.
       
   442 Two functions are involved: $\inj$ and $\mkeps$.
       
   443 The function $\mkeps$ constructs a value from the last
       
   444 one of all the successive derivatives:
       
   445 \begin{ceqn}
       
   446 \begin{equation}\label{graph:mkeps}
       
   447 \begin{tikzcd}
       
   448 r_0 \arrow[r, "\backslash c_0"]  & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed] & r_n \arrow[d, "mkeps" description] \\
       
   449 	        & 	              & 	            & v_n       
       
   450 \end{tikzcd}
       
   451 \end{equation}
       
   452 \end{ceqn}
       
   453 
       
   454 It tells us how can an empty string be matched by a 
       
   455 regular expression, in a $\POSIX$ way:
       
   456 
       
   457 	\begin{center}
       
   458 		\begin{tabular}{lcl}
       
   459 			$\mkeps(\ONE)$ 		& $\dn$ & $\Empty$ \\
       
   460 			$\mkeps(r_{1}+r_{2})$	& $\dn$ 
       
   461 			& \textit{if} $\nullable(r_{1})$\\ 
       
   462 			& & \textit{then} $\Left(\mkeps(r_{1}))$\\ 
       
   463 			& & \textit{else} $\Right(\mkeps(r_{2}))$\\
       
   464 			$\mkeps(r_1\cdot r_2)$ 	& $\dn$ & $\Seq\,(\mkeps\,r_1)\,(\mkeps\,r_2)$\\
       
   465 			$mkeps(r^*)$	        & $\dn$ & $\Stars\,[]$
       
   466 		\end{tabular}
       
   467 	\end{center}
       
   468 
       
   469 
       
   470 \noindent 
       
   471 We favour the left to match an empty string if there is a choice.
       
   472 When there is a star for us to match the empty string,
       
   473 we give the $\Stars$ constructor an empty list, meaning
       
   474 no iterations are taken.
       
   475 The result of a call to $\mkeps$ on a $\nullable$ $r$ would
       
   476 be a $\POSIX$ value corresponding to $r$:
       
   477 \begin{lemma}
       
   478 $\nullable(r) \implies (r, []) \rightarrow (\mkeps\; v)$
       
   479 \end{lemma}\label{mePosix}
       
   480 
       
   481 
       
   482 After the $\mkeps$-call, we inject back the characters one by one in order to build
       
   483 the lexical value $v_i$ for how the regex $r_i$ matches the string $s_i$
       
   484 ($s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$.
       
   485 After injecting back $n$ characters, we get the lexical value for how $r_0$
       
   486 matches $s$. 
       
   487 To do this, Sulzmann and Lu defined a function that reverses
       
   488 the ``chopping off'' of characters during the derivative phase. The
       
   489 corresponding function is called \emph{injection}, written
       
   490 $\textit{inj}$; it takes three arguments: the first one is a regular
       
   491 expression ${r_{i-1}}$, before the character is chopped off, the second
       
   492 is a character ${c_{i-1}}$, the character we want to inject and the
       
   493 third argument is the value ${v_i}$, into which one wants to inject the
       
   494 character (it corresponds to the regular expression after the character
       
   495 has been chopped off). The result of this function is a new value. 
       
   496 \begin{ceqn}
       
   497 \begin{equation}\label{graph:inj}
       
   498 \begin{tikzcd}
       
   499 r_1 \arrow[r, dashed] \arrow[d]& r_i \arrow[r, "\backslash c_i"]  \arrow[d]  & r_{i+1}  \arrow[r, dashed] \arrow[d]        & r_n \arrow[d, "mkeps" description] \\
       
   500 v_1           \arrow[u]                 & v_i  \arrow[l, dashed]                              & v_{i+1} \arrow[l,"inj_{r_i} c_i"]                 & v_n \arrow[l, dashed]         
       
   501 \end{tikzcd}
       
   502 \end{equation}
       
   503 \end{ceqn}
       
   504 
       
   505 
       
   506 \noindent
       
   507 The
       
   508 definition of $\textit{inj}$ is as follows: 
       
   509 
       
   510 \begin{center}
       
   511 \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
       
   512   $\textit{inj}\,(c)\,c\,Empty$            & $\dn$ & $Char\,c$\\
       
   513   $\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\
       
   514   $\textit{inj}\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\textit{inj}\,r_2\,c\,v)$\\
       
   515   $\textit{inj}\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$  & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
       
   516   $\textit{inj}\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$  & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
       
   517   $\textit{inj}\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$  & $Seq(\textit{mkeps}(r_1),\textit{inj}\,r_2\,c\,v)$\\
       
   518   $\textit{inj}\,(r^*)\,c\,Seq(v,Stars\,vs)$         & $\dn$  & $Stars((\textit{inj}\,r\,c\,v)\,::\,vs)$\\
       
   519 \end{tabular}
       
   520 \end{center}
       
   521 
       
   522 \noindent 
       
   523 This definition is by recursion on the ``shape'' of regular
       
   524 expressions and values. 
       
   525 The clauses do one thing--identifying the ``hole'' on a
       
   526 value to inject the character back into.
       
   527 For instance, in the last clause for injecting back to a value
       
   528 that would turn into a new star value that corresponds to a star,
       
   529 we know it must be a sequence value. And we know that the first 
       
   530 value of that sequence corresponds to the child regex of the star
       
   531 with the first character being chopped off--an iteration of the star
       
   532 that had just been unfolded. This value is followed by the already
       
   533 matched star iterations we collected before. So we inject the character 
       
   534 back to the first value and form a new value with this latest iteration
       
   535 being added to the previous list of iterations, all under the $\Stars$
       
   536 top level.
       
   537 The POSIX value is maintained throughout the process.
       
   538 \begin{lemma}
       
   539 $(r \backslash c, s) \rightarrow v \implies (r, c :: s) \rightarrow (\inj r \; c\; v)$
       
   540 \end{lemma}\label{injPosix}
       
   541 
       
   542 
       
   543 Putting all the functions $\inj$, $\mkeps$, $\backslash$ together,
       
   544 and taking into consideration the possibility of a non-match,
       
   545 we have a lexer with the following recursive definition:
       
   546 \begin{center}
       
   547 \begin{tabular}{lcr}
       
   548 $\lexer \; r \; [] $ & $=$ & $\textit{if} (\nullable \; r)\; \textit{then}\;  \Some(\mkeps \; r) \; \textit{else} \; \None$\\
       
   549 $\lexer \; r \;c::s$ & $=$ & $\textit{case}\; (\lexer (r\backslash c) s) \textit{of} $\\
       
   550 & & $\None \implies \None$\\
       
   551 & & $\mid \Some(v) \implies \Some(\inj \; r\; c\; v)$
       
   552 \end{tabular}
       
   553 \end{center}
       
   554  \noindent
       
   555  The central property of the $\lexer$ is that it gives the correct result by
       
   556  $\POSIX$ standards:
       
   557  \begin{lemma}
       
   558  \begin{tabular}{l}
       
   559  $s \in L(r) \Longleftrightarrow  (\exists v. \; r \; s = \Some(v) \land (r, \; s) \rightarrow v)$\\
       
   560  $s \notin L(r) \Longleftrightarrow (\lexer \; r\; s = \None)$
       
   561  \end{tabular}
       
   562  \end{lemma}
       
   563  
       
   564  
       
   565  \begin{proof}
       
   566  By induction on $s$. $r$ is allowed to be an arbitrary regular expression.
       
   567  The $[]$ case is proven by  lemma \ref{mePosix}, and the inductive case
       
   568  by lemma \ref{injPosix}.
       
   569  \end{proof}
       
   570 
       
   571 For convenience, we shall employ the following notations: the regular
       
   572 expression we start with is $r_0$, and the given string $s$ is composed
       
   573 of characters $c_0 c_1 \ldots c_{n-1}$. In  the first phase from the
       
   574 left to right, we build the derivatives $r_1$, $r_2$, \ldots  according
       
   575 to the characters $c_0$, $c_1$  until we exhaust the string and obtain
       
   576 the derivative $r_n$. We test whether this derivative is
       
   577 $\textit{nullable}$ or not. If not, we know the string does not match
       
   578 $r$, and no value needs to be generated. If yes, we start building the
       
   579 values incrementally by \emph{injecting} back the characters into the
       
   580 earlier values $v_n, \ldots, v_0$.
       
   581 Pictorially, the algorithm is as follows:
       
   582 
       
   583 \begin{ceqn}
       
   584 \begin{equation}\label{graph:2}
       
   585 \begin{tikzcd}
       
   586 r_0 \arrow[r, "\backslash c_0"]  \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
       
   587 v_0           & v_1 \arrow[l,"inj_{r_0} c_0"]                & v_2 \arrow[l, "inj_{r_1} c_1"]              & v_n \arrow[l, dashed]         
       
   588 \end{tikzcd}
       
   589 \end{equation}
       
   590 \end{ceqn}
       
   591 
       
   592 
       
   593 \noindent
       
   594  This is the second phase of the
       
   595 algorithm from right to left. For the first value $v_n$, we call the
       
   596 function $\textit{mkeps}$, which builds a POSIX lexical value
       
   597 for how the empty string has been matched by the (nullable) regular
       
   598 expression $r_n$. This function is defined as
       
   599 
       
   600 
       
   601 
       
   602 We have mentioned before that derivatives without simplification 
       
   603 can get clumsy, and this is true for values as well--they reflect
       
   604 the size of the regular expression by definition.
       
   605 
       
   606 One can introduce simplification on the regex and values but have to
       
   607 be careful not to break the correctness, as the injection 
       
   608 function heavily relies on the structure of the regexes and values
       
   609 being correct and matching each other.
       
   610 It can be achieved by recording some extra rectification functions
       
   611 during the derivatives step, and applying these rectifications in 
       
   612 each run during the injection phase.
       
   613 And we can prove that the POSIX value of how
       
   614 regular expressions match strings will not be affected---although it is much harder
       
   615 to establish. 
       
   616 Some initial results in this regard have been
       
   617 obtained in \cite{AusafDyckhoffUrban2016}. 
       
   618 
       
   619 
       
   620 
       
   621 %Brzozowski, after giving the derivatives and simplification,
       
   622 %did not explore lexing with simplification, or he may well be 
       
   623 %stuck on an efficient simplification with proof.
       
   624 %He went on to examine the use of derivatives together with 
       
   625 %automaton, and did not try lexing using products.
       
   626 
       
   627 We want to get rid of the complex and fragile rectification of values.
       
   628 Can we not create those intermediate values $v_1,\ldots v_n$,
       
   629 and get the lexing information that should be already there while
       
   630 doing derivatives in one pass, without a second injection phase?
       
   631 In the meantime, can we make sure that simplifications
       
   632 are easily handled without breaking the correctness of the algorithm?
       
   633 
       
   634 Sulzmann and Lu solved this problem by
       
   635 introducing additional information to the 
       
   636 regular expressions called \emph{bitcodes}.
       
   637 
       
   638 
       
   639 
       
   640 
       
   641 
       
   642 With the formally-specified rules for what a POSIX matching is,
       
   643 they proved in Isabelle/HOL that the algorithm gives correct results.
       
   644 But having a correct result is still not enough, 
       
   645 we want at least some degree of $\mathbf{efficiency}$.
       
   646 
       
   647 
       
   648 A pair of regular expression and string can have multiple lexical values. 
       
   649 Take the example where $r= (a^*\cdot a^*)^*$ and the string 
       
   650 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$.
       
   651 If we do not allow any empty iterations in its lexical values,
       
   652 there will be $n - 1$ "splitting points" on $s$ we can choose to 
       
   653 split or not so that each sub-string
       
   654 segmented by those chosen splitting points will form different iterations:
       
   655 \begin{center}
       
   656 \begin{tabular}{lcr}
       
   657 $a \mid aaa $ & $\rightarrow$ & $\Stars\, [v_{iteration \,a},\,  v_{iteration \,aaa}]$\\
       
   658 $aa \mid aa $ & $\rightarrow$ & $\Stars\, [v_{iteration \, aa},\,  v_{iteration \, aa}]$\\
       
   659 $a \mid aa\mid a $ & $\rightarrow$ & $\Stars\, [v_{iteration \, a},\,  v_{iteration \, aa}, \, v_{iteration \, a}]$\\
       
   660  & $\textit{etc}.$ &
       
   661  \end{tabular}
       
   662 \end{center}
       
   663 \noindent
       
   664 And for each iteration, there are still multiple ways to split
       
   665 between the two $a^*$s.
       
   666 It is not surprising there are exponentially many lexical values
       
   667 that are distinct for the regex and string pair $r= (a^*\cdot a^*)^*$  and 
       
   668 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$.
       
   669 A lexer to keep all the possible values will naturally 
       
   670 have an exponential runtime on ambiguous regular expressions.
       
   671 With just $\inj$ and $\mkeps$, the lexing algorithm will keep track of all different values
       
   672 of a match. This means Sulzmann and Lu's injection-based algorithm 
       
   673 will be exponential by nature.
       
   674 Somehow one has to decide which lexical value to keep and
       
   675 output in a lexing algorithm.
       
   676 
       
   677 
       
   678  For example, the above $r= (a^*\cdot a^*)^*$  and 
       
   679 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$ example has the POSIX value
       
   680 $ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$.
       
   681 The output of an algorithm we want would be a POSIX matching
       
   682 encoded as a value.
       
   683 
       
   684 
       
   685 
       
   686 
       
   687 
       
   688 %kind of redundant material
       
   689 
       
   690 
   319 where we start with  a regular expression  $r_0$, build successive
   691 where we start with  a regular expression  $r_0$, build successive
   320 derivatives until we exhaust the string and then use \textit{nullable}
   692 derivatives until we exhaust the string and then use \textit{nullable}
   321 to test whether the result can match the empty string. It can  be
   693 to test whether the result can match the empty string. It can  be
   322 relatively  easily shown that this matcher is correct  (that is given
   694 relatively  easily shown that this matcher is correct  (that is given
   323 an $s = c_0...c_{n-1}$ and an $r_0$, it generates YES if and only if $s \in L(r_0)$).
   695 an $s = c_0...c_{n-1}$ and an $r_0$, it generates YES if and only if $s \in L(r_0)$).
   403 example is simplifying $(a^*+a) + (a^*+ \ONE) + (a +\ONE)$ to just
   775 example is simplifying $(a^*+a) + (a^*+ \ONE) + (a +\ONE)$ to just
   404 $a^*+a+\ONE$.  These more aggressive simplification rules are for
   776 $a^*+a+\ONE$.  These more aggressive simplification rules are for
   405  a very tight size bound, possibly as low
   777  a very tight size bound, possibly as low
   406   as that of the \emph{partial derivatives}\parencite{Antimirov1995}. 
   778   as that of the \emph{partial derivatives}\parencite{Antimirov1995}. 
   407 
   779 
   408 Building derivatives and then simplifying them.
       
   409 So far, so good. But what if we want to 
       
   410 do lexing instead of just getting a YES/NO answer?
       
   411 This requires us to go back again to the world 
       
   412 without simplification first for a moment.
       
   413 Sulzmann and Lu~\cite{Sulzmann2014} first came up with a nice and 
       
   414 elegant(arguably as beautiful as the original
       
   415 derivatives definition) solution for this.
       
   416 
       
   417 \subsection*{Values and the Lexing Algorithm by Sulzmann and Lu}
       
   418 
       
   419 
       
   420 They first defined the datatypes for storing the 
       
   421 lexing information called a \emph{value} or
       
   422 sometimes also \emph{lexical value}.  These values and regular
       
   423 expressions correspond to each other as illustrated in the following
       
   424 table:
       
   425 
       
   426 \begin{center}
       
   427 	\begin{tabular}{c@{\hspace{20mm}}c}
       
   428 		\begin{tabular}{@{}rrl@{}}
       
   429 			\multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\
       
   430 			$r$ & $::=$  & $\ZERO$\\
       
   431 			& $\mid$ & $\ONE$   \\
       
   432 			& $\mid$ & $c$          \\
       
   433 			& $\mid$ & $r_1 \cdot r_2$\\
       
   434 			& $\mid$ & $r_1 + r_2$   \\
       
   435 			\\
       
   436 			& $\mid$ & $r^*$         \\
       
   437 		\end{tabular}
       
   438 		&
       
   439 		\begin{tabular}{@{\hspace{0mm}}rrl@{}}
       
   440 			\multicolumn{3}{@{}l}{\textbf{Values}}\medskip\\
       
   441 			$v$ & $::=$  & \\
       
   442 			&        & $\Empty$   \\
       
   443 			& $\mid$ & $\Char(c)$          \\
       
   444 			& $\mid$ & $\Seq\,v_1\, v_2$\\
       
   445 			& $\mid$ & $\Left(v)$   \\
       
   446 			& $\mid$ & $\Right(v)$  \\
       
   447 			& $\mid$ & $\Stars\,[v_1,\ldots\,v_n]$ \\
       
   448 		\end{tabular}
       
   449 	\end{tabular}
       
   450 \end{center}
       
   451 
       
   452 \noindent
       
   453 
       
   454 Building on top of Sulzmann and Lu's attempt to formalise the 
       
   455 notion of POSIX lexing rules \parencite{Sulzmann2014}, 
       
   456 Ausaf and Urban\parencite{AusafDyckhoffUrban2016} modelled
       
   457 POSIX matching as a ternary relation recursively defined in a
       
   458 natural deduction style.
       
   459 With the formally-specified rules for what a POSIX matching is,
       
   460 they proved in Isabelle/HOL that the algorithm gives correct results.
       
   461 
       
   462 But having a correct result is still not enough, 
       
   463 we want at least some degree of $\mathbf{efficiency}$.
       
   464 
       
   465 
       
   466 
       
   467 One regular expression can have multiple lexical values. For example
       
   468 for the regular expression $(a+b)^*$, it has a infinite list of
       
   469 values corresponding to it: $\Stars\,[]$, $\Stars\,[\Left(Char(a))]$,
       
   470 $\Stars\,[\Right(Char(b))]$, $\Stars\,[\Left(Char(a),\,\Right(Char(b))]$,
       
   471 $\ldots$, and vice versa.
       
   472 Even for the regular expression matching a particular string, there could 
       
   473 be more than one value corresponding to it.
       
   474 Take the example where $r= (a^*\cdot a^*)^*$ and the string 
       
   475 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$.
       
   476 If we do not allow any empty iterations in its lexical values,
       
   477 there will be $n - 1$ "splitting points" on $s$ we can choose to 
       
   478 split or not so that each sub-string
       
   479 segmented by those chosen splitting points will form different iterations:
       
   480 \begin{center}
       
   481 \begin{tabular}{lcr}
       
   482 $a \mid aaa $ & $\rightarrow$ & $\Stars\, [v_{iteration \,a},\,  v_{iteration \,aaa}]$\\
       
   483 $aa \mid aa $ & $\rightarrow$ & $\Stars\, [v_{iteration \, aa},\,  v_{iteration \, aa}]$\\
       
   484 $a \mid aa\mid a $ & $\rightarrow$ & $\Stars\, [v_{iteration \, a},\,  v_{iteration \, aa}, \, v_{iteration \, a}]$\\
       
   485  & $\textit{etc}.$ &
       
   486  \end{tabular}
       
   487 \end{center}
       
   488 
       
   489 And for each iteration, there are still multiple ways to split
       
   490 between the two $a^*$s.
       
   491 It is not surprising there are exponentially many lexical values
       
   492 that are distinct for the regex and string pair $r= (a^*\cdot a^*)^*$  and 
       
   493 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$.
       
   494 
       
   495 A lexer to keep all the possible values will naturally 
       
   496 have an exponential runtime on ambiguous regular expressions.
       
   497 Somehow one has to decide which lexical value to keep and
       
   498 output in a lexing algorithm.
       
   499 In practice, we are usually 
       
   500 interested in POSIX values, which by intuition always
       
   501 \begin{itemize}
       
   502 \item
       
   503 match the leftmost regular expression when multiple options of matching
       
   504 are available  
       
   505 \item 
       
   506 always match a subpart as much as possible before proceeding
       
   507 to the next token.
       
   508 \end{itemize}
       
   509 The formal definition of a $\POSIX$ value $v$ for a regular expression
       
   510 $r$ and string $s$, denoted as $(s, r) \rightarrow v$, can be specified 
       
   511 in the following set of rules:
       
   512 (TODO: write the entire set of inference rules for POSIX )
       
   513 \newcommand*{\inference}[3][t]{%
       
   514    \begingroup
       
   515    \def\and{\\}%
       
   516    \begin{tabular}[#1]{@{\enspace}c@{\enspace}}
       
   517    #2 \\
       
   518    \hline
       
   519    #3
       
   520    \end{tabular}%
       
   521    \endgroup
       
   522 }
       
   523 \begin{center}
       
   524 \inference{$s_1 @ s_2 = s$ \and $(\nexists s_3 s_4 s_5. s_1 @ s_5 = s_3 \land s_5 \neq [] \land s_3 @ s_4 = s \land (s_3, r_1) \rightarrow v_3 \land (s_4, r_2) \rightarrow v_4)$ \and $(s_1, r_1) \rightarrow v_1$ \and $(s_2, r_2) \rightarrow v_2$  }{$(s, r_1 \cdot r_2) \rightarrow \Seq(v_1, v_2)$ }
       
   525 \end{center}
       
   526 
       
   527 The reason why we are interested in $\POSIX$ values is that they can
       
   528 be practically used in the lexing phase of a compiler front end.
       
   529 For instance, when lexing a code snippet 
       
   530 $\textit{iffoo} = 3$ with the regular expression $\textit{keyword} + \textit{identifier}$, we want $\textit{iffoo}$ to be recognized
       
   531 as an identifier rather than a keyword.
       
   532 
       
   533  For example, the above $r= (a^*\cdot a^*)^*$  and 
       
   534 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$ example has the POSIX value
       
   535 $ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$.
       
   536 The output of an algorithm we want would be a POSIX matching
       
   537 encoded as a value.
       
   538 
       
   539 
       
   540 
       
   541 
       
   542 The contribution of Sulzmann and Lu is an extension of Brzozowski's
       
   543 algorithm by a second phase (the first phase being building successive
       
   544 derivatives---see \eqref{graph:successive_ders}). In this second phase, a POSIX value 
       
   545 is generated if the regular expression matches the string.
       
   546 How can we construct a value out of regular expressions and character
       
   547 sequences only?
       
   548 Two functions are involved: $\inj$ and $\mkeps$.
       
   549 The function $\mkeps$ constructs a value from the last
       
   550 one of all the successive derivatives:
       
   551 \begin{ceqn}
       
   552 \begin{equation}\label{graph:mkeps}
       
   553 \begin{tikzcd}
       
   554 r_0 \arrow[r, "\backslash c_0"]  & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed] & r_n \arrow[d, "mkeps" description] \\
       
   555 	        & 	              & 	            & v_n       
       
   556 \end{tikzcd}
       
   557 \end{equation}
       
   558 \end{ceqn}
       
   559 
       
   560 It tells us how can an empty string be matched by a 
       
   561 regular expression, in a $\POSIX$ way:
       
   562 
       
   563 	\begin{center}
       
   564 		\begin{tabular}{lcl}
       
   565 			$\mkeps(\ONE)$ 		& $\dn$ & $\Empty$ \\
       
   566 			$\mkeps(r_{1}+r_{2})$	& $\dn$ 
       
   567 			& \textit{if} $\nullable(r_{1})$\\ 
       
   568 			& & \textit{then} $\Left(\mkeps(r_{1}))$\\ 
       
   569 			& & \textit{else} $\Right(\mkeps(r_{2}))$\\
       
   570 			$\mkeps(r_1\cdot r_2)$ 	& $\dn$ & $\Seq\,(\mkeps\,r_1)\,(\mkeps\,r_2)$\\
       
   571 			$mkeps(r^*)$	        & $\dn$ & $\Stars\,[]$
       
   572 		\end{tabular}
       
   573 	\end{center}
       
   574 
       
   575 
       
   576 \noindent 
       
   577 We favour the left to match an empty string if there is a choice.
       
   578 When there is a star for us to match the empty string,
       
   579 we give the $\Stars$ constructor an empty list, meaning
       
   580 no iterations are taken.
       
   581 
       
   582 
       
   583 After the $\mkeps$-call, we inject back the characters one by one in order to build
       
   584 the lexical value $v_i$ for how the regex $r_i$ matches the string $s_i$
       
   585 ($s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$.
       
   586 After injecting back $n$ characters, we get the lexical value for how $r_0$
       
   587 matches $s$. The POSIX value is maintained throughout the process.
       
   588 For this Sulzmann and Lu defined a function that reverses
       
   589 the ``chopping off'' of characters during the derivative phase. The
       
   590 corresponding function is called \emph{injection}, written
       
   591 $\textit{inj}$; it takes three arguments: the first one is a regular
       
   592 expression ${r_{i-1}}$, before the character is chopped off, the second
       
   593 is a character ${c_{i-1}}$, the character we want to inject and the
       
   594 third argument is the value ${v_i}$, into which one wants to inject the
       
   595 character (it corresponds to the regular expression after the character
       
   596 has been chopped off). The result of this function is a new value. 
       
   597 \begin{ceqn}
       
   598 \begin{equation}\label{graph:inj}
       
   599 \begin{tikzcd}
       
   600 r_1 \arrow[r, dashed] \arrow[d]& r_i \arrow[r, "\backslash c_i"]  \arrow[d]  & r_{i+1}  \arrow[r, dashed] \arrow[d]        & r_n \arrow[d, "mkeps" description] \\
       
   601 v_1           \arrow[u]                 & v_i  \arrow[l, dashed]                              & v_{i+1} \arrow[l,"inj_{r_i} c_i"]                 & v_n \arrow[l, dashed]         
       
   602 \end{tikzcd}
       
   603 \end{equation}
       
   604 \end{ceqn}
       
   605 
       
   606 
       
   607 \noindent
       
   608 The
       
   609 definition of $\textit{inj}$ is as follows: 
       
   610 
       
   611 \begin{center}
       
   612 \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
       
   613   $\textit{inj}\,(c)\,c\,Empty$            & $\dn$ & $Char\,c$\\
       
   614   $\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\
       
   615   $\textit{inj}\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\textit{inj}\,r_2\,c\,v)$\\
       
   616   $\textit{inj}\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$  & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
       
   617   $\textit{inj}\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$  & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
       
   618   $\textit{inj}\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$  & $Seq(\textit{mkeps}(r_1),\textit{inj}\,r_2\,c\,v)$\\
       
   619   $\textit{inj}\,(r^*)\,c\,Seq(v,Stars\,vs)$         & $\dn$  & $Stars((\textit{inj}\,r\,c\,v)\,::\,vs)$\\
       
   620 \end{tabular}
       
   621 \end{center}
       
   622 
       
   623 \noindent This definition is by recursion on the ``shape'' of regular
       
   624 expressions and values. 
       
   625 The clauses do one thing--identifying the ``hole'' on a
       
   626 value to inject the character back into.
       
   627 For instance, in the last clause for injecting back to a value
       
   628 that would turn into a new star value that corresponds to a star,
       
   629 we know it must be a sequence value. And we know that the first 
       
   630 value of that sequence corresponds to the child regex of the star
       
   631 with the first character being chopped off--an iteration of the star
       
   632 that had just been unfolded. This value is followed by the already
       
   633 matched star iterations we collected before. So we inject the character 
       
   634 back to the first value and form a new value with this latest iteration
       
   635 being added to the previous list of iterations, all under the $\Stars$
       
   636 top level.
       
   637 
       
   638 Putting all the functions $\inj$, $\mkeps$, $\backslash$ together,
       
   639 we have a lexer with the following recursive definition:
       
   640 \begin{center}
       
   641 \begin{tabular}{lcr}
       
   642 $\lexer \; r \; [] $ & $=$ & $\mkeps \; r$\\
       
   643 $\lexer \; r \;c::s$ & $=$ & $\inj \; r \; c (\lexer (r\backslash c) s)$
       
   644 \end{tabular}
       
   645 \end{center}
       
   646  
       
   647 Pictorially, the algorithm is as follows:
       
   648 
       
   649 \begin{ceqn}
       
   650 \begin{equation}\label{graph:2}
       
   651 \begin{tikzcd}
       
   652 r_0 \arrow[r, "\backslash c_0"]  \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
       
   653 v_0           & v_1 \arrow[l,"inj_{r_0} c_0"]                & v_2 \arrow[l, "inj_{r_1} c_1"]              & v_n \arrow[l, dashed]         
       
   654 \end{tikzcd}
       
   655 \end{equation}
       
   656 \end{ceqn}
       
   657 
       
   658 
       
   659 \noindent
       
   660 For convenience, we shall employ the following notations: the regular
       
   661 expression we start with is $r_0$, and the given string $s$ is composed
       
   662 of characters $c_0 c_1 \ldots c_{n-1}$. In  the first phase from the
       
   663 left to right, we build the derivatives $r_1$, $r_2$, \ldots  according
       
   664 to the characters $c_0$, $c_1$  until we exhaust the string and obtain
       
   665 the derivative $r_n$. We test whether this derivative is
       
   666 $\textit{nullable}$ or not. If not, we know the string does not match
       
   667 $r$, and no value needs to be generated. If yes, we start building the
       
   668 values incrementally by \emph{injecting} back the characters into the
       
   669 earlier values $v_n, \ldots, v_0$. This is the second phase of the
       
   670 algorithm from right to left. For the first value $v_n$, we call the
       
   671 function $\textit{mkeps}$, which builds a POSIX lexical value
       
   672 for how the empty string has been matched by the (nullable) regular
       
   673 expression $r_n$. This function is defined as
       
   674 
       
   675 
       
   676 
       
   677 We have mentioned before that derivatives without simplification 
       
   678 can get clumsy, and this is true for values as well--they reflect
       
   679 the size of the regular expression by definition.
       
   680 
       
   681 One can introduce simplification on the regex and values but have to
       
   682 be careful not to break the correctness, as the injection 
       
   683 function heavily relies on the structure of the regexes and values
       
   684 being correct and matching each other.
       
   685 It can be achieved by recording some extra rectification functions
       
   686 during the derivatives step, and applying these rectifications in 
       
   687 each run during the injection phase.
       
   688 And we can prove that the POSIX value of how
       
   689 regular expressions match strings will not be affected---although it is much harder
       
   690 to establish. 
       
   691 Some initial results in this regard have been
       
   692 obtained in \cite{AusafDyckhoffUrban2016}. 
       
   693 
       
   694 
       
   695 
       
   696 %Brzozowski, after giving the derivatives and simplification,
       
   697 %did not explore lexing with simplification, or he may well be 
       
   698 %stuck on an efficient simplification with proof.
       
   699 %He went on to examine the use of derivatives together with 
       
   700 %automaton, and did not try lexing using products.
       
   701 
       
   702 We want to get rid of the complex and fragile rectification of values.
       
   703 Can we not create those intermediate values $v_1,\ldots v_n$,
       
   704 and get the lexing information that should be already there while
       
   705 doing derivatives in one pass, without a second injection phase?
       
   706 In the meantime, can we make sure that simplifications
       
   707 are easily handled without breaking the correctness of the algorithm?
       
   708 
       
   709 Sulzmann and Lu solved this problem by
       
   710 introducing additional information to the 
       
   711 regular expressions called \emph{bitcodes}.
       
   712 
       
   713 
       
   714 
       
   715 
       
   716 
       
   717 
   780 
   718 	
   781