ChengsongTanPhdThesis/Chapters/Inj.tex
changeset 532 cc54ce075db5
child 536 aff7bf93b9c7
equal deleted inserted replaced
531:c334f0b3ef52 532:cc54ce075db5
       
     1 % Chapter Template
       
     2 
       
     3 \chapter{Regular Expressions and POSIX Lexing} % Main chapter title
       
     4 
       
     5 \label{Inj} % In chapter 2 \ref{Chapter2} we will introduce the concepts
       
     6 %and notations we 
       
     7 %use for describing the lexing algorithm by Sulzmann and Lu,
       
     8 %and then give the algorithm and its variant, and discuss
       
     9 %why more aggressive simplifications are needed. 
       
    10 
       
    11 
       
    12 \section{Basic Concepts and Notations for Strings, Languages, and Regular Expressions}
       
    13 We have a primitive datatype char, denoting characters.
       
    14 \[			char ::=  a
       
    15 			 \mid b
       
    16 			 \mid c
       
    17 			 \mid  \ldots
       
    18 			 \mid z       
       
    19 \]
       
    20 (which one is better?)
       
    21 \begin{center}
       
    22 \begin{tabular}{lcl}
       
    23 $\textit{char}$ & $\dn$ & $a | b | c | \ldots$\\
       
    24 \end{tabular}
       
    25 \end{center}
       
    26 They can form strings by lists:
       
    27 \begin{center}
       
    28 \begin{tabular}{lcl}
       
    29 $\textit{string}$ & $\dn$ & $[] | c  :: cs$\\
       
    30 & & $(c\; \text{has char type})$
       
    31 \end{tabular}
       
    32 \end{center}
       
    33 And strings can be concatenated to form longer strings:
       
    34 \begin{center}
       
    35 \begin{tabular}{lcl}
       
    36 $[] @ s_2$ & $\dn$ & $s_2$\\
       
    37 $(c :: s_1) @ s_2$ & $\dn$ & $c :: (s_1 @ s_2)$
       
    38 \end{tabular}
       
    39 \end{center}
       
    40 
       
    41 A set of strings can operate with another set of strings:
       
    42 \begin{center}
       
    43 \begin{tabular}{lcl}
       
    44 $A @ B $ & $\dn$ & $\{s_A @ s_B \mid s_A \in A; s_B \in B \}$\\
       
    45 \end{tabular}
       
    46 \end{center}
       
    47 We also call the above "language concatenation".
       
    48 The power of a language is defined recursively, using the 
       
    49 concatenation operator $@$:
       
    50 \begin{center}
       
    51 \begin{tabular}{lcl}
       
    52 $A^0 $ & $\dn$ & $\{ [] \}$\\
       
    53 $A^{n+1}$ & $\dn$ & $A^n @ A$
       
    54 \end{tabular}
       
    55 \end{center}
       
    56 The union of all the natural number powers of a language   
       
    57 is denoted by the Kleene star operator:
       
    58 \begin{center}
       
    59 \begin{tabular}{lcl}
       
    60 $\bigcup_{i \geq 0} A^i$ & $\denote$ & $A^*$\\
       
    61 \end{tabular}
       
    62 \end{center}
       
    63 
       
    64 In Isabelle of course we cannot easily get a counterpart of
       
    65 the $\bigcup_{i \geq 0}$ operator, so we instead define the Kleene star
       
    66 as an inductive set: 
       
    67 \begin{center}
       
    68 \begin{tabular}{lcl}
       
    69 $[] \in A^*$  & &\\
       
    70 $s_1 \in A \land \; s_2 \in A^* $ & $\implies$ & $s_1 @ s_2 \in A^*$\\
       
    71 \end{tabular}
       
    72 \end{center}
       
    73 \subsection{Language Derivatives}
       
    74 We also define an operation of chopping off a character from
       
    75 a language:
       
    76 \begin{center}
       
    77 \begin{tabular}{lcl}
       
    78 $\textit{Der} \;c \;A$ & $\dn$ & $\{ s \mid c :: s \in A \}$\\
       
    79 \end{tabular}
       
    80 \end{center}
       
    81 
       
    82 This can be generalised to chopping off a string from all strings within set $A$:
       
    83 \begin{center}
       
    84 \begin{tabular}{lcl}
       
    85 $\textit{Ders} \;w \;A$ & $\dn$ & $\{ s \mid w@s \in A \}$\\
       
    86 \end{tabular}
       
    87 \end{center}
       
    88 
       
    89 which is essentially the left quotient $A \backslash L'$ of $A$ against 
       
    90 the singleton language $L' = \{w\}$
       
    91 in formal language theory.
       
    92 For this dissertation the $\textit{Ders}$ notation would suffice, there is
       
    93 no need for a more general derivative definition.
       
    94 
       
    95 With the  sequencing, Kleene star, and $\textit{Der}$ operator on languages,
       
    96 we have a  few properties of how the language derivative can be defined using 
       
    97 sub-languages.
       
    98 \begin{lemma}
       
    99 $\Der \; c \; (A @ B) = \textit{if} \;  [] \in A \; \textit{then} ((\Der \; c \; A) @ B ) \cup \Der \; c\; B \quad \textit{else}\; (\Der \; c \; A) @ B$
       
   100 \end{lemma}
       
   101 \noindent
       
   102 This lemma states that if $A$ contains the empty string, $\Der$ can "pierce" through it
       
   103 and get to $B$.
       
   104 
       
   105 The language $A^*$'s derivative can be described using the language derivative
       
   106 of $A$:
       
   107 \begin{lemma}
       
   108 $\textit{Der} \;c \;A^* = (\textit{Der}\; c A) @ (A^*)$\\
       
   109 \end{lemma}
       
   110 \begin{proof}
       
   111 \begin{itemize}
       
   112 \item{$\subseteq$}
       
   113 The set 
       
   114 \[ \{s \mid c :: s \in A^*\} \]
       
   115 is enclosed in the set
       
   116 \[ \{s_1 @ s_2 \mid s_1 \, s_2. s_1 \in \{s \mid c :: s \in A\} \land s_2 \in A^* \} \]
       
   117 because whenever you have a string starting with a character 
       
   118 in the language of a Kleene star $A^*$, then that character together with some sub-string
       
   119 immediately after it will form the first iteration, and the rest of the string will 
       
   120 be still in $A^*$.
       
   121 \item{$\supseteq$}
       
   122 Note that
       
   123 \[ \Der \; c \; A^* = \Der \; c \;  (\{ [] \} \cup (A @ A^*) ) \]
       
   124 and 
       
   125 \[ \Der \; c \;  (\{ [] \} \cup (A @ A^*) ) = \Der\; c \; (A @ A^*) \]
       
   126 where the $\textit{RHS}$ of the above equatioin can be rewritten
       
   127 as \[ (\Der \; c\; A) @ A^* \cup A' \], $A'$ being a possibly empty set.
       
   128 \end{itemize}
       
   129 \end{proof}
       
   130 Before we define the $\textit{Der}$ and $\textit{Ders}$ counterpart
       
   131 for regular languages, we need to first give definitions for regular expressions.
       
   132 
       
   133 \subsection{Regular Expressions and Their Language Interpretation}
       
   134 Suppose we have an alphabet $\Sigma$, the strings  whose characters
       
   135 are from $\Sigma$
       
   136 can be expressed as $\Sigma^*$.
       
   137 
       
   138 We use patterns to define a set of strings concisely. Regular expressions
       
   139 are one of such patterns systems:
       
   140 The basic regular expressions  are defined inductively
       
   141  by the following grammar:
       
   142 \[			r ::=   \ZERO \mid  \ONE
       
   143 			 \mid  c  
       
   144 			 \mid  r_1 \cdot r_2
       
   145 			 \mid  r_1 + r_2   
       
   146 			 \mid r^*         
       
   147 \]
       
   148 
       
   149 The language or set of strings denoted 
       
   150 by regular expressions are defined as
       
   151 %TODO: FILL in the other defs
       
   152 \begin{center}
       
   153 \begin{tabular}{lcl}
       
   154 $L \; (r_1 + r_2)$ & $\dn$ & $ L \; (r_1) \cup L \; ( r_2)$\\
       
   155 $L \; (r_1 \cdot r_2)$ & $\dn$ & $ L \; (r_1) \cap L \; (r_2)$\\
       
   156 \end{tabular}
       
   157 \end{center}
       
   158 Which is also called the "language interpretation" of
       
   159 a regular expression.
       
   160 
       
   161 Now with semantic derivatives of a language and regular expressions and
       
   162 their language interpretations in place, we are ready to define derivatives on regexes.
       
   163 \subsection{Brzozowski Derivatives and a Regular Expression Matcher}
       
   164 The language derivatives acts on a string set and chops off a character from
       
   165 all strings in that set, we want to define a derivative operation on regular expressions
       
   166 so that after derivative $L(r\backslash c)$ 
       
   167 will look as if it was obtained by doing a language derivative on $L(r)$:
       
   168 \[
       
   169 L(r \backslash c) = \Der \; c \; L(r)
       
   170 \]
       
   171 So we mimic the equalities we have for $\Der$ on language concatenation
       
   172 
       
   173 \[
       
   174 \Der \; c \; (A @ B) = \textit{if} \;  [] \in A \; \textit{then} ((\Der \; c \; A) @ B ) \cup \Der \; c\; B \quad \textit{else}\; (\Der \; c \; A) @ B\\
       
   175 \]
       
   176 to get the derivative for sequence regular expressions:
       
   177 \[
       
   178 (r_1 \cdot r_2 ) \backslash c = \textit{if}\,([] \in L(r_1)) r_1 \backslash c \cdot r_2 + r_2 \backslash c \textit{else} (r_1 \backslash c) \cdot r_2
       
   179 \]
       
   180 
       
   181 \noindent
       
   182 and language Kleene star:
       
   183 \[
       
   184 \textit{Der} \;c \;A^* = (\textit{Der}\; c A) @ (A^*)
       
   185 \]
       
   186 to get derivative of the Kleene star regular expression:
       
   187 \[
       
   188 r^* \backslash c = (r \backslash c)\cdot r^*
       
   189 \]
       
   190 Note that although we can formalise the boolean predicate
       
   191 $[] \in L(r_1)$ without problems, if we want a function that works
       
   192 computationally, then we would have to define a function that tests
       
   193 whether an empty string is in the language of a regular expression.
       
   194 We call such a function $\nullable$:
       
   195 
       
   196 
       
   197 
       
   198 \begin{center}
       
   199 \begin{tabular}{lcl}
       
   200 		$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\  
       
   201 		$\ONE \backslash c$  & $\dn$ & $\ZERO$\\
       
   202 		$d \backslash c$     & $\dn$ & 
       
   203 		$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
       
   204 $(r_1 + r_2)\backslash c$     & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
       
   205 $(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, nullable(r_1)$\\
       
   206 	&   & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\
       
   207 	&   & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
       
   208 	$(r^*)\backslash c$           & $\dn$ & $(r\backslash c) \cdot r^*$\\
       
   209 \end{tabular}
       
   210 \end{center}
       
   211 \noindent
       
   212 The function derivative, written $r\backslash c$, 
       
   213 defines how a regular expression evolves into
       
   214 a new regular expression after all the string it contains
       
   215 is chopped off a certain head character $c$.
       
   216 The most involved cases are the sequence 
       
   217 and star case.
       
   218 The sequence case says that if the first regular expression
       
   219 contains an empty string then the second component of the sequence
       
   220 might be chosen as the target regular expression to be chopped
       
   221 off its head character.
       
   222 The star regular expression's derivative unwraps the iteration of
       
   223 regular expression and attaches the star regular expression
       
   224 to the sequence's second element to make sure a copy is retained
       
   225 for possible more iterations in later phases of lexing.
       
   226 
       
   227 
       
   228 The $\nullable$ function tests whether the empty string $""$ 
       
   229 is in the language of $r$:
       
   230 
       
   231 
       
   232 \begin{center}
       
   233 		\begin{tabular}{lcl}
       
   234 			$\nullable(\ZERO)$     & $\dn$ & $\mathit{false}$ \\  
       
   235 			$\nullable(\ONE)$      & $\dn$ & $\mathit{true}$ \\
       
   236 			$\nullable(c)$ 	       & $\dn$ & $\mathit{false}$ \\
       
   237 			$\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\
       
   238 			$\nullable(r_1\cdot r_2)$  & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\
       
   239 			$\nullable(r^*)$       & $\dn$ & $\mathit{true}$ \\
       
   240 		\end{tabular}
       
   241 \end{center}
       
   242 \noindent
       
   243 The empty set does not contain any string and
       
   244 therefore not the empty string, the empty string 
       
   245 regular expression contains the empty string
       
   246 by definition, the character regular expression
       
   247 is the singleton that contains character only,
       
   248 and therefore does not contain the empty string,
       
   249 the alternative regular expression (or "or" expression)
       
   250 might have one of its children regular expressions
       
   251 being nullable and any one of its children being nullable
       
   252 would suffice. The sequence regular expression
       
   253 would require both children to have the empty string
       
   254 to compose an empty string and the Kleene star
       
   255 operation naturally introduced the empty string. 
       
   256   
       
   257 We have the following property where the derivative on regular 
       
   258 expressions coincides with the derivative on a set of strings:
       
   259 
       
   260 \begin{lemma}
       
   261 $\textit{Der} \; c \; L(r) = L (r\backslash c)$
       
   262 \end{lemma}
       
   263 
       
   264 \noindent
       
   265 
       
   266 
       
   267 The main property of the derivative operation
       
   268 that enables us to reason about the correctness of
       
   269 an algorithm using derivatives is 
       
   270 
       
   271 \begin{center}
       
   272 $c\!::\!s \in L(r)$ holds
       
   273 if and only if $s \in L(r\backslash c)$.
       
   274 \end{center}
       
   275 
       
   276 \noindent
       
   277 We can generalise the derivative operation shown above for single characters
       
   278 to strings as follows:
       
   279 
       
   280 \begin{center}
       
   281 \begin{tabular}{lcl}
       
   282 $r \backslash_s (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash_s s$ \\
       
   283 $r \backslash [\,] $ & $\dn$ & $r$
       
   284 \end{tabular}
       
   285 \end{center}
       
   286 
       
   287 \noindent
       
   288 When there is no ambiguity we will use  $\backslash$ to denote
       
   289 string derivatives for brevity.
       
   290 
       
   291 and then define Brzozowski's  regular-expression matching algorithm as:
       
   292 
       
   293 \begin{definition}
       
   294 $match\;s\;r \;\dn\; nullable(r\backslash s)$
       
   295 \end{definition}
       
   296 
       
   297 \noindent
       
   298 Assuming the string is given as a sequence of characters, say $c_0c_1..c_n$, 
       
   299 this algorithm presented graphically is as follows:
       
   300 
       
   301 \begin{equation}\label{graph:successive_ders}
       
   302 \begin{tikzcd}
       
   303 r_0 \arrow[r, "\backslash c_0"]  & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed]  & r_n  \arrow[r,"\textit{nullable}?"] & \;\textrm{YES}/\textrm{NO}
       
   304 \end{tikzcd}
       
   305 \end{equation}
       
   306 
       
   307 \noindent
       
   308 where we start with  a regular expression  $r_0$, build successive
       
   309 derivatives until we exhaust the string and then use \textit{nullable}
       
   310 to test whether the result can match the empty string. It can  be
       
   311 relatively  easily shown that this matcher is correct  (that is given
       
   312 an $s = c_0...c_{n-1}$ and an $r_0$, it generates YES if and only if $s \in L(r_0)$).
       
   313 
       
   314 Beautiful and simple definition.
       
   315 
       
   316 If we implement the above algorithm naively, however,
       
   317 the algorithm can be excruciatingly slow. 
       
   318 
       
   319 
       
   320 \begin{figure}
       
   321 \centering
       
   322 \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
       
   323 \begin{tikzpicture}
       
   324 \begin{axis}[
       
   325     xlabel={$n$},
       
   326     x label style={at={(1.05,-0.05)}},
       
   327     ylabel={time in secs},
       
   328     enlargelimits=false,
       
   329     xtick={0,5,...,30},
       
   330     xmax=33,
       
   331     ymax=10000,
       
   332     ytick={0,1000,...,10000},
       
   333     scaled ticks=false,
       
   334     axis lines=left,
       
   335     width=5cm,
       
   336     height=4cm, 
       
   337     legend entries={JavaScript},  
       
   338     legend pos=north west,
       
   339     legend cell align=left]
       
   340 \addplot[red,mark=*, mark options={fill=white}] table {EightThousandNodes.data};
       
   341 \end{axis}
       
   342 \end{tikzpicture}\\
       
   343 \multicolumn{3}{c}{Graphs: Runtime for matching $(a^*)^*\,b$ with strings 
       
   344            of the form $\underbrace{aa..a}_{n}$.}
       
   345 \end{tabular}    
       
   346 \caption{EightThousandNodes} \label{fig:EightThousandNodes}
       
   347 \end{figure}
       
   348 
       
   349 
       
   350 (8000 node data to be added here)
       
   351 For example, when starting with the regular
       
   352 expression $(a + aa)^*$ and building a few successive derivatives (around 10)
       
   353 w.r.t.~the character $a$, one obtains a derivative regular expression
       
   354 with more than 8000 nodes (when viewed as a tree)\ref{EightThousandNodes}.
       
   355 The reason why $(a + aa) ^*$ explodes so drastically is that without
       
   356 pruning, the algorithm will keep records of all possible ways of matching:
       
   357 \begin{center}
       
   358 $(a + aa) ^* \backslash [aa] = (\ZERO + \ONE \ONE)\cdot(a + aa)^* + (\ONE + \ONE a) \cdot (a + aa)^*$
       
   359 \end{center}
       
   360 
       
   361 \noindent
       
   362 Each of the above alternative branches correspond to the match 
       
   363 $aa $, $a \quad a$ and $a \quad a \cdot (a)$(incomplete).
       
   364 These different ways of matching will grow exponentially with the string length,
       
   365 and without simplifications that throw away some of these very similar matchings,
       
   366 it is no surprise that these expressions grow so quickly.
       
   367 Operations like
       
   368 $\backslash$ and $\nullable$ need to traverse such trees and
       
   369 consequently the bigger the size of the derivative the slower the
       
   370 algorithm. 
       
   371 
       
   372 Brzozowski was quick in finding that during this process a lot useless
       
   373 $\ONE$s and $\ZERO$s are generated and therefore not optimal.
       
   374 He also introduced some "similarity rules", such
       
   375 as $P+(Q+R) = (P+Q)+R$ to merge syntactically 
       
   376 different but language-equivalent sub-regexes to further decrease the size
       
   377 of the intermediate regexes. 
       
   378 
       
   379 More simplifications are possible, such as deleting duplicates
       
   380 and opening up nested alternatives to trigger even more simplifications.
       
   381 And suppose we apply simplification after each derivative step, and compose
       
   382 these two operations together as an atomic one: $a \backslash_{simp}\,c \dn
       
   383 \textit{simp}(a \backslash c)$. Then we can build
       
   384 a matcher with simpler regular expressions.
       
   385 
       
   386 If we want the size of derivatives in the algorithm to
       
   387 stay even lower, we would need more aggressive simplifications.
       
   388 Essentially we need to delete useless $\ZERO$s and $\ONE$s, as well as
       
   389 delete duplicates whenever possible. For example, the parentheses in
       
   390 $(a+b) \cdot c + b\cdot c$ can be opened up to get $a\cdot c + b \cdot c + b
       
   391 \cdot c$, and then simplified to just $a \cdot c + b \cdot c$. Another
       
   392 example is simplifying $(a^*+a) + (a^*+ \ONE) + (a +\ONE)$ to just
       
   393 $a^*+a+\ONE$.  These more aggressive simplification rules are for
       
   394  a very tight size bound, possibly as low
       
   395   as that of the \emph{partial derivatives}\parencite{Antimirov1995}. 
       
   396 
       
   397 Building derivatives and then simplifying them.
       
   398 So far, so good. But what if we want to 
       
   399 do lexing instead of just getting a YES/NO answer?
       
   400 This requires us to go back again to the world 
       
   401 without simplification first for a moment.
       
   402 Sulzmann and Lu~\cite{Sulzmann2014} first came up with a nice and 
       
   403 elegant(arguably as beautiful as the original
       
   404 derivatives definition) solution for this.
       
   405 
       
   406 \subsection*{Values and the Lexing Algorithm by Sulzmann and Lu}
       
   407 
       
   408 
       
   409 They first defined the datatypes for storing the 
       
   410 lexing information called a \emph{value} or
       
   411 sometimes also \emph{lexical value}.  These values and regular
       
   412 expressions correspond to each other as illustrated in the following
       
   413 table:
       
   414 
       
   415 \begin{center}
       
   416 	\begin{tabular}{c@{\hspace{20mm}}c}
       
   417 		\begin{tabular}{@{}rrl@{}}
       
   418 			\multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\
       
   419 			$r$ & $::=$  & $\ZERO$\\
       
   420 			& $\mid$ & $\ONE$   \\
       
   421 			& $\mid$ & $c$          \\
       
   422 			& $\mid$ & $r_1 \cdot r_2$\\
       
   423 			& $\mid$ & $r_1 + r_2$   \\
       
   424 			\\
       
   425 			& $\mid$ & $r^*$         \\
       
   426 		\end{tabular}
       
   427 		&
       
   428 		\begin{tabular}{@{\hspace{0mm}}rrl@{}}
       
   429 			\multicolumn{3}{@{}l}{\textbf{Values}}\medskip\\
       
   430 			$v$ & $::=$  & \\
       
   431 			&        & $\Empty$   \\
       
   432 			& $\mid$ & $\Char(c)$          \\
       
   433 			& $\mid$ & $\Seq\,v_1\, v_2$\\
       
   434 			& $\mid$ & $\Left(v)$   \\
       
   435 			& $\mid$ & $\Right(v)$  \\
       
   436 			& $\mid$ & $\Stars\,[v_1,\ldots\,v_n]$ \\
       
   437 		\end{tabular}
       
   438 	\end{tabular}
       
   439 \end{center}
       
   440 
       
   441 \noindent
       
   442 
       
   443 Building on top of Sulzmann and Lu's attempt to formalise the 
       
   444 notion of POSIX lexing rules \parencite{Sulzmann2014}, 
       
   445 Ausaf and Urban\parencite{AusafDyckhoffUrban2016} modelled
       
   446 POSIX matching as a ternary relation recursively defined in a
       
   447 natural deduction style.
       
   448 With the formally-specified rules for what a POSIX matching is,
       
   449 they proved in Isabelle/HOL that the algorithm gives correct results.
       
   450 
       
   451 But having a correct result is still not enough, 
       
   452 we want at least some degree of $\mathbf{efficiency}$.
       
   453 
       
   454 
       
   455 
       
   456 One regular expression can have multiple lexical values. For example
       
   457 for the regular expression $(a+b)^*$, it has a infinite list of
       
   458 values corresponding to it: $\Stars\,[]$, $\Stars\,[\Left(Char(a))]$,
       
   459 $\Stars\,[\Right(Char(b))]$, $\Stars\,[\Left(Char(a),\,\Right(Char(b))]$,
       
   460 $\ldots$, and vice versa.
       
   461 Even for the regular expression matching a particular string, there could 
       
   462 be more than one value corresponding to it.
       
   463 Take the example where $r= (a^*\cdot a^*)^*$ and the string 
       
   464 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$.
       
   465 If we do not allow any empty iterations in its lexical values,
       
   466 there will be $n - 1$ "splitting points" on $s$ we can choose to 
       
   467 split or not so that each sub-string
       
   468 segmented by those chosen splitting points will form different iterations:
       
   469 \begin{center}
       
   470 \begin{tabular}{lcr}
       
   471 $a \mid aaa $ & $\rightarrow$ & $\Stars\, [v_{iteration \,a},\,  v_{iteration \,aaa}]$\\
       
   472 $aa \mid aa $ & $\rightarrow$ & $\Stars\, [v_{iteration \, aa},\,  v_{iteration \, aa}]$\\
       
   473 $a \mid aa\mid a $ & $\rightarrow$ & $\Stars\, [v_{iteration \, a},\,  v_{iteration \, aa}, \, v_{iteration \, a}]$\\
       
   474  & $\textit{etc}.$ &
       
   475  \end{tabular}
       
   476 \end{center}
       
   477 
       
   478 And for each iteration, there are still multiple ways to split
       
   479 between the two $a^*$s.
       
   480 It is not surprising there are exponentially many lexical values
       
   481 that are distinct for the regex and string pair $r= (a^*\cdot a^*)^*$  and 
       
   482 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$.
       
   483 
       
   484 A lexer to keep all the possible values will naturally 
       
   485 have an exponential runtime on ambiguous regular expressions.
       
   486 Somehow one has to decide which lexical value to keep and
       
   487 output in a lexing algorithm.
       
   488 In practice, we are usually 
       
   489 interested in POSIX values, which by intuition always
       
   490 \begin{itemize}
       
   491 \item
       
   492 match the leftmost regular expression when multiple options of matching
       
   493 are available  
       
   494 \item 
       
   495 always match a subpart as much as possible before proceeding
       
   496 to the next token.
       
   497 \end{itemize}
       
   498 The formal definition of a $\POSIX$ value $v$ for a regular expression
       
   499 $r$ and string $s$, denoted as $(s, r) \rightarrow v$, can be specified 
       
   500 in the following set of rules:
       
   501 (TODO: write the entire set of inference rules for POSIX )
       
   502 \newcommand*{\inference}[3][t]{%
       
   503    \begingroup
       
   504    \def\and{\\}%
       
   505    \begin{tabular}[#1]{@{\enspace}c@{\enspace}}
       
   506    #2 \\
       
   507    \hline
       
   508    #3
       
   509    \end{tabular}%
       
   510    \endgroup
       
   511 }
       
   512 \begin{center}
       
   513 \inference{$s_1 @ s_2 = s$ \and $(\nexists s_3 s_4 s_5. s_1 @ s_5 = s_3 \land s_5 \neq [] \land s_3 @ s_4 = s \land (s_3, r_1) \rightarrow v_3 \land (s_4, r_2) \rightarrow v_4)$ \and $(s_1, r_1) \rightarrow v_1$ \and $(s_2, r_2) \rightarrow v_2$  }{$(s, r_1 \cdot r_2) \rightarrow \Seq(v_1, v_2)$ }
       
   514 \end{center}
       
   515 
       
   516 The reason why we are interested in $\POSIX$ values is that they can
       
   517 be practically used in the lexing phase of a compiler front end.
       
   518 For instance, when lexing a code snippet 
       
   519 $\textit{iffoo} = 3$ with the regular expression $\textit{keyword} + \textit{identifier}$, we want $\textit{iffoo}$ to be recognized
       
   520 as an identifier rather than a keyword.
       
   521 
       
   522  For example, the above $r= (a^*\cdot a^*)^*$  and 
       
   523 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$ example has the POSIX value
       
   524 $ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$.
       
   525 The output of an algorithm we want would be a POSIX matching
       
   526 encoded as a value.
       
   527 
       
   528 
       
   529 
       
   530 
       
   531 The contribution of Sulzmann and Lu is an extension of Brzozowski's
       
   532 algorithm by a second phase (the first phase being building successive
       
   533 derivatives---see \eqref{graph:successive_ders}). In this second phase, a POSIX value 
       
   534 is generated if the regular expression matches the string.
       
   535 How can we construct a value out of regular expressions and character
       
   536 sequences only?
       
   537 Two functions are involved: $\inj$ and $\mkeps$.
       
   538 The function $\mkeps$ constructs a value from the last
       
   539 one of all the successive derivatives:
       
   540 \begin{ceqn}
       
   541 \begin{equation}\label{graph:mkeps}
       
   542 \begin{tikzcd}
       
   543 r_0 \arrow[r, "\backslash c_0"]  & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed] & r_n \arrow[d, "mkeps" description] \\
       
   544 	        & 	              & 	            & v_n       
       
   545 \end{tikzcd}
       
   546 \end{equation}
       
   547 \end{ceqn}
       
   548 
       
   549 It tells us how can an empty string be matched by a 
       
   550 regular expression, in a $\POSIX$ way:
       
   551 
       
   552 	\begin{center}
       
   553 		\begin{tabular}{lcl}
       
   554 			$\mkeps(\ONE)$ 		& $\dn$ & $\Empty$ \\
       
   555 			$\mkeps(r_{1}+r_{2})$	& $\dn$ 
       
   556 			& \textit{if} $\nullable(r_{1})$\\ 
       
   557 			& & \textit{then} $\Left(\mkeps(r_{1}))$\\ 
       
   558 			& & \textit{else} $\Right(\mkeps(r_{2}))$\\
       
   559 			$\mkeps(r_1\cdot r_2)$ 	& $\dn$ & $\Seq\,(\mkeps\,r_1)\,(\mkeps\,r_2)$\\
       
   560 			$mkeps(r^*)$	        & $\dn$ & $\Stars\,[]$
       
   561 		\end{tabular}
       
   562 	\end{center}
       
   563 
       
   564 
       
   565 \noindent 
       
   566 We favour the left to match an empty string if there is a choice.
       
   567 When there is a star for us to match the empty string,
       
   568 we give the $\Stars$ constructor an empty list, meaning
       
   569 no iterations are taken.
       
   570 
       
   571 
       
   572 After the $\mkeps$-call, we inject back the characters one by one in order to build
       
   573 the lexical value $v_i$ for how the regex $r_i$ matches the string $s_i$
       
   574 ($s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$.
       
   575 After injecting back $n$ characters, we get the lexical value for how $r_0$
       
   576 matches $s$. The POSIX value is maintained throughout the process.
       
   577 For this Sulzmann and Lu defined a function that reverses
       
   578 the ``chopping off'' of characters during the derivative phase. The
       
   579 corresponding function is called \emph{injection}, written
       
   580 $\textit{inj}$; it takes three arguments: the first one is a regular
       
   581 expression ${r_{i-1}}$, before the character is chopped off, the second
       
   582 is a character ${c_{i-1}}$, the character we want to inject and the
       
   583 third argument is the value ${v_i}$, into which one wants to inject the
       
   584 character (it corresponds to the regular expression after the character
       
   585 has been chopped off). The result of this function is a new value. 
       
   586 \begin{ceqn}
       
   587 \begin{equation}\label{graph:inj}
       
   588 \begin{tikzcd}
       
   589 r_1 \arrow[r, dashed] \arrow[d]& r_i \arrow[r, "\backslash c_i"]  \arrow[d]  & r_{i+1}  \arrow[r, dashed] \arrow[d]        & r_n \arrow[d, "mkeps" description] \\
       
   590 v_1           \arrow[u]                 & v_i  \arrow[l, dashed]                              & v_{i+1} \arrow[l,"inj_{r_i} c_i"]                 & v_n \arrow[l, dashed]         
       
   591 \end{tikzcd}
       
   592 \end{equation}
       
   593 \end{ceqn}
       
   594 
       
   595 
       
   596 \noindent
       
   597 The
       
   598 definition of $\textit{inj}$ is as follows: 
       
   599 
       
   600 \begin{center}
       
   601 \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
       
   602   $\textit{inj}\,(c)\,c\,Empty$            & $\dn$ & $Char\,c$\\
       
   603   $\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\
       
   604   $\textit{inj}\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\textit{inj}\,r_2\,c\,v)$\\
       
   605   $\textit{inj}\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$  & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
       
   606   $\textit{inj}\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$  & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
       
   607   $\textit{inj}\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$  & $Seq(\textit{mkeps}(r_1),\textit{inj}\,r_2\,c\,v)$\\
       
   608   $\textit{inj}\,(r^*)\,c\,Seq(v,Stars\,vs)$         & $\dn$  & $Stars((\textit{inj}\,r\,c\,v)\,::\,vs)$\\
       
   609 \end{tabular}
       
   610 \end{center}
       
   611 
       
   612 \noindent This definition is by recursion on the ``shape'' of regular
       
   613 expressions and values. 
       
   614 The clauses do one thing--identifying the ``hole'' on a
       
   615 value to inject the character back into.
       
   616 For instance, in the last clause for injecting back to a value
       
   617 that would turn into a new star value that corresponds to a star,
       
   618 we know it must be a sequence value. And we know that the first 
       
   619 value of that sequence corresponds to the child regex of the star
       
   620 with the first character being chopped off--an iteration of the star
       
   621 that had just been unfolded. This value is followed by the already
       
   622 matched star iterations we collected before. So we inject the character 
       
   623 back to the first value and form a new value with this latest iteration
       
   624 being added to the previous list of iterations, all under the $\Stars$
       
   625 top level.
       
   626 
       
   627 Putting all the functions $\inj$, $\mkeps$, $\backslash$ together,
       
   628 we have a lexer with the following recursive definition:
       
   629 \begin{center}
       
   630 \begin{tabular}{lcr}
       
   631 $\lexer \; r \; [] $ & $=$ & $\mkeps \; r$\\
       
   632 $\lexer \; r \;c::s$ & $=$ & $\inj \; r \; c (\lexer (r\backslash c) s)$
       
   633 \end{tabular}
       
   634 \end{center}
       
   635  
       
   636 Pictorially, the algorithm is as follows:
       
   637 
       
   638 \begin{ceqn}
       
   639 \begin{equation}\label{graph:2}
       
   640 \begin{tikzcd}
       
   641 r_0 \arrow[r, "\backslash c_0"]  \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
       
   642 v_0           & v_1 \arrow[l,"inj_{r_0} c_0"]                & v_2 \arrow[l, "inj_{r_1} c_1"]              & v_n \arrow[l, dashed]         
       
   643 \end{tikzcd}
       
   644 \end{equation}
       
   645 \end{ceqn}
       
   646 
       
   647 
       
   648 \noindent
       
   649 For convenience, we shall employ the following notations: the regular
       
   650 expression we start with is $r_0$, and the given string $s$ is composed
       
   651 of characters $c_0 c_1 \ldots c_{n-1}$. In  the first phase from the
       
   652 left to right, we build the derivatives $r_1$, $r_2$, \ldots  according
       
   653 to the characters $c_0$, $c_1$  until we exhaust the string and obtain
       
   654 the derivative $r_n$. We test whether this derivative is
       
   655 $\textit{nullable}$ or not. If not, we know the string does not match
       
   656 $r$, and no value needs to be generated. If yes, we start building the
       
   657 values incrementally by \emph{injecting} back the characters into the
       
   658 earlier values $v_n, \ldots, v_0$. This is the second phase of the
       
   659 algorithm from right to left. For the first value $v_n$, we call the
       
   660 function $\textit{mkeps}$, which builds a POSIX lexical value
       
   661 for how the empty string has been matched by the (nullable) regular
       
   662 expression $r_n$. This function is defined as
       
   663 
       
   664 
       
   665 
       
   666 We have mentioned before that derivatives without simplification 
       
   667 can get clumsy, and this is true for values as well--they reflect
       
   668 the size of the regular expression by definition.
       
   669 
       
   670 One can introduce simplification on the regex and values but have to
       
   671 be careful not to break the correctness, as the injection 
       
   672 function heavily relies on the structure of the regexes and values
       
   673 being correct and matching each other.
       
   674 It can be achieved by recording some extra rectification functions
       
   675 during the derivatives step, and applying these rectifications in 
       
   676 each run during the injection phase.
       
   677 And we can prove that the POSIX value of how
       
   678 regular expressions match strings will not be affected---although it is much harder
       
   679 to establish. 
       
   680 Some initial results in this regard have been
       
   681 obtained in \cite{AusafDyckhoffUrban2016}. 
       
   682 
       
   683 
       
   684 
       
   685 %Brzozowski, after giving the derivatives and simplification,
       
   686 %did not explore lexing with simplification, or he may well be 
       
   687 %stuck on an efficient simplification with proof.
       
   688 %He went on to examine the use of derivatives together with 
       
   689 %automaton, and did not try lexing using products.
       
   690 
       
   691 We want to get rid of the complex and fragile rectification of values.
       
   692 Can we not create those intermediate values $v_1,\ldots v_n$,
       
   693 and get the lexing information that should be already there while
       
   694 doing derivatives in one pass, without a second injection phase?
       
   695 In the meantime, can we make sure that simplifications
       
   696 are easily handled without breaking the correctness of the algorithm?
       
   697 
       
   698 Sulzmann and Lu solved this problem by
       
   699 introducing additional information to the 
       
   700 regular expressions called \emph{bitcodes}.
       
   701 
       
   702 
       
   703 
       
   704 
       
   705 
       
   706 
       
   707