ChengsongTanPhdThesis/Chapters/Chapter2.tex
changeset 565 0497408a3598
parent 564 3cbcd7cda0a9
child 566 94604a5fd271
equal deleted inserted replaced
564:3cbcd7cda0a9 565:0497408a3598
     1 % Chapter Template
       
     2 
       
     3 \chapter{Regular Expressions and POSIX Lexing} % Main chapter title
       
     4 
       
     5 \label{Chapter2} % In chapter 2 \ref{Chapter2} we will introduce the concepts
       
     6 %and notations we 
       
     7 %use for describing the lexing algorithm by Sulzmann and Lu,
       
     8 %and then give the algorithm and its variant, and discuss
       
     9 %why more aggressive simplifications are needed. 
       
    10 
       
    11 
       
    12 \section{Basic Concepts and Notations for Strings, Languages, and Regular Expressions}
       
    13 We have a primitive datatype char, denoting characters.
       
    14 \[			char ::=  a
       
    15 			 \mid b
       
    16 			 \mid c
       
    17 			 \mid  \ldots
       
    18 			 \mid z       
       
    19 \]
       
    20 (which one is better?)
       
    21 \begin{center}
       
    22 \begin{tabular}{lcl}
       
    23 $\textit{char}$ & $\dn$ & $a | b | c | \ldots$\\
       
    24 \end{tabular}
       
    25 \end{center}
       
    26 They can form strings by lists:
       
    27 \begin{center}
       
    28 \begin{tabular}{lcl}
       
    29 $\textit{string}$ & $\dn$ & $[] | c  :: cs$\\
       
    30 & & $(c\; \text{has char type})$
       
    31 \end{tabular}
       
    32 \end{center}
       
    33 And strings can be concatenated to form longer strings:
       
    34 \begin{center}
       
    35 \begin{tabular}{lcl}
       
    36 $[] @ s_2$ & $\dn$ & $s_2$\\
       
    37 $(c :: s_1) @ s_2$ & $\dn$ & $c :: (s_1 @ s_2)$
       
    38 \end{tabular}
       
    39 \end{center}
       
    40 
       
    41 A set of strings can operate with another set of strings:
       
    42 \begin{center}
       
    43 \begin{tabular}{lcl}
       
    44 $A @ B $ & $\dn$ & $\{s_A @ s_B \mid s_A \in A; s_B \in B \}$\\
       
    45 \end{tabular}
       
    46 \end{center}
       
    47 We also call the above "language concatenation".
       
    48 The power of a language is defined recursively, using the 
       
    49 concatenation operator $@$:
       
    50 \begin{center}
       
    51 \begin{tabular}{lcl}
       
    52 $A^0 $ & $\dn$ & $\{ [] \}$\\
       
    53 $A^{n+1}$ & $\dn$ & $A^n @ A$
       
    54 \end{tabular}
       
    55 \end{center}
       
    56 The union of all the natural number powers of a language   
       
    57 is denoted by the Kleene star operator:
       
    58 \begin{center}
       
    59 \begin{tabular}{lcl}
       
    60 $\bigcup_{i \geq 0} A^i$ & $\denote$ & $A^*$\\
       
    61 \end{tabular}
       
    62 \end{center}
       
    63 
       
    64 In Isabelle of course we cannot easily get a counterpart of
       
    65 the $\bigcup_{i \geq 0}$ operator, so we instead define the Kleene star
       
    66 as an inductive set: 
       
    67 \begin{center}
       
    68 \begin{tabular}{lcl}
       
    69 $[] \in A^*$  & &\\
       
    70 $s_1 \in A \land \; s_2 \in A^* $ & $\implies$ & $s_1 @ s_2 \in A^*$\\
       
    71 \end{tabular}
       
    72 \end{center}
       
    73 
       
    74 We also define an operation of chopping off a character from
       
    75 a language:
       
    76 \begin{center}
       
    77 \begin{tabular}{lcl}
       
    78 $\textit{Der} \;c \;A$ & $\dn$ & $\{ s \mid c :: s \in A \}$\\
       
    79 \end{tabular}
       
    80 \end{center}
       
    81 
       
    82 This can be generalised to chopping off a string from all strings within set $A$:
       
    83 \begin{center}
       
    84 \begin{tabular}{lcl}
       
    85 $\textit{Ders} \;w \;A$ & $\dn$ & $\{ s \mid w@s \in A \}$\\
       
    86 \end{tabular}
       
    87 \end{center}
       
    88 
       
    89 which is essentially the left quotient $A \backslash L'$ of $A$ against 
       
    90 the singleton language $L' = \{w\}$
       
    91 in formal language theory.
       
    92 For this dissertation the $\textit{Ders}$ notation would suffice, there is
       
    93 no need for a more general derivative definition.
       
    94 
       
    95 With the  sequencing, Kleene star, and $\textit{Der}$ operator on languages,
       
    96 we have a  few properties of how the language derivative can be defined using 
       
    97 sub-languages.
       
    98 \begin{lemma}
       
    99 $\Der \; c \; (A @ B) = \textit{if} \;  [] \in A \; \textit{then} ((\Der \; c \; A) @ B ) \cup \Der \; c\; B \quad \textit{else}\; (\Der \; c \; A) @ B$
       
   100 \end{lemma}
       
   101 \noindent
       
   102 This lemma states that if $A$ contains the empty string, $\Der$ can "pierce" through it
       
   103 and get to $B$.
       
   104 
       
   105 The language $A^*$'s derivative can be described using the language derivative
       
   106 of $A$:
       
   107 \begin{lemma}
       
   108 $\textit{Der} \;c \;A^* = (\textit{Der}\; c A) @ (A^*)$\\
       
   109 \end{lemma}
       
   110 \begin{proof}
       
   111 \begin{itemize}
       
   112 \item{$\subseteq$}
       
   113 The set 
       
   114 \[ \{s \mid c :: s \in A^*\} \]
       
   115 is enclosed in the set
       
   116 \[ \{s_1 @ s_2 \mid s_1 \, s_2. s_1 \in \{s \mid c :: s \in A\} \land s_2 \in A^* \} \]
       
   117 because whenever you have a string starting with a character 
       
   118 in the language of a Kleene star $A^*$, then that character together with some sub-string
       
   119 immediately after it will form the first iteration, and the rest of the string will 
       
   120 be still in $A^*$.
       
   121 \item{$\supseteq$}
       
   122 Note that
       
   123 \[ \Der \; c \; A^* = \Der \; c \;  (\{ [] \} \cup (A @ A^*) ) \]
       
   124 and 
       
   125 \[ \Der \; c \;  (\{ [] \} \cup (A @ A^*) ) = \Der\; c \; (A @ A^*) \]
       
   126 where the $\textit{RHS}$ of the above equatioin can be rewritten
       
   127 as \[ (\Der \; c\; A) @ A^* \cup A' \], $A'$ being a possibly empty set.
       
   128 \end{itemize}
       
   129 \end{proof}
       
   130 Before we define the $\textit{Der}$ and $\textit{Ders}$ counterpart
       
   131 for regular languages, we need to first give definitions for regular expressions.
       
   132 
       
   133 \section{Regular Expressions and Their Language Interpretation}
       
   134 Suppose we have an alphabet $\Sigma$, the strings  whose characters
       
   135 are from $\Sigma$
       
   136 can be expressed as $\Sigma^*$.
       
   137 
       
   138 We use patterns to define a set of strings concisely. Regular expressions
       
   139 are one of such patterns systems:
       
   140 The basic regular expressions  are defined inductively
       
   141  by the following grammar:
       
   142 \[			r ::=   \ZERO \mid  \ONE
       
   143 			 \mid  c  
       
   144 			 \mid  r_1 \cdot r_2
       
   145 			 \mid  r_1 + r_2   
       
   146 			 \mid r^*         
       
   147 \]
       
   148 
       
   149 The language or set of strings defined by regular expressions are defined as
       
   150 %TODO: FILL in the other defs
       
   151 \begin{center}
       
   152 \begin{tabular}{lcl}
       
   153 $L \; (r_1 + r_2)$ & $\dn$ & $ L \; (r_1) \cup L \; ( r_2)$\\
       
   154 $L \; (r_1 \cdot r_2)$ & $\dn$ & $ L \; (r_1) \cap L \; (r_2)$\\
       
   155 \end{tabular}
       
   156 \end{center}
       
   157 Which are also called the "language interpretation".
       
   158 
       
   159 
       
   160 
       
   161 
       
   162 % Derivatives of a
       
   163 %regular expression, written $r \backslash c$, give a simple solution
       
   164 %to the problem of matching a string $s$ with a regular
       
   165 %expression $r$: if the derivative of $r$ w.r.t.\ (in
       
   166 %succession) all the characters of the string matches the empty string,
       
   167 %then $r$ matches $s$ (and {\em vice versa}).  
       
   168 
       
   169 
       
   170 \section{Brzozowski Derivatives of Regular Expressions}
       
   171 Now with semantic derivatives of a language and regular expressions and
       
   172 their language interpretations, we are ready to define derivatives on regexes.
       
   173 
       
   174 The Brzozowski derivative w.r.t character $c$ is an operation on the regex,
       
   175 where the operation transforms the regex to a new one containing
       
   176 strings without the head character $c$.
       
   177 
       
   178 The  derivative of regular expression, denoted as
       
   179 $r \backslash c$, is a function that takes parameters
       
   180 $r$ and $c$, and returns another regular expression $r'$,
       
   181 which is computed by the following recursive function:
       
   182 
       
   183 \begin{center}
       
   184 \begin{tabular}{lcl}
       
   185 		$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\  
       
   186 		$\ONE \backslash c$  & $\dn$ & $\ZERO$\\
       
   187 		$d \backslash c$     & $\dn$ & 
       
   188 		$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
       
   189 $(r_1 + r_2)\backslash c$     & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
       
   190 $(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, nullable(r_1)$\\
       
   191 	&   & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\
       
   192 	&   & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
       
   193 	$(r^*)\backslash c$           & $\dn$ & $(r\backslash c) \cdot r^*$\\
       
   194 \end{tabular}
       
   195 \end{center}
       
   196 \noindent
       
   197 The function derivative, written $r\backslash c$, 
       
   198 defines how a regular expression evolves into
       
   199 a new regular expression after all the string it contains
       
   200 is chopped off a certain head character $c$.
       
   201 The most involved cases are the sequence 
       
   202 and star case.
       
   203 The sequence case says that if the first regular expression
       
   204 contains an empty string then the second component of the sequence
       
   205 might be chosen as the target regular expression to be chopped
       
   206 off its head character.
       
   207 The star regular expression's derivative unwraps the iteration of
       
   208 regular expression and attaches the star regular expression
       
   209 to the sequence's second element to make sure a copy is retained
       
   210 for possible more iterations in later phases of lexing.
       
   211 
       
   212 
       
   213 The $\nullable$ function tests whether the empty string $""$ 
       
   214 is in the language of $r$:
       
   215 
       
   216 
       
   217 \begin{center}
       
   218 		\begin{tabular}{lcl}
       
   219 			$\nullable(\ZERO)$     & $\dn$ & $\mathit{false}$ \\  
       
   220 			$\nullable(\ONE)$      & $\dn$ & $\mathit{true}$ \\
       
   221 			$\nullable(c)$ 	       & $\dn$ & $\mathit{false}$ \\
       
   222 			$\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\
       
   223 			$\nullable(r_1\cdot r_2)$  & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\
       
   224 			$\nullable(r^*)$       & $\dn$ & $\mathit{true}$ \\
       
   225 		\end{tabular}
       
   226 \end{center}
       
   227 \noindent
       
   228 The empty set does not contain any string and
       
   229 therefore not the empty string, the empty string 
       
   230 regular expression contains the empty string
       
   231 by definition, the character regular expression
       
   232 is the singleton that contains character only,
       
   233 and therefore does not contain the empty string,
       
   234 the alternative regular expression (or "or" expression)
       
   235 might have one of its children regular expressions
       
   236 being nullable and any one of its children being nullable
       
   237 would suffice. The sequence regular expression
       
   238 would require both children to have the empty string
       
   239 to compose an empty string and the Kleene star
       
   240 operation naturally introduced the empty string. 
       
   241   
       
   242 We have the following property where the derivative on regular 
       
   243 expressions coincides with the derivative on a set of strings:
       
   244 
       
   245 \begin{lemma}
       
   246 $\textit{Der} \; c \; L(r) = L (r\backslash c)$
       
   247 \end{lemma}
       
   248 
       
   249 \noindent
       
   250 
       
   251 
       
   252 The main property of the derivative operation
       
   253 that enables us to reason about the correctness of
       
   254 an algorithm using derivatives is 
       
   255 
       
   256 \begin{center}
       
   257 $c\!::\!s \in L(r)$ holds
       
   258 if and only if $s \in L(r\backslash c)$.
       
   259 \end{center}
       
   260 
       
   261 \noindent
       
   262 We can generalise the derivative operation shown above for single characters
       
   263 to strings as follows:
       
   264 
       
   265 \begin{center}
       
   266 \begin{tabular}{lcl}
       
   267 $r \backslash (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash s$ \\
       
   268 $r \backslash [\,] $ & $\dn$ & $r$
       
   269 \end{tabular}
       
   270 \end{center}
       
   271 
       
   272 \noindent
       
   273 and then define Brzozowski's  regular-expression matching algorithm as:
       
   274 
       
   275 \begin{definition}
       
   276 $match\;s\;r \;\dn\; nullable(r\backslash s)$
       
   277 \end{definition}
       
   278 
       
   279 \noindent
       
   280 Assuming the string is given as a sequence of characters, say $c_0c_1..c_n$, 
       
   281 this algorithm presented graphically is as follows:
       
   282 
       
   283 \begin{equation}\label{graph:successive_ders}
       
   284 \begin{tikzcd}
       
   285 r_0 \arrow[r, "\backslash c_0"]  & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed]  & r_n  \arrow[r,"\textit{nullable}?"] & \;\textrm{YES}/\textrm{NO}
       
   286 \end{tikzcd}
       
   287 \end{equation}
       
   288 
       
   289 \noindent
       
   290 where we start with  a regular expression  $r_0$, build successive
       
   291 derivatives until we exhaust the string and then use \textit{nullable}
       
   292 to test whether the result can match the empty string. It can  be
       
   293 relatively  easily shown that this matcher is correct  (that is given
       
   294 an $s = c_0...c_{n-1}$ and an $r_0$, it generates YES if and only if $s \in L(r_0)$).
       
   295 
       
   296 Beautiful and simple definition.
       
   297 
       
   298 If we implement the above algorithm naively, however,
       
   299 the algorithm can be excruciatingly slow. 
       
   300 
       
   301 
       
   302 \begin{figure}
       
   303 \centering
       
   304 \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
       
   305 \begin{tikzpicture}
       
   306 \begin{axis}[
       
   307     xlabel={$n$},
       
   308     x label style={at={(1.05,-0.05)}},
       
   309     ylabel={time in secs},
       
   310     enlargelimits=false,
       
   311     xtick={0,5,...,30},
       
   312     xmax=33,
       
   313     ymax=10000,
       
   314     ytick={0,1000,...,10000},
       
   315     scaled ticks=false,
       
   316     axis lines=left,
       
   317     width=5cm,
       
   318     height=4cm, 
       
   319     legend entries={JavaScript},  
       
   320     legend pos=north west,
       
   321     legend cell align=left]
       
   322 \addplot[red,mark=*, mark options={fill=white}] table {EightThousandNodes.data};
       
   323 \end{axis}
       
   324 \end{tikzpicture}\\
       
   325 \multicolumn{3}{c}{Graphs: Runtime for matching $(a^*)^*\,b$ with strings 
       
   326            of the form $\underbrace{aa..a}_{n}$.}
       
   327 \end{tabular}    
       
   328 \caption{EightThousandNodes} \label{fig:EightThousandNodes}
       
   329 \end{figure}
       
   330 
       
   331 
       
   332 (8000 node data to be added here)
       
   333 For example, when starting with the regular
       
   334 expression $(a + aa)^*$ and building a few successive derivatives (around 10)
       
   335 w.r.t.~the character $a$, one obtains a derivative regular expression
       
   336 with more than 8000 nodes (when viewed as a tree)\ref{EightThousandNodes}.
       
   337 The reason why $(a + aa) ^*$ explodes so drastically is that without
       
   338 pruning, the algorithm will keep records of all possible ways of matching:
       
   339 \begin{center}
       
   340 $(a + aa) ^* \backslash [aa] = (\ZERO + \ONE \ONE)\cdot(a + aa)^* + (\ONE + \ONE a) \cdot (a + aa)^*$
       
   341 \end{center}
       
   342 
       
   343 \noindent
       
   344 Each of the above alternative branches correspond to the match 
       
   345 $aa $, $a \quad a$ and $a \quad a \cdot (a)$(incomplete).
       
   346 These different ways of matching will grow exponentially with the string length,
       
   347 and without simplifications that throw away some of these very similar matchings,
       
   348 it is no surprise that these expressions grow so quickly.
       
   349 Operations like
       
   350 $\backslash$ and $\nullable$ need to traverse such trees and
       
   351 consequently the bigger the size of the derivative the slower the
       
   352 algorithm. 
       
   353 
       
   354 Brzozowski was quick in finding that during this process a lot useless
       
   355 $\ONE$s and $\ZERO$s are generated and therefore not optimal.
       
   356 He also introduced some "similarity rules", such
       
   357 as $P+(Q+R) = (P+Q)+R$ to merge syntactically 
       
   358 different but language-equivalent sub-regexes to further decrease the size
       
   359 of the intermediate regexes. 
       
   360 
       
   361 More simplifications are possible, such as deleting duplicates
       
   362 and opening up nested alternatives to trigger even more simplifications.
       
   363 And suppose we apply simplification after each derivative step, and compose
       
   364 these two operations together as an atomic one: $a \backslash_{simp}\,c \dn
       
   365 \textit{simp}(a \backslash c)$. Then we can build
       
   366 a matcher with simpler regular expressions.
       
   367 
       
   368 If we want the size of derivatives in the algorithm to
       
   369 stay even lower, we would need more aggressive simplifications.
       
   370 Essentially we need to delete useless $\ZERO$s and $\ONE$s, as well as
       
   371 delete duplicates whenever possible. For example, the parentheses in
       
   372 $(a+b) \cdot c + b\cdot c$ can be opened up to get $a\cdot c + b \cdot c + b
       
   373 \cdot c$, and then simplified to just $a \cdot c + b \cdot c$. Another
       
   374 example is simplifying $(a^*+a) + (a^*+ \ONE) + (a +\ONE)$ to just
       
   375 $a^*+a+\ONE$.  These more aggressive simplification rules are for
       
   376  a very tight size bound, possibly as low
       
   377   as that of the \emph{partial derivatives}\parencite{Antimirov1995}. 
       
   378 
       
   379 Building derivatives and then simplifying them.
       
   380 So far, so good. But what if we want to 
       
   381 do lexing instead of just getting a YES/NO answer?
       
   382 This requires us to go back again to the world 
       
   383 without simplification first for a moment.
       
   384 Sulzmann and Lu~\cite{Sulzmann2014} first came up with a nice and 
       
   385 elegant(arguably as beautiful as the original
       
   386 derivatives definition) solution for this.
       
   387 
       
   388 \subsection*{Values and the Lexing Algorithm by Sulzmann and Lu}
       
   389 
       
   390 
       
   391 They first defined the datatypes for storing the 
       
   392 lexing information called a \emph{value} or
       
   393 sometimes also \emph{lexical value}.  These values and regular
       
   394 expressions correspond to each other as illustrated in the following
       
   395 table:
       
   396 
       
   397 \begin{center}
       
   398 	\begin{tabular}{c@{\hspace{20mm}}c}
       
   399 		\begin{tabular}{@{}rrl@{}}
       
   400 			\multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\
       
   401 			$r$ & $::=$  & $\ZERO$\\
       
   402 			& $\mid$ & $\ONE$   \\
       
   403 			& $\mid$ & $c$          \\
       
   404 			& $\mid$ & $r_1 \cdot r_2$\\
       
   405 			& $\mid$ & $r_1 + r_2$   \\
       
   406 			\\
       
   407 			& $\mid$ & $r^*$         \\
       
   408 		\end{tabular}
       
   409 		&
       
   410 		\begin{tabular}{@{\hspace{0mm}}rrl@{}}
       
   411 			\multicolumn{3}{@{}l}{\textbf{Values}}\medskip\\
       
   412 			$v$ & $::=$  & \\
       
   413 			&        & $\Empty$   \\
       
   414 			& $\mid$ & $\Char(c)$          \\
       
   415 			& $\mid$ & $\Seq\,v_1\, v_2$\\
       
   416 			& $\mid$ & $\Left(v)$   \\
       
   417 			& $\mid$ & $\Right(v)$  \\
       
   418 			& $\mid$ & $\Stars\,[v_1,\ldots\,v_n]$ \\
       
   419 		\end{tabular}
       
   420 	\end{tabular}
       
   421 \end{center}
       
   422 
       
   423 \noindent
       
   424 
       
   425 Building on top of Sulzmann and Lu's attempt to formalise the 
       
   426 notion of POSIX lexing rules \parencite{Sulzmann2014}, 
       
   427 Ausaf and Urban\parencite{AusafDyckhoffUrban2016} modelled
       
   428 POSIX matching as a ternary relation recursively defined in a
       
   429 natural deduction style.
       
   430 With the formally-specified rules for what a POSIX matching is,
       
   431 they proved in Isabelle/HOL that the algorithm gives correct results.
       
   432 
       
   433 But having a correct result is still not enough, 
       
   434 we want at least some degree of $\mathbf{efficiency}$.
       
   435 
       
   436 
       
   437 
       
   438 One regular expression can have multiple lexical values. For example
       
   439 for the regular expression $(a+b)^*$, it has a infinite list of
       
   440 values corresponding to it: $\Stars\,[]$, $\Stars\,[\Left(Char(a))]$,
       
   441 $\Stars\,[\Right(Char(b))]$, $\Stars\,[\Left(Char(a),\,\Right(Char(b))]$,
       
   442 $\ldots$, and vice versa.
       
   443 Even for the regular expression matching a particular string, there could 
       
   444 be more than one value corresponding to it.
       
   445 Take the example where $r= (a^*\cdot a^*)^*$ and the string 
       
   446 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$.
       
   447 If we do not allow any empty iterations in its lexical values,
       
   448 there will be $n - 1$ "splitting points" on $s$ we can choose to 
       
   449 split or not so that each sub-string
       
   450 segmented by those chosen splitting points will form different iterations:
       
   451 \begin{center}
       
   452 \begin{tabular}{lcr}
       
   453 $a \mid aaa $ & $\rightarrow$ & $\Stars\, [v_{iteration \,a},\,  v_{iteration \,aaa}]$\\
       
   454 $aa \mid aa $ & $\rightarrow$ & $\Stars\, [v_{iteration \, aa},\,  v_{iteration \, aa}]$\\
       
   455 $a \mid aa\mid a $ & $\rightarrow$ & $\Stars\, [v_{iteration \, a},\,  v_{iteration \, aa}, \, v_{iteration \, a}]$\\
       
   456  & $\textit{etc}.$ &
       
   457  \end{tabular}
       
   458 \end{center}
       
   459 
       
   460 And for each iteration, there are still multiple ways to split
       
   461 between the two $a^*$s.
       
   462 It is not surprising there are exponentially many lexical values
       
   463 that are distinct for the regex and string pair $r= (a^*\cdot a^*)^*$  and 
       
   464 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$.
       
   465 
       
   466 A lexer to keep all the possible values will naturally 
       
   467 have an exponential runtime on ambiguous regular expressions.
       
   468 Somehow one has to decide which lexical value to keep and
       
   469 output in a lexing algorithm.
       
   470 In practice, we are usually 
       
   471 interested in POSIX values, which by intuition always
       
   472 \begin{itemize}
       
   473 \item
       
   474 match the leftmost regular expression when multiple options of matching
       
   475 are available  
       
   476 \item 
       
   477 always match a subpart as much as possible before proceeding
       
   478 to the next token.
       
   479 \end{itemize}
       
   480 The formal definition of a $\POSIX$ value $v$ for a regular expression
       
   481 $r$ and string $s$, denoted as $(s, r) \rightarrow v$, can be specified 
       
   482 in the following set of rules:
       
   483 (TODO: write the entire set of inference rules for POSIX )
       
   484 \newcommand*{\inference}[3][t]{%
       
   485    \begingroup
       
   486    \def\and{\\}%
       
   487    \begin{tabular}[#1]{@{\enspace}c@{\enspace}}
       
   488    #2 \\
       
   489    \hline
       
   490    #3
       
   491    \end{tabular}%
       
   492    \endgroup
       
   493 }
       
   494 \begin{center}
       
   495 \inference{$s_1 @ s_2 = s$ \and $(\nexists s_3 s_4 s_5. s_1 @ s_5 = s_3 \land s_5 \neq [] \land s_3 @ s_4 = s \land (s_3, r_1) \rightarrow v_3 \land (s_4, r_2) \rightarrow v_4)$ \and $(s_1, r_1) \rightarrow v_1$ \and $(s_2, r_2) \rightarrow v_2$  }{$(s, r_1 \cdot r_2) \rightarrow \Seq(v_1, v_2)$ }
       
   496 \end{center}
       
   497 
       
   498 The reason why we are interested in $\POSIX$ values is that they can
       
   499 be practically used in the lexing phase of a compiler front end.
       
   500 For instance, when lexing a code snippet 
       
   501 $\textit{iffoo} = 3$ with the regular expression $\textit{keyword} + \textit{identifier}$, we want $\textit{iffoo}$ to be recognized
       
   502 as an identifier rather than a keyword.
       
   503 
       
   504  For example, the above $r= (a^*\cdot a^*)^*$  and 
       
   505 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$ example has the POSIX value
       
   506 $ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$.
       
   507 The output of an algorithm we want would be a POSIX matching
       
   508 encoded as a value.
       
   509 
       
   510 
       
   511 
       
   512 
       
   513 The contribution of Sulzmann and Lu is an extension of Brzozowski's
       
   514 algorithm by a second phase (the first phase being building successive
       
   515 derivatives---see \eqref{graph:successive_ders}). In this second phase, a POSIX value 
       
   516 is generated if the regular expression matches the string.
       
   517 How can we construct a value out of regular expressions and character
       
   518 sequences only?
       
   519 Two functions are involved: $\inj$ and $\mkeps$.
       
   520 The function $\mkeps$ constructs a value from the last
       
   521 one of all the successive derivatives:
       
   522 \begin{ceqn}
       
   523 \begin{equation}\label{graph:mkeps}
       
   524 \begin{tikzcd}
       
   525 r_0 \arrow[r, "\backslash c_0"]  & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed] & r_n \arrow[d, "mkeps" description] \\
       
   526 	        & 	              & 	            & v_n       
       
   527 \end{tikzcd}
       
   528 \end{equation}
       
   529 \end{ceqn}
       
   530 
       
   531 It tells us how can an empty string be matched by a 
       
   532 regular expression, in a $\POSIX$ way:
       
   533 
       
   534 	\begin{center}
       
   535 		\begin{tabular}{lcl}
       
   536 			$\mkeps(\ONE)$ 		& $\dn$ & $\Empty$ \\
       
   537 			$\mkeps(r_{1}+r_{2})$	& $\dn$ 
       
   538 			& \textit{if} $\nullable(r_{1})$\\ 
       
   539 			& & \textit{then} $\Left(\mkeps(r_{1}))$\\ 
       
   540 			& & \textit{else} $\Right(\mkeps(r_{2}))$\\
       
   541 			$\mkeps(r_1\cdot r_2)$ 	& $\dn$ & $\Seq\,(\mkeps\,r_1)\,(\mkeps\,r_2)$\\
       
   542 			$mkeps(r^*)$	        & $\dn$ & $\Stars\,[]$
       
   543 		\end{tabular}
       
   544 	\end{center}
       
   545 
       
   546 
       
   547 \noindent 
       
   548 We favour the left to match an empty string if there is a choice.
       
   549 When there is a star for us to match the empty string,
       
   550 we give the $\Stars$ constructor an empty list, meaning
       
   551 no iterations are taken.
       
   552 
       
   553 
       
   554 After the $\mkeps$-call, we inject back the characters one by one in order to build
       
   555 the lexical value $v_i$ for how the regex $r_i$ matches the string $s_i$
       
   556 ($s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$.
       
   557 After injecting back $n$ characters, we get the lexical value for how $r_0$
       
   558 matches $s$. The POSIX value is maintained throughout the process.
       
   559 For this Sulzmann and Lu defined a function that reverses
       
   560 the ``chopping off'' of characters during the derivative phase. The
       
   561 corresponding function is called \emph{injection}, written
       
   562 $\textit{inj}$; it takes three arguments: the first one is a regular
       
   563 expression ${r_{i-1}}$, before the character is chopped off, the second
       
   564 is a character ${c_{i-1}}$, the character we want to inject and the
       
   565 third argument is the value ${v_i}$, into which one wants to inject the
       
   566 character (it corresponds to the regular expression after the character
       
   567 has been chopped off). The result of this function is a new value. 
       
   568 \begin{ceqn}
       
   569 \begin{equation}\label{graph:inj}
       
   570 \begin{tikzcd}
       
   571 r_1 \arrow[r, dashed] \arrow[d]& r_i \arrow[r, "\backslash c_i"]  \arrow[d]  & r_{i+1}  \arrow[r, dashed] \arrow[d]        & r_n \arrow[d, "mkeps" description] \\
       
   572 v_1           \arrow[u]                 & v_i  \arrow[l, dashed]                              & v_{i+1} \arrow[l,"inj_{r_i} c_i"]                 & v_n \arrow[l, dashed]         
       
   573 \end{tikzcd}
       
   574 \end{equation}
       
   575 \end{ceqn}
       
   576 
       
   577 
       
   578 \noindent
       
   579 The
       
   580 definition of $\textit{inj}$ is as follows: 
       
   581 
       
   582 \begin{center}
       
   583 \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
       
   584   $\textit{inj}\,(c)\,c\,Empty$            & $\dn$ & $Char\,c$\\
       
   585   $\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\
       
   586   $\textit{inj}\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\textit{inj}\,r_2\,c\,v)$\\
       
   587   $\textit{inj}\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$  & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
       
   588   $\textit{inj}\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$  & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
       
   589   $\textit{inj}\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$  & $Seq(\textit{mkeps}(r_1),\textit{inj}\,r_2\,c\,v)$\\
       
   590   $\textit{inj}\,(r^*)\,c\,Seq(v,Stars\,vs)$         & $\dn$  & $Stars((\textit{inj}\,r\,c\,v)\,::\,vs)$\\
       
   591 \end{tabular}
       
   592 \end{center}
       
   593 
       
   594 \noindent This definition is by recursion on the ``shape'' of regular
       
   595 expressions and values. 
       
   596 The clauses do one thing--identifying the ``hole'' on a
       
   597 value to inject the character back into.
       
   598 For instance, in the last clause for injecting back to a value
       
   599 that would turn into a new star value that corresponds to a star,
       
   600 we know it must be a sequence value. And we know that the first 
       
   601 value of that sequence corresponds to the child regex of the star
       
   602 with the first character being chopped off--an iteration of the star
       
   603 that had just been unfolded. This value is followed by the already
       
   604 matched star iterations we collected before. So we inject the character 
       
   605 back to the first value and form a new value with this latest iteration
       
   606 being added to the previous list of iterations, all under the $\Stars$
       
   607 top level.
       
   608 
       
   609 Putting all the functions $\inj$, $\mkeps$, $\backslash$ together,
       
   610 we have a lexer with the following recursive definition:
       
   611 \begin{center}
       
   612 \begin{tabular}{lcr}
       
   613 $\lexer \; r \; [] $ & $=$ & $\mkeps \; r$\\
       
   614 $\lexer \; r \;c::s$ & $=$ & $\inj \; r \; c (\lexer (r\backslash c) s)$
       
   615 \end{tabular}
       
   616 \end{center}
       
   617  
       
   618 Pictorially, the algorithm is as follows:
       
   619 
       
   620 \begin{ceqn}
       
   621 \begin{equation}\label{graph:2}
       
   622 \begin{tikzcd}
       
   623 r_0 \arrow[r, "\backslash c_0"]  \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
       
   624 v_0           & v_1 \arrow[l,"inj_{r_0} c_0"]                & v_2 \arrow[l, "inj_{r_1} c_1"]              & v_n \arrow[l, dashed]         
       
   625 \end{tikzcd}
       
   626 \end{equation}
       
   627 \end{ceqn}
       
   628 
       
   629 
       
   630 \noindent
       
   631 For convenience, we shall employ the following notations: the regular
       
   632 expression we start with is $r_0$, and the given string $s$ is composed
       
   633 of characters $c_0 c_1 \ldots c_{n-1}$. In  the first phase from the
       
   634 left to right, we build the derivatives $r_1$, $r_2$, \ldots  according
       
   635 to the characters $c_0$, $c_1$  until we exhaust the string and obtain
       
   636 the derivative $r_n$. We test whether this derivative is
       
   637 $\textit{nullable}$ or not. If not, we know the string does not match
       
   638 $r$, and no value needs to be generated. If yes, we start building the
       
   639 values incrementally by \emph{injecting} back the characters into the
       
   640 earlier values $v_n, \ldots, v_0$. This is the second phase of the
       
   641 algorithm from right to left. For the first value $v_n$, we call the
       
   642 function $\textit{mkeps}$, which builds a POSIX lexical value
       
   643 for how the empty string has been matched by the (nullable) regular
       
   644 expression $r_n$. This function is defined as
       
   645 
       
   646 
       
   647 
       
   648 We have mentioned before that derivatives without simplification 
       
   649 can get clumsy, and this is true for values as well--they reflect
       
   650 the size of the regular expression by definition.
       
   651 
       
   652 One can introduce simplification on the regex and values but have to
       
   653 be careful not to break the correctness, as the injection 
       
   654 function heavily relies on the structure of the regexes and values
       
   655 being correct and matching each other.
       
   656 It can be achieved by recording some extra rectification functions
       
   657 during the derivatives step, and applying these rectifications in 
       
   658 each run during the injection phase.
       
   659 And we can prove that the POSIX value of how
       
   660 regular expressions match strings will not be affected---although it is much harder
       
   661 to establish. 
       
   662 Some initial results in this regard have been
       
   663 obtained in \cite{AusafDyckhoffUrban2016}. 
       
   664 
       
   665 
       
   666 
       
   667 %Brzozowski, after giving the derivatives and simplification,
       
   668 %did not explore lexing with simplification, or he may well be 
       
   669 %stuck on an efficient simplification with proof.
       
   670 %He went on to examine the use of derivatives together with 
       
   671 %automaton, and did not try lexing using products.
       
   672 
       
   673 We want to get rid of the complex and fragile rectification of values.
       
   674 Can we not create those intermediate values $v_1,\ldots v_n$,
       
   675 and get the lexing information that should be already there while
       
   676 doing derivatives in one pass, without a second injection phase?
       
   677 In the meantime, can we make sure that simplifications
       
   678 are easily handled without breaking the correctness of the algorithm?
       
   679 
       
   680 Sulzmann and Lu solved this problem by
       
   681 introducing additional information to the 
       
   682 regular expressions called \emph{bitcodes}.
       
   683 
       
   684 
       
   685 
       
   686 
       
   687 
       
   688 
       
   689