diff -r 3cbcd7cda0a9 -r 0497408a3598 ChengsongTanPhdThesis/Chapters/Chapter2.tex --- a/ChengsongTanPhdThesis/Chapters/Chapter2.tex Wed Jul 13 08:27:28 2022 +0100 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,689 +0,0 @@ -% Chapter Template - -\chapter{Regular Expressions and POSIX Lexing} % Main chapter title - -\label{Chapter2} % In chapter 2 \ref{Chapter2} we will introduce the concepts -%and notations we -%use for describing the lexing algorithm by Sulzmann and Lu, -%and then give the algorithm and its variant, and discuss -%why more aggressive simplifications are needed. - - -\section{Basic Concepts and Notations for Strings, Languages, and Regular Expressions} -We have a primitive datatype char, denoting characters. -\[ char ::= a - \mid b - \mid c - \mid \ldots - \mid z -\] -(which one is better?) -\begin{center} -\begin{tabular}{lcl} -$\textit{char}$ & $\dn$ & $a | b | c | \ldots$\\ -\end{tabular} -\end{center} -They can form strings by lists: -\begin{center} -\begin{tabular}{lcl} -$\textit{string}$ & $\dn$ & $[] | c :: cs$\\ -& & $(c\; \text{has char type})$ -\end{tabular} -\end{center} -And strings can be concatenated to form longer strings: -\begin{center} -\begin{tabular}{lcl} -$[] @ s_2$ & $\dn$ & $s_2$\\ -$(c :: s_1) @ s_2$ & $\dn$ & $c :: (s_1 @ s_2)$ -\end{tabular} -\end{center} - -A set of strings can operate with another set of strings: -\begin{center} -\begin{tabular}{lcl} -$A @ B $ & $\dn$ & $\{s_A @ s_B \mid s_A \in A; s_B \in B \}$\\ -\end{tabular} -\end{center} -We also call the above "language concatenation". -The power of a language is defined recursively, using the -concatenation operator $@$: -\begin{center} -\begin{tabular}{lcl} -$A^0 $ & $\dn$ & $\{ [] \}$\\ -$A^{n+1}$ & $\dn$ & $A^n @ A$ -\end{tabular} -\end{center} -The union of all the natural number powers of a language -is denoted by the Kleene star operator: -\begin{center} -\begin{tabular}{lcl} -$\bigcup_{i \geq 0} A^i$ & $\denote$ & $A^*$\\ -\end{tabular} -\end{center} - -In Isabelle of course we cannot easily get a counterpart of -the $\bigcup_{i \geq 0}$ operator, so we instead define the Kleene star -as an inductive set: -\begin{center} -\begin{tabular}{lcl} -$[] \in A^*$ & &\\ -$s_1 \in A \land \; s_2 \in A^* $ & $\implies$ & $s_1 @ s_2 \in A^*$\\ -\end{tabular} -\end{center} - -We also define an operation of chopping off a character from -a language: -\begin{center} -\begin{tabular}{lcl} -$\textit{Der} \;c \;A$ & $\dn$ & $\{ s \mid c :: s \in A \}$\\ -\end{tabular} -\end{center} - -This can be generalised to chopping off a string from all strings within set $A$: -\begin{center} -\begin{tabular}{lcl} -$\textit{Ders} \;w \;A$ & $\dn$ & $\{ s \mid w@s \in A \}$\\ -\end{tabular} -\end{center} - -which is essentially the left quotient $A \backslash L'$ of $A$ against -the singleton language $L' = \{w\}$ -in formal language theory. -For this dissertation the $\textit{Ders}$ notation would suffice, there is -no need for a more general derivative definition. - -With the sequencing, Kleene star, and $\textit{Der}$ operator on languages, -we have a few properties of how the language derivative can be defined using -sub-languages. -\begin{lemma} -$\Der \; c \; (A @ B) = \textit{if} \; [] \in A \; \textit{then} ((\Der \; c \; A) @ B ) \cup \Der \; c\; B \quad \textit{else}\; (\Der \; c \; A) @ B$ -\end{lemma} -\noindent -This lemma states that if $A$ contains the empty string, $\Der$ can "pierce" through it -and get to $B$. - -The language $A^*$'s derivative can be described using the language derivative -of $A$: -\begin{lemma} -$\textit{Der} \;c \;A^* = (\textit{Der}\; c A) @ (A^*)$\\ -\end{lemma} -\begin{proof} -\begin{itemize} -\item{$\subseteq$} -The set -\[ \{s \mid c :: s \in A^*\} \] -is enclosed in the set -\[ \{s_1 @ s_2 \mid s_1 \, s_2. s_1 \in \{s \mid c :: s \in A\} \land s_2 \in A^* \} \] -because whenever you have a string starting with a character -in the language of a Kleene star $A^*$, then that character together with some sub-string -immediately after it will form the first iteration, and the rest of the string will -be still in $A^*$. -\item{$\supseteq$} -Note that -\[ \Der \; c \; A^* = \Der \; c \; (\{ [] \} \cup (A @ A^*) ) \] -and -\[ \Der \; c \; (\{ [] \} \cup (A @ A^*) ) = \Der\; c \; (A @ A^*) \] -where the $\textit{RHS}$ of the above equatioin can be rewritten -as \[ (\Der \; c\; A) @ A^* \cup A' \], $A'$ being a possibly empty set. -\end{itemize} -\end{proof} -Before we define the $\textit{Der}$ and $\textit{Ders}$ counterpart -for regular languages, we need to first give definitions for regular expressions. - -\section{Regular Expressions and Their Language Interpretation} -Suppose we have an alphabet $\Sigma$, the strings whose characters -are from $\Sigma$ -can be expressed as $\Sigma^*$. - -We use patterns to define a set of strings concisely. Regular expressions -are one of such patterns systems: -The basic regular expressions are defined inductively - by the following grammar: -\[ r ::= \ZERO \mid \ONE - \mid c - \mid r_1 \cdot r_2 - \mid r_1 + r_2 - \mid r^* -\] - -The language or set of strings defined by regular expressions are defined as -%TODO: FILL in the other defs -\begin{center} -\begin{tabular}{lcl} -$L \; (r_1 + r_2)$ & $\dn$ & $ L \; (r_1) \cup L \; ( r_2)$\\ -$L \; (r_1 \cdot r_2)$ & $\dn$ & $ L \; (r_1) \cap L \; (r_2)$\\ -\end{tabular} -\end{center} -Which are also called the "language interpretation". - - - - -% Derivatives of a -%regular expression, written $r \backslash c$, give a simple solution -%to the problem of matching a string $s$ with a regular -%expression $r$: if the derivative of $r$ w.r.t.\ (in -%succession) all the characters of the string matches the empty string, -%then $r$ matches $s$ (and {\em vice versa}). - - -\section{Brzozowski Derivatives of Regular Expressions} -Now with semantic derivatives of a language and regular expressions and -their language interpretations, we are ready to define derivatives on regexes. - -The Brzozowski derivative w.r.t character $c$ is an operation on the regex, -where the operation transforms the regex to a new one containing -strings without the head character $c$. - -The derivative of regular expression, denoted as -$r \backslash c$, is a function that takes parameters -$r$ and $c$, and returns another regular expression $r'$, -which is computed by the following recursive function: - -\begin{center} -\begin{tabular}{lcl} - $\ZERO \backslash c$ & $\dn$ & $\ZERO$\\ - $\ONE \backslash c$ & $\dn$ & $\ZERO$\\ - $d \backslash c$ & $\dn$ & - $\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\ -$(r_1 + r_2)\backslash c$ & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\ -$(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, nullable(r_1)$\\ - & & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\ - & & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\ - $(r^*)\backslash c$ & $\dn$ & $(r\backslash c) \cdot r^*$\\ -\end{tabular} -\end{center} -\noindent -The function derivative, written $r\backslash c$, -defines how a regular expression evolves into -a new regular expression after all the string it contains -is chopped off a certain head character $c$. -The most involved cases are the sequence -and star case. -The sequence case says that if the first regular expression -contains an empty string then the second component of the sequence -might be chosen as the target regular expression to be chopped -off its head character. -The star regular expression's derivative unwraps the iteration of -regular expression and attaches the star regular expression -to the sequence's second element to make sure a copy is retained -for possible more iterations in later phases of lexing. - - -The $\nullable$ function tests whether the empty string $""$ -is in the language of $r$: - - -\begin{center} - \begin{tabular}{lcl} - $\nullable(\ZERO)$ & $\dn$ & $\mathit{false}$ \\ - $\nullable(\ONE)$ & $\dn$ & $\mathit{true}$ \\ - $\nullable(c)$ & $\dn$ & $\mathit{false}$ \\ - $\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\ - $\nullable(r_1\cdot r_2)$ & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\ - $\nullable(r^*)$ & $\dn$ & $\mathit{true}$ \\ - \end{tabular} -\end{center} -\noindent -The empty set does not contain any string and -therefore not the empty string, the empty string -regular expression contains the empty string -by definition, the character regular expression -is the singleton that contains character only, -and therefore does not contain the empty string, -the alternative regular expression (or "or" expression) -might have one of its children regular expressions -being nullable and any one of its children being nullable -would suffice. The sequence regular expression -would require both children to have the empty string -to compose an empty string and the Kleene star -operation naturally introduced the empty string. - -We have the following property where the derivative on regular -expressions coincides with the derivative on a set of strings: - -\begin{lemma} -$\textit{Der} \; c \; L(r) = L (r\backslash c)$ -\end{lemma} - -\noindent - - -The main property of the derivative operation -that enables us to reason about the correctness of -an algorithm using derivatives is - -\begin{center} -$c\!::\!s \in L(r)$ holds -if and only if $s \in L(r\backslash c)$. -\end{center} - -\noindent -We can generalise the derivative operation shown above for single characters -to strings as follows: - -\begin{center} -\begin{tabular}{lcl} -$r \backslash (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash s$ \\ -$r \backslash [\,] $ & $\dn$ & $r$ -\end{tabular} -\end{center} - -\noindent -and then define Brzozowski's regular-expression matching algorithm as: - -\begin{definition} -$match\;s\;r \;\dn\; nullable(r\backslash s)$ -\end{definition} - -\noindent -Assuming the string is given as a sequence of characters, say $c_0c_1..c_n$, -this algorithm presented graphically is as follows: - -\begin{equation}\label{graph:successive_ders} -\begin{tikzcd} -r_0 \arrow[r, "\backslash c_0"] & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed] & r_n \arrow[r,"\textit{nullable}?"] & \;\textrm{YES}/\textrm{NO} -\end{tikzcd} -\end{equation} - -\noindent -where we start with a regular expression $r_0$, build successive -derivatives until we exhaust the string and then use \textit{nullable} -to test whether the result can match the empty string. It can be -relatively easily shown that this matcher is correct (that is given -an $s = c_0...c_{n-1}$ and an $r_0$, it generates YES if and only if $s \in L(r_0)$). - -Beautiful and simple definition. - -If we implement the above algorithm naively, however, -the algorithm can be excruciatingly slow. - - -\begin{figure} -\centering -\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}} -\begin{tikzpicture} -\begin{axis}[ - xlabel={$n$}, - x label style={at={(1.05,-0.05)}}, - ylabel={time in secs}, - enlargelimits=false, - xtick={0,5,...,30}, - xmax=33, - ymax=10000, - ytick={0,1000,...,10000}, - scaled ticks=false, - axis lines=left, - width=5cm, - height=4cm, - legend entries={JavaScript}, - legend pos=north west, - legend cell align=left] -\addplot[red,mark=*, mark options={fill=white}] table {EightThousandNodes.data}; -\end{axis} -\end{tikzpicture}\\ -\multicolumn{3}{c}{Graphs: Runtime for matching $(a^*)^*\,b$ with strings - of the form $\underbrace{aa..a}_{n}$.} -\end{tabular} -\caption{EightThousandNodes} \label{fig:EightThousandNodes} -\end{figure} - - -(8000 node data to be added here) -For example, when starting with the regular -expression $(a + aa)^*$ and building a few successive derivatives (around 10) -w.r.t.~the character $a$, one obtains a derivative regular expression -with more than 8000 nodes (when viewed as a tree)\ref{EightThousandNodes}. -The reason why $(a + aa) ^*$ explodes so drastically is that without -pruning, the algorithm will keep records of all possible ways of matching: -\begin{center} -$(a + aa) ^* \backslash [aa] = (\ZERO + \ONE \ONE)\cdot(a + aa)^* + (\ONE + \ONE a) \cdot (a + aa)^*$ -\end{center} - -\noindent -Each of the above alternative branches correspond to the match -$aa $, $a \quad a$ and $a \quad a \cdot (a)$(incomplete). -These different ways of matching will grow exponentially with the string length, -and without simplifications that throw away some of these very similar matchings, -it is no surprise that these expressions grow so quickly. -Operations like -$\backslash$ and $\nullable$ need to traverse such trees and -consequently the bigger the size of the derivative the slower the -algorithm. - -Brzozowski was quick in finding that during this process a lot useless -$\ONE$s and $\ZERO$s are generated and therefore not optimal. -He also introduced some "similarity rules", such -as $P+(Q+R) = (P+Q)+R$ to merge syntactically -different but language-equivalent sub-regexes to further decrease the size -of the intermediate regexes. - -More simplifications are possible, such as deleting duplicates -and opening up nested alternatives to trigger even more simplifications. -And suppose we apply simplification after each derivative step, and compose -these two operations together as an atomic one: $a \backslash_{simp}\,c \dn -\textit{simp}(a \backslash c)$. Then we can build -a matcher with simpler regular expressions. - -If we want the size of derivatives in the algorithm to -stay even lower, we would need more aggressive simplifications. -Essentially we need to delete useless $\ZERO$s and $\ONE$s, as well as -delete duplicates whenever possible. For example, the parentheses in -$(a+b) \cdot c + b\cdot c$ can be opened up to get $a\cdot c + b \cdot c + b -\cdot c$, and then simplified to just $a \cdot c + b \cdot c$. Another -example is simplifying $(a^*+a) + (a^*+ \ONE) + (a +\ONE)$ to just -$a^*+a+\ONE$. These more aggressive simplification rules are for - a very tight size bound, possibly as low - as that of the \emph{partial derivatives}\parencite{Antimirov1995}. - -Building derivatives and then simplifying them. -So far, so good. But what if we want to -do lexing instead of just getting a YES/NO answer? -This requires us to go back again to the world -without simplification first for a moment. -Sulzmann and Lu~\cite{Sulzmann2014} first came up with a nice and -elegant(arguably as beautiful as the original -derivatives definition) solution for this. - -\subsection*{Values and the Lexing Algorithm by Sulzmann and Lu} - - -They first defined the datatypes for storing the -lexing information called a \emph{value} or -sometimes also \emph{lexical value}. These values and regular -expressions correspond to each other as illustrated in the following -table: - -\begin{center} - \begin{tabular}{c@{\hspace{20mm}}c} - \begin{tabular}{@{}rrl@{}} - \multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\ - $r$ & $::=$ & $\ZERO$\\ - & $\mid$ & $\ONE$ \\ - & $\mid$ & $c$ \\ - & $\mid$ & $r_1 \cdot r_2$\\ - & $\mid$ & $r_1 + r_2$ \\ - \\ - & $\mid$ & $r^*$ \\ - \end{tabular} - & - \begin{tabular}{@{\hspace{0mm}}rrl@{}} - \multicolumn{3}{@{}l}{\textbf{Values}}\medskip\\ - $v$ & $::=$ & \\ - & & $\Empty$ \\ - & $\mid$ & $\Char(c)$ \\ - & $\mid$ & $\Seq\,v_1\, v_2$\\ - & $\mid$ & $\Left(v)$ \\ - & $\mid$ & $\Right(v)$ \\ - & $\mid$ & $\Stars\,[v_1,\ldots\,v_n]$ \\ - \end{tabular} - \end{tabular} -\end{center} - -\noindent - -Building on top of Sulzmann and Lu's attempt to formalise the -notion of POSIX lexing rules \parencite{Sulzmann2014}, -Ausaf and Urban\parencite{AusafDyckhoffUrban2016} modelled -POSIX matching as a ternary relation recursively defined in a -natural deduction style. -With the formally-specified rules for what a POSIX matching is, -they proved in Isabelle/HOL that the algorithm gives correct results. - -But having a correct result is still not enough, -we want at least some degree of $\mathbf{efficiency}$. - - - -One regular expression can have multiple lexical values. For example -for the regular expression $(a+b)^*$, it has a infinite list of -values corresponding to it: $\Stars\,[]$, $\Stars\,[\Left(Char(a))]$, -$\Stars\,[\Right(Char(b))]$, $\Stars\,[\Left(Char(a),\,\Right(Char(b))]$, -$\ldots$, and vice versa. -Even for the regular expression matching a particular string, there could -be more than one value corresponding to it. -Take the example where $r= (a^*\cdot a^*)^*$ and the string -$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$. -If we do not allow any empty iterations in its lexical values, -there will be $n - 1$ "splitting points" on $s$ we can choose to -split or not so that each sub-string -segmented by those chosen splitting points will form different iterations: -\begin{center} -\begin{tabular}{lcr} -$a \mid aaa $ & $\rightarrow$ & $\Stars\, [v_{iteration \,a},\, v_{iteration \,aaa}]$\\ -$aa \mid aa $ & $\rightarrow$ & $\Stars\, [v_{iteration \, aa},\, v_{iteration \, aa}]$\\ -$a \mid aa\mid a $ & $\rightarrow$ & $\Stars\, [v_{iteration \, a},\, v_{iteration \, aa}, \, v_{iteration \, a}]$\\ - & $\textit{etc}.$ & - \end{tabular} -\end{center} - -And for each iteration, there are still multiple ways to split -between the two $a^*$s. -It is not surprising there are exponentially many lexical values -that are distinct for the regex and string pair $r= (a^*\cdot a^*)^*$ and -$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$. - -A lexer to keep all the possible values will naturally -have an exponential runtime on ambiguous regular expressions. -Somehow one has to decide which lexical value to keep and -output in a lexing algorithm. -In practice, we are usually -interested in POSIX values, which by intuition always -\begin{itemize} -\item -match the leftmost regular expression when multiple options of matching -are available -\item -always match a subpart as much as possible before proceeding -to the next token. -\end{itemize} -The formal definition of a $\POSIX$ value $v$ for a regular expression -$r$ and string $s$, denoted as $(s, r) \rightarrow v$, can be specified -in the following set of rules: -(TODO: write the entire set of inference rules for POSIX ) -\newcommand*{\inference}[3][t]{% - \begingroup - \def\and{\\}% - \begin{tabular}[#1]{@{\enspace}c@{\enspace}} - #2 \\ - \hline - #3 - \end{tabular}% - \endgroup -} -\begin{center} -\inference{$s_1 @ s_2 = s$ \and $(\nexists s_3 s_4 s_5. s_1 @ s_5 = s_3 \land s_5 \neq [] \land s_3 @ s_4 = s \land (s_3, r_1) \rightarrow v_3 \land (s_4, r_2) \rightarrow v_4)$ \and $(s_1, r_1) \rightarrow v_1$ \and $(s_2, r_2) \rightarrow v_2$ }{$(s, r_1 \cdot r_2) \rightarrow \Seq(v_1, v_2)$ } -\end{center} - -The reason why we are interested in $\POSIX$ values is that they can -be practically used in the lexing phase of a compiler front end. -For instance, when lexing a code snippet -$\textit{iffoo} = 3$ with the regular expression $\textit{keyword} + \textit{identifier}$, we want $\textit{iffoo}$ to be recognized -as an identifier rather than a keyword. - - For example, the above $r= (a^*\cdot a^*)^*$ and -$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$ example has the POSIX value -$ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$. -The output of an algorithm we want would be a POSIX matching -encoded as a value. - - - - -The contribution of Sulzmann and Lu is an extension of Brzozowski's -algorithm by a second phase (the first phase being building successive -derivatives---see \eqref{graph:successive_ders}). In this second phase, a POSIX value -is generated if the regular expression matches the string. -How can we construct a value out of regular expressions and character -sequences only? -Two functions are involved: $\inj$ and $\mkeps$. -The function $\mkeps$ constructs a value from the last -one of all the successive derivatives: -\begin{ceqn} -\begin{equation}\label{graph:mkeps} -\begin{tikzcd} -r_0 \arrow[r, "\backslash c_0"] & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed] & r_n \arrow[d, "mkeps" description] \\ - & & & v_n -\end{tikzcd} -\end{equation} -\end{ceqn} - -It tells us how can an empty string be matched by a -regular expression, in a $\POSIX$ way: - - \begin{center} - \begin{tabular}{lcl} - $\mkeps(\ONE)$ & $\dn$ & $\Empty$ \\ - $\mkeps(r_{1}+r_{2})$ & $\dn$ - & \textit{if} $\nullable(r_{1})$\\ - & & \textit{then} $\Left(\mkeps(r_{1}))$\\ - & & \textit{else} $\Right(\mkeps(r_{2}))$\\ - $\mkeps(r_1\cdot r_2)$ & $\dn$ & $\Seq\,(\mkeps\,r_1)\,(\mkeps\,r_2)$\\ - $mkeps(r^*)$ & $\dn$ & $\Stars\,[]$ - \end{tabular} - \end{center} - - -\noindent -We favour the left to match an empty string if there is a choice. -When there is a star for us to match the empty string, -we give the $\Stars$ constructor an empty list, meaning -no iterations are taken. - - -After the $\mkeps$-call, we inject back the characters one by one in order to build -the lexical value $v_i$ for how the regex $r_i$ matches the string $s_i$ -($s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$. -After injecting back $n$ characters, we get the lexical value for how $r_0$ -matches $s$. The POSIX value is maintained throughout the process. -For this Sulzmann and Lu defined a function that reverses -the ``chopping off'' of characters during the derivative phase. The -corresponding function is called \emph{injection}, written -$\textit{inj}$; it takes three arguments: the first one is a regular -expression ${r_{i-1}}$, before the character is chopped off, the second -is a character ${c_{i-1}}$, the character we want to inject and the -third argument is the value ${v_i}$, into which one wants to inject the -character (it corresponds to the regular expression after the character -has been chopped off). The result of this function is a new value. -\begin{ceqn} -\begin{equation}\label{graph:inj} -\begin{tikzcd} -r_1 \arrow[r, dashed] \arrow[d]& r_i \arrow[r, "\backslash c_i"] \arrow[d] & r_{i+1} \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\ -v_1 \arrow[u] & v_i \arrow[l, dashed] & v_{i+1} \arrow[l,"inj_{r_i} c_i"] & v_n \arrow[l, dashed] -\end{tikzcd} -\end{equation} -\end{ceqn} - - -\noindent -The -definition of $\textit{inj}$ is as follows: - -\begin{center} -\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l} - $\textit{inj}\,(c)\,c\,Empty$ & $\dn$ & $Char\,c$\\ - $\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\ - $\textit{inj}\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\textit{inj}\,r_2\,c\,v)$\\ - $\textit{inj}\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$ & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\ - $\textit{inj}\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$ & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\ - $\textit{inj}\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$ & $Seq(\textit{mkeps}(r_1),\textit{inj}\,r_2\,c\,v)$\\ - $\textit{inj}\,(r^*)\,c\,Seq(v,Stars\,vs)$ & $\dn$ & $Stars((\textit{inj}\,r\,c\,v)\,::\,vs)$\\ -\end{tabular} -\end{center} - -\noindent This definition is by recursion on the ``shape'' of regular -expressions and values. -The clauses do one thing--identifying the ``hole'' on a -value to inject the character back into. -For instance, in the last clause for injecting back to a value -that would turn into a new star value that corresponds to a star, -we know it must be a sequence value. And we know that the first -value of that sequence corresponds to the child regex of the star -with the first character being chopped off--an iteration of the star -that had just been unfolded. This value is followed by the already -matched star iterations we collected before. So we inject the character -back to the first value and form a new value with this latest iteration -being added to the previous list of iterations, all under the $\Stars$ -top level. - -Putting all the functions $\inj$, $\mkeps$, $\backslash$ together, -we have a lexer with the following recursive definition: -\begin{center} -\begin{tabular}{lcr} -$\lexer \; r \; [] $ & $=$ & $\mkeps \; r$\\ -$\lexer \; r \;c::s$ & $=$ & $\inj \; r \; c (\lexer (r\backslash c) s)$ -\end{tabular} -\end{center} - -Pictorially, the algorithm is as follows: - -\begin{ceqn} -\begin{equation}\label{graph:2} -\begin{tikzcd} -r_0 \arrow[r, "\backslash c_0"] \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\ -v_0 & v_1 \arrow[l,"inj_{r_0} c_0"] & v_2 \arrow[l, "inj_{r_1} c_1"] & v_n \arrow[l, dashed] -\end{tikzcd} -\end{equation} -\end{ceqn} - - -\noindent -For convenience, we shall employ the following notations: the regular -expression we start with is $r_0$, and the given string $s$ is composed -of characters $c_0 c_1 \ldots c_{n-1}$. In the first phase from the -left to right, we build the derivatives $r_1$, $r_2$, \ldots according -to the characters $c_0$, $c_1$ until we exhaust the string and obtain -the derivative $r_n$. We test whether this derivative is -$\textit{nullable}$ or not. If not, we know the string does not match -$r$, and no value needs to be generated. If yes, we start building the -values incrementally by \emph{injecting} back the characters into the -earlier values $v_n, \ldots, v_0$. This is the second phase of the -algorithm from right to left. For the first value $v_n$, we call the -function $\textit{mkeps}$, which builds a POSIX lexical value -for how the empty string has been matched by the (nullable) regular -expression $r_n$. This function is defined as - - - -We have mentioned before that derivatives without simplification -can get clumsy, and this is true for values as well--they reflect -the size of the regular expression by definition. - -One can introduce simplification on the regex and values but have to -be careful not to break the correctness, as the injection -function heavily relies on the structure of the regexes and values -being correct and matching each other. -It can be achieved by recording some extra rectification functions -during the derivatives step, and applying these rectifications in -each run during the injection phase. -And we can prove that the POSIX value of how -regular expressions match strings will not be affected---although it is much harder -to establish. -Some initial results in this regard have been -obtained in \cite{AusafDyckhoffUrban2016}. - - - -%Brzozowski, after giving the derivatives and simplification, -%did not explore lexing with simplification, or he may well be -%stuck on an efficient simplification with proof. -%He went on to examine the use of derivatives together with -%automaton, and did not try lexing using products. - -We want to get rid of the complex and fragile rectification of values. -Can we not create those intermediate values $v_1,\ldots v_n$, -and get the lexing information that should be already there while -doing derivatives in one pass, without a second injection phase? -In the meantime, can we make sure that simplifications -are easily handled without breaking the correctness of the algorithm? - -Sulzmann and Lu solved this problem by -introducing additional information to the -regular expressions called \emph{bitcodes}. - - - - - - -