lexing: comparison ChengsongTanPhdThesis/Chapters/Inj.tex

equal deleted inserted replaced

-:8016a2480704
+:7cf9f17aa179
 \begin{lemma}
 $\textit{Der} \; c \; L(r) = L (r\backslash c)$
 \end{lemma}
 \noindent
 The main property of the derivative operation
 that enables us to reason about the correctness of
 an algorithm using derivatives is
-\begin{center}
+\begin{lemma}\label{derStepwise}
 $c\!::\!s \in L(r)$ holds
 if and only if $s \in L(r\backslash c)$.
-\end{center}
+\end{lemma}
 \noindent
 We can generalise the derivative operation shown above for single characters
 to strings as follows:
 \end{center}
 \noindent
 When there is no ambiguity we will use  $\backslash$ to denote
 string derivatives for brevity.
+Brzozowski's  regular-expression matcher algorithm can then be described as:
-and then define Brzozowski's  regular-expression matching algorithm as:
 \begin{definition}
 $\textit{match}\;s\;r \;\dn\; \nullable(r\backslash s)$
 \end{definition}
 r_0 \arrow[r, "\backslash c_0"]  & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed]  & r_n  \arrow[r,"\textit{nullable}?"] & \;\textrm{YES}/\textrm{NO}
 \end{tikzcd}
 \end{equation}
 \noindent
+It can  be
+relatively  easily shown that this matcher is correct:
+\begin{lemma}
+$\textit{match} \; s\; r  = \textit{true} \Longleftrightarrow s \in L(r)$
+\end{lemma}
+\begin{proof}
+By the stepwise property of $\backslash$ (\ref{derStepwise})
+\end{proof}
+\noindent
+If we implement the above algorithm naively, however,
+the algorithm can be excruciatingly slow.
+\begin{figure}
+\begin{tikzpicture}
+\begin{axis}[
+xlabel={$n$},
+ylabel={time in secs},
+ymode = log,
+legend entries={Naive Matcher},
+legend pos=north west,
+legend cell align=left]
+\addplot[red,mark=*, mark options={fill=white}] table {NaiveMatcher.data};
+\end{axis}
+\end{tikzpicture}
+\caption{Matching $(a^*)^*b$ against $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$}\label{NaiveMatcher}
+\end{figure}
+\noindent
+For this we need to introduce certain
+rewrite rules for the intermediate results,
+such as $r + r \rightarrow r$,
+and make sure those rules do not change the
+language of the regular expression.
+We have a simplification function (that is as simple as possible
+while having much power on making a regex simpler):
+\begin{verbatim}
+def simp(r: Rexp) : Rexp = r match {
+case SEQ(r1, r2) =>
+(simp(r1), simp(r2)) match {
+case (ZERO, _) => ZERO
+case (_, ZERO) => ZERO
+case (ONE, r2s) => r2s
+case (r1s, ONE) => r1s
+case (r1s, r2s) => SEQ(r1s, r2s)
+}
+case ALTS(r1, r2) => {
+(simp(r1), simp(r2)) match {
+case (ZERO, r2s) => r2s
+case (r1s, ZERO) => r1s
+case (r1s, r2s) =>
+if(r1s == r2s) r1s else ALTS(r1s, r2s)
+}
+}
+case r => r
+}
+\end{verbatim}
+If we repeatedly incorporate these
+rules during the matching algorithm,
+we have a lexer with simplification:
+\begin{verbatim}
+def ders_simp(s: List[Char], r: Rexp) : Rexp = s match {
+case Nil => simp(r)
+case c :: cs => ders_simp(cs, simp(der(c, r)))
+}
+def simp_matcher(s: String, r: Rexp) : Boolean =
+nullable(ders_simp(s.toList, r))
+\end{verbatim}
+\noindent
+After putting in those rules, the example of \ref{NaiveMatcher}
+is now very tame in the length of inputs:
+\begin{tikzpicture}
+\begin{axis}[
+xlabel={$n$},
+ylabel={time in secs},
+ymode = log,
+xmode = log,
+legend entries={Matcher With Simp},
+legend pos=north west,
+legend cell align=left]
+\addplot[red,mark=*, mark options={fill=white}] table {BetterMatcher.data};
+\end{axis}
+\end{tikzpicture} \label{fig:BetterMatcher}
+\noindent
+Note how the x-axis is in logarithmic scale.
 Building derivatives and then testing the existence
-of empty string in the resulting regular expression's language.
+of empty string in the resulting regular expression's language,
+and add simplification rules when necessary.
 So far, so good. But what if we want to
 do lexing instead of just getting a YES/NO answer?
-Sulzmann and Lu~\cite{Sulzmann2014} first came up with a nice and
+\citeauthor{Sulzmann2014} first came up with a nice and
 elegant (arguably as beautiful as the definition of the original derivative) solution for this.
-\subsection*{Values and the Lexing Algorithm by Sulzmann and Lu}
+\section{Values and the Lexing Algorithm by Sulzmann and Lu}
 Here we present the hybrid phases of a regular expression lexing
 algorithm using the function $\inj$, as given by Sulzmann and Lu.
 They first defined the datatypes for storing the
 lexing information called a \emph{value} or
 sometimes also \emph{lexical value}.  These values and regular
 given the same regular expression $r$ and string $s$,
 one can always uniquely determine the $\POSIX$ value for it:
 \begin{lemma}
 $\textit{if} \,(s, r) \rightarrow v_1 \land (s, r) \rightarrow v_2\quad  \textit{then} \; v_1 = v_2$
 \end{lemma}
+\begin{proof}
+By induction on $s$, $r$ and $v_1$. The induction principle is
+the \POSIX rules. Each case is proven by a combination of
+the induction rules for $\POSIX$ values and the inductive hypothesis.
+Probably the most cumbersome cases are the sequence and star with non-empty iterations.
+We give the reasoning about the sequence case as follows:
+When we have $(s_1, r_1) \rightarrow v_1$ and $(s_2, r_2) \rightarrow v_2$,
+we know that there could not be a longer string $r_1'$ such that $(s_1', r_1) \rightarrow v_1'$
+and $(s_2', r_2) \rightarrow v2'$ and $s_1' @s_2' = s$ all hold.
+For possible values of $s_1'$ and $s_2'$ where $s_1'$ is shorter, they cannot
+possibly form a $\POSIX$ for $s$.
+If we have some other values $v_1'$ and $v_2'$ such that
+$(s_1, r_1) \rightarrow v_1'$ and $(s_2, r_2) \rightarrow v_2'$,
+Then by induction hypothesis $v_1' = v_1$ and $v_2'= v_2$,
+which means this "different" $\POSIX$ value $\Seq(v_1', v_2')$
+is the same as $\Seq(v_1, v_2)$.
+\end{proof}
 Now we know what a $\POSIX$ value is, the problem is how do we achieve
 such a value in a lexing algorithm, using derivatives?
 \subsection{Sulzmann and Lu's Injection-based Lexing Algorithm}
 The contribution of Sulzmann and Lu is an extension of Brzozowski's
 algorithm by a second phase (the first phase being building successive
-derivatives---see \eqref{graph:successive_ders}). In this second phase, a POSIX value
+derivatives---see \ref{graph:successive_ders}). In this second phase, a POSIX value
 is generated if the regular expression matches the string.
 Two functions are involved: $\inj$ and $\mkeps$.
 The function $\mkeps$ constructs a value from the last
 one of all the successive derivatives:
 \begin{ceqn}
 that had just been unfolded. This value is followed by the already
 matched star iterations we collected before. So we inject the character
 back to the first value and form a new value with this latest iteration
 being added to the previous list of iterations, all under the $\Stars$
 top level.
-The POSIX value is maintained throughout the process.
+The POSIX value is maintained throughout the process:
 \begin{lemma}
 $(r \backslash c, s) \rightarrow v \implies (r, c :: s) \rightarrow (\inj r \; c\; v)$
 \end{lemma}\label{injPosix}
 Putting all the functions $\inj$, $\mkeps$, $\backslash$ together,
 and taking into consideration the possibility of a non-match,
 we have a lexer with the following recursive definition:
 \begin{center}
-\begin{tabular}{lcr}
+\begin{tabular}{lcl}
 $\lexer \; r \; [] $ & $=$ & $\textit{if} (\nullable \; r)\; \textit{then}\;  \Some(\mkeps \; r) \; \textit{else} \; \None$\\
 $\lexer \; r \;c::s$ & $=$ & $\textit{case}\; (\lexer (r\backslash c) s) \textit{of} $\\
-& & $\None \implies \None$\\
+& & $\quad \None \implies \None$\\
-& & $\mid \Some(v) \implies \Some(\inj \; r\; c\; v)$
+& & $\quad \mid \Some(v) \implies \Some(\inj \; r\; c\; v)$
 \end{tabular}
 \end{center}
 \noindent
 The central property of the $\lexer$ is that it gives the correct result by
 $\POSIX$ standards:
 By induction on $s$. $r$ is allowed to be an arbitrary regular expression.
 The $[]$ case is proven by  lemma \ref{mePosix}, and the inductive case
 by lemma \ref{injPosix}.
 \end{proof}
-For convenience, we shall employ the following notations: the regular
+Pictorially, the algorithm is as follows (
+For convenience, we employ the following notations: the regular
 expression we start with is $r_0$, and the given string $s$ is composed
-of characters $c_0 c_1 \ldots c_{n-1}$. In  the first phase from the
+of characters $c_0 c_1 \ldots c_{n-1}$. The
-left to right, we build the derivatives $r_1$, $r_2$, \ldots  according
+values built incrementally by \emph{injecting} back the characters into the
-to the characters $c_0$, $c_1$  until we exhaust the string and obtain
+earlier values are $v_n, \ldots, v_0$. Corresponding values and characters
-the derivative $r_n$. We test whether this derivative is
+are always in the same subscript, i.e. $\vdash v_i : r_i$):
-$\textit{nullable}$ or not. If not, we know the string does not match
-$r$, and no value needs to be generated. If yes, we start building the
-values incrementally by \emph{injecting} back the characters into the
-earlier values $v_n, \ldots, v_0$.
-Pictorially, the algorithm is as follows:
 \begin{ceqn}
 \begin{equation}\label{graph:2}
 \begin{tikzcd}
 r_0 \arrow[r, "\backslash c_0"]  \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
 v_0           & v_1 \arrow[l,"inj_{r_0} c_0"]                & v_2 \arrow[l, "inj_{r_1} c_1"]              & v_n \arrow[l, dashed]
 \end{tikzcd}
 \end{equation}
 \end{ceqn}
+\noindent
-\noindent
+As we did earlier in this chapter on the matcher, one can
-This is the second phase of the
+introduce simplification on the regex.
-algorithm from right to left. For the first value $v_n$, we call the
+However, now we need to do a backward phase and make sure
-function $\textit{mkeps}$, which builds a POSIX lexical value
+the values align with the regular expressions.
-for how the empty string has been matched by the (nullable) regular
+Therefore one has to
-expression $r_n$. This function is defined as
-We have mentioned before that derivatives without simplification
-can get clumsy, and this is true for values as well--they reflect
-the size of the regular expression by definition.
-One can introduce simplification on the regex and values but have to
 be careful not to break the correctness, as the injection
 function heavily relies on the structure of the regexes and values
 being correct and matching each other.
 It can be achieved by recording some extra rectification functions
 during the derivatives step, and applying these rectifications in
 each run during the injection phase.
+\ChristianComment{Do I introduce the lexer with rectification here?}
 And we can prove that the POSIX value of how
 regular expressions match strings will not be affected---although it is much harder
 to establish.
 Some initial results in this regard have been
 obtained in \cite{AusafDyckhoffUrban2016}.
+However, even with these simplification rules, we could still end up in
+trouble, when we encounter cases that require more involved and aggressive
-%Brzozowski, after giving the derivatives and simplification,
+simplifications.
-%did not explore lexing with simplification, or he may well be
+\section{A Case Requring More Aggressive Simplification}
-%stuck on an efficient simplification with proof.
+For example, when starting with the regular
-%He went on to examine the use of derivatives together with
+expression $(a^* \cdot a^*)^*$ and building a few successive derivatives (around 10)
-%automaton, and did not try lexing using products.
+w.r.t.~the character $a$, one obtains a derivative regular expression
+with more than 9000 nodes (when viewed as a tree)
-We want to get rid of the complex and fragile rectification of values.
+even with simplification.
-Can we not create those intermediate values $v_1,\ldots v_n$,
+\begin{figure}
-and get the lexing information that should be already there while
+\begin{tikzpicture}
-doing derivatives in one pass, without a second injection phase?
+\begin{axis}[
-In the meantime, can we make sure that simplifications
+xlabel={$n$},
-are easily handled without breaking the correctness of the algorithm?
+ylabel={size},
+legend entries={Naive Matcher},
-Sulzmann and Lu solved this problem by
+legend pos=north west,
-introducing additional information to the
+legend cell align=left]
-regular expressions called \emph{bitcodes}.
+\addplot[red,mark=*, mark options={fill=white}] table {BetterWaterloo.data};
+\end{axis}
+\end{tikzpicture}
+\caption{Size of $(a^*\cdot a^*)^*$ against $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$}
+\end{figure}\label{fig:BetterWaterloo}
-With the formally-specified rules for what a POSIX matching is,
+That is because our lexing algorithm currently keeps a lot of
-they proved in Isabelle/HOL that the algorithm gives correct results.
+"useless values that will never not be used.
-But having a correct result is still not enough,
+These different ways of matching will grow exponentially with the string length.
-we want at least some degree of $\mathbf{efficiency}$.
+For $r= (a^*\cdot a^*)^*$ and
+$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$,
-A pair of regular expression and string can have multiple lexical values.
+if we do not allow any empty iterations in its lexical values,
-Take the example where $r= (a^*\cdot a^*)^*$ and the string
+there will be $n - 1$ "splitting points" on $s$ we can independently choose to
-$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$.
-If we do not allow any empty iterations in its lexical values,
-there will be $n - 1$ "splitting points" on $s$ we can choose to
 split or not so that each sub-string
-segmented by those chosen splitting points will form different iterations:
+segmented by those chosen splitting points will form different iterations.
+For example when $n=4$,
 \begin{center}
 \begin{tabular}{lcr}
-$a \mid aaa $ & $\rightarrow$ & $\Stars\, [v_{iteration \,a},\,  v_{iteration \,aaa}]$\\
+$aaaa $ & $\rightarrow$ & $\Stars\, [v_{iteration \,aaaa}]$ (1 iteration, this iteration will be divided between the inner sequence $a^*\cdot a^*$)\\
-$aa \mid aa $ & $\rightarrow$ & $\Stars\, [v_{iteration \, aa},\,  v_{iteration \, aa}]$\\
+$a \mid aaa $ & $\rightarrow$ & $\Stars\, [v_{iteration \,a},\,  v_{iteration \,aaa}]$ (two iterations)\\
-$a \mid aa\mid a $ & $\rightarrow$ & $\Stars\, [v_{iteration \, a},\,  v_{iteration \, aa}, \, v_{iteration \, a}]$\\
+$aa \mid aa $ & $\rightarrow$ & $\Stars\, [v_{iteration \, aa},\,  v_{iteration \, aa}]$ (two iterations)\\
+$a \mid aa\mid a $ & $\rightarrow$ & $\Stars\, [v_{iteration \, a},\,  v_{iteration \, aa}, \, v_{iteration \, a}]$ (three iterations)\\
+$a \mid a \mid a\mid a $ & $\rightarrow$ & $\Stars\, [v_{iteration \, a},\,  v_{iteration \, a} \,v_{iteration \, a}, \, v_{iteration \, a}]$ (four iterations)\\
 & $\textit{etc}.$ &
 \end{tabular}
 \end{center}
 \noindent
 And for each iteration, there are still multiple ways to split
 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$.
 A lexer to keep all the possible values will naturally
 have an exponential runtime on ambiguous regular expressions.
 With just $\inj$ and $\mkeps$, the lexing algorithm will keep track of all different values
 of a match. This means Sulzmann and Lu's injection-based algorithm
-will be exponential by nature.
+exponential by nature.
-Somehow one has to decide which lexical value to keep and
+Somehow one has to make sure which
-output in a lexing algorithm.
+lexical values are $\POSIX$ and need to be kept in a lexing algorithm.
 For example, the above $r= (a^*\cdot a^*)^*$  and
 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$ example has the POSIX value
 $ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$.
-The output of an algorithm we want would be a POSIX matching
+We want to keep this value only, and remove all the regular expression subparts
-encoded as a value.
+not corresponding to this value during lexing.
+To do this, a two-phase algorithm with rectification is a bit too fragile.
+Can we not create those intermediate values $v_1,\ldots v_n$,
+and get the lexing information that should be already there while
+doing derivatives in one pass, without a second injection phase?
+In the meantime, can we make sure that simplifications
-%kind of redundant material
+are easily handled without breaking the correctness of the algorithm?
+Sulzmann and Lu solved this problem by
-where we start with  a regular expression  $r_0$, build successive
+introducing additional information to the
-derivatives until we exhaust the string and then use \textit{nullable}
+regular expressions called \emph{bitcodes}.
-to test whether the result can match the empty string. It can  be
-relatively  easily shown that this matcher is correct  (that is given
-an $s = c_0...c_{n-1}$ and an $r_0$, it generates YES if and only if $s \in L(r_0)$).
-Beautiful and simple definition.
-If we implement the above algorithm naively, however,
-the algorithm can be excruciatingly slow.
-\begin{figure}
-\centering
-\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
-\begin{tikzpicture}
-\begin{axis}[
-xlabel={$n$},
-x label style={at={(1.05,-0.05)}},
-ylabel={time in secs},
-enlargelimits=false,
-xtick={0,5,...,30},
-xmax=33,
-ymax=10000,
-ytick={0,1000,...,10000},
-scaled ticks=false,
-axis lines=left,
-width=5cm,
-height=4cm,
-legend entries={JavaScript},
-legend pos=north west,
-legend cell align=left]
-\addplot[red,mark=*, mark options={fill=white}] table {EightThousandNodes.data};
-\end{axis}
-\end{tikzpicture}\\
-\multicolumn{3}{c}{Graphs: Runtime for matching $(a^*)^*\,b$ with strings
-of the form $\underbrace{aa..a}_{n}$.}
-\end{tabular}
-\caption{EightThousandNodes} \label{fig:EightThousandNodes}
-\end{figure}
-(8000 node data to be added here)
-For example, when starting with the regular
-expression $(a + aa)^*$ and building a few successive derivatives (around 10)
-w.r.t.~the character $a$, one obtains a derivative regular expression
-with more than 8000 nodes (when viewed as a tree)\ref{EightThousandNodes}.
-The reason why $(a + aa) ^*$ explodes so drastically is that without
-pruning, the algorithm will keep records of all possible ways of matching:
-\begin{center}
-$(a + aa) ^* \backslash [aa] = (\ZERO + \ONE \ONE)\cdot(a + aa)^* + (\ONE + \ONE a) \cdot (a + aa)^*$
-\end{center}
-\noindent
-Each of the above alternative branches correspond to the match
-$aa $, $a \quad a$ and $a \quad a \cdot (a)$(incomplete).
-These different ways of matching will grow exponentially with the string length,
-and without simplifications that throw away some of these very similar matchings,
-it is no surprise that these expressions grow so quickly.
-Operations like
-$\backslash$ and $\nullable$ need to traverse such trees and
-consequently the bigger the size of the derivative the slower the
-algorithm.
-Brzozowski was quick in finding that during this process a lot useless
-$\ONE$s and $\ZERO$s are generated and therefore not optimal.
-He also introduced some "similarity rules", such
-as $P+(Q+R) = (P+Q)+R$ to merge syntactically
-different but language-equivalent sub-regexes to further decrease the size
-of the intermediate regexes.
-More simplifications are possible, such as deleting duplicates
-and opening up nested alternatives to trigger even more simplifications.
-And suppose we apply simplification after each derivative step, and compose
-these two operations together as an atomic one: $a \backslash_{simp}\,c \dn
-\textit{simp}(a \backslash c)$. Then we can build
-a matcher with simpler regular expressions.
-If we want the size of derivatives in the algorithm to
-stay even lower, we would need more aggressive simplifications.
-Essentially we need to delete useless $\ZERO$s and $\ONE$s, as well as
-delete duplicates whenever possible. For example, the parentheses in
-$(a+b) \cdot c + b\cdot c$ can be opened up to get $a\cdot c + b \cdot c + b
-\cdot c$, and then simplified to just $a \cdot c + b \cdot c$. Another
-example is simplifying $(a^*+a) + (a^*+ \ONE) + (a +\ONE)$ to just
-$a^*+a+\ONE$.  These more aggressive simplification rules are for
-a very tight size bound, possibly as low
-as that of the \emph{partial derivatives}\parencite{Antimirov1995}.

changeset 539	7cf9f17aa179
parent 538	8016a2480704
child 541	5bf9f94c02e1