lexing: comparison ChengsongTanPhdThesis/Chapters/Inj.tex

equal deleted inserted replaced

-:fd068f39ac23
+:ce4e5151a836
 \noindent
 Assuming the string is given as a sequence of characters, say $c_0c_1..c_n$,
 this algorithm presented graphically is as follows:
-\begin{equation}\label{graph:successive_ders}
+\begin{equation}\label{matcher}
 \begin{tikzcd}
 r_0 \arrow[r, "\backslash c_0"]  & r_1 \arrow[r, "\backslash c_1"] &
 r_2 \arrow[r, dashed]  & r_n  \arrow[r,"\textit{nullable}?"] &
 \;\textrm{true}/\textrm{false}
 \end{tikzcd}
 \begin{proof}
 	By induction on $s$ using property of derivatives:
 	lemma \ref{derDer}.
 \end{proof}
 \noindent
-\begin{center}
 \begin{figure}
+\begin{center}
 \begin{tikzpicture}
 \begin{axis}[
 xlabel={$n$},
 ylabel={time in secs},
-ymode = log,
+%ymode = log,
 legend entries={Naive Matcher},
 legend pos=north west,
 legend cell align=left]
 \addplot[red,mark=*, mark options={fill=white}] table {NaiveMatcher.data};
 \end{axis}
 \end{tikzpicture}
 \caption{Matching the regular expression $(a^*)^*b$ against strings of the form
 $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}
 $ using Brzozowski's original algorithm}\label{NaiveMatcher}
+\end{center}
 \end{figure}
-\end{center}
 \noindent
 If we implement the above algorithm naively, however,
 the algorithm can be excruciatingly slow, as shown in
 \ref{NaiveMatcher}.
 Note that both axes are in logarithmic scale.
 \begin{figure}
 \begin{tikzpicture}
 \begin{axis}[
 xlabel={$n$},
 ylabel={time in secs},
-ymode = log,
+%ymode = log,
-xmode = log,
+%xmode = log,
 grid = both,
 legend entries={Matcher With Simp},
 legend pos=north west,
 legend cell align=left]
 \addplot[red,mark=*, mark options={fill=white}] table {BetterMatcher.data};
 \noindent
 The condition $|v| \neq []$ in the premise of star's rule
 is to make sure that for a given pair of regular
 expression $r$ and string $s$, the number of values
 satisfying $|v| = s$ and $\vdash v:r$ is finite.
+This additional condition was
+imposed by Ausaf and Urban to make their proofs easier.
 Given the same string and regular expression, there can be
 multiple values for it. For example, both
 $\vdash \Seq(\Left \; ab)(\Right \; c):(ab+a)(bc+c)$ and
 $\vdash \Seq(\Right\; a)(\Left \; bc ):(ab+a)(bc+c)$ hold
 and the values both flatten to $abc$.
 formalised the property
 as a ternary relation.
 The $\POSIX$ value $v$ for a regular expression
 $r$ and string $s$, denoted as $(s, r) \rightarrow v$, can be specified
 in the following set of rules\footnote{The names of the rules are used
-as they were originally given in \cite{AusafDyckhoffUrban2016}}:
+as they were originally given in \cite{AusafDyckhoffUrban2016} }:
-\noindent
+\begin{figure}[H]
-\begin{figure}
 \begin{mathpar}
 	\inferrule[P1]{\mbox{}}{([], \ONE) \rightarrow \Empty}
 	\inferrule[PC]{\mbox{}}{([c], c) \rightarrow \Char \; c}
 	\inferrule[P*]{(s_1, v) \rightarrow v \\ (s_2, r^*) \rightarrow \Stars \; vs \\
 		|v| \neq []\\ \nexists s_3 \; s_4. s_3 \neq [] \land s_3@s_4 = s_2 \land
 		s_1@s_3 \in L \; r \land s_4 \in L \; r^*}{(s_1@s_2, r^*)\rightarrow \Stars \;
 	(v::vs)}
 \end{mathpar}
-\caption{POSIX Lexing Rules}
+\caption{The inductive POSIX Lexing Rules defined by Ausaf, Dyckhoff and Urban \cite{AusafDyckhoffUrban2016}.
+The ternary relation, written $(s, r) \rightarrow v$, formalises the POSIX constraints on the
+value $v$ given a string $s$ and
+regular expression $r$.
+For example, this specification says that all matches for an alternative
+must always prefer a left value to a right one.
+}
 \end{figure}
 \noindent
 \begin{figure}
 \begin{tikzpicture}[]
 We shall give the details for proving the sequence case here.
 When we have
 \[
 	(s_1, r_1) \rightarrow v_1 \;\, and \;\,
-	(s_2, r_2) \rightarrow v_2  \;\, and \;\,\\
+	(s_2, r_2) \rightarrow v_2  \;\, and \;\,
 	\nexists s_3 \; s_4. s_3 \neq [] \land s_3 @ s_4 = s_2 \land
 		s_1@ s_3 \in L \; r_1 \land s_4 \in L \; r_2
 \]
 we know that the last condition
 excludes the possibility of a
 for how the regular expression
 $r_i$ matches the string $s_i$ consisting of the last $n-i$ characters
 of $s$ (i.e. $s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$.
 After injecting back $n$ characters, we get the lexical value for how $r_0$
 matches $s$.
+\begin{figure}[H]
+\begin{center}
 \begin{ceqn}
-\begin{equation}\label{graph:inj}
 \begin{tikzcd}
 r_0 \arrow[r, dashed] \arrow[d]& r_i \arrow[r, "\backslash c_i"]  \arrow[d]  & r_{i+1}  \arrow[r, dashed] \arrow[d]        & r_n \arrow[d, "mkeps" description] \\
 v_0           \arrow[u]                 & v_i  \arrow[l, dashed]                              & v_{i+1} \arrow[l,"inj_{r_i} c_i"]                 & v_n \arrow[l, dashed]
 \end{tikzcd}
-\end{equation}
 \end{ceqn}
+\end{center}
+\caption{The two-phase lexing algorithm by Sulzmann and Lu \cite{AusafDyckhoffUrban2016},
+	matching the regular expression $r_0$ and string of the form $[c_0, c_1, \ldots, c_{n-1}]$.
+	The first phase involves taking successive derivatives w.r.t the characters $c_0$,
+	$c_1$, and so on. These are the same operations as they have appeared in the matcher
+	\ref{matcher}. When the final derivative regular expression is nullable (contains the empty string),
+	then the second phase starts. First $\mkeps$ generates a POSIX value which tells us how $r_n$ matches
+	the empty string , by always selecting the leftmost
+	nullable regular expression. After that $\inj$ ``injects'' back the character in reverse order as they
+	appeared in the string, always preserving POSIXness.}\label{graph:inj}
+\end{figure}
 \noindent
 $\textit{inj}$ takes three arguments: a regular
 expression ${r_{i}}$, before the character is chopped off,
 a character ${c_{i}}$, the character we want to inject back and
 the third argument $v_{i+1}$ the value we want to inject into.
 w.r.t.~the character $a$, one obtains a derivative regular expression
 with millions of nodes (when viewed as a tree)
 even with simplification, which is not much better compared
 with the naive version without any simplifications:
 \begin{figure}[H]
-	\centering
+\begin{center}
 \begin{tikzpicture}
 \begin{axis}[
 xlabel={$n$},
 ylabel={size},
 legend entries={Simple-Minded Simp, Naive Matcher},
 legend cell align=left]
 \addplot[red,mark=*, mark options={fill=white}] table {BetterWaterloo.data};
 \addplot[blue,mark=*, mark options={fill=white}] table {BetterWaterloo1.data};
 \end{axis}
 \end{tikzpicture}
+\end{center}
 \caption{Size of $(a^*\cdot a^*)^*$ against $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$}
 \end{figure}\label{fig:BetterWaterloo}
 That is because Sulzmann and Lu's
 injection-based lexing algorithm keeps a lot of

changeset 601	ce4e5151a836
parent 591	b2d0de6aee18
child 608	37b6fd310a16