--- a/ChengsongTanPhdThesis/Chapters/Inj.tex Fri Dec 23 13:27:56 2022 +0000
+++ b/ChengsongTanPhdThesis/Chapters/Inj.tex Sat Dec 24 11:55:04 2022 +0000
@@ -11,8 +11,8 @@
In this chapter, we define the basic notions
for regular languages and regular expressions.
This is essentially a description in ``English''
-of our formalisation in Isabelle/HOL.
-We also give the definition of what $\POSIX$ lexing means,
+the functions and datatypes of our formalisation in Isabelle/HOL.
+We also define what $\POSIX$ lexing means,
followed by a lexing algorithm by Sulzmanna and Lu \parencite{Sulzmann2014}
that produces the output conforming
to the $\POSIX$ standard\footnote{In what follows
@@ -26,7 +26,7 @@
\section{Basic Concepts}
Formal language theory usually starts with an alphabet
denoting a set of characters.
-Here we just use the datatype of characters from Isabelle,
+Here we use the datatype of characters from Isabelle,
which roughly corresponds to the ASCII characters.
In what follows, we shall leave the information about the alphabet
implicit.
@@ -40,9 +40,9 @@
where $c$ is a variable ranging over characters.
The $::$ stands for list cons and $[]$ for the empty
list.
-A singleton list is sometimes written as $[c]$ for brevity.
+For brevity, a singleton list is sometimes written as $[c]$.
Strings can be concatenated to form longer strings in the same
-way as we concatenate two lists, which we shall write as $s_1 @ s_2$.
+way we concatenate two lists, which we shall write as $s_1 @ s_2$.
We omit the precise
recursive definition here.
We overload this concatenation operator for two sets of strings:
@@ -121,7 +121,7 @@
\end{tabular}
\end{center}
\noindent
-and in the end test whether the set
+and in the end, test whether the set
has the empty string.\footnote{We use the infix notation $A\backslash c$
instead of $\Der \; c \; A$ for brevity, as it is clear we are operating
on languages rather than regular expressions.}
@@ -156,15 +156,18 @@
\begin{itemize}
\item{$\subseteq$}:\\
The set
-\[ \{s \mid c :: s \in A*\} \]
+\[ S_1 = \{s \mid c :: s \in A*\} \]
is enclosed in the set
-\[ \{s_1 @ s_2 \mid s_1 \, s_2.\; s_1 \in \{s \mid c :: s \in A\} \land s_2 \in A* \} \]
-because whenever you have a string starting with a character
-in the language of a Kleene star $A*$,
-then that character together with some sub-string
-immediately after it will form the first iteration,
-and the rest of the string will
-be still in $A*$.
+\[ S_2 = \{s_1 @ s_2 \mid s_1 \, s_2.\; s_1 \in \{s \mid c :: s \in A\} \land s_2 \in A* \}. \]
+This is because for any string $c::s$ satisfying $c::s \in A*$,
+%whenever you have a string starting with a character
+%in the language of a Kleene star $A*$,
+%then that
+the character $c$, together with a prefix of $s$
+%immediately after $c$
+forms the first iteration of $A*$,
+and the rest of the $s$ is also $A*$.
+This coincides with the definition of $S_2$.
\item{$\supseteq$}:\\
Note that
\[ \Der \; c \; (A*) = \Der \; c \; (\{ [] \} \cup (A @ A*) ) \]
@@ -194,7 +197,7 @@
\]
\noindent
We call them basic because we will introduce
-additional constructors in later chapters such as negation
+additional constructors in later chapters, such as negation
and bounded repetitions.
We use $\ZERO$ for the regular expression that
matches no string, and $\ONE$ for the regular
@@ -223,7 +226,7 @@
\noindent
%Now with language derivatives of a language and regular expressions and
%their language interpretations in place, we are ready to define derivatives on regular expressions.
-With $L$ we are ready to introduce Brzozowski derivatives on regular expressions.
+With $L$, we are ready to introduce Brzozowski derivatives on regular expressions.
We do so by first introducing what properties it should satisfy.
\subsection{Brzozowski Derivatives and a Regular Expression Matcher}
@@ -380,8 +383,8 @@
$s \in L \; r_{start} \iff [] \in L \; r_{end}$
\end{property}
\noindent
-Next we give the recursive definition of derivative on
-regular expressions, so that it satisfies the properties above.
+Next, we give the recursive definition of derivative on
+regular expressions so that it satisfies the properties above.
The derivative function, written $r\backslash c$,
takes a regular expression $r$ and character $c$, and
returns a new regular expression representing
@@ -441,7 +444,7 @@
(r^*) \backslash c \dn (r \backslash c)\cdot r^*.
\]
Again,
-the structure is the same as the language derivative of Kleene star:
+the structure is the same as the language derivative of the Kleene star:
\[
\textit{Der} \;c \;(A*) \dn (\textit{Der}\; c A) @ (A*)
\]
@@ -522,8 +525,8 @@
\end{definition}
\noindent
-Assuming the string is given as a sequence of characters, say $c_0c_1..c_n$,
-this algorithm presented graphically is as follows:
+Assuming the string is given as a sequence of characters, say $c_0c_1 \ldots c_n$,
+this algorithm, presented graphically, is as follows:
\begin{equation}\label{matcher}
\begin{tikzcd}
@@ -540,7 +543,7 @@
$\textit{match} \; s\; r = \textit{true} \; \textit{iff} \; s \in L(r)$
\end{lemma}
\begin{proof}
- By induction on $s$ using property of derivatives:
+ By induction on $s$ using the property of derivatives:
lemma \ref{derDer}.
\end{proof}
\begin{figure}
@@ -566,7 +569,7 @@
the algorithm can be excruciatingly slow, as shown in
\ref{NaiveMatcher}.
Note that both axes are in logarithmic scale.
-Around two dozens characters
+Around two dozen characters
would already ``explode'' the matcher with the regular expression
$(a^*)^*b$.
To improve this situation, we need to introduce simplification
@@ -628,8 +631,8 @@
is now ``tame'' in terms of the length of inputs,
as shown in Figure \ref{BetterMatcher}.
-So far the story is use Brzozowski derivatives and
-simplify as much as possible and at the end test
+So far, the story is use Brzozowski derivatives and
+simplify as much as possible, and at the end test
whether the empty string is recognised
by the final derivative.
But what if we want to
@@ -727,7 +730,7 @@
$\vdash \Seq(\Right\; a)(\Left \; bc ):(ab+a)(bc+c)$ hold
and the values both flatten to $abc$.
Lexers therefore have to disambiguate and choose only
-one of the values be generated. $\POSIX$ is one of the
+one of the values to be generated. $\POSIX$ is one of the
disambiguation strategies that is widely adopted.
Ausaf et al.\parencite{AusafDyckhoffUrban2016}
@@ -796,10 +799,10 @@
%\end{tikzpicture}
%\caption{Maximum munch example: $s$ matches $r_{token1} \cdot r_{token2}$}\label{munch}
%\end{figure}
-The above $\POSIX$ rules follows the intuition described below:
+The above $\POSIX$ rules follow the intuition described below:
\begin{itemize}
\item (Left Priority)\\
- Match the leftmost regular expression when multiple options of matching
+ Match the leftmost regular expression when multiple options for matching
are available. See P+L and P+R where in P+R $s$ cannot
be in the language of $L \; r_1$.
\item (Maximum munch)\\
@@ -809,12 +812,12 @@
$r_{part1}\cdot r_{part2}$, and we have two ways $s$ can be split:
Then the split that matches a longer string for the first part
$r_{part1}$ is preferred by this maximum munch rule.
- This is caused by the side-condition
+ The side-condition
\begin{center}
$\nexists s_3 \; s_4. s_3 \neq [] \land s_3 @ s_4 = s_2 \land
s_1@ s_3 \in L \; r_1 \land s_4 \in L \; r_2$
\end{center}
- in PS.
+ in PS causes this.
%(See
%\ref{munch} for an illustration).
\end{itemize}
@@ -883,8 +886,8 @@
\;\;and\;\; (s_2'', r_2) \rightarrow v_2'' \;\;and \;\;s_1'' @s_2'' = s
\]
cannot possibly form a $\POSIX$ value either, because
-by definition there is a candidate
-with longer initial string
+by definition, there is a candidate
+with a longer initial string
$s_1$. Therefore, we know that the POSIX
value $\Seq \; a \; b$ for $r_1 \cdot r_2$ matching
$s$ must have the
@@ -900,8 +903,8 @@
is the same as $\Seq(v_1, v_2)$.
\end{proof}
\noindent
-We have now defined what a POSIX value is and shown that it is unique,
-the problem is to generate
+We have now defined what a POSIX value is and shown that it is unique.
+The problem is to generate
such a value in a lexing algorithm using derivatives.
\subsection{Sulzmann and Lu's Injection-based Lexing Algorithm}
@@ -963,7 +966,7 @@
\noindent
After the $\mkeps$-call, Sulzmann and Lu inject back the characters one by one
in reverse order as they were chopped off in the derivative phase.
-The fucntion for this is called $\inj$. This function
+The function for this is called $\inj$. This function
operates on values, unlike $\backslash$ which operates on regular expressions.
In the diagram below, $v_i$ stands for the (POSIX) value
for how the regular expression
@@ -985,9 +988,9 @@
The first phase involves taking successive derivatives w.r.t the characters $c_0$,
$c_1$, and so on. These are the same operations as they have appeared in the matcher
\ref{matcher}. When the final derivative regular expression is nullable (contains the empty string),
- then the second phase starts. First $\mkeps$ generates a POSIX value which tells us how $r_n$ matches
- the empty string , by always selecting the leftmost
- nullable regular expression. After that $\inj$ ``injects'' back the character in reverse order as they
+ then the second phase starts. First, $\mkeps$ generates a POSIX value which tells us how $r_n$ matches
+ the empty string, by always selecting the leftmost
+ nullable regular expression. After that, $\inj$ ``injects'' back the character in reverse order as they
appeared in the string, always preserving POSIXness.}\label{graph:inj}
\end{figure}
\noindent
@@ -1020,7 +1023,7 @@
\noindent
The function recurses on
the shape of regular
-expressionsw and values.
+expressions and values.
Intuitively, each clause analyses
how $r_i$ could have transformed when being
derived by $c$, identifying which subpart
@@ -1028,13 +1031,13 @@
to inject the character back into.
Once the character is
injected back to that sub-value;
-$\inj$ assembles all parts together
+$\inj$ assembles all parts
to form a new value.
For instance, the last clause is an
injection into a sequence value $v_{i+1}$
whose second child
-value is a star, and the shape of the
+value is a star and the shape of the
regular expression $r_i$ before injection
is a star.
We therefore know
@@ -1057,7 +1060,7 @@
\]
Finally,
$\inj \; r \;c \; v$ is prepended
-to the previous list of iterations, and then
+to the previous list of iterations and then
wrapped under the $\Stars$
constructor, giving us $\Stars \; ((\inj \; r \; c \; v) ::vs)$.
@@ -1088,7 +1091,7 @@
\end{tabular}
\end{center}
We know that there exists a unique pair of
- $s_a$ and $s_b$ satisfaying
+ $s_a$ and $s_b$ satisfying
$(a \backslash c, s_a) \rightarrow v_a$,
$(b , s_b) \rightarrow v_b$, and
$\nexists s_3 \; s_4. s_3 \neq [] \land s_a @ s_3 \in
@@ -1113,7 +1116,7 @@
& $=$ & $\Seq \; (\inj \;a \; c \; v_a) \; v_b$
\end{tabular}
\end{center}
- With a similar reasoning,
+ With similar reasoning,
\[
(a\cdot b, (c::s_a)@s_b) \rightarrow \Seq \; (\inj \; a\;c \;v_a) \; v_b.
@@ -1133,7 +1136,7 @@
\[
(a, []) \rightarrow \mkeps \; a.
\]
- Also by inductive hypothesis
+ Also, by inductive hypothesis
\[
(b, c::s) \rightarrow \inj\; b \; c \; v_b
\]
@@ -1154,7 +1157,7 @@
s_3 @s_4 = c::s \land s_3 \in L \; a
\land s_4 \in L \; b.
\]
- (Which basically says there cannot be a longer
+ (Which says there cannot be a longer
initial split for $s$ other than the empty string.)
Therefore we have $\Seq \; (\mkeps \; a) \;(\inj \;b \; c\; v_b)$
as the POSIX value for $a\cdot b$.
@@ -1187,20 +1190,21 @@
\end{theorem}
\begin{proof}
By induction on $s$. $r$ generalising over an arbitrary regular expression.
-The $[]$ case is proven by an application of lemma \ref{mePosix}, and the inductive case
+The $[]$ case is proven by an application of lemma \ref{mePosix}, and the inductive case
by lemma \ref{injPosix}.
\end{proof}
\noindent
As we did earlier in this chapter with the matcher, one can
introduce simplification on the regular expression in each derivative step.
-However, now one needs to do a backward phase and make sure
-the values align with the regular expression.
+However, due to lexing, one needs to do a backward phase (w.r.t the forward derivative phase)
+and ensure that
+the values align with the regular expression at each step.
Therefore one has to
be careful not to break the correctness, as the injection
function heavily relies on the structure of
the regular expressions and values being aligned.
This can be achieved by recording some extra rectification functions
-during the derivatives step, and applying these rectifications in
+during the derivatives step and applying these rectifications in
each run during the injection phase.
With extra care
one can show that POSIXness will not be affected
@@ -1223,16 +1227,17 @@
\end{tabular}
\end{center}
-However, with the simple-minded simplification rules allowed
-in an injection-based lexer, one can still end up
-with exploding derivatives.
-\section{A Case Requring More Aggressive Simplifications}
+However, one can still end up
+with exploding derivatives,
+even with the simple-minded simplification rules allowed
+in an injection-based lexer.
+\section{A Case Requiring More Aggressive Simplifications}
For example, when starting with the regular
expression $(a^* \cdot a^*)^*$ and building just over
a dozen successive derivatives
w.r.t.~the character $a$, one obtains a derivative regular expression
with millions of nodes (when viewed as a tree)
-even with the simple-minded simplification.
+even with the mentioned simplifications.
\begin{figure}[H]
\begin{center}
\begin{tikzpicture}
@@ -1262,7 +1267,7 @@
as an example.
This is a highly ambiguous regular expression, with
many ways to split up the string into multiple segments for
-different star iteratioins,
+different star iterations,
and for each segment
multiple ways of splitting between
the two $a^*$ sub-expressions.
@@ -1331,7 +1336,7 @@
A lexer without a good enough strategy to
deduplicate will naturally
have an exponential runtime on highly
-ambiguous regular expressions, because there
+ambiguous regular expressions because there
are exponentially many matches.
For this particular example, it seems
that the number of distinct matches growth
@@ -1342,12 +1347,12 @@
$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$ is
\[
\Stars\,
- [\Seq \; (\Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}]), \Stars\,[]]
+ [\Seq \; (\Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}]), \Stars\,[]].
\]
-and at any moment the subterms in a regular expression
-that will result in a POSIX value is only
+At any moment, the subterms in a regular expression
+that will potentially result in a POSIX value is only
a minority among the many other terms,
-and one can remove ones that are absolutely not possible to
+and one can remove ones that are not possible to
be POSIX.
In the above example,
\begin{equation}\label{eqn:growth2}
@@ -1370,23 +1375,23 @@
$\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; [\Char \; a])] $ & $(\text{term 2})$
\end{tabular}
\end{center}
-Other terms with an underlying value such as
+Other terms with an underlying value, such as
\[
\Stars \; [\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a, \Char \; a])]
\]
-is simply too hopeless to contribute a POSIX lexical value,
+is too hopeless to contribute a POSIX lexical value,
and is therefore thrown away.
Ausaf et al. \cite{AusafDyckhoffUrban2016}
have come up with some simplification steps, however those steps
-are not yet sufficiently strong so that they achieve the above effects.
-And even with these relatively mild simplifications the proof
-is already quite a bit complicated than the theorem \ref{lexerCorrectness}.
-One would prove something like:
+are not yet sufficiently strong, so they achieve the above effects.
+And even with these relatively mild simplifications, the proof
+is already quite a bit more complicated than the theorem \ref{lexerCorrectness}.
+One would prove something like this:
\[
\textit{If}\; (\textit{snd} \; (\textit{simp} \; r\backslash c), s) \rightarrow v \;\;
\textit{then}\;\; (r, c::s) \rightarrow
- \inj\;\, r\, \;c \;\, ((\textit{fst} \; (\textit{simp} \; r \backslash c))\; v)
+ \inj\;\, r\, \;c \;\, ((\textit{fst} \; (\textit{simp} \; r \backslash c))\; v).
\]
instead of the simple lemma \ref{injPosix}, where now $\textit{simp}$
not only has to return a simplified regular expression,