diff -r e3752aac8ec2 -r dd9dde2d902b ChengsongTanPhdThesis/Chapters/Inj.tex --- a/ChengsongTanPhdThesis/Chapters/Inj.tex Sat Dec 24 11:55:04 2022 +0000 +++ b/ChengsongTanPhdThesis/Chapters/Inj.tex Fri Dec 30 01:52:32 2022 +0000 @@ -13,7 +13,7 @@ This is essentially a description in ``English'' the functions and datatypes of our formalisation in Isabelle/HOL. We also define what $\POSIX$ lexing means, -followed by a lexing algorithm by Sulzmanna and Lu \parencite{Sulzmann2014} +followed by the first lexing algorithm by Sulzmanna and Lu \parencite{Sulzmann2014} that produces the output conforming to the $\POSIX$ standard\footnote{In what follows we choose to use the Isabelle-style notation @@ -109,7 +109,7 @@ to test membership of a string in a set of strings. For example, to test whether the string -$bar$ is contained in the set $\{foo, bar, brak\}$, one takes derivative of the set with +$bar$ is contained in the set $\{foo, bar, brak\}$, one can take derivative of the set with respect to the string $bar$: \begin{center} \begin{tabular}{lll} @@ -122,8 +122,9 @@ \end{center} \noindent and in the end, test whether the set -has the empty string.\footnote{We use the infix notation $A\backslash c$ - instead of $\Der \; c \; A$ for brevity, as it is clear we are operating +contains the empty string.\footnote{We use the infix notation $A\backslash c$ + instead of $\Der \; c \; A$ for brevity, as it will always be + clear from the context that we are operating on languages rather than regular expressions.} In general, if we have a language $S$, @@ -204,7 +205,7 @@ expression that matches only the empty string.\footnote{ Some authors also use $\phi$ and $\epsilon$ for $\ZERO$ and $\ONE$ -but we prefer our notation.} +but we prefer this notation.} The sequence regular expression is written $r_1\cdot r_2$ and sometimes we omit the dot if it is clear which regular expression is meant; the alternative @@ -227,7 +228,7 @@ %Now with language derivatives of a language and regular expressions and %their language interpretations in place, we are ready to define derivatives on regular expressions. With $L$, we are ready to introduce Brzozowski derivatives on regular expressions. -We do so by first introducing what properties it should satisfy. +We do so by first introducing what properties they should satisfy. \subsection{Brzozowski Derivatives and a Regular Expression Matcher} %Recall, the language derivative acts on a set of strings @@ -291,14 +292,13 @@ he calls the derivative of a regular expression $r$ with respect to a character $c$, written $r \backslash c$. This infix operator -takes an original regular expression $r$ as input -and a character as a right operand and -outputs a result, which is a new regular expression. +takes regular expression $r$ as input +and a character as a right operand. The derivative operation on regular expression is defined such that the language of the derivative result coincides with the language of the original regular expression being taken -derivative with respect to the same character: +derivative with respect to the same characters, namely \begin{property} \[ @@ -335,11 +335,11 @@ \vspace{ 3mm } The derivatives on regular expression can again be -generalised to a string. +generalised to strings. One could compute $r_{start} \backslash s$ and test membership of $s$ in $L \; r_{start}$ by checking whether the empty string is in the language of -$r_{end}$ ($r_{start}\backslash s$).\\ +$r_{end}$ (that is $r_{start}\backslash s$).\\ \vspace{2mm} \Longstack{ @@ -378,18 +378,18 @@ } - +We have the property that \begin{property} $s \in L \; r_{start} \iff [] \in L \; r_{end}$ \end{property} \noindent Next, we give the recursive definition of derivative on regular expressions so that it satisfies the properties above. -The derivative function, written $r\backslash c$, -takes a regular expression $r$ and character $c$, and -returns a new regular expression representing -the original regular expression's language $L \; r$ -being taken the language derivative with respect to $c$. +%The derivative function, written $r\backslash c$, +%takes a regular expression $r$ and character $c$, and +%returns a new regular expression representing +%the original regular expression's language $L \; r$ +%being taken the language derivative with respect to $c$. \begin{table} \begin{center} \begin{tabular}{lcl} @@ -418,7 +418,7 @@ \begin{tabular}{lcl} $(r_1 \cdot r_2 ) \backslash c$ & $\dn$ & $\textit{if}\;\,([] \in L(r_1))\; - \textit{then} \; r_1 \backslash c \cdot r_2 + r_2 \backslash c$ \\ + \textit{then} \; (r_1 \backslash c) \cdot r_2 + r_2 \backslash c$ \\ & & $\textit{else} \; (r_1 \backslash c) \cdot r_2$ \end{tabular} \end{center} @@ -483,7 +483,7 @@ is always nullable because it naturally contains the empty string. \noindent -We have the following correspondence between +We have the following two correspondences between derivatives on regular expressions and derivatives on a set of strings: \begin{lemma}\label{derDer} @@ -499,8 +499,8 @@ By induction on $r$. \end{proof} \noindent -which is the main property of derivatives -that enables us to reason about the correctness of +which are the main properties of derivatives +that enables us later to reason about the correctness of derivative-based matching. We can generalise the derivative operation shown above for single characters to strings as follows: @@ -570,14 +570,14 @@ \ref{NaiveMatcher}. Note that both axes are in logarithmic scale. Around two dozen characters -would already ``explode'' the matcher with the regular expression +this algorithm already ``explodes'' with the regular expression $(a^*)^*b$. To improve this situation, we need to introduce simplification rules for the intermediate results, -such as $r + r \rightarrow r$, +such as $r + r \rightarrow r$ or $\ONE \cdot r \rightarrow r$, and make sure those rules do not change the language of the regular expression. -One simpled-minded simplification function +One simple-minded simplification function that achieves these requirements is given below (see Ausaf et al. \cite{AusafDyckhoffUrban2016}): \begin{center} @@ -733,7 +733,7 @@ one of the values to be generated. $\POSIX$ is one of the disambiguation strategies that is widely adopted. -Ausaf et al.\parencite{AusafDyckhoffUrban2016} +Ausaf et al. \cite{AusafDyckhoffUrban2016} formalised the property as a ternary relation. The $\POSIX$ value $v$ for a regular expression @@ -843,7 +843,7 @@ then a match with a keyword (if) followed by an identifier (foo) would be incorrect. -POSIX lexing would generate what we want. +POSIX lexing generates what is included by lexing. \noindent We know that a POSIX @@ -915,7 +915,7 @@ after the initial phase of successive derivatives. This second phase generates a POSIX value if the regular expression matches the string. -Two functions are involved: $\inj$ and $\mkeps$. +The algorithm uses two functions called $\inj$ and $\mkeps$. The function $\mkeps$ constructs a POSIX value from the last derivative $r_n$: \begin{ceqn} @@ -958,7 +958,7 @@ The result of $\mkeps$ on a $\nullable$ $r$ is a POSIX value for $r$ and the empty string: \begin{lemma}\label{mePosix} -$\nullable\; r \implies (r, []) \rightarrow (\mkeps\; v)$ +$\nullable\; r \implies (r, []) \rightarrow (\mkeps\; r)$ \end{lemma} \begin{proof} By induction on $r$. @@ -1322,7 +1322,7 @@ \Seq \; (\Stars \; [\Char \; a])\; (\Stars \; []) ] \] -And $\derssimp \; aa \; (a^*a^*)^*$ would be +And $\derssimp \; aa \; (a^*a^*)^*$ is \[ ((a^*a^* + a^*)+a^*)\cdot(a^*a^*)^* + (a^*a^* + a^*)\cdot(a^*a^*)^*. @@ -1352,7 +1352,7 @@ At any moment, the subterms in a regular expression that will potentially result in a POSIX value is only a minority among the many other terms, -and one can remove ones that are not possible to +and one can remove the ones that are not possible to be POSIX. In the above example, \begin{equation}\label{eqn:growth2} @@ -1379,15 +1379,15 @@ \[ \Stars \; [\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a, \Char \; a])] \] -is too hopeless to contribute a POSIX lexical value, -and is therefore thrown away. +do not to contribute a POSIX lexical value, +and therefore can be thrown away. Ausaf et al. \cite{AusafDyckhoffUrban2016} have come up with some simplification steps, however those steps -are not yet sufficiently strong, so they achieve the above effects. +are not yet sufficiently strong, to achieve the above effects. And even with these relatively mild simplifications, the proof is already quite a bit more complicated than the theorem \ref{lexerCorrectness}. -One would prove something like this: +One would need to prove something like this: \[ \textit{If}\; (\textit{snd} \; (\textit{simp} \; r\backslash c), s) \rightarrow v \;\; \textit{then}\;\; (r, c::s) \rightarrow @@ -1396,7 +1396,7 @@ instead of the simple lemma \ref{injPosix}, where now $\textit{simp}$ not only has to return a simplified regular expression, but also what specific simplifications -has been done as a function on values +have been done as a function on values showing how one can transform the value underlying the simplified regular expression to the unsimplified one.