lexing: changeset 583:4aabb0629e4b

--- a/ChengsongTanPhdThesis/Chapters/Bitcoded2.tex	Mon Aug 22 17:58:29 2022 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Bitcoded2.tex	Tue Aug 23 14:21:13 2022 +0100
@@ -11,8 +11,20 @@
 
 
 
-Now we introduce the simplifications, which is why we introduce the 
-bitcodes in the first place.
+In this chapter we introduce the simplifications
+on annotated regular expressions that can be applied to 
+each intermediate derivative result. This allows
+us to make $\blexer$ much more efficient.
+We contrast this simplification function 
+with Sulzmann and Lu's original
+simplifications, indicating the simplicity of our algorithm and
+improvements we made, demostrating
+the usefulness and reliability of formal proofs on algorithms.
+These ``aggressive'' simplifications would not be possible in the injection-based 
+lexing we introduced in chapter \ref{Inj}.
+We then go on to prove the correctness with the improved version of 
+$\blexer$, called $\blexersimp$, by establishing 
+$\blexer \; r \; s= \blexersimp \; r \; s$ using a term rewriting system.
 
 \section{Simplification for Annotated Regular Expressions}
 The first thing we notice in the fast growth of examples such as $(a^*a^*)^*$'s
@@ -21,14 +33,15 @@
 de-duplication at different levels:
 \begin{center}
 	$(a^*a^*)^* \rightarrow (a^*a^* + a^*)\cdot(a^*a^*)^* \rightarrow $\\
-	$((a^*a^* + a^*) + a^*)\cdot(a^*a^*)^* + (a^*a^* + a^*)\cdot(a^*a^*)^*$
+	$((a^*a^* + a^*) + a^*)\cdot(a^*a^*)^* + (a^*a^* + a^*)\cdot(a^*a^*)^* \rightarrow \ldots$
 \end{center}
 \noindent
 As we have already mentioned in \ref{eqn:growth2},
 a simple-minded simplification function cannot simplify
+\begin{center}
 $((a^*a^* + a^*) + a^*)\cdot(a^*a^*)^* + (a^*a^* + a^*)\cdot(a^*a^*)^*$
-any further.
-we would expect a better simplification function to work in the 
+\end{center}
+any further, one would expect a better simplification function to work in the 
 following way:
 \begin{gather*}
 	((a^*a^* + \underbrace{a^*}_\text{A})+\underbrace{a^*}_\text{duplicate of A})\cdot(a^*a^*)^* + 
@@ -40,7 +53,10 @@
 	\bigg\downarrow \\
 	(a^*a^* + a^* 
 	)\cdot(a^*a^*)^*  
-	\color{gray} + (a^*a^* + a^*) \cdot(a^*a^*)^*
+	\color{gray} + (a^*a^* + a^*) \cdot(a^*a^*)^*\\
+	\bigg\downarrow \\
+	(a^*a^* + a^* 
+	)\cdot(a^*a^*)^*  
 \end{gather*}
 \noindent
 This motivating example came from testing Sulzmann and Lu's 
@@ -82,16 +98,6 @@
 		$\zeroable\;_{bs}\ZERO$ & $\dn$ & $\textit{true}$
 	\end{tabular}
 \end{center}
-%\[
-%\begin{aligned}
-%&\textit{isPhi} \left(b s @ r i^{*}\right)=\text { False } \\
-%&\textit{isPhi}\left(b s @ r i_{1} r i_{2}\right)=\textit{isPhi}  r i_{1} \vee \textit{isPhi}  r i_{2} \\
-%&\textit{isPhi} \left(b s @ r i_{1} \oplus r i_{2}\right)=\textit{isPhi}  r i_{1} \wedge \textit{isPhi}  r i_{2} \\
-%&\textit{isPhi}(b s @ l)= False \\
-%&\textit{isPhi}(b s @ \epsilon)= False \\
-%&\textit{isPhi} \; \ZERO = True
-%\end{aligned}
-%\]
 \noindent
 Despite that we have already implemented the simple-minded simplifications 
 such as throwing away useless $\ONE$s and $\ZERO$s.

--- a/ChengsongTanPhdThesis/Chapters/Inj.tex	Mon Aug 22 17:58:29 2022 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Inj.tex	Tue Aug 23 14:21:13 2022 +0100
@@ -10,19 +10,19 @@
 
 In this chapter, we define the basic notions 
 for regular languages and regular expressions.
-This is essentially a description in ``English"
+This is essentially a description in ``English''
 of our formalisation in Isabelle/HOL.
 We also give the definition of what $\POSIX$ lexing means, 
-followed by an algorithm by Sulzmanna and Lu\parencite{Sulzmann2014} 
+followed by a lexing algorithm by Sulzmanna and Lu\parencite{Sulzmann2014} 
 that produces the output conforming
 to the $\POSIX$ standard.
-It is also worth mentioning that
-we choose to use the ML-style notation
+\footnote{In what follows 
+we choose to use the Isabelle-style notation
 for function applications, where
-the parameters of a function is not enclosed
+the parameters of a function are not enclosed
 inside a pair of parentheses (e.g. $f \;x \;y$
 instead of $f(x,\;y)$). This is mainly
-to make the text visually more concise.
+to make the text visually more concise.}
 
 \section{Basic Concepts}
 Usually, formal language theory starts with an alphabet 
@@ -38,7 +38,7 @@
 $\textit{s}$ & $\dn$ & $[] \; |\; c  :: s$
 \end{tabular}
 \end{center}
-Where $c$ is a variable ranging over characters.
+where $c$ is a variable ranging over characters.
 Strings can be concatenated to form longer strings in the same
 way as we concatenate two lists, which we shall write as $s_1 @ s_2$.
 We omit the precise 
@@ -88,7 +88,8 @@
 \end{tabular}
 \end{center}
 \noindent
-This can be generalised to "chopping off" a string from all strings within set $A$, 
+This can be generalised to ``chopping off'' a string 
+from all strings within a set $A$, 
 namely:
 \begin{center}
 \begin{tabular}{lcl}
@@ -97,7 +98,7 @@
 \end{center}
 \noindent
 which is essentially the left quotient $A \backslash L$ of $A$ against 
-the singleton language with $L = \{w\}$
+the singleton language with $L = \{s\}$
 in formal language theory.
 However, for the purposes here, the $\textit{Ders}$ definition with 
 a single string is sufficient.
@@ -145,13 +146,13 @@
 \noindent
 This lemma states that if $A$ contains the empty string, $\Der$ can "pierce" through it
 and get to $B$.
-The language $A*$'s derivative can be described using the language derivative
+The language derivative for $A*$ can be described using the language derivative
 of $A$:
 \begin{lemma}
 $\textit{Der} \;c \;(A*) = (\textit{Der}\; c A) @ (A*)$\\
 \end{lemma}
 \begin{proof}
-There are too inclusions to prove:
+There are two inclusions to prove:
 \begin{itemize}
 \item{$\subseteq$}:\\
 The set 
@@ -167,8 +168,8 @@
 \item{$\supseteq$}:\\
 Note that
 \[ \Der \; c \; (A*) = \Der \; c \;  (\{ [] \} \cup (A @ A*) ) \]
-hold.
-Also this holds:
+holds.
+Also the following holds:
 \[ \Der \; c \;  (\{ [] \} \cup (A @ A*) ) = \Der\; c \; (A @ A*) \]
 where the $\textit{RHS}$ can be rewritten
 as \[ (\Der \; c\; A) @ A* \cup (\Der \; c \; (A*)) \]
@@ -218,15 +219,15 @@
 \end{tabular}
 \end{center}
 \noindent
-Now with semantic derivatives of a language and regular expressions and
-their language interpretations in place, we are ready to define derivatives on regexes.
+Now with language derivatives of a language and regular expressions and
+their language interpretations in place, we are ready to define derivatives on regular expressions.
 \subsection{Brzozowski Derivatives and a Regular Expression Matcher}
 %Recall, the language derivative acts on a set of strings
 %and essentially chops off a particular character from
 %all strings in that set, Brzozowski defined a derivative operation on regular expressions
 %so that after derivative $L(r\backslash c)$ 
 %will look as if it was obtained by doing a language derivative on $L(r)$:
-Recall that the semantic derivative acts on a 
+Recall that the language derivative acts on a 
 language (set of strings).
 One can decide whether a string $s$ belongs
 to a language $S$ by taking derivative with respect to
@@ -375,10 +376,10 @@
 Now we give the recursive definition of derivative on
 regular expressions, so that it satisfies the properties above.
 The derivative function, written $r\backslash c$, 
-defines how a regular expression evolves into
-a new one after all the string it contains is acted on: 
-if it starts with $c$, then the character is chopped of,
-if not, that string is removed.
+takes a regular expression $r$ and character $c$, and
+returns a new regular expression representing
+the original regular expression's language $L \; r$
+being taken the language derivative with respect to $c$.
 \begin{center}
 \begin{tabular}{lcl}
 		$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\  
@@ -401,7 +402,9 @@
 result of this derivative:
 \begin{center}
 	\begin{tabular}{lcl}
-		$(r_1 \cdot r_2 ) \backslash c$ & $\dn$ & $\textit{if}\;\,([] \in L(r_1))\; r_1 \backslash c \cdot r_2 + r_2 \backslash c$ \\
+		$(r_1 \cdot r_2 ) \backslash c$ & $\dn$ & 
+		$\textit{if}\;\,([] \in L(r_1))\; 
+		\textit{then} \; r_1 \backslash c \cdot r_2 + r_2 \backslash c$ \\
 		& & $\textit{else} \; (r_1 \backslash c) \cdot r_2$
 	\end{tabular}
 \end{center}
@@ -418,7 +421,7 @@
 	\end{tabular}
 \end{center}
 \noindent
-The star regular expression $r^*$'s derivative 
+The derivative of the star regular expression $r^*$ 
 unwraps one iteration of $r$, turns it into $r\backslash c$,
 and attaches the original $r^*$
 after $r\backslash c$, so that 
@@ -427,7 +430,7 @@
 	(r^*) \backslash c \dn (r \backslash c)\cdot r^*.
 \]
 Again,
-the structure is the same as the semantic derivative of Kleene star: 
+the structure is the same as the language derivative of Kleene star: 
 \[
 	\textit{Der} \;c \;(A*) \dn (\textit{Der}\; c A) @ (A*)
 \]
@@ -494,16 +497,16 @@
 \begin{center}
 \begin{tabular}{lcl}
 $r \backslash_s (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash_s s$ \\
-$r \backslash [\,] $ & $\dn$ & $r$
+$r \backslash_s [\,] $ & $\dn$ & $r$
 \end{tabular}
 \end{center}
 
 \noindent
 When there is no ambiguity, we will 
 omit the subscript and use $\backslash$ instead
-of $\backslash_r$ to denote
+of $\backslash_s$ to denote
 string derivatives for brevity.
-Brzozowski's  regular-expression matcher algorithm can then be described as:
+Brzozowski's  regular-expression matching algorithm can then be described as:
 
 \begin{definition}
 $\textit{match}\;s\;r \;\dn\; \nullable \; (r\backslash s)$
@@ -515,7 +518,9 @@
 
 \begin{equation}\label{graph:successive_ders}
 \begin{tikzcd}
-r_0 \arrow[r, "\backslash c_0"]  & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed]  & r_n  \arrow[r,"\textit{nullable}?"] & \;\textrm{YES}/\textrm{NO}
+r_0 \arrow[r, "\backslash c_0"]  & r_1 \arrow[r, "\backslash c_1"] & 
+r_2 \arrow[r, dashed]  & r_n  \arrow[r,"\textit{nullable}?"] & 
+\;\textrm{true}/\textrm{false}
 \end{tikzcd}
 \end{equation}
 
@@ -526,12 +531,12 @@
 	$\textit{match} \; s\; r  = \textit{true} \; \textit{iff} \; s \in L(r)$
 \end{lemma}
 \begin{proof}
-	By the stepwise property of derivatives 
-	(lemma \ref{derDer}).
+	By induction on $s$ using property of derivatives:
+	lemma \ref{derDer}.
 \end{proof}
 \noindent
 \begin{center}
-	\begin{figure}
+\begin{figure}
 \begin{tikzpicture}
 \begin{axis}[
     xlabel={$n$},
@@ -543,7 +548,9 @@
 \addplot[red,mark=*, mark options={fill=white}] table {NaiveMatcher.data};
 \end{axis}
 \end{tikzpicture} 
-\caption{Matching $(a^*)^*b$ against $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$}\label{NaiveMatcher}
+\caption{Matching the regular expression $(a^*)^*b$ against strings of the form
+$\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}
+$ using Brzozowski's original algorithm}\label{NaiveMatcher}
 \end{figure}
 \end{center} 
 \noindent
@@ -552,9 +559,9 @@
 \ref{NaiveMatcher}.
 Note that both axes are in logarithmic scale.
 Around two dozens characters
-would already explode the matcher on regular expression 
+would already explode the matcher on the regular expression 
 $(a^*)^*b$.
-For this, we need to introduce certain 
+Too improve this situation, we need to introduce simplification
 rewrite rules for the intermediate results,
 such as $r + r \rightarrow r$,
 and make sure those rules do not change the 
@@ -574,7 +581,7 @@
 				     & & $\quad \; \case \; (\ZERO, r_2') \Rightarrow r_2'$\\
 				     & & $\quad \; \case \; (r_1', \ZERO) \Rightarrow r_1'$\\
 				     & & $\quad \; \case \; (r_1', r_2') \Rightarrow r_1' + r_2'$\\
-		$\simp \; r$ & $\dn$ & $r$
+		$\simp \; r$ & $\dn$ & $r\quad\quad (otherwise)$
 				   
 	\end{tabular}
 \end{center}
@@ -608,15 +615,15 @@
 \end{figure}
 \noindent
 The running time of $\textit{ders}\_\textit{simp}$
-on the same example of \ref{NaiveMatcher}
-is now very tame in terms of the length of inputs,
-as shown in \ref{BetterMatcher}.
+on the same example of Figure \ref{NaiveMatcher}
+is now ``tame''  in terms of the length of inputs,
+as shown in Figure \ref{BetterMatcher}.
 
-Building derivatives and then testing the existence
-of empty string in the resulting regular expression's language,
-adding simplifications when necessary.
-So far, so good. But what if we want to 
-do lexing instead of just getting a YES/NO answer?
+So far the story is use Brzozowski derivatives,
+simplifying where possible and at the end test
+whether the empty string is recognised by the final derivative.
+But what if we want to 
+do lexing instead of just getting a true/false answer?
 Sulzmanna and Lu \cite{Sulzmann2014} first came up with a nice and 
 elegant (arguably as beautiful as the definition of the original derivative) solution for this.
 
@@ -1152,12 +1159,12 @@
 \end{proof}
 \noindent
 As we did earlier in this chapter on the matcher, one can 
-introduce simplification on the regex.
+introduce simplification on the regular expression.
 However, now one needs to do a backward phase and make sure
 the values align with the regular expressions.
 Therefore one has to
 be careful not to break the correctness, as the injection 
-function heavily relies on the structure of the regexes and values
+function heavily relies on the structure of the regular expressions and values
 being correct and matching each other.
 It can be achieved by recording some extra rectification functions
 during the derivatives step, and applying these rectifications in

--- a/thys3/Blexer.thy	Mon Aug 22 17:58:29 2022 +0100
+++ b/thys3/Blexer.thy	Tue Aug 23 14:21:13 2022 +0100
@@ -264,6 +264,13 @@
   apply(simp_all)
   done
 
+lemma retrieve_codestars2:
+  assumes "\<forall>v \<in> set vs. \<Turnstile> v : r \<and> code v = retrieve (intern r) v"
+  shows "retrieve (ASTAR bs (intern r)) (Stars []) = bs @ [S]"
+  apply simp
+  done
+
+
 lemma retrieve_fuse2:
   assumes "\<Turnstile> v : (erase r)"
   shows "retrieve (fuse bs r) v = bs @ retrieve r v"

author	Chengsong
	Tue, 23 Aug 2022 14:21:13 +0100
changeset 583	4aabb0629e4b
parent 582	3e19073e91f4
child 584	1734bd5975a3

ChengsongTanPhdThesis/Chapters/Bitcoded2.tex		file \| annotate \| diff \| comparison \| revisions
ChengsongTanPhdThesis/Chapters/Inj.tex		file \| annotate \| diff \| comparison \| revisions
thys3/Blexer.thy		file \| annotate \| diff \| comparison \| revisions