lexing: comparison thys3/Paper.thy

equal deleted inserted replaced

-:8881a09a06fd
+:8907d4b6316d
 does not achieve the intended objective with bitcoded regular expressions.  The
 reason is that in the bitcoded setting, each copy generally has a
 different bitcode annotation---so @{text nub} would never ``fire''.
 Inspired by Scala's library for lists, we shall instead use a @{text
 distinctWith} function that finds duplicates under an ``erasing'' function
-which deletes bitcodes before comparing regular expressions.
+that deletes bitcodes before comparing regular expressions.
 We shall also introduce our \emph{own} arguments and definitions for
 establishing the correctness of the bitcoded algorithm when
 simplifications are included. Finally we
 establish that the size of derivatives can be finitely bounded.\medskip%\footnote{ In this paper, we shall first briefly introduce the basic notions
 %of regular expressions and describe the definition
 \end{center}
 \noindent where we use @{term vs} to stand for a list of values. The
 string underlying a value can be calculated by a @{const flat}
 function, written @{term "flat DUMMY"}. It traverses a value and
-collects the characters contained in it \cite{AusafDyckhoffUrban2016}.
+collects the characters contained in it (see \cite{AusafDyckhoffUrban2016}).
 Sulzmann and Lu also define inductively an
 inhabitation relation that associates values to regular expressions. Our
-version of this relation is defined the following six rules for the values:
+version of this relation is defined by the following six rules:
 %
 \begin{center}
 \begin{tabular}{@ {}l@ {}}
 @{thm[mode=Axiom] Prf.intros(4)}\quad
 @{thm[mode=Rule] Prf.intros(2)[of "v\<^sub>1" "r\<^sub>1" "r\<^sub>2"]}\quad
 only contains values that flatten to the empty string.
 It is routine to establish how values ``inhabiting'' a regular
 expression correspond to the language of a regular expression, namely
 @{thm L_flat_Prf}.
-In general there is more than one value inhabited by a regular
+In general there is more than one value inhabiting a regular
 expression (meaning regular expressions can typically match more
 than one string). But even when fixing a string from the language of the
 regular expression, there are generally more than one way of how the
 regular expression can match this string. POSIX lexing is about
 identifying the unique value for a given regular expression and a
 %	Unix-like operating systems \cite{POSIX}.}
 Sometimes these
 informal rules are called \emph{maximal munch rule} and \emph{rule priority}.
 One contribution of our earlier paper is to give a convenient
 specification for what POSIX values are (the inductive rules are shown in
-Figure~\ref{POSIXrules}).
+Fig~\ref{POSIXrules}).
 \begin{figure}[t]
 \begin{center}\small%
 \begin{tabular}{@ {\hspace{-2mm}}c@ {}}
-\\[-9mm]
+\\[-4.5mm]
 @{thm[mode=Axiom] Posix.intros(1)}\<open>P\<close>@{term "ONE"} \quad
 @{thm[mode=Axiom] Posix.intros(2)}\<open>P\<close>@{term "c"}\quad
 @{thm[mode=Rule] Posix.intros(3)[of "s" "r\<^sub>1" "v" "r\<^sub>2"]}\<open>P+L\<close>\quad
 @{thm[mode=Rule] Posix.intros(4)[of "s" "r\<^sub>2" "v" "r\<^sub>1"]}\<open>P+R\<close>\medskip\\
 $\mprset{flushleft}
 regular expression \emph{and} a bitsequence of a corresponding value,
 then we can always decode the value accurately~(see Fig.~\ref{decode}).
 The function \textit{decode} checks whether all of the bitsequence is
 consumed and returns the corresponding value as @{term "Some v"}; otherwise
 it fails with @{text "None"}. We can establish that for a value $v$
-inhabited by a regular expression $r$, the decoding of its
+inhabiting a regular expression $r$, the decoding of its
 bitsequence never fails (see also \cite{NielsenHenglein2011}).
 %The decoding can be
 %defined by using two functions called $\textit{decode}'$ and
 %\textit{decode}:
 \end{tabular}
 \end{minipage}\medskip
 %!\end{center}
 \noindent
+This function ``fuses'' a bitsequence to the topmost constructor of an bitcoded regular expressions.
 A regular expression can then be \emph{internalised} into a bitcoded
 regular expression as follows:
 %
 %!\begin{center}
 %!\begin{tabular}{@ {}c@ {\hspace{2mm}}c@ {\hspace{2mm}}c@ {}}
 derivative $r\backslash s$ and another time for stacking up injection
 functions), it helps us with proving
 that incrementally building up values in @{text blexer} generates the same result.
 This brings us to our main lemma in this section: if we calculate a
-derivative, say $r\backslash s$, and have a value, say $v$, inhabited
+derivative, say $r\backslash s$, and have a value, say $v$, inhabiting
-by this derivative, then we can produce the result @{text lexer} generates
+this derivative, then we can produce the result @{text lexer} generates
 by applying this value to the stacked-up injection functions
 that $\textit{flex}$ assembles. The lemma establishes that this is the same
 value as if we build the annotated derivative $r^\uparrow\backslash s$
-and then retrieve the corresponding bitcoded version, followed by a
+and then retrieve the bitcoded version of @{text v}, followed by a
 decoding step.
 \begin{lemma}[Main Lemma]\label{mainlemma}\mbox{}\\\it
 If $\vdash v : r\backslash s$ then
 $\textit{Some}\,(\textit{flex}\,r\,\textit{id}\,s\,v) =
 de-nest, or spill out, @{text ALTs} as follows
 %
 \[
 @{term "ALTs bs\<^sub>1 (((ALTs bs\<^sub>2 rs\<^sub>2)) # rs\<^sub>1)"}
 \quad\xrightarrow{bsimp}\quad
-@{term "ALTs bs\<^sub>1 ((map (fuse bs\<^sub>2) rs\<^sub>2) @ rs\<^sub>1)"}
+@{text "ALTs bs\<^sub>1 ((map (fuse bs\<^sub>2) rs\<^sub>2) @ rs\<^sub>1)"}
 \]
 \noindent
 where we just need to fuse the bitsequence that has accumulated in @{text "bs\<^sub>2"}
 to the alternatives in @{text "rs\<^sub>2"}. As we shall show below this will
 \end{tabular}
 \end{center}
 \noindent where we scan the list from left to right (because we
 have to remove later copies). In @{text distinctWith}, @{text eq} is intended to be an
-equivalence relation for annotated regular expressions and @{text acc} is an accumulator for annotated regular
+equivalence relation for bitcoded regular expressions and @{text acc} is an accumulator for bitcoded regular
 expressions---essentially a set of regular expressions that we have already seen
 while scanning the list. Therefore we delete an element, say @{text x},
 from the list provided a @{text "y"} with @{text "y"} being equivalent to @{text x} is already in the accumulator;
 otherwise we keep @{text x} and scan the rest of the list but
 add @{text "x"} as another ``seen'' element to @{text acc}. We will use
 @{term distinctWith} where @{text eq} is an equivalence that deletes bitsequences from bitcoded regular expressions
 before comparing the components. One way to define this in Isabelle/HOL is by the following recursive function from
-annotated regular expressions to @{text bool}:
+bitcoded regular expressions to @{text bool}:
 %
 \begin{center}
 \begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {\hspace{1mm}}l@ {}}
 @{thm (lhs) eq1.simps(1)} & $\dn$ & @{thm (rhs) eq1.simps(1)}\\
 @{thm (lhs) eq1.simps(2)[of DUMMY DUMMY]} & $\dn$ & @{thm (rhs) eq1.simps(2)[of DUMMY DUMMY]}\\
 Next we can include simplification after each derivative step leading to the
 following notion of bitcoded derivatives:
 \begin{center}
-\begin{tabular}{cc}
+\begin{tabular}{c@ {\hspace{10mm}}c}
 \begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l}
 @{thm (lhs) bders_simp.simps(1)} & $\dn$ & @{thm (rhs) bders_simp.simps(1)}
 \end{tabular}
 &
 \begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l}
 \begin{center}
 \begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {}}
 $\textit{blexer}^+\;r\,s$ & $\dn$ &
 $\textit{let}\;r_{der} = (r^\uparrow)\backslash_{bsimp}\, s\;\textit{in}$\\
-& & $\;\;\textit{if}\; \textit{bnullable}(r_{der})$\\
+& & $\;\;\;\textit{if}\; \textit{bnullable}(r_{der})$\\
-& & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,r_{der})\,r$\\
+& & $\;\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,r_{der})\,r$ $\;\;\textit{else}\;\textit{None}$
-& & $\;\;\textit{else}\;\textit{None}$
 \end{tabular}
 \end{center}
-\noindent The remaining task is to show that @{term blexer} and
+\noindent
+Note that in $\textit{blexer}^+$ the derivative $r_{der}$ is calculated
+using the simplifying derivative $\_\,\backslash_{bsimp}\,\_$.
+The remaining task is to show that @{term blexer} and
 @{term "blexer_simp"} generate the same answers.
 When we first
 attempted this proof we encountered a problem with the idea
 in Sulzmann and Lu's paper where the argument seems to be to appeal
 synchronise the change in the simplified regular expressions with
 the original POSIX value, there is no hope of appealing to @{text retrieve} in the
 correctness argument for @{term blexer_simp}.
 We found it more helpful to introduce the rewriting systems shown in
-Figure~\ref{SimpRewrites}. The idea is to generate
+Fig~\ref{SimpRewrites}. The idea is to generate
 simplified regular expressions in small steps (unlike the @{text bsimp}-function which
 does the same in a big step), and show that each of
-the small steps preserves the bitcodes that lead to the final POSIX value.
+the small steps preserves the bitcodes that lead to the POSIX value.
 The rewrite system is organised such that $\leadsto$ is for bitcoded regular
 expressions and $\stackrel{s}{\leadsto}$ for lists of bitcoded regular
 expressions. The former essentially implements the simplifications of
 @{text "bsimpSEQ"} and @{text flts}; while the latter implements the
 simplifications in @{text "bsimpALTs"}. We can show that any bitcoded
 %!\end{center}
 % tell Chengsong about Indian paper of closed forms of derivatives
 \noindent
-where in (1) the $\textit{Suf}(@{text "r"}_1, s)$ are all the suffixes of $s$ where @{term "bders_simp r\<^sub>1 s'"} is nullable ($s'$ being a suffix of $s$).
+where in (1) the set $\textit{Suf}(@{text "r"}_1, s)$ are all the suffixes of $s$ where @{term "bders_simp r\<^sub>1 s'"} is nullable ($s'$ being a suffix of $s$).
 In (3) we know that  $\llbracket@{term "ASEQ [] (bders_simp r\<^sub>1 s) r\<^sub>2"}\rrbracket$ is
 bounded by $N_1 + \llbracket{}r_2\rrbracket + 1$. In (5) we know the list comprehension contains only regular expressions of size smaller
 than $N_2$. The list length after @{text distinctWith} is bounded by a number, which we call $l_{N_2}$. It stands
 for the number of distinct regular expressions smaller than $N_2$ (there can only be finitely many of them).
 We reason similarly for @{text STAR}.\smallskip
 transfer to our setting where we are interested in the \emph{size} of the
 derivatives. For example it is not true for our derivatives that the
 set of of derivatives $r \backslash s$ for a given $r$ and all strings
 $s$ is finite. This is because for our annotated regular expressions
 the bitcode annotation is dependent on the number of iterations that
-are needed for STAR-regular expressions. Since we want to do lexing
+are needed for STAR-regular expressions. This is not a problem for us: Since we intend to do lexing
 by calculating (as fast as possible) derivatives, the bound on the size
 of the derivatives is important, not the number of derivatives.
 *}
 \cite{Sulzmann2014}. This follows earlier work where we established
 the correctness of the first algorithm
 \cite{AusafDyckhoffUrban2016}. In the earlier work we needed to
 introduce our own specification about what POSIX values are,
 because the informal definition given by Sulzmann and Lu did not
-stand up to a formal proof. Also for the second algorithm we needed
+stand up to formal proof. Also for the second algorithm we needed
 to introduce our own definitions and proof ideas in order to establish the
 correctness.  Our interest in the second algorithm
 lies in the fact that by using bitcoded regular expressions and an aggressive
 simplification method there is a chance that the derivatives
 can be kept universally small  (we established in this paper that
-they can be kept finitely bounded for any string).
+for a given $r$ they can be kept finitely bounded for all strings).
 %This is important if one is after
 %an efficient POSIX lexing algorithm based on derivatives.
 Having proved the correctness of the POSIX lexing algorithm, which
 lessons have we learned? Well, we feel this is a very good example
 grow to arbitrarily big sizes and the algorithm needs to traverse
 the derivatives possibly several times, then the algorithm will be
 slow---excruciatingly slow that is. Other works seems to make
 stronger claims, but during our work we have developed a healthy
 suspicion when for example experimental data is used to back up
-efficiency claims. For example Sulzmann and Lu write about their
+efficiency claims. For instance Sulzmann and Lu write about their
 equivalent of @{term blexer_simp} \textit{``...we can incrementally
 compute bitcoded parse trees in linear time in the size of the
 input''} \cite[Page 14]{Sulzmann2014}.  Given the growth of the
 derivatives in some cases even after aggressive simplification,
 this is a hard to believe claim. A similar claim about a
 one needs to represent them as
 sequences of $a \cdot a \cdot \ldots \cdot a$ (500 a's in sequence). We have run
 their extracted code with such a regular expression as a
 single lexing rule and a string of 50\,000 a's---lexing in this case
 takes approximately 5~minutes. We are not aware of any better
-translation using the traditional notion of DFAs. Therefore we
+translation using the traditional notion of DFAs so that we can improve on this. Therefore we
 prefer to stick with calculating derivatives, but attempt to make
 this calculation (in the future) as fast as possible. What we can guaranty
 with the presented work is that the maximum size of the derivatives
-for this example is not bigger than 9. This means our Scala
+for \mbox{@{text "a\<^bsup>{100}{5}\<^esup>"}$\,\cdot\,$@{term "STAR a"}} is never bigger than 9. This means our Scala
-implementation only needs a few seconds for this example.
+implementation only needs a few seconds for this example and matching 50\,000 a's.
 %
 %
 %Possible ideas are
 %zippers which have been used in the context of parsing of
 %context-free grammars \cite{zipperparser}.

changeset 616	8907d4b6316d
parent 615	8881a09a06fd
child 617	3ea6a38c33b5