lexing: changeset 616:8907d4b6316d

--- a/thys3/Paper.thy	Wed Oct 12 15:23:42 2022 +0100
+++ b/thys3/Paper.thy	Thu Oct 13 00:19:50 2022 +0100
@@ -199,7 +199,7 @@
 different bitcode annotation---so @{text nub} would never ``fire''.
 Inspired by Scala's library for lists, we shall instead use a @{text
 distinctWith} function that finds duplicates under an ``erasing'' function
-which deletes bitcodes before comparing regular expressions.
+that deletes bitcodes before comparing regular expressions.
 We shall also introduce our \emph{own} arguments and definitions for
 establishing the correctness of the bitcoded algorithm when 
 simplifications are included. Finally we
@@ -376,12 +376,12 @@
   \noindent where we use @{term vs} to stand for a list of values. The
   string underlying a value can be calculated by a @{const flat}
   function, written @{term "flat DUMMY"}. It traverses a value and
-  collects the characters contained in it \cite{AusafDyckhoffUrban2016}.
+  collects the characters contained in it (see \cite{AusafDyckhoffUrban2016}).
 
 
   Sulzmann and Lu also define inductively an
   inhabitation relation that associates values to regular expressions. Our
-  version of this relation is defined the following six rules for the values:
+  version of this relation is defined by the following six rules:
   %
   \begin{center}
   \begin{tabular}{@ {}l@ {}}
@@ -416,7 +416,7 @@
   expression correspond to the language of a regular expression, namely
   @{thm L_flat_Prf}.
 
-  In general there is more than one value inhabited by a regular
+  In general there is more than one value inhabiting a regular
   expression (meaning regular expressions can typically match more
   than one string). But even when fixing a string from the language of the
   regular expression, there are generally more than one way of how the
@@ -432,12 +432,12 @@
   informal rules are called \emph{maximal munch rule} and \emph{rule priority}.
   One contribution of our earlier paper is to give a convenient
  specification for what POSIX values are (the inductive rules are shown in
-  Figure~\ref{POSIXrules}).
+  Fig~\ref{POSIXrules}).
 
 \begin{figure}[t]
   \begin{center}\small%
   \begin{tabular}{@ {\hspace{-2mm}}c@ {}}
-  \\[-9mm]
+  \\[-4.5mm]
   @{thm[mode=Axiom] Posix.intros(1)}\<open>P\<close>@{term "ONE"} \quad
   @{thm[mode=Axiom] Posix.intros(2)}\<open>P\<close>@{term "c"}\quad
   @{thm[mode=Rule] Posix.intros(3)[of "s" "r\<^sub>1" "v" "r\<^sub>2"]}\<open>P+L\<close>\quad
@@ -715,7 +715,7 @@
   The function \textit{decode} checks whether all of the bitsequence is
   consumed and returns the corresponding value as @{term "Some v"}; otherwise
   it fails with @{text "None"}. We can establish that for a value $v$
-  inhabited by a regular expression $r$, the decoding of its
+  inhabiting a regular expression $r$, the decoding of its
   bitsequence never fails (see also \cite{NielsenHenglein2011}).
 
   %The decoding can be
@@ -799,6 +799,7 @@
   %!\end{center}    
 
   \noindent
+  This function ``fuses'' a bitsequence to the topmost constructor of an bitcoded regular expressions. 
   A regular expression can then be \emph{internalised} into a bitcoded
   regular expression as follows:
   %
@@ -1042,12 +1043,12 @@
 that incrementally building up values in @{text blexer} generates the same result.
 
 This brings us to our main lemma in this section: if we calculate a
-derivative, say $r\backslash s$, and have a value, say $v$, inhabited
-by this derivative, then we can produce the result @{text lexer} generates
+derivative, say $r\backslash s$, and have a value, say $v$, inhabiting
+this derivative, then we can produce the result @{text lexer} generates
 by applying this value to the stacked-up injection functions
 that $\textit{flex}$ assembles. The lemma establishes that this is the same
 value as if we build the annotated derivative $r^\uparrow\backslash s$
-and then retrieve the corresponding bitcoded version, followed by a
+and then retrieve the bitcoded version of @{text v}, followed by a
 decoding step.
 
 \begin{lemma}[Main Lemma]\label{mainlemma}\mbox{}\\\it
@@ -1131,7 +1132,7 @@
      \[
      @{term "ALTs bs\<^sub>1 (((ALTs bs\<^sub>2 rs\<^sub>2)) # rs\<^sub>1)"}
      \quad\xrightarrow{bsimp}\quad
-     @{term "ALTs bs\<^sub>1 ((map (fuse bs\<^sub>2) rs\<^sub>2) @ rs\<^sub>1)"}
+     @{text "ALTs bs\<^sub>1 ((map (fuse bs\<^sub>2) rs\<^sub>2) @ rs\<^sub>1)"}
      \]
 
      \noindent
@@ -1178,7 +1179,7 @@
 
      \noindent where we scan the list from left to right (because we
      have to remove later copies). In @{text distinctWith}, @{text eq} is intended to be an
-     equivalence relation for annotated regular expressions and @{text acc} is an accumulator for annotated regular
+     equivalence relation for bitcoded regular expressions and @{text acc} is an accumulator for bitcoded regular
      expressions---essentially a set of regular expressions that we have already seen
      while scanning the list. Therefore we delete an element, say @{text x},
      from the list provided a @{text "y"} with @{text "y"} being equivalent to @{text x} is already in the accumulator;
@@ -1186,7 +1187,7 @@
      add @{text "x"} as another ``seen'' element to @{text acc}. We will use
      @{term distinctWith} where @{text eq} is an equivalence that deletes bitsequences from bitcoded regular expressions
      before comparing the components. One way to define this in Isabelle/HOL is by the following recursive function from
-     annotated regular expressions to @{text bool}:
+     bitcoded regular expressions to @{text bool}:
      %
      \begin{center}
      \begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {\hspace{1mm}}l@ {}}
@@ -1285,7 +1286,7 @@
      following notion of bitcoded derivatives:
      
      \begin{center}
-      \begin{tabular}{cc}
+      \begin{tabular}{c@ {\hspace{10mm}}c}
       \begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l}
       @{thm (lhs) bders_simp.simps(1)} & $\dn$ & @{thm (rhs) bders_simp.simps(1)}
       \end{tabular}
@@ -1303,13 +1304,15 @@
 \begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {}}
   $\textit{blexer}^+\;r\,s$ & $\dn$ &
       $\textit{let}\;r_{der} = (r^\uparrow)\backslash_{bsimp}\, s\;\textit{in}$\\                
-      & & $\;\;\textit{if}\; \textit{bnullable}(r_{der})$\\
-      & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,r_{der})\,r$\\
-      & & $\;\;\textit{else}\;\textit{None}$
+      & & $\;\;\;\textit{if}\; \textit{bnullable}(r_{der})$\\
+      & & $\;\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,r_{der})\,r$ $\;\;\textit{else}\;\textit{None}$
 \end{tabular}
 \end{center}
 
-       \noindent The remaining task is to show that @{term blexer} and
+       \noindent
+       Note that in $\textit{blexer}^+$ the derivative $r_{der}$ is calculated
+       using the simplifying derivative $\_\,\backslash_{bsimp}\,\_$.
+       The remaining task is to show that @{term blexer} and
        @{term "blexer_simp"} generate the same answers.
 
        When we first
@@ -1341,10 +1344,10 @@
      correctness argument for @{term blexer_simp}.
 
      We found it more helpful to introduce the rewriting systems shown in
-     Figure~\ref{SimpRewrites}. The idea is to generate 
+     Fig~\ref{SimpRewrites}. The idea is to generate 
      simplified regular expressions in small steps (unlike the @{text bsimp}-function which
      does the same in a big step), and show that each of
-     the small steps preserves the bitcodes that lead to the final POSIX value.
+     the small steps preserves the bitcodes that lead to the POSIX value.
      The rewrite system is organised such that $\leadsto$ is for bitcoded regular
      expressions and $\stackrel{s}{\leadsto}$ for lists of bitcoded regular
      expressions. The former essentially implements the simplifications of
@@ -1503,7 +1506,7 @@
 % tell Chengsong about Indian paper of closed forms of derivatives
 
 \noindent
-where in (1) the $\textit{Suf}(@{text "r"}_1, s)$ are all the suffixes of $s$ where @{term "bders_simp r\<^sub>1 s'"} is nullable ($s'$ being a suffix of $s$).
+where in (1) the set $\textit{Suf}(@{text "r"}_1, s)$ are all the suffixes of $s$ where @{term "bders_simp r\<^sub>1 s'"} is nullable ($s'$ being a suffix of $s$).
 In (3) we know that  $\llbracket@{term "ASEQ [] (bders_simp r\<^sub>1 s) r\<^sub>2"}\rrbracket$ is 
 bounded by $N_1 + \llbracket{}r_2\rrbracket + 1$. In (5) we know the list comprehension contains only regular expressions of size smaller
 than $N_2$. The list length after @{text distinctWith} is bounded by a number, which we call $l_{N_2}$. It stands
@@ -1554,7 +1557,7 @@
 set of of derivatives $r \backslash s$ for a given $r$ and all strings
 $s$ is finite. This is because for our annotated regular expressions
 the bitcode annotation is dependent on the number of iterations that
-are needed for STAR-regular expressions. Since we want to do lexing
+are needed for STAR-regular expressions. This is not a problem for us: Since we intend to do lexing
 by calculating (as fast as possible) derivatives, the bound on the size
 of the derivatives is important, not the number of derivatives.
 
@@ -1573,13 +1576,13 @@
    \cite{AusafDyckhoffUrban2016}. In the earlier work we needed to
    introduce our own specification about what POSIX values are,
    because the informal definition given by Sulzmann and Lu did not
-   stand up to a formal proof. Also for the second algorithm we needed
+   stand up to formal proof. Also for the second algorithm we needed
    to introduce our own definitions and proof ideas in order to establish the
    correctness.  Our interest in the second algorithm 
    lies in the fact that by using bitcoded regular expressions and an aggressive
    simplification method there is a chance that the derivatives
    can be kept universally small  (we established in this paper that
-   they can be kept finitely bounded for any string).
+   for a given $r$ they can be kept finitely bounded for all strings).
    %This is important if one is after
    %an efficient POSIX lexing algorithm based on derivatives.
 
@@ -1604,7 +1607,7 @@
    slow---excruciatingly slow that is. Other works seems to make
    stronger claims, but during our work we have developed a healthy
    suspicion when for example experimental data is used to back up
-   efficiency claims. For example Sulzmann and Lu write about their
+   efficiency claims. For instance Sulzmann and Lu write about their
    equivalent of @{term blexer_simp} \textit{``...we can incrementally
    compute bitcoded parse trees in linear time in the size of the
    input''} \cite[Page 14]{Sulzmann2014}.  Given the growth of the
@@ -1673,12 +1676,12 @@
    their extracted code with such a regular expression as a
    single lexing rule and a string of 50\,000 a's---lexing in this case
    takes approximately 5~minutes. We are not aware of any better
-   translation using the traditional notion of DFAs. Therefore we
+   translation using the traditional notion of DFAs so that we can improve on this. Therefore we
    prefer to stick with calculating derivatives, but attempt to make
    this calculation (in the future) as fast as possible. What we can guaranty
    with the presented work is that the maximum size of the derivatives
-   for this example is not bigger than 9. This means our Scala
-   implementation only needs a few seconds for this example.
+   for \mbox{@{text "a\<^bsup>{100}{5}\<^esup>"}$\,\cdot\,$@{term "STAR a"}} is never bigger than 9. This means our Scala
+   implementation only needs a few seconds for this example and matching 50\,000 a's.
    %
    %
    %Possible ideas are

--- a/thys3/document/root.bib	Wed Oct 12 15:23:42 2022 +0100
+++ b/thys3/document/root.bib	Thu Oct 13 00:19:50 2022 +0100
@@ -372,15 +372,14 @@
   year =         {2013}
 }
 
+
+
 @inproceedings{RibeiroAgda2017,
 author = {R.~Ribeiro and A.~Du Bois},
 title = {{C}ertified {B}it-{C}oded {R}egular {E}xpression {P}arsing},
 year = {2017},
-publisher = {Association for Computing Machinery},
-address = {New York, NY, USA},
 booktitle = {Proc.~of the 21st Brazilian Symposium on Programming Languages},
-articleno = {4},
-numpages = {8}
+pages     = {4:1--4:8}
 }

author	Christian Urban <christian.urban@kcl.ac.uk>
	Thu, 13 Oct 2022 00:19:50 +0100
changeset 616	8907d4b6316d
parent 615	8881a09a06fd
child 617	3ea6a38c33b5

thys3/Paper.thy		file \| annotate \| diff \| comparison \| revisions
thys3/document/root.bib		file \| annotate \| diff \| comparison \| revisions