--- a/thys3/Paper.thy Wed Oct 12 15:23:42 2022 +0100
+++ b/thys3/Paper.thy Thu Oct 13 00:19:50 2022 +0100
@@ -199,7 +199,7 @@
different bitcode annotation---so @{text nub} would never ``fire''.
Inspired by Scala's library for lists, we shall instead use a @{text
distinctWith} function that finds duplicates under an ``erasing'' function
-which deletes bitcodes before comparing regular expressions.
+that deletes bitcodes before comparing regular expressions.
We shall also introduce our \emph{own} arguments and definitions for
establishing the correctness of the bitcoded algorithm when
simplifications are included. Finally we
@@ -376,12 +376,12 @@
\noindent where we use @{term vs} to stand for a list of values. The
string underlying a value can be calculated by a @{const flat}
function, written @{term "flat DUMMY"}. It traverses a value and
- collects the characters contained in it \cite{AusafDyckhoffUrban2016}.
+ collects the characters contained in it (see \cite{AusafDyckhoffUrban2016}).
Sulzmann and Lu also define inductively an
inhabitation relation that associates values to regular expressions. Our
- version of this relation is defined the following six rules for the values:
+ version of this relation is defined by the following six rules:
%
\begin{center}
\begin{tabular}{@ {}l@ {}}
@@ -416,7 +416,7 @@
expression correspond to the language of a regular expression, namely
@{thm L_flat_Prf}.
- In general there is more than one value inhabited by a regular
+ In general there is more than one value inhabiting a regular
expression (meaning regular expressions can typically match more
than one string). But even when fixing a string from the language of the
regular expression, there are generally more than one way of how the
@@ -432,12 +432,12 @@
informal rules are called \emph{maximal munch rule} and \emph{rule priority}.
One contribution of our earlier paper is to give a convenient
specification for what POSIX values are (the inductive rules are shown in
- Figure~\ref{POSIXrules}).
+ Fig~\ref{POSIXrules}).
\begin{figure}[t]
\begin{center}\small%
\begin{tabular}{@ {\hspace{-2mm}}c@ {}}
- \\[-9mm]
+ \\[-4.5mm]
@{thm[mode=Axiom] Posix.intros(1)}\<open>P\<close>@{term "ONE"} \quad
@{thm[mode=Axiom] Posix.intros(2)}\<open>P\<close>@{term "c"}\quad
@{thm[mode=Rule] Posix.intros(3)[of "s" "r\<^sub>1" "v" "r\<^sub>2"]}\<open>P+L\<close>\quad
@@ -715,7 +715,7 @@
The function \textit{decode} checks whether all of the bitsequence is
consumed and returns the corresponding value as @{term "Some v"}; otherwise
it fails with @{text "None"}. We can establish that for a value $v$
- inhabited by a regular expression $r$, the decoding of its
+ inhabiting a regular expression $r$, the decoding of its
bitsequence never fails (see also \cite{NielsenHenglein2011}).
%The decoding can be
@@ -799,6 +799,7 @@
%!\end{center}
\noindent
+ This function ``fuses'' a bitsequence to the topmost constructor of an bitcoded regular expressions.
A regular expression can then be \emph{internalised} into a bitcoded
regular expression as follows:
%
@@ -1042,12 +1043,12 @@
that incrementally building up values in @{text blexer} generates the same result.
This brings us to our main lemma in this section: if we calculate a
-derivative, say $r\backslash s$, and have a value, say $v$, inhabited
-by this derivative, then we can produce the result @{text lexer} generates
+derivative, say $r\backslash s$, and have a value, say $v$, inhabiting
+this derivative, then we can produce the result @{text lexer} generates
by applying this value to the stacked-up injection functions
that $\textit{flex}$ assembles. The lemma establishes that this is the same
value as if we build the annotated derivative $r^\uparrow\backslash s$
-and then retrieve the corresponding bitcoded version, followed by a
+and then retrieve the bitcoded version of @{text v}, followed by a
decoding step.
\begin{lemma}[Main Lemma]\label{mainlemma}\mbox{}\\\it
@@ -1131,7 +1132,7 @@
\[
@{term "ALTs bs\<^sub>1 (((ALTs bs\<^sub>2 rs\<^sub>2)) # rs\<^sub>1)"}
\quad\xrightarrow{bsimp}\quad
- @{term "ALTs bs\<^sub>1 ((map (fuse bs\<^sub>2) rs\<^sub>2) @ rs\<^sub>1)"}
+ @{text "ALTs bs\<^sub>1 ((map (fuse bs\<^sub>2) rs\<^sub>2) @ rs\<^sub>1)"}
\]
\noindent
@@ -1178,7 +1179,7 @@
\noindent where we scan the list from left to right (because we
have to remove later copies). In @{text distinctWith}, @{text eq} is intended to be an
- equivalence relation for annotated regular expressions and @{text acc} is an accumulator for annotated regular
+ equivalence relation for bitcoded regular expressions and @{text acc} is an accumulator for bitcoded regular
expressions---essentially a set of regular expressions that we have already seen
while scanning the list. Therefore we delete an element, say @{text x},
from the list provided a @{text "y"} with @{text "y"} being equivalent to @{text x} is already in the accumulator;
@@ -1186,7 +1187,7 @@
add @{text "x"} as another ``seen'' element to @{text acc}. We will use
@{term distinctWith} where @{text eq} is an equivalence that deletes bitsequences from bitcoded regular expressions
before comparing the components. One way to define this in Isabelle/HOL is by the following recursive function from
- annotated regular expressions to @{text bool}:
+ bitcoded regular expressions to @{text bool}:
%
\begin{center}
\begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {\hspace{1mm}}l@ {}}
@@ -1285,7 +1286,7 @@
following notion of bitcoded derivatives:
\begin{center}
- \begin{tabular}{cc}
+ \begin{tabular}{c@ {\hspace{10mm}}c}
\begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l}
@{thm (lhs) bders_simp.simps(1)} & $\dn$ & @{thm (rhs) bders_simp.simps(1)}
\end{tabular}
@@ -1303,13 +1304,15 @@
\begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1mm}}l@ {}}
$\textit{blexer}^+\;r\,s$ & $\dn$ &
$\textit{let}\;r_{der} = (r^\uparrow)\backslash_{bsimp}\, s\;\textit{in}$\\
- & & $\;\;\textit{if}\; \textit{bnullable}(r_{der})$\\
- & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,r_{der})\,r$\\
- & & $\;\;\textit{else}\;\textit{None}$
+ & & $\;\;\;\textit{if}\; \textit{bnullable}(r_{der})$\\
+ & & $\;\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,r_{der})\,r$ $\;\;\textit{else}\;\textit{None}$
\end{tabular}
\end{center}
- \noindent The remaining task is to show that @{term blexer} and
+ \noindent
+ Note that in $\textit{blexer}^+$ the derivative $r_{der}$ is calculated
+ using the simplifying derivative $\_\,\backslash_{bsimp}\,\_$.
+ The remaining task is to show that @{term blexer} and
@{term "blexer_simp"} generate the same answers.
When we first
@@ -1341,10 +1344,10 @@
correctness argument for @{term blexer_simp}.
We found it more helpful to introduce the rewriting systems shown in
- Figure~\ref{SimpRewrites}. The idea is to generate
+ Fig~\ref{SimpRewrites}. The idea is to generate
simplified regular expressions in small steps (unlike the @{text bsimp}-function which
does the same in a big step), and show that each of
- the small steps preserves the bitcodes that lead to the final POSIX value.
+ the small steps preserves the bitcodes that lead to the POSIX value.
The rewrite system is organised such that $\leadsto$ is for bitcoded regular
expressions and $\stackrel{s}{\leadsto}$ for lists of bitcoded regular
expressions. The former essentially implements the simplifications of
@@ -1503,7 +1506,7 @@
% tell Chengsong about Indian paper of closed forms of derivatives
\noindent
-where in (1) the $\textit{Suf}(@{text "r"}_1, s)$ are all the suffixes of $s$ where @{term "bders_simp r\<^sub>1 s'"} is nullable ($s'$ being a suffix of $s$).
+where in (1) the set $\textit{Suf}(@{text "r"}_1, s)$ are all the suffixes of $s$ where @{term "bders_simp r\<^sub>1 s'"} is nullable ($s'$ being a suffix of $s$).
In (3) we know that $\llbracket@{term "ASEQ [] (bders_simp r\<^sub>1 s) r\<^sub>2"}\rrbracket$ is
bounded by $N_1 + \llbracket{}r_2\rrbracket + 1$. In (5) we know the list comprehension contains only regular expressions of size smaller
than $N_2$. The list length after @{text distinctWith} is bounded by a number, which we call $l_{N_2}$. It stands
@@ -1554,7 +1557,7 @@
set of of derivatives $r \backslash s$ for a given $r$ and all strings
$s$ is finite. This is because for our annotated regular expressions
the bitcode annotation is dependent on the number of iterations that
-are needed for STAR-regular expressions. Since we want to do lexing
+are needed for STAR-regular expressions. This is not a problem for us: Since we intend to do lexing
by calculating (as fast as possible) derivatives, the bound on the size
of the derivatives is important, not the number of derivatives.
@@ -1573,13 +1576,13 @@
\cite{AusafDyckhoffUrban2016}. In the earlier work we needed to
introduce our own specification about what POSIX values are,
because the informal definition given by Sulzmann and Lu did not
- stand up to a formal proof. Also for the second algorithm we needed
+ stand up to formal proof. Also for the second algorithm we needed
to introduce our own definitions and proof ideas in order to establish the
correctness. Our interest in the second algorithm
lies in the fact that by using bitcoded regular expressions and an aggressive
simplification method there is a chance that the derivatives
can be kept universally small (we established in this paper that
- they can be kept finitely bounded for any string).
+ for a given $r$ they can be kept finitely bounded for all strings).
%This is important if one is after
%an efficient POSIX lexing algorithm based on derivatives.
@@ -1604,7 +1607,7 @@
slow---excruciatingly slow that is. Other works seems to make
stronger claims, but during our work we have developed a healthy
suspicion when for example experimental data is used to back up
- efficiency claims. For example Sulzmann and Lu write about their
+ efficiency claims. For instance Sulzmann and Lu write about their
equivalent of @{term blexer_simp} \textit{``...we can incrementally
compute bitcoded parse trees in linear time in the size of the
input''} \cite[Page 14]{Sulzmann2014}. Given the growth of the
@@ -1673,12 +1676,12 @@
their extracted code with such a regular expression as a
single lexing rule and a string of 50\,000 a's---lexing in this case
takes approximately 5~minutes. We are not aware of any better
- translation using the traditional notion of DFAs. Therefore we
+ translation using the traditional notion of DFAs so that we can improve on this. Therefore we
prefer to stick with calculating derivatives, but attempt to make
this calculation (in the future) as fast as possible. What we can guaranty
with the presented work is that the maximum size of the derivatives
- for this example is not bigger than 9. This means our Scala
- implementation only needs a few seconds for this example.
+ for \mbox{@{text "a\<^bsup>{100}{5}\<^esup>"}$\,\cdot\,$@{term "STAR a"}} is never bigger than 9. This means our Scala
+ implementation only needs a few seconds for this example and matching 50\,000 a's.
%
%
%Possible ideas are
--- a/thys3/document/root.bib Wed Oct 12 15:23:42 2022 +0100
+++ b/thys3/document/root.bib Thu Oct 13 00:19:50 2022 +0100
@@ -372,15 +372,14 @@
year = {2013}
}
+
+
@inproceedings{RibeiroAgda2017,
author = {R.~Ribeiro and A.~Du Bois},
title = {{C}ertified {B}it-{C}oded {R}egular {E}xpression {P}arsing},
year = {2017},
-publisher = {Association for Computing Machinery},
-address = {New York, NY, USA},
booktitle = {Proc.~of the 21st Brazilian Symposium on Programming Languages},
-articleno = {4},
-numpages = {8}
+pages = {4:1--4:8}
}