diff -r d73f19be3ce6 -r 46e5566ad4ba thys2/Paper/document/root.tex --- a/thys2/Paper/document/root.tex Sat Jan 29 16:43:51 2022 +0000 +++ b/thys2/Paper/document/root.tex Sat Jan 29 23:53:21 2022 +0000 @@ -59,37 +59,45 @@ \begin{abstract} Sulzmann and Lu described a lexing algorithm that calculates - Brzozowski derivatives using bit-sequences annotated to regular + Brzozowski derivatives using bitcodes annotated to regular expressions. Their algorithm generates POSIX values which encode the information of \emph{how} a regular expression matches a string---that is, which part of the string is matched by which part - of the regular expression. The purpose of the bit-sequences in - Sulzmann and Lu's algorithm is to keep the size of derivatives small - which is achieved by `aggressively' simplifying regular expressions. - In this paper we describe a slight variant of Sulzmann and Lu's - algorithm and \textit{(i)} prove that this algorithm generates - unique POSIX values; \textit{(ii)} we also establish a cubic bound - for the size of the derivatives---in earlier works, derivatives can - grow exponentially even after simplification. + of the regular expression. The purpose of the bitcodes in Sulzmann + and Lu's algorithm is to generate POSIX values incrementally while + derivatives are calculated. However they also help with designing + `aggressive' simplification methods that keep the size of + derivatives small. Without simplification derivatives can grow + exponentially resulting in an extremely slow lexing algorithm. In this + paper we describe a variant of Sulzmann and Lu's algorithm: Our + algorithm is a small, recursive functional program, whereas Sulzmann + and Lu's version involves a fixpoint construction. We \textit{(i)} + prove in Isabelle/HOL that our program is correct and generates + unique POSIX values; we also \textit{(ii)} establish a polynomial + bound for the size of the derivatives. The size can be seen as a + proxy measure for the effeciency of the lexing algorithm---that means + our algorithm does not suffer from the exponential blowup. -%Brzozowski introduced the notion of derivatives for regular -%expressions. They can be used for a very simple regular expression -%matching algorithm. Sulzmann and Lu cleverly extended this algorithm -%in order to deal with POSIX matching, which is the underlying -%disambiguation strategy for regular expressions needed in lexers. -%Their algorithm generates POSIX values which encode the information of -%\emph{how} a regular expression matches a string---that is, which part -%of the string is matched by which part of the regular expression. In -%this paper we give our inductive definition of what a POSIX value is -%and show $(i)$ that such a value is unique (for given regular -%expression and string being matched) and $(ii)$ that Sulzmann and Lu's -%algorithm always generates such a value (provided that the regular -%expression matches the string). We show that $(iii)$ our inductive -%definition of a POSIX value is equivalent to an alternative definition -%by Okui and Suzuki which identifies POSIX values as least elements -%according to an ordering of values. We also prove the correctness of -%Sulzmann's bitcoded version of the POSIX matching algorithm and extend the -%results to additional constructors for regular expressions. \smallskip + % Brzozowski introduced the notion of derivatives for regular + % expressions. They can be used for a very simple regular expression + % matching algorithm. Sulzmann and Lu cleverly extended this + % algorithm in order to deal with POSIX matching, which is the + % underlying disambiguation strategy for regular expressions needed + % in lexers. Their algorithm generates POSIX values which encode + % the information of \emph{how} a regular expression matches a + % string---that is, which part of the string is matched by which + % part of the regular expression. In this paper we give our + % inductive definition of what a POSIX value is and show $(i)$ that + % such a value is unique (for given regular expression and string + % being matched) and $(ii)$ that Sulzmann and Lu's algorithm always + % generates such a value (provided that the regular expression + % matches the string). We show that $(iii)$ our inductive definition + % of a POSIX value is equivalent to an alternative definition by + % Okui and Suzuki which identifies POSIX values as least elements + % according to an ordering of values. We also prove the correctness + % of Sulzmann's bitcoded version of the POSIX matching algorithm and + % extend the results to additional constructors for regular + % expressions. \smallskip \end{abstract}