# HG changeset patch # User Chengsong # Date 1668288880 0 # Node ID 8ffa28fce2719bc217dd1db5dcb22b678fe80c03 # Parent c0c1ebe09c7daff514cb91ba3083a716f21fb8a6 all comments incorporated!!+related work diff -r c0c1ebe09c7d -r 8ffa28fce271 ChengsongTanPhdThesis/Chapters/Bitcoded1.tex --- a/ChengsongTanPhdThesis/Chapters/Bitcoded1.tex Sat Nov 12 00:37:23 2022 +0000 +++ b/ChengsongTanPhdThesis/Chapters/Bitcoded1.tex Sat Nov 12 21:34:40 2022 +0000 @@ -10,14 +10,14 @@ %Chapter 3\ref{Chapter3}. In this chapter, we are going to describe the bit-coded algorithm introduced by Sulzmann and Lu \parencite{Sulzmann2014} to address the growth problem of +derivatives of regular expressions. -We have implemented their algorithm in Scala, and found out inefficiencies +We have implemented their algorithm in Scala and Isabelle, +and found problems in their algorithm such as de-duplication not working properly and redundant -fixpoint construction. Their algorithm is improved and verified with the help of -formal proofs. +fixpoint construction. \section{The Motivation Behind Using Bitcodes} -We first do a recap of what was going on -in the lexer algorithm in Chapter \ref{Inj}, +Let us give again the definition of $\lexer$ from Chapter \ref{Inj}: \begin{center} \begin{tabular}{lcl} $\lexer \; r \; [] $ & $=$ & $\textit{if} \; (\nullable \; r)\; \textit{then}\; \Some(\mkeps \; r) \; \textit{else} \; \None$\\ @@ -27,7 +27,22 @@ \end{tabular} \end{center} \noindent -The algorithm recursively calls $\lexer$ on +The first problem with this algorithm is that +for the function $\inj$ to work properly +we cannot destroy the structure of a regular expression, +and therefore canno simplify aggressively. +For example, +\[ + r + (r + a) \rightarrow r + a +\] +cannot be done because that would require +breaking up the inner alternative. +The $\lexer$ function therefore only enables +same-level de-duplications like +\[ + r + r \rightarrow r. +\] +Secondly, the algorithm recursively calls $\lexer$ on each new character input, and before starting a child call it stores information of previous lexing steps @@ -296,13 +311,13 @@ \noindent Storing all $r_i, c_{i+1}$ pairs recursively allows the algorithm to work in an elegant way, at the expense of -storing quite a bit of verbose information. -The stack seems to grow at least quadratically fast with respect +storing quite a bit of verbose information on the stack. +The stack seems to grow at least quadratically with respect to the input (not taking into account the size bloat of $r_i$), -which can be inefficient and prone to stack overflow. +which can be inefficient and prone to stack overflows. \section{Bitcoded Algorithm} To address this, -Sulzmann and Lu chose to define a new datatype +Sulzmann and Lu defined a new datatype called \emph{annotated regular expression}, which condenses all the partial lexing information (that was originally stored in $r_i, c_{i+1}$ pairs) @@ -333,9 +348,9 @@ where $bs$ stands for bit-codes, $a$ for $\mathbf{a}$nnotated regular expressions and $as$ for lists of annotated regular expressions. The alternative constructor, written, $\sum$, has been generalised to -accept a list of annotated regular expressions rather than just two. -Why is it generalised? This is because when we open up nested -alternatives, there could be more than two elements at the same level +take a list of annotated regular expressions rather than just two. +Why is it generalised? This is because when we analyse nested +alternatives, there can be more than two elements at the same level after de-duplication, which can no longer be stored in a binary constructor. Bits and bitcodes (lists of bits) are defined as: @@ -347,21 +362,20 @@ \noindent We use $S$ and $Z$ rather than $1$ and $0$ is to avoid confusion with the regular expressions $\ZERO$ and $\ONE$. -The idea is to use the attached bitcodes +The idea is to use the bitcodes to indicate which choice was made at a certain point during lexing. For example, $(_{Z}a+_{S}b) \backslash a$ gives us -$_{Z}\ONE + \ZERO$, this $Z$ bitcode will +$_{Z}\ONE + \ZERO$, where the $Z$ bitcode will later be used to decode that a left branch was -selected at the time when the part $a+b$ is being taken -derivative of. +selected at the time when the part $a+b$ was anallysed by +derivative. \subsection{A Bird's Eye View of the Bit-coded Lexer} -Before we give out the rest of the functions and definitions +Before we give the details of the functions and definitions related to our $\blexer$ (\emph{b}-itcoded lexer), we first provide a high-level -view of the algorithm, so the reader does not get lost in -the details. +view of the algorithm. \begin{figure}[H] \begin{tikzpicture}[->,>=stealth',shorten >=1pt,auto,thick] %\draw (-6,-6) grid (6,6); @@ -436,14 +450,17 @@ \end{tikzpicture} -\caption{Bird's Eye View of $\blexer$} +\caption{A bird's eye view of $\blexer$. The $_{bsi}a_{i}$s stand + for the annotated regular expressions in each derivative step. + The characters used in each derivative is written as $c_i$. +The relative size increase reflect the size increase in each derivative step.} \end{figure} \noindent The plain regular expressions are first ``lifted'' to an annotated regular expression, -with the function called \emph{internalise}. +with the function called \emph{internalise} ($r \rightarrow _{bs}a$). Then the annotated regular expression $_{bs}a$ will -go through successive derivatives with respect +be transformed by successive derivatives with respect to the input stream of characters $c_1, c_2$ etc. Each time a derivative is taken, the bitcodes are moved around, discarded or augmented together @@ -451,9 +468,9 @@ After all input has been consumed, the bitcodes are collected by $\bmkeps$, which traverses the nullable part of the regular expression -and collect the bitcodes along the way. +and collects the bitcodes along the way. The collected bitcodes are then $\decode$d with the guidance -of the input regular expression $r$. +of the input regular expression $r$ ( $_{bs}a \rightarrow r$). The most notable improvements of $\blexer$ over $\lexer$ are \begin{itemize} @@ -461,43 +478,43 @@ An absence of the second injection phase. \item - One need not store each pair of the + One does not need to store each pair of the intermediate regular expressions $_{bs_i}a_i$ and $c_{i+1}$. The previous annotated regular expression $_{bs_i}a_i$'s information is passed on without loss to its successor $_{bs_{i+1}}a_{i+1}$, - and $c_{i+1}$'s already contained in the strings in $L\;r$ \footnote{ - which can be easily recovered if we decode in the correct order}. + and $c_{i+1}$'s information is stored in $L\;r$.\footnote{ + which will be used during the decode phase, where we use $r$ as + a source of information.} \item - The simplification works much better--one can maintain correctness - while applying quite strong simplifications, as we shall wee. + Finally, simplification works much better--one can maintain correctness + while applying quite aggressive simplifications. \end{itemize} -\noindent -In the next section we will -give the operations needed in $\blexer$, -with some remarks on the idea behind their definitions. +%In the next section we will +%give the operations needed in $\blexer$, +%with some remarks on the idea behind their definitions. \subsection{Operations in $\textit{Blexer}$} The first operation we define related to bit-coded regular expressions is how we move bits to the inside of regular expressions. -Called $\fuse$, this operation attaches bit-codes -to the front of an annotated regular expression: +This operation is called $\fuse$: \begin{center} \begin{tabular}{lcl} - $\textit{fuse}\;bs \; \ZERO$ & $\dn$ & $\ZERO$\\ - $\textit{fuse}\;bs\; _{bs'}\ONE$ & $\dn$ & + $\textit{fuse}\;bs \; (\ZERO)$ & $\dn$ & $\ZERO$\\ + $\textit{fuse}\;bs\; (_{bs'})\ONE$ & $\dn$ & $_{bs @ bs'}\ONE$\\ - $\textit{fuse}\;bs\;_{bs'}{\bf c}$ & $\dn$ & + $\textit{fuse}\;bs\;(_{bs'}{\bf c})$ & $\dn$ & $_{bs@bs'}{\bf c}$\\ - $\textit{fuse}\;bs\,_{bs'}\sum\textit{as}$ & $\dn$ & + $\textit{fuse}\;bs\;(_{bs'}\sum\textit{as})$ & $\dn$ & $_{bs@bs'}\sum\textit{as}$\\ - $\textit{fuse}\;bs\; _{bs'}a_1\cdot a_2$ & $\dn$ & + $\textit{fuse}\;bs\; (_{bs'}a_1\cdot a_2)$ & $\dn$ & $_{bs@bs'}a_1 \cdot a_2$\\ - $\textit{fuse}\;bs\,_{bs'}a^*$ & $\dn$ & + $\textit{fuse}\;bs\;(_{bs'}a^*)$ & $\dn$ & $_{bs @ bs'}a^*$ \end{tabular} \end{center} \noindent -With \emph{fuse} we are able to define the $\internalise$ function +With \emph{fuse} we are able to define the $\internalise$ function, +written $(\_)^\uparrow$, that translates a ``standard'' regular expression into an annotated regular expression. This function will be applied before we start @@ -518,15 +535,12 @@ \end{tabular} \end{center} \noindent -We use an up arrow with postfix notation -to denote this operation -for convenience. The opposite of $\textit{internalise}$ is -$\erase$, where all the bit-codes are removed, +$\erase$, where all bit-codes are removed, and the alternative operator $\sum$ for annotated regular expressions is transformed to the binary version -in plain regular expressions. -\begin{center} +in (plain) regular expressions. This can be defined as follows: +\begin{center}\label{eraseDef} \begin{tabular}{lcl} $\ZERO_\downarrow$ & $\dn$ & $\ZERO$\\ $( _{bs}\ONE )_\downarrow$ & $\dn$ & $\ONE$\\ @@ -535,7 +549,7 @@ $ (a_1) _\downarrow \cdot (a_2) _\downarrow$\\ $( _{bs} [])_\downarrow $ & $\dn$ & $\ZERO $\\ $( _{bs} [a] )_\downarrow$ & $\dn$ & $a_\downarrow$\\ - $_{bs} \sum [a_1, \; a_2]$ & $\dn$ & $ (a_1) _\downarrow + ( a_2 ) _\downarrow $\\ + $(_{bs} \sum [a_1, \; a_2])_\downarrow$ & $\dn$ & $ (a_1) _\downarrow + ( a_2 ) _\downarrow $\\ $(_{bs} \sum (a :: as))_\downarrow$ & $\dn$ & $ a_\downarrow + \; (_{[]} \sum as)_\downarrow$\\ $( _{bs} a^* )_\downarrow$ & $\dn$ & $(a_\downarrow)^*$ \end{tabular} @@ -544,9 +558,11 @@ We also abbreviate the $\erase\; a$ operation as $a_\downarrow$, for conciseness. -Testing whether an annotated regular expression +Determining whether an annotated regular expression contains the empty string in its lauguage requires a dedicated function $\bnullable$. +In our formalisation this function is defined by simply calling $\erase$ +before $\nullable$. $\bnullable$ simply calls $\erase$ before $\nullable$. \begin{definition} $\bnullable \; a \dn \nullable \; (a_\downarrow)$ @@ -554,10 +570,10 @@ The function for collecting bitcodes from a (b)nullable regular expression is called $\bmkeps$. -$\bmkeps$ is a variation of the $\textit{mkeps}$ function, +This function is a lifted version of the $\textit{mkeps}$ function, which follows the path $\mkeps$ takes to traverse the $\nullable$ branches when visiting a regular expression, -but gives back bitcodes instead of a value. +but collects bitcodes instead of generating a value. \begin{center} \begin{tabular}{lcl} $\textit{bmkeps}\,(_{bs}\ONE)$ & $\dn$ & $bs$\\ @@ -572,10 +588,9 @@ \end{tabular} \end{center} \noindent -$\bmkeps$, just like $\mkeps$, +This function, just like $\mkeps$, visits a regular expression tree respecting -the POSIX rules. The difference, however, is that -it does not create values, but only bitcodes. +the POSIX rules. It traverses each child of the sequence regular expression from left to right and creates a bitcode by stitching together bitcodes obtained from the children expressions. @@ -588,7 +603,8 @@ %child recursive calls return. The only time when $\bmkeps$ creates new bitcodes is when it completes a star's iterations by attaching a $S$ to the end of the bitcode -list it returns.\\ +list it returns. + The bitcodes extracted by $\bmkeps$ need to be $\decode$d (with the guidance of a plain regular expression): %\begin{definition}[Bitdecoding of Values]\mbox{} @@ -620,23 +636,24 @@ \end{center} \noindent The function $\decode'$ returns a pair consisting of -a partially decoded value and some leftover bit list that cannot -be decide yet. +a partially decoded value and some leftover bit-list. The function $\decode'$ succeeds if the left-over bit-sequence is empty. -$\decode$ is terminating as $\decode'$ is terminating. -$\decode'$ is terminating -because at least one of $\decode'$'s parameters will go down in terms -of size.\\ + +%$\decode$ is terminating as $\decode'$ is terminating. +%$\decode'$ is terminating +%because at least one of $\decode'$'s parameters will go down in terms +%of size.\\ The reverse operation of $\decode$ is $\code$. -$\textit{code}$ encodes a value into a bitcode by converting +This function encodes a value into a bitcode by converting $\Left$ into $Z$, $\Right$ into $S$, and marks the start of any non-empty star iteration by $S$. The border where a star iteration terminates is marked by $Z$. This coding is lossy, as it throws away the information about characters, and does not encode the ``boundary'' between two sequence values. Moreover, with only the bitcode we cannot even tell -whether the $S$s and $Z$s are for $\Left/\Right$ or $\Stars$. +whether the $S$s and $Z$s are for $\Left/\Right$ or $\Stars$, +but this swill not be necessary in our proof. \begin{center} \begin{tabular}{lcl} $\textit{code}(\Empty)$ & $\dn$ & $[]$\\ @@ -650,7 +667,7 @@ \end{tabular} \end{center} \noindent -Assume we have a value $v$ and regular expression $r$ +We can prove that given a value $v$ and regular expression $r$ with $\vdash v:r$, then we have the property that $\decode$ and $\code$ are reverse operations of one another: @@ -664,8 +681,8 @@ we obtain the property. \end{proof} \noindent -Now we give out the central part of this lexing algorithm, -the $\bder$ function (stands for \emph{b}itcoded-derivative): +Now we define the central part of Sulzmann and Lu's lexing algorithm, +the $\bder$ function (which stands for \emph{b}itcoded-derivative): \begin{center} \begin{tabular}{@{}lcl@{}} $(\ZERO)\,\backslash c$ & $\dn$ & $\ZERO$\\ @@ -686,23 +703,17 @@ \end{tabular} \end{center} \noindent -For most time we use the infix notation $(\_\backslash\_)$ -to mean $\bder$ for brevity when -there is no danger of confusion with derivatives on plain regular expressions. -For example, we write $( _{[]}r^* ) \backslash c$ instead of $\bder \;c \; _{[]}r^*$, -as the bitcodes at the front of $r^*$ indicates that it is -a bit-coded regular expression, not a plain one.\\ -$\bder$ tells us how regular expressions can be recursively traversed, +The $\bder$ function tells us how regular expressions can be recursively traversed, where the bitcodes are augmented and carried around when a derivative is taken. We give the intuition behind some of the more involved cases in $\bder$. \\ For example, in the \emph{star} case, -a derivative on $_{bs}a^*$ means -that one more star iteratoin needs to be taken. -we need to unfold it into a sequence, -and attach an additional bit $Z$ to the front of $r \backslash c$ +a derivative of $_{bs}a^*$ means +that one more star iteration needs to be taken. +We therefore need to unfold it into a sequence, +and attach an additional bit $Z$ to the front of $a \backslash c$ as a record to indicate one new star iteration is unfolded. \noindent @@ -716,9 +727,9 @@ \noindent This information will be recovered later by the $\decode$ function. -The intuition is that the bit $Z$ will be decoded at the right location, -because we accumulate bits from left to right (a rigorous proof will be given -later). +%The intuition is that the bit $Z$ will be decoded at the right location, +%because we accumulate bits from left to right (a rigorous proof will be given +%later). %\begin{tikzpicture}[ > = stealth, % arrow head style % shorten > = 1pt, % don't touch arrow head to node @@ -761,7 +772,7 @@ \noindent Another place the $\bder$ function differs -from derivatives on plain regular expressions +from derivatives on regular expressions is the sequence case: \begin{center} \begin{tabular}{@{}lcl@{}} @@ -774,36 +785,42 @@ \end{tabular} \end{center} \noindent -The difference is that (when $a_1$ is $\bnullable$) +%We assume that $\bmkeps$ +%correctly extracts the bitcode for how $a_1$ +%matches the string prior to $c$ (more on this later). +%The bitsequence $\textit{bs}$ which was initially +%attached to the first element of the sequence +%$a_1 \cdot a_2$, has now been +%elevated to the top level of the $\sum$ constructor. +%This is because this piece of information will be needed +%whichever way the sequence is matched, +%regardless of whether $c$ belongs to $a_1$ or $a_2$. +The difference mainly lies in the $\textit{if}$ branch (when $a_1$ is $\bnullable$): we use $\bmkeps$ to store the lexing information in $a_1$ before collapsing it (as it has been fully matched by string prior to $c$), and attach the collected bit-codes to the front of $a_2$ -before throwing away $a_1$. We assume that $\bmkeps$ -correctly extracts the bitcode for how $a_1$ -matches the string prior to $c$ (more on this later). -The bitsequence $\textit{bs}$ which was initially -attached to the first element of the sequence -$a_1 \cdot a_2$, has now been -elevated to the top level of the $\sum$ constructor. -This is because this piece of information will be needed -whichever way the sequence is matched, -regardless of whether $c$ belongs to $a_1$ or $a_2$. - +before throwing away $a_1$. In the injection-based lexer, $r_1$ is immediately thrown away in -subsequent derivatives on the right branch (when $r_1$ is $\nullable$), -depite $r_1$ potentially storing information of previous matches: +in the $\textit{if}$ branch, +the information $r_1$ stores is therefore lost: \begin{center} - $(r_1 \cdot r_2 )\backslash c = (r_1 \backslash c) \cdot r_2 + r_2 \backslash c$ + \begin{tabular}{lcl} +$(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, [] \in L(r_1)$\\ + & & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\ + & & $\mathit{else}\; \ldots$\\ + \end{tabular} \end{center} \noindent -this loss of information makes it necessary -to store on stack all the regular expressions' -``snapshots'' before they were -taken derivative of. -So that the related information will be available -once the child recursive -call finishes.\\ + +%\noindent +%this loss of information makes it necessary +%to store on stack all the regular expressions' +%``snapshots'' before they were +%taken derivative of. +%So that the related information will be available +%once the child recursive +%call finishes.\\ The rest of the clauses of $\bder$ is rather similar to $\der$. \\ Generalising the derivative operation with bitcodes to strings, we have @@ -829,16 +846,15 @@ & & $\;\;\textit{else}\;\textit{None}$ \end{tabular} \end{center} - \noindent -$\blexer$ first attaches bitcodes to a -plain regular expression, and then do successive derivatives +This function first attaches bitcodes to a +plain regular expression $r$, and then builds successive derivatives with respect to the input string $s$, and -then test whether the result is nullable. -If yes, then extract the bitcodes out of the nullable expression, +then test whether the result is (b)nullable. +If yes, then extract the bitcodes from the nullable expression, and decodes the bitcodes into a lexical value. If there does not exists a match between $r$ and $s$ the lexer -outputs $\None$ indicating a failed lex.\\ +outputs $\None$ indicating a mismatch.\\ Ausaf and Urban formally proved the correctness of the $\blexer$, namely \begin{property} $\blexer \;r \; s = \lexer \; r \; s$. @@ -848,7 +864,7 @@ We introduce the proof later, after we give all the needed auxiliary functions and definitions. \subsection{An Example $\blexer$ Run} -Before introducing the proof we shall first walk the reader +Before introducing the proof we shall first walk through a concrete example of our $\blexer$ calculating the right lexical information through bit-coded regular expressions.\\ Consider the regular expression $(aa)^* \cdot (b+c)$ matching the string $aab$. @@ -919,7 +935,7 @@ is taken, which calls $\bmkeps$ on the nuallable $_Z\ONE\cdot (aa)^*$ before processing $_Zb+_Sc$.\\ The annotated regular expressions -would look too cumbersome if we explicitly indicate all the +would look overwhelming if we explicitly indicate all the locations where bitcodes are attached. For example, $(aa)^*\cdot (b+c)$ would @@ -927,7 +943,7 @@ after internalise. Therefore for readability we omit bitcodes if they are empty. -This applies to all example annotated +This applies to all annotated regular expressions in this thesis.\\ %and assume we have just read the first character $a$: %\begin{center} @@ -1038,7 +1054,8 @@ allowing the possibility of pulling out the right lexical value from an annotated regular expression at - any stage of the algorithm. + any stage, using $\bmkeps$ or $\retrieve$ (details in + lemmas \ref{bmkepsRetrieve} and \ref{blexer_retrieve}). \end{itemize} We will elaborate on this, with the help of some helper functions such as $\retrieve$ and @@ -1051,9 +1068,9 @@ The first function we shall introduce is $\retrieve$. Sulzmann and Lu gave its definition, and Ausaf and Urban found -its usage in mechanised proofs. +useful in their mechanised proofs. Our bit-coded lexer ``retrieve''s the bitcodes using $\bmkeps$, -after all the derivatives has been taken: +after all the derivatives have been taken: \begin{center} \begin{tabular}{lcl} & & $\ldots$\\ @@ -1063,21 +1080,13 @@ \end{tabular} \end{center} \noindent -$\bmkeps$ retrieves the value $v$'s +The function $\bmkeps$ retrieves the value $v$'s information in the format of bitcodes, by travelling along the path of the regular expression that corresponds to a POSIX match, collecting all the bitcodes. We know that this "retrieved" bitcode leads to the correct value after decoding, -which is $v_0$ in the injection-based lexing diagram. -As an observation we pointed at the beginning of this section, -the annotated regular expressions generated in successive derivative steps -in $\blexer$ after $\erase$ has the same structure -as those appeared in $\lexer$. -We redraw the diagram below to visualise this fact. -We pretend that all the values are -ready despite they are only calculated in $\lexer$. -In general we have $\vdash v_i:(a_i)_\downarrow$. +which is $v_0$ in the injection-based lexing diagram: \vspace{20mm} \begin{figure}[H]%\label{graph:injLexer} \begin{center} @@ -1099,11 +1108,17 @@ \node[E = (p-1-2) (p-2-2)] {}; \node[E = (p-1-3) (p-2-3)] {}; \end{tikzpicture} - +\caption{Injection-based lexing diagram, +with matching value and regular expression pairs +encircled}\label{Inj2} \end{figure} \vspace{20mm} \noindent -We encircle in the diagram all the pairs $v_i, a_i$ to show that these values +We have that $\vdash v_i:(a_i)_\downarrow$, +where $v_i$ are the intermediate values generated step by step in $\lexer$. +These values are not explicitly created in $\blexer$. + +We encircled in the diagram all the pairs $v_i, a_i$ to show that these values and regular expressions share the same structure. These pairs all contain adequate information to obtain the final lexing result. @@ -1111,7 +1126,7 @@ lexical information is condensed in $v_0$, whereas for the rightmost pair, the lexing result hides is in the bitcodes of $a_n$.\\ -$\bmkeps$ is able to extract from $a_n$ the result +The function $\bmkeps$ is able to extract from $a_n$ the result by looking for nullable parts of the regular expression. However for regular expressions $a_i$ in general, they might not necessarily be nullable. @@ -1129,9 +1144,10 @@ $f \; a_i\;v_i = f \; a_0 \; v_0 = \code \; v_0$ %\decode \;(\code \; v_0) \; (\erase \; a_0)$ \end{itemize} \noindent -Sulzmann and Lu came up with a function satisfying +Sulzmann and Lu came up with the following definition for $f$ satisfying the above -requirements, named \emph{retrieve}: +requirements, and named it \emph{retrieve}: +\vspace{-7mm} \begin{center} \begin{tabular}{llcl} $\retrieve \; \, _{bs} \ONE$ & $ \Empty$ & $\dn$ & $\textit{bs}$\\ @@ -1155,8 +1171,9 @@ \end{tabular} \end{center} \noindent -As promised, $\retrieve$ collects the right bit-codes from the -final derivative $a_n$, guided by $v_n$: +As stated, $\retrieve$ collects the right bit-codes from the +final derivative $a_n$, guided by $v_n$. This can be proved +as follows: \begin{lemma}\label{bmkepsRetrieve} $\bnullable \; a \implies \bmkeps \; a = \retrieve \; a \; (\mkeps \; (\erase \; a))$ \end{lemma} @@ -1164,17 +1181,17 @@ By a routine induction on $a$. \end{proof} \noindent -The design of $\retrieve$ enables us to extract bitcodes -from both annotated regular expressions and values. -$\retrieve$ always ``sucks up'' all the information. +%The design of $\retrieve$ enables us to extract bitcodes +%from both annotated regular expressions and values. +%$\retrieve$ always ``sucks up'' all the information. When more information is stored in the value, we would be able to extract more from the value, and vice versa. For example in star iterations, $\retrieve$ will be able to extract from $\Stars \; vs$ exactly the same bitcodes as $\code \; (\Stars \; vs)$: -\begin{lemma} +\begin{lemma}\label{retrieveEncodeSTARS} If $\forall v \in vs. \vdash v : r$, and $\code \; v = \retrieve \; (\internalise\; r) \; v$\\ then $\code \; (\Stars \; vs) = \retrieve \; _{[]}((\internalise \; r)^*) \; (\Stars \; vs)$ -\end{lemma}\label{retrieveEncodeSTARS} +\end{lemma} \begin{proof} By induction on the value list $vs$. \end{proof} @@ -1184,8 +1201,8 @@ \begin{center} $\retrieve \; _{bs}((\internalise \; r)^*) \; (\Stars \; [] ) = bs @ [S]$. \end{center} -In general, if we have a regular expression just internalised -and the lexing result value, we could $\retrieve$ the bitcdoes +In general, if we have a regular expression that is just internalised +and the final lexing result value, we can $\retrieve$ the bitcodes that look as if we have en$\code$-ed the value as bitcodes: \begin{lemma} $\vdash v : r \implies \retrieve \; (r)^\uparrow \; v = \code \; v$ @@ -1195,7 +1212,7 @@ The star case uses lemma \ref{retrieveEncodeSTARS}. \end{proof} \noindent -The following property is more interesting, as +The following property of $\retrieve$ is crucial, as it provides a "bridge" between $a_0$ and $a_n $ in the lexing diagram by linking adjacent regular expressions $a_i$ and $a_{i+1}$. @@ -1203,13 +1220,14 @@ can retrieve the same bits from a derivative regular expression as those from before the derivative took place, -provided that the right values are used respectively: +provided that the corresponding values are used respectively: \begin{lemma}\label{retrieveStepwise} $\vdash v : (r\backslash c) \implies \retrieve \; (r \backslash c) \; v= \retrieve \; r \; (\inj \; r\; c\; v)$ \end{lemma} \begin{proof} - By induction on $r$, where $v$ is allowed to be arbitrary. - The induction principle is function $\erase$'s cases. + By induction on $r$, generalising over $v$. + The induction principle in our formalisation + is function $\erase$'s cases. \end{proof} \noindent Before we move on to the next @@ -1222,13 +1240,13 @@ Using lemma \ref{bmkepsRetrieve}. \end{proof} \noindent -$\retrieve$ allows connecting +The function $\retrieve$ allows connecting between the intermediate results $a_i$ and $a_{i+1}$ in $\blexer$. For plain regular expressions something similar is required. \subsection{$\flex$} -Ausaf and Urban cleverly defined an auxiliary function called $\flex$ for $\lexer$, +Ausaf and Urban defined an auxiliary function called $\flex$ for $\lexer$, defined as \begin{center} \begin{tabular}{lcl} @@ -1238,6 +1256,7 @@ \end{center} which accumulates the characters that need to be injected back, and does the injection in a stack-like manner (last taken derivative first injected). +The function $\flex$ can calculate what $\lexer$ calculates, given the input regular expression $r$, the identity function $id$, the input string $s$ and the value $v_n= \mkeps \; (r\backslash s)$: @@ -1249,13 +1268,13 @@ \end{proof} \noindent The main advantage of using $\flex$ -in addition to $\lexer$ is that +in $\lexer$ is that $\flex$ makes the value $v$ and function $f$ that manipulates the value explicit parameters, which ``exposes'' $\lexer$'s recursive call arguments and allows us to ``set breakpoints'' and ``resume'' -at any point during $\lexer$'s recursive calls.\\ -The following stepwise property holds. +at any point during $\lexer$'s recursive calls. Therefore +the following stepwise property holds. \begin{lemma}\label{flexStepwise} $\textit{flex} \;\; r \;\; f \;\; (s@[c]) \;\; v = \flex \;\; r \;\; f \;\; s \;\; (\inj \; (r\backslash s) \; c \; v) $ \end{lemma} @@ -1265,13 +1284,14 @@ \noindent With $\flex$ and $\retrieve$, we are ready to connect $\lexer$ and $\blexer$, -giving the correctness proof. +giving the correctness proof for $\blexer$. %---------------------------------------------------------------------------------------- % SECTION correctness proof %---------------------------------------------------------------------------------------- -\section{Correctness of Bit-coded Algorithm (Without Simplification)} -$\flex$ and $\blexer$ essentially calculates the same thing. +\section{Correctness of the Bit-coded Algorithm (Without Simplification)} +We can first show that +$\flex$ and $\blexer$ calculates the same thing. \begin{lemma}\label{flex_retrieve} If $\vdash v: (r\backslash s)$, then\\ $\flex \; r \; \textit{id}\; s\; v = \decode \; (\retrieve \; (r\backslash s )\; v) \; r$ @@ -1328,14 +1348,18 @@ result of the bitcoded lexer more explicitly, we use the fact from the previous chapter that \[ - (r, s) \rightarrow v \;\; \textit{iff} \;\; \lexer \; r \; s = v + (r, s) \rightarrow v \;\; \textit{iff} \;\; \lexer \; r \; s =\Some \; v + \quad \quad + \nexists v. (r, s) \rightarrow v\;\; \textit{iff} \;\; \lexer \;r\;s = None \] to obtain this corollary: \begin{corollary}\label{blexerPOSIX} - $(r, s) \rightarrow v \;\; \textit{iff} \;\; \blexer \; r \; s = v$ + The $\blexer$ function correctly implements a POSIX lexer, namely\\ + $(r, s) \rightarrow v \;\; \textit{iff} \;\; \blexer \; r \; s = \Some \; v$\\ + $\nexists v. (r, s) \rightarrow v\;\; \textit{iff} \;\; \blexer \;r\;s = None$ \end{corollary} -Our main reason for wanting a bit-coded algorithm over -the injection-based one is for its capabilities of allowing +Our main reason for analysing the bit-coded algorithm over +the injection-based one is that it allows us to define more aggressive simplifications. We will elaborate on this in the next chapter. diff -r c0c1ebe09c7d -r 8ffa28fce271 ChengsongTanPhdThesis/Chapters/Bitcoded2.tex --- a/ChengsongTanPhdThesis/Chapters/Bitcoded2.tex Sat Nov 12 00:37:23 2022 +0000 +++ b/ChengsongTanPhdThesis/Chapters/Bitcoded2.tex Sat Nov 12 21:34:40 2022 +0000 @@ -12,32 +12,37 @@ In this chapter we introduce simplifications -on annotated regular expressions that can be applied to +for annotated regular expressions that can be applied to each intermediate derivative result. This allows us to make $\blexer$ much more efficient. -Sulzmann and Lu already had some bit-coded simplifications, -but their simplification functions were inefficient. -We contrast our simplification function -with Sulzmann and Lu's, indicating the simplicity of our algorithm. -This is another case for the usefulness -and reliability of formal proofs on algorithms. -These ``aggressive'' simplifications would not be possible in the injection-based -lexing we introduced in chapter \ref{Inj}. -We then prove the correctness with the improved version of -$\blexer$, called $\blexersimp$, by establishing -$\blexer \; r \; s= \blexersimp \; r \; s$ using a term rewriting system. - +Sulzmann and Lu already introduced some simplifications for bitcoded regular expressions, +but their simplification functions were inefficient and in some cases needs fixing. +%We contrast our simplification function +%with Sulzmann and Lu's, indicating the simplicity of our algorithm. +%This is another case for the usefulness +%and reliability of formal proofs on algorithms. +%These ``aggressive'' simplifications would not be possible in the injection-based +%lexing we introduced in chapter \ref{Inj}. +%We then prove the correctness with the improved version of +%$\blexer$, called $\blexersimp$, by establishing +%$\blexer \; r \; s= \blexersimp \; r \; s$ using a term rewriting system. +% \section{Simplifications by Sulzmann and Lu} -Consider the derivatives of examples such as $(a^*a^*)^*$ -and $(a^* + (aa)^*)^*$: +Consider the derivatives of the following example $(a^*a^*)^*$: +%and $(a^* + (aa)^*)^*$: \begin{center} - $(a^*a^*)^* \stackrel{\backslash a}{\longrightarrow} (a^*a^* + a^*)\cdot(a^*a^*)^* \stackrel{\backslash a}{\longrightarrow} $\\ - $((a^*a^* + a^*) + a^*)\cdot(a^*a^*)^* + (a^*a^* + a^*)\cdot(a^*a^*)^* \stackrel{\backslash a}{\longrightarrow} \ldots$ + \begin{tabular}{lcl} + $(a^*a^*)^*$ & $ \stackrel{\backslash a}{\longrightarrow}$ & + $ (a^*a^* + a^*)\cdot(a^*a^*)^*$\\ + & + $ \stackrel{\backslash a}{\longrightarrow} $ & + $((a^*a^* + a^*) + a^*)\cdot(a^*a^*)^* + (a^*a^* + a^*)\cdot(a^*a^*)^*$\\ + & $\stackrel{\backslash a}{ + \longrightarrow} $ & $\ldots$\\ + \end{tabular} \end{center} \noindent -As can be seen, there is a lot of duplication -in the example we have already mentioned in -\ref{eqn:growth2}. +As can be seen, there are serveral duplications. A simple-minded simplification function cannot simplify the third regular expression in the above chain of derivative regular expressions, namely @@ -52,52 +57,51 @@ \begin{gather*} ((a^*a^* + \underbrace{a^*}_\text{A})+\underbrace{a^*}_\text{duplicate of A})\cdot(a^*a^*)^* + \underbrace{(a^*a^* + a^*)\cdot(a^*a^*)^*}_\text{further simp removes this}.\\ - \bigg\downarrow \\ + \bigg\downarrow (1) \\ (a^*a^* + a^* \color{gray} + a^* \color{black})\cdot(a^*a^*)^* + \underbrace{(a^*a^* + a^*)\cdot(a^*a^*)^*}_\text{further simp removes this} \\ - \bigg\downarrow \\ + \bigg\downarrow (2) \\ (a^*a^* + a^* )\cdot(a^*a^*)^* \color{gray} + (a^*a^* + a^*) \cdot(a^*a^*)^*\\ - \bigg\downarrow \\ + \bigg\downarrow (3) \\ (a^*a^* + a^* )\cdot(a^*a^*)^* \end{gather*} \noindent In the first step, the nested alternative regular expression $(a^*a^* + a^*) + a^*$ is flattened into $a^*a^* + a^* + a^*$. -Now the third term $a^*$ is clearly identified as a duplicate -and therefore removed in the second step. This causes the two +Now the third term $a^*$ can clearly be identified as a duplicate +and therefore removed in the second step. +This causes the two top-level terms to become the same and the second $(a^*a^*+a^*)\cdot(a^*a^*)^*$ -removed in the final step.\\ -This motivating example is from testing Sulzmann and Lu's -algorithm: their simplification does -not work! -Consider their simplification (using our notations): +removed in the final step. +Sulzmann and Lu's simplification function (using our notations) can achieve this +simplification: \begin{center} \begin{tabular}{lcl} - $\simpsulz \; _{bs}(_{bs'}\ONE \cdot r)$ & $\dn$ & + $\textit{simp}\_{SL} \; _{bs}(_{bs'}\ONE \cdot r)$ & $\dn$ & $\textit{if} \; (\textit{zeroable} \; r)\; \textit{then} \;\; \ZERO$\\ & &$\textit{else}\;\; \fuse \; (bs@ bs') \; r$\\ - $\simpsulz \;(_{bs}r_1\cdot r_2)$ & $\dn$ & $\textit{if} + $\textit{simp}\_{SL} \;(_{bs}r_1\cdot r_2)$ & $\dn$ & $\textit{if} \; (\textit{zeroable} \; r_1 \; \textit{or} \; \textit{zeroable}\; r_2)\; \textit{then} \;\; \ZERO$\\ - & & $\textit{else}\;\;_{bs}((\simpsulz \;r_1)\cdot - (\simpsulz \; r_2))$\\ - $\simpsulz \; _{bs}\sum []$ & $\dn$ & $\ZERO$\\ - $\simpsulz \; _{bs}\sum ((_{bs'}\sum rs_1) :: rs_2)$ & $\dn$ & + & & $\textit{else}\;\;_{bs}((\textit{simp}\_{SL} \;r_1)\cdot + (\textit{simp}\_{SL} \; r_2))$\\ + $\textit{simp}\_{SL} \; _{bs}\sum []$ & $\dn$ & $\ZERO$\\ + $\textit{simp}\_{SL} \; _{bs}\sum ((_{bs'}\sum rs_1) :: rs_2)$ & $\dn$ & $_{bs}\sum ((\map \; (\fuse \; bs')\; rs_1) @ rs_2)$\\ - $\simpsulz \; _{bs}\sum[r]$ & $\dn$ & $\fuse \; bs \; (\simpsulz \; r)$\\ - $\simpsulz \; _{bs}\sum(r::rs)$ & $\dn$ & $_{bs}\sum - (\nub \; (\filter \; (\not \circ \zeroable)\;((\simpsulz \; r) :: \map \; \simpsulz \; rs)))$\\ + $\textit{simp}\_{SL} \; _{bs}\sum[r]$ & $\dn$ & $\fuse \; bs \; (\textit{simp}\_{SL} \; r)$\\ + $\textit{simp}\_{SL} \; _{bs}\sum(r::rs)$ & $\dn$ & $_{bs}\sum + (\nub \; (\filter \; (\not \circ \zeroable)\;((\textit{simp}\_{SL} \; r) :: \map \; \textit{simp}\_{SL} \; rs)))$\\ \end{tabular} \end{center} \noindent -where the $\textit{zeroable}$ predicate +The $\textit{zeroable}$ predicate tests whether the regular expression -is equivalent to $\ZERO$, +is equivalent to $\ZERO$, and can be defined as: \begin{center} \begin{tabular}{lcl} @@ -111,25 +115,36 @@ \end{tabular} \end{center} \noindent -They suggested that the $\simpsulz $ function should be -applied repeatedly until a fixpoint is reached. -We call this construction $\textit{sulzSimp}$: +The \begin{center} \begin{tabular}{lcl} - $\textit{sulzSimp} \; r$ & $\dn$ & - $\textit{while}((\simpsulz \; r)\; \cancel{=} \; r)$ \\ - & & $\quad r := \simpsulz \; r$\\ + $\textit{simp}\_{SL} \; _{bs}\sum ((_{bs'}\sum rs_1) :: rs_2)$ & $\dn$ & + $_{bs}\sum ((\map \; (\fuse \; bs')\; rs_1) @ rs_2)$\\ + \end{tabular} +\end{center} +\noindent +clause does flatten the alternative as required in step (1), +but $\textit{simp}\_{SL}$ is insufficient if we want to do steps (2) and (3), +as these ``identical'' terms have different bit-annotations. +They also suggested that the $\textit{simp}\_{SL} $ function should be +applied repeatedly until a fixpoint is reached. +We call this construction $\textit{SLSimp}$: +\begin{center} + \begin{tabular}{lcl} + $\textit{SLSimp} \; r$ & $\dn$ & + $\textit{while}((\textit{simp}\_{SL} \; r)\; \cancel{=} \; r)$ \\ + & & $\quad r := \textit{simp}\_{SL} \; r$\\ & & $\textit{return} \; r$ \end{tabular} \end{center} We call the operation of alternatingly applying derivatives and simplifications (until the string is exhausted) Sulz-simp-derivative, -written $\backslash_{sulzSimp}$: +written $\backslash_{SLSimp}$: \begin{center} \begin{tabular}{lcl} - $r \backslash_{sulzSimp} (c\!::\!s) $ & $\dn$ & $(\textit{sulzSimp} \; (r \backslash c)) \backslash_{sulzSimp}\, s$ \\ -$r \backslash_{sulzSimp} [\,] $ & $\dn$ & $r$ + $r \backslash_{SLSimp} (c\!::\!s) $ & $\dn$ & $(\textit{SLSimp} \; (r \backslash c)) \backslash_{SLSimp}\, s$ \\ +$r \backslash_{SLSimp} [\,] $ & $\dn$ & $r$ \end{tabular} \end{center} \noindent @@ -138,8 +153,8 @@ as $\blexer$: \begin{center} \begin{tabular}{lcl} - $\textit{blexer\_sulzSimp}\;r\,s$ & $\dn$ & - $\textit{let}\;a = (r^\uparrow)\backslash_{sulzSimp}\, s\;\textit{in}$\\ + $\textit{blexer\_SLSimp}\;r\,s$ & $\dn$ & + $\textit{let}\;a = (r^\uparrow)\backslash_{SLSimp}\, s\;\textit{in}$\\ & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\ & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\ & & $\;\;\textit{else}\;\textit{None}$ @@ -148,7 +163,7 @@ \noindent We implemented this lexing algorithm in Scala, and found that the final derivative regular expression -size grows exponentially fast: +size grows exponentially fast (note the logarithmic scale): \begin{figure}[H] \centering \begin{tikzpicture} @@ -169,7 +184,7 @@ \noindent At $n= 20$ we already get an out of memory error with Scala's normal JVM heap size settings. -In fact their simplification does not improve over +In fact their simplification does not improve much over the simple-minded simplifications we have shown in \ref{fig:BetterWaterloo}. The time required also grows exponentially: \begin{figure}[H] @@ -199,46 +214,47 @@ \noindent The assumption that the size of the regular expressions in the algorithm -would stay below a finite constant is not ture. -The main reason behind this is that (i) The $\textit{nub}$ +would stay below a finite constant is not true, not in the +examples we considered. +The main reason behind this is that (i) the $\textit{nub}$ function requires identical annotations between two annotated regular expressions to qualify as duplicates, -and cannot simplify the cases like $_{SZZ}a^*+_{SZS}a^*$ -even if both $a^*$ denote the same language. -(ii) The ``flattening'' only applies to the head of the list +and therefore cannot simplify cases like $_{SZZ}a^*+_{SZS}a^*$ +even if both $a^*$ denote the same language, and +(ii) the ``flattening'' only applies to the head of the list in the \begin{center} \begin{tabular}{lcl} - $\simpsulz \; _{bs}\sum ((_{bs'}\sum rs_1) :: rs_2)$ & $\dn$ & + $\textit{simp}\_{SL} \; _{bs}\sum ((_{bs'}\sum rs_1) :: rs_2)$ & $\dn$ & $_{bs}\sum ((\map \; (\fuse \; bs')\; rs_1) @ rs_2)$\\ \end{tabular} \end{center} \noindent -clause, and therefore is not thorough enough to simplify all -needed parts of the regular expression.\\ -In addition to that, even if the regular expressions size +clause, and therefore is not strong enough to simplify all +needed parts of the regular expression. Moreover, even if the regular expressions size do stay finite, one has to take into account that -the $\simpsulz$ function is applied many times +the $\textit{simp}\_{SL}$ function is applied many times in each derivative step, and that number is not necessarily a constant with respect to the size of the regular expression. -To not get ``caught off guard'' by -these counterexamples, -one needs to be more careful when designing the -simplification function and making claims about them. +%To not get ``caught off guard'' by +%these counterexamples, +%one needs to be more careful when designing the +%simplification function and making claims about them. \section{Our $\textit{Simp}$ Function} -We will now introduce our simplification function, -by making a contrast with $\simpsulz$. -We describe -the ideas behind components in their algorithm -and why they fail to achieve the desired effect, followed -by our solution. These solutions come with correctness -statements that are backed up by formal proofs. +We will now introduce our simplification function. +%by making a contrast with $\textit{simp}\_{SL}$. +We also describe +the ideas behind Sulzmann and Lu's $\textit{simp}\_{SL}$ +algorithm +and why it fails to achieve the desired effect. +Our simplification function comes with a formal +correctness proof. \subsection{Flattening Nested Alternatives} The idea behind the clause \begin{center} -$\simpsulz \; _{bs}\sum ((_{bs'}\sum rs_1) :: rs_2) \quad \dn \quad + $\textit{simp}\_{SL} \; _{bs}\sum ((_{bs'}\sum rs_1) :: rs_2) \quad \dn \quad _{bs}\sum ((\map \; (\fuse \; bs')\; rs_1) @ rs_2)$ \end{center} is that it allows @@ -250,25 +266,27 @@ \begin{center} $(a+r)+r \longrightarrow a+r$ \end{center} -The problem here is that only the head element -is ``spilled out'', -whereas we would want to flatten -an entire list to open up possibilities for further simplifications. +The problem is that only the head element +is ``spilled out''. +It is more desirable +to flatten +an entire list to open up possibilities for further simplifications +with later regular expressions. Not flattening the rest of the elements also means that the later de-duplication processs does not fully remove further duplicates. For example, -using $\simpsulz$ we could not +using $\textit{simp}\_{SL}$ we cannot simplify \begin{center} $((a^* a^*)+\underline{(a^* + a^*)})\cdot (a^*a^*)^*+ ((a^*a^*)+a^*)\cdot (a^*a^*)^*$ \end{center} -due to the underlined part not in the first element -of the alternative.\\ -We define a flatten operation that flattens not only -the first regular expression of an alternative, -but the entire list: +due to the underlined part not being in the head +of the alternative. + +We define our flatten operation so that it flattens +the entire list: \begin{center} \begin{tabular}{@{}lcl@{}} $\textit{flts} \; (_{bs}\sum \textit{as}) :: \textit{as'}$ & $\dn$ & $(\textit{map} \; @@ -282,10 +300,10 @@ also throws away $\ZERO$s as they do not contribute to a lexing result. \subsection{Duplicate Removal} -After flattening is done, we are ready to deduplicate. +After flattening is done, we can deduplicate. The de-duplicate function is called $\distinctBy$, and that is where we make our second improvement over -Sulzmann and Lu's. +Sulzmann and Lu's simplification method. The process goes as follows: \begin{center} $rs \stackrel{\textit{flts}}{\longrightarrow} @@ -306,7 +324,7 @@ \noindent The reason we define a distinct function under a mapping $f$ is because we want to eliminate regular expressions that are syntactically the same, -but with different bit-codes. +but have different bit-codes. For example, we can remove the second $a^*a^*$ from $_{ZSZ}a^*a^* + _{SZZ}a^*a^*$, because it represents a match with shorter initial sub-match @@ -327,21 +345,23 @@ even if they are attached with different bitcodes. These duplicates therefore need to be removed. To achieve this, we call $\rerases$ as the function $f$ during the distinction -operation.\\ +operation. The function $\rerases$ is very similar to $\erase$, except that it preserves the structure when erasing an alternative regular expression. The reason why we use $\rerases$ instead of $\erase$ is that it keeps the structures of alternative annotated regular expressions -whereas $\erase$ would turn it back into a binary structure. +whereas $\erase$ would turn it back into a binary tree structure. Not having to mess with the structure greatly simplifies the finiteness proof in chapter -\ref{Finite} (we will follow up with more details there). +\ref{Finite}. We give the definitions of $\rerases$ here together with the new datatype used by $\rerases$ (as our plain regular expression datatype does not allow non-binary alternatives). -For the moment the reader can just think of -$\rerases$ as $\erase$ and $\rrexp$ as plain regular expressions. +For now we can think of +$\rerases$ as the function $(\_)_\downarrow$ defined in chapter \ref{Bitcoded1} +and $\rrexp$ as plain regular expressions, but having a general list constructor +for alternatives: \begin{figure}[H] \begin{center} $\rrexp ::= \RZERO \mid \RONE @@ -353,9 +373,7 @@ \caption{$\rrexp$: plain regular expressions, but with $\sum$ alternative constructor}\label{rrexpDef} \end{figure} -The notation of $\rerases$ also follows that of $\erase$, -which is a postfix operator written as a subscript, -except that it has an \emph{r} attached to it to distinguish against $\erase$: +The function $\rerases$ we define as follows: \begin{center} \begin{tabular}{lcl} $\rerase{\ZERO}$ & $\dn$ & $\RZERO$\\ @@ -368,8 +386,7 @@ \end{center} \subsection{Putting Things Together} -A recursive definition of our simplification function -is given below: +We can now give the definition of our simplification function: %that looks somewhat similar to our Scala code is \begin{center} \begin{tabular}{@{}lcl@{}} @@ -404,21 +421,20 @@ and then call $\distinctBy$ on that list, the predicate determining whether two elements are the same is $\rerases \; r_1 = \rerases\; r_2$. Finally, depending on whether the regular expression list $as'$ has turned into a -singleton or empty list after $\flts$ and $\distinctBy$, $\textit{bsimp}_{AALTS}$ +singleton or empty list after $\flts$ and $\distinctBy$, $\textit{bsimp}_{ALTS}$ decides whether to keep the current level constructor $\sum$ as it is, and removes it when there are less than two elements: \begin{center} \begin{tabular}{lcl} - $\textit{bsimp}_{AALTS} \; bs \; as'$ & $ \dn$ & $ as' \; \textit{match}$\\ + $\textit{bsimp}_{ALTS} \; bs \; as'$ & $ \dn$ & $ as' \; \textit{match}$\\ &&$\quad\textit{case} \; [] \Rightarrow \ZERO$ \\ &&$\quad\textit{case} \; a :: [] \Rightarrow \textit{fuse bs a}$ \\ &&$\quad\textit{case} \; as' \Rightarrow _{bs}\sum \textit{as'}$\\ \end{tabular} \end{center} -Having defined the $\bsimp$ function, -we add it as a phase after a derivative is taken, -so it stays small: +Having defined the $\textit{bsimp}$ function, +we add it as a phase after a derivative is taken. \begin{center} \begin{tabular}{lcl} $r \backslash_{bsimp} s$ & $\dn$ & $\textit{bsimp}(r \backslash s)$ @@ -428,7 +444,7 @@ %when extending from derivatives w.r.t.~character to derivative %w.r.t.~string, we define the derivative that nests simplifications %with derivatives:%\comment{simp in the [] case?} -We extend this from character to string: +We extend this from characters to strings: \begin{center} \begin{tabular}{lcl} $r \backslash_{bsimps} (c\!::\!s) $ & $\dn$ & $(r \backslash_{bsimp}\, c) \backslash_{bsimps}\, s$ \\ @@ -458,7 +474,7 @@ After Simplification} Recall the previous $(a^*a^*)^*$ example -where $\simpsulz$ could not +where $\textit{simp}\_{SL}$ could not prevent the fast growth (over 3 million nodes just below $20$ input length) will be reduced to just 15 and stays constant no matter how long the @@ -486,7 +502,7 @@ ylabel={derivative size}, width = 7cm, height = 4cm, - legend entries={Lexer with $\simpsulz$}, + legend entries={Lexer with $\textit{simp}\_{SL}$}, legend pos= north west, legend cell align=left] \addplot[red,mark=*, mark options={fill=white}] table {BetterWaterloo.data}; @@ -499,36 +515,38 @@ \noindent Given the size difference, it is not surprising that our $\blexersimp$ significantly outperforms -$\textit{blexer\_sulzSimp}$. -In the next section we are going to establish the -first important property of our lexer--the correctness. +$\textit{blexer\_SLSimp}$. +In the next section we are going to establish that our +simplification preserves the correctness of the algorithm. %---------------------------------------------------------------------------------------- % SECTION rewrite relation %---------------------------------------------------------------------------------------- \section{Correctness of $\blexersimp$} -In this section we give details -of the correctness proof of $\blexersimp$, -one of the contributions of this thesis.\\ We first introduce the rewriting relation \emph{rrewrite} ($\rrewrite$) between two regular expressions, -which expresses an atomic +which stands for an atomic simplification. We then prove properties about this rewriting relation and its reflexive transitive closure. Finally we leverage these properties to show -an equivalence between the internal data structures of +an equivalence between the results generated by $\blexer$ and $\blexersimp$. \subsection{The Rewriting Relation $\rrewrite$($\rightsquigarrow$)} In the $\blexer$'s correctness proof, we did not directly derive the fact that $\blexer$ generates the POSIX value, -but first proved that $\blexer$ is linked with $\lexer$. +but first proved that $\blexer$ generates the same result as $\lexer$. Then we re-use the correctness of $\lexer$ -to obtain +to obtain \begin{center} - $(r, s) \rightarrow v \;\; \textit{iff} \;\; \blexer \; r \;s = v$. + $(r, s) \rightarrow v \;\; \textit{iff} \;\; \blexer \; r \;s = v$\\ + $\nexists v. \; (r, s) \rightarrow v \;\; \textit{iff} \;\; \blexer\; + r\;s = \None$. \end{center} +%\begin{center} +% $(r, s) \rightarrow v \;\; \textit{iff} \;\; \blexer \; r \;s = v$. +%\end{center} Here we apply this modularised technique again by first proving that @@ -536,22 +554,25 @@ produces the same output as $\blexer \; r\; s$, and then piecing it together with $\blexer$'s correctness to achieve our main -theorem:\footnote{ The case when -$s$ is not in $L \; r$, is routine to establish.} +theorem: \begin{center} - $(r, s) \rightarrow v \; \; \textit{iff} \;\; \blexersimp \; r \; s = v$ + $(r, s) \rightarrow v \; \; \textit{iff} \;\; \blexersimp \; r \; s = \Some \;v$ + \\ + $\nexists v. \; (r, s) \rightarrow v \;\; \textit{iff} \;\; \blexersimp\; + r\;s = \None$ \end{center} \noindent The overall idea for the proof of $\blexer \;r \;s = \blexersimp \; r \;s$ is that the transition from $r$ to $\textit{bsimp}\; r$ can be -broken down into finitely many rewrite steps: +broken down into smaller rewrite steps of the form: \begin{center} $r \rightsquigarrow^* \textit{bsimp} \; r$ \end{center} where each rewrite step, written $\rightsquigarrow$, is an ``atomic'' simplification that -is similar to a small-step reduction in operational semantics: +is similar to a small-step reduction in operational semantics ( +see figure \ref{rrewriteRules} for the rules): \begin{figure}[H] \begin{mathpar} \inferrule * [Right = $S\ZERO_l$]{\vspace{0em}}{_{bs} \ZERO \cdot r_2 \rightsquigarrow \ZERO\\} @@ -596,11 +617,11 @@ }\label{rrewriteRules} \end{figure} \noindent -The rules such as $LT$ and $LH$ are for rewriting between two regular expression lists +The rules $LT$ and $LH$ are for rewriting two regular expression lists such that one regular expression in the left-hand-side list is rewritable in one step to the right-hand-side's regular expression at the same position. -This helps with defining the ``context rules'' such as $AL$.\\ +This helps with defining the ``context rule'' $AL$.\\ The reflexive transitive closure of $\rightsquigarrow$ and $\stackrel{s}{\rightsquigarrow}$ are defined in the usual way: \begin{figure}[H] @@ -619,20 +640,21 @@ \end{figure} %Two rewritable terms will remain rewritable to each other %even after a derivative is taken: -Rewriting is preserved under derivatives, +The main point of our rewriting relation +is that it is preserved under derivatives, namely \begin{center} $r_1 \rightsquigarrow r_2 \implies (r_1 \backslash c) \rightsquigarrow^* (r_2 \backslash c)$ \end{center} -And finally, if two terms are rewritable to each other, +And also, if two terms are rewritable to each other, then they produce the same bitcodes: \begin{center} $r \rightsquigarrow^* r' \;\; \textit{then} \; \; \bmkeps \; r = \bmkeps \; r'$ \end{center} The decoding phase of both $\blexer$ and $\blexersimp$ -are the same, which means that if they get the same +are the same, which means that if they receive the same bitcodes before the decoding phase, -they get the same value after decoding is done. +they generate the same value after decoding is done. We will prove the three properties we mentioned above in the next sub-section. \subsection{Important Properties of $\rightsquigarrow$} @@ -685,7 +707,7 @@ \end{proof} \noindent The inference rules of $\stackrel{s}{\rightsquigarrow}$ -are defined in terms of list cons operation, here +are defined in terms of the list cons operation, where we establish that the $\stackrel{s}{\rightsquigarrow}$ and $\stackrel{s*}{\rightsquigarrow}$ relation is also preserved w.r.t appending and prepending of a list. @@ -717,9 +739,9 @@ $rs_3 \stackrel{s*}{\rightsquigarrow} rs_4 \land r_1 \rightsquigarrow^* r_2 \implies r_2 :: rs_3 \stackrel{s*}{\rightsquigarrow} r_2 :: rs_4$ \item - If we could rewrite a regular expression + If we can rewrite a regular expression in many steps to $\ZERO$, then - we could also rewrite any sequence containing it to $\ZERO$:\\ + we can also rewrite any sequence containing it to $\ZERO$:\\ $r_1 \rightsquigarrow^* \ZERO \implies _{bs}r_1\cdot r_2 \rightsquigarrow^* \ZERO$ \end{itemize} @@ -736,18 +758,18 @@ The last part is proven by rule induction again on $\rightsquigarrow^*$. \end{proof} \noindent -Now we are ready to give the proofs of the below properties: +Now we are ready to give the proofs of the following properties: \begin{itemize} \item - $(r \rightsquigarrow^* r'\land \bnullable \; r_1) + $r \rightsquigarrow^* r'\land \bnullable \; r_1 \implies \bmkeps \; r = \bmkeps \; r'$. \\ \item $r \rightsquigarrow^* \textit{bsimp} \;r$.\\ \item $r \rightsquigarrow r' \implies r \backslash c \rightsquigarrow^* r'\backslash c$.\\ \end{itemize} -These properties would work together towards the correctness theorem. -\subsubsection{Property 1: $(r \rightsquigarrow^* r'\land \bnullable \; r_1) + +\subsubsection{Property 1: $r \rightsquigarrow^* r'\land \bnullable \; r_1 \implies \bmkeps \; r = \bmkeps \; r'$} Intuitively, this property says we can extract the same bitcodes using $\bmkeps$ from the nullable @@ -761,7 +783,7 @@ $\textit{bnullables} \; rs \quad \dn \quad \exists r \in rs. \;\; \bnullable \; r$ \end{center} \noindent -The rewriting relation $\rightsquigarrow$ preserves nullability: +The rewriting relation $\rightsquigarrow$ preserves (b)nullability: \begin{lemma}\label{rewritesBnullable} \hspace{0em} \begin{itemize} @@ -812,23 +834,23 @@ By rule induction over the cases that lead to $r_1 \rightsquigarrow r_2$. \end{proof} \noindent -With lemma \ref{rewriteBmkepsAux} we are ready to prove its +With lemma \ref{rewriteBmkepsAux} in place we are ready to prove its many-step version: \begin{lemma} $\text{If} \;\; r \stackrel{*}{\rightsquigarrow} r' \;\; \text{and} \;\; \bnullable \; r, \;\;\; \text{then} \;\; \bmkeps \; r = \bmkeps \; r'$ \end{lemma} \begin{proof} - By rule induction of $\stackrel{*}{\rightsquigarrow} $. - $\ref{rewritesBnullable}$ tells us both $r$ and $r'$ are nullable. - \ref{rewriteBmkepsAux} solves the inductive case. + By rule induction of $\stackrel{*}{\rightsquigarrow} $. Lemma + $\ref{rewritesBnullable}$ gives us both $r$ and $r'$ are nullable. + The lemma \ref{rewriteBmkepsAux} solves the inductive case. \end{proof} -\subsubsection{Property 2: $r \stackrel{*}{\rightsquigarrow} \bsimp{r}$} -Now we get to the ``meaty'' part of the proof, +\subsubsection{Property 2: $r \stackrel{*}{\rightsquigarrow} \textit{bsimp} \; r$} +Now we get to the key part of the proof, which says that our simplification's helper functions -such as $\distinctBy$ and $\flts$ conform to -the $\stackrel{s*}{\rightsquigarrow}$ and -$\rightsquigarrow^* $ rewriting relations.\\ +such as $\distinctBy$ and $\flts$ describe +reducts of $\stackrel{s*}{\rightsquigarrow}$ and +$\rightsquigarrow^* $.\\ The first lemma to prove is a more general version of $rs_ 1 \rightsquigarrow^* \distinctBy \; rs_1 \; \phi$: \begin{lemma} @@ -849,7 +871,7 @@ $rs_1 \stackrel{s*}{\rightsquigarrow} \distinctBy \; rs_1 \; \phi$. \end{corollary} \noindent -The flatten function $\flts$ conforms to +Similarly the flatten function $\flts$ describes a reduct of $\stackrel{s*}{\rightsquigarrow}$ as well: \begin{lemma}\label{fltsPreserves} @@ -865,14 +887,14 @@ \end{lemma} \noindent The simplification function -$\textit{bsimp}$ only transforms the regex $r$ using steps specified by -$\rightsquigarrow^*$ and nothing else. +$\textit{bsimp}$ only transforms the regular expression using steps specified by +$\rightsquigarrow^*$ and nothing else: \begin{lemma} - $r \stackrel{*}{\rightsquigarrow} \bsimp{r}$ + $r \stackrel{*}{\rightsquigarrow} \textit{bsimp} \; r$ \end{lemma} \begin{proof} By an induction on $r$. - The most involved case would be the alternative, + The most involved case is the alternative, where we use lemmas \ref{bsimpaltsPreserves}, \ref{fltsPreserves} and \ref{dBPreserves} to do a series of rewriting:\\ \begin{center} @@ -883,7 +905,7 @@ (\flts \; (\map \; \textit{bsimp}\; rs)) \; \rerases \; \phi$\\ \end{tabular} \end{center} - Using this we derive the following rewrite relation:\\ + Using this we can derive the following rewrite sequence:\\ \begin{center} \begin{tabular}{lcl} $r$ & $=$ & $_{bs}\sum rs$\\[1.5ex] @@ -900,10 +922,9 @@ \end{center} \end{proof} \subsubsection{Property 3: $r_1 \stackrel{*}{\rightsquigarrow} r_2 \implies r_1 \backslash c \stackrel{*}{\rightsquigarrow} r_2 \backslash c$} -The rewritability relation -$\rightsquigarrow$ is preserved under derivatives-- -it is just that we might need multiple steps -where originally only one step was needed: +The rewrite relation +$\rightsquigarrow$ changes into $\stackrel{*}{\rightsquigarrow}$ +after derivatives are taken on both sides: \begin{lemma}\label{rewriteBder} \hspace{0em} \begin{itemize} @@ -927,11 +948,11 @@ r_2 \backslash c$ \end{corollary} \begin{proof} - By rule induction of $\stackrel{*}{\rightsquigarrow} $ and using the previous lemma \ref{rewriteBder}. + By rule induction of $\stackrel{*}{\rightsquigarrow} $ and lemma \ref{rewriteBder}. \end{proof} \noindent This can be extended and combined with $r \rightsquigarrow^* \textit{bsimp} \; r$ -to obtain the rewritability between +to obtain the correspondence between $\blexer$ and $\blexersimp$'s intermediate derivative regular expressions \begin{lemma}\label{bderBderssimp} @@ -941,52 +962,56 @@ By an induction on $s$. \end{proof} \subsection{Main Theorem} -Now with \ref{bderBderssimp} we are ready for the main theorem. +Now with \ref{bderBderssimp} in place we are ready for the main theorem. \begin{theorem} $\blexer \; r \; s = \blexersimp{r}{s}$ \end{theorem} \noindent \begin{proof} - One can rewrite in many steps from the original lexer's + We can rewrite in many steps from the original lexer's derivative regular expressions to the lexer with simplification applied (by lemma \ref{bderBderssimp}): \begin{center} $a \backslash s \stackrel{*}{\rightsquigarrow} \bderssimp{a}{s} $. \end{center} - we know that they give out the same bits, if the lexing result is a match: + We know that they generate the same bits, if the lexing result is a match: \begin{center} $\bnullable \; (a \backslash s) \implies \bmkeps \; (a \backslash s) = \bmkeps \; (\bderssimp{a}{s})$ \end{center} - Now that they give out the same bits, we know that they give the same value after decoding. + Now that they generate the same bits, we know that they give the same value after decoding. \begin{center} $\bnullable \; (a \backslash s) \implies \decode \; r \; (\bmkeps \; (a \backslash s)) = \decode \; r \; (\bmkeps \; (\bderssimp{a}{s}))$ \end{center} - Which is equivalent to our proof goal: + Which is required by our proof goal: \begin{center} $\blexer \; r \; s = \blexersimp \; r \; s$. \end{center} \end{proof} \noindent As a corollary, -we link this result with the lemma we proved earlier that +we can link this result with the lemma we proved earlier that \begin{center} - $(r, s) \rightarrow v \;\; \textit{iff}\;\; \blexer \; r \; s = v$ + $(r, s) \rightarrow v \;\; \textit{iff}\;\; \blexer \; r \; s = \Some \;v$\\ + $\nexists v. \; (r, s) \rightarrow v \;\; \textit{iff} \;\; \blexer\; + r\;s = \None$. \end{center} and obtain the corollary that the bit-coded lexer with simplification is indeed correctly outputting POSIX lexing result, if such a result exists. \begin{corollary} - $(r, s) \rightarrow v \;\; \textit{iff} \;\; \blexersimp \; r\; s $ + $(r, s) \rightarrow v \;\; \textit{iff} \;\; \blexersimp \; r\; s = \Some \; v$\\ + $\nexists v. \; (r, s) \rightarrow v \;\; \textit{iff} \;\; \blexersimp\; + r\;s = \None$. \end{corollary} \subsection{Comments on the Proof Techniques Used} Straightforward and simple as the proof may seem, -the efforts we spent obtaining it were far from trivial.\\ +the efforts we spent obtaining it were far from trivial. We initially attempted to re-use the argument in \cref{flex_retrieve}. -The problem was that both functions $\inj$ and $\retrieve$ require +The problem is that both functions $\inj$ and $\retrieve$ require that the annotated regular expressions stay unsimplified, so that one can correctly compare $v_{i+1}$ and $r_i$ and $v_i$ @@ -1041,16 +1066,11 @@ such as $\SEQ r_1 \cdot (\SEQ r_1 \cdot r_3) \rightarrow \SEQs [r_1, r_2, r_3]$. -This does not fit with the proof technique +However this does not fit with the proof technique of our main theorem, but seem to not violate the POSIX -property.\\ -Having correctness property is good. -But we would also a guarantee that the lexer is not slow in -some sense, for exampe, not grinding to a halt regardless of the input. -As we have already seen, Sulzmann and Lu's simplification function -$\simpsulz$ cannot achieve this, because their claim that -the regular expression size does not grow arbitrary large -was not true. -In the next chapter we shall prove that with our $\simp$, -for a given $r$, the internal derivative size is always +property. + +Having established the correctness of our +$\blexersimp$, in the next chapter we shall prove that with our $\simp$ function, +for a given $r$, the derivative size is always finitely bounded by a constant. diff -r c0c1ebe09c7d -r 8ffa28fce271 ChengsongTanPhdThesis/Chapters/Finite.tex --- a/ChengsongTanPhdThesis/Chapters/Finite.tex Sat Nov 12 00:37:23 2022 +0000 +++ b/ChengsongTanPhdThesis/Chapters/Finite.tex Sat Nov 12 21:34:40 2022 +0000 @@ -8,10 +8,11 @@ %regex's derivatives. -In this chapter we give a guarantee in terms of size of derivatives: +In this chapter we give a bound in terms of size of +the calculated derivatives: given an annotated regular expression $a$, for any string $s$ -our algorithm $\blexersimp$'s internal annotated regular expression -size is finitely bounded +our algorithm $\blexersimp$'s derivatives +are finitely bounded by a constant that only depends on $a$. Formally we show that there exists an $N_a$ such that \begin{center} @@ -32,8 +33,8 @@ grows beyond a finite bound, then clearly the algorithm (which traverses these structures) will be slow. - The next step would be to refine the bound $N_a$ so that it - is polynomial on $\llbracket a\rrbracket$. + The next step is to refine the bound $N_a$ so that it + is not just finite but polynomial in $\llbracket a\rrbracket$. \item Having the finite bound formalised gives us a higher confidence that @@ -41,7 +42,8 @@ like $\simpsulz$ does. The bound is universal for a given regular expression, which is an advantage over work which - only gives empirical evidence on some test cases (see Verbatim++\cite{Verbatimpp}). + only gives empirical evidence on + some test cases (see Verbatim++ work\cite{Verbatimpp}). \end{itemize} In the next section we describe in more detail what the finite bound means in our algorithm diff -r c0c1ebe09c7d -r 8ffa28fce271 ChengsongTanPhdThesis/Chapters/RelatedWork.tex --- a/ChengsongTanPhdThesis/Chapters/RelatedWork.tex Sat Nov 12 00:37:23 2022 +0000 +++ b/ChengsongTanPhdThesis/Chapters/RelatedWork.tex Sat Nov 12 21:34:40 2022 +0000 @@ -15,13 +15,40 @@ with theoretical work on them being fairly recent. -Campaneu gave +Campaneu et al. studied regexes with back-references +in the context of formal languages theory in +their 2003 work \cite{campeanu2003formal}. +They devised a pumping lemma for Perl-like regexes, +proving that the langugages denoted by them +is properly contained in context-sensitive +languages. +More interesting questions such as +whether the language denoted by Perl-like regexes +can express palindromes ($\{w \mid w = w^R\}$) +are discussed in \cite{campeanu2009patterns} +and \cite{CAMPEANU2009Intersect}. +Freydenberger \cite{Frey2013} investigated the +expressive power of back-references. He showed several +undecidability and decriptional complexity results +of back-references, concluding that they add +great power to certain programming tasks but are difficult to harness. +An interesting question would then be +whether we can add restrictions to them, +so that they become expressive enough, but not too powerful. +Freydenberger and Schmid \cite{FREYDENBERGER20191} +introduced the notion of deterministic +regular expressions with back-references to achieve +a better balance between expressiveness and tractability. -See \cite{AHO1990255} for a survey +Fernau and Schmid \cite{FERNAU2015287} and Schmid \cite{Schmid2012} +investigated the time complexity of different variants +of back-references. + +See \cite{BERGLUND2022} for a survey of these works and comparison between different -flavours (e.g. whether references can be circular, -can labels be repeated etc.) of back-references syntax. +flavours of back-references syntax (e.g. whether references can be circular, +can labels be repeated etc.). \subsection{Matchers and Lexers with Mechanised Proofs} diff -r c0c1ebe09c7d -r 8ffa28fce271 ChengsongTanPhdThesis/example.bib --- a/ChengsongTanPhdThesis/example.bib Sat Nov 12 00:37:23 2022 +0000 +++ b/ChengsongTanPhdThesis/example.bib Sat Nov 12 21:34:40 2022 +0000 @@ -112,25 +112,129 @@ +%% ------------------------------------- %% back-references-------------------- -@incollection{AHO1990255, -title = {CHAPTER 5 - Algorithms for Finding Patterns in Strings}, -editor = {JAN {VAN LEEUWEN}}, -booktitle = {Algorithms and Complexity}, -publisher = {Elsevier}, -address = {Amsterdam}, -pages = {255-300}, -year = {1990}, -series = {Handbook of Theoretical Computer Science}, -isbn = {978-0-444-88071-0}, -doi = {https://doi.org/10.1016/B978-0-444-88071-0.50010-2}, -url = {https://www.sciencedirect.com/science/article/pii/B9780444880710500102}, -author = {Alfred V. AHO}, -abstract = {Publisher Summary -This chapter discusses the algorithms for solving string-matching problems that have proven useful for text-editing and text-processing applications. String pattern matching is an important problem that occurs in many areas of science and information processing. In computing, it occurs naturally as part of data processing, text editing, term rewriting, lexical analysis, and information retrieval. Many text editors and programming languages have facilities for matching strings. In biology, string-matching problems arise in the analysis of nucleic acids and protein sequences, and in the investigation of molecular phylogeny. String matching is also one of the central and most widely studied problems in theoretical computer science. The simplest form of the problem is to locate an occurrence of a keyword as a substring in a sequence of characters, which is called the input string. For example, the input string queueing contains the keyword ueuei as a substring. Even for this problem, several innovative, theoretically interesting algorithms have been devised that run significantly faster than the obvious brute-force method.} +@article{FERNAU2015287, +title = {Pattern matching with variables: A multivariate complexity analysis}, +journal = {Information and Computation}, +volume = {242}, +pages = {287-305}, +year = {2015}, +issn = {0890-5401}, +doi = {https://doi.org/10.1016/j.ic.2015.03.006}, +url = {https://www.sciencedirect.com/science/article/pii/S0890540115000218}, +author = {H.~Fernau and M.L.~Schmid}, +keywords = {Parameterised pattern matching, Function matching, NP-completeness, Membership problem for pattern languages, Morphisms}, +abstract = {A pattern α, i.e., a string that contains variables and terminals, matches a terminal word w if w can be obtained by uniformly substituting the variables of α by terminal words. Deciding whether a given terminal word matches a given pattern is NP-complete and this holds for several natural variants of the problem that result from whether or not variables can be erased, whether or not the patterns are required to be terminal-free or whether or not the mapping of variables to terminal words must be injective. We consider numerous parameters of this problem (i.e., number of variables, length of w, length of the words substituted for variables, number of occurrences per variable, cardinality of the terminal alphabet) and for all possible combinations of the parameters (and variants described above), we answer the question whether or not the problem is still NP-complete if these parameters are bounded by constants.} +} + +@inproceedings{Schmid2012, +author = {M.L.~Schmid}, +title = {Inside the Class of REGEX Languages}, +year = {2012}, +isbn = {9783642316524}, +publisher = {Springer-Verlag}, +address = {Berlin, Heidelberg}, +url = {https://doi.org/10.1007/978-3-642-31653-1_8}, +doi = {10.1007/978-3-642-31653-1_8}, +abstract = {We study different possibilities of combining the concept of homomorphic replacement with regular expressions in order to investigate the class of languages given by extended regular expressions with backreferences (REGEX). It is shown in which regard existing and natural ways to do this fail to reach the expressive power of REGEX. Furthermore, the complexity of the membership problem for REGEX with a bounded number of backreferences is considered.}, +booktitle = {Proceedings of the 16th International Conference on Developments in Language Theory}, +pages = {73–84}, +numpages = {12}, +keywords = {extended regular expressions, pattern languages, REGEX, pattern expressions, homomorphic replacement}, +location = {Taipei, Taiwan}, +series = {DLT'12} +} + + + +@article{BERGLUND2022, +title = {Re-examining regular expressions with backreferences}, +journal = {Theoretical Computer Science}, +year = {2022}, +issn = {0304-3975}, +doi = {https://doi.org/10.1016/j.tcs.2022.10.041}, +url = {https://www.sciencedirect.com/science/article/pii/S0304397522006570}, +author = {Martin Berglund and Brink {van der Merwe}}, +keywords = {Regular expressions, Backreferences}, +abstract = {Most modern regular expression matching libraries (one of the rare exceptions being Google's RE2) allow backreferences, operations which bind a substring to a variable, allowing it to be matched again verbatim. However, both real-world implementations and definitions in the literature use different syntactic restrictions and have differences in the semantics of the matching of backreferences. Our aim is to compare these various flavors by considering the classes of formal languages that each can describe, establishing, as a result, a hierarchy of language classes. Beyond the hierarchy itself, some complexity results are given, and as part of the effort on comparing language classes new pumping lemmas are established, old classes are extended to new ones, and several incidental results on the nature of these language classes are given.} } +@article{FREYDENBERGER20191, +title = {Deterministic regular expressions with back-references}, +journal = {Journal of Computer and System Sciences}, +volume = {105}, +pages = {1-39}, +year = {2019}, +issn = {0022-0000}, +doi = {https://doi.org/10.1016/j.jcss.2019.04.001}, +url = {https://www.sciencedirect.com/science/article/pii/S0022000018301818}, +author = {Dominik D. Freydenberger and Markus L. Schmid}, +keywords = {Deterministic regular expression, Regex, Glushkov automaton}, +abstract = {Most modern libraries for regular expression matching allow back-references (i.e., repetition operators) that substantially increase expressive power, but also lead to intractability. In order to find a better balance between expressiveness and tractability, we combine these with the notion of determinism for regular expressions used in XML DTDs and XML Schema. This includes the definition of a suitable automaton model, and a generalization of the Glushkov construction. We demonstrate that, compared to their non-deterministic superclass, these deterministic regular expressions with back-references have desirable algorithmic properties (i.e., efficiently solvable membership problem and some decidable problems in static analysis), while, at the same time, their expressive power exceeds that of deterministic regular expressions without back-references.} +} +@InProceedings{Frey2013, + author = {Dominik D. Freydenberger}, + title = {{Extended Regular Expressions: Succinctness and Decidability}}, + booktitle = {28th International Symposium on Theoretical Aspects of Computer Science (STACS 2011) }, + pages = {507--518}, + series = {Leibniz International Proceedings in Informatics (LIPIcs)}, + ISBN = {978-3-939897-25-5}, + ISSN = {1868-8969}, + year = {2011}, + volume = {9}, + editor = {Thomas Schwentick and Christoph D{\"u}rr}, + publisher = {Schloss Dagstuhl--Leibniz-Zentrum fuer Informatik}, + address = {Dagstuhl, Germany}, + URL = {http://drops.dagstuhl.de/opus/volltexte/2011/3039}, + URN = {urn:nbn:de:0030-drops-30396}, + doi = {10.4230/LIPIcs.STACS.2011.507}, + annote = {Keywords: extended regular expressions, regex, decidability, non-recursive tradeoffs} +} + + + +%% -------------------------- campeanu related +@article{campeanu2003formal, + author = {C.~C{\^a}mpeanu and K.~Salomaa and S.~Yu}, + journal = {International Journal of Foundations of Computer Science}, + number = {06}, + pages = {1007--1018}, + publisher = {World Scientific}, + title = {A formal study of practical regular expressions}, + volume = {14}, + year = {2003}} + +@article{campeanu2009patterns, +author = {C.~C{\^a}mpeanu and N.~Santean}, +year = {2009}, +month = {05}, +pages = {193-207}, +title = {On the closure of pattern expressions languages under intersection with regular languages}, +volume = {46}, +journal = {Acta Inf.}, +doi = {10.1007/s00236-009-0090-y} +} + +@article{CAMPEANU2009Intersect, + abstract = {In this paper we revisit the semantics of extended regular expressions (regex), defined succinctly in the 90s [A.V. Aho, Algorithms for finding patterns in strings, in: Jan van Leeuwen (Ed.), Handbook of Theoretical Computer Science, in: Algorithms and Complexity, vol. A, Elsevier and MIT Press, 1990, pp. 255--300] and rigorously in 2003 by C{\^a}mpeanu, Salomaa and Yu [C. C{\^a}mpeanu, K. Salomaa, S. Yu, A formal study of practical regular expressions, IJFCS 14 (6) (2003) 1007--1018], when the authors reported an open problem, namely whether regex languages are closed under the intersection with regular languages. We give a positive answer; and for doing so, we propose a new class of machines --- regex automata systems (RAS) --- which are equivalent to regex. Among others, these machines provide a consistent and convenient method of implementing regex in practice. We also prove, as a consequence of this closure property, that several languages, such as the mirror language, the language of palindromes, and the language of balanced words are not regex languages.}, + author = {Cezar C{\^a}mpeanu and Nicolae Santean}, + doi = {https://doi.org/10.1016/j.tcs.2009.02.022}, + issn = {0304-3975}, + journal = {Theoretical Computer Science}, + keywords = {Extended regular expression, Regex automata system, Regex}, + note = {Formal Languages and Applications: A Collection of Papers in Honor of Sheng Yu}, + number = {24}, + pages = {2336-2344}, + title = {On the intersection of regex languages with regular languages}, + url = {https://www.sciencedirect.com/science/article/pii/S0304397509001789}, + volume = {410}, + year = {2009}, + bdsk-url-1 = {https://www.sciencedirect.com/science/article/pii/S0304397509001789}, + bdsk-url-2 = {https://doi.org/10.1016/j.tcs.2009.02.022}} + %% back-references-------------------- +%% ------------------------------------- @article{fowler2003, title={An interpretation of the POSIX regex standard}, @@ -276,15 +380,6 @@ title = {Static Analysis for Regular Expression Exponential Runtime via Substructural Logics}, year = {2017}} -@article{campeanu2003formal, - author = {C{\^a}mpeanu, Cezar and Salomaa, Kai and Yu, Sheng}, - journal = {International Journal of Foundations of Computer Science}, - number = {06}, - pages = {1007--1018}, - publisher = {World Scientific}, - title = {A formal study of practical regular expressions}, - volume = {14}, - year = {2003}} @article{alfred2014algorithms, author = {Alfred, V}, @@ -295,23 +390,7 @@ volume = {1}, year = {2014}} -@article{CAMPEANU2009Intersect, - abstract = {In this paper we revisit the semantics of extended regular expressions (regex), defined succinctly in the 90s [A.V. Aho, Algorithms for finding patterns in strings, in: Jan van Leeuwen (Ed.), Handbook of Theoretical Computer Science, in: Algorithms and Complexity, vol. A, Elsevier and MIT Press, 1990, pp. 255--300] and rigorously in 2003 by C{\^a}mpeanu, Salomaa and Yu [C. C{\^a}mpeanu, K. Salomaa, S. Yu, A formal study of practical regular expressions, IJFCS 14 (6) (2003) 1007--1018], when the authors reported an open problem, namely whether regex languages are closed under the intersection with regular languages. We give a positive answer; and for doing so, we propose a new class of machines --- regex automata systems (RAS) --- which are equivalent to regex. Among others, these machines provide a consistent and convenient method of implementing regex in practice. We also prove, as a consequence of this closure property, that several languages, such as the mirror language, the language of palindromes, and the language of balanced words are not regex languages.}, - author = {Cezar C{\^a}mpeanu and Nicolae Santean}, - doi = {https://doi.org/10.1016/j.tcs.2009.02.022}, - issn = {0304-3975}, - journal = {Theoretical Computer Science}, - keywords = {Extended regular expression, Regex automata system, Regex}, - note = {Formal Languages and Applications: A Collection of Papers in Honor of Sheng Yu}, - number = {24}, - pages = {2336-2344}, - title = {On the intersection of regex languages with regular languages}, - url = {https://www.sciencedirect.com/science/article/pii/S0304397509001789}, - volume = {410}, - year = {2009}, - bdsk-url-1 = {https://www.sciencedirect.com/science/article/pii/S0304397509001789}, - bdsk-url-2 = {https://doi.org/10.1016/j.tcs.2009.02.022}} - + @article{nielson11bcre, author = {Lasse Nielsen, Fritz Henglein}, date-added = {2019-07-03 21:09:39 +0000},