# HG changeset patch # User Chengsong # Date 1664759329 -3600 # Node ID 61139fdddae0d0b957a7b76aefde4362c3a3f504 # Parent 37b6fd310a168af27e78dbe45b321123f984cd7e chap1 totally done diff -r 37b6fd310a16 -r 61139fdddae0 ChengsongTanPhdThesis/Chapters/Finite.tex --- a/ChengsongTanPhdThesis/Chapters/Finite.tex Sat Oct 01 12:06:46 2022 +0100 +++ b/ChengsongTanPhdThesis/Chapters/Finite.tex Mon Oct 03 02:08:49 2022 +0100 @@ -402,7 +402,8 @@ \end{center} \noindent We do not define an r-regular expression version of $\blexersimp$, -as our proof does not involve its use. +as our proof does not involve its use +(and there is no bitcode to decode into a lexical value). Everything about the size of annotated regular expressions can be calculated via the size of r-regular expressions: \begin{lemma}\label{sizeRelations} @@ -433,7 +434,7 @@ \end{center} Unless stated otherwise in the rest of this chapter all regular expressions without -bitcodes are seen as $\rrexp$s. +bitcodes are seen as r-regular expressions ($\rrexp$s). For the binary alternative r-regular expression $\RALTS{[r_1, r_2]}$, we use the notation $r_1 + r_2$ for brevity. @@ -476,10 +477,10 @@ %\end{quote} %\noindent %We explain in detail how we reached those claims. -\subsection{Basic Properties needed for Closed Forms} +\subsection{The Idea Behind Closed Forms} If we attempt to prove \begin{center} - $\forall r. \; \exists N_r.\;\; s.t. \llbracket r\backslash_{bsimps} s \rrbracket_r \leq N_r$ + $\forall r. \; \exists N_r.\;\; s.t. \llbracket r\backslash_{rsimps} s \rrbracket_r \leq N_r$ \end{center} using a naive induction on the structure of $r$, then we are stuck at the inductive cases such as @@ -487,51 +488,89 @@ The inductive hypotheses are: \begin{center} 1: $\text{for } r_1, \text{there exists } N_{r_1}.\;\; s.t. - \;\;\forall s. \llbracket r_1 \backslash_{bsimps} s \rrbracket_r \leq N_{r_1}. $\\ + \;\;\forall s. \llbracket r_1 \backslash_{rsimps} s \rrbracket_r \leq N_{r_1}. $\\ 2: $\text{for } r_2, \text{there exists } N_{r_2}.\;\; s.t. - \;\; \forall s. \llbracket r_2 \backslash_{bsimps} s \rrbracket_r \leq N_{r_2}. $ + \;\; \forall s. \llbracket r_2 \backslash_{rsimps} s \rrbracket_r \leq N_{r_2}. $ \end{center} The inductive step to prove would be \begin{center} $\text{there exists } N_{r_1\cdot r_2}. \;\; s.t. \forall s. - \llbracket (r_1 \cdot r_2) \backslash_{bsimps} s \rrbracket_r \leq N_{r_1\cdot r_2}.$ + \llbracket (r_1 \cdot r_2) \backslash_{rsimps} s \rrbracket_r \leq N_{r_1\cdot r_2}.$ \end{center} The problem is that it is not clear what -$(r_1\cdot r_2) \backslash_{bsimps} s$ looks like, +$(r_1\cdot r_2) \backslash_{rsimps} s$ looks like, and therefore $N_{r_1}$ and $N_{r_2}$ in the inductive hypotheses cannot be directly used. +We have already seen that $(r_1 \cdot r_2)\backslash s$ +and $(r^*)\backslash s$ can grow in a wild way. +The point is that they will be equivalent to a list of +terms $\sum rs$, where each term in $rs$ will +be made of $r_1 \backslash s' $, $r_2\backslash s'$, +and $r \backslash s'$ with $s' \in \textit{SubString} \; s$. +The list $\sum rs$ will then be de-duplicated by $\textit{rdistinct}$ +in the simplification which saves $rs$ from growing indefinitely. + +Based on this idea, we sketch a proof by first showing the equality (where +$f$ and $g$ are functions that do not increase the size of the input) +\begin{center} +$(r_1 \cdot r_2)\backslash_{rsimps} s = f\; (\textit{rdistinct} \; (g\; \sum rs))$, +\end{center} +and then show the right-hand-side can be finitely bounded. +We call the right-hand-side the +\emph{Closed Form} of $(r_1 \cdot r_2)\backslash_{rsimps} s$. +We will flesh out the proof sketch in the next section. + +\section{Details of Closed Forms and Bounds} +In this section we introduce in detail +how the closed forms are obtained for regular expressions' +derivatives and how they are bounded. +We start by proving some basic identities +involving the simplification functions for r-regular expressions. +After that we use these identities to establish the +closed forms we need. +Finally, we prove the functions such as $\flts$ +will keep the size non-increasing. +Putting this together with a general bound +on the finiteness of distinct regular expressions +smaller than a certain size, we obtain a bound on +the closed forms. +%$r_1\cdot r_2$, $r^*$ and $\sum rs$. -The next steps involves getting a closed form for -$\rderssimp{r}{s}$ and then obtaining -an estimate for the closed form. + +\subsection{Some Basic Identities} + + \subsubsection{$\textit{rdistinct}$'s Deduplicates Successfully} The $\textit{rdistinct}$ function, as its name suggests, will -remove duplicates in an \emph{r}$\textit{rexp}$ list, -according to the accumulator -and leave only one of each different element in a list: +remove duplicates in an r-regular expression list. +It will also correctly exclude any elements that +is intially in the accumulator set. \begin{lemma}\label{rdistinctDoesTheJob} - The function $\textit{rdistinct}$ satisfies the following - properties: + %The function $\textit{rdistinct}$ satisfies the following + %properties: + Assume we have the predicate $\textit{isDistinct}$\footnote{We omit its + recursive definition here, its Isabelle counterpart would be $\textit{distinct}$.} + readily defined + for testing + whether a list's elements are all unique. Then the following + properties about $\textit{rdistinct}$ hold: \begin{itemize} \item If $a \in acc$ then $a \notin (\rdistinct{rs}{acc})$. \item - If list $rs'$ is the result of $\rdistinct{rs}{acc}$, - then $\textit{isDistinct} \; rs'$. + %If list $rs'$ is the result of $\rdistinct{rs}{acc}$, + $\textit{isDistinct} \;\;\; (\rdistinct{rs}{acc})$. \item - $\rdistinct{rs}{acc} = rs - acc$ + $\textit{set} \; (\rdistinct{rs}{acc}) + = (textit{set} \; rs) - acc$ \end{itemize} \end{lemma} \noindent -The predicate $\textit{isDistinct}$ is for testing -whether a list's elements are all unique. It is defined -recursively on the structure of a regular expression, -and we omit the precise definition here. \begin{proof} The first part is by an induction on $rs$. The second and third part can be proven by using the - induction rules of $\rdistinct{\_}{\_}$. + inductive cases of $\textit{rdistinct}$. \end{proof} @@ -541,7 +580,7 @@ list $rs$ whose elements are all from the accumulator, and then call $\rdistinct{\_}{\_}$ on the resulting list, the output will be as if we had called $\rdistinct{\_}{\_}$ without the prepending of $rs$: -\begin{lemma} +\begin{lemma}\label{rdistinctConcat} The elements appearing in the accumulator will always be removed. More precisely, \begin{itemize} @@ -549,13 +588,13 @@ If $rs \subseteq rset$, then $\rdistinct{rs@rsa }{acc} = \rdistinct{rsa }{acc}$. \item - Furthermore, if $a \in rset$ and $\rdistinct{rs}{\{a\}} = []$, + More generally, if $a \in rset$ and $\rdistinct{rs}{\{a\}} = []$, then $\rdistinct{(rs @ rs')}{rset} = \rdistinct{rs'}{rset}$ \end{itemize} \end{lemma} \begin{proof} - By induction on $rs$. + By induction on $rs$ and using \ref{rdistinctDoesTheJob}. \end{proof} \noindent On the other hand, if an element $r$ does not appear in the input list waiting to be deduplicated, @@ -964,9 +1003,10 @@ By induction on the lists involved. \end{proof} \noindent -Similarly, -we introduce the equality for $\sum$ when certain child regular expressions -are $\sum$ themselves: +The above allows us to prove +two similar equalities (which are a bit more involved). +It says that we could flatten out the elements +before simplification and still get the same result. \begin{lemma}\label{simpFlatten3} One can flatten the inside $\sum$ of a $\sum$ if it is being simplified. Concretely, @@ -1279,7 +1319,11 @@ Before we go on to obtain them, some preliminary definitions are needed to make proof statements concise. -\section{"Closed Forms" of Sequence Regular Expressions} + + + +\subsection{Closed Forms} +\subsubsection{Closed Form for Sequence Regular Expressions} The problem of obataining a closed-form for sequence regular expression is constructing $(r_1 \cdot r_2) \backslash_r s$ if we are only allowed to use a combination of $r_1 \backslash s''$ @@ -1466,7 +1510,7 @@ \end{center} \end{corollary} \noindent -\subsection{Closed Forms for Star Regular Expressions} +\subsubsection{Closed Forms for Star Regular Expressions} We have shown how to control the size of the sequence regular expression $r_1\cdot r_2$ using the "closed form" of $(r_1 \cdot r_2) \backslash s$ and then the property of the $\distinct$ function. @@ -1683,7 +1727,13 @@ The lemmas \ref{rsimpIdem}, \ref{starHfauInduct}, and \ref{hfauRsimpeq2} are used. \end{proof} -\section{Estimating the Closed Forms' sizes} + + + + + + +\subsection{Estimating the Closed Forms' sizes} We now summarize the closed forms below: \begin{itemize} \item @@ -1758,15 +1808,94 @@ is bounded by $N$. We want to apply it to our setting $\rsize{\rsimp{\sum rs}}$. -We show how $\rdistinct$ and $\rflts$ -in the simplification function together is at least as -good as $\rdistinct{}{}$ alone. -\begin{lemma}\label{interactionFltsDB} +We show that $\rdistinct$ and $\rflts$ +working together is at least as +good as $\rdistinct{}{}$ alone, which can be written as +\begin{center} + $\llbracket \rdistinct{(\rflts \; \textit{rs})}{\varnothing} \rrbracket_r + \leq + \llbracket \rdistinct{rs}{\varnothing} \rrbracket_r $. +\end{center} +We need this so that we know the outcome of our real +simplification is better than or equal to a rough estimate, +and therefore can be bounded by that estimate. +This is a bit harder to establish compared with proving +$\textit{flts}$ does not make a list larger (which can +be proven using routine induction): +\begin{center} + $\llbracket \textit{rflts}\; rs \rrbracket_r \leq + \llbracket \textit{rs} \rrbracket_r$ +\end{center} +We cannot simply prove how each helper function +reduces the size and then put them together: +From +\begin{center} +$\llbracket \textit{rflts}\; rs \rrbracket_r \leq + \llbracket \; \textit{rs} \rrbracket_r$ +\end{center} +and +\begin{center} + $\llbracket \textit{rdistinct} \; rs \; \varnothing \leq + \llbracket rs \rrbracket_r$ +\end{center} +one cannot imply +\begin{center} $\llbracket \rdistinct{(\rflts \; \textit{rs})}{\varnothing} \rrbracket_r \leq \llbracket \rdistinct{rs}{\varnothing} \rrbracket_r $. +\end{center} +What we can imply is that +\begin{center} + $\llbracket \rdistinct{(\rflts \; \textit{rs})}{\varnothing} \rrbracket_r + \leq + \llbracket rs \rrbracket_r$ +\end{center} +but this estimate is too rough and $\llbracket rs \rrbracket_r$ is unbounded. +The way we +get through this is by first proving a more general lemma +(so that the inductive case goes through): +\begin{lemma}\label{fltsSizeReductionAlts} + If we have three accumulator sets: + $noalts\_set$, $alts\_set$ and $corr\_set$, + satisfying: + \begin{itemize} + \item + $\forall r \in noalts\_set. \; \nexists xs.\; r = \sum xs$ + \item + $\forall r \in alts\_set. \; \exists xs. \; r = \sum xs + \; \textit{and} \; set \; xs \subseteq corr\_set$ + \end{itemize} + then we have that + \begin{center} + \begin{tabular}{lcl} + $\llbracket (\textit{rdistinct} \; (\textit{rflts} \; as) \; + (noalts\_set \cup corr\_set)) \rrbracket_r$ & $\leq$ &\\ + $\llbracket (\textit{rdistinct} \; as \; (noalts\_set \cup alts\_set \cup + \{ \ZERO \} )) \rrbracket_r$ & & \\ + \end{tabular} + \end{center} + holds. \end{lemma} \noindent +We need to split the accumulator into two parts: the part +which contains alternative regular expressions ($alts\_set$), and +the part without any of them($noalts\_set$). +The set $corr\_set$ is the corresponding set +of $alts\_set$ with all elements under the $\sum$ constructor +spilled out. +\begin{proof} + By induction on the list $as$. We make use of lemma \ref{rdistinctConcat}. +\end{proof} +By setting all three sets to the empty set, one gets the desired size estimate: +\begin{corollary}\label{interactionFltsDB} + $\llbracket \rdistinct{(\rflts \; \textit{rs})}{\varnothing} \rrbracket_r + \leq + \llbracket \rdistinct{rs}{\varnothing} \rrbracket_r $. +\end{corollary} +\begin{proof} + By using the lemma \ref{fltsSizeReductionAlts}. +\end{proof} +\noindent The intuition is that if we remove duplicates from the $\textit{LHS}$, at least the same amount of duplicates will be removed from the list $\textit{rs}$ in the $\textit{RHS}$. @@ -1778,13 +1907,10 @@ By using \ref{interactionFltsDB}. \end{proof} \noindent -which says that the size of regular expression -is always smaller if we apply the full simplification -rather than just one component ($\rdistinct{}{}$). - - -Now we are ready to control the sizes of -$r_1 \cdot r_2 \backslash s$, $r^* \backslash s$. +This is a key lemma in establishing the bounds on all the +closed forms. +With this we are now ready to control the sizes of +$(r_1 \cdot r_2 )\backslash s$, $r^* \backslash s$. \begin{theorem} For any regex $r$, $\exists N_r. \forall s. \; \rsize{\rderssimp{r}{s}} \leq N_r$ \end{theorem} @@ -1893,6 +2019,10 @@ \end{proof} \noindent + + + + %----------------------------------- % SECTION 2 %----------------------------------- @@ -2062,53 +2192,54 @@ %---------------------------------------------------------------------------------------- % SECTION ALTS CLOSED FORM %---------------------------------------------------------------------------------------- -\section{A Closed Form for \textit{ALTS}} -Now we prove that $rsimp (rders\_simp (RALTS rs) s) = rsimp (RALTS (map (\lambda r. rders\_simp r s) rs))$. - - -There are a few key steps, one of these steps is - - - -One might want to prove this by something a simple statement like: - -For this to hold we want the $\textit{distinct}$ function to pick up -the elements before and after derivatives correctly: -$r \in rset \equiv (rder x r) \in (rder x rset)$. -which essentially requires that the function $\backslash$ is an injective mapping. - -Unfortunately the function $\backslash c$ is not an injective mapping. +%\section{A Closed Form for \textit{ALTS}} +%Now we prove that $rsimp (rders\_simp (RALTS rs) s) = rsimp (RALTS (map (\lambda r. rders\_simp r s) rs))$. +% +% +%There are a few key steps, one of these steps is +% +% +% +%One might want to prove this by something a simple statement like: +% +%For this to hold we want the $\textit{distinct}$ function to pick up +%the elements before and after derivatives correctly: +%$r \in rset \equiv (rder x r) \in (rder x rset)$. +%which essentially requires that the function $\backslash$ is an injective mapping. +% +%Unfortunately the function $\backslash c$ is not an injective mapping. +% +%\subsection{function $\backslash c$ is not injective (1-to-1)} +%\begin{center} +% The derivative $w.r.t$ character $c$ is not one-to-one. +% Formally, +% $\exists r_1 \;r_2. r_1 \neq r_2 \mathit{and} r_1 \backslash c = r_2 \backslash c$ +%\end{center} +%This property is trivially true for the +%character regex example: +%\begin{center} +% $r_1 = e; \; r_2 = d;\; r_1 \backslash c = \ZERO = r_2 \backslash c$ +%\end{center} +%But apart from the cases where the derivative +%output is $\ZERO$, are there non-trivial results +%of derivatives which contain strings? +%The answer is yes. +%For example, +%\begin{center} +% Let $r_1 = a^*b\;\quad r_2 = (a\cdot a^*)\cdot b + b$.\\ +% where $a$ is not nullable.\\ +% $r_1 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$\\ +% $r_2 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$ +%\end{center} +%We start with two syntactically different regular expressions, +%and end up with the same derivative result. +%This is not surprising as we have such +%equality as below in the style of Arden's lemma:\\ +%\begin{center} +% $L(A^*B) = L(A\cdot A^* \cdot B + B)$ +%\end{center} -\subsection{function $\backslash c$ is not injective (1-to-1)} -\begin{center} - The derivative $w.r.t$ character $c$ is not one-to-one. - Formally, - $\exists r_1 \;r_2. r_1 \neq r_2 \mathit{and} r_1 \backslash c = r_2 \backslash c$ -\end{center} -This property is trivially true for the -character regex example: -\begin{center} - $r_1 = e; \; r_2 = d;\; r_1 \backslash c = \ZERO = r_2 \backslash c$ -\end{center} -But apart from the cases where the derivative -output is $\ZERO$, are there non-trivial results -of derivatives which contain strings? -The answer is yes. -For example, -\begin{center} - Let $r_1 = a^*b\;\quad r_2 = (a\cdot a^*)\cdot b + b$.\\ - where $a$ is not nullable.\\ - $r_1 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$\\ - $r_2 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$ -\end{center} -We start with two syntactically different regular expressions, -and end up with the same derivative result. -This is not surprising as we have such -equality as below in the style of Arden's lemma:\\ -\begin{center} - $L(A^*B) = L(A\cdot A^* \cdot B + B)$ -\end{center} - +\section{Further Improvements to the Bound} There are two problems with this finiteness result, though. \begin{itemize} \item diff -r 37b6fd310a16 -r 61139fdddae0 ChengsongTanPhdThesis/Chapters/Introduction.tex --- a/ChengsongTanPhdThesis/Chapters/Introduction.tex Sat Oct 01 12:06:46 2022 +0100 +++ b/ChengsongTanPhdThesis/Chapters/Introduction.tex Mon Oct 03 02:08:49 2022 +0100 @@ -819,15 +819,16 @@ Such a language is contained in the context-sensitive hierarchy of formal languages. Solving the back-reference expressions matching problem -is NP-complete\parencite{alfred2014algorithms}. +is known to be NP-complete \parencite{alfred2014algorithms}. A non-bactracking, efficient solution is not known to exist. Regex libraries supporting back-references such as PCRE \cite{pcre} therefore have to -revert to a depth-first search algorithm that backtracks. -The unreasonable part with them, is that even in the case of -regexes not involving back-references, there is still -a (non-negligible) chance they might backtrack super-linearly. +revert to a depth-first search algorithm which backtracks. +What is unexpected is that even in the cases +not involving back-references, there is still +a (non-negligible) chance they might backtrack super-linearly, +as shown in the graphs in \ref{fig:aStarStarb}. \subsection{Summary of the Catastrophic Backtracking Problem} Summing these up, we can categorise existing @@ -944,84 +945,172 @@ is trying to capture something very precise, and is crying out for formalising. Ausaf et al. \cite{AusafDyckhoffUrban2016} -are the first to describe such a formalised POSIX -specification in Isabelle/HOL. +are the first to fill the gap +by not just describing such a formalised POSIX +specification in Isabelle/HOL, but also proving +that their specification coincides with the +POSIX specification given by Okui and Suzuki \cite{Okui10} +which is a completely +different characterisation. They then formally proved the correctness of a lexing algorithm by Sulzmann and Lu \cite{Sulzmann2014} based on that specification. -In this thesis, -we propose a solution to catastrophic -backtracking and error-prone matchers--a formally verified -regular expression lexing algorithm -that is both fast -and correct by extending Ausaf et al.'s work. -The end result is a regular expression lexing algorithm that comes with -\begin{itemize} -\item -a proven correctness theorem according to POSIX specification -given by Ausaf et al. \cite{AusafDyckhoffUrban2016}, -\item -a proven complexity-related property saying that the -algorithm's internal data structure will -remain finite, -\item -and extension to -the bounded repetitions construct with the correctness and finiteness property -maintained. - \end{itemize} In the next section we will very briefly -introduce Brzozowski derivatives. -We will give a taste to what they +introduce Brzozowski derivatives and Sulzmann +and Lu's algorithm, which this thesis builds on. +We give a taste of what they are like and why they are suitable for regular expression matching and lexing. -\section{Our Solution--Formal Specification of POSIX and Brzozowski Derivatives} -We propose Brzozowski derivatives on regular expressions as - a solution to this. -In the last fifteen or so years, Brzozowski's derivatives of regular -expressions have sparked quite a bit of interest in the functional -programming and theorem prover communities. +\section{Our Solution--Formal Specification of POSIX Matching +and Brzozowski Derivatives} +Now we start with the central topic of the thesis: Brzozowski derivatives. +Brzozowski \cite{Brzozowski1964} first introduced the +concept of the \emph{derivative} in the 1960s. +The derivative of a regular expression $r$ +with respect to a character $c$, is written as $r \backslash c$.\footnote{ + Despite having the same name, regular expression + derivatives bear little similarity with the mathematical definition + of derivatives on functions. +} +It tells us what $r$ would transform into +if we chop off the first character $c$ +from all strings in the language of $r$ ($L \; r$). +To give a flavour of Brzozowski derivatives, we present +two straightforward clauses from it: +\begin{center} + \begin{tabular}{lcl} + $d \backslash c$ & $\dn$ & + $\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\ +$(r_1 + r_2)\backslash c$ & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\ + \end{tabular} +\end{center} +\noindent +The first clause says that for the regular expression +denoting a singleton set consisting of a sinlge-character string $\{ d \}$, +we check the derivative character $c$ against $d$, +returning a set containing only the empty string $\{ [] \}$ +if $c$ and $d$ are equal, and the empty set $\varnothing$ otherwise. +The second clause states that to obtain the regular expression +representing all strings' head character $c$ being chopped off +from $r_1 + r_2$, one simply needs to recursively take derivative +of $r_1$ and $r_2$ and then put them together. -\subsection{Motivation} - +Thanks to the definition, derivatives have the nice property +that $s \in L \; (r\backslash c)$ if and only if +$c::s \in L \; r$. +%This property can be used on regular expressions +%matching and lexing--to test whether a string $s$ is in $L \; r$, +%one simply takes derivatives of $r$ successively with +%respect to the characters (in the correct order) in $s$, +%and then test whether the empty string is in the last regular expression. Derivatives give a simple solution -to the problem of matching a string $s$ with a regular +to the problem of matching and lexing a string $s$ with a regular expression $r$: if the derivative of $r$ w.r.t.\ (in succession) all the characters of the string matches the empty string, then $r$ matches $s$ (and {\em vice versa}). -The beauty of -Brzozowski's derivatives \parencite{Brzozowski1964} is that they are neatly -expressible in any functional language, and easily definable and -reasoned about in theorem provers---the definitions just consist of -inductive datatypes and simple recursive functions. -And an algorithms based on it by -Suzmann and Lu \parencite{Sulzmann2014} allows easy extension -to include extended regular expressions and - simplification of internal data structures - eliminating the exponential behaviours. +This makes formally reasoning about these properties such +as correctness and complexity smooth and intuitive. +In fact, there has already been several mechanised proofs about them, +for example the one by Owens and Slind \cite{Owens2008} in HOL4, +another one by Krauss and Nipkow \cite{Nipkow98} in Isabelle/HOL, and +yet another in Coq by Coquand and Siles \cite{Coquand2012}. + +In addition, one can extend the clauses to bounded repetitions +``for free'': +\begin{center} + \begin{tabular}{lcl} + $r^{\{n\}} \backslash c$ & $\dn$ & $r \backslash c \cdot + r^{\{n-1\}}$\\ + \end{tabular} +\end{center} +\noindent +And experimental results suggest that unlike DFA-based solutions, +this derivatives can support +bounded regular expressions with large counters +quite well. -However, two difficulties with derivative-based matchers exist: -\subsubsection{Problems with Current Brzozowski Matchers} +There has also been +extensions to other constructs. +For example, Owens et al include the derivatives +for \emph{NOT} regular expressions, which is +able to concisely express C-style comments of the form +$/* \ldots */$. +Another extension for derivatives would be +regular expressions with look-aheads, done by +by Miyazaki and Minamide +\cite{Takayuki2019}. +%We therefore use Brzozowski derivatives on regular expressions +%lexing + + + +Given the above definitions and properties of +Brzozowski derivatives, one quickly realises their potential +in generating a formally verified algorithm for lexing--the clauses and property +can be easily expressed in a functional programming language +or converted to theorem prover +code, with great extensibility. +Perhaps this is the reason why it has sparked quite a bit of interest +in the functional programming and theorem prover communities in the last +fifteen or so years ( +\cite{Almeidaetal10}, \cite{Berglund14}, \cite{Berglund18}, +\cite{Chen12} and \cite{Coquand2012} +to name a few), despite being buried in the ``sands of time'' \cite{Owens2008} +after they were first published. + + +However, there are two difficulties with derivative-based matchers: First, Brzozowski's original matcher only generates a yes/no answer for whether a regular expression matches a string or not. This is too little information in the context of lexing where separate tokens must be identified and also classified (for example as keywords -or identifiers). Sulzmann and Lu~\cite{Sulzmann2014} overcome this +or identifiers). +Second, derivative-based matchers need to be more efficient. +Elegant and beautiful +as many implementations are, +they can be excruciatingly slow. +For example, Sulzmann and Lu +claim a linear running time of their proposed algorithm, +but that was falsified by our experiments. The running time +is actually $\Omega(2^n)$ in the worst case. +A similar claim about a theoretical runtime of $O(n^2)$ +is made for the Verbatim \cite{Verbatim} +%TODO: give references +lexer, which calculates POSIX matches and is based on derivatives. +They formalized the correctness of the lexer, but not the complexity. +In the performance evaluation section, they simply analyzed the run time +of matching $a$ with the string +\begin{center} + $\underbrace{a \ldots a}_{\text{n a's}}$ +\end{center} +and concluded that the algorithm is quadratic in terms of input length. +When we tried out their extracted OCaml code with our example $(a+aa)^*$, +the time it took to lex only 40 $a$'s was 5 minutes. + + +\subsection{Sulzmann and Lu's Algorithm} +Sulzmann and Lu~\cite{Sulzmann2014} overcame the first difficulty by cleverly extending Brzozowski's matching algorithm. Their extended version generates additional information on \emph{how} a regular expression matches a string following the POSIX rules for regular expression matching. They achieve this by adding a second ``phase'' to Brzozowski's algorithm involving an injection -function. In our own earlier work, we provided the formal +function simplification of internal data structures +eliminating the exponential behaviours. +In an earlier work, Ausaf et al provided the formal specification of what POSIX matching means and proved in Isabelle/HOL the correctness of Sulzmann and Lu's extended algorithm accordingly \cite{AusafDyckhoffUrban2016}. -The second difficulty is that Brzozowski's derivatives can -grow to arbitrarily big sizes. For example if we start with the +The version of the algorithm proven correct +suffers from the +second difficulty though, where the internal derivatives can +grow to arbitrarily big sizes. +For example if we start with the regular expression $(a+aa)^*$ and take successive derivatives according to the character $a$, we end up with a sequence of ever-growing derivatives like @@ -1038,7 +1127,8 @@ \end{center} \noindent where after around 35 steps we run out of memory on a -typical computer (we shall define shortly the precise details of our +typical computer (we shall define in the next chapter +the precise details of our regular expressions and the derivative operation). Clearly, the notation involving $\ZERO$s and $\ONE$s already suggests simplification rules that can be applied to regular regular @@ -1051,7 +1141,6 @@ above: the growth is slowed, but the derivatives can still grow rather quickly beyond any finite bound. - Sulzmann and Lu overcome this ``growth problem'' in a second algorithm \cite{Sulzmann2014} where they introduce bit-coded regular expressions. In this version, POSIX values are @@ -1075,94 +1164,76 @@ extensively [..] but yet have to work out all proof details.'' \cite[Page 14]{Sulzmann2014} \end{quote} - Ausaf and Urban were able to back this correctness claim with a formal proof. -But as they stated, +However a faster formally verified +lexing program with the optimisations +mentioned by Sulzmann and Lu's second algorithm +is still missing. +As they stated, \begin{quote}\it -The next step would be to implement a more aggressive simplification procedure on annotated regular expressions and then prove the corresponding algorithm generates the same values as blexer. Alas due to time constraints we are unable to do so here. +``The next step would be to implement a more aggressive simplification procedure on annotated regular expressions and then prove the corresponding algorithm generates the same values as blexer. Alas due to time constraints we are unable to do so here.'' \end{quote} - This thesis implements the aggressive simplifications envisioned by Ausaf and Urban, -and gives a formal proof of the correctness with those simplifications. +together with a formal proof of the correctness with those simplifications. %---------------------------------------------------------------------------------------- \section{Contribution} - - -This work addresses the vulnerability of super-linear and -buggy regex implementations by the combination -of Brzozowski's derivatives and interactive theorem proving. -We give an -improved version of Sulzmann and Lu's bit-coded algorithm using -derivatives, which come with a formal guarantee in terms of correctness and -running time as an Isabelle/HOL proof. -Further improvements to the algorithm with an even stronger version of -simplification is made. -We have not yet come up with one, but believe that it leads to a -formalised proof with a time bound linear to input and -cubic to regular expression size using a technique by -Antimirov\cite{Antimirov}. - - -The main contribution of this thesis is +In this thesis, +we propose a solution to catastrophic +backtracking and error-prone matchers: a formally verified +regular expression lexing algorithm +that is both fast +and correct by extending Ausaf et al.'s work. +The end result is %a regular expression lexing algorithm that comes with \begin{itemize} \item -a proven correct lexing algorithm +an improved version of Sulzmann and Lu's bit-coded algorithm using +derivatives with simplifications, +accompanied by +a proven correctness theorem according to POSIX specification +given by Ausaf et al. \cite{AusafDyckhoffUrban2016}, +\item +a complexity-related property for that algorithm saying that the +internal data structure will +remain finite, \item -with formalized finite bounds on internal data structures' sizes. -\end{itemize} - -To our best knowledge, no lexing libraries using Brzozowski derivatives -have a provable time guarantee, -and claims about running time are usually speculative and backed by thin empirical -evidence. -%TODO: give references -For example, Sulzmann and Lu had proposed an algorithm in which they -claim a linear running time. -But that was falsified by our experiments and the running time -is actually $\Omega(2^n)$ in the worst case. -A similar claim about a theoretical runtime of $O(n^2)$ is made for the Verbatim -%TODO: give references -lexer, which calculates POSIX matches and is based on derivatives. -They formalized the correctness of the lexer, but not the complexity. -In the performance evaluation section, they simply analyzed the run time -of matching $a$ with the string $\underbrace{a \ldots a}_{\text{n a's}}$ -and concluded that the algorithm is quadratic in terms of input length. -When we tried out their extracted OCaml code with our example $(a+aa)^*$, -the time it took to lex only 40 $a$'s was 5 minutes. - +and extension to +the bounded repetitions construct with the correctness and finiteness property +maintained. + \end{itemize} -\subsection{Related Work} -We are aware -of a mechanised correctness proof of Brzozowski's derivative-based matcher in HOL4 by -Owens and Slind~\parencite{Owens2008}. Another one in Isabelle/HOL is part -of the work by Krauss and Nipkow \parencite{Krauss2011}. And another one -in Coq is given by Coquand and Siles \parencite{Coquand2012}. -Also Ribeiro and Du Bois give one in Agda \parencite{RibeiroAgda2017}. - +With a formal finiteness bound in place, +we can greatly reduce the attack surface of servers in terms of ReDoS attacks. +Further improvements to the algorithm with an even stronger version of +simplification is made. +Thanks to our theorem-prover-friendly approach, +we believe that +this finiteness bound can be improved to a bound +linear to input and +cubic to the regular expression size using a technique by +Antimirov\cite{Antimirov95}. +Once formalised, this would be a guarantee for the absence of all super-linear behavious. +We are working out the +details. + - When a regular expression does not behave as intended, -people usually try to rewrite the regex to some equivalent form -or they try to avoid the possibly problematic patterns completely, -for which many false positives exist\parencite{Davis18}. -Animated tools to "debug" regular expressions such as - \parencite{regexploit2021} \parencite{regex101} are also popular. -We are also aware of static analysis work on regular expressions that -aims to detect potentially expoential regex patterns. Rathnayake and Thielecke -\parencite{Rathnayake2014StaticAF} proposed an algorithm -that detects regular expressions triggering exponential -behavious on backtracking matchers. -Weideman \parencite{Weideman2017Static} came up with -non-linear polynomial worst-time estimates -for regexes, attack string that exploit the worst-time -scenario, and "attack automata" that generates -attack strings. +To our best knowledge, no lexing libraries using Brzozowski derivatives +have similar complexity-related bounds, +and claims about running time are usually speculative and backed by empirical +evidence on a few test cases. +If a matching or lexing algorithm +does not come with certain basic complexity related +guarantees (for examaple the internal data structure size +does not grow indefinitely), +then they cannot claim with confidence having solved the problem +of catastrophic backtracking. + diff -r 37b6fd310a16 -r 61139fdddae0 ChengsongTanPhdThesis/Chapters/RelatedWork.tex --- a/ChengsongTanPhdThesis/Chapters/RelatedWork.tex Sat Oct 01 12:06:46 2022 +0100 +++ b/ChengsongTanPhdThesis/Chapters/RelatedWork.tex Mon Oct 03 02:08:49 2022 +0100 @@ -23,3 +23,29 @@ flavours (e.g. whether references can be circular, can labels be repeated etc.) of back-references syntax. + +\subsection{Matchers and Lexers with Mechanised Proofs} +We are aware +of a mechanised correctness proof of Brzozowski's derivative-based matcher in HOL4 by +Owens and Slind~\parencite{Owens2008}. Another one in Isabelle/HOL is part +of the work by Krauss and Nipkow \parencite{Krauss2011}. And another one +in Coq is given by Coquand and Siles \parencite{Coquand2012}. +Also Ribeiro and Du Bois give one in Agda \parencite{RibeiroAgda2017}. + +\subsection{Static Analysis of Evil Regex Patterns} + When a regular expression does not behave as intended, +people usually try to rewrite the regex to some equivalent form +or they try to avoid the possibly problematic patterns completely, +for which many false positives exist\parencite{Davis18}. +Animated tools to "debug" regular expressions such as + \parencite{regexploit2021} \parencite{regex101} are also popular. +We are also aware of static analysis work on regular expressions that +aims to detect potentially expoential regex patterns. Rathnayake and Thielecke +\parencite{Rathnayake2014StaticAF} proposed an algorithm +that detects regular expressions triggering exponential +behavious on backtracking matchers. +Weideman \parencite{Weideman2017Static} came up with +non-linear polynomial worst-time estimates +for regexes, attack string that exploit the worst-time +scenario, and "attack automata" that generates +attack strings. diff -r 37b6fd310a16 -r 61139fdddae0 ChengsongTanPhdThesis/example.bib --- a/ChengsongTanPhdThesis/example.bib Sat Oct 01 12:06:46 2022 +0100 +++ b/ChengsongTanPhdThesis/example.bib Mon Oct 03 02:08:49 2022 +0100 @@ -8,6 +8,83 @@ +%% POSIX specification------------------------ +@InProceedings{Okui10, +author="Okui, Satoshi +and Suzuki, Taro", +editor="Domaratzki, Michael +and Salomaa, Kai", +title="Disambiguation in Regular Expression Matching via Position Automata with Augmented Transitions", +booktitle="Implementation and Application of Automata", +year="2011", +publisher="Springer Berlin Heidelberg", +address="Berlin, Heidelberg", +pages="231--240", +abstract="This paper offers a new efficient regular expression matching algorithm which follows the POSIX-type leftmost-longest rule. The algorithm basically emulates the subset construction without backtracking, so that its computational cost even in the worst case does not explode exponentially; the time complexity of the algorithm is O(mn(n{\thinspace}+{\thinspace}c)), where m is the length of a given input string, n the number of occurrences of the most frequently used letter in a given regular expression and c the number of subexpressions to be used for capturing substrings. A formalization of the leftmost-longest semantics by using parse trees is also discussed.", +isbn="978-3-642-18098-9" +} + +%% POSIX specification------------------------ + +%% Brzozowski ders------------------------ +@article{Berglund14, +author = {Berglund, Martin and Drewes, Frank and Van Der Merwe, Brink}, +year = {2014}, +month = {05}, +pages = {}, +title = {Analyzing Catastrophic Backtracking Behavior in Practical Regular Expression Matching}, +volume = {151}, +journal = {Electronic Proceedings in Theoretical Computer Science}, +doi = {10.4204/EPTCS.151.7} +} + +@InProceedings{Berglund18, +author="Berglund, Martin +and Bester, Willem +and van der Merwe, Brink", +editor="Fischer, Bernd +and Uustalu, Tarmo", +title="Formalising Boost POSIX Regular Expression Matching", +booktitle="Theoretical Aspects of Computing -- ICTAC 2018", +year="2018", +publisher="Springer International Publishing", +address="Cham", +pages="99--115", +abstract="Whereas Perl-compatible regular expression matchers typically exhibit some variation of leftmost-greedy semantics, those conforming to the posix standard are prescribed leftmost-longest semantics. However, the posix standard leaves some room for interpretation, and Fowler and Kuklewicz have done experimental work to confirm differences between various posix matchers. The Boost library has an interesting take on the posix standard, where it maximises the leftmost match not with respect to subexpressions of the regular expression pattern, but rather, with respect to capturing groups. In our work, we provide the first formalisation of Boost semantics, and we analyse the complexity of regular expression matching when using Boost semantics.", +isbn="978-3-030-02508-3" +} + + +@inproceedings{Chen12, +author = {Chen, Haiming and Yu, Sheng}, +year = {2012}, +month = {01}, +pages = {343-356}, +title = {Derivatives of Regular Expressions and an Application}, +volume = {7160}, +doi = {10.1007/978-3-642-27654-5_27} +} + + + +%% Brzozowski ders------------------------ + + +%% look-aheads------------------------ +@article{Takayuki2019, + title={Derivatives of Regular Expressions with Lookahead}, + author={Takayuki Miyazaki and Yasuhiko Minamide}, + journal={Journal of Information Processing}, + volume={27}, + number={ }, + pages={422-430}, + year={2019}, + doi={10.2197/ipsjjip.27.422} +} + +%% look-aheads------------------------ + + %% back-references-------------------- @incollection{AHO1990255, @@ -537,6 +614,8 @@ volume = {7086}, year = {2011}} + + @inproceedings{Almeidaetal10, author = {J.~B.~Almeida and N.~Moriera and D.~Pereira and S.~M.~de Sousa}, booktitle = {Proc.~of the 15th International Conference on Implementation and Application of Automata (CIAA)}, @@ -575,20 +654,22 @@ url = {http://www.pcre.org}, bdsk-url-1 = {http://www.pcre.org}} -@inproceedings{OkuiSuzuki2010, - author = {S.~Okui and T.~Suzuki}, - booktitle = {Proc.~of the 15th International Conference on Implementation and Application of Automata (CIAA)}, - pages = {231--240}, - series = {LNCS}, - title = {{D}isambiguation in {R}egular {E}xpression {M}atching via {P}osition {A}utomata with {A}ugmented {T}ransitions}, - volume = {6482}, - year = {2010}} +%@inproceedings{OkuiSuzuki2010, +% author = {S.~Okui and T.~Suzuki}, +% booktitle = {Proc.~of the 15th International Conference on Implementation and Application of Automata (CIAA)}, +% pages = {231--240}, +% series = {LNCS}, +% title = {{D}isambiguation in {R}egular {E}xpression {M}atching via {P}osition {A}utomata with {A}ugmented {T}ransitions}, +% volume = {6482}, +% year = {2010}} +% -@techreport{OkuiSuzukiTech, - author = {S.~Okui and T.~Suzuki}, - institution = {University of Aizu}, - title = {{D}isambiguation in {R}egular {E}xpression {M}atching via {P}osition {A}utomata with {A}ugmented {T}ransitions}, - year = {2013}} + +%@techreport{OkuiSuzukiTech, +% author = {S.~Okui and T.~Suzuki}, +% institution = {University of Aizu}, +% title = {{D}isambiguation in {R}egular {E}xpression {M}atching via {P}osition {A}utomata with {A}ugmented {T}ransitions}, +% year = {2013}} @inproceedings{Davis18, author = {J.~C.~Davis and C.~.A.~Coghlan and F.~Servant and D.~Lee},