# HG changeset patch # User Chengsong # Date 1667775970 0 # Node ID 2072a8d54e3e5c09008dcdc147e20fb17f3029ac # Parent 233cf2b97d1ab10d6f8a9db28a994f6962f5b328# Parent 3ea6a38c33b516b47f000d3068c58aedc1e20649 chap 5 finished diff -r 3ea6a38c33b5 -r 2072a8d54e3e ChengsongTanPhdThesis/Chapters/Finite.tex --- a/ChengsongTanPhdThesis/Chapters/Finite.tex Thu Oct 13 23:55:25 2022 +0100 +++ b/ChengsongTanPhdThesis/Chapters/Finite.tex Sun Nov 06 23:06:10 2022 +0000 @@ -7,53 +7,68 @@ %of our bitcoded algorithm, that is a finite bound on the size of any %regex's derivatives. -\begin{figure} -\begin{center} - \begin{tabular}{ccc} - $\llbracket _{bs}\ONE \rrbracket$ & $\dn$ & $1$\\ - $\llbracket \ZERO \rrbracket$ & $\dn$ & $1$ \\ - $\llbracket _{bs} r_1 \cdot r_2 \rrbracket$ & $\dn$ & $\llbracket r_1 \rrbracket + \llbracket r_2 \rrbracket + 1$\\ - $\llbracket _{bs}\mathbf{c} \rrbracket $ & $\dn$ & $1$\\ - $\llbracket _{bs}\sum as \rrbracket $ & $\dn$ & $\map \; (\llbracket \_ \rrbracket)\; as + 1$\\ - $\llbracket _{bs} a^* \rrbracket $ & $\dn$ & $\llbracket a \rrbracket + 1$. - \end{tabular} -\end{center} -\caption{The size function of bitcoded regular expressions}\label{brexpSize} -\end{figure} -In this chapter we give a guarantee in terms of size: + +In this chapter we give a guarantee in terms of size of derivatives: given an annotated regular expression $a$, for any string $s$ our algorithm $\blexersimp$'s internal annotated regular expression size is finitely bounded -by a constant $N_a$ that only depends on $a$: +by a constant that only depends on $a$. +Formally we show that there exists an $N_a$ such that \begin{center} $\llbracket \bderssimp{a}{s} \rrbracket \leq N_a$ \end{center} \noindent -where the size of an annotated regular expression is defined -in terms of the number of nodes in its tree structure (see figure \ref{brexpSize}). +where the size ($\llbracket \_ \rrbracket$) of +an annotated regular expression is defined +in terms of the number of nodes in its +tree structure (its recursive definition given in the next page). We believe this size bound is important in the context of POSIX lexing, because \begin{itemize} \item - It is a stepping stone towards an ``absence of catastrophic-backtracking'' - guarantee. - If the internal data structures used by our algorithm cannot grow very large, - then our algorithm (which traverses those structures) cannot be too slow. + It is a stepping stone towards the goal + of eliminating ``catastrophic backtracking''. + If the internal data structures used by our algorithm + grows beyond a finite bound, then clearly + the algorithm (which traverses these structures) will + be slow. The next step would be to refine the bound $N_a$ so that it is polynomial on $\llbracket a\rrbracket$. \item - Having it formalised gives us a higher confidence that + Having the finite bound formalised + gives us a higher confidence that our simplification algorithm $\simp$ does not ``mis-behave'' like $\simpsulz$ does. - The bound is universal, which is an advantage over work which - only gives empirical evidence on some test cases. + The bound is universal for a given regular expression, + which is an advantage over work which + only gives empirical evidence on some test cases (see Verbatim++\cite{Verbatimpp}). \end{itemize} +In the next section we describe in more detail +what the finite bound means in our algorithm +and why the size of the internal data structures of +a typical derivative-based lexer such as +Sulzmann and Lu's need formal treatment. + \section{Formalising About Size} \noindent In our lexer ($\blexersimp$), we take an annotated regular expression as input, -and repeately take derivative of and simplify it: -\begin{figure}[H] +and repeately take derivative of and simplify it. +\begin{figure} + \begin{center} + \begin{tabular}{lcl} + $\llbracket _{bs}\ONE \rrbracket$ & $\dn$ & $1$\\ + $\llbracket \ZERO \rrbracket$ & $\dn$ & $1$ \\ + $\llbracket _{bs} r_1 \cdot r_2 \rrbracket$ & $\dn$ & $\llbracket r_1 \rrbracket + \llbracket r_2 \rrbracket + 1$\\ + $\llbracket _{bs}\mathbf{c} \rrbracket $ & $\dn$ & $1$\\ + $\llbracket _{bs}\sum as \rrbracket $ & $\dn$ & $\map \; (\llbracket \_ \rrbracket)\; as + 1$\\ + $\llbracket _{bs} a^* \rrbracket $ & $\dn$ & $\llbracket a \rrbracket + 1$. + \end{tabular} + \end{center} + \caption{The size function of bitcoded regular expressions}\label{brexpSize} +\end{figure} + +\begin{figure} \begin{tikzpicture}[scale=2, every node/.style={minimum size=11mm}, ->,>=stealth',shorten >=1pt,auto,thick @@ -79,15 +94,16 @@ \end{tikzpicture} \caption{Regular expression size change during our $\blexersimp$ algorithm}\label{simpShrinks} \end{figure} + \noindent Each time a derivative is taken, the regular expression might grow. However, the simplification that is immediately afterwards will always shrink it so that -it stays small. +it stays relatively small. This intuition is depicted by the relative size change between the black and blue nodes: After $\simp$ the node always shrinks. -Our proof says that all the blue nodes +Our proof states that all the blue nodes stay below a size bound $N_a$ determined by the input $a$. \noindent @@ -126,7 +142,7 @@ \noindent The picture means that on certain cases their lexer (where they use $\simpsulz$ as the simplification function) -will have an indefinite size explosion, causing the running time +will have a size explosion, causing the running time of each derivative step to grow continuously (for example in \ref{SulzmannLuLexerTime}). They tested out the run time of their @@ -137,13 +153,13 @@ Before delving in to the details of the formalisation, we are going to provide an overview of it. -In the next subsection, we draw a picture of the bird's eye view +In the next subsection, we give a high-level view of the proof. \subsection{Overview of the Proof} -Here is a bird's eye view of the main components of the finiteness proof, -which involves three steps: +A high-level overview of the main components of the finiteness proof +is as follows: \begin{figure}[H] \begin{tikzpicture}[scale=1,font=\bf, node/.style={ @@ -179,21 +195,23 @@ \item We first introduce the operations such as derivatives, simplification, size calculation, etc. - associated with $\rrexp$s, which we have given - a very brief introduction to in chapter \ref{Bitcoded2}. + associated with $\rrexp$s, which we have introduced + in chapter \ref{Bitcoded2}. The operations on $\rrexp$s are identical to those on - annotated regular expressions except that they are unaware - of bitcodes. This means that all proofs about size of $\rrexp$s will apply to - annotated regular expressions. + annotated regular expressions except that they dispense with + bitcodes. This means that all proofs about size of $\rrexp$s will apply to + annotated regular expressions, because the size of a regular + expression is independent of the bitcodes. \item We prove that $\rderssimp{r}{s} = \textit{ClosedForm}(r, s)$, where $\textit{ClosedForm}(r, s)$ is entirely - written in the derivatives of their children regular + given as the derivatives of their children regular expressions. We call the right-hand-side the \emph{Closed Form} of the derivative $\rderssimp{r}{s}$. \item - We estimate $\llbracket \textit{ClosedForm}(r, s) \rrbracket_r$. + Formally we give an estimate of + $\llbracket \textit{ClosedForm}(r, s) \rrbracket_r$. The key observation is that $\distinctBy$'s output is a list with a constant length bound. \end{itemize} @@ -202,14 +220,14 @@ \section{The $\textit{Rrexp}$ Datatype} The first step is to define $\textit{rrexp}$s. -They are without bitcodes, +They are annotated regular expressions without bitcodes, allowing a much simpler size bound proof. -Of course, the bits which encode the lexing information -would grow linearly with respect -to the input, which should be taken into account when we wish to tackle the runtime comlexity. -But for the sake of the structural size -we can safely ignore them.\\ -To recapitulate, the datatype +%Of course, the bits which encode the lexing information +%would grow linearly with respect +%to the input, which should be taken into account when we wish to tackle the runtime comlexity. +%But for the sake of the structural size +%we can safely ignore them.\\ +The datatype definition of the $\rrexp$, called \emph{r-regular expressions}, was initially defined in \ref{rrexpDef}. @@ -226,7 +244,7 @@ written $\llbracket r\rrbracket_r$, whose definition mirrors that of an annotated regular expression. \begin{center} - \begin{tabular}{ccc} + \begin{tabular}{lcl} $\llbracket _{bs}\ONE \rrbracket_r$ & $\dn$ & $1$\\ $\llbracket \ZERO \rrbracket_r$ & $\dn$ & $1$ \\ $\llbracket _{bs} r_1 \cdot r_2 \rrbracket_r$ & $\dn$ & $\llbracket r_1 \rrbracket_r + \llbracket r_2 \rrbracket_r + 1$\\ @@ -254,27 +272,31 @@ \end{center} \subsection{Why a New Datatype?} -The reason we take all the trouble -defining a new datatype is that $\erase$ makes things harder. +The reason we +define a new datatype is that +the $\erase$ function +does not preserve the structure of annotated +regular expressions. We initially started by using plain regular expressions and tried to prove -the lemma \ref{rsizeAsize}, -however the $\erase$ function unavoidbly messes with the structure of the +lemma \ref{rsizeAsize}, +however the $\erase$ function messes with the structure of the annotated regular expression. The $+$ constructor -of basic regular expressions is binary whereas $\sum$ -takes a list, and one has to convert between them: +of basic regular expressions is only binary, whereas $\sum$ +takes a list. Therefore we need to convert between +annotated and normal regular expressions as follows: \begin{center} - \begin{tabular}{ccc} + \begin{tabular}{lcl} $\erase \; _{bs}\sum [] $ & $\dn$ & $\ZERO$\\ $\erase \; _{bs}\sum [a]$ & $\dn$ & $a$\\ $\erase \; _{bs}\sum a :: as$ & $\dn$ & $a + (\erase \; _{[]} \sum as)\quad \text{if $as$ length over 1}$ \end{tabular} \end{center} \noindent -An alternative regular expression with an empty argument list +As can be seen alternative regular expression with an empty argument list will be turned into a $\ZERO$. -The singleton alternative $\sum [r]$ would have $r$ during the +The singleton alternative $\sum [r]$ becomes $r$ during the $\erase$ function. The annotated regular expression $\sum[a, b, c]$ would turn into $(a+(b+c))$. @@ -284,7 +306,7 @@ For example, if we define the size of a basic plain regular expression in the usual way, \begin{center} - \begin{tabular}{ccc} + \begin{tabular}{lcl} $\llbracket \ONE \rrbracket_p$ & $\dn$ & $1$\\ $\llbracket \ZERO \rrbracket_p$ & $\dn$ & $1$ \\ $\llbracket r_1 \cdot r_2 \rrbracket_p$ & $\dn$ & $\llbracket r_1 \rrbracket_p + \llbracket r_2 \rrbracket_p + 1$\\ @@ -303,7 +325,7 @@ only the bitcodes are thrown away. Everything about the structure remains intact. Therefore it does not change the size -of an annotated regular expression: +of an annotated regular expression and we have: \begin{lemma}\label{rsizeAsize} $\rsize{\rerase a} = \asize a$ \end{lemma} @@ -317,17 +339,17 @@ but we found our approach more straightforward.\\ \subsection{Functions for R-regular Expressions} -We shall define the r-regular expression version -of $\blexer$ and $\blexersimp$ related functions. +In this section we shall define the r-regular expression version +of $\blexer$, and $\blexersimp$ related functions. We use $r$ as the prefix or subscript to differentiate with the bitcoded version. -For example,$\backslash_r$, $\rdistincts$, and $\rsimp$ -as opposed to $\backslash$, $\distinctBy$, and $\bsimp$. -As promised, they are much simpler than their bitcoded counterparts. +%For example,$\backslash_r$, $\rdistincts$, and $\rsimp$ +%as opposed to $\backslash$, $\distinctBy$, and $\bsimp$. +%As promised, they are much simpler than their bitcoded counterparts. %The operations on r-regular expressions are %almost identical to those of the annotated regular expressions, %except that no bitcodes are used. For example, -The derivative operation becomes simpler:\\ +The derivative operation for an r-regular expression is\\ \begin{center} \begin{tabular}{@{}lcl@{}} $(\ZERO)\,\backslash_r c$ & $\dn$ & $\ZERO$\\ @@ -347,10 +369,12 @@ \end{tabular} \end{center} \noindent -Similarly, $\distinctBy$ does not need +where we omit the definition of $\textit{rnullable}$. + +The function $\distinctBy$ for r-regular expressions does not need a function checking equivalence because -there are no bit annotations causing superficial differences -between syntactically equal terms. +there are no bit annotations. +Therefore we have \begin{center} \begin{tabular}{lcl} $\rdistinct{[]}{rset} $ & $\dn$ & $[]$\\ @@ -362,14 +386,14 @@ \end{center} %TODO: definition of rsimp (maybe only the alternative clause) \noindent -We would like to make clear -a difference between our $\rdistincts$ and -the Isabelle $\textit {distinct}$ predicate. -In Isabelle $\textit{distinct}$ is a function that returns a boolean -rather than a list. -It tests if all the elements of a list are unique.\\ -With $\textit{rdistinct}$, -and the flatten function for $\rrexp$s: +%We would like to make clear +%a difference between our $\rdistincts$ and +%the Isabelle $\textit {distinct}$ predicate. +%In Isabelle $\textit{distinct}$ is a function that returns a boolean +%rather than a list. +%It tests if all the elements of a list are unique.\\ +With $\textit{rdistinct}$ in place, +the flatten function for $\rrexp$ is as follows: \begin{center} \begin{tabular}{@{}lcl@{}} $\textit{rflts} \; (\sum \textit{as}) :: \textit{as'}$ & $\dn$ & $as \; @ \; \textit{rflts} \; as' $ \\ @@ -378,8 +402,8 @@ \end{tabular} \end{center} \noindent -one can chain together all the other modules -such as $\rsimpalts$: +The function +$\rsimpalts$ corresponds to $\textit{bsimp}_{ALTS}$: \begin{center} \begin{tabular}{@{}lcl@{}} $\rsimpalts \;\; nil$ & $\dn$ & $\RZERO$\\ @@ -388,7 +412,7 @@ \end{tabular} \end{center} \noindent -and $\rsimpseq$: +Similarly, we have $\rsimpseq$ which corresponds to $\textit{bsimp}_{SEQ}$: \begin{center} \begin{tabular}{@{}lcl@{}} $\rsimpseq \;\; \RZERO \; \_ $ & $=$ & $\RZERO$\\ @@ -420,8 +444,7 @@ \end{center} \noindent We do not define an r-regular expression version of $\blexersimp$, -as our proof does not involve its use -(and there is no bitcode to decode into a lexical value). +as our proof does not depend on it. Now we are ready to introduce how r-regular expressions allow us to prove the size bound on bitcoded regular expressions. @@ -431,7 +454,7 @@ can be calculated via the size of r-regular expressions after the application of $\rsimp$ and $\backslash_{rsimps}$: \begin{lemma}\label{sizeRelations} - The following equalities hold: + The following two equalities hold: \begin{itemize} \item $\asize{\bsimps \; a} = \rsize{\rsimp{ \rerase{a}}}$ @@ -456,13 +479,13 @@ implies $\quad \llbracket a \backslash_{bsimps} s \rrbracket \leq N_r$. \end{center} -From now on we -Unless stated otherwise in the rest of this -chapter all regular expressions without -bitcodes are seen as r-regular expressions ($\rrexp$s). -For the binary alternative r-regular expression $\RALTS{[r_1, r_2]}$, -we use the notation $r_1 + r_2$ -for brevity. +%From now on we +%Unless stated otherwise in the rest of this +%chapter all regular expressions without +%bitcodes are seen as r-regular expressions ($\rrexp$s). +%For the binary alternative r-regular expression $\RALTS{[r_1, r_2]}$, +%we use the notation $r_1 + r_2$ +%for brevity. %----------------------------------- @@ -521,15 +544,16 @@ $(r_1\cdot r_2) \backslash_{rsimps} s$ looks like, and therefore $N_{r_1}$ and $N_{r_2}$ in the inductive hypotheses cannot be directly used. -We have already seen that $(r_1 \cdot r_2)\backslash s$ -and $(r^*)\backslash s$ can grow in a wild way. +%We have already seen that $(r_1 \cdot r_2)\backslash s$ +%and $(r^*)\backslash s$ can grow in a wild way. -The point is that they will be equivalent to a list of +The point however, is that they will be equivalent to a list of terms $\sum rs$, where each term in $rs$ will be made of $r_1 \backslash s' $, $r_2\backslash s'$, -and $r \backslash s'$ with $s' \in \textit{SubString} \; s$. +and $r \backslash s'$ with $s' \in \textit{SubString} \; s$ (which stands +for the set of substrings of $s$). The list $\sum rs$ will then be de-duplicated by $\textit{rdistinct}$ -in the simplification which saves $rs$ from growing indefinitely. +in the simplification, which prevents the $rs$ from growing indefinitely. Based on this idea, we develop a proof in two steps. First, we show the equality (where @@ -547,23 +571,45 @@ \emph{Closed Form} of $(r_1 \cdot r_2)\backslash_{rsimps} s$. Second, we will bound the closed form of r-regular expressions using some estimation techniques -and then piece it together -with lemma \ref{sizeRelations} to show the bitcoded regular expressions +and then apply +lemma \ref{sizeRelations} to show that the bitcoded regular expressions in our $\blexersimp$ are finitely bounded. -We will flesh out the first step of the proof we -sketched just now in the next section. +We will describe in detail the first step of the proof +in the next section. \section{Closed Forms} In this section we introduce in detail -how the closed forms are obtained for regular expressions' -derivatives. +how we express the string derivatives +of regular expressions (i.e. $r \backslash_r s$ where $s$ is a string +rather than a single character) in a different way than +our previous definition. +In previous chapters, the derivative of a +regular expression $r$ w.r.t a string $s$ +was recursively defined on the string: +\begin{center} + $r \backslash_s (c::s) \dn (r \backslash c) \backslash_s s$ +\end{center} +The problem is that +this definition does not provide much information +on what $r \backslash_s s$ looks like. +If we are interested in the size of a derivative like +$(r_1 \cdot r_2)\backslash s$, +we have to somehow get a more concrete form to begin. +We call such more concrete representations the ``closed forms'' of +string derivatives as opposed to their original definitions. +The terminology ``closed form'' is borrowed from mathematics, +which usually describe expressions that are solely comprised of +well-known and easy-to-compute operations such as +additions, multiplications, exponential functions. + We start by proving some basic identities involving the simplification functions for r-regular expressions. After that we introduce the rewrite relations $\rightsquigarrow_h$, $\rightsquigarrow^*_{scf}$ $\rightsquigarrow_f$ and $\rightsquigarrow_g$. -These relations involves similar techniques in chapter \ref{Bitcoded2}. +These relations involves similar techniques as in chapter \ref{Bitcoded2} +for annotated regular expressions. Finally, we use these identities to establish the closed forms of the alternative regular expression, the sequence regular expression, and the star regular expression. @@ -573,11 +619,9 @@ \subsection{Some Basic Identities} -We now introduce lemmas -that are repeatedly used in later proofs. -Note that for the $\textit{rdistinct}$ function there -will be a lot of conversion from lists to sets. -We use $set$ to refere to the +In what follows we will often convert between lists +and sets. +We use Isabelle's $set$ to refere to the function that converts a list $rs$ to the set containing all the elements in $rs$. \subsubsection{$\textit{rdistinct}$'s Does the Job of De-duplication} @@ -589,8 +633,7 @@ %The function $\textit{rdistinct}$ satisfies the following %properties: Assume we have the predicate $\textit{isDistinct}$\footnote{We omit its - recursive definition here, its Isabelle counterpart would be $\textit{distinct}$.} - readily defined + recursive definition here. Its Isabelle counterpart would be $\textit{distinct}$.} for testing whether a list's elements are unique. Then the following properties about $\textit{rdistinct}$ hold: @@ -675,12 +718,13 @@ \begin{proof} By induction on $rs$. All other variables are allowed to be arbitrary. The second part of the lemma requires the first. - Note that for each part, the two sub-propositions need to be proven concurrently, + Note that for each part, the two sub-propositions need to be proven + at the same time, so that the induction goes through. \end{proof} \noindent This allows us to prove a few more equivalence relations involving -$\textit{rdistinct}$ (it will be useful later): +$\textit{rdistinct}$ (they will be useful later): \begin{lemma}\label{rdistinctConcatGeneral} \mbox{} \begin{itemize} @@ -723,7 +767,7 @@ \end{proof} \noindent $\textit{rdistinct}$ needs to be applied only once, and -applying it multiple times does not cause any difference: +applying it multiple times does not make any difference: \begin{corollary}\label{distinctOnceEnough} $\textit{rdistinct} \; (rs @ rsa) {} = \textit{rdistinct} \; (rdistinct \; rs \{ \} @ (\textit{rdistinct} \; rs_a \; (set \; rs)))$ @@ -748,7 +792,7 @@ %functions on two lists separately, and then concatenating them together. $\textit{Rflts}$ is composable in terms of concatenation: \begin{lemma}\label{rfltsProps} - The function $\rflts$ has the below properties:\\ + The function $\rflts$ has the properties below:\\ \begin{itemize} \item $\rflts \; (rs_1 @ rs_2) = \rflts \; rs_1 @ \rflts \; rs_2$ @@ -779,15 +823,18 @@ \noindent Now we introduce the property that the operations derivative and $\rsimpalts$ -commute, this will be used later in deriving the closed form for +commute, this will be used later on, when deriving the closed form for the alternative regular expression: \begin{lemma}\label{rderRsimpAltsCommute} $\rder{x}{(\rsimpalts \; rs)} = \rsimpalts \; (\map \; (\rder{x}{\_}) \; rs)$ \end{lemma} +\begin{proof} + By induction on $rs$. +\end{proof} \noindent \subsubsection{The $RL$ Function: Language Interpretation of $\textit{Rrexp}$s} -Much like the definition of $L$ on plain regular expressions, one could also +Much like the definition of $L$ on plain regular expressions, one can also define the language interpretation of $\rrexp$s. \begin{center} \begin{tabular}{lcl} @@ -820,10 +867,12 @@ \subsubsection{Simplified $\textit{Rrexp}$s are Good} We formalise the notion of ``good" regular expressions, which means regular expressions that -are not fully simplified. For alternative regular expressions that means they -do not contain any nested alternatives like -\[ r_1 + (r_2 + r_3) \], un-removed $\RZERO$s like \[\RZERO + r\] -or duplicate elements in a children regular expression list like \[ \sum [r, r, \ldots]\]: +are fully simplified in terms of our $\textit{rsimp}$ function. +For alternative regular expressions that means they +do not contain any nested alternatives, un-eliminated $\RZERO$s +or duplicate elements (for example, +$r_1 + (r_2 + r_3)$, $\RZERO + r$ and $ \sum [r, r, \ldots]$). +The clauses for $\good$ are: \begin{center} \begin{tabular}{@{}lcl@{}} $\good\; \RZERO$ & $\dn$ & $\textit{false}$\\ @@ -833,7 +882,7 @@ $\good\; \RALTS{[r]}$ & $\dn$ & $\bfalse$\\ $\good\; \RALTS{r_1 :: r_2 :: rs}$ & $\dn$ & $\textit{isDistinct} \; (r_1 :: r_2 :: rs) \;$\\ - & & $\textit{and}\; (\forall r' \in (r_1 :: r_2 :: rs).\; \good \; r'\; \, \textit{and}\; \, \textit{nonAlt}\; r')$\\ + & & $\land \; (\forall r' \in (r_1 :: r_2 :: rs).\; \good \; r'\; \, \land \; \, \textit{nonAlt}\; r')$\\ $\good \; \RSEQ{\RZERO}{r}$ & $\dn$ & $\bfalse$\\ $\good \; \RSEQ{\RONE}{r}$ & $\dn$ & $\bfalse$\\ $\good \; \RSEQ{r}{\RZERO}$ & $\dn$ & $\bfalse$\\ @@ -842,7 +891,8 @@ \end{tabular} \end{center} \noindent -The predicate $\textit{nonAlt}$ evaluates to true when the regular expression is not an +We omit the recursive definition of the predicate $\textit{nonAlt}$, +which evaluates to true when the regular expression is not an alternative, and false otherwise. The $\good$ property is preserved under $\rsimp_{ALTS}$, provided that its non-empty argument list of expressions are all good themsleves, and $\textit{nonAlt}$, @@ -879,10 +929,13 @@ which enables $\rsimp$ to be non-nested: \begin{lemma}\label{nonnestedRsimp} - $\nonnested \; (\rsimp{r})$ + It is always the case that + \begin{center} + $\nonnested \; (\rsimp{r})$ + \end{center} \end{lemma} \begin{proof} - By an induction on $r$. + By induction on $r$. \end{proof} \noindent With this we could prove that a regular expressions @@ -896,13 +949,13 @@ By \ref{nonnestedRsimp}. \end{proof} \noindent -The other thing we know is that once $\rsimp{}$ had finished +The other fact we know is that once $\rsimp{}$ has finished processing an alternative regular expression, it will not -contain any $\RZERO$s, this is because all the recursive +contain any $\RZERO$s. This is because all the recursive calls to the simplification on the children regular expressions -make the children good, and $\rflts$ will not take out +make the children good, and $\rflts$ will not delete any $\RZERO$s out of a good regular expression list, -and $\rdistinct{}$ will not mess with the result. +and $\rdistinct{}$ will not ``mess'' with the result. \begin{lemma}\label{flts3Obv} The following are true: \begin{itemize} @@ -931,7 +984,7 @@ \end{lemma} \begin{proof} By an induction on $r$. The inductive measure is the size $\llbracket \rrbracket_r$. - Lemma \ref{rsimpSize} says that + Lemma \ref{rsimpMono} says that $\llbracket \rsimp{r}\rrbracket_r$ is smaller than or equal to $\llbracket r \rrbracket_r$. Therefore, in the $r_1 \cdot r_2$ and $\sum rs$ case, @@ -953,7 +1006,7 @@ \end{proof} \noindent Now we are ready to prove that good regular expressions are invariant -of $\rsimp{}$ application: +with respect to $\rsimp{}$: \begin{lemma}\label{test} If $\good \;r$ then $\rsimp{r} = r$. \end{lemma} @@ -964,16 +1017,19 @@ case where 2 or more elements are present in the list. \end{proof} \noindent -Given below is a property involving $\rflts$, $\rdistinct{}{}$, $\rsimp{}$ and $\rsimp_{ALTS}$, -which requires $\ref{good1}$ to go through smoothly. -It says that an application of $\rsimp_{ALTS}$ can be "absorbed", -if it its output is concatenated with a list and then applied to $\rflts$. +Given below is a property involving $\rflts$, $\textit{rdistinct}$, +$\rsimp{}$ and $\rsimp_{ALTS}$, +which requires $\ref{good1}$ to go through smoothly: \begin{lemma}\label{flattenRsimpalts} +An application of $\rsimp_{ALTS}$ can be ``absorbed'', +if its output is concatenated with a list and then applied to $\rflts$. +\begin{center} $\rflts \; ( (\rsimp_{ALTS} \; (\rdistinct{(\rflts \; (\map \; \rsimp{}\; rs))}{\varnothing})) :: \map \; \rsimp{} \; rs' ) = \rflts \; ( (\rdistinct{(\rflts \; (\map \; \rsimp{}\; rs))}{\varnothing}) @ ( \map \; \rsimp{rs'}))$ +\end{center} \end{lemma} @@ -993,7 +1049,7 @@ forms so that key steps allowing further rewriting to closed forms are possible. \begin{lemma}\label{rsimpIdem} - $\rsimp{r} = \rsimp{\rsimp{r}}$ + $\rsimp{r} = \rsimp{(\rsimp{r})}$ \end{lemma} \begin{proof} @@ -1005,7 +1061,7 @@ our definition of $\blexersimp$. -On the other hand, we could repeat the same $\rsimp{}$ applications +On the other hand, we can repeat the same $\rsimp{}$ applications on regular expressions as many times as we want, if we have at least one simplification applied to it, and apply it wherever we would like to: \begin{corollary}\label{headOneMoreSimp} @@ -1019,8 +1075,8 @@ \end{itemize} \end{corollary} \noindent -This will be useful in later closed form proof's rewriting steps. -Similarly, we point out the following useful facts below: +This will be useful in the later closed form proof's rewriting steps. +Similarly, we state the following useful facts below: \begin{lemma} The following equalities hold if $r = \rsimp{r'}$ for some $r'$: \begin{itemize} @@ -1041,10 +1097,18 @@ we can start proving some key equalities leading to the closed forms. Now presented are a few equivalent terms under $\rsimp{}$. -We use $r_1 \sequal r_2 $ here to denote $\rsimp{r_1} = \rsimp{r_2}$. +To make the notation more concise +We use $r_1 \sequal r_2 $ to denote that $\rsimp{r_1} = \rsimp{r_2}$: +\begin{center} +\begin{tabular}{lcl} + $a \sequal b$ & $ \dn$ & $ \textit{rsimp} \; a = \textit{rsimp} \; b$ +\end{tabular} +\end{center} +\noindent +%\vspace{0em} \begin{lemma} + The following equivalence hold: \begin{itemize} - The following equivalence hold: \item $\rsimpalts \; (\RZERO :: rs) \sequal \rsimpalts\; rs$ \item @@ -1063,7 +1127,7 @@ \noindent The above allows us to prove two similar equalities (which are a bit more involved). -It says that we could flatten out the elements +It says that we could flatten the elements before simplification and still get the same result. \begin{lemma}\label{simpFlatten3} One can flatten the inside $\sum$ of a $\sum$ if it is being @@ -1099,7 +1163,7 @@ \noindent -We need more equalities like the above to enable a closed form, +We need more equalities like the above to enable a closed form lemma, for which we need to introduce a few rewrite relations to help us obtain them. @@ -1211,11 +1275,11 @@ operations}\label{gRewrite} \end{figure} \noindent -We defined +We define two separate list rewriting relations $\frewrite$ and $\grewrite$. The rewriting steps that take place during flattening is characterised by $\frewrite$. -$\grewrite$ characterises both flattening and de-duplicating. +The rewrite relation $\grewrite$ characterises both flattening and de-duplicating. Sometimes $\grewrites$ is slightly too powerful so we would rather use $\frewrites$ to prove %because we only @@ -1236,19 +1300,24 @@ \end{lemma} \noindent and then the equivalence between two terms -that can reduce in many steps to each other. +that can reduce in many steps to each other: \begin{lemma}\label{frewritesSimpeq} If $rs_1 \frewrites rs_2 $, then $\sum (\rDistinct \; rs_1 \; \varnothing) \sequal \sum (\rDistinct \; rs_2 \; \varnothing)$. \end{lemma} \noindent +These two lemmas can both be proven using a straightforward induction (and +the proofs for them are therefore omitted). +Now the above equalities can be derived like a breeze: \begin{corollary} $\sum (\rDistinct \;\; (\map \; (\_ \backslash x) \; (\rflts \; rs)) \;\; \varnothing) \sequal \sum (\rDistinct \;\; (\rflts \; (\map \; (\_ \backslash x) \; rs)) \;\; \varnothing) $ \end{corollary} - +\begin{proof} + By lemmas \ref{earlyLaterDerFrewrites} and \ref{frewritesSimpeq}. +\end{proof} But this trick will not work for $\grewrites$. For example, a rewriting step in proving closed forms is: @@ -1258,7 +1327,7 @@ $\rsimp{(\rsimpalts \; (\rdistinct{(\map \; (\_ \backslash x) \; (\rflts \; (\map \; (\rsimp{} \; \circ \; (\lambda r. \rderssimp{r}{xs})))) ) }{\varnothing}))} $ \noindent \end{center} -For this one would hope to have a rewriting relation between the two lists involved, +For this, one would hope to have a rewriting relation between the two lists involved, similar to \ref{earlyLaterDerFrewrites}. However, it turns out that \begin{center} $\map \; (\_ \backslash x) \; (\rDistinct \; rs \; rset) \grewrites \rDistinct \; (\map \; @@ -1267,7 +1336,7 @@ \noindent does $\mathbf{not}$ hold in general. For this rewriting step we will introduce some slightly more cumbersome -proof technique in later sections. +proof technique later. The point is that $\frewrite$ allows us to prove equivalence in a straightforward way that is not possible for $\grewrite$. @@ -1275,12 +1344,12 @@ \subsubsection{Terms That Can Be Rewritten Using $\hrewrites$, $\grewrites$, and $\frewrites$} In this part, we present lemmas stating -pairs of r-regular expressions or r-regular expression lists +pairs of r-regular expressions and r-regular expression lists where one could rewrite from one in many steps to the other. Most of the proofs to these lemmas are straightforward, using -an induction on the inductive cases of the corresponding rewriting relations. +an induction on the corresponding rewriting relations. These proofs will therefore be omitted when this is the case. -We present in the below lemma a few pairs of terms that are rewritable via +We present in the following lemma a few pairs of terms that are rewritable via $\grewrites$: \begin{lemma}\label{gstarRdistinctGeneral} \mbox{} @@ -1302,6 +1371,7 @@ If a pair of terms $rs_1, rs_2$ are rewritable via $\grewrites$ to each other, then they are equivalent under $\rsimp{}$: \begin{lemma}\label{grewritesSimpalts} + \mbox{} If $rs_1 \grewrites rs_2$, then we have the following equivalence: \begin{itemize} @@ -1330,7 +1400,7 @@ \end{itemize} \end{lemma} \noindent -Here comes the meat of the proof, +Now comes the core of the proof, which says that once two lists are rewritable to each other, then they are equivalent under $\rsimp{}$: \begin{lemma} @@ -1338,15 +1408,17 @@ \end{lemma} \noindent -And similar to \ref{Bitcoded2} one can preserve rewritability after taking derivative -of two regular expressions on both sides: +Similar to what we did in \ref{Bitcoded2}, +we prove that if one can rewrite from one r-regular expression ($r$) +to the other ($r'$), after taking derivatives one could still rewrite +the first ($r\backslash c$) to the other ($r'\backslash c$). \begin{lemma}\label{interleave} If $r \hrewrites r' $ then $\rder{c}{r} \hrewrites \rder{c}{r'}$ \end{lemma} \noindent -This allows proving more $\mathbf{rsimp}$-equivalent terms, involving $\backslash_r$ now. +This allows us proving more $\mathbf{rsimp}$-equivalent terms, involving $\backslash_r$. \begin{lemma}\label{insideSimpRemoval} - $\rsimp{\rder{c}{\rsimp{r}}} = \rsimp{\rder{c}{r}} $ + $\rsimp{(\rder{c}{(\rsimp{r})})} = \rsimp{(\rder{c}{r})} $ \end{lemma} \noindent \begin{proof} @@ -1373,15 +1445,15 @@ \noindent \subsection{Closed Forms for $\sum rs$, $r_1\cdot r_2$ and $r^*$} -\subsubsection{Closed Form for Alternative Regular Expression} -Lemma \ref{Simpders} leads to the first closed form-- -\begin{lemma}\label{altsClosedForm} +Lemma \ref{Simpders} leads to our first closed form, +which is for the alternative regular expression: +\begin{theorem}\label{altsClosedForm} + \mbox{} \begin{center} $\rderssimp{(\sum rs)}{s} \sequal \sum \; (\map \; (\rderssimp{\_}{s}) \; rs)$ \end{center} -\end{lemma} - +\end{theorem} \noindent \begin{proof} By a reverse induction on the string $s$. @@ -1402,7 +1474,7 @@ \end{proof} \noindent This closed form has a variant which can be more convenient in later proofs: -\begin{corollary}{altsClosedForm1} +\begin{corollary}\label{altsClosedForm1} If $s \neq []$ then $\rderssimp \; (\sum \; rs) \; s = \rsimp{(\sum \; (\map \; \rderssimp{\_}{s} \; rs))}$. @@ -1414,15 +1486,23 @@ \subsubsection{Closed Form for Sequence Regular Expressions} -First let's look at a series of derivatives steps on a sequence -regular expression, assuming that each time the first -component of the sequence is always nullable): +For the sequence regular expression, +let's first look at a series of derivative steps on it +(assuming that each time when a derivative is taken, +the head of the sequence is always nullable): \begin{center} - - $r_1 \cdot r_2 \quad \longrightarrow_{\backslash c} \quad r_1 \backslash c \cdot r_2 + r_2 \backslash c \quad \longrightarrow_{\backslash c'} \quad (r_1 \backslash cc' \cdot r_2 + r_2 \backslash c') + r_2 \backslash cc' \longrightarrow_{\backslash c''} \quad$\\ - $((r_1 \backslash cc'c'' \cdot r_2 + r_2 \backslash c'') + r_2 \backslash c'c'') + r_2 \backslash cc'c'' \longrightarrow_{\backslash c''} \quad - \ldots$ - + \begin{tabular}{llll} + $r_1 \cdot r_2$ & + $\longrightarrow_{\backslash c}$ & + $r_1\backslash c \cdot r_2 + r_2 \backslash c$ & + $ \longrightarrow_{\backslash c'} $ \\ + \\ + $(r_1 \backslash cc' \cdot r_2 + r_2 \backslash c') + r_2 \backslash cc'$ & + $\longrightarrow_{\backslash c''} $ & + $((r_1 \backslash cc'c'' \cdot r_2 + r_2 \backslash c'') + r_2 \backslash c'c'') + + r_2 \backslash cc'c''$ & + $ \longrightarrow_{\backslash c''} \quad \ldots$\\ + \end{tabular} \end{center} Roughly speaking $r_1 \cdot r_2 \backslash s$ can be expresssed as a giant alternative taking a list of terms @@ -1430,32 +1510,50 @@ where the head of the list is always the term representing a match involving only $r_1$, and the tail of the list consisting of terms of the shape $r_2 \backslash_r s''$, $s''$ being a suffix of $s$. -This intuition is also echoed by IndianPaper, where they gave +This intuition is also echoed by Murugesan and Sundaram \cite{Murugesan2014}, +where they gave a pencil-and-paper derivation of $(r_1 \cdot r_2)\backslash s$: \begin{center} - \begin{tabular}{c} - $(r_1 \cdot r_2) \backslash_r (c_1 :: c_2 :: \ldots c_n) \myequiv$\\ - \rule{0pt}{3ex} $((r_1 \backslash_r c_1) \cdot r_2 + (\delta\; (\rnullable \; r_1) \; r_2 \backslash_r c_1)) \backslash_r (c_2 :: \ldots c_n) - \myequiv$\\ - \rule{0pt}{3ex} $((r_1 \backslash_r c_1c_2 \cdot r_2 + (\delta \; (\rnullable \; r_1) \; r_2 \backslash_r c_1c_2)) - + (\delta \ (\rnullable \; r_1 \backslash_r c)\; r_2 \backslash_r c_2)) \backslash_r (c_3 \ldots c_n) - $ + \begin{tabular}{lc} + $L \; [ (r_1 \cdot r_2) \backslash_r (c_1 :: c_2 :: \ldots c_n) ]$ & $ =$\\ + \\ + \rule{0pt}{3ex} $L \; [ ((r_1 \backslash_r c_1) \cdot r_2 + + (\delta\; (\nullable \; r_1) \; (r_2 \backslash_r c_1) )) \backslash_r + (c_2 :: \ldots c_n) ]$ & + $=$\\ + \\ + \rule{0pt}{3ex} $L \; [ ((r_1 \backslash_r c_1c_2 \cdot r_2 + + (\delta \; (\nullable \; r_1) \; (r_2 \backslash_r c_1c_2))) + $ & \\ + \\ + $\quad + (\delta \ (\nullable \; r_1 \backslash_r c)\; (r_2 \backslash_r c_2) )) + \backslash_r (c_3 \ldots c_n) ]$ & $\ldots$ \\ \end{tabular} \end{center} \noindent -The equality in above should be interpretated -as language equivalence. -The $\delta$ function works similarly to that of -a Kronecker delta function: -\[ \delta \; b\; r\] -will produce $r$ -if $b$ evaluates to true, -and $\RZERO$ otherwise. -Note that their formulation -\[ - ((r_1 \backslash_r \, c_1c_2 \cdot r_2 + (\delta \; (\rnullable) \; r_1, r_2 \backslash_r c_1c_2) - + (\delta \; (\rnullable \; r_1 \backslash_r c)\; r_2 \backslash_r c_2) -\] +The $\delta$ function +returns $r$ when the boolean condition +$b$ evaluates to true and +$\ZERO$ otherwise: +\begin{center} + \begin{tabular}{lcl} + $\delta \; b\; r$ & $\dn$ & $r \quad \textit{if} \; b \; is \;\textit{true}$\\ + & $\dn$ & $\ZERO \quad otherwise$ + \end{tabular} +\end{center} +\noindent +Note that the term +\begin{center} + \begin{tabular}{lc} + \rule{0pt}{3ex} $((r_1 \backslash_r c_1c_2 \cdot r_2 + + (\delta \; (\nullable \; r_1) \; (r_2 \backslash_r c_1c_2))) + $ & \\ + \\ + $\quad + (\delta \ (\nullable \; r_1 \backslash_r c)\; (r_2 \backslash_r c_2) )) + \backslash_r (c_3 \ldots c_n)$ &\\ + \end{tabular} +\end{center} +\noindent does not faithfully represent what the intermediate derivatives would actually look like when one or more intermediate results $r_1 \backslash s' \cdot r_2$ are not @@ -1463,59 +1561,88 @@ For example, when $r_1$ and $r_1 \backslash_r c_1$ are not nullable, the regular expression would not look like \[ - (r_1 \backslash_r c_1c_2 + \RZERO ) + \RZERO, + r_1 \backslash_r c_1c_2 \] -but actually $r_1 \backslash_r c_1c_2$, the redundant $\RZERO$s will not be created in the +instead of +\[ + (r_1 \backslash_r c_1c_2 + \ZERO ) + \ZERO. +\] +The redundant $\ZERO$s will not be created in the first place. -In a closed-form one would want to take into account this -and generate the list of -regular expressions $r_2 \backslash_r s''$ with -string pairs $(s', s'')$ where $s'@s'' = s$ and -$r_1 \backslash s'$ nullable. -We denote the list consisting of such -strings $s''$ as $\vsuf{s}{r_1}$. - -The function $\vsuf{\_}{\_}$ is defined recursively on the structure of the string: +In a closed-form one needs to take into account this (because +closed forms require exact equality rather than language equivalence) +and only generate the +$r_2 \backslash_r s''$ terms satisfying the property +\begin{center} +$\exists s'. such \; that \; s'@s'' = s \;\; \land \;\; +r_1 \backslash s' \; is \; nullable$. +\end{center} +Given the arguments $s$ and $r_1$, we denote the list of strings +$s''$ satisfying the above property as $\vsuf{s}{r_1}$. +The function $\vsuf{\_}{\_}$ is defined recursively on the structure of the string\footnote{ + Perhaps a better name of it would be ``NullablePrefixSuffix'' + to differentiate with the list of \emph{all} prefixes of $s$, but + that is a bit too long for a function name and we are yet to find +a more concise and easy-to-understand name.} \begin{center} \begin{tabular}{lcl} $\vsuf{[]}{\_} $ & $=$ & $[]$\\ - $\vsuf{c::cs}{r_1}$ & $ =$ & $ \textit{if} (\rnullable{r_1}) \textit{then} \; (\vsuf{cs}{(\rder{c}{r_1})}) @ [c :: cs]$\\ + $\vsuf{c::cs}{r_1}$ & $ =$ & $ \textit{if} \; (\rnullable{r_1}) \; \textit{then} \; (\vsuf{cs}{(\rder{c}{r_1})}) @ [c :: cs]$\\ && $\textit{else} \; (\vsuf{cs}{(\rder{c}{r_1}) }) $ \end{tabular} \end{center} \noindent -The list is sorted in the order $r_2\backslash s''$ -appears in $(r_1\cdot r_2)\backslash s$. +The list starts with shorter suffixes +and ends with longer ones (in other words, the string elements $s''$ +in the list $\vsuf{s}{r_1}$ are sorted +in the same order as that of the terms $r_2\backslash s''$ +appearing in $(r_1\cdot r_2)\backslash s$). In essence, $\vsuf{\_}{\_}$ is doing a "virtual derivative" of $r_1 \cdot r_2$, but instead of producing the entire result $(r_1 \cdot r_2) \backslash s$, -it only stores all the strings $s''$ such that $r_2 \backslash s''$ -are occurring terms in $(r_1\cdot r_2)\backslash s$. +it only stores strings, +with each string $s''$ representing a term such that $r_2 \backslash s''$ +is occurring in $(r_1\cdot r_2)\backslash s$. -To make the closed form representation -more straightforward, -the flattetning function $\sflat{\_}$ is used to enable the transformation from +With $\textit{Suffix}$ we are ready to express the +sequence regular expression's closed form, +but before doing so +more devices are needed. +The first thing is the flattening function $\sflat{\_}$, +which takes an alternative regular expression and produces a flattened version +of that alternative regular expression. +It is needed to convert a left-associative nested sequence of alternatives into a flattened list: \[ - \sum [r_1, r_2, r_3, \ldots] \stackrel{\sflat{\_}}{\rightarrow} - (\ldots ((r_1 + r_2) + r_3) + \ldots) + \sum(\ldots ((r_1 + r_2) + r_3) + \ldots) + \stackrel{\sflat{\_}}{\rightarrow} + \sum[r_1, r_2, r_3, \ldots] \] \noindent -The definitions $\sflat{\_}$, $\sflataux{\_}$ are given below. +The definitions of $\sflat{\_}$ and helper functions +$\sflataux{\_}$ and $\llparenthesis \_ \rrparenthesis''$ are given below. \begin{center} - \begin{tabular}{ccc} - $\sflataux{\AALTS{ }{r :: rs}}$ & $=$ & $\sflataux{r} @ rs$\\ - $\sflataux{\AALTS{ }{[]}}$ & $ = $ & $ []$\\ - $\sflataux r$ & $=$ & $ [r]$ + \begin{tabular}{lcl} + $\sflataux{\sum r :: rs}$ & $\dn$ & $\sflataux{r} @ rs$\\ + $\sflataux{\sum []}$ & $ \dn $ & $ []$\\ + $\sflataux r$ & $\dn$ & $ [r]$ \end{tabular} \end{center} \begin{center} - \begin{tabular}{ccc} - $\sflat{(\sum r :: rs)}$ & $=$ & $\sum (\sflataux{r} @ rs)$\\ - $\sflat{\sum []}$ & $ = $ & $ \sum []$\\ - $\sflat r$ & $=$ & $ r$ + \begin{tabular}{lcl} + $\sflat{(\sum r :: rs)}$ & $\dn$ & $\sum (\sflataux{r} @ rs)$\\ + $\sflat{\sum []}$ & $ \dn $ & $ \sum []$\\ + $\sflat r$ & $\dn$ & $ r$ + \end{tabular} +\end{center} + +\begin{center} + \begin{tabular}{lcl} + $\sflataux{[]}'$ & $ \dn $ & $ []$\\ + $\sflataux{ (r_1 + r_2) :: rs }'$ & $\dn$ & $r_1 :: r_2 :: rs$\\ + $\sflataux{r :: rs}$ & $\dn$ & $ r::rs$ \end{tabular} \end{center} \noindent @@ -1527,19 +1654,63 @@ the output type a regular expression, not a list. $\sflataux{\_}$ and $\sflat{\_}$ are only recursive on the first element of the list. +$\sflataux{\_}'$ takes a list of regular expressions as input, and outputs +a list of regular expressions. +The use of $\sflataux{\_}$ and $\sflataux{\_}'$ is clear once we have +$\textit{createdBySequence}$ defined: +\begin{center} + \begin{mathpar} + \inferrule{\mbox{}}{\textit{createdBySequence}\; (r_1 \cdot r_2)} -With $\sflataux{}$ a preliminary to the closed form can be stated, -where the derivative of $r_1 \cdot r_2 \backslash s$ can be -flattened into a list whose head and tail meet the description -we gave earlier. + \inferrule{\textit{createdBySequence} \; r_1}{\textit{createdBySequence} \; + (r_1 + r_2)} + \end{mathpar} +\end{center} +\noindent +The predicate $\textit{createdBySequence}$ is used to describe the shape of +the derivative regular expressions $(r_1\cdot r_2) \backslash s$: +\begin{lemma}\label{recursivelyDerseq} + It is always the case that + \begin{center} + $\textit{createdBySequence} \; ( (r_1\cdot r_2) \backslash_r s) $ + \end{center} + holds. +\end{lemma} +\begin{proof} + By a reverse induction on the string $s$, where the inductive cases are $[]$ + and $xs @ [x]$. +\end{proof} +\noindent +If we have a regular expression $r$ whose shpae +fits into those described by $\textit{createdBySequence}$, +then we can convert between +$r \backslash_r c$ and $(\sflataux{r}) \backslash_r c$ with +$\sflataux{\_}'$: +\begin{lemma}\label{sfauIdemDer} + If $\textit{createdBySequence} \; r$, then + \begin{center} + $\sflataux{ r \backslash_r c} = + \llparenthesis (\map \; (\_ \backslash_r c) \; (\sflataux{r}) ) \rrparenthesis''$ + \end{center} + holds. +\end{lemma} +\begin{proof} + By a simple induction on the inductive cases of $\textit{createdBySequence}. + $ +\end{proof} + +Now we are ready to express +the shape of $r_1 \cdot r_2 \backslash s$ \begin{lemma}\label{seqSfau0} - $\sflataux{\rders{(r_1 \cdot r_2) \backslash s }} = (r_1 \backslash_r s) \cdot r_2 + $\sflataux{(r_1 \cdot r_2) \backslash_r s} = (r_1 \backslash_r s) \cdot r_2 :: (\map \; (r_2 \backslash_r \_) \; (\textit{Suffix} \; s \; r1))$ \end{lemma} \begin{proof} - By an induction on the string $s$, where the inductive cases - are split as $[]$ and $xs @ [x]$. - Note the key identify holds: + By a reverse induction on the string $s$, where the inductive cases + are $[]$ and $xs @ [x]$. + For the inductive case, we know that $\textit{createdBySequence} \; ((r_1 \cdot r_2) + \backslash_r xs)$ holds from lemma \ref{recursivelyDerseq}, + which can be used to prove \[ \map \; (r_2 \backslash_r \_) \; (\vsuf{[x]}{(r_1 \backslash_r xs)}) \;\; @ \;\; \map \; (\_ \backslash_r x) \; (\map \; (r_2 \backslash \_) \; (\vsuf{xs}{r_1})) @@ -1548,96 +1719,126 @@ \[ \map \; (r_2 \backslash_r \_) \; (\vsuf{xs @ [x]}{r_1}) \] - This enables the inductive case to go through. + using lemma \ref{sfauIdemDer}. + This equality enables the inductive case to go through. \end{proof} \noindent -Note that this lemma does $\mathbf{not}$ depend on any -specific definitions we used, -allowing people investigating derivatives to get an alternative -view of what $r_1 \cdot r_2$ is. +This lemma says that $(r_1\cdot r_2)\backslash s$ +can be flattened into a list whose head and tail meet the description +we gave earlier. +%Note that this lemma does $\mathbf{not}$ depend on any +%specific definitions we used, +%allowing people investigating derivatives to get an alternative +%view of what $r_1 \cdot r_2$ is. -Now we are able to use this for the intuition that -the different ways in which regular expressions are -nested do not matter under $\rsimp{}$: -\begin{center} - $\rsimp{r} \stackrel{?}{\sequal} \rsimp{r'}$ if $r = \sum [r_1, r_2, r_3, \ldots]$ - and $r' =(\ldots ((r_1 + r_2) + r_3) + \ldots)$ -\end{center} -Simply wrap with $\sum$ constructor and add -simplifications to both sides of \ref{seqSfau0} -and one gets -\begin{corollary}\label{seqClosedFormGeneral} +We now use $\textit{createdBySequence}$ and +$\sflataux{\_}$ to describe an intuition +behind the closed form of the sequence closed form. +If two regular expressions only differ in the way their +alternatives are nested, then we should be able to get the same result +once we apply simplification to both of them: +\begin{lemma}\label{sflatRsimpeq} + If $r$ is created from a sequence through + a series of derivatives + (i.e. if $\textit{createdBySequence} \; r$ holds), + and that $\sflataux{r} = rs$, + then we have + that + \begin{center} + $\textit{rsimp} \; r = \textit{rsimp} \; (\sum \; rs)$ + \end{center} + holds. +\end{lemma} +\begin{proof} + By an induction on the inductive cases of $\textit{createdBySequence}$. +\end{proof} + +Now we are ready for the closed form +for the sequence regular expressions (without the inner applications +of simplifications): +\begin{lemma}\label{seqClosedFormGeneral} $\rsimp{\sflat{(r_1 \cdot r_2) \backslash s} } =\rsimp{(\sum ( (r_1 \backslash s) \cdot r_2 :: \map\; (r_2 \backslash \_) \; (\vsuf{s}{r_1})))}$ -\end{corollary} -Together with the idempotency property of $\rsimp{}$ (lemma \ref{rsimpIdem}), -it is possible to convert the above lemma to obtain a "closed form" -for derivatives nested with simplification: -\begin{lemma}\label{seqClosedForm} - $\rderssimp{(r_1 \cdot r_2)}{s} = \rsimp{(\sum ((r_1 \backslash s) \cdot r_2 ) - :: (\map \; (r_2 \backslash \_) (\vsuf{s}{r_1})))}$ \end{lemma} \begin{proof} - By a case analysis of string $s$. - When $s$ is empty list, the rewrite is straightforward. - When $s$ is a list, one could use the corollary \ref{seqSfau0}, - and lemma \ref{Simpders} to rewrite the left-hand-side. + We know that + $\sflataux{(r_1 \cdot r_2) \backslash_r s} = (r_1 \backslash_r s) \cdot r_2 + :: (\map \; (r_2 \backslash_r \_) \; (\textit{Suffix} \; s \; r1))$ + holds + by lemma \ref{seqSfau0}. + This allows the theorem to go through because of lemma \ref{sflatRsimpeq}. \end{proof} -As a corollary for this closed form, one can estimate the size -of the sequence derivative $r_1 \cdot r_2 \backslash_r s$ using -an easier-to-handle expression: -\begin{corollary}\label{seqEstimate1} - \begin{center} - - $\llbracket \rderssimp{(r_1 \cdot r_2)}{s} \rrbracket_r = \llbracket \rsimp{(\sum ((r_1 \backslash s) \cdot r_2 ) - :: (\map \; (r_2 \backslash \_) (\vsuf{s}{r_1})))} \rrbracket_r$ - - \end{center} -\end{corollary} -\noindent +Together with the idempotency property of $\rsimp{}$ (lemma \ref{rsimpIdem}), +it is possible to convert the above lemma to obtain the +proper closed form for $\backslash_{rsimps}$ rather than $\backslash_r$: +for derivatives nested with simplification: +\begin{theorem}\label{seqClosedForm} + $\rderssimp{(r_1 \cdot r_2)}{s} = \rsimp{(\sum ((r_1 \backslash s) \cdot r_2 ) + :: (\map \; (r_2 \backslash \_) (\vsuf{s}{r_1})))}$ +\end{theorem} +\begin{proof} + By a case analysis of the string $s$. + When $s$ is an empty list, the rewrite is straightforward. + When $s$ is a non-empty list, the + lemmas \ref{seqClosedFormGeneral} and \ref{Simpders} apply, + making the proof go through. +\end{proof} \subsubsection{Closed Forms for Star Regular Expressions} -We have shown how to control the size of the sequence regular expression $r_1\cdot r_2$ using -the "closed form" of $(r_1 \cdot r_2) \backslash s$ and then -the property of the $\distinct$ function. -Now we try to get a bound on $r^* \backslash s$ as well. -Again, we first look at how a star's derivatives evolve, if they grow maximally: +The closed form for the star regular expression involves similar tricks +for the sequence regular expression. +The $\textit{Suffix}$ function is now replaced by something +slightly more complex, because the growth pattern of star +regular expressions' derivatives is a bit different: \begin{center} - - $r^* \quad \longrightarrow_{\backslash c} \quad (r\backslash c) \cdot r^* \quad \longrightarrow_{\backslash c'} \quad - r \backslash cc' \cdot r^* + r \backslash c' \cdot r^* \quad \longrightarrow_{\backslash c''} \quad - (r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') + (r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*) \quad \longrightarrow_{\backslash c'''} - \quad \ldots$ - + \begin{tabular}{lclc} + $r^* $ & $\longrightarrow_{\backslash c}$ & + $(r\backslash c) \cdot r^*$ & $\longrightarrow_{\backslash c'}$\\ + \\ + $r \backslash cc' \cdot r^* + r \backslash c' \cdot r^*$ & + $\longrightarrow_{\backslash c''}$ & + $(r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') + + (r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*)$ & + $\longrightarrow_{\backslash c'''}$ \\ + \\ + $\ldots$\\ + \end{tabular} \end{center} When we have a string $s = c :: c' :: c'' \ldots$ such that $r \backslash c$, $r \backslash cc'$, $r \backslash c'$, $r \backslash cc'c''$, $r \backslash c'c''$, $r\backslash c''$ etc. are all nullable, -the number of terms in $r^* \backslash s$ will grow exponentially, causing the size -of the derivatives $r^* \backslash s$ to grow exponentially, even if we do not -count the possible size explosions of $r \backslash c$ themselves. +the number of terms in $r^* \backslash s$ will grow exponentially rather than linearly +in the sequence case. +The good news is that the function $\textit{rsimp}$ will again +ignore the difference between differently nesting patterns of alternatives, +and the exponentially growing star derivative like +\begin{center} + $(r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') + + (r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*) $ +\end{center} +can be treated as +\begin{center} + $\RALTS{[r_1 \backslash cc'c'' \cdot r^*, r \backslash c'', + r \backslash c'c'' \cdot r^*, r \backslash c'' \cdot r^*]}$ +\end{center} +which can be de-duplicated by $\rDistinct$ +and therefore bounded finitely. -Thanks to $\rflts$ and $\rDistinct$, we are able to open up regular expressions like -$(r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') + -(r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*) $ -into $\RALTS{[r_1 \backslash cc'c'' \cdot r^*, r \backslash c'', -r \backslash c'c'' \cdot r^*, r \backslash c'' \cdot r^*]}$ -and then de-duplicate terms of the form $r\backslash s' \cdot r^*$ ($s'$ being a substring of $s$). +and then de-duplicate terms of the form ($s'$ being a substring of $s$). This allows us to use a similar technique as $r_1 \cdot r_2$ case, -where the crux is to get an equivalent form of -$\rderssimp{r^*}{s}$ with shape $\rsimp{\sum rs}$. -This requires generating -all possible sub-strings $s'$ of $s$ -such that $r\backslash s' \cdot r^*$ will appear -as a term in $(r^*) \backslash s$. -The first function we define is a single-step -updating function $\starupdate$, which takes three arguments as input: -the new character $c$ to take derivative with, -the regular expression -$r$ directly under the star $r^*$, and the -list of strings $sSet$ for the derivative $r^* \backslash s$ -up til this point -such that $(r^*) \backslash s = \sum_{s' \in sSet} (r\backslash s') \cdot r^*$ -(the equality is not exact, more on this later). + +Now the crux of this section is finding a suitable description +for $rs$ where +\begin{center} + $\rderssimp{r^*}{s} = \rsimp{\sum rs}$. +\end{center} +holds. +In addition, the list $rs$ +shall be in the form of +$\map \; (\lambda s'. r\backslash s' \cdot r^*) \; Ss$. +The $Ss$ is a list of strings, and for example in the sequence +closed form it is specified as $\textit{Suffix} \; s \; r_1$. +To get $Ss$ for the star regular expression, +we need to introduce $\starupdate$ and $\starupdates$: \begin{center} \begin{tabular}{lcl} $\starupdate \; c \; r \; [] $ & $\dn$ & $[]$\\ @@ -1650,11 +1851,6 @@ \starupdate \; c \; r \; Ss)$ \end{tabular} \end{center} -\noindent -As a generalisation from characters to strings, -$\starupdates$ takes a string instead of a character -as the first input argument, and is otherwise the same -as $\starupdate$. \begin{center} \begin{tabular}{lcl} $\starupdates \; [] \; r \; Ss$ & $=$ & $Ss$\\ @@ -1663,12 +1859,33 @@ \end{tabular} \end{center} \noindent -For the star regular expression, -its derivatives can be seen as a nested gigantic -alternative similar to that of sequence regular expression's derivatives, -and therefore need -to be ``straightened out" as well. -The function for this would be $\hflat{}$ and $\hflataux{}$. +Assuming we have that +\begin{center} + $\rderssimp{r^*}{s} = \rsimp{(\sum \map \; (\lambda s'. r\backslash s' \cdot r^*) \; Ss)}$. +\end{center} +holds. +The idea of $\starupdate$ and $\starupdates$ +is to update $Ss$ when another +derivative is taken on $\rderssimp{r^*}{s}$ +w.r.t a character $c$ and a string $s'$ +respectively. +Both $\starupdate$ and $\starupdates$ take three arguments as input: +the new character $c$ or string $s$ to take derivative with, +the regular expression +$r$ under the star $r^*$, and the +list of strings $Ss$ for the derivative $r^* \backslash s$ +up until this point +such that +\begin{center} +$(r^*) \backslash s = \sum_{s' \in sSet} (r\backslash s') \cdot r^*$ +\end{center} +is satisfied. + +Functions $\starupdate$ and $\starupdates$ characterise what the +star derivatives will look like once ``straightened out'' into lists. +The helper functions for such operations will be similar to +$\sflat{\_}$, $\sflataux{\_}$ and $\sflataux{\_}$, which we defined for sequence. +We use similar symbols to denote them, with a $*$ subscript to mark the difference. \begin{center} \begin{tabular}{lcl} $\hflataux{r_1 + r_2}$ & $\dn$ & $\hflataux{r_1} @ \hflataux{r_2}$\\ @@ -1683,122 +1900,230 @@ \end{tabular} \end{center} \noindent -%MAYBE TODO: introduce createdByStar -Again these definitions are tailor-made for dealing with alternatives that have -originated from a star's derivatives, so we do not attempt to open up all possible -regular expressions of the form $\RALTS{rs}$, where $\textit{rs}$ might not contain precisely 2 -elements. -We give a predicate for such "star-created" regular expressions: +These definitions are tailor-made for dealing with alternatives that have +originated from a star's derivatives. +A typical star derivative always has the structure of a balanced binary tree: \begin{center} - \begin{tabular}{lcr} - & & $\createdByStar{(\RSEQ{ra}{\RSTAR{rb}}) }$\\ - $\createdByStar{r_1} \land \createdByStar{r_2} $ & $ \Longrightarrow$ & $\createdByStar{(r_1 + r_2)}$ - \end{tabular} + $(r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') + + (r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*) $ \end{center} +All of the nested structures of alternatives +generated from derivatives are binary, and therefore +$\hflat{\_}$ and $\hflataux{\_}$ only deal with binary alternatives. +$\hflat{\_}$ ``untangles'' like the following: +\[ + \sum ((r_1 + r_2) + (r_3 + r_4)) + \ldots \; + \stackrel{\hflat{\_}}{\longrightarrow} \; + \RALTS{[r_1, r_2, \ldots, r_n]} +\] +Here is a lemma stating the recursive property of $\starupdate$ and $\starupdates$, +with the helpers $\hflat{\_}$ and $\hflataux{\_}$\footnote{The function $\textit{concat}$ takes a list of lists + and merges each of the element lists to form a flattened list.}: +\begin{lemma}\label{stupdateInduct1} + \mbox + For a list of strings $Ss$, the following hold. + \begin{itemize} + \item + If we do a derivative on the terms + $r\backslash_r s \cdot r^*$ (where $s$ is taken from the list $Ss$), + the result will be the same as if we apply $\starupdate$ to $Ss$. + \begin{center} + \begin{tabular}{c} + $\textit{concat} \; (\map \; (\hflataux{\_} \circ ( (\_\backslash_r x) + \circ (\lambda s.\;\; (r \backslash_r s) \cdot r^*)))\; Ss )\; + $\\ + \\ + $=$ \\ + \\ + $\map \; (\lambda s. (r \backslash_r s) \cdot (r^*)) \; + (\starupdate \; x \; r \; Ss)$ + \end{tabular} + \end{center} + \item + $\starupdates$ is ``composable'' w.r.t a derivative. + It piggybacks the character $x$ to the tail of the string + $xs$. + \begin{center} + \begin{tabular}{c} + $\textit{concat} \; (\map \; \hflataux{\_} \; + (\map \; (\_\backslash_r x) \; + (\map \; (\lambda s.\;\; (r \backslash_r s) \cdot + (r^*) ) \; (\starupdates \; xs \; r \; Ss))))$\\ + \\ + $=$\\ + \\ + $\map \; (\lambda s.\;\; (r\backslash_r s) \cdot (r^*)) \; + (\starupdates \; (xs @ [x]) \; r \; Ss)$ + \end{tabular} + \end{center} + \end{itemize} +\end{lemma} + +\begin{proof} + Part 1 is by induction on $Ss$. + Part 2 is by induction on $xs$, where $Ss$ is left to take arbitrary values. +\end{proof} + -These definitions allows us the flexibility to talk about -regular expressions in their most convenient format, -for example, flattened out $\RALTS{[r_1, r_2, \ldots, r_n]} $ -instead of binary-nested: $((r_1 + r_2) + (r_3 + r_4)) + \ldots$. -These definitions help express that certain classes of syntatically -distinct regular expressions are actually the same under simplification. -This is not entirely true for annotated regular expressions: -%TODO: bsimp bders \neq bderssimp +Like $\textit{createdBySequence}$, we need +a predicate for ``star-created'' regular expressions: \begin{center} - $(1+ (c\cdot \ASEQ{bs}{c^*}{c} ))$ -\end{center} -For bit-codes, the order in which simplification is applied -might cause a difference in the location they are placed. -If we want something like -\begin{center} - $\bderssimp{r}{s} \myequiv \bsimp{\bders{r}{s}}$ + \begin{mathpar} + \inferrule{\mbox{}}{ \textit{createdByStar}\; \RSEQ{ra}{\RSTAR{rb}} } + + \inferrule{ \textit{createdByStar} \; r_1\; \land \; \textit{createdByStar} \; r_2 }{\textit{createdByStar} \; (r_1 + r_2) } + \end{mathpar} \end{center} -Some "canonicalization" procedure is required, -which either pushes all the common bitcodes to nodes -as senior as possible: -\begin{center} - $_{bs}(_{bs_1 @ bs'}r_1 + _{bs_1 @ bs''}r_2) \rightarrow _{bs @ bs_1}(_{bs'}r_1 + _{bs''}r_2) $ -\end{center} -or does the reverse. However bitcodes are not of interest if we are talking about -the $\llbracket r \rrbracket$ size of a regex. -Therefore for the ease and simplicity of producing a -proof for a size bound, we are happy to restrict ourselves to -unannotated regular expressions, and obtain such equalities as -\begin{lemma} - $\rsimp{r_1 + r_2} = \rsimp{\RALTS{\hflataux{r_1} @ \hflataux{r_2}}}$ +\noindent +All regular expressions created by taking derivatives of +$r_1 \cdot (r_2)^*$ satisfy the $\textit{createdByStar}$ predicate: +\begin{lemma}\label{starDersCbs} + $\textit{createdByStar} \; ((r_1 \cdot r_2^*) \backslash_r s) $ holds. +\end{lemma} +\begin{proof} + By a reverse induction on $s$. +\end{proof} +If a regular expression conforms to the shape of a star's derivative, +then we can push an application of $\hflataux{\_}$ inside a derivative of it: +\begin{lemma}\label{hfauPushin} + If $\textit{createdByStar} \; r$ holds, then + \begin{center} + $\hflataux{r \backslash_r c} = \textit{concat} \; ( + \map \; \hflataux{\_} (\map \; (\_\backslash_r c) \;(\hflataux{r})))$ + \end{center} + holds. +\end{lemma} +\begin{proof} + By an induction on the inductivev cases of $\textit{createdByStar}$. +\end{proof} +%This is not entirely true for annotated regular expressions: +%%TODO: bsimp bders \neq bderssimp +%\begin{center} +% $(1+ (c\cdot \ASEQ{bs}{c^*}{c} ))$ +%\end{center} +%For bit-codes, the order in which simplification is applied +%might cause a difference in the location they are placed. +%If we want something like +%\begin{center} +% $\bderssimp{r}{s} \myequiv \bsimp{\bders{r}{s}}$ +%\end{center} +%Some "canonicalization" procedure is required, +%which either pushes all the common bitcodes to nodes +%as senior as possible: +%\begin{center} +% $_{bs}(_{bs_1 @ bs'}r_1 + _{bs_1 @ bs''}r_2) \rightarrow _{bs @ bs_1}(_{bs'}r_1 + _{bs''}r_2) $ +%\end{center} +%or does the reverse. However bitcodes are not of interest if we are talking about +%the $\llbracket r \rrbracket$ size of a regex. +%Therefore for the ease and simplicity of producing a +%proof for a size bound, we are happy to restrict ourselves to +%unannotated regular expressions, and obtain such equalities as +%TODO: rsimp sflat +% The simplification of a flattened out regular expression, provided it comes +%from the derivative of a star, is the same as the one nested. + + + +Now we introduce an inductive property +for $\starupdate$ and $\hflataux{\_}$. +\begin{lemma}\label{starHfauInduct} + If we do derivatives of $r^*$ + with a string that starts with $c$, + then flatten it out, + we obtain a list + of the shape $\sum_{s' \in sS} (r\backslash_r s') \cdot r^*$, + where $sS = \starupdates \; s \; r \; [[c]]$. Namely, + \begin{center} + $\hflataux{(\rders{( (\rder{c}{r_0})\cdot(r_0^*))}{s})} = + \map \; (\lambda s_1. (r_0 \backslash_r s_1) \cdot (r_0^*)) \; + (\starupdates \; s \; r_0 \; [[c]])$ + \end{center} +holds. +\end{lemma} +\begin{proof} + By an induction on $s$, the inductive cases + being $[]$ and $s@[c]$. The lemmas \ref{hfauPushin} and \ref{starDersCbs} are used. +\end{proof} +\noindent + +$\hflataux{\_}$ has a similar effect as $\textit{flatten}$: +\begin{lemma}\label{hflatauxGrewrites} + $a :: rs \grewrites \hflataux{a} @ rs$ +\end{lemma} +\begin{proof} + By induction on $a$. $rs$ is set to take arbitrary values. +\end{proof} +It is also not surprising that $\textit{rsimp}$ will cover +the effect of $\hflataux{\_}$: +\begin{lemma}\label{cbsHfauRsimpeq1} + $\rsimp{(r_1 + r_2)} = \rsimp{(\RALTS{\hflataux{r_1} @ \hflataux{r_2}})}$ \end{lemma} \begin{proof} By using the rewriting relation $\rightsquigarrow$ \end{proof} -%TODO: rsimp sflat And from this we obtain a proof that a star's derivative will be the same as if it had all its nested alternatives created during deriving being flattened out: For example, -\begin{lemma} +\begin{lemma}\label{hfauRsimpeq2} $\createdByStar{r} \implies \rsimp{r} = \rsimp{\RALTS{\hflataux{r}}}$ \end{lemma} \begin{proof} - By structural induction on $r$, where the induction rules are these of $\createdByStar{_}$. + By structural induction on $r$, where the induction rules + are these of $\createdByStar{\_}$. + Lemma \ref{cbsHfauRsimpeq1} is used in the inductive case. \end{proof} -% The simplification of a flattened out regular expression, provided it comes -%from the derivative of a star, is the same as the one nested. - -We first introduce an inductive property -for $\starupdate$ and $\hflataux{\_}$, -it says if we do derivatives of $r^*$ -with a string that starts with $c$, -then flatten it out, -we obtain a list -of the shape $\sum_{s' \in sSet} (r\backslash_r s') \cdot r^*$, -where $sSet = \starupdates \; s \; r \; [[c]]$. -\begin{lemma}\label{starHfauInduct} - $\hflataux{(\rders{( (\rder{c}{r_0})\cdot(r_0^*))}{s})} = - \map \; (\lambda s_1. (r_0 \backslash_r s_1) \cdot (r_0^*)) \; - (\starupdates \; s \; r_0 \; [[c]])$ +%Here is a corollary that states the lemma in +%a more intuitive way: +%\begin{corollary} +% $\hflataux{r^* \backslash_r (c::xs)} = \map \; (\lambda s. (r \backslash_r s) \cdot +% (r^*))\; (\starupdates \; c\; r\; [[c]])$ +%\end{corollary} +%\noindent +%Note that this is also agnostic of the simplification +%function we defined, and is therefore of more general interest. + +Together with the rewriting relation +\begin{lemma}\label{starClosedForm6Hrewrites} + We have the following set of rewriting relations or equalities: + \begin{itemize} + \item + $\textit{rsimp} \; (r^* \backslash_r (c::s)) + \sequal + \sum \; ( ( \sum (\lambda s. (r\backslash_r s) \cdot r^*) \; ( + \starupdates \; s \; r \; [ c::[]] ) ) )$ + \item + $r \backslash_{rsimp} (c::s) = \textit{rsimp} \; ( ( + \sum ( (\map \; (\lambda s_1. (r\backslash s_1) \; r^*) \; + (\starupdates \;s \; r \; [ c::[] ])))))$ + \item + $\sum ( (\map \; (\lambda s. (r\backslash s) \; r^*) \; Ss)) + \sequal + \sum ( (\map \; (\lambda s. \textit{rsimp} \; (r\backslash s) \; + r^*) \; Ss) )$ + \item + $\map \; (\lambda s. (\rsimp{r \backslash_r s}) \cdot (r^*)) \; Ss + \scfrewrites + \map \; (\lambda s. (\rsimp{r \backslash_r s}) \cdot (r^*)) \; Ss$ + \item + $( ( \sum ( ( \map \ (\lambda s. \;\; + (\textit{rsimp} \; (r \backslash_r s)) \cdot r^*) \; (\starupdates \; + s \; r \; [ c::[] ])))))$\\ + $\sequal$\\ + $( ( \sum ( ( \map \ (\lambda s. \;\; + ( r \backslash_{rsimp} s)) \cdot r^*) \; (\starupdates \; + s \; r \; [ c::[] ]))))$\\ + \end{itemize} \end{lemma} \begin{proof} - By an induction on $s$, the inductive cases - being $[]$ and $s@[c]$. + Part 1 leads to part 2. + The rest of them are routine. \end{proof} \noindent -Here is a corollary that states the lemma in -a more intuitive way: -\begin{corollary} - $\hflataux{r^* \backslash_r (c::xs)} = \map \; (\lambda s. (r \backslash_r s) \cdot - (r^*))\; (\starupdates \; c\; r\; [[c]])$ -\end{corollary} -\noindent -Note that this is also agnostic of the simplification -function we defined, and is therefore of more general interest. - -Now adding the $\rsimp{}$ bit for closed forms, -we have -\begin{lemma} - $a :: rs \grewrites \hflataux{a} @ rs$ -\end{lemma} -\noindent -giving us -\begin{lemma}\label{cbsHfauRsimpeq1} - $\rsimp{a+b} = \rsimp{(\sum \hflataux{a} @ \hflataux{b})}$. -\end{lemma} -\noindent -This yields -\begin{lemma}\label{hfauRsimpeq2} - $\rsimp{r} = \rsimp{(\sum \hflataux{r})}$ -\end{lemma} -\noindent -Together with the rewriting relation -\begin{lemma}\label{starClosedForm6Hrewrites} - $\map \; (\lambda s. (\rsimp{r \backslash_r s}) \cdot (r^*)) \; Ss - \scfrewrites - \map \; (\lambda s. (\rsimp{r \backslash_r s}) \cdot (r^*)) \; Ss$ -\end{lemma} -\noindent -We obtain the closed form for star regular expression: -\begin{lemma}\label{starClosedForm} +Next the closed form for star regular expressions can be derived: +\begin{theorem}\label{starClosedForm} $\rderssimp{r^*}{c::s} = \rsimp{ (\sum (\map \; (\lambda s. (\rderssimp{r}{s})\cdot r^*) \; @@ -1807,11 +2132,14 @@ ) } $ -\end{lemma} +\end{theorem} \begin{proof} By an induction on $s$. - The lemmas \ref{rsimpIdem}, \ref{starHfauInduct}, and \ref{hfauRsimpeq2} + The lemmas \ref{rsimpIdem}, \ref{starHfauInduct}, \ref{starClosedForm6Hrewrites} + and \ref{hfauRsimpeq2} are used. + In \ref{starClosedForm6Hrewrites}, the equalities are + used to link the LHS and RHS. \end{proof} @@ -1881,13 +2209,56 @@ In this section, we introduce how we formalised the bound on closed forms. -We first prove that functions such as $\rflts$ +We first show that in general regular expressions up to a certain +size are finite. +Then we prove that functions such as $\rflts$ will not cause the size of r-regular expressions to grow. Putting this together with a general bound on the finiteness of distinct regular expressions -smaller than a certain size, we obtain a bound on +up to a certain size, we obtain a bound on the closed forms. +\subsection{Finiteness of Distinct Regular Expressions} +We define the set of regular expressions whose size are no more than +a certain size $N$ as $\textit{sizeNregex} \; N$: +\[ + \textit{sizeNregex} \; N \dn \{r\; \mid \; \llbracket r \rrbracket_r \leq N \} +\] +We have that $\textit{sizeNregex} \; N$ is always a finite set: +\begin{lemma}\label{finiteSizeN} + $\textit{finite} \; (\textit{sizeNregex} \; N)$ holds. +\end{lemma} +\begin{proof} + By splitting the set $\textit{sizeNregex} \; (N + 1)$ into + subsets by their categories: + $\{\ZERO, \ONE, c\}$, $\{* `` \textit{sizeNregex} \; N\}$, + and so on. Each of these subsets are finitely bounded. +\end{proof} +\noindent +From this we get a corollary that +if forall $r \in rs$, $\rsize{r} \leq N$, then the output of +$\rdistinct{rs}{\varnothing}$ is a list of regular +expressions of finite size depending on $N$ only. +\begin{corollary}\label{finiteSizeNCorollary} + $\rsize{\rdistinct{rs}{\varnothing}} \leq c_N * N$ holds, + where the constant $c_N$ is equal to $\textit{card} \; (\textit{sizeNregex} \; + N)$. +\end{corollary} +\begin{proof} + For all $r$ in + $\textit{set} \; (\rdistinct{rs}{\varnothing})$, + it is always the case that $\rsize{r} \leq N$. + In addition, the list length is bounded by + $c_N$, yielding the desired bound. +\end{proof} +\noindent +This fact will be handy in estimating the closed form sizes. +%We have proven that the size of the +%output of $\textit{rdistinct} \; rs' \; \varnothing$ +%is bounded by a constant $N * c_N$ depending only on $N$, +%provided that each of $rs'$'s element +%is bounded by $N$. + \subsection{$\textit{rsimp}$ Does Not Increment the Size} Although it seems evident, we need a series of non-trivial lemmas to establish that functions such as $\rflts$ @@ -1944,7 +2315,8 @@ \end{proof} \subsection{Estimating the Closed Forms' sizes} -We now summarize the closed forms below: +We recap the closed forms we obtained +earlier by putting them together down below: \begin{itemize} \item $\rderssimp{(\sum rs)}{s} \sequal @@ -1978,44 +2350,10 @@ We elaborate the above reasoning by a series of lemmas below, where straightforward proofs are omitted. -\begin{lemma} - If $\forall r \in rs. \rsize{r} $ is less than or equal to $N$, - and $\textit{length} \; rs$ is less than or equal to $l$, - then $\rsize{\sum rs}$ is less than or equal to $l*N + 1$. -\end{lemma} -\noindent -If we define all regular expressions with size no -more than $N$ as $\sizeNregex \; N$: -\[ - \sizeNregex \; N \dn \{r \mid \rsize{r} \leq N \} -\] -Then such set is finite: -\begin{lemma}\label{finiteSizeN} - $\textit{isFinite}\; (\sizeNregex \; N)$ -\end{lemma} -\begin{proof} - By overestimating the set $\sizeNregex \; N + 1$ - using union of sets like - $\{r_1 \cdot r_2 \mid r_1 \in A - \text{and} - r_2 \in A\} - $ where $A = \sizeNregex \; N$. -\end{proof} -\noindent -From this we get a corollary that -if forall $r \in rs$, $\rsize{r} \leq N$, then the output of -$\rdistinct{rs}{\varnothing}$ is a list of regular -expressions of finite size depending on $N$ only. -\begin{corollary}\label{finiteSizeNCorollary} - Assumes that for all $r \in rs. \rsize{r} \leq N$, - and the cardinality of $\sizeNregex \; N$ is $c_N$ - then$\rsize{\rdistinct{rs}{\varnothing}} \leq c*N$. -\end{corollary} -\noindent -We have proven that the output of $\rdistinct{rs'}{\varnothing}$ -is bounded by a constant $c_N$ depending only on $N$, -provided that each of $rs'$'s element -is bounded by $N$. + + + + We want to apply it to our setting $\rsize{\rsimp{\sum rs}}$. We show that $\rdistinct$ and $\rflts$ @@ -2041,20 +2379,20 @@ From \begin{center} $\llbracket \textit{rflts}\; rs \rrbracket_r \leq - \llbracket \; \textit{rs} \rrbracket_r$ + \llbracket \textit{rs} \rrbracket_r$ \end{center} and \begin{center} $\llbracket \textit{rdistinct} \; rs \; \varnothing \leq \llbracket rs \rrbracket_r$ \end{center} -one cannot imply +one cannot infer \begin{center} $\llbracket \rdistinct{(\rflts \; \textit{rs})}{\varnothing} \rrbracket_r \leq \llbracket \rdistinct{rs}{\varnothing} \rrbracket_r $. \end{center} -What we can imply is that +What we can infer is that \begin{center} $\llbracket \rdistinct{(\rflts \; \textit{rs})}{\varnothing} \rrbracket_r \leq @@ -2062,7 +2400,7 @@ \end{center} but this estimate is too rough and $\llbracket rs \rrbracket_r$ is unbounded. The way we -get through this is by first proving a more general lemma +get around this is by first proving a more general lemma (so that the inductive case goes through): \begin{lemma}\label{fltsSizeReductionAlts} If we have three accumulator sets: @@ -2087,11 +2425,17 @@ holds. \end{lemma} \noindent -We need to split the accumulator into two parts: the part +We split the accumulator into two parts: the part which contains alternative regular expressions ($alts\_set$), and the part without any of them($noalts\_set$). +This is because $\rflts$ opens up the alternatives in $as$, +causing the accumulators on both sides of the inequality +to diverge slightly. +If we want to compare the accumulators that are not +perfectly in sync, we need to consider the alternatives and non-alternatives +separately. The set $corr\_set$ is the corresponding set -of $alts\_set$ with all elements under the $\sum$ constructor +of $alts\_set$ with all elements under the alternative constructor spilled out. \begin{proof} By induction on the list $as$. We make use of lemma \ref{rdistinctConcat}. @@ -2106,7 +2450,8 @@ By using the lemma \ref{fltsSizeReductionAlts}. \end{proof} \noindent -The intuition is that if we remove duplicates from the $\textit{LHS}$, at least the same amount of +The intuition for why this is true +is that if we remove duplicates from the $\textit{LHS}$, at least the same amount of duplicates will be removed from the list $\textit{rs}$ in the $\textit{RHS}$. Now this $\rsimp{\sum rs}$ can be estimated using $\rdistinct{rs}{\varnothing}$: @@ -2114,14 +2459,14 @@ $\rsize{\rsimp{\sum rs}} \leq \rsize{\rdistinct{rs}{\varnothing}}+ 1$ \end{lemma} \begin{proof} - By using \ref{interactionFltsDB}. + By using corollary \ref{interactionFltsDB}. \end{proof} \noindent This is a key lemma in establishing the bounds on all the closed forms. With this we are now ready to control the sizes of -$(r_1 \cdot r_2 )\backslash s$, $r^* \backslash s$. -\begin{theorem} +$(r_1 \cdot r_2 )\backslash s$ and $r^* \backslash s$. +\begin{theorem}\label{rBound} For any regex $r$, $\exists N_r. \forall s. \; \rsize{\rderssimp{r}{s}} \leq N_r$ \end{theorem} \noindent @@ -2146,7 +2491,7 @@ \end{tabular} \end{center} \noindent -(1) is by the corollary \ref{seqEstimate1}. +(1) is by theorem \ref{seqClosedForm}. (2) is by \ref{altsSimpControl}. (3) is by \ref{finiteSizeNCorollary}. @@ -2183,9 +2528,9 @@ \end{tabular} \end{center} \noindent -(5) is by the lemma \ref{starClosedForm}. +(5) is by theorem \ref{starClosedForm}. (6) is by \ref{altsSimpControl}. -(7) is by \ref{finiteSizeNCorollary}. +(7) is by corollary \ref{finiteSizeNCorollary}. Combining with the case when $s = []$, one gets \begin{center} \begin{tabular}{lcll} @@ -2208,7 +2553,7 @@ \end{tabular} \end{center} \noindent -(9) is by \ref{altsClosedForm}, (10) by \ref{rsimpSize} and (11) by inductive hypothesis. +(9) is by theorem \ref{altsClosedForm}, (10) by lemma \ref{rsimpMono} and (11) by inductive hypothesis. Combining with the case when $s = []$, one gets \begin{center} @@ -2217,15 +2562,15 @@ & (12)\\ \end{tabular} \end{center} -(4), (8), and (12) are all the inductive cases proven. +We have all the inductive cases proven. \end{proof} - +This leads to our main result on the size bound: \begin{corollary} - For any regex $a$, $\exists N_r. \forall s. \; \rsize{\bderssimp{a}{s}} \leq N_r$ + For any annotated regular expression $a$, $\exists N_r. \forall s. \; \rsize{\bderssimp{a}{s}} \leq N_r$ \end{corollary} \begin{proof} - By \ref{sizeRelations}. + By lemma \ref{sizeRelations} and theorem \ref{rBound}. \end{proof} \noindent @@ -2243,22 +2588,9 @@ %---------------------------------------------------------------------------------------- -\subsection{A Closed Form for the Sequence Regular Expression} -\noindent - -Before we get to the proof that says the intermediate result of our lexer will -remain finitely bounded, which is an important efficiency/liveness guarantee, -we shall first develop a few preparatory properties and definitions to -make the process of proving that a breeze. - -We define rewriting relations for $\rrexp$s, which allows us to do the -same trick as we did for the correctness proof, -but this time we will have stronger equalities established. - - - +\section{Comments and Future Improvements} +\subsection{Some Experimental Results} What guarantee does this bound give us? - Whatever the regex is, it will not grow indefinitely. Take our previous example $(a + aa)^*$ as an example: \begin{center} @@ -2278,7 +2610,7 @@ width=5cm, height=4cm, legend entries={$(a + aa)^*$}, - legend pos=north west, + legend pos=south east, legend cell align=left] \addplot[red,mark=*, mark options={fill=white}] table {a_aa_star.data}; \end{axis} @@ -2297,22 +2629,16 @@ $f(x) = x * 2^x$. This means the bound we have will surge up at least tower-exponentially with a linear increase of the depth. -For a regex of depth $n$, the bound -would be approximately $4^n$. -Test data in the graphs from randomly generated regular expressions -shows that the giant bounds are far from being hit. -%a few sample regular experessions' derivatives -%size change -%TODO: giving regex1_size_change.data showing a few regular expressions' size changes -%w;r;t the input characters number, where the size is usually cubic in terms of original size -%a*, aa*, aaa*, ..... -%randomly generated regular expressions -\begin{figure}{H} - \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}} +One might be quite skeptical about what this non-elementary +bound can bring us. +It turns out that the giant bounds are far from being hit. +Here we have some test data from randomly generated regular expressions: +\begin{figure}[H] + \begin{tabular}{@{}c@{\hspace{2mm}}c@{\hspace{0mm}}c@{}} \begin{tikzpicture} \begin{axis}[ - xlabel={number of characters}, + xlabel={$n$}, x label style={at={(1.05,-0.05)}}, ylabel={regex size}, enlargelimits=false, @@ -2322,15 +2648,15 @@ %ytick={0,100,...,1000}, scaled ticks=false, axis lines=left, - width=5cm, - height=4cm, + width=4.75cm, + height=3.8cm, legend entries={regex1}, - legend pos=north west, + legend pos=north east, legend cell align=left] \addplot[red,mark=*, mark options={fill=white}] table {regex1_size_change.data}; \end{axis} \end{tikzpicture} - & + & \begin{tikzpicture} \begin{axis}[ xlabel={$n$}, @@ -2343,15 +2669,15 @@ %ytick={0,100,...,1000}, scaled ticks=false, axis lines=left, - width=5cm, - height=4cm, + width=4.75cm, + height=3.8cm, legend entries={regex2}, - legend pos=north west, + legend pos=south east, legend cell align=left] \addplot[blue,mark=*, mark options={fill=white}] table {regex2_size_change.data}; \end{axis} \end{tikzpicture} - & + & \begin{tikzpicture} \begin{axis}[ xlabel={$n$}, @@ -2364,16 +2690,19 @@ %ytick={0,100,...,1000}, scaled ticks=false, axis lines=left, - width=5cm, - height=4cm, + width=4.75cm, + height=3.8cm, legend entries={regex3}, - legend pos=north west, + legend pos=south east, legend cell align=left] \addplot[cyan,mark=*, mark options={fill=white}] table {regex3_size_change.data}; \end{axis} \end{tikzpicture}\\ - \multicolumn{3}{c}{Graphs: size change of 3 randomly generated regular expressions $w.r.t.$ input string length.} + \multicolumn{3}{c}{} \end{tabular} + \caption{Graphs: size change of 3 randomly generated + regular expressions $w.r.t.$ input string length. + The x axis represents the length of input.} \end{figure} \noindent Most of the regex's sizes seem to stay within a polynomial bound $w.r.t$ the @@ -2382,24 +2711,22 @@ -\section{Possible Further Improvements} -There are two problems with this finiteness result, though. -\begin{itemize} - \item - First, It is not yet a direct formalisation of our lexer's complexity, +\subsection{Possible Further Improvements} +There are two problems with this finiteness result, though.\\ +(i) + First, it is not yet a direct formalisation of our lexer's complexity, as a complexity proof would require looking into the time it takes to execute {\bf all} the operations - involved in the lexer (simp, collect, decode), not just the derivative. - \item + involved in the lexer (simp, collect, decode), not just the derivative.\\ +(ii) Second, the bound is not yet tight, and we seek to improve $N_a$ so that - it is polynomial on $\llbracket a \rrbracket$. -\end{itemize} -Still, we believe this contribution is fruitful, + it is polynomial on $\llbracket a \rrbracket$.\\ +Still, we believe this contribution is useful, because \begin{itemize} \item - The size proof can serve as a cornerstone for a complexity + The size proof can serve as a starting point for a complexity formalisation. Derivatives are the most important phases of our lexer algorithm. Size properties about derivatives covers the majority of the algorithm @@ -2435,8 +2762,11 @@ would still be slow. And unfortunately, we have concrete examples where such regular expressions grew exponentially large before levelling off: +\begin{center} $(a ^ * + (aa) ^ * + (aaa) ^ * + \ldots + -(\underbrace{a \ldots a}_{\text{n a's}})^*$ will already have a maximum +(\underbrace{a \ldots a}_{\text{n a's}})^*)^*$ +\end{center} +will already have a maximum size that is exponential on the number $n$ under our current simplification rules: %TODO: graph of a regex whose size increases exponentially. @@ -2456,38 +2786,48 @@ \end{tikzpicture} \end{center} -For convenience we use $(\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$ +For convenience we use $(\sum_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$ to express $(a ^ * + (aa) ^ * + (aaa) ^ * + \ldots + (\underbrace{a \ldots a}_{\text{n a's}})^*$ in the below discussion. The exponential size is triggered by that the regex -$\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*$ +$\sum_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*$ inside the $(\ldots) ^*$ having exponentially many different derivatives, despite those difference being minor. -$(\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*\backslash \underbrace{a \ldots a}_{\text{m a's}}$ +$(\sum_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*\backslash \underbrace{a \ldots a}_{\text{m a's}}$ will therefore contain the following terms (after flattening out all nested alternatives): \begin{center} - $(\oplus_{i = 1]{n} (\underbrace{a \ldots a}_{\text{((i - (m' \% i))\%i) a's}})\cdot (\underbrace{a \ldots a}_{\text{i a's}})^* })\cdot (\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)$\\ +$(\sum_{i = 1}^{n} (\underbrace{a \ldots a}_{\text{((i - (m' \% i))\%i) a's}})\cdot (\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot (\sum_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)$\\ +[1mm] $(1 \leq m' \leq m )$ \end{center} -These terms are distinct for $m' \leq L.C.M.(1, \ldots, n)$ (will be explained in appendix). +There at at least exponentially +many such terms.\footnote{To be exact, these terms are +distinct for $m' \leq L.C.M.(1, \ldots, n)$, the details are omitted, +but the point is that the number is exponential. +} With each new input character taking the derivative against the intermediate result, more and more such distinct -terms will accumulate, -until the length reaches $L.C.M.(1, \ldots, n)$. -$\textit{distinctBy}$ will not be able to de-duplicate any two of these terms -$(\oplus_{i = 1}^{n} (\underbrace{a \ldots a}_{\text{((i - (m' \% i))\%i) a's}})\cdot (\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot (\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$\\ - -$(\oplus_{i = 1}^{n} (\underbrace{a \ldots a}_{\text{((i - (m'' \% i))\%i) a's}})\cdot (\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot (\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$\\ -where $m' \neq m''$ \\ +terms will accumulate. +The function $\textit{distinctBy}$ will not be able to de-duplicate any two of these terms +\begin{center} +$(\sum_{i = 1}^{n} +(\underbrace{a \ldots a}_{\text{((i - (m' \% i))\%i) a's}})\cdot +(\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot +(\sum_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$\\ +$(\sum_{i = 1}^{n} (\underbrace{a \ldots a}_{\text{((i - (m'' \% i))\%i) a's}})\cdot +(\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot +(\sum_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$ +\end{center} +\noindent +where $m' \neq m''$ as they are slightly different. This means that with our current simplification methods, we will not be able to control the derivative so that -$\llbracket \bderssimp{r}{s} \rrbracket$ stays polynomial %\leq O((\llbracket r\rrbacket)^c)$ -as there are already exponentially many terms. +$\llbracket \bderssimp{r}{s} \rrbracket$ stays polynomial. %\leq O((\llbracket r\rrbacket)^c)$ These terms are similar in the sense that the head of those terms are all consisted of sub-terms of the form: $(\underbrace{a \ldots a}_{\text{j a's}})\cdot (\underbrace{a \ldots a}_{\text{i a's}})^* $. -For $\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*$, there will be at most +For $\sum_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*$, there will be at most $n * (n + 1) / 2$ such terms. For example, $(a^* + (aa)^* + (aaa)^*) ^*$'s derivatives can be described by 6 terms: @@ -2495,15 +2835,29 @@ $aa \cdot (aaa)^*$, $a \cdot (aaa)^*$, and $(aaa)^*$. The total number of different "head terms", $n * (n + 1) / 2$, is proportional to the number of characters in the regex -$(\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$. -This suggests a slightly different notion of size, which we call the +$(\sum_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$. +If we can improve our deduplication process so that it becomes smarter +and only keep track of these $n * (n+1) /2$ terms, then we can keep +the size growth polynomial again. +This example also suggests a slightly different notion of size, which we call the alphabetic width: -%TODO: -(TODO: Alphabetic width def.) +\begin{center} + \begin{tabular}{lcl} + $\textit{awidth} \; \ZERO$ & $\dn$ & $0$\\ + $\textit{awidth} \; \ONE$ & $\dn$ & $0$\\ + $\textit{awidth} \; c$ & $\dn$ & $1$\\ + $\textit{awidth} \; r_1 + r_2$ & $\dn$ & $\textit{awidth} \; + r_1 + \textit{awidth} \; r_2$\\ + $\textit{awidth} \; r_1 \cdot r_2$ & $\dn$ & $\textit{awidth} \; + r_1 + \textit{awidth} \; r_2$\\ + $\textit{awidth} \; r^*$ & $\dn$ & $\textit{awidth} \; r$\\ + \end{tabular} +\end{center} + Antimirov\parencite{Antimirov95} has proven that -$\textit{PDER}_{UNIV}(r) \leq \textit{awidth}(r)$. +$\textit{PDER}_{UNIV}(r) \leq \textit{awidth}(r)$, where $\textit{PDER}_{UNIV}(r)$ is a set of all possible subterms created by doing derivatives of $r$ against all possible strings. If we can make sure that at any moment in our lexing algorithm our @@ -2523,12 +2877,12 @@ %----------------------------------- % SUBSECTION 1 %----------------------------------- -\subsection{Syntactic Equivalence Under $\simp$} -We prove that minor differences can be annhilated -by $\simp$. -For example, -\begin{center} - $\simp \;(\simpALTs\; (\map \;(\_\backslash \; x)\; (\distinct \; \mathit{rs}\; \phi))) = - \simp \;(\simpALTs \;(\distinct \;(\map \;(\_ \backslash\; x) \; \mathit{rs}) \; \phi))$ -\end{center} +%\subsection{Syntactic Equivalence Under $\simp$} +%We prove that minor differences can be annhilated +%by $\simp$. +%For example, +%\begin{center} +% $\simp \;(\simpALTs\; (\map \;(\_\backslash \; x)\; (\distinct \; \mathit{rs}\; \phi))) = +% \simp \;(\simpALTs \;(\distinct \;(\map \;(\_ \backslash\; x) \; \mathit{rs}) \; \phi))$ +%\end{center} diff -r 3ea6a38c33b5 -r 2072a8d54e3e ChengsongTanPhdThesis/Chapters/Introduction.tex --- a/ChengsongTanPhdThesis/Chapters/Introduction.tex Thu Oct 13 23:55:25 2022 +0100 +++ b/ChengsongTanPhdThesis/Chapters/Introduction.tex Sun Nov 06 23:06:10 2022 +0000 @@ -780,13 +780,13 @@ % \end{tabular} %\end{center} A concrete example -for back-references would be +for back-references is \begin{center} $(.^*)\backslash 1$, \end{center} -which would match +which matches strings that can be split into two identical halves, -for example $\mathit{foofoo}$, $\mathit{ww}$ and etc. +for example $\mathit{foofoo}$, $\mathit{ww}$ and so on. Note that this is different from repeating the sub-expression verbatim like \begin{center} @@ -795,32 +795,32 @@ which does not impose any restrictions on what strings the second sub-expression $.^*$ might match. -Another example of back-references would be +Another example of back-references is \begin{center} $(.)(.)\backslash 2\backslash 1$ \end{center} -which expresses four-character palindromes -like $abba$, $x??x$ etc. +which matches four-character palindromes +like $abba$, $x??x$ and so on. Back-references is a regex construct -that programmers found quite useful. +that programmers find quite useful. According to Becchi and Crawley\cite{Becchi08}, -6\% of Snort rules (up until 2008) include the use of them. +6\% of Snort rules (up until 2008) use them. The most common use of back-references -would be expressing well-formed html files, -where back-references would be handy in expressing -a pair of opening and closing tags like +is to express well-formed html files, +where back-references are convenient for matching +opening and closing tags like \begin{center} $\langle html \rangle \ldots \langle / html \rangle$ \end{center} A regex describing such a format -could be +is \begin{center} $\langle (.^+) \rangle \ldots \langle / \backslash 1 \rangle$ \end{center} -Despite being useful, the syntax and expressive power of regexes +Despite being useful, the expressive power of regexes go beyond the regular language hierarchy -with back-references. +once back-references are included. In fact, they allow the regex construct to express languages that cannot be contained in context-free languages either. @@ -829,17 +829,15 @@ which cannot be expressed by context-free grammars\parencite{campeanu2003formal}. Such a language is contained in the context-sensitive hierarchy of formal languages. -Solving the back-reference expressions matching problem +Also solving the matching problem involving back-references is known to be NP-complete \parencite{alfred2014algorithms}. -A non-bactracking, -efficient solution is not known to exist. Regex libraries supporting back-references such as PCRE \cite{pcre} therefore have to revert to a depth-first search algorithm which backtracks. What is unexpected is that even in the cases not involving back-references, there is still a (non-negligible) chance they might backtrack super-linearly, -as shown in the graphs in \ref{fig:aStarStarb}. +as shown in the graphs in figure\ref{fig:aStarStarb}. \subsection{Summary of the Catastrophic Backtracking Problem} Summing these up, we can categorise existing diff -r 3ea6a38c33b5 -r 2072a8d54e3e ChengsongTanPhdThesis/example.bib --- a/ChengsongTanPhdThesis/example.bib Thu Oct 13 23:55:25 2022 +0100 +++ b/ChengsongTanPhdThesis/example.bib Sun Nov 06 23:06:10 2022 +0000 @@ -76,7 +76,16 @@ %% Brzozowski ders------------------------ - +@article{Murugesan2014, + author = {N.~Murugesan and O.~V.~Shanmuga Sundaram}, + title = {{S}ome {P}roperties of {B}rzozowski {D}erivatives of {R}egular {E}xpressions}, + journal = {International Journal of Computer Trends and Technology}, + volume = {13}, + number = {1}, + year = {2014}, + url = {http://arxiv.org/abs/1407.5902}, + pages = {29--33} +} %% look-aheads------------------------ @article{Takayuki2019, diff -r 3ea6a38c33b5 -r 2072a8d54e3e thys4/posix/FBound.thy --- a/thys4/posix/FBound.thy Thu Oct 13 23:55:25 2022 +0100 +++ b/thys4/posix/FBound.thy Sun Nov 06 23:06:10 2022 +0000 @@ -640,7 +640,7 @@ sorry -lemma compl_rrewrite_down1: +lemma compl_rrewrite_down1_1: shows "\r1 \1 r2; s_complexity r1 = s_complexity r2 \ \ r1 = r2" using compl_rrewrite_down nat_less_le by auto