# HG changeset patch # User Chengsong # Date 1667949894 0 # Node ID 17c7611fb0a9bd863abbf4bc856e7dd27c519375 # Parent ae6010c14e49b658bd6943545c8f1857cf38459e chap6 diff -r ae6010c14e49 -r 17c7611fb0a9 ChengsongTanPhdThesis/Chapters/Cubic.tex --- a/ChengsongTanPhdThesis/Chapters/Cubic.tex Mon Nov 07 21:31:07 2022 +0000 +++ b/ChengsongTanPhdThesis/Chapters/Cubic.tex Tue Nov 08 23:24:54 2022 +0000 @@ -39,17 +39,17 @@ to support bounded repetitions ($r^{\{n\}}$). We update our formalisation of the correctness and finiteness properties to -include this new construct. With bounded repetitions -we are able to out-compete other verified lexers such as -Verbatim++ on regular expressions which involve a lot of -repetitions. %We also present the idempotency property proof +include this new construct. +we can out-compete other verified lexers such as +Verbatim++ on bounded regular expressions. +%We also present the idempotency property proof %of $\bsimp$, which leverages the idempotency proof of $\rsimp$. %This reinforces our claim that the fixpoint construction %originally required by Sulzmann and Lu can be removed in $\blexersimp$. -\\ -Last but not least, we present our efforts and challenges we met -in further improving the algorithm by data -structures such as zippers. + +%Last but not least, we present our efforts and challenges we met +%in further improving the algorithm by data +%structures such as zippers. @@ -90,7 +90,11 @@ \noindent A simplification like this actually cannot be omitted, -as without it the size could blow up even with our $\simp$ function: +as without it the size could blow up even with our $\textit{bsimp}$ +function: for the chapter \ref{Finite} example +$\protect((a^* + (aa)^* + \ldots + (\underbrace{a\ldots a}_{n a's})^* )^*)^*$, +by just setting n to a small number, +we get exponential growth that does not stop before it becomes huge: \begin{figure}[H] \centering \begin{tikzpicture} @@ -119,9 +123,19 @@ \noindent in our $\simp$ function, so that it makes the simplification in \ref{partialDedup} possible. -For example, -can a function like the below one be used? -%TODO: simp' that spills things +Translating the rule into our $\textit{bsimp}$ function simply +involves adding a new clause to the $\textit{bsimp}_{ASEQ}$ function: +\begin{center} + \begin{tabular}{@{}lcl@{}} + $\textit{bsimp}_{ASEQ} \; bs\; a \; b$ & $\dn$ & $ (a,\; b) \textit{match}$\\ + && $\ldots$\\ + &&$\quad\textit{case} \; (_{bs1}\sum as, a_2') \Rightarrow _{bs1}\sum ( + \map \; (_{[]}\textit{ASEQ} \; \_ \; a_2') \; as)$\\ + &&$\quad\textit{case} \; (a_1', a_2') \Rightarrow _{bs}a_1' \cdot a_2'$ \\ + \end{tabular} +\end{center} + + Unfortunately, if we introduce them in our @@ -171,7 +185,7 @@ but this time we allow a more general list rewrite: \begin{mathpar}\label{cubicRule} - \inferrule{\vspace{0mm} }{rs_a@[a]@rs_c + \inferrule * [Right = cubicRule]{\vspace{0mm} }{rs_a@[a]@rs_c \stackrel{s}{\rightsquigarrow } rs_a@[\textit{prune} \; a \; rs_a]@rs_c } \end{mathpar} @@ -199,41 +213,8 @@ by replacing the $\simp$ function with a stronger version called $\bsimpStrong$ that prunes regular expressions. -We call this lexer $\blexerStrong$. -$\blexerStrong$ is able to drastically reduce the -internal data structure size which could -trigger exponential behaviours in -$\blexersimp$. \begin{figure}[H] -\centering -\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}} -\begin{tikzpicture} -\begin{axis}[ - %xlabel={$n$}, - myplotstyle, - xlabel={input length}, - ylabel={size}, - ] -\addplot[red,mark=*, mark options={fill=white}] table {strongSimpCurve.data}; -\end{axis} -\end{tikzpicture} - & -\begin{tikzpicture} -\begin{axis}[ - %xlabel={$n$}, - myplotstyle, - xlabel={input length}, - ylabel={size}, - ] -\addplot[blue,mark=*, mark options={fill=white}] table {bsimpExponential.data}; -\end{axis} -\end{tikzpicture}\\ -\multicolumn{2}{l}{Graphs: Runtime for matching $((a^* + (aa)^* + \ldots + (aaaaa)^* )^*)^*$ with strings - of the form $\underbrace{aa..a}_{n}$.} -\end{tabular} -\caption{aaaaaStarStar} \label{fig:aaaaaStarStar} -\end{figure} -\begin{figure}[H] + \begin{lstlisting} def atMostEmpty(r: Rexp) : Boolean = r match { case ZERO => true @@ -397,18 +378,58 @@ \end{lstlisting} \end{figure} \noindent +We call this lexer $\blexerStrong$. +$\blexerStrong$ is able to drastically reduce the +internal data structure size which could +trigger exponential behaviours in +$\blexersimp$. +\begin{figure}[H] +\centering +\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}} +\begin{tikzpicture} +\begin{axis}[ + %xlabel={$n$}, + myplotstyle, + xlabel={input length}, + ylabel={size}, + width = 7cm, + height = 5cm, + ] +\addplot[red,mark=*, mark options={fill=white}] table {strongSimpCurve.data}; +\end{axis} +\end{tikzpicture} + & +\begin{tikzpicture} +\begin{axis}[ + %xlabel={$n$}, + myplotstyle, + xlabel={input length}, + ylabel={size}, + width = 7cm, + height = 5cm, + ] +\addplot[blue,mark=*, mark options={fill=white}] table {bsimpExponential.data}; +\end{axis} +\end{tikzpicture}\\ +\multicolumn{2}{l}{} +\end{tabular} +\caption{Runtime for matching + $\protect((a^* + (aa)^* + \ldots + (aaaaa)^* )^*)^*$ with strings + of the form $\protect\underbrace{aa..a}_{n}$.}\label{fig:aaaaaStarStar} +\end{figure} +\noindent We would like to preserve the correctness like the one we had for $\blexersimp$: \begin{conjecture}\label{cubicConjecture} $\blexerStrong \;r \; s = \blexer\; r\;s$ \end{conjecture} \noindent -We introduce the new rule \ref{cubicRule} -and augment our proofs in chapter \ref{Bitcoded2}. -The idea is to maintain the properties like -$r \stackrel{*}{\rightsquigarrow} \textit{bsimp} \; r$ etc. +The idea is to maintain key lemmas in +chapter \ref{Bitcoded2} like +$r \stackrel{*}{\rightsquigarrow} \textit{bsimp} \; r$ +with the new rewriting rule \ref{cubicRule} . -In the next section, +In the next sub-section, we will describe why we believe a cubic bound can be achieved. We give an introduction to the @@ -417,82 +438,155 @@ and then link it with the result of the function $\bdersStrongs$. -\section{Antimirov's partial derivatives} -This suggests a link towrads "partial derivatives" - introduced The idea behind Antimirov's partial derivatives -is to do derivatives in a similar way as suggested by Brzozowski, -but maintain a set of regular expressions instead of a single one: - -%TODO: antimirov proposition 3.1, needs completion - \begin{center} - \begin{tabular}{ccc} - $\partial_x(a+b)$ & $=$ & $\partial_x(a) \cup \partial_x(b)$\\ -$\partial_x(\ONE)$ & $=$ & $\phi$ -\end{tabular} +\subsection{Antimirov's partial derivatives} +Partial derivatives were first introduced by +Antimirov \cite{Antimirov95}. +It does derivatives in a similar way as suggested by Brzozowski, +but splits children of alternative regular expressions into +multiple independent terms, causing the output to become a +set of regular expressions: +\begin{center} + \begin{tabular}{lcl} + $\partial_x \; (a \cdot b)$ & + $\dn$ & $\partial_x \; a\cdot b \cup + \partial_x \; b \; \textit{if} \; \; \nullable\; a$\\ + & & $\partial_x \; a\cdot b \quad\quad + \textit{otherwise}$\\ + $\partial_x \; r^*$ & $\dn$ & $\partial_x \; r \cdot r^*$\\ + $\partial_x \; c $ & $\dn$ & $\textit{if} \; x = c \; + \textit{then} \; + \{ \ONE\} \;\;\textit{else} \; \varnothing$\\ + $\partial_x(a+b)$ & $=$ & $\partial_x(a) \cup \partial_x(b)$\\ + $\partial_x(\ONE)$ & $=$ & $\varnothing$\\ + $\partial_x(\ZERO)$ & $\dn$ & $\varnothing$\\ + \end{tabular} \end{center} - +\noindent +The $\cdot$ between for example +$\partial_x \; a\cdot b $ +is a shorthand notation for the cartesian product +$\partial_x \; a \times \{ b\}$. +%Each element in the set generated by a partial derivative +%corresponds to a (potentially partial) match +%TODO: define derivatives w.r.t string s Rather than joining the calculated derivatives $\partial_x a$ and $\partial_x b$ together -using the alternatives constructor, Antimirov cleverly chose to put them into -a set instead. This breaks the terms in a derivative regular expression up, +using the $\sum$ constructor, Antimirov put them into +a set. This causes maximum de-duplication to happen, allowing us to understand what are the "atomic" components of it. For example, To compute what regular expression $x^*(xx + y)^*$'s derivative against $x$ is made of, one can do a partial derivative of it and get two singleton sets $\{x^* \cdot (xx + y)^*\}$ and $\{x \cdot (xx + y) ^* \}$ from $\partial_x(x^*) \cdot (xx + y) ^*$ and $\partial_x((xx + y)^*)$. -To get all the "atomic" components of a regular expression's possible derivatives, -there is a procedure Antimirov called $\textit{lf}$, short for "linear forms", that takes -whatever character is available at the head of the string inside the language of a -regular expression, and gives back the character and the derivative regular expression -as a pair (which he called "monomial"): - \begin{center} - \begin{tabular}{ccc} - $\lf(\ONE)$ & $=$ & $\phi$\\ -$\lf(c)$ & $=$ & $\{(c, \ONE) \}$\\ - $\lf(a+b)$ & $=$ & $\lf(a) \cup \lf(b)$\\ - $\lf(r^*)$ & $=$ & $\lf(r) \bigodot \lf(r^*)$\\ -\end{tabular} + +The set of all possible partial derivatives is defined +as the union of derivatives w.r.t all the strings in the universe: +\begin{center} + \begin{tabular}{lcl} + $\textit{PDER}_{UNIV} \; r $ & $\dn $ & $\bigcup_{w \in A^*}\partial_w \; r$ + \end{tabular} \end{center} -%TODO: completion +\noindent -There is a slight difference in the last three clauses compared -with $\partial$: instead of a dot operator $ \textit{rset} \cdot r$ that attaches the regular -expression $r$ with every element inside $\textit{rset}$ to create a set of -sequence derivatives, it uses the "circle dot" operator $\bigodot$ which operates -on a set of monomials (which Antimirov called "linear form") and a regular -expression, and returns a linear form: - \begin{center} - \begin{tabular}{ccc} - $l \bigodot (\ZERO)$ & $=$ & $\phi$\\ - $l \bigodot (\ONE)$ & $=$ & $l$\\ - $\phi \bigodot t$ & $=$ & $\phi$\\ - $\{ (x, \ZERO) \} \bigodot t$ & $=$ & $\{(x,\ZERO) \}$\\ - $\{ (x, \ONE) \} \bigodot t$ & $=$ & $\{(x,t) \}$\\ - $\{ (x, p) \} \bigodot t$ & $=$ & $\{(x,p\cdot t) \}$\\ - $\lf(a+b)$ & $=$ & $\lf(a) \cup \lf(b)$\\ - $\lf(r^*)$ & $=$ & $\lf(r) \cdot \lf(r^*)$\\ -\end{tabular} +Back to our +\begin{center} + $((a^* + (aa)^* + \ldots + (\underbrace{a\ldots a}_{n a's})^* )^*)^*$ +\end{center} +example, if we denote this regular expression as $A$, +we have that +\begin{center} +$\textit{PDER}_{UNIV} \; A = +\bigcup_{i=1}^{n}\bigcup_{j=0}^{i-1} \{ + (\underbrace{a \ldots a}_{\text{j a's}}\cdot +(\underbrace{a \ldots a}_{\text{i a's}})^*)\cdot A \}$, \end{center} -%TODO: completion +with exactly $n * (n + 1) / 2$ terms. +This is in line with our speculation that only $n*(n+1)/2$ terms are +needed. We conjecture that $\bsimpStrong$ is also able to achieve this +upper limit in general +\begin{conjecture}\label{bsimpStrongInclusionPder} + Using a suitable transformation $f$, we have + \begin{center} + $\forall s.\; f \; (r \bdersStrong \; s) \subseteq + \textit{PDER}_{UNIV} \; r$ + \end{center} +\end{conjecture} +\noindent +because our \ref{cubicRule} will keep only one copy of each term, +where the function $\textit{prune}$ takes care of maintaining +a set like structure similar to partial derivatives. +It is anticipated we might need to adjust $\textit{prune}$ +slightly to make sure all duplicate terms are eliminated, +which should be doable. - Some degree of simplification is applied when doing $\bigodot$, for example, - $l \bigodot (\ZERO) = \phi$ corresponds to $r \cdot \ZERO \rightsquigarrow \ZERO$, - and $l \bigodot (\ONE) = l$ to $l \cdot \ONE \rightsquigarrow l$, and - $\{ (x, \ZERO) \} \bigodot t = \{(x,\ZERO) \}$ to $\ZERO \cdot x \rightsquigarrow \ZERO$, - and so on. - - With the function $\lf$ one can compute all possible partial derivatives $\partial_{UNIV}(r)$ of a regular expression $r$ with - an iterative procedure: - \begin{center} - \begin{tabular}{llll} -$\textit{while}$ & $(\Delta_i \neq \phi)$ & & \\ - & $\Delta_{i+1}$ & $ =$ & $\lf(\Delta_i) - \PD_i$ \\ - & $\PD_{i+1}$ & $ =$ & $\Delta_{i+1} \cup \PD_i$ \\ -$\partial_{UNIV}(r)$ & $=$ & $\PD$ & -\end{tabular} -\end{center} +Antimirov had proven that the sum of all the partial derivative +terms' sizes is bounded by the cubic of the size of that regular +expression: +\begin{property}\label{pderBound} + $\llbracket \textit{PDER}_{UNIV} \; r \rrbracket \leq O((\llbracket r \rrbracket)^3)$ +\end{property} +This property was formalised by Urban, and the details are in the PDERIVS.thy file +in our repository. +Once conjecture \ref{bsimpStrongInclusionPder} is proven, then property \ref{pderBound} +would yield us a cubic bound for our $\blexerStrong$ algorithm: +\begin{conjecture}\label{strongCubic} + $\llbracket r \bdersStrong\; s \rrbracket \leq \llbracket r \rrbracket^3$ +\end{conjecture} - $(r_1 + r_2) \cdot r_3 \longrightarrow (r_1 \cdot r_3) + (r_2 \cdot r_3)$, +%To get all the "atomic" components of a regular expression's possible derivatives, +%there is a procedure Antimirov called $\textit{lf}$, short for "linear forms", that takes +%whatever character is available at the head of the string inside the language of a +%regular expression, and gives back the character and the derivative regular expression +%as a pair (which he called "monomial"): +% \begin{center} +% \begin{tabular}{ccc} +% $\lf(\ONE)$ & $=$ & $\phi$\\ +%$\lf(c)$ & $=$ & $\{(c, \ONE) \}$\\ +% $\lf(a+b)$ & $=$ & $\lf(a) \cup \lf(b)$\\ +% $\lf(r^*)$ & $=$ & $\lf(r) \bigodot \lf(r^*)$\\ +%\end{tabular} +%\end{center} +%%TODO: completion +% +%There is a slight difference in the last three clauses compared +%with $\partial$: instead of a dot operator $ \textit{rset} \cdot r$ that attaches the regular +%expression $r$ with every element inside $\textit{rset}$ to create a set of +%sequence derivatives, it uses the "circle dot" operator $\bigodot$ which operates +%on a set of monomials (which Antimirov called "linear form") and a regular +%expression, and returns a linear form: +% \begin{center} +% \begin{tabular}{ccc} +% $l \bigodot (\ZERO)$ & $=$ & $\phi$\\ +% $l \bigodot (\ONE)$ & $=$ & $l$\\ +% $\phi \bigodot t$ & $=$ & $\phi$\\ +% $\{ (x, \ZERO) \} \bigodot t$ & $=$ & $\{(x,\ZERO) \}$\\ +% $\{ (x, \ONE) \} \bigodot t$ & $=$ & $\{(x,t) \}$\\ +% $\{ (x, p) \} \bigodot t$ & $=$ & $\{(x,p\cdot t) \}$\\ +% $\lf(a+b)$ & $=$ & $\lf(a) \cup \lf(b)$\\ +% $\lf(r^*)$ & $=$ & $\lf(r) \cdot \lf(r^*)$\\ +%\end{tabular} +%\end{center} +%%TODO: completion +% +% Some degree of simplification is applied when doing $\bigodot$, for example, +% $l \bigodot (\ZERO) = \phi$ corresponds to $r \cdot \ZERO \rightsquigarrow \ZERO$, +% and $l \bigodot (\ONE) = l$ to $l \cdot \ONE \rightsquigarrow l$, and +% $\{ (x, \ZERO) \} \bigodot t = \{(x,\ZERO) \}$ to $\ZERO \cdot x \rightsquigarrow \ZERO$, +% and so on. +% +% With the function $\lf$ one can compute all possible partial derivatives $\partial_{UNIV}(r)$ of a regular expression $r$ with +% an iterative procedure: +% \begin{center} +% \begin{tabular}{llll} +%$\textit{while}$ & $(\Delta_i \neq \phi)$ & & \\ +% & $\Delta_{i+1}$ & $ =$ & $\lf(\Delta_i) - \PD_i$ \\ +% & $\PD_{i+1}$ & $ =$ & $\Delta_{i+1} \cup \PD_i$ \\ +%$\partial_{UNIV}(r)$ & $=$ & $\PD$ & +%\end{tabular} +%\end{center} +% +% +% $(r_1 + r_2) \cdot r_3 \longrightarrow (r_1 \cdot r_3) + (r_2 \cdot r_3)$, @@ -512,11 +606,63 @@ introduced in chapter \ref{Bitcoded2}. \subsection{Augmented Definitions} -There are a number of definitions that needs to be augmented. +There are a number of definitions that need to be augmented. The most notable one would be the POSIX rules for $r^{\{n\}}$: -\begin{mathpar} - \inferrule{v \in vs}{\textit{Stars} place holder} -\end{mathpar} +\begin{center} + \begin{mathpar} + \inferrule{\forall v \in vs_1. \vdash v:r \land + |v| \neq []\\ \forall v \in vs_2. \vdash v:r \land |v| = []\\ + \textit{length} \; (vs_1 @ vs_2) = n}{\textit{Stars} \; + (vs_1 @ vs_2) : r^{\{n\}} } + \end{mathpar} +\end{center} +As Ausaf had pointed out \cite{Ausaf}, +sometimes empty iterations have to be taken to get +a match with exactly $n$ repetitions, +and hence the $vs_2$ part. + +Another important definition would be the size: +\begin{center} + \begin{tabular}{lcl} + $\llbracket r^{\{n\}} \rrbracket_r$ & $\dn$ & + $\llbracket r \rrbracket_r + n$\\ + \end{tabular} +\end{center} +\noindent +Arguably we should use $\log \; n$ for the size because +the number of digits increase logarithmically w.r.t $n$. +For simplicity we choose to add the counter directly to the size. + +The derivative w.r.t a bounded regular expression +is given as +\begin{center} + \begin{tabular}{lcl} + $r^{\{n\}} \backslash_r c$ & $\dn$ & + $r\backslash_r c \cdot r^{\{n-1\}} \;\; \textit{if} \; n \geq 1$\\ + & & $\RZERO \;\quad \quad\quad \quad + \textit{otherwise}$\\ + \end{tabular} +\end{center} +\noindent +For brevity, we sometimes use NTIMES to refer to bounded +regular expressions. +The $\mkeps$ function clause for NTIMES would be +\begin{center} + \begin{tabular}{lcl} + $\mkeps \; r^{\{n\}} $ & $\dn$ & $\Stars \; + (\textit{replicate} \; n\; (\mkeps \; r))$ + \end{tabular} +\end{center} +\noindent +The injection looks like +\begin{center} + \begin{tabular}{lcl} + $\inj \; r^{\{n\}} \; c\; (\Seq \;v \; (\Stars \; vs)) $ & + $\dn$ & $\Stars \; + ((\inj \; r \;c \;v ) :: vs)$ + \end{tabular} +\end{center} +\noindent \subsection{Proofs for the Augmented Lexing Algorithm} @@ -758,6 +904,7 @@ For the inductive case, note that if $\cbn \; r$ holds, then $\cbn \; (r\backslash_r c)$ holds. \end{proof} +\noindent In addition, for $\cbn$-shaped regular expressioins, one can flatten them: \begin{lemma}\label{ntimesHfauPushin} @@ -768,37 +915,47 @@ \begin{proof} By an induction on the inductive cases of $\cbn$. \end{proof} - - - +\noindent This time we do not need to define the flattening functions for NTIMES only, -because $\hflat{\_}$ and $\hflataux{\_}$ work on NTIMES already: +because $\hflat{\_}$ and $\hflataux{\_}$ work on NTIMES already. \begin{lemma}\label{ntimesHfauInduct} $\hflataux{( (r\backslash_r c) \cdot r^{\{n\}}) \backslash_{rsimps} s} = \map \; (\opterm \; r) \; (\nupdates \; s \; r \; [\Some \; ([c], n)])$ \end{lemma} \begin{proof} By a reverse induction on $s$. - The lemma \ref{ntimesDersCbn} is used. + The lemmas \ref{ntimesHfauPushin} and \ref{ntimesDersCbn} are used. \end{proof} \noindent We have a recursive property for NTIMES with $\nupdate$ similar to that for STAR, and one for $\nupdates $ as well: \begin{lemma}\label{nupdateInduct1} - (i) $\textit{concat} \; (\map \; (\hflataux{\_} \circ ( + \mbox{} + \begin{itemize} + \item + \begin{center} + $\textit{concat} \; (\map \; (\hflataux{\_} \circ ( \opterm \; r)) \; Ss) = \map \; (\opterm \; r) \; (\nupdate \; c \; r \; Ss)$\\ - (ii) $\textit{concat} \; (\map \; \hflataux{\_}\; + \end{center} + holds. +\item + \begin{center} + $\textit{concat} \; (\map \; \hflataux{\_}\; \map \; (\_\backslash_r x) \; - (\map \; (\opterm \; r) \; (\nupdates \; xs \; r \; Ss))) = - \map \; (\opterm \; r) \; (\nupdates \;(xs@[x]) \; r\;Ss)$ holds. + (\map \; (\opterm \; r) \; (\nupdates \; xs \; r \; Ss)))$\\ + $=$\\ + $\map \; (\opterm \; r) \; (\nupdates \;(xs@[x]) \; r\;Ss)$ + \end{center} + holds. + \end{itemize} \end{lemma} \begin{proof} (i) is by an induction on $Ss$. (ii) is by an induction on $xs$. \end{proof} - +\noindent The $\nString$ predicate is defined for conveniently expressing that there are no empty strings in the $\Some \;(s, n)$ elements generated by $\nupdate$: @@ -810,10 +967,10 @@ \end{tabular} \end{center} \begin{lemma}\label{nupdatesNonempty} - If for all elements $opt$ in $\textit{set} \; Ss$, - $\nString \; opt$ holds, the we have that - for all elements $opt$ in $\textit{set} \; (\nupdates \; s \; r \; Ss)$, - $\nString \; opt$ holds. + If for all elements $o \in \textit{set} \; Ss$, + $\nString \; o$ holds, the we have that + for all elements $o' \in \textit{set} \; (\nupdates \; s \; r \; Ss)$, + $\nString \; o'$ holds. \end{lemma} \begin{proof} By an induction on $s$, where $Ss$ is set to vary over all possible values. @@ -866,8 +1023,10 @@ nonempty, $\optermsimp \; r \;o$, $\optermosimp \; r \; o$ and $\optermosimp \; \; o$ are guaranteed to be equal. + (v) uses \ref{nupdateInduct1}. \end{proof} - +\noindent +Now we are ready to present the closed form for NTIMES: \begin{theorem}\label{ntimesClosedForm} The derivative of $r^{\{n+1\}}$ can be described as an alternative containing a list @@ -879,14 +1038,22 @@ \begin{proof} By the rewriting steps described in lemma \ref{ntimesClosedFormsSteps}. \end{proof} - +\noindent +The key observation for bounding this closed form +is that the counter on $r^{\{n\}}$ will +only decrement during derivatives: \begin{lemma}\label{nupdatesNLeqN} For an element $o$ in $\textit{set} \; (\nupdates \; s \; r \; [\Some \; ([c], n)])$, either $o = \None$, or $o = \Some \; (s', m)$ for some string $s'$ and number $m \leq n$. \end{lemma} - - +\noindent +The proof is routine and therefore omitted. +This allows us to say what kind of terms +are in the list $\textit{set} \; (\map \; (\optermsimp \; r) \; ( +\nupdates \; s \; r \; [\Some \; ([c], n)]))$: +only $\ZERO_r$s or a sequence with the tail an $r^{\{m\}}$ +with a small $m$: \begin{lemma}\label{ntimesClosedFormListElemShape} For any element $r'$ in $\textit{set} \; (\map \; (\optermsimp \; r) \; ( \nupdates \; s \; r \; [\Some \; ([c], n)]))$, @@ -918,353 +1085,98 @@ \nupdates \; s \; r \; [\Some \; ([c], n)])$'s size is at most $c_N = \textit{card} \; (\sizeNregex \; ((N + \llbracket r^{\{n\}} \rrbracket) + 1))$. -This gives us $\llbracket \llbracket r \backslash_{rsimps} \;s \llbracket_r +This gives us $\llbracket r \backslash_{rsimps} \;s \rrbracket_r \leq N * c_N$. \end{proof} +We aim to formalise the correctness and size bound +for constructs like $r^{\{\ldots n\}}$, $r^{\{n \ldots\}}$ +and so on, which is still work in progress. +They should more or less follow the same recipe described in this section. +Once we know about how to deal with them recursively using suitable auxiliary +definitions, we are able to routinely establish the proofs. -Assuming we have that -\begin{center} - $\rderssimp{r^*}{s} = \rsimp{(\sum \map \; (\lambda s'. r\backslash s' \cdot r^*) \; Ss)}$. -\end{center} -holds. -The idea of $\starupdate$ and $\starupdates$ -is to update $Ss$ when another -derivative is taken on $\rderssimp{r^*}{s}$ -w.r.t a character $c$ and a string $s'$ -respectively. -Both $\starupdate$ and $\starupdates$ take three arguments as input: -the new character $c$ or string $s$ to take derivative with, -the regular expression -$r$ under the star $r^*$, and the -list of strings $Ss$ for the derivative $r^* \backslash s$ -up until this point -such that -\begin{center} -$(r^*) \backslash s = \sum_{s' \in sSet} (r\backslash s') \cdot r^*$ -\end{center} -is satisfied. - -Functions $\starupdate$ and $\starupdates$ characterise what the -star derivatives will look like once ``straightened out'' into lists. -The helper functions for such operations will be similar to -$\sflat{\_}$, $\sflataux{\_}$ and $\sflataux{\_}$, which we defined for sequence. -We use similar symbols to denote them, with a $*$ subscript to mark the difference. -\begin{center} - \begin{tabular}{lcl} - $\hflataux{r_1 + r_2}$ & $\dn$ & $\hflataux{r_1} @ \hflataux{r_2}$\\ - $\hflataux{r}$ & $\dn$ & $[r]$ - \end{tabular} -\end{center} -\begin{center} - \begin{tabular}{lcl} - $\hflat{r_1 + r_2}$ & $\dn$ & $\sum (\hflataux {r_1} @ \hflataux {r_2}) $\\ - $\hflat{r}$ & $\dn$ & $r$ - \end{tabular} -\end{center} -\noindent -These definitions are tailor-made for dealing with alternatives that have -originated from a star's derivatives. -A typical star derivative always has the structure of a balanced binary tree: -\begin{center} - $(r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') + - (r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*) $ -\end{center} -All of the nested structures of alternatives -generated from derivatives are binary, and therefore -$\hflat{\_}$ and $\hflataux{\_}$ only deal with binary alternatives. -$\hflat{\_}$ ``untangles'' like the following: -\[ - \sum ((r_1 + r_2) + (r_3 + r_4)) + \ldots \; - \stackrel{\hflat{\_}}{\longrightarrow} \; - \RALTS{[r_1, r_2, \ldots, r_n]} -\] -Here is a lemma stating the recursive property of $\starupdate$ and $\starupdates$, -with the helpers $\hflat{\_}$ and $\hflataux{\_}$\footnote{The function $\textit{concat}$ takes a list of lists - and merges each of the element lists to form a flattened list.}: -\begin{lemma}\label{stupdateInduct1} - \mbox - For a list of strings $Ss$, the following hold. - \begin{itemize} - \item - If we do a derivative on the terms - $r\backslash_r s \cdot r^*$ (where $s$ is taken from the list $Ss$), - the result will be the same as if we apply $\starupdate$ to $Ss$. - \begin{center} - \begin{tabular}{c} - $\textit{concat} \; (\map \; (\hflataux{\_} \circ ( (\_\backslash_r x) - \circ (\lambda s.\;\; (r \backslash_r s) \cdot r^*)))\; Ss )\; - $\\ - \\ - $=$ \\ - \\ - $\map \; (\lambda s. (r \backslash_r s) \cdot (r^*)) \; - (\starupdate \; x \; r \; Ss)$ - \end{tabular} - \end{center} - \item - $\starupdates$ is ``composable'' w.r.t a derivative. - It piggybacks the character $x$ to the tail of the string - $xs$. - \begin{center} - \begin{tabular}{c} - $\textit{concat} \; (\map \; \hflataux{\_} \; - (\map \; (\_\backslash_r x) \; - (\map \; (\lambda s.\;\; (r \backslash_r s) \cdot - (r^*) ) \; (\starupdates \; xs \; r \; Ss))))$\\ - \\ - $=$\\ - \\ - $\map \; (\lambda s.\;\; (r\backslash_r s) \cdot (r^*)) \; - (\starupdates \; (xs @ [x]) \; r \; Ss)$ - \end{tabular} - \end{center} - \end{itemize} -\end{lemma} - -\begin{proof} - Part 1 is by induction on $Ss$. - Part 2 is by induction on $xs$, where $Ss$ is left to take arbitrary values. -\end{proof} - - -Like $\textit{createdBySequence}$, we need -a predicate for ``star-created'' regular expressions: -\begin{center} - \begin{mathpar} - \inferrule{\mbox{}}{ \textit{createdByStar}\; \RSEQ{ra}{\RSTAR{rb}} } - - \inferrule{ \textit{createdByStar} \; r_1\; \land \; \textit{createdByStar} \; r_2 }{\textit{createdByStar} \; (r_1 + r_2) } - \end{mathpar} -\end{center} -\noindent -All regular expressions created by taking derivatives of -$r_1 \cdot (r_2)^*$ satisfy the $\textit{createdByStar}$ predicate: -\begin{lemma}\label{starDersCbs} - $\textit{createdByStar} \; ((r_1 \cdot r_2^*) \backslash_r s) $ holds. -\end{lemma} -\begin{proof} - By a reverse induction on $s$. -\end{proof} -If a regular expression conforms to the shape of a star's derivative, -then we can push an application of $\hflataux{\_}$ inside a derivative of it: -\begin{lemma}\label{hfauPushin} - If $\textit{createdByStar} \; r$ holds, then - \begin{center} - $\hflataux{r \backslash_r c} = \textit{concat} \; ( - \map \; \hflataux{\_} (\map \; (\_\backslash_r c) \;(\hflataux{r})))$ - \end{center} - holds. -\end{lemma} -\begin{proof} - By an induction on the inductivev cases of $\textit{createdByStar}$. -\end{proof} -%This is not entirely true for annotated regular expressions: -%%TODO: bsimp bders \neq bderssimp +%The closed form for them looks like: +%%\begin{center} +%% \begin{tabular}{llrclll} +%% $r^{\{n+1\}}$ & $ \backslash_{rsimps}$ & $(c::s)$ & $=$ & & \\ +%% $\textit{rsimp}$ & $($ & $ +%% \sum \; ( $ & $\map$ & $(\textit{optermsimp}\;r)$ & $($\\ +%% & & & & $\textit{nupdates} \;$ & +%% $ s \; r_0 \; [ \textit{Some} \; ([c], n)]$\\ +%% & & & & $)$ &\\ +%% & & $)$ & & &\\ +%% & $)$ & & & &\\ +%% \end{tabular} +%%\end{center} %\begin{center} -% $(1+ (c\cdot \ASEQ{bs}{c^*}{c} ))$ -%\end{center} -%For bit-codes, the order in which simplification is applied -%might cause a difference in the location they are placed. -%If we want something like -%\begin{center} -% $\bderssimp{r}{s} \myequiv \bsimp{\bders{r}{s}}$ +% \begin{tabular}{llrcllrllll} +% $r^{\{n+1\}}$ & $ \backslash_{rsimps}$ & $(c::s)$ & $=$ & & &&&&\\ +% &&&&$\textit{rsimp}$ & $($ & $ +% \sum \; ( $ & $\map$ & $(\textit{optermsimp}\;r)$ & $($\\ +% &&&& & & & & $\;\; \textit{nupdates} \;$ & +% $ s \; r_0 \; [ \textit{Some} \; ([c], n)]$\\ +% &&&& & & & & $)$ &\\ +% &&&& & & $)$ & & &\\ +% &&&& & $)$ & & & &\\ +% \end{tabular} %\end{center} -%Some "canonicalization" procedure is required, -%which either pushes all the common bitcodes to nodes -%as senior as possible: +%The $\textit{optermsimp}$ function with the argument $r$ +%chooses from two options: $\ZERO$ or +%We define for the $r^{\{n\}}$ constructor something similar to $\starupdate$ +%and $\starupdates$: %\begin{center} -% $_{bs}(_{bs_1 @ bs'}r_1 + _{bs_1 @ bs''}r_2) \rightarrow _{bs @ bs_1}(_{bs'}r_1 + _{bs''}r_2) $ +% \begin{tabular}{lcl} +% $\starupdate \; c \; r \; [] $ & $\dn$ & $[]$\\ +% $\starupdate \; c \; r \; (s :: Ss)$ & $\dn$ & \\ +% & & $\textit{if} \; +% (\rnullable \; (\rders \; r \; s))$ \\ +% & & $\textit{then} \;\; (s @ [c]) :: [c] :: ( +% \starupdate \; c \; r \; Ss)$ \\ +% & & $\textit{else} \;\; (s @ [c]) :: ( +% \starupdate \; c \; r \; Ss)$ +% \end{tabular} %\end{center} -%or does the reverse. However bitcodes are not of interest if we are talking about -%the $\llbracket r \rrbracket$ size of a regex. -%Therefore for the ease and simplicity of producing a -%proof for a size bound, we are happy to restrict ourselves to -%unannotated regular expressions, and obtain such equalities as -%TODO: rsimp sflat -% The simplification of a flattened out regular expression, provided it comes -%from the derivative of a star, is the same as the one nested. +%\noindent +%As a generalisation from characters to strings, +%$\starupdates$ takes a string instead of a character +%as the first input argument, and is otherwise the same +%as $\starupdate$. +%\begin{center} +% \begin{tabular}{lcl} +% $\starupdates \; [] \; r \; Ss$ & $=$ & $Ss$\\ +% $\starupdates \; (c :: cs) \; r \; Ss$ & $=$ & $\starupdates \; cs \; r \; ( +% \starupdate \; c \; r \; Ss)$ +% \end{tabular} +%\end{center} +%\noindent -Now we introduce an inductive property -for $\starupdate$ and $\hflataux{\_}$. -\begin{lemma}\label{starHfauInduct} - If we do derivatives of $r^*$ - with a string that starts with $c$, - then flatten it out, - we obtain a list - of the shape $\sum_{s' \in sS} (r\backslash_r s') \cdot r^*$, - where $sS = \starupdates \; s \; r \; [[c]]$. Namely, - \begin{center} - $\hflataux{(\rders{( (\rder{c}{r_0})\cdot(r_0^*))}{s})} = - \map \; (\lambda s_1. (r_0 \backslash_r s_1) \cdot (r_0^*)) \; - (\starupdates \; s \; r_0 \; [[c]])$ - \end{center} -holds. -\end{lemma} -\begin{proof} - By an induction on $s$, the inductive cases - being $[]$ and $s@[c]$. The lemmas \ref{hfauPushin} and \ref{starDersCbs} are used. -\end{proof} -\noindent - -$\hflataux{\_}$ has a similar effect as $\textit{flatten}$: -\begin{lemma}\label{hflatauxGrewrites} - $a :: rs \grewrites \hflataux{a} @ rs$ -\end{lemma} -\begin{proof} - By induction on $a$. $rs$ is set to take arbitrary values. -\end{proof} -It is also not surprising that $\textit{rsimp}$ will cover -the effect of $\hflataux{\_}$: -\begin{lemma}\label{cbsHfauRsimpeq1} - $\rsimp{(r_1 + r_2)} = \rsimp{(\RALTS{\hflataux{r_1} @ \hflataux{r_2}})}$ -\end{lemma} - -\begin{proof} - By using the rewriting relation $\rightsquigarrow$ -\end{proof} -And from this we obtain a proof that a star's derivative will be the same -as if it had all its nested alternatives created during deriving being flattened out: -For example, -\begin{lemma}\label{hfauRsimpeq2} - $\createdByStar{r} \implies \rsimp{r} = \rsimp{\RALTS{\hflataux{r}}}$ -\end{lemma} -\begin{proof} - By structural induction on $r$, where the induction rules - are these of $\createdByStar{\_}$. - Lemma \ref{cbsHfauRsimpeq1} is used in the inductive case. -\end{proof} - - -%Here is a corollary that states the lemma in -%a more intuitive way: -%\begin{corollary} -% $\hflataux{r^* \backslash_r (c::xs)} = \map \; (\lambda s. (r \backslash_r s) \cdot -% (r^*))\; (\starupdates \; c\; r\; [[c]])$ -%\end{corollary} -%\noindent -%Note that this is also agnostic of the simplification -%function we defined, and is therefore of more general interest. - -Together with the rewriting relation -\begin{lemma}\label{starClosedForm6Hrewrites} - We have the following set of rewriting relations or equalities: - \begin{itemize} - \item - $\textit{rsimp} \; (r^* \backslash_r (c::s)) - \sequal - \sum \; ( ( \sum (\lambda s. (r\backslash_r s) \cdot r^*) \; ( - \starupdates \; s \; r \; [ c::[]] ) ) )$ - \item - $r \backslash_{rsimp} (c::s) = \textit{rsimp} \; ( ( - \sum ( (\map \; (\lambda s_1. (r\backslash s_1) \; r^*) \; - (\starupdates \;s \; r \; [ c::[] ])))))$ - \item - $\sum ( (\map \; (\lambda s. (r\backslash s) \; r^*) \; Ss)) - \sequal - \sum ( (\map \; (\lambda s. \textit{rsimp} \; (r\backslash s) \; - r^*) \; Ss) )$ - \item - $\map \; (\lambda s. (\rsimp{r \backslash_r s}) \cdot (r^*)) \; Ss - \scfrewrites - \map \; (\lambda s. (\rsimp{r \backslash_r s}) \cdot (r^*)) \; Ss$ - \item - $( ( \sum ( ( \map \ (\lambda s. \;\; - (\textit{rsimp} \; (r \backslash_r s)) \cdot r^*) \; (\starupdates \; - s \; r \; [ c::[] ])))))$\\ - $\sequal$\\ - $( ( \sum ( ( \map \ (\lambda s. \;\; - ( r \backslash_{rsimp} s)) \cdot r^*) \; (\starupdates \; - s \; r \; [ c::[] ]))))$\\ - \end{itemize} -\end{lemma} -\begin{proof} - Part 1 leads to part 2. - The rest of them are routine. -\end{proof} -\noindent -Next the closed form for star regular expressions can be derived: -\begin{theorem}\label{starClosedForm} - $\rderssimp{r^*}{c::s} = - \rsimp{ - (\sum (\map \; (\lambda s. (\rderssimp{r}{s})\cdot r^*) \; - (\starupdates \; s\; r \; [[c]]) - ) - ) - } - $ -\end{theorem} -\begin{proof} - By an induction on $s$. - The lemmas \ref{rsimpIdem}, \ref{starHfauInduct}, \ref{starClosedForm6Hrewrites} - and \ref{hfauRsimpeq2} - are used. - In \ref{starClosedForm6Hrewrites}, the equalities are - used to link the LHS and RHS. -\end{proof} - - - - -The closed form for them looks like: -%\begin{center} -% \begin{tabular}{llrclll} -% $r^{\{n+1\}}$ & $ \backslash_{rsimps}$ & $(c::s)$ & $=$ & & \\ -% $\textit{rsimp}$ & $($ & $ -% \sum \; ( $ & $\map$ & $(\textit{optermsimp}\;r)$ & $($\\ -% & & & & $\textit{nupdates} \;$ & -% $ s \; r_0 \; [ \textit{Some} \; ([c], n)]$\\ -% & & & & $)$ &\\ -% & & $)$ & & &\\ -% & $)$ & & & &\\ -% \end{tabular} -%\end{center} -\begin{center} - \begin{tabular}{llrcllrllll} - $r^{\{n+1\}}$ & $ \backslash_{rsimps}$ & $(c::s)$ & $=$ & & &&&&\\ - &&&&$\textit{rsimp}$ & $($ & $ - \sum \; ( $ & $\map$ & $(\textit{optermsimp}\;r)$ & $($\\ - &&&& & & & & $\;\; \textit{nupdates} \;$ & - $ s \; r_0 \; [ \textit{Some} \; ([c], n)]$\\ - &&&& & & & & $)$ &\\ - &&&& & & $)$ & & &\\ - &&&& & $)$ & & & &\\ - \end{tabular} -\end{center} -The $\textit{optermsimp}$ function with the argument $r$ -chooses from two options: $\ZERO$ or -We define for the $r^{\{n\}}$ constructor something similar to $\starupdate$ -and $\starupdates$: -\begin{center} - \begin{tabular}{lcl} - $\starupdate \; c \; r \; [] $ & $\dn$ & $[]$\\ - $\starupdate \; c \; r \; (s :: Ss)$ & $\dn$ & \\ - & & $\textit{if} \; - (\rnullable \; (\rders \; r \; s))$ \\ - & & $\textit{then} \;\; (s @ [c]) :: [c] :: ( - \starupdate \; c \; r \; Ss)$ \\ - & & $\textit{else} \;\; (s @ [c]) :: ( - \starupdate \; c \; r \; Ss)$ - \end{tabular} -\end{center} -\noindent -As a generalisation from characters to strings, -$\starupdates$ takes a string instead of a character -as the first input argument, and is otherwise the same -as $\starupdate$. -\begin{center} - \begin{tabular}{lcl} - $\starupdates \; [] \; r \; Ss$ & $=$ & $Ss$\\ - $\starupdates \; (c :: cs) \; r \; Ss$ & $=$ & $\starupdates \; cs \; r \; ( - \starupdate \; c \; r \; Ss)$ - \end{tabular} -\end{center} -\noindent - - +%\section{Zippers} +%Zipper is a data structure designed to operate on +%and navigate between local parts of a tree. +%It was first formally described by Huet \cite{HuetZipper}. +%Typical applications of zippers involve text editor buffers +%and proof system databases. +%In our setting, the idea is to compactify the representation +%of derivatives with zippers, thereby making our algorithm faster. +%Some initial results +%We first give a brief introduction to what zippers are, +%and other works +%that apply zippers to derivatives +%When dealing with large trees, it would be a waste to +%traverse the entire tree if +%the operation only +%involves a small fraction of it. +%The idea is to put the focus on that subtree, turning other parts +%of the tree into a context +% +% +%One observation about our derivative-based lexing algorithm is that +%the derivative operation sometimes traverses the entire regular expression +%unnecessarily: %---------------------------------------------------------------------------------------- diff -r ae6010c14e49 -r 17c7611fb0a9 ChengsongTanPhdThesis/example.bib --- a/ChengsongTanPhdThesis/example.bib Mon Nov 07 21:31:07 2022 +0000 +++ b/ChengsongTanPhdThesis/example.bib Tue Nov 08 23:24:54 2022 +0000 @@ -15,6 +15,23 @@ % year = 1956, % month = jun %} +@article{Murugesan2014, + author = {N.~Murugesan and O.~V.~Shanmuga Sundaram}, + title = {{S}ome {P}roperties of {B}rzozowski {D}erivatives of {R}egular {E}xpressions}, + journal = {International Journal of Computer Trends and Technology}, + volume = {13}, + number = {1}, + year = {2014}, + url = {http://arxiv.org/abs/1407.5902}, + pages = {29--33} +} + +@PhdThesis{Ausaf, + author = {F.~Ausaf}, + title = {{V}erified {L}exing and {P}arsing}, + school = {King's College London}, + year = {2018} +} %% POSIX specification------------------------ @InProceedings{Okui10,