--- a/ChengsongTanPhdThesis/Chapters/Bitcoded1.tex Tue Jun 14 18:06:33 2022 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Bitcoded1.tex Thu Jun 23 16:09:40 2022 +0100
@@ -528,7 +528,7 @@
the lexing process, which might not be possible
using $\lexer$ or $\blexer$ themselves.
The first function we shall look at is $\retrieve$.
-\subsection{$\retrieve$}
+\subsection{$\textit{Retrieve}$}
Our bit-coded lexer "retrieve"s the bitcodes using $\bmkeps$
after we finished doing all the derivatives:
\begin{center}
--- a/ChengsongTanPhdThesis/Chapters/Bitcoded2.tex Tue Jun 14 18:06:33 2022 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Bitcoded2.tex Thu Jun 23 16:09:40 2022 +0100
@@ -14,92 +14,124 @@
Now we introduce the simplifications, which is why we introduce the
bitcodes in the first place.
-\subsection*{Simplification Rules}
+\section{Simplification for Annotated Regular Expressions}
+The first thing we notice in the fast growth of examples such as $(a^*a^*)^*$'s
+and $(a^* + (aa)^*)^*$'s derivatives is that a lot of duplicated sub-patterns
+are scattered around different levels:
-This section introduces aggressive (in terms of size) simplification rules
-on annotated regular expressions
-to keep derivatives small. Such simplifications are promising
-as we have
-generated test data that show
-that a good tight bound can be achieved. We could only
-partially cover the search space as there are infinitely many regular
-expressions and strings.
+\begin{center}
+ $(a^*a^*)^* \rightarrow (a^*a^* + a^*)\cdot(a^*a^*)^*$\\
+ $((a^*a^* + a^*) + a^*)\cdot(a^*a^*)^* + (a^*a^* + a^*)\cdot(a^*a^*)^*$
+\end{center}
+\noindent
+Despite that we have already implemented the simple-minded simplifications
+such as throwing away useless $\ONE$s and $\ZERO$s.
+The simplification rule $r + r \rightarrow $ cannot make a difference either
+since it only removes duplicates on the same level, not something like
+$(r+a)+r \rightarrow r+a$.
+This requires us to break up nested alternatives into lists, for example
+using the flatten operation similar to the one defined for any function language's
+list datatype:
-One modification we introduced is to allow a list of annotated regular
-expressions in the $\sum$ constructor. This allows us to not just
-delete unnecessary $\ZERO$s and $\ONE$s from regular expressions, but
-also unnecessary ``copies'' of regular expressions (very similar to
-simplifying $r + r$ to just $r$, but in a more general setting). Another
-modification is that we use simplification rules inspired by Antimirov's
-work on partial derivatives. They maintain the idea that only the first
-``copy'' of a regular expression in an alternative contributes to the
-calculation of a POSIX value. All subsequent copies can be pruned away from
-the regular expression. A recursive definition of our simplification function
+ \begin{center}
+ \begin{tabular}{@{}lcl@{}}
+ $\textit{flts} \; (_{bs}\sum \textit{as}) :: \textit{as'}$ & $\dn$ & $(\textit{map} \;
+ (\textit{fuse}\;bs)\; \textit{as}) \; @ \; \textit{flts} \; as' $ \\
+ $\textit{flts} \; \ZERO :: as'$ & $\dn$ & $ \textit{flts} \; \textit{as'} $ \\
+ $\textit{flts} \; a :: as'$ & $\dn$ & $a :: \textit{flts} \; \textit{as'}$ \quad(otherwise)
+\end{tabular}
+\end{center}
+
+\noindent
+There is a minor difference though, in that our $\flts$ operation defined by us
+also throws away $\ZERO$s.
+For a flattened list of regular expressions, a de-duplication can be done efficiently:
+
+
+\begin{center}
+ \begin{tabular}{@{}lcl@{}}
+ $\distinctBy \; [] \; f\; acc $ & $ =$ & $ []$\\
+ $\distinctBy \; (x :: xs) \; f \; acc$ & $=$ & \\
+ & & $\quad \textit{if} (f \; x) \in acc \textit{then} \distinctBy \; xs \; f \; acc$\\
+ & & $\quad \textit{else} x :: (\distinctBy \; xs \; f \; (\{f \; x\} \cup acc))$
+ \end{tabular}
+\end{center}
+\noindent
+The reason we define a distinct function under a mapping $f$ is because
+we want to eliminate regular expressions that are the same syntactically,
+but having different bit-codes (more on the reason why we can do this later).
+To achieve this, we call $\erase$ as the function $f$ during the distinction
+operation.
+A recursive definition of our simplification function
that looks somewhat similar to our Scala code is given below:
-%\comment{Use $\ZERO$, $\ONE$ and so on.
-%Is it $ALTS$ or $ALTS$?}\\
\begin{center}
\begin{tabular}{@{}lcl@{}}
- $\textit{simp} \; (_{bs}a_1\cdot a_2)$ & $\dn$ & $ (\textit{simp} \; a_1, \textit{simp} \; a_2) \; \textit{match} $ \\
- &&$\quad\textit{case} \; (\ZERO, \_) \Rightarrow \ZERO$ \\
- &&$\quad\textit{case} \; (\_, \ZERO) \Rightarrow \ZERO$ \\
- &&$\quad\textit{case} \; (\ONE, a_2') \Rightarrow \textit{fuse} \; bs \; a_2'$ \\
- &&$\quad\textit{case} \; (a_1', \ONE) \Rightarrow \textit{fuse} \; bs \; a_1'$ \\
- &&$\quad\textit{case} \; (a_1', a_2') \Rightarrow _{bs}a_1' \cdot a_2'$ \\
-
- $\textit{simp} \; (_{bs}\sum \textit{as})$ & $\dn$ & $\textit{distinct}( \textit{flatten} ( \textit{map} \; simp \; as)) \; \textit{match} $ \\
- &&$\quad\textit{case} \; [] \Rightarrow \ZERO$ \\
- &&$\quad\textit{case} \; a :: [] \Rightarrow \textit{fuse bs a}$ \\
- &&$\quad\textit{case} \; as' \Rightarrow _{bs}\sum \textit{as'}$\\
-
- $\textit{simp} \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$
+ $\textit{bsimp} \; (_{bs}a_1\cdot a_2)$ & $\dn$ & $ \textit{bsimp}_{ASEQ} \; bs \;(\textit{bsimp} \; a_1) \; (\textit{bsimp} \; a_2) $ \\
+ $\textit{bsimp} \; (_{bs}\sum \textit{as})$ & $\dn$ & $\textit{bsimp}_{ALTS} \; \textit{bs} \; (\textit{distinctBy} \; ( \textit{flatten} ( \textit{map} \; bsimp \; as)) \; \erase \; \phi) $ \\
+ $\textit{bsimp} \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$
\end{tabular}
\end{center}
\noindent
-The simplification does a pattern matching on the regular expression.
+The simplification (named $\bsimp$ for \emph{b}it-coded) does a pattern matching on the regular expression.
When it detected that the regular expression is an alternative or
-sequence, it will try to simplify its child regular expressions
+sequence, it will try to simplify its children regular expressions
recursively and then see if one of the children turns into $\ZERO$ or
$\ONE$, which might trigger further simplification at the current level.
-The most involved part is the $\sum$ clause, where we use two
-auxiliary functions $\textit{flatten}$ and $\textit{distinct}$ to open up nested
-alternatives and reduce as many duplicates as possible. Function
-$\textit{distinct}$ keeps the first occurring copy only and removes all later ones
-when detected duplicates. Function $\textit{flatten}$ opens up nested $\sum$s.
-Its recursive definition is given below:
-
- \begin{center}
- \begin{tabular}{@{}lcl@{}}
- $\textit{flatten} \; (_{bs}\sum \textit{as}) :: \textit{as'}$ & $\dn$ & $(\textit{map} \;
- (\textit{fuse}\;bs)\; \textit{as}) \; @ \; \textit{flatten} \; as' $ \\
- $\textit{flatten} \; \ZERO :: as'$ & $\dn$ & $ \textit{flatten} \; \textit{as'} $ \\
- $\textit{flatten} \; a :: as'$ & $\dn$ & $a :: \textit{flatten} \; \textit{as'}$ \quad(otherwise)
-\end{tabular}
-\end{center}
-
+Current level simplifications are handled by the function $\textit{bsimp}_{ASEQ}$,
+using rules such as $\ZERO \cdot r \rightarrow \ZERO$ and $\ONE \cdot r \rightarrow r$.
+\begin{center}
+ \begin{tabular}{@{}lcl@{}}
+ $\textit{bsimp}_{ASEQ} \; bs\; a \; b$ & $\dn$ & $ (a,\; b) \textit{match}$\\
+ &&$\quad\textit{case} \; (\ZERO, \_) \Rightarrow \ZERO$ \\
+ &&$\quad\textit{case} \; (\_, \ZERO) \Rightarrow \ZERO$ \\
+ &&$\quad\textit{case} \; (_{bs1}\ONE, a_2') \Rightarrow \textit{fuse} \; (bs@bs_1) \; a_2'$ \\
+ &&$\quad\textit{case} \; (a_1', a_2') \Rightarrow _{bs}a_1' \cdot a_2'$
+ \end{tabular}
+\end{center}
\noindent
-Here $\textit{flatten}$ behaves like the traditional functional programming flatten
-function, except that it also removes $\ZERO$s. Or in terms of regular expressions, it
-removes parentheses, for example changing $a+(b+c)$ into $a+b+c$.
-
-Having defined the $\simp$ function,
-we can use the previous notation of natural
+The most involved part is the $\sum$ clause, where we first call $\flts$ on
+the simplified children regular expression list $\textit{map}\; \textit{bsimp}\; \textit{as}$.
+and then call $\distinctBy$ on that list, the predicate determining whether two
+elements are the same is $\erase \; r_1 = \erase\; r_2$.
+Finally, depending on whether the regular expression list $as'$ has turned into a
+singleton or empty list after $\flts$ and $\distinctBy$, $\textit{bsimp}_{AALTS}$
+decides whether to keep the current level constructor $\sum$ as it is, and
+removes it when there are less than two elements:
+\begin{center}
+ \begin{tabular}{lcl}
+ $\textit{bsimp}_{AALTS} \; bs \; as'$ & $ \dn$ & $ as' \; \textit{match}$\\
+ &&$\quad\textit{case} \; [] \Rightarrow \ZERO$ \\
+ &&$\quad\textit{case} \; a :: [] \Rightarrow \textit{fuse bs a}$ \\
+ &&$\quad\textit{case} \; as' \Rightarrow _{bs}\sum \textit{as'}$\\
+ \end{tabular}
+
+\end{center}
+Having defined the $\bsimp$ function,
+we add it as a phase after a derivative is taken,
+so it stays small:
+\begin{center}
+ \begin{tabular}{lcl}
+ $r \backslash_{bsimp} s$ & $\dn$ & $\textit{bsimp}(r \backslash s)$
+ \end{tabular}
+\end{center}
+Following previous notation of natural
extension from derivative w.r.t.~character to derivative
-w.r.t.~string:%\comment{simp in the [] case?}
+w.r.t.~string, we define the derivative that nests simplifications with derivatives:%\comment{simp in the [] case?}
\begin{center}
\begin{tabular}{lcl}
-$r \backslash_{simp} (c\!::\!s) $ & $\dn$ & $(r \backslash_{simp}\, c) \backslash_{simp}\, s$ \\
-$r \backslash_{simp} [\,] $ & $\dn$ & $r$
+$r \backslash_{bsimps} (c\!::\!s) $ & $\dn$ & $(r \backslash_{bsimp}\, c) \backslash_{bsimps}\, s$ \\
+$r \backslash_{bsimps} [\,] $ & $\dn$ & $r$
\end{tabular}
\end{center}
\noindent
-to obtain an optimised version of the algorithm:
-
+Extracting bit-codes from the derivatives that had been simplified repeatedly after
+each derivative run, the simplified $\blexer$, called $\blexersimp$,
+is defined as
\begin{center}
\begin{tabular}{lcl}
$\textit{blexer\_simp}\;r\,s$ & $\dn$ &
@@ -112,16 +144,17 @@
\noindent
This algorithm keeps the regular expression size small, for example,
-with this simplification our previous $(a + aa)^*$ example's 8000 nodes
+with this simplification our previous $(a + aa)^*$ example's fast growth (over
+$10^5$ nodes at around $20$ input length)
will be reduced to just 17 and stays constant, no matter how long the
input string is.
+We show some graphs to better demonstrate this imrpovement.
-
-\section{ $(a^*\cdot a^*)^*$ against $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$ After Simplification}
-We
-Unlike in \ref{fig:BetterWaterloo}, the size with simplification now stay nicely constant with current
-simplification rules (we put the graphs together to show the contrast)
+\section{$(a+aa)^*$ and $(a^*\cdot a^*)^*$ against $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$ After Simplification}
+For $(a+aa)^*$, it used to grow to over $9000$ nodes with simple-minded simplification
+at only around $15$ input characters:
+\begin{center}
\begin{tabular}{ll}
\begin{tikzpicture}
\begin{axis}[
@@ -129,7 +162,38 @@
ylabel={derivative size},
width=7cm,
height=4cm,
- legend entries={Lexer with $\bsimp$},
+ legend entries={Lexer with $\textit{bsimp}$},
+ legend pos= south east,
+ legend cell align=left]
+\addplot[red,mark=*, mark options={fill=white}] table {a_aa_star_bsimp.data};
+\end{axis}
+\end{tikzpicture} %\label{fig:BitcodedLexer}
+&
+\begin{tikzpicture}
+\begin{axis}[
+ xlabel={$n$},
+ ylabel={derivative size},
+ width = 7cm,
+ height = 4cm,
+ legend entries={Lexer without $\textit{bsimp}$},
+ legend pos= north west,
+ legend cell align=left]
+\addplot[red,mark=*, mark options={fill=white}] table {a_aa_star_easysimp.data};
+\end{axis}
+\end{tikzpicture}
+\end{tabular}
+\end{center}
+And for $(a^*\cdot a^*)^*$, unlike in \ref{fig:BetterWaterloo}, the size with simplification now stay nicely constant with current
+simplification rules (we put the graphs together to show the contrast)
+\begin{center}
+\begin{tabular}{ll}
+\begin{tikzpicture}
+\begin{axis}[
+ xlabel={$n$},
+ ylabel={derivative size},
+ width=7cm,
+ height=4cm,
+ legend entries={Lexer with $\textit{bsimp}$},
legend pos= south east,
legend cell align=left]
\addplot[red,mark=*, mark options={fill=white}] table {BitcodedLexer.data};
@@ -142,65 +206,349 @@
ylabel={derivative size},
width = 7cm,
height = 4cm,
- legend entries={Lexer without $\bsimp$},
+ legend entries={Lexer without $\textit{bsimp}$},
legend pos= north west,
legend cell align=left]
\addplot[red,mark=*, mark options={fill=white}] table {BetterWaterloo.data};
\end{axis}
\end{tikzpicture}
\end{tabular}
+\end{center}
+%----------------------------------------------------------------------------------------
+% SECTION rewrite relation
+%----------------------------------------------------------------------------------------
+\section{The Rewriting Relation $\rrewrite$($\rightsquigarrow$)}
+The overall idea for the correctness
+\begin{conjecture}
+ $\blexersimp \; r \; s = \blexer \; r\; s$
+\end{conjecture}
+is that the $\textit{bsimp}$ will not change the regular expressions so much that
+it becomes impossible to extract the $\POSIX$ values.
+To capture this "similarity" between unsimplified regular expressions and simplified ones,
+we devise the rewriting relation $\rrewrite$, written infix as $\rightsquigarrow$:
+\begin{center}
+\begin{mathpar}
+ \inferrule{}{_{bs} \ZERO \cdot r_2 \rightsquigarrow \ZERO\\}
+
+ \inferrule{}{_{bs} r_1 \cdot \ZERO \rightsquigarrow \ZERO\\}
+
+ \inferrule{}{_{bs1} ((_{bs2} \ONE) \cdot r) \rightsquigarrow \fuse \; (bs_1 @ bs_2) \; r\\}\\
+
+
+
+ \inferrule{\\ r_1 \rightsquigarrow r_2}{_{bs} r_1 \cdot r_3 \rightsquigarrow _{bs} r_2 \cdot r_3\\}
+
+ \inferrule{\\ r_3 \rightsquigarrow r_4}{_{bs} r_1 \cdot r_3 \rightsquigarrow _{bs} r_1 \cdot r_4\\}\\
+
+ \inferrule{}{ _{bs}\sum [] \rightsquigarrow \ZERO\\}
+
+ \inferrule{}{ _{bs}\sum [a] \rightsquigarrow \fuse \; bs \; a\\}
+
+ \inferrule{\\ rs_1 \stackrel{s}{\rightsquigarrow} rs_2}{_{bs}\sum rs_1 \rightsquigarrow rs_2\\}
+
+ \inferrule{}{\\ [] \stackrel{s}{\rightsquigarrow} []}
+
+ \inferrule{rs_1 \stackrel{s}{\rightsquigarrow} rs_2}{\\ r :: rs_1 \rightsquigarrow r :: rs_2 \rightsquigarrow}
+
+ \inferrule{r_1 \rightsquigarrow r_2}{ r_1 :: rs \stackrel{s}{\rightsquigarrow} r_2 :: rs\\}
+
+ \inferrule{}{\ZERO :: rs \stackrel{s}{\rightsquigarrow} rs}
+
+ \inferrule{}{_{bs} \sum (rs_1 :: rs_b) \stackrel{s}{\rightsquigarrow} ((\map \; (\fuse \; bs_1) \; rs_1) @ rsb) }
+
+ \inferrule{\\ \rerase{a_1} = \rerase{a_2}}{rs_a @ [a_1] @ rs_b @ [a_2] @ rsc \stackrel{s}{\rightsquigarrow} rs_a @ [a_1] @ rs_b @ rs_c}
+
+\end{mathpar}
+\end{center}
+These "rewrite" steps define the atomic simplifications we could impose on regular expressions
+under our simplification algorithm.
+For convenience, we define a relation $\stackrel{s}{\rightsquigarrow}$ for rewriting
+a list of regular exression to another list.
+The $\rerase{}$ function is used instead of $_\downarrow$ for the finiteness bound proof of next chapter,
+which we will discuss later. For the moment the reader can assume they basically do the same thing as $\erase$.
+
+Usually more than one steps are taking place during the simplification of a regular expression,
+therefore we define the reflexive transitive closure of the $\rightsquigarrow$ and $\stackrel{s}{\rightsquigarrow}$
+relations:
+
+\begin{center}
+\begin{mathpar}
+ \inferrule{}{ r \stackrel{*}{\rightsquigarrow} r \\}
+ \inferrule{}{\\ rs \stackrel{s*}{\rightsquigarrow} rs \\}
+ \inferrule{\\r_1 \stackrel{*}{\rightsquigarrow} r_2 \land \; r_2 \stackrel{*}{\rightsquigarrow} r_3}{r_1 \stackrel{*}{\rightsquigarrow} r_3\\}
+ \inferrule{\\rs_1 \stackrel{*}{\rightsquigarrow} rs_2 \land \; rs_2 \stackrel{*}{\rightsquigarrow} rs_3}{rs_1 \stackrel{*}{\rightsquigarrow} rs_3}
+\end{mathpar}
+\end{center}
+Now that we have modelled the rewriting behaviour of our simplifier $\bsimp$, we prove mainly
+three properties about how these relations connect to $\blexersimp$:
+\begin{itemize}
+ \item
+ $r \stackrel{*}{\rightsquigarrow} \bsimp{r}$
+ The algorithm $\bsimp$ only transforms the regex $r$ using steps specified by
+ $\stackrel{*}{\rightsquigarrow}$ and nothing else.
+ \item
+ $r \rightsquigarrow r' \implies r \backslash c \rightsquigarrow r'\backslash c$.
+ The relation $\stackrel{*}{rightsquigarrow}$ is preserved under derivative.
+ \item
+ $(r \stackrel{*}{\rightsquigarrow} r'\land \bnullable \; r_1) \implies \bmkeps \; r = \bmkeps \; r'$. If we reach another
+ expression in finitely many atomic simplification steps, then these two regular expressions
+ will produce the same bit-codes under the bit collection function $\bmkeps$ used by our $\blexer$.
+
+\end{itemize}
+\section{Three Important properties}
+These properties would work together towards the correctness theorem.
+We start proving each of these lemmas below.
+\subsection{$(r \stackrel{*}{\rightsquigarrow} r'\land \bnullable \; r_1) \implies \bmkeps \; r = \bmkeps \; r'$ and $r \stackrel{*}{\rightsquigarrow} \bsimp{r}$}
+The first few properties we establish is that the inference rules we gave for $\rightsquigarrow$
+and $\stackrel{s}{\rightsquigarrow}$ also hold as implications for $\stackrel{*}{\rightsquigarrow}$ and
+$\stackrel{s*}{\rightsquigarrow}$.
+\begin{lemma}
+ $rs_1 \stackrel{s*}{\rightsquigarrow} rs_2 \implies _{bs} \sum rs_1 \stackrel{*}{\rightsquigarrow} _{bs} \sum rs_2$
+\end{lemma}
+\begin{proof}
+ By rule induction of $\stackrel{s*}{\rightsquigarrow}$.
+\end{proof}
+\begin{lemma}
+ $r \stackrel{*}{\rightsquigarrow} r' \implies _{bs} \sum r :: rs \stackrel{*}{\rightsquigarrow} _{bs} \sum r' :: rs$
+\end{lemma}
+\begin{proof}
+ By rule induction of $\stackrel{*}{\rightsquigarrow} $.
+\end{proof}
+\noindent
+Then we establish that the $\stackrel{s}{\rightsquigarrow}$ and $\stackrel{s*}{\rightsquigarrow}$ relation is preserved w.r.t appending and prepending of a list:
+\begin{lemma}
+ $rs_1 \stackrel{s}{\rightsquigarrow} rs_2 \implies rs @ rs_1 \stackrel{s}{\rightsquigarrow} rs @ rs_2$
+\end{lemma}
+\begin{proof}
+ By induction on the list $rs$.
+\end{proof}
+
+\begin{lemma}
+ $rs_1 \stackrel{s*}{\rightsquigarrow} rs_2 \implies rs @ rs_1 \stackrel{s*}{\rightsquigarrow} rs @ rs_2$
+\end{lemma}
+\begin{proof}
+ By rule induction of $\stackrel{s*}{\rightsquigarrow}$.
+\end{proof}
+
+\noindent
+The $\stackrel{s}{\rightsquigarrow} $ relation after appending a list becomes $\stackrel{s*}{\rightsquigarrow}$:
+\begin{lemma}\label{ssgqTossgs}
+ $rs_1 \stackrel{s}{\rightsquigarrow} rs_2 \implies rs_1 @ rs \stackrel{s*}{\rightsquigarrow} rs_2 @ rs$
+\end{lemma}
+\begin{proof}
+ By rule induction of $\stackrel{s}{\rightsquigarrow}$.
+\end{proof}
+\begin{lemma}
+ $rs_1 \stackrel{s*}{\rightsquigarrow} rs_2 \implies rs_1 @ rs \stackrel{s*}{\rightsquigarrow} rs_2 @ rs$
+\end{lemma}
+\begin{proof}
+ By rule induction of $\stackrel{s*}{\rightsquigarrow}$ and using \ref{ssgqTossgs}.
+\end{proof}
+Here are two lemmas relating $\stackrel{*}{\rightsquigarrow}$ and $\stackrel{s*}{\rightsquigarrow}$:
+\begin{lemma}\label{singleton}
+ $r_1 \stackrel{*}{\rightsquigarrow} r_2 \implies [r_1] \stackrel{s*}{\rightsquigarrow} [r_2]$
+\end{lemma}
+\begin{proof}
+ By rule induction of $ \stackrel{*}{\rightsquigarrow} $.
+\end{proof}
+\begin{lemma}
+ $rs_3 \stackrel{s*}{\rightsquigarrow} rs_4 \land r_1 \stackrel{*}{\rightsquigarrow} r_2 \implies
+ r_2 :: rs_3 \stackrel{s*}{\rightsquigarrow} r_2 :: rs_4$
+\end{lemma}
+\begin{proof}
+ By using \ref{singleton}.
+\end{proof}
+Now we get to the "meaty" part of the proof, which relates the relations $\stackrel{s*}{\rightsquigarrow}$ and
+$\stackrel{*}{\rightsquigarrow} $ with our simplification components such $\distinctBy$ and $\flts$.
+The first lemma below says that for a list made of two parts $rs_1 @ rs_2$, one can throw away the duplicate
+elements in $rs_2$, as well as those that have appeared in $rs_1$:
+\begin{lemma}
+ $rs_1 @ rs_2 \stackrel{s*}{\rightsquigarrow} (rs_1 @ (\distinctBy \; rs_2 \; \; \rerase{\_}\; \; (\map\;\; \rerase{\_}\; \; rs_1)))$
+\end{lemma}
+\begin{proof}
+ By induction on $rs_2$, where $rs_1$ is allowed to be arbitrary.
+\end{proof}
+The above h as the corollary that is suitable for the actual way $\distinctBy$ is called in $\bsimp$:
+\begin{lemma}\label{dBPreserves}
+ $rs_ 1 \rightarrow \distinctBy \; rs_1 \; \phi$
+\end{lemma}
-%----------------------------------------------------------------------------------------
-% SECTION common identities
-%----------------------------------------------------------------------------------------
-\section{Common Identities In Simplification-Related Functions}
-Need to prove these before starting on the big correctness proof.
+The flatten function $\flts$ works within the $\rightsquigarrow$ relation:
+\begin{lemma}\label{fltsPreserves}
+ $rs \stackrel{s*}{\rightsquigarrow} \flts \; rs$
+\end{lemma}
+
+The rewriting in many steps property is composible in terms of regular expression constructors:
+\begin{lemma}
+ $r_1 \stackrel{*}{\rightsquigarrow} r_2 \implies _{bs} r_1 \cdot r_3 \stackrel{*}{\rightsquigarrow} \; _{bs} r_2 \cdot r_3 \quad $ and
+$r_3 \stackrel{*}{\rightsquigarrow} r_4 \implies _{bs} r_1 \cdot r_3 \stackrel{*}{\rightsquigarrow} _{bs} \; r_1 \cdot r_4$
+\end{lemma}
-%-----------------------------------
-% SUBSECTION
-%-----------------------------------
-\subsection{Idempotency of $\simp$}
-
-\begin{equation}
- \simp \;r = \simp\; \simp \; r
-\end{equation}
-This property means we do not have to repeatedly
-apply simplification in each step, which justifies
-our definition of $\blexersimp$.
-It will also be useful in future proofs where properties such as
-closed forms are needed.
-The proof is by structural induction on $r$.
+The rewriting in many steps properties $\stackrel{*}{\rightsquigarrow}$ and $\stackrel{s*}{\rightsquigarrow}$ is preserved under the function $\fuse$:
+\begin{lemma}
+ $r_1 \stackrel{*}{\rightsquigarrow} r_2 \implies \fuse \; bs \; r_1 \stackrel{*}{\rightsquigarrow} \; \fuse \; bs \; r_2 \quad $ and
+ $rs_1 \stackrel{s}{\rightsquigarrow} rs_2 \implies \map \; (\fuse \; bs) \; rs_1 \stackrel{s*}{\rightsquigarrow} \map \; (\fuse \; bs) \; rs_2$
+\end{lemma}
+\begin{proof}
+ By the properties $r_1 \rightsquigarrow r_2 \implies \fuse \; bs \; r_1 \implies \fuse \; bs \; r_2$ and
+ $rs_2 \stackrel{s}{\rightsquigarrow} rs_3 \implies \map \; (\fuse \; bs) rs_2 \stackrel{s*}{\rightsquigarrow} \map \; (\fuse \; bs)\; rs_3$.
+\end{proof}
+\noindent
+If we could rewrite a regular expression in many steps to $\ZERO$, then
+we could also rewrite any sequence containing it to $\ZERO$:
+\begin{lemma}
+ $r_1 \stackrel{*}{\rightsquigarrow} \ZERO \implies _{bs}r_1\cdot r_2 \stackrel{*}{\rightsquigarrow} \ZERO$
+\end{lemma}
+\begin{lemma}
+ $\bmkeps \; (r \backslash s) = \bmkeps \; \bderssimp{r}{s}$
+\end{lemma}
+\noindent
+The function $\bsimpalts$ preserves rewritability:
+\begin{lemma}\label{bsimpaltsPreserves}
+ $_{bs} \sum rs \stackrel{*}{\rightsquigarrow} \bsimpalts \; _{bs} \; rs$
+\end{lemma}
+\noindent
+Before we give out the next lemmas, we define a predicate for a list of regular expressions
+having at least one nullable regular expressions:
+\begin{definition}
+ $\textit{bnullables} \; rs \dn \exists r \in rs. \bnullable r$
+\end{definition}
+The rewriting relation $\rightsquigarrow$ preserves nullability:
+\begin{lemma}
+ $r_1 \rightsquigarrow r_2 \implies \bnullable \; r_1 = \bnullable \; r_2$ and
+ $rs_1 \stackrel{s}{\rightsquigarrow} rs_2 \implies \textit{bnullables} \; rs_1 = \textit{bnullables} \; rs_2$
+\end{lemma}
+\begin{proof}
+ By rule induction of $\rightarrow$ and $\stackrel{s}{\rightsquigarrow}$.
+\end{proof}
+So does the many steps rewriting:
+\begin{lemma}\label{rewritesBnullable}
+ $r_1 \stackrel{*}{\rightsquigarrow} r_2 \implies \bnullable \; r_1 = \bnullable \; r_2$
+\end{lemma}
+\begin{proof}
+ By rule induction of $\stackrel{*}{\rightsquigarrow} $.
+\end{proof}
+\noindent
+And if both regular expressions in a rewriting relation are nullable, then they
+produce the same bit-codes:
-%-----------------------------------
-% SUBSECTION
-%-----------------------------------
-\subsection{Syntactic Equivalence Under $\simp$}
-We prove that minor differences can be annhilated
-by $\simp$.
-For example,
+\begin{lemma}\label{rewriteBmkepsAux}
+ $r_1 \rightsquigarrow r_2 \implies (\bnullable \; r_1 \land \bnullable \; r_2 \implies \bmkeps \; r_1 =
+ \bmkeps \; r_2)$ and
+ $rs_ 1 \stackrel{s}{\rightsquigarrow} rs_2 \implies (\bnullables \; rs_1 \land \bnullables \; rs_2 \implies
+ \bmkepss \; rs_1 = \bmkepss \; rs2)$
+\end{lemma}
+\noindent
+The definition of $\bmkepss$ on list $rs$ is just to extract the bit-codes on the first element in $rs$ that
+is $bnullable$:
\begin{center}
-$\simp \;(\simpALTs\; (\map \;(\_\backslash \; x)\; (\distinct \; \mathit{rs}\; \phi))) =
- \simp \;(\simpALTs \;(\distinct \;(\map \;(\_ \backslash\; x) \; \mathit{rs}) \; \phi))$
+ \begin{tabular}{lcl}
+ $\bmkepss \; [] $ & $\dn$ & $[]$\\
+ $\bmkepss \; r :: rs$ & $\dn$ & $\textit{if} \; \bnullable \; r then \; (\bmkeps \; r) \; \textit{else} \; \bmkepss \; rs$
+ \end{tabular}
\end{center}
-
-
-
-
-
+\noindent
+And now we are ready to prove the key property that if you
+have two regular expressions, one rewritable in many steps to the other,
+and one of them is $\bnullable$, then they will both yield the same bits under $\bmkeps$:
+\begin{lemma}
+ $(r \stackrel{*}{\rightsquigarrow} r'\land \bnullable \; r_1) \implies \bmkeps \; r = \bmkeps \; r'$
+\end{lemma}
+\begin{proof}
+ By rule induction of $\stackrel{*}{\rightsquigarrow} $, using \ref{rewriteBmkepsAux} and $\ref{rewritesBnullable}$.
+\end{proof}
+\noindent
+the other property is also ready:
+\begin{lemma}
+ $r \stackrel{*}{\rightsquigarrow} \bsimp{r}$
+\end{lemma}
+\begin{proof}
+ By an induction on $r$.
+The most difficult case would be the alternative case, where we using properties such as \ref{bsimpaltsPreserves} and \ref{fltsPreserves} and \ref{dBPreserves}, we could continuously rewrite a list like:\\
+ $rs \stackrel{s*}{\rightsquigarrow} \map \; \textit{bsimp} \; rs$\\
+ $\ldots \stackrel{s*}{\rightsquigarrow} \flts \; (\map \; \textit{bsimp} \; rs)$\\
+ $\ldots \;\stackrel{s*}{\rightsquigarrow} \distinctBy \; (\flts \; (\map \; \textit{bsimp}\; rs)) \; \rerase \; \phi$\\
+ Then we could do the following regular expresssion many steps rewrite:\\
+ $ _{bs} \sum \distinctBy \; (\flts \; (\map \; \textit{bsimp}\; rs)) \; \rerase \; \phi \stackrel{*}{\rightsquigarrow} \bsimpalts \; bs \; (\distinctBy \; (\flts \; (\map \; \textit{bsimp}\; rs)) \; \rerase \; \phi)$
+ \\
+
+\end{proof}
+\section{Proof for the Property: $r_1 \stackrel{*}{\rightsquigarrow} r_2 \implies r_1 \backslash c \stackrel{*}{\rightsquigarrow} r_2 \backslash c$}
+The first thing we prove is that if we could rewrite in one step, then after derivative
+we could rewrite in many steps:
+\begin{lemma}\label{rewriteBder}
+ $r_1 \rightsquigarrow r_2 \implies r_1 \backslash c \stackrel{*}{\rightsquigarrow} r_2 \backslash c$ and
+ $rs_1 \stackrel{s}{\rightsquigarrow} rs_2 \implies \map \; (\_\backslash c) \; rs_1 \stackrel{s*}{\rightsquigarrow} \map \; (\_ \backslash c) \; rs_2$
+\end{lemma}
+\begin{proof}
+ By induction on $\rightsquigarrow$ and $\stackrel{s}{\rightsquigarrow}$, using a number of the previous lemmas.
+\end{proof}
+Now we can prove that once we could rewrite from one expression to another in many steps,
+then after a derivative on both sides we could still rewrite one to another in many steps:
+\begin{lemma}
+ $r_1 \stackrel{*}{\rightsquigarrow} r_2 \implies r_1 \backslash c \stackrel{*}{\rightsquigarrow} r_2 \backslash c$
+\end{lemma}
+\begin{proof}
+ By rule induction of $\stackrel{*}{\rightsquigarrow} $ and using the previous lemma :\ref{rewriteBder}.
+\end{proof}
+\noindent
+This can be extended and combined with the previous two important properties
+so that a regular expression's successivve derivatives can be rewritten in many steps
+to its simplified counterpart:
+\begin{lemma}\label{bderBderssimp}
+ $a \backslash s \stackrel{*}{\rightsquigarrow} \bderssimp{a}{s} $
+\end{lemma}
+\subsection{Main Theorem}
+Now with \ref{bdersBderssimp} we are ready for the main theorem.
+To link $\blexersimp$ and $\blexer$,
+we first say that they give out the same bits, if the lexing result is a match:
+\begin{lemma}
+ $\bnullable \; (a \backslash s) \implies \bmkeps \; (a \backslash s) = \bmkeps \; (\bderssimp{a}{s})$
+\end{lemma}
+\noindent
+Now that they give out the same bits, we know that they give the same value after decoding,
+which we know is correct value as $\blexer$ is correct:
+\begin{theorem}
+ $\blexer \; r \; s = \blexersimp{r}{s}$
+\end{theorem}
+\noindent
-
-%----------------------------------------------------------------------------------------
-% SECTION corretness proof
-%----------------------------------------------------------------------------------------
-\section{Proof Technique of Correctness of Bit-coded Algorithm with Simplification}
+\subsection{Comments on the Proof Techniques Used}
The non-trivial part of proving the correctness of the algorithm with simplification
compared with not having simplification is that we can no longer use the argument
in \cref{flex_retrieve}.
-The function \retrieve needs the structure of the annotated regular expression to
+The function \retrieve needs the cumbersome structure of the (umsimplified)
+annotated regular expression to
agree with the structure of the value, but simplification will always mess with the
-structure:
-%TODO: after simp does not agree with each other: (a + 0) --> a v.s. Left(Char(a))
\ No newline at end of file
+structure.
+
+We also tried to prove $\bsimp{\bderssimp{a}{s}} = \bsimp{a\backslash s}$,
+but this turns out to be not true, A counterexample of this being
+\[ r = [(1+c)\cdot [aa \cdot (1+c)]] \land s = aa
+\]
+
+Then we would have $\bsimp{a \backslash s}$ being
+$_{[]}(_{ZZ}\ONE + _{ZS}c ) $
+whereas $\bsimp{\bderssimp{a}{s}}$ would be
+$_{Z}(_{Z} \ONE + _{S} c)$.
+Unfortunately if we apply $\textit{bsimp}$ at different
+stages we will always have this discrepancy, due to
+whether the $\map \; (\fuse\; bs) \; as$ operation in $\textit{bsimp}$
+is taken at some points will be entirely dependant on when the simplification
+take place whether there is a larger alternative structure surrounding the
+alternative being simplified.
+The good thing about $\stackrel{*}{\rightsquigarrow} $ is that it allows
+us not specify how exactly the "atomic" simplification steps $\rightsquigarrow$
+are taken, but simply say that they can be taken to make two similar
+regular expressions equal, and can be done after interleaving derivatives
+and simplifications.
+
+
+Having correctness property is good. But we would also like the lexer to be efficient in
+some sense, for exampe, not grinding to a halt at certain cases.
+In the next chapter we shall prove that for a given $r$, the internal derivative size is always
+finitely bounded by a constant.
--- a/ChengsongTanPhdThesis/Chapters/Finite.tex Tue Jun 14 18:06:33 2022 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Finite.tex Thu Jun 23 16:09:40 2022 +0100
@@ -10,6 +10,17 @@
In this chapter we give a guarantee in terms of time complexity:
given a regular expression $r$, for any string $s$
our algorithm's internal data structure is finitely bounded.
+Note that it is not immediately obvious that $\llbracket \bderssimp{r}{s} \rrbracket$ (the internal
+data structure used in our $\blexer$)
+is bounded by a constant $N_r$, where $N$ only depends on the regular expression
+$r$, not the string $s$.
+When doing a time complexity analysis of any
+lexer/parser based on Brzozowski derivatives, one needs to take into account that
+not only the "derivative steps".
+
+%TODO: get a grpah for internal data structure growing arbitrary large
+
+
To obtain such a proof, we need to
\begin{itemize}
\item
@@ -34,32 +45,64 @@
\begin{center}
\begin{tabular}{ccc}
-$\llbracket \ACHAR{bs}{c} \rrbracket $ & $\dn$ & $1$\\
+ $\llbracket _{bs}\ONE \rrbracket$ & $\dn$ & $1$\\
+ $\llbracket \ZERO \rrbracket$ & $\dn$ & $1$ \\
+ $\llbracket _{bs} r_1 \cdot r_2 \rrbracket$ & $\dn$ & $\llbracket r_1 \rrbracket + \llbracket r_2 \rrbracket + 1$\\
+$\llbracket _{bs}\mathbf{c} \rrbracket $ & $\dn$ & $1$\\
+$\llbracket _{bs}\sum as \rrbracket $ & $\dn$ & $\map \; (\llbracket \_ \rrbracket)\; as + 1$\\
+$\llbracket _{bs} a^* \rrbracket $ & $\dn$ & $\llbracket a \rrbracket + 1$
+\end{tabular}
+\end{center}
+
+\noindent
+Similarly there is a size function for plain regular expressions:
+\begin{center}
+\begin{tabular}{ccc}
+ $\llbracket \ONE \rrbracket_p$ & $\dn$ & $1$\\
+ $\llbracket \ZERO \rrbracket_p$ & $\dn$ & $1$ \\
+ $\llbracket r_1 \cdot r_2 \rrbracket_p$ & $\dn$ & $\llbracket r_1 \rrbracket_p + \llbracket r_2 \rrbracket_p + 1$\\
+$\llbracket \mathbf{c} \rrbracket_p $ & $\dn$ & $1$\\
+$\llbracket r_1 \cdot r_2 \rrbracket_p $ & $\dn$ & $\llbracket r_1 \rrbracket_p \; + \llbracket r_2 \rrbracket_p + 1$\\
+$\llbracket a^* \rrbracket_p $ & $\dn$ & $\llbracket a \rrbracket_p + 1$
\end{tabular}
\end{center}
-(TODO: COMPLETE this defn and for $rs$)
-The size is based on a recursive function on the structure of the regex,
-not the bitcodes.
-Therefore we may as well talk about size of an annotated regular expression
-in their un-annotated form:
+\noindent
+The idea of obatining a bound for $\llbracket \bderssimp{a}{s} \rrbracket$
+is to get an equivalent form
+of something like $\llbracket \bderssimp{a}{s}\rrbracket = f(a, s)$, where $f(a, s)$
+is easier to estimate than $\llbracket \bderssimp{a}{s}\rrbracket$.
+We notice that while it is not so clear how to obtain
+a metamorphic representation of $\bderssimp{a}{s}$ (as we argued in chapter \ref{Bitcoded2},
+not interleaving the application of the functions $\backslash$ and $\bsimp{\_}$
+in the order as our lexer will result in the bit-codes dispensed differently),
+it is possible to get an slightly different representation of the unlifted versions:
+$ (\bderssimp{a}{s})_\downarrow = (\erase \; \bsimp{a \backslash s})_\downarrow$.
+This suggest setting the bounding function $f(a, s)$ as
+$\llbracket (a \backslash s)_\downarrow \rrbracket_p$, the plain size
+of the erased annotated regular expression.
+This requires the the regular expression accompanied by bitcodes
+to have the same size as its plain counterpart after erasure:
\begin{center}
-$\asize(a) = \size(\erase(a))$.
+ $\asize{a} \stackrel{?}{=} \llbracket \erase(a)\rrbracket_p$.
\end{center}
-(TODO: turn equals to roughly equals)
-
-But there is a minor nuisance here:
-the erase function actually messes with the structure of the regular expression:
+\noindent
+But there is a minor nuisance:
+the erase function unavoidbly messes with the structure of the regular expression,
+due to the discrepancy between annotated regular expression's $\sum$ constructor
+and plain regular expression's $+$ constructor having different arity.
\begin{center}
\begin{tabular}{ccc}
-$\erase(\AALTS{bs}{[]} )$ & $\dn$ & $\ZERO$\\
+$\erase \; _{bs}\sum [] $ & $\dn$ & $\ZERO$\\
+$\erase \; _{bs}\sum [a]$ & $\dn$ & $a$\\
+$\erase \; _{bs}\sum a :: as$ & $\dn$ & $a + (\erase \; _{[]} \sum as)\quad \text{if $as$ length over 1}$
\end{tabular}
\end{center}
-(TODO: complete this definition with singleton r in alts)
-
+\noindent
An alternative regular expression with an empty list of children
- is turned into an $\ZERO$ during the
+ is turned into a $\ZERO$ during the
$\erase$ function, thereby changing the size and structure of the regex.
+Therefore the equality in question does not hold.
These will likely be fixable if we really want to use plain $\rexp$s for dealing
with size, but we choose a more straightforward (or stupid) method by
defining a new datatype that is similar to plain $\rexp$s but can take
@@ -75,56 +118,105 @@
For $\rrexp$ we throw away the bitcodes on the annotated regular expressions,
but keep everything else intact.
It is similar to annotated regular expressions being $\erase$-ed, but with all its structure preserved
-(the $\erase$ function unfortunately does not preserve structure in the case of empty and singleton
-$\ALTS$).
We denote the operation of erasing the bits and turning an annotated regular expression
into an $\rrexp{}$ as $\rerase{}$.
\begin{center}
\begin{tabular}{lcl}
-$\rerase{\AZERO}$ & $=$ & $\RZERO$\\
-$\rerase{\ASEQ{bs}{r_1}{r_2}}$ & $=$ & $\RSEQ{\rerase{r_1}}{\rerase{r_2}}$\\
-$\rerase{\AALTS{bs}{rs}}$ & $ =$ & $ \RALTS{rs}$
+$\rerase{\ZERO}$ & $\dn$ & $\RZERO$\\
+$\rerase{_{bs}\ONE}$ & $\dn$ & $\RONE$\\
+ $\rerase{_{bs}\mathbf{c}}$ & $\dn$ & $\RCHAR{c}$\\
+$\rerase{_{bs}r_1\cdot r_2}$ & $\dn$ & $\RSEQ{\rerase{r_1}}{\rerase{r_2}}$\\
+$\rerase{_{bs}\sum as}$ & $\dn$ & $\RALTS{\map \; \rerase{\_} \; as}$\\
+$\rerase{_{bs} a ^*}$ & $\dn$ & $\rerase{a}^*$
\end{tabular}
\end{center}
-%TODO: FINISH definition of rerase
+\noindent
+$\rrexp$ give the exact correspondence between an annotated regular expression
+and its (r-)erased version:
+\begin{lemma}
+$\rsize{\rerase a} = \asize a$
+\end{lemma}
+This does not hold for plain $\rexp$s.
+
Similarly we could define the derivative and simplification on
$\rrexp$, which would be identical to those we defined for plain $\rexp$s in chapter1,
except that now they can operate on alternatives taking multiple arguments.
\begin{center}
\begin{tabular}{lcr}
-$\RALTS{rs} \backslash c$ & $=$ & $\RALTS{map\; (\_ \backslash c) \;rs}$\\
+ $(\RALTS{rs})\; \backslash c$ & $\dn$ & $\RALTS{\map\; (\_ \backslash c) \;rs}$\\
(other clauses omitted)
+With the new $\rrexp$ datatype in place, one can define its size function,
+which precisely mirrors that of the annotated regular expressions:
+\end{tabular}
+\end{center}
+\noindent
+\begin{center}
+\begin{tabular}{ccc}
+ $\llbracket _{bs}\ONE \rrbracket_r$ & $\dn$ & $1$\\
+ $\llbracket \ZERO \rrbracket_r$ & $\dn$ & $1$ \\
+ $\llbracket _{bs} r_1 \cdot r_2 \rrbracket_r$ & $\dn$ & $\llbracket r_1 \rrbracket_r + \llbracket r_2 \rrbracket_r + 1$\\
+$\llbracket _{bs}\mathbf{c} \rrbracket_r $ & $\dn$ & $1$\\
+$\llbracket _{bs}\sum as \rrbracket_r $ & $\dn$ & $\map \; (\llbracket \_ \rrbracket_r)\; as + 1$\\
+$\llbracket _{bs} a^* \rrbracket_r $ & $\dn$ & $\llbracket a \rrbracket_r + 1$
\end{tabular}
\end{center}
+\noindent
-Now that $\rrexp$s do not have bitcodes on them, we can do the
-duplicate removal without an equivalence relation:
+\subsection{Lexing Related Functions for $\rrexp$}
+Everything else for $\rrexp$ will be precisely the same for annotated expressions,
+except that they do not involve rectifying and augmenting bit-encoded tokenization information.
+As expected, most functions are simpler, such as the derivative:
+\begin{center}
+ \begin{tabular}{@{}lcl@{}}
+ $(\ZERO)\,\backslash_r c$ & $\dn$ & $\ZERO$\\
+ $(\ONE)\,\backslash_r c$ & $\dn$ &
+ $\textit{if}\;c=d\; \;\textit{then}\;
+ \ONE\;\textit{else}\;\ZERO$\\
+ $(\sum \;\textit{rs})\,\backslash_r c$ & $\dn$ &
+ $\sum\;(\textit{map} \; (\_\backslash_r c) \; rs )$\\
+ $(r_1\cdot r_2)\,\backslash_r c$ & $\dn$ &
+ $\textit{if}\;\textit{rnullable}\,r_1$\\
+ & &$\textit{then}\;\sum\,[(r_1\,\backslash_r c)\cdot\,r_2,$\\
+ & &$\phantom{\textit{then},\;\sum\,}((r_2\,\backslash_r c))]$\\
+ & &$\textit{else}\;\,(r_1\,\backslash_r c)\cdot r_2$\\
+ $(r^*)\,\backslash_r c$ & $\dn$ &
+ $( r\,\backslash_r c)\cdot
+ (_{[]}r^*))$
+\end{tabular}
+\end{center}
+\noindent
+The simplification function is simplified without annotation causing superficial differences.
+Duplicate removal without an equivalence relation:
\begin{center}
\begin{tabular}{lcl}
-$\rdistinct{r :: rs}{rset}$ & $=$ & $\textit{if}(r \in \textit{rset}) \; \textit{then} \; \rdistinct{rs}{rset}$\\
+ $\rdistinct{[]}{rset} $ & $\dn$ & $[]$\\
+$\rdistinct{r :: rs}{rset}$ & $\dn$ & $\textit{if}(r \in \textit{rset}) \; \textit{then} \; \rdistinct{rs}{rset}$\\
& & $\textit{else}\; r::\rdistinct{rs}{(rset \cup \{r\})}$
\end{tabular}
\end{center}
%TODO: definition of rsimp (maybe only the alternative clause)
-
-
-The reason why these definitions mirror precisely the corresponding operations
-on annotated regualar expressions is that we can calculate sizes more easily:
-
+\noindent
+With $\textit{rdistinct}$ one can chain together all the other modules
+of $\bsimp{\_}$ (removing the functionalities related to bit-sequences)
+and get $\textit{rsimp}$ and $\rderssimp{\_}{\_}$.
+We omit these functions, as they are routine. Please refer to the formalisation
+(in file BasicIdentities.thy) for the exact definition.
+With $\rrexp$ the size caclulation of annotated regular expressions'
+simplification and derivatives can be done by the size of their unlifted
+counterpart with the unlifted version of simplification and derivatives applied.
\begin{lemma}
-$\rsize{\rerase a} = \asize a$
+ \begin{itemize}
+ \item
+$\asize{\bsimp{a}} = \rsize{\rsimp{\rerase{a}}}$
+\item
+$\asize{\bderssimp{r}{s}} = \rsize{\rderssimp{\rerase{r}}{s}}$
+\end{itemize}
\end{lemma}
-This lemma says that the size of an r-erased regex is the same as the annotated regex.
-this does not hold for plain $\rexp$s due to their ways of how alternatives are represented.
-\begin{lemma}
-$\asize{\bsimp{a}} = \rsize{\rsimp{\rerase{a}}}$
-\end{lemma}
-Putting these two together we get a property that allows us to estimate the
-size of an annotated regular expression derivative using its un-annotated counterpart:
-\begin{lemma}
-$\asize{\bderssimp{r}{s}} = \rsize{\rderssimp{\rerase{r}}{s}}$
-\end{lemma}
+\noindent
+In the following content, we will focus on $\rrexp$'s size bound.
+We will piece together this bound and show the same bound for annotated regular
+expressions in the end.
Unless stated otherwise in this chapter all $\textit{rexp}$s without
bitcodes are seen as $\rrexp$s.
We also use $r_1 + r_2$ and $\RALTS{[r_1, r_2]}$ interchageably
@@ -136,7 +228,135 @@
%-----------------------------------
% SECTION ?
%-----------------------------------
-\section{preparatory properties for getting a finiteness bound}
+ \section{Roadmap to a Finiteness Proof}
+ Now that we have defined the $\rrexp$ datatype, and proven that its size changes
+ w.r.t derivatives and simplifications mirrors precisely those of annotated regular expressions,
+ we aim to bound the size of $r \backslash s$ for any $\rrexp$ $r$.
+The way we do it is by two steps:
+\begin{itemize}
+ \item
+ First, we rewrite $r\backslash s$ into something else that is easier
+ to bound. This step is especially important for the inductive case
+ $r_1 \cdot r_2$ and $r^*$, where the derivative can grow and bloat in a wild way,
+ but after simplification they will always be equal or smaller to a form consisting of an alternative
+ list of regular expressions $f \; (g\; (\sum rs))$ with some functions applied to it, where each element will be distinct after the function application.
+ The functions $f$ and $g$ are usually from $\flts$ $\distinctBy$ and other appearing in $\bsimp{_}$.
+ This way we get a different but equivalent way of expressing : $r\backslash s = f \; (g\; (\sum rs))$, we call the
+ right hand side the "closed form" of $r\backslash s$.
+ \item
+ Then, for such a sum list of regular expressions $f\; (g\; (\sum rs))$, we can control its size
+ by estimation, since $\distinctBy$ and $\flts$ are well-behaved and working together would only
+ reduce the size of a regular expression, not adding to it. The closed form $f\; (g\; (\sum rs)) $ is controlled
+ by the size of $rs'$, where every element in $rs'$ is distinct, and each element can be described by som e inductive sub-structures (for example when $r = r_1 \cdot r_2$ then $rs'$ will be solely comprised of $r_1 \backslash s'$ and $r_2 \backslash s''$, $s'$ and $s''$ being sub-strings of $s$), which will have a numeric uppder bound according to inductive hypothesis, which controls $r \backslash s$.
+\end{itemize}
+
+\section{Closed Forms}
+
+\subsection{Basic Properties needed for Closed Forms}
+
+\subsubsection{$\textit{rdistinct}$'s Deduplicates Successfully}
+The $\textit{rdistinct}$ function, as its name suggests, will
+remove duplicates according to the accumulator
+and leave only one of each different element in a list:
+\begin{lemma}
+ The function $\textit{rdistinct}$ satisfies the following
+ properties:
+ \begin{itemize}
+ \item
+ If $a \in acc$ then $a \notin (\rdistinct{rs}{acc})$.
+ \item
+ If list $rs'$ is the result of $\rdistinct{rs}{acc}$,
+ All the elements in $rs'$ are unique.
+ \end{itemize}
+\end{lemma}
+\begin{proof}
+ The first part is by an induction on $rs$.
+ The second part can be proven by using the
+ induction rules of $\rdistinct{\_}{\_}$.
+
+\end{proof}
+
+\noindent
+$\rdistinct{\_}{\_}$ will cancel out all regular expression terms
+that are in the accumulator, therefore prepending a list $rs_a$ with an arbitrary
+list $rs$ whose elements are all from the accumulator, and then call $\rdistinct{\_}{\_}$
+on the resulting list, the output will be as if we had called $\rdistinct{\_}{\_}$
+without the prepending of $rs$:
+\begin{lemma}
+ If $rs \subseteq rset$, then
+ $\rdistinct{rs@rsa }{acc} = \rdistinct{rsa }{acc}$.
+\end{lemma}
+\begin{proof}
+ By induction on $rs$.
+\end{proof}
+\noindent
+On the other hand, if an element $r$ does not appear in the input list waiting to be deduplicated,
+then expanding the accumulator to include that element will not cause the output list to change:
+\begin{lemma}
+ The accumulator can be augmented to include elements not appearing in the input list,
+ and the output will not change.
+ \begin{itemize}
+ \item
+ If $r \notin rs$, then $\rdistinct{rs}{acc} = \rdistinct{rs}{\{r\} \cup acc}$.
+ \item
+ Particularly, when $acc = \varnothing$ and $rs$ de-duplicated, we have\\
+ \[ \rdistinct{rs}{\varnothing} = rs \]
+ \end{itemize}
+\end{lemma}
+\begin{proof}
+ The first half is by induction on $rs$. The second half is a corollary of the first.
+\end{proof}
+\noindent
+The next property gives the condition for
+when $\rdistinct{\_}{\_}$ becomes an identical mapping
+for any prefix of an input list, in other words, when can
+we ``push out" the arguments of $\rdistinct{\_}{\_}$:
+\begin{lemma}
+ If $rs_1$'s elements are all unique, and not appearing in the accumulator $acc$,
+ then it can be taken out and left untouched in the output:
+ \[\textit{rdistinct}\; rs_1 @ rsa\;\, acc
+ = rs_1@(\textit{rdistinct} rsa \; (acc \cup rs_1)\]
+\end{lemma}
+
+\begin{proof}
+By an induction on $rs_1$, where $rsa$ and $acc$ are allowed to be arbitrary.
+\end{proof}
+\subsubsection{The properties of $\backslash_r$, $\backslash_{rsimp}$, $\textit{Rflts}$ and $\textit{Rsimp}_{ALTS}$}
+We give in this subsection some properties of how $\backslash_r$, $\backslash_{rsimp}$, $\textit{Rflts}$ and $\textit{Rsimp}_{ALTS} $ interact with each other and with $@$, the concatenation operator.
+These will be helpful in later closed form proofs, when
+we want to transform the ways in which multiple functions involving
+those are composed together
+in interleaving derivative and simplification steps.
+
+When the function $\textit{Rflts}$
+is applied to the concatenation of two lists, the output can be calculated by first applying the
+functions on two lists separately, and then concatenating them together.
+\begin{lemma}
+ The function $\rflts$ has the below properties:\\
+ \begin{itemize}
+ \item
+ $\rflts \; (rs_1 @ rs_2) = \rflts \; rs_1 @ \rflts \; rs_2$
+\item
+ If $r \neq \RZERO$ and $\nexists rs_1. r = \RALTS{rs}_1$, then $\rflts \; (r::rs) = r :: \rflts \; rs$
+ \end{itemize}
+\end{lemma}
+\noindent
+\begin{proof}
+ By induction on $rs_1$.
+\end{proof}
+
+
+
+\subsection{A Closed Form for the Sequence Regular Expression}
+\begin{quote}\it
+ Claim: For regular expressions $r_1 \cdot r_2$, we claim that
+ \begin{center}
+ $ \rderssimp{r_1 \cdot r_2}{s} =
+ \rsimp{(\sum (r_1 \backslash s \cdot r_2 ) \; :: \;(\map \; \rderssimp{r2}{\_} \;(\vsuf{s}{r_1})))}$
+ \end{center}
+\end{quote}
+\noindent
+
Before we get to the proof that says the intermediate result of our lexer will
remain finitely bounded, which is an important efficiency/liveness guarantee,
we shall first develop a few preparatory properties and definitions to
--- a/ChengsongTanPhdThesis/Chapters/Inj.tex Tue Jun 14 18:06:33 2022 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Inj.tex Thu Jun 23 16:09:40 2022 +0100
@@ -169,7 +169,7 @@
$L \; (\ONE)$ & $\dn$ & $\{[]\}$\\
$L \; (c)$ & $\dn$ & $\{[c]\}$\\
$L \; (r_1 + r_2)$ & $\dn$ & $ L \; (r_1) \cup L \; ( r_2)$\\
-$L \; (r_1 \cdot r_2)$ & $\dn$ & $ L \; (r_1) \cap L \; (r_2)$\\
+$L \; (r_1 \cdot r_2)$ & $\dn$ & $ L \; (r_1) @ L \; (r_2)$\\
$L \; (r^*)$ & $\dn$ & $ (L(r))^*$
\end{tabular}
\end{center}
--- a/ChengsongTanPhdThesis/Chapters/Introduction.tex Tue Jun 14 18:06:33 2022 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Introduction.tex Thu Jun 23 16:09:40 2022 +0100
@@ -19,7 +19,7 @@
%\newcommand{\sflataux}[1]{\textit{sflat}\_\textit{aux} \, #1}
\newcommand\sflat[1]{\llparenthesis #1 \rrparenthesis }
\newcommand{\ASEQ}[3]{\textit{ASEQ}_{#1} \, #2 \, #3}
-\newcommand{\bderssimp}[2]{#1 \backslash_{bsimp} #2}
+\newcommand{\bderssimp}[2]{#1 \backslash_{bsimps} #2}
\newcommand{\rderssimp}[2]{#1 \backslash_{rsimp} #2}
\newcommand{\bders}[2]{#1 \backslash #2}
\newcommand{\bsimp}[1]{\textit{bsimp}(#1)}
@@ -30,7 +30,7 @@
\newcommand{\ZERO}{\mbox{\bf 0}}
\newcommand{\ONE}{\mbox{\bf 1}}
\newcommand{\AALTS}[2]{\oplus {\scriptstyle #1}\, #2}
-\newcommand{\rdistinct}[2]{\textit{distinct} \; \textit{#1} \; #2}
+\newcommand{\rdistinct}[2]{\textit{rdistinct} \; \textit{#1} \; #2}
\newcommand\hflat[1]{\llparenthesis #1 \rrparenthesis_*}
\newcommand\hflataux[1]{\llparenthesis #1 \rrparenthesis_*'}
\newcommand\createdByStar[1]{\textit{createdByStar}(#1)}
@@ -38,6 +38,7 @@
\newcommand\myequiv{\mathrel{\stackrel{\makebox[0pt]{\mbox{\normalfont\tiny equiv}}}{=}}}
\def\bnullable{\textit{bnullable}}
+\def\bnullables{\textit{bnullables}}
\def\Some{\textit{Some}}
\def\None{\textit{None}}
\def\code{\textit{code}}
@@ -60,6 +61,7 @@
\def\DFA{\textit{DFA}}
\def\NFA{\textit{NFA}}
\def\bmkeps{\textit{bmkeps}}
+\def\bmkepss{\textit{bmkepss}}
\def\retrieve{\textit{retrieve}}
\def\blexer{\textit{blexer}}
\def\flex{\textit{flex}}
@@ -81,7 +83,7 @@
\def\lf{\textit{lf}}
\def\PD{\textit{PD}}
\def\suffix{\textit{Suffix}}
-
+\def\distinctBy{\textit{distinctBy}}
\def\size{\mathit{size}}
\def\rexp{\mathbf{rexp}}
@@ -97,9 +99,15 @@
\newcommand\rnullable[1]{\textit{rnullable}(#1)}
\newcommand\rsize[1]{\llbracket #1 \rrbracket_r}
\newcommand\asize[1]{\llbracket #1 \rrbracket}
-\newcommand\rerase[1]{ (#1)\downarrow_r}
+\newcommand\rerase[1]{ (#1)_{\downarrow_r}}
+
\newcommand\ChristianComment[1]{\textcolor{blue}{#1}\\}
+
+\def\rflts{\textit{rflts}}
+\def\rrewrite{\textit{rrewrite}}
+\def\bsimpalts{\textit{bsimp}_{ALTS}}
+
\def\erase{\textit{erase}}
\def\STAR{\textit{STAR}}
\def\flts{\textit{flts}}
@@ -330,7 +338,7 @@
But there are problems with current practical implementations.
First thing is that the running time might blow up.
The second problem is that they might be error-prone on certain
-cases that are very easily spotted by a human.
+very simple cases.
In the next part of the chapter, we will look into reasons why
certain regex engines are running horribly slow on the "catastrophic"
cases and propose a solution that addresses both of these problems
--- a/ChengsongTanPhdThesis/a_aa_star.data Tue Jun 14 18:06:33 2022 +0100
+++ b/ChengsongTanPhdThesis/a_aa_star.data Thu Jun 23 16:09:40 2022 +0100
@@ -2,15 +2,24 @@
%% generated by the `filecontents' environment
%% from source `main' on 2022/03/16.
%%
-1 20
-5 30
-10 35
-15 35
-20 35
-23 35
-25 35
-26 35
-28 35
-30 35
-31 35
-32 35
+0 20
+1 20
+2 20
+3 28
+4 35
+5 35
+6 35
+7 35
+8 35
+9 35
+10 35
+11 35
+12 35
+13 35
+14 35
+15 35
+16 35
+17 35
+18 35
+19 35
+20 35
--- a/thys2/SizeBound3.thy Tue Jun 14 18:06:33 2022 +0100
+++ b/thys2/SizeBound3.thy Thu Jun 23 16:09:40 2022 +0100
@@ -703,28 +703,6 @@
shows "(\<Union> (RL ` rerase ` (set (distinctBy rs rerase {})))) = \<Union> (RL ` rerase ` (set rs))"
by (metis RL_rerase_dB_acc Un_commute Union_image_empty)
-(*
-lemma L_bsimp_erase:
- shows "L (erase r) = L (erase (bsimp r))"
- apply(induct r)
- apply(simp)
- apply(simp)
- apply(simp)
- apply(auto simp add: Sequ_def)[1]
- apply(subst L_bsimp_ASEQ[symmetric])
- apply(auto simp add: Sequ_def)[1]
- apply(subst (asm) L_bsimp_ASEQ[symmetric])
- apply(auto simp add: Sequ_def)[1]
- apply(simp)
- apply(subst L_bsimp_AALTs[symmetric])
- defer
- apply(simp)
- apply(subst (2)L_erase_AALTs)
- apply(subst L_erase_dB)
- apply(subst L_erase_flts)
- apply (simp add: L_erase_AALTs)
- done
-*)
lemma RL_bsimp_rerase:
shows "RL (rerase r) = RL (rerase (bsimp r))"
@@ -964,6 +942,8 @@
apply(simp)
done
+
+
lemma ss6_stronger:
shows "rs1 s\<leadsto>* distinctBy rs1 rerase {}"
by (metis append_Nil list.set(1) list.simps(8) ss6_stronger_aux)
@@ -1523,7 +1503,6 @@
prefer 2
apply(subgoal_tac "distinct (map rerase (v # vb # vc)) = True")
prefer 2
- sledgehammer
apply blast
prefer 2
apply force
@@ -1876,7 +1855,7 @@
prefer 2
apply (metis Cons_eq_map_conv SizeBound3.nn1b SizeBound3.nn1c ex_map_conv)
using dB_keeps_property apply presburger
- sledgehammer
+
using dB_does_more_job apply presburger
apply(subgoal_tac " Ball (set ( (flts (bsimp a # map bsimp list)) )) agood")
@@ -2036,203 +2015,6 @@
apply(auto)[1]
done
-lemma test:
- assumes "agood r"
- shows "bsimp r = r"
- using assms
- apply(induct r taking: "asize" rule: measure_induct)
- apply(erule agood.elims)
- apply(simp_all)
- apply(subst k0)
- apply(subst (2) k0)
- apply(subst flts_qq)
- apply(auto)[1]
- apply(auto)[1]
- oops
-
-
-
-lemma bsimp_idem:
- shows "bsimp (bsimp r) = bsimp r"
- using test good1
- oops
-
-(*
-lemma erase_preimage1:
- assumes "anonalt r"
- shows "erase r = ONE \<Longrightarrow> \<exists>bs. r = AONE bs"
- apply(case_tac r)
- apply simp+
- using anonalt.simps(1) assms apply presburger
- by fastforce
-
-lemma erase_preimage_char:
- assumes "anonalt r"
- shows "erase r = CH c \<Longrightarrow> \<exists>bs. r = ACHAR bs c"
- apply(case_tac r)
- apply simp+
- using assms apply fastforce
- by simp
-
-lemma erase_preimage_seq:
- assumes "anonalt r"
- shows "erase r = SEQ r1 r2 \<Longrightarrow> \<exists>bs a1 a2. r = ASEQ bs a1 a2 \<and> erase a1 = r1 \<and> erase a2 = r2"
- apply(case_tac r)
- apply simp+
- using assms apply fastforce
- by simp
-
-lemma rerase_preimage_seq:
- assumes "anonalt r"
- shows "rerase r = RSEQ r1 r2 \<Longrightarrow> \<exists>bs a1 a2. r = ASEQ bs a1 a2 \<and> rerase a1 = r1 \<and> rerase a2 = r2 "
- using rerase_preimage4 by presburger
-
-lemma seq_recursive_equality:
- shows "\<lbrakk>r1 = a1; r2 = a2\<rbrakk> \<Longrightarrow> SEQ r1 r2 = SEQ a1 a2"
- by meson
-
-lemma almost_identical_image:
- assumes "agood r" "\<forall>r \<in> rset. agood r"
- shows "erase r \<in> erase ` rset \<Longrightarrow> \<exists>r' \<in> rset. erase r' = erase r"
- by auto
-
-lemma goodalts_never_change:
- assumes "r = AALTs bs rs" "agood r"
- shows "\<exists>r1 r2. erase r = ALT r1 r2"
- by (metis agood.simps(4) agood.simps(5) assms(1) assms(2) bmkepss.cases erase.simps(6))
-
-
-fun shape_preserving :: "arexp \<Rightarrow> bool" where
- "shape_preserving AZERO = True"
-| "shape_preserving (AONE bs) = True"
-| "shape_preserving (ACHAR bs c) = True"
-| "shape_preserving (AALTs bs []) = False"
-| "shape_preserving (AALTs bs [a]) = False"
-| "shape_preserving (AALTs bs (a1 # a2 # rs)) = (\<forall>r \<in> set (a1 # a2 # rs). shape_preserving r)"
-| "shape_preserving (ASEQ bs r1 r2) = (shape_preserving r1 \<and> shape_preserving r2)"
-| "shape_preserving (ASTAR bs r) = shape_preserving r"
-
-
-lemma good_shape_preserving:
- assumes "\<nexists>bs r0. r = ASTAR bs r0"
- shows "agood r \<Longrightarrow> shape_preserving r"
- using assms
-
- apply(induct r)
- prefer 6
-
- apply blast
- apply simp
-
- oops
-
-
-
-
-
-lemma shadow_arexp_rerase_erase:
- shows "\<lbrakk>agood r; agood r'; erase r = erase r'\<rbrakk> \<Longrightarrow> rerase r = rerase r'"
- apply(induct r )
- apply simp
- apply(induct r')
- apply simp+
- apply (metis goodalts_never_change rexp.distinct(15))
- apply simp+
- apply (metis anonalt.elims(3) erase_preimage_char goodalts_never_change rerase.simps(3) rexp.distinct(21))
- apply(induct r')
- apply simp
- apply simp
- apply simp
- apply(subgoal_tac "agood r1")
- prefer 2
- apply (metis SizeBound3.good_SEQ agood.simps(10) agood.simps(11) agood.simps(2) agood.simps(3) agood.simps(33) agood.simps(7) bsimp.cases)
- apply(subgoal_tac "agood r2")
- apply(subgoal_tac "agood r'1")
- apply(subgoal_tac "agood r'2")
- apply(subgoal_tac "rerase r'1 = rerase r1")
- apply(subgoal_tac "rerase r'2 = rerase r2")
-
- using rerase.simps(5) apply presburger
- oops
-
-
-
-lemma rerase_erase_good:
- assumes "agood r" "\<forall>r \<in> rset. agood r"
- shows "erase r \<in> erase ` rset \<Longrightarrow> rerase r \<in> rerase ` rset"
- using assms
- apply(case_tac r)
- apply simp+
- oops
-
-lemma rerase_erase_both:
- assumes "\<forall>r \<in> rset. agood r" "agood r"
- shows "(rerase r \<in> (rerase ` rset)) \<longleftrightarrow> (erase r \<in> (erase ` rset))"
- using assms
- apply(induct r )
- apply force
- apply simp
- apply(case_tac "RONE \<in> rerase ` rset")
- apply(subgoal_tac "\<exists>bs. (AONE bs) \<in> rset")
- apply (metis erase.simps(2) rev_image_eqI)
- apply (metis image_iff rerase_preimage2)
- apply(subgoal_tac "\<nexists>bs. (AONE bs) \<in> rset")
- apply(subgoal_tac "ONE \<notin> erase ` rset")
-
- apply blast
- sledgehammer
- apply (metis erase_preimage1 image_iff)
- apply (metis rerase.simps(2) rev_image_eqI)
- apply (smt (verit, best) erase.simps(3) erase_preimage_char image_iff rerase.simps(3) rerase_preimage3)
-(* apply simp
- apply(subgoal_tac "(RSEQ (rerase r1) (rerase r2) \<in> rerase ` rset) \<Longrightarrow> (SEQ (erase r1) (erase r2) \<in> erase ` rset)")
- prefer 2
- apply(subgoal_tac "\<exists>bs a1 a2. (ASEQ bs a1 a2) \<in> rset \<and> rerase a1 = rerase r1 \<and> rerase a2 = rerase r2")
- prefer 2
- apply (metis (full_types) image_iff rerase_preimage4)
- apply(erule exE)+
- apply(subgoal_tac "erase (ASEQ bs a1 a2) \<in> (erase ` rset) ")
- apply(subgoal_tac "SEQ (erase a1) (erase a2) \<in> (erase ` rset)")
- apply(subgoal_tac "SEQ (erase a1) (erase a2) = SEQ (erase r1) (erase r2)")
- apply metis
- apply(erule conjE)+*)
- apply(drule_tac x = "rset" in meta_spec)+
-
-
-
-
- oops
-
-
-lemma rerase_pushin1_general:
- assumes "\<forall>r \<in> set rs. agood r"
- shows "map rerase (distinctBy rs erase (erase ` rset)) = rdistinct (map rerase rs) (rerase ` rset)"
- apply(induct rs arbitrary: rset)
- apply simp+
- apply(case_tac "erase a \<in> erase ` rset")
- apply simp
-
-
-
- oops
-
-lemma rerase_erase:
- assumes "\<forall>r \<in> set as. agood r \<and> anonalt r"
- shows "rdistinct (map rerase as) (rerase ` rset) = map rerase (distinctBy as erase (erase ` rset))"
- using assms
- apply(induct as)
- apply simp+
-
- sorry
-
-
-lemma rerase_pushin1:
- assumes "\<forall>r \<in> set rs. anonalt r \<and> agood r"
- shows "map rerase (distinctBy rs erase {}) = rdistinct (map rerase rs) {}"
- apply(insert rerase_erase)
- by (metis assms image_empty)
-
-*)
@@ -2337,4 +2119,7 @@
apply(insert aders_simp_finiteness)
by (simp add: rders_simp_bounded)
+unused_thms
+
+
end
--- a/thys2/blexer2.sc Tue Jun 14 18:06:33 2022 +0100
+++ b/thys2/blexer2.sc Thu Jun 23 16:09:40 2022 +0100
@@ -617,7 +617,7 @@
case rs => AALTS(bs1, rs)
}
}
- case ASTAR(bs, r0) if(atMostEmpty(erase(r0))) => AONE(bs)
+ //(erase(r0))) => AONE(bs)
case r => r
}
}
@@ -781,7 +781,7 @@
// fun pAKC_aux :: "arexp list ⇒ arexp ⇒ rexp ⇒ arexp"
// where
-// "pAKC_aux rsa r ctx = (if (L (SEQ (erase r) ( ctx) )) ⊆ (L (erase (AALTs [] rsa))) then AZERO else
+// "pAKC_aux rsa r ctx = (if (regConcat (SEQ (erase r) ( ctx) )) ⊆ (regConcat (erase (AALTs [] rsa))) then AZERO else
// case r of (ASEQ bs r1 r2) ⇒
// bsimp_ASEQ bs (pAKC_aux rsa r1 (SEQ (erase r2) ( ctx) )) r2 |
// (AALTs bs rs) ⇒
@@ -796,13 +796,13 @@
//the "fake" Language interpretation: just concatenates!
-def L(acc: Rexp, rs: List[Rexp]) : Rexp = rs match {
+def regConcat(acc: Rexp, rs: List[Rexp]) : Rexp = rs match {
case Nil => acc
case r :: rs1 =>
// if(acc == ONE)
- // L(r, rs1)
+ // regConcat(r, rs1)
// else
- L(SEQ(acc, r), rs1)
+ regConcat(SEQ(acc, r), rs1)
}
def rprint(r: Rexp) : Unit = println(shortRexpOutput(r))
@@ -819,9 +819,9 @@
// println("ctx---------")
// rsprint(ctx)
// println("ctx---------end")
- // rsprint(breakIntoTerms(L(erase(r), ctx)).map(oneSimp))
+ // rsprint(breakIntoTerms(regConcat(erase(r), ctx)).map(oneSimp))
- if (breakIntoTerms(L(erase(r), ctx)).map(oneSimp).forall(acc.contains)) {//acc.flatMap(breakIntoTerms
+ if (breakIntoTerms(regConcat(erase(r), ctx)).map(oneSimp).forall(acc.contains)) {//acc.flatMap(breakIntoTerms
AZERO
}
else{
@@ -900,47 +900,70 @@
}
+def isOne1(r: Rexp) : Boolean = r match {
+ case ONE => true
+ case SEQ(r1, r2) => false//isOne1(r1) && isOne1(r2)
+ case ALTS(r1, r2) => false//(isOne1(r1) || isOne1(r2)) && (atMostEmpty(r1) && atMostEmpty(r2))//rs.forall(atMostEmpty) && rs.exists(isOne)
+ case STAR(r0) => false//atMostEmpty(r0)
+ case CHAR(c) => false
+ case ZERO => false
+
+}
+
def turnIntoTerms(r: Rexp): List[Rexp] = r match {
- case SEQ(r1, r2) => if(isOne(r1))
+ case SEQ(r1, r2) => if(isOne1(r1))
turnIntoTerms(r2)
else
- turnIntoTerms(r1).map(r11 => SEQ(r11, r2))
+ turnIntoTerms(r1).flatMap(r11 => furtherSEQ(r11, r2))
case ALTS(r1, r2) => turnIntoTerms(r1) ::: turnIntoTerms(r2)
case ZERO => Nil
case _ => r :: Nil
}
-def attachCtx(r: ARexp, ctx: List[Rexp]) : Set[Rexp] = {
- val res = turnIntoTerms((L(erase(r), ctx))).map(oneSimp)
- res.toSet
+def furtherSEQ(r11: Rexp, r2: Rexp) : List[Rexp] = {
+ if(r11 == ONE)
+ turnIntoTerms(r2)
+ else
+ SEQ(r11, r2) :: Nil
}
-def attachCtxcc(r: Rexp, ctx: List[Rexp]) : Set[Rexp] = {
- val res = turnIntoTerms((L(r, ctx))).map(oneSimp)
- res.toSet
+def attachCtx(r: ARexp, ctx: List[Rexp]) : Set[Rexp] = {
+ turnIntoTerms((regConcat(erase(r), ctx)))
+ .toSet
}
-def ABIncludedByC[A, B, C](a: A, b: B, c: C, f: (A, B) => C, subseteqPred: (C, C) => Boolean) : Boolean = {
+def attachCtxcc(r: Rexp, ctx: List[Rexp]) : Set[Rexp] =
+ turnIntoTerms(regConcat(r, ctx)).toSet
+
+def ABIncludedByC[A, B, C](a: A, b: B, c: C, f: (A, B) => C,
+subseteqPred: (C, C) => Boolean) : Boolean = {
subseteqPred(f(a, b), c)
}
-def rs1_subseteq_rs2(rs1: Set[Rexp], rs2: Set[Rexp]) : Boolean = {
- val res = rs1.forall(rs2.contains)
- res
-}
-def prune6(acc: Set[Rexp], r: ARexp, ctx: List[Rexp]) : ARexp = {
+def rs1_subseteq_rs2(rs1: Set[Rexp], rs2: Set[Rexp]) : Boolean =
+ //rs1 \subseteq rs2
+ rs1.forall(rs2.contains)
+
+def prune6(acc: Set[Rexp], r: ARexp, ctx: List[Rexp] = Nil) : ARexp = {
+ // if (erase(r) == ONE && acc != Set())
+ // {
+ // println("ctx")
+ // rsprint(ctx)
+ // println("acc")
+ // rsprint(acc)
+ // println("acc-end")
+ // }
if (ABIncludedByC(a = r, b = ctx, c = acc,
f = attachCtx, subseteqPred = rs1_subseteq_rs2))
- {//acc.flatMap(breakIntoTerms
+ {
AZERO
}
else{
r match {
- case ASEQ(bs, r1, r2) =>
- (prune6(acc, r1, erase(r2) :: ctx)) match{
+ case ASEQ(bs, r1, r2) => (prune6(acc, r1, erase(r2) :: ctx)) match{
case AZERO =>
AZERO
- case AONE(bs1) =>
- fuse(bs1, r2)
+ case AONE(bs1) => //when r1 becomes 1, chances to further prune r2
+ prune6(acc, fuse(bs1, r2), ctx)
case r1p =>
ASEQ(bs, r1p, r2)
}
@@ -949,7 +972,7 @@
case Nil =>
AZERO
case r :: Nil =>
- r
+ fuse(bs, r)
case rs0p =>
AALTS(bs, rs0p)
}
@@ -989,7 +1012,8 @@
}
}
-def distinctBy6(xs: List[ARexp], acc: Set[Rexp] = Set()) : List[ARexp] = xs match {
+def distinctBy6(xs: List[ARexp], acc: Set[Rexp] = Set()) :
+List[ARexp] = xs match {
case Nil =>
Nil
case x :: xs => {
@@ -998,13 +1022,13 @@
distinctBy6(xs, acc)
}
else{
- val pruned = prune6(acc, x, Nil)
+ val pruned = prune6(acc, x)
val newTerms = turnIntoTerms(erase(pruned))
pruned match {
case AZERO =>
distinctBy6(xs, acc)
case xPrime =>
- xPrime :: distinctBy6(xs, newTerms.map(oneSimp) ++: acc)//distinctBy5(xs, addToAcc.map(oneSimp(_)) ::: acc)
+ xPrime :: distinctBy6(xs, newTerms ++: acc)//distinctBy5(xs, addToAcc.map(oneSimp(_)) ::: acc)
}
}
}
@@ -1037,6 +1061,12 @@
case _ => r::Nil
}
+def flatsIntoTerms(r: Rexp) : List[Rexp] = r match {
+ //case SEQ(r1, r2) => flatsIntoTerms(r1).map(r11 => SEQ(r11, r2))
+ case ALTS(r1, r2) => flatsIntoTerms(r1) ::: flatsIntoTerms(r2)
+ case ZERO => Nil
+ case _ => r::Nil
+}
def decode_aux(r: Rexp, bs: Bits) : (Val, Bits) = (r, bs) match {
@@ -1245,7 +1275,7 @@
}
def cir_prod(l: Lin, t: Rexp): Lin = t match {//remember this Lin is different from the Lin in Antimirov's paper. Here it does not mean the set of all subsets of linear forms that does not contain zero, but rather the type a set of linear forms
case ZERO => Set()
- case ONE => l
+ // case ONE => l
case t => l.map( m => m._2 match
{
case ZERO => m
@@ -1255,6 +1285,39 @@
)
}
+ def cir_prodList(l : List[Mon], t: Rexp): List[Mon] = t match {
+ case ZERO => Nil
+ case ONE => l
+ case t => l.map(m => m._2 match
+ {
+ case ZERO => m
+ case ONE => (m._1, t)
+ case p => (m._1, SEQ(p, t))
+ }
+ )
+
+ }
+ def lfList(r: Rexp) : List[Mon] = r match {
+ case ZERO => Nil
+ case ONE => Nil
+ case CHAR(f) => {
+ (f, ONE) :: Nil
+ }
+ case ALTS(r1, r2) => {
+ lfList(r1) ++ lfList(r2)
+ }
+ case STAR(r1) => cir_prodList(lfList(r1), STAR(r1))
+ case SEQ(r1, r2) => {
+ if(nullable(r1))
+ cir_prodList(lfList(r1), r2) ++ lfList(r2)
+ else
+ cir_prodList(lfList(r1), r2)
+ }
+ }
+ def lfprint(ls: Lin) = ls.foreach(m =>{
+ println(m._1)
+ rprint(m._2)
+ })
def lf(r: Rexp): Lin = r match {
case ZERO => Set()
case ONE => Set()
@@ -1403,9 +1466,11 @@
def aaa_star() = {
val r = STAR(("a" | "aa"))
- for(i <- 0 to 100) {
+ for(i <- 0 to 20) {
val prog = "a" * i
+ print(i+" ")
println(asize(bders_simp(prog.toList, internalise(r))))
+ //println(size(ders_simp(prog.toList, r)))
}
}
// aaa_star()
@@ -1438,29 +1503,35 @@
}
// naive_matcher()
+//single(SEQ(SEQ(STAR(CHAR('b')),STAR(STAR(SEQ(CHAR('a'),CHAR('b'))))),
+// SEQ(SEQ(CHAR('b'),STAR(ALTS(CHAR('a'),ONE))),ONE)))
+
+
+//STAR(SEQ(ALTS(CHAR(b),STAR(CHAR(b))),ALTS(ALTS(ONE,CHAR(b)),SEQ(CHAR(c),ONE))))
def generator_test() {
- test(single(SEQ(SEQ(STAR(CHAR('b')),STAR(STAR(SEQ(CHAR('a'),CHAR('b'))))),
- SEQ(SEQ(CHAR('b'),STAR(ALTS(CHAR('a'),ONE))),ONE))), 1) { (r: Rexp) =>
+ test(single(STAR(SEQ(ALTS(STAR(CHAR('b')),ALTS(CHAR('b'),CHAR('a'))),ALTS(ALTS(CHAR('b'),CHAR('c')),STAR(ZERO))))), 1) { (r: Rexp) =>
// ALTS(SEQ(SEQ(ONE,CHAR('a')),STAR(CHAR('a'))),SEQ(ALTS(CHAR('c'),ONE),STAR(ZERO))))))), 1) { (r: Rexp) =>
- val ss = stringsFromRexp(r)
+ val ss = Set("ab")//stringsFromRexp(r)
val boolList = ss.map(s => {
//val bdStrong = bdersStrong(s.toList, internalise(r))
val bdStrong6 = bdersStrong6(s.toList, internalise(r))
- val bdStrong6Set = turnIntoTerms(erase(bdStrong6))
+ val bdStrong6Set = breakIntoTerms(erase(bdStrong6))
val pdersSet = pderUNIV(r)//.flatMap(r => turnIntoTerms(r))
val pdersSetBroken = pdersSet.flatMap(r => turnIntoTerms(r))
- rs1_subseteq_rs2(bdStrong6Set.toSet, distinctByacc(pdersSet.toList))
- //bdStrong6Set.size <= pdersSet.size || bdStrong6Set.size <= pdersSetBroken.size ||
- //rs1_subseteq_rs2(bdStrong6Set.toSet, pdersSet union pdersSetBroken)//|| bdStrong6Set.size <= pdersSetBroken.size
+ (rs1_subseteq_rs2(bdStrong6Set.toSet, distinctByacc(pdersSet.toList)) ||
+ bdStrong6Set.size <= pdersSet.size || bdStrong6Set.size <= pdersSetBroken.size ||
+ rs1_subseteq_rs2(bdStrong6Set.toSet, pdersSet union pdersSetBroken))//|| bdStrong6Set.size <= pdersSetBroken.size
+
})
//!boolList.exists(b => b == false)
!boolList.exists(b => b == false)
}
}
// small()
-// generator_test()
-
+// generator_test()
+
+
//STAR(STAR(STAR(SEQ(SEQ(ALTS(STAR(ALTS(ONE,CHAR('c'))),
// CHAR('c')),ALTS(CHAR('a'),ALTS(STAR(CHAR('b')),ALTS(CHAR('a'),ZERO)))),
@@ -1471,13 +1542,14 @@
//new ce1 : STAR(SEQ(ALTS(ALTS(ONE,CHAR(a)),SEQ(ONE,CHAR(b))),ALTS(CHAR(a),ALTS(CHAR(b),CHAR(a)))))
//new ce2 : ALTS(CHAR(b),SEQ(ALTS(ZERO,ALTS(CHAR(b),CHAR(b))),ALTS(ALTS(CHAR(a),CHAR(b)),SEQ(CHAR(c),ONE))))
//new ce3 : SEQ(CHAR(b),ALTS(ALTS(ALTS(ONE,CHAR(a)),SEQ(CHAR(c),ONE)),SEQ(STAR(ZERO),SEQ(ONE,CHAR(b)))))
+//circular double-nullable STAR(SEQ((STAR("b") | "a" | "b"), ("b" | ONE)))
def counterexample_check() {
- val r = SEQ(SEQ(STAR(CHAR('b')),STAR(STAR(SEQ(CHAR('a'),CHAR('b'))))),
- SEQ(SEQ(CHAR('b'),STAR(ALTS(CHAR('a'),ONE))),ONE))//SEQ(SEQ(STAR(ALTS(CHAR('a'),CHAR('b'))),
+ val r = STAR(SEQ(ALTS(STAR(CHAR('b')),ALTS(CHAR('b'),CHAR('a'))),
+ ALTS(ALTS(CHAR('b'),CHAR('c')),STAR(ZERO))))//SEQ(SEQ(STAR(ALTS(CHAR('a'),CHAR('b'))),
//ALTS(ALTS(CHAR('c'),CHAR('b')),STAR(ONE))),STAR(CHAR('b')))
- val s = "b"
- val bdStrong5 = bdersStrong7(s.toList, internalise(r))
- val bdStrong5Set = turnIntoTerms(erase(bdStrong5))
+ val s = "ab"
+ val bdStrong5 = bdersStrong6(s.toList, internalise(r))
+ val bdStrong5Set = breakIntoTerms(erase(bdStrong5))
val pdersSet = pderUNIV(r)//.map(r => erase(bsimp(internalise(r))))//.flatMap(r => turnIntoTerms(r))//.map(oneSimp).flatMap(r => breakIntoTerms(r))
val apdersSet = pdersSet.map(internalise)
println("original regex ")
@@ -1490,7 +1562,7 @@
rsprint(pdersSet.toList)
println("pderUNIV distinctBy6")
//asprint(distinctBy6(apdersSet.toList))
- rsprint(distinctByacc(pdersSet.toList))
+ rsprint((pdersSet.toList).flatMap(turnIntoTerms))
// rsprint(turnIntoTerms(pdersSet.toList(3)))
// println("NO 3 not into terms")
// rprint((pdersSet.toList()))
@@ -1498,7 +1570,23 @@
// rsprint(pdersSet.flatMap(r => turnIntoTerms(r)).toList)
}
-counterexample_check()
+
+// counterexample_check()
+
+def lf_display() {
+ val r = STAR(SEQ(ALTS(STAR(CHAR('b')),ALTS(CHAR('b'),CHAR('a'))),
+ ALTS(ALTS(CHAR('b'),CHAR('c')),STAR(ZERO))))//SEQ(SEQ(STAR(ALTS(CHAR('a'),CHAR('b'))),
+ //ALTS(ALTS(CHAR('c'),CHAR('b')),STAR(ONE))),STAR(CHAR('b')))
+ val s = "ab"
+ val bdStrong5 = bdersStrong6(s.toList, internalise(r))
+ val bdStrong5Set = breakIntoTerms(erase(bdStrong5))
+ val pdersSet = pderUNIV(r)//.map(r => erase(bsimp(internalise(r))))//.flatMap(r => turnIntoTerms(r))//.map(oneSimp).flatMap(r => breakIntoTerms(r))
+ val apdersSet = pdersSet.map(internalise)
+ lfprint(lf(r))
+
+}
+
+lf_display()
def breakable(r: Rexp) : Boolean = r match {
case SEQ(ALTS(_, _), _) => true
@@ -1535,4 +1623,37 @@
// Sequ(Sequ(Sequ(Empty,Chr(b)),Stars(List(Chr(b), Chr(b)))),Sequ(Right(Stars(List(Right(Sequ(Sequ(Stars(List(Right(Chr(a)), Right(Chr(a)))),Sequ(Chr(a),Sequ(Left(Chr(b)),Sequ(Empty,Chr(b))))),Empty)), Right(Sequ(Sequ(Stars(List(Right(Chr(a)), Right(Chr(a)))),Sequ(Chr(a),Sequ(Left(Chr(b)),Sequ(Empty,Chr(b))))),Empty))))),Empty))
-// Sequ(Sequ(Sequ(Empty,Chr(b)),Stars(List(Chr(b), Chr(b)))),Sequ(Right(Stars(List(Right(Sequ(Sequ(Stars(List(Right(Chr(a)), Right(Chr(a)))),Sequ(Chr(a),Sequ(Left(Chr(b)),Sequ(Empty,Chr(b))))),Empty)), Right(Sequ(Sequ(Stars(List(Right(Chr(a)), Right(Chr(a)))),Sequ(Chr(a),Sequ(Left(Chr(b)),Sequ(Empty,Chr(b))))),Empty))))),Empty))
\ No newline at end of file
+// Sequ(Sequ(Sequ(Empty,Chr(b)),Stars(List(Chr(b), Chr(b)))),Sequ(Right(Stars(List(Right(Sequ(Sequ(Stars(List(Right(Chr(a)), Right(Chr(a)))),Sequ(Chr(a),Sequ(Left(Chr(b)),Sequ(Empty,Chr(b))))),Empty)), Right(Sequ(Sequ(Stars(List(Right(Chr(a)), Right(Chr(a)))),Sequ(Chr(a),Sequ(Left(Chr(b)),Sequ(Empty,Chr(b))))),Empty))))),Empty))
+
+
+
+
+
+def bders_bderssimp() {
+
+ test(single(SEQ(ALTS(ONE,CHAR('c')),
+ SEQ(SEQ(CHAR('a'),CHAR('a')),ALTS(ONE,CHAR('c'))))), 1) { (r: Rexp) =>
+ // ALTS(SEQ(SEQ(ONE,CHAR('a')),STAR(CHAR('a'))),SEQ(ALTS(CHAR('c'),ONE),STAR(ZERO))))))), 1) { (r: Rexp) =>
+ val ss = Set("aa")//stringsFromRexp(r)
+ val boolList = ss.map(s => {
+ //val bdStrong = bdersStrong(s.toList, internalise(r))
+ val bds = bsimp(bders(s.toList, internalise(r)))
+ val bdssimp = bsimp(bders_simp(s.toList, internalise(r)))
+ //val pdersSet = pderUNIV(r)//.flatMap(r => turnIntoTerms(r))
+ //val pdersSetBroken = pdersSet.flatMap(r => turnIntoTerms(r))
+ //rs1_subseteq_rs2(bdStrong6Set.toSet, distinctByacc(pdersSet.toList))
+ //bdStrong6Set.size <= pdersSet.size || bdStrong6Set.size <= pdersSetBroken.size ||
+ //rs1_subseteq_rs2(bdStrong6Set.toSet, pdersSet union pdersSetBroken)//|| bdStrong6Set.size <= pdersSetBroken.size
+ rprint(r)
+ println(bds)
+ println(bdssimp)
+ bds == bdssimp
+ })
+ //!boolList.exists(b => b == false)
+ !boolList.exists(b => b == false)
+ }
+}
+// bders_bderssimp()
+
+
+
--- a/thys3/ClosedForms.thy Tue Jun 14 18:06:33 2022 +0100
+++ b/thys3/ClosedForms.thy Thu Jun 23 16:09:40 2022 +0100
@@ -1491,7 +1491,8 @@
lemma seq_closed_form_variant:
assumes "s \<noteq> []"
shows "rders_simp (RSEQ r1 r2) s =
- rsimp (RALTS (RSEQ (rders_simp r1 s) r2 # (map (rders_simp r2) (vsuf s r1))))"
+ rsimp (RALTS (RSEQ (rders_simp r1 s) r2 #
+ (map (rders_simp r2) (vsuf s r1))))"
using assms q seq_closed_form by force