--- a/etnms/etnms.tex Fri Jan 10 17:02:16 2020 +0000
+++ b/etnms/etnms.tex Fri Jan 10 22:35:08 2020 +0000
@@ -87,27 +87,54 @@
answers with \emph{how} it matches the string. This is important for
applications such as lexing (tokenising a string). The problem is to
make the algorithm by Sulzmann and Lu fast on all inputs without
- breaking its correctness. We have already developed some
- simplification rules for this, but have not yet proved that they
- preserve the correctness of the algorithm. We also have not yet
- looked at extended regular expressions, such as bounded repetitions,
+ breaking its correctness. Being fast depends on a complete set of
+ simplification rules, some of which
+ have been put forward by Sulzmann and Lu. We have extended their
+ rules in order to obtain a tight bound on size of regular expressions.
+ We have tested the correctness of these extended rules, but have not
+ formally established their correctness. We also have not yet looked
+ at extended regular expressions, such as bounded repetitions,
negation and back-references.
\end{abstract}
\section{Introduction}
-
+While we believe derivatives of regular expressions is a beautiful
+concept (interms of ease to implementing them in functional programming
+language and ease to reason about them formally), they have one major
+drawback: every derivative step can make regular expressions grow
+drastically in size. This in turn has negative effects on the runtime of
+the corresponding lexing algorithms. Consider for example the regular
+expression $(a+aa)^*$ and the short string $aaaaaaaaaaaa$. The size of
+the corresponding derivative is already 8668 node assuming the derivatives
+is seen as a tree. The reason for the poor runtime of the lexing algorithms is
+that they need to traverse such trees over and over again. The solution is to
+find a complete set of simplification rules that keep the sizes of derivatives
+uniformly small.
-\noindent\rule[0.5ex]{\linewidth}{1pt}
-Between the 2 bars are the new materials.\\
-In the past 6 months I was trying to prove that the bit-coded algorithm is correct.
-\begin{center}
-$\blexers \;r \; s = \blexer \; r \; s$
-\end{center}
+For reasons beyond this report, it turns out that a complete set of
+simplification rules depend on values being encoded as bitsequences.
+(Vlue are the results of the lexing algorithms generate; they encode how
+a regular expression matched a string.) We already know that the lexing
+algorithm \emph{without} simplification is correct. Therefore in the
+past 6 months we were trying to prove that the algorithm using bitsequences plus
+our simplification rules is correct. Formally this amounts to show that
+
+\begin{equation}\label{mainthm}
+\blexers \; r \; s = \blexer \;r\;s
+\end{equation}
+
\noindent
-To prove this, we need to prove these two functions produce the same output
-whether or not $r \in L(r)$.
-Given the definition of $\blexer$ and $\blexers$:
+whereby $\blexers$ simplifies (makes derivatives smaller) in each step,
+whereas with $\blexer$ the size can grow exponentially. This would be an
+important milestone, because we already have a very good idea how to
+establish that our set our simplification rules keeps the size below a
+relatively tight bound.
+
+In order to prove the main theorem \eqref{mainthm}, we need to prove the
+two functions produce the same output. The definition of these functions
+is shown below.
+
\begin{center}
\begin{tabular}{lcl}
$\textit{blexer}\;r\,s$ & $\dn$ &
@@ -118,41 +145,63 @@
\end{tabular}
\end{center}
- \begin{center}
+\begin{center}
\begin{tabular}{lcl}
- $\textit{blexer\_simp}\;r\,s$ & $\dn$ &
- $\textit{let}\;a = (r^\uparrow)\backslash_{simp}\, s\;\textit{in}$\\
- & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
- & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
- & & $\;\;\textit{else}\;\textit{None}$
+ $\blexers \; r \, s$ &$\dn$ &
+ $\textit{let} \; a = (r^\uparrow)\backslash_{simp}\, s\; \textit{in}$\\
+ & & $\; \; \textit{if} \; \textit{bnullable}(a)$\\
+ & & $\; \; \textit{then} \; \textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
+ & & $\;\; \textit{else}\;\textit{None}$
\end{tabular}
\end{center}
\noindent
-it boils down to proving the following two propositions(depending on which
-branch in the if-else clause is taken):
+In these definitions $(r^\uparrow)$ is a kind of coding function that is the
+same in each case, similarly the decode and the \textit{bmkeps}
+functions. Our main theorem \eqref{mainthm} therefore boils down to
+proving the following two propositions (depending on which branch the
+if-else clause takes). It establishes how the derivatives \emph{with}
+simplification do not change the computed result:
\begin{itemize}
-
-\item{}
-When s is a string in the language L(r), \\
-$\textit{bmkeps} (r^\uparrow)\backslash_{simp}\, s = \textit{bmkeps} (r^\uparrow)\backslash s$, \\
-\item{}
-when s is not a string of the language L(ar)
-ders\_simp(ar, s) is not nullable
+\item{} If a string $s$ is in the language of $L(r)$, then \\
+$\textit{bmkeps} (r^\uparrow)\backslash_{simp}\,s = \textit{bmkeps} (r^\uparrow)\backslash s$,\\
+\item{} If a string $s$ is in the language $L(r)$, then
+$\rup \backslash_{simp} \,s$ is not nullable.
\end{itemize}
-The second one is relatively straightforward using isabelle to prove.
-The first part requires more effort.
-It builds on the result that the bit-coded algorithm without simplification
-produces the correct result:
-\begin{center}
-$\blexer \;r^\uparrow s = \lexer \; r\; s$
-\end{center}
+
\noindent
-the definition of lexer and its correctness is
-omitted(see \cite{AusafDyckhoffUrban2016}).
-if we can prove that the bit-coded algorithm with simplification produces
-the same result as the original bit-coded algorithm,
-then we are done.
+We have already proved in Isabelle the second part. This is actually not
+too difficult because we can show that simplification does not change
+the language of regular expressions. If we can prove the first case,
+that is the bitsequence algorithm with simplification produces the same
+result as the one without simplification, then we are done.
+Unfortunately that part requires more effort, because simplification does not
+only.need to \emph{not} change the language, but also not change
+the value (computed result).
+
+\bigskip\noindent\rule[1.5ex]{\linewidth}{5pt}
+Do you want to keep this? You essentially want to say that the old
+method used retrieve, which unfortunately cannot be adopted to
+the simplification rules. You could just say that and give an example.
+However you have to think about how you give the example....nobody knows
+about AZERO etc yet. Maybe it might be better to use normal regexes
+like $a + aa$, but annotate bitsequences as subscript like $_1(_0a + _1aa)$.
+
+\bigskip\noindent\rule[1.5ex]{\linewidth}{5pt}
+REPLY:\\
+Yes, I am essentially saying that the old method
+cannot be adopted without adjustments.
+But this does not mean we should skip
+the proof of the bit-coded algorithm
+as it is still the main direction we are looking into
+to prove things. We are trying to modify
+the old proof to suit our needs, but not give
+up it totally, that is why i believe the old
+proof is fundamental in understanding
+what we are doing in the past 6 months.
+
+\bigskip\noindent\rule[1.5ex]{\linewidth}{5pt}
+
The correctness proof of
\begin{center}
$\blexer \; r^\uparrow s = \lexer \;r \;s$
@@ -247,8 +296,8 @@
equalities,
$\retrieve \; a \; v$ is not always defined.
for example,
-$\retrieve \; \AALTS(Z, \AONE(S), \AONE(S)) \; \Left(\Empty)$
-is defined, but not $\retrieve \; \AONE(\Z\S) \; \Left(\Empty)$,
+$\retrieve \; _0(_1a+_0a) \; \Left(\Empty)$
+is defined, but not $\retrieve \; _{01}a \;\Left(\Empty)$,
though we can extract the same POSIX
bits from the two annotated regular expressions.
The latter might occur when we try to retrieve from