etnms/etnms.tex
changeset 100 397b31867ea6
parent 95 c969a973fcae
child 101 4a327e70d538
--- a/etnms/etnms.tex	Fri Jan 10 17:02:16 2020 +0000
+++ b/etnms/etnms.tex	Fri Jan 10 22:35:08 2020 +0000
@@ -87,27 +87,54 @@
   answers with \emph{how} it matches the string.  This is important for
   applications such as lexing (tokenising a string). The problem is to
   make the algorithm by Sulzmann and Lu fast on all inputs without
-  breaking its correctness. We have already developed some
-  simplification rules for this, but have not yet proved that they
-  preserve the correctness of the algorithm. We also have not yet
-  looked at extended regular expressions, such as bounded repetitions,
+  breaking its correctness. Being fast depends on a complete set of 
+  simplification rules, some of which 
+  have been put forward by Sulzmann and Lu. We have extended their
+  rules in order to obtain a tight bound on size of regular expressions.
+  We have tested the correctness of these extended rules, but have not 
+  formally established their correctness. We also have not yet looked 
+  at extended regular expressions, such as bounded repetitions,
   negation and back-references.
 \end{abstract}
 
 \section{Introduction}
 
-
+While we believe derivatives of regular expressions is a beautiful 
+concept (interms of ease to implementing them in functional programming
+language and ease to reason about them formally), they have one major 
+drawback: every derivative step can make regular expressions grow 
+drastically in size. This in turn has negative effects on the runtime of 
+the corresponding lexing algorithms. Consider for example the regular 
+expression $(a+aa)^*$ and the short string $aaaaaaaaaaaa$. The size of
+the corresponding derivative is already 8668 node assuming the derivatives
+is seen as a tree. The reason for the poor runtime of the lexing algorithms is
+that they need to traverse such trees over and over again. The solution is to
+find a complete set of simplification rules that keep the sizes of derivatives
+uniformly small.
 
-\noindent\rule[0.5ex]{\linewidth}{1pt}
-Between the 2 bars are the new materials.\\
-In the past 6 months I was trying to prove that the bit-coded algorithm is correct.
-\begin{center}
-$\blexers \;r \; s = \blexer \; r \; s$
-\end{center}
+For reasons beyond this report, it turns out that a complete set of 
+simplification rules depend on values being encoded as bitsequences.
+(Vlue are the results of the lexing algorithms generate; they encode how
+a regular expression matched a string.) We already know that the lexing 
+algorithm \emph{without} simplification is correct. Therefore in the
+past 6 months we were trying to prove  that the algorithm using bitsequences plus
+our simplification rules is correct. Formally this amounts to show that
+
+\begin{equation}\label{mainthm}
+\blexers \; r \; s = \blexer \;r\;s
+\end{equation}
+
 \noindent
-To prove this, we need to prove these two functions produce the same output 
-whether or not $r \in L(r)$.
-Given the definition of $\blexer$ and $\blexers$:
+whereby $\blexers$ simplifies (makes derivatives smaller) in each step,
+whereas with $\blexer$ the size can grow exponentially. This would be an
+important milestone, because we already have a very good idea how to 
+establish that our set our simplification rules keeps the size below a
+relatively tight bound.
+
+In order to prove the main theorem \eqref{mainthm}, we need to prove the
+two functions produce the same output. The definition of these functions 
+is shown below.
+
 \begin{center}
 \begin{tabular}{lcl}
   $\textit{blexer}\;r\,s$ & $\dn$ &
@@ -118,41 +145,63 @@
 \end{tabular}
 \end{center}
 
- \begin{center}
+\begin{center}
 \begin{tabular}{lcl}
-  $\textit{blexer\_simp}\;r\,s$ & $\dn$ &
-      $\textit{let}\;a = (r^\uparrow)\backslash_{simp}\, s\;\textit{in}$\\                
-  & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
-  & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
-  & & $\;\;\textit{else}\;\textit{None}$
+  $\blexers \; r \, s$ &$\dn$ &
+    $\textit{let} \; a = (r^\uparrow)\backslash_{simp}\, s\; \textit{in}$\\
+  & & $\; \; \textit{if} \; \textit{bnullable}(a)$\\
+  & & $\; \; \textit{then} \; \textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
+  & & $\;\;   \textit{else}\;\textit{None}$
 \end{tabular}
 \end{center}
 \noindent
-it boils down to proving the following two propositions(depending on which
-branch in the if-else clause is taken):
+In these definitions $(r^\uparrow)$ is a kind of coding function that is the
+same in each case, similarly the decode and the \textit{bmkeps}
+functions. Our main theorem \eqref{mainthm} therefore boils down to
+proving the following two  propositions (depending on which branch the 
+if-else clause takes). It establishes how the derivatives  \emph{with}
+simplification do not change the computed result:
 
 \begin{itemize}
-
-\item{}
-When s is a string in the language L(r), \\
-$\textit{bmkeps} (r^\uparrow)\backslash_{simp}\, s  = \textit{bmkeps} (r^\uparrow)\backslash s$, \\
-\item{}
-when s is not a string of the language L(ar)
-ders\_simp(ar, s) is not nullable
+\item{} If a string $s$ is in the language of $L(r)$, then \\
+$\textit{bmkeps} (r^\uparrow)\backslash_{simp}\,s = \textit{bmkeps} (r^\uparrow)\backslash s$,\\
+\item{} If a string $s$ is in the language $L(r)$, then 
+$\rup \backslash_{simp} \,s$ is not nullable.
 \end{itemize}
-The second one is relatively straightforward using isabelle to prove.
-The first part requires more effort. 
-It builds on the result that the bit-coded algorithm without simplification
-produces the correct result:
-\begin{center}
-$\blexer \;r^\uparrow  s = \lexer \; r\; s$
-\end{center}
+
 \noindent
-the definition of lexer and its correctness is 
-omitted(see \cite{AusafDyckhoffUrban2016}).
-if we can prove that the bit-coded algorithm with simplification produces 
-the same result as the original bit-coded algorithm, 
-then we are done.
+We have already proved in Isabelle the second part. This is actually not 
+too difficult because we can show that simplification does not change
+the language of regular expressions. If we can prove the first case,
+that is the bitsequence algorithm with simplification produces the same 
+result as the one without simplification, then we are done.
+Unfortunately that part requires more effort, because simplification does not 
+only.need to \emph{not} change the language, but also not change
+the value (computed result).
+
+\bigskip\noindent\rule[1.5ex]{\linewidth}{5pt}
+Do you want to keep this? You essentially want to say that the old
+method used retrieve, which unfortunately cannot be adopted to 
+the simplification rules. You could just say that and give an example.
+However you have to think about how you give the example....nobody knows
+about AZERO etc yet. Maybe it might be better to use normal regexes
+like $a + aa$, but annotate bitsequences as subscript like $_1(_0a + _1aa)$.
+
+\bigskip\noindent\rule[1.5ex]{\linewidth}{5pt}
+REPLY:\\
+Yes, I am essentially saying that the old method
+cannot be adopted without adjustments.
+But this does not mean we should skip
+the proof of the bit-coded algorithm
+as it is still the main direction we are looking into
+to prove things. We are trying to modify
+the old proof to suit our needs, but not give 
+up it totally, that is why i believe the old 
+proof is fundamental in understanding
+what we are doing in the past 6 months.
+
+\bigskip\noindent\rule[1.5ex]{\linewidth}{5pt}
+
 The correctness proof of
 \begin{center}
 $\blexer \; r^\uparrow  s = \lexer \;r \;s$
@@ -247,8 +296,8 @@
 equalities,
 $\retrieve \; a \; v$ is not always defined.
 for example,
-$\retrieve \; \AALTS(Z, \AONE(S), \AONE(S)) \; \Left(\Empty)$
-is defined, but not $\retrieve \; \AONE(\Z\S) \; \Left(\Empty)$,
+$\retrieve \; _0(_1a+_0a) \; \Left(\Empty)$
+is defined, but not $\retrieve \; _{01}a \;\Left(\Empty)$,
 though we can extract the same POSIX
 bits from the two annotated regular expressions.
 The latter might occur when we try to retrieve from