cst_tests: comparison etnms/etnms.tex

equal deleted inserted replaced

-:596bcdd7aaf9
+:397b31867ea6
 algorithm to not just give a YES/NO answer for whether or not a
 regular expression matches a string, but in case it does also
 answers with \emph{how} it matches the string.  This is important for
 applications such as lexing (tokenising a string). The problem is to
 make the algorithm by Sulzmann and Lu fast on all inputs without
-breaking its correctness. We have already developed some
+breaking its correctness. Being fast depends on a complete set of
-simplification rules for this, but have not yet proved that they
+simplification rules, some of which
-preserve the correctness of the algorithm. We also have not yet
+have been put forward by Sulzmann and Lu. We have extended their
-looked at extended regular expressions, such as bounded repetitions,
+rules in order to obtain a tight bound on size of regular expressions.
+We have tested the correctness of these extended rules, but have not
+formally established their correctness. We also have not yet looked
+at extended regular expressions, such as bounded repetitions,
 negation and back-references.
 \end{abstract}
 \section{Introduction}
+While we believe derivatives of regular expressions is a beautiful
+concept (interms of ease to implementing them in functional programming
-\noindent\rule[0.5ex]{\linewidth}{1pt}
+language and ease to reason about them formally), they have one major
-Between the 2 bars are the new materials.\\
+drawback: every derivative step can make regular expressions grow
-In the past 6 months I was trying to prove that the bit-coded algorithm is correct.
+drastically in size. This in turn has negative effects on the runtime of
-\begin{center}
+the corresponding lexing algorithms. Consider for example the regular
-$\blexers \;r \; s = \blexer \; r \; s$
+expression $(a+aa)^*$ and the short string $aaaaaaaaaaaa$. The size of
-\end{center}
+the corresponding derivative is already 8668 node assuming the derivatives
-\noindent
+is seen as a tree. The reason for the poor runtime of the lexing algorithms is
-To prove this, we need to prove these two functions produce the same output
+that they need to traverse such trees over and over again. The solution is to
-whether or not $r \in L(r)$.
+find a complete set of simplification rules that keep the sizes of derivatives
-Given the definition of $\blexer$ and $\blexers$:
+uniformly small.
+For reasons beyond this report, it turns out that a complete set of
+simplification rules depend on values being encoded as bitsequences.
+(Vlue are the results of the lexing algorithms generate; they encode how
+a regular expression matched a string.) We already know that the lexing
+algorithm \emph{without} simplification is correct. Therefore in the
+past 6 months we were trying to prove  that the algorithm using bitsequences plus
+our simplification rules is correct. Formally this amounts to show that
+\begin{equation}\label{mainthm}
+\blexers \; r \; s = \blexer \;r\;s
+\end{equation}
+\noindent
+whereby $\blexers$ simplifies (makes derivatives smaller) in each step,
+whereas with $\blexer$ the size can grow exponentially. This would be an
+important milestone, because we already have a very good idea how to
+establish that our set our simplification rules keeps the size below a
+relatively tight bound.
+In order to prove the main theorem \eqref{mainthm}, we need to prove the
+two functions produce the same output. The definition of these functions
+is shown below.
 \begin{center}
 \begin{tabular}{lcl}
 $\textit{blexer}\;r\,s$ & $\dn$ &
 $\textit{let}\;a = (r^\uparrow)\backslash s\;\textit{in}$\\
 & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
 & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
 & & $\;\;\textit{else}\;\textit{None}$
 \end{tabular}
 \end{center}
 \begin{center}
 \begin{tabular}{lcl}
-$\textit{blexer\_simp}\;r\,s$ & $\dn$ &
+$\blexers \; r \, s$ &$\dn$ &
-$\textit{let}\;a = (r^\uparrow)\backslash_{simp}\, s\;\textit{in}$\\
+$\textit{let} \; a = (r^\uparrow)\backslash_{simp}\, s\; \textit{in}$\\
-& & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
+& & $\; \; \textit{if} \; \textit{bnullable}(a)$\\
-& & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
+& & $\; \; \textit{then} \; \textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
-& & $\;\;\textit{else}\;\textit{None}$
+& & $\;\;   \textit{else}\;\textit{None}$
 \end{tabular}
 \end{center}
 \noindent
-it boils down to proving the following two propositions(depending on which
+In these definitions $(r^\uparrow)$ is a kind of coding function that is the
-branch in the if-else clause is taken):
+same in each case, similarly the decode and the \textit{bmkeps}
+functions. Our main theorem \eqref{mainthm} therefore boils down to
+proving the following two  propositions (depending on which branch the
+if-else clause takes). It establishes how the derivatives  \emph{with}
+simplification do not change the computed result:
 \begin{itemize}
+\item{} If a string $s$ is in the language of $L(r)$, then \\
-\item{}
+$\textit{bmkeps} (r^\uparrow)\backslash_{simp}\,s = \textit{bmkeps} (r^\uparrow)\backslash s$,\\
-When s is a string in the language L(r), \\
+\item{} If a string $s$ is in the language $L(r)$, then
-$\textit{bmkeps} (r^\uparrow)\backslash_{simp}\, s  = \textit{bmkeps} (r^\uparrow)\backslash s$, \\
+$\rup \backslash_{simp} \,s$ is not nullable.
-\item{}
-when s is not a string of the language L(ar)
-ders\_simp(ar, s) is not nullable
 \end{itemize}
-The second one is relatively straightforward using isabelle to prove.
-The first part requires more effort.
+\noindent
-It builds on the result that the bit-coded algorithm without simplification
+We have already proved in Isabelle the second part. This is actually not
-produces the correct result:
+too difficult because we can show that simplification does not change
-\begin{center}
+the language of regular expressions. If we can prove the first case,
-$\blexer \;r^\uparrow  s = \lexer \; r\; s$
+that is the bitsequence algorithm with simplification produces the same
-\end{center}
+result as the one without simplification, then we are done.
-\noindent
+Unfortunately that part requires more effort, because simplification does not
-the definition of lexer and its correctness is
+only.need to \emph{not} change the language, but also not change
-omitted(see \cite{AusafDyckhoffUrban2016}).
+the value (computed result).
-if we can prove that the bit-coded algorithm with simplification produces
-the same result as the original bit-coded algorithm,
+\bigskip\noindent\rule[1.5ex]{\linewidth}{5pt}
-then we are done.
+Do you want to keep this? You essentially want to say that the old
+method used retrieve, which unfortunately cannot be adopted to
+the simplification rules. You could just say that and give an example.
+However you have to think about how you give the example....nobody knows
+about AZERO etc yet. Maybe it might be better to use normal regexes
+like $a + aa$, but annotate bitsequences as subscript like $_1(_0a + _1aa)$.
+\bigskip\noindent\rule[1.5ex]{\linewidth}{5pt}
+REPLY:\\
+Yes, I am essentially saying that the old method
+cannot be adopted without adjustments.
+But this does not mean we should skip
+the proof of the bit-coded algorithm
+as it is still the main direction we are looking into
+to prove things. We are trying to modify
+the old proof to suit our needs, but not give
+up it totally, that is why i believe the old
+proof is fundamental in understanding
+what we are doing in the past 6 months.
+\bigskip\noindent\rule[1.5ex]{\linewidth}{5pt}
 The correctness proof of
 \begin{center}
 $\blexer \; r^\uparrow  s = \lexer \;r \;s$
 \end{center}
 \noindent
 simplified version of algorithm,
 we face the problem that in the above
 equalities,
 $\retrieve \; a \; v$ is not always defined.
 for example,
-$\retrieve \; \AALTS(Z, \AONE(S), \AONE(S)) \; \Left(\Empty)$
+$\retrieve \; _0(_1a+_0a) \; \Left(\Empty)$
-is defined, but not $\retrieve \; \AONE(\Z\S) \; \Left(\Empty)$,
+is defined, but not $\retrieve \; _{01}a \;\Left(\Empty)$,
 though we can extract the same POSIX
 bits from the two annotated regular expressions.
 The latter might occur when we try to retrieve from
 a simplified regular expression using the same value
 as the unsimplified one.

changeset 100	397b31867ea6
parent 95	c969a973fcae
child 101	4a327e70d538