diff -r 596bcdd7aaf9 -r 397b31867ea6 etnms/etnms.tex --- a/etnms/etnms.tex Fri Jan 10 17:02:16 2020 +0000 +++ b/etnms/etnms.tex Fri Jan 10 22:35:08 2020 +0000 @@ -87,27 +87,54 @@ answers with \emph{how} it matches the string. This is important for applications such as lexing (tokenising a string). The problem is to make the algorithm by Sulzmann and Lu fast on all inputs without - breaking its correctness. We have already developed some - simplification rules for this, but have not yet proved that they - preserve the correctness of the algorithm. We also have not yet - looked at extended regular expressions, such as bounded repetitions, + breaking its correctness. Being fast depends on a complete set of + simplification rules, some of which + have been put forward by Sulzmann and Lu. We have extended their + rules in order to obtain a tight bound on size of regular expressions. + We have tested the correctness of these extended rules, but have not + formally established their correctness. We also have not yet looked + at extended regular expressions, such as bounded repetitions, negation and back-references. \end{abstract} \section{Introduction} - +While we believe derivatives of regular expressions is a beautiful +concept (interms of ease to implementing them in functional programming +language and ease to reason about them formally), they have one major +drawback: every derivative step can make regular expressions grow +drastically in size. This in turn has negative effects on the runtime of +the corresponding lexing algorithms. Consider for example the regular +expression $(a+aa)^*$ and the short string $aaaaaaaaaaaa$. The size of +the corresponding derivative is already 8668 node assuming the derivatives +is seen as a tree. The reason for the poor runtime of the lexing algorithms is +that they need to traverse such trees over and over again. The solution is to +find a complete set of simplification rules that keep the sizes of derivatives +uniformly small. -\noindent\rule[0.5ex]{\linewidth}{1pt} -Between the 2 bars are the new materials.\\ -In the past 6 months I was trying to prove that the bit-coded algorithm is correct. -\begin{center} -$\blexers \;r \; s = \blexer \; r \; s$ -\end{center} +For reasons beyond this report, it turns out that a complete set of +simplification rules depend on values being encoded as bitsequences. +(Vlue are the results of the lexing algorithms generate; they encode how +a regular expression matched a string.) We already know that the lexing +algorithm \emph{without} simplification is correct. Therefore in the +past 6 months we were trying to prove that the algorithm using bitsequences plus +our simplification rules is correct. Formally this amounts to show that + +\begin{equation}\label{mainthm} +\blexers \; r \; s = \blexer \;r\;s +\end{equation} + \noindent -To prove this, we need to prove these two functions produce the same output -whether or not $r \in L(r)$. -Given the definition of $\blexer$ and $\blexers$: +whereby $\blexers$ simplifies (makes derivatives smaller) in each step, +whereas with $\blexer$ the size can grow exponentially. This would be an +important milestone, because we already have a very good idea how to +establish that our set our simplification rules keeps the size below a +relatively tight bound. + +In order to prove the main theorem \eqref{mainthm}, we need to prove the +two functions produce the same output. The definition of these functions +is shown below. + \begin{center} \begin{tabular}{lcl} $\textit{blexer}\;r\,s$ & $\dn$ & @@ -118,41 +145,63 @@ \end{tabular} \end{center} - \begin{center} +\begin{center} \begin{tabular}{lcl} - $\textit{blexer\_simp}\;r\,s$ & $\dn$ & - $\textit{let}\;a = (r^\uparrow)\backslash_{simp}\, s\;\textit{in}$\\ - & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\ - & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\ - & & $\;\;\textit{else}\;\textit{None}$ + $\blexers \; r \, s$ &$\dn$ & + $\textit{let} \; a = (r^\uparrow)\backslash_{simp}\, s\; \textit{in}$\\ + & & $\; \; \textit{if} \; \textit{bnullable}(a)$\\ + & & $\; \; \textit{then} \; \textit{decode}\,(\textit{bmkeps}\,a)\,r$\\ + & & $\;\; \textit{else}\;\textit{None}$ \end{tabular} \end{center} \noindent -it boils down to proving the following two propositions(depending on which -branch in the if-else clause is taken): +In these definitions $(r^\uparrow)$ is a kind of coding function that is the +same in each case, similarly the decode and the \textit{bmkeps} +functions. Our main theorem \eqref{mainthm} therefore boils down to +proving the following two propositions (depending on which branch the +if-else clause takes). It establishes how the derivatives \emph{with} +simplification do not change the computed result: \begin{itemize} - -\item{} -When s is a string in the language L(r), \\ -$\textit{bmkeps} (r^\uparrow)\backslash_{simp}\, s = \textit{bmkeps} (r^\uparrow)\backslash s$, \\ -\item{} -when s is not a string of the language L(ar) -ders\_simp(ar, s) is not nullable +\item{} If a string $s$ is in the language of $L(r)$, then \\ +$\textit{bmkeps} (r^\uparrow)\backslash_{simp}\,s = \textit{bmkeps} (r^\uparrow)\backslash s$,\\ +\item{} If a string $s$ is in the language $L(r)$, then +$\rup \backslash_{simp} \,s$ is not nullable. \end{itemize} -The second one is relatively straightforward using isabelle to prove. -The first part requires more effort. -It builds on the result that the bit-coded algorithm without simplification -produces the correct result: -\begin{center} -$\blexer \;r^\uparrow s = \lexer \; r\; s$ -\end{center} + \noindent -the definition of lexer and its correctness is -omitted(see \cite{AusafDyckhoffUrban2016}). -if we can prove that the bit-coded algorithm with simplification produces -the same result as the original bit-coded algorithm, -then we are done. +We have already proved in Isabelle the second part. This is actually not +too difficult because we can show that simplification does not change +the language of regular expressions. If we can prove the first case, +that is the bitsequence algorithm with simplification produces the same +result as the one without simplification, then we are done. +Unfortunately that part requires more effort, because simplification does not +only.need to \emph{not} change the language, but also not change +the value (computed result). + +\bigskip\noindent\rule[1.5ex]{\linewidth}{5pt} +Do you want to keep this? You essentially want to say that the old +method used retrieve, which unfortunately cannot be adopted to +the simplification rules. You could just say that and give an example. +However you have to think about how you give the example....nobody knows +about AZERO etc yet. Maybe it might be better to use normal regexes +like $a + aa$, but annotate bitsequences as subscript like $_1(_0a + _1aa)$. + +\bigskip\noindent\rule[1.5ex]{\linewidth}{5pt} +REPLY:\\ +Yes, I am essentially saying that the old method +cannot be adopted without adjustments. +But this does not mean we should skip +the proof of the bit-coded algorithm +as it is still the main direction we are looking into +to prove things. We are trying to modify +the old proof to suit our needs, but not give +up it totally, that is why i believe the old +proof is fundamental in understanding +what we are doing in the past 6 months. + +\bigskip\noindent\rule[1.5ex]{\linewidth}{5pt} + The correctness proof of \begin{center} $\blexer \; r^\uparrow s = \lexer \;r \;s$ @@ -247,8 +296,8 @@ equalities, $\retrieve \; a \; v$ is not always defined. for example, -$\retrieve \; \AALTS(Z, \AONE(S), \AONE(S)) \; \Left(\Empty)$ -is defined, but not $\retrieve \; \AONE(\Z\S) \; \Left(\Empty)$, +$\retrieve \; _0(_1a+_0a) \; \Left(\Empty)$ +is defined, but not $\retrieve \; _{01}a \;\Left(\Empty)$, though we can extract the same POSIX bits from the two annotated regular expressions. The latter might occur when we try to retrieve from