diff -r aeb0bc2d1812 -r 317a7d54ebcc etnms/etnms.tex --- a/etnms/etnms.tex Sun Jan 12 22:49:23 2020 +0000 +++ b/etnms/etnms.tex Mon Jan 13 14:13:47 2020 +0000 @@ -91,48 +91,51 @@ breaking its correctness. Being fast depends on a complete set of simplification rules, some of which have been put forward by Sulzmann and Lu. We have extended their - rules in order to obtain a tight bound on size of regular expressions. - We have tested the correctness of these extended rules, but have not - formally established their correctness. We also have not yet looked + rules in order to obtain a tight bound on the size of regular expressions. + We have tested these extended rules, but have not + formally established their correctness. We have also not yet looked at extended regular expressions, such as bounded repetitions, negation and back-references. \end{abstract} \section{Introduction} -While we believe derivatives of regular expressions is a beautiful -concept (interms of ease to implementing them in functional programming -language and ease to reason about them formally), they have one major -drawback: every derivative step can make regular expressions grow -drastically in size. This in turn has negative effects on the runtime of -the corresponding lexing algorithms. Consider for example the regular -expression $(a+aa)^*$ and the short string $aaaaaaaaaaaa$. The size of -the corresponding derivative is already 8668 node assuming the derivatives -is seen as a tree. The reason for the poor runtime of the lexing algorithms is -that they need to traverse such trees over and over again. The solution is to -find a complete set of simplification rules that keep the sizes of derivatives -uniformly small. +While we believe derivatives of regular expressions is a beautiful +concept (in terms of ease of implementing them in functional +programming languages and in terms of reasoning about them formally), +they have one major drawback: every derivative step can make regular +expressions grow drastically in size. This in turn has negative effect +on the runtime of the corresponding lexing algorithms. Consider for +example the regular expression $(a+aa)^*$ and the short string +$aaaaaaaaaaaa$. The corresponding derivative is already 8668 nodes +assuming the derivative is seen as a tree. The reason for the poor +runtime of the lexing algorithms is that they need to traverse such +trees over and over again. The solution is to find a complete set of +simplification rules that keep the sizes of derivatives uniformly +small. -For reasons beyond this report, it turns out that a complete set of -simplification rules depend on values being encoded as bitsequences. -(Vlue are the results of the lexing algorithms generate; they encode how -a regular expression matched a string.) We already know that the lexing -algorithm \emph{without} simplification is correct. Therefore in the -past 6 months we were trying to prove that the algorithm using bitsequences plus -our simplification rules is correct. Formally this amounts to show that +For reasons beyond this report, it turns out that a complete set of +simplification rules depends on values being encoded as +bitsequences.\footnote{Values are the results the lexing algorithms + generate; they encode how a regular expression matched a string.} We +already know that the lexing algorithm using bitsequences but +\emph{without} simplification is correct, albeilt horribly +slow. Therefore in the past 6 months we were trying to prove that the +algorithm using bitsequences plus our simplification rules is +correct. Formally this amounts to show that \begin{equation}\label{mainthm} \blexers \; r \; s = \blexer \;r\;s \end{equation} \noindent -whereby $\blexers$ simplifies (makes derivatives smaller) in each step, -whereas with $\blexer$ the size can grow exponentially. This would be an -important milestone, because we already have a very good idea how to -establish that our set our simplification rules keeps the size below a -relatively tight bound. +whereby $\blexers$ simplifies (makes derivatives smaller) in each +step, whereas with $\blexer$ the size can grow exponentially. This +would be an important milestone for the thesis, because we already +have a very good idea how to establish that our set our simplification +rules keeps the size of derivativs below a relatively tight bound. -In order to prove the main theorem \eqref{mainthm}, we need to prove the +In order to prove the main theorem in \eqref{mainthm}, we need to prove the two functions produce the same output. The definition of these functions is shown below. @@ -156,82 +159,94 @@ \end{tabular} \end{center} \noindent -In these definitions $(r^\uparrow)$ is a kind of coding function that is the -same in each case, similarly the decode and the \textit{bmkeps} -functions. Our main theorem \eqref{mainthm} therefore boils down to -proving the following two propositions (depending on which branch the -if-else clause takes). It establishes how the derivatives \emph{with} -simplification do not change the computed result: +In these definitions $(r^\uparrow)$ is a kind of coding function that +is the same in each case, similarly the decode and the \textit{bmkeps} +are functions that are the same in each case. Our main theorem +\eqref{mainthm} therefore boils down to proving the following two +propositions (depending on which branch the if-else clause takes). It +establishes how the derivatives \emph{with} simplification do not +change the computed result: \begin{itemize} -\item{} If a string $s$ is in the language of $L(r)$, then \\ +\item{(a)} If a string $s$ is in the language of $L(r)$, then \\ $\textit{bmkeps} (r^\uparrow)\backslash_{simp}\,s = \textit{bmkeps} (r^\uparrow)\backslash s$,\\ -\item{} If a string $s$ is in the language $L(r)$, then +\item{(b)} If a string $s$ is in the language $L(r)$, then $\rup \backslash_{simp} \,s$ is not nullable. \end{itemize} \noindent -We have already proved in Isabelle the second part. This is actually not -too difficult because we can show that simplification does not change -the language of regular expressions. If we can prove the first case, -that is the bitsequence algorithm with simplification produces the same -result as the one without simplification, then we are done. -Unfortunately that part requires more effort, because simplification does not -only.need to \emph{not} change the language, but also not change -the value (computed result). +We have already proved in Isabelle the second part. This is actually +not too difficult because we can show that simplification does not +change the language of simplified regular expressions. -\bigskip\noindent\rule[1.5ex]{\linewidth}{5pt} -Do you want to keep this? You essentially want to say that the old -method used retrieve, which unfortunately cannot be adopted to -the simplification rules. You could just say that and give an example. -However you have to think about how you give the example....nobody knows -about AZERO etc yet. Maybe it might be better to use normal regexes -like $a + aa$, but annotate bitsequences as subscript like $_1(_0a + _1aa)$. +If we can prove the first part, that is the bitsequence algorithm with +simplification produces the same result as the one without +simplification, then we are done. Unfortunately that part requires +more effort, because simplification does not only need to \emph{not} +change the language, but also not change the value (that is the +computed result). + +%\bigskip\noindent\rule[1.5ex]{\linewidth}{5pt} +%Do you want to keep this? You essentially want to say that the old +%method used retrieve, which unfortunately cannot be adopted to +%the simplification rules. You could just say that and give an example. +%However you have to think about how you give the example....nobody knows +%about AZERO etc yet. Maybe it might be better to use normal regexes +%like $a + aa$, but annotate bitsequences as subscript like $_1(_0a + _1aa)$. -\bigskip\noindent\rule[1.5ex]{\linewidth}{5pt} -REPLY:\\ -Yes, I am essentially saying that the old method -cannot be adopted without adjustments. -But this does not mean we should skip -the proof of the bit-coded algorithm -as it is still the main direction we are looking into -to prove things. We are trying to modify -the old proof to suit our needs, but not give -up it totally, that is why i believe the old -proof is fundamental in understanding -what we are doing in the past 6 months. - -\bigskip\noindent\rule[1.5ex]{\linewidth}{5pt} +%\bigskip\noindent\rule[1.5ex]{\linewidth}{5pt} +%REPLY:\\ +%Yes, I am essentially saying that the old method +%cannot be adopted without adjustments. +%But this does not mean we should skip +%the proof of the bit-coded algorithm +%as it is still the main direction we are looking into +%to prove things. We are trying to modify +%the old proof to suit our needs, but not give +%up it totally, that is why i believe the old +%proof is fundamental in understanding +%what we are doing in the past 6 months. +%\bigskip\noindent\rule[1.5ex]{\linewidth}{5pt} -The correctness proof of -\begin{center} -$\blexer \; r^\uparrow s = \lexer \;r \;s$ -\end{center} -\noindent -might provide us insight into proving -\begin{center} -$\blexer \; r^\uparrow \;s = \blexers \; r^\uparrow \;s$ -\end{center} +For this we have started with looking at the original proof that +established that the bitsequence algorrithm produces the same result +as the algorithm not using bitsequences. Formally this proof +established + +\begin{equation}\label{lexer} +\blexer \; (r^\uparrow) s = \lexer \;r \;s +\end{equation} + +%\noindent +%might provide us insight into proving +%\begin{center} +%$\blexer \; r^\uparrow \;s = \blexers \; r^\uparrow \;s$ +%\end{center} + \noindent -(that is also -why we say the new proof builds on the older one). -The proof defined the function $\flex$ as another way of -expressing the $\lexer$ function: -\begin{center} -$\lexer \;r\; s = \flex \;\textit{id} \; r\;s \;(\mkeps \; r\backslash s)$ -\end{center}. -\noindent -(proof for the above equality will be explained later) -The definition of $flex$ is as follows: +This proof used two ``tricks''. One is that it defined a \flex-function + \begin{center} \begin{tabular}{lcl} $\textit{flex} \;r\; f\; (c\!::\!s) $ & $\dn$ & $\textit{flex} \; (r\backslash c) \;(\lambda v. f (inj \; r \; c \; v)) \;s$ \\ $\textit{flex} \;r\; f\; [\,] $ & $\dn$ & $f$ \end{tabular} \end{center} + \noindent -here $\flex$ essentially does lexing by -stacking up injection functions while doing derivatives, +and then proved for the right-hand side in \eqref{lexer} + +\begin{center} +$\lexer \;r\; s = \flex \;\textit{id} \; r\;s \;(\mkeps \; r\backslash s)$ +\end{center}. + + +\noindent\rule[1.5ex]{\linewidth}{1pt} + +\noindent +The $\flex$-function essentially does lexing by +stacking up injection functions while doing derivatives. + explicitly showing the order of characters being injected back in each step. With $\flex$ we can write $\lexer$ this way: