# HG changeset patch # User Christian Urban # Date 1579093270 0 # Node ID e0db3242d8b5d71114dffb00f43a056a4cfbc858 # Parent 317a7d54ebcc5b63d549ee89fe7ff31855272c84# Parent e7ba4da53930e75f5d85de5c90661431068b0911 added my comments diff -r 317a7d54ebcc -r e0db3242d8b5 etnms/etnms.tex --- a/etnms/etnms.tex Mon Jan 13 14:13:47 2020 +0000 +++ b/etnms/etnms.tex Wed Jan 15 13:01:10 2020 +0000 @@ -100,18 +100,19 @@ \section{Introduction} -While we believe derivatives of regular expressions is a beautiful -concept (in terms of ease of implementing them in functional -programming languages and in terms of reasoning about them formally), -they have one major drawback: every derivative step can make regular -expressions grow drastically in size. This in turn has negative effect -on the runtime of the corresponding lexing algorithms. Consider for -example the regular expression $(a+aa)^*$ and the short string -$aaaaaaaaaaaa$. The corresponding derivative is already 8668 nodes -assuming the derivative is seen as a tree. The reason for the poor -runtime of the lexing algorithms is that they need to traverse such -trees over and over again. The solution is to find a complete set of -simplification rules that keep the sizes of derivatives uniformly +While we believe derivatives of regular expressions, written +$r\backslash s$, are a beautiful concept (in terms of ease of +implementing them in functional programming languages and in terms of +reasoning about them formally), they have one major drawback: every +derivative step can make regular expressions grow drastically in +size. This in turn has negative effect on the runtime of the +corresponding lexing algorithms. Consider for example the regular +expression $(a+aa)^*$ and the short string $aaaaaaaaaaaa$. The +corresponding derivative contains already 8668 nodes where we assume +the derivative is given as a tree. The reason for the poor runtime of +the derivative-based lexing algorithms is that they need to traverse +such trees over and over again. The solution is to find a complete set +of simplification rules that keep the sizes of derivatives uniformly small. For reasons beyond this report, it turns out that a complete set of @@ -120,9 +121,9 @@ generate; they encode how a regular expression matched a string.} We already know that the lexing algorithm using bitsequences but \emph{without} simplification is correct, albeilt horribly -slow. Therefore in the past 6 months we were trying to prove that the +slow. Therefore in the past 6 months I was trying to prove that the algorithm using bitsequences plus our simplification rules is -correct. Formally this amounts to show that +also correct. Formally this amounts to show that \begin{equation}\label{mainthm} \blexers \; r \; s = \blexer \;r\;s @@ -131,12 +132,12 @@ \noindent whereby $\blexers$ simplifies (makes derivatives smaller) in each step, whereas with $\blexer$ the size can grow exponentially. This -would be an important milestone for the thesis, because we already +would be an important milestone for my thesis, because we already have a very good idea how to establish that our set our simplification rules keeps the size of derivativs below a relatively tight bound. -In order to prove the main theorem in \eqref{mainthm}, we need to prove the -two functions produce the same output. The definition of these functions +In order to prove the main theorem in \eqref{mainthm}, we need to prove that the +two functions produce the same output. The definition of these two functions is shown below. \begin{center} @@ -161,11 +162,11 @@ \noindent In these definitions $(r^\uparrow)$ is a kind of coding function that is the same in each case, similarly the decode and the \textit{bmkeps} -are functions that are the same in each case. Our main theorem -\eqref{mainthm} therefore boils down to proving the following two -propositions (depending on which branch the if-else clause takes). It -establishes how the derivatives \emph{with} simplification do not -change the computed result: +are functions that are the same in each case. Our main +theorem~\eqref{mainthm} therefore boils down to proving the following +two propositions (depending on which branch the if-else clause +takes). They establish how the derivatives \emph{with} simplification +do not change the computed result: \begin{itemize} \item{(a)} If a string $s$ is in the language of $L(r)$, then \\ @@ -175,7 +176,7 @@ \end{itemize} \noindent -We have already proved in Isabelle the second part. This is actually +We have already proved the second part in Isabelle. This is actually not too difficult because we can show that simplification does not change the language of simplified regular expressions. @@ -208,6 +209,8 @@ %what we are doing in the past 6 months. %\bigskip\noindent\rule[1.5ex]{\linewidth}{5pt} +\subsubsection*{Existing Proof} + For this we have started with looking at the original proof that established that the bitsequence algorrithm produces the same result as the algorithm not using bitsequences. Formally this proof @@ -224,7 +227,7 @@ %\end{center} \noindent -This proof used two ``tricks''. One is that it defined a \flex-function +The proof uses two ``tricks''. One is that it uses a \flex-function \begin{center} \begin{tabular}{lcl} @@ -234,14 +237,14 @@ \end{center} \noindent -and then proved for the right-hand side in \eqref{lexer} +and then proves for the right-hand side in \eqref{lexer} \begin{center} -$\lexer \;r\; s = \flex \;\textit{id} \; r\;s \;(\mkeps \; r\backslash s)$ +$\lexer \;r\; s = \flex \;\textit{id} \; r\;s \;(\mkeps \; (r\backslash s))$ \end{center}. -\noindent\rule[1.5ex]{\linewidth}{1pt} + \noindent The $\flex$-function essentially does lexing by @@ -250,45 +253,60 @@ explicitly showing the order of characters being injected back in each step. With $\flex$ we can write $\lexer$ this way: + \begin{center} $\lexer \;r\; s = \flex \;id \; r\;s \;(\mkeps r\backslash s)$ \end{center} + +%\noindent +%$\flex$ focuses on the injections instead of the derivatives , +%compared to the original definition of $\lexer$, which puts equal +%amount of emphasis on injection and derivative with respect to each +%character: + +%\begin{center} +%\begin{tabular}{lcl} +%$\textit{lexer} \; r\; (c\!::\!s) $ & $\dn$ & $\textit{case} \; \lexer \; (r\backslash c) \;s \; %\textit{of}$ \\ +% & & $\textit{None} \; \Longrightarrow \; \textit{None}$\\ +% & & $\textbar \; v \; \Longrightarrow \; \inj \; r\;c\;v$\\ +%$\textit{lexer} \; r\; [\,] $ & $\dn$ & $\textit{if} \; \nullable (r) \; \textit{then} \; \mkeps% (r) \; \textit{else} \;None$ +%\end{tabular} +%\end{center} + \noindent -$\flex$ focuses on - the injections instead -of the derivatives , -compared -to the original definition of $\lexer$, -which puts equal amount of emphasis on -injection and derivative with respect to each character: -\begin{center} -\begin{tabular}{lcl} -$\textit{lexer} \; r\; (c\!::\!s) $ & $\dn$ & $\textit{case} \; \lexer \; (r\backslash c) \;s \; \textit{of}$ \\ - & & $\textit{None} \; \Longrightarrow \; \textit{None}$\\ - & & $\textbar \; v \; \Longrightarrow \; \inj \; r\;c\;v$\\ -$\textit{lexer} \; r\; [\,] $ & $\dn$ & $\textit{if} \; \nullable (r) \; \textit{then} \; \mkeps (r) \; \textit{else} \;None$ -\end{tabular} -\end{center} -\noindent -Using this feature of $\flex$ we can rewrite the lexing -$w.r.t \; s @ [c]$ in term of lexing -$w.r.t \; s$: +The crux in the existing proof is how $\flex$ relates to injection, namely + \begin{center} $\flex \; r \; id \; (s@[c]) \; v = \flex \; r \; id \; s \; (inj \; (r\backslash s) \; c\; v)$. \end{center} + \noindent -this allows us to use -the inductive hypothesis to get +This property allows one to rewrite an induction hypothesis like + \begin{center} $ \flex \; r\; id\; (s@[c])\; v = \textit{decode} \;( \textit{retrieve}\; (\rup \backslash s) \; (\inj \; (r\backslash s) \;c\;v)\;) r$ \end{center} + +\noindent\rule[1.5ex]{\linewidth}{4pt} +There is no mention of retrieve yet .... this is the second trick in the +existing proof. I am not sure whether you need to explain annotated regular +expressions much earlier - maybe before the ``existing proof'' section, or +evan earlier. + +\noindent\rule[1.5ex]{\linewidth}{4pt} + \noindent By using a property of retrieve we have the $\textit{RHS}$ of the above equality is $decode (retrieve (r^\uparrow \backslash(s @ [c])) v) r$, and this gives the main lemma result: + \begin{center} $ \flex \;r\; id \; (s@[c]) \; v =\textit{decode}(\textit{retrieve} (\rup \backslash (s@[c])) \;v) r$ \end{center} + + + + \noindent To use this lemma result for our correctness proof, simply replace the $v$ in the