etnms/etnms.tex
changeset 105 317a7d54ebcc
parent 103 aeb0bc2d1812
child 106 e0db3242d8b5
equal deleted inserted replaced
103:aeb0bc2d1812 105:317a7d54ebcc
    89   applications such as lexing (tokenising a string). The problem is to
    89   applications such as lexing (tokenising a string). The problem is to
    90   make the algorithm by Sulzmann and Lu fast on all inputs without
    90   make the algorithm by Sulzmann and Lu fast on all inputs without
    91   breaking its correctness. Being fast depends on a complete set of 
    91   breaking its correctness. Being fast depends on a complete set of 
    92   simplification rules, some of which 
    92   simplification rules, some of which 
    93   have been put forward by Sulzmann and Lu. We have extended their
    93   have been put forward by Sulzmann and Lu. We have extended their
    94   rules in order to obtain a tight bound on size of regular expressions.
    94   rules in order to obtain a tight bound on the size of regular expressions.
    95   We have tested the correctness of these extended rules, but have not 
    95   We have tested these extended rules, but have not 
    96   formally established their correctness. We also have not yet looked 
    96   formally established their correctness. We have also not yet looked 
    97   at extended regular expressions, such as bounded repetitions,
    97   at extended regular expressions, such as bounded repetitions,
    98   negation and back-references.
    98   negation and back-references.
    99 \end{abstract}
    99 \end{abstract}
   100 
   100 
   101 \section{Introduction}
   101 \section{Introduction}
   102 
   102 
   103 While we believe derivatives of regular expressions is a beautiful 
   103 While we believe derivatives of regular expressions is a beautiful
   104 concept (interms of ease to implementing them in functional programming
   104 concept (in terms of ease of implementing them in functional
   105 language and ease to reason about them formally), they have one major 
   105 programming languages and in terms of reasoning about them formally),
   106 drawback: every derivative step can make regular expressions grow 
   106 they have one major drawback: every derivative step can make regular
   107 drastically in size. This in turn has negative effects on the runtime of 
   107 expressions grow drastically in size. This in turn has negative effect
   108 the corresponding lexing algorithms. Consider for example the regular 
   108 on the runtime of the corresponding lexing algorithms. Consider for
   109 expression $(a+aa)^*$ and the short string $aaaaaaaaaaaa$. The size of
   109 example the regular expression $(a+aa)^*$ and the short string
   110 the corresponding derivative is already 8668 node assuming the derivatives
   110 $aaaaaaaaaaaa$. The corresponding derivative is already 8668 nodes
   111 is seen as a tree. The reason for the poor runtime of the lexing algorithms is
   111 assuming the derivative is seen as a tree. The reason for the poor
   112 that they need to traverse such trees over and over again. The solution is to
   112 runtime of the lexing algorithms is that they need to traverse such
   113 find a complete set of simplification rules that keep the sizes of derivatives
   113 trees over and over again. The solution is to find a complete set of
   114 uniformly small.
   114 simplification rules that keep the sizes of derivatives uniformly
   115 
   115 small.
   116 For reasons beyond this report, it turns out that a complete set of 
   116 
   117 simplification rules depend on values being encoded as bitsequences.
   117 For reasons beyond this report, it turns out that a complete set of
   118 (Vlue are the results of the lexing algorithms generate; they encode how
   118 simplification rules depends on values being encoded as
   119 a regular expression matched a string.) We already know that the lexing 
   119 bitsequences.\footnote{Values are the results the lexing algorithms
   120 algorithm \emph{without} simplification is correct. Therefore in the
   120   generate; they encode how a regular expression matched a string.} We
   121 past 6 months we were trying to prove  that the algorithm using bitsequences plus
   121 already know that the lexing algorithm using bitsequences but
   122 our simplification rules is correct. Formally this amounts to show that
   122 \emph{without} simplification is correct, albeilt horribly
       
   123 slow. Therefore in the past 6 months we were trying to prove that the
       
   124 algorithm using bitsequences plus our simplification rules is
       
   125 correct. Formally this amounts to show that
   123 
   126 
   124 \begin{equation}\label{mainthm}
   127 \begin{equation}\label{mainthm}
   125 \blexers \; r \; s = \blexer \;r\;s
   128 \blexers \; r \; s = \blexer \;r\;s
   126 \end{equation}
   129 \end{equation}
   127 
   130 
   128 \noindent
   131 \noindent
   129 whereby $\blexers$ simplifies (makes derivatives smaller) in each step,
   132 whereby $\blexers$ simplifies (makes derivatives smaller) in each
   130 whereas with $\blexer$ the size can grow exponentially. This would be an
   133 step, whereas with $\blexer$ the size can grow exponentially. This
   131 important milestone, because we already have a very good idea how to 
   134 would be an important milestone for the thesis, because we already
   132 establish that our set our simplification rules keeps the size below a
   135 have a very good idea how to establish that our set our simplification
   133 relatively tight bound.
   136 rules keeps the size of derivativs below a relatively tight bound.
   134 
   137 
   135 In order to prove the main theorem \eqref{mainthm}, we need to prove the
   138 In order to prove the main theorem in \eqref{mainthm}, we need to prove the
   136 two functions produce the same output. The definition of these functions 
   139 two functions produce the same output. The definition of these functions 
   137 is shown below.
   140 is shown below.
   138 
   141 
   139 \begin{center}
   142 \begin{center}
   140 \begin{tabular}{lcl}
   143 \begin{tabular}{lcl}
   154   & & $\; \; \textit{then} \; \textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
   157   & & $\; \; \textit{then} \; \textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
   155   & & $\;\;   \textit{else}\;\textit{None}$
   158   & & $\;\;   \textit{else}\;\textit{None}$
   156 \end{tabular}
   159 \end{tabular}
   157 \end{center}
   160 \end{center}
   158 \noindent
   161 \noindent
   159 In these definitions $(r^\uparrow)$ is a kind of coding function that is the
   162 In these definitions $(r^\uparrow)$ is a kind of coding function that
   160 same in each case, similarly the decode and the \textit{bmkeps}
   163 is the same in each case, similarly the decode and the \textit{bmkeps}
   161 functions. Our main theorem \eqref{mainthm} therefore boils down to
   164 are functions that are the same in each case. Our main theorem
   162 proving the following two  propositions (depending on which branch the 
   165 \eqref{mainthm} therefore boils down to proving the following two
   163 if-else clause takes). It establishes how the derivatives  \emph{with}
   166 propositions (depending on which branch the if-else clause takes). It
   164 simplification do not change the computed result:
   167 establishes how the derivatives \emph{with} simplification do not
       
   168 change the computed result:
   165 
   169 
   166 \begin{itemize}
   170 \begin{itemize}
   167 \item{} If a string $s$ is in the language of $L(r)$, then \\
   171 \item{(a)} If a string $s$ is in the language of $L(r)$, then \\
   168 $\textit{bmkeps} (r^\uparrow)\backslash_{simp}\,s = \textit{bmkeps} (r^\uparrow)\backslash s$,\\
   172 $\textit{bmkeps} (r^\uparrow)\backslash_{simp}\,s = \textit{bmkeps} (r^\uparrow)\backslash s$,\\
   169 \item{} If a string $s$ is in the language $L(r)$, then 
   173 \item{(b)} If a string $s$ is in the language $L(r)$, then 
   170 $\rup \backslash_{simp} \,s$ is not nullable.
   174 $\rup \backslash_{simp} \,s$ is not nullable.
   171 \end{itemize}
   175 \end{itemize}
   172 
   176 
   173 \noindent
   177 \noindent
   174 We have already proved in Isabelle the second part. This is actually not 
   178 We have already proved in Isabelle the second part. This is actually
   175 too difficult because we can show that simplification does not change
   179 not too difficult because we can show that simplification does not
   176 the language of regular expressions. If we can prove the first case,
   180 change the language of simplified regular expressions.
   177 that is the bitsequence algorithm with simplification produces the same 
   181 
   178 result as the one without simplification, then we are done.
   182 If we can prove the first part, that is the bitsequence algorithm with
   179 Unfortunately that part requires more effort, because simplification does not 
   183 simplification produces the same result as the one without
   180 only.need to \emph{not} change the language, but also not change
   184 simplification, then we are done.  Unfortunately that part requires
   181 the value (computed result).
   185 more effort, because simplification does not only need to \emph{not}
   182 
   186 change the language, but also not change the value (that is the
   183 \bigskip\noindent\rule[1.5ex]{\linewidth}{5pt}
   187 computed result).
   184 Do you want to keep this? You essentially want to say that the old
   188 
   185 method used retrieve, which unfortunately cannot be adopted to 
   189 %\bigskip\noindent\rule[1.5ex]{\linewidth}{5pt}
   186 the simplification rules. You could just say that and give an example.
   190 %Do you want to keep this? You essentially want to say that the old
   187 However you have to think about how you give the example....nobody knows
   191 %method used retrieve, which unfortunately cannot be adopted to 
   188 about AZERO etc yet. Maybe it might be better to use normal regexes
   192 %the simplification rules. You could just say that and give an example.
   189 like $a + aa$, but annotate bitsequences as subscript like $_1(_0a + _1aa)$.
   193 %However you have to think about how you give the example....nobody knows
   190 
   194 %about AZERO etc yet. Maybe it might be better to use normal regexes
   191 \bigskip\noindent\rule[1.5ex]{\linewidth}{5pt}
   195 %like $a + aa$, but annotate bitsequences as subscript like $_1(_0a + _1aa)$.
   192 REPLY:\\
   196 
   193 Yes, I am essentially saying that the old method
   197 %\bigskip\noindent\rule[1.5ex]{\linewidth}{5pt}
   194 cannot be adopted without adjustments.
   198 %REPLY:\\
   195 But this does not mean we should skip
   199 %Yes, I am essentially saying that the old method
   196 the proof of the bit-coded algorithm
   200 %cannot be adopted without adjustments.
   197 as it is still the main direction we are looking into
   201 %But this does not mean we should skip
   198 to prove things. We are trying to modify
   202 %the proof of the bit-coded algorithm
   199 the old proof to suit our needs, but not give 
   203 %as it is still the main direction we are looking into
   200 up it totally, that is why i believe the old 
   204 %to prove things. We are trying to modify
   201 proof is fundamental in understanding
   205 %the old proof to suit our needs, but not give 
   202 what we are doing in the past 6 months.
   206 %up it totally, that is why i believe the old 
   203 
   207 %proof is fundamental in understanding
   204 \bigskip\noindent\rule[1.5ex]{\linewidth}{5pt}
   208 %what we are doing in the past 6 months.
   205 
   209 %\bigskip\noindent\rule[1.5ex]{\linewidth}{5pt}
   206 The correctness proof of
   210 
   207 \begin{center}
   211 For this we have started with looking at the original proof that
   208 $\blexer \; r^\uparrow  s = \lexer \;r \;s$
   212 established that the bitsequence algorrithm produces the same result
   209 \end{center}
   213 as the algorithm not using bitsequences. Formally this proof
   210 \noindent
   214 established
   211 might provide us insight into proving 
   215 
   212 \begin{center}
   216 \begin{equation}\label{lexer}
   213 $\blexer \; r^\uparrow \;s = \blexers \; r^\uparrow \;s$
   217 \blexer \; (r^\uparrow)  s = \lexer \;r \;s
   214 \end{center}
   218 \end{equation}
   215 \noindent
   219 
   216 (that is also
   220 %\noindent
   217 why we say the new proof builds on the older one).
   221 %might provide us insight into proving 
   218 The proof defined the function $\flex$ as another way of
   222 %\begin{center}
   219 expressing the $\lexer$ function:
   223 %$\blexer \; r^\uparrow \;s = \blexers \; r^\uparrow \;s$
   220 \begin{center}
   224 %\end{center}
   221 $\lexer \;r\; s = \flex \;\textit{id} \; r\;s \;(\mkeps \; r\backslash s)$
   225 
   222 \end{center}.
   226 \noindent
   223 \noindent
   227 This proof used two ``tricks''. One is that it defined a \flex-function
   224 (proof for the above equality will be explained later)
   228 
   225 The definition of $flex$ is as follows:
       
   226 \begin{center}
   229 \begin{center}
   227 \begin{tabular}{lcl}
   230 \begin{tabular}{lcl}
   228 $\textit{flex} \;r\; f\; (c\!::\!s) $ & $\dn$ & $\textit{flex} \;  (r\backslash c) \;(\lambda v. f (inj \; r \; c \; v)) \;s$ \\
   231 $\textit{flex} \;r\; f\; (c\!::\!s) $ & $\dn$ & $\textit{flex} \;  (r\backslash c) \;(\lambda v. f (inj \; r \; c \; v)) \;s$ \\
   229 $\textit{flex} \;r\; f\;  [\,] $ & $\dn$ & $f$
   232 $\textit{flex} \;r\; f\;  [\,] $ & $\dn$ & $f$
   230 \end{tabular}
   233 \end{tabular}
   231 \end{center}
   234 \end{center}
   232 \noindent
   235 
   233 here $\flex$ essentially does lexing by
   236 \noindent
   234 stacking up injection functions while doing derivatives,
   237 and then proved for the right-hand side in \eqref{lexer}
       
   238 
       
   239 \begin{center}
       
   240 $\lexer \;r\; s = \flex \;\textit{id} \; r\;s \;(\mkeps \; r\backslash s)$
       
   241 \end{center}.
       
   242 
       
   243 
       
   244 \noindent\rule[1.5ex]{\linewidth}{1pt}
       
   245 
       
   246 \noindent
       
   247 The $\flex$-function essentially does lexing by
       
   248 stacking up injection functions while doing derivatives.
       
   249 
   235 explicitly showing the order of characters being
   250 explicitly showing the order of characters being
   236 injected back in each step.
   251 injected back in each step.
   237 With $\flex$ we can write $\lexer$ this way: 
   252 With $\flex$ we can write $\lexer$ this way: 
   238 \begin{center}
   253 \begin{center}
   239 $\lexer \;r\; s = \flex \;id \; r\;s \;(\mkeps r\backslash s)$
   254 $\lexer \;r\; s = \flex \;id \; r\;s \;(\mkeps r\backslash s)$