etnms/etnms.tex
changeset 100 397b31867ea6
parent 95 c969a973fcae
child 101 4a327e70d538
equal deleted inserted replaced
99:596bcdd7aaf9 100:397b31867ea6
    85   algorithm to not just give a YES/NO answer for whether or not a
    85   algorithm to not just give a YES/NO answer for whether or not a
    86   regular expression matches a string, but in case it does also
    86   regular expression matches a string, but in case it does also
    87   answers with \emph{how} it matches the string.  This is important for
    87   answers with \emph{how} it matches the string.  This is important for
    88   applications such as lexing (tokenising a string). The problem is to
    88   applications such as lexing (tokenising a string). The problem is to
    89   make the algorithm by Sulzmann and Lu fast on all inputs without
    89   make the algorithm by Sulzmann and Lu fast on all inputs without
    90   breaking its correctness. We have already developed some
    90   breaking its correctness. Being fast depends on a complete set of 
    91   simplification rules for this, but have not yet proved that they
    91   simplification rules, some of which 
    92   preserve the correctness of the algorithm. We also have not yet
    92   have been put forward by Sulzmann and Lu. We have extended their
    93   looked at extended regular expressions, such as bounded repetitions,
    93   rules in order to obtain a tight bound on size of regular expressions.
       
    94   We have tested the correctness of these extended rules, but have not 
       
    95   formally established their correctness. We also have not yet looked 
       
    96   at extended regular expressions, such as bounded repetitions,
    94   negation and back-references.
    97   negation and back-references.
    95 \end{abstract}
    98 \end{abstract}
    96 
    99 
    97 \section{Introduction}
   100 \section{Introduction}
    98 
   101 
    99 
   102 While we believe derivatives of regular expressions is a beautiful 
   100 
   103 concept (interms of ease to implementing them in functional programming
   101 \noindent\rule[0.5ex]{\linewidth}{1pt}
   104 language and ease to reason about them formally), they have one major 
   102 Between the 2 bars are the new materials.\\
   105 drawback: every derivative step can make regular expressions grow 
   103 In the past 6 months I was trying to prove that the bit-coded algorithm is correct.
   106 drastically in size. This in turn has negative effects on the runtime of 
   104 \begin{center}
   107 the corresponding lexing algorithms. Consider for example the regular 
   105 $\blexers \;r \; s = \blexer \; r \; s$
   108 expression $(a+aa)^*$ and the short string $aaaaaaaaaaaa$. The size of
   106 \end{center}
   109 the corresponding derivative is already 8668 node assuming the derivatives
   107 \noindent
   110 is seen as a tree. The reason for the poor runtime of the lexing algorithms is
   108 To prove this, we need to prove these two functions produce the same output 
   111 that they need to traverse such trees over and over again. The solution is to
   109 whether or not $r \in L(r)$.
   112 find a complete set of simplification rules that keep the sizes of derivatives
   110 Given the definition of $\blexer$ and $\blexers$:
   113 uniformly small.
       
   114 
       
   115 For reasons beyond this report, it turns out that a complete set of 
       
   116 simplification rules depend on values being encoded as bitsequences.
       
   117 (Vlue are the results of the lexing algorithms generate; they encode how
       
   118 a regular expression matched a string.) We already know that the lexing 
       
   119 algorithm \emph{without} simplification is correct. Therefore in the
       
   120 past 6 months we were trying to prove  that the algorithm using bitsequences plus
       
   121 our simplification rules is correct. Formally this amounts to show that
       
   122 
       
   123 \begin{equation}\label{mainthm}
       
   124 \blexers \; r \; s = \blexer \;r\;s
       
   125 \end{equation}
       
   126 
       
   127 \noindent
       
   128 whereby $\blexers$ simplifies (makes derivatives smaller) in each step,
       
   129 whereas with $\blexer$ the size can grow exponentially. This would be an
       
   130 important milestone, because we already have a very good idea how to 
       
   131 establish that our set our simplification rules keeps the size below a
       
   132 relatively tight bound.
       
   133 
       
   134 In order to prove the main theorem \eqref{mainthm}, we need to prove the
       
   135 two functions produce the same output. The definition of these functions 
       
   136 is shown below.
       
   137 
   111 \begin{center}
   138 \begin{center}
   112 \begin{tabular}{lcl}
   139 \begin{tabular}{lcl}
   113   $\textit{blexer}\;r\,s$ & $\dn$ &
   140   $\textit{blexer}\;r\,s$ & $\dn$ &
   114       $\textit{let}\;a = (r^\uparrow)\backslash s\;\textit{in}$\\                
   141       $\textit{let}\;a = (r^\uparrow)\backslash s\;\textit{in}$\\                
   115   & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
   142   & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
   116   & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
   143   & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
   117   & & $\;\;\textit{else}\;\textit{None}$
   144   & & $\;\;\textit{else}\;\textit{None}$
   118 \end{tabular}
   145 \end{tabular}
   119 \end{center}
   146 \end{center}
   120 
   147 
   121  \begin{center}
   148 \begin{center}
   122 \begin{tabular}{lcl}
   149 \begin{tabular}{lcl}
   123   $\textit{blexer\_simp}\;r\,s$ & $\dn$ &
   150   $\blexers \; r \, s$ &$\dn$ &
   124       $\textit{let}\;a = (r^\uparrow)\backslash_{simp}\, s\;\textit{in}$\\                
   151     $\textit{let} \; a = (r^\uparrow)\backslash_{simp}\, s\; \textit{in}$\\
   125   & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
   152   & & $\; \; \textit{if} \; \textit{bnullable}(a)$\\
   126   & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
   153   & & $\; \; \textit{then} \; \textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
   127   & & $\;\;\textit{else}\;\textit{None}$
   154   & & $\;\;   \textit{else}\;\textit{None}$
   128 \end{tabular}
   155 \end{tabular}
   129 \end{center}
   156 \end{center}
   130 \noindent
   157 \noindent
   131 it boils down to proving the following two propositions(depending on which
   158 In these definitions $(r^\uparrow)$ is a kind of coding function that is the
   132 branch in the if-else clause is taken):
   159 same in each case, similarly the decode and the \textit{bmkeps}
       
   160 functions. Our main theorem \eqref{mainthm} therefore boils down to
       
   161 proving the following two  propositions (depending on which branch the 
       
   162 if-else clause takes). It establishes how the derivatives  \emph{with}
       
   163 simplification do not change the computed result:
   133 
   164 
   134 \begin{itemize}
   165 \begin{itemize}
   135 
   166 \item{} If a string $s$ is in the language of $L(r)$, then \\
   136 \item{}
   167 $\textit{bmkeps} (r^\uparrow)\backslash_{simp}\,s = \textit{bmkeps} (r^\uparrow)\backslash s$,\\
   137 When s is a string in the language L(r), \\
   168 \item{} If a string $s$ is in the language $L(r)$, then 
   138 $\textit{bmkeps} (r^\uparrow)\backslash_{simp}\, s  = \textit{bmkeps} (r^\uparrow)\backslash s$, \\
   169 $\rup \backslash_{simp} \,s$ is not nullable.
   139 \item{}
       
   140 when s is not a string of the language L(ar)
       
   141 ders\_simp(ar, s) is not nullable
       
   142 \end{itemize}
   170 \end{itemize}
   143 The second one is relatively straightforward using isabelle to prove.
   171 
   144 The first part requires more effort. 
   172 \noindent
   145 It builds on the result that the bit-coded algorithm without simplification
   173 We have already proved in Isabelle the second part. This is actually not 
   146 produces the correct result:
   174 too difficult because we can show that simplification does not change
   147 \begin{center}
   175 the language of regular expressions. If we can prove the first case,
   148 $\blexer \;r^\uparrow  s = \lexer \; r\; s$
   176 that is the bitsequence algorithm with simplification produces the same 
   149 \end{center}
   177 result as the one without simplification, then we are done.
   150 \noindent
   178 Unfortunately that part requires more effort, because simplification does not 
   151 the definition of lexer and its correctness is 
   179 only.need to \emph{not} change the language, but also not change
   152 omitted(see \cite{AusafDyckhoffUrban2016}).
   180 the value (computed result).
   153 if we can prove that the bit-coded algorithm with simplification produces 
   181 
   154 the same result as the original bit-coded algorithm, 
   182 \bigskip\noindent\rule[1.5ex]{\linewidth}{5pt}
   155 then we are done.
   183 Do you want to keep this? You essentially want to say that the old
       
   184 method used retrieve, which unfortunately cannot be adopted to 
       
   185 the simplification rules. You could just say that and give an example.
       
   186 However you have to think about how you give the example....nobody knows
       
   187 about AZERO etc yet. Maybe it might be better to use normal regexes
       
   188 like $a + aa$, but annotate bitsequences as subscript like $_1(_0a + _1aa)$.
       
   189 
       
   190 \bigskip\noindent\rule[1.5ex]{\linewidth}{5pt}
       
   191 REPLY:\\
       
   192 Yes, I am essentially saying that the old method
       
   193 cannot be adopted without adjustments.
       
   194 But this does not mean we should skip
       
   195 the proof of the bit-coded algorithm
       
   196 as it is still the main direction we are looking into
       
   197 to prove things. We are trying to modify
       
   198 the old proof to suit our needs, but not give 
       
   199 up it totally, that is why i believe the old 
       
   200 proof is fundamental in understanding
       
   201 what we are doing in the past 6 months.
       
   202 
       
   203 \bigskip\noindent\rule[1.5ex]{\linewidth}{5pt}
       
   204 
   156 The correctness proof of
   205 The correctness proof of
   157 \begin{center}
   206 \begin{center}
   158 $\blexer \; r^\uparrow  s = \lexer \;r \;s$
   207 $\blexer \; r^\uparrow  s = \lexer \;r \;s$
   159 \end{center}
   208 \end{center}
   160 \noindent
   209 \noindent
   245 simplified version of algorithm,
   294 simplified version of algorithm,
   246 we face the problem that in the above 
   295 we face the problem that in the above 
   247 equalities,
   296 equalities,
   248 $\retrieve \; a \; v$ is not always defined.
   297 $\retrieve \; a \; v$ is not always defined.
   249 for example,
   298 for example,
   250 $\retrieve \; \AALTS(Z, \AONE(S), \AONE(S)) \; \Left(\Empty)$
   299 $\retrieve \; _0(_1a+_0a) \; \Left(\Empty)$
   251 is defined, but not $\retrieve \; \AONE(\Z\S) \; \Left(\Empty)$,
   300 is defined, but not $\retrieve \; _{01}a \;\Left(\Empty)$,
   252 though we can extract the same POSIX
   301 though we can extract the same POSIX
   253 bits from the two annotated regular expressions.
   302 bits from the two annotated regular expressions.
   254 The latter might occur when we try to retrieve from 
   303 The latter might occur when we try to retrieve from 
   255 a simplified regular expression using the same value
   304 a simplified regular expression using the same value
   256 as the unsimplified one.
   305 as the unsimplified one.