etnms/etnms.tex
changeset 108 0a0c551bb368
parent 107 b1e365afa29c
child 109 79f347cb8b4d
equal deleted inserted replaced
107:b1e365afa29c 108:0a0c551bb368
   100   at extended regular expressions, such as bounded repetitions,
   100   at extended regular expressions, such as bounded repetitions,
   101   negation and back-references.
   101   negation and back-references.
   102 \end{abstract}
   102 \end{abstract}
   103 
   103 
   104 \section{Recapitulation of Concepts From the Last Report}
   104 \section{Recapitulation of Concepts From the Last Report}
   105 \subsection{The Algorithm by Brzozowski based on Derivatives of Regular
   105 
   106 Expressions}
   106 \subsection*{Regular Expressions and Derivatives}
   107 
   107 
   108 Suppose (basic) regular expressions are given by the following grammar:
   108 Suppose (basic) regular expressions are given by the following grammar:
       
   109 
   109 \[			r ::=   \ZERO \mid  \ONE
   110 \[			r ::=   \ZERO \mid  \ONE
   110 			 \mid  c  
   111 			 \mid  c  
   111 			 \mid  r_1 \cdot r_2
   112 			 \mid  r_1 \cdot r_2
   112 			 \mid  r_1 + r_2   
   113 			 \mid  r_1 + r_2   
   113 			 \mid r^*         
   114 			 \mid r^*         
   114 \]
   115 \]
   115 
   116 
   116 \noindent
   117 \noindent
   117 
   118 The ingenious contribution of Brzozowski is the notion of \emph{derivatives} of
   118 The ingenious contribution by Brzozowski is the notion of
   119 regular expressions, written~$\_ \backslash \_$. It uses the auxiliary notion of
   119 \emph{derivatives} of regular expressions.  
   120 $\nullable$ defined below.
       
   121 
   120 \begin{center}
   122 \begin{center}
   121 		\begin{tabular}{lcl}
   123 		\begin{tabular}{lcl}
   122 			$\nullable(\ZERO)$     & $\dn$ & $\mathit{false}$ \\  
   124 			$\nullable(\ZERO)$     & $\dn$ & $\mathit{false}$ \\  
   123 			$\nullable(\ONE)$      & $\dn$ & $\mathit{true}$ \\
   125 			$\nullable(\ONE)$      & $\dn$ & $\mathit{true}$ \\
   124 			$\nullable(c)$ 	       & $\dn$ & $\mathit{false}$ \\
   126 			$\nullable(c)$ 	       & $\dn$ & $\mathit{false}$ \\
   152 $c\!::\!s \in L(r)$ holds
   154 $c\!::\!s \in L(r)$ holds
   153 if and only if $s \in L(r\backslash c)$.
   155 if and only if $s \in L(r\backslash c)$.
   154 \end{center}
   156 \end{center}
   155 
   157 
   156 \noindent
   158 \noindent
   157 
   159 We can generalise the derivative operation shown above for single characters
   158 
   160 to strings as follows:
   159 Now we can generalise the derivative operation to strings like this:
       
   160 
   161 
   161 \begin{center}
   162 \begin{center}
   162 \begin{tabular}{lcl}
   163 \begin{tabular}{lcl}
   163 $r \backslash (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash s$ \\
   164 $r \backslash (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash s$ \\
   164 $r \backslash [\,] $ & $\dn$ & $r$
   165 $r \backslash [\,] $ & $\dn$ & $r$
   165 \end{tabular}
   166 \end{tabular}
   166 \end{center}
   167 \end{center}
   167 
   168 
   168 \noindent
   169 \noindent
   169 and then define as  regular-expression matching algorithm: 
   170 and then define Brzozowski's  regular-expression matching algorithm as:
       
   171 
   170 \[
   172 \[
   171 match\;s\;r \;\dn\; nullable(r\backslash s)
   173 match\;s\;r \;\dn\; nullable(r\backslash s)
   172 \]
   174 \]
   173 
   175 
   174 \noindent
   176 \noindent
   175 This algorithm looks graphically as follows:
   177 Assuming the a string is givane as a sequence of characters, say $c_0c_1..c_n$, 
       
   178 this algorithm presented graphically is as follows:
       
   179 
   176 \begin{equation}\label{graph:*}
   180 \begin{equation}\label{graph:*}
   177 \begin{tikzcd}
   181 \begin{tikzcd}
   178 r_0 \arrow[r, "\backslash c_0"]  & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed]  & r_n  \arrow[r,"\textit{nullable}?"] & \;\textrm{YES}/\textrm{NO}
   182 r_0 \arrow[r, "\backslash c_0"]  & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed]  & r_n  \arrow[r,"\textit{nullable}?"] & \;\textrm{YES}/\textrm{NO}
   179 \end{tikzcd}
   183 \end{tikzcd}
   180 \end{equation}
   184 \end{equation}
   185 to test whether the result can match the empty string. It can  be
   189 to test whether the result can match the empty string. It can  be
   186 relatively  easily shown that this matcher is correct  (that is given
   190 relatively  easily shown that this matcher is correct  (that is given
   187 an $s = c_0...c_{n-1}$ and an $r_0$, it generates YES if and only if $s \in L(r_0)$).
   191 an $s = c_0...c_{n-1}$ and an $r_0$, it generates YES if and only if $s \in L(r_0)$).
   188 
   192 
   189  
   193  
   190 \subsection{Values and the Algorithm by Sulzmann and Lu}
   194 \subsection*{Values and the Lexing Algorithm by Sulzmann and Lu}
   191 
   195 
   192 One limitation of Brzozowski's algorithm is that it only produces a
   196 One limitation of Brzozowski's algorithm is that it only produces a
   193 YES/NO answer for whether a string is being matched by a regular
   197 YES/NO answer for whether a string is being matched by a regular
   194 expression.  Sulzmann and Lu~\cite{Sulzmann2014} extended this algorithm
   198 expression.  Sulzmann and Lu~\cite{Sulzmann2014} extended this algorithm
   195 to allow generation of an actual matching, called a \emph{value} or
   199 to allow generation of an actual matching, called a \emph{value} or
   223 		\end{tabular}
   227 		\end{tabular}
   224 	\end{tabular}
   228 	\end{tabular}
   225 \end{center}
   229 \end{center}
   226 
   230 
   227 \noindent
   231 \noindent
   228 
       
   229 The contribution of Sulzmann and Lu is an extension of Brzozowski's
   232 The contribution of Sulzmann and Lu is an extension of Brzozowski's
   230 algorithm by a second phase (the first phase being building successive
   233 algorithm by a second phase (the first phase being building successive
   231 derivatives---see \eqref{graph:*}). In this second phase, a POSIX value 
   234 derivatives---see \eqref{graph:*}). In this second phase, a POSIX value 
   232 is generated in case the regular expression matches  the string. 
   235 is generated in case the regular expression matches  the string. 
   233 Pictorially, the Sulzmann and Lu algorithm is as follows:
   236 Pictorially, the Sulzmann and Lu algorithm is as follows:
   300 
   303 
   301 \noindent This definition is by recursion on the ``shape'' of regular
   304 \noindent This definition is by recursion on the ``shape'' of regular
   302 expressions and values. 
   305 expressions and values. 
   303 
   306 
   304 
   307 
   305 \subsection{Simplification of Regular Expressions}
   308 \subsection*{Simplification of Regular Expressions}
   306 
   309 
   307 The main drawback of building successive derivatives according
   310 The main drawback of building successive derivatives according
   308 to Brzozowski's definition is that they can grow very quickly in size.
   311 to Brzozowski's definition is that they can grow very quickly in size.
   309 This is mainly due to the fact that the derivative operation generates
   312 This is mainly due to the fact that the derivative operation generates
   310 often ``useless'' $\ZERO$s and $\ONE$s in derivatives.  As a result, if
   313 often ``useless'' $\ZERO$s and $\ONE$s in derivatives.  As a result, if
   329 
   332 
   330 If we want the size of derivatives in Sulzmann and Lu's algorithm to
   333 If we want the size of derivatives in Sulzmann and Lu's algorithm to
   331 stay below this bound, we would need more aggressive simplifications.
   334 stay below this bound, we would need more aggressive simplifications.
   332 Essentially we need to delete useless $\ZERO$s and $\ONE$s, as well as
   335 Essentially we need to delete useless $\ZERO$s and $\ONE$s, as well as
   333 deleting duplicates whenever possible. For example, the parentheses in
   336 deleting duplicates whenever possible. For example, the parentheses in
   334 $(a+b) \cdot c + bc$ can be opened up to get $a\cdot c +  b \cdot c + b
   337 $(a+b) \cdot c + b\cdot c$ can be opened up to get $a\cdot c + b \cdot c + b
   335 \cdot c$, and then simplified to just $a \cdot c + b \cdot c$. Another
   338 \cdot c$, and then simplified to just $a \cdot c + b \cdot c$. Another
   336 example is simplifying $(a^*+a) + (a^*+ \ONE) + (a +\ONE)$ to just
   339 example is simplifying $(a^*+a) + (a^*+ \ONE) + (a +\ONE)$ to just
   337 $a^*+a+\ONE$. Adding these more aggressive simplification rules helps us
   340 $a^*+a+\ONE$. Adding these more aggressive simplification rules help us
   338 to achieve the same size bound as that of the partial derivatives. 
   341 to achieve the same size bound as that of the partial derivatives. 
   339 
   342 
   340 In order to implement the idea of ``spilling out alternatives'' and to
   343 In order to implement the idea of ``spilling out alternatives'' and to
   341 make them compatible with the $\text{inj}$-mechanism, we use
   344 make them compatible with the $\textit{inj}$-mechanism, we use
   342 \emph{bitcodes}. Bits and bitcodes (lists of bits) are just:
   345 \emph{bitcodes}. They were first introduced by Sulzmann and Lu.
   343 
   346 Here bits and bitcodes (lists of bits) are defined as:
   344 
   347 
   345 \begin{center}
   348 \begin{center}
   346 		$b ::=   S \mid  Z \qquad
   349 		$b ::=   S \mid  Z \qquad
   347 bs ::= [] \mid b:bs    
   350 bs ::= [] \mid b:bs    
   348 $
   351 $
   571 
   574 
   572 \noindent
   575 \noindent
   573 In this definition $\_\backslash s$ is the  generalisation  of the derivative
   576 In this definition $\_\backslash s$ is the  generalisation  of the derivative
   574 operation from characters to strings (just like the derivatives for un-annotated
   577 operation from characters to strings (just like the derivatives for un-annotated
   575 regular expressions).
   578 regular expressions).
       
   579 
       
   580 
       
   581 \subsection*{Our Simplification Rules}
   576 
   582 
   577 The main point of the bitcodes and annotated regular expressions is that
   583 The main point of the bitcodes and annotated regular expressions is that
   578 we can apply rather aggressive (in terms of size) simplification rules
   584 we can apply rather aggressive (in terms of size) simplification rules
   579 in order to keep derivatives small. We have developed such
   585 in order to keep derivatives small. We have developed such
   580 ``aggressive'' simplification rules and generated test data that show
   586 ``aggressive'' simplification rules and generated test data that show