etnms/etnms.tex
changeset 148 c8ef391dd6f7
parent 145 a7c063981fa5
equal deleted inserted replaced
147:dfcf3fa58d7f 148:c8ef391dd6f7
    89 \begin{document}
    89 \begin{document}
    90 
    90 
    91 \maketitle
    91 \maketitle
    92 
    92 
    93 \begin{abstract}
    93 \begin{abstract}
    94   Brzozowski introduced in 1964 a beautifully simple algorithm for
    94 Regular expressions, a useful concept in computer science
    95   regular expression matching based on the notion of derivatives of
    95 that was initially designed to match string patterns,
    96   regular expressions. In 2014, Sulzmann and Lu extended this
    96 need fast matching algorithms.
    97   algorithm to not just give a YES/NO answer for whether or not a
    97 If we do matching by naively converting
    98   regular expression matches a string, but in case it does also
    98 the regular expression to a
    99   answers with \emph{how} it matches the string.  This is important for
    99 Nondeterministic-Finite-Automaton(NFA) and simulating it,
   100   applications such as lexing (tokenising a string). The problem is to
   100 they can run into trouble on certain evil regular expression and string 
   101   make the algorithm by Sulzmann and Lu fast on all inputs without
   101 pairs.
   102   breaking its correctness. Being fast depends on a complete set of 
   102 This necessitates the introudction of a 
   103   simplification rules, some of which 
   103 method called derivatives of regular expressions\cite{Brzozowski1964}.
   104   have been put forward by Sulzmann and Lu. We have extended their
   104 With some variation introduced by Sulzmann and Lu\cite{Sulzmann2014},
   105   rules in order to obtain a tight bound on the size of regular expressions.
   105 it circumvents the problem of 
   106   We have tested these extended rules, but have not 
   106 catastrphoic backtracking elegantly when
   107   formally established their correctness. We have also not yet looked 
   107 compared to backtracking NFA regex engines.
   108   at extended regular expressions, such as bounded repetitions,
   108 We believe that such a method can help to address 
   109   negation and back-references.
   109 the status-quo where badly written code for 
       
   110 regular expression matching can cause a software grief.
       
   111 We have come up with a set of simplification rules
       
   112 that makes the space and time requirements below a tight bound.
       
   113 We have proved certain properties of the algorithm 
       
   114 and simplification.
       
   115 We have not yet worked out a proof for the correctness
       
   116 although test data suggested this.
   110 \end{abstract}
   117 \end{abstract}
   111 
   118 
   112 
   119 
       
   120 
   113 \section{Introduction}
   121 \section{Introduction}
   114 %Regular expressions' derivatives, which have received 
   122 The following brief introduction gives the background for 
   115 %renewed interest in the new millenium, is a beautiful....
   123 derivatives--how they are originally defined and the 
   116 While we believe derivatives of regular expressions, written
   124 variety\cite{Sulzmann2014} that is more suitable for lexing.
   117 $r\backslash s$, are a beautiful concept (in terms of ease of
   125 Regular expressions can be formalized as follows:\\
   118 implementing them in functional programming languages and in terms of
   126 \begin{center}
   119 reasoning about them formally), they have one major drawback: every
   127 	$r ::= \ZERO | \ONE | c | r_1 +r_2 | r_1 \cdot r_2 | r^*$
   120 derivative step can make regular expressions grow drastically in
   128 \end{center}
   121 size. This in turn has negative effect on the runtime of the
   129 Derivatives of regular expression can be 
   122 corresponding lexing algorithms. Consider for example the regular
   130 seen as the regular expression that corresponds to
   123 expression $(a+aa)^*$ and the short string $aaaaaaaaaaaa$. The
   131 the language derived by computing
   124 corresponding derivative contains already 8668 nodes where we assume
   132 the left quotient of the original regular language
   125 the derivative is given as a tree. The reason for the poor runtime of
   133 w.r.t a character.
   126 the derivative-based lexing algorithms is that they need to traverse
   134 Derivative of regular expression is also a 
   127 such trees over and over again. The solution is to find a complete set
   135 regular expression because 
   128 of simplification rules that keep the sizes of derivatives uniformly
   136 the left quotient of a regular language is also regular.
   129 small.
   137 With this we know such a function exist without having
   130 
   138 to worry that it might not be defined.
   131 This has been partially addressed by the function $\blexer_{simp}$,
   139 In fact, regular expressions derivatives
   132 which after  the simplification the $(a+aa)^*$ example's 8000 nodes will be
   140 are not just defined but also simple and elegant.
   133 reduced to just 6 and stays constant in each derivative step.
   141 They are composed of two recursive functions,
   134 The part that still needs more work is the correctness proof of this 
   142 and can be easily explained by intuition.
   135 function, namely,
   143 The function
   136 \begin{equation}\label{mainthm}
   144 $\nullable$ defined below,
   137 \blexers \; r \; s = \blexer \;r\;s
   145 tests whether the empty string belongs
   138 \end{equation}
   146 to the regular expression being checked.
   139 
       
   140 \noindent
       
   141 and this is what this report is mainly about. A condensed 
       
   142 version of the last report will be provided in the next section
       
   143 to help the reader understand the report, and the attempts 
       
   144 on the problem will follow.
       
   145 
       
   146 
       
   147 \section{Recapitulation of Concepts From the Last Report}
       
   148 
       
   149 \subsection*{Regular Expressions and Derivatives}
       
   150 Suppose (basic) regular expressions are given by the following grammar:
       
   151 
       
   152 \[			r ::=   \ZERO \mid  \ONE
       
   153 			 \mid  c  
       
   154 			 \mid  r_1 \cdot r_2
       
   155 			 \mid  r_1 + r_2   
       
   156 			 \mid r^*         
       
   157 \]
       
   158 
       
   159 \noindent
       
   160 The ingenious contribution of Brzozowski is the notion of \emph{derivatives} of
       
   161 regular expressions, written~$\_ \backslash \_$. It uses the auxiliary notion of
       
   162 $\nullable$ defined below.
       
   163 
   147 
   164 \begin{center}
   148 \begin{center}
   165 		\begin{tabular}{lcl}
   149 		\begin{tabular}{lcl}
   166 			$\nullable(\ZERO)$     & $\dn$ & $\mathit{false}$ \\  
   150 			$\nullable(\ZERO)$     & $\dn$ & $\mathit{false}$ \\  
   167 			$\nullable(\ONE)$      & $\dn$ & $\mathit{true}$ \\
   151 			$\nullable(\ONE)$      & $\dn$ & $\mathit{true}$ \\
   168 			$\nullable(c)$ 	       & $\dn$ & $\mathit{false}$ \\
   152 			$\nullable(c)$ 	       & $\dn$ & $\mathit{false}$ \\
   169 			$\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\
   153 			$\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\
   170 			$\nullable(r_1\cdot r_2)$  & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\
   154 			$\nullable(r_1\cdot r_2)$  & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\
   171 			$\nullable(r^*)$       & $\dn$ & $\mathit{true}$ \\
   155 			$\nullable(r^*)$       & $\dn$ & $\mathit{true}$ \\
   172 		\end{tabular}
   156 		\end{tabular}
   173 	\end{center}
   157 \end{center}
       
   158 \noindent
       
   159 The empty set does not contain any string and
       
   160 therefore not the empty string, the empty string 
       
   161 regular expression contains the empty string
       
   162 by definition, the character regular expression
       
   163 is the singleton that contains character only,
       
   164 and therefore does not contain the empty string,
       
   165 the alternative regular expression(or "or" expression)
       
   166 might have one of its children regular expressions
       
   167 being nullable and any one of its children being nullable
       
   168 would suffice. The sequence regular expression
       
   169 would require both children to have the empty string
       
   170 to compose an empty string and the Kleene star
       
   171 operation naturally introduced the empty string.
       
   172 All nice and easy.
       
   173 This definition is used in the derivative operation
       
   174 as a condition:
   174 
   175 
   175 \begin{center}
   176 \begin{center}
   176 \begin{tabular}{lcl}
   177 \begin{tabular}{lcl}
   177 		$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\  
   178 		$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\  
   178 		$\ONE \backslash c$  & $\dn$ & $\ZERO$\\
   179 		$\ONE \backslash c$  & $\dn$ & $\ZERO$\\
   184 	&   & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
   185 	&   & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
   185 	$(r^*)\backslash c$           & $\dn$ & $(r\backslash c) \cdot r^*$\\
   186 	$(r^*)\backslash c$           & $\dn$ & $(r\backslash c) \cdot r^*$\\
   186 \end{tabular}
   187 \end{tabular}
   187 \end{center}
   188 \end{center}
   188 \noindent
   189 \noindent
   189 And defines how a regular expression evolves into
   190 \noindent
       
   191 The function derivative, written $\backslash c$, 
       
   192 defines how a regular expression evolves into
   190 a new regular expression after all the string it contains
   193 a new regular expression after all the string it contains
   191 is chopped off a certain head character $c$.
   194 is chopped off a certain head character $c$.
   192 
   195 The most involved cases are the sequence 
   193 The main property of the derivative operation is that
   196 and star case.
       
   197 The sequence case says that if the first regular expression
       
   198 contains an empty string then second component of the sequence
       
   199 might be chosen as the target regular expression to be chopped
       
   200 off its head character.
       
   201 The star regular expression unwraps the iteration of
       
   202 regular expression and attack the star regular expression
       
   203 to its back again to make sure there are 0 or more iterations
       
   204 following this unfolded iteration.
       
   205 
       
   206 
       
   207 The main property of the derivative operation
       
   208 that enables us to reason about the correctness of
       
   209 an algorithm using derivatives is 
   194 
   210 
   195 \begin{center}
   211 \begin{center}
   196 $c\!::\!s \in L(r)$ holds
   212 $c\!::\!s \in L(r)$ holds
   197 if and only if $s \in L(r\backslash c)$.
   213 if and only if $s \in L(r\backslash c)$.
   198 \end{center}
   214 \end{center}
   230 derivatives until we exhaust the string and then use \textit{nullable}
   246 derivatives until we exhaust the string and then use \textit{nullable}
   231 to test whether the result can match the empty string. It can  be
   247 to test whether the result can match the empty string. It can  be
   232 relatively  easily shown that this matcher is correct  (that is given
   248 relatively  easily shown that this matcher is correct  (that is given
   233 an $s = c_0...c_{n-1}$ and an $r_0$, it generates YES if and only if $s \in L(r_0)$).
   249 an $s = c_0...c_{n-1}$ and an $r_0$, it generates YES if and only if $s \in L(r_0)$).
   234 
   250 
   235  
   251 Beautiful and simple definition.
       
   252 
       
   253 If we implement the above algorithm naively, however,
       
   254 the algorithm can be excruciatingly slow. For example, when starting with the regular
       
   255 expression $(a + aa)^*$ and building 12 successive derivatives
       
   256 w.r.t.~the character $a$, one obtains a derivative regular expression
       
   257 with more than 8000 nodes (when viewed as a tree). Operations like
       
   258 $\backslash$ and $\nullable$ need to traverse such trees and
       
   259 consequently the bigger the size of the derivative the slower the
       
   260 algorithm. 
       
   261 
       
   262 Brzozowski was quick in finding that during this process a lot useless
       
   263 $\ONE$s and $\ZERO$s are generated and therefore not optimal.
       
   264 He also introduced some "similarity rules" such
       
   265 as $P+(Q+R) = (P+Q)+R$ to merge syntactically 
       
   266 different but language-equivalent sub-regexes to further decrease the size
       
   267 of the intermediate regexes. 
       
   268 
       
   269 More simplifications are possible, such as deleting duplicates
       
   270 and opening up nested alternatives to trigger even more simplifications.
       
   271 And suppose we apply simplification after each derivative step, and compose
       
   272 these two operations together as an atomic one: $a \backslash_{simp}\,c \dn
       
   273 \textit{simp}(a \backslash c)$. Then we can build
       
   274 a matcher without having  cumbersome regular expressions.
       
   275 
       
   276 
       
   277 If we want the size of derivatives in the algorithm to
       
   278 stay even lower, we would need more aggressive simplifications.
       
   279 Essentially we need to delete useless $\ZERO$s and $\ONE$s, as well as
       
   280 deleting duplicates whenever possible. For example, the parentheses in
       
   281 $(a+b) \cdot c + b\cdot c$ can be opened up to get $a\cdot c + b \cdot c + b
       
   282 \cdot c$, and then simplified to just $a \cdot c + b \cdot c$. Another
       
   283 example is simplifying $(a^*+a) + (a^*+ \ONE) + (a +\ONE)$ to just
       
   284 $a^*+a+\ONE$. Adding these more aggressive simplification rules help us
       
   285 to achieve a very tight size bound, namely,
       
   286  the same size bound as that of the \emph{partial derivatives}. 
       
   287 
       
   288 Building derivatives and then simplify them.
       
   289 So far so good. But what if we want to 
       
   290 do lexing instead of just a YES/NO answer?
       
   291 This requires us to go back again to the world 
       
   292 without simplification first for a moment.
       
   293 Sulzmann and Lu~\cite{Sulzmann2014} first came up with a nice and 
       
   294 elegant(arguably as beautiful as the original
       
   295 derivatives definition) solution for this.
       
   296 
   236 \subsection*{Values and the Lexing Algorithm by Sulzmann and Lu}
   297 \subsection*{Values and the Lexing Algorithm by Sulzmann and Lu}
   237 
   298 
   238 One limitation of Brzozowski's algorithm is that it only produces a
   299 
   239 YES/NO answer for whether a string is being matched by a regular
   300 They first defined the datatypes for storing the 
   240 expression.  Sulzmann and Lu~\cite{Sulzmann2014} extended this algorithm
   301 lexing information called a \emph{value} or
   241 to allow generation of an actual matching, called a \emph{value} or
       
   242 sometimes also \emph{lexical value}.  These values and regular
   302 sometimes also \emph{lexical value}.  These values and regular
   243 expressions correspond to each other as illustrated in the following
   303 expressions correspond to each other as illustrated in the following
   244 table:
   304 table:
   245 
       
   246 
   305 
   247 \begin{center}
   306 \begin{center}
   248 	\begin{tabular}{c@{\hspace{20mm}}c}
   307 	\begin{tabular}{c@{\hspace{20mm}}c}
   249 		\begin{tabular}{@{}rrl@{}}
   308 		\begin{tabular}{@{}rrl@{}}
   250 			\multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\
   309 			\multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\
   269 		\end{tabular}
   328 		\end{tabular}
   270 	\end{tabular}
   329 	\end{tabular}
   271 \end{center}
   330 \end{center}
   272 
   331 
   273 \noindent
   332 \noindent
       
   333 One regular expression can have multiple lexical values. For example
       
   334 for the regular expression $(a+b)^*$, it has a infinite list of
       
   335 values corresponding to it: $\Stars\,[]$, $\Stars\,[\Left(Char(a))]$,
       
   336 $\Stars\,[\Right(Char(b))]$, $\Stars\,[\Left(Char(a),\,\Right(Char(b))]$,
       
   337 $\ldots$, and vice versa.
       
   338 Even for the regular expression matching a certain string, there could 
       
   339 still be more than one value corresponding to it.
       
   340 Take the example where $r= (a^*\cdot a^*)^*$ and the string 
       
   341 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$.
       
   342 The number of different ways of matching 
       
   343 without allowing any value under a star to be flattened
       
   344 to an empty string can be given by the following formula:
       
   345 \begin{center}
       
   346 	$C_n = (n+1)+n C_1+\ldots + 2 C_{n-1}$
       
   347 \end{center}
       
   348 and a closed form formula can be calculated to be
       
   349 \begin{equation}
       
   350 	C_n =\frac{(2+\sqrt{2})^n - (2-\sqrt{2})^n}{4\sqrt{2}}
       
   351 \end{equation}
       
   352 which is clearly in exponential order.
       
   353 A lexer aimed at getting all the possible values has an exponential
       
   354 worst case runtime. Therefore it is impractical to try to generate
       
   355 all possible matches in a run. In practice, we are usually 
       
   356 interested about POSIX values, which by intuition always
       
   357 match the leftmost regular expression when there is a choice
       
   358 and always match a sub part as much as possible before proceeding
       
   359 to the next token. For example, the above example has the POSIX value
       
   360 $ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$.
       
   361 The output of an algorithm we want would be a POSIX matching
       
   362 encoded as a value.
   274 The contribution of Sulzmann and Lu is an extension of Brzozowski's
   363 The contribution of Sulzmann and Lu is an extension of Brzozowski's
   275 algorithm by a second phase (the first phase being building successive
   364 algorithm by a second phase (the first phase being building successive
   276 derivatives---see \eqref{graph:*}). In this second phase, a POSIX value 
   365 derivatives---see \eqref{graph:*}). In this second phase, a POSIX value 
   277 is generated in case the regular expression matches  the string. 
   366 is generated in case the regular expression matches  the string. 
   278 Pictorially, the Sulzmann and Lu algorithm is as follows:
   367 Pictorially, the Sulzmann and Lu algorithm is as follows:
   296 $\textit{nullable}$ or not. If not, we know the string does not match
   385 $\textit{nullable}$ or not. If not, we know the string does not match
   297 $r$ and no value needs to be generated. If yes, we start building the
   386 $r$ and no value needs to be generated. If yes, we start building the
   298 values incrementally by \emph{injecting} back the characters into the
   387 values incrementally by \emph{injecting} back the characters into the
   299 earlier values $v_n, \ldots, v_0$. This is the second phase of the
   388 earlier values $v_n, \ldots, v_0$. This is the second phase of the
   300 algorithm from the right to left. For the first value $v_n$, we call the
   389 algorithm from the right to left. For the first value $v_n$, we call the
   301 function $\textit{mkeps}$, which builds the lexical value
   390 function $\textit{mkeps}$, which builds a POSIX lexical value
   302 for how the empty string has been matched by the (nullable) regular
   391 for how the empty string has been matched by the (nullable) regular
   303 expression $r_n$. This function is defined as
   392 expression $r_n$. This function is defined as
   304 
   393 
   305 	\begin{center}
   394 	\begin{center}
   306 		\begin{tabular}{lcl}
   395 		\begin{tabular}{lcl}
   318 \noindent 
   407 \noindent 
   319 After the $\mkeps$-call, we inject back the characters one by one in order to build
   408 After the $\mkeps$-call, we inject back the characters one by one in order to build
   320 the lexical value $v_i$ for how the regex $r_i$ matches the string $s_i$
   409 the lexical value $v_i$ for how the regex $r_i$ matches the string $s_i$
   321 ($s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$.
   410 ($s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$.
   322 After injecting back $n$ characters, we get the lexical value for how $r_0$
   411 After injecting back $n$ characters, we get the lexical value for how $r_0$
   323 matches $s$. For this Sulzmann and Lu defined a function that reverses
   412 matches $s$. The POSIX value is maintained throught out the process.
       
   413 For this Sulzmann and Lu defined a function that reverses
   324 the ``chopping off'' of characters during the derivative phase. The
   414 the ``chopping off'' of characters during the derivative phase. The
   325 corresponding function is called \emph{injection}, written
   415 corresponding function is called \emph{injection}, written
   326 $\textit{inj}$; it takes three arguments: the first one is a regular
   416 $\textit{inj}$; it takes three arguments: the first one is a regular
   327 expression ${r_{i-1}}$, before the character is chopped off, the second
   417 expression ${r_{i-1}}$, before the character is chopped off, the second
   328 is a character ${c_{i-1}}$, the character we want to inject and the
   418 is a character ${c_{i-1}}$, the character we want to inject and the
   343 \end{tabular}
   433 \end{tabular}
   344 \end{center}
   434 \end{center}
   345 
   435 
   346 \noindent This definition is by recursion on the ``shape'' of regular
   436 \noindent This definition is by recursion on the ``shape'' of regular
   347 expressions and values. 
   437 expressions and values. 
   348 
   438 The clauses basically do one thing--identifying the ``holes'' on 
   349 
   439 value to inject the character back into.
   350 \subsection*{Simplification of Regular Expressions}
   440 For instance, in the last clause for injecting back to a value
   351 
   441 that would turn into a new star value that corresponds to a star,
   352 The main drawback of building successive derivatives according
   442 we know it must be a sequence value. And we know that the first 
   353 to Brzozowski's definition is that they can grow very quickly in size.
   443 value of that sequence corresponds to the child regex of the star
   354 This is mainly due to the fact that the derivative operation generates
   444 with the first character being chopped off--an iteration of the star
   355 often ``useless'' $\ZERO$s and $\ONE$s in derivatives.  As a result, if
   445 that had just been unfolded. This value is followed by the already
   356 implemented naively both algorithms by Brzozowski and by Sulzmann and Lu
   446 matched star iterations we collected before. So we inject the character 
   357 are excruciatingly slow. For example when starting with the regular
   447 back to the first value and form a new value with this new iteration
   358 expression $(a + aa)^*$ and building 12 successive derivatives
   448 being added to the previous list of iterations, all under the $Stars$
   359 w.r.t.~the character $a$, one obtains a derivative regular expression
   449 top level.
   360 with more than 8000 nodes (when viewed as a tree). Operations like
   450 
   361 $\textit{der}$ and $\nullable$ need to traverse such trees and
   451 We have mentioned before that derivatives without simplification 
   362 consequently the bigger the size of the derivative the slower the
   452 can get clumsy, and this is true for values as well--they reflect
   363 algorithm. 
   453 the regular expressions size by definition.
   364 
   454 
   365 Fortunately, one can simplify regular expressions after each derivative
   455 One can introduce simplification on the regex and values, but have to
   366 step. 
   456 be careful in not breaking the correctness as the injection 
   367 Various simplifications of regular expressions are possible, such
   457 function heavily relies on the structure of the regexes and values
   368 as the simplification of $\ZERO + r$, $r + \ZERO$, $\ONE\cdot r$, $r
   458 being correct and match each other.
   369 \cdot \ONE$, and $r + r$ to just $r$.
   459 It can be achieved by recording some extra rectification functions
   370 Suppose we apply simplification after each derivative step, and compose
   460 during the derivatives step, and applying these rectifications in 
   371 these two operations together as an atomic one: $a \backslash_{simp}\,c \dn
   461 each run during the injection phase.
   372 \textit{simp}(a \backslash c)$. Then we can build values without having
   462 And we can prove that the POSIX value of how
   373 a cumbersome regular expression, and  fortunately if we are careful
       
   374 enough in making some extra rectifications, the POSIX value of how
       
   375 regular expressions match strings will not be affected---although is much harder
   463 regular expressions match strings will not be affected---although is much harder
   376 to establish. Some initial results in this regard have been
   464 to establish. Some initial results in this regard have been
   377 obtained in \cite{AusafDyckhoffUrban2016}. 
   465 obtained in \cite{AusafDyckhoffUrban2016}. 
   378 
   466 
   379 
   467 %Brzozowski, after giving the derivatives and simplification,
   380 If we want the size of derivatives in Sulzmann and Lu's algorithm to
   468 %did not explore lexing with simplification or he may well be 
   381 stay even lower, we would need more aggressive simplifications.
   469 %stuck on an efficient simplificaiton with a proof.
   382 Essentially we need to delete useless $\ZERO$s and $\ONE$s, as well as
   470 %He went on to explore the use of derivatives together with 
   383 deleting duplicates whenever possible. For example, the parentheses in
   471 %automaton, and did not try lexing using derivatives.
   384 $(a+b) \cdot c + b\cdot c$ can be opened up to get $a\cdot c + b \cdot c + b
   472 
   385 \cdot c$, and then simplified to just $a \cdot c + b \cdot c$. Another
   473 We want to get rid of complex and fragile rectification of values.
   386 example is simplifying $(a^*+a) + (a^*+ \ONE) + (a +\ONE)$ to just
   474 Can we not create those intermediate values $v_1,\ldots v_n$,
   387 $a^*+a+\ONE$. Adding these more aggressive simplification rules help us
   475 and get the lexing information that should be already there while
   388 to achieve a very tight size bound, namely,
   476 doing derivatives in one pass, without a second phase of injection?
   389  the same size bound as that of the \emph{partial derivatives}. 
   477 In the meantime, can we make sure that simplifications
   390 And we want to get rid of complex and fragile rectification of values.
   478 are easily handled without breaking the correctness of the algorithm?
   391 
   479 
   392 
   480 Sulzmann and Lu solved this problem by
   393 In order to implement the idea of ``spilling out alternatives'' and to
   481 introducing additional informtaion to the 
   394 make them compatible with the $\textit{inj}$-mechanism, we use
   482 regular expressions called \emph{bitcodes}.
   395 \emph{bitcodes}. They were first introduced by Sulzmann and Lu.
   483 
   396 Here bits and bitcodes (lists of bits) are defined as:
   484 \subsection*{Bit-coded Algorithm}
       
   485 Bits and bitcodes (lists of bits) are defined as:
   397 
   486 
   398 \begin{center}
   487 \begin{center}
   399 		$b ::=   1 \mid  0 \qquad
   488 		$b ::=   1 \mid  0 \qquad
   400 bs ::= [] \mid b::bs    
   489 bs ::= [] \mid b::bs    
   401 $
   490 $
   672 \noindent
   761 \noindent
   673 In this definition $\_\backslash s$ is the  generalisation  of the derivative
   762 In this definition $\_\backslash s$ is the  generalisation  of the derivative
   674 operation from characters to strings (just like the derivatives for un-annotated
   763 operation from characters to strings (just like the derivatives for un-annotated
   675 regular expressions).
   764 regular expressions).
   676 
   765 
       
   766 Remember tha one of the important reasons we introduced bitcodes
       
   767 is that they can make simplification more structured and therefore guaranteeing
       
   768 the correctness.
   677 
   769 
   678 \subsection*{Our Simplification Rules}
   770 \subsection*{Our Simplification Rules}
   679 
   771 
   680 The main point of the bitcodes and annotated regular expressions is that
   772 In this section we introduce aggressive (in terms of size) simplification rules
   681 we can apply rather aggressive (in terms of size) simplification rules
   773 on annotated regular expressions
   682 in order to keep derivatives small. We have developed such
   774 in order to keep derivatives small. Such simplifications are promising
   683 ``aggressive'' simplification rules and generated test data that show
   775 as we have
   684 that the expected bound can be achieved. Obviously we could only
   776 generated test data that show
       
   777 that a good tight bound can be achieved. Obviously we could only
   685 partially cover  the search space as there are infinitely many regular
   778 partially cover  the search space as there are infinitely many regular
   686 expressions and strings. 
   779 expressions and strings. 
   687 
   780 
   688 One modification we introduced is to allow a list of annotated regular
   781 One modification we introduced is to allow a list of annotated regular
   689 expressions in the $\sum$ constructor. This allows us to not just
   782 expressions in the $\sum$ constructor. This allows us to not just
  1754 regular expression that is simplified at the final moment.
  1847 regular expression that is simplified at the final moment.
  1755 We are almost there, but a last step is needed to make the proof work.
  1848 We are almost there, but a last step is needed to make the proof work.
  1756 Hopefully in the next few weeks we will be able to find one.
  1849 Hopefully in the next few weeks we will be able to find one.
  1757 This would be an important milestone for my dissertation.
  1850 This would be an important milestone for my dissertation.
  1758 
  1851 
       
  1852 
       
  1853 \section{Future Plans}
       
  1854 
       
  1855 Before the proof comes out, a plan is needed to make 
       
  1856 sure some progress is happening.
       
  1857 We plan to get
       
  1858 a practical implementation of our current
       
  1859 method, which would involve  extending the set of allowed syntax
       
  1860 to include common regular expression constructs
       
  1861 such as bounded repitions, negation(complement) 
       
  1862 and character range or even Boolean functions.
       
  1863 Adding these concise way of expressing regular expressions
       
  1864 while keeping the correctness of the simplification
       
  1865 makes the work more practical.
       
  1866 For example, using the added constructs we can 
       
  1867 demonstrate that our implementation can actually 
       
  1868 help the cloudflare web-application firewall
       
  1869 to run smoothly even if it had been fed with 
       
  1870 evil regular expression 
       
  1871 $.*(?:.*=.*)))$
       
  1872 and strings like $x=xxxxxxxxxx....$.
       
  1873 This could help prove that our algorithm is not just 
       
  1874 some nice theory work but is actually competent in reality
       
  1875 as this regular expression and string pair made the cloudflare
       
  1876 servers to grind
       
  1877 to a halt on July 2nd, 2019.
       
  1878 After all, simply the "basic" regular expressions
       
  1879 we gave in the introduction is really so basic, that
       
  1880 even Brzozowski's 1964 paper has included more 
       
  1881 varieties than that(Boolean functions instead of 
       
  1882 just alternatives)\cite{Brzozowski1964}.
       
  1883 A function only implementing the basic ones
       
  1884 would lack practical interests.
       
  1885 Making the algorithm practical also means to make it fast.
       
  1886 Right now the Scala implementation
       
  1887 is slower than the naive algorithm without bitcodes, even with those
       
  1888 drastic simplification measures, which is not suggested by theory.
       
  1889 We might need to implement it in other languages such as C
       
  1890 to see if the problem is language specific.
       
  1891 This is about 3 months' work.
       
  1892 
       
  1893 We believe that there
       
  1894 is still potential for more simplification as there are a lot
       
  1895 of rules for regular expression similarity, however,
       
  1896 this needs to be explored and checked.
       
  1897 Looking for maximal simplification and 
       
  1898 best size bound and proving them would take around 3 months to complete.
       
  1899 
       
  1900 Context-free languages are a bit harder to deal with by derivatives
       
  1901 when compared with regular expressions.
       
  1902 The main reason is that the child node of a context grammar
       
  1903 can be the grammar itself--this left-recursion makes 
       
  1904 the computation procedure an impossibility when trying to solve directly.
       
  1905 An example would be the following\cite{might2011cfgder}:
       
  1906 \begin{center}
       
  1907 $L = L \cdot x | \ONE$
       
  1908 \end{center}
       
  1909 then a derivative would give us the following:
       
  1910 \begin{center}
       
  1911 $L\backslash x = (L \backslash x)\cdot x | \ONE$
       
  1912 \end{center}
       
  1913 which is essentially what we have previously.
       
  1914 This recursion can go on forever if we do not use some 
       
  1915 other methods such as introducing methods
       
  1916 or manually decide that the two equations 
       
  1917 are actually equivalent and $L$ and $L\backslash x$
       
  1918 denote the same language. And then give a mathematical proof
       
  1919 that this is actually the case.
       
  1920 This seems well beyond any mechanical algorithm one can 
       
  1921 expect to be able to do.....
       
  1922 People have tried various ways to deal with the problems
       
  1923 of looping when computing derivatives of context-free grammars
       
  1924 such as memoization and laziness, with some degree of success\cite{might2011cfgder}.
       
  1925 However this field seems largely unexplored and further 
       
  1926 optimizations seem possible. It would be great
       
  1927 if we could find a simple augmentation to the current 
       
  1928 derivative method so that the subset of context free language 
       
  1929 of interest can be handled.
       
  1930 This may take longer than the two previous tasks, but we still
       
  1931 give it 3 months time and see how it goes.
       
  1932 
       
  1933 
       
  1934 
       
  1935  
       
  1936 
       
  1937 
  1759 \bibliographystyle{plain}
  1938 \bibliographystyle{plain}
  1760 \bibliography{root}
  1939 \bibliography{root}
  1761 
  1940 
  1762 
  1941 
  1763 \end{document}
  1942 \end{document}