ninems/ninems.tex
changeset 81 a0df84875788
parent 80 d9d61a648292
child 82 3153338ec6e4
equal deleted inserted replaced
80:d9d61a648292 81:a0df84875788
    96 fast?
    96 fast?
    97 
    97 
    98 Unfortunately these preconceptions are not supported by evidence: Take
    98 Unfortunately these preconceptions are not supported by evidence: Take
    99 for example the regular expression $(a^*)^*\,b$ and ask whether
    99 for example the regular expression $(a^*)^*\,b$ and ask whether
   100 strings of the form $aa..a$ match this regular
   100 strings of the form $aa..a$ match this regular
   101 expression. Obviously not---the expected $b$ in the last
   101 expression. Obviously this is not the case---the expected $b$ in the last
   102 position is missing. One would expect that modern regular expression
   102 position is missing. One would expect that modern regular expression
   103 matching engines can find this out very quickly. Alas, if one tries
   103 matching engines can find this out very quickly. Alas, if one tries
   104 this example in JavaScript, Python or Java 8 with strings like 28
   104 this example in JavaScript, Python or Java 8 with strings like 28
   105 $a$'s, one discovers that this decision takes around 30 seconds and
   105 $a$'s, one discovers that this decision takes around 30 seconds and
   106 takes considerably longer when adding a few more $a$'s, as the graphs
   106 takes considerably longer when adding a few more $a$'s, as the graphs
   186 frequent enough to have a separate name created for
   186 frequent enough to have a separate name created for
   187 them---\emph{evil regular expressions}. In empiric work, Davis et al
   187 them---\emph{evil regular expressions}. In empiric work, Davis et al
   188 report that they have found thousands of such evil regular expressions
   188 report that they have found thousands of such evil regular expressions
   189 in the JavaScript and Python ecosystems \cite{Davis18}.
   189 in the JavaScript and Python ecosystems \cite{Davis18}.
   190 
   190 
       
   191 \comment{Needs to be consistent: either exponential blowup; or quadratic blowup. Maybe you use the terminology superlinear, like in  the Davis et al paper}
   191 This exponential blowup in matching algorithms sometimes causes
   192 This exponential blowup in matching algorithms sometimes causes
   192 considerable grief in real life: for example on 20 July 2016 one evil
   193 considerable grief in real life: for example on 20 July 2016 one evil
   193 regular expression brought the webpage
   194 regular expression brought the webpage
   194 \href{http://stackexchange.com}{Stack Exchange} to its
   195 \href{http://stackexchange.com}{Stack Exchange} to its
   195 knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}}
   196 knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}}
   218 The underlying problem is that many ``real life'' regular expression
   219 The underlying problem is that many ``real life'' regular expression
   219 matching engines do not use DFAs for matching. This is because they
   220 matching engines do not use DFAs for matching. This is because they
   220 support regular expressions that are not covered by the classical
   221 support regular expressions that are not covered by the classical
   221 automata theory, and in this more general setting there are quite a few
   222 automata theory, and in this more general setting there are quite a few
   222 research questions still unanswered and fast algorithms still need to be
   223 research questions still unanswered and fast algorithms still need to be
   223 developed (for example how to treat bounded repetitions, negation and
   224 developed (for example how to treat efficiently bounded repetitions, negation and
   224 back-references efficiently).
   225 back-references).
   225 %question: dfa can have exponential states. isn't this the actual reason why they do not use dfas?
   226 %question: dfa can have exponential states. isn't this the actual reason why they do not use dfas?
   226 %how do they avoid dfas exponential states if they use them for fast matching?
   227 %how do they avoid dfas exponential states if they use them for fast matching?
   227 
   228 
   228 There is also another under-researched problem to do with regular
   229 There is also another under-researched problem to do with regular
   229 expressions and lexing, i.e.~the process of breaking up strings into
   230 expressions and lexing, i.e.~the process of breaking up strings into
   230 sequences of tokens according to some regular expressions. In this
   231 sequences of tokens according to some regular expressions. In this
   231 setting one is not just interested in whether or not a regular
   232 setting one is not just interested in whether or not a regular
   232 expression matches a string, but also in \emph{how}.  Consider for example a regular expression
   233 expression matches a string, but also in \emph{how}.  Consider for
   233 $r_{key}$ for recognising keywords such as \textit{if}, \textit{then}
   234 example a regular expression $r_{key}$ for recognising keywords such as
   234 and so on; and a regular expression $r_{id}$ for recognising
   235 \textit{if}, \textit{then} and so on; and a regular expression $r_{id}$
   235 identifiers (say, a single character followed by characters or
   236 for recognising identifiers (say, a single character followed by
   236 numbers). One can then form the compound regular expression
   237 characters or numbers). One can then form the compound regular
   237 $(r_{key} + r_{id})^*$ and use it to tokenise strings.  But then how
   238 expression $(r_{key} + r_{id})^*$ and use it to tokenise strings.  But
   238 should the string \textit{iffoo} be tokenised?  It could be tokenised
   239 then how should the string \textit{iffoo} be tokenised?  It could be
   239 as a keyword followed by an identifier, or the entire string as a
   240 tokenised as a keyword followed by an identifier, or the entire string
   240 single identifier.  Similarly, how should the string \textit{if} be
   241 as a single identifier.  Similarly, how should the string \textit{if} be
   241 tokenised? Both regular expressions, $r_{key}$ and $r_{id}$, would
   242 tokenised? Both regular expressions, $r_{key}$ and $r_{id}$, would
   242 ``fire''---so is it an identifier or a keyword?  While in applications
   243 ``fire''---so is it an identifier or a keyword?  While in applications
   243 there is a well-known strategy to decide these questions, called POSIX
   244 there is a well-known strategy to decide these questions, called POSIX
   244 matching, only relatively recently precise definitions of what POSIX
   245 matching, only relatively recently precise definitions of what POSIX
   245 matching actually means has been formalised
   246 matching actually means have been formalised
   246 \cite{AusafDyckhoffUrban2016,OkuiSuzuki2010,Vansummeren2006}. 
   247 \cite{AusafDyckhoffUrban2016,OkuiSuzuki2010,Vansummeren2006}. Such a
   247 Such a definition has also been given by Sulzmann and  Lu \cite{Sulzmann2014}, but the
   248 definition has also been given by Sulzmann and  Lu \cite{Sulzmann2014},
   248 corresponding correctness proof turned out to be  faulty \cite{AusafDyckhoffUrban2016}.
   249 but the corresponding correctness proof turned out to be  faulty
   249 Roughly, POSIX matching means matching the longest initial substring.
   250 \cite{AusafDyckhoffUrban2016}. Roughly, POSIX matching means matching
   250 In the case of a tie, the initial sub-match is chosen according to some priorities attached to the
   251 the longest initial substring. In the case of a tie, the initial
   251 regular expressions (e.g.~keywords have a higher priority than
   252 sub-match is chosen according to some priorities attached to the regular
   252 identifiers). This sounds rather simple, but according to Grathwohl et
   253 expressions (e.g.~keywords have a higher priority than identifiers).
   253 al \cite[Page 36]{CrashCourse2014} this is not the case. They wrote:
   254 This sounds rather simple, but according to Grathwohl et al \cite[Page
       
   255 36]{CrashCourse2014} this is not the case. They wrote:
   254 
   256 
   255 \begin{quote}
   257 \begin{quote}
   256 \it{}``The POSIX strategy is more complicated than the greedy because of 
   258 \it{}``The POSIX strategy is more complicated than the greedy because of 
   257 the dependence on information about the length of matched strings in the 
   259 the dependence on information about the length of matched strings in the 
   258 various subexpressions.''
   260 various subexpressions.''
   346 straightforwardly realised within the classic automata approach.
   348 straightforwardly realised within the classic automata approach.
   347 For the moment however, we focus only on the usual basic regular expressions.
   349 For the moment however, we focus only on the usual basic regular expressions.
   348 
   350 
   349 
   351 
   350 Now if we want to find out whether a string $s$ matches with a regular
   352 Now if we want to find out whether a string $s$ matches with a regular
   351 expression $r$, build the derivatives of $r$ w.r.t.\ (in succession)
   353 expression $r$, we can build the derivatives of $r$ w.r.t.\ (in succession)
   352 all the characters of the string $s$. Finally, test whether the
   354 all the characters of the string $s$. Finally, test whether the
   353 resulting regular expression can match the empty string.  If yes, then
   355 resulting regular expression can match the empty string.  If yes, then
   354 $r$ matches $s$, and no in the negative case. To implement this idea
   356 $r$ matches $s$, and no in the negative case. To implement this idea
   355 we can generalise the derivative operation to strings like this:
   357 we can generalise the derivative operation to strings like this:
   356 
   358 
   387 
   389 
   388 One limitation of Brzozowski's algorithm is that it only produces a
   390 One limitation of Brzozowski's algorithm is that it only produces a
   389 YES/NO answer for whether a string is being matched by a regular
   391 YES/NO answer for whether a string is being matched by a regular
   390 expression.  Sulzmann and Lu~\cite{Sulzmann2014} extended this algorithm
   392 expression.  Sulzmann and Lu~\cite{Sulzmann2014} extended this algorithm
   391 to allow generation of an actual matching, called a \emph{value} or
   393 to allow generation of an actual matching, called a \emph{value} or
   392 sometimes also \emph{lexical values}.  These values and regular
   394 sometimes also \emph{lexical value}.  These values and regular
   393 expressions correspond to each other as illustrated in the following
   395 expressions correspond to each other as illustrated in the following
   394 table:
   396 table:
   395 
   397 
   396 
   398 
   397 \begin{center}
   399 \begin{center}
   419 		\end{tabular}
   421 		\end{tabular}
   420 	\end{tabular}
   422 	\end{tabular}
   421 \end{center}
   423 \end{center}
   422 
   424 
   423 \noindent
   425 \noindent
   424 No value  corresponds to $\ZERO$; $\Empty$ corresponds to
   426 No value  corresponds to $\ZERO$; $\Empty$ corresponds to $\ONE$;
   425 $\ONE$; $\Char$ to the character regular expression; $\Seq$ to the
   427 $\Char$ to the character regular expression; $\Seq$ to the sequence
   426 sequence regular expression and so on. The idea of values is to encode
   428 regular expression and so on. The idea of values is to encode a kind of
   427 a kind of lexical value for how the sub-parts of a regular expression match
   429 lexical value for how the sub-parts of a regular expression match the
   428 the sub-parts of a string. To see this, suppose a \emph{flatten} operation, written
   430 sub-parts of a string. To see this, suppose a \emph{flatten} operation,
   429 $|v|$ for values. We can use this function to extract the underlying string of a value
   431 written $|v|$ for values. We can use this function to extract the
   430 $v$. For example, $|\mathit{Seq} \, (\textit{Char x}) \, (\textit{Char
   432 underlying string of a value $v$. For example, $|\mathit{Seq} \,
   431 y})|$ is the string $xy$.  Using flatten, we can describe how values
   433 (\textit{Char x}) \, (\textit{Char y})|$ is the string $xy$.  Using
   432 encode parse trees: $\Seq\,v_1\, v_2$ encodes a tree with 2 children nodes
   434 flatten, we can describe how values encode \comment{Avoid the notion
   433 that tells how the string $|v_1| @
   435 parse trees! Also later!!}parse trees: $\Seq\,v_1\, v_2$ encodes a tree with two
   434 |v_2|$ matches the regex $r_1 \cdot r_2$ whereby $r_1$ matches the
   436 children nodes that tells how the string $|v_1| @ |v_2|$ matches the
   435 substring $|v_1|$ and, respectively, $r_2$ matches the substring
   437 regex $r_1 \cdot r_2$ whereby $r_1$ matches the substring $|v_1|$ and,
   436 $|v_2|$. Exactly how these two are matched is contained in the
   438 respectively, $r_2$ matches the substring $|v_2|$. Exactly how these two
   437 children nodes $v_1$ and $v_2$ of parent $\textit{Seq}$ . 
   439 are matched is contained in the children nodes $v_1$ and $v_2$ of parent
       
   440 $\textit{Seq}$. 
   438 
   441 
   439 To give a concrete example of how values work, consider the string $xy$
   442 To give a concrete example of how values work, consider the string $xy$
   440 and the regular expression $(x + (y + xy))^*$. We can view this regular
   443 and the regular expression $(x + (y + xy))^*$. We can view this regular
   441 expression as a tree and if the string $xy$ is matched by two Star
   444 expression as a tree and if the string $xy$ is matched by two Star
   442 ``iterations'', then the $x$ is matched by the left-most alternative in
   445 ``iterations'', then the $x$ is matched by the left-most alternative in
   488 $\textit{nullable}$ or not. If not, we know the string does not match
   491 $\textit{nullable}$ or not. If not, we know the string does not match
   489 $r$ and no value needs to be generated. If yes, we start building the
   492 $r$ and no value needs to be generated. If yes, we start building the
   490 values incrementally by \emph{injecting} back the characters into the
   493 values incrementally by \emph{injecting} back the characters into the
   491 earlier values $v_n, \ldots, v_0$. This is the second phase of the
   494 earlier values $v_n, \ldots, v_0$. This is the second phase of the
   492 algorithm from the right to left. For the first value $v_n$, we call the
   495 algorithm from the right to left. For the first value $v_n$, we call the
   493 function $\textit{mkeps}$, which builds the parse tree for how the empty
   496 function $\textit{mkeps}$, which builds the \comment{Avoid}parse tree
   494 string has been matched by the (nullable) regular expression $r_n$. This
   497 for how the empty string has been matched by the (nullable) regular
   495 function is defined as
   498 expression $r_n$. This function is defined as
   496 
   499 
   497 	\begin{center}
   500 	\begin{center}
   498 		\begin{tabular}{lcl}
   501 		\begin{tabular}{lcl}
   499 			$\mkeps(\ONE)$ 		& $\dn$ & $\Empty$ \\
   502 			$\mkeps(\ONE)$ 		& $\dn$ & $\Empty$ \\
   500 			$\mkeps(r_{1}+r_{2})$	& $\dn$ 
   503 			$\mkeps(r_{1}+r_{2})$	& $\dn$ 
   509 
   512 
   510 \noindent There are no cases for $\ZERO$ and $c$, since
   513 \noindent There are no cases for $\ZERO$ and $c$, since
   511 these regular expression cannot match the empty string. Note
   514 these regular expression cannot match the empty string. Note
   512 also that in case of alternatives we give preference to the
   515 also that in case of alternatives we give preference to the
   513 regular expression on the left-hand side. This will become
   516 regular expression on the left-hand side. This will become
   514 important later on.
   517 important later on about what value is calculated.
   515 
   518 
   516 After this, we inject back the characters one by one in order to build
   519 After the $\mkeps$-call, we inject back the characters one by one in order to build
   517 the parse tree $v_i$ for how the regex $r_i$ matches the string $s_i$
   520 the parse tree $v_i$ for how the regex $r_i$ matches the string $s_i$
   518 ($s_i = c_i \ldots c_{n-1}$ ) from the previous parse tree $v_{i+1}$.
   521 ($s_i = c_i \ldots c_{n-1}$ ) from the previous parse tree $v_{i+1}$.
   519 After injecting back $n$ characters, we get the parse tree for how $r_0$
   522 After injecting back $n$ characters, we get the parse tree for how $r_0$
   520 matches $s$. For this Sulzmann and Lu defined a function that reverses
   523 matches $s$. For this Sulzmann and Lu defined a function that reverses
   521 the ``chopping off'' of characters during the derivative phase. The
   524 the ``chopping off'' of characters during the derivative phase. The
   570 r_a\,c\,v_a$ to fill that hole in $v_a$. After injection, the value 
   573 r_a\,c\,v_a$ to fill that hole in $v_a$. After injection, the value 
   571 $v_i$ for $r_i = r_a \cdot r_b$ should be $\Seq\,(\inj\,r_a\,c\,v_a)\,v_b$.
   574 $v_i$ for $r_i = r_a \cdot r_b$ should be $\Seq\,(\inj\,r_a\,c\,v_a)\,v_b$.
   572 Other clauses can be understood in a similar way.
   575 Other clauses can be understood in a similar way.
   573 
   576 
   574 %\comment{Other word: insight?}
   577 %\comment{Other word: insight?}
   575 The following example gives an insight of $\textit{inj}$'s effect
   578 The following example gives an insight of $\textit{inj}$'s effect and
   576 and how Sulzmann and Lu's algorithm works as a whole.
   579 how Sulzmann and Lu's algorithm works as a whole. Suppose we have a
   577  Suppose we have a
   580 regular expression $((((a+b)+ab)+c)+abc)^*$, and want to match it
   578 regular expression $((((a+b)+ab)+c)+abc)^*$, and want to match it against
   581 against the string $abc$ (when $abc$ is written as a regular expression,
   579 the string $abc$ (when $abc$ is written as a regular expression, the most
   582 the standard way of expressing it is $a \cdot (b \cdot c)$. But we
   580 standard way of expressing it should be $a \cdot (b \cdot c)$. We omit
   583 usually omit the parentheses and dots here for better readability. This
   581 the parentheses and dots here for readability). 
   584 algorithm returns a POSIX value, which means it will produce the longest
   582 This algorithm returns a POSIX value, which means it
   585 matching. Consequently, it matches the string $abc$ in one star
   583 will go for the longest matching, i.e.~it should match the string
   586 iteration, using the longest alternative $abc$ in the sub-expression (we shall use $r$ to denote this
   584 $abc$ in one star iteration, using the longest alternative $abc$ in the
   587 sub-expression for conciseness):
   585 sub-expression $((((a+b)+ab)+c)+abc)$ (we use $r$ to denote this sub-expression
   588 
   586 for conciseness). 
   589 \[((((a+b)+ab)+c)+\underbrace{abc}_r)\] 
   587 Before $\textit{inj}$ comes into play, 
   590 
   588 our lexer first builds derivative using string $abc$ (we simplified some regular expressions like
   591 \noindent
   589 $\ZERO \cdot b$ to $\ZERO$ for conciseness; we also omit parentheses if
   592 Before $\textit{inj}$ is called, our lexer first builds derivative using
   590 they are clear from the context):
   593 string $abc$ (we simplified some regular expressions like $\ZERO \cdot
       
   594 b$ to $\ZERO$ for conciseness; we also omit parentheses if they are
       
   595 clear from the context):
       
   596 
   591 %Similarly, we allow
   597 %Similarly, we allow
   592 %$\textit{ALT}$ to take a list of regular expressions as an argument
   598 %$\textit{ALT}$ to take a list of regular expressions as an argument
   593 %instead of just 2 operands to reduce the nested depth of
   599 %instead of just 2 operands to reduce the nested depth of
   594 %$\textit{ALT}$
   600 %$\textit{ALT}$
       
   601 
   595 \begin{center}
   602 \begin{center}
   596 \begin{tabular}{lcl}
   603 \begin{tabular}{lcl}
   597 $r^*$ & $\xrightarrow{\backslash a}$ & $r_1 = (\ONE+\ZERO+\ONE \cdot b + \ZERO + \ONE \cdot b \cdot c) \cdot r^*$\\
   604 $r^*$ & $\xrightarrow{\backslash a}$ & $r_1 = (\ONE+\ZERO+\ONE \cdot b + \ZERO + \ONE \cdot b \cdot c) \cdot r^*$\\
   598       & $\xrightarrow{\backslash b}$ & $r_2 = (\ZERO+\ZERO+\ONE \cdot \ONE + \ZERO + \ONE \cdot \ONE \cdot c) \cdot r^* +(\ZERO+\ONE+\ZERO  + \ZERO + \ZERO) \cdot r^*$\\
   605       & $\xrightarrow{\backslash b}$ & $r_2 = (\ZERO+\ZERO+\ONE \cdot \ONE + \ZERO + \ONE \cdot \ONE \cdot c) \cdot r^* +(\ZERO+\ONE+\ZERO  + \ZERO + \ZERO) \cdot r^*$\\
   599       & $\xrightarrow{\backslash c}$ & $r_3 = ((\ZERO+\ZERO+\ZERO + \ZERO + \ONE \cdot \ONE \cdot \ONE) \cdot r^* + (\ZERO+\ZERO+\ZERO  + \ONE + \ZERO) \cdot r^*) + $\\ 
   606       & $\xrightarrow{\backslash c}$ & $r_3 = ((\ZERO+\ZERO+\ZERO + \ZERO + \ONE \cdot \ONE \cdot \ONE) \cdot r^* + (\ZERO+\ZERO+\ZERO  + \ONE + \ZERO) \cdot r^*) + $\\ 
   602 \end{center}
   609 \end{center}
   603 
   610 
   604 \noindent
   611 \noindent
   605 In  case $r_3$ is nullable, we can call $\textit{mkeps}$ 
   612 In  case $r_3$ is nullable, we can call $\textit{mkeps}$ 
   606 to construct a parse tree for how $r_3$ matched the string $abc$. 
   613 to construct a parse tree for how $r_3$ matched the string $abc$. 
   607 $\textit{mkeps}$ gives the following value $v_3$: 
   614 This function gives the following value $v_3$: 
       
   615 
       
   616 
   608 \begin{center}
   617 \begin{center}
   609 $\Left(\Left(\Seq(\Right(\Seq(\Empty, \Seq(\Empty,\Empty))), \Stars [])))$
   618 $\Left(\Left(\Seq(\Right(\Seq(\Empty, \Seq(\Empty,\Empty))), \Stars [])))$
   610 \end{center}
   619 \end{center}
   611 The outer $\Left(\Left(\ldots))$ tells us the leftmost nullable part of $r_3$(underlined):
   620 The outer $\Left(\Left(\ldots))$ tells us the leftmost nullable part of $r_3$(underlined):
   612 
   621 
   613 \begin{center}
   622 \begin{center}\comment{better layout}
   614    $( \underline{(\ZERO+\ZERO+\ZERO+ \ZERO+ \ONE \cdot \ONE \cdot \ONE) \cdot r^*} + (\ZERO+\ZERO+\ZERO + \ONE + \ZERO)
   623    $( \underline{(\ZERO+\ZERO+\ZERO+ \ZERO+ \ONE \cdot \ONE \cdot \ONE) \cdot r^*} + (\ZERO+\ZERO+\ZERO + \ONE + \ZERO)
   615   \cdot r^*) +((\ZERO+\ONE+\ZERO  + \ZERO + \ZERO) \cdot r^*+(\ZERO+\ZERO+\ZERO  + \ONE + \ZERO) \cdot r^* ).$
   624   \cdot r^*) +((\ZERO+\ONE+\ZERO  + \ZERO + \ZERO) \cdot r^*+(\ZERO+\ZERO+\ZERO  + \ONE + \ZERO) \cdot r^* ).$
   616  \end{center}
   625  \end{center}
   617 
   626 
   618 \noindent
   627 \noindent
   619  Note that the leftmost location of term $((\ZERO+\ZERO+\ZERO + \ZERO + \ONE \cdot \ONE \cdot
   628  Note that the leftmost location of term $((\ZERO+\ZERO+\ZERO + \ZERO + \ONE \cdot \ONE \cdot
   620  \ONE) \cdot r^*$ (which corresponds to the initial sub-match $abc$) allows
   629  \ONE) \cdot r^*$ (which corresponds to the initial sub-match $abc$) allows
   621  $\textit{mkeps}$ to pick it up because $\textit{mkeps}$ is defined to always choose the
   630  $\textit{mkeps}$ to pick it up because $\textit{mkeps}$ is defined to always choose the
   622  left one when it is nullable. In the case of this example, $abc$ is
   631  left one when it is nullable. In the case of this example, $abc$ is
   623  preferred over $a$ or $ab$. This $\Left(\Left(\ldots))$ location is
   632  preferred over $a$ or $ab$. This $\Left(\Left(\ldots))$ location is
   624  naturally generated by two applications of the splitting clause
   633  generated by two applications of the splitting clause
   625 
   634 
   626 \begin{center}
   635 \begin{center}
   627      $(r_1 \cdot r_2)\backslash c  (when \; r_1 \; nullable) \, = (r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c.$
   636      $(r_1 \cdot r_2)\backslash c  \;\;(when \; r_1 \; nullable) \, = (r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c.$
   628 \end{center}
   637 \end{center}
   629        
   638        
   630 \noindent
   639 \noindent
   631 By this clause, we put $r_1 \backslash c \cdot r_2 $ at the
   640 By this clause, we put $r_1 \backslash c \cdot r_2 $ at the
   632 $\textit{front}$ and $r_2 \backslash c$ at the $\textit{back}$. This
   641 $\textit{front}$ and $r_2 \backslash c$ at the $\textit{back}$. This
   638  $\Seq(\Right(\Seq(\Empty, \Seq(\Empty, \Empty))), \Stars [])$
   647  $\Seq(\Right(\Seq(\Empty, \Seq(\Empty, \Empty))), \Stars [])$
   639 \end{center}
   648 \end{center}
   640 
   649 
   641 \noindent
   650 \noindent
   642 tells us how the empty string $[]$ is matched with $(\ZERO+\ZERO+\ZERO + \ZERO + \ONE \cdot
   651 tells us how the empty string $[]$ is matched with $(\ZERO+\ZERO+\ZERO + \ZERO + \ONE \cdot
   643 \ONE \cdot \ONE) \cdot r^*$. We match $[]$ by a sequence of 2 nullable regular
   652 \ONE \cdot \ONE) \cdot r^*$. We match $[]$ by a sequence of two nullable regular
   644 expressions. The first one is an alternative, we take the rightmost
   653 expressions. The first one is an alternative, we take the rightmost
   645 alternative---whose language contains the empty string. The second
   654 alternative---whose language contains the empty string. The second
   646 nullable regular expression is a Kleene star. $\Stars$ tells us how it
   655 nullable regular expression is a Kleene star. $\Stars$ tells us how it
   647 generates the nullable regular expression: by 0 iterations to form
   656 generates the nullable regular expression: by 0 iterations to form
   648 $\ONE$. Now $\textit{inj}$ injects characters back and incrementally
   657 $\ONE$. Now $\textit{inj}$ injects characters back and incrementally
   667   $v_0= \Stars [\Right(\Seq(a, \Seq(b, c)))]$
   676   $v_0= \Stars [\Right(\Seq(a, \Seq(b, c)))]$
   668   \end{center}
   677   \end{center}
   669    for how $r$ matched $abc$. This completes the algorithm.
   678    for how $r$ matched $abc$. This completes the algorithm.
   670    
   679    
   671 %We omit the details of injection function, which is provided by Sulzmann and Lu's paper \cite{Sulzmann2014}. 
   680 %We omit the details of injection function, which is provided by Sulzmann and Lu's paper \cite{Sulzmann2014}. 
   672 Readers might have noticed that the parse tree information 
   681 Readers might have noticed that the parse tree information is actually
   673 is actually already available when doing derivatives. 
   682 already available when doing derivatives. For example, immediately after
   674 For example, immediately after the operation $\backslash a$ we know that if we want to match a string that starts with $a$,
   683 the operation $\backslash a$ we know that if we want to match a string
   675  we can either take the initial match to be 
   684 that starts with $a$, we can either take the initial match to be 
       
   685 
   676  \begin{center}
   686  \begin{center}
   677 \begin{enumerate}
   687 \begin{enumerate}
   678     \item[1)] just $a$ or
   688     \item[1)] just $a$ or
   679     \item[2)] string $ab$ or 
   689     \item[2)] string $ab$ or 
   680     \item[3)] string $abc$.
   690     \item[3)] string $abc$.
   681 \end{enumerate}
   691 \end{enumerate}
   682 \end{center}
   692 \end{center}
   683 
   693 
   684 \noindent
   694 \noindent
   685 In order to differentiate between these choices, we just need to
   695 In order to differentiate between these choices, we just need to
   686 remember their positions--$a$ is on the left, $ab$ is in the middle ,
   696 remember their positions---$a$ is on the left, $ab$ is in the middle ,
   687 and $abc$ is on the right. Which one of these alternatives is chosen
   697 and $abc$ is on the right. Which of these alternatives is chosen
   688 later does not affect their relative position because our algorithm does
   698 later does not affect their relative position because the algorithm does
   689 not change this order. If this parsing information can be determined and
   699 not change this order. If this parsing information can be determined and
   690 does not change because of later derivatives, there is no point in
   700 does not change because of later derivatives, there is no point in
   691 traversing this information twice. This leads to an optimisation---if we
   701 traversing this information twice. This leads to an optimisation---if we
   692 store the information for parse trees inside the regular expression,
   702 store the information for parse trees inside the regular expression,
   693 update it when we do derivative on them, and collect the information
   703 update it when we do derivative on them, and collect the information
   705 
   715 
   706 \section{Simplification of Regular Expressions}
   716 \section{Simplification of Regular Expressions}
   707 
   717 
   708 Using bitcodes to guide  parsing is not a novel idea. It was applied to
   718 Using bitcodes to guide  parsing is not a novel idea. It was applied to
   709 context free grammars and then adapted by Henglein and Nielson for
   719 context free grammars and then adapted by Henglein and Nielson for
   710 efficient regular expression parsing using DFAs~\cite{nielson11bcre}.
   720 efficient regular expression \comment{?}parsing using DFAs~\cite{nielson11bcre}.
   711 Sulzmann and Lu took this idea of bitcodes a step further by integrating
   721 Sulzmann and Lu took this idea of bitcodes a step further by integrating
   712 bitcodes into derivatives. The reason why we want to use bitcodes in
   722 bitcodes into derivatives. The reason why we want to use bitcodes in
   713 this project is that we want to introduce more aggressive
   723 this project is that we want to introduce more aggressive
   714 simplification rules in order to keep the size of derivatives small
   724 simplification rules in order to keep the size of derivatives small
   715 throughout. This is because the main drawback of building successive
   725 throughout. This is because the main drawback of building successive
   761 A partial derivative of a regular expression $r$ is essentially a set of
   771 A partial derivative of a regular expression $r$ is essentially a set of
   762 regular expressions that are either $r$'s children expressions or a
   772 regular expressions that are either $r$'s children expressions or a
   763 concatenation of them. Antimirov has proved a tight bound of the size of
   773 concatenation of them. Antimirov has proved a tight bound of the size of
   764 \emph{all} partial derivatives no matter what the string looks like.
   774 \emph{all} partial derivatives no matter what the string looks like.
   765 Roughly speaking the size will be quadruple in the size of the regular
   775 Roughly speaking the size will be quadruple in the size of the regular
   766 expression. If we want the size of derivatives in Sulzmann and Lu's
   776 expression.\comment{Are you sure? I have just proved that the sum of
   767 algorithm to stay equal or below this bound, we would need more
   777 sizes in $pder$ is less or equal $(1 + size\;r)^3$. And this is surely
       
   778 not the best bound.} If we want the size of derivatives in Sulzmann and
       
   779 Lu's algorithm to stay equal or below this bound, we would need more
   768 aggressive simplifications. Essentially we need to delete useless
   780 aggressive simplifications. Essentially we need to delete useless
   769 $\ZERO$s and $\ONE$s, as well as deleting duplicates whenever possible.
   781 $\ZERO$s and $\ONE$s, as well as deleting duplicates whenever possible.
   770 For example, the parentheses in $(a+b) \cdot c + bc$ can be opened up to
   782 For example, the parentheses in $(a+b) \cdot c + bc$ can be opened up to
   771 get $a\cdot c +  b \cdot c + b \cdot c$, and then simplified to just $a \cdot
   783 get $a\cdot c +  b \cdot c + b \cdot c$, and then simplified to just $a
   772 c + b \cdot c$. Another example is simplifying $(a^*+a) + (a^*+ \ONE) + (a
   784 \cdot c + b \cdot c$. Another example is simplifying $(a^*+a) + (a^*+
   773 +\ONE)$ to just $a^*+a+\ONE$. Adding these more aggressive simplification
   785 \ONE) + (a +\ONE)$ to just $a^*+a+\ONE$. Adding these more aggressive
   774 rules helps us to achieve the same size bound as that of the partial
   786 simplification rules helps us to achieve the same size bound as that of
   775 derivatives. In order to implement the idea of ``spilling out alternatives''
   787 the partial derivatives. 
   776 and to make them compatible with the $\text{inj}$-mechanism, we use \emph{bitcodes}.
   788 
   777 Bits and bitcodes (lists of bits) are just:
   789 In order to implement the idea of ``spilling out alternatives'' and to
       
   790 make them compatible with the $\text{inj}$-mechanism, we use
       
   791 \emph{bitcodes}. Bits and bitcodes (lists of bits) are just:
       
   792 
   778 %This allows us to prove a tight
   793 %This allows us to prove a tight
   779 %bound on the size of regular expression during the running time of the
   794 %bound on the size of regular expression during the running time of the
   780 %algorithm if we can establish the connection between our simplification
   795 %algorithm if we can establish the connection between our simplification
   781 %rules and partial derivatives.
   796 %rules and partial derivatives.
   782 
   797 
   790 bs ::= [] \mid b:bs    
   805 bs ::= [] \mid b:bs    
   791 $
   806 $
   792 \end{center}
   807 \end{center}
   793 
   808 
   794 \noindent
   809 \noindent
   795 The names $S$ and $Z$  here are quite arbitrary in order to avoid 
   810 The $S$ and $Z$ are arbitrary names for the bits in order to avoid 
   796 confusion with the regular expressions $\ZERO$ and $\ONE$. Bitcodes (or
   811 confusion with the regular expressions $\ZERO$ and $\ONE$. Bitcodes (or
   797 bit-lists) can be used to encode values (or incomplete values) in a
   812 bit-lists) can be used to encode values (or incomplete values) in a
   798 compact form. This can be straightforwardly seen in the following
   813 compact form. This can be straightforwardly seen in the following
   799 coding function from values to bitcodes: 
   814 coding function from values to bitcodes: 
   800 
   815 
   923      $\textit{STAR}\,(bs\,@\,bs')\,a$
   938      $\textit{STAR}\,(bs\,@\,bs')\,a$
   924 \end{tabular}    
   939 \end{tabular}    
   925 \end{center}  
   940 \end{center}  
   926 
   941 
   927 \noindent
   942 \noindent
   928 After internalise we do successive derivative operations on the
   943 After internalising the regular expression, we perform successive
   929 annotated regular expressions. This derivative operation is the same as
   944 derivative operations on the annotated regular expressions. This
   930 what we previously have for the simple regular expressions, except that
   945 derivative operation is the same as what we had previously for the
   931 we beed to take special care of the bitcodes:\comment{You need to be consitent with  ALTS and ALT; ALT is just 
   946 basic regular expressions, except that we beed to take care of
   932 an abbreviation; derivations and so on are defined for ALTS}\comment{no this is not the case,
   947 the bitcodes:\comment{You need to be consitent with  ALTS and ALT; ALT
   933 ALT for 2 regexes, ALTS for a list}
   948 is just an abbreviation; derivations and so on are defined for
       
   949 ALTS}\comment{no this is not the case, ALT for 2 regexes, ALTS for a
       
   950 list...\textcolor{blue}{This does not make sense to me. CU}}
   934 
   951 
   935  %\begin{definition}{bder}
   952  %\begin{definition}{bder}
   936 \begin{center}
   953 \begin{center}
   937   \begin{tabular}{@{}lcl@{}}
   954   \begin{tabular}{@{}lcl@{}}
   938   $(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
   955   $(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
   966 then the right branch of $ALTS$, which is $fuse \; bmkeps \;  a_1 (a_2
   983 then the right branch of $ALTS$, which is $fuse \; bmkeps \;  a_1 (a_2
   967 \backslash c)$ will collapse the regular expression $a_1$(as it has
   984 \backslash c)$ will collapse the regular expression $a_1$(as it has
   968 already been fully matched) and store the parsing information at the
   985 already been fully matched) and store the parsing information at the
   969 head of the regular expression $a_2 \backslash c$ by fusing to it. The
   986 head of the regular expression $a_2 \backslash c$ by fusing to it. The
   970 bitsequence $bs$, which was initially attached to the head of $SEQ$, has
   987 bitsequence $bs$, which was initially attached to the head of $SEQ$, has
   971 now been elevated to the top-level of ALT, as this information will be
   988 now been elevated to the top-level of $ALT$, as this information will be
   972 needed whichever way the $SEQ$ is matched--no matter whether $c$ belongs
   989 needed whichever way the $SEQ$ is matched---no matter whether $c$ belongs
   973 to $a_1$ or $ a_2$. After building these derivatives and maintaining all
   990 to $a_1$ or $ a_2$. After building these derivatives and maintaining all
   974 the lexing information, we complete the lexing by collecting the
   991 the lexing information, we complete the lexing by collecting the
   975 bitcodes using a generalised version of the $\textit{mkeps}$ function
   992 bitcodes using a generalised version of the $\textit{mkeps}$ function
   976 for annotated regular expressions, called $\textit{bmkeps}$:
   993 for annotated regular expressions, called $\textit{bmkeps}$:
   977 
   994 
  1009   & & $\;\;\textit{else}\;\textit{None}$
  1026   & & $\;\;\textit{else}\;\textit{None}$
  1010 \end{tabular}
  1027 \end{tabular}
  1011 \end{center}
  1028 \end{center}
  1012 
  1029 
  1013 \noindent
  1030 \noindent
  1014 In this definition $_\backslash s$ is the  generalisation  of the derivative
  1031 In this definition $\_\backslash s$ is the  generalisation  of the derivative
  1015 operation from characters to strings (just like the derivatives for un-annotated
  1032 operation from characters to strings (just like the derivatives for un-annotated
  1016 regular expressions).
  1033 regular expressions).
  1017 
  1034 
  1018 The main point of the bitcodes and annotated regular expressions is that
  1035 The main point of the bitcodes and annotated regular expressions is that
  1019 we can apply rather aggressive (in terms of size) simplification rules
  1036 we can apply rather aggressive (in terms of size) simplification rules
  1060 When it detected that the regular expression is an alternative or
  1077 When it detected that the regular expression is an alternative or
  1061 sequence, it will try to simplify its children regular expressions
  1078 sequence, it will try to simplify its children regular expressions
  1062 recursively and then see if one of the children turn into $\ZERO$ or
  1079 recursively and then see if one of the children turn into $\ZERO$ or
  1063 $\ONE$, which might trigger further simplification at the current level.
  1080 $\ONE$, which might trigger further simplification at the current level.
  1064 The most involved part is the $\textit{ALTS}$ clause, where we use two
  1081 The most involved part is the $\textit{ALTS}$ clause, where we use two
  1065 auxiliary functions flatten and distinct to open up nested
  1082 auxiliary functions $\textit{flatten}$ and $\textit{distinct}$ to open up nested
  1066 $\textit{ALTS}$ and reduce as many duplicates as possible. Function
  1083 $\textit{ALTS}$ and reduce as many duplicates as possible. Function
  1067 distinct  keeps the first occurring copy only and remove all later ones
  1084 $\textit{distinct}$  keeps the first occurring copy only and remove all later ones
  1068 when detected duplicates. Function flatten opens up nested \textit{ALTS}.
  1085 when detected duplicates. Function $\textit{flatten}$ opens up nested \textit{ALTS}.
  1069 Its recursive definition is given below:
  1086 Its recursive definition is given below:
  1070 
  1087 
  1071  \begin{center}
  1088  \begin{center}
  1072   \begin{tabular}{@{}lcl@{}}
  1089   \begin{tabular}{@{}lcl@{}}
  1073   $\textit{flatten} \; (\textit{ALTS}\;bs\,as) :: as'$ & $\dn$ & $(\textit{map} \;
  1090   $\textit{flatten} \; (\textit{ALTS}\;bs\,as) :: as'$ & $\dn$ & $(\textit{map} \;
  1076     $\textit{flatten} \; a :: as'$ & $\dn$ & $a :: \textit{flatten} \; as'$ \quad(otherwise) 
  1093     $\textit{flatten} \; a :: as'$ & $\dn$ & $a :: \textit{flatten} \; as'$ \quad(otherwise) 
  1077 \end{tabular}    
  1094 \end{tabular}    
  1078 \end{center}  
  1095 \end{center}  
  1079 
  1096 
  1080 \noindent
  1097 \noindent
  1081 Here flatten behaves like the traditional functional programming flatten
  1098 Here $\textit{flatten}$ behaves like the traditional functional programming flatten
  1082 function, except that it also removes $\ZERO$s. What it does is
  1099 function, except that it also removes $\ZERO$s. Or in terms of regular expressions, it
  1083 basically removing parentheses like changing $a+(b+c)$ into $a+b+c$.
  1100 removes parentheses, for example changing $a+(b+c)$ into $a+b+c$.
  1084 
  1101 
  1085 Suppose we apply simplification after each derivative step, and view
  1102 Suppose we apply simplification after each derivative step, and view
  1086 these two operations as an atomic one: $a \backslash_{simp}\,c \dn
  1103 these two operations as an atomic one: $a \backslash_{simp}\,c \dn
  1087 \textit{simp}(a \backslash c)$. Then we can use the previous natural
  1104 \textit{simp}(a \backslash c)$. Then we can use the previous natural
  1088 extension from derivative w.r.t.~character to derivative
  1105 extension from derivative w.r.t.~character to derivative
  1137   Figure 6 yields POSIX parse trees. We have tested this claim
  1154   Figure 6 yields POSIX parse trees. We have tested this claim
  1138   extensively by using the method in Figure~3 as a reference but yet
  1155   extensively by using the method in Figure~3 as a reference but yet
  1139   have to work out all proof details.''
  1156   have to work out all proof details.''
  1140 \end{quote}  
  1157 \end{quote}  
  1141 
  1158 
  1142 \noindent We would settle this correctness claim. It is relatively
  1159 \noindent We like to settle this correctness claim. It is relatively
  1143 straightforward to establish that after one simplification step, the part of a
  1160 straightforward to establish that after one simplification step, the part of a
  1144 nullable derivative that corresponds to a POSIX value remains intact and can
  1161 nullable derivative that corresponds to a POSIX value remains intact and can
  1145 still be collected, in other words, we can show that\comment{Double-check....I
  1162 still be collected, in other words, we can show that\comment{Double-check....I
  1146 think this  is not the case}\comment{If i remember correctly, you have proved this lemma.
  1163 think this  is not the case}
  1147 I feel this is indeed not true because you might place arbitrary 
  1164 %\comment{If i remember correctly, you have proved this lemma.
  1148 bits on the regex r, however if this is the case, did i remember wrongly that
  1165 %I feel this is indeed not true because you might place arbitrary 
  1149 you proved something like simplification does not affect $\textit{bmkeps}$ results?
  1166 %bits on the regex r, however if this is the case, did i remember wrongly that
  1150 Anyway, i have amended this a little bit so it does not allow arbitrary bits attached
  1167 %you proved something like simplification does not affect $\textit{bmkeps}$ results?
  1151 to a regex. Maybe it works now.}
  1168 %Anyway, i have amended this a little bit so it does not allow arbitrary bits attached
       
  1169 %to a regex. Maybe it works now.}
  1152 
  1170 
  1153 \begin{center}
  1171 \begin{center}
  1154 $\textit{bmkeps} \; a = \textit{bmkeps} \; \textit{bsimp} \; a\;( a\; \textit{bnullable} and \textit{decode}(r, \textit{bmkeps}(a)) is a \textit{POSIX} value)$
  1172 $\textit{bmkeps} \; a = \textit{bmkeps} \; \textit{bsimp} \; a\;( a\; \textit{bnullable} and \textit{decode}(r, \textit{bmkeps}(a)) is a \textit{POSIX} value)$
  1155 \end{center}
  1173 \end{center}
       
  1174 \comment{\textcolor{blue}{I proved $bmkeps\,(bsimp\,a) = bmkeps\,a$ provided $a$ is 
       
  1175 $\textit{bnullable}$}}
  1156 
  1176 
  1157 \noindent
  1177 \noindent
  1158 as this basically comes down to proving actions like removing the
  1178 as this basically comes down to proving actions like removing the
  1159 additional $r$ in $r+r$  does not delete important POSIX information in
  1179 additional $r$ in $r+r$  does not delete important POSIX information in
  1160 a regular expression. The hard part of this proof is to establish that
  1180 a regular expression. The hard part of this proof is to establish that
  1161 
  1181 
  1162 \begin{center}
  1182 \begin{center}
  1163 	$\textit{bmkeps} \; \textit{blexer}\_{simp}(s, \; r) = \textit{bmkeps} \; \textit{blexer} \; \textit{simp}(s, \; r)$
  1183 	$\textit{bmkeps} \; \textit{blexer}\_{simp}(s, \; r) = \textit{bmkeps} \; \textit{blexer} \; \textit{simp}(s, \; r)$
  1164 \end{center}
  1184 \end{center}\comment{This is not true either...look at the definion blexer/blexer-simp}
  1165 
  1185 
  1166 \noindent That is, if we do derivative on regular expression $r$ and then
  1186 \noindent That is, if we do derivative on regular expression $r$ and then
  1167 simplify it, and repeat this process until we exhaust the string, we get a
  1187 simplify it, and repeat this process until we exhaust the string, we get a
  1168 regular expression $r''$($\textit{LHS}$)  that provides the POSIX matching
  1188 regular expression $r''$($\textit{LHS}$)  that provides the POSIX matching
  1169 information, which is exactly the same as the result $r'$($\textit{RHS}$ of the
  1189 information, which is exactly the same as the result $r'$($\textit{RHS}$ of the
  1179 $\textit{POSIX}$  information of a regular expression and how it is modified,
  1199 $\textit{POSIX}$  information of a regular expression and how it is modified,
  1180 augmented and propagated 
  1200 augmented and propagated 
  1181 during simplification in parallel with the regularr expression that
  1201 during simplification in parallel with the regularr expression that
  1182 has not been simplified in the subsequent derivative operations.  To aid this,
  1202 has not been simplified in the subsequent derivative operations.  To aid this,
  1183 we use the helping function retrieve described by Sulzmann and Lu: \\definition
  1203 we use the helping function retrieve described by Sulzmann and Lu: \\definition
  1184 of retrieve TODO\\
  1204 of retrieve TODO\comment{Did not read further}\\
  1185 This function assembles the bitcode that corresponds to a parse tree for how
  1205 This function assembles the bitcode that corresponds to a parse tree for how
  1186 the current derivative matches the suffix of the string(the characters that
  1206 the current derivative matches the suffix of the string(the characters that
  1187 have not yet appeared, but will appear as the successive derivatives go on,
  1207 have not yet appeared, but will appear as the successive derivatives go on,
  1188 how do we get this "future" information? By the value $v$, which is
  1208 how do we get this "future" information? By the value $v$, which is
  1189 computed by a pass of the algorithm that uses
  1209 computed by a pass of the algorithm that uses