cst_tests: changeset 77:058133a9ffe0

--- a/ninems/ninems.tex	Tue Jul 16 22:20:11 2019 +0100
+++ b/ninems/ninems.tex	Thu Jul 18 15:06:50 2019 +0100
@@ -178,10 +178,10 @@
 \noindent These are clearly abysmal and possibly surprising results. One
 would expect these systems to do  much better than that---after all,
 given a DFA and a string, deciding whether a string is matched by this
-DFA should be linear.
+DFA should be linear?
 
 Admittedly, the regular expression $(a^*)^*\,b$ is carefully chosen to
-exhibit this exponential behaviour.  Unfortunately, such regular
+exhibit this exponential behaviour.  But unfortunately, such regular
 expressions are not just a few outliers. They are actually 
 frequent enough to have a separate name created for
 them---\emph{evil regular expressions}. In empiric work, Davis et al
@@ -204,32 +204,27 @@
 In this example, the time needed to process the string is not 
 exactly the classical exponential case, but rather $O(n^2)$
 with respect to the string length. But this is enough for the 
-home page of stackexchnge to respond not fast enough to
+home page of Stack Exchange to respond not fast enough to
 the load balancer, which thought that there must be some
 attack and therefore stopped the servers from responding to 
-requests. This makes the whole site become unavailable.
-Another fresh example that just came out of the oven
-is the cloudfare outage incident. A poorly written
-regular expression exhibited exponential behaviour
-and exhausted CPUs that serve HTTP traffic on 2 July 2019,
- resulting in a 27-minute outage. The regular expression contained 
-adjacent Kleene stars, which is the source of the exponential number
-of backtracking possibilities. Although this outage has a series of
-causes that came from different vulnerabilities within the system 
-of cloudfare, the heart of the problem lies within regular expression.
-Such outages could be excluded from happening if the matching 
-algorithm is guaranteed to be fast.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}}
- The underlying problem is
-that many ``real life'' regular expression matching engines do not use
-DFAs for matching. 
-This is because they support regular expressions that
-are not covered by the classical automata theory, and in this more
-general setting there are quite a few research questions still
-unanswered and fast algorithms still need to be developed (for example
-how to treat bounded repetitions, negation and  back-references
-efficiently).
+requests. This made the whole site become unavailable.
+Another very recent example is a global outage of all Cloudflare servers
+on 2 July 2019. A poorly written regular expression exhibited
+exponential behaviour and exhausted CPUs that serve HTTP traffic.
+Although the outage had several causes, at the heart was a regular
+expression that was used to monitor network
+traffic.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}}
+
+The underlying problem is that many ``real life'' regular expression
+matching engines do not use DFAs for matching. This is because they
+support regular expressions that are not covered by the classical
+automata theory, and in this more general setting there are quite a few
+research questions still unanswered and fast algorithms still need to be
+developed (for example how to treat bounded repetitions, negation and
+back-references efficiently).
 %question: dfa can have exponential states. isn't this the actual reason why they do not use dfas?
 %how do they avoid dfas exponential states if they use them for fast matching?
+
 There is also another under-researched problem to do with regular
 expressions and lexing, i.e.~the process of breaking up strings into
 sequences of tokens according to some regular expressions. In this
@@ -268,12 +263,12 @@
 \cite{Kuklewicz} who noticed that a number of POSIX regular expression
 matchers calculate incorrect results.
 
-Our focus is on an algorithm introduced by Sulzmann and Lu in 2014 for
-regular expression matching according to the POSIX strategy
-\cite{Sulzmann2014}. Their algorithm is based on an older algorithm by
-Brzozowski from 1964 where he introduced the notion of derivatives of
-regular expressions~\cite{Brzozowski1964}. We shall briefly explain
-this algorithm next.
+Our focus in this project is on an algorithm introduced by Sulzmann and
+Lu in 2014 for regular expression matching according to the POSIX
+strategy \cite{Sulzmann2014}. Their algorithm is based on an older
+algorithm by Brzozowski from 1964 where he introduced the notion of
+derivatives of regular expressions~\cite{Brzozowski1964}. We shall
+briefly explain this algorithm next.
 
 \section{The Algorithm by Brzozowski based on Derivatives of Regular
 Expressions}
@@ -390,12 +385,13 @@
  
 \section{Values and the Algorithm by Sulzmann and Lu}
 
-One limitation, however, of Brzozowski's algorithm is that it only
-produces a YES/NO answer for whether a string is being matched by a
-regular expression.  Sulzmann and Lu~\cite{Sulzmann2014} extended this
-algorithm to allow generation of an actual matching, called a
-\emph{value} or sometimes lexical values.  These values and regular expressions correspond to each 
-other as illustrated in the following table:
+One limitation of Brzozowski's algorithm is that it only produces a
+YES/NO answer for whether a string is being matched by a regular
+expression.  Sulzmann and Lu~\cite{Sulzmann2014} extended this algorithm
+to allow generation of an actual matching, called a \emph{value} or
+sometimes also \emph{lexical values}.  These values and regular
+expressions correspond to each other as illustrated in the following
+table:
 
 
 \begin{center}
@@ -440,7 +436,7 @@
 $|v_2|$. Exactly how these two are matched is contained in the
 children nodes $v_1$ and $v_2$ of parent $\textit{Seq}$ . 
 
- To give a concrete example of how values work, consider the string $xy$
+To give a concrete example of how values work, consider the string $xy$
 and the regular expression $(x + (y + xy))^*$. We can view this regular
 expression as a tree and if the string $xy$ is matched by two Star
 ``iterations'', then the $x$ is matched by the left-most alternative in
@@ -483,18 +479,20 @@
 \end{ceqn}
 
 \noindent
-For convenience, we shall employ the following notations: the regular expression we
-start with is $r_0$, and the given string $s$ is composed of characters $c_0 c_1
-\ldots c_{n-1}$. In  the first phase, we build the derivatives $r_1$, $r_2$, \ldots  according to
-the characters $c_0$, $c_1$  until we exhaust the string and
-obtain the derivative $r_n$. We test whether this derivative is
+For convenience, we shall employ the following notations: the regular
+expression we start with is $r_0$, and the given string $s$ is composed
+of characters $c_0 c_1 \ldots c_{n-1}$. In  the first phase from the
+left to right, we build the derivatives $r_1$, $r_2$, \ldots  according
+to the characters $c_0$, $c_1$  until we exhaust the string and obtain
+the derivative $r_n$. We test whether this derivative is
 $\textit{nullable}$ or not. If not, we know the string does not match
 $r$ and no value needs to be generated. If yes, we start building the
-values incrementally by \emph{injecting} back the characters into
-the earlier values $v_n, \ldots, v_0$. For the first value $v_0$, we call the function
-$\textit{mkeps}$, which builds the parse tree for how the empty string
-has been matched by the (nullable) regular expression $r_n$. This function is defined
-as
+values incrementally by \emph{injecting} back the characters into the
+earlier values $v_n, \ldots, v_0$. This is the second phase of the
+algorithm from the right to left. For the first value $v_n$, we call the
+function $\textit{mkeps}$, which builds the parse tree for how the empty
+string has been matched by the (nullable) regular expression $r_n$. This
+function is defined as
 
 	\begin{center}
 		\begin{tabular}{lcl}
@@ -517,18 +515,18 @@
 
 After this, we inject back the characters one by one in order to build
 the parse tree $v_i$ for how the regex $r_i$ matches the string $s_i$
-($s_i = c_i \ldots c_{n-1}$ ) from the previous parse tree $v_{i+1}$. After
-injecting back $n$ characters, we get the parse tree for how $r_0$
+($s_i = c_i \ldots c_{n-1}$ ) from the previous parse tree $v_{i+1}$.
+After injecting back $n$ characters, we get the parse tree for how $r_0$
 matches $s$. For this Sulzmann and Lu defined a function that reverses
 the ``chopping off'' of characters during the derivative phase. The
-corresponding function is called $\textit{inj}$; it takes three
-arguments: the first one is a regular expression ${r_{i-1}}$, before the
-character is chopped off, the second is a character ${c_{i-1}}$, the
-character we want to inject and the third argument is the value
-${v_i}$, into which one wants to inject the character (it
-corresponds to the regular expression after the character has been
-chopped off). The result of this function is a new value. The definition
-of $\textit{inj}$ is as follows: 
+corresponding function is called \emph{injection}, written
+$\textit{inj}$; it takes three arguments: the first one is a regular
+expression ${r_{i-1}}$, before the character is chopped off, the second
+is a character ${c_{i-1}}$, the character we want to inject and the
+third argument is the value ${v_i}$, into which one wants to inject the
+character (it corresponds to the regular expression after the character
+has been chopped off). The result of this function is a new value. The
+definition of $\textit{inj}$ is as follows: 
 
 \begin{center}
 \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
@@ -713,7 +711,7 @@
 Sulzmann and Lu took this idea of bitcodes a step further by integrating
 bitcodes into derivatives. The reason why we want to use bitcodes in
 this project is that we want to introduce more aggressive
-simplifications in order to keep the size of derivatives small
+simplification rules in order to keep the size of derivatives small
 throughout. This is because the main drawback of building successive
 derivatives according to Brzozowski's definition is that they can grow
 very quickly in size. This is mainly due to the fact that the derivative
@@ -723,13 +721,13 @@
 when starting with the regular expression $(a + aa)^*$ and building 12
 successive derivatives w.r.t.~the character $a$, one obtains a
 derivative regular expression with more than 8000 nodes (when viewed as
-a tree). Operations like derivative and $\nullable$ need to traverse
+a tree). Operations like $\textit{der}$ and $\nullable$ need to traverse
 such trees and consequently the bigger the size of the derivative the
 slower the algorithm. 
 
 Fortunately, one can simplify regular expressions after each derivative
 step. Various simplifications of regular expressions are possible, such
-as the simplifications of $\ZERO + r$, $r + \ZERO$, $\ONE\cdot r$, $r
+as the simplification of $\ZERO + r$, $r + \ZERO$, $\ONE\cdot r$, $r
 \cdot \ONE$, and $r + r$ to just $r$. These simplifications do not
 affect the answer for whether a regular expression matches a string or
 not, but fortunately also do not affect the POSIX strategy of how
@@ -738,9 +736,9 @@
 obtained in \cite{AusafDyckhoffUrban2016}. 
 
 Unfortunately, the simplification rules outlined above  are not
-sufficient to prevent an explosion for all regular expression. We
+sufficient to prevent a size explosion in all cases. We
 believe a tighter bound can be achieved that prevents an explosion in
-all cases. Such a tighter bound is suggested by work of Antimirov who
+\emph{all} cases. Such a tighter bound is suggested by work of Antimirov who
 proved that (partial) derivatives can be bound by the number of
 characters contained in the initial regular expression
 \cite{Antimirov95}. He defined the \emph{partial derivatives} of regular
@@ -763,22 +761,20 @@
 A partial derivative of a regular expression $r$ is essentially a set of
 regular expressions that are either $r$'s children expressions or a
 concatenation of them. Antimirov has proved a tight bound of the size of
-partial derivatives. Roughly
-speaking the size will be quadruple in the size of the regular expression.
- If we want the size of derivatives 
-to stay below this bound, we would need more aggressive simplifications
-such as opening up alternatives to achieve the maximum level of duplicates
-cancellation.
+\emph{all} partial derivatives no matter what the string looks like.
+Roughly speaking the size will be quadruple in the size of the regular
+expression. If we want the size of derivatives in Sulzmann and Lu's
+algorithm to stay equal or below this bound, we would need more
+aggressive simplifications. Essentially we need to delete useless
+$\ZERO$s and $\ONE$s, as well as deleting duplicates whenever possible.
 For example, the parentheses in $(a+b) \cdot c + bc$ can be opened up to
-get $a\cdot c +b \cdot c
-+ b \cdot c$, and then simplified to $a \cdot c+b \cdot c$. Another example is from
-$(a^*+a) + (a^*+ \ONE) + (a +\ONE)$ to $a^*+a+\ONE$.
-Adding these more aggressive simplification rules
- helped us to achieve the same size bound as that of the partial derivatives.
-To introduce these "spilling out alternatives" simplifications
- and make the correctness proof easier,
-we used bitcodes. 
-Bitcodes look like this:
+get $a\cdot c +  b \cdot c + b \cdot c$, and then simplified to just $a \cdot
+c + b \cdot c$. Another example is simplifying $(a^*+a) + (a^*+ \ONE) + (a
++\ONE)$ to just $a^*+a+\ONE$. Adding these more aggressive simplification
+rules helps us to achieve the same size bound as that of the partial
+derivatives. In order to implement the idea of ``spilling out alternatives''
+and to make them compatible with the $\text{inj}$-mechanism, we use \emph{bitcodes}.
+Bits and bitcodes (lists of bits) are just:
 %This allows us to prove a tight
 %bound on the size of regular expression during the running time of the
 %algorithm if we can establish the connection between our simplification
@@ -790,16 +786,18 @@
 
 
 \begin{center}
-		$b ::=   S \mid  Z \; \;\;
+		$b ::=   S \mid  Z \qquad
 bs ::= [] \mid b:bs    
 $
 \end{center}
-They are just a string of bits, 
-the names $S$ and $Z$  here are quite arbitrary, we can use 0 and 1 
-or any other set of binary symbols to substitute them. 
-Bitcodes(or bit-sequences) are a compact form of parse trees.
-Bitcodes are essentially incomplete values.
-This can be straightforwardly seen in the following transformation: 
+
+\noindent
+The names $S$ and $Z$  here are quite arbitrary in order to avoid 
+confusion with the regular expressions $\ZERO$ and $\ONE$. Bitcodes (or
+bit-lists) can be used to encode values (or incomplete values) in a
+compact form. This can be straightforwardly seen in the following
+coding function from values to bitcodes: 
+
 \begin{center}
 \begin{tabular}{lcl}
   $\textit{code}(\Empty)$ & $\dn$ & $[]$\\
@@ -813,18 +811,19 @@
 \end{tabular}    
 \end{center} 
 
-Here code encodes a value into a bit-sequence by converting Left into
-$\Z$, Right into $\S$, the start point of a non-empty star iteration
-into $\S$, and the border where a local star terminates into $\Z$. This
-conversion is apparently lossy, as it throws away the character
-information, and does not decode the boundary between the two operands
-of the sequence constructor. Moreover, with only the bitcode we cannot
-even tell whether the $\S$s and $\Z$s are for $\Left/\Right$ or
-$\Stars$. The reason for choosing this compact way of storing
-information is that the relatively small size of bits can be easily
-moved around. In order to recover the bitcode back into values, we will
-need the regular expression as the extra information and decode it back
-into value:\\
+\noindent
+Here $\textit{code}$ encodes a value into a bitcodes by converting
+$\Left$ into $\Z$, $\Right$ into $\S$, the start point of a non-empty
+star iteration into $\S$, and the border where a local star terminates
+into $\Z$. This coding is lossy, as it throws away the information about
+characters, and also does not encode the ``boundary'' between two
+sequence values. Moreover, with only the bitcode we cannot even tell
+whether the $\S$s and $\Z$s are for $\Left/\Right$ or $\Stars$. The
+reason for choosing this compact way of storing information is that the
+relatively small size of bits can be easily manipulated and ``moved
+around'' in a regular expression. In order to recover values, we will 
+need the corresponding regular expression as an extra information. This
+means the decoding function is defined as:
 
 
 %\begin{definition}[Bitdecoding of Values]\mbox{}
@@ -856,11 +855,9 @@
 \end{center}    
 %\end{definition}
 
-Sulzmann and Lu's integrated the bitcodes into regular
-expressions to create annotated regular expressions.
-It is by attaching them to the head of every substructure of a
-regular expression\cite{Sulzmann2014}. Annotated regular expressions
- are defined by the following
+Sulzmann and Lu's integrated the bitcodes into regular expressions to
+create annotated regular expressions \cite{Sulzmann2014}.
+\emph{Annotated regular expressions} are defined by the following
 grammar:
 
 \begin{center}
@@ -874,10 +871,13 @@
 \end{tabular}    
 \end{center}  
 %(in \textit{ALT})
+
 \noindent
-where $bs$ stands for bit-sequences, and $a$  for $\bold{a}$nnotated regular expressions. These bit-sequences encode
-information about the (POSIX) value that should be generated by the
-Sulzmann and Lu algorithm. 
+where $bs$ stands for bitcodes, and $a$  for $\bold{a}$nnotated regular
+expressions. We will show that these bitcodes encode information about
+the (POSIX) value that should be generated by the Sulzmann and Lu
+algorithm.
+
 
 To do lexing using annotated regular expressions, we shall first
 transform the usual (un-annotated) regular expressions into annotated
@@ -902,96 +902,103 @@
 %\end{definition}
 
 \noindent
-We use up arrows here to imply that the basic un-annotated regular expressions
-are "lifted up" into something slightly more complex.
-In the fourth clause, $\textit{fuse}$ is an auxiliary function that helps to attach bits to the
-front of an annotated regular expression. Its definition is as follows:
+We use up arrows here to indicate that the basic un-annotated regular
+expressions are ``lifted up'' into something slightly more complex. In the
+fourth clause, $\textit{fuse}$ is an auxiliary function that helps to
+attach bits to the front of an annotated regular expression. Its
+definition is as follows:
 
 \begin{center}
 \begin{tabular}{lcl}
-  $\textit{fuse}\,bs\,(\textit{ZERO})$ & $\dn$ & $\textit{ZERO}$\\
-  $\textit{fuse}\,bs\,(\textit{ONE}\,bs')$ & $\dn$ &
+  $\textit{fuse}\;bs\,(\textit{ZERO})$ & $\dn$ & $\textit{ZERO}$\\
+  $\textit{fuse}\;bs\,(\textit{ONE}\,bs')$ & $\dn$ &
      $\textit{ONE}\,(bs\,@\,bs')$\\
-  $\textit{fuse}\,bs\,(\textit{CHAR}\,bs'\,c)$ & $\dn$ &
+  $\textit{fuse}\;bs\,(\textit{CHAR}\,bs'\,c)$ & $\dn$ &
      $\textit{CHAR}\,(bs\,@\,bs')\,c$\\
-  $\textit{fuse}\,bs\,(\textit{ALT}\,bs'\,a_1\,a_2)$ & $\dn$ &
+  $\textit{fuse}\;bs\,(\textit{ALT}\,bs'\,a_1\,a_2)$ & $\dn$ &
      $\textit{ALT}\,(bs\,@\,bs')\,a_1\,a_2$\\
-  $\textit{fuse}\,bs\,(\textit{SEQ}\,bs'\,a_1\,a_2)$ & $\dn$ &
+  $\textit{fuse}\;bs\,(\textit{SEQ}\,bs'\,a_1\,a_2)$ & $\dn$ &
      $\textit{SEQ}\,(bs\,@\,bs')\,a_1\,a_2$\\
-  $\textit{fuse}\,bs\,(\textit{STAR}\,bs'\,a)$ & $\dn$ &
+  $\textit{fuse}\;bs\,(\textit{STAR}\,bs'\,a)$ & $\dn$ &
      $\textit{STAR}\,(bs\,@\,bs')\,a$
 \end{tabular}    
 \end{center}  
 
 \noindent
 After internalise we do successive derivative operations on the
- annotated regular expression. This derivative operation is the same as
- what we previously have for the simple regular expressions, except that
- we take special care of the bits :\\
+annotated regular expressions. This derivative operation is the same as
+what we previously have for the simple regular expressions, except that
+we beed to take special care of the bitcodes:\comment{You need to be consitent with  ALTS and ALT; ALT is just 
+an abbreviation; derivations and so on are defined for ALTS}
 
  %\begin{definition}{bder}
 \begin{center}
   \begin{tabular}{@{}lcl@{}}
-  $(\textit{ZERO})\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
-  $(\textit{ONE}\;bs)\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
-  $(\textit{CHAR}\;bs\,d)\backslash c$ & $\dn$ &
+  $(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
+  $(\textit{ONE}\;bs)\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
+  $(\textit{CHAR}\;bs\,d)\,\backslash c$ & $\dn$ &
         $\textit{if}\;c=d\; \;\textit{then}\;
          \textit{ONE}\;bs\;\textit{else}\;\textit{ZERO}$\\  
-  $(\textit{ALT}\;bs\,a_1\,a_2)\backslash c$ & $\dn$ &
-        $\textit{ALT}\,bs\,(a_1\backslash c)\,(a_2\backslash c)$\\
-  $(\textit{SEQ}\;bs\,a_1\,a_2)\backslash c$ & $\dn$ &
+  $(\textit{ALT}\;bs\,a_1\,a_2)\,\backslash c$ & $\dn$ &
+        $\textit{ALT}\;bs\,(a_1\,\backslash c)\,(a_2\,\backslash c)$\\
+  $(\textit{SEQ}\;bs\,a_1\,a_2)\,\backslash c$ & $\dn$ &
      $\textit{if}\;\textit{bnullable}\,a_1$\\
-  & &$\textit{then}\;\textit{ALT}\,bs\,(\textit{SEQ}\,[]\,(a_1\backslash c)\,a_2)$\\
-  & &$\phantom{\textit{then}\;\textit{ALT}\,bs\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\backslash c))$\\
-  & &$\textit{else}\;\textit{SEQ}\,bs\,(a_1\backslash c)\,a_2$\\
-  $(\textit{STAR}\,bs\,a)\backslash c$ & $\dn$ &
-      $\textit{SEQ}\;bs\,(\textit{fuse}\, [\Z] (r\backslash c))\,
+  & &$\textit{then}\;\textit{ALT}\,bs\,(\textit{SEQ}\,[]\,(a_1\,\backslash c)\,a_2)$\\
+  & &$\phantom{\textit{then}\;\textit{ALT}\,bs\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))$\\
+  & &$\textit{else}\;\textit{SEQ}\,bs\,(a_1\,\backslash c)\,a_2$\\
+  $(\textit{STAR}\,bs\,a)\,\backslash c$ & $\dn$ &
+      $\textit{SEQ}\;bs\,(\textit{fuse}\, [\Z] (r\,\backslash c))\,
        (\textit{STAR}\,[]\,r)$
 \end{tabular}    
 \end{center}    
 %\end{definition}
 
-For instance, when we unfold $STAR \; bs \; a$ into a sequence, we
-attach an additional bit Z to the front of $r \backslash c$ to indicate
-that there is one more star iteration. The other example, the $SEQ$
-clause is more subtle-- when $a_1$ is $bnullable$(here bnullable is
-exactly the same as nullable, except that it is for annotated regular
-expressions, therefore we omit the definition). Assume that $bmkeps$
-correctly extracts the bitcode for how $a_1$ matches the string prior to
-character c(more on this later), then the right branch of $ALTS$, which
-is $fuse \; bmkeps \;  a_1 (a_2 \backslash c)$ will collapse the regular
-expression $a_1$(as it has already been fully matched) and store the
-parsing information at the head of the regular expression $a_2
-\backslash c$ by fusing to it. The bitsequence $bs$, which was initially
-attached to the head of $SEQ$, has now been elevated to the top-level of
-ALT, as this information will be needed whichever way the $SEQ$ is
-matched--no matter whether c belongs to $a_1$ or $ a_2$. After carefully
-doing these derivatives and maintaining all the parsing information, we
-complete the parsing by collecting the bits using a special $mkeps$
-function for annotated regular expressions--$bmkeps$:
+\noindent
+For instance, when we unfold $\textit{STAR} \; bs \; a$ into a sequence,
+we need to attach an additional bit $Z$ to the front of $r \backslash c$
+to indicate that there is one more star iteration. Also the $SEQ$ clause
+is more subtle---when $a_1$ is $\textit{bnullable}$ (here
+\textit{bnullable} is exactly the same as $\textit{nullable}$, except
+that it is for annotated regular expressions, therefore we omit the
+definition). Assume that $bmkeps$ correctly extracts the bitcode for how
+$a_1$ matches the string prior to character $c$ (more on this later),
+then the right branch of $ALTS$, which is $fuse \; bmkeps \;  a_1 (a_2
+\backslash c)$ will collapse the regular expression $a_1$(as it has
+already been fully matched) and store the parsing information at the
+head of the regular expression $a_2 \backslash c$ by fusing to it. The
+bitsequence $bs$, which was initially attached to the head of $SEQ$, has
+now been elevated to the top-level of ALT, as this information will be
+needed whichever way the $SEQ$ is matched--no matter whether $c$ belongs
+to $a_1$ or $ a_2$. After building these derivatives and maintaining all
+the lexing information, we complete the lexing by collecting the
+bitcodes using a generalised version of the $\textit{mkeps}$ function
+for annotated regular expressions, called $\textit{bmkeps}$:
 
 
 %\begin{definition}[\textit{bmkeps}]\mbox{}
 \begin{center}
 \begin{tabular}{lcl}
-  $\textit{bmkeps}\,(\textit{ONE}\,bs)$ & $\dn$ & $bs$\\
-  $\textit{bmkeps}\,(\textit{ALT}\,bs\,a_1\,a_2)$ & $\dn$ &
+  $\textit{bmkeps}\,(\textit{ONE}\;bs)$ & $\dn$ & $bs$\\
+  $\textit{bmkeps}\,(\textit{ALT}\;bs\,a_1\,a_2)$ & $\dn$ &
      $\textit{if}\;\textit{bnullable}\,a_1$\\
   & &$\textit{then}\;bs\,@\,\textit{bmkeps}\,a_1$\\
   & &$\textit{else}\;bs\,@\,\textit{bmkeps}\,a_2$\\
-  $\textit{bmkeps}\,(\textit{SEQ}\,bs\,a_1\,a_2)$ & $\dn$ &
+  $\textit{bmkeps}\,(\textit{SEQ}\;bs\,a_1\,a_2)$ & $\dn$ &
      $bs \,@\,\textit{bmkeps}\,a_1\,@\, \textit{bmkeps}\,a_2$\\
-  $\textit{bmkeps}\,(\textit{STAR}\,bs\,a)$ & $\dn$ &
+  $\textit{bmkeps}\,(\textit{STAR}\;bs\,a)$ & $\dn$ &
      $bs \,@\, [\S]$
 \end{tabular}    
 \end{center}    
 %\end{definition}
 
 \noindent
-This function completes the parse tree information by 
-travelling along the path on the regular expression that corresponds to a POSIX value snd collect all the bits, and
-using S to indicate the end of star iterations. If we take the bits produced by $bmkeps$ and decode it, 
-we get the parse tree we need, the working flow looks like this:\\
+This function completes the value information by travelling along the
+path of the regular expression that corresponds to a POSIX value and
+collecting all the bitcodes, and using $S$ to indicate the end of star
+iterations. If we take the bitcodes produced by $\textit{bmkeps}$ and
+decode them, we get the value we expect. The corresponding lexing
+algorithm looks as follows:
+
 \begin{center}
 \begin{tabular}{lcl}
   $\textit{blexer}\;r\,s$ & $\dn$ &
@@ -1001,62 +1008,65 @@
   & & $\;\;\textit{else}\;\textit{None}$
 \end{tabular}
 \end{center}
-Here $(r^\uparrow)\backslash s$ is similar to what we have previously defined for 
-$r\backslash s$.
+
+\noindent
+In this definition $_\backslash s$ is the  generalisation  of the derivative
+operation from characters to strings (just like the derivatives for un-annotated
+regular expressions).
 
-The main point of the bit-sequences and annotated regular expressions
-is that we can apply rather aggressive (in terms of size)
-simplification rules in order to keep derivatives small.  
+The main point of the bitcodes and annotated regular expressions is that
+we can apply rather aggressive (in terms of size) simplification rules
+in order to keep derivatives small. We have developed such
+``aggressive'' simplification rules and generated test data that show
+that the expected bound can be achieved. Obviously we could only
+partially cover  the search space as there are infinitely many regular
+expressions and strings. 
 
-We have
-developed such ``aggressive'' simplification rules and generated test
-data that show that the expected bound can be achieved. Obviously we
-could only partially cover  the search space as there are infinitely
-many regular expressions and strings. One modification we introduced
-is to allow a list of annotated regular expressions in the
-\textit{ALTS} constructor. This allows us to not just delete
-unnecessary $\ZERO$s and $\ONE$s from regular expressions, but also
-unnecessary ``copies'' of regular expressions (very similar to
-simplifying $r + r$ to just $r$, but in a more general
-setting). 
-Another modification is that we use simplification rules
-inspired by Antimirov's work on partial derivatives. They maintain the
-idea that only the first ``copy'' of a regular expression in an
-alternative contributes to the calculation of a POSIX value. All
-subsequent copies can be pruned from the regular expression.
+One modification we introduced is to allow a list of annotated regular
+expressions in the \textit{ALTS} constructor. This allows us to not just
+delete unnecessary $\ZERO$s and $\ONE$s from regular expressions, but
+also unnecessary ``copies'' of regular expressions (very similar to
+simplifying $r + r$ to just $r$, but in a more general setting). Another
+modification is that we use simplification rules inspired by Antimirov's
+work on partial derivatives. They maintain the idea that only the first
+``copy'' of a regular expression in an alternative contributes to the
+calculation of a POSIX value. All subsequent copies can be pruned away from
+the regular expression. A recursive definition of our  simplification function 
+that looks somewhat similar to our Scala code is given below:\comment{Use $\ZERO$, $\ONE$ and so on. 
+Is it $ALT$ or $ALTS$?}\\
 
-A recursive definition of simplification function that looks similar to scala code is given below:\\
 \begin{center}
   \begin{tabular}{@{}lcl@{}}
-  $\textit{simp} \; a$ & $\dn$ & $\textit{a} \; \textit{if} \; a  =  (\textit{ONE} \; bs) \; or\; (\textit{CHAR} \, bs \; c) \; or\; (\textit{STAR}\; bs\; a_1)$\\  
-  
-  $\textit{simp} \; \textit{SEQ}\;bs\,a_1\,a_2$ & $\dn$ & $ (\textit{simp} \; a_1, \textit{simp}  \; a_2) \; \textit{match} $ \\
-  &&$\textit{case} \; (0, \_) \Rightarrow  0$ \\
-   &&$ \textit{case} \; (\_, 0) \Rightarrow  0$ \\
-   &&$ \textit{case} \;  (1, a_2') \Rightarrow  \textit{fuse} \; bs \;  a_2'$ \\
-   &&$ \textit{case} \; (a_1', 1) \Rightarrow  \textit{fuse} \; bs \;  a_1'$ \\
-   &&$ \textit{case} \; (a_1', a_2') \Rightarrow  \textit{SEQ} \; bs \; a_1' \;  a_2'$ \\
+   
+  $\textit{simp} \; (\textit{SEQ}\;bs\,a_1\,a_2)$ & $\dn$ & $ (\textit{simp} \; a_1, \textit{simp}  \; a_2) \; \textit{match} $ \\
+   &&$\quad\textit{case} \; (0, \_) \Rightarrow  0$ \\
+   &&$\quad\textit{case} \; (\_, 0) \Rightarrow  0$ \\
+   &&$\quad\textit{case} \;  (1, a_2') \Rightarrow  \textit{fuse} \; bs \;  a_2'$ \\
+   &&$\quad\textit{case} \; (a_1', 1) \Rightarrow  \textit{fuse} \; bs \;  a_1'$ \\
+   &&$\quad\textit{case} \; (a_1', a_2') \Rightarrow  \textit{SEQ} \; bs \; a_1' \;  a_2'$ \\
 
-  $\textit{simp} \; \textit{ALTS}\;bs\,as$ & $\dn$ & $\textit{ distinct}( \textit{flatten} ( \textit{map simp as})) \; \textit{match} $ \\
-  &&$\textit{case} \; [] \Rightarrow  0$ \\
-   &&$ \textit{case} \; a :: [] \Rightarrow  \textit{fuse bs a}$ \\
-   &&$ \textit{case} \;  as' \Rightarrow  \textit{ALT bs as'}$ 
+  $\textit{simp} \; (\textit{ALTS}\;bs\,as)$ & $\dn$ & $\textit{distinct}( \textit{flatten} ( \textit{map simp as})) \; \textit{match} $ \\
+  &&$\quad\textit{case} \; [] \Rightarrow  0$ \\
+   &&$\quad\textit{case} \; a :: [] \Rightarrow  \textit{fuse bs a}$ \\
+   &&$\quad\textit{case} \;  as' \Rightarrow  \textit{ALT}\;bs\;as'$\\ 
+
+   $\textit{simp} \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$   
 \end{tabular}    
 \end{center}    
 
-The simplification does a pattern matching on the regular expression. When it detected that
-the regular expression is an alternative or sequence, 
-it will try to simplify its children regular expressions
-recursively and then see if one of the children turn 
-into $\ZERO$ or $\ONE$, which might trigger further simplification
- at the current level. The most involved part is the $\textit{ALTS}$
-  clause, where we use two auxiliary functions 
- flatten and distinct to open up nested $\textit{ALTS}$ and 
- reduce as many duplicates as possible.
- Function distinct  keeps the first occurring copy only and 
- remove all later ones when detected duplicates.
- Function flatten opens up nested \textit{ALT}. Its recursive
-  definition is given below:
+\noindent
+The simplification does a pattern matching on the regular expression.
+When it detected that the regular expression is an alternative or
+sequence, it will try to simplify its children regular expressions
+recursively and then see if one of the children turn into $\ZERO$ or
+$\ONE$, which might trigger further simplification at the current level.
+The most involved part is the $\textit{ALTS}$ clause, where we use two
+auxiliary functions flatten and distinct to open up nested
+$\textit{ALTS}$ and reduce as many duplicates as possible. Function
+distinct  keeps the first occurring copy only and remove all later ones
+when detected duplicates. Function flatten opens up nested \textit{ALT}.
+Its recursive definition is given below:
+
  \begin{center}
   \begin{tabular}{@{}lcl@{}}
   $\textit{flatten} \; (\textit{ALT}\;bs\,as) :: as'$ & $\dn$ & $(\textit{map} \;
@@ -1067,54 +1077,58 @@
 \end{center}  
 
 \noindent
- Here flatten behaves like the traditional functional programming flatten function, except that it also removes $\ZERO$s.
- What it does is basically removing parentheses like changing $a+(b+c)$ into $a+b+c$.
+Here flatten behaves like the traditional functional programming flatten
+function, except that it also removes $\ZERO$s. What it does is
+basically removing parentheses like changing $a+(b+c)$ into $a+b+c$.
 
-Suppose we apply simplification after each derivative step,
-and view these two operations as an atomic one: $a \backslash_{simp} c \dn \textit{simp}(a \backslash c)$.
-Then we can use the previous natural extension from derivative w.r.t   character to derivative w.r.t string:
+Suppose we apply simplification after each derivative step, and view
+these two operations as an atomic one: $a \backslash_{simp}\,c \dn
+\textit{simp}(a \backslash c)$. Then we can use the previous natural
+extension from derivative w.r.t.~character to derivative
+w.r.t.~string:\comment{simp in  the [] case?}
 
 \begin{center}
 \begin{tabular}{lcl}
-$r \backslash_{simp} (c\!::\!s) $ & $\dn$ & $(r \backslash_{simp} c) \backslash_{simp} s$ \\
+$r \backslash_{simp} (c\!::\!s) $ & $\dn$ & $(r \backslash_{simp}\, c) \backslash_{simp}\, s$ \\
 $r \backslash [\,] $ & $\dn$ & $r$
 \end{tabular}
 \end{center}
 
- we get an optimized version of the algorithm:
-\begin{center}
+\noindent
+we obtain an optimised version of the algorithm:
+
+ \begin{center}
 \begin{tabular}{lcl}
   $\textit{blexer\_simp}\;r\,s$ & $\dn$ &
-      $\textit{let}\;a = (r^\uparrow)\backslash_{simp} s\;\textit{in}$\\                
+      $\textit{let}\;a = (r^\uparrow)\backslash_{simp}\, s\;\textit{in}$\\                
   & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
   & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
   & & $\;\;\textit{else}\;\textit{None}$
 \end{tabular}
 \end{center}
 
-This algorithm effectively keeps the regular expression size small, for example,
-with this simplification our previous $(a + aa)^*$ example's 8000 nodes will be reduced to only 6 and stay constant, however long the input string is.
+\noindent
+This algorithm keeps the regular expression size small, for example,
+with this simplification our previous $(a + aa)^*$ example's 8000 nodes
+will be reduced to just 6 and stays constant, no matter how long the
+input string is.
 
 
 
 \section{Current Work}
 
-We are currently engaged in two tasks related to this algorithm. 
-
-
-The first one is proving that our simplification rules
-actually do not affect the POSIX value that should be generated by the
-algorithm according to the specification of a POSIX value
- and furthermore obtain a much
-tighter bound on the sizes of derivatives. The result is that our
+We are currently engaged in two tasks related to this algorithm. The
+first task is proving that our simplification rules actually do not
+affect the POSIX value that should be generated by the algorithm
+according to the specification of a POSIX value and furthermore obtain a
+much tighter bound on the sizes of derivatives. The result is that our
 algorithm should be correct and faster on all inputs.  The original
 blow-up, as observed in JavaScript, Python and Java, would be excluded
-from happening in our algorithm.For
-this proof we use the theorem prover Isabelle. Once completed, this
-result will advance the state-of-the-art: Sulzmann and Lu wrote in
-their paper \cite{Sulzmann2014} about the bitcoded ``incremental
-parsing method'' (that is the matching algorithm outlined in this
-section):
+from happening in our algorithm. For this proof we use the theorem prover
+Isabelle. Once completed, this result will advance the state-of-the-art:
+Sulzmann and Lu wrote in their paper~\cite{Sulzmann2014} about the
+bitcoded ``incremental parsing method'' (that is the lexing algorithm
+outlined in this section):
 
 \begin{quote}\it
   ``Correctness Claim: We further claim that the incremental parsing
@@ -1125,10 +1139,10 @@
 \end{quote}  
 
 \noindent
-We would settle the correctness claim. It is relatively straightforward
+We would settle this correctness claim. It is relatively straightforward
 to establish that after one simplification step, the part of a nullable
 derivative that corresponds to a POSIX value remains intact and can
-still be collected, in other words,
+still be collected, in other words, we can show that\comment{Double-check....I think this  is not the case}
 
 \begin{center}
 $\textit{bmkeps} \; r = \textit{bmkeps} \; \textit{simp} \; r\;( r\; \textit{nullable})$
@@ -1137,13 +1151,13 @@
 \noindent
 as this basically comes down to proving actions like removing the
 additional $r$ in $r+r$  does not delete important POSIX information in
-a regular expression. The hardcore of this problem is to prove that
+a regular expression. The hard part of this proof is to establish that
 
 \begin{center}
 $\textit{bmkeps} \; \textit{blexer}\_{simp} \; r = \textit{bmkeps} \; \textit{blexer} \; \textit{simp} \; r$
 \end{center}
 
-\noindent
+\noindent\comment{OK from here on you still need to work. Did not read.}
 That is, if we do derivative on regular expression r and the simplified version, 
 they can still provide the same POSIX value if there is one . 
 This is not as straightforward as the previous proposition, as the two regular expressions $r$ and $\textit{simp}\; r$
@@ -1153,9 +1167,13 @@
 as good as a regular expression that has not been simplified in the subsequent derivative operations.
 To aid this, we use the helping function retrieve described by Sulzmann and Lu:
 \\definition of retrieve\\
- This function assembled the bitcode that corresponds to a parse tree for how the current 
- derivative matches the suffix of the string(the characters that have not yet appeared, but is stored in the value).
- Sulzmann and Lu used this to connect the bitcoded algorithm to the older algorithm by the following equation:
+ 
+This function assembled the bitcode that corresponds to a parse tree for
+how the current derivative matches the suffix of the string(the
+characters that have not yet appeared, but is stored in the value).
+Sulzmann and Lu used this to connect the bitcoded algorithm to the older
+algorithm by the following equation:
+ 
  \begin{center}
  $inj \;a\; c \; v = \textit{decode} \; (\textit{retrieve}\; ((\textit{internalise}\; r)\backslash_{simp} c) v)$
  \end{center}
author	Christian Urban <urbanc@in.tum.de>
	Thu, 18 Jul 2019 15:06:50 +0100
changeset 77	058133a9ffe0
parent 76	f575cf219377
child 78	a67aff8fb06a