cst_tests: comparison ninems/ninems.tex

equal deleted inserted replaced

-:f575cf219377
+:058133a9ffe0
 \end{center}
 \noindent These are clearly abysmal and possibly surprising results. One
 would expect these systems to do  much better than that---after all,
 given a DFA and a string, deciding whether a string is matched by this
-DFA should be linear.
+DFA should be linear?
 Admittedly, the regular expression $(a^*)^*\,b$ is carefully chosen to
-exhibit this exponential behaviour.  Unfortunately, such regular
+exhibit this exponential behaviour.  But unfortunately, such regular
 expressions are not just a few outliers. They are actually
 frequent enough to have a separate name created for
 them---\emph{evil regular expressions}. In empiric work, Davis et al
 report that they have found thousands of such evil regular expressions
 in the JavaScript and Python ecosystems \cite{Davis18}.
 As a result, the regular expression matching
 engine needed to backtrack over many choices.
 In this example, the time needed to process the string is not
 exactly the classical exponential case, but rather $O(n^2)$
 with respect to the string length. But this is enough for the
-home page of stackexchnge to respond not fast enough to
+home page of Stack Exchange to respond not fast enough to
 the load balancer, which thought that there must be some
 attack and therefore stopped the servers from responding to
-requests. This makes the whole site become unavailable.
+requests. This made the whole site become unavailable.
-Another fresh example that just came out of the oven
+Another very recent example is a global outage of all Cloudflare servers
-is the cloudfare outage incident. A poorly written
+on 2 July 2019. A poorly written regular expression exhibited
-regular expression exhibited exponential behaviour
+exponential behaviour and exhausted CPUs that serve HTTP traffic.
-and exhausted CPUs that serve HTTP traffic on 2 July 2019,
+Although the outage had several causes, at the heart was a regular
-resulting in a 27-minute outage. The regular expression contained
+expression that was used to monitor network
-adjacent Kleene stars, which is the source of the exponential number
+traffic.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}}
-of backtracking possibilities. Although this outage has a series of
-causes that came from different vulnerabilities within the system
+The underlying problem is that many ``real life'' regular expression
-of cloudfare, the heart of the problem lies within regular expression.
+matching engines do not use DFAs for matching. This is because they
-Such outages could be excluded from happening if the matching
+support regular expressions that are not covered by the classical
-algorithm is guaranteed to be fast.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}}
+automata theory, and in this more general setting there are quite a few
-The underlying problem is
+research questions still unanswered and fast algorithms still need to be
-that many ``real life'' regular expression matching engines do not use
+developed (for example how to treat bounded repetitions, negation and
-DFAs for matching.
+back-references efficiently).
-This is because they support regular expressions that
-are not covered by the classical automata theory, and in this more
-general setting there are quite a few research questions still
-unanswered and fast algorithms still need to be developed (for example
-how to treat bounded repetitions, negation and  back-references
-efficiently).
 %question: dfa can have exponential states. isn't this the actual reason why they do not use dfas?
 %how do they avoid dfas exponential states if they use them for fast matching?
 There is also another under-researched problem to do with regular
 expressions and lexing, i.e.~the process of breaking up strings into
 sequences of tokens according to some regular expressions. In this
 setting one is not just interested in whether or not a regular
 expression matches a string, but also in \emph{how}.  Consider for example a regular expression
 \noindent
 This is also supported by evidence collected by Kuklewicz
 \cite{Kuklewicz} who noticed that a number of POSIX regular expression
 matchers calculate incorrect results.
-Our focus is on an algorithm introduced by Sulzmann and Lu in 2014 for
+Our focus in this project is on an algorithm introduced by Sulzmann and
-regular expression matching according to the POSIX strategy
+Lu in 2014 for regular expression matching according to the POSIX
-\cite{Sulzmann2014}. Their algorithm is based on an older algorithm by
+strategy \cite{Sulzmann2014}. Their algorithm is based on an older
-Brzozowski from 1964 where he introduced the notion of derivatives of
+algorithm by Brzozowski from 1964 where he introduced the notion of
-regular expressions~\cite{Brzozowski1964}. We shall briefly explain
+derivatives of regular expressions~\cite{Brzozowski1964}. We shall
-this algorithm next.
+briefly explain this algorithm next.
 \section{The Algorithm by Brzozowski based on Derivatives of Regular
 Expressions}
 Suppose (basic) regular expressions are given by the following grammar:
 an $s = c_0...c_{n-1}$ and an $r_0$, it generates YES if and only if $s \in L(r_0)$).
 \section{Values and the Algorithm by Sulzmann and Lu}
-One limitation, however, of Brzozowski's algorithm is that it only
+One limitation of Brzozowski's algorithm is that it only produces a
-produces a YES/NO answer for whether a string is being matched by a
+YES/NO answer for whether a string is being matched by a regular
-regular expression.  Sulzmann and Lu~\cite{Sulzmann2014} extended this
+expression.  Sulzmann and Lu~\cite{Sulzmann2014} extended this algorithm
-algorithm to allow generation of an actual matching, called a
+to allow generation of an actual matching, called a \emph{value} or
-\emph{value} or sometimes lexical values.  These values and regular expressions correspond to each
+sometimes also \emph{lexical values}.  These values and regular
-other as illustrated in the following table:
+expressions correspond to each other as illustrated in the following
+table:
 \begin{center}
 	\begin{tabular}{c@{\hspace{20mm}}c}
 		\begin{tabular}{@{}rrl@{}}
 |v_2|$ matches the regex $r_1 \cdot r_2$ whereby $r_1$ matches the
 substring $|v_1|$ and, respectively, $r_2$ matches the substring
 $|v_2|$. Exactly how these two are matched is contained in the
 children nodes $v_1$ and $v_2$ of parent $\textit{Seq}$ .
 To give a concrete example of how values work, consider the string $xy$
 and the regular expression $(x + (y + xy))^*$. We can view this regular
 expression as a tree and if the string $xy$ is matched by two Star
 ``iterations'', then the $x$ is matched by the left-most alternative in
 this tree and the $y$ by the right-left alternative. This suggests to
 record this matching as
 \end{tikzcd}
 \end{equation}
 \end{ceqn}
 \noindent
-For convenience, we shall employ the following notations: the regular expression we
+For convenience, we shall employ the following notations: the regular
-start with is $r_0$, and the given string $s$ is composed of characters $c_0 c_1
+expression we start with is $r_0$, and the given string $s$ is composed
-\ldots c_{n-1}$. In  the first phase, we build the derivatives $r_1$, $r_2$, \ldots  according to
+of characters $c_0 c_1 \ldots c_{n-1}$. In  the first phase from the
-the characters $c_0$, $c_1$  until we exhaust the string and
+left to right, we build the derivatives $r_1$, $r_2$, \ldots  according
-obtain the derivative $r_n$. We test whether this derivative is
+to the characters $c_0$, $c_1$  until we exhaust the string and obtain
+the derivative $r_n$. We test whether this derivative is
 $\textit{nullable}$ or not. If not, we know the string does not match
 $r$ and no value needs to be generated. If yes, we start building the
-values incrementally by \emph{injecting} back the characters into
+values incrementally by \emph{injecting} back the characters into the
-the earlier values $v_n, \ldots, v_0$. For the first value $v_0$, we call the function
+earlier values $v_n, \ldots, v_0$. This is the second phase of the
-$\textit{mkeps}$, which builds the parse tree for how the empty string
+algorithm from the right to left. For the first value $v_n$, we call the
-has been matched by the (nullable) regular expression $r_n$. This function is defined
+function $\textit{mkeps}$, which builds the parse tree for how the empty
-as
+string has been matched by the (nullable) regular expression $r_n$. This
+function is defined as
 	\begin{center}
 		\begin{tabular}{lcl}
 			$\mkeps(\ONE)$ 		& $\dn$ & $\Empty$ \\
 			$\mkeps(r_{1}+r_{2})$	& $\dn$
 regular expression on the left-hand side. This will become
 important later on.
 After this, we inject back the characters one by one in order to build
 the parse tree $v_i$ for how the regex $r_i$ matches the string $s_i$
-($s_i = c_i \ldots c_{n-1}$ ) from the previous parse tree $v_{i+1}$. After
+($s_i = c_i \ldots c_{n-1}$ ) from the previous parse tree $v_{i+1}$.
-injecting back $n$ characters, we get the parse tree for how $r_0$
+After injecting back $n$ characters, we get the parse tree for how $r_0$
 matches $s$. For this Sulzmann and Lu defined a function that reverses
 the ``chopping off'' of characters during the derivative phase. The
-corresponding function is called $\textit{inj}$; it takes three
+corresponding function is called \emph{injection}, written
-arguments: the first one is a regular expression ${r_{i-1}}$, before the
+$\textit{inj}$; it takes three arguments: the first one is a regular
-character is chopped off, the second is a character ${c_{i-1}}$, the
+expression ${r_{i-1}}$, before the character is chopped off, the second
-character we want to inject and the third argument is the value
+is a character ${c_{i-1}}$, the character we want to inject and the
-${v_i}$, into which one wants to inject the character (it
+third argument is the value ${v_i}$, into which one wants to inject the
-corresponds to the regular expression after the character has been
+character (it corresponds to the regular expression after the character
-chopped off). The result of this function is a new value. The definition
+has been chopped off). The result of this function is a new value. The
-of $\textit{inj}$ is as follows:
+definition of $\textit{inj}$ is as follows:
 \begin{center}
 \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
 $\textit{inj}\,(c)\,c\,Empty$            & $\dn$ & $Char\,c$\\
 $\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\
 context free grammars and then adapted by Henglein and Nielson for
 efficient regular expression parsing using DFAs~\cite{nielson11bcre}.
 Sulzmann and Lu took this idea of bitcodes a step further by integrating
 bitcodes into derivatives. The reason why we want to use bitcodes in
 this project is that we want to introduce more aggressive
-simplifications in order to keep the size of derivatives small
+simplification rules in order to keep the size of derivatives small
 throughout. This is because the main drawback of building successive
 derivatives according to Brzozowski's definition is that they can grow
 very quickly in size. This is mainly due to the fact that the derivative
 operation generates often ``useless'' $\ZERO$s and $\ONE$s in
 derivatives.  As a result, if implemented naively both algorithms by
 Brzozowski and by Sulzmann and Lu are excruciatingly slow. For example
 when starting with the regular expression $(a + aa)^*$ and building 12
 successive derivatives w.r.t.~the character $a$, one obtains a
 derivative regular expression with more than 8000 nodes (when viewed as
-a tree). Operations like derivative and $\nullable$ need to traverse
+a tree). Operations like $\textit{der}$ and $\nullable$ need to traverse
 such trees and consequently the bigger the size of the derivative the
 slower the algorithm.
 Fortunately, one can simplify regular expressions after each derivative
 step. Various simplifications of regular expressions are possible, such
-as the simplifications of $\ZERO + r$, $r + \ZERO$, $\ONE\cdot r$, $r
+as the simplification of $\ZERO + r$, $r + \ZERO$, $\ONE\cdot r$, $r
 \cdot \ONE$, and $r + r$ to just $r$. These simplifications do not
 affect the answer for whether a regular expression matches a string or
 not, but fortunately also do not affect the POSIX strategy of how
 regular expressions match strings---although the latter is much harder
 to establish. Some initial results in this regard have been
 obtained in \cite{AusafDyckhoffUrban2016}.
 Unfortunately, the simplification rules outlined above  are not
-sufficient to prevent an explosion for all regular expression. We
+sufficient to prevent a size explosion in all cases. We
 believe a tighter bound can be achieved that prevents an explosion in
-all cases. Such a tighter bound is suggested by work of Antimirov who
+\emph{all} cases. Such a tighter bound is suggested by work of Antimirov who
 proved that (partial) derivatives can be bound by the number of
 characters contained in the initial regular expression
 \cite{Antimirov95}. He defined the \emph{partial derivatives} of regular
 expressions as follows:
 \noindent
 A partial derivative of a regular expression $r$ is essentially a set of
 regular expressions that are either $r$'s children expressions or a
 concatenation of them. Antimirov has proved a tight bound of the size of
-partial derivatives. Roughly
+\emph{all} partial derivatives no matter what the string looks like.
-speaking the size will be quadruple in the size of the regular expression.
+Roughly speaking the size will be quadruple in the size of the regular
-If we want the size of derivatives
+expression. If we want the size of derivatives in Sulzmann and Lu's
-to stay below this bound, we would need more aggressive simplifications
+algorithm to stay equal or below this bound, we would need more
-such as opening up alternatives to achieve the maximum level of duplicates
+aggressive simplifications. Essentially we need to delete useless
-cancellation.
+$\ZERO$s and $\ONE$s, as well as deleting duplicates whenever possible.
 For example, the parentheses in $(a+b) \cdot c + bc$ can be opened up to
-get $a\cdot c +b \cdot c
+get $a\cdot c +  b \cdot c + b \cdot c$, and then simplified to just $a \cdot
-+ b \cdot c$, and then simplified to $a \cdot c+b \cdot c$. Another example is from
+c + b \cdot c$. Another example is simplifying $(a^*+a) + (a^*+ \ONE) + (a
-$(a^*+a) + (a^*+ \ONE) + (a +\ONE)$ to $a^*+a+\ONE$.
++\ONE)$ to just $a^*+a+\ONE$. Adding these more aggressive simplification
-Adding these more aggressive simplification rules
+rules helps us to achieve the same size bound as that of the partial
-helped us to achieve the same size bound as that of the partial derivatives.
+derivatives. In order to implement the idea of ``spilling out alternatives''
-To introduce these "spilling out alternatives" simplifications
+and to make them compatible with the $\text{inj}$-mechanism, we use \emph{bitcodes}.
-and make the correctness proof easier,
+Bits and bitcodes (lists of bits) are just:
-we used bitcodes.
-Bitcodes look like this:
 %This allows us to prove a tight
 %bound on the size of regular expression during the running time of the
 %algorithm if we can establish the connection between our simplification
 %rules and partial derivatives.
 %data, that a similar bound can be obtained for the derivatives in
 %Sulzmann and Lu's algorithm. Let us give some details about this next.
 \begin{center}
-		$b ::=   S \mid  Z \; \;\;
+		$b ::=   S \mid  Z \qquad
 bs ::= [] \mid b:bs
 $
 \end{center}
-They are just a string of bits,
-the names $S$ and $Z$  here are quite arbitrary, we can use 0 and 1
+\noindent
-or any other set of binary symbols to substitute them.
+The names $S$ and $Z$  here are quite arbitrary in order to avoid
-Bitcodes(or bit-sequences) are a compact form of parse trees.
+confusion with the regular expressions $\ZERO$ and $\ONE$. Bitcodes (or
-Bitcodes are essentially incomplete values.
+bit-lists) can be used to encode values (or incomplete values) in a
-This can be straightforwardly seen in the following transformation:
+compact form. This can be straightforwardly seen in the following
+coding function from values to bitcodes:
 \begin{center}
 \begin{tabular}{lcl}
 $\textit{code}(\Empty)$ & $\dn$ & $[]$\\
 $\textit{code}(\Char\,c)$ & $\dn$ & $[]$\\
 $\textit{code}(\Left\,v)$ & $\dn$ & $\Z :: code(v)$\\
 $\textit{code}(\Stars\,(v\!::\!vs))$ & $\dn$ & $\S :: code(v) \;@\;
 code(\Stars\,vs)$
 \end{tabular}
 \end{center}
-Here code encodes a value into a bit-sequence by converting Left into
+\noindent
-$\Z$, Right into $\S$, the start point of a non-empty star iteration
+Here $\textit{code}$ encodes a value into a bitcodes by converting
-into $\S$, and the border where a local star terminates into $\Z$. This
+$\Left$ into $\Z$, $\Right$ into $\S$, the start point of a non-empty
-conversion is apparently lossy, as it throws away the character
+star iteration into $\S$, and the border where a local star terminates
-information, and does not decode the boundary between the two operands
+into $\Z$. This coding is lossy, as it throws away the information about
-of the sequence constructor. Moreover, with only the bitcode we cannot
+characters, and also does not encode the ``boundary'' between two
-even tell whether the $\S$s and $\Z$s are for $\Left/\Right$ or
+sequence values. Moreover, with only the bitcode we cannot even tell
-$\Stars$. The reason for choosing this compact way of storing
+whether the $\S$s and $\Z$s are for $\Left/\Right$ or $\Stars$. The
-information is that the relatively small size of bits can be easily
+reason for choosing this compact way of storing information is that the
-moved around. In order to recover the bitcode back into values, we will
+relatively small size of bits can be easily manipulated and ``moved
-need the regular expression as the extra information and decode it back
+around'' in a regular expression. In order to recover values, we will
-into value:\\
+need the corresponding regular expression as an extra information. This
+means the decoding function is defined as:
 %\begin{definition}[Bitdecoding of Values]\mbox{}
 \begin{center}
 \begin{tabular}{@{}l@{\hspace{1mm}}c@{\hspace{1mm}}l@{}}
 \textit{else}\;\textit{None}$
 \end{tabular}
 \end{center}
 %\end{definition}
-Sulzmann and Lu's integrated the bitcodes into regular
+Sulzmann and Lu's integrated the bitcodes into regular expressions to
-expressions to create annotated regular expressions.
+create annotated regular expressions \cite{Sulzmann2014}.
-It is by attaching them to the head of every substructure of a
+\emph{Annotated regular expressions} are defined by the following
-regular expression\cite{Sulzmann2014}. Annotated regular expressions
-are defined by the following
 grammar:
 \begin{center}
 \begin{tabular}{lcl}
 $\textit{a}$ & $::=$  & $\textit{ZERO}$\\
 & $\mid$ & $\textit{SEQ}\;\;bs\,a_1\,a_2$\\
 & $\mid$ & $\textit{STAR}\;\;bs\,a$
 \end{tabular}
 \end{center}
 %(in \textit{ALT})
-\noindent
-where $bs$ stands for bit-sequences, and $a$  for $\bold{a}$nnotated regular expressions. These bit-sequences encode
+\noindent
-information about the (POSIX) value that should be generated by the
+where $bs$ stands for bitcodes, and $a$  for $\bold{a}$nnotated regular
-Sulzmann and Lu algorithm.
+expressions. We will show that these bitcodes encode information about
+the (POSIX) value that should be generated by the Sulzmann and Lu
+algorithm.
 To do lexing using annotated regular expressions, we shall first
 transform the usual (un-annotated) regular expressions into annotated
 regular expressions. This operation is called \emph{internalisation} and
 defined as follows:
 \end{tabular}
 \end{center}
 %\end{definition}
 \noindent
-We use up arrows here to imply that the basic un-annotated regular expressions
+We use up arrows here to indicate that the basic un-annotated regular
-are "lifted up" into something slightly more complex.
+expressions are ``lifted up'' into something slightly more complex. In the
-In the fourth clause, $\textit{fuse}$ is an auxiliary function that helps to attach bits to the
+fourth clause, $\textit{fuse}$ is an auxiliary function that helps to
-front of an annotated regular expression. Its definition is as follows:
+attach bits to the front of an annotated regular expression. Its
+definition is as follows:
 \begin{center}
 \begin{tabular}{lcl}
-$\textit{fuse}\,bs\,(\textit{ZERO})$ & $\dn$ & $\textit{ZERO}$\\
+$\textit{fuse}\;bs\,(\textit{ZERO})$ & $\dn$ & $\textit{ZERO}$\\
-$\textit{fuse}\,bs\,(\textit{ONE}\,bs')$ & $\dn$ &
+$\textit{fuse}\;bs\,(\textit{ONE}\,bs')$ & $\dn$ &
 $\textit{ONE}\,(bs\,@\,bs')$\\
-$\textit{fuse}\,bs\,(\textit{CHAR}\,bs'\,c)$ & $\dn$ &
+$\textit{fuse}\;bs\,(\textit{CHAR}\,bs'\,c)$ & $\dn$ &
 $\textit{CHAR}\,(bs\,@\,bs')\,c$\\
-$\textit{fuse}\,bs\,(\textit{ALT}\,bs'\,a_1\,a_2)$ & $\dn$ &
+$\textit{fuse}\;bs\,(\textit{ALT}\,bs'\,a_1\,a_2)$ & $\dn$ &
 $\textit{ALT}\,(bs\,@\,bs')\,a_1\,a_2$\\
-$\textit{fuse}\,bs\,(\textit{SEQ}\,bs'\,a_1\,a_2)$ & $\dn$ &
+$\textit{fuse}\;bs\,(\textit{SEQ}\,bs'\,a_1\,a_2)$ & $\dn$ &
 $\textit{SEQ}\,(bs\,@\,bs')\,a_1\,a_2$\\
-$\textit{fuse}\,bs\,(\textit{STAR}\,bs'\,a)$ & $\dn$ &
+$\textit{fuse}\;bs\,(\textit{STAR}\,bs'\,a)$ & $\dn$ &
 $\textit{STAR}\,(bs\,@\,bs')\,a$
 \end{tabular}
 \end{center}
 \noindent
 After internalise we do successive derivative operations on the
-annotated regular expression. This derivative operation is the same as
+annotated regular expressions. This derivative operation is the same as
 what we previously have for the simple regular expressions, except that
-we take special care of the bits :\\
+we beed to take special care of the bitcodes:\comment{You need to be consitent with  ALTS and ALT; ALT is just
+an abbreviation; derivations and so on are defined for ALTS}
 %\begin{definition}{bder}
 \begin{center}
 \begin{tabular}{@{}lcl@{}}
-$(\textit{ZERO})\backslash c$ & $\dn$ & $\textit{ZERO}$\\
+$(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\
-$(\textit{ONE}\;bs)\backslash c$ & $\dn$ & $\textit{ZERO}$\\
+$(\textit{ONE}\;bs)\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\
-$(\textit{CHAR}\;bs\,d)\backslash c$ & $\dn$ &
+$(\textit{CHAR}\;bs\,d)\,\backslash c$ & $\dn$ &
 $\textit{if}\;c=d\; \;\textit{then}\;
 \textit{ONE}\;bs\;\textit{else}\;\textit{ZERO}$\\
-$(\textit{ALT}\;bs\,a_1\,a_2)\backslash c$ & $\dn$ &
+$(\textit{ALT}\;bs\,a_1\,a_2)\,\backslash c$ & $\dn$ &
-$\textit{ALT}\,bs\,(a_1\backslash c)\,(a_2\backslash c)$\\
+$\textit{ALT}\;bs\,(a_1\,\backslash c)\,(a_2\,\backslash c)$\\
-$(\textit{SEQ}\;bs\,a_1\,a_2)\backslash c$ & $\dn$ &
+$(\textit{SEQ}\;bs\,a_1\,a_2)\,\backslash c$ & $\dn$ &
 $\textit{if}\;\textit{bnullable}\,a_1$\\
-& &$\textit{then}\;\textit{ALT}\,bs\,(\textit{SEQ}\,[]\,(a_1\backslash c)\,a_2)$\\
+& &$\textit{then}\;\textit{ALT}\,bs\,(\textit{SEQ}\,[]\,(a_1\,\backslash c)\,a_2)$\\
-& &$\phantom{\textit{then}\;\textit{ALT}\,bs\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\backslash c))$\\
+& &$\phantom{\textit{then}\;\textit{ALT}\,bs\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))$\\
-& &$\textit{else}\;\textit{SEQ}\,bs\,(a_1\backslash c)\,a_2$\\
+& &$\textit{else}\;\textit{SEQ}\,bs\,(a_1\,\backslash c)\,a_2$\\
-$(\textit{STAR}\,bs\,a)\backslash c$ & $\dn$ &
+$(\textit{STAR}\,bs\,a)\,\backslash c$ & $\dn$ &
-$\textit{SEQ}\;bs\,(\textit{fuse}\, [\Z] (r\backslash c))\,
+$\textit{SEQ}\;bs\,(\textit{fuse}\, [\Z] (r\,\backslash c))\,
 (\textit{STAR}\,[]\,r)$
 \end{tabular}
 \end{center}
 %\end{definition}
-For instance, when we unfold $STAR \; bs \; a$ into a sequence, we
+\noindent
-attach an additional bit Z to the front of $r \backslash c$ to indicate
+For instance, when we unfold $\textit{STAR} \; bs \; a$ into a sequence,
-that there is one more star iteration. The other example, the $SEQ$
+we need to attach an additional bit $Z$ to the front of $r \backslash c$
-clause is more subtle-- when $a_1$ is $bnullable$(here bnullable is
+to indicate that there is one more star iteration. Also the $SEQ$ clause
-exactly the same as nullable, except that it is for annotated regular
+is more subtle---when $a_1$ is $\textit{bnullable}$ (here
-expressions, therefore we omit the definition). Assume that $bmkeps$
+\textit{bnullable} is exactly the same as $\textit{nullable}$, except
-correctly extracts the bitcode for how $a_1$ matches the string prior to
+that it is for annotated regular expressions, therefore we omit the
-character c(more on this later), then the right branch of $ALTS$, which
+definition). Assume that $bmkeps$ correctly extracts the bitcode for how
-is $fuse \; bmkeps \;  a_1 (a_2 \backslash c)$ will collapse the regular
+$a_1$ matches the string prior to character $c$ (more on this later),
-expression $a_1$(as it has already been fully matched) and store the
+then the right branch of $ALTS$, which is $fuse \; bmkeps \;  a_1 (a_2
-parsing information at the head of the regular expression $a_2
+\backslash c)$ will collapse the regular expression $a_1$(as it has
-\backslash c$ by fusing to it. The bitsequence $bs$, which was initially
+already been fully matched) and store the parsing information at the
-attached to the head of $SEQ$, has now been elevated to the top-level of
+head of the regular expression $a_2 \backslash c$ by fusing to it. The
-ALT, as this information will be needed whichever way the $SEQ$ is
+bitsequence $bs$, which was initially attached to the head of $SEQ$, has
-matched--no matter whether c belongs to $a_1$ or $ a_2$. After carefully
+now been elevated to the top-level of ALT, as this information will be
-doing these derivatives and maintaining all the parsing information, we
+needed whichever way the $SEQ$ is matched--no matter whether $c$ belongs
-complete the parsing by collecting the bits using a special $mkeps$
+to $a_1$ or $ a_2$. After building these derivatives and maintaining all
-function for annotated regular expressions--$bmkeps$:
+the lexing information, we complete the lexing by collecting the
+bitcodes using a generalised version of the $\textit{mkeps}$ function
+for annotated regular expressions, called $\textit{bmkeps}$:
 %\begin{definition}[\textit{bmkeps}]\mbox{}
 \begin{center}
 \begin{tabular}{lcl}
-$\textit{bmkeps}\,(\textit{ONE}\,bs)$ & $\dn$ & $bs$\\
+$\textit{bmkeps}\,(\textit{ONE}\;bs)$ & $\dn$ & $bs$\\
-$\textit{bmkeps}\,(\textit{ALT}\,bs\,a_1\,a_2)$ & $\dn$ &
+$\textit{bmkeps}\,(\textit{ALT}\;bs\,a_1\,a_2)$ & $\dn$ &
 $\textit{if}\;\textit{bnullable}\,a_1$\\
 & &$\textit{then}\;bs\,@\,\textit{bmkeps}\,a_1$\\
 & &$\textit{else}\;bs\,@\,\textit{bmkeps}\,a_2$\\
-$\textit{bmkeps}\,(\textit{SEQ}\,bs\,a_1\,a_2)$ & $\dn$ &
+$\textit{bmkeps}\,(\textit{SEQ}\;bs\,a_1\,a_2)$ & $\dn$ &
 $bs \,@\,\textit{bmkeps}\,a_1\,@\, \textit{bmkeps}\,a_2$\\
-$\textit{bmkeps}\,(\textit{STAR}\,bs\,a)$ & $\dn$ &
+$\textit{bmkeps}\,(\textit{STAR}\;bs\,a)$ & $\dn$ &
 $bs \,@\, [\S]$
 \end{tabular}
 \end{center}
 %\end{definition}
 \noindent
-This function completes the parse tree information by
+This function completes the value information by travelling along the
-travelling along the path on the regular expression that corresponds to a POSIX value snd collect all the bits, and
+path of the regular expression that corresponds to a POSIX value and
-using S to indicate the end of star iterations. If we take the bits produced by $bmkeps$ and decode it,
+collecting all the bitcodes, and using $S$ to indicate the end of star
-we get the parse tree we need, the working flow looks like this:\\
+iterations. If we take the bitcodes produced by $\textit{bmkeps}$ and
+decode them, we get the value we expect. The corresponding lexing
+algorithm looks as follows:
 \begin{center}
 \begin{tabular}{lcl}
 $\textit{blexer}\;r\,s$ & $\dn$ &
 $\textit{let}\;a = (r^\uparrow)\backslash s\;\textit{in}$\\
 & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
 & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
 & & $\;\;\textit{else}\;\textit{None}$
 \end{tabular}
 \end{center}
-Here $(r^\uparrow)\backslash s$ is similar to what we have previously defined for
-$r\backslash s$.
+\noindent
+In this definition $_\backslash s$ is the  generalisation  of the derivative
-The main point of the bit-sequences and annotated regular expressions
+operation from characters to strings (just like the derivatives for un-annotated
-is that we can apply rather aggressive (in terms of size)
+regular expressions).
-simplification rules in order to keep derivatives small.
+The main point of the bitcodes and annotated regular expressions is that
-We have
+we can apply rather aggressive (in terms of size) simplification rules
-developed such ``aggressive'' simplification rules and generated test
+in order to keep derivatives small. We have developed such
-data that show that the expected bound can be achieved. Obviously we
+``aggressive'' simplification rules and generated test data that show
-could only partially cover  the search space as there are infinitely
+that the expected bound can be achieved. Obviously we could only
-many regular expressions and strings. One modification we introduced
+partially cover  the search space as there are infinitely many regular
-is to allow a list of annotated regular expressions in the
+expressions and strings.
-\textit{ALTS} constructor. This allows us to not just delete
-unnecessary $\ZERO$s and $\ONE$s from regular expressions, but also
+One modification we introduced is to allow a list of annotated regular
-unnecessary ``copies'' of regular expressions (very similar to
+expressions in the \textit{ALTS} constructor. This allows us to not just
-simplifying $r + r$ to just $r$, but in a more general
+delete unnecessary $\ZERO$s and $\ONE$s from regular expressions, but
-setting).
+also unnecessary ``copies'' of regular expressions (very similar to
-Another modification is that we use simplification rules
+simplifying $r + r$ to just $r$, but in a more general setting). Another
-inspired by Antimirov's work on partial derivatives. They maintain the
+modification is that we use simplification rules inspired by Antimirov's
-idea that only the first ``copy'' of a regular expression in an
+work on partial derivatives. They maintain the idea that only the first
-alternative contributes to the calculation of a POSIX value. All
+``copy'' of a regular expression in an alternative contributes to the
-subsequent copies can be pruned from the regular expression.
+calculation of a POSIX value. All subsequent copies can be pruned away from
+the regular expression. A recursive definition of our  simplification function
-A recursive definition of simplification function that looks similar to scala code is given below:\\
+that looks somewhat similar to our Scala code is given below:\comment{Use $\ZERO$, $\ONE$ and so on.
+Is it $ALT$ or $ALTS$?}\\
 \begin{center}
 \begin{tabular}{@{}lcl@{}}
-$\textit{simp} \; a$ & $\dn$ & $\textit{a} \; \textit{if} \; a  =  (\textit{ONE} \; bs) \; or\; (\textit{CHAR} \, bs \; c) \; or\; (\textit{STAR}\; bs\; a_1)$\\
+$\textit{simp} \; (\textit{SEQ}\;bs\,a_1\,a_2)$ & $\dn$ & $ (\textit{simp} \; a_1, \textit{simp}  \; a_2) \; \textit{match} $ \\
-$\textit{simp} \; \textit{SEQ}\;bs\,a_1\,a_2$ & $\dn$ & $ (\textit{simp} \; a_1, \textit{simp}  \; a_2) \; \textit{match} $ \\
+&&$\quad\textit{case} \; (0, \_) \Rightarrow  0$ \\
-&&$\textit{case} \; (0, \_) \Rightarrow  0$ \\
+&&$\quad\textit{case} \; (\_, 0) \Rightarrow  0$ \\
-&&$ \textit{case} \; (\_, 0) \Rightarrow  0$ \\
+&&$\quad\textit{case} \;  (1, a_2') \Rightarrow  \textit{fuse} \; bs \;  a_2'$ \\
-&&$ \textit{case} \;  (1, a_2') \Rightarrow  \textit{fuse} \; bs \;  a_2'$ \\
+&&$\quad\textit{case} \; (a_1', 1) \Rightarrow  \textit{fuse} \; bs \;  a_1'$ \\
-&&$ \textit{case} \; (a_1', 1) \Rightarrow  \textit{fuse} \; bs \;  a_1'$ \\
+&&$\quad\textit{case} \; (a_1', a_2') \Rightarrow  \textit{SEQ} \; bs \; a_1' \;  a_2'$ \\
-&&$ \textit{case} \; (a_1', a_2') \Rightarrow  \textit{SEQ} \; bs \; a_1' \;  a_2'$ \\
+$\textit{simp} \; (\textit{ALTS}\;bs\,as)$ & $\dn$ & $\textit{distinct}( \textit{flatten} ( \textit{map simp as})) \; \textit{match} $ \\
-$\textit{simp} \; \textit{ALTS}\;bs\,as$ & $\dn$ & $\textit{ distinct}( \textit{flatten} ( \textit{map simp as})) \; \textit{match} $ \\
+&&$\quad\textit{case} \; [] \Rightarrow  0$ \\
-&&$\textit{case} \; [] \Rightarrow  0$ \\
+&&$\quad\textit{case} \; a :: [] \Rightarrow  \textit{fuse bs a}$ \\
-&&$ \textit{case} \; a :: [] \Rightarrow  \textit{fuse bs a}$ \\
+&&$\quad\textit{case} \;  as' \Rightarrow  \textit{ALT}\;bs\;as'$\\
-&&$ \textit{case} \;  as' \Rightarrow  \textit{ALT bs as'}$
+$\textit{simp} \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$
 \end{tabular}
 \end{center}
-The simplification does a pattern matching on the regular expression. When it detected that
+\noindent
-the regular expression is an alternative or sequence,
+The simplification does a pattern matching on the regular expression.
-it will try to simplify its children regular expressions
+When it detected that the regular expression is an alternative or
-recursively and then see if one of the children turn
+sequence, it will try to simplify its children regular expressions
-into $\ZERO$ or $\ONE$, which might trigger further simplification
+recursively and then see if one of the children turn into $\ZERO$ or
-at the current level. The most involved part is the $\textit{ALTS}$
+$\ONE$, which might trigger further simplification at the current level.
-clause, where we use two auxiliary functions
+The most involved part is the $\textit{ALTS}$ clause, where we use two
-flatten and distinct to open up nested $\textit{ALTS}$ and
+auxiliary functions flatten and distinct to open up nested
-reduce as many duplicates as possible.
+$\textit{ALTS}$ and reduce as many duplicates as possible. Function
-Function distinct  keeps the first occurring copy only and
+distinct  keeps the first occurring copy only and remove all later ones
-remove all later ones when detected duplicates.
+when detected duplicates. Function flatten opens up nested \textit{ALT}.
-Function flatten opens up nested \textit{ALT}. Its recursive
+Its recursive definition is given below:
-definition is given below:
 \begin{center}
 \begin{tabular}{@{}lcl@{}}
 $\textit{flatten} \; (\textit{ALT}\;bs\,as) :: as'$ & $\dn$ & $(\textit{map} \;
 (\textit{fuse}\;bs)\; \textit{as}) \; @ \; \textit{flatten} \; as' $ \\
 $\textit{flatten} \; \textit{ZERO} :: as'$ & $\dn$ & $ \textit{flatten} \;  as' $ \\
 $\textit{flatten} \; a :: as'$ & $\dn$ & $a :: \textit{flatten} \; as'$ \quad(otherwise)
 \end{tabular}
 \end{center}
 \noindent
-Here flatten behaves like the traditional functional programming flatten function, except that it also removes $\ZERO$s.
+Here flatten behaves like the traditional functional programming flatten
-What it does is basically removing parentheses like changing $a+(b+c)$ into $a+b+c$.
+function, except that it also removes $\ZERO$s. What it does is
+basically removing parentheses like changing $a+(b+c)$ into $a+b+c$.
-Suppose we apply simplification after each derivative step,
-and view these two operations as an atomic one: $a \backslash_{simp} c \dn \textit{simp}(a \backslash c)$.
+Suppose we apply simplification after each derivative step, and view
-Then we can use the previous natural extension from derivative w.r.t   character to derivative w.r.t string:
+these two operations as an atomic one: $a \backslash_{simp}\,c \dn
+\textit{simp}(a \backslash c)$. Then we can use the previous natural
+extension from derivative w.r.t.~character to derivative
+w.r.t.~string:\comment{simp in  the [] case?}
 \begin{center}
 \begin{tabular}{lcl}
-$r \backslash_{simp} (c\!::\!s) $ & $\dn$ & $(r \backslash_{simp} c) \backslash_{simp} s$ \\
+$r \backslash_{simp} (c\!::\!s) $ & $\dn$ & $(r \backslash_{simp}\, c) \backslash_{simp}\, s$ \\
 $r \backslash [\,] $ & $\dn$ & $r$
 \end{tabular}
 \end{center}
-we get an optimized version of the algorithm:
+\noindent
-\begin{center}
+we obtain an optimised version of the algorithm:
+\begin{center}
 \begin{tabular}{lcl}
 $\textit{blexer\_simp}\;r\,s$ & $\dn$ &
-$\textit{let}\;a = (r^\uparrow)\backslash_{simp} s\;\textit{in}$\\
+$\textit{let}\;a = (r^\uparrow)\backslash_{simp}\, s\;\textit{in}$\\
 & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
 & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
 & & $\;\;\textit{else}\;\textit{None}$
 \end{tabular}
 \end{center}
-This algorithm effectively keeps the regular expression size small, for example,
+\noindent
-with this simplification our previous $(a + aa)^*$ example's 8000 nodes will be reduced to only 6 and stay constant, however long the input string is.
+This algorithm keeps the regular expression size small, for example,
+with this simplification our previous $(a + aa)^*$ example's 8000 nodes
+will be reduced to just 6 and stays constant, no matter how long the
+input string is.
 \section{Current Work}
-We are currently engaged in two tasks related to this algorithm.
+We are currently engaged in two tasks related to this algorithm. The
+first task is proving that our simplification rules actually do not
+affect the POSIX value that should be generated by the algorithm
-The first one is proving that our simplification rules
+according to the specification of a POSIX value and furthermore obtain a
-actually do not affect the POSIX value that should be generated by the
+much tighter bound on the sizes of derivatives. The result is that our
-algorithm according to the specification of a POSIX value
-and furthermore obtain a much
-tighter bound on the sizes of derivatives. The result is that our
 algorithm should be correct and faster on all inputs.  The original
 blow-up, as observed in JavaScript, Python and Java, would be excluded
-from happening in our algorithm.For
+from happening in our algorithm. For this proof we use the theorem prover
-this proof we use the theorem prover Isabelle. Once completed, this
+Isabelle. Once completed, this result will advance the state-of-the-art:
-result will advance the state-of-the-art: Sulzmann and Lu wrote in
+Sulzmann and Lu wrote in their paper~\cite{Sulzmann2014} about the
-their paper \cite{Sulzmann2014} about the bitcoded ``incremental
+bitcoded ``incremental parsing method'' (that is the lexing algorithm
-parsing method'' (that is the matching algorithm outlined in this
+outlined in this section):
-section):
 \begin{quote}\it
 ``Correctness Claim: We further claim that the incremental parsing
 method in Figure~5 in combination with the simplification steps in
 Figure 6 yields POSIX parse trees. We have tested this claim
 extensively by using the method in Figure~3 as a reference but yet
 have to work out all proof details.''
 \end{quote}
 \noindent
-We would settle the correctness claim. It is relatively straightforward
+We would settle this correctness claim. It is relatively straightforward
 to establish that after one simplification step, the part of a nullable
 derivative that corresponds to a POSIX value remains intact and can
-still be collected, in other words,
+still be collected, in other words, we can show that\comment{Double-check....I think this  is not the case}
 \begin{center}
 $\textit{bmkeps} \; r = \textit{bmkeps} \; \textit{simp} \; r\;( r\; \textit{nullable})$
 \end{center}
 \noindent
 as this basically comes down to proving actions like removing the
 additional $r$ in $r+r$  does not delete important POSIX information in
-a regular expression. The hardcore of this problem is to prove that
+a regular expression. The hard part of this proof is to establish that
 \begin{center}
 $\textit{bmkeps} \; \textit{blexer}\_{simp} \; r = \textit{bmkeps} \; \textit{blexer} \; \textit{simp} \; r$
 \end{center}
-\noindent
+\noindent\comment{OK from here on you still need to work. Did not read.}
 That is, if we do derivative on regular expression r and the simplified version,
 they can still provide the same POSIX value if there is one .
 This is not as straightforward as the previous proposition, as the two regular expressions $r$ and $\textit{simp}\; r$
 might become very different regular expressions after repeated application of $\textit{simp}$ and derivative.
 The crucial point is to find the indispensable information of
 a regular expression and how it is kept intact during simplification so that it performs
 as good as a regular expression that has not been simplified in the subsequent derivative operations.
 To aid this, we use the helping function retrieve described by Sulzmann and Lu:
 \\definition of retrieve\\
-This function assembled the bitcode that corresponds to a parse tree for how the current
-derivative matches the suffix of the string(the characters that have not yet appeared, but is stored in the value).
+This function assembled the bitcode that corresponds to a parse tree for
-Sulzmann and Lu used this to connect the bitcoded algorithm to the older algorithm by the following equation:
+how the current derivative matches the suffix of the string(the
+characters that have not yet appeared, but is stored in the value).
+Sulzmann and Lu used this to connect the bitcoded algorithm to the older
+algorithm by the following equation:
 \begin{center}
 $inj \;a\; c \; v = \textit{decode} \; (\textit{retrieve}\; ((\textit{internalise}\; r)\backslash_{simp} c) v)$
 \end{center}
 A little fact that needs to be stated to help comprehension:
 \begin{center}

changeset 77	058133a9ffe0
parent 76	f575cf219377
child 79	481c8000de6d