cst_tests: comparison ninems/ninems.tex

equal deleted inserted replaced

-:a0df84875788
+:3153338ec6e4
 \end{center}
 \noindent These are clearly abysmal and possibly surprising results. One
 would expect these systems to do  much better than that---after all,
 given a DFA and a string, deciding whether a string is matched by this
-DFA should be linear?
+DFA should be linear in terms of the size of the regular expression and
+the string?
 Admittedly, the regular expression $(a^*)^*\,b$ is carefully chosen to
-exhibit this exponential behaviour.  But unfortunately, such regular
+exhibit this super-linear behaviour.  But unfortunately, such regular
 expressions are not just a few outliers. They are actually
 frequent enough to have a separate name created for
 them---\emph{evil regular expressions}. In empiric work, Davis et al
 report that they have found thousands of such evil regular expressions
 in the JavaScript and Python ecosystems \cite{Davis18}.
 \comment{Needs to be consistent: either exponential blowup; or quadratic blowup. Maybe you use the terminology superlinear, like in  the Davis et al paper}
-This exponential blowup in matching algorithms sometimes causes
+This superlinear blowup in matching algorithms sometimes causes
 considerable grief in real life: for example on 20 July 2016 one evil
 regular expression brought the webpage
 \href{http://stackexchange.com}{Stack Exchange} to its
 knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}}
 In this instance, a regular expression intended to just trim white
 grind to a halt. This happened when a post with 20,000 white spaces was
 submitted, but importantly the white spaces were neither at the
 beginning nor at the end.
 As a result, the regular expression matching
 engine needed to backtrack over many choices.
-In this example, the time needed to process the string is not
+In this example, the time needed to process the string is
-exactly the classical exponential case, but rather $O(n^2)$
+$O(n^2)$
-with respect to the string length. But this is enough for the
+with respect to the string length. This quadratic
-home page of Stack Exchange to respond not fast enough to
+overhead is enough for the
-the load balancer, which thought that there must be some
+home page of Stack Exchange to respond so slowly to
+the load balancer that it thought there must be some
 attack and therefore stopped the servers from responding to
 requests. This made the whole site become unavailable.
 Another very recent example is a global outage of all Cloudflare servers
 on 2 July 2019. A poorly written regular expression exhibited
 exponential behaviour and exhausted CPUs that serve HTTP traffic.
 sub-parts of a string. To see this, suppose a \emph{flatten} operation,
 written $|v|$ for values. We can use this function to extract the
 underlying string of a value $v$. For example, $|\mathit{Seq} \,
 (\textit{Char x}) \, (\textit{Char y})|$ is the string $xy$.  Using
 flatten, we can describe how values encode \comment{Avoid the notion
-parse trees! Also later!!}parse trees: $\Seq\,v_1\, v_2$ encodes a tree with two
+parse trees! Also later!!}lexical values: $\Seq\,v_1\, v_2$ encodes a tree with two
 children nodes that tells how the string $|v_1| @ |v_2|$ matches the
 regex $r_1 \cdot r_2$ whereby $r_1$ matches the substring $|v_1|$ and,
 respectively, $r_2$ matches the substring $|v_2|$. Exactly how these two
 are matched is contained in the children nodes $v_1$ and $v_2$ of parent
 $\textit{Seq}$.
 $\textit{nullable}$ or not. If not, we know the string does not match
 $r$ and no value needs to be generated. If yes, we start building the
 values incrementally by \emph{injecting} back the characters into the
 earlier values $v_n, \ldots, v_0$. This is the second phase of the
 algorithm from the right to left. For the first value $v_n$, we call the
-function $\textit{mkeps}$, which builds the \comment{Avoid}parse tree
+function $\textit{mkeps}$, which builds the \comment{Avoid}lexical value
 for how the empty string has been matched by the (nullable) regular
 expression $r_n$. This function is defined as
 	\begin{center}
 		\begin{tabular}{lcl}
 also that in case of alternatives we give preference to the
 regular expression on the left-hand side. This will become
 important later on about what value is calculated.
 After the $\mkeps$-call, we inject back the characters one by one in order to build
-the parse tree $v_i$ for how the regex $r_i$ matches the string $s_i$
+the lexical value $v_i$ for how the regex $r_i$ matches the string $s_i$
-($s_i = c_i \ldots c_{n-1}$ ) from the previous parse tree $v_{i+1}$.
+($s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$.
-After injecting back $n$ characters, we get the parse tree for how $r_0$
+After injecting back $n$ characters, we get the lexical value for how $r_0$
 matches $s$. For this Sulzmann and Lu defined a function that reverses
 the ``chopping off'' of characters during the derivative phase. The
 corresponding function is called \emph{injection}, written
 $\textit{inj}$; it takes three arguments: the first one is a regular
 expression ${r_{i-1}}$, before the character is chopped off, the second
 \end{tabular}
 \end{center}
 \noindent
 In  case $r_3$ is nullable, we can call $\textit{mkeps}$
-to construct a parse tree for how $r_3$ matched the string $abc$.
+to construct a lexical value for how $r_3$ matched the string $abc$.
 This function gives the following value $v_3$:
 \begin{center}
 $\Left(\Left(\Seq(\Right(\Seq(\Empty, \Seq(\Empty,\Empty))), \Stars [])))$
 \end{center}
 The outer $\Left(\Left(\ldots))$ tells us the leftmost nullable part of $r_3$(underlined):
 \begin{center}\comment{better layout}
-$( \underline{(\ZERO+\ZERO+\ZERO+ \ZERO+ \ONE \cdot \ONE \cdot \ONE) \cdot r^*} + (\ZERO+\ZERO+\ZERO + \ONE + \ZERO)
+	\begin{tabular}{lcl}
-\cdot r^*) +((\ZERO+\ONE+\ZERO  + \ZERO + \ZERO) \cdot r^*+(\ZERO+\ZERO+\ZERO  + \ONE + \ZERO) \cdot r^* ).$
+		$( \underline{(\ZERO+\ZERO+\ZERO+ \ZERO+ \ONE \cdot \ONE \cdot \ONE) \cdot r^*}$ & $+$ & $(\ZERO+\ZERO+\ZERO + \ONE + \ZERO)
+		\cdot r^*) +$\\
+		$((\ZERO+\ONE+\ZERO  + \ZERO + \ZERO) \cdot r^*$ & $+$ & $(\ZERO+\ZERO+\ZERO  + \ONE + \ZERO) \cdot r^* ).$
+	\end{tabular}
 \end{center}
 \noindent
 Note that the leftmost location of term $((\ZERO+\ZERO+\ZERO + \ZERO + \ONE \cdot \ONE \cdot
 \ONE) \cdot r^*$ (which corresponds to the initial sub-match $abc$) allows
 expressions. The first one is an alternative, we take the rightmost
 alternative---whose language contains the empty string. The second
 nullable regular expression is a Kleene star. $\Stars$ tells us how it
 generates the nullable regular expression: by 0 iterations to form
 $\ONE$. Now $\textit{inj}$ injects characters back and incrementally
-builds a parse tree based on $v_3$. Using the value $v_3$, the character
+builds a lexical value based on $v_3$. Using the value $v_3$, the character
 c, and the regular expression $r_2$, we can recover how $r_2$ matched
 the string $[c]$ : $\textit{inj} \; r_2 \; c \; v_3$ gives us
 \begin{center}
 $v_2 = \Left(\Seq(\Right(\Seq(\Empty, \Seq(\Empty, c))), \Stars [])),$
 \end{center}
 \begin{center}
 $r_1= (\ONE+\ZERO+\ONE \cdot b + \ZERO + \ONE \cdot b \cdot c) \cdot r*$
 \end{center}
 matched  the string $bc$ before it split into 2 pieces.
 Finally, after injecting character $a$ back to $v_1$,
-we get  the parse tree
+we get  the lexical value tree
 \begin{center}
 $v_0= \Stars [\Right(\Seq(a, \Seq(b, c)))]$
 \end{center}
 for how $r$ matched $abc$. This completes the algorithm.
 %We omit the details of injection function, which is provided by Sulzmann and Lu's paper \cite{Sulzmann2014}.
-Readers might have noticed that the parse tree information is actually
+Readers might have noticed that the lixical value information is actually
 already available when doing derivatives. For example, immediately after
 the operation $\backslash a$ we know that if we want to match a string
 that starts with $a$, we can either take the initial match to be
 \begin{center}
 and $abc$ is on the right. Which of these alternatives is chosen
 later does not affect their relative position because the algorithm does
 not change this order. If this parsing information can be determined and
 does not change because of later derivatives, there is no point in
 traversing this information twice. This leads to an optimisation---if we
-store the information for parse trees inside the regular expression,
+store the information for lexical values inside the regular expression,
 update it when we do derivative on them, and collect the information
 when finished with derivatives and call $\textit{mkeps}$ for deciding which
-branch is POSIX, we can generate the parse tree in one pass, instead of
+branch is POSIX, we can generate the lexical value in one pass, instead of
 doing the rest $n$ injections. This leads to Sulzmann and Lu's novel
 idea of using bitcodes in derivatives.
 In the next section, we shall focus on the bitcoded algorithm and the
 process of simplification of regular expressions. This is needed in
 state-of-the-art.
 \section{Simplification of Regular Expressions}
-Using bitcodes to guide  parsing is not a novel idea. It was applied to
+Using bitcodes to guide  parsing is not a novel idea. It was applied to context
-context free grammars and then adapted by Henglein and Nielson for
+free grammars and then adapted by Henglein and Nielson for efficient regular
-efficient regular expression \comment{?}parsing using DFAs~\cite{nielson11bcre}.
+expression \comment{?}\comment{i have changed parsing into lexing,
-Sulzmann and Lu took this idea of bitcodes a step further by integrating
+however, parsing is the terminology Henglein and Nielson used} lexing using
-bitcodes into derivatives. The reason why we want to use bitcodes in
+DFAs~\cite{nielson11bcre}.  Sulzmann and Lu took this idea of bitcodes a step
-this project is that we want to introduce more aggressive
+further by integrating bitcodes into derivatives. The reason why we want to use
-simplification rules in order to keep the size of derivatives small
+bitcodes in this project is that we want to introduce more aggressive
-throughout. This is because the main drawback of building successive
+simplification rules in order to keep the size of derivatives small throughout.
-derivatives according to Brzozowski's definition is that they can grow
+This is because the main drawback of building successive derivatives according
-very quickly in size. This is mainly due to the fact that the derivative
+to Brzozowski's definition is that they can grow very quickly in size. This is
-operation generates often ``useless'' $\ZERO$s and $\ONE$s in
+mainly due to the fact that the derivative operation generates often
-derivatives.  As a result, if implemented naively both algorithms by
+``useless'' $\ZERO$s and $\ONE$s in derivatives.  As a result, if implemented
-Brzozowski and by Sulzmann and Lu are excruciatingly slow. For example
+naively both algorithms by Brzozowski and by Sulzmann and Lu are excruciatingly
-when starting with the regular expression $(a + aa)^*$ and building 12
+slow. For example when starting with the regular expression $(a + aa)^*$ and
-successive derivatives w.r.t.~the character $a$, one obtains a
+building 12 successive derivatives w.r.t.~the character $a$, one obtains a
-derivative regular expression with more than 8000 nodes (when viewed as
+derivative regular expression with more than 8000 nodes (when viewed as a
-a tree). Operations like $\textit{der}$ and $\nullable$ need to traverse
+tree). Operations like $\textit{der}$ and $\nullable$ need to traverse such
-such trees and consequently the bigger the size of the derivative the
+trees and consequently the bigger the size of the derivative the slower the
-slower the algorithm.
+algorithm.
 Fortunately, one can simplify regular expressions after each derivative
 step. Various simplifications of regular expressions are possible, such
 as the simplification of $\ZERO + r$, $r + \ZERO$, $\ONE\cdot r$, $r
 \cdot \ONE$, and $r + r$ to just $r$. These simplifications do not
 \end{center}
 \noindent
 A partial derivative of a regular expression $r$ is essentially a set of
 regular expressions that are either $r$'s children expressions or a
-concatenation of them. Antimirov has proved a tight bound of the size of
+concatenation of them. Antimirov has proved a tight bound of the sum of
+the size of
 \emph{all} partial derivatives no matter what the string looks like.
-Roughly speaking the size will be quadruple in the size of the regular
+Roughly speaking the size sum will be at most cubic in the size of the regular
 expression.\comment{Are you sure? I have just proved that the sum of
 sizes in $pder$ is less or equal $(1 + size\;r)^3$. And this is surely
-not the best bound.} If we want the size of derivatives in Sulzmann and
+not the best bound.}\comment{this is just a very rough guess i made
-Lu's algorithm to stay equal or below this bound, we would need more
+2 months ago. i will take your suggested bound here.}
+If we want the size of derivatives in Sulzmann and
+Lu's algorithm to stay below this bound, we would need more
 aggressive simplifications. Essentially we need to delete useless
 $\ZERO$s and $\ONE$s, as well as deleting duplicates whenever possible.
 For example, the parentheses in $(a+b) \cdot c + bc$ can be opened up to
 get $a\cdot c +  b \cdot c + b \cdot c$, and then simplified to just $a
 \cdot c + b \cdot c$. Another example is simplifying $(a^*+a) + (a^*+
 \begin{center}
 \begin{tabular}{lcl}
 $\textit{a}$ & $::=$  & $\textit{ZERO}$\\
 & $\mid$ & $\textit{ONE}\;\;bs$\\
 & $\mid$ & $\textit{CHAR}\;\;bs\,c$\\
-& $\mid$ & $\textit{ALT}\;\;bs\,a_1 \, a_2$\\
+& $\mid$ & $\textit{ALTS}\;\;bs\,a_1 \, a_2$\\
 & $\mid$ & $\textit{SEQ}\;\;bs\,a_1\,a_2$\\
 & $\mid$ & $\textit{STAR}\;\;bs\,a$
 \end{tabular}
 \end{center}
-%(in \textit{ALT})
+%(in \textit{ALTS})
 \noindent
 where $bs$ stands for bitcodes, and $a$  for $\bold{a}$nnotated regular
 expressions. We will show that these bitcodes encode information about
 the (POSIX) value that should be generated by the Sulzmann and Lu
 \begin{tabular}{lcl}
 $(\ZERO)^\uparrow$ & $\dn$ & $\textit{ZERO}$\\
 $(\ONE)^\uparrow$ & $\dn$ & $\textit{ONE}\,[]$\\
 $(c)^\uparrow$ & $\dn$ & $\textit{CHAR}\,[]\,c$\\
 $(r_1 + r_2)^\uparrow$ & $\dn$ &
-$\textit{ALT}\;[]\,(\textit{fuse}\,[\Z]\,r_1^\uparrow)\,
+$\textit{ALTS}\;[]\,(\textit{fuse}\,[\Z]\,r_1^\uparrow)\,
 (\textit{fuse}\,[\S]\,r_2^\uparrow)$\\
 $(r_1\cdot r_2)^\uparrow$ & $\dn$ &
 $\textit{SEQ}\;[]\,r_1^\uparrow\,r_2^\uparrow$\\
 $(r^*)^\uparrow$ & $\dn$ &
 $\textit{STAR}\;[]\,r^\uparrow$\\
 $\textit{fuse}\;bs\,(\textit{ZERO})$ & $\dn$ & $\textit{ZERO}$\\
 $\textit{fuse}\;bs\,(\textit{ONE}\,bs')$ & $\dn$ &
 $\textit{ONE}\,(bs\,@\,bs')$\\
 $\textit{fuse}\;bs\,(\textit{CHAR}\,bs'\,c)$ & $\dn$ &
 $\textit{CHAR}\,(bs\,@\,bs')\,c$\\
-$\textit{fuse}\;bs\,(\textit{ALT}\,bs'\,a_1\,a_2)$ & $\dn$ &
+$\textit{fuse}\;bs\,(\textit{ALTS}\,bs'\,a_1\,a_2)$ & $\dn$ &
-$\textit{ALT}\,(bs\,@\,bs')\,a_1\,a_2$\\
+$\textit{ALTS}\,(bs\,@\,bs')\,a_1\,a_2$\\
 $\textit{fuse}\;bs\,(\textit{SEQ}\,bs'\,a_1\,a_2)$ & $\dn$ &
 $\textit{SEQ}\,(bs\,@\,bs')\,a_1\,a_2$\\
 $\textit{fuse}\;bs\,(\textit{STAR}\,bs'\,a)$ & $\dn$ &
 $\textit{STAR}\,(bs\,@\,bs')\,a$
 \end{tabular}
 \noindent
 After internalising the regular expression, we perform successive
 derivative operations on the annotated regular expressions. This
 derivative operation is the same as what we had previously for the
 basic regular expressions, except that we beed to take care of
-the bitcodes:\comment{You need to be consitent with  ALTS and ALT; ALT
+the bitcodes:\comment{You need to be consitent with  ALTS and ALTS; ALTS
 is just an abbreviation; derivations and so on are defined for
 ALTS}\comment{no this is not the case, ALT for 2 regexes, ALTS for a
 list...\textcolor{blue}{This does not make sense to me. CU}}
 %\begin{definition}{bder}
 $(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\
 $(\textit{ONE}\;bs)\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\
 $(\textit{CHAR}\;bs\,d)\,\backslash c$ & $\dn$ &
 $\textit{if}\;c=d\; \;\textit{then}\;
 \textit{ONE}\;bs\;\textit{else}\;\textit{ZERO}$\\
-$(\textit{ALT}\;bs\,a_1\,a_2)\,\backslash c$ & $\dn$ &
+$(\textit{ALTS}\;bs\,a_1\,a_2)\,\backslash c$ & $\dn$ &
-$\textit{ALT}\;bs\,(a_1\,\backslash c)\,(a_2\,\backslash c)$\\
+$\textit{ALTS}\;bs\,(a_1\,\backslash c)\,(a_2\,\backslash c)$\\
 $(\textit{SEQ}\;bs\,a_1\,a_2)\,\backslash c$ & $\dn$ &
 $\textit{if}\;\textit{bnullable}\,a_1$\\
-& &$\textit{then}\;\textit{ALT}\,bs\,(\textit{SEQ}\,[]\,(a_1\,\backslash c)\,a_2)$\\
+& &$\textit{then}\;\textit{ALTS}\,bs\,(\textit{SEQ}\,[]\,(a_1\,\backslash c)\,a_2)$\\
-& &$\phantom{\textit{then}\;\textit{ALT}\,bs\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))$\\
+& &$\phantom{\textit{then}\;\textit{ALTS}\,bs\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))$\\
 & &$\textit{else}\;\textit{SEQ}\,bs\,(a_1\,\backslash c)\,a_2$\\
 $(\textit{STAR}\,bs\,a)\,\backslash c$ & $\dn$ &
 $\textit{SEQ}\;bs\,(\textit{fuse}\, [\Z] (r\,\backslash c))\,
 (\textit{STAR}\,[]\,r)$
 \end{tabular}
 then the right branch of $ALTS$, which is $fuse \; bmkeps \;  a_1 (a_2
 \backslash c)$ will collapse the regular expression $a_1$(as it has
 already been fully matched) and store the parsing information at the
 head of the regular expression $a_2 \backslash c$ by fusing to it. The
 bitsequence $bs$, which was initially attached to the head of $SEQ$, has
-now been elevated to the top-level of $ALT$, as this information will be
+now been elevated to the top-level of $ALTS$, as this information will be
 needed whichever way the $SEQ$ is matched---no matter whether $c$ belongs
 to $a_1$ or $ a_2$. After building these derivatives and maintaining all
 the lexing information, we complete the lexing by collecting the
 bitcodes using a generalised version of the $\textit{mkeps}$ function
 for annotated regular expressions, called $\textit{bmkeps}$:
 %\begin{definition}[\textit{bmkeps}]\mbox{}
 \begin{center}
 \begin{tabular}{lcl}
 $\textit{bmkeps}\,(\textit{ONE}\;bs)$ & $\dn$ & $bs$\\
-$\textit{bmkeps}\,(\textit{ALT}\;bs\,a_1\,a_2)$ & $\dn$ &
+$\textit{bmkeps}\,(\textit{ALTS}\;bs\,a_1\,a_2)$ & $\dn$ &
 $\textit{if}\;\textit{bnullable}\,a_1$\\
 & &$\textit{then}\;bs\,@\,\textit{bmkeps}\,a_1$\\
 & &$\textit{else}\;bs\,@\,\textit{bmkeps}\,a_2$\\
 $\textit{bmkeps}\,(\textit{SEQ}\;bs\,a_1\,a_2)$ & $\dn$ &
 $bs \,@\,\textit{bmkeps}\,a_1\,@\, \textit{bmkeps}\,a_2$\\
 modification is that we use simplification rules inspired by Antimirov's
 work on partial derivatives. They maintain the idea that only the first
 ``copy'' of a regular expression in an alternative contributes to the
 calculation of a POSIX value. All subsequent copies can be pruned away from
 the regular expression. A recursive definition of our  simplification function
-that looks somewhat similar to our Scala code is given below:\comment{Use $\ZERO$, $\ONE$ and so on.
+that looks somewhat similar to our Scala code is given below:
-Is it $ALT$ or $ALTS$?}\\
+%\comment{Use $\ZERO$, $\ONE$ and so on.
+%Is it $ALTS$ or $ALTS$?}\\
 \begin{center}
 \begin{tabular}{@{}lcl@{}}
 $\textit{simp} \; (\textit{SEQ}\;bs\,a_1\,a_2)$ & $\dn$ & $ (\textit{simp} \; a_1, \textit{simp}  \; a_2) \; \textit{match} $ \\
 outlined in this section):
 \begin{quote}\it
 ``Correctness Claim: We further claim that the incremental parsing
 method in Figure~5 in combination with the simplification steps in
-Figure 6 yields POSIX parse trees. We have tested this claim
+Figure 6 yields POSIX parse tree. We have tested this claim
 extensively by using the method in Figure~3 as a reference but yet
 have to work out all proof details.''
 \end{quote}
-\noindent We like to settle this correctness claim. It is relatively
+\noindent (The so-called parse trees in this quote means lexical values.)
+We like to settle this correctness claim. It is relatively
 straightforward to establish that after one simplification step, the part of a
 nullable derivative that corresponds to a POSIX value remains intact and can
 still be collected, in other words, we can show that\comment{Double-check....I
 think this  is not the case}
 %\comment{If i remember correctly, you have proved this lemma.
 %you proved something like simplification does not affect $\textit{bmkeps}$ results?
 %Anyway, i have amended this a little bit so it does not allow arbitrary bits attached
 %to a regex. Maybe it works now.}
 \begin{center}
-$\textit{bmkeps} \; a = \textit{bmkeps} \; \textit{bsimp} \; a\;( a\; \textit{bnullable} and \textit{decode}(r, \textit{bmkeps}(a)) is a \textit{POSIX} value)$
+	$\textit{bmkeps} \; a = \textit{bmkeps} \; \textit{bsimp} \; a\;($\textit{provided}$ \; a\; is \; \textit{bnullable} )$
 \end{center}
-\comment{\textcolor{blue}{I proved $bmkeps\,(bsimp\,a) = bmkeps\,a$ provided $a$ is
-$\textit{bnullable}$}}
 \noindent
 as this basically comes down to proving actions like removing the
 additional $r$ in $r+r$  does not delete important POSIX information in
 a regular expression. The hard part of this proof is to establish that
 \begin{center}
-	$\textit{bmkeps} \; \textit{blexer}\_{simp}(s, \; r) = \textit{bmkeps} \; \textit{blexer} \; \textit{simp}(s, \; r)$
+	$ \textit{blexer}\_{simp}(r, \; s) =  \textit{blexer}(r, \; s)$
 \end{center}\comment{This is not true either...look at the definion blexer/blexer-simp}
 \noindent That is, if we do derivative on regular expression $r$ and then
 simplify it, and repeat this process until we exhaust the string, we get a
 regular expression $r''$($\textit{LHS}$)  that provides the POSIX matching
 information, which is exactly the same as the result $r'$($\textit{RHS}$ of the
 normal derivative algorithm that only does derivative repeatedly and has no
 simplification at all.  This might seem at first glance very unintuitive, as
-the $r'$ is exponentially larger than $r''$, but can be explained in the
+the $r'$ could be exponentially larger than $r''$, but can be explained in the
 following way: we are pruning away the possible matches that are not POSIX.
-Since there are exponentially non-POSIX matchings and only 1 POSIX matching, it
+Since there could be exponentially many
+non-POSIX matchings and only 1 POSIX matching, it
 is understandable that our $r''$ can be a lot smaller.  we can still provide
 the same POSIX value if there is one.  This is not as straightforward as the
 previous proposition, as the two regular expressions $r'$ and $r''$ might have
-become very different regular expressions.  The crucial point is to find the
+become very different.  The crucial point is to find the
 $\textit{POSIX}$  information of a regular expression and how it is modified,
 augmented and propagated
 during simplification in parallel with the regularr expression that
 has not been simplified in the subsequent derivative operations.  To aid this,
-we use the helping function retrieve described by Sulzmann and Lu: \\definition
+we use the helper function retrieve described by Sulzmann and Lu: \\definition
 of retrieve TODO\comment{Did not read further}\\
-This function assembles the bitcode that corresponds to a parse tree for how
+This function assembles the bitcode
-the current derivative matches the suffix of the string(the characters that
+%that corresponds to a lexical value for how
-have not yet appeared, but will appear as the successive derivatives go on,
+%the current derivative matches the suffix of the string(the characters that
-how do we get this "future" information? By the value $v$, which is
+%have not yet appeared, but will appear as the successive derivatives go on.
-computed by a pass of the algorithm that uses
+%How do we get this "future" information? By the value $v$, which is
-$inj$ as described in the previous section).  Sulzmann and Lu used this
+%computed by a pass of the algorithm that uses
+%$inj$ as described in the previous section).
+using information from both the derivative regular expression and the value.
+Sulzmann and Lu used this
 to connect the bitcoded algorithm to the older algorithm by the following
 equation:
 \begin{center} $inj \;a\; c \; v = \textit{decode} \; (\textit{retrieve}\;
 	 ((\textit{internalise}\; r)\backslash_{simp} c) v)$
 \end{center}
-A little fact that needs to be stated to help comprehension:
+A little fact that needs to be stated to aid comprehension:
 \begin{center}
 	$r^\uparrow = a$($a$ stands for $\textit{annotated}).$
 \end{center}
 Ausaf and Urban also used this fact to prove  the
 correctness of bitcoded algorithm without simplification.  Our purpose
 \begin{center}
 $ \textit{retrieve} \;
 a \; v \;=\; \textit{retrieve}  \; \textit{simp}(a) \; v'.$
 \end{center}
 The idea
-is that using $v'$, a simplified version of $v$ that possibly had gone
+is that using $v'$, a simplified version of $v$ that had gone
-through the same simplification step as $\textit{simp}(a)$ we are
+through the same simplification step as $\textit{simp}(a)$, we are
 able to extract the bit-sequence that gives the same parsing
 information as the unsimplified one.  After establishing this, we
 might be able to finally bridge the gap of proving
 \begin{center}
-$\textit{retrieve} \; r   \backslash  s \; v = \;\textit{retrieve} \;
+$\textit{retrieve} \; r^\uparrow   \backslash  s \; v = \;\textit{retrieve} \;
-\textit{simp}(r)  \backslash  s \; v'$
+\textit{bsimp}(r^\uparrow)  \backslash  s \; v'$
 \end{center}
 and subsequently
 \begin{center}
-$\textit{retrieve} \; r \backslash  s \; v\; = \; \textit{retrieve} \;
+$\textit{retrieve} \; r^\uparrow \backslash  s \; v\; = \; \textit{retrieve} \;
-r  \backslash_{simp}   s \; v'$.
+r^\uparrow  \backslash_{simp}   s \; v'$.
 \end{center}
+The $\textit{LHS}$ of the above equation is the bitcode we want.
 This proves that our simplified
 version of regular expression still contains all the bitcodes needed.
+The task here is to find a way to compute the correct $v'$.
 The second task is to speed up the more aggressive simplification.
 Currently it is slower than a naive simplification(the naive version as
 implemented in ADU of course can explode in some cases). So it needs to
 be explored how to make it faster. Our possibility would be to explore

changeset 82	3153338ec6e4
parent 81	a0df84875788
child 83	8c1195dd6136