--- a/ninems/ninems.tex Tue Jul 23 21:54:13 2019 +0100
+++ b/ninems/ninems.tex Wed Jul 24 12:19:46 2019 +0100
@@ -178,10 +178,11 @@
\noindent These are clearly abysmal and possibly surprising results. One
would expect these systems to do much better than that---after all,
given a DFA and a string, deciding whether a string is matched by this
-DFA should be linear?
+DFA should be linear in terms of the size of the regular expression and
+the string?
Admittedly, the regular expression $(a^*)^*\,b$ is carefully chosen to
-exhibit this exponential behaviour. But unfortunately, such regular
+exhibit this super-linear behaviour. But unfortunately, such regular
expressions are not just a few outliers. They are actually
frequent enough to have a separate name created for
them---\emph{evil regular expressions}. In empiric work, Davis et al
@@ -189,7 +190,7 @@
in the JavaScript and Python ecosystems \cite{Davis18}.
\comment{Needs to be consistent: either exponential blowup; or quadratic blowup. Maybe you use the terminology superlinear, like in the Davis et al paper}
-This exponential blowup in matching algorithms sometimes causes
+This superlinear blowup in matching algorithms sometimes causes
considerable grief in real life: for example on 20 July 2016 one evil
regular expression brought the webpage
\href{http://stackexchange.com}{Stack Exchange} to its
@@ -202,11 +203,12 @@
beginning nor at the end.
As a result, the regular expression matching
engine needed to backtrack over many choices.
-In this example, the time needed to process the string is not
-exactly the classical exponential case, but rather $O(n^2)$
-with respect to the string length. But this is enough for the
-home page of Stack Exchange to respond not fast enough to
-the load balancer, which thought that there must be some
+In this example, the time needed to process the string is
+$O(n^2)$
+with respect to the string length. This quadratic
+overhead is enough for the
+home page of Stack Exchange to respond so slowly to
+the load balancer that it thought there must be some
attack and therefore stopped the servers from responding to
requests. This made the whole site become unavailable.
Another very recent example is a global outage of all Cloudflare servers
@@ -432,7 +434,7 @@
underlying string of a value $v$. For example, $|\mathit{Seq} \,
(\textit{Char x}) \, (\textit{Char y})|$ is the string $xy$. Using
flatten, we can describe how values encode \comment{Avoid the notion
-parse trees! Also later!!}parse trees: $\Seq\,v_1\, v_2$ encodes a tree with two
+parse trees! Also later!!}lexical values: $\Seq\,v_1\, v_2$ encodes a tree with two
children nodes that tells how the string $|v_1| @ |v_2|$ matches the
regex $r_1 \cdot r_2$ whereby $r_1$ matches the substring $|v_1|$ and,
respectively, $r_2$ matches the substring $|v_2|$. Exactly how these two
@@ -493,7 +495,7 @@
values incrementally by \emph{injecting} back the characters into the
earlier values $v_n, \ldots, v_0$. This is the second phase of the
algorithm from the right to left. For the first value $v_n$, we call the
-function $\textit{mkeps}$, which builds the \comment{Avoid}parse tree
+function $\textit{mkeps}$, which builds the \comment{Avoid}lexical value
for how the empty string has been matched by the (nullable) regular
expression $r_n$. This function is defined as
@@ -517,9 +519,9 @@
important later on about what value is calculated.
After the $\mkeps$-call, we inject back the characters one by one in order to build
-the parse tree $v_i$ for how the regex $r_i$ matches the string $s_i$
-($s_i = c_i \ldots c_{n-1}$ ) from the previous parse tree $v_{i+1}$.
-After injecting back $n$ characters, we get the parse tree for how $r_0$
+the lexical value $v_i$ for how the regex $r_i$ matches the string $s_i$
+($s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$.
+After injecting back $n$ characters, we get the lexical value for how $r_0$
matches $s$. For this Sulzmann and Lu defined a function that reverses
the ``chopping off'' of characters during the derivative phase. The
corresponding function is called \emph{injection}, written
@@ -610,7 +612,7 @@
\noindent
In case $r_3$ is nullable, we can call $\textit{mkeps}$
-to construct a parse tree for how $r_3$ matched the string $abc$.
+to construct a lexical value for how $r_3$ matched the string $abc$.
This function gives the following value $v_3$:
@@ -620,8 +622,11 @@
The outer $\Left(\Left(\ldots))$ tells us the leftmost nullable part of $r_3$(underlined):
\begin{center}\comment{better layout}
- $( \underline{(\ZERO+\ZERO+\ZERO+ \ZERO+ \ONE \cdot \ONE \cdot \ONE) \cdot r^*} + (\ZERO+\ZERO+\ZERO + \ONE + \ZERO)
- \cdot r^*) +((\ZERO+\ONE+\ZERO + \ZERO + \ZERO) \cdot r^*+(\ZERO+\ZERO+\ZERO + \ONE + \ZERO) \cdot r^* ).$
+ \begin{tabular}{lcl}
+ $( \underline{(\ZERO+\ZERO+\ZERO+ \ZERO+ \ONE \cdot \ONE \cdot \ONE) \cdot r^*}$ & $+$ & $(\ZERO+\ZERO+\ZERO + \ONE + \ZERO)
+ \cdot r^*) +$\\
+ $((\ZERO+\ONE+\ZERO + \ZERO + \ZERO) \cdot r^*$ & $+$ & $(\ZERO+\ZERO+\ZERO + \ONE + \ZERO) \cdot r^* ).$
+ \end{tabular}
\end{center}
\noindent
@@ -655,7 +660,7 @@
nullable regular expression is a Kleene star. $\Stars$ tells us how it
generates the nullable regular expression: by 0 iterations to form
$\ONE$. Now $\textit{inj}$ injects characters back and incrementally
-builds a parse tree based on $v_3$. Using the value $v_3$, the character
+builds a lexical value based on $v_3$. Using the value $v_3$, the character
c, and the regular expression $r_2$, we can recover how $r_2$ matched
the string $[c]$ : $\textit{inj} \; r_2 \; c \; v_3$ gives us
\begin{center}
@@ -671,14 +676,14 @@
\end{center}
matched the string $bc$ before it split into 2 pieces.
Finally, after injecting character $a$ back to $v_1$,
- we get the parse tree
+ we get the lexical value tree
\begin{center}
$v_0= \Stars [\Right(\Seq(a, \Seq(b, c)))]$
\end{center}
for how $r$ matched $abc$. This completes the algorithm.
%We omit the details of injection function, which is provided by Sulzmann and Lu's paper \cite{Sulzmann2014}.
-Readers might have noticed that the parse tree information is actually
+Readers might have noticed that the lixical value information is actually
already available when doing derivatives. For example, immediately after
the operation $\backslash a$ we know that if we want to match a string
that starts with $a$, we can either take the initial match to be
@@ -699,10 +704,10 @@
not change this order. If this parsing information can be determined and
does not change because of later derivatives, there is no point in
traversing this information twice. This leads to an optimisation---if we
-store the information for parse trees inside the regular expression,
+store the information for lexical values inside the regular expression,
update it when we do derivative on them, and collect the information
when finished with derivatives and call $\textit{mkeps}$ for deciding which
-branch is POSIX, we can generate the parse tree in one pass, instead of
+branch is POSIX, we can generate the lexical value in one pass, instead of
doing the rest $n$ injections. This leads to Sulzmann and Lu's novel
idea of using bitcodes in derivatives.
@@ -715,25 +720,25 @@
\section{Simplification of Regular Expressions}
-Using bitcodes to guide parsing is not a novel idea. It was applied to
-context free grammars and then adapted by Henglein and Nielson for
-efficient regular expression \comment{?}parsing using DFAs~\cite{nielson11bcre}.
-Sulzmann and Lu took this idea of bitcodes a step further by integrating
-bitcodes into derivatives. The reason why we want to use bitcodes in
-this project is that we want to introduce more aggressive
-simplification rules in order to keep the size of derivatives small
-throughout. This is because the main drawback of building successive
-derivatives according to Brzozowski's definition is that they can grow
-very quickly in size. This is mainly due to the fact that the derivative
-operation generates often ``useless'' $\ZERO$s and $\ONE$s in
-derivatives. As a result, if implemented naively both algorithms by
-Brzozowski and by Sulzmann and Lu are excruciatingly slow. For example
-when starting with the regular expression $(a + aa)^*$ and building 12
-successive derivatives w.r.t.~the character $a$, one obtains a
-derivative regular expression with more than 8000 nodes (when viewed as
-a tree). Operations like $\textit{der}$ and $\nullable$ need to traverse
-such trees and consequently the bigger the size of the derivative the
-slower the algorithm.
+Using bitcodes to guide parsing is not a novel idea. It was applied to context
+free grammars and then adapted by Henglein and Nielson for efficient regular
+expression \comment{?}\comment{i have changed parsing into lexing,
+however, parsing is the terminology Henglein and Nielson used} lexing using
+DFAs~\cite{nielson11bcre}. Sulzmann and Lu took this idea of bitcodes a step
+further by integrating bitcodes into derivatives. The reason why we want to use
+bitcodes in this project is that we want to introduce more aggressive
+simplification rules in order to keep the size of derivatives small throughout.
+This is because the main drawback of building successive derivatives according
+to Brzozowski's definition is that they can grow very quickly in size. This is
+mainly due to the fact that the derivative operation generates often
+``useless'' $\ZERO$s and $\ONE$s in derivatives. As a result, if implemented
+naively both algorithms by Brzozowski and by Sulzmann and Lu are excruciatingly
+slow. For example when starting with the regular expression $(a + aa)^*$ and
+building 12 successive derivatives w.r.t.~the character $a$, one obtains a
+derivative regular expression with more than 8000 nodes (when viewed as a
+tree). Operations like $\textit{der}$ and $\nullable$ need to traverse such
+trees and consequently the bigger the size of the derivative the slower the
+algorithm.
Fortunately, one can simplify regular expressions after each derivative
step. Various simplifications of regular expressions are possible, such
@@ -770,13 +775,16 @@
\noindent
A partial derivative of a regular expression $r$ is essentially a set of
regular expressions that are either $r$'s children expressions or a
-concatenation of them. Antimirov has proved a tight bound of the size of
+concatenation of them. Antimirov has proved a tight bound of the sum of
+the size of
\emph{all} partial derivatives no matter what the string looks like.
-Roughly speaking the size will be quadruple in the size of the regular
+Roughly speaking the size sum will be at most cubic in the size of the regular
expression.\comment{Are you sure? I have just proved that the sum of
sizes in $pder$ is less or equal $(1 + size\;r)^3$. And this is surely
-not the best bound.} If we want the size of derivatives in Sulzmann and
-Lu's algorithm to stay equal or below this bound, we would need more
+not the best bound.}\comment{this is just a very rough guess i made
+2 months ago. i will take your suggested bound here.}
+If we want the size of derivatives in Sulzmann and
+Lu's algorithm to stay below this bound, we would need more
aggressive simplifications. Essentially we need to delete useless
$\ZERO$s and $\ONE$s, as well as deleting duplicates whenever possible.
For example, the parentheses in $(a+b) \cdot c + bc$ can be opened up to
@@ -880,12 +888,12 @@
$\textit{a}$ & $::=$ & $\textit{ZERO}$\\
& $\mid$ & $\textit{ONE}\;\;bs$\\
& $\mid$ & $\textit{CHAR}\;\;bs\,c$\\
- & $\mid$ & $\textit{ALT}\;\;bs\,a_1 \, a_2$\\
+ & $\mid$ & $\textit{ALTS}\;\;bs\,a_1 \, a_2$\\
& $\mid$ & $\textit{SEQ}\;\;bs\,a_1\,a_2$\\
& $\mid$ & $\textit{STAR}\;\;bs\,a$
\end{tabular}
\end{center}
-%(in \textit{ALT})
+%(in \textit{ALTS})
\noindent
where $bs$ stands for bitcodes, and $a$ for $\bold{a}$nnotated regular
@@ -906,7 +914,7 @@
$(\ONE)^\uparrow$ & $\dn$ & $\textit{ONE}\,[]$\\
$(c)^\uparrow$ & $\dn$ & $\textit{CHAR}\,[]\,c$\\
$(r_1 + r_2)^\uparrow$ & $\dn$ &
- $\textit{ALT}\;[]\,(\textit{fuse}\,[\Z]\,r_1^\uparrow)\,
+ $\textit{ALTS}\;[]\,(\textit{fuse}\,[\Z]\,r_1^\uparrow)\,
(\textit{fuse}\,[\S]\,r_2^\uparrow)$\\
$(r_1\cdot r_2)^\uparrow$ & $\dn$ &
$\textit{SEQ}\;[]\,r_1^\uparrow\,r_2^\uparrow$\\
@@ -930,8 +938,8 @@
$\textit{ONE}\,(bs\,@\,bs')$\\
$\textit{fuse}\;bs\,(\textit{CHAR}\,bs'\,c)$ & $\dn$ &
$\textit{CHAR}\,(bs\,@\,bs')\,c$\\
- $\textit{fuse}\;bs\,(\textit{ALT}\,bs'\,a_1\,a_2)$ & $\dn$ &
- $\textit{ALT}\,(bs\,@\,bs')\,a_1\,a_2$\\
+ $\textit{fuse}\;bs\,(\textit{ALTS}\,bs'\,a_1\,a_2)$ & $\dn$ &
+ $\textit{ALTS}\,(bs\,@\,bs')\,a_1\,a_2$\\
$\textit{fuse}\;bs\,(\textit{SEQ}\,bs'\,a_1\,a_2)$ & $\dn$ &
$\textit{SEQ}\,(bs\,@\,bs')\,a_1\,a_2$\\
$\textit{fuse}\;bs\,(\textit{STAR}\,bs'\,a)$ & $\dn$ &
@@ -944,7 +952,7 @@
derivative operations on the annotated regular expressions. This
derivative operation is the same as what we had previously for the
basic regular expressions, except that we beed to take care of
-the bitcodes:\comment{You need to be consitent with ALTS and ALT; ALT
+the bitcodes:\comment{You need to be consitent with ALTS and ALTS; ALTS
is just an abbreviation; derivations and so on are defined for
ALTS}\comment{no this is not the case, ALT for 2 regexes, ALTS for a
list...\textcolor{blue}{This does not make sense to me. CU}}
@@ -957,12 +965,12 @@
$(\textit{CHAR}\;bs\,d)\,\backslash c$ & $\dn$ &
$\textit{if}\;c=d\; \;\textit{then}\;
\textit{ONE}\;bs\;\textit{else}\;\textit{ZERO}$\\
- $(\textit{ALT}\;bs\,a_1\,a_2)\,\backslash c$ & $\dn$ &
- $\textit{ALT}\;bs\,(a_1\,\backslash c)\,(a_2\,\backslash c)$\\
+ $(\textit{ALTS}\;bs\,a_1\,a_2)\,\backslash c$ & $\dn$ &
+ $\textit{ALTS}\;bs\,(a_1\,\backslash c)\,(a_2\,\backslash c)$\\
$(\textit{SEQ}\;bs\,a_1\,a_2)\,\backslash c$ & $\dn$ &
$\textit{if}\;\textit{bnullable}\,a_1$\\
- & &$\textit{then}\;\textit{ALT}\,bs\,(\textit{SEQ}\,[]\,(a_1\,\backslash c)\,a_2)$\\
- & &$\phantom{\textit{then}\;\textit{ALT}\,bs\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))$\\
+ & &$\textit{then}\;\textit{ALTS}\,bs\,(\textit{SEQ}\,[]\,(a_1\,\backslash c)\,a_2)$\\
+ & &$\phantom{\textit{then}\;\textit{ALTS}\,bs\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))$\\
& &$\textit{else}\;\textit{SEQ}\,bs\,(a_1\,\backslash c)\,a_2$\\
$(\textit{STAR}\,bs\,a)\,\backslash c$ & $\dn$ &
$\textit{SEQ}\;bs\,(\textit{fuse}\, [\Z] (r\,\backslash c))\,
@@ -985,7 +993,7 @@
already been fully matched) and store the parsing information at the
head of the regular expression $a_2 \backslash c$ by fusing to it. The
bitsequence $bs$, which was initially attached to the head of $SEQ$, has
-now been elevated to the top-level of $ALT$, as this information will be
+now been elevated to the top-level of $ALTS$, as this information will be
needed whichever way the $SEQ$ is matched---no matter whether $c$ belongs
to $a_1$ or $ a_2$. After building these derivatives and maintaining all
the lexing information, we complete the lexing by collecting the
@@ -997,7 +1005,7 @@
\begin{center}
\begin{tabular}{lcl}
$\textit{bmkeps}\,(\textit{ONE}\;bs)$ & $\dn$ & $bs$\\
- $\textit{bmkeps}\,(\textit{ALT}\;bs\,a_1\,a_2)$ & $\dn$ &
+ $\textit{bmkeps}\,(\textit{ALTS}\;bs\,a_1\,a_2)$ & $\dn$ &
$\textit{if}\;\textit{bnullable}\,a_1$\\
& &$\textit{then}\;bs\,@\,\textit{bmkeps}\,a_1$\\
& &$\textit{else}\;bs\,@\,\textit{bmkeps}\,a_2$\\
@@ -1050,8 +1058,9 @@
``copy'' of a regular expression in an alternative contributes to the
calculation of a POSIX value. All subsequent copies can be pruned away from
the regular expression. A recursive definition of our simplification function
-that looks somewhat similar to our Scala code is given below:\comment{Use $\ZERO$, $\ONE$ and so on.
-Is it $ALT$ or $ALTS$?}\\
+that looks somewhat similar to our Scala code is given below:
+%\comment{Use $\ZERO$, $\ONE$ and so on.
+%Is it $ALTS$ or $ALTS$?}\\
\begin{center}
\begin{tabular}{@{}lcl@{}}
@@ -1151,12 +1160,13 @@
\begin{quote}\it
``Correctness Claim: We further claim that the incremental parsing
method in Figure~5 in combination with the simplification steps in
- Figure 6 yields POSIX parse trees. We have tested this claim
+ Figure 6 yields POSIX parse tree. We have tested this claim
extensively by using the method in Figure~3 as a reference but yet
have to work out all proof details.''
\end{quote}
-\noindent We like to settle this correctness claim. It is relatively
+\noindent (The so-called parse trees in this quote means lexical values.)
+We like to settle this correctness claim. It is relatively
straightforward to establish that after one simplification step, the part of a
nullable derivative that corresponds to a POSIX value remains intact and can
still be collected, in other words, we can show that\comment{Double-check....I
@@ -1169,10 +1179,8 @@
%to a regex. Maybe it works now.}
\begin{center}
-$\textit{bmkeps} \; a = \textit{bmkeps} \; \textit{bsimp} \; a\;( a\; \textit{bnullable} and \textit{decode}(r, \textit{bmkeps}(a)) is a \textit{POSIX} value)$
-\end{center}
-\comment{\textcolor{blue}{I proved $bmkeps\,(bsimp\,a) = bmkeps\,a$ provided $a$ is
-$\textit{bnullable}$}}
+ $\textit{bmkeps} \; a = \textit{bmkeps} \; \textit{bsimp} \; a\;($\textit{provided}$ \; a\; is \; \textit{bnullable} )$
+\end{center}
\noindent
as this basically comes down to proving actions like removing the
@@ -1180,7 +1188,7 @@
a regular expression. The hard part of this proof is to establish that
\begin{center}
- $\textit{bmkeps} \; \textit{blexer}\_{simp}(s, \; r) = \textit{bmkeps} \; \textit{blexer} \; \textit{simp}(s, \; r)$
+ $ \textit{blexer}\_{simp}(r, \; s) = \textit{blexer}(r, \; s)$
\end{center}\comment{This is not true either...look at the definion blexer/blexer-simp}
\noindent That is, if we do derivative on regular expression $r$ and then
@@ -1189,32 +1197,36 @@
information, which is exactly the same as the result $r'$($\textit{RHS}$ of the
normal derivative algorithm that only does derivative repeatedly and has no
simplification at all. This might seem at first glance very unintuitive, as
-the $r'$ is exponentially larger than $r''$, but can be explained in the
+the $r'$ could be exponentially larger than $r''$, but can be explained in the
following way: we are pruning away the possible matches that are not POSIX.
-Since there are exponentially non-POSIX matchings and only 1 POSIX matching, it
+Since there could be exponentially many
+non-POSIX matchings and only 1 POSIX matching, it
is understandable that our $r''$ can be a lot smaller. we can still provide
the same POSIX value if there is one. This is not as straightforward as the
previous proposition, as the two regular expressions $r'$ and $r''$ might have
-become very different regular expressions. The crucial point is to find the
+become very different. The crucial point is to find the
$\textit{POSIX}$ information of a regular expression and how it is modified,
augmented and propagated
during simplification in parallel with the regularr expression that
has not been simplified in the subsequent derivative operations. To aid this,
-we use the helping function retrieve described by Sulzmann and Lu: \\definition
+we use the helper function retrieve described by Sulzmann and Lu: \\definition
of retrieve TODO\comment{Did not read further}\\
-This function assembles the bitcode that corresponds to a parse tree for how
-the current derivative matches the suffix of the string(the characters that
-have not yet appeared, but will appear as the successive derivatives go on,
-how do we get this "future" information? By the value $v$, which is
-computed by a pass of the algorithm that uses
-$inj$ as described in the previous section). Sulzmann and Lu used this
+This function assembles the bitcode
+%that corresponds to a lexical value for how
+%the current derivative matches the suffix of the string(the characters that
+%have not yet appeared, but will appear as the successive derivatives go on.
+%How do we get this "future" information? By the value $v$, which is
+%computed by a pass of the algorithm that uses
+%$inj$ as described in the previous section).
+using information from both the derivative regular expression and the value.
+Sulzmann and Lu used this
to connect the bitcoded algorithm to the older algorithm by the following
equation:
\begin{center} $inj \;a\; c \; v = \textit{decode} \; (\textit{retrieve}\;
((\textit{internalise}\; r)\backslash_{simp} c) v)$
\end{center}
- A little fact that needs to be stated to help comprehension:
+ A little fact that needs to be stated to aid comprehension:
\begin{center}
$r^\uparrow = a$($a$ stands for $\textit{annotated}).$
\end{center}
@@ -1226,23 +1238,24 @@
a \; v \;=\; \textit{retrieve} \; \textit{simp}(a) \; v'.$
\end{center}
The idea
-is that using $v'$, a simplified version of $v$ that possibly had gone
-through the same simplification step as $\textit{simp}(a)$ we are
+is that using $v'$, a simplified version of $v$ that had gone
+through the same simplification step as $\textit{simp}(a)$, we are
able to extract the bit-sequence that gives the same parsing
information as the unsimplified one. After establishing this, we
might be able to finally bridge the gap of proving
\begin{center}
-$\textit{retrieve} \; r \backslash s \; v = \;\textit{retrieve} \;
-\textit{simp}(r) \backslash s \; v'$
+$\textit{retrieve} \; r^\uparrow \backslash s \; v = \;\textit{retrieve} \;
+\textit{bsimp}(r^\uparrow) \backslash s \; v'$
\end{center}
and subsequently
\begin{center}
-$\textit{retrieve} \; r \backslash s \; v\; = \; \textit{retrieve} \;
-r \backslash_{simp} s \; v'$.
+$\textit{retrieve} \; r^\uparrow \backslash s \; v\; = \; \textit{retrieve} \;
+r^\uparrow \backslash_{simp} s \; v'$.
\end{center}
+The $\textit{LHS}$ of the above equation is the bitcode we want.
This proves that our simplified
version of regular expression still contains all the bitcodes needed.
-
+The task here is to find a way to compute the correct $v'$.
The second task is to speed up the more aggressive simplification.
Currently it is slower than a naive simplification(the naive version as