updated
authorChristian Urban <urbanc@in.tum.de>
Wed, 24 Jul 2019 20:37:40 +0100
changeset 83 8c1195dd6136
parent 82 3153338ec6e4
child 84 de50a65d1b15
updated
ninems/ninems.tex
--- a/ninems/ninems.tex	Wed Jul 24 12:19:46 2019 +0100
+++ b/ninems/ninems.tex	Wed Jul 24 20:37:40 2019 +0100
@@ -189,7 +189,6 @@
 report that they have found thousands of such evil regular expressions
 in the JavaScript and Python ecosystems \cite{Davis18}.
 
-\comment{Needs to be consistent: either exponential blowup; or quadratic blowup. Maybe you use the terminology superlinear, like in  the Davis et al paper}
 This superlinear blowup in matching algorithms sometimes causes
 considerable grief in real life: for example on 20 July 2016 one evil
 regular expression brought the webpage
@@ -197,25 +196,21 @@
 knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}}
 In this instance, a regular expression intended to just trim white
 spaces from the beginning and the end of a line actually consumed
-massive amounts of CPU-resources---causing web servers to
-grind to a halt. This happened when a post with 20,000 white spaces was
-submitted, but importantly the white spaces were neither at the
-beginning nor at the end. 
-As a result, the regular expression matching
-engine needed to backtrack over many choices.
-In this example, the time needed to process the string is
-$O(n^2)$
-with respect to the string length. This quadratic
-overhead is enough for the 
-home page of Stack Exchange to respond so slowly to
-the load balancer that it thought there must be some
-attack and therefore stopped the servers from responding to 
-requests. This made the whole site become unavailable.
-Another very recent example is a global outage of all Cloudflare servers
-on 2 July 2019. A poorly written regular expression exhibited
-exponential behaviour and exhausted CPUs that serve HTTP traffic.
-Although the outage had several causes, at the heart was a regular
-expression that was used to monitor network
+massive amounts of CPU-resources---causing web servers to grind to a
+halt. This happened when a post with 20,000 white spaces was submitted,
+but importantly the white spaces were neither at the beginning nor at
+the end. As a result, the regular expression matching engine needed to
+backtrack over many choices. In this example, the time needed to process
+the string was $O(n^2)$ with respect to the string length. This
+quadratic overhead was enough for the homepage of Stack Exchange to
+respond so slowly that the load balancer assumed there must be some
+attack and therefore stopped the servers from responding to any
+requests. This made the whole site become unavailable. Another very
+recent example is a global outage of all Cloudflare servers on 2 July
+2019. A poorly written regular expression exhibited exponential
+behaviour and exhausted CPUs that serve HTTP traffic. Although the
+outage had several causes, at the heart was a regular expression that
+was used to monitor network
 traffic.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}}
 
 The underlying problem is that many ``real life'' regular expression
@@ -433,13 +428,12 @@
 written $|v|$ for values. We can use this function to extract the
 underlying string of a value $v$. For example, $|\mathit{Seq} \,
 (\textit{Char x}) \, (\textit{Char y})|$ is the string $xy$.  Using
-flatten, we can describe how values encode \comment{Avoid the notion
-parse trees! Also later!!}lexical values: $\Seq\,v_1\, v_2$ encodes a tree with two
-children nodes that tells how the string $|v_1| @ |v_2|$ matches the
-regex $r_1 \cdot r_2$ whereby $r_1$ matches the substring $|v_1|$ and,
-respectively, $r_2$ matches the substring $|v_2|$. Exactly how these two
-are matched is contained in the children nodes $v_1$ and $v_2$ of parent
-$\textit{Seq}$. 
+flatten, we can describe how values encode lexical values: $\Seq\,v_1\,
+v_2$ encodes a tree with two children nodes that tells how the string
+$|v_1| @ |v_2|$ matches the regex $r_1 \cdot r_2$ whereby $r_1$ matches
+the substring $|v_1|$ and, respectively, $r_2$ matches the substring
+$|v_2|$. Exactly how these two are matched is contained in the children
+nodes $v_1$ and $v_2$ of parent $\textit{Seq}$. 
 
 To give a concrete example of how values work, consider the string $xy$
 and the regular expression $(x + (y + xy))^*$. We can view this regular
@@ -495,7 +489,7 @@
 values incrementally by \emph{injecting} back the characters into the
 earlier values $v_n, \ldots, v_0$. This is the second phase of the
 algorithm from the right to left. For the first value $v_n$, we call the
-function $\textit{mkeps}$, which builds the \comment{Avoid}lexical value
+function $\textit{mkeps}$, which builds the lexical value
 for how the empty string has been matched by the (nullable) regular
 expression $r_n$. This function is defined as
 
@@ -621,16 +615,17 @@
 \end{center}
 The outer $\Left(\Left(\ldots))$ tells us the leftmost nullable part of $r_3$(underlined):
 
-\begin{center}\comment{better layout}
-	\begin{tabular}{lcl}
-		$( \underline{(\ZERO+\ZERO+\ZERO+ \ZERO+ \ONE \cdot \ONE \cdot \ONE) \cdot r^*}$ & $+$ & $(\ZERO+\ZERO+\ZERO + \ONE + \ZERO)
-  		\cdot r^*) +$\\
-		$((\ZERO+\ONE+\ZERO  + \ZERO + \ZERO) \cdot r^*$ & $+$ & $(\ZERO+\ZERO+\ZERO  + \ONE + \ZERO) \cdot r^* ).$
+\begin{center}
+	\begin{tabular}{l@{\hspace{2mm}}l}
+    & $\big(\underline{(\ZERO+\ZERO+\ZERO+ \ZERO+ \ONE \cdot \ONE \cdot \ONE) \cdot r^*} 
+    \;+\; (\ZERO+\ZERO+\ZERO + \ONE + \ZERO) \cdot r^*\big)$ \smallskip\\
+    $+$ & $\big((\ZERO+\ONE+\ZERO  + \ZERO + \ZERO) \cdot r^*
+    \;+\; (\ZERO+\ZERO+\ZERO  + \ONE + \ZERO) \cdot r^* \big)$
   	\end{tabular}
  \end{center}
 
 \noindent
- Note that the leftmost location of term $((\ZERO+\ZERO+\ZERO + \ZERO + \ONE \cdot \ONE \cdot
+ Note that the leftmost location of term $(\ZERO+\ZERO+\ZERO + \ZERO + \ONE \cdot \ONE \cdot
  \ONE) \cdot r^*$ (which corresponds to the initial sub-match $abc$) allows
  $\textit{mkeps}$ to pick it up because $\textit{mkeps}$ is defined to always choose the
  left one when it is nullable. In the case of this example, $abc$ is
@@ -674,7 +669,7 @@
  \begin{center}
  $r_1= (\ONE+\ZERO+\ONE \cdot b + \ZERO + \ONE \cdot b \cdot c) \cdot r*$
  \end{center}
-  matched  the string $bc$ before it split into 2 pieces. 
+  matched  the string $bc$ before it split into two substrings. 
   Finally, after injecting character $a$ back to $v_1$, 
   we get  the lexical value tree 
   \begin{center}
@@ -683,7 +678,7 @@
    for how $r$ matched $abc$. This completes the algorithm.
    
 %We omit the details of injection function, which is provided by Sulzmann and Lu's paper \cite{Sulzmann2014}. 
-Readers might have noticed that the lixical value information is actually
+Readers might have noticed that the lexical value information is actually
 already available when doing derivatives. For example, immediately after
 the operation $\backslash a$ we know that if we want to match a string
 that starts with $a$, we can either take the initial match to be 
@@ -720,24 +715,24 @@
 
 \section{Simplification of Regular Expressions}
 
-Using bitcodes to guide  parsing is not a novel idea. It was applied to context
-free grammars and then adapted by Henglein and Nielson for efficient regular
-expression \comment{?}\comment{i have changed parsing into lexing,
-however, parsing is the terminology Henglein and Nielson used} lexing using
-DFAs~\cite{nielson11bcre}.  Sulzmann and Lu took this idea of bitcodes a step
-further by integrating bitcodes into derivatives. The reason why we want to use
-bitcodes in this project is that we want to introduce more aggressive
-simplification rules in order to keep the size of derivatives small throughout.
-This is because the main drawback of building successive derivatives according
-to Brzozowski's definition is that they can grow very quickly in size. This is
-mainly due to the fact that the derivative operation generates often
-``useless'' $\ZERO$s and $\ONE$s in derivatives.  As a result, if implemented
-naively both algorithms by Brzozowski and by Sulzmann and Lu are excruciatingly
-slow. For example when starting with the regular expression $(a + aa)^*$ and
-building 12 successive derivatives w.r.t.~the character $a$, one obtains a
-derivative regular expression with more than 8000 nodes (when viewed as a
-tree). Operations like $\textit{der}$ and $\nullable$ need to traverse such
-trees and consequently the bigger the size of the derivative the slower the
+Using bitcodes to guide  parsing is not a novel idea. It was applied to
+context free grammars and then adapted by Henglein and Nielson for
+efficient regular expression  lexing using DFAs~\cite{nielson11bcre}.
+Sulzmann and Lu took this idea of bitcodes a step further by integrating
+bitcodes into derivatives. The reason why we want to use bitcodes in
+this project is that we want to introduce more aggressive simplification
+rules in order to keep the size of derivatives small throughout. This is
+because the main drawback of building successive derivatives according
+to Brzozowski's definition is that they can grow very quickly in size.
+This is mainly due to the fact that the derivative operation generates
+often ``useless'' $\ZERO$s and $\ONE$s in derivatives.  As a result, if
+implemented naively both algorithms by Brzozowski and by Sulzmann and Lu
+are excruciatingly slow. For example when starting with the regular
+expression $(a + aa)^*$ and building 12 successive derivatives
+w.r.t.~the character $a$, one obtains a derivative regular expression
+with more than 8000 nodes (when viewed as a tree). Operations like
+$\textit{der}$ and $\nullable$ need to traverse such trees and
+consequently the bigger the size of the derivative the slower the
 algorithm. 
 
 Fortunately, one can simplify regular expressions after each derivative
@@ -776,23 +771,19 @@
 A partial derivative of a regular expression $r$ is essentially a set of
 regular expressions that are either $r$'s children expressions or a
 concatenation of them. Antimirov has proved a tight bound of the sum of
-the size of
-\emph{all} partial derivatives no matter what the string looks like.
-Roughly speaking the size sum will be at most cubic in the size of the regular
-expression.\comment{Are you sure? I have just proved that the sum of
-sizes in $pder$ is less or equal $(1 + size\;r)^3$. And this is surely
-not the best bound.}\comment{this is just a very rough guess i made
-2 months ago. i will take your suggested bound here.}
-If we want the size of derivatives in Sulzmann and
-Lu's algorithm to stay below this bound, we would need more
-aggressive simplifications. Essentially we need to delete useless
-$\ZERO$s and $\ONE$s, as well as deleting duplicates whenever possible.
-For example, the parentheses in $(a+b) \cdot c + bc$ can be opened up to
-get $a\cdot c +  b \cdot c + b \cdot c$, and then simplified to just $a
-\cdot c + b \cdot c$. Another example is simplifying $(a^*+a) + (a^*+
-\ONE) + (a +\ONE)$ to just $a^*+a+\ONE$. Adding these more aggressive
-simplification rules helps us to achieve the same size bound as that of
-the partial derivatives. 
+the size of \emph{all} partial derivatives no matter what the string
+looks like. Roughly speaking the size sum will be at most cubic in the
+size of the regular expression.
+
+If we want the size of derivatives in Sulzmann and Lu's algorithm to
+stay below this bound, we would need more aggressive simplifications.
+Essentially we need to delete useless $\ZERO$s and $\ONE$s, as well as
+deleting duplicates whenever possible. For example, the parentheses in
+$(a+b) \cdot c + bc$ can be opened up to get $a\cdot c +  b \cdot c + b
+\cdot c$, and then simplified to just $a \cdot c + b \cdot c$. Another
+example is simplifying $(a^*+a) + (a^*+ \ONE) + (a +\ONE)$ to just
+$a^*+a+\ONE$. Adding these more aggressive simplification rules helps us
+to achieve the same size bound as that of the partial derivatives. 
 
 In order to implement the idea of ``spilling out alternatives'' and to
 make them compatible with the $\text{inj}$-mechanism, we use
@@ -881,7 +872,7 @@
 Sulzmann and Lu's integrated the bitcodes into regular expressions to
 create annotated regular expressions \cite{Sulzmann2014}.
 \emph{Annotated regular expressions} are defined by the following
-grammar:
+grammar:\comment{ALTS should have  an $as$ in  the definitions, not  just $a_1$ and $a_2$}
 
 \begin{center}
 \begin{tabular}{lcl}
@@ -952,10 +943,7 @@
 derivative operations on the annotated regular expressions. This
 derivative operation is the same as what we had previously for the
 basic regular expressions, except that we beed to take care of
-the bitcodes:\comment{You need to be consitent with  ALTS and ALTS; ALTS
-is just an abbreviation; derivations and so on are defined for
-ALTS}\comment{no this is not the case, ALT for 2 regexes, ALTS for a
-list...\textcolor{blue}{This does not make sense to me. CU}}
+the bitcodes:
 
  %\begin{definition}{bder}
 \begin{center}
@@ -1160,12 +1148,12 @@
 \begin{quote}\it
   ``Correctness Claim: We further claim that the incremental parsing
   method in Figure~5 in combination with the simplification steps in
-  Figure 6 yields POSIX parse tree. We have tested this claim
+  Figure 6 yields POSIX parse tree [our lexical values]. We have tested this claim
   extensively by using the method in Figure~3 as a reference but yet
   have to work out all proof details.''
 \end{quote}  
 
-\noindent (The so-called parse trees in this quote means lexical values.)
+\noindent 
 We like to settle this correctness claim. It is relatively
 straightforward to establish that after one simplification step, the part of a
 nullable derivative that corresponds to a POSIX value remains intact and can
@@ -1207,7 +1195,7 @@
 become very different.  The crucial point is to find the
 $\textit{POSIX}$  information of a regular expression and how it is modified,
 augmented and propagated 
-during simplification in parallel with the regularr expression that
+during simplification in parallel with the regular expression that
 has not been simplified in the subsequent derivative operations.  To aid this,
 we use the helper function retrieve described by Sulzmann and Lu: \\definition
 of retrieve TODO\comment{Did not read further}\\