# HG changeset patch # User Christian Urban # Date 1562366785 -3600 # Node ID f0360e17080e8c407eba26c685dff5525c8a4e17 # Parent 8ea861a19d7045491e9a8f154931ec6069723455 proofread diff -r 8ea861a19d70 -r f0360e17080e ninems/ninems.tex --- a/ninems/ninems.tex Fri Jul 05 21:20:47 2019 +0100 +++ b/ninems/ninems.tex Fri Jul 05 23:46:25 2019 +0100 @@ -68,7 +68,7 @@ regular expressions. In 2014, Sulzmann and Lu extended this algorithm to not just give a YES/NO answer for whether or not a regular expression matches a string, but in case it matches also - \emph{how} it matches the string. This is important for + answers with \emph{how} it matches the string. This is important for applications such as lexing (tokenising a string). The problem is to make the algorithm by Sulzmann and Lu fast on all inputs without breaking its correctness. We have already developed some @@ -87,8 +87,6 @@ implemented algorithms for regular expression matching are blindingly fast? - - Unfortunately these preconceptions are not supported by evidence: Take for example the regular expression $(a^*)^*\,b$ and ask whether strings of the form $aa..a$ match this regular @@ -169,10 +167,10 @@ \end{tabular} \end{center} -\noindent These are clearly abysmal and possibly surprising results. -One would expect these systems doing much better than that---after -all, given a DFA and a string, whether a string is matched by this DFA -should be linear. +\noindent These are clearly abysmal and possibly surprising results. One +would expect these systems doing much better than that---after all, +given a DFA and a string, deciding whether a string is matched by this +DFA should be linear. Admittedly, the regular expression $(a^*)^*\,b$ is carefully chosen to exhibit this ``exponential behaviour''. Unfortunately, such regular @@ -182,25 +180,24 @@ report that they have found thousands of such evil regular expressions in the JavaScript and Python ecosystems \cite{Davis18}. -This exponential blowup sometimes causes pain in real life: -for example on 20 July 2016 one evil regular expression brought the -webpage \href{http://stackexchange.com}{Stack Exchange} to its +This exponential blowup in matching algorithms sometimes causes +considerable grief in real life: for example on 20 July 2016 one evil +regular expression brought the webpage +\href{http://stackexchange.com}{Stack Exchange} to its knees.\footnote{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016} In this instance, a regular expression intended to just trim white spaces from the beginning and the end of a line actually consumed massive amounts of CPU-resources and because of this the web servers -ground to a halt. This happened when a post with 20,000 white spaces -was submitted, but importantly the white spaces were neither at the +ground to a halt. This happened when a post with 20,000 white spaces was +submitted, but importantly the white spaces were neither at the beginning nor at the end. As a result, the regular expression matching -engine needed to backtrack over many choices. - -The underlying problem is that many ``real life'' regular expression -matching engines do not use DFAs for matching. This is because they -support regular expressions that are not covered by the classical -automata theory, and in this more general setting there are quite a -few research questions still unanswered and fast algorithms still need -to be developed (for example how to include bounded repetitions, negation -and back-references). +engine needed to backtrack over many choices. The underlying problem is +that many ``real life'' regular expression matching engines do not use +DFAs for matching. This is because they support regular expressions that +are not covered by the classical automata theory, and in this more +general setting there are quite a few research questions still +unanswered and fast algorithms still need to be developed (for example +how to include bounded repetitions, negation and back-references). There is also another under-researched problem to do with regular expressions and lexing, i.e.~the process of breaking up strings into @@ -246,7 +243,7 @@ \cite{Sulzmann2014}. Their algorithm is based on an older algorithm by Brzozowski from 1964 where he introduced the notion of derivatives of regular expressions \cite{Brzozowski1964}. We shall briefly explain -this algorithms next. +this algorithm next. \section{The Algorithm by Brzozowski based on Derivatives of Regular Expressions} @@ -265,7 +262,7 @@ character regular expression $c$ can match the character $c$, and so on. -The brilliant contribution by Brzozowski is the notion of +The ingenious contribution by Brzozowski is the notion of \emph{derivatives} of regular expressions. The idea behind this notion is as follows: suppose a regular expression $r$ can match a string of the form $c\!::\! s$ (that is a list of characters starting @@ -346,7 +343,7 @@ \] \noindent -This algorithm can be illustrated as follows: +This algorithm can be illustrated graphically as follows \begin{equation}\label{graph:*} \begin{tikzcd} r_0 \arrow[r, "\backslash c_0"] & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed] & r_n \arrow[r,"\textit{nullable}?"] & \;\textrm{YES}/\textrm{NO} @@ -358,7 +355,7 @@ derivatives until we exhaust the string and then use \textit{nullable} to test whether the result can match the empty string. It can be relatively easily shown that this matcher is correct (that is given -$s$ and $r$, it generates YES if and only if $s \in L(r)$). +an $s$ and a $r$, it generates YES if and only if $s \in L(r)$). \section{Values and the Algorithm by Sulzmann and Lu} @@ -398,18 +395,19 @@ \end{center} \noindent -The idea of values is to express parse trees. Suppose a flatten +There is no value corresponding to $\ZERO$; $\Empty$ corresponds to +$\ONE$; $\Seq$ to the sequence regular expression and so on. The idea of +values is to encode parse trees. To see this, suppose a \emph{flatten} operation, written $|v|$, which we can use to extract the underlying -string of $v$. For example, $|\mathit{Seq} \, (\textit{Char x}) \, +string of a value $v$. For example, $|\mathit{Seq} \, (\textit{Char x}) \, (\textit{Char y})|$ is the string $xy$. We omit the straightforward -definition of flatten. Using flatten, we can describe how -values encode parse trees: $\Seq\,v_1\, v_2$ tells us how the -string $|v_1| @ |v_2|$ matches the regex $r_1 \cdot r_2$: $r_1$ -matches $|v_1|$ and, respectively, $r_2$ matches $|v_2|$. Exactly how -these two are matched is contained in the sub-structure of $v_1$ and -$v_2$. +definition of flatten. Using flatten, we can describe how values encode +parse trees: $\Seq\,v_1\, v_2$ encodes how the string $|v_1| @ |v_2|$ +matches the regex $r_1 \cdot r_2$: $r_1$ matches the substring $|v_1|$ and, +respectively, $r_2$ matches the substring $|v_2|$. Exactly how these two are matched +is contained in the sub-structure of $v_1$ and $v_2$. - To give a concrete example of how value works, consider the string $xy$ + To give a concrete example of how value work, consider the string $xy$ and the regular expression $(x + (y + xy))^*$. We can view this regular expression as a tree and if the string $xy$ is matched by two Star ``iterations'', then the $x$ is matched by the left-most alternative in @@ -450,18 +448,17 @@ \end{center} \noindent -We shall briefly explain this algorithm. For the convenience of -explanation, we have the following notations: the regular expression we -start with is $r_0$ and the string $s$ is composed characters $c_0 c_1 -\ldots c_n$. First, we build the derivatives $r_1$, $r_2$, \ldots, using +For the convenience, we shall employ the following notations: the regular expression we +start with is $r_0$, and the given string $s$ is composed of characters $c_0 c_1 +\ldots c_n$. First, we build the derivatives $r_1$, $r_2$, \ldots according to the characters $c_0$, $c_1$,\ldots until we exhaust the string and -arrive at the derivative $r_n$. We test whether this derivative is +obtain at the derivative $r_n$. We test whether this derivative is $\textit{nullable}$ or not. If not, we know the string does not match $r$ and no value needs to be generated. If yes, we start building the parse tree incrementally by \emph{injecting} back the characters into -the values $v_n, \ldots, v_0$. We first call the function +the values $v_n, \ldots, v_0$. For this we first call the function $\textit{mkeps}$, which builds the parse tree for how the empty string -has matched the empty regular expression $r_n$. This function is defined +has matched the (nullable) regular expression $r_n$. This function is defined as \begin{center} @@ -482,10 +479,21 @@ get the parse tree for how $r_0$ matches $s$, exactly as we wanted. A correctness proof using induction can be routinely established. -It is instructive to see how this algorithm works by a little example. Suppose we have a regular expression $(a+b+ab+c+abc)^*$ and we want to match it against the string $abc$(when $abc$ is written -as a regular expression, the most standard way of expressing it should be $a \cdot (b \cdot c)$. We omit the parenthesis and dots here for readability). -By POSIX rules the lexer should go for the longest matching, i.e. it should match the string $abc$ in one star iteration, using the longest string $abc$ in the sub-expression $a+b+ab+c+abc$(we use $r$ to denote this sub-expression for conciseness). Here is how the lexer achieves a parse tree for this matching. -First, we build successive derivatives until we exhaust the string, as illustrated here( we simplified some regular expressions like $0 \cdot b$ to $0$ for conciseness. Similarly, we allow $\textit{ALT}$ to take a list of regular expressions as an argument instead of just 2 operands to reduce the nested depth of $\textit{ALT}$): +It is instructive to see how this algorithm works by a little example. +Suppose we have a regular expression $(a+b+ab+c+abc)^*$ and we want to +match it against the string $abc$(when $abc$ is written as a regular +expression, the most standard way of expressing it should be $a \cdot (b +\cdot c)$. We omit the parenthesis and dots here for readability). By +POSIX rules the lexer should go for the longest matching, i.e. it should +match the string $abc$ in one star iteration, using the longest string +$abc$ in the sub-expression $a+b+ab+c+abc$(we use $r$ to denote this +sub-expression for conciseness). Here is how the lexer achieves a parse +tree for this matching. First, we build successive derivatives until we +exhaust the string, as illustrated here( we simplified some regular +expressions like $0 \cdot b$ to $0$ for conciseness. Similarly, we allow +$\textit{ALT}$ to take a list of regular expressions as an argument +instead of just 2 operands to reduce the nested depth of +$\textit{ALT}$): \[ r^* \xrightarrow{\backslash a} r_1 = (1+0+1 \cdot b + 0 + 1 \cdot b \cdot c) \cdot r* \xrightarrow{\backslash b}\] \[r_2 = (0+0+1 \cdot 1 + 0 + 1 \cdot 1 \cdot c) \cdot r^* +(0+1+0 + 0 + 0) \cdot r* \xrightarrow{\backslash c}\] @@ -856,30 +864,30 @@ We would settle the correctness claim. It is relatively straightforward to establish that after 1 simplification step, the part of derivative that corresponds to a POSIX value remains intact and can still be collected, in other words, bmkeps r = bmkeps simp r -as this basically comes down to proving actions like removing the additional $r$ in $r+r$ does not delete imporatnt POSIX information in a regular expression. +as this basically comes down to proving actions like removing the additional $r$ in $r+r$ does not delete important POSIX information in a regular expression. The hardcore of this problem is to prove that bmkeps bders r = bmkeps bders simp r -That is, if we do derivative on regular expression r and the simplificed version fo r , they can still provie the same POSIZ calue if there is one . This is not as strightforward as the previous proposition, as the two regular expression r and simp r might become very different regular epxressions after repeated application ofd simp and derivative. +That is, if we do derivative on regular expression r and the simplified version for, they can still prove the same POSIX value if there is one . This is not as straghtforward as the previous proposition, as the two regular expression r and simp r might become very different regular expressions after repeated application ofd simp and derivative. The crucial point is to find the "gene" of a regular expression and how it is kept intact during simplification. To aid this, we are utilizing the helping function retrieve described by Sulzmann and Lu: \\definition of retrieve\\ - This function assembled the bitcode that corresponds to a parse tree for how the current derivative mathces the suffix of the string(the characters that have not yet appeared, but is stored in the value). + This function assembled the bitcode that corresponds to a parse tree for how the current derivative matches the suffix of the string(the characters that have not yet appeared, but is stored in the value). Sulzmann and Lu used this to connect the bit-coded algorithm to the older algorithm by the following equation:\\ $inj \;a\; c \; v = \textit{decode} \; (\textit{retrieve}\; ((\textit{internalise}\; r)\backslash_{simp} c) v)$\\ A little fact that needs to be stated to help comprehension:\\ $r^\uparrow = a$($a$ stands for $annotated$).\\ - Fahad and Christian also used this fact to prove the correctness of bit-coded algorithm without simplificaiton. + Fahad and Christian also used this fact to prove the correctness of bit-coded algorithm without simplification. Our purpose of using this, however, is try to establish \\ $ \textit{retrieve} \; a \; v \;=\; \textit{retrieve} \; \textit{simp}(a) \; v'.$\\ The idea is that using $v'$, - a simplified version of $v$ that possibly had gone through the same simplification step as $\textit{simp}(a)$ we are still able to extract the bitsequence that gievs the same parsing information as the unsimplified one. + a simplified version of $v$ that possibly had gone through the same simplification step as $\textit{simp}(a)$ we are still able to extract the bitsequence that gives the same parsing information as the unsimplified one. After establishing this, we might be able to finally bridge the gap of proving\\ $\textit{retrieve} \; r \backslash s \; v = \;\textit{retrieve} \; \textit{simp}(r) \backslash s \; v'$\\ and subsequently\\ $\textit{retrieve} \; r \backslash s \; v\; = \; \textit{retrieve} \; r \backslash_{simp} s \; v'$.\\ - This proves that our simplified version of regular expression still contains all the bitcodes neeeded. + This proves that our simplified version of regular expression still contains all the bitcodes needed. -The second task is to speed up the more aggressive simplification. Currently it is slower than a naive simplifiction(the naive version as implemented in ADU of course can explode in some cases). +The second task is to speed up the more aggressive simplification. Currently it is slower than a naive simplification(the naive version as implemented in ADU of course can explode in some cases). So it needs to be explored how to make it faster. Our possibility would be to explore again the connection to DFAs. This is very much work in progress. \section{Conclusion}