--- a/lex_blex_Frankensteined.scala Fri Jul 05 23:46:25 2019 +0100
+++ b/lex_blex_Frankensteined.scala Sat Jul 06 19:48:20 2019 +0100
@@ -146,7 +146,7 @@
case (ASEQ(bs, a1, a2), Sequ(v1, v2)) => bs ++ retrieve(a1, v1) ++ retrieve(a2, v2)
case (ASTAR(bs, a), Stars(Nil)) => bs ++ List(Z)
case (ASTAR(bs, a), Stars(v::vs)) => bs ++ List(S) ++ retrieve(a, v) ++ retrieve(ASTAR(Nil, a), Stars(vs))
- }
+ }//bug here last clause should not add list(S)
//erase function: extracts the regx from Aregex
def erase(r:ARexp): Rexp = r match{
case AZERO => ZERO
--- a/ninems/ninems.tex Fri Jul 05 23:46:25 2019 +0100
+++ b/ninems/ninems.tex Sat Jul 06 19:48:20 2019 +0100
@@ -72,7 +72,7 @@
applications such as lexing (tokenising a string). The problem is to
make the algorithm by Sulzmann and Lu fast on all inputs without
breaking its correctness. We have already developed some
- simplification rules for this, but have not proved yet that they
+ simplification rules for this, but have not yet proved that they
preserve the correctness of the algorithm. We also have not yet
looked at extended regular expressions, such as bounded repetitions,
negation and back-references.
@@ -181,7 +181,7 @@
in the JavaScript and Python ecosystems \cite{Davis18}.
This exponential blowup in matching algorithms sometimes causes
-considerable grief in real life: for example on 20 July 2016 one evil
+considerable disturbance in real life: for example on 20 July 2016 one evil
regular expression brought the webpage
\href{http://stackexchange.com}{Stack Exchange} to its
knees.\footnote{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}
@@ -407,7 +407,7 @@
respectively, $r_2$ matches the substring $|v_2|$. Exactly how these two are matched
is contained in the sub-structure of $v_1$ and $v_2$.
- To give a concrete example of how value work, consider the string $xy$
+ To give a concrete example of how value works, consider the string $xy$
and the regular expression $(x + (y + xy))^*$. We can view this regular
expression as a tree and if the string $xy$ is matched by two Star
``iterations'', then the $x$ is matched by the left-most alternative in
@@ -440,15 +440,15 @@
is generated assuming the regular expression matches the string.
Pictorially, the algorithm is as follows:
-\begin{center}
+\begin{equation}\label{graph:2}
\begin{tikzcd}
r_0 \arrow[r, "\backslash c_0"] \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
v_0 & v_1 \arrow[l,"inj_{r_0} c_0"] & v_2 \arrow[l, "inj_{r_1} c_1"] & v_n \arrow[l, dashed]
\end{tikzcd}
-\end{center}
+\end{equation}
\noindent
-For the convenience, we shall employ the following notations: the regular expression we
+For convenience, we shall employ the following notations: the regular expression we
start with is $r_0$, and the given string $s$ is composed of characters $c_0 c_1
\ldots c_n$. First, we build the derivatives $r_1$, $r_2$, \ldots according to
the characters $c_0$, $c_1$,\ldots until we exhaust the string and
@@ -473,10 +473,71 @@
\end{tabular}
\end{center}
- After this, we inject back the characters one by one in order to build
+
+\noindent There are no cases for $\ZERO$ and $c$, since
+these regular expression cannot match the empty string. Note
+also that in case of alternatives we give preference to the
+regular expression on the left-hand side. This will become
+important later on.
+
+After this, we inject back the characters one by one in order to build
the parse tree $v_i$ for how the regex $r_i$ matches the string
$s_i$ ($s_i = c_i \ldots c_n$ ) from the previous parse tree $v_{i+1}$. After injecting back $n$ characters, we
get the parse tree for how $r_0$ matches $s$, exactly as we wanted.
+
+We need a function that
+reverses the "chopping off" for values which we did in the
+first phase for derivatives. The corresponding function is
+called $\textit{inj}$ (short for injection). This function takes three
+arguments: the first one is a regular expression $\textit{r_{i-1}}$
+before the character is chopped off, the second is $\textit{c_{i-1}}$,
+the character
+we
+want to inject and the third argument is the value $textit{v_i}$, where we
+will inject the character(it corresponds to the regular expression
+ after the character
+has been chopped off). The result of this function is a
+new value. The definition of $\textit{inj}$ is as follows:
+
+\begin{center}
+\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
+ $\textit{inj}\,(c)\,c\,Empty$ & $\dn$ & $Char\,c$\\
+ $\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\
+ $\textit{inj}\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\textit{inj}\,r_2\,c\,v)$\\
+ $\textit{inj}\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$ & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
+ $\textit{inj}\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$ & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
+ $\textit{inj}\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$ & $Seq(\textit{mkeps}(r_1),\textit{inj}\,r_2\,c\,v)$\\
+ $\textit{inj}\,(r^*)\,c\,Seq(v,Stars\,vs)$ & $\dn$ & $Stars((\textit{inj}\,r\,c\,v)\,::\,vs)$\\
+\end{tabular}
+\end{center}
+
+\noindent This definition is by recursion on the regular
+expressions' and values' shapes.
+To understands this definition in a more vivid way, the reader
+might imagine this: when we do the derivative on regular expression
+$r_{i-1}$, we chop off a character from $r_{i-1}$ to form $r_i$.
+This leaves a "hole" on $r_i$. In order to calculate a value for
+$r_{i-1}$, we need to locate where that hole is. We can find this location
+informtaion by comparing $r_{i-1}$ and $v_i$.
+For instance, if $r_{i-1}$ is of shape $r_a \cdot r_b$, and $v_i$
+is of shape $\textit{Left(Seq(v_a, v_b))}$, we know immediately that
+\[ (r_a \cdot r_b)\backslash c = (r_a\backslash c) \cdot r_b \,+\, r_b\backslash c\],
+otherwise if $r_a$ is not nullable,
+\[ (r_a \cdot r_b)\backslash c = (r_a\backslash c) \cdot r_b\],
+the value $v_i$ should be of shape $Seq(\ldots)$, contradicting the fact that
+$v_i$ is actually $Left(\ldots)$. Furthermore, since $v_i$ is of shape
+$Left(\ldots)$ instead of $Right(\ldots)$, we know that the left
+branch is taken instead of the right one. We have therefore found out
+that the hole will be on $r_a$. So we recursively call $\textit{inj}
+r_a c v_a$ to fill that hole in $v_a$. After injection, the value
+$v_i$ for $r_i = r_a \cdot \rb$ should be $\textit{Seq}(\textit{inj}
+r_a c v_a, v_b)$.
+
+To understand what is going on, it might be best to do some
+example calculations and compare them with Figure~\eqref{graph:2}.
+
+
+
A correctness proof using induction can be routinely established.
It is instructive to see how this algorithm works by a little example.