changes1
authorChengsong
Sat, 06 Jul 2019 19:48:20 +0100
changeset 59 8ff7b7508824
parent 58 f0360e17080e
child 60 c737a0259194
changes1
lex_blex_Frankensteined.scala
ninems/ninems.tex
--- a/lex_blex_Frankensteined.scala	Fri Jul 05 23:46:25 2019 +0100
+++ b/lex_blex_Frankensteined.scala	Sat Jul 06 19:48:20 2019 +0100
@@ -146,7 +146,7 @@
     case (ASEQ(bs, a1, a2), Sequ(v1, v2)) => bs ++ retrieve(a1, v1) ++ retrieve(a2, v2)
     case (ASTAR(bs, a), Stars(Nil)) => bs ++ List(Z) 
     case (ASTAR(bs, a), Stars(v::vs)) => bs ++ List(S) ++ retrieve(a, v) ++ retrieve(ASTAR(Nil, a), Stars(vs))
-  }
+  }//bug here last clause should not add list(S)
   //erase function: extracts the regx from Aregex
   def erase(r:ARexp): Rexp = r match{
     case AZERO => ZERO
--- a/ninems/ninems.tex	Fri Jul 05 23:46:25 2019 +0100
+++ b/ninems/ninems.tex	Sat Jul 06 19:48:20 2019 +0100
@@ -72,7 +72,7 @@
   applications such as lexing (tokenising a string). The problem is to
   make the algorithm by Sulzmann and Lu fast on all inputs without
   breaking its correctness. We have already developed some
-  simplification rules for this, but have not proved yet that they
+  simplification rules for this, but have not yet proved that they
   preserve the correctness of the algorithm. We also have not yet
   looked at extended regular expressions, such as bounded repetitions,
   negation and back-references.
@@ -181,7 +181,7 @@
 in the JavaScript and Python ecosystems \cite{Davis18}.
 
 This exponential blowup in matching algorithms sometimes causes
-considerable grief in real life: for example on 20 July 2016 one evil
+considerable disturbance in real life: for example on 20 July 2016 one evil
 regular expression brought the webpage
 \href{http://stackexchange.com}{Stack Exchange} to its
 knees.\footnote{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}
@@ -407,7 +407,7 @@
 respectively, $r_2$ matches the substring $|v_2|$. Exactly how these two are matched
 is contained in the sub-structure of $v_1$ and $v_2$. 
 
- To give a concrete example of how value work, consider the string $xy$
+ To give a concrete example of how value works, consider the string $xy$
 and the regular expression $(x + (y + xy))^*$. We can view this regular
 expression as a tree and if the string $xy$ is matched by two Star
 ``iterations'', then the $x$ is matched by the left-most alternative in
@@ -440,15 +440,15 @@
 is generated assuming the regular expression matches  the string. 
 Pictorially, the algorithm is as follows:
 
-\begin{center}
+\begin{equation}\label{graph:2}
 \begin{tikzcd}
 r_0 \arrow[r, "\backslash c_0"]  \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
 v_0           & v_1 \arrow[l,"inj_{r_0} c_0"]                & v_2 \arrow[l, "inj_{r_1} c_1"]              & v_n \arrow[l, dashed]         
 \end{tikzcd}
-\end{center}
+\end{equation}
 
 \noindent
-For the convenience, we shall employ the following notations: the regular expression we
+For convenience, we shall employ the following notations: the regular expression we
 start with is $r_0$, and the given string $s$ is composed of characters $c_0 c_1
 \ldots c_n$. First, we build the derivatives $r_1$, $r_2$, \ldots  according to
 the characters $c_0$, $c_1$,\ldots  until we exhaust the string and
@@ -473,10 +473,71 @@
 		\end{tabular}
 	\end{center}
 
- After this, we inject back the characters one by one in order to build
+
+\noindent There are no cases for $\ZERO$ and $c$, since
+these regular expression cannot match the empty string. Note
+also that in case of alternatives we give preference to the
+regular expression on the left-hand side. This will become
+important later on.
+
+After this, we inject back the characters one by one in order to build
 the parse tree $v_i$ for how the regex $r_i$ matches the string
 $s_i$ ($s_i = c_i \ldots c_n$ ) from the previous parse tree $v_{i+1}$. After injecting back $n$ characters, we
 get the parse tree for how $r_0$ matches $s$, exactly as we wanted.
+
+We need a function that
+reverses the "chopping off" for values which we did in the
+first phase for derivatives. The corresponding function is
+called $\textit{inj}$ (short for injection). This function takes three
+arguments: the first one is a regular expression $\textit{r_{i-1}}$ 
+before the character is chopped off, the second is  $\textit{c_{i-1}}$,
+the character
+we
+want to inject and the third argument is the value $textit{v_i}$, where we
+will inject the character(it corresponds to the regular expression
+ after the character
+has been chopped off). The result of this function is a
+new value. The definition of $\textit{inj}$ is as follows: 
+
+\begin{center}
+\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
+  $\textit{inj}\,(c)\,c\,Empty$            & $\dn$ & $Char\,c$\\
+  $\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\
+  $\textit{inj}\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\textit{inj}\,r_2\,c\,v)$\\
+  $\textit{inj}\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$  & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
+  $\textit{inj}\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$  & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
+  $\textit{inj}\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$  & $Seq(\textit{mkeps}(r_1),\textit{inj}\,r_2\,c\,v)$\\
+  $\textit{inj}\,(r^*)\,c\,Seq(v,Stars\,vs)$         & $\dn$  & $Stars((\textit{inj}\,r\,c\,v)\,::\,vs)$\\
+\end{tabular}
+\end{center}
+
+\noindent This definition is by recursion on the regular
+expressions' and values' shapes. 
+To understands this definition in a more vivid way, the reader 
+might imagine this: when we do the derivative on regular expression 
+$r_{i-1}$, we chop off a character from $r_{i-1}$ to form $r_i$.
+This leaves a "hole" on $r_i$. In order to calculate a value for
+$r_{i-1}$, we need to locate where that hole is. We can find this location
+informtaion by comparing $r_{i-1}$ and $v_i$.
+For instance, if $r_{i-1}$ is of shape $r_a \cdot r_b$, and $v_i$
+is of shape $\textit{Left(Seq(v_a, v_b))}$, we know immediately that 
+\[ (r_a \cdot r_b)\backslash c = (r_a\backslash c) \cdot r_b \,+\, r_b\backslash c\],
+otherwise if $r_a$ is not nullable,
+\[ (r_a \cdot r_b)\backslash c = (r_a\backslash c) \cdot r_b\],
+the value $v_i$ should be of shape $Seq(\ldots)$, contradicting the fact that
+$v_i$ is actually $Left(\ldots)$. Furthermore, since $v_i$ is of shape
+$Left(\ldots)$ instead of $Right(\ldots)$, we know that the left
+branch is taken instead of the right one. We have therefore found out 
+that the hole will be on $r_a$. So we recursively call $\textit{inj} 
+r_a c v_a$ to fill that hole in $v_a$. After injection, the value 
+$v_i$ for $r_i = r_a \cdot \rb$ should be $\textit{Seq}(\textit{inj}
+r_a c v_a, v_b)$.
+
+To understand what is going on, it might be best to do some
+example calculations and compare them with Figure~\eqref{graph:2}.
+
+
+
  A correctness proof using induction can be routinely established.
 
 It is instructive to see how this algorithm works by a little example.