ChengsongTanPhdThesis/Chapters/Inj.tex
changeset 601 ce4e5151a836
parent 591 b2d0de6aee18
child 608 37b6fd310a16
equal deleted inserted replaced
600:fd068f39ac23 601:ce4e5151a836
   514 
   514 
   515 \noindent
   515 \noindent
   516 Assuming the string is given as a sequence of characters, say $c_0c_1..c_n$, 
   516 Assuming the string is given as a sequence of characters, say $c_0c_1..c_n$, 
   517 this algorithm presented graphically is as follows:
   517 this algorithm presented graphically is as follows:
   518 
   518 
   519 \begin{equation}\label{graph:successive_ders}
   519 \begin{equation}\label{matcher}
   520 \begin{tikzcd}
   520 \begin{tikzcd}
   521 r_0 \arrow[r, "\backslash c_0"]  & r_1 \arrow[r, "\backslash c_1"] & 
   521 r_0 \arrow[r, "\backslash c_0"]  & r_1 \arrow[r, "\backslash c_1"] & 
   522 r_2 \arrow[r, dashed]  & r_n  \arrow[r,"\textit{nullable}?"] & 
   522 r_2 \arrow[r, dashed]  & r_n  \arrow[r,"\textit{nullable}?"] & 
   523 \;\textrm{true}/\textrm{false}
   523 \;\textrm{true}/\textrm{false}
   524 \end{tikzcd}
   524 \end{tikzcd}
   533 \begin{proof}
   533 \begin{proof}
   534 	By induction on $s$ using property of derivatives:
   534 	By induction on $s$ using property of derivatives:
   535 	lemma \ref{derDer}.
   535 	lemma \ref{derDer}.
   536 \end{proof}
   536 \end{proof}
   537 \noindent
   537 \noindent
   538 \begin{center}
       
   539 \begin{figure}
   538 \begin{figure}
       
   539 \begin{center}
   540 \begin{tikzpicture}
   540 \begin{tikzpicture}
   541 \begin{axis}[
   541 \begin{axis}[
   542     xlabel={$n$},
   542     xlabel={$n$},
   543     ylabel={time in secs},
   543     ylabel={time in secs},
   544     ymode = log,
   544     %ymode = log,
   545     legend entries={Naive Matcher},  
   545     legend entries={Naive Matcher},  
   546     legend pos=north west,
   546     legend pos=north west,
   547     legend cell align=left]
   547     legend cell align=left]
   548 \addplot[red,mark=*, mark options={fill=white}] table {NaiveMatcher.data};
   548 \addplot[red,mark=*, mark options={fill=white}] table {NaiveMatcher.data};
   549 \end{axis}
   549 \end{axis}
   550 \end{tikzpicture} 
   550 \end{tikzpicture} 
   551 \caption{Matching the regular expression $(a^*)^*b$ against strings of the form
   551 \caption{Matching the regular expression $(a^*)^*b$ against strings of the form
   552 $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}
   552 $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}
   553 $ using Brzozowski's original algorithm}\label{NaiveMatcher}
   553 $ using Brzozowski's original algorithm}\label{NaiveMatcher}
       
   554 \end{center}
   554 \end{figure}
   555 \end{figure}
   555 \end{center} 
       
   556 \noindent
   556 \noindent
   557 If we implement the above algorithm naively, however,
   557 If we implement the above algorithm naively, however,
   558 the algorithm can be excruciatingly slow, as shown in 
   558 the algorithm can be excruciatingly slow, as shown in 
   559 \ref{NaiveMatcher}.
   559 \ref{NaiveMatcher}.
   560 Note that both axes are in logarithmic scale.
   560 Note that both axes are in logarithmic scale.
   598 \begin{figure}
   598 \begin{figure}
   599 \begin{tikzpicture}
   599 \begin{tikzpicture}
   600 \begin{axis}[
   600 \begin{axis}[
   601     xlabel={$n$},
   601     xlabel={$n$},
   602     ylabel={time in secs},
   602     ylabel={time in secs},
   603     ymode = log,
   603     %ymode = log,
   604     xmode = log,
   604     %xmode = log,
   605     grid = both,
   605     grid = both,
   606     legend entries={Matcher With Simp},  
   606     legend entries={Matcher With Simp},  
   607     legend pos=north west,
   607     legend pos=north west,
   608     legend cell align=left]
   608     legend cell align=left]
   609 \addplot[red,mark=*, mark options={fill=white}] table {BetterMatcher.data};
   609 \addplot[red,mark=*, mark options={fill=white}] table {BetterMatcher.data};
   700 \noindent
   700 \noindent
   701 The condition $|v| \neq []$ in the premise of star's rule
   701 The condition $|v| \neq []$ in the premise of star's rule
   702 is to make sure that for a given pair of regular 
   702 is to make sure that for a given pair of regular 
   703 expression $r$ and string $s$, the number of values 
   703 expression $r$ and string $s$, the number of values 
   704 satisfying $|v| = s$ and $\vdash v:r$ is finite.
   704 satisfying $|v| = s$ and $\vdash v:r$ is finite.
       
   705 This additional condition was
       
   706 imposed by Ausaf and Urban to make their proofs easier.
   705 Given the same string and regular expression, there can be
   707 Given the same string and regular expression, there can be
   706 multiple values for it. For example, both
   708 multiple values for it. For example, both
   707 $\vdash \Seq(\Left \; ab)(\Right \; c):(ab+a)(bc+c)$ and
   709 $\vdash \Seq(\Left \; ab)(\Right \; c):(ab+a)(bc+c)$ and
   708 $\vdash \Seq(\Right\; a)(\Left \; bc ):(ab+a)(bc+c)$ hold
   710 $\vdash \Seq(\Right\; a)(\Left \; bc ):(ab+a)(bc+c)$ hold
   709 and the values both flatten to $abc$.
   711 and the values both flatten to $abc$.
   715 formalised the property 
   717 formalised the property 
   716 as a ternary relation.
   718 as a ternary relation.
   717 The $\POSIX$ value $v$ for a regular expression
   719 The $\POSIX$ value $v$ for a regular expression
   718 $r$ and string $s$, denoted as $(s, r) \rightarrow v$, can be specified 
   720 $r$ and string $s$, denoted as $(s, r) \rightarrow v$, can be specified 
   719 in the following set of rules\footnote{The names of the rules are used
   721 in the following set of rules\footnote{The names of the rules are used
   720 as they were originally given in \cite{AusafDyckhoffUrban2016}}:
   722 as they were originally given in \cite{AusafDyckhoffUrban2016} }:
   721 \noindent
   723 \begin{figure}[H]
   722 \begin{figure}
       
   723 \begin{mathpar}
   724 \begin{mathpar}
   724 	\inferrule[P1]{\mbox{}}{([], \ONE) \rightarrow \Empty}
   725 	\inferrule[P1]{\mbox{}}{([], \ONE) \rightarrow \Empty}
   725 		
   726 		
   726 	\inferrule[PC]{\mbox{}}{([c], c) \rightarrow \Char \; c}
   727 	\inferrule[PC]{\mbox{}}{([c], c) \rightarrow \Char \; c}
   727 
   728 
   739 	\inferrule[P*]{(s_1, v) \rightarrow v \\ (s_2, r^*) \rightarrow \Stars \; vs \\
   740 	\inferrule[P*]{(s_1, v) \rightarrow v \\ (s_2, r^*) \rightarrow \Stars \; vs \\
   740 		|v| \neq []\\ \nexists s_3 \; s_4. s_3 \neq [] \land s_3@s_4 = s_2 \land
   741 		|v| \neq []\\ \nexists s_3 \; s_4. s_3 \neq [] \land s_3@s_4 = s_2 \land
   741 		s_1@s_3 \in L \; r \land s_4 \in L \; r^*}{(s_1@s_2, r^*)\rightarrow \Stars \;
   742 		s_1@s_3 \in L \; r \land s_4 \in L \; r^*}{(s_1@s_2, r^*)\rightarrow \Stars \;
   742 	(v::vs)}
   743 	(v::vs)}
   743 \end{mathpar}
   744 \end{mathpar}
   744 \caption{POSIX Lexing Rules}
   745 \caption{The inductive POSIX Lexing Rules defined by Ausaf, Dyckhoff and Urban \cite{AusafDyckhoffUrban2016}.
       
   746 The ternary relation, written $(s, r) \rightarrow v$, formalises the POSIX constraints on the
       
   747 value $v$ given a string $s$ and 
       
   748 regular expression $r$.
       
   749 For example, this specification says that all matches for an alternative 
       
   750 must always prefer a left value to a right one.
       
   751 }
   745 \end{figure}
   752 \end{figure}
   746 \noindent
   753 \noindent
   747 
   754 
   748 \begin{figure}
   755 \begin{figure}
   749 \begin{tikzpicture}[]
   756 \begin{tikzpicture}[]
   830 We shall give the details for proving the sequence case here.
   837 We shall give the details for proving the sequence case here.
   831 
   838 
   832 When we have 
   839 When we have 
   833 \[
   840 \[
   834 	(s_1, r_1) \rightarrow v_1 \;\, and \;\, 
   841 	(s_1, r_1) \rightarrow v_1 \;\, and \;\, 
   835 	(s_2, r_2) \rightarrow v_2  \;\, and \;\,\\ 
   842 	(s_2, r_2) \rightarrow v_2  \;\, and \;\, 
   836 	\nexists s_3 \; s_4. s_3 \neq [] \land s_3 @ s_4 = s_2 \land 
   843 	\nexists s_3 \; s_4. s_3 \neq [] \land s_3 @ s_4 = s_2 \land 
   837 		s_1@ s_3 \in L \; r_1 \land s_4 \in L \; r_2
   844 		s_1@ s_3 \in L \; r_1 \land s_4 \in L \; r_2
   838 \]
   845 \]
   839 we know that the last condition 
   846 we know that the last condition 
   840 excludes the possibility of a 
   847 excludes the possibility of a 
   940 for how the regular expression 
   947 for how the regular expression 
   941 $r_i$ matches the string $s_i$ consisting of the last $n-i$ characters
   948 $r_i$ matches the string $s_i$ consisting of the last $n-i$ characters
   942 of $s$ (i.e. $s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$.
   949 of $s$ (i.e. $s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$.
   943 After injecting back $n$ characters, we get the lexical value for how $r_0$
   950 After injecting back $n$ characters, we get the lexical value for how $r_0$
   944 matches $s$. 
   951 matches $s$. 
       
   952 \begin{figure}[H]
       
   953 \begin{center}	
   945 \begin{ceqn}
   954 \begin{ceqn}
   946 \begin{equation}\label{graph:inj}
       
   947 \begin{tikzcd}
   955 \begin{tikzcd}
   948 r_0 \arrow[r, dashed] \arrow[d]& r_i \arrow[r, "\backslash c_i"]  \arrow[d]  & r_{i+1}  \arrow[r, dashed] \arrow[d]        & r_n \arrow[d, "mkeps" description] \\
   956 r_0 \arrow[r, dashed] \arrow[d]& r_i \arrow[r, "\backslash c_i"]  \arrow[d]  & r_{i+1}  \arrow[r, dashed] \arrow[d]        & r_n \arrow[d, "mkeps" description] \\
   949 v_0           \arrow[u]                 & v_i  \arrow[l, dashed]                              & v_{i+1} \arrow[l,"inj_{r_i} c_i"]                 & v_n \arrow[l, dashed]         
   957 v_0           \arrow[u]                 & v_i  \arrow[l, dashed]                              & v_{i+1} \arrow[l,"inj_{r_i} c_i"]                 & v_n \arrow[l, dashed]         
   950 \end{tikzcd}
   958 \end{tikzcd}
   951 \end{equation}
       
   952 \end{ceqn}
   959 \end{ceqn}
       
   960 \end{center}
       
   961 \caption{The two-phase lexing algorithm by Sulzmann and Lu \cite{AusafDyckhoffUrban2016},
       
   962 	matching the regular expression $r_0$ and string of the form $[c_0, c_1, \ldots, c_{n-1}]$.
       
   963 	The first phase involves taking successive derivatives w.r.t the characters $c_0$,
       
   964 	$c_1$, and so on. These are the same operations as they have appeared in the matcher
       
   965 	\ref{matcher}. When the final derivative regular expression is nullable (contains the empty string),
       
   966 	then the second phase starts. First $\mkeps$ generates a POSIX value which tells us how $r_n$ matches
       
   967 	the empty string , by always selecting the leftmost 
       
   968 	nullable regular expression. After that $\inj$ ``injects'' back the character in reverse order as they 
       
   969 	appeared in the string, always preserving POSIXness.}\label{graph:inj}
       
   970 \end{figure}
   953 \noindent
   971 \noindent
   954 $\textit{inj}$ takes three arguments: a regular
   972 $\textit{inj}$ takes three arguments: a regular
   955 expression ${r_{i}}$, before the character is chopped off, 
   973 expression ${r_{i}}$, before the character is chopped off, 
   956 a character ${c_{i}}$, the character we want to inject back and 
   974 a character ${c_{i}}$, the character we want to inject back and 
   957 the third argument $v_{i+1}$ the value we want to inject into. 
   975 the third argument $v_{i+1}$ the value we want to inject into. 
  1186 w.r.t.~the character $a$, one obtains a derivative regular expression
  1204 w.r.t.~the character $a$, one obtains a derivative regular expression
  1187 with millions of nodes (when viewed as a tree)
  1205 with millions of nodes (when viewed as a tree)
  1188 even with simplification, which is not much better compared
  1206 even with simplification, which is not much better compared
  1189 with the naive version without any simplifications:
  1207 with the naive version without any simplifications:
  1190 \begin{figure}[H]
  1208 \begin{figure}[H]
  1191 	\centering
  1209 \begin{center}
  1192 \begin{tikzpicture}
  1210 \begin{tikzpicture}
  1193 \begin{axis}[
  1211 \begin{axis}[
  1194     xlabel={$n$},
  1212     xlabel={$n$},
  1195     ylabel={size},
  1213     ylabel={size},
  1196     legend entries={Simple-Minded Simp, Naive Matcher},  
  1214     legend entries={Simple-Minded Simp, Naive Matcher},  
  1198     legend cell align=left]
  1216     legend cell align=left]
  1199 \addplot[red,mark=*, mark options={fill=white}] table {BetterWaterloo.data};
  1217 \addplot[red,mark=*, mark options={fill=white}] table {BetterWaterloo.data};
  1200 \addplot[blue,mark=*, mark options={fill=white}] table {BetterWaterloo1.data};
  1218 \addplot[blue,mark=*, mark options={fill=white}] table {BetterWaterloo1.data};
  1201 \end{axis}
  1219 \end{axis}
  1202 \end{tikzpicture} 
  1220 \end{tikzpicture} 
       
  1221 \end{center}
  1203 \caption{Size of $(a^*\cdot a^*)^*$ against $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$}
  1222 \caption{Size of $(a^*\cdot a^*)^*$ against $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$}
  1204 \end{figure}\label{fig:BetterWaterloo}
  1223 \end{figure}\label{fig:BetterWaterloo}
  1205    
  1224    
  1206 That is because Sulzmann and Lu's 
  1225 That is because Sulzmann and Lu's 
  1207 injection-based lexing algorithm keeps a lot of 
  1226 injection-based lexing algorithm keeps a lot of