# HG changeset patch # User Chengsong # Date 1657807052 -3600 # Node ID 28cb8089ec368fb90d6aa6db827db04b3314157b # Parent 94604a5fd2717eb99cbe71c0320b7f8954fd3a94 more updaates diff -r 94604a5fd271 -r 28cb8089ec36 ChengsongTanPhdThesis/Chapters/Inj.tex --- a/ChengsongTanPhdThesis/Chapters/Inj.tex Wed Jul 13 08:35:45 2022 +0100 +++ b/ChengsongTanPhdThesis/Chapters/Inj.tex Thu Jul 14 14:57:32 2022 +0100 @@ -621,100 +621,138 @@ \begin{proof} By induction on $s$, $r$ and $v_1$. The inductive cases are all the POSIX rules. -Each case is proven by a combination of -the induction rules for $\POSIX$ -values and the inductive hypothesis. Probably the most cumbersome cases are the sequence and star with non-empty iterations. +We shall give the details for proving the sequence case here. -We give the reasoning about the sequence case as follows: -When we have $(s_1, r_1) \rightarrow v_1$ and $(s_2, r_2) \rightarrow v_2$, -we know that there could not be a longer string $r_1'$ such that $(s_1', r_1) \rightarrow v_1'$ -and $(s_2', r_2) \rightarrow v2'$ and $s_1' @s_2' = s$ all hold. -For possible values of $s_1'$ and $s_2'$ where $s_1'$ is shorter, they cannot -possibly form a $\POSIX$ for $s$. -If we have some other values $v_1'$ and $v_2'$ such that -$(s_1, r_1) \rightarrow v_1'$ and $(s_2, r_2) \rightarrow v_2'$, -Then by induction hypothesis $v_1' = v_1$ and $v_2'= v_2$, -which means this "different" $\POSIX$ value $\Seq(v_1', v_2')$ +When we have +\[ + (s_1, r_1) \rightarrow v_1 \;\, and \;\, + (s_2, r_2) \rightarrow v_2 \;\, and \;\,\\ + \nexists s_3 \; s_4. s_3 \neq [] \land s_3 @ s_4 = s_2 \land + s_1@ s_3 \in L \; r_1 \land s_4 \in L \; r_2 +\] +we know that the last condition +excludes the possibility of a +string $s_1'$ longer than $s_1$ such that +\[ +(s_1', r_1) \rightarrow v_1' \;\; +and\;\; (s_2', r_2) \rightarrow v_2'\;\; and \;\;s_1' @s_2' = s +\] +hold. +A shorter string $s_1''$ with $s_2''$ satisfying +\[ +(s_1'', r_1) \rightarrow v_1'' +\;\;and\;\; (s_2'', r_2) \rightarrow v_2'' \;\;and \;\;s_1'' @s_2'' = s +\] +cannot possibly form a $\POSIX$ value either, because +by definition there is a candidate +with longer initial string +$s_1$. Therefore, we know that the POSIX +value $\Seq \; a \; b$ for $r_1 \cdot r_2$ matching +$s$ must have the +property that +\[ + |a| = s_1 \;\; and \;\; |b| = s_2. +\] +The goal is to prove that $a = v_1 $ and $b = v_2$. +If we have some other POSIX values $v_{10}$ and $v_{20}$ such that +$(s_1, r_1) \rightarrow v_{10}$ and $(s_2, r_2) \rightarrow v_{20}$ hold, +then by induction hypothesis $v_{10} = v_1$ and $v_{20}= v_2$, +which means this "other" $\POSIX$ value $\Seq(v_{10}, v_{20})$ is the same as $\Seq(v_1, v_2)$. \end{proof} -Now we know what a $\POSIX$ value is; the problem is how do we achieve -such a value in a lexing algorithm, using derivatives? +\noindent +Now we know what a $\POSIX$ value is and why it is unique; +the problem is generating +such a value in a lexing algorithm using derivatives. \subsection{Sulzmann and Lu's Injection-based Lexing Algorithm} -The contribution of Sulzmann and Lu is an extension of Brzozowski's -algorithm by a second phase (the first phase being building successive -derivatives---see \ref{graph:successive_ders}). This second phase generates a POSIX value if the regular expression matches the string. +Sulzmann and Lu extended Brzozowski's +derivative-based matching +to a lexing algorithm by a second pass +after the initial phase of successive derivatives. +This second phase generates a POSIX value +if the regular expression matches the string. Two functions are involved: $\inj$ and $\mkeps$. -The function $\mkeps$ constructs a value from the last -one of all the successive derivatives: +The first one used is $\mkeps$, which constructs a POSIX value from the last +derivative $r_n$: \begin{ceqn} \begin{equation}\label{graph:mkeps} \begin{tikzcd} -r_0 \arrow[r, "\backslash c_0"] & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed] & r_n \arrow[d, "mkeps" description] \\ +r_0 \arrow[r, "\backslash c_0"] & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed, "\ldots"] & r_n \arrow[d, "mkeps" description] \\ & & & v_n \end{tikzcd} \end{equation} \end{ceqn} - -It tells us how can an empty string be matched by a -regular expression, in a $\POSIX$ way: - +\noindent +In the above diagram, again we assume that +the input string $s$ is made of $n$ characters +$c_0c_1 \ldots c_{n-1}$, and the input regular expression $r$ +is given label $0$ and after each character $c_i$ is taken off +by the derivative operation the resulting derivative regular +expressioin is $r_{i+1}$.The last derivative operation +$\backslash c_{n-1}$ gives back $r_n$, which is transformed into +a value $v_n$ by $\mkeps$. +$v_n$ tells us how an empty string is matched by the (nullable) +regular expression $r_n$, in a $\POSIX$ way. +The definition of $\mkeps$ is \begin{center} \begin{tabular}{lcl} $\mkeps \; \ONE$ & $\dn$ & $\Empty$ \\ - $\mkeps \; r_{1}+r_{2}$ & $\dn$ - & \textit{if} ($\nullable \; r_{1})$\\ - & & \textit{then} $\Left (\mkeps \; r_{1})$\\ - & & \textit{else} $\Right(\mkeps \; r_{2})$\\ - $\mkeps \; r_1\cdot r_2$ & $\dn$ & $\Seq\;(\mkeps\;r_1)\;(\mkeps \; r_2)$\\ + $\mkeps \; (r_{1}+r_{2})$ & $\dn$ + & $\textit{if}\; (\nullable \; r_{1}) \;\, + \textit{then}\;\, \Left \; (\mkeps \; r_{1})$\\ + & & $\phantom{if}\; \textit{else}\;\, \Right \;(\mkeps \; r_{2})$\\ + $\mkeps \; (r_1 \cdot r_2)$ & $\dn$ & $\Seq\;(\mkeps\;r_1)\;(\mkeps \; r_2)$\\ $\mkeps \; r^* $ & $\dn$ & $\Stars\;[]$ \end{tabular} \end{center} \noindent -We favour the left to match an empty string if there is a choice. +We favour the left child $r_1$ of $r_1 + r_2$ +to match an empty string if there is a choice. When there is a star for us to match the empty string, we give the $\Stars$ constructor an empty list, meaning -no iterations are taken. +no iteration is taken. The result of a call to $\mkeps$ on a $\nullable$ $r$ would be a $\POSIX$ value corresponding to $r$: -\begin{lemma} -$\nullable(r) \implies (r, []) \rightarrow (\mkeps\; v)$ -\end{lemma}\label{mePosix} - - -After the $\mkeps$-call, we inject back the characters one by one in order to build -the lexical value $v_i$ for how the regex $r_i$ matches the string $s_i$ -($s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$. +\begin{lemma}\label{mePosix} +$\nullable\; r \implies (r, []) \rightarrow (\mkeps\; v)$ +\end{lemma} +\begin{proof} + By induction on the shape of $r$. +\end{proof} +\noindent +After the $\mkeps$-call, we inject back the characters one by one +in reverse order as they were chopped off in the derivative phase. +The fucntion for this is called $\inj$. $\inj$ and $\backslash$ +are not exactly reverse operations of one another, as $\inj$ +operates on values instead of regular +expressions. +In the diagram below, $v_i$ stands for the (POSIX) value +for how the regular expression +$r_i$ matches the string $s_i$ consisting of the last $n-i$ characters +of $s$ (i.e. $s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$. After injecting back $n$ characters, we get the lexical value for how $r_0$ matches $s$. -To do this, Sulzmann and Lu defined a function that reverses -the ``chopping off'' of characters during the derivative phase. The -corresponding function is called \emph{injection}, written -$\textit{inj}$; it takes three arguments: the first one is a regular -expression ${r_{i-1}}$, before the character is chopped off, the second -is a character ${c_{i-1}}$, the character we want to inject and the -third argument is the value ${v_i}$, into which one wants to inject the -character (it corresponds to the regular expression after the character -has been chopped off). The result of this function is a new value. \begin{ceqn} \begin{equation}\label{graph:inj} \begin{tikzcd} -r_1 \arrow[r, dashed] \arrow[d]& r_i \arrow[r, "\backslash c_i"] \arrow[d] & r_{i+1} \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\ -v_1 \arrow[u] & v_i \arrow[l, dashed] & v_{i+1} \arrow[l,"inj_{r_i} c_i"] & v_n \arrow[l, dashed] +r_0 \arrow[r, dashed] \arrow[d]& r_i \arrow[r, "\backslash c_i"] \arrow[d] & r_{i+1} \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\ +v_0 \arrow[u] & v_i \arrow[l, dashed] & v_{i+1} \arrow[l,"inj_{r_i} c_i"] & v_n \arrow[l, dashed] \end{tikzcd} \end{equation} \end{ceqn} - - \noindent -The -definition of $\textit{inj}$ is as follows: - +$\textit{inj}$ takes three arguments: a regular +expression ${r_{i}}$, before the character is chopped off, +a character ${c_{i}}$, the character we want to inject back and +the third argument $v_{i+1}$ the value we want to inject into. +The result of this function is a new value $v_i$. +The definition of $\textit{inj}$ is as follows: \begin{center} \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l} $\textit{inj}\,(c)\,c\,Empty$ & $\dn$ & $Char\,c$\\ diff -r 94604a5fd271 -r 28cb8089ec36 ChengsongTanPhdThesis/Chapters/Introduction.tex --- a/ChengsongTanPhdThesis/Chapters/Introduction.tex Wed Jul 13 08:35:45 2022 +0100 +++ b/ChengsongTanPhdThesis/Chapters/Introduction.tex Thu Jul 14 14:57:32 2022 +0100 @@ -85,8 +85,8 @@ \def\flex{\textit{flex}} \def\inj{\mathit{inj}} \def\Empty{\textit{Empty}} -\def\Left{\mathit{Left}} -\def\Right{\mathit{Right}} +\def\Left{\textit{Left}} +\def\Right{\textit{Right}} \def\Stars{\mathit{Stars}} \def\Char{\mathit{Char}} \def\Seq{\mathit{Seq}}