--- a/ChengsongPhdThesis/ChengsongPhDThesis.tex Wed Mar 02 23:53:11 2022 +0000
+++ b/ChengsongPhdThesis/ChengsongPhDThesis.tex Sat Mar 05 11:31:59 2022 +0000
@@ -38,6 +38,10 @@
\def\lexer{\mathit{lexer}}
\def\mkeps{\mathit{mkeps}}
+\def\bmkeps{\textit{bmkeps}}
+\def\retrieve{\textit{retrieve}}
+\def\blexer{\textit{blexer}}
+\def\flex{\textit{flex}}
\def\inj{\mathit{inj}}
\def\Empty{\mathit{Empty}}
\def\Left{\mathit{Left}}
@@ -49,6 +53,7 @@
\def\nullable{\mathit{nullable}}
\def\Z{\mathit{Z}}
\def\S{\mathit{S}}
+\def\rup{r^\uparrow}
\def\awidth{\mathit{awidth}}
@@ -146,7 +151,61 @@
\subsubsection{ NFA's}
+$\bold{Problems With This:}$
+\begin{itemize}
+\item
Can be slow especially when many states are active.
+\item
+Want Lexing Results: Can have Exponential different matching results.
+\end{itemize}
+
+
+One regular expression can have multiple lexical values. For example
+for the regular expression $(a+b)^*$, it has a infinite list of
+values corresponding to it: $\Stars\,[]$, $\Stars\,[\Left(Char(a))]$,
+$\Stars\,[\Right(Char(b))]$, $\Stars\,[\Left(Char(a),\,\Right(Char(b))]$,
+$\ldots$, and vice versa.
+Even for the regular expression matching a certain string, there could
+still be more than one value corresponding to it.
+Take the example where $r= (a^*\cdot a^*)^*$ and the string
+$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$.
+The number of different ways of matching
+without allowing any value under a star to be flattened
+to an empty string can be given by the following formula:
+\begin{center}
+ $C_n = (n+1)+n C_1+\ldots + 2 C_{n-1}$
+\end{center}
+and a closed form formula can be calculated to be
+\begin{equation}
+ C_n =\frac{(2+\sqrt{2})^n - (2-\sqrt{2})^n}{4\sqrt{2}}
+\end{equation}
+which is clearly in exponential order.
+A lexer aimed at getting all the possible values has an exponential
+worst case runtime. Therefore it is impractical to try to generate
+all possible matches in a run. In practice, we are usually
+interested about POSIX values, which by intuition always
+match the leftmost regular expression when there is a choice
+and always match a sub part as much as possible before proceeding
+to the next token. For example, the above example has the POSIX value
+$ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$.
+The output of an algorithm we want would be a POSIX matching
+encoded as a value.\\
+$\mathbf{TODO:}$
+\begin{itemize}
+\item
+Illustrate graphically how you can match $a*a**$ with $aaa$ in different ways.
+\item
+Give a backtracking algorithm, and explain briefly why this can be exponentially slow.
+(When there is a matching, it finds straight away; where there is not one, this fails to
+recognize immediately that a match cannot be possibly found, and tries out all remaining
+possibilities, etc.)
+\item
+From the above point, are there statical analysis tools that single out those malicious
+patterns and tell before a lexer is even run?
+Have a more thorough survey of the Birmingham paper.
+Give out the suitable scenarios for such static analysis algorithms.
+
+\end{itemize}
\subsubsection{DFAs}
The tool JFLEX uses it.
@@ -183,8 +242,137 @@
+bitcodes!
Built on top of derivatives, but with auxiliary bits
\subsection{Correctness Proof}
-unproven by Sulzmann and Lu
-by Ausaf and Urban
+
+Not proven by Sulzmann and Lu
+
+Proven by Ausaf and Urban!!
+
+
+For this we have started with looking at the proof of
+\begin{equation}\label{lexer}
+\blexer \; (r^\uparrow) s = \lexer \;r \;s,
+\end{equation}
+
+%\noindent
+%might provide us insight into proving
+%\begin{center}
+%$\blexer \; r^\uparrow \;s = \blexers \; r^\uparrow \;s$
+%\end{center}
+
+\noindent
+which established that the bit-sequence algorithm produces the same
+result as the original algorithm, which does not use
+bit-sequences.
+The proof uses two ``tricks''. One is that it uses a \flex-function
+
+\begin{center}
+\begin{tabular}{lcl}
+$\textit{flex} \;r\; f\; (c\!::\!s) $ & $\dn$ & $\textit{flex} \; (r\backslash c) \;(\lambda v. f (inj \; r \; c \; v)) \;s$ \\
+$\textit{flex} \;r\; f\; [\,] $ & $\dn$ & $f$
+\end{tabular}
+\end{center}
+
+\noindent
+
+The intuition behind the $\flex$ function is that
+ it accumulates a series of $\inj$ function applications when doing derivatives
+ in a $\mathit{LIFO}$ manner. The arguments of the $\inj$ functions are kept by
+ remembering which character
+ was chopped off and what the regular expression looks like before
+ chopping off that character.
+ The $\mathit{LIFO}$ order was achieved by putting the newest $\inj$ application
+ always before the application of $f$, the previously accumulated function applications.\\
+Therefore, the function $\flex$, when acted on a string $s@[c]$ where the last
+character is $c$, by nature can have its last injection function revealed already:
+\begin{equation}\label{flex}
+\flex \; r \; id \; (s@[c]) \; v = \flex \; r \; id \; s \; (inj \; (r\backslash s) \; c\; v).
+\end{equation}
+that the last character can be taken off, and the injection it causes be applied to
+the argument value $v$.
+
+Ausaf and Urban proved that the Sulzmann and Lu's lexers
+can be charactarized by the $\flex$ function:
+\begin{center}
+$\lexer \;r\; s = \flex \;\textit{id} \; r\;s \;(\mkeps \; (r\backslash s))$.
+\end{center}
+
+\noindent
+This property says that the Sulzmann and Lu's $\lexer$ does lexing by
+stacking up injection functions while doing derivatives,
+explicitly showing the order of characters being
+injected back in each step.
+
+\noindent
+The other trick, which is the crux in the existing proof,
+is the use of the $\retrieve$-function:
+\begin{center}
+\begin{tabular}{@{}l@{\hspace{2mm}}c@{\hspace{2mm}}l@{}}
+ $\textit{retrieve}\,(_{bs}\ONE)\,\Empty$ & $\dn$ & $bs$\\
+ $\textit{retrieve}\,(_{bs}{\bf c})\,(\Char\,d)$ & $\dn$ & $bs$\\
+ $\textit{retrieve}\,(_{bs}\sum a::as)\,(\Left\,v)$ & $\dn$ &
+ $bs \,@\, \textit{retrieve}\,a\,v$\\
+ $\textit{retrieve}\,(_{bs}\sum a::as)\,(\Right\,v)$ & $\dn$ &
+ $\textit{bs} \,@\, \textit{retrieve}\,(_{[]}\sum as)\,v$\\
+ $\textit{retrieve}\,(_{bs}a_1\cdot a_2)\,(\Seq\,v_1\,v_2)$ & $\dn$ &
+ $bs \,@\,\textit{retrieve}\,a_1\,v_1\,@\, \textit{retrieve}\,a_2\,v_2$\\
+ $\textit{retrieve}\,(_{bs}a^*)\,(\Stars\,[])$ & $\dn$ &
+ $bs \,@\, [0]$\\
+ $\textit{retrieve}\,(_{bs}a^*)\,(\Stars\,(v\!::\!vs))$ & $\dn$ &\\
+ \multicolumn{3}{l}{
+ \hspace{3cm}$bs \,@\, [1] \,@\, \textit{retrieve}\,a\,v\,@\,
+ \textit{retrieve}\,(_{[]}a^*)\,(\Stars\,vs)$}\\
+\end{tabular}
+\end{center}
+
+\noindent
+Sulzmann and Lu proposed this function, but did not prove
+anything about it. Ausaf and Urban made use of the
+fact about $\retrieve$ in their proof:
+ \begin{equation}\label{retrieve_reversible}
+ \retrieve\; \rup \backslash c \; v = \retrieve \; \rup (\inj \;r \;c \; v)
+ \end{equation}
+This says that $\retrieve$ will always pick up
+partial information about a lexing value value and transform it into suitable bitcodes.
+If the information is in the regular expression (stored as bitcodes), it will keep those
+bitcodes with the guidance of the value,
+if the information is in the value, which has been injected back to the value,
+it will "digest" and transform that part of the value to bitcodes.
+
+\noindent
+
+Using this together with ~\eqref{flex}, we can prove that the bitcoded version of
+lexer is the same as Sulzmann and Lu's lexer:
+\begin{center}
+$\lexer \; r \; s = \flex \; r\; id\; s\; v = \textit{decode} \;( \textit{bmkeps}\; (\rup \backslash s) ) r = \blexer \; r \; s$
+\end{center}
+\noindent
+\begin{proof}
+We express $\bmkeps$ using $\retrieve$, and the theorem to prove becomes:
+\begin{center}
+$ \flex \; r\; id\; s\; v = \textit{decode} \;( \textit{retrieve}\; (\rup \backslash s) \; v \;) r$
+\end{center}
+\noindent
+We prove the above by reverse induction on string $s$(meaning that the inductive step is on
+$s @ [c]$ rather than $c :: s$).
+$v$ takes arbitrary values.\\
+The base case goes through trivially.\\
+For the inductive step, assuming
+$ \flex \; r\; id\; s\; v = \textit{decode} \;( \textit{retrieve}\; (\rup \backslash s) \; v \;) r$
+holds for all values $v$. Now we need to show that
+$ \flex \; r\; id\; s@[c]\; v = \textit{decode} \;( \textit{retrieve}\; (\rup \backslash (s@[c])) \; v \;) r$.\\
+~\eqref{flex} allows us to do the following rewrite:
+\begin{center}
+$ \flex \; r\; id\; (s@[c])\; v = \flex \; r \; id\; s\; (\inj \; (r \backslash s) \; c\; v)= \textit{decode} \;( \textit{retrieve}\; (\rup \backslash s) \; (\inj \; (r\backslash s) \;c\;v)\;) r$
+\end{center}
+~\eqref{retrieve_reversible} allows us to further rewrite the $\mathit{RHS}$ of the above to
+\begin{center}
+$\textit{decode} \; (\textit{retrieve}\; (\rup \backslash (s @ [c])) \; v\;) \;r$
+\end{center}
+
+
+\end{proof}
+
+
\section{My Work}