ChengsongPhdThesis/ChengsongPhDThesis.tex
changeset 441 426a93160f4a
parent 440 0856fbf8bda7
child 443 c6351a96bf80
--- a/ChengsongPhdThesis/ChengsongPhDThesis.tex	Wed Mar 02 23:53:11 2022 +0000
+++ b/ChengsongPhdThesis/ChengsongPhDThesis.tex	Sat Mar 05 11:31:59 2022 +0000
@@ -38,6 +38,10 @@
 \def\lexer{\mathit{lexer}}
 \def\mkeps{\mathit{mkeps}}
 
+\def\bmkeps{\textit{bmkeps}}
+\def\retrieve{\textit{retrieve}}
+\def\blexer{\textit{blexer}}
+\def\flex{\textit{flex}}
 \def\inj{\mathit{inj}}
 \def\Empty{\mathit{Empty}}
 \def\Left{\mathit{Left}}
@@ -49,6 +53,7 @@
 \def\nullable{\mathit{nullable}}
 \def\Z{\mathit{Z}}
 \def\S{\mathit{S}}
+\def\rup{r^\uparrow}
 
 
 \def\awidth{\mathit{awidth}}
@@ -146,7 +151,61 @@
 
 
 \subsubsection{ NFA's}
+$\bold{Problems With This:}$
+\begin{itemize}
+\item
 Can be slow especially when many states are active.
+\item
+Want Lexing Results: Can have  Exponential different matching results.
+\end{itemize}
+
+
+One regular expression can have multiple lexical values. For example
+for the regular expression $(a+b)^*$, it has a infinite list of
+values corresponding to it: $\Stars\,[]$, $\Stars\,[\Left(Char(a))]$,
+$\Stars\,[\Right(Char(b))]$, $\Stars\,[\Left(Char(a),\,\Right(Char(b))]$,
+$\ldots$, and vice versa.
+Even for the regular expression matching a certain string, there could 
+still be more than one value corresponding to it.
+Take the example where $r= (a^*\cdot a^*)^*$ and the string 
+$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$.
+The number of different ways of matching 
+without allowing any value under a star to be flattened
+to an empty string can be given by the following formula:
+\begin{center}
+	$C_n = (n+1)+n C_1+\ldots + 2 C_{n-1}$
+\end{center}
+and a closed form formula can be calculated to be
+\begin{equation}
+	C_n =\frac{(2+\sqrt{2})^n - (2-\sqrt{2})^n}{4\sqrt{2}}
+\end{equation}
+which is clearly in exponential order.
+A lexer aimed at getting all the possible values has an exponential
+worst case runtime. Therefore it is impractical to try to generate
+all possible matches in a run. In practice, we are usually 
+interested about POSIX values, which by intuition always
+match the leftmost regular expression when there is a choice
+and always match a sub part as much as possible before proceeding
+to the next token. For example, the above example has the POSIX value
+$ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$.
+The output of an algorithm we want would be a POSIX matching
+encoded as a value.\\
+$\mathbf{TODO:}$
+\begin{itemize}
+\item
+Illustrate graphically how you can match $a*a**$ with $aaa$ in different ways. 
+\item
+Give a backtracking algorithm, and explain briefly why this can be exponentially slow.
+(When there is a matching, it finds straight away; where there is not one, this fails to 
+recognize immediately that a match cannot be possibly found, and tries out all remaining 
+possibilities, etc.)
+\item
+From the above point, are there statical analysis tools that single out those malicious 
+patterns and tell before a lexer is even run?
+Have a more thorough survey of the Birmingham paper.
+Give out the suitable scenarios for such static analysis algorithms.
+
+\end{itemize}
 
 \subsubsection{DFAs}
 The tool JFLEX uses it.
@@ -183,8 +242,137 @@
 +bitcodes!
 Built on top of derivatives, but with auxiliary bits
 \subsection{Correctness Proof}
-unproven by Sulzmann  and Lu
-by Ausaf and Urban
+
+Not proven by Sulzmann  and Lu
+
+Proven by Ausaf and Urban!!
+
+
+For this we have started with looking at the proof of
+\begin{equation}\label{lexer}
+\blexer \; (r^\uparrow)  s = \lexer \;r \;s,
+\end{equation}
+
+%\noindent
+%might provide us insight into proving 
+%\begin{center}
+%$\blexer \; r^\uparrow \;s = \blexers \; r^\uparrow \;s$
+%\end{center}
+
+\noindent
+which established that the bit-sequence algorithm produces the same
+result as the original algorithm, which does not use 
+bit-sequences.
+The proof uses two ``tricks''. One is that it uses a \flex-function
+
+\begin{center}
+\begin{tabular}{lcl}
+$\textit{flex} \;r\; f\; (c\!::\!s) $ & $\dn$ & $\textit{flex} \;  (r\backslash c) \;(\lambda v. f (inj \; r \; c \; v)) \;s$ \\
+$\textit{flex} \;r\; f\;  [\,] $ & $\dn$ & $f$
+\end{tabular}
+\end{center}
+
+\noindent
+
+The intuition behind the $\flex$ function is that
+ it  accumulates a series of $\inj$ function applications when doing derivatives
+ in a $\mathit{LIFO}$ manner. The arguments of the $\inj$ functions are kept by
+  remembering which character
+ was chopped off and what the regular expression looks like before
+ chopping off that character.
+ The $\mathit{LIFO}$ order was achieved by putting the newest $\inj$ application
+ always before the application of $f$, the previously accumulated function applications.\\
+Therefore, the function $\flex$, when acted on a string $s@[c]$ where the last
+character is $c$, by nature can have its last injection function revealed already: 
+\begin{equation}\label{flex}
+\flex \; r \; id \; (s@[c]) \; v = \flex \;  r \; id \; s \; (inj \; (r\backslash s) \; c\; v).
+\end{equation}
+that the last character can be taken off, and the injection it causes be applied to 
+the argument value $v$.
+
+Ausaf and Urban proved that the Sulzmann and Lu's lexers
+can be charactarized by the $\flex$ function:
+\begin{center}
+$\lexer \;r\; s = \flex \;\textit{id} \; r\;s \;(\mkeps \; (r\backslash s))$.
+\end{center}
+
+\noindent
+This property says that the Sulzmann and Lu's $\lexer$ does lexing by
+stacking up injection functions while doing derivatives,
+explicitly showing the order of characters being
+injected back in each step.
+
+\noindent
+The other trick, which is the crux in the existing proof, 
+is the use of the $\retrieve$-function:
+\begin{center}
+\begin{tabular}{@{}l@{\hspace{2mm}}c@{\hspace{2mm}}l@{}}
+  $\textit{retrieve}\,(_{bs}\ONE)\,\Empty$ & $\dn$ & $bs$\\
+  $\textit{retrieve}\,(_{bs}{\bf c})\,(\Char\,d)$ & $\dn$ & $bs$\\
+  $\textit{retrieve}\,(_{bs}\sum a::as)\,(\Left\,v)$ & $\dn$ &
+     $bs \,@\, \textit{retrieve}\,a\,v$\\
+  $\textit{retrieve}\,(_{bs}\sum a::as)\,(\Right\,v)$ & $\dn$ &
+  $\textit{bs} \,@\, \textit{retrieve}\,(_{[]}\sum as)\,v$\\
+  $\textit{retrieve}\,(_{bs}a_1\cdot a_2)\,(\Seq\,v_1\,v_2)$ & $\dn$ &
+     $bs \,@\,\textit{retrieve}\,a_1\,v_1\,@\, \textit{retrieve}\,a_2\,v_2$\\
+  $\textit{retrieve}\,(_{bs}a^*)\,(\Stars\,[])$ & $\dn$ &
+     $bs \,@\, [0]$\\
+  $\textit{retrieve}\,(_{bs}a^*)\,(\Stars\,(v\!::\!vs))$ & $\dn$ &\\
+  \multicolumn{3}{l}{
+     \hspace{3cm}$bs \,@\, [1] \,@\, \textit{retrieve}\,a\,v\,@\,
+                    \textit{retrieve}\,(_{[]}a^*)\,(\Stars\,vs)$}\\
+\end{tabular}
+\end{center}
+
+\noindent
+Sulzmann and Lu proposed this function, but did not prove
+anything about it. Ausaf and Urban made use of the
+fact about $\retrieve$ in their proof:
+ \begin{equation}\label{retrieve_reversible}
+ \retrieve\; \rup \backslash c \; v = \retrieve \;  \rup (\inj \;r \;c \; v)
+ \end{equation} 
+This says that $\retrieve$ will always pick up 
+partial information about a lexing value value and transform it into suitable bitcodes.
+If the information is in the regular expression (stored as bitcodes), it will keep those
+bitcodes with the guidance of the value,
+if the information is in the value, which has been injected back to the value,
+it will "digest" and transform that part of the value to bitcodes.
+
+\noindent
+
+Using this together with ~\eqref{flex}, we can prove that the bitcoded version of
+lexer is the same as Sulzmann and Lu's lexer:
+\begin{center} 
+$\lexer \; r \; s =  \flex \; r\; id\; s\; v = \textit{decode} \;( \textit{bmkeps}\; (\rup \backslash s) ) r = \blexer \; r \; s$
+\end{center}
+\noindent
+\begin{proof}
+We express $\bmkeps$ using $\retrieve$, and the theorem to prove becomes:
+\begin{center} 
+$ \flex \; r\; id\; s\; v = \textit{decode} \;( \textit{retrieve}\; (\rup \backslash s) \; v \;) r$
+\end{center}
+\noindent
+We prove the above by reverse induction on string $s$(meaning that the inductive step is on 
+$s @ [c]$ rather than $c :: s$).
+$v$ takes arbitrary values.\\
+The base case goes through trivially.\\
+For the inductive step, assuming
+$ \flex \; r\; id\; s\; v = \textit{decode} \;( \textit{retrieve}\; (\rup \backslash s) \; v \;) r$
+holds for all values $v$. Now we need to show that
+$ \flex \; r\; id\; s@[c]\; v = \textit{decode} \;( \textit{retrieve}\; (\rup \backslash (s@[c])) \; v \;) r$.\\
+~\eqref{flex} allows us to do the following rewrite:
+\begin{center} 
+$ \flex \; r\; id\; (s@[c])\; v = \flex \; r \; id\; s\; (\inj \; (r \backslash s) \; c\; v)=  \textit{decode} \;( \textit{retrieve}\; (\rup \backslash s) \; (\inj \; (r\backslash s) \;c\;v)\;) r$
+\end{center}
+~\eqref{retrieve_reversible} allows us  to further rewrite the $\mathit{RHS}$ of the above to
+\begin{center}
+$\textit{decode} \; (\textit{retrieve}\; (\rup \backslash (s @ [c])) \; v\;)  \;r$
+\end{center}
+
+
+\end{proof}
+
+
 
 \section{My Work}