lexing: comparison ChengsongPhdThesis/ChengsongPhDThesis.tex

equal deleted inserted replaced

-:0856fbf8bda7
+:426a93160f4a
 \newcommand{\ZERO}{\mbox{\bf 0}}
 \newcommand{\ONE}{\mbox{\bf 1}}
 \def\lexer{\mathit{lexer}}
 \def\mkeps{\mathit{mkeps}}
+\def\bmkeps{\textit{bmkeps}}
+\def\retrieve{\textit{retrieve}}
+\def\blexer{\textit{blexer}}
+\def\flex{\textit{flex}}
 \def\inj{\mathit{inj}}
 \def\Empty{\mathit{Empty}}
 \def\Left{\mathit{Left}}
 \def\Right{\mathit{Right}}
 \def\Stars{\mathit{Stars}}
 \def\Seq{\mathit{Seq}}
 \def\Der{\mathit{Der}}
 \def\nullable{\mathit{nullable}}
 \def\Z{\mathit{Z}}
 \def\S{\mathit{S}}
+\def\rup{r^\uparrow}
 \def\awidth{\mathit{awidth}}
 \def\pder{\mathit{pder}}
 \def\maxterms{\mathit{maxterms}}
 \section{Existing approaches}
 \subsection{Shortcomings of different methods}
 \subsubsection{ NFA's}
+$\bold{Problems With This:}$
+\begin{itemize}
+\item
 Can be slow especially when many states are active.
+\item
+Want Lexing Results: Can have  Exponential different matching results.
+\end{itemize}
+One regular expression can have multiple lexical values. For example
+for the regular expression $(a+b)^*$, it has a infinite list of
+values corresponding to it: $\Stars\,[]$, $\Stars\,[\Left(Char(a))]$,
+$\Stars\,[\Right(Char(b))]$, $\Stars\,[\Left(Char(a),\,\Right(Char(b))]$,
+$\ldots$, and vice versa.
+Even for the regular expression matching a certain string, there could
+still be more than one value corresponding to it.
+Take the example where $r= (a^*\cdot a^*)^*$ and the string
+$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$.
+The number of different ways of matching
+without allowing any value under a star to be flattened
+to an empty string can be given by the following formula:
+\begin{center}
+	$C_n = (n+1)+n C_1+\ldots + 2 C_{n-1}$
+\end{center}
+and a closed form formula can be calculated to be
+\begin{equation}
+	C_n =\frac{(2+\sqrt{2})^n - (2-\sqrt{2})^n}{4\sqrt{2}}
+\end{equation}
+which is clearly in exponential order.
+A lexer aimed at getting all the possible values has an exponential
+worst case runtime. Therefore it is impractical to try to generate
+all possible matches in a run. In practice, we are usually
+interested about POSIX values, which by intuition always
+match the leftmost regular expression when there is a choice
+and always match a sub part as much as possible before proceeding
+to the next token. For example, the above example has the POSIX value
+$ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$.
+The output of an algorithm we want would be a POSIX matching
+encoded as a value.\\
+$\mathbf{TODO:}$
+\begin{itemize}
+\item
+Illustrate graphically how you can match $a*a**$ with $aaa$ in different ways.
+\item
+Give a backtracking algorithm, and explain briefly why this can be exponentially slow.
+(When there is a matching, it finds straight away; where there is not one, this fails to
+recognize immediately that a match cannot be possibly found, and tries out all remaining
+possibilities, etc.)
+\item
+From the above point, are there statical analysis tools that single out those malicious
+patterns and tell before a lexer is even run?
+Have a more thorough survey of the Birmingham paper.
+Give out the suitable scenarios for such static analysis algorithms.
+\end{itemize}
 \subsubsection{DFAs}
 The tool JFLEX uses it.
 Advantages: super fast on most regexes \\
 TODO: show it being fast on a lot of inputs\\
 \subsection{Bit-coded algorithm}
 +bitcodes!
 Built on top of derivatives, but with auxiliary bits
 \subsection{Correctness Proof}
-unproven by Sulzmann  and Lu
-by Ausaf and Urban
+Not proven by Sulzmann  and Lu
+Proven by Ausaf and Urban!!
+For this we have started with looking at the proof of
+\begin{equation}\label{lexer}
+\blexer \; (r^\uparrow)  s = \lexer \;r \;s,
+\end{equation}
+%\noindent
+%might provide us insight into proving
+%\begin{center}
+%$\blexer \; r^\uparrow \;s = \blexers \; r^\uparrow \;s$
+%\end{center}
+\noindent
+which established that the bit-sequence algorithm produces the same
+result as the original algorithm, which does not use
+bit-sequences.
+The proof uses two ``tricks''. One is that it uses a \flex-function
+\begin{center}
+\begin{tabular}{lcl}
+$\textit{flex} \;r\; f\; (c\!::\!s) $ & $\dn$ & $\textit{flex} \;  (r\backslash c) \;(\lambda v. f (inj \; r \; c \; v)) \;s$ \\
+$\textit{flex} \;r\; f\;  [\,] $ & $\dn$ & $f$
+\end{tabular}
+\end{center}
+\noindent
+The intuition behind the $\flex$ function is that
+it  accumulates a series of $\inj$ function applications when doing derivatives
+in a $\mathit{LIFO}$ manner. The arguments of the $\inj$ functions are kept by
+remembering which character
+was chopped off and what the regular expression looks like before
+chopping off that character.
+The $\mathit{LIFO}$ order was achieved by putting the newest $\inj$ application
+always before the application of $f$, the previously accumulated function applications.\\
+Therefore, the function $\flex$, when acted on a string $s@[c]$ where the last
+character is $c$, by nature can have its last injection function revealed already:
+\begin{equation}\label{flex}
+\flex \; r \; id \; (s@[c]) \; v = \flex \;  r \; id \; s \; (inj \; (r\backslash s) \; c\; v).
+\end{equation}
+that the last character can be taken off, and the injection it causes be applied to
+the argument value $v$.
+Ausaf and Urban proved that the Sulzmann and Lu's lexers
+can be charactarized by the $\flex$ function:
+\begin{center}
+$\lexer \;r\; s = \flex \;\textit{id} \; r\;s \;(\mkeps \; (r\backslash s))$.
+\end{center}
+\noindent
+This property says that the Sulzmann and Lu's $\lexer$ does lexing by
+stacking up injection functions while doing derivatives,
+explicitly showing the order of characters being
+injected back in each step.
+\noindent
+The other trick, which is the crux in the existing proof,
+is the use of the $\retrieve$-function:
+\begin{center}
+\begin{tabular}{@{}l@{\hspace{2mm}}c@{\hspace{2mm}}l@{}}
+$\textit{retrieve}\,(_{bs}\ONE)\,\Empty$ & $\dn$ & $bs$\\
+$\textit{retrieve}\,(_{bs}{\bf c})\,(\Char\,d)$ & $\dn$ & $bs$\\
+$\textit{retrieve}\,(_{bs}\sum a::as)\,(\Left\,v)$ & $\dn$ &
+$bs \,@\, \textit{retrieve}\,a\,v$\\
+$\textit{retrieve}\,(_{bs}\sum a::as)\,(\Right\,v)$ & $\dn$ &
+$\textit{bs} \,@\, \textit{retrieve}\,(_{[]}\sum as)\,v$\\
+$\textit{retrieve}\,(_{bs}a_1\cdot a_2)\,(\Seq\,v_1\,v_2)$ & $\dn$ &
+$bs \,@\,\textit{retrieve}\,a_1\,v_1\,@\, \textit{retrieve}\,a_2\,v_2$\\
+$\textit{retrieve}\,(_{bs}a^*)\,(\Stars\,[])$ & $\dn$ &
+$bs \,@\, [0]$\\
+$\textit{retrieve}\,(_{bs}a^*)\,(\Stars\,(v\!::\!vs))$ & $\dn$ &\\
+\multicolumn{3}{l}{
+\hspace{3cm}$bs \,@\, [1] \,@\, \textit{retrieve}\,a\,v\,@\,
+\textit{retrieve}\,(_{[]}a^*)\,(\Stars\,vs)$}\\
+\end{tabular}
+\end{center}
+\noindent
+Sulzmann and Lu proposed this function, but did not prove
+anything about it. Ausaf and Urban made use of the
+fact about $\retrieve$ in their proof:
+\begin{equation}\label{retrieve_reversible}
+\retrieve\; \rup \backslash c \; v = \retrieve \;  \rup (\inj \;r \;c \; v)
+\end{equation}
+This says that $\retrieve$ will always pick up
+partial information about a lexing value value and transform it into suitable bitcodes.
+If the information is in the regular expression (stored as bitcodes), it will keep those
+bitcodes with the guidance of the value,
+if the information is in the value, which has been injected back to the value,
+it will "digest" and transform that part of the value to bitcodes.
+\noindent
+Using this together with ~\eqref{flex}, we can prove that the bitcoded version of
+lexer is the same as Sulzmann and Lu's lexer:
+\begin{center}
+$\lexer \; r \; s =  \flex \; r\; id\; s\; v = \textit{decode} \;( \textit{bmkeps}\; (\rup \backslash s) ) r = \blexer \; r \; s$
+\end{center}
+\noindent
+\begin{proof}
+We express $\bmkeps$ using $\retrieve$, and the theorem to prove becomes:
+\begin{center}
+$ \flex \; r\; id\; s\; v = \textit{decode} \;( \textit{retrieve}\; (\rup \backslash s) \; v \;) r$
+\end{center}
+\noindent
+We prove the above by reverse induction on string $s$(meaning that the inductive step is on
+$s @ [c]$ rather than $c :: s$).
+$v$ takes arbitrary values.\\
+The base case goes through trivially.\\
+For the inductive step, assuming
+$ \flex \; r\; id\; s\; v = \textit{decode} \;( \textit{retrieve}\; (\rup \backslash s) \; v \;) r$
+holds for all values $v$. Now we need to show that
+$ \flex \; r\; id\; s@[c]\; v = \textit{decode} \;( \textit{retrieve}\; (\rup \backslash (s@[c])) \; v \;) r$.\\
+~\eqref{flex} allows us to do the following rewrite:
+\begin{center}
+$ \flex \; r\; id\; (s@[c])\; v = \flex \; r \; id\; s\; (\inj \; (r \backslash s) \; c\; v)=  \textit{decode} \;( \textit{retrieve}\; (\rup \backslash s) \; (\inj \; (r\backslash s) \;c\;v)\;) r$
+\end{center}
+~\eqref{retrieve_reversible} allows us  to further rewrite the $\mathit{RHS}$ of the above to
+\begin{center}
+$\textit{decode} \; (\textit{retrieve}\; (\rup \backslash (s @ [c])) \; v\;)  \;r$
+\end{center}
+\end{proof}
 \section{My Work}
 \subsection{an improved version of bit-coded algorithm: with simp!}

changeset 441	426a93160f4a
parent 440	0856fbf8bda7
child 443	c6351a96bf80