ChengsongPhdThesis/ChengsongPhDThesis.tex
changeset 441 426a93160f4a
parent 440 0856fbf8bda7
child 443 c6351a96bf80
equal deleted inserted replaced
440:0856fbf8bda7 441:426a93160f4a
    36 \newcommand{\ZERO}{\mbox{\bf 0}}
    36 \newcommand{\ZERO}{\mbox{\bf 0}}
    37 \newcommand{\ONE}{\mbox{\bf 1}}
    37 \newcommand{\ONE}{\mbox{\bf 1}}
    38 \def\lexer{\mathit{lexer}}
    38 \def\lexer{\mathit{lexer}}
    39 \def\mkeps{\mathit{mkeps}}
    39 \def\mkeps{\mathit{mkeps}}
    40 
    40 
       
    41 \def\bmkeps{\textit{bmkeps}}
       
    42 \def\retrieve{\textit{retrieve}}
       
    43 \def\blexer{\textit{blexer}}
       
    44 \def\flex{\textit{flex}}
    41 \def\inj{\mathit{inj}}
    45 \def\inj{\mathit{inj}}
    42 \def\Empty{\mathit{Empty}}
    46 \def\Empty{\mathit{Empty}}
    43 \def\Left{\mathit{Left}}
    47 \def\Left{\mathit{Left}}
    44 \def\Right{\mathit{Right}}
    48 \def\Right{\mathit{Right}}
    45 \def\Stars{\mathit{Stars}}
    49 \def\Stars{\mathit{Stars}}
    47 \def\Seq{\mathit{Seq}}
    51 \def\Seq{\mathit{Seq}}
    48 \def\Der{\mathit{Der}}
    52 \def\Der{\mathit{Der}}
    49 \def\nullable{\mathit{nullable}}
    53 \def\nullable{\mathit{nullable}}
    50 \def\Z{\mathit{Z}}
    54 \def\Z{\mathit{Z}}
    51 \def\S{\mathit{S}}
    55 \def\S{\mathit{S}}
       
    56 \def\rup{r^\uparrow}
    52 
    57 
    53 
    58 
    54 \def\awidth{\mathit{awidth}}
    59 \def\awidth{\mathit{awidth}}
    55 \def\pder{\mathit{pder}}
    60 \def\pder{\mathit{pder}}
    56 \def\maxterms{\mathit{maxterms}}
    61 \def\maxterms{\mathit{maxterms}}
   144 \section{Existing approaches}
   149 \section{Existing approaches}
   145 \subsection{Shortcomings of different methods}
   150 \subsection{Shortcomings of different methods}
   146 
   151 
   147 
   152 
   148 \subsubsection{ NFA's}
   153 \subsubsection{ NFA's}
       
   154 $\bold{Problems With This:}$
       
   155 \begin{itemize}
       
   156 \item
   149 Can be slow especially when many states are active.
   157 Can be slow especially when many states are active.
       
   158 \item
       
   159 Want Lexing Results: Can have  Exponential different matching results.
       
   160 \end{itemize}
       
   161 
       
   162 
       
   163 One regular expression can have multiple lexical values. For example
       
   164 for the regular expression $(a+b)^*$, it has a infinite list of
       
   165 values corresponding to it: $\Stars\,[]$, $\Stars\,[\Left(Char(a))]$,
       
   166 $\Stars\,[\Right(Char(b))]$, $\Stars\,[\Left(Char(a),\,\Right(Char(b))]$,
       
   167 $\ldots$, and vice versa.
       
   168 Even for the regular expression matching a certain string, there could 
       
   169 still be more than one value corresponding to it.
       
   170 Take the example where $r= (a^*\cdot a^*)^*$ and the string 
       
   171 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$.
       
   172 The number of different ways of matching 
       
   173 without allowing any value under a star to be flattened
       
   174 to an empty string can be given by the following formula:
       
   175 \begin{center}
       
   176 	$C_n = (n+1)+n C_1+\ldots + 2 C_{n-1}$
       
   177 \end{center}
       
   178 and a closed form formula can be calculated to be
       
   179 \begin{equation}
       
   180 	C_n =\frac{(2+\sqrt{2})^n - (2-\sqrt{2})^n}{4\sqrt{2}}
       
   181 \end{equation}
       
   182 which is clearly in exponential order.
       
   183 A lexer aimed at getting all the possible values has an exponential
       
   184 worst case runtime. Therefore it is impractical to try to generate
       
   185 all possible matches in a run. In practice, we are usually 
       
   186 interested about POSIX values, which by intuition always
       
   187 match the leftmost regular expression when there is a choice
       
   188 and always match a sub part as much as possible before proceeding
       
   189 to the next token. For example, the above example has the POSIX value
       
   190 $ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$.
       
   191 The output of an algorithm we want would be a POSIX matching
       
   192 encoded as a value.\\
       
   193 $\mathbf{TODO:}$
       
   194 \begin{itemize}
       
   195 \item
       
   196 Illustrate graphically how you can match $a*a**$ with $aaa$ in different ways. 
       
   197 \item
       
   198 Give a backtracking algorithm, and explain briefly why this can be exponentially slow.
       
   199 (When there is a matching, it finds straight away; where there is not one, this fails to 
       
   200 recognize immediately that a match cannot be possibly found, and tries out all remaining 
       
   201 possibilities, etc.)
       
   202 \item
       
   203 From the above point, are there statical analysis tools that single out those malicious 
       
   204 patterns and tell before a lexer is even run?
       
   205 Have a more thorough survey of the Birmingham paper.
       
   206 Give out the suitable scenarios for such static analysis algorithms.
       
   207 
       
   208 \end{itemize}
   150 
   209 
   151 \subsubsection{DFAs}
   210 \subsubsection{DFAs}
   152 The tool JFLEX uses it.
   211 The tool JFLEX uses it.
   153 Advantages: super fast on most regexes \\
   212 Advantages: super fast on most regexes \\
   154 TODO: show it being fast on a lot of inputs\\
   213 TODO: show it being fast on a lot of inputs\\
   181 
   240 
   182 \subsection{Bit-coded algorithm}
   241 \subsection{Bit-coded algorithm}
   183 +bitcodes!
   242 +bitcodes!
   184 Built on top of derivatives, but with auxiliary bits
   243 Built on top of derivatives, but with auxiliary bits
   185 \subsection{Correctness Proof}
   244 \subsection{Correctness Proof}
   186 unproven by Sulzmann  and Lu
   245 
   187 by Ausaf and Urban
   246 Not proven by Sulzmann  and Lu
       
   247 
       
   248 Proven by Ausaf and Urban!!
       
   249 
       
   250 
       
   251 For this we have started with looking at the proof of
       
   252 \begin{equation}\label{lexer}
       
   253 \blexer \; (r^\uparrow)  s = \lexer \;r \;s,
       
   254 \end{equation}
       
   255 
       
   256 %\noindent
       
   257 %might provide us insight into proving 
       
   258 %\begin{center}
       
   259 %$\blexer \; r^\uparrow \;s = \blexers \; r^\uparrow \;s$
       
   260 %\end{center}
       
   261 
       
   262 \noindent
       
   263 which established that the bit-sequence algorithm produces the same
       
   264 result as the original algorithm, which does not use 
       
   265 bit-sequences.
       
   266 The proof uses two ``tricks''. One is that it uses a \flex-function
       
   267 
       
   268 \begin{center}
       
   269 \begin{tabular}{lcl}
       
   270 $\textit{flex} \;r\; f\; (c\!::\!s) $ & $\dn$ & $\textit{flex} \;  (r\backslash c) \;(\lambda v. f (inj \; r \; c \; v)) \;s$ \\
       
   271 $\textit{flex} \;r\; f\;  [\,] $ & $\dn$ & $f$
       
   272 \end{tabular}
       
   273 \end{center}
       
   274 
       
   275 \noindent
       
   276 
       
   277 The intuition behind the $\flex$ function is that
       
   278  it  accumulates a series of $\inj$ function applications when doing derivatives
       
   279  in a $\mathit{LIFO}$ manner. The arguments of the $\inj$ functions are kept by
       
   280   remembering which character
       
   281  was chopped off and what the regular expression looks like before
       
   282  chopping off that character.
       
   283  The $\mathit{LIFO}$ order was achieved by putting the newest $\inj$ application
       
   284  always before the application of $f$, the previously accumulated function applications.\\
       
   285 Therefore, the function $\flex$, when acted on a string $s@[c]$ where the last
       
   286 character is $c$, by nature can have its last injection function revealed already: 
       
   287 \begin{equation}\label{flex}
       
   288 \flex \; r \; id \; (s@[c]) \; v = \flex \;  r \; id \; s \; (inj \; (r\backslash s) \; c\; v).
       
   289 \end{equation}
       
   290 that the last character can be taken off, and the injection it causes be applied to 
       
   291 the argument value $v$.
       
   292 
       
   293 Ausaf and Urban proved that the Sulzmann and Lu's lexers
       
   294 can be charactarized by the $\flex$ function:
       
   295 \begin{center}
       
   296 $\lexer \;r\; s = \flex \;\textit{id} \; r\;s \;(\mkeps \; (r\backslash s))$.
       
   297 \end{center}
       
   298 
       
   299 \noindent
       
   300 This property says that the Sulzmann and Lu's $\lexer$ does lexing by
       
   301 stacking up injection functions while doing derivatives,
       
   302 explicitly showing the order of characters being
       
   303 injected back in each step.
       
   304 
       
   305 \noindent
       
   306 The other trick, which is the crux in the existing proof, 
       
   307 is the use of the $\retrieve$-function:
       
   308 \begin{center}
       
   309 \begin{tabular}{@{}l@{\hspace{2mm}}c@{\hspace{2mm}}l@{}}
       
   310   $\textit{retrieve}\,(_{bs}\ONE)\,\Empty$ & $\dn$ & $bs$\\
       
   311   $\textit{retrieve}\,(_{bs}{\bf c})\,(\Char\,d)$ & $\dn$ & $bs$\\
       
   312   $\textit{retrieve}\,(_{bs}\sum a::as)\,(\Left\,v)$ & $\dn$ &
       
   313      $bs \,@\, \textit{retrieve}\,a\,v$\\
       
   314   $\textit{retrieve}\,(_{bs}\sum a::as)\,(\Right\,v)$ & $\dn$ &
       
   315   $\textit{bs} \,@\, \textit{retrieve}\,(_{[]}\sum as)\,v$\\
       
   316   $\textit{retrieve}\,(_{bs}a_1\cdot a_2)\,(\Seq\,v_1\,v_2)$ & $\dn$ &
       
   317      $bs \,@\,\textit{retrieve}\,a_1\,v_1\,@\, \textit{retrieve}\,a_2\,v_2$\\
       
   318   $\textit{retrieve}\,(_{bs}a^*)\,(\Stars\,[])$ & $\dn$ &
       
   319      $bs \,@\, [0]$\\
       
   320   $\textit{retrieve}\,(_{bs}a^*)\,(\Stars\,(v\!::\!vs))$ & $\dn$ &\\
       
   321   \multicolumn{3}{l}{
       
   322      \hspace{3cm}$bs \,@\, [1] \,@\, \textit{retrieve}\,a\,v\,@\,
       
   323                     \textit{retrieve}\,(_{[]}a^*)\,(\Stars\,vs)$}\\
       
   324 \end{tabular}
       
   325 \end{center}
       
   326 
       
   327 \noindent
       
   328 Sulzmann and Lu proposed this function, but did not prove
       
   329 anything about it. Ausaf and Urban made use of the
       
   330 fact about $\retrieve$ in their proof:
       
   331  \begin{equation}\label{retrieve_reversible}
       
   332  \retrieve\; \rup \backslash c \; v = \retrieve \;  \rup (\inj \;r \;c \; v)
       
   333  \end{equation} 
       
   334 This says that $\retrieve$ will always pick up 
       
   335 partial information about a lexing value value and transform it into suitable bitcodes.
       
   336 If the information is in the regular expression (stored as bitcodes), it will keep those
       
   337 bitcodes with the guidance of the value,
       
   338 if the information is in the value, which has been injected back to the value,
       
   339 it will "digest" and transform that part of the value to bitcodes.
       
   340 
       
   341 \noindent
       
   342 
       
   343 Using this together with ~\eqref{flex}, we can prove that the bitcoded version of
       
   344 lexer is the same as Sulzmann and Lu's lexer:
       
   345 \begin{center} 
       
   346 $\lexer \; r \; s =  \flex \; r\; id\; s\; v = \textit{decode} \;( \textit{bmkeps}\; (\rup \backslash s) ) r = \blexer \; r \; s$
       
   347 \end{center}
       
   348 \noindent
       
   349 \begin{proof}
       
   350 We express $\bmkeps$ using $\retrieve$, and the theorem to prove becomes:
       
   351 \begin{center} 
       
   352 $ \flex \; r\; id\; s\; v = \textit{decode} \;( \textit{retrieve}\; (\rup \backslash s) \; v \;) r$
       
   353 \end{center}
       
   354 \noindent
       
   355 We prove the above by reverse induction on string $s$(meaning that the inductive step is on 
       
   356 $s @ [c]$ rather than $c :: s$).
       
   357 $v$ takes arbitrary values.\\
       
   358 The base case goes through trivially.\\
       
   359 For the inductive step, assuming
       
   360 $ \flex \; r\; id\; s\; v = \textit{decode} \;( \textit{retrieve}\; (\rup \backslash s) \; v \;) r$
       
   361 holds for all values $v$. Now we need to show that
       
   362 $ \flex \; r\; id\; s@[c]\; v = \textit{decode} \;( \textit{retrieve}\; (\rup \backslash (s@[c])) \; v \;) r$.\\
       
   363 ~\eqref{flex} allows us to do the following rewrite:
       
   364 \begin{center} 
       
   365 $ \flex \; r\; id\; (s@[c])\; v = \flex \; r \; id\; s\; (\inj \; (r \backslash s) \; c\; v)=  \textit{decode} \;( \textit{retrieve}\; (\rup \backslash s) \; (\inj \; (r\backslash s) \;c\;v)\;) r$
       
   366 \end{center}
       
   367 ~\eqref{retrieve_reversible} allows us  to further rewrite the $\mathit{RHS}$ of the above to
       
   368 \begin{center}
       
   369 $\textit{decode} \; (\textit{retrieve}\; (\rup \backslash (s @ [c])) \; v\;)  \;r$
       
   370 \end{center}
       
   371 
       
   372 
       
   373 \end{proof}
       
   374 
       
   375 
   188 
   376 
   189 \section{My Work}
   377 \section{My Work}
   190 
   378 
   191 \subsection{an improved version of bit-coded algorithm: with simp!}
   379 \subsection{an improved version of bit-coded algorithm: with simp!}
   192 
   380