144 \section{Existing approaches} |
149 \section{Existing approaches} |
145 \subsection{Shortcomings of different methods} |
150 \subsection{Shortcomings of different methods} |
146 |
151 |
147 |
152 |
148 \subsubsection{ NFA's} |
153 \subsubsection{ NFA's} |
|
154 $\bold{Problems With This:}$ |
|
155 \begin{itemize} |
|
156 \item |
149 Can be slow especially when many states are active. |
157 Can be slow especially when many states are active. |
|
158 \item |
|
159 Want Lexing Results: Can have Exponential different matching results. |
|
160 \end{itemize} |
|
161 |
|
162 |
|
163 One regular expression can have multiple lexical values. For example |
|
164 for the regular expression $(a+b)^*$, it has a infinite list of |
|
165 values corresponding to it: $\Stars\,[]$, $\Stars\,[\Left(Char(a))]$, |
|
166 $\Stars\,[\Right(Char(b))]$, $\Stars\,[\Left(Char(a),\,\Right(Char(b))]$, |
|
167 $\ldots$, and vice versa. |
|
168 Even for the regular expression matching a certain string, there could |
|
169 still be more than one value corresponding to it. |
|
170 Take the example where $r= (a^*\cdot a^*)^*$ and the string |
|
171 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$. |
|
172 The number of different ways of matching |
|
173 without allowing any value under a star to be flattened |
|
174 to an empty string can be given by the following formula: |
|
175 \begin{center} |
|
176 $C_n = (n+1)+n C_1+\ldots + 2 C_{n-1}$ |
|
177 \end{center} |
|
178 and a closed form formula can be calculated to be |
|
179 \begin{equation} |
|
180 C_n =\frac{(2+\sqrt{2})^n - (2-\sqrt{2})^n}{4\sqrt{2}} |
|
181 \end{equation} |
|
182 which is clearly in exponential order. |
|
183 A lexer aimed at getting all the possible values has an exponential |
|
184 worst case runtime. Therefore it is impractical to try to generate |
|
185 all possible matches in a run. In practice, we are usually |
|
186 interested about POSIX values, which by intuition always |
|
187 match the leftmost regular expression when there is a choice |
|
188 and always match a sub part as much as possible before proceeding |
|
189 to the next token. For example, the above example has the POSIX value |
|
190 $ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$. |
|
191 The output of an algorithm we want would be a POSIX matching |
|
192 encoded as a value.\\ |
|
193 $\mathbf{TODO:}$ |
|
194 \begin{itemize} |
|
195 \item |
|
196 Illustrate graphically how you can match $a*a**$ with $aaa$ in different ways. |
|
197 \item |
|
198 Give a backtracking algorithm, and explain briefly why this can be exponentially slow. |
|
199 (When there is a matching, it finds straight away; where there is not one, this fails to |
|
200 recognize immediately that a match cannot be possibly found, and tries out all remaining |
|
201 possibilities, etc.) |
|
202 \item |
|
203 From the above point, are there statical analysis tools that single out those malicious |
|
204 patterns and tell before a lexer is even run? |
|
205 Have a more thorough survey of the Birmingham paper. |
|
206 Give out the suitable scenarios for such static analysis algorithms. |
|
207 |
|
208 \end{itemize} |
150 |
209 |
151 \subsubsection{DFAs} |
210 \subsubsection{DFAs} |
152 The tool JFLEX uses it. |
211 The tool JFLEX uses it. |
153 Advantages: super fast on most regexes \\ |
212 Advantages: super fast on most regexes \\ |
154 TODO: show it being fast on a lot of inputs\\ |
213 TODO: show it being fast on a lot of inputs\\ |
181 |
240 |
182 \subsection{Bit-coded algorithm} |
241 \subsection{Bit-coded algorithm} |
183 +bitcodes! |
242 +bitcodes! |
184 Built on top of derivatives, but with auxiliary bits |
243 Built on top of derivatives, but with auxiliary bits |
185 \subsection{Correctness Proof} |
244 \subsection{Correctness Proof} |
186 unproven by Sulzmann and Lu |
245 |
187 by Ausaf and Urban |
246 Not proven by Sulzmann and Lu |
|
247 |
|
248 Proven by Ausaf and Urban!! |
|
249 |
|
250 |
|
251 For this we have started with looking at the proof of |
|
252 \begin{equation}\label{lexer} |
|
253 \blexer \; (r^\uparrow) s = \lexer \;r \;s, |
|
254 \end{equation} |
|
255 |
|
256 %\noindent |
|
257 %might provide us insight into proving |
|
258 %\begin{center} |
|
259 %$\blexer \; r^\uparrow \;s = \blexers \; r^\uparrow \;s$ |
|
260 %\end{center} |
|
261 |
|
262 \noindent |
|
263 which established that the bit-sequence algorithm produces the same |
|
264 result as the original algorithm, which does not use |
|
265 bit-sequences. |
|
266 The proof uses two ``tricks''. One is that it uses a \flex-function |
|
267 |
|
268 \begin{center} |
|
269 \begin{tabular}{lcl} |
|
270 $\textit{flex} \;r\; f\; (c\!::\!s) $ & $\dn$ & $\textit{flex} \; (r\backslash c) \;(\lambda v. f (inj \; r \; c \; v)) \;s$ \\ |
|
271 $\textit{flex} \;r\; f\; [\,] $ & $\dn$ & $f$ |
|
272 \end{tabular} |
|
273 \end{center} |
|
274 |
|
275 \noindent |
|
276 |
|
277 The intuition behind the $\flex$ function is that |
|
278 it accumulates a series of $\inj$ function applications when doing derivatives |
|
279 in a $\mathit{LIFO}$ manner. The arguments of the $\inj$ functions are kept by |
|
280 remembering which character |
|
281 was chopped off and what the regular expression looks like before |
|
282 chopping off that character. |
|
283 The $\mathit{LIFO}$ order was achieved by putting the newest $\inj$ application |
|
284 always before the application of $f$, the previously accumulated function applications.\\ |
|
285 Therefore, the function $\flex$, when acted on a string $s@[c]$ where the last |
|
286 character is $c$, by nature can have its last injection function revealed already: |
|
287 \begin{equation}\label{flex} |
|
288 \flex \; r \; id \; (s@[c]) \; v = \flex \; r \; id \; s \; (inj \; (r\backslash s) \; c\; v). |
|
289 \end{equation} |
|
290 that the last character can be taken off, and the injection it causes be applied to |
|
291 the argument value $v$. |
|
292 |
|
293 Ausaf and Urban proved that the Sulzmann and Lu's lexers |
|
294 can be charactarized by the $\flex$ function: |
|
295 \begin{center} |
|
296 $\lexer \;r\; s = \flex \;\textit{id} \; r\;s \;(\mkeps \; (r\backslash s))$. |
|
297 \end{center} |
|
298 |
|
299 \noindent |
|
300 This property says that the Sulzmann and Lu's $\lexer$ does lexing by |
|
301 stacking up injection functions while doing derivatives, |
|
302 explicitly showing the order of characters being |
|
303 injected back in each step. |
|
304 |
|
305 \noindent |
|
306 The other trick, which is the crux in the existing proof, |
|
307 is the use of the $\retrieve$-function: |
|
308 \begin{center} |
|
309 \begin{tabular}{@{}l@{\hspace{2mm}}c@{\hspace{2mm}}l@{}} |
|
310 $\textit{retrieve}\,(_{bs}\ONE)\,\Empty$ & $\dn$ & $bs$\\ |
|
311 $\textit{retrieve}\,(_{bs}{\bf c})\,(\Char\,d)$ & $\dn$ & $bs$\\ |
|
312 $\textit{retrieve}\,(_{bs}\sum a::as)\,(\Left\,v)$ & $\dn$ & |
|
313 $bs \,@\, \textit{retrieve}\,a\,v$\\ |
|
314 $\textit{retrieve}\,(_{bs}\sum a::as)\,(\Right\,v)$ & $\dn$ & |
|
315 $\textit{bs} \,@\, \textit{retrieve}\,(_{[]}\sum as)\,v$\\ |
|
316 $\textit{retrieve}\,(_{bs}a_1\cdot a_2)\,(\Seq\,v_1\,v_2)$ & $\dn$ & |
|
317 $bs \,@\,\textit{retrieve}\,a_1\,v_1\,@\, \textit{retrieve}\,a_2\,v_2$\\ |
|
318 $\textit{retrieve}\,(_{bs}a^*)\,(\Stars\,[])$ & $\dn$ & |
|
319 $bs \,@\, [0]$\\ |
|
320 $\textit{retrieve}\,(_{bs}a^*)\,(\Stars\,(v\!::\!vs))$ & $\dn$ &\\ |
|
321 \multicolumn{3}{l}{ |
|
322 \hspace{3cm}$bs \,@\, [1] \,@\, \textit{retrieve}\,a\,v\,@\, |
|
323 \textit{retrieve}\,(_{[]}a^*)\,(\Stars\,vs)$}\\ |
|
324 \end{tabular} |
|
325 \end{center} |
|
326 |
|
327 \noindent |
|
328 Sulzmann and Lu proposed this function, but did not prove |
|
329 anything about it. Ausaf and Urban made use of the |
|
330 fact about $\retrieve$ in their proof: |
|
331 \begin{equation}\label{retrieve_reversible} |
|
332 \retrieve\; \rup \backslash c \; v = \retrieve \; \rup (\inj \;r \;c \; v) |
|
333 \end{equation} |
|
334 This says that $\retrieve$ will always pick up |
|
335 partial information about a lexing value value and transform it into suitable bitcodes. |
|
336 If the information is in the regular expression (stored as bitcodes), it will keep those |
|
337 bitcodes with the guidance of the value, |
|
338 if the information is in the value, which has been injected back to the value, |
|
339 it will "digest" and transform that part of the value to bitcodes. |
|
340 |
|
341 \noindent |
|
342 |
|
343 Using this together with ~\eqref{flex}, we can prove that the bitcoded version of |
|
344 lexer is the same as Sulzmann and Lu's lexer: |
|
345 \begin{center} |
|
346 $\lexer \; r \; s = \flex \; r\; id\; s\; v = \textit{decode} \;( \textit{bmkeps}\; (\rup \backslash s) ) r = \blexer \; r \; s$ |
|
347 \end{center} |
|
348 \noindent |
|
349 \begin{proof} |
|
350 We express $\bmkeps$ using $\retrieve$, and the theorem to prove becomes: |
|
351 \begin{center} |
|
352 $ \flex \; r\; id\; s\; v = \textit{decode} \;( \textit{retrieve}\; (\rup \backslash s) \; v \;) r$ |
|
353 \end{center} |
|
354 \noindent |
|
355 We prove the above by reverse induction on string $s$(meaning that the inductive step is on |
|
356 $s @ [c]$ rather than $c :: s$). |
|
357 $v$ takes arbitrary values.\\ |
|
358 The base case goes through trivially.\\ |
|
359 For the inductive step, assuming |
|
360 $ \flex \; r\; id\; s\; v = \textit{decode} \;( \textit{retrieve}\; (\rup \backslash s) \; v \;) r$ |
|
361 holds for all values $v$. Now we need to show that |
|
362 $ \flex \; r\; id\; s@[c]\; v = \textit{decode} \;( \textit{retrieve}\; (\rup \backslash (s@[c])) \; v \;) r$.\\ |
|
363 ~\eqref{flex} allows us to do the following rewrite: |
|
364 \begin{center} |
|
365 $ \flex \; r\; id\; (s@[c])\; v = \flex \; r \; id\; s\; (\inj \; (r \backslash s) \; c\; v)= \textit{decode} \;( \textit{retrieve}\; (\rup \backslash s) \; (\inj \; (r\backslash s) \;c\;v)\;) r$ |
|
366 \end{center} |
|
367 ~\eqref{retrieve_reversible} allows us to further rewrite the $\mathit{RHS}$ of the above to |
|
368 \begin{center} |
|
369 $\textit{decode} \; (\textit{retrieve}\; (\rup \backslash (s @ [c])) \; v\;) \;r$ |
|
370 \end{center} |
|
371 |
|
372 |
|
373 \end{proof} |
|
374 |
|
375 |
188 |
376 |
189 \section{My Work} |
377 \section{My Work} |
190 |
378 |
191 \subsection{an improved version of bit-coded algorithm: with simp!} |
379 \subsection{an improved version of bit-coded algorithm: with simp!} |
192 |
380 |