276 \begin{center} |
276 \begin{center} |
277 $s \in S_{start} \iff [] \in S_{end}$ |
277 $s \in S_{start} \iff [] \in S_{end}$ |
278 \end{center} |
278 \end{center} |
279 \noindent |
279 \noindent |
280 Brzozowski noticed that this operation |
280 Brzozowski noticed that this operation |
281 can be ``mirrored" on regular expressions which |
281 can be ``mirrored'' on regular expressions which |
282 he calls the derivative of a regular expression $r$ |
282 he calls the derivative of a regular expression $r$ |
283 with respect to a character $c$, written |
283 with respect to a character $c$, written |
284 $r \backslash c$. This infix operator |
284 $r \backslash c$. This infix operator |
285 takes an original regular expression $r$ as input |
285 takes an original regular expression $r$ as input |
286 and a character as a right operand and |
286 and a character as a right operand and |
287 outputs a result, which is a new regular expression. |
287 outputs a result, which is a new regular expression. |
288 The derivative operation on regular expression |
288 The derivative operation on regular expression |
289 is defined such that the language of the derivative result |
289 is defined such that the language of the derivative result |
290 coincides with the language of the original |
290 coincides with the language of the original |
291 regular expression's language being taken the language |
291 regular expression being taken |
292 derivative with respect to the same character: |
292 derivative with respect to the same character: |
293 \begin{center} |
293 \begin{property} |
|
294 |
|
295 \[ |
|
296 L \; (r \backslash c) = \Der \; c \; (L \; r) |
|
297 \] |
|
298 \end{property} |
|
299 \noindent |
|
300 Pictorially, this looks as follows: |
|
301 |
294 \parskip \baselineskip |
302 \parskip \baselineskip |
295 \def\myupbracefill#1{\rotatebox{90}{\stretchto{\{}{#1}}} |
303 \def\myupbracefill#1{\rotatebox{90}{\stretchto{\{}{#1}}} |
296 \def\rlwd{.5pt} |
304 \def\rlwd{.5pt} |
297 \newcommand\notate[3]{% |
305 \newcommand\notate[3]{% |
298 \unskip\def\useanchorwidth{T}% |
306 \unskip\def\useanchorwidth{T}% |
306 \Longstack{ |
314 \Longstack{ |
307 \notate{$r$}{1}{$L \; r = \{\ldots, \;c::s_1, |
315 \notate{$r$}{1}{$L \; r = \{\ldots, \;c::s_1, |
308 \;\ldots\}$} |
316 \;\ldots\}$} |
309 } |
317 } |
310 \Longstack{ |
318 \Longstack{ |
311 $\stackrel{\backslash c}{\longrightarrow}$ |
319 $\stackrel{\backslash c}{\xrightarrow{\hspace*{8cm}}}$ |
312 } |
320 } |
313 \Longstack{ |
321 \Longstack{ |
314 \notate{$r\backslash c$}{2}{$L \; (r\backslash c)= |
322 \notate{$r\backslash c$}{1}{$L \; (r\backslash c)= |
315 \{\ldots,\;s_1,\;\ldots\}$} |
323 \{\ldots,\;s_1,\;\ldots\}$} |
316 } |
324 } |
317 \end{center} |
325 \\ |
318 \begin{center} |
326 The derivatives on regular expression can again be |
319 |
327 generalised to a string. |
320 \[ |
328 One could compute $r\backslash s$ and test membership of $s$ |
321 L(r \backslash c) = \Der \; c \; L(r) |
329 in $L \; r$ by checking |
322 \] |
|
323 \end{center} |
|
324 \noindent |
|
325 where we do derivatives on the regular expression |
|
326 $r$ and test membership of $s$ by checking |
|
327 whether the empty string is in the language of |
330 whether the empty string is in the language of |
328 $r\backslash s$. |
331 $r\backslash s$. |
329 For example in the sequence case we have |
332 |
|
333 |
|
334 \Longstack{ |
|
335 \notate{$r_{start}$}{4}{ |
|
336 \Longstack{$L \; r_{start} = \{\ldots, \;$ |
|
337 \notate{$s$}{1}{$c_1::s_1$} |
|
338 $, \ldots\} $ |
|
339 } |
|
340 } |
|
341 } |
|
342 \Longstack{ |
|
343 $\stackrel{\backslash c_1}{ \xrightarrow{\hspace*{1.8cm}} }$ |
|
344 } |
|
345 \Longstack{ |
|
346 \notate{$r_1$}{3}{ |
|
347 $r_1 = r_{start}\backslash c_1$, |
|
348 $L \; r_1 = $ |
|
349 \Longstack{ |
|
350 $\{ \ldots,\;$ \notate{$s_1$}{1}{$c_2::s_2$} |
|
351 $,\; \ldots \}$ |
|
352 } |
|
353 } |
|
354 } |
|
355 \Longstack{ |
|
356 $\stackrel{\backslash c_2}{\xrightarrow{\hspace*{1.8cm}}}$ |
|
357 } |
|
358 \Longstack{ |
|
359 $r_2$ |
|
360 } |
|
361 \Longstack{ |
|
362 $ \xdashrightarrow{\hspace*{0.3cm} \backslash c_3 \ldots \ldots \ldots \hspace*{0.3cm}} $ |
|
363 } |
|
364 \Longstack{ |
|
365 \notate{$r_{end}$}{1}{ |
|
366 $L \; r_{end} = \{\ldots, \; [], \ldots\}$} |
|
367 } |
|
368 |
|
369 |
|
370 |
|
371 \begin{property} |
|
372 $s \in L \; r_{start} \iff [] \in L \; r_{end}$ |
|
373 \end{property} |
|
374 \noindent |
|
375 Now we give the recursive definition of derivative on |
|
376 regular expressions, so that it satisfies the properties above. |
|
377 The derivative function, written $r\backslash c$, |
|
378 defines how a regular expression evolves into |
|
379 a new one after all the string it contains is acted on: |
|
380 if it starts with $c$, then the character is chopped of, |
|
381 if not, that string is removed. |
|
382 \begin{center} |
|
383 \begin{tabular}{lcl} |
|
384 $\ZERO \backslash c$ & $\dn$ & $\ZERO$\\ |
|
385 $\ONE \backslash c$ & $\dn$ & $\ZERO$\\ |
|
386 $d \backslash c$ & $\dn$ & |
|
387 $\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\ |
|
388 $(r_1 + r_2)\backslash c$ & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\ |
|
389 $(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, [] \in L(r_1)$\\ |
|
390 & & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\ |
|
391 & & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\ |
|
392 $(r^*)\backslash c$ & $\dn$ & $(r\backslash c) \cdot r^*$\\ |
|
393 \end{tabular} |
|
394 \end{center} |
|
395 \noindent |
|
396 The most involved cases are the sequence case |
|
397 and the star case. |
|
398 The sequence case says that if the first regular expression |
|
399 contains an empty string, then the second component of the sequence |
|
400 needs to be considered, as its derivative will contribute to the |
|
401 result of this derivative: |
|
402 \begin{center} |
|
403 \begin{tabular}{lcl} |
|
404 $(r_1 \cdot r_2 ) \backslash c$ & $\dn$ & $\textit{if}\;\,([] \in L(r_1))\; r_1 \backslash c \cdot r_2 + r_2 \backslash c$ \\ |
|
405 & & $\textit{else} \; (r_1 \backslash c) \cdot r_2$ |
|
406 \end{tabular} |
|
407 \end{center} |
|
408 \noindent |
|
409 Notice how this closely resembles |
|
410 the language derivative operation $\Der$: |
330 \begin{center} |
411 \begin{center} |
331 \begin{tabular}{lcl} |
412 \begin{tabular}{lcl} |
332 $\Der \; c \; (A @ B)$ & $\dn$ & |
413 $\Der \; c \; (A @ B)$ & $\dn$ & |
333 $ \textit{if} \;\, [] \in A \; |
414 $ \textit{if} \;\, [] \in A \; |
334 \textit{then} \;\, ((\Der \; c \; A) @ B ) \cup |
415 \textit{then} \;\, ((\Der \; c \; A) @ B ) \cup |
335 \Der \; c\; B$\\ |
416 \Der \; c\; B$\\ |
336 & & $\textit{else}\; (\Der \; c \; A) @ B$\\ |
417 & & $\textit{else}\; (\Der \; c \; A) @ B$\\ |
337 \end{tabular} |
418 \end{tabular} |
338 \end{center} |
419 \end{center} |
339 \noindent |
420 \noindent |
340 This can be translated to |
421 The star regular expression $r^*$'s derivative |
341 regular expressions in the following |
422 unwraps one iteration of $r$, turns it into $r\backslash c$, |
342 manner: |
423 and attaches the original $r^*$ |
343 \begin{center} |
424 after $r\backslash c$, so that |
344 \begin{tabular}{lcl} |
425 we can further unfold it as many times as needed: |
345 $(r_1 \cdot r_2 ) \backslash c$ & $\dn$ & $\textit{if}\;\,([] \in L(r_1)) r_1 \backslash c \cdot r_2 + r_2 \backslash c$ \\ |
426 \[ |
346 & & $\textit{else} \; (r_1 \backslash c) \cdot r_2$ |
427 (r^*) \backslash c \dn (r \backslash c)\cdot r^*. |
347 \end{tabular} |
428 \] |
348 \end{center} |
429 Again, |
349 |
430 the structure is the same as the semantic derivative of Kleene star: |
350 \noindent |
|
351 And similarly, the Kleene star's semantic derivative |
|
352 can be expressed as |
|
353 \[ |
431 \[ |
354 \textit{Der} \;c \;(A*) \dn (\textit{Der}\; c A) @ (A*) |
432 \textit{Der} \;c \;(A*) \dn (\textit{Der}\; c A) @ (A*) |
355 \] |
|
356 which translates to |
|
357 \[ |
|
358 (r^*) \backslash c \dn (r \backslash c)\cdot r^*. |
|
359 \] |
433 \] |
360 In the above definition of $(r_1\cdot r_2) \backslash c$, |
434 In the above definition of $(r_1\cdot r_2) \backslash c$, |
361 the $\textit{if}$ clause's |
435 the $\textit{if}$ clause's |
362 boolean condition |
436 boolean condition |
363 $[] \in L(r_1)$ needs to be |
437 $[] \in L(r_1)$ needs to be |
388 if at least one of its children is nullable. |
462 if at least one of its children is nullable. |
389 The sequence regular expression |
463 The sequence regular expression |
390 would require both children to have the empty string |
464 would require both children to have the empty string |
391 to compose an empty string, and the Kleene star |
465 to compose an empty string, and the Kleene star |
392 is always nullable because it naturally |
466 is always nullable because it naturally |
393 contains the empty string. |
467 contains the empty string. |
394 |
468 \noindent |
395 The derivative function, written $r\backslash c$, |
|
396 defines how a regular expression evolves into |
|
397 a new one after all the string it contains is acted on: |
|
398 if it starts with $c$, then the character is chopped of, |
|
399 if not, that string is removed. |
|
400 \begin{center} |
|
401 \begin{tabular}{lcl} |
|
402 $\ZERO \backslash c$ & $\dn$ & $\ZERO$\\ |
|
403 $\ONE \backslash c$ & $\dn$ & $\ZERO$\\ |
|
404 $d \backslash c$ & $\dn$ & |
|
405 $\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\ |
|
406 $(r_1 + r_2)\backslash c$ & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\ |
|
407 $(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, [] \in L(r_1)$\\ |
|
408 & & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\ |
|
409 & & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\ |
|
410 $(r^*)\backslash c$ & $\dn$ & $(r\backslash c) \cdot r^*$\\ |
|
411 \end{tabular} |
|
412 \end{center} |
|
413 \noindent |
|
414 The most involved cases are the sequence case |
|
415 and the star case. |
|
416 The sequence case says that if the first regular expression |
|
417 contains an empty string, then the second component of the sequence |
|
418 needs to be considered, as its derivative will contribute to the |
|
419 result of this derivative. |
|
420 The star regular expression $r^*$'s derivative |
|
421 unwraps one iteration of $r$, turns it into $r\backslash c$, |
|
422 and attaches the original $r^*$ |
|
423 after $r\backslash c$, so that |
|
424 we can further unfold it as many times as needed. |
|
425 We have the following correspondence between |
469 We have the following correspondence between |
426 derivatives on regular expressions and |
470 derivatives on regular expressions and |
427 derivatives on a set of strings: |
471 derivatives on a set of strings: |
428 \begin{lemma}\label{derDer} |
472 \begin{lemma}\label{derDer} |
|
473 \begin{itemize} |
|
474 \item |
429 $\textit{Der} \; c \; L(r) = L (r\backslash c)$ |
475 $\textit{Der} \; c \; L(r) = L (r\backslash c)$ |
|
476 \item |
|
477 $c\!::\!s \in L(r)$ \textit{iff} $s \in L(r\backslash c)$. |
|
478 \end{itemize} |
430 \end{lemma} |
479 \end{lemma} |
431 |
480 \begin{proof} |
|
481 By induction on $r$. |
|
482 \end{proof} |
432 \noindent |
483 \noindent |
433 The main property of the derivative operation |
484 The main property of the derivative operation |
434 (that enables us to reason about the correctness of |
485 (that enables us to reason about the correctness of |
435 derivative-based matching) |
486 derivative-based matching) |
436 is |
487 is |
437 |
488 |
438 \begin{lemma}\label{derStepwise} |
|
439 $c\!::\!s \in L(r)$ \textit{iff} $s \in L(r\backslash c)$. |
|
440 \end{lemma} |
|
441 |
489 |
442 \noindent |
490 \noindent |
443 We can generalise the derivative operation shown above for single characters |
491 We can generalise the derivative operation shown above for single characters |
444 to strings as follows: |
492 to strings as follows: |
445 |
493 |
687 (v::vs)} |
735 (v::vs)} |
688 \end{mathpar} |
736 \end{mathpar} |
689 \caption{POSIX Lexing Rules} |
737 \caption{POSIX Lexing Rules} |
690 \end{figure} |
738 \end{figure} |
691 \noindent |
739 \noindent |
|
740 |
|
741 \begin{figure} |
|
742 \begin{tikzpicture}[] |
|
743 \node [minimum width = 6cm, rectangle split, rectangle split horizontal, |
|
744 rectangle split parts=2, rectangle split part fill={red!30,blue!20}, style={draw, rounded corners, inner sep=10pt}] |
|
745 (node1) |
|
746 {$r_{token1}$ |
|
747 \nodepart{two} $\;\;\; \quad r_{token2}\;\;\;\quad$ }; |
|
748 %\node [left = 6.0cm of node1] (start1) {hi}; |
|
749 \node [left = 0.2cm of node1] (middle) {$v.s.$}; |
|
750 \node [minimum width = 6cm, left = 0.2cm of middle, rectangle split, rectangle split horizontal, |
|
751 rectangle split parts=2, rectangle split part fill={red!30,blue!20}, style={draw, rounded corners, inner sep=10pt}] |
|
752 (node2) |
|
753 {$\quad\;\;\;r_{token1}\quad\;\;\;$ |
|
754 \nodepart{two} $r_{token2}$ }; |
|
755 \node [below = 0.1cm of node2] (text1) {\checkmark preferred by POSIX}; |
|
756 \node [above = 1.5cm of middle, minimum width = 6cm, |
|
757 rectangle, style={draw, rounded corners, inner sep=10pt}] |
|
758 (topNode) {$s$}; |
|
759 \path[->,draw] |
|
760 (topNode) edge node {split $A$} (node2) |
|
761 (topNode) edge node {split $B$} (node1) |
|
762 ; |
|
763 |
|
764 |
|
765 \end{tikzpicture} |
|
766 \caption{Maximum munch example: $s$ matches $r_{token1} \cdot r_{token2}$}\label{munch} |
|
767 \end{figure} |
692 The above $\POSIX$ rules follows the intuition described below: |
768 The above $\POSIX$ rules follows the intuition described below: |
693 \begin{itemize} |
769 \begin{itemize} |
694 \item (Left Priority)\\ |
770 \item (Left Priority)\\ |
695 Match the leftmost regular expression when multiple options of matching |
771 Match the leftmost regular expression when multiple options of matching |
696 are available. |
772 are available. |
697 \item (Maximum munch)\\ |
773 \item (Maximum munch)\\ |
698 Always match a subpart as much as possible before proceeding |
774 Always match a subpart as much as possible before proceeding |
699 to the next token. |
775 to the next token. |
|
776 For example, when the string $s$ matches |
|
777 $r_{token1}\cdot r_{token2}$, and we have two ways $s$ can be split: |
|
778 Then the split that matches a longer string for the first token |
|
779 $r_{token1}$ is preferred by this maximum munch rule (See |
|
780 \ref{munch} for an illustration). |
|
781 \noindent |
|
782 |
700 \end{itemize} |
783 \end{itemize} |
701 \noindent |
784 \noindent |
702 These disambiguation strategies can be |
785 These disambiguation strategies can be |
703 quite practical. |
786 quite practical. |
704 For instance, when lexing a code snippet |
787 For instance, when lexing a code snippet |
1203 that will result in a POSIX value is only |
1286 that will result in a POSIX value is only |
1204 a minority among the many other terms, |
1287 a minority among the many other terms, |
1205 and one can remove ones that are absolutely not possible to |
1288 and one can remove ones that are absolutely not possible to |
1206 be POSIX. |
1289 be POSIX. |
1207 In the above example, |
1290 In the above example, |
1208 \[ |
1291 \begin{equation}\label{eqn:growth2} |
1209 ((a^*a^* + \underbrace{a^*}_\text{A})+\underbrace{a^*}_\text{duplicate of A})\cdot(a^*a^*)^* + |
1292 ((a^*a^* + \underbrace{a^*}_\text{A})+\underbrace{a^*}_\text{duplicate of A})\cdot(a^*a^*)^* + |
1210 \underbrace{(a^*a^* + a^*)\cdot(a^*a^*)^*}_\text{further simp removes this}. |
1293 \underbrace{(a^*a^* + a^*)\cdot(a^*a^*)^*}_\text{further simp removes this}. |
1211 \] |
1294 \end{equation} |
1212 can be further simplified by |
1295 can be further simplified by |
1213 removing the underlined term first, |
1296 removing the underlined term first, |
1214 which would open up possibilities |
1297 which would open up possibilities |
1215 of further simplification that removes the |
1298 of further simplification that removes the |
1216 underbraced part. |
1299 underbraced part. |