366 a matcher with simpler regular expressions. |
366 a matcher with simpler regular expressions. |
367 |
367 |
368 If we want the size of derivatives in the algorithm to |
368 If we want the size of derivatives in the algorithm to |
369 stay even lower, we would need more aggressive simplifications. |
369 stay even lower, we would need more aggressive simplifications. |
370 Essentially we need to delete useless $\ZERO$s and $\ONE$s, as well as |
370 Essentially we need to delete useless $\ZERO$s and $\ONE$s, as well as |
371 deleting duplicates whenever possible. For example, the parentheses in |
371 delete duplicates whenever possible. For example, the parentheses in |
372 $(a+b) \cdot c + b\cdot c$ can be opened up to get $a\cdot c + b \cdot c + b |
372 $(a+b) \cdot c + b\cdot c$ can be opened up to get $a\cdot c + b \cdot c + b |
373 \cdot c$, and then simplified to just $a \cdot c + b \cdot c$. Another |
373 \cdot c$, and then simplified to just $a \cdot c + b \cdot c$. Another |
374 example is simplifying $(a^*+a) + (a^*+ \ONE) + (a +\ONE)$ to just |
374 example is simplifying $(a^*+a) + (a^*+ \ONE) + (a +\ONE)$ to just |
375 $a^*+a+\ONE$. These more aggressive simplification rules are for |
375 $a^*+a+\ONE$. These more aggressive simplification rules are for |
376 a very tight size bound, possibly as low |
376 a very tight size bound, possibly as low |
377 as that of the \emph{partial derivatives}\parencite{Antimirov1995}. |
377 as that of the \emph{partial derivatives}\parencite{Antimirov1995}. |
378 |
378 |
379 Building derivatives and then simplify them. |
379 Building derivatives and then simplifying them. |
380 So far so good. But what if we want to |
380 So far, so good. But what if we want to |
381 do lexing instead of just getting a YES/NO answer? |
381 do lexing instead of just getting a YES/NO answer? |
382 This requires us to go back again to the world |
382 This requires us to go back again to the world |
383 without simplification first for a moment. |
383 without simplification first for a moment. |
384 Sulzmann and Lu~\cite{Sulzmann2014} first came up with a nice and |
384 Sulzmann and Lu~\cite{Sulzmann2014} first came up with a nice and |
385 elegant(arguably as beautiful as the original |
385 elegant(arguably as beautiful as the original |
386 derivatives definition) solution for this. |
386 derivatives definition) solution for this. |
438 One regular expression can have multiple lexical values. For example |
438 One regular expression can have multiple lexical values. For example |
439 for the regular expression $(a+b)^*$, it has a infinite list of |
439 for the regular expression $(a+b)^*$, it has a infinite list of |
440 values corresponding to it: $\Stars\,[]$, $\Stars\,[\Left(Char(a))]$, |
440 values corresponding to it: $\Stars\,[]$, $\Stars\,[\Left(Char(a))]$, |
441 $\Stars\,[\Right(Char(b))]$, $\Stars\,[\Left(Char(a),\,\Right(Char(b))]$, |
441 $\Stars\,[\Right(Char(b))]$, $\Stars\,[\Left(Char(a),\,\Right(Char(b))]$, |
442 $\ldots$, and vice versa. |
442 $\ldots$, and vice versa. |
443 Even for the regular expression matching a certain string, there could |
443 Even for the regular expression matching a particular string, there could |
444 still be more than one value corresponding to it. |
444 be more than one value corresponding to it. |
445 Take the example where $r= (a^*\cdot a^*)^*$ and the string |
445 Take the example where $r= (a^*\cdot a^*)^*$ and the string |
446 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$. |
446 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$. |
447 If we do not allow any empty iterations in its lexical values, |
447 If we do not allow any empty iterations in its lexical values, |
448 there will be $n - 1$ "splitting points" on $s$ we can choose to |
448 there will be $n - 1$ "splitting points" on $s$ we can choose to |
449 split or not so that each sub-string |
449 split or not so that each sub-string |
461 between the two $a^*$s. |
461 between the two $a^*$s. |
462 It is not surprising there are exponentially many lexical values |
462 It is not surprising there are exponentially many lexical values |
463 that are distinct for the regex and string pair $r= (a^*\cdot a^*)^*$ and |
463 that are distinct for the regex and string pair $r= (a^*\cdot a^*)^*$ and |
464 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$. |
464 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$. |
465 |
465 |
466 A lexer aimed at keeping all the possible values will naturally |
466 A lexer to keep all the possible values will naturally |
467 have an exponential runtime on ambiguous regular expressions. |
467 have an exponential runtime on ambiguous regular expressions. |
468 Somehow one has to decide which lexical value to keep and |
468 Somehow one has to decide which lexical value to keep and |
469 output in a lexing algorithm. |
469 output in a lexing algorithm. |
470 In practice, we are usually |
470 In practice, we are usually |
471 interested about POSIX values, which by intuition always |
471 interested in POSIX values, which by intuition always |
472 \begin{itemize} |
472 \begin{itemize} |
473 \item |
473 \item |
474 match the leftmost regular expression when multiple options of matching |
474 match the leftmost regular expression when multiple options of matching |
475 are available |
475 are available |
476 \item |
476 \item |
543 \end{tabular} |
543 \end{tabular} |
544 \end{center} |
544 \end{center} |
545 |
545 |
546 |
546 |
547 \noindent |
547 \noindent |
548 We favour the left to match an empty string in case there is a choice. |
548 We favour the left to match an empty string if there is a choice. |
549 When there is a star for us to match the empty string, |
549 When there is a star for us to match the empty string, |
550 we simply give the $\Stars$ constructor an empty list, meaning |
550 we give the $\Stars$ constructor an empty list, meaning |
551 no iterations are taken. |
551 no iterations are taken. |
552 |
552 |
553 |
553 |
554 After the $\mkeps$-call, we inject back the characters one by one in order to build |
554 After the $\mkeps$-call, we inject back the characters one by one in order to build |
555 the lexical value $v_i$ for how the regex $r_i$ matches the string $s_i$ |
555 the lexical value $v_i$ for how the regex $r_i$ matches the string $s_i$ |
556 ($s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$. |
556 ($s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$. |
557 After injecting back $n$ characters, we get the lexical value for how $r_0$ |
557 After injecting back $n$ characters, we get the lexical value for how $r_0$ |
558 matches $s$. The POSIX value is maintained throught out the process. |
558 matches $s$. The POSIX value is maintained throughout the process. |
559 For this Sulzmann and Lu defined a function that reverses |
559 For this Sulzmann and Lu defined a function that reverses |
560 the ``chopping off'' of characters during the derivative phase. The |
560 the ``chopping off'' of characters during the derivative phase. The |
561 corresponding function is called \emph{injection}, written |
561 corresponding function is called \emph{injection}, written |
562 $\textit{inj}$; it takes three arguments: the first one is a regular |
562 $\textit{inj}$; it takes three arguments: the first one is a regular |
563 expression ${r_{i-1}}$, before the character is chopped off, the second |
563 expression ${r_{i-1}}$, before the character is chopped off, the second |
591 \end{tabular} |
591 \end{tabular} |
592 \end{center} |
592 \end{center} |
593 |
593 |
594 \noindent This definition is by recursion on the ``shape'' of regular |
594 \noindent This definition is by recursion on the ``shape'' of regular |
595 expressions and values. |
595 expressions and values. |
596 The clauses basically do one thing--identifying the ``holes'' on |
596 The clauses do one thing--identifying the ``hole'' on a |
597 value to inject the character back into. |
597 value to inject the character back into. |
598 For instance, in the last clause for injecting back to a value |
598 For instance, in the last clause for injecting back to a value |
599 that would turn into a new star value that corresponds to a star, |
599 that would turn into a new star value that corresponds to a star, |
600 we know it must be a sequence value. And we know that the first |
600 we know it must be a sequence value. And we know that the first |
601 value of that sequence corresponds to the child regex of the star |
601 value of that sequence corresponds to the child regex of the star |
602 with the first character being chopped off--an iteration of the star |
602 with the first character being chopped off--an iteration of the star |
603 that had just been unfolded. This value is followed by the already |
603 that had just been unfolded. This value is followed by the already |
604 matched star iterations we collected before. So we inject the character |
604 matched star iterations we collected before. So we inject the character |
605 back to the first value and form a new value with this new iteration |
605 back to the first value and form a new value with this latest iteration |
606 being added to the previous list of iterations, all under the $\Stars$ |
606 being added to the previous list of iterations, all under the $\Stars$ |
607 top level. |
607 top level. |
608 |
608 |
609 Putting all the functions $\inj$, $\mkeps$, $\backslash$ together, |
609 Putting all the functions $\inj$, $\mkeps$, $\backslash$ together, |
610 we have a lexer with the following recursive definition: |
610 we have a lexer with the following recursive definition: |
633 of characters $c_0 c_1 \ldots c_{n-1}$. In the first phase from the |
633 of characters $c_0 c_1 \ldots c_{n-1}$. In the first phase from the |
634 left to right, we build the derivatives $r_1$, $r_2$, \ldots according |
634 left to right, we build the derivatives $r_1$, $r_2$, \ldots according |
635 to the characters $c_0$, $c_1$ until we exhaust the string and obtain |
635 to the characters $c_0$, $c_1$ until we exhaust the string and obtain |
636 the derivative $r_n$. We test whether this derivative is |
636 the derivative $r_n$. We test whether this derivative is |
637 $\textit{nullable}$ or not. If not, we know the string does not match |
637 $\textit{nullable}$ or not. If not, we know the string does not match |
638 $r$ and no value needs to be generated. If yes, we start building the |
638 $r$, and no value needs to be generated. If yes, we start building the |
639 values incrementally by \emph{injecting} back the characters into the |
639 values incrementally by \emph{injecting} back the characters into the |
640 earlier values $v_n, \ldots, v_0$. This is the second phase of the |
640 earlier values $v_n, \ldots, v_0$. This is the second phase of the |
641 algorithm from the right to left. For the first value $v_n$, we call the |
641 algorithm from right to left. For the first value $v_n$, we call the |
642 function $\textit{mkeps}$, which builds a POSIX lexical value |
642 function $\textit{mkeps}$, which builds a POSIX lexical value |
643 for how the empty string has been matched by the (nullable) regular |
643 for how the empty string has been matched by the (nullable) regular |
644 expression $r_n$. This function is defined as |
644 expression $r_n$. This function is defined as |
645 |
645 |
646 |
646 |
647 |
647 |
648 We have mentioned before that derivatives without simplification |
648 We have mentioned before that derivatives without simplification |
649 can get clumsy, and this is true for values as well--they reflect |
649 can get clumsy, and this is true for values as well--they reflect |
650 the regular expressions size by definition. |
650 the size of the regular expression by definition. |
651 |
651 |
652 One can introduce simplification on the regex and values, but have to |
652 One can introduce simplification on the regex and values but have to |
653 be careful in not breaking the correctness as the injection |
653 be careful not to break the correctness, as the injection |
654 function heavily relies on the structure of the regexes and values |
654 function heavily relies on the structure of the regexes and values |
655 being correct and match each other. |
655 being correct and matching each other. |
656 It can be achieved by recording some extra rectification functions |
656 It can be achieved by recording some extra rectification functions |
657 during the derivatives step, and applying these rectifications in |
657 during the derivatives step, and applying these rectifications in |
658 each run during the injection phase. |
658 each run during the injection phase. |
659 And we can prove that the POSIX value of how |
659 And we can prove that the POSIX value of how |
660 regular expressions match strings will not be affected---although is much harder |
660 regular expressions match strings will not be affected---although it is much harder |
661 to establish. |
661 to establish. |
662 Some initial results in this regard have been |
662 Some initial results in this regard have been |
663 obtained in \cite{AusafDyckhoffUrban2016}. |
663 obtained in \cite{AusafDyckhoffUrban2016}. |
664 |
664 |
665 |
665 |
666 |
666 |
667 %Brzozowski, after giving the derivatives and simplification, |
667 %Brzozowski, after giving the derivatives and simplification, |
668 %did not explore lexing with simplification or he may well be |
668 %did not explore lexing with simplification, or he may well be |
669 %stuck on an efficient simplificaiton with a proof. |
669 %stuck on an efficient simplification with proof. |
670 %He went on to explore the use of derivatives together with |
670 %He went on to examine the use of derivatives together with |
671 %automaton, and did not try lexing using derivatives. |
671 %automaton, and did not try lexing using products. |
672 |
672 |
673 We want to get rid of complex and fragile rectification of values. |
673 We want to get rid of the complex and fragile rectification of values. |
674 Can we not create those intermediate values $v_1,\ldots v_n$, |
674 Can we not create those intermediate values $v_1,\ldots v_n$, |
675 and get the lexing information that should be already there while |
675 and get the lexing information that should be already there while |
676 doing derivatives in one pass, without a second phase of injection? |
676 doing derivatives in one pass, without a second injection phase? |
677 In the meantime, can we make sure that simplifications |
677 In the meantime, can we make sure that simplifications |
678 are easily handled without breaking the correctness of the algorithm? |
678 are easily handled without breaking the correctness of the algorithm? |
679 |
679 |
680 Sulzmann and Lu solved this problem by |
680 Sulzmann and Lu solved this problem by |
681 introducing additional informtaion to the |
681 introducing additional information to the |
682 regular expressions called \emph{bitcodes}. |
682 regular expressions called \emph{bitcodes}. |
683 |
683 |
684 |
684 |
685 |
685 |
686 |
686 |