changeset 77 | 058133a9ffe0 |
parent 76 | f575cf219377 |
child 79 | 481c8000de6d |
76:f575cf219377 | 77:058133a9ffe0 |
---|---|
176 \end{center} |
176 \end{center} |
177 |
177 |
178 \noindent These are clearly abysmal and possibly surprising results. One |
178 \noindent These are clearly abysmal and possibly surprising results. One |
179 would expect these systems to do much better than that---after all, |
179 would expect these systems to do much better than that---after all, |
180 given a DFA and a string, deciding whether a string is matched by this |
180 given a DFA and a string, deciding whether a string is matched by this |
181 DFA should be linear. |
181 DFA should be linear? |
182 |
182 |
183 Admittedly, the regular expression $(a^*)^*\,b$ is carefully chosen to |
183 Admittedly, the regular expression $(a^*)^*\,b$ is carefully chosen to |
184 exhibit this exponential behaviour. Unfortunately, such regular |
184 exhibit this exponential behaviour. But unfortunately, such regular |
185 expressions are not just a few outliers. They are actually |
185 expressions are not just a few outliers. They are actually |
186 frequent enough to have a separate name created for |
186 frequent enough to have a separate name created for |
187 them---\emph{evil regular expressions}. In empiric work, Davis et al |
187 them---\emph{evil regular expressions}. In empiric work, Davis et al |
188 report that they have found thousands of such evil regular expressions |
188 report that they have found thousands of such evil regular expressions |
189 in the JavaScript and Python ecosystems \cite{Davis18}. |
189 in the JavaScript and Python ecosystems \cite{Davis18}. |
202 As a result, the regular expression matching |
202 As a result, the regular expression matching |
203 engine needed to backtrack over many choices. |
203 engine needed to backtrack over many choices. |
204 In this example, the time needed to process the string is not |
204 In this example, the time needed to process the string is not |
205 exactly the classical exponential case, but rather $O(n^2)$ |
205 exactly the classical exponential case, but rather $O(n^2)$ |
206 with respect to the string length. But this is enough for the |
206 with respect to the string length. But this is enough for the |
207 home page of stackexchnge to respond not fast enough to |
207 home page of Stack Exchange to respond not fast enough to |
208 the load balancer, which thought that there must be some |
208 the load balancer, which thought that there must be some |
209 attack and therefore stopped the servers from responding to |
209 attack and therefore stopped the servers from responding to |
210 requests. This makes the whole site become unavailable. |
210 requests. This made the whole site become unavailable. |
211 Another fresh example that just came out of the oven |
211 Another very recent example is a global outage of all Cloudflare servers |
212 is the cloudfare outage incident. A poorly written |
212 on 2 July 2019. A poorly written regular expression exhibited |
213 regular expression exhibited exponential behaviour |
213 exponential behaviour and exhausted CPUs that serve HTTP traffic. |
214 and exhausted CPUs that serve HTTP traffic on 2 July 2019, |
214 Although the outage had several causes, at the heart was a regular |
215 resulting in a 27-minute outage. The regular expression contained |
215 expression that was used to monitor network |
216 adjacent Kleene stars, which is the source of the exponential number |
216 traffic.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}} |
217 of backtracking possibilities. Although this outage has a series of |
217 |
218 causes that came from different vulnerabilities within the system |
218 The underlying problem is that many ``real life'' regular expression |
219 of cloudfare, the heart of the problem lies within regular expression. |
219 matching engines do not use DFAs for matching. This is because they |
220 Such outages could be excluded from happening if the matching |
220 support regular expressions that are not covered by the classical |
221 algorithm is guaranteed to be fast.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}} |
221 automata theory, and in this more general setting there are quite a few |
222 The underlying problem is |
222 research questions still unanswered and fast algorithms still need to be |
223 that many ``real life'' regular expression matching engines do not use |
223 developed (for example how to treat bounded repetitions, negation and |
224 DFAs for matching. |
224 back-references efficiently). |
225 This is because they support regular expressions that |
|
226 are not covered by the classical automata theory, and in this more |
|
227 general setting there are quite a few research questions still |
|
228 unanswered and fast algorithms still need to be developed (for example |
|
229 how to treat bounded repetitions, negation and back-references |
|
230 efficiently). |
|
231 %question: dfa can have exponential states. isn't this the actual reason why they do not use dfas? |
225 %question: dfa can have exponential states. isn't this the actual reason why they do not use dfas? |
232 %how do they avoid dfas exponential states if they use them for fast matching? |
226 %how do they avoid dfas exponential states if they use them for fast matching? |
227 |
|
233 There is also another under-researched problem to do with regular |
228 There is also another under-researched problem to do with regular |
234 expressions and lexing, i.e.~the process of breaking up strings into |
229 expressions and lexing, i.e.~the process of breaking up strings into |
235 sequences of tokens according to some regular expressions. In this |
230 sequences of tokens according to some regular expressions. In this |
236 setting one is not just interested in whether or not a regular |
231 setting one is not just interested in whether or not a regular |
237 expression matches a string, but also in \emph{how}. Consider for example a regular expression |
232 expression matches a string, but also in \emph{how}. Consider for example a regular expression |
266 \noindent |
261 \noindent |
267 This is also supported by evidence collected by Kuklewicz |
262 This is also supported by evidence collected by Kuklewicz |
268 \cite{Kuklewicz} who noticed that a number of POSIX regular expression |
263 \cite{Kuklewicz} who noticed that a number of POSIX regular expression |
269 matchers calculate incorrect results. |
264 matchers calculate incorrect results. |
270 |
265 |
271 Our focus is on an algorithm introduced by Sulzmann and Lu in 2014 for |
266 Our focus in this project is on an algorithm introduced by Sulzmann and |
272 regular expression matching according to the POSIX strategy |
267 Lu in 2014 for regular expression matching according to the POSIX |
273 \cite{Sulzmann2014}. Their algorithm is based on an older algorithm by |
268 strategy \cite{Sulzmann2014}. Their algorithm is based on an older |
274 Brzozowski from 1964 where he introduced the notion of derivatives of |
269 algorithm by Brzozowski from 1964 where he introduced the notion of |
275 regular expressions~\cite{Brzozowski1964}. We shall briefly explain |
270 derivatives of regular expressions~\cite{Brzozowski1964}. We shall |
276 this algorithm next. |
271 briefly explain this algorithm next. |
277 |
272 |
278 \section{The Algorithm by Brzozowski based on Derivatives of Regular |
273 \section{The Algorithm by Brzozowski based on Derivatives of Regular |
279 Expressions} |
274 Expressions} |
280 |
275 |
281 Suppose (basic) regular expressions are given by the following grammar: |
276 Suppose (basic) regular expressions are given by the following grammar: |
388 an $s = c_0...c_{n-1}$ and an $r_0$, it generates YES if and only if $s \in L(r_0)$). |
383 an $s = c_0...c_{n-1}$ and an $r_0$, it generates YES if and only if $s \in L(r_0)$). |
389 |
384 |
390 |
385 |
391 \section{Values and the Algorithm by Sulzmann and Lu} |
386 \section{Values and the Algorithm by Sulzmann and Lu} |
392 |
387 |
393 One limitation, however, of Brzozowski's algorithm is that it only |
388 One limitation of Brzozowski's algorithm is that it only produces a |
394 produces a YES/NO answer for whether a string is being matched by a |
389 YES/NO answer for whether a string is being matched by a regular |
395 regular expression. Sulzmann and Lu~\cite{Sulzmann2014} extended this |
390 expression. Sulzmann and Lu~\cite{Sulzmann2014} extended this algorithm |
396 algorithm to allow generation of an actual matching, called a |
391 to allow generation of an actual matching, called a \emph{value} or |
397 \emph{value} or sometimes lexical values. These values and regular expressions correspond to each |
392 sometimes also \emph{lexical values}. These values and regular |
398 other as illustrated in the following table: |
393 expressions correspond to each other as illustrated in the following |
394 table: |
|
399 |
395 |
400 |
396 |
401 \begin{center} |
397 \begin{center} |
402 \begin{tabular}{c@{\hspace{20mm}}c} |
398 \begin{tabular}{c@{\hspace{20mm}}c} |
403 \begin{tabular}{@{}rrl@{}} |
399 \begin{tabular}{@{}rrl@{}} |
438 |v_2|$ matches the regex $r_1 \cdot r_2$ whereby $r_1$ matches the |
434 |v_2|$ matches the regex $r_1 \cdot r_2$ whereby $r_1$ matches the |
439 substring $|v_1|$ and, respectively, $r_2$ matches the substring |
435 substring $|v_1|$ and, respectively, $r_2$ matches the substring |
440 $|v_2|$. Exactly how these two are matched is contained in the |
436 $|v_2|$. Exactly how these two are matched is contained in the |
441 children nodes $v_1$ and $v_2$ of parent $\textit{Seq}$ . |
437 children nodes $v_1$ and $v_2$ of parent $\textit{Seq}$ . |
442 |
438 |
443 To give a concrete example of how values work, consider the string $xy$ |
439 To give a concrete example of how values work, consider the string $xy$ |
444 and the regular expression $(x + (y + xy))^*$. We can view this regular |
440 and the regular expression $(x + (y + xy))^*$. We can view this regular |
445 expression as a tree and if the string $xy$ is matched by two Star |
441 expression as a tree and if the string $xy$ is matched by two Star |
446 ``iterations'', then the $x$ is matched by the left-most alternative in |
442 ``iterations'', then the $x$ is matched by the left-most alternative in |
447 this tree and the $y$ by the right-left alternative. This suggests to |
443 this tree and the $y$ by the right-left alternative. This suggests to |
448 record this matching as |
444 record this matching as |
481 \end{tikzcd} |
477 \end{tikzcd} |
482 \end{equation} |
478 \end{equation} |
483 \end{ceqn} |
479 \end{ceqn} |
484 |
480 |
485 \noindent |
481 \noindent |
486 For convenience, we shall employ the following notations: the regular expression we |
482 For convenience, we shall employ the following notations: the regular |
487 start with is $r_0$, and the given string $s$ is composed of characters $c_0 c_1 |
483 expression we start with is $r_0$, and the given string $s$ is composed |
488 \ldots c_{n-1}$. In the first phase, we build the derivatives $r_1$, $r_2$, \ldots according to |
484 of characters $c_0 c_1 \ldots c_{n-1}$. In the first phase from the |
489 the characters $c_0$, $c_1$ until we exhaust the string and |
485 left to right, we build the derivatives $r_1$, $r_2$, \ldots according |
490 obtain the derivative $r_n$. We test whether this derivative is |
486 to the characters $c_0$, $c_1$ until we exhaust the string and obtain |
487 the derivative $r_n$. We test whether this derivative is |
|
491 $\textit{nullable}$ or not. If not, we know the string does not match |
488 $\textit{nullable}$ or not. If not, we know the string does not match |
492 $r$ and no value needs to be generated. If yes, we start building the |
489 $r$ and no value needs to be generated. If yes, we start building the |
493 values incrementally by \emph{injecting} back the characters into |
490 values incrementally by \emph{injecting} back the characters into the |
494 the earlier values $v_n, \ldots, v_0$. For the first value $v_0$, we call the function |
491 earlier values $v_n, \ldots, v_0$. This is the second phase of the |
495 $\textit{mkeps}$, which builds the parse tree for how the empty string |
492 algorithm from the right to left. For the first value $v_n$, we call the |
496 has been matched by the (nullable) regular expression $r_n$. This function is defined |
493 function $\textit{mkeps}$, which builds the parse tree for how the empty |
497 as |
494 string has been matched by the (nullable) regular expression $r_n$. This |
495 function is defined as |
|
498 |
496 |
499 \begin{center} |
497 \begin{center} |
500 \begin{tabular}{lcl} |
498 \begin{tabular}{lcl} |
501 $\mkeps(\ONE)$ & $\dn$ & $\Empty$ \\ |
499 $\mkeps(\ONE)$ & $\dn$ & $\Empty$ \\ |
502 $\mkeps(r_{1}+r_{2})$ & $\dn$ |
500 $\mkeps(r_{1}+r_{2})$ & $\dn$ |
515 regular expression on the left-hand side. This will become |
513 regular expression on the left-hand side. This will become |
516 important later on. |
514 important later on. |
517 |
515 |
518 After this, we inject back the characters one by one in order to build |
516 After this, we inject back the characters one by one in order to build |
519 the parse tree $v_i$ for how the regex $r_i$ matches the string $s_i$ |
517 the parse tree $v_i$ for how the regex $r_i$ matches the string $s_i$ |
520 ($s_i = c_i \ldots c_{n-1}$ ) from the previous parse tree $v_{i+1}$. After |
518 ($s_i = c_i \ldots c_{n-1}$ ) from the previous parse tree $v_{i+1}$. |
521 injecting back $n$ characters, we get the parse tree for how $r_0$ |
519 After injecting back $n$ characters, we get the parse tree for how $r_0$ |
522 matches $s$. For this Sulzmann and Lu defined a function that reverses |
520 matches $s$. For this Sulzmann and Lu defined a function that reverses |
523 the ``chopping off'' of characters during the derivative phase. The |
521 the ``chopping off'' of characters during the derivative phase. The |
524 corresponding function is called $\textit{inj}$; it takes three |
522 corresponding function is called \emph{injection}, written |
525 arguments: the first one is a regular expression ${r_{i-1}}$, before the |
523 $\textit{inj}$; it takes three arguments: the first one is a regular |
526 character is chopped off, the second is a character ${c_{i-1}}$, the |
524 expression ${r_{i-1}}$, before the character is chopped off, the second |
527 character we want to inject and the third argument is the value |
525 is a character ${c_{i-1}}$, the character we want to inject and the |
528 ${v_i}$, into which one wants to inject the character (it |
526 third argument is the value ${v_i}$, into which one wants to inject the |
529 corresponds to the regular expression after the character has been |
527 character (it corresponds to the regular expression after the character |
530 chopped off). The result of this function is a new value. The definition |
528 has been chopped off). The result of this function is a new value. The |
531 of $\textit{inj}$ is as follows: |
529 definition of $\textit{inj}$ is as follows: |
532 |
530 |
533 \begin{center} |
531 \begin{center} |
534 \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l} |
532 \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l} |
535 $\textit{inj}\,(c)\,c\,Empty$ & $\dn$ & $Char\,c$\\ |
533 $\textit{inj}\,(c)\,c\,Empty$ & $\dn$ & $Char\,c$\\ |
536 $\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\ |
534 $\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\ |
711 context free grammars and then adapted by Henglein and Nielson for |
709 context free grammars and then adapted by Henglein and Nielson for |
712 efficient regular expression parsing using DFAs~\cite{nielson11bcre}. |
710 efficient regular expression parsing using DFAs~\cite{nielson11bcre}. |
713 Sulzmann and Lu took this idea of bitcodes a step further by integrating |
711 Sulzmann and Lu took this idea of bitcodes a step further by integrating |
714 bitcodes into derivatives. The reason why we want to use bitcodes in |
712 bitcodes into derivatives. The reason why we want to use bitcodes in |
715 this project is that we want to introduce more aggressive |
713 this project is that we want to introduce more aggressive |
716 simplifications in order to keep the size of derivatives small |
714 simplification rules in order to keep the size of derivatives small |
717 throughout. This is because the main drawback of building successive |
715 throughout. This is because the main drawback of building successive |
718 derivatives according to Brzozowski's definition is that they can grow |
716 derivatives according to Brzozowski's definition is that they can grow |
719 very quickly in size. This is mainly due to the fact that the derivative |
717 very quickly in size. This is mainly due to the fact that the derivative |
720 operation generates often ``useless'' $\ZERO$s and $\ONE$s in |
718 operation generates often ``useless'' $\ZERO$s and $\ONE$s in |
721 derivatives. As a result, if implemented naively both algorithms by |
719 derivatives. As a result, if implemented naively both algorithms by |
722 Brzozowski and by Sulzmann and Lu are excruciatingly slow. For example |
720 Brzozowski and by Sulzmann and Lu are excruciatingly slow. For example |
723 when starting with the regular expression $(a + aa)^*$ and building 12 |
721 when starting with the regular expression $(a + aa)^*$ and building 12 |
724 successive derivatives w.r.t.~the character $a$, one obtains a |
722 successive derivatives w.r.t.~the character $a$, one obtains a |
725 derivative regular expression with more than 8000 nodes (when viewed as |
723 derivative regular expression with more than 8000 nodes (when viewed as |
726 a tree). Operations like derivative and $\nullable$ need to traverse |
724 a tree). Operations like $\textit{der}$ and $\nullable$ need to traverse |
727 such trees and consequently the bigger the size of the derivative the |
725 such trees and consequently the bigger the size of the derivative the |
728 slower the algorithm. |
726 slower the algorithm. |
729 |
727 |
730 Fortunately, one can simplify regular expressions after each derivative |
728 Fortunately, one can simplify regular expressions after each derivative |
731 step. Various simplifications of regular expressions are possible, such |
729 step. Various simplifications of regular expressions are possible, such |
732 as the simplifications of $\ZERO + r$, $r + \ZERO$, $\ONE\cdot r$, $r |
730 as the simplification of $\ZERO + r$, $r + \ZERO$, $\ONE\cdot r$, $r |
733 \cdot \ONE$, and $r + r$ to just $r$. These simplifications do not |
731 \cdot \ONE$, and $r + r$ to just $r$. These simplifications do not |
734 affect the answer for whether a regular expression matches a string or |
732 affect the answer for whether a regular expression matches a string or |
735 not, but fortunately also do not affect the POSIX strategy of how |
733 not, but fortunately also do not affect the POSIX strategy of how |
736 regular expressions match strings---although the latter is much harder |
734 regular expressions match strings---although the latter is much harder |
737 to establish. Some initial results in this regard have been |
735 to establish. Some initial results in this regard have been |
738 obtained in \cite{AusafDyckhoffUrban2016}. |
736 obtained in \cite{AusafDyckhoffUrban2016}. |
739 |
737 |
740 Unfortunately, the simplification rules outlined above are not |
738 Unfortunately, the simplification rules outlined above are not |
741 sufficient to prevent an explosion for all regular expression. We |
739 sufficient to prevent a size explosion in all cases. We |
742 believe a tighter bound can be achieved that prevents an explosion in |
740 believe a tighter bound can be achieved that prevents an explosion in |
743 all cases. Such a tighter bound is suggested by work of Antimirov who |
741 \emph{all} cases. Such a tighter bound is suggested by work of Antimirov who |
744 proved that (partial) derivatives can be bound by the number of |
742 proved that (partial) derivatives can be bound by the number of |
745 characters contained in the initial regular expression |
743 characters contained in the initial regular expression |
746 \cite{Antimirov95}. He defined the \emph{partial derivatives} of regular |
744 \cite{Antimirov95}. He defined the \emph{partial derivatives} of regular |
747 expressions as follows: |
745 expressions as follows: |
748 |
746 |
761 |
759 |
762 \noindent |
760 \noindent |
763 A partial derivative of a regular expression $r$ is essentially a set of |
761 A partial derivative of a regular expression $r$ is essentially a set of |
764 regular expressions that are either $r$'s children expressions or a |
762 regular expressions that are either $r$'s children expressions or a |
765 concatenation of them. Antimirov has proved a tight bound of the size of |
763 concatenation of them. Antimirov has proved a tight bound of the size of |
766 partial derivatives. Roughly |
764 \emph{all} partial derivatives no matter what the string looks like. |
767 speaking the size will be quadruple in the size of the regular expression. |
765 Roughly speaking the size will be quadruple in the size of the regular |
768 If we want the size of derivatives |
766 expression. If we want the size of derivatives in Sulzmann and Lu's |
769 to stay below this bound, we would need more aggressive simplifications |
767 algorithm to stay equal or below this bound, we would need more |
770 such as opening up alternatives to achieve the maximum level of duplicates |
768 aggressive simplifications. Essentially we need to delete useless |
771 cancellation. |
769 $\ZERO$s and $\ONE$s, as well as deleting duplicates whenever possible. |
772 For example, the parentheses in $(a+b) \cdot c + bc$ can be opened up to |
770 For example, the parentheses in $(a+b) \cdot c + bc$ can be opened up to |
773 get $a\cdot c +b \cdot c |
771 get $a\cdot c + b \cdot c + b \cdot c$, and then simplified to just $a \cdot |
774 + b \cdot c$, and then simplified to $a \cdot c+b \cdot c$. Another example is from |
772 c + b \cdot c$. Another example is simplifying $(a^*+a) + (a^*+ \ONE) + (a |
775 $(a^*+a) + (a^*+ \ONE) + (a +\ONE)$ to $a^*+a+\ONE$. |
773 +\ONE)$ to just $a^*+a+\ONE$. Adding these more aggressive simplification |
776 Adding these more aggressive simplification rules |
774 rules helps us to achieve the same size bound as that of the partial |
777 helped us to achieve the same size bound as that of the partial derivatives. |
775 derivatives. In order to implement the idea of ``spilling out alternatives'' |
778 To introduce these "spilling out alternatives" simplifications |
776 and to make them compatible with the $\text{inj}$-mechanism, we use \emph{bitcodes}. |
779 and make the correctness proof easier, |
777 Bits and bitcodes (lists of bits) are just: |
780 we used bitcodes. |
|
781 Bitcodes look like this: |
|
782 %This allows us to prove a tight |
778 %This allows us to prove a tight |
783 %bound on the size of regular expression during the running time of the |
779 %bound on the size of regular expression during the running time of the |
784 %algorithm if we can establish the connection between our simplification |
780 %algorithm if we can establish the connection between our simplification |
785 %rules and partial derivatives. |
781 %rules and partial derivatives. |
786 |
782 |
788 %data, that a similar bound can be obtained for the derivatives in |
784 %data, that a similar bound can be obtained for the derivatives in |
789 %Sulzmann and Lu's algorithm. Let us give some details about this next. |
785 %Sulzmann and Lu's algorithm. Let us give some details about this next. |
790 |
786 |
791 |
787 |
792 \begin{center} |
788 \begin{center} |
793 $b ::= S \mid Z \; \;\; |
789 $b ::= S \mid Z \qquad |
794 bs ::= [] \mid b:bs |
790 bs ::= [] \mid b:bs |
795 $ |
791 $ |
796 \end{center} |
792 \end{center} |
797 They are just a string of bits, |
793 |
798 the names $S$ and $Z$ here are quite arbitrary, we can use 0 and 1 |
794 \noindent |
799 or any other set of binary symbols to substitute them. |
795 The names $S$ and $Z$ here are quite arbitrary in order to avoid |
800 Bitcodes(or bit-sequences) are a compact form of parse trees. |
796 confusion with the regular expressions $\ZERO$ and $\ONE$. Bitcodes (or |
801 Bitcodes are essentially incomplete values. |
797 bit-lists) can be used to encode values (or incomplete values) in a |
802 This can be straightforwardly seen in the following transformation: |
798 compact form. This can be straightforwardly seen in the following |
799 coding function from values to bitcodes: |
|
800 |
|
803 \begin{center} |
801 \begin{center} |
804 \begin{tabular}{lcl} |
802 \begin{tabular}{lcl} |
805 $\textit{code}(\Empty)$ & $\dn$ & $[]$\\ |
803 $\textit{code}(\Empty)$ & $\dn$ & $[]$\\ |
806 $\textit{code}(\Char\,c)$ & $\dn$ & $[]$\\ |
804 $\textit{code}(\Char\,c)$ & $\dn$ & $[]$\\ |
807 $\textit{code}(\Left\,v)$ & $\dn$ & $\Z :: code(v)$\\ |
805 $\textit{code}(\Left\,v)$ & $\dn$ & $\Z :: code(v)$\\ |
811 $\textit{code}(\Stars\,(v\!::\!vs))$ & $\dn$ & $\S :: code(v) \;@\; |
809 $\textit{code}(\Stars\,(v\!::\!vs))$ & $\dn$ & $\S :: code(v) \;@\; |
812 code(\Stars\,vs)$ |
810 code(\Stars\,vs)$ |
813 \end{tabular} |
811 \end{tabular} |
814 \end{center} |
812 \end{center} |
815 |
813 |
816 Here code encodes a value into a bit-sequence by converting Left into |
814 \noindent |
817 $\Z$, Right into $\S$, the start point of a non-empty star iteration |
815 Here $\textit{code}$ encodes a value into a bitcodes by converting |
818 into $\S$, and the border where a local star terminates into $\Z$. This |
816 $\Left$ into $\Z$, $\Right$ into $\S$, the start point of a non-empty |
819 conversion is apparently lossy, as it throws away the character |
817 star iteration into $\S$, and the border where a local star terminates |
820 information, and does not decode the boundary between the two operands |
818 into $\Z$. This coding is lossy, as it throws away the information about |
821 of the sequence constructor. Moreover, with only the bitcode we cannot |
819 characters, and also does not encode the ``boundary'' between two |
822 even tell whether the $\S$s and $\Z$s are for $\Left/\Right$ or |
820 sequence values. Moreover, with only the bitcode we cannot even tell |
823 $\Stars$. The reason for choosing this compact way of storing |
821 whether the $\S$s and $\Z$s are for $\Left/\Right$ or $\Stars$. The |
824 information is that the relatively small size of bits can be easily |
822 reason for choosing this compact way of storing information is that the |
825 moved around. In order to recover the bitcode back into values, we will |
823 relatively small size of bits can be easily manipulated and ``moved |
826 need the regular expression as the extra information and decode it back |
824 around'' in a regular expression. In order to recover values, we will |
827 into value:\\ |
825 need the corresponding regular expression as an extra information. This |
826 means the decoding function is defined as: |
|
828 |
827 |
829 |
828 |
830 %\begin{definition}[Bitdecoding of Values]\mbox{} |
829 %\begin{definition}[Bitdecoding of Values]\mbox{} |
831 \begin{center} |
830 \begin{center} |
832 \begin{tabular}{@{}l@{\hspace{1mm}}c@{\hspace{1mm}}l@{}} |
831 \begin{tabular}{@{}l@{\hspace{1mm}}c@{\hspace{1mm}}l@{}} |
854 \textit{else}\;\textit{None}$ |
853 \textit{else}\;\textit{None}$ |
855 \end{tabular} |
854 \end{tabular} |
856 \end{center} |
855 \end{center} |
857 %\end{definition} |
856 %\end{definition} |
858 |
857 |
859 Sulzmann and Lu's integrated the bitcodes into regular |
858 Sulzmann and Lu's integrated the bitcodes into regular expressions to |
860 expressions to create annotated regular expressions. |
859 create annotated regular expressions \cite{Sulzmann2014}. |
861 It is by attaching them to the head of every substructure of a |
860 \emph{Annotated regular expressions} are defined by the following |
862 regular expression\cite{Sulzmann2014}. Annotated regular expressions |
|
863 are defined by the following |
|
864 grammar: |
861 grammar: |
865 |
862 |
866 \begin{center} |
863 \begin{center} |
867 \begin{tabular}{lcl} |
864 \begin{tabular}{lcl} |
868 $\textit{a}$ & $::=$ & $\textit{ZERO}$\\ |
865 $\textit{a}$ & $::=$ & $\textit{ZERO}$\\ |
872 & $\mid$ & $\textit{SEQ}\;\;bs\,a_1\,a_2$\\ |
869 & $\mid$ & $\textit{SEQ}\;\;bs\,a_1\,a_2$\\ |
873 & $\mid$ & $\textit{STAR}\;\;bs\,a$ |
870 & $\mid$ & $\textit{STAR}\;\;bs\,a$ |
874 \end{tabular} |
871 \end{tabular} |
875 \end{center} |
872 \end{center} |
876 %(in \textit{ALT}) |
873 %(in \textit{ALT}) |
877 \noindent |
874 |
878 where $bs$ stands for bit-sequences, and $a$ for $\bold{a}$nnotated regular expressions. These bit-sequences encode |
875 \noindent |
879 information about the (POSIX) value that should be generated by the |
876 where $bs$ stands for bitcodes, and $a$ for $\bold{a}$nnotated regular |
880 Sulzmann and Lu algorithm. |
877 expressions. We will show that these bitcodes encode information about |
878 the (POSIX) value that should be generated by the Sulzmann and Lu |
|
879 algorithm. |
|
880 |
|
881 |
881 |
882 To do lexing using annotated regular expressions, we shall first |
882 To do lexing using annotated regular expressions, we shall first |
883 transform the usual (un-annotated) regular expressions into annotated |
883 transform the usual (un-annotated) regular expressions into annotated |
884 regular expressions. This operation is called \emph{internalisation} and |
884 regular expressions. This operation is called \emph{internalisation} and |
885 defined as follows: |
885 defined as follows: |
900 \end{tabular} |
900 \end{tabular} |
901 \end{center} |
901 \end{center} |
902 %\end{definition} |
902 %\end{definition} |
903 |
903 |
904 \noindent |
904 \noindent |
905 We use up arrows here to imply that the basic un-annotated regular expressions |
905 We use up arrows here to indicate that the basic un-annotated regular |
906 are "lifted up" into something slightly more complex. |
906 expressions are ``lifted up'' into something slightly more complex. In the |
907 In the fourth clause, $\textit{fuse}$ is an auxiliary function that helps to attach bits to the |
907 fourth clause, $\textit{fuse}$ is an auxiliary function that helps to |
908 front of an annotated regular expression. Its definition is as follows: |
908 attach bits to the front of an annotated regular expression. Its |
909 definition is as follows: |
|
909 |
910 |
910 \begin{center} |
911 \begin{center} |
911 \begin{tabular}{lcl} |
912 \begin{tabular}{lcl} |
912 $\textit{fuse}\,bs\,(\textit{ZERO})$ & $\dn$ & $\textit{ZERO}$\\ |
913 $\textit{fuse}\;bs\,(\textit{ZERO})$ & $\dn$ & $\textit{ZERO}$\\ |
913 $\textit{fuse}\,bs\,(\textit{ONE}\,bs')$ & $\dn$ & |
914 $\textit{fuse}\;bs\,(\textit{ONE}\,bs')$ & $\dn$ & |
914 $\textit{ONE}\,(bs\,@\,bs')$\\ |
915 $\textit{ONE}\,(bs\,@\,bs')$\\ |
915 $\textit{fuse}\,bs\,(\textit{CHAR}\,bs'\,c)$ & $\dn$ & |
916 $\textit{fuse}\;bs\,(\textit{CHAR}\,bs'\,c)$ & $\dn$ & |
916 $\textit{CHAR}\,(bs\,@\,bs')\,c$\\ |
917 $\textit{CHAR}\,(bs\,@\,bs')\,c$\\ |
917 $\textit{fuse}\,bs\,(\textit{ALT}\,bs'\,a_1\,a_2)$ & $\dn$ & |
918 $\textit{fuse}\;bs\,(\textit{ALT}\,bs'\,a_1\,a_2)$ & $\dn$ & |
918 $\textit{ALT}\,(bs\,@\,bs')\,a_1\,a_2$\\ |
919 $\textit{ALT}\,(bs\,@\,bs')\,a_1\,a_2$\\ |
919 $\textit{fuse}\,bs\,(\textit{SEQ}\,bs'\,a_1\,a_2)$ & $\dn$ & |
920 $\textit{fuse}\;bs\,(\textit{SEQ}\,bs'\,a_1\,a_2)$ & $\dn$ & |
920 $\textit{SEQ}\,(bs\,@\,bs')\,a_1\,a_2$\\ |
921 $\textit{SEQ}\,(bs\,@\,bs')\,a_1\,a_2$\\ |
921 $\textit{fuse}\,bs\,(\textit{STAR}\,bs'\,a)$ & $\dn$ & |
922 $\textit{fuse}\;bs\,(\textit{STAR}\,bs'\,a)$ & $\dn$ & |
922 $\textit{STAR}\,(bs\,@\,bs')\,a$ |
923 $\textit{STAR}\,(bs\,@\,bs')\,a$ |
923 \end{tabular} |
924 \end{tabular} |
924 \end{center} |
925 \end{center} |
925 |
926 |
926 \noindent |
927 \noindent |
927 After internalise we do successive derivative operations on the |
928 After internalise we do successive derivative operations on the |
928 annotated regular expression. This derivative operation is the same as |
929 annotated regular expressions. This derivative operation is the same as |
929 what we previously have for the simple regular expressions, except that |
930 what we previously have for the simple regular expressions, except that |
930 we take special care of the bits :\\ |
931 we beed to take special care of the bitcodes:\comment{You need to be consitent with ALTS and ALT; ALT is just |
932 an abbreviation; derivations and so on are defined for ALTS} |
|
931 |
933 |
932 %\begin{definition}{bder} |
934 %\begin{definition}{bder} |
933 \begin{center} |
935 \begin{center} |
934 \begin{tabular}{@{}lcl@{}} |
936 \begin{tabular}{@{}lcl@{}} |
935 $(\textit{ZERO})\backslash c$ & $\dn$ & $\textit{ZERO}$\\ |
937 $(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\ |
936 $(\textit{ONE}\;bs)\backslash c$ & $\dn$ & $\textit{ZERO}$\\ |
938 $(\textit{ONE}\;bs)\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\ |
937 $(\textit{CHAR}\;bs\,d)\backslash c$ & $\dn$ & |
939 $(\textit{CHAR}\;bs\,d)\,\backslash c$ & $\dn$ & |
938 $\textit{if}\;c=d\; \;\textit{then}\; |
940 $\textit{if}\;c=d\; \;\textit{then}\; |
939 \textit{ONE}\;bs\;\textit{else}\;\textit{ZERO}$\\ |
941 \textit{ONE}\;bs\;\textit{else}\;\textit{ZERO}$\\ |
940 $(\textit{ALT}\;bs\,a_1\,a_2)\backslash c$ & $\dn$ & |
942 $(\textit{ALT}\;bs\,a_1\,a_2)\,\backslash c$ & $\dn$ & |
941 $\textit{ALT}\,bs\,(a_1\backslash c)\,(a_2\backslash c)$\\ |
943 $\textit{ALT}\;bs\,(a_1\,\backslash c)\,(a_2\,\backslash c)$\\ |
942 $(\textit{SEQ}\;bs\,a_1\,a_2)\backslash c$ & $\dn$ & |
944 $(\textit{SEQ}\;bs\,a_1\,a_2)\,\backslash c$ & $\dn$ & |
943 $\textit{if}\;\textit{bnullable}\,a_1$\\ |
945 $\textit{if}\;\textit{bnullable}\,a_1$\\ |
944 & &$\textit{then}\;\textit{ALT}\,bs\,(\textit{SEQ}\,[]\,(a_1\backslash c)\,a_2)$\\ |
946 & &$\textit{then}\;\textit{ALT}\,bs\,(\textit{SEQ}\,[]\,(a_1\,\backslash c)\,a_2)$\\ |
945 & &$\phantom{\textit{then}\;\textit{ALT}\,bs\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\backslash c))$\\ |
947 & &$\phantom{\textit{then}\;\textit{ALT}\,bs\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))$\\ |
946 & &$\textit{else}\;\textit{SEQ}\,bs\,(a_1\backslash c)\,a_2$\\ |
948 & &$\textit{else}\;\textit{SEQ}\,bs\,(a_1\,\backslash c)\,a_2$\\ |
947 $(\textit{STAR}\,bs\,a)\backslash c$ & $\dn$ & |
949 $(\textit{STAR}\,bs\,a)\,\backslash c$ & $\dn$ & |
948 $\textit{SEQ}\;bs\,(\textit{fuse}\, [\Z] (r\backslash c))\, |
950 $\textit{SEQ}\;bs\,(\textit{fuse}\, [\Z] (r\,\backslash c))\, |
949 (\textit{STAR}\,[]\,r)$ |
951 (\textit{STAR}\,[]\,r)$ |
950 \end{tabular} |
952 \end{tabular} |
951 \end{center} |
953 \end{center} |
952 %\end{definition} |
954 %\end{definition} |
953 |
955 |
954 For instance, when we unfold $STAR \; bs \; a$ into a sequence, we |
956 \noindent |
955 attach an additional bit Z to the front of $r \backslash c$ to indicate |
957 For instance, when we unfold $\textit{STAR} \; bs \; a$ into a sequence, |
956 that there is one more star iteration. The other example, the $SEQ$ |
958 we need to attach an additional bit $Z$ to the front of $r \backslash c$ |
957 clause is more subtle-- when $a_1$ is $bnullable$(here bnullable is |
959 to indicate that there is one more star iteration. Also the $SEQ$ clause |
958 exactly the same as nullable, except that it is for annotated regular |
960 is more subtle---when $a_1$ is $\textit{bnullable}$ (here |
959 expressions, therefore we omit the definition). Assume that $bmkeps$ |
961 \textit{bnullable} is exactly the same as $\textit{nullable}$, except |
960 correctly extracts the bitcode for how $a_1$ matches the string prior to |
962 that it is for annotated regular expressions, therefore we omit the |
961 character c(more on this later), then the right branch of $ALTS$, which |
963 definition). Assume that $bmkeps$ correctly extracts the bitcode for how |
962 is $fuse \; bmkeps \; a_1 (a_2 \backslash c)$ will collapse the regular |
964 $a_1$ matches the string prior to character $c$ (more on this later), |
963 expression $a_1$(as it has already been fully matched) and store the |
965 then the right branch of $ALTS$, which is $fuse \; bmkeps \; a_1 (a_2 |
964 parsing information at the head of the regular expression $a_2 |
966 \backslash c)$ will collapse the regular expression $a_1$(as it has |
965 \backslash c$ by fusing to it. The bitsequence $bs$, which was initially |
967 already been fully matched) and store the parsing information at the |
966 attached to the head of $SEQ$, has now been elevated to the top-level of |
968 head of the regular expression $a_2 \backslash c$ by fusing to it. The |
967 ALT, as this information will be needed whichever way the $SEQ$ is |
969 bitsequence $bs$, which was initially attached to the head of $SEQ$, has |
968 matched--no matter whether c belongs to $a_1$ or $ a_2$. After carefully |
970 now been elevated to the top-level of ALT, as this information will be |
969 doing these derivatives and maintaining all the parsing information, we |
971 needed whichever way the $SEQ$ is matched--no matter whether $c$ belongs |
970 complete the parsing by collecting the bits using a special $mkeps$ |
972 to $a_1$ or $ a_2$. After building these derivatives and maintaining all |
971 function for annotated regular expressions--$bmkeps$: |
973 the lexing information, we complete the lexing by collecting the |
974 bitcodes using a generalised version of the $\textit{mkeps}$ function |
|
975 for annotated regular expressions, called $\textit{bmkeps}$: |
|
972 |
976 |
973 |
977 |
974 %\begin{definition}[\textit{bmkeps}]\mbox{} |
978 %\begin{definition}[\textit{bmkeps}]\mbox{} |
975 \begin{center} |
979 \begin{center} |
976 \begin{tabular}{lcl} |
980 \begin{tabular}{lcl} |
977 $\textit{bmkeps}\,(\textit{ONE}\,bs)$ & $\dn$ & $bs$\\ |
981 $\textit{bmkeps}\,(\textit{ONE}\;bs)$ & $\dn$ & $bs$\\ |
978 $\textit{bmkeps}\,(\textit{ALT}\,bs\,a_1\,a_2)$ & $\dn$ & |
982 $\textit{bmkeps}\,(\textit{ALT}\;bs\,a_1\,a_2)$ & $\dn$ & |
979 $\textit{if}\;\textit{bnullable}\,a_1$\\ |
983 $\textit{if}\;\textit{bnullable}\,a_1$\\ |
980 & &$\textit{then}\;bs\,@\,\textit{bmkeps}\,a_1$\\ |
984 & &$\textit{then}\;bs\,@\,\textit{bmkeps}\,a_1$\\ |
981 & &$\textit{else}\;bs\,@\,\textit{bmkeps}\,a_2$\\ |
985 & &$\textit{else}\;bs\,@\,\textit{bmkeps}\,a_2$\\ |
982 $\textit{bmkeps}\,(\textit{SEQ}\,bs\,a_1\,a_2)$ & $\dn$ & |
986 $\textit{bmkeps}\,(\textit{SEQ}\;bs\,a_1\,a_2)$ & $\dn$ & |
983 $bs \,@\,\textit{bmkeps}\,a_1\,@\, \textit{bmkeps}\,a_2$\\ |
987 $bs \,@\,\textit{bmkeps}\,a_1\,@\, \textit{bmkeps}\,a_2$\\ |
984 $\textit{bmkeps}\,(\textit{STAR}\,bs\,a)$ & $\dn$ & |
988 $\textit{bmkeps}\,(\textit{STAR}\;bs\,a)$ & $\dn$ & |
985 $bs \,@\, [\S]$ |
989 $bs \,@\, [\S]$ |
986 \end{tabular} |
990 \end{tabular} |
987 \end{center} |
991 \end{center} |
988 %\end{definition} |
992 %\end{definition} |
989 |
993 |
990 \noindent |
994 \noindent |
991 This function completes the parse tree information by |
995 This function completes the value information by travelling along the |
992 travelling along the path on the regular expression that corresponds to a POSIX value snd collect all the bits, and |
996 path of the regular expression that corresponds to a POSIX value and |
993 using S to indicate the end of star iterations. If we take the bits produced by $bmkeps$ and decode it, |
997 collecting all the bitcodes, and using $S$ to indicate the end of star |
994 we get the parse tree we need, the working flow looks like this:\\ |
998 iterations. If we take the bitcodes produced by $\textit{bmkeps}$ and |
999 decode them, we get the value we expect. The corresponding lexing |
|
1000 algorithm looks as follows: |
|
1001 |
|
995 \begin{center} |
1002 \begin{center} |
996 \begin{tabular}{lcl} |
1003 \begin{tabular}{lcl} |
997 $\textit{blexer}\;r\,s$ & $\dn$ & |
1004 $\textit{blexer}\;r\,s$ & $\dn$ & |
998 $\textit{let}\;a = (r^\uparrow)\backslash s\;\textit{in}$\\ |
1005 $\textit{let}\;a = (r^\uparrow)\backslash s\;\textit{in}$\\ |
999 & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\ |
1006 & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\ |
1000 & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\ |
1007 & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\ |
1001 & & $\;\;\textit{else}\;\textit{None}$ |
1008 & & $\;\;\textit{else}\;\textit{None}$ |
1002 \end{tabular} |
1009 \end{tabular} |
1003 \end{center} |
1010 \end{center} |
1004 Here $(r^\uparrow)\backslash s$ is similar to what we have previously defined for |
1011 |
1005 $r\backslash s$. |
1012 \noindent |
1006 |
1013 In this definition $_\backslash s$ is the generalisation of the derivative |
1007 The main point of the bit-sequences and annotated regular expressions |
1014 operation from characters to strings (just like the derivatives for un-annotated |
1008 is that we can apply rather aggressive (in terms of size) |
1015 regular expressions). |
1009 simplification rules in order to keep derivatives small. |
1016 |
1010 |
1017 The main point of the bitcodes and annotated regular expressions is that |
1011 We have |
1018 we can apply rather aggressive (in terms of size) simplification rules |
1012 developed such ``aggressive'' simplification rules and generated test |
1019 in order to keep derivatives small. We have developed such |
1013 data that show that the expected bound can be achieved. Obviously we |
1020 ``aggressive'' simplification rules and generated test data that show |
1014 could only partially cover the search space as there are infinitely |
1021 that the expected bound can be achieved. Obviously we could only |
1015 many regular expressions and strings. One modification we introduced |
1022 partially cover the search space as there are infinitely many regular |
1016 is to allow a list of annotated regular expressions in the |
1023 expressions and strings. |
1017 \textit{ALTS} constructor. This allows us to not just delete |
1024 |
1018 unnecessary $\ZERO$s and $\ONE$s from regular expressions, but also |
1025 One modification we introduced is to allow a list of annotated regular |
1019 unnecessary ``copies'' of regular expressions (very similar to |
1026 expressions in the \textit{ALTS} constructor. This allows us to not just |
1020 simplifying $r + r$ to just $r$, but in a more general |
1027 delete unnecessary $\ZERO$s and $\ONE$s from regular expressions, but |
1021 setting). |
1028 also unnecessary ``copies'' of regular expressions (very similar to |
1022 Another modification is that we use simplification rules |
1029 simplifying $r + r$ to just $r$, but in a more general setting). Another |
1023 inspired by Antimirov's work on partial derivatives. They maintain the |
1030 modification is that we use simplification rules inspired by Antimirov's |
1024 idea that only the first ``copy'' of a regular expression in an |
1031 work on partial derivatives. They maintain the idea that only the first |
1025 alternative contributes to the calculation of a POSIX value. All |
1032 ``copy'' of a regular expression in an alternative contributes to the |
1026 subsequent copies can be pruned from the regular expression. |
1033 calculation of a POSIX value. All subsequent copies can be pruned away from |
1027 |
1034 the regular expression. A recursive definition of our simplification function |
1028 A recursive definition of simplification function that looks similar to scala code is given below:\\ |
1035 that looks somewhat similar to our Scala code is given below:\comment{Use $\ZERO$, $\ONE$ and so on. |
1036 Is it $ALT$ or $ALTS$?}\\ |
|
1037 |
|
1029 \begin{center} |
1038 \begin{center} |
1030 \begin{tabular}{@{}lcl@{}} |
1039 \begin{tabular}{@{}lcl@{}} |
1031 $\textit{simp} \; a$ & $\dn$ & $\textit{a} \; \textit{if} \; a = (\textit{ONE} \; bs) \; or\; (\textit{CHAR} \, bs \; c) \; or\; (\textit{STAR}\; bs\; a_1)$\\ |
1040 |
1032 |
1041 $\textit{simp} \; (\textit{SEQ}\;bs\,a_1\,a_2)$ & $\dn$ & $ (\textit{simp} \; a_1, \textit{simp} \; a_2) \; \textit{match} $ \\ |
1033 $\textit{simp} \; \textit{SEQ}\;bs\,a_1\,a_2$ & $\dn$ & $ (\textit{simp} \; a_1, \textit{simp} \; a_2) \; \textit{match} $ \\ |
1042 &&$\quad\textit{case} \; (0, \_) \Rightarrow 0$ \\ |
1034 &&$\textit{case} \; (0, \_) \Rightarrow 0$ \\ |
1043 &&$\quad\textit{case} \; (\_, 0) \Rightarrow 0$ \\ |
1035 &&$ \textit{case} \; (\_, 0) \Rightarrow 0$ \\ |
1044 &&$\quad\textit{case} \; (1, a_2') \Rightarrow \textit{fuse} \; bs \; a_2'$ \\ |
1036 &&$ \textit{case} \; (1, a_2') \Rightarrow \textit{fuse} \; bs \; a_2'$ \\ |
1045 &&$\quad\textit{case} \; (a_1', 1) \Rightarrow \textit{fuse} \; bs \; a_1'$ \\ |
1037 &&$ \textit{case} \; (a_1', 1) \Rightarrow \textit{fuse} \; bs \; a_1'$ \\ |
1046 &&$\quad\textit{case} \; (a_1', a_2') \Rightarrow \textit{SEQ} \; bs \; a_1' \; a_2'$ \\ |
1038 &&$ \textit{case} \; (a_1', a_2') \Rightarrow \textit{SEQ} \; bs \; a_1' \; a_2'$ \\ |
1047 |
1039 |
1048 $\textit{simp} \; (\textit{ALTS}\;bs\,as)$ & $\dn$ & $\textit{distinct}( \textit{flatten} ( \textit{map simp as})) \; \textit{match} $ \\ |
1040 $\textit{simp} \; \textit{ALTS}\;bs\,as$ & $\dn$ & $\textit{ distinct}( \textit{flatten} ( \textit{map simp as})) \; \textit{match} $ \\ |
1049 &&$\quad\textit{case} \; [] \Rightarrow 0$ \\ |
1041 &&$\textit{case} \; [] \Rightarrow 0$ \\ |
1050 &&$\quad\textit{case} \; a :: [] \Rightarrow \textit{fuse bs a}$ \\ |
1042 &&$ \textit{case} \; a :: [] \Rightarrow \textit{fuse bs a}$ \\ |
1051 &&$\quad\textit{case} \; as' \Rightarrow \textit{ALT}\;bs\;as'$\\ |
1043 &&$ \textit{case} \; as' \Rightarrow \textit{ALT bs as'}$ |
1052 |
1053 $\textit{simp} \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$ |
|
1044 \end{tabular} |
1054 \end{tabular} |
1045 \end{center} |
1055 \end{center} |
1046 |
1056 |
1047 The simplification does a pattern matching on the regular expression. When it detected that |
1057 \noindent |
1048 the regular expression is an alternative or sequence, |
1058 The simplification does a pattern matching on the regular expression. |
1049 it will try to simplify its children regular expressions |
1059 When it detected that the regular expression is an alternative or |
1050 recursively and then see if one of the children turn |
1060 sequence, it will try to simplify its children regular expressions |
1051 into $\ZERO$ or $\ONE$, which might trigger further simplification |
1061 recursively and then see if one of the children turn into $\ZERO$ or |
1052 at the current level. The most involved part is the $\textit{ALTS}$ |
1062 $\ONE$, which might trigger further simplification at the current level. |
1053 clause, where we use two auxiliary functions |
1063 The most involved part is the $\textit{ALTS}$ clause, where we use two |
1054 flatten and distinct to open up nested $\textit{ALTS}$ and |
1064 auxiliary functions flatten and distinct to open up nested |
1055 reduce as many duplicates as possible. |
1065 $\textit{ALTS}$ and reduce as many duplicates as possible. Function |
1056 Function distinct keeps the first occurring copy only and |
1066 distinct keeps the first occurring copy only and remove all later ones |
1057 remove all later ones when detected duplicates. |
1067 when detected duplicates. Function flatten opens up nested \textit{ALT}. |
1058 Function flatten opens up nested \textit{ALT}. Its recursive |
1068 Its recursive definition is given below: |
1059 definition is given below: |
1069 |
1060 \begin{center} |
1070 \begin{center} |
1061 \begin{tabular}{@{}lcl@{}} |
1071 \begin{tabular}{@{}lcl@{}} |
1062 $\textit{flatten} \; (\textit{ALT}\;bs\,as) :: as'$ & $\dn$ & $(\textit{map} \; |
1072 $\textit{flatten} \; (\textit{ALT}\;bs\,as) :: as'$ & $\dn$ & $(\textit{map} \; |
1063 (\textit{fuse}\;bs)\; \textit{as}) \; @ \; \textit{flatten} \; as' $ \\ |
1073 (\textit{fuse}\;bs)\; \textit{as}) \; @ \; \textit{flatten} \; as' $ \\ |
1064 $\textit{flatten} \; \textit{ZERO} :: as'$ & $\dn$ & $ \textit{flatten} \; as' $ \\ |
1074 $\textit{flatten} \; \textit{ZERO} :: as'$ & $\dn$ & $ \textit{flatten} \; as' $ \\ |
1065 $\textit{flatten} \; a :: as'$ & $\dn$ & $a :: \textit{flatten} \; as'$ \quad(otherwise) |
1075 $\textit{flatten} \; a :: as'$ & $\dn$ & $a :: \textit{flatten} \; as'$ \quad(otherwise) |
1066 \end{tabular} |
1076 \end{tabular} |
1067 \end{center} |
1077 \end{center} |
1068 |
1078 |
1069 \noindent |
1079 \noindent |
1070 Here flatten behaves like the traditional functional programming flatten function, except that it also removes $\ZERO$s. |
1080 Here flatten behaves like the traditional functional programming flatten |
1071 What it does is basically removing parentheses like changing $a+(b+c)$ into $a+b+c$. |
1081 function, except that it also removes $\ZERO$s. What it does is |
1072 |
1082 basically removing parentheses like changing $a+(b+c)$ into $a+b+c$. |
1073 Suppose we apply simplification after each derivative step, |
1083 |
1074 and view these two operations as an atomic one: $a \backslash_{simp} c \dn \textit{simp}(a \backslash c)$. |
1084 Suppose we apply simplification after each derivative step, and view |
1075 Then we can use the previous natural extension from derivative w.r.t character to derivative w.r.t string: |
1085 these two operations as an atomic one: $a \backslash_{simp}\,c \dn |
1086 \textit{simp}(a \backslash c)$. Then we can use the previous natural |
|
1087 extension from derivative w.r.t.~character to derivative |
|
1088 w.r.t.~string:\comment{simp in the [] case?} |
|
1076 |
1089 |
1077 \begin{center} |
1090 \begin{center} |
1078 \begin{tabular}{lcl} |
1091 \begin{tabular}{lcl} |
1079 $r \backslash_{simp} (c\!::\!s) $ & $\dn$ & $(r \backslash_{simp} c) \backslash_{simp} s$ \\ |
1092 $r \backslash_{simp} (c\!::\!s) $ & $\dn$ & $(r \backslash_{simp}\, c) \backslash_{simp}\, s$ \\ |
1080 $r \backslash [\,] $ & $\dn$ & $r$ |
1093 $r \backslash [\,] $ & $\dn$ & $r$ |
1081 \end{tabular} |
1094 \end{tabular} |
1082 \end{center} |
1095 \end{center} |
1083 |
1096 |
1084 we get an optimized version of the algorithm: |
1097 \noindent |
1085 \begin{center} |
1098 we obtain an optimised version of the algorithm: |
1099 |
|
1100 \begin{center} |
|
1086 \begin{tabular}{lcl} |
1101 \begin{tabular}{lcl} |
1087 $\textit{blexer\_simp}\;r\,s$ & $\dn$ & |
1102 $\textit{blexer\_simp}\;r\,s$ & $\dn$ & |
1088 $\textit{let}\;a = (r^\uparrow)\backslash_{simp} s\;\textit{in}$\\ |
1103 $\textit{let}\;a = (r^\uparrow)\backslash_{simp}\, s\;\textit{in}$\\ |
1089 & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\ |
1104 & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\ |
1090 & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\ |
1105 & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\ |
1091 & & $\;\;\textit{else}\;\textit{None}$ |
1106 & & $\;\;\textit{else}\;\textit{None}$ |
1092 \end{tabular} |
1107 \end{tabular} |
1093 \end{center} |
1108 \end{center} |
1094 |
1109 |
1095 This algorithm effectively keeps the regular expression size small, for example, |
1110 \noindent |
1096 with this simplification our previous $(a + aa)^*$ example's 8000 nodes will be reduced to only 6 and stay constant, however long the input string is. |
1111 This algorithm keeps the regular expression size small, for example, |
1112 with this simplification our previous $(a + aa)^*$ example's 8000 nodes |
|
1113 will be reduced to just 6 and stays constant, no matter how long the |
|
1114 input string is. |
|
1097 |
1115 |
1098 |
1116 |
1099 |
1117 |
1100 \section{Current Work} |
1118 \section{Current Work} |
1101 |
1119 |
1102 We are currently engaged in two tasks related to this algorithm. |
1120 We are currently engaged in two tasks related to this algorithm. The |
1103 |
1121 first task is proving that our simplification rules actually do not |
1104 |
1122 affect the POSIX value that should be generated by the algorithm |
1105 The first one is proving that our simplification rules |
1123 according to the specification of a POSIX value and furthermore obtain a |
1106 actually do not affect the POSIX value that should be generated by the |
1124 much tighter bound on the sizes of derivatives. The result is that our |
1107 algorithm according to the specification of a POSIX value |
|
1108 and furthermore obtain a much |
|
1109 tighter bound on the sizes of derivatives. The result is that our |
|
1110 algorithm should be correct and faster on all inputs. The original |
1125 algorithm should be correct and faster on all inputs. The original |
1111 blow-up, as observed in JavaScript, Python and Java, would be excluded |
1126 blow-up, as observed in JavaScript, Python and Java, would be excluded |
1112 from happening in our algorithm.For |
1127 from happening in our algorithm. For this proof we use the theorem prover |
1113 this proof we use the theorem prover Isabelle. Once completed, this |
1128 Isabelle. Once completed, this result will advance the state-of-the-art: |
1114 result will advance the state-of-the-art: Sulzmann and Lu wrote in |
1129 Sulzmann and Lu wrote in their paper~\cite{Sulzmann2014} about the |
1115 their paper \cite{Sulzmann2014} about the bitcoded ``incremental |
1130 bitcoded ``incremental parsing method'' (that is the lexing algorithm |
1116 parsing method'' (that is the matching algorithm outlined in this |
1131 outlined in this section): |
1117 section): |
|
1118 |
1132 |
1119 \begin{quote}\it |
1133 \begin{quote}\it |
1120 ``Correctness Claim: We further claim that the incremental parsing |
1134 ``Correctness Claim: We further claim that the incremental parsing |
1121 method in Figure~5 in combination with the simplification steps in |
1135 method in Figure~5 in combination with the simplification steps in |
1122 Figure 6 yields POSIX parse trees. We have tested this claim |
1136 Figure 6 yields POSIX parse trees. We have tested this claim |
1123 extensively by using the method in Figure~3 as a reference but yet |
1137 extensively by using the method in Figure~3 as a reference but yet |
1124 have to work out all proof details.'' |
1138 have to work out all proof details.'' |
1125 \end{quote} |
1139 \end{quote} |
1126 |
1140 |
1127 \noindent |
1141 \noindent |
1128 We would settle the correctness claim. It is relatively straightforward |
1142 We would settle this correctness claim. It is relatively straightforward |
1129 to establish that after one simplification step, the part of a nullable |
1143 to establish that after one simplification step, the part of a nullable |
1130 derivative that corresponds to a POSIX value remains intact and can |
1144 derivative that corresponds to a POSIX value remains intact and can |
1131 still be collected, in other words, |
1145 still be collected, in other words, we can show that\comment{Double-check....I think this is not the case} |
1132 |
1146 |
1133 \begin{center} |
1147 \begin{center} |
1134 $\textit{bmkeps} \; r = \textit{bmkeps} \; \textit{simp} \; r\;( r\; \textit{nullable})$ |
1148 $\textit{bmkeps} \; r = \textit{bmkeps} \; \textit{simp} \; r\;( r\; \textit{nullable})$ |
1135 \end{center} |
1149 \end{center} |
1136 |
1150 |
1137 \noindent |
1151 \noindent |
1138 as this basically comes down to proving actions like removing the |
1152 as this basically comes down to proving actions like removing the |
1139 additional $r$ in $r+r$ does not delete important POSIX information in |
1153 additional $r$ in $r+r$ does not delete important POSIX information in |
1140 a regular expression. The hardcore of this problem is to prove that |
1154 a regular expression. The hard part of this proof is to establish that |
1141 |
1155 |
1142 \begin{center} |
1156 \begin{center} |
1143 $\textit{bmkeps} \; \textit{blexer}\_{simp} \; r = \textit{bmkeps} \; \textit{blexer} \; \textit{simp} \; r$ |
1157 $\textit{bmkeps} \; \textit{blexer}\_{simp} \; r = \textit{bmkeps} \; \textit{blexer} \; \textit{simp} \; r$ |
1144 \end{center} |
1158 \end{center} |
1145 |
1159 |
1146 \noindent |
1160 \noindent\comment{OK from here on you still need to work. Did not read.} |
1147 That is, if we do derivative on regular expression r and the simplified version, |
1161 That is, if we do derivative on regular expression r and the simplified version, |
1148 they can still provide the same POSIX value if there is one . |
1162 they can still provide the same POSIX value if there is one . |
1149 This is not as straightforward as the previous proposition, as the two regular expressions $r$ and $\textit{simp}\; r$ |
1163 This is not as straightforward as the previous proposition, as the two regular expressions $r$ and $\textit{simp}\; r$ |
1150 might become very different regular expressions after repeated application of $\textit{simp}$ and derivative. |
1164 might become very different regular expressions after repeated application of $\textit{simp}$ and derivative. |
1151 The crucial point is to find the indispensable information of |
1165 The crucial point is to find the indispensable information of |
1152 a regular expression and how it is kept intact during simplification so that it performs |
1166 a regular expression and how it is kept intact during simplification so that it performs |
1153 as good as a regular expression that has not been simplified in the subsequent derivative operations. |
1167 as good as a regular expression that has not been simplified in the subsequent derivative operations. |
1154 To aid this, we use the helping function retrieve described by Sulzmann and Lu: |
1168 To aid this, we use the helping function retrieve described by Sulzmann and Lu: |
1155 \\definition of retrieve\\ |
1169 \\definition of retrieve\\ |
1156 This function assembled the bitcode that corresponds to a parse tree for how the current |
1170 |
1157 derivative matches the suffix of the string(the characters that have not yet appeared, but is stored in the value). |
1171 This function assembled the bitcode that corresponds to a parse tree for |
1158 Sulzmann and Lu used this to connect the bitcoded algorithm to the older algorithm by the following equation: |
1172 how the current derivative matches the suffix of the string(the |
1173 characters that have not yet appeared, but is stored in the value). |
|
1174 Sulzmann and Lu used this to connect the bitcoded algorithm to the older |
|
1175 algorithm by the following equation: |
|
1176 |
|
1159 \begin{center} |
1177 \begin{center} |
1160 $inj \;a\; c \; v = \textit{decode} \; (\textit{retrieve}\; ((\textit{internalise}\; r)\backslash_{simp} c) v)$ |
1178 $inj \;a\; c \; v = \textit{decode} \; (\textit{retrieve}\; ((\textit{internalise}\; r)\backslash_{simp} c) v)$ |
1161 \end{center} |
1179 \end{center} |
1162 A little fact that needs to be stated to help comprehension: |
1180 A little fact that needs to be stated to help comprehension: |
1163 \begin{center} |
1181 \begin{center} |