39 |
39 |
40 \def\AZERO{\textit{AZERO}} |
40 \def\AZERO{\textit{AZERO}} |
41 \def\AONE{\textit{AONE}} |
41 \def\AONE{\textit{AONE}} |
42 \def\ACHAR{\textit{ACHAR}} |
42 \def\ACHAR{\textit{ACHAR}} |
43 |
43 |
44 |
44 \def\POSIX{\textit{POSIX}} |
45 \def\ALTS{\textit{ALTS}} |
45 \def\ALTS{\textit{ALTS}} |
46 \def\ASTAR{\textit{ASTAR}} |
46 \def\ASTAR{\textit{ASTAR}} |
47 \def\DFA{\textit{DFA}} |
47 \def\DFA{\textit{DFA}} |
48 \def\bmkeps{\textit{bmkeps}} |
48 \def\bmkeps{\textit{bmkeps}} |
49 \def\retrieve{\textit{retrieve}} |
49 \def\retrieve{\textit{retrieve}} |
50 \def\blexer{\textit{blexer}} |
50 \def\blexer{\textit{blexer}} |
51 \def\flex{\textit{flex}} |
51 \def\flex{\textit{flex}} |
52 \def\inj{\mathit{inj}} |
52 \def\inj{\mathit{inj}} |
53 \def\Empty{\mathit{Empty}} |
53 \def\Empty{\mathit{Empty}} |
54 \def\Left{\mathit{Left}}xc |
54 \def\Left{\mathit{Left}} |
55 \def\Right{\mathit{Right}} |
55 \def\Right{\mathit{Right}} |
56 \def\Stars{\mathit{Stars}} |
56 \def\Stars{\mathit{Stars}} |
57 \def\Char{\mathit{Char}} |
57 \def\Char{\mathit{Char}} |
58 \def\Seq{\mathit{Seq}} |
58 \def\Seq{\mathit{Seq}} |
59 \def\Der{\mathit{Der}} |
59 \def\Der{\mathit{Der}} |
98 |
100 |
99 %TODO: look up snort rules to use here--give readers idea of what regexes look like |
101 %TODO: look up snort rules to use here--give readers idea of what regexes look like |
100 |
102 |
101 |
103 |
102 Regular expressions are widely used in computer science: |
104 Regular expressions are widely used in computer science: |
103 be it in text-editors\parencite{atomEditor} with syntax hightlighting and auto completion, |
105 be it in text-editors\parencite{atomEditor} with syntax highlighting and auto-completion, |
104 command line tools like $\mathit{grep}$ that facilitates easy |
106 command-line tools like $\mathit{grep}$ that facilitate easy |
105 text processing , network intrusion |
107 text processing, network intrusion |
106 detection systems that rejects suspicious traffic, or compiler |
108 detection systems that reject suspicious traffic, or compiler |
107 front ends--the majority of the solutions to these tasks |
109 front ends--the majority of the solutions to these tasks |
108 involve lexing with regular |
110 involve lexing with regular |
109 expressions. |
111 expressions. |
110 |
|
111 Given its usefulness and ubiquity, one would imagine that |
112 Given its usefulness and ubiquity, one would imagine that |
112 modern regular expression matching implementations |
113 modern regular expression matching implementations |
113 are mature and fully-studied. |
114 are mature and fully studied. |
114 |
115 Indeed, in a popular programming language' regex engine, |
115 If you go to a popular programming language's |
116 supplying it with regular expressions and strings, one can |
116 regex engine, |
117 get rich matching information in a very short time. |
117 you can supply it with regex and strings of your own, |
118 Some network intrusion detection systems |
118 and get matching/lexing information such as how a |
119 use regex engines that are able to process |
119 sub-part of the regex matches a sub-part of the string. |
120 megabytes or even gigabytes of data per second\parencite{Turo_ov__2020}. |
120 These lexing libraries are on average quite fast. |
121 Unfortunately, this is not the case for $\mathbf{all}$ inputs. |
121 %TODO: get source for SNORT/BRO's regex matching engine/speed |
122 %TODO: get source for SNORT/BRO's regex matching engine/speed |
122 For example, the regex engines some network intrusion detection |
|
123 systems use are able to process |
|
124 megabytes or even gigabytes of network traffic per second. |
|
125 |
|
126 Why do we need to have our version, if the algorithms are |
|
127 blindingly fast already? Well it turns out it is not always the case. |
|
128 |
123 |
129 |
124 |
130 Take $(a^*)^*\,b$ and ask whether |
125 Take $(a^*)^*\,b$ and ask whether |
131 strings of the form $aa..a$ match this regular |
126 strings of the form $aa..a$ match this regular |
132 expression. Obviously this is not the case---the expected $b$ in the last |
127 expression. Obviously this is not the case---the expected $b$ in the last |
133 position is missing. One would expect that modern regular expression |
128 position is missing. One would expect that modern regular expression |
134 matching engines can find this out very quickly. Alas, if one tries |
129 matching engines can find this out very quickly. Alas, if one tries |
135 this example in JavaScript, Python or Java 8 with strings like 28 |
130 this example in JavaScript, Python or Java 8, even with strings of a small |
136 $a$'s, one discovers that this decision takes around 30 seconds and |
131 length, say around 30 $a$'s, one discovers that |
137 takes considerably longer when adding a few more $a$'s, as the graphs |
132 this decision takes crazy time to finish given the simplicity of the problem. |
|
133 This is clearly exponential behaviour, and |
|
134 is triggered by some relatively simple regex patterns, as the graphs |
138 below show: |
135 below show: |
139 |
136 |
140 \begin{center} |
137 \begin{figure} |
|
138 \centering |
141 \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}} |
139 \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}} |
142 \begin{tikzpicture} |
140 \begin{tikzpicture} |
143 \begin{axis}[ |
141 \begin{axis}[ |
144 xlabel={$n$}, |
142 xlabel={$n$}, |
145 x label style={at={(1.05,-0.05)}}, |
143 x label style={at={(1.05,-0.05)}}, |
226 the string was $O(n^2)$ with respect to the string length. This |
223 the string was $O(n^2)$ with respect to the string length. This |
227 quadratic overhead was enough for the homepage of Stack Exchange to |
224 quadratic overhead was enough for the homepage of Stack Exchange to |
228 respond so slowly that the load balancer assumed a $\mathit{DoS}$ |
225 respond so slowly that the load balancer assumed a $\mathit{DoS}$ |
229 attack and therefore stopped the servers from responding to any |
226 attack and therefore stopped the servers from responding to any |
230 requests. This made the whole site become unavailable. |
227 requests. This made the whole site become unavailable. |
231 A more |
228 A more recent example is a global outage of all Cloudflare servers on 2 July |
232 recent example is a global outage of all Cloudflare servers on 2 July |
|
233 2019. A poorly written regular expression exhibited exponential |
229 2019. A poorly written regular expression exhibited exponential |
234 behaviour and exhausted CPUs that serve HTTP traffic. Although the |
230 behaviour and exhausted CPUs that serve HTTP traffic. Although the outage |
235 outage had several causes, at the heart was a regular expression that |
231 had several causes, at the heart was a regular expression that |
236 was used to monitor network |
232 was used to monitor network |
237 traffic.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}} |
233 traffic.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}} |
238 %TODO: data points for some new versions of languages |
234 %TODO: data points for some new versions of languages |
239 These problems with regular expressions |
235 These problems with regular expressions |
240 are not isolated events that happen |
236 are not isolated events that happen |
241 very occasionally, but actually quite widespread. |
237 very occasionally, but actually widespread. |
242 They occur so often that they get a |
238 They occur so often that they get a |
243 name--Regular-Expression-Denial-Of-Service (ReDoS) |
239 name--Regular-Expression-Denial-Of-Service (ReDoS) |
244 attack. |
240 attack. |
245 Davis et al. \parencite{Davis18} detected more |
241 Davis et al. \parencite{Davis18} detected more |
246 than 1000 super-linear (SL) regular expressions |
242 than 1000 super-linear (SL) regular expressions |
248 They therefore concluded that evil regular expressions |
244 They therefore concluded that evil regular expressions |
249 are problems more than "a parlour trick", but one that |
245 are problems more than "a parlour trick", but one that |
250 requires |
246 requires |
251 more research attention. |
247 more research attention. |
252 |
248 |
253 \section{Why are current algorithm for regexes slow?} |
249 \section{The Problem Behind Slow Cases} |
254 |
250 |
255 %find literature/find out for yourself that REGEX->DFA on basic regexes |
251 %find literature/find out for yourself that REGEX->DFA on basic regexes |
256 %does not blow up the size |
252 %does not blow up the size |
257 Shouldn't regular expression matching be linear? |
253 Shouldn't regular expression matching be linear? |
258 How can one explain the super-linear behaviour of the |
254 How can one explain the super-linear behaviour of the |
305 that can be compiled and run. |
301 that can be compiled and run. |
306 The good things about $\mathit{DFA}$s is that once |
302 The good things about $\mathit{DFA}$s is that once |
307 generated, they are fast and stable, unlike |
303 generated, they are fast and stable, unlike |
308 backtracking algorithms. |
304 backtracking algorithms. |
309 However, they do not scale well with bounded repetitions.\\ |
305 However, they do not scale well with bounded repetitions.\\ |
310 |
|
311 |
|
312 \begin{center} |
|
313 \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}} |
|
314 \begin{tikzpicture} |
|
315 \begin{axis}[ |
|
316 ymode=log, |
|
317 xlabel={$n$}, |
|
318 x label style={at={(1.05,-0.05)}}, |
|
319 ylabel={time in secs}, |
|
320 enlargelimits=false, |
|
321 xtick={1,2,...,8}, |
|
322 xmax=9, |
|
323 ymax=16000000, |
|
324 ytick={0,500,...,3500}, |
|
325 scaled ticks=false, |
|
326 axis lines=left, |
|
327 width=5cm, |
|
328 height=4cm, |
|
329 legend entries={JavaScript}, |
|
330 legend pos=north west, |
|
331 legend cell align=left] |
|
332 \addplot[red,mark=*, mark options={fill=white}] table {re-chengsong.data}; |
|
333 \end{axis} |
|
334 \end{tikzpicture} |
|
335 \end{tabular} |
|
336 \end{center} |
|
337 |
|
338 Another graph: |
|
339 \begin{center} |
|
340 \begin{tikzpicture} |
|
341 \begin{axis}[ |
|
342 height=0.5\textwidth, |
|
343 width=\textwidth, |
|
344 xlabel=number of a's, |
|
345 xtick={0,...,9}, |
|
346 ylabel=maximum size, |
|
347 ymode=log, |
|
348 log basis y={2} |
|
349 ] |
|
350 \addplot[mark=*,blue] table {re-chengsong.data}; |
|
351 \end{axis} |
|
352 \end{tikzpicture} |
|
353 \end{center} |
|
354 |
306 |
355 |
307 |
356 |
308 |
357 |
309 |
358 Bounded repetitions, usually written in the form |
310 Bounded repetitions, usually written in the form |
467 \end{center} |
419 \end{center} |
468 The concrete example |
420 The concrete example |
469 $((a|b|c|\ldots|z)^*)\backslash 1$ |
421 $((a|b|c|\ldots|z)^*)\backslash 1$ |
470 would match the string like $\mathit{bobo}$, $\mathit{weewee}$ and etc.\\ |
422 would match the string like $\mathit{bobo}$, $\mathit{weewee}$ and etc.\\ |
471 Back-reference is a construct in the "regex" standard |
423 Back-reference is a construct in the "regex" standard |
472 that programmers found quite useful, but not exactly |
424 that programmers found useful, but not exactly |
473 regular any more. |
425 regular any more. |
474 In fact, that allows the regex construct to express |
426 In fact, that allows the regex construct to express |
475 languages that cannot be contained in context-free |
427 languages that cannot be contained in context-free |
476 languages either. |
428 languages either. |
477 For example, the back-reference $((a^*)b\backslash1 b \backslash 1$ |
429 For example, the back-reference $((a^*)b\backslash1 b \backslash 1$ |
504 %TODO: verify the fact Rust does not allow 1000+ reps |
456 %TODO: verify the fact Rust does not allow 1000+ reps |
505 %TODO: Java 17 updated graphs? Is it ok to still use Java 8 graphs? |
457 %TODO: Java 17 updated graphs? Is it ok to still use Java 8 graphs? |
506 \section{Buggy Regex Engines} |
458 \section{Buggy Regex Engines} |
507 |
459 |
508 |
460 |
509 Another thing about the these libraries is that there |
461 Another thing about these libraries is that there |
510 is no correctness guarantee. |
462 is no correctness guarantee. |
511 In some cases they either fails to generate a lexing result when there is a match, |
463 In some cases, they either fail to generate a lexing result when there exists a match, |
512 or gives the wrong way of matching. |
464 or give the wrong way of matching. |
513 |
465 |
514 |
466 |
515 It turns out that regex libraries not only suffer from |
467 It turns out that regex libraries not only suffer from |
516 exponential backtracking problems, |
468 exponential backtracking problems, |
517 but also undesired (or even buggy) outputs. |
469 but also undesired (or even buggy) outputs. |
555 %\section{How people solve problems with regexes} |
507 %\section{How people solve problems with regexes} |
556 |
508 |
557 |
509 |
558 When a regular expression does not behave as intended, |
510 When a regular expression does not behave as intended, |
559 people usually try to rewrite the regex to some equivalent form |
511 people usually try to rewrite the regex to some equivalent form |
560 or they try to avoid the possibly problematic patterns completely\parencite{Davis18}, |
512 or they try to avoid the possibly problematic patterns completely, |
561 of which there are many false positives. |
513 for which many false positives exist\parencite{Davis18}. |
562 Animated tools to "debug" regular expressions |
514 Animated tools to "debug" regular expressions |
563 are also quite popular, regexploit\parencite{regexploit2021}, regex101\parencite{regex101} |
515 are also popular, regexploit\parencite{regexploit2021}, regex101\parencite{regex101} |
564 to name a few. |
516 to name a few. |
565 There is also static analysis work on regular expressions that |
517 We are also aware of static analysis work on regular expressions that |
566 aims to detect potentially expoential regex patterns. Rathnayake and Thielecke |
518 aims to detect potentially expoential regex patterns. Rathnayake and Thielecke |
567 \parencite{Rathnayake2014StaticAF} proposed an algorithm |
519 \parencite{Rathnayake2014StaticAF} proposed an algorithm |
568 that detects regular expressions triggering exponential |
520 that detects regular expressions triggering exponential |
569 behavious on backtracking matchers. |
521 behavious on backtracking matchers. |
570 Weideman \parencite{Weideman2017Static} came up with |
522 Weideman \parencite{Weideman2017Static} came up with |
625 Antimirov. |
577 Antimirov. |
626 |
578 |
627 |
579 |
628 The main contribution of this thesis is a proven correct lexing algorithm |
580 The main contribution of this thesis is a proven correct lexing algorithm |
629 with formalized time bounds. |
581 with formalized time bounds. |
630 To our best knowledge, there is no lexing libraries using Brzozowski derivatives |
582 To our best knowledge, no lexing libraries using Brzozowski derivatives |
631 that have a provable time guarantee, |
583 have a provable time guarantee, |
632 and claims about running time are usually speculative and backed by thin empirical |
584 and claims about running time are usually speculative and backed by thin empirical |
633 evidence. |
585 evidence. |
634 %TODO: give references |
586 %TODO: give references |
635 For example, Sulzmann and Lu had proposed an algorithm in which they |
587 For example, Sulzmann and Lu had proposed an algorithm in which they |
636 claim a linear running time. |
588 claim a linear running time. |
694 strings without the head character $c$. |
646 strings without the head character $c$. |
695 |
647 |
696 Formally, we define first such a transformation on any string set, which |
648 Formally, we define first such a transformation on any string set, which |
697 we call semantic derivative: |
649 we call semantic derivative: |
698 \begin{center} |
650 \begin{center} |
699 $\Der \; c\; \textit{StringSet} = \{s \mid c :: s \in StringSet\}$ |
651 $\Der \; c\; \textit{A} = \{s \mid c :: s \in A\}$ |
700 \end{center} |
652 \end{center} |
701 Mathematically, it can be expressed as the |
653 Mathematically, it can be expressed as the |
702 |
654 |
703 If the $\textit{StringSet}$ happen to have some structure, for example, |
655 If the $\textit{StringSet}$ happen to have some structure, for example, |
704 if it is regular, then we have that it |
656 if it is regular, then we have that it |
786 a new regular expression after all the string it contains |
738 a new regular expression after all the string it contains |
787 is chopped off a certain head character $c$. |
739 is chopped off a certain head character $c$. |
788 The most involved cases are the sequence |
740 The most involved cases are the sequence |
789 and star case. |
741 and star case. |
790 The sequence case says that if the first regular expression |
742 The sequence case says that if the first regular expression |
791 contains an empty string then second component of the sequence |
743 contains an empty string then the second component of the sequence |
792 might be chosen as the target regular expression to be chopped |
744 might be chosen as the target regular expression to be chopped |
793 off its head character. |
745 off its head character. |
794 The star regular expression unwraps the iteration of |
746 The star regular expression's derivative unwraps the iteration of |
795 regular expression and attack the star regular expression |
747 regular expression and attaches the star regular expression |
796 to its back again to make sure there are 0 or more iterations |
748 to the sequence's second element to make sure a copy is retained |
797 following this unfolded iteration. |
749 for possible more iterations in later phases of lexing. |
798 |
750 |
799 |
751 |
800 The main property of the derivative operation |
752 The main property of the derivative operation |
801 that enables us to reason about the correctness of |
753 that enables us to reason about the correctness of |
802 an algorithm using derivatives is |
754 an algorithm using derivatives is |
842 an $s = c_0...c_{n-1}$ and an $r_0$, it generates YES if and only if $s \in L(r_0)$). |
794 an $s = c_0...c_{n-1}$ and an $r_0$, it generates YES if and only if $s \in L(r_0)$). |
843 |
795 |
844 Beautiful and simple definition. |
796 Beautiful and simple definition. |
845 |
797 |
846 If we implement the above algorithm naively, however, |
798 If we implement the above algorithm naively, however, |
847 the algorithm can be excruciatingly slow. For example, when starting with the regular |
799 the algorithm can be excruciatingly slow. |
848 expression $(a + aa)^*$ and building 12 successive derivatives |
800 |
|
801 |
|
802 \begin{figure} |
|
803 \centering |
|
804 \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}} |
|
805 \begin{tikzpicture} |
|
806 \begin{axis}[ |
|
807 xlabel={$n$}, |
|
808 x label style={at={(1.05,-0.05)}}, |
|
809 ylabel={time in secs}, |
|
810 enlargelimits=false, |
|
811 xtick={0,5,...,30}, |
|
812 xmax=33, |
|
813 ymax=10000, |
|
814 ytick={0,1000,...,10000}, |
|
815 scaled ticks=false, |
|
816 axis lines=left, |
|
817 width=5cm, |
|
818 height=4cm, |
|
819 legend entries={JavaScript}, |
|
820 legend pos=north west, |
|
821 legend cell align=left] |
|
822 \addplot[red,mark=*, mark options={fill=white}] table {EightThousandNodes.data}; |
|
823 \end{axis} |
|
824 \end{tikzpicture}\\ |
|
825 \multicolumn{3}{c}{Graphs: Runtime for matching $(a^*)^*\,b$ with strings |
|
826 of the form $\underbrace{aa..a}_{n}$.} |
|
827 \end{tabular} |
|
828 \caption{EightThousandNodes} \label{fig:EightThousandNodes} |
|
829 \end{figure} |
|
830 |
|
831 |
|
832 (8000 node data to be added here) |
|
833 For example, when starting with the regular |
|
834 expression $(a + aa)^*$ and building a few successive derivatives (around 10) |
849 w.r.t.~the character $a$, one obtains a derivative regular expression |
835 w.r.t.~the character $a$, one obtains a derivative regular expression |
850 with more than 8000 nodes (when viewed as a tree). Operations like |
836 with more than 8000 nodes (when viewed as a tree)\ref{EightThousandNodes}. |
|
837 The reason why $(a + aa) ^*$ explodes so drastically is that without |
|
838 pruning, the algorithm will keep records of all possible ways of matching: |
|
839 \begin{center} |
|
840 $(a + aa) ^* \backslash (aa) = (\ZERO + \ONE \ONE)\cdot(a + aa)^* + (\ONE + \ONE a) \cdot (a + aa)^*$ |
|
841 \end{center} |
|
842 |
|
843 \noindent |
|
844 Each of the above alternative branches correspond to the match |
|
845 $aa $, $a \quad a$ and $a \quad a \cdot (a)$(incomplete). |
|
846 These different ways of matching will grow exponentially with the string length, |
|
847 and without simplifications that throw away some of these very similar matchings, |
|
848 it is no surprise that these expressions grow so quickly. |
|
849 Operations like |
851 $\backslash$ and $\nullable$ need to traverse such trees and |
850 $\backslash$ and $\nullable$ need to traverse such trees and |
852 consequently the bigger the size of the derivative the slower the |
851 consequently the bigger the size of the derivative the slower the |
853 algorithm. |
852 algorithm. |
854 |
853 |
855 Brzozowski was quick in finding that during this process a lot useless |
854 Brzozowski was quick in finding that during this process a lot useless |
947 Take the example where $r= (a^*\cdot a^*)^*$ and the string |
946 Take the example where $r= (a^*\cdot a^*)^*$ and the string |
948 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$. |
947 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$. |
949 The number of different ways of matching |
948 The number of different ways of matching |
950 without allowing any value under a star to be flattened |
949 without allowing any value under a star to be flattened |
951 to an empty string can be given by the following formula: |
950 to an empty string can be given by the following formula: |
952 \begin{center} |
951 \begin{equation} |
953 $C_n = (n+1)+n C_1+\ldots + 2 C_{n-1}$ |
952 C_n = (n+1)+n C_1+\ldots + 2 C_{n-1} |
954 \end{center} |
953 \end{equation} |
955 and a closed form formula can be calculated to be |
954 and a closed form formula can be calculated to be |
956 \begin{equation} |
955 \begin{equation} |
957 C_n =\frac{(2+\sqrt{2})^n - (2-\sqrt{2})^n}{4\sqrt{2}} |
956 C_n =\frac{(2+\sqrt{2})^n - (2-\sqrt{2})^n}{4\sqrt{2}} |
958 \end{equation} |
957 \end{equation} |
959 which is clearly in exponential order. |
958 which is clearly in exponential order. |
960 |
959 |
961 A lexer aimed at getting all the possible values has an exponential |
960 A lexer aimed at getting all the possible values has an exponential |
962 worst case runtime. Therefore it is impractical to try to generate |
961 worst case runtime. Therefore it is impractical to try to generate |
963 all possible matches in a run. In practice, we are usually |
962 all possible matches in a run. In practice, we are usually |
964 interested about POSIX values, which by intuition always |
963 interested about POSIX values, which by intuition always |
965 match the leftmost regular expression when there is a choice |
964 \begin{itemize} |
966 and always match a sub part as much as possible before proceeding |
965 \item |
967 to the next token. For example, the above example has the POSIX value |
966 match the leftmost regular expression when multiple options of matching |
|
967 are available |
|
968 \item |
|
969 always match a subpart as much as possible before proceeding |
|
970 to the next token. |
|
971 \end{itemize} |
|
972 |
|
973 |
|
974 For example, the above example has the POSIX value |
968 $ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$. |
975 $ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$. |
969 The output of an algorithm we want would be a POSIX matching |
976 The output of an algorithm we want would be a POSIX matching |
970 encoded as a value. |
977 encoded as a value. |
|
978 The reason why we are interested in $\POSIX$ values is that they can |
|
979 be practically used in the lexing phase of a compiler front end. |
|
980 For instance, when lexing a code snippet |
|
981 $\textit{iffoo} = 3$ with the regular expression $\textit{keyword} + \textit{identifier}$, we want $\textit{iffoo}$ to be recognized |
|
982 as an identifier rather than a keyword. |
|
983 |
971 The contribution of Sulzmann and Lu is an extension of Brzozowski's |
984 The contribution of Sulzmann and Lu is an extension of Brzozowski's |
972 algorithm by a second phase (the first phase being building successive |
985 algorithm by a second phase (the first phase being building successive |
973 derivatives---see \eqref{graph:*}). In this second phase, a POSIX value |
986 derivatives---see \eqref{graph:*}). In this second phase, a POSIX value |
974 is generated in case the regular expression matches the string. |
987 is generated in case the regular expression matches the string. |
975 Pictorially, the Sulzmann and Lu algorithm is as follows: |
988 Pictorially, the Sulzmann and Lu algorithm is as follows: |
1254 $(\textit{ONE}\;bs)\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\ |
1267 $(\textit{ONE}\;bs)\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\ |
1255 $(\textit{CHAR}\;bs\,d)\,\backslash c$ & $\dn$ & |
1268 $(\textit{CHAR}\;bs\,d)\,\backslash c$ & $\dn$ & |
1256 $\textit{if}\;c=d\; \;\textit{then}\; |
1269 $\textit{if}\;c=d\; \;\textit{then}\; |
1257 \textit{ONE}\;bs\;\textit{else}\;\textit{ZERO}$\\ |
1270 \textit{ONE}\;bs\;\textit{else}\;\textit{ZERO}$\\ |
1258 $(\textit{ALTS}\;bs\,as)\,\backslash c$ & $\dn$ & |
1271 $(\textit{ALTS}\;bs\,as)\,\backslash c$ & $\dn$ & |
1259 $\textit{ALTS}\;bs\,(as.map(\backslash c))$\\ |
1272 $\textit{ALTS}\;bs\,(map (\backslash c) as)$\\ |
1260 $(\textit{SEQ}\;bs\,a_1\,a_2)\,\backslash c$ & $\dn$ & |
1273 $(\textit{SEQ}\;bs\,a_1\,a_2)\,\backslash c$ & $\dn$ & |
1261 $\textit{if}\;\textit{bnullable}\,a_1$\\ |
1274 $\textit{if}\;\textit{bnullable}\,a_1$\\ |
1262 & &$\textit{then}\;\textit{ALTS}\,bs\,List((\textit{SEQ}\,[]\,(a_1\,\backslash c)\,a_2),$\\ |
1275 & &$\textit{then}\;\textit{ALTS}\,bs\,List((\textit{SEQ}\,[]\,(a_1\,\backslash c)\,a_2),$\\ |
1263 & &$\phantom{\textit{then}\;\textit{ALTS}\,bs\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\ |
1276 & &$\phantom{\textit{then}\;\textit{ALTS}\,bs\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\ |
1264 & &$\textit{else}\;\textit{SEQ}\,bs\,(a_1\,\backslash c)\,a_2$\\ |
1277 & &$\textit{else}\;\textit{SEQ}\,bs\,(a_1\,\backslash c)\,a_2$\\ |
1314 %\end{definition} |
1327 %\end{definition} |
1315 \noindent |
1328 \noindent |
1316 For instance, when we do derivative of $_{bs}a^*$ with respect to c, |
1329 For instance, when we do derivative of $_{bs}a^*$ with respect to c, |
1317 we need to unfold it into a sequence, |
1330 we need to unfold it into a sequence, |
1318 and attach an additional bit $0$ to the front of $r \backslash c$ |
1331 and attach an additional bit $0$ to the front of $r \backslash c$ |
1319 to indicate that there is one more star iteration. Also the sequence clause |
1332 to indicate one more star iteration. Also the sequence clause |
1320 is more subtle---when $a_1$ is $\textit{bnullable}$ (here |
1333 is more subtle---when $a_1$ is $\textit{bnullable}$ (here |
1321 \textit{bnullable} is exactly the same as $\textit{nullable}$, except |
1334 \textit{bnullable} is exactly the same as $\textit{nullable}$, except |
1322 that it is for annotated regular expressions, therefore we omit the |
1335 that it is for annotated regular expressions, therefore we omit the |
1323 definition). Assume that $\textit{bmkeps}$ correctly extracts the bitcode for how |
1336 definition). Assume that $\textit{bmkeps}$ correctly extracts the bitcode for how |
1324 $a_1$ matches the string prior to character $c$ (more on this later), |
1337 $a_1$ matches the string prior to character $c$ (more on this later), |
1373 \noindent |
1386 \noindent |
1374 In this definition $\_\backslash s$ is the generalisation of the derivative |
1387 In this definition $\_\backslash s$ is the generalisation of the derivative |
1375 operation from characters to strings (just like the derivatives for un-annotated |
1388 operation from characters to strings (just like the derivatives for un-annotated |
1376 regular expressions). |
1389 regular expressions). |
1377 |
1390 |
1378 Remember tha one of the important reasons we introduced bitcodes |
1391 Now we introduce the simplifications, which is why we introduce the |
1379 is that they can make simplification more structured and therefore guaranteeing |
1392 bitcodes in the first place. |
1380 the correctness. |
1393 |
1381 |
1394 \subsection*{Simplification Rules} |
1382 \subsection*{Our Simplification Rules} |
1395 |
1383 |
1396 This section introduces aggressive (in terms of size) simplification rules |
1384 In this section we introduce aggressive (in terms of size) simplification rules |
|
1385 on annotated regular expressions |
1397 on annotated regular expressions |
1386 in order to keep derivatives small. Such simplifications are promising |
1398 to keep derivatives small. Such simplifications are promising |
1387 as we have |
1399 as we have |
1388 generated test data that show |
1400 generated test data that show |
1389 that a good tight bound can be achieved. Obviously we could only |
1401 that a good tight bound can be achieved. We could only |
1390 partially cover the search space as there are infinitely many regular |
1402 partially cover the search space as there are infinitely many regular |
1391 expressions and strings. |
1403 expressions and strings. |
1392 |
1404 |
1393 One modification we introduced is to allow a list of annotated regular |
1405 One modification we introduced is to allow a list of annotated regular |
1394 expressions in the $\sum$ constructor. This allows us to not just |
1406 expressions in the $\sum$ constructor. This allows us to not just |
1395 delete unnecessary $\ZERO$s and $\ONE$s from regular expressions, but |
1407 delete unnecessary $\ZERO$s and $\ONE$s from regular expressions, but |
1424 \end{center} |
1436 \end{center} |
1425 |
1437 |
1426 \noindent |
1438 \noindent |
1427 The simplification does a pattern matching on the regular expression. |
1439 The simplification does a pattern matching on the regular expression. |
1428 When it detected that the regular expression is an alternative or |
1440 When it detected that the regular expression is an alternative or |
1429 sequence, it will try to simplify its children regular expressions |
1441 sequence, it will try to simplify its child regular expressions |
1430 recursively and then see if one of the children turn into $\ZERO$ or |
1442 recursively and then see if one of the children turns into $\ZERO$ or |
1431 $\ONE$, which might trigger further simplification at the current level. |
1443 $\ONE$, which might trigger further simplification at the current level. |
1432 The most involved part is the $\sum$ clause, where we use two |
1444 The most involved part is the $\sum$ clause, where we use two |
1433 auxiliary functions $\textit{flatten}$ and $\textit{distinct}$ to open up nested |
1445 auxiliary functions $\textit{flatten}$ and $\textit{distinct}$ to open up nested |
1434 alternatives and reduce as many duplicates as possible. Function |
1446 alternatives and reduce as many duplicates as possible. Function |
1435 $\textit{distinct}$ keeps the first occurring copy only and remove all later ones |
1447 $\textit{distinct}$ keeps the first occurring copy only and removes all later ones |
1436 when detected duplicates. Function $\textit{flatten}$ opens up nested $\sum$s. |
1448 when detected duplicates. Function $\textit{flatten}$ opens up nested $\sum$s. |
1437 Its recursive definition is given below: |
1449 Its recursive definition is given below: |
1438 |
1450 |
1439 \begin{center} |
1451 \begin{center} |
1440 \begin{tabular}{@{}lcl@{}} |
1452 \begin{tabular}{@{}lcl@{}} |
1489 succession) all the characters of the string matches the empty string, |
1501 succession) all the characters of the string matches the empty string, |
1490 then $r$ matches $s$ (and {\em vice versa}). |
1502 then $r$ matches $s$ (and {\em vice versa}). |
1491 |
1503 |
1492 |
1504 |
1493 |
1505 |
1494 However, there are two difficulties with derivative-based matchers: |
1506 However, two difficulties with derivative-based matchers exist: |
1495 First, Brzozowski's original matcher only generates a yes/no answer |
1507 First, Brzozowski's original matcher only generates a yes/no answer |
1496 for whether a regular expression matches a string or not. This is too |
1508 for whether a regular expression matches a string or not. This is too |
1497 little information in the context of lexing where separate tokens must |
1509 little information in the context of lexing where separate tokens must |
1498 be identified and also classified (for example as keywords |
1510 be identified and also classified (for example as keywords |
1499 or identifiers). Sulzmann and Lu~\cite{Sulzmann2014} overcome this |
1511 or identifiers). Sulzmann and Lu~\cite{Sulzmann2014} overcome this |
1552 of this version, but do not provide any supporting proof arguments, not |
1564 of this version, but do not provide any supporting proof arguments, not |
1553 even ``pencil-and-paper'' arguments. They write about their bitcoded |
1565 even ``pencil-and-paper'' arguments. They write about their bitcoded |
1554 \emph{incremental parsing method} (that is the algorithm to be formalised |
1566 \emph{incremental parsing method} (that is the algorithm to be formalised |
1555 in this paper): |
1567 in this paper): |
1556 |
1568 |
1557 |
1569 %motivation part |
1558 \begin{quote}\it |
1570 \begin{quote}\it |
1559 ``Correctness Claim: We further claim that the incremental parsing |
1571 ``Correctness Claim: We further claim that the incremental parsing |
1560 method [..] in combination with the simplification steps [..] |
1572 method [..] in combination with the simplification steps [..] |
1561 yields POSIX parse trees. We have tested this claim |
1573 yields POSIX parse trees. We have tested this claim |
1562 extensively [..] but yet |
1574 extensively [..] but yet |