ChengsongTanPhdThesis/Chapters/Chapter1.tex
changeset 518 ff7945a988a3
parent 516 6fecb7fe8cd0
child 519 856d025dbc15
equal deleted inserted replaced
517:3c5e58d08939 518:ff7945a988a3
    39 
    39 
    40 \def\AZERO{\textit{AZERO}}
    40 \def\AZERO{\textit{AZERO}}
    41 \def\AONE{\textit{AONE}}
    41 \def\AONE{\textit{AONE}}
    42 \def\ACHAR{\textit{ACHAR}}
    42 \def\ACHAR{\textit{ACHAR}}
    43 
    43 
    44 
    44 \def\POSIX{\textit{POSIX}}
    45 \def\ALTS{\textit{ALTS}}
    45 \def\ALTS{\textit{ALTS}}
    46 \def\ASTAR{\textit{ASTAR}}
    46 \def\ASTAR{\textit{ASTAR}}
    47 \def\DFA{\textit{DFA}}
    47 \def\DFA{\textit{DFA}}
    48 \def\bmkeps{\textit{bmkeps}}
    48 \def\bmkeps{\textit{bmkeps}}
    49 \def\retrieve{\textit{retrieve}}
    49 \def\retrieve{\textit{retrieve}}
    50 \def\blexer{\textit{blexer}}
    50 \def\blexer{\textit{blexer}}
    51 \def\flex{\textit{flex}}
    51 \def\flex{\textit{flex}}
    52 \def\inj{\mathit{inj}}
    52 \def\inj{\mathit{inj}}
    53 \def\Empty{\mathit{Empty}}
    53 \def\Empty{\mathit{Empty}}
    54 \def\Left{\mathit{Left}}xc
    54 \def\Left{\mathit{Left}}
    55 \def\Right{\mathit{Right}}
    55 \def\Right{\mathit{Right}}
    56 \def\Stars{\mathit{Stars}}
    56 \def\Stars{\mathit{Stars}}
    57 \def\Char{\mathit{Char}}
    57 \def\Char{\mathit{Char}}
    58 \def\Seq{\mathit{Seq}}
    58 \def\Seq{\mathit{Seq}}
    59 \def\Der{\mathit{Der}}
    59 \def\Der{\mathit{Der}}
    62 \def\S{\mathit{S}}
    62 \def\S{\mathit{S}}
    63 \def\rup{r^\uparrow}
    63 \def\rup{r^\uparrow}
    64 %\def\bderssimp{\mathit{bders}\_\mathit{simp}}
    64 %\def\bderssimp{\mathit{bders}\_\mathit{simp}}
    65 \def\distinctWith{\textit{distinctWith}}
    65 \def\distinctWith{\textit{distinctWith}}
    66 
    66 
       
    67 \def\size{\mathit{size}}
    67 \def\rexp{\mathbf{rexp}}
    68 \def\rexp{\mathbf{rexp}}
    68 \def\simp{\mathit{simp}}
    69 \def\simp{\mathit{simp}}
    69 \def\simpALTs{\mathit{simp}\_\mathit{ALTs}}
    70 \def\simpALTs{\mathit{simp}\_\mathit{ALTs}}
    70 \def\map{\mathit{map}}
    71 \def\map{\mathit{map}}
    71 \def\distinct{\mathit{distinct}}
    72 \def\distinct{\mathit{distinct}}
    77 \newcommand\rnullable[1]{\textit{rnullable}(#1)}
    78 \newcommand\rnullable[1]{\textit{rnullable}(#1)}
    78 \newcommand\rsize[1]{\llbracket #1 \rrbracket_r}
    79 \newcommand\rsize[1]{\llbracket #1 \rrbracket_r}
    79 \newcommand\asize[1]{\llbracket #1 \rrbracket}
    80 \newcommand\asize[1]{\llbracket #1 \rrbracket}
    80 \newcommand\rerase[1]{ (#1)\downarrow_r}
    81 \newcommand\rerase[1]{ (#1)\downarrow_r}
    81 
    82 
       
    83 
    82 \def\erase{\textit{erase}}
    84 \def\erase{\textit{erase}}
    83 \def\STAR{\textit{STAR}}
    85 \def\STAR{\textit{STAR}}
    84 \def\flts{\textit{flts}}
    86 \def\flts{\textit{flts}}
    85 
    87 
    86 
    88 
    98 
   100 
    99 %TODO: look up snort rules to use here--give readers idea of what regexes look like
   101 %TODO: look up snort rules to use here--give readers idea of what regexes look like
   100 
   102 
   101 
   103 
   102 Regular expressions are widely used in computer science: 
   104 Regular expressions are widely used in computer science: 
   103 be it in text-editors\parencite{atomEditor} with syntax hightlighting and auto completion, 
   105 be it in text-editors\parencite{atomEditor} with syntax highlighting and auto-completion, 
   104 command line tools like $\mathit{grep}$ that facilitates easy 
   106 command-line tools like $\mathit{grep}$ that facilitate easy 
   105 text processing , network intrusion
   107 text processing, network intrusion
   106 detection systems that rejects suspicious traffic, or compiler
   108 detection systems that reject suspicious traffic, or compiler
   107 front ends--the majority of the solutions to these tasks 
   109 front ends--the majority of the solutions to these tasks 
   108 involve lexing with regular 
   110 involve lexing with regular 
   109 expressions.
   111 expressions.
   110 
       
   111 Given its usefulness and ubiquity, one would imagine that
   112 Given its usefulness and ubiquity, one would imagine that
   112 modern regular expression matching implementations
   113 modern regular expression matching implementations
   113 are mature and fully-studied.
   114 are mature and fully studied.
   114 
   115 Indeed, in a popular programming language' regex engine, 
   115 If you go to a popular programming language's
   116 supplying it with regular expressions and strings, one can
   116 regex engine,
   117 get rich matching information in a very short time.
   117 you can supply it with regex and strings of your own,
   118 Some network intrusion detection systems
   118 and get matching/lexing  information such as how a 
   119 use regex engines that are able to process 
   119 sub-part of the regex matches a sub-part of the string.
   120 megabytes or even gigabytes of data per second\parencite{Turo_ov__2020}.
   120 These lexing libraries are on average quite fast.
   121 Unfortunately, this is not the case for $\mathbf{all}$ inputs.
   121 %TODO: get source for SNORT/BRO's regex matching engine/speed
   122 %TODO: get source for SNORT/BRO's regex matching engine/speed
   122 For example, the regex engines some network intrusion detection
       
   123 systems use are able to process
       
   124  megabytes or even gigabytes of network traffic per second.
       
   125  
       
   126 Why do we need to have our version, if the algorithms are
       
   127 blindingly fast already? Well it turns out it is not always the case.
       
   128 
   123 
   129 
   124 
   130 Take $(a^*)^*\,b$ and ask whether
   125 Take $(a^*)^*\,b$ and ask whether
   131 strings of the form $aa..a$ match this regular
   126 strings of the form $aa..a$ match this regular
   132 expression. Obviously this is not the case---the expected $b$ in the last
   127 expression. Obviously this is not the case---the expected $b$ in the last
   133 position is missing. One would expect that modern regular expression
   128 position is missing. One would expect that modern regular expression
   134 matching engines can find this out very quickly. Alas, if one tries
   129 matching engines can find this out very quickly. Alas, if one tries
   135 this example in JavaScript, Python or Java 8 with strings like 28
   130 this example in JavaScript, Python or Java 8, even with strings of a small
   136 $a$'s, one discovers that this decision takes around 30 seconds and
   131 length, say around 30 $a$'s, one discovers that 
   137 takes considerably longer when adding a few more $a$'s, as the graphs
   132 this decision takes crazy time to finish given the simplicity of the problem.
       
   133 This is clearly exponential behaviour, and 
       
   134 is triggered by some relatively simple regex patterns, as the graphs
   138 below show:
   135 below show:
   139 
   136 
   140 \begin{center}
   137 \begin{figure}
       
   138 \centering
   141 \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
   139 \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
   142 \begin{tikzpicture}
   140 \begin{tikzpicture}
   143 \begin{axis}[
   141 \begin{axis}[
   144     xlabel={$n$},
   142     xlabel={$n$},
   145     x label style={at={(1.05,-0.05)}},
   143     x label style={at={(1.05,-0.05)}},
   202 \end{axis}
   200 \end{axis}
   203 \end{tikzpicture}\\
   201 \end{tikzpicture}\\
   204 \multicolumn{3}{c}{Graphs: Runtime for matching $(a^*)^*\,b$ with strings 
   202 \multicolumn{3}{c}{Graphs: Runtime for matching $(a^*)^*\,b$ with strings 
   205            of the form $\underbrace{aa..a}_{n}$.}
   203            of the form $\underbrace{aa..a}_{n}$.}
   206 \end{tabular}    
   204 \end{tabular}    
   207 \end{center}  
   205 \caption{aStarStarb} \label{fig:aStarStarb}
   208 
   206 \end{figure}
   209 
   207 
   210 This is clearly exponential behaviour, and 
   208 
   211 is triggered by some relatively simple regex patterns.
       
   212 
   209 
   213 
   210 
   214 This superlinear blowup in matching algorithms sometimes cause
   211 This superlinear blowup in matching algorithms sometimes cause
   215 considerable grief in real life: for example on 20 July 2016 one evil
   212 considerable grief in real life: for example on 20 July 2016 one evil
   216 regular expression brought the webpage
   213 regular expression brought the webpage
   226 the string was $O(n^2)$ with respect to the string length. This
   223 the string was $O(n^2)$ with respect to the string length. This
   227 quadratic overhead was enough for the homepage of Stack Exchange to
   224 quadratic overhead was enough for the homepage of Stack Exchange to
   228 respond so slowly that the load balancer assumed a $\mathit{DoS}$ 
   225 respond so slowly that the load balancer assumed a $\mathit{DoS}$ 
   229 attack and therefore stopped the servers from responding to any
   226 attack and therefore stopped the servers from responding to any
   230 requests. This made the whole site become unavailable. 
   227 requests. This made the whole site become unavailable. 
   231 A more 
   228 A more recent example is a global outage of all Cloudflare servers on 2 July
   232 recent example is a global outage of all Cloudflare servers on 2 July
       
   233 2019. A poorly written regular expression exhibited exponential
   229 2019. A poorly written regular expression exhibited exponential
   234 behaviour and exhausted CPUs that serve HTTP traffic. Although the
   230 behaviour and exhausted CPUs that serve HTTP traffic. Although the outage
   235 outage had several causes, at the heart was a regular expression that
   231 had several causes, at the heart was a regular expression that
   236 was used to monitor network
   232 was used to monitor network
   237 traffic.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}}
   233 traffic.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}}
   238 %TODO: data points for some new versions of languages
   234 %TODO: data points for some new versions of languages
   239 These problems with regular expressions 
   235 These problems with regular expressions 
   240 are not isolated events that happen
   236 are not isolated events that happen
   241 very occasionally, but actually quite widespread.
   237 very occasionally, but actually widespread.
   242 They occur so often that they get a 
   238 They occur so often that they get a 
   243 name--Regular-Expression-Denial-Of-Service (ReDoS)
   239 name--Regular-Expression-Denial-Of-Service (ReDoS)
   244 attack.
   240 attack.
   245 Davis et al. \parencite{Davis18} detected more
   241 Davis et al. \parencite{Davis18} detected more
   246 than 1000 super-linear (SL) regular expressions
   242 than 1000 super-linear (SL) regular expressions
   248 They therefore concluded that evil regular expressions
   244 They therefore concluded that evil regular expressions
   249 are problems more than "a parlour trick", but one that
   245 are problems more than "a parlour trick", but one that
   250 requires
   246 requires
   251 more research attention.
   247 more research attention.
   252 
   248 
   253  \section{Why are current algorithm for regexes slow?}
   249  \section{The Problem Behind Slow Cases}
   254 
   250 
   255 %find literature/find out for yourself that REGEX->DFA on basic regexes
   251 %find literature/find out for yourself that REGEX->DFA on basic regexes
   256 %does not blow up the size
   252 %does not blow up the size
   257 Shouldn't regular expression matching be linear?
   253 Shouldn't regular expression matching be linear?
   258 How can one explain the super-linear behaviour of the 
   254 How can one explain the super-linear behaviour of the 
   305 that can be compiled and run. 
   301 that can be compiled and run. 
   306 The good things about $\mathit{DFA}$s is that once
   302 The good things about $\mathit{DFA}$s is that once
   307 generated, they are fast and stable, unlike
   303 generated, they are fast and stable, unlike
   308 backtracking algorithms. 
   304 backtracking algorithms. 
   309 However, they do not scale well with bounded repetitions.\\
   305 However, they do not scale well with bounded repetitions.\\
   310 
       
   311 
       
   312 \begin{center}
       
   313 \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
       
   314 \begin{tikzpicture}
       
   315 \begin{axis}[
       
   316     ymode=log,
       
   317     xlabel={$n$},
       
   318     x label style={at={(1.05,-0.05)}},
       
   319     ylabel={time in secs},
       
   320     enlargelimits=false,
       
   321     xtick={1,2,...,8},
       
   322     xmax=9,
       
   323     ymax=16000000,
       
   324     ytick={0,500,...,3500},
       
   325     scaled ticks=false,
       
   326     axis lines=left,
       
   327     width=5cm,
       
   328     height=4cm, 
       
   329     legend entries={JavaScript},  
       
   330     legend pos=north west,
       
   331     legend cell align=left]
       
   332 \addplot[red,mark=*, mark options={fill=white}] table {re-chengsong.data};
       
   333 \end{axis}
       
   334 \end{tikzpicture}
       
   335 \end{tabular}
       
   336 \end{center}
       
   337 
       
   338 Another graph:
       
   339 \begin{center}
       
   340 \begin{tikzpicture}
       
   341     \begin{axis}[
       
   342         height=0.5\textwidth,
       
   343         width=\textwidth,
       
   344         xlabel=number of a's,
       
   345         xtick={0,...,9},
       
   346         ylabel=maximum size,
       
   347         ymode=log,
       
   348        log basis y={2}
       
   349 ]
       
   350         \addplot[mark=*,blue] table {re-chengsong.data};
       
   351     \end{axis}
       
   352 \end{tikzpicture}
       
   353 \end{center}
       
   354 
   306 
   355 
   307 
   356 
   308 
   357 
   309 
   358 Bounded repetitions, usually written in the form
   310 Bounded repetitions, usually written in the form
   467 \end{center}
   419 \end{center}
   468 The concrete example
   420 The concrete example
   469 $((a|b|c|\ldots|z)^*)\backslash 1$
   421 $((a|b|c|\ldots|z)^*)\backslash 1$
   470 would match the string like $\mathit{bobo}$, $\mathit{weewee}$ and etc.\\
   422 would match the string like $\mathit{bobo}$, $\mathit{weewee}$ and etc.\\
   471 Back-reference is a construct in the "regex" standard
   423 Back-reference is a construct in the "regex" standard
   472 that programmers found quite useful, but not exactly 
   424 that programmers found useful, but not exactly 
   473 regular any more.
   425 regular any more.
   474 In fact, that allows the regex construct to express 
   426 In fact, that allows the regex construct to express 
   475 languages that cannot be contained in context-free
   427 languages that cannot be contained in context-free
   476 languages either.
   428 languages either.
   477 For example, the back-reference $((a^*)b\backslash1 b \backslash 1$
   429 For example, the back-reference $((a^*)b\backslash1 b \backslash 1$
   504  %TODO: verify the fact Rust does not allow 1000+ reps
   456  %TODO: verify the fact Rust does not allow 1000+ reps
   505  %TODO: Java 17 updated graphs? Is it ok to still use Java 8 graphs?
   457  %TODO: Java 17 updated graphs? Is it ok to still use Java 8 graphs?
   506 \section{Buggy Regex Engines} 
   458 \section{Buggy Regex Engines} 
   507 
   459 
   508 
   460 
   509  Another thing about the these libraries is that there
   461  Another thing about these libraries is that there
   510  is no correctness guarantee.
   462  is no correctness guarantee.
   511  In some cases they either fails to generate a lexing result when there is a match,
   463  In some cases, they either fail to generate a lexing result when there exists a match,
   512  or gives the wrong way of matching.
   464  or give the wrong way of matching.
   513  
   465  
   514 
   466 
   515 It turns out that regex libraries not only suffer from 
   467 It turns out that regex libraries not only suffer from 
   516 exponential backtracking problems, 
   468 exponential backtracking problems, 
   517 but also undesired (or even buggy) outputs.
   469 but also undesired (or even buggy) outputs.
   555 %\section{How people solve problems with regexes}
   507 %\section{How people solve problems with regexes}
   556 
   508 
   557 
   509 
   558 When a regular expression does not behave as intended,
   510 When a regular expression does not behave as intended,
   559 people usually try to rewrite the regex to some equivalent form
   511 people usually try to rewrite the regex to some equivalent form
   560 or they try to avoid the possibly problematic patterns completely\parencite{Davis18},
   512 or they try to avoid the possibly problematic patterns completely,
   561 of which there are many false positives.
   513 for which many false positives exist\parencite{Davis18}.
   562 Animated tools to "debug" regular expressions
   514 Animated tools to "debug" regular expressions
   563 are also quite popular, regexploit\parencite{regexploit2021}, regex101\parencite{regex101} 
   515 are also popular, regexploit\parencite{regexploit2021}, regex101\parencite{regex101} 
   564 to name a few.
   516 to name a few.
   565 There is also static analysis work on regular expressions that
   517 We are also aware of static analysis work on regular expressions that
   566 aims to detect potentially expoential regex patterns. Rathnayake and Thielecke 
   518 aims to detect potentially expoential regex patterns. Rathnayake and Thielecke 
   567 \parencite{Rathnayake2014StaticAF} proposed an algorithm
   519 \parencite{Rathnayake2014StaticAF} proposed an algorithm
   568 that detects regular expressions triggering exponential
   520 that detects regular expressions triggering exponential
   569 behavious on backtracking matchers.
   521 behavious on backtracking matchers.
   570 Weideman \parencite{Weideman2017Static} came up with 
   522 Weideman \parencite{Weideman2017Static} came up with 
   606 
   558 
   607 
   559 
   608 
   560 
   609 %----------------------------------------------------------------------------------------
   561 %----------------------------------------------------------------------------------------
   610 
   562 
   611 \section{Our Contribution}
   563 \section{Contribution}
   612 
   564 
   613 
   565 
   614 
   566 
   615 This work addresses the vulnerability of super-linear and
   567 This work addresses the vulnerability of super-linear and
   616 buggy regex implementations by the combination
   568 buggy regex implementations by the combination
   625 Antimirov.
   577 Antimirov.
   626 
   578 
   627  
   579  
   628 The main contribution of this thesis is a proven correct lexing algorithm
   580 The main contribution of this thesis is a proven correct lexing algorithm
   629 with formalized time bounds.
   581 with formalized time bounds.
   630 To our best knowledge, there is no lexing libraries using Brzozowski derivatives
   582 To our best knowledge, no lexing libraries using Brzozowski derivatives
   631 that have a provable time guarantee, 
   583 have a provable time guarantee, 
   632 and claims about running time are usually speculative and backed by thin empirical
   584 and claims about running time are usually speculative and backed by thin empirical
   633 evidence.
   585 evidence.
   634 %TODO: give references
   586 %TODO: give references
   635 For example, Sulzmann and Lu had proposed an algorithm  in which they
   587 For example, Sulzmann and Lu had proposed an algorithm  in which they
   636 claim a linear running time.
   588 claim a linear running time.
   679 
   631 
   680 The language or set of strings defined by regular expressions are defined as
   632 The language or set of strings defined by regular expressions are defined as
   681 %TODO: FILL in the other defs
   633 %TODO: FILL in the other defs
   682 \begin{center}
   634 \begin{center}
   683 \begin{tabular}{lcl}
   635 \begin{tabular}{lcl}
   684 $L \; r_1 + r_2$ & $\dn$ & $ L \; r_1 \cup L \; r_2$\\
   636 $L \; (r_1 + r_2)$ & $\dn$ & $ L \; (r_1) \cup L \; ( r_2)$\\
   685 $L \; r_1 \cdot r_2$ & $\dn$ & $ L \; r_1 \cap L \; r_2$\\
   637 $L \; (r_1 \cdot r_2)$ & $\dn$ & $ L \; (r_1) \cap L \; (r_2)$\\
   686 \end{tabular}
   638 \end{tabular}
   687 \end{center}
   639 \end{center}
   688 Which are also called the "language interpretation".
   640 Which are also called the "language interpretation".
   689 
   641 
   690 
   642 
   694 strings without the head character $c$.
   646 strings without the head character $c$.
   695 
   647 
   696 Formally, we define first such a transformation on any string set, which
   648 Formally, we define first such a transformation on any string set, which
   697 we call semantic derivative:
   649 we call semantic derivative:
   698 \begin{center}
   650 \begin{center}
   699 $\Der \; c\; \textit{StringSet} = \{s \mid c :: s \in StringSet\}$
   651 $\Der \; c\; \textit{A} = \{s \mid c :: s \in A\}$
   700 \end{center}
   652 \end{center}
   701 Mathematically, it can be expressed as the 
   653 Mathematically, it can be expressed as the 
   702 
   654 
   703 If the $\textit{StringSet}$ happen to have some structure, for example,
   655 If the $\textit{StringSet}$ happen to have some structure, for example,
   704 if it is regular, then we have that it
   656 if it is regular, then we have that it
   786 a new regular expression after all the string it contains
   738 a new regular expression after all the string it contains
   787 is chopped off a certain head character $c$.
   739 is chopped off a certain head character $c$.
   788 The most involved cases are the sequence 
   740 The most involved cases are the sequence 
   789 and star case.
   741 and star case.
   790 The sequence case says that if the first regular expression
   742 The sequence case says that if the first regular expression
   791 contains an empty string then second component of the sequence
   743 contains an empty string then the second component of the sequence
   792 might be chosen as the target regular expression to be chopped
   744 might be chosen as the target regular expression to be chopped
   793 off its head character.
   745 off its head character.
   794 The star regular expression unwraps the iteration of
   746 The star regular expression's derivative unwraps the iteration of
   795 regular expression and attack the star regular expression
   747 regular expression and attaches the star regular expression
   796 to its back again to make sure there are 0 or more iterations
   748 to the sequence's second element to make sure a copy is retained
   797 following this unfolded iteration.
   749 for possible more iterations in later phases of lexing.
   798 
   750 
   799 
   751 
   800 The main property of the derivative operation
   752 The main property of the derivative operation
   801 that enables us to reason about the correctness of
   753 that enables us to reason about the correctness of
   802 an algorithm using derivatives is 
   754 an algorithm using derivatives is 
   842 an $s = c_0...c_{n-1}$ and an $r_0$, it generates YES if and only if $s \in L(r_0)$).
   794 an $s = c_0...c_{n-1}$ and an $r_0$, it generates YES if and only if $s \in L(r_0)$).
   843 
   795 
   844 Beautiful and simple definition.
   796 Beautiful and simple definition.
   845 
   797 
   846 If we implement the above algorithm naively, however,
   798 If we implement the above algorithm naively, however,
   847 the algorithm can be excruciatingly slow. For example, when starting with the regular
   799 the algorithm can be excruciatingly slow. 
   848 expression $(a + aa)^*$ and building 12 successive derivatives
   800 
       
   801 
       
   802 \begin{figure}
       
   803 \centering
       
   804 \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
       
   805 \begin{tikzpicture}
       
   806 \begin{axis}[
       
   807     xlabel={$n$},
       
   808     x label style={at={(1.05,-0.05)}},
       
   809     ylabel={time in secs},
       
   810     enlargelimits=false,
       
   811     xtick={0,5,...,30},
       
   812     xmax=33,
       
   813     ymax=10000,
       
   814     ytick={0,1000,...,10000},
       
   815     scaled ticks=false,
       
   816     axis lines=left,
       
   817     width=5cm,
       
   818     height=4cm, 
       
   819     legend entries={JavaScript},  
       
   820     legend pos=north west,
       
   821     legend cell align=left]
       
   822 \addplot[red,mark=*, mark options={fill=white}] table {EightThousandNodes.data};
       
   823 \end{axis}
       
   824 \end{tikzpicture}\\
       
   825 \multicolumn{3}{c}{Graphs: Runtime for matching $(a^*)^*\,b$ with strings 
       
   826            of the form $\underbrace{aa..a}_{n}$.}
       
   827 \end{tabular}    
       
   828 \caption{EightThousandNodes} \label{fig:EightThousandNodes}
       
   829 \end{figure}
       
   830 
       
   831 
       
   832 (8000 node data to be added here)
       
   833 For example, when starting with the regular
       
   834 expression $(a + aa)^*$ and building a few successive derivatives (around 10)
   849 w.r.t.~the character $a$, one obtains a derivative regular expression
   835 w.r.t.~the character $a$, one obtains a derivative regular expression
   850 with more than 8000 nodes (when viewed as a tree). Operations like
   836 with more than 8000 nodes (when viewed as a tree)\ref{EightThousandNodes}.
       
   837 The reason why $(a + aa) ^*$ explodes so drastically is that without
       
   838 pruning, the algorithm will keep records of all possible ways of matching:
       
   839 \begin{center}
       
   840 $(a + aa) ^* \backslash (aa) = (\ZERO + \ONE \ONE)\cdot(a + aa)^* + (\ONE + \ONE a) \cdot (a + aa)^*$
       
   841 \end{center}
       
   842 
       
   843 \noindent
       
   844 Each of the above alternative branches correspond to the match 
       
   845 $aa $, $a \quad a$ and $a \quad a \cdot (a)$(incomplete).
       
   846 These different ways of matching will grow exponentially with the string length,
       
   847 and without simplifications that throw away some of these very similar matchings,
       
   848 it is no surprise that these expressions grow so quickly.
       
   849 Operations like
   851 $\backslash$ and $\nullable$ need to traverse such trees and
   850 $\backslash$ and $\nullable$ need to traverse such trees and
   852 consequently the bigger the size of the derivative the slower the
   851 consequently the bigger the size of the derivative the slower the
   853 algorithm. 
   852 algorithm. 
   854 
   853 
   855 Brzozowski was quick in finding that during this process a lot useless
   854 Brzozowski was quick in finding that during this process a lot useless
   947 Take the example where $r= (a^*\cdot a^*)^*$ and the string 
   946 Take the example where $r= (a^*\cdot a^*)^*$ and the string 
   948 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$.
   947 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$.
   949 The number of different ways of matching 
   948 The number of different ways of matching 
   950 without allowing any value under a star to be flattened
   949 without allowing any value under a star to be flattened
   951 to an empty string can be given by the following formula:
   950 to an empty string can be given by the following formula:
   952 \begin{center}
   951 \begin{equation}
   953 	$C_n = (n+1)+n C_1+\ldots + 2 C_{n-1}$
   952 	C_n = (n+1)+n C_1+\ldots + 2 C_{n-1}
   954 \end{center}
   953 \end{equation}
   955 and a closed form formula can be calculated to be
   954 and a closed form formula can be calculated to be
   956 \begin{equation}
   955 \begin{equation}
   957 	C_n =\frac{(2+\sqrt{2})^n - (2-\sqrt{2})^n}{4\sqrt{2}}
   956 	C_n =\frac{(2+\sqrt{2})^n - (2-\sqrt{2})^n}{4\sqrt{2}}
   958 \end{equation}
   957 \end{equation}
   959 which is clearly in exponential order.
   958 which is clearly in exponential order.
   960 
   959 
   961 A lexer aimed at getting all the possible values has an exponential
   960 A lexer aimed at getting all the possible values has an exponential
   962 worst case runtime. Therefore it is impractical to try to generate
   961 worst case runtime. Therefore it is impractical to try to generate
   963 all possible matches in a run. In practice, we are usually 
   962 all possible matches in a run. In practice, we are usually 
   964 interested about POSIX values, which by intuition always
   963 interested about POSIX values, which by intuition always
   965 match the leftmost regular expression when there is a choice
   964 \begin{itemize}
   966 and always match a sub part as much as possible before proceeding
   965 \item
   967 to the next token. For example, the above example has the POSIX value
   966 match the leftmost regular expression when multiple options of matching
       
   967 are available  
       
   968 \item 
       
   969 always match a subpart as much as possible before proceeding
       
   970 to the next token.
       
   971 \end{itemize}
       
   972 
       
   973 
       
   974  For example, the above example has the POSIX value
   968 $ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$.
   975 $ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$.
   969 The output of an algorithm we want would be a POSIX matching
   976 The output of an algorithm we want would be a POSIX matching
   970 encoded as a value.
   977 encoded as a value.
       
   978 The reason why we are interested in $\POSIX$ values is that they can
       
   979 be practically used in the lexing phase of a compiler front end.
       
   980 For instance, when lexing a code snippet 
       
   981 $\textit{iffoo} = 3$ with the regular expression $\textit{keyword} + \textit{identifier}$, we want $\textit{iffoo}$ to be recognized
       
   982 as an identifier rather than a keyword.
       
   983 
   971 The contribution of Sulzmann and Lu is an extension of Brzozowski's
   984 The contribution of Sulzmann and Lu is an extension of Brzozowski's
   972 algorithm by a second phase (the first phase being building successive
   985 algorithm by a second phase (the first phase being building successive
   973 derivatives---see \eqref{graph:*}). In this second phase, a POSIX value 
   986 derivatives---see \eqref{graph:*}). In this second phase, a POSIX value 
   974 is generated in case the regular expression matches  the string. 
   987 is generated in case the regular expression matches  the string. 
   975 Pictorially, the Sulzmann and Lu algorithm is as follows:
   988 Pictorially, the Sulzmann and Lu algorithm is as follows:
  1254   $(\textit{ONE}\;bs)\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
  1267   $(\textit{ONE}\;bs)\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
  1255   $(\textit{CHAR}\;bs\,d)\,\backslash c$ & $\dn$ &
  1268   $(\textit{CHAR}\;bs\,d)\,\backslash c$ & $\dn$ &
  1256         $\textit{if}\;c=d\; \;\textit{then}\;
  1269         $\textit{if}\;c=d\; \;\textit{then}\;
  1257          \textit{ONE}\;bs\;\textit{else}\;\textit{ZERO}$\\  
  1270          \textit{ONE}\;bs\;\textit{else}\;\textit{ZERO}$\\  
  1258   $(\textit{ALTS}\;bs\,as)\,\backslash c$ & $\dn$ &
  1271   $(\textit{ALTS}\;bs\,as)\,\backslash c$ & $\dn$ &
  1259   $\textit{ALTS}\;bs\,(as.map(\backslash c))$\\
  1272   $\textit{ALTS}\;bs\,(map (\backslash c) as)$\\
  1260   $(\textit{SEQ}\;bs\,a_1\,a_2)\,\backslash c$ & $\dn$ &
  1273   $(\textit{SEQ}\;bs\,a_1\,a_2)\,\backslash c$ & $\dn$ &
  1261      $\textit{if}\;\textit{bnullable}\,a_1$\\
  1274      $\textit{if}\;\textit{bnullable}\,a_1$\\
  1262 					       & &$\textit{then}\;\textit{ALTS}\,bs\,List((\textit{SEQ}\,[]\,(a_1\,\backslash c)\,a_2),$\\
  1275 					       & &$\textit{then}\;\textit{ALTS}\,bs\,List((\textit{SEQ}\,[]\,(a_1\,\backslash c)\,a_2),$\\
  1263 					       & &$\phantom{\textit{then}\;\textit{ALTS}\,bs\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\
  1276 					       & &$\phantom{\textit{then}\;\textit{ALTS}\,bs\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\
  1264   & &$\textit{else}\;\textit{SEQ}\,bs\,(a_1\,\backslash c)\,a_2$\\
  1277   & &$\textit{else}\;\textit{SEQ}\,bs\,(a_1\,\backslash c)\,a_2$\\
  1314 %\end{definition}
  1327 %\end{definition}
  1315 \noindent
  1328 \noindent
  1316 For instance, when we do derivative of  $_{bs}a^*$ with respect to c,
  1329 For instance, when we do derivative of  $_{bs}a^*$ with respect to c,
  1317 we need to unfold it into a sequence,
  1330 we need to unfold it into a sequence,
  1318 and attach an additional bit $0$ to the front of $r \backslash c$
  1331 and attach an additional bit $0$ to the front of $r \backslash c$
  1319 to indicate that there is one more star iteration. Also the sequence clause
  1332 to indicate one more star iteration. Also the sequence clause
  1320 is more subtle---when $a_1$ is $\textit{bnullable}$ (here
  1333 is more subtle---when $a_1$ is $\textit{bnullable}$ (here
  1321 \textit{bnullable} is exactly the same as $\textit{nullable}$, except
  1334 \textit{bnullable} is exactly the same as $\textit{nullable}$, except
  1322 that it is for annotated regular expressions, therefore we omit the
  1335 that it is for annotated regular expressions, therefore we omit the
  1323 definition). Assume that $\textit{bmkeps}$ correctly extracts the bitcode for how
  1336 definition). Assume that $\textit{bmkeps}$ correctly extracts the bitcode for how
  1324 $a_1$ matches the string prior to character $c$ (more on this later),
  1337 $a_1$ matches the string prior to character $c$ (more on this later),
  1373 \noindent
  1386 \noindent
  1374 In this definition $\_\backslash s$ is the  generalisation  of the derivative
  1387 In this definition $\_\backslash s$ is the  generalisation  of the derivative
  1375 operation from characters to strings (just like the derivatives for un-annotated
  1388 operation from characters to strings (just like the derivatives for un-annotated
  1376 regular expressions).
  1389 regular expressions).
  1377 
  1390 
  1378 Remember tha one of the important reasons we introduced bitcodes
  1391 Now we introduce the simplifications, which is why we introduce the 
  1379 is that they can make simplification more structured and therefore guaranteeing
  1392 bitcodes in the first place.
  1380 the correctness.
  1393 
  1381 
  1394 \subsection*{Simplification Rules}
  1382 \subsection*{Our Simplification Rules}
  1395 
  1383 
  1396 This section introduces aggressive (in terms of size) simplification rules
  1384 In this section we introduce aggressive (in terms of size) simplification rules
       
  1385 on annotated regular expressions
  1397 on annotated regular expressions
  1386 in order to keep derivatives small. Such simplifications are promising
  1398 to keep derivatives small. Such simplifications are promising
  1387 as we have
  1399 as we have
  1388 generated test data that show
  1400 generated test data that show
  1389 that a good tight bound can be achieved. Obviously we could only
  1401 that a good tight bound can be achieved. We could only
  1390 partially cover  the search space as there are infinitely many regular
  1402 partially cover the search space as there are infinitely many regular
  1391 expressions and strings. 
  1403 expressions and strings. 
  1392 
  1404 
  1393 One modification we introduced is to allow a list of annotated regular
  1405 One modification we introduced is to allow a list of annotated regular
  1394 expressions in the $\sum$ constructor. This allows us to not just
  1406 expressions in the $\sum$ constructor. This allows us to not just
  1395 delete unnecessary $\ZERO$s and $\ONE$s from regular expressions, but
  1407 delete unnecessary $\ZERO$s and $\ONE$s from regular expressions, but
  1424 \end{center}    
  1436 \end{center}    
  1425 
  1437 
  1426 \noindent
  1438 \noindent
  1427 The simplification does a pattern matching on the regular expression.
  1439 The simplification does a pattern matching on the regular expression.
  1428 When it detected that the regular expression is an alternative or
  1440 When it detected that the regular expression is an alternative or
  1429 sequence, it will try to simplify its children regular expressions
  1441 sequence, it will try to simplify its child regular expressions
  1430 recursively and then see if one of the children turn into $\ZERO$ or
  1442 recursively and then see if one of the children turns into $\ZERO$ or
  1431 $\ONE$, which might trigger further simplification at the current level.
  1443 $\ONE$, which might trigger further simplification at the current level.
  1432 The most involved part is the $\sum$ clause, where we use two
  1444 The most involved part is the $\sum$ clause, where we use two
  1433 auxiliary functions $\textit{flatten}$ and $\textit{distinct}$ to open up nested
  1445 auxiliary functions $\textit{flatten}$ and $\textit{distinct}$ to open up nested
  1434 alternatives and reduce as many duplicates as possible. Function
  1446 alternatives and reduce as many duplicates as possible. Function
  1435 $\textit{distinct}$  keeps the first occurring copy only and remove all later ones
  1447 $\textit{distinct}$  keeps the first occurring copy only and removes all later ones
  1436 when detected duplicates. Function $\textit{flatten}$ opens up nested $\sum$s.
  1448 when detected duplicates. Function $\textit{flatten}$ opens up nested $\sum$s.
  1437 Its recursive definition is given below:
  1449 Its recursive definition is given below:
  1438 
  1450 
  1439  \begin{center}
  1451  \begin{center}
  1440   \begin{tabular}{@{}lcl@{}}
  1452   \begin{tabular}{@{}lcl@{}}
  1489 succession) all the characters of the string matches the empty string,
  1501 succession) all the characters of the string matches the empty string,
  1490 then $r$ matches $s$ (and {\em vice versa}).  
  1502 then $r$ matches $s$ (and {\em vice versa}).  
  1491 
  1503 
  1492 
  1504 
  1493 
  1505 
  1494 However, there are two difficulties with derivative-based matchers:
  1506 However, two difficulties with derivative-based matchers exist:
  1495 First, Brzozowski's original matcher only generates a yes/no answer
  1507 First, Brzozowski's original matcher only generates a yes/no answer
  1496 for whether a regular expression matches a string or not.  This is too
  1508 for whether a regular expression matches a string or not.  This is too
  1497 little information in the context of lexing where separate tokens must
  1509 little information in the context of lexing where separate tokens must
  1498 be identified and also classified (for example as keywords
  1510 be identified and also classified (for example as keywords
  1499 or identifiers).  Sulzmann and Lu~\cite{Sulzmann2014} overcome this
  1511 or identifiers).  Sulzmann and Lu~\cite{Sulzmann2014} overcome this
  1552 of this version, but do not provide any supporting proof arguments, not
  1564 of this version, but do not provide any supporting proof arguments, not
  1553 even ``pencil-and-paper'' arguments. They write about their bitcoded
  1565 even ``pencil-and-paper'' arguments. They write about their bitcoded
  1554 \emph{incremental parsing method} (that is the algorithm to be formalised
  1566 \emph{incremental parsing method} (that is the algorithm to be formalised
  1555 in this paper):
  1567 in this paper):
  1556 
  1568 
  1557 
  1569 %motivation part
  1558 \begin{quote}\it
  1570 \begin{quote}\it
  1559   ``Correctness Claim: We further claim that the incremental parsing
  1571   ``Correctness Claim: We further claim that the incremental parsing
  1560   method [..] in combination with the simplification steps [..]
  1572   method [..] in combination with the simplification steps [..]
  1561   yields POSIX parse trees. We have tested this claim
  1573   yields POSIX parse trees. We have tested this claim
  1562   extensively [..] but yet
  1574   extensively [..] but yet