ecp/ecoop_paper.tex
changeset 24 bffa240d5b7a
parent 22 feffec3af1a1
child 25 5ca7bf724474
equal deleted inserted replaced
23:fc1597145975 24:bffa240d5b7a
     9 % \usepackage{hyperref}
     9 % \usepackage{hyperref}
    10 % \usepackage[margin=0.5in]{geometry}
    10 % \usepackage[margin=0.5in]{geometry}
    11 %\usepackage{pmboxdraw}
    11 %\usepackage{pmboxdraw}
    12  
    12  
    13 \title{POSIX Regular Expression Matching and Lexing}
    13 \title{POSIX Regular Expression Matching and Lexing}
    14 \author[1]{Chengsong Tan \\ King's College London\\chengsong.tan@kcl.ac.uk}
    14 \author[1]{Chengsong Tan}
       
    15 \affil[1]{\\ Department of Informatics, King's College London\\
       
    16 London, UK\\
       
    17 \texttt{chengsong.tan@kcl.ac.uk}}
       
    18 \authorrunning{Chengsong Tan}
       
    19 \Copyright{Chengsong Tan}
    15 
    20 
    16 \newcommand{\dn}{\stackrel{\mbox{\scriptsize def}}{=}}%
    21 \newcommand{\dn}{\stackrel{\mbox{\scriptsize def}}{=}}%
    17 \newcommand{\ZERO}{\mbox{\bf 0}}
    22 \newcommand{\ZERO}{\mbox{\bf 0}}
    18 \newcommand{\ONE}{\mbox{\bf 1}}
    23 \newcommand{\ONE}{\mbox{\bf 1}}
    19 \def\lexer{\mathit{lexer}}
    24 \def\lexer{\mathit{lexer}}
    40 
    45 
    41 
    46 
    42 \begin{document}
    47 \begin{document}
    43 
    48 
    44 \maketitle
    49 \maketitle
       
    50 
    45 \begin{abstract}
    51 \begin{abstract}
    46   Brzozowski introduced in 1964 a beautifully simple algorithm for
    52   Brzozowski introduced in 1964 a beautifully simple algorithm for
    47   regular expression matching based on the notion of derivatives of
    53   regular expression matching based on the notion of derivatives of
    48   regular expressions. In 2014, Sulzmann and Lu extended this
    54   regular expressions. In 2014, Sulzmann and Lu extended this
    49   algorithm to not just give a YES/NO answer for whether or not a regular
    55   algorithm to not just give a YES/NO answer for whether or not a regular
   156 frequent enough that a separate name has been created for
   162 frequent enough that a separate name has been created for
   157 them---\emph{evil regular expressions}. In empiric work, Davis et al
   163 them---\emph{evil regular expressions}. In empiric work, Davis et al
   158 report that they have found thousands of such evil regular expressions
   164 report that they have found thousands of such evil regular expressions
   159 in the JavaScript and Python ecosystems \cite{Davis18}.
   165 in the JavaScript and Python ecosystems \cite{Davis18}.
   160 
   166 
   161 This exponential blowup sometimes causes real pain in ``real life'':
   167 This exponential blowup sometimes causes real pain in real life:
   162 for example one evil regular expression brought on 20 July 2016 the
   168 for example on 20 July 2016 one evil regular expression brought the
   163 webpage \href{http://stackexchange.com}{Stack Exchange} to its knees\cite{SE16}.
   169 webpage \href{http://stackexchange.com}{Stack Exchange} to its knees\cite{SE16}.
   164 In this instance, a regular expression intended to just trim white
   170 In this instance, a regular expression intended to just trim white
   165 spaces from the beginning and the end of a line actually consumed
   171 spaces from the beginning and the end of a line actually consumed
   166 massive amounts of CPU-resources and because of this the web servers
   172 massive amounts of CPU-resources and because of this the web servers
   167 ground to a halt. This happened when a post with 20,000 white spaces
   173 ground to a halt. This happened when a post with 20,000 white spaces
   194 ``fire''---so is it an identifier or a keyword?  While in applications
   200 ``fire''---so is it an identifier or a keyword?  While in applications
   195 there is a well-known strategy to decide these questions, called POSIX
   201 there is a well-known strategy to decide these questions, called POSIX
   196 matching, only relatively recently precise definitions of what POSIX
   202 matching, only relatively recently precise definitions of what POSIX
   197 matching actually means have been formalised
   203 matching actually means have been formalised
   198 \cite{AusafDyckhoffUrban2016,OkuiSuzuki2010,Vansummeren2006}. Roughly,
   204 \cite{AusafDyckhoffUrban2016,OkuiSuzuki2010,Vansummeren2006}. Roughly,
   199 POSIX matching means to match the longest initial substring and
   205 POSIX matching means matching the longest initial substring and
   200 possible ties are solved according to some priorities attached to the
   206 in the case of a tie, the initial submatch is chosen according to some priorities attached to the
   201 regular expressions (e.g.~keywords have a higher priority than
   207 regular expressions (e.g.~keywords have a higher priority than
   202 identifiers). This sounds rather simple, but according to Grathwohl et
   208 identifiers). This sounds rather simple, but according to Grathwohl et
   203 al \cite[Page 36]{CrashCourse2014} this is not the case. They wrote:
   209 al \cite[Page 36]{CrashCourse2014} this is not the case. They wrote:
   204 
   210 
   205 \begin{quote}
   211 \begin{quote}
   436 The main point of the bitsequences and annotated regular expressions
   442 The main point of the bitsequences and annotated regular expressions
   437 is that we can apply rather aggressive (in terms of size)
   443 is that we can apply rather aggressive (in terms of size)
   438 simplification rules in order to keep derivatives small.  We have
   444 simplification rules in order to keep derivatives small.  We have
   439 developed such ``aggressive'' simplification rules and generated test
   445 developed such ``aggressive'' simplification rules and generated test
   440 data that show that the expected bound can be achieved. Obviously we
   446 data that show that the expected bound can be achieved. Obviously we
   441 could only cover partially the search space as there are infinitely
   447 could only partially cover  the search space as there are infinitely
   442 many regular expressions and strings. One modification we introduced
   448 many regular expressions and strings. One modification we introduced
   443 is to allow a list of annotated regular expressions in the
   449 is to allow a list of annotated regular expressions in the
   444 \textit{ALTS} constructor. This allows us to not just delete
   450 \textit{ALTS} constructor. This allows us to not just delete
   445 unnecessary $\ZERO$s and $\ONE$s from regular expressions, but also
   451 unnecessary $\ZERO$s and $\ONE$s from regular expressions, but also
   446 unnecessary ``copies'' of regular expressions (very similar to
   452 unnecessary ``copies'' of regular expressions (very similar to