diff -r fc1597145975 -r bffa240d5b7a ecp/ecoop_paper.tex --- a/ecp/ecoop_paper.tex Wed Jun 26 16:08:49 2019 +0100 +++ b/ecp/ecoop_paper.tex Wed Jun 26 17:15:48 2019 +0100 @@ -11,7 +11,12 @@ %\usepackage{pmboxdraw} \title{POSIX Regular Expression Matching and Lexing} -\author[1]{Chengsong Tan \\ King's College London\\chengsong.tan@kcl.ac.uk} +\author[1]{Chengsong Tan} +\affil[1]{\\ Department of Informatics, King's College London\\ +London, UK\\ +\texttt{chengsong.tan@kcl.ac.uk}} +\authorrunning{Chengsong Tan} +\Copyright{Chengsong Tan} \newcommand{\dn}{\stackrel{\mbox{\scriptsize def}}{=}}% \newcommand{\ZERO}{\mbox{\bf 0}} @@ -42,6 +47,7 @@ \begin{document} \maketitle + \begin{abstract} Brzozowski introduced in 1964 a beautifully simple algorithm for regular expression matching based on the notion of derivatives of @@ -158,8 +164,8 @@ report that they have found thousands of such evil regular expressions in the JavaScript and Python ecosystems \cite{Davis18}. -This exponential blowup sometimes causes real pain in ``real life'': -for example one evil regular expression brought on 20 July 2016 the +This exponential blowup sometimes causes real pain in real life: +for example on 20 July 2016 one evil regular expression brought the webpage \href{http://stackexchange.com}{Stack Exchange} to its knees\cite{SE16}. In this instance, a regular expression intended to just trim white spaces from the beginning and the end of a line actually consumed @@ -196,8 +202,8 @@ matching, only relatively recently precise definitions of what POSIX matching actually means have been formalised \cite{AusafDyckhoffUrban2016,OkuiSuzuki2010,Vansummeren2006}. Roughly, -POSIX matching means to match the longest initial substring and -possible ties are solved according to some priorities attached to the +POSIX matching means matching the longest initial substring and +in the case of a tie, the initial submatch is chosen according to some priorities attached to the regular expressions (e.g.~keywords have a higher priority than identifiers). This sounds rather simple, but according to Grathwohl et al \cite[Page 36]{CrashCourse2014} this is not the case. They wrote: @@ -438,7 +444,7 @@ simplification rules in order to keep derivatives small. We have developed such ``aggressive'' simplification rules and generated test data that show that the expected bound can be achieved. Obviously we -could only cover partially the search space as there are infinitely +could only partially cover the search space as there are infinitely many regular expressions and strings. One modification we introduced is to allow a list of annotated regular expressions in the \textit{ALTS} constructor. This allows us to not just delete