ChengsongTanPhdThesis/Chapters/Overview.tex
author Chengsong
Wed, 23 Aug 2023 03:02:31 +0100
changeset 668 3831621d7b14
permissions -rw-r--r--
added technical Overview section, almost done introduction
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
668
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
     1
\chapter{Technical Overview}
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
     2
\label{Overview}
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
     3
We start with a technical overview of the 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
     4
catastrophic backtracking problem,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
     5
motivating rigorous approaches to regular expression matching and lexing.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
     6
In doing so we also briefly
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
     7
introduce common terminology such as
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
     8
bounded repetitions and back-references.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
     9
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
    10
\section{Motivating Examples}
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
    11
Consider for example the regular expression $(a^*)^*\,b$ and 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
    12
strings of the form $aa..a$. These strings cannot be matched by this regular
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
    13
expression: obviously the expected $b$ in the last
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
    14
position is missing. One would assume that modern regular expression
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
    15
matching engines can find this out very quickly. Surprisingly, if one tries
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
    16
this example in JavaScript, Python or Java 8, even with small strings, 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
    17
say of lenght of around 30 $a$'s,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
    18
the decision takes large amounts of time to finish.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
    19
This is inproportional
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
    20
to the simplicity of the input (see graphs in figure \ref{fig:aStarStarb}).
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
    21
The algorithms clearly show exponential behaviour, and as can be seen
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
    22
is triggered by some relatively simple regular expressions.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
    23
Java 9 and newer
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
    24
versions mitigates this behaviour by several magnitudes, 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
    25
but are still slow compared 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
    26
with the approach we are going to use in this thesis.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
    27
 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
    28
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
    29
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
    30
This superlinear blowup in regular expression engines
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
    31
has caused grief in ``real life'' where it is 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
    32
given the name ``catastrophic backtracking'' or ``evil'' regular expressions.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
    33
A less serious example is a bug in the Atom editor:
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
    34
a user found out when writing the following code
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
    35
\begin{verbatim}
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
    36
   vVar.Type().Name() == "" && vVar.Kind() == reflect.Ptr 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
    37
\end{verbatim}
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
    38
\begin{verbatim}
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
    39
    && vVar.Type().Elem().Name() == "" && vVar.Type().Elem().Kind() == 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
    40
\end{verbatim}
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
    41
\begin{verbatim}
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
    42
    reflect.Slice
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
    43
\end{verbatim}
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
    44
it took the editor half a hour to finish computing.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
    45
This was subsequently fixed by Galbraith \cite{fixedAtom},
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
    46
and the issue was due to the regex engine of Atom.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
    47
But evil regular expressions can be more than a nuisance in a text editor:
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
    48
on 20 July 2016 one evil
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
    49
regular expression brought the webpage
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
    50
\href{http://stackexchange.com}{Stack Exchange} to its
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
    51
knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}(Last accessed in 2019)}
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
    52
In this instance, a regular expression intended to just trim white
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
    53
spaces from the beginning and the end of a line actually consumed
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
    54
massive amounts of CPU resources---causing the web servers to grind to a
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
    55
halt. In this example, the time needed to process
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
    56
the string was 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
    57
$O(n^2)$ with respect to the string length $n$. This
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
    58
quadratic overhead was enough for the homepage of Stack Exchange to
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
    59
respond so slowly that the load balancer assumed a $\mathit{DoS}$ 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
    60
attack and therefore stopped the servers from responding to any
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
    61
requests. This made the whole site become unavailable. 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
    62
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
    63
\begin{figure}[p]
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
    64
\begin{center}
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
    65
\begin{tabular}{@{}c@{\hspace{0mm}}c@{}}
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
    66
\begin{tikzpicture}
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
    67
\begin{axis}[
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
    68
    xlabel={$n$},
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
    69
    x label style={at={(1.05,-0.05)}},
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
    70
    ylabel={time in secs},
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
    71
    enlargelimits=false,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
    72
    xtick={0,5,...,30},
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
    73
    xmax=33,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
    74
    ymax=35,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
    75
    ytick={0,5,...,30},
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
    76
    scaled ticks=false,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
    77
    axis lines=left,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
    78
    width=5cm,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
    79
    height=4cm, 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
    80
    legend entries={JavaScript},  
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
    81
    legend pos=north west,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
    82
    legend cell align=left]
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
    83
\addplot[red,mark=*, mark options={fill=white}] table {re-js.data};
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
    84
\end{axis}
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
    85
\end{tikzpicture}
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
    86
  &
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
    87
\begin{tikzpicture}
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
    88
\begin{axis}[
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
    89
    xlabel={$n$},
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
    90
    x label style={at={(1.05,-0.05)}},
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
    91
    %ylabel={time in secs},
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
    92
    enlargelimits=false,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
    93
    xtick={0,5,...,30},
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
    94
    xmax=33,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
    95
    ymax=35,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
    96
    ytick={0,5,...,30},
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
    97
    scaled ticks=false,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
    98
    axis lines=left,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
    99
    width=5cm,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   100
    height=4cm, 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   101
    legend entries={Python},  
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   102
    legend pos=north west,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   103
    legend cell align=left]
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   104
\addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   105
\end{axis}
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   106
\end{tikzpicture}\\ 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   107
\begin{tikzpicture}
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   108
\begin{axis}[
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   109
    xlabel={$n$},
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   110
    x label style={at={(1.05,-0.05)}},
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   111
    ylabel={time in secs},
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   112
    enlargelimits=false,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   113
    xtick={0,5,...,30},
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   114
    xmax=33,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   115
    ymax=35,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   116
    ytick={0,5,...,30},
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   117
    scaled ticks=false,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   118
    axis lines=left,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   119
    width=5cm,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   120
    height=4cm, 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   121
    legend entries={Java 8},  
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   122
    legend pos=north west,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   123
    legend cell align=left]
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   124
\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   125
\end{axis}
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   126
\end{tikzpicture}
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   127
  &
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   128
\begin{tikzpicture}
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   129
\begin{axis}[
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   130
    xlabel={$n$},
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   131
    x label style={at={(1.05,-0.05)}},
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   132
    %ylabel={time in secs},
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   133
    enlargelimits=false,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   134
    xtick={0,5,...,30},
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   135
    xmax=33,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   136
    ymax=35,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   137
    ytick={0,5,...,30},
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   138
    scaled ticks=false,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   139
    axis lines=left,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   140
    width=5cm,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   141
    height=4cm, 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   142
    legend entries={Dart},  
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   143
    legend pos=north west,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   144
    legend cell align=left]
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   145
\addplot[green,mark=*, mark options={fill=white}] table {re-dart.data};
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   146
\end{axis}
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   147
\end{tikzpicture}\\ 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   148
\begin{tikzpicture}
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   149
\begin{axis}[
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   150
    xlabel={$n$},
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   151
    x label style={at={(1.05,-0.05)}},
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   152
    ylabel={time in secs},
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   153
    enlargelimits=false,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   154
    xtick={0,5,...,30},
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   155
    xmax=33,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   156
    ymax=35,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   157
    ytick={0,5,...,30},
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   158
    scaled ticks=false,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   159
    axis lines=left,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   160
    width=5cm,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   161
    height=4cm, 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   162
    legend entries={Swift},  
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   163
    legend pos=north west,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   164
    legend cell align=left]
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   165
\addplot[purple,mark=*, mark options={fill=white}] table {re-swift.data};
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   166
\end{axis}
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   167
\end{tikzpicture}
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   168
  & 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   169
\begin{tikzpicture}
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   170
\begin{axis}[
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   171
    xlabel={$n$},
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   172
    x label style={at={(1.05,-0.05)}},
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   173
    %ylabel={time in secs},
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   174
    enlargelimits=true,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   175
    %xtick={0,5000,...,40000},
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   176
    %xmax=40000,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   177
    %ymax=35,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   178
    restrict x to domain*=0:40000,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   179
    restrict y to domain*=0:35,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   180
    %ytick={0,5,...,30},
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   181
    %scaled ticks=false,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   182
    axis lines=left,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   183
    width=5cm,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   184
    height=4cm, 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   185
    legend entries={Java9+},  
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   186
    legend pos=north west,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   187
    legend cell align=left]
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   188
\addplot[orange,mark=*, mark options={fill=white}] table {re-java9.data};
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   189
\end{axis}
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   190
\end{tikzpicture}\\ 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   191
\begin{tikzpicture}
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   192
\begin{axis}[
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   193
    xlabel={$n$},
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   194
    x label style={at={(1.05,-0.05)}},
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   195
    ylabel={time in secs},
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   196
    enlargelimits=true,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   197
    %xtick={0,5,...,30},
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   198
    %xmax=33,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   199
    %ymax=35,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   200
    restrict x to domain*=0:60000,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   201
    restrict y to domain*=0:35,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   202
    %ytick={0,5,...,30},
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   203
    %scaled ticks=false,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   204
    axis lines=left,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   205
    width=5cm,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   206
    height=4cm, 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   207
    legend entries={Scala},  
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   208
    legend pos=north west,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   209
    legend cell align=left]
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   210
\addplot[magenta,mark=*, mark options={fill=white}] table {re-blexersimp.data};
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   211
\end{axis}
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   212
\end{tikzpicture}
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   213
  & 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   214
\begin{tikzpicture}
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   215
\begin{axis}[
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   216
    xlabel={$n$},
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   217
    x label style={at={(1.05,-0.05)}},
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   218
    %ylabel={time in secs},
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   219
    enlargelimits=true,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   220
    %xtick={0,5000,...,40000},
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   221
    %xmax=40000,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   222
    %ymax=35,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   223
    restrict x to domain*=0:60000,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   224
    restrict y to domain*=0:45,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   225
    %ytick={0,5,...,30},
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   226
    %scaled ticks=false,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   227
    axis lines=left,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   228
    width=5cm,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   229
    height=4cm, 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   230
    legend style={cells={align=left}},
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   231
    legend entries={Isabelle \\ Extracted},
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   232
    legend pos=north west,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   233
    legend cell align=left]
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   234
\addplot[magenta,mark=*, mark options={fill=white}] table {re-fromIsabelle.data};
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   235
\end{axis}
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   236
\end{tikzpicture}\\ 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   237
\multicolumn{2}{c}{Graphs}
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   238
\end{tabular}    
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   239
\end{center}
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   240
\caption{Graphs showing runtime for matching $(a^*)^*\,b$ with strings 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   241
           of the form $\protect\underbrace{aa..a}_{n}$ in various existing regular expression libraries.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   242
   The reason for their fast growth %superlinear behaviour 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   243
   is that they do a depth-first-search
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   244
   using NFAs.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   245
   If the string does not match, the regular expression matching
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   246
   engine starts to explore all possibilities. 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   247
   The last two graphs are for our implementation in Scala, one manual
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   248
   and one extracted from the verified lexer in Isabelle by $\textit{codegen}$.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   249
   Our lexer  
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   250
   performs better in this case, and
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   251
   is formally verified.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   252
   Despite being almost identical, the codegen-generated lexer
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   253
   % is generated directly from Isabelle using $\textit{codegen}$,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   254
   is slower than the manually written version since the code synthesised by
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   255
   $\textit{codegen}$ does not use native integer or string
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   256
   types of the target language.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   257
   %Note that we are comparing our \emph{lexer} against other languages' matchers.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   258
}\label{fig:aStarStarb}
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   259
\end{figure}\afterpage{\clearpage}
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   260
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   261
A more recent example is a global outage of all Cloudflare servers on 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   262
2 July 2019. 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   263
The traffic Cloudflare services each day is more than
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   264
Twitter, Amazon, Instagram, Apple, Bing and Wikipedia combined, yet
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   265
it became completely unavailable for half an hour because of 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   266
a poorly-written regular expression roughly of the form $^*.^*=.^*$
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   267
exhausted CPU resources that serve HTTP traffic. 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   268
Although the outage
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   269
had several causes, at the heart was a regular expression that
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   270
was used to monitor network
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   271
traffic.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}(Last accessed in 2022)}
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   272
These problems with regular expressions 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   273
are not isolated events that happen
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   274
very rarely, 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   275
%but actually widespread.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   276
%They occur so often that they have a 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   277
but they occur actually often enough that they have a
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   278
name: Regular-Expression-Denial-Of-Service (ReDoS)
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   279
attacks.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   280
Davis et al. \cite{Davis18} detected more
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   281
than 1000 evil regular expressions
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   282
in Node.js, Python core libraries, npm and pypi. 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   283
They therefore concluded that evil regular expressions
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   284
are a real problem rather than just "a parlour trick".
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   285
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   286
The work in this thesis aims to address this issue
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   287
with the help of formal proofs.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   288
We describe a lexing algorithm based
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   289
on Brzozowski derivatives with verified correctness 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   290
and a finiteness property for the size of derivatives
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   291
(which are all done in Isabelle/HOL).
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   292
Such properties %guarantee the absence of 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   293
are an important step in preventing
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   294
catastrophic backtracking once and for all.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   295
We will give more details in the next sections
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   296
on (i) why the slow cases in graph \ref{fig:aStarStarb}
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   297
can occur in traditional regular expression engines
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   298
and (ii) why we choose our 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   299
approach based on Brzozowski derivatives and formal proofs.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   300
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   301
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   302
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   303
\section{Preliminaries}%Regex, and the Problems with Regex Matchers}
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   304
Regular expressions and regular expression matchers 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   305
have been studied for many years.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   306
Theoretical results in automata theory state 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   307
that basic regular expression matching should be linear
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   308
w.r.t the input.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   309
This assumes that the regular expression
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   310
$r$ was pre-processed and turned into a
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   311
deterministic finite automaton (DFA) before matching \cite{Sakarovitch2009}.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   312
By basic we mean textbook definitions such as the one
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   313
below, involving only regular expressions for characters, alternatives,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   314
sequences, and Kleene stars:
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   315
\[
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   316
	r ::= c | r_1 + r_2 | r_1 \cdot r_2 | r^*
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   317
\]
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   318
Modern regular expression matchers used by programmers,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   319
however,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   320
support much richer constructs, such as bounded repetitions,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   321
negations,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   322
and back-references.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   323
To differentiate, we use the word \emph{regex} to refer
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   324
to those expressions with richer constructs while reserving the
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   325
term \emph{regular expression}
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   326
for the more traditional meaning in formal languages theory.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   327
We follow this convention 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   328
in this thesis.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   329
In the future, we aim to support all the popular features of regexes, 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   330
but for this work we mainly look at basic regular expressions
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   331
and bounded repetitions.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   332
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   333
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   334
%Most modern regex libraries
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   335
%the so-called PCRE standard (Peral Compatible Regular Expressions)
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   336
%has the back-references
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   337
Regexes come with a number of constructs
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   338
that make it more convenient for 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   339
programmers to write regular expressions.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   340
Depending on the types of constructs
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   341
the task of matching and lexing with them
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   342
will have different levels of complexity.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   343
Some of those constructs are syntactic sugars that are
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   344
simply short hand notations
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   345
that save the programmers a few keystrokes.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   346
These will not cause problems for regex libraries.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   347
For example the
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   348
non-binary alternative involving three or more choices just means:
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   349
\[
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   350
	(a | b | c) \stackrel{means}{=} ((a + b)+ c)
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   351
\]
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   352
Similarly, the range operator
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   353
%used to express the alternative
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   354
%of all characters between its operands, 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   355
is just a concise way
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   356
of expressing an alternative of consecutive characters:
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   357
\[
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   358
	[0~-9]\stackrel{means}{=} (0 | 1 | \ldots | 9 )  
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   359
\]
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   360
for an alternative. The
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   361
wildcard character '$.$' is used to refer to any single character,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   362
\[
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   363
	. \stackrel{means}{=} [0-9a-zA-Z+-()*\&\ldots]
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   364
\]
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   365
except the newline.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   366
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   367
\subsection{Bounded Repetitions}
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   368
More interesting are bounded repetitions, which can 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   369
make the regular expressions much
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   370
more compact.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   371
Normally there are four kinds of bounded repetitions:
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   372
$r^{\{n\}}$, $r^{\{\ldots m\}}$, $r^{\{n\ldots \}}$ and $r^{\{n\ldots m\}}$
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   373
(where $n$ and $m$ are constant natural numbers).
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   374
Like the star regular expressions, the set of strings or language
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   375
a bounded regular expression can match
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   376
is defined using the power operation on sets:
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   377
\begin{center}
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   378
	\begin{tabular}{lcl}
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   379
		$L \; r^{\{n\}}$ & $\dn$ & $(L \; r)^n$\\
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   380
		$L \; r^{\{\ldots m\}}$ & $\dn$ & $\bigcup_{0 \leq i \leq m}. (L \; r)^i$\\
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   381
		$L \; r^{\{n\ldots \}}$ & $\dn$ & $\bigcup_{n \leq i}. (L \; r)^i$\\
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   382
		$L \; r^{\{n \ldots m\}}$ & $\dn$ & $\bigcup_{n \leq i \leq m}. (L \; r)^i$
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   383
	\end{tabular}
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   384
\end{center}
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   385
The attraction of bounded repetitions is that they can be
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   386
used to avoid a size blow up: for example $r^{\{n\}}$
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   387
is a shorthand for
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   388
the much longer regular expression:
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   389
\[
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   390
	\underbrace{r\ldots r}_\text{n copies of r}.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   391
\]
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   392
%Therefore, a naive algorithm that simply unfolds
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   393
%them into their desugared forms
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   394
%will suffer from at least an exponential runtime increase.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   395
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   396
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   397
The problem with matching 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   398
such bounded repetitions
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   399
is that tools based on the classic notion of
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   400
automata need to expand $r^{\{n\}}$ into $n$ connected 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   401
copies of the automaton for $r$. This leads to very inefficient matching
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   402
algorithms  or algorithms that consume large amounts of memory.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   403
Implementations using $\DFA$s will
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   404
in such situations
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   405
either become very slow 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   406
(for example Verbatim++ \cite{Verbatimpp}) or run
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   407
out of memory (for example $\mathit{LEX}$ and 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   408
$\mathit{JFLEX}$\footnote{LEX and JFLEX are lexer generators
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   409
in C and JAVA that generate $\mathit{DFA}$-based
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   410
lexers. The user provides a set of regular expressions
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   411
and configurations, and then 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   412
gets an output program encoding a minimized $\mathit{DFA}$
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   413
that can be compiled and run. 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   414
When given the above countdown regular expression,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   415
a small $n$ (say 20) would result in a program representing a
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   416
DFA
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   417
with millions of states.}) for large counters.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   418
A classic example for this phenomenon is the regular expression $(a+b)^*  a (a+b)^{n}$
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   419
where the minimal DFA requires at least $2^{n+1}$ states.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   420
For example, when $n$ is equal to 2,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   421
the corresponding $\mathit{NFA}$ looks like:
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   422
\vspace{6mm}
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   423
\begin{center}
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   424
\begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto] 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   425
   \node[state,initial] (q_0)   {$q_0$}; 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   426
   \node[state, red] (q_1) [right=of q_0] {$q_1$}; 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   427
   \node[state, red] (q_2) [right=of q_1] {$q_2$}; 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   428
   \node[state, accepting, red](q_3) [right=of q_2] {$q_3$};
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   429
    \path[->] 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   430
    (q_0) edge  node {a} (q_1)
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   431
    	  edge [loop below] node {a,b} ()
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   432
    (q_1) edge  node  {a,b} (q_2)
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   433
    (q_2) edge  node  {a,b} (q_3);
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   434
\end{tikzpicture}
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   435
\end{center}
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   436
and when turned into a DFA by the subset construction
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   437
requires at least $2^3$ states.\footnote{The 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   438
red states are "countdown states" which count down 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   439
the number of characters needed in addition to the current
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   440
string to make a successful match.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   441
For example, state $q_1$ indicates a match that has
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   442
gone past the $(a|b)^*$ part of $(a|b)^*a(a|b)^{\{2\}}$,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   443
and just consumed the "delimiter" $a$ in the middle, and 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   444
needs to match 2 more iterations of $(a|b)$ to complete.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   445
State $q_2$ on the other hand, can be viewed as a state
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   446
after $q_1$ has consumed 1 character, and just waits
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   447
for 1 more character to complete.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   448
The state $q_3$ is the last (accepting) state, requiring 0 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   449
more characters.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   450
Depending on the suffix of the
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   451
input string up to the current read location,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   452
the states $q_1$ and $q_2$, $q_3$
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   453
may or may
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   454
not be active.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   455
A $\mathit{DFA}$ for such an $\mathit{NFA}$ would
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   456
contain at least $2^3$ non-equivalent states that cannot be merged, 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   457
because the subset construction during determinisation will generate
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   458
all the elements in the power set $\mathit{Pow}\{q_1, q_2, q_3\}$.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   459
Generalizing this to regular expressions with larger
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   460
bounded repetitions number, we have that
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   461
regexes shaped like $r^*ar^{\{n\}}$ when converted to $\mathit{DFA}$s
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   462
would require at least $2^{n+1}$ states, if $r$ itself contains
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   463
more than 1 string.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   464
This is to represent all different 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   465
scenarios in which "countdown" states are active.}
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   466
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   467
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   468
Bounded repetitions are important because they
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   469
tend to occur frequently in practical use,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   470
for example in the regex library RegExLib, in
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   471
the rules library of Snort \cite{Snort1999}\footnote{
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   472
Snort is a network intrusion detection (NID) tool
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   473
for monitoring network traffic.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   474
The network security community curates a list
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   475
of malicious patterns written as regexes,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   476
which is used by Snort's detection engine
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   477
to match against network traffic for any hostile
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   478
activities such as buffer overflow attacks.}, 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   479
as well as in XML Schema definitions (XSDs).
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   480
According to Bj\"{o}rklund et al \cite{xml2015},
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   481
more than half of the 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   482
XSDs they found on the Maven.org central repository
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   483
have bounded regular expressions in them.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   484
Often the counters are quite large, with the largest being
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   485
close to ten million. 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   486
A smaller sample XSD they gave
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   487
is:
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   488
\lstset{
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   489
	basicstyle=\fontsize{8.5}{9}\ttfamily,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   490
  language=XML,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   491
  morekeywords={encoding,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   492
    xs:schema,xs:element,xs:complexType,xs:sequence,xs:attribute}
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   493
}
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   494
\begin{lstlisting}
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   495
<sequence minOccurs="0" maxOccurs="65535">
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   496
	<element name="TimeIncr" type="mpeg7:MediaIncrDurationType"/>
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   497
 	<element name="MotionParams" type="float" minOccurs="2" maxOccurs="12"/>
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   498
</sequence>
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   499
\end{lstlisting}
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   500
This can be seen as the regex
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   501
$(ab^{2\ldots 12})^{0 \ldots 65535}$, where $a$ and $b$ are themselves
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   502
regular expressions 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   503
satisfying certain constraints (such as 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   504
satisfying the floating point number format).
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   505
It is therefore quite unsatisfying that 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   506
some regular expressions matching libraries
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   507
impose adhoc limits
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   508
for bounded regular expressions:
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   509
For example, in the regular expression matching library in the Go
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   510
language the regular expression $a^{1001}$ is not permitted, because no counter
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   511
can be above 1000, and in the built-in Rust regular expression library
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   512
expressions such as $a^{\{1000\}\{100\}\{5\}}$ give an error message
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   513
for being too big~\footnote{Try it out here: https://rustexp.lpil.uk}. 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   514
As Becchi and Crawley \cite{Becchi08}  have pointed out,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   515
the reason for these restrictions
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   516
is that they simulate a non-deterministic finite
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   517
automata (NFA) with a breadth-first search.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   518
This way the number of active states could
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   519
be equal to the counter number.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   520
When the counters are large, 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   521
the memory requirement could become
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   522
infeasible, and a regex engine
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   523
like in Go will reject this pattern straight away.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   524
\begin{figure}[H]
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   525
\begin{center}
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   526
\begin{tikzpicture} [node distance = 2cm, on grid, auto]
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   527
 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   528
    	\node (q0) [state, initial] {$0$};
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   529
	\node (q1) [state, right = of q0] {$1$};
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   530
	%\node (q2) [state, right = of q1] {$2$};
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   531
	\node (qdots) [right = of q1] {$\ldots$};
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   532
	\node (qn) [state, right = of qdots] {$n$};
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   533
	\node (qn1) [state, right = of qn] {$n+1$};
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   534
	\node (qn2) [state, right = of qn1] {$n+2$};
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   535
	\node (qn3) [state, accepting, right = of qn2] {$n+3$}; 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   536
 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   537
\path [-stealth, thick]
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   538
	(q0) edge [loop above] node {a} ()
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   539
    (q0) edge node {a}   (q1) 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   540
    %(q1) edge node {.}   (q2)
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   541
    (q1) edge node {.}   (qdots)
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   542
    (qdots) edge node {.} (qn)
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   543
    (qn) edge node {.} (qn1)
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   544
    (qn1) edge node {b} (qn2)
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   545
    (qn2) edge node {$c$} (qn3);
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   546
\end{tikzpicture}
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   547
%\begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto] 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   548
%   \node[state,initial] (q_0)   {$0$}; 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   549
%   \node[state, ] (q_1) [right=of q_0] {$1$}; 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   550
%   \node[state, ] (q_2) [right=of q_1] {$2$}; 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   551
%   \node[state,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   552
%   \node[state, accepting, ](q_3) [right=of q_2] {$3$};
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   553
%    \path[->] 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   554
%    (q_0) edge  node {a} (q_1)
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   555
%    	  edge [loop below] node {a,b} ()
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   556
%    (q_1) edge  node  {a,b} (q_2)
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   557
%    (q_2) edge  node  {a,b} (q_3);
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   558
%\end{tikzpicture}
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   559
\end{center}
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   560
\caption{The example given by Becchi and Crawley
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   561
	that NFA simulation can consume large
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   562
	amounts of memory: $.^*a.^{\{n\}}bc$ matching
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   563
	strings of the form $aaa\ldots aaaabc$.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   564
	When traversing in a breadth-first manner,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   565
all states from 0 till $n+1$ will become active.}\label{fig:inj}
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   566
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   567
\end{figure}
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   568
%Languages like $\mathit{Go}$ and $\mathit{Rust}$ use this
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   569
%type of $\mathit{NFA}$ simulation and guarantees a linear runtime
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   570
%in terms of input string length.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   571
%TODO:try out these lexers
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   572
These problems can of course be solved in matching algorithms where 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   573
automata go beyond the classic notion and for instance include explicit
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   574
counters \cite{Turo_ov__2020}.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   575
These solutions can be quite efficient,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   576
with the ability to process
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   577
gigabits of strings input per second
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   578
even with large counters \cite{Becchi08}.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   579
These practical solutions do not come with
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   580
formal guarantees, and as pointed out by
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   581
Kuklewicz \cite{KuklewiczHaskell}, can be error-prone.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   582
%But formal reasoning about these automata especially in Isabelle 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   583
%can be challenging
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   584
%and un-intuitive. 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   585
%Therefore, we take correctness and runtime claims made about these solutions
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   586
%with a grain of salt.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   587
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   588
In the work reported in \cite{ITP2023} and here, 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   589
we add better support using derivatives
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   590
for bounded regular expression $r^{\{n\}}$.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   591
Our results
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   592
extend straightforwardly to
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   593
repetitions with intervals such as 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   594
$r^{\{n\ldots m\}}$.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   595
The merit of Brzozowski derivatives (more on this later)
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   596
on this problem is that
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   597
it can be naturally extended to support bounded repetitions.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   598
Moreover these extensions are still made up of only small
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   599
inductive datatypes and recursive functions,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   600
making it handy to deal with them in theorem provers.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   601
%The point here is that Brzozowski derivatives and the algorithms by Sulzmann and Lu can be
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   602
%straightforwardly extended to deal with bounded regular expressions
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   603
%and moreover the resulting code still consists of only simple
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   604
%recursive functions and inductive datatypes.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   605
Finally, bounded regular expressions do not destroy our finite
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   606
boundedness property, which we shall prove later on.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   607
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   608
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   609
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   610
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   611
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   612
\subsection{Back-References}
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   613
The other way to simulate an $\mathit{NFA}$ for matching is choosing  
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   614
a single transition each time, keeping all the other options in 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   615
a queue or stack, and backtracking if that choice eventually 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   616
fails. 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   617
This method, often called a  "depth-first-search", 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   618
is efficient in many cases, but could end up
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   619
with exponential run time.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   620
The backtracking method is employed in regex libraries
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   621
that support \emph{back-references}, for example
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   622
in Java and Python.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   623
%\section{Back-references and The Terminology Regex}
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   624
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   625
%When one constructs an $\NFA$ out of a regular expression
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   626
%there is often very little to be done in the first phase, one simply 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   627
%construct the $\NFA$ states based on the structure of the input regular expression.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   628
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   629
%In the lexing phase, one can simulate the $\mathit{NFA}$ running in two ways:
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   630
%one by keeping track of all active states after consuming 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   631
%a character, and update that set of states iteratively.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   632
%This can be viewed as a breadth-first-search of the $\mathit{NFA}$
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   633
%for a path terminating
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   634
%at an accepting state.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   635
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   636
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   637
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   638
Consider the following regular expression where the sequence
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   639
operator is omitted for brevity:
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   640
\begin{center}
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   641
	$r_1r_2r_3r_4$
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   642
\end{center}
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   643
In this example,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   644
one could label sub-expressions of interest 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   645
by parenthesizing them and giving 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   646
them a number in the order in which their opening parentheses appear.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   647
One possible way of parenthesizing and labelling is: 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   648
\begin{center}
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   649
	$\underset{1}{(}r_1\underset{2}{(}r_2\underset{3}{(}r_3)\underset{4}{(}r_4)))$
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   650
\end{center}
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   651
The sub-expressions
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   652
$r_1r_2r_3r_4$, $r_1r_2r_3$, $r_3$ and $r_4$ are labelled
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   653
by 1 to 4, and can be ``referred back'' by their respective numbers. 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   654
%These sub-expressions are called "capturing groups".
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   655
To do so, one uses the syntax $\backslash i$ 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   656
to denote that we want the sub-string 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   657
of the input matched by the i-th
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   658
sub-expression to appear again, 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   659
exactly the same as it first appeared: 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   660
\begin{center}
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   661
$\ldots\underset{\text{i-th lparen}}{(}{r_i})\ldots 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   662
\underset{s_i \text{ which just matched} \;r_i}{\backslash i} \ldots$
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   663
\end{center}
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   664
Once the sub-string $s_i$ for the sub-expression $r_i$
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   665
has been fixed, there is no variability on what the back-reference
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   666
label $\backslash i$ can be---it is tied to $s_i$.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   667
The overall string will look like $\ldots s_i \ldots s_i \ldots $
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   668
%The backslash and number $i$ are the
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   669
%so-called "back-references".
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   670
%Let $e$ be an expression made of regular expressions 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   671
%and back-references. $e$ contains the expression $e_i$
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   672
%as its $i$-th capturing group.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   673
%The semantics of back-reference can be recursively
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   674
%written as:
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   675
%\begin{center}
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   676
%	\begin{tabular}{c}
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   677
%		$L ( e \cdot \backslash i) = \{s @ s_i \mid s \in L (e)\quad s_i \in L(r_i)$\\
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   678
%		$s_i\; \text{match of ($e$, $s$)'s $i$-th capturing group string}\}$
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   679
%	\end{tabular}
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   680
%\end{center}
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   681
A concrete example
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   682
for back-references is
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   683
\begin{center}
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   684
$(.^*)\backslash 1$,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   685
\end{center}
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   686
which matches
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   687
strings that can be split into two identical halves,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   688
for example $\mathit{foofoo}$, $\mathit{ww}$ and so on.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   689
Note that this is different from 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   690
repeating the  sub-expression verbatim like
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   691
\begin{center}
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   692
	$(.^*)(.^*)$,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   693
\end{center}
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   694
which does not impose any restrictions on what strings the second 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   695
sub-expression $.^*$
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   696
might match.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   697
Another example for back-references is
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   698
\begin{center}
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   699
$(.)(.)\backslash 2\backslash 1$
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   700
\end{center}
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   701
which matches four-character palindromes
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   702
like $abba$, $x??x$ and so on.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   703
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   704
Back-references are a regex construct 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   705
that programmers find quite useful.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   706
According to Becchi and Crawley \cite{Becchi08},
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   707
6\% of Snort rules (up until 2008) use them.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   708
The most common use of back-references
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   709
is to express well-formed html files,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   710
where back-references are convenient for matching
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   711
opening and closing tags like 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   712
\begin{center}
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   713
	$\langle html \rangle \ldots \langle / html \rangle$
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   714
\end{center}
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   715
A regex describing such a format
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   716
is
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   717
\begin{center}
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   718
	$\langle (.^+) \rangle \ldots \langle / \backslash 1 \rangle$
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   719
\end{center}
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   720
Despite being useful, the expressive power of regexes 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   721
go beyond regular languages 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   722
once back-references are included.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   723
In fact, they allow the regex construct to express 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   724
languages that cannot be contained in context-free
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   725
languages either.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   726
For example, the back-reference $(a^*)b\backslash1 b \backslash 1$
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   727
expresses the language $\{a^n b a^n b a^n\mid n \in \mathbb{N}\}$,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   728
which cannot be expressed by 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   729
context-free grammars \cite{campeanu2003formal}.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   730
Such a language is contained in the context-sensitive hierarchy
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   731
of formal languages. 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   732
Also solving the matching problem involving back-references
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   733
is known to be NP-complete \parencite{alfred2014algorithms}.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   734
Regex libraries supporting back-references such as 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   735
PCRE \cite{pcre} therefore have to
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   736
revert to a depth-first search algorithm including backtracking.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   737
What is unexpected is that even in the cases 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   738
not involving back-references, there is still
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   739
a (non-negligible) chance they might backtrack super-linearly,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   740
as shown in the graphs in figure \ref{fig:aStarStarb}.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   741
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   742
Summing up, we can categorise existing 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   743
practical regex libraries into two kinds:
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   744
(i) The ones  with  linear
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   745
time guarantees like Go and Rust. The downside with them is that
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   746
they impose restrictions
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   747
on the regular expressions (not allowing back-references, 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   748
bounded repetitions cannot exceed an ad hoc limit etc.).
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   749
And (ii) those 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   750
that allow large bounded regular expressions and back-references
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   751
at the expense of using backtracking algorithms.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   752
They can potentially ``grind to a halt''
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   753
on some very simple cases, resulting 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   754
ReDoS attacks if exposed to the internet.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   755
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   756
The problems with both approaches are the motivation for us 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   757
to look again at the regular expression matching problem. 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   758
Another motivation is that regular expression matching algorithms
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   759
that follow the POSIX standard often contain errors and bugs 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   760
as we shall explain next.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   761
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   762
%We would like to have regex engines that can 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   763
%deal with the regular part (e.g.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   764
%bounded repetitions) of regexes more
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   765
%efficiently.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   766
%Also we want to make sure that they do it correctly.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   767
%It turns out that such aim is not so easy to achieve.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   768
 %TODO: give examples such as RE2 GOLANG 1000 restriction, rust no repetitions 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   769
% For example, the Rust regex engine claims to be linear, 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   770
% but does not support lookarounds and back-references.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   771
% The GoLang regex library does not support over 1000 repetitions.  
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   772
% Java and Python both support back-references, but shows
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   773
%catastrophic backtracking behaviours on inputs without back-references(
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   774
%when the language is still regular).
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   775
 %TODO: test performance of Rust on (((((a*a*)b*)b){20})*)c  baabaabababaabaaaaaaaaababaaaababababaaaabaaabaaaaaabaabaabababaababaaaaaaaaababaaaababababaaaaaaaaaaaaac
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   776
 %TODO: verify the fact Rust does not allow 1000+ reps
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   777
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   778
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   779
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   780
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   781
%The time cost of regex matching algorithms in general
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   782
%involve two different phases, and different things can go differently wrong on 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   783
%these phases.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   784
%$\DFA$s usually have problems in the first (construction) phase
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   785
%, whereas $\NFA$s usually run into trouble
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   786
%on the second phase.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   787
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   788
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   789
\section{Error-prone POSIX Implementations}
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   790
Very often there are multiple ways of matching a string
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   791
with a regular expression.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   792
In such cases the regular expressions matcher needs to
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   793
disambiguate.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   794
The more widely used strategy is called POSIX,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   795
which roughly speaking always chooses the longest initial match.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   796
The POSIX strategy is widely adopted in many regular expression matchers
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   797
because it is a natural strategy for lexers.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   798
However, many implementations (including the C libraries
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   799
used by Linux and OS X distributions) contain bugs
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   800
or do not meet the specification they claim to adhere to.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   801
Kuklewicz maintains a unit test repository which lists some
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   802
problems with existing regular expression engines \cite{KuklewiczHaskell}.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   803
In some cases, they either fail to generate a
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   804
result when there exists a match,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   805
or give results that are inconsistent with the POSIX standard.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   806
A concrete example is the regex:
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   807
\begin{center}
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   808
	$(aba + ab + a)^* \text{and the string} \; ababa$
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   809
\end{center}
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   810
The correct POSIX match for the above
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   811
involves the entire string $ababa$, 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   812
split into two Kleene star iterations, namely $[ab], [aba]$ at positions
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   813
$[0, 2), [2, 5)$
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   814
respectively.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   815
But feeding this example to the different engines
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   816
listed at regex101 \footnote{
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   817
	regex101 is an online regular expression matcher which
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   818
	provides API for trying out regular expression
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   819
	engines of multiple popular programming languages like
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   820
Java, Python, Go, etc.} \parencite{regex101}. 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   821
yields
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   822
only two incomplete matches: $[aba]$ at $[0, 3)$
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   823
and $a$ at $[4, 5)$.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   824
Fowler \cite{fowler2003} and Kuklewicz \cite{KuklewiczHaskell} 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   825
commented that most regex libraries are not
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   826
correctly implementing the central POSIX
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   827
rule, called the maximum munch rule.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   828
Grathwohl \parencite{grathwohl2014crash} wrote:
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   829
\begin{quote}\it
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   830
	``The POSIX strategy is more complicated than the 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   831
	greedy because of the dependence on information about 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   832
	the length of matched strings in the various subexpressions.''
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   833
\end{quote}
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   834
%\noindent
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   835
People have recognised that the implementation complexity of POSIX rules also come from
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   836
the specification being not very precise.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   837
The developers of the regexp package of Go 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   838
\footnote{\url{https://pkg.go.dev/regexp\#pkg-overview}}
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   839
commented that
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   840
\begin{quote}\it
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   841
``
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   842
The POSIX rule is computationally prohibitive
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   843
and not even well-defined.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   844
``
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   845
\end{quote}
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   846
There are many informal summaries of this disambiguation
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   847
strategy, which are often quite long and delicate.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   848
For example Kuklewicz \cite{KuklewiczHaskell} 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   849
described the POSIX rule as (section 1, last paragraph):
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   850
\begin{quote}
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   851
	\begin{itemize}
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   852
		\item
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   853
regular expressions (REs) take the leftmost starting match, and the longest match starting there
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   854
earlier subpatterns have leftmost-longest priority over later subpatterns\\
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   855
\item
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   856
higher-level subpatterns have leftmost-longest priority over their component subpatterns\\
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   857
\item
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   858
REs have right associative concatenation which can be changed with parenthesis\\
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   859
\item
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   860
parenthesized subexpressions return the match from their last usage\\
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   861
\item
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   862
text of component subexpressions must be contained in the text of the 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   863
higher-level subexpressions\\
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   864
\item
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   865
if "p" and "q" can never match the same text then "p|q" and "q|p" are equivalent, up to trivial renumbering of captured subexpressions\\
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   866
\item
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   867
if "p" in "p*" is used to capture non-empty text then additional repetitions of "p" will not capture an empty string\\
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   868
\end{itemize}
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   869
\end{quote}
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   870
%The text above 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   871
%is trying to capture something very precise,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents:
diff changeset
   872
%and is crying out for formalising.