ChengsongTanPhdThesis/Chapters/Inj.tex
author Chengsong
Tue, 25 Jul 2023 17:28:29 +0100
changeset 667 660cf698eb26
parent 666 6da4516ea87d
child 668 3831621d7b14
permissions -rwxr-xr-x
added example of how inj and lexer works
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
     1
% Chapter Template
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
     2
666
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
     3
\chapter{Regular Expressions and POSIX Lexing-Preliminaries} % Main chapter title
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
     4
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
     5
\label{Inj} % In chapter 2 \ref{Chapter2} we will introduce the concepts
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
     6
%and notations we 
564
Chengsong
parents: 543
diff changeset
     7
% used for describing the lexing algorithm by Sulzmann and Lu,
Chengsong
parents: 543
diff changeset
     8
%and then give the algorithm and its variant and discuss
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
     9
%why more aggressive simplifications are needed. 
666
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
    10
We start with a technical overview of the catastrophic backtracking problem,
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
    11
motivating rigorous approaches to regular expression matching and lexing.
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
    12
Then we introduce basic notations and definitions of our problem.
649
Chengsong
parents: 646
diff changeset
    13
Chengsong
parents: 646
diff changeset
    14
\section{Technical Overview}
Chengsong
parents: 646
diff changeset
    15
Consider for example the regular expression $(a^*)^*\,b$ and 
Chengsong
parents: 646
diff changeset
    16
strings of the form $aa..a$. These strings cannot be matched by this regular
666
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
    17
expression: obviously the expected $b$ in the last
649
Chengsong
parents: 646
diff changeset
    18
position is missing. One would assume that modern regular expression
Chengsong
parents: 646
diff changeset
    19
matching engines can find this out very quickly. Surprisingly, if one tries
Chengsong
parents: 646
diff changeset
    20
this example in JavaScript, Python or Java 8, even with small strings, 
Chengsong
parents: 646
diff changeset
    21
say of lenght of around 30 $a$'s,
666
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
    22
the decision takes large amounts of time to finish.
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
    23
This is inproportional
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
    24
to the simplicity of the input (see graphs in figure \ref{fig:aStarStarb}).
649
Chengsong
parents: 646
diff changeset
    25
The algorithms clearly show exponential behaviour, and as can be seen
Chengsong
parents: 646
diff changeset
    26
is triggered by some relatively simple regular expressions.
Chengsong
parents: 646
diff changeset
    27
Java 9 and newer
666
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
    28
versions mitigates this behaviour by several magnitudes, 
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
    29
but are still slow compared 
649
Chengsong
parents: 646
diff changeset
    30
with the approach we are going to use in this thesis.
Chengsong
parents: 646
diff changeset
    31
Chengsong
parents: 646
diff changeset
    32
Chengsong
parents: 646
diff changeset
    33
Chengsong
parents: 646
diff changeset
    34
This superlinear blowup in regular expression engines
Chengsong
parents: 646
diff changeset
    35
has caused grief in ``real life'' where it is 
Chengsong
parents: 646
diff changeset
    36
given the name ``catastrophic backtracking'' or ``evil'' regular expressions.
Chengsong
parents: 646
diff changeset
    37
For example, on 20 July 2016 one evil
Chengsong
parents: 646
diff changeset
    38
regular expression brought the webpage
Chengsong
parents: 646
diff changeset
    39
\href{http://stackexchange.com}{Stack Exchange} to its
Chengsong
parents: 646
diff changeset
    40
knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}(Last accessed in 2019)}
Chengsong
parents: 646
diff changeset
    41
In this instance, a regular expression intended to just trim white
Chengsong
parents: 646
diff changeset
    42
spaces from the beginning and the end of a line actually consumed
Chengsong
parents: 646
diff changeset
    43
massive amounts of CPU resources---causing the web servers to grind to a
Chengsong
parents: 646
diff changeset
    44
halt. In this example, the time needed to process
Chengsong
parents: 646
diff changeset
    45
the string was 
Chengsong
parents: 646
diff changeset
    46
$O(n^2)$ with respect to the string length $n$. This
Chengsong
parents: 646
diff changeset
    47
quadratic overhead was enough for the homepage of Stack Exchange to
Chengsong
parents: 646
diff changeset
    48
respond so slowly that the load balancer assumed a $\mathit{DoS}$ 
Chengsong
parents: 646
diff changeset
    49
attack and therefore stopped the servers from responding to any
Chengsong
parents: 646
diff changeset
    50
requests. This made the whole site become unavailable. 
Chengsong
parents: 646
diff changeset
    51
Chengsong
parents: 646
diff changeset
    52
\begin{figure}[p]
Chengsong
parents: 646
diff changeset
    53
\begin{center}
Chengsong
parents: 646
diff changeset
    54
\begin{tabular}{@{}c@{\hspace{0mm}}c@{}}
Chengsong
parents: 646
diff changeset
    55
\begin{tikzpicture}
Chengsong
parents: 646
diff changeset
    56
\begin{axis}[
Chengsong
parents: 646
diff changeset
    57
    xlabel={$n$},
Chengsong
parents: 646
diff changeset
    58
    x label style={at={(1.05,-0.05)}},
Chengsong
parents: 646
diff changeset
    59
    ylabel={time in secs},
Chengsong
parents: 646
diff changeset
    60
    enlargelimits=false,
Chengsong
parents: 646
diff changeset
    61
    xtick={0,5,...,30},
Chengsong
parents: 646
diff changeset
    62
    xmax=33,
Chengsong
parents: 646
diff changeset
    63
    ymax=35,
Chengsong
parents: 646
diff changeset
    64
    ytick={0,5,...,30},
Chengsong
parents: 646
diff changeset
    65
    scaled ticks=false,
Chengsong
parents: 646
diff changeset
    66
    axis lines=left,
Chengsong
parents: 646
diff changeset
    67
    width=5cm,
Chengsong
parents: 646
diff changeset
    68
    height=4cm, 
Chengsong
parents: 646
diff changeset
    69
    legend entries={JavaScript},  
Chengsong
parents: 646
diff changeset
    70
    legend pos=north west,
Chengsong
parents: 646
diff changeset
    71
    legend cell align=left]
Chengsong
parents: 646
diff changeset
    72
\addplot[red,mark=*, mark options={fill=white}] table {re-js.data};
Chengsong
parents: 646
diff changeset
    73
\end{axis}
Chengsong
parents: 646
diff changeset
    74
\end{tikzpicture}
Chengsong
parents: 646
diff changeset
    75
  &
Chengsong
parents: 646
diff changeset
    76
\begin{tikzpicture}
Chengsong
parents: 646
diff changeset
    77
\begin{axis}[
Chengsong
parents: 646
diff changeset
    78
    xlabel={$n$},
Chengsong
parents: 646
diff changeset
    79
    x label style={at={(1.05,-0.05)}},
Chengsong
parents: 646
diff changeset
    80
    %ylabel={time in secs},
Chengsong
parents: 646
diff changeset
    81
    enlargelimits=false,
Chengsong
parents: 646
diff changeset
    82
    xtick={0,5,...,30},
Chengsong
parents: 646
diff changeset
    83
    xmax=33,
Chengsong
parents: 646
diff changeset
    84
    ymax=35,
Chengsong
parents: 646
diff changeset
    85
    ytick={0,5,...,30},
Chengsong
parents: 646
diff changeset
    86
    scaled ticks=false,
Chengsong
parents: 646
diff changeset
    87
    axis lines=left,
Chengsong
parents: 646
diff changeset
    88
    width=5cm,
Chengsong
parents: 646
diff changeset
    89
    height=4cm, 
Chengsong
parents: 646
diff changeset
    90
    legend entries={Python},  
Chengsong
parents: 646
diff changeset
    91
    legend pos=north west,
Chengsong
parents: 646
diff changeset
    92
    legend cell align=left]
Chengsong
parents: 646
diff changeset
    93
\addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
Chengsong
parents: 646
diff changeset
    94
\end{axis}
Chengsong
parents: 646
diff changeset
    95
\end{tikzpicture}\\ 
Chengsong
parents: 646
diff changeset
    96
\begin{tikzpicture}
Chengsong
parents: 646
diff changeset
    97
\begin{axis}[
Chengsong
parents: 646
diff changeset
    98
    xlabel={$n$},
Chengsong
parents: 646
diff changeset
    99
    x label style={at={(1.05,-0.05)}},
Chengsong
parents: 646
diff changeset
   100
    ylabel={time in secs},
Chengsong
parents: 646
diff changeset
   101
    enlargelimits=false,
Chengsong
parents: 646
diff changeset
   102
    xtick={0,5,...,30},
Chengsong
parents: 646
diff changeset
   103
    xmax=33,
Chengsong
parents: 646
diff changeset
   104
    ymax=35,
Chengsong
parents: 646
diff changeset
   105
    ytick={0,5,...,30},
Chengsong
parents: 646
diff changeset
   106
    scaled ticks=false,
Chengsong
parents: 646
diff changeset
   107
    axis lines=left,
Chengsong
parents: 646
diff changeset
   108
    width=5cm,
Chengsong
parents: 646
diff changeset
   109
    height=4cm, 
Chengsong
parents: 646
diff changeset
   110
    legend entries={Java 8},  
Chengsong
parents: 646
diff changeset
   111
    legend pos=north west,
Chengsong
parents: 646
diff changeset
   112
    legend cell align=left]
Chengsong
parents: 646
diff changeset
   113
\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
Chengsong
parents: 646
diff changeset
   114
\end{axis}
Chengsong
parents: 646
diff changeset
   115
\end{tikzpicture}
Chengsong
parents: 646
diff changeset
   116
  &
Chengsong
parents: 646
diff changeset
   117
\begin{tikzpicture}
Chengsong
parents: 646
diff changeset
   118
\begin{axis}[
Chengsong
parents: 646
diff changeset
   119
    xlabel={$n$},
Chengsong
parents: 646
diff changeset
   120
    x label style={at={(1.05,-0.05)}},
Chengsong
parents: 646
diff changeset
   121
    %ylabel={time in secs},
Chengsong
parents: 646
diff changeset
   122
    enlargelimits=false,
Chengsong
parents: 646
diff changeset
   123
    xtick={0,5,...,30},
Chengsong
parents: 646
diff changeset
   124
    xmax=33,
Chengsong
parents: 646
diff changeset
   125
    ymax=35,
Chengsong
parents: 646
diff changeset
   126
    ytick={0,5,...,30},
Chengsong
parents: 646
diff changeset
   127
    scaled ticks=false,
Chengsong
parents: 646
diff changeset
   128
    axis lines=left,
Chengsong
parents: 646
diff changeset
   129
    width=5cm,
Chengsong
parents: 646
diff changeset
   130
    height=4cm, 
Chengsong
parents: 646
diff changeset
   131
    legend entries={Dart},  
Chengsong
parents: 646
diff changeset
   132
    legend pos=north west,
Chengsong
parents: 646
diff changeset
   133
    legend cell align=left]
Chengsong
parents: 646
diff changeset
   134
\addplot[green,mark=*, mark options={fill=white}] table {re-dart.data};
Chengsong
parents: 646
diff changeset
   135
\end{axis}
Chengsong
parents: 646
diff changeset
   136
\end{tikzpicture}\\ 
Chengsong
parents: 646
diff changeset
   137
\begin{tikzpicture}
Chengsong
parents: 646
diff changeset
   138
\begin{axis}[
Chengsong
parents: 646
diff changeset
   139
    xlabel={$n$},
Chengsong
parents: 646
diff changeset
   140
    x label style={at={(1.05,-0.05)}},
Chengsong
parents: 646
diff changeset
   141
    ylabel={time in secs},
Chengsong
parents: 646
diff changeset
   142
    enlargelimits=false,
Chengsong
parents: 646
diff changeset
   143
    xtick={0,5,...,30},
Chengsong
parents: 646
diff changeset
   144
    xmax=33,
Chengsong
parents: 646
diff changeset
   145
    ymax=35,
Chengsong
parents: 646
diff changeset
   146
    ytick={0,5,...,30},
Chengsong
parents: 646
diff changeset
   147
    scaled ticks=false,
Chengsong
parents: 646
diff changeset
   148
    axis lines=left,
Chengsong
parents: 646
diff changeset
   149
    width=5cm,
Chengsong
parents: 646
diff changeset
   150
    height=4cm, 
Chengsong
parents: 646
diff changeset
   151
    legend entries={Swift},  
Chengsong
parents: 646
diff changeset
   152
    legend pos=north west,
Chengsong
parents: 646
diff changeset
   153
    legend cell align=left]
Chengsong
parents: 646
diff changeset
   154
\addplot[purple,mark=*, mark options={fill=white}] table {re-swift.data};
Chengsong
parents: 646
diff changeset
   155
\end{axis}
Chengsong
parents: 646
diff changeset
   156
\end{tikzpicture}
Chengsong
parents: 646
diff changeset
   157
  & 
Chengsong
parents: 646
diff changeset
   158
\begin{tikzpicture}
Chengsong
parents: 646
diff changeset
   159
\begin{axis}[
Chengsong
parents: 646
diff changeset
   160
    xlabel={$n$},
Chengsong
parents: 646
diff changeset
   161
    x label style={at={(1.05,-0.05)}},
Chengsong
parents: 646
diff changeset
   162
    %ylabel={time in secs},
Chengsong
parents: 646
diff changeset
   163
    enlargelimits=true,
Chengsong
parents: 646
diff changeset
   164
    %xtick={0,5000,...,40000},
Chengsong
parents: 646
diff changeset
   165
    %xmax=40000,
Chengsong
parents: 646
diff changeset
   166
    %ymax=35,
Chengsong
parents: 646
diff changeset
   167
    restrict x to domain*=0:40000,
Chengsong
parents: 646
diff changeset
   168
    restrict y to domain*=0:35,
Chengsong
parents: 646
diff changeset
   169
    %ytick={0,5,...,30},
Chengsong
parents: 646
diff changeset
   170
    %scaled ticks=false,
Chengsong
parents: 646
diff changeset
   171
    axis lines=left,
Chengsong
parents: 646
diff changeset
   172
    width=5cm,
Chengsong
parents: 646
diff changeset
   173
    height=4cm, 
Chengsong
parents: 646
diff changeset
   174
    legend entries={Java9+},  
Chengsong
parents: 646
diff changeset
   175
    legend pos=north west,
Chengsong
parents: 646
diff changeset
   176
    legend cell align=left]
Chengsong
parents: 646
diff changeset
   177
\addplot[orange,mark=*, mark options={fill=white}] table {re-java9.data};
Chengsong
parents: 646
diff changeset
   178
\end{axis}
Chengsong
parents: 646
diff changeset
   179
\end{tikzpicture}\\ 
666
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
   180
\begin{tikzpicture}
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
   181
\begin{axis}[
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
   182
    xlabel={$n$},
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
   183
    x label style={at={(1.05,-0.05)}},
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
   184
    ylabel={time in secs},
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
   185
    enlargelimits=true,
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
   186
    %xtick={0,5,...,30},
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
   187
    %xmax=33,
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
   188
    %ymax=35,
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
   189
    restrict x to domain*=0:60000,
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
   190
    restrict y to domain*=0:35,
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
   191
    %ytick={0,5,...,30},
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
   192
    %scaled ticks=false,
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
   193
    axis lines=left,
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
   194
    width=5cm,
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
   195
    height=4cm, 
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
   196
    legend entries={Scala},  
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
   197
    legend pos=north west,
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
   198
    legend cell align=left]
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
   199
\addplot[magenta,mark=*, mark options={fill=white}] table {re-blexersimp.data};
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
   200
\end{axis}
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
   201
\end{tikzpicture}
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
   202
  & 
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
   203
\begin{tikzpicture}
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
   204
\begin{axis}[
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
   205
    xlabel={$n$},
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
   206
    x label style={at={(1.05,-0.05)}},
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
   207
    %ylabel={time in secs},
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
   208
    enlargelimits=true,
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
   209
    %xtick={0,5000,...,40000},
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
   210
    %xmax=40000,
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
   211
    %ymax=35,
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
   212
    restrict x to domain*=0:60000,
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
   213
    restrict y to domain*=0:45,
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
   214
    %ytick={0,5,...,30},
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
   215
    %scaled ticks=false,
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
   216
    axis lines=left,
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
   217
    width=5cm,
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
   218
    height=4cm, 
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
   219
    legend style={cells={align=left}},
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
   220
    legend entries={Isabelle \\ Extracted},
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
   221
    legend pos=north west,
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
   222
    legend cell align=left]
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
   223
\addplot[magenta,mark=*, mark options={fill=white}] table {re-fromIsabelle.data};
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
   224
\end{axis}
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
   225
\end{tikzpicture}\\ 
649
Chengsong
parents: 646
diff changeset
   226
\multicolumn{2}{c}{Graphs}
Chengsong
parents: 646
diff changeset
   227
\end{tabular}    
Chengsong
parents: 646
diff changeset
   228
\end{center}
Chengsong
parents: 646
diff changeset
   229
\caption{Graphs showing runtime for matching $(a^*)^*\,b$ with strings 
Chengsong
parents: 646
diff changeset
   230
           of the form $\protect\underbrace{aa..a}_{n}$ in various existing regular expression libraries.
666
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
   231
   The reason for their fast growth %superlinear behaviour 
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
   232
   is that they do a depth-first-search
649
Chengsong
parents: 646
diff changeset
   233
   using NFAs.
Chengsong
parents: 646
diff changeset
   234
   If the string does not match, the regular expression matching
Chengsong
parents: 646
diff changeset
   235
   engine starts to explore all possibilities. 
666
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
   236
   The last two graphs are for our implementation in Scala, one manual
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
   237
   and one extracted from the verified lexer in Isabelle by $\textit{codegen}$.
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
   238
   Our lexer  
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
   239
   performs better in this case, and
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
   240
   is formally verified.
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
   241
   Despite being almost identical, the codegen-generated lexer
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
   242
   % is generated directly from Isabelle using $\textit{codegen}$,
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
   243
   is slower than the manually written version since the code synthesised by
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
   244
   $\textit{codegen}$ does not use native integer or string
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
   245
   types of the target language.
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
   246
   %Note that we are comparing our \emph{lexer} against other languages' matchers.
649
Chengsong
parents: 646
diff changeset
   247
}\label{fig:aStarStarb}
Chengsong
parents: 646
diff changeset
   248
\end{figure}\afterpage{\clearpage}
Chengsong
parents: 646
diff changeset
   249
Chengsong
parents: 646
diff changeset
   250
A more recent example is a global outage of all Cloudflare servers on 2 July
Chengsong
parents: 646
diff changeset
   251
2019. A poorly written regular expression exhibited catastrophic backtracking
Chengsong
parents: 646
diff changeset
   252
and exhausted CPUs that serve HTTP traffic. Although the outage
Chengsong
parents: 646
diff changeset
   253
had several causes, at the heart was a regular expression that
Chengsong
parents: 646
diff changeset
   254
was used to monitor network
Chengsong
parents: 646
diff changeset
   255
traffic.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}(Last accessed in 2022)}
Chengsong
parents: 646
diff changeset
   256
These problems with regular expressions 
Chengsong
parents: 646
diff changeset
   257
are not isolated events that happen
Chengsong
parents: 646
diff changeset
   258
very rarely, 
Chengsong
parents: 646
diff changeset
   259
%but actually widespread.
Chengsong
parents: 646
diff changeset
   260
%They occur so often that they have a 
Chengsong
parents: 646
diff changeset
   261
but they occur actually often enough that they have a
Chengsong
parents: 646
diff changeset
   262
name: Regular-Expression-Denial-Of-Service (ReDoS)
Chengsong
parents: 646
diff changeset
   263
attacks.
Chengsong
parents: 646
diff changeset
   264
Davis et al. \cite{Davis18} detected more
Chengsong
parents: 646
diff changeset
   265
than 1000 evil regular expressions
Chengsong
parents: 646
diff changeset
   266
in Node.js, Python core libraries, npm and pypi. 
Chengsong
parents: 646
diff changeset
   267
They therefore concluded that evil regular expressions
Chengsong
parents: 646
diff changeset
   268
are a real problem rather than just "a parlour trick".
Chengsong
parents: 646
diff changeset
   269
Chengsong
parents: 646
diff changeset
   270
The work in this thesis aims to address this issue
Chengsong
parents: 646
diff changeset
   271
with the help of formal proofs.
Chengsong
parents: 646
diff changeset
   272
We describe a lexing algorithm based
Chengsong
parents: 646
diff changeset
   273
on Brzozowski derivatives with verified correctness 
Chengsong
parents: 646
diff changeset
   274
and a finiteness property for the size of derivatives
Chengsong
parents: 646
diff changeset
   275
(which are all done in Isabelle/HOL).
Chengsong
parents: 646
diff changeset
   276
Such properties %guarantee the absence of 
Chengsong
parents: 646
diff changeset
   277
are an important step in preventing
Chengsong
parents: 646
diff changeset
   278
catastrophic backtracking once and for all.
Chengsong
parents: 646
diff changeset
   279
We will give more details in the next sections
Chengsong
parents: 646
diff changeset
   280
on (i) why the slow cases in graph \ref{fig:aStarStarb}
Chengsong
parents: 646
diff changeset
   281
can occur in traditional regular expression engines
Chengsong
parents: 646
diff changeset
   282
and (ii) why we choose our 
Chengsong
parents: 646
diff changeset
   283
approach based on Brzozowski derivatives and formal proofs.
Chengsong
parents: 646
diff changeset
   284
Chengsong
parents: 646
diff changeset
   285
666
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
   286
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
   287
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
   288
In this chapter, we define the basic notions 
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
   289
for regular languages and regular expressions.
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
   290
This is essentially a description in ``English''
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
   291
the functions and datatypes of our formalisation in Isabelle/HOL.
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
   292
We also define what $\POSIX$ lexing means, 
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
   293
followed by the first lexing algorithm by Sulzmanna and Lu \parencite{Sulzmann2014} 
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
   294
that produces the output conforming
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
   295
to the $\POSIX$ standard. 
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
   296
This is a preliminary chapter which describes the results of
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
   297
Sulzmann and Lu and Ausaf et al \ref{AusafDyckhoffUrban2016},
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
   298
but the proof details and function definitions are needed to motivate our work
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
   299
as we show in chapter \ref{Bitcoded2} how the proofs break down when
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
   300
simplifications are applied.
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
   301
%TODO: Actually show this in chapter 4.
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
   302
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
   303
In what follows 
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
   304
we choose to use the Isabelle-style notation
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
   305
for function and datatype constructor applications, where
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
   306
the parameters of a function are not enclosed
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
   307
inside a pair of parentheses (e.g. $f \;x \;y$
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
   308
instead of $f(x,\;y)$). This is mainly
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
   309
to make the text visually more concise.
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
   310
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
   311
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
   312
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
   313
649
Chengsong
parents: 646
diff changeset
   314
\section{Preliminaries}%Regex, and the Problems with Regex Matchers}
Chengsong
parents: 646
diff changeset
   315
Regular expressions and regular expression matchers 
Chengsong
parents: 646
diff changeset
   316
have clearly been studied for many, many years.
Chengsong
parents: 646
diff changeset
   317
Theoretical results in automata theory state 
Chengsong
parents: 646
diff changeset
   318
that basic regular expression matching should be linear
Chengsong
parents: 646
diff changeset
   319
w.r.t the input.
Chengsong
parents: 646
diff changeset
   320
This assumes that the regular expression
Chengsong
parents: 646
diff changeset
   321
$r$ was pre-processed and turned into a
Chengsong
parents: 646
diff changeset
   322
deterministic finite automaton (DFA) before matching \cite{Sakarovitch2009}.
Chengsong
parents: 646
diff changeset
   323
By basic we mean textbook definitions such as the one
Chengsong
parents: 646
diff changeset
   324
below, involving only regular expressions for characters, alternatives,
Chengsong
parents: 646
diff changeset
   325
sequences, and Kleene stars:
Chengsong
parents: 646
diff changeset
   326
\[
Chengsong
parents: 646
diff changeset
   327
	r ::= c | r_1 + r_2 | r_1 \cdot r_2 | r^*
Chengsong
parents: 646
diff changeset
   328
\]
Chengsong
parents: 646
diff changeset
   329
Modern regular expression matchers used by programmers,
Chengsong
parents: 646
diff changeset
   330
however,
Chengsong
parents: 646
diff changeset
   331
support much richer constructs, such as bounded repetitions,
Chengsong
parents: 646
diff changeset
   332
negations,
Chengsong
parents: 646
diff changeset
   333
and back-references.
Chengsong
parents: 646
diff changeset
   334
To differentiate, we use the word \emph{regex} to refer
Chengsong
parents: 646
diff changeset
   335
to those expressions with richer constructs while reserving the
Chengsong
parents: 646
diff changeset
   336
term \emph{regular expression}
Chengsong
parents: 646
diff changeset
   337
for the more traditional meaning in formal languages theory.
Chengsong
parents: 646
diff changeset
   338
We follow this convention 
Chengsong
parents: 646
diff changeset
   339
in this thesis.
Chengsong
parents: 646
diff changeset
   340
In the future, we aim to support all the popular features of regexes, 
Chengsong
parents: 646
diff changeset
   341
but for this work we mainly look at basic regular expressions
Chengsong
parents: 646
diff changeset
   342
and bounded repetitions.
Chengsong
parents: 646
diff changeset
   343
Chengsong
parents: 646
diff changeset
   344
Chengsong
parents: 646
diff changeset
   345
Chengsong
parents: 646
diff changeset
   346
%Most modern regex libraries
Chengsong
parents: 646
diff changeset
   347
%the so-called PCRE standard (Peral Compatible Regular Expressions)
Chengsong
parents: 646
diff changeset
   348
%has the back-references
Chengsong
parents: 646
diff changeset
   349
Regexes come with a number of constructs
Chengsong
parents: 646
diff changeset
   350
that make it more convenient for 
Chengsong
parents: 646
diff changeset
   351
programmers to write regular expressions.
Chengsong
parents: 646
diff changeset
   352
Depending on the types of constructs
Chengsong
parents: 646
diff changeset
   353
the task of matching and lexing with them
Chengsong
parents: 646
diff changeset
   354
will have different levels of complexity.
Chengsong
parents: 646
diff changeset
   355
Some of those constructs are syntactic sugars that are
Chengsong
parents: 646
diff changeset
   356
simply short hand notations
Chengsong
parents: 646
diff changeset
   357
that save the programmers a few keystrokes.
Chengsong
parents: 646
diff changeset
   358
These will not cause problems for regex libraries.
Chengsong
parents: 646
diff changeset
   359
For example the
Chengsong
parents: 646
diff changeset
   360
non-binary alternative involving three or more choices just means:
Chengsong
parents: 646
diff changeset
   361
\[
Chengsong
parents: 646
diff changeset
   362
	(a | b | c) \stackrel{means}{=} ((a + b)+ c)
Chengsong
parents: 646
diff changeset
   363
\]
Chengsong
parents: 646
diff changeset
   364
Similarly, the range operator
Chengsong
parents: 646
diff changeset
   365
%used to express the alternative
Chengsong
parents: 646
diff changeset
   366
%of all characters between its operands, 
Chengsong
parents: 646
diff changeset
   367
is just a concise way
Chengsong
parents: 646
diff changeset
   368
of expressing an alternative of consecutive characters:
Chengsong
parents: 646
diff changeset
   369
\[
Chengsong
parents: 646
diff changeset
   370
	[0~-9]\stackrel{means}{=} (0 | 1 | \ldots | 9 )  
Chengsong
parents: 646
diff changeset
   371
\]
Chengsong
parents: 646
diff changeset
   372
for an alternative. The
Chengsong
parents: 646
diff changeset
   373
wildcard character '$.$' is used to refer to any single character,
Chengsong
parents: 646
diff changeset
   374
\[
Chengsong
parents: 646
diff changeset
   375
	. \stackrel{means}{=} [0-9a-zA-Z+-()*\&\ldots]
Chengsong
parents: 646
diff changeset
   376
\]
Chengsong
parents: 646
diff changeset
   377
except the newline.
Chengsong
parents: 646
diff changeset
   378
Chengsong
parents: 646
diff changeset
   379
\subsection{Bounded Repetitions}
Chengsong
parents: 646
diff changeset
   380
More interesting are bounded repetitions, which can 
Chengsong
parents: 646
diff changeset
   381
make the regular expressions much
Chengsong
parents: 646
diff changeset
   382
more compact.
Chengsong
parents: 646
diff changeset
   383
Normally there are four kinds of bounded repetitions:
Chengsong
parents: 646
diff changeset
   384
$r^{\{n\}}$, $r^{\{\ldots m\}}$, $r^{\{n\ldots \}}$ and $r^{\{n\ldots m\}}$
Chengsong
parents: 646
diff changeset
   385
(where $n$ and $m$ are constant natural numbers).
Chengsong
parents: 646
diff changeset
   386
Like the star regular expressions, the set of strings or language
Chengsong
parents: 646
diff changeset
   387
a bounded regular expression can match
Chengsong
parents: 646
diff changeset
   388
is defined using the power operation on sets:
Chengsong
parents: 646
diff changeset
   389
\begin{center}
Chengsong
parents: 646
diff changeset
   390
	\begin{tabular}{lcl}
Chengsong
parents: 646
diff changeset
   391
		$L \; r^{\{n\}}$ & $\dn$ & $(L \; r)^n$\\
Chengsong
parents: 646
diff changeset
   392
		$L \; r^{\{\ldots m\}}$ & $\dn$ & $\bigcup_{0 \leq i \leq m}. (L \; r)^i$\\
Chengsong
parents: 646
diff changeset
   393
		$L \; r^{\{n\ldots \}}$ & $\dn$ & $\bigcup_{n \leq i}. (L \; r)^i$\\
Chengsong
parents: 646
diff changeset
   394
		$L \; r^{\{n \ldots m\}}$ & $\dn$ & $\bigcup_{n \leq i \leq m}. (L \; r)^i$
Chengsong
parents: 646
diff changeset
   395
	\end{tabular}
Chengsong
parents: 646
diff changeset
   396
\end{center}
Chengsong
parents: 646
diff changeset
   397
The attraction of bounded repetitions is that they can be
Chengsong
parents: 646
diff changeset
   398
used to avoid a size blow up: for example $r^{\{n\}}$
Chengsong
parents: 646
diff changeset
   399
is a shorthand for
Chengsong
parents: 646
diff changeset
   400
the much longer regular expression:
Chengsong
parents: 646
diff changeset
   401
\[
Chengsong
parents: 646
diff changeset
   402
	\underbrace{r\ldots r}_\text{n copies of r}.
Chengsong
parents: 646
diff changeset
   403
\]
Chengsong
parents: 646
diff changeset
   404
%Therefore, a naive algorithm that simply unfolds
Chengsong
parents: 646
diff changeset
   405
%them into their desugared forms
Chengsong
parents: 646
diff changeset
   406
%will suffer from at least an exponential runtime increase.
Chengsong
parents: 646
diff changeset
   407
Chengsong
parents: 646
diff changeset
   408
Chengsong
parents: 646
diff changeset
   409
The problem with matching 
Chengsong
parents: 646
diff changeset
   410
such bounded repetitions
Chengsong
parents: 646
diff changeset
   411
is that tools based on the classic notion of
Chengsong
parents: 646
diff changeset
   412
automata need to expand $r^{\{n\}}$ into $n$ connected 
Chengsong
parents: 646
diff changeset
   413
copies of the automaton for $r$. This leads to very inefficient matching
Chengsong
parents: 646
diff changeset
   414
algorithms  or algorithms that consume large amounts of memory.
Chengsong
parents: 646
diff changeset
   415
Implementations using $\DFA$s will
Chengsong
parents: 646
diff changeset
   416
in such situations
666
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
   417
either become very slow 
649
Chengsong
parents: 646
diff changeset
   418
(for example Verbatim++ \cite{Verbatimpp}) or run
Chengsong
parents: 646
diff changeset
   419
out of memory (for example $\mathit{LEX}$ and 
Chengsong
parents: 646
diff changeset
   420
$\mathit{JFLEX}$\footnote{LEX and JFLEX are lexer generators
Chengsong
parents: 646
diff changeset
   421
in C and JAVA that generate $\mathit{DFA}$-based
Chengsong
parents: 646
diff changeset
   422
lexers. The user provides a set of regular expressions
Chengsong
parents: 646
diff changeset
   423
and configurations, and then 
Chengsong
parents: 646
diff changeset
   424
gets an output program encoding a minimized $\mathit{DFA}$
Chengsong
parents: 646
diff changeset
   425
that can be compiled and run. 
Chengsong
parents: 646
diff changeset
   426
When given the above countdown regular expression,
Chengsong
parents: 646
diff changeset
   427
a small $n$ (say 20) would result in a program representing a
Chengsong
parents: 646
diff changeset
   428
DFA
Chengsong
parents: 646
diff changeset
   429
with millions of states.}) for large counters.
Chengsong
parents: 646
diff changeset
   430
A classic example for this phenomenon is the regular expression $(a+b)^*  a (a+b)^{n}$
Chengsong
parents: 646
diff changeset
   431
where the minimal DFA requires at least $2^{n+1}$ states.
Chengsong
parents: 646
diff changeset
   432
For example, when $n$ is equal to 2,
Chengsong
parents: 646
diff changeset
   433
the corresponding $\mathit{NFA}$ looks like:
Chengsong
parents: 646
diff changeset
   434
\vspace{6mm}
Chengsong
parents: 646
diff changeset
   435
\begin{center}
Chengsong
parents: 646
diff changeset
   436
\begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto] 
Chengsong
parents: 646
diff changeset
   437
   \node[state,initial] (q_0)   {$q_0$}; 
Chengsong
parents: 646
diff changeset
   438
   \node[state, red] (q_1) [right=of q_0] {$q_1$}; 
Chengsong
parents: 646
diff changeset
   439
   \node[state, red] (q_2) [right=of q_1] {$q_2$}; 
Chengsong
parents: 646
diff changeset
   440
   \node[state, accepting, red](q_3) [right=of q_2] {$q_3$};
Chengsong
parents: 646
diff changeset
   441
    \path[->] 
Chengsong
parents: 646
diff changeset
   442
    (q_0) edge  node {a} (q_1)
Chengsong
parents: 646
diff changeset
   443
    	  edge [loop below] node {a,b} ()
Chengsong
parents: 646
diff changeset
   444
    (q_1) edge  node  {a,b} (q_2)
Chengsong
parents: 646
diff changeset
   445
    (q_2) edge  node  {a,b} (q_3);
Chengsong
parents: 646
diff changeset
   446
\end{tikzpicture}
Chengsong
parents: 646
diff changeset
   447
\end{center}
Chengsong
parents: 646
diff changeset
   448
and when turned into a DFA by the subset construction
Chengsong
parents: 646
diff changeset
   449
requires at least $2^3$ states.\footnote{The 
Chengsong
parents: 646
diff changeset
   450
red states are "countdown states" which count down 
Chengsong
parents: 646
diff changeset
   451
the number of characters needed in addition to the current
Chengsong
parents: 646
diff changeset
   452
string to make a successful match.
Chengsong
parents: 646
diff changeset
   453
For example, state $q_1$ indicates a match that has
Chengsong
parents: 646
diff changeset
   454
gone past the $(a|b)^*$ part of $(a|b)^*a(a|b)^{\{2\}}$,
Chengsong
parents: 646
diff changeset
   455
and just consumed the "delimiter" $a$ in the middle, and 
Chengsong
parents: 646
diff changeset
   456
needs to match 2 more iterations of $(a|b)$ to complete.
Chengsong
parents: 646
diff changeset
   457
State $q_2$ on the other hand, can be viewed as a state
Chengsong
parents: 646
diff changeset
   458
after $q_1$ has consumed 1 character, and just waits
Chengsong
parents: 646
diff changeset
   459
for 1 more character to complete.
Chengsong
parents: 646
diff changeset
   460
The state $q_3$ is the last (accepting) state, requiring 0 
Chengsong
parents: 646
diff changeset
   461
more characters.
Chengsong
parents: 646
diff changeset
   462
Depending on the suffix of the
Chengsong
parents: 646
diff changeset
   463
input string up to the current read location,
Chengsong
parents: 646
diff changeset
   464
the states $q_1$ and $q_2$, $q_3$
Chengsong
parents: 646
diff changeset
   465
may or may
Chengsong
parents: 646
diff changeset
   466
not be active.
Chengsong
parents: 646
diff changeset
   467
A $\mathit{DFA}$ for such an $\mathit{NFA}$ would
Chengsong
parents: 646
diff changeset
   468
contain at least $2^3$ non-equivalent states that cannot be merged, 
Chengsong
parents: 646
diff changeset
   469
because the subset construction during determinisation will generate
Chengsong
parents: 646
diff changeset
   470
all the elements in the power set $\mathit{Pow}\{q_1, q_2, q_3\}$.
Chengsong
parents: 646
diff changeset
   471
Generalizing this to regular expressions with larger
Chengsong
parents: 646
diff changeset
   472
bounded repetitions number, we have that
Chengsong
parents: 646
diff changeset
   473
regexes shaped like $r^*ar^{\{n\}}$ when converted to $\mathit{DFA}$s
Chengsong
parents: 646
diff changeset
   474
would require at least $2^{n+1}$ states, if $r$ itself contains
Chengsong
parents: 646
diff changeset
   475
more than 1 string.
Chengsong
parents: 646
diff changeset
   476
This is to represent all different 
Chengsong
parents: 646
diff changeset
   477
scenarios in which "countdown" states are active.}
Chengsong
parents: 646
diff changeset
   478
Chengsong
parents: 646
diff changeset
   479
Chengsong
parents: 646
diff changeset
   480
Bounded repetitions are important because they
Chengsong
parents: 646
diff changeset
   481
tend to occur frequently in practical use,
Chengsong
parents: 646
diff changeset
   482
for example in the regex library RegExLib, in
Chengsong
parents: 646
diff changeset
   483
the rules library of Snort \cite{Snort1999}\footnote{
Chengsong
parents: 646
diff changeset
   484
Snort is a network intrusion detection (NID) tool
Chengsong
parents: 646
diff changeset
   485
for monitoring network traffic.
Chengsong
parents: 646
diff changeset
   486
The network security community curates a list
Chengsong
parents: 646
diff changeset
   487
of malicious patterns written as regexes,
Chengsong
parents: 646
diff changeset
   488
which is used by Snort's detection engine
Chengsong
parents: 646
diff changeset
   489
to match against network traffic for any hostile
Chengsong
parents: 646
diff changeset
   490
activities such as buffer overflow attacks.}, 
Chengsong
parents: 646
diff changeset
   491
as well as in XML Schema definitions (XSDs).
Chengsong
parents: 646
diff changeset
   492
According to Bj\"{o}rklund et al \cite{xml2015},
Chengsong
parents: 646
diff changeset
   493
more than half of the 
Chengsong
parents: 646
diff changeset
   494
XSDs they found on the Maven.org central repository
Chengsong
parents: 646
diff changeset
   495
have bounded regular expressions in them.
Chengsong
parents: 646
diff changeset
   496
Often the counters are quite large, with the largest being
Chengsong
parents: 646
diff changeset
   497
close to ten million. 
Chengsong
parents: 646
diff changeset
   498
A smaller sample XSD they gave
Chengsong
parents: 646
diff changeset
   499
is:
Chengsong
parents: 646
diff changeset
   500
\lstset{
Chengsong
parents: 646
diff changeset
   501
	basicstyle=\fontsize{8.5}{9}\ttfamily,
Chengsong
parents: 646
diff changeset
   502
  language=XML,
Chengsong
parents: 646
diff changeset
   503
  morekeywords={encoding,
Chengsong
parents: 646
diff changeset
   504
    xs:schema,xs:element,xs:complexType,xs:sequence,xs:attribute}
Chengsong
parents: 646
diff changeset
   505
}
Chengsong
parents: 646
diff changeset
   506
\begin{lstlisting}
Chengsong
parents: 646
diff changeset
   507
<sequence minOccurs="0" maxOccurs="65535">
Chengsong
parents: 646
diff changeset
   508
	<element name="TimeIncr" type="mpeg7:MediaIncrDurationType"/>
Chengsong
parents: 646
diff changeset
   509
 	<element name="MotionParams" type="float" minOccurs="2" maxOccurs="12"/>
Chengsong
parents: 646
diff changeset
   510
</sequence>
Chengsong
parents: 646
diff changeset
   511
\end{lstlisting}
Chengsong
parents: 646
diff changeset
   512
This can be seen as the regex
Chengsong
parents: 646
diff changeset
   513
$(ab^{2\ldots 12})^{0 \ldots 65535}$, where $a$ and $b$ are themselves
Chengsong
parents: 646
diff changeset
   514
regular expressions 
Chengsong
parents: 646
diff changeset
   515
satisfying certain constraints (such as 
Chengsong
parents: 646
diff changeset
   516
satisfying the floating point number format).
Chengsong
parents: 646
diff changeset
   517
It is therefore quite unsatisfying that 
Chengsong
parents: 646
diff changeset
   518
some regular expressions matching libraries
Chengsong
parents: 646
diff changeset
   519
impose adhoc limits
Chengsong
parents: 646
diff changeset
   520
for bounded regular expressions:
Chengsong
parents: 646
diff changeset
   521
For example, in the regular expression matching library in the Go
Chengsong
parents: 646
diff changeset
   522
language the regular expression $a^{1001}$ is not permitted, because no counter
Chengsong
parents: 646
diff changeset
   523
can be above 1000, and in the built-in Rust regular expression library
Chengsong
parents: 646
diff changeset
   524
expressions such as $a^{\{1000\}\{100\}\{5\}}$ give an error message
Chengsong
parents: 646
diff changeset
   525
for being too big. 
Chengsong
parents: 646
diff changeset
   526
As Becchi and Crawley \cite{Becchi08}  have pointed out,
Chengsong
parents: 646
diff changeset
   527
the reason for these restrictions
Chengsong
parents: 646
diff changeset
   528
is that they simulate a non-deterministic finite
Chengsong
parents: 646
diff changeset
   529
automata (NFA) with a breadth-first search.
Chengsong
parents: 646
diff changeset
   530
This way the number of active states could
Chengsong
parents: 646
diff changeset
   531
be equal to the counter number.
Chengsong
parents: 646
diff changeset
   532
When the counters are large, 
Chengsong
parents: 646
diff changeset
   533
the memory requirement could become
Chengsong
parents: 646
diff changeset
   534
infeasible, and a regex engine
Chengsong
parents: 646
diff changeset
   535
like in Go will reject this pattern straight away.
Chengsong
parents: 646
diff changeset
   536
\begin{figure}[H]
Chengsong
parents: 646
diff changeset
   537
\begin{center}
Chengsong
parents: 646
diff changeset
   538
\begin{tikzpicture} [node distance = 2cm, on grid, auto]
Chengsong
parents: 646
diff changeset
   539
 
Chengsong
parents: 646
diff changeset
   540
    	\node (q0) [state, initial] {$0$};
Chengsong
parents: 646
diff changeset
   541
	\node (q1) [state, right = of q0] {$1$};
Chengsong
parents: 646
diff changeset
   542
	%\node (q2) [state, right = of q1] {$2$};
Chengsong
parents: 646
diff changeset
   543
	\node (qdots) [right = of q1] {$\ldots$};
Chengsong
parents: 646
diff changeset
   544
	\node (qn) [state, right = of qdots] {$n$};
Chengsong
parents: 646
diff changeset
   545
	\node (qn1) [state, right = of qn] {$n+1$};
Chengsong
parents: 646
diff changeset
   546
	\node (qn2) [state, right = of qn1] {$n+2$};
Chengsong
parents: 646
diff changeset
   547
	\node (qn3) [state, accepting, right = of qn2] {$n+3$}; 
Chengsong
parents: 646
diff changeset
   548
 
Chengsong
parents: 646
diff changeset
   549
\path [-stealth, thick]
Chengsong
parents: 646
diff changeset
   550
	(q0) edge [loop above] node {a} ()
Chengsong
parents: 646
diff changeset
   551
    (q0) edge node {a}   (q1) 
Chengsong
parents: 646
diff changeset
   552
    %(q1) edge node {.}   (q2)
Chengsong
parents: 646
diff changeset
   553
    (q1) edge node {.}   (qdots)
Chengsong
parents: 646
diff changeset
   554
    (qdots) edge node {.} (qn)
Chengsong
parents: 646
diff changeset
   555
    (qn) edge node {.} (qn1)
Chengsong
parents: 646
diff changeset
   556
    (qn1) edge node {b} (qn2)
Chengsong
parents: 646
diff changeset
   557
    (qn2) edge node {$c$} (qn3);
Chengsong
parents: 646
diff changeset
   558
\end{tikzpicture}
Chengsong
parents: 646
diff changeset
   559
%\begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto] 
Chengsong
parents: 646
diff changeset
   560
%   \node[state,initial] (q_0)   {$0$}; 
Chengsong
parents: 646
diff changeset
   561
%   \node[state, ] (q_1) [right=of q_0] {$1$}; 
Chengsong
parents: 646
diff changeset
   562
%   \node[state, ] (q_2) [right=of q_1] {$2$}; 
Chengsong
parents: 646
diff changeset
   563
%   \node[state,
Chengsong
parents: 646
diff changeset
   564
%   \node[state, accepting, ](q_3) [right=of q_2] {$3$};
Chengsong
parents: 646
diff changeset
   565
%    \path[->] 
Chengsong
parents: 646
diff changeset
   566
%    (q_0) edge  node {a} (q_1)
Chengsong
parents: 646
diff changeset
   567
%    	  edge [loop below] node {a,b} ()
Chengsong
parents: 646
diff changeset
   568
%    (q_1) edge  node  {a,b} (q_2)
Chengsong
parents: 646
diff changeset
   569
%    (q_2) edge  node  {a,b} (q_3);
Chengsong
parents: 646
diff changeset
   570
%\end{tikzpicture}
Chengsong
parents: 646
diff changeset
   571
\end{center}
Chengsong
parents: 646
diff changeset
   572
\caption{The example given by Becchi and Crawley
Chengsong
parents: 646
diff changeset
   573
	that NFA simulation can consume large
Chengsong
parents: 646
diff changeset
   574
	amounts of memory: $.^*a.^{\{n\}}bc$ matching
Chengsong
parents: 646
diff changeset
   575
	strings of the form $aaa\ldots aaaabc$.
Chengsong
parents: 646
diff changeset
   576
	When traversing in a breadth-first manner,
657
00171b627b8d Fixed some annotated/unannotated a/r notation inconsistencies.
Chengsong
parents: 651
diff changeset
   577
all states from 0 till $n+1$ will become active.}\label{fig:inj}
00171b627b8d Fixed some annotated/unannotated a/r notation inconsistencies.
Chengsong
parents: 651
diff changeset
   578
649
Chengsong
parents: 646
diff changeset
   579
\end{figure}
Chengsong
parents: 646
diff changeset
   580
%Languages like $\mathit{Go}$ and $\mathit{Rust}$ use this
Chengsong
parents: 646
diff changeset
   581
%type of $\mathit{NFA}$ simulation and guarantees a linear runtime
Chengsong
parents: 646
diff changeset
   582
%in terms of input string length.
Chengsong
parents: 646
diff changeset
   583
%TODO:try out these lexers
Chengsong
parents: 646
diff changeset
   584
These problems can of course be solved in matching algorithms where 
Chengsong
parents: 646
diff changeset
   585
automata go beyond the classic notion and for instance include explicit
Chengsong
parents: 646
diff changeset
   586
counters \cite{Turo_ov__2020}.
Chengsong
parents: 646
diff changeset
   587
These solutions can be quite efficient,
Chengsong
parents: 646
diff changeset
   588
with the ability to process
Chengsong
parents: 646
diff changeset
   589
gigabits of strings input per second
Chengsong
parents: 646
diff changeset
   590
even with large counters \cite{Becchi08}.
Chengsong
parents: 646
diff changeset
   591
These practical solutions do not come with
Chengsong
parents: 646
diff changeset
   592
formal guarantees, and as pointed out by
Chengsong
parents: 646
diff changeset
   593
Kuklewicz \cite{KuklewiczHaskell}, can be error-prone.
Chengsong
parents: 646
diff changeset
   594
%But formal reasoning about these automata especially in Isabelle 
Chengsong
parents: 646
diff changeset
   595
%can be challenging
Chengsong
parents: 646
diff changeset
   596
%and un-intuitive. 
Chengsong
parents: 646
diff changeset
   597
%Therefore, we take correctness and runtime claims made about these solutions
Chengsong
parents: 646
diff changeset
   598
%with a grain of salt.
Chengsong
parents: 646
diff changeset
   599
Chengsong
parents: 646
diff changeset
   600
In the work reported in \cite{FoSSaCS2023} and here, 
Chengsong
parents: 646
diff changeset
   601
we add better support using derivatives
Chengsong
parents: 646
diff changeset
   602
for bounded regular expression $r^{\{n\}}$.
Chengsong
parents: 646
diff changeset
   603
Our results
Chengsong
parents: 646
diff changeset
   604
extend straightforwardly to
Chengsong
parents: 646
diff changeset
   605
repetitions with intervals such as 
Chengsong
parents: 646
diff changeset
   606
$r^{\{n\ldots m\}}$.
Chengsong
parents: 646
diff changeset
   607
The merit of Brzozowski derivatives (more on this later)
Chengsong
parents: 646
diff changeset
   608
on this problem is that
Chengsong
parents: 646
diff changeset
   609
it can be naturally extended to support bounded repetitions.
Chengsong
parents: 646
diff changeset
   610
Moreover these extensions are still made up of only small
Chengsong
parents: 646
diff changeset
   611
inductive datatypes and recursive functions,
Chengsong
parents: 646
diff changeset
   612
making it handy to deal with them in theorem provers.
Chengsong
parents: 646
diff changeset
   613
%The point here is that Brzozowski derivatives and the algorithms by Sulzmann and Lu can be
Chengsong
parents: 646
diff changeset
   614
%straightforwardly extended to deal with bounded regular expressions
Chengsong
parents: 646
diff changeset
   615
%and moreover the resulting code still consists of only simple
Chengsong
parents: 646
diff changeset
   616
%recursive functions and inductive datatypes.
Chengsong
parents: 646
diff changeset
   617
Finally, bounded regular expressions do not destroy our finite
Chengsong
parents: 646
diff changeset
   618
boundedness property, which we shall prove later on.
Chengsong
parents: 646
diff changeset
   619
Chengsong
parents: 646
diff changeset
   620
Chengsong
parents: 646
diff changeset
   621
Chengsong
parents: 646
diff changeset
   622
Chengsong
parents: 646
diff changeset
   623
Chengsong
parents: 646
diff changeset
   624
\subsection{Back-References}
Chengsong
parents: 646
diff changeset
   625
The other way to simulate an $\mathit{NFA}$ for matching is choosing  
Chengsong
parents: 646
diff changeset
   626
a single transition each time, keeping all the other options in 
Chengsong
parents: 646
diff changeset
   627
a queue or stack, and backtracking if that choice eventually 
Chengsong
parents: 646
diff changeset
   628
fails. 
Chengsong
parents: 646
diff changeset
   629
This method, often called a  "depth-first-search", 
Chengsong
parents: 646
diff changeset
   630
is efficient in many cases, but could end up
Chengsong
parents: 646
diff changeset
   631
with exponential run time.
Chengsong
parents: 646
diff changeset
   632
The backtracking method is employed in regex libraries
Chengsong
parents: 646
diff changeset
   633
that support \emph{back-references}, for example
Chengsong
parents: 646
diff changeset
   634
in Java and Python.
Chengsong
parents: 646
diff changeset
   635
%\section{Back-references and The Terminology Regex}
Chengsong
parents: 646
diff changeset
   636
Chengsong
parents: 646
diff changeset
   637
%When one constructs an $\NFA$ out of a regular expression
Chengsong
parents: 646
diff changeset
   638
%there is often very little to be done in the first phase, one simply 
Chengsong
parents: 646
diff changeset
   639
%construct the $\NFA$ states based on the structure of the input regular expression.
Chengsong
parents: 646
diff changeset
   640
Chengsong
parents: 646
diff changeset
   641
%In the lexing phase, one can simulate the $\mathit{NFA}$ running in two ways:
Chengsong
parents: 646
diff changeset
   642
%one by keeping track of all active states after consuming 
Chengsong
parents: 646
diff changeset
   643
%a character, and update that set of states iteratively.
Chengsong
parents: 646
diff changeset
   644
%This can be viewed as a breadth-first-search of the $\mathit{NFA}$
Chengsong
parents: 646
diff changeset
   645
%for a path terminating
Chengsong
parents: 646
diff changeset
   646
%at an accepting state.
Chengsong
parents: 646
diff changeset
   647
Chengsong
parents: 646
diff changeset
   648
Chengsong
parents: 646
diff changeset
   649
Chengsong
parents: 646
diff changeset
   650
Consider the following regular expression where the sequence
Chengsong
parents: 646
diff changeset
   651
operator is omitted for brevity:
Chengsong
parents: 646
diff changeset
   652
\begin{center}
Chengsong
parents: 646
diff changeset
   653
	$r_1r_2r_3r_4$
Chengsong
parents: 646
diff changeset
   654
\end{center}
Chengsong
parents: 646
diff changeset
   655
In this example,
Chengsong
parents: 646
diff changeset
   656
one could label sub-expressions of interest 
Chengsong
parents: 646
diff changeset
   657
by parenthesizing them and giving 
Chengsong
parents: 646
diff changeset
   658
them a number in the order in which their opening parentheses appear.
Chengsong
parents: 646
diff changeset
   659
One possible way of parenthesizing and labelling is: 
Chengsong
parents: 646
diff changeset
   660
\begin{center}
Chengsong
parents: 646
diff changeset
   661
	$\underset{1}{(}r_1\underset{2}{(}r_2\underset{3}{(}r_3)\underset{4}{(}r_4)))$
Chengsong
parents: 646
diff changeset
   662
\end{center}
Chengsong
parents: 646
diff changeset
   663
The sub-expressions
Chengsong
parents: 646
diff changeset
   664
$r_1r_2r_3r_4$, $r_1r_2r_3$, $r_3$ and $r_4$ are labelled
Chengsong
parents: 646
diff changeset
   665
by 1 to 4, and can be ``referred back'' by their respective numbers. 
Chengsong
parents: 646
diff changeset
   666
%These sub-expressions are called "capturing groups".
Chengsong
parents: 646
diff changeset
   667
To do so, one uses the syntax $\backslash i$ 
Chengsong
parents: 646
diff changeset
   668
to denote that we want the sub-string 
Chengsong
parents: 646
diff changeset
   669
of the input matched by the i-th
Chengsong
parents: 646
diff changeset
   670
sub-expression to appear again, 
Chengsong
parents: 646
diff changeset
   671
exactly the same as it first appeared: 
Chengsong
parents: 646
diff changeset
   672
\begin{center}
Chengsong
parents: 646
diff changeset
   673
$\ldots\underset{\text{i-th lparen}}{(}{r_i})\ldots 
Chengsong
parents: 646
diff changeset
   674
\underset{s_i \text{ which just matched} \;r_i}{\backslash i} \ldots$
Chengsong
parents: 646
diff changeset
   675
\end{center}
Chengsong
parents: 646
diff changeset
   676
Once the sub-string $s_i$ for the sub-expression $r_i$
Chengsong
parents: 646
diff changeset
   677
has been fixed, there is no variability on what the back-reference
Chengsong
parents: 646
diff changeset
   678
label $\backslash i$ can be---it is tied to $s_i$.
Chengsong
parents: 646
diff changeset
   679
The overall string will look like $\ldots s_i \ldots s_i \ldots $
Chengsong
parents: 646
diff changeset
   680
%The backslash and number $i$ are the
Chengsong
parents: 646
diff changeset
   681
%so-called "back-references".
Chengsong
parents: 646
diff changeset
   682
%Let $e$ be an expression made of regular expressions 
Chengsong
parents: 646
diff changeset
   683
%and back-references. $e$ contains the expression $e_i$
Chengsong
parents: 646
diff changeset
   684
%as its $i$-th capturing group.
Chengsong
parents: 646
diff changeset
   685
%The semantics of back-reference can be recursively
Chengsong
parents: 646
diff changeset
   686
%written as:
Chengsong
parents: 646
diff changeset
   687
%\begin{center}
Chengsong
parents: 646
diff changeset
   688
%	\begin{tabular}{c}
Chengsong
parents: 646
diff changeset
   689
%		$L ( e \cdot \backslash i) = \{s @ s_i \mid s \in L (e)\quad s_i \in L(r_i)$\\
Chengsong
parents: 646
diff changeset
   690
%		$s_i\; \text{match of ($e$, $s$)'s $i$-th capturing group string}\}$
Chengsong
parents: 646
diff changeset
   691
%	\end{tabular}
Chengsong
parents: 646
diff changeset
   692
%\end{center}
Chengsong
parents: 646
diff changeset
   693
A concrete example
Chengsong
parents: 646
diff changeset
   694
for back-references is
Chengsong
parents: 646
diff changeset
   695
\begin{center}
Chengsong
parents: 646
diff changeset
   696
$(.^*)\backslash 1$,
Chengsong
parents: 646
diff changeset
   697
\end{center}
Chengsong
parents: 646
diff changeset
   698
which matches
Chengsong
parents: 646
diff changeset
   699
strings that can be split into two identical halves,
Chengsong
parents: 646
diff changeset
   700
for example $\mathit{foofoo}$, $\mathit{ww}$ and so on.
Chengsong
parents: 646
diff changeset
   701
Note that this is different from 
Chengsong
parents: 646
diff changeset
   702
repeating the  sub-expression verbatim like
Chengsong
parents: 646
diff changeset
   703
\begin{center}
Chengsong
parents: 646
diff changeset
   704
	$(.^*)(.^*)$,
Chengsong
parents: 646
diff changeset
   705
\end{center}
Chengsong
parents: 646
diff changeset
   706
which does not impose any restrictions on what strings the second 
Chengsong
parents: 646
diff changeset
   707
sub-expression $.^*$
Chengsong
parents: 646
diff changeset
   708
might match.
Chengsong
parents: 646
diff changeset
   709
Another example for back-references is
Chengsong
parents: 646
diff changeset
   710
\begin{center}
Chengsong
parents: 646
diff changeset
   711
$(.)(.)\backslash 2\backslash 1$
Chengsong
parents: 646
diff changeset
   712
\end{center}
Chengsong
parents: 646
diff changeset
   713
which matches four-character palindromes
Chengsong
parents: 646
diff changeset
   714
like $abba$, $x??x$ and so on.
Chengsong
parents: 646
diff changeset
   715
Chengsong
parents: 646
diff changeset
   716
Back-references are a regex construct 
Chengsong
parents: 646
diff changeset
   717
that programmers find quite useful.
Chengsong
parents: 646
diff changeset
   718
According to Becchi and Crawley \cite{Becchi08},
Chengsong
parents: 646
diff changeset
   719
6\% of Snort rules (up until 2008) use them.
Chengsong
parents: 646
diff changeset
   720
The most common use of back-references
Chengsong
parents: 646
diff changeset
   721
is to express well-formed html files,
Chengsong
parents: 646
diff changeset
   722
where back-references are convenient for matching
Chengsong
parents: 646
diff changeset
   723
opening and closing tags like 
Chengsong
parents: 646
diff changeset
   724
\begin{center}
Chengsong
parents: 646
diff changeset
   725
	$\langle html \rangle \ldots \langle / html \rangle$
Chengsong
parents: 646
diff changeset
   726
\end{center}
Chengsong
parents: 646
diff changeset
   727
A regex describing such a format
Chengsong
parents: 646
diff changeset
   728
is
Chengsong
parents: 646
diff changeset
   729
\begin{center}
Chengsong
parents: 646
diff changeset
   730
	$\langle (.^+) \rangle \ldots \langle / \backslash 1 \rangle$
Chengsong
parents: 646
diff changeset
   731
\end{center}
Chengsong
parents: 646
diff changeset
   732
Despite being useful, the expressive power of regexes 
Chengsong
parents: 646
diff changeset
   733
go beyond regular languages 
Chengsong
parents: 646
diff changeset
   734
once back-references are included.
Chengsong
parents: 646
diff changeset
   735
In fact, they allow the regex construct to express 
Chengsong
parents: 646
diff changeset
   736
languages that cannot be contained in context-free
Chengsong
parents: 646
diff changeset
   737
languages either.
Chengsong
parents: 646
diff changeset
   738
For example, the back-reference $(a^*)b\backslash1 b \backslash 1$
Chengsong
parents: 646
diff changeset
   739
expresses the language $\{a^n b a^n b a^n\mid n \in \mathbb{N}\}$,
Chengsong
parents: 646
diff changeset
   740
which cannot be expressed by 
Chengsong
parents: 646
diff changeset
   741
context-free grammars \cite{campeanu2003formal}.
Chengsong
parents: 646
diff changeset
   742
Such a language is contained in the context-sensitive hierarchy
Chengsong
parents: 646
diff changeset
   743
of formal languages. 
Chengsong
parents: 646
diff changeset
   744
Also solving the matching problem involving back-references
Chengsong
parents: 646
diff changeset
   745
is known to be NP-complete \parencite{alfred2014algorithms}.
Chengsong
parents: 646
diff changeset
   746
Regex libraries supporting back-references such as 
Chengsong
parents: 646
diff changeset
   747
PCRE \cite{pcre} therefore have to
Chengsong
parents: 646
diff changeset
   748
revert to a depth-first search algorithm including backtracking.
Chengsong
parents: 646
diff changeset
   749
What is unexpected is that even in the cases 
Chengsong
parents: 646
diff changeset
   750
not involving back-references, there is still
Chengsong
parents: 646
diff changeset
   751
a (non-negligible) chance they might backtrack super-linearly,
Chengsong
parents: 646
diff changeset
   752
as shown in the graphs in figure \ref{fig:aStarStarb}.
Chengsong
parents: 646
diff changeset
   753
Chengsong
parents: 646
diff changeset
   754
Summing up, we can categorise existing 
Chengsong
parents: 646
diff changeset
   755
practical regex libraries into two kinds:
Chengsong
parents: 646
diff changeset
   756
(i) The ones  with  linear
Chengsong
parents: 646
diff changeset
   757
time guarantees like Go and Rust. The downside with them is that
Chengsong
parents: 646
diff changeset
   758
they impose restrictions
Chengsong
parents: 646
diff changeset
   759
on the regular expressions (not allowing back-references, 
Chengsong
parents: 646
diff changeset
   760
bounded repetitions cannot exceed an ad hoc limit etc.).
Chengsong
parents: 646
diff changeset
   761
And (ii) those 
Chengsong
parents: 646
diff changeset
   762
that allow large bounded regular expressions and back-references
Chengsong
parents: 646
diff changeset
   763
at the expense of using backtracking algorithms.
Chengsong
parents: 646
diff changeset
   764
They can potentially ``grind to a halt''
Chengsong
parents: 646
diff changeset
   765
on some very simple cases, resulting 
Chengsong
parents: 646
diff changeset
   766
ReDoS attacks if exposed to the internet.
Chengsong
parents: 646
diff changeset
   767
Chengsong
parents: 646
diff changeset
   768
The problems with both approaches are the motivation for us 
Chengsong
parents: 646
diff changeset
   769
to look again at the regular expression matching problem. 
Chengsong
parents: 646
diff changeset
   770
Another motivation is that regular expression matching algorithms
Chengsong
parents: 646
diff changeset
   771
that follow the POSIX standard often contain errors and bugs 
Chengsong
parents: 646
diff changeset
   772
as we shall explain next.
Chengsong
parents: 646
diff changeset
   773
Chengsong
parents: 646
diff changeset
   774
%We would like to have regex engines that can 
Chengsong
parents: 646
diff changeset
   775
%deal with the regular part (e.g.
Chengsong
parents: 646
diff changeset
   776
%bounded repetitions) of regexes more
Chengsong
parents: 646
diff changeset
   777
%efficiently.
Chengsong
parents: 646
diff changeset
   778
%Also we want to make sure that they do it correctly.
Chengsong
parents: 646
diff changeset
   779
%It turns out that such aim is not so easy to achieve.
Chengsong
parents: 646
diff changeset
   780
 %TODO: give examples such as RE2 GOLANG 1000 restriction, rust no repetitions 
Chengsong
parents: 646
diff changeset
   781
% For example, the Rust regex engine claims to be linear, 
Chengsong
parents: 646
diff changeset
   782
% but does not support lookarounds and back-references.
Chengsong
parents: 646
diff changeset
   783
% The GoLang regex library does not support over 1000 repetitions.  
Chengsong
parents: 646
diff changeset
   784
% Java and Python both support back-references, but shows
Chengsong
parents: 646
diff changeset
   785
%catastrophic backtracking behaviours on inputs without back-references(
Chengsong
parents: 646
diff changeset
   786
%when the language is still regular).
Chengsong
parents: 646
diff changeset
   787
 %TODO: test performance of Rust on (((((a*a*)b*)b){20})*)c  baabaabababaabaaaaaaaaababaaaababababaaaabaaabaaaaaabaabaabababaababaaaaaaaaababaaaababababaaaaaaaaaaaaac
Chengsong
parents: 646
diff changeset
   788
 %TODO: verify the fact Rust does not allow 1000+ reps
Chengsong
parents: 646
diff changeset
   789
Chengsong
parents: 646
diff changeset
   790
Chengsong
parents: 646
diff changeset
   791
Chengsong
parents: 646
diff changeset
   792
Chengsong
parents: 646
diff changeset
   793
%The time cost of regex matching algorithms in general
Chengsong
parents: 646
diff changeset
   794
%involve two different phases, and different things can go differently wrong on 
Chengsong
parents: 646
diff changeset
   795
%these phases.
Chengsong
parents: 646
diff changeset
   796
%$\DFA$s usually have problems in the first (construction) phase
Chengsong
parents: 646
diff changeset
   797
%, whereas $\NFA$s usually run into trouble
Chengsong
parents: 646
diff changeset
   798
%on the second phase.
Chengsong
parents: 646
diff changeset
   799
Chengsong
parents: 646
diff changeset
   800
Chengsong
parents: 646
diff changeset
   801
\section{Error-prone POSIX Implementations}
Chengsong
parents: 646
diff changeset
   802
Very often there are multiple ways of matching a string
Chengsong
parents: 646
diff changeset
   803
with a regular expression.
Chengsong
parents: 646
diff changeset
   804
In such cases the regular expressions matcher needs to
Chengsong
parents: 646
diff changeset
   805
disambiguate.
Chengsong
parents: 646
diff changeset
   806
The more widely used strategy is called POSIX,
Chengsong
parents: 646
diff changeset
   807
which roughly speaking always chooses the longest initial match.
Chengsong
parents: 646
diff changeset
   808
The POSIX strategy is widely adopted in many regular expression matchers
Chengsong
parents: 646
diff changeset
   809
because it is a natural strategy for lexers.
Chengsong
parents: 646
diff changeset
   810
However, many implementations (including the C libraries
Chengsong
parents: 646
diff changeset
   811
used by Linux and OS X distributions) contain bugs
Chengsong
parents: 646
diff changeset
   812
or do not meet the specification they claim to adhere to.
Chengsong
parents: 646
diff changeset
   813
Kuklewicz maintains a unit test repository which lists some
Chengsong
parents: 646
diff changeset
   814
problems with existing regular expression engines \cite{KuklewiczHaskell}.
Chengsong
parents: 646
diff changeset
   815
In some cases, they either fail to generate a
Chengsong
parents: 646
diff changeset
   816
result when there exists a match,
Chengsong
parents: 646
diff changeset
   817
or give results that are inconsistent with the POSIX standard.
Chengsong
parents: 646
diff changeset
   818
A concrete example is the regex:
Chengsong
parents: 646
diff changeset
   819
\begin{center}
Chengsong
parents: 646
diff changeset
   820
	$(aba + ab + a)^* \text{and the string} \; ababa$
Chengsong
parents: 646
diff changeset
   821
\end{center}
Chengsong
parents: 646
diff changeset
   822
The correct POSIX match for the above
Chengsong
parents: 646
diff changeset
   823
involves the entire string $ababa$, 
Chengsong
parents: 646
diff changeset
   824
split into two Kleene star iterations, namely $[ab], [aba]$ at positions
Chengsong
parents: 646
diff changeset
   825
$[0, 2), [2, 5)$
Chengsong
parents: 646
diff changeset
   826
respectively.
Chengsong
parents: 646
diff changeset
   827
But feeding this example to the different engines
Chengsong
parents: 646
diff changeset
   828
listed at regex101 \footnote{
Chengsong
parents: 646
diff changeset
   829
	regex101 is an online regular expression matcher which
Chengsong
parents: 646
diff changeset
   830
	provides API for trying out regular expression
Chengsong
parents: 646
diff changeset
   831
	engines of multiple popular programming languages like
Chengsong
parents: 646
diff changeset
   832
Java, Python, Go, etc.} \parencite{regex101}. 
Chengsong
parents: 646
diff changeset
   833
yields
Chengsong
parents: 646
diff changeset
   834
only two incomplete matches: $[aba]$ at $[0, 3)$
Chengsong
parents: 646
diff changeset
   835
and $a$ at $[4, 5)$.
Chengsong
parents: 646
diff changeset
   836
Fowler \cite{fowler2003} and Kuklewicz \cite{KuklewiczHaskell} 
Chengsong
parents: 646
diff changeset
   837
commented that most regex libraries are not
Chengsong
parents: 646
diff changeset
   838
correctly implementing the central POSIX
Chengsong
parents: 646
diff changeset
   839
rule, called the maximum munch rule.
Chengsong
parents: 646
diff changeset
   840
Grathwohl \parencite{grathwohl2014crash} wrote:
Chengsong
parents: 646
diff changeset
   841
\begin{quote}\it
Chengsong
parents: 646
diff changeset
   842
	``The POSIX strategy is more complicated than the 
Chengsong
parents: 646
diff changeset
   843
	greedy because of the dependence on information about 
Chengsong
parents: 646
diff changeset
   844
	the length of matched strings in the various subexpressions.''
Chengsong
parents: 646
diff changeset
   845
\end{quote}
Chengsong
parents: 646
diff changeset
   846
%\noindent
Chengsong
parents: 646
diff changeset
   847
People have recognised that the implementation complexity of POSIX rules also come from
Chengsong
parents: 646
diff changeset
   848
the specification being not very precise.
Chengsong
parents: 646
diff changeset
   849
The developers of the regexp package of Go 
Chengsong
parents: 646
diff changeset
   850
\footnote{\url{https://pkg.go.dev/regexp\#pkg-overview}}
Chengsong
parents: 646
diff changeset
   851
commented that
Chengsong
parents: 646
diff changeset
   852
\begin{quote}\it
Chengsong
parents: 646
diff changeset
   853
``
Chengsong
parents: 646
diff changeset
   854
The POSIX rule is computationally prohibitive
Chengsong
parents: 646
diff changeset
   855
and not even well-defined.
Chengsong
parents: 646
diff changeset
   856
``
Chengsong
parents: 646
diff changeset
   857
\end{quote}
Chengsong
parents: 646
diff changeset
   858
There are many informal summaries of this disambiguation
Chengsong
parents: 646
diff changeset
   859
strategy, which are often quite long and delicate.
Chengsong
parents: 646
diff changeset
   860
For example Kuklewicz \cite{KuklewiczHaskell} 
Chengsong
parents: 646
diff changeset
   861
described the POSIX rule as (section 1, last paragraph):
Chengsong
parents: 646
diff changeset
   862
\begin{quote}
Chengsong
parents: 646
diff changeset
   863
	\begin{itemize}
Chengsong
parents: 646
diff changeset
   864
		\item
Chengsong
parents: 646
diff changeset
   865
regular expressions (REs) take the leftmost starting match, and the longest match starting there
Chengsong
parents: 646
diff changeset
   866
earlier subpatterns have leftmost-longest priority over later subpatterns\\
Chengsong
parents: 646
diff changeset
   867
\item
Chengsong
parents: 646
diff changeset
   868
higher-level subpatterns have leftmost-longest priority over their component subpatterns\\
Chengsong
parents: 646
diff changeset
   869
\item
Chengsong
parents: 646
diff changeset
   870
REs have right associative concatenation which can be changed with parenthesis\\
Chengsong
parents: 646
diff changeset
   871
\item
Chengsong
parents: 646
diff changeset
   872
parenthesized subexpressions return the match from their last usage\\
Chengsong
parents: 646
diff changeset
   873
\item
Chengsong
parents: 646
diff changeset
   874
text of component subexpressions must be contained in the text of the 
Chengsong
parents: 646
diff changeset
   875
higher-level subexpressions\\
Chengsong
parents: 646
diff changeset
   876
\item
Chengsong
parents: 646
diff changeset
   877
if "p" and "q" can never match the same text then "p|q" and "q|p" are equivalent, up to trivial renumbering of captured subexpressions\\
Chengsong
parents: 646
diff changeset
   878
\item
Chengsong
parents: 646
diff changeset
   879
if "p" in "p*" is used to capture non-empty text then additional repetitions of "p" will not capture an empty string\\
Chengsong
parents: 646
diff changeset
   880
\end{itemize}
Chengsong
parents: 646
diff changeset
   881
\end{quote}
Chengsong
parents: 646
diff changeset
   882
%The text above 
Chengsong
parents: 646
diff changeset
   883
%is trying to capture something very precise,
Chengsong
parents: 646
diff changeset
   884
%and is crying out for formalising.
Chengsong
parents: 646
diff changeset
   885
Ribeiro and Du Bois \cite{RibeiroAgda2017} have 
Chengsong
parents: 646
diff changeset
   886
formalised the notion of bit-coded regular expressions
Chengsong
parents: 646
diff changeset
   887
and proved their relations with simple regular expressions in
Chengsong
parents: 646
diff changeset
   888
the dependently-typed proof assistant Agda.
Chengsong
parents: 646
diff changeset
   889
They also proved the soundness and completeness of a matching algorithm
Chengsong
parents: 646
diff changeset
   890
based on the bit-coded regular expressions.
Chengsong
parents: 646
diff changeset
   891
Ausaf et al. \cite{AusafDyckhoffUrban2016}
Chengsong
parents: 646
diff changeset
   892
are the first to
Chengsong
parents: 646
diff changeset
   893
give a quite simple formalised POSIX
Chengsong
parents: 646
diff changeset
   894
specification in Isabelle/HOL, and also prove 
Chengsong
parents: 646
diff changeset
   895
that their specification coincides with an earlier (unformalised) 
Chengsong
parents: 646
diff changeset
   896
POSIX specification given by Okui and Suzuki \cite{Okui10}.
Chengsong
parents: 646
diff changeset
   897
They then formally proved the correctness of
Chengsong
parents: 646
diff changeset
   898
a lexing algorithm by Sulzmann and Lu \cite{Sulzmann2014}
Chengsong
parents: 646
diff changeset
   899
with regards to that specification.
Chengsong
parents: 646
diff changeset
   900
They also found that the informal POSIX
Chengsong
parents: 646
diff changeset
   901
specification by Sulzmann and Lu needs to be substantially 
Chengsong
parents: 646
diff changeset
   902
revised in order for the correctness proof to go through.
Chengsong
parents: 646
diff changeset
   903
Their original specification and proof were unfixable
Chengsong
parents: 646
diff changeset
   904
according to Ausaf et al.
Chengsong
parents: 646
diff changeset
   905
Chengsong
parents: 646
diff changeset
   906
Chengsong
parents: 646
diff changeset
   907
In the next section, we will briefly
Chengsong
parents: 646
diff changeset
   908
introduce Brzozowski derivatives and Sulzmann
Chengsong
parents: 646
diff changeset
   909
and Lu's algorithm, which the main point of this thesis builds on.
Chengsong
parents: 646
diff changeset
   910
%We give a taste of what they 
Chengsong
parents: 646
diff changeset
   911
%are like and why they are suitable for regular expression
Chengsong
parents: 646
diff changeset
   912
%matching and lexing.
Chengsong
parents: 646
diff changeset
   913
\section{Formal Specification of POSIX Matching 
Chengsong
parents: 646
diff changeset
   914
and Brzozowski Derivatives}
Chengsong
parents: 646
diff changeset
   915
%Now we start with the central topic of the thesis: Brzozowski derivatives.
Chengsong
parents: 646
diff changeset
   916
Brzozowski \cite{Brzozowski1964} first introduced the 
Chengsong
parents: 646
diff changeset
   917
concept of a \emph{derivative} of regular expression in 1964.
Chengsong
parents: 646
diff changeset
   918
The derivative of a regular expression $r$
Chengsong
parents: 646
diff changeset
   919
with respect to a character $c$, is written as $r \backslash c$.
Chengsong
parents: 646
diff changeset
   920
This operation tells us what $r$ transforms into
Chengsong
parents: 646
diff changeset
   921
if we ``chop'' off a particular character $c$ 
Chengsong
parents: 646
diff changeset
   922
from all strings in the language of $r$ (defined
Chengsong
parents: 646
diff changeset
   923
later as $L \; r$).
Chengsong
parents: 646
diff changeset
   924
%To give a flavour of Brzozowski derivatives, we present
Chengsong
parents: 646
diff changeset
   925
%two straightforward clauses from it:
Chengsong
parents: 646
diff changeset
   926
%\begin{center}
Chengsong
parents: 646
diff changeset
   927
%	\begin{tabular}{lcl}
Chengsong
parents: 646
diff changeset
   928
%		$d \backslash c$     & $\dn$ & 
Chengsong
parents: 646
diff changeset
   929
%		$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
Chengsong
parents: 646
diff changeset
   930
%$(r_1 + r_2)\backslash c$     & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
Chengsong
parents: 646
diff changeset
   931
%	\end{tabular}
Chengsong
parents: 646
diff changeset
   932
%\end{center}
Chengsong
parents: 646
diff changeset
   933
%\noindent
Chengsong
parents: 646
diff changeset
   934
%The first clause says that for the regular expression
Chengsong
parents: 646
diff changeset
   935
%denoting a singleton set consisting of a single-character string $\{ d \}$,
Chengsong
parents: 646
diff changeset
   936
%we check the derivative character $c$ against $d$,
Chengsong
parents: 646
diff changeset
   937
%returning a set containing only the empty string $\{ [] \}$
Chengsong
parents: 646
diff changeset
   938
%if $c$ and $d$ are equal, and the empty set $\varnothing$ otherwise.
Chengsong
parents: 646
diff changeset
   939
%The second clause states that to obtain the regular expression
Chengsong
parents: 646
diff changeset
   940
%representing all strings' head character $c$ being chopped off
Chengsong
parents: 646
diff changeset
   941
%from $r_1 + r_2$, one simply needs to recursively take derivative
Chengsong
parents: 646
diff changeset
   942
%of $r_1$ and $r_2$ and then put them together.
Chengsong
parents: 646
diff changeset
   943
Derivatives have the property
Chengsong
parents: 646
diff changeset
   944
that $s \in L \; (r\backslash c)$ if and only if 
Chengsong
parents: 646
diff changeset
   945
$c::s \in L \; r$ where $::$ stands for list prepending.
Chengsong
parents: 646
diff changeset
   946
%This property can be used on regular expressions
Chengsong
parents: 646
diff changeset
   947
%matching and lexing--to test whether a string $s$ is in $L \; r$,
Chengsong
parents: 646
diff changeset
   948
%one simply takes derivatives of $r$ successively with
Chengsong
parents: 646
diff changeset
   949
%respect to the characters (in the correct order) in $s$,
Chengsong
parents: 646
diff changeset
   950
%and then test whether the empty string is in the last regular expression.
Chengsong
parents: 646
diff changeset
   951
With this property, derivatives can give a simple solution
Chengsong
parents: 646
diff changeset
   952
to the problem of matching a string $s$ with a regular
Chengsong
parents: 646
diff changeset
   953
expression $r$: if the derivative of $r$ w.r.t.\ (in
Chengsong
parents: 646
diff changeset
   954
succession) all the characters of the string matches the empty string,
Chengsong
parents: 646
diff changeset
   955
then $r$ matches $s$ (and {\em vice versa}).  
Chengsong
parents: 646
diff changeset
   956
%This makes formally reasoning about these properties such
Chengsong
parents: 646
diff changeset
   957
%as correctness and complexity smooth and intuitive.
Chengsong
parents: 646
diff changeset
   958
There are several mechanised proofs of this property in various theorem
Chengsong
parents: 646
diff changeset
   959
provers,
Chengsong
parents: 646
diff changeset
   960
for example one by Owens and Slind \cite{Owens2008} in HOL4,
Chengsong
parents: 646
diff changeset
   961
another one by Krauss and Nipkow \cite{Nipkow98} in Isabelle/HOL, and
Chengsong
parents: 646
diff changeset
   962
yet another in Coq by Coquand and Siles \cite{Coquand2012}.
Chengsong
parents: 646
diff changeset
   963
Chengsong
parents: 646
diff changeset
   964
In addition, one can extend derivatives to bounded repetitions
Chengsong
parents: 646
diff changeset
   965
relatively straightforwardly. For example, the derivative for 
Chengsong
parents: 646
diff changeset
   966
this can be defined as:
Chengsong
parents: 646
diff changeset
   967
\begin{center}
Chengsong
parents: 646
diff changeset
   968
	\begin{tabular}{lcl}
Chengsong
parents: 646
diff changeset
   969
		$r^{\{n\}} \backslash c$     & $\dn$ & $r \backslash c \cdot
Chengsong
parents: 646
diff changeset
   970
		r^{\{n-1\}} (\text{when} n > 0)$\\
Chengsong
parents: 646
diff changeset
   971
	\end{tabular}
Chengsong
parents: 646
diff changeset
   972
\end{center}
Chengsong
parents: 646
diff changeset
   973
\noindent
Chengsong
parents: 646
diff changeset
   974
Experimental results suggest that  unlike DFA-based solutions
Chengsong
parents: 646
diff changeset
   975
for bounded regular expressions,
Chengsong
parents: 646
diff changeset
   976
derivatives can cope
Chengsong
parents: 646
diff changeset
   977
large counters
Chengsong
parents: 646
diff changeset
   978
quite well.
Chengsong
parents: 646
diff changeset
   979
Chengsong
parents: 646
diff changeset
   980
There have also been 
Chengsong
parents: 646
diff changeset
   981
extensions of derivatives to other regex constructs.
Chengsong
parents: 646
diff changeset
   982
For example, Owens et al include the derivatives
Chengsong
parents: 646
diff changeset
   983
for the \emph{NOT} regular expression, which is
Chengsong
parents: 646
diff changeset
   984
able to concisely express C-style comments of the form
Chengsong
parents: 646
diff changeset
   985
$/* \ldots */$ (see \cite{Owens2008}).
Chengsong
parents: 646
diff changeset
   986
Another extension for derivatives is
Chengsong
parents: 646
diff changeset
   987
regular expressions with look-aheads, done
Chengsong
parents: 646
diff changeset
   988
by Miyazaki and Minamide
Chengsong
parents: 646
diff changeset
   989
\cite{Takayuki2019}.
Chengsong
parents: 646
diff changeset
   990
%We therefore use Brzozowski derivatives on regular expressions 
Chengsong
parents: 646
diff changeset
   991
%lexing 
Chengsong
parents: 646
diff changeset
   992
Chengsong
parents: 646
diff changeset
   993
Chengsong
parents: 646
diff changeset
   994
Chengsong
parents: 646
diff changeset
   995
Given the above definitions and properties of
Chengsong
parents: 646
diff changeset
   996
Brzozowski derivatives, one quickly realises their potential
Chengsong
parents: 646
diff changeset
   997
in generating a formally verified algorithm for lexing: the clauses and property
Chengsong
parents: 646
diff changeset
   998
can be easily expressed in a functional programming language 
Chengsong
parents: 646
diff changeset
   999
or converted to theorem prover
Chengsong
parents: 646
diff changeset
  1000
code, with great ease.
Chengsong
parents: 646
diff changeset
  1001
Perhaps this is the reason why derivatives have sparked quite a bit of interest
Chengsong
parents: 646
diff changeset
  1002
in the functional programming and theorem prover communities in the last
Chengsong
parents: 646
diff changeset
  1003
fifteen or so years (
Chengsong
parents: 646
diff changeset
  1004
\cite{Almeidaetal10}, \cite{Berglund14}, \cite{Berglund18},
Chengsong
parents: 646
diff changeset
  1005
\cite{Chen12} and \cite{Coquand2012}
Chengsong
parents: 646
diff changeset
  1006
to name a few), despite being buried in the ``sands of time'' \cite{Owens2008}
Chengsong
parents: 646
diff changeset
  1007
after they were first published by Brzozowski.
Chengsong
parents: 646
diff changeset
  1008
Chengsong
parents: 646
diff changeset
  1009
Chengsong
parents: 646
diff changeset
  1010
However, there are two difficulties with derivative-based matchers:
Chengsong
parents: 646
diff changeset
  1011
First, Brzozowski's original matcher only generates a yes/no answer
Chengsong
parents: 646
diff changeset
  1012
for whether a regular expression matches a string or not.  This is too
Chengsong
parents: 646
diff changeset
  1013
little information in the context of lexing where separate tokens must
Chengsong
parents: 646
diff changeset
  1014
be identified and also classified (for example as keywords
Chengsong
parents: 646
diff changeset
  1015
or identifiers). 
Chengsong
parents: 646
diff changeset
  1016
Second, derivative-based matchers need to be more efficient in terms
Chengsong
parents: 646
diff changeset
  1017
of the sizes of derivatives.
Chengsong
parents: 646
diff changeset
  1018
Elegant and beautiful
Chengsong
parents: 646
diff changeset
  1019
as many implementations are,
666
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
  1020
they can be still quite slow. 
649
Chengsong
parents: 646
diff changeset
  1021
For example, Sulzmann and Lu
Chengsong
parents: 646
diff changeset
  1022
claim a linear running time of their proposed algorithm,
Chengsong
parents: 646
diff changeset
  1023
but that was falsified by our experiments. The running time 
Chengsong
parents: 646
diff changeset
  1024
is actually $\Omega(2^n)$ in the worst case.
Chengsong
parents: 646
diff changeset
  1025
A similar claim about a theoretical runtime of $O(n^2)$ 
Chengsong
parents: 646
diff changeset
  1026
is made for the Verbatim \cite{Verbatim}
Chengsong
parents: 646
diff changeset
  1027
%TODO: give references
Chengsong
parents: 646
diff changeset
  1028
lexer, which calculates POSIX matches and is based on derivatives.
Chengsong
parents: 646
diff changeset
  1029
They formalized the correctness of the lexer, but not their complexity result.
Chengsong
parents: 646
diff changeset
  1030
In the performance evaluation section, they analyzed the run time
Chengsong
parents: 646
diff changeset
  1031
of matching $a$ with the string 
Chengsong
parents: 646
diff changeset
  1032
\begin{center}
Chengsong
parents: 646
diff changeset
  1033
	$\underbrace{a \ldots a}_{\text{n a's}}$.
Chengsong
parents: 646
diff changeset
  1034
\end{center}
Chengsong
parents: 646
diff changeset
  1035
\noindent
Chengsong
parents: 646
diff changeset
  1036
They concluded that the algorithm is quadratic in terms of 
Chengsong
parents: 646
diff changeset
  1037
the length of the input string.
Chengsong
parents: 646
diff changeset
  1038
When we tried out their extracted OCaml code with the example $(a+aa)^*$,
Chengsong
parents: 646
diff changeset
  1039
the time it took to match a string of 40 $a$'s was approximately 5 minutes.
Chengsong
parents: 646
diff changeset
  1040
Chengsong
parents: 646
diff changeset
  1041
Chengsong
parents: 646
diff changeset
  1042
\subsection{Sulzmann and Lu's Algorithm}
Chengsong
parents: 646
diff changeset
  1043
Sulzmann and Lu~\cite{Sulzmann2014} overcame the first 
Chengsong
parents: 646
diff changeset
  1044
problem with the yes/no answer 
Chengsong
parents: 646
diff changeset
  1045
by cleverly extending Brzozowski's matching
Chengsong
parents: 646
diff changeset
  1046
algorithm. Their extended version generates additional information on
Chengsong
parents: 646
diff changeset
  1047
\emph{how} a regular expression matches a string following the POSIX
Chengsong
parents: 646
diff changeset
  1048
rules for regular expression matching. They achieve this by adding a
Chengsong
parents: 646
diff changeset
  1049
second ``phase'' to Brzozowski's algorithm involving an injection
Chengsong
parents: 646
diff changeset
  1050
function. This injection function in a sense undoes the ``damage''
Chengsong
parents: 646
diff changeset
  1051
of the derivatives chopping off characters.
Chengsong
parents: 646
diff changeset
  1052
In earlier work, Ausaf et al provided the formal
Chengsong
parents: 646
diff changeset
  1053
specification of what POSIX matching means and proved in Isabelle/HOL
Chengsong
parents: 646
diff changeset
  1054
the correctness
Chengsong
parents: 646
diff changeset
  1055
of this extended algorithm accordingly
Chengsong
parents: 646
diff changeset
  1056
\cite{AusafDyckhoffUrban2016}.
Chengsong
parents: 646
diff changeset
  1057
Chengsong
parents: 646
diff changeset
  1058
The version of the algorithm proven correct
Chengsong
parents: 646
diff changeset
  1059
suffers however heavily from a 
Chengsong
parents: 646
diff changeset
  1060
second difficulty, where derivatives can
Chengsong
parents: 646
diff changeset
  1061
grow to arbitrarily big sizes. 
Chengsong
parents: 646
diff changeset
  1062
For example if we start with the
Chengsong
parents: 646
diff changeset
  1063
regular expression $(a+aa)^*$ and take
Chengsong
parents: 646
diff changeset
  1064
successive derivatives according to the character $a$, we end up with
Chengsong
parents: 646
diff changeset
  1065
a sequence of ever-growing derivatives like 
Chengsong
parents: 646
diff changeset
  1066
Chengsong
parents: 646
diff changeset
  1067
\def\ll{\stackrel{\_\backslash{} a}{\longrightarrow}}
Chengsong
parents: 646
diff changeset
  1068
\begin{center}
Chengsong
parents: 646
diff changeset
  1069
\begin{tabular}{rll}
Chengsong
parents: 646
diff changeset
  1070
$(a + aa)^*$ & $\ll$ & $(\ONE + \ONE{}a) \cdot (a + aa)^*$\\
Chengsong
parents: 646
diff changeset
  1071
& $\ll$ & $(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* \;+\; (\ONE + \ONE{}a) \cdot (a + aa)^*$\\
Chengsong
parents: 646
diff changeset
  1072
& $\ll$ & $(\ZERO + \ZERO{}a + \ZERO) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^* \;+\; $\\
Chengsong
parents: 646
diff changeset
  1073
& & $\qquad(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^*$\\
Chengsong
parents: 646
diff changeset
  1074
& $\ll$ & \ldots \hspace{15mm}(regular expressions of sizes 98, 169, 283, 468, 767, \ldots)
Chengsong
parents: 646
diff changeset
  1075
\end{tabular}
Chengsong
parents: 646
diff changeset
  1076
\end{center}
Chengsong
parents: 646
diff changeset
  1077
 
Chengsong
parents: 646
diff changeset
  1078
\noindent where after around 35 steps we usually run out of memory on a
Chengsong
parents: 646
diff changeset
  1079
typical computer.  Clearly, the
Chengsong
parents: 646
diff changeset
  1080
notation involving $\ZERO$s and $\ONE$s already suggests
Chengsong
parents: 646
diff changeset
  1081
simplification rules that can be applied to regular regular
Chengsong
parents: 646
diff changeset
  1082
expressions, for example $\ZERO{}\,r \Rightarrow \ZERO$, $\ONE{}\,r
Chengsong
parents: 646
diff changeset
  1083
\Rightarrow r$, $\ZERO{} + r \Rightarrow r$ and $r + r \Rightarrow
Chengsong
parents: 646
diff changeset
  1084
r$. While such simple-minded simplifications have been proved in 
Chengsong
parents: 646
diff changeset
  1085
the work by Ausaf et al. to preserve the correctness of Sulzmann and Lu's
Chengsong
parents: 646
diff changeset
  1086
algorithm \cite{AusafDyckhoffUrban2016}, they unfortunately do
Chengsong
parents: 646
diff changeset
  1087
\emph{not} help with limiting the growth of the derivatives shown
Chengsong
parents: 646
diff changeset
  1088
above: the growth is slowed, but the derivatives can still grow rather
Chengsong
parents: 646
diff changeset
  1089
quickly beyond any finite bound.
Chengsong
parents: 646
diff changeset
  1090
Chengsong
parents: 646
diff changeset
  1091
Therefore we want to look in this thesis at a second
Chengsong
parents: 646
diff changeset
  1092
algorithm by Sulzmann and Lu where they
Chengsong
parents: 646
diff changeset
  1093
overcame this ``growth problem'' \cite{Sulzmann2014}.
Chengsong
parents: 646
diff changeset
  1094
In this version, POSIX values are 
Chengsong
parents: 646
diff changeset
  1095
represented as bit sequences and such sequences are incrementally generated
Chengsong
parents: 646
diff changeset
  1096
when derivatives are calculated. The compact representation
Chengsong
parents: 646
diff changeset
  1097
of bit sequences and regular expressions allows them to define a more
Chengsong
parents: 646
diff changeset
  1098
``aggressive'' simplification method that keeps the size of the
Chengsong
parents: 646
diff changeset
  1099
derivatives finite no matter what the length of the string is.
Chengsong
parents: 646
diff changeset
  1100
They make some informal claims about the correctness and linear behaviour
Chengsong
parents: 646
diff changeset
  1101
of this version, but do not provide any supporting proof arguments, not
Chengsong
parents: 646
diff changeset
  1102
even ``pencil-and-paper'' arguments. They write about their bit-coded
Chengsong
parents: 646
diff changeset
  1103
\emph{incremental parsing method} (that is the algorithm to be formalised
Chengsong
parents: 646
diff changeset
  1104
in this dissertation)
Chengsong
parents: 646
diff changeset
  1105
Chengsong
parents: 646
diff changeset
  1106
Chengsong
parents: 646
diff changeset
  1107
  
Chengsong
parents: 646
diff changeset
  1108
  \begin{quote}\it
Chengsong
parents: 646
diff changeset
  1109
  ``Correctness Claim: We further claim that the incremental parsing
Chengsong
parents: 646
diff changeset
  1110
  method [..] in combination with the simplification steps [..]
Chengsong
parents: 646
diff changeset
  1111
  yields POSIX parse trees. We have tested this claim
Chengsong
parents: 646
diff changeset
  1112
  extensively [..] but yet
Chengsong
parents: 646
diff changeset
  1113
  have to work out all proof details.'' \cite[Page 14]{Sulzmann2014}
Chengsong
parents: 646
diff changeset
  1114
\end{quote}  
Chengsong
parents: 646
diff changeset
  1115
Ausaf and Urban made some initial progress towards the 
Chengsong
parents: 646
diff changeset
  1116
full correctness proof but still had to leave out the optimisation
Chengsong
parents: 646
diff changeset
  1117
Sulzmann and Lu proposed.
Chengsong
parents: 646
diff changeset
  1118
Ausaf  wrote \cite{Ausaf},
Chengsong
parents: 646
diff changeset
  1119
  \begin{quote}\it
Chengsong
parents: 646
diff changeset
  1120
``The next step would be to implement a more aggressive simplification procedure on annotated regular expressions and then prove the corresponding algorithm generates the same values as blexer. Alas due to time constraints we are unable to do so here.''
Chengsong
parents: 646
diff changeset
  1121
\end{quote}  
Chengsong
parents: 646
diff changeset
  1122
This thesis implements the aggressive simplifications envisioned
Chengsong
parents: 646
diff changeset
  1123
by Ausaf and Urban,
Chengsong
parents: 646
diff changeset
  1124
together with a formal proof of the correctness of those simplifications.
Chengsong
parents: 646
diff changeset
  1125
Chengsong
parents: 646
diff changeset
  1126
Chengsong
parents: 646
diff changeset
  1127
One of the most recent work in the context of lexing
Chengsong
parents: 646
diff changeset
  1128
%with this issue
Chengsong
parents: 646
diff changeset
  1129
is the Verbatim lexer by Egolf, Lasser and Fisher~\cite{Verbatim}.
Chengsong
parents: 646
diff changeset
  1130
This is relevant work for us and we will compare it later with 
Chengsong
parents: 646
diff changeset
  1131
our derivative-based matcher we are going to present.
Chengsong
parents: 646
diff changeset
  1132
There is also some newer work called
Chengsong
parents: 646
diff changeset
  1133
Verbatim++~\cite{Verbatimpp}, which does not use derivatives, 
Chengsong
parents: 646
diff changeset
  1134
but deterministic finite automaton instead.
Chengsong
parents: 646
diff changeset
  1135
We will also study this work in a later section.
Chengsong
parents: 646
diff changeset
  1136
%An example that gives problem to automaton approaches would be
Chengsong
parents: 646
diff changeset
  1137
%the regular expression $(a|b)^*a(a|b)^{\{n\}}$.
Chengsong
parents: 646
diff changeset
  1138
%It requires at least $2^{n+1}$ states to represent
Chengsong
parents: 646
diff changeset
  1139
%as a DFA.
Chengsong
parents: 646
diff changeset
  1140
Chengsong
parents: 646
diff changeset
  1141
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1142
\section{Basic Concepts}
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1143
Formal language theory usually starts with an alphabet 
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1144
denoting a set of characters.
637
Chengsong
parents: 628
diff changeset
  1145
Here we use the datatype of characters from Isabelle,
541
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
  1146
which roughly corresponds to the ASCII characters.
564
Chengsong
parents: 543
diff changeset
  1147
In what follows, we shall leave the information about the alphabet
541
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
  1148
implicit.
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
  1149
Then using the usual bracket notation for lists,
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1150
we can define strings made up of characters as: 
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1151
\begin{center}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1152
\begin{tabular}{lcl}
541
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
  1153
$\textit{s}$ & $\dn$ & $[] \; |\; c  :: s$
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1154
\end{tabular}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1155
\end{center}
583
Chengsong
parents: 579
diff changeset
  1156
where $c$ is a variable ranging over characters.
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1157
The $::$ stands for list cons and $[]$ for the empty
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1158
list.
637
Chengsong
parents: 628
diff changeset
  1159
For brevity, a singleton list is sometimes written as $[c]$.
541
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
  1160
Strings can be concatenated to form longer strings in the same
637
Chengsong
parents: 628
diff changeset
  1161
way we concatenate two lists, which we shall write as $s_1 @ s_2$.
541
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
  1162
We omit the precise 
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1163
recursive definition here.
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1164
We overload this concatenation operator for two sets of strings:
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1165
\begin{center}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1166
\begin{tabular}{lcl}
541
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
  1167
$A @ B $ & $\dn$ & $\{s_A @ s_B \mid s_A \in A \land s_B \in B \}$\\
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1168
\end{tabular}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1169
\end{center}
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1170
We also call the above \emph{language concatenation}.
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1171
The power of a language is defined recursively, using the 
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1172
concatenation operator $@$:
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1173
\begin{center}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1174
\begin{tabular}{lcl}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1175
$A^0 $ & $\dn$ & $\{ [] \}$\\
541
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
  1176
$A^{n+1}$ & $\dn$ & $A @ A^n$
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1177
\end{tabular}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1178
\end{center}
564
Chengsong
parents: 543
diff changeset
  1179
The union of all powers of a language   
Chengsong
parents: 543
diff changeset
  1180
can be used to define the Kleene star operator:
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1181
\begin{center}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1182
\begin{tabular}{lcl}
536
aff7bf93b9c7 comments addressed all
Chengsong
parents: 532
diff changeset
  1183
 $A*$ & $\dn$ & $\bigcup_{i \geq 0} A^i$ \\
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1184
\end{tabular}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1185
\end{center}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1186
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1187
\noindent
564
Chengsong
parents: 543
diff changeset
  1188
However, to obtain a more convenient induction principle 
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1189
in Isabelle/HOL, 
536
aff7bf93b9c7 comments addressed all
Chengsong
parents: 532
diff changeset
  1190
we instead define the Kleene star
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1191
as an inductive set: 
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1192
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1193
\begin{center}
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1194
\begin{mathpar}
564
Chengsong
parents: 543
diff changeset
  1195
	\inferrule{\mbox{}}{[] \in A*\\}
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1196
564
Chengsong
parents: 543
diff changeset
  1197
\inferrule{s_1 \in A \;\; s_2 \in A*}{s_1 @ s_2 \in A*}
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1198
\end{mathpar}
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1199
\end{center}
564
Chengsong
parents: 543
diff changeset
  1200
\noindent
541
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
  1201
We also define an operation of "chopping off" a character from
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
  1202
a language, which we call $\Der$, meaning \emph{Derivative} (for a language):
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1203
\begin{center}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1204
\begin{tabular}{lcl}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1205
$\textit{Der} \;c \;A$ & $\dn$ & $\{ s \mid c :: s \in A \}$\\
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1206
\end{tabular}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1207
\end{center}
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1208
\noindent
583
Chengsong
parents: 579
diff changeset
  1209
This can be generalised to ``chopping off'' a string 
Chengsong
parents: 579
diff changeset
  1210
from all strings within a set $A$, 
541
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
  1211
namely:
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1212
\begin{center}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1213
\begin{tabular}{lcl}
541
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
  1214
$\textit{Ders} \;s \;A$ & $\dn$ & $\{ s' \mid s@s' \in A \}$\\
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1215
\end{tabular}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1216
\end{center}
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1217
\noindent
541
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
  1218
which is essentially the left quotient $A \backslash L$ of $A$ against 
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1219
the singleton language with $L = \{s\}$.
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1220
However, for our purposes here, the $\textit{Ders}$ definition with 
541
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
  1221
a single string is sufficient.
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1222
577
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
  1223
The reason for defining derivatives
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1224
is that they provide another approach
577
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
  1225
to test membership of a string in 
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
  1226
a set of strings. 
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
  1227
For example, to test whether the string
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  1228
$bar$ is contained in the set $\{foo, bar, brak\}$, one can take derivative of the set with
577
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
  1229
respect to the string $bar$:
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
  1230
\begin{center}
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1231
\begin{tabular}{lll}
577
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
  1232
	$S = \{foo, bar, brak\}$ & $ \stackrel{\backslash b}{\rightarrow }$ & 
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1233
	$\{ar, rak\}$ \\
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1234
				 & $\stackrel{\backslash a}{\rightarrow}$ & $\{r \}$\\
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1235
				 & $\stackrel{\backslash r}{\rightarrow}$ & $\{[]\}$\\
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1236
				 %& $\stackrel{[] \in S \backslash bar}{\longrightarrow}$ & $bar \in S$\\
577
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
  1237
\end{tabular}	
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
  1238
\end{center}
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
  1239
\noindent
637
Chengsong
parents: 628
diff changeset
  1240
and in the end, test whether the set
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  1241
contains the empty string.\footnote{We use the infix notation $A\backslash c$
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  1242
	instead of $\Der \; c \; A$ for brevity, as it will always be
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  1243
	clear from the context that we are operating
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1244
on languages rather than regular expressions.}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1245
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1246
In general, if we have a language $S$,
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1247
then we can test whether $s$ is in $S$
577
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
  1248
by testing whether $[] \in S \backslash s$.
564
Chengsong
parents: 543
diff changeset
  1249
With the sequencing, Kleene star, and $\textit{Der}$ operator on languages,
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1250
we have a  few properties of how the language derivative can be defined using 
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1251
sub-languages.
577
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
  1252
For example, for the sequence operator, we have
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1253
something similar to a ``chain rule'':
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1254
\begin{lemma}
536
aff7bf93b9c7 comments addressed all
Chengsong
parents: 532
diff changeset
  1255
\[
aff7bf93b9c7 comments addressed all
Chengsong
parents: 532
diff changeset
  1256
	\Der \; c \; (A @ B) =
aff7bf93b9c7 comments addressed all
Chengsong
parents: 532
diff changeset
  1257
	\begin{cases}
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1258
	((\Der \; c \; A) \, @ \, B ) \cup (\Der \; c\; B) , &  \text{if} \;  [] \in A  \\
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1259
	 (\Der \; c \; A) \,  @ \, B, & \text{otherwise}
536
aff7bf93b9c7 comments addressed all
Chengsong
parents: 532
diff changeset
  1260
	 \end{cases}	
aff7bf93b9c7 comments addressed all
Chengsong
parents: 532
diff changeset
  1261
\]
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1262
\end{lemma}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1263
\noindent
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1264
This lemma states that if $A$ contains the empty string, $\Der$ can "pierce" through it
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1265
and get to $B$.
583
Chengsong
parents: 579
diff changeset
  1266
The language derivative for $A*$ can be described using the language derivative
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1267
of $A$:
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1268
\begin{lemma}
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1269
$\textit{Der} \;c \;(A*) = (\textit{Der}\; c A) @ (A*)$\\
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1270
\end{lemma}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1271
\begin{proof}
583
Chengsong
parents: 579
diff changeset
  1272
There are two inclusions to prove:
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1273
\begin{itemize}
564
Chengsong
parents: 543
diff changeset
  1274
\item{$\subseteq$}:\\
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1275
The set 
637
Chengsong
parents: 628
diff changeset
  1276
\[ S_1 = \{s \mid c :: s \in A*\} \]
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1277
is enclosed in the set
637
Chengsong
parents: 628
diff changeset
  1278
\[ S_2 = \{s_1 @ s_2 \mid s_1 \, s_2.\;  s_1 \in \{s \mid c :: s \in A\} \land s_2 \in A* \}. \]
Chengsong
parents: 628
diff changeset
  1279
This is because for any string $c::s$ satisfying $c::s \in A*$,
Chengsong
parents: 628
diff changeset
  1280
%whenever you have a string starting with a character 
Chengsong
parents: 628
diff changeset
  1281
%in the language of a Kleene star $A*$, 
Chengsong
parents: 628
diff changeset
  1282
%then that
Chengsong
parents: 628
diff changeset
  1283
the character $c$, together with a prefix of $s$
Chengsong
parents: 628
diff changeset
  1284
%immediately after $c$ 
Chengsong
parents: 628
diff changeset
  1285
forms the first iteration of $A*$, 
Chengsong
parents: 628
diff changeset
  1286
and the rest of the $s$ is also $A*$.
Chengsong
parents: 628
diff changeset
  1287
This coincides with the definition of $S_2$.
564
Chengsong
parents: 543
diff changeset
  1288
\item{$\supseteq$}:\\
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1289
Note that
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1290
\[ \Der \; c \; (A*) = \Der \; c \;  (\{ [] \} \cup (A @ A*) ) \]
583
Chengsong
parents: 579
diff changeset
  1291
holds.
Chengsong
parents: 579
diff changeset
  1292
Also the following holds:
536
aff7bf93b9c7 comments addressed all
Chengsong
parents: 532
diff changeset
  1293
\[ \Der \; c \;  (\{ [] \} \cup (A @ A*) ) = \Der\; c \; (A @ A*) \]
564
Chengsong
parents: 543
diff changeset
  1294
where the $\textit{RHS}$ can be rewritten
Chengsong
parents: 543
diff changeset
  1295
as \[ (\Der \; c\; A) @ A* \cup (\Der \; c \; (A*)) \]
Chengsong
parents: 543
diff changeset
  1296
which of course contains $\Der \; c \; A @ A*$.
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1297
\end{itemize}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1298
\end{proof}
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1299
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1300
\noindent
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1301
The clever idea of Brzozowski was to find counterparts of $\Der$ and $\Ders$
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1302
for regular expressions.
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1303
To introduce them, we need to first give definitions for regular expressions,
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1304
which we shall do next.
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1305
536
aff7bf93b9c7 comments addressed all
Chengsong
parents: 532
diff changeset
  1306
\subsection{Regular Expressions and Their Meaning}
564
Chengsong
parents: 543
diff changeset
  1307
The \emph{basic regular expressions} are defined inductively
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1308
 by the following grammar:
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1309
\[			r ::=   \ZERO \mid  \ONE
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1310
			 \mid  c  
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1311
			 \mid  r_1 \cdot r_2
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1312
			 \mid  r_1 + r_2   
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1313
			 \mid r^*         
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1314
\]
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1315
\noindent
564
Chengsong
parents: 543
diff changeset
  1316
We call them basic because we will introduce
637
Chengsong
parents: 628
diff changeset
  1317
additional constructors in later chapters, such as negation
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1318
and bounded repetitions.
564
Chengsong
parents: 543
diff changeset
  1319
We use $\ZERO$ for the regular expression that
Chengsong
parents: 543
diff changeset
  1320
matches no string, and $\ONE$ for the regular
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1321
expression that matches only the empty string.\footnote{
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1322
Some authors
564
Chengsong
parents: 543
diff changeset
  1323
also use $\phi$ and $\epsilon$ for $\ZERO$ and $\ONE$
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  1324
but we prefer this notation.} 
564
Chengsong
parents: 543
diff changeset
  1325
The sequence regular expression is written $r_1\cdot r_2$
Chengsong
parents: 543
diff changeset
  1326
and sometimes we omit the dot if it is clear which
Chengsong
parents: 543
diff changeset
  1327
regular expression is meant; the alternative
Chengsong
parents: 543
diff changeset
  1328
is written $r_1 + r_2$.
Chengsong
parents: 543
diff changeset
  1329
The \emph{language} or meaning of 
Chengsong
parents: 543
diff changeset
  1330
a regular expression is defined recursively as
Chengsong
parents: 543
diff changeset
  1331
a set of strings:
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1332
%TODO: FILL in the other defs
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1333
\begin{center}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1334
\begin{tabular}{lcl}
667
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1335
$L \; \ZERO$ & $\dn$ & $\varnothing$\\
564
Chengsong
parents: 543
diff changeset
  1336
$L \; \ONE$ & $\dn$ & $\{[]\}$\\
Chengsong
parents: 543
diff changeset
  1337
$L \; c$ & $\dn$ & $\{[c]\}$\\
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1338
$L \; (r_1 + r_2)$ & $\dn$ & $ L \; r_1 \cup L \; r_2$\\
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1339
$L \; (r_1 \cdot r_2)$ & $\dn$ & $ L \; r_1 @ L \; r_2$\\
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1340
$L \; (r^*)$ & $\dn$ & $ (L\;r)*$
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1341
\end{tabular}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1342
\end{center}
536
aff7bf93b9c7 comments addressed all
Chengsong
parents: 532
diff changeset
  1343
\noindent
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1344
%Now with language derivatives of a language and regular expressions and
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1345
%their language interpretations in place, we are ready to define derivatives on regular expressions.
637
Chengsong
parents: 628
diff changeset
  1346
With $L$, we are ready to introduce Brzozowski derivatives on regular expressions.
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  1347
We do so by first introducing what properties they should satisfy.
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1348
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1349
\subsection{Brzozowski Derivatives and a Regular Expression Matcher}
564
Chengsong
parents: 543
diff changeset
  1350
%Recall, the language derivative acts on a set of strings
Chengsong
parents: 543
diff changeset
  1351
%and essentially chops off a particular character from
Chengsong
parents: 543
diff changeset
  1352
%all strings in that set, Brzozowski defined a derivative operation on regular expressions
Chengsong
parents: 543
diff changeset
  1353
%so that after derivative $L(r\backslash c)$ 
Chengsong
parents: 543
diff changeset
  1354
%will look as if it was obtained by doing a language derivative on $L(r)$:
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1355
%Recall that the language derivative acts on a 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1356
%language (set of strings).
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1357
%One can decide whether a string $s$ belongs
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1358
%to a language $S$ by taking derivative with respect to
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1359
%that string and then checking whether the empty 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1360
%string is in the derivative:
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1361
%\begin{center}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1362
%\parskip \baselineskip
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1363
%\def\myupbracefill#1{\rotatebox{90}{\stretchto{\{}{#1}}}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1364
%\def\rlwd{.5pt}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1365
%\newcommand\notate[3]{%
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1366
%  \unskip\def\useanchorwidth{T}%
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1367
%  \setbox0=\hbox{#1}%
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1368
%  \def\stackalignment{c}\stackunder[-6pt]{%
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1369
%    \def\stackalignment{c}\stackunder[-1.5pt]{%
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1370
%      \stackunder[-2pt]{\strut #1}{\myupbracefill{\wd0}}}{%
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1371
%    \rule{\rlwd}{#2\baselineskip}}}{%
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1372
%  \strut\kern7pt$\hookrightarrow$\rlap{ \footnotesize#3}}\ignorespaces%
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1373
%}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1374
%\Longstack{
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1375
%\notate{$\{ \ldots ,\;$ 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1376
%	\notate{s}{1}{$(c_1 :: s_1)$} 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1377
%	$, \; \ldots \}$ 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1378
%}{1}{$S_{start}$} 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1379
%}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1380
%\Longstack{
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1381
%	$\stackrel{\backslash c_1}{\longrightarrow}$
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1382
%}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1383
%\Longstack{
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1384
%	$\{ \ldots,\;$  \notate{$s_1$}{1}{$(c_2::s_2)$} 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1385
%	$,\; \ldots \}$
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1386
%}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1387
%\Longstack{
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1388
%	$\stackrel{\backslash c_2}{\longrightarrow}$ 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1389
%}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1390
%\Longstack{
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1391
%	$\{ \ldots,\;  s_2
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1392
%	,\; \ldots \}$
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1393
%}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1394
%\Longstack{
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1395
%	$ \xdashrightarrow{\backslash c_3\ldots\ldots} $	
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1396
%}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1397
%\Longstack{
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1398
%	\notate{$\{\ldots, [], \ldots\}$}{1}{$S_{end} = 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1399
%	S_{start}\backslash s$}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1400
%}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1401
%\end{center}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1402
%\begin{center}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1403
%	$s \in S_{start} \iff [] \in S_{end}$
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1404
%\end{center}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1405
%\noindent
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1406
Brzozowski noticed that $\Der$
579
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1407
can be ``mirrored'' on regular expressions which
564
Chengsong
parents: 543
diff changeset
  1408
he calls the derivative of a regular expression $r$
Chengsong
parents: 543
diff changeset
  1409
with respect to a character $c$, written
577
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
  1410
$r \backslash c$. This infix operator
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  1411
takes regular expression $r$ as input
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  1412
and a character as a right operand.
577
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
  1413
The derivative operation on regular expression
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
  1414
is defined such that the language of the derivative result 
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
  1415
coincides with the language of the original 
579
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1416
regular expression being taken 
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  1417
derivative with respect to the same characters, namely
579
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1418
\begin{property}
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1419
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1420
\[
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1421
	L \; (r \backslash c) = \Der \; c \; (L \; r)
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1422
\]
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1423
\end{property}
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1424
\noindent
637
Chengsong
parents: 628
diff changeset
  1425
Next, we give the recursive definition of derivative on
Chengsong
parents: 628
diff changeset
  1426
regular expressions so that it satisfies the properties above.
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  1427
%The derivative function, written $r\backslash c$, 
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  1428
%takes a regular expression $r$ and character $c$, and
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  1429
%returns a new regular expression representing
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  1430
%the original regular expression's language $L \; r$
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  1431
%being taken the language derivative with respect to $c$.
626
1c8525061545 finished!
Chengsong
parents: 623
diff changeset
  1432
\begin{table}
1c8525061545 finished!
Chengsong
parents: 623
diff changeset
  1433
	\begin{center}
579
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1434
\begin{tabular}{lcl}
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1435
		$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\  
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1436
		$\ONE \backslash c$  & $\dn$ & $\ZERO$\\
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1437
		$d \backslash c$     & $\dn$ & 
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1438
		$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1439
$(r_1 + r_2)\backslash c$     & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1440
$(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, [] \in L(r_1)$\\
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1441
	&   & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1442
	&   & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1443
	$(r^*)\backslash c$           & $\dn$ & $(r\backslash c) \cdot r^*$\\
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1444
\end{tabular}
626
1c8525061545 finished!
Chengsong
parents: 623
diff changeset
  1445
	\end{center}
1c8525061545 finished!
Chengsong
parents: 623
diff changeset
  1446
\caption{Derivative on Regular Expressions}
1c8525061545 finished!
Chengsong
parents: 623
diff changeset
  1447
\label{table:der}
1c8525061545 finished!
Chengsong
parents: 623
diff changeset
  1448
\end{table}
564
Chengsong
parents: 543
diff changeset
  1449
\noindent
579
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1450
The most involved cases are the sequence case
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1451
and the star case.
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1452
The sequence case says that if the first regular expression
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1453
contains an empty string, then the second component of the sequence
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1454
needs to be considered, as its derivative will contribute to the
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1455
result of this derivative:
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1456
\begin{center}
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1457
	\begin{tabular}{lcl}
583
Chengsong
parents: 579
diff changeset
  1458
		$(r_1 \cdot r_2 ) \backslash c$ & $\dn$ & 
Chengsong
parents: 579
diff changeset
  1459
		$\textit{if}\;\,([] \in L(r_1))\; 
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  1460
		\textit{then} \; (r_1 \backslash c) \cdot r_2 + r_2 \backslash c$ \\
579
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1461
		& & $\textit{else} \; (r_1 \backslash c) \cdot r_2$
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1462
	\end{tabular}
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1463
\end{center}
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1464
\noindent
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1465
Notice how this closely resembles
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1466
the language derivative operation $\Der$:
564
Chengsong
parents: 543
diff changeset
  1467
\begin{center}
Chengsong
parents: 543
diff changeset
  1468
	\begin{tabular}{lcl}
Chengsong
parents: 543
diff changeset
  1469
		$\Der \; c \; (A @ B)$ & $\dn$ & 
Chengsong
parents: 543
diff changeset
  1470
		$ \textit{if} \;\,  [] \in A \; 
Chengsong
parents: 543
diff changeset
  1471
		\textit{then} \;\, ((\Der \; c \; A) @ B ) \cup 
Chengsong
parents: 543
diff changeset
  1472
		\Der \; c\; B$\\
Chengsong
parents: 543
diff changeset
  1473
		& & $\textit{else}\; (\Der \; c \; A) @ B$\\
Chengsong
parents: 543
diff changeset
  1474
	\end{tabular}
Chengsong
parents: 543
diff changeset
  1475
\end{center}
Chengsong
parents: 543
diff changeset
  1476
\noindent
583
Chengsong
parents: 579
diff changeset
  1477
The derivative of the star regular expression $r^*$ 
579
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1478
unwraps one iteration of $r$, turns it into $r\backslash c$,
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1479
and attaches the original $r^*$
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1480
after $r\backslash c$, so that 
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1481
we can further unfold it as many times as needed:
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1482
\[
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1483
	(r^*) \backslash c \dn (r \backslash c)\cdot r^*.
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1484
\]
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1485
Again,
637
Chengsong
parents: 628
diff changeset
  1486
the structure is the same as the language derivative of the Kleene star: 
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1487
\[
564
Chengsong
parents: 543
diff changeset
  1488
	\textit{Der} \;c \;(A*) \dn (\textit{Der}\; c A) @ (A*)
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1489
\]
564
Chengsong
parents: 543
diff changeset
  1490
In the above definition of $(r_1\cdot r_2) \backslash c$,
Chengsong
parents: 543
diff changeset
  1491
the $\textit{if}$ clause's
Chengsong
parents: 543
diff changeset
  1492
boolean condition 
Chengsong
parents: 543
diff changeset
  1493
$[] \in L(r_1)$ needs to be 
Chengsong
parents: 543
diff changeset
  1494
somehow recursively computed.
Chengsong
parents: 543
diff changeset
  1495
We call such a function that checks
Chengsong
parents: 543
diff changeset
  1496
whether the empty string $[]$ is 
Chengsong
parents: 543
diff changeset
  1497
in the language of a regular expression $\nullable$:
Chengsong
parents: 543
diff changeset
  1498
\begin{center}
Chengsong
parents: 543
diff changeset
  1499
		\begin{tabular}{lcl}
Chengsong
parents: 543
diff changeset
  1500
			$\nullable(\ZERO)$     & $\dn$ & $\mathit{false}$ \\  
Chengsong
parents: 543
diff changeset
  1501
			$\nullable(\ONE)$      & $\dn$ & $\mathit{true}$ \\
Chengsong
parents: 543
diff changeset
  1502
			$\nullable(c)$ 	       & $\dn$ & $\mathit{false}$ \\
Chengsong
parents: 543
diff changeset
  1503
			$\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\
Chengsong
parents: 543
diff changeset
  1504
			$\nullable(r_1\cdot r_2)$  & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\
Chengsong
parents: 543
diff changeset
  1505
			$\nullable(r^*)$       & $\dn$ & $\mathit{true}$ \\
Chengsong
parents: 543
diff changeset
  1506
		\end{tabular}
Chengsong
parents: 543
diff changeset
  1507
\end{center}
Chengsong
parents: 543
diff changeset
  1508
\noindent
Chengsong
parents: 543
diff changeset
  1509
The $\ZERO$ regular expression
Chengsong
parents: 543
diff changeset
  1510
does not contain any string and
Chengsong
parents: 543
diff changeset
  1511
therefore is not \emph{nullable}.
Chengsong
parents: 543
diff changeset
  1512
$\ONE$ is \emph{nullable} 
Chengsong
parents: 543
diff changeset
  1513
by definition. 
Chengsong
parents: 543
diff changeset
  1514
The character regular expression $c$
Chengsong
parents: 543
diff changeset
  1515
corresponds to the singleton set $\{c\}$, 
Chengsong
parents: 543
diff changeset
  1516
and therefore does not contain the empty string.
Chengsong
parents: 543
diff changeset
  1517
The alternative regular expression is nullable
Chengsong
parents: 543
diff changeset
  1518
if at least one of its children is nullable.
Chengsong
parents: 543
diff changeset
  1519
The sequence regular expression
Chengsong
parents: 543
diff changeset
  1520
would require both children to have the empty string
Chengsong
parents: 543
diff changeset
  1521
to compose an empty string, and the Kleene star
Chengsong
parents: 543
diff changeset
  1522
is always nullable because it naturally
579
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1523
contains the empty string.  
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1524
\noindent
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  1525
We have the following two correspondences between 
564
Chengsong
parents: 543
diff changeset
  1526
derivatives on regular expressions and
Chengsong
parents: 543
diff changeset
  1527
derivatives on a set of strings:
Chengsong
parents: 543
diff changeset
  1528
\begin{lemma}\label{derDer}
608
37b6fd310a16 added related work chap
Chengsong
parents: 601
diff changeset
  1529
	\mbox{}
579
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1530
	\begin{itemize}
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1531
		\item
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1532
$\textit{Der} \; c \; L(r) = L (r\backslash c)$
579
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1533
\item
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1534
	$c\!::\!s \in L(r)$ \textit{iff} $s \in L(r\backslash c)$.
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1535
	\end{itemize}
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1536
\end{lemma}
579
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1537
\begin{proof}
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1538
	By induction on $r$.
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1539
\end{proof}
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1540
\noindent
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  1541
which are the main properties of derivatives
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  1542
that enables us later to reason about the correctness of
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1543
derivative-based matching.
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1544
We can generalise the derivative operation shown above for single characters
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1545
to strings as follows:
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1546
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1547
\begin{center}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1548
\begin{tabular}{lcl}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1549
$r \backslash_s (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash_s s$ \\
583
Chengsong
parents: 579
diff changeset
  1550
$r \backslash_s [\,] $ & $\dn$ & $r$
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1551
\end{tabular}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1552
\end{center}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1553
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1554
\noindent
564
Chengsong
parents: 543
diff changeset
  1555
When there is no ambiguity, we will 
Chengsong
parents: 543
diff changeset
  1556
omit the subscript and use $\backslash$ instead
583
Chengsong
parents: 579
diff changeset
  1557
of $\backslash_s$ to denote
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1558
string derivatives for brevity.
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1559
Brzozowski's derivative-based
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1560
regular-expression matching algorithm can then be described as:
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1561
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1562
\begin{definition}
564
Chengsong
parents: 543
diff changeset
  1563
$\textit{match}\;s\;r \;\dn\; \nullable \; (r\backslash s)$
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1564
\end{definition}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1565
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1566
\noindent
637
Chengsong
parents: 628
diff changeset
  1567
Assuming the string is given as a sequence of characters, say $c_0c_1 \ldots c_n$, 
Chengsong
parents: 628
diff changeset
  1568
this algorithm, presented graphically, is as follows:
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1569
601
Chengsong
parents: 591
diff changeset
  1570
\begin{equation}\label{matcher}
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1571
\begin{tikzcd}
583
Chengsong
parents: 579
diff changeset
  1572
r_0 \arrow[r, "\backslash c_0"]  & r_1 \arrow[r, "\backslash c_1"] & 
Chengsong
parents: 579
diff changeset
  1573
r_2 \arrow[r, dashed]  & r_n  \arrow[r,"\textit{nullable}?"] & 
Chengsong
parents: 579
diff changeset
  1574
\;\textrm{true}/\textrm{false}
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1575
\end{tikzcd}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1576
\end{equation}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1577
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1578
\noindent
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1579
 It is relatively
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1580
easy to show that this matcher is correct, namely
539
Chengsong
parents: 538
diff changeset
  1581
\begin{lemma}
564
Chengsong
parents: 543
diff changeset
  1582
	$\textit{match} \; s\; r  = \textit{true} \; \textit{iff} \; s \in L(r)$
539
Chengsong
parents: 538
diff changeset
  1583
\end{lemma}
Chengsong
parents: 538
diff changeset
  1584
\begin{proof}
637
Chengsong
parents: 628
diff changeset
  1585
	By induction on $s$ using the property of derivatives:
583
Chengsong
parents: 579
diff changeset
  1586
	lemma \ref{derDer}.
539
Chengsong
parents: 538
diff changeset
  1587
\end{proof}
601
Chengsong
parents: 591
diff changeset
  1588
\begin{figure}
564
Chengsong
parents: 543
diff changeset
  1589
\begin{center}
539
Chengsong
parents: 538
diff changeset
  1590
\begin{tikzpicture}
Chengsong
parents: 538
diff changeset
  1591
\begin{axis}[
Chengsong
parents: 538
diff changeset
  1592
    xlabel={$n$},
Chengsong
parents: 538
diff changeset
  1593
    ylabel={time in secs},
601
Chengsong
parents: 591
diff changeset
  1594
    %ymode = log,
539
Chengsong
parents: 538
diff changeset
  1595
    legend entries={Naive Matcher},  
Chengsong
parents: 538
diff changeset
  1596
    legend pos=north west,
Chengsong
parents: 538
diff changeset
  1597
    legend cell align=left]
Chengsong
parents: 538
diff changeset
  1598
\addplot[red,mark=*, mark options={fill=white}] table {NaiveMatcher.data};
Chengsong
parents: 538
diff changeset
  1599
\end{axis}
Chengsong
parents: 538
diff changeset
  1600
\end{tikzpicture} 
583
Chengsong
parents: 579
diff changeset
  1601
\caption{Matching the regular expression $(a^*)^*b$ against strings of the form
Chengsong
parents: 579
diff changeset
  1602
$\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}
Chengsong
parents: 579
diff changeset
  1603
$ using Brzozowski's original algorithm}\label{NaiveMatcher}
601
Chengsong
parents: 591
diff changeset
  1604
\end{center}
539
Chengsong
parents: 538
diff changeset
  1605
\end{figure}
Chengsong
parents: 538
diff changeset
  1606
\noindent
564
Chengsong
parents: 543
diff changeset
  1607
If we implement the above algorithm naively, however,
666
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
  1608
the algorithm can be as slow as backtracking lexers, as shown in 
564
Chengsong
parents: 543
diff changeset
  1609
\ref{NaiveMatcher}.
Chengsong
parents: 543
diff changeset
  1610
Note that both axes are in logarithmic scale.
637
Chengsong
parents: 628
diff changeset
  1611
Around two dozen characters
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  1612
this algorithm already ``explodes'' with the regular expression 
564
Chengsong
parents: 543
diff changeset
  1613
$(a^*)^*b$.
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1614
To improve this situation, we need to introduce simplification
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1615
rules for the intermediate results,
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  1616
such as $r + r \rightarrow r$ or $\ONE \cdot r \rightarrow r$,
539
Chengsong
parents: 538
diff changeset
  1617
and make sure those rules do not change the 
Chengsong
parents: 538
diff changeset
  1618
language of the regular expression.
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  1619
One simple-minded simplification function
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1620
that achieves these requirements 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1621
is given below (see Ausaf et al. \cite{AusafDyckhoffUrban2016}):
564
Chengsong
parents: 543
diff changeset
  1622
\begin{center}
Chengsong
parents: 543
diff changeset
  1623
	\begin{tabular}{lcl}
Chengsong
parents: 543
diff changeset
  1624
		$\simp \; r_1 \cdot r_2 $ & $ \dn$ & 
Chengsong
parents: 543
diff changeset
  1625
		$(\simp \; r_1,  \simp \; r_2) \; \textit{match}$\\
Chengsong
parents: 543
diff changeset
  1626
					  & & $\quad \case \; (\ZERO, \_) \Rightarrow \ZERO$\\
Chengsong
parents: 543
diff changeset
  1627
					  & & $\quad \case \; (\_, \ZERO) \Rightarrow \ZERO$\\
Chengsong
parents: 543
diff changeset
  1628
					  & & $\quad \case \; (\ONE, r_2') \Rightarrow r_2'$\\
Chengsong
parents: 543
diff changeset
  1629
					  & & $\quad \case \; (r_1', \ONE) \Rightarrow r_1'$\\
Chengsong
parents: 543
diff changeset
  1630
					  & & $\quad \case \; (r_1', r_2') \Rightarrow r_1'\cdot r_2'$\\
Chengsong
parents: 543
diff changeset
  1631
		$\simp \; r_1 + r_2$ & $\dn$ & $(\simp \; r_1, \simp \; r_2) \textit{match}$\\
Chengsong
parents: 543
diff changeset
  1632
				     & & $\quad \; \case \; (\ZERO, r_2') \Rightarrow r_2'$\\
Chengsong
parents: 543
diff changeset
  1633
				     & & $\quad \; \case \; (r_1', \ZERO) \Rightarrow r_1'$\\
Chengsong
parents: 543
diff changeset
  1634
				     & & $\quad \; \case \; (r_1', r_2') \Rightarrow r_1' + r_2'$\\
583
Chengsong
parents: 579
diff changeset
  1635
		$\simp \; r$ & $\dn$ & $r\quad\quad (otherwise)$
564
Chengsong
parents: 543
diff changeset
  1636
				   
Chengsong
parents: 543
diff changeset
  1637
	\end{tabular}
Chengsong
parents: 543
diff changeset
  1638
\end{center}
Chengsong
parents: 543
diff changeset
  1639
If we repeatedly apply this simplification  
Chengsong
parents: 543
diff changeset
  1640
function during the matching algorithm, 
Chengsong
parents: 543
diff changeset
  1641
we have a matcher with simplification:
Chengsong
parents: 543
diff changeset
  1642
\begin{center}
Chengsong
parents: 543
diff changeset
  1643
	\begin{tabular}{lcl}
Chengsong
parents: 543
diff changeset
  1644
		$\derssimp \; [] \; r$ & $\dn$ & $r$\\
Chengsong
parents: 543
diff changeset
  1645
		$\derssimp \; c :: cs \; r$ & $\dn$ & $\derssimp \; cs \; (\simp \; (r \backslash c))$\\
Chengsong
parents: 543
diff changeset
  1646
		$\textit{matcher}_{simp}\; s \; r $ & $\dn$ & $\nullable \; (\derssimp \; s\;r)$
Chengsong
parents: 543
diff changeset
  1647
	\end{tabular}
Chengsong
parents: 543
diff changeset
  1648
\end{center}
Chengsong
parents: 543
diff changeset
  1649
\begin{figure}
539
Chengsong
parents: 538
diff changeset
  1650
\begin{tikzpicture}
Chengsong
parents: 538
diff changeset
  1651
\begin{axis}[
Chengsong
parents: 538
diff changeset
  1652
    xlabel={$n$},
Chengsong
parents: 538
diff changeset
  1653
    ylabel={time in secs},
601
Chengsong
parents: 591
diff changeset
  1654
    %ymode = log,
Chengsong
parents: 591
diff changeset
  1655
    %xmode = log,
564
Chengsong
parents: 543
diff changeset
  1656
    grid = both,
539
Chengsong
parents: 538
diff changeset
  1657
    legend entries={Matcher With Simp},  
Chengsong
parents: 538
diff changeset
  1658
    legend pos=north west,
Chengsong
parents: 538
diff changeset
  1659
    legend cell align=left]
Chengsong
parents: 538
diff changeset
  1660
\addplot[red,mark=*, mark options={fill=white}] table {BetterMatcher.data};
Chengsong
parents: 538
diff changeset
  1661
\end{axis}
564
Chengsong
parents: 543
diff changeset
  1662
\end{tikzpicture} 
Chengsong
parents: 543
diff changeset
  1663
\caption{$(a^*)^*b$ 
Chengsong
parents: 543
diff changeset
  1664
against 
Chengsong
parents: 543
diff changeset
  1665
$\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$ Using $\textit{matcher}_{simp}$}\label{BetterMatcher}
Chengsong
parents: 543
diff changeset
  1666
\end{figure}
Chengsong
parents: 543
diff changeset
  1667
\noindent
Chengsong
parents: 543
diff changeset
  1668
The running time of $\textit{ders}\_\textit{simp}$
583
Chengsong
parents: 579
diff changeset
  1669
on the same example of Figure \ref{NaiveMatcher}
Chengsong
parents: 579
diff changeset
  1670
is now ``tame''  in terms of the length of inputs,
Chengsong
parents: 579
diff changeset
  1671
as shown in Figure \ref{BetterMatcher}.
539
Chengsong
parents: 538
diff changeset
  1672
637
Chengsong
parents: 628
diff changeset
  1673
So far, the story is use Brzozowski derivatives and
Chengsong
parents: 628
diff changeset
  1674
simplify as much as possible, and at the end test
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1675
whether the empty string is recognised 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1676
by the final derivative.
583
Chengsong
parents: 579
diff changeset
  1677
But what if we want to 
Chengsong
parents: 579
diff changeset
  1678
do lexing instead of just getting a true/false answer?
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1679
Sulzmann and Lu \cite{Sulzmann2014} first came up with a nice and 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1680
elegant (arguably as beautiful as the definition of the 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1681
Brzozowski derivative) solution for this.
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1682
539
Chengsong
parents: 538
diff changeset
  1683
\section{Values and the Lexing Algorithm by Sulzmann and Lu}
564
Chengsong
parents: 543
diff changeset
  1684
In this section, we present a two-phase regular expression lexing 
Chengsong
parents: 543
diff changeset
  1685
algorithm.
Chengsong
parents: 543
diff changeset
  1686
The first phase takes successive derivatives with 
Chengsong
parents: 543
diff changeset
  1687
respect to the input string,
Chengsong
parents: 543
diff changeset
  1688
and the second phase does the reverse, \emph{injecting} back
Chengsong
parents: 543
diff changeset
  1689
characters, in the meantime constructing a lexing result.
Chengsong
parents: 543
diff changeset
  1690
We will introduce the injection phase in detail slightly
Chengsong
parents: 543
diff changeset
  1691
later, but as a preliminary we have to first define 
Chengsong
parents: 543
diff changeset
  1692
the datatype for lexing results, 
Chengsong
parents: 543
diff changeset
  1693
called \emph{value} or
608
37b6fd310a16 added related work chap
Chengsong
parents: 601
diff changeset
  1694
sometimes also \emph{lexical value}.  
37b6fd310a16 added related work chap
Chengsong
parents: 601
diff changeset
  1695
Values and regular
37b6fd310a16 added related work chap
Chengsong
parents: 601
diff changeset
  1696
expressions correspond to each other 
37b6fd310a16 added related work chap
Chengsong
parents: 601
diff changeset
  1697
as illustrated in the following
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1698
table:
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1699
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1700
\begin{center}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1701
	\begin{tabular}{c@{\hspace{20mm}}c}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1702
		\begin{tabular}{@{}rrl@{}}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1703
			\multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1704
			$r$ & $::=$  & $\ZERO$\\
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1705
			& $\mid$ & $\ONE$   \\
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1706
			& $\mid$ & $c$          \\
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1707
			& $\mid$ & $r_1 \cdot r_2$\\
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1708
			& $\mid$ & $r_1 + r_2$   \\
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1709
			\\
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1710
			& $\mid$ & $r^*$         \\
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1711
		\end{tabular}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1712
		&
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1713
		\begin{tabular}{@{\hspace{0mm}}rrl@{}}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1714
			\multicolumn{3}{@{}l}{\textbf{Values}}\medskip\\
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1715
			$v$ & $::=$  & \\
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1716
			&        & $\Empty$   \\
591
b2d0de6aee18 more polishing integrated comments chap2
Chengsong
parents: 585
diff changeset
  1717
			& $\mid$ & $\Char \; c$          \\
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1718
			& $\mid$ & $\Seq\,v_1\, v_2$\\
591
b2d0de6aee18 more polishing integrated comments chap2
Chengsong
parents: 585
diff changeset
  1719
			& $\mid$ & $\Left \;v$   \\
b2d0de6aee18 more polishing integrated comments chap2
Chengsong
parents: 585
diff changeset
  1720
			& $\mid$ & $\Right\;v$  \\
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1721
			& $\mid$ & $\Stars\,[v_1,\ldots\,v_n]$ \\
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1722
		\end{tabular}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1723
	\end{tabular}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1724
\end{center}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1725
\noindent
564
Chengsong
parents: 543
diff changeset
  1726
A value has an underlying string, which 
Chengsong
parents: 543
diff changeset
  1727
can be calculated by the ``flatten" function $|\_|$:
Chengsong
parents: 543
diff changeset
  1728
\begin{center}
Chengsong
parents: 543
diff changeset
  1729
	\begin{tabular}{lcl}
Chengsong
parents: 543
diff changeset
  1730
		$|\Empty|$ & $\dn$ &  $[]$\\
Chengsong
parents: 543
diff changeset
  1731
		$|\Char \; c|$ & $ \dn$ & $ [c]$\\
591
b2d0de6aee18 more polishing integrated comments chap2
Chengsong
parents: 585
diff changeset
  1732
		$|\Seq \; v_1, \;v_2|$ & $ \dn$ & $ v_1| @ |v_2|$\\
b2d0de6aee18 more polishing integrated comments chap2
Chengsong
parents: 585
diff changeset
  1733
		$|\Left \; v|$ & $ \dn$ & $ |v|$\\
b2d0de6aee18 more polishing integrated comments chap2
Chengsong
parents: 585
diff changeset
  1734
		$|\Right \; v|$ & $ \dn$ & $ |v|$\\
b2d0de6aee18 more polishing integrated comments chap2
Chengsong
parents: 585
diff changeset
  1735
		$|\Stars \; []|$ & $\dn$ & $[]$\\
b2d0de6aee18 more polishing integrated comments chap2
Chengsong
parents: 585
diff changeset
  1736
		$|\Stars \; v::vs|$ &  $\dn$ & $ |v| @ |\Stars(vs)|$
564
Chengsong
parents: 543
diff changeset
  1737
	\end{tabular}
Chengsong
parents: 543
diff changeset
  1738
\end{center}
Chengsong
parents: 543
diff changeset
  1739
Sulzmann and Lu used a binary predicate, written $\vdash v:r $,
Chengsong
parents: 543
diff changeset
  1740
to indicate that a value $v$ could be generated from a lexing algorithm
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1741
with input $r$. They call it the value inhabitation relation,
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1742
defined by the rules.
628
7af4e2420a8c ready to submit~~
Chengsong
parents: 626
diff changeset
  1743
\begin{figure}[H]
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1744
\begin{mathpar}
591
b2d0de6aee18 more polishing integrated comments chap2
Chengsong
parents: 585
diff changeset
  1745
	\inferrule{\mbox{}}{\vdash \Char \; c : \mathbf{c}} \hspace{2em}
564
Chengsong
parents: 543
diff changeset
  1746
Chengsong
parents: 543
diff changeset
  1747
	\inferrule{\mbox{}}{\vdash \Empty :  \ONE} \hspace{2em}
Chengsong
parents: 543
diff changeset
  1748
591
b2d0de6aee18 more polishing integrated comments chap2
Chengsong
parents: 585
diff changeset
  1749
\inferrule{\vdash v_1 : r_1 \;\; \vdash v_2 : r_2 }{\vdash \Seq \; v_1,\; v_2 : (r_1 \cdot r_2)}
564
Chengsong
parents: 543
diff changeset
  1750
591
b2d0de6aee18 more polishing integrated comments chap2
Chengsong
parents: 585
diff changeset
  1751
\inferrule{\vdash v_1 : r_1}{\vdash \Left \; v_1 : r_1+r_2}
564
Chengsong
parents: 543
diff changeset
  1752
591
b2d0de6aee18 more polishing integrated comments chap2
Chengsong
parents: 585
diff changeset
  1753
\inferrule{\vdash v_2 : r_2}{\vdash \Right \; v_2:r_1 + r_2}
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1754
591
b2d0de6aee18 more polishing integrated comments chap2
Chengsong
parents: 585
diff changeset
  1755
\inferrule{\forall v \in vs. \vdash v:r \land  |v| \neq []}{\vdash \Stars \; vs : r^*}
564
Chengsong
parents: 543
diff changeset
  1756
\end{mathpar}
628
7af4e2420a8c ready to submit~~
Chengsong
parents: 626
diff changeset
  1757
\caption{The inhabitation relation for values and regular expressions}\label{fig:inhab}
626
1c8525061545 finished!
Chengsong
parents: 623
diff changeset
  1758
\end{figure}
564
Chengsong
parents: 543
diff changeset
  1759
\noindent
Chengsong
parents: 543
diff changeset
  1760
The condition $|v| \neq []$ in the premise of star's rule
Chengsong
parents: 543
diff changeset
  1761
is to make sure that for a given pair of regular 
Chengsong
parents: 543
diff changeset
  1762
expression $r$ and string $s$, the number of values 
Chengsong
parents: 543
diff changeset
  1763
satisfying $|v| = s$ and $\vdash v:r$ is finite.
601
Chengsong
parents: 591
diff changeset
  1764
This additional condition was
Chengsong
parents: 591
diff changeset
  1765
imposed by Ausaf and Urban to make their proofs easier.
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1766
Given a string and a regular expression, there can be
564
Chengsong
parents: 543
diff changeset
  1767
multiple values for it. For example, both
Chengsong
parents: 543
diff changeset
  1768
$\vdash \Seq(\Left \; ab)(\Right \; c):(ab+a)(bc+c)$ and
Chengsong
parents: 543
diff changeset
  1769
$\vdash \Seq(\Right\; a)(\Left \; bc ):(ab+a)(bc+c)$ hold
Chengsong
parents: 543
diff changeset
  1770
and the values both flatten to $abc$.
Chengsong
parents: 543
diff changeset
  1771
Lexers therefore have to disambiguate and choose only
637
Chengsong
parents: 628
diff changeset
  1772
one of the values to be generated. $\POSIX$ is one of the
564
Chengsong
parents: 543
diff changeset
  1773
disambiguation strategies that is widely adopted.
Chengsong
parents: 543
diff changeset
  1774
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  1775
Ausaf et al. \cite{AusafDyckhoffUrban2016} 
564
Chengsong
parents: 543
diff changeset
  1776
formalised the property 
Chengsong
parents: 543
diff changeset
  1777
as a ternary relation.
Chengsong
parents: 543
diff changeset
  1778
The $\POSIX$ value $v$ for a regular expression
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1779
$r$ and string $s$, denoted as $(s, r) \rightarrow v$, can be specified 
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1780
in the following rules\footnote{The names of the rules are used
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1781
as they were originally given in \cite{AusafDyckhoffUrban2016}.}:
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1782
\begin{figure}[p]
564
Chengsong
parents: 543
diff changeset
  1783
\begin{mathpar}
Chengsong
parents: 543
diff changeset
  1784
	\inferrule[P1]{\mbox{}}{([], \ONE) \rightarrow \Empty}
Chengsong
parents: 543
diff changeset
  1785
		
Chengsong
parents: 543
diff changeset
  1786
	\inferrule[PC]{\mbox{}}{([c], c) \rightarrow \Char \; c}
Chengsong
parents: 543
diff changeset
  1787
Chengsong
parents: 543
diff changeset
  1788
	\inferrule[P+L]{(s,r_1)\rightarrow v_1}{(s, r_1+r_2)\rightarrow \Left \; v_1}
Chengsong
parents: 543
diff changeset
  1789
Chengsong
parents: 543
diff changeset
  1790
	\inferrule[P+R]{(s,r_2)\rightarrow v_2\\ s \notin L \; r_1}{(s, r_1+r_2)\rightarrow \Right \; v_2}
Chengsong
parents: 543
diff changeset
  1791
Chengsong
parents: 543
diff changeset
  1792
	\inferrule[PS]{(s_1, v_1) \rightarrow r_1 \\ (s_2, v_2)\rightarrow r_2\\
Chengsong
parents: 543
diff changeset
  1793
		\nexists s_3 \; s_4. s_3 \neq [] \land s_3 @ s_4 = s_2 \land 
Chengsong
parents: 543
diff changeset
  1794
		s_1@ s_3 \in L \; r_1 \land s_4 \in L \; r_2}{(s_1 @ s_2, r_1\cdot r_2) \rightarrow
Chengsong
parents: 543
diff changeset
  1795
	\Seq \; v_1 \; v_2}
Chengsong
parents: 543
diff changeset
  1796
Chengsong
parents: 543
diff changeset
  1797
	\inferrule[P{[]}]{\mbox{}}{([], r^*) \rightarrow \Stars([])}
Chengsong
parents: 543
diff changeset
  1798
Chengsong
parents: 543
diff changeset
  1799
	\inferrule[P*]{(s_1, v) \rightarrow v \\ (s_2, r^*) \rightarrow \Stars \; vs \\
Chengsong
parents: 543
diff changeset
  1800
		|v| \neq []\\ \nexists s_3 \; s_4. s_3 \neq [] \land s_3@s_4 = s_2 \land
Chengsong
parents: 543
diff changeset
  1801
		s_1@s_3 \in L \; r \land s_4 \in L \; r^*}{(s_1@s_2, r^*)\rightarrow \Stars \;
Chengsong
parents: 543
diff changeset
  1802
	(v::vs)}
Chengsong
parents: 543
diff changeset
  1803
\end{mathpar}
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1804
\caption{The inductive POSIX rules given by Ausaf et al.
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1805
	\cite{AusafDyckhoffUrban2016}.
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1806
This ternary relation, written $(s, r) \rightarrow v$, 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1807
formalises the POSIX constraints on the
601
Chengsong
parents: 591
diff changeset
  1808
value $v$ given a string $s$ and 
Chengsong
parents: 591
diff changeset
  1809
regular expression $r$.
Chengsong
parents: 591
diff changeset
  1810
}
646
Chengsong
parents: 639
diff changeset
  1811
\label{fig:POSIXDef}
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1812
\end{figure}\afterpage{\clearpage}
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1813
\noindent
579
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1814
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1815
%\begin{figure}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1816
%\begin{tikzpicture}[]
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1817
%    \node [minimum width = 6cm, rectangle split, rectangle split horizontal, 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1818
%	    rectangle split parts=2, rectangle split part fill={red!30,blue!20}, style={draw, rounded corners, inner sep=10pt}]
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1819
%	    (node1)
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1820
%	    {$r_{token1}$
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1821
%	    \nodepart{two}  $\;\;\; \quad r_{token2}\;\;\;\quad$ };
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1822
%	    %\node [left = 6.0cm of node1] (start1) {hi};
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1823
%	    \node [left = 0.2cm of node1] (middle) {$v.s.$};
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1824
%    \node [minimum width = 6cm, left = 0.2cm of middle, rectangle split, rectangle split horizontal, 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1825
%	    rectangle split parts=2, rectangle split part fill={red!30,blue!20}, style={draw, rounded corners, inner sep=10pt}]
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1826
%	    (node2)
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1827
%	    {$\quad\;\;\;r_{token1}\quad\;\;\;$
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1828
%	    \nodepart{two}  $r_{token2}$ };
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1829
%	    \node [below = 0.1cm of node2] (text1) {\checkmark preferred by POSIX};
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1830
%		\node [above = 1.5cm of middle, minimum width = 6cm, 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1831
%			rectangle, style={draw, rounded corners, inner sep=10pt}] 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1832
%			(topNode) {$s$};
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1833
%	    \path[->,draw]
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1834
%	        (topNode) edge node {split $A$} (node2)
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1835
%	        (topNode) edge node {split $B$} (node1)
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1836
%		;
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1837
%			
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1838
%
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1839
%\end{tikzpicture}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1840
%\caption{Maximum munch example: $s$ matches $r_{token1} \cdot r_{token2}$}\label{munch}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1841
%\end{figure}
637
Chengsong
parents: 628
diff changeset
  1842
The above $\POSIX$ rules follow the intuition described below: 
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1843
\begin{itemize}
564
Chengsong
parents: 543
diff changeset
  1844
	\item (Left Priority)\\
637
Chengsong
parents: 628
diff changeset
  1845
		Match the leftmost regular expression when multiple options for matching
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1846
		are available. See P+L and P+R where in P+R $s$ cannot
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1847
		be in the language of $L \; r_1$.
564
Chengsong
parents: 543
diff changeset
  1848
	\item (Maximum munch)\\
Chengsong
parents: 543
diff changeset
  1849
		Always match a subpart as much as possible before proceeding
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1850
		to the next part of the string.
579
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1851
		For example, when the string $s$ matches 
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1852
		$r_{part1}\cdot r_{part2}$, and we have two ways $s$ can be split:
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1853
		Then the split that matches a longer string for the first part
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1854
		$r_{part1}$ is preferred by this maximum munch rule.
637
Chengsong
parents: 628
diff changeset
  1855
		The side-condition 
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1856
		\begin{center}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1857
		$\nexists s_3 \; s_4. s_3 \neq [] \land s_3 @ s_4 = s_2 \land 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1858
		s_1@ s_3 \in L \; r_1 \land s_4 \in L \; r_2$
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1859
		\end{center}
637
Chengsong
parents: 628
diff changeset
  1860
		in PS causes this.
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1861
		%(See
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1862
		%\ref{munch} for an illustration).
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1863
\end{itemize}
564
Chengsong
parents: 543
diff changeset
  1864
\noindent
Chengsong
parents: 543
diff changeset
  1865
These disambiguation strategies can be 
Chengsong
parents: 543
diff changeset
  1866
quite practical.
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1867
For instance, when lexing a code snippet 
564
Chengsong
parents: 543
diff changeset
  1868
\[ 
Chengsong
parents: 543
diff changeset
  1869
	\textit{iffoo} = 3
Chengsong
parents: 543
diff changeset
  1870
\]
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1871
using a regular expression 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1872
for keywords and 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1873
identifiers:
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1874
%(for example, keyword is a nonempty string starting with letters 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1875
%followed by alphanumeric characters or underscores):
564
Chengsong
parents: 543
diff changeset
  1876
\[
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1877
	r_{keyword} + r_{identifier}.
564
Chengsong
parents: 543
diff changeset
  1878
\]
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1879
If we want $\textit{iffoo}$ to be recognized
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1880
as an identifier
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1881
where identifiers are defined as usual (letters
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1882
followed by letters, numbers or underscores),
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1883
then a match with a keyword (if)
564
Chengsong
parents: 543
diff changeset
  1884
followed by
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1885
an identifier (foo) would be incorrect.
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  1886
POSIX lexing generates what is included by lexing.
564
Chengsong
parents: 543
diff changeset
  1887
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1888
\noindent
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1889
We know that a POSIX 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1890
value for regular expression $r$ is inhabited by $r$.
541
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
  1891
\begin{lemma}
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
  1892
$(r, s) \rightarrow v \implies \vdash v: r$
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
  1893
\end{lemma}
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
  1894
\noindent
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1895
The main property about a $\POSIX$ value is that 
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1896
given the same regular expression $r$ and string $s$,
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1897
one can always uniquely determine the $\POSIX$ value for it:
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1898
\begin{lemma}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1899
$\textit{if} \,(s, r) \rightarrow v_1 \land (s, r) \rightarrow v_2\quad  \textit{then} \; v_1 = v_2$
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1900
\end{lemma}
539
Chengsong
parents: 538
diff changeset
  1901
\begin{proof}
564
Chengsong
parents: 543
diff changeset
  1902
By induction on $s$, $r$ and $v_1$. The inductive cases
Chengsong
parents: 543
diff changeset
  1903
are all the POSIX rules. 
Chengsong
parents: 543
diff changeset
  1904
Probably the most cumbersome cases are 
Chengsong
parents: 543
diff changeset
  1905
the sequence and star with non-empty iterations.
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1906
We shall give the details for proving the sequence case here.
539
Chengsong
parents: 538
diff changeset
  1907
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1908
When we have 
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1909
\[
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1910
	(s_1, r_1) \rightarrow v_1 \;\, and \;\, 
601
Chengsong
parents: 591
diff changeset
  1911
	(s_2, r_2) \rightarrow v_2  \;\, and \;\, 
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1912
	\nexists s_3 \; s_4. s_3 \neq [] \land s_3 @ s_4 = s_2 \land 
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1913
		s_1@ s_3 \in L \; r_1 \land s_4 \in L \; r_2
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1914
\]
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1915
we know that the last condition 
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1916
excludes the possibility of a 
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1917
string $s_1'$ longer than $s_1$ such that 
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1918
\[
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1919
(s_1', r_1) \rightarrow v_1'   \;\; 
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1920
and\;\; (s_2', r_2) \rightarrow v_2'\;\; and \;\;s_1' @s_2' = s 
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1921
\]
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1922
hold.
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1923
A shorter string $s_1''$ with $s_2''$ satisfying
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1924
\[
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1925
(s_1'', r_1) \rightarrow v_1''
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1926
\;\;and\;\; (s_2'', r_2) \rightarrow v_2'' \;\;and \;\;s_1'' @s_2'' = s 
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1927
\]
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1928
cannot possibly form a $\POSIX$ value either, because
637
Chengsong
parents: 628
diff changeset
  1929
by definition, there is a candidate
Chengsong
parents: 628
diff changeset
  1930
with a longer initial string
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1931
$s_1$. Therefore, we know that the POSIX
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1932
value $\Seq \; a \; b$ for $r_1 \cdot r_2$ matching
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1933
$s$ must have the 
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1934
property that 
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1935
\[
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1936
	|a| = s_1 \;\; and \;\; |b| = s_2.
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1937
\]
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1938
The goal is to prove that $a = v_1 $ and $b = v_2$.
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1939
If we have some other POSIX values $v_{10}$ and $v_{20}$ such that 
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1940
$(s_1, r_1) \rightarrow v_{10}$ and $(s_2, r_2) \rightarrow v_{20}$ hold,
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1941
then by induction hypothesis $v_{10} = v_1$ and $v_{20}= v_2$, 
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1942
which means this "other" $\POSIX$ value $\Seq(v_{10}, v_{20})$
539
Chengsong
parents: 538
diff changeset
  1943
is the same as $\Seq(v_1, v_2)$. 
Chengsong
parents: 538
diff changeset
  1944
\end{proof}
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1945
\noindent
637
Chengsong
parents: 628
diff changeset
  1946
We have now defined what a POSIX value is and shown that it is unique.
Chengsong
parents: 628
diff changeset
  1947
The problem is to generate
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1948
such a value in a lexing algorithm using derivatives.
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1949
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1950
\subsection{Sulzmann and Lu's Injection-based Lexing Algorithm}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1951
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1952
Sulzmann and Lu extended Brzozowski's 
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1953
derivative-based matching
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1954
to a lexing algorithm by a second phase 
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1955
after the initial phase of successive derivatives.
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1956
This second phase generates a POSIX value 
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1957
if the regular expression matches the string.
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  1958
The algorithm uses two functions called $\inj$ and $\mkeps$.
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1959
The function $\mkeps$ constructs a POSIX value from the last
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1960
derivative $r_n$:
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1961
\begin{ceqn}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1962
\begin{equation}\label{graph:mkeps}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1963
\begin{tikzcd}
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1964
r_0 \arrow[r, "\backslash c_0"]  & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed, "\ldots"] & r_n \arrow[d, "mkeps" description] \\
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1965
	        & 	              & 	            & v_n       
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1966
\end{tikzcd}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1967
\end{equation}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1968
\end{ceqn}
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1969
\noindent
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1970
In the above diagram, again we assume that
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1971
the input string $s$ is made of $n$ characters
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1972
$c_0c_1 \ldots c_{n-1}$ 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1973
The last derivative operation 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1974
$\backslash c_{n-1}$ generates the derivative $r_n$, for which
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1975
$\mkeps$ produces the value $v_n$. This value
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1976
tells us how the empty string is matched by the (nullable)
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1977
regular expression $r_n$, in a POSIX way.
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1978
The definition of $\mkeps$ is
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1979
	\begin{center}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1980
		\begin{tabular}{lcl}
564
Chengsong
parents: 543
diff changeset
  1981
			$\mkeps \; \ONE$ 		& $\dn$ & $\Empty$ \\
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1982
			$\mkeps \; (r_{1}+r_{2})$	& $\dn$ 
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1983
						& $\textit{if}\; (\nullable \; r_{1}) \;\,
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1984
							\textit{then}\;\, \Left \; (\mkeps \; r_{1})$\\ 
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1985
						& & $\phantom{if}\; \textit{else}\;\, \Right \;(\mkeps \; r_{2})$\\
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1986
			$\mkeps \; (r_1 \cdot r_2)$ 	& $\dn$ & $\Seq\;(\mkeps\;r_1)\;(\mkeps \; r_2)$\\
564
Chengsong
parents: 543
diff changeset
  1987
			$\mkeps \; r^* $	        & $\dn$ & $\Stars\;[]$
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1988
		\end{tabular}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1989
	\end{center}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1990
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1991
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1992
\noindent 
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1993
The function prefers the left child $r_1$ of $r_1 + r_2$ 
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1994
to match an empty string if there is a choice.
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1995
When there is a star to match the empty string,
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1996
we give the $\Stars$ constructor an empty list, meaning
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1997
no iteration is taken.
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1998
The result of $\mkeps$ on a $\nullable$ $r$ 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1999
is a POSIX value for $r$ and the empty string:
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  2000
\begin{lemma}\label{mePosix}
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  2001
$\nullable\; r \implies (r, []) \rightarrow (\mkeps\; r)$
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  2002
\end{lemma}
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  2003
\begin{proof}
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  2004
	By induction on $r$.
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  2005
\end{proof}
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  2006
\noindent
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  2007
After the $\mkeps$-call, Sulzmann and Lu inject back the characters one by one
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  2008
in reverse order as they were chopped off in the derivative phase.
637
Chengsong
parents: 628
diff changeset
  2009
The function for this is called $\inj$. This function 
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  2010
operates on values, unlike $\backslash$ which operates on regular expressions.
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  2011
In the diagram below, $v_i$ stands for the (POSIX) value 
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  2012
for how the regular expression 
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  2013
$r_i$ matches the string $s_i$ consisting of the last $n-i$ characters
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  2014
of $s$ (i.e. $s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$.
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  2015
After injecting back $n$ characters, we get the lexical value for how $r_0$
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  2016
matches $s$. 
601
Chengsong
parents: 591
diff changeset
  2017
\begin{figure}[H]
Chengsong
parents: 591
diff changeset
  2018
\begin{center}	
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  2019
\begin{ceqn}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  2020
\begin{tikzcd}
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  2021
r_0 \arrow[r, dashed] \arrow[d]& r_i \arrow[r, "\backslash c_i"]  \arrow[d]  & r_{i+1}  \arrow[r, dashed] \arrow[d]        & r_n \arrow[d, "mkeps" description] \\
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  2022
v_0           \arrow[u]                 & v_i  \arrow[l, dashed]                              & v_{i+1} \arrow[l,"inj_{r_i} c_i"]                 & v_n \arrow[l, dashed]         
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  2023
\end{tikzcd}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  2024
\end{ceqn}
601
Chengsong
parents: 591
diff changeset
  2025
\end{center}
Chengsong
parents: 591
diff changeset
  2026
\caption{The two-phase lexing algorithm by Sulzmann and Lu \cite{AusafDyckhoffUrban2016},
Chengsong
parents: 591
diff changeset
  2027
	matching the regular expression $r_0$ and string of the form $[c_0, c_1, \ldots, c_{n-1}]$.
Chengsong
parents: 591
diff changeset
  2028
	The first phase involves taking successive derivatives w.r.t the characters $c_0$,
Chengsong
parents: 591
diff changeset
  2029
	$c_1$, and so on. These are the same operations as they have appeared in the matcher
Chengsong
parents: 591
diff changeset
  2030
	\ref{matcher}. When the final derivative regular expression is nullable (contains the empty string),
637
Chengsong
parents: 628
diff changeset
  2031
	then the second phase starts. First, $\mkeps$ generates a POSIX value which tells us how $r_n$ matches
Chengsong
parents: 628
diff changeset
  2032
	the empty string, by always selecting the leftmost 
Chengsong
parents: 628
diff changeset
  2033
	nullable regular expression. After that, $\inj$ ``injects'' back the character in reverse order as they 
601
Chengsong
parents: 591
diff changeset
  2034
	appeared in the string, always preserving POSIXness.}\label{graph:inj}
Chengsong
parents: 591
diff changeset
  2035
\end{figure}
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  2036
\noindent
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2037
The function $\textit{inj}$ as defined by Sulzmann and Lu
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2038
takes three arguments: a regular
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  2039
expression ${r_{i}}$, before the character is chopped off, 
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2040
a character ${c_{i}}$ (the character we want to inject back) and 
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  2041
the third argument $v_{i+1}$ the value we want to inject into. 
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2042
The result of an application 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2043
$\inj \; r_i \; c_i \; v_{i+1}$ is a new value $v_i$ such that
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2044
\[
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2045
	(s_i, r_i) \rightarrow v_i
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2046
\]
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2047
holds.
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  2048
The definition of $\textit{inj}$ is as follows: 
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  2049
\begin{center}
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2050
\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{5mm}}l}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2051
  $\textit{inj}\;(c)\;c\,Empty$            & $\dn$ & $\Char\,c$\\
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2052
  $\textit{inj}\;(r_1 + r_2)\;c\; (\Left\; v)$ & $\dn$ & $\Left  \; (\textit{inj}\; r_1 \; c\,v)$\\
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2053
  $\textit{inj}\;(r_1 + r_2)\,c\; (\Right\;v)$ & $\dn$ & $\Right \; (\textit{inj}\;r_2\;c  \; v)$\\
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2054
  $\textit{inj}\;(r_1 \cdot r_2)\; c\;(\Seq \; v_1 \; v_2)$ & $\dn$  & 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2055
  $\Seq \; (\textit{inj}\;r_1\;c\;v_1) \; v_2$\\
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2056
  $\textit{inj}\;(r_1 \cdot r_2)\; c\;(\Left \; (\Seq \; v_1\;v_2) )$ & 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2057
  $\dn$  & $\Seq \; (\textit{inj}\,r_1\,c\,v_1)\; v_2$\\
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2058
  $\textit{inj}\;(r_1 \cdot r_2)\; c\; (\Right\; v)$ & $\dn$  & $\Seq\; (\textit{mkeps}\; r_1) \; (\textit{inj} \; r_2\;c\;v)$\\
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2059
  $\textit{inj}\;(r^*)\; c \; (\Seq \; v\; (\Stars\;vs))$         & $\dn$  & $\Stars\;\,((\textit{inj}\;r\;c\;v)\,::\,vs)$\\
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  2060
\end{tabular}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  2061
\end{center}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  2062
\noindent 
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2063
The function recurses on 
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2064
the shape of regular
637
Chengsong
parents: 628
diff changeset
  2065
expressions and values.
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2066
Intuitively, each clause analyses 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2067
how $r_i$ could have transformed when being 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2068
derived by $c$, identifying which subpart
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2069
of $v_{i+1}$ has the ``hole'' 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2070
to inject the character back into.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2071
Once the character is
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2072
injected back to that sub-value; 
637
Chengsong
parents: 628
diff changeset
  2073
$\inj$ assembles all parts
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2074
to form a new value.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2075
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2076
For instance, the last clause is an
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2077
injection into a sequence value $v_{i+1}$
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2078
whose second child
637
Chengsong
parents: 628
diff changeset
  2079
value is a star and the shape of the 
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2080
regular expression $r_i$ before injection 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2081
is a star.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2082
We therefore know 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2083
the derivative 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2084
starts on a star and ends as a sequence:
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2085
\[
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2086
	(r^*) \backslash c \longrightarrow r\backslash c \cdot r^*
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2087
\]
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2088
during which an iteration of the star
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2089
had just been unfolded, giving the below
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2090
value inhabitation relation:
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2091
\[
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2092
	\vdash \Seq \; v \; (\Stars \; vs) : (r\backslash c) \cdot r^*.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2093
\]
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2094
The value list $vs$ corresponds to
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2095
matched star iterations,
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2096
and the ``hole'' lies in $v$ because
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2097
\[
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2098
	\vdash v: r\backslash c.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2099
\]
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2100
Finally, 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2101
$\inj \; r \;c \; v$ is prepended
637
Chengsong
parents: 628
diff changeset
  2102
to the previous list of iterations and then
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2103
wrapped under the $\Stars$
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2104
constructor, giving us $\Stars \; ((\inj \; r \; c \; v) ::vs)$.
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  2105
667
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2106
Putting all together, Sulzmann and Lu obtained the following algorithm
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2107
outlined in the injection-based lexing diagram \ref{graph:inj}:
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2108
\begin{center}
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2109
\begin{tabular}{lcl}
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2110
	$\lexer \; r \; [] $ & $=$ & $\textit{if} \; (\nullable \; r)\; \textit{then}\;  \Some(\mkeps \; r) \; \textit{else} \; \None$\\
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2111
	$\lexer \; r \;c::s$ & $=$ & $\textit{case}\; (\lexer \; (r\backslash c) \; s) \;\textit{of}\; $\\
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2112
	& & $\quad \phantom{\mid}\; \None \implies \None$\\
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2113
	& & $\quad \mid           \Some(v) \implies \Some(\inj \; r\; c\; v)$
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2114
\end{tabular}
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2115
\end{center}
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2116
\noindent
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2117
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2118
\subsection{Examples on How Injection and Lexer Works}
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2119
We will provide a few examples on how $\inj$ and $\lexer$
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2120
works by showing their values in each recursive call on some 
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2121
concrete examples.
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2122
We start with the call $\lexer \; (a+aa)^*\cdot c\; aac$
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2123
on the lexer, note that the value's character constructor
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2124
$\Char \;c$ is abbreviated as $c$ for readability.
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2125
Similarly the last derivative's sub-expression
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2126
is abbreviated as $r_{=0}$\footnote{which is equal to
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2127
$((\ZERO + (\ZERO a+ \ZERO))\cdot(a+aa)^* + (\ZERO + \ZERO a)\cdot (a+aa)^*)+
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2128
((\ZERO+\ZERO a)\cdot(a+aa)^* + (\ZERO+\ZERO a)\cdot(a+aa)^*),
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2129
$.}
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2130
whose language interpretation is equivalent to that of
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2131
$\ZERO$ and therefore not crucial to be displayed fully expanded,
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2132
as they will not be injected into.%our example run.
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2133
\begin{center}
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2134
	\begin{tabular}{lcl}
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2135
		$(a+\textcolor{magenta}{a}\textcolor{blue}{a})^* \cdot \textcolor{red}{c}$ & $\stackrel{\backslash \textcolor{magenta}{a}}{\rightarrow}$ & $((\ONE + \textcolor{magenta}{\ONE} \textcolor{blue}{a})\cdot (a+aa)^*)\cdot \textcolor{red}{c}+\ZERO$\\
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2136
				   %& $\stackrel{\rightarrow}{\backslash a}$ & $((\ONE + \ONE a)\cdot (a+aa)^*)\cdot c + \ZERO$\\
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2137
				    & $\stackrel{\backslash \textcolor{blue}{a}}{\rightarrow}$  &
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2138
				    $(((\ZERO+(\ZERO a+ \textcolor{blue}{\ONE}))\cdot (a+aa)^* + (\ONE+\ONE a)\cdot (a+aa)^* )\cdot \textcolor{red}{\mathbf{c}} + \ZERO)+\ZERO$\\
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2139
				    & $\stackrel{\backslash \textcolor{red}{c}}{\rightarrow}$  &
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2140
				    $((r_{=0}\cdot c + \textcolor{red}{\ONE})+\ZERO)+\ZERO$\\
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2141
				    & $\stackrel{\mkeps}{\rightarrow}$ & $\Left (\Left \; (\Right \; \textcolor{red}{\Empty}))$ \\
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2142
				   & $\stackrel{\inj \;\textcolor{red}{c} }{\rightarrow}$ & 
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2143
				   $\Left \; (\Left \; (\Seq \;(\Left \; (\Seq \; (\Right \; (\Right\; \textcolor{blue}{\Empty})) \; \Stars \, [])) \; \textcolor{red}{c}))$\\
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2144
				   & $\stackrel{\inj \;\textcolor{blue}{a}}{\rightarrow}$ & 
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2145
				   $\Left\; (\Seq \; (\Seq\; (\Right \; (\Seq \; \textcolor{magenta}{\Empty }\; \textcolor{blue}{ \mathbf{a}}) )
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2146
				   \;\Stars \,[]) \; \textcolor{red}{c})$\\
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2147
				   & $\stackrel{\inj \;\textcolor{magenta}{a}}{\rightarrow}$ & $\Seq \; (\Stars \; [\Right \; (\Seq \; \mathbf{\textcolor{magenta}{a}} \; \textcolor{blue}{a})]) \; \textcolor{red}{ c}$
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2148
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2149
				   %$\inj \; r \; c\A$
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2150
		%$\inj \; (a\cdot b)\cdot c \; \Seq \; \ONE \; b$ & $=$ & $(a+e)$\\
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2151
	\end{tabular}
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2152
\end{center}
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2153
\noindent
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2154
We have assigned different colours for each character,
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2155
as well as their corresponding locations in values and regular expressions.
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2156
The most recently injected character is marked with a bold font.
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2157
%Their corresponding derivative steps have been marked with the same
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2158
%colour. We also mark the specific location where the coloured character
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2159
%was injected into with the same colour, before and after the injection occurred.
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2160
%TODO: can be also used to motivate injection->blexer in Bitcoded1
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2161
To show the details of how $\inj$ works,
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2162
we zoom in to the second injection above, 
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2163
illustrating the recursive calls involved: 
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2164
\begin{center}
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2165
	\begin{tabular}{lcl}
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2166
$\inj \;\quad ((\ONE + {\ONE} 
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2167
	\textcolor{blue}{a})\cdot (a+aa)^*)\cdot 
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2168
	c + \ZERO \; \quad \textcolor{blue}{a} \; $ & &\\
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2169
$\quad \Left \; 
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2170
(\Left \; (\Seq \;(\Left \; (\Seq \; (\Right \; (\Right\; 
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2171
	\textcolor{blue}{\Empty})) \; \Stars \, [])) \; c))$ & \\$=$\\
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2172
	$\Left \; (\inj \; ((\ONE + \ONE 
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2173
	\textcolor{blue}{a})\cdot (a+aa)^*)\cdot 
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2174
	c \quad  \textcolor{blue}{a} \quad (\Left \; (\Seq\; ( \Left \; (\Seq \; (\Right \; (\Right\; 
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2175
	\textcolor{blue}{\Empty})) \; \Stars \, []) ) \; c ) )\;\;)$ &\\ $=$\\
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2176
	$\Left \; (\Seq \; (\inj \; (\ONE + \ONE \textcolor{blue}{a})\cdot(a+aa)^* \quad \textcolor{blue}{a} \quad
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2177
	(\Left \; (\Seq \; (\Right \; (\Right\;\textcolor{blue}{\Empty})))) ) \; c ) $ & \\$=$\\
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2178
	$\Left \; (\Seq \; (\Seq \; (\inj \quad (\ONE + \ONE \textcolor{blue}{a}) \quad \textcolor{blue}{a}\quad \Right \;(\Right \; \textcolor{blue}{\Empty}) ) \; \Stars \,[])\; c)$ &\\ $=$ \\
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2179
	$\Left \; (\Seq \; (\Seq \; (\Right \; (\inj \; \ONE \textcolor{blue}{a} \quad \textcolor{blue}{a}\quad (\Right \;\textcolor{blue}{\Empty})))) \; \Stars \, [])$
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2180
																		   & \\ $=$\\
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2181
																		   $\Left \; (\Seq \; (\Seq \; (\Right \; (\Seq \; (\mkeps \; \ONE)\;(\inj \;\textcolor{blue}{a} \; \textcolor{blue}{a} \; \textcolor{blue}{\Empty})) ) ) \; \Stars \, [] )$
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2182
																		   & \\ $=$\\
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2183
																		   $\Left \; (\Seq \; (\Seq \; (\Right \; (\Seq \; \Empty \; \textcolor{blue}{a}) ) ) \; \Stars \, [] )$
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2184
 	\end{tabular}
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2185
\end{center}
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2186
\noindent
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2187
We will now introduce the proofs related to $\inj$ and
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2188
$\lexer$. These proofs have originally been found by Ausaf et al.
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2189
in their 2016 work \cite{AusafDyckhoffUrban2016}.
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2190
Proofs to some of these lemmas have also been discussed in more detail in 
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2191
Ausaf's thesis \cite{Ausaf}.
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2192
Nevertheless, we still introduce these proofs, to make this thesis
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2193
self-contained as we show inductive variants used in these proofs break
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2194
after more simplifications are introduced.
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2195
%Also note that lemmas like \ref{}described in this thesis is a more faithful
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2196
%representation of the actual accompanying Isabelle formalisation, and 
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2197
%therefore 
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2198
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2199
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2200
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2201
Recall that lemma 
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2202
\ref{mePosix} tells us that
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2203
$\mkeps$ always generates the POSIX value.
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2204
The function $\inj$ preserves the POSIXness, provided
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2205
the value before injection is POSIX, namely
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2206
\begin{lemma}\label{injPosix}
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2207
	If$(r \backslash c, s) \rightarrow v $,
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2208
	then $(r, c :: s) \rightarrow (\inj r \; c\; v)$.
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2209
\end{lemma}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2210
\begin{proof}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2211
	By induction on $r$.
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2212
	The non-trivial cases are sequence and star.
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2213
	When $r = a \cdot b$, there can be
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2214
	three cases for the value $v$ satisfying $\vdash v:a\backslash c$.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2215
	We give the reasoning why $\inj \; r \; c \; v$ is POSIX in each
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2216
	case.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2217
	\begin{itemize}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2218
		\item
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2219
			$v = \Seq \; v_a \; v_b$.\\
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2220
			The ``not nullable'' clause of the $\inj$ function is taken:
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2221
			\begin{center}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2222
				\begin{tabular}{lcl}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2223
					$\inj \; r \; c \; v$ &   $=$ & $ \inj \;\; (a \cdot b) \;\; c \;\; (\Seq \; v_a \; v_b) $\\
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2224
					& $=$ & $\Seq \; (\inj \;a \; c \; v_a) \; v_b$
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2225
				\end{tabular}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2226
			\end{center}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2227
			We know that there exists a unique pair of
637
Chengsong
parents: 628
diff changeset
  2228
			$s_a$ and $s_b$ satisfying	
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2229
				$(a \backslash c, s_a) \rightarrow v_a$,
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2230
				$(b , s_b) \rightarrow v_b$, and
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2231
				$\nexists s_3 \; s_4. s_3 \neq [] \land s_a @ s_3 \in 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2232
				L \; (a\backslash c) \land
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2233
				s_4 \in L \; b$.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2234
			The last condition gives us
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2235
			$\nexists s_3 \; s_4. s_3 \neq [] \land (c :: s_a )@ s_3 \in 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2236
				L \; a \land
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2237
				s_4 \in L \; b$.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2238
			By induction hypothesis, $(a, c::s_a) \rightarrow \inj \; a \; c \; v_a $ holds,
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2239
			and this gives us
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2240
			\[
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2241
				(a\cdot b, (c::s_a)@s_b) \rightarrow \Seq \; (\inj \; a\;c \;v_a) \; v_b.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2242
			\]
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2243
		\item
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2244
			$v = \Left \; (\Seq \; v_a \; v_b)$\\
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2245
			The argument is almost identical to the above case,	
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2246
			except that a different clause of $\inj$ is taken:
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2247
			\begin{center}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2248
				\begin{tabular}{lcl}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2249
					$\inj \; r \; c \; v$ &   $=$ & $ \inj \;\; (a \cdot b) \;\; c \;\; (\Left \; (\Seq \; v_a \; v_b)) $\\
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2250
					& $=$ & $\Seq \; (\inj \;a \; c \; v_a) \; v_b$
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2251
				\end{tabular}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2252
			\end{center}
637
Chengsong
parents: 628
diff changeset
  2253
			With similar reasoning, 
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  2254
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2255
			\[
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2256
				(a\cdot b, (c::s_a)@s_b) \rightarrow \Seq \; (\inj \; a\;c \;v_a) \; v_b.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2257
			\]
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2258
			again holds.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2259
		\item 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2260
			$v = \Right \; v_b$\\
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2261
			Again the injection result would be 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2262
			\begin{center}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2263
				\begin{tabular}{lcl}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2264
					$\inj \; r \; c \; v$ &   $=$ & $ \inj \;\; (a \cdot b) \;\; c \;\; \Right \; (v_b) $\\
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2265
					& $=$ & $\Seq \; (\mkeps \; a) \; (\inj \;b \; c\; v_b)$
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2266
				\end{tabular}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2267
			\end{center}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2268
			We know that $a$ must be nullable,
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2269
			allowing us to call $\mkeps$ and get
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2270
			\[
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2271
				(a, []) \rightarrow \mkeps \; a.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2272
			\]
637
Chengsong
parents: 628
diff changeset
  2273
			Also, by inductive hypothesis
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2274
			\[
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2275
				(b, c::s) \rightarrow \inj\; b \; c \; v_b
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2276
			\]
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2277
			holds.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2278
			In addition, as
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2279
			$\Right \;v_b$  instead of $\Left \ldots$ is 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2280
			the POSIX value for $v$, it must be the case
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2281
			that $s \notin L \;( (a\backslash c)\cdot b)$.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2282
			This tells us that 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2283
			\[
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2284
				\nexists s_3 \; s_4.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2285
				s_3 @s_4 = s  \land s_3 \in L \; (a\backslash c) 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2286
				\land s_4 \in L \; b
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2287
			\]
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2288
			which translates to
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2289
			\[
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2290
				\nexists s_3 \; s_4. \; s_3 \neq [] \land
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2291
				s_3 @s_4 = c::s  \land s_3 \in L \; a 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2292
				\land s_4 \in L \; b.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2293
			\]
637
Chengsong
parents: 628
diff changeset
  2294
			(Which says there cannot be a longer 
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2295
			initial split for $s$ other than the empty string.)
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2296
			Therefore we have $\Seq \; (\mkeps \; a) \;(\inj \;b \; c\; v_b)$
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2297
			as the POSIX value for $a\cdot b$.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2298
	\end{itemize}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2299
	The star case can be proven similarly.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2300
\end{proof}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2301
\noindent
667
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2302
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2303
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2304
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2305
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2306
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2307
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2308
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  2309
%TODO: Cut from previous lexer def, need to make coherent
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2310
The central property of the $\lexer$ is that it gives the correct result
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2311
according to
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2312
POSIX rules. 
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2313
\begin{theorem}\label{lexerCorrectness}
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2314
	The $\lexer$ based on derivatives and injections is correct: 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2315
	\begin{center}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2316
		\begin{tabular}{lcl}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2317
			$\lexer \; r \; s = \Some(v)$ & $ \Longleftrightarrow$ & $ (r, \; s) \rightarrow v$\\
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2318
			$\lexer \;r \; s = \None $ & $\Longleftrightarrow$ & $ \neg(\exists v. (r, s) \rightarrow v)$
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2319
		\end{tabular}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2320
	\end{center}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2321
\end{theorem} 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2322
\begin{proof}
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2323
By induction on $s$. $r$ generalising over an arbitrary regular expression.
637
Chengsong
parents: 628
diff changeset
  2324
The $[]$ case is proven by an application of lemma \ref{mePosix}, and the inductive case
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2325
by lemma \ref{injPosix}.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2326
\end{proof}
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  2327
\noindent
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2328
As we did earlier in this chapter with the matcher, one can 
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2329
introduce simplification on the regular expression in each derivative step.
637
Chengsong
parents: 628
diff changeset
  2330
However, due to lexing, one needs to do a backward phase (w.r.t the forward derivative phase)
Chengsong
parents: 628
diff changeset
  2331
and ensure that
Chengsong
parents: 628
diff changeset
  2332
the values align with the regular expression at each step.
539
Chengsong
parents: 538
diff changeset
  2333
Therefore one has to
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  2334
be careful not to break the correctness, as the injection 
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2335
function heavily relies on the structure of 
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2336
the regular expressions and values being aligned.
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2337
This can be achieved by recording some extra rectification functions
637
Chengsong
parents: 628
diff changeset
  2338
during the derivatives step and applying these rectifications in 
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  2339
each run during the injection phase.
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2340
With extra care
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2341
one can show that POSIXness will not be affected
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2342
by the simplifications listed here \cite{AusafDyckhoffUrban2016}. 
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2343
\begin{center}
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2344
	\begin{tabular}{lcl}
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2345
		$\simp \; r_1 \cdot r_2 $ & $ \dn$ & 
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2346
		$(\simp \; r_1,  \simp \; r_2) \; \textit{match}$\\
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2347
					  & & $\quad \case \; (\ZERO, \_) \Rightarrow \ZERO$\\
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2348
					  & & $\quad \case \; (\_, \ZERO) \Rightarrow \ZERO$\\
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2349
					  & & $\quad \case \; (\ONE, r_2') \Rightarrow r_2'$\\
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2350
					  & & $\quad \case \; (r_1', \ONE) \Rightarrow r_1'$\\
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2351
					  & & $\quad \case \; (r_1', r_2') \Rightarrow r_1'\cdot r_2'$\\
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2352
		$\simp \; r_1 + r_2$ & $\dn$ & $(\simp \; r_1, \simp \; r_2) \textit{match}$\\
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2353
				     & & $\quad \; \case \; (\ZERO, r_2') \Rightarrow r_2'$\\
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2354
				     & & $\quad \; \case \; (r_1', \ZERO) \Rightarrow r_1'$\\
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2355
				     & & $\quad \; \case \; (r_1', r_2') \Rightarrow r_1' + r_2'$\\
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2356
		$\simp \; r$ & $\dn$ & $r\quad\quad (otherwise)$
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2357
				   
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2358
	\end{tabular}
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2359
\end{center}
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  2360
637
Chengsong
parents: 628
diff changeset
  2361
However, one can still end up
Chengsong
parents: 628
diff changeset
  2362
with exploding derivatives,
Chengsong
parents: 628
diff changeset
  2363
even with the simple-minded simplification rules allowed
Chengsong
parents: 628
diff changeset
  2364
in an injection-based lexer.
Chengsong
parents: 628
diff changeset
  2365
\section{A Case Requiring More Aggressive Simplifications}
539
Chengsong
parents: 538
diff changeset
  2366
For example, when starting with the regular
585
4969ef817d92 chap4 more
Chengsong
parents: 583
diff changeset
  2367
expression $(a^* \cdot a^*)^*$ and building just over
4969ef817d92 chap4 more
Chengsong
parents: 583
diff changeset
  2368
a dozen successive derivatives 
539
Chengsong
parents: 538
diff changeset
  2369
w.r.t.~the character $a$, one obtains a derivative regular expression
585
4969ef817d92 chap4 more
Chengsong
parents: 583
diff changeset
  2370
with millions of nodes (when viewed as a tree)
637
Chengsong
parents: 628
diff changeset
  2371
even with the mentioned simplifications.
585
4969ef817d92 chap4 more
Chengsong
parents: 583
diff changeset
  2372
\begin{figure}[H]
601
Chengsong
parents: 591
diff changeset
  2373
\begin{center}
539
Chengsong
parents: 538
diff changeset
  2374
\begin{tikzpicture}
Chengsong
parents: 538
diff changeset
  2375
\begin{axis}[
Chengsong
parents: 538
diff changeset
  2376
    xlabel={$n$},
Chengsong
parents: 538
diff changeset
  2377
    ylabel={size},
585
4969ef817d92 chap4 more
Chengsong
parents: 583
diff changeset
  2378
    legend entries={Simple-Minded Simp, Naive Matcher},  
539
Chengsong
parents: 538
diff changeset
  2379
    legend pos=north west,
Chengsong
parents: 538
diff changeset
  2380
    legend cell align=left]
Chengsong
parents: 538
diff changeset
  2381
\addplot[red,mark=*, mark options={fill=white}] table {BetterWaterloo.data};
585
4969ef817d92 chap4 more
Chengsong
parents: 583
diff changeset
  2382
\addplot[blue,mark=*, mark options={fill=white}] table {BetterWaterloo1.data};
539
Chengsong
parents: 538
diff changeset
  2383
\end{axis}
Chengsong
parents: 538
diff changeset
  2384
\end{tikzpicture} 
601
Chengsong
parents: 591
diff changeset
  2385
\end{center}
539
Chengsong
parents: 538
diff changeset
  2386
\caption{Size of $(a^*\cdot a^*)^*$ against $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$}
Chengsong
parents: 538
diff changeset
  2387
\end{figure}\label{fig:BetterWaterloo}
Chengsong
parents: 538
diff changeset
  2388
   
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2389
That is because Sulzmann and Lu's 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2390
injection-based lexing algorithm keeps a lot of 
541
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
  2391
"useless" values that will not be used. 
539
Chengsong
parents: 538
diff changeset
  2392
These different ways of matching will grow exponentially with the string length.
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2393
Consider the case
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2394
\[
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2395
	r= (a^*\cdot a^*)^* \quad and \quad
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2396
	s=\underbrace{aa\ldots a}_\text{n \textit{a}s}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2397
\]
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2398
as an example.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2399
This is a highly ambiguous regular expression, with
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2400
many ways to split up the string into multiple segments for
637
Chengsong
parents: 628
diff changeset
  2401
different star iterations,
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2402
and for each segment 
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2403
multiple ways of splitting between 
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2404
the two $a^*$ sub-expressions.
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2405
When $n$ is equal to $1$, there are two lexical values for
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2406
the match:
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2407
\[
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2408
	\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; [])] \quad (v1)
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2409
\]
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2410
and
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2411
\[
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2412
	\Stars \; [\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a])] \quad (v2)
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2413
\]
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2414
The derivative of $\derssimp \;s \; r$ is
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2415
\[
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2416
	(a^*a^* + a^*)\cdot(a^*a^*)^*.
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2417
\]
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2418
The $a^*a^*$ and $a^*$ in the first child of the above sequence
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2419
correspond to value 1 and value 2, respectively.
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2420
When $n=2$, the number goes up to 7:
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2421
\[
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2422
	\Stars \; [\Seq \; (\Stars \; [\Char \; a, \Char \; a])\; (\Stars \; [])]
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2423
\]
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2424
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2425
\[
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2426
	\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; [\Char \; a])]
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2427
\]
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2428
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2429
\[
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2430
	\Stars \; [\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a, \Char \; a])]
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2431
\]
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2432
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2433
\[
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2434
	\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; []), \Seq \; (\Stars \; [\Char\;a])\; (\Stars\; []) ]
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2435
\]
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2436
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2437
\[
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2438
	\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; []), 
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2439
	 	   \Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a])
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2440
		  ] 
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2441
\]
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2442
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2443
\[
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2444
	\Stars \; [
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2445
	 	   \Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a]),
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2446
	 	   \Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a])
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2447
		  ] 
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2448
\]
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2449
and
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2450
\[
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2451
	\Stars \; [
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2452
	 	   \Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a]),
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2453
		   \Seq \; (\Stars \; [\Char \; a])\; (\Stars \; [])
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2454
		  ] 
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2455
\]
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  2456
And $\derssimp \; aa \; (a^*a^*)^*$ is
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2457
\[
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2458
	((a^*a^* + a^*)+a^*)\cdot(a^*a^*)^* + 
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2459
	(a^*a^* + a^*)\cdot(a^*a^*)^*.
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2460
\]
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2461
which removes two out of the seven terms corresponding to the
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2462
seven distinct lexical values.
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2463
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2464
It is not surprising that there are exponentially many 
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2465
distinct lexical values that cannot be eliminated by 
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2466
the simple-minded simplification of $\derssimp$. 
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2467
A lexer without a good enough strategy to 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2468
deduplicate will naturally
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2469
have an exponential runtime on highly
637
Chengsong
parents: 628
diff changeset
  2470
ambiguous regular expressions because there
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2471
are exponentially many matches.
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2472
For this particular example, it seems
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2473
that the number of distinct matches growth
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2474
speed is proportional to $(2n)!/(n!(n+1)!)$ ($n$ being the input length).
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  2475
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2476
On the other hand, the
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2477
 $\POSIX$ value for $r= (a^*\cdot a^*)^*$  and 
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2478
$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$ is  
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2479
\[
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2480
	\Stars\,
637
Chengsong
parents: 628
diff changeset
  2481
	[\Seq \; (\Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}]), \Stars\,[]].
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2482
\]
637
Chengsong
parents: 628
diff changeset
  2483
At any moment, the  subterms in a regular expression
Chengsong
parents: 628
diff changeset
  2484
that will potentially result in a POSIX value is only
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2485
a minority among the many other terms,
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  2486
and one can remove the ones that are not possible to 
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2487
be POSIX.
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2488
In the above example,
579
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  2489
\begin{equation}\label{eqn:growth2}
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2490
	((a^*a^* + \underbrace{a^*}_\text{A})+\underbrace{a^*}_\text{duplicate of A})\cdot(a^*a^*)^* + 
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2491
	\underbrace{(a^*a^* + a^*)\cdot(a^*a^*)^*}_\text{further simp removes this}.
579
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  2492
\end{equation}
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2493
can be further simplified by 
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2494
removing the underlined term first,
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2495
which would open up possibilities
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2496
of further simplification that removes the
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2497
underbraced part.
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2498
The result would be 
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2499
\[
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2500
	(\underbrace{a^*a^*}_\text{term 1} + \underbrace{a^*}_\text{term 2})\cdot(a^*a^*)^*.
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2501
\]
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2502
with corresponding values
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2503
\begin{center}
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2504
	\begin{tabular}{lr}
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2505
		$\Stars \; [\Seq \; (\Stars \; [\Char \; a, \Char \; a])\; (\Stars \; [])]$  & $(\text{term 1})$\\
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2506
		$\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; [\Char \; a])]  $ &  $(\text{term 2})$
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2507
	\end{tabular}
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2508
\end{center}
637
Chengsong
parents: 628
diff changeset
  2509
Other terms with an underlying value, such as
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2510
\[
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2511
	\Stars \; [\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a, \Char \; a])]
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2512
\]
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  2513
do not to contribute a POSIX lexical value,
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  2514
and therefore can be thrown away.
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  2515
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2516
Ausaf et al. \cite{AusafDyckhoffUrban2016} 
639
80cc6dc4c98b until chap 7
Chengsong
parents: 638
diff changeset
  2517
have come up with some simplification steps, 
80cc6dc4c98b until chap 7
Chengsong
parents: 638
diff changeset
  2518
and incorporated the simplification into $\lexer$.
80cc6dc4c98b until chap 7
Chengsong
parents: 638
diff changeset
  2519
They call this lexer $\textit{lexer}_{simp}$ and proved that
80cc6dc4c98b until chap 7
Chengsong
parents: 638
diff changeset
  2520
\[
80cc6dc4c98b until chap 7
Chengsong
parents: 638
diff changeset
  2521
	\lexer \; r\; s = \textit{lexer}_{simp} \; r \; s
80cc6dc4c98b until chap 7
Chengsong
parents: 638
diff changeset
  2522
\]
80cc6dc4c98b until chap 7
Chengsong
parents: 638
diff changeset
  2523
The function $\textit{lexer}_{simp}$
80cc6dc4c98b until chap 7
Chengsong
parents: 638
diff changeset
  2524
involves some fiddly manipulation of value rectification,
80cc6dc4c98b until chap 7
Chengsong
parents: 638
diff changeset
  2525
which we omit here.
80cc6dc4c98b until chap 7
Chengsong
parents: 638
diff changeset
  2526
however those steps
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  2527
are not yet sufficiently strong, to achieve the above effects.
637
Chengsong
parents: 628
diff changeset
  2528
And even with these relatively mild simplifications, the proof
Chengsong
parents: 628
diff changeset
  2529
is already quite a bit more complicated than the theorem \ref{lexerCorrectness}.
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  2530
One would need to prove something like this: 
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2531
\[
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2532
	\textit{If}\; (\textit{snd} \; (\textit{simp} \; r\backslash c), s) \rightarrow  v  \;\;
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2533
	\textit{then}\;\; (r, c::s) \rightarrow 
637
Chengsong
parents: 628
diff changeset
  2534
	\inj\;\, r\,  \;c \;\, ((\textit{fst} \; (\textit{simp} \; r \backslash c))\; v).
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2535
\]
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2536
instead of the simple lemma \ref{injPosix}, where now $\textit{simp}$
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2537
not only has to return a simplified regular expression,
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2538
but also what specific simplifications 
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  2539
have been done as a function on values
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2540
showing how one can transform the value
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2541
underlying the simplified regular expression
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2542
to the unsimplified one.
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2543
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2544
We therefore choose a slightly different approach
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2545
also described by Sulzmann and Lu to
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2546
get better simplifications, which uses
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2547
some augmented data structures compared to 
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2548
plain regular expressions.
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2549
We call them \emph{annotated}
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2550
regular expressions.
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2551
With annotated regular expressions,
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2552
we can avoid creating the intermediate values $v_1,\ldots v_n$ and a 
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2553
second phase altogether.
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2554
We introduce this new datatype and the 
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2555
corresponding algorithm in the next chapter.
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  2556
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  2557
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  2558
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  2559