ChengsongTanPhdThesis/Chapters/Inj.tex
author Chengsong
Sun, 09 Jul 2023 00:29:02 +0100
changeset 657 00171b627b8d
parent 651 deb35fd780fe
child 666 6da4516ea87d
permissions -rwxr-xr-x
Fixed some annotated/unannotated a/r notation inconsistencies.
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
     1
% Chapter Template
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
     2
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
     3
\chapter{Regular Expressions and POSIX Lexing} % Main chapter title
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
     4
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
     5
\label{Inj} % In chapter 2 \ref{Chapter2} we will introduce the concepts
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
     6
%and notations we 
564
Chengsong
parents: 543
diff changeset
     7
% used for describing the lexing algorithm by Sulzmann and Lu,
Chengsong
parents: 543
diff changeset
     8
%and then give the algorithm and its variant and discuss
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
     9
%why more aggressive simplifications are needed. 
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
    10
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
    11
In this chapter, we define the basic notions 
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
    12
for regular languages and regular expressions.
583
Chengsong
parents: 579
diff changeset
    13
This is essentially a description in ``English''
637
Chengsong
parents: 628
diff changeset
    14
the functions and datatypes of our formalisation in Isabelle/HOL.
Chengsong
parents: 628
diff changeset
    15
We also define what $\POSIX$ lexing means, 
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
    16
followed by the first lexing algorithm by Sulzmanna and Lu \parencite{Sulzmann2014} 
564
Chengsong
parents: 543
diff changeset
    17
that produces the output conforming
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
    18
to the $\POSIX$ standard\footnote{In what follows 
583
Chengsong
parents: 579
diff changeset
    19
we choose to use the Isabelle-style notation
564
Chengsong
parents: 543
diff changeset
    20
for function applications, where
583
Chengsong
parents: 579
diff changeset
    21
the parameters of a function are not enclosed
564
Chengsong
parents: 543
diff changeset
    22
inside a pair of parentheses (e.g. $f \;x \;y$
Chengsong
parents: 543
diff changeset
    23
instead of $f(x,\;y)$). This is mainly
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
    24
to make the text visually more concise.}.
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
    25
649
Chengsong
parents: 646
diff changeset
    26
Chengsong
parents: 646
diff changeset
    27
\section{Technical Overview}
Chengsong
parents: 646
diff changeset
    28
Chengsong
parents: 646
diff changeset
    29
Consider for example the regular expression $(a^*)^*\,b$ and 
Chengsong
parents: 646
diff changeset
    30
strings of the form $aa..a$. These strings cannot be matched by this regular
Chengsong
parents: 646
diff changeset
    31
expression: Obviously the expected $b$ in the last
Chengsong
parents: 646
diff changeset
    32
position is missing. One would assume that modern regular expression
Chengsong
parents: 646
diff changeset
    33
matching engines can find this out very quickly. Surprisingly, if one tries
Chengsong
parents: 646
diff changeset
    34
this example in JavaScript, Python or Java 8, even with small strings, 
Chengsong
parents: 646
diff changeset
    35
say of lenght of around 30 $a$'s,
Chengsong
parents: 646
diff changeset
    36
the decision takes an absurd amount of time to finish (see graphs in figure \ref{fig:aStarStarb}).
Chengsong
parents: 646
diff changeset
    37
The algorithms clearly show exponential behaviour, and as can be seen
Chengsong
parents: 646
diff changeset
    38
is triggered by some relatively simple regular expressions.
Chengsong
parents: 646
diff changeset
    39
Java 9 and newer
Chengsong
parents: 646
diff changeset
    40
versions improve this behaviour somewhat, but are still slow compared 
Chengsong
parents: 646
diff changeset
    41
with the approach we are going to use in this thesis.
Chengsong
parents: 646
diff changeset
    42
Chengsong
parents: 646
diff changeset
    43
Chengsong
parents: 646
diff changeset
    44
Chengsong
parents: 646
diff changeset
    45
This superlinear blowup in regular expression engines
Chengsong
parents: 646
diff changeset
    46
has caused grief in ``real life'' where it is 
Chengsong
parents: 646
diff changeset
    47
given the name ``catastrophic backtracking'' or ``evil'' regular expressions.
Chengsong
parents: 646
diff changeset
    48
For example, on 20 July 2016 one evil
Chengsong
parents: 646
diff changeset
    49
regular expression brought the webpage
Chengsong
parents: 646
diff changeset
    50
\href{http://stackexchange.com}{Stack Exchange} to its
Chengsong
parents: 646
diff changeset
    51
knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}(Last accessed in 2019)}
Chengsong
parents: 646
diff changeset
    52
In this instance, a regular expression intended to just trim white
Chengsong
parents: 646
diff changeset
    53
spaces from the beginning and the end of a line actually consumed
Chengsong
parents: 646
diff changeset
    54
massive amounts of CPU resources---causing the web servers to grind to a
Chengsong
parents: 646
diff changeset
    55
halt. In this example, the time needed to process
Chengsong
parents: 646
diff changeset
    56
the string was 
Chengsong
parents: 646
diff changeset
    57
$O(n^2)$ with respect to the string length $n$. This
Chengsong
parents: 646
diff changeset
    58
quadratic overhead was enough for the homepage of Stack Exchange to
Chengsong
parents: 646
diff changeset
    59
respond so slowly that the load balancer assumed a $\mathit{DoS}$ 
Chengsong
parents: 646
diff changeset
    60
attack and therefore stopped the servers from responding to any
Chengsong
parents: 646
diff changeset
    61
requests. This made the whole site become unavailable. 
Chengsong
parents: 646
diff changeset
    62
Chengsong
parents: 646
diff changeset
    63
\begin{figure}[p]
Chengsong
parents: 646
diff changeset
    64
\begin{center}
Chengsong
parents: 646
diff changeset
    65
\begin{tabular}{@{}c@{\hspace{0mm}}c@{}}
Chengsong
parents: 646
diff changeset
    66
\begin{tikzpicture}
Chengsong
parents: 646
diff changeset
    67
\begin{axis}[
Chengsong
parents: 646
diff changeset
    68
    xlabel={$n$},
Chengsong
parents: 646
diff changeset
    69
    x label style={at={(1.05,-0.05)}},
Chengsong
parents: 646
diff changeset
    70
    ylabel={time in secs},
Chengsong
parents: 646
diff changeset
    71
    enlargelimits=false,
Chengsong
parents: 646
diff changeset
    72
    xtick={0,5,...,30},
Chengsong
parents: 646
diff changeset
    73
    xmax=33,
Chengsong
parents: 646
diff changeset
    74
    ymax=35,
Chengsong
parents: 646
diff changeset
    75
    ytick={0,5,...,30},
Chengsong
parents: 646
diff changeset
    76
    scaled ticks=false,
Chengsong
parents: 646
diff changeset
    77
    axis lines=left,
Chengsong
parents: 646
diff changeset
    78
    width=5cm,
Chengsong
parents: 646
diff changeset
    79
    height=4cm, 
Chengsong
parents: 646
diff changeset
    80
    legend entries={JavaScript},  
Chengsong
parents: 646
diff changeset
    81
    legend pos=north west,
Chengsong
parents: 646
diff changeset
    82
    legend cell align=left]
Chengsong
parents: 646
diff changeset
    83
\addplot[red,mark=*, mark options={fill=white}] table {re-js.data};
Chengsong
parents: 646
diff changeset
    84
\end{axis}
Chengsong
parents: 646
diff changeset
    85
\end{tikzpicture}
Chengsong
parents: 646
diff changeset
    86
  &
Chengsong
parents: 646
diff changeset
    87
\begin{tikzpicture}
Chengsong
parents: 646
diff changeset
    88
\begin{axis}[
Chengsong
parents: 646
diff changeset
    89
    xlabel={$n$},
Chengsong
parents: 646
diff changeset
    90
    x label style={at={(1.05,-0.05)}},
Chengsong
parents: 646
diff changeset
    91
    %ylabel={time in secs},
Chengsong
parents: 646
diff changeset
    92
    enlargelimits=false,
Chengsong
parents: 646
diff changeset
    93
    xtick={0,5,...,30},
Chengsong
parents: 646
diff changeset
    94
    xmax=33,
Chengsong
parents: 646
diff changeset
    95
    ymax=35,
Chengsong
parents: 646
diff changeset
    96
    ytick={0,5,...,30},
Chengsong
parents: 646
diff changeset
    97
    scaled ticks=false,
Chengsong
parents: 646
diff changeset
    98
    axis lines=left,
Chengsong
parents: 646
diff changeset
    99
    width=5cm,
Chengsong
parents: 646
diff changeset
   100
    height=4cm, 
Chengsong
parents: 646
diff changeset
   101
    legend entries={Python},  
Chengsong
parents: 646
diff changeset
   102
    legend pos=north west,
Chengsong
parents: 646
diff changeset
   103
    legend cell align=left]
Chengsong
parents: 646
diff changeset
   104
\addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
Chengsong
parents: 646
diff changeset
   105
\end{axis}
Chengsong
parents: 646
diff changeset
   106
\end{tikzpicture}\\ 
Chengsong
parents: 646
diff changeset
   107
\begin{tikzpicture}
Chengsong
parents: 646
diff changeset
   108
\begin{axis}[
Chengsong
parents: 646
diff changeset
   109
    xlabel={$n$},
Chengsong
parents: 646
diff changeset
   110
    x label style={at={(1.05,-0.05)}},
Chengsong
parents: 646
diff changeset
   111
    ylabel={time in secs},
Chengsong
parents: 646
diff changeset
   112
    enlargelimits=false,
Chengsong
parents: 646
diff changeset
   113
    xtick={0,5,...,30},
Chengsong
parents: 646
diff changeset
   114
    xmax=33,
Chengsong
parents: 646
diff changeset
   115
    ymax=35,
Chengsong
parents: 646
diff changeset
   116
    ytick={0,5,...,30},
Chengsong
parents: 646
diff changeset
   117
    scaled ticks=false,
Chengsong
parents: 646
diff changeset
   118
    axis lines=left,
Chengsong
parents: 646
diff changeset
   119
    width=5cm,
Chengsong
parents: 646
diff changeset
   120
    height=4cm, 
Chengsong
parents: 646
diff changeset
   121
    legend entries={Java 8},  
Chengsong
parents: 646
diff changeset
   122
    legend pos=north west,
Chengsong
parents: 646
diff changeset
   123
    legend cell align=left]
Chengsong
parents: 646
diff changeset
   124
\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
Chengsong
parents: 646
diff changeset
   125
\end{axis}
Chengsong
parents: 646
diff changeset
   126
\end{tikzpicture}
Chengsong
parents: 646
diff changeset
   127
  &
Chengsong
parents: 646
diff changeset
   128
\begin{tikzpicture}
Chengsong
parents: 646
diff changeset
   129
\begin{axis}[
Chengsong
parents: 646
diff changeset
   130
    xlabel={$n$},
Chengsong
parents: 646
diff changeset
   131
    x label style={at={(1.05,-0.05)}},
Chengsong
parents: 646
diff changeset
   132
    %ylabel={time in secs},
Chengsong
parents: 646
diff changeset
   133
    enlargelimits=false,
Chengsong
parents: 646
diff changeset
   134
    xtick={0,5,...,30},
Chengsong
parents: 646
diff changeset
   135
    xmax=33,
Chengsong
parents: 646
diff changeset
   136
    ymax=35,
Chengsong
parents: 646
diff changeset
   137
    ytick={0,5,...,30},
Chengsong
parents: 646
diff changeset
   138
    scaled ticks=false,
Chengsong
parents: 646
diff changeset
   139
    axis lines=left,
Chengsong
parents: 646
diff changeset
   140
    width=5cm,
Chengsong
parents: 646
diff changeset
   141
    height=4cm, 
Chengsong
parents: 646
diff changeset
   142
    legend entries={Dart},  
Chengsong
parents: 646
diff changeset
   143
    legend pos=north west,
Chengsong
parents: 646
diff changeset
   144
    legend cell align=left]
Chengsong
parents: 646
diff changeset
   145
\addplot[green,mark=*, mark options={fill=white}] table {re-dart.data};
Chengsong
parents: 646
diff changeset
   146
\end{axis}
Chengsong
parents: 646
diff changeset
   147
\end{tikzpicture}\\ 
Chengsong
parents: 646
diff changeset
   148
\begin{tikzpicture}
Chengsong
parents: 646
diff changeset
   149
\begin{axis}[
Chengsong
parents: 646
diff changeset
   150
    xlabel={$n$},
Chengsong
parents: 646
diff changeset
   151
    x label style={at={(1.05,-0.05)}},
Chengsong
parents: 646
diff changeset
   152
    ylabel={time in secs},
Chengsong
parents: 646
diff changeset
   153
    enlargelimits=false,
Chengsong
parents: 646
diff changeset
   154
    xtick={0,5,...,30},
Chengsong
parents: 646
diff changeset
   155
    xmax=33,
Chengsong
parents: 646
diff changeset
   156
    ymax=35,
Chengsong
parents: 646
diff changeset
   157
    ytick={0,5,...,30},
Chengsong
parents: 646
diff changeset
   158
    scaled ticks=false,
Chengsong
parents: 646
diff changeset
   159
    axis lines=left,
Chengsong
parents: 646
diff changeset
   160
    width=5cm,
Chengsong
parents: 646
diff changeset
   161
    height=4cm, 
Chengsong
parents: 646
diff changeset
   162
    legend entries={Swift},  
Chengsong
parents: 646
diff changeset
   163
    legend pos=north west,
Chengsong
parents: 646
diff changeset
   164
    legend cell align=left]
Chengsong
parents: 646
diff changeset
   165
\addplot[purple,mark=*, mark options={fill=white}] table {re-swift.data};
Chengsong
parents: 646
diff changeset
   166
\end{axis}
Chengsong
parents: 646
diff changeset
   167
\end{tikzpicture}
Chengsong
parents: 646
diff changeset
   168
  & 
Chengsong
parents: 646
diff changeset
   169
\begin{tikzpicture}
Chengsong
parents: 646
diff changeset
   170
\begin{axis}[
Chengsong
parents: 646
diff changeset
   171
    xlabel={$n$},
Chengsong
parents: 646
diff changeset
   172
    x label style={at={(1.05,-0.05)}},
Chengsong
parents: 646
diff changeset
   173
    %ylabel={time in secs},
Chengsong
parents: 646
diff changeset
   174
    enlargelimits=true,
Chengsong
parents: 646
diff changeset
   175
    %xtick={0,5000,...,40000},
Chengsong
parents: 646
diff changeset
   176
    %xmax=40000,
Chengsong
parents: 646
diff changeset
   177
    %ymax=35,
Chengsong
parents: 646
diff changeset
   178
    restrict x to domain*=0:40000,
Chengsong
parents: 646
diff changeset
   179
    restrict y to domain*=0:35,
Chengsong
parents: 646
diff changeset
   180
    %ytick={0,5,...,30},
Chengsong
parents: 646
diff changeset
   181
    %scaled ticks=false,
Chengsong
parents: 646
diff changeset
   182
    axis lines=left,
Chengsong
parents: 646
diff changeset
   183
    width=5cm,
Chengsong
parents: 646
diff changeset
   184
    height=4cm, 
Chengsong
parents: 646
diff changeset
   185
    legend entries={Java9+},  
Chengsong
parents: 646
diff changeset
   186
    legend pos=north west,
Chengsong
parents: 646
diff changeset
   187
    legend cell align=left]
Chengsong
parents: 646
diff changeset
   188
\addplot[orange,mark=*, mark options={fill=white}] table {re-java9.data};
Chengsong
parents: 646
diff changeset
   189
\end{axis}
Chengsong
parents: 646
diff changeset
   190
\end{tikzpicture}\\ 
Chengsong
parents: 646
diff changeset
   191
\multicolumn{2}{c}{Graphs}
Chengsong
parents: 646
diff changeset
   192
\end{tabular}    
Chengsong
parents: 646
diff changeset
   193
\end{center}
Chengsong
parents: 646
diff changeset
   194
\caption{Graphs showing runtime for matching $(a^*)^*\,b$ with strings 
Chengsong
parents: 646
diff changeset
   195
           of the form $\protect\underbrace{aa..a}_{n}$ in various existing regular expression libraries.
Chengsong
parents: 646
diff changeset
   196
   The reason for their superlinear behaviour is that they do a depth-first-search
Chengsong
parents: 646
diff changeset
   197
   using NFAs.
Chengsong
parents: 646
diff changeset
   198
   If the string does not match, the regular expression matching
Chengsong
parents: 646
diff changeset
   199
   engine starts to explore all possibilities. 
Chengsong
parents: 646
diff changeset
   200
}\label{fig:aStarStarb}
Chengsong
parents: 646
diff changeset
   201
\end{figure}\afterpage{\clearpage}
Chengsong
parents: 646
diff changeset
   202
Chengsong
parents: 646
diff changeset
   203
A more recent example is a global outage of all Cloudflare servers on 2 July
Chengsong
parents: 646
diff changeset
   204
2019. A poorly written regular expression exhibited catastrophic backtracking
Chengsong
parents: 646
diff changeset
   205
and exhausted CPUs that serve HTTP traffic. Although the outage
Chengsong
parents: 646
diff changeset
   206
had several causes, at the heart was a regular expression that
Chengsong
parents: 646
diff changeset
   207
was used to monitor network
Chengsong
parents: 646
diff changeset
   208
traffic.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}(Last accessed in 2022)}
Chengsong
parents: 646
diff changeset
   209
These problems with regular expressions 
Chengsong
parents: 646
diff changeset
   210
are not isolated events that happen
Chengsong
parents: 646
diff changeset
   211
very rarely, 
Chengsong
parents: 646
diff changeset
   212
%but actually widespread.
Chengsong
parents: 646
diff changeset
   213
%They occur so often that they have a 
Chengsong
parents: 646
diff changeset
   214
but they occur actually often enough that they have a
Chengsong
parents: 646
diff changeset
   215
name: Regular-Expression-Denial-Of-Service (ReDoS)
Chengsong
parents: 646
diff changeset
   216
attacks.
Chengsong
parents: 646
diff changeset
   217
Davis et al. \cite{Davis18} detected more
Chengsong
parents: 646
diff changeset
   218
than 1000 evil regular expressions
Chengsong
parents: 646
diff changeset
   219
in Node.js, Python core libraries, npm and pypi. 
Chengsong
parents: 646
diff changeset
   220
They therefore concluded that evil regular expressions
Chengsong
parents: 646
diff changeset
   221
are a real problem rather than just "a parlour trick".
Chengsong
parents: 646
diff changeset
   222
Chengsong
parents: 646
diff changeset
   223
The work in this thesis aims to address this issue
Chengsong
parents: 646
diff changeset
   224
with the help of formal proofs.
Chengsong
parents: 646
diff changeset
   225
We describe a lexing algorithm based
Chengsong
parents: 646
diff changeset
   226
on Brzozowski derivatives with verified correctness 
Chengsong
parents: 646
diff changeset
   227
and a finiteness property for the size of derivatives
Chengsong
parents: 646
diff changeset
   228
(which are all done in Isabelle/HOL).
Chengsong
parents: 646
diff changeset
   229
Such properties %guarantee the absence of 
Chengsong
parents: 646
diff changeset
   230
are an important step in preventing
Chengsong
parents: 646
diff changeset
   231
catastrophic backtracking once and for all.
Chengsong
parents: 646
diff changeset
   232
We will give more details in the next sections
Chengsong
parents: 646
diff changeset
   233
on (i) why the slow cases in graph \ref{fig:aStarStarb}
Chengsong
parents: 646
diff changeset
   234
can occur in traditional regular expression engines
Chengsong
parents: 646
diff changeset
   235
and (ii) why we choose our 
Chengsong
parents: 646
diff changeset
   236
approach based on Brzozowski derivatives and formal proofs.
Chengsong
parents: 646
diff changeset
   237
Chengsong
parents: 646
diff changeset
   238
Chengsong
parents: 646
diff changeset
   239
\section{Preliminaries}%Regex, and the Problems with Regex Matchers}
Chengsong
parents: 646
diff changeset
   240
Regular expressions and regular expression matchers 
Chengsong
parents: 646
diff changeset
   241
have clearly been studied for many, many years.
Chengsong
parents: 646
diff changeset
   242
Theoretical results in automata theory state 
Chengsong
parents: 646
diff changeset
   243
that basic regular expression matching should be linear
Chengsong
parents: 646
diff changeset
   244
w.r.t the input.
Chengsong
parents: 646
diff changeset
   245
This assumes that the regular expression
Chengsong
parents: 646
diff changeset
   246
$r$ was pre-processed and turned into a
Chengsong
parents: 646
diff changeset
   247
deterministic finite automaton (DFA) before matching \cite{Sakarovitch2009}.
Chengsong
parents: 646
diff changeset
   248
By basic we mean textbook definitions such as the one
Chengsong
parents: 646
diff changeset
   249
below, involving only regular expressions for characters, alternatives,
Chengsong
parents: 646
diff changeset
   250
sequences, and Kleene stars:
Chengsong
parents: 646
diff changeset
   251
\[
Chengsong
parents: 646
diff changeset
   252
	r ::= c | r_1 + r_2 | r_1 \cdot r_2 | r^*
Chengsong
parents: 646
diff changeset
   253
\]
Chengsong
parents: 646
diff changeset
   254
Modern regular expression matchers used by programmers,
Chengsong
parents: 646
diff changeset
   255
however,
Chengsong
parents: 646
diff changeset
   256
support much richer constructs, such as bounded repetitions,
Chengsong
parents: 646
diff changeset
   257
negations,
Chengsong
parents: 646
diff changeset
   258
and back-references.
Chengsong
parents: 646
diff changeset
   259
To differentiate, we use the word \emph{regex} to refer
Chengsong
parents: 646
diff changeset
   260
to those expressions with richer constructs while reserving the
Chengsong
parents: 646
diff changeset
   261
term \emph{regular expression}
Chengsong
parents: 646
diff changeset
   262
for the more traditional meaning in formal languages theory.
Chengsong
parents: 646
diff changeset
   263
We follow this convention 
Chengsong
parents: 646
diff changeset
   264
in this thesis.
Chengsong
parents: 646
diff changeset
   265
In the future, we aim to support all the popular features of regexes, 
Chengsong
parents: 646
diff changeset
   266
but for this work we mainly look at basic regular expressions
Chengsong
parents: 646
diff changeset
   267
and bounded repetitions.
Chengsong
parents: 646
diff changeset
   268
Chengsong
parents: 646
diff changeset
   269
Chengsong
parents: 646
diff changeset
   270
Chengsong
parents: 646
diff changeset
   271
%Most modern regex libraries
Chengsong
parents: 646
diff changeset
   272
%the so-called PCRE standard (Peral Compatible Regular Expressions)
Chengsong
parents: 646
diff changeset
   273
%has the back-references
Chengsong
parents: 646
diff changeset
   274
Regexes come with a number of constructs
Chengsong
parents: 646
diff changeset
   275
that make it more convenient for 
Chengsong
parents: 646
diff changeset
   276
programmers to write regular expressions.
Chengsong
parents: 646
diff changeset
   277
Depending on the types of constructs
Chengsong
parents: 646
diff changeset
   278
the task of matching and lexing with them
Chengsong
parents: 646
diff changeset
   279
will have different levels of complexity.
Chengsong
parents: 646
diff changeset
   280
Some of those constructs are syntactic sugars that are
Chengsong
parents: 646
diff changeset
   281
simply short hand notations
Chengsong
parents: 646
diff changeset
   282
that save the programmers a few keystrokes.
Chengsong
parents: 646
diff changeset
   283
These will not cause problems for regex libraries.
Chengsong
parents: 646
diff changeset
   284
For example the
Chengsong
parents: 646
diff changeset
   285
non-binary alternative involving three or more choices just means:
Chengsong
parents: 646
diff changeset
   286
\[
Chengsong
parents: 646
diff changeset
   287
	(a | b | c) \stackrel{means}{=} ((a + b)+ c)
Chengsong
parents: 646
diff changeset
   288
\]
Chengsong
parents: 646
diff changeset
   289
Similarly, the range operator
Chengsong
parents: 646
diff changeset
   290
%used to express the alternative
Chengsong
parents: 646
diff changeset
   291
%of all characters between its operands, 
Chengsong
parents: 646
diff changeset
   292
is just a concise way
Chengsong
parents: 646
diff changeset
   293
of expressing an alternative of consecutive characters:
Chengsong
parents: 646
diff changeset
   294
\[
Chengsong
parents: 646
diff changeset
   295
	[0~-9]\stackrel{means}{=} (0 | 1 | \ldots | 9 )  
Chengsong
parents: 646
diff changeset
   296
\]
Chengsong
parents: 646
diff changeset
   297
for an alternative. The
Chengsong
parents: 646
diff changeset
   298
wildcard character '$.$' is used to refer to any single character,
Chengsong
parents: 646
diff changeset
   299
\[
Chengsong
parents: 646
diff changeset
   300
	. \stackrel{means}{=} [0-9a-zA-Z+-()*\&\ldots]
Chengsong
parents: 646
diff changeset
   301
\]
Chengsong
parents: 646
diff changeset
   302
except the newline.
Chengsong
parents: 646
diff changeset
   303
Chengsong
parents: 646
diff changeset
   304
\subsection{Bounded Repetitions}
Chengsong
parents: 646
diff changeset
   305
More interesting are bounded repetitions, which can 
Chengsong
parents: 646
diff changeset
   306
make the regular expressions much
Chengsong
parents: 646
diff changeset
   307
more compact.
Chengsong
parents: 646
diff changeset
   308
Normally there are four kinds of bounded repetitions:
Chengsong
parents: 646
diff changeset
   309
$r^{\{n\}}$, $r^{\{\ldots m\}}$, $r^{\{n\ldots \}}$ and $r^{\{n\ldots m\}}$
Chengsong
parents: 646
diff changeset
   310
(where $n$ and $m$ are constant natural numbers).
Chengsong
parents: 646
diff changeset
   311
Like the star regular expressions, the set of strings or language
Chengsong
parents: 646
diff changeset
   312
a bounded regular expression can match
Chengsong
parents: 646
diff changeset
   313
is defined using the power operation on sets:
Chengsong
parents: 646
diff changeset
   314
\begin{center}
Chengsong
parents: 646
diff changeset
   315
	\begin{tabular}{lcl}
Chengsong
parents: 646
diff changeset
   316
		$L \; r^{\{n\}}$ & $\dn$ & $(L \; r)^n$\\
Chengsong
parents: 646
diff changeset
   317
		$L \; r^{\{\ldots m\}}$ & $\dn$ & $\bigcup_{0 \leq i \leq m}. (L \; r)^i$\\
Chengsong
parents: 646
diff changeset
   318
		$L \; r^{\{n\ldots \}}$ & $\dn$ & $\bigcup_{n \leq i}. (L \; r)^i$\\
Chengsong
parents: 646
diff changeset
   319
		$L \; r^{\{n \ldots m\}}$ & $\dn$ & $\bigcup_{n \leq i \leq m}. (L \; r)^i$
Chengsong
parents: 646
diff changeset
   320
	\end{tabular}
Chengsong
parents: 646
diff changeset
   321
\end{center}
Chengsong
parents: 646
diff changeset
   322
The attraction of bounded repetitions is that they can be
Chengsong
parents: 646
diff changeset
   323
used to avoid a size blow up: for example $r^{\{n\}}$
Chengsong
parents: 646
diff changeset
   324
is a shorthand for
Chengsong
parents: 646
diff changeset
   325
the much longer regular expression:
Chengsong
parents: 646
diff changeset
   326
\[
Chengsong
parents: 646
diff changeset
   327
	\underbrace{r\ldots r}_\text{n copies of r}.
Chengsong
parents: 646
diff changeset
   328
\]
Chengsong
parents: 646
diff changeset
   329
%Therefore, a naive algorithm that simply unfolds
Chengsong
parents: 646
diff changeset
   330
%them into their desugared forms
Chengsong
parents: 646
diff changeset
   331
%will suffer from at least an exponential runtime increase.
Chengsong
parents: 646
diff changeset
   332
Chengsong
parents: 646
diff changeset
   333
Chengsong
parents: 646
diff changeset
   334
The problem with matching 
Chengsong
parents: 646
diff changeset
   335
such bounded repetitions
Chengsong
parents: 646
diff changeset
   336
is that tools based on the classic notion of
Chengsong
parents: 646
diff changeset
   337
automata need to expand $r^{\{n\}}$ into $n$ connected 
Chengsong
parents: 646
diff changeset
   338
copies of the automaton for $r$. This leads to very inefficient matching
Chengsong
parents: 646
diff changeset
   339
algorithms  or algorithms that consume large amounts of memory.
Chengsong
parents: 646
diff changeset
   340
Implementations using $\DFA$s will
Chengsong
parents: 646
diff changeset
   341
in such situations
Chengsong
parents: 646
diff changeset
   342
either become excruciatingly slow 
Chengsong
parents: 646
diff changeset
   343
(for example Verbatim++ \cite{Verbatimpp}) or run
Chengsong
parents: 646
diff changeset
   344
out of memory (for example $\mathit{LEX}$ and 
Chengsong
parents: 646
diff changeset
   345
$\mathit{JFLEX}$\footnote{LEX and JFLEX are lexer generators
Chengsong
parents: 646
diff changeset
   346
in C and JAVA that generate $\mathit{DFA}$-based
Chengsong
parents: 646
diff changeset
   347
lexers. The user provides a set of regular expressions
Chengsong
parents: 646
diff changeset
   348
and configurations, and then 
Chengsong
parents: 646
diff changeset
   349
gets an output program encoding a minimized $\mathit{DFA}$
Chengsong
parents: 646
diff changeset
   350
that can be compiled and run. 
Chengsong
parents: 646
diff changeset
   351
When given the above countdown regular expression,
Chengsong
parents: 646
diff changeset
   352
a small $n$ (say 20) would result in a program representing a
Chengsong
parents: 646
diff changeset
   353
DFA
Chengsong
parents: 646
diff changeset
   354
with millions of states.}) for large counters.
Chengsong
parents: 646
diff changeset
   355
A classic example for this phenomenon is the regular expression $(a+b)^*  a (a+b)^{n}$
Chengsong
parents: 646
diff changeset
   356
where the minimal DFA requires at least $2^{n+1}$ states.
Chengsong
parents: 646
diff changeset
   357
For example, when $n$ is equal to 2,
Chengsong
parents: 646
diff changeset
   358
the corresponding $\mathit{NFA}$ looks like:
Chengsong
parents: 646
diff changeset
   359
\vspace{6mm}
Chengsong
parents: 646
diff changeset
   360
\begin{center}
Chengsong
parents: 646
diff changeset
   361
\begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto] 
Chengsong
parents: 646
diff changeset
   362
   \node[state,initial] (q_0)   {$q_0$}; 
Chengsong
parents: 646
diff changeset
   363
   \node[state, red] (q_1) [right=of q_0] {$q_1$}; 
Chengsong
parents: 646
diff changeset
   364
   \node[state, red] (q_2) [right=of q_1] {$q_2$}; 
Chengsong
parents: 646
diff changeset
   365
   \node[state, accepting, red](q_3) [right=of q_2] {$q_3$};
Chengsong
parents: 646
diff changeset
   366
    \path[->] 
Chengsong
parents: 646
diff changeset
   367
    (q_0) edge  node {a} (q_1)
Chengsong
parents: 646
diff changeset
   368
    	  edge [loop below] node {a,b} ()
Chengsong
parents: 646
diff changeset
   369
    (q_1) edge  node  {a,b} (q_2)
Chengsong
parents: 646
diff changeset
   370
    (q_2) edge  node  {a,b} (q_3);
Chengsong
parents: 646
diff changeset
   371
\end{tikzpicture}
Chengsong
parents: 646
diff changeset
   372
\end{center}
Chengsong
parents: 646
diff changeset
   373
and when turned into a DFA by the subset construction
Chengsong
parents: 646
diff changeset
   374
requires at least $2^3$ states.\footnote{The 
Chengsong
parents: 646
diff changeset
   375
red states are "countdown states" which count down 
Chengsong
parents: 646
diff changeset
   376
the number of characters needed in addition to the current
Chengsong
parents: 646
diff changeset
   377
string to make a successful match.
Chengsong
parents: 646
diff changeset
   378
For example, state $q_1$ indicates a match that has
Chengsong
parents: 646
diff changeset
   379
gone past the $(a|b)^*$ part of $(a|b)^*a(a|b)^{\{2\}}$,
Chengsong
parents: 646
diff changeset
   380
and just consumed the "delimiter" $a$ in the middle, and 
Chengsong
parents: 646
diff changeset
   381
needs to match 2 more iterations of $(a|b)$ to complete.
Chengsong
parents: 646
diff changeset
   382
State $q_2$ on the other hand, can be viewed as a state
Chengsong
parents: 646
diff changeset
   383
after $q_1$ has consumed 1 character, and just waits
Chengsong
parents: 646
diff changeset
   384
for 1 more character to complete.
Chengsong
parents: 646
diff changeset
   385
The state $q_3$ is the last (accepting) state, requiring 0 
Chengsong
parents: 646
diff changeset
   386
more characters.
Chengsong
parents: 646
diff changeset
   387
Depending on the suffix of the
Chengsong
parents: 646
diff changeset
   388
input string up to the current read location,
Chengsong
parents: 646
diff changeset
   389
the states $q_1$ and $q_2$, $q_3$
Chengsong
parents: 646
diff changeset
   390
may or may
Chengsong
parents: 646
diff changeset
   391
not be active.
Chengsong
parents: 646
diff changeset
   392
A $\mathit{DFA}$ for such an $\mathit{NFA}$ would
Chengsong
parents: 646
diff changeset
   393
contain at least $2^3$ non-equivalent states that cannot be merged, 
Chengsong
parents: 646
diff changeset
   394
because the subset construction during determinisation will generate
Chengsong
parents: 646
diff changeset
   395
all the elements in the power set $\mathit{Pow}\{q_1, q_2, q_3\}$.
Chengsong
parents: 646
diff changeset
   396
Generalizing this to regular expressions with larger
Chengsong
parents: 646
diff changeset
   397
bounded repetitions number, we have that
Chengsong
parents: 646
diff changeset
   398
regexes shaped like $r^*ar^{\{n\}}$ when converted to $\mathit{DFA}$s
Chengsong
parents: 646
diff changeset
   399
would require at least $2^{n+1}$ states, if $r$ itself contains
Chengsong
parents: 646
diff changeset
   400
more than 1 string.
Chengsong
parents: 646
diff changeset
   401
This is to represent all different 
Chengsong
parents: 646
diff changeset
   402
scenarios in which "countdown" states are active.}
Chengsong
parents: 646
diff changeset
   403
Chengsong
parents: 646
diff changeset
   404
Chengsong
parents: 646
diff changeset
   405
Bounded repetitions are important because they
Chengsong
parents: 646
diff changeset
   406
tend to occur frequently in practical use,
Chengsong
parents: 646
diff changeset
   407
for example in the regex library RegExLib, in
Chengsong
parents: 646
diff changeset
   408
the rules library of Snort \cite{Snort1999}\footnote{
Chengsong
parents: 646
diff changeset
   409
Snort is a network intrusion detection (NID) tool
Chengsong
parents: 646
diff changeset
   410
for monitoring network traffic.
Chengsong
parents: 646
diff changeset
   411
The network security community curates a list
Chengsong
parents: 646
diff changeset
   412
of malicious patterns written as regexes,
Chengsong
parents: 646
diff changeset
   413
which is used by Snort's detection engine
Chengsong
parents: 646
diff changeset
   414
to match against network traffic for any hostile
Chengsong
parents: 646
diff changeset
   415
activities such as buffer overflow attacks.}, 
Chengsong
parents: 646
diff changeset
   416
as well as in XML Schema definitions (XSDs).
Chengsong
parents: 646
diff changeset
   417
According to Bj\"{o}rklund et al \cite{xml2015},
Chengsong
parents: 646
diff changeset
   418
more than half of the 
Chengsong
parents: 646
diff changeset
   419
XSDs they found on the Maven.org central repository
Chengsong
parents: 646
diff changeset
   420
have bounded regular expressions in them.
Chengsong
parents: 646
diff changeset
   421
Often the counters are quite large, with the largest being
Chengsong
parents: 646
diff changeset
   422
close to ten million. 
Chengsong
parents: 646
diff changeset
   423
A smaller sample XSD they gave
Chengsong
parents: 646
diff changeset
   424
is:
Chengsong
parents: 646
diff changeset
   425
\lstset{
Chengsong
parents: 646
diff changeset
   426
	basicstyle=\fontsize{8.5}{9}\ttfamily,
Chengsong
parents: 646
diff changeset
   427
  language=XML,
Chengsong
parents: 646
diff changeset
   428
  morekeywords={encoding,
Chengsong
parents: 646
diff changeset
   429
    xs:schema,xs:element,xs:complexType,xs:sequence,xs:attribute}
Chengsong
parents: 646
diff changeset
   430
}
Chengsong
parents: 646
diff changeset
   431
\begin{lstlisting}
Chengsong
parents: 646
diff changeset
   432
<sequence minOccurs="0" maxOccurs="65535">
Chengsong
parents: 646
diff changeset
   433
	<element name="TimeIncr" type="mpeg7:MediaIncrDurationType"/>
Chengsong
parents: 646
diff changeset
   434
 	<element name="MotionParams" type="float" minOccurs="2" maxOccurs="12"/>
Chengsong
parents: 646
diff changeset
   435
</sequence>
Chengsong
parents: 646
diff changeset
   436
\end{lstlisting}
Chengsong
parents: 646
diff changeset
   437
This can be seen as the regex
Chengsong
parents: 646
diff changeset
   438
$(ab^{2\ldots 12})^{0 \ldots 65535}$, where $a$ and $b$ are themselves
Chengsong
parents: 646
diff changeset
   439
regular expressions 
Chengsong
parents: 646
diff changeset
   440
satisfying certain constraints (such as 
Chengsong
parents: 646
diff changeset
   441
satisfying the floating point number format).
Chengsong
parents: 646
diff changeset
   442
It is therefore quite unsatisfying that 
Chengsong
parents: 646
diff changeset
   443
some regular expressions matching libraries
Chengsong
parents: 646
diff changeset
   444
impose adhoc limits
Chengsong
parents: 646
diff changeset
   445
for bounded regular expressions:
Chengsong
parents: 646
diff changeset
   446
For example, in the regular expression matching library in the Go
Chengsong
parents: 646
diff changeset
   447
language the regular expression $a^{1001}$ is not permitted, because no counter
Chengsong
parents: 646
diff changeset
   448
can be above 1000, and in the built-in Rust regular expression library
Chengsong
parents: 646
diff changeset
   449
expressions such as $a^{\{1000\}\{100\}\{5\}}$ give an error message
Chengsong
parents: 646
diff changeset
   450
for being too big. 
Chengsong
parents: 646
diff changeset
   451
As Becchi and Crawley \cite{Becchi08}  have pointed out,
Chengsong
parents: 646
diff changeset
   452
the reason for these restrictions
Chengsong
parents: 646
diff changeset
   453
is that they simulate a non-deterministic finite
Chengsong
parents: 646
diff changeset
   454
automata (NFA) with a breadth-first search.
Chengsong
parents: 646
diff changeset
   455
This way the number of active states could
Chengsong
parents: 646
diff changeset
   456
be equal to the counter number.
Chengsong
parents: 646
diff changeset
   457
When the counters are large, 
Chengsong
parents: 646
diff changeset
   458
the memory requirement could become
Chengsong
parents: 646
diff changeset
   459
infeasible, and a regex engine
Chengsong
parents: 646
diff changeset
   460
like in Go will reject this pattern straight away.
Chengsong
parents: 646
diff changeset
   461
\begin{figure}[H]
Chengsong
parents: 646
diff changeset
   462
\begin{center}
Chengsong
parents: 646
diff changeset
   463
\begin{tikzpicture} [node distance = 2cm, on grid, auto]
Chengsong
parents: 646
diff changeset
   464
 
Chengsong
parents: 646
diff changeset
   465
    	\node (q0) [state, initial] {$0$};
Chengsong
parents: 646
diff changeset
   466
	\node (q1) [state, right = of q0] {$1$};
Chengsong
parents: 646
diff changeset
   467
	%\node (q2) [state, right = of q1] {$2$};
Chengsong
parents: 646
diff changeset
   468
	\node (qdots) [right = of q1] {$\ldots$};
Chengsong
parents: 646
diff changeset
   469
	\node (qn) [state, right = of qdots] {$n$};
Chengsong
parents: 646
diff changeset
   470
	\node (qn1) [state, right = of qn] {$n+1$};
Chengsong
parents: 646
diff changeset
   471
	\node (qn2) [state, right = of qn1] {$n+2$};
Chengsong
parents: 646
diff changeset
   472
	\node (qn3) [state, accepting, right = of qn2] {$n+3$}; 
Chengsong
parents: 646
diff changeset
   473
 
Chengsong
parents: 646
diff changeset
   474
\path [-stealth, thick]
Chengsong
parents: 646
diff changeset
   475
	(q0) edge [loop above] node {a} ()
Chengsong
parents: 646
diff changeset
   476
    (q0) edge node {a}   (q1) 
Chengsong
parents: 646
diff changeset
   477
    %(q1) edge node {.}   (q2)
Chengsong
parents: 646
diff changeset
   478
    (q1) edge node {.}   (qdots)
Chengsong
parents: 646
diff changeset
   479
    (qdots) edge node {.} (qn)
Chengsong
parents: 646
diff changeset
   480
    (qn) edge node {.} (qn1)
Chengsong
parents: 646
diff changeset
   481
    (qn1) edge node {b} (qn2)
Chengsong
parents: 646
diff changeset
   482
    (qn2) edge node {$c$} (qn3);
Chengsong
parents: 646
diff changeset
   483
\end{tikzpicture}
Chengsong
parents: 646
diff changeset
   484
%\begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto] 
Chengsong
parents: 646
diff changeset
   485
%   \node[state,initial] (q_0)   {$0$}; 
Chengsong
parents: 646
diff changeset
   486
%   \node[state, ] (q_1) [right=of q_0] {$1$}; 
Chengsong
parents: 646
diff changeset
   487
%   \node[state, ] (q_2) [right=of q_1] {$2$}; 
Chengsong
parents: 646
diff changeset
   488
%   \node[state,
Chengsong
parents: 646
diff changeset
   489
%   \node[state, accepting, ](q_3) [right=of q_2] {$3$};
Chengsong
parents: 646
diff changeset
   490
%    \path[->] 
Chengsong
parents: 646
diff changeset
   491
%    (q_0) edge  node {a} (q_1)
Chengsong
parents: 646
diff changeset
   492
%    	  edge [loop below] node {a,b} ()
Chengsong
parents: 646
diff changeset
   493
%    (q_1) edge  node  {a,b} (q_2)
Chengsong
parents: 646
diff changeset
   494
%    (q_2) edge  node  {a,b} (q_3);
Chengsong
parents: 646
diff changeset
   495
%\end{tikzpicture}
Chengsong
parents: 646
diff changeset
   496
\end{center}
Chengsong
parents: 646
diff changeset
   497
\caption{The example given by Becchi and Crawley
Chengsong
parents: 646
diff changeset
   498
	that NFA simulation can consume large
Chengsong
parents: 646
diff changeset
   499
	amounts of memory: $.^*a.^{\{n\}}bc$ matching
Chengsong
parents: 646
diff changeset
   500
	strings of the form $aaa\ldots aaaabc$.
Chengsong
parents: 646
diff changeset
   501
	When traversing in a breadth-first manner,
657
00171b627b8d Fixed some annotated/unannotated a/r notation inconsistencies.
Chengsong
parents: 651
diff changeset
   502
all states from 0 till $n+1$ will become active.}\label{fig:inj}
00171b627b8d Fixed some annotated/unannotated a/r notation inconsistencies.
Chengsong
parents: 651
diff changeset
   503
649
Chengsong
parents: 646
diff changeset
   504
\end{figure}
Chengsong
parents: 646
diff changeset
   505
%Languages like $\mathit{Go}$ and $\mathit{Rust}$ use this
Chengsong
parents: 646
diff changeset
   506
%type of $\mathit{NFA}$ simulation and guarantees a linear runtime
Chengsong
parents: 646
diff changeset
   507
%in terms of input string length.
Chengsong
parents: 646
diff changeset
   508
%TODO:try out these lexers
Chengsong
parents: 646
diff changeset
   509
These problems can of course be solved in matching algorithms where 
Chengsong
parents: 646
diff changeset
   510
automata go beyond the classic notion and for instance include explicit
Chengsong
parents: 646
diff changeset
   511
counters \cite{Turo_ov__2020}.
Chengsong
parents: 646
diff changeset
   512
These solutions can be quite efficient,
Chengsong
parents: 646
diff changeset
   513
with the ability to process
Chengsong
parents: 646
diff changeset
   514
gigabits of strings input per second
Chengsong
parents: 646
diff changeset
   515
even with large counters \cite{Becchi08}.
Chengsong
parents: 646
diff changeset
   516
These practical solutions do not come with
Chengsong
parents: 646
diff changeset
   517
formal guarantees, and as pointed out by
Chengsong
parents: 646
diff changeset
   518
Kuklewicz \cite{KuklewiczHaskell}, can be error-prone.
Chengsong
parents: 646
diff changeset
   519
%But formal reasoning about these automata especially in Isabelle 
Chengsong
parents: 646
diff changeset
   520
%can be challenging
Chengsong
parents: 646
diff changeset
   521
%and un-intuitive. 
Chengsong
parents: 646
diff changeset
   522
%Therefore, we take correctness and runtime claims made about these solutions
Chengsong
parents: 646
diff changeset
   523
%with a grain of salt.
Chengsong
parents: 646
diff changeset
   524
Chengsong
parents: 646
diff changeset
   525
In the work reported in \cite{FoSSaCS2023} and here, 
Chengsong
parents: 646
diff changeset
   526
we add better support using derivatives
Chengsong
parents: 646
diff changeset
   527
for bounded regular expression $r^{\{n\}}$.
Chengsong
parents: 646
diff changeset
   528
Our results
Chengsong
parents: 646
diff changeset
   529
extend straightforwardly to
Chengsong
parents: 646
diff changeset
   530
repetitions with intervals such as 
Chengsong
parents: 646
diff changeset
   531
$r^{\{n\ldots m\}}$.
Chengsong
parents: 646
diff changeset
   532
The merit of Brzozowski derivatives (more on this later)
Chengsong
parents: 646
diff changeset
   533
on this problem is that
Chengsong
parents: 646
diff changeset
   534
it can be naturally extended to support bounded repetitions.
Chengsong
parents: 646
diff changeset
   535
Moreover these extensions are still made up of only small
Chengsong
parents: 646
diff changeset
   536
inductive datatypes and recursive functions,
Chengsong
parents: 646
diff changeset
   537
making it handy to deal with them in theorem provers.
Chengsong
parents: 646
diff changeset
   538
%The point here is that Brzozowski derivatives and the algorithms by Sulzmann and Lu can be
Chengsong
parents: 646
diff changeset
   539
%straightforwardly extended to deal with bounded regular expressions
Chengsong
parents: 646
diff changeset
   540
%and moreover the resulting code still consists of only simple
Chengsong
parents: 646
diff changeset
   541
%recursive functions and inductive datatypes.
Chengsong
parents: 646
diff changeset
   542
Finally, bounded regular expressions do not destroy our finite
Chengsong
parents: 646
diff changeset
   543
boundedness property, which we shall prove later on.
Chengsong
parents: 646
diff changeset
   544
Chengsong
parents: 646
diff changeset
   545
Chengsong
parents: 646
diff changeset
   546
Chengsong
parents: 646
diff changeset
   547
Chengsong
parents: 646
diff changeset
   548
Chengsong
parents: 646
diff changeset
   549
\subsection{Back-References}
Chengsong
parents: 646
diff changeset
   550
The other way to simulate an $\mathit{NFA}$ for matching is choosing  
Chengsong
parents: 646
diff changeset
   551
a single transition each time, keeping all the other options in 
Chengsong
parents: 646
diff changeset
   552
a queue or stack, and backtracking if that choice eventually 
Chengsong
parents: 646
diff changeset
   553
fails. 
Chengsong
parents: 646
diff changeset
   554
This method, often called a  "depth-first-search", 
Chengsong
parents: 646
diff changeset
   555
is efficient in many cases, but could end up
Chengsong
parents: 646
diff changeset
   556
with exponential run time.
Chengsong
parents: 646
diff changeset
   557
The backtracking method is employed in regex libraries
Chengsong
parents: 646
diff changeset
   558
that support \emph{back-references}, for example
Chengsong
parents: 646
diff changeset
   559
in Java and Python.
Chengsong
parents: 646
diff changeset
   560
%\section{Back-references and The Terminology Regex}
Chengsong
parents: 646
diff changeset
   561
Chengsong
parents: 646
diff changeset
   562
%When one constructs an $\NFA$ out of a regular expression
Chengsong
parents: 646
diff changeset
   563
%there is often very little to be done in the first phase, one simply 
Chengsong
parents: 646
diff changeset
   564
%construct the $\NFA$ states based on the structure of the input regular expression.
Chengsong
parents: 646
diff changeset
   565
Chengsong
parents: 646
diff changeset
   566
%In the lexing phase, one can simulate the $\mathit{NFA}$ running in two ways:
Chengsong
parents: 646
diff changeset
   567
%one by keeping track of all active states after consuming 
Chengsong
parents: 646
diff changeset
   568
%a character, and update that set of states iteratively.
Chengsong
parents: 646
diff changeset
   569
%This can be viewed as a breadth-first-search of the $\mathit{NFA}$
Chengsong
parents: 646
diff changeset
   570
%for a path terminating
Chengsong
parents: 646
diff changeset
   571
%at an accepting state.
Chengsong
parents: 646
diff changeset
   572
Chengsong
parents: 646
diff changeset
   573
Chengsong
parents: 646
diff changeset
   574
Chengsong
parents: 646
diff changeset
   575
Consider the following regular expression where the sequence
Chengsong
parents: 646
diff changeset
   576
operator is omitted for brevity:
Chengsong
parents: 646
diff changeset
   577
\begin{center}
Chengsong
parents: 646
diff changeset
   578
	$r_1r_2r_3r_4$
Chengsong
parents: 646
diff changeset
   579
\end{center}
Chengsong
parents: 646
diff changeset
   580
In this example,
Chengsong
parents: 646
diff changeset
   581
one could label sub-expressions of interest 
Chengsong
parents: 646
diff changeset
   582
by parenthesizing them and giving 
Chengsong
parents: 646
diff changeset
   583
them a number in the order in which their opening parentheses appear.
Chengsong
parents: 646
diff changeset
   584
One possible way of parenthesizing and labelling is: 
Chengsong
parents: 646
diff changeset
   585
\begin{center}
Chengsong
parents: 646
diff changeset
   586
	$\underset{1}{(}r_1\underset{2}{(}r_2\underset{3}{(}r_3)\underset{4}{(}r_4)))$
Chengsong
parents: 646
diff changeset
   587
\end{center}
Chengsong
parents: 646
diff changeset
   588
The sub-expressions
Chengsong
parents: 646
diff changeset
   589
$r_1r_2r_3r_4$, $r_1r_2r_3$, $r_3$ and $r_4$ are labelled
Chengsong
parents: 646
diff changeset
   590
by 1 to 4, and can be ``referred back'' by their respective numbers. 
Chengsong
parents: 646
diff changeset
   591
%These sub-expressions are called "capturing groups".
Chengsong
parents: 646
diff changeset
   592
To do so, one uses the syntax $\backslash i$ 
Chengsong
parents: 646
diff changeset
   593
to denote that we want the sub-string 
Chengsong
parents: 646
diff changeset
   594
of the input matched by the i-th
Chengsong
parents: 646
diff changeset
   595
sub-expression to appear again, 
Chengsong
parents: 646
diff changeset
   596
exactly the same as it first appeared: 
Chengsong
parents: 646
diff changeset
   597
\begin{center}
Chengsong
parents: 646
diff changeset
   598
$\ldots\underset{\text{i-th lparen}}{(}{r_i})\ldots 
Chengsong
parents: 646
diff changeset
   599
\underset{s_i \text{ which just matched} \;r_i}{\backslash i} \ldots$
Chengsong
parents: 646
diff changeset
   600
\end{center}
Chengsong
parents: 646
diff changeset
   601
Once the sub-string $s_i$ for the sub-expression $r_i$
Chengsong
parents: 646
diff changeset
   602
has been fixed, there is no variability on what the back-reference
Chengsong
parents: 646
diff changeset
   603
label $\backslash i$ can be---it is tied to $s_i$.
Chengsong
parents: 646
diff changeset
   604
The overall string will look like $\ldots s_i \ldots s_i \ldots $
Chengsong
parents: 646
diff changeset
   605
%The backslash and number $i$ are the
Chengsong
parents: 646
diff changeset
   606
%so-called "back-references".
Chengsong
parents: 646
diff changeset
   607
%Let $e$ be an expression made of regular expressions 
Chengsong
parents: 646
diff changeset
   608
%and back-references. $e$ contains the expression $e_i$
Chengsong
parents: 646
diff changeset
   609
%as its $i$-th capturing group.
Chengsong
parents: 646
diff changeset
   610
%The semantics of back-reference can be recursively
Chengsong
parents: 646
diff changeset
   611
%written as:
Chengsong
parents: 646
diff changeset
   612
%\begin{center}
Chengsong
parents: 646
diff changeset
   613
%	\begin{tabular}{c}
Chengsong
parents: 646
diff changeset
   614
%		$L ( e \cdot \backslash i) = \{s @ s_i \mid s \in L (e)\quad s_i \in L(r_i)$\\
Chengsong
parents: 646
diff changeset
   615
%		$s_i\; \text{match of ($e$, $s$)'s $i$-th capturing group string}\}$
Chengsong
parents: 646
diff changeset
   616
%	\end{tabular}
Chengsong
parents: 646
diff changeset
   617
%\end{center}
Chengsong
parents: 646
diff changeset
   618
A concrete example
Chengsong
parents: 646
diff changeset
   619
for back-references is
Chengsong
parents: 646
diff changeset
   620
\begin{center}
Chengsong
parents: 646
diff changeset
   621
$(.^*)\backslash 1$,
Chengsong
parents: 646
diff changeset
   622
\end{center}
Chengsong
parents: 646
diff changeset
   623
which matches
Chengsong
parents: 646
diff changeset
   624
strings that can be split into two identical halves,
Chengsong
parents: 646
diff changeset
   625
for example $\mathit{foofoo}$, $\mathit{ww}$ and so on.
Chengsong
parents: 646
diff changeset
   626
Note that this is different from 
Chengsong
parents: 646
diff changeset
   627
repeating the  sub-expression verbatim like
Chengsong
parents: 646
diff changeset
   628
\begin{center}
Chengsong
parents: 646
diff changeset
   629
	$(.^*)(.^*)$,
Chengsong
parents: 646
diff changeset
   630
\end{center}
Chengsong
parents: 646
diff changeset
   631
which does not impose any restrictions on what strings the second 
Chengsong
parents: 646
diff changeset
   632
sub-expression $.^*$
Chengsong
parents: 646
diff changeset
   633
might match.
Chengsong
parents: 646
diff changeset
   634
Another example for back-references is
Chengsong
parents: 646
diff changeset
   635
\begin{center}
Chengsong
parents: 646
diff changeset
   636
$(.)(.)\backslash 2\backslash 1$
Chengsong
parents: 646
diff changeset
   637
\end{center}
Chengsong
parents: 646
diff changeset
   638
which matches four-character palindromes
Chengsong
parents: 646
diff changeset
   639
like $abba$, $x??x$ and so on.
Chengsong
parents: 646
diff changeset
   640
Chengsong
parents: 646
diff changeset
   641
Back-references are a regex construct 
Chengsong
parents: 646
diff changeset
   642
that programmers find quite useful.
Chengsong
parents: 646
diff changeset
   643
According to Becchi and Crawley \cite{Becchi08},
Chengsong
parents: 646
diff changeset
   644
6\% of Snort rules (up until 2008) use them.
Chengsong
parents: 646
diff changeset
   645
The most common use of back-references
Chengsong
parents: 646
diff changeset
   646
is to express well-formed html files,
Chengsong
parents: 646
diff changeset
   647
where back-references are convenient for matching
Chengsong
parents: 646
diff changeset
   648
opening and closing tags like 
Chengsong
parents: 646
diff changeset
   649
\begin{center}
Chengsong
parents: 646
diff changeset
   650
	$\langle html \rangle \ldots \langle / html \rangle$
Chengsong
parents: 646
diff changeset
   651
\end{center}
Chengsong
parents: 646
diff changeset
   652
A regex describing such a format
Chengsong
parents: 646
diff changeset
   653
is
Chengsong
parents: 646
diff changeset
   654
\begin{center}
Chengsong
parents: 646
diff changeset
   655
	$\langle (.^+) \rangle \ldots \langle / \backslash 1 \rangle$
Chengsong
parents: 646
diff changeset
   656
\end{center}
Chengsong
parents: 646
diff changeset
   657
Despite being useful, the expressive power of regexes 
Chengsong
parents: 646
diff changeset
   658
go beyond regular languages 
Chengsong
parents: 646
diff changeset
   659
once back-references are included.
Chengsong
parents: 646
diff changeset
   660
In fact, they allow the regex construct to express 
Chengsong
parents: 646
diff changeset
   661
languages that cannot be contained in context-free
Chengsong
parents: 646
diff changeset
   662
languages either.
Chengsong
parents: 646
diff changeset
   663
For example, the back-reference $(a^*)b\backslash1 b \backslash 1$
Chengsong
parents: 646
diff changeset
   664
expresses the language $\{a^n b a^n b a^n\mid n \in \mathbb{N}\}$,
Chengsong
parents: 646
diff changeset
   665
which cannot be expressed by 
Chengsong
parents: 646
diff changeset
   666
context-free grammars \cite{campeanu2003formal}.
Chengsong
parents: 646
diff changeset
   667
Such a language is contained in the context-sensitive hierarchy
Chengsong
parents: 646
diff changeset
   668
of formal languages. 
Chengsong
parents: 646
diff changeset
   669
Also solving the matching problem involving back-references
Chengsong
parents: 646
diff changeset
   670
is known to be NP-complete \parencite{alfred2014algorithms}.
Chengsong
parents: 646
diff changeset
   671
Regex libraries supporting back-references such as 
Chengsong
parents: 646
diff changeset
   672
PCRE \cite{pcre} therefore have to
Chengsong
parents: 646
diff changeset
   673
revert to a depth-first search algorithm including backtracking.
Chengsong
parents: 646
diff changeset
   674
What is unexpected is that even in the cases 
Chengsong
parents: 646
diff changeset
   675
not involving back-references, there is still
Chengsong
parents: 646
diff changeset
   676
a (non-negligible) chance they might backtrack super-linearly,
Chengsong
parents: 646
diff changeset
   677
as shown in the graphs in figure \ref{fig:aStarStarb}.
Chengsong
parents: 646
diff changeset
   678
Chengsong
parents: 646
diff changeset
   679
Summing up, we can categorise existing 
Chengsong
parents: 646
diff changeset
   680
practical regex libraries into two kinds:
Chengsong
parents: 646
diff changeset
   681
(i) The ones  with  linear
Chengsong
parents: 646
diff changeset
   682
time guarantees like Go and Rust. The downside with them is that
Chengsong
parents: 646
diff changeset
   683
they impose restrictions
Chengsong
parents: 646
diff changeset
   684
on the regular expressions (not allowing back-references, 
Chengsong
parents: 646
diff changeset
   685
bounded repetitions cannot exceed an ad hoc limit etc.).
Chengsong
parents: 646
diff changeset
   686
And (ii) those 
Chengsong
parents: 646
diff changeset
   687
that allow large bounded regular expressions and back-references
Chengsong
parents: 646
diff changeset
   688
at the expense of using backtracking algorithms.
Chengsong
parents: 646
diff changeset
   689
They can potentially ``grind to a halt''
Chengsong
parents: 646
diff changeset
   690
on some very simple cases, resulting 
Chengsong
parents: 646
diff changeset
   691
ReDoS attacks if exposed to the internet.
Chengsong
parents: 646
diff changeset
   692
Chengsong
parents: 646
diff changeset
   693
The problems with both approaches are the motivation for us 
Chengsong
parents: 646
diff changeset
   694
to look again at the regular expression matching problem. 
Chengsong
parents: 646
diff changeset
   695
Another motivation is that regular expression matching algorithms
Chengsong
parents: 646
diff changeset
   696
that follow the POSIX standard often contain errors and bugs 
Chengsong
parents: 646
diff changeset
   697
as we shall explain next.
Chengsong
parents: 646
diff changeset
   698
Chengsong
parents: 646
diff changeset
   699
%We would like to have regex engines that can 
Chengsong
parents: 646
diff changeset
   700
%deal with the regular part (e.g.
Chengsong
parents: 646
diff changeset
   701
%bounded repetitions) of regexes more
Chengsong
parents: 646
diff changeset
   702
%efficiently.
Chengsong
parents: 646
diff changeset
   703
%Also we want to make sure that they do it correctly.
Chengsong
parents: 646
diff changeset
   704
%It turns out that such aim is not so easy to achieve.
Chengsong
parents: 646
diff changeset
   705
 %TODO: give examples such as RE2 GOLANG 1000 restriction, rust no repetitions 
Chengsong
parents: 646
diff changeset
   706
% For example, the Rust regex engine claims to be linear, 
Chengsong
parents: 646
diff changeset
   707
% but does not support lookarounds and back-references.
Chengsong
parents: 646
diff changeset
   708
% The GoLang regex library does not support over 1000 repetitions.  
Chengsong
parents: 646
diff changeset
   709
% Java and Python both support back-references, but shows
Chengsong
parents: 646
diff changeset
   710
%catastrophic backtracking behaviours on inputs without back-references(
Chengsong
parents: 646
diff changeset
   711
%when the language is still regular).
Chengsong
parents: 646
diff changeset
   712
 %TODO: test performance of Rust on (((((a*a*)b*)b){20})*)c  baabaabababaabaaaaaaaaababaaaababababaaaabaaabaaaaaabaabaabababaababaaaaaaaaababaaaababababaaaaaaaaaaaaac
Chengsong
parents: 646
diff changeset
   713
 %TODO: verify the fact Rust does not allow 1000+ reps
Chengsong
parents: 646
diff changeset
   714
Chengsong
parents: 646
diff changeset
   715
Chengsong
parents: 646
diff changeset
   716
Chengsong
parents: 646
diff changeset
   717
Chengsong
parents: 646
diff changeset
   718
%The time cost of regex matching algorithms in general
Chengsong
parents: 646
diff changeset
   719
%involve two different phases, and different things can go differently wrong on 
Chengsong
parents: 646
diff changeset
   720
%these phases.
Chengsong
parents: 646
diff changeset
   721
%$\DFA$s usually have problems in the first (construction) phase
Chengsong
parents: 646
diff changeset
   722
%, whereas $\NFA$s usually run into trouble
Chengsong
parents: 646
diff changeset
   723
%on the second phase.
Chengsong
parents: 646
diff changeset
   724
Chengsong
parents: 646
diff changeset
   725
Chengsong
parents: 646
diff changeset
   726
\section{Error-prone POSIX Implementations}
Chengsong
parents: 646
diff changeset
   727
Very often there are multiple ways of matching a string
Chengsong
parents: 646
diff changeset
   728
with a regular expression.
Chengsong
parents: 646
diff changeset
   729
In such cases the regular expressions matcher needs to
Chengsong
parents: 646
diff changeset
   730
disambiguate.
Chengsong
parents: 646
diff changeset
   731
The more widely used strategy is called POSIX,
Chengsong
parents: 646
diff changeset
   732
which roughly speaking always chooses the longest initial match.
Chengsong
parents: 646
diff changeset
   733
The POSIX strategy is widely adopted in many regular expression matchers
Chengsong
parents: 646
diff changeset
   734
because it is a natural strategy for lexers.
Chengsong
parents: 646
diff changeset
   735
However, many implementations (including the C libraries
Chengsong
parents: 646
diff changeset
   736
used by Linux and OS X distributions) contain bugs
Chengsong
parents: 646
diff changeset
   737
or do not meet the specification they claim to adhere to.
Chengsong
parents: 646
diff changeset
   738
Kuklewicz maintains a unit test repository which lists some
Chengsong
parents: 646
diff changeset
   739
problems with existing regular expression engines \cite{KuklewiczHaskell}.
Chengsong
parents: 646
diff changeset
   740
In some cases, they either fail to generate a
Chengsong
parents: 646
diff changeset
   741
result when there exists a match,
Chengsong
parents: 646
diff changeset
   742
or give results that are inconsistent with the POSIX standard.
Chengsong
parents: 646
diff changeset
   743
A concrete example is the regex:
Chengsong
parents: 646
diff changeset
   744
\begin{center}
Chengsong
parents: 646
diff changeset
   745
	$(aba + ab + a)^* \text{and the string} \; ababa$
Chengsong
parents: 646
diff changeset
   746
\end{center}
Chengsong
parents: 646
diff changeset
   747
The correct POSIX match for the above
Chengsong
parents: 646
diff changeset
   748
involves the entire string $ababa$, 
Chengsong
parents: 646
diff changeset
   749
split into two Kleene star iterations, namely $[ab], [aba]$ at positions
Chengsong
parents: 646
diff changeset
   750
$[0, 2), [2, 5)$
Chengsong
parents: 646
diff changeset
   751
respectively.
Chengsong
parents: 646
diff changeset
   752
But feeding this example to the different engines
Chengsong
parents: 646
diff changeset
   753
listed at regex101 \footnote{
Chengsong
parents: 646
diff changeset
   754
	regex101 is an online regular expression matcher which
Chengsong
parents: 646
diff changeset
   755
	provides API for trying out regular expression
Chengsong
parents: 646
diff changeset
   756
	engines of multiple popular programming languages like
Chengsong
parents: 646
diff changeset
   757
Java, Python, Go, etc.} \parencite{regex101}. 
Chengsong
parents: 646
diff changeset
   758
yields
Chengsong
parents: 646
diff changeset
   759
only two incomplete matches: $[aba]$ at $[0, 3)$
Chengsong
parents: 646
diff changeset
   760
and $a$ at $[4, 5)$.
Chengsong
parents: 646
diff changeset
   761
Fowler \cite{fowler2003} and Kuklewicz \cite{KuklewiczHaskell} 
Chengsong
parents: 646
diff changeset
   762
commented that most regex libraries are not
Chengsong
parents: 646
diff changeset
   763
correctly implementing the central POSIX
Chengsong
parents: 646
diff changeset
   764
rule, called the maximum munch rule.
Chengsong
parents: 646
diff changeset
   765
Grathwohl \parencite{grathwohl2014crash} wrote:
Chengsong
parents: 646
diff changeset
   766
\begin{quote}\it
Chengsong
parents: 646
diff changeset
   767
	``The POSIX strategy is more complicated than the 
Chengsong
parents: 646
diff changeset
   768
	greedy because of the dependence on information about 
Chengsong
parents: 646
diff changeset
   769
	the length of matched strings in the various subexpressions.''
Chengsong
parents: 646
diff changeset
   770
\end{quote}
Chengsong
parents: 646
diff changeset
   771
%\noindent
Chengsong
parents: 646
diff changeset
   772
People have recognised that the implementation complexity of POSIX rules also come from
Chengsong
parents: 646
diff changeset
   773
the specification being not very precise.
Chengsong
parents: 646
diff changeset
   774
The developers of the regexp package of Go 
Chengsong
parents: 646
diff changeset
   775
\footnote{\url{https://pkg.go.dev/regexp\#pkg-overview}}
Chengsong
parents: 646
diff changeset
   776
commented that
Chengsong
parents: 646
diff changeset
   777
\begin{quote}\it
Chengsong
parents: 646
diff changeset
   778
``
Chengsong
parents: 646
diff changeset
   779
The POSIX rule is computationally prohibitive
Chengsong
parents: 646
diff changeset
   780
and not even well-defined.
Chengsong
parents: 646
diff changeset
   781
``
Chengsong
parents: 646
diff changeset
   782
\end{quote}
Chengsong
parents: 646
diff changeset
   783
There are many informal summaries of this disambiguation
Chengsong
parents: 646
diff changeset
   784
strategy, which are often quite long and delicate.
Chengsong
parents: 646
diff changeset
   785
For example Kuklewicz \cite{KuklewiczHaskell} 
Chengsong
parents: 646
diff changeset
   786
described the POSIX rule as (section 1, last paragraph):
Chengsong
parents: 646
diff changeset
   787
\begin{quote}
Chengsong
parents: 646
diff changeset
   788
	\begin{itemize}
Chengsong
parents: 646
diff changeset
   789
		\item
Chengsong
parents: 646
diff changeset
   790
regular expressions (REs) take the leftmost starting match, and the longest match starting there
Chengsong
parents: 646
diff changeset
   791
earlier subpatterns have leftmost-longest priority over later subpatterns\\
Chengsong
parents: 646
diff changeset
   792
\item
Chengsong
parents: 646
diff changeset
   793
higher-level subpatterns have leftmost-longest priority over their component subpatterns\\
Chengsong
parents: 646
diff changeset
   794
\item
Chengsong
parents: 646
diff changeset
   795
REs have right associative concatenation which can be changed with parenthesis\\
Chengsong
parents: 646
diff changeset
   796
\item
Chengsong
parents: 646
diff changeset
   797
parenthesized subexpressions return the match from their last usage\\
Chengsong
parents: 646
diff changeset
   798
\item
Chengsong
parents: 646
diff changeset
   799
text of component subexpressions must be contained in the text of the 
Chengsong
parents: 646
diff changeset
   800
higher-level subexpressions\\
Chengsong
parents: 646
diff changeset
   801
\item
Chengsong
parents: 646
diff changeset
   802
if "p" and "q" can never match the same text then "p|q" and "q|p" are equivalent, up to trivial renumbering of captured subexpressions\\
Chengsong
parents: 646
diff changeset
   803
\item
Chengsong
parents: 646
diff changeset
   804
if "p" in "p*" is used to capture non-empty text then additional repetitions of "p" will not capture an empty string\\
Chengsong
parents: 646
diff changeset
   805
\end{itemize}
Chengsong
parents: 646
diff changeset
   806
\end{quote}
Chengsong
parents: 646
diff changeset
   807
%The text above 
Chengsong
parents: 646
diff changeset
   808
%is trying to capture something very precise,
Chengsong
parents: 646
diff changeset
   809
%and is crying out for formalising.
Chengsong
parents: 646
diff changeset
   810
Ribeiro and Du Bois \cite{RibeiroAgda2017} have 
Chengsong
parents: 646
diff changeset
   811
formalised the notion of bit-coded regular expressions
Chengsong
parents: 646
diff changeset
   812
and proved their relations with simple regular expressions in
Chengsong
parents: 646
diff changeset
   813
the dependently-typed proof assistant Agda.
Chengsong
parents: 646
diff changeset
   814
They also proved the soundness and completeness of a matching algorithm
Chengsong
parents: 646
diff changeset
   815
based on the bit-coded regular expressions.
Chengsong
parents: 646
diff changeset
   816
Ausaf et al. \cite{AusafDyckhoffUrban2016}
Chengsong
parents: 646
diff changeset
   817
are the first to
Chengsong
parents: 646
diff changeset
   818
give a quite simple formalised POSIX
Chengsong
parents: 646
diff changeset
   819
specification in Isabelle/HOL, and also prove 
Chengsong
parents: 646
diff changeset
   820
that their specification coincides with an earlier (unformalised) 
Chengsong
parents: 646
diff changeset
   821
POSIX specification given by Okui and Suzuki \cite{Okui10}.
Chengsong
parents: 646
diff changeset
   822
They then formally proved the correctness of
Chengsong
parents: 646
diff changeset
   823
a lexing algorithm by Sulzmann and Lu \cite{Sulzmann2014}
Chengsong
parents: 646
diff changeset
   824
with regards to that specification.
Chengsong
parents: 646
diff changeset
   825
They also found that the informal POSIX
Chengsong
parents: 646
diff changeset
   826
specification by Sulzmann and Lu needs to be substantially 
Chengsong
parents: 646
diff changeset
   827
revised in order for the correctness proof to go through.
Chengsong
parents: 646
diff changeset
   828
Their original specification and proof were unfixable
Chengsong
parents: 646
diff changeset
   829
according to Ausaf et al.
Chengsong
parents: 646
diff changeset
   830
Chengsong
parents: 646
diff changeset
   831
Chengsong
parents: 646
diff changeset
   832
In the next section, we will briefly
Chengsong
parents: 646
diff changeset
   833
introduce Brzozowski derivatives and Sulzmann
Chengsong
parents: 646
diff changeset
   834
and Lu's algorithm, which the main point of this thesis builds on.
Chengsong
parents: 646
diff changeset
   835
%We give a taste of what they 
Chengsong
parents: 646
diff changeset
   836
%are like and why they are suitable for regular expression
Chengsong
parents: 646
diff changeset
   837
%matching and lexing.
Chengsong
parents: 646
diff changeset
   838
\section{Formal Specification of POSIX Matching 
Chengsong
parents: 646
diff changeset
   839
and Brzozowski Derivatives}
Chengsong
parents: 646
diff changeset
   840
%Now we start with the central topic of the thesis: Brzozowski derivatives.
Chengsong
parents: 646
diff changeset
   841
Brzozowski \cite{Brzozowski1964} first introduced the 
Chengsong
parents: 646
diff changeset
   842
concept of a \emph{derivative} of regular expression in 1964.
Chengsong
parents: 646
diff changeset
   843
The derivative of a regular expression $r$
Chengsong
parents: 646
diff changeset
   844
with respect to a character $c$, is written as $r \backslash c$.
Chengsong
parents: 646
diff changeset
   845
This operation tells us what $r$ transforms into
Chengsong
parents: 646
diff changeset
   846
if we ``chop'' off a particular character $c$ 
Chengsong
parents: 646
diff changeset
   847
from all strings in the language of $r$ (defined
Chengsong
parents: 646
diff changeset
   848
later as $L \; r$).
Chengsong
parents: 646
diff changeset
   849
%To give a flavour of Brzozowski derivatives, we present
Chengsong
parents: 646
diff changeset
   850
%two straightforward clauses from it:
Chengsong
parents: 646
diff changeset
   851
%\begin{center}
Chengsong
parents: 646
diff changeset
   852
%	\begin{tabular}{lcl}
Chengsong
parents: 646
diff changeset
   853
%		$d \backslash c$     & $\dn$ & 
Chengsong
parents: 646
diff changeset
   854
%		$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
Chengsong
parents: 646
diff changeset
   855
%$(r_1 + r_2)\backslash c$     & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
Chengsong
parents: 646
diff changeset
   856
%	\end{tabular}
Chengsong
parents: 646
diff changeset
   857
%\end{center}
Chengsong
parents: 646
diff changeset
   858
%\noindent
Chengsong
parents: 646
diff changeset
   859
%The first clause says that for the regular expression
Chengsong
parents: 646
diff changeset
   860
%denoting a singleton set consisting of a single-character string $\{ d \}$,
Chengsong
parents: 646
diff changeset
   861
%we check the derivative character $c$ against $d$,
Chengsong
parents: 646
diff changeset
   862
%returning a set containing only the empty string $\{ [] \}$
Chengsong
parents: 646
diff changeset
   863
%if $c$ and $d$ are equal, and the empty set $\varnothing$ otherwise.
Chengsong
parents: 646
diff changeset
   864
%The second clause states that to obtain the regular expression
Chengsong
parents: 646
diff changeset
   865
%representing all strings' head character $c$ being chopped off
Chengsong
parents: 646
diff changeset
   866
%from $r_1 + r_2$, one simply needs to recursively take derivative
Chengsong
parents: 646
diff changeset
   867
%of $r_1$ and $r_2$ and then put them together.
Chengsong
parents: 646
diff changeset
   868
Derivatives have the property
Chengsong
parents: 646
diff changeset
   869
that $s \in L \; (r\backslash c)$ if and only if 
Chengsong
parents: 646
diff changeset
   870
$c::s \in L \; r$ where $::$ stands for list prepending.
Chengsong
parents: 646
diff changeset
   871
%This property can be used on regular expressions
Chengsong
parents: 646
diff changeset
   872
%matching and lexing--to test whether a string $s$ is in $L \; r$,
Chengsong
parents: 646
diff changeset
   873
%one simply takes derivatives of $r$ successively with
Chengsong
parents: 646
diff changeset
   874
%respect to the characters (in the correct order) in $s$,
Chengsong
parents: 646
diff changeset
   875
%and then test whether the empty string is in the last regular expression.
Chengsong
parents: 646
diff changeset
   876
With this property, derivatives can give a simple solution
Chengsong
parents: 646
diff changeset
   877
to the problem of matching a string $s$ with a regular
Chengsong
parents: 646
diff changeset
   878
expression $r$: if the derivative of $r$ w.r.t.\ (in
Chengsong
parents: 646
diff changeset
   879
succession) all the characters of the string matches the empty string,
Chengsong
parents: 646
diff changeset
   880
then $r$ matches $s$ (and {\em vice versa}).  
Chengsong
parents: 646
diff changeset
   881
%This makes formally reasoning about these properties such
Chengsong
parents: 646
diff changeset
   882
%as correctness and complexity smooth and intuitive.
Chengsong
parents: 646
diff changeset
   883
There are several mechanised proofs of this property in various theorem
Chengsong
parents: 646
diff changeset
   884
provers,
Chengsong
parents: 646
diff changeset
   885
for example one by Owens and Slind \cite{Owens2008} in HOL4,
Chengsong
parents: 646
diff changeset
   886
another one by Krauss and Nipkow \cite{Nipkow98} in Isabelle/HOL, and
Chengsong
parents: 646
diff changeset
   887
yet another in Coq by Coquand and Siles \cite{Coquand2012}.
Chengsong
parents: 646
diff changeset
   888
Chengsong
parents: 646
diff changeset
   889
In addition, one can extend derivatives to bounded repetitions
Chengsong
parents: 646
diff changeset
   890
relatively straightforwardly. For example, the derivative for 
Chengsong
parents: 646
diff changeset
   891
this can be defined as:
Chengsong
parents: 646
diff changeset
   892
\begin{center}
Chengsong
parents: 646
diff changeset
   893
	\begin{tabular}{lcl}
Chengsong
parents: 646
diff changeset
   894
		$r^{\{n\}} \backslash c$     & $\dn$ & $r \backslash c \cdot
Chengsong
parents: 646
diff changeset
   895
		r^{\{n-1\}} (\text{when} n > 0)$\\
Chengsong
parents: 646
diff changeset
   896
	\end{tabular}
Chengsong
parents: 646
diff changeset
   897
\end{center}
Chengsong
parents: 646
diff changeset
   898
\noindent
Chengsong
parents: 646
diff changeset
   899
Experimental results suggest that  unlike DFA-based solutions
Chengsong
parents: 646
diff changeset
   900
for bounded regular expressions,
Chengsong
parents: 646
diff changeset
   901
derivatives can cope
Chengsong
parents: 646
diff changeset
   902
large counters
Chengsong
parents: 646
diff changeset
   903
quite well.
Chengsong
parents: 646
diff changeset
   904
Chengsong
parents: 646
diff changeset
   905
There have also been 
Chengsong
parents: 646
diff changeset
   906
extensions of derivatives to other regex constructs.
Chengsong
parents: 646
diff changeset
   907
For example, Owens et al include the derivatives
Chengsong
parents: 646
diff changeset
   908
for the \emph{NOT} regular expression, which is
Chengsong
parents: 646
diff changeset
   909
able to concisely express C-style comments of the form
Chengsong
parents: 646
diff changeset
   910
$/* \ldots */$ (see \cite{Owens2008}).
Chengsong
parents: 646
diff changeset
   911
Another extension for derivatives is
Chengsong
parents: 646
diff changeset
   912
regular expressions with look-aheads, done
Chengsong
parents: 646
diff changeset
   913
by Miyazaki and Minamide
Chengsong
parents: 646
diff changeset
   914
\cite{Takayuki2019}.
Chengsong
parents: 646
diff changeset
   915
%We therefore use Brzozowski derivatives on regular expressions 
Chengsong
parents: 646
diff changeset
   916
%lexing 
Chengsong
parents: 646
diff changeset
   917
Chengsong
parents: 646
diff changeset
   918
Chengsong
parents: 646
diff changeset
   919
Chengsong
parents: 646
diff changeset
   920
Given the above definitions and properties of
Chengsong
parents: 646
diff changeset
   921
Brzozowski derivatives, one quickly realises their potential
Chengsong
parents: 646
diff changeset
   922
in generating a formally verified algorithm for lexing: the clauses and property
Chengsong
parents: 646
diff changeset
   923
can be easily expressed in a functional programming language 
Chengsong
parents: 646
diff changeset
   924
or converted to theorem prover
Chengsong
parents: 646
diff changeset
   925
code, with great ease.
Chengsong
parents: 646
diff changeset
   926
Perhaps this is the reason why derivatives have sparked quite a bit of interest
Chengsong
parents: 646
diff changeset
   927
in the functional programming and theorem prover communities in the last
Chengsong
parents: 646
diff changeset
   928
fifteen or so years (
Chengsong
parents: 646
diff changeset
   929
\cite{Almeidaetal10}, \cite{Berglund14}, \cite{Berglund18},
Chengsong
parents: 646
diff changeset
   930
\cite{Chen12} and \cite{Coquand2012}
Chengsong
parents: 646
diff changeset
   931
to name a few), despite being buried in the ``sands of time'' \cite{Owens2008}
Chengsong
parents: 646
diff changeset
   932
after they were first published by Brzozowski.
Chengsong
parents: 646
diff changeset
   933
Chengsong
parents: 646
diff changeset
   934
Chengsong
parents: 646
diff changeset
   935
However, there are two difficulties with derivative-based matchers:
Chengsong
parents: 646
diff changeset
   936
First, Brzozowski's original matcher only generates a yes/no answer
Chengsong
parents: 646
diff changeset
   937
for whether a regular expression matches a string or not.  This is too
Chengsong
parents: 646
diff changeset
   938
little information in the context of lexing where separate tokens must
Chengsong
parents: 646
diff changeset
   939
be identified and also classified (for example as keywords
Chengsong
parents: 646
diff changeset
   940
or identifiers). 
Chengsong
parents: 646
diff changeset
   941
Second, derivative-based matchers need to be more efficient in terms
Chengsong
parents: 646
diff changeset
   942
of the sizes of derivatives.
Chengsong
parents: 646
diff changeset
   943
Elegant and beautiful
Chengsong
parents: 646
diff changeset
   944
as many implementations are,
Chengsong
parents: 646
diff changeset
   945
they can be excruciatingly slow. 
Chengsong
parents: 646
diff changeset
   946
For example, Sulzmann and Lu
Chengsong
parents: 646
diff changeset
   947
claim a linear running time of their proposed algorithm,
Chengsong
parents: 646
diff changeset
   948
but that was falsified by our experiments. The running time 
Chengsong
parents: 646
diff changeset
   949
is actually $\Omega(2^n)$ in the worst case.
Chengsong
parents: 646
diff changeset
   950
A similar claim about a theoretical runtime of $O(n^2)$ 
Chengsong
parents: 646
diff changeset
   951
is made for the Verbatim \cite{Verbatim}
Chengsong
parents: 646
diff changeset
   952
%TODO: give references
Chengsong
parents: 646
diff changeset
   953
lexer, which calculates POSIX matches and is based on derivatives.
Chengsong
parents: 646
diff changeset
   954
They formalized the correctness of the lexer, but not their complexity result.
Chengsong
parents: 646
diff changeset
   955
In the performance evaluation section, they analyzed the run time
Chengsong
parents: 646
diff changeset
   956
of matching $a$ with the string 
Chengsong
parents: 646
diff changeset
   957
\begin{center}
Chengsong
parents: 646
diff changeset
   958
	$\underbrace{a \ldots a}_{\text{n a's}}$.
Chengsong
parents: 646
diff changeset
   959
\end{center}
Chengsong
parents: 646
diff changeset
   960
\noindent
Chengsong
parents: 646
diff changeset
   961
They concluded that the algorithm is quadratic in terms of 
Chengsong
parents: 646
diff changeset
   962
the length of the input string.
Chengsong
parents: 646
diff changeset
   963
When we tried out their extracted OCaml code with the example $(a+aa)^*$,
Chengsong
parents: 646
diff changeset
   964
the time it took to match a string of 40 $a$'s was approximately 5 minutes.
Chengsong
parents: 646
diff changeset
   965
Chengsong
parents: 646
diff changeset
   966
Chengsong
parents: 646
diff changeset
   967
\subsection{Sulzmann and Lu's Algorithm}
Chengsong
parents: 646
diff changeset
   968
Sulzmann and Lu~\cite{Sulzmann2014} overcame the first 
Chengsong
parents: 646
diff changeset
   969
problem with the yes/no answer 
Chengsong
parents: 646
diff changeset
   970
by cleverly extending Brzozowski's matching
Chengsong
parents: 646
diff changeset
   971
algorithm. Their extended version generates additional information on
Chengsong
parents: 646
diff changeset
   972
\emph{how} a regular expression matches a string following the POSIX
Chengsong
parents: 646
diff changeset
   973
rules for regular expression matching. They achieve this by adding a
Chengsong
parents: 646
diff changeset
   974
second ``phase'' to Brzozowski's algorithm involving an injection
Chengsong
parents: 646
diff changeset
   975
function. This injection function in a sense undoes the ``damage''
Chengsong
parents: 646
diff changeset
   976
of the derivatives chopping off characters.
Chengsong
parents: 646
diff changeset
   977
In earlier work, Ausaf et al provided the formal
Chengsong
parents: 646
diff changeset
   978
specification of what POSIX matching means and proved in Isabelle/HOL
Chengsong
parents: 646
diff changeset
   979
the correctness
Chengsong
parents: 646
diff changeset
   980
of this extended algorithm accordingly
Chengsong
parents: 646
diff changeset
   981
\cite{AusafDyckhoffUrban2016}.
Chengsong
parents: 646
diff changeset
   982
Chengsong
parents: 646
diff changeset
   983
The version of the algorithm proven correct
Chengsong
parents: 646
diff changeset
   984
suffers however heavily from a 
Chengsong
parents: 646
diff changeset
   985
second difficulty, where derivatives can
Chengsong
parents: 646
diff changeset
   986
grow to arbitrarily big sizes. 
Chengsong
parents: 646
diff changeset
   987
For example if we start with the
Chengsong
parents: 646
diff changeset
   988
regular expression $(a+aa)^*$ and take
Chengsong
parents: 646
diff changeset
   989
successive derivatives according to the character $a$, we end up with
Chengsong
parents: 646
diff changeset
   990
a sequence of ever-growing derivatives like 
Chengsong
parents: 646
diff changeset
   991
Chengsong
parents: 646
diff changeset
   992
\def\ll{\stackrel{\_\backslash{} a}{\longrightarrow}}
Chengsong
parents: 646
diff changeset
   993
\begin{center}
Chengsong
parents: 646
diff changeset
   994
\begin{tabular}{rll}
Chengsong
parents: 646
diff changeset
   995
$(a + aa)^*$ & $\ll$ & $(\ONE + \ONE{}a) \cdot (a + aa)^*$\\
Chengsong
parents: 646
diff changeset
   996
& $\ll$ & $(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* \;+\; (\ONE + \ONE{}a) \cdot (a + aa)^*$\\
Chengsong
parents: 646
diff changeset
   997
& $\ll$ & $(\ZERO + \ZERO{}a + \ZERO) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^* \;+\; $\\
Chengsong
parents: 646
diff changeset
   998
& & $\qquad(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^*$\\
Chengsong
parents: 646
diff changeset
   999
& $\ll$ & \ldots \hspace{15mm}(regular expressions of sizes 98, 169, 283, 468, 767, \ldots)
Chengsong
parents: 646
diff changeset
  1000
\end{tabular}
Chengsong
parents: 646
diff changeset
  1001
\end{center}
Chengsong
parents: 646
diff changeset
  1002
 
Chengsong
parents: 646
diff changeset
  1003
\noindent where after around 35 steps we usually run out of memory on a
Chengsong
parents: 646
diff changeset
  1004
typical computer.  Clearly, the
Chengsong
parents: 646
diff changeset
  1005
notation involving $\ZERO$s and $\ONE$s already suggests
Chengsong
parents: 646
diff changeset
  1006
simplification rules that can be applied to regular regular
Chengsong
parents: 646
diff changeset
  1007
expressions, for example $\ZERO{}\,r \Rightarrow \ZERO$, $\ONE{}\,r
Chengsong
parents: 646
diff changeset
  1008
\Rightarrow r$, $\ZERO{} + r \Rightarrow r$ and $r + r \Rightarrow
Chengsong
parents: 646
diff changeset
  1009
r$. While such simple-minded simplifications have been proved in 
Chengsong
parents: 646
diff changeset
  1010
the work by Ausaf et al. to preserve the correctness of Sulzmann and Lu's
Chengsong
parents: 646
diff changeset
  1011
algorithm \cite{AusafDyckhoffUrban2016}, they unfortunately do
Chengsong
parents: 646
diff changeset
  1012
\emph{not} help with limiting the growth of the derivatives shown
Chengsong
parents: 646
diff changeset
  1013
above: the growth is slowed, but the derivatives can still grow rather
Chengsong
parents: 646
diff changeset
  1014
quickly beyond any finite bound.
Chengsong
parents: 646
diff changeset
  1015
Chengsong
parents: 646
diff changeset
  1016
Therefore we want to look in this thesis at a second
Chengsong
parents: 646
diff changeset
  1017
algorithm by Sulzmann and Lu where they
Chengsong
parents: 646
diff changeset
  1018
overcame this ``growth problem'' \cite{Sulzmann2014}.
Chengsong
parents: 646
diff changeset
  1019
In this version, POSIX values are 
Chengsong
parents: 646
diff changeset
  1020
represented as bit sequences and such sequences are incrementally generated
Chengsong
parents: 646
diff changeset
  1021
when derivatives are calculated. The compact representation
Chengsong
parents: 646
diff changeset
  1022
of bit sequences and regular expressions allows them to define a more
Chengsong
parents: 646
diff changeset
  1023
``aggressive'' simplification method that keeps the size of the
Chengsong
parents: 646
diff changeset
  1024
derivatives finite no matter what the length of the string is.
Chengsong
parents: 646
diff changeset
  1025
They make some informal claims about the correctness and linear behaviour
Chengsong
parents: 646
diff changeset
  1026
of this version, but do not provide any supporting proof arguments, not
Chengsong
parents: 646
diff changeset
  1027
even ``pencil-and-paper'' arguments. They write about their bit-coded
Chengsong
parents: 646
diff changeset
  1028
\emph{incremental parsing method} (that is the algorithm to be formalised
Chengsong
parents: 646
diff changeset
  1029
in this dissertation)
Chengsong
parents: 646
diff changeset
  1030
Chengsong
parents: 646
diff changeset
  1031
Chengsong
parents: 646
diff changeset
  1032
  
Chengsong
parents: 646
diff changeset
  1033
  \begin{quote}\it
Chengsong
parents: 646
diff changeset
  1034
  ``Correctness Claim: We further claim that the incremental parsing
Chengsong
parents: 646
diff changeset
  1035
  method [..] in combination with the simplification steps [..]
Chengsong
parents: 646
diff changeset
  1036
  yields POSIX parse trees. We have tested this claim
Chengsong
parents: 646
diff changeset
  1037
  extensively [..] but yet
Chengsong
parents: 646
diff changeset
  1038
  have to work out all proof details.'' \cite[Page 14]{Sulzmann2014}
Chengsong
parents: 646
diff changeset
  1039
\end{quote}  
Chengsong
parents: 646
diff changeset
  1040
Ausaf and Urban made some initial progress towards the 
Chengsong
parents: 646
diff changeset
  1041
full correctness proof but still had to leave out the optimisation
Chengsong
parents: 646
diff changeset
  1042
Sulzmann and Lu proposed.
Chengsong
parents: 646
diff changeset
  1043
Ausaf  wrote \cite{Ausaf},
Chengsong
parents: 646
diff changeset
  1044
  \begin{quote}\it
Chengsong
parents: 646
diff changeset
  1045
``The next step would be to implement a more aggressive simplification procedure on annotated regular expressions and then prove the corresponding algorithm generates the same values as blexer. Alas due to time constraints we are unable to do so here.''
Chengsong
parents: 646
diff changeset
  1046
\end{quote}  
Chengsong
parents: 646
diff changeset
  1047
This thesis implements the aggressive simplifications envisioned
Chengsong
parents: 646
diff changeset
  1048
by Ausaf and Urban,
Chengsong
parents: 646
diff changeset
  1049
together with a formal proof of the correctness of those simplifications.
Chengsong
parents: 646
diff changeset
  1050
Chengsong
parents: 646
diff changeset
  1051
Chengsong
parents: 646
diff changeset
  1052
One of the most recent work in the context of lexing
Chengsong
parents: 646
diff changeset
  1053
%with this issue
Chengsong
parents: 646
diff changeset
  1054
is the Verbatim lexer by Egolf, Lasser and Fisher~\cite{Verbatim}.
Chengsong
parents: 646
diff changeset
  1055
This is relevant work for us and we will compare it later with 
Chengsong
parents: 646
diff changeset
  1056
our derivative-based matcher we are going to present.
Chengsong
parents: 646
diff changeset
  1057
There is also some newer work called
Chengsong
parents: 646
diff changeset
  1058
Verbatim++~\cite{Verbatimpp}, which does not use derivatives, 
Chengsong
parents: 646
diff changeset
  1059
but deterministic finite automaton instead.
Chengsong
parents: 646
diff changeset
  1060
We will also study this work in a later section.
Chengsong
parents: 646
diff changeset
  1061
%An example that gives problem to automaton approaches would be
Chengsong
parents: 646
diff changeset
  1062
%the regular expression $(a|b)^*a(a|b)^{\{n\}}$.
Chengsong
parents: 646
diff changeset
  1063
%It requires at least $2^{n+1}$ states to represent
Chengsong
parents: 646
diff changeset
  1064
%as a DFA.
Chengsong
parents: 646
diff changeset
  1065
Chengsong
parents: 646
diff changeset
  1066
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1067
\section{Basic Concepts}
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1068
Formal language theory usually starts with an alphabet 
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1069
denoting a set of characters.
637
Chengsong
parents: 628
diff changeset
  1070
Here we use the datatype of characters from Isabelle,
541
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
  1071
which roughly corresponds to the ASCII characters.
564
Chengsong
parents: 543
diff changeset
  1072
In what follows, we shall leave the information about the alphabet
541
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
  1073
implicit.
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
  1074
Then using the usual bracket notation for lists,
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1075
we can define strings made up of characters as: 
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1076
\begin{center}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1077
\begin{tabular}{lcl}
541
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
  1078
$\textit{s}$ & $\dn$ & $[] \; |\; c  :: s$
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1079
\end{tabular}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1080
\end{center}
583
Chengsong
parents: 579
diff changeset
  1081
where $c$ is a variable ranging over characters.
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1082
The $::$ stands for list cons and $[]$ for the empty
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1083
list.
637
Chengsong
parents: 628
diff changeset
  1084
For brevity, a singleton list is sometimes written as $[c]$.
541
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
  1085
Strings can be concatenated to form longer strings in the same
637
Chengsong
parents: 628
diff changeset
  1086
way we concatenate two lists, which we shall write as $s_1 @ s_2$.
541
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
  1087
We omit the precise 
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1088
recursive definition here.
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1089
We overload this concatenation operator for two sets of strings:
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1090
\begin{center}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1091
\begin{tabular}{lcl}
541
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
  1092
$A @ B $ & $\dn$ & $\{s_A @ s_B \mid s_A \in A \land s_B \in B \}$\\
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1093
\end{tabular}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1094
\end{center}
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1095
We also call the above \emph{language concatenation}.
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1096
The power of a language is defined recursively, using the 
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1097
concatenation operator $@$:
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1098
\begin{center}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1099
\begin{tabular}{lcl}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1100
$A^0 $ & $\dn$ & $\{ [] \}$\\
541
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
  1101
$A^{n+1}$ & $\dn$ & $A @ A^n$
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1102
\end{tabular}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1103
\end{center}
564
Chengsong
parents: 543
diff changeset
  1104
The union of all powers of a language   
Chengsong
parents: 543
diff changeset
  1105
can be used to define the Kleene star operator:
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1106
\begin{center}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1107
\begin{tabular}{lcl}
536
aff7bf93b9c7 comments addressed all
Chengsong
parents: 532
diff changeset
  1108
 $A*$ & $\dn$ & $\bigcup_{i \geq 0} A^i$ \\
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1109
\end{tabular}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1110
\end{center}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1111
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1112
\noindent
564
Chengsong
parents: 543
diff changeset
  1113
However, to obtain a more convenient induction principle 
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1114
in Isabelle/HOL, 
536
aff7bf93b9c7 comments addressed all
Chengsong
parents: 532
diff changeset
  1115
we instead define the Kleene star
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1116
as an inductive set: 
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1117
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1118
\begin{center}
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1119
\begin{mathpar}
564
Chengsong
parents: 543
diff changeset
  1120
	\inferrule{\mbox{}}{[] \in A*\\}
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1121
564
Chengsong
parents: 543
diff changeset
  1122
\inferrule{s_1 \in A \;\; s_2 \in A*}{s_1 @ s_2 \in A*}
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1123
\end{mathpar}
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1124
\end{center}
564
Chengsong
parents: 543
diff changeset
  1125
\noindent
541
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
  1126
We also define an operation of "chopping off" a character from
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
  1127
a language, which we call $\Der$, meaning \emph{Derivative} (for a language):
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1128
\begin{center}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1129
\begin{tabular}{lcl}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1130
$\textit{Der} \;c \;A$ & $\dn$ & $\{ s \mid c :: s \in A \}$\\
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1131
\end{tabular}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1132
\end{center}
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1133
\noindent
583
Chengsong
parents: 579
diff changeset
  1134
This can be generalised to ``chopping off'' a string 
Chengsong
parents: 579
diff changeset
  1135
from all strings within a set $A$, 
541
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
  1136
namely:
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1137
\begin{center}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1138
\begin{tabular}{lcl}
541
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
  1139
$\textit{Ders} \;s \;A$ & $\dn$ & $\{ s' \mid s@s' \in A \}$\\
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1140
\end{tabular}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1141
\end{center}
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1142
\noindent
541
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
  1143
which is essentially the left quotient $A \backslash L$ of $A$ against 
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1144
the singleton language with $L = \{s\}$.
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1145
However, for our purposes here, the $\textit{Ders}$ definition with 
541
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
  1146
a single string is sufficient.
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1147
577
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
  1148
The reason for defining derivatives
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1149
is that they provide another approach
577
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
  1150
to test membership of a string in 
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
  1151
a set of strings. 
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
  1152
For example, to test whether the string
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  1153
$bar$ is contained in the set $\{foo, bar, brak\}$, one can take derivative of the set with
577
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
  1154
respect to the string $bar$:
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
  1155
\begin{center}
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1156
\begin{tabular}{lll}
577
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
  1157
	$S = \{foo, bar, brak\}$ & $ \stackrel{\backslash b}{\rightarrow }$ & 
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1158
	$\{ar, rak\}$ \\
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1159
				 & $\stackrel{\backslash a}{\rightarrow}$ & $\{r \}$\\
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1160
				 & $\stackrel{\backslash r}{\rightarrow}$ & $\{[]\}$\\
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1161
				 %& $\stackrel{[] \in S \backslash bar}{\longrightarrow}$ & $bar \in S$\\
577
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
  1162
\end{tabular}	
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
  1163
\end{center}
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
  1164
\noindent
637
Chengsong
parents: 628
diff changeset
  1165
and in the end, test whether the set
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  1166
contains the empty string.\footnote{We use the infix notation $A\backslash c$
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  1167
	instead of $\Der \; c \; A$ for brevity, as it will always be
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  1168
	clear from the context that we are operating
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1169
on languages rather than regular expressions.}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1170
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1171
In general, if we have a language $S$,
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1172
then we can test whether $s$ is in $S$
577
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
  1173
by testing whether $[] \in S \backslash s$.
564
Chengsong
parents: 543
diff changeset
  1174
With the sequencing, Kleene star, and $\textit{Der}$ operator on languages,
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1175
we have a  few properties of how the language derivative can be defined using 
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1176
sub-languages.
577
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
  1177
For example, for the sequence operator, we have
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1178
something similar to a ``chain rule'':
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1179
\begin{lemma}
536
aff7bf93b9c7 comments addressed all
Chengsong
parents: 532
diff changeset
  1180
\[
aff7bf93b9c7 comments addressed all
Chengsong
parents: 532
diff changeset
  1181
	\Der \; c \; (A @ B) =
aff7bf93b9c7 comments addressed all
Chengsong
parents: 532
diff changeset
  1182
	\begin{cases}
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1183
	((\Der \; c \; A) \, @ \, B ) \cup (\Der \; c\; B) , &  \text{if} \;  [] \in A  \\
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1184
	 (\Der \; c \; A) \,  @ \, B, & \text{otherwise}
536
aff7bf93b9c7 comments addressed all
Chengsong
parents: 532
diff changeset
  1185
	 \end{cases}	
aff7bf93b9c7 comments addressed all
Chengsong
parents: 532
diff changeset
  1186
\]
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1187
\end{lemma}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1188
\noindent
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1189
This lemma states that if $A$ contains the empty string, $\Der$ can "pierce" through it
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1190
and get to $B$.
583
Chengsong
parents: 579
diff changeset
  1191
The language derivative for $A*$ can be described using the language derivative
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1192
of $A$:
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1193
\begin{lemma}
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1194
$\textit{Der} \;c \;(A*) = (\textit{Der}\; c A) @ (A*)$\\
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1195
\end{lemma}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1196
\begin{proof}
583
Chengsong
parents: 579
diff changeset
  1197
There are two inclusions to prove:
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1198
\begin{itemize}
564
Chengsong
parents: 543
diff changeset
  1199
\item{$\subseteq$}:\\
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1200
The set 
637
Chengsong
parents: 628
diff changeset
  1201
\[ S_1 = \{s \mid c :: s \in A*\} \]
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1202
is enclosed in the set
637
Chengsong
parents: 628
diff changeset
  1203
\[ S_2 = \{s_1 @ s_2 \mid s_1 \, s_2.\;  s_1 \in \{s \mid c :: s \in A\} \land s_2 \in A* \}. \]
Chengsong
parents: 628
diff changeset
  1204
This is because for any string $c::s$ satisfying $c::s \in A*$,
Chengsong
parents: 628
diff changeset
  1205
%whenever you have a string starting with a character 
Chengsong
parents: 628
diff changeset
  1206
%in the language of a Kleene star $A*$, 
Chengsong
parents: 628
diff changeset
  1207
%then that
Chengsong
parents: 628
diff changeset
  1208
the character $c$, together with a prefix of $s$
Chengsong
parents: 628
diff changeset
  1209
%immediately after $c$ 
Chengsong
parents: 628
diff changeset
  1210
forms the first iteration of $A*$, 
Chengsong
parents: 628
diff changeset
  1211
and the rest of the $s$ is also $A*$.
Chengsong
parents: 628
diff changeset
  1212
This coincides with the definition of $S_2$.
564
Chengsong
parents: 543
diff changeset
  1213
\item{$\supseteq$}:\\
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1214
Note that
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1215
\[ \Der \; c \; (A*) = \Der \; c \;  (\{ [] \} \cup (A @ A*) ) \]
583
Chengsong
parents: 579
diff changeset
  1216
holds.
Chengsong
parents: 579
diff changeset
  1217
Also the following holds:
536
aff7bf93b9c7 comments addressed all
Chengsong
parents: 532
diff changeset
  1218
\[ \Der \; c \;  (\{ [] \} \cup (A @ A*) ) = \Der\; c \; (A @ A*) \]
564
Chengsong
parents: 543
diff changeset
  1219
where the $\textit{RHS}$ can be rewritten
Chengsong
parents: 543
diff changeset
  1220
as \[ (\Der \; c\; A) @ A* \cup (\Der \; c \; (A*)) \]
Chengsong
parents: 543
diff changeset
  1221
which of course contains $\Der \; c \; A @ A*$.
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1222
\end{itemize}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1223
\end{proof}
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1224
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1225
\noindent
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1226
The clever idea of Brzozowski was to find counterparts of $\Der$ and $\Ders$
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1227
for regular expressions.
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1228
To introduce them, we need to first give definitions for regular expressions,
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1229
which we shall do next.
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1230
536
aff7bf93b9c7 comments addressed all
Chengsong
parents: 532
diff changeset
  1231
\subsection{Regular Expressions and Their Meaning}
564
Chengsong
parents: 543
diff changeset
  1232
The \emph{basic regular expressions} are defined inductively
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1233
 by the following grammar:
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1234
\[			r ::=   \ZERO \mid  \ONE
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1235
			 \mid  c  
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1236
			 \mid  r_1 \cdot r_2
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1237
			 \mid  r_1 + r_2   
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1238
			 \mid r^*         
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1239
\]
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1240
\noindent
564
Chengsong
parents: 543
diff changeset
  1241
We call them basic because we will introduce
637
Chengsong
parents: 628
diff changeset
  1242
additional constructors in later chapters, such as negation
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1243
and bounded repetitions.
564
Chengsong
parents: 543
diff changeset
  1244
We use $\ZERO$ for the regular expression that
Chengsong
parents: 543
diff changeset
  1245
matches no string, and $\ONE$ for the regular
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1246
expression that matches only the empty string.\footnote{
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1247
Some authors
564
Chengsong
parents: 543
diff changeset
  1248
also use $\phi$ and $\epsilon$ for $\ZERO$ and $\ONE$
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  1249
but we prefer this notation.} 
564
Chengsong
parents: 543
diff changeset
  1250
The sequence regular expression is written $r_1\cdot r_2$
Chengsong
parents: 543
diff changeset
  1251
and sometimes we omit the dot if it is clear which
Chengsong
parents: 543
diff changeset
  1252
regular expression is meant; the alternative
Chengsong
parents: 543
diff changeset
  1253
is written $r_1 + r_2$.
Chengsong
parents: 543
diff changeset
  1254
The \emph{language} or meaning of 
Chengsong
parents: 543
diff changeset
  1255
a regular expression is defined recursively as
Chengsong
parents: 543
diff changeset
  1256
a set of strings:
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1257
%TODO: FILL in the other defs
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1258
\begin{center}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1259
\begin{tabular}{lcl}
564
Chengsong
parents: 543
diff changeset
  1260
$L \; \ZERO$ & $\dn$ & $\phi$\\
Chengsong
parents: 543
diff changeset
  1261
$L \; \ONE$ & $\dn$ & $\{[]\}$\\
Chengsong
parents: 543
diff changeset
  1262
$L \; c$ & $\dn$ & $\{[c]\}$\\
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1263
$L \; (r_1 + r_2)$ & $\dn$ & $ L \; r_1 \cup L \; r_2$\\
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1264
$L \; (r_1 \cdot r_2)$ & $\dn$ & $ L \; r_1 @ L \; r_2$\\
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1265
$L \; (r^*)$ & $\dn$ & $ (L\;r)*$
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1266
\end{tabular}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1267
\end{center}
536
aff7bf93b9c7 comments addressed all
Chengsong
parents: 532
diff changeset
  1268
\noindent
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1269
%Now with language derivatives of a language and regular expressions and
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1270
%their language interpretations in place, we are ready to define derivatives on regular expressions.
637
Chengsong
parents: 628
diff changeset
  1271
With $L$, we are ready to introduce Brzozowski derivatives on regular expressions.
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  1272
We do so by first introducing what properties they should satisfy.
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1273
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1274
\subsection{Brzozowski Derivatives and a Regular Expression Matcher}
564
Chengsong
parents: 543
diff changeset
  1275
%Recall, the language derivative acts on a set of strings
Chengsong
parents: 543
diff changeset
  1276
%and essentially chops off a particular character from
Chengsong
parents: 543
diff changeset
  1277
%all strings in that set, Brzozowski defined a derivative operation on regular expressions
Chengsong
parents: 543
diff changeset
  1278
%so that after derivative $L(r\backslash c)$ 
Chengsong
parents: 543
diff changeset
  1279
%will look as if it was obtained by doing a language derivative on $L(r)$:
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1280
%Recall that the language derivative acts on a 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1281
%language (set of strings).
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1282
%One can decide whether a string $s$ belongs
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1283
%to a language $S$ by taking derivative with respect to
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1284
%that string and then checking whether the empty 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1285
%string is in the derivative:
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1286
%\begin{center}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1287
%\parskip \baselineskip
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1288
%\def\myupbracefill#1{\rotatebox{90}{\stretchto{\{}{#1}}}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1289
%\def\rlwd{.5pt}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1290
%\newcommand\notate[3]{%
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1291
%  \unskip\def\useanchorwidth{T}%
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1292
%  \setbox0=\hbox{#1}%
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1293
%  \def\stackalignment{c}\stackunder[-6pt]{%
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1294
%    \def\stackalignment{c}\stackunder[-1.5pt]{%
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1295
%      \stackunder[-2pt]{\strut #1}{\myupbracefill{\wd0}}}{%
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1296
%    \rule{\rlwd}{#2\baselineskip}}}{%
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1297
%  \strut\kern7pt$\hookrightarrow$\rlap{ \footnotesize#3}}\ignorespaces%
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1298
%}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1299
%\Longstack{
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1300
%\notate{$\{ \ldots ,\;$ 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1301
%	\notate{s}{1}{$(c_1 :: s_1)$} 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1302
%	$, \; \ldots \}$ 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1303
%}{1}{$S_{start}$} 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1304
%}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1305
%\Longstack{
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1306
%	$\stackrel{\backslash c_1}{\longrightarrow}$
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1307
%}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1308
%\Longstack{
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1309
%	$\{ \ldots,\;$  \notate{$s_1$}{1}{$(c_2::s_2)$} 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1310
%	$,\; \ldots \}$
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1311
%}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1312
%\Longstack{
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1313
%	$\stackrel{\backslash c_2}{\longrightarrow}$ 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1314
%}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1315
%\Longstack{
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1316
%	$\{ \ldots,\;  s_2
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1317
%	,\; \ldots \}$
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1318
%}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1319
%\Longstack{
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1320
%	$ \xdashrightarrow{\backslash c_3\ldots\ldots} $	
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1321
%}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1322
%\Longstack{
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1323
%	\notate{$\{\ldots, [], \ldots\}$}{1}{$S_{end} = 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1324
%	S_{start}\backslash s$}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1325
%}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1326
%\end{center}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1327
%\begin{center}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1328
%	$s \in S_{start} \iff [] \in S_{end}$
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1329
%\end{center}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1330
%\noindent
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1331
Brzozowski noticed that $\Der$
579
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1332
can be ``mirrored'' on regular expressions which
564
Chengsong
parents: 543
diff changeset
  1333
he calls the derivative of a regular expression $r$
Chengsong
parents: 543
diff changeset
  1334
with respect to a character $c$, written
577
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
  1335
$r \backslash c$. This infix operator
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  1336
takes regular expression $r$ as input
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  1337
and a character as a right operand.
577
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
  1338
The derivative operation on regular expression
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
  1339
is defined such that the language of the derivative result 
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
  1340
coincides with the language of the original 
579
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1341
regular expression being taken 
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  1342
derivative with respect to the same characters, namely
579
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1343
\begin{property}
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1344
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1345
\[
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1346
	L \; (r \backslash c) = \Der \; c \; (L \; r)
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1347
\]
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1348
\end{property}
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1349
\noindent
637
Chengsong
parents: 628
diff changeset
  1350
Next, we give the recursive definition of derivative on
Chengsong
parents: 628
diff changeset
  1351
regular expressions so that it satisfies the properties above.
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  1352
%The derivative function, written $r\backslash c$, 
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  1353
%takes a regular expression $r$ and character $c$, and
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  1354
%returns a new regular expression representing
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  1355
%the original regular expression's language $L \; r$
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  1356
%being taken the language derivative with respect to $c$.
626
1c8525061545 finished!
Chengsong
parents: 623
diff changeset
  1357
\begin{table}
1c8525061545 finished!
Chengsong
parents: 623
diff changeset
  1358
	\begin{center}
579
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1359
\begin{tabular}{lcl}
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1360
		$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\  
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1361
		$\ONE \backslash c$  & $\dn$ & $\ZERO$\\
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1362
		$d \backslash c$     & $\dn$ & 
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1363
		$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1364
$(r_1 + r_2)\backslash c$     & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1365
$(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, [] \in L(r_1)$\\
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1366
	&   & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1367
	&   & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1368
	$(r^*)\backslash c$           & $\dn$ & $(r\backslash c) \cdot r^*$\\
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1369
\end{tabular}
626
1c8525061545 finished!
Chengsong
parents: 623
diff changeset
  1370
	\end{center}
1c8525061545 finished!
Chengsong
parents: 623
diff changeset
  1371
\caption{Derivative on Regular Expressions}
1c8525061545 finished!
Chengsong
parents: 623
diff changeset
  1372
\label{table:der}
1c8525061545 finished!
Chengsong
parents: 623
diff changeset
  1373
\end{table}
564
Chengsong
parents: 543
diff changeset
  1374
\noindent
579
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1375
The most involved cases are the sequence case
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1376
and the star case.
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1377
The sequence case says that if the first regular expression
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1378
contains an empty string, then the second component of the sequence
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1379
needs to be considered, as its derivative will contribute to the
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1380
result of this derivative:
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1381
\begin{center}
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1382
	\begin{tabular}{lcl}
583
Chengsong
parents: 579
diff changeset
  1383
		$(r_1 \cdot r_2 ) \backslash c$ & $\dn$ & 
Chengsong
parents: 579
diff changeset
  1384
		$\textit{if}\;\,([] \in L(r_1))\; 
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  1385
		\textit{then} \; (r_1 \backslash c) \cdot r_2 + r_2 \backslash c$ \\
579
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1386
		& & $\textit{else} \; (r_1 \backslash c) \cdot r_2$
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1387
	\end{tabular}
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1388
\end{center}
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1389
\noindent
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1390
Notice how this closely resembles
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1391
the language derivative operation $\Der$:
564
Chengsong
parents: 543
diff changeset
  1392
\begin{center}
Chengsong
parents: 543
diff changeset
  1393
	\begin{tabular}{lcl}
Chengsong
parents: 543
diff changeset
  1394
		$\Der \; c \; (A @ B)$ & $\dn$ & 
Chengsong
parents: 543
diff changeset
  1395
		$ \textit{if} \;\,  [] \in A \; 
Chengsong
parents: 543
diff changeset
  1396
		\textit{then} \;\, ((\Der \; c \; A) @ B ) \cup 
Chengsong
parents: 543
diff changeset
  1397
		\Der \; c\; B$\\
Chengsong
parents: 543
diff changeset
  1398
		& & $\textit{else}\; (\Der \; c \; A) @ B$\\
Chengsong
parents: 543
diff changeset
  1399
	\end{tabular}
Chengsong
parents: 543
diff changeset
  1400
\end{center}
Chengsong
parents: 543
diff changeset
  1401
\noindent
583
Chengsong
parents: 579
diff changeset
  1402
The derivative of the star regular expression $r^*$ 
579
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1403
unwraps one iteration of $r$, turns it into $r\backslash c$,
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1404
and attaches the original $r^*$
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1405
after $r\backslash c$, so that 
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1406
we can further unfold it as many times as needed:
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1407
\[
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1408
	(r^*) \backslash c \dn (r \backslash c)\cdot r^*.
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1409
\]
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1410
Again,
637
Chengsong
parents: 628
diff changeset
  1411
the structure is the same as the language derivative of the Kleene star: 
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1412
\[
564
Chengsong
parents: 543
diff changeset
  1413
	\textit{Der} \;c \;(A*) \dn (\textit{Der}\; c A) @ (A*)
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1414
\]
564
Chengsong
parents: 543
diff changeset
  1415
In the above definition of $(r_1\cdot r_2) \backslash c$,
Chengsong
parents: 543
diff changeset
  1416
the $\textit{if}$ clause's
Chengsong
parents: 543
diff changeset
  1417
boolean condition 
Chengsong
parents: 543
diff changeset
  1418
$[] \in L(r_1)$ needs to be 
Chengsong
parents: 543
diff changeset
  1419
somehow recursively computed.
Chengsong
parents: 543
diff changeset
  1420
We call such a function that checks
Chengsong
parents: 543
diff changeset
  1421
whether the empty string $[]$ is 
Chengsong
parents: 543
diff changeset
  1422
in the language of a regular expression $\nullable$:
Chengsong
parents: 543
diff changeset
  1423
\begin{center}
Chengsong
parents: 543
diff changeset
  1424
		\begin{tabular}{lcl}
Chengsong
parents: 543
diff changeset
  1425
			$\nullable(\ZERO)$     & $\dn$ & $\mathit{false}$ \\  
Chengsong
parents: 543
diff changeset
  1426
			$\nullable(\ONE)$      & $\dn$ & $\mathit{true}$ \\
Chengsong
parents: 543
diff changeset
  1427
			$\nullable(c)$ 	       & $\dn$ & $\mathit{false}$ \\
Chengsong
parents: 543
diff changeset
  1428
			$\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\
Chengsong
parents: 543
diff changeset
  1429
			$\nullable(r_1\cdot r_2)$  & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\
Chengsong
parents: 543
diff changeset
  1430
			$\nullable(r^*)$       & $\dn$ & $\mathit{true}$ \\
Chengsong
parents: 543
diff changeset
  1431
		\end{tabular}
Chengsong
parents: 543
diff changeset
  1432
\end{center}
Chengsong
parents: 543
diff changeset
  1433
\noindent
Chengsong
parents: 543
diff changeset
  1434
The $\ZERO$ regular expression
Chengsong
parents: 543
diff changeset
  1435
does not contain any string and
Chengsong
parents: 543
diff changeset
  1436
therefore is not \emph{nullable}.
Chengsong
parents: 543
diff changeset
  1437
$\ONE$ is \emph{nullable} 
Chengsong
parents: 543
diff changeset
  1438
by definition. 
Chengsong
parents: 543
diff changeset
  1439
The character regular expression $c$
Chengsong
parents: 543
diff changeset
  1440
corresponds to the singleton set $\{c\}$, 
Chengsong
parents: 543
diff changeset
  1441
and therefore does not contain the empty string.
Chengsong
parents: 543
diff changeset
  1442
The alternative regular expression is nullable
Chengsong
parents: 543
diff changeset
  1443
if at least one of its children is nullable.
Chengsong
parents: 543
diff changeset
  1444
The sequence regular expression
Chengsong
parents: 543
diff changeset
  1445
would require both children to have the empty string
Chengsong
parents: 543
diff changeset
  1446
to compose an empty string, and the Kleene star
Chengsong
parents: 543
diff changeset
  1447
is always nullable because it naturally
579
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1448
contains the empty string.  
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1449
\noindent
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  1450
We have the following two correspondences between 
564
Chengsong
parents: 543
diff changeset
  1451
derivatives on regular expressions and
Chengsong
parents: 543
diff changeset
  1452
derivatives on a set of strings:
Chengsong
parents: 543
diff changeset
  1453
\begin{lemma}\label{derDer}
608
37b6fd310a16 added related work chap
Chengsong
parents: 601
diff changeset
  1454
	\mbox{}
579
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1455
	\begin{itemize}
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1456
		\item
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1457
$\textit{Der} \; c \; L(r) = L (r\backslash c)$
579
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1458
\item
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1459
	$c\!::\!s \in L(r)$ \textit{iff} $s \in L(r\backslash c)$.
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1460
	\end{itemize}
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1461
\end{lemma}
579
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1462
\begin{proof}
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1463
	By induction on $r$.
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1464
\end{proof}
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1465
\noindent
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  1466
which are the main properties of derivatives
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  1467
that enables us later to reason about the correctness of
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1468
derivative-based matching.
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1469
We can generalise the derivative operation shown above for single characters
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1470
to strings as follows:
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1471
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1472
\begin{center}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1473
\begin{tabular}{lcl}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1474
$r \backslash_s (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash_s s$ \\
583
Chengsong
parents: 579
diff changeset
  1475
$r \backslash_s [\,] $ & $\dn$ & $r$
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1476
\end{tabular}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1477
\end{center}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1478
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1479
\noindent
564
Chengsong
parents: 543
diff changeset
  1480
When there is no ambiguity, we will 
Chengsong
parents: 543
diff changeset
  1481
omit the subscript and use $\backslash$ instead
583
Chengsong
parents: 579
diff changeset
  1482
of $\backslash_s$ to denote
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1483
string derivatives for brevity.
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1484
Brzozowski's derivative-based
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1485
regular-expression matching algorithm can then be described as:
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1486
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1487
\begin{definition}
564
Chengsong
parents: 543
diff changeset
  1488
$\textit{match}\;s\;r \;\dn\; \nullable \; (r\backslash s)$
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1489
\end{definition}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1490
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1491
\noindent
637
Chengsong
parents: 628
diff changeset
  1492
Assuming the string is given as a sequence of characters, say $c_0c_1 \ldots c_n$, 
Chengsong
parents: 628
diff changeset
  1493
this algorithm, presented graphically, is as follows:
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1494
601
Chengsong
parents: 591
diff changeset
  1495
\begin{equation}\label{matcher}
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1496
\begin{tikzcd}
583
Chengsong
parents: 579
diff changeset
  1497
r_0 \arrow[r, "\backslash c_0"]  & r_1 \arrow[r, "\backslash c_1"] & 
Chengsong
parents: 579
diff changeset
  1498
r_2 \arrow[r, dashed]  & r_n  \arrow[r,"\textit{nullable}?"] & 
Chengsong
parents: 579
diff changeset
  1499
\;\textrm{true}/\textrm{false}
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1500
\end{tikzcd}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1501
\end{equation}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1502
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1503
\noindent
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1504
 It is relatively
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1505
easy to show that this matcher is correct, namely
539
Chengsong
parents: 538
diff changeset
  1506
\begin{lemma}
564
Chengsong
parents: 543
diff changeset
  1507
	$\textit{match} \; s\; r  = \textit{true} \; \textit{iff} \; s \in L(r)$
539
Chengsong
parents: 538
diff changeset
  1508
\end{lemma}
Chengsong
parents: 538
diff changeset
  1509
\begin{proof}
637
Chengsong
parents: 628
diff changeset
  1510
	By induction on $s$ using the property of derivatives:
583
Chengsong
parents: 579
diff changeset
  1511
	lemma \ref{derDer}.
539
Chengsong
parents: 538
diff changeset
  1512
\end{proof}
601
Chengsong
parents: 591
diff changeset
  1513
\begin{figure}
564
Chengsong
parents: 543
diff changeset
  1514
\begin{center}
539
Chengsong
parents: 538
diff changeset
  1515
\begin{tikzpicture}
Chengsong
parents: 538
diff changeset
  1516
\begin{axis}[
Chengsong
parents: 538
diff changeset
  1517
    xlabel={$n$},
Chengsong
parents: 538
diff changeset
  1518
    ylabel={time in secs},
601
Chengsong
parents: 591
diff changeset
  1519
    %ymode = log,
539
Chengsong
parents: 538
diff changeset
  1520
    legend entries={Naive Matcher},  
Chengsong
parents: 538
diff changeset
  1521
    legend pos=north west,
Chengsong
parents: 538
diff changeset
  1522
    legend cell align=left]
Chengsong
parents: 538
diff changeset
  1523
\addplot[red,mark=*, mark options={fill=white}] table {NaiveMatcher.data};
Chengsong
parents: 538
diff changeset
  1524
\end{axis}
Chengsong
parents: 538
diff changeset
  1525
\end{tikzpicture} 
583
Chengsong
parents: 579
diff changeset
  1526
\caption{Matching the regular expression $(a^*)^*b$ against strings of the form
Chengsong
parents: 579
diff changeset
  1527
$\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}
Chengsong
parents: 579
diff changeset
  1528
$ using Brzozowski's original algorithm}\label{NaiveMatcher}
601
Chengsong
parents: 591
diff changeset
  1529
\end{center}
539
Chengsong
parents: 538
diff changeset
  1530
\end{figure}
Chengsong
parents: 538
diff changeset
  1531
\noindent
564
Chengsong
parents: 543
diff changeset
  1532
If we implement the above algorithm naively, however,
Chengsong
parents: 543
diff changeset
  1533
the algorithm can be excruciatingly slow, as shown in 
Chengsong
parents: 543
diff changeset
  1534
\ref{NaiveMatcher}.
Chengsong
parents: 543
diff changeset
  1535
Note that both axes are in logarithmic scale.
637
Chengsong
parents: 628
diff changeset
  1536
Around two dozen characters
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  1537
this algorithm already ``explodes'' with the regular expression 
564
Chengsong
parents: 543
diff changeset
  1538
$(a^*)^*b$.
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1539
To improve this situation, we need to introduce simplification
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1540
rules for the intermediate results,
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  1541
such as $r + r \rightarrow r$ or $\ONE \cdot r \rightarrow r$,
539
Chengsong
parents: 538
diff changeset
  1542
and make sure those rules do not change the 
Chengsong
parents: 538
diff changeset
  1543
language of the regular expression.
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  1544
One simple-minded simplification function
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1545
that achieves these requirements 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1546
is given below (see Ausaf et al. \cite{AusafDyckhoffUrban2016}):
564
Chengsong
parents: 543
diff changeset
  1547
\begin{center}
Chengsong
parents: 543
diff changeset
  1548
	\begin{tabular}{lcl}
Chengsong
parents: 543
diff changeset
  1549
		$\simp \; r_1 \cdot r_2 $ & $ \dn$ & 
Chengsong
parents: 543
diff changeset
  1550
		$(\simp \; r_1,  \simp \; r_2) \; \textit{match}$\\
Chengsong
parents: 543
diff changeset
  1551
					  & & $\quad \case \; (\ZERO, \_) \Rightarrow \ZERO$\\
Chengsong
parents: 543
diff changeset
  1552
					  & & $\quad \case \; (\_, \ZERO) \Rightarrow \ZERO$\\
Chengsong
parents: 543
diff changeset
  1553
					  & & $\quad \case \; (\ONE, r_2') \Rightarrow r_2'$\\
Chengsong
parents: 543
diff changeset
  1554
					  & & $\quad \case \; (r_1', \ONE) \Rightarrow r_1'$\\
Chengsong
parents: 543
diff changeset
  1555
					  & & $\quad \case \; (r_1', r_2') \Rightarrow r_1'\cdot r_2'$\\
Chengsong
parents: 543
diff changeset
  1556
		$\simp \; r_1 + r_2$ & $\dn$ & $(\simp \; r_1, \simp \; r_2) \textit{match}$\\
Chengsong
parents: 543
diff changeset
  1557
				     & & $\quad \; \case \; (\ZERO, r_2') \Rightarrow r_2'$\\
Chengsong
parents: 543
diff changeset
  1558
				     & & $\quad \; \case \; (r_1', \ZERO) \Rightarrow r_1'$\\
Chengsong
parents: 543
diff changeset
  1559
				     & & $\quad \; \case \; (r_1', r_2') \Rightarrow r_1' + r_2'$\\
583
Chengsong
parents: 579
diff changeset
  1560
		$\simp \; r$ & $\dn$ & $r\quad\quad (otherwise)$
564
Chengsong
parents: 543
diff changeset
  1561
				   
Chengsong
parents: 543
diff changeset
  1562
	\end{tabular}
Chengsong
parents: 543
diff changeset
  1563
\end{center}
Chengsong
parents: 543
diff changeset
  1564
If we repeatedly apply this simplification  
Chengsong
parents: 543
diff changeset
  1565
function during the matching algorithm, 
Chengsong
parents: 543
diff changeset
  1566
we have a matcher with simplification:
Chengsong
parents: 543
diff changeset
  1567
\begin{center}
Chengsong
parents: 543
diff changeset
  1568
	\begin{tabular}{lcl}
Chengsong
parents: 543
diff changeset
  1569
		$\derssimp \; [] \; r$ & $\dn$ & $r$\\
Chengsong
parents: 543
diff changeset
  1570
		$\derssimp \; c :: cs \; r$ & $\dn$ & $\derssimp \; cs \; (\simp \; (r \backslash c))$\\
Chengsong
parents: 543
diff changeset
  1571
		$\textit{matcher}_{simp}\; s \; r $ & $\dn$ & $\nullable \; (\derssimp \; s\;r)$
Chengsong
parents: 543
diff changeset
  1572
	\end{tabular}
Chengsong
parents: 543
diff changeset
  1573
\end{center}
Chengsong
parents: 543
diff changeset
  1574
\begin{figure}
539
Chengsong
parents: 538
diff changeset
  1575
\begin{tikzpicture}
Chengsong
parents: 538
diff changeset
  1576
\begin{axis}[
Chengsong
parents: 538
diff changeset
  1577
    xlabel={$n$},
Chengsong
parents: 538
diff changeset
  1578
    ylabel={time in secs},
601
Chengsong
parents: 591
diff changeset
  1579
    %ymode = log,
Chengsong
parents: 591
diff changeset
  1580
    %xmode = log,
564
Chengsong
parents: 543
diff changeset
  1581
    grid = both,
539
Chengsong
parents: 538
diff changeset
  1582
    legend entries={Matcher With Simp},  
Chengsong
parents: 538
diff changeset
  1583
    legend pos=north west,
Chengsong
parents: 538
diff changeset
  1584
    legend cell align=left]
Chengsong
parents: 538
diff changeset
  1585
\addplot[red,mark=*, mark options={fill=white}] table {BetterMatcher.data};
Chengsong
parents: 538
diff changeset
  1586
\end{axis}
564
Chengsong
parents: 543
diff changeset
  1587
\end{tikzpicture} 
Chengsong
parents: 543
diff changeset
  1588
\caption{$(a^*)^*b$ 
Chengsong
parents: 543
diff changeset
  1589
against 
Chengsong
parents: 543
diff changeset
  1590
$\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$ Using $\textit{matcher}_{simp}$}\label{BetterMatcher}
Chengsong
parents: 543
diff changeset
  1591
\end{figure}
Chengsong
parents: 543
diff changeset
  1592
\noindent
Chengsong
parents: 543
diff changeset
  1593
The running time of $\textit{ders}\_\textit{simp}$
583
Chengsong
parents: 579
diff changeset
  1594
on the same example of Figure \ref{NaiveMatcher}
Chengsong
parents: 579
diff changeset
  1595
is now ``tame''  in terms of the length of inputs,
Chengsong
parents: 579
diff changeset
  1596
as shown in Figure \ref{BetterMatcher}.
539
Chengsong
parents: 538
diff changeset
  1597
637
Chengsong
parents: 628
diff changeset
  1598
So far, the story is use Brzozowski derivatives and
Chengsong
parents: 628
diff changeset
  1599
simplify as much as possible, and at the end test
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1600
whether the empty string is recognised 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1601
by the final derivative.
583
Chengsong
parents: 579
diff changeset
  1602
But what if we want to 
Chengsong
parents: 579
diff changeset
  1603
do lexing instead of just getting a true/false answer?
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1604
Sulzmann and Lu \cite{Sulzmann2014} first came up with a nice and 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1605
elegant (arguably as beautiful as the definition of the 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1606
Brzozowski derivative) solution for this.
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1607
539
Chengsong
parents: 538
diff changeset
  1608
\section{Values and the Lexing Algorithm by Sulzmann and Lu}
564
Chengsong
parents: 543
diff changeset
  1609
In this section, we present a two-phase regular expression lexing 
Chengsong
parents: 543
diff changeset
  1610
algorithm.
Chengsong
parents: 543
diff changeset
  1611
The first phase takes successive derivatives with 
Chengsong
parents: 543
diff changeset
  1612
respect to the input string,
Chengsong
parents: 543
diff changeset
  1613
and the second phase does the reverse, \emph{injecting} back
Chengsong
parents: 543
diff changeset
  1614
characters, in the meantime constructing a lexing result.
Chengsong
parents: 543
diff changeset
  1615
We will introduce the injection phase in detail slightly
Chengsong
parents: 543
diff changeset
  1616
later, but as a preliminary we have to first define 
Chengsong
parents: 543
diff changeset
  1617
the datatype for lexing results, 
Chengsong
parents: 543
diff changeset
  1618
called \emph{value} or
608
37b6fd310a16 added related work chap
Chengsong
parents: 601
diff changeset
  1619
sometimes also \emph{lexical value}.  
37b6fd310a16 added related work chap
Chengsong
parents: 601
diff changeset
  1620
Values and regular
37b6fd310a16 added related work chap
Chengsong
parents: 601
diff changeset
  1621
expressions correspond to each other 
37b6fd310a16 added related work chap
Chengsong
parents: 601
diff changeset
  1622
as illustrated in the following
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1623
table:
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1624
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1625
\begin{center}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1626
	\begin{tabular}{c@{\hspace{20mm}}c}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1627
		\begin{tabular}{@{}rrl@{}}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1628
			\multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1629
			$r$ & $::=$  & $\ZERO$\\
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1630
			& $\mid$ & $\ONE$   \\
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1631
			& $\mid$ & $c$          \\
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1632
			& $\mid$ & $r_1 \cdot r_2$\\
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1633
			& $\mid$ & $r_1 + r_2$   \\
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1634
			\\
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1635
			& $\mid$ & $r^*$         \\
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1636
		\end{tabular}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1637
		&
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1638
		\begin{tabular}{@{\hspace{0mm}}rrl@{}}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1639
			\multicolumn{3}{@{}l}{\textbf{Values}}\medskip\\
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1640
			$v$ & $::=$  & \\
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1641
			&        & $\Empty$   \\
591
b2d0de6aee18 more polishing integrated comments chap2
Chengsong
parents: 585
diff changeset
  1642
			& $\mid$ & $\Char \; c$          \\
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1643
			& $\mid$ & $\Seq\,v_1\, v_2$\\
591
b2d0de6aee18 more polishing integrated comments chap2
Chengsong
parents: 585
diff changeset
  1644
			& $\mid$ & $\Left \;v$   \\
b2d0de6aee18 more polishing integrated comments chap2
Chengsong
parents: 585
diff changeset
  1645
			& $\mid$ & $\Right\;v$  \\
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1646
			& $\mid$ & $\Stars\,[v_1,\ldots\,v_n]$ \\
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1647
		\end{tabular}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1648
	\end{tabular}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1649
\end{center}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1650
\noindent
564
Chengsong
parents: 543
diff changeset
  1651
A value has an underlying string, which 
Chengsong
parents: 543
diff changeset
  1652
can be calculated by the ``flatten" function $|\_|$:
Chengsong
parents: 543
diff changeset
  1653
\begin{center}
Chengsong
parents: 543
diff changeset
  1654
	\begin{tabular}{lcl}
Chengsong
parents: 543
diff changeset
  1655
		$|\Empty|$ & $\dn$ &  $[]$\\
Chengsong
parents: 543
diff changeset
  1656
		$|\Char \; c|$ & $ \dn$ & $ [c]$\\
591
b2d0de6aee18 more polishing integrated comments chap2
Chengsong
parents: 585
diff changeset
  1657
		$|\Seq \; v_1, \;v_2|$ & $ \dn$ & $ v_1| @ |v_2|$\\
b2d0de6aee18 more polishing integrated comments chap2
Chengsong
parents: 585
diff changeset
  1658
		$|\Left \; v|$ & $ \dn$ & $ |v|$\\
b2d0de6aee18 more polishing integrated comments chap2
Chengsong
parents: 585
diff changeset
  1659
		$|\Right \; v|$ & $ \dn$ & $ |v|$\\
b2d0de6aee18 more polishing integrated comments chap2
Chengsong
parents: 585
diff changeset
  1660
		$|\Stars \; []|$ & $\dn$ & $[]$\\
b2d0de6aee18 more polishing integrated comments chap2
Chengsong
parents: 585
diff changeset
  1661
		$|\Stars \; v::vs|$ &  $\dn$ & $ |v| @ |\Stars(vs)|$
564
Chengsong
parents: 543
diff changeset
  1662
	\end{tabular}
Chengsong
parents: 543
diff changeset
  1663
\end{center}
Chengsong
parents: 543
diff changeset
  1664
Sulzmann and Lu used a binary predicate, written $\vdash v:r $,
Chengsong
parents: 543
diff changeset
  1665
to indicate that a value $v$ could be generated from a lexing algorithm
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1666
with input $r$. They call it the value inhabitation relation,
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1667
defined by the rules.
628
7af4e2420a8c ready to submit~~
Chengsong
parents: 626
diff changeset
  1668
\begin{figure}[H]
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1669
\begin{mathpar}
591
b2d0de6aee18 more polishing integrated comments chap2
Chengsong
parents: 585
diff changeset
  1670
	\inferrule{\mbox{}}{\vdash \Char \; c : \mathbf{c}} \hspace{2em}
564
Chengsong
parents: 543
diff changeset
  1671
Chengsong
parents: 543
diff changeset
  1672
	\inferrule{\mbox{}}{\vdash \Empty :  \ONE} \hspace{2em}
Chengsong
parents: 543
diff changeset
  1673
591
b2d0de6aee18 more polishing integrated comments chap2
Chengsong
parents: 585
diff changeset
  1674
\inferrule{\vdash v_1 : r_1 \;\; \vdash v_2 : r_2 }{\vdash \Seq \; v_1,\; v_2 : (r_1 \cdot r_2)}
564
Chengsong
parents: 543
diff changeset
  1675
591
b2d0de6aee18 more polishing integrated comments chap2
Chengsong
parents: 585
diff changeset
  1676
\inferrule{\vdash v_1 : r_1}{\vdash \Left \; v_1 : r_1+r_2}
564
Chengsong
parents: 543
diff changeset
  1677
591
b2d0de6aee18 more polishing integrated comments chap2
Chengsong
parents: 585
diff changeset
  1678
\inferrule{\vdash v_2 : r_2}{\vdash \Right \; v_2:r_1 + r_2}
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1679
591
b2d0de6aee18 more polishing integrated comments chap2
Chengsong
parents: 585
diff changeset
  1680
\inferrule{\forall v \in vs. \vdash v:r \land  |v| \neq []}{\vdash \Stars \; vs : r^*}
564
Chengsong
parents: 543
diff changeset
  1681
\end{mathpar}
628
7af4e2420a8c ready to submit~~
Chengsong
parents: 626
diff changeset
  1682
\caption{The inhabitation relation for values and regular expressions}\label{fig:inhab}
626
1c8525061545 finished!
Chengsong
parents: 623
diff changeset
  1683
\end{figure}
564
Chengsong
parents: 543
diff changeset
  1684
\noindent
Chengsong
parents: 543
diff changeset
  1685
The condition $|v| \neq []$ in the premise of star's rule
Chengsong
parents: 543
diff changeset
  1686
is to make sure that for a given pair of regular 
Chengsong
parents: 543
diff changeset
  1687
expression $r$ and string $s$, the number of values 
Chengsong
parents: 543
diff changeset
  1688
satisfying $|v| = s$ and $\vdash v:r$ is finite.
601
Chengsong
parents: 591
diff changeset
  1689
This additional condition was
Chengsong
parents: 591
diff changeset
  1690
imposed by Ausaf and Urban to make their proofs easier.
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1691
Given a string and a regular expression, there can be
564
Chengsong
parents: 543
diff changeset
  1692
multiple values for it. For example, both
Chengsong
parents: 543
diff changeset
  1693
$\vdash \Seq(\Left \; ab)(\Right \; c):(ab+a)(bc+c)$ and
Chengsong
parents: 543
diff changeset
  1694
$\vdash \Seq(\Right\; a)(\Left \; bc ):(ab+a)(bc+c)$ hold
Chengsong
parents: 543
diff changeset
  1695
and the values both flatten to $abc$.
Chengsong
parents: 543
diff changeset
  1696
Lexers therefore have to disambiguate and choose only
637
Chengsong
parents: 628
diff changeset
  1697
one of the values to be generated. $\POSIX$ is one of the
564
Chengsong
parents: 543
diff changeset
  1698
disambiguation strategies that is widely adopted.
Chengsong
parents: 543
diff changeset
  1699
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  1700
Ausaf et al. \cite{AusafDyckhoffUrban2016} 
564
Chengsong
parents: 543
diff changeset
  1701
formalised the property 
Chengsong
parents: 543
diff changeset
  1702
as a ternary relation.
Chengsong
parents: 543
diff changeset
  1703
The $\POSIX$ value $v$ for a regular expression
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1704
$r$ and string $s$, denoted as $(s, r) \rightarrow v$, can be specified 
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1705
in the following rules\footnote{The names of the rules are used
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1706
as they were originally given in \cite{AusafDyckhoffUrban2016}.}:
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1707
\begin{figure}[p]
564
Chengsong
parents: 543
diff changeset
  1708
\begin{mathpar}
Chengsong
parents: 543
diff changeset
  1709
	\inferrule[P1]{\mbox{}}{([], \ONE) \rightarrow \Empty}
Chengsong
parents: 543
diff changeset
  1710
		
Chengsong
parents: 543
diff changeset
  1711
	\inferrule[PC]{\mbox{}}{([c], c) \rightarrow \Char \; c}
Chengsong
parents: 543
diff changeset
  1712
Chengsong
parents: 543
diff changeset
  1713
	\inferrule[P+L]{(s,r_1)\rightarrow v_1}{(s, r_1+r_2)\rightarrow \Left \; v_1}
Chengsong
parents: 543
diff changeset
  1714
Chengsong
parents: 543
diff changeset
  1715
	\inferrule[P+R]{(s,r_2)\rightarrow v_2\\ s \notin L \; r_1}{(s, r_1+r_2)\rightarrow \Right \; v_2}
Chengsong
parents: 543
diff changeset
  1716
Chengsong
parents: 543
diff changeset
  1717
	\inferrule[PS]{(s_1, v_1) \rightarrow r_1 \\ (s_2, v_2)\rightarrow r_2\\
Chengsong
parents: 543
diff changeset
  1718
		\nexists s_3 \; s_4. s_3 \neq [] \land s_3 @ s_4 = s_2 \land 
Chengsong
parents: 543
diff changeset
  1719
		s_1@ s_3 \in L \; r_1 \land s_4 \in L \; r_2}{(s_1 @ s_2, r_1\cdot r_2) \rightarrow
Chengsong
parents: 543
diff changeset
  1720
	\Seq \; v_1 \; v_2}
Chengsong
parents: 543
diff changeset
  1721
Chengsong
parents: 543
diff changeset
  1722
	\inferrule[P{[]}]{\mbox{}}{([], r^*) \rightarrow \Stars([])}
Chengsong
parents: 543
diff changeset
  1723
Chengsong
parents: 543
diff changeset
  1724
	\inferrule[P*]{(s_1, v) \rightarrow v \\ (s_2, r^*) \rightarrow \Stars \; vs \\
Chengsong
parents: 543
diff changeset
  1725
		|v| \neq []\\ \nexists s_3 \; s_4. s_3 \neq [] \land s_3@s_4 = s_2 \land
Chengsong
parents: 543
diff changeset
  1726
		s_1@s_3 \in L \; r \land s_4 \in L \; r^*}{(s_1@s_2, r^*)\rightarrow \Stars \;
Chengsong
parents: 543
diff changeset
  1727
	(v::vs)}
Chengsong
parents: 543
diff changeset
  1728
\end{mathpar}
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1729
\caption{The inductive POSIX rules given by Ausaf et al.
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1730
	\cite{AusafDyckhoffUrban2016}.
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1731
This ternary relation, written $(s, r) \rightarrow v$, 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1732
formalises the POSIX constraints on the
601
Chengsong
parents: 591
diff changeset
  1733
value $v$ given a string $s$ and 
Chengsong
parents: 591
diff changeset
  1734
regular expression $r$.
Chengsong
parents: 591
diff changeset
  1735
}
646
Chengsong
parents: 639
diff changeset
  1736
\label{fig:POSIXDef}
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1737
\end{figure}\afterpage{\clearpage}
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1738
\noindent
579
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1739
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1740
%\begin{figure}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1741
%\begin{tikzpicture}[]
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1742
%    \node [minimum width = 6cm, rectangle split, rectangle split horizontal, 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1743
%	    rectangle split parts=2, rectangle split part fill={red!30,blue!20}, style={draw, rounded corners, inner sep=10pt}]
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1744
%	    (node1)
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1745
%	    {$r_{token1}$
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1746
%	    \nodepart{two}  $\;\;\; \quad r_{token2}\;\;\;\quad$ };
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1747
%	    %\node [left = 6.0cm of node1] (start1) {hi};
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1748
%	    \node [left = 0.2cm of node1] (middle) {$v.s.$};
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1749
%    \node [minimum width = 6cm, left = 0.2cm of middle, rectangle split, rectangle split horizontal, 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1750
%	    rectangle split parts=2, rectangle split part fill={red!30,blue!20}, style={draw, rounded corners, inner sep=10pt}]
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1751
%	    (node2)
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1752
%	    {$\quad\;\;\;r_{token1}\quad\;\;\;$
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1753
%	    \nodepart{two}  $r_{token2}$ };
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1754
%	    \node [below = 0.1cm of node2] (text1) {\checkmark preferred by POSIX};
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1755
%		\node [above = 1.5cm of middle, minimum width = 6cm, 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1756
%			rectangle, style={draw, rounded corners, inner sep=10pt}] 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1757
%			(topNode) {$s$};
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1758
%	    \path[->,draw]
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1759
%	        (topNode) edge node {split $A$} (node2)
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1760
%	        (topNode) edge node {split $B$} (node1)
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1761
%		;
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1762
%			
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1763
%
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1764
%\end{tikzpicture}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1765
%\caption{Maximum munch example: $s$ matches $r_{token1} \cdot r_{token2}$}\label{munch}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1766
%\end{figure}
637
Chengsong
parents: 628
diff changeset
  1767
The above $\POSIX$ rules follow the intuition described below: 
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1768
\begin{itemize}
564
Chengsong
parents: 543
diff changeset
  1769
	\item (Left Priority)\\
637
Chengsong
parents: 628
diff changeset
  1770
		Match the leftmost regular expression when multiple options for matching
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1771
		are available. See P+L and P+R where in P+R $s$ cannot
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1772
		be in the language of $L \; r_1$.
564
Chengsong
parents: 543
diff changeset
  1773
	\item (Maximum munch)\\
Chengsong
parents: 543
diff changeset
  1774
		Always match a subpart as much as possible before proceeding
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1775
		to the next part of the string.
579
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1776
		For example, when the string $s$ matches 
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1777
		$r_{part1}\cdot r_{part2}$, and we have two ways $s$ can be split:
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1778
		Then the split that matches a longer string for the first part
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1779
		$r_{part1}$ is preferred by this maximum munch rule.
637
Chengsong
parents: 628
diff changeset
  1780
		The side-condition 
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1781
		\begin{center}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1782
		$\nexists s_3 \; s_4. s_3 \neq [] \land s_3 @ s_4 = s_2 \land 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1783
		s_1@ s_3 \in L \; r_1 \land s_4 \in L \; r_2$
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1784
		\end{center}
637
Chengsong
parents: 628
diff changeset
  1785
		in PS causes this.
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1786
		%(See
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1787
		%\ref{munch} for an illustration).
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1788
\end{itemize}
564
Chengsong
parents: 543
diff changeset
  1789
\noindent
Chengsong
parents: 543
diff changeset
  1790
These disambiguation strategies can be 
Chengsong
parents: 543
diff changeset
  1791
quite practical.
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1792
For instance, when lexing a code snippet 
564
Chengsong
parents: 543
diff changeset
  1793
\[ 
Chengsong
parents: 543
diff changeset
  1794
	\textit{iffoo} = 3
Chengsong
parents: 543
diff changeset
  1795
\]
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1796
using a regular expression 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1797
for keywords and 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1798
identifiers:
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1799
%(for example, keyword is a nonempty string starting with letters 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1800
%followed by alphanumeric characters or underscores):
564
Chengsong
parents: 543
diff changeset
  1801
\[
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1802
	r_{keyword} + r_{identifier}.
564
Chengsong
parents: 543
diff changeset
  1803
\]
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1804
If we want $\textit{iffoo}$ to be recognized
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1805
as an identifier
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1806
where identifiers are defined as usual (letters
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1807
followed by letters, numbers or underscores),
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1808
then a match with a keyword (if)
564
Chengsong
parents: 543
diff changeset
  1809
followed by
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1810
an identifier (foo) would be incorrect.
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  1811
POSIX lexing generates what is included by lexing.
564
Chengsong
parents: 543
diff changeset
  1812
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1813
\noindent
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1814
We know that a POSIX 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1815
value for regular expression $r$ is inhabited by $r$.
541
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
  1816
\begin{lemma}
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
  1817
$(r, s) \rightarrow v \implies \vdash v: r$
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
  1818
\end{lemma}
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
  1819
\noindent
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1820
The main property about a $\POSIX$ value is that 
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1821
given the same regular expression $r$ and string $s$,
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1822
one can always uniquely determine the $\POSIX$ value for it:
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1823
\begin{lemma}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1824
$\textit{if} \,(s, r) \rightarrow v_1 \land (s, r) \rightarrow v_2\quad  \textit{then} \; v_1 = v_2$
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1825
\end{lemma}
539
Chengsong
parents: 538
diff changeset
  1826
\begin{proof}
564
Chengsong
parents: 543
diff changeset
  1827
By induction on $s$, $r$ and $v_1$. The inductive cases
Chengsong
parents: 543
diff changeset
  1828
are all the POSIX rules. 
Chengsong
parents: 543
diff changeset
  1829
Probably the most cumbersome cases are 
Chengsong
parents: 543
diff changeset
  1830
the sequence and star with non-empty iterations.
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1831
We shall give the details for proving the sequence case here.
539
Chengsong
parents: 538
diff changeset
  1832
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1833
When we have 
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1834
\[
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1835
	(s_1, r_1) \rightarrow v_1 \;\, and \;\, 
601
Chengsong
parents: 591
diff changeset
  1836
	(s_2, r_2) \rightarrow v_2  \;\, and \;\, 
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1837
	\nexists s_3 \; s_4. s_3 \neq [] \land s_3 @ s_4 = s_2 \land 
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1838
		s_1@ s_3 \in L \; r_1 \land s_4 \in L \; r_2
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1839
\]
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1840
we know that the last condition 
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1841
excludes the possibility of a 
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1842
string $s_1'$ longer than $s_1$ such that 
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1843
\[
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1844
(s_1', r_1) \rightarrow v_1'   \;\; 
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1845
and\;\; (s_2', r_2) \rightarrow v_2'\;\; and \;\;s_1' @s_2' = s 
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1846
\]
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1847
hold.
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1848
A shorter string $s_1''$ with $s_2''$ satisfying
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1849
\[
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1850
(s_1'', r_1) \rightarrow v_1''
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1851
\;\;and\;\; (s_2'', r_2) \rightarrow v_2'' \;\;and \;\;s_1'' @s_2'' = s 
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1852
\]
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1853
cannot possibly form a $\POSIX$ value either, because
637
Chengsong
parents: 628
diff changeset
  1854
by definition, there is a candidate
Chengsong
parents: 628
diff changeset
  1855
with a longer initial string
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1856
$s_1$. Therefore, we know that the POSIX
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1857
value $\Seq \; a \; b$ for $r_1 \cdot r_2$ matching
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1858
$s$ must have the 
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1859
property that 
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1860
\[
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1861
	|a| = s_1 \;\; and \;\; |b| = s_2.
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1862
\]
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1863
The goal is to prove that $a = v_1 $ and $b = v_2$.
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1864
If we have some other POSIX values $v_{10}$ and $v_{20}$ such that 
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1865
$(s_1, r_1) \rightarrow v_{10}$ and $(s_2, r_2) \rightarrow v_{20}$ hold,
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1866
then by induction hypothesis $v_{10} = v_1$ and $v_{20}= v_2$, 
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1867
which means this "other" $\POSIX$ value $\Seq(v_{10}, v_{20})$
539
Chengsong
parents: 538
diff changeset
  1868
is the same as $\Seq(v_1, v_2)$. 
Chengsong
parents: 538
diff changeset
  1869
\end{proof}
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1870
\noindent
637
Chengsong
parents: 628
diff changeset
  1871
We have now defined what a POSIX value is and shown that it is unique.
Chengsong
parents: 628
diff changeset
  1872
The problem is to generate
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1873
such a value in a lexing algorithm using derivatives.
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1874
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1875
\subsection{Sulzmann and Lu's Injection-based Lexing Algorithm}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1876
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1877
Sulzmann and Lu extended Brzozowski's 
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1878
derivative-based matching
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1879
to a lexing algorithm by a second phase 
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1880
after the initial phase of successive derivatives.
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1881
This second phase generates a POSIX value 
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1882
if the regular expression matches the string.
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  1883
The algorithm uses two functions called $\inj$ and $\mkeps$.
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1884
The function $\mkeps$ constructs a POSIX value from the last
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1885
derivative $r_n$:
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1886
\begin{ceqn}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1887
\begin{equation}\label{graph:mkeps}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1888
\begin{tikzcd}
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1889
r_0 \arrow[r, "\backslash c_0"]  & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed, "\ldots"] & r_n \arrow[d, "mkeps" description] \\
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1890
	        & 	              & 	            & v_n       
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1891
\end{tikzcd}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1892
\end{equation}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1893
\end{ceqn}
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1894
\noindent
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1895
In the above diagram, again we assume that
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1896
the input string $s$ is made of $n$ characters
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1897
$c_0c_1 \ldots c_{n-1}$ 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1898
The last derivative operation 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1899
$\backslash c_{n-1}$ generates the derivative $r_n$, for which
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1900
$\mkeps$ produces the value $v_n$. This value
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1901
tells us how the empty string is matched by the (nullable)
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1902
regular expression $r_n$, in a POSIX way.
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1903
The definition of $\mkeps$ is
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1904
	\begin{center}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1905
		\begin{tabular}{lcl}
564
Chengsong
parents: 543
diff changeset
  1906
			$\mkeps \; \ONE$ 		& $\dn$ & $\Empty$ \\
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1907
			$\mkeps \; (r_{1}+r_{2})$	& $\dn$ 
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1908
						& $\textit{if}\; (\nullable \; r_{1}) \;\,
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1909
							\textit{then}\;\, \Left \; (\mkeps \; r_{1})$\\ 
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1910
						& & $\phantom{if}\; \textit{else}\;\, \Right \;(\mkeps \; r_{2})$\\
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1911
			$\mkeps \; (r_1 \cdot r_2)$ 	& $\dn$ & $\Seq\;(\mkeps\;r_1)\;(\mkeps \; r_2)$\\
564
Chengsong
parents: 543
diff changeset
  1912
			$\mkeps \; r^* $	        & $\dn$ & $\Stars\;[]$
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1913
		\end{tabular}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1914
	\end{center}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1915
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1916
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1917
\noindent 
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1918
The function prefers the left child $r_1$ of $r_1 + r_2$ 
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1919
to match an empty string if there is a choice.
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1920
When there is a star to match the empty string,
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1921
we give the $\Stars$ constructor an empty list, meaning
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1922
no iteration is taken.
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1923
The result of $\mkeps$ on a $\nullable$ $r$ 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1924
is a POSIX value for $r$ and the empty string:
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1925
\begin{lemma}\label{mePosix}
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  1926
$\nullable\; r \implies (r, []) \rightarrow (\mkeps\; r)$
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1927
\end{lemma}
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1928
\begin{proof}
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1929
	By induction on $r$.
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1930
\end{proof}
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1931
\noindent
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1932
After the $\mkeps$-call, Sulzmann and Lu inject back the characters one by one
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1933
in reverse order as they were chopped off in the derivative phase.
637
Chengsong
parents: 628
diff changeset
  1934
The function for this is called $\inj$. This function 
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1935
operates on values, unlike $\backslash$ which operates on regular expressions.
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1936
In the diagram below, $v_i$ stands for the (POSIX) value 
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1937
for how the regular expression 
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1938
$r_i$ matches the string $s_i$ consisting of the last $n-i$ characters
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1939
of $s$ (i.e. $s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$.
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1940
After injecting back $n$ characters, we get the lexical value for how $r_0$
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1941
matches $s$. 
601
Chengsong
parents: 591
diff changeset
  1942
\begin{figure}[H]
Chengsong
parents: 591
diff changeset
  1943
\begin{center}	
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1944
\begin{ceqn}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1945
\begin{tikzcd}
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1946
r_0 \arrow[r, dashed] \arrow[d]& r_i \arrow[r, "\backslash c_i"]  \arrow[d]  & r_{i+1}  \arrow[r, dashed] \arrow[d]        & r_n \arrow[d, "mkeps" description] \\
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1947
v_0           \arrow[u]                 & v_i  \arrow[l, dashed]                              & v_{i+1} \arrow[l,"inj_{r_i} c_i"]                 & v_n \arrow[l, dashed]         
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1948
\end{tikzcd}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1949
\end{ceqn}
601
Chengsong
parents: 591
diff changeset
  1950
\end{center}
Chengsong
parents: 591
diff changeset
  1951
\caption{The two-phase lexing algorithm by Sulzmann and Lu \cite{AusafDyckhoffUrban2016},
Chengsong
parents: 591
diff changeset
  1952
	matching the regular expression $r_0$ and string of the form $[c_0, c_1, \ldots, c_{n-1}]$.
Chengsong
parents: 591
diff changeset
  1953
	The first phase involves taking successive derivatives w.r.t the characters $c_0$,
Chengsong
parents: 591
diff changeset
  1954
	$c_1$, and so on. These are the same operations as they have appeared in the matcher
Chengsong
parents: 591
diff changeset
  1955
	\ref{matcher}. When the final derivative regular expression is nullable (contains the empty string),
637
Chengsong
parents: 628
diff changeset
  1956
	then the second phase starts. First, $\mkeps$ generates a POSIX value which tells us how $r_n$ matches
Chengsong
parents: 628
diff changeset
  1957
	the empty string, by always selecting the leftmost 
Chengsong
parents: 628
diff changeset
  1958
	nullable regular expression. After that, $\inj$ ``injects'' back the character in reverse order as they 
601
Chengsong
parents: 591
diff changeset
  1959
	appeared in the string, always preserving POSIXness.}\label{graph:inj}
Chengsong
parents: 591
diff changeset
  1960
\end{figure}
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1961
\noindent
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1962
The function $\textit{inj}$ as defined by Sulzmann and Lu
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1963
takes three arguments: a regular
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1964
expression ${r_{i}}$, before the character is chopped off, 
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1965
a character ${c_{i}}$ (the character we want to inject back) and 
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1966
the third argument $v_{i+1}$ the value we want to inject into. 
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1967
The result of an application 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1968
$\inj \; r_i \; c_i \; v_{i+1}$ is a new value $v_i$ such that
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1969
\[
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1970
	(s_i, r_i) \rightarrow v_i
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1971
\]
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1972
holds.
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1973
The definition of $\textit{inj}$ is as follows: 
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1974
\begin{center}
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1975
\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{5mm}}l}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1976
  $\textit{inj}\;(c)\;c\,Empty$            & $\dn$ & $\Char\,c$\\
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1977
  $\textit{inj}\;(r_1 + r_2)\;c\; (\Left\; v)$ & $\dn$ & $\Left  \; (\textit{inj}\; r_1 \; c\,v)$\\
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1978
  $\textit{inj}\;(r_1 + r_2)\,c\; (\Right\;v)$ & $\dn$ & $\Right \; (\textit{inj}\;r_2\;c  \; v)$\\
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1979
  $\textit{inj}\;(r_1 \cdot r_2)\; c\;(\Seq \; v_1 \; v_2)$ & $\dn$  & 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1980
  $\Seq \; (\textit{inj}\;r_1\;c\;v_1) \; v_2$\\
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1981
  $\textit{inj}\;(r_1 \cdot r_2)\; c\;(\Left \; (\Seq \; v_1\;v_2) )$ & 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1982
  $\dn$  & $\Seq \; (\textit{inj}\,r_1\,c\,v_1)\; v_2$\\
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1983
  $\textit{inj}\;(r_1 \cdot r_2)\; c\; (\Right\; v)$ & $\dn$  & $\Seq\; (\textit{mkeps}\; r_1) \; (\textit{inj} \; r_2\;c\;v)$\\
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1984
  $\textit{inj}\;(r^*)\; c \; (\Seq \; v\; (\Stars\;vs))$         & $\dn$  & $\Stars\;\,((\textit{inj}\;r\;c\;v)\,::\,vs)$\\
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1985
\end{tabular}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1986
\end{center}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1987
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1988
\noindent 
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1989
The function recurses on 
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1990
the shape of regular
637
Chengsong
parents: 628
diff changeset
  1991
expressions and values.
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1992
Intuitively, each clause analyses 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1993
how $r_i$ could have transformed when being 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1994
derived by $c$, identifying which subpart
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1995
of $v_{i+1}$ has the ``hole'' 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1996
to inject the character back into.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1997
Once the character is
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1998
injected back to that sub-value; 
637
Chengsong
parents: 628
diff changeset
  1999
$\inj$ assembles all parts
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2000
to form a new value.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2001
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2002
For instance, the last clause is an
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2003
injection into a sequence value $v_{i+1}$
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2004
whose second child
637
Chengsong
parents: 628
diff changeset
  2005
value is a star and the shape of the 
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2006
regular expression $r_i$ before injection 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2007
is a star.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2008
We therefore know 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2009
the derivative 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2010
starts on a star and ends as a sequence:
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2011
\[
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2012
	(r^*) \backslash c \longrightarrow r\backslash c \cdot r^*
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2013
\]
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2014
during which an iteration of the star
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2015
had just been unfolded, giving the below
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2016
value inhabitation relation:
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2017
\[
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2018
	\vdash \Seq \; v \; (\Stars \; vs) : (r\backslash c) \cdot r^*.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2019
\]
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2020
The value list $vs$ corresponds to
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2021
matched star iterations,
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2022
and the ``hole'' lies in $v$ because
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2023
\[
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2024
	\vdash v: r\backslash c.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2025
\]
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2026
Finally, 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2027
$\inj \; r \;c \; v$ is prepended
637
Chengsong
parents: 628
diff changeset
  2028
to the previous list of iterations and then
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2029
wrapped under the $\Stars$
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2030
constructor, giving us $\Stars \; ((\inj \; r \; c \; v) ::vs)$.
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  2031
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2032
Recall that lemma 
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2033
\ref{mePosix} tells us that
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2034
$\mkeps$ always generates the POSIX value.
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2035
The function $\inj$ preserves the POSIXness, provided
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2036
the value before injection is POSIX, namely
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2037
\begin{lemma}\label{injPosix}
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2038
	If$(r \backslash c, s) \rightarrow v $,
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2039
	then $(r, c :: s) \rightarrow (\inj r \; c\; v)$.
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2040
\end{lemma}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2041
\begin{proof}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2042
	By induction on $r$.
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2043
	The non-trivial cases are sequence and star.
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2044
	When $r = a \cdot b$, there can be
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2045
	three cases for the value $v$ satisfying $\vdash v:a\backslash c$.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2046
	We give the reasoning why $\inj \; r \; c \; v$ is POSIX in each
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2047
	case.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2048
	\begin{itemize}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2049
		\item
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2050
			$v = \Seq \; v_a \; v_b$.\\
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2051
			The ``not nullable'' clause of the $\inj$ function is taken:
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2052
			\begin{center}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2053
				\begin{tabular}{lcl}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2054
					$\inj \; r \; c \; v$ &   $=$ & $ \inj \;\; (a \cdot b) \;\; c \;\; (\Seq \; v_a \; v_b) $\\
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2055
					& $=$ & $\Seq \; (\inj \;a \; c \; v_a) \; v_b$
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2056
				\end{tabular}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2057
			\end{center}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2058
			We know that there exists a unique pair of
637
Chengsong
parents: 628
diff changeset
  2059
			$s_a$ and $s_b$ satisfying	
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2060
				$(a \backslash c, s_a) \rightarrow v_a$,
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2061
				$(b , s_b) \rightarrow v_b$, and
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2062
				$\nexists s_3 \; s_4. s_3 \neq [] \land s_a @ s_3 \in 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2063
				L \; (a\backslash c) \land
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2064
				s_4 \in L \; b$.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2065
			The last condition gives us
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2066
			$\nexists s_3 \; s_4. s_3 \neq [] \land (c :: s_a )@ s_3 \in 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2067
				L \; a \land
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2068
				s_4 \in L \; b$.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2069
			By induction hypothesis, $(a, c::s_a) \rightarrow \inj \; a \; c \; v_a $ holds,
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2070
			and this gives us
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2071
			\[
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2072
				(a\cdot b, (c::s_a)@s_b) \rightarrow \Seq \; (\inj \; a\;c \;v_a) \; v_b.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2073
			\]
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2074
		\item
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2075
			$v = \Left \; (\Seq \; v_a \; v_b)$\\
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2076
			The argument is almost identical to the above case,	
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2077
			except that a different clause of $\inj$ is taken:
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2078
			\begin{center}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2079
				\begin{tabular}{lcl}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2080
					$\inj \; r \; c \; v$ &   $=$ & $ \inj \;\; (a \cdot b) \;\; c \;\; (\Left \; (\Seq \; v_a \; v_b)) $\\
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2081
					& $=$ & $\Seq \; (\inj \;a \; c \; v_a) \; v_b$
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2082
				\end{tabular}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2083
			\end{center}
637
Chengsong
parents: 628
diff changeset
  2084
			With similar reasoning, 
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  2085
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2086
			\[
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2087
				(a\cdot b, (c::s_a)@s_b) \rightarrow \Seq \; (\inj \; a\;c \;v_a) \; v_b.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2088
			\]
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2089
			again holds.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2090
		\item 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2091
			$v = \Right \; v_b$\\
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2092
			Again the injection result would be 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2093
			\begin{center}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2094
				\begin{tabular}{lcl}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2095
					$\inj \; r \; c \; v$ &   $=$ & $ \inj \;\; (a \cdot b) \;\; c \;\; \Right \; (v_b) $\\
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2096
					& $=$ & $\Seq \; (\mkeps \; a) \; (\inj \;b \; c\; v_b)$
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2097
				\end{tabular}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2098
			\end{center}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2099
			We know that $a$ must be nullable,
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2100
			allowing us to call $\mkeps$ and get
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2101
			\[
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2102
				(a, []) \rightarrow \mkeps \; a.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2103
			\]
637
Chengsong
parents: 628
diff changeset
  2104
			Also, by inductive hypothesis
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2105
			\[
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2106
				(b, c::s) \rightarrow \inj\; b \; c \; v_b
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2107
			\]
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2108
			holds.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2109
			In addition, as
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2110
			$\Right \;v_b$  instead of $\Left \ldots$ is 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2111
			the POSIX value for $v$, it must be the case
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2112
			that $s \notin L \;( (a\backslash c)\cdot b)$.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2113
			This tells us that 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2114
			\[
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2115
				\nexists s_3 \; s_4.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2116
				s_3 @s_4 = s  \land s_3 \in L \; (a\backslash c) 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2117
				\land s_4 \in L \; b
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2118
			\]
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2119
			which translates to
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2120
			\[
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2121
				\nexists s_3 \; s_4. \; s_3 \neq [] \land
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2122
				s_3 @s_4 = c::s  \land s_3 \in L \; a 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2123
				\land s_4 \in L \; b.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2124
			\]
637
Chengsong
parents: 628
diff changeset
  2125
			(Which says there cannot be a longer 
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2126
			initial split for $s$ other than the empty string.)
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2127
			Therefore we have $\Seq \; (\mkeps \; a) \;(\inj \;b \; c\; v_b)$
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2128
			as the POSIX value for $a\cdot b$.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2129
	\end{itemize}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2130
	The star case can be proven similarly.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2131
\end{proof}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2132
\noindent
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2133
Putting all together, Sulzmann and Lu obtained the following algorithm
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2134
outlined in the diagram \ref{graph:inj}:
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  2135
\begin{center}
539
Chengsong
parents: 538
diff changeset
  2136
\begin{tabular}{lcl}
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2137
	$\lexer \; r \; [] $ & $=$ & $\textit{if} \; (\nullable \; r)\; \textit{then}\;  \Some(\mkeps \; r) \; \textit{else} \; \None$\\
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2138
	$\lexer \; r \;c::s$ & $=$ & $\textit{case}\; (\lexer \; (r\backslash c) \; s) \;\textit{of}\; $\\
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2139
	& & $\quad \phantom{\mid}\; \None \implies \None$\\
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2140
	& & $\quad \mid           \Some(v) \implies \Some(\inj \; r\; c\; v)$
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  2141
\end{tabular}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  2142
\end{center}
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2143
\noindent
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2144
The central property of the $\lexer$ is that it gives the correct result
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2145
according to
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2146
POSIX rules. 
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2147
\begin{theorem}\label{lexerCorrectness}
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2148
	The $\lexer$ based on derivatives and injections is correct: 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2149
	\begin{center}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2150
		\begin{tabular}{lcl}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2151
			$\lexer \; r \; s = \Some(v)$ & $ \Longleftrightarrow$ & $ (r, \; s) \rightarrow v$\\
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2152
			$\lexer \;r \; s = \None $ & $\Longleftrightarrow$ & $ \neg(\exists v. (r, s) \rightarrow v)$
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2153
		\end{tabular}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2154
	\end{center}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2155
\end{theorem} 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2156
\begin{proof}
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2157
By induction on $s$. $r$ generalising over an arbitrary regular expression.
637
Chengsong
parents: 628
diff changeset
  2158
The $[]$ case is proven by an application of lemma \ref{mePosix}, and the inductive case
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2159
by lemma \ref{injPosix}.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2160
\end{proof}
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  2161
\noindent
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2162
As we did earlier in this chapter with the matcher, one can 
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2163
introduce simplification on the regular expression in each derivative step.
637
Chengsong
parents: 628
diff changeset
  2164
However, due to lexing, one needs to do a backward phase (w.r.t the forward derivative phase)
Chengsong
parents: 628
diff changeset
  2165
and ensure that
Chengsong
parents: 628
diff changeset
  2166
the values align with the regular expression at each step.
539
Chengsong
parents: 538
diff changeset
  2167
Therefore one has to
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  2168
be careful not to break the correctness, as the injection 
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2169
function heavily relies on the structure of 
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2170
the regular expressions and values being aligned.
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2171
This can be achieved by recording some extra rectification functions
637
Chengsong
parents: 628
diff changeset
  2172
during the derivatives step and applying these rectifications in 
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  2173
each run during the injection phase.
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2174
With extra care
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2175
one can show that POSIXness will not be affected
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2176
by the simplifications listed here \cite{AusafDyckhoffUrban2016}. 
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2177
\begin{center}
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2178
	\begin{tabular}{lcl}
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2179
		$\simp \; r_1 \cdot r_2 $ & $ \dn$ & 
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2180
		$(\simp \; r_1,  \simp \; r_2) \; \textit{match}$\\
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2181
					  & & $\quad \case \; (\ZERO, \_) \Rightarrow \ZERO$\\
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2182
					  & & $\quad \case \; (\_, \ZERO) \Rightarrow \ZERO$\\
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2183
					  & & $\quad \case \; (\ONE, r_2') \Rightarrow r_2'$\\
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2184
					  & & $\quad \case \; (r_1', \ONE) \Rightarrow r_1'$\\
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2185
					  & & $\quad \case \; (r_1', r_2') \Rightarrow r_1'\cdot r_2'$\\
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2186
		$\simp \; r_1 + r_2$ & $\dn$ & $(\simp \; r_1, \simp \; r_2) \textit{match}$\\
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2187
				     & & $\quad \; \case \; (\ZERO, r_2') \Rightarrow r_2'$\\
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2188
				     & & $\quad \; \case \; (r_1', \ZERO) \Rightarrow r_1'$\\
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2189
				     & & $\quad \; \case \; (r_1', r_2') \Rightarrow r_1' + r_2'$\\
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2190
		$\simp \; r$ & $\dn$ & $r\quad\quad (otherwise)$
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2191
				   
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2192
	\end{tabular}
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2193
\end{center}
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  2194
637
Chengsong
parents: 628
diff changeset
  2195
However, one can still end up
Chengsong
parents: 628
diff changeset
  2196
with exploding derivatives,
Chengsong
parents: 628
diff changeset
  2197
even with the simple-minded simplification rules allowed
Chengsong
parents: 628
diff changeset
  2198
in an injection-based lexer.
Chengsong
parents: 628
diff changeset
  2199
\section{A Case Requiring More Aggressive Simplifications}
539
Chengsong
parents: 538
diff changeset
  2200
For example, when starting with the regular
585
4969ef817d92 chap4 more
Chengsong
parents: 583
diff changeset
  2201
expression $(a^* \cdot a^*)^*$ and building just over
4969ef817d92 chap4 more
Chengsong
parents: 583
diff changeset
  2202
a dozen successive derivatives 
539
Chengsong
parents: 538
diff changeset
  2203
w.r.t.~the character $a$, one obtains a derivative regular expression
585
4969ef817d92 chap4 more
Chengsong
parents: 583
diff changeset
  2204
with millions of nodes (when viewed as a tree)
637
Chengsong
parents: 628
diff changeset
  2205
even with the mentioned simplifications.
585
4969ef817d92 chap4 more
Chengsong
parents: 583
diff changeset
  2206
\begin{figure}[H]
601
Chengsong
parents: 591
diff changeset
  2207
\begin{center}
539
Chengsong
parents: 538
diff changeset
  2208
\begin{tikzpicture}
Chengsong
parents: 538
diff changeset
  2209
\begin{axis}[
Chengsong
parents: 538
diff changeset
  2210
    xlabel={$n$},
Chengsong
parents: 538
diff changeset
  2211
    ylabel={size},
585
4969ef817d92 chap4 more
Chengsong
parents: 583
diff changeset
  2212
    legend entries={Simple-Minded Simp, Naive Matcher},  
539
Chengsong
parents: 538
diff changeset
  2213
    legend pos=north west,
Chengsong
parents: 538
diff changeset
  2214
    legend cell align=left]
Chengsong
parents: 538
diff changeset
  2215
\addplot[red,mark=*, mark options={fill=white}] table {BetterWaterloo.data};
585
4969ef817d92 chap4 more
Chengsong
parents: 583
diff changeset
  2216
\addplot[blue,mark=*, mark options={fill=white}] table {BetterWaterloo1.data};
539
Chengsong
parents: 538
diff changeset
  2217
\end{axis}
Chengsong
parents: 538
diff changeset
  2218
\end{tikzpicture} 
601
Chengsong
parents: 591
diff changeset
  2219
\end{center}
539
Chengsong
parents: 538
diff changeset
  2220
\caption{Size of $(a^*\cdot a^*)^*$ against $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$}
Chengsong
parents: 538
diff changeset
  2221
\end{figure}\label{fig:BetterWaterloo}
Chengsong
parents: 538
diff changeset
  2222
   
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2223
That is because Sulzmann and Lu's 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2224
injection-based lexing algorithm keeps a lot of 
541
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
  2225
"useless" values that will not be used. 
539
Chengsong
parents: 538
diff changeset
  2226
These different ways of matching will grow exponentially with the string length.
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2227
Consider the case
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2228
\[
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2229
	r= (a^*\cdot a^*)^* \quad and \quad
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2230
	s=\underbrace{aa\ldots a}_\text{n \textit{a}s}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2231
\]
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2232
as an example.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2233
This is a highly ambiguous regular expression, with
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2234
many ways to split up the string into multiple segments for
637
Chengsong
parents: 628
diff changeset
  2235
different star iterations,
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2236
and for each segment 
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2237
multiple ways of splitting between 
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2238
the two $a^*$ sub-expressions.
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2239
When $n$ is equal to $1$, there are two lexical values for
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2240
the match:
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2241
\[
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2242
	\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; [])] \quad (v1)
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2243
\]
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2244
and
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2245
\[
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2246
	\Stars \; [\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a])] \quad (v2)
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2247
\]
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2248
The derivative of $\derssimp \;s \; r$ is
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2249
\[
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2250
	(a^*a^* + a^*)\cdot(a^*a^*)^*.
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2251
\]
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2252
The $a^*a^*$ and $a^*$ in the first child of the above sequence
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2253
correspond to value 1 and value 2, respectively.
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2254
When $n=2$, the number goes up to 7:
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2255
\[
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2256
	\Stars \; [\Seq \; (\Stars \; [\Char \; a, \Char \; a])\; (\Stars \; [])]
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2257
\]
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2258
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2259
\[
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2260
	\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; [\Char \; a])]
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2261
\]
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2262
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2263
\[
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2264
	\Stars \; [\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a, \Char \; a])]
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2265
\]
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2266
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2267
\[
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2268
	\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; []), \Seq \; (\Stars \; [\Char\;a])\; (\Stars\; []) ]
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2269
\]
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2270
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2271
\[
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2272
	\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; []), 
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2273
	 	   \Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a])
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2274
		  ] 
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2275
\]
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2276
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2277
\[
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2278
	\Stars \; [
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2279
	 	   \Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a]),
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2280
	 	   \Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a])
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2281
		  ] 
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2282
\]
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2283
and
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2284
\[
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2285
	\Stars \; [
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2286
	 	   \Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a]),
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2287
		   \Seq \; (\Stars \; [\Char \; a])\; (\Stars \; [])
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2288
		  ] 
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2289
\]
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  2290
And $\derssimp \; aa \; (a^*a^*)^*$ is
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2291
\[
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2292
	((a^*a^* + a^*)+a^*)\cdot(a^*a^*)^* + 
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2293
	(a^*a^* + a^*)\cdot(a^*a^*)^*.
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2294
\]
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2295
which removes two out of the seven terms corresponding to the
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2296
seven distinct lexical values.
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2297
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2298
It is not surprising that there are exponentially many 
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2299
distinct lexical values that cannot be eliminated by 
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2300
the simple-minded simplification of $\derssimp$. 
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2301
A lexer without a good enough strategy to 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2302
deduplicate will naturally
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2303
have an exponential runtime on highly
637
Chengsong
parents: 628
diff changeset
  2304
ambiguous regular expressions because there
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2305
are exponentially many matches.
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2306
For this particular example, it seems
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2307
that the number of distinct matches growth
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2308
speed is proportional to $(2n)!/(n!(n+1)!)$ ($n$ being the input length).
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  2309
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2310
On the other hand, the
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2311
 $\POSIX$ value for $r= (a^*\cdot a^*)^*$  and 
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2312
$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$ is  
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2313
\[
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2314
	\Stars\,
637
Chengsong
parents: 628
diff changeset
  2315
	[\Seq \; (\Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}]), \Stars\,[]].
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2316
\]
637
Chengsong
parents: 628
diff changeset
  2317
At any moment, the  subterms in a regular expression
Chengsong
parents: 628
diff changeset
  2318
that will potentially result in a POSIX value is only
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2319
a minority among the many other terms,
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  2320
and one can remove the ones that are not possible to 
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2321
be POSIX.
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2322
In the above example,
579
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  2323
\begin{equation}\label{eqn:growth2}
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2324
	((a^*a^* + \underbrace{a^*}_\text{A})+\underbrace{a^*}_\text{duplicate of A})\cdot(a^*a^*)^* + 
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2325
	\underbrace{(a^*a^* + a^*)\cdot(a^*a^*)^*}_\text{further simp removes this}.
579
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  2326
\end{equation}
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2327
can be further simplified by 
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2328
removing the underlined term first,
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2329
which would open up possibilities
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2330
of further simplification that removes the
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2331
underbraced part.
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2332
The result would be 
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2333
\[
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2334
	(\underbrace{a^*a^*}_\text{term 1} + \underbrace{a^*}_\text{term 2})\cdot(a^*a^*)^*.
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2335
\]
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2336
with corresponding values
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2337
\begin{center}
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2338
	\begin{tabular}{lr}
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2339
		$\Stars \; [\Seq \; (\Stars \; [\Char \; a, \Char \; a])\; (\Stars \; [])]$  & $(\text{term 1})$\\
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2340
		$\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; [\Char \; a])]  $ &  $(\text{term 2})$
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2341
	\end{tabular}
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2342
\end{center}
637
Chengsong
parents: 628
diff changeset
  2343
Other terms with an underlying value, such as
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2344
\[
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2345
	\Stars \; [\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a, \Char \; a])]
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2346
\]
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  2347
do not to contribute a POSIX lexical value,
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  2348
and therefore can be thrown away.
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  2349
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2350
Ausaf et al. \cite{AusafDyckhoffUrban2016} 
639
80cc6dc4c98b until chap 7
Chengsong
parents: 638
diff changeset
  2351
have come up with some simplification steps, 
80cc6dc4c98b until chap 7
Chengsong
parents: 638
diff changeset
  2352
and incorporated the simplification into $\lexer$.
80cc6dc4c98b until chap 7
Chengsong
parents: 638
diff changeset
  2353
They call this lexer $\textit{lexer}_{simp}$ and proved that
80cc6dc4c98b until chap 7
Chengsong
parents: 638
diff changeset
  2354
\[
80cc6dc4c98b until chap 7
Chengsong
parents: 638
diff changeset
  2355
	\lexer \; r\; s = \textit{lexer}_{simp} \; r \; s
80cc6dc4c98b until chap 7
Chengsong
parents: 638
diff changeset
  2356
\]
80cc6dc4c98b until chap 7
Chengsong
parents: 638
diff changeset
  2357
The function $\textit{lexer}_{simp}$
80cc6dc4c98b until chap 7
Chengsong
parents: 638
diff changeset
  2358
involves some fiddly manipulation of value rectification,
80cc6dc4c98b until chap 7
Chengsong
parents: 638
diff changeset
  2359
which we omit here.
80cc6dc4c98b until chap 7
Chengsong
parents: 638
diff changeset
  2360
however those steps
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  2361
are not yet sufficiently strong, to achieve the above effects.
637
Chengsong
parents: 628
diff changeset
  2362
And even with these relatively mild simplifications, the proof
Chengsong
parents: 628
diff changeset
  2363
is already quite a bit more complicated than the theorem \ref{lexerCorrectness}.
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  2364
One would need to prove something like this: 
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2365
\[
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2366
	\textit{If}\; (\textit{snd} \; (\textit{simp} \; r\backslash c), s) \rightarrow  v  \;\;
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2367
	\textit{then}\;\; (r, c::s) \rightarrow 
637
Chengsong
parents: 628
diff changeset
  2368
	\inj\;\, r\,  \;c \;\, ((\textit{fst} \; (\textit{simp} \; r \backslash c))\; v).
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2369
\]
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2370
instead of the simple lemma \ref{injPosix}, where now $\textit{simp}$
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2371
not only has to return a simplified regular expression,
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2372
but also what specific simplifications 
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  2373
have been done as a function on values
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2374
showing how one can transform the value
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2375
underlying the simplified regular expression
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2376
to the unsimplified one.
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2377
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2378
We therefore choose a slightly different approach
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2379
also described by Sulzmann and Lu to
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2380
get better simplifications, which uses
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2381
some augmented data structures compared to 
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2382
plain regular expressions.
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2383
We call them \emph{annotated}
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2384
regular expressions.
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2385
With annotated regular expressions,
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2386
we can avoid creating the intermediate values $v_1,\ldots v_n$ and a 
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2387
second phase altogether.
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2388
We introduce this new datatype and the 
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2389
corresponding algorithm in the next chapter.
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  2390
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  2391
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  2392
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  2393