ChengsongTanPhdThesis/Chapters/Inj.tex
author Chengsong
Tue, 27 Jun 2023 01:09:36 +0100
changeset 651 deb35fd780fe
parent 649 ef2b8abcbc55
child 657 00171b627b8d
permissions -rwxr-xr-x
more
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
     1
% Chapter Template
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
     2
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
     3
\chapter{Regular Expressions and POSIX Lexing} % Main chapter title
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
     4
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
     5
\label{Inj} % In chapter 2 \ref{Chapter2} we will introduce the concepts
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
     6
%and notations we 
564
Chengsong
parents: 543
diff changeset
     7
% used for describing the lexing algorithm by Sulzmann and Lu,
Chengsong
parents: 543
diff changeset
     8
%and then give the algorithm and its variant and discuss
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
     9
%why more aggressive simplifications are needed. 
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
    10
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
    11
In this chapter, we define the basic notions 
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
    12
for regular languages and regular expressions.
583
Chengsong
parents: 579
diff changeset
    13
This is essentially a description in ``English''
637
Chengsong
parents: 628
diff changeset
    14
the functions and datatypes of our formalisation in Isabelle/HOL.
Chengsong
parents: 628
diff changeset
    15
We also define what $\POSIX$ lexing means, 
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
    16
followed by the first lexing algorithm by Sulzmanna and Lu \parencite{Sulzmann2014} 
564
Chengsong
parents: 543
diff changeset
    17
that produces the output conforming
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
    18
to the $\POSIX$ standard\footnote{In what follows 
583
Chengsong
parents: 579
diff changeset
    19
we choose to use the Isabelle-style notation
564
Chengsong
parents: 543
diff changeset
    20
for function applications, where
583
Chengsong
parents: 579
diff changeset
    21
the parameters of a function are not enclosed
564
Chengsong
parents: 543
diff changeset
    22
inside a pair of parentheses (e.g. $f \;x \;y$
Chengsong
parents: 543
diff changeset
    23
instead of $f(x,\;y)$). This is mainly
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
    24
to make the text visually more concise.}.
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
    25
649
Chengsong
parents: 646
diff changeset
    26
Chengsong
parents: 646
diff changeset
    27
\section{Technical Overview}
Chengsong
parents: 646
diff changeset
    28
Chengsong
parents: 646
diff changeset
    29
Consider for example the regular expression $(a^*)^*\,b$ and 
Chengsong
parents: 646
diff changeset
    30
strings of the form $aa..a$. These strings cannot be matched by this regular
Chengsong
parents: 646
diff changeset
    31
expression: Obviously the expected $b$ in the last
Chengsong
parents: 646
diff changeset
    32
position is missing. One would assume that modern regular expression
Chengsong
parents: 646
diff changeset
    33
matching engines can find this out very quickly. Surprisingly, if one tries
Chengsong
parents: 646
diff changeset
    34
this example in JavaScript, Python or Java 8, even with small strings, 
Chengsong
parents: 646
diff changeset
    35
say of lenght of around 30 $a$'s,
Chengsong
parents: 646
diff changeset
    36
the decision takes an absurd amount of time to finish (see graphs in figure \ref{fig:aStarStarb}).
Chengsong
parents: 646
diff changeset
    37
The algorithms clearly show exponential behaviour, and as can be seen
Chengsong
parents: 646
diff changeset
    38
is triggered by some relatively simple regular expressions.
Chengsong
parents: 646
diff changeset
    39
Java 9 and newer
Chengsong
parents: 646
diff changeset
    40
versions improve this behaviour somewhat, but are still slow compared 
Chengsong
parents: 646
diff changeset
    41
with the approach we are going to use in this thesis.
Chengsong
parents: 646
diff changeset
    42
Chengsong
parents: 646
diff changeset
    43
Chengsong
parents: 646
diff changeset
    44
Chengsong
parents: 646
diff changeset
    45
This superlinear blowup in regular expression engines
Chengsong
parents: 646
diff changeset
    46
has caused grief in ``real life'' where it is 
Chengsong
parents: 646
diff changeset
    47
given the name ``catastrophic backtracking'' or ``evil'' regular expressions.
Chengsong
parents: 646
diff changeset
    48
For example, on 20 July 2016 one evil
Chengsong
parents: 646
diff changeset
    49
regular expression brought the webpage
Chengsong
parents: 646
diff changeset
    50
\href{http://stackexchange.com}{Stack Exchange} to its
Chengsong
parents: 646
diff changeset
    51
knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}(Last accessed in 2019)}
Chengsong
parents: 646
diff changeset
    52
In this instance, a regular expression intended to just trim white
Chengsong
parents: 646
diff changeset
    53
spaces from the beginning and the end of a line actually consumed
Chengsong
parents: 646
diff changeset
    54
massive amounts of CPU resources---causing the web servers to grind to a
Chengsong
parents: 646
diff changeset
    55
halt. In this example, the time needed to process
Chengsong
parents: 646
diff changeset
    56
the string was 
Chengsong
parents: 646
diff changeset
    57
$O(n^2)$ with respect to the string length $n$. This
Chengsong
parents: 646
diff changeset
    58
quadratic overhead was enough for the homepage of Stack Exchange to
Chengsong
parents: 646
diff changeset
    59
respond so slowly that the load balancer assumed a $\mathit{DoS}$ 
Chengsong
parents: 646
diff changeset
    60
attack and therefore stopped the servers from responding to any
Chengsong
parents: 646
diff changeset
    61
requests. This made the whole site become unavailable. 
Chengsong
parents: 646
diff changeset
    62
Chengsong
parents: 646
diff changeset
    63
\begin{figure}[p]
Chengsong
parents: 646
diff changeset
    64
\begin{center}
Chengsong
parents: 646
diff changeset
    65
\begin{tabular}{@{}c@{\hspace{0mm}}c@{}}
Chengsong
parents: 646
diff changeset
    66
\begin{tikzpicture}
Chengsong
parents: 646
diff changeset
    67
\begin{axis}[
Chengsong
parents: 646
diff changeset
    68
    xlabel={$n$},
Chengsong
parents: 646
diff changeset
    69
    x label style={at={(1.05,-0.05)}},
Chengsong
parents: 646
diff changeset
    70
    ylabel={time in secs},
Chengsong
parents: 646
diff changeset
    71
    enlargelimits=false,
Chengsong
parents: 646
diff changeset
    72
    xtick={0,5,...,30},
Chengsong
parents: 646
diff changeset
    73
    xmax=33,
Chengsong
parents: 646
diff changeset
    74
    ymax=35,
Chengsong
parents: 646
diff changeset
    75
    ytick={0,5,...,30},
Chengsong
parents: 646
diff changeset
    76
    scaled ticks=false,
Chengsong
parents: 646
diff changeset
    77
    axis lines=left,
Chengsong
parents: 646
diff changeset
    78
    width=5cm,
Chengsong
parents: 646
diff changeset
    79
    height=4cm, 
Chengsong
parents: 646
diff changeset
    80
    legend entries={JavaScript},  
Chengsong
parents: 646
diff changeset
    81
    legend pos=north west,
Chengsong
parents: 646
diff changeset
    82
    legend cell align=left]
Chengsong
parents: 646
diff changeset
    83
\addplot[red,mark=*, mark options={fill=white}] table {re-js.data};
Chengsong
parents: 646
diff changeset
    84
\end{axis}
Chengsong
parents: 646
diff changeset
    85
\end{tikzpicture}
Chengsong
parents: 646
diff changeset
    86
  &
Chengsong
parents: 646
diff changeset
    87
\begin{tikzpicture}
Chengsong
parents: 646
diff changeset
    88
\begin{axis}[
Chengsong
parents: 646
diff changeset
    89
    xlabel={$n$},
Chengsong
parents: 646
diff changeset
    90
    x label style={at={(1.05,-0.05)}},
Chengsong
parents: 646
diff changeset
    91
    %ylabel={time in secs},
Chengsong
parents: 646
diff changeset
    92
    enlargelimits=false,
Chengsong
parents: 646
diff changeset
    93
    xtick={0,5,...,30},
Chengsong
parents: 646
diff changeset
    94
    xmax=33,
Chengsong
parents: 646
diff changeset
    95
    ymax=35,
Chengsong
parents: 646
diff changeset
    96
    ytick={0,5,...,30},
Chengsong
parents: 646
diff changeset
    97
    scaled ticks=false,
Chengsong
parents: 646
diff changeset
    98
    axis lines=left,
Chengsong
parents: 646
diff changeset
    99
    width=5cm,
Chengsong
parents: 646
diff changeset
   100
    height=4cm, 
Chengsong
parents: 646
diff changeset
   101
    legend entries={Python},  
Chengsong
parents: 646
diff changeset
   102
    legend pos=north west,
Chengsong
parents: 646
diff changeset
   103
    legend cell align=left]
Chengsong
parents: 646
diff changeset
   104
\addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
Chengsong
parents: 646
diff changeset
   105
\end{axis}
Chengsong
parents: 646
diff changeset
   106
\end{tikzpicture}\\ 
Chengsong
parents: 646
diff changeset
   107
\begin{tikzpicture}
Chengsong
parents: 646
diff changeset
   108
\begin{axis}[
Chengsong
parents: 646
diff changeset
   109
    xlabel={$n$},
Chengsong
parents: 646
diff changeset
   110
    x label style={at={(1.05,-0.05)}},
Chengsong
parents: 646
diff changeset
   111
    ylabel={time in secs},
Chengsong
parents: 646
diff changeset
   112
    enlargelimits=false,
Chengsong
parents: 646
diff changeset
   113
    xtick={0,5,...,30},
Chengsong
parents: 646
diff changeset
   114
    xmax=33,
Chengsong
parents: 646
diff changeset
   115
    ymax=35,
Chengsong
parents: 646
diff changeset
   116
    ytick={0,5,...,30},
Chengsong
parents: 646
diff changeset
   117
    scaled ticks=false,
Chengsong
parents: 646
diff changeset
   118
    axis lines=left,
Chengsong
parents: 646
diff changeset
   119
    width=5cm,
Chengsong
parents: 646
diff changeset
   120
    height=4cm, 
Chengsong
parents: 646
diff changeset
   121
    legend entries={Java 8},  
Chengsong
parents: 646
diff changeset
   122
    legend pos=north west,
Chengsong
parents: 646
diff changeset
   123
    legend cell align=left]
Chengsong
parents: 646
diff changeset
   124
\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
Chengsong
parents: 646
diff changeset
   125
\end{axis}
Chengsong
parents: 646
diff changeset
   126
\end{tikzpicture}
Chengsong
parents: 646
diff changeset
   127
  &
Chengsong
parents: 646
diff changeset
   128
\begin{tikzpicture}
Chengsong
parents: 646
diff changeset
   129
\begin{axis}[
Chengsong
parents: 646
diff changeset
   130
    xlabel={$n$},
Chengsong
parents: 646
diff changeset
   131
    x label style={at={(1.05,-0.05)}},
Chengsong
parents: 646
diff changeset
   132
    %ylabel={time in secs},
Chengsong
parents: 646
diff changeset
   133
    enlargelimits=false,
Chengsong
parents: 646
diff changeset
   134
    xtick={0,5,...,30},
Chengsong
parents: 646
diff changeset
   135
    xmax=33,
Chengsong
parents: 646
diff changeset
   136
    ymax=35,
Chengsong
parents: 646
diff changeset
   137
    ytick={0,5,...,30},
Chengsong
parents: 646
diff changeset
   138
    scaled ticks=false,
Chengsong
parents: 646
diff changeset
   139
    axis lines=left,
Chengsong
parents: 646
diff changeset
   140
    width=5cm,
Chengsong
parents: 646
diff changeset
   141
    height=4cm, 
Chengsong
parents: 646
diff changeset
   142
    legend entries={Dart},  
Chengsong
parents: 646
diff changeset
   143
    legend pos=north west,
Chengsong
parents: 646
diff changeset
   144
    legend cell align=left]
Chengsong
parents: 646
diff changeset
   145
\addplot[green,mark=*, mark options={fill=white}] table {re-dart.data};
Chengsong
parents: 646
diff changeset
   146
\end{axis}
Chengsong
parents: 646
diff changeset
   147
\end{tikzpicture}\\ 
Chengsong
parents: 646
diff changeset
   148
\begin{tikzpicture}
Chengsong
parents: 646
diff changeset
   149
\begin{axis}[
Chengsong
parents: 646
diff changeset
   150
    xlabel={$n$},
Chengsong
parents: 646
diff changeset
   151
    x label style={at={(1.05,-0.05)}},
Chengsong
parents: 646
diff changeset
   152
    ylabel={time in secs},
Chengsong
parents: 646
diff changeset
   153
    enlargelimits=false,
Chengsong
parents: 646
diff changeset
   154
    xtick={0,5,...,30},
Chengsong
parents: 646
diff changeset
   155
    xmax=33,
Chengsong
parents: 646
diff changeset
   156
    ymax=35,
Chengsong
parents: 646
diff changeset
   157
    ytick={0,5,...,30},
Chengsong
parents: 646
diff changeset
   158
    scaled ticks=false,
Chengsong
parents: 646
diff changeset
   159
    axis lines=left,
Chengsong
parents: 646
diff changeset
   160
    width=5cm,
Chengsong
parents: 646
diff changeset
   161
    height=4cm, 
Chengsong
parents: 646
diff changeset
   162
    legend entries={Swift},  
Chengsong
parents: 646
diff changeset
   163
    legend pos=north west,
Chengsong
parents: 646
diff changeset
   164
    legend cell align=left]
Chengsong
parents: 646
diff changeset
   165
\addplot[purple,mark=*, mark options={fill=white}] table {re-swift.data};
Chengsong
parents: 646
diff changeset
   166
\end{axis}
Chengsong
parents: 646
diff changeset
   167
\end{tikzpicture}
Chengsong
parents: 646
diff changeset
   168
  & 
Chengsong
parents: 646
diff changeset
   169
\begin{tikzpicture}
Chengsong
parents: 646
diff changeset
   170
\begin{axis}[
Chengsong
parents: 646
diff changeset
   171
    xlabel={$n$},
Chengsong
parents: 646
diff changeset
   172
    x label style={at={(1.05,-0.05)}},
Chengsong
parents: 646
diff changeset
   173
    %ylabel={time in secs},
Chengsong
parents: 646
diff changeset
   174
    enlargelimits=true,
Chengsong
parents: 646
diff changeset
   175
    %xtick={0,5000,...,40000},
Chengsong
parents: 646
diff changeset
   176
    %xmax=40000,
Chengsong
parents: 646
diff changeset
   177
    %ymax=35,
Chengsong
parents: 646
diff changeset
   178
    restrict x to domain*=0:40000,
Chengsong
parents: 646
diff changeset
   179
    restrict y to domain*=0:35,
Chengsong
parents: 646
diff changeset
   180
    %ytick={0,5,...,30},
Chengsong
parents: 646
diff changeset
   181
    %scaled ticks=false,
Chengsong
parents: 646
diff changeset
   182
    axis lines=left,
Chengsong
parents: 646
diff changeset
   183
    width=5cm,
Chengsong
parents: 646
diff changeset
   184
    height=4cm, 
Chengsong
parents: 646
diff changeset
   185
    legend entries={Java9+},  
Chengsong
parents: 646
diff changeset
   186
    legend pos=north west,
Chengsong
parents: 646
diff changeset
   187
    legend cell align=left]
Chengsong
parents: 646
diff changeset
   188
\addplot[orange,mark=*, mark options={fill=white}] table {re-java9.data};
Chengsong
parents: 646
diff changeset
   189
\end{axis}
Chengsong
parents: 646
diff changeset
   190
\end{tikzpicture}\\ 
Chengsong
parents: 646
diff changeset
   191
\multicolumn{2}{c}{Graphs}
Chengsong
parents: 646
diff changeset
   192
\end{tabular}    
Chengsong
parents: 646
diff changeset
   193
\end{center}
Chengsong
parents: 646
diff changeset
   194
\caption{Graphs showing runtime for matching $(a^*)^*\,b$ with strings 
Chengsong
parents: 646
diff changeset
   195
           of the form $\protect\underbrace{aa..a}_{n}$ in various existing regular expression libraries.
Chengsong
parents: 646
diff changeset
   196
   The reason for their superlinear behaviour is that they do a depth-first-search
Chengsong
parents: 646
diff changeset
   197
   using NFAs.
Chengsong
parents: 646
diff changeset
   198
   If the string does not match, the regular expression matching
Chengsong
parents: 646
diff changeset
   199
   engine starts to explore all possibilities. 
Chengsong
parents: 646
diff changeset
   200
}\label{fig:aStarStarb}
Chengsong
parents: 646
diff changeset
   201
\end{figure}\afterpage{\clearpage}
Chengsong
parents: 646
diff changeset
   202
Chengsong
parents: 646
diff changeset
   203
A more recent example is a global outage of all Cloudflare servers on 2 July
Chengsong
parents: 646
diff changeset
   204
2019. A poorly written regular expression exhibited catastrophic backtracking
Chengsong
parents: 646
diff changeset
   205
and exhausted CPUs that serve HTTP traffic. Although the outage
Chengsong
parents: 646
diff changeset
   206
had several causes, at the heart was a regular expression that
Chengsong
parents: 646
diff changeset
   207
was used to monitor network
Chengsong
parents: 646
diff changeset
   208
traffic.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}(Last accessed in 2022)}
Chengsong
parents: 646
diff changeset
   209
These problems with regular expressions 
Chengsong
parents: 646
diff changeset
   210
are not isolated events that happen
Chengsong
parents: 646
diff changeset
   211
very rarely, 
Chengsong
parents: 646
diff changeset
   212
%but actually widespread.
Chengsong
parents: 646
diff changeset
   213
%They occur so often that they have a 
Chengsong
parents: 646
diff changeset
   214
but they occur actually often enough that they have a
Chengsong
parents: 646
diff changeset
   215
name: Regular-Expression-Denial-Of-Service (ReDoS)
Chengsong
parents: 646
diff changeset
   216
attacks.
Chengsong
parents: 646
diff changeset
   217
Davis et al. \cite{Davis18} detected more
Chengsong
parents: 646
diff changeset
   218
than 1000 evil regular expressions
Chengsong
parents: 646
diff changeset
   219
in Node.js, Python core libraries, npm and pypi. 
Chengsong
parents: 646
diff changeset
   220
They therefore concluded that evil regular expressions
Chengsong
parents: 646
diff changeset
   221
are a real problem rather than just "a parlour trick".
Chengsong
parents: 646
diff changeset
   222
Chengsong
parents: 646
diff changeset
   223
The work in this thesis aims to address this issue
Chengsong
parents: 646
diff changeset
   224
with the help of formal proofs.
Chengsong
parents: 646
diff changeset
   225
We describe a lexing algorithm based
Chengsong
parents: 646
diff changeset
   226
on Brzozowski derivatives with verified correctness 
Chengsong
parents: 646
diff changeset
   227
and a finiteness property for the size of derivatives
Chengsong
parents: 646
diff changeset
   228
(which are all done in Isabelle/HOL).
Chengsong
parents: 646
diff changeset
   229
Such properties %guarantee the absence of 
Chengsong
parents: 646
diff changeset
   230
are an important step in preventing
Chengsong
parents: 646
diff changeset
   231
catastrophic backtracking once and for all.
Chengsong
parents: 646
diff changeset
   232
We will give more details in the next sections
Chengsong
parents: 646
diff changeset
   233
on (i) why the slow cases in graph \ref{fig:aStarStarb}
Chengsong
parents: 646
diff changeset
   234
can occur in traditional regular expression engines
Chengsong
parents: 646
diff changeset
   235
and (ii) why we choose our 
Chengsong
parents: 646
diff changeset
   236
approach based on Brzozowski derivatives and formal proofs.
Chengsong
parents: 646
diff changeset
   237
Chengsong
parents: 646
diff changeset
   238
Chengsong
parents: 646
diff changeset
   239
\section{Preliminaries}%Regex, and the Problems with Regex Matchers}
Chengsong
parents: 646
diff changeset
   240
Regular expressions and regular expression matchers 
Chengsong
parents: 646
diff changeset
   241
have clearly been studied for many, many years.
Chengsong
parents: 646
diff changeset
   242
Theoretical results in automata theory state 
Chengsong
parents: 646
diff changeset
   243
that basic regular expression matching should be linear
Chengsong
parents: 646
diff changeset
   244
w.r.t the input.
Chengsong
parents: 646
diff changeset
   245
This assumes that the regular expression
Chengsong
parents: 646
diff changeset
   246
$r$ was pre-processed and turned into a
Chengsong
parents: 646
diff changeset
   247
deterministic finite automaton (DFA) before matching \cite{Sakarovitch2009}.
Chengsong
parents: 646
diff changeset
   248
By basic we mean textbook definitions such as the one
Chengsong
parents: 646
diff changeset
   249
below, involving only regular expressions for characters, alternatives,
Chengsong
parents: 646
diff changeset
   250
sequences, and Kleene stars:
Chengsong
parents: 646
diff changeset
   251
\[
Chengsong
parents: 646
diff changeset
   252
	r ::= c | r_1 + r_2 | r_1 \cdot r_2 | r^*
Chengsong
parents: 646
diff changeset
   253
\]
Chengsong
parents: 646
diff changeset
   254
Modern regular expression matchers used by programmers,
Chengsong
parents: 646
diff changeset
   255
however,
Chengsong
parents: 646
diff changeset
   256
support much richer constructs, such as bounded repetitions,
Chengsong
parents: 646
diff changeset
   257
negations,
Chengsong
parents: 646
diff changeset
   258
and back-references.
Chengsong
parents: 646
diff changeset
   259
To differentiate, we use the word \emph{regex} to refer
Chengsong
parents: 646
diff changeset
   260
to those expressions with richer constructs while reserving the
Chengsong
parents: 646
diff changeset
   261
term \emph{regular expression}
Chengsong
parents: 646
diff changeset
   262
for the more traditional meaning in formal languages theory.
Chengsong
parents: 646
diff changeset
   263
We follow this convention 
Chengsong
parents: 646
diff changeset
   264
in this thesis.
Chengsong
parents: 646
diff changeset
   265
In the future, we aim to support all the popular features of regexes, 
Chengsong
parents: 646
diff changeset
   266
but for this work we mainly look at basic regular expressions
Chengsong
parents: 646
diff changeset
   267
and bounded repetitions.
Chengsong
parents: 646
diff changeset
   268
Chengsong
parents: 646
diff changeset
   269
Chengsong
parents: 646
diff changeset
   270
Chengsong
parents: 646
diff changeset
   271
%Most modern regex libraries
Chengsong
parents: 646
diff changeset
   272
%the so-called PCRE standard (Peral Compatible Regular Expressions)
Chengsong
parents: 646
diff changeset
   273
%has the back-references
Chengsong
parents: 646
diff changeset
   274
Regexes come with a number of constructs
Chengsong
parents: 646
diff changeset
   275
that make it more convenient for 
Chengsong
parents: 646
diff changeset
   276
programmers to write regular expressions.
Chengsong
parents: 646
diff changeset
   277
Depending on the types of constructs
Chengsong
parents: 646
diff changeset
   278
the task of matching and lexing with them
Chengsong
parents: 646
diff changeset
   279
will have different levels of complexity.
Chengsong
parents: 646
diff changeset
   280
Some of those constructs are syntactic sugars that are
Chengsong
parents: 646
diff changeset
   281
simply short hand notations
Chengsong
parents: 646
diff changeset
   282
that save the programmers a few keystrokes.
Chengsong
parents: 646
diff changeset
   283
These will not cause problems for regex libraries.
Chengsong
parents: 646
diff changeset
   284
For example the
Chengsong
parents: 646
diff changeset
   285
non-binary alternative involving three or more choices just means:
Chengsong
parents: 646
diff changeset
   286
\[
Chengsong
parents: 646
diff changeset
   287
	(a | b | c) \stackrel{means}{=} ((a + b)+ c)
Chengsong
parents: 646
diff changeset
   288
\]
Chengsong
parents: 646
diff changeset
   289
Similarly, the range operator
Chengsong
parents: 646
diff changeset
   290
%used to express the alternative
Chengsong
parents: 646
diff changeset
   291
%of all characters between its operands, 
Chengsong
parents: 646
diff changeset
   292
is just a concise way
Chengsong
parents: 646
diff changeset
   293
of expressing an alternative of consecutive characters:
Chengsong
parents: 646
diff changeset
   294
\[
Chengsong
parents: 646
diff changeset
   295
	[0~-9]\stackrel{means}{=} (0 | 1 | \ldots | 9 )  
Chengsong
parents: 646
diff changeset
   296
\]
Chengsong
parents: 646
diff changeset
   297
for an alternative. The
Chengsong
parents: 646
diff changeset
   298
wildcard character '$.$' is used to refer to any single character,
Chengsong
parents: 646
diff changeset
   299
\[
Chengsong
parents: 646
diff changeset
   300
	. \stackrel{means}{=} [0-9a-zA-Z+-()*\&\ldots]
Chengsong
parents: 646
diff changeset
   301
\]
Chengsong
parents: 646
diff changeset
   302
except the newline.
Chengsong
parents: 646
diff changeset
   303
Chengsong
parents: 646
diff changeset
   304
\subsection{Bounded Repetitions}
Chengsong
parents: 646
diff changeset
   305
More interesting are bounded repetitions, which can 
Chengsong
parents: 646
diff changeset
   306
make the regular expressions much
Chengsong
parents: 646
diff changeset
   307
more compact.
Chengsong
parents: 646
diff changeset
   308
Normally there are four kinds of bounded repetitions:
Chengsong
parents: 646
diff changeset
   309
$r^{\{n\}}$, $r^{\{\ldots m\}}$, $r^{\{n\ldots \}}$ and $r^{\{n\ldots m\}}$
Chengsong
parents: 646
diff changeset
   310
(where $n$ and $m$ are constant natural numbers).
Chengsong
parents: 646
diff changeset
   311
Like the star regular expressions, the set of strings or language
Chengsong
parents: 646
diff changeset
   312
a bounded regular expression can match
Chengsong
parents: 646
diff changeset
   313
is defined using the power operation on sets:
Chengsong
parents: 646
diff changeset
   314
\begin{center}
Chengsong
parents: 646
diff changeset
   315
	\begin{tabular}{lcl}
Chengsong
parents: 646
diff changeset
   316
		$L \; r^{\{n\}}$ & $\dn$ & $(L \; r)^n$\\
Chengsong
parents: 646
diff changeset
   317
		$L \; r^{\{\ldots m\}}$ & $\dn$ & $\bigcup_{0 \leq i \leq m}. (L \; r)^i$\\
Chengsong
parents: 646
diff changeset
   318
		$L \; r^{\{n\ldots \}}$ & $\dn$ & $\bigcup_{n \leq i}. (L \; r)^i$\\
Chengsong
parents: 646
diff changeset
   319
		$L \; r^{\{n \ldots m\}}$ & $\dn$ & $\bigcup_{n \leq i \leq m}. (L \; r)^i$
Chengsong
parents: 646
diff changeset
   320
	\end{tabular}
Chengsong
parents: 646
diff changeset
   321
\end{center}
Chengsong
parents: 646
diff changeset
   322
The attraction of bounded repetitions is that they can be
Chengsong
parents: 646
diff changeset
   323
used to avoid a size blow up: for example $r^{\{n\}}$
Chengsong
parents: 646
diff changeset
   324
is a shorthand for
Chengsong
parents: 646
diff changeset
   325
the much longer regular expression:
Chengsong
parents: 646
diff changeset
   326
\[
Chengsong
parents: 646
diff changeset
   327
	\underbrace{r\ldots r}_\text{n copies of r}.
Chengsong
parents: 646
diff changeset
   328
\]
Chengsong
parents: 646
diff changeset
   329
%Therefore, a naive algorithm that simply unfolds
Chengsong
parents: 646
diff changeset
   330
%them into their desugared forms
Chengsong
parents: 646
diff changeset
   331
%will suffer from at least an exponential runtime increase.
Chengsong
parents: 646
diff changeset
   332
Chengsong
parents: 646
diff changeset
   333
Chengsong
parents: 646
diff changeset
   334
The problem with matching 
Chengsong
parents: 646
diff changeset
   335
such bounded repetitions
Chengsong
parents: 646
diff changeset
   336
is that tools based on the classic notion of
Chengsong
parents: 646
diff changeset
   337
automata need to expand $r^{\{n\}}$ into $n$ connected 
Chengsong
parents: 646
diff changeset
   338
copies of the automaton for $r$. This leads to very inefficient matching
Chengsong
parents: 646
diff changeset
   339
algorithms  or algorithms that consume large amounts of memory.
Chengsong
parents: 646
diff changeset
   340
Implementations using $\DFA$s will
Chengsong
parents: 646
diff changeset
   341
in such situations
Chengsong
parents: 646
diff changeset
   342
either become excruciatingly slow 
Chengsong
parents: 646
diff changeset
   343
(for example Verbatim++ \cite{Verbatimpp}) or run
Chengsong
parents: 646
diff changeset
   344
out of memory (for example $\mathit{LEX}$ and 
Chengsong
parents: 646
diff changeset
   345
$\mathit{JFLEX}$\footnote{LEX and JFLEX are lexer generators
Chengsong
parents: 646
diff changeset
   346
in C and JAVA that generate $\mathit{DFA}$-based
Chengsong
parents: 646
diff changeset
   347
lexers. The user provides a set of regular expressions
Chengsong
parents: 646
diff changeset
   348
and configurations, and then 
Chengsong
parents: 646
diff changeset
   349
gets an output program encoding a minimized $\mathit{DFA}$
Chengsong
parents: 646
diff changeset
   350
that can be compiled and run. 
Chengsong
parents: 646
diff changeset
   351
When given the above countdown regular expression,
Chengsong
parents: 646
diff changeset
   352
a small $n$ (say 20) would result in a program representing a
Chengsong
parents: 646
diff changeset
   353
DFA
Chengsong
parents: 646
diff changeset
   354
with millions of states.}) for large counters.
Chengsong
parents: 646
diff changeset
   355
A classic example for this phenomenon is the regular expression $(a+b)^*  a (a+b)^{n}$
Chengsong
parents: 646
diff changeset
   356
where the minimal DFA requires at least $2^{n+1}$ states.
Chengsong
parents: 646
diff changeset
   357
For example, when $n$ is equal to 2,
Chengsong
parents: 646
diff changeset
   358
the corresponding $\mathit{NFA}$ looks like:
Chengsong
parents: 646
diff changeset
   359
\vspace{6mm}
Chengsong
parents: 646
diff changeset
   360
\begin{center}
Chengsong
parents: 646
diff changeset
   361
\begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto] 
Chengsong
parents: 646
diff changeset
   362
   \node[state,initial] (q_0)   {$q_0$}; 
Chengsong
parents: 646
diff changeset
   363
   \node[state, red] (q_1) [right=of q_0] {$q_1$}; 
Chengsong
parents: 646
diff changeset
   364
   \node[state, red] (q_2) [right=of q_1] {$q_2$}; 
Chengsong
parents: 646
diff changeset
   365
   \node[state, accepting, red](q_3) [right=of q_2] {$q_3$};
Chengsong
parents: 646
diff changeset
   366
    \path[->] 
Chengsong
parents: 646
diff changeset
   367
    (q_0) edge  node {a} (q_1)
Chengsong
parents: 646
diff changeset
   368
    	  edge [loop below] node {a,b} ()
Chengsong
parents: 646
diff changeset
   369
    (q_1) edge  node  {a,b} (q_2)
Chengsong
parents: 646
diff changeset
   370
    (q_2) edge  node  {a,b} (q_3);
Chengsong
parents: 646
diff changeset
   371
\end{tikzpicture}
Chengsong
parents: 646
diff changeset
   372
\end{center}
Chengsong
parents: 646
diff changeset
   373
and when turned into a DFA by the subset construction
Chengsong
parents: 646
diff changeset
   374
requires at least $2^3$ states.\footnote{The 
Chengsong
parents: 646
diff changeset
   375
red states are "countdown states" which count down 
Chengsong
parents: 646
diff changeset
   376
the number of characters needed in addition to the current
Chengsong
parents: 646
diff changeset
   377
string to make a successful match.
Chengsong
parents: 646
diff changeset
   378
For example, state $q_1$ indicates a match that has
Chengsong
parents: 646
diff changeset
   379
gone past the $(a|b)^*$ part of $(a|b)^*a(a|b)^{\{2\}}$,
Chengsong
parents: 646
diff changeset
   380
and just consumed the "delimiter" $a$ in the middle, and 
Chengsong
parents: 646
diff changeset
   381
needs to match 2 more iterations of $(a|b)$ to complete.
Chengsong
parents: 646
diff changeset
   382
State $q_2$ on the other hand, can be viewed as a state
Chengsong
parents: 646
diff changeset
   383
after $q_1$ has consumed 1 character, and just waits
Chengsong
parents: 646
diff changeset
   384
for 1 more character to complete.
Chengsong
parents: 646
diff changeset
   385
The state $q_3$ is the last (accepting) state, requiring 0 
Chengsong
parents: 646
diff changeset
   386
more characters.
Chengsong
parents: 646
diff changeset
   387
Depending on the suffix of the
Chengsong
parents: 646
diff changeset
   388
input string up to the current read location,
Chengsong
parents: 646
diff changeset
   389
the states $q_1$ and $q_2$, $q_3$
Chengsong
parents: 646
diff changeset
   390
may or may
Chengsong
parents: 646
diff changeset
   391
not be active.
Chengsong
parents: 646
diff changeset
   392
A $\mathit{DFA}$ for such an $\mathit{NFA}$ would
Chengsong
parents: 646
diff changeset
   393
contain at least $2^3$ non-equivalent states that cannot be merged, 
Chengsong
parents: 646
diff changeset
   394
because the subset construction during determinisation will generate
Chengsong
parents: 646
diff changeset
   395
all the elements in the power set $\mathit{Pow}\{q_1, q_2, q_3\}$.
Chengsong
parents: 646
diff changeset
   396
Generalizing this to regular expressions with larger
Chengsong
parents: 646
diff changeset
   397
bounded repetitions number, we have that
Chengsong
parents: 646
diff changeset
   398
regexes shaped like $r^*ar^{\{n\}}$ when converted to $\mathit{DFA}$s
Chengsong
parents: 646
diff changeset
   399
would require at least $2^{n+1}$ states, if $r$ itself contains
Chengsong
parents: 646
diff changeset
   400
more than 1 string.
Chengsong
parents: 646
diff changeset
   401
This is to represent all different 
Chengsong
parents: 646
diff changeset
   402
scenarios in which "countdown" states are active.}
Chengsong
parents: 646
diff changeset
   403
Chengsong
parents: 646
diff changeset
   404
Chengsong
parents: 646
diff changeset
   405
Bounded repetitions are important because they
Chengsong
parents: 646
diff changeset
   406
tend to occur frequently in practical use,
Chengsong
parents: 646
diff changeset
   407
for example in the regex library RegExLib, in
Chengsong
parents: 646
diff changeset
   408
the rules library of Snort \cite{Snort1999}\footnote{
Chengsong
parents: 646
diff changeset
   409
Snort is a network intrusion detection (NID) tool
Chengsong
parents: 646
diff changeset
   410
for monitoring network traffic.
Chengsong
parents: 646
diff changeset
   411
The network security community curates a list
Chengsong
parents: 646
diff changeset
   412
of malicious patterns written as regexes,
Chengsong
parents: 646
diff changeset
   413
which is used by Snort's detection engine
Chengsong
parents: 646
diff changeset
   414
to match against network traffic for any hostile
Chengsong
parents: 646
diff changeset
   415
activities such as buffer overflow attacks.}, 
Chengsong
parents: 646
diff changeset
   416
as well as in XML Schema definitions (XSDs).
Chengsong
parents: 646
diff changeset
   417
According to Bj\"{o}rklund et al \cite{xml2015},
Chengsong
parents: 646
diff changeset
   418
more than half of the 
Chengsong
parents: 646
diff changeset
   419
XSDs they found on the Maven.org central repository
Chengsong
parents: 646
diff changeset
   420
have bounded regular expressions in them.
Chengsong
parents: 646
diff changeset
   421
Often the counters are quite large, with the largest being
Chengsong
parents: 646
diff changeset
   422
close to ten million. 
Chengsong
parents: 646
diff changeset
   423
A smaller sample XSD they gave
Chengsong
parents: 646
diff changeset
   424
is:
Chengsong
parents: 646
diff changeset
   425
\lstset{
Chengsong
parents: 646
diff changeset
   426
	basicstyle=\fontsize{8.5}{9}\ttfamily,
Chengsong
parents: 646
diff changeset
   427
  language=XML,
Chengsong
parents: 646
diff changeset
   428
  morekeywords={encoding,
Chengsong
parents: 646
diff changeset
   429
    xs:schema,xs:element,xs:complexType,xs:sequence,xs:attribute}
Chengsong
parents: 646
diff changeset
   430
}
Chengsong
parents: 646
diff changeset
   431
\begin{lstlisting}
Chengsong
parents: 646
diff changeset
   432
<sequence minOccurs="0" maxOccurs="65535">
Chengsong
parents: 646
diff changeset
   433
	<element name="TimeIncr" type="mpeg7:MediaIncrDurationType"/>
Chengsong
parents: 646
diff changeset
   434
 	<element name="MotionParams" type="float" minOccurs="2" maxOccurs="12"/>
Chengsong
parents: 646
diff changeset
   435
</sequence>
Chengsong
parents: 646
diff changeset
   436
\end{lstlisting}
Chengsong
parents: 646
diff changeset
   437
This can be seen as the regex
Chengsong
parents: 646
diff changeset
   438
$(ab^{2\ldots 12})^{0 \ldots 65535}$, where $a$ and $b$ are themselves
Chengsong
parents: 646
diff changeset
   439
regular expressions 
Chengsong
parents: 646
diff changeset
   440
satisfying certain constraints (such as 
Chengsong
parents: 646
diff changeset
   441
satisfying the floating point number format).
Chengsong
parents: 646
diff changeset
   442
It is therefore quite unsatisfying that 
Chengsong
parents: 646
diff changeset
   443
some regular expressions matching libraries
Chengsong
parents: 646
diff changeset
   444
impose adhoc limits
Chengsong
parents: 646
diff changeset
   445
for bounded regular expressions:
Chengsong
parents: 646
diff changeset
   446
For example, in the regular expression matching library in the Go
Chengsong
parents: 646
diff changeset
   447
language the regular expression $a^{1001}$ is not permitted, because no counter
Chengsong
parents: 646
diff changeset
   448
can be above 1000, and in the built-in Rust regular expression library
Chengsong
parents: 646
diff changeset
   449
expressions such as $a^{\{1000\}\{100\}\{5\}}$ give an error message
Chengsong
parents: 646
diff changeset
   450
for being too big. 
Chengsong
parents: 646
diff changeset
   451
As Becchi and Crawley \cite{Becchi08}  have pointed out,
Chengsong
parents: 646
diff changeset
   452
the reason for these restrictions
Chengsong
parents: 646
diff changeset
   453
is that they simulate a non-deterministic finite
Chengsong
parents: 646
diff changeset
   454
automata (NFA) with a breadth-first search.
Chengsong
parents: 646
diff changeset
   455
This way the number of active states could
Chengsong
parents: 646
diff changeset
   456
be equal to the counter number.
Chengsong
parents: 646
diff changeset
   457
When the counters are large, 
Chengsong
parents: 646
diff changeset
   458
the memory requirement could become
Chengsong
parents: 646
diff changeset
   459
infeasible, and a regex engine
Chengsong
parents: 646
diff changeset
   460
like in Go will reject this pattern straight away.
Chengsong
parents: 646
diff changeset
   461
\begin{figure}[H]
Chengsong
parents: 646
diff changeset
   462
\begin{center}
Chengsong
parents: 646
diff changeset
   463
\begin{tikzpicture} [node distance = 2cm, on grid, auto]
Chengsong
parents: 646
diff changeset
   464
 
Chengsong
parents: 646
diff changeset
   465
    	\node (q0) [state, initial] {$0$};
Chengsong
parents: 646
diff changeset
   466
	\node (q1) [state, right = of q0] {$1$};
Chengsong
parents: 646
diff changeset
   467
	%\node (q2) [state, right = of q1] {$2$};
Chengsong
parents: 646
diff changeset
   468
	\node (qdots) [right = of q1] {$\ldots$};
Chengsong
parents: 646
diff changeset
   469
	\node (qn) [state, right = of qdots] {$n$};
Chengsong
parents: 646
diff changeset
   470
	\node (qn1) [state, right = of qn] {$n+1$};
Chengsong
parents: 646
diff changeset
   471
	\node (qn2) [state, right = of qn1] {$n+2$};
Chengsong
parents: 646
diff changeset
   472
	\node (qn3) [state, accepting, right = of qn2] {$n+3$}; 
Chengsong
parents: 646
diff changeset
   473
 
Chengsong
parents: 646
diff changeset
   474
\path [-stealth, thick]
Chengsong
parents: 646
diff changeset
   475
	(q0) edge [loop above] node {a} ()
Chengsong
parents: 646
diff changeset
   476
    (q0) edge node {a}   (q1) 
Chengsong
parents: 646
diff changeset
   477
    %(q1) edge node {.}   (q2)
Chengsong
parents: 646
diff changeset
   478
    (q1) edge node {.}   (qdots)
Chengsong
parents: 646
diff changeset
   479
    (qdots) edge node {.} (qn)
Chengsong
parents: 646
diff changeset
   480
    (qn) edge node {.} (qn1)
Chengsong
parents: 646
diff changeset
   481
    (qn1) edge node {b} (qn2)
Chengsong
parents: 646
diff changeset
   482
    (qn2) edge node {$c$} (qn3);
Chengsong
parents: 646
diff changeset
   483
\end{tikzpicture}
Chengsong
parents: 646
diff changeset
   484
%\begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto] 
Chengsong
parents: 646
diff changeset
   485
%   \node[state,initial] (q_0)   {$0$}; 
Chengsong
parents: 646
diff changeset
   486
%   \node[state, ] (q_1) [right=of q_0] {$1$}; 
Chengsong
parents: 646
diff changeset
   487
%   \node[state, ] (q_2) [right=of q_1] {$2$}; 
Chengsong
parents: 646
diff changeset
   488
%   \node[state,
Chengsong
parents: 646
diff changeset
   489
%   \node[state, accepting, ](q_3) [right=of q_2] {$3$};
Chengsong
parents: 646
diff changeset
   490
%    \path[->] 
Chengsong
parents: 646
diff changeset
   491
%    (q_0) edge  node {a} (q_1)
Chengsong
parents: 646
diff changeset
   492
%    	  edge [loop below] node {a,b} ()
Chengsong
parents: 646
diff changeset
   493
%    (q_1) edge  node  {a,b} (q_2)
Chengsong
parents: 646
diff changeset
   494
%    (q_2) edge  node  {a,b} (q_3);
Chengsong
parents: 646
diff changeset
   495
%\end{tikzpicture}
Chengsong
parents: 646
diff changeset
   496
\end{center}
Chengsong
parents: 646
diff changeset
   497
\caption{The example given by Becchi and Crawley
Chengsong
parents: 646
diff changeset
   498
	that NFA simulation can consume large
Chengsong
parents: 646
diff changeset
   499
	amounts of memory: $.^*a.^{\{n\}}bc$ matching
Chengsong
parents: 646
diff changeset
   500
	strings of the form $aaa\ldots aaaabc$.
Chengsong
parents: 646
diff changeset
   501
	When traversing in a breadth-first manner,
Chengsong
parents: 646
diff changeset
   502
all states from 0 till $n+1$ will become active.}
Chengsong
parents: 646
diff changeset
   503
\end{figure}
Chengsong
parents: 646
diff changeset
   504
%Languages like $\mathit{Go}$ and $\mathit{Rust}$ use this
Chengsong
parents: 646
diff changeset
   505
%type of $\mathit{NFA}$ simulation and guarantees a linear runtime
Chengsong
parents: 646
diff changeset
   506
%in terms of input string length.
Chengsong
parents: 646
diff changeset
   507
%TODO:try out these lexers
Chengsong
parents: 646
diff changeset
   508
These problems can of course be solved in matching algorithms where 
Chengsong
parents: 646
diff changeset
   509
automata go beyond the classic notion and for instance include explicit
Chengsong
parents: 646
diff changeset
   510
counters \cite{Turo_ov__2020}.
Chengsong
parents: 646
diff changeset
   511
These solutions can be quite efficient,
Chengsong
parents: 646
diff changeset
   512
with the ability to process
Chengsong
parents: 646
diff changeset
   513
gigabits of strings input per second
Chengsong
parents: 646
diff changeset
   514
even with large counters \cite{Becchi08}.
Chengsong
parents: 646
diff changeset
   515
These practical solutions do not come with
Chengsong
parents: 646
diff changeset
   516
formal guarantees, and as pointed out by
Chengsong
parents: 646
diff changeset
   517
Kuklewicz \cite{KuklewiczHaskell}, can be error-prone.
Chengsong
parents: 646
diff changeset
   518
%But formal reasoning about these automata especially in Isabelle 
Chengsong
parents: 646
diff changeset
   519
%can be challenging
Chengsong
parents: 646
diff changeset
   520
%and un-intuitive. 
Chengsong
parents: 646
diff changeset
   521
%Therefore, we take correctness and runtime claims made about these solutions
Chengsong
parents: 646
diff changeset
   522
%with a grain of salt.
Chengsong
parents: 646
diff changeset
   523
Chengsong
parents: 646
diff changeset
   524
In the work reported in \cite{FoSSaCS2023} and here, 
Chengsong
parents: 646
diff changeset
   525
we add better support using derivatives
Chengsong
parents: 646
diff changeset
   526
for bounded regular expression $r^{\{n\}}$.
Chengsong
parents: 646
diff changeset
   527
Our results
Chengsong
parents: 646
diff changeset
   528
extend straightforwardly to
Chengsong
parents: 646
diff changeset
   529
repetitions with intervals such as 
Chengsong
parents: 646
diff changeset
   530
$r^{\{n\ldots m\}}$.
Chengsong
parents: 646
diff changeset
   531
The merit of Brzozowski derivatives (more on this later)
Chengsong
parents: 646
diff changeset
   532
on this problem is that
Chengsong
parents: 646
diff changeset
   533
it can be naturally extended to support bounded repetitions.
Chengsong
parents: 646
diff changeset
   534
Moreover these extensions are still made up of only small
Chengsong
parents: 646
diff changeset
   535
inductive datatypes and recursive functions,
Chengsong
parents: 646
diff changeset
   536
making it handy to deal with them in theorem provers.
Chengsong
parents: 646
diff changeset
   537
%The point here is that Brzozowski derivatives and the algorithms by Sulzmann and Lu can be
Chengsong
parents: 646
diff changeset
   538
%straightforwardly extended to deal with bounded regular expressions
Chengsong
parents: 646
diff changeset
   539
%and moreover the resulting code still consists of only simple
Chengsong
parents: 646
diff changeset
   540
%recursive functions and inductive datatypes.
Chengsong
parents: 646
diff changeset
   541
Finally, bounded regular expressions do not destroy our finite
Chengsong
parents: 646
diff changeset
   542
boundedness property, which we shall prove later on.
Chengsong
parents: 646
diff changeset
   543
Chengsong
parents: 646
diff changeset
   544
Chengsong
parents: 646
diff changeset
   545
Chengsong
parents: 646
diff changeset
   546
Chengsong
parents: 646
diff changeset
   547
Chengsong
parents: 646
diff changeset
   548
\subsection{Back-References}
Chengsong
parents: 646
diff changeset
   549
The other way to simulate an $\mathit{NFA}$ for matching is choosing  
Chengsong
parents: 646
diff changeset
   550
a single transition each time, keeping all the other options in 
Chengsong
parents: 646
diff changeset
   551
a queue or stack, and backtracking if that choice eventually 
Chengsong
parents: 646
diff changeset
   552
fails. 
Chengsong
parents: 646
diff changeset
   553
This method, often called a  "depth-first-search", 
Chengsong
parents: 646
diff changeset
   554
is efficient in many cases, but could end up
Chengsong
parents: 646
diff changeset
   555
with exponential run time.
Chengsong
parents: 646
diff changeset
   556
The backtracking method is employed in regex libraries
Chengsong
parents: 646
diff changeset
   557
that support \emph{back-references}, for example
Chengsong
parents: 646
diff changeset
   558
in Java and Python.
Chengsong
parents: 646
diff changeset
   559
%\section{Back-references and The Terminology Regex}
Chengsong
parents: 646
diff changeset
   560
Chengsong
parents: 646
diff changeset
   561
%When one constructs an $\NFA$ out of a regular expression
Chengsong
parents: 646
diff changeset
   562
%there is often very little to be done in the first phase, one simply 
Chengsong
parents: 646
diff changeset
   563
%construct the $\NFA$ states based on the structure of the input regular expression.
Chengsong
parents: 646
diff changeset
   564
Chengsong
parents: 646
diff changeset
   565
%In the lexing phase, one can simulate the $\mathit{NFA}$ running in two ways:
Chengsong
parents: 646
diff changeset
   566
%one by keeping track of all active states after consuming 
Chengsong
parents: 646
diff changeset
   567
%a character, and update that set of states iteratively.
Chengsong
parents: 646
diff changeset
   568
%This can be viewed as a breadth-first-search of the $\mathit{NFA}$
Chengsong
parents: 646
diff changeset
   569
%for a path terminating
Chengsong
parents: 646
diff changeset
   570
%at an accepting state.
Chengsong
parents: 646
diff changeset
   571
Chengsong
parents: 646
diff changeset
   572
Chengsong
parents: 646
diff changeset
   573
Chengsong
parents: 646
diff changeset
   574
Consider the following regular expression where the sequence
Chengsong
parents: 646
diff changeset
   575
operator is omitted for brevity:
Chengsong
parents: 646
diff changeset
   576
\begin{center}
Chengsong
parents: 646
diff changeset
   577
	$r_1r_2r_3r_4$
Chengsong
parents: 646
diff changeset
   578
\end{center}
Chengsong
parents: 646
diff changeset
   579
In this example,
Chengsong
parents: 646
diff changeset
   580
one could label sub-expressions of interest 
Chengsong
parents: 646
diff changeset
   581
by parenthesizing them and giving 
Chengsong
parents: 646
diff changeset
   582
them a number in the order in which their opening parentheses appear.
Chengsong
parents: 646
diff changeset
   583
One possible way of parenthesizing and labelling is: 
Chengsong
parents: 646
diff changeset
   584
\begin{center}
Chengsong
parents: 646
diff changeset
   585
	$\underset{1}{(}r_1\underset{2}{(}r_2\underset{3}{(}r_3)\underset{4}{(}r_4)))$
Chengsong
parents: 646
diff changeset
   586
\end{center}
Chengsong
parents: 646
diff changeset
   587
The sub-expressions
Chengsong
parents: 646
diff changeset
   588
$r_1r_2r_3r_4$, $r_1r_2r_3$, $r_3$ and $r_4$ are labelled
Chengsong
parents: 646
diff changeset
   589
by 1 to 4, and can be ``referred back'' by their respective numbers. 
Chengsong
parents: 646
diff changeset
   590
%These sub-expressions are called "capturing groups".
Chengsong
parents: 646
diff changeset
   591
To do so, one uses the syntax $\backslash i$ 
Chengsong
parents: 646
diff changeset
   592
to denote that we want the sub-string 
Chengsong
parents: 646
diff changeset
   593
of the input matched by the i-th
Chengsong
parents: 646
diff changeset
   594
sub-expression to appear again, 
Chengsong
parents: 646
diff changeset
   595
exactly the same as it first appeared: 
Chengsong
parents: 646
diff changeset
   596
\begin{center}
Chengsong
parents: 646
diff changeset
   597
$\ldots\underset{\text{i-th lparen}}{(}{r_i})\ldots 
Chengsong
parents: 646
diff changeset
   598
\underset{s_i \text{ which just matched} \;r_i}{\backslash i} \ldots$
Chengsong
parents: 646
diff changeset
   599
\end{center}
Chengsong
parents: 646
diff changeset
   600
Once the sub-string $s_i$ for the sub-expression $r_i$
Chengsong
parents: 646
diff changeset
   601
has been fixed, there is no variability on what the back-reference
Chengsong
parents: 646
diff changeset
   602
label $\backslash i$ can be---it is tied to $s_i$.
Chengsong
parents: 646
diff changeset
   603
The overall string will look like $\ldots s_i \ldots s_i \ldots $
Chengsong
parents: 646
diff changeset
   604
%The backslash and number $i$ are the
Chengsong
parents: 646
diff changeset
   605
%so-called "back-references".
Chengsong
parents: 646
diff changeset
   606
%Let $e$ be an expression made of regular expressions 
Chengsong
parents: 646
diff changeset
   607
%and back-references. $e$ contains the expression $e_i$
Chengsong
parents: 646
diff changeset
   608
%as its $i$-th capturing group.
Chengsong
parents: 646
diff changeset
   609
%The semantics of back-reference can be recursively
Chengsong
parents: 646
diff changeset
   610
%written as:
Chengsong
parents: 646
diff changeset
   611
%\begin{center}
Chengsong
parents: 646
diff changeset
   612
%	\begin{tabular}{c}
Chengsong
parents: 646
diff changeset
   613
%		$L ( e \cdot \backslash i) = \{s @ s_i \mid s \in L (e)\quad s_i \in L(r_i)$\\
Chengsong
parents: 646
diff changeset
   614
%		$s_i\; \text{match of ($e$, $s$)'s $i$-th capturing group string}\}$
Chengsong
parents: 646
diff changeset
   615
%	\end{tabular}
Chengsong
parents: 646
diff changeset
   616
%\end{center}
Chengsong
parents: 646
diff changeset
   617
A concrete example
Chengsong
parents: 646
diff changeset
   618
for back-references is
Chengsong
parents: 646
diff changeset
   619
\begin{center}
Chengsong
parents: 646
diff changeset
   620
$(.^*)\backslash 1$,
Chengsong
parents: 646
diff changeset
   621
\end{center}
Chengsong
parents: 646
diff changeset
   622
which matches
Chengsong
parents: 646
diff changeset
   623
strings that can be split into two identical halves,
Chengsong
parents: 646
diff changeset
   624
for example $\mathit{foofoo}$, $\mathit{ww}$ and so on.
Chengsong
parents: 646
diff changeset
   625
Note that this is different from 
Chengsong
parents: 646
diff changeset
   626
repeating the  sub-expression verbatim like
Chengsong
parents: 646
diff changeset
   627
\begin{center}
Chengsong
parents: 646
diff changeset
   628
	$(.^*)(.^*)$,
Chengsong
parents: 646
diff changeset
   629
\end{center}
Chengsong
parents: 646
diff changeset
   630
which does not impose any restrictions on what strings the second 
Chengsong
parents: 646
diff changeset
   631
sub-expression $.^*$
Chengsong
parents: 646
diff changeset
   632
might match.
Chengsong
parents: 646
diff changeset
   633
Another example for back-references is
Chengsong
parents: 646
diff changeset
   634
\begin{center}
Chengsong
parents: 646
diff changeset
   635
$(.)(.)\backslash 2\backslash 1$
Chengsong
parents: 646
diff changeset
   636
\end{center}
Chengsong
parents: 646
diff changeset
   637
which matches four-character palindromes
Chengsong
parents: 646
diff changeset
   638
like $abba$, $x??x$ and so on.
Chengsong
parents: 646
diff changeset
   639
Chengsong
parents: 646
diff changeset
   640
Back-references are a regex construct 
Chengsong
parents: 646
diff changeset
   641
that programmers find quite useful.
Chengsong
parents: 646
diff changeset
   642
According to Becchi and Crawley \cite{Becchi08},
Chengsong
parents: 646
diff changeset
   643
6\% of Snort rules (up until 2008) use them.
Chengsong
parents: 646
diff changeset
   644
The most common use of back-references
Chengsong
parents: 646
diff changeset
   645
is to express well-formed html files,
Chengsong
parents: 646
diff changeset
   646
where back-references are convenient for matching
Chengsong
parents: 646
diff changeset
   647
opening and closing tags like 
Chengsong
parents: 646
diff changeset
   648
\begin{center}
Chengsong
parents: 646
diff changeset
   649
	$\langle html \rangle \ldots \langle / html \rangle$
Chengsong
parents: 646
diff changeset
   650
\end{center}
Chengsong
parents: 646
diff changeset
   651
A regex describing such a format
Chengsong
parents: 646
diff changeset
   652
is
Chengsong
parents: 646
diff changeset
   653
\begin{center}
Chengsong
parents: 646
diff changeset
   654
	$\langle (.^+) \rangle \ldots \langle / \backslash 1 \rangle$
Chengsong
parents: 646
diff changeset
   655
\end{center}
Chengsong
parents: 646
diff changeset
   656
Despite being useful, the expressive power of regexes 
Chengsong
parents: 646
diff changeset
   657
go beyond regular languages 
Chengsong
parents: 646
diff changeset
   658
once back-references are included.
Chengsong
parents: 646
diff changeset
   659
In fact, they allow the regex construct to express 
Chengsong
parents: 646
diff changeset
   660
languages that cannot be contained in context-free
Chengsong
parents: 646
diff changeset
   661
languages either.
Chengsong
parents: 646
diff changeset
   662
For example, the back-reference $(a^*)b\backslash1 b \backslash 1$
Chengsong
parents: 646
diff changeset
   663
expresses the language $\{a^n b a^n b a^n\mid n \in \mathbb{N}\}$,
Chengsong
parents: 646
diff changeset
   664
which cannot be expressed by 
Chengsong
parents: 646
diff changeset
   665
context-free grammars \cite{campeanu2003formal}.
Chengsong
parents: 646
diff changeset
   666
Such a language is contained in the context-sensitive hierarchy
Chengsong
parents: 646
diff changeset
   667
of formal languages. 
Chengsong
parents: 646
diff changeset
   668
Also solving the matching problem involving back-references
Chengsong
parents: 646
diff changeset
   669
is known to be NP-complete \parencite{alfred2014algorithms}.
Chengsong
parents: 646
diff changeset
   670
Regex libraries supporting back-references such as 
Chengsong
parents: 646
diff changeset
   671
PCRE \cite{pcre} therefore have to
Chengsong
parents: 646
diff changeset
   672
revert to a depth-first search algorithm including backtracking.
Chengsong
parents: 646
diff changeset
   673
What is unexpected is that even in the cases 
Chengsong
parents: 646
diff changeset
   674
not involving back-references, there is still
Chengsong
parents: 646
diff changeset
   675
a (non-negligible) chance they might backtrack super-linearly,
Chengsong
parents: 646
diff changeset
   676
as shown in the graphs in figure \ref{fig:aStarStarb}.
Chengsong
parents: 646
diff changeset
   677
Chengsong
parents: 646
diff changeset
   678
Summing up, we can categorise existing 
Chengsong
parents: 646
diff changeset
   679
practical regex libraries into two kinds:
Chengsong
parents: 646
diff changeset
   680
(i) The ones  with  linear
Chengsong
parents: 646
diff changeset
   681
time guarantees like Go and Rust. The downside with them is that
Chengsong
parents: 646
diff changeset
   682
they impose restrictions
Chengsong
parents: 646
diff changeset
   683
on the regular expressions (not allowing back-references, 
Chengsong
parents: 646
diff changeset
   684
bounded repetitions cannot exceed an ad hoc limit etc.).
Chengsong
parents: 646
diff changeset
   685
And (ii) those 
Chengsong
parents: 646
diff changeset
   686
that allow large bounded regular expressions and back-references
Chengsong
parents: 646
diff changeset
   687
at the expense of using backtracking algorithms.
Chengsong
parents: 646
diff changeset
   688
They can potentially ``grind to a halt''
Chengsong
parents: 646
diff changeset
   689
on some very simple cases, resulting 
Chengsong
parents: 646
diff changeset
   690
ReDoS attacks if exposed to the internet.
Chengsong
parents: 646
diff changeset
   691
Chengsong
parents: 646
diff changeset
   692
The problems with both approaches are the motivation for us 
Chengsong
parents: 646
diff changeset
   693
to look again at the regular expression matching problem. 
Chengsong
parents: 646
diff changeset
   694
Another motivation is that regular expression matching algorithms
Chengsong
parents: 646
diff changeset
   695
that follow the POSIX standard often contain errors and bugs 
Chengsong
parents: 646
diff changeset
   696
as we shall explain next.
Chengsong
parents: 646
diff changeset
   697
Chengsong
parents: 646
diff changeset
   698
%We would like to have regex engines that can 
Chengsong
parents: 646
diff changeset
   699
%deal with the regular part (e.g.
Chengsong
parents: 646
diff changeset
   700
%bounded repetitions) of regexes more
Chengsong
parents: 646
diff changeset
   701
%efficiently.
Chengsong
parents: 646
diff changeset
   702
%Also we want to make sure that they do it correctly.
Chengsong
parents: 646
diff changeset
   703
%It turns out that such aim is not so easy to achieve.
Chengsong
parents: 646
diff changeset
   704
 %TODO: give examples such as RE2 GOLANG 1000 restriction, rust no repetitions 
Chengsong
parents: 646
diff changeset
   705
% For example, the Rust regex engine claims to be linear, 
Chengsong
parents: 646
diff changeset
   706
% but does not support lookarounds and back-references.
Chengsong
parents: 646
diff changeset
   707
% The GoLang regex library does not support over 1000 repetitions.  
Chengsong
parents: 646
diff changeset
   708
% Java and Python both support back-references, but shows
Chengsong
parents: 646
diff changeset
   709
%catastrophic backtracking behaviours on inputs without back-references(
Chengsong
parents: 646
diff changeset
   710
%when the language is still regular).
Chengsong
parents: 646
diff changeset
   711
 %TODO: test performance of Rust on (((((a*a*)b*)b){20})*)c  baabaabababaabaaaaaaaaababaaaababababaaaabaaabaaaaaabaabaabababaababaaaaaaaaababaaaababababaaaaaaaaaaaaac
Chengsong
parents: 646
diff changeset
   712
 %TODO: verify the fact Rust does not allow 1000+ reps
Chengsong
parents: 646
diff changeset
   713
Chengsong
parents: 646
diff changeset
   714
Chengsong
parents: 646
diff changeset
   715
Chengsong
parents: 646
diff changeset
   716
Chengsong
parents: 646
diff changeset
   717
%The time cost of regex matching algorithms in general
Chengsong
parents: 646
diff changeset
   718
%involve two different phases, and different things can go differently wrong on 
Chengsong
parents: 646
diff changeset
   719
%these phases.
Chengsong
parents: 646
diff changeset
   720
%$\DFA$s usually have problems in the first (construction) phase
Chengsong
parents: 646
diff changeset
   721
%, whereas $\NFA$s usually run into trouble
Chengsong
parents: 646
diff changeset
   722
%on the second phase.
Chengsong
parents: 646
diff changeset
   723
Chengsong
parents: 646
diff changeset
   724
Chengsong
parents: 646
diff changeset
   725
\section{Error-prone POSIX Implementations}
Chengsong
parents: 646
diff changeset
   726
Very often there are multiple ways of matching a string
Chengsong
parents: 646
diff changeset
   727
with a regular expression.
Chengsong
parents: 646
diff changeset
   728
In such cases the regular expressions matcher needs to
Chengsong
parents: 646
diff changeset
   729
disambiguate.
Chengsong
parents: 646
diff changeset
   730
The more widely used strategy is called POSIX,
Chengsong
parents: 646
diff changeset
   731
which roughly speaking always chooses the longest initial match.
Chengsong
parents: 646
diff changeset
   732
The POSIX strategy is widely adopted in many regular expression matchers
Chengsong
parents: 646
diff changeset
   733
because it is a natural strategy for lexers.
Chengsong
parents: 646
diff changeset
   734
However, many implementations (including the C libraries
Chengsong
parents: 646
diff changeset
   735
used by Linux and OS X distributions) contain bugs
Chengsong
parents: 646
diff changeset
   736
or do not meet the specification they claim to adhere to.
Chengsong
parents: 646
diff changeset
   737
Kuklewicz maintains a unit test repository which lists some
Chengsong
parents: 646
diff changeset
   738
problems with existing regular expression engines \cite{KuklewiczHaskell}.
Chengsong
parents: 646
diff changeset
   739
In some cases, they either fail to generate a
Chengsong
parents: 646
diff changeset
   740
result when there exists a match,
Chengsong
parents: 646
diff changeset
   741
or give results that are inconsistent with the POSIX standard.
Chengsong
parents: 646
diff changeset
   742
A concrete example is the regex:
Chengsong
parents: 646
diff changeset
   743
\begin{center}
Chengsong
parents: 646
diff changeset
   744
	$(aba + ab + a)^* \text{and the string} \; ababa$
Chengsong
parents: 646
diff changeset
   745
\end{center}
Chengsong
parents: 646
diff changeset
   746
The correct POSIX match for the above
Chengsong
parents: 646
diff changeset
   747
involves the entire string $ababa$, 
Chengsong
parents: 646
diff changeset
   748
split into two Kleene star iterations, namely $[ab], [aba]$ at positions
Chengsong
parents: 646
diff changeset
   749
$[0, 2), [2, 5)$
Chengsong
parents: 646
diff changeset
   750
respectively.
Chengsong
parents: 646
diff changeset
   751
But feeding this example to the different engines
Chengsong
parents: 646
diff changeset
   752
listed at regex101 \footnote{
Chengsong
parents: 646
diff changeset
   753
	regex101 is an online regular expression matcher which
Chengsong
parents: 646
diff changeset
   754
	provides API for trying out regular expression
Chengsong
parents: 646
diff changeset
   755
	engines of multiple popular programming languages like
Chengsong
parents: 646
diff changeset
   756
Java, Python, Go, etc.} \parencite{regex101}. 
Chengsong
parents: 646
diff changeset
   757
yields
Chengsong
parents: 646
diff changeset
   758
only two incomplete matches: $[aba]$ at $[0, 3)$
Chengsong
parents: 646
diff changeset
   759
and $a$ at $[4, 5)$.
Chengsong
parents: 646
diff changeset
   760
Fowler \cite{fowler2003} and Kuklewicz \cite{KuklewiczHaskell} 
Chengsong
parents: 646
diff changeset
   761
commented that most regex libraries are not
Chengsong
parents: 646
diff changeset
   762
correctly implementing the central POSIX
Chengsong
parents: 646
diff changeset
   763
rule, called the maximum munch rule.
Chengsong
parents: 646
diff changeset
   764
Grathwohl \parencite{grathwohl2014crash} wrote:
Chengsong
parents: 646
diff changeset
   765
\begin{quote}\it
Chengsong
parents: 646
diff changeset
   766
	``The POSIX strategy is more complicated than the 
Chengsong
parents: 646
diff changeset
   767
	greedy because of the dependence on information about 
Chengsong
parents: 646
diff changeset
   768
	the length of matched strings in the various subexpressions.''
Chengsong
parents: 646
diff changeset
   769
\end{quote}
Chengsong
parents: 646
diff changeset
   770
%\noindent
Chengsong
parents: 646
diff changeset
   771
People have recognised that the implementation complexity of POSIX rules also come from
Chengsong
parents: 646
diff changeset
   772
the specification being not very precise.
Chengsong
parents: 646
diff changeset
   773
The developers of the regexp package of Go 
Chengsong
parents: 646
diff changeset
   774
\footnote{\url{https://pkg.go.dev/regexp\#pkg-overview}}
Chengsong
parents: 646
diff changeset
   775
commented that
Chengsong
parents: 646
diff changeset
   776
\begin{quote}\it
Chengsong
parents: 646
diff changeset
   777
``
Chengsong
parents: 646
diff changeset
   778
The POSIX rule is computationally prohibitive
Chengsong
parents: 646
diff changeset
   779
and not even well-defined.
Chengsong
parents: 646
diff changeset
   780
``
Chengsong
parents: 646
diff changeset
   781
\end{quote}
Chengsong
parents: 646
diff changeset
   782
There are many informal summaries of this disambiguation
Chengsong
parents: 646
diff changeset
   783
strategy, which are often quite long and delicate.
Chengsong
parents: 646
diff changeset
   784
For example Kuklewicz \cite{KuklewiczHaskell} 
Chengsong
parents: 646
diff changeset
   785
described the POSIX rule as (section 1, last paragraph):
Chengsong
parents: 646
diff changeset
   786
\begin{quote}
Chengsong
parents: 646
diff changeset
   787
	\begin{itemize}
Chengsong
parents: 646
diff changeset
   788
		\item
Chengsong
parents: 646
diff changeset
   789
regular expressions (REs) take the leftmost starting match, and the longest match starting there
Chengsong
parents: 646
diff changeset
   790
earlier subpatterns have leftmost-longest priority over later subpatterns\\
Chengsong
parents: 646
diff changeset
   791
\item
Chengsong
parents: 646
diff changeset
   792
higher-level subpatterns have leftmost-longest priority over their component subpatterns\\
Chengsong
parents: 646
diff changeset
   793
\item
Chengsong
parents: 646
diff changeset
   794
REs have right associative concatenation which can be changed with parenthesis\\
Chengsong
parents: 646
diff changeset
   795
\item
Chengsong
parents: 646
diff changeset
   796
parenthesized subexpressions return the match from their last usage\\
Chengsong
parents: 646
diff changeset
   797
\item
Chengsong
parents: 646
diff changeset
   798
text of component subexpressions must be contained in the text of the 
Chengsong
parents: 646
diff changeset
   799
higher-level subexpressions\\
Chengsong
parents: 646
diff changeset
   800
\item
Chengsong
parents: 646
diff changeset
   801
if "p" and "q" can never match the same text then "p|q" and "q|p" are equivalent, up to trivial renumbering of captured subexpressions\\
Chengsong
parents: 646
diff changeset
   802
\item
Chengsong
parents: 646
diff changeset
   803
if "p" in "p*" is used to capture non-empty text then additional repetitions of "p" will not capture an empty string\\
Chengsong
parents: 646
diff changeset
   804
\end{itemize}
Chengsong
parents: 646
diff changeset
   805
\end{quote}
Chengsong
parents: 646
diff changeset
   806
%The text above 
Chengsong
parents: 646
diff changeset
   807
%is trying to capture something very precise,
Chengsong
parents: 646
diff changeset
   808
%and is crying out for formalising.
Chengsong
parents: 646
diff changeset
   809
Ribeiro and Du Bois \cite{RibeiroAgda2017} have 
Chengsong
parents: 646
diff changeset
   810
formalised the notion of bit-coded regular expressions
Chengsong
parents: 646
diff changeset
   811
and proved their relations with simple regular expressions in
Chengsong
parents: 646
diff changeset
   812
the dependently-typed proof assistant Agda.
Chengsong
parents: 646
diff changeset
   813
They also proved the soundness and completeness of a matching algorithm
Chengsong
parents: 646
diff changeset
   814
based on the bit-coded regular expressions.
Chengsong
parents: 646
diff changeset
   815
Ausaf et al. \cite{AusafDyckhoffUrban2016}
Chengsong
parents: 646
diff changeset
   816
are the first to
Chengsong
parents: 646
diff changeset
   817
give a quite simple formalised POSIX
Chengsong
parents: 646
diff changeset
   818
specification in Isabelle/HOL, and also prove 
Chengsong
parents: 646
diff changeset
   819
that their specification coincides with an earlier (unformalised) 
Chengsong
parents: 646
diff changeset
   820
POSIX specification given by Okui and Suzuki \cite{Okui10}.
Chengsong
parents: 646
diff changeset
   821
They then formally proved the correctness of
Chengsong
parents: 646
diff changeset
   822
a lexing algorithm by Sulzmann and Lu \cite{Sulzmann2014}
Chengsong
parents: 646
diff changeset
   823
with regards to that specification.
Chengsong
parents: 646
diff changeset
   824
They also found that the informal POSIX
Chengsong
parents: 646
diff changeset
   825
specification by Sulzmann and Lu needs to be substantially 
Chengsong
parents: 646
diff changeset
   826
revised in order for the correctness proof to go through.
Chengsong
parents: 646
diff changeset
   827
Their original specification and proof were unfixable
Chengsong
parents: 646
diff changeset
   828
according to Ausaf et al.
Chengsong
parents: 646
diff changeset
   829
Chengsong
parents: 646
diff changeset
   830
Chengsong
parents: 646
diff changeset
   831
In the next section, we will briefly
Chengsong
parents: 646
diff changeset
   832
introduce Brzozowski derivatives and Sulzmann
Chengsong
parents: 646
diff changeset
   833
and Lu's algorithm, which the main point of this thesis builds on.
Chengsong
parents: 646
diff changeset
   834
%We give a taste of what they 
Chengsong
parents: 646
diff changeset
   835
%are like and why they are suitable for regular expression
Chengsong
parents: 646
diff changeset
   836
%matching and lexing.
Chengsong
parents: 646
diff changeset
   837
\section{Formal Specification of POSIX Matching 
Chengsong
parents: 646
diff changeset
   838
and Brzozowski Derivatives}
Chengsong
parents: 646
diff changeset
   839
%Now we start with the central topic of the thesis: Brzozowski derivatives.
Chengsong
parents: 646
diff changeset
   840
Brzozowski \cite{Brzozowski1964} first introduced the 
Chengsong
parents: 646
diff changeset
   841
concept of a \emph{derivative} of regular expression in 1964.
Chengsong
parents: 646
diff changeset
   842
The derivative of a regular expression $r$
Chengsong
parents: 646
diff changeset
   843
with respect to a character $c$, is written as $r \backslash c$.
Chengsong
parents: 646
diff changeset
   844
This operation tells us what $r$ transforms into
Chengsong
parents: 646
diff changeset
   845
if we ``chop'' off a particular character $c$ 
Chengsong
parents: 646
diff changeset
   846
from all strings in the language of $r$ (defined
Chengsong
parents: 646
diff changeset
   847
later as $L \; r$).
Chengsong
parents: 646
diff changeset
   848
%To give a flavour of Brzozowski derivatives, we present
Chengsong
parents: 646
diff changeset
   849
%two straightforward clauses from it:
Chengsong
parents: 646
diff changeset
   850
%\begin{center}
Chengsong
parents: 646
diff changeset
   851
%	\begin{tabular}{lcl}
Chengsong
parents: 646
diff changeset
   852
%		$d \backslash c$     & $\dn$ & 
Chengsong
parents: 646
diff changeset
   853
%		$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
Chengsong
parents: 646
diff changeset
   854
%$(r_1 + r_2)\backslash c$     & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
Chengsong
parents: 646
diff changeset
   855
%	\end{tabular}
Chengsong
parents: 646
diff changeset
   856
%\end{center}
Chengsong
parents: 646
diff changeset
   857
%\noindent
Chengsong
parents: 646
diff changeset
   858
%The first clause says that for the regular expression
Chengsong
parents: 646
diff changeset
   859
%denoting a singleton set consisting of a single-character string $\{ d \}$,
Chengsong
parents: 646
diff changeset
   860
%we check the derivative character $c$ against $d$,
Chengsong
parents: 646
diff changeset
   861
%returning a set containing only the empty string $\{ [] \}$
Chengsong
parents: 646
diff changeset
   862
%if $c$ and $d$ are equal, and the empty set $\varnothing$ otherwise.
Chengsong
parents: 646
diff changeset
   863
%The second clause states that to obtain the regular expression
Chengsong
parents: 646
diff changeset
   864
%representing all strings' head character $c$ being chopped off
Chengsong
parents: 646
diff changeset
   865
%from $r_1 + r_2$, one simply needs to recursively take derivative
Chengsong
parents: 646
diff changeset
   866
%of $r_1$ and $r_2$ and then put them together.
Chengsong
parents: 646
diff changeset
   867
Derivatives have the property
Chengsong
parents: 646
diff changeset
   868
that $s \in L \; (r\backslash c)$ if and only if 
Chengsong
parents: 646
diff changeset
   869
$c::s \in L \; r$ where $::$ stands for list prepending.
Chengsong
parents: 646
diff changeset
   870
%This property can be used on regular expressions
Chengsong
parents: 646
diff changeset
   871
%matching and lexing--to test whether a string $s$ is in $L \; r$,
Chengsong
parents: 646
diff changeset
   872
%one simply takes derivatives of $r$ successively with
Chengsong
parents: 646
diff changeset
   873
%respect to the characters (in the correct order) in $s$,
Chengsong
parents: 646
diff changeset
   874
%and then test whether the empty string is in the last regular expression.
Chengsong
parents: 646
diff changeset
   875
With this property, derivatives can give a simple solution
Chengsong
parents: 646
diff changeset
   876
to the problem of matching a string $s$ with a regular
Chengsong
parents: 646
diff changeset
   877
expression $r$: if the derivative of $r$ w.r.t.\ (in
Chengsong
parents: 646
diff changeset
   878
succession) all the characters of the string matches the empty string,
Chengsong
parents: 646
diff changeset
   879
then $r$ matches $s$ (and {\em vice versa}).  
Chengsong
parents: 646
diff changeset
   880
%This makes formally reasoning about these properties such
Chengsong
parents: 646
diff changeset
   881
%as correctness and complexity smooth and intuitive.
Chengsong
parents: 646
diff changeset
   882
There are several mechanised proofs of this property in various theorem
Chengsong
parents: 646
diff changeset
   883
provers,
Chengsong
parents: 646
diff changeset
   884
for example one by Owens and Slind \cite{Owens2008} in HOL4,
Chengsong
parents: 646
diff changeset
   885
another one by Krauss and Nipkow \cite{Nipkow98} in Isabelle/HOL, and
Chengsong
parents: 646
diff changeset
   886
yet another in Coq by Coquand and Siles \cite{Coquand2012}.
Chengsong
parents: 646
diff changeset
   887
Chengsong
parents: 646
diff changeset
   888
In addition, one can extend derivatives to bounded repetitions
Chengsong
parents: 646
diff changeset
   889
relatively straightforwardly. For example, the derivative for 
Chengsong
parents: 646
diff changeset
   890
this can be defined as:
Chengsong
parents: 646
diff changeset
   891
\begin{center}
Chengsong
parents: 646
diff changeset
   892
	\begin{tabular}{lcl}
Chengsong
parents: 646
diff changeset
   893
		$r^{\{n\}} \backslash c$     & $\dn$ & $r \backslash c \cdot
Chengsong
parents: 646
diff changeset
   894
		r^{\{n-1\}} (\text{when} n > 0)$\\
Chengsong
parents: 646
diff changeset
   895
	\end{tabular}
Chengsong
parents: 646
diff changeset
   896
\end{center}
Chengsong
parents: 646
diff changeset
   897
\noindent
Chengsong
parents: 646
diff changeset
   898
Experimental results suggest that  unlike DFA-based solutions
Chengsong
parents: 646
diff changeset
   899
for bounded regular expressions,
Chengsong
parents: 646
diff changeset
   900
derivatives can cope
Chengsong
parents: 646
diff changeset
   901
large counters
Chengsong
parents: 646
diff changeset
   902
quite well.
Chengsong
parents: 646
diff changeset
   903
Chengsong
parents: 646
diff changeset
   904
There have also been 
Chengsong
parents: 646
diff changeset
   905
extensions of derivatives to other regex constructs.
Chengsong
parents: 646
diff changeset
   906
For example, Owens et al include the derivatives
Chengsong
parents: 646
diff changeset
   907
for the \emph{NOT} regular expression, which is
Chengsong
parents: 646
diff changeset
   908
able to concisely express C-style comments of the form
Chengsong
parents: 646
diff changeset
   909
$/* \ldots */$ (see \cite{Owens2008}).
Chengsong
parents: 646
diff changeset
   910
Another extension for derivatives is
Chengsong
parents: 646
diff changeset
   911
regular expressions with look-aheads, done
Chengsong
parents: 646
diff changeset
   912
by Miyazaki and Minamide
Chengsong
parents: 646
diff changeset
   913
\cite{Takayuki2019}.
Chengsong
parents: 646
diff changeset
   914
%We therefore use Brzozowski derivatives on regular expressions 
Chengsong
parents: 646
diff changeset
   915
%lexing 
Chengsong
parents: 646
diff changeset
   916
Chengsong
parents: 646
diff changeset
   917
Chengsong
parents: 646
diff changeset
   918
Chengsong
parents: 646
diff changeset
   919
Given the above definitions and properties of
Chengsong
parents: 646
diff changeset
   920
Brzozowski derivatives, one quickly realises their potential
Chengsong
parents: 646
diff changeset
   921
in generating a formally verified algorithm for lexing: the clauses and property
Chengsong
parents: 646
diff changeset
   922
can be easily expressed in a functional programming language 
Chengsong
parents: 646
diff changeset
   923
or converted to theorem prover
Chengsong
parents: 646
diff changeset
   924
code, with great ease.
Chengsong
parents: 646
diff changeset
   925
Perhaps this is the reason why derivatives have sparked quite a bit of interest
Chengsong
parents: 646
diff changeset
   926
in the functional programming and theorem prover communities in the last
Chengsong
parents: 646
diff changeset
   927
fifteen or so years (
Chengsong
parents: 646
diff changeset
   928
\cite{Almeidaetal10}, \cite{Berglund14}, \cite{Berglund18},
Chengsong
parents: 646
diff changeset
   929
\cite{Chen12} and \cite{Coquand2012}
Chengsong
parents: 646
diff changeset
   930
to name a few), despite being buried in the ``sands of time'' \cite{Owens2008}
Chengsong
parents: 646
diff changeset
   931
after they were first published by Brzozowski.
Chengsong
parents: 646
diff changeset
   932
Chengsong
parents: 646
diff changeset
   933
Chengsong
parents: 646
diff changeset
   934
However, there are two difficulties with derivative-based matchers:
Chengsong
parents: 646
diff changeset
   935
First, Brzozowski's original matcher only generates a yes/no answer
Chengsong
parents: 646
diff changeset
   936
for whether a regular expression matches a string or not.  This is too
Chengsong
parents: 646
diff changeset
   937
little information in the context of lexing where separate tokens must
Chengsong
parents: 646
diff changeset
   938
be identified and also classified (for example as keywords
Chengsong
parents: 646
diff changeset
   939
or identifiers). 
Chengsong
parents: 646
diff changeset
   940
Second, derivative-based matchers need to be more efficient in terms
Chengsong
parents: 646
diff changeset
   941
of the sizes of derivatives.
Chengsong
parents: 646
diff changeset
   942
Elegant and beautiful
Chengsong
parents: 646
diff changeset
   943
as many implementations are,
Chengsong
parents: 646
diff changeset
   944
they can be excruciatingly slow. 
Chengsong
parents: 646
diff changeset
   945
For example, Sulzmann and Lu
Chengsong
parents: 646
diff changeset
   946
claim a linear running time of their proposed algorithm,
Chengsong
parents: 646
diff changeset
   947
but that was falsified by our experiments. The running time 
Chengsong
parents: 646
diff changeset
   948
is actually $\Omega(2^n)$ in the worst case.
Chengsong
parents: 646
diff changeset
   949
A similar claim about a theoretical runtime of $O(n^2)$ 
Chengsong
parents: 646
diff changeset
   950
is made for the Verbatim \cite{Verbatim}
Chengsong
parents: 646
diff changeset
   951
%TODO: give references
Chengsong
parents: 646
diff changeset
   952
lexer, which calculates POSIX matches and is based on derivatives.
Chengsong
parents: 646
diff changeset
   953
They formalized the correctness of the lexer, but not their complexity result.
Chengsong
parents: 646
diff changeset
   954
In the performance evaluation section, they analyzed the run time
Chengsong
parents: 646
diff changeset
   955
of matching $a$ with the string 
Chengsong
parents: 646
diff changeset
   956
\begin{center}
Chengsong
parents: 646
diff changeset
   957
	$\underbrace{a \ldots a}_{\text{n a's}}$.
Chengsong
parents: 646
diff changeset
   958
\end{center}
Chengsong
parents: 646
diff changeset
   959
\noindent
Chengsong
parents: 646
diff changeset
   960
They concluded that the algorithm is quadratic in terms of 
Chengsong
parents: 646
diff changeset
   961
the length of the input string.
Chengsong
parents: 646
diff changeset
   962
When we tried out their extracted OCaml code with the example $(a+aa)^*$,
Chengsong
parents: 646
diff changeset
   963
the time it took to match a string of 40 $a$'s was approximately 5 minutes.
Chengsong
parents: 646
diff changeset
   964
Chengsong
parents: 646
diff changeset
   965
Chengsong
parents: 646
diff changeset
   966
\subsection{Sulzmann and Lu's Algorithm}
Chengsong
parents: 646
diff changeset
   967
Sulzmann and Lu~\cite{Sulzmann2014} overcame the first 
Chengsong
parents: 646
diff changeset
   968
problem with the yes/no answer 
Chengsong
parents: 646
diff changeset
   969
by cleverly extending Brzozowski's matching
Chengsong
parents: 646
diff changeset
   970
algorithm. Their extended version generates additional information on
Chengsong
parents: 646
diff changeset
   971
\emph{how} a regular expression matches a string following the POSIX
Chengsong
parents: 646
diff changeset
   972
rules for regular expression matching. They achieve this by adding a
Chengsong
parents: 646
diff changeset
   973
second ``phase'' to Brzozowski's algorithm involving an injection
Chengsong
parents: 646
diff changeset
   974
function. This injection function in a sense undoes the ``damage''
Chengsong
parents: 646
diff changeset
   975
of the derivatives chopping off characters.
Chengsong
parents: 646
diff changeset
   976
In earlier work, Ausaf et al provided the formal
Chengsong
parents: 646
diff changeset
   977
specification of what POSIX matching means and proved in Isabelle/HOL
Chengsong
parents: 646
diff changeset
   978
the correctness
Chengsong
parents: 646
diff changeset
   979
of this extended algorithm accordingly
Chengsong
parents: 646
diff changeset
   980
\cite{AusafDyckhoffUrban2016}.
Chengsong
parents: 646
diff changeset
   981
Chengsong
parents: 646
diff changeset
   982
The version of the algorithm proven correct
Chengsong
parents: 646
diff changeset
   983
suffers however heavily from a 
Chengsong
parents: 646
diff changeset
   984
second difficulty, where derivatives can
Chengsong
parents: 646
diff changeset
   985
grow to arbitrarily big sizes. 
Chengsong
parents: 646
diff changeset
   986
For example if we start with the
Chengsong
parents: 646
diff changeset
   987
regular expression $(a+aa)^*$ and take
Chengsong
parents: 646
diff changeset
   988
successive derivatives according to the character $a$, we end up with
Chengsong
parents: 646
diff changeset
   989
a sequence of ever-growing derivatives like 
Chengsong
parents: 646
diff changeset
   990
Chengsong
parents: 646
diff changeset
   991
\def\ll{\stackrel{\_\backslash{} a}{\longrightarrow}}
Chengsong
parents: 646
diff changeset
   992
\begin{center}
Chengsong
parents: 646
diff changeset
   993
\begin{tabular}{rll}
Chengsong
parents: 646
diff changeset
   994
$(a + aa)^*$ & $\ll$ & $(\ONE + \ONE{}a) \cdot (a + aa)^*$\\
Chengsong
parents: 646
diff changeset
   995
& $\ll$ & $(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* \;+\; (\ONE + \ONE{}a) \cdot (a + aa)^*$\\
Chengsong
parents: 646
diff changeset
   996
& $\ll$ & $(\ZERO + \ZERO{}a + \ZERO) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^* \;+\; $\\
Chengsong
parents: 646
diff changeset
   997
& & $\qquad(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^*$\\
Chengsong
parents: 646
diff changeset
   998
& $\ll$ & \ldots \hspace{15mm}(regular expressions of sizes 98, 169, 283, 468, 767, \ldots)
Chengsong
parents: 646
diff changeset
   999
\end{tabular}
Chengsong
parents: 646
diff changeset
  1000
\end{center}
Chengsong
parents: 646
diff changeset
  1001
 
Chengsong
parents: 646
diff changeset
  1002
\noindent where after around 35 steps we usually run out of memory on a
Chengsong
parents: 646
diff changeset
  1003
typical computer.  Clearly, the
Chengsong
parents: 646
diff changeset
  1004
notation involving $\ZERO$s and $\ONE$s already suggests
Chengsong
parents: 646
diff changeset
  1005
simplification rules that can be applied to regular regular
Chengsong
parents: 646
diff changeset
  1006
expressions, for example $\ZERO{}\,r \Rightarrow \ZERO$, $\ONE{}\,r
Chengsong
parents: 646
diff changeset
  1007
\Rightarrow r$, $\ZERO{} + r \Rightarrow r$ and $r + r \Rightarrow
Chengsong
parents: 646
diff changeset
  1008
r$. While such simple-minded simplifications have been proved in 
Chengsong
parents: 646
diff changeset
  1009
the work by Ausaf et al. to preserve the correctness of Sulzmann and Lu's
Chengsong
parents: 646
diff changeset
  1010
algorithm \cite{AusafDyckhoffUrban2016}, they unfortunately do
Chengsong
parents: 646
diff changeset
  1011
\emph{not} help with limiting the growth of the derivatives shown
Chengsong
parents: 646
diff changeset
  1012
above: the growth is slowed, but the derivatives can still grow rather
Chengsong
parents: 646
diff changeset
  1013
quickly beyond any finite bound.
Chengsong
parents: 646
diff changeset
  1014
Chengsong
parents: 646
diff changeset
  1015
Therefore we want to look in this thesis at a second
Chengsong
parents: 646
diff changeset
  1016
algorithm by Sulzmann and Lu where they
Chengsong
parents: 646
diff changeset
  1017
overcame this ``growth problem'' \cite{Sulzmann2014}.
Chengsong
parents: 646
diff changeset
  1018
In this version, POSIX values are 
Chengsong
parents: 646
diff changeset
  1019
represented as bit sequences and such sequences are incrementally generated
Chengsong
parents: 646
diff changeset
  1020
when derivatives are calculated. The compact representation
Chengsong
parents: 646
diff changeset
  1021
of bit sequences and regular expressions allows them to define a more
Chengsong
parents: 646
diff changeset
  1022
``aggressive'' simplification method that keeps the size of the
Chengsong
parents: 646
diff changeset
  1023
derivatives finite no matter what the length of the string is.
Chengsong
parents: 646
diff changeset
  1024
They make some informal claims about the correctness and linear behaviour
Chengsong
parents: 646
diff changeset
  1025
of this version, but do not provide any supporting proof arguments, not
Chengsong
parents: 646
diff changeset
  1026
even ``pencil-and-paper'' arguments. They write about their bit-coded
Chengsong
parents: 646
diff changeset
  1027
\emph{incremental parsing method} (that is the algorithm to be formalised
Chengsong
parents: 646
diff changeset
  1028
in this dissertation)
Chengsong
parents: 646
diff changeset
  1029
Chengsong
parents: 646
diff changeset
  1030
Chengsong
parents: 646
diff changeset
  1031
  
Chengsong
parents: 646
diff changeset
  1032
  \begin{quote}\it
Chengsong
parents: 646
diff changeset
  1033
  ``Correctness Claim: We further claim that the incremental parsing
Chengsong
parents: 646
diff changeset
  1034
  method [..] in combination with the simplification steps [..]
Chengsong
parents: 646
diff changeset
  1035
  yields POSIX parse trees. We have tested this claim
Chengsong
parents: 646
diff changeset
  1036
  extensively [..] but yet
Chengsong
parents: 646
diff changeset
  1037
  have to work out all proof details.'' \cite[Page 14]{Sulzmann2014}
Chengsong
parents: 646
diff changeset
  1038
\end{quote}  
Chengsong
parents: 646
diff changeset
  1039
Ausaf and Urban made some initial progress towards the 
Chengsong
parents: 646
diff changeset
  1040
full correctness proof but still had to leave out the optimisation
Chengsong
parents: 646
diff changeset
  1041
Sulzmann and Lu proposed.
Chengsong
parents: 646
diff changeset
  1042
Ausaf  wrote \cite{Ausaf},
Chengsong
parents: 646
diff changeset
  1043
  \begin{quote}\it
Chengsong
parents: 646
diff changeset
  1044
``The next step would be to implement a more aggressive simplification procedure on annotated regular expressions and then prove the corresponding algorithm generates the same values as blexer. Alas due to time constraints we are unable to do so here.''
Chengsong
parents: 646
diff changeset
  1045
\end{quote}  
Chengsong
parents: 646
diff changeset
  1046
This thesis implements the aggressive simplifications envisioned
Chengsong
parents: 646
diff changeset
  1047
by Ausaf and Urban,
Chengsong
parents: 646
diff changeset
  1048
together with a formal proof of the correctness of those simplifications.
Chengsong
parents: 646
diff changeset
  1049
Chengsong
parents: 646
diff changeset
  1050
Chengsong
parents: 646
diff changeset
  1051
One of the most recent work in the context of lexing
Chengsong
parents: 646
diff changeset
  1052
%with this issue
Chengsong
parents: 646
diff changeset
  1053
is the Verbatim lexer by Egolf, Lasser and Fisher~\cite{Verbatim}.
Chengsong
parents: 646
diff changeset
  1054
This is relevant work for us and we will compare it later with 
Chengsong
parents: 646
diff changeset
  1055
our derivative-based matcher we are going to present.
Chengsong
parents: 646
diff changeset
  1056
There is also some newer work called
Chengsong
parents: 646
diff changeset
  1057
Verbatim++~\cite{Verbatimpp}, which does not use derivatives, 
Chengsong
parents: 646
diff changeset
  1058
but deterministic finite automaton instead.
Chengsong
parents: 646
diff changeset
  1059
We will also study this work in a later section.
Chengsong
parents: 646
diff changeset
  1060
%An example that gives problem to automaton approaches would be
Chengsong
parents: 646
diff changeset
  1061
%the regular expression $(a|b)^*a(a|b)^{\{n\}}$.
Chengsong
parents: 646
diff changeset
  1062
%It requires at least $2^{n+1}$ states to represent
Chengsong
parents: 646
diff changeset
  1063
%as a DFA.
Chengsong
parents: 646
diff changeset
  1064
Chengsong
parents: 646
diff changeset
  1065
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1066
\section{Basic Concepts}
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1067
Formal language theory usually starts with an alphabet 
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1068
denoting a set of characters.
637
Chengsong
parents: 628
diff changeset
  1069
Here we use the datatype of characters from Isabelle,
541
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
  1070
which roughly corresponds to the ASCII characters.
564
Chengsong
parents: 543
diff changeset
  1071
In what follows, we shall leave the information about the alphabet
541
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
  1072
implicit.
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
  1073
Then using the usual bracket notation for lists,
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1074
we can define strings made up of characters as: 
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1075
\begin{center}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1076
\begin{tabular}{lcl}
541
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
  1077
$\textit{s}$ & $\dn$ & $[] \; |\; c  :: s$
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1078
\end{tabular}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1079
\end{center}
583
Chengsong
parents: 579
diff changeset
  1080
where $c$ is a variable ranging over characters.
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1081
The $::$ stands for list cons and $[]$ for the empty
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1082
list.
637
Chengsong
parents: 628
diff changeset
  1083
For brevity, a singleton list is sometimes written as $[c]$.
541
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
  1084
Strings can be concatenated to form longer strings in the same
637
Chengsong
parents: 628
diff changeset
  1085
way we concatenate two lists, which we shall write as $s_1 @ s_2$.
541
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
  1086
We omit the precise 
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1087
recursive definition here.
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1088
We overload this concatenation operator for two sets of strings:
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1089
\begin{center}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1090
\begin{tabular}{lcl}
541
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
  1091
$A @ B $ & $\dn$ & $\{s_A @ s_B \mid s_A \in A \land s_B \in B \}$\\
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1092
\end{tabular}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1093
\end{center}
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1094
We also call the above \emph{language concatenation}.
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1095
The power of a language is defined recursively, using the 
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1096
concatenation operator $@$:
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1097
\begin{center}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1098
\begin{tabular}{lcl}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1099
$A^0 $ & $\dn$ & $\{ [] \}$\\
541
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
  1100
$A^{n+1}$ & $\dn$ & $A @ A^n$
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1101
\end{tabular}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1102
\end{center}
564
Chengsong
parents: 543
diff changeset
  1103
The union of all powers of a language   
Chengsong
parents: 543
diff changeset
  1104
can be used to define the Kleene star operator:
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1105
\begin{center}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1106
\begin{tabular}{lcl}
536
aff7bf93b9c7 comments addressed all
Chengsong
parents: 532
diff changeset
  1107
 $A*$ & $\dn$ & $\bigcup_{i \geq 0} A^i$ \\
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1108
\end{tabular}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1109
\end{center}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1110
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1111
\noindent
564
Chengsong
parents: 543
diff changeset
  1112
However, to obtain a more convenient induction principle 
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1113
in Isabelle/HOL, 
536
aff7bf93b9c7 comments addressed all
Chengsong
parents: 532
diff changeset
  1114
we instead define the Kleene star
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1115
as an inductive set: 
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1116
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1117
\begin{center}
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1118
\begin{mathpar}
564
Chengsong
parents: 543
diff changeset
  1119
	\inferrule{\mbox{}}{[] \in A*\\}
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1120
564
Chengsong
parents: 543
diff changeset
  1121
\inferrule{s_1 \in A \;\; s_2 \in A*}{s_1 @ s_2 \in A*}
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1122
\end{mathpar}
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1123
\end{center}
564
Chengsong
parents: 543
diff changeset
  1124
\noindent
541
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
  1125
We also define an operation of "chopping off" a character from
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
  1126
a language, which we call $\Der$, meaning \emph{Derivative} (for a language):
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1127
\begin{center}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1128
\begin{tabular}{lcl}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1129
$\textit{Der} \;c \;A$ & $\dn$ & $\{ s \mid c :: s \in A \}$\\
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1130
\end{tabular}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1131
\end{center}
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1132
\noindent
583
Chengsong
parents: 579
diff changeset
  1133
This can be generalised to ``chopping off'' a string 
Chengsong
parents: 579
diff changeset
  1134
from all strings within a set $A$, 
541
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
  1135
namely:
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1136
\begin{center}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1137
\begin{tabular}{lcl}
541
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
  1138
$\textit{Ders} \;s \;A$ & $\dn$ & $\{ s' \mid s@s' \in A \}$\\
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1139
\end{tabular}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1140
\end{center}
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1141
\noindent
541
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
  1142
which is essentially the left quotient $A \backslash L$ of $A$ against 
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1143
the singleton language with $L = \{s\}$.
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1144
However, for our purposes here, the $\textit{Ders}$ definition with 
541
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
  1145
a single string is sufficient.
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1146
577
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
  1147
The reason for defining derivatives
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1148
is that they provide another approach
577
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
  1149
to test membership of a string in 
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
  1150
a set of strings. 
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
  1151
For example, to test whether the string
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  1152
$bar$ is contained in the set $\{foo, bar, brak\}$, one can take derivative of the set with
577
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
  1153
respect to the string $bar$:
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
  1154
\begin{center}
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1155
\begin{tabular}{lll}
577
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
  1156
	$S = \{foo, bar, brak\}$ & $ \stackrel{\backslash b}{\rightarrow }$ & 
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1157
	$\{ar, rak\}$ \\
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1158
				 & $\stackrel{\backslash a}{\rightarrow}$ & $\{r \}$\\
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1159
				 & $\stackrel{\backslash r}{\rightarrow}$ & $\{[]\}$\\
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1160
				 %& $\stackrel{[] \in S \backslash bar}{\longrightarrow}$ & $bar \in S$\\
577
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
  1161
\end{tabular}	
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
  1162
\end{center}
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
  1163
\noindent
637
Chengsong
parents: 628
diff changeset
  1164
and in the end, test whether the set
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  1165
contains the empty string.\footnote{We use the infix notation $A\backslash c$
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  1166
	instead of $\Der \; c \; A$ for brevity, as it will always be
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  1167
	clear from the context that we are operating
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1168
on languages rather than regular expressions.}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1169
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1170
In general, if we have a language $S$,
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1171
then we can test whether $s$ is in $S$
577
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
  1172
by testing whether $[] \in S \backslash s$.
564
Chengsong
parents: 543
diff changeset
  1173
With the sequencing, Kleene star, and $\textit{Der}$ operator on languages,
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1174
we have a  few properties of how the language derivative can be defined using 
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1175
sub-languages.
577
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
  1176
For example, for the sequence operator, we have
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1177
something similar to a ``chain rule'':
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1178
\begin{lemma}
536
aff7bf93b9c7 comments addressed all
Chengsong
parents: 532
diff changeset
  1179
\[
aff7bf93b9c7 comments addressed all
Chengsong
parents: 532
diff changeset
  1180
	\Der \; c \; (A @ B) =
aff7bf93b9c7 comments addressed all
Chengsong
parents: 532
diff changeset
  1181
	\begin{cases}
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1182
	((\Der \; c \; A) \, @ \, B ) \cup (\Der \; c\; B) , &  \text{if} \;  [] \in A  \\
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1183
	 (\Der \; c \; A) \,  @ \, B, & \text{otherwise}
536
aff7bf93b9c7 comments addressed all
Chengsong
parents: 532
diff changeset
  1184
	 \end{cases}	
aff7bf93b9c7 comments addressed all
Chengsong
parents: 532
diff changeset
  1185
\]
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1186
\end{lemma}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1187
\noindent
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1188
This lemma states that if $A$ contains the empty string, $\Der$ can "pierce" through it
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1189
and get to $B$.
583
Chengsong
parents: 579
diff changeset
  1190
The language derivative for $A*$ can be described using the language derivative
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1191
of $A$:
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1192
\begin{lemma}
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1193
$\textit{Der} \;c \;(A*) = (\textit{Der}\; c A) @ (A*)$\\
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1194
\end{lemma}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1195
\begin{proof}
583
Chengsong
parents: 579
diff changeset
  1196
There are two inclusions to prove:
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1197
\begin{itemize}
564
Chengsong
parents: 543
diff changeset
  1198
\item{$\subseteq$}:\\
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1199
The set 
637
Chengsong
parents: 628
diff changeset
  1200
\[ S_1 = \{s \mid c :: s \in A*\} \]
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1201
is enclosed in the set
637
Chengsong
parents: 628
diff changeset
  1202
\[ S_2 = \{s_1 @ s_2 \mid s_1 \, s_2.\;  s_1 \in \{s \mid c :: s \in A\} \land s_2 \in A* \}. \]
Chengsong
parents: 628
diff changeset
  1203
This is because for any string $c::s$ satisfying $c::s \in A*$,
Chengsong
parents: 628
diff changeset
  1204
%whenever you have a string starting with a character 
Chengsong
parents: 628
diff changeset
  1205
%in the language of a Kleene star $A*$, 
Chengsong
parents: 628
diff changeset
  1206
%then that
Chengsong
parents: 628
diff changeset
  1207
the character $c$, together with a prefix of $s$
Chengsong
parents: 628
diff changeset
  1208
%immediately after $c$ 
Chengsong
parents: 628
diff changeset
  1209
forms the first iteration of $A*$, 
Chengsong
parents: 628
diff changeset
  1210
and the rest of the $s$ is also $A*$.
Chengsong
parents: 628
diff changeset
  1211
This coincides with the definition of $S_2$.
564
Chengsong
parents: 543
diff changeset
  1212
\item{$\supseteq$}:\\
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1213
Note that
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1214
\[ \Der \; c \; (A*) = \Der \; c \;  (\{ [] \} \cup (A @ A*) ) \]
583
Chengsong
parents: 579
diff changeset
  1215
holds.
Chengsong
parents: 579
diff changeset
  1216
Also the following holds:
536
aff7bf93b9c7 comments addressed all
Chengsong
parents: 532
diff changeset
  1217
\[ \Der \; c \;  (\{ [] \} \cup (A @ A*) ) = \Der\; c \; (A @ A*) \]
564
Chengsong
parents: 543
diff changeset
  1218
where the $\textit{RHS}$ can be rewritten
Chengsong
parents: 543
diff changeset
  1219
as \[ (\Der \; c\; A) @ A* \cup (\Der \; c \; (A*)) \]
Chengsong
parents: 543
diff changeset
  1220
which of course contains $\Der \; c \; A @ A*$.
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1221
\end{itemize}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1222
\end{proof}
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1223
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1224
\noindent
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1225
The clever idea of Brzozowski was to find counterparts of $\Der$ and $\Ders$
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1226
for regular expressions.
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1227
To introduce them, we need to first give definitions for regular expressions,
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1228
which we shall do next.
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1229
536
aff7bf93b9c7 comments addressed all
Chengsong
parents: 532
diff changeset
  1230
\subsection{Regular Expressions and Their Meaning}
564
Chengsong
parents: 543
diff changeset
  1231
The \emph{basic regular expressions} are defined inductively
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1232
 by the following grammar:
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1233
\[			r ::=   \ZERO \mid  \ONE
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1234
			 \mid  c  
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1235
			 \mid  r_1 \cdot r_2
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1236
			 \mid  r_1 + r_2   
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1237
			 \mid r^*         
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1238
\]
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1239
\noindent
564
Chengsong
parents: 543
diff changeset
  1240
We call them basic because we will introduce
637
Chengsong
parents: 628
diff changeset
  1241
additional constructors in later chapters, such as negation
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1242
and bounded repetitions.
564
Chengsong
parents: 543
diff changeset
  1243
We use $\ZERO$ for the regular expression that
Chengsong
parents: 543
diff changeset
  1244
matches no string, and $\ONE$ for the regular
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1245
expression that matches only the empty string.\footnote{
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1246
Some authors
564
Chengsong
parents: 543
diff changeset
  1247
also use $\phi$ and $\epsilon$ for $\ZERO$ and $\ONE$
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  1248
but we prefer this notation.} 
564
Chengsong
parents: 543
diff changeset
  1249
The sequence regular expression is written $r_1\cdot r_2$
Chengsong
parents: 543
diff changeset
  1250
and sometimes we omit the dot if it is clear which
Chengsong
parents: 543
diff changeset
  1251
regular expression is meant; the alternative
Chengsong
parents: 543
diff changeset
  1252
is written $r_1 + r_2$.
Chengsong
parents: 543
diff changeset
  1253
The \emph{language} or meaning of 
Chengsong
parents: 543
diff changeset
  1254
a regular expression is defined recursively as
Chengsong
parents: 543
diff changeset
  1255
a set of strings:
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1256
%TODO: FILL in the other defs
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1257
\begin{center}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1258
\begin{tabular}{lcl}
564
Chengsong
parents: 543
diff changeset
  1259
$L \; \ZERO$ & $\dn$ & $\phi$\\
Chengsong
parents: 543
diff changeset
  1260
$L \; \ONE$ & $\dn$ & $\{[]\}$\\
Chengsong
parents: 543
diff changeset
  1261
$L \; c$ & $\dn$ & $\{[c]\}$\\
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1262
$L \; (r_1 + r_2)$ & $\dn$ & $ L \; r_1 \cup L \; r_2$\\
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1263
$L \; (r_1 \cdot r_2)$ & $\dn$ & $ L \; r_1 @ L \; r_2$\\
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1264
$L \; (r^*)$ & $\dn$ & $ (L\;r)*$
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1265
\end{tabular}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1266
\end{center}
536
aff7bf93b9c7 comments addressed all
Chengsong
parents: 532
diff changeset
  1267
\noindent
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1268
%Now with language derivatives of a language and regular expressions and
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1269
%their language interpretations in place, we are ready to define derivatives on regular expressions.
637
Chengsong
parents: 628
diff changeset
  1270
With $L$, we are ready to introduce Brzozowski derivatives on regular expressions.
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  1271
We do so by first introducing what properties they should satisfy.
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1272
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1273
\subsection{Brzozowski Derivatives and a Regular Expression Matcher}
564
Chengsong
parents: 543
diff changeset
  1274
%Recall, the language derivative acts on a set of strings
Chengsong
parents: 543
diff changeset
  1275
%and essentially chops off a particular character from
Chengsong
parents: 543
diff changeset
  1276
%all strings in that set, Brzozowski defined a derivative operation on regular expressions
Chengsong
parents: 543
diff changeset
  1277
%so that after derivative $L(r\backslash c)$ 
Chengsong
parents: 543
diff changeset
  1278
%will look as if it was obtained by doing a language derivative on $L(r)$:
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1279
%Recall that the language derivative acts on a 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1280
%language (set of strings).
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1281
%One can decide whether a string $s$ belongs
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1282
%to a language $S$ by taking derivative with respect to
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1283
%that string and then checking whether the empty 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1284
%string is in the derivative:
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1285
%\begin{center}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1286
%\parskip \baselineskip
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1287
%\def\myupbracefill#1{\rotatebox{90}{\stretchto{\{}{#1}}}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1288
%\def\rlwd{.5pt}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1289
%\newcommand\notate[3]{%
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1290
%  \unskip\def\useanchorwidth{T}%
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1291
%  \setbox0=\hbox{#1}%
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1292
%  \def\stackalignment{c}\stackunder[-6pt]{%
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1293
%    \def\stackalignment{c}\stackunder[-1.5pt]{%
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1294
%      \stackunder[-2pt]{\strut #1}{\myupbracefill{\wd0}}}{%
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1295
%    \rule{\rlwd}{#2\baselineskip}}}{%
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1296
%  \strut\kern7pt$\hookrightarrow$\rlap{ \footnotesize#3}}\ignorespaces%
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1297
%}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1298
%\Longstack{
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1299
%\notate{$\{ \ldots ,\;$ 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1300
%	\notate{s}{1}{$(c_1 :: s_1)$} 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1301
%	$, \; \ldots \}$ 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1302
%}{1}{$S_{start}$} 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1303
%}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1304
%\Longstack{
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1305
%	$\stackrel{\backslash c_1}{\longrightarrow}$
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1306
%}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1307
%\Longstack{
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1308
%	$\{ \ldots,\;$  \notate{$s_1$}{1}{$(c_2::s_2)$} 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1309
%	$,\; \ldots \}$
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1310
%}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1311
%\Longstack{
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1312
%	$\stackrel{\backslash c_2}{\longrightarrow}$ 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1313
%}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1314
%\Longstack{
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1315
%	$\{ \ldots,\;  s_2
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1316
%	,\; \ldots \}$
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1317
%}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1318
%\Longstack{
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1319
%	$ \xdashrightarrow{\backslash c_3\ldots\ldots} $	
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1320
%}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1321
%\Longstack{
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1322
%	\notate{$\{\ldots, [], \ldots\}$}{1}{$S_{end} = 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1323
%	S_{start}\backslash s$}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1324
%}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1325
%\end{center}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1326
%\begin{center}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1327
%	$s \in S_{start} \iff [] \in S_{end}$
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1328
%\end{center}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1329
%\noindent
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1330
Brzozowski noticed that $\Der$
579
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1331
can be ``mirrored'' on regular expressions which
564
Chengsong
parents: 543
diff changeset
  1332
he calls the derivative of a regular expression $r$
Chengsong
parents: 543
diff changeset
  1333
with respect to a character $c$, written
577
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
  1334
$r \backslash c$. This infix operator
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  1335
takes regular expression $r$ as input
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  1336
and a character as a right operand.
577
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
  1337
The derivative operation on regular expression
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
  1338
is defined such that the language of the derivative result 
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
  1339
coincides with the language of the original 
579
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1340
regular expression being taken 
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  1341
derivative with respect to the same characters, namely
579
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1342
\begin{property}
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1343
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1344
\[
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1345
	L \; (r \backslash c) = \Der \; c \; (L \; r)
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1346
\]
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1347
\end{property}
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1348
\noindent
637
Chengsong
parents: 628
diff changeset
  1349
Next, we give the recursive definition of derivative on
Chengsong
parents: 628
diff changeset
  1350
regular expressions so that it satisfies the properties above.
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  1351
%The derivative function, written $r\backslash c$, 
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  1352
%takes a regular expression $r$ and character $c$, and
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  1353
%returns a new regular expression representing
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  1354
%the original regular expression's language $L \; r$
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  1355
%being taken the language derivative with respect to $c$.
626
1c8525061545 finished!
Chengsong
parents: 623
diff changeset
  1356
\begin{table}
1c8525061545 finished!
Chengsong
parents: 623
diff changeset
  1357
	\begin{center}
579
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1358
\begin{tabular}{lcl}
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1359
		$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\  
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1360
		$\ONE \backslash c$  & $\dn$ & $\ZERO$\\
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1361
		$d \backslash c$     & $\dn$ & 
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1362
		$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1363
$(r_1 + r_2)\backslash c$     & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1364
$(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, [] \in L(r_1)$\\
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1365
	&   & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1366
	&   & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1367
	$(r^*)\backslash c$           & $\dn$ & $(r\backslash c) \cdot r^*$\\
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1368
\end{tabular}
626
1c8525061545 finished!
Chengsong
parents: 623
diff changeset
  1369
	\end{center}
1c8525061545 finished!
Chengsong
parents: 623
diff changeset
  1370
\caption{Derivative on Regular Expressions}
1c8525061545 finished!
Chengsong
parents: 623
diff changeset
  1371
\label{table:der}
1c8525061545 finished!
Chengsong
parents: 623
diff changeset
  1372
\end{table}
564
Chengsong
parents: 543
diff changeset
  1373
\noindent
579
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1374
The most involved cases are the sequence case
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1375
and the star case.
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1376
The sequence case says that if the first regular expression
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1377
contains an empty string, then the second component of the sequence
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1378
needs to be considered, as its derivative will contribute to the
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1379
result of this derivative:
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1380
\begin{center}
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1381
	\begin{tabular}{lcl}
583
Chengsong
parents: 579
diff changeset
  1382
		$(r_1 \cdot r_2 ) \backslash c$ & $\dn$ & 
Chengsong
parents: 579
diff changeset
  1383
		$\textit{if}\;\,([] \in L(r_1))\; 
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  1384
		\textit{then} \; (r_1 \backslash c) \cdot r_2 + r_2 \backslash c$ \\
579
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1385
		& & $\textit{else} \; (r_1 \backslash c) \cdot r_2$
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1386
	\end{tabular}
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1387
\end{center}
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1388
\noindent
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1389
Notice how this closely resembles
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1390
the language derivative operation $\Der$:
564
Chengsong
parents: 543
diff changeset
  1391
\begin{center}
Chengsong
parents: 543
diff changeset
  1392
	\begin{tabular}{lcl}
Chengsong
parents: 543
diff changeset
  1393
		$\Der \; c \; (A @ B)$ & $\dn$ & 
Chengsong
parents: 543
diff changeset
  1394
		$ \textit{if} \;\,  [] \in A \; 
Chengsong
parents: 543
diff changeset
  1395
		\textit{then} \;\, ((\Der \; c \; A) @ B ) \cup 
Chengsong
parents: 543
diff changeset
  1396
		\Der \; c\; B$\\
Chengsong
parents: 543
diff changeset
  1397
		& & $\textit{else}\; (\Der \; c \; A) @ B$\\
Chengsong
parents: 543
diff changeset
  1398
	\end{tabular}
Chengsong
parents: 543
diff changeset
  1399
\end{center}
Chengsong
parents: 543
diff changeset
  1400
\noindent
583
Chengsong
parents: 579
diff changeset
  1401
The derivative of the star regular expression $r^*$ 
579
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1402
unwraps one iteration of $r$, turns it into $r\backslash c$,
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1403
and attaches the original $r^*$
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1404
after $r\backslash c$, so that 
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1405
we can further unfold it as many times as needed:
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1406
\[
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1407
	(r^*) \backslash c \dn (r \backslash c)\cdot r^*.
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1408
\]
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1409
Again,
637
Chengsong
parents: 628
diff changeset
  1410
the structure is the same as the language derivative of the Kleene star: 
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1411
\[
564
Chengsong
parents: 543
diff changeset
  1412
	\textit{Der} \;c \;(A*) \dn (\textit{Der}\; c A) @ (A*)
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1413
\]
564
Chengsong
parents: 543
diff changeset
  1414
In the above definition of $(r_1\cdot r_2) \backslash c$,
Chengsong
parents: 543
diff changeset
  1415
the $\textit{if}$ clause's
Chengsong
parents: 543
diff changeset
  1416
boolean condition 
Chengsong
parents: 543
diff changeset
  1417
$[] \in L(r_1)$ needs to be 
Chengsong
parents: 543
diff changeset
  1418
somehow recursively computed.
Chengsong
parents: 543
diff changeset
  1419
We call such a function that checks
Chengsong
parents: 543
diff changeset
  1420
whether the empty string $[]$ is 
Chengsong
parents: 543
diff changeset
  1421
in the language of a regular expression $\nullable$:
Chengsong
parents: 543
diff changeset
  1422
\begin{center}
Chengsong
parents: 543
diff changeset
  1423
		\begin{tabular}{lcl}
Chengsong
parents: 543
diff changeset
  1424
			$\nullable(\ZERO)$     & $\dn$ & $\mathit{false}$ \\  
Chengsong
parents: 543
diff changeset
  1425
			$\nullable(\ONE)$      & $\dn$ & $\mathit{true}$ \\
Chengsong
parents: 543
diff changeset
  1426
			$\nullable(c)$ 	       & $\dn$ & $\mathit{false}$ \\
Chengsong
parents: 543
diff changeset
  1427
			$\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\
Chengsong
parents: 543
diff changeset
  1428
			$\nullable(r_1\cdot r_2)$  & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\
Chengsong
parents: 543
diff changeset
  1429
			$\nullable(r^*)$       & $\dn$ & $\mathit{true}$ \\
Chengsong
parents: 543
diff changeset
  1430
		\end{tabular}
Chengsong
parents: 543
diff changeset
  1431
\end{center}
Chengsong
parents: 543
diff changeset
  1432
\noindent
Chengsong
parents: 543
diff changeset
  1433
The $\ZERO$ regular expression
Chengsong
parents: 543
diff changeset
  1434
does not contain any string and
Chengsong
parents: 543
diff changeset
  1435
therefore is not \emph{nullable}.
Chengsong
parents: 543
diff changeset
  1436
$\ONE$ is \emph{nullable} 
Chengsong
parents: 543
diff changeset
  1437
by definition. 
Chengsong
parents: 543
diff changeset
  1438
The character regular expression $c$
Chengsong
parents: 543
diff changeset
  1439
corresponds to the singleton set $\{c\}$, 
Chengsong
parents: 543
diff changeset
  1440
and therefore does not contain the empty string.
Chengsong
parents: 543
diff changeset
  1441
The alternative regular expression is nullable
Chengsong
parents: 543
diff changeset
  1442
if at least one of its children is nullable.
Chengsong
parents: 543
diff changeset
  1443
The sequence regular expression
Chengsong
parents: 543
diff changeset
  1444
would require both children to have the empty string
Chengsong
parents: 543
diff changeset
  1445
to compose an empty string, and the Kleene star
Chengsong
parents: 543
diff changeset
  1446
is always nullable because it naturally
579
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1447
contains the empty string.  
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1448
\noindent
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  1449
We have the following two correspondences between 
564
Chengsong
parents: 543
diff changeset
  1450
derivatives on regular expressions and
Chengsong
parents: 543
diff changeset
  1451
derivatives on a set of strings:
Chengsong
parents: 543
diff changeset
  1452
\begin{lemma}\label{derDer}
608
37b6fd310a16 added related work chap
Chengsong
parents: 601
diff changeset
  1453
	\mbox{}
579
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1454
	\begin{itemize}
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1455
		\item
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1456
$\textit{Der} \; c \; L(r) = L (r\backslash c)$
579
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1457
\item
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1458
	$c\!::\!s \in L(r)$ \textit{iff} $s \in L(r\backslash c)$.
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1459
	\end{itemize}
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1460
\end{lemma}
579
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1461
\begin{proof}
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1462
	By induction on $r$.
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1463
\end{proof}
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1464
\noindent
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  1465
which are the main properties of derivatives
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  1466
that enables us later to reason about the correctness of
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1467
derivative-based matching.
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1468
We can generalise the derivative operation shown above for single characters
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1469
to strings as follows:
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1470
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1471
\begin{center}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1472
\begin{tabular}{lcl}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1473
$r \backslash_s (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash_s s$ \\
583
Chengsong
parents: 579
diff changeset
  1474
$r \backslash_s [\,] $ & $\dn$ & $r$
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1475
\end{tabular}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1476
\end{center}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1477
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1478
\noindent
564
Chengsong
parents: 543
diff changeset
  1479
When there is no ambiguity, we will 
Chengsong
parents: 543
diff changeset
  1480
omit the subscript and use $\backslash$ instead
583
Chengsong
parents: 579
diff changeset
  1481
of $\backslash_s$ to denote
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1482
string derivatives for brevity.
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1483
Brzozowski's derivative-based
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1484
regular-expression matching algorithm can then be described as:
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1485
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1486
\begin{definition}
564
Chengsong
parents: 543
diff changeset
  1487
$\textit{match}\;s\;r \;\dn\; \nullable \; (r\backslash s)$
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1488
\end{definition}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1489
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1490
\noindent
637
Chengsong
parents: 628
diff changeset
  1491
Assuming the string is given as a sequence of characters, say $c_0c_1 \ldots c_n$, 
Chengsong
parents: 628
diff changeset
  1492
this algorithm, presented graphically, is as follows:
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1493
601
Chengsong
parents: 591
diff changeset
  1494
\begin{equation}\label{matcher}
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1495
\begin{tikzcd}
583
Chengsong
parents: 579
diff changeset
  1496
r_0 \arrow[r, "\backslash c_0"]  & r_1 \arrow[r, "\backslash c_1"] & 
Chengsong
parents: 579
diff changeset
  1497
r_2 \arrow[r, dashed]  & r_n  \arrow[r,"\textit{nullable}?"] & 
Chengsong
parents: 579
diff changeset
  1498
\;\textrm{true}/\textrm{false}
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1499
\end{tikzcd}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1500
\end{equation}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1501
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
  1502
\noindent
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1503
 It is relatively
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1504
easy to show that this matcher is correct, namely
539
Chengsong
parents: 538
diff changeset
  1505
\begin{lemma}
564
Chengsong
parents: 543
diff changeset
  1506
	$\textit{match} \; s\; r  = \textit{true} \; \textit{iff} \; s \in L(r)$
539
Chengsong
parents: 538
diff changeset
  1507
\end{lemma}
Chengsong
parents: 538
diff changeset
  1508
\begin{proof}
637
Chengsong
parents: 628
diff changeset
  1509
	By induction on $s$ using the property of derivatives:
583
Chengsong
parents: 579
diff changeset
  1510
	lemma \ref{derDer}.
539
Chengsong
parents: 538
diff changeset
  1511
\end{proof}
601
Chengsong
parents: 591
diff changeset
  1512
\begin{figure}
564
Chengsong
parents: 543
diff changeset
  1513
\begin{center}
539
Chengsong
parents: 538
diff changeset
  1514
\begin{tikzpicture}
Chengsong
parents: 538
diff changeset
  1515
\begin{axis}[
Chengsong
parents: 538
diff changeset
  1516
    xlabel={$n$},
Chengsong
parents: 538
diff changeset
  1517
    ylabel={time in secs},
601
Chengsong
parents: 591
diff changeset
  1518
    %ymode = log,
539
Chengsong
parents: 538
diff changeset
  1519
    legend entries={Naive Matcher},  
Chengsong
parents: 538
diff changeset
  1520
    legend pos=north west,
Chengsong
parents: 538
diff changeset
  1521
    legend cell align=left]
Chengsong
parents: 538
diff changeset
  1522
\addplot[red,mark=*, mark options={fill=white}] table {NaiveMatcher.data};
Chengsong
parents: 538
diff changeset
  1523
\end{axis}
Chengsong
parents: 538
diff changeset
  1524
\end{tikzpicture} 
583
Chengsong
parents: 579
diff changeset
  1525
\caption{Matching the regular expression $(a^*)^*b$ against strings of the form
Chengsong
parents: 579
diff changeset
  1526
$\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}
Chengsong
parents: 579
diff changeset
  1527
$ using Brzozowski's original algorithm}\label{NaiveMatcher}
601
Chengsong
parents: 591
diff changeset
  1528
\end{center}
539
Chengsong
parents: 538
diff changeset
  1529
\end{figure}
Chengsong
parents: 538
diff changeset
  1530
\noindent
564
Chengsong
parents: 543
diff changeset
  1531
If we implement the above algorithm naively, however,
Chengsong
parents: 543
diff changeset
  1532
the algorithm can be excruciatingly slow, as shown in 
Chengsong
parents: 543
diff changeset
  1533
\ref{NaiveMatcher}.
Chengsong
parents: 543
diff changeset
  1534
Note that both axes are in logarithmic scale.
637
Chengsong
parents: 628
diff changeset
  1535
Around two dozen characters
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  1536
this algorithm already ``explodes'' with the regular expression 
564
Chengsong
parents: 543
diff changeset
  1537
$(a^*)^*b$.
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1538
To improve this situation, we need to introduce simplification
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1539
rules for the intermediate results,
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  1540
such as $r + r \rightarrow r$ or $\ONE \cdot r \rightarrow r$,
539
Chengsong
parents: 538
diff changeset
  1541
and make sure those rules do not change the 
Chengsong
parents: 538
diff changeset
  1542
language of the regular expression.
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  1543
One simple-minded simplification function
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1544
that achieves these requirements 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1545
is given below (see Ausaf et al. \cite{AusafDyckhoffUrban2016}):
564
Chengsong
parents: 543
diff changeset
  1546
\begin{center}
Chengsong
parents: 543
diff changeset
  1547
	\begin{tabular}{lcl}
Chengsong
parents: 543
diff changeset
  1548
		$\simp \; r_1 \cdot r_2 $ & $ \dn$ & 
Chengsong
parents: 543
diff changeset
  1549
		$(\simp \; r_1,  \simp \; r_2) \; \textit{match}$\\
Chengsong
parents: 543
diff changeset
  1550
					  & & $\quad \case \; (\ZERO, \_) \Rightarrow \ZERO$\\
Chengsong
parents: 543
diff changeset
  1551
					  & & $\quad \case \; (\_, \ZERO) \Rightarrow \ZERO$\\
Chengsong
parents: 543
diff changeset
  1552
					  & & $\quad \case \; (\ONE, r_2') \Rightarrow r_2'$\\
Chengsong
parents: 543
diff changeset
  1553
					  & & $\quad \case \; (r_1', \ONE) \Rightarrow r_1'$\\
Chengsong
parents: 543
diff changeset
  1554
					  & & $\quad \case \; (r_1', r_2') \Rightarrow r_1'\cdot r_2'$\\
Chengsong
parents: 543
diff changeset
  1555
		$\simp \; r_1 + r_2$ & $\dn$ & $(\simp \; r_1, \simp \; r_2) \textit{match}$\\
Chengsong
parents: 543
diff changeset
  1556
				     & & $\quad \; \case \; (\ZERO, r_2') \Rightarrow r_2'$\\
Chengsong
parents: 543
diff changeset
  1557
				     & & $\quad \; \case \; (r_1', \ZERO) \Rightarrow r_1'$\\
Chengsong
parents: 543
diff changeset
  1558
				     & & $\quad \; \case \; (r_1', r_2') \Rightarrow r_1' + r_2'$\\
583
Chengsong
parents: 579
diff changeset
  1559
		$\simp \; r$ & $\dn$ & $r\quad\quad (otherwise)$
564
Chengsong
parents: 543
diff changeset
  1560
				   
Chengsong
parents: 543
diff changeset
  1561
	\end{tabular}
Chengsong
parents: 543
diff changeset
  1562
\end{center}
Chengsong
parents: 543
diff changeset
  1563
If we repeatedly apply this simplification  
Chengsong
parents: 543
diff changeset
  1564
function during the matching algorithm, 
Chengsong
parents: 543
diff changeset
  1565
we have a matcher with simplification:
Chengsong
parents: 543
diff changeset
  1566
\begin{center}
Chengsong
parents: 543
diff changeset
  1567
	\begin{tabular}{lcl}
Chengsong
parents: 543
diff changeset
  1568
		$\derssimp \; [] \; r$ & $\dn$ & $r$\\
Chengsong
parents: 543
diff changeset
  1569
		$\derssimp \; c :: cs \; r$ & $\dn$ & $\derssimp \; cs \; (\simp \; (r \backslash c))$\\
Chengsong
parents: 543
diff changeset
  1570
		$\textit{matcher}_{simp}\; s \; r $ & $\dn$ & $\nullable \; (\derssimp \; s\;r)$
Chengsong
parents: 543
diff changeset
  1571
	\end{tabular}
Chengsong
parents: 543
diff changeset
  1572
\end{center}
Chengsong
parents: 543
diff changeset
  1573
\begin{figure}
539
Chengsong
parents: 538
diff changeset
  1574
\begin{tikzpicture}
Chengsong
parents: 538
diff changeset
  1575
\begin{axis}[
Chengsong
parents: 538
diff changeset
  1576
    xlabel={$n$},
Chengsong
parents: 538
diff changeset
  1577
    ylabel={time in secs},
601
Chengsong
parents: 591
diff changeset
  1578
    %ymode = log,
Chengsong
parents: 591
diff changeset
  1579
    %xmode = log,
564
Chengsong
parents: 543
diff changeset
  1580
    grid = both,
539
Chengsong
parents: 538
diff changeset
  1581
    legend entries={Matcher With Simp},  
Chengsong
parents: 538
diff changeset
  1582
    legend pos=north west,
Chengsong
parents: 538
diff changeset
  1583
    legend cell align=left]
Chengsong
parents: 538
diff changeset
  1584
\addplot[red,mark=*, mark options={fill=white}] table {BetterMatcher.data};
Chengsong
parents: 538
diff changeset
  1585
\end{axis}
564
Chengsong
parents: 543
diff changeset
  1586
\end{tikzpicture} 
Chengsong
parents: 543
diff changeset
  1587
\caption{$(a^*)^*b$ 
Chengsong
parents: 543
diff changeset
  1588
against 
Chengsong
parents: 543
diff changeset
  1589
$\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$ Using $\textit{matcher}_{simp}$}\label{BetterMatcher}
Chengsong
parents: 543
diff changeset
  1590
\end{figure}
Chengsong
parents: 543
diff changeset
  1591
\noindent
Chengsong
parents: 543
diff changeset
  1592
The running time of $\textit{ders}\_\textit{simp}$
583
Chengsong
parents: 579
diff changeset
  1593
on the same example of Figure \ref{NaiveMatcher}
Chengsong
parents: 579
diff changeset
  1594
is now ``tame''  in terms of the length of inputs,
Chengsong
parents: 579
diff changeset
  1595
as shown in Figure \ref{BetterMatcher}.
539
Chengsong
parents: 538
diff changeset
  1596
637
Chengsong
parents: 628
diff changeset
  1597
So far, the story is use Brzozowski derivatives and
Chengsong
parents: 628
diff changeset
  1598
simplify as much as possible, and at the end test
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1599
whether the empty string is recognised 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1600
by the final derivative.
583
Chengsong
parents: 579
diff changeset
  1601
But what if we want to 
Chengsong
parents: 579
diff changeset
  1602
do lexing instead of just getting a true/false answer?
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1603
Sulzmann and Lu \cite{Sulzmann2014} first came up with a nice and 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1604
elegant (arguably as beautiful as the definition of the 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1605
Brzozowski derivative) solution for this.
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1606
539
Chengsong
parents: 538
diff changeset
  1607
\section{Values and the Lexing Algorithm by Sulzmann and Lu}
564
Chengsong
parents: 543
diff changeset
  1608
In this section, we present a two-phase regular expression lexing 
Chengsong
parents: 543
diff changeset
  1609
algorithm.
Chengsong
parents: 543
diff changeset
  1610
The first phase takes successive derivatives with 
Chengsong
parents: 543
diff changeset
  1611
respect to the input string,
Chengsong
parents: 543
diff changeset
  1612
and the second phase does the reverse, \emph{injecting} back
Chengsong
parents: 543
diff changeset
  1613
characters, in the meantime constructing a lexing result.
Chengsong
parents: 543
diff changeset
  1614
We will introduce the injection phase in detail slightly
Chengsong
parents: 543
diff changeset
  1615
later, but as a preliminary we have to first define 
Chengsong
parents: 543
diff changeset
  1616
the datatype for lexing results, 
Chengsong
parents: 543
diff changeset
  1617
called \emph{value} or
608
37b6fd310a16 added related work chap
Chengsong
parents: 601
diff changeset
  1618
sometimes also \emph{lexical value}.  
37b6fd310a16 added related work chap
Chengsong
parents: 601
diff changeset
  1619
Values and regular
37b6fd310a16 added related work chap
Chengsong
parents: 601
diff changeset
  1620
expressions correspond to each other 
37b6fd310a16 added related work chap
Chengsong
parents: 601
diff changeset
  1621
as illustrated in the following
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1622
table:
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1623
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1624
\begin{center}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1625
	\begin{tabular}{c@{\hspace{20mm}}c}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1626
		\begin{tabular}{@{}rrl@{}}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1627
			\multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1628
			$r$ & $::=$  & $\ZERO$\\
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1629
			& $\mid$ & $\ONE$   \\
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1630
			& $\mid$ & $c$          \\
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1631
			& $\mid$ & $r_1 \cdot r_2$\\
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1632
			& $\mid$ & $r_1 + r_2$   \\
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1633
			\\
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1634
			& $\mid$ & $r^*$         \\
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1635
		\end{tabular}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1636
		&
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1637
		\begin{tabular}{@{\hspace{0mm}}rrl@{}}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1638
			\multicolumn{3}{@{}l}{\textbf{Values}}\medskip\\
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1639
			$v$ & $::=$  & \\
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1640
			&        & $\Empty$   \\
591
b2d0de6aee18 more polishing integrated comments chap2
Chengsong
parents: 585
diff changeset
  1641
			& $\mid$ & $\Char \; c$          \\
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1642
			& $\mid$ & $\Seq\,v_1\, v_2$\\
591
b2d0de6aee18 more polishing integrated comments chap2
Chengsong
parents: 585
diff changeset
  1643
			& $\mid$ & $\Left \;v$   \\
b2d0de6aee18 more polishing integrated comments chap2
Chengsong
parents: 585
diff changeset
  1644
			& $\mid$ & $\Right\;v$  \\
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1645
			& $\mid$ & $\Stars\,[v_1,\ldots\,v_n]$ \\
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1646
		\end{tabular}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1647
	\end{tabular}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1648
\end{center}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1649
\noindent
564
Chengsong
parents: 543
diff changeset
  1650
A value has an underlying string, which 
Chengsong
parents: 543
diff changeset
  1651
can be calculated by the ``flatten" function $|\_|$:
Chengsong
parents: 543
diff changeset
  1652
\begin{center}
Chengsong
parents: 543
diff changeset
  1653
	\begin{tabular}{lcl}
Chengsong
parents: 543
diff changeset
  1654
		$|\Empty|$ & $\dn$ &  $[]$\\
Chengsong
parents: 543
diff changeset
  1655
		$|\Char \; c|$ & $ \dn$ & $ [c]$\\
591
b2d0de6aee18 more polishing integrated comments chap2
Chengsong
parents: 585
diff changeset
  1656
		$|\Seq \; v_1, \;v_2|$ & $ \dn$ & $ v_1| @ |v_2|$\\
b2d0de6aee18 more polishing integrated comments chap2
Chengsong
parents: 585
diff changeset
  1657
		$|\Left \; v|$ & $ \dn$ & $ |v|$\\
b2d0de6aee18 more polishing integrated comments chap2
Chengsong
parents: 585
diff changeset
  1658
		$|\Right \; v|$ & $ \dn$ & $ |v|$\\
b2d0de6aee18 more polishing integrated comments chap2
Chengsong
parents: 585
diff changeset
  1659
		$|\Stars \; []|$ & $\dn$ & $[]$\\
b2d0de6aee18 more polishing integrated comments chap2
Chengsong
parents: 585
diff changeset
  1660
		$|\Stars \; v::vs|$ &  $\dn$ & $ |v| @ |\Stars(vs)|$
564
Chengsong
parents: 543
diff changeset
  1661
	\end{tabular}
Chengsong
parents: 543
diff changeset
  1662
\end{center}
Chengsong
parents: 543
diff changeset
  1663
Sulzmann and Lu used a binary predicate, written $\vdash v:r $,
Chengsong
parents: 543
diff changeset
  1664
to indicate that a value $v$ could be generated from a lexing algorithm
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1665
with input $r$. They call it the value inhabitation relation,
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1666
defined by the rules.
628
7af4e2420a8c ready to submit~~
Chengsong
parents: 626
diff changeset
  1667
\begin{figure}[H]
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1668
\begin{mathpar}
591
b2d0de6aee18 more polishing integrated comments chap2
Chengsong
parents: 585
diff changeset
  1669
	\inferrule{\mbox{}}{\vdash \Char \; c : \mathbf{c}} \hspace{2em}
564
Chengsong
parents: 543
diff changeset
  1670
Chengsong
parents: 543
diff changeset
  1671
	\inferrule{\mbox{}}{\vdash \Empty :  \ONE} \hspace{2em}
Chengsong
parents: 543
diff changeset
  1672
591
b2d0de6aee18 more polishing integrated comments chap2
Chengsong
parents: 585
diff changeset
  1673
\inferrule{\vdash v_1 : r_1 \;\; \vdash v_2 : r_2 }{\vdash \Seq \; v_1,\; v_2 : (r_1 \cdot r_2)}
564
Chengsong
parents: 543
diff changeset
  1674
591
b2d0de6aee18 more polishing integrated comments chap2
Chengsong
parents: 585
diff changeset
  1675
\inferrule{\vdash v_1 : r_1}{\vdash \Left \; v_1 : r_1+r_2}
564
Chengsong
parents: 543
diff changeset
  1676
591
b2d0de6aee18 more polishing integrated comments chap2
Chengsong
parents: 585
diff changeset
  1677
\inferrule{\vdash v_2 : r_2}{\vdash \Right \; v_2:r_1 + r_2}
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1678
591
b2d0de6aee18 more polishing integrated comments chap2
Chengsong
parents: 585
diff changeset
  1679
\inferrule{\forall v \in vs. \vdash v:r \land  |v| \neq []}{\vdash \Stars \; vs : r^*}
564
Chengsong
parents: 543
diff changeset
  1680
\end{mathpar}
628
7af4e2420a8c ready to submit~~
Chengsong
parents: 626
diff changeset
  1681
\caption{The inhabitation relation for values and regular expressions}\label{fig:inhab}
626
1c8525061545 finished!
Chengsong
parents: 623
diff changeset
  1682
\end{figure}
564
Chengsong
parents: 543
diff changeset
  1683
\noindent
Chengsong
parents: 543
diff changeset
  1684
The condition $|v| \neq []$ in the premise of star's rule
Chengsong
parents: 543
diff changeset
  1685
is to make sure that for a given pair of regular 
Chengsong
parents: 543
diff changeset
  1686
expression $r$ and string $s$, the number of values 
Chengsong
parents: 543
diff changeset
  1687
satisfying $|v| = s$ and $\vdash v:r$ is finite.
601
Chengsong
parents: 591
diff changeset
  1688
This additional condition was
Chengsong
parents: 591
diff changeset
  1689
imposed by Ausaf and Urban to make their proofs easier.
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1690
Given a string and a regular expression, there can be
564
Chengsong
parents: 543
diff changeset
  1691
multiple values for it. For example, both
Chengsong
parents: 543
diff changeset
  1692
$\vdash \Seq(\Left \; ab)(\Right \; c):(ab+a)(bc+c)$ and
Chengsong
parents: 543
diff changeset
  1693
$\vdash \Seq(\Right\; a)(\Left \; bc ):(ab+a)(bc+c)$ hold
Chengsong
parents: 543
diff changeset
  1694
and the values both flatten to $abc$.
Chengsong
parents: 543
diff changeset
  1695
Lexers therefore have to disambiguate and choose only
637
Chengsong
parents: 628
diff changeset
  1696
one of the values to be generated. $\POSIX$ is one of the
564
Chengsong
parents: 543
diff changeset
  1697
disambiguation strategies that is widely adopted.
Chengsong
parents: 543
diff changeset
  1698
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  1699
Ausaf et al. \cite{AusafDyckhoffUrban2016} 
564
Chengsong
parents: 543
diff changeset
  1700
formalised the property 
Chengsong
parents: 543
diff changeset
  1701
as a ternary relation.
Chengsong
parents: 543
diff changeset
  1702
The $\POSIX$ value $v$ for a regular expression
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1703
$r$ and string $s$, denoted as $(s, r) \rightarrow v$, can be specified 
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1704
in the following rules\footnote{The names of the rules are used
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1705
as they were originally given in \cite{AusafDyckhoffUrban2016}.}:
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1706
\begin{figure}[p]
564
Chengsong
parents: 543
diff changeset
  1707
\begin{mathpar}
Chengsong
parents: 543
diff changeset
  1708
	\inferrule[P1]{\mbox{}}{([], \ONE) \rightarrow \Empty}
Chengsong
parents: 543
diff changeset
  1709
		
Chengsong
parents: 543
diff changeset
  1710
	\inferrule[PC]{\mbox{}}{([c], c) \rightarrow \Char \; c}
Chengsong
parents: 543
diff changeset
  1711
Chengsong
parents: 543
diff changeset
  1712
	\inferrule[P+L]{(s,r_1)\rightarrow v_1}{(s, r_1+r_2)\rightarrow \Left \; v_1}
Chengsong
parents: 543
diff changeset
  1713
Chengsong
parents: 543
diff changeset
  1714
	\inferrule[P+R]{(s,r_2)\rightarrow v_2\\ s \notin L \; r_1}{(s, r_1+r_2)\rightarrow \Right \; v_2}
Chengsong
parents: 543
diff changeset
  1715
Chengsong
parents: 543
diff changeset
  1716
	\inferrule[PS]{(s_1, v_1) \rightarrow r_1 \\ (s_2, v_2)\rightarrow r_2\\
Chengsong
parents: 543
diff changeset
  1717
		\nexists s_3 \; s_4. s_3 \neq [] \land s_3 @ s_4 = s_2 \land 
Chengsong
parents: 543
diff changeset
  1718
		s_1@ s_3 \in L \; r_1 \land s_4 \in L \; r_2}{(s_1 @ s_2, r_1\cdot r_2) \rightarrow
Chengsong
parents: 543
diff changeset
  1719
	\Seq \; v_1 \; v_2}
Chengsong
parents: 543
diff changeset
  1720
Chengsong
parents: 543
diff changeset
  1721
	\inferrule[P{[]}]{\mbox{}}{([], r^*) \rightarrow \Stars([])}
Chengsong
parents: 543
diff changeset
  1722
Chengsong
parents: 543
diff changeset
  1723
	\inferrule[P*]{(s_1, v) \rightarrow v \\ (s_2, r^*) \rightarrow \Stars \; vs \\
Chengsong
parents: 543
diff changeset
  1724
		|v| \neq []\\ \nexists s_3 \; s_4. s_3 \neq [] \land s_3@s_4 = s_2 \land
Chengsong
parents: 543
diff changeset
  1725
		s_1@s_3 \in L \; r \land s_4 \in L \; r^*}{(s_1@s_2, r^*)\rightarrow \Stars \;
Chengsong
parents: 543
diff changeset
  1726
	(v::vs)}
Chengsong
parents: 543
diff changeset
  1727
\end{mathpar}
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1728
\caption{The inductive POSIX rules given by Ausaf et al.
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1729
	\cite{AusafDyckhoffUrban2016}.
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1730
This ternary relation, written $(s, r) \rightarrow v$, 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1731
formalises the POSIX constraints on the
601
Chengsong
parents: 591
diff changeset
  1732
value $v$ given a string $s$ and 
Chengsong
parents: 591
diff changeset
  1733
regular expression $r$.
Chengsong
parents: 591
diff changeset
  1734
}
646
Chengsong
parents: 639
diff changeset
  1735
\label{fig:POSIXDef}
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1736
\end{figure}\afterpage{\clearpage}
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1737
\noindent
579
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1738
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1739
%\begin{figure}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1740
%\begin{tikzpicture}[]
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1741
%    \node [minimum width = 6cm, rectangle split, rectangle split horizontal, 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1742
%	    rectangle split parts=2, rectangle split part fill={red!30,blue!20}, style={draw, rounded corners, inner sep=10pt}]
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1743
%	    (node1)
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1744
%	    {$r_{token1}$
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1745
%	    \nodepart{two}  $\;\;\; \quad r_{token2}\;\;\;\quad$ };
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1746
%	    %\node [left = 6.0cm of node1] (start1) {hi};
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1747
%	    \node [left = 0.2cm of node1] (middle) {$v.s.$};
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1748
%    \node [minimum width = 6cm, left = 0.2cm of middle, rectangle split, rectangle split horizontal, 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1749
%	    rectangle split parts=2, rectangle split part fill={red!30,blue!20}, style={draw, rounded corners, inner sep=10pt}]
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1750
%	    (node2)
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1751
%	    {$\quad\;\;\;r_{token1}\quad\;\;\;$
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1752
%	    \nodepart{two}  $r_{token2}$ };
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1753
%	    \node [below = 0.1cm of node2] (text1) {\checkmark preferred by POSIX};
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1754
%		\node [above = 1.5cm of middle, minimum width = 6cm, 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1755
%			rectangle, style={draw, rounded corners, inner sep=10pt}] 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1756
%			(topNode) {$s$};
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1757
%	    \path[->,draw]
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1758
%	        (topNode) edge node {split $A$} (node2)
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1759
%	        (topNode) edge node {split $B$} (node1)
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1760
%		;
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1761
%			
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1762
%
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1763
%\end{tikzpicture}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1764
%\caption{Maximum munch example: $s$ matches $r_{token1} \cdot r_{token2}$}\label{munch}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1765
%\end{figure}
637
Chengsong
parents: 628
diff changeset
  1766
The above $\POSIX$ rules follow the intuition described below: 
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1767
\begin{itemize}
564
Chengsong
parents: 543
diff changeset
  1768
	\item (Left Priority)\\
637
Chengsong
parents: 628
diff changeset
  1769
		Match the leftmost regular expression when multiple options for matching
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1770
		are available. See P+L and P+R where in P+R $s$ cannot
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1771
		be in the language of $L \; r_1$.
564
Chengsong
parents: 543
diff changeset
  1772
	\item (Maximum munch)\\
Chengsong
parents: 543
diff changeset
  1773
		Always match a subpart as much as possible before proceeding
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1774
		to the next part of the string.
579
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1775
		For example, when the string $s$ matches 
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1776
		$r_{part1}\cdot r_{part2}$, and we have two ways $s$ can be split:
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1777
		Then the split that matches a longer string for the first part
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1778
		$r_{part1}$ is preferred by this maximum munch rule.
637
Chengsong
parents: 628
diff changeset
  1779
		The side-condition 
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1780
		\begin{center}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1781
		$\nexists s_3 \; s_4. s_3 \neq [] \land s_3 @ s_4 = s_2 \land 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1782
		s_1@ s_3 \in L \; r_1 \land s_4 \in L \; r_2$
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1783
		\end{center}
637
Chengsong
parents: 628
diff changeset
  1784
		in PS causes this.
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1785
		%(See
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1786
		%\ref{munch} for an illustration).
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1787
\end{itemize}
564
Chengsong
parents: 543
diff changeset
  1788
\noindent
Chengsong
parents: 543
diff changeset
  1789
These disambiguation strategies can be 
Chengsong
parents: 543
diff changeset
  1790
quite practical.
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1791
For instance, when lexing a code snippet 
564
Chengsong
parents: 543
diff changeset
  1792
\[ 
Chengsong
parents: 543
diff changeset
  1793
	\textit{iffoo} = 3
Chengsong
parents: 543
diff changeset
  1794
\]
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1795
using a regular expression 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1796
for keywords and 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1797
identifiers:
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1798
%(for example, keyword is a nonempty string starting with letters 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1799
%followed by alphanumeric characters or underscores):
564
Chengsong
parents: 543
diff changeset
  1800
\[
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1801
	r_{keyword} + r_{identifier}.
564
Chengsong
parents: 543
diff changeset
  1802
\]
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1803
If we want $\textit{iffoo}$ to be recognized
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1804
as an identifier
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1805
where identifiers are defined as usual (letters
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1806
followed by letters, numbers or underscores),
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1807
then a match with a keyword (if)
564
Chengsong
parents: 543
diff changeset
  1808
followed by
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1809
an identifier (foo) would be incorrect.
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  1810
POSIX lexing generates what is included by lexing.
564
Chengsong
parents: 543
diff changeset
  1811
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1812
\noindent
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1813
We know that a POSIX 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1814
value for regular expression $r$ is inhabited by $r$.
541
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
  1815
\begin{lemma}
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
  1816
$(r, s) \rightarrow v \implies \vdash v: r$
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
  1817
\end{lemma}
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
  1818
\noindent
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1819
The main property about a $\POSIX$ value is that 
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1820
given the same regular expression $r$ and string $s$,
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1821
one can always uniquely determine the $\POSIX$ value for it:
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1822
\begin{lemma}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1823
$\textit{if} \,(s, r) \rightarrow v_1 \land (s, r) \rightarrow v_2\quad  \textit{then} \; v_1 = v_2$
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1824
\end{lemma}
539
Chengsong
parents: 538
diff changeset
  1825
\begin{proof}
564
Chengsong
parents: 543
diff changeset
  1826
By induction on $s$, $r$ and $v_1$. The inductive cases
Chengsong
parents: 543
diff changeset
  1827
are all the POSIX rules. 
Chengsong
parents: 543
diff changeset
  1828
Probably the most cumbersome cases are 
Chengsong
parents: 543
diff changeset
  1829
the sequence and star with non-empty iterations.
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1830
We shall give the details for proving the sequence case here.
539
Chengsong
parents: 538
diff changeset
  1831
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1832
When we have 
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1833
\[
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1834
	(s_1, r_1) \rightarrow v_1 \;\, and \;\, 
601
Chengsong
parents: 591
diff changeset
  1835
	(s_2, r_2) \rightarrow v_2  \;\, and \;\, 
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1836
	\nexists s_3 \; s_4. s_3 \neq [] \land s_3 @ s_4 = s_2 \land 
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1837
		s_1@ s_3 \in L \; r_1 \land s_4 \in L \; r_2
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1838
\]
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1839
we know that the last condition 
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1840
excludes the possibility of a 
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1841
string $s_1'$ longer than $s_1$ such that 
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1842
\[
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1843
(s_1', r_1) \rightarrow v_1'   \;\; 
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1844
and\;\; (s_2', r_2) \rightarrow v_2'\;\; and \;\;s_1' @s_2' = s 
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1845
\]
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1846
hold.
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1847
A shorter string $s_1''$ with $s_2''$ satisfying
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1848
\[
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1849
(s_1'', r_1) \rightarrow v_1''
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1850
\;\;and\;\; (s_2'', r_2) \rightarrow v_2'' \;\;and \;\;s_1'' @s_2'' = s 
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1851
\]
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1852
cannot possibly form a $\POSIX$ value either, because
637
Chengsong
parents: 628
diff changeset
  1853
by definition, there is a candidate
Chengsong
parents: 628
diff changeset
  1854
with a longer initial string
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1855
$s_1$. Therefore, we know that the POSIX
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1856
value $\Seq \; a \; b$ for $r_1 \cdot r_2$ matching
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1857
$s$ must have the 
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1858
property that 
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1859
\[
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1860
	|a| = s_1 \;\; and \;\; |b| = s_2.
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1861
\]
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1862
The goal is to prove that $a = v_1 $ and $b = v_2$.
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1863
If we have some other POSIX values $v_{10}$ and $v_{20}$ such that 
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1864
$(s_1, r_1) \rightarrow v_{10}$ and $(s_2, r_2) \rightarrow v_{20}$ hold,
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1865
then by induction hypothesis $v_{10} = v_1$ and $v_{20}= v_2$, 
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1866
which means this "other" $\POSIX$ value $\Seq(v_{10}, v_{20})$
539
Chengsong
parents: 538
diff changeset
  1867
is the same as $\Seq(v_1, v_2)$. 
Chengsong
parents: 538
diff changeset
  1868
\end{proof}
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1869
\noindent
637
Chengsong
parents: 628
diff changeset
  1870
We have now defined what a POSIX value is and shown that it is unique.
Chengsong
parents: 628
diff changeset
  1871
The problem is to generate
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1872
such a value in a lexing algorithm using derivatives.
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1873
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1874
\subsection{Sulzmann and Lu's Injection-based Lexing Algorithm}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1875
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1876
Sulzmann and Lu extended Brzozowski's 
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1877
derivative-based matching
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1878
to a lexing algorithm by a second phase 
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1879
after the initial phase of successive derivatives.
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1880
This second phase generates a POSIX value 
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1881
if the regular expression matches the string.
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  1882
The algorithm uses two functions called $\inj$ and $\mkeps$.
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1883
The function $\mkeps$ constructs a POSIX value from the last
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1884
derivative $r_n$:
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1885
\begin{ceqn}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1886
\begin{equation}\label{graph:mkeps}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1887
\begin{tikzcd}
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1888
r_0 \arrow[r, "\backslash c_0"]  & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed, "\ldots"] & r_n \arrow[d, "mkeps" description] \\
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1889
	        & 	              & 	            & v_n       
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1890
\end{tikzcd}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1891
\end{equation}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1892
\end{ceqn}
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1893
\noindent
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1894
In the above diagram, again we assume that
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1895
the input string $s$ is made of $n$ characters
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1896
$c_0c_1 \ldots c_{n-1}$ 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1897
The last derivative operation 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1898
$\backslash c_{n-1}$ generates the derivative $r_n$, for which
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1899
$\mkeps$ produces the value $v_n$. This value
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1900
tells us how the empty string is matched by the (nullable)
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1901
regular expression $r_n$, in a POSIX way.
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1902
The definition of $\mkeps$ is
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1903
	\begin{center}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1904
		\begin{tabular}{lcl}
564
Chengsong
parents: 543
diff changeset
  1905
			$\mkeps \; \ONE$ 		& $\dn$ & $\Empty$ \\
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1906
			$\mkeps \; (r_{1}+r_{2})$	& $\dn$ 
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1907
						& $\textit{if}\; (\nullable \; r_{1}) \;\,
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1908
							\textit{then}\;\, \Left \; (\mkeps \; r_{1})$\\ 
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1909
						& & $\phantom{if}\; \textit{else}\;\, \Right \;(\mkeps \; r_{2})$\\
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1910
			$\mkeps \; (r_1 \cdot r_2)$ 	& $\dn$ & $\Seq\;(\mkeps\;r_1)\;(\mkeps \; r_2)$\\
564
Chengsong
parents: 543
diff changeset
  1911
			$\mkeps \; r^* $	        & $\dn$ & $\Stars\;[]$
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1912
		\end{tabular}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1913
	\end{center}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1914
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1915
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1916
\noindent 
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1917
The function prefers the left child $r_1$ of $r_1 + r_2$ 
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1918
to match an empty string if there is a choice.
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1919
When there is a star to match the empty string,
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1920
we give the $\Stars$ constructor an empty list, meaning
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1921
no iteration is taken.
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1922
The result of $\mkeps$ on a $\nullable$ $r$ 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1923
is a POSIX value for $r$ and the empty string:
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1924
\begin{lemma}\label{mePosix}
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  1925
$\nullable\; r \implies (r, []) \rightarrow (\mkeps\; r)$
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1926
\end{lemma}
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1927
\begin{proof}
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1928
	By induction on $r$.
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1929
\end{proof}
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1930
\noindent
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1931
After the $\mkeps$-call, Sulzmann and Lu inject back the characters one by one
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1932
in reverse order as they were chopped off in the derivative phase.
637
Chengsong
parents: 628
diff changeset
  1933
The function for this is called $\inj$. This function 
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1934
operates on values, unlike $\backslash$ which operates on regular expressions.
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1935
In the diagram below, $v_i$ stands for the (POSIX) value 
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1936
for how the regular expression 
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1937
$r_i$ matches the string $s_i$ consisting of the last $n-i$ characters
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1938
of $s$ (i.e. $s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$.
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1939
After injecting back $n$ characters, we get the lexical value for how $r_0$
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1940
matches $s$. 
601
Chengsong
parents: 591
diff changeset
  1941
\begin{figure}[H]
Chengsong
parents: 591
diff changeset
  1942
\begin{center}	
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1943
\begin{ceqn}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1944
\begin{tikzcd}
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1945
r_0 \arrow[r, dashed] \arrow[d]& r_i \arrow[r, "\backslash c_i"]  \arrow[d]  & r_{i+1}  \arrow[r, dashed] \arrow[d]        & r_n \arrow[d, "mkeps" description] \\
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1946
v_0           \arrow[u]                 & v_i  \arrow[l, dashed]                              & v_{i+1} \arrow[l,"inj_{r_i} c_i"]                 & v_n \arrow[l, dashed]         
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1947
\end{tikzcd}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1948
\end{ceqn}
601
Chengsong
parents: 591
diff changeset
  1949
\end{center}
Chengsong
parents: 591
diff changeset
  1950
\caption{The two-phase lexing algorithm by Sulzmann and Lu \cite{AusafDyckhoffUrban2016},
Chengsong
parents: 591
diff changeset
  1951
	matching the regular expression $r_0$ and string of the form $[c_0, c_1, \ldots, c_{n-1}]$.
Chengsong
parents: 591
diff changeset
  1952
	The first phase involves taking successive derivatives w.r.t the characters $c_0$,
Chengsong
parents: 591
diff changeset
  1953
	$c_1$, and so on. These are the same operations as they have appeared in the matcher
Chengsong
parents: 591
diff changeset
  1954
	\ref{matcher}. When the final derivative regular expression is nullable (contains the empty string),
637
Chengsong
parents: 628
diff changeset
  1955
	then the second phase starts. First, $\mkeps$ generates a POSIX value which tells us how $r_n$ matches
Chengsong
parents: 628
diff changeset
  1956
	the empty string, by always selecting the leftmost 
Chengsong
parents: 628
diff changeset
  1957
	nullable regular expression. After that, $\inj$ ``injects'' back the character in reverse order as they 
601
Chengsong
parents: 591
diff changeset
  1958
	appeared in the string, always preserving POSIXness.}\label{graph:inj}
Chengsong
parents: 591
diff changeset
  1959
\end{figure}
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1960
\noindent
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1961
The function $\textit{inj}$ as defined by Sulzmann and Lu
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1962
takes three arguments: a regular
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1963
expression ${r_{i}}$, before the character is chopped off, 
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1964
a character ${c_{i}}$ (the character we want to inject back) and 
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1965
the third argument $v_{i+1}$ the value we want to inject into. 
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1966
The result of an application 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1967
$\inj \; r_i \; c_i \; v_{i+1}$ is a new value $v_i$ such that
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1968
\[
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1969
	(s_i, r_i) \rightarrow v_i
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1970
\]
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1971
holds.
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1972
The definition of $\textit{inj}$ is as follows: 
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1973
\begin{center}
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1974
\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{5mm}}l}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1975
  $\textit{inj}\;(c)\;c\,Empty$            & $\dn$ & $\Char\,c$\\
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1976
  $\textit{inj}\;(r_1 + r_2)\;c\; (\Left\; v)$ & $\dn$ & $\Left  \; (\textit{inj}\; r_1 \; c\,v)$\\
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1977
  $\textit{inj}\;(r_1 + r_2)\,c\; (\Right\;v)$ & $\dn$ & $\Right \; (\textit{inj}\;r_2\;c  \; v)$\\
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1978
  $\textit{inj}\;(r_1 \cdot r_2)\; c\;(\Seq \; v_1 \; v_2)$ & $\dn$  & 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1979
  $\Seq \; (\textit{inj}\;r_1\;c\;v_1) \; v_2$\\
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1980
  $\textit{inj}\;(r_1 \cdot r_2)\; c\;(\Left \; (\Seq \; v_1\;v_2) )$ & 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1981
  $\dn$  & $\Seq \; (\textit{inj}\,r_1\,c\,v_1)\; v_2$\\
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1982
  $\textit{inj}\;(r_1 \cdot r_2)\; c\; (\Right\; v)$ & $\dn$  & $\Seq\; (\textit{mkeps}\; r_1) \; (\textit{inj} \; r_2\;c\;v)$\\
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1983
  $\textit{inj}\;(r^*)\; c \; (\Seq \; v\; (\Stars\;vs))$         & $\dn$  & $\Stars\;\,((\textit{inj}\;r\;c\;v)\,::\,vs)$\\
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1984
\end{tabular}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1985
\end{center}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1986
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1987
\noindent 
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1988
The function recurses on 
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1989
the shape of regular
637
Chengsong
parents: 628
diff changeset
  1990
expressions and values.
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1991
Intuitively, each clause analyses 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1992
how $r_i$ could have transformed when being 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1993
derived by $c$, identifying which subpart
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1994
of $v_{i+1}$ has the ``hole'' 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1995
to inject the character back into.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1996
Once the character is
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1997
injected back to that sub-value; 
637
Chengsong
parents: 628
diff changeset
  1998
$\inj$ assembles all parts
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1999
to form a new value.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2000
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2001
For instance, the last clause is an
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2002
injection into a sequence value $v_{i+1}$
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2003
whose second child
637
Chengsong
parents: 628
diff changeset
  2004
value is a star and the shape of the 
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2005
regular expression $r_i$ before injection 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2006
is a star.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2007
We therefore know 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2008
the derivative 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2009
starts on a star and ends as a sequence:
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2010
\[
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2011
	(r^*) \backslash c \longrightarrow r\backslash c \cdot r^*
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2012
\]
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2013
during which an iteration of the star
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2014
had just been unfolded, giving the below
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2015
value inhabitation relation:
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2016
\[
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2017
	\vdash \Seq \; v \; (\Stars \; vs) : (r\backslash c) \cdot r^*.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2018
\]
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2019
The value list $vs$ corresponds to
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2020
matched star iterations,
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2021
and the ``hole'' lies in $v$ because
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2022
\[
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2023
	\vdash v: r\backslash c.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2024
\]
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2025
Finally, 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2026
$\inj \; r \;c \; v$ is prepended
637
Chengsong
parents: 628
diff changeset
  2027
to the previous list of iterations and then
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2028
wrapped under the $\Stars$
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2029
constructor, giving us $\Stars \; ((\inj \; r \; c \; v) ::vs)$.
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  2030
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2031
Recall that lemma 
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2032
\ref{mePosix} tells us that
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2033
$\mkeps$ always generates the POSIX value.
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2034
The function $\inj$ preserves the POSIXness, provided
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2035
the value before injection is POSIX, namely
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2036
\begin{lemma}\label{injPosix}
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2037
	If$(r \backslash c, s) \rightarrow v $,
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2038
	then $(r, c :: s) \rightarrow (\inj r \; c\; v)$.
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2039
\end{lemma}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2040
\begin{proof}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2041
	By induction on $r$.
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2042
	The non-trivial cases are sequence and star.
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2043
	When $r = a \cdot b$, there can be
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2044
	three cases for the value $v$ satisfying $\vdash v:a\backslash c$.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2045
	We give the reasoning why $\inj \; r \; c \; v$ is POSIX in each
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2046
	case.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2047
	\begin{itemize}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2048
		\item
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2049
			$v = \Seq \; v_a \; v_b$.\\
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2050
			The ``not nullable'' clause of the $\inj$ function is taken:
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2051
			\begin{center}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2052
				\begin{tabular}{lcl}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2053
					$\inj \; r \; c \; v$ &   $=$ & $ \inj \;\; (a \cdot b) \;\; c \;\; (\Seq \; v_a \; v_b) $\\
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2054
					& $=$ & $\Seq \; (\inj \;a \; c \; v_a) \; v_b$
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2055
				\end{tabular}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2056
			\end{center}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2057
			We know that there exists a unique pair of
637
Chengsong
parents: 628
diff changeset
  2058
			$s_a$ and $s_b$ satisfying	
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2059
				$(a \backslash c, s_a) \rightarrow v_a$,
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2060
				$(b , s_b) \rightarrow v_b$, and
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2061
				$\nexists s_3 \; s_4. s_3 \neq [] \land s_a @ s_3 \in 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2062
				L \; (a\backslash c) \land
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2063
				s_4 \in L \; b$.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2064
			The last condition gives us
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2065
			$\nexists s_3 \; s_4. s_3 \neq [] \land (c :: s_a )@ s_3 \in 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2066
				L \; a \land
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2067
				s_4 \in L \; b$.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2068
			By induction hypothesis, $(a, c::s_a) \rightarrow \inj \; a \; c \; v_a $ holds,
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2069
			and this gives us
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2070
			\[
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2071
				(a\cdot b, (c::s_a)@s_b) \rightarrow \Seq \; (\inj \; a\;c \;v_a) \; v_b.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2072
			\]
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2073
		\item
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2074
			$v = \Left \; (\Seq \; v_a \; v_b)$\\
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2075
			The argument is almost identical to the above case,	
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2076
			except that a different clause of $\inj$ is taken:
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2077
			\begin{center}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2078
				\begin{tabular}{lcl}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2079
					$\inj \; r \; c \; v$ &   $=$ & $ \inj \;\; (a \cdot b) \;\; c \;\; (\Left \; (\Seq \; v_a \; v_b)) $\\
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2080
					& $=$ & $\Seq \; (\inj \;a \; c \; v_a) \; v_b$
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2081
				\end{tabular}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2082
			\end{center}
637
Chengsong
parents: 628
diff changeset
  2083
			With similar reasoning, 
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  2084
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2085
			\[
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2086
				(a\cdot b, (c::s_a)@s_b) \rightarrow \Seq \; (\inj \; a\;c \;v_a) \; v_b.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2087
			\]
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2088
			again holds.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2089
		\item 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2090
			$v = \Right \; v_b$\\
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2091
			Again the injection result would be 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2092
			\begin{center}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2093
				\begin{tabular}{lcl}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2094
					$\inj \; r \; c \; v$ &   $=$ & $ \inj \;\; (a \cdot b) \;\; c \;\; \Right \; (v_b) $\\
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2095
					& $=$ & $\Seq \; (\mkeps \; a) \; (\inj \;b \; c\; v_b)$
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2096
				\end{tabular}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2097
			\end{center}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2098
			We know that $a$ must be nullable,
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2099
			allowing us to call $\mkeps$ and get
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2100
			\[
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2101
				(a, []) \rightarrow \mkeps \; a.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2102
			\]
637
Chengsong
parents: 628
diff changeset
  2103
			Also, by inductive hypothesis
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2104
			\[
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2105
				(b, c::s) \rightarrow \inj\; b \; c \; v_b
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2106
			\]
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2107
			holds.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2108
			In addition, as
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2109
			$\Right \;v_b$  instead of $\Left \ldots$ is 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2110
			the POSIX value for $v$, it must be the case
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2111
			that $s \notin L \;( (a\backslash c)\cdot b)$.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2112
			This tells us that 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2113
			\[
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2114
				\nexists s_3 \; s_4.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2115
				s_3 @s_4 = s  \land s_3 \in L \; (a\backslash c) 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2116
				\land s_4 \in L \; b
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2117
			\]
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2118
			which translates to
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2119
			\[
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2120
				\nexists s_3 \; s_4. \; s_3 \neq [] \land
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2121
				s_3 @s_4 = c::s  \land s_3 \in L \; a 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2122
				\land s_4 \in L \; b.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2123
			\]
637
Chengsong
parents: 628
diff changeset
  2124
			(Which says there cannot be a longer 
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2125
			initial split for $s$ other than the empty string.)
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2126
			Therefore we have $\Seq \; (\mkeps \; a) \;(\inj \;b \; c\; v_b)$
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2127
			as the POSIX value for $a\cdot b$.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2128
	\end{itemize}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2129
	The star case can be proven similarly.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2130
\end{proof}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2131
\noindent
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2132
Putting all together, Sulzmann and Lu obtained the following algorithm
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2133
outlined in the diagram \ref{graph:inj}:
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  2134
\begin{center}
539
Chengsong
parents: 538
diff changeset
  2135
\begin{tabular}{lcl}
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2136
	$\lexer \; r \; [] $ & $=$ & $\textit{if} \; (\nullable \; r)\; \textit{then}\;  \Some(\mkeps \; r) \; \textit{else} \; \None$\\
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2137
	$\lexer \; r \;c::s$ & $=$ & $\textit{case}\; (\lexer \; (r\backslash c) \; s) \;\textit{of}\; $\\
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2138
	& & $\quad \phantom{\mid}\; \None \implies \None$\\
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2139
	& & $\quad \mid           \Some(v) \implies \Some(\inj \; r\; c\; v)$
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  2140
\end{tabular}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  2141
\end{center}
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2142
\noindent
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2143
The central property of the $\lexer$ is that it gives the correct result
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2144
according to
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2145
POSIX rules. 
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2146
\begin{theorem}\label{lexerCorrectness}
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2147
	The $\lexer$ based on derivatives and injections is correct: 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2148
	\begin{center}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2149
		\begin{tabular}{lcl}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2150
			$\lexer \; r \; s = \Some(v)$ & $ \Longleftrightarrow$ & $ (r, \; s) \rightarrow v$\\
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2151
			$\lexer \;r \; s = \None $ & $\Longleftrightarrow$ & $ \neg(\exists v. (r, s) \rightarrow v)$
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2152
		\end{tabular}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2153
	\end{center}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2154
\end{theorem} 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2155
\begin{proof}
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2156
By induction on $s$. $r$ generalising over an arbitrary regular expression.
637
Chengsong
parents: 628
diff changeset
  2157
The $[]$ case is proven by an application of lemma \ref{mePosix}, and the inductive case
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2158
by lemma \ref{injPosix}.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2159
\end{proof}
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  2160
\noindent
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2161
As we did earlier in this chapter with the matcher, one can 
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2162
introduce simplification on the regular expression in each derivative step.
637
Chengsong
parents: 628
diff changeset
  2163
However, due to lexing, one needs to do a backward phase (w.r.t the forward derivative phase)
Chengsong
parents: 628
diff changeset
  2164
and ensure that
Chengsong
parents: 628
diff changeset
  2165
the values align with the regular expression at each step.
539
Chengsong
parents: 538
diff changeset
  2166
Therefore one has to
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  2167
be careful not to break the correctness, as the injection 
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2168
function heavily relies on the structure of 
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2169
the regular expressions and values being aligned.
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2170
This can be achieved by recording some extra rectification functions
637
Chengsong
parents: 628
diff changeset
  2171
during the derivatives step and applying these rectifications in 
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  2172
each run during the injection phase.
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2173
With extra care
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2174
one can show that POSIXness will not be affected
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2175
by the simplifications listed here \cite{AusafDyckhoffUrban2016}. 
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2176
\begin{center}
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2177
	\begin{tabular}{lcl}
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2178
		$\simp \; r_1 \cdot r_2 $ & $ \dn$ & 
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2179
		$(\simp \; r_1,  \simp \; r_2) \; \textit{match}$\\
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2180
					  & & $\quad \case \; (\ZERO, \_) \Rightarrow \ZERO$\\
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2181
					  & & $\quad \case \; (\_, \ZERO) \Rightarrow \ZERO$\\
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2182
					  & & $\quad \case \; (\ONE, r_2') \Rightarrow r_2'$\\
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2183
					  & & $\quad \case \; (r_1', \ONE) \Rightarrow r_1'$\\
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2184
					  & & $\quad \case \; (r_1', r_2') \Rightarrow r_1'\cdot r_2'$\\
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2185
		$\simp \; r_1 + r_2$ & $\dn$ & $(\simp \; r_1, \simp \; r_2) \textit{match}$\\
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2186
				     & & $\quad \; \case \; (\ZERO, r_2') \Rightarrow r_2'$\\
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2187
				     & & $\quad \; \case \; (r_1', \ZERO) \Rightarrow r_1'$\\
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2188
				     & & $\quad \; \case \; (r_1', r_2') \Rightarrow r_1' + r_2'$\\
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2189
		$\simp \; r$ & $\dn$ & $r\quad\quad (otherwise)$
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2190
				   
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2191
	\end{tabular}
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2192
\end{center}
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  2193
637
Chengsong
parents: 628
diff changeset
  2194
However, one can still end up
Chengsong
parents: 628
diff changeset
  2195
with exploding derivatives,
Chengsong
parents: 628
diff changeset
  2196
even with the simple-minded simplification rules allowed
Chengsong
parents: 628
diff changeset
  2197
in an injection-based lexer.
Chengsong
parents: 628
diff changeset
  2198
\section{A Case Requiring More Aggressive Simplifications}
539
Chengsong
parents: 538
diff changeset
  2199
For example, when starting with the regular
585
4969ef817d92 chap4 more
Chengsong
parents: 583
diff changeset
  2200
expression $(a^* \cdot a^*)^*$ and building just over
4969ef817d92 chap4 more
Chengsong
parents: 583
diff changeset
  2201
a dozen successive derivatives 
539
Chengsong
parents: 538
diff changeset
  2202
w.r.t.~the character $a$, one obtains a derivative regular expression
585
4969ef817d92 chap4 more
Chengsong
parents: 583
diff changeset
  2203
with millions of nodes (when viewed as a tree)
637
Chengsong
parents: 628
diff changeset
  2204
even with the mentioned simplifications.
585
4969ef817d92 chap4 more
Chengsong
parents: 583
diff changeset
  2205
\begin{figure}[H]
601
Chengsong
parents: 591
diff changeset
  2206
\begin{center}
539
Chengsong
parents: 538
diff changeset
  2207
\begin{tikzpicture}
Chengsong
parents: 538
diff changeset
  2208
\begin{axis}[
Chengsong
parents: 538
diff changeset
  2209
    xlabel={$n$},
Chengsong
parents: 538
diff changeset
  2210
    ylabel={size},
585
4969ef817d92 chap4 more
Chengsong
parents: 583
diff changeset
  2211
    legend entries={Simple-Minded Simp, Naive Matcher},  
539
Chengsong
parents: 538
diff changeset
  2212
    legend pos=north west,
Chengsong
parents: 538
diff changeset
  2213
    legend cell align=left]
Chengsong
parents: 538
diff changeset
  2214
\addplot[red,mark=*, mark options={fill=white}] table {BetterWaterloo.data};
585
4969ef817d92 chap4 more
Chengsong
parents: 583
diff changeset
  2215
\addplot[blue,mark=*, mark options={fill=white}] table {BetterWaterloo1.data};
539
Chengsong
parents: 538
diff changeset
  2216
\end{axis}
Chengsong
parents: 538
diff changeset
  2217
\end{tikzpicture} 
601
Chengsong
parents: 591
diff changeset
  2218
\end{center}
539
Chengsong
parents: 538
diff changeset
  2219
\caption{Size of $(a^*\cdot a^*)^*$ against $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$}
Chengsong
parents: 538
diff changeset
  2220
\end{figure}\label{fig:BetterWaterloo}
Chengsong
parents: 538
diff changeset
  2221
   
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2222
That is because Sulzmann and Lu's 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2223
injection-based lexing algorithm keeps a lot of 
541
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
  2224
"useless" values that will not be used. 
539
Chengsong
parents: 538
diff changeset
  2225
These different ways of matching will grow exponentially with the string length.
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2226
Consider the case
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2227
\[
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2228
	r= (a^*\cdot a^*)^* \quad and \quad
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2229
	s=\underbrace{aa\ldots a}_\text{n \textit{a}s}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2230
\]
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2231
as an example.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2232
This is a highly ambiguous regular expression, with
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2233
many ways to split up the string into multiple segments for
637
Chengsong
parents: 628
diff changeset
  2234
different star iterations,
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2235
and for each segment 
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2236
multiple ways of splitting between 
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2237
the two $a^*$ sub-expressions.
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2238
When $n$ is equal to $1$, there are two lexical values for
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2239
the match:
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2240
\[
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2241
	\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; [])] \quad (v1)
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2242
\]
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2243
and
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2244
\[
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2245
	\Stars \; [\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a])] \quad (v2)
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2246
\]
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2247
The derivative of $\derssimp \;s \; r$ is
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2248
\[
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2249
	(a^*a^* + a^*)\cdot(a^*a^*)^*.
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2250
\]
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2251
The $a^*a^*$ and $a^*$ in the first child of the above sequence
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2252
correspond to value 1 and value 2, respectively.
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2253
When $n=2$, the number goes up to 7:
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2254
\[
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2255
	\Stars \; [\Seq \; (\Stars \; [\Char \; a, \Char \; a])\; (\Stars \; [])]
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2256
\]
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2257
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2258
\[
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2259
	\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; [\Char \; a])]
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2260
\]
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2261
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2262
\[
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2263
	\Stars \; [\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a, \Char \; a])]
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2264
\]
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2265
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2266
\[
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2267
	\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; []), \Seq \; (\Stars \; [\Char\;a])\; (\Stars\; []) ]
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2268
\]
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2269
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2270
\[
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2271
	\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; []), 
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2272
	 	   \Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a])
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2273
		  ] 
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2274
\]
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2275
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2276
\[
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2277
	\Stars \; [
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2278
	 	   \Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a]),
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2279
	 	   \Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a])
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2280
		  ] 
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2281
\]
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2282
and
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2283
\[
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2284
	\Stars \; [
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2285
	 	   \Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a]),
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2286
		   \Seq \; (\Stars \; [\Char \; a])\; (\Stars \; [])
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2287
		  ] 
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2288
\]
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  2289
And $\derssimp \; aa \; (a^*a^*)^*$ is
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2290
\[
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2291
	((a^*a^* + a^*)+a^*)\cdot(a^*a^*)^* + 
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2292
	(a^*a^* + a^*)\cdot(a^*a^*)^*.
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2293
\]
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2294
which removes two out of the seven terms corresponding to the
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2295
seven distinct lexical values.
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2296
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2297
It is not surprising that there are exponentially many 
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2298
distinct lexical values that cannot be eliminated by 
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2299
the simple-minded simplification of $\derssimp$. 
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2300
A lexer without a good enough strategy to 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  2301
deduplicate will naturally
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2302
have an exponential runtime on highly
637
Chengsong
parents: 628
diff changeset
  2303
ambiguous regular expressions because there
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2304
are exponentially many matches.
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2305
For this particular example, it seems
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2306
that the number of distinct matches growth
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2307
speed is proportional to $(2n)!/(n!(n+1)!)$ ($n$ being the input length).
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  2308
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2309
On the other hand, the
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2310
 $\POSIX$ value for $r= (a^*\cdot a^*)^*$  and 
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2311
$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$ is  
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2312
\[
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2313
	\Stars\,
637
Chengsong
parents: 628
diff changeset
  2314
	[\Seq \; (\Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}]), \Stars\,[]].
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2315
\]
637
Chengsong
parents: 628
diff changeset
  2316
At any moment, the  subterms in a regular expression
Chengsong
parents: 628
diff changeset
  2317
that will potentially result in a POSIX value is only
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2318
a minority among the many other terms,
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  2319
and one can remove the ones that are not possible to 
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2320
be POSIX.
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2321
In the above example,
579
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  2322
\begin{equation}\label{eqn:growth2}
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2323
	((a^*a^* + \underbrace{a^*}_\text{A})+\underbrace{a^*}_\text{duplicate of A})\cdot(a^*a^*)^* + 
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2324
	\underbrace{(a^*a^* + a^*)\cdot(a^*a^*)^*}_\text{further simp removes this}.
579
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  2325
\end{equation}
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2326
can be further simplified by 
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2327
removing the underlined term first,
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2328
which would open up possibilities
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2329
of further simplification that removes the
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2330
underbraced part.
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2331
The result would be 
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2332
\[
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2333
	(\underbrace{a^*a^*}_\text{term 1} + \underbrace{a^*}_\text{term 2})\cdot(a^*a^*)^*.
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2334
\]
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2335
with corresponding values
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2336
\begin{center}
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2337
	\begin{tabular}{lr}
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2338
		$\Stars \; [\Seq \; (\Stars \; [\Char \; a, \Char \; a])\; (\Stars \; [])]$  & $(\text{term 1})$\\
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2339
		$\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; [\Char \; a])]  $ &  $(\text{term 2})$
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2340
	\end{tabular}
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2341
\end{center}
637
Chengsong
parents: 628
diff changeset
  2342
Other terms with an underlying value, such as
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2343
\[
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2344
	\Stars \; [\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a, \Char \; a])]
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2345
\]
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  2346
do not to contribute a POSIX lexical value,
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  2347
and therefore can be thrown away.
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  2348
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2349
Ausaf et al. \cite{AusafDyckhoffUrban2016} 
639
80cc6dc4c98b until chap 7
Chengsong
parents: 638
diff changeset
  2350
have come up with some simplification steps, 
80cc6dc4c98b until chap 7
Chengsong
parents: 638
diff changeset
  2351
and incorporated the simplification into $\lexer$.
80cc6dc4c98b until chap 7
Chengsong
parents: 638
diff changeset
  2352
They call this lexer $\textit{lexer}_{simp}$ and proved that
80cc6dc4c98b until chap 7
Chengsong
parents: 638
diff changeset
  2353
\[
80cc6dc4c98b until chap 7
Chengsong
parents: 638
diff changeset
  2354
	\lexer \; r\; s = \textit{lexer}_{simp} \; r \; s
80cc6dc4c98b until chap 7
Chengsong
parents: 638
diff changeset
  2355
\]
80cc6dc4c98b until chap 7
Chengsong
parents: 638
diff changeset
  2356
The function $\textit{lexer}_{simp}$
80cc6dc4c98b until chap 7
Chengsong
parents: 638
diff changeset
  2357
involves some fiddly manipulation of value rectification,
80cc6dc4c98b until chap 7
Chengsong
parents: 638
diff changeset
  2358
which we omit here.
80cc6dc4c98b until chap 7
Chengsong
parents: 638
diff changeset
  2359
however those steps
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  2360
are not yet sufficiently strong, to achieve the above effects.
637
Chengsong
parents: 628
diff changeset
  2361
And even with these relatively mild simplifications, the proof
Chengsong
parents: 628
diff changeset
  2362
is already quite a bit more complicated than the theorem \ref{lexerCorrectness}.
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  2363
One would need to prove something like this: 
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2364
\[
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2365
	\textit{If}\; (\textit{snd} \; (\textit{simp} \; r\backslash c), s) \rightarrow  v  \;\;
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2366
	\textit{then}\;\; (r, c::s) \rightarrow 
637
Chengsong
parents: 628
diff changeset
  2367
	\inj\;\, r\,  \;c \;\, ((\textit{fst} \; (\textit{simp} \; r \backslash c))\; v).
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2368
\]
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2369
instead of the simple lemma \ref{injPosix}, where now $\textit{simp}$
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2370
not only has to return a simplified regular expression,
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2371
but also what specific simplifications 
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  2372
have been done as a function on values
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2373
showing how one can transform the value
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2374
underlying the simplified regular expression
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2375
to the unsimplified one.
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2376
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2377
We therefore choose a slightly different approach
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  2378
also described by Sulzmann and Lu to
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2379
get better simplifications, which uses
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2380
some augmented data structures compared to 
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2381
plain regular expressions.
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2382
We call them \emph{annotated}
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2383
regular expressions.
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2384
With annotated regular expressions,
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2385
we can avoid creating the intermediate values $v_1,\ldots v_n$ and a 
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2386
second phase altogether.
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2387
We introduce this new datatype and the 
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  2388
corresponding algorithm in the next chapter.
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  2389
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  2390
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  2391
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  2392