ChengsongTanPhdThesis/Chapters/Inj.tex
author Chengsong
Wed, 23 Aug 2023 03:02:31 +0100
changeset 668 3831621d7b14
parent 667 660cf698eb26
permissions -rwxr-xr-x
added technical Overview section, almost done introduction
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
     1
% Chapter Template
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
     2
666
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
     3
\chapter{Regular Expressions and POSIX Lexing-Preliminaries} % Main chapter title
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
     4
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
     5
\label{Inj} % In chapter 2 \ref{Chapter2} we will introduce the concepts
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
     6
%and notations we 
564
Chengsong
parents: 543
diff changeset
     7
% used for describing the lexing algorithm by Sulzmann and Lu,
Chengsong
parents: 543
diff changeset
     8
%and then give the algorithm and its variant and discuss
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
     9
%why more aggressive simplifications are needed. 
649
Chengsong
parents: 646
diff changeset
    10
Chengsong
parents: 646
diff changeset
    11
668
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents: 667
diff changeset
    12
\marginpar{Gerog comment: addressed attributing work correctly problem,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents: 667
diff changeset
    13
showed clearly these definitions come from Ausaf et al.}
649
Chengsong
parents: 646
diff changeset
    14
668
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents: 667
diff changeset
    15
This is a preliminary chapter which describes the results of
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents: 667
diff changeset
    16
Sulzmann and Lu \cite{Sulzmann2014} and part of the formalisations by
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents: 667
diff changeset
    17
Ausaf et al \ref{AusafDyckhoffUrban2016} (mainly the definitions).
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents: 667
diff changeset
    18
These results are not part of this PhD work,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents: 667
diff changeset
    19
but the details are included to provide necessary context
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents: 667
diff changeset
    20
for our work on the correctness proof of $\blexersimp$,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents: 667
diff changeset
    21
as we show in chapter \ref{Bitcoded2} how the proofs break down when
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents: 667
diff changeset
    22
simplifications are applied.
657
00171b627b8d Fixed some annotated/unannotated a/r notation inconsistencies.
Chengsong
parents: 651
diff changeset
    23
668
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents: 667
diff changeset
    24
In the coming section, the definitions of basic notions 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents: 667
diff changeset
    25
for regular languages and regular expressions are given.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents: 667
diff changeset
    26
This is essentially a description in ``English''
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents: 667
diff changeset
    27
the functions and datatypes used by Ausaf et al. 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents: 667
diff changeset
    28
\cite{AusafDyckhoffUrban2016} \cite{Ausaf}
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents: 667
diff changeset
    29
in their formalisation in Isabelle/HOL.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents: 667
diff changeset
    30
We include them as we build on their formalisation,
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents: 667
diff changeset
    31
and therefore inherently use these definitions.
649
Chengsong
parents: 646
diff changeset
    32
Chengsong
parents: 646
diff changeset
    33
668
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents: 667
diff changeset
    34
% In the next section, we will briefly
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents: 667
diff changeset
    35
% introduce Brzozowski derivatives and Sulzmann
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents: 667
diff changeset
    36
% and Lu's algorithm, which the main point of this thesis builds on.
649
Chengsong
parents: 646
diff changeset
    37
%We give a taste of what they 
Chengsong
parents: 646
diff changeset
    38
%are like and why they are suitable for regular expression
Chengsong
parents: 646
diff changeset
    39
%matching and lexing.
Chengsong
parents: 646
diff changeset
    40
\section{Formal Specification of POSIX Matching 
Chengsong
parents: 646
diff changeset
    41
and Brzozowski Derivatives}
Chengsong
parents: 646
diff changeset
    42
%Now we start with the central topic of the thesis: Brzozowski derivatives.
Chengsong
parents: 646
diff changeset
    43
Brzozowski \cite{Brzozowski1964} first introduced the 
Chengsong
parents: 646
diff changeset
    44
concept of a \emph{derivative} of regular expression in 1964.
Chengsong
parents: 646
diff changeset
    45
The derivative of a regular expression $r$
Chengsong
parents: 646
diff changeset
    46
with respect to a character $c$, is written as $r \backslash c$.
Chengsong
parents: 646
diff changeset
    47
This operation tells us what $r$ transforms into
Chengsong
parents: 646
diff changeset
    48
if we ``chop'' off a particular character $c$ 
Chengsong
parents: 646
diff changeset
    49
from all strings in the language of $r$ (defined
Chengsong
parents: 646
diff changeset
    50
later as $L \; r$).
Chengsong
parents: 646
diff changeset
    51
%To give a flavour of Brzozowski derivatives, we present
Chengsong
parents: 646
diff changeset
    52
%two straightforward clauses from it:
Chengsong
parents: 646
diff changeset
    53
%\begin{center}
Chengsong
parents: 646
diff changeset
    54
%	\begin{tabular}{lcl}
Chengsong
parents: 646
diff changeset
    55
%		$d \backslash c$     & $\dn$ & 
Chengsong
parents: 646
diff changeset
    56
%		$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
Chengsong
parents: 646
diff changeset
    57
%$(r_1 + r_2)\backslash c$     & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
Chengsong
parents: 646
diff changeset
    58
%	\end{tabular}
Chengsong
parents: 646
diff changeset
    59
%\end{center}
Chengsong
parents: 646
diff changeset
    60
%\noindent
Chengsong
parents: 646
diff changeset
    61
%The first clause says that for the regular expression
Chengsong
parents: 646
diff changeset
    62
%denoting a singleton set consisting of a single-character string $\{ d \}$,
Chengsong
parents: 646
diff changeset
    63
%we check the derivative character $c$ against $d$,
Chengsong
parents: 646
diff changeset
    64
%returning a set containing only the empty string $\{ [] \}$
Chengsong
parents: 646
diff changeset
    65
%if $c$ and $d$ are equal, and the empty set $\varnothing$ otherwise.
Chengsong
parents: 646
diff changeset
    66
%The second clause states that to obtain the regular expression
Chengsong
parents: 646
diff changeset
    67
%representing all strings' head character $c$ being chopped off
Chengsong
parents: 646
diff changeset
    68
%from $r_1 + r_2$, one simply needs to recursively take derivative
Chengsong
parents: 646
diff changeset
    69
%of $r_1$ and $r_2$ and then put them together.
Chengsong
parents: 646
diff changeset
    70
Derivatives have the property
Chengsong
parents: 646
diff changeset
    71
that $s \in L \; (r\backslash c)$ if and only if 
Chengsong
parents: 646
diff changeset
    72
$c::s \in L \; r$ where $::$ stands for list prepending.
Chengsong
parents: 646
diff changeset
    73
%This property can be used on regular expressions
Chengsong
parents: 646
diff changeset
    74
%matching and lexing--to test whether a string $s$ is in $L \; r$,
Chengsong
parents: 646
diff changeset
    75
%one simply takes derivatives of $r$ successively with
Chengsong
parents: 646
diff changeset
    76
%respect to the characters (in the correct order) in $s$,
Chengsong
parents: 646
diff changeset
    77
%and then test whether the empty string is in the last regular expression.
Chengsong
parents: 646
diff changeset
    78
With this property, derivatives can give a simple solution
Chengsong
parents: 646
diff changeset
    79
to the problem of matching a string $s$ with a regular
Chengsong
parents: 646
diff changeset
    80
expression $r$: if the derivative of $r$ w.r.t.\ (in
Chengsong
parents: 646
diff changeset
    81
succession) all the characters of the string matches the empty string,
Chengsong
parents: 646
diff changeset
    82
then $r$ matches $s$ (and {\em vice versa}).  
Chengsong
parents: 646
diff changeset
    83
%This makes formally reasoning about these properties such
Chengsong
parents: 646
diff changeset
    84
%as correctness and complexity smooth and intuitive.
Chengsong
parents: 646
diff changeset
    85
There are several mechanised proofs of this property in various theorem
Chengsong
parents: 646
diff changeset
    86
provers,
Chengsong
parents: 646
diff changeset
    87
for example one by Owens and Slind \cite{Owens2008} in HOL4,
Chengsong
parents: 646
diff changeset
    88
another one by Krauss and Nipkow \cite{Nipkow98} in Isabelle/HOL, and
Chengsong
parents: 646
diff changeset
    89
yet another in Coq by Coquand and Siles \cite{Coquand2012}.
Chengsong
parents: 646
diff changeset
    90
Chengsong
parents: 646
diff changeset
    91
In addition, one can extend derivatives to bounded repetitions
Chengsong
parents: 646
diff changeset
    92
relatively straightforwardly. For example, the derivative for 
Chengsong
parents: 646
diff changeset
    93
this can be defined as:
Chengsong
parents: 646
diff changeset
    94
\begin{center}
Chengsong
parents: 646
diff changeset
    95
	\begin{tabular}{lcl}
Chengsong
parents: 646
diff changeset
    96
		$r^{\{n\}} \backslash c$     & $\dn$ & $r \backslash c \cdot
Chengsong
parents: 646
diff changeset
    97
		r^{\{n-1\}} (\text{when} n > 0)$\\
Chengsong
parents: 646
diff changeset
    98
	\end{tabular}
Chengsong
parents: 646
diff changeset
    99
\end{center}
Chengsong
parents: 646
diff changeset
   100
\noindent
Chengsong
parents: 646
diff changeset
   101
Experimental results suggest that  unlike DFA-based solutions
Chengsong
parents: 646
diff changeset
   102
for bounded regular expressions,
Chengsong
parents: 646
diff changeset
   103
derivatives can cope
Chengsong
parents: 646
diff changeset
   104
large counters
Chengsong
parents: 646
diff changeset
   105
quite well.
Chengsong
parents: 646
diff changeset
   106
Chengsong
parents: 646
diff changeset
   107
There have also been 
Chengsong
parents: 646
diff changeset
   108
extensions of derivatives to other regex constructs.
Chengsong
parents: 646
diff changeset
   109
For example, Owens et al include the derivatives
Chengsong
parents: 646
diff changeset
   110
for the \emph{NOT} regular expression, which is
Chengsong
parents: 646
diff changeset
   111
able to concisely express C-style comments of the form
Chengsong
parents: 646
diff changeset
   112
$/* \ldots */$ (see \cite{Owens2008}).
Chengsong
parents: 646
diff changeset
   113
Another extension for derivatives is
Chengsong
parents: 646
diff changeset
   114
regular expressions with look-aheads, done
Chengsong
parents: 646
diff changeset
   115
by Miyazaki and Minamide
Chengsong
parents: 646
diff changeset
   116
\cite{Takayuki2019}.
Chengsong
parents: 646
diff changeset
   117
%We therefore use Brzozowski derivatives on regular expressions 
Chengsong
parents: 646
diff changeset
   118
%lexing 
Chengsong
parents: 646
diff changeset
   119
Chengsong
parents: 646
diff changeset
   120
Chengsong
parents: 646
diff changeset
   121
Chengsong
parents: 646
diff changeset
   122
Given the above definitions and properties of
Chengsong
parents: 646
diff changeset
   123
Brzozowski derivatives, one quickly realises their potential
Chengsong
parents: 646
diff changeset
   124
in generating a formally verified algorithm for lexing: the clauses and property
Chengsong
parents: 646
diff changeset
   125
can be easily expressed in a functional programming language 
Chengsong
parents: 646
diff changeset
   126
or converted to theorem prover
Chengsong
parents: 646
diff changeset
   127
code, with great ease.
Chengsong
parents: 646
diff changeset
   128
Perhaps this is the reason why derivatives have sparked quite a bit of interest
Chengsong
parents: 646
diff changeset
   129
in the functional programming and theorem prover communities in the last
Chengsong
parents: 646
diff changeset
   130
fifteen or so years (
Chengsong
parents: 646
diff changeset
   131
\cite{Almeidaetal10}, \cite{Berglund14}, \cite{Berglund18},
Chengsong
parents: 646
diff changeset
   132
\cite{Chen12} and \cite{Coquand2012}
Chengsong
parents: 646
diff changeset
   133
to name a few), despite being buried in the ``sands of time'' \cite{Owens2008}
Chengsong
parents: 646
diff changeset
   134
after they were first published by Brzozowski.
Chengsong
parents: 646
diff changeset
   135
Chengsong
parents: 646
diff changeset
   136
Chengsong
parents: 646
diff changeset
   137
However, there are two difficulties with derivative-based matchers:
Chengsong
parents: 646
diff changeset
   138
First, Brzozowski's original matcher only generates a yes/no answer
Chengsong
parents: 646
diff changeset
   139
for whether a regular expression matches a string or not.  This is too
Chengsong
parents: 646
diff changeset
   140
little information in the context of lexing where separate tokens must
Chengsong
parents: 646
diff changeset
   141
be identified and also classified (for example as keywords
Chengsong
parents: 646
diff changeset
   142
or identifiers). 
Chengsong
parents: 646
diff changeset
   143
Second, derivative-based matchers need to be more efficient in terms
Chengsong
parents: 646
diff changeset
   144
of the sizes of derivatives.
Chengsong
parents: 646
diff changeset
   145
Elegant and beautiful
Chengsong
parents: 646
diff changeset
   146
as many implementations are,
666
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
   147
they can be still quite slow. 
649
Chengsong
parents: 646
diff changeset
   148
For example, Sulzmann and Lu
Chengsong
parents: 646
diff changeset
   149
claim a linear running time of their proposed algorithm,
Chengsong
parents: 646
diff changeset
   150
but that was falsified by our experiments. The running time 
Chengsong
parents: 646
diff changeset
   151
is actually $\Omega(2^n)$ in the worst case.
Chengsong
parents: 646
diff changeset
   152
A similar claim about a theoretical runtime of $O(n^2)$ 
Chengsong
parents: 646
diff changeset
   153
is made for the Verbatim \cite{Verbatim}
Chengsong
parents: 646
diff changeset
   154
%TODO: give references
Chengsong
parents: 646
diff changeset
   155
lexer, which calculates POSIX matches and is based on derivatives.
Chengsong
parents: 646
diff changeset
   156
They formalized the correctness of the lexer, but not their complexity result.
Chengsong
parents: 646
diff changeset
   157
In the performance evaluation section, they analyzed the run time
Chengsong
parents: 646
diff changeset
   158
of matching $a$ with the string 
Chengsong
parents: 646
diff changeset
   159
\begin{center}
Chengsong
parents: 646
diff changeset
   160
	$\underbrace{a \ldots a}_{\text{n a's}}$.
Chengsong
parents: 646
diff changeset
   161
\end{center}
Chengsong
parents: 646
diff changeset
   162
\noindent
Chengsong
parents: 646
diff changeset
   163
They concluded that the algorithm is quadratic in terms of 
Chengsong
parents: 646
diff changeset
   164
the length of the input string.
Chengsong
parents: 646
diff changeset
   165
When we tried out their extracted OCaml code with the example $(a+aa)^*$,
Chengsong
parents: 646
diff changeset
   166
the time it took to match a string of 40 $a$'s was approximately 5 minutes.
Chengsong
parents: 646
diff changeset
   167
Chengsong
parents: 646
diff changeset
   168
Chengsong
parents: 646
diff changeset
   169
\subsection{Sulzmann and Lu's Algorithm}
Chengsong
parents: 646
diff changeset
   170
Sulzmann and Lu~\cite{Sulzmann2014} overcame the first 
Chengsong
parents: 646
diff changeset
   171
problem with the yes/no answer 
Chengsong
parents: 646
diff changeset
   172
by cleverly extending Brzozowski's matching
Chengsong
parents: 646
diff changeset
   173
algorithm. Their extended version generates additional information on
Chengsong
parents: 646
diff changeset
   174
\emph{how} a regular expression matches a string following the POSIX
Chengsong
parents: 646
diff changeset
   175
rules for regular expression matching. They achieve this by adding a
Chengsong
parents: 646
diff changeset
   176
second ``phase'' to Brzozowski's algorithm involving an injection
Chengsong
parents: 646
diff changeset
   177
function. This injection function in a sense undoes the ``damage''
Chengsong
parents: 646
diff changeset
   178
of the derivatives chopping off characters.
Chengsong
parents: 646
diff changeset
   179
In earlier work, Ausaf et al provided the formal
Chengsong
parents: 646
diff changeset
   180
specification of what POSIX matching means and proved in Isabelle/HOL
Chengsong
parents: 646
diff changeset
   181
the correctness
Chengsong
parents: 646
diff changeset
   182
of this extended algorithm accordingly
Chengsong
parents: 646
diff changeset
   183
\cite{AusafDyckhoffUrban2016}.
Chengsong
parents: 646
diff changeset
   184
Chengsong
parents: 646
diff changeset
   185
The version of the algorithm proven correct
Chengsong
parents: 646
diff changeset
   186
suffers however heavily from a 
Chengsong
parents: 646
diff changeset
   187
second difficulty, where derivatives can
Chengsong
parents: 646
diff changeset
   188
grow to arbitrarily big sizes. 
Chengsong
parents: 646
diff changeset
   189
For example if we start with the
Chengsong
parents: 646
diff changeset
   190
regular expression $(a+aa)^*$ and take
Chengsong
parents: 646
diff changeset
   191
successive derivatives according to the character $a$, we end up with
Chengsong
parents: 646
diff changeset
   192
a sequence of ever-growing derivatives like 
Chengsong
parents: 646
diff changeset
   193
Chengsong
parents: 646
diff changeset
   194
\def\ll{\stackrel{\_\backslash{} a}{\longrightarrow}}
Chengsong
parents: 646
diff changeset
   195
\begin{center}
Chengsong
parents: 646
diff changeset
   196
\begin{tabular}{rll}
Chengsong
parents: 646
diff changeset
   197
$(a + aa)^*$ & $\ll$ & $(\ONE + \ONE{}a) \cdot (a + aa)^*$\\
Chengsong
parents: 646
diff changeset
   198
& $\ll$ & $(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* \;+\; (\ONE + \ONE{}a) \cdot (a + aa)^*$\\
Chengsong
parents: 646
diff changeset
   199
& $\ll$ & $(\ZERO + \ZERO{}a + \ZERO) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^* \;+\; $\\
Chengsong
parents: 646
diff changeset
   200
& & $\qquad(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^*$\\
Chengsong
parents: 646
diff changeset
   201
& $\ll$ & \ldots \hspace{15mm}(regular expressions of sizes 98, 169, 283, 468, 767, \ldots)
Chengsong
parents: 646
diff changeset
   202
\end{tabular}
Chengsong
parents: 646
diff changeset
   203
\end{center}
Chengsong
parents: 646
diff changeset
   204
 
Chengsong
parents: 646
diff changeset
   205
\noindent where after around 35 steps we usually run out of memory on a
Chengsong
parents: 646
diff changeset
   206
typical computer.  Clearly, the
Chengsong
parents: 646
diff changeset
   207
notation involving $\ZERO$s and $\ONE$s already suggests
Chengsong
parents: 646
diff changeset
   208
simplification rules that can be applied to regular regular
Chengsong
parents: 646
diff changeset
   209
expressions, for example $\ZERO{}\,r \Rightarrow \ZERO$, $\ONE{}\,r
Chengsong
parents: 646
diff changeset
   210
\Rightarrow r$, $\ZERO{} + r \Rightarrow r$ and $r + r \Rightarrow
Chengsong
parents: 646
diff changeset
   211
r$. While such simple-minded simplifications have been proved in 
Chengsong
parents: 646
diff changeset
   212
the work by Ausaf et al. to preserve the correctness of Sulzmann and Lu's
Chengsong
parents: 646
diff changeset
   213
algorithm \cite{AusafDyckhoffUrban2016}, they unfortunately do
Chengsong
parents: 646
diff changeset
   214
\emph{not} help with limiting the growth of the derivatives shown
Chengsong
parents: 646
diff changeset
   215
above: the growth is slowed, but the derivatives can still grow rather
Chengsong
parents: 646
diff changeset
   216
quickly beyond any finite bound.
Chengsong
parents: 646
diff changeset
   217
Chengsong
parents: 646
diff changeset
   218
Therefore we want to look in this thesis at a second
Chengsong
parents: 646
diff changeset
   219
algorithm by Sulzmann and Lu where they
Chengsong
parents: 646
diff changeset
   220
overcame this ``growth problem'' \cite{Sulzmann2014}.
Chengsong
parents: 646
diff changeset
   221
In this version, POSIX values are 
Chengsong
parents: 646
diff changeset
   222
represented as bit sequences and such sequences are incrementally generated
Chengsong
parents: 646
diff changeset
   223
when derivatives are calculated. The compact representation
Chengsong
parents: 646
diff changeset
   224
of bit sequences and regular expressions allows them to define a more
Chengsong
parents: 646
diff changeset
   225
``aggressive'' simplification method that keeps the size of the
Chengsong
parents: 646
diff changeset
   226
derivatives finite no matter what the length of the string is.
Chengsong
parents: 646
diff changeset
   227
They make some informal claims about the correctness and linear behaviour
Chengsong
parents: 646
diff changeset
   228
of this version, but do not provide any supporting proof arguments, not
Chengsong
parents: 646
diff changeset
   229
even ``pencil-and-paper'' arguments. They write about their bit-coded
Chengsong
parents: 646
diff changeset
   230
\emph{incremental parsing method} (that is the algorithm to be formalised
Chengsong
parents: 646
diff changeset
   231
in this dissertation)
Chengsong
parents: 646
diff changeset
   232
Chengsong
parents: 646
diff changeset
   233
Chengsong
parents: 646
diff changeset
   234
  
Chengsong
parents: 646
diff changeset
   235
  \begin{quote}\it
Chengsong
parents: 646
diff changeset
   236
  ``Correctness Claim: We further claim that the incremental parsing
Chengsong
parents: 646
diff changeset
   237
  method [..] in combination with the simplification steps [..]
Chengsong
parents: 646
diff changeset
   238
  yields POSIX parse trees. We have tested this claim
Chengsong
parents: 646
diff changeset
   239
  extensively [..] but yet
Chengsong
parents: 646
diff changeset
   240
  have to work out all proof details.'' \cite[Page 14]{Sulzmann2014}
Chengsong
parents: 646
diff changeset
   241
\end{quote}  
Chengsong
parents: 646
diff changeset
   242
Ausaf and Urban made some initial progress towards the 
Chengsong
parents: 646
diff changeset
   243
full correctness proof but still had to leave out the optimisation
Chengsong
parents: 646
diff changeset
   244
Sulzmann and Lu proposed.
Chengsong
parents: 646
diff changeset
   245
Ausaf  wrote \cite{Ausaf},
Chengsong
parents: 646
diff changeset
   246
  \begin{quote}\it
Chengsong
parents: 646
diff changeset
   247
``The next step would be to implement a more aggressive simplification procedure on annotated regular expressions and then prove the corresponding algorithm generates the same values as blexer. Alas due to time constraints we are unable to do so here.''
Chengsong
parents: 646
diff changeset
   248
\end{quote}  
Chengsong
parents: 646
diff changeset
   249
This thesis implements the aggressive simplifications envisioned
Chengsong
parents: 646
diff changeset
   250
by Ausaf and Urban,
Chengsong
parents: 646
diff changeset
   251
together with a formal proof of the correctness of those simplifications.
Chengsong
parents: 646
diff changeset
   252
Chengsong
parents: 646
diff changeset
   253
Chengsong
parents: 646
diff changeset
   254
One of the most recent work in the context of lexing
Chengsong
parents: 646
diff changeset
   255
%with this issue
Chengsong
parents: 646
diff changeset
   256
is the Verbatim lexer by Egolf, Lasser and Fisher~\cite{Verbatim}.
Chengsong
parents: 646
diff changeset
   257
This is relevant work for us and we will compare it later with 
Chengsong
parents: 646
diff changeset
   258
our derivative-based matcher we are going to present.
Chengsong
parents: 646
diff changeset
   259
There is also some newer work called
Chengsong
parents: 646
diff changeset
   260
Verbatim++~\cite{Verbatimpp}, which does not use derivatives, 
Chengsong
parents: 646
diff changeset
   261
but deterministic finite automaton instead.
Chengsong
parents: 646
diff changeset
   262
We will also study this work in a later section.
Chengsong
parents: 646
diff changeset
   263
%An example that gives problem to automaton approaches would be
Chengsong
parents: 646
diff changeset
   264
%the regular expression $(a|b)^*a(a|b)^{\{n\}}$.
Chengsong
parents: 646
diff changeset
   265
%It requires at least $2^{n+1}$ states to represent
Chengsong
parents: 646
diff changeset
   266
%as a DFA.
Chengsong
parents: 646
diff changeset
   267
Chengsong
parents: 646
diff changeset
   268
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   269
\section{Basic Concepts}
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   270
Formal language theory usually starts with an alphabet 
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   271
denoting a set of characters.
637
Chengsong
parents: 628
diff changeset
   272
Here we use the datatype of characters from Isabelle,
541
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
   273
which roughly corresponds to the ASCII characters.
564
Chengsong
parents: 543
diff changeset
   274
In what follows, we shall leave the information about the alphabet
541
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
   275
implicit.
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
   276
Then using the usual bracket notation for lists,
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   277
we can define strings made up of characters as: 
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   278
\begin{center}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   279
\begin{tabular}{lcl}
541
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
   280
$\textit{s}$ & $\dn$ & $[] \; |\; c  :: s$
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   281
\end{tabular}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   282
\end{center}
583
Chengsong
parents: 579
diff changeset
   283
where $c$ is a variable ranging over characters.
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   284
The $::$ stands for list cons and $[]$ for the empty
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   285
list.
637
Chengsong
parents: 628
diff changeset
   286
For brevity, a singleton list is sometimes written as $[c]$.
541
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
   287
Strings can be concatenated to form longer strings in the same
637
Chengsong
parents: 628
diff changeset
   288
way we concatenate two lists, which we shall write as $s_1 @ s_2$.
541
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
   289
We omit the precise 
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   290
recursive definition here.
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   291
We overload this concatenation operator for two sets of strings:
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   292
\begin{center}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   293
\begin{tabular}{lcl}
541
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
   294
$A @ B $ & $\dn$ & $\{s_A @ s_B \mid s_A \in A \land s_B \in B \}$\\
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   295
\end{tabular}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   296
\end{center}
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   297
We also call the above \emph{language concatenation}.
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   298
The power of a language is defined recursively, using the 
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   299
concatenation operator $@$:
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   300
\begin{center}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   301
\begin{tabular}{lcl}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   302
$A^0 $ & $\dn$ & $\{ [] \}$\\
541
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
   303
$A^{n+1}$ & $\dn$ & $A @ A^n$
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   304
\end{tabular}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   305
\end{center}
564
Chengsong
parents: 543
diff changeset
   306
The union of all powers of a language   
Chengsong
parents: 543
diff changeset
   307
can be used to define the Kleene star operator:
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   308
\begin{center}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   309
\begin{tabular}{lcl}
536
aff7bf93b9c7 comments addressed all
Chengsong
parents: 532
diff changeset
   310
 $A*$ & $\dn$ & $\bigcup_{i \geq 0} A^i$ \\
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   311
\end{tabular}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   312
\end{center}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   313
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   314
\noindent
564
Chengsong
parents: 543
diff changeset
   315
However, to obtain a more convenient induction principle 
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   316
in Isabelle/HOL, 
536
aff7bf93b9c7 comments addressed all
Chengsong
parents: 532
diff changeset
   317
we instead define the Kleene star
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   318
as an inductive set: 
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   319
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   320
\begin{center}
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   321
\begin{mathpar}
564
Chengsong
parents: 543
diff changeset
   322
	\inferrule{\mbox{}}{[] \in A*\\}
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   323
564
Chengsong
parents: 543
diff changeset
   324
\inferrule{s_1 \in A \;\; s_2 \in A*}{s_1 @ s_2 \in A*}
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   325
\end{mathpar}
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   326
\end{center}
564
Chengsong
parents: 543
diff changeset
   327
\noindent
541
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
   328
We also define an operation of "chopping off" a character from
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
   329
a language, which we call $\Der$, meaning \emph{Derivative} (for a language):
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   330
\begin{center}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   331
\begin{tabular}{lcl}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   332
$\textit{Der} \;c \;A$ & $\dn$ & $\{ s \mid c :: s \in A \}$\\
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   333
\end{tabular}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   334
\end{center}
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   335
\noindent
583
Chengsong
parents: 579
diff changeset
   336
This can be generalised to ``chopping off'' a string 
Chengsong
parents: 579
diff changeset
   337
from all strings within a set $A$, 
541
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
   338
namely:
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   339
\begin{center}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   340
\begin{tabular}{lcl}
541
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
   341
$\textit{Ders} \;s \;A$ & $\dn$ & $\{ s' \mid s@s' \in A \}$\\
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   342
\end{tabular}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   343
\end{center}
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   344
\noindent
541
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
   345
which is essentially the left quotient $A \backslash L$ of $A$ against 
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   346
the singleton language with $L = \{s\}$.
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   347
However, for our purposes here, the $\textit{Ders}$ definition with 
541
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
   348
a single string is sufficient.
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   349
577
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
   350
The reason for defining derivatives
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   351
is that they provide another approach
577
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
   352
to test membership of a string in 
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
   353
a set of strings. 
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
   354
For example, to test whether the string
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
   355
$bar$ is contained in the set $\{foo, bar, brak\}$, one can take derivative of the set with
577
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
   356
respect to the string $bar$:
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
   357
\begin{center}
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   358
\begin{tabular}{lll}
577
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
   359
	$S = \{foo, bar, brak\}$ & $ \stackrel{\backslash b}{\rightarrow }$ & 
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   360
	$\{ar, rak\}$ \\
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   361
				 & $\stackrel{\backslash a}{\rightarrow}$ & $\{r \}$\\
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   362
				 & $\stackrel{\backslash r}{\rightarrow}$ & $\{[]\}$\\
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   363
				 %& $\stackrel{[] \in S \backslash bar}{\longrightarrow}$ & $bar \in S$\\
577
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
   364
\end{tabular}	
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
   365
\end{center}
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
   366
\noindent
637
Chengsong
parents: 628
diff changeset
   367
and in the end, test whether the set
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
   368
contains the empty string.\footnote{We use the infix notation $A\backslash c$
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
   369
	instead of $\Der \; c \; A$ for brevity, as it will always be
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
   370
	clear from the context that we are operating
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   371
on languages rather than regular expressions.}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   372
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   373
In general, if we have a language $S$,
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   374
then we can test whether $s$ is in $S$
577
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
   375
by testing whether $[] \in S \backslash s$.
564
Chengsong
parents: 543
diff changeset
   376
With the sequencing, Kleene star, and $\textit{Der}$ operator on languages,
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   377
we have a  few properties of how the language derivative can be defined using 
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   378
sub-languages.
577
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
   379
For example, for the sequence operator, we have
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   380
something similar to a ``chain rule'':
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   381
\begin{lemma}
536
aff7bf93b9c7 comments addressed all
Chengsong
parents: 532
diff changeset
   382
\[
aff7bf93b9c7 comments addressed all
Chengsong
parents: 532
diff changeset
   383
	\Der \; c \; (A @ B) =
aff7bf93b9c7 comments addressed all
Chengsong
parents: 532
diff changeset
   384
	\begin{cases}
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   385
	((\Der \; c \; A) \, @ \, B ) \cup (\Der \; c\; B) , &  \text{if} \;  [] \in A  \\
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   386
	 (\Der \; c \; A) \,  @ \, B, & \text{otherwise}
536
aff7bf93b9c7 comments addressed all
Chengsong
parents: 532
diff changeset
   387
	 \end{cases}	
aff7bf93b9c7 comments addressed all
Chengsong
parents: 532
diff changeset
   388
\]
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   389
\end{lemma}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   390
\noindent
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   391
This lemma states that if $A$ contains the empty string, $\Der$ can "pierce" through it
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   392
and get to $B$.
583
Chengsong
parents: 579
diff changeset
   393
The language derivative for $A*$ can be described using the language derivative
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   394
of $A$:
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   395
\begin{lemma}
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   396
$\textit{Der} \;c \;(A*) = (\textit{Der}\; c A) @ (A*)$\\
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   397
\end{lemma}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   398
\begin{proof}
583
Chengsong
parents: 579
diff changeset
   399
There are two inclusions to prove:
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   400
\begin{itemize}
564
Chengsong
parents: 543
diff changeset
   401
\item{$\subseteq$}:\\
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   402
The set 
637
Chengsong
parents: 628
diff changeset
   403
\[ S_1 = \{s \mid c :: s \in A*\} \]
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   404
is enclosed in the set
637
Chengsong
parents: 628
diff changeset
   405
\[ S_2 = \{s_1 @ s_2 \mid s_1 \, s_2.\;  s_1 \in \{s \mid c :: s \in A\} \land s_2 \in A* \}. \]
Chengsong
parents: 628
diff changeset
   406
This is because for any string $c::s$ satisfying $c::s \in A*$,
Chengsong
parents: 628
diff changeset
   407
%whenever you have a string starting with a character 
Chengsong
parents: 628
diff changeset
   408
%in the language of a Kleene star $A*$, 
Chengsong
parents: 628
diff changeset
   409
%then that
Chengsong
parents: 628
diff changeset
   410
the character $c$, together with a prefix of $s$
Chengsong
parents: 628
diff changeset
   411
%immediately after $c$ 
Chengsong
parents: 628
diff changeset
   412
forms the first iteration of $A*$, 
Chengsong
parents: 628
diff changeset
   413
and the rest of the $s$ is also $A*$.
Chengsong
parents: 628
diff changeset
   414
This coincides with the definition of $S_2$.
564
Chengsong
parents: 543
diff changeset
   415
\item{$\supseteq$}:\\
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   416
Note that
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   417
\[ \Der \; c \; (A*) = \Der \; c \;  (\{ [] \} \cup (A @ A*) ) \]
583
Chengsong
parents: 579
diff changeset
   418
holds.
Chengsong
parents: 579
diff changeset
   419
Also the following holds:
536
aff7bf93b9c7 comments addressed all
Chengsong
parents: 532
diff changeset
   420
\[ \Der \; c \;  (\{ [] \} \cup (A @ A*) ) = \Der\; c \; (A @ A*) \]
564
Chengsong
parents: 543
diff changeset
   421
where the $\textit{RHS}$ can be rewritten
Chengsong
parents: 543
diff changeset
   422
as \[ (\Der \; c\; A) @ A* \cup (\Der \; c \; (A*)) \]
Chengsong
parents: 543
diff changeset
   423
which of course contains $\Der \; c \; A @ A*$.
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   424
\end{itemize}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   425
\end{proof}
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   426
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   427
\noindent
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   428
The clever idea of Brzozowski was to find counterparts of $\Der$ and $\Ders$
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   429
for regular expressions.
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   430
To introduce them, we need to first give definitions for regular expressions,
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   431
which we shall do next.
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   432
536
aff7bf93b9c7 comments addressed all
Chengsong
parents: 532
diff changeset
   433
\subsection{Regular Expressions and Their Meaning}
564
Chengsong
parents: 543
diff changeset
   434
The \emph{basic regular expressions} are defined inductively
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   435
 by the following grammar:
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   436
\[			r ::=   \ZERO \mid  \ONE
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   437
			 \mid  c  
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   438
			 \mid  r_1 \cdot r_2
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   439
			 \mid  r_1 + r_2   
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   440
			 \mid r^*         
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   441
\]
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   442
\noindent
564
Chengsong
parents: 543
diff changeset
   443
We call them basic because we will introduce
637
Chengsong
parents: 628
diff changeset
   444
additional constructors in later chapters, such as negation
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   445
and bounded repetitions.
564
Chengsong
parents: 543
diff changeset
   446
We use $\ZERO$ for the regular expression that
Chengsong
parents: 543
diff changeset
   447
matches no string, and $\ONE$ for the regular
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   448
expression that matches only the empty string.\footnote{
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   449
Some authors
564
Chengsong
parents: 543
diff changeset
   450
also use $\phi$ and $\epsilon$ for $\ZERO$ and $\ONE$
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
   451
but we prefer this notation.} 
564
Chengsong
parents: 543
diff changeset
   452
The sequence regular expression is written $r_1\cdot r_2$
Chengsong
parents: 543
diff changeset
   453
and sometimes we omit the dot if it is clear which
Chengsong
parents: 543
diff changeset
   454
regular expression is meant; the alternative
Chengsong
parents: 543
diff changeset
   455
is written $r_1 + r_2$.
Chengsong
parents: 543
diff changeset
   456
The \emph{language} or meaning of 
Chengsong
parents: 543
diff changeset
   457
a regular expression is defined recursively as
Chengsong
parents: 543
diff changeset
   458
a set of strings:
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   459
%TODO: FILL in the other defs
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   460
\begin{center}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   461
\begin{tabular}{lcl}
667
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
   462
$L \; \ZERO$ & $\dn$ & $\varnothing$\\
564
Chengsong
parents: 543
diff changeset
   463
$L \; \ONE$ & $\dn$ & $\{[]\}$\\
Chengsong
parents: 543
diff changeset
   464
$L \; c$ & $\dn$ & $\{[c]\}$\\
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   465
$L \; (r_1 + r_2)$ & $\dn$ & $ L \; r_1 \cup L \; r_2$\\
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   466
$L \; (r_1 \cdot r_2)$ & $\dn$ & $ L \; r_1 @ L \; r_2$\\
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   467
$L \; (r^*)$ & $\dn$ & $ (L\;r)*$
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   468
\end{tabular}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   469
\end{center}
536
aff7bf93b9c7 comments addressed all
Chengsong
parents: 532
diff changeset
   470
\noindent
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   471
%Now with language derivatives of a language and regular expressions and
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   472
%their language interpretations in place, we are ready to define derivatives on regular expressions.
637
Chengsong
parents: 628
diff changeset
   473
With $L$, we are ready to introduce Brzozowski derivatives on regular expressions.
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
   474
We do so by first introducing what properties they should satisfy.
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   475
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   476
\subsection{Brzozowski Derivatives and a Regular Expression Matcher}
564
Chengsong
parents: 543
diff changeset
   477
%Recall, the language derivative acts on a set of strings
Chengsong
parents: 543
diff changeset
   478
%and essentially chops off a particular character from
Chengsong
parents: 543
diff changeset
   479
%all strings in that set, Brzozowski defined a derivative operation on regular expressions
Chengsong
parents: 543
diff changeset
   480
%so that after derivative $L(r\backslash c)$ 
Chengsong
parents: 543
diff changeset
   481
%will look as if it was obtained by doing a language derivative on $L(r)$:
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   482
%Recall that the language derivative acts on a 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   483
%language (set of strings).
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   484
%One can decide whether a string $s$ belongs
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   485
%to a language $S$ by taking derivative with respect to
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   486
%that string and then checking whether the empty 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   487
%string is in the derivative:
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   488
%\begin{center}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   489
%\parskip \baselineskip
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   490
%\def\myupbracefill#1{\rotatebox{90}{\stretchto{\{}{#1}}}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   491
%\def\rlwd{.5pt}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   492
%\newcommand\notate[3]{%
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   493
%  \unskip\def\useanchorwidth{T}%
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   494
%  \setbox0=\hbox{#1}%
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   495
%  \def\stackalignment{c}\stackunder[-6pt]{%
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   496
%    \def\stackalignment{c}\stackunder[-1.5pt]{%
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   497
%      \stackunder[-2pt]{\strut #1}{\myupbracefill{\wd0}}}{%
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   498
%    \rule{\rlwd}{#2\baselineskip}}}{%
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   499
%  \strut\kern7pt$\hookrightarrow$\rlap{ \footnotesize#3}}\ignorespaces%
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   500
%}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   501
%\Longstack{
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   502
%\notate{$\{ \ldots ,\;$ 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   503
%	\notate{s}{1}{$(c_1 :: s_1)$} 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   504
%	$, \; \ldots \}$ 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   505
%}{1}{$S_{start}$} 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   506
%}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   507
%\Longstack{
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   508
%	$\stackrel{\backslash c_1}{\longrightarrow}$
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   509
%}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   510
%\Longstack{
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   511
%	$\{ \ldots,\;$  \notate{$s_1$}{1}{$(c_2::s_2)$} 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   512
%	$,\; \ldots \}$
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   513
%}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   514
%\Longstack{
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   515
%	$\stackrel{\backslash c_2}{\longrightarrow}$ 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   516
%}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   517
%\Longstack{
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   518
%	$\{ \ldots,\;  s_2
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   519
%	,\; \ldots \}$
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   520
%}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   521
%\Longstack{
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   522
%	$ \xdashrightarrow{\backslash c_3\ldots\ldots} $	
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   523
%}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   524
%\Longstack{
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   525
%	\notate{$\{\ldots, [], \ldots\}$}{1}{$S_{end} = 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   526
%	S_{start}\backslash s$}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   527
%}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   528
%\end{center}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   529
%\begin{center}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   530
%	$s \in S_{start} \iff [] \in S_{end}$
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   531
%\end{center}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   532
%\noindent
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   533
Brzozowski noticed that $\Der$
579
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   534
can be ``mirrored'' on regular expressions which
564
Chengsong
parents: 543
diff changeset
   535
he calls the derivative of a regular expression $r$
Chengsong
parents: 543
diff changeset
   536
with respect to a character $c$, written
577
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
   537
$r \backslash c$. This infix operator
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
   538
takes regular expression $r$ as input
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
   539
and a character as a right operand.
577
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
   540
The derivative operation on regular expression
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
   541
is defined such that the language of the derivative result 
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
   542
coincides with the language of the original 
579
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   543
regular expression being taken 
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
   544
derivative with respect to the same characters, namely
579
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   545
\begin{property}
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   546
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   547
\[
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   548
	L \; (r \backslash c) = \Der \; c \; (L \; r)
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   549
\]
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   550
\end{property}
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   551
\noindent
637
Chengsong
parents: 628
diff changeset
   552
Next, we give the recursive definition of derivative on
Chengsong
parents: 628
diff changeset
   553
regular expressions so that it satisfies the properties above.
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
   554
%The derivative function, written $r\backslash c$, 
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
   555
%takes a regular expression $r$ and character $c$, and
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
   556
%returns a new regular expression representing
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
   557
%the original regular expression's language $L \; r$
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
   558
%being taken the language derivative with respect to $c$.
626
1c8525061545 finished!
Chengsong
parents: 623
diff changeset
   559
\begin{table}
1c8525061545 finished!
Chengsong
parents: 623
diff changeset
   560
	\begin{center}
579
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   561
\begin{tabular}{lcl}
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   562
		$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\  
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   563
		$\ONE \backslash c$  & $\dn$ & $\ZERO$\\
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   564
		$d \backslash c$     & $\dn$ & 
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   565
		$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   566
$(r_1 + r_2)\backslash c$     & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   567
$(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, [] \in L(r_1)$\\
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   568
	&   & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   569
	&   & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   570
	$(r^*)\backslash c$           & $\dn$ & $(r\backslash c) \cdot r^*$\\
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   571
\end{tabular}
626
1c8525061545 finished!
Chengsong
parents: 623
diff changeset
   572
	\end{center}
1c8525061545 finished!
Chengsong
parents: 623
diff changeset
   573
\caption{Derivative on Regular Expressions}
1c8525061545 finished!
Chengsong
parents: 623
diff changeset
   574
\label{table:der}
1c8525061545 finished!
Chengsong
parents: 623
diff changeset
   575
\end{table}
564
Chengsong
parents: 543
diff changeset
   576
\noindent
579
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   577
The most involved cases are the sequence case
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   578
and the star case.
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   579
The sequence case says that if the first regular expression
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   580
contains an empty string, then the second component of the sequence
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   581
needs to be considered, as its derivative will contribute to the
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   582
result of this derivative:
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   583
\begin{center}
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   584
	\begin{tabular}{lcl}
583
Chengsong
parents: 579
diff changeset
   585
		$(r_1 \cdot r_2 ) \backslash c$ & $\dn$ & 
Chengsong
parents: 579
diff changeset
   586
		$\textit{if}\;\,([] \in L(r_1))\; 
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
   587
		\textit{then} \; (r_1 \backslash c) \cdot r_2 + r_2 \backslash c$ \\
579
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   588
		& & $\textit{else} \; (r_1 \backslash c) \cdot r_2$
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   589
	\end{tabular}
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   590
\end{center}
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   591
\noindent
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   592
Notice how this closely resembles
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   593
the language derivative operation $\Der$:
564
Chengsong
parents: 543
diff changeset
   594
\begin{center}
Chengsong
parents: 543
diff changeset
   595
	\begin{tabular}{lcl}
Chengsong
parents: 543
diff changeset
   596
		$\Der \; c \; (A @ B)$ & $\dn$ & 
Chengsong
parents: 543
diff changeset
   597
		$ \textit{if} \;\,  [] \in A \; 
Chengsong
parents: 543
diff changeset
   598
		\textit{then} \;\, ((\Der \; c \; A) @ B ) \cup 
Chengsong
parents: 543
diff changeset
   599
		\Der \; c\; B$\\
Chengsong
parents: 543
diff changeset
   600
		& & $\textit{else}\; (\Der \; c \; A) @ B$\\
Chengsong
parents: 543
diff changeset
   601
	\end{tabular}
Chengsong
parents: 543
diff changeset
   602
\end{center}
Chengsong
parents: 543
diff changeset
   603
\noindent
583
Chengsong
parents: 579
diff changeset
   604
The derivative of the star regular expression $r^*$ 
579
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   605
unwraps one iteration of $r$, turns it into $r\backslash c$,
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   606
and attaches the original $r^*$
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   607
after $r\backslash c$, so that 
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   608
we can further unfold it as many times as needed:
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   609
\[
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   610
	(r^*) \backslash c \dn (r \backslash c)\cdot r^*.
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   611
\]
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   612
Again,
637
Chengsong
parents: 628
diff changeset
   613
the structure is the same as the language derivative of the Kleene star: 
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   614
\[
564
Chengsong
parents: 543
diff changeset
   615
	\textit{Der} \;c \;(A*) \dn (\textit{Der}\; c A) @ (A*)
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   616
\]
564
Chengsong
parents: 543
diff changeset
   617
In the above definition of $(r_1\cdot r_2) \backslash c$,
Chengsong
parents: 543
diff changeset
   618
the $\textit{if}$ clause's
Chengsong
parents: 543
diff changeset
   619
boolean condition 
Chengsong
parents: 543
diff changeset
   620
$[] \in L(r_1)$ needs to be 
Chengsong
parents: 543
diff changeset
   621
somehow recursively computed.
Chengsong
parents: 543
diff changeset
   622
We call such a function that checks
Chengsong
parents: 543
diff changeset
   623
whether the empty string $[]$ is 
Chengsong
parents: 543
diff changeset
   624
in the language of a regular expression $\nullable$:
Chengsong
parents: 543
diff changeset
   625
\begin{center}
Chengsong
parents: 543
diff changeset
   626
		\begin{tabular}{lcl}
Chengsong
parents: 543
diff changeset
   627
			$\nullable(\ZERO)$     & $\dn$ & $\mathit{false}$ \\  
Chengsong
parents: 543
diff changeset
   628
			$\nullable(\ONE)$      & $\dn$ & $\mathit{true}$ \\
Chengsong
parents: 543
diff changeset
   629
			$\nullable(c)$ 	       & $\dn$ & $\mathit{false}$ \\
Chengsong
parents: 543
diff changeset
   630
			$\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\
Chengsong
parents: 543
diff changeset
   631
			$\nullable(r_1\cdot r_2)$  & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\
Chengsong
parents: 543
diff changeset
   632
			$\nullable(r^*)$       & $\dn$ & $\mathit{true}$ \\
Chengsong
parents: 543
diff changeset
   633
		\end{tabular}
Chengsong
parents: 543
diff changeset
   634
\end{center}
Chengsong
parents: 543
diff changeset
   635
\noindent
Chengsong
parents: 543
diff changeset
   636
The $\ZERO$ regular expression
Chengsong
parents: 543
diff changeset
   637
does not contain any string and
Chengsong
parents: 543
diff changeset
   638
therefore is not \emph{nullable}.
Chengsong
parents: 543
diff changeset
   639
$\ONE$ is \emph{nullable} 
Chengsong
parents: 543
diff changeset
   640
by definition. 
Chengsong
parents: 543
diff changeset
   641
The character regular expression $c$
Chengsong
parents: 543
diff changeset
   642
corresponds to the singleton set $\{c\}$, 
Chengsong
parents: 543
diff changeset
   643
and therefore does not contain the empty string.
Chengsong
parents: 543
diff changeset
   644
The alternative regular expression is nullable
Chengsong
parents: 543
diff changeset
   645
if at least one of its children is nullable.
Chengsong
parents: 543
diff changeset
   646
The sequence regular expression
Chengsong
parents: 543
diff changeset
   647
would require both children to have the empty string
Chengsong
parents: 543
diff changeset
   648
to compose an empty string, and the Kleene star
Chengsong
parents: 543
diff changeset
   649
is always nullable because it naturally
579
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   650
contains the empty string.  
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   651
\noindent
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
   652
We have the following two correspondences between 
564
Chengsong
parents: 543
diff changeset
   653
derivatives on regular expressions and
Chengsong
parents: 543
diff changeset
   654
derivatives on a set of strings:
Chengsong
parents: 543
diff changeset
   655
\begin{lemma}\label{derDer}
608
37b6fd310a16 added related work chap
Chengsong
parents: 601
diff changeset
   656
	\mbox{}
579
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   657
	\begin{itemize}
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   658
		\item
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   659
$\textit{Der} \; c \; L(r) = L (r\backslash c)$
579
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   660
\item
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   661
	$c\!::\!s \in L(r)$ \textit{iff} $s \in L(r\backslash c)$.
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   662
	\end{itemize}
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   663
\end{lemma}
579
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   664
\begin{proof}
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   665
	By induction on $r$.
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   666
\end{proof}
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   667
\noindent
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
   668
which are the main properties of derivatives
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
   669
that enables us later to reason about the correctness of
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   670
derivative-based matching.
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   671
We can generalise the derivative operation shown above for single characters
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   672
to strings as follows:
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   673
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   674
\begin{center}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   675
\begin{tabular}{lcl}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   676
$r \backslash_s (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash_s s$ \\
583
Chengsong
parents: 579
diff changeset
   677
$r \backslash_s [\,] $ & $\dn$ & $r$
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   678
\end{tabular}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   679
\end{center}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   680
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   681
\noindent
564
Chengsong
parents: 543
diff changeset
   682
When there is no ambiguity, we will 
Chengsong
parents: 543
diff changeset
   683
omit the subscript and use $\backslash$ instead
583
Chengsong
parents: 579
diff changeset
   684
of $\backslash_s$ to denote
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   685
string derivatives for brevity.
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   686
Brzozowski's derivative-based
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   687
regular-expression matching algorithm can then be described as:
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   688
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   689
\begin{definition}
564
Chengsong
parents: 543
diff changeset
   690
$\textit{match}\;s\;r \;\dn\; \nullable \; (r\backslash s)$
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   691
\end{definition}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   692
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   693
\noindent
637
Chengsong
parents: 628
diff changeset
   694
Assuming the string is given as a sequence of characters, say $c_0c_1 \ldots c_n$, 
Chengsong
parents: 628
diff changeset
   695
this algorithm, presented graphically, is as follows:
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   696
601
Chengsong
parents: 591
diff changeset
   697
\begin{equation}\label{matcher}
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   698
\begin{tikzcd}
583
Chengsong
parents: 579
diff changeset
   699
r_0 \arrow[r, "\backslash c_0"]  & r_1 \arrow[r, "\backslash c_1"] & 
Chengsong
parents: 579
diff changeset
   700
r_2 \arrow[r, dashed]  & r_n  \arrow[r,"\textit{nullable}?"] & 
Chengsong
parents: 579
diff changeset
   701
\;\textrm{true}/\textrm{false}
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   702
\end{tikzcd}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   703
\end{equation}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   704
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   705
\noindent
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   706
 It is relatively
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   707
easy to show that this matcher is correct, namely
539
Chengsong
parents: 538
diff changeset
   708
\begin{lemma}
564
Chengsong
parents: 543
diff changeset
   709
	$\textit{match} \; s\; r  = \textit{true} \; \textit{iff} \; s \in L(r)$
539
Chengsong
parents: 538
diff changeset
   710
\end{lemma}
Chengsong
parents: 538
diff changeset
   711
\begin{proof}
637
Chengsong
parents: 628
diff changeset
   712
	By induction on $s$ using the property of derivatives:
583
Chengsong
parents: 579
diff changeset
   713
	lemma \ref{derDer}.
539
Chengsong
parents: 538
diff changeset
   714
\end{proof}
601
Chengsong
parents: 591
diff changeset
   715
\begin{figure}
564
Chengsong
parents: 543
diff changeset
   716
\begin{center}
539
Chengsong
parents: 538
diff changeset
   717
\begin{tikzpicture}
Chengsong
parents: 538
diff changeset
   718
\begin{axis}[
Chengsong
parents: 538
diff changeset
   719
    xlabel={$n$},
Chengsong
parents: 538
diff changeset
   720
    ylabel={time in secs},
601
Chengsong
parents: 591
diff changeset
   721
    %ymode = log,
539
Chengsong
parents: 538
diff changeset
   722
    legend entries={Naive Matcher},  
Chengsong
parents: 538
diff changeset
   723
    legend pos=north west,
Chengsong
parents: 538
diff changeset
   724
    legend cell align=left]
Chengsong
parents: 538
diff changeset
   725
\addplot[red,mark=*, mark options={fill=white}] table {NaiveMatcher.data};
Chengsong
parents: 538
diff changeset
   726
\end{axis}
Chengsong
parents: 538
diff changeset
   727
\end{tikzpicture} 
583
Chengsong
parents: 579
diff changeset
   728
\caption{Matching the regular expression $(a^*)^*b$ against strings of the form
Chengsong
parents: 579
diff changeset
   729
$\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}
Chengsong
parents: 579
diff changeset
   730
$ using Brzozowski's original algorithm}\label{NaiveMatcher}
601
Chengsong
parents: 591
diff changeset
   731
\end{center}
539
Chengsong
parents: 538
diff changeset
   732
\end{figure}
Chengsong
parents: 538
diff changeset
   733
\noindent
564
Chengsong
parents: 543
diff changeset
   734
If we implement the above algorithm naively, however,
666
6da4516ea87d more changes to figures & benchmarking
Chengsong
parents: 657
diff changeset
   735
the algorithm can be as slow as backtracking lexers, as shown in 
564
Chengsong
parents: 543
diff changeset
   736
\ref{NaiveMatcher}.
Chengsong
parents: 543
diff changeset
   737
Note that both axes are in logarithmic scale.
637
Chengsong
parents: 628
diff changeset
   738
Around two dozen characters
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
   739
this algorithm already ``explodes'' with the regular expression 
564
Chengsong
parents: 543
diff changeset
   740
$(a^*)^*b$.
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   741
To improve this situation, we need to introduce simplification
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   742
rules for the intermediate results,
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
   743
such as $r + r \rightarrow r$ or $\ONE \cdot r \rightarrow r$,
539
Chengsong
parents: 538
diff changeset
   744
and make sure those rules do not change the 
Chengsong
parents: 538
diff changeset
   745
language of the regular expression.
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
   746
One simple-minded simplification function
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   747
that achieves these requirements 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   748
is given below (see Ausaf et al. \cite{AusafDyckhoffUrban2016}):
564
Chengsong
parents: 543
diff changeset
   749
\begin{center}
Chengsong
parents: 543
diff changeset
   750
	\begin{tabular}{lcl}
Chengsong
parents: 543
diff changeset
   751
		$\simp \; r_1 \cdot r_2 $ & $ \dn$ & 
Chengsong
parents: 543
diff changeset
   752
		$(\simp \; r_1,  \simp \; r_2) \; \textit{match}$\\
Chengsong
parents: 543
diff changeset
   753
					  & & $\quad \case \; (\ZERO, \_) \Rightarrow \ZERO$\\
Chengsong
parents: 543
diff changeset
   754
					  & & $\quad \case \; (\_, \ZERO) \Rightarrow \ZERO$\\
Chengsong
parents: 543
diff changeset
   755
					  & & $\quad \case \; (\ONE, r_2') \Rightarrow r_2'$\\
Chengsong
parents: 543
diff changeset
   756
					  & & $\quad \case \; (r_1', \ONE) \Rightarrow r_1'$\\
Chengsong
parents: 543
diff changeset
   757
					  & & $\quad \case \; (r_1', r_2') \Rightarrow r_1'\cdot r_2'$\\
Chengsong
parents: 543
diff changeset
   758
		$\simp \; r_1 + r_2$ & $\dn$ & $(\simp \; r_1, \simp \; r_2) \textit{match}$\\
Chengsong
parents: 543
diff changeset
   759
				     & & $\quad \; \case \; (\ZERO, r_2') \Rightarrow r_2'$\\
Chengsong
parents: 543
diff changeset
   760
				     & & $\quad \; \case \; (r_1', \ZERO) \Rightarrow r_1'$\\
Chengsong
parents: 543
diff changeset
   761
				     & & $\quad \; \case \; (r_1', r_2') \Rightarrow r_1' + r_2'$\\
583
Chengsong
parents: 579
diff changeset
   762
		$\simp \; r$ & $\dn$ & $r\quad\quad (otherwise)$
564
Chengsong
parents: 543
diff changeset
   763
				   
Chengsong
parents: 543
diff changeset
   764
	\end{tabular}
Chengsong
parents: 543
diff changeset
   765
\end{center}
Chengsong
parents: 543
diff changeset
   766
If we repeatedly apply this simplification  
Chengsong
parents: 543
diff changeset
   767
function during the matching algorithm, 
Chengsong
parents: 543
diff changeset
   768
we have a matcher with simplification:
Chengsong
parents: 543
diff changeset
   769
\begin{center}
Chengsong
parents: 543
diff changeset
   770
	\begin{tabular}{lcl}
Chengsong
parents: 543
diff changeset
   771
		$\derssimp \; [] \; r$ & $\dn$ & $r$\\
Chengsong
parents: 543
diff changeset
   772
		$\derssimp \; c :: cs \; r$ & $\dn$ & $\derssimp \; cs \; (\simp \; (r \backslash c))$\\
Chengsong
parents: 543
diff changeset
   773
		$\textit{matcher}_{simp}\; s \; r $ & $\dn$ & $\nullable \; (\derssimp \; s\;r)$
Chengsong
parents: 543
diff changeset
   774
	\end{tabular}
Chengsong
parents: 543
diff changeset
   775
\end{center}
Chengsong
parents: 543
diff changeset
   776
\begin{figure}
539
Chengsong
parents: 538
diff changeset
   777
\begin{tikzpicture}
Chengsong
parents: 538
diff changeset
   778
\begin{axis}[
Chengsong
parents: 538
diff changeset
   779
    xlabel={$n$},
Chengsong
parents: 538
diff changeset
   780
    ylabel={time in secs},
601
Chengsong
parents: 591
diff changeset
   781
    %ymode = log,
Chengsong
parents: 591
diff changeset
   782
    %xmode = log,
564
Chengsong
parents: 543
diff changeset
   783
    grid = both,
539
Chengsong
parents: 538
diff changeset
   784
    legend entries={Matcher With Simp},  
Chengsong
parents: 538
diff changeset
   785
    legend pos=north west,
Chengsong
parents: 538
diff changeset
   786
    legend cell align=left]
Chengsong
parents: 538
diff changeset
   787
\addplot[red,mark=*, mark options={fill=white}] table {BetterMatcher.data};
Chengsong
parents: 538
diff changeset
   788
\end{axis}
564
Chengsong
parents: 543
diff changeset
   789
\end{tikzpicture} 
Chengsong
parents: 543
diff changeset
   790
\caption{$(a^*)^*b$ 
Chengsong
parents: 543
diff changeset
   791
against 
Chengsong
parents: 543
diff changeset
   792
$\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$ Using $\textit{matcher}_{simp}$}\label{BetterMatcher}
Chengsong
parents: 543
diff changeset
   793
\end{figure}
Chengsong
parents: 543
diff changeset
   794
\noindent
Chengsong
parents: 543
diff changeset
   795
The running time of $\textit{ders}\_\textit{simp}$
583
Chengsong
parents: 579
diff changeset
   796
on the same example of Figure \ref{NaiveMatcher}
Chengsong
parents: 579
diff changeset
   797
is now ``tame''  in terms of the length of inputs,
Chengsong
parents: 579
diff changeset
   798
as shown in Figure \ref{BetterMatcher}.
539
Chengsong
parents: 538
diff changeset
   799
637
Chengsong
parents: 628
diff changeset
   800
So far, the story is use Brzozowski derivatives and
Chengsong
parents: 628
diff changeset
   801
simplify as much as possible, and at the end test
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   802
whether the empty string is recognised 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   803
by the final derivative.
583
Chengsong
parents: 579
diff changeset
   804
But what if we want to 
Chengsong
parents: 579
diff changeset
   805
do lexing instead of just getting a true/false answer?
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   806
Sulzmann and Lu \cite{Sulzmann2014} first came up with a nice and 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   807
elegant (arguably as beautiful as the definition of the 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   808
Brzozowski derivative) solution for this.
668
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents: 667
diff changeset
   809
\marginpar{explicitly attribute the work}
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents: 667
diff changeset
   810
For the same reason described
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents: 667
diff changeset
   811
at the beginning of this chapter, 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents: 667
diff changeset
   812
we introduce the formal
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents: 667
diff changeset
   813
semantics of $\POSIX$ lexing by 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents: 667
diff changeset
   814
Ausaf et al.\cite{AusafDyckhoffUrban2016}, 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents: 667
diff changeset
   815
followed by the first lexing algorithm by 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents: 667
diff changeset
   816
Sulzmanna and Lu \cite{Sulzmann2014} 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents: 667
diff changeset
   817
that produces the output conforming
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents: 667
diff changeset
   818
to the $\POSIX$ standard. 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents: 667
diff changeset
   819
%TODO: Actually show this in chapter 4.
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents: 667
diff changeset
   820
In what follows 
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents: 667
diff changeset
   821
we choose to use the Isabelle-style notation
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents: 667
diff changeset
   822
for function and datatype constructor applications, where
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents: 667
diff changeset
   823
the parameters of a function are not enclosed
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents: 667
diff changeset
   824
inside a pair of parentheses (e.g. $f \;x \;y$
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents: 667
diff changeset
   825
instead of $f(x,\;y)$). This is mainly
3831621d7b14 added technical Overview section, almost done introduction
Chengsong
parents: 667
diff changeset
   826
to make the text visually more concise.
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   827
539
Chengsong
parents: 538
diff changeset
   828
\section{Values and the Lexing Algorithm by Sulzmann and Lu}
564
Chengsong
parents: 543
diff changeset
   829
In this section, we present a two-phase regular expression lexing 
Chengsong
parents: 543
diff changeset
   830
algorithm.
Chengsong
parents: 543
diff changeset
   831
The first phase takes successive derivatives with 
Chengsong
parents: 543
diff changeset
   832
respect to the input string,
Chengsong
parents: 543
diff changeset
   833
and the second phase does the reverse, \emph{injecting} back
Chengsong
parents: 543
diff changeset
   834
characters, in the meantime constructing a lexing result.
Chengsong
parents: 543
diff changeset
   835
We will introduce the injection phase in detail slightly
Chengsong
parents: 543
diff changeset
   836
later, but as a preliminary we have to first define 
Chengsong
parents: 543
diff changeset
   837
the datatype for lexing results, 
Chengsong
parents: 543
diff changeset
   838
called \emph{value} or
608
37b6fd310a16 added related work chap
Chengsong
parents: 601
diff changeset
   839
sometimes also \emph{lexical value}.  
37b6fd310a16 added related work chap
Chengsong
parents: 601
diff changeset
   840
Values and regular
37b6fd310a16 added related work chap
Chengsong
parents: 601
diff changeset
   841
expressions correspond to each other 
37b6fd310a16 added related work chap
Chengsong
parents: 601
diff changeset
   842
as illustrated in the following
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   843
table:
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   844
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   845
\begin{center}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   846
	\begin{tabular}{c@{\hspace{20mm}}c}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   847
		\begin{tabular}{@{}rrl@{}}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   848
			\multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   849
			$r$ & $::=$  & $\ZERO$\\
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   850
			& $\mid$ & $\ONE$   \\
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   851
			& $\mid$ & $c$          \\
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   852
			& $\mid$ & $r_1 \cdot r_2$\\
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   853
			& $\mid$ & $r_1 + r_2$   \\
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   854
			\\
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   855
			& $\mid$ & $r^*$         \\
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   856
		\end{tabular}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   857
		&
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   858
		\begin{tabular}{@{\hspace{0mm}}rrl@{}}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   859
			\multicolumn{3}{@{}l}{\textbf{Values}}\medskip\\
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   860
			$v$ & $::=$  & \\
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   861
			&        & $\Empty$   \\
591
b2d0de6aee18 more polishing integrated comments chap2
Chengsong
parents: 585
diff changeset
   862
			& $\mid$ & $\Char \; c$          \\
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   863
			& $\mid$ & $\Seq\,v_1\, v_2$\\
591
b2d0de6aee18 more polishing integrated comments chap2
Chengsong
parents: 585
diff changeset
   864
			& $\mid$ & $\Left \;v$   \\
b2d0de6aee18 more polishing integrated comments chap2
Chengsong
parents: 585
diff changeset
   865
			& $\mid$ & $\Right\;v$  \\
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   866
			& $\mid$ & $\Stars\,[v_1,\ldots\,v_n]$ \\
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   867
		\end{tabular}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   868
	\end{tabular}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   869
\end{center}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   870
\noindent
564
Chengsong
parents: 543
diff changeset
   871
A value has an underlying string, which 
Chengsong
parents: 543
diff changeset
   872
can be calculated by the ``flatten" function $|\_|$:
Chengsong
parents: 543
diff changeset
   873
\begin{center}
Chengsong
parents: 543
diff changeset
   874
	\begin{tabular}{lcl}
Chengsong
parents: 543
diff changeset
   875
		$|\Empty|$ & $\dn$ &  $[]$\\
Chengsong
parents: 543
diff changeset
   876
		$|\Char \; c|$ & $ \dn$ & $ [c]$\\
591
b2d0de6aee18 more polishing integrated comments chap2
Chengsong
parents: 585
diff changeset
   877
		$|\Seq \; v_1, \;v_2|$ & $ \dn$ & $ v_1| @ |v_2|$\\
b2d0de6aee18 more polishing integrated comments chap2
Chengsong
parents: 585
diff changeset
   878
		$|\Left \; v|$ & $ \dn$ & $ |v|$\\
b2d0de6aee18 more polishing integrated comments chap2
Chengsong
parents: 585
diff changeset
   879
		$|\Right \; v|$ & $ \dn$ & $ |v|$\\
b2d0de6aee18 more polishing integrated comments chap2
Chengsong
parents: 585
diff changeset
   880
		$|\Stars \; []|$ & $\dn$ & $[]$\\
b2d0de6aee18 more polishing integrated comments chap2
Chengsong
parents: 585
diff changeset
   881
		$|\Stars \; v::vs|$ &  $\dn$ & $ |v| @ |\Stars(vs)|$
564
Chengsong
parents: 543
diff changeset
   882
	\end{tabular}
Chengsong
parents: 543
diff changeset
   883
\end{center}
Chengsong
parents: 543
diff changeset
   884
Sulzmann and Lu used a binary predicate, written $\vdash v:r $,
Chengsong
parents: 543
diff changeset
   885
to indicate that a value $v$ could be generated from a lexing algorithm
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   886
with input $r$. They call it the value inhabitation relation,
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   887
defined by the rules.
628
7af4e2420a8c ready to submit~~
Chengsong
parents: 626
diff changeset
   888
\begin{figure}[H]
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   889
\begin{mathpar}
591
b2d0de6aee18 more polishing integrated comments chap2
Chengsong
parents: 585
diff changeset
   890
	\inferrule{\mbox{}}{\vdash \Char \; c : \mathbf{c}} \hspace{2em}
564
Chengsong
parents: 543
diff changeset
   891
Chengsong
parents: 543
diff changeset
   892
	\inferrule{\mbox{}}{\vdash \Empty :  \ONE} \hspace{2em}
Chengsong
parents: 543
diff changeset
   893
591
b2d0de6aee18 more polishing integrated comments chap2
Chengsong
parents: 585
diff changeset
   894
\inferrule{\vdash v_1 : r_1 \;\; \vdash v_2 : r_2 }{\vdash \Seq \; v_1,\; v_2 : (r_1 \cdot r_2)}
564
Chengsong
parents: 543
diff changeset
   895
591
b2d0de6aee18 more polishing integrated comments chap2
Chengsong
parents: 585
diff changeset
   896
\inferrule{\vdash v_1 : r_1}{\vdash \Left \; v_1 : r_1+r_2}
564
Chengsong
parents: 543
diff changeset
   897
591
b2d0de6aee18 more polishing integrated comments chap2
Chengsong
parents: 585
diff changeset
   898
\inferrule{\vdash v_2 : r_2}{\vdash \Right \; v_2:r_1 + r_2}
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   899
591
b2d0de6aee18 more polishing integrated comments chap2
Chengsong
parents: 585
diff changeset
   900
\inferrule{\forall v \in vs. \vdash v:r \land  |v| \neq []}{\vdash \Stars \; vs : r^*}
564
Chengsong
parents: 543
diff changeset
   901
\end{mathpar}
628
7af4e2420a8c ready to submit~~
Chengsong
parents: 626
diff changeset
   902
\caption{The inhabitation relation for values and regular expressions}\label{fig:inhab}
626
1c8525061545 finished!
Chengsong
parents: 623
diff changeset
   903
\end{figure}
564
Chengsong
parents: 543
diff changeset
   904
\noindent
Chengsong
parents: 543
diff changeset
   905
The condition $|v| \neq []$ in the premise of star's rule
Chengsong
parents: 543
diff changeset
   906
is to make sure that for a given pair of regular 
Chengsong
parents: 543
diff changeset
   907
expression $r$ and string $s$, the number of values 
Chengsong
parents: 543
diff changeset
   908
satisfying $|v| = s$ and $\vdash v:r$ is finite.
601
Chengsong
parents: 591
diff changeset
   909
This additional condition was
Chengsong
parents: 591
diff changeset
   910
imposed by Ausaf and Urban to make their proofs easier.
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   911
Given a string and a regular expression, there can be
564
Chengsong
parents: 543
diff changeset
   912
multiple values for it. For example, both
Chengsong
parents: 543
diff changeset
   913
$\vdash \Seq(\Left \; ab)(\Right \; c):(ab+a)(bc+c)$ and
Chengsong
parents: 543
diff changeset
   914
$\vdash \Seq(\Right\; a)(\Left \; bc ):(ab+a)(bc+c)$ hold
Chengsong
parents: 543
diff changeset
   915
and the values both flatten to $abc$.
Chengsong
parents: 543
diff changeset
   916
Lexers therefore have to disambiguate and choose only
637
Chengsong
parents: 628
diff changeset
   917
one of the values to be generated. $\POSIX$ is one of the
564
Chengsong
parents: 543
diff changeset
   918
disambiguation strategies that is widely adopted.
Chengsong
parents: 543
diff changeset
   919
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
   920
Ausaf et al. \cite{AusafDyckhoffUrban2016} 
564
Chengsong
parents: 543
diff changeset
   921
formalised the property 
Chengsong
parents: 543
diff changeset
   922
as a ternary relation.
Chengsong
parents: 543
diff changeset
   923
The $\POSIX$ value $v$ for a regular expression
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   924
$r$ and string $s$, denoted as $(s, r) \rightarrow v$, can be specified 
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   925
in the following rules\footnote{The names of the rules are used
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   926
as they were originally given in \cite{AusafDyckhoffUrban2016}.}:
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   927
\begin{figure}[p]
564
Chengsong
parents: 543
diff changeset
   928
\begin{mathpar}
Chengsong
parents: 543
diff changeset
   929
	\inferrule[P1]{\mbox{}}{([], \ONE) \rightarrow \Empty}
Chengsong
parents: 543
diff changeset
   930
		
Chengsong
parents: 543
diff changeset
   931
	\inferrule[PC]{\mbox{}}{([c], c) \rightarrow \Char \; c}
Chengsong
parents: 543
diff changeset
   932
Chengsong
parents: 543
diff changeset
   933
	\inferrule[P+L]{(s,r_1)\rightarrow v_1}{(s, r_1+r_2)\rightarrow \Left \; v_1}
Chengsong
parents: 543
diff changeset
   934
Chengsong
parents: 543
diff changeset
   935
	\inferrule[P+R]{(s,r_2)\rightarrow v_2\\ s \notin L \; r_1}{(s, r_1+r_2)\rightarrow \Right \; v_2}
Chengsong
parents: 543
diff changeset
   936
Chengsong
parents: 543
diff changeset
   937
	\inferrule[PS]{(s_1, v_1) \rightarrow r_1 \\ (s_2, v_2)\rightarrow r_2\\
Chengsong
parents: 543
diff changeset
   938
		\nexists s_3 \; s_4. s_3 \neq [] \land s_3 @ s_4 = s_2 \land 
Chengsong
parents: 543
diff changeset
   939
		s_1@ s_3 \in L \; r_1 \land s_4 \in L \; r_2}{(s_1 @ s_2, r_1\cdot r_2) \rightarrow
Chengsong
parents: 543
diff changeset
   940
	\Seq \; v_1 \; v_2}
Chengsong
parents: 543
diff changeset
   941
Chengsong
parents: 543
diff changeset
   942
	\inferrule[P{[]}]{\mbox{}}{([], r^*) \rightarrow \Stars([])}
Chengsong
parents: 543
diff changeset
   943
Chengsong
parents: 543
diff changeset
   944
	\inferrule[P*]{(s_1, v) \rightarrow v \\ (s_2, r^*) \rightarrow \Stars \; vs \\
Chengsong
parents: 543
diff changeset
   945
		|v| \neq []\\ \nexists s_3 \; s_4. s_3 \neq [] \land s_3@s_4 = s_2 \land
Chengsong
parents: 543
diff changeset
   946
		s_1@s_3 \in L \; r \land s_4 \in L \; r^*}{(s_1@s_2, r^*)\rightarrow \Stars \;
Chengsong
parents: 543
diff changeset
   947
	(v::vs)}
Chengsong
parents: 543
diff changeset
   948
\end{mathpar}
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   949
\caption{The inductive POSIX rules given by Ausaf et al.
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   950
	\cite{AusafDyckhoffUrban2016}.
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   951
This ternary relation, written $(s, r) \rightarrow v$, 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   952
formalises the POSIX constraints on the
601
Chengsong
parents: 591
diff changeset
   953
value $v$ given a string $s$ and 
Chengsong
parents: 591
diff changeset
   954
regular expression $r$.
Chengsong
parents: 591
diff changeset
   955
}
646
Chengsong
parents: 639
diff changeset
   956
\label{fig:POSIXDef}
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   957
\end{figure}\afterpage{\clearpage}
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   958
\noindent
579
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   959
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   960
%\begin{figure}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   961
%\begin{tikzpicture}[]
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   962
%    \node [minimum width = 6cm, rectangle split, rectangle split horizontal, 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   963
%	    rectangle split parts=2, rectangle split part fill={red!30,blue!20}, style={draw, rounded corners, inner sep=10pt}]
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   964
%	    (node1)
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   965
%	    {$r_{token1}$
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   966
%	    \nodepart{two}  $\;\;\; \quad r_{token2}\;\;\;\quad$ };
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   967
%	    %\node [left = 6.0cm of node1] (start1) {hi};
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   968
%	    \node [left = 0.2cm of node1] (middle) {$v.s.$};
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   969
%    \node [minimum width = 6cm, left = 0.2cm of middle, rectangle split, rectangle split horizontal, 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   970
%	    rectangle split parts=2, rectangle split part fill={red!30,blue!20}, style={draw, rounded corners, inner sep=10pt}]
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   971
%	    (node2)
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   972
%	    {$\quad\;\;\;r_{token1}\quad\;\;\;$
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   973
%	    \nodepart{two}  $r_{token2}$ };
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   974
%	    \node [below = 0.1cm of node2] (text1) {\checkmark preferred by POSIX};
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   975
%		\node [above = 1.5cm of middle, minimum width = 6cm, 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   976
%			rectangle, style={draw, rounded corners, inner sep=10pt}] 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   977
%			(topNode) {$s$};
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   978
%	    \path[->,draw]
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   979
%	        (topNode) edge node {split $A$} (node2)
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   980
%	        (topNode) edge node {split $B$} (node1)
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   981
%		;
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   982
%			
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   983
%
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   984
%\end{tikzpicture}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   985
%\caption{Maximum munch example: $s$ matches $r_{token1} \cdot r_{token2}$}\label{munch}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   986
%\end{figure}
637
Chengsong
parents: 628
diff changeset
   987
The above $\POSIX$ rules follow the intuition described below: 
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   988
\begin{itemize}
564
Chengsong
parents: 543
diff changeset
   989
	\item (Left Priority)\\
637
Chengsong
parents: 628
diff changeset
   990
		Match the leftmost regular expression when multiple options for matching
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   991
		are available. See P+L and P+R where in P+R $s$ cannot
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   992
		be in the language of $L \; r_1$.
564
Chengsong
parents: 543
diff changeset
   993
	\item (Maximum munch)\\
Chengsong
parents: 543
diff changeset
   994
		Always match a subpart as much as possible before proceeding
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   995
		to the next part of the string.
579
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   996
		For example, when the string $s$ matches 
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   997
		$r_{part1}\cdot r_{part2}$, and we have two ways $s$ can be split:
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   998
		Then the split that matches a longer string for the first part
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   999
		$r_{part1}$ is preferred by this maximum munch rule.
637
Chengsong
parents: 628
diff changeset
  1000
		The side-condition 
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1001
		\begin{center}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1002
		$\nexists s_3 \; s_4. s_3 \neq [] \land s_3 @ s_4 = s_2 \land 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1003
		s_1@ s_3 \in L \; r_1 \land s_4 \in L \; r_2$
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1004
		\end{center}
637
Chengsong
parents: 628
diff changeset
  1005
		in PS causes this.
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1006
		%(See
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1007
		%\ref{munch} for an illustration).
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1008
\end{itemize}
564
Chengsong
parents: 543
diff changeset
  1009
\noindent
Chengsong
parents: 543
diff changeset
  1010
These disambiguation strategies can be 
Chengsong
parents: 543
diff changeset
  1011
quite practical.
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1012
For instance, when lexing a code snippet 
564
Chengsong
parents: 543
diff changeset
  1013
\[ 
Chengsong
parents: 543
diff changeset
  1014
	\textit{iffoo} = 3
Chengsong
parents: 543
diff changeset
  1015
\]
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1016
using a regular expression 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1017
for keywords and 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1018
identifiers:
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1019
%(for example, keyword is a nonempty string starting with letters 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1020
%followed by alphanumeric characters or underscores):
564
Chengsong
parents: 543
diff changeset
  1021
\[
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1022
	r_{keyword} + r_{identifier}.
564
Chengsong
parents: 543
diff changeset
  1023
\]
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1024
If we want $\textit{iffoo}$ to be recognized
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1025
as an identifier
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1026
where identifiers are defined as usual (letters
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1027
followed by letters, numbers or underscores),
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1028
then a match with a keyword (if)
564
Chengsong
parents: 543
diff changeset
  1029
followed by
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1030
an identifier (foo) would be incorrect.
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  1031
POSIX lexing generates what is included by lexing.
564
Chengsong
parents: 543
diff changeset
  1032
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1033
\noindent
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1034
We know that a POSIX 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1035
value for regular expression $r$ is inhabited by $r$.
541
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
  1036
\begin{lemma}
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
  1037
$(r, s) \rightarrow v \implies \vdash v: r$
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
  1038
\end{lemma}
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
  1039
\noindent
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1040
The main property about a $\POSIX$ value is that 
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1041
given the same regular expression $r$ and string $s$,
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1042
one can always uniquely determine the $\POSIX$ value for it:
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1043
\begin{lemma}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1044
$\textit{if} \,(s, r) \rightarrow v_1 \land (s, r) \rightarrow v_2\quad  \textit{then} \; v_1 = v_2$
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1045
\end{lemma}
539
Chengsong
parents: 538
diff changeset
  1046
\begin{proof}
564
Chengsong
parents: 543
diff changeset
  1047
By induction on $s$, $r$ and $v_1$. The inductive cases
Chengsong
parents: 543
diff changeset
  1048
are all the POSIX rules. 
Chengsong
parents: 543
diff changeset
  1049
Probably the most cumbersome cases are 
Chengsong
parents: 543
diff changeset
  1050
the sequence and star with non-empty iterations.
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1051
We shall give the details for proving the sequence case here.
539
Chengsong
parents: 538
diff changeset
  1052
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1053
When we have 
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1054
\[
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1055
	(s_1, r_1) \rightarrow v_1 \;\, and \;\, 
601
Chengsong
parents: 591
diff changeset
  1056
	(s_2, r_2) \rightarrow v_2  \;\, and \;\, 
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1057
	\nexists s_3 \; s_4. s_3 \neq [] \land s_3 @ s_4 = s_2 \land 
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1058
		s_1@ s_3 \in L \; r_1 \land s_4 \in L \; r_2
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1059
\]
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1060
we know that the last condition 
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1061
excludes the possibility of a 
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1062
string $s_1'$ longer than $s_1$ such that 
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1063
\[
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1064
(s_1', r_1) \rightarrow v_1'   \;\; 
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1065
and\;\; (s_2', r_2) \rightarrow v_2'\;\; and \;\;s_1' @s_2' = s 
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1066
\]
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1067
hold.
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1068
A shorter string $s_1''$ with $s_2''$ satisfying
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1069
\[
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1070
(s_1'', r_1) \rightarrow v_1''
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1071
\;\;and\;\; (s_2'', r_2) \rightarrow v_2'' \;\;and \;\;s_1'' @s_2'' = s 
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1072
\]
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1073
cannot possibly form a $\POSIX$ value either, because
637
Chengsong
parents: 628
diff changeset
  1074
by definition, there is a candidate
Chengsong
parents: 628
diff changeset
  1075
with a longer initial string
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1076
$s_1$. Therefore, we know that the POSIX
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1077
value $\Seq \; a \; b$ for $r_1 \cdot r_2$ matching
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1078
$s$ must have the 
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1079
property that 
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1080
\[
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1081
	|a| = s_1 \;\; and \;\; |b| = s_2.
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1082
\]
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1083
The goal is to prove that $a = v_1 $ and $b = v_2$.
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1084
If we have some other POSIX values $v_{10}$ and $v_{20}$ such that 
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1085
$(s_1, r_1) \rightarrow v_{10}$ and $(s_2, r_2) \rightarrow v_{20}$ hold,
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1086
then by induction hypothesis $v_{10} = v_1$ and $v_{20}= v_2$, 
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1087
which means this "other" $\POSIX$ value $\Seq(v_{10}, v_{20})$
539
Chengsong
parents: 538
diff changeset
  1088
is the same as $\Seq(v_1, v_2)$. 
Chengsong
parents: 538
diff changeset
  1089
\end{proof}
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1090
\noindent
637
Chengsong
parents: 628
diff changeset
  1091
We have now defined what a POSIX value is and shown that it is unique.
Chengsong
parents: 628
diff changeset
  1092
The problem is to generate
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1093
such a value in a lexing algorithm using derivatives.
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1094
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1095
\subsection{Sulzmann and Lu's Injection-based Lexing Algorithm}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1096
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1097
Sulzmann and Lu extended Brzozowski's 
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1098
derivative-based matching
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1099
to a lexing algorithm by a second phase 
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1100
after the initial phase of successive derivatives.
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1101
This second phase generates a POSIX value 
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1102
if the regular expression matches the string.
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  1103
The algorithm uses two functions called $\inj$ and $\mkeps$.
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1104
The function $\mkeps$ constructs a POSIX value from the last
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1105
derivative $r_n$:
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1106
\begin{ceqn}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1107
\begin{equation}\label{graph:mkeps}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1108
\begin{tikzcd}
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1109
r_0 \arrow[r, "\backslash c_0"]  & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed, "\ldots"] & r_n \arrow[d, "mkeps" description] \\
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1110
	        & 	              & 	            & v_n       
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1111
\end{tikzcd}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1112
\end{equation}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1113
\end{ceqn}
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1114
\noindent
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1115
In the above diagram, again we assume that
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1116
the input string $s$ is made of $n$ characters
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1117
$c_0c_1 \ldots c_{n-1}$ 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1118
The last derivative operation 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1119
$\backslash c_{n-1}$ generates the derivative $r_n$, for which
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1120
$\mkeps$ produces the value $v_n$. This value
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1121
tells us how the empty string is matched by the (nullable)
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1122
regular expression $r_n$, in a POSIX way.
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1123
The definition of $\mkeps$ is
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1124
	\begin{center}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1125
		\begin{tabular}{lcl}
564
Chengsong
parents: 543
diff changeset
  1126
			$\mkeps \; \ONE$ 		& $\dn$ & $\Empty$ \\
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1127
			$\mkeps \; (r_{1}+r_{2})$	& $\dn$ 
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1128
						& $\textit{if}\; (\nullable \; r_{1}) \;\,
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1129
							\textit{then}\;\, \Left \; (\mkeps \; r_{1})$\\ 
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1130
						& & $\phantom{if}\; \textit{else}\;\, \Right \;(\mkeps \; r_{2})$\\
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1131
			$\mkeps \; (r_1 \cdot r_2)$ 	& $\dn$ & $\Seq\;(\mkeps\;r_1)\;(\mkeps \; r_2)$\\
564
Chengsong
parents: 543
diff changeset
  1132
			$\mkeps \; r^* $	        & $\dn$ & $\Stars\;[]$
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1133
		\end{tabular}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1134
	\end{center}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1135
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1136
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1137
\noindent 
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1138
The function prefers the left child $r_1$ of $r_1 + r_2$ 
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1139
to match an empty string if there is a choice.
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1140
When there is a star to match the empty string,
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1141
we give the $\Stars$ constructor an empty list, meaning
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1142
no iteration is taken.
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1143
The result of $\mkeps$ on a $\nullable$ $r$ 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1144
is a POSIX value for $r$ and the empty string:
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1145
\begin{lemma}\label{mePosix}
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  1146
$\nullable\; r \implies (r, []) \rightarrow (\mkeps\; r)$
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1147
\end{lemma}
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1148
\begin{proof}
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1149
	By induction on $r$.
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1150
\end{proof}
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1151
\noindent
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1152
After the $\mkeps$-call, Sulzmann and Lu inject back the characters one by one
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1153
in reverse order as they were chopped off in the derivative phase.
637
Chengsong
parents: 628
diff changeset
  1154
The function for this is called $\inj$. This function 
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
  1155
operates on values, unlike $\backslash$ which operates on regular expressions.
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1156
In the diagram below, $v_i$ stands for the (POSIX) value 
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1157
for how the regular expression 
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1158
$r_i$ matches the string $s_i$ consisting of the last $n-i$ characters
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1159
of $s$ (i.e. $s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$.
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1160
After injecting back $n$ characters, we get the lexical value for how $r_0$
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1161
matches $s$. 
601
Chengsong
parents: 591
diff changeset
  1162
\begin{figure}[H]
Chengsong
parents: 591
diff changeset
  1163
\begin{center}	
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1164
\begin{ceqn}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1165
\begin{tikzcd}
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1166
r_0 \arrow[r, dashed] \arrow[d]& r_i \arrow[r, "\backslash c_i"]  \arrow[d]  & r_{i+1}  \arrow[r, dashed] \arrow[d]        & r_n \arrow[d, "mkeps" description] \\
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1167
v_0           \arrow[u]                 & v_i  \arrow[l, dashed]                              & v_{i+1} \arrow[l,"inj_{r_i} c_i"]                 & v_n \arrow[l, dashed]         
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1168
\end{tikzcd}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1169
\end{ceqn}
601
Chengsong
parents: 591
diff changeset
  1170
\end{center}
Chengsong
parents: 591
diff changeset
  1171
\caption{The two-phase lexing algorithm by Sulzmann and Lu \cite{AusafDyckhoffUrban2016},
Chengsong
parents: 591
diff changeset
  1172
	matching the regular expression $r_0$ and string of the form $[c_0, c_1, \ldots, c_{n-1}]$.
Chengsong
parents: 591
diff changeset
  1173
	The first phase involves taking successive derivatives w.r.t the characters $c_0$,
Chengsong
parents: 591
diff changeset
  1174
	$c_1$, and so on. These are the same operations as they have appeared in the matcher
Chengsong
parents: 591
diff changeset
  1175
	\ref{matcher}. When the final derivative regular expression is nullable (contains the empty string),
637
Chengsong
parents: 628
diff changeset
  1176
	then the second phase starts. First, $\mkeps$ generates a POSIX value which tells us how $r_n$ matches
Chengsong
parents: 628
diff changeset
  1177
	the empty string, by always selecting the leftmost 
Chengsong
parents: 628
diff changeset
  1178
	nullable regular expression. After that, $\inj$ ``injects'' back the character in reverse order as they 
601
Chengsong
parents: 591
diff changeset
  1179
	appeared in the string, always preserving POSIXness.}\label{graph:inj}
Chengsong
parents: 591
diff changeset
  1180
\end{figure}
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1181
\noindent
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1182
The function $\textit{inj}$ as defined by Sulzmann and Lu
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1183
takes three arguments: a regular
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1184
expression ${r_{i}}$, before the character is chopped off, 
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1185
a character ${c_{i}}$ (the character we want to inject back) and 
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1186
the third argument $v_{i+1}$ the value we want to inject into. 
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1187
The result of an application 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1188
$\inj \; r_i \; c_i \; v_{i+1}$ is a new value $v_i$ such that
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1189
\[
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1190
	(s_i, r_i) \rightarrow v_i
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1191
\]
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1192
holds.
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1193
The definition of $\textit{inj}$ is as follows: 
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1194
\begin{center}
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1195
\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{5mm}}l}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1196
  $\textit{inj}\;(c)\;c\,Empty$            & $\dn$ & $\Char\,c$\\
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1197
  $\textit{inj}\;(r_1 + r_2)\;c\; (\Left\; v)$ & $\dn$ & $\Left  \; (\textit{inj}\; r_1 \; c\,v)$\\
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1198
  $\textit{inj}\;(r_1 + r_2)\,c\; (\Right\;v)$ & $\dn$ & $\Right \; (\textit{inj}\;r_2\;c  \; v)$\\
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1199
  $\textit{inj}\;(r_1 \cdot r_2)\; c\;(\Seq \; v_1 \; v_2)$ & $\dn$  & 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1200
  $\Seq \; (\textit{inj}\;r_1\;c\;v_1) \; v_2$\\
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1201
  $\textit{inj}\;(r_1 \cdot r_2)\; c\;(\Left \; (\Seq \; v_1\;v_2) )$ & 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1202
  $\dn$  & $\Seq \; (\textit{inj}\,r_1\,c\,v_1)\; v_2$\\
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1203
  $\textit{inj}\;(r_1 \cdot r_2)\; c\; (\Right\; v)$ & $\dn$  & $\Seq\; (\textit{mkeps}\; r_1) \; (\textit{inj} \; r_2\;c\;v)$\\
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1204
  $\textit{inj}\;(r^*)\; c \; (\Seq \; v\; (\Stars\;vs))$         & $\dn$  & $\Stars\;\,((\textit{inj}\;r\;c\;v)\,::\,vs)$\\
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1205
\end{tabular}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1206
\end{center}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1207
\noindent 
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1208
The function recurses on 
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1209
the shape of regular
637
Chengsong
parents: 628
diff changeset
  1210
expressions and values.
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1211
Intuitively, each clause analyses 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1212
how $r_i$ could have transformed when being 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1213
derived by $c$, identifying which subpart
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1214
of $v_{i+1}$ has the ``hole'' 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1215
to inject the character back into.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1216
Once the character is
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1217
injected back to that sub-value; 
637
Chengsong
parents: 628
diff changeset
  1218
$\inj$ assembles all parts
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1219
to form a new value.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1220
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1221
For instance, the last clause is an
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1222
injection into a sequence value $v_{i+1}$
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1223
whose second child
637
Chengsong
parents: 628
diff changeset
  1224
value is a star and the shape of the 
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1225
regular expression $r_i$ before injection 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1226
is a star.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1227
We therefore know 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1228
the derivative 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1229
starts on a star and ends as a sequence:
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1230
\[
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1231
	(r^*) \backslash c \longrightarrow r\backslash c \cdot r^*
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1232
\]
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1233
during which an iteration of the star
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1234
had just been unfolded, giving the below
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1235
value inhabitation relation:
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1236
\[
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1237
	\vdash \Seq \; v \; (\Stars \; vs) : (r\backslash c) \cdot r^*.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1238
\]
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1239
The value list $vs$ corresponds to
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1240
matched star iterations,
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1241
and the ``hole'' lies in $v$ because
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1242
\[
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1243
	\vdash v: r\backslash c.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1244
\]
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1245
Finally, 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1246
$\inj \; r \;c \; v$ is prepended
637
Chengsong
parents: 628
diff changeset
  1247
to the previous list of iterations and then
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1248
wrapped under the $\Stars$
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1249
constructor, giving us $\Stars \; ((\inj \; r \; c \; v) ::vs)$.
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1250
667
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1251
Putting all together, Sulzmann and Lu obtained the following algorithm
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1252
outlined in the injection-based lexing diagram \ref{graph:inj}:
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1253
\begin{center}
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1254
\begin{tabular}{lcl}
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1255
	$\lexer \; r \; [] $ & $=$ & $\textit{if} \; (\nullable \; r)\; \textit{then}\;  \Some(\mkeps \; r) \; \textit{else} \; \None$\\
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1256
	$\lexer \; r \;c::s$ & $=$ & $\textit{case}\; (\lexer \; (r\backslash c) \; s) \;\textit{of}\; $\\
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1257
	& & $\quad \phantom{\mid}\; \None \implies \None$\\
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1258
	& & $\quad \mid           \Some(v) \implies \Some(\inj \; r\; c\; v)$
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1259
\end{tabular}
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1260
\end{center}
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1261
\noindent
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1262
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1263
\subsection{Examples on How Injection and Lexer Works}
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1264
We will provide a few examples on how $\inj$ and $\lexer$
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1265
works by showing their values in each recursive call on some 
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1266
concrete examples.
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1267
We start with the call $\lexer \; (a+aa)^*\cdot c\; aac$
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1268
on the lexer, note that the value's character constructor
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1269
$\Char \;c$ is abbreviated as $c$ for readability.
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1270
Similarly the last derivative's sub-expression
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1271
is abbreviated as $r_{=0}$\footnote{which is equal to
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1272
$((\ZERO + (\ZERO a+ \ZERO))\cdot(a+aa)^* + (\ZERO + \ZERO a)\cdot (a+aa)^*)+
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1273
((\ZERO+\ZERO a)\cdot(a+aa)^* + (\ZERO+\ZERO a)\cdot(a+aa)^*),
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1274
$.}
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1275
whose language interpretation is equivalent to that of
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1276
$\ZERO$ and therefore not crucial to be displayed fully expanded,
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1277
as they will not be injected into.%our example run.
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1278
\begin{center}
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1279
	\begin{tabular}{lcl}
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1280
		$(a+\textcolor{magenta}{a}\textcolor{blue}{a})^* \cdot \textcolor{red}{c}$ & $\stackrel{\backslash \textcolor{magenta}{a}}{\rightarrow}$ & $((\ONE + \textcolor{magenta}{\ONE} \textcolor{blue}{a})\cdot (a+aa)^*)\cdot \textcolor{red}{c}+\ZERO$\\
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1281
				   %& $\stackrel{\rightarrow}{\backslash a}$ & $((\ONE + \ONE a)\cdot (a+aa)^*)\cdot c + \ZERO$\\
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1282
				    & $\stackrel{\backslash \textcolor{blue}{a}}{\rightarrow}$  &
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1283
				    $(((\ZERO+(\ZERO a+ \textcolor{blue}{\ONE}))\cdot (a+aa)^* + (\ONE+\ONE a)\cdot (a+aa)^* )\cdot \textcolor{red}{\mathbf{c}} + \ZERO)+\ZERO$\\
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1284
				    & $\stackrel{\backslash \textcolor{red}{c}}{\rightarrow}$  &
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1285
				    $((r_{=0}\cdot c + \textcolor{red}{\ONE})+\ZERO)+\ZERO$\\
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1286
				    & $\stackrel{\mkeps}{\rightarrow}$ & $\Left (\Left \; (\Right \; \textcolor{red}{\Empty}))$ \\
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1287
				   & $\stackrel{\inj \;\textcolor{red}{c} }{\rightarrow}$ & 
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1288
				   $\Left \; (\Left \; (\Seq \;(\Left \; (\Seq \; (\Right \; (\Right\; \textcolor{blue}{\Empty})) \; \Stars \, [])) \; \textcolor{red}{c}))$\\
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1289
				   & $\stackrel{\inj \;\textcolor{blue}{a}}{\rightarrow}$ & 
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1290
				   $\Left\; (\Seq \; (\Seq\; (\Right \; (\Seq \; \textcolor{magenta}{\Empty }\; \textcolor{blue}{ \mathbf{a}}) )
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1291
				   \;\Stars \,[]) \; \textcolor{red}{c})$\\
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1292
				   & $\stackrel{\inj \;\textcolor{magenta}{a}}{\rightarrow}$ & $\Seq \; (\Stars \; [\Right \; (\Seq \; \mathbf{\textcolor{magenta}{a}} \; \textcolor{blue}{a})]) \; \textcolor{red}{ c}$
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1293
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1294
				   %$\inj \; r \; c\A$
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1295
		%$\inj \; (a\cdot b)\cdot c \; \Seq \; \ONE \; b$ & $=$ & $(a+e)$\\
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1296
	\end{tabular}
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1297
\end{center}
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1298
\noindent
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1299
We have assigned different colours for each character,
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1300
as well as their corresponding locations in values and regular expressions.
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1301
The most recently injected character is marked with a bold font.
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1302
%Their corresponding derivative steps have been marked with the same
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1303
%colour. We also mark the specific location where the coloured character
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1304
%was injected into with the same colour, before and after the injection occurred.
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1305
%TODO: can be also used to motivate injection->blexer in Bitcoded1
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1306
To show the details of how $\inj$ works,
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1307
we zoom in to the second injection above, 
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1308
illustrating the recursive calls involved: 
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1309
\begin{center}
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1310
	\begin{tabular}{lcl}
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1311
$\inj \;\quad ((\ONE + {\ONE} 
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1312
	\textcolor{blue}{a})\cdot (a+aa)^*)\cdot 
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1313
	c + \ZERO \; \quad \textcolor{blue}{a} \; $ & &\\
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1314
$\quad \Left \; 
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1315
(\Left \; (\Seq \;(\Left \; (\Seq \; (\Right \; (\Right\; 
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1316
	\textcolor{blue}{\Empty})) \; \Stars \, [])) \; c))$ & \\$=$\\
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1317
	$\Left \; (\inj \; ((\ONE + \ONE 
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1318
	\textcolor{blue}{a})\cdot (a+aa)^*)\cdot 
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1319
	c \quad  \textcolor{blue}{a} \quad (\Left \; (\Seq\; ( \Left \; (\Seq \; (\Right \; (\Right\; 
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1320
	\textcolor{blue}{\Empty})) \; \Stars \, []) ) \; c ) )\;\;)$ &\\ $=$\\
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1321
	$\Left \; (\Seq \; (\inj \; (\ONE + \ONE \textcolor{blue}{a})\cdot(a+aa)^* \quad \textcolor{blue}{a} \quad
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1322
	(\Left \; (\Seq \; (\Right \; (\Right\;\textcolor{blue}{\Empty})))) ) \; c ) $ & \\$=$\\
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1323
	$\Left \; (\Seq \; (\Seq \; (\inj \quad (\ONE + \ONE \textcolor{blue}{a}) \quad \textcolor{blue}{a}\quad \Right \;(\Right \; \textcolor{blue}{\Empty}) ) \; \Stars \,[])\; c)$ &\\ $=$ \\
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1324
	$\Left \; (\Seq \; (\Seq \; (\Right \; (\inj \; \ONE \textcolor{blue}{a} \quad \textcolor{blue}{a}\quad (\Right \;\textcolor{blue}{\Empty})))) \; \Stars \, [])$
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1325
																		   & \\ $=$\\
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1326
																		   $\Left \; (\Seq \; (\Seq \; (\Right \; (\Seq \; (\mkeps \; \ONE)\;(\inj \;\textcolor{blue}{a} \; \textcolor{blue}{a} \; \textcolor{blue}{\Empty})) ) ) \; \Stars \, [] )$
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1327
																		   & \\ $=$\\
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1328
																		   $\Left \; (\Seq \; (\Seq \; (\Right \; (\Seq \; \Empty \; \textcolor{blue}{a}) ) ) \; \Stars \, [] )$
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1329
 	\end{tabular}
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1330
\end{center}
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1331
\noindent
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1332
We will now introduce the proofs related to $\inj$ and
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1333
$\lexer$. These proofs have originally been found by Ausaf et al.
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1334
in their 2016 work \cite{AusafDyckhoffUrban2016}.
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1335
Proofs to some of these lemmas have also been discussed in more detail in 
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1336
Ausaf's thesis \cite{Ausaf}.
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1337
Nevertheless, we still introduce these proofs, to make this thesis
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1338
self-contained as we show inductive variants used in these proofs break
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1339
after more simplifications are introduced.
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1340
%Also note that lemmas like \ref{}described in this thesis is a more faithful
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1341
%representation of the actual accompanying Isabelle formalisation, and 
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1342
%therefore 
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1343
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1344
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1345
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1346
Recall that lemma 
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1347
\ref{mePosix} tells us that
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1348
$\mkeps$ always generates the POSIX value.
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1349
The function $\inj$ preserves the POSIXness, provided
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1350
the value before injection is POSIX, namely
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1351
\begin{lemma}\label{injPosix}
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1352
	If$(r \backslash c, s) \rightarrow v $,
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1353
	then $(r, c :: s) \rightarrow (\inj r \; c\; v)$.
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1354
\end{lemma}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1355
\begin{proof}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1356
	By induction on $r$.
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1357
	The non-trivial cases are sequence and star.
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1358
	When $r = a \cdot b$, there can be
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1359
	three cases for the value $v$ satisfying $\vdash v:a\backslash c$.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1360
	We give the reasoning why $\inj \; r \; c \; v$ is POSIX in each
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1361
	case.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1362
	\begin{itemize}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1363
		\item
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1364
			$v = \Seq \; v_a \; v_b$.\\
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1365
			The ``not nullable'' clause of the $\inj$ function is taken:
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1366
			\begin{center}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1367
				\begin{tabular}{lcl}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1368
					$\inj \; r \; c \; v$ &   $=$ & $ \inj \;\; (a \cdot b) \;\; c \;\; (\Seq \; v_a \; v_b) $\\
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1369
					& $=$ & $\Seq \; (\inj \;a \; c \; v_a) \; v_b$
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1370
				\end{tabular}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1371
			\end{center}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1372
			We know that there exists a unique pair of
637
Chengsong
parents: 628
diff changeset
  1373
			$s_a$ and $s_b$ satisfying	
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1374
				$(a \backslash c, s_a) \rightarrow v_a$,
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1375
				$(b , s_b) \rightarrow v_b$, and
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1376
				$\nexists s_3 \; s_4. s_3 \neq [] \land s_a @ s_3 \in 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1377
				L \; (a\backslash c) \land
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1378
				s_4 \in L \; b$.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1379
			The last condition gives us
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1380
			$\nexists s_3 \; s_4. s_3 \neq [] \land (c :: s_a )@ s_3 \in 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1381
				L \; a \land
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1382
				s_4 \in L \; b$.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1383
			By induction hypothesis, $(a, c::s_a) \rightarrow \inj \; a \; c \; v_a $ holds,
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1384
			and this gives us
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1385
			\[
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1386
				(a\cdot b, (c::s_a)@s_b) \rightarrow \Seq \; (\inj \; a\;c \;v_a) \; v_b.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1387
			\]
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1388
		\item
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1389
			$v = \Left \; (\Seq \; v_a \; v_b)$\\
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1390
			The argument is almost identical to the above case,	
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1391
			except that a different clause of $\inj$ is taken:
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1392
			\begin{center}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1393
				\begin{tabular}{lcl}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1394
					$\inj \; r \; c \; v$ &   $=$ & $ \inj \;\; (a \cdot b) \;\; c \;\; (\Left \; (\Seq \; v_a \; v_b)) $\\
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1395
					& $=$ & $\Seq \; (\inj \;a \; c \; v_a) \; v_b$
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1396
				\end{tabular}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1397
			\end{center}
637
Chengsong
parents: 628
diff changeset
  1398
			With similar reasoning, 
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1399
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1400
			\[
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1401
				(a\cdot b, (c::s_a)@s_b) \rightarrow \Seq \; (\inj \; a\;c \;v_a) \; v_b.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1402
			\]
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1403
			again holds.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1404
		\item 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1405
			$v = \Right \; v_b$\\
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1406
			Again the injection result would be 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1407
			\begin{center}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1408
				\begin{tabular}{lcl}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1409
					$\inj \; r \; c \; v$ &   $=$ & $ \inj \;\; (a \cdot b) \;\; c \;\; \Right \; (v_b) $\\
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1410
					& $=$ & $\Seq \; (\mkeps \; a) \; (\inj \;b \; c\; v_b)$
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1411
				\end{tabular}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1412
			\end{center}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1413
			We know that $a$ must be nullable,
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1414
			allowing us to call $\mkeps$ and get
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1415
			\[
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1416
				(a, []) \rightarrow \mkeps \; a.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1417
			\]
637
Chengsong
parents: 628
diff changeset
  1418
			Also, by inductive hypothesis
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1419
			\[
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1420
				(b, c::s) \rightarrow \inj\; b \; c \; v_b
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1421
			\]
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1422
			holds.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1423
			In addition, as
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1424
			$\Right \;v_b$  instead of $\Left \ldots$ is 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1425
			the POSIX value for $v$, it must be the case
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1426
			that $s \notin L \;( (a\backslash c)\cdot b)$.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1427
			This tells us that 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1428
			\[
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1429
				\nexists s_3 \; s_4.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1430
				s_3 @s_4 = s  \land s_3 \in L \; (a\backslash c) 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1431
				\land s_4 \in L \; b
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1432
			\]
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1433
			which translates to
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1434
			\[
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1435
				\nexists s_3 \; s_4. \; s_3 \neq [] \land
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1436
				s_3 @s_4 = c::s  \land s_3 \in L \; a 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1437
				\land s_4 \in L \; b.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1438
			\]
637
Chengsong
parents: 628
diff changeset
  1439
			(Which says there cannot be a longer 
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1440
			initial split for $s$ other than the empty string.)
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1441
			Therefore we have $\Seq \; (\mkeps \; a) \;(\inj \;b \; c\; v_b)$
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1442
			as the POSIX value for $a\cdot b$.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1443
	\end{itemize}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1444
	The star case can be proven similarly.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1445
\end{proof}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1446
\noindent
667
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1447
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1448
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1449
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1450
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1451
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1452
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1453
660cf698eb26 added example of how inj and lexer works
Chengsong
parents: 666
diff changeset
  1454
%TODO: Cut from previous lexer def, need to make coherent
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1455
The central property of the $\lexer$ is that it gives the correct result
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1456
according to
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1457
POSIX rules. 
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1458
\begin{theorem}\label{lexerCorrectness}
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1459
	The $\lexer$ based on derivatives and injections is correct: 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1460
	\begin{center}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1461
		\begin{tabular}{lcl}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1462
			$\lexer \; r \; s = \Some(v)$ & $ \Longleftrightarrow$ & $ (r, \; s) \rightarrow v$\\
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1463
			$\lexer \;r \; s = \None $ & $\Longleftrightarrow$ & $ \neg(\exists v. (r, s) \rightarrow v)$
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1464
		\end{tabular}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1465
	\end{center}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1466
\end{theorem} 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1467
\begin{proof}
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1468
By induction on $s$. $r$ generalising over an arbitrary regular expression.
637
Chengsong
parents: 628
diff changeset
  1469
The $[]$ case is proven by an application of lemma \ref{mePosix}, and the inductive case
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1470
by lemma \ref{injPosix}.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1471
\end{proof}
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1472
\noindent
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1473
As we did earlier in this chapter with the matcher, one can 
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1474
introduce simplification on the regular expression in each derivative step.
637
Chengsong
parents: 628
diff changeset
  1475
However, due to lexing, one needs to do a backward phase (w.r.t the forward derivative phase)
Chengsong
parents: 628
diff changeset
  1476
and ensure that
Chengsong
parents: 628
diff changeset
  1477
the values align with the regular expression at each step.
539
Chengsong
parents: 538
diff changeset
  1478
Therefore one has to
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1479
be careful not to break the correctness, as the injection 
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1480
function heavily relies on the structure of 
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1481
the regular expressions and values being aligned.
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1482
This can be achieved by recording some extra rectification functions
637
Chengsong
parents: 628
diff changeset
  1483
during the derivatives step and applying these rectifications in 
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1484
each run during the injection phase.
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1485
With extra care
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1486
one can show that POSIXness will not be affected
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1487
by the simplifications listed here \cite{AusafDyckhoffUrban2016}. 
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1488
\begin{center}
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1489
	\begin{tabular}{lcl}
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1490
		$\simp \; r_1 \cdot r_2 $ & $ \dn$ & 
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1491
		$(\simp \; r_1,  \simp \; r_2) \; \textit{match}$\\
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1492
					  & & $\quad \case \; (\ZERO, \_) \Rightarrow \ZERO$\\
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1493
					  & & $\quad \case \; (\_, \ZERO) \Rightarrow \ZERO$\\
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1494
					  & & $\quad \case \; (\ONE, r_2') \Rightarrow r_2'$\\
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1495
					  & & $\quad \case \; (r_1', \ONE) \Rightarrow r_1'$\\
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1496
					  & & $\quad \case \; (r_1', r_2') \Rightarrow r_1'\cdot r_2'$\\
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1497
		$\simp \; r_1 + r_2$ & $\dn$ & $(\simp \; r_1, \simp \; r_2) \textit{match}$\\
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1498
				     & & $\quad \; \case \; (\ZERO, r_2') \Rightarrow r_2'$\\
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1499
				     & & $\quad \; \case \; (r_1', \ZERO) \Rightarrow r_1'$\\
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1500
				     & & $\quad \; \case \; (r_1', r_2') \Rightarrow r_1' + r_2'$\\
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1501
		$\simp \; r$ & $\dn$ & $r\quad\quad (otherwise)$
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1502
				   
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1503
	\end{tabular}
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1504
\end{center}
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1505
637
Chengsong
parents: 628
diff changeset
  1506
However, one can still end up
Chengsong
parents: 628
diff changeset
  1507
with exploding derivatives,
Chengsong
parents: 628
diff changeset
  1508
even with the simple-minded simplification rules allowed
Chengsong
parents: 628
diff changeset
  1509
in an injection-based lexer.
Chengsong
parents: 628
diff changeset
  1510
\section{A Case Requiring More Aggressive Simplifications}
539
Chengsong
parents: 538
diff changeset
  1511
For example, when starting with the regular
585
4969ef817d92 chap4 more
Chengsong
parents: 583
diff changeset
  1512
expression $(a^* \cdot a^*)^*$ and building just over
4969ef817d92 chap4 more
Chengsong
parents: 583
diff changeset
  1513
a dozen successive derivatives 
539
Chengsong
parents: 538
diff changeset
  1514
w.r.t.~the character $a$, one obtains a derivative regular expression
585
4969ef817d92 chap4 more
Chengsong
parents: 583
diff changeset
  1515
with millions of nodes (when viewed as a tree)
637
Chengsong
parents: 628
diff changeset
  1516
even with the mentioned simplifications.
585
4969ef817d92 chap4 more
Chengsong
parents: 583
diff changeset
  1517
\begin{figure}[H]
601
Chengsong
parents: 591
diff changeset
  1518
\begin{center}
539
Chengsong
parents: 538
diff changeset
  1519
\begin{tikzpicture}
Chengsong
parents: 538
diff changeset
  1520
\begin{axis}[
Chengsong
parents: 538
diff changeset
  1521
    xlabel={$n$},
Chengsong
parents: 538
diff changeset
  1522
    ylabel={size},
585
4969ef817d92 chap4 more
Chengsong
parents: 583
diff changeset
  1523
    legend entries={Simple-Minded Simp, Naive Matcher},  
539
Chengsong
parents: 538
diff changeset
  1524
    legend pos=north west,
Chengsong
parents: 538
diff changeset
  1525
    legend cell align=left]
Chengsong
parents: 538
diff changeset
  1526
\addplot[red,mark=*, mark options={fill=white}] table {BetterWaterloo.data};
585
4969ef817d92 chap4 more
Chengsong
parents: 583
diff changeset
  1527
\addplot[blue,mark=*, mark options={fill=white}] table {BetterWaterloo1.data};
539
Chengsong
parents: 538
diff changeset
  1528
\end{axis}
Chengsong
parents: 538
diff changeset
  1529
\end{tikzpicture} 
601
Chengsong
parents: 591
diff changeset
  1530
\end{center}
539
Chengsong
parents: 538
diff changeset
  1531
\caption{Size of $(a^*\cdot a^*)^*$ against $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$}
Chengsong
parents: 538
diff changeset
  1532
\end{figure}\label{fig:BetterWaterloo}
Chengsong
parents: 538
diff changeset
  1533
   
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1534
That is because Sulzmann and Lu's 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1535
injection-based lexing algorithm keeps a lot of 
541
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
  1536
"useless" values that will not be used. 
539
Chengsong
parents: 538
diff changeset
  1537
These different ways of matching will grow exponentially with the string length.
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1538
Consider the case
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1539
\[
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1540
	r= (a^*\cdot a^*)^* \quad and \quad
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1541
	s=\underbrace{aa\ldots a}_\text{n \textit{a}s}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1542
\]
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1543
as an example.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1544
This is a highly ambiguous regular expression, with
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1545
many ways to split up the string into multiple segments for
637
Chengsong
parents: 628
diff changeset
  1546
different star iterations,
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1547
and for each segment 
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1548
multiple ways of splitting between 
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1549
the two $a^*$ sub-expressions.
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1550
When $n$ is equal to $1$, there are two lexical values for
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1551
the match:
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1552
\[
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1553
	\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; [])] \quad (v1)
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1554
\]
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1555
and
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1556
\[
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1557
	\Stars \; [\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a])] \quad (v2)
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1558
\]
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1559
The derivative of $\derssimp \;s \; r$ is
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1560
\[
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1561
	(a^*a^* + a^*)\cdot(a^*a^*)^*.
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1562
\]
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1563
The $a^*a^*$ and $a^*$ in the first child of the above sequence
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1564
correspond to value 1 and value 2, respectively.
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1565
When $n=2$, the number goes up to 7:
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1566
\[
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1567
	\Stars \; [\Seq \; (\Stars \; [\Char \; a, \Char \; a])\; (\Stars \; [])]
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1568
\]
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1569
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1570
\[
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1571
	\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; [\Char \; a])]
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1572
\]
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1573
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1574
\[
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1575
	\Stars \; [\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a, \Char \; a])]
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1576
\]
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1577
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1578
\[
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1579
	\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; []), \Seq \; (\Stars \; [\Char\;a])\; (\Stars\; []) ]
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1580
\]
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1581
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1582
\[
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1583
	\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; []), 
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1584
	 	   \Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a])
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1585
		  ] 
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1586
\]
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1587
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1588
\[
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1589
	\Stars \; [
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1590
	 	   \Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a]),
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1591
	 	   \Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a])
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1592
		  ] 
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1593
\]
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1594
and
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1595
\[
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1596
	\Stars \; [
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1597
	 	   \Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a]),
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1598
		   \Seq \; (\Stars \; [\Char \; a])\; (\Stars \; [])
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1599
		  ] 
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1600
\]
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  1601
And $\derssimp \; aa \; (a^*a^*)^*$ is
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1602
\[
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1603
	((a^*a^* + a^*)+a^*)\cdot(a^*a^*)^* + 
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1604
	(a^*a^* + a^*)\cdot(a^*a^*)^*.
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1605
\]
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1606
which removes two out of the seven terms corresponding to the
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1607
seven distinct lexical values.
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1608
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1609
It is not surprising that there are exponentially many 
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1610
distinct lexical values that cannot be eliminated by 
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1611
the simple-minded simplification of $\derssimp$. 
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1612
A lexer without a good enough strategy to 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1613
deduplicate will naturally
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1614
have an exponential runtime on highly
637
Chengsong
parents: 628
diff changeset
  1615
ambiguous regular expressions because there
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1616
are exponentially many matches.
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1617
For this particular example, it seems
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1618
that the number of distinct matches growth
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1619
speed is proportional to $(2n)!/(n!(n+1)!)$ ($n$ being the input length).
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1620
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1621
On the other hand, the
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1622
 $\POSIX$ value for $r= (a^*\cdot a^*)^*$  and 
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1623
$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$ is  
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1624
\[
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1625
	\Stars\,
637
Chengsong
parents: 628
diff changeset
  1626
	[\Seq \; (\Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}]), \Stars\,[]].
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1627
\]
637
Chengsong
parents: 628
diff changeset
  1628
At any moment, the  subterms in a regular expression
Chengsong
parents: 628
diff changeset
  1629
that will potentially result in a POSIX value is only
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1630
a minority among the many other terms,
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  1631
and one can remove the ones that are not possible to 
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1632
be POSIX.
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1633
In the above example,
579
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1634
\begin{equation}\label{eqn:growth2}
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1635
	((a^*a^* + \underbrace{a^*}_\text{A})+\underbrace{a^*}_\text{duplicate of A})\cdot(a^*a^*)^* + 
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1636
	\underbrace{(a^*a^* + a^*)\cdot(a^*a^*)^*}_\text{further simp removes this}.
579
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1637
\end{equation}
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1638
can be further simplified by 
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1639
removing the underlined term first,
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1640
which would open up possibilities
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1641
of further simplification that removes the
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1642
underbraced part.
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1643
The result would be 
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1644
\[
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1645
	(\underbrace{a^*a^*}_\text{term 1} + \underbrace{a^*}_\text{term 2})\cdot(a^*a^*)^*.
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1646
\]
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1647
with corresponding values
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1648
\begin{center}
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1649
	\begin{tabular}{lr}
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1650
		$\Stars \; [\Seq \; (\Stars \; [\Char \; a, \Char \; a])\; (\Stars \; [])]$  & $(\text{term 1})$\\
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1651
		$\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; [\Char \; a])]  $ &  $(\text{term 2})$
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1652
	\end{tabular}
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1653
\end{center}
637
Chengsong
parents: 628
diff changeset
  1654
Other terms with an underlying value, such as
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1655
\[
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1656
	\Stars \; [\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a, \Char \; a])]
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1657
\]
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  1658
do not to contribute a POSIX lexical value,
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  1659
and therefore can be thrown away.
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1660
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1661
Ausaf et al. \cite{AusafDyckhoffUrban2016} 
639
80cc6dc4c98b until chap 7
Chengsong
parents: 638
diff changeset
  1662
have come up with some simplification steps, 
80cc6dc4c98b until chap 7
Chengsong
parents: 638
diff changeset
  1663
and incorporated the simplification into $\lexer$.
80cc6dc4c98b until chap 7
Chengsong
parents: 638
diff changeset
  1664
They call this lexer $\textit{lexer}_{simp}$ and proved that
80cc6dc4c98b until chap 7
Chengsong
parents: 638
diff changeset
  1665
\[
80cc6dc4c98b until chap 7
Chengsong
parents: 638
diff changeset
  1666
	\lexer \; r\; s = \textit{lexer}_{simp} \; r \; s
80cc6dc4c98b until chap 7
Chengsong
parents: 638
diff changeset
  1667
\]
80cc6dc4c98b until chap 7
Chengsong
parents: 638
diff changeset
  1668
The function $\textit{lexer}_{simp}$
80cc6dc4c98b until chap 7
Chengsong
parents: 638
diff changeset
  1669
involves some fiddly manipulation of value rectification,
80cc6dc4c98b until chap 7
Chengsong
parents: 638
diff changeset
  1670
which we omit here.
80cc6dc4c98b until chap 7
Chengsong
parents: 638
diff changeset
  1671
however those steps
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  1672
are not yet sufficiently strong, to achieve the above effects.
637
Chengsong
parents: 628
diff changeset
  1673
And even with these relatively mild simplifications, the proof
Chengsong
parents: 628
diff changeset
  1674
is already quite a bit more complicated than the theorem \ref{lexerCorrectness}.
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  1675
One would need to prove something like this: 
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1676
\[
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1677
	\textit{If}\; (\textit{snd} \; (\textit{simp} \; r\backslash c), s) \rightarrow  v  \;\;
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1678
	\textit{then}\;\; (r, c::s) \rightarrow 
637
Chengsong
parents: 628
diff changeset
  1679
	\inj\;\, r\,  \;c \;\, ((\textit{fst} \; (\textit{simp} \; r \backslash c))\; v).
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1680
\]
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1681
instead of the simple lemma \ref{injPosix}, where now $\textit{simp}$
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1682
not only has to return a simplified regular expression,
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1683
but also what specific simplifications 
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  1684
have been done as a function on values
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1685
showing how one can transform the value
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1686
underlying the simplified regular expression
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1687
to the unsimplified one.
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1688
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1689
We therefore choose a slightly different approach
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1690
also described by Sulzmann and Lu to
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1691
get better simplifications, which uses
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1692
some augmented data structures compared to 
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1693
plain regular expressions.
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1694
We call them \emph{annotated}
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1695
regular expressions.
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1696
With annotated regular expressions,
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1697
we can avoid creating the intermediate values $v_1,\ldots v_n$ and a 
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1698
second phase altogether.
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1699
We introduce this new datatype and the 
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1700
corresponding algorithm in the next chapter.
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1701
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1702
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1703
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1704