ChengsongTanPhdThesis/Chapters/Inj.tex
author Chengsong
Fri, 30 Dec 2022 17:37:51 +0000
changeset 639 80cc6dc4c98b
parent 638 dd9dde2d902b
child 646 56057198e4f5
permissions -rwxr-xr-x
until chap 7
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
     1
% Chapter Template
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
     2
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
     3
\chapter{Regular Expressions and POSIX Lexing} % Main chapter title
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
     4
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
     5
\label{Inj} % In chapter 2 \ref{Chapter2} we will introduce the concepts
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
     6
%and notations we 
564
Chengsong
parents: 543
diff changeset
     7
% used for describing the lexing algorithm by Sulzmann and Lu,
Chengsong
parents: 543
diff changeset
     8
%and then give the algorithm and its variant and discuss
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
     9
%why more aggressive simplifications are needed. 
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
    10
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
    11
In this chapter, we define the basic notions 
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
    12
for regular languages and regular expressions.
583
Chengsong
parents: 579
diff changeset
    13
This is essentially a description in ``English''
637
Chengsong
parents: 628
diff changeset
    14
the functions and datatypes of our formalisation in Isabelle/HOL.
Chengsong
parents: 628
diff changeset
    15
We also define what $\POSIX$ lexing means, 
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
    16
followed by the first lexing algorithm by Sulzmanna and Lu \parencite{Sulzmann2014} 
564
Chengsong
parents: 543
diff changeset
    17
that produces the output conforming
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
    18
to the $\POSIX$ standard\footnote{In what follows 
583
Chengsong
parents: 579
diff changeset
    19
we choose to use the Isabelle-style notation
564
Chengsong
parents: 543
diff changeset
    20
for function applications, where
583
Chengsong
parents: 579
diff changeset
    21
the parameters of a function are not enclosed
564
Chengsong
parents: 543
diff changeset
    22
inside a pair of parentheses (e.g. $f \;x \;y$
Chengsong
parents: 543
diff changeset
    23
instead of $f(x,\;y)$). This is mainly
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
    24
to make the text visually more concise.}.
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
    25
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
    26
\section{Basic Concepts}
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
    27
Formal language theory usually starts with an alphabet 
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
    28
denoting a set of characters.
637
Chengsong
parents: 628
diff changeset
    29
Here we use the datatype of characters from Isabelle,
541
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
    30
which roughly corresponds to the ASCII characters.
564
Chengsong
parents: 543
diff changeset
    31
In what follows, we shall leave the information about the alphabet
541
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
    32
implicit.
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
    33
Then using the usual bracket notation for lists,
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
    34
we can define strings made up of characters as: 
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
    35
\begin{center}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
    36
\begin{tabular}{lcl}
541
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
    37
$\textit{s}$ & $\dn$ & $[] \; |\; c  :: s$
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
    38
\end{tabular}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
    39
\end{center}
583
Chengsong
parents: 579
diff changeset
    40
where $c$ is a variable ranging over characters.
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
    41
The $::$ stands for list cons and $[]$ for the empty
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
    42
list.
637
Chengsong
parents: 628
diff changeset
    43
For brevity, a singleton list is sometimes written as $[c]$.
541
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
    44
Strings can be concatenated to form longer strings in the same
637
Chengsong
parents: 628
diff changeset
    45
way we concatenate two lists, which we shall write as $s_1 @ s_2$.
541
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
    46
We omit the precise 
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
    47
recursive definition here.
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
    48
We overload this concatenation operator for two sets of strings:
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
    49
\begin{center}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
    50
\begin{tabular}{lcl}
541
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
    51
$A @ B $ & $\dn$ & $\{s_A @ s_B \mid s_A \in A \land s_B \in B \}$\\
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
    52
\end{tabular}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
    53
\end{center}
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
    54
We also call the above \emph{language concatenation}.
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
    55
The power of a language is defined recursively, using the 
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
    56
concatenation operator $@$:
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
    57
\begin{center}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
    58
\begin{tabular}{lcl}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
    59
$A^0 $ & $\dn$ & $\{ [] \}$\\
541
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
    60
$A^{n+1}$ & $\dn$ & $A @ A^n$
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
    61
\end{tabular}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
    62
\end{center}
564
Chengsong
parents: 543
diff changeset
    63
The union of all powers of a language   
Chengsong
parents: 543
diff changeset
    64
can be used to define the Kleene star operator:
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
    65
\begin{center}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
    66
\begin{tabular}{lcl}
536
aff7bf93b9c7 comments addressed all
Chengsong
parents: 532
diff changeset
    67
 $A*$ & $\dn$ & $\bigcup_{i \geq 0} A^i$ \\
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
    68
\end{tabular}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
    69
\end{center}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
    70
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
    71
\noindent
564
Chengsong
parents: 543
diff changeset
    72
However, to obtain a more convenient induction principle 
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
    73
in Isabelle/HOL, 
536
aff7bf93b9c7 comments addressed all
Chengsong
parents: 532
diff changeset
    74
we instead define the Kleene star
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
    75
as an inductive set: 
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
    76
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
    77
\begin{center}
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
    78
\begin{mathpar}
564
Chengsong
parents: 543
diff changeset
    79
	\inferrule{\mbox{}}{[] \in A*\\}
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
    80
564
Chengsong
parents: 543
diff changeset
    81
\inferrule{s_1 \in A \;\; s_2 \in A*}{s_1 @ s_2 \in A*}
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
    82
\end{mathpar}
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
    83
\end{center}
564
Chengsong
parents: 543
diff changeset
    84
\noindent
541
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
    85
We also define an operation of "chopping off" a character from
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
    86
a language, which we call $\Der$, meaning \emph{Derivative} (for a language):
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
    87
\begin{center}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
    88
\begin{tabular}{lcl}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
    89
$\textit{Der} \;c \;A$ & $\dn$ & $\{ s \mid c :: s \in A \}$\\
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
    90
\end{tabular}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
    91
\end{center}
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
    92
\noindent
583
Chengsong
parents: 579
diff changeset
    93
This can be generalised to ``chopping off'' a string 
Chengsong
parents: 579
diff changeset
    94
from all strings within a set $A$, 
541
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
    95
namely:
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
    96
\begin{center}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
    97
\begin{tabular}{lcl}
541
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
    98
$\textit{Ders} \;s \;A$ & $\dn$ & $\{ s' \mid s@s' \in A \}$\\
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
    99
\end{tabular}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   100
\end{center}
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   101
\noindent
541
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
   102
which is essentially the left quotient $A \backslash L$ of $A$ against 
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   103
the singleton language with $L = \{s\}$.
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   104
However, for our purposes here, the $\textit{Ders}$ definition with 
541
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
   105
a single string is sufficient.
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   106
577
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
   107
The reason for defining derivatives
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   108
is that they provide another approach
577
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
   109
to test membership of a string in 
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
   110
a set of strings. 
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
   111
For example, to test whether the string
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
   112
$bar$ is contained in the set $\{foo, bar, brak\}$, one can take derivative of the set with
577
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
   113
respect to the string $bar$:
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
   114
\begin{center}
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   115
\begin{tabular}{lll}
577
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
   116
	$S = \{foo, bar, brak\}$ & $ \stackrel{\backslash b}{\rightarrow }$ & 
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   117
	$\{ar, rak\}$ \\
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   118
				 & $\stackrel{\backslash a}{\rightarrow}$ & $\{r \}$\\
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   119
				 & $\stackrel{\backslash r}{\rightarrow}$ & $\{[]\}$\\
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   120
				 %& $\stackrel{[] \in S \backslash bar}{\longrightarrow}$ & $bar \in S$\\
577
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
   121
\end{tabular}	
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
   122
\end{center}
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
   123
\noindent
637
Chengsong
parents: 628
diff changeset
   124
and in the end, test whether the set
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
   125
contains the empty string.\footnote{We use the infix notation $A\backslash c$
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
   126
	instead of $\Der \; c \; A$ for brevity, as it will always be
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
   127
	clear from the context that we are operating
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   128
on languages rather than regular expressions.}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   129
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   130
In general, if we have a language $S$,
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   131
then we can test whether $s$ is in $S$
577
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
   132
by testing whether $[] \in S \backslash s$.
564
Chengsong
parents: 543
diff changeset
   133
With the sequencing, Kleene star, and $\textit{Der}$ operator on languages,
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   134
we have a  few properties of how the language derivative can be defined using 
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   135
sub-languages.
577
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
   136
For example, for the sequence operator, we have
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   137
something similar to a ``chain rule'':
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   138
\begin{lemma}
536
aff7bf93b9c7 comments addressed all
Chengsong
parents: 532
diff changeset
   139
\[
aff7bf93b9c7 comments addressed all
Chengsong
parents: 532
diff changeset
   140
	\Der \; c \; (A @ B) =
aff7bf93b9c7 comments addressed all
Chengsong
parents: 532
diff changeset
   141
	\begin{cases}
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   142
	((\Der \; c \; A) \, @ \, B ) \cup (\Der \; c\; B) , &  \text{if} \;  [] \in A  \\
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   143
	 (\Der \; c \; A) \,  @ \, B, & \text{otherwise}
536
aff7bf93b9c7 comments addressed all
Chengsong
parents: 532
diff changeset
   144
	 \end{cases}	
aff7bf93b9c7 comments addressed all
Chengsong
parents: 532
diff changeset
   145
\]
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   146
\end{lemma}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   147
\noindent
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   148
This lemma states that if $A$ contains the empty string, $\Der$ can "pierce" through it
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   149
and get to $B$.
583
Chengsong
parents: 579
diff changeset
   150
The language derivative for $A*$ can be described using the language derivative
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   151
of $A$:
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   152
\begin{lemma}
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   153
$\textit{Der} \;c \;(A*) = (\textit{Der}\; c A) @ (A*)$\\
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   154
\end{lemma}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   155
\begin{proof}
583
Chengsong
parents: 579
diff changeset
   156
There are two inclusions to prove:
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   157
\begin{itemize}
564
Chengsong
parents: 543
diff changeset
   158
\item{$\subseteq$}:\\
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   159
The set 
637
Chengsong
parents: 628
diff changeset
   160
\[ S_1 = \{s \mid c :: s \in A*\} \]
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   161
is enclosed in the set
637
Chengsong
parents: 628
diff changeset
   162
\[ S_2 = \{s_1 @ s_2 \mid s_1 \, s_2.\;  s_1 \in \{s \mid c :: s \in A\} \land s_2 \in A* \}. \]
Chengsong
parents: 628
diff changeset
   163
This is because for any string $c::s$ satisfying $c::s \in A*$,
Chengsong
parents: 628
diff changeset
   164
%whenever you have a string starting with a character 
Chengsong
parents: 628
diff changeset
   165
%in the language of a Kleene star $A*$, 
Chengsong
parents: 628
diff changeset
   166
%then that
Chengsong
parents: 628
diff changeset
   167
the character $c$, together with a prefix of $s$
Chengsong
parents: 628
diff changeset
   168
%immediately after $c$ 
Chengsong
parents: 628
diff changeset
   169
forms the first iteration of $A*$, 
Chengsong
parents: 628
diff changeset
   170
and the rest of the $s$ is also $A*$.
Chengsong
parents: 628
diff changeset
   171
This coincides with the definition of $S_2$.
564
Chengsong
parents: 543
diff changeset
   172
\item{$\supseteq$}:\\
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   173
Note that
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   174
\[ \Der \; c \; (A*) = \Der \; c \;  (\{ [] \} \cup (A @ A*) ) \]
583
Chengsong
parents: 579
diff changeset
   175
holds.
Chengsong
parents: 579
diff changeset
   176
Also the following holds:
536
aff7bf93b9c7 comments addressed all
Chengsong
parents: 532
diff changeset
   177
\[ \Der \; c \;  (\{ [] \} \cup (A @ A*) ) = \Der\; c \; (A @ A*) \]
564
Chengsong
parents: 543
diff changeset
   178
where the $\textit{RHS}$ can be rewritten
Chengsong
parents: 543
diff changeset
   179
as \[ (\Der \; c\; A) @ A* \cup (\Der \; c \; (A*)) \]
Chengsong
parents: 543
diff changeset
   180
which of course contains $\Der \; c \; A @ A*$.
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   181
\end{itemize}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   182
\end{proof}
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   183
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   184
\noindent
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   185
The clever idea of Brzozowski was to find counterparts of $\Der$ and $\Ders$
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   186
for regular expressions.
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   187
To introduce them, we need to first give definitions for regular expressions,
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   188
which we shall do next.
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   189
536
aff7bf93b9c7 comments addressed all
Chengsong
parents: 532
diff changeset
   190
\subsection{Regular Expressions and Their Meaning}
564
Chengsong
parents: 543
diff changeset
   191
The \emph{basic regular expressions} are defined inductively
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   192
 by the following grammar:
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   193
\[			r ::=   \ZERO \mid  \ONE
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   194
			 \mid  c  
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   195
			 \mid  r_1 \cdot r_2
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   196
			 \mid  r_1 + r_2   
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   197
			 \mid r^*         
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   198
\]
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   199
\noindent
564
Chengsong
parents: 543
diff changeset
   200
We call them basic because we will introduce
637
Chengsong
parents: 628
diff changeset
   201
additional constructors in later chapters, such as negation
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   202
and bounded repetitions.
564
Chengsong
parents: 543
diff changeset
   203
We use $\ZERO$ for the regular expression that
Chengsong
parents: 543
diff changeset
   204
matches no string, and $\ONE$ for the regular
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   205
expression that matches only the empty string.\footnote{
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   206
Some authors
564
Chengsong
parents: 543
diff changeset
   207
also use $\phi$ and $\epsilon$ for $\ZERO$ and $\ONE$
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
   208
but we prefer this notation.} 
564
Chengsong
parents: 543
diff changeset
   209
The sequence regular expression is written $r_1\cdot r_2$
Chengsong
parents: 543
diff changeset
   210
and sometimes we omit the dot if it is clear which
Chengsong
parents: 543
diff changeset
   211
regular expression is meant; the alternative
Chengsong
parents: 543
diff changeset
   212
is written $r_1 + r_2$.
Chengsong
parents: 543
diff changeset
   213
The \emph{language} or meaning of 
Chengsong
parents: 543
diff changeset
   214
a regular expression is defined recursively as
Chengsong
parents: 543
diff changeset
   215
a set of strings:
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   216
%TODO: FILL in the other defs
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   217
\begin{center}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   218
\begin{tabular}{lcl}
564
Chengsong
parents: 543
diff changeset
   219
$L \; \ZERO$ & $\dn$ & $\phi$\\
Chengsong
parents: 543
diff changeset
   220
$L \; \ONE$ & $\dn$ & $\{[]\}$\\
Chengsong
parents: 543
diff changeset
   221
$L \; c$ & $\dn$ & $\{[c]\}$\\
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   222
$L \; (r_1 + r_2)$ & $\dn$ & $ L \; r_1 \cup L \; r_2$\\
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   223
$L \; (r_1 \cdot r_2)$ & $\dn$ & $ L \; r_1 @ L \; r_2$\\
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   224
$L \; (r^*)$ & $\dn$ & $ (L\;r)*$
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   225
\end{tabular}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   226
\end{center}
536
aff7bf93b9c7 comments addressed all
Chengsong
parents: 532
diff changeset
   227
\noindent
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   228
%Now with language derivatives of a language and regular expressions and
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   229
%their language interpretations in place, we are ready to define derivatives on regular expressions.
637
Chengsong
parents: 628
diff changeset
   230
With $L$, we are ready to introduce Brzozowski derivatives on regular expressions.
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
   231
We do so by first introducing what properties they should satisfy.
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   232
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   233
\subsection{Brzozowski Derivatives and a Regular Expression Matcher}
564
Chengsong
parents: 543
diff changeset
   234
%Recall, the language derivative acts on a set of strings
Chengsong
parents: 543
diff changeset
   235
%and essentially chops off a particular character from
Chengsong
parents: 543
diff changeset
   236
%all strings in that set, Brzozowski defined a derivative operation on regular expressions
Chengsong
parents: 543
diff changeset
   237
%so that after derivative $L(r\backslash c)$ 
Chengsong
parents: 543
diff changeset
   238
%will look as if it was obtained by doing a language derivative on $L(r)$:
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   239
%Recall that the language derivative acts on a 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   240
%language (set of strings).
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   241
%One can decide whether a string $s$ belongs
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   242
%to a language $S$ by taking derivative with respect to
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   243
%that string and then checking whether the empty 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   244
%string is in the derivative:
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   245
%\begin{center}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   246
%\parskip \baselineskip
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   247
%\def\myupbracefill#1{\rotatebox{90}{\stretchto{\{}{#1}}}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   248
%\def\rlwd{.5pt}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   249
%\newcommand\notate[3]{%
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   250
%  \unskip\def\useanchorwidth{T}%
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   251
%  \setbox0=\hbox{#1}%
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   252
%  \def\stackalignment{c}\stackunder[-6pt]{%
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   253
%    \def\stackalignment{c}\stackunder[-1.5pt]{%
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   254
%      \stackunder[-2pt]{\strut #1}{\myupbracefill{\wd0}}}{%
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   255
%    \rule{\rlwd}{#2\baselineskip}}}{%
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   256
%  \strut\kern7pt$\hookrightarrow$\rlap{ \footnotesize#3}}\ignorespaces%
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   257
%}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   258
%\Longstack{
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   259
%\notate{$\{ \ldots ,\;$ 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   260
%	\notate{s}{1}{$(c_1 :: s_1)$} 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   261
%	$, \; \ldots \}$ 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   262
%}{1}{$S_{start}$} 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   263
%}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   264
%\Longstack{
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   265
%	$\stackrel{\backslash c_1}{\longrightarrow}$
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   266
%}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   267
%\Longstack{
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   268
%	$\{ \ldots,\;$  \notate{$s_1$}{1}{$(c_2::s_2)$} 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   269
%	$,\; \ldots \}$
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   270
%}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   271
%\Longstack{
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   272
%	$\stackrel{\backslash c_2}{\longrightarrow}$ 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   273
%}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   274
%\Longstack{
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   275
%	$\{ \ldots,\;  s_2
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   276
%	,\; \ldots \}$
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   277
%}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   278
%\Longstack{
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   279
%	$ \xdashrightarrow{\backslash c_3\ldots\ldots} $	
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   280
%}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   281
%\Longstack{
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   282
%	\notate{$\{\ldots, [], \ldots\}$}{1}{$S_{end} = 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   283
%	S_{start}\backslash s$}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   284
%}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   285
%\end{center}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   286
%\begin{center}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   287
%	$s \in S_{start} \iff [] \in S_{end}$
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   288
%\end{center}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   289
%\noindent
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   290
Brzozowski noticed that $\Der$
579
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   291
can be ``mirrored'' on regular expressions which
564
Chengsong
parents: 543
diff changeset
   292
he calls the derivative of a regular expression $r$
Chengsong
parents: 543
diff changeset
   293
with respect to a character $c$, written
577
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
   294
$r \backslash c$. This infix operator
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
   295
takes regular expression $r$ as input
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
   296
and a character as a right operand.
577
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
   297
The derivative operation on regular expression
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
   298
is defined such that the language of the derivative result 
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
   299
coincides with the language of the original 
579
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   300
regular expression being taken 
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
   301
derivative with respect to the same characters, namely
579
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   302
\begin{property}
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   303
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   304
\[
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   305
	L \; (r \backslash c) = \Der \; c \; (L \; r)
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   306
\]
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   307
\end{property}
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   308
\noindent
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   309
Pictorially, this looks as follows:\\
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   310
\vspace{3mm}
579
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   311
577
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
   312
\parskip \baselineskip
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
   313
\def\myupbracefill#1{\rotatebox{90}{\stretchto{\{}{#1}}}
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
   314
\def\rlwd{.5pt}
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
   315
\newcommand\notate[3]{%
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
   316
  \unskip\def\useanchorwidth{T}%
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
   317
  \setbox0=\hbox{#1}%
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
   318
  \def\stackalignment{c}\stackunder[-6pt]{%
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
   319
    \def\stackalignment{c}\stackunder[-1.5pt]{%
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
   320
      \stackunder[-2pt]{\strut #1}{\myupbracefill{\wd0}}}{%
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
   321
    \rule{\rlwd}{#2\baselineskip}}}{%
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
   322
  \strut\kern8pt$\hookrightarrow$\rlap{ \footnotesize#3}}\ignorespaces%
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
   323
}
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
   324
\Longstack{
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
   325
	\notate{$r$}{1}{$L \; r = \{\ldots, \;c::s_1, 
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
   326
\;\ldots\}$}
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
   327
}
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
   328
\Longstack{
579
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   329
	$\stackrel{\backslash c}{\xrightarrow{\hspace*{8cm}}}$
577
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
   330
}
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
   331
\Longstack{
579
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   332
	\notate{$r\backslash c$}{1}{$L \; (r\backslash c)=
577
f47fc4840579 thesis chap2
Chengsong
parents: 573
diff changeset
   333
	\{\ldots,\;s_1,\;\ldots\}$}
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   334
}\\ 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   335
\vspace{ 3mm }
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   336
579
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   337
The derivatives on regular expression can again be 
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
   338
generalised to strings.
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   339
One could compute $r_{start} \backslash s$  and test membership of $s$
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   340
in $L \; r_{start}$ by checking
579
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   341
whether the empty string is in the language of 
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
   342
$r_{end}$ (that is $r_{start}\backslash s$).\\
579
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   343
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   344
\vspace{2mm}
579
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   345
\Longstack{
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   346
	\notate{$r_{start}$}{4}{
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   347
		\Longstack{$L \; r_{start} = \{\ldots, \;$ 
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   348
			\notate{$s$}{1}{$c_1::s_1$} 
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   349
		$, \ldots\} $
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   350
		}
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   351
	} 
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   352
}
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   353
\Longstack{
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   354
	$\stackrel{\backslash c_1}{ \xrightarrow{\hspace*{1.8cm}} }$
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   355
}
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   356
\Longstack{
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   357
	\notate{$r_1$}{3}{
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   358
		$r_1 = r_{start}\backslash c_1$,
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   359
		$L \; r_1 = $
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   360
	\Longstack{
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   361
		$\{ \ldots,\;$  \notate{$s_1$}{1}{$c_2::s_2$} 
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   362
		$,\; \ldots \}$
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   363
	}
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   364
}
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   365
}
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   366
\Longstack{
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   367
	$\stackrel{\backslash c_2}{\xrightarrow{\hspace*{1.8cm}}}$ 
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   368
}
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   369
\Longstack{
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   370
	$r_2$	
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   371
}
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   372
\Longstack{
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   373
	$  \xdashrightarrow{\hspace*{0.3cm} \backslash c_3 \ldots \ldots \ldots \hspace*{0.3cm}} $	
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   374
}
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   375
\Longstack{
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   376
	\notate{$r_{end}$}{1}{
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   377
	$L \; r_{end} = \{\ldots, \; [], \ldots\}$}
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   378
}
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   379
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   380
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
   381
We have the property that
579
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   382
\begin{property}
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   383
	$s \in L \; r_{start} \iff [] \in L \; r_{end}$
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   384
\end{property}
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   385
\noindent
637
Chengsong
parents: 628
diff changeset
   386
Next, we give the recursive definition of derivative on
Chengsong
parents: 628
diff changeset
   387
regular expressions so that it satisfies the properties above.
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
   388
%The derivative function, written $r\backslash c$, 
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
   389
%takes a regular expression $r$ and character $c$, and
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
   390
%returns a new regular expression representing
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
   391
%the original regular expression's language $L \; r$
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
   392
%being taken the language derivative with respect to $c$.
626
1c8525061545 finished!
Chengsong
parents: 623
diff changeset
   393
\begin{table}
1c8525061545 finished!
Chengsong
parents: 623
diff changeset
   394
	\begin{center}
579
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   395
\begin{tabular}{lcl}
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   396
		$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\  
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   397
		$\ONE \backslash c$  & $\dn$ & $\ZERO$\\
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   398
		$d \backslash c$     & $\dn$ & 
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   399
		$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   400
$(r_1 + r_2)\backslash c$     & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   401
$(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, [] \in L(r_1)$\\
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   402
	&   & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   403
	&   & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   404
	$(r^*)\backslash c$           & $\dn$ & $(r\backslash c) \cdot r^*$\\
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   405
\end{tabular}
626
1c8525061545 finished!
Chengsong
parents: 623
diff changeset
   406
	\end{center}
1c8525061545 finished!
Chengsong
parents: 623
diff changeset
   407
\caption{Derivative on Regular Expressions}
1c8525061545 finished!
Chengsong
parents: 623
diff changeset
   408
\label{table:der}
1c8525061545 finished!
Chengsong
parents: 623
diff changeset
   409
\end{table}
564
Chengsong
parents: 543
diff changeset
   410
\noindent
579
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   411
The most involved cases are the sequence case
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   412
and the star case.
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   413
The sequence case says that if the first regular expression
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   414
contains an empty string, then the second component of the sequence
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   415
needs to be considered, as its derivative will contribute to the
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   416
result of this derivative:
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   417
\begin{center}
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   418
	\begin{tabular}{lcl}
583
Chengsong
parents: 579
diff changeset
   419
		$(r_1 \cdot r_2 ) \backslash c$ & $\dn$ & 
Chengsong
parents: 579
diff changeset
   420
		$\textit{if}\;\,([] \in L(r_1))\; 
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
   421
		\textit{then} \; (r_1 \backslash c) \cdot r_2 + r_2 \backslash c$ \\
579
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   422
		& & $\textit{else} \; (r_1 \backslash c) \cdot r_2$
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   423
	\end{tabular}
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   424
\end{center}
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   425
\noindent
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   426
Notice how this closely resembles
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   427
the language derivative operation $\Der$:
564
Chengsong
parents: 543
diff changeset
   428
\begin{center}
Chengsong
parents: 543
diff changeset
   429
	\begin{tabular}{lcl}
Chengsong
parents: 543
diff changeset
   430
		$\Der \; c \; (A @ B)$ & $\dn$ & 
Chengsong
parents: 543
diff changeset
   431
		$ \textit{if} \;\,  [] \in A \; 
Chengsong
parents: 543
diff changeset
   432
		\textit{then} \;\, ((\Der \; c \; A) @ B ) \cup 
Chengsong
parents: 543
diff changeset
   433
		\Der \; c\; B$\\
Chengsong
parents: 543
diff changeset
   434
		& & $\textit{else}\; (\Der \; c \; A) @ B$\\
Chengsong
parents: 543
diff changeset
   435
	\end{tabular}
Chengsong
parents: 543
diff changeset
   436
\end{center}
Chengsong
parents: 543
diff changeset
   437
\noindent
583
Chengsong
parents: 579
diff changeset
   438
The derivative of the star regular expression $r^*$ 
579
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   439
unwraps one iteration of $r$, turns it into $r\backslash c$,
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   440
and attaches the original $r^*$
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   441
after $r\backslash c$, so that 
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   442
we can further unfold it as many times as needed:
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   443
\[
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   444
	(r^*) \backslash c \dn (r \backslash c)\cdot r^*.
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   445
\]
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   446
Again,
637
Chengsong
parents: 628
diff changeset
   447
the structure is the same as the language derivative of the Kleene star: 
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   448
\[
564
Chengsong
parents: 543
diff changeset
   449
	\textit{Der} \;c \;(A*) \dn (\textit{Der}\; c A) @ (A*)
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   450
\]
564
Chengsong
parents: 543
diff changeset
   451
In the above definition of $(r_1\cdot r_2) \backslash c$,
Chengsong
parents: 543
diff changeset
   452
the $\textit{if}$ clause's
Chengsong
parents: 543
diff changeset
   453
boolean condition 
Chengsong
parents: 543
diff changeset
   454
$[] \in L(r_1)$ needs to be 
Chengsong
parents: 543
diff changeset
   455
somehow recursively computed.
Chengsong
parents: 543
diff changeset
   456
We call such a function that checks
Chengsong
parents: 543
diff changeset
   457
whether the empty string $[]$ is 
Chengsong
parents: 543
diff changeset
   458
in the language of a regular expression $\nullable$:
Chengsong
parents: 543
diff changeset
   459
\begin{center}
Chengsong
parents: 543
diff changeset
   460
		\begin{tabular}{lcl}
Chengsong
parents: 543
diff changeset
   461
			$\nullable(\ZERO)$     & $\dn$ & $\mathit{false}$ \\  
Chengsong
parents: 543
diff changeset
   462
			$\nullable(\ONE)$      & $\dn$ & $\mathit{true}$ \\
Chengsong
parents: 543
diff changeset
   463
			$\nullable(c)$ 	       & $\dn$ & $\mathit{false}$ \\
Chengsong
parents: 543
diff changeset
   464
			$\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\
Chengsong
parents: 543
diff changeset
   465
			$\nullable(r_1\cdot r_2)$  & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\
Chengsong
parents: 543
diff changeset
   466
			$\nullable(r^*)$       & $\dn$ & $\mathit{true}$ \\
Chengsong
parents: 543
diff changeset
   467
		\end{tabular}
Chengsong
parents: 543
diff changeset
   468
\end{center}
Chengsong
parents: 543
diff changeset
   469
\noindent
Chengsong
parents: 543
diff changeset
   470
The $\ZERO$ regular expression
Chengsong
parents: 543
diff changeset
   471
does not contain any string and
Chengsong
parents: 543
diff changeset
   472
therefore is not \emph{nullable}.
Chengsong
parents: 543
diff changeset
   473
$\ONE$ is \emph{nullable} 
Chengsong
parents: 543
diff changeset
   474
by definition. 
Chengsong
parents: 543
diff changeset
   475
The character regular expression $c$
Chengsong
parents: 543
diff changeset
   476
corresponds to the singleton set $\{c\}$, 
Chengsong
parents: 543
diff changeset
   477
and therefore does not contain the empty string.
Chengsong
parents: 543
diff changeset
   478
The alternative regular expression is nullable
Chengsong
parents: 543
diff changeset
   479
if at least one of its children is nullable.
Chengsong
parents: 543
diff changeset
   480
The sequence regular expression
Chengsong
parents: 543
diff changeset
   481
would require both children to have the empty string
Chengsong
parents: 543
diff changeset
   482
to compose an empty string, and the Kleene star
Chengsong
parents: 543
diff changeset
   483
is always nullable because it naturally
579
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   484
contains the empty string.  
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   485
\noindent
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
   486
We have the following two correspondences between 
564
Chengsong
parents: 543
diff changeset
   487
derivatives on regular expressions and
Chengsong
parents: 543
diff changeset
   488
derivatives on a set of strings:
Chengsong
parents: 543
diff changeset
   489
\begin{lemma}\label{derDer}
608
37b6fd310a16 added related work chap
Chengsong
parents: 601
diff changeset
   490
	\mbox{}
579
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   491
	\begin{itemize}
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   492
		\item
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   493
$\textit{Der} \; c \; L(r) = L (r\backslash c)$
579
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   494
\item
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   495
	$c\!::\!s \in L(r)$ \textit{iff} $s \in L(r\backslash c)$.
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   496
	\end{itemize}
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   497
\end{lemma}
579
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   498
\begin{proof}
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   499
	By induction on $r$.
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   500
\end{proof}
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   501
\noindent
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
   502
which are the main properties of derivatives
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
   503
that enables us later to reason about the correctness of
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   504
derivative-based matching.
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   505
We can generalise the derivative operation shown above for single characters
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   506
to strings as follows:
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   507
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   508
\begin{center}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   509
\begin{tabular}{lcl}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   510
$r \backslash_s (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash_s s$ \\
583
Chengsong
parents: 579
diff changeset
   511
$r \backslash_s [\,] $ & $\dn$ & $r$
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   512
\end{tabular}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   513
\end{center}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   514
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   515
\noindent
564
Chengsong
parents: 543
diff changeset
   516
When there is no ambiguity, we will 
Chengsong
parents: 543
diff changeset
   517
omit the subscript and use $\backslash$ instead
583
Chengsong
parents: 579
diff changeset
   518
of $\backslash_s$ to denote
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   519
string derivatives for brevity.
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   520
Brzozowski's derivative-based
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   521
regular-expression matching algorithm can then be described as:
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   522
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   523
\begin{definition}
564
Chengsong
parents: 543
diff changeset
   524
$\textit{match}\;s\;r \;\dn\; \nullable \; (r\backslash s)$
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   525
\end{definition}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   526
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   527
\noindent
637
Chengsong
parents: 628
diff changeset
   528
Assuming the string is given as a sequence of characters, say $c_0c_1 \ldots c_n$, 
Chengsong
parents: 628
diff changeset
   529
this algorithm, presented graphically, is as follows:
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   530
601
Chengsong
parents: 591
diff changeset
   531
\begin{equation}\label{matcher}
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   532
\begin{tikzcd}
583
Chengsong
parents: 579
diff changeset
   533
r_0 \arrow[r, "\backslash c_0"]  & r_1 \arrow[r, "\backslash c_1"] & 
Chengsong
parents: 579
diff changeset
   534
r_2 \arrow[r, dashed]  & r_n  \arrow[r,"\textit{nullable}?"] & 
Chengsong
parents: 579
diff changeset
   535
\;\textrm{true}/\textrm{false}
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   536
\end{tikzcd}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   537
\end{equation}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   538
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   539
\noindent
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   540
 It is relatively
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   541
easy to show that this matcher is correct, namely
539
Chengsong
parents: 538
diff changeset
   542
\begin{lemma}
564
Chengsong
parents: 543
diff changeset
   543
	$\textit{match} \; s\; r  = \textit{true} \; \textit{iff} \; s \in L(r)$
539
Chengsong
parents: 538
diff changeset
   544
\end{lemma}
Chengsong
parents: 538
diff changeset
   545
\begin{proof}
637
Chengsong
parents: 628
diff changeset
   546
	By induction on $s$ using the property of derivatives:
583
Chengsong
parents: 579
diff changeset
   547
	lemma \ref{derDer}.
539
Chengsong
parents: 538
diff changeset
   548
\end{proof}
601
Chengsong
parents: 591
diff changeset
   549
\begin{figure}
564
Chengsong
parents: 543
diff changeset
   550
\begin{center}
539
Chengsong
parents: 538
diff changeset
   551
\begin{tikzpicture}
Chengsong
parents: 538
diff changeset
   552
\begin{axis}[
Chengsong
parents: 538
diff changeset
   553
    xlabel={$n$},
Chengsong
parents: 538
diff changeset
   554
    ylabel={time in secs},
601
Chengsong
parents: 591
diff changeset
   555
    %ymode = log,
539
Chengsong
parents: 538
diff changeset
   556
    legend entries={Naive Matcher},  
Chengsong
parents: 538
diff changeset
   557
    legend pos=north west,
Chengsong
parents: 538
diff changeset
   558
    legend cell align=left]
Chengsong
parents: 538
diff changeset
   559
\addplot[red,mark=*, mark options={fill=white}] table {NaiveMatcher.data};
Chengsong
parents: 538
diff changeset
   560
\end{axis}
Chengsong
parents: 538
diff changeset
   561
\end{tikzpicture} 
583
Chengsong
parents: 579
diff changeset
   562
\caption{Matching the regular expression $(a^*)^*b$ against strings of the form
Chengsong
parents: 579
diff changeset
   563
$\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}
Chengsong
parents: 579
diff changeset
   564
$ using Brzozowski's original algorithm}\label{NaiveMatcher}
601
Chengsong
parents: 591
diff changeset
   565
\end{center}
539
Chengsong
parents: 538
diff changeset
   566
\end{figure}
Chengsong
parents: 538
diff changeset
   567
\noindent
564
Chengsong
parents: 543
diff changeset
   568
If we implement the above algorithm naively, however,
Chengsong
parents: 543
diff changeset
   569
the algorithm can be excruciatingly slow, as shown in 
Chengsong
parents: 543
diff changeset
   570
\ref{NaiveMatcher}.
Chengsong
parents: 543
diff changeset
   571
Note that both axes are in logarithmic scale.
637
Chengsong
parents: 628
diff changeset
   572
Around two dozen characters
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
   573
this algorithm already ``explodes'' with the regular expression 
564
Chengsong
parents: 543
diff changeset
   574
$(a^*)^*b$.
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   575
To improve this situation, we need to introduce simplification
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   576
rules for the intermediate results,
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
   577
such as $r + r \rightarrow r$ or $\ONE \cdot r \rightarrow r$,
539
Chengsong
parents: 538
diff changeset
   578
and make sure those rules do not change the 
Chengsong
parents: 538
diff changeset
   579
language of the regular expression.
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
   580
One simple-minded simplification function
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   581
that achieves these requirements 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   582
is given below (see Ausaf et al. \cite{AusafDyckhoffUrban2016}):
564
Chengsong
parents: 543
diff changeset
   583
\begin{center}
Chengsong
parents: 543
diff changeset
   584
	\begin{tabular}{lcl}
Chengsong
parents: 543
diff changeset
   585
		$\simp \; r_1 \cdot r_2 $ & $ \dn$ & 
Chengsong
parents: 543
diff changeset
   586
		$(\simp \; r_1,  \simp \; r_2) \; \textit{match}$\\
Chengsong
parents: 543
diff changeset
   587
					  & & $\quad \case \; (\ZERO, \_) \Rightarrow \ZERO$\\
Chengsong
parents: 543
diff changeset
   588
					  & & $\quad \case \; (\_, \ZERO) \Rightarrow \ZERO$\\
Chengsong
parents: 543
diff changeset
   589
					  & & $\quad \case \; (\ONE, r_2') \Rightarrow r_2'$\\
Chengsong
parents: 543
diff changeset
   590
					  & & $\quad \case \; (r_1', \ONE) \Rightarrow r_1'$\\
Chengsong
parents: 543
diff changeset
   591
					  & & $\quad \case \; (r_1', r_2') \Rightarrow r_1'\cdot r_2'$\\
Chengsong
parents: 543
diff changeset
   592
		$\simp \; r_1 + r_2$ & $\dn$ & $(\simp \; r_1, \simp \; r_2) \textit{match}$\\
Chengsong
parents: 543
diff changeset
   593
				     & & $\quad \; \case \; (\ZERO, r_2') \Rightarrow r_2'$\\
Chengsong
parents: 543
diff changeset
   594
				     & & $\quad \; \case \; (r_1', \ZERO) \Rightarrow r_1'$\\
Chengsong
parents: 543
diff changeset
   595
				     & & $\quad \; \case \; (r_1', r_2') \Rightarrow r_1' + r_2'$\\
583
Chengsong
parents: 579
diff changeset
   596
		$\simp \; r$ & $\dn$ & $r\quad\quad (otherwise)$
564
Chengsong
parents: 543
diff changeset
   597
				   
Chengsong
parents: 543
diff changeset
   598
	\end{tabular}
Chengsong
parents: 543
diff changeset
   599
\end{center}
Chengsong
parents: 543
diff changeset
   600
If we repeatedly apply this simplification  
Chengsong
parents: 543
diff changeset
   601
function during the matching algorithm, 
Chengsong
parents: 543
diff changeset
   602
we have a matcher with simplification:
Chengsong
parents: 543
diff changeset
   603
\begin{center}
Chengsong
parents: 543
diff changeset
   604
	\begin{tabular}{lcl}
Chengsong
parents: 543
diff changeset
   605
		$\derssimp \; [] \; r$ & $\dn$ & $r$\\
Chengsong
parents: 543
diff changeset
   606
		$\derssimp \; c :: cs \; r$ & $\dn$ & $\derssimp \; cs \; (\simp \; (r \backslash c))$\\
Chengsong
parents: 543
diff changeset
   607
		$\textit{matcher}_{simp}\; s \; r $ & $\dn$ & $\nullable \; (\derssimp \; s\;r)$
Chengsong
parents: 543
diff changeset
   608
	\end{tabular}
Chengsong
parents: 543
diff changeset
   609
\end{center}
Chengsong
parents: 543
diff changeset
   610
\begin{figure}
539
Chengsong
parents: 538
diff changeset
   611
\begin{tikzpicture}
Chengsong
parents: 538
diff changeset
   612
\begin{axis}[
Chengsong
parents: 538
diff changeset
   613
    xlabel={$n$},
Chengsong
parents: 538
diff changeset
   614
    ylabel={time in secs},
601
Chengsong
parents: 591
diff changeset
   615
    %ymode = log,
Chengsong
parents: 591
diff changeset
   616
    %xmode = log,
564
Chengsong
parents: 543
diff changeset
   617
    grid = both,
539
Chengsong
parents: 538
diff changeset
   618
    legend entries={Matcher With Simp},  
Chengsong
parents: 538
diff changeset
   619
    legend pos=north west,
Chengsong
parents: 538
diff changeset
   620
    legend cell align=left]
Chengsong
parents: 538
diff changeset
   621
\addplot[red,mark=*, mark options={fill=white}] table {BetterMatcher.data};
Chengsong
parents: 538
diff changeset
   622
\end{axis}
564
Chengsong
parents: 543
diff changeset
   623
\end{tikzpicture} 
Chengsong
parents: 543
diff changeset
   624
\caption{$(a^*)^*b$ 
Chengsong
parents: 543
diff changeset
   625
against 
Chengsong
parents: 543
diff changeset
   626
$\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$ Using $\textit{matcher}_{simp}$}\label{BetterMatcher}
Chengsong
parents: 543
diff changeset
   627
\end{figure}
Chengsong
parents: 543
diff changeset
   628
\noindent
Chengsong
parents: 543
diff changeset
   629
The running time of $\textit{ders}\_\textit{simp}$
583
Chengsong
parents: 579
diff changeset
   630
on the same example of Figure \ref{NaiveMatcher}
Chengsong
parents: 579
diff changeset
   631
is now ``tame''  in terms of the length of inputs,
Chengsong
parents: 579
diff changeset
   632
as shown in Figure \ref{BetterMatcher}.
539
Chengsong
parents: 538
diff changeset
   633
637
Chengsong
parents: 628
diff changeset
   634
So far, the story is use Brzozowski derivatives and
Chengsong
parents: 628
diff changeset
   635
simplify as much as possible, and at the end test
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   636
whether the empty string is recognised 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   637
by the final derivative.
583
Chengsong
parents: 579
diff changeset
   638
But what if we want to 
Chengsong
parents: 579
diff changeset
   639
do lexing instead of just getting a true/false answer?
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   640
Sulzmann and Lu \cite{Sulzmann2014} first came up with a nice and 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   641
elegant (arguably as beautiful as the definition of the 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   642
Brzozowski derivative) solution for this.
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   643
539
Chengsong
parents: 538
diff changeset
   644
\section{Values and the Lexing Algorithm by Sulzmann and Lu}
564
Chengsong
parents: 543
diff changeset
   645
In this section, we present a two-phase regular expression lexing 
Chengsong
parents: 543
diff changeset
   646
algorithm.
Chengsong
parents: 543
diff changeset
   647
The first phase takes successive derivatives with 
Chengsong
parents: 543
diff changeset
   648
respect to the input string,
Chengsong
parents: 543
diff changeset
   649
and the second phase does the reverse, \emph{injecting} back
Chengsong
parents: 543
diff changeset
   650
characters, in the meantime constructing a lexing result.
Chengsong
parents: 543
diff changeset
   651
We will introduce the injection phase in detail slightly
Chengsong
parents: 543
diff changeset
   652
later, but as a preliminary we have to first define 
Chengsong
parents: 543
diff changeset
   653
the datatype for lexing results, 
Chengsong
parents: 543
diff changeset
   654
called \emph{value} or
608
37b6fd310a16 added related work chap
Chengsong
parents: 601
diff changeset
   655
sometimes also \emph{lexical value}.  
37b6fd310a16 added related work chap
Chengsong
parents: 601
diff changeset
   656
Values and regular
37b6fd310a16 added related work chap
Chengsong
parents: 601
diff changeset
   657
expressions correspond to each other 
37b6fd310a16 added related work chap
Chengsong
parents: 601
diff changeset
   658
as illustrated in the following
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   659
table:
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   660
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   661
\begin{center}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   662
	\begin{tabular}{c@{\hspace{20mm}}c}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   663
		\begin{tabular}{@{}rrl@{}}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   664
			\multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   665
			$r$ & $::=$  & $\ZERO$\\
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   666
			& $\mid$ & $\ONE$   \\
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   667
			& $\mid$ & $c$          \\
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   668
			& $\mid$ & $r_1 \cdot r_2$\\
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   669
			& $\mid$ & $r_1 + r_2$   \\
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   670
			\\
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   671
			& $\mid$ & $r^*$         \\
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   672
		\end{tabular}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   673
		&
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   674
		\begin{tabular}{@{\hspace{0mm}}rrl@{}}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   675
			\multicolumn{3}{@{}l}{\textbf{Values}}\medskip\\
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   676
			$v$ & $::=$  & \\
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   677
			&        & $\Empty$   \\
591
b2d0de6aee18 more polishing integrated comments chap2
Chengsong
parents: 585
diff changeset
   678
			& $\mid$ & $\Char \; c$          \\
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   679
			& $\mid$ & $\Seq\,v_1\, v_2$\\
591
b2d0de6aee18 more polishing integrated comments chap2
Chengsong
parents: 585
diff changeset
   680
			& $\mid$ & $\Left \;v$   \\
b2d0de6aee18 more polishing integrated comments chap2
Chengsong
parents: 585
diff changeset
   681
			& $\mid$ & $\Right\;v$  \\
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   682
			& $\mid$ & $\Stars\,[v_1,\ldots\,v_n]$ \\
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   683
		\end{tabular}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   684
	\end{tabular}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   685
\end{center}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   686
\noindent
564
Chengsong
parents: 543
diff changeset
   687
A value has an underlying string, which 
Chengsong
parents: 543
diff changeset
   688
can be calculated by the ``flatten" function $|\_|$:
Chengsong
parents: 543
diff changeset
   689
\begin{center}
Chengsong
parents: 543
diff changeset
   690
	\begin{tabular}{lcl}
Chengsong
parents: 543
diff changeset
   691
		$|\Empty|$ & $\dn$ &  $[]$\\
Chengsong
parents: 543
diff changeset
   692
		$|\Char \; c|$ & $ \dn$ & $ [c]$\\
591
b2d0de6aee18 more polishing integrated comments chap2
Chengsong
parents: 585
diff changeset
   693
		$|\Seq \; v_1, \;v_2|$ & $ \dn$ & $ v_1| @ |v_2|$\\
b2d0de6aee18 more polishing integrated comments chap2
Chengsong
parents: 585
diff changeset
   694
		$|\Left \; v|$ & $ \dn$ & $ |v|$\\
b2d0de6aee18 more polishing integrated comments chap2
Chengsong
parents: 585
diff changeset
   695
		$|\Right \; v|$ & $ \dn$ & $ |v|$\\
b2d0de6aee18 more polishing integrated comments chap2
Chengsong
parents: 585
diff changeset
   696
		$|\Stars \; []|$ & $\dn$ & $[]$\\
b2d0de6aee18 more polishing integrated comments chap2
Chengsong
parents: 585
diff changeset
   697
		$|\Stars \; v::vs|$ &  $\dn$ & $ |v| @ |\Stars(vs)|$
564
Chengsong
parents: 543
diff changeset
   698
	\end{tabular}
Chengsong
parents: 543
diff changeset
   699
\end{center}
Chengsong
parents: 543
diff changeset
   700
Sulzmann and Lu used a binary predicate, written $\vdash v:r $,
Chengsong
parents: 543
diff changeset
   701
to indicate that a value $v$ could be generated from a lexing algorithm
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   702
with input $r$. They call it the value inhabitation relation,
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   703
defined by the rules.
628
7af4e2420a8c ready to submit~~
Chengsong
parents: 626
diff changeset
   704
\begin{figure}[H]
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   705
\begin{mathpar}
591
b2d0de6aee18 more polishing integrated comments chap2
Chengsong
parents: 585
diff changeset
   706
	\inferrule{\mbox{}}{\vdash \Char \; c : \mathbf{c}} \hspace{2em}
564
Chengsong
parents: 543
diff changeset
   707
Chengsong
parents: 543
diff changeset
   708
	\inferrule{\mbox{}}{\vdash \Empty :  \ONE} \hspace{2em}
Chengsong
parents: 543
diff changeset
   709
591
b2d0de6aee18 more polishing integrated comments chap2
Chengsong
parents: 585
diff changeset
   710
\inferrule{\vdash v_1 : r_1 \;\; \vdash v_2 : r_2 }{\vdash \Seq \; v_1,\; v_2 : (r_1 \cdot r_2)}
564
Chengsong
parents: 543
diff changeset
   711
591
b2d0de6aee18 more polishing integrated comments chap2
Chengsong
parents: 585
diff changeset
   712
\inferrule{\vdash v_1 : r_1}{\vdash \Left \; v_1 : r_1+r_2}
564
Chengsong
parents: 543
diff changeset
   713
591
b2d0de6aee18 more polishing integrated comments chap2
Chengsong
parents: 585
diff changeset
   714
\inferrule{\vdash v_2 : r_2}{\vdash \Right \; v_2:r_1 + r_2}
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   715
591
b2d0de6aee18 more polishing integrated comments chap2
Chengsong
parents: 585
diff changeset
   716
\inferrule{\forall v \in vs. \vdash v:r \land  |v| \neq []}{\vdash \Stars \; vs : r^*}
564
Chengsong
parents: 543
diff changeset
   717
\end{mathpar}
628
7af4e2420a8c ready to submit~~
Chengsong
parents: 626
diff changeset
   718
\caption{The inhabitation relation for values and regular expressions}\label{fig:inhab}
626
1c8525061545 finished!
Chengsong
parents: 623
diff changeset
   719
\end{figure}
564
Chengsong
parents: 543
diff changeset
   720
\noindent
Chengsong
parents: 543
diff changeset
   721
The condition $|v| \neq []$ in the premise of star's rule
Chengsong
parents: 543
diff changeset
   722
is to make sure that for a given pair of regular 
Chengsong
parents: 543
diff changeset
   723
expression $r$ and string $s$, the number of values 
Chengsong
parents: 543
diff changeset
   724
satisfying $|v| = s$ and $\vdash v:r$ is finite.
601
Chengsong
parents: 591
diff changeset
   725
This additional condition was
Chengsong
parents: 591
diff changeset
   726
imposed by Ausaf and Urban to make their proofs easier.
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   727
Given a string and a regular expression, there can be
564
Chengsong
parents: 543
diff changeset
   728
multiple values for it. For example, both
Chengsong
parents: 543
diff changeset
   729
$\vdash \Seq(\Left \; ab)(\Right \; c):(ab+a)(bc+c)$ and
Chengsong
parents: 543
diff changeset
   730
$\vdash \Seq(\Right\; a)(\Left \; bc ):(ab+a)(bc+c)$ hold
Chengsong
parents: 543
diff changeset
   731
and the values both flatten to $abc$.
Chengsong
parents: 543
diff changeset
   732
Lexers therefore have to disambiguate and choose only
637
Chengsong
parents: 628
diff changeset
   733
one of the values to be generated. $\POSIX$ is one of the
564
Chengsong
parents: 543
diff changeset
   734
disambiguation strategies that is widely adopted.
Chengsong
parents: 543
diff changeset
   735
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
   736
Ausaf et al. \cite{AusafDyckhoffUrban2016} 
564
Chengsong
parents: 543
diff changeset
   737
formalised the property 
Chengsong
parents: 543
diff changeset
   738
as a ternary relation.
Chengsong
parents: 543
diff changeset
   739
The $\POSIX$ value $v$ for a regular expression
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   740
$r$ and string $s$, denoted as $(s, r) \rightarrow v$, can be specified 
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   741
in the following rules\footnote{The names of the rules are used
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   742
as they were originally given in \cite{AusafDyckhoffUrban2016}.}:
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   743
\begin{figure}[p]
564
Chengsong
parents: 543
diff changeset
   744
\begin{mathpar}
Chengsong
parents: 543
diff changeset
   745
	\inferrule[P1]{\mbox{}}{([], \ONE) \rightarrow \Empty}
Chengsong
parents: 543
diff changeset
   746
		
Chengsong
parents: 543
diff changeset
   747
	\inferrule[PC]{\mbox{}}{([c], c) \rightarrow \Char \; c}
Chengsong
parents: 543
diff changeset
   748
Chengsong
parents: 543
diff changeset
   749
	\inferrule[P+L]{(s,r_1)\rightarrow v_1}{(s, r_1+r_2)\rightarrow \Left \; v_1}
Chengsong
parents: 543
diff changeset
   750
Chengsong
parents: 543
diff changeset
   751
	\inferrule[P+R]{(s,r_2)\rightarrow v_2\\ s \notin L \; r_1}{(s, r_1+r_2)\rightarrow \Right \; v_2}
Chengsong
parents: 543
diff changeset
   752
Chengsong
parents: 543
diff changeset
   753
	\inferrule[PS]{(s_1, v_1) \rightarrow r_1 \\ (s_2, v_2)\rightarrow r_2\\
Chengsong
parents: 543
diff changeset
   754
		\nexists s_3 \; s_4. s_3 \neq [] \land s_3 @ s_4 = s_2 \land 
Chengsong
parents: 543
diff changeset
   755
		s_1@ s_3 \in L \; r_1 \land s_4 \in L \; r_2}{(s_1 @ s_2, r_1\cdot r_2) \rightarrow
Chengsong
parents: 543
diff changeset
   756
	\Seq \; v_1 \; v_2}
Chengsong
parents: 543
diff changeset
   757
Chengsong
parents: 543
diff changeset
   758
	\inferrule[P{[]}]{\mbox{}}{([], r^*) \rightarrow \Stars([])}
Chengsong
parents: 543
diff changeset
   759
Chengsong
parents: 543
diff changeset
   760
	\inferrule[P*]{(s_1, v) \rightarrow v \\ (s_2, r^*) \rightarrow \Stars \; vs \\
Chengsong
parents: 543
diff changeset
   761
		|v| \neq []\\ \nexists s_3 \; s_4. s_3 \neq [] \land s_3@s_4 = s_2 \land
Chengsong
parents: 543
diff changeset
   762
		s_1@s_3 \in L \; r \land s_4 \in L \; r^*}{(s_1@s_2, r^*)\rightarrow \Stars \;
Chengsong
parents: 543
diff changeset
   763
	(v::vs)}
Chengsong
parents: 543
diff changeset
   764
\end{mathpar}
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   765
\caption{The inductive POSIX rules given by Ausaf et al.
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   766
	\cite{AusafDyckhoffUrban2016}.
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   767
This ternary relation, written $(s, r) \rightarrow v$, 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   768
formalises the POSIX constraints on the
601
Chengsong
parents: 591
diff changeset
   769
value $v$ given a string $s$ and 
Chengsong
parents: 591
diff changeset
   770
regular expression $r$.
Chengsong
parents: 591
diff changeset
   771
}
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   772
\end{figure}\afterpage{\clearpage}
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   773
\noindent
579
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   774
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   775
%\begin{figure}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   776
%\begin{tikzpicture}[]
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   777
%    \node [minimum width = 6cm, rectangle split, rectangle split horizontal, 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   778
%	    rectangle split parts=2, rectangle split part fill={red!30,blue!20}, style={draw, rounded corners, inner sep=10pt}]
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   779
%	    (node1)
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   780
%	    {$r_{token1}$
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   781
%	    \nodepart{two}  $\;\;\; \quad r_{token2}\;\;\;\quad$ };
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   782
%	    %\node [left = 6.0cm of node1] (start1) {hi};
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   783
%	    \node [left = 0.2cm of node1] (middle) {$v.s.$};
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   784
%    \node [minimum width = 6cm, left = 0.2cm of middle, rectangle split, rectangle split horizontal, 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   785
%	    rectangle split parts=2, rectangle split part fill={red!30,blue!20}, style={draw, rounded corners, inner sep=10pt}]
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   786
%	    (node2)
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   787
%	    {$\quad\;\;\;r_{token1}\quad\;\;\;$
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   788
%	    \nodepart{two}  $r_{token2}$ };
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   789
%	    \node [below = 0.1cm of node2] (text1) {\checkmark preferred by POSIX};
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   790
%		\node [above = 1.5cm of middle, minimum width = 6cm, 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   791
%			rectangle, style={draw, rounded corners, inner sep=10pt}] 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   792
%			(topNode) {$s$};
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   793
%	    \path[->,draw]
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   794
%	        (topNode) edge node {split $A$} (node2)
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   795
%	        (topNode) edge node {split $B$} (node1)
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   796
%		;
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   797
%			
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   798
%
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   799
%\end{tikzpicture}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   800
%\caption{Maximum munch example: $s$ matches $r_{token1} \cdot r_{token2}$}\label{munch}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   801
%\end{figure}
637
Chengsong
parents: 628
diff changeset
   802
The above $\POSIX$ rules follow the intuition described below: 
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   803
\begin{itemize}
564
Chengsong
parents: 543
diff changeset
   804
	\item (Left Priority)\\
637
Chengsong
parents: 628
diff changeset
   805
		Match the leftmost regular expression when multiple options for matching
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   806
		are available. See P+L and P+R where in P+R $s$ cannot
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   807
		be in the language of $L \; r_1$.
564
Chengsong
parents: 543
diff changeset
   808
	\item (Maximum munch)\\
Chengsong
parents: 543
diff changeset
   809
		Always match a subpart as much as possible before proceeding
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   810
		to the next part of the string.
579
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
   811
		For example, when the string $s$ matches 
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   812
		$r_{part1}\cdot r_{part2}$, and we have two ways $s$ can be split:
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   813
		Then the split that matches a longer string for the first part
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   814
		$r_{part1}$ is preferred by this maximum munch rule.
637
Chengsong
parents: 628
diff changeset
   815
		The side-condition 
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   816
		\begin{center}
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   817
		$\nexists s_3 \; s_4. s_3 \neq [] \land s_3 @ s_4 = s_2 \land 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   818
		s_1@ s_3 \in L \; r_1 \land s_4 \in L \; r_2$
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   819
		\end{center}
637
Chengsong
parents: 628
diff changeset
   820
		in PS causes this.
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   821
		%(See
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   822
		%\ref{munch} for an illustration).
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   823
\end{itemize}
564
Chengsong
parents: 543
diff changeset
   824
\noindent
Chengsong
parents: 543
diff changeset
   825
These disambiguation strategies can be 
Chengsong
parents: 543
diff changeset
   826
quite practical.
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   827
For instance, when lexing a code snippet 
564
Chengsong
parents: 543
diff changeset
   828
\[ 
Chengsong
parents: 543
diff changeset
   829
	\textit{iffoo} = 3
Chengsong
parents: 543
diff changeset
   830
\]
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   831
using a regular expression 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   832
for keywords and 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   833
identifiers:
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   834
%(for example, keyword is a nonempty string starting with letters 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   835
%followed by alphanumeric characters or underscores):
564
Chengsong
parents: 543
diff changeset
   836
\[
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   837
	r_{keyword} + r_{identifier}.
564
Chengsong
parents: 543
diff changeset
   838
\]
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   839
If we want $\textit{iffoo}$ to be recognized
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   840
as an identifier
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   841
where identifiers are defined as usual (letters
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   842
followed by letters, numbers or underscores),
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   843
then a match with a keyword (if)
564
Chengsong
parents: 543
diff changeset
   844
followed by
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   845
an identifier (foo) would be incorrect.
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
   846
POSIX lexing generates what is included by lexing.
564
Chengsong
parents: 543
diff changeset
   847
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   848
\noindent
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   849
We know that a POSIX 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   850
value for regular expression $r$ is inhabited by $r$.
541
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
   851
\begin{lemma}
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
   852
$(r, s) \rightarrow v \implies \vdash v: r$
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
   853
\end{lemma}
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
   854
\noindent
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   855
The main property about a $\POSIX$ value is that 
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   856
given the same regular expression $r$ and string $s$,
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   857
one can always uniquely determine the $\POSIX$ value for it:
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   858
\begin{lemma}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   859
$\textit{if} \,(s, r) \rightarrow v_1 \land (s, r) \rightarrow v_2\quad  \textit{then} \; v_1 = v_2$
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   860
\end{lemma}
539
Chengsong
parents: 538
diff changeset
   861
\begin{proof}
564
Chengsong
parents: 543
diff changeset
   862
By induction on $s$, $r$ and $v_1$. The inductive cases
Chengsong
parents: 543
diff changeset
   863
are all the POSIX rules. 
Chengsong
parents: 543
diff changeset
   864
Probably the most cumbersome cases are 
Chengsong
parents: 543
diff changeset
   865
the sequence and star with non-empty iterations.
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
   866
We shall give the details for proving the sequence case here.
539
Chengsong
parents: 538
diff changeset
   867
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
   868
When we have 
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
   869
\[
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
   870
	(s_1, r_1) \rightarrow v_1 \;\, and \;\, 
601
Chengsong
parents: 591
diff changeset
   871
	(s_2, r_2) \rightarrow v_2  \;\, and \;\, 
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
   872
	\nexists s_3 \; s_4. s_3 \neq [] \land s_3 @ s_4 = s_2 \land 
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
   873
		s_1@ s_3 \in L \; r_1 \land s_4 \in L \; r_2
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
   874
\]
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
   875
we know that the last condition 
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
   876
excludes the possibility of a 
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
   877
string $s_1'$ longer than $s_1$ such that 
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
   878
\[
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
   879
(s_1', r_1) \rightarrow v_1'   \;\; 
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
   880
and\;\; (s_2', r_2) \rightarrow v_2'\;\; and \;\;s_1' @s_2' = s 
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
   881
\]
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
   882
hold.
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
   883
A shorter string $s_1''$ with $s_2''$ satisfying
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
   884
\[
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
   885
(s_1'', r_1) \rightarrow v_1''
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
   886
\;\;and\;\; (s_2'', r_2) \rightarrow v_2'' \;\;and \;\;s_1'' @s_2'' = s 
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
   887
\]
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
   888
cannot possibly form a $\POSIX$ value either, because
637
Chengsong
parents: 628
diff changeset
   889
by definition, there is a candidate
Chengsong
parents: 628
diff changeset
   890
with a longer initial string
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
   891
$s_1$. Therefore, we know that the POSIX
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
   892
value $\Seq \; a \; b$ for $r_1 \cdot r_2$ matching
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
   893
$s$ must have the 
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
   894
property that 
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
   895
\[
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
   896
	|a| = s_1 \;\; and \;\; |b| = s_2.
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
   897
\]
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
   898
The goal is to prove that $a = v_1 $ and $b = v_2$.
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
   899
If we have some other POSIX values $v_{10}$ and $v_{20}$ such that 
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
   900
$(s_1, r_1) \rightarrow v_{10}$ and $(s_2, r_2) \rightarrow v_{20}$ hold,
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
   901
then by induction hypothesis $v_{10} = v_1$ and $v_{20}= v_2$, 
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
   902
which means this "other" $\POSIX$ value $\Seq(v_{10}, v_{20})$
539
Chengsong
parents: 538
diff changeset
   903
is the same as $\Seq(v_1, v_2)$. 
Chengsong
parents: 538
diff changeset
   904
\end{proof}
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
   905
\noindent
637
Chengsong
parents: 628
diff changeset
   906
We have now defined what a POSIX value is and shown that it is unique.
Chengsong
parents: 628
diff changeset
   907
The problem is to generate
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
   908
such a value in a lexing algorithm using derivatives.
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   909
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   910
\subsection{Sulzmann and Lu's Injection-based Lexing Algorithm}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   911
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
   912
Sulzmann and Lu extended Brzozowski's 
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
   913
derivative-based matching
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   914
to a lexing algorithm by a second phase 
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
   915
after the initial phase of successive derivatives.
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
   916
This second phase generates a POSIX value 
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
   917
if the regular expression matches the string.
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
   918
The algorithm uses two functions called $\inj$ and $\mkeps$.
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   919
The function $\mkeps$ constructs a POSIX value from the last
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
   920
derivative $r_n$:
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   921
\begin{ceqn}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   922
\begin{equation}\label{graph:mkeps}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   923
\begin{tikzcd}
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
   924
r_0 \arrow[r, "\backslash c_0"]  & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed, "\ldots"] & r_n \arrow[d, "mkeps" description] \\
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   925
	        & 	              & 	            & v_n       
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   926
\end{tikzcd}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   927
\end{equation}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   928
\end{ceqn}
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
   929
\noindent
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
   930
In the above diagram, again we assume that
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
   931
the input string $s$ is made of $n$ characters
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   932
$c_0c_1 \ldots c_{n-1}$ 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   933
The last derivative operation 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   934
$\backslash c_{n-1}$ generates the derivative $r_n$, for which
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   935
$\mkeps$ produces the value $v_n$. This value
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   936
tells us how the empty string is matched by the (nullable)
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   937
regular expression $r_n$, in a POSIX way.
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
   938
The definition of $\mkeps$ is
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   939
	\begin{center}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   940
		\begin{tabular}{lcl}
564
Chengsong
parents: 543
diff changeset
   941
			$\mkeps \; \ONE$ 		& $\dn$ & $\Empty$ \\
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
   942
			$\mkeps \; (r_{1}+r_{2})$	& $\dn$ 
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
   943
						& $\textit{if}\; (\nullable \; r_{1}) \;\,
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
   944
							\textit{then}\;\, \Left \; (\mkeps \; r_{1})$\\ 
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
   945
						& & $\phantom{if}\; \textit{else}\;\, \Right \;(\mkeps \; r_{2})$\\
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
   946
			$\mkeps \; (r_1 \cdot r_2)$ 	& $\dn$ & $\Seq\;(\mkeps\;r_1)\;(\mkeps \; r_2)$\\
564
Chengsong
parents: 543
diff changeset
   947
			$\mkeps \; r^* $	        & $\dn$ & $\Stars\;[]$
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   948
		\end{tabular}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   949
	\end{center}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   950
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   951
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   952
\noindent 
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   953
The function prefers the left child $r_1$ of $r_1 + r_2$ 
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
   954
to match an empty string if there is a choice.
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   955
When there is a star to match the empty string,
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   956
we give the $\Stars$ constructor an empty list, meaning
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
   957
no iteration is taken.
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   958
The result of $\mkeps$ on a $\nullable$ $r$ 
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   959
is a POSIX value for $r$ and the empty string:
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
   960
\begin{lemma}\label{mePosix}
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
   961
$\nullable\; r \implies (r, []) \rightarrow (\mkeps\; r)$
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
   962
\end{lemma}
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
   963
\begin{proof}
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   964
	By induction on $r$.
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
   965
\end{proof}
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
   966
\noindent
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   967
After the $\mkeps$-call, Sulzmann and Lu inject back the characters one by one
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
   968
in reverse order as they were chopped off in the derivative phase.
637
Chengsong
parents: 628
diff changeset
   969
The function for this is called $\inj$. This function 
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 608
diff changeset
   970
operates on values, unlike $\backslash$ which operates on regular expressions.
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
   971
In the diagram below, $v_i$ stands for the (POSIX) value 
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
   972
for how the regular expression 
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
   973
$r_i$ matches the string $s_i$ consisting of the last $n-i$ characters
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
   974
of $s$ (i.e. $s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$.
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   975
After injecting back $n$ characters, we get the lexical value for how $r_0$
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   976
matches $s$. 
601
Chengsong
parents: 591
diff changeset
   977
\begin{figure}[H]
Chengsong
parents: 591
diff changeset
   978
\begin{center}	
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   979
\begin{ceqn}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   980
\begin{tikzcd}
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
   981
r_0 \arrow[r, dashed] \arrow[d]& r_i \arrow[r, "\backslash c_i"]  \arrow[d]  & r_{i+1}  \arrow[r, dashed] \arrow[d]        & r_n \arrow[d, "mkeps" description] \\
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
   982
v_0           \arrow[u]                 & v_i  \arrow[l, dashed]                              & v_{i+1} \arrow[l,"inj_{r_i} c_i"]                 & v_n \arrow[l, dashed]         
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   983
\end{tikzcd}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   984
\end{ceqn}
601
Chengsong
parents: 591
diff changeset
   985
\end{center}
Chengsong
parents: 591
diff changeset
   986
\caption{The two-phase lexing algorithm by Sulzmann and Lu \cite{AusafDyckhoffUrban2016},
Chengsong
parents: 591
diff changeset
   987
	matching the regular expression $r_0$ and string of the form $[c_0, c_1, \ldots, c_{n-1}]$.
Chengsong
parents: 591
diff changeset
   988
	The first phase involves taking successive derivatives w.r.t the characters $c_0$,
Chengsong
parents: 591
diff changeset
   989
	$c_1$, and so on. These are the same operations as they have appeared in the matcher
Chengsong
parents: 591
diff changeset
   990
	\ref{matcher}. When the final derivative regular expression is nullable (contains the empty string),
637
Chengsong
parents: 628
diff changeset
   991
	then the second phase starts. First, $\mkeps$ generates a POSIX value which tells us how $r_n$ matches
Chengsong
parents: 628
diff changeset
   992
	the empty string, by always selecting the leftmost 
Chengsong
parents: 628
diff changeset
   993
	nullable regular expression. After that, $\inj$ ``injects'' back the character in reverse order as they 
601
Chengsong
parents: 591
diff changeset
   994
	appeared in the string, always preserving POSIXness.}\label{graph:inj}
Chengsong
parents: 591
diff changeset
   995
\end{figure}
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   996
\noindent
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
   997
The function $\textit{inj}$ as defined by Sulzmann and Lu
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
   998
takes three arguments: a regular
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
   999
expression ${r_{i}}$, before the character is chopped off, 
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1000
a character ${c_{i}}$ (the character we want to inject back) and 
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1001
the third argument $v_{i+1}$ the value we want to inject into. 
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1002
The result of an application 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1003
$\inj \; r_i \; c_i \; v_{i+1}$ is a new value $v_i$ such that
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1004
\[
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1005
	(s_i, r_i) \rightarrow v_i
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1006
\]
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1007
holds.
567
28cb8089ec36 more updaates
Chengsong
parents: 564
diff changeset
  1008
The definition of $\textit{inj}$ is as follows: 
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1009
\begin{center}
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1010
\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{5mm}}l}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1011
  $\textit{inj}\;(c)\;c\,Empty$            & $\dn$ & $\Char\,c$\\
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1012
  $\textit{inj}\;(r_1 + r_2)\;c\; (\Left\; v)$ & $\dn$ & $\Left  \; (\textit{inj}\; r_1 \; c\,v)$\\
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1013
  $\textit{inj}\;(r_1 + r_2)\,c\; (\Right\;v)$ & $\dn$ & $\Right \; (\textit{inj}\;r_2\;c  \; v)$\\
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1014
  $\textit{inj}\;(r_1 \cdot r_2)\; c\;(\Seq \; v_1 \; v_2)$ & $\dn$  & 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1015
  $\Seq \; (\textit{inj}\;r_1\;c\;v_1) \; v_2$\\
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1016
  $\textit{inj}\;(r_1 \cdot r_2)\; c\;(\Left \; (\Seq \; v_1\;v_2) )$ & 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1017
  $\dn$  & $\Seq \; (\textit{inj}\,r_1\,c\,v_1)\; v_2$\\
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1018
  $\textit{inj}\;(r_1 \cdot r_2)\; c\; (\Right\; v)$ & $\dn$  & $\Seq\; (\textit{mkeps}\; r_1) \; (\textit{inj} \; r_2\;c\;v)$\\
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1019
  $\textit{inj}\;(r^*)\; c \; (\Seq \; v\; (\Stars\;vs))$         & $\dn$  & $\Stars\;\,((\textit{inj}\;r\;c\;v)\,::\,vs)$\\
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1020
\end{tabular}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1021
\end{center}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1022
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1023
\noindent 
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1024
The function recurses on 
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1025
the shape of regular
637
Chengsong
parents: 628
diff changeset
  1026
expressions and values.
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1027
Intuitively, each clause analyses 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1028
how $r_i$ could have transformed when being 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1029
derived by $c$, identifying which subpart
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1030
of $v_{i+1}$ has the ``hole'' 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1031
to inject the character back into.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1032
Once the character is
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1033
injected back to that sub-value; 
637
Chengsong
parents: 628
diff changeset
  1034
$\inj$ assembles all parts
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1035
to form a new value.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1036
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1037
For instance, the last clause is an
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1038
injection into a sequence value $v_{i+1}$
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1039
whose second child
637
Chengsong
parents: 628
diff changeset
  1040
value is a star and the shape of the 
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1041
regular expression $r_i$ before injection 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1042
is a star.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1043
We therefore know 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1044
the derivative 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1045
starts on a star and ends as a sequence:
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1046
\[
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1047
	(r^*) \backslash c \longrightarrow r\backslash c \cdot r^*
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1048
\]
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1049
during which an iteration of the star
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1050
had just been unfolded, giving the below
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1051
value inhabitation relation:
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1052
\[
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1053
	\vdash \Seq \; v \; (\Stars \; vs) : (r\backslash c) \cdot r^*.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1054
\]
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1055
The value list $vs$ corresponds to
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1056
matched star iterations,
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1057
and the ``hole'' lies in $v$ because
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1058
\[
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1059
	\vdash v: r\backslash c.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1060
\]
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1061
Finally, 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1062
$\inj \; r \;c \; v$ is prepended
637
Chengsong
parents: 628
diff changeset
  1063
to the previous list of iterations and then
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1064
wrapped under the $\Stars$
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1065
constructor, giving us $\Stars \; ((\inj \; r \; c \; v) ::vs)$.
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1066
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1067
Recall that lemma 
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1068
\ref{mePosix} tells us that
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1069
$\mkeps$ always generates the POSIX value.
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1070
The function $\inj$ preserves the POSIXness, provided
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1071
the value before injection is POSIX, namely
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1072
\begin{lemma}\label{injPosix}
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1073
	If$(r \backslash c, s) \rightarrow v $,
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1074
	then $(r, c :: s) \rightarrow (\inj r \; c\; v)$.
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1075
\end{lemma}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1076
\begin{proof}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1077
	By induction on $r$.
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1078
	The non-trivial cases are sequence and star.
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1079
	When $r = a \cdot b$, there can be
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1080
	three cases for the value $v$ satisfying $\vdash v:a\backslash c$.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1081
	We give the reasoning why $\inj \; r \; c \; v$ is POSIX in each
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1082
	case.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1083
	\begin{itemize}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1084
		\item
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1085
			$v = \Seq \; v_a \; v_b$.\\
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1086
			The ``not nullable'' clause of the $\inj$ function is taken:
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1087
			\begin{center}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1088
				\begin{tabular}{lcl}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1089
					$\inj \; r \; c \; v$ &   $=$ & $ \inj \;\; (a \cdot b) \;\; c \;\; (\Seq \; v_a \; v_b) $\\
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1090
					& $=$ & $\Seq \; (\inj \;a \; c \; v_a) \; v_b$
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1091
				\end{tabular}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1092
			\end{center}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1093
			We know that there exists a unique pair of
637
Chengsong
parents: 628
diff changeset
  1094
			$s_a$ and $s_b$ satisfying	
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1095
				$(a \backslash c, s_a) \rightarrow v_a$,
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1096
				$(b , s_b) \rightarrow v_b$, and
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1097
				$\nexists s_3 \; s_4. s_3 \neq [] \land s_a @ s_3 \in 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1098
				L \; (a\backslash c) \land
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1099
				s_4 \in L \; b$.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1100
			The last condition gives us
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1101
			$\nexists s_3 \; s_4. s_3 \neq [] \land (c :: s_a )@ s_3 \in 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1102
				L \; a \land
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1103
				s_4 \in L \; b$.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1104
			By induction hypothesis, $(a, c::s_a) \rightarrow \inj \; a \; c \; v_a $ holds,
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1105
			and this gives us
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1106
			\[
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1107
				(a\cdot b, (c::s_a)@s_b) \rightarrow \Seq \; (\inj \; a\;c \;v_a) \; v_b.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1108
			\]
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1109
		\item
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1110
			$v = \Left \; (\Seq \; v_a \; v_b)$\\
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1111
			The argument is almost identical to the above case,	
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1112
			except that a different clause of $\inj$ is taken:
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1113
			\begin{center}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1114
				\begin{tabular}{lcl}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1115
					$\inj \; r \; c \; v$ &   $=$ & $ \inj \;\; (a \cdot b) \;\; c \;\; (\Left \; (\Seq \; v_a \; v_b)) $\\
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1116
					& $=$ & $\Seq \; (\inj \;a \; c \; v_a) \; v_b$
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1117
				\end{tabular}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1118
			\end{center}
637
Chengsong
parents: 628
diff changeset
  1119
			With similar reasoning, 
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1120
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1121
			\[
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1122
				(a\cdot b, (c::s_a)@s_b) \rightarrow \Seq \; (\inj \; a\;c \;v_a) \; v_b.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1123
			\]
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1124
			again holds.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1125
		\item 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1126
			$v = \Right \; v_b$\\
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1127
			Again the injection result would be 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1128
			\begin{center}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1129
				\begin{tabular}{lcl}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1130
					$\inj \; r \; c \; v$ &   $=$ & $ \inj \;\; (a \cdot b) \;\; c \;\; \Right \; (v_b) $\\
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1131
					& $=$ & $\Seq \; (\mkeps \; a) \; (\inj \;b \; c\; v_b)$
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1132
				\end{tabular}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1133
			\end{center}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1134
			We know that $a$ must be nullable,
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1135
			allowing us to call $\mkeps$ and get
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1136
			\[
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1137
				(a, []) \rightarrow \mkeps \; a.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1138
			\]
637
Chengsong
parents: 628
diff changeset
  1139
			Also, by inductive hypothesis
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1140
			\[
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1141
				(b, c::s) \rightarrow \inj\; b \; c \; v_b
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1142
			\]
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1143
			holds.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1144
			In addition, as
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1145
			$\Right \;v_b$  instead of $\Left \ldots$ is 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1146
			the POSIX value for $v$, it must be the case
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1147
			that $s \notin L \;( (a\backslash c)\cdot b)$.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1148
			This tells us that 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1149
			\[
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1150
				\nexists s_3 \; s_4.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1151
				s_3 @s_4 = s  \land s_3 \in L \; (a\backslash c) 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1152
				\land s_4 \in L \; b
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1153
			\]
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1154
			which translates to
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1155
			\[
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1156
				\nexists s_3 \; s_4. \; s_3 \neq [] \land
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1157
				s_3 @s_4 = c::s  \land s_3 \in L \; a 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1158
				\land s_4 \in L \; b.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1159
			\]
637
Chengsong
parents: 628
diff changeset
  1160
			(Which says there cannot be a longer 
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1161
			initial split for $s$ other than the empty string.)
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1162
			Therefore we have $\Seq \; (\mkeps \; a) \;(\inj \;b \; c\; v_b)$
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1163
			as the POSIX value for $a\cdot b$.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1164
	\end{itemize}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1165
	The star case can be proven similarly.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1166
\end{proof}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1167
\noindent
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1168
Putting all together, Sulzmann and Lu obtained the following algorithm
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1169
outlined in the diagram \ref{graph:inj}:
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1170
\begin{center}
539
Chengsong
parents: 538
diff changeset
  1171
\begin{tabular}{lcl}
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1172
	$\lexer \; r \; [] $ & $=$ & $\textit{if} \; (\nullable \; r)\; \textit{then}\;  \Some(\mkeps \; r) \; \textit{else} \; \None$\\
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1173
	$\lexer \; r \;c::s$ & $=$ & $\textit{case}\; (\lexer \; (r\backslash c) \; s) \;\textit{of}\; $\\
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1174
	& & $\quad \phantom{\mid}\; \None \implies \None$\\
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1175
	& & $\quad \mid           \Some(v) \implies \Some(\inj \; r\; c\; v)$
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1176
\end{tabular}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1177
\end{center}
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1178
\noindent
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1179
The central property of the $\lexer$ is that it gives the correct result
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1180
according to
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1181
POSIX rules. 
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1182
\begin{theorem}\label{lexerCorrectness}
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1183
	The $\lexer$ based on derivatives and injections is correct: 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1184
	\begin{center}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1185
		\begin{tabular}{lcl}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1186
			$\lexer \; r \; s = \Some(v)$ & $ \Longleftrightarrow$ & $ (r, \; s) \rightarrow v$\\
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1187
			$\lexer \;r \; s = \None $ & $\Longleftrightarrow$ & $ \neg(\exists v. (r, s) \rightarrow v)$
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1188
		\end{tabular}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1189
	\end{center}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1190
\end{theorem} 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1191
\begin{proof}
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1192
By induction on $s$. $r$ generalising over an arbitrary regular expression.
637
Chengsong
parents: 628
diff changeset
  1193
The $[]$ case is proven by an application of lemma \ref{mePosix}, and the inductive case
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1194
by lemma \ref{injPosix}.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1195
\end{proof}
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1196
\noindent
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1197
As we did earlier in this chapter with the matcher, one can 
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1198
introduce simplification on the regular expression in each derivative step.
637
Chengsong
parents: 628
diff changeset
  1199
However, due to lexing, one needs to do a backward phase (w.r.t the forward derivative phase)
Chengsong
parents: 628
diff changeset
  1200
and ensure that
Chengsong
parents: 628
diff changeset
  1201
the values align with the regular expression at each step.
539
Chengsong
parents: 538
diff changeset
  1202
Therefore one has to
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1203
be careful not to break the correctness, as the injection 
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1204
function heavily relies on the structure of 
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1205
the regular expressions and values being aligned.
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1206
This can be achieved by recording some extra rectification functions
637
Chengsong
parents: 628
diff changeset
  1207
during the derivatives step and applying these rectifications in 
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1208
each run during the injection phase.
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1209
With extra care
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1210
one can show that POSIXness will not be affected
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1211
by the simplifications listed here \cite{AusafDyckhoffUrban2016}. 
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1212
\begin{center}
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1213
	\begin{tabular}{lcl}
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1214
		$\simp \; r_1 \cdot r_2 $ & $ \dn$ & 
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1215
		$(\simp \; r_1,  \simp \; r_2) \; \textit{match}$\\
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1216
					  & & $\quad \case \; (\ZERO, \_) \Rightarrow \ZERO$\\
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1217
					  & & $\quad \case \; (\_, \ZERO) \Rightarrow \ZERO$\\
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1218
					  & & $\quad \case \; (\ONE, r_2') \Rightarrow r_2'$\\
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1219
					  & & $\quad \case \; (r_1', \ONE) \Rightarrow r_1'$\\
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1220
					  & & $\quad \case \; (r_1', r_2') \Rightarrow r_1'\cdot r_2'$\\
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1221
		$\simp \; r_1 + r_2$ & $\dn$ & $(\simp \; r_1, \simp \; r_2) \textit{match}$\\
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1222
				     & & $\quad \; \case \; (\ZERO, r_2') \Rightarrow r_2'$\\
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1223
				     & & $\quad \; \case \; (r_1', \ZERO) \Rightarrow r_1'$\\
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1224
				     & & $\quad \; \case \; (r_1', r_2') \Rightarrow r_1' + r_2'$\\
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1225
		$\simp \; r$ & $\dn$ & $r\quad\quad (otherwise)$
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1226
				   
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1227
	\end{tabular}
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1228
\end{center}
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1229
637
Chengsong
parents: 628
diff changeset
  1230
However, one can still end up
Chengsong
parents: 628
diff changeset
  1231
with exploding derivatives,
Chengsong
parents: 628
diff changeset
  1232
even with the simple-minded simplification rules allowed
Chengsong
parents: 628
diff changeset
  1233
in an injection-based lexer.
Chengsong
parents: 628
diff changeset
  1234
\section{A Case Requiring More Aggressive Simplifications}
539
Chengsong
parents: 538
diff changeset
  1235
For example, when starting with the regular
585
4969ef817d92 chap4 more
Chengsong
parents: 583
diff changeset
  1236
expression $(a^* \cdot a^*)^*$ and building just over
4969ef817d92 chap4 more
Chengsong
parents: 583
diff changeset
  1237
a dozen successive derivatives 
539
Chengsong
parents: 538
diff changeset
  1238
w.r.t.~the character $a$, one obtains a derivative regular expression
585
4969ef817d92 chap4 more
Chengsong
parents: 583
diff changeset
  1239
with millions of nodes (when viewed as a tree)
637
Chengsong
parents: 628
diff changeset
  1240
even with the mentioned simplifications.
585
4969ef817d92 chap4 more
Chengsong
parents: 583
diff changeset
  1241
\begin{figure}[H]
601
Chengsong
parents: 591
diff changeset
  1242
\begin{center}
539
Chengsong
parents: 538
diff changeset
  1243
\begin{tikzpicture}
Chengsong
parents: 538
diff changeset
  1244
\begin{axis}[
Chengsong
parents: 538
diff changeset
  1245
    xlabel={$n$},
Chengsong
parents: 538
diff changeset
  1246
    ylabel={size},
585
4969ef817d92 chap4 more
Chengsong
parents: 583
diff changeset
  1247
    legend entries={Simple-Minded Simp, Naive Matcher},  
539
Chengsong
parents: 538
diff changeset
  1248
    legend pos=north west,
Chengsong
parents: 538
diff changeset
  1249
    legend cell align=left]
Chengsong
parents: 538
diff changeset
  1250
\addplot[red,mark=*, mark options={fill=white}] table {BetterWaterloo.data};
585
4969ef817d92 chap4 more
Chengsong
parents: 583
diff changeset
  1251
\addplot[blue,mark=*, mark options={fill=white}] table {BetterWaterloo1.data};
539
Chengsong
parents: 538
diff changeset
  1252
\end{axis}
Chengsong
parents: 538
diff changeset
  1253
\end{tikzpicture} 
601
Chengsong
parents: 591
diff changeset
  1254
\end{center}
539
Chengsong
parents: 538
diff changeset
  1255
\caption{Size of $(a^*\cdot a^*)^*$ against $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$}
Chengsong
parents: 538
diff changeset
  1256
\end{figure}\label{fig:BetterWaterloo}
Chengsong
parents: 538
diff changeset
  1257
   
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1258
That is because Sulzmann and Lu's 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1259
injection-based lexing algorithm keeps a lot of 
541
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
  1260
"useless" values that will not be used. 
539
Chengsong
parents: 538
diff changeset
  1261
These different ways of matching will grow exponentially with the string length.
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1262
Consider the case
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1263
\[
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1264
	r= (a^*\cdot a^*)^* \quad and \quad
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1265
	s=\underbrace{aa\ldots a}_\text{n \textit{a}s}
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1266
\]
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1267
as an example.
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1268
This is a highly ambiguous regular expression, with
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1269
many ways to split up the string into multiple segments for
637
Chengsong
parents: 628
diff changeset
  1270
different star iterations,
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1271
and for each segment 
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1272
multiple ways of splitting between 
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1273
the two $a^*$ sub-expressions.
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1274
When $n$ is equal to $1$, there are two lexical values for
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1275
the match:
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1276
\[
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1277
	\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; [])] \quad (v1)
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1278
\]
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1279
and
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1280
\[
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1281
	\Stars \; [\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a])] \quad (v2)
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1282
\]
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1283
The derivative of $\derssimp \;s \; r$ is
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1284
\[
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1285
	(a^*a^* + a^*)\cdot(a^*a^*)^*.
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1286
\]
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1287
The $a^*a^*$ and $a^*$ in the first child of the above sequence
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1288
correspond to value 1 and value 2, respectively.
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1289
When $n=2$, the number goes up to 7:
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1290
\[
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1291
	\Stars \; [\Seq \; (\Stars \; [\Char \; a, \Char \; a])\; (\Stars \; [])]
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1292
\]
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1293
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1294
\[
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1295
	\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; [\Char \; a])]
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1296
\]
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1297
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1298
\[
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1299
	\Stars \; [\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a, \Char \; a])]
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1300
\]
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1301
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1302
\[
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1303
	\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; []), \Seq \; (\Stars \; [\Char\;a])\; (\Stars\; []) ]
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1304
\]
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1305
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1306
\[
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1307
	\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; []), 
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1308
	 	   \Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a])
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1309
		  ] 
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1310
\]
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1311
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1312
\[
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1313
	\Stars \; [
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1314
	 	   \Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a]),
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1315
	 	   \Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a])
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1316
		  ] 
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1317
\]
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1318
and
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1319
\[
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1320
	\Stars \; [
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1321
	 	   \Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a]),
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1322
		   \Seq \; (\Stars \; [\Char \; a])\; (\Stars \; [])
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1323
		  ] 
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1324
\]
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  1325
And $\derssimp \; aa \; (a^*a^*)^*$ is
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1326
\[
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1327
	((a^*a^* + a^*)+a^*)\cdot(a^*a^*)^* + 
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1328
	(a^*a^* + a^*)\cdot(a^*a^*)^*.
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1329
\]
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1330
which removes two out of the seven terms corresponding to the
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1331
seven distinct lexical values.
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1332
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1333
It is not surprising that there are exponentially many 
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1334
distinct lexical values that cannot be eliminated by 
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1335
the simple-minded simplification of $\derssimp$. 
568
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1336
A lexer without a good enough strategy to 
7a579f5533f8 more chapter2 modifications
Chengsong
parents: 567
diff changeset
  1337
deduplicate will naturally
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1338
have an exponential runtime on highly
637
Chengsong
parents: 628
diff changeset
  1339
ambiguous regular expressions because there
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1340
are exponentially many matches.
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1341
For this particular example, it seems
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1342
that the number of distinct matches growth
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1343
speed is proportional to $(2n)!/(n!(n+1)!)$ ($n$ being the input length).
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1344
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1345
On the other hand, the
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1346
 $\POSIX$ value for $r= (a^*\cdot a^*)^*$  and 
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1347
$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$ is  
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1348
\[
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1349
	\Stars\,
637
Chengsong
parents: 628
diff changeset
  1350
	[\Seq \; (\Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}]), \Stars\,[]].
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1351
\]
637
Chengsong
parents: 628
diff changeset
  1352
At any moment, the  subterms in a regular expression
Chengsong
parents: 628
diff changeset
  1353
that will potentially result in a POSIX value is only
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1354
a minority among the many other terms,
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  1355
and one can remove the ones that are not possible to 
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1356
be POSIX.
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1357
In the above example,
579
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1358
\begin{equation}\label{eqn:growth2}
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1359
	((a^*a^* + \underbrace{a^*}_\text{A})+\underbrace{a^*}_\text{duplicate of A})\cdot(a^*a^*)^* + 
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1360
	\underbrace{(a^*a^* + a^*)\cdot(a^*a^*)^*}_\text{further simp removes this}.
579
35df9cdd36ca more chap3
Chengsong
parents: 577
diff changeset
  1361
\end{equation}
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1362
can be further simplified by 
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1363
removing the underlined term first,
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1364
which would open up possibilities
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1365
of further simplification that removes the
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1366
underbraced part.
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1367
The result would be 
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1368
\[
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1369
	(\underbrace{a^*a^*}_\text{term 1} + \underbrace{a^*}_\text{term 2})\cdot(a^*a^*)^*.
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1370
\]
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1371
with corresponding values
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1372
\begin{center}
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1373
	\begin{tabular}{lr}
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1374
		$\Stars \; [\Seq \; (\Stars \; [\Char \; a, \Char \; a])\; (\Stars \; [])]$  & $(\text{term 1})$\\
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1375
		$\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; [\Char \; a])]  $ &  $(\text{term 2})$
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1376
	\end{tabular}
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1377
\end{center}
637
Chengsong
parents: 628
diff changeset
  1378
Other terms with an underlying value, such as
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1379
\[
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1380
	\Stars \; [\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a, \Char \; a])]
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1381
\]
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  1382
do not to contribute a POSIX lexical value,
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  1383
and therefore can be thrown away.
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1384
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1385
Ausaf et al. \cite{AusafDyckhoffUrban2016} 
639
80cc6dc4c98b until chap 7
Chengsong
parents: 638
diff changeset
  1386
have come up with some simplification steps, 
80cc6dc4c98b until chap 7
Chengsong
parents: 638
diff changeset
  1387
and incorporated the simplification into $\lexer$.
80cc6dc4c98b until chap 7
Chengsong
parents: 638
diff changeset
  1388
They call this lexer $\textit{lexer}_{simp}$ and proved that
80cc6dc4c98b until chap 7
Chengsong
parents: 638
diff changeset
  1389
\[
80cc6dc4c98b until chap 7
Chengsong
parents: 638
diff changeset
  1390
	\lexer \; r\; s = \textit{lexer}_{simp} \; r \; s
80cc6dc4c98b until chap 7
Chengsong
parents: 638
diff changeset
  1391
\]
80cc6dc4c98b until chap 7
Chengsong
parents: 638
diff changeset
  1392
The function $\textit{lexer}_{simp}$
80cc6dc4c98b until chap 7
Chengsong
parents: 638
diff changeset
  1393
involves some fiddly manipulation of value rectification,
80cc6dc4c98b until chap 7
Chengsong
parents: 638
diff changeset
  1394
which we omit here.
80cc6dc4c98b until chap 7
Chengsong
parents: 638
diff changeset
  1395
however those steps
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  1396
are not yet sufficiently strong, to achieve the above effects.
637
Chengsong
parents: 628
diff changeset
  1397
And even with these relatively mild simplifications, the proof
Chengsong
parents: 628
diff changeset
  1398
is already quite a bit more complicated than the theorem \ref{lexerCorrectness}.
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  1399
One would need to prove something like this: 
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1400
\[
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1401
	\textit{If}\; (\textit{snd} \; (\textit{simp} \; r\backslash c), s) \rightarrow  v  \;\;
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1402
	\textit{then}\;\; (r, c::s) \rightarrow 
637
Chengsong
parents: 628
diff changeset
  1403
	\inj\;\, r\,  \;c \;\, ((\textit{fst} \; (\textit{simp} \; r \backslash c))\; v).
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1404
\]
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1405
instead of the simple lemma \ref{injPosix}, where now $\textit{simp}$
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1406
not only has to return a simplified regular expression,
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1407
but also what specific simplifications 
638
dd9dde2d902b comments till chap4
Chengsong
parents: 637
diff changeset
  1408
have been done as a function on values
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1409
showing how one can transform the value
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1410
underlying the simplified regular expression
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1411
to the unsimplified one.
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1412
623
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1413
We therefore choose a slightly different approach
c0c1ebe09c7d finished injchap2
Chengsong
parents: 622
diff changeset
  1414
also described by Sulzmann and Lu to
573
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1415
get better simplifications, which uses
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1416
some augmented data structures compared to 
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1417
plain regular expressions.
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1418
We call them \emph{annotated}
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1419
regular expressions.
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1420
With annotated regular expressions,
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1421
we can avoid creating the intermediate values $v_1,\ldots v_n$ and a 
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1422
second phase altogether.
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1423
We introduce this new datatype and the 
454ced557605 chapter2 finished polishing
Chengsong
parents: 568
diff changeset
  1424
corresponding algorithm in the next chapter.
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1425
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1426
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1427
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
  1428