ChengsongTanPhdThesis/Chapters/Inj.tex
author Chengsong
Wed, 13 Jul 2022 08:27:28 +0100
changeset 564 3cbcd7cda0a9
parent 543 b2bea5968b89
child 567 28cb8089ec36
permissions -rwxr-xr-x
more
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
     1
% Chapter Template
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
     2
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
     3
\chapter{Regular Expressions and POSIX Lexing} % Main chapter title
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
     4
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
     5
\label{Inj} % In chapter 2 \ref{Chapter2} we will introduce the concepts
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
     6
%and notations we 
564
Chengsong
parents: 543
diff changeset
     7
% used for describing the lexing algorithm by Sulzmann and Lu,
Chengsong
parents: 543
diff changeset
     8
%and then give the algorithm and its variant and discuss
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
     9
%why more aggressive simplifications are needed. 
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
    10
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
    11
In this chapter, we define the basic notions 
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
    12
for regular languages and regular expressions.
564
Chengsong
parents: 543
diff changeset
    13
This is essentially a description in ``English"
Chengsong
parents: 543
diff changeset
    14
of our formalisation in Isabelle/HOL.
Chengsong
parents: 543
diff changeset
    15
We also give the definition of what $\POSIX$ lexing means, 
Chengsong
parents: 543
diff changeset
    16
followed by an algorithm by Sulzmanna and Lu\parencite{Sulzmann2014} 
Chengsong
parents: 543
diff changeset
    17
that produces the output conforming
Chengsong
parents: 543
diff changeset
    18
to the $\POSIX$ standard.
Chengsong
parents: 543
diff changeset
    19
It is also worth mentioning that
Chengsong
parents: 543
diff changeset
    20
we choose to use the ML-style notation
Chengsong
parents: 543
diff changeset
    21
for function applications, where
Chengsong
parents: 543
diff changeset
    22
the parameters of a function is not enclosed
Chengsong
parents: 543
diff changeset
    23
inside a pair of parentheses (e.g. $f \;x \;y$
Chengsong
parents: 543
diff changeset
    24
instead of $f(x,\;y)$). This is mainly
Chengsong
parents: 543
diff changeset
    25
to make the text visually more concise.
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
    26
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
    27
\section{Basic Concepts}
564
Chengsong
parents: 543
diff changeset
    28
Usually, formal language theory starts with an alphabet 
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
    29
denoting a set of characters.
541
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
    30
Here we just use the datatype of characters from Isabelle,
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
    31
which roughly corresponds to the ASCII characters.
564
Chengsong
parents: 543
diff changeset
    32
In what follows, we shall leave the information about the alphabet
541
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
    33
implicit.
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
    34
Then using the usual bracket notation for lists,
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
    35
we can define strings made up of characters: 
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
    36
\begin{center}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
    37
\begin{tabular}{lcl}
541
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
    38
$\textit{s}$ & $\dn$ & $[] \; |\; c  :: s$
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
    39
\end{tabular}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
    40
\end{center}
541
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
    41
Where $c$ is a variable ranging over characters.
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
    42
Strings can be concatenated to form longer strings in the same
564
Chengsong
parents: 543
diff changeset
    43
way as we concatenate two lists, which we shall write as $s_1 @ s_2$.
541
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
    44
We omit the precise 
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
    45
recursive definition here.
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
    46
We overload this concatenation operator for two sets of strings:
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
    47
\begin{center}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
    48
\begin{tabular}{lcl}
541
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
    49
$A @ B $ & $\dn$ & $\{s_A @ s_B \mid s_A \in A \land s_B \in B \}$\\
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
    50
\end{tabular}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
    51
\end{center}
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
    52
We also call the above \emph{language concatenation}.
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
    53
The power of a language is defined recursively, using the 
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
    54
concatenation operator $@$:
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
    55
\begin{center}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
    56
\begin{tabular}{lcl}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
    57
$A^0 $ & $\dn$ & $\{ [] \}$\\
541
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
    58
$A^{n+1}$ & $\dn$ & $A @ A^n$
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
    59
\end{tabular}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
    60
\end{center}
564
Chengsong
parents: 543
diff changeset
    61
The union of all powers of a language   
Chengsong
parents: 543
diff changeset
    62
can be used to define the Kleene star operator:
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
    63
\begin{center}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
    64
\begin{tabular}{lcl}
536
aff7bf93b9c7 comments addressed all
Chengsong
parents: 532
diff changeset
    65
 $A*$ & $\dn$ & $\bigcup_{i \geq 0} A^i$ \\
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
    66
\end{tabular}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
    67
\end{center}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
    68
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
    69
\noindent
564
Chengsong
parents: 543
diff changeset
    70
However, to obtain a more convenient induction principle 
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
    71
in Isabelle/HOL, 
536
aff7bf93b9c7 comments addressed all
Chengsong
parents: 532
diff changeset
    72
we instead define the Kleene star
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
    73
as an inductive set: 
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
    74
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
    75
\begin{center}
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
    76
\begin{mathpar}
564
Chengsong
parents: 543
diff changeset
    77
	\inferrule{\mbox{}}{[] \in A*\\}
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
    78
564
Chengsong
parents: 543
diff changeset
    79
\inferrule{s_1 \in A \;\; s_2 \in A*}{s_1 @ s_2 \in A*}
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
    80
\end{mathpar}
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
    81
\end{center}
564
Chengsong
parents: 543
diff changeset
    82
\noindent
541
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
    83
We also define an operation of "chopping off" a character from
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
    84
a language, which we call $\Der$, meaning \emph{Derivative} (for a language):
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
    85
\begin{center}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
    86
\begin{tabular}{lcl}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
    87
$\textit{Der} \;c \;A$ & $\dn$ & $\{ s \mid c :: s \in A \}$\\
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
    88
\end{tabular}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
    89
\end{center}
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
    90
\noindent
541
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
    91
This can be generalised to "chopping off" a string from all strings within set $A$, 
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
    92
namely:
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
    93
\begin{center}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
    94
\begin{tabular}{lcl}
541
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
    95
$\textit{Ders} \;s \;A$ & $\dn$ & $\{ s' \mid s@s' \in A \}$\\
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
    96
\end{tabular}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
    97
\end{center}
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
    98
\noindent
541
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
    99
which is essentially the left quotient $A \backslash L$ of $A$ against 
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
   100
the singleton language with $L = \{w\}$
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   101
in formal language theory.
564
Chengsong
parents: 543
diff changeset
   102
However, for the purposes here, the $\textit{Ders}$ definition with 
541
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
   103
a single string is sufficient.
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   104
564
Chengsong
parents: 543
diff changeset
   105
With the sequencing, Kleene star, and $\textit{Der}$ operator on languages,
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   106
we have a  few properties of how the language derivative can be defined using 
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   107
sub-languages.
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   108
\begin{lemma}
536
aff7bf93b9c7 comments addressed all
Chengsong
parents: 532
diff changeset
   109
\[
aff7bf93b9c7 comments addressed all
Chengsong
parents: 532
diff changeset
   110
	\Der \; c \; (A @ B) =
aff7bf93b9c7 comments addressed all
Chengsong
parents: 532
diff changeset
   111
	\begin{cases}
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   112
	((\Der \; c \; A) \, @ \, B ) \cup (\Der \; c\; B) , &  \text{if} \;  [] \in A  \\
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   113
	 (\Der \; c \; A) \,  @ \, B, & \text{otherwise}
536
aff7bf93b9c7 comments addressed all
Chengsong
parents: 532
diff changeset
   114
	 \end{cases}	
aff7bf93b9c7 comments addressed all
Chengsong
parents: 532
diff changeset
   115
\]
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   116
\end{lemma}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   117
\noindent
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   118
This lemma states that if $A$ contains the empty string, $\Der$ can "pierce" through it
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   119
and get to $B$.
536
aff7bf93b9c7 comments addressed all
Chengsong
parents: 532
diff changeset
   120
The language $A*$'s derivative can be described using the language derivative
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   121
of $A$:
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   122
\begin{lemma}
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   123
$\textit{Der} \;c \;(A*) = (\textit{Der}\; c A) @ (A*)$\\
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   124
\end{lemma}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   125
\begin{proof}
564
Chengsong
parents: 543
diff changeset
   126
There are too inclusions to prove:
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   127
\begin{itemize}
564
Chengsong
parents: 543
diff changeset
   128
\item{$\subseteq$}:\\
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   129
The set 
536
aff7bf93b9c7 comments addressed all
Chengsong
parents: 532
diff changeset
   130
\[ \{s \mid c :: s \in A*\} \]
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   131
is enclosed in the set
564
Chengsong
parents: 543
diff changeset
   132
\[ \{s_1 @ s_2 \mid s_1 \, s_2.\;  s_1 \in \{s \mid c :: s \in A\} \land s_2 \in A* \} \]
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   133
because whenever you have a string starting with a character 
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   134
in the language of a Kleene star $A*$, 
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   135
then that character together with some sub-string
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   136
immediately after it will form the first iteration, 
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   137
and the rest of the string will 
536
aff7bf93b9c7 comments addressed all
Chengsong
parents: 532
diff changeset
   138
be still in $A*$.
564
Chengsong
parents: 543
diff changeset
   139
\item{$\supseteq$}:\\
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   140
Note that
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   141
\[ \Der \; c \; (A*) = \Der \; c \;  (\{ [] \} \cup (A @ A*) ) \]
564
Chengsong
parents: 543
diff changeset
   142
hold.
Chengsong
parents: 543
diff changeset
   143
Also this holds:
536
aff7bf93b9c7 comments addressed all
Chengsong
parents: 532
diff changeset
   144
\[ \Der \; c \;  (\{ [] \} \cup (A @ A*) ) = \Der\; c \; (A @ A*) \]
564
Chengsong
parents: 543
diff changeset
   145
where the $\textit{RHS}$ can be rewritten
Chengsong
parents: 543
diff changeset
   146
as \[ (\Der \; c\; A) @ A* \cup (\Der \; c \; (A*)) \]
Chengsong
parents: 543
diff changeset
   147
which of course contains $\Der \; c \; A @ A*$.
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   148
\end{itemize}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   149
\end{proof}
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   150
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   151
\noindent
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   152
Before we define the $\textit{Der}$ and $\textit{Ders}$ counterpart
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   153
for regular languages, we need to first give definitions for regular expressions.
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   154
536
aff7bf93b9c7 comments addressed all
Chengsong
parents: 532
diff changeset
   155
\subsection{Regular Expressions and Their Meaning}
564
Chengsong
parents: 543
diff changeset
   156
The \emph{basic regular expressions} are defined inductively
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   157
 by the following grammar:
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   158
\[			r ::=   \ZERO \mid  \ONE
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   159
			 \mid  c  
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   160
			 \mid  r_1 \cdot r_2
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   161
			 \mid  r_1 + r_2   
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   162
			 \mid r^*         
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   163
\]
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   164
\noindent
564
Chengsong
parents: 543
diff changeset
   165
We call them basic because we will introduce
Chengsong
parents: 543
diff changeset
   166
additional constructors in later chapters such as negation
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   167
and bounded repetitions.
564
Chengsong
parents: 543
diff changeset
   168
We use $\ZERO$ for the regular expression that
Chengsong
parents: 543
diff changeset
   169
matches no string, and $\ONE$ for the regular
Chengsong
parents: 543
diff changeset
   170
expression that matches only the empty string\footnote{
Chengsong
parents: 543
diff changeset
   171
some authors
Chengsong
parents: 543
diff changeset
   172
also use $\phi$ and $\epsilon$ for $\ZERO$ and $\ONE$
Chengsong
parents: 543
diff changeset
   173
but we prefer our notation}. 
Chengsong
parents: 543
diff changeset
   174
The sequence regular expression is written $r_1\cdot r_2$
Chengsong
parents: 543
diff changeset
   175
and sometimes we omit the dot if it is clear which
Chengsong
parents: 543
diff changeset
   176
regular expression is meant; the alternative
Chengsong
parents: 543
diff changeset
   177
is written $r_1 + r_2$.
Chengsong
parents: 543
diff changeset
   178
The \emph{language} or meaning of 
Chengsong
parents: 543
diff changeset
   179
a regular expression is defined recursively as
Chengsong
parents: 543
diff changeset
   180
a set of strings:
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   181
%TODO: FILL in the other defs
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   182
\begin{center}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   183
\begin{tabular}{lcl}
564
Chengsong
parents: 543
diff changeset
   184
$L \; \ZERO$ & $\dn$ & $\phi$\\
Chengsong
parents: 543
diff changeset
   185
$L \; \ONE$ & $\dn$ & $\{[]\}$\\
Chengsong
parents: 543
diff changeset
   186
$L \; c$ & $\dn$ & $\{[c]\}$\\
Chengsong
parents: 543
diff changeset
   187
$L \; r_1 + r_2$ & $\dn$ & $ L \; r_1 \cup L \; r_2$\\
Chengsong
parents: 543
diff changeset
   188
$L \; r_1 \cdot r_2$ & $\dn$ & $ L \; r_1 @ L \; r_2$\\
Chengsong
parents: 543
diff changeset
   189
$L \; r^*$ & $\dn$ & $ (L\;r)*$
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   190
\end{tabular}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   191
\end{center}
536
aff7bf93b9c7 comments addressed all
Chengsong
parents: 532
diff changeset
   192
\noindent
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   193
Now with semantic derivatives of a language and regular expressions and
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   194
their language interpretations in place, we are ready to define derivatives on regexes.
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   195
\subsection{Brzozowski Derivatives and a Regular Expression Matcher}
564
Chengsong
parents: 543
diff changeset
   196
%Recall, the language derivative acts on a set of strings
Chengsong
parents: 543
diff changeset
   197
%and essentially chops off a particular character from
Chengsong
parents: 543
diff changeset
   198
%all strings in that set, Brzozowski defined a derivative operation on regular expressions
Chengsong
parents: 543
diff changeset
   199
%so that after derivative $L(r\backslash c)$ 
Chengsong
parents: 543
diff changeset
   200
%will look as if it was obtained by doing a language derivative on $L(r)$:
Chengsong
parents: 543
diff changeset
   201
Recall that the semantic derivative acts on a set of 
Chengsong
parents: 543
diff changeset
   202
strings. Brzozowski noticed that this operation
Chengsong
parents: 543
diff changeset
   203
can be ``mirrored" on regular expressions which
Chengsong
parents: 543
diff changeset
   204
he calls the derivative of a regular expression $r$
Chengsong
parents: 543
diff changeset
   205
with respect to a character $c$, written
Chengsong
parents: 543
diff changeset
   206
$r \backslash c$.
Chengsong
parents: 543
diff changeset
   207
He defined this operation such that the following property holds:
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   208
\begin{center}
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   209
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   210
\[
564
Chengsong
parents: 543
diff changeset
   211
L(r \backslash c) = \Der \; c \; L(r)
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   212
\]
564
Chengsong
parents: 543
diff changeset
   213
\end{center}
Chengsong
parents: 543
diff changeset
   214
\noindent
Chengsong
parents: 543
diff changeset
   215
For example in the sequence case we have 
Chengsong
parents: 543
diff changeset
   216
\begin{center}
Chengsong
parents: 543
diff changeset
   217
	\begin{tabular}{lcl}
Chengsong
parents: 543
diff changeset
   218
		$\Der \; c \; (A @ B)$ & $\dn$ & 
Chengsong
parents: 543
diff changeset
   219
		$ \textit{if} \;\,  [] \in A \; 
Chengsong
parents: 543
diff changeset
   220
		\textit{then} \;\, ((\Der \; c \; A) @ B ) \cup 
Chengsong
parents: 543
diff changeset
   221
		\Der \; c\; B$\\
Chengsong
parents: 543
diff changeset
   222
		& & $\textit{else}\; (\Der \; c \; A) @ B$\\
Chengsong
parents: 543
diff changeset
   223
	\end{tabular}
Chengsong
parents: 543
diff changeset
   224
\end{center}
Chengsong
parents: 543
diff changeset
   225
\noindent
Chengsong
parents: 543
diff changeset
   226
This can be translated to  
Chengsong
parents: 543
diff changeset
   227
regular expressions in the following 
Chengsong
parents: 543
diff changeset
   228
manner:
Chengsong
parents: 543
diff changeset
   229
\begin{center}
Chengsong
parents: 543
diff changeset
   230
	\begin{tabular}{lcl}
Chengsong
parents: 543
diff changeset
   231
		$(r_1 \cdot r_2 ) \backslash c$ & $\dn$ & $\textit{if}\;\,([] \in L(r_1)) r_1 \backslash c \cdot r_2 + r_2 \backslash c$ \\
Chengsong
parents: 543
diff changeset
   232
		& & $\textit{else} \; (r_1 \backslash c) \cdot r_2$
Chengsong
parents: 543
diff changeset
   233
	\end{tabular}
Chengsong
parents: 543
diff changeset
   234
\end{center}
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   235
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   236
\noindent
564
Chengsong
parents: 543
diff changeset
   237
And similarly, the Kleene star's semantic derivative
Chengsong
parents: 543
diff changeset
   238
can be expressed as
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   239
\[
564
Chengsong
parents: 543
diff changeset
   240
	\textit{Der} \;c \;(A*) \dn (\textit{Der}\; c A) @ (A*)
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   241
\]
564
Chengsong
parents: 543
diff changeset
   242
which translates to
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   243
\[
564
Chengsong
parents: 543
diff changeset
   244
	(r^*) \backslash c \dn (r \backslash c)\cdot r^*.
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   245
\]
564
Chengsong
parents: 543
diff changeset
   246
In the above definition of $(r_1\cdot r_2) \backslash c$,
Chengsong
parents: 543
diff changeset
   247
the $\textit{if}$ clause's
Chengsong
parents: 543
diff changeset
   248
boolean condition 
Chengsong
parents: 543
diff changeset
   249
$[] \in L(r_1)$ needs to be 
Chengsong
parents: 543
diff changeset
   250
somehow recursively computed.
Chengsong
parents: 543
diff changeset
   251
We call such a function that checks
Chengsong
parents: 543
diff changeset
   252
whether the empty string $[]$ is 
Chengsong
parents: 543
diff changeset
   253
in the language of a regular expression $\nullable$:
Chengsong
parents: 543
diff changeset
   254
\begin{center}
Chengsong
parents: 543
diff changeset
   255
		\begin{tabular}{lcl}
Chengsong
parents: 543
diff changeset
   256
			$\nullable(\ZERO)$     & $\dn$ & $\mathit{false}$ \\  
Chengsong
parents: 543
diff changeset
   257
			$\nullable(\ONE)$      & $\dn$ & $\mathit{true}$ \\
Chengsong
parents: 543
diff changeset
   258
			$\nullable(c)$ 	       & $\dn$ & $\mathit{false}$ \\
Chengsong
parents: 543
diff changeset
   259
			$\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\
Chengsong
parents: 543
diff changeset
   260
			$\nullable(r_1\cdot r_2)$  & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\
Chengsong
parents: 543
diff changeset
   261
			$\nullable(r^*)$       & $\dn$ & $\mathit{true}$ \\
Chengsong
parents: 543
diff changeset
   262
		\end{tabular}
Chengsong
parents: 543
diff changeset
   263
\end{center}
Chengsong
parents: 543
diff changeset
   264
\noindent
Chengsong
parents: 543
diff changeset
   265
The $\ZERO$ regular expression
Chengsong
parents: 543
diff changeset
   266
does not contain any string and
Chengsong
parents: 543
diff changeset
   267
therefore is not \emph{nullable}.
Chengsong
parents: 543
diff changeset
   268
$\ONE$ is \emph{nullable} 
Chengsong
parents: 543
diff changeset
   269
by definition. 
Chengsong
parents: 543
diff changeset
   270
The character regular expression $c$
Chengsong
parents: 543
diff changeset
   271
corresponds to the singleton set $\{c\}$, 
Chengsong
parents: 543
diff changeset
   272
and therefore does not contain the empty string.
Chengsong
parents: 543
diff changeset
   273
The alternative regular expression is nullable
Chengsong
parents: 543
diff changeset
   274
if at least one of its children is nullable.
Chengsong
parents: 543
diff changeset
   275
The sequence regular expression
Chengsong
parents: 543
diff changeset
   276
would require both children to have the empty string
Chengsong
parents: 543
diff changeset
   277
to compose an empty string, and the Kleene star
Chengsong
parents: 543
diff changeset
   278
is always nullable because it naturally
Chengsong
parents: 543
diff changeset
   279
contains the empty string. 
Chengsong
parents: 543
diff changeset
   280
  
Chengsong
parents: 543
diff changeset
   281
The derivative function, written $r\backslash c$, 
Chengsong
parents: 543
diff changeset
   282
defines how a regular expression evolves into
Chengsong
parents: 543
diff changeset
   283
a new one after all the string it contains is acted on: 
Chengsong
parents: 543
diff changeset
   284
if it starts with $c$, then the character is chopped of,
Chengsong
parents: 543
diff changeset
   285
if not, that string is removed.
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   286
\begin{center}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   287
\begin{tabular}{lcl}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   288
		$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\  
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   289
		$\ONE \backslash c$  & $\dn$ & $\ZERO$\\
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   290
		$d \backslash c$     & $\dn$ & 
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   291
		$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   292
$(r_1 + r_2)\backslash c$     & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   293
$(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, [] \in L(r_1)$\\
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   294
	&   & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   295
	&   & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   296
	$(r^*)\backslash c$           & $\dn$ & $(r\backslash c) \cdot r^*$\\
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   297
\end{tabular}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   298
\end{center}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   299
\noindent
564
Chengsong
parents: 543
diff changeset
   300
The most involved cases are the sequence case
Chengsong
parents: 543
diff changeset
   301
and the star case.
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   302
The sequence case says that if the first regular expression
564
Chengsong
parents: 543
diff changeset
   303
contains an empty string, then the second component of the sequence
Chengsong
parents: 543
diff changeset
   304
needs to be considered, as its derivative will contribute to the
Chengsong
parents: 543
diff changeset
   305
result of this derivative.
Chengsong
parents: 543
diff changeset
   306
The star regular expression $r^*$'s derivative 
Chengsong
parents: 543
diff changeset
   307
unwraps one iteration of $r$, turns it into $r\backslash c$,
Chengsong
parents: 543
diff changeset
   308
and attaches the original $r^*$
Chengsong
parents: 543
diff changeset
   309
after $r\backslash c$, so that 
Chengsong
parents: 543
diff changeset
   310
we can further unfold it as many times as needed.
Chengsong
parents: 543
diff changeset
   311
We have the following correspondence between 
Chengsong
parents: 543
diff changeset
   312
derivatives on regular expressions and
Chengsong
parents: 543
diff changeset
   313
derivatives on a set of strings:
Chengsong
parents: 543
diff changeset
   314
\begin{lemma}\label{derDer}
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   315
$\textit{Der} \; c \; L(r) = L (r\backslash c)$
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   316
\end{lemma}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   317
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   318
\noindent
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   319
The main property of the derivative operation
564
Chengsong
parents: 543
diff changeset
   320
(that enables us to reason about the correctness of
Chengsong
parents: 543
diff changeset
   321
derivative-based matching)
Chengsong
parents: 543
diff changeset
   322
is 
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   323
539
Chengsong
parents: 538
diff changeset
   324
\begin{lemma}\label{derStepwise}
564
Chengsong
parents: 543
diff changeset
   325
	$c\!::\!s \in L(r)$ \textit{iff} $s \in L(r\backslash c)$.
539
Chengsong
parents: 538
diff changeset
   326
\end{lemma}
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   327
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   328
\noindent
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   329
We can generalise the derivative operation shown above for single characters
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   330
to strings as follows:
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   331
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   332
\begin{center}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   333
\begin{tabular}{lcl}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   334
$r \backslash_s (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash_s s$ \\
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   335
$r \backslash [\,] $ & $\dn$ & $r$
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   336
\end{tabular}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   337
\end{center}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   338
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   339
\noindent
564
Chengsong
parents: 543
diff changeset
   340
When there is no ambiguity, we will 
Chengsong
parents: 543
diff changeset
   341
omit the subscript and use $\backslash$ instead
Chengsong
parents: 543
diff changeset
   342
of $\backslash_r$ to denote
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   343
string derivatives for brevity.
539
Chengsong
parents: 538
diff changeset
   344
Brzozowski's  regular-expression matcher algorithm can then be described as:
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   345
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   346
\begin{definition}
564
Chengsong
parents: 543
diff changeset
   347
$\textit{match}\;s\;r \;\dn\; \nullable \; (r\backslash s)$
532
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   348
\end{definition}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   349
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   350
\noindent
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   351
Assuming the string is given as a sequence of characters, say $c_0c_1..c_n$, 
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   352
this algorithm presented graphically is as follows:
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   353
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   354
\begin{equation}\label{graph:successive_ders}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   355
\begin{tikzcd}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   356
r_0 \arrow[r, "\backslash c_0"]  & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed]  & r_n  \arrow[r,"\textit{nullable}?"] & \;\textrm{YES}/\textrm{NO}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   357
\end{tikzcd}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   358
\end{equation}
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   359
cc54ce075db5 restructured
Chengsong
parents:
diff changeset
   360
\noindent
539
Chengsong
parents: 538
diff changeset
   361
 It can  be
Chengsong
parents: 538
diff changeset
   362
relatively  easily shown that this matcher is correct:
Chengsong
parents: 538
diff changeset
   363
\begin{lemma}
564
Chengsong
parents: 543
diff changeset
   364
	$\textit{match} \; s\; r  = \textit{true} \; \textit{iff} \; s \in L(r)$
539
Chengsong
parents: 538
diff changeset
   365
\end{lemma}
Chengsong
parents: 538
diff changeset
   366
\begin{proof}
564
Chengsong
parents: 543
diff changeset
   367
	By the stepwise property of derivatives (lemma \ref{derStepwise})
Chengsong
parents: 543
diff changeset
   368
	and lemma \ref{derDer}. 
539
Chengsong
parents: 538
diff changeset
   369
\end{proof}
Chengsong
parents: 538
diff changeset
   370
\noindent
564
Chengsong
parents: 543
diff changeset
   371
\begin{center}
Chengsong
parents: 543
diff changeset
   372
	\begin{figure}
539
Chengsong
parents: 538
diff changeset
   373
\begin{tikzpicture}
Chengsong
parents: 538
diff changeset
   374
\begin{axis}[
Chengsong
parents: 538
diff changeset
   375
    xlabel={$n$},
Chengsong
parents: 538
diff changeset
   376
    ylabel={time in secs},
Chengsong
parents: 538
diff changeset
   377
    ymode = log,
Chengsong
parents: 538
diff changeset
   378
    legend entries={Naive Matcher},  
Chengsong
parents: 538
diff changeset
   379
    legend pos=north west,
Chengsong
parents: 538
diff changeset
   380
    legend cell align=left]
Chengsong
parents: 538
diff changeset
   381
\addplot[red,mark=*, mark options={fill=white}] table {NaiveMatcher.data};
Chengsong
parents: 538
diff changeset
   382
\end{axis}
Chengsong
parents: 538
diff changeset
   383
\end{tikzpicture} 
Chengsong
parents: 538
diff changeset
   384
\caption{Matching $(a^*)^*b$ against $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$}\label{NaiveMatcher}
Chengsong
parents: 538
diff changeset
   385
\end{figure}
564
Chengsong
parents: 543
diff changeset
   386
\end{center} 
539
Chengsong
parents: 538
diff changeset
   387
\noindent
564
Chengsong
parents: 543
diff changeset
   388
If we implement the above algorithm naively, however,
Chengsong
parents: 543
diff changeset
   389
the algorithm can be excruciatingly slow, as shown in 
Chengsong
parents: 543
diff changeset
   390
\ref{NaiveMatcher}.
Chengsong
parents: 543
diff changeset
   391
Note that both axes are in logarithmic scale.
Chengsong
parents: 543
diff changeset
   392
Around two dozens characters
Chengsong
parents: 543
diff changeset
   393
would already explode the matcher on regular expression 
Chengsong
parents: 543
diff changeset
   394
$(a^*)^*b$.
Chengsong
parents: 543
diff changeset
   395
For this, we need to introduce certain 
539
Chengsong
parents: 538
diff changeset
   396
rewrite rules for the intermediate results,
Chengsong
parents: 538
diff changeset
   397
such as $r + r \rightarrow r$,
Chengsong
parents: 538
diff changeset
   398
and make sure those rules do not change the 
Chengsong
parents: 538
diff changeset
   399
language of the regular expression.
564
Chengsong
parents: 543
diff changeset
   400
One simpled-minded simplification function
Chengsong
parents: 543
diff changeset
   401
that achieves these requirements is given below:
Chengsong
parents: 543
diff changeset
   402
\begin{center}
Chengsong
parents: 543
diff changeset
   403
	\begin{tabular}{lcl}
Chengsong
parents: 543
diff changeset
   404
		$\simp \; r_1 \cdot r_2 $ & $ \dn$ & 
Chengsong
parents: 543
diff changeset
   405
		$(\simp \; r_1,  \simp \; r_2) \; \textit{match}$\\
Chengsong
parents: 543
diff changeset
   406
					  & & $\quad \case \; (\ZERO, \_) \Rightarrow \ZERO$\\
Chengsong
parents: 543
diff changeset
   407
					  & & $\quad \case \; (\_, \ZERO) \Rightarrow \ZERO$\\
Chengsong
parents: 543
diff changeset
   408
					  & & $\quad \case \; (\ONE, r_2') \Rightarrow r_2'$\\
Chengsong
parents: 543
diff changeset
   409
					  & & $\quad \case \; (r_1', \ONE) \Rightarrow r_1'$\\
Chengsong
parents: 543
diff changeset
   410
					  & & $\quad \case \; (r_1', r_2') \Rightarrow r_1'\cdot r_2'$\\
Chengsong
parents: 543
diff changeset
   411
		$\simp \; r_1 + r_2$ & $\dn$ & $(\simp \; r_1, \simp \; r_2) \textit{match}$\\
Chengsong
parents: 543
diff changeset
   412
				     & & $\quad \; \case \; (\ZERO, r_2') \Rightarrow r_2'$\\
Chengsong
parents: 543
diff changeset
   413
				     & & $\quad \; \case \; (r_1', \ZERO) \Rightarrow r_1'$\\
Chengsong
parents: 543
diff changeset
   414
				     & & $\quad \; \case \; (r_1', r_2') \Rightarrow r_1' + r_2'$\\
Chengsong
parents: 543
diff changeset
   415
		$\simp \; r$ & $\dn$ & $r$
Chengsong
parents: 543
diff changeset
   416
				   
Chengsong
parents: 543
diff changeset
   417
	\end{tabular}
Chengsong
parents: 543
diff changeset
   418
\end{center}
Chengsong
parents: 543
diff changeset
   419
If we repeatedly apply this simplification  
Chengsong
parents: 543
diff changeset
   420
function during the matching algorithm, 
Chengsong
parents: 543
diff changeset
   421
we have a matcher with simplification:
Chengsong
parents: 543
diff changeset
   422
\begin{center}
Chengsong
parents: 543
diff changeset
   423
	\begin{tabular}{lcl}
Chengsong
parents: 543
diff changeset
   424
		$\derssimp \; [] \; r$ & $\dn$ & $r$\\
Chengsong
parents: 543
diff changeset
   425
		$\derssimp \; c :: cs \; r$ & $\dn$ & $\derssimp \; cs \; (\simp \; (r \backslash c))$\\
Chengsong
parents: 543
diff changeset
   426
		$\textit{matcher}_{simp}\; s \; r $ & $\dn$ & $\nullable \; (\derssimp \; s\;r)$
Chengsong
parents: 543
diff changeset
   427
	\end{tabular}
Chengsong
parents: 543
diff changeset
   428
\end{center}
Chengsong
parents: 543
diff changeset
   429
\begin{figure}
539
Chengsong
parents: 538
diff changeset
   430
\begin{tikzpicture}
Chengsong
parents: 538
diff changeset
   431
\begin{axis}[
Chengsong
parents: 538
diff changeset
   432
    xlabel={$n$},
Chengsong
parents: 538
diff changeset
   433
    ylabel={time in secs},
Chengsong
parents: 538
diff changeset
   434
    ymode = log,
Chengsong
parents: 538
diff changeset
   435
    xmode = log,
564
Chengsong
parents: 543
diff changeset
   436
    grid = both,
539
Chengsong
parents: 538
diff changeset
   437
    legend entries={Matcher With Simp},  
Chengsong
parents: 538
diff changeset
   438
    legend pos=north west,
Chengsong
parents: 538
diff changeset
   439
    legend cell align=left]
Chengsong
parents: 538
diff changeset
   440
\addplot[red,mark=*, mark options={fill=white}] table {BetterMatcher.data};
Chengsong
parents: 538
diff changeset
   441
\end{axis}
564
Chengsong
parents: 543
diff changeset
   442
\end{tikzpicture} 
Chengsong
parents: 543
diff changeset
   443
\caption{$(a^*)^*b$ 
Chengsong
parents: 543
diff changeset
   444
against 
Chengsong
parents: 543
diff changeset
   445
$\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$ Using $\textit{matcher}_{simp}$}\label{BetterMatcher}
Chengsong
parents: 543
diff changeset
   446
\end{figure}
Chengsong
parents: 543
diff changeset
   447
\noindent
Chengsong
parents: 543
diff changeset
   448
The running time of $\textit{ders}\_\textit{simp}$
Chengsong
parents: 543
diff changeset
   449
on the same example of \ref{NaiveMatcher}
Chengsong
parents: 543
diff changeset
   450
is now very tame in terms of the length of inputs,
Chengsong
parents: 543
diff changeset
   451
as shown in \ref{BetterMatcher}.
539
Chengsong
parents: 538
diff changeset
   452
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   453
Building derivatives and then testing the existence
539
Chengsong
parents: 538
diff changeset
   454
of empty string in the resulting regular expression's language,
564
Chengsong
parents: 543
diff changeset
   455
adding simplifications when necessary.
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   456
So far, so good. But what if we want to 
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   457
do lexing instead of just getting a YES/NO answer?
564
Chengsong
parents: 543
diff changeset
   458
Sulzmanna and Lu \cite{Sulzmann2014} first came up with a nice and 
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   459
elegant (arguably as beautiful as the definition of the original derivative) solution for this.
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   460
539
Chengsong
parents: 538
diff changeset
   461
\section{Values and the Lexing Algorithm by Sulzmann and Lu}
564
Chengsong
parents: 543
diff changeset
   462
In this section, we present a two-phase regular expression lexing 
Chengsong
parents: 543
diff changeset
   463
algorithm.
Chengsong
parents: 543
diff changeset
   464
The first phase takes successive derivatives with 
Chengsong
parents: 543
diff changeset
   465
respect to the input string,
Chengsong
parents: 543
diff changeset
   466
and the second phase does the reverse, \emph{injecting} back
Chengsong
parents: 543
diff changeset
   467
characters, in the meantime constructing a lexing result.
Chengsong
parents: 543
diff changeset
   468
We will introduce the injection phase in detail slightly
Chengsong
parents: 543
diff changeset
   469
later, but as a preliminary we have to first define 
Chengsong
parents: 543
diff changeset
   470
the datatype for lexing results, 
Chengsong
parents: 543
diff changeset
   471
called \emph{value} or
Chengsong
parents: 543
diff changeset
   472
sometimes also \emph{lexical value}.  Values and regular
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   473
expressions correspond to each other as illustrated in the following
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   474
table:
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   475
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   476
\begin{center}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   477
	\begin{tabular}{c@{\hspace{20mm}}c}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   478
		\begin{tabular}{@{}rrl@{}}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   479
			\multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   480
			$r$ & $::=$  & $\ZERO$\\
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   481
			& $\mid$ & $\ONE$   \\
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   482
			& $\mid$ & $c$          \\
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   483
			& $\mid$ & $r_1 \cdot r_2$\\
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   484
			& $\mid$ & $r_1 + r_2$   \\
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   485
			\\
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   486
			& $\mid$ & $r^*$         \\
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   487
		\end{tabular}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   488
		&
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   489
		\begin{tabular}{@{\hspace{0mm}}rrl@{}}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   490
			\multicolumn{3}{@{}l}{\textbf{Values}}\medskip\\
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   491
			$v$ & $::=$  & \\
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   492
			&        & $\Empty$   \\
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   493
			& $\mid$ & $\Char(c)$          \\
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   494
			& $\mid$ & $\Seq\,v_1\, v_2$\\
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   495
			& $\mid$ & $\Left(v)$   \\
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   496
			& $\mid$ & $\Right(v)$  \\
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   497
			& $\mid$ & $\Stars\,[v_1,\ldots\,v_n]$ \\
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   498
		\end{tabular}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   499
	\end{tabular}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   500
\end{center}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   501
\noindent
564
Chengsong
parents: 543
diff changeset
   502
A value has an underlying string, which 
Chengsong
parents: 543
diff changeset
   503
can be calculated by the ``flatten" function $|\_|$:
Chengsong
parents: 543
diff changeset
   504
\begin{center}
Chengsong
parents: 543
diff changeset
   505
	\begin{tabular}{lcl}
Chengsong
parents: 543
diff changeset
   506
		$|\Empty|$ & $\dn$ &  $[]$\\
Chengsong
parents: 543
diff changeset
   507
		$|\Char \; c|$ & $ \dn$ & $ [c]$\\
Chengsong
parents: 543
diff changeset
   508
		$|\Seq(v_1, v_2)|$ & $ \dn$ & $ v_1| @ |v_2|$\\
Chengsong
parents: 543
diff changeset
   509
		$|\Left(v)|$ & $ \dn$ & $ |v|$\\
Chengsong
parents: 543
diff changeset
   510
		$|\Right(v)|$ & $ \dn$ & $ |v|$\\
Chengsong
parents: 543
diff changeset
   511
		$|\Stars([])|$ & $\dn$ & $[]$\\
Chengsong
parents: 543
diff changeset
   512
		$|\Stars(v::vs)|$ &  $\dn$ & $ |v| @ |\Stars(vs)|$
Chengsong
parents: 543
diff changeset
   513
	\end{tabular}
Chengsong
parents: 543
diff changeset
   514
\end{center}
Chengsong
parents: 543
diff changeset
   515
Sulzmann and Lu used a binary predicate, written $\vdash v:r $,
Chengsong
parents: 543
diff changeset
   516
to indicate that a value $v$ could be generated from a lexing algorithm
Chengsong
parents: 543
diff changeset
   517
with input $r$. They call it the value inhabitation relation. 
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   518
\begin{mathpar}
564
Chengsong
parents: 543
diff changeset
   519
	\inferrule{\mbox{}}{\vdash \Char(c) : \mathbf{c}} \hspace{2em}
Chengsong
parents: 543
diff changeset
   520
Chengsong
parents: 543
diff changeset
   521
	\inferrule{\mbox{}}{\vdash \Empty :  \ONE} \hspace{2em}
Chengsong
parents: 543
diff changeset
   522
Chengsong
parents: 543
diff changeset
   523
\inferrule{\vdash v_1 : r_1 \;\; \vdash v_2 : r_2 }{\vdash \Seq(v_1, v_2) : (r_1 \cdot r_2)}
Chengsong
parents: 543
diff changeset
   524
Chengsong
parents: 543
diff changeset
   525
\inferrule{\vdash v_1 : r_1}{\vdash \Left(v_1):r_1+r_2}
Chengsong
parents: 543
diff changeset
   526
Chengsong
parents: 543
diff changeset
   527
\inferrule{\vdash v_2 : r_2}{\vdash \Right(v_2):r_1 + r_2}
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   528
564
Chengsong
parents: 543
diff changeset
   529
\inferrule{\forall v \in vs. \vdash v:r \land  |v| \neq []}{\vdash \Stars(vs):r^*}
Chengsong
parents: 543
diff changeset
   530
\end{mathpar}
Chengsong
parents: 543
diff changeset
   531
\noindent
Chengsong
parents: 543
diff changeset
   532
The condition $|v| \neq []$ in the premise of star's rule
Chengsong
parents: 543
diff changeset
   533
is to make sure that for a given pair of regular 
Chengsong
parents: 543
diff changeset
   534
expression $r$ and string $s$, the number of values 
Chengsong
parents: 543
diff changeset
   535
satisfying $|v| = s$ and $\vdash v:r$ is finite.
Chengsong
parents: 543
diff changeset
   536
Given the same string and regular expression, there can be
Chengsong
parents: 543
diff changeset
   537
multiple values for it. For example, both
Chengsong
parents: 543
diff changeset
   538
$\vdash \Seq(\Left \; ab)(\Right \; c):(ab+a)(bc+c)$ and
Chengsong
parents: 543
diff changeset
   539
$\vdash \Seq(\Right\; a)(\Left \; bc ):(ab+a)(bc+c)$ hold
Chengsong
parents: 543
diff changeset
   540
and the values both flatten to $abc$.
Chengsong
parents: 543
diff changeset
   541
Lexers therefore have to disambiguate and choose only
Chengsong
parents: 543
diff changeset
   542
one of the values to output. $\POSIX$ is one of the
Chengsong
parents: 543
diff changeset
   543
disambiguation strategies that is widely adopted.
Chengsong
parents: 543
diff changeset
   544
Chengsong
parents: 543
diff changeset
   545
Ausaf and Urban\parencite{AusafDyckhoffUrban2016} 
Chengsong
parents: 543
diff changeset
   546
formalised the property 
Chengsong
parents: 543
diff changeset
   547
as a ternary relation.
Chengsong
parents: 543
diff changeset
   548
The $\POSIX$ value $v$ for a regular expression
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   549
$r$ and string $s$, denoted as $(s, r) \rightarrow v$, can be specified 
564
Chengsong
parents: 543
diff changeset
   550
in the following set of rules\footnote{The names of the rules are used
Chengsong
parents: 543
diff changeset
   551
as they were originally given in \cite{AusafDyckhoffUrban2016}}:
Chengsong
parents: 543
diff changeset
   552
\noindent
Chengsong
parents: 543
diff changeset
   553
\begin{figure}
Chengsong
parents: 543
diff changeset
   554
\begin{mathpar}
Chengsong
parents: 543
diff changeset
   555
	\inferrule[P1]{\mbox{}}{([], \ONE) \rightarrow \Empty}
Chengsong
parents: 543
diff changeset
   556
		
Chengsong
parents: 543
diff changeset
   557
	\inferrule[PC]{\mbox{}}{([c], c) \rightarrow \Char \; c}
Chengsong
parents: 543
diff changeset
   558
Chengsong
parents: 543
diff changeset
   559
	\inferrule[P+L]{(s,r_1)\rightarrow v_1}{(s, r_1+r_2)\rightarrow \Left \; v_1}
Chengsong
parents: 543
diff changeset
   560
Chengsong
parents: 543
diff changeset
   561
	\inferrule[P+R]{(s,r_2)\rightarrow v_2\\ s \notin L \; r_1}{(s, r_1+r_2)\rightarrow \Right \; v_2}
Chengsong
parents: 543
diff changeset
   562
Chengsong
parents: 543
diff changeset
   563
	\inferrule[PS]{(s_1, v_1) \rightarrow r_1 \\ (s_2, v_2)\rightarrow r_2\\
Chengsong
parents: 543
diff changeset
   564
		\nexists s_3 \; s_4. s_3 \neq [] \land s_3 @ s_4 = s_2 \land 
Chengsong
parents: 543
diff changeset
   565
		s_1@ s_3 \in L \; r_1 \land s_4 \in L \; r_2}{(s_1 @ s_2, r_1\cdot r_2) \rightarrow
Chengsong
parents: 543
diff changeset
   566
	\Seq \; v_1 \; v_2}
Chengsong
parents: 543
diff changeset
   567
Chengsong
parents: 543
diff changeset
   568
	\inferrule[P{[]}]{\mbox{}}{([], r^*) \rightarrow \Stars([])}
Chengsong
parents: 543
diff changeset
   569
Chengsong
parents: 543
diff changeset
   570
	\inferrule[P*]{(s_1, v) \rightarrow v \\ (s_2, r^*) \rightarrow \Stars \; vs \\
Chengsong
parents: 543
diff changeset
   571
		|v| \neq []\\ \nexists s_3 \; s_4. s_3 \neq [] \land s_3@s_4 = s_2 \land
Chengsong
parents: 543
diff changeset
   572
		s_1@s_3 \in L \; r \land s_4 \in L \; r^*}{(s_1@s_2, r^*)\rightarrow \Stars \;
Chengsong
parents: 543
diff changeset
   573
	(v::vs)}
Chengsong
parents: 543
diff changeset
   574
\end{mathpar}
Chengsong
parents: 543
diff changeset
   575
\caption{POSIX Lexing Rules}
Chengsong
parents: 543
diff changeset
   576
\end{figure}
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   577
\noindent
564
Chengsong
parents: 543
diff changeset
   578
The above $\POSIX$ rules follows the intuition described below: 
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   579
\begin{itemize}
564
Chengsong
parents: 543
diff changeset
   580
	\item (Left Priority)\\
Chengsong
parents: 543
diff changeset
   581
		Match the leftmost regular expression when multiple options of matching
Chengsong
parents: 543
diff changeset
   582
		are available.
Chengsong
parents: 543
diff changeset
   583
	\item (Maximum munch)\\
Chengsong
parents: 543
diff changeset
   584
		Always match a subpart as much as possible before proceeding
Chengsong
parents: 543
diff changeset
   585
		to the next token.
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   586
\end{itemize}
564
Chengsong
parents: 543
diff changeset
   587
\noindent
Chengsong
parents: 543
diff changeset
   588
These disambiguation strategies can be 
Chengsong
parents: 543
diff changeset
   589
quite practical.
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   590
For instance, when lexing a code snippet 
564
Chengsong
parents: 543
diff changeset
   591
\[ 
Chengsong
parents: 543
diff changeset
   592
	\textit{iffoo} = 3
Chengsong
parents: 543
diff changeset
   593
\]
Chengsong
parents: 543
diff changeset
   594
using the regular expression (with 
Chengsong
parents: 543
diff changeset
   595
keyword and identifier having their 
Chengsong
parents: 543
diff changeset
   596
usualy definitions on any formal
Chengsong
parents: 543
diff changeset
   597
language textbook, for instance
Chengsong
parents: 543
diff changeset
   598
keyword is a nonempty string starting with letters 
Chengsong
parents: 543
diff changeset
   599
followed by alphanumeric characters or underscores):
Chengsong
parents: 543
diff changeset
   600
\[
Chengsong
parents: 543
diff changeset
   601
	\textit{keyword} + \textit{identifier},
Chengsong
parents: 543
diff changeset
   602
\]
Chengsong
parents: 543
diff changeset
   603
we want $\textit{iffoo}$ to be recognized
Chengsong
parents: 543
diff changeset
   604
as an identifier rather than a keyword (if)
Chengsong
parents: 543
diff changeset
   605
followed by
Chengsong
parents: 543
diff changeset
   606
an identifier (foo).
Chengsong
parents: 543
diff changeset
   607
POSIX lexing achieves this.
Chengsong
parents: 543
diff changeset
   608
Chengsong
parents: 543
diff changeset
   609
We know that a $\POSIX$ value is also a normal underlying
Chengsong
parents: 543
diff changeset
   610
value:
541
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
   611
\begin{lemma}
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
   612
$(r, s) \rightarrow v \implies \vdash v: r$
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
   613
\end{lemma}
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
   614
\noindent
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   615
The good property about a $\POSIX$ value is that 
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   616
given the same regular expression $r$ and string $s$,
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   617
one can always uniquely determine the $\POSIX$ value for it:
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   618
\begin{lemma}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   619
$\textit{if} \,(s, r) \rightarrow v_1 \land (s, r) \rightarrow v_2\quad  \textit{then} \; v_1 = v_2$
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   620
\end{lemma}
539
Chengsong
parents: 538
diff changeset
   621
\begin{proof}
564
Chengsong
parents: 543
diff changeset
   622
By induction on $s$, $r$ and $v_1$. The inductive cases
Chengsong
parents: 543
diff changeset
   623
are all the POSIX rules. 
Chengsong
parents: 543
diff changeset
   624
Each case is proven by a combination of
Chengsong
parents: 543
diff changeset
   625
the induction rules for $\POSIX$ 
Chengsong
parents: 543
diff changeset
   626
values and the inductive hypothesis.
Chengsong
parents: 543
diff changeset
   627
Probably the most cumbersome cases are 
Chengsong
parents: 543
diff changeset
   628
the sequence and star with non-empty iterations.
539
Chengsong
parents: 538
diff changeset
   629
Chengsong
parents: 538
diff changeset
   630
We give the reasoning about the sequence case as follows:
Chengsong
parents: 538
diff changeset
   631
When we have $(s_1, r_1) \rightarrow v_1$ and $(s_2, r_2) \rightarrow v_2$, 
Chengsong
parents: 538
diff changeset
   632
we know that there could not be a longer string $r_1'$ such that $(s_1', r_1) \rightarrow v_1'$
Chengsong
parents: 538
diff changeset
   633
and $(s_2', r_2) \rightarrow v2'$ and $s_1' @s_2' = s$ all hold.
Chengsong
parents: 538
diff changeset
   634
For possible values of $s_1'$ and $s_2'$ where $s_1'$ is shorter, they cannot
Chengsong
parents: 538
diff changeset
   635
possibly form a $\POSIX$ for $s$.
Chengsong
parents: 538
diff changeset
   636
If we have some other values $v_1'$ and $v_2'$ such that 
Chengsong
parents: 538
diff changeset
   637
$(s_1, r_1) \rightarrow v_1'$ and $(s_2, r_2) \rightarrow v_2'$,
Chengsong
parents: 538
diff changeset
   638
Then by induction hypothesis $v_1' = v_1$ and $v_2'= v_2$, 
Chengsong
parents: 538
diff changeset
   639
which means this "different" $\POSIX$ value $\Seq(v_1', v_2')$
Chengsong
parents: 538
diff changeset
   640
is the same as $\Seq(v_1, v_2)$. 
Chengsong
parents: 538
diff changeset
   641
\end{proof}
564
Chengsong
parents: 543
diff changeset
   642
Now we know what a $\POSIX$ value is; the problem is how do we achieve 
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   643
such a value in a lexing algorithm, using derivatives?
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   644
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   645
\subsection{Sulzmann and Lu's Injection-based Lexing Algorithm}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   646
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   647
The contribution of Sulzmann and Lu is an extension of Brzozowski's
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   648
algorithm by a second phase (the first phase being building successive
564
Chengsong
parents: 543
diff changeset
   649
derivatives---see \ref{graph:successive_ders}). This second phase generates a POSIX value if the regular expression matches the string.
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   650
Two functions are involved: $\inj$ and $\mkeps$.
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   651
The function $\mkeps$ constructs a value from the last
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   652
one of all the successive derivatives:
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   653
\begin{ceqn}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   654
\begin{equation}\label{graph:mkeps}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   655
\begin{tikzcd}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   656
r_0 \arrow[r, "\backslash c_0"]  & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed] & r_n \arrow[d, "mkeps" description] \\
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   657
	        & 	              & 	            & v_n       
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   658
\end{tikzcd}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   659
\end{equation}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   660
\end{ceqn}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   661
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   662
It tells us how can an empty string be matched by a 
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   663
regular expression, in a $\POSIX$ way:
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   664
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   665
	\begin{center}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   666
		\begin{tabular}{lcl}
564
Chengsong
parents: 543
diff changeset
   667
			$\mkeps \; \ONE$ 		& $\dn$ & $\Empty$ \\
Chengsong
parents: 543
diff changeset
   668
			$\mkeps \; r_{1}+r_{2}$	& $\dn$ 
Chengsong
parents: 543
diff changeset
   669
						& \textit{if} ($\nullable \; r_{1})$\\ 
Chengsong
parents: 543
diff changeset
   670
			& & \textit{then} $\Left (\mkeps \; r_{1})$\\ 
Chengsong
parents: 543
diff changeset
   671
			& & \textit{else} $\Right(\mkeps \; r_{2})$\\
Chengsong
parents: 543
diff changeset
   672
			$\mkeps \; r_1\cdot r_2$ 	& $\dn$ & $\Seq\;(\mkeps\;r_1)\;(\mkeps \; r_2)$\\
Chengsong
parents: 543
diff changeset
   673
			$\mkeps \; r^* $	        & $\dn$ & $\Stars\;[]$
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   674
		\end{tabular}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   675
	\end{center}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   676
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   677
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   678
\noindent 
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   679
We favour the left to match an empty string if there is a choice.
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   680
When there is a star for us to match the empty string,
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   681
we give the $\Stars$ constructor an empty list, meaning
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   682
no iterations are taken.
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   683
The result of a call to $\mkeps$ on a $\nullable$ $r$ would
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   684
be a $\POSIX$ value corresponding to $r$:
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   685
\begin{lemma}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   686
$\nullable(r) \implies (r, []) \rightarrow (\mkeps\; v)$
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   687
\end{lemma}\label{mePosix}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   688
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   689
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   690
After the $\mkeps$-call, we inject back the characters one by one in order to build
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   691
the lexical value $v_i$ for how the regex $r_i$ matches the string $s_i$
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   692
($s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$.
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   693
After injecting back $n$ characters, we get the lexical value for how $r_0$
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   694
matches $s$. 
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   695
To do this, Sulzmann and Lu defined a function that reverses
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   696
the ``chopping off'' of characters during the derivative phase. The
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   697
corresponding function is called \emph{injection}, written
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   698
$\textit{inj}$; it takes three arguments: the first one is a regular
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   699
expression ${r_{i-1}}$, before the character is chopped off, the second
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   700
is a character ${c_{i-1}}$, the character we want to inject and the
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   701
third argument is the value ${v_i}$, into which one wants to inject the
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   702
character (it corresponds to the regular expression after the character
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   703
has been chopped off). The result of this function is a new value. 
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   704
\begin{ceqn}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   705
\begin{equation}\label{graph:inj}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   706
\begin{tikzcd}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   707
r_1 \arrow[r, dashed] \arrow[d]& r_i \arrow[r, "\backslash c_i"]  \arrow[d]  & r_{i+1}  \arrow[r, dashed] \arrow[d]        & r_n \arrow[d, "mkeps" description] \\
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   708
v_1           \arrow[u]                 & v_i  \arrow[l, dashed]                              & v_{i+1} \arrow[l,"inj_{r_i} c_i"]                 & v_n \arrow[l, dashed]         
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   709
\end{tikzcd}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   710
\end{equation}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   711
\end{ceqn}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   712
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   713
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   714
\noindent
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   715
The
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   716
definition of $\textit{inj}$ is as follows: 
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   717
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   718
\begin{center}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   719
\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   720
  $\textit{inj}\,(c)\,c\,Empty$            & $\dn$ & $Char\,c$\\
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   721
  $\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   722
  $\textit{inj}\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\textit{inj}\,r_2\,c\,v)$\\
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   723
  $\textit{inj}\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$  & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   724
  $\textit{inj}\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$  & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   725
  $\textit{inj}\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$  & $Seq(\textit{mkeps}(r_1),\textit{inj}\,r_2\,c\,v)$\\
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   726
  $\textit{inj}\,(r^*)\,c\,Seq(v,Stars\,vs)$         & $\dn$  & $Stars((\textit{inj}\,r\,c\,v)\,::\,vs)$\\
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   727
\end{tabular}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   728
\end{center}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   729
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   730
\noindent 
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   731
This definition is by recursion on the ``shape'' of regular
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   732
expressions and values. 
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   733
The clauses do one thing--identifying the ``hole'' on a
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   734
value to inject the character back into.
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   735
For instance, in the last clause for injecting back to a value
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   736
that would turn into a new star value that corresponds to a star,
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   737
we know it must be a sequence value. And we know that the first 
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   738
value of that sequence corresponds to the child regex of the star
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   739
with the first character being chopped off--an iteration of the star
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   740
that had just been unfolded. This value is followed by the already
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   741
matched star iterations we collected before. So we inject the character 
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   742
back to the first value and form a new value with this latest iteration
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   743
being added to the previous list of iterations, all under the $\Stars$
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   744
top level.
539
Chengsong
parents: 538
diff changeset
   745
The POSIX value is maintained throughout the process:
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   746
\begin{lemma}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   747
$(r \backslash c, s) \rightarrow v \implies (r, c :: s) \rightarrow (\inj r \; c\; v)$
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   748
\end{lemma}\label{injPosix}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   749
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   750
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   751
Putting all the functions $\inj$, $\mkeps$, $\backslash$ together,
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   752
and taking into consideration the possibility of a non-match,
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   753
we have a lexer with the following recursive definition:
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   754
\begin{center}
539
Chengsong
parents: 538
diff changeset
   755
\begin{tabular}{lcl}
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   756
$\lexer \; r \; [] $ & $=$ & $\textit{if} (\nullable \; r)\; \textit{then}\;  \Some(\mkeps \; r) \; \textit{else} \; \None$\\
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   757
$\lexer \; r \;c::s$ & $=$ & $\textit{case}\; (\lexer (r\backslash c) s) \textit{of} $\\
539
Chengsong
parents: 538
diff changeset
   758
& & $\quad \None \implies \None$\\
Chengsong
parents: 538
diff changeset
   759
& & $\quad \mid \Some(v) \implies \Some(\inj \; r\; c\; v)$
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   760
\end{tabular}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   761
\end{center}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   762
 \noindent
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   763
 The central property of the $\lexer$ is that it gives the correct result by
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   764
 $\POSIX$ standards:
541
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
   765
 \begin{theorem}
564
Chengsong
parents: 543
diff changeset
   766
	 The $\lexer$ based on derivatives and injections is both sound and complete:
Chengsong
parents: 543
diff changeset
   767
 \begin{tabular}{lcl}
Chengsong
parents: 543
diff changeset
   768
	 $\lexer \; r \; s = \Some(v)$ & $ \Longleftrightarrow$ & $ (r, \; s) \rightarrow v$\\
Chengsong
parents: 543
diff changeset
   769
	 $\lexer \;r \; s = \None $ & $\Longleftrightarrow$ & $ \neg(\exists v. (r, s) \rightarrow v)$
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   770
 \end{tabular}
541
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
   771
 \end{theorem}
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   772
 
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   773
 
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   774
 \begin{proof}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   775
 By induction on $s$. $r$ is allowed to be an arbitrary regular expression.
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   776
 The $[]$ case is proven by  lemma \ref{mePosix}, and the inductive case
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   777
 by lemma \ref{injPosix}.
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   778
 \end{proof}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   779
539
Chengsong
parents: 538
diff changeset
   780
541
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
   781
We now give a pictorial view of the algorithm (
539
Chengsong
parents: 538
diff changeset
   782
For convenience, we employ the following notations: the regular
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   783
expression we start with is $r_0$, and the given string $s$ is composed
539
Chengsong
parents: 538
diff changeset
   784
of characters $c_0 c_1 \ldots c_{n-1}$. The
Chengsong
parents: 538
diff changeset
   785
values built incrementally by \emph{injecting} back the characters into the
Chengsong
parents: 538
diff changeset
   786
earlier values are $v_n, \ldots, v_0$. Corresponding values and characters
Chengsong
parents: 538
diff changeset
   787
are always in the same subscript, i.e. $\vdash v_i : r_i$):
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   788
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   789
\begin{ceqn}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   790
\begin{equation}\label{graph:2}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   791
\begin{tikzcd}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   792
r_0 \arrow[r, "\backslash c_0"]  \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   793
v_0           & v_1 \arrow[l,"inj_{r_0} c_0"]                & v_2 \arrow[l, "inj_{r_1} c_1"]              & v_n \arrow[l, dashed]         
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   794
\end{tikzcd}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   795
\end{equation}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   796
\end{ceqn}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   797
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   798
\noindent
539
Chengsong
parents: 538
diff changeset
   799
As we did earlier in this chapter on the matcher, one can 
Chengsong
parents: 538
diff changeset
   800
introduce simplification on the regex.
Chengsong
parents: 538
diff changeset
   801
However, now we need to do a backward phase and make sure
Chengsong
parents: 538
diff changeset
   802
the values align with the regular expressions.
Chengsong
parents: 538
diff changeset
   803
Therefore one has to
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   804
be careful not to break the correctness, as the injection 
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   805
function heavily relies on the structure of the regexes and values
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   806
being correct and matching each other.
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   807
It can be achieved by recording some extra rectification functions
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   808
during the derivatives step, and applying these rectifications in 
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   809
each run during the injection phase.
539
Chengsong
parents: 538
diff changeset
   810
Chengsong
parents: 538
diff changeset
   811
\ChristianComment{Do I introduce the lexer with rectification here?}
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   812
And we can prove that the POSIX value of how
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   813
regular expressions match strings will not be affected---although it is much harder
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   814
to establish. 
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   815
Some initial results in this regard have been
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   816
obtained in \cite{AusafDyckhoffUrban2016}. 
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   817
539
Chengsong
parents: 538
diff changeset
   818
However, even with these simplification rules, we could still end up in
Chengsong
parents: 538
diff changeset
   819
trouble, when we encounter cases that require more involved and aggressive
Chengsong
parents: 538
diff changeset
   820
simplifications.
Chengsong
parents: 538
diff changeset
   821
\section{A Case Requring More Aggressive Simplification}
Chengsong
parents: 538
diff changeset
   822
For example, when starting with the regular
Chengsong
parents: 538
diff changeset
   823
expression $(a^* \cdot a^*)^*$ and building a few successive derivatives (around 10)
Chengsong
parents: 538
diff changeset
   824
w.r.t.~the character $a$, one obtains a derivative regular expression
Chengsong
parents: 538
diff changeset
   825
with more than 9000 nodes (when viewed as a tree)
Chengsong
parents: 538
diff changeset
   826
even with simplification.
Chengsong
parents: 538
diff changeset
   827
\begin{figure}
Chengsong
parents: 538
diff changeset
   828
\begin{tikzpicture}
Chengsong
parents: 538
diff changeset
   829
\begin{axis}[
Chengsong
parents: 538
diff changeset
   830
    xlabel={$n$},
Chengsong
parents: 538
diff changeset
   831
    ylabel={size},
Chengsong
parents: 538
diff changeset
   832
    legend entries={Naive Matcher},  
Chengsong
parents: 538
diff changeset
   833
    legend pos=north west,
Chengsong
parents: 538
diff changeset
   834
    legend cell align=left]
Chengsong
parents: 538
diff changeset
   835
\addplot[red,mark=*, mark options={fill=white}] table {BetterWaterloo.data};
Chengsong
parents: 538
diff changeset
   836
\end{axis}
Chengsong
parents: 538
diff changeset
   837
\end{tikzpicture} 
Chengsong
parents: 538
diff changeset
   838
\caption{Size of $(a^*\cdot a^*)^*$ against $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$}
Chengsong
parents: 538
diff changeset
   839
\end{figure}\label{fig:BetterWaterloo}
Chengsong
parents: 538
diff changeset
   840
   
Chengsong
parents: 538
diff changeset
   841
That is because our lexing algorithm currently keeps a lot of 
541
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
   842
"useless" values that will not be used. 
539
Chengsong
parents: 538
diff changeset
   843
These different ways of matching will grow exponentially with the string length.
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   844
539
Chengsong
parents: 538
diff changeset
   845
For $r= (a^*\cdot a^*)^*$ and  
Chengsong
parents: 538
diff changeset
   846
$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$,
Chengsong
parents: 538
diff changeset
   847
if we do not allow any empty iterations in its lexical values,
Chengsong
parents: 538
diff changeset
   848
there will be $n - 1$ "splitting points" on $s$ we can independently choose to 
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   849
split or not so that each sub-string
539
Chengsong
parents: 538
diff changeset
   850
segmented by those chosen splitting points will form different iterations.
541
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
   851
For example when $n=4$, we give out a few of the many possibilities of splitting:
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   852
\begin{center}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   853
\begin{tabular}{lcr}
541
5bf9f94c02e1 some comments implemented
Chengsong
parents: 539
diff changeset
   854
$aaaa $ & $\rightarrow$ & $\Stars\, [v_{iteration \,aaaa}]$ (1 iteration)\\
539
Chengsong
parents: 538
diff changeset
   855
$a \mid aaa $ & $\rightarrow$ & $\Stars\, [v_{iteration \,a},\,  v_{iteration \,aaa}]$ (two iterations)\\
Chengsong
parents: 538
diff changeset
   856
$aa \mid aa $ & $\rightarrow$ & $\Stars\, [v_{iteration \, aa},\,  v_{iteration \, aa}]$ (two iterations)\\
Chengsong
parents: 538
diff changeset
   857
$a \mid aa\mid a $ & $\rightarrow$ & $\Stars\, [v_{iteration \, a},\,  v_{iteration \, aa}, \, v_{iteration \, a}]$ (three iterations)\\
Chengsong
parents: 538
diff changeset
   858
$a \mid a \mid a\mid a $ & $\rightarrow$ & $\Stars\, [v_{iteration \, a},\,  v_{iteration \, a} \,v_{iteration \, a}, \, v_{iteration \, a}]$ (four iterations)\\
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   859
 & $\textit{etc}.$ &
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   860
 \end{tabular}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   861
\end{center}
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   862
\noindent
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   863
And for each iteration, there are still multiple ways to split
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   864
between the two $a^*$s.
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   865
It is not surprising there are exponentially many lexical values
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   866
that are distinct for the regex and string pair $r= (a^*\cdot a^*)^*$  and 
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   867
$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$.
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   868
A lexer to keep all the possible values will naturally 
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   869
have an exponential runtime on ambiguous regular expressions.
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   870
With just $\inj$ and $\mkeps$, the lexing algorithm will keep track of all different values
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   871
of a match. This means Sulzmann and Lu's injection-based algorithm 
539
Chengsong
parents: 538
diff changeset
   872
 exponential by nature.
Chengsong
parents: 538
diff changeset
   873
Somehow one has to make sure which
564
Chengsong
parents: 543
diff changeset
   874
 lexical values are $\POSIX$ and must be kept in a lexing algorithm.
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   875
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   876
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   877
 For example, the above $r= (a^*\cdot a^*)^*$  and 
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   878
$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$ example has the POSIX value
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   879
$ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$.
539
Chengsong
parents: 538
diff changeset
   880
We want to keep this value only, and remove all the regular expression subparts
Chengsong
parents: 538
diff changeset
   881
not corresponding to this value during lexing.
Chengsong
parents: 538
diff changeset
   882
To do this, a two-phase algorithm with rectification is a bit too fragile.
Chengsong
parents: 538
diff changeset
   883
Can we not create those intermediate values $v_1,\ldots v_n$,
Chengsong
parents: 538
diff changeset
   884
and get the lexing information that should be already there while
Chengsong
parents: 538
diff changeset
   885
doing derivatives in one pass, without a second injection phase?
564
Chengsong
parents: 543
diff changeset
   886
In the meantime, can we ensure that simplifications
539
Chengsong
parents: 538
diff changeset
   887
are easily handled without breaking the correctness of the algorithm?
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   888
539
Chengsong
parents: 538
diff changeset
   889
Sulzmann and Lu solved this problem by
Chengsong
parents: 538
diff changeset
   890
introducing additional information to the 
Chengsong
parents: 538
diff changeset
   891
regular expressions called \emph{bitcodes}.
538
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   892
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   893
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   894
8016a2480704 intro and chap2
Chengsong
parents: 536
diff changeset
   895