8 %and then give the algorithm and its variant, and discuss |
8 %and then give the algorithm and its variant, and discuss |
9 %why more aggressive simplifications are needed. |
9 %why more aggressive simplifications are needed. |
10 |
10 |
11 |
11 |
12 \section{Basic Concepts and Notations for Strings, Languages, and Regular Expressions} |
12 \section{Basic Concepts and Notations for Strings, Languages, and Regular Expressions} |
13 We have a built-in datatype char, made up of characters, which we do not define |
13 We have a primitive datatype char, denoting characters. |
14 on top of anything else. |
14 \[ char ::= a |
|
15 \mid b |
|
16 \mid c |
|
17 \mid \ldots |
|
18 \mid z |
|
19 \] |
|
20 (which one is better?) |
15 \begin{center} |
21 \begin{center} |
16 \begin{tabular}{lcl} |
22 \begin{tabular}{lcl} |
17 $\textit{char}$ & $\dn$ & $a | b | c | \ldots$\\ |
23 $\textit{char}$ & $\dn$ & $a | b | c | \ldots$\\ |
18 \end{tabular} |
24 \end{tabular} |
19 \end{center} |
25 \end{center} |
44 \begin{tabular}{lcl} |
51 \begin{tabular}{lcl} |
45 $A^0 $ & $\dn$ & $\{ [] \}$\\ |
52 $A^0 $ & $\dn$ & $\{ [] \}$\\ |
46 $A^{n+1}$ & $\dn$ & $A^n @ A$ |
53 $A^{n+1}$ & $\dn$ & $A^n @ A$ |
47 \end{tabular} |
54 \end{tabular} |
48 \end{center} |
55 \end{center} |
49 The infinite set of all the power of a language unioned together |
56 The union of all the natural number powers of a language |
50 is defined using the power operator, also in recursive function: |
57 is denoted by the Kleene star operator: |
51 \begin{center} |
58 \begin{center} |
52 \begin{tabular}{lcl} |
59 \begin{tabular}{lcl} |
53 $A^*$ & $\dn$ & $\bigcup_{i \geq 0} A^i$\\ |
60 $\bigcup_{i \geq 0} A^i$ & $\denote$ & $A^*$\\ |
54 \end{tabular} |
61 \end{tabular} |
55 \end{center} |
62 \end{center} |
56 We also define an operation of chopping off a character from all the strings |
63 |
57 in a set: |
64 In Isabelle of course we cannot easily get a counterpart of |
|
65 the $\bigcup_{i \geq 0}$ operator, so we instead define the Kleene star |
|
66 as an inductive set: |
|
67 \begin{center} |
|
68 \begin{tabular}{lcl} |
|
69 $[] \in A^*$ & &\\ |
|
70 $s_1 \in A \land \; s_2 \in A^* $ & $\implies$ & $s_1 @ s_2 \in A^*$\\ |
|
71 \end{tabular} |
|
72 \end{center} |
|
73 |
|
74 We also define an operation of chopping off a character from |
|
75 a language: |
58 \begin{center} |
76 \begin{center} |
59 \begin{tabular}{lcl} |
77 \begin{tabular}{lcl} |
60 $\textit{Der} \;c \;A$ & $\dn$ & $\{ s \mid c :: s \in A \}$\\ |
78 $\textit{Der} \;c \;A$ & $\dn$ & $\{ s \mid c :: s \in A \}$\\ |
61 \end{tabular} |
79 \end{tabular} |
62 \end{center} |
80 \end{center} |
63 With the above definitions, it becomes natural to define regular expressions |
81 |
64 as a concise way for expressing the languages. |
82 This can be generalised to chopping off a string from all strings within set $A$: |
|
83 \begin{center} |
|
84 \begin{tabular}{lcl} |
|
85 $\textit{Ders} \;w \;A$ & $\dn$ & $\{ s \mid w@s \in A \}$\\ |
|
86 \end{tabular} |
|
87 \end{center} |
|
88 |
|
89 which is essentially the left quotient $A \backslash L'$ of $A$ against |
|
90 the singleton language $L' = \{w\}$ |
|
91 in formal language theory. |
|
92 For this dissertation the $\textit{Ders}$ notation would suffice, there is |
|
93 no need for a more general derivative definition. |
|
94 |
|
95 With the sequencing, Kleene star, and $\textit{Der}$ operator on languages, |
|
96 we have a few properties of how the language derivative can be defined using |
|
97 sub-languages. |
|
98 \begin{lemma} |
|
99 $\Der \; c \; (A @ B) = \textit{if} \; [] \in A \; \textit{then} ((\Der \; c \; A) @ B ) \cup \Der \; c\; B \quad \textit{else}\; (\Der \; c \; A) @ B$ |
|
100 \end{lemma} |
|
101 \noindent |
|
102 This lemma states that if $A$ contains the empty string, $\Der$ can "pierce" through it |
|
103 and get to $B$. |
|
104 |
|
105 The language $A^*$'s derivative can be described using the language derivative |
|
106 of $A$: |
|
107 \begin{lemma} |
|
108 $\textit{Der} \;c \;A^* = (\textit{Der}\; c A) @ (A^*)$\\ |
|
109 \end{lemma} |
|
110 \begin{proof} |
|
111 \begin{itemize} |
|
112 \item{$\subseteq$} |
|
113 The set |
|
114 \[ \{s \mid c :: s \in A^*\} \] |
|
115 is enclosed in the set |
|
116 \[ \{s_1 @ s_2 \mid s_1 \, s_2. s_1 \in \{s \mid c :: s \in A\} \land s_2 \in A^* \} \] |
|
117 because whenever you have a string starting with a character |
|
118 in the language of a Kleene star $A^*$, then that character together with some sub-string |
|
119 immediately after it will form the first iteration, and the rest of the string will |
|
120 be still in $A^*$. |
|
121 \item{$\supseteq$} |
|
122 Note that |
|
123 \[ \Der \; c \; A^* = \Der \; c \; (\{ [] \} \cup (A @ A^*) ) \] |
|
124 and |
|
125 \[ \Der \; c \; (\{ [] \} \cup (A @ A^*) ) = \Der\; c \; (A @ A^*) \] |
|
126 where the $\textit{RHS}$ of the above equatioin can be rewritten |
|
127 as \[ (\Der \; c\; A) @ A^* \cup A' \], $A'$ being a possibly empty set. |
|
128 \end{itemize} |
|
129 \end{proof} |
|
130 Before we define the $\textit{Der}$ and $\textit{Ders}$ counterpart |
|
131 for regular languages, we need to first give definitions for regular expressions. |
|
132 |
65 \section{Regular Expressions and Their Language Interpretation} |
133 \section{Regular Expressions and Their Language Interpretation} |
66 Suppose we have an alphabet $\Sigma$, the strings whose characters |
134 Suppose we have an alphabet $\Sigma$, the strings whose characters |
67 are from $\Sigma$ |
135 are from $\Sigma$ |
68 can be expressed as $\Sigma^*$. |
136 can be expressed as $\Sigma^*$. |
69 |
137 |
88 \end{center} |
156 \end{center} |
89 Which are also called the "language interpretation". |
157 Which are also called the "language interpretation". |
90 |
158 |
91 |
159 |
92 |
160 |
93 The Brzozowski derivative w.r.t character $c$ is an operation on the regex, |
|
94 where the operation transforms the regex to a new one containing |
|
95 strings without the head character $c$. |
|
96 |
|
97 Formally, we define first such a transformation on any string set, which |
|
98 we call semantic derivative: |
|
99 \begin{center} |
|
100 $\Der \; c\; \textit{A} = \{s \mid c :: s \in A\}$ |
|
101 \end{center} |
|
102 Mathematically, it can be expressed as the |
|
103 |
|
104 If the $\textit{A}$ happen to have some structure, for example, |
|
105 if it is regular, then we have that it |
|
106 |
161 |
107 % Derivatives of a |
162 % Derivatives of a |
108 %regular expression, written $r \backslash c$, give a simple solution |
163 %regular expression, written $r \backslash c$, give a simple solution |
109 %to the problem of matching a string $s$ with a regular |
164 %to the problem of matching a string $s$ with a regular |
110 %expression $r$: if the derivative of $r$ w.r.t.\ (in |
165 %expression $r$: if the derivative of $r$ w.r.t.\ (in |
111 %succession) all the characters of the string matches the empty string, |
166 %succession) all the characters of the string matches the empty string, |
112 %then $r$ matches $s$ (and {\em vice versa}). |
167 %then $r$ matches $s$ (and {\em vice versa}). |
113 |
168 |
114 |
169 |
115 \section{Brzozowski Derivatives of Regular Expressions} |
170 \section{Brzozowski Derivatives of Regular Expressions} |
116 The the derivative of regular expression, denoted as |
171 Now with semantic derivatives of a language and regular expressions and |
|
172 their language interpretations, we are ready to define derivatives on regexes. |
|
173 |
|
174 The Brzozowski derivative w.r.t character $c$ is an operation on the regex, |
|
175 where the operation transforms the regex to a new one containing |
|
176 strings without the head character $c$. |
|
177 |
|
178 The derivative of regular expression, denoted as |
117 $r \backslash c$, is a function that takes parameters |
179 $r \backslash c$, is a function that takes parameters |
118 $r$ and $c$, and returns another regular expression $r'$, |
180 $r$ and $c$, and returns another regular expression $r'$, |
119 which is computed by the following recursive function: |
181 which is computed by the following recursive function: |
120 |
182 |
121 \begin{center} |
183 \begin{center} |
399 between the two $a^*$s. |
461 between the two $a^*$s. |
400 It is not surprising there are exponentially many lexical values |
462 It is not surprising there are exponentially many lexical values |
401 that are distinct for the regex and string pair $r= (a^*\cdot a^*)^*$ and |
463 that are distinct for the regex and string pair $r= (a^*\cdot a^*)^*$ and |
402 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$. |
464 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$. |
403 |
465 |
404 A lexer aimed at getting all the possible values has an exponential |
466 A lexer aimed at keeping all the possible values will naturally |
405 worst case runtime. Therefore it is impractical to try to generate |
467 have an exponential runtime on ambiguous regular expressions. |
406 all possible matches in a run. In practice, we are usually |
468 Somehow one has to decide which lexical value to keep and |
|
469 output in a lexing algorithm. |
|
470 In practice, we are usually |
407 interested about POSIX values, which by intuition always |
471 interested about POSIX values, which by intuition always |
408 \begin{itemize} |
472 \begin{itemize} |
409 \item |
473 \item |
410 match the leftmost regular expression when multiple options of matching |
474 match the leftmost regular expression when multiple options of matching |
411 are available |
475 are available |
412 \item |
476 \item |
413 always match a subpart as much as possible before proceeding |
477 always match a subpart as much as possible before proceeding |
414 to the next token. |
478 to the next token. |
415 \end{itemize} |
479 \end{itemize} |
|
480 The formal definition of a $\POSIX$ value can be described |
|
481 in the following set of rules: |
|
482 \ |
416 |
483 |
417 |
484 |
418 For example, the above example has the POSIX value |
485 For example, the above example has the POSIX value |
419 $ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$. |
486 $ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$. |
420 The output of an algorithm we want would be a POSIX matching |
487 The output of an algorithm we want would be a POSIX matching |