8 %and then give the algorithm and its variant, and discuss |
8 %and then give the algorithm and its variant, and discuss |
9 %why more aggressive simplifications are needed. |
9 %why more aggressive simplifications are needed. |
10 |
10 |
11 In this chapter, we define the basic notions |
11 In this chapter, we define the basic notions |
12 for regular languages and regular expressions. |
12 for regular languages and regular expressions. |
|
13 This is essentially a description in "English" |
|
14 of your formalisation in Isabelle/HOL. |
13 We also give the definition of what $\POSIX$ lexing means. |
15 We also give the definition of what $\POSIX$ lexing means. |
14 |
16 |
15 \section{Basic Concepts} |
17 \section{Basic Concepts} |
16 Usually in formal language theory there is an alphabet |
18 Usually formal language theory starts with an alphabet |
17 denoting a set of characters. |
19 denoting a set of characters. |
18 Here we only use the datatype of characters from Isabelle, |
20 Here we just use the datatype of characters from Isabelle, |
19 which roughly corresponds to the ASCII character. |
21 which roughly corresponds to the ASCII characters. |
20 Then using the usual $[]$ notation for lists, |
22 In what follows we shall leave the information about the alphabet |
21 we can define strings using chars: |
23 implicit. |
22 \begin{center} |
24 Then using the usual bracket notation for lists, |
23 \begin{tabular}{lcl} |
25 we can define strings made up of characters: |
24 $\textit{string}$ & $\dn$ & $[] | c :: cs$\\ |
26 \begin{center} |
25 & & $(c\; \text{has char type})$ |
27 \begin{tabular}{lcl} |
26 \end{tabular} |
28 $\textit{s}$ & $\dn$ & $[] \; |\; c :: s$ |
27 \end{center} |
29 \end{tabular} |
28 And strings can be concatenated to form longer strings, |
30 \end{center} |
29 in the same way as we concatenate two lists, |
31 Where $c$ is a variable ranging over characters. |
30 which we denote as $@$. We omit the precise |
32 Strings can be concatenated to form longer strings in the same |
|
33 way as we concatenate two lists, which we write as @. |
|
34 We omit the precise |
31 recursive definition here. |
35 recursive definition here. |
32 We overload this concatenation operator for two sets of strings: |
36 We overload this concatenation operator for two sets of strings: |
33 \begin{center} |
37 \begin{center} |
34 \begin{tabular}{lcl} |
38 \begin{tabular}{lcl} |
35 $A @ B $ & $\dn$ & $\{s_A @ s_B \mid s_A \in A; s_B \in B \}$\\ |
39 $A @ B $ & $\dn$ & $\{s_A @ s_B \mid s_A \in A \land s_B \in B \}$\\ |
36 \end{tabular} |
40 \end{tabular} |
37 \end{center} |
41 \end{center} |
38 We also call the above \emph{language concatenation}. |
42 We also call the above \emph{language concatenation}. |
39 The power of a language is defined recursively, using the |
43 The power of a language is defined recursively, using the |
40 concatenation operator $@$: |
44 concatenation operator $@$: |
41 \begin{center} |
45 \begin{center} |
42 \begin{tabular}{lcl} |
46 \begin{tabular}{lcl} |
43 $A^0 $ & $\dn$ & $\{ [] \}$\\ |
47 $A^0 $ & $\dn$ & $\{ [] \}$\\ |
44 $A^{n+1}$ & $\dn$ & $A^n @ A$ |
48 $A^{n+1}$ & $\dn$ & $A @ A^n$ |
45 \end{tabular} |
49 \end{tabular} |
46 \end{center} |
50 \end{center} |
47 The union of all the natural number powers of a language |
51 The union of all the natural number powers of a language |
48 is defined as the Kleene star operator: |
52 is usually defined as the Kleene star operator: |
49 \begin{center} |
53 \begin{center} |
50 \begin{tabular}{lcl} |
54 \begin{tabular}{lcl} |
51 $A*$ & $\dn$ & $\bigcup_{i \geq 0} A^i$ \\ |
55 $A*$ & $\dn$ & $\bigcup_{i \geq 0} A^i$ \\ |
52 \end{tabular} |
56 \end{tabular} |
53 \end{center} |
57 \end{center} |
63 \inferrule{}{[] \in A*\\} |
67 \inferrule{}{[] \in A*\\} |
64 |
68 |
65 \inferrule{\\s_1 \in A \land \; s_2 \in A*}{s_1 @ s_2 \in A*} |
69 \inferrule{\\s_1 \in A \land \; s_2 \in A*}{s_1 @ s_2 \in A*} |
66 \end{mathpar} |
70 \end{mathpar} |
67 \end{center} |
71 \end{center} |
68 |
72 \ChristianComment{Yes, used the inferrule command in mathpar} |
69 We also define an operation of "chopping of" a character from |
73 We also define an operation of "chopping off" a character from |
70 a language, which we call $\Der$, meaning "Derivative for a language": |
74 a language, which we call $\Der$, meaning \emph{Derivative} (for a language): |
71 \begin{center} |
75 \begin{center} |
72 \begin{tabular}{lcl} |
76 \begin{tabular}{lcl} |
73 $\textit{Der} \;c \;A$ & $\dn$ & $\{ s \mid c :: s \in A \}$\\ |
77 $\textit{Der} \;c \;A$ & $\dn$ & $\{ s \mid c :: s \in A \}$\\ |
74 \end{tabular} |
78 \end{tabular} |
75 \end{center} |
79 \end{center} |
76 \noindent |
80 \noindent |
77 This can be generalised to "chopping off" a string from all strings within set $A$, |
81 This can be generalised to "chopping off" a string from all strings within set $A$, |
78 with the help of the concatenation operator: |
82 namely: |
79 \begin{center} |
83 \begin{center} |
80 \begin{tabular}{lcl} |
84 \begin{tabular}{lcl} |
81 $\textit{Ders} \;w \;A$ & $\dn$ & $\{ s \mid w@s \in A \}$\\ |
85 $\textit{Ders} \;s \;A$ & $\dn$ & $\{ s' \mid s@s' \in A \}$\\ |
82 \end{tabular} |
86 \end{tabular} |
83 \end{center} |
87 \end{center} |
84 \noindent |
88 \noindent |
85 which is essentially the left quotient $A \backslash L'$ of $A$ against |
89 which is essentially the left quotient $A \backslash L$ of $A$ against |
86 the singleton language $L' = \{w\}$ |
90 the singleton language with $L = \{w\}$ |
87 in formal language theory. |
91 in formal language theory. |
88 For this dissertation the $\textit{Ders}$ definition with |
92 However for the purposes here, the $\textit{Ders}$ definition with |
89 a single string suffices. |
93 a single string is sufficient. |
90 |
94 |
91 With the sequencing, Kleene star, and $\textit{Der}$ operator on languages, |
95 With the sequencing, Kleene star, and $\textit{Der}$ operator on languages, |
92 we have a few properties of how the language derivative can be defined using |
96 we have a few properties of how the language derivative can be defined using |
93 sub-languages. |
97 sub-languages. |
94 \begin{lemma} |
98 \begin{lemma} |
508 The reason why we are interested in $\POSIX$ values is that they can |
512 The reason why we are interested in $\POSIX$ values is that they can |
509 be practically used in the lexing phase of a compiler front end. |
513 be practically used in the lexing phase of a compiler front end. |
510 For instance, when lexing a code snippet |
514 For instance, when lexing a code snippet |
511 $\textit{iffoo} = 3$ with the regular expression $\textit{keyword} + \textit{identifier}$, we want $\textit{iffoo}$ to be recognized |
515 $\textit{iffoo} = 3$ with the regular expression $\textit{keyword} + \textit{identifier}$, we want $\textit{iffoo}$ to be recognized |
512 as an identifier rather than a keyword. |
516 as an identifier rather than a keyword. |
513 |
517 \ChristianComment{Do I also introduce lexical values $LV$ here?} |
|
518 We know that $\POSIX$ values are also part of the normal values: |
|
519 \begin{lemma} |
|
520 $(r, s) \rightarrow v \implies \vdash v: r$ |
|
521 \end{lemma} |
|
522 \noindent |
514 The good property about a $\POSIX$ value is that |
523 The good property about a $\POSIX$ value is that |
515 given the same regular expression $r$ and string $s$, |
524 given the same regular expression $r$ and string $s$, |
516 one can always uniquely determine the $\POSIX$ value for it: |
525 one can always uniquely determine the $\POSIX$ value for it: |
517 \begin{lemma} |
526 \begin{lemma} |
518 $\textit{if} \,(s, r) \rightarrow v_1 \land (s, r) \rightarrow v_2\quad \textit{then} \; v_1 = v_2$ |
527 $\textit{if} \,(s, r) \rightarrow v_1 \land (s, r) \rightarrow v_2\quad \textit{then} \; v_1 = v_2$ |
657 \end{tabular} |
666 \end{tabular} |
658 \end{center} |
667 \end{center} |
659 \noindent |
668 \noindent |
660 The central property of the $\lexer$ is that it gives the correct result by |
669 The central property of the $\lexer$ is that it gives the correct result by |
661 $\POSIX$ standards: |
670 $\POSIX$ standards: |
662 \begin{lemma} |
671 \begin{theorem} |
663 \begin{tabular}{l} |
672 \begin{tabular}{l} |
664 $s \in L(r) \Longleftrightarrow (\exists v. \; r \; s = \Some(v) \land (r, \; s) \rightarrow v)$\\ |
673 $\lexer \; r \; s = \Some(v) \Longleftrightarrow (r, \; s) \rightarrow v$\\ |
665 $s \notin L(r) \Longleftrightarrow (\lexer \; r\; s = \None)$ |
674 $\lexer \;r \; s = \None \Longleftrightarrow \neg(\exists v. (r, s) \rightarrow v)$ |
666 \end{tabular} |
675 \end{tabular} |
667 \end{lemma} |
676 \end{theorem} |
668 |
677 |
669 |
678 |
670 \begin{proof} |
679 \begin{proof} |
671 By induction on $s$. $r$ is allowed to be an arbitrary regular expression. |
680 By induction on $s$. $r$ is allowed to be an arbitrary regular expression. |
672 The $[]$ case is proven by lemma \ref{mePosix}, and the inductive case |
681 The $[]$ case is proven by lemma \ref{mePosix}, and the inductive case |
673 by lemma \ref{injPosix}. |
682 by lemma \ref{injPosix}. |
674 \end{proof} |
683 \end{proof} |
675 |
684 |
676 |
685 |
677 Pictorially, the algorithm is as follows ( |
686 We now give a pictorial view of the algorithm ( |
678 For convenience, we employ the following notations: the regular |
687 For convenience, we employ the following notations: the regular |
679 expression we start with is $r_0$, and the given string $s$ is composed |
688 expression we start with is $r_0$, and the given string $s$ is composed |
680 of characters $c_0 c_1 \ldots c_{n-1}$. The |
689 of characters $c_0 c_1 \ldots c_{n-1}$. The |
681 values built incrementally by \emph{injecting} back the characters into the |
690 values built incrementally by \emph{injecting} back the characters into the |
682 earlier values are $v_n, \ldots, v_0$. Corresponding values and characters |
691 earlier values are $v_n, \ldots, v_0$. Corresponding values and characters |
733 \end{tikzpicture} |
742 \end{tikzpicture} |
734 \caption{Size of $(a^*\cdot a^*)^*$ against $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$} |
743 \caption{Size of $(a^*\cdot a^*)^*$ against $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$} |
735 \end{figure}\label{fig:BetterWaterloo} |
744 \end{figure}\label{fig:BetterWaterloo} |
736 |
745 |
737 That is because our lexing algorithm currently keeps a lot of |
746 That is because our lexing algorithm currently keeps a lot of |
738 "useless values that will never not be used. |
747 "useless" values that will not be used. |
739 These different ways of matching will grow exponentially with the string length. |
748 These different ways of matching will grow exponentially with the string length. |
740 |
749 |
741 For $r= (a^*\cdot a^*)^*$ and |
750 For $r= (a^*\cdot a^*)^*$ and |
742 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$, |
751 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$, |
743 if we do not allow any empty iterations in its lexical values, |
752 if we do not allow any empty iterations in its lexical values, |
744 there will be $n - 1$ "splitting points" on $s$ we can independently choose to |
753 there will be $n - 1$ "splitting points" on $s$ we can independently choose to |
745 split or not so that each sub-string |
754 split or not so that each sub-string |
746 segmented by those chosen splitting points will form different iterations. |
755 segmented by those chosen splitting points will form different iterations. |
747 For example when $n=4$, |
756 For example when $n=4$, we give out a few of the many possibilities of splitting: |
748 \begin{center} |
757 \begin{center} |
749 \begin{tabular}{lcr} |
758 \begin{tabular}{lcr} |
750 $aaaa $ & $\rightarrow$ & $\Stars\, [v_{iteration \,aaaa}]$ (1 iteration, this iteration will be divided between the inner sequence $a^*\cdot a^*$)\\ |
759 $aaaa $ & $\rightarrow$ & $\Stars\, [v_{iteration \,aaaa}]$ (1 iteration)\\ |
751 $a \mid aaa $ & $\rightarrow$ & $\Stars\, [v_{iteration \,a},\, v_{iteration \,aaa}]$ (two iterations)\\ |
760 $a \mid aaa $ & $\rightarrow$ & $\Stars\, [v_{iteration \,a},\, v_{iteration \,aaa}]$ (two iterations)\\ |
752 $aa \mid aa $ & $\rightarrow$ & $\Stars\, [v_{iteration \, aa},\, v_{iteration \, aa}]$ (two iterations)\\ |
761 $aa \mid aa $ & $\rightarrow$ & $\Stars\, [v_{iteration \, aa},\, v_{iteration \, aa}]$ (two iterations)\\ |
753 $a \mid aa\mid a $ & $\rightarrow$ & $\Stars\, [v_{iteration \, a},\, v_{iteration \, aa}, \, v_{iteration \, a}]$ (three iterations)\\ |
762 $a \mid aa\mid a $ & $\rightarrow$ & $\Stars\, [v_{iteration \, a},\, v_{iteration \, aa}, \, v_{iteration \, a}]$ (three iterations)\\ |
754 $a \mid a \mid a\mid a $ & $\rightarrow$ & $\Stars\, [v_{iteration \, a},\, v_{iteration \, a} \,v_{iteration \, a}, \, v_{iteration \, a}]$ (four iterations)\\ |
763 $a \mid a \mid a\mid a $ & $\rightarrow$ & $\Stars\, [v_{iteration \, a},\, v_{iteration \, a} \,v_{iteration \, a}, \, v_{iteration \, a}]$ (four iterations)\\ |
755 & $\textit{etc}.$ & |
764 & $\textit{etc}.$ & |