ChengsongTanPhdThesis/Chapters/Inj.tex
changeset 541 5bf9f94c02e1
parent 539 7cf9f17aa179
child 543 b2bea5968b89
equal deleted inserted replaced
540:3a1fd5ea2484 541:5bf9f94c02e1
     8 %and then give the algorithm and its variant, and discuss
     8 %and then give the algorithm and its variant, and discuss
     9 %why more aggressive simplifications are needed. 
     9 %why more aggressive simplifications are needed. 
    10 
    10 
    11 In this chapter, we define the basic notions 
    11 In this chapter, we define the basic notions 
    12 for regular languages and regular expressions.
    12 for regular languages and regular expressions.
       
    13 This is essentially a description in "English"
       
    14 of your formalisation in Isabelle/HOL.
    13 We also give the definition of what $\POSIX$ lexing means.
    15 We also give the definition of what $\POSIX$ lexing means.
    14 
    16 
    15 \section{Basic Concepts}
    17 \section{Basic Concepts}
    16 Usually in formal language theory there is an alphabet 
    18 Usually formal language theory starts with an alphabet 
    17 denoting a set of characters.
    19 denoting a set of characters.
    18 Here we only use the datatype of characters from Isabelle,
    20 Here we just use the datatype of characters from Isabelle,
    19 which roughly corresponds to the ASCII character.
    21 which roughly corresponds to the ASCII characters.
    20 Then using the usual $[]$ notation for lists,
    22 In what follows we shall leave the information about the alphabet
    21 we can define strings using chars:
    23 implicit.
    22 \begin{center}
    24 Then using the usual bracket notation for lists,
    23 \begin{tabular}{lcl}
    25 we can define strings made up of characters: 
    24 $\textit{string}$ & $\dn$ & $[] | c  :: cs$\\
    26 \begin{center}
    25 & & $(c\; \text{has char type})$
    27 \begin{tabular}{lcl}
    26 \end{tabular}
    28 $\textit{s}$ & $\dn$ & $[] \; |\; c  :: s$
    27 \end{center}
    29 \end{tabular}
    28 And strings can be concatenated to form longer strings,
    30 \end{center}
    29 in the same way as we concatenate two lists,
    31 Where $c$ is a variable ranging over characters.
    30 which we denote as $@$. We omit the precise 
    32 Strings can be concatenated to form longer strings in the same
       
    33 way as we concatenate two lists, which we write as @.
       
    34 We omit the precise 
    31 recursive definition here.
    35 recursive definition here.
    32 We overload this concatenation operator for two sets of strings:
    36 We overload this concatenation operator for two sets of strings:
    33 \begin{center}
    37 \begin{center}
    34 \begin{tabular}{lcl}
    38 \begin{tabular}{lcl}
    35 $A @ B $ & $\dn$ & $\{s_A @ s_B \mid s_A \in A; s_B \in B \}$\\
    39 $A @ B $ & $\dn$ & $\{s_A @ s_B \mid s_A \in A \land s_B \in B \}$\\
    36 \end{tabular}
    40 \end{tabular}
    37 \end{center}
    41 \end{center}
    38 We also call the above \emph{language concatenation}.
    42 We also call the above \emph{language concatenation}.
    39 The power of a language is defined recursively, using the 
    43 The power of a language is defined recursively, using the 
    40 concatenation operator $@$:
    44 concatenation operator $@$:
    41 \begin{center}
    45 \begin{center}
    42 \begin{tabular}{lcl}
    46 \begin{tabular}{lcl}
    43 $A^0 $ & $\dn$ & $\{ [] \}$\\
    47 $A^0 $ & $\dn$ & $\{ [] \}$\\
    44 $A^{n+1}$ & $\dn$ & $A^n @ A$
    48 $A^{n+1}$ & $\dn$ & $A @ A^n$
    45 \end{tabular}
    49 \end{tabular}
    46 \end{center}
    50 \end{center}
    47 The union of all the natural number powers of a language   
    51 The union of all the natural number powers of a language   
    48 is defined as the Kleene star operator:
    52 is usually defined as the Kleene star operator:
    49 \begin{center}
    53 \begin{center}
    50 \begin{tabular}{lcl}
    54 \begin{tabular}{lcl}
    51  $A*$ & $\dn$ & $\bigcup_{i \geq 0} A^i$ \\
    55  $A*$ & $\dn$ & $\bigcup_{i \geq 0} A^i$ \\
    52 \end{tabular}
    56 \end{tabular}
    53 \end{center}
    57 \end{center}
    63 \inferrule{}{[] \in A*\\}
    67 \inferrule{}{[] \in A*\\}
    64 
    68 
    65 \inferrule{\\s_1 \in A \land \; s_2 \in A*}{s_1 @ s_2 \in A*}
    69 \inferrule{\\s_1 \in A \land \; s_2 \in A*}{s_1 @ s_2 \in A*}
    66 \end{mathpar}
    70 \end{mathpar}
    67 \end{center}
    71 \end{center}
    68 
    72 \ChristianComment{Yes, used the inferrule command in mathpar}
    69 We also define an operation of "chopping of" a character from
    73 We also define an operation of "chopping off" a character from
    70 a language, which we call $\Der$, meaning "Derivative for a language":
    74 a language, which we call $\Der$, meaning \emph{Derivative} (for a language):
    71 \begin{center}
    75 \begin{center}
    72 \begin{tabular}{lcl}
    76 \begin{tabular}{lcl}
    73 $\textit{Der} \;c \;A$ & $\dn$ & $\{ s \mid c :: s \in A \}$\\
    77 $\textit{Der} \;c \;A$ & $\dn$ & $\{ s \mid c :: s \in A \}$\\
    74 \end{tabular}
    78 \end{tabular}
    75 \end{center}
    79 \end{center}
    76 \noindent
    80 \noindent
    77 This can be generalised to "chopping off" a string from all strings within set $A$,
    81 This can be generalised to "chopping off" a string from all strings within set $A$, 
    78 with the help of the concatenation operator:
    82 namely:
    79 \begin{center}
    83 \begin{center}
    80 \begin{tabular}{lcl}
    84 \begin{tabular}{lcl}
    81 $\textit{Ders} \;w \;A$ & $\dn$ & $\{ s \mid w@s \in A \}$\\
    85 $\textit{Ders} \;s \;A$ & $\dn$ & $\{ s' \mid s@s' \in A \}$\\
    82 \end{tabular}
    86 \end{tabular}
    83 \end{center}
    87 \end{center}
    84 \noindent
    88 \noindent
    85 which is essentially the left quotient $A \backslash L'$ of $A$ against 
    89 which is essentially the left quotient $A \backslash L$ of $A$ against 
    86 the singleton language $L' = \{w\}$
    90 the singleton language with $L = \{w\}$
    87 in formal language theory.
    91 in formal language theory.
    88 For this dissertation the $\textit{Ders}$ definition with 
    92 However for the purposes here, the $\textit{Ders}$ definition with 
    89 a single string suffices.
    93 a single string is sufficient.
    90 
    94 
    91 With the  sequencing, Kleene star, and $\textit{Der}$ operator on languages,
    95 With the  sequencing, Kleene star, and $\textit{Der}$ operator on languages,
    92 we have a  few properties of how the language derivative can be defined using 
    96 we have a  few properties of how the language derivative can be defined using 
    93 sub-languages.
    97 sub-languages.
    94 \begin{lemma}
    98 \begin{lemma}
   508 The reason why we are interested in $\POSIX$ values is that they can
   512 The reason why we are interested in $\POSIX$ values is that they can
   509 be practically used in the lexing phase of a compiler front end.
   513 be practically used in the lexing phase of a compiler front end.
   510 For instance, when lexing a code snippet 
   514 For instance, when lexing a code snippet 
   511 $\textit{iffoo} = 3$ with the regular expression $\textit{keyword} + \textit{identifier}$, we want $\textit{iffoo}$ to be recognized
   515 $\textit{iffoo} = 3$ with the regular expression $\textit{keyword} + \textit{identifier}$, we want $\textit{iffoo}$ to be recognized
   512 as an identifier rather than a keyword.
   516 as an identifier rather than a keyword.
   513 
   517 \ChristianComment{Do I also introduce lexical values $LV$ here?}
       
   518 We know that $\POSIX$ values are also part of the normal values:
       
   519 \begin{lemma}
       
   520 $(r, s) \rightarrow v \implies \vdash v: r$
       
   521 \end{lemma}
       
   522 \noindent
   514 The good property about a $\POSIX$ value is that 
   523 The good property about a $\POSIX$ value is that 
   515 given the same regular expression $r$ and string $s$,
   524 given the same regular expression $r$ and string $s$,
   516 one can always uniquely determine the $\POSIX$ value for it:
   525 one can always uniquely determine the $\POSIX$ value for it:
   517 \begin{lemma}
   526 \begin{lemma}
   518 $\textit{if} \,(s, r) \rightarrow v_1 \land (s, r) \rightarrow v_2\quad  \textit{then} \; v_1 = v_2$
   527 $\textit{if} \,(s, r) \rightarrow v_1 \land (s, r) \rightarrow v_2\quad  \textit{then} \; v_1 = v_2$
   657 \end{tabular}
   666 \end{tabular}
   658 \end{center}
   667 \end{center}
   659  \noindent
   668  \noindent
   660  The central property of the $\lexer$ is that it gives the correct result by
   669  The central property of the $\lexer$ is that it gives the correct result by
   661  $\POSIX$ standards:
   670  $\POSIX$ standards:
   662  \begin{lemma}
   671  \begin{theorem}
   663  \begin{tabular}{l}
   672  \begin{tabular}{l}
   664  $s \in L(r) \Longleftrightarrow  (\exists v. \; r \; s = \Some(v) \land (r, \; s) \rightarrow v)$\\
   673  $\lexer \; r \; s = \Some(v) \Longleftrightarrow (r, \; s) \rightarrow v$\\
   665  $s \notin L(r) \Longleftrightarrow (\lexer \; r\; s = \None)$
   674  $\lexer \;r \; s = \None \Longleftrightarrow \neg(\exists v. (r, s) \rightarrow v)$
   666  \end{tabular}
   675  \end{tabular}
   667  \end{lemma}
   676  \end{theorem}
   668  
   677  
   669  
   678  
   670  \begin{proof}
   679  \begin{proof}
   671  By induction on $s$. $r$ is allowed to be an arbitrary regular expression.
   680  By induction on $s$. $r$ is allowed to be an arbitrary regular expression.
   672  The $[]$ case is proven by  lemma \ref{mePosix}, and the inductive case
   681  The $[]$ case is proven by  lemma \ref{mePosix}, and the inductive case
   673  by lemma \ref{injPosix}.
   682  by lemma \ref{injPosix}.
   674  \end{proof}
   683  \end{proof}
   675 
   684 
   676 
   685 
   677 Pictorially, the algorithm is as follows (
   686 We now give a pictorial view of the algorithm (
   678 For convenience, we employ the following notations: the regular
   687 For convenience, we employ the following notations: the regular
   679 expression we start with is $r_0$, and the given string $s$ is composed
   688 expression we start with is $r_0$, and the given string $s$ is composed
   680 of characters $c_0 c_1 \ldots c_{n-1}$. The
   689 of characters $c_0 c_1 \ldots c_{n-1}$. The
   681 values built incrementally by \emph{injecting} back the characters into the
   690 values built incrementally by \emph{injecting} back the characters into the
   682 earlier values are $v_n, \ldots, v_0$. Corresponding values and characters
   691 earlier values are $v_n, \ldots, v_0$. Corresponding values and characters
   733 \end{tikzpicture} 
   742 \end{tikzpicture} 
   734 \caption{Size of $(a^*\cdot a^*)^*$ against $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$}
   743 \caption{Size of $(a^*\cdot a^*)^*$ against $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$}
   735 \end{figure}\label{fig:BetterWaterloo}
   744 \end{figure}\label{fig:BetterWaterloo}
   736    
   745    
   737 That is because our lexing algorithm currently keeps a lot of 
   746 That is because our lexing algorithm currently keeps a lot of 
   738 "useless values that will never not be used. 
   747 "useless" values that will not be used. 
   739 These different ways of matching will grow exponentially with the string length.
   748 These different ways of matching will grow exponentially with the string length.
   740 
   749 
   741 For $r= (a^*\cdot a^*)^*$ and  
   750 For $r= (a^*\cdot a^*)^*$ and  
   742 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$,
   751 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$,
   743 if we do not allow any empty iterations in its lexical values,
   752 if we do not allow any empty iterations in its lexical values,
   744 there will be $n - 1$ "splitting points" on $s$ we can independently choose to 
   753 there will be $n - 1$ "splitting points" on $s$ we can independently choose to 
   745 split or not so that each sub-string
   754 split or not so that each sub-string
   746 segmented by those chosen splitting points will form different iterations.
   755 segmented by those chosen splitting points will form different iterations.
   747 For example when $n=4$,
   756 For example when $n=4$, we give out a few of the many possibilities of splitting:
   748 \begin{center}
   757 \begin{center}
   749 \begin{tabular}{lcr}
   758 \begin{tabular}{lcr}
   750 $aaaa $ & $\rightarrow$ & $\Stars\, [v_{iteration \,aaaa}]$ (1 iteration, this iteration will be divided between the inner sequence $a^*\cdot a^*$)\\
   759 $aaaa $ & $\rightarrow$ & $\Stars\, [v_{iteration \,aaaa}]$ (1 iteration)\\
   751 $a \mid aaa $ & $\rightarrow$ & $\Stars\, [v_{iteration \,a},\,  v_{iteration \,aaa}]$ (two iterations)\\
   760 $a \mid aaa $ & $\rightarrow$ & $\Stars\, [v_{iteration \,a},\,  v_{iteration \,aaa}]$ (two iterations)\\
   752 $aa \mid aa $ & $\rightarrow$ & $\Stars\, [v_{iteration \, aa},\,  v_{iteration \, aa}]$ (two iterations)\\
   761 $aa \mid aa $ & $\rightarrow$ & $\Stars\, [v_{iteration \, aa},\,  v_{iteration \, aa}]$ (two iterations)\\
   753 $a \mid aa\mid a $ & $\rightarrow$ & $\Stars\, [v_{iteration \, a},\,  v_{iteration \, aa}, \, v_{iteration \, a}]$ (three iterations)\\
   762 $a \mid aa\mid a $ & $\rightarrow$ & $\Stars\, [v_{iteration \, a},\,  v_{iteration \, aa}, \, v_{iteration \, a}]$ (three iterations)\\
   754 $a \mid a \mid a\mid a $ & $\rightarrow$ & $\Stars\, [v_{iteration \, a},\,  v_{iteration \, a} \,v_{iteration \, a}, \, v_{iteration \, a}]$ (four iterations)\\
   763 $a \mid a \mid a\mid a $ & $\rightarrow$ & $\Stars\, [v_{iteration \, a},\,  v_{iteration \, a} \,v_{iteration \, a}, \, v_{iteration \, a}]$ (four iterations)\\
   755  & $\textit{etc}.$ &
   764  & $\textit{etc}.$ &