ChengsongTanPhdThesis/Chapters/Chapter2.tex
changeset 529 96e93df60954
parent 528 28751de4b4ba
child 530 823d9b19d21c
equal deleted inserted replaced
528:28751de4b4ba 529:96e93df60954
     8 %and then give the algorithm and its variant, and discuss
     8 %and then give the algorithm and its variant, and discuss
     9 %why more aggressive simplifications are needed. 
     9 %why more aggressive simplifications are needed. 
    10 
    10 
    11 
    11 
    12 \section{Basic Concepts and Notations for Strings, Languages, and Regular Expressions}
    12 \section{Basic Concepts and Notations for Strings, Languages, and Regular Expressions}
    13 We have a built-in datatype char, made up of characters, which we do not define
    13 We have a primitive datatype char, denoting characters.
    14 on top of anything else.
    14 \[			char ::=  a
       
    15 			 \mid b
       
    16 			 \mid c
       
    17 			 \mid  \ldots
       
    18 			 \mid z       
       
    19 \]
       
    20 (which one is better?)
    15 \begin{center}
    21 \begin{center}
    16 \begin{tabular}{lcl}
    22 \begin{tabular}{lcl}
    17 $\textit{char}$ & $\dn$ & $a | b | c | \ldots$\\
    23 $\textit{char}$ & $\dn$ & $a | b | c | \ldots$\\
    18 \end{tabular}
    24 \end{tabular}
    19 \end{center}
    25 \end{center}
    25 \end{tabular}
    31 \end{tabular}
    26 \end{center}
    32 \end{center}
    27 And strings can be concatenated to form longer strings:
    33 And strings can be concatenated to form longer strings:
    28 \begin{center}
    34 \begin{center}
    29 \begin{tabular}{lcl}
    35 \begin{tabular}{lcl}
    30 $s_1 @ s_2$ & $\rightarrow$ & $s'$\\
    36 $[] @ s_2$ & $\dn$ & $s_2$\\
       
    37 $(c :: s_1) @ s_2$ & $\dn$ & $c :: (s_1 @ s_2)$
    31 \end{tabular}
    38 \end{tabular}
    32 \end{center}
    39 \end{center}
    33 
    40 
    34 A set of strings can operate with another set of strings:
    41 A set of strings can operate with another set of strings:
    35 \begin{center}
    42 \begin{center}
    44 \begin{tabular}{lcl}
    51 \begin{tabular}{lcl}
    45 $A^0 $ & $\dn$ & $\{ [] \}$\\
    52 $A^0 $ & $\dn$ & $\{ [] \}$\\
    46 $A^{n+1}$ & $\dn$ & $A^n @ A$
    53 $A^{n+1}$ & $\dn$ & $A^n @ A$
    47 \end{tabular}
    54 \end{tabular}
    48 \end{center}
    55 \end{center}
    49 The infinite set of all the power of a language unioned together 
    56 The union of all the natural number powers of a language   
    50 is defined using the power operator, also in recursive function:
    57 is denoted by the Kleene star operator:
    51 \begin{center}
    58 \begin{center}
    52 \begin{tabular}{lcl}
    59 \begin{tabular}{lcl}
    53 $A^*$ & $\dn$ & $\bigcup_{i \geq 0} A^i$\\
    60 $\bigcup_{i \geq 0} A^i$ & $\denote$ & $A^*$\\
    54 \end{tabular}
    61 \end{tabular}
    55 \end{center}
    62 \end{center}
    56 We also define an operation of chopping off a character from all the strings
    63 
    57 in a set:
    64 In Isabelle of course we cannot easily get a counterpart of
       
    65 the $\bigcup_{i \geq 0}$ operator, so we instead define the Kleene star
       
    66 as an inductive set: 
       
    67 \begin{center}
       
    68 \begin{tabular}{lcl}
       
    69 $[] \in A^*$  & &\\
       
    70 $s_1 \in A \land \; s_2 \in A^* $ & $\implies$ & $s_1 @ s_2 \in A^*$\\
       
    71 \end{tabular}
       
    72 \end{center}
       
    73 
       
    74 We also define an operation of chopping off a character from
       
    75 a language:
    58 \begin{center}
    76 \begin{center}
    59 \begin{tabular}{lcl}
    77 \begin{tabular}{lcl}
    60 $\textit{Der} \;c \;A$ & $\dn$ & $\{ s \mid c :: s \in A \}$\\
    78 $\textit{Der} \;c \;A$ & $\dn$ & $\{ s \mid c :: s \in A \}$\\
    61 \end{tabular}
    79 \end{tabular}
    62 \end{center}
    80 \end{center}
    63 With the above definitions, it becomes natural to define regular expressions
    81 
    64 as a concise way for expressing the languages.
    82 This can be generalised to chopping off a string from all strings within set $A$:
       
    83 \begin{center}
       
    84 \begin{tabular}{lcl}
       
    85 $\textit{Ders} \;w \;A$ & $\dn$ & $\{ s \mid w@s \in A \}$\\
       
    86 \end{tabular}
       
    87 \end{center}
       
    88 
       
    89 which is essentially the left quotient $A \backslash L'$ of $A$ against 
       
    90 the singleton language $L' = \{w\}$
       
    91 in formal language theory.
       
    92 For this dissertation the $\textit{Ders}$ notation would suffice, there is
       
    93 no need for a more general derivative definition.
       
    94 
       
    95 With the  sequencing, Kleene star, and $\textit{Der}$ operator on languages,
       
    96 we have a  few properties of how the language derivative can be defined using 
       
    97 sub-languages.
       
    98 \begin{lemma}
       
    99 $\Der \; c \; (A @ B) = \textit{if} \;  [] \in A \; \textit{then} ((\Der \; c \; A) @ B ) \cup \Der \; c\; B \quad \textit{else}\; (\Der \; c \; A) @ B$
       
   100 \end{lemma}
       
   101 \noindent
       
   102 This lemma states that if $A$ contains the empty string, $\Der$ can "pierce" through it
       
   103 and get to $B$.
       
   104 
       
   105 The language $A^*$'s derivative can be described using the language derivative
       
   106 of $A$:
       
   107 \begin{lemma}
       
   108 $\textit{Der} \;c \;A^* = (\textit{Der}\; c A) @ (A^*)$\\
       
   109 \end{lemma}
       
   110 \begin{proof}
       
   111 \begin{itemize}
       
   112 \item{$\subseteq$}
       
   113 The set 
       
   114 \[ \{s \mid c :: s \in A^*\} \]
       
   115 is enclosed in the set
       
   116 \[ \{s_1 @ s_2 \mid s_1 \, s_2. s_1 \in \{s \mid c :: s \in A\} \land s_2 \in A^* \} \]
       
   117 because whenever you have a string starting with a character 
       
   118 in the language of a Kleene star $A^*$, then that character together with some sub-string
       
   119 immediately after it will form the first iteration, and the rest of the string will 
       
   120 be still in $A^*$.
       
   121 \item{$\supseteq$}
       
   122 Note that
       
   123 \[ \Der \; c \; A^* = \Der \; c \;  (\{ [] \} \cup (A @ A^*) ) \]
       
   124 and 
       
   125 \[ \Der \; c \;  (\{ [] \} \cup (A @ A^*) ) = \Der\; c \; (A @ A^*) \]
       
   126 where the $\textit{RHS}$ of the above equatioin can be rewritten
       
   127 as \[ (\Der \; c\; A) @ A^* \cup A' \], $A'$ being a possibly empty set.
       
   128 \end{itemize}
       
   129 \end{proof}
       
   130 Before we define the $\textit{Der}$ and $\textit{Ders}$ counterpart
       
   131 for regular languages, we need to first give definitions for regular expressions.
       
   132 
    65 \section{Regular Expressions and Their Language Interpretation}
   133 \section{Regular Expressions and Their Language Interpretation}
    66 Suppose we have an alphabet $\Sigma$, the strings  whose characters
   134 Suppose we have an alphabet $\Sigma$, the strings  whose characters
    67 are from $\Sigma$
   135 are from $\Sigma$
    68 can be expressed as $\Sigma^*$.
   136 can be expressed as $\Sigma^*$.
    69 
   137 
    88 \end{center}
   156 \end{center}
    89 Which are also called the "language interpretation".
   157 Which are also called the "language interpretation".
    90 
   158 
    91 
   159 
    92 
   160 
    93 The Brzozowski derivative w.r.t character $c$ is an operation on the regex,
       
    94 where the operation transforms the regex to a new one containing
       
    95 strings without the head character $c$.
       
    96 
       
    97 Formally, we define first such a transformation on any string set, which
       
    98 we call semantic derivative:
       
    99 \begin{center}
       
   100 $\Der \; c\; \textit{A} = \{s \mid c :: s \in A\}$
       
   101 \end{center}
       
   102 Mathematically, it can be expressed as the 
       
   103 
       
   104 If the $\textit{A}$ happen to have some structure, for example,
       
   105 if it is regular, then we have that it
       
   106 
   161 
   107 % Derivatives of a
   162 % Derivatives of a
   108 %regular expression, written $r \backslash c$, give a simple solution
   163 %regular expression, written $r \backslash c$, give a simple solution
   109 %to the problem of matching a string $s$ with a regular
   164 %to the problem of matching a string $s$ with a regular
   110 %expression $r$: if the derivative of $r$ w.r.t.\ (in
   165 %expression $r$: if the derivative of $r$ w.r.t.\ (in
   111 %succession) all the characters of the string matches the empty string,
   166 %succession) all the characters of the string matches the empty string,
   112 %then $r$ matches $s$ (and {\em vice versa}).  
   167 %then $r$ matches $s$ (and {\em vice versa}).  
   113 
   168 
   114 
   169 
   115 \section{Brzozowski Derivatives of Regular Expressions}
   170 \section{Brzozowski Derivatives of Regular Expressions}
   116 The the derivative of regular expression, denoted as
   171 Now with semantic derivatives of a language and regular expressions and
       
   172 their language interpretations, we are ready to define derivatives on regexes.
       
   173 
       
   174 The Brzozowski derivative w.r.t character $c$ is an operation on the regex,
       
   175 where the operation transforms the regex to a new one containing
       
   176 strings without the head character $c$.
       
   177 
       
   178 The  derivative of regular expression, denoted as
   117 $r \backslash c$, is a function that takes parameters
   179 $r \backslash c$, is a function that takes parameters
   118 $r$ and $c$, and returns another regular expression $r'$,
   180 $r$ and $c$, and returns another regular expression $r'$,
   119 which is computed by the following recursive function:
   181 which is computed by the following recursive function:
   120 
   182 
   121 \begin{center}
   183 \begin{center}
   399 between the two $a^*$s.
   461 between the two $a^*$s.
   400 It is not surprising there are exponentially many lexical values
   462 It is not surprising there are exponentially many lexical values
   401 that are distinct for the regex and string pair $r= (a^*\cdot a^*)^*$  and 
   463 that are distinct for the regex and string pair $r= (a^*\cdot a^*)^*$  and 
   402 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$.
   464 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$.
   403 
   465 
   404 A lexer aimed at getting all the possible values has an exponential
   466 A lexer aimed at keeping all the possible values will naturally 
   405 worst case runtime. Therefore it is impractical to try to generate
   467 have an exponential runtime on ambiguous regular expressions.
   406 all possible matches in a run. In practice, we are usually 
   468 Somehow one has to decide which lexical value to keep and
       
   469 output in a lexing algorithm.
       
   470 In practice, we are usually 
   407 interested about POSIX values, which by intuition always
   471 interested about POSIX values, which by intuition always
   408 \begin{itemize}
   472 \begin{itemize}
   409 \item
   473 \item
   410 match the leftmost regular expression when multiple options of matching
   474 match the leftmost regular expression when multiple options of matching
   411 are available  
   475 are available  
   412 \item 
   476 \item 
   413 always match a subpart as much as possible before proceeding
   477 always match a subpart as much as possible before proceeding
   414 to the next token.
   478 to the next token.
   415 \end{itemize}
   479 \end{itemize}
       
   480 The formal definition of a $\POSIX$ value can be described 
       
   481 in the following set of rules:
       
   482 \
   416 
   483 
   417 
   484 
   418  For example, the above example has the POSIX value
   485  For example, the above example has the POSIX value
   419 $ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$.
   486 $ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$.
   420 The output of an algorithm we want would be a POSIX matching
   487 The output of an algorithm we want would be a POSIX matching