handouts/ho04.tex
changeset 283 c14e5ebf0c3b
parent 282 3e3b927a85cf
child 284 0afe43616b6a
equal deleted inserted replaced
282:3e3b927a85cf 283:c14e5ebf0c3b
    14 find out \emph{how} a regular expression matched a string?
    14 find out \emph{how} a regular expression matched a string?
    15 Answering this question will help us with the problem we are
    15 Answering this question will help us with the problem we are
    16 after, namely tokenising an input string, that is splitting it
    16 after, namely tokenising an input string, that is splitting it
    17 up into its ``word'' components. The algorithm we will be
    17 up into its ``word'' components. The algorithm we will be
    18 looking at was designed by Sulzmann \& Lu in a rather recent
    18 looking at was designed by Sulzmann \& Lu in a rather recent
    19 paper. A link to it is provided at KEATS, in case you are
    19 paper. A link to it is provided on KEATS, in case you are
    20 interested.\footnote{In my humble opinion this is an
    20 interested.\footnote{In my humble opinion this is an
    21 interesting instance of the research literature: it contains a
    21 interesting instance of the research literature: it contains a
    22 very neat idea, but its presentation is rather sloppy.
    22 very neat idea, but its presentation is rather sloppy. In
    23 Students and I found several rather annoying typos in their
    23 earlier versions of their paper, students and I found several
    24 examples and definitions.}
    24 rather annoying typos in their examples and definitions.}
    25 
    25 
    26 In order to give an answer for how a regular expression matched
    26 In order to give an answer for how a regular expression matched
    27 a string, Sulzmann and Lu introduce \emph{values}. A value 
    27 a string, Sulzmann and Lu introduce \emph{values}. A value 
    28 will be the output of the algorithm whenever the regular 
    28 will be the output of the algorithm whenever the regular 
    29 expression matches the string. If not an error will be raised.
    29 expression matches the string. If not, an error will be raised.
    30 Since the first phase of the algorithm is identical to the
    30 Since the first phase of the algorithm is identical to the
    31 derivative based matcher from the first coursework, the 
    31 derivative based matcher from the first coursework, the 
    32 function $nullable$ will be used to decide whether as string
    32 function $nullable$ will be used to decide whether as string
    33 is matched by a regular expression.
    33 is matched by a regular expression. If $nullable$ says
    34 
    34 yes, then values are constructed that reflect how the regular
    35 
    35 expression matched the string. The definitions for regular 
    36 
    36 expressions $r$ and values $v$ is shown below:
    37 
    37 
       
    38 \begin{center}
       
    39 \begin{tabular}{cc}
       
    40 \begin{tabular}{@{}rrl@{}}
       
    41   $r$ & $::=$  & $\varnothing$\\
       
    42       & $\mid$ & $\epsilon$   \\
       
    43       & $\mid$ & $c$          \\
       
    44       & $\mid$ & $r_1 \cdot r_2$\\
       
    45       & $\mid$ & $r_1 + r_2$   \\
       
    46   \\
       
    47       & $\mid$ & $r^*$         \\
       
    48 \end{tabular}
       
    49 &
       
    50 \begin{tabular}{@{\hspace{0mm}}rrl@{}}
       
    51   $v$ & $::=$  & \\
       
    52       &        & $Empty$   \\
       
    53       & $\mid$ & $Char(c)$          \\
       
    54       & $\mid$ & $Seq(v_1,v_2)$\\
       
    55       & $\mid$ & $Left(v)$   \\
       
    56       & $\mid$ & $Right(v)$  \\
       
    57       & $\mid$ & $[v_1,\ldots\,v_n]$ \\
       
    58 \end{tabular}
       
    59 \end{tabular}
       
    60 \end{center}
       
    61 
       
    62 \noindent As you can see there is a very strong correspondence
       
    63 between regular expressions and values. There is no value for
       
    64 the $\varnothing$ regular expression (since it does not match
       
    65 any string). Then there is exactly one value corresponding to 
       
    66 each regular expression. with the exception of $r_1 + r_2$ 
       
    67 where there are two values $Left(v)$ and $Right(v)$ 
       
    68 corresponding  to the two alternatives. Note that $r^*$
       
    69 is associated with a list of values, one for each copy of $r$ 
       
    70 that was needed to match the string. This mean we might also 
       
    71 return the empty list $[]$, if no copy was needed.
       
    72 
       
    73 Graphically the algorithm can be represneted by the 
       
    74 picture in Figure~\ref{Sulz} where the path involving $der/nullable$ is the first
       
    75 phase of the algorithm and $mkeps/inj$ the second phase.
       
    76 This picture shows the steps required when a regular
       
    77 expression, say $r_1$, matches the string $abc$. We first
       
    78 build the three derivatives (according to $a$, $b$ and $c$).
       
    79 We then use $nullable$ to find out whether the resulting
       
    80 regular expression can match the empty string. If yes
       
    81 we call the function $mkeps$.
       
    82 
       
    83 \begin{figure}[t]
       
    84 \begin{center}
       
    85 \begin{tikzpicture}[scale=2,node distance=1.2cm,
       
    86                     every node/.style={minimum size=7mm}]
       
    87 \node (r1)  {$r_1$};
       
    88 \node (r2) [right=of r1]{$r_2$};
       
    89 \draw[->,line width=1mm](r1)--(r2) node[above,midway] {$der\,a$};
       
    90 \node (r3) [right=of r2]{$r_3$};
       
    91 \draw[->,line width=1mm](r2)--(r3) node[above,midway] {$der\,b$};
       
    92 \node (r4) [right=of r3]{$r_4$};
       
    93 \draw[->,line width=1mm](r3)--(r4) node[above,midway] {$der\,c$};
       
    94 \draw (r4) node[anchor=west] {\;\raisebox{3mm}{$nullable$}};
       
    95 \node (v4) [below=of r4]{$v_4$};
       
    96 \draw[->,line width=1mm](r4) -- (v4);
       
    97 \node (v3) [left=of v4] {$v_3$};
       
    98 \draw[->,line width=1mm](v4)--(v3) node[below,midway] {$inj\,c$};
       
    99 \node (v2) [left=of v3]{$v_2$};
       
   100 \draw[->,line width=1mm](v3)--(v2) node[below,midway] {$inj\,b$};
       
   101 \node (v1) [left=of v2] {$v_1$};
       
   102 \draw[->,line width=1mm](v2)--(v1) node[below,midway] {$inj\,a$};
       
   103 \draw (r4) node[anchor=north west] {\;\raisebox{-8mm}{$mkeps$}};
       
   104 \end{tikzpicture}
       
   105 \end{center}
       
   106 \caption{The two phases of the algorithm by Sulzmann \& Lu.
       
   107 \label{Sulz}}
       
   108 \end{figure}
       
   109 
       
   110 The $mkeps$ function calculates a value for how a regular
       
   111 expression could have matched the empty string. Its definition
       
   112 is as follows:
       
   113 
       
   114 \begin{center}
       
   115 \begin{tabular}{lcl}
       
   116   $mkeps(\epsilon)$       & $\dn$ & $Empty$\\
       
   117   $mkeps(r_1 + r_2)$      & $\dn$ & if $nullable(r_1)$  \\
       
   118                           &       & then $Left(mkeps(r_1))$\\
       
   119                           &       & else $Right(mkeps(r_2))$\\
       
   120   $mkeps(r_1 \cdot r_2)$  & $\dn$ & $Seq(mkeps(r_1),mkeps(r_2))$\\
       
   121   $mkeps(r^*)$            & $\dn$ & $[]$  \\
       
   122 \end{tabular}
       
   123 \end{center}
       
   124 
       
   125 \noindent There are no cases for $\epsilon$ and $c$, since
       
   126 these regular expression cannot match the empty string. 
       
   127 Note that in case of alternatives we give preference to the
       
   128 regular expression on the left-hand side. This will become
       
   129 important later on.
       
   130 
       
   131 The algorithm is organised recursively such that it will 
       
   132 calculate a value for how the derivative regular expression
       
   133 has matched a string where the first character has been
       
   134 chopped off. Now we need a function that reverses this
       
   135 ``chopping off'' for values. The corresponding function
       
   136 is called $inj$ for injection. This function takes
       
   137 three arguments: the first one is a regular expression
       
   138 for which we want to calculate the value, the second is the 
       
   139 character we want to inject and the third argument is the
       
   140 value where we will inject the character. The result
       
   141 of this function is a new value. The definition of $inj$ 
       
   142 is as follows: 
       
   143 
       
   144 \begin{center}
       
   145 \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
       
   146   $inj\,(c)\,c\,Empty$            & $\dn$  & $Char\,c$\\
       
   147   $inj\,(r_1 + r_2)\,c\,Left(v)$  & $\dn$  & $Left(inj\,r_1\,c\,v)$\\
       
   148   $inj\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$  & $Right(inj\,r_2\,c\,v)$\\
       
   149   $inj\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$  & $Seq(inj\,r_1\,c\,v_1,v_2)$\\
       
   150   $inj\,(r_1 \cdot r_2)\,c\,Left(Seq(v_1,v_2))$ & $\dn$  & $Seq(inj\,r_1\,c\,v_1,v_2)$\\
       
   151   $inj\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$  & $Seq(mkeps(r_1),inj\,r_2\,c\,v)$\\
       
   152   $inj\,(r^*)\,c\,Seq(v,vs)$         & $\dn$  & $inj\,r\,c\,v\,::\,vs$\\
       
   153 \end{tabular}
       
   154 \end{center}
       
   155 
       
   156 \noindent This definition is by recursion on the
       
   157 regular expression and by analysing the shape of the values. 
       
   158 Therefore there are, for example, three cases for sequnece
       
   159 regular expressions. The last clause returns a list where 
       
   160 the first element is $inj\,r\,c\,v$ and the other elements 
       
   161 are $vs$.
       
   162 
       
   163 To understand what is going on, it might be best to do some
       
   164 example calculations and compare with Figure~\ref{Sulz}. For
       
   165 this note that we have not yet dealt with the need of
       
   166 simplifying regular expreesions (this will be a topic on its
       
   167 own later). Suppose the regular expression is $a \cdot (b
       
   168 \cdot c)$ and the input string is $abc$. The derivatives are
       
   169 as follows:
       
   170 
       
   171 \begin{center}
       
   172 \begin{tabular}{ll}
       
   173 $r_1$: & $a \cdot (b \cdot c)$\\
       
   174 $r_2$: & $\epsilon \cdot (b \cdot c)$\\
       
   175 $r_3$: & $(\varnothing \cdot (b \cdot c)) + (\epsilon \cdot c)$\\
       
   176 $r_4$: & $(\varnothing \cdot (b \cdot c)) + ((\varnothing \cdot c) + \epsilon)$\\
       
   177 \end{tabular}
       
   178 \end{center}
       
   179 
       
   180 \noindent According to the simple algorithm. we would test
       
   181 whether $r_4$ is nullable, which it is. This means we can
       
   182 use the function $mkeps$ to calculate a value for how $r_4$
       
   183 was able to match the empty string. Remember that this 
       
   184 function gives preference for alternatives on the left-hand
       
   185 side. However there is only $\epsilon$ on the very right-hand
       
   186 side of $r_4$ that matches the empty string. Therefore
       
   187 $mkeps$ returns the value
       
   188 
       
   189 \begin{center}
       
   190 $v_4:\;Right(Right(Empty))$
       
   191 \end{center}
       
   192 
       
   193 \noindent The point is that from this value we can directly 
       
   194 read off which part of $r_4$ matched the empty string. Next we 
       
   195 have to ``inject'' the last character, that is $c$ in the
       
   196 running example, into this value in order to calculate how
       
   197 $r_3$ could have matched the string $c$. According to the 
       
   198 definition of $inj$ we obtain
       
   199 
       
   200 \begin{center}
       
   201 $v_3:\;Right(Seq(Empty, Char(c)))$
       
   202 \end{center}
       
   203 
       
   204 \noindent This is the correct result, because $r_3$ needs
       
   205 to use the right-hand alternative, and then $\epsilon$ needs
       
   206 to match the empty string and $c$ needs to match $c$.
       
   207 Next we need to inject back the letter $b$ into $v_3$. This
       
   208 gives
       
   209 
       
   210 \begin{center}
       
   211 $v_2:\;Seq(Empty, Seq(Char(b), Char(c)))$
       
   212 \end{center}
       
   213 
       
   214 \noindent Finally we need to inject back the letter $a$ into
       
   215 $v_2$ giving the final result
       
   216 
       
   217 \begin{center}
       
   218 $v_1:\;Seq(Char(a), Seq(Char(b), Char(c)))$
       
   219 \end{center}
       
   220 
       
   221 \noindent This now corresponds to how the regular
       
   222 expression $a \cdot (b \cdot c)$ matched the string $abc$.
       
   223 This is the expected result. So at least in this case the
       
   224 algorithm seems to calculate what it is supposed to.
       
   225 
       
   226 There are a few auxiliary function that are of interest
       
   227 in analysing this algorithm. One is called \emph{flatten},
       
   228 written $|\_|$, which extracts the string ``underlying'' a 
       
   229 value. It is defined as
       
   230 
       
   231 \begin{center}
       
   232 \begin{tabular}{lcl}
       
   233   $|Empty|$     & $\dn$ & $[]$\\
       
   234   $|Char(c)|$   & $\dn$ & $[c]$\\
       
   235   $|Left(v)|$   & $\dn$ & $|v|$\\
       
   236   $|Right(v)|$  & $\dn$ & $|v|$\\
       
   237   $|Seq(v_1,v_2)|$& $\dn$ & $|v_1| \,@\, |v_2|$\\
       
   238   $|[v_1,\ldots ,v_n]|$ & $\dn$ & $|v_1| \,@\ldots @\, |v_n|$\\
       
   239 \end{tabular}
       
   240 \end{center}
       
   241 
       
   242 \noindent Using flatten we can see what is the string behind 
       
   243 the values calculated by $mkeps$ and $inj$ in our running 
       
   244 example:
       
   245 
       
   246 \begin{center}
       
   247 \begin{tabular}{ll}
       
   248 $v_4$: & $[]$\\
       
   249 $v_3$: & $c$\\
       
   250 $v_2$: & $bc$\\
       
   251 $v_1$: & $abc$
       
   252 \end{tabular}
       
   253 \end{center}
       
   254 
       
   255 \noindent 
       
   256 This indicates that $inj$ indeed is injecting, or adding, back
       
   257 a character into the value.
       
   258 
       
   259 
       
   260 \subsubsection*{Simplification}
       
   261 
       
   262 Generally the matching algorithms based on derivatives do
       
   263 poorly unless the regular expressions are simplified after
       
   264 each derivatives step.
       
   265 
       
   266 \newpage
    38 Algorithm by Sulzmann, Lexing
   267 Algorithm by Sulzmann, Lexing
    39 
   268 
    40 \end{document}
   269 \end{document}
    41 
   270 
    42 %%% Local Variables: 
   271 %%% Local Variables: