handouts/ho04.tex
changeset 284 0afe43616b6a
parent 283 c14e5ebf0c3b
child 285 8a222559278f
equal deleted inserted replaced
283:c14e5ebf0c3b 284:0afe43616b6a
    11 So far our algorithm based on derivatives was only able to say
    11 So far our algorithm based on derivatives was only able to say
    12 yes or no depending on whether a string was matched by regular
    12 yes or no depending on whether a string was matched by regular
    13 expression or not. Often a more interesting question is to
    13 expression or not. Often a more interesting question is to
    14 find out \emph{how} a regular expression matched a string?
    14 find out \emph{how} a regular expression matched a string?
    15 Answering this question will help us with the problem we are
    15 Answering this question will help us with the problem we are
    16 after, namely tokenising an input string, that is splitting it
    16 after, namely tokenising an input string. The algorithm we
    17 up into its ``word'' components. The algorithm we will be
    17 will be looking at for this was designed by Sulzmann \& Lu in
    18 looking at was designed by Sulzmann \& Lu in a rather recent
    18 a rather recent paper. A link to it is provided on KEATS, in
    19 paper. A link to it is provided on KEATS, in case you are
    19 case you are interested.\footnote{In my humble opinion this is
    20 interested.\footnote{In my humble opinion this is an
    20 an interesting instance of the research literature: it
    21 interesting instance of the research literature: it contains a
    21 contains a very neat idea, but its presentation is rather
    22 very neat idea, but its presentation is rather sloppy. In
    22 sloppy. In earlier versions of their paper, students and I
    23 earlier versions of their paper, students and I found several
    23 found several rather annoying typos in their examples and
    24 rather annoying typos in their examples and definitions.}
    24 definitions.}
    25 
    25 
    26 In order to give an answer for how a regular expression matched
    26 In order to give an answer for how a regular expression
    27 a string, Sulzmann and Lu introduce \emph{values}. A value 
    27 matched a string, Sulzmann and Lu introduce \emph{values}. A
    28 will be the output of the algorithm whenever the regular 
    28 value will be the output of the algorithm whenever the regular
    29 expression matches the string. If not, an error will be raised.
    29 expression matches the string. If not, an error will be
    30 Since the first phase of the algorithm is identical to the
    30 raised. Since the first phase of the algorithm by Sulzmann \&
    31 derivative based matcher from the first coursework, the 
    31 Lu is identical to the derivative based matcher from the first
    32 function $nullable$ will be used to decide whether as string
    32 coursework, the function $nullable$ will be used to decide
    33 is matched by a regular expression. If $nullable$ says
    33 whether as string is matched by a regular expression. If
    34 yes, then values are constructed that reflect how the regular
    34 $nullable$ says yes, then values are constructed that reflect
    35 expression matched the string. The definitions for regular 
    35 how the regular expression matched the string. The definitions
    36 expressions $r$ and values $v$ is shown below:
    36 for regular expressions $r$ and values $v$ is shown next to
       
    37 each other below:
    37 
    38 
    38 \begin{center}
    39 \begin{center}
    39 \begin{tabular}{cc}
    40 \begin{tabular}{cc}
    40 \begin{tabular}{@{}rrl@{}}
    41 \begin{tabular}{@{}rrl@{}}
       
    42 \multicolumn{3}{c}{regular expressions}\\
    41   $r$ & $::=$  & $\varnothing$\\
    43   $r$ & $::=$  & $\varnothing$\\
    42       & $\mid$ & $\epsilon$   \\
    44       & $\mid$ & $\epsilon$   \\
    43       & $\mid$ & $c$          \\
    45       & $\mid$ & $c$          \\
    44       & $\mid$ & $r_1 \cdot r_2$\\
    46       & $\mid$ & $r_1 \cdot r_2$\\
    45       & $\mid$ & $r_1 + r_2$   \\
    47       & $\mid$ & $r_1 + r_2$   \\
    46   \\
    48   \\
    47       & $\mid$ & $r^*$         \\
    49       & $\mid$ & $r^*$         \\
    48 \end{tabular}
    50 \end{tabular}
    49 &
    51 &
    50 \begin{tabular}{@{\hspace{0mm}}rrl@{}}
    52 \begin{tabular}{@{\hspace{0mm}}rrl@{}}
    51   $v$ & $::=$  & \\
    53 \multicolumn{3}{c}{values}\\
       
    54    $v$ & $::=$  & \\
    52       &        & $Empty$   \\
    55       &        & $Empty$   \\
    53       & $\mid$ & $Char(c)$          \\
    56       & $\mid$ & $Char(c)$          \\
    54       & $\mid$ & $Seq(v_1,v_2)$\\
    57       & $\mid$ & $Seq(v_1,v_2)$\\
    55       & $\mid$ & $Left(v)$   \\
    58       & $\mid$ & $Left(v)$   \\
    56       & $\mid$ & $Right(v)$  \\
    59       & $\mid$ & $Right(v)$  \\
    57       & $\mid$ & $[v_1,\ldots\,v_n]$ \\
    60       & $\mid$ & $[v_1,\ldots\,v_n]$ \\
    58 \end{tabular}
    61 \end{tabular}
    59 \end{tabular}
    62 \end{tabular}
    60 \end{center}
    63 \end{center}
    61 
    64 
    62 \noindent As you can see there is a very strong correspondence
    65 \noindent The point is that there is a very strong
    63 between regular expressions and values. There is no value for
    66 correspondence between them. There is no value for the
    64 the $\varnothing$ regular expression (since it does not match
    67 $\varnothing$ regular expression, since it does not match any
    65 any string). Then there is exactly one value corresponding to 
    68 string. Otherwise there is exactly one value corresponding to
    66 each regular expression. with the exception of $r_1 + r_2$ 
    69 each regular expression with the exception of $r_1 + r_2$
    67 where there are two values $Left(v)$ and $Right(v)$ 
    70 where there are two values, namely $Left(v)$ and $Right(v)$
    68 corresponding  to the two alternatives. Note that $r^*$
    71 corresponding to the two alternatives. Note that $r^*$ is
    69 is associated with a list of values, one for each copy of $r$ 
    72 associated with a list of values, one for each copy of $r$
    70 that was needed to match the string. This mean we might also 
    73 that was needed to match the string. This means we might also
    71 return the empty list $[]$, if no copy was needed.
    74 return the empty list $[]$, if no copy was needed.
    72 
    75 
    73 Graphically the algorithm can be represneted by the 
    76 Graphically the algorithm by Sulzmann \& Lu can be represneted
    74 picture in Figure~\ref{Sulz} where the path involving $der/nullable$ is the first
    77 by the picture in Figure~\ref{Sulz} where the path from the
    75 phase of the algorithm and $mkeps/inj$ the second phase.
    78 left to the right involving $der/nullable$ is the first phase
    76 This picture shows the steps required when a regular
    79 of the algorithm and $mkeps/inj$, the path from right to left,
    77 expression, say $r_1$, matches the string $abc$. We first
    80 the second phase. This picture shows the steps required when a
    78 build the three derivatives (according to $a$, $b$ and $c$).
    81 regular expression, say $r_1$, matches the string $abc$. We
    79 We then use $nullable$ to find out whether the resulting
    82 first build the three derivatives (according to $a$, $b$ and
    80 regular expression can match the empty string. If yes
    83 $c$). We then use $nullable$ to find out whether the resulting
    81 we call the function $mkeps$.
    84 regular expression can match the empty string. If yes we call
       
    85 the function $mkeps$.
    82 
    86 
    83 \begin{figure}[t]
    87 \begin{figure}[t]
    84 \begin{center}
    88 \begin{center}
    85 \begin{tikzpicture}[scale=2,node distance=1.2cm,
    89 \begin{tikzpicture}[scale=2,node distance=1.2cm,
    86                     every node/.style={minimum size=7mm}]
    90                     every node/.style={minimum size=7mm}]
   106 \caption{The two phases of the algorithm by Sulzmann \& Lu.
   110 \caption{The two phases of the algorithm by Sulzmann \& Lu.
   107 \label{Sulz}}
   111 \label{Sulz}}
   108 \end{figure}
   112 \end{figure}
   109 
   113 
   110 The $mkeps$ function calculates a value for how a regular
   114 The $mkeps$ function calculates a value for how a regular
   111 expression could have matched the empty string. Its definition
   115 expression has matched the empty string. Its definition
   112 is as follows:
   116 is as follows:
   113 
   117 
   114 \begin{center}
   118 \begin{center}
   115 \begin{tabular}{lcl}
   119 \begin{tabular}{lcl}
   116   $mkeps(\epsilon)$       & $\dn$ & $Empty$\\
   120   $mkeps(\epsilon)$       & $\dn$ & $Empty$\\
   121   $mkeps(r^*)$            & $\dn$ & $[]$  \\
   125   $mkeps(r^*)$            & $\dn$ & $[]$  \\
   122 \end{tabular}
   126 \end{tabular}
   123 \end{center}
   127 \end{center}
   124 
   128 
   125 \noindent There are no cases for $\epsilon$ and $c$, since
   129 \noindent There are no cases for $\epsilon$ and $c$, since
   126 these regular expression cannot match the empty string. 
   130 these regular expression cannot match the empty string. Note
   127 Note that in case of alternatives we give preference to the
   131 also that in case of alternatives we give preference to the
   128 regular expression on the left-hand side. This will become
   132 regular expression on the left-hand side. This will become
   129 important later on.
   133 important later on.
   130 
   134 
   131 The algorithm is organised recursively such that it will 
   135 The second phase of the algorithm is organised recursively
   132 calculate a value for how the derivative regular expression
   136 such that it will calculate a value for how the derivative
   133 has matched a string where the first character has been
   137 regular expression has matched a string whose first character
   134 chopped off. Now we need a function that reverses this
   138 has been chopped off. Now we need a function that reverses
   135 ``chopping off'' for values. The corresponding function
   139 this ``chopping off'' for values. The corresponding function
   136 is called $inj$ for injection. This function takes
   140 is called $inj$ for injection. This function takes three
   137 three arguments: the first one is a regular expression
   141 arguments: the first one is a regular expression for which we
   138 for which we want to calculate the value, the second is the 
   142 want to calculate the value, the second is the character we
   139 character we want to inject and the third argument is the
   143 want to inject and the third argument is the value where we
   140 value where we will inject the character. The result
   144 will inject the character. The result of this function is a
   141 of this function is a new value. The definition of $inj$ 
   145 new value. The definition of $inj$ is as follows: 
   142 is as follows: 
       
   143 
   146 
   144 \begin{center}
   147 \begin{center}
   145 \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
   148 \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
   146   $inj\,(c)\,c\,Empty$            & $\dn$  & $Char\,c$\\
   149   $inj\,(c)\,c\,Empty$            & $\dn$  & $Char\,c$\\
   147   $inj\,(r_1 + r_2)\,c\,Left(v)$  & $\dn$  & $Left(inj\,r_1\,c\,v)$\\
   150   $inj\,(r_1 + r_2)\,c\,Left(v)$  & $\dn$  & $Left(inj\,r_1\,c\,v)$\\
   151   $inj\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$  & $Seq(mkeps(r_1),inj\,r_2\,c\,v)$\\
   154   $inj\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$  & $Seq(mkeps(r_1),inj\,r_2\,c\,v)$\\
   152   $inj\,(r^*)\,c\,Seq(v,vs)$         & $\dn$  & $inj\,r\,c\,v\,::\,vs$\\
   155   $inj\,(r^*)\,c\,Seq(v,vs)$         & $\dn$  & $inj\,r\,c\,v\,::\,vs$\\
   153 \end{tabular}
   156 \end{tabular}
   154 \end{center}
   157 \end{center}
   155 
   158 
   156 \noindent This definition is by recursion on the
   159 \noindent This definition is by recursion on the regular
   157 regular expression and by analysing the shape of the values. 
   160 expression and by analysing the shape of the values. Therefore
   158 Therefore there are, for example, three cases for sequnece
   161 there are, for example, three cases for sequnece regular
   159 regular expressions. The last clause returns a list where 
   162 expressions. The last clause for the star regular expression
   160 the first element is $inj\,r\,c\,v$ and the other elements 
   163 returns a list where the first element is $inj\,r\,c\,v$ and
   161 are $vs$.
   164 the other elements are $vs$. That mean $\_\,::\,\_$ should be 
       
   165 read as list cons.
   162 
   166 
   163 To understand what is going on, it might be best to do some
   167 To understand what is going on, it might be best to do some
   164 example calculations and compare with Figure~\ref{Sulz}. For
   168 example calculations and compare with Figure~\ref{Sulz}. For
   165 this note that we have not yet dealt with the need of
   169 this note that we have not yet dealt with the need of
   166 simplifying regular expreesions (this will be a topic on its
   170 simplifying regular expressions (this will be a topic on its
   167 own later). Suppose the regular expression is $a \cdot (b
   171 own later). Suppose the regular expression is $a \cdot (b
   168 \cdot c)$ and the input string is $abc$. The derivatives are
   172 \cdot c)$ and the input string is $abc$. The derivatives from
   169 as follows:
   173 the first phase are as follows:
   170 
   174 
   171 \begin{center}
   175 \begin{center}
   172 \begin{tabular}{ll}
   176 \begin{tabular}{ll}
   173 $r_1$: & $a \cdot (b \cdot c)$\\
   177 $r_1$: & $a \cdot (b \cdot c)$\\
   174 $r_2$: & $\epsilon \cdot (b \cdot c)$\\
   178 $r_2$: & $\epsilon \cdot (b \cdot c)$\\
   175 $r_3$: & $(\varnothing \cdot (b \cdot c)) + (\epsilon \cdot c)$\\
   179 $r_3$: & $(\varnothing \cdot (b \cdot c)) + (\epsilon \cdot c)$\\
   176 $r_4$: & $(\varnothing \cdot (b \cdot c)) + ((\varnothing \cdot c) + \epsilon)$\\
   180 $r_4$: & $(\varnothing \cdot (b \cdot c)) + ((\varnothing \cdot c) + \epsilon)$\\
   177 \end{tabular}
   181 \end{tabular}
   178 \end{center}
   182 \end{center}
   179 
   183 
   180 \noindent According to the simple algorithm. we would test
   184 \noindent According to the simple algorithm, we would test
   181 whether $r_4$ is nullable, which it is. This means we can
   185 whether $r_4$ is nullable, which in this case it is. This
   182 use the function $mkeps$ to calculate a value for how $r_4$
   186 means we can use the function $mkeps$ to calculate a value for
   183 was able to match the empty string. Remember that this 
   187 how $r_4$ was able to match the empty string. Remember that
   184 function gives preference for alternatives on the left-hand
   188 this function gives preference for alternatives on the
   185 side. However there is only $\epsilon$ on the very right-hand
   189 left-hand side. However there is only $\epsilon$ on the very
   186 side of $r_4$ that matches the empty string. Therefore
   190 right-hand side of $r_4$ that matches the empty string.
   187 $mkeps$ returns the value
   191 Therefore $mkeps$ returns the value
   188 
   192 
   189 \begin{center}
   193 \begin{center}
   190 $v_4:\;Right(Right(Empty))$
   194 $v_4:\;Right(Right(Empty))$
   191 \end{center}
   195 \end{center}
   192 
   196 
   193 \noindent The point is that from this value we can directly 
   197 \noindent The point is that from this value we can directly
   194 read off which part of $r_4$ matched the empty string. Next we 
   198 read off which part of $r_4$ matched the empty string. Next we
   195 have to ``inject'' the last character, that is $c$ in the
   199 have to ``inject'' the last character, that is $c$ in the
   196 running example, into this value in order to calculate how
   200 running example, into this value $v_4$ in order to calculate
   197 $r_3$ could have matched the string $c$. According to the 
   201 how $r_3$ could have matched the string $c$. According to the
   198 definition of $inj$ we obtain
   202 definition of $inj$ we obtain
   199 
   203 
   200 \begin{center}
   204 \begin{center}
   201 $v_3:\;Right(Seq(Empty, Char(c)))$
   205 $v_3:\;Right(Seq(Empty, Char(c)))$
   202 \end{center}
   206 \end{center}
   218 $v_1:\;Seq(Char(a), Seq(Char(b), Char(c)))$
   222 $v_1:\;Seq(Char(a), Seq(Char(b), Char(c)))$
   219 \end{center}
   223 \end{center}
   220 
   224 
   221 \noindent This now corresponds to how the regular
   225 \noindent This now corresponds to how the regular
   222 expression $a \cdot (b \cdot c)$ matched the string $abc$.
   226 expression $a \cdot (b \cdot c)$ matched the string $abc$.
   223 This is the expected result. So at least in this case the
   227 
   224 algorithm seems to calculate what it is supposed to.
   228 There are a few auxiliary functions that are of interest
   225 
       
   226 There are a few auxiliary function that are of interest
       
   227 in analysing this algorithm. One is called \emph{flatten},
   229 in analysing this algorithm. One is called \emph{flatten},
   228 written $|\_|$, which extracts the string ``underlying'' a 
   230 written $|\_|$, which extracts the string ``underlying'' a 
   229 value. It is defined as
   231 value. It is defined recursively as
   230 
   232 
   231 \begin{center}
   233 \begin{center}
   232 \begin{tabular}{lcl}
   234 \begin{tabular}{lcl}
   233   $|Empty|$     & $\dn$ & $[]$\\
   235   $|Empty|$     & $\dn$ & $[]$\\
   234   $|Char(c)|$   & $\dn$ & $[c]$\\
   236   $|Char(c)|$   & $\dn$ & $[c]$\\
   239 \end{tabular}
   241 \end{tabular}
   240 \end{center}
   242 \end{center}
   241 
   243 
   242 \noindent Using flatten we can see what is the string behind 
   244 \noindent Using flatten we can see what is the string behind 
   243 the values calculated by $mkeps$ and $inj$ in our running 
   245 the values calculated by $mkeps$ and $inj$ in our running 
   244 example:
   246 example are:
   245 
   247 
   246 \begin{center}
   248 \begin{center}
   247 \begin{tabular}{ll}
   249 \begin{tabular}{ll}
   248 $v_4$: & $[]$\\
   250 $|v_4|$: & $[]$\\
   249 $v_3$: & $c$\\
   251 $|v_3|$: & $c$\\
   250 $v_2$: & $bc$\\
   252 $|v_2|$: & $bc$\\
   251 $v_1$: & $abc$
   253 $|v_1|$: & $abc$
   252 \end{tabular}
   254 \end{tabular}
   253 \end{center}
   255 \end{center}
   254 
   256 
   255 \noindent 
   257 \noindent 
   256 This indicates that $inj$ indeed is injecting, or adding, back
   258 This indicates that $inj$ indeed is injecting, or adding, back
   259 
   261 
   260 \subsubsection*{Simplification}
   262 \subsubsection*{Simplification}
   261 
   263 
   262 Generally the matching algorithms based on derivatives do
   264 Generally the matching algorithms based on derivatives do
   263 poorly unless the regular expressions are simplified after
   265 poorly unless the regular expressions are simplified after
   264 each derivatives step.
   266 each derivatives step. But this is a bit more involved in 
       
   267 algorithm of Sulzmann \& Lu. Consider the last derivation
       
   268 step in our running example
       
   269 
       
   270 \begin{center}
       
   271 $r_4 = der\,c\,r_3$
       
   272 \end{center}
       
   273 
       
   274 \noindent 
       
   275 Simplifying the result would just give use $\epsilon$.
       
   276 Running $mkeps$ on this regular expression would provide 
       
   277 us with $Empty$ instead of $Right(Right(Empty))$ that
       
   278 was obtained without the simplification. The problem is
       
   279 we need to recreate this value, rather than just $Empty$.
       
   280 
       
   281 This requires what I call \emph{rectification functions}. They
       
   282 need to be calculated whenever a regular expression gets
       
   283 simplified. Rectification functions take a value as argument
       
   284 and return a (rectified) value. Our simplification rules so
       
   285 far are
       
   286 
       
   287 \begin{center}
       
   288 \begin{tabular}{l@{\hspace{2mm}}c@{\hspace{2mm}}l}
       
   289 $r \cdot \varnothing$ & $\mapsto$ & $\varnothing$\\ 
       
   290 $\varnothing \cdot r$ & $\mapsto$ & $\varnothing$\\ 
       
   291 $r \cdot \epsilon$ & $\mapsto$ & $r$\\ 
       
   292 $\epsilon \cdot r$ & $\mapsto$ & $r$\\ 
       
   293 $r + \varnothing$ & $\mapsto$ & $r$\\ 
       
   294 $\varnothing + r$ & $\mapsto$ & $r$\\ 
       
   295 $r + r$ & $\mapsto$ & $r$\\ 
       
   296 \end{tabular}
       
   297 \end{center}
       
   298 
       
   299 \noindent Applying them to $r_4$ will require several nested
       
   300 simplifications in order end up with just $\epsilon$. We can
       
   301 implement this by letting simp return not just a (simplified)
       
   302 regular expression, but also a rectification function. Let us
       
   303 consider the alternative case, say $r_1 + r_2$, first. We
       
   304 would first simplify the component regular expressions $r_1$
       
   305 and $r_2.$ This will return simplified versions (if they can
       
   306 be simplified), say $r_{1s}$ and $r_{2s}$, but also two
       
   307 rectification functions $f_{1s}$ and $f_{2s}$. We need to
       
   308 assemble them in order to obtain a rectified value for 
       
   309 $r_1 + r_2$.
       
   310 
       
   311 In case $r_{1s}$ simplified to $\varnothing$, we would
       
   312 continue the derivative calculation with $r_{2s}$. The
       
   313 Sulzmann \& Lu algorithm would return a corresponding value,
       
   314 say $v_{2s}$.
       
   315 
       
   316 
       
   317 \subsubsection*{Records and Tokenisation}
   265 
   318 
   266 \newpage
   319 \newpage
   267 Algorithm by Sulzmann, Lexing
   320 Algorithm by Sulzmann, Lexing
   268 
   321 
   269 \end{document}
   322 \end{document}