handouts/ho04.tex
changeset 350 c4e7caa06c74
parent 326 94700593a2d5
child 352 1e1b0fe66107
equal deleted inserted replaced
349:434891622131 350:c4e7caa06c74
    13 expression or not. Often a more interesting question is to
    13 expression or not. Often a more interesting question is to
    14 find out \emph{how} a regular expression matched a string?
    14 find out \emph{how} a regular expression matched a string?
    15 Answering this question will also help us with the problem we
    15 Answering this question will also help us with the problem we
    16 are after, namely tokenising an input string. 
    16 are after, namely tokenising an input string. 
    17 
    17 
    18 The algorithm we will be looking at was designed by Sulzmann \& Lu in
    18 The algorithm we will be looking at was designed by Sulzmann
    19 a rather recent paper. A link to it is provided on KEATS, in case you
    19 \& Lu in a rather recent paper (from 2014). A link to it is
    20 are interested.\footnote{In my humble opinion this is an interesting
    20 provided on KEATS, in case you are interested.\footnote{In my
    21   instance of the research literature: it contains a very neat idea,
    21 humble opinion this is an interesting instance of the research
    22   but its presentation is rather sloppy. In earlier versions of their
    22 literature: it contains a very neat idea, but its presentation
    23   paper, students and I found several rather annoying typos in their
    23 is rather sloppy. In earlier versions of their paper, a King's
    24   examples and definitions.}  In order to give an answer for how a
    24 student and I found several rather annoying typos in their
    25 regular expression matched a string, Sulzmann and Lu introduce
    25 examples and definitions.} In order to give an answer for
    26 \emph{values}. A value will be the output of the algorithm whenever
    26 \emph{how} a regular expression matches a string, Sulzmann and
    27 the regular expression matches the string. If the string does not
    27 Lu introduce \emph{values}. A value will be the output of the
    28 match the string, an error will be raised. Since the first phase of
    28 algorithm whenever the regular expression matches the string.
    29 the algorithm by Sulzmann \& Lu is identical to the derivative based
    29 If the string does not match the string, an error will be
    30 matcher from the first coursework, the function $nullable$ will be
    30 raised. Since the first phase of the algorithm by Sulzmann \&
    31 used to decide whether as string is matched by a regular
    31 Lu is identical to the derivative based matcher from the first
    32 expression. If $nullable$ says yes, then values are constructed that
    32 coursework, the function $nullable$ will be used to decide
    33 reflect how the regular expression matched the string. The definitions
    33 whether as string is matched by a regular expression. If
    34 for regular expressions $r$ and values $v$ is shown next to each other
    34 $nullable$ says yes, then values are constructed that reflect
    35 below:
    35 how the regular expression matched the string. 
       
    36 
       
    37 The definitions for values is given below. They are shown 
       
    38 together with the regular expressions $r$ to which
       
    39 they correspond:
    36 
    40 
    37 \begin{center}
    41 \begin{center}
    38 \begin{tabular}{cc}
    42 \begin{tabular}{cc}
    39 \begin{tabular}{@{}rrl@{}}
    43 \begin{tabular}{@{}rrl@{}}
    40 \multicolumn{3}{c}{regular expressions}\medskip\\
    44 \multicolumn{3}{c}{regular expressions}\medskip\\
    58       & $\mid$ & $[v_1,\ldots\,v_n]$ \\
    62       & $\mid$ & $[v_1,\ldots\,v_n]$ \\
    59 \end{tabular}
    63 \end{tabular}
    60 \end{tabular}
    64 \end{tabular}
    61 \end{center}
    65 \end{center}
    62 
    66 
    63 \noindent The point is that there is a very strong
    67 \noindent The reason is that there is a very strong
    64 correspondence between them. There is no value for the
    68 correspondence between them. There is no value for the
    65 $\varnothing$ regular expression, since it does not match any
    69 $\varnothing$ regular expression, since it does not match any
    66 string. Otherwise there is exactly one value corresponding to
    70 string. Otherwise there is exactly one value corresponding to
    67 each regular expression with the exception of $r_1 + r_2$
    71 each regular expression with the exception of $r_1 + r_2$
    68 where there are two values, namely $Left(v)$ and $Right(v)$
    72 where there are two values, namely $Left(v)$ and $Right(v)$
    69 corresponding to the two alternatives. Note that $r^*$ is
    73 corresponding to the two alternatives. Note that $r^*$ is
    70 associated with a list of values, one for each copy of $r$
    74 associated with a list of values, one for each copy of $r$
    71 that was needed to match the string. This means we might also
    75 that was needed to match the string. This means we might also
    72 return the empty list $[]$, if no copy was needed.
    76 return the empty list $[]$, if no copy was needed in case
       
    77 of $r^*$. For sequence, there is exactly one value, composed 
       
    78 of two component values ($v_1$ and $v_2$).
    73 
    79 
    74 To emphasise the connection between regular expressions and
    80 To emphasise the connection between regular expressions and
    75 values, I have in my implementation the convention that
    81 values, I have in my implementation the convention that
       
    82 regular expressions and values have the same name, except that
    76 regular expressions are written entirely with upper-case
    83 regular expressions are written entirely with upper-case
    77 letters, while values just start with a single upper-case
    84 letters, while values just start with a single upper-case
    78 character. My definition of values in Scala is below. I use 
    85 character and the rest are lower-case letters. My definition
    79 this in the REPL of Scala; when I use the Scala compiler
    86 of regular expressions and values in Scala is shown below. I use
    80 I need to rename some constructors, because Scala on Macs
    87 this in the REPL of Scala; when I use the Scala compiler I
    81 does not like classes that are called \pcode{EMPTY} and
    88 need to rename some constructors, because Scala on Macs does
       
    89 not like classes that are called \pcode{EMPTY} and
    82 \pcode{Empty}.
    90 \pcode{Empty}.
       
    91  
       
    92  {\small\lstinputlisting[language=Scala,numbers=none]
       
    93 {../progs/app03.scala}}
       
    94 
    83  
    95  
    84 {\small\lstinputlisting[language=Scala,numbers=none]
    96 {\small\lstinputlisting[language=Scala,numbers=none]
    85 {../progs/app02.scala}}
    97 {../progs/app02.scala}}
    86 
    98 
    87 
    99 
   144 regular expression on the left-hand side. This will become
   156 regular expression on the left-hand side. This will become
   145 important later on.
   157 important later on.
   146 
   158 
   147 The second phase of the algorithm is organised so that it will
   159 The second phase of the algorithm is organised so that it will
   148 calculate a value for how the derivative regular expression
   160 calculate a value for how the derivative regular expression
   149 has matched a string whose first character has been chopped
   161 has matched a string. For this we need a function that
   150 off. Now we need a function that reverses this ``chopping
   162 reverses this ``chopping off'' for values which we did in the
   151 off'' for values. The corresponding function is called $inj$
   163 first phase for derivatives. The corresponding function is
   152 for injection. This function takes three arguments: the first
   164 called $inj$ for injection. This function takes three
   153 one is a regular expression for which we want to calculate the
   165 arguments: the first one is a regular expression for which we
   154 value, the second is the character we want to inject and the
   166 want to calculate the value, the second is the character we
   155 third argument is the value where we will inject the
   167 want to inject and the third argument is the value where we
   156 character. The result of this function is a new value. The
   168 will inject the character into. The result of this function is a
   157 definition of $inj$ is as follows: 
   169 new value. The definition of $inj$ is as follows: 
   158 
   170 
   159 \begin{center}
   171 \begin{center}
   160 \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
   172 \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
   161   $inj\,(c)\,c\,Empty$            & $\dn$  & $Char\,c$\\
   173   $inj\,(c)\,c\,Empty$            & $\dn$  & $Char\,c$\\
   162   $inj\,(r_1 + r_2)\,c\,Left(v)$  & $\dn$  & $Left(inj\,r_1\,c\,v)$\\
   174   $inj\,(r_1 + r_2)\,c\,Left(v)$  & $\dn$  & $Left(inj\,r_1\,c\,v)$\\
   169 \end{center}
   181 \end{center}
   170 
   182 
   171 \noindent This definition is by recursion on the regular
   183 \noindent This definition is by recursion on the regular
   172 expression and by analysing the shape of the values. Therefore
   184 expression and by analysing the shape of the values. Therefore
   173 there are, for example, three cases for sequence regular
   185 there are, for example, three cases for sequence regular
   174 expressions. The last clause for the star regular expression
   186 expressions (for all possible shapes of the value). The last
   175 returns a list where the first element is $inj\,r\,c\,v$ and
   187 clause for the star regular expression returns a list where
   176 the other elements are $vs$. That means $\_\,::\,\_$ should be 
   188 the first element is $inj\,r\,c\,v$ and the other elements are
   177 read as list cons.
   189 $vs$. That means $\_\,::\,\_$ should be read as list cons.
   178 
   190 
   179 To understand what is going on, it might be best to do some
   191 To understand what is going on, it might be best to do some
   180 example calculations and compare them with Figure~\ref{Sulz}.
   192 example calculations and compare them with Figure~\ref{Sulz}.
   181 For this note that we have not yet dealt with the need of
   193 For this note that we have not yet dealt with the need of
   182 simplifying regular expressions (this will be a topic on its
   194 simplifying regular expressions (this will be a topic on its
   192 $r_4$: & $(\varnothing \cdot (b \cdot c)) + ((\varnothing \cdot c) + \epsilon)$\\
   204 $r_4$: & $(\varnothing \cdot (b \cdot c)) + ((\varnothing \cdot c) + \epsilon)$\\
   193 \end{tabular}
   205 \end{tabular}
   194 \end{center}
   206 \end{center}
   195 
   207 
   196 \noindent According to the simple algorithm, we would test
   208 \noindent According to the simple algorithm, we would test
   197 whether $r_4$ is nullable, which in this case it is. This
   209 whether $r_4$ is nullable, which in this case it indeed is.
   198 means we can use the function $mkeps$ to calculate a value for
   210 This means we can use the function $mkeps$ to calculate a
   199 how $r_4$ was able to match the empty string. Remember that
   211 value for how $r_4$ was able to match the empty string.
   200 this function gives preference for alternatives on the
   212 Remember that this function gives preference for alternatives
   201 left-hand side. However there is only $\epsilon$ on the very
   213 on the left-hand side. However there is only $\epsilon$ on the
   202 right-hand side of $r_4$ that matches the empty string.
   214 very right-hand side of $r_4$ that matches the empty string.
   203 Therefore $mkeps$ returns the value
   215 Therefore $mkeps$ returns the value
   204 
   216 
   205 \begin{center}
   217 \begin{center}
   206 $v_4:\;Right(Right(Empty))$
   218 $v_4:\;Right(Right(Empty))$
   207 \end{center}
   219 \end{center}
   208 
   220 
   209 \noindent The point is that from this value we can directly
   221 \noindent If there had been a $\epsilon$ on the left, then
   210 read off which part of $r_4$ matched the empty string: take
   222 $mkeps$ would have returned something of the form
   211 the right-alternative first, and then the right-alternative
   223 $Left(\ldots)$. The point is that from this value we can
   212 again. 
   224 directly read off which part of $r_4$ matched the empty
       
   225 string: take the right-alternative first, and then the
       
   226 right-alternative again. 
   213 
   227 
   214 Next we have to ``inject'' the last character, that is $c$ in
   228 Next we have to ``inject'' the last character, that is $c$ in
   215 the running example, into this value $v_4$ in order to
   229 the running example, into this value $v_4$ in order to
   216 calculate how $r_3$ could have matched the string $c$.
   230 calculate how $r_3$ could have matched the string $c$.
   217 According to the definition of $inj$ we obtain
   231 According to the definition of $inj$ we obtain
   255   $|Seq(v_1,v_2)|$& $\dn$ & $|v_1| \,@\, |v_2|$\\
   269   $|Seq(v_1,v_2)|$& $\dn$ & $|v_1| \,@\, |v_2|$\\
   256   $|[v_1,\ldots ,v_n]|$ & $\dn$ & $|v_1| \,@\ldots @\, |v_n|$\\
   270   $|[v_1,\ldots ,v_n]|$ & $\dn$ & $|v_1| \,@\ldots @\, |v_n|$\\
   257 \end{tabular}
   271 \end{tabular}
   258 \end{center}
   272 \end{center}
   259 
   273 
   260 \noindent Using flatten we can see what is the string behind 
   274 \noindent Using flatten we can see what are the strings behind 
   261 the values calculated by $mkeps$ and $inj$ in our running 
   275 the values calculated by $mkeps$ and $inj$. In our running 
   262 example are:
   276 example:
   263 
   277 
   264 \begin{center}
   278 \begin{center}
   265 \begin{tabular}{ll}
   279 \begin{tabular}{ll}
   266 $|v_4|$: & $[]$\\
   280 $|v_4|$: & $[]$\\
   267 $|v_3|$: & $c$\\
   281 $|v_3|$: & $c$\\
   269 $|v_1|$: & $abc$
   283 $|v_1|$: & $abc$
   270 \end{tabular}
   284 \end{tabular}
   271 \end{center}
   285 \end{center}
   272 
   286 
   273 \noindent This indicates that $inj$ indeed is injecting, or
   287 \noindent This indicates that $inj$ indeed is injecting, or
   274 adding, back a character into the value. Next we look at how
   288 adding, back a character into the value. If we continue until
   275 simplification can be included into this algorithm.
   289 all characters are injected back, we have a value that can 
       
   290 indeed say how the string $abc$ was matched.
       
   291 
       
   292 There is a problem, however, with the described algorithm
       
   293 so far: it is very slow. We need to include the simplification 
       
   294 from Lecture 2. This is what we shall do next.
   276 
   295 
   277 
   296 
   278 \subsubsection*{Simplification}
   297 \subsubsection*{Simplification}
   279 
   298 
   280 Generally the matching algorithms based on derivatives do
   299 Generally the matching algorithms based on derivatives do