updated
authorChristian Urban <christian.urban@kcl.ac.uk>
Mon, 29 Jun 2020 21:14:50 +0100
changeset 727 eb9343126625
parent 726 fba480bbc9f7
child 728 c669b39debe3
updated
handouts/ho02.tex
handouts/ho05.tex
--- a/handouts/ho02.tex	Mon Jun 29 21:13:49 2020 +0100
+++ b/handouts/ho02.tex	Mon Jun 29 21:14:50 2020 +0100
@@ -7,22 +7,40 @@
 
 
 \begin{document}
-\fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017, 2018, 2019}
+\fnote{\copyright{} Christian Urban, King's College London, 
+  2014, 2015, 2016, 2017, 2018, 2019, 2020}
 
 
 \section*{Handout 2 (Regular Expression Matching)}
 
+\noindent
+{\bf Checklist}
+
+\begin{itemize}
+  \item You have understood the derivative-based matching algorithm.
+  \item You know how the derivative is related to the meaning of regular
+  expressions.
+  \item You can extend the algorithm to non-basic regular expressions.
+\end{itemize}\bigskip\bigskip\bigskip
+
+\noindent
 This lecture is about implementing a more efficient regular expression
 matcher (the plots on the right below)---more efficient than the
 matchers from regular expression libraries in Ruby, Python, JavaScript
 and Java (the plots on the left). For this consider some experimental
 data: The first pair of plots shows the running time for the
 regular expression $(a^*)^*\cdot b$ and strings composed of $n$
-\pcode{a}s (meaning this regular expression actually does not match
-the strings). The second pair of plots shows the running time for the
-regular expressions $a^?{}^{\{n\}}\cdot a^{\{n\}}$ and strings also
-composed of $n$ \pcode{a}s (this time the regular expressions match
-the strings).  To see the substantial differences in the left and
+\pcode{a}s, like
+\[
+\pcode{"}\!\underbrace{\pcode{a}\ldots\pcode{a}}_{n}\!\pcode{"}  
+\]
+
+\noindent
+This means the regular expression actually does not match the strings.
+The second pair of plots shows the running time for the regular
+expressions of the form $a^?{}^{\{n\}}\cdot a^{\{n\}}$ and corresponding
+strings composed of $n$ \pcode{a}s (this time the regular expressions
+match the strings).  To see the substantial differences in the left and
 right plots below, note the different scales of the $x$-axes.
 
   
@@ -130,9 +148,8 @@
 running examples. There will be several versions (V1, V2, V3,\ldots)
 of our matcher.\footnote{The corresponding files are
   \texttt{re1.scala}, \texttt{re2.scala} and so on. As usual, you can
-  find the code on KEATS.}\bigskip
+  find the code on KEATS.}
 
-\noindent
 Having specified in the previous lecture what
 problem our regular expression matcher is supposed to solve,
 namely for any given regular expression $r$ and string $s$
@@ -155,8 +172,8 @@
 \subsection*{Regular Expression Equivalences}
 
 We already defined in Handout 1 what it means for two regular
-expressions to be equivalent, namely if their meaning is the
-same language:
+expressions to be equivalent, namely whether their 
+\emph{meaning} is the same language:
 
 \[
 r_1 \equiv r_2 \;\dn\; L(r_1) = L(r_2)
@@ -220,7 +237,7 @@
 \end{tabular}
 \end{center}
 
-\noindent which always hold no matter what the regular expression $r$
+\noindent They always hold no matter what the regular expression $r$
 looks like. The first two are easy to verify since $L(\ZERO)$ is the
 empty set. The next two are also easy to verify since $L(\ONE) =
 \{[]\}$ and appending the empty string to every string of another set,
@@ -230,7 +247,7 @@
 see this, check the definition of $\_ @ \_$ for sets. The last
 equivalence is again trivial.
 
-What will be important later on is that we can orient these
+What will be critical later on is that we can orient these
 equivalences and read them from left to right. In this way we
 can view them as \emph{simplification rules}. Consider for 
 example the regular expression 
@@ -242,13 +259,13 @@
 
 \noindent If we can find an equivalent regular expression that is
 simpler (that usually means smaller), then this might potentially make
-our matching algorithm run faster. We can look for such a simpler
-regular expression $r'$ because whether a string $s$ is in $L(r)$ or
-in $L(r')$ with $r\equiv r'$ will always give the same answer. Yes?
+our matching algorithm run faster. We can look for such a simpler, but
+equivalent, regular expression $r'$ because whether a string $s$ is in
+$L(r)$ or in $L(r')$ does not matter as long as $r\equiv r'$. Yes?
 
-In the example above you will see that the regular expression is
-equivalent to just $r_1$. You can verify this by iteratively applying
-the simplification rules from above:
+In the example above you will see that the regular expression in
+\eqref{big} is equivalent to just $r_1$. You can verify this by
+iteratively applying the simplification rules from above:
 
 \begin{center}
 \begin{tabular}{ll}
@@ -274,26 +291,26 @@
 
 \begin{center}
 \begin{tabular}{rcl}
-$r^*$  & $\equiv$ & $1 + r\cdot r^*$\\
+$r^*$  & $\equiv$ & $\ONE + r\cdot r^*$\\
 $(r_1 + r_2)^*$  & $\equiv$ & $r_1^* \cdot (r_2\cdot r_1^*)^*$\\
-$(r_1 \cdot r_2)^*$ & $\equiv$ & $1 + r_1\cdot (r_2 \cdot r_1)^* \cdot r_2$\\
+$(r_1 \cdot r_2)^*$ & $\equiv$ & $\ONE + r_1\cdot (r_2 \cdot r_1)^* \cdot r_2$\\
 \end{tabular}
 \end{center}
 
 \noindent
-We will not use them in our algorithm, but feel free to convince yourself
-that they hold. As an aside, there has been a lot of research about
-questions like: Can one always decide when two regular expressions are
-equivalent or not? What does an algorithm look like to decide this
-efficiently? So in general it is not a trivial problem.
+We will not use them in our algorithm, but feel free to convince
+yourself that they actually hold. As an aside, there has been a lot of
+research about questions like: Can one always decide when two regular
+expressions are equivalent or not? What does an algorithm look like to
+decide this efficiently? So in general it is not a trivial problem.
 
 \subsection*{The Matching Algorithm}
 
-The algorithm we will define below consists of two parts. One
-is the function $\textit{nullable}$ which takes a regular expression as
-argument and decides whether it can match the empty string
-(this means it returns a boolean in Scala). This can be easily
-defined recursively as follows:
+The algorithm we will introduce below consists of two parts. One is the
+function $\textit{nullable}$ which takes a regular expression as an
+argument and decides whether it can match the empty string (this means
+it returns a boolean in Scala). This can be easily defined recursively
+as follows:
 
 \begin{center}
 \begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {}}
@@ -313,10 +330,9 @@
 \textit{nullable}(r) \;\;\text{if and only if}\;\; []\in L(r)
 \]
 
-\noindent Note on the left-hand side of the if-and-only-if we
-have a function we can implement; on the right we have its
-specification (which we cannot implement in a programming
-language).
+\noindent Note on the left-hand side of the if-and-only-if we have a
+function we can implement, ofr example in Scala; on the right we have
+its specification (which we cannot implement in a programming language).
 
 The other function of our matching algorithm calculates a
 \emph{derivative} of a regular expression. This is a function
@@ -342,25 +358,23 @@
   \end{tabular}
 \end{center}
 
-\noindent The first two clauses can be rationalised as
-follows: recall that $\textit{der}$ should calculate a regular
-expression so that given the ``input'' regular expression can
-match a string of the form $c\!::\!s$, we want a regular
-expression for $s$. Since neither $\ZERO$ nor $\ONE$
-can match a string of the form $c\!::\!s$, we return
-$\ZERO$. In the third case we have to make a
-case-distinction: In case the regular expression is $c$, then
-clearly it can recognise a string of the form $c\!::\!s$, just
-that $s$ is the empty string. Therefore we return the
-$\ONE$-regular expression. In the other case we again
-return $\ZERO$ since no string of the $c\!::\!s$ can be
-matched. Next come the recursive cases, which are a bit more
-involved. Fortunately, the $+$-case is still relatively
-straightforward: all strings of the form $c\!::\!s$ are either
-matched by the regular expression $r_1$ or $r_2$. So we just
-have to recursively call $\textit{der}$ with these two regular
-expressions and compose the results again with $+$. Makes
-sense? 
+\noindent The first two clauses can be rationalised as follows: recall
+that $\textit{der}$ should calculate a regular expression so that
+provided  the ``input'' regular expression can match a string of the
+form $c\!::\!s$, we want a regular expression for $s$. Since neither
+$\ZERO$ nor $\ONE$ can match a string of the form $c\!::\!s$, we return
+$\ZERO$. In the third case we have to make a case-distinction: In case
+the regular expression is $c$, then clearly it can recognise a string of
+the form $c\!::\!s$, just that $s$ is the empty string. Therefore we
+return the $\ONE$-regular expression, as it can match the empty string.
+In the other case we again return $\ZERO$ since no string of the
+$c\!::\!s$ can be matched. Next come the recursive cases, which are a
+bit more involved. Fortunately, the $+$-case is still relatively
+straightforward: all strings of the form $c\!::\!s$ are either matched
+by the regular expression $r_1$ or $r_2$. So we just have to recursively
+call $\textit{der}$ with these two regular expressions and compose the
+results again with $+$. Makes sense?
+
 
 The $\cdot$-case is more complicated: if $r_1\cdot r_2$
 matches a string of the form $c\!::\!s$, then the first part
--- a/handouts/ho05.tex	Mon Jun 29 21:13:49 2020 +0100
+++ b/handouts/ho05.tex	Mon Jun 29 21:14:50 2020 +0100
@@ -24,16 +24,17 @@
 
 \section*{Handout 5 (Grammars \& Parser)}
 
-While regular expressions are very useful for lexing and for recognising
-many patterns in strings (like email addresses), they have their
-limitations. For example there is no regular expression that can
-recognise the language $a^nb^n$ (where you have strings starting with $n$ $a$'s
-followed by the same amount of $b$'s). Another example for which there
-exists no regular expression is the language of well-parenthesised
-expressions. In languages like Lisp, which use parentheses rather
-extensively, it might be of interest to know whether the following two
-expressions are well-parenthesised or not (the left one is, the right
-one is not):
+So far we have focused on regular expressions as well as matching and
+lexing algorithms. While regular expressions are very useful for lexing
+and for recognising many patterns in strings (like email addresses),
+they have their limitations. For example there is no regular expression
+that can recognise the language $a^nb^n$ (where you have strings
+starting with $n$ $a$'s followed by the same amount of $b$'s). Another
+example for which there exists no regular expression is the language of
+well-parenthesised expressions. In languages like Lisp, which use
+parentheses rather extensively, it might be of interest to know whether
+the following two expressions are well-parenthesised or not (the left
+one is, the right one is not):
 
 \begin{center}
 $(((()()))())$  \hspace{10mm} $(((()()))()))$