# HG changeset patch # User Christian Urban # Date 1465900908 -3600 # Node ID 245d302791c7a837491f6248c4a8e7a1687e1641 # Parent 564f7584eff1ef00297fdc36d80df0faaa939add updated diff -r 564f7584eff1 -r 245d302791c7 handouts/ho01.pdf Binary file handouts/ho01.pdf has changed diff -r 564f7584eff1 -r 245d302791c7 handouts/ho01.tex --- a/handouts/ho01.tex Wed Jun 01 13:04:06 2016 +0100 +++ b/handouts/ho01.tex Tue Jun 14 11:41:48 2016 +0100 @@ -34,7 +34,7 @@ This module is about text processing, be it for web-crawlers, compilers, dictionaries, DNA-data and so on. When looking for -a particular string in a large text we can use the +a particular string, like $abc$ in a large text we can use the Knuth-Morris-Pratt algorithm, which is currently the most efficient general string search algorithm. But often we do \emph{not} just look for a particular string, but for string @@ -67,14 +67,14 @@ \noindent where the first part matches one or more lowercase letters (\pcode{a-z}), digits (\pcode{0-9}), underscores, dots -or hyphens. The \pcode{+} ensures the ``one or more''. Then -comes the \pcode{@}-sign, followed by the domain name which -must be one or more lowercase letters, digits, underscores, -dots or hyphens. Note there cannot be an underscore in the -domain name. Finally there must be a dot followed by the -toplevel domain. This toplevel domain must be 2 to 6 lowercase -letters including the dot. Example strings which follow this -pattern are: +and hyphens. The \pcode{+} at the end of the brackets ensures +the ``one or more''. Then comes the \pcode{@}-sign, followed +by the domain name which must be one or more lowercase +letters, digits, underscores, dots or hyphens. Note there +cannot be an underscore in the domain name. Finally there must +be a dot followed by the toplevel domain. This toplevel domain +must be 2 to 6 lowercase letters including the dot. Example +strings which follow this pattern are: \begin{lstlisting}[language={},numbers=none,keywordstyle=\color{black}] niceandsimple@example.org @@ -113,10 +113,11 @@ are \pcode{x}, \pcode{foo}, \pcode{foo_bar_1}, \pcode{A_very_42_long_object_name}, but not \pcode{_i} and also not \pcode{4you}. -Many programming language offer libraries that can be used to +Many programming languages offer libraries that can be used to validate such strings against regular expressions. Also there are some common, and I am sure very familiar, ways of how to -construct regular expressions. For example in Scala we have: +construct regular expressions. For example in Scala we have +a library implementing the following regular expressions: \begin{center} \begin{tabular}{lp{9cm}} @@ -177,10 +178,11 @@ Regular expressions were introduced by Kleene in the 1950ies and they have been object of intense study since then. They -are nowadays pretty much ubiquitous in computer science. I am -sure you have come across them before. Why on earth then is -there any interest in studying them again in depth in this -module? Well, one answer is in the following graph about +are nowadays pretty much ubiquitous in computer science. There +are many libraries implementing regular expressions. I am sure +you have come across them before (remember PRA?). Why on earth +then is there any interest in studying them again in depth in +this module? Well, one answer is in the following graph about regular expression matching in Python and in Ruby. \begin{center} @@ -214,7 +216,7 @@ Ruby is even slightly worse.\footnote{In this example Ruby uses the slightly different regular expression \texttt{a?a?a?...a?a?aaa...aa}, where the \texttt{a?} and -\texttt{a} each occur $n$ times. More test results can be +\texttt{a} each occur $n$ times. More such test cases can be found at \url{http://www.computerbytesman.com/redos/}.} Admittedly, this regular expression is carefully chosen to exhibit this exponential behaviour, but similar ones occur @@ -285,7 +287,7 @@ \noindent Because we overload our notation, there are some subtleties you should be aware of. When regular expressions -are referred to then $\ZERO$ (in bold font) does not stand for +are referred to, then $\ZERO$ (in bold font) does not stand for the number zero: rather it is a particular pattern that does not match any string. Similarly, in the context of regular expressions, $\ONE$ does not stand for the number one but for @@ -368,7 +370,7 @@ If you prefer to think in terms of the implementation of regular expressions in Scala, the constructors and classes relate as follows\footnote{More about Scala is -in the handout about A Crash-Course on Scala.} +in the handout about \emph{A Crash-Course on Scala}.} \begin{center} \begin{tabular}{rcl} @@ -663,7 +665,8 @@ \end{minipage} \end{center} -\caption{Nothing that can be said this\ldots\label{monster}} +\caption{Nothing that can be said about this regular +expression\ldots\label{monster}} \end{figure} diff -r 564f7584eff1 -r 245d302791c7 handouts/notation.pdf Binary file handouts/notation.pdf has changed diff -r 564f7584eff1 -r 245d302791c7 handouts/notation.tex --- a/handouts/notation.tex Wed Jun 01 13:04:06 2016 +0100 +++ b/handouts/notation.tex Tue Jun 14 11:41:48 2016 +0100 @@ -8,10 +8,15 @@ \section*{A Crash-Course on Notation} -There are innumerable books available about compiler, automata -and formal languages. Unfortunately, they often use their own -notational conventions and their own symbols. This handout is -meant to clarify some of the notation I will use. +There are innumerable books available about compilers, +automata theory and formal languages. Unfortunately, they +often use their own notational conventions and their own +symbols. This handout is meant to clarify some of the notation +I will use. I appologise in advance that sometimes I will be a +bit fuzzy\ldots the problem is that often we want to have +convenience in our mathematical definitions (to make them +readable and understandable), but sometimes we need precision +for actual programs. \subsubsection*{Characters and Strings} @@ -40,13 +45,17 @@ $b$, $c$ and so on for characters. Therefore if we need a representative string, we might write -\begin{equation}\label{abracadabra} +\[ abracadabra -\end{equation} +\] \noindent In this string, we do not really care what the characters stand for, except we do care about the fact that for example the character $a$ is not equal to $b$ and so on. +Why do I make this distinction? Because we often need to +define functions using variables ranging over characters. We +need to somehow say this is a variable, say $c$, ranging over +characters, while this is the atual character \pcode{c}. An \defn{alphabet} is a (non-empty) finite set of characters. Often the letter $\Sigma$ is used to refer to an alphabet. For @@ -62,12 +71,26 @@ \defn{Strings} are lists of characters. Unfortunately, there are many ways how we can write down strings. In programming languages, they are usually written as \dq{$hello$} where the -double quotes indicate that we are dealing with a string. But -since we regard strings as lists of characters we could also -write this string as +double quotes indicate that we are dealing with a string. In +programming languages, such as Scala, strings have a special +type---namely \pcode{String} which is different from the type +for lists of chatacters. This is because strings can be +efficiently represented in memory, unlike general lists. Since +\code{String} and the type of lists of characters, +\code{List[Char]} are not the same, we need to explicitly +coerce elements between the two types, for example + +\begin{lstlisting}[numbers=none] +scala> "abc".toList +res01: List[Char] = List(a, b, c) +\end{lstlisting} + +\noindent Since in our (mathematical) definitions we regard +strings as lists of characters, we will also write +\dq{$hello$} as \[ -[\text{\it h, e, l, l, o}] \;\;\text{or simply}\;\; \textit{hello} +[\text{\it h, e, l, l, o}] \qquad\text{or simply}\qquad \textit{hello} \] \noindent The important point is that we can always decompose @@ -81,44 +104,32 @@ represent the empty string.} Two strings, say $s_1$ and $s_2$, can be \defn{concatenated}, -which we write as $s_1 @ s_2$. Suppose we are given two -strings \dq{\textit{foo}} and \dq{\textit{bar}}, then their -concatenation, writen \dq{\textit{foo}} $@$ \dq{\textit{bar}}, -gives \dq{\textit{foobar}}. Often we will simplify our life -and just drop the double quotes whenever it is clear we are -talking about strings, writing as already in -\eqref{abracadabra} just \textit{foo}, \textit{bar}, -\textit{foobar} or \textit{foo $@$ bar}. +which we write as $s_1 @ s_2$. If we regard $s_1$ and $s_2$ as +lists of characters, then $@$ is the list-append function. +Suppose we are given two strings \dq{\textit{foo}} and +\dq{\textit{bar}}, then their concatenation, writen +\dq{\textit{foo}} $@$ \dq{\textit{bar}}, gives +\dq{\textit{foobar}}. But as said above, we will often +simplify our life and just drop the double quotes whenever it +is clear we are talking about strings, So we will often just +write \textit{foo}, \textit{bar}, \textit{foobar} or +\textit{foo $@$ bar}. -Some simple properties of string concatenation hold. For -example the concatenation operation is \emph{associative}, -meaning +Occasionally we will use the notation $a^n$ for strings, which +stands for the string of $n$ repeated $a$s. So $a^{n}b^{n}$ is +a string that has as many $a$s by as many $b$s. + +A simple property of string concatenation is +\emph{associativity}, meaning \[(s_1 @ s_2) @ s_3 = s_1 @ (s_2 @ s_3)\] \noindent are always equal strings. The empty string behaves -like a unit element, therefore +like a \emph{unit element}, therefore \[s \,@\, [] = [] \,@\, s = s\] -Occasionally we will use the notation $a^n$ for strings, which -stands for the string of $n$ repeated $a$s. So $a^{n}b^{n}$ is -a string that has as many $a$s as $b$s. - -Note however that while for us strings are just lists of -characters, programming languages often differentiate between -the two concepts. In Scala, for example, there is the type of -\code{String} and the type of lists of characters, -\code{List[Char]}. They are not the same and we need to -explicitly coerce elements between the two types, for example - -\begin{lstlisting}[numbers=none] -scala> "abc".toList -res01: List[Char] = List(a, b, c) -\end{lstlisting} - - \subsubsection*{Sets and Languages} We will use the familiar operations $\cup$, $\cap$, $\subset$ @@ -137,8 +148,9 @@ example the set of all natural numbers $\{0, 1, 2, \ldots\}$ is infinite. This set is often also abbreviated as $\mathbb{N}$. We can define sets by giving all elements, for -example $\{0, 1\}$, but also by \defn{set comprehensions}. For -example the set of all even natural numbers can be defined as +example $\{0, 1\}$ for the set containing just $0$ and $1$, +but also by \defn{set comprehensions}. For example the set of +all even natural numbers can be defined as \[ \{n\;|\;n\in\mathbb{N} \wedge n\;\text{is even}\} @@ -189,7 +201,15 @@ \noindent but using the big union notation is more concise. -An important notion in this module are \defn{languages}, which +While this stuff about sete might all look trivial or even +needlessly pedantic, \emph{Nature} is never simple. If you +want to be amazed how complicated sets can get, watch out for +the last lecture just before Christmas where I want you to +convince you of the fact that some sets are more infinite than +others. Actually that will be a fact that is very relevant to +the material of this module. + +Another important notion in this module are \defn{languages}, which are sets of strings. One of the main goals for us will be how to (formally) specify languages and to find out whether a string is in a language or not.\footnote{You might wish to ponder @@ -251,10 +271,10 @@ $x^0$?) Next we can define the \defn{star operation} for languages: -$A^*$ is the union of all powers of $A$, or short +$A\star$ is the union of all powers of $A$, or short \begin{equation}\label{star} -A^* \dn \bigcup_{0\le n}\; A^n +A\star \dn \bigcup_{0\le n}\; A^n \end{equation} \noindent This star operation is often also called @@ -272,16 +292,16 @@ \{[]\} \,\cup\, A \,\cup\, A @ A \,\cup\, A @ A @ A \,\cup\, \ldots \] -\noindent We can see that the empty string is always in $A^*$, +\noindent We can see that the empty string is always in $A\star$, no matter what $A$ is. This is because $[] \in A^0$. To make sure you understand these definitions, I leave you to answer -what $\{[]\}^*$ and $\varnothing^*$ are. +what $\{[]\}\star$ and $\varnothing\star$ are? Recall that an alphabet is often referred to by the letter -$\Sigma$. We can now write for the set of all strings over -this alphabet $\Sigma^*$. In doing so we also include the -empty string as a possible string over $\Sigma$. So if -$\Sigma = \{a, b\}$, then $\Sigma^*$ is +$\Sigma$. We can now write for the set of \emph{all} strings +over this alphabet $\Sigma\star$. In doing so we also include the +empty string as a possible string over $\Sigma$. So if $\Sigma += \{a, b\}$, then $\Sigma\star$ is \[ \{[], a, b, aa, ab, ba, bb, aaa, aab, aba, abb, baa, bab, \ldots\} diff -r 564f7584eff1 -r 245d302791c7 hws/hw02.pdf Binary file hws/hw02.pdf has changed diff -r 564f7584eff1 -r 245d302791c7 hws/hw02.tex --- a/hws/hw02.tex Wed Jun 01 13:04:06 2016 +0100 +++ b/hws/hw02.tex Tue Jun 14 11:41:48 2016 +0100 @@ -100,7 +100,12 @@ \item Give a regular expressions that can recognise all strings from the language $\{a^n\;|\;\exists k.\; n = 3 k - + 1 \}$. \end{enumerate} + + 1 \}$. + +\item Give a regular expression that can recognise an odd +number of $a$s or an even number of $b$s. + +\end{enumerate} \end{document}