afl-material: comparison handouts/ho01.tex

equal deleted inserted replaced

-:35104ee14f87
+:8d5aaf5b0031
 \documentclass{article}
 \usepackage{../style}
 \usepackage{../langs}
 \usepackage{../graphics}
 \usepackage{../data}
-\usepackage{longtable}
 \begin{document}
 \section*{Handout 1}
 This module is about text processing, be it for web-crawlers,
 compilers, dictionaries, DNA-data and so on. When looking for
 a particular string in a large text we can use the
 Knuth-Morris-Pratt algorithm, which is currently the most
 efficient general string search algorithm. But often we do
-\emph{not} look for just a particular string, but for string
+\emph{not} just look for a particular string, but for string
 patterns. For example in programming code we need to identify
-what are the keywords, what are the identifiers etc. Also
+what are the keywords, what are the identifiers etc. A pattern
-often we face the problem that we are given a string (for
+for identifiers could be that they start with a letter,
+followed by zero or more letters, numbers and the underscore.
+Also often we face the problem that we are given a string (for
 example some user input) and want to know whether it matches a
-particular pattern. For example for excluding some user input
+particular pattern. In this way we can exclude user input that
-that would otherwise have nasty effects on our program
+would otherwise have nasty effects on our program (crashing it
-(crashing or going into an infinite loop, if not worse).
+or going into an infinite loop, if not worse). \defn{Regular
-\defn{Regular expressions} help with conveniently specifying
+expressions} help with conveniently specifying such patterns.
-such patterns.
 The idea behind regular expressions is that they are a simple
-method for describing languages (or sets of strings)...at
+method for describing languages (or sets of strings)\ldots at
 least languages we are interested in in computer science. For
 example there is no convenient regular expression for
 describing the English language short of enumerating all
 English words. But they seem useful for describing for example
 email addresses.\footnote{See ``8 Regular Expressions You
 toplevel domain. This toplevel domain must be 2 to 6 lowercase
 letters including the dot. Example strings which follow this
 pattern are:
 \begin{lstlisting}[language={},numbers=none,keywordstyle=\color{black}]
-niceandsimple@example.com
+niceandsimple@example.org
-very.common@example.org
+very.common@example.co.uk
-a.little.lengthy.but.fine@dept.example.co.uk
+a.little.lengthy.but.fine@dept.example.ac.uk
-other.email-with-dash@example.ac.uk
+other.email-with-dash@example.edu
 \end{lstlisting}
 \noindent
 But for example the following two do not:
 \begin{lstlisting}[language={},numbers=none,keywordstyle=\color{black}]
 user@localserver
 disposable.style.email.with+symbol@example.com
 \end{lstlisting}
+Identifiers, or variables, in program text are often required
+to satisfy the constraints that they start with a letter and
+then can be followed by zero or more letters or numbers and
+also can include underscores, but not as the first character.
+Such identifiers can be recognised with the regular expression
+\begin{center}
+\pcode{[a-zA-Z] [a-zA-Z0-9_]*}
+\end{center}
+\noindent Possible identifiers that match this regular expression
+are \pcode{x}, \pcode{foo}, \pcode{foo_bar_1}, \pcode{A_very_42_long_object_name},
+but not \pcode{_i} and also not \pcode{4you}.
 Many programming language offer libraries that can be used to
-validate such strings against regular expressions, like the
+validate such strings against regular expressions. Also there
-one for email addresses in \eqref{email}. There are some
+are some common, and I am sure very familiar, ways how to
-common, and I am sure very familiar, ways how to construct
+construct regular expressions. For example in Scala we have:
-regular expressions. For example in Scala we have
+\begin{center}
-\begin{center}
+\begin{tabular}{lp{9cm}}
-\begin{longtable}{lp{9cm}}
 \pcode{re*} & matches 0 or more occurrences of preceding
 expression\\
 \pcode{re+} & matches 1 or more occurrences of preceding
 expression\\
 \pcode{re?} &	 matches 0 or 1 occurrence of preceding
 expression\\
 \pcode{re\{n\}}	& matches exactly \pcode{n} number of
-occurrences\\
+occurrences of preceding  expression\\
 \pcode{re\{n,m\}} & matches at least \pcode{n} and at most {\tt m}
 occurences of the preceding expression\\
 \pcode{[...]} & matches any single character inside the
 brackets\\
 \pcode{[^...]} & matches any single character not inside the
 brackets\\
 \pcode{..-..} & character ranges\\
 \pcode{\\d} &	matches digits; equivalent to \pcode{[0-9]}
-\end{longtable}
+\end{tabular}
 \end{center}
-\noindent With this you can figure out the purpose of the
+\noindent With this table you can figure out the purpose of
-regular expressions in the web-crawlers shown Figures
+the regular expressions in the web-crawlers shown Figures
-\ref{crawler1}, \ref{crawler2} and \ref{crawler3}. Note the
+\ref{crawler1}, \ref{crawler2} and \ref{crawler3}. Note,
-regular expression for http-addresses in web-pages:
+however, the regular expression for http-addresses in
+web-pages is meant to be
 \[
 \pcode{"https?://[^"]*"}
 \]
 \[
 \pcode{""""https?://[^"]*"""".r}
 \]
-\noindent Not also that the convention in Scala is that
+\noindent Note also that the convention in Scala is that
 \texttt{.r} converts a string into a regular expression. I
 leave it to you to ponder whether this regular expression
-really captures all possible web-addresses.\bigskip
+really captures all possible web-addresses.
+\subsection*{Why Study Regular Expressions?}
 Regular expressions were introduced by Kleene in the 1950ies
 and they have been object of intense study since then. They
 are nowadays pretty much ubiquitous in computer science. I am
 sure you have come across them before. Why on earth then is
 	\end{scope}
 \end{tikzpicture}
 \end{center}
 \noindent This graph shows that Python needs approximately 29
-seconds in order to find out that a string of 28 \texttt{a}s
+seconds for finding out whether a string of 28 \texttt{a}s
 matches the regular expression \texttt{[a?]\{28\}[a]\{28\}}.
 Ruby is even slightly worse.\footnote{In this example Ruby
 uses the slightly different regular expression
 \texttt{a?a?a?...a?a?aaa...aa}, where the \texttt{a?} and
 \texttt{a} each occur $n$ times.} Admittedly, this regular
 expression is carefully chosen to exhibit this exponential
 behaviour, but similar ones occur more often than one wants in
 ``real life''. They are sometimes called \emph{evil regular
 expressions} because they have the potential to make regular
-expression matching engines topple over, like in Python and
+expression matching engines to topple over, like in Python and
-Ruby. The problem is that this can have some serious
+Ruby. The problem with evil regular expressions is that they
-consequences, for example, if you use them in your
+can have some serious consequences, for example, if you use
-web-application, because hackers can look for these instances
+them in your web-application. The reason is that hackers can
-where the matching engine behaves badly and mount a nice
+look for these instances where the matching engine behaves
-DoS-attack against your application.
+badly and then mount a nice DoS-attack against your
+application.
-It will be instructive to look behind the ``scenes''to find
+It will be instructive to look behind the ``scenes'' to find
 out why Python and Ruby (and others) behave so badly when
 matching with evil regular expressions. But we will also look
 at a relatively simple algorithm that solves this problem much
 better than Python and Ruby do\ldots actually it will be two
 versions of the algorithm: the first one will be able to
 process strings of approximately 1,000 \texttt{a}s in 30
-seconds, while the second version will even be able to
+seconds, while the second version will even be able to process
-process up to 12,000 in less than 10(!) seconds, see the graph
+up to 12,000 in less than 10(!) seconds, see the graph below:
-below:
 \begin{center}
 \begin{tikzpicture}[y=.1cm, x=.0006cm]
 	%axis
 	\draw (0,0) -- coordinate (x axis mid) (12000,0);
 \end{tikzpicture}
 \end{center}
 \subsection*{Basic Regular Expressions}
-The regular expressions shown above we will call
+The regular expressions shown above, for example for Scala, we
-\defn{extended regular expressions}. The ones we will mainly
+will call \emph{extended regular expressions}. The ones we
-study are \emph{basic regular expressions}, which by
+will mainly study in this module are \emph{basic regular
-convention we will just call regular expressions, if it is
+expressions}, which by convention we will just call
-clear what we mean. The attraction of (basic) regular
+\emph{regular expressions}, if it is clear what we mean. The
-expressions is that many features of the extended one are just
+attraction of (basic) regular expressions is that many
-syntactic suggar. (Basic) regular expressions are defined by
+features of the extended ones are just syntactic sugar.
-the following grammar:
+(Basic) regular expressions are defined by the following
+grammar:
 \begin{center}
 \begin{tabular}{r@{\hspace{1mm}}r@{\hspace{1mm}}l@{\hspace{13mm}}l}
 $r$ & $::=$ &   $\varnothing$         & null\\
 & $\mid$ & $\epsilon$           & empty string / "" / []\\
 & $\mid$ & $c$                  & single character\\
+& $\mid$ & $r_1 + r_2$          & alternative / choice\\
 & $\mid$ & $r_1 \cdot r_2$      & sequence\\
-& $\mid$ & $r_1 + r_2$          & alternative / choice\\
 & $\mid$ & $r^*$                & star (zero or more)\\
 \end{tabular}
 \end{center}
 \noindent Because we overload our notation, there are some
-subtleties you should be aware of. First, when regular
+subtleties you should be aware of. When regular expressions
-expressions are referred to then $\varnothing$ does not stand
+are referred to then $\varnothing$ does not stand for the
-for the empty set: it is a particular pattern that does not
+empty set: rather it is a particular pattern that does not
 match any string. Similarly, in the context of regular
 expressions, $\epsilon$ does not stand for the empty string
-(as in many places in the literature) but for a pattern that
+(as in many places in the literature) but for a regular
-matches the empty string. Second, the letter $c$ stands for
+expression that matches the empty string. The letter $c$
-any character from the alphabet at hand. Again in the context
+stands for any character from the alphabet at hand. Again in
-of regular expressions, it is a particular pattern that can
+the context of regular expressions, it is a particular pattern
-match the specified string. Third, you should also be careful
+that can match the specified character. You should also be
-with the our overloading of the star: assuming you have read
+careful with our overloading of the star: assuming you have
-the handout about our basic mathematical notation, you will
+read the handout about our basic mathematical notation, you
-see that in the context of languages (sets of strings) the
+will see that in the context of languages (sets of strings)
-star stands for an operation on languages. While $r^*$ stands
+the star stands for an operation on languages. While here
-for a regular expression, the operation on sets is defined as
+$r^*$ stands for a regular expression, which is different from
+the operation on sets is defined as
 \[
 A^* \dn \bigcup_{0\le n} A^n
 \]
 We will use parentheses to disambiguate regular expressions.
 Parentheses are not really part of a regular expression, and
 indeed we do not need them in our code because there the tree
-structure is always clear. But for writing them down in a more
+structure of regular expressions is always clear. But for
-mathematical fashion, parentheses will be helpful. For example
+writing them down in a more mathematical fashion, parentheses
-we will write $(r_1 + r_2)^*$, which is different from, say
+will be helpful. For example we will write $(r_1 + r_2)^*$,
-$r_1 + (r_2)^*$. The former means roughly zero or more times
+which is different from, say $r_1 + (r_2)^*$. The former means
-$r_1$ or $r_2$, while the latter means $r_1$ or zero or more
+roughly zero or more times $r_1$ or $r_2$, while the latter
-times $r_2$. This will turn out are two different pattern,
+means $r_1$ or zero or more times $r_2$. This will turn out
-which match in general different strings. We should also write
+are two different patterns, which match in general different
-$(r_1 + r_2) + r_3$, which is different from the regular
+strings. We should also write $(r_1 + r_2) + r_3$, which is
-expression $r_1 + (r_2 + r_3)$, but in case of $+$ and $\cdot$
+different from the regular expression $r_1 + (r_2 + r_3)$, but
-we actually do not care about the order and just write $r_1 +
+in case of $+$ and $\cdot$ we actually do not care about the
-r_2 + r_3$, or $r_1 \cdot r_2 \cdot r_3$, respectively. The
+order and just write $r_1 + r_2 + r_3$, or $r_1 \cdot r_2
-reasons for this will become clear shortly. In the literature
+\cdot r_3$, respectively. The reasons for this will become
-you will often find that the choice $r_1 + r_2$ is written as
+clear shortly. In the literature you will often find that the
-$r_1\mid{}r_2$ or $r_1\mid\mid{}r_2$. Also following the
+choice $r_1 + r_2$ is written as $r_1\mid{}r_2$ or
-convention in the literature, we will often omit the $\cdot$
+$r_1\mid\mid{}r_2$. Also following the convention in the
-all together. This is to make some concrete regular
+literature, we will often omit the $\cdot$ all together. This
-expressions more readable. For example the regular expression
+is to make some concrete regular expressions more readable.
-for email addresses shown in \eqref{email} would look like
+For example the regular expression for email addresses shown
+in \eqref{email} would look like
 \[
 \texttt{[...]+} \;\cdot\;  \texttt{@} \;\cdot\;
 \texttt{[...]+} \;\cdot\; \texttt{.} \;\cdot\;
 \texttt{[...]\{2,6\}}
 A source of confusion might arise from the fact that we
 use the term \emph{basic regular expression} for the regular
 expressions used in ``theory'' and defined above, and
 \emph{extended regular expression} for the ones used in
-``practice'', for example Scala. If runtime is not of an
+``practice'', for example in Scala. If runtime is not an
-issue, then the latter can be seen as some syntactic sugar of
+issue, then the latter can be seen as syntactic sugar of
-the former. Fo example we could replace
+the former. For example we could replace
 \begin{center}
 \begin{tabular}{rcl}
-$r^+$ & $\mapsto$ & $r\cdot r^*$\\
+$r+$ & $\mapsto$ & $r\cdot r^*$\\
 $r?$ & $\mapsto$ & $\epsilon + r$\\
 $\backslash d$ & $\mapsto$ & $0 + 1 + 2 + \ldots + 9$\\
 $[\text{\it a - z}]$ & $\mapsto$ & $a + b + \ldots + z$\\
 \end{tabular}
 \end{center}
 \subsection*{The Meaning of Regular Expressions}
 So far we have only considered informally what the
-\emph{meaning} of a regular expression is. This is no good for
+\emph{meaning} of a regular expression is. This is not good
-specifications of what algorithms are supposed to do or which
+enough for specifications of what algorithms are supposed to
-problems they are supposed to solve.
+do or which problems they are supposed to solve.
-To do so more formally we will associate with every regular
+To define the meaning of a regular expression we will
-expression a language, or set of strings, that is supposed to
+associate with every regular expression a language, or set of
-be matched by this regular expression. To understand what is
+strings. This language contains all the strings the regular
-going on here it is crucial that you have also read the
+expression is supposed to match. To understand what is going
-handout about our basic mathematical notations.
+on here it is crucial that you have read the handout
+about basic mathematical notations.
-The meaning of a regular expression can be defined recursively
-as follows
+The \defn{meaning of a regular expression} can be defined
+by a recursive function called $L$ (for language), which
+is defined as follows
 \begin{center}
 \begin{tabular}{rcl}
 $L(\varnothing)$  & $\dn$ & $\varnothing$\\
 $L(\epsilon)$     & $\dn$ & $\{[]\}$\\
 $L(c)$            & $\dn$ & $\{"c"\}$\\
 $L(r_1+ r_2)$     & $\dn$ & $L(r_1) \cup L(r_2)$\\
-$L(r_1 \cdot r_2)$ & $\dn$ & $L(r_1) @ L(r_2)$\\
+$L(r_1 \cdot r_2)$ & $\dn$ & $L(r_1) \,@\, L(r_2)$\\
 $L(r^*)$           & $\dn$ & $(L(r))^*$\\
 \end{tabular}
 \end{center}
-\noindent
+\noindent As a result we can now precisely state what the
-As a result we can now precisely state what the meaning, for example, of the regular expression
+meaning, for example, of the regular expression $h \cdot
-${\it h} \cdot {\it e} \cdot {\it l} \cdot {\it l} \cdot {\it o}$ is, namely
+e \cdot l \cdot l \cdot o$ is, namely
-$L({\it h} \cdot {\it e} \cdot {\it l} \cdot {\it l} \cdot {\it o}) = \{\text{\it"hello"}\}$...as expected. Similarly if we have the
-choice-regular-expression $a + b$, its meaning is $L(a + b) = \{\text{\it"a"}, \text{\it"b"}\}$, namely the only two strings which can possibly
+\[
-be matched by this choice. You can now also see why we do not make a difference
+L({\it h} \cdot {\it e} \cdot {\it l} \cdot {\it l} \cdot
-between the different regular expressions $(r_1 + r_2) + r_3$ and $r_1 + (r_2 + r_3)$....they
+{\it o}) = \{"hello"\}
-are not the same regular expression, but have the same meaning.
+\]
-The point of the definition of $L$ is that we can use it to precisely specify when a string $s$ is matched by a
+\noindent This is expected because this regular expression
-regular expression $r$, namely only when $s \in L(r)$. In fact we will write a program {\it match} that takes any string $s$ and
+is only supposed to match the ``$hello$''-string. Similarly if
-any regular expression $r$ as argument and returns \emph{yes}, if $s \in L(r)$ and \emph{no},
+we have the choice-regular-expression $a + b$, its meaning is
-if $s \not\in L(r)$. We leave this for the next lecture.
+\[
+L(a + b) = \{"a", "b"\}
+\]
+\noindent You can now also see why we do not make a difference
+between the different regular expressions $(r_1 + r_2) + r_3$
+and $r_1 + (r_2 + r_3)$\ldots they are not the same regular
+expression, but have the same meaning.
+\begin{eqnarray*}
+L((r_1 + r_2) + r_3) & = & L(r_1 + r_2) \cup L(r_3)\\
+& = & L(r_1) \cup L(r_2) \cup L(r_3)\\
+& = & L(r_1) \cup L(r_2 + r_3)\\
+& = & L(r_1 + (r_2 + r_3))
+\end{eqnarray*}
+The point of the definition of $L$ is that we can use it to
+precisely specify when a string $s$ is matched by a regular
+expression $r$, namely if and only if $s \in L(r)$. In fact we
+will write a program \pcode{match} that takes any string $s$
+and any regular expression $r$ as argument and returns
+\emph{yes}, if $s \in L(r)$ and \emph{no}, if $s \not\in
+L(r)$. We leave this for the next lecture.
+There is one more feature of regular expressions that is worth
+mentioning. Given some strings, there are in general many
+different regular expressions that can recognise these
+strings. This is obvious with the regular expression $a + b$
+which can match the strings $a$ and $b$. But also the regular
+expression $b + a$ would match the same strings. However,
+sometimes it is not so obvious whether two regular expressions
+match the same strings: for example do $r^*$ and $\epsilon + r
+\cdot r^*$ match the same strings? What about $\varnothing^*$
+and $\epsilon^*$? This suggests the following relation between
+\defn{equivalent regular expressions}:
+\[
+r_1 \equiv r_2 \;\dn\; L(r_1) = L(r_2)
+\]
+\noindent That means two regular expressions are equivalent if
+they match the same set of strings. Therefore we do not really
+distinguish between the different regular expression $(r_1 +
+r_2) + r_3$ and $r_1 + (r_2 + r_3)$, because they are
+equivalent. I leave you to the question whether
+\[
+\varnothing^* \equiv \epsilon^*
+\]
+\noindent holds. Such equivalences will be important for out
+matching algorithm, because we can use them to simplify
+regular expressions.
+\subsection*{My Fascination for Regular Expressions}
+Up until a few years ago I was not really interested in
+regular expressions. They have been studied for the last 60
+years (by smarter people than me)---surely nothing new can be
+found out about them. I even have the vague recollection that
+I did not quite understand them during my study. If I remember
+correctly,\footnote{That was really a long time ago.} I got
+utterly confused about $\epsilon$ and the empty
+string.\footnote{Obviously the lecturer must have been bad.}
+Since my study, I have used regular expressions for
+implementing lexers and parsers as I have always been
+interested in all kinds of programming languages and
+compilers, which invariably need regular expression in some
+form or shape.
+To understand my fascination nowadays with regular
+expressions, you need to know that my main scientific interest
+for the last 14 years has been with in theorem provers. I am a
+core developer of one of
+them.\footnote{\url{http://isabelle.in.tum.de}} Theorem
+provers are systems in which you can formally reason about
+mathematical concepts, but also about programs. In this way
+they can help with writing bug-free code. Theorem provers have
+proved already their value in a number of systems (even in
+terms of hard cash), but they are still clunky and difficult
+to use by average programmers.
+Anyway, in about 2011 I came across the notion of
+\defn{derivatives of regular expressions}. This notion allows
+one to do almost all calculations in regular language theory
+on the level of regular expressions, not needing any automata.
+This is important because automata are graphs and it is rather
+difficult to reason about graphs in theorem provers. In
+contrast, to reason about regular expressions is easy-peasy in
+theorem provers. Is this important? I think yes, because
+according to Kuklewicz nearly all POSIX-based regular
+expression matchers are
+buggy.\footnote{\url{http://www.haskell.org/haskellwiki/Regex_Posix}}
+With my PhD student Fahad Ausaf I am currently working on
+proving the correctness for one such algorithm that was
+proposed by Sulzmann and Lu in
+2014.\footnote{\url{http://goo.gl/bz0eHp}} This would be an
+attractive results since we will be able to prove that the
+algorithm is really correct, but also that the machine code(!)
+that implements this code efficiently is correct. Writing
+programs in this way does not leave any room for potential
+errors or bugs. How nice!
+What also helped with my fascination with regular expressions
+is that we could indeed find out new things about them that
+have surprised some experts in the field of regular
+expressions. Together with two colleagues from China, I was
+able to prove the Myhill-Nerode theorem by only using regular
+expressions and the notion of derivatives. Earlier versions of
+this theorem used always automata in the proof. Using this
+theorem we can show that regular languages are closed under
+complementation, something which Gasarch in his
+blog\footnote{\url{http://goo.gl/2R11Fw}} assumed can only be
+shown via automata. Even sombody who has written a 700+-page
+book\footnote{\url{http://goo.gl/fD0eHx}} on regular
+exprssions did not know better. Well, we showed it can also be
+done with regular expressions only. What a feeling if you
+are an outsider to the subject!
+To conclude: Despite my early ignorance about regular
+expressions, I find them now quite interesting. They have a
+beautiful mathematical theory behind them. They have practical
+importance (remember the shocking runtime of the regular
+expression matchers in Python and Ruby in some instances).
+People who are not very familiar with the mathematical
+background get them consistently wrong. The hope is that we
+can do better in the future---for example by proving that the
+algorithms actually satisfy their specification and that the
+corresponding implementations do not contain any bugs. We are
+close, but not yet quite there.
 \begin{figure}[p]
 \lstinputlisting{../progs/crawler1.scala}
 \caption{The Scala code for a simple web-crawler that checks
 for broken links in a web-page. It uses the regular expression
 \texttt{email\_pattern} in Line~16. The main change is in Line
 32 where all email addresses that can be found in a page are
 printed.\label{crawler3}}
 \end{figure}
-\pagebreak
-Lets start
-with what we mean by \emph{strings}. Strings (they are also
-sometimes referred to as \emph{words}) are lists of characters
-drawn from an \emph{alphabet}. If nothing else is specified,
-we usually assume the alphabet consists of just the lower-case
-letters $a$, $b$, \ldots, $z$. Sometimes, however, we
-explicitly restrict strings to contain, for example, only the
-letters $a$ and $b$. In this case we say the alphabet is the
-set $\{a, b\}$.
-There are many ways how we can write down strings. In programming languages, they are usually
-written as {\it "hello"} where the double quotes indicate that we dealing with a string.
-Essentially, strings are lists of characters which can be written for example as follows
-\[
-[\text{\it h, e, l, l, o}]
-\]
-\noindent
-The important point is that we can always decompose strings. For example, we will often consider the
-first character of a string, say $h$, and the ``rest''  of a string say {\it "ello"} when making definitions
-about strings. There are some subtleties with the empty string, sometimes written as {\it ""} but also as
-the empty list of characters $[\,]$. Two strings, for example $s_1$ and $s_2$, can be \emph{concatenated},
-which we write as $s_1 @ s_2$. Suppose we are given two strings {\it "foo"} and {\it "bar"}, then their concatenation
-gives {\it "foobar"}.
-We often need to talk about sets of strings. For example the set of all strings over the alphabet $\{a, \ldots\, z\}$
-is
-\[
-\{\text{\it "", "a", "b", "c",\ldots,"z", "aa", "ab", "ac", \ldots, "aaa", \ldots}\}
-\]
-\noindent
-Any set of strings, not just the set-of-all-strings, is often called a \emph{language}. The idea behind
-this choice of terminology is that if we enumerate, say, all words/strings from a dictionary, like
-\[
-\{\text{\it "the", "of", "milk", "name", "antidisestablishmentarianism", \ldots}\}
-\]
-\noindent
-then we have essentially described the English language, or more precisely all
-strings that can be used in a sentence of the English language. French would be a
-different set of strings, and so on. In the context of this course, a language might
-not necessarily make sense from a natural language point of view. For example
-the set of all strings shown above is a language, as is the empty set (of strings). The
-empty set of strings is often written as $\varnothing$ or $\{\,\}$. Note that there is a
-difference between the empty set, or empty language, and the set that
-contains only the empty string $\{\text{""}\}$: the former has no elements, whereas
-the latter has one element.
-Before we expand on the topic of regular expressions, let us review some operations on
-sets. We will use capital letters $A$, $B$, $\ldots$ to stand for sets of strings.
-The union of two sets is written as usual as $A \cup B$. We also need to define the
-operation of \emph{concatenating} two sets of strings. This can be defined as
-\[
-A @ B \dn \{s_1@ s_2 | s_1 \in A \wedge s_2 \in B \}
-\]
-\noindent
-which essentially means take the first string from the set $A$ and concatenate it with every
-string in the set $B$, then take the second string from $A$ do the same and so on. You might
-like to think about what this definition means in case $A$ or $B$ is the empty set.
-We also need to define
-the power of a set of strings, written as $A^n$ with $n$ being a natural number. This is defined inductively as follows
-\begin{center}
-\begin{tabular}{rcl}
-$A^0$ & $\dn$ & $\{[\,]\}$ \\
-$A^{n+1}$ & $\dn$ & $A @ A^n$\\
-\end{tabular}
-\end{center}
-\noindent
-Finally we need the \emph{star} of a set of strings, written $A^*$. This is defined as the union
-of every power of $A^n$ with $n\ge 0$. The mathematical notation for this operation is
-\[
-A^* \dn \bigcup_{0\le n} A^n
-\]
-\noindent
-This definition implies that the star of a set $A$ contains always the empty string (that is $A^0$), one
-copy of every string in $A$ (that is $A^1$), two copies in $A$ (that is $A^2$) and so on. In case $A=\{"a"\}$ we therefore
-have
-\[
-A^* = \{"", "a", "aa", "aaa", \ldots\}
-\]
-\noindent
-Be aware that these operations sometimes have quite non-intuitive properties, for example
-\begin{center}
-\begin{tabular}{@{}ccc@{}}
-\begin{tabular}{@{}r@{\hspace{1mm}}c@{\hspace{1mm}}l}
-$A \cup \varnothing$ & $=$ & $A$\\
-$A \cup A$ & $=$ & $A$\\
-$A \cup B$ & $=$ & $B \cup A$\\
-\end{tabular} &
-\begin{tabular}{r@{\hspace{1mm}}c@{\hspace{1mm}}l}
-$A @ B$ & $\not =$ & $B @ A$\\
-$A  @ \varnothing$ & $=$ & $\varnothing @ A = \varnothing$\\
-$A  @ \{""\}$ & $=$ & $\{""\} @ A = A$\\
-\end{tabular} &
-\begin{tabular}{r@{\hspace{1mm}}c@{\hspace{1mm}}l@{}}
-$\varnothing^*$ & $=$ & $\{""\}$\\
-$\{""\}^*$ & $=$ & $\{""\}$\\
-$A^\star$ & $=$ & $\{""\} \cup A\cdot A^*$\\
-\end{tabular}
-\end{tabular}
-\end{center}
-\bigskip
-\subsection*{My Fascination for Regular Expressions}
 \end{document}
 %%% Local Variables:

changeset 243	8d5aaf5b0031
parent 242	35104ee14f87
child 244	771042ac7c3f