# HG changeset patch # User Christian Urban # Date 1410752568 -3600 # Node ID ce767ca2324458bea2d25701f2aeb98ac52ce30a # Parent 84b4bc6e8554b5b34092f29b650c69dfaf92933d updated diff -r 84b4bc6e8554 -r ce767ca23244 handouts/ho01.pdf Binary file handouts/ho01.pdf has changed diff -r 84b4bc6e8554 -r ce767ca23244 handouts/ho01.tex --- a/handouts/ho01.tex Mon Sep 15 04:16:28 2014 +0100 +++ b/handouts/ho01.tex Mon Sep 15 04:42:48 2014 +0100 @@ -82,7 +82,7 @@ Many programming language offer libraries that can be used to validate such strings against regular expressions. Also there -are some common, and I am sure very familiar, ways how to +are some common, and I am sure very familiar, ways of how to construct regular expressions. For example in Scala we have: \begin{center} @@ -261,9 +261,9 @@ careful with our overloading of the star: assuming you have read the handout about our basic mathematical notation, you will see that in the context of languages (sets of strings) -the star stands for an operation on languages. While here -$r^*$ stands for a regular expression, which is different from -the operation on sets is defined as +the star stands for an operation on languages. Here $r^*$ +stands for a regular expression, which is different from the +operation on sets is defined as \[ A^* \dn \bigcup_{0\le n} A^n @@ -277,8 +277,8 @@ will be helpful. For example we will write $(r_1 + r_2)^*$, which is different from, say $r_1 + (r_2)^*$. The former means roughly zero or more times $r_1$ or $r_2$, while the latter -means $r_1$ or zero or more times $r_2$. This will turn out -are two different patterns, which match in general different +means $r_1$ or zero or more times $r_2$. This will turn out to +be two different patterns, which match in general different strings. We should also write $(r_1 + r_2) + r_3$, which is different from the regular expression $r_1 + (r_2 + r_3)$, but in case of $+$ and $\cdot$ we actually do not care about the @@ -304,7 +304,7 @@ should write \[ -{\it h} \cdot {\it e} \cdot {\it l} \cdot {\it l} \cdot {\it o} +h \cdot e \cdot l \cdot l \cdot o \] \noindent @@ -313,7 +313,7 @@ If you prefer to think in terms of the implementation of regular expressions in Scala, the constructors and classes relate as follows\footnote{More about Scala is -in the handout about a crash-course on Scala.} +in the handout about A Crash-Course on Scala.} \begin{center} \begin{tabular}{rcl} @@ -378,8 +378,7 @@ e \cdot l \cdot l \cdot o$ is, namely \[ -L({\it h} \cdot {\it e} \cdot {\it l} \cdot {\it l} \cdot -{\it o}) = \{"hello"\} +L(h \cdot e \cdot l \cdot l \cdot o) = \{"hello"\} \] \noindent This is expected because this regular expression @@ -393,7 +392,7 @@ \noindent You can now also see why we do not make a difference between the different regular expressions $(r_1 + r_2) + r_3$ and $r_1 + (r_2 + r_3)$\ldots they are not the same regular -expression, but have the same meaning. +expression, but they have the same meaning. For example \begin{eqnarray*} L((r_1 + r_2) + r_3) & = & L(r_1 + r_2) \cup L(r_3)\\ @@ -426,17 +425,18 @@ r_1 \equiv r_2 \;\dn\; L(r_1) = L(r_2) \] -\noindent That means two regular expressions are equivalent if -they match the same set of strings. Therefore we do not really -distinguish between the different regular expression $(r_1 + -r_2) + r_3$ and $r_1 + (r_2 + r_3)$, because they are -equivalent. I leave you to the question whether +\noindent That means two regular expressions are said to be +equivalent if they match the same set of strings. Therefore we +do not really distinguish between the different regular +expression $(r_1 + r_2) + r_3$ and $r_1 + (r_2 + r_3)$, +because they are equivalent. I leave you to the question +whether \[ \varnothing^* \equiv \epsilon^* \] -\noindent holds. Such equivalences will be important for out +\noindent holds. Such equivalences will be important for our matching algorithm, because we can use them to simplify regular expressions. @@ -458,7 +458,7 @@ To understand my fascination nowadays with regular expressions, you need to know that my main scientific interest -for the last 14 years has been with in theorem provers. I am a +for the last 14 years has been with theorem provers. I am a core developer of one of them.\footnote{\url{http://isabelle.in.tum.de}} Theorem provers are systems in which you can formally reason about @@ -480,7 +480,7 @@ expression matchers are buggy.\footnote{\url{http://www.haskell.org/haskellwiki/Regex_Posix}} With my PhD student Fahad Ausaf I am currently working on -proving the correctness for one such algorithm that was +proving the correctness for one such matcher that was proposed by Sulzmann and Lu in 2014.\footnote{\url{http://goo.gl/bz0eHp}} This would be an attractive results since we will be able to prove that the @@ -502,8 +502,8 @@ shown via automata. Even sombody who has written a 700+-page book\footnote{\url{http://goo.gl/fD0eHx}} on regular exprssions did not know better. Well, we showed it can also be -done with regular expressions only. What a feeling if you -are an outsider to the subject! +done with regular expressions only.\footnote{\url{http://www.inf.kcl.ac.uk/staff/urbanc/Publications/rexp.pdf}} +What a feeling if you are an outsider to the subject! To conclude: Despite my early ignorance about regular expressions, I find them now quite interesting. They have a @@ -511,15 +511,15 @@ importance (remember the shocking runtime of the regular expression matchers in Python and Ruby in some instances). People who are not very familiar with the mathematical -background get them consistently wrong. The hope is that we -can do better in the future---for example by proving that the -algorithms actually satisfy their specification and that the -corresponding implementations do not contain any bugs. We are -close, but not yet quite there. +background of regular expressions get them consistently wrong. +The hope is that we can do better in the future---for example +by proving that the algorithms actually satisfy their +specification and that the corresponding implementations do +not contain any bugs. We are close, but not yet quite there. Despite my fascination, I am also happy to admit that regular expressions have their shortcomings. There are some well-known -``theoretical shortcomings'', for example recognising strings +``theoretical'' shortcomings, for example recognising strings of the form $a^{n}b^{n}$. I am not so bothered by them. What I am bothered about is when regular expressions are in the way of practical programming. For example, it turns out that the @@ -528,7 +528,7 @@ being touted as something every computer scientist should know about). The W3 Consortium (which standardises the Web) proposes to use the following, already more complicated -regular expressions: +regular expressions for email addresses: {\small\begin{lstlisting}[language={},keywordstyle=\color{black},numbers=none] [a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-zA-Z0-9-]+(?:\.[a-zA-Z0-9-]+)* @@ -541,11 +541,12 @@ others. Not a good situation to be in. A regular expression that is claimed to be closer to the standard is shown in Figure~\ref{monster}. Whether this claim is true or not, I -would not know---the only thing I can say it is a monstrosity. -However, this might actually be an argument against the -standard, rather than against regular expressions. Still it -is good to know that some tasks in text processing just -cannot be achieved by using regular expressions. +would not know---the only thing I can say to this regular +expression is it is a monstrosity. However, this might +actually be an argument against the RFC standard, rather than +against regular expressions. Still it is good to know that +some tasks in text processing just cannot be achieved by using +regular expressions. \begin{figure}[p]