diff -r 1f4e81950ab4 -r 9476086849ad handouts/ho01.tex --- a/handouts/ho01.tex Mon Nov 14 15:50:42 2016 +0000 +++ b/handouts/ho01.tex Sat Jan 07 14:52:26 2017 +0000 @@ -32,26 +32,27 @@ \begin{document} -\fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016} +\fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017} \section*{Handout 1} This module is about text processing, be it for web-crawlers, -compilers, dictionaries, DNA-data and so on. When looking for -a particular string, like $abc$ in a large text we can use the -Knuth-Morris-Pratt algorithm, which is currently the most -efficient general string search algorithm. But often we do -\emph{not} just look for a particular string, but for string -patterns. For example in program code we need to identify what -are the keywords, what are the identifiers etc. A pattern for -identifiers could be stated as: they start with a letter, -followed by zero or more letters, numbers and underscores. -Also often we face the problem that we are given a string (for -example some user input) and want to know whether it matches a -particular pattern---be it an email address, for example. In -this way we can exclude user input that would otherwise have -nasty effects on our program (crashing it or making it go into -an infinite loop, if not worse).\smallskip +compilers, dictionaries, DNA-data and so on. When looking for a +particular string, like $abc$ in a large text we can use the +Knuth-Morris-Pratt algorithm, which is currently the most efficient +general string search algorithm. But often we do \emph{not} just look +for a particular string, but for string patterns. For example in +program code we need to identify what are the keywords (if, then, +while, etc), what are the identifiers (variable names). A pattern for +identifiers could be stated as: they start with a letter, followed by +zero or more letters, numbers and underscores. Also often we face the +problem that we are given a string (for example some user input) and +want to know whether it matches a particular pattern---be it an email +address, for example. In this way we can exclude user input that would +otherwise have nasty effects on our program (crashing it or making it +go into an infinite loop, if not worse). The point is that the fast +Knuth-Morris-Pratt algorithm for strings is not good enough for such +string patterns.\smallskip \defn{Regular expressions} help with conveniently specifying such patterns. The idea behind regular expressions is that @@ -72,7 +73,7 @@ \noindent where the first part, the user name, matches one or more lowercase letters (\pcode{a-z}), digits (\pcode{0-9}), underscores, dots and hyphens. The \pcode{+} at the end of the brackets ensures -the ``one or more''. Then comes the \pcode{@}-sign, followed +the ``one or more''. Then comes the email \pcode{@}-sign, followed by the domain name which must be one or more lowercase letters, digits, underscores, dots or hyphens. Note there cannot be an underscore in the domain name. Finally there must @@ -96,11 +97,10 @@ disposable.style.email.with+symbol@example.com \end{lstlisting} -\noindent according to the regular expression we specified in -\eqref{email}. Whether this is intended or not is a different -question (the second email above is actually an acceptable -email address acording to the RFC 5322 standard for email -addresses). +\noindent according to the regular expression we specified in line +\eqref{email} above. Whether this is intended or not is a different +question (the second email above is actually an acceptable email +address acording to the RFC 5322 standard for email addresses). As mentioned above, identifiers, or variables, in program code are often required to satisfy the constraints that they start @@ -178,14 +178,14 @@ \subsection*{Why Study Regular Expressions?} -Regular expressions were introduced by Kleene in the 1950ies -and they have been object of intense study since then. They -are nowadays pretty much ubiquitous in computer science. There -are many libraries implementing regular expressions. I am sure -you have come across them before (remember PRA?). Why on earth -then is there any interest in studying them again in depth in -this module? Well, one answer is in the following two graphs about -regular expression matching in Python, Ruby and Java. +Regular expressions were introduced by Kleene in the 1950ies and they +have been object of intense study since then. They are nowadays pretty +much ubiquitous in computer science. There are many libraries +implementing regular expressions. I am sure you have come across them +before (remember the PRA module?). Why on earth then is there any +interest in studying them again in depth in this module? Well, one +answer is in the following two graphs about regular expression +matching in Python, Ruby and Java. \begin{center} \begin{tabular}{@{\hspace{-1mm}}c@{\hspace{-1mm}}c@{}} @@ -423,9 +423,9 @@ \] \noindent -which is much less readable than \eqref{email}. Similarly for -the regular expression that matches the string $hello$ we -should write +which is much less readable than the regular expression in +\eqref{email}. Similarly for the regular expression that matches the +string $hello$ we should write \[ h \cdot e \cdot l \cdot l \cdot o @@ -460,8 +460,8 @@ \begin{center} \begin{tabular}{rcl} -$r+$ & $\mapsto$ & $r\cdot r^*$\\ -$r?$ & $\mapsto$ & $\ONE + r$\\ +$r^+$ & $\mapsto$ & $r\cdot r^*$\\ +$r^?$ & $\mapsto$ & $\ONE + r$\\ $\backslash d$ & $\mapsto$ & $0 + 1 + 2 + \ldots + 9$\\ $[\text{\it a - z}]$ & $\mapsto$ & $a + b + \ldots + z$\\ \end{tabular} @@ -569,52 +569,49 @@ \subsection*{My Fascination for Regular Expressions} -Up until a few years ago I was not really interested in -regular expressions. They have been studied for the last 60 -years (by smarter people than me)---surely nothing new can be -found out about them. I even have the vague recollection that -I did not quite understand them during my undergraduate study. If I remember -correctly,\footnote{That was really a long time ago.} I got -utterly confused about $\ONE$ (which my lecturer wrote as -$\epsilon$) and the empty string.\footnote{Obviously the -lecturer must have been bad.} Since my then, I have used -regular expressions for implementing lexers and parsers as I -have always been interested in all kinds of programming -languages and compilers, which invariably need regular -expressions in some form or shape. +Up until a few years ago I was not really interested in regular +expressions. They have been studied for the last 60 years (by smarter +people than me)---surely nothing new can be found out about them. I +even have the vague recollection that I did not quite understand them +during my undergraduate study. If I remember correctly,\footnote{That + was really a long time ago.} I got utterly confused about $\ONE$ +(which my lecturer wrote as $\epsilon$) and the empty string (which he +also wrote as $\epsilon$).\footnote{Obviously the lecturer must have + been bad ;o)} Since my then, I have used regular expressions for +implementing lexers and parsers as I have always been interested in +all kinds of programming languages and compilers, which invariably +need regular expressions in some form or shape. To understand my fascination \emph{nowadays} with regular expressions, you need to know that my main scientific interest -for the last 14 years has been with theorem provers. I am a +for the last 17 years has been with theorem provers. I am a core developer of one of them.\footnote{\url{http://isabelle.in.tum.de}} Theorem provers are systems in which you can formally reason about mathematical concepts, but also about programs. In this way -theorem prover can help with the manacing problem of writing bug-free code. Theorem provers have +theorem provers can help with the manacing problem of writing bug-free code. Theorem provers have proved already their value in a number of cases (even in terms of hard cash), but they are still clunky and difficult to use by average programmers. -Anyway, in about 2011 I came across the notion of -\defn{derivatives of regular expressions}. This notion allows -one to do almost all calculations in regular language theory -on the level of regular expressions, not needing any automata (you will -see we only touch briefly on automata in lecture 3). -This is crucial because automata are graphs and it is rather -difficult to reason about graphs in theorem provers. In -contrast, reasoning about regular expressions is easy-peasy in -theorem provers. Is this important? I think yes, because -according to Kuklewicz nearly all POSIX-based regular -expression matchers are +Anyway, in about 2011 I came across the notion of \defn{derivatives of + regular expressions}. This notion allows one to do almost all +calculations with regular expressions on the level of regular +expressions, not needing any automata (you will see we only touch +briefly on automata in lecture 3). Automata are usually the main +object of study in formal language courses. The avoidance of automata +is crucial because automata are graphs and it is rather difficult to +reason about graphs in theorem provers. In contrast, reasoning about +regular expressions is easy-peasy in theorem provers. Is this +important? I think yes, because according to Kuklewicz nearly all +POSIX-based regular expression matchers are buggy.\footnote{\url{http://www.haskell.org/haskellwiki/Regex_Posix}} -With my PhD student Fahad Ausaf I proved -the correctness for one such matcher that was -proposed by Sulzmann and Lu in -2014.\footnote{\url{http://goo.gl/bz0eHp}} Hopefully we can -prove that the machine code(!) -that implements this code efficiently is correct also. Writing -programs in this way does not leave any room for potential -errors or bugs. How nice! +With my PhD student Fahad Ausaf I proved the correctness for one such +matcher that was proposed by Sulzmann and Lu in +2014.\footnote{\url{http://goo.gl/bz0eHp}} Hopefully we can prove that +the machine code(!) that implements this code efficiently is correct +also. Writing programs in this way does not leave any room for +potential errors or bugs. How nice! What also helped with my fascination with regular expressions is that we could indeed find out new things about them that @@ -629,42 +626,43 @@ book\footnote{\url{http://goo.gl/fD0eHx}} on regular exprssions did not know better. Well, we showed it can also be done with regular expressions only.\footnote{\url{http://www.inf.kcl.ac.uk/staff/urbanc/Publications/rexp.pdf}} -What a feeling if you are an outsider to the subject! +What a feeling when you are an outsider to the subject! -To conclude: Despite my early ignorance about regular -expressions, I find them now very interesting. They have a -beautiful mathematical theory behind them, which can be -sometimes quite deep and which sometimes contains hidden snares. They have -practical importance (remember the shocking runtime of the -regular expression matchers in Python, Ruby and Java in some -instances and the problems in Stack Exchange and the Atom editor). -People who are not very familiar with the -mathematical background of regular expressions get them -consistently wrong. The hope is that we can do better in the -future---for example by proving that the algorithms actually -satisfy their specification and that the corresponding -implementations do not contain any bugs. We are close, but not -yet quite there. +To conclude: Despite my early ignorance about regular expressions, I +find them now very interesting. They have a beautiful mathematical +theory behind them, which can be sometimes quite deep and which +sometimes contains hidden snares. They have practical importance +(remember the shocking runtime of the regular expression matchers in +Python, Ruby and Java in some instances and the problems in Stack +Exchange and the Atom editor). People who are not very familiar with +the mathematical background of regular expressions get them +consistently wrong (surprising given they are a supposed to be core +skill for computer scientists). The hope is that we can do better in +the future---for example by proving that the algorithms actually +satisfy their specification and that the corresponding implementations +do not contain any bugs. We are close, but not yet quite there. Notwithstanding my fascination, I am also happy to admit that regular expressions have their shortcomings. There are some well-known -``theoretical'' shortcomings, for example recognising strings -of the form $a^{n}b^{n}$. I am not so bothered by them. What I -am bothered about is when regular expressions are in the way -of practical programming. For example, it turns out that the -regular expression for email addresses shown in \eqref{email} -is hopelessly inadequate for recognising all of them (despite -being touted as something every computer scientist should know -about). The W3 Consortium (which standardises the Web) -proposes to use the following, already more complicated -regular expressions for email addresses: +``theoretical'' shortcomings, for example recognising strings of the +form $a^{n}b^{n}$ is not possible with regular expressions. This means +for example if we try to regognise whether parentheses are well-nested +is impossible with (basic) regular expressions. I am not so bothered +by these shortcomings. What I am bothered about is when regular +expressions are in the way of practical programming. For example, it +turns out that the regular expression for email addresses shown in +\eqref{email} is hopelessly inadequate for recognising all of them +(despite being touted as something every computer scientist should +know about). The W3 Consortium (which standardises the Web) proposes +to use the following, already more complicated regular expressions for +email addresses: {\small\begin{lstlisting}[language={},keywordstyle=\color{black},numbers=none] [a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-zA-Z0-9-]+(?:\.[a-zA-Z0-9-]+)* \end{lstlisting}} \noindent But they admit that by using this regular expression -they wilfully violate the RFC 5322 standard which specifies +they wilfully violate the RFC 5322 standard, which specifies the syntax of email addresses. With their proposed regular expression they are too strict in some cases and too lax in others. Not a good situation to be in. A regular expression @@ -683,7 +681,8 @@ \noindent which explains some of the crazier parts of email addresses. Still it is good to know that some tasks in text processing just cannot be achieved by using regular -expressions. +expressions. But for what we want to use them (lexing) they are +pretty good. \begin{figure}[p]