handouts/ho01.tex
changeset 399 5c1fbb39c93e
parent 398 c8ce95067c1a
child 403 564f7584eff1
equal deleted inserted replaced
398:c8ce95067c1a 399:5c1fbb39c93e
     8 %%https://jex.im/regulex/
     8 %%https://jex.im/regulex/
     9 %%https://www.reddit.com/r/ProgrammingLanguages/comments/42dlem/mona_compiler_development_part_2_parsing/
     9 %%https://www.reddit.com/r/ProgrammingLanguages/comments/42dlem/mona_compiler_development_part_2_parsing/
    10 %%https://www.reddit.com/r/ProgrammingLanguages/comments/43wlkq/formal_grammar_for_csh_tsch_sh_or_bash/
    10 %%https://www.reddit.com/r/ProgrammingLanguages/comments/43wlkq/formal_grammar_for_csh_tsch_sh_or_bash/
    11 
    11 
    12 \begin{document}
    12 \begin{document}
    13 \fnote{\copyright{} Christian Urban, 2014, 2015}
    13 \fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016}
    14 
    14 
    15 \section*{Handout 1}
    15 \section*{Handout 1}
    16 
    16 
    17 This module is about text processing, be it for web-crawlers,
    17 This module is about text processing, be it for web-crawlers,
    18 compilers, dictionaries, DNA-data and so on. When looking for
    18 compilers, dictionaries, DNA-data and so on. When looking for
    24 are the keywords, what are the identifiers etc. A pattern for
    24 are the keywords, what are the identifiers etc. A pattern for
    25 identifiers could be stated as: they start with a letter,
    25 identifiers could be stated as: they start with a letter,
    26 followed by zero or more letters, numbers and underscores.
    26 followed by zero or more letters, numbers and underscores.
    27 Also often we face the problem that we are given a string (for
    27 Also often we face the problem that we are given a string (for
    28 example some user input) and want to know whether it matches a
    28 example some user input) and want to know whether it matches a
    29 particular pattern. In this way we can, for example, exclude
    29 particular pattern---be it an email address, for example. In
    30 user input that would otherwise have nasty effects on our
    30 this way we can exclude user input that would otherwise have
    31 program (crashing it or making it go into an infinite loop, if
    31 nasty effects on our program (crashing it or making it go into
    32 not worse).
    32 an infinite loop, if not worse).
    33 
    33 
    34 \defn{Regular expressions} help with conveniently specifying
    34 \defn{Regular expressions} help with conveniently specifying
    35 such patterns. The idea behind regular expressions is that
    35 such patterns. The idea behind regular expressions is that
    36 they are a simple method for describing languages (or sets of
    36 they are a simple method for describing languages (or sets of
    37 strings)\ldots at least languages we are interested in in
    37 strings)\ldots at least languages we are interested in in
    38 computer science. For example there is no convenient regular
    38 computer science. For example there is no convenient regular
    39 expression for describing the English language short of
    39 expression for describing the English language short of
    40 enumerating all English words. But they seem useful for
    40 enumerating all English words. But they seem useful for
    41 describing for example email addresses.\footnote{See ``8
    41 describing for example simple email addresses.\footnote{See
    42 Regular Expressions You Should Know''
    42 ``8 Regular Expressions You Should Know''
    43 \url{http://goo.gl/5LoVX7}} Consider the following regular
    43 \url{http://goo.gl/5LoVX7}} Consider the following regular
    44 expression
    44 expression
    45 
    45 
    46 \begin{equation}\label{email}
    46 \begin{equation}\label{email}
    47 \texttt{[a-z0-9\_.-]+ @ [a-z0-9.-]+ . [a-z.]\{2,6\}}
    47 \texttt{[a-z0-9\_.-]+ @ [a-z0-9.-]+ . [a-z.]\{2,6\}}
    65 other.email-with-dash@example.edu
    65 other.email-with-dash@example.edu
    66 \end{lstlisting}
    66 \end{lstlisting}
    67 
    67 
    68 
    68 
    69 \noindent
    69 \noindent
    70 But for example the following two do not:
    70 But for example the following two do not
    71 
    71 
    72 \begin{lstlisting}[language={},numbers=none,keywordstyle=\color{black}]
    72 \begin{lstlisting}[language={},numbers=none,keywordstyle=\color{black}]
    73 user@localserver
    73 user@localserver
    74 disposable.style.email.with+symbol@example.com
    74 disposable.style.email.with+symbol@example.com
    75 \end{lstlisting}
    75 \end{lstlisting}
    76 
    76 
       
    77 \noindent according to the regular expression we specified in
       
    78 \eqref{email}. Whether this is intended or not is a different
       
    79 question (the second email above is actually an acceptable
       
    80 email address acording to the RFC 5322 standard for email
       
    81 addresses).
       
    82  
    77 As mentioned above, identifiers, or variables, in program code
    83 As mentioned above, identifiers, or variables, in program code
    78 are often required to satisfy the constraints that they start
    84 are often required to satisfy the constraints that they start
    79 with a letter and then can be followed by zero or more letters
    85 with a letter and then can be followed by zero or more letters
    80 or numbers and also can include underscores, but not as the
    86 or numbers and also can include underscores, but not as the
    81 first character. Such identifiers can be recognised with the
    87 first character. Such identifiers can be recognised with the
   159 module? Well, one answer is in the following graph about
   165 module? Well, one answer is in the following graph about
   160 regular expression matching in Python and in Ruby.
   166 regular expression matching in Python and in Ruby.
   161 
   167 
   162 \begin{center}
   168 \begin{center}
   163 \begin{tikzpicture}
   169 \begin{tikzpicture}
   164 \begin{axis}[xlabel={\pcode{a}s},ylabel={time in secs},
   170 \begin{axis}[
       
   171     xlabel={strings of {\tt a}s},
       
   172     ylabel={time in secs},
   165     enlargelimits=false,
   173     enlargelimits=false,
   166     xtick={0,5,...,30},
   174     xtick={0,5,...,30},
   167     xmax=33,
   175     xmax=33,
   168     ymax=35,
   176     ymax=35,
   169     ytick={0,5,...,30},
   177     ytick={0,5,...,30},
   214 seconds, while the second version will even be able to process
   222 seconds, while the second version will even be able to process
   215 up to 12,000 in less than 10(!) seconds, see the graph below:
   223 up to 12,000 in less than 10(!) seconds, see the graph below:
   216 
   224 
   217 \begin{center}
   225 \begin{center}
   218 \begin{tikzpicture}
   226 \begin{tikzpicture}
   219   \begin{axis}[xlabel={\pcode{a}s},ylabel={time in secs},
   227   \begin{axis}[
       
   228     xlabel={strings of \pcode{a}s},
       
   229     ylabel={time in secs},
   220     enlargelimits=false,
   230     enlargelimits=false,
   221     xtick={0,3000,...,12000},
   231     xtick={0,3000,...,12000},
   222     xmax=12500,
   232     xmax=12500,
   223     ymax=35,
   233     ymax=35,
   224     ytick={0,5,...,30},
   234     ytick={0,5,...,30},
   232 \end{tikzpicture}
   242 \end{tikzpicture}
   233 \end{center}
   243 \end{center}
   234 
   244 
   235 \subsection*{Basic Regular Expressions}
   245 \subsection*{Basic Regular Expressions}
   236 
   246 
   237 The regular expressions shown above for Scala, we
   247 The regular expressions shown earlier for Scala, we
   238 will call \emph{extended regular expressions}. The ones we
   248 will call \emph{extended regular expressions}. The ones we
   239 will mainly study in this module are \emph{basic regular
   249 will mainly study in this module are \emph{basic regular
   240 expressions}, which by convention we will just call
   250 expressions}, which by convention we will just call
   241 \emph{regular expressions}, if it is clear what we mean. The
   251 \emph{regular expressions}, if it is clear what we mean. The
   242 attraction of (basic) regular expressions is that many
   252 attraction of (basic) regular expressions is that many
   244 (Basic) regular expressions are defined by the following
   254 (Basic) regular expressions are defined by the following
   245 grammar:
   255 grammar:
   246 
   256 
   247 \begin{center}
   257 \begin{center}
   248 \begin{tabular}{r@{\hspace{1mm}}r@{\hspace{1mm}}l@{\hspace{13mm}}l}
   258 \begin{tabular}{r@{\hspace{1mm}}r@{\hspace{1mm}}l@{\hspace{13mm}}l}
   249   $r$ & $::=$ &   $\varnothing$         & null\\
   259   $r$ & $::=$ &    $\ZERO$          & null language\\
   250         & $\mid$ & $\epsilon$           & empty string / \texttt{""} / []\\
   260         & $\mid$ & $\ONE$           & empty string / \texttt{""} / []\\
   251         & $\mid$ & $c$                  & single character\\
   261         & $\mid$ & $c$                  & single character\\
   252         & $\mid$ & $r_1 + r_2$          & alternative / choice\\
   262         & $\mid$ & $r_1 + r_2$          & alternative / choice\\
   253         & $\mid$ & $r_1 \cdot r_2$      & sequence\\
   263         & $\mid$ & $r_1 \cdot r_2$      & sequence\\
   254         & $\mid$ & $r^*$                & star (zero or more)\\
   264         & $\mid$ & $r^*$                & star (zero or more)\\
   255   \end{tabular}
   265   \end{tabular}
   256 \end{center}
   266 \end{center}
   257 
   267 
   258 \noindent Because we overload our notation, there are some
   268 \noindent Because we overload our notation, there are some
   259 subtleties you should be aware of. When regular expressions
   269 subtleties you should be aware of. When regular expressions
   260 are referred to then $\varnothing$ does not stand for the
   270 are referred to then $\ZERO$ (in bold font) does not stand for
   261 empty set: rather it is a particular pattern that does not
   271 the number zero: rather it is a particular pattern that does
   262 match any string. Similarly, in the context of regular
   272 not match any string. Similarly, in the context of regular
   263 expressions, $\epsilon$ does not stand for the empty string
   273 expressions, $\ONE$ does not stand for the number one but for
   264 (as in many places in the literature) but for a regular
   274 a regular expression that matches the empty string. The letter
   265 expression that matches the empty string. The letter $c$
   275 $c$ stands for any character from the alphabet at hand. Again
   266 stands for any character from the alphabet at hand. Again in
   276 in the context of regular expressions, it is a particular
   267 the context of regular expressions, it is a particular pattern
   277 pattern that can match the specified character. You should
   268 that can match the specified character. You should also be
   278 also be careful with our overloading of the star: assuming you
   269 careful with our overloading of the star: assuming you have
   279 have read the handout about our basic mathematical notation,
   270 read the handout about our basic mathematical notation, you
   280 you will see that in the context of languages (sets of
   271 will see that in the context of languages (sets of strings)
   281 strings) the star stands for an operation on languages. Here
   272 the star stands for an operation on languages. Here $r^*$
   282 $r^*$ stands for a regular expression, which is different from
   273 stands for a regular expression, which is different from the
   283 the operation on sets is defined as
   274 operation on sets is defined as
   284 
   275 
   285 \[
   276 \[
   286 A\star\dn \bigcup_{0\le n} A^n
   277 A^* \dn \bigcup_{0\le n} A^n
       
   278 \]
   287 \]
   279 
   288 
   280 \noindent 
   289 \noindent 
   281 Note that this expands to
   290 Note that this expands to
   282 
   291 
   283 \[
   292 \[
   284 A^* \dn A^0 \cup A^1 \cup A^2 \cup A^3 \cup A^4 \cup \ldots
   293 A\star \dn A^0 \cup A^1 \cup A^2 \cup A^3 \cup A^4 \cup \ldots
   285 \]
   294 \]
   286 
   295 
   287 \noindent which is equivalent to
   296 \noindent which is equivalent to
   288 
   297 
   289 \[
   298 \[
   290 A^* \dn \{[]\} \cup A \cup A@A \cup A@A@A \cup A@A@A@A \cup \ldots
   299 A\star \dn \{[]\} \cup A \cup A@A \cup A@A@A \cup A@A@A@A \cup \ldots
   291 \]
   300 \]
   292 
   301 
   293 \noindent 
   302 \noindent 
   294 Remember that $A^0$ is always the set containing the empty 
   303 Remember that $A^0$ is always the set containing the empty 
   295 string.
   304 string.
   307 strings. We should also write $(r_1 + r_2) + r_3$, which is
   316 strings. We should also write $(r_1 + r_2) + r_3$, which is
   308 different from the regular expression $r_1 + (r_2 + r_3)$, but
   317 different from the regular expression $r_1 + (r_2 + r_3)$, but
   309 in case of $+$ and $\cdot$ we actually do not care about the
   318 in case of $+$ and $\cdot$ we actually do not care about the
   310 order and just write $r_1 + r_2 + r_3$, or $r_1 \cdot r_2
   319 order and just write $r_1 + r_2 + r_3$, or $r_1 \cdot r_2
   311 \cdot r_3$, respectively. The reasons for this will become
   320 \cdot r_3$, respectively. The reasons for this will become
   312 clear shortly. In the literature you will often find that the
   321 clear shortly. 
   313 choice $r_1 + r_2$ is written as $r_1\mid{}r_2$ or
   322 
   314 $r_1\mid\mid{}r_2$. Also following the convention in the
   323 In the literature you will often find that the choice $r_1 +
       
   324 r_2$ is written as $r_1\mid{}r_2$ or $r_1\mid\mid{}r_2$. Also,
       
   325 often our $\ZERO$ and $\ONE$ are written $\varnothing$ and
       
   326 $\epsilon$, respectively. Following the convention in the
   315 literature, we will often omit the $\cdot$ all together. This
   327 literature, we will often omit the $\cdot$ all together. This
   316 is to make some concrete regular expressions more readable.
   328 is to make some concrete regular expressions more readable.
   317 For example the regular expression for email addresses shown
   329 For example the regular expression for email addresses shown
   318 in \eqref{email} would look like
   330 in \eqref{email} would look like
   319 
   331 
   340 classes relate as follows\footnote{More about Scala is 
   352 classes relate as follows\footnote{More about Scala is 
   341 in the handout about A Crash-Course on Scala.}
   353 in the handout about A Crash-Course on Scala.}
   342 
   354 
   343 \begin{center}
   355 \begin{center}
   344 \begin{tabular}{rcl}
   356 \begin{tabular}{rcl}
   345 $\varnothing$ & $\mapsto$ & \texttt{NULL}\\
   357 $\ZERO$       & $\mapsto$ & \texttt{ZERO}\\
   346 $\epsilon$    & $\mapsto$ & \texttt{EMPTY}\\
   358 $\ONE$        & $\mapsto$ & \texttt{ONE}\\
   347 $c$           & $\mapsto$ & \texttt{CHAR(c)}\\
   359 $c$           & $\mapsto$ & \texttt{CHAR(c)}\\
   348 $r_1 + r_2$   & $\mapsto$ & \texttt{ALT(r1, r2)}\\
   360 $r_1 + r_2$   & $\mapsto$ & \texttt{ALT(r1, r2)}\\
   349 $r_1 \cdot r_2$ & $\mapsto$ & \texttt{SEQ(r1, r2)}\\
   361 $r_1 \cdot r_2$ & $\mapsto$ & \texttt{SEQ(r1, r2)}\\
   350 $r^*$         & $\mapsto$ & \texttt{STAR(r)}
   362 $r^*$         & $\mapsto$ & \texttt{STAR(r)}
   351 \end{tabular}
   363 \end{tabular}
   360 the former. For example we could replace
   372 the former. For example we could replace
   361 
   373 
   362 \begin{center}
   374 \begin{center}
   363 \begin{tabular}{rcl}
   375 \begin{tabular}{rcl}
   364 $r+$ & $\mapsto$ & $r\cdot r^*$\\
   376 $r+$ & $\mapsto$ & $r\cdot r^*$\\
   365 $r?$ & $\mapsto$ & $\epsilon + r$\\
   377 $r?$ & $\mapsto$ & $\ONE + r$\\
   366 $\backslash d$ & $\mapsto$ & $0 + 1 + 2 + \ldots + 9$\\
   378 $\backslash d$ & $\mapsto$ & $0 + 1 + 2 + \ldots + 9$\\
   367 $[\text{\it a - z}]$ & $\mapsto$ & $a + b + \ldots + z$\\
   379 $[\text{\it a - z}]$ & $\mapsto$ & $a + b + \ldots + z$\\
   368 \end{tabular}
   380 \end{tabular}
   369 \end{center}
   381 \end{center}
   370 
   382 
   386 The \defn{meaning of a regular expression} can be defined
   398 The \defn{meaning of a regular expression} can be defined
   387 by a recursive function called $L$ (for language), which
   399 by a recursive function called $L$ (for language), which
   388 is defined as follows
   400 is defined as follows
   389 
   401 
   390 \begin{center}
   402 \begin{center}
   391 \begin{tabular}{rcl}
   403 \begin{tabular}{rcll}
   392 $L(\varnothing)$  & $\dn$ & $\{\}$\\
   404 $L(\ZERO)$         & $\dn$ & $\{\}$\\
   393 $L(\epsilon)$     & $\dn$ & $\{[]\}$\\
   405 $L(\ONE)$          & $\dn$ & $\{[]\}$\\
   394 $L(c)$            & $\dn$ & $\{[c]\}$\\
   406 $L(c)$             & $\dn$ & $\{"c"\}$ & or equivalently $\dn \{[c]\}$\\
   395 $L(r_1+ r_2)$     & $\dn$ & $L(r_1) \cup L(r_2)$\\
   407 $L(r_1+ r_2)$      & $\dn$ & $L(r_1) \cup L(r_2)$\\
   396 $L(r_1 \cdot r_2)$ & $\dn$ & $L(r_1) \,@\, L(r_2)$\\
   408 $L(r_1 \cdot r_2)$ & $\dn$ & $L(r_1) \,@\, L(r_2)$\\
   397 $L(r^*)$           & $\dn$ & $(L(r))^*$\\
   409 $L(r^*)$           & $\dn$ & $(L(r))\star$\\
   398 \end{tabular}
   410 \end{tabular}
   399 \end{center}
   411 \end{center}
   400 
   412 
   401 \noindent As a result we can now precisely state what the
   413 \noindent As a result we can now precisely state what the
   402 meaning, for example, of the regular expression $h \cdot
   414 meaning, for example, of the regular expression $h \cdot
   441 different regular expressions that can recognise these
   453 different regular expressions that can recognise these
   442 strings. This is obvious with the regular expression $a + b$
   454 strings. This is obvious with the regular expression $a + b$
   443 which can match the strings $a$ and $b$. But also the regular
   455 which can match the strings $a$ and $b$. But also the regular
   444 expression $b + a$ would match the same strings. However,
   456 expression $b + a$ would match the same strings. However,
   445 sometimes it is not so obvious whether two regular expressions
   457 sometimes it is not so obvious whether two regular expressions
   446 match the same strings: for example do $r^*$ and $\epsilon + r
   458 match the same strings: for example do $r^*$ and $\ONE + r
   447 \cdot r^*$ match the same strings? What about $\varnothing^*$
   459 \cdot r^*$ match the same strings? What about $\ZERO^*$
   448 and $\epsilon^*$? This suggests the following relation between
   460 and $\ONE^*$? This suggests the following relation between
   449 \defn{equivalent regular expressions}: 
   461 \defn{equivalent regular expressions}: 
   450 
   462 
   451 \[
   463 \[
   452 r_1 \equiv r_2 \;\dn\; L(r_1) = L(r_2)
   464 r_1 \equiv r_2 \;\dn\; L(r_1) = L(r_2)
   453 \]
   465 \]
   458 expression $(r_1 + r_2) + r_3$ and $r_1 + (r_2 + r_3)$,
   470 expression $(r_1 + r_2) + r_3$ and $r_1 + (r_2 + r_3)$,
   459 because they are equivalent. I leave you to the question
   471 because they are equivalent. I leave you to the question
   460 whether
   472 whether
   461 
   473 
   462 \[
   474 \[
   463 \varnothing^* \equiv \epsilon^*
   475 \ZERO^* \equiv \ONE^*
   464 \]
   476 \]
   465 
   477 
   466 \noindent holds? Such equivalences will be important for our
   478 \noindent holds or not? Such equivalences will be important
   467 matching algorithm, because we can use them to simplify 
   479 for our matching algorithm, because we can use them to
   468 regular expressions. 
   480 simplify regular expressions, which will mean we can speed
       
   481 up the calculations. 
   469 
   482 
   470 \subsection*{My Fascination for Regular Expressions}
   483 \subsection*{My Fascination for Regular Expressions}
   471 
   484 
   472 Up until a few years ago I was not really interested in
   485 Up until a few years ago I was not really interested in
   473 regular expressions. They have been studied for the last 60
   486 regular expressions. They have been studied for the last 60
   474 years (by smarter people than me)---surely nothing new can be
   487 years (by smarter people than me)---surely nothing new can be
   475 found out about them. I even have the vague recollection that
   488 found out about them. I even have the vague recollection that
   476 I did not quite understand them during my study. If I remember
   489 I did not quite understand them during my study. If I remember
   477 correctly,\footnote{That was really a long time ago.} I got
   490 correctly,\footnote{That was really a long time ago.} I got
   478 utterly confused about $\epsilon$ and the empty
   491 utterly confused about $\ONE$ (which my lecturer wrote as
   479 string.\footnote{Obviously the lecturer must have been bad.}
   492 $\epsilon$) and the empty string.\footnote{Obviously the
   480 Since my study, I have used regular expressions for
   493 lecturer must have been bad.} Since my study, I have used
   481 implementing lexers and parsers as I have always been
   494 regular expressions for implementing lexers and parsers as I
   482 interested in all kinds of programming languages and
   495 have always been interested in all kinds of programming
   483 compilers, which invariably need regular expression in some
   496 languages and compilers, which invariably need regular
   484 form or shape. 
   497 expression in some form or shape. 
   485 
   498 
   486 To understand my fascination nowadays with regular
   499 To understand my fascination \emph{nowadays} with regular
   487 expressions, you need to know that my main scientific interest
   500 expressions, you need to know that my main scientific interest
   488 for the last 14 years has been with theorem provers. I am a
   501 for the last 14 years has been with theorem provers. I am a
   489 core developer of one of
   502 core developer of one of
   490 them.\footnote{\url{http://isabelle.in.tum.de}} Theorem
   503 them.\footnote{\url{http://isabelle.in.tum.de}} Theorem
   491 provers are systems in which you can formally reason about
   504 provers are systems in which you can formally reason about
   531 exprssions did not know better. Well, we showed it can also be
   544 exprssions did not know better. Well, we showed it can also be
   532 done with regular expressions only.\footnote{\url{http://www.inf.kcl.ac.uk/staff/urbanc/Publications/rexp.pdf}} 
   545 done with regular expressions only.\footnote{\url{http://www.inf.kcl.ac.uk/staff/urbanc/Publications/rexp.pdf}} 
   533 What a feeling if you are an outsider to the subject!
   546 What a feeling if you are an outsider to the subject!
   534 
   547 
   535 To conclude: Despite my early ignorance about regular
   548 To conclude: Despite my early ignorance about regular
   536 expressions, I find them now quite interesting. They have a
   549 expressions, I find them now very interesting. They have a
   537 beautiful mathematical theory behind them. They have practical
   550 beautiful mathematical theory behind them, which can be
   538 importance (remember the shocking runtime of the regular
   551 sometimes quite deep and contain hidden snares. They have
   539 expression matchers in Python and Ruby in some instances).
   552 practical importance (remember the shocking runtime of the
   540 People who are not very familiar with the mathematical
   553 regular expression matchers in Python and Ruby in some
   541 background of regular expressions get them consistently wrong.
   554 instances). People who are not very familiar with the
   542 The hope is that we can do better in the future---for example
   555 mathematical background of regular expressions get them
   543 by proving that the algorithms actually satisfy their
   556 consistently wrong. The hope is that we can do better in the
   544 specification and that the corresponding implementations do
   557 future---for example by proving that the algorithms actually
   545 not contain any bugs. We are close, but not yet quite there.
   558 satisfy their specification and that the corresponding
       
   559 implementations do not contain any bugs. We are close, but not
       
   560 yet quite there.
   546 
   561 
   547 Notwithstanding my fascination, I am also happy to admit that regular
   562 Notwithstanding my fascination, I am also happy to admit that regular
   548 expressions have their shortcomings. There are some well-known
   563 expressions have their shortcomings. There are some well-known
   549 ``theoretical'' shortcomings, for example recognising strings
   564 ``theoretical'' shortcomings, for example recognising strings
   550 of the form $a^{n}b^{n}$. I am not so bothered by them. What I
   565 of the form $a^{n}b^{n}$. I am not so bothered by them. What I
   569 that is claimed to be closer to the standard is shown in
   584 that is claimed to be closer to the standard is shown in
   570 Figure~\ref{monster}. Whether this claim is true or not, I
   585 Figure~\ref{monster}. Whether this claim is true or not, I
   571 would not know---the only thing I can say to this regular
   586 would not know---the only thing I can say to this regular
   572 expression is it is a monstrosity. However, this might
   587 expression is it is a monstrosity. However, this might
   573 actually be an argument against the RFC standard, rather than
   588 actually be an argument against the RFC standard, rather than
   574 against regular expressions. Still it is good to know that
   589 against regular expressions. A similar argument is made in
   575 some tasks in text processing just cannot be achieved by using
   590 
   576 regular expressions.
   591 \begin{center}
       
   592 \url{https://elliot.land/validating-an-email-address}
       
   593 \end{center}
       
   594 
       
   595 
       
   596 \noindent which explains some of the crazier parts of email
       
   597 addresses. Still it is good to know that some tasks in text
       
   598 processing just cannot be achieved by using regular
       
   599 expressions.
   577 
   600 
   578 
   601 
   579 \begin{figure}[p]
   602 \begin{figure}[p]
   580 \lstinputlisting{../progs/crawler1.scala}
   603 \lstinputlisting{../progs/crawler1.scala}
       
   604 
   581 \caption{The Scala code for a simple web-crawler that checks
   605 \caption{The Scala code for a simple web-crawler that checks
   582 for broken links in a web-page. It uses the regular expression
   606 for broken links in a web-page. It uses the regular expression
   583 \texttt{http\_pattern} in Line~15 for recognising URL-addresses.
   607 \texttt{http\_pattern} in Line~\ref{httpline} for recognising
   584 It finds all links using the library function
   608 URL-addresses. It finds all links using the library function
   585 \texttt{findAllIn} in Line~21.\label{crawler1}}
   609 \texttt{findAllIn} in Line~\ref{findallline}.\label{crawler1}}
       
   610 
   586 \end{figure}
   611 \end{figure}
   587 
   612 
   588 
   613 
   589 \begin{figure}[p]
   614 \begin{figure}[p]
   590 \lstinputlisting{../progs/crawler2.scala}
   615 \lstinputlisting{../progs/crawler2.scala}
   591 
   616 
   592 \caption{A version of the web-crawler that only follows links
   617 \caption{A version of the web-crawler that only follows links
   593 in ``my'' domain---since these are the ones I am interested in
   618 in ``my'' domain---since these are the ones I am interested in
   594 to fix. It uses the regular expression \texttt{my\_urls} in
   619 to fix. It uses the regular expression \texttt{my\_urls} in
   595 Line~16 to check for my name in the links. The main change is
   620 Line~\ref{myurlline} to check for my name in the links. The
   596 in Lines~24--28 where there is a test whether URL is in ``my''
   621 main change is in
   597 domain or not.\label{crawler2}}
   622 Lines~\ref{changestartline}--\ref{changeendline} where there
       
   623 is a test whether URL is in ``my'' domain or
       
   624 not.\label{crawler2}}
   598 
   625 
   599 \end{figure}
   626 \end{figure}
   600 
   627 
   601 \begin{figure}[p]
   628 \begin{figure}[p]
   602 \lstinputlisting{../progs/crawler3.scala}
   629 \lstinputlisting{../progs/crawler3.scala}
   603 
   630 
   604 \caption{A small email harvester---whenever we download a
   631 \caption{A small email harvester---whenever we download a
   605 web-page, we also check whether it contains any email
   632 web-page, we also check whether it contains any email
   606 addresses. For this we use the regular expression
   633 addresses. For this we use the regular expression
   607 \texttt{email\_pattern} in Line~15. The main change is in Line
   634 \texttt{email\_pattern} in Line~\ref{emailline}. The main
   608 30 where all email addresses that can be found in a page are
   635 change is in Line~\ref{mainline} where all email addresses
   609 printed.\label{crawler3}}
   636 that can be found in a page are printed.\label{crawler3}}
       
   637 
   610 \end{figure}
   638 \end{figure}
   611 
   639 
   612 \begin{figure}[p]
   640 \begin{figure}[p]
   613 \tiny
   641 \tiny
   614 \begin{center}
   642 \begin{center}
   615 \begin{minipage}{0.8\textwidth}
   643 \begin{minipage}{0.8\textwidth}
   616 \lstinputlisting[language={},keywordstyle=\color{black},numbers=none]{../progs/email-rexp}
   644 \lstinputlisting[language={},keywordstyle=\color{black},numbers=none]{../progs/email-rexp}
   617 \end{minipage}
   645 \end{minipage}
   618 \end{center}
   646 \end{center}
   619 
   647 
   620 \caption{Nothing that can be said\ldots\label{monster}}
   648 \caption{Nothing that can be said this\ldots\label{monster}}
   621 \end{figure}
   649 \end{figure}
   622 
   650 
   623 
   651 
   624 \end{document}
   652 \end{document}
   625 
   653