80 are \pcode{x}, \pcode{foo}, \pcode{foo_bar_1}, \pcode{A_very_42_long_object_name}, |
80 are \pcode{x}, \pcode{foo}, \pcode{foo_bar_1}, \pcode{A_very_42_long_object_name}, |
81 but not \pcode{_i} and also not \pcode{4you}. |
81 but not \pcode{_i} and also not \pcode{4you}. |
82 |
82 |
83 Many programming language offer libraries that can be used to |
83 Many programming language offer libraries that can be used to |
84 validate such strings against regular expressions. Also there |
84 validate such strings against regular expressions. Also there |
85 are some common, and I am sure very familiar, ways how to |
85 are some common, and I am sure very familiar, ways of how to |
86 construct regular expressions. For example in Scala we have: |
86 construct regular expressions. For example in Scala we have: |
87 |
87 |
88 \begin{center} |
88 \begin{center} |
89 \begin{tabular}{lp{9cm}} |
89 \begin{tabular}{lp{9cm}} |
90 \pcode{re*} & matches 0 or more occurrences of preceding |
90 \pcode{re*} & matches 0 or more occurrences of preceding |
259 the context of regular expressions, it is a particular pattern |
259 the context of regular expressions, it is a particular pattern |
260 that can match the specified character. You should also be |
260 that can match the specified character. You should also be |
261 careful with our overloading of the star: assuming you have |
261 careful with our overloading of the star: assuming you have |
262 read the handout about our basic mathematical notation, you |
262 read the handout about our basic mathematical notation, you |
263 will see that in the context of languages (sets of strings) |
263 will see that in the context of languages (sets of strings) |
264 the star stands for an operation on languages. While here |
264 the star stands for an operation on languages. Here $r^*$ |
265 $r^*$ stands for a regular expression, which is different from |
265 stands for a regular expression, which is different from the |
266 the operation on sets is defined as |
266 operation on sets is defined as |
267 |
267 |
268 \[ |
268 \[ |
269 A^* \dn \bigcup_{0\le n} A^n |
269 A^* \dn \bigcup_{0\le n} A^n |
270 \] |
270 \] |
271 |
271 |
275 structure of regular expressions is always clear. But for |
275 structure of regular expressions is always clear. But for |
276 writing them down in a more mathematical fashion, parentheses |
276 writing them down in a more mathematical fashion, parentheses |
277 will be helpful. For example we will write $(r_1 + r_2)^*$, |
277 will be helpful. For example we will write $(r_1 + r_2)^*$, |
278 which is different from, say $r_1 + (r_2)^*$. The former means |
278 which is different from, say $r_1 + (r_2)^*$. The former means |
279 roughly zero or more times $r_1$ or $r_2$, while the latter |
279 roughly zero or more times $r_1$ or $r_2$, while the latter |
280 means $r_1$ or zero or more times $r_2$. This will turn out |
280 means $r_1$ or zero or more times $r_2$. This will turn out to |
281 are two different patterns, which match in general different |
281 be two different patterns, which match in general different |
282 strings. We should also write $(r_1 + r_2) + r_3$, which is |
282 strings. We should also write $(r_1 + r_2) + r_3$, which is |
283 different from the regular expression $r_1 + (r_2 + r_3)$, but |
283 different from the regular expression $r_1 + (r_2 + r_3)$, but |
284 in case of $+$ and $\cdot$ we actually do not care about the |
284 in case of $+$ and $\cdot$ we actually do not care about the |
285 order and just write $r_1 + r_2 + r_3$, or $r_1 \cdot r_2 |
285 order and just write $r_1 + r_2 + r_3$, or $r_1 \cdot r_2 |
286 \cdot r_3$, respectively. The reasons for this will become |
286 \cdot r_3$, respectively. The reasons for this will become |
302 which is much less readable than \eqref{email}. Similarly for |
302 which is much less readable than \eqref{email}. Similarly for |
303 the regular expression that matches the string $hello$ we |
303 the regular expression that matches the string $hello$ we |
304 should write |
304 should write |
305 |
305 |
306 \[ |
306 \[ |
307 {\it h} \cdot {\it e} \cdot {\it l} \cdot {\it l} \cdot {\it o} |
307 h \cdot e \cdot l \cdot l \cdot o |
308 \] |
308 \] |
309 |
309 |
310 \noindent |
310 \noindent |
311 but often just write {\it hello}. |
311 but often just write {\it hello}. |
312 |
312 |
313 If you prefer to think in terms of the implementation |
313 If you prefer to think in terms of the implementation |
314 of regular expressions in Scala, the constructors and |
314 of regular expressions in Scala, the constructors and |
315 classes relate as follows\footnote{More about Scala is |
315 classes relate as follows\footnote{More about Scala is |
316 in the handout about a crash-course on Scala.} |
316 in the handout about A Crash-Course on Scala.} |
317 |
317 |
318 \begin{center} |
318 \begin{center} |
319 \begin{tabular}{rcl} |
319 \begin{tabular}{rcl} |
320 $\varnothing$ & $\mapsto$ & \texttt{NULL}\\ |
320 $\varnothing$ & $\mapsto$ & \texttt{NULL}\\ |
321 $\epsilon$ & $\mapsto$ & \texttt{EMPTY}\\ |
321 $\epsilon$ & $\mapsto$ & \texttt{EMPTY}\\ |
376 \noindent As a result we can now precisely state what the |
376 \noindent As a result we can now precisely state what the |
377 meaning, for example, of the regular expression $h \cdot |
377 meaning, for example, of the regular expression $h \cdot |
378 e \cdot l \cdot l \cdot o$ is, namely |
378 e \cdot l \cdot l \cdot o$ is, namely |
379 |
379 |
380 \[ |
380 \[ |
381 L({\it h} \cdot {\it e} \cdot {\it l} \cdot {\it l} \cdot |
381 L(h \cdot e \cdot l \cdot l \cdot o) = \{"hello"\} |
382 {\it o}) = \{"hello"\} |
|
383 \] |
382 \] |
384 |
383 |
385 \noindent This is expected because this regular expression |
384 \noindent This is expected because this regular expression |
386 is only supposed to match the ``$hello$''-string. Similarly if |
385 is only supposed to match the ``$hello$''-string. Similarly if |
387 we have the choice-regular-expression $a + b$, its meaning is |
386 we have the choice-regular-expression $a + b$, its meaning is |
424 |
423 |
425 \[ |
424 \[ |
426 r_1 \equiv r_2 \;\dn\; L(r_1) = L(r_2) |
425 r_1 \equiv r_2 \;\dn\; L(r_1) = L(r_2) |
427 \] |
426 \] |
428 |
427 |
429 \noindent That means two regular expressions are equivalent if |
428 \noindent That means two regular expressions are said to be |
430 they match the same set of strings. Therefore we do not really |
429 equivalent if they match the same set of strings. Therefore we |
431 distinguish between the different regular expression $(r_1 + |
430 do not really distinguish between the different regular |
432 r_2) + r_3$ and $r_1 + (r_2 + r_3)$, because they are |
431 expression $(r_1 + r_2) + r_3$ and $r_1 + (r_2 + r_3)$, |
433 equivalent. I leave you to the question whether |
432 because they are equivalent. I leave you to the question |
|
433 whether |
434 |
434 |
435 \[ |
435 \[ |
436 \varnothing^* \equiv \epsilon^* |
436 \varnothing^* \equiv \epsilon^* |
437 \] |
437 \] |
438 |
438 |
439 \noindent holds. Such equivalences will be important for out |
439 \noindent holds. Such equivalences will be important for our |
440 matching algorithm, because we can use them to simplify |
440 matching algorithm, because we can use them to simplify |
441 regular expressions. |
441 regular expressions. |
442 |
442 |
443 \subsection*{My Fascination for Regular Expressions} |
443 \subsection*{My Fascination for Regular Expressions} |
444 |
444 |
456 compilers, which invariably need regular expression in some |
456 compilers, which invariably need regular expression in some |
457 form or shape. |
457 form or shape. |
458 |
458 |
459 To understand my fascination nowadays with regular |
459 To understand my fascination nowadays with regular |
460 expressions, you need to know that my main scientific interest |
460 expressions, you need to know that my main scientific interest |
461 for the last 14 years has been with in theorem provers. I am a |
461 for the last 14 years has been with theorem provers. I am a |
462 core developer of one of |
462 core developer of one of |
463 them.\footnote{\url{http://isabelle.in.tum.de}} Theorem |
463 them.\footnote{\url{http://isabelle.in.tum.de}} Theorem |
464 provers are systems in which you can formally reason about |
464 provers are systems in which you can formally reason about |
465 mathematical concepts, but also about programs. In this way |
465 mathematical concepts, but also about programs. In this way |
466 they can help with writing bug-free code. Theorem provers have |
466 they can help with writing bug-free code. Theorem provers have |
478 theorem provers. Is this important? I think yes, because |
478 theorem provers. Is this important? I think yes, because |
479 according to Kuklewicz nearly all POSIX-based regular |
479 according to Kuklewicz nearly all POSIX-based regular |
480 expression matchers are |
480 expression matchers are |
481 buggy.\footnote{\url{http://www.haskell.org/haskellwiki/Regex_Posix}} |
481 buggy.\footnote{\url{http://www.haskell.org/haskellwiki/Regex_Posix}} |
482 With my PhD student Fahad Ausaf I am currently working on |
482 With my PhD student Fahad Ausaf I am currently working on |
483 proving the correctness for one such algorithm that was |
483 proving the correctness for one such matcher that was |
484 proposed by Sulzmann and Lu in |
484 proposed by Sulzmann and Lu in |
485 2014.\footnote{\url{http://goo.gl/bz0eHp}} This would be an |
485 2014.\footnote{\url{http://goo.gl/bz0eHp}} This would be an |
486 attractive results since we will be able to prove that the |
486 attractive results since we will be able to prove that the |
487 algorithm is really correct, but also that the machine code(!) |
487 algorithm is really correct, but also that the machine code(!) |
488 that implements this code efficiently is correct. Writing |
488 that implements this code efficiently is correct. Writing |
500 complementation, something which Gasarch in his |
500 complementation, something which Gasarch in his |
501 blog\footnote{\url{http://goo.gl/2R11Fw}} assumed can only be |
501 blog\footnote{\url{http://goo.gl/2R11Fw}} assumed can only be |
502 shown via automata. Even sombody who has written a 700+-page |
502 shown via automata. Even sombody who has written a 700+-page |
503 book\footnote{\url{http://goo.gl/fD0eHx}} on regular |
503 book\footnote{\url{http://goo.gl/fD0eHx}} on regular |
504 exprssions did not know better. Well, we showed it can also be |
504 exprssions did not know better. Well, we showed it can also be |
505 done with regular expressions only. What a feeling if you |
505 done with regular expressions only.\footnote{\url{http://www.inf.kcl.ac.uk/staff/urbanc/Publications/rexp.pdf}} |
506 are an outsider to the subject! |
506 What a feeling if you are an outsider to the subject! |
507 |
507 |
508 To conclude: Despite my early ignorance about regular |
508 To conclude: Despite my early ignorance about regular |
509 expressions, I find them now quite interesting. They have a |
509 expressions, I find them now quite interesting. They have a |
510 beautiful mathematical theory behind them. They have practical |
510 beautiful mathematical theory behind them. They have practical |
511 importance (remember the shocking runtime of the regular |
511 importance (remember the shocking runtime of the regular |
512 expression matchers in Python and Ruby in some instances). |
512 expression matchers in Python and Ruby in some instances). |
513 People who are not very familiar with the mathematical |
513 People who are not very familiar with the mathematical |
514 background get them consistently wrong. The hope is that we |
514 background of regular expressions get them consistently wrong. |
515 can do better in the future---for example by proving that the |
515 The hope is that we can do better in the future---for example |
516 algorithms actually satisfy their specification and that the |
516 by proving that the algorithms actually satisfy their |
517 corresponding implementations do not contain any bugs. We are |
517 specification and that the corresponding implementations do |
518 close, but not yet quite there. |
518 not contain any bugs. We are close, but not yet quite there. |
519 |
519 |
520 Despite my fascination, I am also happy to admit that regular |
520 Despite my fascination, I am also happy to admit that regular |
521 expressions have their shortcomings. There are some well-known |
521 expressions have their shortcomings. There are some well-known |
522 ``theoretical shortcomings'', for example recognising strings |
522 ``theoretical'' shortcomings, for example recognising strings |
523 of the form $a^{n}b^{n}$. I am not so bothered by them. What I |
523 of the form $a^{n}b^{n}$. I am not so bothered by them. What I |
524 am bothered about is when regular expressions are in the way |
524 am bothered about is when regular expressions are in the way |
525 of practical programming. For example, it turns out that the |
525 of practical programming. For example, it turns out that the |
526 regular expression for email addresses shown in \eqref{email} |
526 regular expression for email addresses shown in \eqref{email} |
527 is hopelessly inadequate for recognising all of them (despite |
527 is hopelessly inadequate for recognising all of them (despite |
528 being touted as something every computer scientist should know |
528 being touted as something every computer scientist should know |
529 about). The W3 Consortium (which standardises the Web) |
529 about). The W3 Consortium (which standardises the Web) |
530 proposes to use the following, already more complicated |
530 proposes to use the following, already more complicated |
531 regular expressions: |
531 regular expressions for email addresses: |
532 |
532 |
533 {\small\begin{lstlisting}[language={},keywordstyle=\color{black},numbers=none] |
533 {\small\begin{lstlisting}[language={},keywordstyle=\color{black},numbers=none] |
534 [a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-zA-Z0-9-]+(?:\.[a-zA-Z0-9-]+)* |
534 [a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-zA-Z0-9-]+(?:\.[a-zA-Z0-9-]+)* |
535 \end{lstlisting}} |
535 \end{lstlisting}} |
536 |
536 |
539 the syntax of email addresses. With their proposed regular |
539 the syntax of email addresses. With their proposed regular |
540 expression they are too strict in some cases and too lax in |
540 expression they are too strict in some cases and too lax in |
541 others. Not a good situation to be in. A regular expression |
541 others. Not a good situation to be in. A regular expression |
542 that is claimed to be closer to the standard is shown in |
542 that is claimed to be closer to the standard is shown in |
543 Figure~\ref{monster}. Whether this claim is true or not, I |
543 Figure~\ref{monster}. Whether this claim is true or not, I |
544 would not know---the only thing I can say it is a monstrosity. |
544 would not know---the only thing I can say to this regular |
545 However, this might actually be an argument against the |
545 expression is it is a monstrosity. However, this might |
546 standard, rather than against regular expressions. Still it |
546 actually be an argument against the RFC standard, rather than |
547 is good to know that some tasks in text processing just |
547 against regular expressions. Still it is good to know that |
548 cannot be achieved by using regular expressions. |
548 some tasks in text processing just cannot be achieved by using |
|
549 regular expressions. |
549 |
550 |
550 |
551 |
551 \begin{figure}[p] |
552 \begin{figure}[p] |
552 \lstinputlisting{../progs/crawler1.scala} |
553 \lstinputlisting{../progs/crawler1.scala} |
553 \caption{The Scala code for a simple web-crawler that checks |
554 \caption{The Scala code for a simple web-crawler that checks |