46 strings that can be used in a sentence of the English language. French would be a |
46 strings that can be used in a sentence of the English language. French would be a |
47 different set of string, and so on. In the context of this course, a language might |
47 different set of string, and so on. In the context of this course, a language might |
48 not necessarily make sense from a natural language perspective. For example |
48 not necessarily make sense from a natural language perspective. For example |
49 the set of all strings from above is a language, as is the empty set (of strings). The |
49 the set of all strings from above is a language, as is the empty set (of strings). The |
50 empty set of strings is often written as $\varnothing$ or $\{\,\}$. Note that there is a |
50 empty set of strings is often written as $\varnothing$ or $\{\,\}$. Note that there is a |
51 difference between the empty set $\{\,\}$ and the set that contains the empty string $\{\text{""}\}$. |
51 difference between the empty set, or empty language, and the set, or language, that |
|
52 contains the empty string $\{\text{""}\}$: the former has no elements, the latter has one |
|
53 element. |
52 |
54 |
|
55 As seen there are languages which contain infinitely many strings, like the set of all strings. |
|
56 The ``natural'' languages English, French and so on contain many but only finitely many |
|
57 strings (the ones listed in a good dictionary). It might be therefore surprising that the |
|
58 language consisting of all email addresses is infinite if we assume it is defined by the |
|
59 regular expression\footnote{See \url{http://goo.gl/5LoVX7}} |
|
60 |
|
61 \[ |
|
62 ([\text{\it{}a-z0-9\_.-}]^+)@([\text{\it a-z0-9.-}]^+).([\text{\it a-z.}]^{\{2,6\}}) |
|
63 \] |
|
64 |
|
65 \noindent |
|
66 The reason is that for example before the $@$-sign there can be any string you want if it |
|
67 is made up from letters, digits, underscore, dot and hyphen---there are infinitely many |
|
68 of those. Similarly the string after the $@$-sign can be any string. This does not mean |
|
69 that every string is an email address. For example |
|
70 |
|
71 \[ |
|
72 \text{\it foo}@\text{\it bar}.\text{\it c} |
|
73 \] |
|
74 |
|
75 \noindent |
|
76 is not, since the top-level-domains must be of length of at least two. Note that there is |
|
77 the convention that uppercase letters are treated in email-addresses as if they were |
|
78 lower-case. |
|
79 \bigskip |
|
80 |
|
81 \noindent |
|
82 \emph{Regular expressions} are meant for conveniently describing languages...at least languages |
|
83 we are interested in in Computer Science. For example there is no convenient regular expression |
|
84 for describing the English language short of enumerating all English words/strings like in a dictionary. |
|
85 But they seem useful for describing all permitted email addresses, as seen above. |
|
86 |
|
87 Regular expressions are given by the following grammar: |
|
88 |
|
89 \begin{center} |
|
90 \begin{tabular}{r@{\hspace{1mm}}r@{\hspace{1mm}}l@{\hspace{13mm}}l} |
|
91 $r$ & $::=$ & $\varnothing$ & null\\ |
|
92 & $\mid$ & $\epsilon$ & empty string / "" / []\\ |
|
93 & $\mid$ & $c$ & character\\ |
|
94 & $\mid$ & $r_1 \cdot r_2$ & sequence\\ |
|
95 & $\mid$ & $r_1 + r_2$ & alternative / choice\\ |
|
96 & $\mid$ & $r^*$ & star (zero or more)\\ |
|
97 \end{tabular} |
|
98 \end{center} |
|
99 |
|
100 \noindent |
|
101 There are some subtleties you should be aware of. First, we will use parentheses to disambiguate |
|
102 regular expressions. For example we will write $(r_1 + r_2)^*$, which is different from $r_1 + (r_2)^*$. |
|
103 The former means roughly zero or more times $r_1$ or $r_2$, while the latter means $r_1$ or zero or more times |
|
104 $r_2$. We should also write $(r_1 + r_2) + r_3$ which is a regular expression different from $r_1 + (r_2 + r_3)$, |
|
105 but in case of $+$ and $\cdot$ we actually do not care and just write $r_1 + r_2 + r_3$, or $r_1 \cdot r_2 \cdot r_3$, |
|
106 respectively. The reasons for this will become clear shortly. In the literature you will often find that the choice |
|
107 $r_1 + r_2$ is written as $r_1\mid{}r_2$ |
53 \end{document} |
108 \end{document} |
54 |
109 |
55 %%% Local Variables: |
110 %%% Local Variables: |
56 %%% mode: latex |
111 %%% mode: latex |
57 %%% TeX-master: t |
112 %%% TeX-master: t |