23 |
25 |
24 \noindent |
26 \noindent |
25 The important point is that we can always decompose strings. For example we often consider the |
27 The important point is that we can always decompose strings. For example we often consider the |
26 first character of a string, say $h$, and the ``rest'' of a string {\it "ello"}. |
28 first character of a string, say $h$, and the ``rest'' of a string {\it "ello"}. |
27 There are also some subtleties with the empty string, sometimes written as {\it ""} or as the empty list |
29 There are also some subtleties with the empty string, sometimes written as {\it ""} or as the empty list |
28 of characters $[\,]$. |
30 of characters $[\,]$. Two strings, say $s_1$ and $s_2$, can be \emph{concatenated} which is written |
|
31 as $s_1 @ s_2$. For example if we have the two strings {\it "foo"} and {\it "bar"}, their concatenation |
|
32 gives {\it "foobar"}. |
29 |
33 |
30 We often need to talk about sets of strings. For example the set of all strings |
34 We often need to talk about sets of strings. For example the set of all strings |
31 |
35 |
32 \[ |
36 \[ |
33 \{\text{\it "", "a", "b", "c",\ldots,"z", "aa", "ab", "ac", \ldots, "aaa", \ldots}\} |
37 \{\text{\it "", "a", "b", "c",\ldots,"z", "aa", "ab", "ac", \ldots, "aaa", \ldots}\} |
76 is not, since the top-level-domains must be of length of at least two. Note that there is |
80 is not, since the top-level-domains must be of length of at least two. Note that there is |
77 the convention that uppercase letters are treated in email-addresses as if they were |
81 the convention that uppercase letters are treated in email-addresses as if they were |
78 lower-case. |
82 lower-case. |
79 \bigskip |
83 \bigskip |
80 |
84 |
|
85 Before we expand on the topic of regular expressions, let us review some operations on |
|
86 sets. We will use capital letters, such as $A$, $B$ and so on, to stand for sets of strings. |
|
87 The union of two sets is written as usual as $A \cup B$. We also need to define the |
|
88 \emph{concatenation} of two sets (of strings). This can be defined as |
|
89 |
|
90 \[ |
|
91 A @ B \dn \{s_1@ s_2 | s_1 \in A \wedge s_2 \in B \} |
|
92 \] |
|
93 |
|
94 \noindent |
|
95 which essentially means take the first string from the set $A$ and concatenate it with every |
|
96 string in the set $B$, then take the second string from $A$ and so on. We also need to define |
|
97 the power of a set, written as $A^n$. This is defined inductively as follows |
|
98 |
|
99 \begin{center} |
|
100 \begin{tabular}{rcl} |
|
101 $A^0$ & $\dn$ & $\{[\,]\}$ \\ |
|
102 $A^{n+1}$ & $\dn$ & $A @ A^n$\\ |
|
103 \end{tabular} |
|
104 \end{center} |
|
105 |
|
106 \noindent |
|
107 Finally we need the \emph{star} of a set of strings, written $A^*$. This is defined as the union |
|
108 of every power of $A^n$ for every $n\ge 0$. The mathematical notation for this operation is |
|
109 |
|
110 \[ |
|
111 A^* \dn \bigcup_{0\le n} A^n |
|
112 \] |
|
113 |
|
114 \noindent |
|
115 This means the star of a set $A$ contains always the empty string, one copy of every string in $A$, two |
|
116 copies and so on. In case $A=\{"a\}$ we have |
|
117 |
|
118 \[ |
|
119 A^* = \{"", "a", "aa", "aaa", \ldots\} |
|
120 \] |
|
121 |
|
122 \noindent |
|
123 Be aware that these operations sometimes have non-intuitive properties, for example |
|
124 |
|
125 \begin{center} |
|
126 \begin{tabular}{ccc} |
|
127 \begin{tabular}{r@{\hspace{1mm}}c@{\hspace{1mm}}l} |
|
128 $A \cup \varnothing$ & $=$ & $A$\\ |
|
129 $A \cup A$ & $=$ & $A$\\ |
|
130 $A \cup B$ & $=$ & $B \cup A$\\ |
|
131 \end{tabular} & |
|
132 \begin{tabular}{r@{\hspace{1mm}}c@{\hspace{1mm}}l} |
|
133 $A @ B$ & $\not =$ & $B @ A$\\ |
|
134 $A @ \varnothing$ & $=$ & $\varnothing @ A = \varnothing$\\ |
|
135 $A @ \{""\}$ & $=$ & $\{""\} @ A = A$\\ |
|
136 \end{tabular} & |
|
137 \begin{tabular}{r@{\hspace{1mm}}c@{\hspace{1mm}}l} |
|
138 $\varnothing^*$ & $=$ & $\{""\}$\\ |
|
139 $\{""\}^*$ & $=$ & $\{""\}$\\ |
|
140 $A^1$ & $=$ & $A$\\ |
|
141 \end{tabular} |
|
142 \end{tabular} |
|
143 \end{center} |
|
144 \bigskip |
|
145 |
81 \noindent |
146 \noindent |
82 \emph{Regular expressions} are meant for conveniently describing languages...at least languages |
147 \emph{Regular expressions} are meant for conveniently describing languages...at least languages |
83 we are interested in in Computer Science. For example there is no convenient regular expression |
148 we are interested in in Computer Science. For example there is no convenient regular expression |
84 for describing the English language short of enumerating all English words/strings like in a dictionary. |
149 for describing the English language short of enumerating all English words/strings like in a dictionary. |
85 But they seem useful for describing all permitted email addresses, as seen above. |
150 But they seem useful for describing all permitted email addresses, as seen above. |
96 & $\mid$ & $r^*$ & star (zero or more)\\ |
161 & $\mid$ & $r^*$ & star (zero or more)\\ |
97 \end{tabular} |
162 \end{tabular} |
98 \end{center} |
163 \end{center} |
99 |
164 |
100 \noindent |
165 \noindent |
101 There are some subtleties you should be aware of. First, we will use parentheses to disambiguate |
166 There are some subtleties you should be aware of. The letter $c$ stands for any character from the |
|
167 alphabet. Second, we will use parentheses to disambiguate |
102 regular expressions. For example we will write $(r_1 + r_2)^*$, which is different from $r_1 + (r_2)^*$. |
168 regular expressions. For example we will write $(r_1 + r_2)^*$, which is different from $r_1 + (r_2)^*$. |
103 The former means roughly zero or more times $r_1$ or $r_2$, while the latter means $r_1$ or zero or more times |
169 The former means roughly zero or more times $r_1$ or $r_2$, while the latter means $r_1$ or zero or more times |
104 $r_2$. We should also write $(r_1 + r_2) + r_3$ which is a regular expression different from $r_1 + (r_2 + r_3)$, |
170 $r_2$. We should also write $(r_1 + r_2) + r_3$ which is a regular expression different from $r_1 + (r_2 + r_3)$, |
105 but in case of $+$ and $\cdot$ we actually do not care and just write $r_1 + r_2 + r_3$, or $r_1 \cdot r_2 \cdot r_3$, |
171 but in case of $+$ and $\cdot$ we actually do not care and just write $r_1 + r_2 + r_3$, or $r_1 \cdot r_2 \cdot r_3$, |
106 respectively. The reasons for this will become clear shortly. In the literature you will often find that the choice |
172 respectively. The reasons for this will become clear shortly. In the literature you will often find that the choice |
111 ([\ldots])^+ \cdot @ \cdot ([\ldots])^+ \cdot . \cdot \ldots |
177 ([\ldots])^+ \cdot @ \cdot ([\ldots])^+ \cdot . \cdot \ldots |
112 \] |
178 \] |
113 |
179 |
114 \noindent |
180 \noindent |
115 meaning first comes a name (specified by the regular expression $([\ldots])^+$), then an $@$-sign, then |
181 meaning first comes a name (specified by the regular expression $([\ldots])^+$), then an $@$-sign, then |
116 a domain name (specified by the regular expression $([\ldots])^+$), then a top-level domain. |
182 a domain name (specified by the regular expression $([\ldots])^+$), then a top-level domain. Similarly if |
117 |
183 we want to specify the regular expression for the string {\it "hello"} we should write |
118 |
184 |
|
185 \[ |
|
186 {\it h} \cdot {\it e} \cdot {\it l} \cdot {\it l} \cdot {\it o} |
|
187 \] |
|
188 |
|
189 \noindent |
|
190 but often just write {\it hello}. |
|
191 |
|
192 Another source of confusion might be that we use the term \emph{regular expressions} for the ones used in ``theory'' |
|
193 and the ones in ``practice''. In this course by default we refer to the regular expressions defined by the grammar above. |
|
194 In ``practice'' we often use $r^+$ for one or more times, $\backslash{}d$ to stand for a digit, $r^?$ for an optional regular |
|
195 expression, or ranges such as $[\text{\it a - z}]$ to stand for any lower case letter from $a$ to $z$. They are mere convenience |
|
196 as they can be seen as shorthand for |
|
197 |
|
198 \begin{center} |
|
199 \begin{tabular}{rcl} |
|
200 $r^+$ & $\mapsto$ & $r\cdot r^*$\\ |
|
201 $r^?$ & $\mapsto$ & $\epsilon + r$\\ |
|
202 $\backslash d$ & $\mapsto$ & $0 + 1 + 2 + \ldots + 9$\\ |
|
203 $[\text{\it a - z}]$ & $\mapsto$ & $a + b + \ldots + z$\\ |
|
204 \end{tabular} |
|
205 \end{center} |
|
206 |
|
207 \noindent |
|
208 We will see later that the \emph{not}-regular-expression can also be seen as convenience. This regular |
|
209 expression is supposed to stand for every string \emph{not} match by a regular expression. We will write |
|
210 such regular expressions as $~r$. While being ``convenience'' it is often not so clear what the shorthand for |
|
211 these kind of regular expressions is. |
|
212 |
|
213 So far we have only considered informally what the \emph{meaning} of a regular expression is. |
|
214 Formally we associate with every regular expression a set of strings which are matched by this |
|
215 regular expression. This can be formally defined as |
|
216 |
|
217 \begin{center} |
|
218 \begin{tabular}{rcl} |
|
219 $L(\varnothing)$ & $\dn$ & $\{\,\}$\\ |
|
220 $L(\epsilon)$ & $\dn$ & $\{""\}$\\ |
|
221 $L(c)$ & $\dn$ & $\{"c"\}$\\ |
|
222 $L(r_1+ r_2)$ & $\dn$ & $L(r_1) \cup L(r_2)$\\ |
|
223 $L(r_1 \cdot r_2)$ & $\dn$ & $\{s_1@ s_2 | s_1 \in L(r_1) \wedge s_2 \in L(r_2) \}$\\ |
|
224 $L(r^*)$ & $\dn$ & $\bigcup_{n \ge 0} L(r)^n$\\ |
|
225 \end{tabular} |
|
226 \end{center} |
|
227 |
|
228 \noindent |
|
229 This means we can now precisely state what the meaning, for example, of the regular expression |
|
230 ${\it h} \cdot {\it e} \cdot {\it l} \cdot {\it l} \cdot {\it o}$ is, namely |
|
231 $L({\it h} \cdot {\it e} \cdot {\it l} \cdot {\it l} \cdot {\it o}) = \{\text{\it"hello"}\}$. Similarly if we have the choice |
|
232 $a + b$, the meaning is $L(a + b) = \{\text{\it"a"}, \text{\it"b"}\}$, namely the only two strings which can possibly |
|
233 be matched by this choice. |
|
234 |
|
235 The point of this definition is that we can now precisely specify when a string is matched by a |
|
236 regular expression, namely a string, say $s$, is matched by a regular expression, say $r$, if |
|
237 and only if $s \in L(r)$. In fact we will write a program {\it match} that takes any string $s$ and |
|
238 any regular expression $r$ as argument and returns \emph{yes}, if $s \in L(r)$ and \emph{no}, |
|
239 if $s \not\in L(r)$. We leave this for the next lecture. |
119 \end{document} |
240 \end{document} |
120 |
241 |
121 %%% Local Variables: |
242 %%% Local Variables: |
122 %%% mode: latex |
243 %%% mode: latex |
123 %%% TeX-master: t |
244 %%% TeX-master: t |