106 \[ |
106 \[ |
107 ([\text{\it{}a-z0-9\_.-}]^+)@([\text{\it a-z0-9.-}]^+).([\text{\it a-z.}]^{\{2,6\}}) |
107 ([\text{\it{}a-z0-9\_.-}]^+)@([\text{\it a-z0-9.-}]^+).([\text{\it a-z.}]^{\{2,6\}}) |
108 \] |
108 \] |
109 |
109 |
110 \noindent |
110 \noindent |
111 The reason is that for example before the $@$-sign there can be any string you want assuming it |
111 One reason is that before the $@$-sign there can be any string you want assuming it |
112 is made up from letters, digits, underscores, dots and hyphens---clearly there are infinitely many |
112 is made up from letters, digits, underscores, dots and hyphens---clearly there are infinitely many |
113 of those. Similarly the string after the $@$-sign can be any string. However, this does not mean |
113 of those. Similarly the string after the $@$-sign can be any string. However, this does not mean |
114 that every string is an email address. For example |
114 that every string is an email address. For example |
115 |
115 |
116 \[ |
116 \[ |
117 \text{\it foo}@\text{\it bar}.\text{\it c} |
117 "\text{\it foo}@\text{\it bar}.\text{\it c}" |
118 \] |
118 \] |
119 |
119 |
120 \noindent |
120 \noindent |
121 is not, because the top-level-domains must be of length of at least two. (Note that there is |
121 is not, because the top-level-domains must be of length of at least two. (Note that there is |
122 the convention that uppercase letters are treated in email-addresses as if they were |
122 the convention that uppercase letters are treated in email-addresses as if they were |
136 which essentially means take the first string from the set $A$ and concatenate it with every |
136 which essentially means take the first string from the set $A$ and concatenate it with every |
137 string in the set $B$, then take the second string from $A$ do the same and so on. You might |
137 string in the set $B$, then take the second string from $A$ do the same and so on. You might |
138 like to think about what this definition means in case $A$ or $B$ is the empty set. |
138 like to think about what this definition means in case $A$ or $B$ is the empty set. |
139 |
139 |
140 We also need to define |
140 We also need to define |
141 the power of a set, written as $A^n$ with $n$ being a natural number. This is defined inductively as follows |
141 the power of a set of strings, written as $A^n$ with $n$ being a natural number. This is defined inductively as follows |
142 |
142 |
143 \begin{center} |
143 \begin{center} |
144 \begin{tabular}{rcl} |
144 \begin{tabular}{rcl} |
145 $A^0$ & $\dn$ & $\{[\,]\}$ \\ |
145 $A^0$ & $\dn$ & $\{[\,]\}$ \\ |
146 $A^{n+1}$ & $\dn$ & $A @ A^n$\\ |
146 $A^{n+1}$ & $\dn$ & $A @ A^n$\\ |
166 |
166 |
167 \noindent |
167 \noindent |
168 Be aware that these operations sometimes have quite non-intuitive properties, for example |
168 Be aware that these operations sometimes have quite non-intuitive properties, for example |
169 |
169 |
170 \begin{center} |
170 \begin{center} |
171 \begin{tabular}{ccc} |
171 \begin{tabular}{@{}ccc@{}} |
172 \begin{tabular}{r@{\hspace{1mm}}c@{\hspace{1mm}}l} |
172 \begin{tabular}{@{}r@{\hspace{1mm}}c@{\hspace{1mm}}l} |
173 $A \cup \varnothing$ & $=$ & $A$\\ |
173 $A \cup \varnothing$ & $=$ & $A$\\ |
174 $A \cup A$ & $=$ & $A$\\ |
174 $A \cup A$ & $=$ & $A$\\ |
175 $A \cup B$ & $=$ & $B \cup A$\\ |
175 $A \cup B$ & $=$ & $B \cup A$\\ |
176 \end{tabular} & |
176 \end{tabular} & |
177 \begin{tabular}{r@{\hspace{1mm}}c@{\hspace{1mm}}l} |
177 \begin{tabular}{r@{\hspace{1mm}}c@{\hspace{1mm}}l} |
178 $A @ B$ & $\not =$ & $B @ A$\\ |
178 $A @ B$ & $\not =$ & $B @ A$\\ |
179 $A @ \varnothing$ & $=$ & $\varnothing @ A = \varnothing$\\ |
179 $A @ \varnothing$ & $=$ & $\varnothing @ A = \varnothing$\\ |
180 $A @ \{""\}$ & $=$ & $\{""\} @ A = A$\\ |
180 $A @ \{""\}$ & $=$ & $\{""\} @ A = A$\\ |
181 \end{tabular} & |
181 \end{tabular} & |
182 \begin{tabular}{r@{\hspace{1mm}}c@{\hspace{1mm}}l} |
182 \begin{tabular}{r@{\hspace{1mm}}c@{\hspace{1mm}}l@{}} |
183 $\varnothing^*$ & $=$ & $\{""\}$\\ |
183 $\varnothing^*$ & $=$ & $\{""\}$\\ |
184 $\{""\}^*$ & $=$ & $\{""\}$\\ |
184 $\{""\}^*$ & $=$ & $\{""\}$\\ |
185 $A^\star$ & $=$ & $\{""\} \cup A\cdot A^*$\\ |
185 $A^\star$ & $=$ & $\{""\} \cup A\cdot A^*$\\ |
186 \end{tabular} |
186 \end{tabular} |
187 \end{tabular} |
187 \end{tabular} |
206 & $\mid$ & $r^*$ & star (zero or more)\\ |
206 & $\mid$ & $r^*$ & star (zero or more)\\ |
207 \end{tabular} |
207 \end{tabular} |
208 \end{center} |
208 \end{center} |
209 |
209 |
210 \noindent |
210 \noindent |
211 Because we overload our notation there are some subtleties you should be aware of. The letter $c$ stands for any character from the |
211 Because we overload our notation, there are some subtleties you should be aware of. First, the letter |
|
212 $c$ stands for any character from the |
212 alphabet at hand. Second, we will use parentheses to disambiguate |
213 alphabet at hand. Second, we will use parentheses to disambiguate |
213 regular expressions. For example we will write $(r_1 + r_2)^*$, which is different from, say $r_1 + (r_2)^*$. |
214 regular expressions. For example we will write $(r_1 + r_2)^*$, which is different from, say $r_1 + (r_2)^*$. |
214 The former means roughly zero or more times $r_1$ or $r_2$, while the latter means $r_1$ or zero or more times |
215 The former means roughly zero or more times $r_1$ or $r_2$, while the latter means $r_1$ or zero or more times |
215 $r_2$. We should also write $(r_1 + r_2) + r_3$, which is different from the regular expression $r_1 + (r_2 + r_3)$, |
216 $r_2$. We should also write $(r_1 + r_2) + r_3$, which is different from the regular expression $r_1 + (r_2 + r_3)$, |
216 but in case of $+$ and $\cdot$ we actually do not care about the order and just write $r_1 + r_2 + r_3$, or $r_1 \cdot r_2 \cdot r_3$, |
217 but in case of $+$ and $\cdot$ we actually do not care about the order and just write $r_1 + r_2 + r_3$, or $r_1 \cdot r_2 \cdot r_3$, |
217 respectively. The reasons for this will become clear shortly. In the literature you will often find that the choice |
218 respectively. The reasons for this will become clear shortly. In the literature you will often find that the choice |
218 $r_1 + r_2$ is written as $r_1\mid{}r_2$. Also following the convention in the literature, we will in case of $\cdot$ even |
219 $r_1 + r_2$ is written as $r_1\mid{}r_2$ or $r_1\mid\mid{}r_2$. Also following the convention in the literature, we will in case of $\cdot$ even |
219 often omit it all together. For example the regular expression for email addresses shown above is meant to be of the form |
220 often omit it all together. For example the regular expression for email addresses shown above is meant to be of the form |
220 |
221 |
221 \[ |
222 \[ |
222 ([\ldots])^+ \cdot @ \cdot ([\ldots])^+ \cdot . \cdot \ldots |
223 ([\ldots])^+ \cdot @ \cdot ([\ldots])^+ \cdot . \cdot \ldots |
223 \] |
224 \] |
224 |
225 |
225 \noindent |
226 \noindent |
226 meaning first comes a name (specified by the regular expression $([\ldots])^+$), then an $@$-sign, then |
227 meaning first comes a name (specified by the regular expression $([\ldots])^+$), then an $@$-sign, then |
227 a domain name (specified by the regular expression $([\ldots])^+$), then a top-level domain. Similarly if |
228 a domain name (specified by the regular expression $([\ldots])^+$), then a dot and then a top-level domain. Similarly if |
228 we want to specify the regular expression for the string {\it "hello"} we should write |
229 we want to specify the regular expression for the string {\it "hello"} we should write |
229 |
230 |
230 \[ |
231 \[ |
231 {\it h} \cdot {\it e} \cdot {\it l} \cdot {\it l} \cdot {\it o} |
232 {\it h} \cdot {\it e} \cdot {\it l} \cdot {\it l} \cdot {\it o} |
232 \] |
233 \] |
233 |
234 |
234 \noindent |
235 \noindent |
235 but often just write {\it hello}. |
236 but often just write {\it hello}. |
236 |
237 |
237 Another source of confusion might arise from the fact that we use the term \emph{regular expressions} for the ones used in ``theory'' |
238 Another source of confusion might arise from the fact that we use the term \emph{regular expression} for the regular expressions used in ``theory'' |
238 and also the ones in ``practice''. In this course we refer by default to the regular expressions defined by the grammar above. |
239 and also the ones used in ``practice''. In this course we refer by default to the regular expressions defined by the grammar above. |
239 In ``practice'' we often use $r^+$ to stand for one or more times, $\backslash{}d$ to stand for a digit, $r^?$ to stand for an optional regular |
240 In ``practice'' we often use $r^+$ to stand for one or more times, $\backslash{}d$ to stand for a digit, $r^?$ to stand for an optional regular |
240 expression, or ranges such as $[\text{\it a - z}]$ to stand for any lower case letter from $a$ to $z$. They are however mere convenience |
241 expression, or ranges such as $[\text{\it a - z}]$ to stand for any lower case letter from $a$ to $z$. They are however mere convenience |
241 as they can be seen as shorthand for |
242 as they can be seen as shorthand for |
242 |
243 |
243 \begin{center} |
244 \begin{center} |
252 |
253 |
253 We will see later that the \emph{not}-regular-expression can also be seen as convenience. This regular |
254 We will see later that the \emph{not}-regular-expression can also be seen as convenience. This regular |
254 expression is supposed to stand for every string \emph{not} matched by a regular expression. We will write |
255 expression is supposed to stand for every string \emph{not} matched by a regular expression. We will write |
255 such not-regular-expressions as $\sim{}r$. While being ``convenience'' it is often not so clear what the shorthand for |
256 such not-regular-expressions as $\sim{}r$. While being ``convenience'' it is often not so clear what the shorthand for |
256 these kind of not-regular-expressions is. Try to write down the regular expression which can match any |
257 these kind of not-regular-expressions is. Try to write down the regular expression which can match any |
257 string except {\it "hello"} and {\it "world"}. It is possible in principle, but often it is easier to just include |
258 string except the two strings {\it "hello"} and {\it "world"}. It is possible in principle, but often it is easier to just include |
258 $\sim{}r$ in the definition or regular expressions. Whenever we do so, we will state it explicitly. |
259 $\sim{}r$ in the definition of regular expressions. Whenever we do so, we will state it explicitly. |
259 |
260 |
260 So far we have only considered informally what the \emph{meaning} of a regular expression is. |
261 So far we have only considered informally what the \emph{meaning} of a regular expression is. |
261 To do so more formally we will associate with every regular expression a set of strings |
262 To do so more formally we will associate with every regular expression a set of strings |
262 that is supposed to be are matched by this |
263 that is supposed to be matched by this |
263 regular expression. This can be defined recursively as follows |
264 regular expression. This can be defined recursively as follows |
264 |
265 |
265 \begin{center} |
266 \begin{center} |
266 \begin{tabular}{rcl} |
267 \begin{tabular}{rcl} |
267 $L(\varnothing)$ & $\dn$ & $\{\,\}$\\ |
268 $L(\varnothing)$ & $\dn$ & $\{\,\}$\\ |
272 $L(r^*)$ & $\dn$ & $\bigcup_{n \ge 0} L(r)^n$\\ |
273 $L(r^*)$ & $\dn$ & $\bigcup_{n \ge 0} L(r)^n$\\ |
273 \end{tabular} |
274 \end{tabular} |
274 \end{center} |
275 \end{center} |
275 |
276 |
276 \noindent |
277 \noindent |
277 This means we can now precisely state what the meaning, for example, of the regular expression |
278 As a result we can now precisely state what the meaning, for example, of the regular expression |
278 ${\it h} \cdot {\it e} \cdot {\it l} \cdot {\it l} \cdot {\it o}$ is, namely |
279 ${\it h} \cdot {\it e} \cdot {\it l} \cdot {\it l} \cdot {\it o}$ is, namely |
279 $L({\it h} \cdot {\it e} \cdot {\it l} \cdot {\it l} \cdot {\it o}) = \{\text{\it"hello"}\}$...as expected. Similarly if we have the |
280 $L({\it h} \cdot {\it e} \cdot {\it l} \cdot {\it l} \cdot {\it o}) = \{\text{\it"hello"}\}$...as expected. Similarly if we have the |
280 choice-regular-expression $a + b$, its meaning is $L(a + b) = \{\text{\it"a"}, \text{\it"b"}\}$, namely the only two strings which can possibly |
281 choice-regular-expression $a + b$, its meaning is $L(a + b) = \{\text{\it"a"}, \text{\it"b"}\}$, namely the only two strings which can possibly |
281 be matched by this choice. You can now also conclude why we do not make a difference |
282 be matched by this choice. You can now also see why we do not make a difference |
282 between the different regular expressions $(r_1 + r_2) + r_3$ and $r_1 + (r_2 + r_3)$....they |
283 between the different regular expressions $(r_1 + r_2) + r_3$ and $r_1 + (r_2 + r_3)$....they |
283 are not the same regular expression, but have the same meaning. |
284 are not the same regular expression, but have the same meaning. |
284 |
285 |
285 The point of the definition of $L$ is that we can now precisely specify when a string $s$ is matched by a |
286 The point of the definition of $L$ is that we can use it to precisely specify when a string $s$ is matched by a |
286 regular expression $r$, namely only when $s \in L(r)$. In fact we will write a program {\it match} that takes any string $s$ and |
287 regular expression $r$, namely only when $s \in L(r)$. In fact we will write a program {\it match} that takes any string $s$ and |
287 any regular expression $r$ as argument and returns \emph{yes}, if $s \in L(r)$ and \emph{no}, |
288 any regular expression $r$ as argument and returns \emph{yes}, if $s \in L(r)$ and \emph{no}, |
288 if $s \not\in L(r)$. We leave this for the next lecture. |
289 if $s \not\in L(r)$. We leave this for the next lecture. |
289 |
290 |
290 \begin{figure}[p] |
291 \begin{figure}[p] |