Other Means of Enforcing Integrity
As I’ve already mentioned, explicit constraints are not the only means of enforcing data
integrity. Data types are constraints as well; they constrain with type-defi ned constraints and
with sets of operations allowed. An attribute is constrained with its data type. You can also
defi ne whether a column of a table allows NULLs. Finally, the defi nitions of tables constrain
as well: if you don’t have a place to insert a value, you cannot insert it. I will explain this a bit
more in the normalization section of this chapter.
You cannot implement all business rules by using declarative means. Some constraints are
too complex, and some span a database boundary. A foreign key, for example, is limited
to associating tables in the same database only. Some constraints have to be implemented
programmatically. You can put your constraining code in a client application, in the middle
tier, in the data access layer, in stored procedures in a database, or anywhere you have some
code. However, if you want your RDBMS to enforce complex constraints automatically, you
have to use triggers.
Triggers are special stored procedures that an RDBMS executes, or fi res, automatically. You
can use Data Modifi cation Language (DML) triggers to enforce data modifi cation rules and
Data Defi nition Language (DDL) triggers to enforce schema modifi cation rules. Triggers can
fi re before or after the statement that is modifying the state of a database. SQL Server 2008
supports two kinds of DML triggers: INSTEAD OF and AFTER triggers; only one kind of DDL
trigger is supported: the AFTER. INSTEAD OF triggers are not actually ANSI-standard BEFORE
triggers; they do fi re before the statement, but they also intercept the statement, and then
you can do whatever you want in the body of the trigger. If you want the statement to
execute, you have to write it explicitly in the body of the trigger.
In theory, you should always be able to use a view instead of a base relation. However, not all
views are updatable. For example, a view can summarize some attributes of a base table; an
RDBMS doesn’t know how to distribute a value from a single row from a view over multiple
base rows. INSTEAD OF triggers are especially meant for making views updatable.
SQL Server 2008 also has a built-in XML system type. The XML type enforces some integrity
rules by itself: it allows well-formed XML only. In addition, you can validate XML values
against a predefi ned schema from a schema collection you create inside a SQL Server
database. Details of triggers and XML validations are beyond scope of this chapter; for more,
please refer to Inside Microsoft SQL Server 2008: T-SQL Programming.
You can also use some elements of a database that don’t really enforce data integrity but
instead help users insert correct values. Defaults can help insert a value when it is not
explicitly listed in the INSERT statement. SQL Server 2008 has also a Timestamp type; SQL
Server inserts and updates values of this type automatically and guarantees that values in
110 Inside Microsoft SQL Server 2008: T-SQL Querying
columns of this type are unique across a database. The IDENTITY property of a column can
help you insert sequential numbers.
One important thing you need to know is the order in which the system enforces constraints. You
probably noticed that I switched from discussing a general (and theoretical) implementation to
a SQL Server 2008–specifi c implementation. The details of constraints are quite system specifi c,
and it seems more appropriate to switch to the system that this book is about—namely, Microsoft
SQL Server 2008. Therefore, the order of execution in SQL Server is as follows:
1. Schema is checked (whether an update is valid for the table schema).
2. Data types are checked.
3. INSTEAD OF triggers fi re instead of the actual statement.
4. Default constraints are applied.
5. Nullability is checked.
6. Primary Key and Unique constraints are checked.
7. Foreign Key and Check constraints are enforced.
8. Statement is executed.
9. AFTER triggers fi re.
What this order tells you is that declarative constraints are enforced before the actual
statement, and they prevent improper updates, while AFTER triggers fi re after the statement,
and you have to roll back an improper modifi cation discovered by the statement’s AFTER
trigger. This means that using declarative constraints is more effi cient than using AFTER
triggers, and you should opt for using declarative constraints whenever possible. Don’t forget
another advantage in using declarative constraints: they can help in query optimization.
The Good, the Bad, and the . . . Unknown!
The last question I want to touch on regarding data integrity is whether you should allow
NULLs in your database. In an ideal world, your database should represent true propositions
only; if something is NULL and you do not know what that NULL means, you cannot say it is
true. Therefore, from a strict point of view, you should not allow any NULLs.
However, in the real world, you always have some missing information, at least temporarily.
In addition, you really can experience Russell’s Paradox, as described in Chapter 2. In addition
to the theoretical description, I’d like to offer an example I found in Fermat’s Last Theorem
by Simon Singh (HarperPerennial, 2005), showing Russell’s Paradox in real life. This is the
problem of the meticulous librarian.
This library has two kinds of catalogs (of whatever you want); some list themselves in
references, and some don’t. The librarian wants to make two new catalogs: one that lists all
catalogs that do list themselves and one that lists all catalogs that do not list themselves.
Chapter 3 The Relational Model 111
The problem is with the latter catalog: should it list itself? If it does list itself, by defi nition
it should not be listed. If it does not list itself, by defi nition it should be listed. Imagine you
have to insert these two catalogs in a database, and in a table describing catalogs, you have
an attribute that is a fl ag showing whether a catalog lists itself. What would you insert in this
attribute for the catalog that lists all catalogs that do not list themselves? I think that NULL is
quite all right, showing that you cannot have anything meaningful there.
Of course, in real life, you will encounter missing information because of many reasons other
than Russell’s Paradox. Nevertheless, you have to fi nd a way to deal with missing information.
ANSI standard prescribes and SQL Server implements NULLs for denoting missing values.
Note that NULL is not a value; it is just a marker. NULL doesn’t even have the privilege to be
equal to itself. Some authors (Date, Pascal) strictly forbid NULLs, others explicitly allow them
(Codd), and others (Halpin) do not discuss them—they just show how to model and use
them. Which is correct?
If NULLs were not allowed, you’d still have to implement some special values denoting
missing information. The advantage of this approach is that you could use standard
Boolean operators in your queries, and there would be no need for special operators that
handle NULLs. The disadvantage is that there is no single, standard, special value accepted
worldwide. In addition, a single special value would not be suffi cient; we would actually
need one for each data type. Using NULLs means using a standard that is already accepted;
however, it also means introducing three-valued logic, where not true is not the same as false.
Three-valued logic makes queries more complicated.
After considering many pros and cons, my personal conclusion is that NULLs are here to stay,
and they are implemented by all major RDBMS; therefore, I prefer using them to inventing
special values. You’ll learn a lot about writing effi cient three-valued logic queries in this
book. Nevertheless, some NULLs can be avoided—namely, NULLs that are there because
an attribute is not applicable for a particular tuple of a relation. This is a matter of design. A
good schema constrains—in other words, excludes—NULLs that represent “not applicable.”
Therefore, the time has come to defi ne a good schema!
As I’ve already mentioned, explicit constraints are not the only means of enforcing data
integrity. Data types are constraints as well; they constrain with type-defi ned constraints and
with sets of operations allowed. An attribute is constrained with its data type. You can also
defi ne whether a column of a table allows NULLs. Finally, the defi nitions of tables constrain
as well: if you don’t have a place to insert a value, you cannot insert it. I will explain this a bit
more in the normalization section of this chapter.
You cannot implement all business rules by using declarative means. Some constraints are
too complex, and some span a database boundary. A foreign key, for example, is limited
to associating tables in the same database only. Some constraints have to be implemented
programmatically. You can put your constraining code in a client application, in the middle
tier, in the data access layer, in stored procedures in a database, or anywhere you have some
code. However, if you want your RDBMS to enforce complex constraints automatically, you
have to use triggers.
Triggers are special stored procedures that an RDBMS executes, or fi res, automatically. You
can use Data Modifi cation Language (DML) triggers to enforce data modifi cation rules and
Data Defi nition Language (DDL) triggers to enforce schema modifi cation rules. Triggers can
fi re before or after the statement that is modifying the state of a database. SQL Server 2008
supports two kinds of DML triggers: INSTEAD OF and AFTER triggers; only one kind of DDL
trigger is supported: the AFTER. INSTEAD OF triggers are not actually ANSI-standard BEFORE
triggers; they do fi re before the statement, but they also intercept the statement, and then
you can do whatever you want in the body of the trigger. If you want the statement to
execute, you have to write it explicitly in the body of the trigger.
In theory, you should always be able to use a view instead of a base relation. However, not all
views are updatable. For example, a view can summarize some attributes of a base table; an
RDBMS doesn’t know how to distribute a value from a single row from a view over multiple
base rows. INSTEAD OF triggers are especially meant for making views updatable.
SQL Server 2008 also has a built-in XML system type. The XML type enforces some integrity
rules by itself: it allows well-formed XML only. In addition, you can validate XML values
against a predefi ned schema from a schema collection you create inside a SQL Server
database. Details of triggers and XML validations are beyond scope of this chapter; for more,
please refer to Inside Microsoft SQL Server 2008: T-SQL Programming.
You can also use some elements of a database that don’t really enforce data integrity but
instead help users insert correct values. Defaults can help insert a value when it is not
explicitly listed in the INSERT statement. SQL Server 2008 has also a Timestamp type; SQL
Server inserts and updates values of this type automatically and guarantees that values in
110 Inside Microsoft SQL Server 2008: T-SQL Querying
columns of this type are unique across a database. The IDENTITY property of a column can
help you insert sequential numbers.
One important thing you need to know is the order in which the system enforces constraints. You
probably noticed that I switched from discussing a general (and theoretical) implementation to
a SQL Server 2008–specifi c implementation. The details of constraints are quite system specifi c,
and it seems more appropriate to switch to the system that this book is about—namely, Microsoft
SQL Server 2008. Therefore, the order of execution in SQL Server is as follows:
1. Schema is checked (whether an update is valid for the table schema).
2. Data types are checked.
3. INSTEAD OF triggers fi re instead of the actual statement.
4. Default constraints are applied.
5. Nullability is checked.
6. Primary Key and Unique constraints are checked.
7. Foreign Key and Check constraints are enforced.
8. Statement is executed.
9. AFTER triggers fi re.
What this order tells you is that declarative constraints are enforced before the actual
statement, and they prevent improper updates, while AFTER triggers fi re after the statement,
and you have to roll back an improper modifi cation discovered by the statement’s AFTER
trigger. This means that using declarative constraints is more effi cient than using AFTER
triggers, and you should opt for using declarative constraints whenever possible. Don’t forget
another advantage in using declarative constraints: they can help in query optimization.
The Good, the Bad, and the . . . Unknown!
The last question I want to touch on regarding data integrity is whether you should allow
NULLs in your database. In an ideal world, your database should represent true propositions
only; if something is NULL and you do not know what that NULL means, you cannot say it is
true. Therefore, from a strict point of view, you should not allow any NULLs.
However, in the real world, you always have some missing information, at least temporarily.
In addition, you really can experience Russell’s Paradox, as described in Chapter 2. In addition
to the theoretical description, I’d like to offer an example I found in Fermat’s Last Theorem
by Simon Singh (HarperPerennial, 2005), showing Russell’s Paradox in real life. This is the
problem of the meticulous librarian.
This library has two kinds of catalogs (of whatever you want); some list themselves in
references, and some don’t. The librarian wants to make two new catalogs: one that lists all
catalogs that do list themselves and one that lists all catalogs that do not list themselves.
Chapter 3 The Relational Model 111
The problem is with the latter catalog: should it list itself? If it does list itself, by defi nition
it should not be listed. If it does not list itself, by defi nition it should be listed. Imagine you
have to insert these two catalogs in a database, and in a table describing catalogs, you have
an attribute that is a fl ag showing whether a catalog lists itself. What would you insert in this
attribute for the catalog that lists all catalogs that do not list themselves? I think that NULL is
quite all right, showing that you cannot have anything meaningful there.
Of course, in real life, you will encounter missing information because of many reasons other
than Russell’s Paradox. Nevertheless, you have to fi nd a way to deal with missing information.
ANSI standard prescribes and SQL Server implements NULLs for denoting missing values.
Note that NULL is not a value; it is just a marker. NULL doesn’t even have the privilege to be
equal to itself. Some authors (Date, Pascal) strictly forbid NULLs, others explicitly allow them
(Codd), and others (Halpin) do not discuss them—they just show how to model and use
them. Which is correct?
If NULLs were not allowed, you’d still have to implement some special values denoting
missing information. The advantage of this approach is that you could use standard
Boolean operators in your queries, and there would be no need for special operators that
handle NULLs. The disadvantage is that there is no single, standard, special value accepted
worldwide. In addition, a single special value would not be suffi cient; we would actually
need one for each data type. Using NULLs means using a standard that is already accepted;
however, it also means introducing three-valued logic, where not true is not the same as false.
Three-valued logic makes queries more complicated.
After considering many pros and cons, my personal conclusion is that NULLs are here to stay,
and they are implemented by all major RDBMS; therefore, I prefer using them to inventing
special values. You’ll learn a lot about writing effi cient three-valued logic queries in this
book. Nevertheless, some NULLs can be avoided—namely, NULLs that are there because
an attribute is not applicable for a particular tuple of a relation. This is a matter of design. A
good schema constrains—in other words, excludes—NULLs that represent “not applicable.”
Therefore, the time has come to defi ne a good schema!
No comments:
Post a Comment