Relational Database-The Data Model
The most abstract level of a database is the data model,
the conceptual description of a problem space. Data models are
expressed in terms of entities, attributes, domains, and
relationships.
Entities
It's difficult to provide a precise formal definition of the
term entity, but the concept is intuitively quite
straightforward: an entity is anything about which the system
needs to store information.
When you begin to design your data model, compiling an initial
list of entities isn't difficult. When you (or your clients)
talk about the problem space, most of the nouns and verbs used
will be candidate entities. "Customers buy products. Employees
sell products. Suppliers sell us products." The nouns
"Customers", "Products", "Employees", and "Suppliers" are all
clearly entities.
The events represented by the verbs "buy" and "sell" are also
entities, but a couple of traps exist here. First, the verb
"sell" is used to represent two distinct events: the sale of a
product to a customer and the purchase of a product by the
company. That's fairly obvious in this example, but it's an easy
trap to fall into, particularly if you're not familiar with the
problem space.
The second gotcha is the inverse of the first: two different
verbs ("buy" in the first sentence and "sell" in the second) are
used to describe the same event, the purchase of a product by a
customer. Again, this isn't necessarily obvious unless you're
familiar with the problem space. This problem is often trickier
to track down than the first. If a client is using different
verbs to describe what appears to be the same event, they might
in fact describing different kinds of events. If the client is a
tailor, for example, "customer buys suit" and "customer orders
suit" might both result in the sale of the suit, but in the
first case it's a pret-a-porter sale (suit) and in the second
it's bespoke. These are very different processes that might need
to be modeled differently.
In addition to interviewing clients to establish a list of
entities, it's also useful to review any documents that exist in
the problem space. Input forms, reports, and procedures manuals
are all good sources of candidate entities. You must be careful
with documents, however. Printed documents have a high degree of
inertia: input forms particularly are expensive to print and
frequently don't keep up with changes to policies and
procedures. If you stumble across an entity that's never come up
in an interview, don't assume the client just forgot to mention
it. Chances are that it's a legacy item that's no longer
pertinent to the company. You'll need to check.
Once you've developed an intial list of candidate entities, it's
also useful to review them for completeness and consistency.
Again, you need to find duplicates and distinct entities that
are masquerading as the same entity. A useful tool in this
process is the concept of entity subtypes. To return to our
example of the tailor, "pret-a-porter" and "bespoke" both
represent the purchase of an item of clothing, but they're
different kinds of purchases. In other words, Sale and
Order are both subtypes of the entity Purchase.
Attributes that are common to both types of Purchase are
assigned to the supertype, in this case, Purchase, and
attributes specific to a subtype-PretAPorter or Bespoke in this
instance-are factored out to that subtype. This allows both
kinds of events to be treated as generic Purchases when that is
appropriate (as when calculating total sales) or as specific
kinds of Purchases (as when comparing subtypes).
Sometimes you might discover that the entity subtypes don't
actually have distinct attributes, in which case it's more
convenient to make TypeOfSale (or TypeOfCustomer, or
TypeOfWhatever) an attribute of the supertype rather than
modeling the subtypes as distinct entities. With tailoring
example for bespoke sales you might need to know the cloth and
color selected, whereas for pret-a-porter sales you might need
to track the garment manufacturer. In this case, you would use
subtypes to model these entities. If, however, you only need to
know that a sale was bespoke or pret-a-porter, a TypeOfSale
attribute would be simpler to implement.
Subtypes are usually mutually exclusive, but this is by no means
always the case. Consider an employee database. All employees
have certain attributes in common (hire date, department,
telephone), but only some will be salespeople (with specific
attributes for commision rates and target) and only a few will
join the company football team. There's nothing preventing a
salesperson from playing basketball, however.
Most entities model objects or events in the physical world:
customers, products or sales calls. These are concrete entities.
Entities can also model abstract concepts. The most common
example of an abstract entity is one that models the
relationship between other entities, the fact that a certain
sales representative is responsible for a certain client or that
certain student is enrolled in a certain class.
Sometimes all you need to model is the fact that a relationship
exists. Other times you'll want to store additional information
about the relationships, such as the date on which it was
established or some characteristics of the relationship. The
relationship between coyotes and wolves is competitive, that
between wolves and sheeps is predatory, and it's useful to know
this if you're planning an open-range zoo.
Whether relationships that do not have attributes ought to be
modeled as separate entities is a matter of some discussion. I
don't think anything is gained by doing so, and it complicates
the process of deriving a database schema from the data model.
However, understand that relationships are as important as
entities are in the data model.
Attributes
Your system will need to keep track of certain facts about each
entity. These facts are referred to as the entity's attributes.
If your system includes Customer entities, for example, you'll
probably want to know the names and addresses of the customers
and perhaps the businesses they're in. If you're modeling an
event such as a Service Call, you'll probably want to know who
the customer was, who made the call, when it was made, and
whether the problem was resolved.
Determining the attributes to be included in your model is a
semantic process. You must make your decision based on what the
data means and how it will be used. Let's look at one common
example: and address. Do you model the address as a single
entity (the Address) or as a set of entities (HouseNumber,
Street, City, State, ZipCode)? Most designers would tend to
automatically break the address up into a set of attributes on
the general principle that structured data is easier to
manipulate, but this is not necesarily correct and certainly not
straightforward.
Let's take for instance a local private club. It will want to
store the addresses of its members in order to print mailing
labels. Since all members live in the same city, there is no
reason to ever look at an address as anything other than a blob:
a single, multiline chunk of text.
But what about a mail-order company that does all its business
on the Internet? For sales tax purposes, the company needs to
know the states in which its customer reside. While it's
possible to extract the state from the single text field used by
the local private club, it isn't easy; so it makes sense in this
case to at least model the state as a separate attribute. What
about the rest of the address? Should be it composed of multiple
attributes, and if so, what are they? Be awareeee that while
addresses in the United States conform to a fairly standard
pattern, modeling them is probably not as simple as appears.
You might think that a set of attributes (HouseNumber, Street,
City, ZipCode) might be adequate. But then you need to deal with
appartment numbers and post office boxes. What do you do with an
address to be sent in care of someone else? And of course the
world is getting smaller, but not less complex, so what happens
when you get first customer outside the United States? Not only
do you need to know the country and adjust the zip code, but the
arrangement of the attributes might need to change. In Romania,
for example, the house number follows the street name. That's
not too bad, it's easy enough to map that when you're entering
data, but how many of your users would know that in the address
4/32 South Avenue, Salt Beach, Australia, 4/32 means Apartment
4, Number 32?
The point here is not so much that addresses are hard to model,
although they are, but rather that you can't make any
assumptions about how you should model any specific kind of
data. The complex schema that you develop for handling
international mail order is completely inappropriate for the
local private club.
Matisse is reputed to have said that a painting was finished
when nothing could be either added or subtracted. Entity design
is a bit like that. How do you know when you've reached that
point? The unfortunate answer is that you can never know for
certain. At the current state of technology, there isn't any way
to develop a provably correct database design. You can prove
that some designs have flaws, but you can't prove that any given
design doesn't. You can't, if you will, prove your innocence.
How do you tackle this problem? There are no rules, but there
are some strategies.
The first strategy is: start with the result and don't make the
design any more complex that it needs to be.
What questions does your database to answer? In our first
example, the private club, the only question was "Where do I
mail a letter to this person?", so a single attribute model was
sufficient. The second example, the mail order company, also had
to answer "In what state does this person live?", so we needed a
different structure to provide the results.
You need to be careful, of course, that you try to provide the
flexibility to handle not just the questions your users are
asking now, but also the ones you can foresee them asking in the
future. I'd be willing to bet, for instance, that whitin a year
of implementing the private club system the club will come back
asking you to sort the addresses by zip code so that they can
qualify for bulk mail discounts.
One of the hallmarks of good designers is the thoroughness and
creativity with which they solicit potential questions.
Inexperienced analysts are frequently heard to remark that the
users don't know what they want. Of course they don't; it's
your job to help them discover what they want.
There's trap here, however. Often, the trade-off for flexibility
is increased complexity. As we saw with the address examples,
the more ways you want to slice and dice the data, the more
exceptions you have to handle, and there comes a point of
diminishing returns.
This leads me to the strategy number two: find the exceptions.
There are two sides to this strategy: first that it is important
to identify all the exceptions, and second that you must design
the system to handle as many exceptions as you can without
confusing users. To illustrate what this means, let's walk
through another example: personal names.
If your system will be used to produce correspondence, it's
crucial that you get the name right. Most names are pretty
straightforward. Ms. Olivia R. Public consists of the Title,
FirstName, MiddleInitial and LastName, right? Wrong. In the
first place, FirstName and LastName are culturally specific.
It's more correct to use GivenName and Surname. Next, what
happens to Sir James Peddington Smythe, Lord Somestable? Is
Peddington Smythe his Surname or is Peddington his MiddleName,
and what do you do about the "Lord Somestable" part? And the
singer Sting? Is that a GivenName or a Surname?
The last question isn't as flippant as it sounds. A letter
addressed to Sir James Peddington Smythe probably won't offend
anyone. But the gentleman in question is not Sir Smythe; he's
Sir James, or maybe Lord Somestable. Realistically, though, how
many of your clients are lords or realm?
So, be aware that there's a trade-off between flexibility and
complexity. While it's important to catch as many exceptions as
possible, it's perfectly reasonable to eliminate some of them as
too unlikely to be worth the cost of dealing with them.
Distinguishing between entities and attributes is sometimes
difficult. Again, addresses are a good example, and again, your
decision must be based on the problem space. Some designers
advocate the creation of a single address entity used to store
all the addresses modeled by system. From an implementation
viewpoint, this approach has certain advantages in terms of
encapsulation and code reuse. From a design viewpoint, I have
some reservations.
It's unlikely that addresses for employees and customers will be
used in the sam way. Mass mailings to employees, for example,
are more likely to be done via internal mail than the postal
service. This being the case, the rules and requirements are
different.