Relational Database-The Data Model

The most abstract level of a database is the data model, the conceptual description of a problem space. Data models are expressed in terms of entities, attributes, domains, and relationships. Entities It's difficult to provide a precise formal definition of the term entity, but the concept is intuitively quite straightforward: an entity is anything about which the system needs to store information. When you begin to design your data model, compiling an initial list of entities isn't difficult. When you (or your clients) talk about the problem space, most of the nouns and verbs used will be candidate entities. "Customers buy products. Employees sell products. Suppliers sell us products." The nouns "Customers", "Products", "Employees", and "Suppliers" are all clearly entities. The events represented by the verbs "buy" and "sell" are also entities, but a couple of traps exist here. First, the verb "sell" is used to represent two distinct events: the sale of a product to a customer and the purchase of a product by the company. That's fairly obvious in this example, but it's an easy trap to fall into, particularly if you're not familiar with the problem space. The second gotcha is the inverse of the first: two different verbs ("buy" in the first sentence and "sell" in the second) are used to describe the same event, the purchase of a product by a customer. Again, this isn't necessarily obvious unless you're familiar with the problem space. This problem is often trickier to track down than the first. If a client is using different verbs to describe what appears to be the same event, they might in fact describing different kinds of events. If the client is a tailor, for example, "customer buys suit" and "customer orders suit" might both result in the sale of the suit, but in the first case it's a pret-a-porter sale (suit) and in the second it's bespoke. These are very different processes that might need to be modeled differently. In addition to interviewing clients to establish a list of entities, it's also useful to review any documents that exist in the problem space. Input forms, reports, and procedures manuals are all good sources of candidate entities. You must be careful with documents, however. Printed documents have a high degree of inertia: input forms particularly are expensive to print and frequently don't keep up with changes to policies and procedures. If you stumble across an entity that's never come up in an interview, don't assume the client just forgot to mention it. Chances are that it's a legacy item that's no longer pertinent to the company. You'll need to check. Once you've developed an intial list of candidate entities, it's also useful to review them for completeness and consistency. Again, you need to find duplicates and distinct entities that are masquerading as the same entity. A useful tool in this process is the concept of entity subtypes. To return to our example of the tailor, "pret-a-porter" and "bespoke" both represent the purchase of an item of clothing, but they're different kinds of purchases. In other words, Sale and Order are both subtypes of the entity Purchase. Attributes that are common to both types of Purchase are assigned to the supertype, in this case, Purchase, and attributes specific to a subtype-PretAPorter or Bespoke in this instance-are factored out to that subtype. This allows both kinds of events to be treated as generic Purchases when that is appropriate (as when calculating total sales) or as specific kinds of Purchases (as when comparing subtypes). Sometimes you might discover that the entity subtypes don't actually have distinct attributes, in which case it's more convenient to make TypeOfSale (or TypeOfCustomer, or TypeOfWhatever) an attribute of the supertype rather than modeling the subtypes as distinct entities. With tailoring example for bespoke sales you might need to know the cloth and color selected, whereas for pret-a-porter sales you might need to track the garment manufacturer. In this case, you would use subtypes to model these entities. If, however, you only need to know that a sale was bespoke or pret-a-porter, a TypeOfSale attribute would be simpler to implement. Subtypes are usually mutually exclusive, but this is by no means always the case. Consider an employee database. All employees have certain attributes in common (hire date, department, telephone), but only some will be salespeople (with specific attributes for commision rates and target) and only a few will join the company football team. There's nothing preventing a salesperson from playing basketball, however. Most entities model objects or events in the physical world: customers, products or sales calls. These are concrete entities. Entities can also model abstract concepts. The most common example of an abstract entity is one that models the relationship between other entities, the fact that a certain sales representative is responsible for a certain client or that certain student is enrolled in a certain class. Sometimes all you need to model is the fact that a relationship exists. Other times you'll want to store additional information about the relationships, such as the date on which it was established or some characteristics of the relationship. The relationship between coyotes and wolves is competitive, that between wolves and sheeps is predatory, and it's useful to know this if you're planning an open-range zoo. Whether relationships that do not have attributes ought to be modeled as separate entities is a matter of some discussion. I don't think anything is gained by doing so, and it complicates the process of deriving a database schema from the data model. However, understand that relationships are as important as entities are in the data model. Attributes Your system will need to keep track of certain facts about each entity. These facts are referred to as the entity's attributes. If your system includes Customer entities, for example, you'll probably want to know the names and addresses of the customers and perhaps the businesses they're in. If you're modeling an event such as a Service Call, you'll probably want to know who the customer was, who made the call, when it was made, and whether the problem was resolved. Determining the attributes to be included in your model is a semantic process. You must make your decision based on what the data means and how it will be used. Let's look at one common example: and address. Do you model the address as a single entity (the Address) or as a set of entities (HouseNumber, Street, City, State, ZipCode)? Most designers would tend to automatically break the address up into a set of attributes on the general principle that structured data is easier to manipulate, but this is not necesarily correct and certainly not straightforward. Let's take for instance a local private club. It will want to store the addresses of its members in order to print mailing labels. Since all members live in the same city, there is no reason to ever look at an address as anything other than a blob: a single, multiline chunk of text. But what about a mail-order company that does all its business on the Internet? For sales tax purposes, the company needs to know the states in which its customer reside. While it's possible to extract the state from the single text field used by the local private club, it isn't easy; so it makes sense in this case to at least model the state as a separate attribute. What about the rest of the address? Should be it composed of multiple attributes, and if so, what are they? Be awareeee that while addresses in the United States conform to a fairly standard pattern, modeling them is probably not as simple as appears. You might think that a set of attributes (HouseNumber, Street, City, ZipCode) might be adequate. But then you need to deal with appartment numbers and post office boxes. What do you do with an address to be sent in care of someone else? And of course the world is getting smaller, but not less complex, so what happens when you get first customer outside the United States? Not only do you need to know the country and adjust the zip code, but the arrangement of the attributes might need to change. In Romania, for example, the house number follows the street name. That's not too bad, it's easy enough to map that when you're entering data, but how many of your users would know that in the address 4/32 South Avenue, Salt Beach, Australia, 4/32 means Apartment 4, Number 32? The point here is not so much that addresses are hard to model, although they are, but rather that you can't make any assumptions about how you should model any specific kind of data. The complex schema that you develop for handling international mail order is completely inappropriate for the local private club. Matisse is reputed to have said that a painting was finished when nothing could be either added or subtracted. Entity design is a bit like that. How do you know when you've reached that point? The unfortunate answer is that you can never know for certain. At the current state of technology, there isn't any way to develop a provably correct database design. You can prove that some designs have flaws, but you can't prove that any given design doesn't. You can't, if you will, prove your innocence. How do you tackle this problem? There are no rules, but there are some strategies. The first strategy is: start with the result and don't make the design any more complex that it needs to be. What questions does your database to answer? In our first example, the private club, the only question was "Where do I mail a letter to this person?", so a single attribute model was sufficient. The second example, the mail order company, also had to answer "In what state does this person live?", so we needed a different structure to provide the results. You need to be careful, of course, that you try to provide the flexibility to handle not just the questions your users are asking now, but also the ones you can foresee them asking in the future. I'd be willing to bet, for instance, that whitin a year of implementing the private club system the club will come back asking you to sort the addresses by zip code so that they can qualify for bulk mail discounts. One of the hallmarks of good designers is the thoroughness and creativity with which they solicit potential questions. Inexperienced analysts are frequently heard to remark that the users don't know what they want. Of course they don't; it's your job to help them discover what they want. There's trap here, however. Often, the trade-off for flexibility is increased complexity. As we saw with the address examples, the more ways you want to slice and dice the data, the more exceptions you have to handle, and there comes a point of diminishing returns. This leads me to the strategy number two: find the exceptions. There are two sides to this strategy: first that it is important to identify all the exceptions, and second that you must design the system to handle as many exceptions as you can without confusing users. To illustrate what this means, let's walk through another example: personal names. If your system will be used to produce correspondence, it's crucial that you get the name right. Most names are pretty straightforward. Ms. Olivia R. Public consists of the Title, FirstName, MiddleInitial and LastName, right? Wrong. In the first place, FirstName and LastName are culturally specific. It's more correct to use GivenName and Surname. Next, what happens to Sir James Peddington Smythe, Lord Somestable? Is Peddington Smythe his Surname or is Peddington his MiddleName, and what do you do about the "Lord Somestable" part? And the singer Sting? Is that a GivenName or a Surname? The last question isn't as flippant as it sounds. A letter addressed to Sir James Peddington Smythe probably won't offend anyone. But the gentleman in question is not Sir Smythe; he's Sir James, or maybe Lord Somestable. Realistically, though, how many of your clients are lords or realm? So, be aware that there's a trade-off between flexibility and complexity. While it's important to catch as many exceptions as possible, it's perfectly reasonable to eliminate some of them as too unlikely to be worth the cost of dealing with them. Distinguishing between entities and attributes is sometimes difficult. Again, addresses are a good example, and again, your decision must be based on the problem space. Some designers advocate the creation of a single address entity used to store all the addresses modeled by system. From an implementation viewpoint, this approach has certain advantages in terms of encapsulation and code reuse. From a design viewpoint, I have some reservations. It's unlikely that addresses for employees and customers will be used in the sam way. Mass mailings to employees, for example, are more likely to be done via internal mail than the postal service. This being the case, the rules and requirements are different.