Appendix 3: Information model and data dictionary
The information model of DeltaAccess
The theoretical item description information model and its simplification
Information model for multiple projects
List of entities and attributes
Detailed attribute information
Introduction to the concept of modifiers
Appendix 4: References, Glossary
Literature references
Glossary of terms and procedures
Previous table of contents
Main table of contents
The information model of DeltaAccess
The following diagram illustrates the core information model of this implementation. Each box in the graph below represents an entity (= table in the database), the attribute names (= fields of the table) are listed inside the boxes. Bold attributes are the primary key (or part of it, i. e. the combination of all boldface attributes must be unique). The connecting lines indicate the relations. The type of a relation is indicated by the 1 and the infinity symbol () at the start and end of the lines. The attribute on the 1-side may occur only once (i.e. it must be unique), while the attribute at the infinity (or "n") side can occur multiple times. All relations are protected by relational integrity rules with cascaded updates. Thus, if the attribute on the 1-side of a relation is changed, the attribute on the n-side is automatically updated to the new value.
The major entities of the character definition are the character entity (X_CHAR) which is linked by the character identifier (CID) with the character state entity (X_CS). The item data are the three entities displayed on the right side. They consist of the item definition (X_ITEM), which contains the item name and additional management information (including information regarding the data source, i.e. Collection Unit [= "specimen"], literature reference including a page or table number (LitRefDetail). The item identifier (IID) is used to link the item with the main item description entities, X_DESCR and X_RSC. Character and character state scores (for character types UM/OM), numerical data (for character types RN/IN) and text (for character type TE) are recorded in the item description entity (X_DESCR). The Resource entity (X_RSC) is used to store external resources like images, illustrations, hyperlinks etc. Such resources can be linked to a character, or even more specifically to a character state. Since multiple resources can exist for each association, the combination of IID, CID, and CS does not provide a primary key in X_RSC (like it did in X_DESCR), and a Resource identifier (RID) is added for internal reasons. It is currently not referenced any further in the information model.
Some additional information: Attributes ending with ID signify a numeric identifier, thus CID is the character ID (= number), IID the item ID. CS is the abbreviation of Character State; it is of type text to allow special states (U/V/-) and numerical/statistical attributes (mean, min, max, s.d., etc.). In X_DESCR, X is a real numeric attribute used only for numeric character types, TXT is a long text attribute used only for text characters, and SEQ is a sequence number to define a sequence of character states in the item description. The rationale and definition of modifiers are discussed in more detail in the following chapter. The use of character and character state definition attributes, and the item definition attributes are explained in the chapters about the character editing form, and the item edit form, respectively.
Perhaps 95% of the functionality of DeltaAccess is based on the core information model above. The full information model is considerably more complex. It adds the concepts of modifiers (X_MOD), headings (X_Char_Heading), and character dependencies (X_DEP). The full information model looks like:
(See major data tables, link tables, and accessory tables for a short explanation of all table names. A complete list of attributes, their type and indexes is supplied in the following chapter: List of entities and attributes.)
[*** DISCUSSION TO BE REVISED, part of material below is from Version 1.2 ***]
The _PROPERTY table is not integrated into the general information model (i.e. it has no relations to other entities). It is used to store internal properties of each descriptor project. Examples are author and copyright of the project data, date of import, version number of the information model, etc. See Editing project properties for more information and an overview of the currently defined properties.
The following accessory tables are not always present: X_ID_DESCR and X_ID_TXT are created only after the data set has been compiled for identification purposes. X_CHAR_Subset exists only for projects with a character subset (i.e. the character definition is based on another project, but only selected characters are visible).
Details of the information model: The Modifier and Character Heading structure
[*** MAKE Separate topics, + ADD back links from attribute to this discussion!]
Character Headings (X_CHAR_Heading and X_CHAR_Heading_Link)
Confor and IntKey use 3 different heading directives: Character Heading, Item Subheading, and Character Definition.
A Character Heading is a heading to a group of characters. It is used when the character definition is presented. Analogous item headings are not defined in DELTA. The term Item Subheading used by DELTA is somewhat deceptive, although the prefix "sub" tries to indicate the difference. Item Subheadings are headings to a group of characters in the natural language descriptions of each item.
Thus, Character Headings are used together with CharNames, while Item Subheadings are used together with CharPhrase in natural language descriptions. Both heading types are combined in the current information model. The HeadingName is the primary attribute. The optional HeadingPhrase attribute is preferred when Item Subheadings are output in natural language descriptions. If no HeadingPhrase is defined, the HeadingName is used here as well.
The association between characters and headings is defined in the relation between the character definition attributes CharHeading and ItemSubheading and the character heading definition attribute HID. Each character can have at most one heading defined for character definition output and one heading for natural language item descriptions (1:n relation).
A further header type is Character Definition. It is defined as an Intkey-directive in the Intkey.ini file. Character Definitions are called Named character groups in DeltaAccess. They are primarily used during interactive identification to restrict the number of character from which the user must choose. A complete list of perhaps several hundred characters would be cumbersome and confusing to use during identification. Instead, users can select from a list of Named character groups. In contrast to Character Headings and Item Subheadings, a single character can be a member of multiple Named character groups, which may be defined under different aspects, e. g., by organ of the organism, by observation or analysis method, or by developmental stage.
The association between characters and headings used as Named character groups is defined in the relation between the character definition attribute CID, the character heading definition attribute HID, and the link table _CHAR_Heading_Link (n:m relation).
[*** MODIFIERS: DISCUSSION TO BE WRITTEN IN NEXT VERSION ***]
The rationale and definition of modifiers are discussed in more detail in the following chapter Introduction to the concept of modifiers.
Details of the information model: The Character Dependency (DEP) structure
Character dependency describes relations between characters. The only relation defined in DELTA is that a character (which may be of any type) may become applicable or inapplicable, depending on the presence of a certain state in another categorical character. The latter character is called the controlling character. Theoretically, it would be possible define dependencies where the controlling character is numeric, but appropriate situation are exceedingly rare in practice. Supporting numeric controlling characters would require significantly different additional data structures. Therefore, neither standard DELTA nor DeltaAccess allow to make a character inapplicable, if a numeric character does or does not exceed a certain value.
The additional entity DEP links through CID and CS to the character state which is the controlling state of the controlling character. Dependent from this state are all characters linked through the Inapplicable CID attribute. Only inapplicable characters are directly supported. DELTA allows an alternative statement of making a character applicable only if a certain state is present. This statement is identical with this character being inapplicable if the other states of the controlling character are present. DeltaAccess makes appropriate conversions during import.
The structure used is similar to the BAOBAB model proposed by White & al. (1993), except for the fact that DeltaAccess defines inapplicable dependent characters, while their model defines character explicitly as applicable. While this inverse definition is equivalent as long as all controlling characters have been scored, different problems arise if the controlling character is undefined.
A special problem of this simple structure occurs if a recursive definition of character dependency is used. Recursive character dependency are allowed in the DELTA definition (Dallwitz & al. 1993). They occur, if the controlling state of a character may become inapplicable depending on yet another controlling character. A short example should illustrate this:
Controlling character |
Controlling character state |
Inapplicable character |
Leaves (presence) |
absent |
Petioles |
Petioles |
absent |
Petioles (hairiness) |
Petioles (hairiness) |
not hairy |
Petioles (hair length) |
Leaves (presence) |
absent |
Leaves (compound or not) |
Leaves (simple or compound) |
simple |
Leaflet petioles |
Leaflet petioles |
absent |
Leaflet petioles (hairiness) |
Leaflet petioles (hairiness) |
not hairy |
Leaflet petioles (hair length) |
Starting with the question of whether leaves are present, two chains of inapplicable characters are illustrated in the example. If leaves are absent, all characters listed in the right hand column become automatically inapplicable, even though the state immediately controlling them is not scored.
Thus, although the data model is perfectly capable of storing the information presented in the example, evaluating it in the application becomes considerably complicated. It is not possible to use simple SQL queries for such recursive evaluations. The current version of DeltaAccess therefore does not yet observe recursive (or "indirect") character dependency definitions. In practice, this can be overcome by making "Petioles (hairiness)", "Petioles (hair length)", etc. directly dependent on "Leaves (presence)". Naturally, this is less convenient and more error prone.
Other notes on the information model:
The model uses natural keys, many indexes (the data are not transactional but comparatively static), and collapses 1:1 relations (exclusive-or subtype structures) into a single table with alternatively used attributes. The net effect is that storage considerations (other information models could require less storage space) are sacrificed for speed gains. This decision is not based on the current size of DELTA data sets. In fact, I would expect much larger data sets to be created with DeltaAccess than are currently possible using Dallwitz's Confor. Rather, I consider the increase in storage space moderate, and storage space has become cheap. The decision to include even a moderate amount of illustrations in your data set is likely to have a much larger impact on the size of the data set, than any information model design decisions.
The decision to use the natural combination of CID and CS as primary key instead of an artificial primary ID key (e. g. counter attribute) to the _CS entity, but has several advantages, but some disadvantages as well. Referring to a single foreign key in the _DESCR entity instead of the combination of CID and CS would reduce the storage consumption of _DESCR, and allow a more direct query during identifications. Since both approaches can be converted into each other without any information loss, the choice between these options is assumed to be an implementation specific decision.
The use of natural keys does rely heavily on referential integrity with cascaded updates and deletes. It should not be used on any databases engine not supporting these features.
At several points a n:m relation was taken into consideration, e. g., for the character states. Many character states are identical across characters, especially the universal numerical statistic attributes and Unknown/Variable/Not-applicable, but also generic attributes like present/absent, colors, etc. The reason not to use n:m relations is, besides simplicity, that in one language it may be satisfactorily to say '... present' and '... absent', but in another language a different wording may be necessary for some characters but not for others. Duplicating the information right from the start makes it more difficult to change the wording for present/absent characters in the whole database (which I consider a rare event), but avoids problems during the translation. In the future special translation tools will be made available, which can do automatic updates of all identically phrased character states. The implications of the planned multilanguage support on the information model are further discussed in a chapter further below.
See also the following chapter Discussion of the theoretical item description information model and its simplification.
Appendix 3: Table of Contents
Main table of contents
The theoretical item description information model and its simplification
|
Who should read this: This chapter is exclusively addressed to people with a theoretical interest in information model of descriptor databases. |
While character and item definition are straightforward and fully normalized, the item description part (_DESCR table) is slightly denormalized. The item description part of the theoretical information model would be more complex (modifier, dependency and illustration parts are removed to simplify the example):
A separate entity could exist for each for each character defined for an item (X_DESCR, CID/IID links), to which an entity for the text data is linked 1:1, and separate entities for numeric and categorical data are linked 1:n. The use of these tables would be exclusive, i. e. which entity could be linked would depend on the character type defined in X_CHAR. X_DESCR would contain only foreign keys plus a notes field which is not really used like that in DELTA (the notes directly after the character behave not like notes after character states, but rather like a TEXT character). The only advantage would be that the rule "only one entry per IID/CID combination" would be enforced on the database engine level. Since this solution requires frequent joins and, worse, slow union queries, I have not followed that path.
Instead, the X_DESCR entities are merged. The Jet database engine of Microsoft Access handles empty attributes extremely well (text attributes use a single byte storage space per record, and the indexes can be defined to ignore all unused values). Instead of implementing rules that only one of the additional entities X_DESCR_CAT, X_DESCR_NUM, or X_DESCR_TXT may be joined, the attributes themselves are declared exclusive. A database engine level rule defines that the attribute TXT may only be used when the CS is TE, that the use of X and TXT is exclusive, and that X may not be used for numeric CS. On the engine level, this allows only very few unwanted entries in X (e. g., when CS is '1&2'), which has to be prevented on the level of the editing forms. Note that the merging of the DESCR entities is an implementation specific decision, which has no influence on the data collected. Data are fully convertible between both information models if all described integrity rules are observed.
As a further internal decision, the use of a default character state TE is enforced for text data. The only reason to do this is that referential integrity can then be activated for all character states. This is only possible if CS in X_DESCR is required and part of the primary key. Referential integrity with cascaded updates makes using the database much safer and easier. It means that if you change character state '3' to '4' in the character definition table X_CS, all data in the item description will automatically be changed. This is enforced on the database engine level and will therefore work in queries as well as in the editing forms.
A possible abnormality of the entities as implemented in DeltaAccess, is that it is possible to enter records for characters of text or numeric character types, without a text or numeric value, respectively. This must be taken into account when creating analysis queries. It is not a flaw of the information model though, but in fact an inherent property of DELTA, which allows text and numeric comments without requiring a value.
Appendix 3: Table of Contents
Main table of contents
Information model for multiple descriptor projects
|
Who should read this: This chapter is exclusively addressed to people with a theoretical interest in information model of descriptor databases. |
As discussed in the introduction The mountain and the pebbles: A global knowledge base versus multiple discrete projects, a descriptor database should be capable of handling multiple independent descriptor projects as well as multiple related descriptor projects. The implementation details of multiple independent projects is discussed in this chapter, the latter are discussed in the chapters on linked projects and subsets.
Two options are available to implement a multiple project descriptor database:
1. Create a full set of tables for each project, letting the database engine handle the management of projects. Each set of tables has the same structure, but the table names are prefixed by a project identifier.
2. Add an additional project identifier to the information model. In the current model, a project identifier would be required at least in the CHAR, ITEM, and PROPERTY entities. A project identifier is not absolutely necessary in the CHAR_Heading entity, but it would probably be useful to get an overview over all headings of a project. All other entities are linked with these entities that the project can be uniquely identified. The primary key of CHAR and ITEM would be a composite index based on the project identifier and the CID and IID, respectively.
The second approach is the preferred one, used, e. g., by PANDORA and TreeBase. PANDORA implements all descriptors in a single set of tables, to which a ***Pandora uses which term //handout Paris, DeltaNewsletter*** and uses composite indexes, called "****". TreeBase is a cooperative project sponsored by NSF, Harvard University Herbaria and University of California, Davis (see www.herbaria.harvard.edu /treebase/schema.html, seen 13.3.98). As far as can be deduced from the documentation published on the web, the characters are organized into multiple character matrices (Matrix entity), which are then linked by a MatrixTaxa entity with Taxa itself.
In contrast, DeltaAccess uses the first approach, creating a new set of tables for each project. This is done exclusively for practical reasons. The decision is based on the assumption that almost never a cross-project analysis will be performed (which would require a union query in the information model used by DeltaAccess. Letting the database engine handle the project management has two advantages. Firstly, a significant performance gain is achieved, since the engine organizes physical data storage and indexes by project. Secondly, it is possible to use queries ("views") in place of the tables to create projects that are partially or fully linked to other projects, with or without a subset condition. The feature to use dynamically updateable item or character subsets is unique to DeltaAccess, and very important if larger, cooperative projects are attempted.
It should be noted that the decision two use the first option on the client side in DeltaAccess does not preclude a different decision on the server side, if DeltaAccess is used in a combination with a SQL server. The server tables can be organized following the second option, adding a project identifier. To then link projects into DeltaAccess, one simply has to create a set of queries/views, analogous to linking tables, which contain a WHERE condition restriction the view to a single project. If the server allows updateable queries/views, the queries should be created on the server side. Otherwise they should be created within DeltaAccess.
Appendix 3: Table of Contents
Main table of contents
Next