Concepts of structured data storage
The mountain and the pebbles: A global knowledge base versus multiple discrete projects
Strategy and future development of DeltaAccess
Software architecture of DeltaAccess
DELTA software sources on the internet
Next table of contents: Import and export
Main table of contents
What is DELTA?
DELTA (Description Language for Taxonomy) is a powerful data exchange format for descriptive data. It goes back to work done by Mike Dallwitz at Canberra University in 1973, and was officially introduced in Dallwitz (1980). DELTA is probably the widest used general purpose standard for descriptive data and is used by several important taxonomic software packages ("The Delta package" containing Confor, Delfor, and Intkey; Pankey and Pandora; DeltaAccess; Taxasoft; and with restrictions also by LucID and CABiKey). It is rivaled only by the NEXUS format, which in its current version 2 (Maddison & al. 1997) is still limited to analytical purposes and can not deal with basic data types like text or multistate characters.
The inventor of the DELTA data exchange format (Mike Dallwitz) calls his software package "DELTA" as well. Although he undoubtedly has a right to do so, this is very confusing. It places a heavy burden on the data standard issue and probably will make it necessary to rename the data standard for descriptive data in the future. To avoid confusion between the data standard and Dallwitz's programs, this documentation therefore refers to the programs in this package (Confor, Delfor, and Intkey) instead of to the package itself.
DELTA was first introduced by Mike Dallwitz in 1973. The Taxonomic Database Working Group (TDWG, a section of the IUBS) has endorsed the basic directives of DELTA as an international data standard (see http://www.tdwg.org/standrds.html).
Technically, DELTA is a delimiter based free text format which works similar to a programming language. Its main advantages are that data files are plain ASCII, comparatively compact, and remain to a certain degree readable and editable in a word processor. In fact, the reliance on this ability has seriously handicapped the wider use of DELTA. The structure is sufficiently technical to prohibit most biologist from trying to understand and use it. Two DOS based editors are available (DEDIT by R. Pankhurst and TAXASOFT by E. Gouda.) The first graphical, Windows based editor appeared in 1997 with DeltaAccess. Only recently, i.e. after 18 years, the first beta versions of a DELTA editor for Dallwitz's "Delta package" have appeared.
For the purpose of this introduction, it is sufficient to view DELTA as a data interchange format. More detailed information can be found in Dallwitz & al. (1993) and chapter 5 of Pankhurst (1991).
DELTA uses the terms item and character(1). An item can be a taxon as a whole (e. g. a species or genus), a specimen of any specific or subspecific taxon, or any other object (e. g. a vegetation unit). Compare also Interactive identification packages outside the realm of taxonomy). Characters, sometimes also called attributes(2) or features, describe the items. Characters should be defined in such a way that they are independent, and that each independent feature is represented by a single character. Note that this is frequently not the case in traditional morphological descriptions. Taxonomists often tend to think in terms of character complexes rather than in terms of atomized characters.
If the character type is categorical, a discrete number of character states is defined for each character. Other character types are numeric, where measurement values or statistics (mean, standard deviation, etc.) can be entered directly, and text, where any free text can be entered. Images (photographs, line drawings, etc.) can be attached to the character definition, as well as to the item description.
DeltaAccess is a software package to import and export DELTA files. It converts the DELTA coded text files to relational database structures. Some of the reasons for writing DeltaAccess are presented in the following chapter Concepts of structured data storage. Although DeltaAccess currently supports only a subset of the directives supported by Dallwitz's Confor program, in many other ways it goes further than the Confor program, which only supports a subset of the DeltaAccess directives. DeltaAccess tries to overcome the limitations of DELTA without giving up the compatibility with DELTA data files and programs supporting DELTA.
DeltaAccess is explicitly not designed to only code existing data for later compilation into natural language descriptions or for use by interactive identification programs. DeltaAccess can be a general data repository, in which the raw data can be entered during work in progress. It should be a working tool of the biologist, rather than an additional task after the completion of data collection. The data are edited or analyzed in the database, not in DELTA coded text files.
The Identify module of DeltaAccess allows the user to identify an unknown item at hand by comparison with the items stored in a DELTA database. It is currently not finished, and supplied as a preview only. In Identify, one or several character states for each character can be described as present or absent, numerical measurement values can be used directly, and words from the free text descriptions can be searched. Although development of a stand-alone identification-only application is possible, given my intentions of supplying a working tool, I currently have no plans to do this. See the chapter Identify in comparison with other identification software for a comparison of the working tool approach with the publishing tool approach.
If a read-only version of a descriptor project is required, the database could be protected using the Microsoft Access security model, but whenever possible the user should have the option to add his or her own data. See the chapter Fingerprinting data sets for an alternative method of copyright protection.
Main table of contents
(1)DeltaAccess uses the terms character and item. Synonyms for character are: feature, attribute, property, and descriptor. Synonyms for item are: taxon and OTU (= Operational taxonomic unit). The item synonyms apply only when used in a taxonomic/systematic context; however DeltaAccess can be used to identify vegetation types as well as specimen!
Compare:Colless (1985) and Inglis (1991).
(2)M. Dallwitz uses the term attribute in relation to scoring characters for specific items in the item descriptions. The documentation of DeltaAccess does not follow this usage, because in data modeling the term is usually understood in the sense of attribute (= "field") of an entity (= "table").
Concepts of structured data storage
Who should read this: People interested in the concepts and current development trends of the various DELTA programs. Reading this chapter is not necessary for the use of DeltaAccess.
Structuring data is at the very heart of any descriptive work. The degree to which data should be structured depends on the objectives of the project. Often it is also a matter of taste and tradition.
Any text is structured. A well written monograph is better structured than a poorly planned collection of descriptions. Using well structured outlines, and making them readily intelligible to the reader using appropriate formatting, is a basic requirement of writing.
The origins of DELTA go back to a time when word processing applications had limited features if they were available at all. At that time it was probably necessary to have a formatting tool and a specialized word processing programming language to use with descriptive data. The proprietary formatting codes still used by current versions of Confor have their justification in that period.
Today, word processing programs are powerful, sometimes feature-overloaded applications which are readily available. They are used on an everyday basis to write reports and scientific publications by almost every scientist. Formatted output is achieved by industry standard interchange formats (Postscript, TEX, RTF, etc.) and software interfaces (e. g. printing services and hardware drivers supplied by the operating system). With such flexible and powerful programs already in widespread use, why would one want to learn yet another word processing tool?
In my opinion, DELTA will only survive if it is seen as a data interchange format for data with more structure than normal text.
The description of taxa and specimens is a task, which can benefit enormously from structured data storage in a database. Whenever possible, the information gathered should be linked to voucher specimens, e. g., insects or herbarium sheets. Descriptions taken from the literature should be linked to a reference in the literature database. Generic taxon descriptions can then automatically be prepared from the separately entered descriptions of specimens or references. If later the identification of a specimen from which date were recorded changes, or if certain references turn out to be unreliable, this can automatically be corrected in the taxon descriptions. See the documentation of the Summarize feature of DeltaAccess for more information.
A database system which supports structured data storage very well, includes a "Summarize" function and is in use for a number of years is Richard Pankhurst's Pankey/Pandora packages. The disadvantages of the Pankey/Pandora combination are that Pandora uses an implementation specific data model called "post-relational" which is a mixture of relational and hierarchical elements and which is incompatible with other database programs. Also, Pankey and Pandora are currently limited to the DOS operating system.
DeltaAccess tries to provide a similar database interface for standard relational databases under a modern operating system. Microsoft Access, one of the most widely distributed PC desktop databases, was chosen as the basis for this development (see the chapter Software architecture of DeltaAccess).
In contrast, most of Mike Dallwitz's DELTA package seems to be evolving in the opposite direction. Instead of viewing DELTA as a general data collection and analysis tool, the basic guideline for the development of Mike Dallwitz's DELTA package seems to be the book metaphor. The primary purpose of collecting the data is to compile the DELTA project either into a book (including natural language descriptions, printed images, and printed dichotomous keys), or into a data set for the interactive identification program IntKey. This is especially true for the not yet finalized (as of May 98) new version of the DELTA format (see Dallwitz et al., New Features for the DELTA System).
In this new version, DELTA seems be developing into a programming language, which can be compiled into material for a taxonomic monograph, like dichotomous keys or natural language descriptions. Other examples of such free-text coding formats are HTML, the format used for World-Wide-Web documents, or the RTF exchange format used by word-processors. Examples for constructs incompatible with structured databases are:
Such free text formats do have a structure and they contain information, but they are still incompatible with all existing structured databases (excluding perhaps the semi-relational free text database specialist AskSam). Both the structured and the unstructured approach have their strengths and weaknesses. The difference becomes evident for example where comment text is used to modify the information defined by the character state, e. g., through adding "sometimes", "rarely", or even "but not". This works well when compiling natural language descriptions or printed keys, but not when items are identified in the database.
Natural language descriptions and dichotomous or multichotomous printed keys are certainly very important, but they are not the only tasks desirable. A descriptor database can be used to organize the data collection and editing work itself. For this purpose it is important that the data can be analyzed to search for missing information, and that data can be selectively retrieved in a process which is very similar to interactive information. Questions like "Do I already know about this?", or, "which evidence supports this observation, and which contradicts it?" should be readily answered in a consistent application environment. This makes it necessary that the interactive identification works online on the most recent data available. It should be possible to improve the character definition as required by the ongoing work.
The current version of the DELTA format poses several problems in this scenario.
The capability to annotate the source of information or the responsible author, typist or editor is very poorly developed in DELTA. This becomes particularly cumbersome if several workers cooperate on a single project.
Another problem of the current DELTA definition is that for numerical characters the meaning of the range and central value remains unclear. The range could be a confidence interval, a quantile, mean plus/minus s.d.; the central value could be a single measurement, a mean, median, mode, etc. In the book paradigm, such information might be given in the introduction, but this is not possible in a worldwide network of distributed databases.
In many cases it is preferable to use a link for scientific names, literature references, geographical places, collectors numbers, collection unit numbers in specimen collections, etc., instead of describing the information in a plain text string. Links may be based on text strings, but they are protected to guarantee that the corresponding information in the external database subsystem can be found automatically by the application, even if the corresponding information has been updated. In contrast, a literature reference entered as text is resolved in the brain of the user, and is not accessible to a program.
Main table of contents
The mountain and the pebbles: A global knowledge base versus multiple discrete projects
Who should read this: People interested in the rationale of having several projects, people interested in implementing a general descriptor database in a large scientific organization, and people interested in building a network of scientists working on closely related subjects.
Theoretically it would be possible to integrate all existing descriptive data about organisms into a single set of database tables. This model would represent the sum of the worldwide taxonomic knowledge, including groups as diverse as viruses, bacteria, protists, algae, lower and higher plants, fungi, and the various groups of animals.
Such a global data set is certainly an attractive idea. Nevertheless, DeltaAccess implements multiple discrete data sets called descriptor projects instead. In a discussion of the respective merits of these approaches, the conceptual problems of a global character definition should be distinguished from the technical problems related to the information model and its implementation, which are discussed in the chapter Information model for multiple projects.
In principal it is possible to create a character definition for a global, all-taxa descriptor database. Using the inapplicable character mechanism, the number of characters visible could be reduced to a bearable measure. Unfortunately, creating a global character definition requires a considerable amount of research and discussion about terminology. It is probably impossible to automatically merge existing character definitions, which contain a mixture of logically identical characters (named identical or not) and non-congruent characters (named differently – or identical!). Unless a character definition was already specifically devised with a global scope in mind, it is difficult to integrate it into a global character definition.
Example: The character "Presence of wings" does not refer to the same thing in insects, birds, and bats. This may appear to be irrelevant at first, yet it is not. In a key to insects, you may want to distinguish between front and hind wings, or add a third alternative to deal with cases where the wings are reduced, but still visible (e. g. halteres in Diptera). Or, you may want to make the feather characters of birds dependent on this character. If you are currently identifying a bat, being asked about feathers will then be confusing.
Even after these problems of a global character definition are solved, "projects", as local chunks of data, remain a necessity. While a global data set is very attractive for analysis/identification purposes, it may not be a good research tool.
The quality of a data set strongly depends on the understanding of the researchers responsible for it. If they are experts only for a small group, this group should be their only concern. Data collection or editing should therefore remain a local process, involving only those characters and items necessary. It is important to maintain the flexibility to change the character definition ad hoc, without having to consult a multinational committee... A modification of a character definition might be erroneous, but we might learn more from our errors than from our omissions.
The knowledge of biological organisms is usually local to taxonomic groups as well as to geographical regions. It does not seem to be a wise decision to burden the character definitions of researchers with the problems that would result from enforcing global character definitions.
I believe that ultimately these issues can be reconciled, but many problems need to be analyzed in first. It is important to distinguish between the requirements of the analysis and information retrieval (including identification) processes and the editing process, respectively. The first actions should focus on data set integration for analysis and identification purposes, i.e., the consolidation of multiple data sets in a one way process for use in identification applications.
An important option of descriptor databases for the purpose of data set integration would be a "validation" of individual characters against a reference character definition. Such reference character definitions for larger areas (e. g., higher plants, insects, or fungi) should be created in coordinated projects involving as many researchers as feasible. They should be issued with a version number. Each character in a new version of a reference character definition should be validated against the previous version. The validation assures that the two versions are backward compatible, although the wording of a character may have been changed, or additional character states may have been added. Individual researchers could validate characters in their definition against a specific version of a reference definition. This procedure would maintain the option to introduce additional characters, or even have a deviating opinion about how a character should be defined. Such a deviation would then at least be a deliberate and documented process.
If reference character definitions exist and if individual data sets are already validated be their authors against (possibly different versions of) these reference character definitions, a major part of the integration process could be automated. The remaining conflicting characters could be analyzed manually, which itself in many cases may give valuable insights into the structure of the data.
In the absence of reference character definitions, and while the available tools (including DeltaAccess) do not support a character validation mechanism, one can still start to integrate existing data sets. The major task is that the character definitions of these projects must manually be made compatible. If you anticipate that you want to consolidate existing or planned projects in the future, you might already want to initiate discussions with your colleagues to make your respective character definitions convergent while your projects remain otherwise separate. The chapter Merging projects (data sets) discusses the practical problems involved in the integration of existing data sets using DeltaAccess and DELTA.
Already in the current version of DeltaAccess it is possible for scientists to work together on a common descriptor project. The character and item subsets facility of DeltaAccess can be used to present each person only those items and characters in which she or he is primarily interested. Each researcher can concurrently edit and analyze this subset without the danger of creating any inconsistency in the data set as a whole.
If it has been decided to implement an institution-wide descriptor database, and the problems outlined above are consider to be not relevant (e. g., because of closely related areas of work), one can ignore the multi-project capabilities of DeltaAccess and create but a single project. As the project name you could choose the name of your organization, or a generic name (e. g. "Descriptors").
It is important to remember that descriptor projects are not taxonomical name databases. The presence of the usually scientific item names in the item entity may lead to the false assumption that items are taxonomical entities. This is not the case. DeltaAccess is a descriptor database subsystem, which should be linked to a separate taxonomical database subsystem. Thus, in a global view, the item entity of a DeltaAccess information model would be a link table between taxonomic entities, literature references and descriptor definitions (character matrices).
Main table of contents
Strategy and future development of DeltaAccess
Who should read this: People interested in the development guidelines of DeltaAccess and in an outlook on what can be expected from future versions of DeltaAccess.
The development of DeltaAccess is guided by two principles:
1. Concentrate on the management of "descriptors" of items. Other data should be stored in appropriate database subsystems. Such subsystems are literature reference databases, nomenclatural databases, and specimen collection databases. Subsystems should be linked using references to object identifiers in the subsystems. The architecture of the subsystems should be modular to allow the exchange of one subsystem against another. See the chapter Database subsystems for more details.
2. Make as much use of available software tools as possible. While the use of industry standard software interfaces may not lead to a remarkably small and fast application, it does minimize the development effort while maximizing the available functionality. See the chapter Software architecture of DeltaAccess for more details.
The current version of DeltaAccess is only the first step in a planned strategy. Besides implementation issues (character state mapping, illustrations, etc.), the major conceptual drawbacks of DeltaAccess are currently:
The next full version (2.0) is scheduled to fully implement images, perhaps also character mapping ("Key states"), and to improve the functionality of the internal interactive identification module Identify (*** external interfaces already make use of the compiled identification information starting with DeltaAccess version 1.3, but Identify itself needs to be revised). In that version or the version after that, the identification process will be changed to offer error tolerant ("fuzzy") searches, and offer estimates of resolution power of characters based on the item description data. See also the chapter Limitations which will be removed.
Only after these single-project-changes have been made, will I try to improve the integration of descriptor projects into a global supra-project structure, including links to other database subsystems and integration of descriptor projects. See also Merging projects for a discussion of manual project merging.
Main table of contents General development information
Who should read this: Users who are interested in the idea of descriptor databases and in a theoretical discussion of how modular a database should be.
A common problem with current biological database implementations is their monolithic design. While parts of existing applications may be excellent, other parts may be lacking in general, or may be unsuitable for the needs of a specific organization. The solution to this problem is a integrated biological database built from modular applications. Such modules are called database subsystems in the current documentation.
DeltaAccess is a descriptor database subsystem. Other common database subsystems are literature reference databases, nomenclatorial databases, and specimen collection databases.
Both the information model and the application of each subsystem should have well defined interfaces to other subsystems. Subsystems should be seen as exchangeable modules. Thus the literature reference database could be replaced, if it did not fulfill the expectations and a better product was found. Such a design would also allow you to be independent of DeltaAccess in the future, should a different application be found to serve your needs better.
A descriptor subsystem contains observations on the objects. Usually this excludes attributes which are essential for the definition of the object or related to the management of objects. This distinction can often be made more clearly in the case of physical objects (e. g. a specimen unit) than in the case of abstract concepts (e. g. a nomenclatorial name). DELTA deals with both types of items. Some examples to clarify this distinction are given below:
Specimen: The location and date of collection can not be observed. It is essential information about a specimen which can only be obtained from the original collector. Typical management attributes are: location in collection, history of origin, loan/exchange management, treatment for preservation, etc. The details of identification (including the possibility to preserve a history of multiple identification events) should also be treated separately.
Examples for specimen descriptors are size, secondary metabolites, or DNA sequences. Some descriptors form a special problem, because, although they are clearly descriptors, they can be observed only at the time of the collection event. Examples are habitus of the organism in the field, behavior of animals, association with other specimen, and any character which does not survive the conservation method, e. g., many flower colors. These are essentially descriptors of an observation unit from which the collection unit is derived. See Berendsohn et al. (1997) for more information about Collection management subsystems and this problem.
Literature reference: The authors, title, source, etc. are essential attributes of a literature reference. Some standard descriptors are commonly carried in the literature database subsystem itself, e. g., size of book (important for filing) and signatures. Index keywords are descriptors which should best be placed in a descriptor subsystem, although this is usually not done. Extracting item descriptions for a descriptor project from a literature reference is a special way of abstracting and indexing.
Nomenclatorial names: Essential attributes are the basionym, type specimen information, status of name, etc. The description of the taxon should be treated as descriptors attached either to the type specimen or to the literature reference which contains the protologue of the type.
In the descriptor database, only linking information (the identifier of a literature reference, the nomenclatorial name, or the code of a specimen) should be stored. For example, the scientific name of an organism can serve as a primary object identifier in a nomenclatorial database, allowing the retrieval of nomenclatorially relevant information about nomenclatorial combinations, synonymy, or the type specimen.
The current version of DeltaAccess does not implement subsystem links on the user interface yet, although the attributes are already present. See Links to other database subsystems in the appendix. Naturally, in the absence of appropriate database subsystems, characters can be introduced to store non-descriptor information inside DeltaAccess. This includes taxonomical information or lists of synonyms. No support for special requirements of such characters will be implemented in DeltaAccess though. The special features for nomenclatural purposes to be introduced in the new version of Confor (see Dallwitz et al. New Features for the DELTA System) will not be supported beyond import.
Main table of contents
Software architecture of DeltaAccess
Who should read this: Users who are interested in a discussion of the advantages and disadvantages of using a relational database in general and Microsoft Access in particular.
DeltaAccess is based on a wide array of existing software tools. It uses a relational database as storage subsystem, industry standard software interfaces (SQL, ODBC, DCOM/OLE), a high level programming language, and a visual development tool. This strategy lets taxonomists benefit from the million dollar programming efforts, which are undertaken primarily for business applications. Using standard relational database management systems makes features like multi-user operation in local and wide area networks, data security models, and database replication available. Implementing these features in a dedicated taxonomic software package would probably not be possible, considering the limited resources available to the scientific and taxonomic community.
Microsoft Access was found to be a suitable tool for the current stage of the development of DeltaAccess. It allows the development of an application which can be used without modification both on a single PC and in a local area network. Theoretically, up to 255 users can use a single descriptor project concurrently. In practice, 20 permanently active users (e. g. typists) and perhaps 50 occasional users is quite realistic. To support significantly more users, wide area network operations, or high volume transactions from web interfaces, data can be ported to a large scale SQL-database-server (see Linking to other data sources). The Microsoft Access application (e. g. DeltaAccess) can then be used as a client or "front end" application. The same feature can be used to create links to other database subsystems. For example, if your reference manager supports ODBC, you can link directly to your literature references.
The JET database engine used by Microsoft Access provides a fairly complete implementation of the relational database model, including declarative referential integrity with optional cascaded updates and deletes. The data are stored compressed (text attributes are not filled with blanks up to their specified length) and long text attributes may contain up to 65535 characters.
Microsoft Access and JET already provide advanced features like a tight security model (see Securing an Access database in the Appendix) and database replication without the help of a database server.
A further advantage of Microsoft Access is its unique combination of development tool and visual user interface. The visual tools and wizards allow users who have a full version of Microsoft Access (i.e. they do not use a run-time version) to modify the parts of the application to suit their needs. This ranges from minor changes in reports or analysis queries to adjust the look on a specific printer or paper format, to major extensions and improvements to suit their own needs (see the chapter Co-Development issues in the appendix). Microsoft Access is very widely distributed as part of Microsoft's Office package.
Some of the disadvantages of the software architecture chosen are
Since most of the functionality of DeltaAccess is based on Microsoft's VBA (Visual Basic for Applications) and JET (Joint Engine Technology), it is possible to use other front end tools, including Visual Basic and Visual C, in the future. Such applications (e. g. a fast, specialized item description editor) could be developed by other people as well (see Co-development issues in the Appendix). Provided that they use JET and the same information model, such applications could even be used simultaneously on the same descriptor project data (by a single user or even by multiple users in a network).
The development of a fast stand-alone application (e. g. in C) with the number of features DeltaAccess offers is beyond my resources. You need a sizable programming team to do that. If you are just looking for a fast editor to edit your DELTA data sets more comfortably and with less errors, and if you feel comfortable with DOS programs, you may want to look at the shareware program TAXASOFT by Eric Gouda. You can also look at the new Windows versions of M. Dallwitz's DELTA package, which are currently developed and contain a good and useful editor for DELTA.
Main table of contents Next