DeltaAccess - Identify

DeltaAccess –
a SQL interface to DELTA (Description Language for Taxonomy), implemented in Microsoft Access

Interactive identification

   Introduction to interactive identification
   Identify in comparison with other identification software
   DELTA based multiple entry, interactive identification outside the realm of taxonomy
   Interactive identification: The Identify dialog box
   Compiling a project for interactive identification
   Next table of contents: Appendix 1: Advanced database administration information

Previous table of contents Main table of contents

Introduction to interactive identification

Who should read this: People interested in the concept of interactive identification programs in general. For help of the identification module of DeltaAccess, go directly to Identify.

Interactive Identification is the term adopted in this documentation for a process which is also known as online-identification, computer assisted identification, electronic synoptic keys, multiple entry keys, or polyclave keys.

In a conventional dichotomous key the user is presented with a question with two (or sometimes multiple) alternative answers. After each answer, the next question with which one has to continue is indicated. Thus a strict, treelike hierarchy of questions must be answered to read a successful identification. Two methods of printing such a key exist:

1 Leaves simple 2

1* Leaves compound 4

2 Plant more than 50 cm high S. magna

2* Plant less than 40 cm high 3

3 Flowers yellow S. lutea

3* Flowers white S. alba

4 Fruits hairy S. pubescens

4* Fruits smooth S. glabra

1 Leaves simple

2 Plant more than 50 cm high S. magna

2* Plant less than 40 cm high

3 Flowers yellow S. lutea

3* Flowers white S. alba

1* Leaves compound

4 Fruits hairy S. pubescens

4* Fruits smooth S. glabra

The method displayed on the right side has certain advantages, because the hierarchy of the key is easily understood. For example, on the left side it is much more difficult to see that S. glabra must have compound leaves. Its main disadvantage is that in large keys the alternative questions are hard to find. They might be several pages away from the first alternative. Also, a key is often too long to allow any indentation like the one used in the example.

A major problem of printed dichotomous keys is the fixed order of questions. Authors of dichotomous keys often put a lot of effort into determining, which are the most useful questions to start with. Yet, since the answer to this depends strongly on the condition of the object to be identified, they often can not possibly find optimal solutions. Examples are the stage of the life cycle of the organism (insect larvae or imago?), which organs are available (flowers or fruits?). Even when theoretically all questions can be answered in a well collected, well preserved specimen, the specimen might be incomplete or damaged.

Another common situation is that often certain features of an object are rather prominent and could lead to safe and fast identification. Yet, since only a very small fraction of all items have this feature, it will never be used in a printed dichotomous key anywhere near the start. Until a user comes to the place where these questions are actually asked in the printed key, she or he might have been unable to answer several other questions.

Printed synoptic keys offer a solution to these problems. The basic idea behind these types of keys is that the user can determine which questions he or she wants to answer first. The example above could look in synoptic form:

1. Leaves

a) simple 1, 2, 3

b) compound 4, 5

2 Plant height

a) more than 50 cm high 1, 5

b) less than 40 cm high 2, 3, 4

3 Flowers color

a) yellow 1, 2

b) white 3, 4, 5

4 Fruit surfaces

a) hairy 4

b) smooth 1, 2, 3, 5

Species:

1 S. magna

2 S. lutea

3 S. alba

4 S. pubescens

5 S. glabra

The identification of S. glabra could start with plant height: > 50 cm and flower color white. Since the intersection of the two sets {1, 5} and {3, 4, 5} yields {5}, the identification is already complete, without the need to use any fruit characters.

Their major disadvantage of synoptic keys is that for groups with more than around 40 species they are unwieldy to use. Writing down large lists of taxon/item numbers for each question answered, and manually determining which numbers occurs in all sets, is very difficult and tedious for large keys.

A computer program can easily perform the manual work involved in the creation of intersection sets of several questions. Such a program would fit to the terms "multiple entry identification" or "electronic synoptic keys".

Many interactive identification programs go one step further. They are truly interactive in that they support the user in the decision, which question to answer next, by ranking the available questions by usefulness. The answer to this question is local to the subgroup of items already keyed out by preceding questions. Since no fixed sequence of identification steps must be followed, an optimization function tries to determine the most valuable next steps.

Finally, it is desirable that such a program offers verification options after an identification has been reached. This can be done by offering a full natural language description of the item, which incorporates all available information, or by offering specific questions which support the current identification.

Unguided identification works well only up to a certain number of taxa. The computer only extends the usefulness of synoptic keys. The limitation, that with increasing number of organisms the number of characters also increases, remains and becomes limiting. If the list of features offered becomes too long, the identification process becomes tedious.

Also, hardly any identification program will be able to lead untrained users to successful identifications. The best a program can do is to offer support in the training itself, with liberally supplied annotations and illustrations.

Identification is always an interaction between previous knowledge of the person using an identification program, and the information database available to the program itself. This often leads to conflicting interests in the optimization of the identification process. The questions whether an organism is a plant or not can be highly effective for someone sufficiently knowledgeable. But a single definition of a plant is not an easy thing. Not every plant is green (e. g., saprophytic or parasitic plants are sometimes without any chlorophyll) and green organisms are not necessarily plants (e. g., some porifera have acquired symbiotic algae). The knowledge of what a plant is and what not is closely linked to at least the superficial recognition of the main branches of the plant and animal kingdom. A person without biological training might even be able to answer such a question in many cases, but if she or he is asked next whether the organism is a moss or not, she will often be lost, even when given an exact definition of a moss.

A related problem concerns the rules for defining a character. Complex characters are usually undesirable for analysis purposes. For example, whether the surface of a leaf has a thick waxy cuticle through which is becomes shiny, or whether the leaf has simple, complex, straight, curved, or glandular hairs, are anatomically really unrelated things. Yet, since a hairy surface is never shiny, the conventional complex character "leaf surface" would encompass all these things.

For identification purposes, it will often be preferable, to stick with conventional, complex characters. The user is well acquainted with them, and the faster the character list can be comprehended, because of reference to well known concepts, the better an identification data set is.

The use of illustrations for the same purpose should not be underestimated. Illustrations not only improve the exactness of the identification, in cases where it is difficult to convey the form of some organ in words. Detailed illustrations of photos would be chosen for such a purpose. Yet, the simple, schematic drawings are essential to make the user comprehend the character list fast. These illustrations are probably more important to the acceptance of an identification program, then the detailed ones, although both types should be available.

A general discussion of which other disciplines might benefit from the use of such software, see the following chapter Uses of multiple entry, interactive identification packages outside the realm of taxonomy.

The module used for interactive identification in DeltaAccess is Identify. It is still under development: The current version is geared towards analysis and information retrieval purposes than true, error tolerant identification. No character ranking or verification options are implemented yet. See the chapter on DELTA software sources on the internet for other DELTA based identification programs. Also, at the end of that chapter several lists of identification programs in general are referenced.

For a conceptual comparison of Identify with other identification software see the following chapter.

Contents of Identification Main table of contents

Identify in comparison with other identification software

Who should read this: People interested in interactive identification programs in general and in how the concept of Identify compares with other programs. This chapter provides no further information for the use of DeltaAccess or Identify. See also the previous chapter Introduction to interactive identification, which introduced the general concepts of interactive identification.

Probably most identification programs follow the traditional concept of printed monographic books. Ideally, the data and expertise for the identification process are collected from experts, prepared and tested for use by the general public and published in a user-friendly form. In this context, the identification tool serves as a publishing medium to carry a one-way flow of information from the expert to the user.

In contrast, DeltaAccess is designed to be a working tool of scientists themselves. Today, a word-processor is no longer a tool for a specialist, to whom the scientist passes the hand- or typewritten manuscript, but the production tool of the scientist her- or himself. Databases should become production tools as well.

Using DeltaAccess, a team of scientists can collect data for their own use, review and revise the data. They can analyze the data for errors, omissions, and for insight, e. g., performing statistical analyses. An important feature of DeltaAccess is that it can link directly into other databases, like nomenclatorial, specimen, or literature reference databases. Using the subset feature of DeltaAccess, each user can be supplied with a view restricted to those characters or items, which she or he works with. In a Local Area Network all changes can occur concurrently and are immediately visible to all users. In the future, using replication technologies already available in Microsoft Access, it will be possible to edit data offline, e. g., on a notebook while on excursion, as well. Periodically the replicated databases can be connected and the main as well as the child databases are updated.

The Identify tool of DeltaAccess for interactive identification should be seen in this context. It operates on live data, on "work in progress". It is designed primarily as an analysis or information retrieval tool for the scientist. Using it to find items that fulfill certain criteria is a helpful working tool, and a constant validation of the correctness of the information used in identification as well. Not only is Identify used to create item subsets, or select items in the MultiItemEdit dialog box, it also branches directly into item editing mode if the button above the Item list is clicked.

Identify can be used as a full identification tool nevertheless, but neither its implementation nor the documentation are currently fit to distribute it as a stand-alone publishing tool. Identify is currently marked as a "preview", because some important features like character state mapping and images are not yet fully implemented in the current version of DeltaAccess.

The "publishing" kind of identification tools mentioned first are as important as are interactive data repositories like DeltaAccess. Both tools complement each other. The publishing tools are most important for teaching and training purposes, as well as for applied identification purposes. Both tools need interfaces for the web, as well as stand-alone applications to use on notebooks or personal digital assistants (PDAs).

Identify could (and hopefully will be) developed into a publishing tool, intended for end users in read only mode as well. The implementation of it would probably use another development tool to make it independent of MS Access. Ideally, it would still use the Jet Engine and mdb data structure (e. g. Visual Basic and C can do this), which would allow it to be integrated into the networking environment described above.

---

Each identification tool must assess, which audience is expected to use the software. There will certainly be a market for attractive, polished multimedia packages. One such market are for example high school and perhaps some undergraduate courses. Yet, in most cases, a simple flip-chart containing the 20 observable species of whales, or a field guide with annotated pictures of birds will be more useful, more attractive, lighter, and more waterproof than computers will be in the near future.

The basic assumption of DeltaAccess and Identify is that there is less need for polishing existing knowledge, than there is a need for the collection of knowledge. DeltaAccess and Identify aim at the professional or amateur scientist. Some identification tools seem to assume that the major problem with the identification of biological objects is that the available data are too difficult to handle. This might be the case in some situations, especially if available literature is lacking adequate illustrations. Yet, in many cases the problem is much more difficult in that there are no up to date keys available at all, or that only keys to small subgroups can be found in specialized articles. In the latter case, much identification experience (to recognize the subgroup) and bibliographic knowledge is required, before an identification can be obtained.

Given the current lack of taxonomic expertise, it can not be expected that sufficient funds will become available to have the systematists do their basic work, and then convert and tailor this information for use in interactive identification programs.

The position of DeltaAccess as a working tool is essential to improve this situation. If any software requires much effort in coding the data for the single use of publishing and identification program, it will be of limited use. DELTA based data sets can be used to prepare printed monographs as well as interactive identification packages. They can contain much more information, especially more detailed information than the interactive identification package would include. Using DeltaAccess, the scientist can use the data set as primary data storage to organize and analyze her or his work while it is still progressing. At the end, data should be made available to the scientific community, which can be used to integrate them into more comprehensive data sets.

The data themselves should be made available, in a format, which can be modified and enhanced. Science is an enterprise concerned with progress, not final products. Software should at least offer a facility that the user can annotate the information provided with his or her own knowledge. Preferable, the data should be modifiable themselves, so that added knowledge is immediately used during the next identification runs. Any printed key provides the latter opportunity!

Unfortunately, currently the tendency seems to be, to remove DELTA data sets from public access. Almost any available "DELTA" data set is in fact encrypted in the proprietary format which IntKey uses. The understandable concern is the protection of copyright, of the effort necessary to create a large data set. DeltaAccess databases also can be protected using either a database password or the Microsoft Jet security model. The preferred method of copyright protection should be fingerprinting though.

Contents of Identification Introduction to interactive identification Main table of contents

DELTA based interactive identification packages outside the realm of taxonomy

Who should read this: People exploring the use of identification software in other disciplines for non-biological objects. This chapter provides no further information for the use of DeltaAccess or Identify.

The concept of the synoptic keys and interactive, multiple entry software was developed mostly in the context of identification of biological organisms. Under which conditions can this concept be used for similar tasks in other disciplines?

Organisms usually have a large number of features ("characters"), many of which may be redundant for identification. This redundancy often depends on the state of other features. A feature may be non-informative in one subgroup, but necessary for identification in another subgroup. Some features may be reliable, but rarely accessible, other features are comparatively unreliable but readily accessible. An feature may be unreliable because it is highly variable in itself, or because the observation process is problematic, so that both the observation of the object under identification, as well as the recorded data are usually unreliable.

A task where the similarities are obvious, is the diagnosis of human illnesses. The primary diagnostic feature is often inaccessible (e. g., because an operation would be necessary to achieve an unambiguous diagnosis, or because a routine test would be too costly), so that many illnesses must be deduced from secondary symptoms. These symptoms may be slightly different in each person. After the tentative initial diagnosis based on well accessible characters, a verification of the diagnosis based on less accessible (costly in terms of time or money) features is usually done. An interactive diagnosis program could make proposals for the most cost-effective diagnostic routines in the present case. Such a program certainly could not replace a practiced physician, and probably no physician would ever turn to such a program in common cases. A potential for such a program could be rare illnesses, which physicians currently have to look up in books. Examples are poisoning cases, or tropical diseases in non-tropical regions.

The identification of cultural artifacts in museums or in archeological projects could be a potential task for interactive identification programs. Museum objects, which have been mishandled, or where the label is missing, could be re-identified. The use in archeology will often be limited beyond training purposes, because the characteristic lack of characters. One example were DELTA has already been used is the identification of pottery (Louhivuori 1996).

The perhaps most promising application in this area could be the identification of stolen works of art, or stolen valuables in general. The international trade of stolen art works and antiques is a serious problem. To discourage art theft and fraud, and to help in the recovery of stolen art, it is essential that as many collectors, auctioneers, and art traders have an opportunity to check offers against catalogues of stolen property. Such catalogues exist in printed versions as well as on the internet (see, e. g., The Art Loss Register, www.artloss.com). Achieving meaningful results without having to browse through hundreds of illustrations usually requires a good knowledge of the way in which the items are described. The Art Loss Register employs trained art historian staff for information retrieval (www.artloss.com/alrinfo.htm, 11.3.98), which makes the use of the register expensive. Interactive identification software could potentially increase the identification success of untrained persons considerably. Works of art have a large number of features, many of which are applicable only under certain circumstances. An integration of illustrations into the identification process is vital. Similar to biological objects, readily available features may lead only to an temptative identification, which must then be verified by costly expertises or analytical procedures. A differentiation between the reliability and the availability (or cost) of a feature is essential to find the most cost effective verification method. Such a differentiation is not available in the DELTA standard, but has been implemented as an added feature in DeltaAccess.

Many general information retrieval situations can be understood as "identifications". Interactive identification has usually limited value in the case of artificial classifications or hierarchies. Neither in a library organized by the Dewey decimal system, nor in a database of legal cases, organized by an abstract legal classification, will the approach of interactive, synoptic identification programs be especially useful. These classifications have rarely redundant features, so that all questions must be answered anyway, and although some questions may be necessary only in part of the hierarchy, following the hierarchy will usually be an optimized search tree.

Yet, even in these cases some insight gained from interactive identification systems could be useful. Literature search queries conventionally work "bottom up", i.e. you have to enter keywords, and get all articles where these keywords have been entered. A problem is that the use of many of these keywords is not immediately clear to a non-librarian researcher. Without knowledge of the thesaurus used during indexing the articles, it can be very difficult to find the right keywords. This problem is increased by the fact that keywords are often used inconsistently, and that some keywords may be outdated while other keywords have been used only in new entries, but not in the older literature. The effect is that literature queries of scientists, which have not become experts in information retrieval, often return too few, too many, or entirely the wrong articles.

One solution could be to use a "top down" approach in the selection (or "identification") of the desired articles. Starting with a guided identification key, which could be either designed by experts or calculated from the data themselves, the user could be offered a free query, which potentially allows to enter any keywords, but restricts the pick-lists of keywords to those relevant to the remaining articles. Similar to biological identification programs, an option could be offered to sort these remaining keywords by their "identification usefulness"; i.e. how well they split the remaining articles into distinct groups.

The new approach would be to narrow the scope of the query systematically, until the set of remaining articles is large enough to be reasonable sure that the desired articles are in the set, but small enough to allow an interactive support of user decisions in a free keyword query.

Another test case could be real estate sales. The buyer could attempt to "identify" offers for sale, given her or his very personal preferences. A very useful feature of biological interactive identification software would be the free choice which questions to answer first (while offering a default sequence of questions, which the agent considers most useful) and the implementation of inapplicable characters, which are quite frequent in real estate situations.

Some of the features found in interactive identification systems could be useful in cases where no real classification exists, and where large data sets are used for exploratory research. For example, in a business database one might be interested whether other operations which fulfill some criteria the researcher deems significant, has similar problems in a certain field. Existing applications often use multivariate statistical methods (clustering, PCA) to find the most similar enterprise. Yet it might be useful to leave the choice of questions to the intuition of the researcher in certain cases. While this could be accomplished with conventional information retrieval interfaces, a unique feature of identification systems, the option to ask for "verification of identification" could be very useful to gain additional insight, which might support or reject the initial hypotheses.

Further information on existing interactive identification programs (mostly not based on DELTA) can be found in Pankhurst (1991). Beyond examples from biology itself, examples from medicine, pharmacognosy, and geology are mentioned there. Relevant internet link pages to interactive identification programs can be found in the chapter DELTA software sources on the internet of this documentation.

Contents of Identification Introduction to interactive identification Main table of contents

Questionnaires

Questionnaire (e. g., in sociology) can be viewed as descriptive data. DeltaAccess provides a general tool to handle questionnaires. If one would try to handle questionnaire data directly in a table, the maximum of 255 fields per table limits the amount of options in a single table. DeltaAccess effectively removes such limitations.

The most important character definition attributes are CharName for the name to be displayed for a variable, Type, and MultistateType. Set the latter to Exclusive to prohibit entering multiple choices for a single question. You can further structure a questionnaire using Character headings.

An interesting option is to provide a categorical character/variable (use Type, UM for unordered and OM for ordered categorical data) and enter the common options as character states in the character editor. Then add an additional state 'TE' to give an option to directly add additional states. Such "Other, please specify:" choices are a common feature of many questionnaires.

Especially useful in this context is the option to Creating item descriptions as HTML forms. These forms provide an easy way to put a questionnaire for direct fill in on the Internet.

Contents of Identification Main table of contents

Interactive identification: the Identify dialog box
This part is a preview of what will be possible in the next version. It can currently be used for information retrieval only and has only limited functionality.

If you have yet not done so, you might want to read the Introduction to interactive identification.

Character groups and character list: On the left side of the dialog box you see a list of characters. The combo box on top of the character list defines which named character group is displayed in the list. Currently the default is "All categorical and numerical characters". See Character list for more information.

Conditions: After you have selected a character from the character list you can specify a condition for this character. Depending on the character type, different sets of controls appear on the right side of the character list. Follow one of the following links for information about defining a condition for categorical, numerical, and text characters.

The place in the lower center part of Identify is reserved for images, which are not implemented yet [*** scheduled for a future release].

Identification steps: Each identification step you have added will be displayed in the list in the upper right part of Identify. You can select a step and delete it using the -button. See Identification steps for more information.

Evaluate button: After you have entered one or several conditions you can execute the identification by clicking on the Evaluate-button. This will execute the query based on all conditions defined so far, fill the item list, and revise the character list (the latter only if less than 500 items remain *** NOT YET IMPLEMENTED). The symbol indicates that you filter those items, which fulfill the defined conditions:

Items or taxa remaining: Displays the number and name of all items, which fulfill the defined conditions. See List of remaining items for more information. You can open the remaining items in the item edit form by clicking on the button with 3 dots (). If you select one or several items from the list, you are given the option to open only the selected items, instead of all remaining items.

Contents of Identification Main table of contents

Identify: Named character groups

Any character may be a member of several named character groups. Named character groups are stored in the _Char_Heading table and can be edited using the Edit character headings form. See the chapter Overview over uses of character headings for further information

Identify Contents of Identification Main table of contents

Identify: Character list

On the left side of the Identify dialog box you see a list of characters. The combo box on top of the character list defines which named character group is displayed in the list. Currently, the default shows all categorical (UM/OM) and numerical (IN/RN) characters. Alternatively you can search in text characters or item description notes (notes of all character types), or enter a restriction for the item name.

For each character, the number (CID), Type (UM/OM/IN/RN and TE), and character name are shown. The column labeled "!" indicates whether a character is mandatory, i.e. present in all items.

The standard sorting is based on an internal estimation of the separation power of characters. The characters most likely to give fast progress in the identification should be listed first (*** Currently only based on the reliability defined for a character, not on the item descriptions like in M. Dallwitz's IntKey. ***). By clicking on the column headers above the list you can alternatively sort the list by character number (CID), character type, or character name. Click on Std to restore the default sorting order.

Identify Contents of Identification Main table of contents

Identify: Selection of categorical character states

For categorical (multistate, UM/OM) characters you now select one or several of the character states listed. The list is initially sorted by character state code (CS); you can sort it by character state name by clicking on the column header labeled CharStateName. The first column labeled Items displays the number of items⁽¹⁾ for which this state was scored in the project. The column labeled with "*" indicates that a character state where a "*" is displayed has been defined as implicit.⁽²⁾

You add the states to the list of identification steps by clicking the Present-button ()

If you select several character states, as shown in the example above, you specify that you are uncertain about which state applies. If, e. g., you select both 2 and 3, the object in question may have state 2, or state 3, or both (i. e. state 2 OR 3).

Identify Contents of Identification Main table of contents

(1) Item distribution among character states: The column labeled Items shows the number of items to which a state applies. The number always refers to all items, not just the items remaining. Thus previous errors in the identification will not influence you in your decisions. Note that more items may remain after evaluation than the number listed here. Unless the Exact-checkbox has been set, items coded as unknown or variable (U/V/TE) for the selected character are included in the remaining items, but not in the count in the character state list.

(2) Implicit character states (indicated by an "*" in the second column of the character state list) may lead to problems when retrieving information directly from an uncompiled descriptor project. Items where a state is only implicitly present will not be retrieved. No problems occur when using a project which has been compiled for identification, because these states have added there.

Identify: Using numerical characters

For numerical (IN/RN) characters you either add a single value or a range of values.

Single value:

The difference between the two options "Single measurement" and Average of several measurements is that a single measurement may be an extreme value. Thus all items where this value is in the range of minimum to maximum for this value are included in the resulting item set. If you specify that your value is the result of several measurements, the value must be within the normal range specified for a given item.

Selecting a single measurement is very restrictive. You can optionally add an error margin to your value, to make your condition less stringent. In the example above, all items between 5 and 15 are included in the resulting item set. Note that the result of the error margin is immediately shown in the gray range controls (Between __ And ___) at the bottom.

Range of values:

A range of values is considered inclusive. In the example above items with 8 or 15 would be included in the resulting item set. Click on the Add-button () to add the numeric condition to the list of identification steps.

Identify Contents of Identification Main table of contents

Identify: Using text character and Notes

If you have selected a text character (TE) from the character list on the left side of Identify, you see the following controls in the upper center:

You can enter two text strings, which will be searched as an instring^(1). You can decide whether you want to combine the two strings with the AND or the OR operator.^(2).

Querying text characters in retrieval mode is rather slow, because no index is available. It is much faster when a compiled item description is used, since a word index is created during compilation. has yet been implemented. To add the condition to the list of identification steps you click on the Add-button ().

Identify Contents of Identification Main table of contents

(1) Instring search: The search text may be anywhere in the text. An exact search for "XYZ" would not find a "See XYZ.", while an instring search would find it. The SQL statement for an instring query would be "Like '*XYZ*' ".

(2) Adjacent words. In the above example the first string consists of several words. When Identify is used directly on the current item description (uncompiled) it will execute a true instring search and a record containing "Costa Brava, Rosa Rica" would not be found, while "Costa Rica, ANAROSA" would be found.

When using the compiled item description with a word index, the sequence of words is currently ignored, and the words may not start with additional letters. Thus the situation above is reversed: "Costa Brava, Rosa Rica" would be found, while "Costa Rica, ANAROSA" would not be found. The SQL used on the compiled word index would look like: "Where 'Costa*' and 'Rica*' and 'Rosa*' ".

Identify: Character illustrations

If any images are available for the character or character state selected, these will be shown in the lower center. [IMAGES ARE NOT YET IMPLEMENTED]

Identify Contents of Identification Main table of contents

Identify: Identification steps

In the upper right part of the Identify dialog box all conditions entered so far (see using categorical, numerical, and text characters) are listed. This allows you to verify which steps you have entered so far. for each step, two lines are displayed. The first line gives the character, the second line an abbreviated statement of your condition.

During identification, you may reconsider a previous identification statement. If you want to remove any condition, select it and click on the "reset single" () button.

Clicking on the "reset all"-button () will reset the identify form to the state when it was initially opened.

In the current version of Identify you can not modify conditions after they have been added. Remove the condition and add it in the modified form. The sequence of conditions has no influence on the outcome.

Identify Contents of Identification Main table of contents

Identify: List of remaining items

In the lower right part of the Identify dialog box the number and name of all items, which fulfill the defined conditions (see Identification steps) are listed. The items are sorted by name.

The list is displayed the first time after you performed your first evaluation (Evaluate button, ). Whenever you add additional identification steps, the background of the list changes to gray, to indicate that the list is not current and that a new evaluation is necessary. The Exact-checkbox determines whether unknown or variable characters shall be assumed to be identical or not.

To the right of total number of remaining items a button with 3 dots () is shown. You can open the remaining items in the item edit form by clicking on this button. If you select one or several items in the list, you are given the option to open only the selected items or all remaining items.

Identify Contents of Identification Main table of contents

Identify: Item list options

If you click on the -button above the item list in Identify, the following dialog box opens: ...

[*** New Navigation dialog box, not yet incorporated into version 1.6. Link from previous chapter is needed]

Identify Contents of Identification Main table of contents

Identify: Evaluation mode "exact" (checkbox)

Exact (checkbox): For categorical and numerical characters you can decide whether unknown/variable states should be assumed to be identical with the character states you have selected (not exact) or not (exact). "Not exact" is more preferable for identification purposes, since otherwise all items coded as unknown are excluded. For information retrieval, if you want to obtain a list of items where a certain state has positively been entered into the database, you might want to change the setting to "Exact".

If you accept the default "not exact", all items coded with the special character states U, V, or TE are treated as if they would have the character state you are searching for. Note that if you exclude character states (press the Exclude = minus-symbol button), the result will always be exact. See also Combining character states with logical operators and Discussion of the SET MATCH terminology used by IntKey.

Identify Contents of Identification Main table of contents

Identify: Combining character states with logical operators

All entries shown in the list of Identification steps are combined with the AND operator.

You can make multiple selections in the character state list box. Multiple character states selected in a single step are assumed to indicate uncertainty (if the color could be called yellow or orange, you select both states), and are combined with the OR operator.

To expressly specify that your object has character state '1' AND character state '2' of the same character, you can add these conditions in two steps, pressing the Present-button each time. The result should be two entries in the list of Identification steps.

Identify Contents of Identification Main table of contents

Discussion of the SET MATCH terminology used by IntKey

Who should read this: People used to the logic and terminology used in M. Dallwitz's IntKeyprogram.

DeltaAccess follows the usual database terminology of setting conditions which must be fulfilled by all items returned (predicate logic), while Intkey uses a model of two sets of objects. Thus Intkey treats character states of items like physical objects. This is not entirely a bad visualization, but it could be confusing if you are using a standard database. DeltaAccess does not have any "MATCH" options, but the same results can be obtained fairly straightforward using the predicate logic of database queries. The following paragraphs describe how to obtain results equivalent to the use of Intkey "MATCH" options.

IntKey provides the following options for the SET MATCH command:

Overlap (SET MATCH O):

"specifies that two sets of values match if they overlap, that is, if they have any values in common (e. g. 1/2 matches 2/3; 2-5 matches 4-10). (S and O cannot be used together.)"

This is the default in DeltaAccess, if you select one or several states in a single step. In the example "2-5 matches 4-10", you would select all states from 2 to 5, and then click on the Present-button. The resulting Boolean condition would then be: "where '2' OR '3' OR '4' OR '5' ".

Subset (SET MATCH S):

"specifies that two sets of values match if one set (usually the values of the specimen) is a subset of the other (e. g. 1/2 matches 1/2/4 but not 2/3; 2-5 matches 1-6 but not 4-10)."

Instead of selecting multiple states in a single identification step (which combines the conditions for each state with the OR operator), you would specify each state as a separate condition. This combines the conditions with the AND operator. In the example "2-5 matches 1-6 but not 4-10", you would select state '2', click the Present-button, state '3', click the Present-button, and so on. The resulting Boolean condition would then be: "where '2' AND '3' AND '4' AND '5' ". This condition is fulfilled by all items where states 1-6 are present, but not by those where states 4-10 are present. Compare Combining character states with logical operators.

Exact (SET MATCH E):

"specifies that two sets of values match only if they are identical."

In addition to adding the states as separate steps as in the example above, you also must exclude the states that may not be present. If you want to see only items where a certain state (e. g. '1') but none of the other states (e. g. '2', '3', '4') is present (i.e. only a single state was scored for the item), you would select '1', click the Present-button, and in a second step select '2', '3', and '4' and click on the Absent-button. The resulting Boolean condition would then be: "where '1' AND NOT ('2' OR '3' OR '4')".

Unknowns (SET MATCH U), Inapplicables (SET MATCH I):

"specify respectively that `unknown' `inapplicable' match any value. The default setting is O U I, which is usually the most appropriate for identification. For information retrieval, the most appropriate setting is usually O."

In DeltaAccess you can influence the behavior regarding the special character states U and V through the setting of the Exact-checkbox. Note that DeltaAccess never includes inapplicable states in the query.

The quotations are taken from the IntKey version 5 help file. No other source for this information could be found.

Identify Contents of Identification Main table of contents

Compiling a project for interactive identification

Who should read this: People interested in creating interactive identification tools, which use DeltaAccess for importing, or which work directly on the data structures of DeltaAccess. You should be acquainted with the general information model of DeltaAccess. See also Comparison of direct and compiled identification modes.

(*** NOTE: This is currently a rough collection of information for developers, which will be reorganized at a later time)

Identification using the _DESCR table directly poses several problems. To solve these without making identifications unduly slow, it is necessary to compile the information available in _DESCR into the separate table _ID_DESCR. Compilation modifies the item descriptions to achieve error tolerance and deals with implicit character states, including implicit unknown ('U') states, which must be inserted whenever no data are recorded at all.

The current version of DeltaAccess for Access 97 already contains a subset of this functionality. It is not integrated into the user interface yet, and is not yet used by Identify itself. It is included for use with external software (e. g. web interfaces) alone.

To compile your descriptor project (in full version of Access, not possible in a run-time version)
   Press Control G (= open the immediate window)
   If your project name is "Borneo", type:
   ID_Compile "Borneo"
   Press Enter
   Wait for a message box indicating success and close the immediate window

Problems already addressed are:

Implicit states are added

Character/item combinations saturated with states (i.e. all states are present), and character/item combinations which contain both a special state and a normal state (compare queries: Checking data) are converted to U

The special states V and TE are converted to U, the special state '-' is removed

Implicit unknowns (i.e. no state used at all for categorical characters) are added

Items, where for a real or integer numeric character only a single value (e. g. the mean) has been entered, are converted to ranges using the Fuzziness and FuzzinessIsPercent attributes. Note that the lower percentage calculation differs slightly from the one IntKey uses!

In addition, Fuzziness is also evaluated for characters of type OM. If CS=3 is scored, and Fuzziness=1, CS 2 and 4 are added as character states.

Combined character states (e. g. '1&2') remain in the compiled item descriptions, but the constituent states ('1' and '2') are added as well. In contrast, IntKey would only provide constituent states, but not the combination to the user.

The reliabilities of character and modifier are combined, see Calculation of combined reliabilities.

The combined reliability value in this table is set to 0 for all characters scored as unknown (U).

Notes and Text are removed from the main table. Instead, TXT and Notes are compiled into a word-wise full text index (see entities _ID_TXT and _ID_NOTES). Certain common words, are ignored (= "stop words", e. g. "the", "at", "it", etc.). The list of stop-words is hardcoded in the function IsStopWord(), and is designed to work with English, German, French, Spanish, and Italian. If the list gives unsatisfactorily results, you might want to send me a list of words, which should or should not be stop words in your language.

Problems not yet addressed are:

Mapped character states are added in their mapped form

Real/integer numeric values are mapped to categorical states according to their mapping definition

If no mapping definition for numerical characters is present, a default mapping is created automatically

Use the compilation always on the original base project (the one containing the ..._DESCR table), not on a project which contains an item link or subset instead of the original table (i.e. a ..._DESCR query). The _ID_DESCR table is not protected by referential integrity; you must recompile the ID_DESCR data after you have made changes to your character definition. //Perhaps CID could be protected, CS not because used for word index! Perhaps automatically invalidate?

The following assumptions may be made about the ID_DESCR table:

For each item, at least one record is present in ID_DESCR for all categorical and numerical characters. This might be a "U" for unknown. If characters of type text have not been used in an item, the item/character combination will be missing.

Categorical characters: CS contains only true character states and U (i.e. not the special states V/TE/-).

Numerical characters: CS contains "+" and "-" for upper and lower limit of range, respectively, "Min", "Max", and "U" for unknown. Mean/Median values have been converted into ranges, using the Fuzziness and FuzzinessIsPercent attributes of the character definition. If you write identification code, and have a range to test against (ID1 and ID2, for the lower and upper values of the range entered by the use), the conditions: "(ID1<ID_DESCR.X Where ID_DESCR.CS='+') AND (ID2>ID_DESCR.X Where ID_DESCR.CS='-')" must be fulfilled by the items.

Text characters are missing in _ID_DESCR. The contents of the TXT attribute, parsed into words, is compiled into _ID_TXT.

Notes are removed from _ID_DESCR. The contents of the NOTES attribute, parsed into words, is compiled into _ID_NOTES.

Other issues relevant for authors or interfaces (e. g. web-interfaces) to DeltaAccess style compiled data sets:

The first steps in the identification uses Named Character Groups. These can be manually defined character groups (see Edit character headings) or groups defined through the AutoGroup attribute.

During compilation, the headings where the AutoGroup attribute does not start with X (including the user defined SQL AutoGroup definitions) are translated into records in the link table _CHAR_Heading_Link and can therefore be treated just like manually defined Named Character Groups. In both cases, the character IDs of the characters to be presented in the second identification step (selection of character) can be retrieved using: "Select CID From PRX_CHAR_Heading_Link Where HID=[User-selected Heading ID]);"

Those AutoGroup codes starting with X (compare the list of predefined AutoGroup codes) require a special treatment. If this is not possible, they must be excluded in the query which retrieves the headings presented during identification (e. g. "Select HeadingName, HID From PRX_Char_Heading Where AutoGroup Not Like 'X*';").

Identify Contents of Identification Main table of contents

Comparison of direct and compiled identification modes

This chapter is preliminary, because the options discussed are not yet fully available!

You can choose between two methods of identification in DeltaAccess:

Direct identification on the active item descriptions. Good for information retrieval, always contains the most up-to-date data.

Compiling the item descriptions for use by identification programs (Identify, web interfaces, etc.)

You can already compile a descriptor project for interactive identification, see Compiling a project for interactive identification. Note: This is a preliminary implementation for web interfaces! Identify itself does not use it yet, but later versions of Identify will offer this option!

Identification by use of the compiled data set is faster, safer (certain error conditions and ambiguities in the data set are excluded), and you can fully use implicit character states. The text information (TXT and Notes) is already parsed into a word index (each word of a text is a separate entry in such an index), which makes searching such information relatively fast.

On the other hand, compilation may take significant time for a large data set, and it must be repeated each time you have changed the item descriptions.

The alternative is to directly use the current item description. Its main disadvantages are that no re-evaluation of characters during identification occurs, that the evaluation of numerical characters is more complicated. On the other hand, you have actually have more options here, because you can include minima and maxima in your search, by specifying that your value is a single, potentially extreme value.

You can search for information in text characters, or in the Notes attached to any character, but this will be relatively slow, because no index is used.

In the case of combined character states (e. g. '1&2', 'flower: like red with white dots') you can not identify an item if you only select one of the constituent states (e. g. '1' = 'red').

You can improve direct identification, if you decide that you do not want to use the special states U to differentiate between whether a character has been checked, but is positively unknown, or whether it has not yet been checked. Use Reorganize queries, Insert unused characters to insert a special states U wherever a character has not been used (scored) in an item. You can also insert the implicit character states. This process is not easily reversible and it makes it difficult to assess later whether a character state was inserted because it has actually been checked for this item, or whether it is based on the global assumption which lead to the declaration of a state as implicit.

Contents of Identification Main table of contents

Calculation of combined reliabilities

During compilation for identification, the following reliability and abundance settings are combined into a new reliability value:
   character Reliability,
   character Availability,
   modifier Reliability, and
   item Abundance.

The following chapter describes how these values are combined.

Character Reliability and character Availability are combined to a new reliability value, which is the minimum of either value. ******** NOT YET IMPLEMENTED!

Modifier Reliability is added according to:

The neutral element of the reliability scale is the value 5 (default value). The absence of a modifier in an item/character/state tuple is assumed to be equivalent to a modifier with the default reliability of 5. The combined reliabilities of character Reliability and modifier Reliability are calculated, using the following algorithm:

CombinedReliability = ((CharacterReliability/5) * (ModifierReliability/5))*5
= (CharacterReliability * ModifierReliability)/5

(i.e. the range 0-10 is rescaled to 0-2, the default 5 becomes the neutral element 1). Because the result is rounded to the nearest integer value, it looks like:

| 0 1 2 3 4 5 6 7 8 9 10
--------------------------------------
0 | 0 0 0 0 0 0 0 0 0 0 0
1 | 0 1 1 1 1 1 2 2 2 2 2
2 | 0 1 1 2 2 2 3 3 4 4 4
3 | 0 1 2 2 3 3 4 5 5 6 6
4 | 0 1 2 3 4 4 5 6 7 8 8
5 | 0 1 2 3 4 5 6 7 8 9 10
6 | 0 2 3 4 5 6 8 9 10 11 12
7 | 0 2 3 5 6 7 9 10 12 13 14
8 | 0 2 4 5 7 8 10 12 13 15 16
9 | 0 2 4 6 8 9 11 13 15 17 18
10 | 0 2 4 6 8 10 12 14 16 18 20

Item Abundance: ******** Combination NOT YET IMPLEMENTED, currently Abundance has no influence!

Note that the compilation for identification process and the Identify dialog box assume that you are working with Reliability (0-10) values and not with fractional "weights" (compare Weight conversion). DeltaAccess allows the use of unconverted character weights (1/32 to 32), but it recommends using reliability values. Using Identify with Weight values will result in no differentiation of weights below 1.

Compiling a project for interactive identification Contents of Identification Main table of contents Next