The past-perfect dataset has been reduced for each of the 5 databases as follows:

* Fields that aren't used have been deleted
* Fields that aren't useful for sorting or fast filtering have been merged
* Fields that have similar data have been combined

The Dublin Core metadata set (http://dublincore.org/documents/usageguide/elements.shtml) has been used as the basis for organizing the fields.

The STERMS, SUBJECTS, AND CLASSES fields have been combined into a single TAGS field, and processed into single words with redundancies removed. For example, if the SUBJECTS field previously had
"Essex Company, Arlington Trust Bank Company, Essex Bank."
The TAGS field now reads:
"Essex, Company, Arlington, Trust, Bank".
In this way, that record will be retrieved if the user searches on any of the previous words, but it dramatically simplifies the organization of tag data. It also simplifies the entry of new tag data.

The PEOPLE field has been processed in the same way.

Important: For the TAGS field (see list of all values in tags field), any tag which wasn't used at least 4 times was deleted. The thinking behind this is that the power of the tag is to provide grouping for similar objects. If there aren't at least 4 of them, then it doesn't make much sense. It's kind of like having a divider in a notebook with only one page in it. By doing this, the number of tags was reduced from 4,000 to about 1,500, which should improve the performance of the indexing. For most of those objects which had only a single tag that word will be used in the full-text description and will be searchable, but that isn't as meaningful as having the word chosen as a tag.

For the PEOPLE field (see list of all values in people field) I did not do this reduction, and left peoples names on the field even if there was only one. The thinking here was that a person's name might have value in the faceted search. For example, once someone chooses a tag of, say, "Essex and Company" then all of the names which also have tags of "Essex, Company" will be selectable. Please give some careful thought to this distinction (between the handling of PEOPLE and TAGS and let me know if you agree or not.

It would be easily possible to eliminate more values from either of these two lists.

Here are the resulting databases:
Archives DB
Audio DB
Books DB
Images DB
Objects DB

These database files don't include the following administrative fields:
CONDEXAM (person)
CONDDATE (date)
CONDITION (many entries – short assessment: good, fair, excellent, etc.)
DISPVALUE appears to be redundant to condition (good, excellent, etc.) few differences
CONDNOTES entries - arch: 24, books 5, images: 765, obj 50 – useful

CATBY (person)
CATDATE (date)

UPDATEDBY (person) This will be kept by Drupal
UPDATED (date)

STATUS entries: mostly "OK", a few "missing"
STATUSBY (person)
STATUSDATE (date)

INVNBY (person)
INVNDATE (date)

MAINTDATE (date)

These fields are all pretty clean... the question is which of them we need to migrate to Drupal?

Fields to be determined - These are additional fields that need some decisions about whether they are kept or merged.

Last, look at the Columns spreadsheet which maps the past-perfect columns (fields) into the Dublin Core set (dc fields) and administrative fields. This spreadsheet has been the guiding document behind the database migration, and might be difficult to make sense of, but you'll get the idea. When I'm done I'll try to clean it up.

At this point, the databases are almost ready for migration into Drupal. The next step is to build the new content types in Drupal, but first we need to make sure that we know exactly which fields we'll be using. This is the time to review each of the DBs carefully and make sure the field selections, combinations, deletions, etc. are right, and that the choice and processing of classification and tag fields are optimal. Changes at this stage are quite easy, but will be very difficult once the migration into Drupal has taken place.