Database Evolution

Nearly all of the digitized data that currently resides in the natural history museums is contained in some type of relational database. A variety of relational systems has been used over the years including serverbased commercial databases such as Sybase and Informix, PC-based databases such as Access, Paradox, FilemakerPro and dBASE, and open source database systems such as MySQL and PostGres. This section briefly describes the different types of databases in use.

Relational Model

The relational model was first described by E. F. Codd in 1970, and was based on relational algebra. It was developed as an alternative to traditional methods that tied data storage and retrieval to the architecture of a particular computer, such as the early databases used in the MVZ and UCMP. Relational databases organize data and the relationships among them into tables of rows and columns. Each column and row in the table is unique, the sequence of columns and rows are insignificant, and the contents of one cell are considered atomic. The relational model is used in most current database systems today from simple Excel spreadsheets to more elaborate data management systems such as Sybase and Oracle. Relational databases can be indexed efficiently so are usually much faster than hierarchical or object-oriented models. Because of their table structure, they can support a much more flexible array of queries. All the museums currently use some type of relational database.

Hierarchical and Object-Oriented Models

In this model, data is organized in a tree or web structure instead of a table structure. Nodes usually represent “parent” or “child” objects. The relationships between objects are represented by the segments that connect them. Relationships may be one-to-many, as in a hierarchical arrangement of taxonomy, or many-to-many, as in a network or web. Hierarchical databases were popular in the 1960s and 1970s, and object-oriented databases enjoyed popularity in the 1990’s for their ability to more intuitively represent certain types of data such as time-sequence video. The downside of storing data this way is that the relationships between data elements must be determined in advance and are fairly static — this type of database handles hierarchical data elegantly but is not conducive to a broader relational query such as “…where date is after 1980 and collector=”Smith”. Queries to hierarchical databases are usually much slower since comparisons must follow a tree or web structure.

Hash Tables or Look-up Tables

This model uses a key-value method of storing data, and typically does not support the expression of relationships among different data elements. Look-up tables such as BerkeleyDB are widely used in web applications where a large body of data must be searched quickly on one or a few fields, such as retrieving customer account info given the customer’s name. They are also used for storing data in small consumer appliances, because they are very fast and very lightweight. Look-up tables are useful when queries only need to be made on one data element at a time, for example, retrieving all records that match a particular taxon. UCJeps uses look-up tables for some of its online queries, taking an extract from the Sybase data and then creating a hash table for each searchable field. Look-up tables are not well suited to datasets that need to be queried on multiple fields.

Online databases

What technology is used by very large, web-based databases such as eBay, Amazon, and Google? They typically do not use traditional relational database technology at all, except perhaps for archival or permanent storage. Much of the functionality that databases traditionally provide is instead moved to the application level, such as joins, sorting, and indexing. Data is partitioned and then replicated so it can be distributed across many CPUs for load balancing, failsafe protection, and above all, fast access. Searches are performed on customized replicated indexes. For example, eBay performs both sorting and joining in the application, not in the database. They use few triggers, and have moved towards “denormalization of the schema [Vogel]”. The emphasis is on scalability and speed. A recent essay by Adam Bosworth lists three requirements that are needed for today’s databases that are not being provided by database vendors: dynamic schema, dynamic partitioning, and modern indexing.

Sources:

“How are databases used at big customers?” Werner Vogel, CTO – Amazon.com, http://weblogs.cs.cornell.edu/AllThingsDistributed/archives/000280.html

“Where have all the good databases gone?” Adam Bosworth, Dec 2004, http://www.adambosworth.net/archives/000038.html

“Peeking Into Google”, Susan Kuchinskas, internetnews.com, http://www.internetnews.com/xSP/article.php/3487041

PC-based vs. Server-based

Single-user databases that run on a personal computer include Access, Paradox, and dBASE. These databases have been very popular (and still are) because they are easy to set up and do not require a great degree of skill to begin using. During the 1980’s as personal computers became affordable, many of the Berkeley museums began to manage their collections using dBASE, a product that was widely available for PCs. This represented an attractive alternative to timesharing on a remote mainframe computer, not only because it was less expensive but also because the museums had more control over the data and access to the computer. Today, most of the museums continue to use PC-based databases for small research projects as well as for collection-related data such as loans, acquisitions, images, etc. One museum still houses its specimen data on PC-based database: UCMP’s data resides as four separate collections on the PC database Paradox. The main disadvantage of using single-user PC-based databases is that the database is only available to one person at a time for updates and additions. Museums often work around this by creating several copies of the same so that multiple people can work on the data at once. Keeping the data synchronized when there are multiple versions is difficult and time-consuming. In the late 1980’s, client-server databases began to appear, such as Ingres, Sybase and Oracle. The client-server architecture allows multiple users to query and update data at the same time. With the explosion of the World Wide Web, server-based databases also offered a way to make data available online to a wide audience.

Distributed databases

As soon as museum collections began to be digitized, interest turned to creating structures that would support queries beyond the museum’s own collection. One way to do this is for all the participating collections to use the same database and/or data structure. This was suggested as early as 1972 in UCMP’s “A Local System of Automated Paleontologic Data Retrieval and its Potential Contribution to an Eventual Nationwide System”. UCMP was an early adopter of technology and expected that other collections would quickly follow their lead. But this was not to happen, perhaps because technology changes seem to be closely tied to funding. Similarly, SMASCH was originally conceived as a distributed database system, and the 1992 paper that described the new project announced that all the herbaria in California would share this system. Although some other UC herbaria are now actively using SMASCH to enter some of their data, SMASCH has not become the state-wide system that was envisioned. Without a large funding effort, most academic museums have either not been able to digitize their collections, as in the case of the Essig Museum, or have used more affordable methods such as PC-based databases that often do not interface well with more elaborate projects. Even when funding is not an issue, museums may not agree on what components are important for a system – systems at multiple museums rarely share enough common design elements to support this type of distributed query. Another way to support distributed queries is for each collection to make only a subset of its data available in an agreed-upon format, and then either run an application locally that allows queries from the outside, or supply a third party with the data who will then make it available for queries. The DiGiR project uses this approach. This method is more attractive than a shared data structure since participants don’t need to commit to a particular overall data structure for their entire collection management system. Another appealing feature is that they retain control over their own data. A disadvantage of this method is that some work is needed at each site to produce the data subset and to run the application that makes it available. For museums that are struggling to fund basic collection management, participating in a distributed system may not be economically feasible. A third method for supporting distributed queries across collections is to create a centralized repository. In this model, participants contribute data in an agreed-upon format as in the previous example, or they may simply supply the data in whatever form it exists in their own system, and leave re-formatting to the repository. This method is attractive because of its very low cost for each participant. Each collection need only create a mechanism for exporting data from the local collection management system, a task that is straightforward for nearly any type of database. The repository can automatically pick up data, format it, and finally include it in the distributed system. A distributed query system that uses this approach is the BNHM all-museum queries.

Source:

“Database Models”, UnixSpace, http://unixspace.com/context/databases.html

Links of Interest:

MVZ Database Model (PDF)

UCMP Data Model

Data at the speed of the Internet (more to come)

Originally written by Ginger Ogle, May 2005; updated Michelle Koo, March 2016.