Text Database Fundamentals


Databases, Records, and Fields

A text database is a collection of related documents assembled into a single searchable unit. The individual documents can be massive or minuscule, but they should bear some relation to each other.

A database is composed of smaller units called records. In a text database, a record can be an entire document, a section within a document, a single page, or a fragment of text within a page. When you search a database, you will retrieve one or more records containing information that satisfies your query.

A record can contain smaller regions of data called fields. A field usually defines a particular type of data common to several or all records within a database. For instance, in a database of corporate memos, wherein each memo makes up a record, the following fields might be used: TO, FROM, DATE, SUBJECT, and TEXT. You can narrow the scope of a search by restricting it to one or more fields. In this example, you might limit your search to the FROM field when searching for a sender's name. Only those records with the specified name in that field would be retrieved.

Stopwords

As opposed to a keyword-based system, PLWeb Turbo is full-text retrieval software, meaning that it indexes every word in a document, with the exception of stopwords. Stopwords are those terms that PLWeb Turbo is programmed to ignore during the indexing and retrieval processes, in order to prevent the retrieval of extraneous records. Generally, a stopword list includes articles, pronouns, adjectives, adverbs, and prepositions (the, they, very, not, of, etc.) that are most common in the English language. After reading about relevance ranking, you'll understand why a stopword list is used.

[Previous Topic] [Contents] [Next Topic]