Chapter 10

Chapter 10

On Persistent Data

  A well-regarded book written in the 1970s by one of the pioneers  of Computer Science, Niklaus Wirth was aptly titled Algorithms + Data Structures = Programs [Wir75]. The title of the book explicitly reflects the fundamental essence of Computer Science: that a program’s raison d’etre is efficient manipulation of data using well-motivated algorithms. A program takes explicit or implicit data as input, performs computations while employing internal structures to store and manipulate data, and possibly carries out actions whose effects are seen by the outside world. In order to effectively manipulate data, it is necessary to use structures that can store and access data efficiently. In certain programs, the structure of the internal data used by the program can become fairly complex, requiring use of techniques found
in books on data structures such as [Knu78, CCELS01]. Additionally though, there are applications where the data is not structurally overly complex, but there is an overabundance of it. For example, a program may deal with tens of thousands of employees of a corporation, a few million customers of a telephone company, or hundreds of millions of records kept on citizens for administrative purposes such as assignment of social security benefits.

In this book, we deal with both aspects of data mentioned above, albeit to a limited extent.   In chapter 3, we discuss the basic data types that Perl provides: scalars, arrays, hashes and references. We also discuss how references can be used to develop complex data structures. A data structure, whether simple or complex, may need to be stored in a file, so the same program or another program can use it later. For complex data structures, the components need to be unraveled or unrolled in some fashion, so that they can be stored in a file for faithful re-assembly later and subsequent use.  This process is called serialization. We discuss this issue in this chapter. This chapter primarily deals with techniques that we can use to make data, whether simple or complex, small amounts or large, persistent. That is, how the
data can be stored in files in disk so that related or unrelated programs can access or manipulate the same data, simultaneously or at different times.

When the amount of data is large, it may not be possible to keep all the data in the program’s resident memory. In such a case, it is necessary to write data into files. In simple application programs, if we need access to data, we can store the data in text files for purposes of reading, writing and other manipulations such as deleting or inserting information, finding information that satisfies certain query criteria, and sorting the data. If the amount of data is small, and speed of access is not critical, we can use text files that we format ourselves in certain ways.   Using text files is extremely simple. Every programming language provides facilities for reading and writing text files. Usually, a record is written one per line with \n being the record delimiter although other record delimiters are possible.
The fields are usually separated with a pre-specified delimiter, say one or more commas or colons or tabs. The obvious drawback of using text files is that it is inefficient. Disk access is slow. If we need to sort records or keep them in sorted order in a text file, it becomes a slow process. A flat text file has no additional structure, so insertion and deletion of data becomes time-consuming, possibly requiring a pass through all the records. This may work for a simple application with a small amount of data. However, if the amount of data becomes large, there are several possibilities for the program author. Some of these are given below.

    We can use a data structure more complex than a straight file although the data can still be stored in files. For example, we can use a binary tree, a binary search tree, a red-black tree, a regular heap, a Fibonacci heap, or a B-tree. These structures are discussed in detail in most books on data structures and algorithms, such as [CLR97, CCELS01]. Writing one’s own code becomes complex, but depending on the task at hand, it may be the only option open to a programmer. There are also off-the-shelf commercial, freeware, or shareware software one may be able to find.

    If the information is not very complex in structure, but the amount of data is somewhat large, we can use a built-in facility called DBM files or databases. DBMs are not really as powerful as relational databases, but for many purposes, they are enough. DBMs have been historically used in Unix machines although they are available in other platforms as well.

The DBM library provides a simple database management facility. The structure of the record allowed is simple, just key-value pairs. They are stored in disk, not in the form of a flat file, but using complex hashing algorithms. DBMs have been in use in the Unix environment since the 1970s [Sel91]. A DBM file can be used to add new values, update existing values, or delete old values. The DBM library is fairly simple, but is readily available and frequently used.

Perl provides access to the DBM files using a clever mechanism whereby a hash can be associated with a DBM database through a process similar to opening a file. Creating a new element in the array modifies the DBM database immediately. Deleting an element deletes the value from the DBM database. The size, number and kind of keys and values in a DBM database are restricted. In general, one is advised to keep both the keys and values down to about 1000 characters or less.

 

    Another option is using a full-fledged database. This is absolutely necessary if the amount of data is large and the data has complex structure. A complex structure can be represented using several related tables that are simple in themselves individually. Databases store huge amounts of data, and structure them using complex mechanisms for easy access and manipulation. Although there are different kinds of databases, relational databases are the most prevalent ones. A freely available relational database is MySQL. Another is Postgresql. Commercially available relational databases include ones from Oracle, Microsoft, and IBM.    

   A relational database is usually a complex software system. To be able to install and use a relational database needs considerable expertise, including learning a language called SQL or a variant of it. Quite frequently, it is necessary to employ an individual called a database administrator to help perform all the activities needed to keep a database system functioning well.