What is Digitization?

: Marek Zielinski

What is Digitization?

Digitization. Illustration prepared with the use of a work by Junior Melo via Wikimedia Commons [CC-BY-SA-3.0])

Digitization seems to have as many definitions as user communities. Using the same word for different purposes is not rare. The word ‘organic’ has a well defined meaning in chemistry (carbon compound), and a common usage in food and other industries, with the general meaning ‘good for you’. Therefore an organic substance may be not organic and organic food may contain inorganic elements. Digitization does not show such a spread of meanings, nevertheless it many contexts it is used to mean different things.

In Wikipedia the preferred term is “digitizing”, but the definition is not of much use: “Strictly speaking, digitizing means simply capturing an analog signal in digital form. For a document the term means to trace the document image or capture the "corners" where the lines end or change direction.” There is no mention of metadata, transcription, access and other elements generally associated with digitization in the archival and library community.

Let us examine some terms and processes which together constitute digitization.

Digital reformatting

Digital reformatting means converting an analog resource into a discrete form usable in computers. The resource is typically a result of human culture, and can be a two dimensional object (document, image, book page), a recording of music or moving image. We will not deal with three-dimensional objects, which also can be digitized, but the process is much more involved and not yet clearly established. Analog means that the signal (color, sound etc.) is continuous, i.e. can take any possible value in its range. Discrete means that only a limited set of values is possible, with nothing in between. Written language is a good example of a discrete string of elements, each taken from an alphabet of some 150 possibilities (in English language, and including lower and upper case letters, numbers and a collection of special characters. There is no ‘intermediate value’ between a and b.

In the first step we need to divide the dimension(s) of the object into discrete elements. For a flat object we superimpose a grid (typically consisting of square ‘pixels’), for a signal in time we divide the time dimension into short intervals. Next we examine the signal in the discrete element - single pixel or interval. For a flat image we will look for color intensity and also record it as discrete values; similar treatment is applied to electrical signal (from a microphone) in the selected time interval. This process is called sampling.

Finally we record the values together with their spatial or time coordinates using a specific coding (selected from many possibilities) and obtain a computer file. If the process was done lege artis, the file is considered a digital surrogate of the original.

Practically we use a scanner (or a camera) for a flat object and a device called AD converter for electrical signals (you have one in your computer). A similar device can also reformat your video recording from a VHS player.

Born digital

Similar process of generating a discrete representation of an analog signal occurs inside a camera while taking a photo snapshot, in digital video camera or sound recorder. When the image, sound or event is immediately recorded in a digital form, we consider it ‘born digital’. Digitization is concerned only with digital reformatting, not with born digital resources.

Metadata

The file we obtain in the digital reformatting step may initially be devoid of metadata. The original resources, fixed on a specific medium (paper, celluloid, magnetic tape) could have labels; if book, a card catalog entry; if an archival document, a folder with label, inside a bigger folder with its own label, inside a labeled box, etc. In order to restore and possibly expand the information that will help us locate the file among hundreds of thousands on your hard drive, we need to collect the object metadata. Descriptive metadata include for example the title, author, description or abstract of the resource, etc. There are other types of metadata, some dealing with the parameters of digital reformatting, some recording ownership and rights of the original, etc. The metadata are typically written down using metadata standards, to help with interoperability.

Transcription

In special case of the written text (and to a limited extent music) we can also perform one more step. The text has originally a discrete form as strings of characters. This string can be decoded by humans (provided they know the language or can read the handwriting) but for a computer the digitally reformatted text is just a mass of dots of different color. The text can be, however, transcribed, i.e. each character changed into its numerical representation according to a specific coding (historically ASCII, today utf-8). Now the text is ‘known’ to the computer as well, and can be dealt with in many ways - the book can be presented in different formats (including Braille or spoken word), one can search the full content of the text etc. Transcription can be done by (human) hand or, if the printed text is clear enough, by using a computer technique of optical character recognition (ORC).

Digitization

Digitization is a concept that includes elements mentioned above. It is a conversion of a resource recorded in traditional medium into a digital one, and inlcludes all added features and responsibilities that go with it: organization of the resources, metadata collection and/or transcription, digital reformatting, providing access (including search, browse and other finding tools), and finally planning for the inevitable future coding, format and hardware changes. The emphasis on different elements is different in disparate communities, for example digitizing books almost always involves transcription (OCR), while archives puts more emphasis on context and document selection. As the archival digitization is what we do in the Pilsudski Institute, I will quote from the National Archives (NARA) definition of digitizing:

“...’digitizing’ should be understood not just as the act of scanning an analog document into digital form, but as a series of activities that results in a digital copy being made available to end users via the Internet or other means for a sustained length of time. The activities include:

Document identification and selection
Document preparation (including preservation, access review and screening, locating, pulling, and refiling)
Basic descriptive and technical metadata collection sufficient to allow retrieval and management of the digital copies and to provide basic contextual information for the user
Digital conversion
Quality control of digital copies and metadata
Providing public access to the material via online delivery of reliable and authentic copies
Providing online ordering for reproduction services at quality or quantities beyond the capacity of an end user
Maintenance of digital copies and metadata”

The archival fonds often come ‘as is’, as collected by the donating person or institution. While the rule respect des fonds should be the guiding principle, the archive can often gain if it is better organized. For digitization, document identification and selection is important, we do not want pages of a single letter scattered among others. We then give pages serial numbers within a unit (file), to preserve the context and integrity of the fonds. Scanning results in a high resolution (600 dpi or more) non-compressed (tiff) ‘digital original’. The original is preserved, and any further work, including website presentation, is done on copies. We collect metadata using the Dublin Core metadata standard, augmented by specific local identifiers. The workflow includes the creators of the digital document and those who proofread and verify their work (we use DSpace - Open Source software - as a tool to collect metadata and enforce the workflow). The metadata with reduced copies of the document pages are then exported and trasferred to the webpage displaying our archival collections.

Read more

Digital Reformatting in Wikipedia
Metadata in Wikipedia (the article is not wery well written, but the discussion is fascinating)
Respect de fonds in Wikipedia
Strategy for Digitizing Archival Materials from National Archives

Marek Zieliński, December 17, 2013