Blog archiwistów i bibliotekarzy Instytutu Piłsudskiego

Blog archiwistów i bibliotekarzy Instytutu Piłsudskiego

digitalizacja260

Digitization. Illustration prepared with the use of a work by Junior Melo via Wikimedia Commons [CC-BY-SA-3.0])

Digitization seems to have as many definitions as user communities. Using the same word for different purposes is not rare. The word ‘organic’ has a well defined meaning in chemistry (carbon compound), and a common usage in food and other industries, with the general meaning ‘good for you’. Therefore an organic substance may be not organic and organic food may contain inorganic elements. Digitization does not show such a spread of meanings, nevertheless it many contexts it is used to mean different things.

In Wikipedia the preferred term is “digitizing”, but the definition is not of much use: “Strictly speaking, digitizing means simply capturing an analog signal in digital form. For a document the term means to trace the document image or capture the "corners" where the lines end or change direction.” There is no mention of metadata, transcription, access and other elements generally associated with digitization in the archival and library community.

Let us examine some terms and processes which together constitute digitization.

Digital reformatting

Digital reformatting means converting an analog resource into a discrete form usable in computers. The resource is typically a result of human culture, and can be a two dimensional object (document, image, book page), a recording of music or moving image. We will not deal with three-dimensional objects, which also can be digitized, but the process is much more involved and not yet clearly established. Analog means that the signal (color, sound etc.) is continuous, i.e. can take any possible value in its range. Discrete means that only a limited set of values is possible, with nothing in between. Written language is a good example of a discrete string of elements, each taken from an alphabet of some 150 possibilities (in English language, and including lower and upper case letters, numbers and a collection of special characters. There is no ‘intermediate value’ between a and b.

In the first step we need to divide the dimension(s) of the object into discrete elements. For a flat object we superimpose a grid (typically consisting of square ‘pixels’), for a signal in time we divide the time dimension into short intervals. Next we examine the signal in the discrete element - single pixel or interval. For a flat image we will look for color intensity and also record it as discrete values; similar treatment is applied to electrical signal (from a microphone) in the selected time interval. This process is called sampling.

Finally we record the values together with their spatial or time coordinates using a specific coding (selected from many possibilities) and obtain a computer file. If the process was done lege artis, the file is considered a digital surrogate of the original.

Practically we use a scanner (or a camera) for a flat object and a device called AD converter for electrical signals (you have one in your computer). A similar device can also reformat your video recording from a VHS player.

Born digital

Similar process of generating a discrete representation of an analog signal occurs inside a camera while taking a photo snapshot, in digital video camera or sound recorder. When the image, sound or event is immediately recorded in a digital form, we consider it ‘born digital’. Digitization is concerned only with digital reformatting, not with born digital resources.

Metadata

The file we obtain in the digital reformatting step may initially be devoid of metadata. The original resources, fixed on a specific medium (paper, celluloid, magnetic tape) could have labels; if book, a card catalog entry; if an archival document, a folder with label, inside a bigger folder with its own label, inside a labeled box,  etc. In order to restore and possibly expand the information that will help us locate the file among hundreds of thousands on your hard drive, we need to collect the object metadata. Descriptive metadata include for example the title, author, description or abstract of the resource, etc. There are other types of metadata, some dealing with the parameters of digital reformatting, some recording ownership and rights of the original, etc. The metadata are typically written down using metadata standards, to help with interoperability.

Transcription

In special case of the written text (and to a limited extent music) we can also perform one more step. The text has originally a discrete form as strings of characters. This string can be decoded by humans (provided they know the language or can read the handwriting) but for a computer the digitally reformatted text is just a mass of dots of different color. The text can be, however, transcribed, i.e. each character changed into its numerical representation according to a specific coding (historically ASCII, today utf-8). Now the text is ‘known’ to the computer as well, and can be dealt with in many ways - the book can be presented in different formats (including Braille or spoken word), one can search the full content of the text etc. Transcription can be done by (human) hand or, if the printed text is clear enough, by using a computer technique of optical character recognition (ORC).

Digitization

Digitization is a concept that includes elements mentioned above. It is a conversion of a resource recorded in traditional medium into a digital one, and inlcludes all added features and responsibilities that go with it: organization of the resources, metadata collection and/or transcription, digital reformatting, providing access (including search, browse and other finding tools), and finally planning for the inevitable future coding, format and hardware changes. The emphasis on different elements is different in disparate communities, for example digitizing books almost always involves transcription (OCR), while archives puts more emphasis on context and document selection. As the archival digitization is what we do in the Pilsudski Institute, I will quote from the National Archives (NARA) definition of digitizing:

“...’digitizing’ should be understood not just as the act of scanning an analog document into digital form, but as a series of activities that results in a digital copy being made available to end users via the Internet or other means for a sustained length of time. The activities include:

  • Document identification and selection
  • Document preparation (including preservation, access review and screening, locating, pulling, and refiling)
  • Basic descriptive and technical metadata collection sufficient to allow retrieval and management of the digital copies and to provide basic contextual information for the user
  • Digital conversion
  • Quality control of digital copies and metadata
  • Providing public access to the material via online delivery of reliable and authentic copies
  • Providing online ordering for reproduction services at quality or quantities beyond the capacity of an end user
  • Maintenance of digital copies and metadata”

The archival fonds often come ‘as is’, as collected by the donating person or institution. While the rule respect des fonds should be the guiding principle, the archive can often gain if it is better organized. For digitization, document identification and selection is important, we do not want pages of a single letter scattered among others. We then give pages serial numbers within a unit (file), to preserve the context and integrity of the fonds. Scanning results in a high resolution (600 dpi or more) non-compressed (tiff) ‘digital original’. The original is preserved, and any further work, including website presentation, is done on copies. We collect metadata using the Dublin Core metadata standard, augmented by specific local identifiers. The workflow includes the creators of the digital document and those who proofread and verify their work (we use DSpace - Open Source software - as a tool to collect metadata and enforce the workflow). The metadata with reduced copies of the document pages are then exported and trasferred to the webpage displaying our archival collections.

Read more

Marek Zieliński, December 17, 2013

Regime_entailment_basic-260

An example of RDF Linked Data graph (reification) - By Karim Rafes (Own work) [CC-BY-SA-3.0], via Wikimedia Commons)

Linked Data is a mechanism used by the Semantic Web or "Web 3.0 in construction". What is the Semantic Web? We all use the World Wide Web (www), the main component of which are the hyperlinks, references or links to other sites. Clicking on the hyperlink (has “http” in its  name) will open a new web page. Web was created for human consumption, and just as natural language it is understood by people.

Compared to us, computers are rather dumb, and one has to be extremely explicit in providing it with instructions. On the other hand they are very fast and can handle vastly more data than we can. And that means that in petabytes of data they can find the single piece of information we need. To make it work, we have to be very precise, need reliable sources of information and a system that connects it all. This system is the Linked Data.

Why should we be interested in Linked Data? Out of curiosity, obviously, to understand how the digital world today around us. Linked Data is especially important for archivists, librarians, and others working in the field of data processing. If you work in an institution which has some good quality data in any field, making the data available in the Linked Data can significantly increase the prestige of the institution in the world.

The basic rules of Linked Data, encapsulated in RDF (Resource Description Framework) are the use of references (URI) instead of text, and the use of simple statements about resources in the form of subject - predicate - object.

GraphicFormatsThese days we use digital photography more and more often. What was a novelty only a dozen or so years ago has become the norm, and traditional cameras are becoming rare. We can see the image instantly, all the devices we carry (phones, tablets etc.) have photo capability, and memory and cameras are constantly become cheaper -  all this resulting in the creation of more and more photos. At the same time, photography has become something very transient. In the past you would paste the photos into albums or collect them in boxes, while now they exist as files on a computer disk, and at the first disk failure we suddenly lose our treasured resources. Personal digital archiving is a broad subject; this time let us focus on packing the images in digital envelopes called files.

An image is not just a photo. Scans of documents in an archive (personal or institutional) is a digital record that should faithfully reflect the original document. How do we choose the best way of preserving images for the next generation, for our grandchildren to be able to enjoy grandparents’ photo albums and for archives to preserve invaluable (because the paper did not survive) archival images? Saved images are stored on a computer disk in a container called a file. We will talk about the format of these envelopes, compression and metadata as well as translating an image from one format to another (conversion).

A digital camera is an imitation of the retina of the eye. The imitation is not very good, because the eye works differently than the camera, but we can treat it as an approximation. The picture - collected by a lens or scanned on the flatbed scanner - is divided into small sections, usually square (pixels), and the color is stored separately for each square. The data for the three colors (different than in the human eye) are recorded. As a result we obtain a rectangular matrix, each cell containing color data. The image is characterized by dimensions in pixels (height and width) and the third dimension, the depth of the color. The most popular model uses 8 bits for each of the three primary colors (total 24 bits), which provides the ability to store more than 16 million color tones. The saved data files are packed in one of the formats known as raster formats.

Format selection criteria

Until recently, we did not need any tools (except for glasses - occasionally) to look at paintings, photos, or to read a book. Today, more and more often we have to use equipment (a computer or device that performs the same function under various names - a phone, tablet, etc.). What's worse, we find a large number of formats, better or worse suited to our requirements. What are these requirements?

  1. The image format should be public, not closed. Some formats, particularly older ones, were created by companies dealing with image processing, and they retain copyright. Usually the format is published and publicly available. Formats defined as international standards (e.g. ISO) are much more likely to remain useful in the future.
  2. he format should be popular (which can sometimes conflict with requirement 1). A standard that has no readily available tools can be used only in theory.
  3. Image processing tools should be easily accessible, and readers should be free or cheap, preferably open source. Giving someone a photo with a comment "you can see it, but first you must buy a program for $500" is in poor taste. The basic treatment of images such as rotate, crop, resize, etc. should be available in popular, low-cost and / or open source tools.
  4. Formats should be able to save metadata -  for details see the blog "The reverse side of a digital photo

Resolution and compression

Those of us who dabbled in film photography remember film grain, related to its speed. The lower the speed, the smaller the silver halide crystals, and the finer the details which could be registered. Crystals are replaced in your digital camera by photosensitive elements - for denser elements we have  finer details. Sensor resolution is usually given in (mega) pixels. Scanner resolution is typically given in pixel per inch (or centimeter), abbreviated as ppi or dpi.
Image size in computer memory (width x height [in pixels] x 3) can be significant. To save space, some formats use compression. We will not elaborate here the compression algorithms, which are numerous; it suffices to consider the compression-decompression cycle. If it leaves the image unchanged, the compression is considered lossless, if not, lossy. Lossy compression can be much more effective in compressing the images, but depending on its intensity can leave traces (artifacts).

Formats

GIF

GIF (Graphic Interchange Program) was introduced by Compuserve in 1987. It uses lossless compression, but is limited to 8 bits for all three colors (up to 256 shades or levels of gray). Therefore it is not suited for photography, where we expect a bigger palette of colors. Metadata recording capabilities are very limited. GIF has, however, two very desirable features. We can define a transparent color, allowing us to create graphics (such as logos) that can be pasted over already existing patterns. GIF also has the ability to save multiple images that can be viewed as short movies (animations) - this function alone has resulted in the non-diminishing popularity of this format. Most web browsers can display GIF files, including animation, and it is supported by almost all graphics programs. The files have the extension .gif.

PNG

PNG (Portable Network Graphics) format was developed to overcome the problems with GIF - the limited number of colors and patented compression method. It has been approved for use on the Internet in 1996 and acquired the status of an ISO standard in 2004. PNG allows us to save graphics and photos, using 24 or 32 bit color, and it also has the capability of transparent color. It uses lossless compression - it is suitable for archival storage. Metadata recording capabilities are limited: EXIF format (used by cameras) is not supported; it is possible to use XMP metadata, but popular programs cannot read or write the data. PNG format is growing in popularity; it is displayed by web browsers and is supported by most graphic programs. Files have the extension .Png

TIFF

TIFF (Tagged Image File Format) was created by Aldus and put into use in 1986. Although it is more than 25 years old, it is still very popular format of graphic designers, photographers and the publishing industry. It can save files of up to 4 GiB in full color. TIFF has the ability to record multiple images (so you can save all pages of a document), uses lossless compression and has uncompressed recording possibility. The standard is administered by Adobe who bought Aldus. It has many add-ons and extensions (version 6.0 is relatively universal) as well as several versions registered as ISO standards. It does not have animation or transparency and is not displayed by the most commonly used web browsers. It is popular as a format for storing archival images and scans. Here you can save EXIF and IPTC metadata; using XMP, although theoretically possible, is not a common option. TIFF is very popular and is supported by almost all graphics programs. The files have the extension .tif or .tiff

JPEG

JPEG (Joint Photographic Expert Group) is a very popular format created for digital photos  and other half-tone images. It always uses compression, which is lossy, but provides a significant reduction in size. At the same file size, an image in jpeg format may have 25 or more times more pixels (5 times the linear dimension) than tiff for example, which largely compensates for the compression losses. For archival documents it presents two problems: first, compression errors are most evident in the contrasting element boundaries (for example in the edges of text characters), and second, each further processing generates additional errors because you can not completely turn off compression. The latter problem can be partially bypassed in processing of photos if one uses a program (such as Picasa), which saves only the transformations, leaving the original unchanged.
JPEG is a registered ISO standards, is supported by all image processing and display programs, as well as by web browsers - it is the most popular format for the recording and viewing of photos. In a JPEG file, you can also save the metadata in EXIF, IPTC and XMP, which significantly increases its versatility. Files have most common the extensions .jpg or .jpeg, although sometimes  .jif, .jfif and others are used.

JPEG 2000

JPEG 2000 format (files use extension .jp2 ) is the next generation format developed by the Joint Photographic Expert Group. It has all the advantages of JPEG compression, a better algorithm, and is an ISO standard. It has the ability to record uncompressed images, so it is suitable for the storage of archival materials. The metadata recording format is only XMP. All in all it is a a very good future graphic format.
Although it was introduced over 10 years ago, it lacks popularity. Many readers and graphic editors either do not support JPEG 2000 or support it only to a limited extent, using plug-ins - loading the image in this format takes considerably longer. Picasa does not support this format, and metadata recording requires specialized tools. JPEG 2000 is not displayed in web browsers.

Other formats

There are many other formats, and here are a few you might encounter.

RAW is the common name for many formats writing raw data from the camera sensor - they contain the most detailed image data, which can then be further processed. Although many of them are using elements of TIFF, the formats are closed, limited to the camera manufacturer and as such are not suitable for long-term storage or sharing images.

BMP is a Microsoft raster format, created for Windows. It is very popular, and can be encountered frequently, especially in older applications and the graphics in Windows.

PDF (Portable Document Format ) is not graphic format, but can also incorporate graphics. It is the description of the document, containing all the elements necessary to show / print a single or multiple-page document. It was created by Adobe in 1991-93 and popularized by the publication of free PDF readers by the company. Since 2008 it is an ISO standard, and is no longer controlled by Adobe. In 2005 an ISO standard called PDF/A (a subset of PDF) was published, with a focus on long-term archival storage.

PDF, and specifically especially PDF/A is recommended as a format for long-term storage of documents. It is good for this purpose, providing a versatile, relatively permanent page format, which can also include graphics, both raster and vector. PDF not a graphic format, however, and for photos and scans it is only an additional envelope that wraps the picture. PDF is not directly displayed by your web browser, nor by programs for image processing. The latest version (PDF/A-2 of 2011) provides JPEG 2000 compression and the use of metadata, both for the entire document or individual pages. Processing tools for PDF (excluding proprietary and rather expensive Adobe tools), however, are rare, and even simple manipulations such as adding, subtracting, or rotating the pages of a document require significant effort. When it comes to presentation (rather than long-term storage) of multi-page documents, PDF is simple to use, and competes with another format created for this purpose, DjVu.

Recommendations

What format should we use to store image document scans at home and in the archive? We can see that in the future we will have a great format for archiving and displaying files, including metadata, and great tools to view our our resources on any device. This day has not come yet.  We have old formats that are common, and new ones that are better, but the lack of tools disqualifies them for use right now. It is therefore likely that our children and grandchildren will have to make a conversion to a 'proper' format perhaps in 2050 - to put the photos in new, better envelopes.

What should we do for now? Photographs can be saved in JPEG format at the highest resolution possible. Cameras usually have a variety of options, and you should always choose the best quality. This increases the file size, but the memory is cheap and the price is steadily declining. Store the original images, and do not modify them, only make copies. Add the metadata (common readers like IrfanView or XnView can do it, also Picasa (Options / Tags / Store the tags in the photos)). Scans, especially archival materials, should be stored as TIFF files. Later, you can convert them to JPEG 2000 when it becomes more common. Recording metadata is also highly recommended, although archives usually want to add more information: where the documents came from, what was their fate, what is in them, etc. For this I recommend a simple spreadsheet or office document, or a specialized archival program such as DSpace or Archivist’s Toolkit. If you want to save documents created electronically, PDF format is very well suitable for this purpose.

Read more

Wikipedia articles on the graphics formats"

Marek Zieliński, November 2, 2013

Explore more blog items:

{plusone}

ead

EAD (Encoded Archival Description) is a standard created expressly for encoding archival  finding aids. For this reason, it is a hybrid. On the one hand, is trying to reflect the way in which archivists works, creating finding aids, on the other it is trying to introduce discipline and accuracy necessary for electronic document processing. The result is a lot of flexibility in the placement of data, which facilitates the work of an archivist at the same time makes it rather difficult to exchange data . The new version of the EAD (EAD#), which is in preparation for several years may hopefully reduce much of such arbitrariness.

The rules and principles of creating finding aids are contained in separate documents. In addition to the international standard - ISAD (G) - are the rules established in different countries,  such as DACS in the U.S., that are similar but often have subtle differences. EAD is an encoding using such rules, understandable by humans but also suitable for computer processing . Like all modern standards of metadata it is expressed in XML and consists of a series of nested labels like <ead>, along with the rules of nesting and the rules governing their their content.

Poniższy tekst proszę potraktować jako zachętę i wstęp do lektury zbioru esejów Debates in the Digital Humanities pod redakcją Matthew K. Golda, wydanego w 2012 przez University of Minnesota Press. Antologia ta została także opublikowana w nieco rozszerzonej formie jako tekst „open access”, który dostępny jest tutaj.

Digital humanities (w skrócie DH), czy też humanistyka cyfrowa jest relatywnie nową dziedziną, która zdobywa coraz większą popularność w świecie akademickim. Artykuł w angielskiej Wikipedii podaje bardzo zgrabną definicję DH, do której odsyłam zainteresowanych. W skrócie, humanistyka cyfrowa, jest obszarem badań, nauczania i tworzenia łączącego technologie informatyczne i dyscypliny humanistyczne. Obejmuje ona działalność od kuracji kolekcji cyfrowych w sieci po eksplorację danych dokonywaną na wielkich zbiorach. DH stara się połączyć warsztat tradycyjnych dyscyplin humanistycznych (takich jak historia, filozofia, językoznastwo, nauka o literaturze, sztuce, muzyce, itd.) z narzędziami informatycznymi takimi jak wizualizacja danych, pozyskiwanie danych, eksploracja danych i tekstu, statystyka czy publikacja elektroniczna.

When we look at the reverse side of an old photograph, we can often find a stamp of the photographer, a note on the place and date of photograph, and sometimes who is depicted in it. But where is the "reverse side" of a digital picture?

The filename is not a good place to store this information. It turns out, however, that digital images have a "flip side" information about a picture or scan, stored within the file. This information storage does not alter the picture itself, and can be read (and written) by a proper tool - a computer program.

This type information, or metadata, can belong to many different categories. Digital camera typically saves a lot of technical data such as shutter speed, aperture, number of pixels and details of the camera itself. This metadata is stored using a standard called Exif. When transmitting images it is very useful to store information about what is shown in the photo, who made it, its title, author, copyright information, etc. The data is stored in a standard called IPTC. Both Exif and IPTC were introduced around 1995, so they are quite old and venerable. It has its advantages - most photo-reader software can read the labels, and the metadata are readily available. Those standards have also a number of drawbacks:

  • Not all digital file formats use the metadata standards (eg. images in png format do not contain EXIF data).
  • The number of tags is limited without the possibility of adding new ones, missing important fields, such as naming people in the picture.
  • The records are limited in the size of the text fields (a small number of characters), there is no unicode support (lack of support for Polish letters), inability to write in more than one language and many more.

A lighter topic for hot summer days (perhaps not necessarily lighter, but surely hot). Recently I was  getting many e-mails from friends, all with accounts in yahoo. Emails looked pretty much the same - "hey, look what I found interesting" and a link to the website. If the text is in English and your correspondent uses Polish, it is easy to immediately be suspicious, but it is not always the case. The link can lead to a page that infects your computer, it may even try to steal your passwords. This phenomenon has already its name  - Spear Phishing.

I also have friends who fall into the other extreme and avoid any online presence – do not join online  communities, do not respond to e-mails (or even do not use the computer, which is a conservative extremism). They throw out the baby with the bath water - the presence in the net has its genuine advantages, as can be seen particularly when you are far away from the people close to you.

Sometimes I have to deal with someone else's computer completely overrun by viruses. Usually the computer runs very slowly, and any attempts to connect to the website redirect to another page (probably even more infected). In this case, the best solution is to copy the valuable materials (and then pass them through a good antivirus program) and completely reformat the hard drive.

It happens often that my mail is rejected by the recipient's server (usually with a lame excuse). This problem is a little complicated – it is seen only by the sender (recipient usually responds "I always get my emails" ...) and it can be fixed only by the recipient.

How to deal with all this? There's no great magic, just common sense. Here are some observations from my own experience:

PARTNERZY
Ministerstwo Kultury
Biblioteka Narodowa
Naczelna Dyrekcja Archiwów Państwowych
Konsulat RP w NY
Fundacja na rzecz Dziedzictwa Narodowego
PSFCU
NYC Department of Cultural Affairs