World Digitization Projects

CodexSinaiaticus260- Codex Sinaiaticus, Esther 4:17m - 5:2 - book 9 chapter 5

The answers to the question “What is digitization ?” are as diverse as there are resources converted into electronic form, and institutions that undertake such task. Some deal with only a single document, others describe in great detail an event or the work of one individual, others yet provide access to virtual archives of history. There are projects that demonstrate innovative technical solutions by combining different techniques and sources of information to provide new ways of viewing and searching. The institutions that have rich collections of their own develop exhibitions of selected resources, while other projects are based on the cooperation of many institutions. Here are some examples to illustrate this diversity:

Codex Sinaiaticus created in the mid-fourth century, contains the text of the Bible in Greek, including the oldest complete copy of the New Testament. Until the mid 19th century the manuscript was kept in the Monastery of Saint Catherine, the oldest existing Christian monastery, situated at the foot of Mount Sinai in Egypt. Today, parts of this manuscript are located in four institutions: The British Library in London, the Library of the University of Leipzig, the Russian National Library in St. Petersburg as well as the Monastery of St. Catherine. A website was created as a result of the cooperation of these four institutions. It is very carefully designed and contains all of the fragments of the codex. In addition to scans of the original pages, a transcript of the Greek is provided, which, for some pages, is also translated into other languages ​​(English, German and/or Russian). Crosslinks allow one to locate transcribed fragments of text after selecting them in the original.

The Newton Project is an example of a monothematic collection. All the writings of one of the greatest scientists of all time, Isaac Newton, are made available on the project website. Original scans can be viewed, as well as transcripts  of his writings, including both published and unpublished manuscripts on science, mathematics, religion and alchemy, as well as his notebooks and correspondence. You can read two versions of each document: "diplomatic" and "normalized." The "diplomatic" version is a transcript of the draft, with all amendments, deletions and changes visible. The "standard" version corresponds to the final form, ready-to-print. Advanced transcription and marking is made ​​using the TEI standard which means that it can be automaticlaly converted to virtually any representation, in various display formats and for different devices (or even printed on paper).

Mapping Our Anzacs. Anzacs are the soldiers of Australia and New Zealand Army, who fought in the First World War (at Gallipoli and elsewhere). The site contains very detailed information about ANZAC soldiers, scans of original documents, crosslinks to maps displaying locations of their birth and certficates of their joining the army. The site is interactive and, which is rare, it allows users to add information to the records of individual soldiers, as well as to help identify people in the pictures. The website's maps show the location of the troops (for example, two born in Radom). The new page, "Discovering the Anzacs" is under construction for extended functionality. There is also a similar French page under the title Mémoire des hommes, with data of the soldiers who fought for France.

Brooklyn Daily Eagle is a historical newspaper of Brooklyn, NY, which grew from a four-page paper in 1841 to 16 pages at the end of the 19th century. It was published until 1955, with a brief resumption in 1960-1963. The Brooklyn Library digitized issues from 1841 to 1902 before the project ran out of funds. This is an example of a very thorough and detailed study of the printed material. Microfilms have been borrowed from the Library of Congress, scanned, then processed by the OCR (Optical Character Recognition) technique. The pages were segmented into columns and articles, and the results saved as XML tagged text (using the Dublin Core standard), images pdf were constructed in such a way that the text in digital form, which can be searched, is applied as a transparent overlay on the original image. Search gives impressive results, highlighting the portion of the original text.

Polish American Pamphlets. This interesting collection of Polonica includes brochures issued by Polish organizations of various types, such as anniversary booklets, convention materials, programs of concerts, celebrations, reunions, dedications, dinners etc. These pamphlets often include historical essays, photographs, and membership lists. The collection has been developed and made ​​available by the Polish American Archives Central Connecticut State University, and is a splendid contribution to the study of everyday life of the Polish communities in America in 19 and 20 century.The brochures are scanned and provided with metadata, but for the most part without transcription.

NYPL Digital Collections is a newly opened (still in beta form) presentation of nearly 800,000 digitized objects in the resources of the New York Public Library. It focuses on the visual, containing a large number of photographs, illustrations, engravings, but also other kinds of materials. Of all the projects mentioned here it has the most detailed taxonomy, as befits a library. In addition to the mandatory search engine, one can browse the collections by: subject headings (eg, " Art - Czechoslovakia - Periodicals " ), names, places, collections, genres, publishers and the types of material (782 thousand images, 5.5 thousand text, 1.5 thousand videos, 562 maps, etc.). The presentation is excellent, with intuitive interfaces, adapted to the type of resource (with different display for maps, photos and multi-page documents, etc.).

Digital Public Library of America is a repository of digitized resources sourced from many different institutions in the U.S. Many of them can be seen on the websites of these institutions, but DPLA provides a common, unified approach to locating resources. There are several thematic exhibitions, a search engine (the keyword "Pilsudski" returns 10 hits from digitized books), and an interesting use of multiple dimensions: there are maps showing institutions (not the origin of resources) and a two-dimensional timeline. Books are listed with a rectangle whose two dimensions denote the number of pages and physical size of the book. Resources are diverse - from museum exhibits to documents from the state archives to 1.6 million digitized books. The resource is quantitatively impressive - more than 5 million objects accesible with a user-friendly interface. After finding the object it redirects the user to the institution that hold the resource, which gives various results - from the excellent presentation to some with problems (e.g. non-uniform use utf-8 encoding and thus the difficulty in displaying Polish or other diacritics).

Project Gutenberg is a system that offers open and free electronic books (eBooks). It is a system based on the work of volunteers, using crowdsourcing as well as on the basis of grants. Its philosophy is to provide books (mostly) in different forms so that most people who use computers can easily read, use, quote, and search.The eBook is the farthest-reaching digitization: the material is not only scanned, but transcribed (using OCR), with manual correction, and the addition of illustrations in the right places, etc, in other words, it is a full publishing system. Such e-publications can be (and are) formatted for different systems - from plain text to HTML web format, as a PDF, as well as in a variety of E-Book formats such as Kindle, eBook, EPUB, which allow one to display the books on devices with different proportions and sizes such as desktop, tablets, mobile etc. Project Gutenberg offers 42,000 titles and, together with its sister systems (e.g. language versions of other countries) more than 100,000.

Internet Archive - a non-profit organization - is known for its Wayback Machine which stores snapshots of Web pages, where you can find in the archive the long since closed pages. But apart from that the Internet Archive also digitizes books on a large scale - currently about 1000 books are digitized daily. The main collection has some 5 million books and other materials from 1,5 thousand different collections and libraries in North America, Europe and Asia, representing 150 languages​​. The more the contemporary Open Library has more than two million books in the form of an eBook that can be read on-site or borrowed.

Google Books is the largest project of book digitization in the world (about 30 million titles). Due to the scale of the project, Google has both been praised and sued for this effort. It is praised for making freely available a large portion of human knowledge and literature, and sued for the same reason. Due to the diversity of the copyright laws in the world, as well as other restrictions, access to books is varied. Books in the public domain (as well as those whose authors have given their consent) are available in their entirety. One can view the scanned images of pages, with superimposed "transparent layer"  of transcribed text. This allows one to search for words and phrases in the text (which is Google’s specialty). Other books, based on agreements with the author or publisher, are indexed and available as a preview: one can read selected passages, often introduction and/or selected chapters, and buy the book to read it all. These types of resources are also indexed completely, showing the snippet of text with the found word in the preview. Other publishers require complete removal of the work from indexing (Google Books only shows the title and the author), which in a way also removes the work from the memory of the world.

In the Pilsudski Institute we have made available some 10,000 digitized archival documents from 8 fonds (archival collections). They are indexed by hand, one can browse the collection folder-by-folder and page-by-page, select names, places and dates of interest (there is also a universal  search engine). The first joint archival presentation debuted last year with a single Jozef Pilsudski archive on the Internet, physically on both sides of the Atlantic (New York and London).

Marek Zieliński, 15 January 2014

Explore more blog items:

Ministerstwo Kultury
Biblioteka Narodowa
Naczelna Dyrekcja Archiwów Państwowych
Konsulat RP w NY
Fundacja na rzecz Dziedzictwa Narodowego
NYC Department of Cultural Affairs