Blog archiwistów i bibliotekarzy Instytutu Piłsudskiego

Blog archiwistów i bibliotekarzy Instytutu Piłsudskiego


On Wednesday, January 15, 2014 the Annual Conference of the Metropolitan New York Library Council (METRO) was held in New York. The conference, which took place in a modern Vertial Campus at Baruch College (CUNY), brought together more than two hundred representatives of libraries,  archives, colleges and other institutions from New York City and surrounding areas. Participants had a choice of 25 presentations and lectures showing various aspects of work, opportunities and achievements of the broadly understood library community. The Pilsudski Institute presented a lecture by Marek Zielinski and Iwona Korg on Digitization of Polish History 1918-1923 describing the digitization project and showing selected archival sources, digitization techniques, special projects and online presentation and statistics.

The conference began with a keynote lecture delivered by a known librarian and blogger Jessamyn West. In the presentation Open, Now! she told us about the possibilities of Open Access with unrestricted, free access to the wide range of sources through the Internet. She talked about Google, Digital Public Library of America, Open LIbrary and legal issues associated with such access.

Next, the participants could choose the lectures from a wide selection of topics. Here are some notes on of those we attended. The program of the conference is available online, the links below lead to the slideshows from the presentations:

Building Authorities with Crowdsourced and Linked Open Data in ProMusicDB (Kimmy Szeto, Baruch College, CUNY and Christy Crowl, ProMusicDB). Linked Data is a great concept, but where to find authoritative data? The authors presented their search for sources of information on music performers and their roles. They found data in many diverse places in the Internet. In the talk they presented the information sources and ways of reconciling the data to obtain a consistent and usable dataset.

Metadata for Oral History: Acting Locally, Thinking Globally (Natalie Milbrodt, Jane Jacobs and Dacia Metes, Queens Library). The representatives of the Queens Public Library presented the latest project on the history of the borough of Queens, including lectures, pictures and memorabilia of the oldest inhabitants of this district. Emphasized was the difficulty of choosing the software and the metadata model, especially in terms of geographical names. Very useful was the pointers on how to describe a personal interview (Who, What, When?, Where, Why?, How?).

Mug Shots, Rap Sheets, & Oral Histories: Building the New Digital Collections at John Jay College of Criminal Justice (Robin Davis, John Jay College of Criminal Justice). The representative of an academic library outlined the stages of work, metadata, and showed some of the most interesting documents from a forthcoming web exhibit on the history of the NYPD (New York Police Department) and recordings of interviews with New York City Mayor Ed Koch.

Wikipedia in Cultural Heritage Institutions (Dorothy Howard, Metropolitan New York Library Council). Dorothy Howard is currently the METRO Wkipedian-in-residence. She presented the latest projects such as Wikipedia GlamWiki, Wikipedia Commons, and told us about the activities of Wikipedians to raise the level and quality of articles, especially regarding medical issues.

Beyond digitization: hacking structured data out of historical documents (Dave Riordan and Paul Beaudoin, NYPL Labs, The New York Public Library). The programmers from the New York Public Library embarked on an ambitious, crowdsourcing project of extracting metadata from vast library collections. They have built tools to entice volunteers to transcribe documents, help them in the task, verify their work by reconciling results of different people, and more, in the process improving and repurposing an Open Source software (Scribe). The topics described in the presentation range from restaurant menus to playbills.

Open Access is a Lot of Work!: How I Took a Journal Open Access and Lived to Tell About It (Emily Drabinski, LIU Brooklyn). A very interesting presentation describing the work with the journal “Radical Teacher” and a change from the traditional paper publishing system to. Open Access, enabled in collaboration with the University of Pittsburg. She stressed that such type of transformation requires a huge amount of work.

The  METRO team organized the conference very professionally and took care of every single detail. Thanks and see you next year!

Iwona Korga, January 21, 2014

CodexSinaiaticus260- Codex Sinaiaticus, Esther 4:17m - 5:2 - book 9 chapter 5

The answers to the question “What is digitization ?” are as diverse as there are resources converted into electronic form, and institutions that undertake such task. Some deal with only a single document, others describe in great detail an event or the work of one individual, others yet provide access to virtual archives of history. There are projects that demonstrate innovative technical solutions by combining different techniques and sources of information to provide new ways of viewing and searching. The institutions that have rich collections of their own develop exhibitions of selected resources, while other projects are based on the cooperation of many institutions. Here are some examples to illustrate this diversity:

Codex Sinaiaticus created in the mid-fourth century, contains the text of the Bible in Greek, including the oldest complete copy of the New Testament. Until the mid 19th century the manuscript was kept in the Monastery of Saint Catherine, the oldest existing Christian monastery, situated at the foot of Mount Sinai in Egypt. Today, parts of this manuscript are located in four institutions: The British Library in London, the Library of the University of Leipzig, the Russian National Library in St. Petersburg as well as the Monastery of St. Catherine. A website was created as a result of the cooperation of these four institutions. It is very carefully designed and contains all of the fragments of the codex. In addition to scans of the original pages, a transcript of the Greek is provided, which, for some pages, is also translated into other languages ​​(English, German and/or Russian). Crosslinks allow one to locate transcribed fragments of text after selecting them in the original.


Digitization. Illustration prepared with the use of a work by Junior Melo via Wikimedia Commons [CC-BY-SA-3.0])

Digitization seems to have as many definitions as user communities. Using the same word for different purposes is not rare. The word ‘organic’ has a well defined meaning in chemistry (carbon compound), and a common usage in food and other industries, with the general meaning ‘good for you’. Therefore an organic substance may be not organic and organic food may contain inorganic elements. Digitization does not show such a spread of meanings, nevertheless it many contexts it is used to mean different things.

In Wikipedia the preferred term is “digitizing”, but the definition is not of much use: “Strictly speaking, digitizing means simply capturing an analog signal in digital form. For a document the term means to trace the document image or capture the "corners" where the lines end or change direction.” There is no mention of metadata, transcription, access and other elements generally associated with digitization in the archival and library community.

Let us examine some terms and processes which together constitute digitization.

Digital reformatting

Digital reformatting means converting an analog resource into a discrete form usable in computers. The resource is typically a result of human culture, and can be a two dimensional object (document, image, book page), a recording of music or moving image. We will not deal with three-dimensional objects, which also can be digitized, but the process is much more involved and not yet clearly established. Analog means that the signal (color, sound etc.) is continuous, i.e. can take any possible value in its range. Discrete means that only a limited set of values is possible, with nothing in between. Written language is a good example of a discrete string of elements, each taken from an alphabet of some 150 possibilities (in English language, and including lower and upper case letters, numbers and a collection of special characters. There is no ‘intermediate value’ between a and b.

In the first step we need to divide the dimension(s) of the object into discrete elements. For a flat object we superimpose a grid (typically consisting of square ‘pixels’), for a signal in time we divide the time dimension into short intervals. Next we examine the signal in the discrete element - single pixel or interval. For a flat image we will look for color intensity and also record it as discrete values; similar treatment is applied to electrical signal (from a microphone) in the selected time interval. This process is called sampling.

Finally we record the values together with their spatial or time coordinates using a specific coding (selected from many possibilities) and obtain a computer file. If the process was done lege artis, the file is considered a digital surrogate of the original.

Practically we use a scanner (or a camera) for a flat object and a device called AD converter for electrical signals (you have one in your computer). A similar device can also reformat your video recording from a VHS player.

Born digital

Similar process of generating a discrete representation of an analog signal occurs inside a camera while taking a photo snapshot, in digital video camera or sound recorder. When the image, sound or event is immediately recorded in a digital form, we consider it ‘born digital’. Digitization is concerned only with digital reformatting, not with born digital resources.


The file we obtain in the digital reformatting step may initially be devoid of metadata. The original resources, fixed on a specific medium (paper, celluloid, magnetic tape) could have labels; if book, a card catalog entry; if an archival document, a folder with label, inside a bigger folder with its own label, inside a labeled box,  etc. In order to restore and possibly expand the information that will help us locate the file among hundreds of thousands on your hard drive, we need to collect the object metadata. Descriptive metadata include for example the title, author, description or abstract of the resource, etc. There are other types of metadata, some dealing with the parameters of digital reformatting, some recording ownership and rights of the original, etc. The metadata are typically written down using metadata standards, to help with interoperability.


In special case of the written text (and to a limited extent music) we can also perform one more step. The text has originally a discrete form as strings of characters. This string can be decoded by humans (provided they know the language or can read the handwriting) but for a computer the digitally reformatted text is just a mass of dots of different color. The text can be, however, transcribed, i.e. each character changed into its numerical representation according to a specific coding (historically ASCII, today utf-8). Now the text is ‘known’ to the computer as well, and can be dealt with in many ways - the book can be presented in different formats (including Braille or spoken word), one can search the full content of the text etc. Transcription can be done by (human) hand or, if the printed text is clear enough, by using a computer technique of optical character recognition (ORC).


Digitization is a concept that includes elements mentioned above. It is a conversion of a resource recorded in traditional medium into a digital one, and inlcludes all added features and responsibilities that go with it: organization of the resources, metadata collection and/or transcription, digital reformatting, providing access (including search, browse and other finding tools), and finally planning for the inevitable future coding, format and hardware changes. The emphasis on different elements is different in disparate communities, for example digitizing books almost always involves transcription (OCR), while archives puts more emphasis on context and document selection. As the archival digitization is what we do in the Pilsudski Institute, I will quote from the National Archives (NARA) definition of digitizing:

“...’digitizing’ should be understood not just as the act of scanning an analog document into digital form, but as a series of activities that results in a digital copy being made available to end users via the Internet or other means for a sustained length of time. The activities include:

  • Document identification and selection
  • Document preparation (including preservation, access review and screening, locating, pulling, and refiling)
  • Basic descriptive and technical metadata collection sufficient to allow retrieval and management of the digital copies and to provide basic contextual information for the user
  • Digital conversion
  • Quality control of digital copies and metadata
  • Providing public access to the material via online delivery of reliable and authentic copies
  • Providing online ordering for reproduction services at quality or quantities beyond the capacity of an end user
  • Maintenance of digital copies and metadata”

The archival fonds often come ‘as is’, as collected by the donating person or institution. While the rule respect des fonds should be the guiding principle, the archive can often gain if it is better organized. For digitization, document identification and selection is important, we do not want pages of a single letter scattered among others. We then give pages serial numbers within a unit (file), to preserve the context and integrity of the fonds. Scanning results in a high resolution (600 dpi or more) non-compressed (tiff) ‘digital original’. The original is preserved, and any further work, including website presentation, is done on copies. We collect metadata using the Dublin Core metadata standard, augmented by specific local identifiers. The workflow includes the creators of the digital document and those who proofread and verify their work (we use DSpace - Open Source software - as a tool to collect metadata and enforce the workflow). The metadata with reduced copies of the document pages are then exported and trasferred to the webpage displaying our archival collections.

Read more

Marek Zieliński, December 17, 2013


An example of RDF Linked Data graph (reification) - By Karim Rafes (Own work) [CC-BY-SA-3.0], via Wikimedia Commons)

Linked Data is a mechanism used by the Semantic Web or "Web 3.0 in construction". What is the Semantic Web? We all use the World Wide Web (www), the main component of which are the hyperlinks, references or links to other sites. Clicking on the hyperlink (has “http” in its  name) will open a new web page. Web was created for human consumption, and just as natural language it is understood by people.

Compared to us, computers are rather dumb, and one has to be extremely explicit in providing it with instructions. On the other hand they are very fast and can handle vastly more data than we can. And that means that in petabytes of data they can find the single piece of information we need. To make it work, we have to be very precise, need reliable sources of information and a system that connects it all. This system is the Linked Data.

Why should we be interested in Linked Data? Out of curiosity, obviously, to understand how the digital world today around us. Linked Data is especially important for archivists, librarians, and others working in the field of data processing. If you work in an institution which has some good quality data in any field, making the data available in the Linked Data can significantly increase the prestige of the institution in the world.

The basic rules of Linked Data, encapsulated in RDF (Resource Description Framework) are the use of references (URI) instead of text, and the use of simple statements about resources in the form of subject - predicate - object.

GraphicFormatsThese days we use digital photography more and more often. What was a novelty only a dozen or so years ago has become the norm, and traditional cameras are becoming rare. We can see the image instantly, all the devices we carry (phones, tablets etc.) have photo capability, and memory and cameras are constantly become cheaper -  all this resulting in the creation of more and more photos. At the same time, photography has become something very transient. In the past you would paste the photos into albums or collect them in boxes, while now they exist as files on a computer disk, and at the first disk failure we suddenly lose our treasured resources. Personal digital archiving is a broad subject; this time let us focus on packing the images in digital envelopes called files.

An image is not just a photo. Scans of documents in an archive (personal or institutional) is a digital record that should faithfully reflect the original document. How do we choose the best way of preserving images for the next generation, for our grandchildren to be able to enjoy grandparents’ photo albums and for archives to preserve invaluable (because the paper did not survive) archival images? Saved images are stored on a computer disk in a container called a file. We will talk about the format of these envelopes, compression and metadata as well as translating an image from one format to another (conversion).

A digital camera is an imitation of the retina of the eye. The imitation is not very good, because the eye works differently than the camera, but we can treat it as an approximation. The picture - collected by a lens or scanned on the flatbed scanner - is divided into small sections, usually square (pixels), and the color is stored separately for each square. The data for the three colors (different than in the human eye) are recorded. As a result we obtain a rectangular matrix, each cell containing color data. The image is characterized by dimensions in pixels (height and width) and the third dimension, the depth of the color. The most popular model uses 8 bits for each of the three primary colors (total 24 bits), which provides the ability to store more than 16 million color tones. The saved data files are packed in one of the formats known as raster formats.

Format selection criteria

Until recently, we did not need any tools (except for glasses - occasionally) to look at paintings, photos, or to read a book. Today, more and more often we have to use equipment (a computer or device that performs the same function under various names - a phone, tablet, etc.). What's worse, we find a large number of formats, better or worse suited to our requirements. What are these requirements?

  1. The image format should be public, not closed. Some formats, particularly older ones, were created by companies dealing with image processing, and they retain copyright. Usually the format is published and publicly available. Formats defined as international standards (e.g. ISO) are much more likely to remain useful in the future.
  2. he format should be popular (which can sometimes conflict with requirement 1). A standard that has no readily available tools can be used only in theory.
  3. Image processing tools should be easily accessible, and readers should be free or cheap, preferably open source. Giving someone a photo with a comment "you can see it, but first you must buy a program for $500" is in poor taste. The basic treatment of images such as rotate, crop, resize, etc. should be available in popular, low-cost and / or open source tools.
  4. Formats should be able to save metadata -  for details see the blog "The reverse side of a digital photo

Resolution and compression

Those of us who dabbled in film photography remember film grain, related to its speed. The lower the speed, the smaller the silver halide crystals, and the finer the details which could be registered. Crystals are replaced in your digital camera by photosensitive elements - for denser elements we have  finer details. Sensor resolution is usually given in (mega) pixels. Scanner resolution is typically given in pixel per inch (or centimeter), abbreviated as ppi or dpi.
Image size in computer memory (width x height [in pixels] x 3) can be significant. To save space, some formats use compression. We will not elaborate here the compression algorithms, which are numerous; it suffices to consider the compression-decompression cycle. If it leaves the image unchanged, the compression is considered lossless, if not, lossy. Lossy compression can be much more effective in compressing the images, but depending on its intensity can leave traces (artifacts).



GIF (Graphic Interchange Program) was introduced by Compuserve in 1987. It uses lossless compression, but is limited to 8 bits for all three colors (up to 256 shades or levels of gray). Therefore it is not suited for photography, where we expect a bigger palette of colors. Metadata recording capabilities are very limited. GIF has, however, two very desirable features. We can define a transparent color, allowing us to create graphics (such as logos) that can be pasted over already existing patterns. GIF also has the ability to save multiple images that can be viewed as short movies (animations) - this function alone has resulted in the non-diminishing popularity of this format. Most web browsers can display GIF files, including animation, and it is supported by almost all graphics programs. The files have the extension .gif.


PNG (Portable Network Graphics) format was developed to overcome the problems with GIF - the limited number of colors and patented compression method. It has been approved for use on the Internet in 1996 and acquired the status of an ISO standard in 2004. PNG allows us to save graphics and photos, using 24 or 32 bit color, and it also has the capability of transparent color. It uses lossless compression - it is suitable for archival storage. Metadata recording capabilities are limited: EXIF format (used by cameras) is not supported; it is possible to use XMP metadata, but popular programs cannot read or write the data. PNG format is growing in popularity; it is displayed by web browsers and is supported by most graphic programs. Files have the extension .Png


TIFF (Tagged Image File Format) was created by Aldus and put into use in 1986. Although it is more than 25 years old, it is still very popular format of graphic designers, photographers and the publishing industry. It can save files of up to 4 GiB in full color. TIFF has the ability to record multiple images (so you can save all pages of a document), uses lossless compression and has uncompressed recording possibility. The standard is administered by Adobe who bought Aldus. It has many add-ons and extensions (version 6.0 is relatively universal) as well as several versions registered as ISO standards. It does not have animation or transparency and is not displayed by the most commonly used web browsers. It is popular as a format for storing archival images and scans. Here you can save EXIF and IPTC metadata; using XMP, although theoretically possible, is not a common option. TIFF is very popular and is supported by almost all graphics programs. The files have the extension .tif or .tiff


JPEG (Joint Photographic Expert Group) is a very popular format created for digital photos  and other half-tone images. It always uses compression, which is lossy, but provides a significant reduction in size. At the same file size, an image in jpeg format may have 25 or more times more pixels (5 times the linear dimension) than tiff for example, which largely compensates for the compression losses. For archival documents it presents two problems: first, compression errors are most evident in the contrasting element boundaries (for example in the edges of text characters), and second, each further processing generates additional errors because you can not completely turn off compression. The latter problem can be partially bypassed in processing of photos if one uses a program (such as Picasa), which saves only the transformations, leaving the original unchanged.
JPEG is a registered ISO standards, is supported by all image processing and display programs, as well as by web browsers - it is the most popular format for the recording and viewing of photos. In a JPEG file, you can also save the metadata in EXIF, IPTC and XMP, which significantly increases its versatility. Files have most common the extensions .jpg or .jpeg, although sometimes  .jif, .jfif and others are used.

JPEG 2000

JPEG 2000 format (files use extension .jp2 ) is the next generation format developed by the Joint Photographic Expert Group. It has all the advantages of JPEG compression, a better algorithm, and is an ISO standard. It has the ability to record uncompressed images, so it is suitable for the storage of archival materials. The metadata recording format is only XMP. All in all it is a a very good future graphic format.
Although it was introduced over 10 years ago, it lacks popularity. Many readers and graphic editors either do not support JPEG 2000 or support it only to a limited extent, using plug-ins - loading the image in this format takes considerably longer. Picasa does not support this format, and metadata recording requires specialized tools. JPEG 2000 is not displayed in web browsers.

Other formats

There are many other formats, and here are a few you might encounter.

RAW is the common name for many formats writing raw data from the camera sensor - they contain the most detailed image data, which can then be further processed. Although many of them are using elements of TIFF, the formats are closed, limited to the camera manufacturer and as such are not suitable for long-term storage or sharing images.

BMP is a Microsoft raster format, created for Windows. It is very popular, and can be encountered frequently, especially in older applications and the graphics in Windows.

PDF (Portable Document Format ) is not graphic format, but can also incorporate graphics. It is the description of the document, containing all the elements necessary to show / print a single or multiple-page document. It was created by Adobe in 1991-93 and popularized by the publication of free PDF readers by the company. Since 2008 it is an ISO standard, and is no longer controlled by Adobe. In 2005 an ISO standard called PDF/A (a subset of PDF) was published, with a focus on long-term archival storage.

PDF, and specifically especially PDF/A is recommended as a format for long-term storage of documents. It is good for this purpose, providing a versatile, relatively permanent page format, which can also include graphics, both raster and vector. PDF not a graphic format, however, and for photos and scans it is only an additional envelope that wraps the picture. PDF is not directly displayed by your web browser, nor by programs for image processing. The latest version (PDF/A-2 of 2011) provides JPEG 2000 compression and the use of metadata, both for the entire document or individual pages. Processing tools for PDF (excluding proprietary and rather expensive Adobe tools), however, are rare, and even simple manipulations such as adding, subtracting, or rotating the pages of a document require significant effort. When it comes to presentation (rather than long-term storage) of multi-page documents, PDF is simple to use, and competes with another format created for this purpose, DjVu.


What format should we use to store image document scans at home and in the archive? We can see that in the future we will have a great format for archiving and displaying files, including metadata, and great tools to view our our resources on any device. This day has not come yet.  We have old formats that are common, and new ones that are better, but the lack of tools disqualifies them for use right now. It is therefore likely that our children and grandchildren will have to make a conversion to a 'proper' format perhaps in 2050 - to put the photos in new, better envelopes.

What should we do for now? Photographs can be saved in JPEG format at the highest resolution possible. Cameras usually have a variety of options, and you should always choose the best quality. This increases the file size, but the memory is cheap and the price is steadily declining. Store the original images, and do not modify them, only make copies. Add the metadata (common readers like IrfanView or XnView can do it, also Picasa (Options / Tags / Store the tags in the photos)). Scans, especially archival materials, should be stored as TIFF files. Later, you can convert them to JPEG 2000 when it becomes more common. Recording metadata is also highly recommended, although archives usually want to add more information: where the documents came from, what was their fate, what is in them, etc. For this I recommend a simple spreadsheet or office document, or a specialized archival program such as DSpace or Archivist’s Toolkit. If you want to save documents created electronically, PDF format is very well suitable for this purpose.

Read more

Wikipedia articles on the graphics formats"

Marek Zieliński, November 2, 2013

Explore more blog items:



EAD (Encoded Archival Description) is a standard created expressly for encoding archival  finding aids. For this reason, it is a hybrid. On the one hand, is trying to reflect the way in which archivists works, creating finding aids, on the other it is trying to introduce discipline and accuracy necessary for electronic document processing. The result is a lot of flexibility in the placement of data, which facilitates the work of an archivist at the same time makes it rather difficult to exchange data . The new version of the EAD (EAD#), which is in preparation for several years may hopefully reduce much of such arbitrariness.

The rules and principles of creating finding aids are contained in separate documents. In addition to the international standard - ISAD (G) - are the rules established in different countries,  such as DACS in the U.S., that are similar but often have subtle differences. EAD is an encoding using such rules, understandable by humans but also suitable for computer processing . Like all modern standards of metadata it is expressed in XML and consists of a series of nested labels like <ead>, along with the rules of nesting and the rules governing their their content.

Poniższy tekst proszę potraktować jako zachętę i wstęp do lektury zbioru esejów Debates in the Digital Humanities pod redakcją Matthew K. Golda, wydanego w 2012 przez University of Minnesota Press. Antologia ta została także opublikowana w nieco rozszerzonej formie jako tekst „open access”, który dostępny jest tutaj.

Digital humanities (w skrócie DH), czy też humanistyka cyfrowa jest relatywnie nową dziedziną, która zdobywa coraz większą popularność w świecie akademickim. Artykuł w angielskiej Wikipedii podaje bardzo zgrabną definicję DH, do której odsyłam zainteresowanych. W skrócie, humanistyka cyfrowa, jest obszarem badań, nauczania i tworzenia łączącego technologie informatyczne i dyscypliny humanistyczne. Obejmuje ona działalność od kuracji kolekcji cyfrowych w sieci po eksplorację danych dokonywaną na wielkich zbiorach. DH stara się połączyć warsztat tradycyjnych dyscyplin humanistycznych (takich jak historia, filozofia, językoznastwo, nauka o literaturze, sztuce, muzyce, itd.) z narzędziami informatycznymi takimi jak wizualizacja danych, pozyskiwanie danych, eksploracja danych i tekstu, statystyka czy publikacja elektroniczna.