Blog archiwistów i bibliotekarzy Instytutu Piłsudskiego

Blog archiwistów i bibliotekarzy Instytutu Piłsudskiego

A photo from Władysława’s album with her father among members of the organization “Sokół”, in Silna, Poland, early 1920s.

A photo from Władysława’s album with her father among members of the organization “Sokół”, in Silna, Poland, early 1920s.


My wife’s aunt Władysława always kept the family albums ready to be packed in a backpack, so that during WWII, when the family were expelled from their flat in Łódź and during further forced migration, the albums always traveled with her. This way the albums survived and we can now enjoy the family photos of several generations back. The durability of black and white photography and of the good quality, rag based non-acidic paper on which  they  were printed helped preserve  the hundred-year old prints.   

Color photos from 20 or 30 years ago did not fare as well. Organic dyes fade quickly, and some lost most of their colors. We are working on digitizing them, trying to digitally restore the colors, converting them to digital albums.

Creation of digital copies of the old albums has also another goal, in addition to preservation. The family was scattered around the world. Brothers and sisters who lived or worked in different partitioned regions of Poland before 1918 ended up in different countries; some returned to the reborn Poland, some settled in Germany, France, in the US, UK and elsewhere. A single album is not enough, but an online version can be viewed by many.

Which brings us to the question: how to create a durable, long lasting electronic album?

About 15 years ago, we installed Gallery2, an Open Source, Web-based photo gallery, which had everything that was necessary to build a collection of online albums. We used a desktop progm Picasa to organize and annotate the images. Now, Gallery2 and Picasa are no more (or not maintained, which is almost the same). This article is about our effort to rebuild the albums, so that they can survive at least for one generation (or, say, 25 years).

How can we predict the future, even just for 25 years, in the fast changing landscape of digital world? One factor that will help (and guide us) is a strong conservatism of programmers or builders of computerized systems. Let us take for example Unicode, an universal alphabet or mechanism to represent characters in almost all writing systems in use today. We will stress it later, because without Unicode one cannot really annotate pictures, for example photo report about a  travel from Łódź to Kraków to Košice to Hajdúböszörmény to Škofja Loka to Zürich to Besançon to Logroño. Unicode is about 25 years old, but was not the first, the Latin alphabet ASCII dominated computing for decades before. Today many programs and systems, even built quite recently, still do not support unicode. Similar conservatism affects formats for recording photographs (raster graphics). Some formats, like TIFF and JPEG, which were introduced 30 years ago, being first useful format became very popular. New standards like JP2, which are better in some respects, have a very difficult time to be accepted.

Goals of the project

It may be helpful to state the goals of the project. We start with a collection of photographs in digital form, which may be scanned photos or born-digital images. The images are organized into albums, perhaps matching original bound albums, or created from scratch. We also have metadata: descriptions, persons and places, dates, etc. associated with the individual images, and with the whole albums. Metadata may also include provenance, history and other events associated with them.

The goal of the project is to research and document the methodology to preserve the images,  their organization into albums, and all the associated metadata. We are looking forward to preserve it  for future generations rather than for a fleeting, one time viewing, which can be easily accomplished using social media today

Beyond the scope of this project is the preservation of bound albums and original photographs, scanning and storage of digital copies. Those topics have more or less extensive literature although some specific topics may be presented in this blog later.


Metadata collection during digitizing archival resources is not a straightforward task. Names, places, events listed in the documents often diverge from current spelling. There are variants, aliases, spelling errors etc. Modern search engines like Google often have synonyms or common spelling errors and can correct them:

Showing results for Kowalski
Search instead for Kowakski

But it works best for common names and spelling errors. In a project in which we would like to present the metadata as Linked Open Data we would want to have a clean list of entries, free of spelling errors and with identified variants, if any.

Let us take as an example names of people (we also collect places, historical events and more). The name alone does not typically identify the person - obviously there can be many people with the same name. Once the person is fixed, we find that very often his or her name exists in many variants. There are versions in various languages, the person could use pseudonym (or several) at some period of his life, change her name (before or after marriage), add titles to the name etc. Subjects and citizens name their leaders by their monikers, persons How to find your data in this mess?

For persons that are mentioned in the archival documents, we have selected several rules. The rules are somewhat arbitrary, but we had to start somewhere:

  1. We use one standard name for one person. The alternative names or names in different languages are also collected to help in search. We are typically guided by the name used in Polish, if possible, and use the Wikipedia spelling (in Polish or other language) when appropriate.

  2. We list the name as Last Name, First name(s) in this order. This sometimes causes difficulties as it is not always easy to figure out given names. The rule has exception for people known only by their full name, as Kings, Popes etc.; in this case we list the full name as it is commonly used or officially known.

  3. We assign each person a unique identifier which we create from scratch. If possible, we correlate this identifier with two of the most common registries: WikiData and VIAF. There are people who do not have their articles in Wikipedia in any language, and hence no WikiData ID. There are people who never wrote a book and are not in the library index represented by VIAF. For those we create a short description, add references and assign our own identifier.

The next step is to review some 80 thousand name records that we have collected till now to bring the names to this standard. We work in chunks, typically one archival fonds at a time, but it is still tens of thousands of records. One can work with spreadsheet, which is an universal tool, and by using such functions as sort, filter, global find-and-replace one can do a lot of work. However, we found that a specialized tool called OpenRefine can be much more useful for this task. OpenRefine (an Open Source software) grew from Google project and was originally called Google Refine. It was strongly connected with a project called Freebase1, now defunct, which collected data from several different databases as well as allowed users to add their own. OpenRefine was created expressly for the task of cleaning up or refining mixed quality data.


OpenRefine is a sophisticated piece of software with very powerful tools. It can be installed on a personal computer and runs as a web server, which means that you interact with it via a browser. The data is stored locally, and does not leave your computer; you can work on sensitive data without problems with sharing. It also means that one can only work on one computer at a time, but it is rather easy to export the project and carry it on a memory stick.

OpenRefine is a rich in options, functions and capabilities. I will not attempt to describe all of them here. My goal is to introduce OpenRefine and show some features using the example of names of people collected in digitizing archival fonds of the Pilsudski Institute of America.

Metro 2016 Streszczenie

W czwartek 21 stycznia 2016 braliśmy udział w dorocznej konferencji METRO - Metropolitan New York Library Council - która miała miejsce w Baruch College w Manhattanie. Konferencja ta, jak i poprzednie, była doskonałym przeglądem najnowszych inicjatyw, pomysłów, rozwiązań i projektów w dziedzinie humanistyki cyfrowej w społeczności GLAM. Poniżej przedstawiamy omówienie wybranych prezentacji w języku angielskim.

The annual METRO (Metropolitan New York Library Council) conferences are about the best sources of the latest inventions, projects and ideas in the GLAM community, concentrated in one day of intense briefings. This year was no exception - the conference that took place January 21, 2016 at the Baruch College in Manhattan. On the conference a number of “Project briefings” were presented - the intent was to show the projects in progress and discuss their workings, issues and plans, not necessarily the completed works. It was impossible to attend so many parallel briefings; we have selected two in each sessions, and report on them here as a sampling of the conference.

Prague astronomical clock
Prague astronomical clock By Steve Collis from Melbourne, Australia (Astronomical Clock Uploaded by russavia) [CC BY 2.0], via Wikimedia Commons

In one of my previous blog posts on “How to write dates?” I discussed the basic universal date and time notation, as specified in the International Organization for Standardization standard (ISO 8601) and its Word Wide Web Consortium (W3C) simplification. Since that time the Library of Congress has completed the work on the extension of this standard, the Extended Date/Time Format (EDTF) 1.0. This extension for the most part deals with expressing uncertain dates and times. Such limited or imprecise date/time information is common occurrence in recording historical events in archives libraries etc. The ISO 8601 does not allow for the expression of such concepts as “approximately  year 1962” or “some year between 1920 and 1935” or “the event occurred probably in may 1938, but we are not certain”. The EDTF standard, allows us to express them in a formalized way,, fulfilling a real need in many fields dealing with historical metadata.

Despite the fact that the standard is relatively new, and there are few software tools to help enter or validate the uncertain dates and time data, I believe, that it is worth familiarizing oneself with the new notation wherever possible.


I would like to to begin with some definitions to facilitate the discussion of the new notation. The definitions are accompanied by symbols that will be used in the next section. 


Precision is a measure of a range or interval within which the ‘true’ value exists [1]. Precision is explicit in the date or date/time expression; if an event occurred in the year 1318, the precision is one year (it could occur at any time within this year). If we specify 1945-09-15, the precision is one day, etc. [2] In EDTF we can extend this definition to a specify a decade or century precision using the x symbol - see discussion of masked precision below.

Approximate (~)

An estimate that is assumed to be possibly correct, or close to correct, where “closeness” may be dependent on specific application.

Uncertain (?)

We are not sure of the value of the variable (in our case date or time). Uncertainty is independent of precision. The source of the information may itself not be reliable, or we may face several values and not enough information to discern between them. For example we may be uncertain as to the year, or month, or day of an event etc.

Unspecified (u)

The value is not stated. The point in time may be unspecified because it did not occur yet, because it is classified, unknown or for any other reason.