Blog archiwistów i bibliotekarzy Instytutu Piłsudskiego

Blog archiwistów i bibliotekarzy Instytutu Piłsudskiego

A photo from Władysława’s album with her father among members of the organization “Sokół”, in Silna, Poland, early 1920s.

A photo from Władysława’s album with her father among members of the organization “Sokół”, in Silna, Poland, early 1920s.


My wife’s aunt Władysława always kept the family albums ready to be packed in a backpack, so that during WWII, when the family were expelled from their flat in Łódź and during further forced migration, the albums always traveled with her. This way the albums survived and we can now enjoy the family photos of several generations back. The durability of black and white photography and of the good quality, rag based non-acidic paper on which  they  were printed helped preserve  the hundred-year old prints.   

Color photos from 20 or 30 years ago did not fare as well. Organic dyes fade quickly, and some lost most of their colors. We are working on digitizing them, trying to digitally restore the colors, converting them to digital albums.

Creation of digital copies of the old albums has also another goal, in addition to preservation. The family was scattered around the world. Brothers and sisters who lived or worked in different partitioned regions of Poland before 1918 ended up in different countries; some returned to the reborn Poland, some settled in Germany, France, in the US, UK and elsewhere. A single album is not enough, but an online version can be viewed by many.

Which brings us to the question: how to create a durable, long lasting electronic album?

About 15 years ago, we installed Gallery2, an Open Source, Web-based photo gallery, which had everything that was necessary to build a collection of online albums. We used a desktop progm Picasa to organize and annotate the images. Now, Gallery2 and Picasa are no more (or not maintained, which is almost the same). This article is about our effort to rebuild the albums, so that they can survive at least for one generation (or, say, 25 years).

How can we predict the future, even just for 25 years, in the fast changing landscape of digital world? One factor that will help (and guide us) is a strong conservatism of programmers or builders of computerized systems. Let us take for example Unicode, an universal alphabet or mechanism to represent characters in almost all writing systems in use today. We will stress it later, because without Unicode one cannot really annotate pictures, for example photo report about a  travel from Łódź to Kraków to Košice to Hajdúböszörmény to Škofja Loka to Zürich to Besançon to Logroño. Unicode is about 25 years old, but was not the first, the Latin alphabet ASCII dominated computing for decades before. Today many programs and systems, even built quite recently, still do not support unicode. Similar conservatism affects formats for recording photographs (raster graphics). Some formats, like TIFF and JPEG, which were introduced 30 years ago, being first useful format became very popular. New standards like JP2, which are better in some respects, have a very difficult time to be accepted.

Goals of the project

It may be helpful to state the goals of the project. We start with a collection of photographs in digital form, which may be scanned photos or born-digital images. The images are organized into albums, perhaps matching original bound albums, or created from scratch. We also have metadata: descriptions, persons and places, dates, etc. associated with the individual images, and with the whole albums. Metadata may also include provenance, history and other events associated with them.

The goal of the project is to research and document the methodology to preserve the images,  their organization into albums, and all the associated metadata. We are looking forward to preserve it  for future generations rather than for a fleeting, one time viewing, which can be easily accomplished using social media today

Beyond the scope of this project is the preservation of bound albums and original photographs, scanning and storage of digital copies. Those topics have more or less extensive literature although some specific topics may be presented in this blog later.


Metadata collection during digitizing archival resources is not a straightforward task. Names, places, events listed in the documents often diverge from current spelling. There are variants, aliases, spelling errors etc. Modern search engines like Google often have synonyms or common spelling errors and can correct them:

Showing results for Kowalski
Search instead for Kowakski

But it works best for common names and spelling errors. In a project in which we would like to present the metadata as Linked Open Data we would want to have a clean list of entries, free of spelling errors and with identified variants, if any.

Let us take as an example names of people (we also collect places, historical events and more). The name alone does not typically identify the person - obviously there can be many people with the same name. Once the person is fixed, we find that very often his or her name exists in many variants. There are versions in various languages, the person could use pseudonym (or several) at some period of his life, change her name (before or after marriage), add titles to the name etc. Subjects and citizens name their leaders by their monikers, persons How to find your data in this mess?

For persons that are mentioned in the archival documents, we have selected several rules. The rules are somewhat arbitrary, but we had to start somewhere:

  1. We use one standard name for one person. The alternative names or names in different languages are also collected to help in search. We are typically guided by the name used in Polish, if possible, and use the Wikipedia spelling (in Polish or other language) when appropriate.

  2. We list the name as Last Name, First name(s) in this order. This sometimes causes difficulties as it is not always easy to figure out given names. The rule has exception for people known only by their full name, as Kings, Popes etc.; in this case we list the full name as it is commonly used or officially known.

  3. We assign each person a unique identifier which we create from scratch. If possible, we correlate this identifier with two of the most common registries: WikiData and VIAF. There are people who do not have their articles in Wikipedia in any language, and hence no WikiData ID. There are people who never wrote a book and are not in the library index represented by VIAF. For those we create a short description, add references and assign our own identifier.

The next step is to review some 80 thousand name records that we have collected till now to bring the names to this standard. We work in chunks, typically one archival fonds at a time, but it is still tens of thousands of records. One can work with spreadsheet, which is an universal tool, and by using such functions as sort, filter, global find-and-replace one can do a lot of work. However, we found that a specialized tool called OpenRefine can be much more useful for this task. OpenRefine (an Open Source software) grew from Google project and was originally called Google Refine. It was strongly connected with a project called Freebase1, now defunct, which collected data from several different databases as well as allowed users to add their own. OpenRefine was created expressly for the task of cleaning up or refining mixed quality data.


OpenRefine is a sophisticated piece of software with very powerful tools. It can be installed on a personal computer and runs as a web server, which means that you interact with it via a browser. The data is stored locally, and does not leave your computer; you can work on sensitive data without problems with sharing. It also means that one can only work on one computer at a time, but it is rather easy to export the project and carry it on a memory stick.

OpenRefine is a rich in options, functions and capabilities. I will not attempt to describe all of them here. My goal is to introduce OpenRefine and show some features using the example of names of people collected in digitizing archival fonds of the Pilsudski Institute of America.

Metro 2016 Streszczenie

W czwartek 21 stycznia 2016 braliśmy udział w dorocznej konferencji METRO - Metropolitan New York Library Council - która miała miejsce w Baruch College w Manhattanie. Konferencja ta, jak i poprzednie, była doskonałym przeglądem najnowszych inicjatyw, pomysłów, rozwiązań i projektów w dziedzinie humanistyki cyfrowej w społeczności GLAM. Poniżej przedstawiamy omówienie wybranych prezentacji w języku angielskim.

The annual METRO (Metropolitan New York Library Council) conferences are about the best sources of the latest inventions, projects and ideas in the GLAM community, concentrated in one day of intense briefings. This year was no exception - the conference that took place January 21, 2016 at the Baruch College in Manhattan. On the conference a number of “Project briefings” were presented - the intent was to show the projects in progress and discuss their workings, issues and plans, not necessarily the completed works. It was impossible to attend so many parallel briefings; we have selected two in each sessions, and report on them here as a sampling of the conference.

Prague astronomical clock
Prague astronomical clock By Steve Collis from Melbourne, Australia (Astronomical Clock Uploaded by russavia) [CC BY 2.0], via Wikimedia Commons

In one of my previous blog posts on “How to write dates?” I discussed the basic universal date and time notation, as specified in the International Organization for Standardization standard (ISO 8601) and its Word Wide Web Consortium (W3C) simplification. Since that time the Library of Congress has completed the work on the extension of this standard, the Extended Date/Time Format (EDTF) 1.0. This extension for the most part deals with expressing uncertain dates and times. Such limited or imprecise date/time information is common occurrence in recording historical events in archives libraries etc. The ISO 8601 does not allow for the expression of such concepts as “approximately  year 1962” or “some year between 1920 and 1935” or “the event occurred probably in may 1938, but we are not certain”. The EDTF standard, allows us to express them in a formalized way,, fulfilling a real need in many fields dealing with historical metadata.

Despite the fact that the standard is relatively new, and there are few software tools to help enter or validate the uncertain dates and time data, I believe, that it is worth familiarizing oneself with the new notation wherever possible.


I would like to to begin with some definitions to facilitate the discussion of the new notation. The definitions are accompanied by symbols that will be used in the next section. 


Precision is a measure of a range or interval within which the ‘true’ value exists [1]. Precision is explicit in the date or date/time expression; if an event occurred in the year 1318, the precision is one year (it could occur at any time within this year). If we specify 1945-09-15, the precision is one day, etc. [2] In EDTF we can extend this definition to a specify a decade or century precision using the x symbol - see discussion of masked precision below.

Approximate (~)

An estimate that is assumed to be possibly correct, or close to correct, where “closeness” may be dependent on specific application.

Uncertain (?)

We are not sure of the value of the variable (in our case date or time). Uncertainty is independent of precision. The source of the information may itself not be reliable, or we may face several values and not enough information to discern between them. For example we may be uncertain as to the year, or month, or day of an event etc.

Unspecified (u)

The value is not stated. The point in time may be unspecified because it did not occur yet, because it is classified, unknown or for any other reason.

Kto lubi przeprowadzki?

I wszystko, co się z tym wiąże: segregowanie, redukowanie, pakowanie, przewożenie, rozpakowywanie, ustawianie....? O ile

Piłsudski Institute
 Instytut Piłsudskiego

można w miarę sprawnie przenieść się z mieszkania do mieszkania, to przeprowadzenie zmiany lokalu instytucji, która od ćwierćwiecza zajmowała kamienicę w centrum Manhattanu, gromadząc archiwa, dzieła sztuki i eksponaty muzealne, trudno sobie wyobrazić.

Wieść o sprzedaży domu, który wynajmował Instytut Piłsudskiego w Ameryce na swoją siedzibę, była dużym zaskoczeniem dla jego pracowników. Instytut kojarzony był od wielu lat z Drugą Aleją na Manhattanie, miał stałe grono przyjaciół, wielbicieli, odwiedzających oraz badaczy, a tu nagle taka wiadomość! Niełatwo było się z nią pogodzić, ale innego wyjścia nie było. Niezwłocznie zorganizowano Kampanię Na Rzecz Przyszłości w celu zebrania funduszy na to przedsięwzięcie i opracowano logistykę zmiany lokalizacji. Przygotowania trwały ponad rok.  Przede wszystkim musieliśmy znaleźć nową siedzibę, która pomieściłby nasze zbiory i zapewniła sprawne kontynuowanie działalności Instytutu. Najbardziej przypadł nam do gustu lokal zaproponowany przez Polsko-Słowiańską Federalną Unię Kredytową, a także warunki jego wynajmu. Rozpoczęły się prace adaptacyjne: zaprojektowanie i zabudowanie wnętrza, instalacja profesjonalnych zabezpieczeń, regałów oraz montowanie przestronnych szaf. Nieocenioną pomoc otrzymaliśmy z Instytutu Pamięci Narodowej, z którego oddelegowano ośmiu archiwistów, którzy w ciągu dwóch miesięcy profesjonalnie i sprawnie zapakowali archiwa oraz zbiory biblioteczne i pomagali w przenoszeniu ich do nowego lokum. Nie byliśmy w stanie policzyć tych wszystkich pudeł i paczek, które po przewiezieniu na nowe miejsce, zajęły większość powierzchni użytkowej, piętrząc się niemal pod sufit.

Part II: Product

(Guest blog by Rob Hudson)

Arthur Rubinstein (Linked Data)In Part I of this blog, I began telling you about my experience transforming Carnegie Hall’s historical performance history data into Linked Open Data, and in addition to giving some background on my project and the data I’m working with, I talked about process: modeling the data; how I went about choosing (and ultimately deciding to mint my own) URIs; finding vocabularies, or predicates, to describe the relationships in the data; and I gave some examples of the links I created to external datasets.

In this installment, I’d like to talk about product: the solutions I examined for serving up my newly-created RDF data, and some useful new tools that help bring the exploration of the web of linked data down out of the realm of developers and into the hands of ordinary users. I think it’s noteworthy that none of the tools I’m going to tell you about existed when I embarked upon my project a little more than two years ago!

As I’ve mentioned, my project is still a prototype, intended to be a proof-of-concept that I could use to convince Carnegie Hall that it would be worth the time to develop and publish its performance history data as Linked Open Data (LOD) — at this point, it exists only on my laptop. I needed to find some way to manage and serve up my RDF files, enough to provide some demonstrations of the possibilities that having our data expressed this way could afford the institution. I began to realize that without access to my own server this would be difficult. Luckily for me, 2014 saw the first full release of a linked data platform called Apache Marmotta by the Apache Software Foundation. Marmotta is a fully-functioning read-write linked data server, which would allow me to import all of my RDF triples, with a SPARQL module for querying the data. Best of all, for me, was the fact that Marmotta could function as a local, stand-alone installation on my laptop — no web server needed; I could act as my own, non-public web server. Marmotta is out-of-the-box, ready-to-go, and easy to install — I had it up and running in a few hours.

Ministerstwo Kultury
Biblioteka Narodowa
Naczelna Dyrekcja Archiwów Państwowych
Konsulat RP w NY
Fundacja na rzecz Dziedzictwa Narodowego
NYC Department of Cultural Affairs