bannerDigitalHumanities1 640 87

Blog archiwistów i bibliotekarzy Instytutu Piłsudskiego

OpenRefine

Metadata collection during digitizing archival resources is not a straightforward task. Names, places, events listed in the documents often diverge from current spelling. There are variants, aliases, spelling errors etc. Modern search engines like Google often have synonyms or common spelling errors and can correct them:

Showing results for Kowalski
Search instead for Kowakski

But it works best for common names and spelling errors. In a project in which we would like to present the metadata as Linked Open Data we would want to have a clean list of entries, free of spelling errors and with identified variants, if any.

Let us take as an example names of people (we also collect places, historical events and more). The name alone does not typically identify the person - obviously there can be many people with the same name. Once the person is fixed, we find that very often his or her name exists in many variants. There are versions in various languages, the person could use pseudonym (or several) at some period of his life, change her name (before or after marriage), add titles to the name etc. Subjects and citizens name their leaders by their monikers, persons How to find your data in this mess?

For persons that are mentioned in the archival documents, we have selected several rules. The rules are somewhat arbitrary, but we had to start somewhere:

  1. We use one standard name for one person. The alternative names or names in different languages are also collected to help in search. We are typically guided by the name used in Polish, if possible, and use the Wikipedia spelling (in Polish or other language) when appropriate.

  2. We list the name as Last Name, First name(s) in this order. This sometimes causes difficulties as it is not always easy to figure out given names. The rule has exception for people known only by their full name, as Kings, Popes etc.; in this case we list the full name as it is commonly used or officially known.

  3. We assign each person a unique identifier which we create from scratch. If possible, we correlate this identifier with two of the most common registries: WikiData and VIAF. There are people who do not have their articles in Wikipedia in any language, and hence no WikiData ID. There are people who never wrote a book and are not in the library index represented by VIAF. For those we create a short description, add references and assign our own identifier.

The next step is to review some 80 thousand name records that we have collected till now to bring the names to this standard. We work in chunks, typically one archival fonds at a time, but it is still tens of thousands of records. One can work with spreadsheet, which is an universal tool, and by using such functions as sort, filter, global find-and-replace one can do a lot of work. However, we found that a specialized tool called OpenRefine can be much more useful for this task. OpenRefine (an Open Source software) grew from Google project and was originally called Google Refine. It was strongly connected with a project called Freebase1, now defunct, which collected data from several different databases as well as allowed users to add their own. OpenRefine was created expressly for the task of cleaning up or refining mixed quality data.

OpenRefine

OpenRefine is a sophisticated piece of software with very powerful tools. It can be installed on a personal computer and runs as a web server, which means that you interact with it via a browser. The data is stored locally, and does not leave your computer; you can work on sensitive data without problems with sharing. It also means that one can only work on one computer at a time, but it is rather easy to export the project and carry it on a memory stick.

OpenRefine is a rich in options, functions and capabilities. I will not attempt to describe all of them here. My goal is to introduce OpenRefine and show some features using the example of names of people collected in digitizing archival fonds of the Pilsudski Institute of America.

Metro 2016 Streszczenie

W czwartek 21 stycznia 2016 braliśmy udział w dorocznej konferencji METRO - Metropolitan New York Library Council - która miała miejsce w Baruch College w Manhattanie. Konferencja ta, jak i poprzednie, była doskonałym przeglądem najnowszych inicjatyw, pomysłów, rozwiązań i projektów w dziedzinie humanistyki cyfrowej w społeczności GLAM. Poniżej przedstawiamy omówienie wybranych prezentacji w języku angielskim.

The annual METRO (Metropolitan New York Library Council) conferences are about the best sources of the latest inventions, projects and ideas in the GLAM community, concentrated in one day of intense briefings. This year was no exception - the conference that took place January 21, 2016 at the Baruch College in Manhattan. On the conference a number of “Project briefings” were presented - the intent was to show the projects in progress and discuss their workings, issues and plans, not necessarily the completed works. It was impossible to attend so many parallel briefings; we have selected two in each sessions, and report on them here as a sampling of the conference.

Prague astronomical clock
Prague astronomical clock By Steve Collis from Melbourne, Australia (Astronomical Clock Uploaded by russavia) [CC BY 2.0], via Wikimedia Commons

In one of my previous blog posts on “How to write dates?” I discussed the basic universal date and time notation, as specified in the International Organization for Standardization standard (ISO 8601) and its Word Wide Web Consortium (W3C) simplification. Since that time the Library of Congress has completed the work on the extension of this standard, the Extended Date/Time Format (EDTF) 1.0. This extension for the most part deals with expressing uncertain dates and times. Such limited or imprecise date/time information is common occurrence in recording historical events in archives libraries etc. The ISO 8601 does not allow for the expression of such concepts as “approximately  year 1962” or “some year between 1920 and 1935” or “the event occurred probably in may 1938, but we are not certain”. The EDTF standard, allows us to express them in a formalized way,, fulfilling a real need in many fields dealing with historical metadata.

Despite the fact that the standard is relatively new, and there are few software tools to help enter or validate the uncertain dates and time data, I believe, that it is worth familiarizing oneself with the new notation wherever possible.

Definitions

I would like to to begin with some definitions to facilitate the discussion of the new notation. The definitions are accompanied by symbols that will be used in the next section. 

Precision

Precision is a measure of a range or interval within which the ‘true’ value exists [1]. Precision is explicit in the date or date/time expression; if an event occurred in the year 1318, the precision is one year (it could occur at any time within this year). If we specify 1945-09-15, the precision is one day, etc. [2] In EDTF we can extend this definition to a specify a decade or century precision using the x symbol - see discussion of masked precision below.

Approximate (~)

An estimate that is assumed to be possibly correct, or close to correct, where “closeness” may be dependent on specific application.

Uncertain (?)

We are not sure of the value of the variable (in our case date or time). Uncertainty is independent of precision. The source of the information may itself not be reliable, or we may face several values and not enough information to discern between them. For example we may be uncertain as to the year, or month, or day of an event etc.

Unspecified (u)

The value is not stated. The point in time may be unspecified because it did not occur yet, because it is classified, unknown or for any other reason.

PARTNERZY
mkidn
bn
senat
ndap
msz
dn
psfcu
nyc