Introduction to Linked Data

Regime_entailment_basic-260

An example of RDF Linked Data graph (reification) - By Karim Rafes (Own work) [CC-BY-SA-3.0], via Wikimedia Commons)

Linked Data is a mechanism used by the Semantic Web or "Web 3.0 in construction". What is the Semantic Web? We all use the World Wide Web (www), the main component of which are the hyperlinks, references or links to other sites. Clicking on the hyperlink (has “http” in its  name) will open a new web page. Web was created for human consumption, and just as natural language it is understood by people.

Compared to us, computers are rather dumb, and one has to be extremely explicit in providing it with instructions. On the other hand they are very fast and can handle vastly more data than we can. And that means that in petabytes of data they can find the single piece of information we need. To make it work, we have to be very precise, need reliable sources of information and a system that connects it all. This system is the Linked Data.

Why should we be interested in Linked Data? Out of curiosity, obviously, to understand how the digital world today around us. Linked Data is especially important for archivists, librarians, and others working in the field of data processing. If you work in an institution which has some good quality data in any field, making the data available in the Linked Data can significantly increase the prestige of the institution in the world.

The basic rules of Linked Data, encapsulated in RDF (Resource Description Framework) are the use of references (URI) instead of text, and the use of simple statements about resources in the form of subject - predicate - object.

URI

In Linked Data a unique identifier of a particular resource is an URI (Universal Resource Identifier). As an example we will select Aleksander Kowalski. The name is common in Polish, so much so, that the Google engine translates it automatically to Alexander Smith. We need to be specific. Instead of writing "Alexander Kowalski ... well ... not the Member of Polish Parliament but the Olympian, not the skier but ice hockey player" we can use  URI (in this case URL) http://en.wikipedia.org/wiki/Aleksander_Kowalski, which leads us to an authoritative source. In this example, I used the link to Wikipedia - an encyclopedia for human consumption; in the Linked Data for computers we could use the link to DBPedia, for example http://dbpedia.org/page/Aleksander_Kowalski. There is a large amount of structured data there, such as dates, places, categories etc, and you can in principle ask more complex questions, such as "give me all hockey players who participated in the Olympic Games in Lake Placid contemporarily with Aleksander Kowalski". Using URI relieves us of explaining each time what is the subject matter; it also prevents or greatly hinders the deliberate use of ambiguous language, favorite pastime of politicians.

RDF

Linked Data is an idea that you can use simple, 3-component sentences to represent all human knowledge - a very challenging task. Those simple sentences consist of subject, predicate and object and RDF is a standard that defines how to create and use such sentences to represent knowledge.

Subject is what we are talking about in the sentence, the resource. The subject must be specified as an URI. In this way, we direct the computer to a clearly defined entity. The subject can be anything - a person, place, book, website, etc. In our example it will be Aleksander Kowalski - a person.

Predicate is the second part of the sentence. It specifies a property, relation, type, etc. For example, it may be "was born on ...", "belongs to the category of ...", "has child ...", etc. The predicate also must be a URI to point to a properly described (in natural language) definition.

Object is the thing or value to assign to the subject in the sentence. It may be a literal value (eg, 1902), but it can also be a URI, if the object is described somewhere (in this case, we do not give it a name, which can be ambiguous, but its identifier, URI).

In natural language it is simple to write such sentences:

Aleksander Kowalski ... well ... not the Member of Polish Parliament but the Olympian, not the skier but hockey player, was born on October seventh 1902.

Aleksander Kowalski (same as above) was a Polish ice hockey player.

Aleksander Kowalski was killed in Katyn.

In RDF there is some work to be done first. We must define the URI of the entity - let's say, for brevity we define ak as http://dbpedia.org/page/Aleksander_Kowalski (and add /p at the end  to note that the subject is a person and not the website) Our first sentence will provide an information for the computer that the object is a person:

ak:p   rdf:type   foaf:person

where ak:p is the entity (we are talking about the person of Aleksander Kowalski), rdf:type predicate is a 'type', as defined in the RDF standard and foaf:person is a value - 'person' - the class specified in the FOAF standard. (Was such information provided, perhaps the translation engine would not try to translate the name)
Further sentences will look like this:

ak:p    dbpedia-owl:birthDate   1902-10-07

ak:p    dc:description   “Polish ice hockey player”

ak:p    dbpedia-owl:deathPlace     dbpedia:Katyn_massacre

In the first sentence, the subject is date formatted using the ISO standard. One can refine the statement by adding a type definition, such as:

ak:p    dbpedia-owl:birthDate   1902-10-07^^xsd:date

In the second sentence, the object is a text literal, and the third is a URI pointing to the new resource, which may have its own list of properties expressed as RDF sentences, and these, in turn, their properties, and so on. This creates a new object that is slowly growing; while only rudimentary now, it already has a name: Giant Global Graph.

What can be expressed in RDF? The standard itself specifies only a small number of basic concepts, resources, classes, properties, relationships of objects to classes, tables or lists - ordered and not, etc. It provides a way to build complex structures ("Jan Kozlowski said that Aleksander Kowalski is a Polish hockey player ") with simple three component sentences (in this process, called reification, we treat the inner sentence as a resource). In other words, RDF defines the elements of formal logic which is used by Linked Data.

RDF syntax

RDFis a language designed to record information about resources. Standard RDF defines not only the semantics but also the syntax of the language, expressed in XML (sometime called RDF/XML), which can be regarded as a 'canonical' representation. In the following example the three sentences from the section above are presented in one compact XML document.

<rdf:RDF
     xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
     xmlns:dbpedia-owl="http://dbpedia.org/ontology/
     xmlns:dbpedia=”http://dbpedia.org”
     xmlns:dc="http://purl.org/dc/elements/1.1/">
     <rdf:Description rdf:about="http://dbpedia.org/page/Aleksander_Kowalski/p">
        <dbpedia-owl:birthDate>1902-10-07</dbpedia-owl:birthDate>    
        <dc:description>Polish ice hockey player</dc:description>
        <dbpedia-owl:deathPlace>dbpedia:Katyn_massacre</dbpedia-owl:deathPlace>
      </rdf:Description>
</rdf:RDF>

There are also many other syntaxes (“serializations”) that are supposed to be either easier to read, or more convenient to use with other technologies, or both. There is a serialization called Notation 3 or N3, which is the second 'official' RDF syntax. The sentences illustrated in the previous section are expressed in a syntax similar to N3. There are also syntaxes called N-Triples, Turtle, Trig, TriX and others. RDFa is a syntax  placing the RDF data inside the WWW HTML code, as attributes of the hyperlink. If all this sounds a bit like the Tower of Babel, it probably is. Each of these methods claims to be "easier" and more intuitive, irrespective of the fact, that the data is not and probably will not be routinely entered manually. We will see which syntax survives the test of time.

Extensions

RDF is just the beginning, the core syntax of Linked Data. We need more extensive vocabularies and tools to lay claim to universal knowledge representation. RDF is quite versatile, allowing for example the use of well known and popular metadata standards such as Dublin Core or Friend-of-a-Friend (FOAF). It  already has built-in extension, RDF Schema (RDFS), the definition of classes for building ontologies. There are complete ontology languages such as OWL or SKOS, database query language (SPARQL) and other tools - a rich set, but a topic for a separate article.

Where is the data?

It's all very well, you might say, but how do we get these URIs? Where are the data for which we can have confidence that they are true, reliable and do not mislead your computer? After all any false data may spread like wildfire to other nodes of the Semantic Web.

There are many resources that are made available in Linked Data, and their number grows exponentially. There are geographical data and information about people - especially authors, collected by libraries around the world. There is a great hope pinned on Linked Data access to scientific data, in particular in areas such as genetics, meteorology or physics where data resources count in petabytes. DBPedia is a reflection of the Wikipedia with data which have a structure; institutions such as the Library of Congress or the New York Times provide more resources. One example of Linked Data can be seen by typing in Google search the phrase "Birth date of Joseph Pisudski". We get not only links to pages, but specific data (5 December 1987) and, as a bonus, the set of encyclopedic data about the Marchall Pilsudski. The available Semantic Web resources is another topic for the next blog.

Read more

Marek Zieliński, December 1, 2013

PARTNERZY
Ministerstwo Kultury
Biblioteka Narodowa
Naczelna Dyrekcja Archiwów Państwowych
Konsulat RP w NY
Fundacja na rzecz Dziedzictwa Narodowego
PSFCU
NYC Department of Cultural Affairs