By Connecting for Health | 2013
What is Linked Data?
In a nutshell Linked Data is a set of concepts, principles and standards aimed at making it easy for people and more importantly applications to:
- Discover relevant data on the Web
- Access and use the data
- Integrate data from new previously unknown sources
The concepts of Linked Data are based on those of the existing Word Wide Web, but applied to data rather than web pages. The relevance of Linked Data is explained here by Tim Berners-Lee:
Tim Berners-Lee set out four simple Linked Data rules or principles:
- Use URIs as names for things.
- Use HTTP URIs so that people can look up those names.
- When someone looks up a URI, provide useful information, using the standards (RDF, SPARQL).
- Include links to other URIs, so that they can discover more things.
A URI is a Universal Resource Identifier. It is used to uniquely name a resource; where resources can be real world objects like people, organisations, places and things, as well as data like HTML pages or JPEG files and also abstract concepts. The more familiar URL (Uniform Resource Locator) or web address we use in our web browsers is a type of URI.
What does a URI look like? Pretty much the same as a familiar URL. Below is a URI for Leeds Teaching Hospitals NHS Trust:
This URI has been created (often termed minted) by the Health Developer Network (often termed a publisher) using the following namespace rules:
- https://data.developer-test.nhs.uk is the registered DNS name for the Health Developer Network data services
- /ods is used to indicate this thing has been derived from the NHS Organisation Data Service (ODS)
- /org is used to indicate this thing is an organisation
- /RR8 is the three character code that ODS uses to identify Leeds Teaching Hospitals NHS Trust
The last three local namespace rules are internal to the publisher and as long as these local namespace rules produce a unique URI when new URIs are minted, they can be anything that makes sense to the publisher. To the rest of the world the URI is just seen as an opaque identifier.
Note a URI is unique in that the same URI should not be used to name something different, for example the publisher should not also use https://data.developer-test.nhs.uk/ods/org/RR8 to identify Harrogate and District NHS Foundation Trust. However there may be other URIs minted by other publishers that also identify Leeds Teaching Hospitals NHS Trust.
Given a URI you may want to look up the name to find out something about the thing this name represents. If you are a person then pointing your web browser at the URI and treating it as a URL usually gives you some human readable descriptive information.
Try https://data.developer-test.nhs.uk/ods/org/RR8 to see what the Health Developer Network data services tells you about the URI.
If you are an application such as a Linked Data client, then issuing a HTTP GET to the URI usually returns some descriptive information in the form of a RDF document.
RDF is the Resource Description Framework data model which represents data in the form of a directed graph. Figure 1 shows a fragment of RDF data for Leeds Teaching Hospitals NHS Trust as published by the Health Developer Network in RDF/XML format.
Without understanding RDF you can see that this fragment of RDF is about the URI https://data.developer-test.nhs.uk/ods/org/RR8, and that it is providing some useful information about it such as its formal name (LEEDS TEACHING HOSPITALS NHS TRUST), full address and the date on which it was opened (note 1/4/1998 is the date on which Leeds Teaching Hospitals NHS Trust organisation was formed not when the hospital was first opened in Leeds which was back in the 18th century).
It also contains links to other URIs, for example the Government Office Region (gor) the trust is in is identified by the URI https://data.developer-test.nhs.uk/ods/org/D. This URI has also been minted by the Health Developer Network and figure 2 shows a fragment of RDF data for this URI as published by the Health Developer Network in RDF/XML format.
These links become more interesting when they are outgoing links to other publishers URIs and data.
Serving Linked Data
There are several technical approaches to serving Linked Data in RDF format:
- Static RDF file
- Relational database
- Wrapping API
The first is to simply serve static RDF files. You can generate these RDF files in a variety of ways; manually create then in a text editor or use a tool to convert existing structured data files such as CSV, XML and Excel into RDF. Note there are several RDF serialisation formats available:
- RDF/XML is an XML format for RDF
- RDFa is format that embeds RDF in HTML documents
- Turtle is a plain text format for RDF
- N-Triples is a subset of the Turtle format for RDF
RDF/XML is currently the only standardised format by the W3C and is also the most widely used, so it is the recommended format to use.
Once you have created the static RDF files you can publish them on a web server. URLs should end in .rdf and have a MIME type of application/rdf+xml.
Serving static RDF files is a good choice if the files are small and their content does not change often.
If the data is already stored in a relational database, then it can be served as RDF data by using a tool that dynamically maps the database contents to RDF and serves it up. A widely used tool to do this is the D2R server. For large volumes of data and/or which change frequently this is potentially a good approach. However careful consideration should be given to required RDF serving performance and the contention impact on the underlying relational database if this approach is to be used for large scale serving, as each RDF request will involve some sort of SQL query on the relational database followed by a transformation of the query result into an RDF data model.
If the data is managed within an existing system that provides proprietary APIs to access the data, you can develop custom wrappers around these APIs that exposes them as HTTP URIs and return RDF. As for a relational database, if large scale serving is required there may be performance limitations imposed by the wrapper and a significant contention and load impact on the underlying system.
The final approach is to use a triplestore. This is a repository specially designed to store RDF data in its native structure which consists of triples of subject, predicate and object. Triplestores probably offer the best technical approach it terms of scalability and performance. Triplestores are normally used with a SPARQL processor to serve RDF data from the store. SPARQL (pronounced “sparkle”) is the recursive acronym for SPARQL Protocol and RDF Query Language. Similar to SQL for relational databases it provides a standard way to query and get result sets from RDF data.
Was this article useful?6