SPARQLPuSH

From Filtered Push Wiki
Jump to: navigation, search


Intro

This page is highly technical, but among the issues are questions of how to hide the RDF and SPARQL details from users. That probably deserves an entire page of its own.

For more about SPARQLPuSH (SPSH), see : http://apassant.net/blog/2010/04/18/sparql-pubsubhubbub-sparqlpush and the video at http://vimeo.com/11023983

SPARQLPuSH follows the model of most(?) semantic pub/sub systems. Subscriptions are queries in some suitable language supporting some kind of reasoning. Publications are data which can constitute responses to those queries. For SPARQLPuSH the data are RDF triples and the queries are SPARQL1.1. We focus on Annotations as the data. SPARQLPuSH is an implementation of the pubsubhubbub protocol, which is beyond the scope of this page. Suffice it to say that pubsubhubbub is an http service implemented by a python-based Google App Engine application. That service is part of the SPARQLPuSH ecosystem whose deployment and support issues are at this writing not yet discussed in the Issues section.

More intro coming.

See also PubSubHubbub

Progress

Refactored SPARQLPuSH (SPSH) and committed to project FP-SPARQLPuSH in sourceforge svn.

  • separated and parameterized by a host http url or IP address the different components of a SPSH deployment. These strings are now defined in configuration files:
    • FEEDHOST - host URL at which atom feeds are served
    • HUBHOST - host URL at which pubsubhubbub runs under the Google App Engine
    • ENDPOINT - host URL at which SPARQL 1.1 query and update can be launched

To date, this has been successfully tested with the endpoint (and so Annotation triple store) on one machine and all other services -- feed, SPSH server and client, and hub, on all on the the same host. Tested both with that host the same or different from the endpoint host.

Maureen Kelly, Bob Morris, and David Lowery have all successfully installed SPARQLPuSH. Maureen studied the pubsubhubbub protocol enough to write a dynamite installation manual, updated by Bob for use with Apache Jena Fuseki SPARQL endpoints.

Tested with actual generated ApplePie annotations and a relatively realistic query that registers an interest in all publications about a particular specimen:

SELECT ?uri ?date ?annotation ?describesObject ?createdOn ?createdBy ?identifiedBy ?dateIdentified ?scientificName ?scientificNameAuthorship ?opinionText ?polarity ?evidence ?motivation 
WHERE { 
	 ?annotation ao:annotatesResource ?uri .
	 ?annotation pav:createdOn ?date .
	 ?subject dwcFP:collectionCode ?collectionCode .
	 ?subject dwcFP:catalogNumber ?catalogNumber .
	 	{ ?annotation ao:annotatesResource ?subject } UNION
	 	{ 	?issue ao:annotatesResource ?subject .
	   		?annotation ao:annotatesResource ?issue 
		} .
		?annotation ao:hasTopic ?topic .
		?annotation pav:createdBy ?annotator . 
		OPTIONAL  { ?annotation bom:hasResolution ?resolution } . 	
		OPTIONAL  {
	 		?topic marl:describesObject ?describesObject .
	 		?topic marl:opinionText ?opinionText .
			?topic marl:hasPolarity ?thePolarity .
			?thePolarity rdf:type ?polarity } . 
		OPTIONAL {
		    ?topic dwcFP:identifiedBy ?identifiedBy .
		    ?topic dwcFP:dateIdentified ?dateIdentified .
		    ?topic dwcFP:scientificName ?scientificName } . 
		OPTIONAL { ?topic dwcFP:scientificNameAuthorship ?scientificNameAuthorship } .
		?annotator foaf:name ?createdBy .
		?annotation pav:createdOn ?createdOn . 
		OPTIONAL {?annotation aod:hasEvidence ?theEvidence . 
			 ?theEvidence aod:asText ?evidence } . 
		OPTIONAL {
	 	   ?annotation aod:hasMotivation ?theMotivation . 
		   ?theMotivation aod:asText ?motivation
		}
		FILTER (?collectionCode = "A" && ?catalogNumber = "00107080" )
 }

Here's an abbreviated feed resulting from registering a subscription to the above query, followed by entries signalling the feed of several publication events. Those publications were re-loading of an Annotation that meets the query, but with some change of data, especially pav:createdOn, which leads to the feed being considered as updated. Our normal scenario, however, would never be to "edit" an Annotation, but more likely to publish a new one that also meets the subscription interest. A more interesting case might be one in which there is a response Annotation to this example. Coming soon...

<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
	<title type="text">sparqlPuSH @ http://tioga.huh.harvard.edu/node1/server</title>
	<id>http://tioga.huh.harvard.edu/node1/server/feed/4ff5f506c955d</id>
	<updated>2012-07-09T14:43:27-04:00</updated>
	<author>
		<name>http://code.google.com/p/sparqlpush/</name>
	</author>
	<link rel="self" href="http://tioga.huh.harvard.edu/node1/server/feed/4ff5f506c955d" title="sparqlPuSH @ http://tioga.huh.harvard.edu/node1/server" type="application/atom+xml"/>
	<link rel="hub" href="http://tioga.huh.harvard.edu:8000/"/>

	<entry>
		<id>http://etaxonomy.org/ontologies/ao/cddd87bd-452c-4de6-93cc-5a11cc674697/Subject_0</id>
		<title type="text">http://etaxonomy.org/ontologies/ao/cddd87bd-452c-4de6-93cc-5a11cc674697/Subject_0</title>
		<content type="text">http://etaxonomy.org/ontologies/ao/cddd87bd-452c-4de6-93cc-5a11cc674697/Subject_0</content>
		<published>1969-12-31T19:00:00-05:00</published>
		<updated>1969-12-31T19:00:00-05:00</updated>
	</entry>
	[...]
	
	<entry>
		<id>http://etaxonomy.org/ontologies/ao/cddd87bd-452c-4de6-93cc-5a11cc674697/Subject_0</id>
		<title type="text">http://etaxonomy.org/ontologies/ao/cddd87bd-452c-4de6-93cc-5a11cc674697/Subject_0</title>
		<content type="text">http://etaxonomy.org/ontologies/ao/cddd87bd-452c-4de6-93cc-5a11cc674697/Subject_0</content>
		<published>2012-07-09T14:11:22-04:00</published>
		<updated>2012-07-09T14:11:22-04:00</updated>
	</entry>
</feed>

Each entry's <id> is actually the URI if the published Annotation's subject, but that is an artifact of the first line of the WHERE clause of the SPARQL query. ("?uri" has special significance to the SPARQLPuSH server, as does "?date".) It could be, however, a reasonable hook to launch a SPARQL query to the endpoint that would, e.g., retrieve all the Annotation's with that URI as Subject. This may or may not bring back more information than the original subscription query, which in principle is all that the subscriber cares about anyway. Maybe.

Note that we typically generate Annotations whose pieces themselves are not resolvable, so an entry's <id> alone doesn't help much without launching a further SPARQL query to the Annotation store.

Issues

These are not necessarily independent.

  • Demo SPSH client sucks is barely useful, even for demo.
  • Unclear which services should/must be behind an FP Access Point when the deployment is on a full FP Network.
  • Unclear what authorization is needed to make pubs, register subs, etc., especially in a "FP Lite" (e.g. just add SPSH support to Morphbank...not FP Network)
  • Probably need Apache front end to the SPARQL endpoints to manage updates, survive firewalls (e.g. the SPSH server makes update calls to the endpoint, and either or both may be behind uncooperative firewall).
  • SPSH server is, among other things, the atom feed server, so should be configured to serve needs to guarantee mime-type is application/xml+atom. Otherwise, some atom clients reject subscription.
  • Google Reader, at least, can subscribe to feeds generated by SPSH, but so far it is not evident that publication events get noticed, as they do on the SPSH polling client. Not clear what is going on yet...
  • What happens if someone loads data into the triple store without going through the triple store, and hence without going through the hub? Will relevant subscribers see those triples? What is desired in that case?
  • How to make friendly subscription constructors, keeping SPARQL out of site of end users? Maybe SPARQLed?
  • What are the best items in an Annotation to bind to the variables special to SPSH (i.e. ?uri, ?date, and a few optional ones)?
  • Is it possible to make the special SPSH variables be part of the FP Network configuration. If yes, what kind of persistence and consistency is required. Which parts of the SPSH ecosystem have to be considered for this question?
  • What are the support and configuration issues for the Google App Engine, for python, and for the pubsubhubbub hub. License issues for the Google python library? Per Maureen and David, the GAE itself is unnecessary.
  • Hub(?) has to have enough privilege to modify feeds, which possibly means it must/should run in the same host as the feed host.

Server Overview

Create a new feed via get request http://localhost/node1/server/?query=[a sparql query]

  1. Check the feeds graph in the triple store (invoke query through SPARQLConnector) and determine whether or not a feed already exists for the query. if not register a new feed.
  2. Registration consists of creating the feed xml: launch the query and collect the annotations that currently meet the interest, create the xml with entries from the query, write the feed xml to a file (/var/www/node1/server/feed/[some feed uuid]), return the url. Last step is to insert the feed uri and query into the triplestore via the SPARQLendpoint.
  3. Now that the feed is registered, we use the Publisher to push the feed url to the hub (we POST to the hub running on gae)

Update feeds via post (using LOAD <file.rdf> query)

  1. When a load or other update query is submited to SPARQLpush via post, it is first sent directly to the endpoint (load the triples into the triple store). Then we update all the feeds since these new triples may satisfy one of the interests.
  2. Updating all the feeds consists of: getting all feeds from the feed graph in the triple store, for each of the feeds launch it's query and check if the first result has a date that is later than the feed date. If the date of the result is later, delete the feed and create new xml with the new results and a new date. Re-insert the updated feed via SPARQLConnector and push to the hub via the Publisher.

Client Overview

Subscribe to a feed

  1. get a query from the server via a get request (same procedure as on the form)
  2. parse the rss and retreive the hub url (from link rel="hub" .. present in the feed xml)
  3. use Subscriber to post the feed url and a callbackto the hub the subscribe
  4. when a feed is published to the hub, the hub invokes the callback url letting the sparqlpush client know that something was published that meets an interest

Component Diagram

Fig128064.svg