Remaining Issues

From Filtered Push Wiki
Jump to: navigation, search

0) No single point of failure node architecture.

In Prototype: Early on, Hadoop, later None. Current: None. Architecture includes elements (DescribeCapabilities, ServiceDiscovery) to support distributed nodes, Decoupling of nodes, messaging services, knowledge services, and analytical services is intended to support distributed nods.

We have never implemented a FilteredPush network since the earlyest experiments with Hadoop in the prototype. We have only ever implemented a single central server. Implementation of a multiple node network with no single point of failure is a critical deliverable for the current funding. The first submission of FP was as a single central server, this was rejected by the reviewers, and both the prototype and the current funding assert that we will build a distributed network using peer to peer technology to avoid any single point of failure.

See: http://etaxonomy.org/mw/File:Architecture_whiteboard_diagram_cleaned_updated_2011Apr.png See: http://etaxonomy.org/mw/File:3nodes.png See: http://etaxonomy.org/mw/File:Elaborated_component_model_of_a_three_node_network.png See: http://etaxonomy.org/mw/FP2_Design

In the 3nodes.png diagram, we have "knowledge" represented in each of the three nodes. Does this mean we have a database/index replicated three times? Or does this mean we have "knowledge" partitioned into three parts? How does data get into "knowledge?" --Maureen Kelly 15:43, 13 July 2012 (EDT)



1) Mechanism for going from identifiers to things identified.

In Prototype: None. Current: None.

When an annotation has a GUID as a subject, further information about the data object identified by the GUID will be needed to identify relevant subscriptions. Annotations about collection objects may not have present in the annotation the instututionCode + collectionCode + catalogNumber needed to determine fit to interests expressed as institutionCode + collectionCode.

Requirement 8. Requirement 18.

Note that this means all data to be included in analysis must be in some (central) _known_ location to be queried. This means we: a) wait for data sources to contribute records b) harvest data sources ourselves (in which case we need to identify data sources) c) access some other harvester's aggregated data --Maureen Kelly 14:44, 13 July 2012 (EDT)


2) Mechanism for matching subscriptions to notifications.

This is a general case of (1) above.

This is an absolutely critical piece of FilteredPush.

In Prototype: Simple value matching by Hadoop. Current: Key/Value Pair Matching on data present in the annotation.

Requirement 20

Missing is "and associated content".

Proposed Mechanism 1 (Prototype):

AnnotationJob has step 1, analyze annotation to find matching interests. It is the responsibility of an analytical component to determine the interests that apply to an annotation. Annotation Job has step 2: give annotation and list of interests to the messaging system.

Proposed Mechanism 2 (Zhimin's Design): AnnotationJob has step 1: Query network cache and data providers using the data found in annotation to find related data (e.g. lookup collection code from specimen GUID). AnnotationJob has step 2: Deliver annotation and related data to messaging system, messaging system has responsibility of matching annotation and related data to interests and delivering message.

Propposed Mechanism 3 (SPARQL Push): Annotation job has step 1: Deliver annotation to messaging system. Messaging system is responsible for determining both the related data and the matching interests.

No matter what component ultimately contains the code to do this, what we need an _implementable_ description of what it means to "determine the interests that apply" and "find related data" and "match annotation and related data to interests". What is "related?" What does it mean to for an interest to "apply"? --Maureen Kelly 14:43, 13 July 2012 (EDT)


3) Cache/Index.

In Prototype: Simple Relational Database with BBG data. Current: None. Experimented briefly with GBIF cache.

Requirement 1. Requirement 2. Requirement 26.

We can't simply use a copy of the GBIF Cache, we need to harvest unredacted and detailed information from providers in order to deliver rich and accurate data about duplicates.

Requirement 1 probably implies a local cache as well as a network cache.



4) Authentication/Authorization.

Prototype: not implemented Current: FPAccessPoint is SOAP web service, encryption/authentication not turned on, key management not developed yet.

All interactions between clients and FP Network Nodes are through the Access Poit. Only client software with a valid key can communicate with an Access Point. All traffic between the AccessPoint and clients is encrypted.

Clients of a FilteredPush network instance have their own independent authentication/authorization mechanisms. All players in a FilteredPush network trust the asssertions made by each client about the identity of human end users of that client.

Requirement 15 Requirement 14

See also: http://etaxonomy.org/mw/Use_Case_Scenarios#Scenarios_for_malicious_uses

A lightweight annotation system may have different requirements, but will need an authentication mechanism for injection of new annotations into the system to prevent arbitrary unauthorized clients from injecting spam or attacks. A lightweght annotation system is particularly vulnerable to cross site scripting attacks.

The human end user and the client (e.g. Patrick Sweeney logging in to the NEVP TCN Symbiota instance) are criteria that are needed in the annotation processor to set filters for automated processing of new specimen annotations.



5) Description of analysis.

Prototype: Hadoop - inherent clustering ability. Clustering tool using external index of cached data for special case of duplicate finding. Current: No implementation. Proposed: Embedd Kepler analytical engine as analytical capability, add

Analysis must be able to cluster duplicates based on fuzzy matching.

Analysis must be able to launch annotations that identify duplicate sets.

Requirement 2. Requirement 26.

Tightly related is James' concept of a duplicate set with a consensus record based on analysis and human responses