FP-AnnotationProcessor

From Filtered Push Wiki
Jump to: navigation, search

System Overview

The annotation processor is a client to a FilteredPush network instance that allows users to interact with the network and with a local database. It supports the ingest annotation portion of the Annotate Specimen use case. It also supports the discover and invoke workflows portions of the Quality_Control_New_Record use case. The annotation processor provides a user interface for an adapter between configured domain vocabularies of the network and a local database schema.

FP-AnnotationProcessor-Overview.jpg


The protocol between SpecifyAdapter and SpecifyDriver is a group of http get.


For example, to find an occurrence object record with collection code equals to A and catalog number equals to 00000673:

Request: http://specifyhost/find?type=dwc_Occurrence&object=..

  where object = { dwc_collectionCode=A, dwc_catalogNumber=00000673 }

Response in Jason: { 234481 }

Interaction with Messaging System

 

register interest

RegisterInterest.png

retrieve interested annotation

RetrieveInterestedAnnotations.png

 

The sort/filter/search functions on annotation list:

  • Approach one: retrieve all the annotations from the messaging system and do the sort/filter/search locally.
    • Not efficient especially when there're too many annotations.
    • To improve the efficiency, need to store the annotations in the local database.
  • Approach two: Support sort/filter/search functions in the messaging system. The annotation processor issue the request; The messaging system executes the query and return a batch of records fit into one page. Similar to what gmail does.

Annotation Parser

FP-annotation-parser.png

 

The AnnotationParserSparqlImpl reads the configuration file which declares the sparql rule files used to validate or parse each item in the incoming annotation. By executing the sparql queries, the Annotation object will finally be returned which contains all the parsed out information and the warning & error messages.

 

The BasicObject is basically a hashmap storing the key-value pairs for specific concept in annotation, like Annotator, Identification etc. The CompositeObject is mainly used to represent the assertion in the annotation, like InsertIdentification, InsertOccurrence etc. Since one assertion is composed by multiple basic concepts, the CompositeObject is a container of the BasicObject. E.g. the InsertIdentification contains an Identification and CollectionObjectIdentifier. Sometimes, one concept corresponds to multiple objects. E.g, the insertOccurrence will have an identification history. So it's also allowed to store a list of BasicObjects for one concept in the CompositeObject.

Mapper & Adapter & Driver

 

Class Diagram

FP-mapper-driver.png

Get Determination History and OAuth Authentication

FP-getDeterminationHistory.png

Create New Determination

FP-createNewDetermination.png

Update Determination

FP-updateDetermination.png

In this diagram, the basic principle to process the inner record referenced by the major record (e.g. the taxon record in the identification record) is:

1. If it's updated to another existing record, then use the id of that existing record to populate the corresponding field in the major record.

2. If it's updated to a completely new record, then create a new record and use the id of the new record to populate the corresponding field in the major record. Actually an alternative approach is to update the current record. Which options should be used could be a configuration issue. And the configuration might not be setting a strategy simply. It might need to declare the tables and the conditions. It could also be solved by interacting with the users which needs general protocol between the adapter and mapper.

What's missing or not clear

  • Protocol between the adapter and the driver, especially the exceptions for different operations, like UnSupportedParameters, Violation of specific business logic etc.
  • General authentication API might be needed. ow will the authentication goes if the annotation processor is a plugin of Specify??
  • Do we need transaction for insert and update in the mapper or adapter since insertion of one record might result in insertions or update of multiple records?
  • DataPolicy (The current DataPolicy object and its identification record definition):
    • Business logic:
      • Example1: Can’t insert a new taxon record which has the same scientific name of an existing record but different scientific authorship.
        • Simple solution: return the exception and let the users decide what to do.
        • Advanced Solution:
          • describe business logic in the data policy so that the mapper knows it’s invalid.
          • describe the actions once the error occurs in the data policy so that the mapper knows how to solve this problem, like the mapper needs to get all the authorships for the specific scientific names currently existing in the database for the user to choose.
      • Example 2: To insert a taxon, the higher taxon it declares must exist in the database.
        • Solution1: declares all the required fields for creating the higher taxon in the data policy to get all the information from the user in one time. Once user submits the creation request, the adapter will create the hierarchy if needed.
        • Solution2: describes the problem and the subsequent action in data policy so that the mapper knows how to deal with it, like the mapper needs to ask user for the higher taxon information recursively in order to create it.
    • Protocols: xml with specific schema could be used.
      • For non-simple CRUD operation.
        • Example1: to getDeterminationHistory: 1) get OccurrenceId 2) find all the determinations.
        • Example2: to create a new determination, 1) get occurrence id. 2) create determination with occurrence id as one parameter.
      • For the actions once business logic is violated as discussed above. This part is not only between mapper and driver, the webUI also needs to be part of it so that we could ask for the user’s input.
      • For how to provide fill-in in the “new” tab by querying the local database. E.g. get the taxon hierarchy from the local database according to the scientific name in the incoming annotations. What to do if multiple set of records are found? Etc.
    • The data format, e.g. the date format output or accepted by the specify driver is yyyy-mm-dd. However the yyyy-mm-dd, yyyy/mm/dd or even date range are allowed by the annotation.
      • Solution1: describe data format in the data policy, and the mapper will do the conversion.
      • Solution2: just declare the supported data format by the annotation processor and let the adapter to do the conversion. This solution might be better. For example, when the date is a range, then which date should be used for the identification date when inserting record in the local data source. It’s a local decision instead of mapper unless such thing can also be described in the data policy.

Proposed solution

 

Adapter vs. Driver

The current Specify driver provides a way to allow read/write the specify database in http protocols and the OAuth is used for authentication. At the annotation processor part, we need a piece of software that could communicate with the Specify driver in this specify specific way. We call that adapter.

Both the adapter and driver are the local data source specific softwares and will be developed by the people who manage the local data source or at least know it clearly. The adapter communicate with the driver in their private protocol while the mapper in the annotation processor communicates with adapter through a generally defined API.

Mapper & Adapter Class Diagram

FP-mapperAdapterCalssDiagram.png

DataPolicyMapper API

This API is invoked by the annotation processor webUI. The DataPolicyMapper decomposes the input KVPs of specific type into KVPs of that type and their foreign dependent types according to the types definitions in the data policy. And then the read/write operations are issued to the adapter to accomplish the mapping. For the fields in the input KVPs that are not defined in the data policy, the DataPolicyMapper will simply ignore them. (Please see more here):

  • getByIdentifiers: Find the the objects which are in the specified type and identified by the specified identifiers.
    • The identifiers are not required to be the fields identified in the data policy for that type.
    • What could be the identifiers for searching are decided by the annotation rules. For example, in the ApplePie rule, for the dwc_identification, the (collectionCode, catalogNumber) would be the identifiers. We need to tell the adapter developers about all the possible identifiers so that they know how to implement searching and they must support that.
    • The reason why we don't merge this method into the following "get" method, is because Mapper deal with these two cases different. For the "getByIdentifiers" case, Mapper will simply invoke the "find" method of the Driver who will know returned the identified object by these identifiers. For the "get" case, Mapper will walk through the data policy to construct the object of each type and invoke the "find" method of the driver. It's hard for the Mapper to distinguish these two cases in a local data source independent way if they're not separated into two methods. But for the driver side, the original proposed "findByIdentifiers" method is merged with "find" method, since driver can decide how to "find by value" in with these two cases.
  • get: Find the objects which are in the specified type and has the specified value.
  • add: Add the object in specified type in the local data source. The new object will be created if it doesn't exist.
    • The reason why we separate the "saveOrUpdate" method into "add" and "update", is because these APIs are invoked by the annotation processor who knows for sure what it wants to do. For the specified object, it wants to either update it or create it. But not both. And the behavior of the update or create are clearly different. For the data in the "new" tab, the annotation processor wants to insert a new record. When the record is not found in the data source, it should be created. But for the data in the existing record tab, it wants to update the record. When the record is not found in the data source, the exception should be thrown out. It's clearer to separate different behavior into different methods.
  • update: Update the object in specified type with the specified type and value in the local data source.
  • delete: Delete the object in specified type with the specified identifier from the local data source.
  • getObjectDef: Get the list of field (in primitive type) definition for the specified object.
    • When the object has cross reference to another object, the mapper will go to the referenced object definition and grab the field definition. This process could be recursive.
    • The major usage of this method is to generate the table in the webUI to present the existing data and the "new" data. It would also be useful in other cases to help interpret the data got from the data source.
DataPolicyAdapter API

This API would be invoked by the DataPolicyMapper. All the KVP passed into are constructed according to the data policy except the find:

  • getDataPolicy: Return the data policy.
    • As mentioned above, the adapter developer will be told which identifiers will be passed to each dwc type.
  • find: Return the identifier of the objects which are in the specified type and has the specified value.
    • The originally proposed findByIdentifiers is merged to here which returns the identifier of the objects which are in the specified type and identified by the specified identifiers.
    • So the input KVP for this method could be all possible identifiers and value constructed according to the data policy.
  • get: Return the object with the specified type and key identifier.
  • add: Insert a record into the data source with the specified type and value.
  • update: Update a record in the data source with the specified type, identifier and value.
  • delete: Delete the object in specified type and with the specified identifier from the local data source.

Authentication

Since the authentication mechanism is not clear yet, it's assumed the credential for accessing the local data source should be passed to the mapper/adapter as a context parameter.

The exception is for OAuth authentication. The credential will be got in the runtime by walking through the OAuth protocol when the mapper/adapter access the data source.

Exception

The following exceptions will be thrown out by the DataPolicyAdapter:

  • UnAuthenticated: The user is not authenticated for the local data source.
  • InvalidParameter: The input parameter is not recognized by the adapter or is not valid according to the business logic of local data source (Please see here).
  • MissingInformationException: Something is missing in the input parameter to complete the operation. The possible case is the missing information to create the taxon hierarchy(please see here).
  • UniqueIdentificationException: The provided information is not enough to uniquely identify a record when it's expected to. Please see here.
  • InternalException: Something wrong with the local data source, e.g. can't connect to Specify, Specify internal state inconsistent, null pointer exceptions.
  • OAuthRedirectException: Indicate the operation can't be completed is just because the request will be redirected to go through the OAuth protocol.


Besides the above exceptions, the DataPolicyMapper will throw the following exception:

  • ConverterException: Failure to convert the data (please refer here)

Transaction

Each single insert or update operation on the DwC type defined in the data policy should be implemented as a transaction: either it succeed or there's no side effect when it fails. The adapter/driver developer needs to follow this rule.

Data Policy

The data policy defines the DwC types that the local database can understand as the first-class "records." Such records are guaranteed by the driver to be saved and updated atomically. The mapper will communicate with the adapter based on the data policy to read/write the local data source.


The content of data policy:

  • DwC type and its composing fields understood by the local data source. The field definition includes:
    • field name
    • field type:
      • The supported type: String, int, float, DateTime, Boolean, List, LatLog
      • If the type refers to another record defined in the data policy, it represents a foreign dependency.
    • required or not
    • primary key or not: Each dwc type defined in the data policy must has one primary key field.
    • converter and its configuration: Optional. The default converters are provided for the system supported type. User customized converter is also supported.
  • Application level converter: Besides the converter associated with the field in the record, the converter generally associated with specific field name, like the CollectionCode no matter in which record this field appears, can also be defined in the data policy. For more details for the converter, please see here.


The requirement and features of data policy:

  • Data policy is required to define some DwC types according to the annotation rules. For the ApplePie rules, the required DwC types are dwc_identification, dwc_occurrence and dwc_location. These DwC types can have foreign dependency on the other DwC types in the data policy. But without definition for these types, the annotation processor/mapper has no starting point.
  • Data policy is not a partition of the DwC terms in the annotation. It's possible that some fields defined in the annotation are not defined in the data policy. These fields are simply ignored by the mapper.
  • Data policy is not required to be expressed completely in DwC terms since not all the information needed by the local data source can be expressed in the DwC terms, e.g. the botanist record in the specify data policy. The mapping only happens between the annotation and the fields in DwC terms. While the fields in non-DwC terms would be shown in the webUI and allow the user to type in the value.

FP-dataPolicy.png

Identifiers Issue

To get the determination history,the annotation processor needs to get the identification objects from the local data source based on the collection code and catalog number. For different data source, the protocol to retrieve the object based on these identifiers might be different. For the specify, it needs to: 1) get OccurrenceId 2) find all the determinations.

Instead of hard coding such data source specific logic in the mapper, we let it be done by the adapter in the getByIdentifier method. The adapter developer needs to be told what identifiers could be input for each dwc type defined in the data policy. This is actually decided by the rules about how the annotation is constructed.


FP-getDeterminationHistoryV2.png

Invalid Field Protocol

Sometimes, the input data for save or update operation is not valid. One of the reason is the business logic is violated. For example, it's not allowed to create a new taxon record which has the same scientific name but different authorship with the existing taxon record. For such local data source specific business logic constrains, we use the InvalidParameter exception to handle it generally.


The basic protocol is:

  • When the adapter find such problem, it will construct the InvalidParameter exception with the invalid field name and options. Obviously, how to detect such problem and how to get the options for the field are the private protocol between the adapter and driver.
  • The mapper throws the InvalidParameter exception out.
  • The webUI catch and parse the InvalidParameter exception, popup a window to show the invalid fields and the options for user to choose. Once user submits the data, the popup window closes and the same request, add or update, will be submitted to the mapper again automatically.


In this solution, the adapter developer needs to keep in mind that when the InvalidParameter is used to collect data, the same request might be issued from the mapper several times before the final success. They need to guarantee there's no side effects of multiple invocation.


FP InvalidParameter.png

Missing Information Protocol

Sometimes, the database operation can't be finished due to the missing information. It doesn't mean the field defined in the data policy is not provided. It happens when specific business logic is associated with that field. E.g., the higherTaxon referenced by the taxon record must exist. To create the record representing the higherTaxon, more information is needed from the user. We use the MissingInformationException to handle this case.


The basic protocol is:

  • The adapter detects such problem and constructs the MissingInformation exception with
    • type name of the record, for which the exception is thrown out, e.g. Taxon.
    • id to uniquely identify this group of data re-collection, e.g. Genus.
    • list of missing fields, including the name and whether it's required or not etc.
  • The mapper throws the MissingInformation exception out.
  • The webUI catches and parses the MissingInformation, popup a window for users to provide the information that are missing. Once the user submit the data, the same request, e.g. add or update, will be submitted to the mapper again automatically with the original data and the supplement data. The supplement data is prefixed with the id provided in the exception.
  • The adapter recognizes the supplement data by the id and knows how to use them correctly to complete the operation.


In this solution, the adapter developer needs to keep in mind that when the MissingInformationException is used to collect data, the same request might be issued from the mapper several times before the final success. They need to guarantee there's no side effects of multiple invocation.


FP MissingInformationProtocol.png

Record Choose Protocol

In the data policy, identification record has foreign dependency on the botanist record. When insert a new identification, the only information got from the annotation for the botanist record is the botanist's name in the identifiedBy field. But sometimes more than one botanist records have the same name. In this case, the mapper/driver has problem to know which botanist record should be used to create the identification record.

The UniqueIdentificationException is used to solve this problem. Basically the protocol is:

  • When add or update a record, the mapper/adapter identify such problem.
  • The mapper/adapter throws UniqueIdentificationException which contains all the matched records, including their identifiers and the information of the other fields which is helpful for users to understand the record and then make a choice.
  • The webUI catches and interprets such exception, and pops up a window for user to make a choice.
  • Once the user makes a choice, the chosen identifier will be passed to the mapper together with the other fields do the save/update again.

Get Fill-in

For the annotation like new determination, to prepare a record for the user to insert, we'll populate an empty record with the data from the annotation and also fill in the absent fields with the data existing in the data source if possible. The basic idea is try to find the matched record based on the data present in the annotation, and populate the absent fields with the information from the found matched record if it exists. For example, for the insert determination annotation with proposed scientific name, we'll try to find the matched taxon record. If it exits, we'll use that record to populate the absent field, like family, higher taxon etc.


FP getFillIn.png

Data Format Convertion

The mapper can convert data format automatically through the converter, like convert the date from yyyy-MM-dd to yyyy/MM/dd.


There're two levels of converter could be registered in the data policy:

  • Application level converter: The converter is associated with specific field name, like collectionCode, so that it would be applicable for all the collectionCode field no matter which record it belongs to.
  • Record level converter: The converter is associated with the field of a record. If both the record and application level converters are registered, then the record level will take effect.


There're several system default converters related to the filed type:

  • IntConverter: convert between integer and String
  • FloatConverter: convert between float and String
  • BooleanConverter: convert between boolean and String
  • DateTimeConverter: convert the date time format. The target format needs to be configured by following the convention of the java.text.SimpleDateFormat. The default format is "yyyy-MM-dd'T'HH:mm:ssz".
  • LatLogconverter: convert the latitude or longitude format according to the configuration. The format needs to be configured. And the supported format is: DDD° MM' SS.S", DDD° MM.MMM', DDD.DDDDD°. The default is DDD.DDDDD° .
  • ListConverter: replace the delimiter. The delimiter char needs to be configured. The default is ";".


All the system default converters implements the Converter interface. And the adapter developer can also implements their own Converter by implementing the Converter interface:

  • getAsObject: convert the data from the object to String. This is used to convert the data read from the local data source to the format that is used by the annotation processor, so that the annotation processor can understand and present it correctly.
  • getAsString: convert the data from the string to the object. This is used to convert the data from the annotation processor to the object that could be understand and operate by the adapter correctly.


FP converterClasses.png


The mapper will convert the data into the format supported by the adapter each time when it talks to the adapter. And the mapper will convert the data into the format supported by the annotation processor each time when it returns the data fetched from the data source to the annotation processor.


FP converter.png

Discussion

  • To decouple the webUI/mapper from the business logic of the local data source, the exceptions are used to build up the general protocol between the webUI/mapper and the adapter and handle the business logic.
    • InvalidParameter: Handle the case when a taxon with the same scientific name but different authorship with the existing records is inserted. (Please see more here).
    • MissingInformationException: Handle the case to get hierarchical taxon interactively. (Please see more here).
    • UniqueIdentificationException: Handle the case when the provided information is not enough to uniquely identify the record so that the insert or update can't be finished. (Please see more here).
  • DataPolicy (Please see more here).
    • in DwC terms + non-DwC terms
    • converter (Please see more here).
      • system default converter associated with the type: int, float, boolean, DateTime, List, LatLog
      • customized converter
  • Identifier issue: how to identify the object with dwc triple. (Please see more here)
    • getByIdentifiers - findByIdentifiers
    • need to tell the adapter developer what identifiers will be passed into the methods. Actually it's associated with the annotation rules.
  • generate fill-in
    • find the matched records by value and populate the blank fields in the "new" tab
  • APIs of Mapper and Adapter. (Please see more here)
    • All of the features that relate to matching a record by the values in its fields, should be handled by querying an index rather than querying the data policy mapper or the driver. (Maureen)