Embedding Kepler

From Filtered Push Wiki
Jump to: navigation, search


Embedding Kepler as a capability within a FilteredPush network node

UNDER CONSTRUCTION

Needs paste from remainder of googledoc.

Context

Kepler engine, configured with workflows for ApplePie installed on a FilteredPush node and made available to the node by wrapping it in an EJB. Then needs an instance of AbstractJob to invoke the bean, and a FP message to invoke the job (See: the Triage Class Diagram). Definition of API of EBJ should define what the job needs to do and what parameters need to be passed in the message. Kepler engine provides an Analysis capability for the network Media:Elaborated_component_model_of_a_three_node_network.png

Need to work out the APIs - inputs, outputs.

Requirements

Requirement26 Cluster potential duplicate botanical specimen records by fuzzy matching

Requirement27 Injected workflows must be able to create annotations.

Also potentially relevant: Requirement14

Like: FP Production Requirements 28 Find clusters in data known to the network on arbitrary criteria.

Like (research goal): FP Production Requirements 30 (Annotations): It must be possible to annotate workflows.

Not yet fully defined FP Production Requirements:

Requirement: A means for administration of automatically triggered workflows present in the network.

Requirement: Workflows injected into the network must be discoverable by at least their creator, and probably all users, no clear requirement for access control has yet been defined for workflows.

Specific target cases

Core quality control use cases from grant proposal, also present in Lei's SPNHC demo workflow. QC georeference, QC identifications, QC dates collected (fits research goals of TCNs for occurrence of species in space and time.).

WorkFlow Example1. Are the georeferences “sane”? Quality-control georeferences WF:

{ records } → WF(rules) → { annotations(explanations) } 

(a) Take record, exec GeoLocate on it, if result is “close” → OK; else → flag as NOT OK

⇒ create annotation which “explains” the problem

(b) Use GIS functions to assess georef data quality.

How to “plug-in” new workflows?

  • Various aspects:
    • use a special FPush Kepler module together with some “FP conventions” (!?)

to ensure that “plugging in” of the new wf is simple.

    • how to (semi-)automate this? Where does one put the wf, so it is found by FP?

(web services again?)

    • sending the wf over the wire, as part of a package

maybe put Kepler on a dedicated VM; use staging of inputs and outputs

WF Ex2. Date collected: do the dates make sense?

Relevant FPush documents include Quality_Control_New_Record and (Prototype: Analysis_module_design_documents).

See Also

The Canadensys data validation/data cleaning library https://github.com/Canadensys/narwhal-processor

Open issues

Client and user interface

The user can use a client to invoke workflow and specify the data that will be fed into the workflow.

Result information visualization

At the end of the workflow, the MongoDBCollectionWriter and MongoDBSummaryWriter will write result into MongoDB and the client will invoke CheckForMessage interface to construct a spreadsheet for visualizing provenance information. The second part is not implemented yet and the requirements of how to render the spreadsheet are not clear.

Discover a pre-defined workflow

The user can select a pre-defined workflow in the UI. The issue is that how could the network guarantee to respond the user request in a relatively short time, i.e. when the user requests a list of existing workflows, they won't wait a significant amount of time to see the list.

Specify new workflow

The user can specify a new workflow or assemble a workflow from the existing actors and push it to the repository so it can be used afterward.

Performance and scale

If the input collection data is huge, storing all the data at one point and processing it are not feasible. One solution could be data streaming, we need to make sure every step of the workflow have the capability for streaming. Other ways include Map-Reduce, parallel computation and etc.

List of existing actors

AnnotationInserter

Description: This actor reads in all the data of SpecimenAnnotationType which conveys the assertions on the specimen data quality. And then the actor will insert such annotation into FPush network to notify the original data distributer of the other interested parties.

Used in workflow: SpecimenCuration-FP

Package: Kuration

CollectingEventOutlierFinder

Description: This actor is used to find out collecting event outliers from a group of input specimen records.

Input filed: RecordedBy, YearCollected, MonthCollected, DayCollected, (DecimalLatitude, DecimalLongitude) (not used)

Used in workflow: SpecimenCuration

Package: Kuration

FloweringTimeValidator

Description: This actor is used to validate the flowering time of the input specimen record. The actor searches the flowering time of input record based on Scientific name in a proof of concept FNA service and compares to the existing one in the record.

Input field: ScientificName, ReproductiveCondition

Output field: ReproductiveCondition

Used in workflow: SpecimenCuration

Note: FNA service is currently proof of concept only - limited data and applicability.

Package: Kuration

GEORefValidator

Description: This actor checks the validity of the geo-reference information of the input specimen record. The actor queries for the DecimalLatitude, DecimalLongitude of input record on the GeoLocate service and compares to the existing ones in the record.

Input field: Country, StateProvince, County, Locality, Locality, DecimalLatitude, DecimalLongitude

Output field: DecimalLatitude, DecimalLongitude

Used in workflow: SpecimenCuration

Package: Kuration

ScientificNameValidator

Description: This actor is used to validate the scientific name of the input specimen record. The actor searches the ScientificName, ScientificNameAuthorship of input record on authoritative nomenclatural or taxonomic services (IPNI, IndexFungorum, or WoRMS) or GBIF's beta checklist bank services (needs update to current service), with a failover to lexical groups on the GNI services and compares the results to the existing ones in the record.

Input field: ScientificName, ScientificNameAuthorship, IdentificationTaxon(optional)

Output field: Same as input, (also adds GUID from nomenclator).

Used in workflow: SpecimenCuration, TaxonNameCuration

Package: Kuration

Clustering

Description: The Clustering actor does clustering for the list of input RecordToken by applying specified clustering function for specified field in turn. In the output data stream, all the clusters are put inside a collection while each cluster is represented by a collection containing all the data belonging to this cluster.

Used in workflow: SpecimenRecordMerge, EbirdDuplicationDetetion

Package: Kuration

ConditionTester

Description: This actor tests whether the specified condition is satisfied or not. For example, it could be used to test whether a data falls in correct range.

Used in workflow: SpecimenRecordMerge

Package: Kuration

DataFuser

Description: The actor merges a list of input RecordToken into one RecordToken and output it by using specified fuse function.

Used in workflow: SpecimenRecordMerge

Package: Kuration

AuthSubAuthorizor

Description: Based on AuthSub authorization mechanism, the actor gets authorization in form of Authsub session token to access specified google services, like gmail, spreadsheet and google doc etc. When the workflow is end, the Authsub token is revoked to protect security.

Used in workflow: SpecimenCuration

Package: Koogle

HumanCuration

Description: A Composite actor is an aggregation of actors. It may have a local director that is responsible for executing the contained actors.

Used in workflow: SpecimenCuration

Package: Koogle

SpreadsheetShare

Description: SpreadsheetShare actor Shares a spreadsheet with other Gmail account user(s)

Used in workflow: SpecimenCuration

Package: Koogle

SpreadsheetImporter

Description: The actor imports an array of RecordToken into the specified spreadsheet. It takes incoming data and writes them into a specified worksheet in a Spreadsheet.

Used in workflow: SpecimenCuration

Package: Koogle

CurationSummarySynthesizor

Description: The actor reads record and provenance information from the COMAD stream and constructs a series of reports and writes into Google spreadsheet. The reports includes one summary of the whole record and a separate report of each actor.

Used in workflow: SpecimenCuration

Package: Comad

CSVCollectionReader

Description: The CSVCollectionReader actor imports external data from a file in CSV format into the COMAD workflow.

Used in workflow: SpecimenCuration

Package: Comad

MongoDBCollectionReader

Description: This actor reads collection data from MongoDB and convert it into COMAD stream.

Package: Comad

MongoDBCollectionWriter

Description: This actor writes cleaned collection data, exceptions and provenance information to MongoDB.

Package: Kuration

MongoDBSummaryWriter

Description: This actor writes provenance information in spreadsheet style to MongoDB.

Package: Kuration

Advanced GeoRefValidator

Georeference_Validator

FPush Annotation Inserter

Description

The FPAnnotationInserter actor will read data from COMAD stream and construct an annotation and inject into FPush network.

Annotation type choice

GeoRefValidator

CORRECT CURATED UNABLE_TO_CURATE UNABLE_DETERMINE_VALIDITY
None Update_Location Solve with more data None

SciNameValidator

CORRECT CURATED UNABLE_TO_CURATE UNABLE_DETERMINE_VALIDITY
None Update_ Occurrence Solve with more data None

FlowerTimeValidator

CORRECT CURATED UNABLE_TO_CURATE UNABLE_DETERMINE_VALIDITY
None  ? Solve with more data None

mapping

COMAD token schema:

"element" : "data",
"annotations"  :{
   "GEOCurationComment" : {
   "key" : "GEOCurationComment",
   "element" 
   "type" 
   "id" 
   "target" 
   "class" 
   "GEOCurationComment" : {
       "Source"
       "Details" 
       "Status" 
       }
   }
},
"content" : {
   "id" 
   "basisOfRecord" 
   "catalogNumber" 
   "collectionCode"
   "coordinateUncertaintyInMeters" 
   "country" 
   "county" 
   "day" 
   "dcidentifier" 
   "decimalLatitude" 
   "decimalLongitude" :
   "eventDate" 
   "family" 
   "georeferenceSources" 
   "georeferenceVerificationStatus"
   "institutionCode" 
   "locality" 
   "modified" 
   "month" 
   "ownerInstitutionCode"
   "recordedBy" 
   "scientificName" 
   "scientificNameAuthorship" 
   "startDayOfYear" 
   "stateProvince" 
   "year" 
}

Example

"element" : "data",
"annotations" : {
   "GEOCurationComment" : {
   "key" : "GEOCurationComment",
   "element" : "Annotation",
   "type" : "CurationCommentType",
   "id" : "279",
   "target" : "278",
   "class" : "org.kepler.actor.SpecimenQC.type.CurationCommentType",
   "GEOCurationComment" : {
       "Source" : "GEOLocate",
       "Details" : "Update the coordinates by using cached data or GEOLocateservice since the original coordinates are not consistent to the specified localities.",
       "Status" : "Curated"
       }
   }
},
"content" : {
   "id" : "512d56f6e4b064678b3da001",
   "basisOfRecord" : "PreservedSpecimen",
   "catalogNumber" : "NAU4FA0010142",
   "collectionCode" : "NAUF",
   "coordinateUncertaintyInMeters" : 1506,
   "country" : "USA",
   "county" : "Cochise",
   "day" : 8,
   "dcidentifier" : "http://fp1.acis.ufl.edu/symbscan/oai/occurrences/oai2.php?verb=GetRecord&metadataPrefix=dwc&identifier=SCAN.occurrence.348867",
   "decimalLatitude" : 31.896516,
   "decimalLongitude" : -109.163673,
   "eventDate" : "1964-09-08",
   "family" : "Curculionidae",
   "georeferenceSources" : "GeoLocate",
   "georeferenceVerificationStatus" : "reviewed - high confidence",
   "institutionCode" : "NAU",
   "locality" : "Sunny Flat Cp. Cave Creek Canyon Chiricahua Mts.",
   "modified" : "2012-08-13 17:10:42",
   "month" : 9,
   "ownerInstitutionCode" : "NAU",
   "recordedBy" : "C.D. Johnson",
   "scientificName" : "Ericydeus lautus",
   "scientificNameAuthorship" : "(LeConte, 1856)",
   "startDayOfYear" : 252,
   "stateProvince" : "Arizona",
   "year" : 1964
}

annotation

<rdf:RDF
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:foaf="http://xmlns.com/foaf/0.1/"
    xmlns:dwcFP="http://filteredpush.org/dwcFP/"
    xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
    xmlns:oa="http://www.w3.org/ns/openannotation/core/"
    xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
    xmlns:cnt="http://www.w3.org/2011/content#"
    xmlns:oad="http://filteredpush.org/oad/" > 
    <rdf:Description rdf:about="http://etaxonomy.org/ontologies/oa413d4757-6e42-4c57-b169-b77cb5daeb99/Annotation_0">
        <oad:hasExpectation rdf:resource="http://etaxonomy.org/ontologies/oa413d4757-6e42-4c57-b169-b77cb5daeb99/Expectation_0"/>
        <oa:annotatedBy rdf:resource="http://etaxonomy.org/ontologies/oa413d4757-6e42-4c57-b169-b77cb5daeb99/Annotator_0"/>
        <oad:hasEvidence rdf:resource="http://etaxonomy.org/ontologies/oa413d4757-6e42-4c57-b169-b77cb5daeb99/Evidence_0"/>
        <oa:hasBody rdf:resource="http://etaxonomy.org/ontologies/oa413d4757-6e42-4c57-b169-b77cb5daeb99/Body_0"/>
        <oa:hasTarget rdf:resource="http://etaxonomy.org/ontologies/oa413d4757-6e42-4c57-b169-b77cb5daeb99/Target_0"/>
        <oa:serializedAt>2013-05-09T16:12:43PDT</oa:serializedAt>
        <oa:annotatedAt>2013-05-09T16:12:43PDT</oa:annotatedAt>
        <rdf:type rdf:resource="http://www.w3.org/ns/openannotation/core/Annotation"/>
    </rdf:Description>
    <rdf:Description rdf:about="http://etaxonomy.org/ontologies/oa413d4757-6e42-4c57-b169-b77cb5daeb99/Evidence_0">
        <cnt:chars xml:lang="en">The original coordinates: (-114.485613, 114.485613) doesn't lie on land surface</cnt:chars>
        <rdf:type rdf:resource="http://filteredpush.org/oad/Evidence"/>
        <rdf:type rdf:resource="http://www.w3.org/2011/content#ContentAsText"/>
    </rdf:Description>
    <rdf:Description rdf:about="http://etaxonomy.org/ontologies/oa413d4757-6e42-4c57-b169-b77cb5daeb99/Annotator_0">
        <foaf:name>Kepler Workflow System</foaf:name>
        <rdf:type rdf:resource="http://xmlns.com/foaf/0.1/Agent"/>
     </rdf:Description>
     <rdf:Description rdf:about="http://etaxonomy.org/ontologies/oa413d4757-6e42-4c57-b169-b77cb5daeb99/Target_0">
         <oa:hasSelector rdf:resource="http://etaxonomy.org/ontologies/oa413d4757-6e42-4c57-b169-b77cb5daeb99/Selector_0"/>
         <rdf:type rdf:resource="http://www.w3.org/ns/openannotation/core/SpecificResource"/>
     </rdf:Description>
     <rdf:Description rdf:about="http://etaxonomy.org/ontologies/oa413d4757-6e42-4c57-b169-b77cb5daeb99/Expectation_0">
         <rdf:type rdf:resource="http://filteredpush.org/oad/Expectation_Solve_With_More_Data"/>
     </rdf:Description>
     <rdf:Description rdf:about="http://etaxonomy.org/ontologies/oa413d4757-6e42-4c57-b169-b77cb5daeb99/Selector_0">
         <dwcFP:catalogNumber>NAU4F A0010139</dwcFP:catalogNumber>
         <dwcFP:institutionCode>NAU</dwcFP:institutionCode>
         <dwcFP:collectionCode>NAUF</dwcFP:collectionCode>
         <rdf:type rdf:resource="http://www.w3.org/ns/openannotation/core/Selector"/>
         </rdf:Description>
</rdf:RDF>

Curation Summary Writer

Spreadsheet requirements

The spreadsheet view of workflow result and provenance consists of two levels of spreadsheet:


First level

The first level spreadsheet gives an overall view of all the results of one workflow run.

One row of the spreadsheet corresponds to one record. Each row contains two part:

  • The first part is a copy of the cleaned record (after workflow run). The background of checked data fields is highlighted if that field is curated or has issue but didn't get solved.
  • The second part is a set of markers indicating the result type of each actor run and overall markers for the workflow run of that record (see below for the symbols and populating rule)


Second level

The second level spreadsheet gives a detailed view of each record, one spreadsheet for each record.

One row of each spreadsheet is corresponding to one actor run on that record. Each row contains two part:

  • The first part is a set of markers indicating the result of each actor run including actor name, result type (same as the first level) and indicator of whether the actor is run or not.
  • The second part shows the provenance information of each data field checked by that actor, the rules are the following:
Result type Style Background color Note
"CORRECT" Only show a tick mark green
"CURATED" or "FILLED_IN" "WAS: " + old value + "CHANGE TO: " + corrected value yellow
"UNABLE_CURATE" "UNABLE_TO_CURATE: " + old value red
"UNABLE_DETERMINE_VALIDITY" "UNABLE_DETERMINE_VALIDITY_OF: " + old value grey this may not be shown if the reason is due to lack of that data field


Result types and symbols

Result type Symbol Background color Description
"CORRECT" "tick" green workflow found certain data field(s) is/are correct by checking with certain standards.
"CURATED" or "FILLED_IN" "delta" yellow workflow found certain data field(s) contain(s) issue and fixed that issue
"UNABLE_CURATE" "cross" red workflow found certain data field(s) contain(s) issue but can't fix that issue
"UNABLE_DETERMINE_VALIDITY" "question" grey workflow can't determine whether certain data field(s) contain(s) issue or not


Populating rule for the overall marker

if (one of the actor has marker "CROSS") {
    the overall marker is "CROSS";
} else {
    if (one of the actor has marker "QUESTION") {
        the overall marker is "QUESTION";
    } else {
         if (one of the actor has marker "DELTA") {
             the overall marker is "DELTA";
         } else{
             the overall marker is "TICK";
         }
    }
}


Example

https://docs.google.com/spreadsheet/ccc?key=0AnsQe6zgKhomdGRMWGZ3SGxiaGZJaVNodzBPblZRNkE&usp=sharing

MongoDB Collections structure (non streaming mode)

There is a set of markers we gonna use in the Mongo DB collections. The marker is a two-part string:


The first part is indicator of the result type, there is one-to-one mapping to the result type defined in Kepler

Result type in Kepler Result type in mongoDB Description
"CORRECT" CORRECT_... workflow found certain data field(s) is/are correct by checking with certain standards.
"CURATED" CURATED_... workflow found certain data field(s) contain(s) issue and fixed that issue
"UNABLE_CURATE" UNABLE_CORRECT_... workflow found certain data field(s) contain(s) issue but can't fix that issue
"UNABLE_DETERMINE_VALIDITY" UNABLE_CURATE_... workflow can't determine whether certain data field(s) contain(s) issue or not


The second part indicates the different level of data set the markers for

Second level string Description..._FIELD
..._FIELD data field that been check by validation actor
..._ACTOR the result after each actor run
..._RECORD the overall marker for the record (populated from the actor markers)
..._DATASET the overall marker for the whole data set (not used for now)


Example

UNABLE_TO_CORRECT_ACTOR means the actor ran and found some issue about certatin data filed but don't know how to curate that field

CURATED_FIELD means the data field has been checked by some actor and has been corrected by the actor


Schema

First level:

{
       "_id" : ObjectId("517712360cf25ded82b4eb7d"),
       ""512d56f1e4b064678b3d9ffe"" : {
               "Record" : {
                       "id" : "512d56f1e4b064678b3d9ffe",
                       "startDayOfYear" : "108",
                       "eventDate" : "1965-04-18",
                       "scientificName" : "Trigonoscuta yumaensis",
                       "scientificNameAuthorship" : "Pierce",
                       "ownerInstitutionCode" : "NAU",
                       "collectionCode" : "NAUF",
                       "modified" : "2012-08-13 17:10:42",
                       "country" : "United States",
                       "coordinateUncertaintyInMeters" : "27335",
                       "occurrenceRemarks" : "US route 80 is on the East coast; highway 80 appears to not exist anymore? Cannot find out what it turned into.",
                       "decimalLatitude" : "32.725204",
                       "basisOfRecord" : "PreservedSpecimen",
                       "institutionCode" : "NAU",
                       "county" : "Yuma",
                       "catalogNumber" : "NAU4F A0010139",
                       "family" : "Curculionidae",
                       "recordedBy" : "Richard S. Funk",
                       "decimalLongitude" : "-114.485613",
                       "month" : "4",
                       "locality" : "US Hwy 80, 8 mi E of Yuma",
                       "stateProvince" : "Arizona",
                       "year" : "1965",
                       "day" : "18",
                       "dcidentifier" : "http://fp1.acis.ufl.edu/symbscan/oai/occurrences/oai2.php?verb=GetRecord&metadataPrefix=dwc&identifier=SCAN.occurrence.348864",
               },
              "ValidationState" : {
                      "scientificName" : "UNABLE_CURRATE_FIELD",
                      "scientificNameAuthorship" : "UNABLE_CURRATE_FIELD",
                      "decimalLatitude" : "CORRECT_FIELD",
                      "decimalLongitude" : "CORRECT_FIELD"
              },
              "Markers" : {
                      "K: GEORefValidator" : "CORRECT_ACTOR",
                      "K: ScientificNameValidator" : "UNABLE_CURATE_ACTOR",
                      "K: FloweringTimeValidator" : "CURRATED_ACTOR",
                      "Kuration Workflow" : "UNABLE_CURATE_ACTOR"
              },

Second level:

               "ActorDetails" : [ {
                              "id" : "",
                              "Source" : "ScientificNameValidator",
                              "Comment" : "Can't find the scientific name and authorship by searching in IPNI and the lexical group from IPNI in GNI.",
                              "Actor" : "ScientificNameValidator",
                              "Actor Run" : "CORRECT_ACTOR",
                              "Actor Result" : "UNABLE_CURATE_ACTOR",
                              "startDayOfYear" : "",
                              "eventDate" : "",
                              "scientificName" : "UNABLE TO VALIDATE: Trigonoscuta yumaensis",
                              "scientificNameAuthorship" : "UNABLE TO VALIDATE: Pierce",
                              "ownerInstitutionCode" : "",
                              "collectionCode" : "",
                              "modified" : "",
                              "country" : "",
                              "coordinateUncertaintyInMeters" : "",
                              "occurrenceRemarks" : "",
                              "decimalLatitude" : "",
                              "basisOfRecord" : "",
                              "institutionCode" : "",
                              "county" : "",
                              "catalogNumber" : "",
                              "family" : "",
                              "recordedBy" : "",
                              "decimalLongitude" : "",
                              "month" : "",
                              "locality" : "",
                              "stateProvince" : "",
                              "year" : "",
                              "day" : "",
                              "dcidentifier" : "",
                              "ValidationState" : {
                                      "scientificName" : "UNABLE_CURRATE_FIELD",
                                      "scientificNameAuthorship" : "UNABLE_CURRATE_FIELD"
                              }
                      }, 
                      {
                              "id" : "",
                              "Source" : "GEOLocate",
                              "Comment" : "The coordinates is correct by checking with GeoLocate Service.",
                              "Actor" : "GEORefValidator",
                              "Actor Run" : "CORRECT_ACTOR",
                              "Actor Result" : "CORRECT_ACTOR",
                              "startDayOfYear" : "",
                              "eventDate" : "",
                              "scientificName" : "",
                              "scientificNameAuthorship" : "",
                              "ownerInstitutionCode" : "",
                              "collectionCode" : "",
                              "modified" : "",
                              "country" : "",
                              "coordinateUncertaintyInMeters" : "",
                              "occurrenceRemarks" : "",
                              "decimalLatitude" : "CORRECT_ACTOR",
                              "basisOfRecord" : "",
                              "institutionCode" : "",
                              "county" : "",
                              "catalogNumber" : "",
                              "family" : "",
                              "recordedBy" : "",
                              "decimalLongitude" : "CORRECT_ACTOR",
                              "month" : "",
                              "locality" : "",
                              "stateProvince" : "",
                              "year" : "",
                              "day" : "",
                              "dcidentifier" : "",
                              "ValidationState" : {
                                      "decimalLatitude" : "UNABLE_CURRATE_FIELD",
                                      "decimalLongitude" : "UNABLE_CURRATE_FIELD"
                              }
                      },
                      {
                              "id" : "",
                              "Source" : "FloweringTimeValidator",
                              "Comment" : "reproductiveCondition is missing in the incoming SpecimenRecord",
                              "Actor" : "FloweringTimeValidator",
                              "Actor Run" : "CORRECT_ACTOR",
                              "Actor Result" : "CURRATED_ACTOR",
                              "startDayOfYear" : "",
                              "eventDate" : "",
                              "scientificName" : "",
                              "scientificNameAuthorship" : "",
                              "ownerInstitutionCode" : "",
                              "collectionCode" : "",
                              "modified" : "",
                              "country" : "",
                              "coordinateUncertaintyInMeters" : "",
                              "occurrenceRemarks" : "",
                              "decimalLatitude" : "",
                              "basisOfRecord" : "",
                              "institutionCode" : "",
                              "county" : "",
                              "catalogNumber" : "",
                              "family" : "",
                              "recordedBy" : "",
                              "decimalLongitude" : "",
                              "month" : "",
                              "locality" : "",
                              "stateProvince" : "",
                              "year" : "",
                              "day" : "",
                              "dcidentifier" : "",
                              "ValidationState" : {
                              }
                      } ]
      }

MongoDB Collections structure (streaming mode)

{
        “COMAD_id” : "117",
        “DependsOn”: "112",
        “ActorName” : "ScientificNameValidator",
        “CurationStatus” : "UNABLE_TO_CURATE",
        "Record" : {
                 "id" : "512d56f1e4b064678b3d9ffe",
                 "startDayOfYear" : "108",
                 "eventDate" : "1965-04-18",
                 "scientificName" : "Trigonoscuta yumaensis",
                 "scientificNameAuthorship" : "Pierce",
                 "ownerInstitutionCode" : "NAU",
                 "collectionCode" : "NAUF",
                 "modified" : "2012-08-13 17:10:42",
                 "country" : "United States",
                 "coordinateUncertaintyInMeters" : "27335",
                 "occurrenceRemarks" : "US route 80 is on the East coast; highway 80 appears to not exist anymore? Cannot find out what it turned into.",
                 "decimalLatitude" : "32.725204",
                 "basisOfRecord" : "PreservedSpecimen",
                 "institutionCode" : "NAU",
                 "county" : "Yuma",
                 "catalogNumber" : "NAU4F A0010139",
                 "family" : "Curculionidae",
                 "recordedBy" : "Richard S. Funk",
                 "decimalLongitude" : "-114.485613",
                 "month" : "4",
                 "locality" : "US Hwy 80, 8 mi E of Yuma",
                 "stateProvince" : "Arizona",
                 "year" : "1965",
                 "day" : "18",
                 "dcidentifier" : "http://fp1.acis.ufl.edu/symbscan/oai/occurrences/oai2.php?verb=GetRecord&metadataPrefix=dwc&identifier=SCAN.occurrence.348864",
        },               
        "Colors" : {
                 "scientificName" : "RED",
                 "scientificNameAuthorship" : "RED",
        }
        "Markers" : {
                 "K: ScientificNameValidator" : "CROSS",
        }
        "ActorDetails" : {
                 "ScientificNameValidator" : {
                         "Source" : "ScientificNameValidator",
                         "Comment" : "Can't find the scientific name and authorship by searching in IPNI and the lexical group from IPNI in GNI.",
                         "Actor Run" : "TICK",
                         "Actor Result" : "CROSS"
         },
}

Scientific Name Validator

Scientific Name Validator


Date Validation

Date Validation