Quality Control New Record

From Filtered Push Wiki
Jump to: navigation, search


Use Case: Quality Control New Record

Quality control of data through annotations produced by quality control workflows (additional quality control provided by individual users finding and annotating errors).

Business Process

Goal: Data from a newly entered specimen record is checked for quality problems by comparison with similar records, and potential errors are flagged.

Summary: Each newly entered specimen has all of its data tested against the network to see if it it represents an outlier. If the specimen record appears to be an outlier, then an annotation is generated by the network for that specimen.

Diagram

Usecase qualityControlData Fig140557.png

Actors

Collection Manager (sensu latu), a person using a software application where the person is involved in the management of specimens and information in a herbarium collection, possibly a curator, a collection manager, a collection assistant, or data capture person. This person isn't using the FP network directly, but is using other software (e.g. Specify Workbench, Specify) to interact with the FP network (through FP Messages).

The local Collection Database.

FP Network.

Preconditions

Captured specimen or observational data.

Triggers

Either:

  1. Completed entry of a new record. Or
  2. Completion of a day's work. Or
  3. Collection manager's decision to quality control a set of records.

Course of Events

  1. Records for quality review are used to query the FP network.
  2. The FP network analyses the submitted records against all other records.
  3. If elements of the submitted records represent outliers (e.g. year collected is an outlier for all known specimens for the collector), an annotation is generated by the network and submitted to the Collection manager FP_Messages#FP_QUALITY_ISSUE_ASSERTION.
  4. The collection manager reviews and acts upon the annotations. Requirement: Efficient UI for sorting and acting on annotations

Alternative Paths

  1. No outliers found.  ? Implies subscription and interest in the analysis - needs community message ?
  2. QC annotation not acted upon by collection manager.
  3. Automated quality control annotations might be generated by large scale network requests (QC collection dates for all specimens known to the network), thus generating very large floods of records. Such results need to be distinguishable from the results of direct requests to QC particular subsets of records (flag potential errors in the records entered by Henry today). Requirement: Message element(s) allowing clustering of responses. Requirement: Ability to include semantically useful statistical rankings in the results (e.g. these 10 records contain the most significant outliers out of these 10,000 records with potential problems).
  4. The network itself has available services (such as images, OCR, and OCR parsing) that it can use to analyze the image of a specimen flagged by a FP_Messages#FP_QUALITY_ISSUE_ASSERTION message, and it can use the results of such analysis to suggest a FP_Messages#FP_CORRECTION_ASSERTION message.
  5. The network has available services for obtaining images and OCRing and parsing the text that it can use to automatically OCR images, cross correlate the parsed data with database records, and generate FP_Messages#FP_QUALITY_ISSUE_ASSERTION messages.

Postconditions

Annotations regarding quality issues in records are stored in the network.

Corrections based on those annotations may have been made in local data stores.

Business Rules

Automated quality control annotations are distinct from other annotations in that they involve recognition of patterns by algorithms, rather than new or corrected knowledge entered by humans.

Automated quality control annotations must not automatically change local authoritative data base records.

Assumptions

The network can identify patterns and outliers in the data.

The network can only flag potential problems, not provide correct answers to those problems (though with image and OCR services, the network might be able to suggest corrections).

Notes

Potential patterns: Date collected outside of date of birth/date of death range of collector. Date collected in early or late tail for collector. Only known specimen by collector. Georeference is in low probability tail of geographic distribution of taxon. Geofeference is at sea for a terrestrial taxon. Georeference is on land for a marine taxon. Determination history contains widely disparate taxa that don't normally occur together in a determination history. Locality elements appear to be disjunct (e.g. Country: Canada, Primary Division: Alabama). Note that some of these are reasonably objective (collected before the collector was born), others are purely statistical (year collected is more than three standard deviations away from the mean year collected for the collector).

The issues flagged by QC checks can most likely not be resolved without reference to an external source of data about the specimen. If the year collected doesn't match the collector, the year might be wrong, the collector might be wrong, or the collector might be an unrecognized (or indistinguishable) duplicate of another collector.

Correct answers to problems might be able to be generated in a network that include images of original specimen data and services that can OCR and parse those data. A problematic year/collector combination might result in the network examining OCR for the year and collector to see if they are verbatim original data or typographic errors.