AnalysisPostProcessing

From Filtered Push Wiki
Jump to: navigation, search
_ _

This Page

This page is presently aimed at informing my draft of the tbw paper about the PostProcessing spreadsheet-motivated UI in the FP analysis workflows. It will need revision to fit well with the other material in Analysis wiki pages, or at least those that describe the validators. The latter include Date_Validation, Georeference_Validator, and, Scientific_Name_Validator. These pages are, at best, guidance for implementers of workflows. For more detail see Embedding_Kepler#Result_types_and_symbols].

Example

File:MCZ-all-2015-04-06.xls is an Analysis Post Processing Spreadsheet. The first sheet therein is named Description and carries the text:

This spreadsheet represents the output of data quality control software run on a copy of harvested occurrence record data. The software runs a workflow composed of a set of "actors" each of which examines a facet of the data (one for checking scientific names, one for checking georeferences, and one for checking date collected). These actors can assert that problems exist in the data (which in some cases are real problems, and in other cases are not), and can assert suggestions of how to change the data. Changes are only proposed in the spreadsheet, no changes have been made to the original data.

The first sheet in the report gives a summary of analysis. A field highlighted in yellow indicates that the software has proposed a correction, green indicates that the original value was found to be correct, grey indicates that we were unable to determine the validity of a value and red indicates that there was a problem curating the value.

Each actor (Scientific Name Validator, Date Validator, Georeference Validator) has a set of sheets that give more details about why the change was made. The comment field on the details sheets contains a summary of why that actor made a change or was unable to curate a record."

The second sheet is named Analysis Results. The aforementioned colored cells provide the nature of the outcome. In particular, within that sheet are summary columns describing the respective validator output. At this writing, those columns are headed ScientificNameValidator, DateValidator, and GeoRefValidator. The cell values are labels describing the outcome as follows:

Color Label Validator Outcome
Green CORRECT No issues found
Yellow CURATED Suggested correction
Grey UNABLE_DETERMINE_VALIDITY Encountered an issue validating the data
Red UNABLE_CURATE Found valid data with a problem, but could not propose a correction

The UNABLE_CURATE outcome corresponds with SolveWithMoreData - it indicates an outcome that will probably involve a human investigating additional information associated with the original data source.

The UNABLE_DETERMINE_VALIDITY outcome may indicate that some prerequisite (either in the data or in the validation system, e.g. a remote service being down) for validation was not met.

About other sheets

TBW


Notes

  • The Description sheet of the above mentioned file is the first, and the text in the description should be 'The second sheet ("Analysis Results" gives a summary..." ' --Bob Morris (talk) 02:23, 26 April 2015 (CEST) Address this in Ticket 375. Bob will synthesize discussion here or remove this item.
  • In the Analysis Results sheet, row 7 (Record Id= SCAN.occurrence.1087255), col A (Event Date = 1700-01-01) is red, i.e. "UNABLE_CURATE". Consquently, col AA (DateValidator) has cell containing "UNABLE_CURATE" and is red. However, col Z (ScientificNameValidator) also carries issues, namely its value is CURATED and so color is yellow. Similarly, col AB (GeoRefValidator) has value "UNABLE_DETERMINE_VALIDITY" and so is grey. By what rule was "Event Date" placed first in the Analysis Results sheet. A brief look at the Analysis Results sheet might leave one the impression that there is only one problem in the given record, whereas there are several. --Bob Morris (talk) 21:27, 26 April 2015 (CEST) Address this in Ticket 377. Bob will synthesize discussion here or remove this item.
  • In each of the validator sheets, if Column Z ("Actor Result") were colored, it be more consistent with the Analysis Results, which has coloration for each of the columns for which the workflow returns a result, especially cols Z, AA, AB of the Analysis Result. --Bob Morris (talk) 21:27, 26 April 2015 (CEST) Address this in Ticket 376. Bob will synthesize discussion here or remove this item.
  • I'm troubled by the outcome labels, though I can't propose better ones. The first thing is that "CURATED" as the past tense of the verb "to curate" does not, in conversational English imply that a change was made. Thus I would expect "curated" to mean simply that some workflow actually was run. I think what's wanted here is something like "proposed value." Similarly "unable curate" might be "no proposal for correction" along with the implication that this the datum is valid but not correct. --Bob Morris (talk) 00:48, 27 April 2015 (CEST) Address this in Ticket 378. Bob will synthesize discussion here or remove this item.
  • In the spreadsheet, no outcomes of the DateValidator seem to be recorded (i.e. colored) in the Month/Day/Year columns even when Event Date has an outcome. This seems different from the behavior of the other validators. --Bob Morris (talk) 00:42, 28 April 2015 (CEST) Address this in Ticket 379. Bob will synthesize discussion here or remove this item.