Find Duplicates

From Filtered Push Wiki
Jump to: navigation, search

NOTE: This has been superseded by ActuallyFindingDuplicates2: In particular, we would like to give up the idea of pre-query-time duplicate cluster detection, management of duplicate clusters, and consensus record building,

Use Case: Find Duplicates

Business Process

Goal: A herbarium collection manager wishes to avoid having to retype all of the data about a specimen that belongs to a set of duplicates when the same data for another specimen belonging to the same set of duplicates has already been captured at another institution.

Summary: A collection manager capturing data about a specimen begins by entering data elements that are likely to identify members of a set of duplicates (e.g. (in order) collector name, collector number, date collected, current determination). If data has been captured from the same set of duplicates, the FP network presents this data to the collection manager, and the collection manager can accept these data for entry into their local database, saving them the effort of retyping much of the rest of the data.


FindDuplicates use case diagram.png


Collection Manager (sensu latu), a person using a software application where the person is involved in the management of specimens and information in a herbarium collection, possibly a curator, a collection manager, a collection assistant, or data capture person. This person isn't using the FP network directly, but is using other software (e.g. Specify Workbench, Specify) to interact with the FP network (through FP Messages).

The local Collection Database.

FP Network.


The FP Network is aware of and can rapidly return relevant duplicate records.

The collection manager (as a person) has authenticated into the local software they are using for collection management, and this software has authenticated in to the FP network.


Beginning data capture on a new (herbarium) specimen.

Course of Events

  1. A collection manager begins data entry on a herbarium sheet by entering the collector's name and collector's number (~= field number) from the sheet into the data entry interface for their local herbarium database (e.g Specify Workbench).
  2. The local database's data entry interface queries its local Filtered Push network node for any sets (of one or more duplicate specimens) matching this collector and collector number (FP_Messages#FP_INVENTORY, FP_Messages#FP_FIND_SETS, FP_Messages#FP_GET_DATA). Matching data are returned very rapidly to the user, more rapidly than they could type in the rest of the record. Requirement: Rapid return of analyzed duplicates from network
  3. A match for the duplicate record is presented to the collection manager, who can accept the data into appropriate fields on the user interface in front of them. Requirement: Highly atomic data in network, along with semantic mapping to local schemas
  4. The match having been recognized, the newly entered specimen is added to the relevant set of duplicates in the network FP_Messages#FP_ADD_SHEET.

Alternative Paths

  1. No matching specimens found, data capture continues from the specimen. The specimen is used as the basis for a new duplicate set in the network. FP_Messages#FP_ADD_NEW_SET
  2. Matching specimens found, but with problematic data that needs correction, this data gets pulled in, corrected, and Use_cases#Use_Case:_Annotate_Specimen is triggered.
  3. No matching specimen is found from just collector and collector number, but as more data are added, a match is found on other criteria (FP_Messages#FP_INVENTORY, FP_Messages#FP_FIND_SETS, FP_Messages#FP_GET_DATA). Course of events continues as if a match was found on the first set of data.


Specimen record in the local herbarium database is populated.

Specimen record has been added in the network to an appropriate set of duplicates.

Business Rules

The local collection database must present collector and collector number as among the first fields for data capture from a herbarium sheet.

The local collection database must query the network node for matching duplicates as soon as collector and collector number are entered.

Particular herbaria will have the original field notes, maps, and similar data to validate the data associated with sets of duplicates distributed by a particular collector. It is expected that herbaria will pay particular attention to validating the data of duplicate sets for which they hold additional authoritative data sources beyond the herbarium sheet itself. This may impose a requirement for a data element in an annotation to describe the authoritative source on which an annotation is based.


Knowledge of which specimens might be duplicates and their data are right at hand for the FP network, so that query/response and network transport lags do not prevent data capture off of the specimen from being faster than the response time of the network.

The specimen/collection object in question is a Herbarium specimen.

Authentication and authorization of the individual person using the FP network is a problem for the local software. Authorizing the software to generate FP messages is a problem for the FP network.


Duplicate specimens (Parts of one plant collected at the same time, attached to different herbarium sheets, and distributed to several Herbaria) are largely a phenomenon of botanical collections. This use case is only relevant for botanical collections amongst which duplicate specimens have been distributed {Note: have been distributed might be a useful piece of information for the network for more rapidly returning relevant records).

Ordered list of terms for duplicate detection, from 2011May19_Meeting_at_UCDavis

  • 0. GUID, if present
  • 1. Catalog No.
  • 2. Institution Name
  • 3. Collection Code
  • 4. Collector
  • 5. Collector No.
  • 6. Date Collected
  • 7./8. Verbatm Locality/ Original Determination
  • 9. Authorship
  • 10. Lat/Long
  • 11. GPS Datum
  • 12. Lat/Long Uncertainty
  • 13. Phenology
  • 14. Sex
  • 15. Habit/Behavior


  1. specimen records are aggregated and stored centrally
  2. a clustering analysis is performed to detect potential duplicates
  3. when two or more records are identified as duplicates, the system generates a consensus record which evaluates and merges the content of the duplicates

Data Capture Use Case:

  1. the merging process should provide the most consistent representation of the data and should highlight fields that do not have agreement to make the data capturer aware
    1. the data capturer should verify the fields that are highlighted and can annotate these to improve the consensus; and/or the data capturer upon copying the consensus into their client/database can annotate the record and these will go into the FP network and inform the consensus
  2. the consensus record is presented to the user to avoid seeing many duplicates and having to choose between them
    1. should have metadata available to be captured in a notes/remarks field relating what records formed the consensus; not all will be able to (or want to) include this

Data Curator/Research use case:

  1. the data curator or researcher will search for taxa, geography, collector, etc. that they are interested for in the web client
  2. the result set will highlight consensus records (note that this will keep the number of records presented shorter)
    1. the user can expand the consensus record to see what individual records make up the set
    2. the user can annotate any potential duplicate record as incorrect
    3. the user can annotate any record in the set in order to influence the consensus [James questions whether we should let users annotate field-level data in the consensus or have the annotator be required to annotate a primary record instead...]