- Can we obtain a date from the record?
- Was the specimen collected within the lifespan of the collector?
- Are the collecting dates similar for two or more specimens that were collected really far from one another?
4. Collector number-based
- Does the collecting number (DwC recordNumber or possibly fieldNumber) coincide with the eventDate?
- Does the timing/date of the event coincide with the reproductive state [this is primarily a botany example but timing is also very important in animals, for example emergence dates of insects]?
- Is the eventDate earlier than any annotation date provided?
7. Record metadata-based
- Is the eventDate earlier than the date that the digital record was created?
- There are 3 other date-based fields in DwC that may be useful/relevant: georeferencedDate, relationshipEstablishedDate, measurementDeterminedDate
|Domain||Can be solved with input data||need local dataset or remote service|
|Within record||1, 6, 7, 8||2, 5|
|Among records||3, 4|
Overall: Solve problem 1,7,8 first, then 3 and 4, then 2,5,6
parse the dwc:eventDate and dwc:verbatimEventDate first then compare that with dwc:startDayOfYear, dwc:month etc. If inconsistent, return "Solve_with_more_data"
Using Harvard Index of Botanists to get the life span of the collector based on the dwc:recordedBy. If more than one person specified in the field, choose the first one (split by semi-column symbol ";"). If a group or orgnization name is specified, skip this validation step and generate according comment.
- Does the event date must be within the life span of the person specified by dwc:identifiedBy, is it possible the occurrence record is identified by the someone later? ==> Yes, can't validate dwc:identifiedBy with the same idea
- If more than one name is specified in the dwc:recordedBy (dwc:identifiedBy), do the names always split by semi-column? ==> No standard, semi-column and pipe is most likely
- Check whether the date is after the year when the collector is born, is a straightforward choice, but it's not likely the event occurs in early childhood, how about after year 15 or 12? ==> Start from year 10 is a good choice
Problem 3 (need work)
see "Outlier Detection"
Problem 4 (need work)
Similar to problem 3, pattern and certainty varies. But the collecting number may changed at any time depends on the collector. One way is to validate within the same pattern of the collecting number, i.e. split by the pattern change and put the subsets into different validation group.
- Does some patterns of collecting number difficult to validate, i.e. not incremental or has certain changing pattern? For example, the first collecting number is 1, second is 11, third is 111?
Problem 5 (need work)
- Not clear how, depends on the value in the dwc:reproductiveCondition, also depends on the specimen type, i.e. a leaf or flower?
- Not clear how to do, which DWC term are we looking at?
Problem 6 (need work)
Compare the event date with annotation date, the annotation must occur after event date
- Not sure how to get the annotation date
Compare the event date with dwc:modified, the date in dwc:modified must occur after event date
Varies for different patterns and certainties. Some impact factors (assume "set of records" mentioned below all have the same collector):
- Percentage of suspicious records in a set of records
For example, one out of a set of 100 records is far away from the rest, then it's likely to be a outlier. Rather, 1 out of 5 or 10 out of 50 are less likely to be a outlier(s).
- Date span of several records with same collector
For example, couple of records have the same event date, it's not likely to have a event happened far away from each other. But if a sequence of event happened with a interval of one month or year, then it's more likely that they could happen far away from each other. Also the mobility may differ from different period of time, e.g. in early year, it may take months to travel around the world, but nowadays, it can be done in days.
- Sequence pattern of the date of a set of records
For example, a set of 8 records have dates in sequential order, and there are two locations "A" and "B" associated with the record as following:
In pattern 1, B is likely to be a outlier; in pattern 2, B is suspicious, but it's possible that the collector traveled to a new location.; in pattern 3, it's different to tell whether A or B is the outlier; in pattern 4, it's likely to be valid.
- Not sure heuristic approach will help, i.e. found outliers can be used as reference for future validation.
Depends on why the outlier occurs, experience-as-reference may works well or not works at all. For example, if a typo is caused by mistyping, then it's not likely to repeatedly occur in a set of records; rather if a typo is caused by mis-configuration of a system, then it's quite likely to occur in the set of records that generated by that system.
Overall, For different set of records, the certainty and the way to find out the outlier may differ.
The old collectionOutlierFinder (Lei's work) takes the following approach: Store the input of the actor (a set of records) in a linked list, then for each record in the list, calculate the distance with previous and next non-outlier record, if both of the distances are above threshold, then identify it as a outlier. There are some problems of this approach: 1) the first and last record will be never be a outlier, 2) the order of the record is determined by the input dataset and other factors, the outcome may not be deterministic, 3) doesn't scale well.
Since out data is longitudinal data, which is not normally distributed, so some statically methods with an assumption of normally distributed is not applicable. We're targeting to eliminate this hidden effect and use certain character of the data to simplify some approach if possible. So the following algorithm is proposed. We'll start with a simple approach and see how it works.
So we propose an approach using clustering to detection outlier. Here are some observations and proposals:
- if event date is presented, cluster the records of the same event date into one cluster.
- the distance threshold is a function of length of time interval and the year of event date, i.e. the threshold will be lower if the event time of two records is relatively closer and if the occurrence happened in early years.
- for each cluster of records of the same date (or a set of records when event date is not presented), each data point is not normally distributed, some hidden effect will limit capability of statistical method. So two ways to tackle this: 1). some statistical approach is still applicable if we limit the false positive outcome, 2). using non statistical method, like data mining or simply calculate the distance between a pair of points.
- Then, calculate the distance between the clusters in the order of date. For each point in one cluster, calculate the average distance to all the points in the other cluster, if this distance is higher than threshold, then it's a outlier. Note that we don't need to calculate all the pair of clusters, only the two clusters adjacent to each in the order needs to be compared.
Due to the fact that the input data set are not usually sorted by collector and eventDate, so a "group by" operation must be performance. Sorting takes significant amount of time if the data set is large which also block the downstream actors. So the idea is to prevent sorting, The following approach can be used to improve performance:
- Store less information. only collector name, event date and coordinates are stored, not the whole record.
- Set a threshold that when certain amount of records has arrive, process them first.
- When reader reads the all input data stream, system records some statistics about the data stream and help dateValidator estimate the data structure.
- Using Map-reduce.