2014Jul09

From Filtered Push Wiki
Jump to: navigation, search


Etherpad for meeting notes: http://firuta.huh.harvard.edu:9000/FP-2014Jul09

Agenda

Non-Tech

  • Publications
  • Kurator
  • James: TDWG Symposium
  • James: FaunaEuropaea
  • InvertEBase

Tech

  • Status of going live with Morphbank integration
  • David: Status of FP2 services
  • QC for SCAN
    • Run on full NAU dataset, send report to Neil
      • GBIF checklist bank and CoL services.
      • Collector names and dates of birth (solr index)
      • Adding agent authority file to Symbiota
    • Run on full MCZ SCAN dataset, send report to Linda Ford
  • Metrics for SCAN
  • Updating Roadmap

Reports

Notes

FilteredPush Team Meeting 2014 July 09

Present: David, Paul, Tianhong, Bob, James

Agenda

  • Publications

James; Not quite done with system and using this perhaps as an excuse for not publishing; Can we hit on some smaller components now. Target descriptive papers on components and some short high impact paper.

Bob: OA croud has made it a little clearer that data annotation is out of scope - might be pointing to somewhere else in W3C. May have to argue that it is a good fit.

Paul: For pieces, there's Chuck's FP-DataEntry tool. Bob: Small pieces loosely coupled good approach. Might be good to start working on larger paper - separating deployment issues from architecture issues.

James: Would like to publish a domain paper in Botany - not quite there yet with NEVP.

Pensoft journals good target.

  • Kurator

Paul: No updates.

  • James: TDWG Symposium

Jönköping in Sweden 27-31 October 2014.

James: Wrote to Anton, he's now got a conflict, will still contribute. Updated the description, added some possible titles, including update on Kuration with a focus on challenges (e.g. parallelization and service calls). Put a placeholder in for Betram on provenance if he'd like to talk about it, otherwise we should get someone else to talk about provenance.

Paul: Need to get the list of presenters hammered out.

James: Anton's list looks solid, we are kurator talk, provenance talk and ?.

  • James: FaunaEuropaea

James: No response from them.

  • InvertEBase

Paul: No updates.

Tech

  • Status of going live with Morphbank integration

David: Have a VM with most components deployed, still need to get their current codebase (including changes that aren't checked in) on there. Will email to get a patch set.

David: Have been working on the builds - have camel working in tomcat, expect to be able to deploy in that next week.

  • David: Status of FP2 services

David: Solr index added to FP2, access point (jetty), fedora (tomcat), mulgara, mongo deployed. Seeing some heap space issues with tomcat - has been unstable. An akka analysis is there, not on cron job.

Paul: State of harvesting?

David: Harvesting manual right now, not on a cron job yet - state is still run and import files by hand.

  • QC for SCAN

David: Have code to translate result to spreadsheet - needs testing (on latest result set changes) still and wrapping in a command line utility - reuses code from the annotation processor.

    • Run on full NAU dataset, send report to Neil

Tianhong: Have modified the code to change the output to match the latest schema. In actor details there is a list of states, do all of these matter, and does the order matter?

David: All of list gets used, order doesn't matter.

Discussion of what to do if there is no data to be validated by the actor (e.g. no date to be validated by the date validation actor). Needs a new state something on the order of "prerequisites not met". Georeferencing, could create georeference if there is text, but probably don't want to (at least yet), given observed issues with accuracy in automated georeferencing without humans.

      • GBIF checklist bank and CoL services.

Tianhong: Taxonname validation code more robust now, seeing more issues when running on a larger set of data. More data runs into more cases of problematic paths through the actor - the actor may be too complicated and needs refactoring to simplify its logic and to simplify its debugging.

      • Collector names and dates of birth (solr index)

David has solr index up on FP2.

Tianhong: code can access that index. Seeing an issue: The format of the collector name (lots of varieties) in the SCAN data isn't matching the index available. Should be well studied for this sort of problem, perhaps invoke some sort of name parser.

Paul: Test case: Derek S. Sikes D.S. Sikes; Sikes, D.S., Derek S. Sikes - can all three strings get to the index record.

David: Don't think current configuration of solr will do this.

99 of the most frequent recordedby strings in SCAN:

|      148 | A. & J.Brooks          |
|      108 | A. A. Westcott             |
|      113 | A. Acuña L.               |
|      699 | A. Asquith                 |
|      156 | A. B. Amerson, Jr.         |
|      431 | A. BARRERA                 |
|      184 | A. Behennah, I. Behennah   |
|      177 | A. Bonet; M. Jarry         |
|      114 | A. Brand                   |
|      229 | A. C. Cole                 |
|      833 | A. C. Cole, Jr.            |
|      151 | A. Cadena                  |
|      458 | A. Campbell                |
|      484 | A. Clark                   |
|      269 | A. D?íaz                  |
|     9132 | A. Dampf                   |
|      329 | A. Davidson                |
|      162 | A. Díaz Francés      |
|      219 | A. Del Valle               |
|     8606 | A. Díaz Francés          |
|      805 | A. Druk                    |
|     1705 | A. Druk, J.D. Herndon      |
|      106 | A. E. Brower               |
|      108 | A. E. Pritchard            |
|      238 | A. E. Verrill              |
|      398 | A. Earl Pritchard          |
|      119 | A. Ellingson               |
|      547 | A. F. Newton               |
|      522 | A. F. Newton, M. K. Thayer |
|      132 | A. Flores                  |
|      110 | A. Gibson                  |
|      153 | A. Gillogly                |
|      574 | A. Gonzalez Hdz.           |
|      157 | A. Guzman L.               |
|      226 | A. H. Powell               |
|      279 | A. Hilton                  |
|      618 | A. Hook                    |
|    12058 | A. Ibarra                  |
|     1220 | A. Johansen                |
|      239 | A. K. Wyatt                |
|      740 | A. L. Melander             |
|      131 | A. Laguerenne              |
|     1874 | A. Lopez, H. Clebsch       |
|     2014 | A. LUIS                    |
|      112 | A. M. Martinez             |
|      280 | A. MacLeod                 |
|     1402 | A. Mangini                 |
|     1355 | A. Maya                    |
|     5537 | A. Moores                  |
|      258 | A. Nichols, S. Hemly       |
|      130 | A. P. Morse                |
|      223 | A. Pearse                  |
|      138 | A. Pena de Niz             |
|      577 | A. Pescador                |
|      433 | A. Portoluri               |
|      248 | A. R. Gillogly             |
|      143 | A. Ramírez                |
|      455 | A. Ravenscraft             |
|      381 | A. Rodriguez               |
|      283 | A. Rothwell                |
|     3021 | A. Sharkov                 |
|      193 | A. T. McClay               |
|      111 | A. Thibault                |
|     9850 | A. Tuz                     |
|      108 | A. V. Sharkov              |
|      132 | A. Varsek, J. E. Louderman |
|      214 | A. Vera                    |
|      787 | A. Villalobos              |
|      137 | A. Williams                |
|      698 | A. Worley                  |
|      346 | A. Wormington              |
|     1473 | A. Xool                    |
|      194 | A. Yates                   |
|      165 | A.&K. Menke, F.D. Parker   |
|      167 | A., A.                     |
|      104 | A.Audiffred                |
|      138 | A.B.Lazarus                |
|      301 | A.C. Cole                  |
|      338 | A.C. DELOYA                |
|     2343 | A.C. Sheppard              |
|      235 | A.D. Davidson              |
|     2547 | A.F. Winn                  |
|    25774 | A.G. Taillefer             |
|      141 | A.Garcia                   |
|      136 | A.L. Hicks, V. Scott       |
|      103 | A.L. Melander              |
|      278 | A.M. Chickering            |
|      271 | A.M. Woodbury              |
|      291 | A.Mojica & R. Johansen     |
|      156 | A.N.Gartrell               |
|     1323 | A.O.S.E.R.P.               |
|      485 | A.R. Gittins               |
|      185 | A.R. Moldenke              |
|      316 | A.R. Thornhill             |
|      305 | A.R.Brooks                 |
|      292 | A.S. Menke, F.D. Parker    |
|      443 | A.T. Finnamore             |
|      291 | A.T. McClay                |

More frequent collectors:

select count(*), recordedBy from omoccurrences group by recordedBy having count(*) > 3000 limit 100;
+----------+-------------------------------------------+
| count(*) | recordedBy                                |
+----------+-------------------------------------------+
|   953130 | NULL                                      |
|     9132 | A. Dampf                                  |
|     8606 | A. Díaz Francés                         |
|    12058 | A. Ibarra                                 |
|     5537 | A. Moores                                 |
|     3021 | A. Sharkov                                |
|     9850 | A. Tuz                                    |
|    25774 | A.G. Taillefer                            |
|    19169 | Alexander, G.                             |
|    11827 | Anderson, Robert                          |
|     4873 | Anweiler, G. G.                           |
|    11728 | Ashe, James                               |
|     4456 | B. Dasch, C. Dasch                        |
|    12704 | B. Kondratieff                            |
|     3093 | B.V. Brown & E. Wilk                      |
|    34687 | Ball, G. E.                               |
|     9027 | Ball, G. E.; Ball, K. E.                  |
|     3670 | Ball, G. E.; Erwin, T. L.; Leech, R. E.   |
|    15152 | Ball, G. E.; Whitehead, D. R.             |
|     3447 | Baumann, R. W.                            |
|     4337 | Beamer, Lucy                              |
|    17247 | Beamer, Raymond                           |
|     3136 | Beardsley, J. W.                          |
|    13147 | Bowman, K.                                |
|    12033 | Brooks, Robert                            |
|     3990 | Burke, H. R.                              |
|     4840 | C. Beutelspacher B.                       |
|     6376 | C. Dasch                                  |
|     5522 | C. H. Kennedy                             |
|     5320 | C. J. Durden                              |
|     5556 | C. L. Remington                           |
|     3565 | C. Pozo                                   |
|     5835 | C.D. Johnson                              |
|    11546 | C.M. Davidson                             |
|     3301 | Carr, F. S.                               |
|     9399 | Caterino & Chatzimanolis                  |
|    15615 | Chamberlain, W. F.                        |
|     6950 | Charles S. Wolfe                          |
|    60915 | Clarence D. Johnson                       |
|     3683 | Clark, W. E.                              |
|     4056 | Collector unknown                         |
|     7445 | Collector(s): A. M. Hagerty, S.Y. Emmert  |
|    22732 | Collector(s): Derek S. Sikes              |
|    11818 | Collector(s): Dominique M. Collet         |
|     4419 | Collector(s): J. Stockbridge, B. Wong     |
|     7977 | Collector(s): J. Stockbridge, C. Bickford |
|     5418 | Collector(s): Jozef Slowik                |
|     6574 | Collector(s): Lisa Saperstein             |
|     4690 | Collector(s): O. Helmy                    |
|     3605 | Collector(s): Richard H. Washburn         |
|    23032 | Collector(s): unknown                     |
|    16771 | Collector(s): USDA ARS                    |
|    15713 | D. J. Knull, J. N. Knull                  |
|     3132 | DeLong, Good, et al.                      |
|     4131 | Delorme, D. R.                            |
|     7268 | Dimond & Watt                             |
|     3037 | E. Ahlstrom                               |
|     3307 | E. M. U.                                  |
|     7269 | E. May                                    |
|     5084 | E. Stephens                               |
|    25446 | Edward G. Riley                           |
|     3227 | Escobar-Sarria, F.                        |
|    10229 | F. Beaulieu                               |
|    16934 | F.D. Parker                               |
|     3408 | F.D. Parker, T.L. Griswold                |
|    18255 | Falin, Zachary                            |
|     3846 | Frank Graham, Jr.                         |
|    14902 | G.E. Bohart                               |
|     9574 | G.E. Knowlton                             |
|     4960 | G.S. Forbes                               |
|     3956 | Gary D. Alpert                            |
|     3177 | Gaumer, G. C.                             |
|     5964 | Grinter, Christopher C.                   |
|    10877 | Guppy, Crispin S.                         |
|    19025 | H. Ikerd                                  |
|     3173 | H. Ikerd, K. Davidson                     |
|     4081 | H. Miller                                 |
|     3423 | Hardy, D. E.                              |
|     5633 | Heming, B. S.                             |
|     3727 | I. Desjardins & C. Idziak                 |
|     3797 | Ismael Alejandro Hinojosa D?¡az          |
|     3682 | J. Forrest                                |
|     4158 | J. L. Neff                                |
|     3190 | J. L. Salinas                             |
|     8065 | J. Longino                                |
|    10222 | J. Rykken                                 |
|    13383 | J. Saldaña Mtz                           |
|     3572 | J. Savage & J. Kuchta                     |
|     6176 | J.D. Herndon                              |
|     8381 | J.H. Davidson                             |
|    10997 | J.M Meiners, T. Lamperty                  |
|     5365 | J.S. Wilson                               |
|     4197 | James B. Woolley                          |
|     3419 | James C. Cokendolpher                     |
|    10012 | Jean H. Puckle                            |
|    24519 | Jerry A. Powell                           |
|     5413 | John A. Jackman                           |
|     3501 | Jonathan E. King; Edward G. Riley         |
|     3638 | JOSE LUIS SALINAS & MAURO OMAR VENCES     |
|     4218 | Joseph C. Schaffner                       |
+----------+-------------------------------------------+
100 rows in set (33.57 sec)
      • Adding agent authority file to Symbiota

Paul: Haven't gotten to this yet.

    • Run on full MCZ SCAN dataset, send report to Linda Ford

Tianhong: ran with older code, haven't run yet on code with latest output format.

Paul: Getting the convert to spreadsheet command line utility the next step needed here, then we can get easily from runs to circulatable documents that we can get feedback on.

  • Metrics for SCAN

David: Have the couple of queries for Symbiota done with a page to display them for integration into Symbiota.

  • Updating Roadmap
  • Tech Call: Thursday 10pacific/1eastern:
  1. Examine progress in workflow output to spreadsheets
  2. Look at refactoring actors to ease debugging, reduce complex paths.