2014Oct14

From Filtered Push Wiki
Jump to: navigation, search


Etherpad for meeting notes: http://firuta.huh.harvard.edu:9000/FP-2014Oct14

Agenda

Non-Tech

  • iDigBio TCN Summit

Jim will attend InvertEBase PI's meeting on behalf of FP

  • LepNet TCN

Jim: Neil Cobb (NAU) is lead PI on a new TCN proposal to NSF that would digitize Lepidoptera (butterflies and moths). FP is being written into the Harvard budget for a very modest sum ($11-13K) to support deployment of an FP instance into the LepNet Symbiota portal. Jim and Paul were contacted at the very last minute with no advance notice. Jim has provided a formal letter of collaboration as per NSF requirements.

  • Publications
    • Progress
      • Paul: Collection Objects
      • Bob: Refactoring Dup finding cluster analysis
      • Tianhong - IDCC submission

Tech

  • Kurator Integration
    • Build issues found by Tim
    • SVN reorganization/cleanup
  • QC for SCAN
    • Feedback from Neil/Paul Heinrich
  • QC work
    • Adding agent authority file to Symbiota - harvest to solr index - use in actor.
  • Firuta server move.
    • Update deployed apps in Tomcat and Apache (Symbiota, Morphbank, Annotation Processor)
    • Distribution upgrade
  • Deployments
    • Access point updates
    • Bringing Annotation Processor up-to-date
    • Deploy and re-run harvest of occurrence records
    • Status of fp2 and fp3

Reports

  • Paul
    • More work on schema changes to support agent authority files in Symbiota. Adding fields/tables for richer biographical information and cross links. Added some fields for compatability with FOAF concepts. Proposed schema changes checked in in a branch.

Notes

FilteredPush Team Meeting 2014 Oct 14 Present: Paul, Jim, James, Bob, David, Tianhong, Bertram Agenda Non-Tech

  • iDigBio TCN Summit
    • Jim will attend InvertEBase PI's meeting on behalf of FP
  • LepNet TCN
    • Neil Cobb (NAU) is lead PI on a new TCN proposal to NSF that would digitize Lepidoptera (butterflies and moths). FP is being written into the Harvard budget for a very modest sum ($11-13K) to support deployment of an FP instance into the LepNet Symbiota portal. Jim and Paul were contacted at the very last minute with no advance notice. Jim has provided a formal letter of collaboration as per NSF requirements.
  • Publications
    • Progress
      • Paul: Collection Objects
      • Bob: Refactoring Dup finding cluster analysis
  Making into a suite of four standalone apps.  David helped with mavenification
  Deployed Hadoop on a cluster of VMs at UMASS-Boston; of the four above only the actual clustering app needs parallelization.  
      • Bob: "Final Paper" draft
  Began topics list in a gdoc. Sent invitations to likely authors? Who else?
      • Tianhong - IDCC submission

Tianhong: just received editor's revision, will send out for comments Bertram: Please circulate for all the authors to make a last check.

(1) Workflow - Bertram to present. Looking for suggestions/contributions. Presentation in Dropbox? 12 min talk

Storyline: lure people into Kurator?

Main story: FP, curation workflows

End with invitation to community-driven development; Kurator

For the visual part: - show before & after of records - nice transition to go from what we've done to where we're going - typically: use of external services (scalability?) - Jim: specimen record geographic distribution on a map; commonly misplaced records - take a snapshot from MCZBase web site - also -> think of this as another curation actor (data2map) Paul: Some examples currently in SCAN data:

 select catalognumber, decimallatitude, decimallongitude, country, stateprovince, county from omoccurrences where decimallatitude = 88.29;
+---------------+-----------------+------------------+---------+---------------+-----------+
| catalognumber | decimallatitude | decimallongitude | country | stateprovince | county    |
+---------------+-----------------+------------------+---------+---------------+-----------+
| TTU-Z_038283  |           88.29 |          -40.105 | USA     | Illinois      | Champaign |
| TTU-Z_038284  |           88.29 |          -40.105 | USA     | Illinois      | Champaign |
| TTU-Z_038286  |           88.29 |          -40.105 | USA     | Illinois      | Champaign |
+---------------+-----------------+------------------+---------+---------------+-----------+
select catalognumber, decimallatitude, decimallongitude, country, stateprovince, county from omoccurrences where decimallatitude < -76;
+---------------+-----------------+------------------+---------+---------------+----------+
| catalognumber | decimallatitude | decimallongitude | country | stateprovince | county   |
+---------------+-----------------+------------------+---------+---------------+----------+
| NAUF4A0000312 |         -76.239 |             NULL | USA     | Arizona       | Coconino |
+---------------+-----------------+------------------+---------+---------------+----------+
select catalognumber, decimallatitude, decimallongitude, country, stateprovince, county from omoccurrences where decimallatitude < -10 and country = 'USA' limit 10;
+---------------+-----------------+------------------+---------+---------------+----------+
| catalognumber | decimallatitude | decimallongitude | country | stateprovince | county   |
+---------------+-----------------+------------------+---------+---------------+----------+
| NAUF4A0000312 |         -76.239 |             NULL | USA     | Arizona       | Coconino |
| NAUF4A0038298 |        -35.6741 |        -106.4935 | USA     | New Mexico    | Sandoval |
| NAUF4A0039463 |        -35.6741 |        -106.4935 | USA     | New Mexico    | Sandoval |
| NAUF4A0039465 |        -35.6741 |        -106.4935 | USA     | New Mexico    | Sandoval |
| NAUF4A0039461 |        -35.6741 |        -106.4935 | USA     | New Mexico    | Sandoval |
| NAUF4A0039458 |        -35.6741 |        -106.4935 | USA     | New Mexico    | Sandoval |
| NAUF4A0038690 |        -35.6741 |        -106.4935 | USA     | New Mexico    | Sandoval |
| NAUF4A0038275 |        -35.6741 |        -106.4935 | USA     | New Mexico    | Sandoval |
| TTU-Z_019555  |       -30.89819 |        -99.90607 | USA     | Texas         | Menard   |
| TTU-Z_037036  |       -30.89819 |        -99.90607 | USA     | Texas         | Menard   |

(2) Precapture process. Paul to present. Has in hand.

(3) Duplicate detection - Bob to present. Needs some screenshots of Chuck's application in use.

Jim: Cancel online meeting week of Oct 28th.

  • Bertram:

- will be on travel until end of the week (panel) Launching a "YesWorkflow" (complementary to NoWorkflow) "project". Key idea: - extra a wf structure from a script (Python, R, ...) using special comments (cf. JavaDoc) - this creates a knowledge artifact that is easy to understand and share; also acts as a simple browsable interface into the code and runtime provenance -> if you're interested, email Bertram

Tech

  • Kurator Integration
    • Build issues found by Tim

David: Remaining issues looked like FP-Curation, geotools repository moved, have checked in Tim's fix to tha issue. With that fix could build and put jar into archiva.

    • SVN reorganization/cleanup

David: Tagged and removed the old projects (java-SOA, FP-Testing, FP-Examples) SPARQL push and some FPtools things can probably be moved as well.

  • QC for SCAN
    • Feedback from Neil/Paul Heinrich

David: with Tianhong spoke with Paul about the spreadsheets - largely cosmetic, seemed pleased overall. Identified some records that needed fixing. Comments appeared sane and informative. Raised some small issues, largely around making it easier to find the source of the problem and making the spreadsheet easier to read.

Paul: Next step probably to make results

Tianhong: Can see some edge cases in other collections that may break workflow on full data set. Good to run on full data set and see what breaks.

Paul: Let's take a dual approach: Run on larger data sets and also on synthetic data (designed to fuzz the workflow actors):

Paul: Next logical target for a subset is ASUHCI, and sent to Nico.

 
select count(*), c.collectioncode from omoccurrences o left join omcollections c on o.collid = c.collid group by c.collectioncode;
+----------+----------------+
| count(*) | collectioncode |
+----------+----------------+
|    28857 | NULL           |
|        1 | AODNA          |
|    62066 | ASUHIC         |   Run this one next.
|    16114 | BYUC           |
|    50648 | CSUC           |
|     5462 | CSUNPS         |
|    87646 | DMNS           |
|    12023 | ENT            |
|  3285171 | GBIF_ARTH      |  Skip this one.
|   721508 | GBIF_NOLOC     |  Skip this one.
|      256 | GPSC           |
|    15937 | HIC            |
|      182 | MOD_AMNH       |
|     2847 | MOD_CAS        |
|     1646 | MOD_CDFA       |
|        1 | MOD_CH         |
|      751 | MOD_CNC        |
|       25 | MOD_CSUFC      |
|     4118 | MOD_DMNS       |
|     2319 | MOD_FMNH       |
|        1 | MOD_INHS       |
|       61 | MOD_KSU        |
|     4097 | MOD_LACM       |
|      567 | MOD_NMNH       |
|       63 | MOD_PMJ        |
|      315 | MOD_PR         |
|      211 | MOD_SBMNH      |
|      155 | MOD_SDNHM      |
|      689 | MOD_TED        |
|    10580 | MOD_UCB        |
|     5145 | MOD_UCD        |
|      235 | MOD_UCONN      |
|      335 | MOD_UCR        |
|       49 | MOD_WPC        |
|    29238 | MSBA           |
|    28444 | NAUF           |
|    54339 | NMSU           |
|     9382 | NPS            |
|   117751 | TAMU-TTI       |  Good 100k collection
|   156264 | TAMUIC         |  Good 100k collection
|    90899 | TTU-Z          |
|    46162 | UAIC           |
|    45588 | UCMC           |
|      187 | UDAFE          |
|    38057 | UHIM           |

Tianhong: Institution code?

select count(*), c.collectioncode, c.institutioncode from omoccurrences o left join omcollections c on o.collid = c.collid group by c.collectioncode, c.institutioncode;
+----------+----------------+-----------------+
| count(*) | collectioncode | institutioncode |
+----------+----------------+-----------------+
|    28857 | NULL           | MCZ             |
|        1 | AODNA          | AODNA           |
|    62066 | ASUHIC         | ASU             |
|    16114 | BYUC           | BYU             |
|    50648 | CSUC           | CSU             |
|     5462 | CSUNPS         | CSU             |
|    87649 | DMNS           | DMNS            |
|    12026 | ENT            | UMNH            |
|  3285171 | GBIF_ARTH      | GBIF            |
|   721508 | GBIF_NOLOC     | GBIF_NOLOC      |
|      256 | GPSC           | GPSC            |
|    15937 | HIC            | UKY-HIC         |
|      182 | MOD_AMNH       | MOD_AMNH        |
|     2847 | MOD_CAS        | MOD_CAS         |
|     1646 | MOD_CDFA       | MOD_CDFA        |
|        1 | MOD_CH         | MOD_CH          |
|      751 | MOD_CNC        | MOD_CNC         |
|       25 | MOD_CSUFC      | MOD_CSUFC       |
|     4118 | MOD_DMNS       | MOD_DMNS        |
|     2319 | MOD_FMNH       | MOD_FMNH        |
|        1 | MOD_INHS       | MOD_INHS        |
|       61 | MOD_KSU        | MOD_KSU         |
|     4097 | MOD_LACM       | MOD_LACM        |
|      567 | MOD_NMNH       | MOD_NMNH        |
|       63 | MOD_PMJ        | MOD_PMJ         |
|      315 | MOD_PR         | MOD_PR          |
|      211 | MOD_SBMNH      | MOD_SBMNH       |
|      155 | MOD_SDNHM      | MOD_SDNHM       |
|      689 | MOD_TED        | MOD_TED         |
|    10580 | MOD_UCB        | MOD_UCB         |
|     5145 | MOD_UCD        | MOD_UCD         |
|      235 | MOD_UCONN      | MOD_UCONN       |
|      335 | MOD_UCR        | MOD_UCR         |
|       49 | MOD_WPC        | MOD_WPC         |
|    29238 | MSBA           | UNM             |
|    28444 | NAUF           | NAU             |
|    54339 | NMSU           | NMSU            |
|     9382 | NPS            | NAU             |
|   117751 | TAMU-TTI       | TAMU            |
|   156264 | TAMUIC         | TAMU            |
|    90899 | TTU-Z          | TTU             |
|    46162 | UAIC           | UA              |
|    45588 | UCMC           | UCB             |
|      187 | UDAFE          | UDAF            |
|    38057 | UHIM           | UHIM            |
+----------+----------------+-----------------+
45 rows in set (11.03 sec)
  • QC work
    • Adding agent authority file to Symbiota - harvest to solr index - use in actor.

Paul: Proposed schema changes committed. Need to get feedback from Ed and start into UI elements.

  • Firuta server move.
    • Update deployed apps in Tomcat and Apache (Symbiota, Morphbank, Annotation Processor)

David: Updating client helper deployment. Need to do PHP updates and configuration for Symbiota/Morphbank deployments there.

    • Distribution upgrade

Paul: Done.

  • Deployments
    • Access point updates

David: Deploying on FP2 and FP3, need to update the PHP elements in the Symbiota code.

Bob: Can they access the document store?

David: If they go through the access point - the triplestore and document store are not publically exposed. Support for ingest of annotation, get an annotation, key-value prepared sparql query deplolyed.

    • Bringing Annotation Processor up-to-date

David: Investigation of driver continues. Have identified code from maureen's driver that can be reused - specify driver project with the hibernate project, which will build. Working to configure this for specify-huh on firuta.

    • Deploy and re-run harvest of occurrence records

David: Still have to look at the shell script.

    • Status of fp2 and fp3

David: Everything in place on FP2, FP3 needs updates and configuration for fedora and mulgara. Need to get the checks with icinga running on FP1 to test localhost services on FP2 and FP3, want to do this before switching on FP3 (being able to detect what services are up, particularly from tomcat restarts).

For Next Tech Call

  • For thursday FP/Kurator tech call:
    • Anything else needed to unblock Tim.
    • Benchmarking workflows
    • Run on additional pieces of SCAN data (ASU,ASUHCI; TAMU, TAMU-TTI)