The University of Illinois Urbana Champaign and Harvard University are awarded grants to develop software tools for scientific data digitization, sharing, integration and use. The considerable challenge to digitizing natural science collections in the U.S. (and globally) necessitate a focus on both digitization efficiencies and the utility of the generated data. This grant will develop a novel, extensible, open source toolkit (Kurator) for automated and semi-automated workflows with diverse curation services to aid biodiversity research and beyond. This project will enhance discovery and understanding while promoting teaching, training and learning. A postdoctoral fellow will be trained in provenance enhanced workflow technology and contribute to Kurator design and implementation. Principles of data management curation with a focus on provenance will be taught in undergraduate and graduate courses. This project will also enhance infrastructure for research and education through collaborations with iDigBio and the Encyclopedia of Life. Educational modules and outreach activities on data quality and data curation will be developed for undergraduates and high school educators. Several Thematic Collection Networks will provide data via iDigBio for testing to insure dissemination of high quality data. Critical community authority files that do not have associated web services will be made available to the greater community as Kurator actors and services.
Kurator will consist of a user friendly web interface for users to configure and launch workflows while maintaining provenance, and a workflow platform for rapid development of new curation services and workflow variants. The latter will also be used to "wrap" valuable domain authority files that are not currently available as services. New "curator-in-the-loop" workflow technology allows us to directly involve experts in semi automatic curation pipelines, using human interaction actors via FilteredPush, other syndication methods, and discovery environments. Kurator will allow examination of data lineages to facilitate the assessment of credibility, supports repeatability in publication, informs legal proceedings where data are regulated, and provides context for feedback to a given resource. Kurator will facilitate digitization efforts through custom processing of raw data obtained from hardcopy specimen labels against existing services, including taxonomic name resolution, georeferencing, and duplicate specimen detection, as well as newly created customizable actors for appropriate controlled vocabularies to clean the data. Where required, semi-automated services can invoke expert review using annotation services and existing discovery environments. Curation pipelines can be integrated with other workflows for analysis of ecological, evolutionary, phenological, genomic and related data, can be shared or repurposed easily, and can be made accessible for publication.