Crowdsourcing Europe Improving access, usability and enriching data on 385 million natural history specimens Laurence LIVERMORE1 , John TWEDDLE1 & Rob CUBEY2 1 Natural History Museum, London; 2 Royal Botanic Garden Edinburgh NBN Crowdsourcing Data Capture Summit 25 September 2015 Crowdsourcing Europe - Overview Intro SYNTHESYS Project Crowdsourcing research & key findings Why build a new platform? Platform functionality What will it do? Strategy & relevance to other organisations
Future & concluding remarks What is SYNTHESYS? Overall aim: to create an integrated European infrastructure for researchers in the natural sciences EU FP7 framework project 18 Partners 3 core strands of work: Transnational Access improves accessibility of natural history collections through funded physical access to collections / expertise and facilities. Joint Research Activities improve access to data stored digitally within NH collections by extracting and enhancing
data from digitised collections Network Activities deliver collection management policies, best practice models, unified standards and protocols for new and emerging collections. SYNTHESYS Joint Research Activities Automated data collection from digital images DNA sequencing viability New methods for 3D digitisation of NH collections Access and management of an integrated European digital collection (with NA2) Crowdsourcing metadata enrichment of digital images
Led by: RBGE (lead), NHM, MfN Quantitative colour analysis Crowdsourcing metadata enrichment of digital images = label transcription (for now) Applied human intelligence is still required for label transcription Some of the issues that are very challenging to solve computationally are: Diversity and irregularity of labels e.g. shape, size, contents Recognising and mapping of label data to atomised fields is complex Label data can be duplicated Label data can be irrelevant or contradictory Mixture of handwritten and printed text Crowdsourcing Landscape c. 2014 Crowdsourcing landscape changed since planning (2011-2012) Many platforms (recently) launched! SYNTHESYS partners developing/using platforms
Growing understanding of best practices (Ellwood et al, 2015) Ellwood et al 2015. doi: 10.1093/biosci/biv005 Research & Requirements Gathering General research report (sent to all survey participants) Platform comparisons Case studies Motivation, participation Organisational investment Functional requirement survey & platform assessments Initial Platform Comparison Feature
145,574 1,365,200 1,025,033 ? Plat. Age 4 years 7 years 3 years 2 years 2 years
Data Entry Mobile Statistics gathered on or around 01/08/2014 Platform age is rounded up NHM Case Study: Notes from Nature Led by Tim Conyers and Robert Prys-Jones Bird register project initial test project for NfN 2,950 pages 315,785 transcriptions 75% of transcriptions by 1 volunteer! Project page: http://www.notesfromnature.org/#/archives/ornithological
Contributor stats: http://data.nhm.ac.uk/dataset/notes-from-nature/resource/7f8fc5f5-90ae-4959-b2869cb7951f2875?view_id=ce329dfd-99cb-4223-b615-ce95d6c707c7 RBG Kew Case Study: [email protected] Led by Sarah Phillips British herbarium sheet transcription 13,000 transcriptions (20122014) Established community generated high quality data even from handwriting interpretation NHM Case Study: Led by John Tweddle & Mark Spencer (+ AMC Team) Combing contemporary recording with historical datasets 1,000 participants, 30,000 classifications, 1,800 field records
200 new orchid locations (incl. for threatened spp.) New recorders, new activity for existing enthusiasts Preliminary analysis already found flowering data are 10 days earlier for 2 orchid species www.orchidobservers.org Crowdsourcing vs in-situ Transcription Report by Santos et al comparing NfN vs internal transcription Super volunteer more accurate and effective Registered users transcribed more than anonymous volunteers Anonymous/unregistered volunteers have higher error rates Records
Errors Error % In situ temp. staff 10,677 26 0.24 In situ students 3,700 0.59 NfN registered
80,019 2,184 2.73 NfN anonymous 13,673 1,768 12.93 22 Crowdsourcing vs in-situ Transcription Recommendations Strongly recommend review-based transcription & multi-stage QC Need to offer better training to volunteers (but when?) Mechanisms to review incomplete submissions (either human or technical error) Highlighted benefits of analysing data some errors and platform issues could have been fixed earlier Participant motivation - why does it matter? CS isnt free and participation isnt a given!
Understanding why volunteers participate in crowdsourcing endeavours and how to support, maintain and reward their involvement is central to success Narrative, tasks, supporting resources & feedback all affect participation Social aspects of crowdsourcing are critical and should not be ignored Motivations of participants vary and can be hard to determine Increasing number of studies, but biased coverage Initial decision to participate Enthusiasm and interest in project topic Desire to record, find and discover Learning and development of new skills Contribution to the greater good (society/science) Sense of purpose and belonging to a community (social) On-going support & reward what works? On-going, rapid feedback and thanks Evidence that the data are being used Social interaction and community
Personal learning and progression Recognition and reputational gain (incl. super-contributors) Awards, games, badges, leaderboard (work for some people, not others) So what does this mean as a practitioner? Projects need to be personally and socially relevant to succeed Motivations of participants often quite different to those of project designer One size rarely fits all - danger of
making assumptions Key to success is working with and understanding target participants and adapting Report conclusions: project choice and design Clear project rationale with both cultural and scientific benefits Projects should be actively promoted and monitored Scientists should be visible and engaged with volunteers Develop best practice for motivating and retaining volunteers (self-establishing community structure and forum, good science, tasks of interest, different rewards etc) Platform should use existing data standards reduce bottle neck for collections management ingestion Resulting data should be freely available projects do not end when all tasks are complete!
Areas of Organisational Investment Communication, outreach and support (e.g. dedicated staff time to develop and provide feedback to an external community, internal project manager and scientists) Strategic project selection (e.g. strong narrative, potential scientific outputs, public appeal, well-structured tasks of known complexity) Preparation of underlying data (e.g. data for autocomplete fields such as collector names or localities) Post-processing of data and subsequent import into institutional collections management system (?) Technical infrastructure (e.g. software, hardware and developers) Functional requirements Surveyed 14 EU partners Captured functional requirements Prioritised using MoSCoW method Requirements written up as user stories after identifying key user roles
MoSCoW Method Must Have Should Have Could Have Wont Have As a Community Manager I want to be able to queue projects so when one project gets completed a new one goes live so Volunteers always have content Platform Requirements Platform as a service Strong management functionality Organisational control API (micro services) to allow embedding in mobile and
institutional websites Key functionality (for example) Review-based transcription Full task archiving Multilingual support Georeferencing & mapping support Platform Choice Smithsonian Institutions Transcription Centre Strong collaboration potential/expertise
Met many functional requirements Open source & Drupal-based Highly customisable (in-house and externally) Significant NHM developer experience But not restrictive Still encourage partners to use other systems ALA, Les Herbonautes, Panoptes Differing functionality & specialisms NHM still intends to work with Zooniverse What are our plans? Technical analysis of major platforms Functional requirements document Finalise technical specification
Hire developer(s) Joint development and design work (NHM, Smithsonian, Simbiotica) User acceptance testing Launch in August 2016! SYNTHESYS Roadmap Q3 2015 Initiation Q4 2015 Alpha Q1 2016 Q2 2016 Beta Q3 2016
Launch Core Platform development deliverables/ milestones refinement Internal UAT - volunteers/staff Public UAT/soft launch Hard launch [31 Aug 2016] Developer recruitment Consortium testing Finalise launch functionality Promotion List of potential launch projects
Confirm launch projects Prepare launch projects Report on usage and statistics Seek additional funding Future project reserve list Draft designs implemented Post-launch functionality Workflow refinement Final designs implemented Risks Developer recruitment
Challenging financial climate Multiple partners/stakeholders CMS integration currently a massive bottleneck for all our digital projects Why should you be interested in crowdsourcing? A stronger online presence/brand Increased rate of collections digitisation (100k+/day?), hence access to data Higher scientific output An effective way of engaging (dispersed) members of the public Deeper and more meaningful engagement with our collections Why should you be interested in the SYNTHESYS platform? Platform model would work for institutes of all sizes Established scalable platform model
Reduces technical overheads Modular structure allows customisation Open international collaboration (e.g. iDigBio/Smithsonian) Resulting data will be available for research (Data Portal) Future Directly doing research through crowdsourcing Deeper engagement with volunteers (visiteering) Tracking our data, benefits, impact and repatriation Dual approach for transcription combine with OCR and intelligent sorting Beyond transcription Closing Remarks We need more data to do better crowdsourcing:
Raw (unreviewed) transcription data Volunteer demographics Motivation for initial and sustained user engagement Experimental data on optimal UI configurations Produce more education and outreach materials to complement public engagement Recruiting & keeping developers is a challenge! Collaboration & partnerships are good but often result in compromises! (open source + modular helps but is ) Free platforms still require community management to get best results If you have any relevant information please share! Anecdotal information, raw or processed transcription data welcome [email protected] Acknowledgements SYNTHESYS: JRA Objective 3 & NA3 Groups Smithsonian Institution: Meghan Ferriter & Michael
Schall Other Contributors: Simon Chagnoux, Libby Ellwood, Paul Flemons, Tom Humphrey and Deborah Paul NHM: Celena Bretton, Tim Conyers, Lucy Robinson Ben Scott, Vince Smith, Ali Thomas References Ellwood, E.R., B. Dunckel, P. Flemons, R. Guralnick, G. Nelson, G. Newman, S. Newman, D. Paul, G. Riccardi, N. Rios, K. C. Seltmann and A. R. Mast. (2015). Accelerating digitization of biodiversity research specimens through online public participation. BioScience. doi: 10.1093/biosci/biv005 Developing Specifications
Its a support tool but also a service Some of the existing platforms with undoubtedly work for you We will have a developer in post sometime in October High level ideas for the NHM Other museums are not just about Natural History- we have other needs sogood to get feedback Project jhas a look of stackholders in SYNTHESYS but we are also aiming at other institutions in the UK and Europe Want to develop our role as a virtual hub for Citizen Science Want to use this sesison partial for requirements gathering Orchid Observers Data Very preliminary analysis! Median flowering dates for Early-purple and Green-winged orchids are 10 days earlier cf. museum data (18301970) Functional Item
User guides and help Review-based transcription Support for relevant data standards Project descriptions Mechanism to report issues with projects or tasks Summarise active projects and their progress Different privileges within site Support for exporting data for clean-up or analysis using external services Templates for project creation Linked project-level documentation All projects and tasks should be archived on the site Zoomify-style interface Ability to import lists for controlled validation Ability to map and export data in different formats Hover-over help Tools for analysing and assessing the quality of user contributions Interactive examples/tutorials Ability to edit "live" projects Ability to create custom data export templates Standard field types and basic validation Permission-based administration
Project progress bars Top users Support for maps to display georeferenced data Must Have Must Have Must Have Must Have Must Have Must Have Must Have Must Have Must Have Must Have Must Have Must Have Must Have Must Have Must Have Must Have Must Have
Must Have Must Have Must Have Must Have Must Have Must Have Must Have Custom fields for data entry Ability for users to filter projects and tasks within projects based on their areas of interest A georeferencing tool that allows users to generate coordinates from locality information An annotation tool that includes determinations to capture data from more expert users New/featured project section Ability for users to ask general questions about projects User notifications Ability for users to request help from a dedicated community member or project experts Dynamic lists Ability to host and run multiple crowdsourcing projects at one time Links to content and project outputs Control hub for users Multikeying/multi-pass transcription
News feed to display updates Reporting tools Localisation support Ability to contact all project volunteers Ability for users to submit or query records for discussion Simple content management Links to information to help with tasks (e.g. BHL, taxonomic catalogues, community created content) Potential to develop mobile/tablet based apps using API Flexible theming A modular structure to support different task types Support for organisations/institutes to use single sign on technology for internal users Built-in read/write API that is used by platform as primary means for delivering and creating content (e.g. dogfooding paradigm) Support for Google Analytics Support for public/community responses to tasks and discussions Simple site-wide user statistics Should Have Should Have Should Have Should Have Should Have
Should Have Should Have Should Have Should Have Should Have Should Have Should Have Should Have Should Have Should Have Should Have Should Have Should Have Should Have Should Have Should Have Should Have Should Have Should Have Should Have Should Have
Should Have Should Have Embedded videos Could Have Simple (non-HTML) interface for editing project information Could Have Ability to serve OCR text to users for correction Could Have Support for external users to use social media logins Could Have Ability to embed and display content from the platform on other websites
Could Have Potential to integrate handwriting recognition in the platform Could Have Project blog Could Have Ability to queue projects Could Have Ability for users to share links to transcriptions/tasks to social media networks Could Have Links to information for discovery/educational purposes (e.g. EOL, Wikipedia, National Portals) Could Have
Support for users to create their own resources to support a project Could Have Support for anonymous (unregistered user) contributions Could Have Support for markup (formating in data entry fields) Could Have
The Practice of Everyday Life Michel de Certeau Luce Giard Pierre Mayol ... culture is judged by its operations, not by the possession of products The ordinary - making do (faire-avec) The Practice of Everyday Life Review by Beryl Lenger...
See also: Anthony Danalis, Gabriel Marin, Collin McCurdy, Jeremy S. Meredith, Philip C. Roth, Kyle Spafford, Vinod Tipparaju, and Jeffrey S. Vetter. 2010. The Scalable Heterogeneous Computing (SHOC) benchmark suite. In Proceedings of the 3rd Workshop on General-Purpose Computation on...
Trade is in bank deposits denominated in different currencies Law of One Price Example: American steel $100 per ton, Japanese steel 10,000 yen per ton If E = 50 yen/$ then prices are: American Steel Japanese Steel In U.S. $100...
The ratio of ions is not indicated in the name…it is assumed you will know (or be able to figure it out). Remember… cations come first . in ionic compounds and naming is the same as for monatomic ions. The...
Fringe field on the cavity surface increases slightly when DC coil powered on with MC coil but field values are still within the acceptable limits. Quench Analysis. Although the total stored magnet Energy of 14KJ is small, the Quench analysis...
Our program makes it easy for career changers to get their initial certification If you meet the prerequisite Liberal Arts and Science requirements, pass the LAST, and complete the other admissions steps, you will be able to enroll Degrees Offered...
Define the scopes of SPM. Understand what project managers worry about. Define the phases of a software project. Explain the factors of management. Be conscious of that a project needs elaborative planning, supervision and control. Identify stakeholders and their objectives....
The ALEKS Pie is a visual representation of your overall progress. ... Clicking ALEKS Pie Detail will let you navigate through all of the pie slices, and see a specific list of topics contained in every slice or Objective. ......
Ready to download the document? Go ahead and hit continue!