GEODE CURATE-A-THON logo GEODE CURATE-A-THON

Project Background

When you hear the term “biodiversity” you probably think of the millions of species that inhabit our forests, savannahs and oceans. However, each species comprises many individual organisms, which share a common heritage, their DNA. While individuals within species all have very similar DNA, each one is also distinguished by hundreds of mutations throughout their genome. In fact, it is these mutations that allow some individuals to adapt to changing environments, while others cannot: a phenomenon called natural selection. Species and populations that have a larger amount of diversity in their collective DNA (i.e. genetic diversity) are thus more likely to be able to adapt to climate change than those that have low genetic diversity. However, despite the recognized importance of genetic diversity, data about the genetic diversity of populations are not often incorporated into large scale conservation policy decisions. One reason for this is that the DNA sequence data are not FAIR (findable, accessible, interoperable and reusable).

While a culture of open data sharing exists for raw digital DNA sequence information, the goal of comprehensively summarizing and monitoring Earth’s genetic biodiversity remains largely elusive because these records lack sufficient geospatial and temporal metadata. You may be familiar with metadata, that, when attached to a digital picture taken by your smartphone, can tell a computer when and where that photo was taken. While most smart phones automatically attach such metadata to photos, only about 13% of potentially biodiversity-relevant genetic data in the Sequence Read Archive (SRA) of the International Nucleotide Sequence Collaboration (NCBI + DDBJ + EMBL) currently have temporal or geospatial metadata attached to them (Toczydlowski et al. 2021). This means that the other 80% are nearly useless for meta-analysis of essential biodiversity variables such as heterozygosity, divergence, and effective population size.

The Genomic Observatories Diversity Explorer (GEODE) project aims to visualize the world’s most foundational, yet under appreciated, layer of biodiversity: genetic diversity. Previous efforts to capture essential spatiotemporal metadata for population genomic datasets from the Sequence Read Archive will be expanded, and genetic variants will be called using a bioinformatic pipeline on Microsoft Azure Cloud to calculate essential biodiversity variables for genetic composition. GEODE will enable open access to genetic diversity information for government agencies and conservation NGOs, as well as for the general public.

Problem Statement:

INSDC houses lots of genomic/genetic sequences from a breadth of sample types such as space samples, covid samples, medical human samples, barnacles, etc. to name a few. Currently, there is not an easy way to sort/filter sequence data in INSDC to find datasets relevant to biodiversity conservation and monitoring which focuses on wild populations. The information necessary to do this and to reuse the dataset itself in the future is usually in associated papers.

Therefore, we need you to:

Instructions:

Please read through the protocol instructions before you begin your first data curation review.

Overall, the curation you will be involved in will progress through the following steps:

  1. Search for published papers associated with the assigned BioProject using the Paper Search Strategies provided.
  2. Link the BioProject record to the associated published paper in the GEODE Curate-A-Thon Google Form (see the BioProjects Google Form Overview for more information).
  3. Determine BioProject relevance for GEODE based on information within the associated published paper. Directions on how to determine relevance are available in the How to Determine Relevance for the GEODE Curate-A-Thon section of this protocol. Enter the relevance status into the BioProjects Google Form.
  4. If the Bioproject is determined to be relevant, determine whether or not spatial (information about the location where a sample was collected (e.g., location name, latitude/longitude)) or temporal (information about when a sample of collected (e.g., date, date range)) metadata can be located within the published paper or associated documents (supplemental materials, data repository, etc.). Enter this information in the BioProjects Google Form.

The work you will be doing during this Curate-A-Thon is important as it will have a direct impact on future use of these genomic/genetic sequence data because reuse depends on correct, well documented information being associated with each record. Therefore, we have provided you with several resources throughout this guide that will assist you in the process (e.g., tutorial videos, guidance, etc.), BUT please do not hesitate to reach out to your Curate-A-Thon hosts, with any questions throughout the process.

Next