GEODE CURATE-A-THON logo GEODE CURATE-A-THON

About The Data You Will Be Curating

Today, we will be working with the metadata associated with genetic and genomic sequence data extracted from the International Nucleotide Sequence Database Collaboration (INSDC) database. The genomic DNA sequence data that these metadata describe were collected for a variety of primary research purposes: from building phylogenies of how species are related to each other, to finding particular genes under natural selection. However, all of these datasets are potentially relevant for a secondary purpose: describing global genetic diversity.

To best be prepared to work with this data, it is important to understand a few elements regarding how the data are structured and accession numbers assigned (Figure 1).

Figure 1: Overview of the distinction between BioProjects and BioSamples

The first level of structure is the BioProject.

BioProject (dataset of genetic sequences): a collection of biological data for a single initiative, originating from a single organization or from a consortium. This is the umbrella under which all sample information and sequence data files are submitted and thus provides users a single place to find links to the diverse data generated for that project and deposited into the archival databases maintained by members of the INSDC. Often, the BioProject includes information about: data type, sample scope, organism or common taxonomic branch (e.g., primates), BioProject release date, and potentially research grant information as well as Biosample and publication information. BioProject Accession Number Format: PRJNAxxxxxx (e.g., PRJNA526235)

This is followed by the next level of structure, the BioSample.

BioSample (individual sample in the dataset of genetic sequences/BioProject): includes descriptive information about the physical biological specimen from which the experimental data are derived. Some examples include a BioSample from a cell line, tissue biopsy, or an environmental isolate. BioSample Accession Number Format: SAMNxxxxxxxx (e.g., SAMN11091118)

Next