Feb 26, 2024

Public workspaceGuidance for populating and validating GenomeTrakr metadata templates (BioSample and SRA) V.11

  • 1US Food and Drug Administration
Open access
Protocol CitationMaria Balkey, Ruth Timme, Candace Hope Bias, Errol Strain, Tina Lusk Pfefer 2024. Guidance for populating and validating GenomeTrakr metadata templates (BioSample and SRA). protocols.io https://dx.doi.org/10.17504/protocols.io.eq2ly3x1pgx9/v11Version created by Ruth Timme
Manuscript citation:
Timme, R.E., Wolfgang, W.J., Balkey, M. et al. Optimizing open data to support one health: best practices to ensure interoperability of genomic data from bacterial pathogens. One Health Outlook 2, 20 (2020). https://doi.org/10.1186/s42522-020-00026-3
License: This is an open access protocol distributed under the terms of the Creative Commons Attribution License,  which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited
Protocol status: Working
We use this protocol and it's working
Created: January 11, 2024
Last Modified: February 26, 2024
Protocol Integer ID: 93402
Keywords: GenomeTrakr, metadata, Pathogen package, NCBI Pathogen Detection, INSDC
Disclaimer
Please note that this protocol is public domain, which supersedes the CC-BY license default used by protocols.io.
Abstract
PURPOSE: This protocol provides instructions for preparing and filling out the metadata templates necessary for direct submission to the National Center for Biotechnology Information (NCBI). These instructions are relevant for the majority of whole genome sequencing data submissions derived from enteric bacterial pathogens collected for surveillance purposes.

SCOPE: This protocol provides detailed instructions for the following two metadata templates:

1. BioSample metadata: guidelines for obtaining, populating, and validating the BioSample metadata template.

2. SRA metadata: Guidelines for populating sequence-level metadata template.

Version history:
v11: Change of guidance to use the One Health Enteric BioSample package for all submissions.
v10: updates to the GenomeTrakr-extended pathogen biosample template (GT-pathogen package-OHE v0.3.xlsx) and release of newly available One Health Enteric package custom templates.
v9: Bug fix
v8: Updated the picklists in the GenomeTrakr-extended pathogen package, "GT-pathogen package-OHE v0.2.2.xlsx". Also provided a direct link to the newly published One Health Enteric package.
v7: Updated the picklists in the GenomeTrakr-extended pathogen package, "GT-pathogen package-OHE v0.2.2.xlsx" and added an incremental update file for the DRAFT One Health Enteric Package that includes extensive edits compared to v6.
v6: Added the One Health Enteric package presented at IAFP 2021 meeting.
Materials
Gather the following contextual information for each pure culture isolate:

  1. organism name
  2. lab name that collected the sample
  3. collection date
  4. collection source
  5. Geographic location of sample collection

Before start
Before collecting sequence data for your isolates, ensure that you can provide the minimum metadata recommended by your coordinating surveillance body.

Overview
Overview

This protocol provides instructions on acquiring and completing two distinct metadata templates essential for the submission of enteric bacterial pathogen surveillance data to the National Center for Biotechnology Information (NCBI).

Two metadata templates are required for each NCBI submission:
1. BioSample: metadata describing the isolate, sample collected, and submitting lab information.
2. SRA: metadata describing the sequence data collection



BioSample metadata
BioSample metadata
Templates for BioSample submission:

Visit GenomeTrakr Metadata Validation System (GMVS) at https://gmvs.fda.gov/ to download custom, version-controlled, biosample metadata template(s). Current and previous versions of these templates can also be at the OHE GitHub page.

  • Our custom templates include extensive guidance and controlled vocabularies for most attributes in the package.
  • Sub-packages are available for download covering the major One Health samples types (human/animal hosts, food, food facilities, and farm/environment). Users can choose to populate the full package, or one more more of the sub-packages.

When visiting GMVS, click on the ONE HEALTH ENTERIC icon within the NCBI Metadata Validation box.



Follow GMVS instructions to download BioSample metadata template (click on the cloud download icon). Chose the most appropriate template for your sample types (the full package or one of the sub-packages).

One Health Enteric Metadata Sheet Upload

Review the excel -Instructions- sheet within the OHE excel file.
Instructions Sheet within OHE excel file
Proceed to fill out the BioSample metadata template in the -UserEntry- excel sheet. Where possible, use terms from dropdown menus for each metadata attribute.

User Entry Sheet within the OHE excel file.

Validate BioSample metadata template
Validate BioSample metadata template
Upload the completed OHE metadata template to GMVS and click on -VALIDATE- icon.

The GMVS validation system will check each entry and also run LexMapr for auto-assignment of the IFSAC category.

After completion GMVS will report out results of the validation.

Click -OK-.


No validation errors:

If metadata passes GMVS validation, each record will be displayed with all the metadata and you will have an option to export metadata.


Click on the -EXPORT METADATA- icon.

Review validated BioSample metadata and lexmapr output (cleaned up isolation_source entries and proposed IFSAC_category).Go togo to step #3 for reviewing lexmapr output.

Address validation errors:

If there are validation errors, GMVS will generate a log report. If few errors are reported, edit values by clicking the EDIT icon, otherwise, export reviewed template by clicking -EXPORT ERRORS-.
Make required changes and click -RE-IMPORT SHEET- and proceed to re-validate the template.



Evaluation of LexMapr Output
Evaluation of LexMapr Output
LexMapr is a tool that processes free text from isolation_source and generates standard terminologies from controlled vocabulary/ontologies, including FoodOn, GenEpiO, UBERON, ENVO, NCBI Taxon, and specific food and environmental categories from Interagency Food Safety Analytics Collaboration (IFSAC) controlled vocabulary.


Each GMVS record subject to validation is analyzed with LexMapr, the attribute isolation source gets an ontological descriptor and a category from IFSAC+ terminology for food safety. After records are processed with LexMapr, a report is generated

The LexMapr report generated at GMVS contains the following columns: strain, isolation_source, isolation_source (LexMapr generated), and IFSAC_category.
strainisolation_sourceisolation_source (Lexmapr generated)IFSAC_category
FDA189213897_s001ENV swab spongeenvironmental swab spongeenvironmental-factory/production facility
LexMapr Output generated during validation.
Review the Lexmpr generated recommendations for isolation_source and IFSAC_category. If you agree with the recomendations, copy these the contents of these fields into the validated BioSample metadata template, under the isolation_source and IFSAC_category fields, respectively.

If the IFSAC category(s) recommended for the sample type are incorrect or not appropriate, leave that entry blank for the submission and submit a bug report to genometrakr@fda.hhs.gov.

Save the validated biosample metadata template and proceed with NCBI submissions.
SRA sequence metadata template
SRA sequence metadata template
Template for SRA metadata submission:

Download the generic "Metadata spreadsheet with sample names" file from the NCBI Submission Templates page:

And follow the guidance in the following table:

PRO TIPS:
  1. If you have sequences to submit that belong to more than one BioProject, create a separate submission + metadata table for each of your BioProjects.
  2. Entering fastq filenames in the spreadsheet: On a Mac, you can directly copy the file names from the folder into a spreadsheet. This is not possible on a PC using copy and paste but can be done with some command-line operation.
  3. Finally, it is important to develop a QA/QC step to make sure the files are associated with the correct sample name. For example, use a left function in excel to strip of the appended text in the file name and then use the exact match to make sure the name matches the sample name.

ABC
FieldDescriptionExample
sample_nameInclude the same ID here as you entered for "sample_name" in the BioSample submission template.UT-12345
library_IDThe library name should be a unique ID relevant to your workflow. It can be an autogenerated ID from your LIMS system or a modification of your sample_name.UT-12345.6
TitleShort, free text description that identifies the data on public pages. For Example: {methodology} of {organism}: {sample_name}WGS of Salmonella enterica: UT-12345
library_strategyOverall sequencing strategy or approach. Choose from NCBI pick listWGS
library_sourcemolecule type used to make the librarygenomic
library_selectionLibrary capture methodrandom
Library_layoutChoose from NCBI pick listpaired
platformSequencing platformIllumina
instrument_modelName of the sequencing instrument.MiSeq
Design_descriptionFree text description of methods
FiletypeFile format name for the raw sequence data Choose from NCBI pick listFastq
Filenameinclude ALL of the files resulting from this library. **Add additional fields if there are more than two files (e.g. Filename3). genome_r1.fastq (*must be exact)
Filename2genome_r2.fastq (*must be exact)genome_r2.fastq (*must be exact)
Filename3-8list other fastq file names (e.g. for NextSeq data)
SRA metadata data template guidance and examples for WGS submission.


Save the second sheet (SRA_data) as a TSV (tab-delimited file) for upload in the “SRA metadata” tab within the submission portal.


*NCBI should also accept the original excel formatted file.