Populating NCBI template for submissions using  BioNumerics

Ruth Timme; Maria Balkey; Julie Haendiges; Brian Sauders; Tina.Pfefer

Feb 14, 2024

Populating NCBI template for submissions using BioNumerics

DOI

dx.doi.org/10.17504/protocols.io.3byl4qn4ovo5/v1

¹US Food and Drug Administration;
²New York State Department of Agriculture & Markets

Maria Balkey

US Food and Drug Administration

DOI: dx.doi.org/10.17504/protocols.io.3byl4qn4ovo5/v1

Protocol Citation: Ruth Timme, Maria Balkey, Julie Haendiges, Brian Sauders, Tina Lusk Pfefer 2024. Populating NCBI template for submissions using BioNumerics . protocols.io https://dx.doi.org/10.17504/protocols.io.3byl4qn4ovo5/v1

License: This is an open access protocol distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited

Protocol status: Working

We use this protocol and it's working

Created: February 09, 2024

Last Modified: February 14, 2024

Protocol Integer ID: 94941

Keywords: NCBI submission, BioNumerics, biosample, SRA, metadata, bioproject,

Abstract

PURPOSE: to define the standard operating procedure for collecting isolate metadata using BioNumerics for submission of food/environmental isolates to NCBI.

SCOPE: to provide a standardized procedure to collect isolate metadata using BioNumerics for submission of food/environmental isolates to NCBI.

RESPONSIBILITIES- SOP Responsible Officials: Ruth Timme, Maria Balkey

The GenomeTrakr Network Management will be responsible to monitor GenomeTrakr submissions processed through Bionumerics and ensure that all GT labs are familiar with the mandatory metadata fields required for submission of GenomeTrakr sequencing records to NCBI. 

V3:  Added dropdown menus from controlled vocabulary to sequenced by and project name to metadata template PulseNet_Bionumerics_Isolate_Metadata
V4:  Changes in metadata template PulseNet_Bionumerics_Isolate_Metadata. 
- Added dropdown menus from controlled vocabulary to collected_by ,  SourceCountryState
- Added fields: collected by, isolation source
- Added mapping table of attribute names.
- Remove requirement to send biosample update to NCBI to make changes on sequenced by and project name. 

Metadata SampleSheet preparation 

Before uploading your sequencing run or linking NCBI sequencing records at the BioNumerics platform make sure to fill out the metadata spreadsheet form.  

Please download the template and guidelines included in the file 
PulseNet_Bionumerics_Isolate_Metadata.xlsx64KB  

Create the fields NCBI_bioproject, Attribute_package, Organism_name, NCBI_LabID, Collected by, SourceCountryState, Latitude_longitude, ProjectName, SequencedBy, Isolation source if they are not in the BioNumerics interface.

Once you have filled out the template information, save the template sheet as .csv and import the metadata to BioNumerics. 

The metadata fields created in Bionumerics will map metadata fields at NCBI.  Table 1 describes each of the fields submitted to NCBI along mapping against name of the fields in Bionumerics templates. 
 
ABCD
Field Name at BioNumerics NCBI Submission PromptField Name at
NCBIField Name in
BioNumerics Submission Metadata TemplateDescription
BioProject accessionBioProjectNCBI bioprojectThe accession number of the BioProject(s) to which the BioSample belongs (PRJNAxxxxxx).  **Double check that you are submitting to the correct BioProject (the organism name must match the one designated for your BioProject). For species that fall outside of NCBI pathogen detection, we recommend establishing a separate multi-species "research" bioproject for publishing data outside of the structured Pathogen Detection surveillance effort.
Attribute packageattribute_packageattribute_packageThis field provides the pathogen type (or “isolation type”). Allowed values are
  “Pathogen.cl” (for human clinical pathogens) or “Pathogen.env” (for
  environmental, food, or animal clinical isolates). The value provided in this
  field drives validation of other fields and cannot be left blank.
Strain namestrainKeyThis
  is the authoritative ID used for foodborne pathogen genomic epidemiology and
  within NCBI Pathogen Detection. Although the strain ID can have any format,
  we suggest that it be unique, concise, and consistent within your laboratory
  (e.g. CFSAN123456). 
SerovarserovarSerovarThe
  organism serovar/serotype name should include the most descriptive
  information you have at time of submission, adhering to proper nomenclature
  in NCBI taxonomy database:
  https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi. Check spelling
  carefully!
Isolate name
  aliasisolate name aliasisolate_name_aliasOther
  IDs associated with this isolate. Separate with ';' if more than one
Project nameproject nameProjectNameName
  of the project within which the sequencing was organized
Collected bycollected bycollected byFull
  name of laboratory or agency that collected the sample or has taken over
  curation of the physical isolate. The name should be written out in
  full, (with minor exceptions) and be consistent across multiple submissions.  Example: Washington State Department of
  Health.
Collection /
  Isolate datecollection dateIsolateDateDate
  on which the sample was collected. Populate using ISO 8601 standard:
  “YYYY-mm-dd”, “YYYY-mm” or “YYYY” (e.g., 1990–10–30, 1990–10, or 1990).  Including the month or month/day of
  collection is extremely valuable for accessing seasonality in the database.
Geographical
  origingeographic locationSourceCountryStatePopulate
  the geographic origin of the food product. Include the country name if
  imported, or the "Country: state/territory/province" if domestic.
  Include multiple locations if necessary, delimited by semi colon.
Geographical
  coordinateslatitute and
  longitudelat_longThe
  geographical coordinates of the location where the sample was collected.
  Specify as degrees latitude and longitude in format "d[d.dddd] N|S
  d[dd.dddd] W|E", eg, 38.98 N 77.11 W. 
  If information is unavailable for any mandatory field, please enter
  'not collected',  'not applicable' or
  'missing' as appropriate.
Isolation
  sourceisolation sourceisolation sourceFree
  text, short description of sample source. Avoid generic terms such as
  patient, sample, food, surface, clinical, product, source, or
  environment.  Example: bagged romaine
  lettuce.
HosthosthostFor
  human, animal, and plant hosts, include the full taxonomic name of the host
  when available, "Homo sapiens" or "Bos Taurus". Animal
  livestock terms are also acceptable entries, e.g. porcine, bovine, equine,
  etc.
Host diseasehost diseasehost_diseaseName
  of relevant disease, e.g. Salmonella gastroenteritis. Choose an ontological
  term from https://bioportal.bioontology.org/ontologies/DOID or
  https://www.ncbi.nlm.nih.gov/mesh. 
  Attribute is mandatory for Pathogen.cl isolates (human clinical
  isolates) or include "missing" if unkown. Leave blank if not
  relevant.
Sequenced bysequenced bySequencedByThe
  name of the agency that generated the sequence, e.g., Centers for Disease
  Control and Prevention
Source name/
  typesource typeSourceTypeControlled
  vocabulary describing the isolation_source. Choose the best fit term: Human,
  Animal, Food, Environmental, Other.
Organism
  nameOrganismorganismThe
  organism name should include the most descriptive information you have at
  time of submission, adhering to proper nomenclature in NCBI taxonomy
  database: https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi. Check
  spelling carefully! Levels of valid organism names are as follows:
    Genus species:
    Salmonella enterica
    Listeria monocytogenes
    Genus species and subspecies:
    Salmonella enterica subsp. enterica
    Determined serotype or serovar (trad or
  WGS-based):
    Escherichia coli O104:H7
    Salmonella enterica subsp. enterica serovar Agnoa
    Salmonella enterica subsp. diarizonae serovar 16:z10:e,n,x,z15
    Listeria monocytogenes serotype 1/2a
    If NCBI doesn’t have the desired organism name, enter the name determined
  by your laboratory. After submission, a “taxonomy consult” will take place to
  evaluate the new name. Sometimes the organism name is changed to a canonical
  serovar name and the submission proceeds. It is also possible that the
  serovar is a novel one not currently in the NCBI database and the Taxonomy
  team will work with the submitter to get the new name added to the database.
Table 1: Metadata attributes for GenomeTrakr
 
 

NCBI Submission Settings (Manage submission template)

Create the NCBI metadata template in BioNumerics following PulseNet instructions making sure fields are populated according to GT requirements which are described in the following steps.

 BioProject and Organization:  GenomeTrakr labs by submitting independently become owners of their data and are responsible for managing individual bioprojects for each sequenced organism. The term 'field content ' denotes that the template value e.g. BioProject accession is mapping to the field in BioNumerics e.g. NCBI_bioproject.  

Fig 1. NCBI Submission Template: BioProject and Organization

Laboratories will be submitting to specific bioprojects for lab/organisms.  Find the organism/lab specific bioproject under each of the GenomeTrakr umbrella bioprojects included at https://www.ncbi.nlm.nih.gov/bioproject/593772

Make sure to submit to your lab bioproject.  Please don't submit to umbrella bioprojects. 

BioSample:  Metadata associate to the isolate might require the creation of new fields in BioNumerics. The term 'field content ' denotes that the template value e.g. Organism name is mapping to the field in BioNumerics e.g. OrganismName.  The template values might map to default values e.g. Pathogen: environmental/food/other; version 1.0.    Make sure to include the metadata associated to the isolates in the mandatory fields such as:  Submitter Provided Unique ID, BioSample accession (output), Organism name, Title, Attribute package, Strain name, Isolate name alias and Project name.   Isolate name alias is a mandatory field for GenomeTrakr submissions.  Provide serovar when available. 

Fig 2. NCBI Submission Template:BioSample

BioSample:   Make sure to include the metadata associated to the isolates in the mandatory fields such as:  Collected by, Collection / Isolate date, Collection / Isolate date format, Title, Geographical origin, Isolate source, Sequenced by and Source name/type.   Isolate name alias is a mandatory field for GenomeTrakr submissions.  Provide Geographical coordinates when available.  For human, animal, and plant hosts, include the full taxonomic name of the host when available, "Homo sapiens" or "Bos Taurus". Animal livestock terms are also acceptable entries, e.g. porcine, bovine, equine, etc.

Fig 3. NCBI Submission Template:BioSample

NCBI submission settings – SRA Experiment and Run

Populate fields for SRA Experiment and Run according to PulseNet instructions. 

Fig 4. NCBI Submission Template forBioNumerics, SRA Experiment and run:  Make sure to map collection attributes to the corresponding fields.

NCBI submission settings – Submission Template

Save submission template according to PulseNet Instructions as -GenomeTrakr-Template-.

Import data

Import the GenomeTrakr Metadata form for BioNumerics  according to PulseNet Instructions. 

When importing rules, the field source should match destination fields. 

In the importing links section, choose the -key- for linking records to database entries. 

Proceed with sequencing data import according to PulseNet Instructions. 

Submit data to NCBI according to PulseNet Instructions. If NCBI accessions are not available at BioNumerics in 1 business day, please contact NCBI and PulseNet to troubleshoot issues with submissions. 

Contact GenomeTrakr by email genometrakr@fda.hhs.gov if issues with submissions are delayed for more than 3 days.  GenomeTrakr can support urgent submissions if needed. 

A	B	C	D
Field Name at BioNumerics NCBI Submission Prompt	Field Name at NCBI	Field Name in BioNumerics Submission Metadata Template	Description
BioProject accession	BioProject	NCBI bioproject	The accession number of the BioProject(s) to which the BioSample belongs (PRJNAxxxxxx). **Double check that you are submitting to the correct BioProject (the organism name must match the one designated for your BioProject). For species that fall outside of NCBI pathogen detection, we recommend establishing a separate multi-species "research" bioproject for publishing data outside of the structured Pathogen Detection surveillance effort.
Attribute package	attribute_package	attribute_package	This field provides the pathogen type (or “isolation type”). Allowed values are “Pathogen.cl” (for human clinical pathogens) or “Pathogen.env” (for environmental, food, or animal clinical isolates). The value provided in this field drives validation of other fields and cannot be left blank.
Strain name	strain	Key	This is the authoritative ID used for foodborne pathogen genomic epidemiology and within NCBI Pathogen Detection. Although the strain ID can have any format, we suggest that it be unique, concise, and consistent within your laboratory (e.g. CFSAN123456).
Serovar	serovar	Serovar	The organism serovar/serotype name should include the most descriptive information you have at time of submission, adhering to proper nomenclature in NCBI taxonomy database: https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi. Check spelling carefully!
Isolate name alias	isolate name alias	isolate_name_alias	Other IDs associated with this isolate. Separate with ';' if more than one
Project name	project name	ProjectName	Name of the project within which the sequencing was organized
Collected by	collected by	collected by	Full name of laboratory or agency that collected the sample or has taken over curation of the physical isolate. The name should be written out in full, (with minor exceptions) and be consistent across multiple submissions. Example: Washington State Department of Health.
Collection / Isolate date	collection date	IsolateDate	Date on which the sample was collected. Populate using ISO 8601 standard: “YYYY-mm-dd”, “YYYY-mm” or “YYYY” (e.g., 1990–10–30, 1990–10, or 1990). Including the month or month/day of collection is extremely valuable for accessing seasonality in the database.
Geographical origin	geographic location	SourceCountryState	Populate the geographic origin of the food product. Include the country name if imported, or the "Country: state/territory/province" if domestic. Include multiple locations if necessary, delimited by semi colon.
Geographical coordinates	latitute and longitude	lat_long	The geographical coordinates of the location where the sample was collected. Specify as degrees latitude and longitude in format "d[d.dddd] N\|S d[dd.dddd] W\|E", eg, 38.98 N 77.11 W. If information is unavailable for any mandatory field, please enter 'not collected', 'not applicable' or 'missing' as appropriate.
Isolation source	isolation source	isolation source	Free text, short description of sample source. Avoid generic terms such as patient, sample, food, surface, clinical, product, source, or environment. Example: bagged romaine lettuce.
Host	host	host	For human, animal, and plant hosts, include the full taxonomic name of the host when available, "Homo sapiens" or "Bos Taurus". Animal livestock terms are also acceptable entries, e.g. porcine, bovine, equine, etc.
Host disease	host disease	host_disease	Name of relevant disease, e.g. Salmonella gastroenteritis. Choose an ontological term from https://bioportal.bioontology.org/ontologies/DOID or https://www.ncbi.nlm.nih.gov/mesh. Attribute is mandatory for Pathogen.cl isolates (human clinical isolates) or include "missing" if unkown. Leave blank if not relevant.
Sequenced by	sequenced by	SequencedBy	The name of the agency that generated the sequence, e.g., Centers for Disease Control and Prevention
Source name/ type	source type	SourceType	Controlled vocabulary describing the isolation_source. Choose the best fit term: Human, Animal, Food, Environmental, Other.
Organism name	Organism	organism	The organism name should include the most descriptive information you have at time of submission, adhering to proper nomenclature in NCBI taxonomy database: https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi. Check spelling carefully! Levels of valid organism names are as follows: Genus species: Salmonella enterica Listeria monocytogenes Genus species and subspecies: Salmonella enterica subsp. enterica Determined serotype or serovar (trad or WGS-based): Escherichia coli O104:H7 Salmonella enterica subsp. enterica serovar Agnoa Salmonella enterica subsp. diarizonae serovar 16:z10:e,n,x,z15 Listeria monocytogenes serotype 1/2a If NCBI doesn’t have the desired organism name, enter the name determined by your laboratory. After submission, a “taxonomy consult” will take place to evaluate the new name. Sometimes the organism name is changed to a canonical serovar name and the submission proceeds. It is also possible that the serovar is a novel one not currently in the NCBI database and the Taxonomy team will work with the submitter to get the new name added to the database.

Public workspacePopulating NCBI template for submissions using BioNumerics

Populating NCBI template for submissions using BioNumerics