Nov 19, 2024

Public workspaceGenFS Metadata Cleanup Challenge protocol V.5

  • 1US Food and Drug Administration;
  • 2National Center for Biotechnology Information;
  • 3US FDA-HFP;
  • 4US FDA
  • GenomeTrakr
    Tech. support email: genomeTrakr@fda.hhs.gov
Icon indicating open access to content
QR code linking to this content
Protocol CitationRuth Timme, Martin Shumway, Candace Hope Bias, Maria Balkey, Tina Pfefer 2024. GenFS Metadata Cleanup Challenge protocol. protocols.io https://dx.doi.org/10.17504/protocols.io.rm7vzj6prlx1/v5Version created by Candace Hope Bias
License: This is an open access protocol distributed under the terms of the Creative Commons Attribution License,  which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited
Protocol status: Working
We use this protocol and it's working
Created: June 26, 2024
Last Modified: November 19, 2024
Protocol Integer ID: 112406
Abstract
This protocol provides guidance for submitting bulk BioSample updates to NCBI for the GenFS Metadata Cleanup Challenge exercise.

Table of Contents:
  • Timeline/Workflow for the 2024 Metadata Cleanup Challenge
  • Preparing the update file
  • Making corrections and updates within the file
  • Submitting the finalized template

Version History:
V5: addition of picklists for mandatory OHE attributes
V4: addition of a training video
V3: minor edits made to correct typos. Addition of final step to email FDA and NCBI when you've completed all the steps.
Timeline/Workflow for the 2024 Metadata Cleanup Challenge
Timeline/Workflow for the 2024 Metadata Cleanup Challenge



Overview and scope of exercise

Timeframe: focus on submissions between September 2023-August 2024. Your lab is welcome to include submission outside this timeframe.

Scope: review and curate entries for the following attributes, which are core requirements within the One Health Enteric package:

Focus on these attributes:
  • source_type ​
  • collected_by
  • sequenced_by
  • project_name ​
  • host (for human and animal isolates) ​
  • food_origin (for food isolates) 
  • isolation_source

Walk-through Video (download):Download 2024GTMetadataCleanupChallenge_walkthrough.mp42024GTMetadataCleanupChallenge_walkthrough.mp471.9MB

Guidance for preparing the update file
Guidance for preparing the update file
Steps 4 and 5 describe the required format for the bulk update file.
For labs in the GenomeTrakr network:
We have generated a bulk metadata update template for every laboratory that has a BioProject linked to the GenomeTrakr umbrella BioProject at NCBI (PRJNA593772). 

Pick up your metadata template here: DOCS: metadata hackathon/2024 Hackathon Template Pickup.

Instructions for generating your own template (if your lab's BioProject is not linked to the GenomeTrakr Umbrella):

Navigate to SRA Run Selector: https://www.ncbi.nlm.nih.gov/Traces/study/
Enter your BioProject accession(s), then click the "Metadata" button in the download box.


The download metadata files will be a comma-delimited txt file with both the BioSample + SRA metadata. Open this in Excel. Remove all the metadata columns not relevant for this exercise, move the BioSample accessions to the first column, then proceed.
See the example template for guidance on which columns to keep: Download Example Template.xlsxExample Template.xlsx

If you see that any of the isolates listed in your template should be part of another organization's submissions (i.e. PulseNet), please notify the GenomeTrakr team by emailing genometrakr@fda.hhs.gov.
Prepare the update file:

  • Updates should be put into one bulk update file per lab, or per lab/organism, or some other aggregation. 
  • Rows: The first row, or header, must contain the column names. Each subsequent row must contain exactly one BioSample and its updatable fields, or attributes. Each row must contain exactly the same number of columns. 
  • Columns:  The column names should be attribute names (full name or harmonized name) included in the One Health Enteric package. The first column should contain the BioSample accession (eg SAMN123456789)
  • File format: tab-delimited text file using the file suffix  .tsv or .txt. If you use Microsoft Excel for editing, export the final template into this format. Take care that dates are correctly exported from the spreadsheet.

FYI: 
  • Only those columns where at least one record has a replacement value need to be supplied. 
  • If a field is blank in the update file, and has an actual value in NCBI BioSample, that will be taken as "replace current value with NULL". 
  • If a field in a record already has a value in NCBI BioSample that will not be changed, then the field should be filled in with its current value. 
  • If a field is new to the BioSample, it will be added.  A blank field whose attribute does not currently exist in the BioSample record will not be added. 
What fields/attributes can be included in the bulk update file?

  • collected_by
  • geo_loc_name
  • food_origin
  • host
  • project_name
  • sequenced_by
  • source_type
  • isolation_source
  • collection_date
  • purpose_of_sampling
  • env_local_scale
  • env_medium
  • animal_env
  • env_broad_scale
  • facility_type
  • intended_consumer
  • food_type_processed
  • food_processing_method
What fields CANNOT be updated in the bulk update file?

DO NOT include changes to the following attributes in this bulk update file. Changes to the following fields can be requested separately by writing to pd-help@ncbi.nlm.nih.gov directly. Changes to the following attributes affect multiple resources at NCBI and updates to them need to be coordinated.

  • biosample_acc/BioSample – This is the primary key of the entire BioSample system - Keep this accession in the first column, but DO NOT EDIT OR UPDATE ENTRIES.
  • bioproject_acc/BioProject - changes to linked bioProject, linked SRA, or linked Assembly.
  • strain or isolate
  • sample_name
  • attribute package – This requires validation of all the fields together. 
  • center name – This is a property of SRA and cannot be changed using the bulk update channel.
  • organism - Change to the BioSample species (the identification of the isolate). 
  • Salmonella serovar/serotype  - changes to organism species or sub-species names (ex Salmonella enterica => Salmonella enterica subsp. enterica serovar Infantis)
  • Any kind of SRA to BioSample, BioSample to BioProject, or SRA to BioProject mapping. 
  • Any of the computed fields in Pathogen Detection (epi_type, min_same, min_diff, computed_types, or amr/virulence analysis outputs). These attributes cannot be updated by the record owner.

For updating any of these attributes, send a TSV file of proposed changes to pd-help@ncbi.nlm.nih.gov as prepared above.
Guidance for making corrections and updates within the file
Guidance for making corrections and updates within the file
Steps 7-9: These steps cover required fields. In your Excel template, ensure there's an entry in these columns for every row. If information is missing, then choose one of the null terms, "Not Applicable, Not Collected, Not Provided, Missing, or Restricted Access".
source_type

Review your entries, ensure every record has an source_type entry. Make corrections where needed.
 
human 
animal 
food 
environmental 
other 
Not Applicable 
Not Collected 
Not Provided 
Missing 
Restricted Access 

NCBI and US pathogen surveillance coordinators have a strict controlled vocabulary for this attribute, only the above terms are allowed.
project_name

Review the entries in this column, make corrections where needed and populate all missing fields. For US surveillance, please choose the coordinating body that best the isolate.

GenomeTrakr 
GenomeTrakr; LFFM-FY1 
GenomeTrakr; LFFM-FY2 
GenomeTrakr; LFFM-FY3 
GenomeTrakr; LFFM-FY4 
GenomeTrakr; LFFM-FY5
NARMS 
NARMS Cecal 
NARMS Retail Meat 
PulseNet 
USDA-FSIS 
Vet-LIRN 
NAHLN 
USDA-ARS
  
If you would like to create another term to communicate membership in another project or network, feel free to do that! Enter you new term directly into the update template. If you would like this term added to the picklist, send this term to genomeTrakr@fda.hhs.gov, and we'll add it to the next update.  

Include more than one term? Separate with "; ", for example, "GenomeTrakr; USDA-ARS".
collected_by and sequenced_by

Review the entries in both of these columns, make corrections where needed, and populate any missing fields.

Ensure terms are standardized across all your records.  Check the sequenced_by picklist in the current One Health Enteric package file for your standardized laboratory name. 

Send updates, corrections, or additions to your laboratory name to genomeTrakr@fda.hhs.gov and we'll make corresponding updates to the picklist terms.
Steps 11-13: For this exercise we're focused on two conditionally required attributes:
  1. host should be populated for isolates derived from human or animal samples
  2. food_origin should be populated for isolates derived from food products, or other commercial products sampled for pathogens (medical products, cosmetics, tattoo ink, etc).

Use the source_type column to filter for these sample types, first for animal/human samples (step 11 + 12), then for food samples (step 13).
host

*FOR HUMAN and ANIMAL ISOLATES ONLY*

Host is a required for host associated human and animal isolates.
Use the source_type column to identify human and animal isolates (filter in Excel for human and animal).

Use the entries in the isolation_source column to help determine what the host entry should be.

Where possible, enter a scientific or binomial name, for example Homo sapiens or Bos taurus. If scientific name is unknown, use a common name recognized by the NCBI taxonomy database (porcine, bovine, etc). 

When host is completely unknown, provide one of NCBI's null values: 
Not Applicable 
Not Collected 
Not Provided 
Missing 
Restricted Access 
isolation_source

**For human and animal isolates: remove all taxonomic references included in isolation_source.**
This information should solely reside in host.

Example edit to both isolation_source and host:


food_origin

**FOR FOOD OR OTHER PRODUCTS ONLY**

Sort or filter on the source_type column to identify the “food” isolates.

Food and other products have two attributes describing geographic location information:
geo_loc_name: geographic location where the sample was physically collected
food_origin: geographic origin of food product or other product sampled for pathogens

Check location data in geo_loc_name:
  • If the location information in geo_loc_name reflects the state or country of origin for the food product (e.g. “India”, reflecting an imported food product sampled in the US), move this geographic information to food_origin.
  • If your isolates contain the location where the sample was collected (e.g. port of entry, US state of grocery store, etc), leave geo_loc_name as is. Populate food_origin with the country or state of origin.

If food_origin is unknown, provide one of NCBI's null values: 
Not Applicable 
Not Collected 
Not Provided 
Missing 
Restricted Access 
OPTIONAL: bring your records up to OHE standards
OPTIONAL: bring your records up to OHE standards
Review and populate the other conditionally required fields for OHE sub-packages. Download latest version of the One Health Enteric package for picklist terms.

Filter on source_type to locate isolates belonging to the four OHE sub-packages. The terms listed under the sub-packages are the conditionally mandatory attributes for those sub-packages.
animal samples, source_type = animal

  • animal_env picklist:
veterinary facility or diagnostic laboratory
private household [ENVO:01000418]
animal breeding facility
animal boarding facility [ENVO:00003040]
animal exhibition site
animal market or collection point
animal import/export quarantine facility
abattoir [ENVO:01000925]
food animal production site
zoo, wildlife refuge, or private animal collection [ENVO:00010625]
natural environment [ENVO:01000951]
fish hatchery [ENVO:00000295]
poultry hatchery [ENVO:01001874]
smallholder farm
Not Applicable
Not Collected
Not Provided
Missing
Restricted Access
food product samples, source_type = food

  • intended_consumer picklist:
human as food consumer
animal as food consumer
Not Applicable
Not Collected
Not Provided
Missing
Restricted Access

  • food_processing_method picklist
food (raw) [FOODON:03311126]
food (heat treated) [FOODON:03316043]
food (cooked) [FOODON:00001181]
food (precooked, frozen) [FOODON:03305323]
food (blanched) [FOODON:00002767]
food (deep-fried) [FOODON:03307052]
food (pasteurized) [FOODON:00002654]
food reheating [FOODON:03450037]
food scalding [FOODON:00002648]
cooking with fat or oil [FOODON:03450024]
cooking by moist heat [FOODON:03450012]
cooking by dry heat [FOODON:03450004]
cooking by microwave [FOODON:03450011]
cooking using heating container [FOODON:03450032]
sous vide cooking [FOODON:03470150]
food (preserved) [FOODON:00002158]
food (pickled) [FOODON:00001079]
food (freeze-dried) [FOODON:03301752]
food (canned) [FOODON:00002418]
food (frozen) [FOODON:03302148]
food (smoked) [FOODON:03310311]
food (dehydrated) [FOODON:00002643]
food (fermented) [FOODON:00001258]
food (batter-coated) [FOODON:00002662]
food (breaded) [FOODON:00002661]
food (chilled) [FOODON:00002642]
food (cleaned) [FOODON:00002708]
food (colored) [FOODON:00002650]
food (comminuted) [FOODON:00002754]
food (filled) [FOODON:00002644]
food (flavored) [FOODON:00002646]
food (ground) [FOODON:00002713]
food (harvested) [FOODON:00003398]
food (hulled) [FOODON:00002720]
food (hydrolized) [FOODON:00002653]
food (juiced) [FOODON:00003499]
food (milled) [FOODON:00002649]
food (peeled) [FOODON:00002655]
food (puffed) [FOODON:00002656]
food (rehydrated) [FOODON:00002755]
food (salted) [FOODON:03460173]
food (seasoned) [FOODON:00002733]
food (sliced) [FOODON:00002455]
food (textured) [FOODON:00002658]
food (chopped) [FOODON:00002777]
food (julienned) [FOODON:00002990]
Not Applicable
Not Collected
Not Provided
Missing
Restricted Access
facility inspection samples, source_type = environmental

  • facility_type picklist:
ambient storage
caterer-catering point
distribution
frozen storage
importer-broker
interstate conveyance
labeler-relabeler
packaging
process/manufacturing
refrigerated storage
storage
Not Applicable
Not Collected
Not Provided
Missing
Restricted Access

  • food_type_processed picklist:
Animal feed
Baby Food products
Bakery Products, Doughs, Bakery Mixes and Icings
Candy, candy specialties, chewing gum
Cereal preparations and Breakfast foods
Cheese and Cheese Products
Chocolate, cocoa products, cocoa beans
Coffee and tea
Dietary supplements
Drinks, soft drinks, and waters
Edible insects and insect-derived foods
Egg and Egg Products
Fishery/Seafood Products
Fruit and Fruit Products
Ice Cream and related products
Macaroni and noodle products
Meats, meat products, and poultry
Medicated animal feeds
Milk Butter and Dried Milk Products
Miscellaneous rood related items
Multiple Food Dinners, Gravies, Sauces and Specialties
Nuts and Edible seeds
Pet and laboratory animal food
Pet food and treats
Prepared salad products
Snack food items
Soups
Spices, flavors, and salts
Vegetable oils
Vegetable protein products
Vegetables and Vegetable Products
Whole grains, milled grain products, and starches
Not Applicable
Not Collected
Not Provided
Missing
Restricted Access
farm/environment samples, source_type = environmental

  • env_broad_scale picklist:
agricultural ecosystem [ENVO:00000077]
aquatic ecosystem [ENVO:01001787]
Not Applicable
Not Collected
Not Provided
Missing
Restricted Access

  • env_local_scale picklist:
farm [ENVO:00000078]
produce farm
mixed-use farm
livestock operation
pasture [ENVO:00000266]
fish farm [ENVO:00000294]
feedlot [ENVO:01000627]
under glass/protected plant cultivation [FOODON:03530211]
aquaculture
indoor rearing structure
outdoor rearing structure
stream [ENVO:00000023]
river [ENVO:00000022]
lake [ENVO:00000020]
pond [ENVO:00000033]
canal [ENVO:00000014]
plant body [PO:0009011]
Not Applicable
Not Collected
Not Provided
Missing
Restricted Access
  • env_medium picklist:
soil [ENVO:00001998]
hay for animal feed [FOODON:03301763]
hay [baled] [FOODON:03309364]
straw [FOODON:03309894]
animal litter
animal manure
air
saline water [ENVO:00002010]
freshwater [ENVO:00002011]
waste water [ENVO:00002001]
sewage [ENVO:00002018]
plant root, tuber or bulb [FOODON:03420238]
plant part above surface [FOODON:03420144]
Not Applicable
Not Collected
Not Provided
Missing
Restricted Access
Finalize the update file and submit the bulk template!
Finalize the update file and submit the bulk template!
Check for duplications - ensure there are no double BioSample entries.
Bring your template to the October 16th, 2024 GenomeTrakr metadata cleanup challenge - finalize edits at the hackathon.
Remove columns that you're not updating. Ensure that the BioSample accession is retained in Column 1 and strain is kept for tracking the update.

Keep
Biosample
strain

Remove
package name
Title
Center_Name
Release_Date
Bioproject
Run
sample_name
Save as a tab-delimited .tsv file and transfer the .tsv file to protocols.io:
Send an email to genomeTrakr@fda.hhs.gov and pd-help@ncbi.nlm.nih.gov, alerting us that you've completed the update. Include your filename in the email. If NCBI has questions as they are processing the update, they will reach out to you directly.