Submission of sequence and contextual data to GISAID, INSDC repositories, or other databases

Paul Lorenzo A Gaite; Dr Ritchie Mae T Gamot; Dr Lyre Anni E Murao

Jan 16, 2023

Submission of sequence and contextual data to GISAID, INSDC repositories, or other databases

This protocol is a draft, published without a DOI.

Paul Lorenzo A Gaite¹,
Dr Ritchie Mae T Gamot^1,2,
Dr Lyre Anni E Murao^1,2

¹Philippine Genome Center Mindanao;
²University of the Philippines Mindanao

PHA4GE Subgrant - Philippines

phagesubgrantph

Protocol Citation: Paul Lorenzo A Gaite, Dr Ritchie Mae T Gamot, Dr Lyre Anni E Murao 2023. Submission of sequence and contextual data to GISAID, INSDC repositories, or other databases. protocols.io https://protocols.io/view/submission-of-sequence-and-contextual-data-to-gisa-cgh2tt8e

License: This is an open access protocol distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited

Protocol status: Working

We use this protocol and it's working

Created: September 12, 2022

Last Modified: January 16, 2023

Protocol Integer ID: 69914

Abstract

Timely submission of viral sequence and corresponding contextual data by public health laboratories is an essential step to SARS-CoV-2 biosurveillance. This enables real-time updating of repositories of sequence data, hence real-time tracking of information on the virus as gathered from the sequence data and metadata as well. There are various sequence repositories or databases (with a section dedicated to SARS-CoV-2 viral sequences) that are publicly-available online, such as GISAID and the INSDC repositories. 

The previous project involved submission of the sequences generated by the PGC Mindanao workflow to the GISAID database. In collaboration with BugSeq on the PHA4GE subgrant, PGC Mindanao was also able to upload and submit the previously-generated SARS-CoV-2 sequences to NCBI, a database that is part of the INSDC, which in turn subsequently released these sequences to its public database successfully.

Abstract/Introduction

Timely submission of viral sequence and corresponding contextual data by public health laboratories is an essential step to SARS-CoV-2 biosurveillance. This enables real-time updating of repositories of sequence data, hence real-time tracking of information on the virus as gathered from the sequence data and metadata as well. There are various sequence repositories or databases (with a section dedicated to SARS-CoV-2 viral sequences) that are publicly-available online, such as GISAID and the INSDC repositories. 

The previous project involved submission of the sequences generated by the PGC Mindanao workflow (refer to protocol on "Introduction and Lineage Assignment of Assembled Sequences") to the GISAID database. In collaboration with BugSeq on the PHA4GE subgrant, PGC Mindanao was also able to upload and submit the previously-generated SARS-CoV-2 sequences to NCBI, a database that is part of the INSDC, which in turn subsequently released these sequences to its public database successfully.

The protocol below outlines the PGC Mindanao workflow for the submission of sequence and contextual data to GISAID (Section 2). A short section shows how the PHA4GE contextual data package was used in the submission process (Section 3). The process of submission of sequence and contextual data by PGC Mindanao to the NCBI database is also outlined (Section 4). A comparison of results from GISAID and NCBI submissions is shown (Section 5).

PGC Mindanao workflow

This section outlines the PGC Mindanao workflow for the submission of sequence and contextual data to a public database. Figure 1 shows an overview of the entire workflow. The workflow ultimately deposits the sequence and contextual data to the online public database GISAID.  
 
Figure 1. Overview of the PGC Mindanao workflow for submission of sequence and contextual data to the GISAID database
   
After assembling and generating the consensus sequences from the viral samples, the sequences are assessed and filtered for the number of ambiguous bases and presence of frameshift mutations and unexpected stop codons in the sequences as confirmed by the sequencing data and not from sequencing artifacts. Sequencing having less than 50% ambiguous bases and have frameshift mutations and unexpected stop codons that are confirmed by the sequencing data will be included in the GISAID submission and those otherwise were not included (refer to Section 2.6 - "Nextclade workflow" of the protocol "Introduction and Lineage Assignment of Assembled Sequences" for details). The headers of the multi-sequence FASTA files were renamed conforming to GISAID requirements (Section 2.1). Parallel to this, sequence metadata were also uploaded to REDCap according to the appropriate data access group of each Sub-National Laboratory (SNL) (Section 2.2).

Sequences were uploaded to REDCap using the script provided by the previous project (succeeding steps were also performed with scripts provided by the previous project). Similar to the metadata, the sequence data were uploaded to REDCap according to the appropriate data access group of each SNL. Both sequence data and metadata were scraped from the REDCap database using the provided Python script (Section 2.2). PGC Mindanao verified the sequences and the corresponding metadata. When verification was done, the sequences and the metadata were uploaded to GISAID through their submission protocol (Section 2.3).

Formatting of sequence data to standard form:

Various scripts were created and provided by the previous project to format heading of FASTA file to standard form and other process. The provided custom script "old2newheader_reduced.py" converted the header/s of the sequence FASTA file to conform to GISAID convention (Figure 2).

Figure 2. Screenshots of two of the scripts used in the workflow, "old2newheader_reduced.py" and "redcap_import_virusname_consensus.py"

Data entry into, and data "scraping" from REDCap Database:

Sequence data and contextual data/metadata (refer to protocol "Establishing processes to capture standardized contextual data" for details) was entered into the REDCap database of the previous project. Figure 3 shows the landing page after entering the login into the REDCap database, which is based on the data access group entered from the last login. Individual sample sequence data and metadata entries may be uploaded manually through the webpage GUI (through the green button "Add new record" and manual entry in individual sub-entry pages). Figure 4 shows the case metadata sub-entry page. Figure 5 shows the analysis metadata sub-entry page. Another approach to uploading data is through the "Data Import Tool" (Figure 6). Sequence data may also be uploaded through the use of the provided custom script "redcap_import_virusname_consensus.py" (Figure 2). 

Data can be "scraped", or collected, from the same database by using the provided custom script "gisaidprep.py", which outputs a spreadsheet that conforms to GISAID submission requirements (Figure 7).

Figure 3. Main landing page after entering login of the REDCap database
Figure 4. Case metadata sub-entry page of a REDCap database sample entry

Figure 5. Analysis metadata sub-entry page of a REDCap database sample entry

Figure 6. Data import tool of REDCap

Submission to GISAID database:

The spreadsheet outputted by "gisaidprep.py" (containing the sequence metadata) from the previous section, along with the final formatted sequence FASTA file, will be used for uploading the sequence data and metadata to GISAID. Figure 7 shows the spreadsheet outputted by "gisaidprep.py" metadata that will be submitted to GISAID in conjunction with the sequence data. Standard GISAID submission procedure was performed, wherein the recommended approach is through the use of the command-line tool CLI2. This process requires a live client ID for authentication, which may be requested by emailing the support team for this tool at GISAID (clisupport@gisaid.org). Figure 8 shows a number of the commands that can be issued by the CLI2 command-line tool.

Figure 7. Spreadsheet outputted by "gisaidprep.py" script containing sequence metadata conforming to GISAID submission requirements

Figure 8. A number of the commands that can be issued by the CLI2 command-line tool
After submission of the sequence data and metadata to GISAID through the CLI2 tool, the data will be reviewed by curators at GISAID. The uploader will be notified by email of successfully-released samples (together with their corresponding individual accession IDs) from submit@gisaid.org, and the samples made immediately available to registered GISAID users. On the other hand, the uploader will be notified by email of unsuccessful sample/s from hcov-19@gisaid.org with the reason for the non-release of each of these sample/s. The uploader may resubmit with the corrected sequence/s or inform GISAID that the sequence/s is correct and is supported by the raw sequencing data.

PHA4GE contextual data package

The contextual data template spreadsheet from the package can be used to standardize submission to repositories. See protocol on "Establishing processes to capture standardized contextual data" for details.

NCBI GenBank submission process

BugSeq has conducted an orientation session on submission of sequence and contextual data to INSDC repositories, particularly NCBI GenBank. Sequences previously-generated by PGC Mindanao were submitted in actual during the orientation. This section outlines this process.

Figure 9 shows the landing page or submission portal for sequences in GenBank. The process is done entirely on webpage GUI at the online NCBI GenBank page. The page also shows the overview of previous sequence submissions.

Figure 10 shows the first step to submission, asking the uploader the type of sequences to be uploaded and other details (e.g. if it is SARS-CoV-2).

Figure 11 shows the page for the second step, which asks for details regarding the submitter.

Figure 12 shows the third step of the process, which asks for details regarding the sequencing technology used (e.g. sequencing platform and method).

Figure 13 shows the fourth step, which is the upload page for the sequence file and also asks when the sequences are to be released.

Figure 14 presents the fifth step, which is the sequence processing page and also where the uploader is given an option to automatically remove failed sequences.

Figure 15 shows the sixth step, which is the source information page and where the uploader is asked about the details on the sequence IDs.

Figures 16 and 17 show the seventh step of the process wherein source modifiers, or sample/sequence metadata, are provided. The uploader is given an option to provide the metadata by filling out the editable table within the page or uploading a tab-delimited table file containing the metadata. Figure 17 shows the editable table where the source modifiers are provided in the submission process in this case.

Figure 18 shows the eighth step asking details on references, such as sequence author information and status of the publication linked to the sequences.

Figure 19 shows the ninth step, which is the review of the sequence submission before uploading and submitting the sequence data and corresponding metadata to the NCBI GenBank database.  

Figure 9. Landing page of the submission portal of NCBI GenBank

Figure 10. Submission type page

Figure 11. Submitter information page

Figure 12. Sequencing technology page

Figure 13. Sequence upload and information page
Figure 14. Sequence processing page

Figure 15. Source information page

Figure 16. Source modifiers page

Figure 17. Editable table for inputting source modifiers

Figure 18. References page for the submission 

Figure 19. Review page for the submission
After submission of the sequence data and metadata to NCBI GenBank through its Submission Portal, the data will be reviewed by curators at NCBI GenBank. The uploader will be notified by email if the whole batch of sequences has been successfully-released or not, with the reason for the non-release of each unsuccessful sample. The uploader may resubmit with the corrected sequence/s through the Submission Portal or inform NCBI GenBank that the sequence/s is correct and is supported by the raw sequencing data.

Comparison of GISAID and GenBank submissions

Table 1 shows details of the sequences submitted to GISAID (from previous project) and GenBank (from this grant), such as if initial database submission was successful, step/s taken to resolve initial unsuccessful submission, if resubmission was successful, and the corresponding accession ID given by the database (if applicable). 


ABCDEFGHIJK
Sample IDGISAID submitted?Initial GISAID submission successful?Resolution if initial GISAID submission unsuccessfulResubmission to GISAID successful?GISAID accession IDNCBI submitted?Initial NCBI submission successful?Resolution if initial NCBI submission unsuccessfulResubmission to NCBI successful?NCBI accession ID
hCoV-19/Philippines/PH-CRMC-13-14/2021YesYes--EPI_ISL_5934896YesNoCorrected sequenceYesOP522426
hCoV-19/Philippines/PH-CRMC-13-15/2021YesYes--EPI_ISL_5934897YesYes--OP522427
hCoV-19/Philippines/PH-CRMC-13-16/2021YesYes--EPI_ISL_5934898YesYes--OP522428
hCoV-19/Philippines/PH-CRMC-13-17/2021YesYes--EPI_ISL_5934899YesYes--OP522429
hCoV-19/Philippines/PH-CRMC-13-18/2021YesYes--EPI_ISL_5934900YesYes--OP522430
hCoV-19/Philippines/PH-CRMC-13-19/2021YesYes--EPI_ISL_5934901YesYes--OP522431
hCoV-19/Philippines/PH-CRMC-13-20/2021YesYes--EPI_ISL_5934902YesYes--OP522432
hCoV-19/Philippines/PH-CRMC-13-21/2021YesYes--EPI_ISL_5934903YesYes--OP522433
hCoV-19/Philippines/PH-CRMC-13-22/2021YesYes--EPI_ISL_5934904YesYes--OP522434
hCoV-19/Philippines/PH-CRMC-13-23/2021YesNoCorrected sequenceYesEPI_ISL_5934905YesNoCorrected sequenceNo-
hCoV-19/Philippines/PH-DDOPH-12-1/2021YesYes--EPI_ISL_5934981YesYes--OP522435
hCoV-19/Philippines/PH-DDOPH-12-2/2021YesYes--EPI_ISL_5934985YesYes--OP522436
hCoV-19/Philippines/PH-DDOPH-12-3/2021YesYes--EPI_ISL_5934986YesYes--OP522437
hCoV-19/Philippines/PH-DDOPH-12-4/2021YesYes--EPI_ISL_5934987YesYes--OP522438
hCoV-19/Philippines/PH-DDOPH-12-5/2021YesYes--EPI_ISL_5934988YesYes--OP522439
hCoV-19/Philippines/PH-DDOPH-12-6/2021YesYes--EPI_ISL_5934989YesYes--OP522440
hCoV-19/Philippines/PH-DDOPH-12-7/2021YesYes--EPI_ISL_5934990YesYes--OP522441
hCoV-19/Philippines/PH-DDOPH-12-8/2021YesYes--EPI_ISL_5934991YesNoCorrected sequenceYesOP522442
hCoV-19/Philippines/PH-DDOPH-12-9/2021YesYes--EPI_ISL_5934992YesYes--OP522443
hCoV-19/Philippines/PH-DDOPH-12-10/2021YesYes--EPI_ISL_5934982YesYes--OP522444
hCoV-19/Philippines/PH-DDOPH-12-11/2021YesYes--EPI_ISL_5934983YesYes--OP522445
hCoV-19/Philippines/PH-DDOPH-12-12/2021YesYes--EPI_ISL_5934984YesYes--OP522446
Table 1. Comparison of GISAID and NCBI submission details

A	B	C	D	E	F	G	H	I	J	K
Sample ID	GISAID submitted?	Initial GISAID submission successful?	Resolution if initial GISAID submission unsuccessful	Resubmission to GISAID successful?	GISAID accession ID	NCBI submitted?	Initial NCBI submission successful?	Resolution if initial NCBI submission unsuccessful	Resubmission to NCBI successful?	NCBI accession ID
hCoV-19/Philippines/PH-CRMC-13-14/2021	Yes	Yes	-	-	EPI_ISL_5934896	Yes	No	Corrected sequence	Yes	OP522426
hCoV-19/Philippines/PH-CRMC-13-15/2021	Yes	Yes	-	-	EPI_ISL_5934897	Yes	Yes	-	-	OP522427
hCoV-19/Philippines/PH-CRMC-13-16/2021	Yes	Yes	-	-	EPI_ISL_5934898	Yes	Yes	-	-	OP522428
hCoV-19/Philippines/PH-CRMC-13-17/2021	Yes	Yes	-	-	EPI_ISL_5934899	Yes	Yes	-	-	OP522429
hCoV-19/Philippines/PH-CRMC-13-18/2021	Yes	Yes	-	-	EPI_ISL_5934900	Yes	Yes	-	-	OP522430
hCoV-19/Philippines/PH-CRMC-13-19/2021	Yes	Yes	-	-	EPI_ISL_5934901	Yes	Yes	-	-	OP522431
hCoV-19/Philippines/PH-CRMC-13-20/2021	Yes	Yes	-	-	EPI_ISL_5934902	Yes	Yes	-	-	OP522432
hCoV-19/Philippines/PH-CRMC-13-21/2021	Yes	Yes	-	-	EPI_ISL_5934903	Yes	Yes	-	-	OP522433
hCoV-19/Philippines/PH-CRMC-13-22/2021	Yes	Yes	-	-	EPI_ISL_5934904	Yes	Yes	-	-	OP522434
hCoV-19/Philippines/PH-CRMC-13-23/2021	Yes	No	Corrected sequence	Yes	EPI_ISL_5934905	Yes	No	Corrected sequence	No	-
hCoV-19/Philippines/PH-DDOPH-12-1/2021	Yes	Yes	-	-	EPI_ISL_5934981	Yes	Yes	-	-	OP522435
hCoV-19/Philippines/PH-DDOPH-12-2/2021	Yes	Yes	-	-	EPI_ISL_5934985	Yes	Yes	-	-	OP522436
hCoV-19/Philippines/PH-DDOPH-12-3/2021	Yes	Yes	-	-	EPI_ISL_5934986	Yes	Yes	-	-	OP522437
hCoV-19/Philippines/PH-DDOPH-12-4/2021	Yes	Yes	-	-	EPI_ISL_5934987	Yes	Yes	-	-	OP522438
hCoV-19/Philippines/PH-DDOPH-12-5/2021	Yes	Yes	-	-	EPI_ISL_5934988	Yes	Yes	-	-	OP522439
hCoV-19/Philippines/PH-DDOPH-12-6/2021	Yes	Yes	-	-	EPI_ISL_5934989	Yes	Yes	-	-	OP522440
hCoV-19/Philippines/PH-DDOPH-12-7/2021	Yes	Yes	-	-	EPI_ISL_5934990	Yes	Yes	-	-	OP522441
hCoV-19/Philippines/PH-DDOPH-12-8/2021	Yes	Yes	-	-	EPI_ISL_5934991	Yes	No	Corrected sequence	Yes	OP522442
hCoV-19/Philippines/PH-DDOPH-12-9/2021	Yes	Yes	-	-	EPI_ISL_5934992	Yes	Yes	-	-	OP522443
hCoV-19/Philippines/PH-DDOPH-12-10/2021	Yes	Yes	-	-	EPI_ISL_5934982	Yes	Yes	-	-	OP522444
hCoV-19/Philippines/PH-DDOPH-12-11/2021	Yes	Yes	-	-	EPI_ISL_5934983	Yes	Yes	-	-	OP522445
hCoV-19/Philippines/PH-DDOPH-12-12/2021	Yes	Yes	-	-	EPI_ISL_5934984	Yes	Yes	-	-	OP522446

Public workspaceSubmission of sequence and contextual data to GISAID, INSDC repositories, or other databases

Submission of sequence and contextual data to GISAID, INSDC repositories, or other databases