Jan 16, 2023

Public workspaceSubmission of sequence and contextual data to GISAID, INSDC repositories, or other databases

This protocol is a draft, published without a DOI.
  • Paul Lorenzo A Gaite1,
  • Dr Ritchie Mae T Gamot1,2,
  • Dr Lyre Anni E Murao1,2
  • 1Philippine Genome Center Mindanao;
  • 2University of the Philippines Mindanao
Icon indicating open access to content
QR code linking to this content
Protocol Citation: Paul Lorenzo A Gaite, Dr Ritchie Mae T Gamot, Dr Lyre Anni E Murao 2023. Submission of sequence and contextual data to GISAID, INSDC repositories, or other databases. protocols.io https://protocols.io/view/submission-of-sequence-and-contextual-data-to-gisa-cgh2tt8e
License: This is an open access protocol distributed under the terms of the Creative Commons Attribution License,  which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited
Protocol status: Working
We use this protocol and it's working
Created: September 13, 2022
Last Modified: January 16, 2023
Protocol Integer ID: 69914
Abstract
Timely submission of viral sequence and corresponding contextual data by public health laboratories is an essential step to SARS-CoV-2 biosurveillance. This enables real-time updating of repositories of sequence data, hence real-time tracking of information on the virus as gathered from the sequence data and metadata as well. There are various sequence repositories or databases (with a section dedicated to SARS-CoV-2 viral sequences) that are publicly-available online, such as GISAID and the INSDC repositories.

The previous project involved submission of the sequences generated by the PGC Mindanao workflow to the GISAID database. In collaboration with BugSeq on the PHA4GE subgrant, PGC Mindanao was also able to upload and submit the previously-generated SARS-CoV-2 sequences to NCBI, a database that is part of the INSDC, which in turn subsequently released these sequences to its public database successfully.
Abstract/Introduction

Timely submission of viral sequence and corresponding contextual data by public health laboratories is an essential step to SARS-CoV-2 biosurveillance. This enables real-time updating of repositories of sequence data, hence real-time tracking of information on the virus as gathered from the sequence data and metadata as well. There are various sequence repositories or databases (with a section dedicated to SARS-CoV-2 viral sequences) that are publicly-available online, such as GISAID and the INSDC repositories.

The previous project involved submission of the sequences generated by the PGC Mindanao workflow (refer to protocol on "Introduction and Lineage Assignment of Assembled Sequences") to the GISAID database. In collaboration with BugSeq on the PHA4GE subgrant, PGC Mindanao was also able to upload and submit the previously-generated SARS-CoV-2 sequences to NCBI, a database that is part of the INSDC, which in turn subsequently released these sequences to its public database successfully.

The protocol below outlines the PGC Mindanao workflow for the submission of sequence and contextual data to GISAID (Section 2). A short section shows how the PHA4GE contextual data package was used in the submission process (Section 3). The process of submission of sequence and contextual data by PGC Mindanao to the NCBI database is also outlined (Section 4). A comparison of results from GISAID and NCBI submissions is shown (Section 5).


PGC Mindanao workflow


This section outlines the PGC Mindanao workflow for the submission of sequence and contextual data to a public database. Figure 1 shows an overview of the entire workflow. The workflow ultimately deposits the sequence and contextual data to the online public database GISAID.
Figure 1. Overview of the PGC Mindanao workflow for submission of sequence and contextual data to the GISAID database

After assembling and generating the consensus sequences from the viral samples, the sequences are assessed and filtered for the number of ambiguous bases and presence of frameshift mutations and unexpected stop codons in the sequences as confirmed by the sequencing data and not from sequencing artifacts. Sequencing having less than 50% ambiguous bases and have frameshift mutations and unexpected stop codons that are confirmed by the sequencing data will be included in the GISAID submission and those otherwise were not included (refer to Section 2.6 - "Nextclade workflow" of the protocol "Introduction and Lineage Assignment of Assembled Sequences" for details). The headers of the multi-sequence FASTA files were renamed conforming to GISAID requirements (Section 2.1). Parallel to this, sequence metadata were also uploaded to REDCap according to the appropriate data access group of each Sub-National Laboratory (SNL) (Section 2.2).

Sequences were uploaded to REDCap using the script provided by the previous project (succeeding steps were also performed with scripts provided by the previous project). Similar to the metadata, the sequence data were uploaded to REDCap according to the appropriate data access group of each SNL. Both sequence data and metadata were scraped from the REDCap database using the provided Python script (Section 2.2). PGC Mindanao verified the sequences and the corresponding metadata. When verification was done, the sequences and the metadata were uploaded to GISAID through their submission protocol (Section 2.3).



Formatting of sequence data to standard form:

Various scripts were created and provided by the previous project to format heading of FASTA file to standard form and other process. The provided custom script "old2newheader_reduced.py" converted the header/s of the sequence FASTA file to conform to GISAID convention (Figure 2).


Figure 2. Screenshots of two of the scripts used in the workflow, "old2newheader_reduced.py" and "redcap_import_virusname_consensus.py"


Data entry into, and data "scraping" from REDCap Database:


Sequence data and contextual data/metadata (refer to protocol "Establishing processes to capture standardized contextual data" for details) was entered into the REDCap database of the previous project. Figure 3 shows the landing page after entering the login into the REDCap database, which is based on the data access group entered from the last login. Individual sample sequence data and metadata entries may be uploaded manually through the webpage GUI (through the green button "Add new record" and manual entry in individual sub-entry pages). Figure 4 shows the case metadata sub-entry page. Figure 5 shows the analysis metadata sub-entry page. Another approach to uploading data is through the "Data Import Tool" (Figure 6). Sequence data may also be uploaded through the use of the provided custom script "redcap_import_virusname_consensus.py" (Figure 2).

Data can be "scraped", or collected, from the same database by using the provided custom script "gisaidprep.py", which outputs a spreadsheet that conforms to GISAID submission requirements (Figure 7).


Figure 3. Main landing page after entering login of the REDCap database
Figure 4. Case metadata sub-entry page of a REDCap database sample entry

Figure 5. Analysis metadata sub-entry page of a REDCap database sample entry

Figure 6. Data import tool of REDCap


Submission to GISAID database:

The spreadsheet outputted by "gisaidprep.py" (containing the sequence metadata) from the previous section, along with the final formatted sequence FASTA file, will be used for uploading the sequence data and metadata to GISAID. Figure 7 shows the spreadsheet outputted by "gisaidprep.py" metadata that will be submitted to GISAID in conjunction with the sequence data. Standard GISAID submission procedure was performed, wherein the recommended approach is through the use of the command-line tool CLI2. This process requires a live client ID for authentication, which may be requested by emailing the support team for this tool at GISAID (clisupport@gisaid.org). Figure 8 shows a number of the commands that can be issued by the CLI2 command-line tool.


Figure 7. Spreadsheet outputted by "gisaidprep.py" script containing sequence metadata conforming to GISAID submission requirements


Figure 8. A number of the commands that can be issued by the CLI2 command-line tool
After submission of the sequence data and metadata to GISAID through the CLI2 tool, the data will be reviewed by curators at GISAID. The uploader will be notified by email of successfully-released samples (together with their corresponding individual accession IDs) from submit@gisaid.org, and the samples made immediately available to registered GISAID users. On the other hand, the uploader will be notified by email of unsuccessful sample/s from hcov-19@gisaid.org with the reason for the non-release of each of these sample/s. The uploader may resubmit with the corrected sequence/s or inform GISAID that the sequence/s is correct and is supported by the raw sequencing data.




PHA4GE contextual data package

The contextual data template spreadsheet from the package can be used to standardize submission to repositories. See protocol on "Establishing processes to capture standardized contextual data" for details.
NCBI GenBank submission process

BugSeq has conducted an orientation session on submission of sequence and contextual data to INSDC repositories, particularly NCBI GenBank. Sequences previously-generated by PGC Mindanao were submitted in actual during the orientation. This section outlines this process.

Figure 9 shows the landing page or submission portal for sequences in GenBank. The process is done entirely on webpage GUI at the online NCBI GenBank page. The page also shows the overview of previous sequence submissions.

Figure 10 shows the first step to submission, asking the uploader the type of sequences to be uploaded and other details (e.g. if it is SARS-CoV-2).

Figure 11 shows the page for the second step, which asks for details regarding the submitter.

Figure 12 shows the third step of the process, which asks for details regarding the sequencing technology used (e.g. sequencing platform and method).

Figure 13 shows the fourth step, which is the upload page for the sequence file and also asks when the sequences are to be released.

Figure 14 presents the fifth step, which is the sequence processing page and also where the uploader is given an option to automatically remove failed sequences.

Figure 15 shows the sixth step, which is the source information page and where the uploader is asked about the details on the sequence IDs.

Figures 16 and 17 show the seventh step of the process wherein source modifiers, or sample/sequence metadata, are provided. The uploader is given an option to provide the metadata by filling out the editable table within the page or uploading a tab-delimited table file containing the metadata. Figure 17 shows the editable table where the source modifiers are provided in the submission process in this case.

Figure 18 shows the eighth step asking details on references, such as sequence author information and status of the publication linked to the sequences.

Figure 19 shows the ninth step, which is the review of the sequence submission before uploading and submitting the sequence data and corresponding metadata to the NCBI GenBank database.


Figure 9. Landing page of the submission portal of NCBI GenBank

Figure 10. Submission type page

Figure 11. Submitter information page

Figure 12. Sequencing technology page


Figure 13. Sequence upload and information page
Figure 14. Sequence processing page


Figure 15. Source information page


Figure 16. Source modifiers page

Figure 17. Editable table for inputting source modifiers


Figure 18. References page for the submission


Figure 19. Review page for the submission
After submission of the sequence data and metadata to NCBI GenBank through its Submission Portal, the data will be reviewed by curators at NCBI GenBank. The uploader will be notified by email if the whole batch of sequences has been successfully-released or not, with the reason for the non-release of each unsuccessful sample. The uploader may resubmit with the corrected sequence/s through the Submission Portal or inform NCBI GenBank that the sequence/s is correct and is supported by the raw sequencing data.




Comparison of GISAID and GenBank submissions

Table 1 shows details of the sequences submitted to GISAID (from previous project) and GenBank (from this grant), such as if initial database submission was successful, step/s taken to resolve initial unsuccessful submission, if resubmission was successful, and the corresponding accession ID given by the database (if applicable).


ABCDEFGHIJK
Sample IDGISAID submitted?Initial GISAID submission successful?Resolution if initial GISAID submission unsuccessfulResubmission to GISAID successful?GISAID accession IDNCBI submitted?Initial NCBI submission successful?Resolution if initial NCBI submission unsuccessfulResubmission to NCBI successful?NCBI accession ID
hCoV-19/Philippines/PH-CRMC-13-14/2021YesYes--EPI_ISL_5934896YesNoCorrected sequenceYesOP522426
hCoV-19/Philippines/PH-CRMC-13-15/2021YesYes--EPI_ISL_5934897YesYes--OP522427
hCoV-19/Philippines/PH-CRMC-13-16/2021YesYes--EPI_ISL_5934898YesYes--OP522428
hCoV-19/Philippines/PH-CRMC-13-17/2021YesYes--EPI_ISL_5934899YesYes--OP522429
hCoV-19/Philippines/PH-CRMC-13-18/2021YesYes--EPI_ISL_5934900YesYes--OP522430
hCoV-19/Philippines/PH-CRMC-13-19/2021YesYes--EPI_ISL_5934901YesYes--OP522431
hCoV-19/Philippines/PH-CRMC-13-20/2021YesYes--EPI_ISL_5934902YesYes--OP522432
hCoV-19/Philippines/PH-CRMC-13-21/2021YesYes--EPI_ISL_5934903YesYes--OP522433
hCoV-19/Philippines/PH-CRMC-13-22/2021YesYes--EPI_ISL_5934904YesYes--OP522434
hCoV-19/Philippines/PH-CRMC-13-23/2021YesNoCorrected sequenceYesEPI_ISL_5934905YesNoCorrected sequenceNo-
hCoV-19/Philippines/PH-DDOPH-12-1/2021YesYes--EPI_ISL_5934981YesYes--OP522435
hCoV-19/Philippines/PH-DDOPH-12-2/2021YesYes--EPI_ISL_5934985YesYes--OP522436
hCoV-19/Philippines/PH-DDOPH-12-3/2021YesYes--EPI_ISL_5934986YesYes--OP522437
hCoV-19/Philippines/PH-DDOPH-12-4/2021YesYes--EPI_ISL_5934987YesYes--OP522438
hCoV-19/Philippines/PH-DDOPH-12-5/2021YesYes--EPI_ISL_5934988YesYes--OP522439
hCoV-19/Philippines/PH-DDOPH-12-6/2021YesYes--EPI_ISL_5934989YesYes--OP522440
hCoV-19/Philippines/PH-DDOPH-12-7/2021YesYes--EPI_ISL_5934990YesYes--OP522441
hCoV-19/Philippines/PH-DDOPH-12-8/2021YesYes--EPI_ISL_5934991YesNoCorrected sequenceYesOP522442
hCoV-19/Philippines/PH-DDOPH-12-9/2021YesYes--EPI_ISL_5934992YesYes--OP522443
hCoV-19/Philippines/PH-DDOPH-12-10/2021YesYes--EPI_ISL_5934982YesYes--OP522444
hCoV-19/Philippines/PH-DDOPH-12-11/2021YesYes--EPI_ISL_5934983YesYes--OP522445
hCoV-19/Philippines/PH-DDOPH-12-12/2021YesYes--EPI_ISL_5934984YesYes--OP522446
Table 1. Comparison of GISAID and NCBI submission details