A key aspect of the Hub's approach to supporting data sharing is to leverage existing resources. Hub users can find and access ‘omics, traits, abundance, occurrence, and epidemiological data through the Hub’s search functionality. My role as the Hub curator (Sarah Kelly), is to work with you (the community) to identify key datasets in the vector-borne disease field that would benefit the community to be discoverable on the Hub site. I will work with you, the depositor, to ensure the appropriate data standards and privacy levels you require are met during the curation process. This includes embargo data support, if you wish to work with me to standardise your data but are not ready for data to be publicly available, we can place an embargo on your data until you wish to proceed. I will support the uploading of data into these existing specialised repositories, following the guidelines and SOPs for these existing resources. In the case that there is no specialised repository available for your data type, the Hub will be able to host your metadata and/or data directly.
The SOPs for these specialised repositories can be found below.
Our recommended specialised repository for occurrence type data is GBIF.
Datasets published through GBIF.org have sufficiently consistent detail to contribute information about the location of individual organisms in time and space—that is, they offer evidence of the occurrence of a species (or other taxon) at a particular place on a specified date. Occurrence datasets make up the core of data published through GBIF.org, and examples can range from specimens and fossils in natural history collections, observations by field researchers and citizen scientists, and data gathered from camera traps or remote-sensing satellites.
Occurrence records in these datasets sometimes provide only general locality information, sometimes simply identifying the country, but in many cases, more precise locations and geographic coordinates support fine-scale analysis and mapping of species distributions.
Datasets published through GBIF have to be formatted according to Darwin Core terms.
The Darwin Core Standard (DwC) offers a stable, straightforward and flexible framework for compiling biodiversity data from varied and variable sources. Most datasets shared through GBIF.org are published using the Darwin Core Archive format (DwC-A). Template available for checklist data above under “Template for checklist data”.
What is Darwin Core? https://www.gbif.org/darwin-core
Darwin Core manual: https://obis.org
The Darwin Core manual provides a list of Darwin core terms that should be used in datasets.
Columns in datasets published through GBIF must be renamed according to their most relevant Darwin Core terms.
Template for occurrence data according to the Darwin Core standards: here
Event Core describes when and where a specific sampling event happened and contains information such as location and date. Event Core is often used to organise data tables when there are more than one sampling occasion and/or location, and different occurrences linked to each sampling. This organisation follows the rationale of most ecological studies and typical marine sampling designs. It covers:
Event Core can be used in combination with the Occurrence and eMoF extensions. The identifier that links Event Core to the extension is the eventID. parentID can also be used to give information on hierarchical sampling. occurrenceID can also be used in datasets with Event Core in order to link information between the Occurrence extension and the eMoF extension. Occurrence Core datasets describe observations and specimen records and cover instances when:
Occurrence Core is often the preferred structure for museum collections, citations of occurrences from literature, and sampling activities.
Datasets formatted in Occurrence Core can use the eMoF Extension for when you have biotic measurements or facts about your specimen. The DNA derived data extension can also be used to link to DNA sequences. The identifier that links Occurrence Core to the extension(s) is the occurrenceID.
Occurrence Core standards are often used for occurrence data. A list of required Darwin Core information to publish occurrence data can be found here.
Note: while there are required terms needed for a dataset to be published on GBIF, additional information on the samples/species recorded (e.g. sampling protocol, habitat, additional remarks on geo-referencing/location) should also be included in the dataset according to the Darwin core terms.
In order to publish on GBIF, new publishers need to be endorsed by GBIF participants. This is done via regional GBIF nodes, the UK’s being the National Biodiversity Network (NBN).
While registration of an organisation can be done on GBIF, publishing has to be carried out through NBN. In order to share data with the NBN atlas, the organisation has to be set up as a data partner and agree to the NBN atlas terms of use. To become a new data partner with NBN, email [email protected] with the following information:
The point of contact provided by the organisation should be contacted by a representative from NBN, who will provide further guidance and feedback on the datasets. The full guidelines for registration as a data partner can be found on the NBN Atlas website.
Our recommended specialised repository for abundance type data is VecDyn (part of VectorByte).
If you do not already have an account on VecDyn, you must create one and request access to upload data. The link to this can be found at the top right-hand corner of the page.
Once you have permission to upload data to VecDyn, you will see a drop-down menu under your login name at the top right corner of the page. The menu will now include an option to ‘Upload VecDyn Data’.
Click the ‘Upload VecDyn Data’ button, here you will find the latest instructions for loading (including column definitions) and a template you can download to ensure the column headers in your dataset are those that will be recognised by the VecDyn validator. The column headers in this template will match the column names in the VecDyn Column Definitions page.
The VecDyn column definitions display the columns or variables that should be present in your dataset. Those columns/variables that are mandatory are labelled as ‘true’ in the ‘Is Required’ column.
Once your template is populated with your data, please ensure you have followed all points in the instruction manual. Now you are ready to upload your data file. Drop your file into the upload box and press ‘Upload’.
Your data is now running through a validator. The validator should run relatively quickly, but validation time is dependent on the size of the dataset. The validator will draw your attention to any errors in your data such as missing fields or duplicated samples. You must fix the errors before the dataset successfully passes through the validator.
Once the dataset has passed validation, it will be submitted to the VectorByte team for upload. Once you have done this, you no longer have direct access to the data. However, if you do make a mistake, just email the team, and they should be able to identify and delete the offending dataset before uploading.
Please make a note of the date and time that you uploaded the dataset which you want discarded. This will make it a lot easier for the team to identify which dataset is yours!
VectorByte will contact you once your dataset has been added to the database.
Our recommended specialised repository for trait type data is VecTraits (part of VectorByte) .
If you do not already have an account on VecTraits, you must create an account and request access to upload data. This can be found in the top right-hand corner of the page.
Once you have access to upload data to VecTraits you will see a drop-down menu under your login name on the top right corner of the page. There will now be an option to ‘Upload VecTraits Data’.
Click the ‘Upload VecTraits Data’ button, here you will find the latest instructions for loading (including column definitions) and a template you can download to ensure the column headers in your dataset are those that will be recognised by the VecDyn validator. The column headers in this template will match the column names in the VecTraits Column Definitions page.
The VecTraits column definitions display the columns or variables that should be present in your dataset. Those columns/variables that are mandatory are labelled as ‘true’ in the ‘Is Required’ column.
Once your template is populated with your data, please ensure you have followed all points in the instruction manual. Now you are ready to upload your data file. Drop your file into the upload box and press ‘Upload’.
Your data is now running through a validator. The validator should run relatively quickly, but validation time is dependent on the size of the dataset. The validator will draw your attention to any errors in your data such as missing fields or duplicated samples. You must fix the errors before the dataset successfully passes through the validator.
Once the dataset has passed validation it will be submitted to the VectorByte team for upload. Once you have done this, you have no direct access to the data any more. However, if you do make a mistake, just email the team and they should be able to identify and delete the offending dataset before uploading.
Please make a note of the date and time that you uploaded the dataset which you want discarded. This will make it a lot easier for the team to identify which dataset is yours!
VectorByte will contact you once your dataset has been added to the database.
Our recommended specialised repository for genomic type data is GENBANK.
Please follow the links below for the GENBANK submission types and tools.
GENBANK submission typesSome authors are concerned that the appearance of their data in GenBank prior to publication will compromise their work. GenBank will, upon request, withhold the release of new submissions for a specified period of time. However, if the accession number or sequence data appears in print or online prior to the specified date, your sequence will be released. In order to prevent the delay in the appearance of published sequence data, we urge authors to inform us of the appearance of the published data. As soon as it is available, please send the full publication data --all authors, title, journal, volume, pages and date-- to the address: [email protected].
If you are submitting human sequences to GenBank, do not include any data that could reveal the source's personal identity. It is our assumption that you have received any necessary informed consent authorizations that your organisations require prior to submitting your sequences.
Our recommended specialised repository for proteomic type data is ProteomeXchange.
ProteomeXchange data submission documentationOur recommended specialised repository for microarray type data is Gene Expression Omnibus (GEO).
GEO data submission guidelinesKeeping GEO submission private pre-publicationOur recommended specialised repository for transcriptomic type data is Sequence Read Archive (SRA).
Please read the SRA submission quickstart guide.
The preferred data submission type is FASTQ files.