Data Management Plan (DMP)Link to Data Management Plan (DMP)

This page summarises how data will be collected, curated, stored, shared, and preserved for the Vector-Borne Diseases Hub.

0. Project name
1. Description of the data
2. Data management, documentation, and curation
3. Data sharing and access
4. Data security
5. Capabilities
6. Maintaining and implementing the DMP
7. Environmental considerations
8. Responsibilities
9. Relevant policies
10. Author of this DMP

0. Project nameLink to 0. Project name

One Health VBD Hub

1. Description of the dataLink to 1. Description of the data

1.1. Type of studyLink to 1.1. Type of study

Consolidating and curating external data.
Integrating data into appropriate repositories and indexing for discovery.
Developing AI tools to extract data from literature.
Providing tools for analysis, modelling, and response support.

1.2. Types of dataLink to 1.2. Types of data

Data on VBD systems.
Omics: genomic, transcriptomic, metagenomic data.
Trait: phenotypic and demographic traits.
Abundance and occurrence: time-series and point records.
Epidemiological: incidence, prevalence.
Derived and synthetic data.
Harmonised, integrated tables linking multiple data types.
Summary products, distribution maps, time-series summaries, and dashboard-ready aggregates.
AI-extracted datasets from text-mining of published literature.

1.3. Origin of the dataLink to 1.3. Origin of the data

Data originates from primary data producers (UK-based research and surveillance programmes), international repositories with omics, trait, abundance, and biodiversity data, machine-assisted text mining of publications, and grey literature.

1.4. Format and scale of the dataLink to 1.4. Format and scale of the data

Preferred formats

Tabular: CSV/TSV (or Parquet)
Structured: JSON
Geospatial: GeoJSON, GeoTIFF, CSV + WKT
Environmental: NetCDF, GeoTIFF, or tidy tabular
Sequence: Partner-defined (FASTA, BAM/CRAM, VCF)
Documentation/Metadata: Markdown, plain text, PDF, JSON
Code: Text-based source files

Scale

Indexing and linking up to tens of millions of records held in partner repositories. Hub-hosted datasets are anticipated to reach the 100-500 GB range, with most data ultimately deposited into long-term repositories. Derived summary products and environmental covariates add tens of GB but remain manageable.

2. Data management, documentation, and curationLink to 2. Data management, documentation, and curation

2.1. Managing, storing, and curating dataLink to 2.1. Managing, storing, and curating data

Secure, RAID-backed storage with snapshots. Code and scripts are in Git repositories. Raw data is kept in separate staging areas.
Standard extract-transform-load pipelines convert data. AI-assisted extraction is human-supervised. Steps are scripted, logged, and re-executable for full traceability.
Systems use ISO 27001-compatible cloud services (UK/EEA residency, encryption). Backups are weekly, in separate regions.
Datasets are deposited in community repositories for preservation and DOIs.
Access is controlled by role-based access control (RBAC): public, authenticated users, curators, and administrators.

2.2. Metadata standards and data documentationLink to 2.2. Metadata standards and data documentation

Datasets will have both human-readable and machine-actionable metadata.

Dataset-level metadata: Text summarising scope, purpose, and limitations; authors and contacts; methods; licences; version; and links to related publications.
Machine-readable metadata: Formats aligned with community standards, for example Darwin Core and VecDyn/VecTraits for occurrence and many abundance datasets, and EML for more complex ecological datasets.
Variable-level documentation: Data dictionaries describing each variable, clear indication of coordinate reference systems for spatial data, time zones for temporal data, and transformations.
Provenance and versioning: Unique IDs for datasets and for individual records or samples, links to source repositories, original DOIs, and publications. Metadata is retained even if underlying data is restricted or withdrawn.

2.3. Data preservation strategy and standardsLink to 2.3. Data preservation strategy and standards

We aim to ensure that data remains usable and citable beyond the end of the award. Primary preservation for most data rests with domain repositories (for example, occurrence in GBIF). In case of the project ceasing, Hub-specific value-added artefacts will be archived in an institutional or generalist repository with DOIs and sufficient metadata to reconstruct integration. Code, containers, and workflows will be preserved via open Git repositories.

3.1. Where will data be shared?Link to 3.1. Where will data be shared?

VBD Hub platform: Central discovery portal exposing metadata, search, filtering, and visualisation across data types. Programmatic access via API and R/Python packages.
VBD Hub repository: For data that cannot be uploaded to existing global repositories.
Partner repositories: VecTraits or VecDyn for trait and abundance data, GBIF for species occurrence and some abundance datasets, INSDC (GenBank/EMBL-EBI) and ProteomeXchange for omics.
Generalist and institutional repositories: Figshare, Zenodo, or institutional repositories for integrated or derived products, and AI-curated corpora. Institutional archives for snapshots of key Hub datasets and code.
Software and training: GitHub for packages and code. Global Vector Hub for publications (for example, SOPs and training) with cross-links from vbdhub.org.

3.2. When will data be available?Link to 3.2. When will data be available?

Data deposited before publication will be openly available at the time of article publication, with DOIs cited in manuscripts.
Validated baseline curated datasets within 3 months of data collection or curation.
All data will be shared according to agreed data sharing plans, including time-limited embargoes where necessary.

For sensitive data, aggregated or anonymised summaries may be released, with restricted access potentially remaining indefinitely. All data, even restricted, will have registered and searchable metadata with clear access conditions and contact points.

3.3. How will data be made findable and accessible?Link to 3.3. How will data be made findable and accessible?

Persistent identifiers (DOIs or PIDs, stable URLs).
Rich, searchable metadata (structured, exposed via web interface, endpoints to support harvesting).
Standardised access mechanisms (open web access to metadata, download, and documentation, APIs for programmatic access or bulk-download).
Indexing and registries (key datasets and Hub resources in relevant catalogues).

3.4. How will data be made reusable?Link to 3.4. How will data be made reusable?

Clear licensing: Prefer CC BY 4.0 licence for curated datasets and any software produced. Licence metadata embedded in dataset records.
High-quality documentation: Persistent links to methods, SOPs, and study protocols used to generate or curate datasets. Tutorials showing how to load, combine, and analyse Hub data.
Provenance and quality flags: Provenance recorded (source article or report, repository, pipeline) and quality flags indicating limitations.
Standard formats and vocabularies: Use of controlled vocabularies and ontologies, consistent encoding of dates, times, and locations.
Users encouraged to cite both dataset DOI and key underlying publications, ensuring appropriate credit to data generators and curators.

3.5. Restrictions or delays to sharing, with planned actions to limit such restrictionsLink to 3.5. Restrictions or delays to sharing, with planned actions to limit such restrictions

Potential reasons for restriction

Privacy or confidentiality: individually, holding-, or community-identifiable data.
Third-party rights or IP: data under licence or contract.
Publication embargoes: time-limited restrictions.

Managing restrictions

Conduct data protection and risk assessments.
Prioritise aggregation, anonymisation, and fuzzing over withholding.
Use data use agreements for controlled data.
Keep embargoes short and justified.

4. Data securityLink to 4. Data security

4.1. Formal information and data security standardLink to 4.1. Formal information and data security standard

GDPR and DPA18.
ISO 27001-aligned information security practices via institutional IT services and cloud providers (for example ISO 27001/27017/27018 certification, SOC-type reports).
Institutional frameworks for information security, including Cyber Essentials.

4.2. Main risks to data securityLink to 4.2. Main risks to data security

Unauthorised access (Low): Role-based access (least privilege); mandatory institutional login and MFA for admins; regular access reviews; logging; incident response with rapid access revocation and DPO intervention.
Accidental loss or deletion (Low to moderate): Automated backups with recovery, version control, restricted delete permissions; incident response includes backup restoration, documenting cause, and updating permissions or workflows.
Data interception in transit (Very Low): HTTPS/TLS, secure transfer, and modern ciphers; incident response includes revoking credentials, reviewing logs, and notifying parties as required.
Misuse or secondary use beyond consent (Low to moderate): Data use policies, click-through terms, and controlled-access data use agreements. Anonymise or aggregate sensitive data; incident response includes suspending access, investigating, and notifying governance or DPO.
Integrity compromise or corruption (Low): Write-once or versioned storage. Controlled, peer-reviewed curation pipelines with audit trails; incident response includes version rollback, documentation, pipeline patching, and user notification.
Service disruption (availability) (High): Highly available cloud; monitoring, alerting, rate limiting, and WAF; incident response includes documenting maintenance windows and following IT incident procedures.
Legal or policy non-compliance (Very Low): Consult DPOs and RDM; document DPIAs; restrict use; review compliance; incident response includes immediate review with legal/DPO and technical fixes.

5. CapabilitiesLink to 5. Capabilities

Institutional infrastructure: secure storage and backup at ICL.
Research computing support.
Existing dedicated project staff.
Experienced PIs and Co-Is.
Track record in building and operating community data infrastructures.

6. Maintaining and implementing the Data Management PlanLink to 6. Maintaining and implementing the Data Management Plan

The DMP will be formally reviewed at least annually, and additionally following:

Major project milestones.
Significant incidents or near-misses.
Changes in external requirements.

Feedback will be explicitly considered in revisions, especially concerning data discoverability, access procedures, or community standards.

7. Environmental considerationsLink to 7. Environmental considerations

Use virtualised or cloud resources instead of under-utilised local servers.
Avoid unnecessary duplication: where possible, rely on links to existing repositories rather than storing complete copies.
Use tiered storage and lifecycle policies.
Design AI and ETL pipelines that are incremental and scalable, avoiding reprocessing.

8. ResponsibilitiesLink to 8. Responsibilities

Overall data management and DMP ownership: Principal Investigator (PI), oversight by Project Management Board and institutional RDM teams.
Day-to-day data management and curation: Data Curator, supported by PI, Co-Is, and Advisory Group.
Metadata standards and documentation: Data Curator and PDRA or Software Developer, supported by Co-Is with repository experience.
Platform and infrastructure security: Software Developer, supported by institutional IT and information security teams.
Quality assurance of data: Data Curator and PDRA, supported by PI and relevant Co-Is (for example, domain experts).
Ethical and legal compliance (data): PI and Data Curator, supported by the Institutional Data Protection Officer and partners.
DMP review and reporting: Data Curator or Software Developer, supported by Management Board and Advisory Group.

9. Relevant institutional, departmental, or study policies on data sharing and data securityLink to 9. Relevant institutional, departmental, or study policies on data sharing and data security

10. Author of this Data Management PlanLink to 10. Author of this Data Management Plan

Stanislav Modrak ([email protected])