wiki:documentation

Documentation

Note: Currently DVD is implemented with Varda, rendering this documentation obsolete.

Description of the input

The input consists of

Description of the output

A tab delimited file:

CHROM Name of the chromosome (1, 2, .. X, Y, M)
POS Position on the chromosome
REF Reference allele
ALT Variant allele
Frequency Frequency of this variant
Occurrences Number of occurrences of this variant
Exome samples List of sample IDs that share this variant *
Public samples List of public datasets that share this variant
  • This list is only shown if the frequency is lower than 0.05.

Not used.

Preparing the input

Single samples

  1. We assume that you have one VCF file and a wiggle file. The VCF file must be filtered for variants that have been called with too low coverage, this is the coverage threshold parameter.
  1. A wiggle file is too large to store, that is why we convert it to a BED file using the coverage threshold. This conversion program will summarise the wiggle file as a list of regions that have sufficient coverage.
  1. Once the files are prepared, you can use this client to upload and annotate your VCF file.

In order for you to keep track of this sample, make sure you use a unique order ID every time you upload a new sample. Internally this order ID is linked to a sample ID, which is returned in the annotation. One of the administrators of DVD can provide this information upon request.

Multiple samples

For privacy reasons, not all samples can be uploaded without some form of obfuscation. That is why we provide a method to pool samples. In order to do this, and yet be able to use the information to its full potential, we have developed a pooling program that takes a number of VCF files and an equal number of wiggle files and converts it to one VCF file and multiple BED files.
You can now proceed to step 3 of the single sample workflow.

The records in the merged VCF file are ordered, thereby losing the link between variants that may be used for identification of a sample. Since we store coverage information of each sample and we know the size of the pool, we are still able to reliably calculate the frequencies.

The import process

Since there is a lot of freedom in the VCF format, we developed a robust way of importing them. First we need to disambiguate a variant description and we need to calculate the coverage and supporting evidence for a variant.

Disambiguation

Consider the following table:

CHROM POS REF ALT
1 5 AA ATA
1 5 A AT

We see two ways of describing the same variant, i.e., the insertion of a T. In order to recognise this, we convert all descriptions to their minimal one by removing the common suffix of the reference and the variant allele. This enables DVD to find more overlap between samples.

Coverage and supporting evidence

We use several ways to find coverage information and supporting evidence in a VCF file. Our preferred method is by using the DP4 tag. If this tag is not present, we use the DP tag in combination with either the AF or the AF1 tag.

Caching

The uploaded files are stored in a cache. This cache is not accessible except for the administrators of DVD. We store these files to be able to re-import data in case of an error. If, for example, we encounter a bug, we can empty the database (except for the Sample table) and repopulate the database using the cached files. Furthermore, if we later on choose to store more (like genotyping information), we can do this without the users having to upload their samples again.

Duplicate detection

In order to avoid importing duplicates and to facilitate re-annotation of previously uploaded samples, we calculate the md5sum of the VCF file to detect the uploading of a duplicate.

The annotation process

In order to correctly annotate the samples, we also have to disambiguate the variant descriptions (see previous section). If the sample is already present in the database (see previous section), we continue the annotation process but we exclude variants from the duplicated sample. By doing this, we can calculate the frequencies correctly without overestimating them.

This mechanism can be used to re-annotate your sample without running the risk of polluting the database. Do make sure that the uploaded file is identical to the previous one.

Removing samples

If for some reason a sample needs to be removed from the database, you can contact one of the administrators.

Database schema

Server characteristics

The server is accessible via a SOAP remote procedure call. The following functions are implemented:

  • getAnnotation
  • uploadVCFBED
  • ping

The first two functions require a username and a password in order to be used.
If you want to make your own client, use this definition. See the following table for a list of dependencies and their license.

Transportation

The data is transported in a base64 encoded format. Currently the available SOAP implementations do not handle attachments well, until that is the case, we will use this method.

Security

All communication is done over an encrypted (SSL) channel. Access to the non-encrypted pages are disabled. The cache is not accessible without using the (password protected) getAnnotation function, for which you also need to know the unique one time ID that is returned when you upload your data. Only a bcrypt hash of the client password is stored in the database, the actual password is not.

What happens on the client

The client will initiate the annotation and uploading with one call, using the uploadVCFBED function. This function immediately returns with a unique ID, which can be used for polling the annotated data using the getAnnotation function. This function will return the string "1" until the annotated data is ready, if the annotation is done, this function will return the annotated data.

What happens on the server

Upon receiving an upload request, the server will first check the username and password combination. If this is successful, the server will fork a daemon and returns a unique ID which can be used by the client to poll for the annotated data.

The daemon will first annotate the VCF file. Once it is finished, the result of the annotation is placed in the cache and from this moment on, the getAnnotation function will return the annotated data.
After the annotation, the daemon continues importing the data (if applicable) and exits when finished.

Last modified 5 years ago Last modified on Dec 3, 2013, 3:41:48 PM

Attachments (1)

Download all attachments as: .zip