wiki:WikiStart

IMPORTANT

Considering several requests to share changes and additions to this script we are moving to GitHub, please follow this link to obtain the latest scripts: https://github.com/rstraver/wisecondor Feel free to fork the project from there, we are very interested in any code alterations and suggestions.
The information on this page and the code will stay available for a while but will not be updated anymore.

Welcome to project WISECONDOR

Welcome to the old project page of WISECONDOR (WIthin-SamplE COpy Number aberration DetectOR): Detect fetal trisomies and smaller CNV's in a maternal plasma sample using whole-genome data.

This project code is available under the GPLv3 license.

This page is meant as a short introduction to get you started.

Explanation

For details on the methods, see the WISECONDOR paper which should be available soon. Additional information will be put on this page in the future.

Getting Started

Obtaining Scripts

To obtain WISECONDOR, use

svn co https://trac.nbic.nl/svn/wisecondor/trunk wisecondor

on a linux machine to check out the hottest scripts available. To obtain the latest stable version, pick the highest version number you can find from the tags instead of the trunk.

Dependencies

WISECONDOR was developed and tested using Python2.7. Using any other version may cause errors or faulty results. The working version is tested using SAMTOOLS on .bam files created by BWA.

WISECONDOR uses several python packages, of which the most common are:

sys
pickle
math
glob
argparse

Additional less common packages which most likely have to be installed separately are:

numpy
biopython
matplotlib

As for the reference genome, any reference that contains every autosomal chromosome once should do. Such a reference can be built by downloading every chromosome from UCSC and concatenating them into a single fasta file:
http://hgdownload-test.cse.ucsc.edu/goldenPath/hg19/chromosomes/

Also, to map your reads and produce a .bam file, we suggest using BWA:
http://bio-bwa.sourceforge.net/

To read the produced .bam file we use SAMtools, which can be obtained here:
http://samtools.sourceforge.net/

First Steps

To understand the system and what we will do next, here is a flow diagram showing where specific data is going to and required for.

Flow diagram of the scripts used in WISECONDOR

Count GC Frequency per Bin

As several steps require information about the GC contents of areas on the genome we need to prepare the necessary information for WISECONDOR. This step only needs to be repeated when the reference genome is replaced.

python countgc.py hg19.fasta gccountperbin.pickle

This takes quite a few minutes as the implementation was not focused on speed, rather on quick functionality. As you rarely have to do this step this shouldn't really be a problem.

Convert BAM to PICKLE

Currently, WISECONDOR's entry point expects to receive a SAM formatted, sorted, input stream. It will filter the reads to remove so called Read-Towers and count the amount of reads left per bin. This can be obtained by a bash script or putting something like this into a terminal:

/path/to/samtools view ex_sample.bam | python consam.py ex_sample.pickle

This step is required for every test and reference sample used.

Creating a Reference Table

To teach WISECONDOR what bins behave alike we will need to feed it a set of (healthy) reference samples. Copy or move healthy files created in the previous step into a separate folder and tell WISECONDOR to build a new reference table using all .pickle files in the directory, the GC-count file previously created (to apply GC-Correction) and store the reference table in a file for later use, for example:

python newref.py refdir/ gccountperbin.pickle reftable.pickle

This step may take several minutes, mostly dependent on the amount of reference samples you provided. Due to the design of WISECONDOR, the more reference samples available the merrier. Even adding extremely low coverage samples (i.e. 0.03 times coverage) may improve reliability of the WISECONDOR. As these samples are only used to build a reference, any healthy whole genome sample that was produced in the same manner as the samples you would like to test will most likely do fine; male, female, pregnant, non-pregnant, different lanes, different times, different coverages, just make sure it is run on the same machine and prepared the same way, then everything should be fine. This also means that, if done right, no additional reference samples need to be sequenced for testing at some point as the reference samples made previously provide enough information. We suggest using at least 16 samples to provide enough information although increasing this amount is definitely a good idea.

Testing A Sample

Now WISECONDOR knows what bins on the genome are likely to behave alike, we can feed it a sample and it will try to discover areas that differ greatly from their own set of reference bins. To test a sample, run the test script and feed it the sample pickle, the GC count file (again, to apply GC-Correction), the reference file and a path+basename so it knows where to put down a plot of the results. Also, the output goes into stdout, which you may want to save for later use by writing it to a file using >.

python test.py ex_sample.pickle gccountperbin.pickle reftable.pickle ex_sample.plot > ex_sample.result

Output formatting is a bit confusing at this point and may improve over time.

Tweaking and Fine-Tuning

WISECONDOR has a massive amount of variables that may require some tweaking to work well on your data as results may differ from system to system over different methods and machines used to obtain your NGS data. In the steps described above, all variables are left to their defaults to keep it readable while they can easily be altered. If you want to tweak some variables, try running any script with the '-h' argument. A list with options, their descriptions and their default values will be returned. Do keep in mind that several options need to be exactly the same over different scripts, i.e. the binsize used in any step should be the same or results will simply be rubbish. Any option for which this is true has the same argument name over different scripts and is marked in its description. For example, using '-h' on the newref.py script:

$>python newref.py -h

usage: newref.py [-h] [-binsize BINSIZE] [-gccmaxn GCCMAXN]
                 [-gccminrd GCCMINRD] [-gccfval GCCFVAL] [-gccival GCCIVAL]
                 refdir gccount refout

Create a new reference table from a set of reference samples. Applies gc-correction. Outputs table as pickle to a specified output file.

positional arguments:
  refdir              directory containing samples to be used as reference
                      (pickle)
  gccount             gc-counts file used for gc-correction (pickle)
  refout              reference table output, used for sample testing (pickle)

optional arguments:
  -h, --help          show this help message and exit
  -binsize BINSIZE    binsize used for samples (default: 1000000)
  -gccmaxn GCCMAXN    maximum relative amount of unknown (n) bases in bin used
                      for gc-correction (equals arg used in test) (default:
                      0.1)
  -gccminrd GCCMINRD  minimum relative amount of reads in bin used for gc-
                      correction (equals arg used in test) (default: 0.0001)
  -gccfval GCCFVAL    width of data used in loess function used for gc-
                      correction (equals arg used in test) (default: 0.1)
  -gccival GCCIVAL    amount of fitting iterations in loess function used for
                      gc-correction (equals arg used in test) (default: 3)

Default Page Info

Mailing lists

This project provides the following mailing lists.

Source access

If available, anonymous readonly subversion access works as follows:

  svn co https://trac.nbic.nl/svn/wisecondor wisecondor

Write access is only available to registered developers.

You can become a developer by registering yourself if you haven't already done so, and requesting write access on the wisecondor-users mailing list.

Starting Points

Other NBIC software projects

All active NBIC software projects can be accessed from the project index.

Last modified 4 years ago Last modified on Nov 13, 2013, 12:30:48 PM

Attachments (2)

Download all attachments as: .zip