Considering several requests to share changes and additions to this script we are moving to GitHub, please follow this link to obtain the latest scripts: https://github.com/rstraver/wisecondor
Feel free to fork the project from there, we are very interested in any code alterations and suggestions.
The information on this page and the code will stay available for a while but will not be updated anymore.
Welcome to project WISECONDOR
Welcome to the old project page of WISECONDOR (WIthin-SamplE COpy Number aberration DetectOR): Detect fetal trisomies and smaller CNV's in a maternal plasma sample using whole-genome data.
This project code is available under the GPLv3 license.
This page is meant as a short introduction to get you started.
For details on the methods, see the WISECONDOR paper which should be available soon. Additional information will be put on this page in the future.
To obtain WISECONDOR, use
svn co https://trac.nbic.nl/svn/wisecondor/trunk wisecondor
on a linux machine to check out the hottest scripts available. To obtain the latest stable version, pick the highest version number you can find from the tags instead of the trunk.
WISECONDOR was developed and tested using Python2.7. Using any other version may cause errors or faulty results. The working version is tested using SAMTOOLS on .bam files created by BWA.
WISECONDOR uses several python packages, of which the most common are:
sys pickle math glob argparse
Additional less common packages which most likely have to be installed separately are:
numpy biopython matplotlib
As for the reference genome, any reference that contains every autosomal chromosome once should do. Such a reference can be built by downloading every chromosome from UCSC and concatenating them into a single fasta file:
Also, to map your reads and produce a .bam file, we suggest using BWA:
To read the produced .bam file we use SAMtools, which can be obtained here:
To understand the system and what we will do next, here is a flow diagram showing where specific data is going to and required for.
Count GC Frequency per Bin
As several steps require information about the GC contents of areas on the genome we need to prepare the necessary information for WISECONDOR. This step only needs to be repeated when the reference genome is replaced.
python countgc.py hg19.fasta gccountperbin.pickle
This takes quite a few minutes as the implementation was not focused on speed, rather on quick functionality. As you rarely have to do this step this shouldn't really be a problem.
Convert BAM to PICKLE
Currently, WISECONDOR's entry point expects to receive a SAM formatted, sorted, input stream. It will filter the reads to remove so called Read-Towers and count the amount of reads left per bin. This can be obtained by a bash script or putting something like this into a terminal:
/path/to/samtools view ex_sample.bam | python consam.py ex_sample.pickle
This step is required for every test and reference sample used.
Creating a Reference Table
To teach WISECONDOR what bins behave alike we will need to feed it a set of (healthy) reference samples. Copy or move healthy files created in the previous step into a separate folder and tell WISECONDOR to build a new reference table using all .pickle files in the directory, the GC-count file previously created (to apply GC-Correction) and store the reference table in a file for later use, for example:
python newref.py refdir/ gccountperbin.pickle reftable.pickle
This step may take several minutes, mostly dependent on the amount of reference samples you provided. Due to the design of WISECONDOR, the more reference samples available the merrier. Even adding extremely low coverage samples (i.e. 0.03 times coverage) may improve reliability of the WISECONDOR. As these samples are only used to build a reference, any healthy whole genome sample that was produced in the same manner as the samples you would like to test will most likely do fine; male, female, pregnant, non-pregnant, different lanes, different times, different coverages, just make sure it is run on the same machine and prepared the same way, then everything should be fine. This also means that, if done right, no additional reference samples need to be sequenced for testing at some point as the reference samples made previously provide enough information. We suggest using at least 16 samples to provide enough information although increasing this amount is definitely a good idea.
Testing A Sample
Now WISECONDOR knows what bins on the genome are likely to behave alike, we can feed it a sample and it will try to discover areas that differ greatly from their own set of reference bins. To test a sample, run the test script and feed it the sample pickle, the GC count file (again, to apply GC-Correction), the reference file and a path+basename so it knows where to put down a plot of the results. Also, the output goes into stdout, which you may want to save for later use by writing it to a file using >.
python test.py ex_sample.pickle gccountperbin.pickle reftable.pickle ex_sample.plot > ex_sample.result
Output formatting is a bit confusing at this point and may improve over time.
Tweaking and Fine-Tuning
WISECONDOR has a massive amount of variables that may require some tweaking to work well on your data as results may differ from system to system over different methods and machines used to obtain your NGS data. In the steps described above, all variables are left to their defaults to keep it readable while they can easily be altered. If you want to tweak some variables, try running any script with the '-h' argument. A list with options, their descriptions and their default values will be returned. Do keep in mind that several options need to be exactly the same over different scripts, i.e. the binsize used in any step should be the same or results will simply be rubbish. Any option for which this is true has the same argument name over different scripts and is marked in its description. For example, using '-h' on the newref.py script:
$>python newref.py -h usage: newref.py [-h] [-binsize BINSIZE] [-gccmaxn GCCMAXN] [-gccminrd GCCMINRD] [-gccfval GCCFVAL] [-gccival GCCIVAL] refdir gccount refout Create a new reference table from a set of reference samples. Applies gc-correction. Outputs table as pickle to a specified output file. positional arguments: refdir directory containing samples to be used as reference (pickle) gccount gc-counts file used for gc-correction (pickle) refout reference table output, used for sample testing (pickle) optional arguments: -h, --help show this help message and exit -binsize BINSIZE binsize used for samples (default: 1000000) -gccmaxn GCCMAXN maximum relative amount of unknown (n) bases in bin used for gc-correction (equals arg used in test) (default: 0.1) -gccminrd GCCMINRD minimum relative amount of reads in bin used for gc- correction (equals arg used in test) (default: 0.0001) -gccfval GCCFVAL width of data used in loess function used for gc- correction (equals arg used in test) (default: 0.1) -gccival GCCIVAL amount of fitting iterations in loess function used for gc-correction (equals arg used in test) (default: 3)
Default Page Info
This project provides the following mailing lists.
- wisecondor-users: a list intended for general discussion on the project.
- wisecondor-commits: a list that receives source code commit messages.
- wisecondor-devel: a list intended for discussion among developers (subscription is restricted to registered developers).
If available, anonymous readonly subversion access works as follows:
svn co https://trac.nbic.nl/svn/wisecondor wisecondor
Write access is only available to registered developers.
- TracGuide -- Built-in Documentation
- TitleIndex -- A complete list of local wiki pages.
- Trac FAQ -- Frequently Asked Questions
Other NBIC software projects
All active NBIC software projects can be accessed from the project index.