Changes between Version 13 and Version 14 of WikiStart


Ignore:
Timestamp:
Apr 4, 2013, 3:40:25 PM (10 years ago)
Author:
r.straver@…
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • WikiStart

    v13 v14  
    5858As several steps require information about the GC contents of areas on the genome we need to prepare the necessary information for WISECONDOR. This step only needs to be repeated when the reference genome is replaced.
    5959{{{
    60 python countgc.py /path/to/hg19.fasta gccountperbin.pickle
     60python countgc.py hg19.fasta gccountperbin.pickle
    6161}}}
    6262This takes quite a few minutes as the implementation was not focused on speed, rather on quick functionality. As you rarely have to do this step this shouldn't really be a problem.
     
    6565Currently, WISECONDOR's entry point expects to receive a SAM formatted, sorted, input stream. It will filter the reads to remove so called Read-Towers and count the amount of reads left per bin. This can be obtained by a bash script or putting something like this into a terminal:
    6666{{{
    67 /path/to/samtools view ex_sample.bam | python consam.py /path/to/ex_sample.pickle
     67/path/to/samtools view ex_sample.bam | python consam.py ex_sample.pickle
    6868}}}
    6969This step is required for every test and reference sample used.
     
    7272To teach WISECONDOR what bins behave alike we will need to feed it a set of (healthy) reference samples. Copy or move healthy files created in the previous step into a separate folder and tell WISECONDOR to build a new reference table using all .pickle files in the directory, the GC-count file previously created (to apply GC-Correction) and store the reference table in a file for later use, for example:
    7373{{{
    74 python newref.py /path/to/refdir/ /path/to/gccountperbin.pickle /path/to/reftable.pickle
     74python newref.py refdir/ gccountperbin.pickle reftable.pickle
    7575}}}
    7676This step may take several minutes, mostly dependent on the amount of reference samples you provided. Due to the design of WISECONDOR, the more reference samples available the merrier. Even adding extremely low coverage samples (i.e. 0.03 times coverage) may improve reliability of the WISECONDOR. As these samples are only used to build a reference, any healthy whole genome sample that was produced in the same manner as the samples you would like to test will most likely do fine; male, female, pregnant, non-pregnant, different lanes, different times, different coverages, just make sure it is run on the same machine and prepared the same way, then everything should be fine. This also means that, if done right, no additional reference samples need to be sequenced for testing at some point as the reference samples made previously provide enough information.
     
    8080Now WISECONDOR knows what bins on the genome are likely to behave alike, we can feed it a sample and it will try to discover areas that differ greatly from their own set of reference bins. To test a sample, run the test script and feed it the sample pickle, the GC count file (again, to apply GC-Correction), the reference file and a path+basename so it knows where to put down a plot of the results. Also, the output goes into stdout, which you may want to save for later use by writing it to a file using >.
    8181{{{
    82 python test.py /path/to/ex_sample.pickle /path/to/gccountperbin.pickle /path/to/reftable.pickle /path/to/ex_sample.plot > /path/to/ex_sample.result
     82python test.py ex_sample.pickle gccountperbin.pickle reftable.pickle ex_sample.plot > ex_sample.result
    8383}}}
    8484Output formatting is a bit confusing at this point and may improve over time.