首页天道酬勤,

,

张世龙 05-13 09:12 75次浏览

最近想把GATK流程化,方便后续工作。 如何查看WDL Cromwell仍然很有用。 而且从后续的GATK中出来的最佳实践也是按照WDL编写的。 等于学习了。 这里记录了三个比较多的例子,具体可以看到官方网站。

case2.创建多步骤示例进程的数据下载

此任务是将来自haplotypeCaller的SNP与indel分离。

工作流程

本示例使用task alisingselect两次,因此cromwell不允许同一task被调用两次(如果有第二次,则覆盖第一次结果)。 然后,在对结果执行以下操作时,将调用selectSNPs.rawSubset,而不是select.rawSubset。 对于第二个和第三个任务,必须从haplotypeCaller中调整结果。 在中,必须按如下方式使用input 3360 input name=taskname.output name : workflowsimplevariantselection { callhaplotypecaller { input 3360 } callselectasselectsnps { input : } calllselectasselectindion 工作流simplevariantselection { filegatkfilereffastafilerefindexfilerefdictstringnamecallhaplotypecaller } input 3360 samplenamer GATK=gatk,RefIndex=refIndex,ref dict=ref dict } callselectasselectsnps { input : sample name=gatk=gatk,refindex raw vcf=haplotype caller.raw vcf } callselectasselect ref fasta=ref fasta,GATK=gatk,RefIndex=refIndex,RefDict=refDict raw vcf=haplotype caller.raw vcf } } tasktaskselect { filegatkfilereffastafilerefindexfilerefdictstringsamplenamestringtypefilerawvcfcommand { Java-jar $ { gatk } - v $ { raw vcf }\- select type $ { type }\- o $ { sample name } _ raw.$ { type }.vcf } output { filerawsubset=' $ {

workflowworkflowsimplevariantdiscovery { filegatkfilereffastafilerefindexfilerefdictstringnamecallhaplotypecalller { . } calll callselectasselectindels { . } callhardfiltersnp { input : sample name=name,reffasta=ref RefIndex=refIndex,ref dict

selectIndels.rawSubset } call combine { input: sampleName=name, RefFasta=refFasta, GATK=gatk, RefIndex=refIndex, RefDict=refDict, filterSNPs=hardFilterSNP.filteredSNPs, filteredIndels=hardFilterIndel.filteredIndels }} Task task hardFilterSNP { File GATK File RefFasta File RefIndex File RefDict String sampleName File rawSNPs command { java -jar ${GATK} \ -T VariantFiltration \ -R ${RefFasta} \ -V ${rawSNPs} \ --filterExpression "FS > 60.0" \ --filterName "snp_filter" \ -o ${sampleName}.filtered.snps.vcf } output { File filteredSNPs = "${sampleName}.filtered.snps.vcf" }} task hardFilterIndel { File GATK File RefFasta File RefIndex File RefDict String sampleName File rawIndels command { java -jar ${GATK} \ -T VariantFiltration \ -R ${RefFasta} \ -V ${rawIndels} \ --filterExpression "FS > 200.0" \ --filterName "indel_filter" \ -o ${sampleName}.filtered.indels.vcf } output { File filteredIndels = "${sampleName}.filtered.indels.vcf" }} task combine { File GATK File RefFasta File RefIndex File RefDict String sampleName File filteredSNPs File filteredIndels command { java -jar ${GATK} \ -T CombineVariants \ -R ${RefFasta} \ -V ${filteredSNPs} \ -V ${filteredIndels} \ --genotypemergeoption UNSORTED \ -o ${sampleName}.filtered.snps.indels.vcf } output { File filteredVCF = "${sampleName}.filtered.snps.indels.vcf" }} case4. 用scatter-gather进行joint call genotypes

数据下载,本例子用了toy dataset: NA12878, NA12877, and NA12882, subset to chromosome 20.

这个方法经常用于大的队列,这里只用3个样本进行尝试。从inputSamples开始,每个sample都是一个包含3个文件的数组。第一个文职array index0是样本的名字,array index1是 bam文件,array index2是bam的index文件。例子中,我们根据sample进行scatter,并在第一步后gather。gathered GVCF用GenotypeGVCFs去生成一个多样本VCF。

Write your script
首先要准备Array[Array[File]] inputSamples,命名为inputsTSV.txt。每个input由tab分割。这个文件可以帮助减轻在inputs json的nested array declarations。 sampleName inputBam bamIndexsampleName2 inputBam2 bamIndex2...etc. Workflow
为导入inputSamplesFile里的内容,转换为Array[Array[File]]格式,需要用到WDL的标准库read_tsv: File inputSamplesFileArray[Array[File]] inputSamples = read_tsv(inputSamplesFile)

然后用sample去scatter,我们的call task那部分包含在HaplotypeCallerERC函数里。同时在scatter操作之后,要做一步gather,这里的GenotypeGVCFs包含HaplotypeCallerERC的输出,即GVCFs=HaplotypeCallerERC.GVCF:

workflow jointCallingGenotypes { File inputSamplesFile Array[Array[File]] inputSamples = read_tsv(inputSamplesFile) File gatk File refFasta File refIndex File refDict scatter (sample in inputSamples) { call HaplotypeCallerERC { input: GATK=gatk, RefFasta=refFasta, RefIndex=refIndex, RefDict=refDict, sampleName=sample[0], bamFile=sample[1], bamIndex=sample[2] } } call GenotypeGVCFs { input: GATK=gatk, RefFasta=refFasta, RefIndex=refIndex, RefDict=refDict, sampleName="CEUtrio", GVCFs=HaplotypeCallerERC.GVCF }} Task
HaplotyepCallerERC task HaplotypeCallerERC { File GATK File RefFasta File RefIndex File RefDict String sampleName File bamFile File bamIndex command { java -jar ${GATK} \ -T HaplotypeCaller \ -ERC GVCF \ -R ${RefFasta} \ -I ${bamFile} \ -o ${sampleName}_rawLikelihoods.g.vcf } output { File GVCF = "${sampleName}_rawLikelihoods.g.vcf" }}

GenotypeGVCFs

task GenotypeGVCFs { File GATK File RefFasta File RefIndex File RefDict String sampleName Array[File] GVCFs command { java -jar ${GATK} \ -T GenotypeGVCFs \ -R ${RefFasta} \ -V ${sep=" -V " GVCFs} \ -o ${sampleName}_rawVariants.vcf } output { File rawVCF = "${sampleName}_rawVariants.vcf" }} Running the pipeline #validate:java -jar wdltool.jar validate jointCallingGenotypes.wdl#generate inputs:java -jar wdltool.jar inputs jointCallingGenotypes.wdl > jointCallingGenotypes_inputs.json#run locally:java -jar cromwell.jar run jointCallingGenotypes.wdl jointCallingGenotypes_inputs.json
深基础的类型,浅基础与深基础的概念 整体管理输入,输出,工具技术,47个过程输入输出工具速记口诀