[Bioperl-l] WGS, WGS_SCAFLD support added for GenBank files
Brian Osborne
osborne1 at optonline.net
Fri Mar 10 02:38:46 UTC 2006
Chris,
Below...
On 3/9/06 4:04 PM, "Chris Fields" <cjfields at uiuc.edu> wrote:
>> I think it's reasonable to use eutils in this way, yes. It's no longer
>> "pure
>> Bioperl" but all of this stuff is depending on eutils anyway. The downside
>> is that their API may change but it looked like you wrote some tests for
>> this, yes? Just my opinion.
>
> I'll get some tests up and running for check and for using the 'gbwithparts'
> format. I actually found that using format=>'fasta' also gives the
> NCBI-built contig and, since it required less memory overhead for object
> creation, used that in &postprocess_data.
>
>> I believe the lack of filling Ns is a bug on Bioperl's part due to the
>> inability of the Bio::Location code to understand NCBI's gaps(). If there
>> are Ns in the sequence we shouldn't just be deleting them, that's not
>> good.
>>
>> Brian O
>
> There are a number of serious problems with bioperl's joining as well,
> something I've just noticed when directly comparing output from NCBI. It
> cuts off one base from the end of each joined sequence, and some of the
> joins aren't correct (normal when they should be revcomp). Basically any
> fix is now redundant in the light of using NCBI's contig building but I
> would still like to know what the problem is with bioperl's version. Did
> something change recently with these records to break this? I'll check
> things over and try to get this fix committed ASAP.
>
> Here's a few chunks of fasta data, first one from NCBI eutils contig build
> and second from Bioperl's postprocess_data (prior to my changes); after that
> is the start of the CONTIG line from the master file. I snipped out the 5'
> end and started close to where the gaps (N's) are and added a couple <CR>
> and arrows where the gaps should be in the second bioperl formatted
> sequence. In the bioperl-formatted contig the end of each joined sequence
> is missing one base ('T' in the first, 'C' in the second). The third
> sequence should be the complement of the sequence in the contig, but isn't.
> Could you try this out and see if you get the same thing? I added the bit
> of code that I used to fetch the contig from postprocess_data.
>
>> CH398085 Oryza sativa (indica cultivar-group) chromosome 1 scaffold000005
> genomic scaffold, whole genome shotgun sequence (from NCBI)
> ....
> TTAGGTGGTTTTATAACTTTAGACTTTGGGAATTTTCATATCACCTGGACACTATGGAAT
> TGTTGGATGATGGTGGAATTGGACATACACCTCTCTTCCTCTTTCAAAACCCCTAAAACC
> TGTTTTCGGTGGGGTTTGGGTGCATGCCAGTTGTGGGAAGTAGCACCCCGGGCACTATAA
> GGATTAAGCTCAGGCCTCTNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
> NNNNNNNNNCTGAGTACTGTGGTTGTACTCATTCTTGCTCAATCTTTTCCCCCTTCAGTA
> AGAGAAGATTTGGAGAAGAAGTCTTAGGTGGAGTCCTGGCTTATACCCCAGTTGAGCGCC
> TGTGAAGATGGAGCCGTAGGCCCGCTAGTCCGCTGCTGTTTATTTTTGATTGTCAGGCCT
> TAAGTGCCTTTGTAATAATGTAAATATTATCGATATAATAAAGATGTGTCTTTTATATCA
> TGTTTGTGTGGTGTACCCCGGCTTTTCCTGGGACGGGGATTAATACACTAGCGTTCGGGA
> AAAGGCAATTTTCCCGGTCGCGACAGAACTTGTAATTCTCTAGCACTAGAATGACATATC
> CTTTGGATTGTGCACCAATGCCACGCGAAAACCCATGGTGCCAAAACTAGGGGTGGAAAA
> ACCTCCGAGACCTCCTCCGAAGAGGCAGGTGACAGGTAAGGCGGAGGAACCCGAGATGCA
> TAAGGAAAATCCAGTGCCGGAAGTGCCACCGGAGATTGCAGTGCCGGAGGTGCCCATGGA
> GATTGTAGTGCCGTTGTCCCAATGGAGATTACAGTGGCAGAACCAGAGGTGCAAATTGTG
> GCATCAGTCGGGACATATATAGAAGAAGTAGTACGATTGGAATGGGACGGTACAGAGCCA
> GAAATATTTGAAGACCCTTCTCCTGCGAAAGACCCCGAGGTGCAAGAAACCCCGGTCCCT
> GAGAAGGCCACTGACAATTCTAAGGTGCCTAAAGTGCTTATGAGCCACGACTCCAAGTCT
> AAAGATGAGAACAATGAGAAGTTCATGGGCTAACCATCTTCAGAGGGGGTAAGGAACGTG
> CCAAACTCAGAGATGATGACCCCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
> NNNNNNNNNNNNNTACTTGTTGCAATAATCTTGCTCCGGAGTAAGTGGTTATAGGATGCA
> AGTACAATAACTAGTTGTAGACAAAGTCAATGACGATACGGAGAAGAATAAGCGCAATGT
>
>
>
>> CH398085 Oryza sativa (indica cultivar-group) chromosome 1 scaffold000005
> genomic scaffold, whole genome shotgun sequence (bioperl's version)
> ....
> TTAGGTGGTTTTATAACTTTAGACTTTGGGAATTTTCATATCACCTGGACACTATGGAAT
> TGTTGGATGATGGTGGAATTGGACATACACCTCTCTTCCTCTTTCAAAACCCCTAAAACC
> TGTTTTCGGTGGGGTTTGGGTGCATGCCAGTTGTGGGAAGTAGCACCCCGGGCACTATAA
> GGATTAAGCTCAGGCCTC <----no gap, missing base
>
> CTGAGTACTGTGGTTGTACTCATTCTTGCTCAATCTTTTCCC
> CCTTCAGTAAGAGAAGATTTGGAGAAGAAGTCTTAGGTGGAGTCCTGGCTTATACCCCAG
> TTGAGCGCCTGTGAAGATGGAGCCGTAGGCCCGCTAGTCCGCTGCTGTTTATTTTTGATT
> GTCAGGCCTTAAGTGCCTTTGTAATAATGTAAATATTATCGATATAATAAAGATGTGTCT
> TTTATATCATGTTTGTGTGGTGTACCCCGGCTTTTCCTGGGACGGGGATTAATACACTAG
> CGTTCGGGAAAAGGCAATTTTCCCGGTCGCGACAGAACTTGTAATTCTCTAGCACTAGAA
> TGACATATCCTTTGGATTGTGCACCAATGCCACGCGAAAACCCATGGTGCCAAAACTAGG
> GGTGGAAAAACCTCCGAGACCTCCTCCGAAGAGGCAGGTGACAGGTAAGGCGGAGGAACC
> CGAGATGCATAAGGAAAATCCAGTGCCGGAAGTGCCACCGGAGATTGCAGTGCCGGAGGT
> GCCCATGGAGATTGTAGTGCCGTTGTCCCAATGGAGATTACAGTGGCAGAACCAGAGGTG
> CAAATTGTGGCATCAGTCGGGACATATATAGAAGAAGTAGTACGATTGGAATGGGACGGT
> ACAGAGCCAGAAATATTTGAAGACCCTTCTCCTGCGAAAGACCCCGAGGTGCAAGAAACC
> CCGGTCCCTGAGAAGGCCACTGACAATTCTAAGGTGCCTAAAGTGCTTATGAGCCACGAC
> TCCAAGTCTAAAGATGAGAACAATGAGAAGTTCATGGGCTAACCATCTTCAGAGGGGGTA
> AGGAACGTGCCAAACTCAGAGATGATGACCC <---- no gap, missing base
>
> GATGGTGGGTTAGCCTGCCTAGCTAGTTC <---- should be revcomp
> GAAGCGGCACTCCTTTTAATTATTTGATATTAGATCATTTTTTAATATTTGTGTTTTTAC
> AAGTACCGCGAGGTACAACCTCATGGACAGGAACAACGCTTTTTTGCAACATATATTTTA
> TACGAAATCTATGCTTTCTGTAAAGTTAAAGCACACTAAATCTAAAGCTTAATATACAAC
> CATGCCACATCATCACCCACTAGCAATAATTATATATTTAATCTCATACAAGCATACAAA
Here's the sequence from NCBI:
1621 ttaggtggtt ttataacttt agactttggg aattttcata tcacctggac actatggaat
1681 tgttggatga tggtggaatt ggacatacac ctctcttcct ctttcaaaac ccctaaaacc
1741 tgttttcggt ggggtttggg tgcatgccag ttgtgggaag tagcaccccg ggcactataa
1801 ggattaagct caggcctct
[gap 50 bp] Expand Ns
1870 c tgagtactgt ggttgtactc attcttgctc aatcttttcc cccttcagta
1921 agagaagatt tggagaagaa gtcttaggtg gagtcctggc ttatacccca gttgagcgcc
1981 tgtgaagatg gagccgtagg cccgctagtc cgctgctgtt tatttttgat tgtcaggcct
2041 taagtgcctt tgtaataatg taaatattat cgatataata aagatgtgtc ttttatatca
2101 tgtttgtgtg gtgtaccccg gcttttcctg ggacggggat taatacacta gcgttcggga
2161 aaaggcaatt ttcccggtcg cgacagaact tgtaattctc tagcactaga atgacatatc
2221 ctttggattg tgcaccaatg ccacgcgaaa acccatggtg ccaaaactag gggtggaaaa
2281 acctccgaga cctcctccga agaggcaggt gacaggtaag gcggaggaac ccgagatgca
2341 taaggaaaat ccagtgccgg aagtgccacc ggagattgca gtgccggagg tgcccatgga
2401 gattgtagtg ccgttgtccc aatggagatt acagtggcag aaccagaggt gcaaattgtg
2461 gcatcagtcg ggacatatat agaagaagta gtacgattgg aatgggacgg tacagagcca
2521 gaaatatttg aagacccttc tcctgcgaaa gaccccgagg tgcaagaaac cccggtccct
2581 gagaaggcca ctgacaattc taaggtgcct aaagtgctta tgagccacga ctccaagtct
2641 aaagatgaga acaatgagaa gttcatgggc taaccatctt cagagggggt aaggaacgtg
2701 ccaaactcag agatgatgac ccc
[gap 50 bp] Expand Ns
2774 tacttgt tgcaataatc ttgctccgga gtaagtggtt ataggatgca
2821 agtacaataa ctagttgtag acaaagtcaa tgacgatacg gagaagaata agcgcaatgt
2881 cagaccagct tgttataatc cagtaacagt aagtaaactc cgtaccgttc gtttttttca
2941 ttcattttaa ttattgtccg ttgcaggctt gcagcagtca catgagtgcg tataaatgca
3001 ccgatttcaa gcccggtgct attaatcaat agattcttct tcactgtggt tcgacaaaca
3061 atgaaactag tataactata gtataactag gtgattcctc acgctttccc gtgctttgtt
3121 gtaaaattta ctaagaaatt ctcaatatgt tttttttaca atcaaactag gattacgaag
It agrees with the 1st sequence, not the second sequence.
Brian O.
>
> CONTIG
> join(AAAA02001496.1:1..1819,gap(50),AAAA02001497.1:1..854,gap(50),
> complement(AAAA02001498.1:1..870),gap(50),AAAA02001499.1:1..945,
> gap(50),AAAA02001500.1:1..11304,gap(100),
>
>
> This is what I changed in postprocess_data:.
>
> # transform links to appropriate descriptions
> if ($data =~ /\nCONTIG\s+/) {
> $self->warn("CONTIG found. Retrieving contig sequence.".
> "\nUse format type 'gbwithparts' or 'fasta' with
> contigs.");
> # get accession from LOCUS
> $data =~ /^LOCUS\s+(\S+)/;
> my $acc = $1;
> my $stream = Bio::DB::GenBank->new(-format => 'fasta');
> my $seq = $stream->get_Seq_by_acc($acc);
> my $contig = $seq->seq;
> # remove everything after and including CONTIG
> $data =~ s/(CONTIG[\s\S]+)$//i;
> # Bio::SeqIO::genbank will fix this line,
> # fills in the actual numbers
> $data .= "BASE COUNT 0 a 0 c 0 g 0 t \n";
> $data .= "ORIGIN \n";
> # Bio::SeqIO::genbank also formats this data correctly
> $data .= "$contig\n//";
> }
>
>
>> On 3/9/06 1:08 PM, "Chris Fields" <cjfields at uiuc.edu> wrote:
>>
>>> Added WGS and WGS_SCAFLD support to Bio::SeqIO::genbank as well as tests
>> and
>>> WGS sample file; the previous fix missed the WGS_SCAFLD line. I will
>> also
>>> soon add support to Bio::DB::GenBank for downloading WGS and WGS_SCAFLD
>>> subfiles.
>>>
>>> Brian, I found a pretty decent speed improvement for contig building in
>>> Bio::DB::NCBIHelper; it basically fetches the contig whole from NCBI
>> using
>>> return type of 'gbwithparts' so the work is done on their end and just
>>> switches the CONTIG line with the sequence; it took about 10 seconds vs.
>> ~50
>>> seconds using an unmodified NCBIHelper on my PC. I haven't committed it
>> yet
>>> bc I noticed the resulting contig files differ; the bioperl contig build
>>> lacks any N's from the 'gaps()' in the CONTIG line while NCBI's version
>> has
>>> the N filler. I didn't know if the difference was a bug or not. Should
>> I
>>> go ahead and commit?
>>>
>>> Christopher Fields
>>> Postdoctoral Researcher - Switzer Lab
>>> Dept. of Biochemistry
>>> University of Illinois Urbana-Champaign
>>>
>>>
>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
> Christopher Fields
> Postdoctoral Researcher - Switzer Lab
> Dept. of Biochemistry
> University of Illinois Urbana-Champaign
>
More information about the Bioperl-l
mailing list