[Bioperl-l] Finding locations of a string within a fasta file
Chris Fields
cjfields at uiuc.edu
Sat Jul 15 21:22:15 UTC 2006
You can retrieve the original GenBank CONTIG file using Bio::DB::GenBank if
the format is set to 'gb' (it is now set to 'gbwithparts' by default. The
CONTIG lines are currently stored in a series of
Bio::Annotation::SimpleValue objects; get the accessions using the following
script.
use strict;
use warnings;
use Bio::DB::GenBank;
my $factory = Bio::DB::GenBank->new(-format => 'gb');
my $seq = $factory->get_Seq_by_id(shift);
my $seqout = Bio::SeqIO->new(-fh => \*STDOUT,
-format => 'genbank');
# greps only annotations with CONTIG tagname, joins all together
my $contig = join '', grep {$_->tagname eq 'CONTIG'}
$seq->get_Annotations();
# split each region, getting rid of gaps and join(), then split into
acc/span
for (grep {$_ !~ m{gap|join}}
split ',', $contig) {
my ($acc, $span) = split ':', $_;
$span =~ s{\)}{}g; # spurious ')'
print "ACC: $acc\n\tSpan:$span\n";
}
> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> bounces at lists.open-bio.org] On Behalf Of Charles Hauser
> Sent: Saturday, July 15, 2006 2:30 PM
> To: bioperl-l at lists.open-bio.org
> Subject: [Bioperl-l] Finding locations of a string within a fasta file
>
> All,
>
> I'm trying to determine where (the start .. end positions) within a
> genomic scaffold sequence gaps occur.
> The gaps are denoted as runs of N's.
>
> Suggestions on how to easily retrieve this would be appreciated.
>
> ch
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
More information about the Bioperl-l
mailing list