[Bioperl-l] Fasta Genome Splice

Jason Stajich jason at cgt.duhs.duke.edu
Thu Feb 12 15:46:33 EST 2004


On Thu, 12 Feb 2004, David Clark wrote:

> Good point.  What I need is two fasta files: one with the ofr regions
> masked, and one with the non-ofr regions masked.

This is a little bit of work, but pretty easy since you can fit whole
yeast chromosomes into memory.  I do it by figuring out what I want to
mask and then do:
 substr($chromseq,$start,$len,'N'x$len)

So you can just write a simple parser for the chromsomal_features.tab
while(<FILE> ){
  my ($feature,$gene,$sgdid, ... etc ) = split(/\t/,$_);
  # do the substr replace here
}

> There was another thing I wanted to do that I didn't mention before: how
> can I generate the reverse compliment of a whole genome file?

That's easy with emboss
% revseq FILE.fwd FILE.rev

With bioperl -- see the Sequence HOWTO in the howto section of the bioperl
website.  you want to use the revcom method in bioperl Bio::PrimarySeq
objects.

# change fasta to whatever format you have/want the sequences in
my $in = Bio::SeqIO->new(-file => 'filename', -format => 'fasta');
my $out = Bio::SeqIO->new(-file => '>filename.rev', -format => 'fasta');
while( my $s = $in->next_seq ) {
  $out->write_seq($s->revcom);
}


-jason
> On Feb 12, 2004, at 1:19 PM, Jason Stajich wrote:
>
> > You want these as a fasta file per orf and per non-orf region or just 2
> > datasets with the genome masked (all N's or lowercased)?
> >
> > -jason
> > On Thu, 12 Feb 2004, David Clark wrote:
> >
> >> Hello,
> >>
> >> I'm a relative newcomer to bioperl, and would like a point in the
> >> right
> >> direction.  I need to separate the yeast genome into two partial
> >> genomes--one with all ORF's, and one with everything else.  I have a
> >> tab delimited list of the ORF's with the coordinates, and can probably
> >> parse that myself, but I wanted to see if anyone could point me to
> >> some
> >> example code, or give me some place to start in separating genomes
> >> based on the coordinates.
> >>
> >> Thanks,
> >>
> >> David Clark
> >> dfclark at neo.tamu.edu
>

--
Jason Stajich
Duke University
jason at cgt.mc.duke.edu


More information about the Bioperl-l mailing list