[Bioperl-l] Processing large fasta sequences throught SeqIO

Jason Stajich jason@chg.mc.duke.edu
Thu, 30 Aug 2001 16:00:04 -0400 (EDT)


On Thu, 30 Aug 2001, Josep Francesc Abril Ferrando wrote:

> I need to work with chromosome size fasta sequences and I was trying
> to run some perl code using BioPerl version 0.7 ("$Id:largefasta.pm,v
> 1.5.2.1$", which is the one currently installed in our system). As I
> read in the "Bio::SeqIO::largefasta" documentation that this module
> has to be accessed from "Bio:SeqIO",I do not included directly that
> module in the program. I wrote a script that basically reads the whole
> seq, may process a little bit the sequence (i.e. reformating
> non-uniform length sequence lines -if I am building the input by
> joining many sequences under the same id-), and then save the
> processed large sequence. It seems to work OK, but I got some strange
> results in the saved file while I get the following error/warning:
> 
> Error in tempdir() using /tmp/XXXXXXXXXX: Could not create directory
> /tmp/Z0gD8R0rlB: Too many links at
> /usr/lib/perl5/site_perl/5.005//Bio/Root/IO.pm line 457
> 

Is your tmp dir really full of files/directories or have not enough space 
for the collection of all the sequence data?  This seems like a system
problem.  

Do you have File::Temp installed?  There is a known bug in 0.7 release
that if you do not have File::Temp installed the application will not
cleanup its tempdirs/tempfiles cleanly.  Installing File::Temp will take
care of that.  

> If I look at the saved file, the sequence is OK (do not have more or
> less nucleotides than expected and they are in the correct ordering)
> but the file contains a lot of empty lines (or just having '>') after
> the finished sequence. Any idea of what should be wrong in the
> following script:
> 

Nothing obvious is jumping out right now by looking at your code -

How large are your files? 

> ---->8---->8---->8---->8---->8----
> 
> perl -ne 'BEGIN{ print ">bigseq\n"; }
>    $_ !~ /^>|^\s*$/o && print ;  ' $INDIR/*.fa |
>   perl -e '
>       use Bio::Seq;
>       use Bio::SeqIO;
>       my $seqin  = Bio::SeqIO->new(-format => "largefasta", -fh => \*STDIN );
>       my $seqout = Bio::SeqIO->new(-format => "largefasta", -fh => \*STDOUT);
>       while (my $sequence = $seqin->next_seq()) {
>            # do here some checkings/changes on substrings of the sequence
>           $seqout->write_seq($sequence);
>       }; # while
>      exit(0);
>     ' - > $OUTDIR/bigseq.fa
> 
> ----8<----8<----8<----8<----8<----
> 
> Is that the right way to use "Bio::SeqIO" for processing large fasta
> files. Do I have to include "Bio::Seq::LargeSeq" and, if yes, how can
> I do that ?
> 
you could add the line
use Bio::Seq::LargeSeq;
just below --> use Bio::SeqIO <--
if you wanted, but it is included by the largefasta modules so it is
optional.

> Thanks for your attention... Josep F.
> ________________________________________
> 
>     Josep Francesc ABRIL FERRANDO
> 
> RESEARCH GROUP on BIOMEDICAL INFORMATICS
>         GENOME INFORMATICS LAB
>               IMIM - UPF
>           C/ Dr. Aiguader 80
>        08003 - Barcelona  (SPAIN)
> 
>     Ph:  +34 93 2211009 ext 2016
>     Fax: +34 93 2213237
> 
>     http://www1.imim.es/~jabril/
> 
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l
>