[Bioperl-l] SeqIO bug?
Heikki Lehvaslaiho
heikki at nildram.co.uk
Sat Feb 12 07:38:39 EST 2005
Ryan,
Most of our parsers assume that if you state the format the sequences really
are in that format. There is some built in guessing in SeqIO: if you do not
specify the format, the code will look into the sequence file.
If the file does not follow any well defined formats, it is very difficult to
guess what it could be...
As for your raw output: raw assumes that there is one sequence per line. In
you case that seems to be the number of characters in one line.
I suggest that you first convert your sequences into fasta with a simple
script. You do not say what is the limiting character or characters between
sequences, nor if you have names for them, so I can not write the script for
you, but you can use the following code as a basis:
#----------------------------
#!/usr/bin/env perl -w
use strict;
my $delimiter = "xx\n";
$/=$delimiter;
my $count;
while (<>) {
$count++;
s/$delimiter//;
s/\W//g;
s/\d//g;
print ">$count\n$_\n";
}
#----------------------------
It assumes files like this as input:
-----------------------------------------------
ag cagc
xx
1 catgctagctacgtatgc
2 cgtcagctagctga
3 catcgtagc
xx
ttt tgtt ttatt atatat
xx
-----------------------------------------------
I hope this helps,
-Heikki
On Friday 11 February 2005 19:31, Ryan Golhar wrote:
> I have a bunch of cDNA sequences that I'm trying to process. The
> sequences are in FASTA format, but they are all missing the FASTA header
> ie that just contain the sequence. As a test to make sure I'm reading
> them in correctly, I doing the following:
>
> my $seq_in = Bio::SeqIO->new(-file => "<myseqfile",
> -format => 'fasta');
> my $seq = $seq_in->next_seq();
> print $seq->length;
>
> It prints out a number, but reads the first line as the FASTA header
> even though its not there. Wouldn't it make more sense to either print
> out an error message about the missing FASTA header, or read in the file
> as just the sequence regardless of specifying the FASTA format?
>
> If I try to read the sequence in as "raw", the length is always printed
> out as 70...
>
> Ryan
>
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l
--
______ _/ _/_____________________________________________________
_/ _/ http://www.ebi.ac.uk/mutations/
_/ _/ _/ Heikki Lehvaslaiho heikki at_ebi _ac _uk
_/_/_/_/_/ EMBL Outstation, European Bioinformatics Institute
_/ _/ _/ Wellcome Trust Genome Campus, Hinxton
_/ _/ _/ Cambridge, CB10 1SD, United Kingdom
_/ Phone: +44 (0)1223 494 644 FAX: +44 (0)1223 494 468
___ _/_/_/_/_/________________________________________________________
More information about the Bioperl-l
mailing list