beta-testing for sequence-analysis modules
Will Fischer
wfischer@walrus.bio.indiana.edu
Thu, 13 Feb 1997 15:06:49 -0500
I have only looked at Seq.pm so
far, and I noticed a couple of things relating to readseq and formats:
1. in the list of formats which readseq currently understands, several
formats CAN contain multiple sequences, but are not listed as such,
(e.g., GenBank/GB, EMBL, Pearson/Fasta, which I have marked with a
TRAILING "*" in the list below ... probably the others marked with a ?
can also. It is important to support multiple sequences per file!
- IG/Stanford
- GenBank/GB *
- NBRF ?
- EMBL *
- GCG
- DnaStrider
- Fitch format ?
- Pearson/Fasta *
- Zuker format ?
- Olsen format ?
- Phylip3.2 *
- Phylip *
- Plain/Raw
* MSF
* PAUP's multiple sequence (NEXUS) format
* PIR/CODATA format used by PIR
* ASN.1 format used by NCBI
Note: Formats indicated with a '*' allow for multiple
sequences to be contained within one file. At this
time, the behaviour of Seq.pm with regard to these
multiple-sequence files has not been spefified.
2. I have gotten readseq to work with my scripts, not
by bi-directional piping, I regret to say, but by using a temporary file.
I use a subroutine to examine each file to see if it contains the formats
my scripts parse, and call readseq to give fasta format if it's something
else:
FIND_READSEQ:
foreach (split(/:/,$ENV{PATH})){
if (-x "$_/readseq"){
$readseq = "$_/readseq";
warn "readseq available as $readseq\n";
last FIND_READSEQ;
}
}
if ($readseq eq ''){
warn qq(couldn't find "readseq" -- will only be able to read fasta and genbank formats.\n);
}
OPTIONS_AND_INFILES: while ($_ = shift(@ARGV)) {
/^-/ && do{
# assign name for output file
/o/ && ($outfile = shift, next); warn "option $_ not (currently) supported (skipping)\n";
next OPTIONS_AND_INFILES;
};
$filename = $_;
push(@seq_sets,&infiles($filename));
}
foreach $chunk_of_sequences_from_the_same_file ( @seq_sets ) {
# PROCESSING HERE
}
sub infiles {
my($is_readable_format) = 0;
my($filename) = shift(@_);
open(INFILE, "<$filename") || die "couldn't open $filename",": $!\n" ;
my($file_contents) = join('',<INFILE>);
close(INFILE);
# ORDER of these regexps is important -- most specialized ought to go first (for accuracy)
# BUT simplest ought to go first (for speed). Hmmm...
#
# FASTA | GENBANK
( $file_contents =~ m{(^>[^>]+)|((?s)LOCUS.*?//\n)} )
&& $is_readable_format++;
if ($is_readable_format) {
return($file_contents);
}
else {
if ($readseq ne '') {
my($tempfile) = "/tmp/$$";
open(READSEQ_INPUT,"| readseq -pipe -all -ffasta > $tempfile");
print READSEQ_INPUT $file_contents;
close(READSEQ_INPUT);
open(READSEQ_OUTPUT,"<$tempfile");
$readseq_fasta_format = join('',<READSEQ_OUTPUT>);
close(READSEQ_OUTPUT);
unlink $tempfile;
return($readseq_fasta_format);
}
else {
warn "format not recognized for file $filename
(and readseq wasn't found). Skipping this file\n";
}
}
}
----- End Included Message -----