[Biopython-dev] Biopython status

Jared Flatow jflatow at northwestern.edu
Tue Oct 16 16:02:19 UTC 2007


Please forgive me for ever doubting your health, it seems the group  
is very much alive!

On Oct 16, 2007, at 3:16 AM, Peter wrote:

> Jared Flatow wrote:
>> I have also needed to create a modified FASTA parser so that I can  
>> read things like quality score files.
>
> Could you be a little more specific - what exactly do you mean by a
> quality score files (links and/or examples).  It may be that this
> warrants setting up a new file format in Bio.SeqIO

That is what I did. The quality score files I meant are simply FASTA- 
like records that indicate the quality of each base pair read from a  
sequencing machine, on a scale of something like 1 to 64. The values  
are tab separated and correspond to 'reads' in another FASTA file  
that contain the actual sequences read. This is the way the 454  
GSFlex machines output their sequencing reads, so for every set of  
reads there will be a pair of 454Reads.fna, 454Reads.qual files. The  
only difference between a parser that processes these qual files and  
one that processes the sequence files is that it shouldn't get rid of  
spaces, and the newlines should not to be stripped but converted into  
spaces (when 454 writes a newline of scores they omit the space).  
Essentially I have made a duplicate of FastaIOs iterator, named it  
something else, made these two small changes and put an entry for it  
in the SeqIO file.

16,17c16,17
< def GSQualIterator(handle, alphabet = single_letter_alphabet,  
title2ids = None) :
<     """Generator function to iterate over GSFlex quality  records  
(as SeqRecord objects).
---
 > def FastaIterator(handle, alphabet = single_letter_alphabet,  
title2ids = None) :
 >     """Generator function to iterate over Fasta records (as  
SeqRecord objects).
54c54
<             lines.append(line.rstrip())  # .replace(" ","")) leave  
off the replacing internal spaces so we can process qscore files (jf)
---
 >             lines.append(line.rstrip().replace(" ",""))
58c58
<         yield SeqRecord(Seq(" ".join(lines), alphabet),
---
 >         yield SeqRecord(Seq("".join(lines), alphabet),
63a64,199

As you can see a parser like this might be useful for other FASTA- 
like formats as well and is in no way specific to the GS quality  
files (its just a space preserving parser). If it were to be  
implemented in Biopython you might call it something else.
>
>> I would be happy to submit the changes to the group or an individual
>>  for inspection, but I would like to avoid having to maintain my own
>>  separate version of Biopython if possible.
>
> As has already been said - please file some (enhancement) bugs and
> attach your patches, or raise specific issues for discussion on this
> mailing list.
>
> Depending on the nature of your changes, you might be able to achieve
> some of them by subclassing Biopython's objects - rather than  
> literally
> maintaining your own branch of the project.
>
>> I am also wondering how it would be received if I did something like
>>  add a to_fasta method to SeqRecord instead of having to go  
>> through writing it to a file using a SeqIO when all I want is the  
>> string.
>
> Out of interest, why do you want to create a FASTA record as a string?

I am serving the fasta from a database of sequences dynamically via a  
web server.

>
> Did you know you can write to a string using any Bio.SeqIO supported
> file format using StringIO?  Perhaps we should spell this out more
> explicitly in the documentation, but a motivating example would help.

This is what I do now, but it seems like a hack to me to go this  
route. To always have to write to a file feels strange, but I see  
that it would be messy to go OO since there are so many formats.  
However, giving preference to fasta over other formats by making it  
innate doesn't seem like such a terrible idea. I do have mixed  
feelings about 'bloating' the code which is why I asked, and you have  
convinced me that this is not quite appropriate given existing  
convention. However the idea would be to put the to_fasta or  
to_format method inside the SeqRecord, then to call it from the IO  
when needed to actually write to a file, but call it directly when  
all that is wanted is a string...

>
> I would suggest rather than adding a to_fasta method to the  
> SeqRecord, simply write your own "seqrecord_to_string" function (or  
> create a subclass of SeqRecord with this method).
>

I'll leave it alone for now until I can come up with a real proposal =)

>> Finally, are there plans to move to a subversion repository at any  
>> point?
>
> It was raised a while ago, and our cunning plan was to let BioPerl try
> the move first.  Once that has been proven, it should be fairly  
> easy for
> the OBF guys to also move us over.  I should email them to see how
> things stand...

BioPerl seems to be the guinea pigs for everything. Leading the way  
on this might put a stop to those nasty rumors about Biopython.

Best Regards,
Jared



More information about the Biopython-dev mailing list