[Biopython-dev] Reading sequences: FormatIO, SeqIO, etc

Michiel de Hoon mdehoon at c2b2.columbia.edu
Fri Aug 4 03:20:18 UTC 2006


> Question One
> ============
 >
> Is reading sequence files an important
> function to you, and if so which file formats in particular (e.g.
> Fasta, GenBank, ...)
> 
I use Fasta, GenBank, and occasionally clustalw.
> 
> Question Two - Reading Fasta Files
> ==================================
>  Which of the following do you currently use (and why)?:
> 
> (a) Bio.Fasta with the RecordParser (giving FastaRecord objects with
> a title, and the sequence as a string) (b) Bio.Fasta with the
> FeatureParser (giving SeqRecord objects) (c) Bio.Fasta with your own
> parser (Could you tell us more?) (d) Bio.SeqIO.FASTA.FastaReader
> (giving SeqRecord objects) (e) Bio.FormatIO (giving SeqRecord
> objects) (f) Other (Could you tell us more?)
I use Bio.Fasta with the RecordParser, but just because it's easy to 
find in the documentation. As a user, I think Bio.Fasta requires too 
many steps to be typed in; I would prefer something more 
straightforward. For the output format, I don't care so much, but for 
the sake of consistency a SeqRecord may be preferable.

> 
> Question Three - index_file based dictionaries 
> ============================================== Do you use any of the
> following: (a) Bio.Fasta.Dictionary (b) Bio.Genbank.Dictionary (c)
> Any other "Martel/Mindy" based dictionary which first requires
> creation of an index using the index_file function
> 

No. I never really understood index files.

> 
> Question Four - Record Access...
> ================================ 
> When loading a file with multiple sequences do you use:
> 
> (a) An iterator interface(e.g. Bio.Fasta.Iterator) to give you the
> records one by one in the order from the file.
> 
> (b) A dictionary interface (e.g. Bio.Fasta.Dictionary) to give you
> random access to the records using their identifier.
> 
> (c) A list giving random access by index number (e.g. load the
> records using an iterator but saving them in a list).
I use (a). It's easy to create (b) or (c), if needed, if (a) is available.
> 
> Question Four - Fasta files: FastaRecord or SeqRecord 
> ===================================================== If you use
> Fasta files, do you want get records returned as FastaRecords or as
> SeqRecords?  If SeqRecords, do you use your own title2ids mapping?
> 
> For example,
> 
>> name text text text
> ACGTACACGT
> 
> As a FastaRecord this would have:
> 
> FastaRecord.title = "name text text text" (string) 
> FastaRecord.sequence= "ACGTACACGT" (string)
> 
> As a SeqRecord (with the default title2ids mapping):
> 
> SeqRecord.id = (default string) SeqRecord.name = (default string) 
> SeqRecord.description = "name text text text" (string) SeqRecord.seq
> = Seq("ACGTACACGT", alphabet)
I use the FastaRecord, but again for no particular reason. I have not 
experienced an advantage of Seq objects over simple strings, so for me 
the fact that FastaRecord contains a simple string is more convenient. 
But it doesn't matter much.

> Question Five - GenBank files: GenbankRecord or SeqRecord 
> ========================================================== If you use
> GenBank files, do you use: (a) Bio.Genbank.FeatureParser which
> returns SeqRecord objects (b) Bio.Genbank.RecordParser which returns
> Bio.GenBank.Record objects
> 
I don't care so much, but I think that having two record types is 
confusing, so it would be better if we could decide on one. A SeqRecord
is more general than a Bio.GenBank.Record, so I have a slight preference 
for a SeqRecord.

> 
> Question Six - Martel, Scanners and Consumers 
> ============================================== Some of BioPython's
> existing parsers (e.g. those using Martel) use an event/callback
> model, where the scanner component generates parsing events which are
> dealt with by the consumer component.
> 
> Do any of you use this system to modify existing parser behaviour, or
> use it as part of your own personal file parser?
> 
> (a) I don't know, or don't care.  I just the the parsers provided. 
> (b) I use this framework to modify a parser in order to do ...
> (please provide details).
> 
(a). Often, I'm just at the Python prompt typing away. What I like about 
Python and Numerical Python is that the commands are often obvious and 
easy to remember. With the parser framework, on the other hand, I always 
need to look up in the documentation how to use them.

--Michiel



More information about the Biopython-dev mailing list