[BioPython] Martel Help
Andrew Dalke
dalke at dalkescientific.com
Wed Apr 18 21:45:00 UTC 2007
On Apr 18, 2007, at 6:54 PM, Pepe Barbe wrote:
> In the example, there are some things whose purpose is obvious but the
> implementation details (Or all the possible options) aren't. Currently
> I am curious on how does Martel.HeaderFooter and Std.record affect the
> parsing.
I'm having to think back several years now.
A limitation with Martel is parsing large data files. It
has a memory overhead of several times the data file being
processed. Eg, a 1 MB file might take 7 or so MB to process.
Most bioinformatics formats are composed of records. Eg,
a GenBank file contains many GenBank records. The idea of the
Header / Footer / HeaderFooter classes is to break the large
file down into small records, and only have the overhead for
parsing a record.
(But it doesn't help processing large records, like the
entire chromosome as a single FASTA record.)
In FASTA files there is no header or footer. It can be
read and split up using a RecordReader. Specifically with
a StartsWith record reader told to look for a ">" which
marks the start of a new record. Compare to SwissProt
where the record ends with a "//" line.
Some formats are more complicated. GenBank is one. Real
genbank files start with a header, something like
GBGSS1.SEQ Genetic Sequence Data Bank
February 15 2003
NCBI-GenBank Flat File Release 134.0
GSS Sequences (Part 1)
88066 loci, 66600405 bases, from 88066 reported sequences
There needs to be a way to process a single, unique header,
followed by 0-or-more repeats of a record, followed by an
optional footer.
Use the HeaderFooter expression for this case.
In general, this is a clumsy solution.
Ignore the Std.record. My thought was that the different terms
in the expression could be standardized. For example, that
all sequences are tagged with "bio:seq". I hoped this would
minimize the work needed to add a new format because most of
the handlers would look for expected tags, and not depend so
much on the actual structure of the XML.
It proved too complicated to explain and use.
> Later in that example they use: blat.format.make_iterator("record").
> Where does the "record" come from? Because of using Std.record?
The "record" comes from a group name used in the expression.
It describes the point where the repetition will be done.
Andrew
dalke at dalkescientific.com
More information about the Biopython
mailing list