[BioPython] Bio.SeqIO ideas
Peter
biopython at maubp.freeserve.co.uk
Mon Jul 16 15:15:31 UTC 2007
Martin MOKREJŠ wrote:
> Peter,
> maybe the docs (generated from sources as well as those in the
> Documentation) should be clear what is id, name, description of SeqRecord object.
They are all strings, normally specified when creating the instance of
the SeqRecord object. The answer is it depends on where the SeqRecord
came from - and for Bio.SeqIO this means which file format.
One idea I had in mind was to expand the wiki page with worked examples
of a sequence files and the SeqRecord created from it by Bio.SeqIO
> E.g.,
> it would be helpful to demonstrate the values on an example of a FASTA
> record parsed. Then one would figure out what is the difference between name
> and description.
Fasta files are used in the tutorial,
http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc11
Do you think in addition to explicitly showing the record id and seq, I
should also show the description (and name)?
Fasta files are a very free form format, and in general the first word
(splitting on white space) is a name or identifier. In some cases (e.g.
NCBI fasta files) this can be subdivided (splitting on the | character).
To be explicit suppose you had this:
>554154531 a made up protein
SDKJSDLHVLSDJDKJFDLJFKLSDJD
>heat shock protein
EINDLKNFLDHFDSHFLDSHJDSHDJHJHKJHSD
Biopython will use the first word as both the record id and name, and
the full text as the description. For example given this FASTA file you
would get two records, the first:
id = name = "554154531"
description = "554154531 a made up protein"
and the second,
id = name = "heat"
description = "heat shock protein"
Note that the inclusion of the full text as the description is partly
based on older Biopython code, and also to try and make it as easy as
possible for you to extract any data from the line in your own code.
Peter
More information about the Biopython
mailing list