[Biopython] SwissProt parser: get entire entry as string?
Martin Mokrejs
mmokrejs at fold.natur.cuni.cz
Fri Aug 19 17:12:22 UTC 2016
Hi Bastien,
Chevreux, Bastien wrote:
> Dear list,
>
>
>
> Is there a way to get the entire entry of a just parsed SwissProt entry as string?
>
>
>
> Motivation: I want to write a simple filter for UniProt/SwissProt .dat files like this
>
>
>
> fh=open(sys.stdin.fileno())
>
> for rec in SwissProt.parse(fh):
>
> if (“something I want to check for matches”):
>
> print(rec.entry_as_string)
>
>
>
> and not reformat each and every field for compatible output (which would take half a life-time to get correct and probably still be error-prone).
From a quick look into runtime docs there is not raw input data left from the parser in the object. So I think you can only do the search and filtering, outpout matching accession, and with some other tool you need to filter the .dat file using the accession list.
http://biopython.org/DIST/docs/api/Bio.SwissProt-module.html
$ wget ftp://ftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.dat.gz
$ gzip -d uniprot_sprot.dat.gz
$ python
>>> from Bio import SwissProt
>>> dir(SwissProt)
>>> _myfileh = open("uniprot_sprot.dat")
>>> for _record in SwissProt.parse(_myfileh):
>>> if 'kinase' in _record.description:
>>> print _record.entry_name
^C
>>> help(_record)
Help on Record in module Bio.SwissProt object:
class Record(__builtin__.object)
| Holds information from a SwissProt record.
|
| Members:
|
| - entry_name Name of this entry, e.g. RL1_ECOLI.
| - data_class Either 'STANDARD' or 'PRELIMINARY'.
| - molecule_type Type of molecule, 'PRT',
| - sequence_length Number of residues.
|
| - accessions List of the accession numbers, e.g. ['P00321']
| - created A tuple of (date, release).
| - sequence_update A tuple of (date, release).
| - annotation_update A tuple of (date, release).
|
| - description Free-format description.
| - gene_name Gene name. See userman.txt for description.
| - organism The source of the sequence.
| - organelle The origin of the sequence.
| - organism_classification The taxonomy classification. List of strings.
| (http://www.ncbi.nlm.nih.gov/Taxonomy/)
| - taxonomy_id A list of NCBI taxonomy id's.
| - host_organism A list of names of the hosts of a virus, if any.
| - host_taxonomy_id A list of NCBI taxonomy id's of the hosts, if any.
| - references List of Reference objects.
| - comments List of strings.
| - cross_references List of tuples (db, id1[, id2][, id3]). See the docs.
| - keywords List of the keywords.
| - features List of tuples (key name, from, to, description).
| from and to can be either integers for the residue
| numbers, '<', '>', or '?'
|
| - seqinfo tuple of (length, molecular weight, CRC32 value)
| - sequence The sequence.
|
| Methods defined here:
|
| __init__(self)
|
| ----------------------------------------------------------------------
| Data descriptors defined here:
|
| __dict__
| dictionary for instance variables (if defined)
|
| __weakref__
| list of weak references to the object (if defined)
Sorry, probably you have to wait for somebody more familiar with SwissProt parser unless you find by Google a standalone utility to extract your entries using accession from the .dat files.
I would look for those cdbfasta/cdbyank indexing utilities, which can index all kinds of input formats so you could then efficiently export required entries using their accession.
Martin
More information about the Biopython
mailing list