[Biopython] SwissProt parser: get entire entry as string?

Fri Aug 19 17:12:22 UTC 2016

Hi Bastien,

Chevreux, Bastien wrote:
> Dear list,
>
>
>
> Is there a way to get the entire entry of a just parsed SwissProt entry as string?
>
>
>
> Motivation: I want to write a simple filter for UniProt/SwissProt .dat files like this
>
>
>
> fh=open(sys.stdin.fileno())
>
> for rec in SwissProt.parse(fh):
>
> if (“something I want to check for matches”):
>
>     print(rec.entry_as_string)
>
>
>
> and not reformat each and every field for compatible output (which would take half a life-time to get correct and probably still be error-prone).

 From a quick look into runtime docs there is not raw input data left from the parser in the object. So I think you can only do the search and filtering, outpout matching accession, and with some other tool you need to filter the .dat file using the accession list.

http://biopython.org/DIST/docs/api/Bio.SwissProt-module.html

$ wget ftp://ftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.dat.gz
$ gzip -d uniprot_sprot.dat.gz
$ python
>>> from Bio import SwissProt
>>> dir(SwissProt)
>>> _myfileh = open("uniprot_sprot.dat")
>>> for _record in SwissProt.parse(_myfileh):
>>>     if 'kinase' in _record.description:
>>>        print _record.entry_name
^C
>>> help(_record)

Help on Record in module Bio.SwissProt object:

class Record(__builtin__.object)
  |  Holds information from a SwissProt record.
  |
  |  Members:
  |
  |      - entry_name        Name of this entry, e.g. RL1_ECOLI.
  |      - data_class        Either 'STANDARD' or 'PRELIMINARY'.
  |      - molecule_type     Type of molecule, 'PRT',
  |      - sequence_length   Number of residues.
  |
  |      - accessions        List of the accession numbers, e.g. ['P00321']
  |      - created           A tuple of (date, release).
  |      - sequence_update   A tuple of (date, release).
  |      - annotation_update A tuple of (date, release).
  |
  |      - description       Free-format description.
  |      - gene_name         Gene name.  See userman.txt for description.
  |      - organism          The source of the sequence.
  |      - organelle         The origin of the sequence.
  |      - organism_classification  The taxonomy classification.  List of strings.
  |        (http://www.ncbi.nlm.nih.gov/Taxonomy/)
  |      - taxonomy_id       A list of NCBI taxonomy id's.
  |      - host_organism     A list of names of the hosts of a virus, if any.
  |      - host_taxonomy_id  A list of NCBI taxonomy id's of the hosts, if any.
  |      - references        List of Reference objects.
  |      - comments          List of strings.
  |      - cross_references  List of tuples (db, id1[, id2][, id3]).  See the docs.
  |      - keywords          List of the keywords.
  |      - features          List of tuples (key name, from, to, description).
  |        from and to can be either integers for the residue
  |        numbers, '<', '>', or '?'
  |
  |      - seqinfo           tuple of (length, molecular weight, CRC32 value)
  |      - sequence          The sequence.
  |
  |  Methods defined here:
  |
  |  __init__(self)
  |
  |  ----------------------------------------------------------------------
  |  Data descriptors defined here:
  |
  |  __dict__
  |      dictionary for instance variables (if defined)
  |
  |  __weakref__
  |      list of weak references to the object (if defined)

Sorry, probably you have to wait for somebody more familiar with SwissProt parser unless you find by Google a standalone utility to extract your entries using accession from the .dat files.
I would look for those cdbfasta/cdbyank indexing utilities, which can index all kinds of input formats so you could then efficiently export required entries using their accession.

Martin