[Biopython] Regarding blast record report

Peter Cock p.j.a.cock at googlemail.com
Thu Nov 8 23:08:57 UTC 2018


The NCBI document their naming - although they have some much
documentation is isn't always all up to date.

https://www.ncbi.nlm.nih.gov/genbank/sequenceids/

GI stands for GenInfo Identifier, but is being deprecated so don't use
it. Use the versioned accession if possible (the gb or ref entry in
your example):

https://ncbiinsights.ncbi.nlm.nih.gov/2016/07/15/ncbi-is-phasing-out-sequence-gis-heres-what-you-need-to-know/

The text after the identifier (after the first space) does seem to
vary considerably depending on the BLAST database and where the
sequence originally came from. Putting the species name in square
brackets does seem to be a consistent NCBI pattern though.

If you want more metadata about the sequence, rather than parsing this
one line of text, it would be better to use the versioned accession to
look this up directly, for example via NCBI Entrez.

Peter
On Thu, Nov 8, 2018 at 5:33 PM Ahmad Khalifa <underoath006 at gmail.com> wrote:
>
> Hello,
>
> I want to extract certain information from the biopython blast output.
>
> In the header I often get variable amounts of information in the title, for example:
>
> gi|1335041855|gb|PNW76469.1| hypothetical protein CHLRE_11g467616v5 [Chlamydomonas reinhardtii]
>
> gi|159481404|ref|XP_001698769.1| predicted protein [Chlamydomonas reinhardtii] >gi|745998015|sp|A8JA42.1|IFT56_CHLRE RecName: Full=Intraflagellar transport protein 56; AltName: Full=Abnormal dye filling protein 13; AltName: Full=Tetratricopeptide repeat protein 26 homolog; Short=TPR repeat protein 26 homolog
>
> gi|1335043717|gb|PNW78329.1| hypothetical protein CHLRE_09g401700v5 [Chlamydomonas reinhardtii]
>
>
> I wonder what exactly is contained in this output, what's gi and gb? How come sometimes I have a refseq or a uniprot accession code but not always (the same information is not consistently present, very difficult to mine). Is it possible to retrieve a uniprot accession code for my hits or a gene name that I can map to an accession code using uniprots API?
>
> What I really want is to mine the title to get every piece of information separately (if it exists of course), are there parsers that do that?
>
> Best regards.
> _______________________________________________
> Biopython mailing list  -  Biopython at mailman.open-bio.org
> http://mailman.open-bio.org/mailman/listinfo/biopython



More information about the Biopython mailing list