[Bioperl-l] Reply to Hilmar Lapp's solution -- parsing SwissProt Records

Jason Stajich jason at cgt.duhs.duke.edu
Wed Oct 27 18:13:45 EDT 2004


I didn't try to answer that - not sure about the GO stuff.  You might 
want to walk through the swiss.pm SeqIO code to figure out whether or 
not that the regular expressions are parsing out the data correctly for 
DR  lines for GO terms.  Or else something is getting overwritten 
perhaps.

-jason

On Oct 27, 2004, at 4:42 PM, Anand Venkatraman wrote:

> Hi,
>
> Thanks a lot.
>
> I tried the code with the $dblink->optional_id. It
> works.
>
> But whats puzzling is that the code is behaving
> weirdly when it comes to extracting the GO
> cross-refences.
>
> As stated in my earlier mail, it is extracting it in
> some cases & fails in other cases. And when it does
> work & extract the GO cross-reference, it does so only
> for the 1st occurence of the GO cross-reference.
>
> There are no warnings or errors.
>
> I could send a small SwissProt file as an attachment
> if you would like to have a look at.
>
> Thanks once again.
>
> Anand
>
> --- Jason Stajich <jason at cgt.duhs.duke.edu> wrote:
>
>> The protein ID is stored in $dblink->optional_id
>>
>> This is the code which does  the parsing work in
>> Bio::SeqIO::swiss to
>> make a DBlink Xref.
>> elsif
>> (/^DR\s+(\S+)\;\s+(\S+)\;\s+([^;]+)[\;\.](.*)$/) {
>>             my $dblinkobj =
>> Bio::Annotation::DBLink->new();
>>             $dblinkobj->database($1);
>>             $dblinkobj->primary_id($2);
>>             $dblinkobj->optional_id($3);
>>             my $comment = $4;
>>             if(length($comment) > 0) {
>>                 # edit comment to get rid of leading
>> space and trailing
>> dot
>>                 if( $comment =~ /^\s*(\S+)\./ ) {
>>                     $dblinkobj->comment($1);
>>                 } else {
>>                     $dblinkobj->comment($comment);
>>                 }
>>             }
>>
>> $annotation->add_Annotation('dblink',$dblinkobj);
>>         }
>>
>> -jason
>> On Oct 27, 2004, at 11:47 AM, Anand Venkatraman
>> wrote:
>>
>>> Hi,
>>>
>>> Thanks a lot for the response.
>>>
>>> Some clarifications from my side:
>>>
>>> [1] Yes, by the EMBL tag, I catually meant the
>> DbXREFto EMBL for the
>>> specific SwissProt accession number. Sorry for the
>> confusion.   Lets
>>> say we have this line from a SwsisProt record:
>>>
>>> DR   EMBL; X57346; CAA40621.1; -.
>>>
>>> By the method outlined in my code, I am able to
>> pull up only the EMBL
>>> nucleotide accession number (X57346) , but I am
>> unable to get to the
>>> Protein Accession Number (CAA40621.1).
>>>
>>> [2] Problems with GO cross-references:
>>>
>>> I can send you a small portion of the SwissProt
>> file -- do you want me
>>> to send it as an attachment or within the text of
>> the message. Can we
>>> send file attachments to the mailing list?
>>>
>>>
>>> Thanks a lot.
>>>
>>> Anand
>>>
>>> Hilmar Lapp <hlapp at gmx.net> wrote:
>>>
>>> On Tuesday, October 26, 2004, at 09:44 PM, Anand
>> Venkatraman wrote:
>>>
>>>> Hi,
>>>>
>>>> I am using Bioperl to parse SwissProt Records.
>>>>
>>>> The bioperl version is 1.4.
>>>>
>>>> I am having 2 problems :
>>>>
>>>> Problem 1: I am unable to get all the accession
>>>> numbers from the line starting with AC on the
>>>> SwissProt Record.
>>>
>>> Other accessions than the first are available via
>>> $seq->get_secondary_accessions().
>>>
>>>>
>>>> Problem 2: I am also trying to get the associated
>>>> EMBL and GO cross-references fro a given
>> Swissprot
>>>> entry. The problem I am having is that
>>>> [a]: I am only getting the Nucleotide Id and Not
>> the
>>>> Protein Id from the EMBL tag and
>>>
>>> What do you mean by EMBL tag? Dbxrefs to EMBL?
>>>
>>>> [b]: In some cases, I am unable to get the GO
>> ids.
>>>
>>> This should not happen. Can you send the accession
>> numbers for those
>>> sequences, or better yet, the swissprot-formatted
>> file with those (or a
>>> selection thereof) that fail?
>>>
>>> -hilmar
>>>
>>>
>>>> For
>>>> example, from the code below, I am only getting
>> the GO
>>>> id for some records, and missing it for some.
>> Also, if
>>>> a particular record has 3 or 4 lines of GO, the
>> code
>>>> just captures the 1st occurence of the GO Id(if
>> and
>>>> when it does so).
>>>>
>>>>
>>>>
>>>> This is the code
>>>>
>>
> -------------------------------------------------------
>>>> #!/usr/bin/perl -w
>>>> use strict;
>>>> use Bio::SeqIO;
>>>>
>>>> my $sp_file = shift @ARGV or die$!;
>>>> my $seqio_object = Bio::SeqIO->new(-file =>
>> $sp_file,
>>>> -format => "swiss");
>>>>
>>>> while (my $seq_object = $seqio_object->next_seq)
>> {
>>>> if ($seq_object->species->binomial =~ m/Homo
>>>> sapiens/) {
>>>> print "Accession:
>>>> ",$seq_object->accession_number(), "\t";
>>>> my $annotation = $seq_object->annotation();
>>>>
>>>> foreach my $dblink (
>>>> $annotation->get_all_Annotations('dblink') ) {
>>>>
>>>> if ( ( $dblink->database eq "EMBL" ) || (
>>>> $dblink->database eq "GO" ) ) {
>>>> print "\t",$dblink->database, ":",
>>>> $dblink->primary_id, "\t";
>>>> }
>>>> }
>>>> }
>>>> print "\n";
>>>>
>>>> }
>>>>
>>>>
>>
> -------------------------------------------------------
>>>>
>>>> Any suggestions,
>>>>
>>>> Thanks in advance for the help.
>>>>
>>>> Anand
>>>>
>>>>
>>>>
>>>>
>>>> __________________________________
>>>> Do you Yahoo!?
>>>> Yahoo! Mail - You care about security. So do we.
>>>> http://promotions.yahoo.com/new_mail
>>>> _______________________________________________
>>>> Bioperl-l mailing list
>>>> Bioperl-l at portal.open-bio.org
>>>>
>>
> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>>>>
>>>>
>>> -- 
>>>
>>
> -------------------------------------------------------------
>>> Hilmar Lapp email: lapp at gnf.org
>>> GNF, San Diego, Ca. 92121 phone: +1-858-812-1757
>>>
>>
> -------------------------------------------------------------
>>>
>>>
>>>
>>> 		
>>> ---------------------------------
>>> Do you Yahoo!?
>>> Yahoo! Mail Address AutoComplete - You start. We
>>>
>>
> finish._______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at portal.open-bio.org
>>>
>>
> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>> --
>> Jason Stajich
>> Duke University
>> jason at cgt.mc.duke.edu
>>
> === message truncated ===
>
>
>
> 		
> __________________________________
> Do you Yahoo!?
> Read only the mail you want - Yahoo! Mail SpamGuard.
> http://promotions.yahoo.com/new_mail
>
>
--
Jason Stajich
Duke University
jason at cgt.mc.duke.edu



More information about the Bioperl-l mailing list