[Bioperl-l] Obtaining GO cross-references from SwissProt Records --reply to Jason Stajich's solution

Anand Venkatraman bioperlanand at yahoo.com
Thu Oct 28 14:03:49 EDT 2004


Hi,

I looked at the swiss.pm IO code, the regular
expression parsing  looks fine.

I even changed the code and tried to print out all the
cross-reference lines.

The script finds and prints only some GO lines and
fails to find others. But, it works well (prints) for 
all other cross-reference  lines .

When it does find  GO lines for a particular SwissProt
reord, it only prints the 1st of those GO lines.

Here is the relevant section of the code

-----------------------------------------------------
my $seqio_object = Bio::SeqIO->new(-file => $sp_file,
-format => "swiss");

while (my  $seq_object = $seqio_object->next_seq) {

if ($seq_object->species->binomial =~ m/Homo
sapiens/){
 print "Accession: ",
$seq_object->accession_number(),"\t";

my $annotation = $seq_object->annotation();

foreach my $dblink 
($annotation->get_all_Annotations('dblink') ) {

 print "\t", $dblink->database, ":",
$dblink->primary_id, "\t"; 
}
}
}

-------------------------------------------------------

Anand

--- Jason Stajich <jason at cgt.duhs.duke.edu> wrote:

> I didn't try to answer that - not sure about the GO
> stuff.  You might 
> want to walk through the swiss.pm SeqIO code to
> figure out whether or 
> not that the regular expressions are parsing out the
> data correctly for 
> DR  lines for GO terms.  Or else something is
> getting overwritten 
> perhaps.
> 
> -jason
> 
> On Oct 27, 2004, at 4:42 PM, Anand Venkatraman
> wrote:
> 
> > Hi,
> >
> > Thanks a lot.
> >
> > I tried the code with the $dblink->optional_id. It
> > works.
> >
> > But whats puzzling is that the code is behaving
> > weirdly when it comes to extracting the GO
> > cross-refences.
> >
> > As stated in my earlier mail, it is extracting it
> in
> > some cases & fails in other cases. And when it
> does
> > work & extract the GO cross-reference, it does so
> only
> > for the 1st occurence of the GO cross-reference.
> >
> > There are no warnings or errors.
> >
> > I could send a small SwissProt file as an
> attachment
> > if you would like to have a look at.
> >
> > Thanks once again.
> >
> > Anand
> >
> > --- Jason Stajich <jason at cgt.duhs.duke.edu> wrote:
> >
> >> The protein ID is stored in $dblink->optional_id
> >>
> >> This is the code which does  the parsing work in
> >> Bio::SeqIO::swiss to
> >> make a DBlink Xref.
> >> elsif
> >> (/^DR\s+(\S+)\;\s+(\S+)\;\s+([^;]+)[\;\.](.*)$/)
> {
> >>             my $dblinkobj =
> >> Bio::Annotation::DBLink->new();
> >>             $dblinkobj->database($1);
> >>             $dblinkobj->primary_id($2);
> >>             $dblinkobj->optional_id($3);
> >>             my $comment = $4;
> >>             if(length($comment) > 0) {
> >>                 # edit comment to get rid of
> leading
> >> space and trailing
> >> dot
> >>                 if( $comment =~ /^\s*(\S+)\./ ) {
> >>                     $dblinkobj->comment($1);
> >>                 } else {
> >>                    
> $dblinkobj->comment($comment);
> >>                 }
> >>             }
> >>
> >> $annotation->add_Annotation('dblink',$dblinkobj);
> >>         }
> >>
> >> -jason
> >> On Oct 27, 2004, at 11:47 AM, Anand Venkatraman
> >> wrote:
> >>
> >>> Hi,
> >>>
> >>> Thanks a lot for the response.
> >>>
> >>> Some clarifications from my side:
> >>>
> >>> [1] Yes, by the EMBL tag, I catually meant the
> >> DbXREFto EMBL for the
> >>> specific SwissProt accession number. Sorry for
> the
> >> confusion.   Lets
> >>> say we have this line from a SwsisProt record:
> >>>
> >>> DR   EMBL; X57346; CAA40621.1; -.
> >>>
> >>> By the method outlined in my code, I am able to
> >> pull up only the EMBL
> >>> nucleotide accession number (X57346) , but I am
> >> unable to get to the
> >>> Protein Accession Number (CAA40621.1).
> >>>
> >>> [2] Problems with GO cross-references:
> >>>
> >>> I can send you a small portion of the SwissProt
> >> file -- do you want me
> >>> to send it as an attachment or within the text
> of
> >> the message. Can we
> >>> send file attachments to the mailing list?
> >>>
> >>>
> >>> Thanks a lot.
> >>>
> >>> Anand
> >>>
> >>> Hilmar Lapp <hlapp at gmx.net> wrote:
> >>>
> >>> On Tuesday, October 26, 2004, at 09:44 PM, Anand
> >> Venkatraman wrote:
> >>>
> >>>> Hi,
> >>>>
> >>>> I am using Bioperl to parse SwissProt Records.
> >>>>
> >>>> The bioperl version is 1.4.
> >>>>
> >>>> I am having 2 problems :
> >>>>
> >>>> Problem 1: I am unable to get all the accession
> >>>> numbers from the line starting with AC on the
> >>>> SwissProt Record.
> >>>
> >>> Other accessions than the first are available
> via
> >>> $seq->get_secondary_accessions().
> >>>
> >>>>
> >>>> Problem 2: I am also trying to get the
> associated
> >>>> EMBL and GO cross-references fro a given
> >> Swissprot
> >>>> entry. The problem I am having is that
> >>>> [a]: I am only getting the Nucleotide Id and
> Not
> >> the
> >>>> Protein Id from the EMBL tag and
> >>>
> >>> What do you mean by EMBL tag? Dbxrefs to EMBL?
> >>>
> >>>> [b]: In some cases, I am unable to get the GO
> >> ids.
> >>>
> >>> This should not happen. Can you send the
> accession
> >> numbers for those
> >>> sequences, or better yet, the
> swissprot-formatted
> >> file with those (or a
> >>> selection thereof) that fail?
> >>>
> >>> -hilmar
> >>>
> >>>
> >>>> For
> >>>> example, from the code below, I am only getting
> >> the GO
> >>>> id for some records, and missing it for some.
> >> Also, if
> >>>> a particular record has 3 or 4 lines of GO, the
> >> code
> >>>> just captures the 1st occurence of the GO Id(if
> >> and
> >>>> when it does so).
> >>>>
> >>>>
> >>>>
> >>>> This is the code
> >>>>
> >>
> >
> 

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 


More information about the Bioperl-l mailing list