[Bioperl-l] getting pubmed id from genbank files

Tue Jul 26 16:31:03 EDT 2005

Then would it be safe to assume that in the case of multi-line JOURNAL
entries, all lines following the initial tagged JOURNAL line would be
untagged?  If so, the regex could probably be made a bit safer.

Barry

-----Original Message-----
From: Hilmar Lapp [mailto:hlapp at gnf.org] 
Sent: Tuesday, July 26, 2005 2:09 PM
To: Barry Moore
Cc: bioperl-l; n.haigh at sheffield.ac.uk
Subject: Re: [Bioperl-l] getting pubmed id from genbank files

There are indeed JOURNAL entries spanning multiple lines; the parser 
was once unable to deal with this and was subsequently fixed ... as we 
see this introduced other problems ...

On Jul 26, 2005, at 1:07 PM, Barry Moore wrote:

> Nathan-
>
> That sounds like you are using bioperl 1.4?  The error is in
> Bio/SeqIO/genbank.pm  and was fixed by Jason in cvs version 1.102 of
> that file.  However the current code still looks a bit odd to me.
> Starting at line 1068 of the current cvs version (1.119) of
genebank.pm
> we have:
>
> 1068  if (/^\s{2}JOURNAL\s+(.*)/o) {
> 1069     push(@loc, $1);
> 1070     while ( defined($_ = $self->_readline) ) {
> 1071           # we only match when there are at least 4 spaces
> 1072           # there is probably a better way to match this
> 1073           # as it assumes that the describing tag is short enough
> 1074           /^\s{4,}(.*)/o && do { push(@loc, $1);
> 1075           next;
> 1076     };
> 1077     last;
> 1078  }
> 1079  $ref->location(join(' ', @loc));
>
> This is all dealing with parsing the Journal line which is handled
fine
> by lines 1068-69.  The while loop at 1070 looks at successive lines to
> find something to add to the Journal line.  The regex at line 1074
used
> to read /^\s{3,}(.*)/o which would not match if the next line after
> JOURNAL began with '  MEDLINE', but would match '   PUBMED' (Nathan's
> situation) causing that line to be added to the JOURNAL line.  Is
there
> ever a JOURNAL entry with more than one line?  If so, shouldn't the
> following lines always be untagged and thus indented 12 making the 
> regex
> /^\s{12}(.*)/o safer.  The current situation would add any line to
> JOURNAL line if it's tag is shorter than 6 characters, and I don't 
> think
> that's what we want.
>
> Barry
>
> -----Original Message-----
> From: bioperl-l-bounces at portal.open-bio.org
> [mailto:bioperl-l-bounces at portal.open-bio.org] On Behalf Of Hilmar
Lapp
> Sent: Tuesday, July 26, 2005 11:05 AM
> To: n.haigh at sheffield.ac.uk
> Cc: 'bioperl-l'
> Subject: Re: [Bioperl-l] getting pubmed id from genbank files
>
>
> On Jul 26, 2005, at 7:49 AM, Nathan Haigh wrote:
>
>> -- snip --
>> $VAR1 = bless( {
>>        'authors' => 'Clauss,M.J. and Mitchell-Olds,T.',
>>        'location' => 'Genetics 166 (3), 1419-1436 (2004) PUBMED
>> 15082560',
>>        'title' => 'Functional divergence in tandemly duplicated
>> Arabidopsis
>> thaliana trypsin inhibitor genes',
>>        'tagname' => 'reference'
>>      }, 'Bio::Annotation::Reference' );
>> -- snip --
>
> This is odd. The PUBMED line should not be concatenated with the
> JOURNAL line. I wonder where this happens and why. Can you download
the
> record from NCBI (using the web interface, format 'GenBank', 'Send all
> to file') and then parse it with Bio::SeqIO? If it works then the
> problem must be in the code that deals with the HTTP-response.
>
> 	-hilmar
>
>
>>
>> -----Original Message-----
>> From: Jason Stajich [mailto:jason.stajich at duke.edu]
>> Sent: 26 July 2005 15:28
>> To: Bioperl-l at portal.open-bio.org
>> Cc: Nathan Haigh
>> Subject: [Bioperl-l] getting pubmed id from genbank files
>>
>>
>>
>> Here is part of the synopsis in Bio::Seq:
>>
>>      foreach my $ref ( $ann->get_Annotations('reference') ) {
>>          print "Reference ",$ref->title,"\n";
>>      }
>>
>>   so do $ref->pubmed instead of $ref->title.
>>
>>
>> -jason
>>> On Jul 26, 2005, at 6:02 AM, Nathan Haigh wrote:
>>>
>>>> I want to be able to supply a list of GI's, retrieve the genbank
>>>> files and
>>>> parse out the pubmed id's.
>>>>
>>>>
>>>>
>>>> I know I can do the first steps of retrieving the genbank files
>>>> directly,
>>>> but how do I get the pubmed id's? I've been playing around with
>>>> things and
>>>> haven't yet found out if this can be done.
>>>>
>>>>
>>>>
>>>> Cheers,
>>>>
>>>> Nathan
>>>>
>>>>
>>>>
>>>> ----------------------------------
>>>>
>>>> Nathan Haigh
>>>>
>>>> Bioinformatics PostDoctoral Research Associate
>>>>
>>>>
>>>>
>>>> Room B2 211
>>>>
>>>> Department of Animal and Plant Sciences
>>>>
>>>> University of Sheffield
>>>>
>>>> Western Bank
>>>>
>>>> Sheffield
>>>>
>>>> S10 2TN
>>>>
>>>>
>>>>
>>>> Tel: +44 (0)114 22 20112
>>>>
>>>> Mob: +44 (0)7742 533 569
>>>>
>>>> Fax: +44 (0)114 22 20002
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Bioperl-l mailing list
>>>> Bioperl-l at portal.open-bio.org
>>>> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>>>>
>>> --
>>> Jason Stajich
>>> http://www.duke.edu/~jes12
>>> jason.stajich -at- duke.edu
>>>
>>>
>> --
>> Jason Stajich
>> http://www.duke.edu/~jes12
>> jason.stajich -at- duke.edu
>>
>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at portal.open-bio.org
>> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>>
> -- 
> -------------------------------------------------------------
> Hilmar Lapp                            email: lapp at gnf.org
> GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
> -------------------------------------------------------------
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>
-- 
-------------------------------------------------------------
Hilmar Lapp                            email: lapp at gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------