[Bioperl-l] getting pubmed id from genbank files
Barry Moore
bmoore at genetics.utah.edu
Tue Jul 26 16:07:16 EDT 2005
Nathan-
That sounds like you are using bioperl 1.4? The error is in
Bio/SeqIO/genbank.pm and was fixed by Jason in cvs version 1.102 of
that file. However the current code still looks a bit odd to me.
Starting at line 1068 of the current cvs version (1.119) of genebank.pm
we have:
1068 if (/^\s{2}JOURNAL\s+(.*)/o) {
1069 push(@loc, $1);
1070 while ( defined($_ = $self->_readline) ) {
1071 # we only match when there are at least 4 spaces
1072 # there is probably a better way to match this
1073 # as it assumes that the describing tag is short enough
1074 /^\s{4,}(.*)/o && do { push(@loc, $1);
1075 next;
1076 };
1077 last;
1078 }
1079 $ref->location(join(' ', @loc));
This is all dealing with parsing the Journal line which is handled fine
by lines 1068-69. The while loop at 1070 looks at successive lines to
find something to add to the Journal line. The regex at line 1074 used
to read /^\s{3,}(.*)/o which would not match if the next line after
JOURNAL began with ' MEDLINE', but would match ' PUBMED' (Nathan's
situation) causing that line to be added to the JOURNAL line. Is there
ever a JOURNAL entry with more than one line? If so, shouldn't the
following lines always be untagged and thus indented 12 making the regex
/^\s{12}(.*)/o safer. The current situation would add any line to
JOURNAL line if it's tag is shorter than 6 characters, and I don't think
that's what we want.
Barry
-----Original Message-----
From: bioperl-l-bounces at portal.open-bio.org
[mailto:bioperl-l-bounces at portal.open-bio.org] On Behalf Of Hilmar Lapp
Sent: Tuesday, July 26, 2005 11:05 AM
To: n.haigh at sheffield.ac.uk
Cc: 'bioperl-l'
Subject: Re: [Bioperl-l] getting pubmed id from genbank files
On Jul 26, 2005, at 7:49 AM, Nathan Haigh wrote:
> -- snip --
> $VAR1 = bless( {
> 'authors' => 'Clauss,M.J. and Mitchell-Olds,T.',
> 'location' => 'Genetics 166 (3), 1419-1436 (2004) PUBMED
> 15082560',
> 'title' => 'Functional divergence in tandemly duplicated
> Arabidopsis
> thaliana trypsin inhibitor genes',
> 'tagname' => 'reference'
> }, 'Bio::Annotation::Reference' );
> -- snip --
This is odd. The PUBMED line should not be concatenated with the
JOURNAL line. I wonder where this happens and why. Can you download the
record from NCBI (using the web interface, format 'GenBank', 'Send all
to file') and then parse it with Bio::SeqIO? If it works then the
problem must be in the code that deals with the HTTP-response.
-hilmar
>
> -----Original Message-----
> From: Jason Stajich [mailto:jason.stajich at duke.edu]
> Sent: 26 July 2005 15:28
> To: Bioperl-l at portal.open-bio.org
> Cc: Nathan Haigh
> Subject: [Bioperl-l] getting pubmed id from genbank files
>
>
>
> Here is part of the synopsis in Bio::Seq:
>
> foreach my $ref ( $ann->get_Annotations('reference') ) {
> print "Reference ",$ref->title,"\n";
> }
>
> so do $ref->pubmed instead of $ref->title.
>
>
> -jason
>> On Jul 26, 2005, at 6:02 AM, Nathan Haigh wrote:
>>
>>> I want to be able to supply a list of GI's, retrieve the genbank
>>> files and
>>> parse out the pubmed id's.
>>>
>>>
>>>
>>> I know I can do the first steps of retrieving the genbank files
>>> directly,
>>> but how do I get the pubmed id's? I've been playing around with
>>> things and
>>> haven't yet found out if this can be done.
>>>
>>>
>>>
>>> Cheers,
>>>
>>> Nathan
>>>
>>>
>>>
>>> ----------------------------------
>>>
>>> Nathan Haigh
>>>
>>> Bioinformatics PostDoctoral Research Associate
>>>
>>>
>>>
>>> Room B2 211
>>>
>>> Department of Animal and Plant Sciences
>>>
>>> University of Sheffield
>>>
>>> Western Bank
>>>
>>> Sheffield
>>>
>>> S10 2TN
>>>
>>>
>>>
>>> Tel: +44 (0)114 22 20112
>>>
>>> Mob: +44 (0)7742 533 569
>>>
>>> Fax: +44 (0)114 22 20002
>>>
>>>
>>>
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at portal.open-bio.org
>>> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>>>
>> --
>> Jason Stajich
>> http://www.duke.edu/~jes12
>> jason.stajich -at- duke.edu
>>
>>
> --
> Jason Stajich
> http://www.duke.edu/~jes12
> jason.stajich -at- duke.edu
>
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>
--
-------------------------------------------------------------
Hilmar Lapp email: lapp at gnf.org
GNF, San Diego, Ca. 92121 phone: +1-858-812-1757
-------------------------------------------------------------
_______________________________________________
Bioperl-l mailing list
Bioperl-l at portal.open-bio.org
http://portal.open-bio.org/mailman/listinfo/bioperl-l
More information about the Bioperl-l
mailing list