[Bioperl-l] getting pubmed id from genbank files
Barry Moore
bmoore at genetics.utah.edu
Tue Jul 26 16:31:03 EDT 2005
Then would it be safe to assume that in the case of multi-line JOURNAL
entries, all lines following the initial tagged JOURNAL line would be
untagged? If so, the regex could probably be made a bit safer.
Barry
-----Original Message-----
From: Hilmar Lapp [mailto:hlapp at gnf.org]
Sent: Tuesday, July 26, 2005 2:09 PM
To: Barry Moore
Cc: bioperl-l; n.haigh at sheffield.ac.uk
Subject: Re: [Bioperl-l] getting pubmed id from genbank files
There are indeed JOURNAL entries spanning multiple lines; the parser
was once unable to deal with this and was subsequently fixed ... as we
see this introduced other problems ...
On Jul 26, 2005, at 1:07 PM, Barry Moore wrote:
> Nathan-
>
> That sounds like you are using bioperl 1.4? The error is in
> Bio/SeqIO/genbank.pm and was fixed by Jason in cvs version 1.102 of
> that file. However the current code still looks a bit odd to me.
> Starting at line 1068 of the current cvs version (1.119) of
genebank.pm
> we have:
>
> 1068 if (/^\s{2}JOURNAL\s+(.*)/o) {
> 1069 push(@loc, $1);
> 1070 while ( defined($_ = $self->_readline) ) {
> 1071 # we only match when there are at least 4 spaces
> 1072 # there is probably a better way to match this
> 1073 # as it assumes that the describing tag is short enough
> 1074 /^\s{4,}(.*)/o && do { push(@loc, $1);
> 1075 next;
> 1076 };
> 1077 last;
> 1078 }
> 1079 $ref->location(join(' ', @loc));
>
> This is all dealing with parsing the Journal line which is handled
fine
> by lines 1068-69. The while loop at 1070 looks at successive lines to
> find something to add to the Journal line. The regex at line 1074
used
> to read /^\s{3,}(.*)/o which would not match if the next line after
> JOURNAL began with ' MEDLINE', but would match ' PUBMED' (Nathan's
> situation) causing that line to be added to the JOURNAL line. Is
there
> ever a JOURNAL entry with more than one line? If so, shouldn't the
> following lines always be untagged and thus indented 12 making the
> regex
> /^\s{12}(.*)/o safer. The current situation would add any line to
> JOURNAL line if it's tag is shorter than 6 characters, and I don't
> think
> that's what we want.
>
> Barry
>
> -----Original Message-----
> From: bioperl-l-bounces at portal.open-bio.org
> [mailto:bioperl-l-bounces at portal.open-bio.org] On Behalf Of Hilmar
Lapp
> Sent: Tuesday, July 26, 2005 11:05 AM
> To: n.haigh at sheffield.ac.uk
> Cc: 'bioperl-l'
> Subject: Re: [Bioperl-l] getting pubmed id from genbank files
>
>
> On Jul 26, 2005, at 7:49 AM, Nathan Haigh wrote:
>
>> -- snip --
>> $VAR1 = bless( {
>> 'authors' => 'Clauss,M.J. and Mitchell-Olds,T.',
>> 'location' => 'Genetics 166 (3), 1419-1436 (2004) PUBMED
>> 15082560',
>> 'title' => 'Functional divergence in tandemly duplicated
>> Arabidopsis
>> thaliana trypsin inhibitor genes',
>> 'tagname' => 'reference'
>> }, 'Bio::Annotation::Reference' );
>> -- snip --
>
> This is odd. The PUBMED line should not be concatenated with the
> JOURNAL line. I wonder where this happens and why. Can you download
the
> record from NCBI (using the web interface, format 'GenBank', 'Send all
> to file') and then parse it with Bio::SeqIO? If it works then the
> problem must be in the code that deals with the HTTP-response.
>
> -hilmar
>
>
>>
>> -----Original Message-----
>> From: Jason Stajich [mailto:jason.stajich at duke.edu]
>> Sent: 26 July 2005 15:28
>> To: Bioperl-l at portal.open-bio.org
>> Cc: Nathan Haigh
>> Subject: [Bioperl-l] getting pubmed id from genbank files
>>
>>
>>
>> Here is part of the synopsis in Bio::Seq:
>>
>> foreach my $ref ( $ann->get_Annotations('reference') ) {
>> print "Reference ",$ref->title,"\n";
>> }
>>
>> so do $ref->pubmed instead of $ref->title.
>>
>>
>> -jason
>>> On Jul 26, 2005, at 6:02 AM, Nathan Haigh wrote:
>>>
>>>> I want to be able to supply a list of GI's, retrieve the genbank
>>>> files and
>>>> parse out the pubmed id's.
>>>>
>>>>
>>>>
>>>> I know I can do the first steps of retrieving the genbank files
>>>> directly,
>>>> but how do I get the pubmed id's? I've been playing around with
>>>> things and
>>>> haven't yet found out if this can be done.
>>>>
>>>>
>>>>
>>>> Cheers,
>>>>
>>>> Nathan
>>>>
>>>>
>>>>
>>>> ----------------------------------
>>>>
>>>> Nathan Haigh
>>>>
>>>> Bioinformatics PostDoctoral Research Associate
>>>>
>>>>
>>>>
>>>> Room B2 211
>>>>
>>>> Department of Animal and Plant Sciences
>>>>
>>>> University of Sheffield
>>>>
>>>> Western Bank
>>>>
>>>> Sheffield
>>>>
>>>> S10 2TN
>>>>
>>>>
>>>>
>>>> Tel: +44 (0)114 22 20112
>>>>
>>>> Mob: +44 (0)7742 533 569
>>>>
>>>> Fax: +44 (0)114 22 20002
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Bioperl-l mailing list
>>>> Bioperl-l at portal.open-bio.org
>>>> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>>>>
>>> --
>>> Jason Stajich
>>> http://www.duke.edu/~jes12
>>> jason.stajich -at- duke.edu
>>>
>>>
>> --
>> Jason Stajich
>> http://www.duke.edu/~jes12
>> jason.stajich -at- duke.edu
>>
>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at portal.open-bio.org
>> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>>
> --
> -------------------------------------------------------------
> Hilmar Lapp email: lapp at gnf.org
> GNF, San Diego, Ca. 92121 phone: +1-858-812-1757
> -------------------------------------------------------------
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>
--
-------------------------------------------------------------
Hilmar Lapp email: lapp at gnf.org
GNF, San Diego, Ca. 92121 phone: +1-858-812-1757
-------------------------------------------------------------
More information about the Bioperl-l
mailing list