[Bioperl-l] parsing GenBank file

Chris Fields cjfields at illinois.edu
Wed May 5 16:01:55 UTC 2010


Shalabh,

There are several problems with this file that make it somewhat problematic and somewhat non-GenBank like.  It does parse (it has seq data) but doesn't catch the SOURCE/ORGANISM b/c of the somewhat non-canonical way of displaying the classification:

SOURCE      Candidatus Azobacteroides pseudotrichonymphae genomovar. CFP2
ORGANISM  Candidatus Azobacteroides pseudotrichonymphae genomovar. CFP2
          Root; Bacteria; "Bacteroidetes"; "Bacteroidia"; "Bacteroidales"; 
          "Porphyromonadaceae"; unclassified_"Porphyromonadaceae".

It's different enough from the NCBI version (from here: http://www.ncbi.nlm.nih.gov/nuccore/212548595) that it's probably breaking the parser:

SOURCE      Candidatus Azobacteroides pseudotrichonymphae genomovar. CFP2
ORGANISM  Candidatus Azobacteroides pseudotrichonymphae genomovar. CFP2
          Bacteria; Bacteroidetes; Bacteroidia; Bacteroidales; Candidatus
          Azobacteroides.

Please file this as a bug, we can take a look at it.  It's a bit non-standard so I can't promise it'll be fixed unless it's fairly easy to do.

chris

On May 5, 2010, at 10:38 AM, shalabh sharma wrote:

> Hi Chris,
>            I downloaded this file from RDP, it contain all bacterial 16s.
> 
> Thanks
> Shalabh
> 
> 
> On Wed, May 5, 2010 at 11:32 AM, Chris Fields <cjfields at illinois.edu> wrote:
> 
>> Shalabh,
>> 
>> What is the source of this file?  It's not from GenBank; if I look up the
>> parent sequence using Bio::DB::GenBank it works fine:
>> 
>> use Modern::Perl;
>> use Bio::DB::GenBank;
>> 
>> my $id = 'AP010656';
>> 
>> my $gb = Bio::DB::GenBank->new();
>> 
>> my $seq = $gb->get_Seq_by_acc($id);
>> 
>> say join(',',$seq->species->classification);
>> 
>> chris
>> 
>> On May 5, 2010, at 9:46 AM, shalabh sharma wrote:
>> 
>>> Hi Torsten,
>>>             Thanks for pointing that out. But this is just a warning,
>>> it will not break the script. i found the the point where script is
>>> breaking.
>>> Its breaking and giving this message:
>>> Can't call method "classification" on an undefined value at parseGB.pl
>> line
>>> 9, <GEN0> line 10067733.
>>> 
>>> So the script is breaking when its coming to this record:
>>> 
>>> LOCUS       S001198291              1521 bp    rRNA    linear   BCT
>>> 02-Feb-2009
>>> DEFINITION  Candidatus Azobacteroides pseudotrichonymphae genomovar.
>> CFP2.
>>> ACCESSION   AP010656 REGION: 61786..63306
>>> PROJECT     GenomeProject:29025
>>> SOURCE      Candidatus Azobacteroides pseudotrichonymphae genomovar. CFP2
>>> ORGANISM  Candidatus Azobacteroides pseudotrichonymphae genomovar. CFP2
>>>        Root; Bacteria; "Bacteroidetes"; "Bacteroidia";
>>> "Bacteroidales";
>>>        "Porphyromonadaceae"; unclassified_"Porphyromonadaceae".
>>> REFERENCE   1 (bases 1 to 1521)
>>> AUTHORS   Toyoda A., Hongoh Y., Toh H., Hattori M., Ohkuma M., Sakaki Y.;
>>> TITLE     ;
>>> JOURNAL   Submitted (21-MAR-2008) to the EMBL/GenBank/DDBJ databases.
>>>        Contact:Atsushi Toyoda National Institute of Genetics,
>>> Comparative
>>>        Genomics Laboratory; Yata 1111, Mishima, Shizuoka 411-8540,
>>> Japan
>>> REFERENCE   2
>>> AUTHORS   Hongoh Y., Sharma V.K., Prakash T., Noda S., Toh H., Taylor
>>> T.D.,
>>>        Kudo T., Sakaki Y., Toyoda A., Hattori M., Ohkuma M.;
>>> 
>>> It is unable to parse this record, but i don't understand why it is doing
>>> so? The only reason i can think of is the organism's name which is very
>> long
>>> as compared to others.
>>> 
>>> Thanks
>>> Shalabh
>>> 
>>> 
>>> 
>>> On Wed, May 5, 2010 at 3:48 AM, Torsten Seemann <
>>> torsten.seemann at infotech.monash.edu.au> wrote:
>>> 
>>>>> i have a huge GenBank file ( downloaded from RDP containing all
>>>>> bacterial 16s). I just want to parse RDP id (in LOCUS) and organism's
>>>> linage (in ORGANISM).
>>>>> I am getting the output like:
>>>>> S000107505 uncultured Acidobacteria bacterium Geothrix Holophagaceae
>>>>> Holophagales Holophagae "Acidobacteria" Bacteria Root
>>>>> This is the exact output i want, but i am missing lot of records (they
>>>> are
>>>>> there in the genbank file but not in my output).
>>>>> I also got a warning during parsing:
>>>>> --------------------- WARNING ---------------------
>>>>> MSG: Unbalanced quote in:
>>>>> /db_xref="taxon:35783" /germline"
>>>>> /mol_type="genomic DNA"
>>>>> /organism="Enterococcus sp."
>>>>> /strain="LMG12316"No further qualifiers will be added for this feature
>>>>> ---------------------------------------------------
>>>>> So i was just wondering that is this warning message causing that
>> problem
>>>> or
>>>>> i am doing something wrong?
>>>> 
>>>> "Unbalanced quote" means there is not an even number (multiple of 2)
>>>> double-quote (") symbols around the tag's value. I can see that the
>>>> first line (below) looks problematic:
>>>> 
>>>> YOU HAVE:
>>>> 
>>>> /db_xref="taxon:35783" /germline"
>>>> 
>>>> SHOULD BE:
>>>> 
>>>> /db_xref="taxon:35783"
>>>> /germline
>>>> 
>>>> I suspect there is a problem either with RDP's genbank producer, or
>>>> Bioperl is having problem with  the "germline" qualifier which is a
>>>> 'null valued' qualifier like /pseudo - it takes no ="value" string. (I
>>>> think in Bioperl this is handled by setting the value to "_no_value"
>>>> ?)
>>>> 
>>>> http://www.ncbi.nlm.nih.gov/collab/FT/
>>>> 
>>>> Qualifier       /germline
>>>> Definition      the sequence presented in the entry has not undergone
>>>> somatic
>>>>           rearrangement as part of an adaptive immune response; it is
>>>> the
>>>>           unrearranged sequence that was inherited from the parental
>>>>           germline
>>>> Value format    none
>>>> Example         /germline
>>>> Comment         /germline should not be used to indicate that the source
>> of
>>>>           the sequence is a gamete or germ cell;
>>>>           /germline and /rearranged cannot be used in the same source
>>>>           feature;
>>>>           /germline and /rearranged should only be used for molecules
>>>> that
>>>>           can undergo somatic rearrangements as part of an
>>>> adaptive immune
>>>>           response; these are the T-cell receptor (TCR) and
>>>> immunoglobulin
>>>>           loci in the jawed vertebrates, and the unrelated variable
>>>>           lymphocyte receptor (VLR) locus in the jawless fish
>>>> (lampreys
>>>>           and hagfish);
>>>>           /germline and /rearranged should not be used outside of the
>>>>           Craniata (taxid=89593)
>>>> 
>>>> 
>>>> --Torsten Seemann
>>>> --Victorian Bioinformatics Consortium, Dept. Microbiology, Monash
>>>> University, AUSTRALIA
>>>> 
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>> 
>> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l





More information about the Bioperl-l mailing list