[Bioperl-l] parsing GenBank file
Chris Fields
cjfields at illinois.edu
Wed May 5 16:01:55 UTC 2010
Shalabh,
There are several problems with this file that make it somewhat problematic and somewhat non-GenBank like. It does parse (it has seq data) but doesn't catch the SOURCE/ORGANISM b/c of the somewhat non-canonical way of displaying the classification:
SOURCE Candidatus Azobacteroides pseudotrichonymphae genomovar. CFP2
ORGANISM Candidatus Azobacteroides pseudotrichonymphae genomovar. CFP2
Root; Bacteria; "Bacteroidetes"; "Bacteroidia"; "Bacteroidales";
"Porphyromonadaceae"; unclassified_"Porphyromonadaceae".
It's different enough from the NCBI version (from here: http://www.ncbi.nlm.nih.gov/nuccore/212548595) that it's probably breaking the parser:
SOURCE Candidatus Azobacteroides pseudotrichonymphae genomovar. CFP2
ORGANISM Candidatus Azobacteroides pseudotrichonymphae genomovar. CFP2
Bacteria; Bacteroidetes; Bacteroidia; Bacteroidales; Candidatus
Azobacteroides.
Please file this as a bug, we can take a look at it. It's a bit non-standard so I can't promise it'll be fixed unless it's fairly easy to do.
chris
On May 5, 2010, at 10:38 AM, shalabh sharma wrote:
> Hi Chris,
> I downloaded this file from RDP, it contain all bacterial 16s.
>
> Thanks
> Shalabh
>
>
> On Wed, May 5, 2010 at 11:32 AM, Chris Fields <cjfields at illinois.edu> wrote:
>
>> Shalabh,
>>
>> What is the source of this file? It's not from GenBank; if I look up the
>> parent sequence using Bio::DB::GenBank it works fine:
>>
>> use Modern::Perl;
>> use Bio::DB::GenBank;
>>
>> my $id = 'AP010656';
>>
>> my $gb = Bio::DB::GenBank->new();
>>
>> my $seq = $gb->get_Seq_by_acc($id);
>>
>> say join(',',$seq->species->classification);
>>
>> chris
>>
>> On May 5, 2010, at 9:46 AM, shalabh sharma wrote:
>>
>>> Hi Torsten,
>>> Thanks for pointing that out. But this is just a warning,
>>> it will not break the script. i found the the point where script is
>>> breaking.
>>> Its breaking and giving this message:
>>> Can't call method "classification" on an undefined value at parseGB.pl
>> line
>>> 9, <GEN0> line 10067733.
>>>
>>> So the script is breaking when its coming to this record:
>>>
>>> LOCUS S001198291 1521 bp rRNA linear BCT
>>> 02-Feb-2009
>>> DEFINITION Candidatus Azobacteroides pseudotrichonymphae genomovar.
>> CFP2.
>>> ACCESSION AP010656 REGION: 61786..63306
>>> PROJECT GenomeProject:29025
>>> SOURCE Candidatus Azobacteroides pseudotrichonymphae genomovar. CFP2
>>> ORGANISM Candidatus Azobacteroides pseudotrichonymphae genomovar. CFP2
>>> Root; Bacteria; "Bacteroidetes"; "Bacteroidia";
>>> "Bacteroidales";
>>> "Porphyromonadaceae"; unclassified_"Porphyromonadaceae".
>>> REFERENCE 1 (bases 1 to 1521)
>>> AUTHORS Toyoda A., Hongoh Y., Toh H., Hattori M., Ohkuma M., Sakaki Y.;
>>> TITLE ;
>>> JOURNAL Submitted (21-MAR-2008) to the EMBL/GenBank/DDBJ databases.
>>> Contact:Atsushi Toyoda National Institute of Genetics,
>>> Comparative
>>> Genomics Laboratory; Yata 1111, Mishima, Shizuoka 411-8540,
>>> Japan
>>> REFERENCE 2
>>> AUTHORS Hongoh Y., Sharma V.K., Prakash T., Noda S., Toh H., Taylor
>>> T.D.,
>>> Kudo T., Sakaki Y., Toyoda A., Hattori M., Ohkuma M.;
>>>
>>> It is unable to parse this record, but i don't understand why it is doing
>>> so? The only reason i can think of is the organism's name which is very
>> long
>>> as compared to others.
>>>
>>> Thanks
>>> Shalabh
>>>
>>>
>>>
>>> On Wed, May 5, 2010 at 3:48 AM, Torsten Seemann <
>>> torsten.seemann at infotech.monash.edu.au> wrote:
>>>
>>>>> i have a huge GenBank file ( downloaded from RDP containing all
>>>>> bacterial 16s). I just want to parse RDP id (in LOCUS) and organism's
>>>> linage (in ORGANISM).
>>>>> I am getting the output like:
>>>>> S000107505 uncultured Acidobacteria bacterium Geothrix Holophagaceae
>>>>> Holophagales Holophagae "Acidobacteria" Bacteria Root
>>>>> This is the exact output i want, but i am missing lot of records (they
>>>> are
>>>>> there in the genbank file but not in my output).
>>>>> I also got a warning during parsing:
>>>>> --------------------- WARNING ---------------------
>>>>> MSG: Unbalanced quote in:
>>>>> /db_xref="taxon:35783" /germline"
>>>>> /mol_type="genomic DNA"
>>>>> /organism="Enterococcus sp."
>>>>> /strain="LMG12316"No further qualifiers will be added for this feature
>>>>> ---------------------------------------------------
>>>>> So i was just wondering that is this warning message causing that
>> problem
>>>> or
>>>>> i am doing something wrong?
>>>>
>>>> "Unbalanced quote" means there is not an even number (multiple of 2)
>>>> double-quote (") symbols around the tag's value. I can see that the
>>>> first line (below) looks problematic:
>>>>
>>>> YOU HAVE:
>>>>
>>>> /db_xref="taxon:35783" /germline"
>>>>
>>>> SHOULD BE:
>>>>
>>>> /db_xref="taxon:35783"
>>>> /germline
>>>>
>>>> I suspect there is a problem either with RDP's genbank producer, or
>>>> Bioperl is having problem with the "germline" qualifier which is a
>>>> 'null valued' qualifier like /pseudo - it takes no ="value" string. (I
>>>> think in Bioperl this is handled by setting the value to "_no_value"
>>>> ?)
>>>>
>>>> http://www.ncbi.nlm.nih.gov/collab/FT/
>>>>
>>>> Qualifier /germline
>>>> Definition the sequence presented in the entry has not undergone
>>>> somatic
>>>> rearrangement as part of an adaptive immune response; it is
>>>> the
>>>> unrearranged sequence that was inherited from the parental
>>>> germline
>>>> Value format none
>>>> Example /germline
>>>> Comment /germline should not be used to indicate that the source
>> of
>>>> the sequence is a gamete or germ cell;
>>>> /germline and /rearranged cannot be used in the same source
>>>> feature;
>>>> /germline and /rearranged should only be used for molecules
>>>> that
>>>> can undergo somatic rearrangements as part of an
>>>> adaptive immune
>>>> response; these are the T-cell receptor (TCR) and
>>>> immunoglobulin
>>>> loci in the jawed vertebrates, and the unrelated variable
>>>> lymphocyte receptor (VLR) locus in the jawless fish
>>>> (lampreys
>>>> and hagfish);
>>>> /germline and /rearranged should not be used outside of the
>>>> Craniata (taxid=89593)
>>>>
>>>>
>>>> --Torsten Seemann
>>>> --Victorian Bioinformatics Consortium, Dept. Microbiology, Monash
>>>> University, AUSTRALIA
>>>>
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
More information about the Bioperl-l
mailing list