[Bioperl-l] parsing GenBank file
Chris Fields
cjfields at illinois.edu
Wed May 5 15:32:41 UTC 2010
Shalabh,
What is the source of this file? It's not from GenBank; if I look up the parent sequence using Bio::DB::GenBank it works fine:
use Modern::Perl;
use Bio::DB::GenBank;
my $id = 'AP010656';
my $gb = Bio::DB::GenBank->new();
my $seq = $gb->get_Seq_by_acc($id);
say join(',',$seq->species->classification);
chris
On May 5, 2010, at 9:46 AM, shalabh sharma wrote:
> Hi Torsten,
> Thanks for pointing that out. But this is just a warning,
> it will not break the script. i found the the point where script is
> breaking.
> Its breaking and giving this message:
> Can't call method "classification" on an undefined value at parseGB.pl line
> 9, <GEN0> line 10067733.
>
> So the script is breaking when its coming to this record:
>
> LOCUS S001198291 1521 bp rRNA linear BCT
> 02-Feb-2009
> DEFINITION Candidatus Azobacteroides pseudotrichonymphae genomovar. CFP2.
> ACCESSION AP010656 REGION: 61786..63306
> PROJECT GenomeProject:29025
> SOURCE Candidatus Azobacteroides pseudotrichonymphae genomovar. CFP2
> ORGANISM Candidatus Azobacteroides pseudotrichonymphae genomovar. CFP2
> Root; Bacteria; "Bacteroidetes"; "Bacteroidia";
> "Bacteroidales";
> "Porphyromonadaceae"; unclassified_"Porphyromonadaceae".
> REFERENCE 1 (bases 1 to 1521)
> AUTHORS Toyoda A., Hongoh Y., Toh H., Hattori M., Ohkuma M., Sakaki Y.;
> TITLE ;
> JOURNAL Submitted (21-MAR-2008) to the EMBL/GenBank/DDBJ databases.
> Contact:Atsushi Toyoda National Institute of Genetics,
> Comparative
> Genomics Laboratory; Yata 1111, Mishima, Shizuoka 411-8540,
> Japan
> REFERENCE 2
> AUTHORS Hongoh Y., Sharma V.K., Prakash T., Noda S., Toh H., Taylor
> T.D.,
> Kudo T., Sakaki Y., Toyoda A., Hattori M., Ohkuma M.;
>
> It is unable to parse this record, but i don't understand why it is doing
> so? The only reason i can think of is the organism's name which is very long
> as compared to others.
>
> Thanks
> Shalabh
>
>
>
> On Wed, May 5, 2010 at 3:48 AM, Torsten Seemann <
> torsten.seemann at infotech.monash.edu.au> wrote:
>
>>> i have a huge GenBank file ( downloaded from RDP containing all
>>> bacterial 16s). I just want to parse RDP id (in LOCUS) and organism's
>> linage (in ORGANISM).
>>> I am getting the output like:
>>> S000107505 uncultured Acidobacteria bacterium Geothrix Holophagaceae
>>> Holophagales Holophagae "Acidobacteria" Bacteria Root
>>> This is the exact output i want, but i am missing lot of records (they
>> are
>>> there in the genbank file but not in my output).
>>> I also got a warning during parsing:
>>> --------------------- WARNING ---------------------
>>> MSG: Unbalanced quote in:
>>> /db_xref="taxon:35783" /germline"
>>> /mol_type="genomic DNA"
>>> /organism="Enterococcus sp."
>>> /strain="LMG12316"No further qualifiers will be added for this feature
>>> ---------------------------------------------------
>>> So i was just wondering that is this warning message causing that problem
>> or
>>> i am doing something wrong?
>>
>> "Unbalanced quote" means there is not an even number (multiple of 2)
>> double-quote (") symbols around the tag's value. I can see that the
>> first line (below) looks problematic:
>>
>> YOU HAVE:
>>
>> /db_xref="taxon:35783" /germline"
>>
>> SHOULD BE:
>>
>> /db_xref="taxon:35783"
>> /germline
>>
>> I suspect there is a problem either with RDP's genbank producer, or
>> Bioperl is having problem with the "germline" qualifier which is a
>> 'null valued' qualifier like /pseudo - it takes no ="value" string. (I
>> think in Bioperl this is handled by setting the value to "_no_value"
>> ?)
>>
>> http://www.ncbi.nlm.nih.gov/collab/FT/
>>
>> Qualifier /germline
>> Definition the sequence presented in the entry has not undergone
>> somatic
>> rearrangement as part of an adaptive immune response; it is
>> the
>> unrearranged sequence that was inherited from the parental
>> germline
>> Value format none
>> Example /germline
>> Comment /germline should not be used to indicate that the source of
>> the sequence is a gamete or germ cell;
>> /germline and /rearranged cannot be used in the same source
>> feature;
>> /germline and /rearranged should only be used for molecules
>> that
>> can undergo somatic rearrangements as part of an
>> adaptive immune
>> response; these are the T-cell receptor (TCR) and
>> immunoglobulin
>> loci in the jawed vertebrates, and the unrelated variable
>> lymphocyte receptor (VLR) locus in the jawless fish
>> (lampreys
>> and hagfish);
>> /germline and /rearranged should not be used outside of the
>> Craniata (taxid=89593)
>>
>>
>> --Torsten Seemann
>> --Victorian Bioinformatics Consortium, Dept. Microbiology, Monash
>> University, AUSTRALIA
>>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
More information about the Bioperl-l
mailing list