[Bioperl-l] strange error parsing a specific NCBI gff file
William Hsiao
william.hsiao at gmail.com
Tue Jun 27 19:52:03 UTC 2006
Hi all,
I've encountered a strange problem while parsing a gff file from
NCBI using perl. I'm hoping that someone on the list may have a
solution even though this is not a bioperl issue. Maybe someone
familiar with gff3 parsing can help :) Essentially, I'm parsing a gff
file into a nested hash structure using the following functions:
sub parse_gff {
my $file = shift;
my %hash_gff;
open (INFILE, $file) or die "Cannot find file $file\n";
while(<INFILE>){
next if (/^\#/);
chomp;
my ($seqid, $source, $type, $start, $end, $score, $strand, $phase,
$attributes) = split /\t/;
my $attri_ref = &process_attributes($attributes);
my %record = ('seqid' => $seqid,
'source' => $source,
'type' => $type,
'start' => $start,
'end' => $end,
'score' => $score,
'strand' => $strand,
'phase' => $phase,
'attribute' => $attri_ref);
push @{$hash_gff{$type}}, \%record;
}
close INFILE;
print Dumper %hash_gff;
return \%hash_gff;
}
sub process_attributes {
my $attr_string = shift;
my @attributes = split (/\;/, $attr_string);
my %attr;
foreach (@attributes){
my ($key, $value) = split /=/;
if ($value=~/\:/){
my ($subkey, $subvalue) = split (/:/, $value);
$attr{$key}{$subkey}=$subvalue;
}
else{
$attr{$key}=$value;
}
}
return \%attr;
}
It works for all the gff files we downloaded from NCBI's microbial
genomes refseq ftp repository. However, 3 lines from one particular
file NC_005966.gff (of Acinetobacter_sp_ADP1) can not be parsed
properly. These lines are:
NC_005966.1 RefSeq CDS 635836 636489 . - 0 locus_tag=ACIAD0647;function=adaptation%20to%20stress;function=protection%20%28MultiFun:5.5%29;note=Multifun:5.6%0AEvidence%203%20:%20Function%20proposed%20based%20on%20presence%20of%20conserved%20amino%20acid%20motif%2C%20structural%20feature%20or%20limited%20homolgy;inference=non-experimental%20evidence%2C%20no%20additional%20details%20recorded;transl_table=11;product=putative%20antioxidant%20protein;protein_id=YP_045389.1;db_xref=GI:50083879;db_xref=GeneID:2878732;exon_number=1
NC_005966.1 RefSeq start_codon 636487 636489 . - 0 locus_tag=ACIAD0647;function=adaptation%20to%20stress;function=protection%20%28MultiFun:5.5%29;note=Multifun:5.6%0AEvidence%203%20:%20Function%20proposed%20based%20on%20presence%20of%20conserved%20amino%20acid%20motif%2C%20structural%20feature%20or%20limited%20homolgy;inference=non-experimental%20evidence%2C%20no%20additional%20details%20recorded;transl_table=11;product=putative%20antioxidant%20protein;protein_id=YP_045389.1;db_xref=GI:50083879;db_xref=GeneID:2878732;exon_number=1
NC_005966.1 RefSeq stop_codon 635833 635835 . - 0 locus_tag=ACIAD0647;function=adaptation%20to%20stress;function=protection%20%28MultiFun:5.5%29;note=Multifun:5.6%0AEvidence%203%20:%20Function%20proposed%20based%20on%20presence%20of%20conserved%20amino%20acid%20motif%2C%20structural%20feature%20or%20limited%20homolgy;inference=non-experimental%20evidence%2C%20no%20additional%20details%20recorded;transl_table=11;product=putative%20antioxidant%20protein;protein_id=YP_045389.1;db_xref=GI:50083879;db_xref=GeneID:2878732;exon_number=1
They generate an error: Can't use string
("adaptation%20to%20stress") as a HASH ref while "strict refs" in use.
The strange part is that all I have to do is replace the word
"function" in front of "=adaptation%20to%20stress;" with another word
or simply change it to functions or functio or Function, etc, then the
line parses properly. If I retype the word "function", it doesn't
solve the problem. For some strange reason, when the word "function"
is there, perl tried to use "adaptation%20to%20stress" as the hash key
and failed. The word "function" is used in other lines as well so I
don't think the problem is not caused by the word alone.
Any suggestion on what might be happening would be greatly
appreciated. Thank you.
Cheers,
Will
--
William Hsiao
PhD Student, Brinkman Laboratory
Department of Molecular Biology and Biochemistry
Simon Fraser University, 8888 University Dr. Burnaby, BC, Canada V5A 1S6
Phone: 604-291-4206 Fax: 604-291-5583
More information about the Bioperl-l
mailing list