[Bioperl-l] Bio::SeqIO HOWTO
Hilmar Lapp
hlapp at gnf.org
Thu Nov 3 11:22:21 EST 2005
Sure, only fasta format is 5-10x less space and parses 3x faster. (You
could also instruct the genbank parser to ignore most of the attributes
except accession and sequence if that's all you need, and it will be
much faster then too. See Bio::Seq::SeqBuilder.)
-hilmar
On Nov 3, 2005, at 12:56 AM, chen li wrote:
> Thanks Hilmar. Both methods work for me now. It turns
> out this script can correctly print out what is
> expected if the input file is in genbank format but
> not a fasta format.
>
> Li
>
> --- Hilmar Lapp <hlapp at gnf.org> wrote:
>
>> $seq->display_id will give you the full composite ID
>> after the
>> greater-than character.
>>
>> It's trivial enough to split it with a regular
>> expression to obtain
>> only the part you're interested in, so for the
>> reasons Barry mentions
>> Bioperl doesn't do this for you.
>>
>>
>> On Nov 2, 2005, at 8:25 PM, Barry Moore wrote:
>>
>>> Li-
>>>
>>> The script is working correctly. You are giving
>> it a fasta file and
>>> then asking it to print the accession number.
>> While you and I can
>>> plainly see that the accession number NM_021308.1
>> is in the fasta
>>> header, bioperl makes no attempt to parse
>> accession numbers from a
>>> fasta
>>> header. The reason for this is there is no
>> uniformity in how fasta
>>> headers are written, so every fasta file could use
>> a different header
>>> format and be valid.
>>>
>>> If you just want to see the script work correctly
>> for learning
>>> purposes,
>>> change the line:
>>> print $seq->accession_number,"\n";
>>> to this any or all of these lines:
>>> print $seq->alphabet,"\n";
>>> print $seq->description,"\n";
>>> print $seq->display_name,"\n";
>>> print $seq->length,"\n";
>>> print $seq->seq,"\n";
>>>
>>> If you want the script to print the accession
>> number, try downloading
>>> the full GenBank formatted sequence and run your
>> script something like:
>>> perl getaccs.pl mouse.gb genbank
>>>
>>> Barry
>>>
>>>> -----Original Message-----
>>>> From: chen li [mailto:chen_li3 at yahoo.com]
>>>> Sent: Wednesday, November 02, 2005 8:36 PM
>>>> To: Barry Moore
>>>> Subject: RE: [Bioperl-l] Bio::SeqIO HOWTO
>>>>
>>>> Barry,
>>>>
>>>> Thank you very much.
>>>>
>>>> Here are the results. 1) If I type "perl
>> getaccs.pl" I
>>>> get this result "getaccs.pl File format" on the
>>>> screen. 2)If I type "perl getaccs.pl mouse.fasta
>>>> fasta" I get "unknow" on the screen. IT seems
>> there
>>>> are no access no. printed out after the script is
>>>> executed.
>>>>
>>>> So what is the problem here?
>>>>
>>>> Li
>>>>
>>>> here is part of my file:
>>>>
>>>>> gi|10946609|ref|NM_021308.1| Mus musculus piwi
>> like
>>>> homolog 2 (Drosophila) (Piwil2), mRNA
>>>>
>>
> AGTGTGTGGGAGGAACGCAGGGGCTGGAATAGGAGGGAAAGGAGGTGGCTCCAGGAGAGAGCGAGAGAGG
>>>>
>>>
>>
> GAGCGCTCGCATCGGGGCTCAGTGGCACCAGACCTAAAAAGAAATCTAGGCAAGGCTCCGGCACAGTCCA.
>>
>>> .
>>> ..
>>>> ....
>>>>
>>>> --- Barry Moore <bmoore at genetics.utah.edu> wrote:
>>>>
>>>>> Li-
>>>>>
>>>>> You don't need to modify the script. It is
>> written
>>>>> to accept the
>>>>> filename and format on the command line like
>> this:
>>>>> perl getaccs.pl
>>>>> mouse.fasta fasta.
>>>>>
>>>>> Barry
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: bioperl-l-bounces at portal.open-bio.org
>>>>> [mailto:bioperl-l-
>>>>>> bounces at portal.open-bio.org] On Behalf Of chen
>> li
>>>>>> Sent: Tuesday, November 01, 2005 10:30 PM
>>>>>> To: bioperl-l at bioperl.org
>>>>>> Subject: [Bioperl-l] Bio::SeqIO HOWTO
>>>>>>
>>>>>> Hi folks,
>>>>>>
>>>>>> Here is one script copied from the Bio::SeqIO
>>>>> HOWTO:
>>>>>>
>>>>>> use Bio::SeqIO;
>>>>>> my $usage = "getaccs.pl file format\n";
>>>>>> my $file = shift or die $usage;
>>>>>> my $format = shift or die $usage;
>>>>>>
>>>>>> my $inseq = Bio::SeqIO->new('-file' =>
>>>>> "<$file",
>>>>>> '-format' => $format );
>>>>>> while (my $seq = $inseq->next_seq) {
>>>>>> print $seq->accession_number,"\n";
>>>>>> }
>>>>>> exit;
>>>>>>
>>>>>>
>>>>>> I have a small file called mouse.fasta kept in
>> the
>>>>>> same directory. My question is that how does
>> the
>>>>>> script know to read in mouse.fasta? Where
>> should I
>>>>>> make a small modification in the script?
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Li
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> __________________________________
>>>>>> Yahoo! FareChase: Search multiple travel sites
>> in
>>>>> one click.
>>>>>> http://farechase.yahoo.com
>>>>>> _______________________________________________
>>>>>> Bioperl-l mailing list
>>>>>> Bioperl-l at portal.open-bio.org
>>>>>>
>>>>>
>>>>
>>
> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> __________________________________
>>>> Yahoo! Mail - PC Magazine Editors' Choice 2005
>>>> http://mail.yahoo.com
>>>
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at portal.open-bio.org
>>>
>>
> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>>>
>> --
>>
> -------------------------------------------------------------
>> Hilmar Lapp email: lapp
>> at gnf.org
>> GNF, San Diego, Ca. 92121 phone:
>> +1-858-812-1757
>>
> -------------------------------------------------------------
>>
>>
>
>
>
>
> __________________________________
> Start your day with Yahoo! - Make it your home page!
> http://www.yahoo.com/r/hs
>
--
-------------------------------------------------------------
Hilmar Lapp email: lapp at gnf.org
GNF, San Diego, Ca. 92121 phone: +1-858-812-1757
-------------------------------------------------------------
More information about the Bioperl-l
mailing list