[Bioperl-l] Help Parsing FASTA Sequence File

Jordi Durban jordi.durban at gmail.com
Wed Dec 22 15:56:59 UTC 2010


At first sight I'd try using awk to get those "column1" that aren't 0.0 at
their "colum4". Something like:
if ($4 !~ /0.0/) print $1
And once these identifiers you could try  to get the $seq->seq() from each
$seq->id().
Hope this helps.

2010/12/22 Chris Fields <cjfields at illinois.edu>

> You might want to look at Bio::DB::Fasta or Bio::Index::Fasta, or
> Bio::DB::Flat (all of which index FASTA), and use SQLite or similar to
> create a database for the score lookups.
>
> chris
>
> On Dec 9, 2010, at 6:50 AM, Fahmida wrote:
>
> >
> > Hi,
> >
> > I've several input 'score' files and their corresponding 'data' files
> like:
> > score1.txt data1.txt
> > score2.txt data2.txt
> > ....
> > ....
> >
> > score1.txt
> >
> > contig00002 length=671 numreads=17 1207 0.0
> > contig00003 length=637 numreads=26 1205 0.0
> > contig00052 length=535 numreads=10 607 e-176
> > contig00072 length=472 numreads=46 571 e-165
> > contig00019 length=667 numreads=5 474 e-136
> >
> > This file has several rows and five columns.column 1-3 are
> > names/descriptions and column 4 (1207, 1205, etc) and column 5 (0.0,0.0,
> > e-176, etc). contain the scores. I want to make a list of TOP 2 names
> based
> > on column 4 score and whose column 5 score is not '0.0'. For example. for
> > the above data the output list would be:
> >
> > contig00052 length=535 numreads=10
> > contig00072 length=472 numreads=46
> >
> > Use the above list to extract data from the 'data1.txt':
> >
> > data1.txt
> >
> >> contig00001 length=567 numreads=35
> > GGGCTGACGTGGCCGCTAATACGACTCACTATAGGGAGAGAAAaCCAAGGGAGAAaGAAa
> > CTACACTACTAATGGAAAaGATCTACATGCTAGAAAAa
> >> contig00002 length=671 numreads=17
> > GGGgCTGACGTGgCcGCTAATACGACTCACTATAGGgAGAGTTACTGTGGAGGGAGAGGC
> > TTGCTCAAaTCCGCGTTCAAGGATTTCCAGATTGGTAAGAACTTCAGATT
> >> contig00052 length=535 numreads=10
> > GGGCTGACGTGgCCGCTAATACGACTCACTATAGGGAGAGATCGTGGCGATCGCCAATCA
> > CCCAGGTGCCGTTAGCCA
> >> contig00003 length=637 numreads=26
> > GGGCTGACGTGgCCGCTAATACGACTCACTATAGGGAGAGATCGTGGCGATCGCCAATCA
> > CCCAGGTGCCGTTAGCCAGAGCTG
> >> contig00072 length=472 numreads=46
> > GGGCTGACGTGgCCGCTAATACGACTCACTATAGGGAGAGTTTtCCCCAGGACCCTGGGA
> > GGACCATGCCGTATGGGTGTCTAGTAAGTACAAaGCCATAATTCACATAAGTGAAATATT
> > CTCAAGcACTAGGATC
> >> contig00019 length=504 numreads=5
> > GGGCTGACGTGGCCGCTAATACGACTCACTATAGGgAGAGATCTCACTAAAAAACTGGGG
> > ATAACGCCT
> >
> >
> > Example Output file:
> >
> >> contig00052 length=535 numreads=10
> > GGGCTGACGTGgCCGCTAATACGACTCACTATAGGGAGAGATCGTGGCGATCGCCAATCA
> > CCCAGGTGCCGTTAGCCA
> >> contig00072 length=472 numreads=46
> > GGGCTGACGTGgCCGCTAATACGACTCACTATAGGGAGAGTTTtCCCCAGGACCCTGGGA
> > GGACCATGCCGTATGGGTGTCTAGTAAGTACAAaGCCATAATTCACATAAGTGAAATATT
> > CTCAAGcACTAGGATC
> >
> > Any reply would be greatly appreciated.
> >
> > --
> > View this message in context:
> http://old.nabble.com/Help-Parsing-FASTA-Sequence-File-tp30416193p30416193.html
> > Sent from the Perl - Bioperl-L mailing list archive at Nabble.com.
> >
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>



-- 
Jordi



More information about the Bioperl-l mailing list