[Bioperl-l] Indexing nr database
Hans-Rudolf Hotz
hrh at fmi.ch
Tue Sep 7 09:33:46 UTC 2010
On 09/07/2010 11:18 AM, Ross KK Leung wrote:
> The reason is that I have to retrieve the specific information of the
> matched sequences, e.g. extract the 64th amino acid of the top matched
> sequence. Is there any way to achieve that?
"blastdbcmd" has several options like "-range"
and even if "blastdbcmd" does not give you the subset of information you
want to fetch, I am still convinced you are quicker by fetching the
complete entry with"blastdbcmd" and then parse the required data out of
just one entry.
Hans
> -----Original Message-----
> From: Hans-Rudolf Hotz [mailto:hrh at fmi.ch]
> Sent: Tuesday, September 07, 2010 5:09 PM
> To: bioperl-l at lists.open-bio.org; ross at cuhk.edu.hk
> Subject: Re: [Bioperl-l] Indexing nr database
>
> Hi
>
>
> why don't you use the pre-indexed BLAST files from NCBI:
>
> ftp://ftp.ncbi.nih.gov/blast/db/
>
> you can use them to fetch individual sequences by gi number or accession
> with the tool "blastdbcmd" from blast+ binaries:
>
> ftp://ftp.ncbi.nih.gov/blast/executables/blast+/
>
>
> regards, Hans
>
>
>
> On 09/07/2010 10:28 AM, Ross KK Leung wrote:
>> By the following codes, I wanna index the 4G nr database, however, the
> index
>> file is> 1T and the job has been running for weeks and still hasn't
>> finished. Could anybody tell me how you accomplish the goal? Thanks in
>> advance.
>>
>> use strict;
>>
>> use Bio::DB::Flat::BinarySearch;
>>
>>
>>
>> (my $baseDir, my $dbName, my $seqFile, my $testId, my $testGi) =
> @ARGV;
>>
>>
>>
>> # use single quotes so you don't have to write
>>
>> # regular expressions like "gi\\|(\\d+)"
>>
>> #my $primary_pattern = '^>(\S+)';
>>
>> #if ($fullHeader == 1) {
>>
>> my $primary_pattern = '^>(.+)';
>>
>> #}
>>
>> my $string = "gi|41353971|emb|AL123456.2| Mycobacterium tuberculosis
>> H37Rv complete genome";
>> #$string =~ s/$primary_pattern/RRR/g;
>>
>> #print "$string\n";
>>
>>
>>
>> # one or more patterns stored in a hash:
>>
>> my $secondary_patterns = {GI => 'gi\|(\d+)'};
>>
>>
>>
>> my $db = Bio::DB::Flat::BinarySearch->new(
>>
>> -directory => $baseDir,
>>
>> -dbname => $dbName,
>>
>> -write_flag => 1,
>>
>> -primary_pattern => $primary_pattern,
>>
>> -primary_namespace => 'ACC',
>>
>> -secondary_patterns => $secondary_patterns,
>>
>> -verbose => 1,
>>
>> -format => 'fasta' );
>>
>>
>>
>> $db->build_index($seqFile);
>>
>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>
More information about the Bioperl-l
mailing list