[Bioperl-l] Indexing nr database

Tue Sep 7 09:33:46 UTC 2010

On 09/07/2010 11:18 AM, Ross KK Leung wrote:
> The reason is that I have to retrieve the specific information of the
> matched sequences, e.g. extract the 64th amino acid of the top matched
> sequence. Is there any way to achieve that?

"blastdbcmd" has several options like "-range"

and even if "blastdbcmd" does not give you the subset of information you 
want to fetch, I am still convinced you are quicker by fetching the 
complete entry with"blastdbcmd" and then parse the required data out of 
just one entry.

Hans

> -----Original Message-----
> From: Hans-Rudolf Hotz [mailto:hrh at fmi.ch]
> Sent: Tuesday, September 07, 2010 5:09 PM
> To: bioperl-l at lists.open-bio.org; ross at cuhk.edu.hk
> Subject: Re: [Bioperl-l] Indexing nr database
>
> Hi
>
>
> why don't you use the pre-indexed BLAST files from NCBI:
>
> ftp://ftp.ncbi.nih.gov/blast/db/
>
> you can use them to fetch individual sequences by gi number or accession
> with the tool "blastdbcmd" from blast+ binaries:
>
> ftp://ftp.ncbi.nih.gov/blast/executables/blast+/
>
>
> regards, Hans
>
>
>
> On 09/07/2010 10:28 AM, Ross KK Leung wrote:
>> By the following codes, I wanna index the 4G nr database, however, the
> index
>> file is>   1T and the job has been running for weeks and still hasn't
>> finished. Could anybody tell me how you accomplish the goal? Thanks in
>> advance.
>>
>>       use strict;
>>
>>        use Bio::DB::Flat::BinarySearch;
>>
>>
>>
>>        (my $baseDir, my $dbName, my $seqFile, my $testId, my $testGi) =
> @ARGV;
>>
>>
>>
>>        # use single quotes so you don't have to write
>>
>>        # regular expressions like "gi\\|(\\d+)"
>>
>>        #my $primary_pattern = '^>(\S+)';
>>
>>        #if ($fullHeader == 1) {
>>
>>           my $primary_pattern = '^>(.+)';
>>
>>        #}
>>
>>        my $string = "gi|41353971|emb|AL123456.2| Mycobacterium tuberculosis
>> H37Rv complete genome";
>> #$string =~ s/$primary_pattern/RRR/g;
>>
>>        #print "$string\n";
>>
>>
>>
>>        # one or more patterns stored in a hash:
>>
>>        my $secondary_patterns = {GI =>   'gi\|(\d+)'};
>>
>>
>>
>>        my $db = Bio::DB::Flat::BinarySearch->new(
>>
>>                              -directory          =>   $baseDir,
>>
>>                              -dbname             =>   $dbName,
>>
>>                              -write_flag         =>   1,
>>
>>                              -primary_pattern    =>   $primary_pattern,
>>
>>                              -primary_namespace  =>   'ACC',
>>
>>                              -secondary_patterns =>   $secondary_patterns,
>>
>>                              -verbose            =>   1,
>>
>>                              -format             =>   'fasta'  );
>>
>>
>>
>>        $db->build_index($seqFile);
>>
>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>