Retrieve genome entry of virus from DDBJ
From WABI
Contents |
Summary
This page introduces a way of retrieving genome entry of virus from DDBJ.
Service can be used
Description
Retrieve genome entries which agree with the following conditions regarding the Definition item of entries from both VRL and PHG division in DDBJ.
- Includes "complete sequence" or includes "segment" and "complete sequence"
- Not include "TPA:" and "nearly complete"
Sample program
This program retrieves genome entries which agree with the above conditions from DDBJ database by using searchByXMLPath.
use LWP::UserAgent;
$ua = new LWP::UserAgent;
# retrieve hitscount
$hits_count = &getHitCount();
$loop = int($hits_count/1000)+1;
print "hitscount ". $hits_count."\n";
print "accession number\tDefinition\tgenome/segment\n";
# retrieve result
for($i=0;$i<$loop;$i++) {
print &extractData(($i*1000)+1,1000);
}
sub callARSA {
# specify start position by argument
$offset = $_[0];
# specify result count by argument
$count = $_[1];
# make request
my $req = new HTTP::Request POST => 'http://xml.nig.ac.jp/rest/Invoke';
$req->content_type('application/x-www-form-urlencoded');
# set parameters
# you should encode your query.
$query = "((/ENTRY/DDBJ/division=='PHG' OR /ENTRY/DDBJ/division=='VRL') AND ";
$query .= "(((/ENTRY/DDBJ/definition='segment' AND /ENTRY/DDBJ/definition='complete sequence') OR ";
$query .= "/ENTRY/DDBJ/definition='complete genome')) AND (/ENTRY/DDBJ/definition!='nearly complete' AND ";
$query .= "/ENTRY/DDBJ/definition!='TPA:'))";
$query =~ s/([^\w ])/'%'.unpack('H2', $1)/eg;
$query =~ tr/ /+/;
# specify return parameter. This example retrieves accession number and definition with tab delimited format.
$return = "/ENTRY/DDBJ/primary-accession,/ENTRY/DDBJ/definition";
$return =~ s/([^\w ])/'%'.unpack('H2', $1)/eg;
$return =~ tr/ /+/;
$req->content("service=ARSA&method=searchByXMLPath&queryPath=$query&returnPath=$return&offset=$offset&count=$count");
# send request and get response.
my $res = $ua->request($req);
# If you want to get a large result. It is better to write to a file directly.
# my $res = $ua->request($req,'file_name.txt');
# show response.
return $res->content;
}
sub extractData {
# specify start position by argument
$offset = $_[0];
# specify result count by argument
$count = $_[1];
$result = &callARSA($offset,$count);
@result = split("\n",$result);
# judge segment or genome
for($j=2;$j<@result;$j++) {
print $result[$j]."\t";
if (index ( $result[$j], "complete genome") != -1) {
print "genome";
}
if (index ( $result[$j], "segment") != -1) {
print "segment";
}
print "\n";
}
}
# extract hitscount from result
sub getHitCount{
$result = &callARSA(1,1);
@result = split("\n",$result);
for($i=0;$i<@result;$i++) {
$find = index ( $result[$i], "hitscount");
if ($find >= 0) {
@hits = split("=",$result[$i]);
return &trim($hits[1]);
}
}
return 0;
}
sub trim {
my @out = @_;
for (@out) {
s/^\s+//;
s/\s+$//;
}
return wantarray ? @out : $out[0];
}
How to execute
Unpack archive file after downloading sample program and type as follows.
perl test.pl
Result is as follows.
First line shows number of search results. Second line is a header. Following lines show hitted entries with tab delimited format(<accession nunber>,<Definition>,<genome or segment>).
Construct with Taverna
The following image was generated by Taverna GUI.
This workflow's xml file for Taverna is here.
Links
GIB-V
Japanese page
ARSA document
AND OR search with plural keywords by ARSA




