Retrieve genome entry of virus from DDBJ

From WABI

Jump to: navigation, search

Contents

Summary

This page introduces a way of retrieving genome entry of virus from DDBJ.

Service can be used

Description

Retrieve genome entries which agree with the following conditions regarding the Definition item of entries from both VRL and PHG division in DDBJ.

  • Includes "complete sequence" or includes "segment" and "complete sequence"
  • Not include "TPA:" and "nearly complete"

Sample program

This program retrieves genome entries which agree with the above conditions from DDBJ database by using searchByXMLPath.

Download this program.

use LWP::UserAgent;
$ua = new LWP::UserAgent;

# retrieve hitscount
$hits_count = &getHitCount();


$loop = int($hits_count/1000)+1;
print "hitscount ". $hits_count."\n";
print "accession number\tDefinition\tgenome/segment\n";
# retrieve result
for($i=0;$i<$loop;$i++) {
	print &extractData(($i*1000)+1,1000);
}


sub callARSA {
	# specify start position by argument
	$offset = $_[0];
	# specify result count by argument
	$count = $_[1];
	# make request
	my $req = new HTTP::Request POST => 'http://xml.nig.ac.jp/rest/Invoke';
	$req->content_type('application/x-www-form-urlencoded');
	# set parameters
	# you should encode your query.
	$query = "((/ENTRY/DDBJ/division=='PHG' OR /ENTRY/DDBJ/division=='VRL') AND ";
	$query .= "(((/ENTRY/DDBJ/definition='segment' AND /ENTRY/DDBJ/definition='complete sequence') OR ";
	$query .= "/ENTRY/DDBJ/definition='complete genome')) AND (/ENTRY/DDBJ/definition!='nearly complete' AND ";
	$query .= "/ENTRY/DDBJ/definition!='TPA:'))";
	$query =~ s/([^\w ])/'%'.unpack('H2', $1)/eg;
	$query =~ tr/ /+/; 
	# specify return parameter. This example retrieves accession number and definition with tab delimited format.
	$return = "/ENTRY/DDBJ/primary-accession,/ENTRY/DDBJ/definition";
	$return =~ s/([^\w ])/'%'.unpack('H2', $1)/eg;
	$return =~ tr/ /+/; 
	$req->content("service=ARSA&method=searchByXMLPath&queryPath=$query&returnPath=$return&offset=$offset&count=$count");
	# send request and get response.
	my $res = $ua->request($req);
	# If you want to get a large result. It is better to write to a file directly.
	# my $res = $ua->request($req,'file_name.txt');
	# show response.
	return $res->content;
}


sub extractData {
	# specify start position by argument
	$offset = $_[0];
	# specify result count by argument
	$count = $_[1];
	$result = &callARSA($offset,$count);
	@result = split("\n",$result);
	# judge segment or genome
	for($j=2;$j<@result;$j++) {
		print $result[$j]."\t";
		if (index ( $result[$j], "complete genome") != -1) {
			print "genome";
		}
		if (index ( $result[$j], "segment") != -1) {
			print "segment";
		}
		print "\n";
	}
}

# extract hitscount from result
sub getHitCount{
	$result = &callARSA(1,1);
	@result = split("\n",$result);
	for($i=0;$i<@result;$i++) {
		$find = index ( $result[$i], "hitscount");
		if ($find >= 0) {
			@hits = split("=",$result[$i]);
			return &trim($hits[1]);
		}
	}
	return 0;
}

sub trim {
    my @out = @_;
    for (@out) {
        s/^\s+//;
        s/\s+$//;
    }
    return wantarray ? @out : $out[0];
}


How to execute

Unpack archive file after downloading sample program and type as follows.

perl test.pl

Result is as follows.
First line shows number of search results. Second line is a header. Following lines show hitted entries with tab delimited format(<accession nunber>,<Definition>,<genome or segment>).

image:Arsa get virus.jpg

Construct with Taverna

The following image was generated by Taverna GUI.
image:Virus genome extraction workflow.jpg

This workflow's xml file for Taverna is here.

Links

GIB-V
Japanese page
ARSA document
AND OR search with plural keywords by ARSA

Personal tools