2013-07-20

Searching for chemical substances based on ATC codes from WHO search engine.

The concept of how to grep search engine results with R had been one of my lingering thoughts. Because sometimes it's just feels easier if I could just grep search engine results than to find a dataset or complete database of a certain thing.

So when at work I had to use ATC codes to get the substances out or the pharmaceutical(?) use. So while no-one was looking I naturally had to devise the following script. This snippet is more or less done, a part of a imaginary whole if you will. It lacks the functionality to search for partial matches. So you can't just look for all the "B01" codes. Probably doing that extension next when I have the time. Also you can't use this for substance to ATC code. But for now this one way ticket, the basic concept as a snippet. Naturally comments about making it more readable and/or efficient are always welcome.

require(gdata)
require("RCurl")
require("XML")
AtcSearch <- function(AtcCodes){
#Function inside a function is bad practice...
#DummySearch searches for one atc code
DummySearch <- function(Code){
#A link to WHO search
htmlLink <- paste("http://www.whocc.no/atc_ddd_index/?code=", Code, sep="")
doc <- htmlParse(getURL(htmlLink))
Drug <- xpathSApply(doc, "//td", xmlValue)
#No idea why these three are needed
Drug <- gsub("Â", "", Drug)
Drug <- gsub("\\s", " ", Drug)
Drug <- trim(Drug) #gdata just for this...
#if no drug is found, returning NULL
if(sum(Drug == Code) == 0 | length(Drug) == 0) return()
atcLevels <- xpathSApply(doc, "//b", xmlValue)
out <- matrix(Drug[which(Drug == Code):length(Drug)], ncol= which(Drug == Code)-1, byrow=TRUE)
#Basic handling of multiple DDDs and Units
if(nrow(out) > 1) out <- t(apply(out, 2, function(x) paste(x[x != ""], collapse="/")))
out <- data.frame(out, t(atcLevels))
colnames(out) <- c(Drug[1:6], "Anatomical", "Therapeutic", "Subgroup", "ChemicalSubgroup")
out
}
#lapply to get all the codes.
ret <- lapply(AtcCodes, DummySearch)
do.call("rbind", ret)
}
#Some examples
AtcSearch("N01AX15")
AtcSearch(c("B01AA01", "N01AX15"))
#Not perfect for example, the who database isn't complete.
AtcSearch("QN01AX99")
view raw AtcSearch.R hosted with ❤ by GitHub

1 comment:

  1. Hallo Xachriel,
    I have been looking something like this for a while. I wonder whether you had the chance to go further with it and reverse the serach for example (name -> ATC class) or make a batch query ...
    Great thanks
    Andrea Zaliani

    ReplyDelete