For my paper on the impact of social insurance on the dynamics of conflict in India, I use some new remote sensed weather data. The data comes from the Tropical Rainfall Measuring Mission (TRMM) satellites. The satellite carries a set of five instruments, and is essentially a rainfall radar located in outer space.
As a robustness check I needed to verify that my main results go through using other rainfall data. In the paper I try to make a humble case in favour of using remote sensed data where possible. The key reason being that the  TRMM data comes from the same set of instruments over time, rather than from input sources that could be varying with e.g., economic conditions. This is a problem that has been identified by climatologist, who try to correct for systematic biases that could arise from the fact that weather stations are more likely to be located in places with a lot of economic activity.
At first I was a bit reluctant as it is quite heavy data that needs to be processed. Nevertheless, thorough analysis required me to jump the hoop and obtain secondary rainfall data sources. I chose the GPCC monthly rainfall data for verification of my results, since these have been used by many other authors in the past in similar contexts. The data is based on rain gauge measurements and is available for the past 100 years.
The raw data is quite heavy ; the monthly rainfall rate data  for the whole world at at 0.5 degree resolution would amount to about 150 million rows of data for the period from 1961-2010. If you drop the non-land grid cells, this reduces the size dramatically to only 40 million rows. Below is a bit of code that  loads in the data once you have downloaded the ASCII source files from the GPCC website. On my personal website, I make a dta and an rdata file available for the whole world. There are three variables appearing in that order: (1) the rainfall rate, (2) the rainfall normals and (3) an integer that gives the number of reporting rain gauges that fall in a grid cell in a particular month.
It turns out that all my results are robust to using this data. However, I do find something that is quite neat. It turns out that, if a district experienced some insurgency related conflict in the previous year,  it is less likely that this district has an active rain gauge reporting data in subsequent years. While it is a no-brainer that places with severe conflict do not have functioning weather reporting, these results suggest  that reporting may also be systematically affected in places with relatively low intensity of conflict – as is the case of India.
While I do not want to overstate the importance of this, it provides another justification of why it makes sense for economists to be using  remotely sensed weather data. This is not to say that ground based data is not useful. Quite the reverse, ground based data is more accurate in many ways, which makes it very important for climatologist. As economist, we are worried about systematic measurement error that correlates with the economic variables we are studying. This is were remote sensed data provides advantages as it does not “decide” to become less accurate in places that are e.g. less developed, suffer from conflict or simply, have nobody living there.
Here the function to read in the data and match to district centroids, you need some packages.
#########LOAD GPPC NOTE THAT YOU NEED TO SUBSET
THE DATA IF YOU DONT WANT TO END UP WITH A HUGE
DATA OBJECT
loadGPCC<-function(ff, COORDS) {
yr<-as.numeric(gsub("(.*)\\_([0-9]{2})([0-9]{4})","\\3",ff))
month<-as.numeric(gsub("(.*)\\_([0-9]{2})([0-9]{4})","\\2",ff))
temp<-data.table(data.frame(cbind(COORDS,
read.table(file=paste("Rainfall/gpcc_full_data_archive_v006_05_degree_2001_2010/",
ff,sep=""), header=FALSE, skip=14))))
###YOU COULD SUBSET THE DATA BY EXTENT HERE IF
YOU DONT WANT TO GET IT FOR THE WHOLE WORLD
##E.G. SUBSET FOR BY BOUNDING BOX
##temp<-temp[x>=73 & x<=136 & y>=16 & y<=54]
temp<-cbind("year"= yr, "month"=month, temp)
gc("free")
temp
}
################
#####
ffs<-list.files("Rainfall/
gpcc_full_data_archive_v006_05_degree_2001_2010")
###THIS DEFINES THE GRID STRUCTURE OF THE DATA
###YOU MAY NEED TO ADJUST IF YOU WORK WITH A
COARSER GRID
xs=seq(-179.75,179.75,.5)
ys=seq(89.75,-89.75,-.5)
COORDS<-do.call("rbind", lapply(ys, function(x) cbind("x"=xs,"y"=x)))
system.time(GPCC<-do.call("rbind", lapply(1:length(ffs), function(x) loadGPCC(ffs[x], COORDS))))
###MATCHING THIS TO SHAPEFILE?
##YOU COULD MATCH CENTROIDS OF DISTRICTS TO THE
NEAREST GRID CELL - THE FOLLOWING FUNCTION WOULD
DO THAT
###find nearest lat / lon pair
##you may want to vectorise this
NEAREST<-NULL
for(k in 1:nrow(CENTROIDS)) {
cat(k," ")
temp<-distHaversine(CENTROIDS[k,c("x","y"),with=F],
GPCC.coords[, c("delx","dely"), with=F])
NEAREST<-rbind(NEAREST, cbind(CENTROIDS[k],GPCC.coords[which(temp==
min(temp))]))
}
Quick question (without looking at your code): how do you define a raster cell as “non-land”? Usual problem is that if you intersect with a highres polygon, you’re in trouble with defining how much land is needed to define it as “land”.
This is especially relevant for islands.
This is a good point, especially since the raster cells with the GPCC data are relatively large .5 by .5 degree. To be honest, I actually dont know. The raw data codes missing data with a rainfall value of -99999; so I simply drop these raster cells from my analysis. I have not checked whether there are occur that a cell is coded as -99999 in one year, but has a non-negative rainfall value in another year. I presume however, that at the GPCC they use some high level shapefile for land cover and simply crop the interpolation output for points that fall outside what is considered land.
I would be curious to know actually what they do.