Stacking Regressions: Latex Tables with R and stargazer

In my paper on the impact of the shale oil and gas boom in the US, I run various instrumental variables specifications. For these, it is nice to stack the regression results one on the other – in particular, to have one row for the IV results, one row for the Reduced Form and maybe one row for plain OLS to see how the IV may turn coefficients around.

I found that as of now – there is no way to do that directly; please correct me if I am wrong.

The layout I have in mind is as in the screenshot of my table.

Sample Table with Stacked Regression Results In Stata, this can be accomplished through the use of esttab with the fragment, and append feature. This appends successive rows to an existing table and you can label the rows using the “refcat” option.

However, in R this is not possible as of yet. I have mainly worked with stargazer, as Marek has added felm objects for high dimensional fixed effects to be handled by his stargazer package.

The following functions are a “hack” that extracts the particular rows from the generated latex code by the stargazer package.

You can then piece the table together by combining the individual elements. The idea is that you have a single stargazer command that is passed to the various functions that extract the different features. Obviuously, this can be done a lot more elegant as the code is extremely hacky, but it does work

Marek has said that he is thinking of incorporating a stacking option into stargazer, but for now, my hack works reasonably well. The key thing to realise is that stargazer calls have the option to return

The following character strings can be used in the table.layout and omit.table.layout arguments of the stargazer command.

“-“ single horizontal line
“=” double horizontal line
“-!” mandatory single horizontal line
“=!” mandatory double horizontal line
“l” dependent variable caption
“d” dependent variable labels
“m” model label
“c” column labels
“#” model numbers
“b” object names
“t” coefficient table
“o” omitted coefficient indicators
“a” additional lines
“n” notes
“s” model statistics


The following functions will simply extract the rows that are being returned.

stargazer.keepcolnums<-function(call="") {
command<-gsub("\\)$",", table.layout='#'\\)",command)
call<-eval(parse(text=command))
bounds<-grep("tabular",call)
row1<- call[(bounds[1]+1):(bounds[2]-1)]
row1<-gsub("^\\\\\\\\\\[-1\\.8ex\\]\\\\hline $", "", row1)
row1<-gsub("^\\\\hline \\\\\\\\\\[-1\\.8ex\\] $","",row1)
row1<-row1[row1!=""]
}
stargazer.keeprow<-function(call="") {
command<-gsub("\\)$",", table.layout='t'\\)",command)
call<-eval(parse(text=command))
bounds<-grep("tabular",call)
row1<- call[(bounds[1]+1):(bounds[2]-1)]
row1<-gsub("^\\\\\\\\\\[-1\\.8ex\\]\\\\hline $", "", row1)
row1<-gsub("^\\\\hline \\\\\\\\\\[-1\\.8ex\\] $","",row1)
row1<-row1[row1!=""]
}

stargazer.keepstats<-function(call="") {
command<-gsub("\\)$",", table.layout='s'\\)",command)
call<-eval(parse(text=command))
row1<-gsub("(.*)\\\\\\\\\\[-1\\.8ex\\](.*)(\\\\end\\{tabular\\})(.*)","\\2",paste(call,collapse=" "))
row1
}
stargazer.begintable<-function(call="") {
command<-gsub("\\)$",", table.layout='m'\\)",command)
call<-eval(parse(text=command))
row1<-paste("\\begin{tabular}",gsub("(.*)\\\\begin\\{tabular\\}(.*)(\\\\end\\{tabular\\})(.*)","\\2",paste(call,collapse="\n")),sep="")
row1
}
stargazer.varlabels<-function(call="") {
command<-gsub("\\)$",", table.layout='d'\\)",command)
call<-eval(parse(text=command))
row1<-paste("\\begin{tabular}",gsub("(.*)\\\\begin\\{tabular\\}(.*)(\\\\end\\{tabular\\})(.*)","\\2",paste(call,collapse="\n")),sep="")
row1
}

stargazer.keepcollabels<-function(call="") {
command<-gsub("\\)$",", table.layout='c'\\)",command)
call<-eval(parse(text=command))
row1<-gsub("(.*)\\\\\\\\\\[-1\\.8ex\\](.*)(\\\\end\\{tabular\\})(.*)","\\2",paste(call,collapse=" "))
row1
}

stargazer.keepomit<-function(call="") {
command<-gsub("\\)$",", table.layout='o'\\)",command)
call<-eval(parse(text=command))
row1<-gsub("(.*)\\\\\\\\\\[-1\\.8ex\\](.*)(\\\\end\\{tabular\\})(.*)","\\2",paste(call,collapse="\n"))
row1
}

It easiest to see how you can use these functions to construct stacked regression output by giving a simple example.

###the global command to be passed to the hacky ###functions that extract the individual bits
## OLS is a list of OLS results from running the
## lfe command
command<-'stargazer(OLS , keep=c("post08shale:directutilityshare","post08shale"), covariate.labels=c("Energy Intensity x Shale","Shale"), header=FALSE, out.header=FALSE, keep.stat=c("n","adj.rsq"))'

begintable<-stargazer.begintable(command)
###some multicolumn to combine 
collcombine<-c("& \\multicolumn{4}{c}{Tradable Goods Sector Only} & \\multicolumn{3}{c}{Additional Sectors} & \\cmidrule(lr){2-5} \\cmidrule(lr){6-8}")
collabel<-stargazer.keepcollabels(command)
colnums<-stargazer.keepcolnums(command)
##plain OLS row
row1<-stargazer.keeprow(command)
##the stats part for the OLS (number of Obs, R2)
stats<-stargazer.keepstats(command)
##the rows for the Fixed effect indicators
omitted<-stargazer.keepomit(command)

##the IV command passing a list of IV results in ## IV object
command<-'stargazer(IV , keep=c("post08anywell:directutilityshare","post08anywell"), covariate.labels=c("Energy Intensity x Anywell","Anywell"), header=FALSE, out.header=FALSE, keep.stat=c("n","adj.rsq"))'

##IV row
row2<-stargazer.keeprow(command)
footer<-c("\\end{tabular}\n" )

###now combine all the items
cat(begintable,collcombine,collabel,colnums,"\\hline \\\\\\emph{Reduced Form} \\\\",row1,""\\hline \\\\\\emph{Reduced Form} \\\\",row2,stats,"\\hline\\hline",footer, file="Draft/tables/energyintensity.tex", sep="\n")

I know the solution is a bit hacky, but it works and does the trick.

 

Fracking and House Prices on the Marcellus Shale

Starting last summer I worked on a short project that set out to estimate the potential costs of externalities due to unconventional shale gas production in the Marcellus shale on local house prices using a dataset of roughly 150,000 recently sold houses in Ohio, West Virginia and Pennsylvania.

The data suggests that proximity to a natural gas well is correlated with lower housing prices, which confirms previous studies.

fracking-houseprices

I stopped working on a project that looks at the impact of nearby shale gas extraction on property prices for the Marcellus shale. Instead, I focused on my paper “Fracking Growth” that evaluates the employment consequences of the shale oil and gas boom in the US more generally.

Everybody can have a look at the data and the document as it stands on sharelatex, where I also tried Sharelatex’s Knitr capacities, which are still somewhat limited as a lot of R-packages I usually work with are not yet installed.

The public sharelatex file, the data and the R script can be accessed here:

https://www.sharelatex.com/project/534d232b32ed2b25466b2541?r=6f73efc4&rs=ps&rm=d

Here are some preliminary snippets. The data used in this study comes from Zillow.com In Fall 2013 I downloaded data for recently sold houses. I focused the download to cover all or most of the counties that are somewhat near the Marcellus shale in West Virginia, Ohio and Pennsylvania. This list goes back to 2011 and provides data for 151,156 sold properties.

load(file = "HOUSES.rdata")
library(xtable)
table(HOUSES$year)
####  2011  2012  2013
#### 40087 63248 47821

A simple tabulation suggests that most data is for 2012.

Some characteristics that are included in the data are the sale price in USD, the number of bedrooms, number of bathrooms, the built up land size in square feet, the year the property was built and for some properties also the lot size.

The properties have geo-coordinates, which are used to intersect the location of the property with census-tract shapefiles. This will allow the adding of further characteristics at the census-tract level to control for general local characteristics.

The geo-coordinates are further used to compute the exact distance of a property to the nearest actual or permitted well in the year that the property was sold. Distances are computed in meters by computing the Haversine distance on a globe with radius r = 6378137 meters.

The following graph plots the average price per square foot as a function of distance to the nearest well in the year in which the property was sold. I group distances into 500 meter bins.

plot(HOUSES[, list(soldprice =sum(soldprice)/sum(sqft)), by=distancecat ], xlab="Distance to Well", ylab="Price per sqft")

A first inspection suggests a positive gradient in distance, that is – however, quite non-monotone.

Non-monotonic relationship between distance to the nearest oil or gas well and price per sqft.
Non-monotonic relationship between distance to the nearest oil or gas well and price per sqft.

Does this relationship hold up when running a hedonic pricing regression?

[math]log(y_{ict}) = \gamma \times welldistance_{i} + \beta \times X_i + a_c + \eta_t + e_{ict}[/math]

These are estimated using the lfe package, as I introduce quite demanding fixed effects (census-tract and county by year). The lfe package takes these fixed effects out iteratively before running the actual regression on the demeaned data.

The results for two chosen bandwidths are presented in the table below. There appears to be a positive gradient – being further away from a well correlates with higher prices per square foot.

Regression results comparing sold houses nearby unconventional gas wells on the Marcellus shale
Regression results comparing sold houses nearby unconventional gas wells on the Marcellus shale

Clearly, the important question is whether one can separate out the property price appreciation that is likely to happen due to the local economic boom from the price differentials that may arise due to the presence of the local externalities and whether, one can separate out externalities due to environmental degradation as distinct from price differentials arising due to factors discussed in the beginning: no access to mortgage lending or insurances.

Unfortunately, I do not have the time to spend more time on this now, but I think a short paper is still feasible…

 

Mobility from Mobile Phones

I have worked on big data in my work with QuBit in London. In my research I increasingly find the tools I learnt there to be extremely useful. The keywords are smart data management for big data, such as hadoop and hive for querying just the right set of data to work with. I am currently working on Mobile Phone usage data provided by Orange Senegal as part of their data for development challenge. The scope for using big data for development is immense. Many African countries have seen extremely rapid urbanization, essentially moving from an agrarian economy straight into a service sector economy. Mobile technology is being developed in tech hubs across Africa, be it in Dar Es Salaam, Nairobi or in Dakar. Mobile money is a revolution in itself.

Development planning in Africa can not rely on tools used in developed countries: there are for example, no sophisticated traffic monitoring systems in place to help planners keep track of traffic flows, dispatch police units to congested spots. Mobile phone usage data may turn out to be an incredibly useful tool for development planners as it theoretically allows an analysis of human mobility along the mobile phone mast network. This allows planners to get a sense of population density and how population moves within an urban setting. This may help in development planning, but can turn out to be useful on many other dimensions. For example, epidemiological models of disease may make find in mobile phone data a useful input to model spread of diseases.

In any case, there is little economics literature that has worked with mobile phone use data so far. The two constraints are, first, getting the data in the first place and second, how to work with true “big data”. I have submitted a proposal for the Orange Data for Development Challenge Senegal and obtained the mobile phone Call Detail Records data. This is data collected by mobile phone companies for billing purposes and basically records, for every user, when and where (at which mobile phone tower) they have used their phone, either by receiving or making a call or by receiving or sending a text message. The result is three datasets made available to researchers. The datasets are based on Call Detail Records (CDR) of phone calls and text exchanges between more than 9 million of Orange’s customers in Senegal between January 1, 2013 to December 31, 2013. The datasets are: (1) antenna-to-antenna traffic for 1666 antennas on an hourly basis, (2) fine-grained mobility data on a rolling 2-week basis for a year with bandicoot behavioral indicators at individual level for about 300,000 randomly sampled users, (3) one year of coarse-grained mobility data at arrondissement level with bandicoot behavioral indicators at individual level for about 150,000 randomly sampled users.

I want to fine tune into the Dakar region. The capital is capturing most of mobile phone users and its most interesting to study mobility there. This constraints the focus on the fine-grained mobility data for rolling 2 week time windows. Just to get a sense of the type of data we are talking about. Suppose that every of the 300,000 users received was active in every hour of the time window, so that it is a truely balanced hourly panel, the resulting dataset would have 100.8 million rows. It is not possible to hold such amounts of data in the working memory on a conventional computer running Stata. R is a lot more capable to work with big data.

First, R provides the necessary architecture to run certain routines for efficient data processing directly using C code. This ensures that big R data objects are not moved around in the working memory, but that operations are done using the “physical storage” address of the data using C pointers. Most prominently the R package “dplyr” developed by Hadley Wickham is extremely useful as it allows running operations on these very huge data frames in just a few seconds.

I just wanted highlight here the approach I have taken so far in converting the unwieldy data objects into something useable. The idea is first, to zoom in to the Dakar region. This part is only a small share when it comes to comparing with the size of the country, but it absorbs most of the mobile phone users that are sampled. Out of 300,000 sampled users, on average 120,000 or 40% are located in the Dakar region.

senegal-dakar

Dakar Region only small in size relative to rest of the country

The accumulation of users in the Dakar region implies that there the mobile phone network is quite dense. In total there are 1666 antennas reported, but 489, or 29.3% of these antennas are clustered in the Dakar region.

dakar

Dakar Region and Mobile Phone Network Masts with overlayed .01 degree grid

The idea will be to construct measures of mobility at a spatial resolution. A user will only be registered in the antenna that he or she is locked in. Some antennas are extremely close to one another. In order to get a continous measure across space I compute a grid and associate grid cells with mobile phone masts. The above picture is of a 0.01 degree grid (roughly 1 km since we are not far from the equator). Leaving out the Bambilor arrondisement in Rufisque department (the big and sparse chunk on the right), almost every grid cell contains a mobile phone mast.

The first approach I have taken is to get a sense of population density: where are people usually located in the morning, evening and afternoon hours. The idea is that, especially, the location in the evening hours is the place in which individuals live (spend the night so to say). The location during the day may reflect the location where individuals are working. To highlight how this looks, consider the picture below. This plots out the average evening location of individuals in the central capital region. This information is displayed in two ways. First, the grid cell coloring is related to the number of people inside that grid cell, on average, for the evening hours in a two week time window. Second, the red dots indicate individual locations. If people were only logged in to one mobile phone mast every evening, then this would show up as a cluster of points around the mobile phone mast. In this case, the information at the grid level is useful. The fact that we see many individual red dots suggests that also in the evening, there is some mobility even if it is only within a grid cell.

Location of Population in Evening Hours

Average Location of Mobile Phone Users in Evening Hours

We can construct crude average morning, afternoon and evening locations. This then allows a construction of a measure of average mobility during the day time in a two week time window. The distances travelled can be visually plotted.

The figure below is a first attempt. This is obviously ongoing research…

travelling4

Mobility of Individuals in Dakar: Lines indicate straight line travel distances from the average morning, afternoon and evening location.

Regressions with Multiple Fixed Effects – Comparing Stata and R

In my paper on the impact of the recent fracking boom on local economic outcomes, I am estimating models with multiple fixed effects. These fixed effects are useful, because they take out, e.g. industry specific heterogeneity at the county level – or state specific time shocks.

The models can take the form:

[math]y_{cist} = \alpha_{ci} + b_{st} + \gamma_{it}+ X_{cist}'\beta + \epsilon_{cist}[/math]  

where [math]\alpha_{ci}[/math] is a set of county-industry, [math]b_{ci}[/math] a set of state-time and [math]\gamma_{it}[/math] is a set of industry-time fixed effects.

Such a specification takes out arbitrary state-specific time shocks and industry specific time shocks, which are particularly important in my research context as the recession hit tradable industries more than non-tradable sectors, as is suggested in Mian, A., & Sufi, A. (2011). What Explains High Unemployment ? The Aggregate Demand Channel.

How can we estimate such a specification?
Running such a regression in R with the lm or reg in stata will not make you happy, as you will need to invert a huge matrix. An alternative in Stata is to absorb one of the fixed-effects by using xtreg or areg. However, this still leaves you with a huge matrix to invert, as the time-fixed effects are huge; inverting this matrix will still take ages.

However, there is a way around this by applying the Frisch-Waugh Lovell theorem iteratively (remember your Econometrics course?); this basically means you iteratively take out each of the fixed effects in turn by demeaning the data by that fixed effect. The iterative procedure is described in detail in Gaure (2013), but also appears in Guimaraes and Portugal(2010).

Simen Gaure has developed an R-package called lfe, which performs the demeaning for you and also provides the possibility to run instrumental variables regressions; it theoretically supports any dimensionality of fixed effects. The key benefit of Simen Gaure’s implementation is the flexibility, the use of C in the background for some of the computing and its support for multicore processing, which speeds up the demeaning process dramatically, especially the larger your samples get..

In Stata there is a package called reg2hdfe and reg3hdfe which has been developed by Guimaraes and Portugal (2010). As the name indicates, these support only fixed effects up to two or three dimensions.

Lets see how – on the same dataset – the runtimes of reg2hdfe and lfe compare.

Comparing Performance of Stata and R

I am estimating the following specification

[math]y_{cist} = \alpha_{ci} + b_{sit} + \gamma_{it}+ X_{cist}'\beta + \epsilon_{cist}[/math]  

where [math]\alpha_{ci}[/math] is a set of county-industry, [math]b_{ci}[/math] a set of state-time fixed effects. There are about 3000 counties in the dataset and 22 industries. Furthermore, there are 50 states and the time period is also about 50 quarters. This means – in total – there are 3000 x 22 = 66,000 county-industry fixed effects to be estimated and 22 x 50 x 50 = 55,000 time fixed effects to be estimated. The sample I work with has sufficient degrees of freedom to allow the estimation of such a specification – I work with roughly 3.7 million observations.

I have about 10 covariates that are in [math]X_{cist}[/math], i.e. these are control variables that vary within county x industry over state x industry x time.

Performance in Stata

In order to time the length of a stata run, you need to run
set rmsg on, which turns on a timer for each command that is run.

The command I run in stata is

reg2hdfe logy x1-x10,  id1(sitq ) id2(id) cluster(STATE_FIPS )

You should go get a coffee, because this run is going to take quite a bit of time. In my case, it took t=1575.31, or just about 26 minutes.

Performance in R 
In order to make the runs of reg2hdfe and lfe, we need to set the tolerance level of the convergence criterion to be the same in both. The standard tolerance in Stata is set at $$1e^{-6}$$, while for lfe package it is set at $$1e^{-8}$$. In order to make the runs comparable you can set the options in the R package lfe options explicitly:

options(lfe.eps=1e-6)

The second change we need to make is to disallow lfe to use multiple cores, since reg2hdfe uses only a single thread. We can do this by setting:

options(lfe.threads=1)

Now lets run this in R using:

system.time(summary(felm(log(y) ~ x1 + x2 +x3 +x4 + x5 + x6 + x7 +x8 + x9 + x10 + G(id)+G(sitq),
data=EMP, cluster=c("STATE_FIPS"))))

The procedure converges in a lot quicker than Stata…

 user system elapsed 
208.450 23.817 236.831 

It took a mere 4 minutes. Now suppose I run this in four separate threads…

user system elapsed 
380.964 23.540 177.520

 

Running this on four threads saves about one minute in processing time; not bad, but not too much gained; the gains from multi-threading increase, the more fixed-effects are added and the larger the samples are.

Classi-Compare of Raster Satellite Images – Before and After

For my research on the effect of power outages on fertility , we study a period of extensive power rationing that lasted for almost a whole year and affected most of Latin America, but in particular, it affected Colombia. The key difficult was to determine which areas were exposed to the power-outage and the extent to which this was the case. This is not straightforward, since there does not exist household- or even municipality level consumption data.

But here is how R and Satellite Data can help. In particular, we study the night light series obtained from the Defense Meterological Sattelite Program, which has been discussed by Jeffrey before.

We simply look for abnormal variation in municipality level light-emitting intensity from 1992 to 1993.

Here is some code that generates some Raster-Maps using the package rasterVis , and uses jQuery to generate a fancy before and after comparison to highlight the year-on-year changes in light intensity of 1992 compared to 1993.

###load the raster images
tif<-"F101992.v4b_web.stable_lights.avg_vis.tif"
f151 = raster(tif)

tif<-"F101993.v4b_web.stable_lights.avg_vis.tif"
f152 = raster(tif)

##crop a smaller window to plot

e = extent(-78,-72,2,8)
#e = extent(-80,-78,-4.6,-2)
rn= crop(f151, e)
rn2= crop(f152, e)

### do a logarithmic transformation to highlight places that receive not much, but some light.

rn<-log(rn+1)
png("1992.png")
p <- levelplot(rn, layers=1, margin=FALSE,col.regions = gray(0:100/100))
p + layer(sp.polygons(COLPOB, lwd=.25, linetype=2, col='darkgray'))
dev.off()

rn2<-log(rn2+1)
png("1993.png")
p <- levelplot(rn2, layers=1, margin=FALSE,col.regions = gray(0:100/100))
p + layer(sp.polygons(COLPOB, lwd=.25, linetype=2, col='darkgray'))
dev.off()

Now with this together, you can create a fancy slider as I have seen on KFOR — comparing satellite pictures of towns before and after a tornado went through them.

The code is essentially just borrowed from that TV station and it loads the javascript from their server; it is essentially just a clever use of jQuery and is maybe something that could or is already implemented in an R reporting package? Do you know of such a function?

Anyways, all you need is a slider.html page that contains the code referring to the two picture sources; the code is simple:

 

<!DOCTYPE html>
<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
    <script src="https://ajax.googleapis.com/ajax/libs/jquery/1.8.3/jquery.min.js"></script>
    <script src="http://cache.ltvcms.com/localtv/tornado2/js/jquery.classycompare.js"></script>
    <link rel="stylesheet" type="text/css" href="http://cache.ltvcms.com/localtv/tornado2/css/jquery.classycompare.css">
    <style type="text/css">.sample1 {width:725px; height:725px;}.sample2 {width:725px; height:725px;}.sample3 {width:725px; height:725px;}</style>
  </head>
  <body>
    <div id="wrapper">
      <div class="container_6 clearfix">
        <section class="main-section grid_6">
          <div class="main-content">
            <section class="clearfix">
              <div>
                <div class="container" style="position:relative">
                  <div class="sample1"> <img src="1992municio.png"
                      alt="Before"
                      height="725px"
                      width="725px">
                    <img src="1993municio.png"
                      alt="After"
                      height="725px"
                      width="725px">
                  </div>
                </div>
                <script>
                                        $(window).load(function() {
                                            $('.sample1').ClassyCompare({
                                                defaultgap:50,
                                                leftgap:0,
                                                rightgap:10,
                                                caption: true,
                                                reveal: 0.5
                                            });
                                        });
                                    </script> </div>
            </section>
          </div>
        </section>
      </div>
    </div>

  </body>
</html>

 

This is how it looks — I know the stuff is not perfectly aligned, partly because when cropping the picture I made a mistake and could not be bothered with fixing it.

Have fun!

 

 

rgeos: TopologyException – found non-noded intersection between..

I have been having some issues generating spatial unions and intersections using the rgeos package. The package is extremely powerful, as it serves as an R interface to the powerful GEOS engine.

However, when working with shapefiles or polygons, quite often you will come across a whole range of errors, typically around topology exceptions. These occur in a whole range of applications — they typically throw errors of the type:

TopologyException: found non-noded intersection between LINESTRING (-59.0479 -1.85389, -59.048 -1.854) and LINESTRING (-59.0482 -1.854, -59.0477 -1.854) at -59.048000000000002 -1.8540000000000001

As becomes evident from this error, the error occurs in the xth decimal point, so it should really not be an error really? There are alternative issues that may arise if you try to create a Spatial Intersection of two Polygons that have different precisions.

What typically works in resolving these issues is a combination of two things.

 

  1. Round the polygon coordinates so that you end up having the same precision if you are creating spatial intersections of polygons coming from different sources. A function that implements this is for example:
 roundPolygons<-function(shptemp, digitss=3) {
for(i in 1:length(shptemp)) {
shptemp@polygons[[i]]@Polygons[[1]]@coords&<-round(shptemp@polygons[[i]]@Polygons[[1]]@coords,digits=digitss)
}
shptemp
}
  1. A second quick fix is to create a buffer area around the polygons you are trying to intersect, here rgeos has a predefined gBuffer function. You just need to specifiy the width of the buffer and then run the Spatial Union or Intersection with the buffered objects.

In most applications the combination of these two solved all my rgeos spatial join issues.

Computing Maritime Routes in R

Thanks to the attention my paper on the cost of Somali piracy has received, a lot of people have approached me to ask how I computed the maritime routes. It is not a very difficult task using R. The key ingredient is a map of the world, that can be rasterized into a grid; all the landmass needs to be assigned an infinite cost of crossing and last but not least — one needs to compute the actual routes.

What packages do I need?

library(gdistance)
library(maptools)
data(wrld_simpl)
library(data.table)

The package gdistance does most of the actual work of computing the routes. The wrld_simpl map provides what is needed to generate a raster.

Generating a Raster

#create a raster from shape files
shp <- wrld_simpl
r <- raster()
r <-rasterize(shp, r, progress='text')

After the raster is generated, we can proceed by making landmass impassable for vessels.

#make all sea = -999
r[is.na(r)] <- -999
#this turns all landmass to missing
r[r>-999] <- NA
#assign unit cost to all grid cells in water
r[r==-999] <- 1

There are a few more things to do, such as opening up the Suez Canal and some other maritime passages — one needs to find the right grid cells for this task. In the next step we can transform the raster into a transition layer matrix, that comes from the gdistance package. It is a data construct that essentially tells us how one can move from one cell to the other — you can allow diagonal moves by allowing the vessel to move into all 8 adjacent grid cells. There is also a geo-correction necessary, as the diagonals are longer distances than the straight-line moves.

tr <- transition(r, mean, directions = 8)
tr <- geoCorrection(tr, "c")

Well — and thats basically it — of course, there are a few bits and pieces that need additional work — like adding heterogenuous costs as one can imagine exist due to maritime currents and so on. Furthermore, there is a whole logic surrounding the handling of the output and the storing in a local database for further use and so on.

But not to bore you with that — how can I obtain the distance between A and B? This uses Dijkstra’s Algorithm and is called through the gdistance function “shortestPath”.

AtoB <- shortestPath(tr, as.numeric(start[1:2]), as.numeric(end[1:2]), output = "SpatialLines")

Using this output, you can then generate fancy graphs such as …

 

 

Starting Multiple Stata Instances on Mac

I found it useful to have multiple Stata instances running on my Mac, in particular, if I use one instance to clean the data before running merge commands. It is always annoying if the merging does not work out or throws an error and then, one would have to clear the current instance and open the DTA file that was messing up the merge.

Its a simple command that allows you to open multiple Stata instances on a Mac:

open -n /Applications/Stata12_OSX/StataSE.app

You can also define an alias command in your .bash_profile,

alias stata='open -n /Applications/Stata12_OSX/StataSE.app'
Good luck!

 

R function: generate a panel data.table or data.frame to fill with data

I have started to work with R and STATA together. I like running regressions in STATA, but I do graphs and setting up the dataset in R. R clearly has a strong comparative advantage here compared to STATA. I was writing a function that will give me a (balanced) panel-structure in R. It then simply works by joining in the additional data.tables or data.frames that you want to join into it.

It consists of two functions:

timeVector <- function(starttime,endtime,timestep="months") {

starttime<- as.POSIXct(strptime(starttime, '%Y-%m-%d'))
endtime<- as.POSIXct(strptime(endtime, '%Y-%m-%d'))
if(timestep=="quarters") {
timestep="months"
ret<-seq(from=as.POSIXct(starttime), to=as.POSIXct(endtime), by=timestep)
quarter <- gsub("(^[123]{1}$)", 1, month(ret))
quarter <- gsub("(^[456]{1}$)", 2, quarter)
quarter <- gsub("(^[789]{1}$)", 3, quarter)
quarter <- as.numeric(gsub("(^[102]{2}$)", 4, quarter))

ret<-paste(year(ret),quarter,sep="-")
ret<-unique(ret)
} else {

ret<-seq(from=as.POSIXct(starttime), to=as.POSIXct(endtime), by=timestep)
}
ret

}

This first function generates the time-vector, you need to tell it what time-steps you want it to have.

panelStructure <- function(group,timevec) {
tt<-rep(timevec,length(group))
tt2 <- as.character(sort(rep(group,length(timevec))))
mat <- cbind("group"=data.frame(tt2),"timevec"=data.frame(tt))
names(mat)<-c("group","timevec")
mat
}

This second function then generates the panel-structure. You need to give it a group vector, such as for example a vector of district names and you need  to pass it the time vector that the other function created.

Hope this is helpful to some of you.

 

 

 

Salmon Fishing in Yemen: Somali Piracy and the Fishing industry

We are working on finalizing our paper on Somalian piracy and the effects on the shipping industry. We believe that our paper is the first serious attempt to identify the cost of piracy using a novel dataset and a novel approach. We find that the direct and indirect cost may be between $ 1.8 – $3.0 billion. This is a lot of money compared with the mere $150 – $250 million that the piracy activity generates for the pirates. This highlights how large the welfare gains from having a functioning state with working institutions and an established monopoly of power can be.

Clearly, the paper is not capturing all the adjustments that are taking place. In particular, we find that local and regional trade is a lot stronger affected by piracy then, e.g. trade from Asia to Europe. This suggests that the piracy burden is especially born by regional economies, such as Yemen, Kenya, the Seychelles and so on. This got me thinking and I started looking at the impact of piracy on the fishing industry — I first started off with Yemen, however, the data quality is very poor.

Here is something very primer…The graph depicts quarterly reported fish catches by the Ministry of Fisheries of the Seychelles and the number of piracy attacks in the vicinity of the Seychelles (roughly in a radius of around 500 miles). Do we believe that the rise in piracy was causing this drop in fish catches, which appears to be persistent?

Exploring Heterogenous Treatment Effects: Returns to Capital

Jon de Quidt and I recently looked a bit into the data from the paper of De Mel, Woodruff and McKenzie (2008) in Sri Lanka. It is quite an influential paper, experimentally administering capital shocks to microenterprises in Sri Lanka in order to estimate the returns to capital.

Their paper highlighted that the returns to capital “at the bottom of the Pyramid” could be very large indeed. In their favorite specification they find that these could be as high as 50 – 63 % per year (in real terms). This suggests that these investments pay-off on average. It was one of the first papers that used experimentally generated variation in capital stocks to estimate these returns. Thats why it became so influential and the authors have a set of papers on Mexico and Ghana, performing similar estimates.

They point out that there is a lot of heterogeneity in the estimated treatment effects. In particular, they observe that for female entrepreneurs, there is virtually no (average) treatment effect. This and the reasons that could be underlying this observation are explored in a second paper.

We had a look at the De Mel et al (2008) data, which is available here. We essentially ran their regressions of (reported) real profits on the treatment dummy. However, we did this iteratively for each individual treated person, using the whole group of non-treated individuals as counterfactual. This allow us to get an estimate of the treatment effect for each individual treated.

From this, we get a distribution of treatment effect estimates. The average of this should be the treatment effect that De Mel et al (2008) report in their paper. And indeed it comes quite close. However, what we are intrigued by is the significant heterogeneity in the point estimates for the treatment effect for different individuals.

A key observation is that the treatment effects are very heterogeneous – the mass of enterprises who saw a drop in profits is almost as large as the mass of enterprises that saw a rise, however, some saw a very significant and large rise in real profits. The vertical line is the average of the treatment effect, which here, as we lumped the cash treatments together, is around 900 rupees.  The median treatment effect however, is only  332 rupees. Thats some food for thought.

Histogram and Kernel Density Estimate of Treatment Effects for Cash Treatments

Removing Multibyte Characters from Strings

I was a bit annoyed by the error when loading a dataset that contains multi-byte characters. R basically just chokes on them. I have not really understood the intricacies of this, but it was basically just an annoyance and since I did not really use these characters in the strings containing them, I just wanted to remove them.

The easiest solution was to use Vim with the following search and replace:

s/[\x80-\xFF]//g

The Welfare Cost of Lawlessness: Evidence from Somalian Piracy

Max Weber has shaped the distinction between functioning and failed states. In his words “a state is a human community that (successfully) claims a monopoly of the legitimate use of physical force within a given territory”. For Somalia the effective monopoly of power has not been with any form of state since 1991 as it has been topping the list of the failed states index for the past five consecutive years. The disastrous US intervention in Mogadishu in October 1993 lead to a shift in US foreign policy, towards non-intervention in Somalia. Unchecked by outside forces, the state further fragmented into several smaller regions that were dominated by war lords. As such it became a refuge for radical islamists and organised crime. It was only through the rise in piracy throughout the first decade of this century, and in particular due to a sharp increase in piracy attacks in 2008, that the world seemed to notice and care again about the situation in Somalia. The interest may be partly driven by the romantic ideas that Hollywood’s pirates inspire, but mostly, as we argue, because Somalian piracy has been an externality. But how costly can such an externality from statelessness be? What is the tax rate at with which Somalian piracy taxes world-trade? And how does this tax rate compare to an optimal tax rate?

Our recent research aims to answer these questions, thus shedding light on the key questions about the role of institutions in securing trade from predation and theft (see e.g. Dixit (2004) ).

Piracy at a Maritime choke point 

Our window into obtaining estimates of the “piracy tax” comes from micro-data on individual shipping contracts. This approach is methodologically more powerful than indirect accounting approaches such as the various One Earth Future Foundation Reports (2010, 2011), which lack any counterfactual. We consider the direct link between the risk of piracy attacks and the cost of shipping by studying the impact on chartering rates on maritime routes that vary by (1) whether and (2) the extent over time to which they are exposed to Somalian piracy. Most of the trade between Asia and Europe has to go through the Gulf of Aden and is thus, potentially affected by Somalian piracy. The fact that the Gulf of Aden is one of the business shipping routes becomes clear by inspection of our data on chartering contracts. Roughly 25% of our ships are travelling through the Gulf of Aden.

This graphical representation highlights how important the Suez Canal is for world trade. Anything that disrupts trade through the Suez Canal has thus the potential of disturbing patterns of trade. The role of the Suez Canal and its impact on trade has been studied in a related study by Feyrer (2009), who looks at the Suez Canal closure as a natural experiment.

We argue that the upsurge in piracy in spring 2008, which becomes evident when studying the monthly time series of attacks, has been disrupting trade and has lead to several reactions by the shipping industry. The cost of these reactions are passed on to the charteres and eventually to the consumers of traded goods. The impact on shipping rates is our window through which we estimate the “piracy tax”.

Bringing both of data-sets together, we find that piracy caused an increase in the transport cost by around 8%. We identify this increase in the cost both from the sudden increase in violence intensity in spring 2008, but also from seasonal variation in the intensity.

In particular, early summer is a period of relative little piracy activity. We show that this is due to the Monsoon season, which just makes it difficult for pirates to operate in their small vessels. Hence, the estimated effect varies significantly with the season as the risks are lower. This is illustrated in the following picture. The piracy tax is lower in the Monsoon season. We check that this drop in shipping rates is not due to less shipping through the area.

Taking econometric studies seriously. 

Taking our estimates seriously, the overall costs of Somalian piracy can be obtained by scaling up the point estimate by aggregate measures of trade through the region. Clearly, such estimates will have very large standard errors – however, their foundation is a solid micro-econometric estimate. Given this we find that, at the lowest level, the $120 million in net revenues that pirates generate are far offset by the costs borne by the shipping industry, which lie between $ 0.9 billion to $ 3.3 billion.

One could compare the revenues generated by piracy to an equivalent tax rate on traffic through the Gulf of Aden or the broader Somalian territorial waters. If we follow this avenue, we estimate that the equivalent tax rate on traffic would be well below 1%. What does this mean? It highlights that a functioning state is able to implement redistributive policies a lot more efficiently than a “roving bandit” ( Olson, 2000). Hence, this exercises provides a sense of how much value is generated by installing functioning institutions and a functioning state.

Conclusions

Is there a solution to Somalian piracy? As Somalian piracy is a classical externality, there is a need for cooperation to adress the problem. Many commentators argue that the only long lasting solution is to provide support on the ground in Somalia. However, international cooperation is difficult to muster, due to varying geo-political interests of the players involved.

History is full of anecdotes suggesting that this problem is not new. Consider for example, the correspondent report on Chinese piracy in The London and China Telegraph from 4th February 1867 noted that

Besides we are not the only Power with large interests at stake. French, Americans, and Germans carry on an extensive trade […] Why should we then incur singly the expense of suppressing piracy if each provided a couple of gunboats the force would suffice for the safety foreign shipping which is all that devolves upon [..] why should the English tax payer alone bear the expense?

This sentiment highlight the public good aspect of travels on global water ways. This research will fall short of providing policy suggestions, however, it highlights that piracy is most likely one of the costliest ways of making transfers to Somalia.