Links   Economics PhD

## Regressions with Multiple Fixed Effects - Comparing Stata and R

1 Comment »

In my paper on the impact of the recent fracking boom on local economic outcomes, I am estimating models with varying levels of fixed effects. These fixed effects are useful, because they take out, e.g. industry specific heterogeneity at the county level - or state specific time shocks. This is particularly important, since the recession of 2008/ 2009 did not impact all states and industries equally strongly, as is suggested in Mian, A., & Sufi, A. (2011). What Explains High Unemployment ? The Aggregate Demand Channel.

The models can take the form:

$y_{cist} = \alpha_{ci} + b_{st} + \gamma_{it}+ X_{cist}'\beta + \epsilon_{cist}$

where $\alpha_{ci}$ is a set of county-industry, $b_{ci}$ a set of state-time and $\gamma_{it}$ is a set of industry-time fixed effects.

Such a specification takes out arbitrary state-specific time shocks and industry specific shocks, which are particularly important in my research context as the recession hit tradable industries more than non-tradable sectors.

How can we estimate such a specification?
Running such a regression straight with the lm in R or reg in stata will not make you happy, as you will need to invert a huge matrix. In stata, running such a regression using xtreg or areg will not be feasible, as you can only xtset or absorb one fixed-effect, which means, you will still have to evaluate and invert a huge matrix.

However, there is a way around this by applying the Frisch-Waugh Lovell theorem iteratively (remember your Econometrics course?); this basically means you iteratively take out each of the fixed effects in turn by demeaning the data by that fixed effect. The iterative procedure is described in detail in Gaure (2013), but also appears in Guimaraes and Portugal(2010).

Simen Gaure has developed an R-package called lfe, which performs the demeaning for you and also provides the possibility to run instrumental variables regressions; it theoretically supports any dimensionality of fixed effects. The key benefit of Simen Gaure's implementation is the flexibility, the use of C in the background for some of the computing and its support for multicore processing, which speeds up the demeaning process dramatically.

In Stata there is a package called reg2hdfe and reg3hdfe which has been developed by Guimaraes and Portugal (2010). As the name indicates, these support only fixed effects up to two or three dimensions.

Lets see how - on the same dataset - the runtimes of reg2hdfe and lfe compare.

Comparing Performance of Stata and R

I am estimating the following specification

$y_{cist} = \alpha_{ci} + b_{sit} + \gamma_{it}+ X_{cist}'\beta + \epsilon_{cist}$

where $\alpha_{ci}$ are county industry fixed effects and $b_{sit}$ are state-time-industry fixed effects. There are about 3000 counties in the dataset and 22 industries. Furthermore, there are 50 states and the time period is also about 50 quarters. This means - in total - there are 3000 x 22 = 66,000 county-industry fixed effects to be estimated and 22 x 50 x 50 = 55,000 time fixed effects to be estimated. The sample I work with has sufficient degrees of freedom to allow the estimation of such a specification - I work with roughly 3.7 million observations.

I have about 10 covariates that are in $X_{cist}$, i.e. these are control variables that vary within county x industry over state x industry x time.

Performance in Stata

In order to time the lenght of a stata run, you need to run  set rmsg on, which turns on a timer for each command that is run.

The command I run in stata is

You should go get a coffee, because this run is going to take quite a bit of time. In my case, it took t=1575.31, or just about 26 minutes.

Performance in R
In order to make the runs of reg2hdfe and lfe, we need to set the tolerance level of the convergence criterion to be the same in both. The standard tolerance in Stata is set at $1e^{-6}$, while for lfe package it is set at $1e^{-8}$. In order to make the runs comparable you can set the options in the R package lfe options explicitly:
options(lfe.eps=1e-6)
The second change we need to make is to disallow lfe to use multiple cores, since reg2hdfe uses only a single thread. We can do this by setting:
options(lfe.threads=1)
Now lets run this in R using:

The procedure converges in a lot quicker than Stata...

It took a mere 4 minutes. Now suppose I run this in four separate threads...

Running this on four threads saves about one minute in processing time; not bad, but not too much gained; the gains from multi-threading increase, the more fixed-effects are added and the larger the samples are.

## Classi-Compare of Raster Satellite Images - Before and After

No Comments »

For my research on the effect of power outages on fertility , we study a period of extensive power rationing that lasted for almost a whole year and affected most of Latin America, but in particular, it affected Colombia. The key difficult was to determine which areas were exposed to the power-outage and the extent to which this was the case. This is not straightforward, since there does not exist household- or even municipality level consumption data.

But here is how R and Satellite Data can help. In particular, we study the night light series obtained from the Defense Meterological Sattelite Program, which has been discussed by Jeffrey before.

We simply look for abnormal variation in municipality level light-emitting intensity from 1992 to 1993.

Here is some code that generates some Raster-Maps using the package rasterVis , and uses jQuery to generate a fancy before and after comparison to highlight the year-on-year changes in light intensity of 1992 compared to 1993.

Now with this together, you can create a fancy slider as I have seen on KFOR -- comparing satellite pictures of towns before and after a tornado went through them.

The code is essentially just borrowed from that TV station and it loads the javascript from their server; it is essentially just a clever use of jQuery and is maybe something that could or is already implemented in an R reporting package? Do you know of such a function?

Anyways, all you need is a slider.html page that contains the code referring to the two picture sources; the code is simple:

This is how it looks -- I know the stuff is not perfectly aligned, partly because when cropping the picture I made a mistake and could not be bothered with fixing it.

Have fun!

## rgeos: TopologyException - found non-noded intersection between..

R No Comments »

I have been having some issues generating spatial unions and intersections using the rgeos package. The package is extremely powerful, as it serves as an R interface to the powerful GEOS engine.

However, when working with shapefiles or polygons, quite often you will come across a whole range of errors, typically around topology exceptions. These occur in a whole range of applications -- they typically throw errors of the type:

TopologyException: found non-noded intersection between LINESTRING (-59.0479 -1.85389, -59.048 -1.854) and LINESTRING (-59.0482 -1.854, -59.0477 -1.854) at -59.048000000000002 -1.8540000000000001

As becomes evident from this error, the error occurs in the xth decimal point, so it should really not be an error really? There are alternative issues that may arise if you try to create a Spatial Intersection of two Polygons that have different precisions.

What typically works in resolving these issues is a combination of two things.

1. Round the polygon coordinates so that you end up having the same precision if you are creating spatial intersections of polygons coming from different sources. A function that implements this is for example:

1. A second quick fix is to create a buffer area around the polygons you are trying to intersect, here rgeos has a predefined gBuffer function. You just need to specifiy the width of the buffer and then run the Spatial Union or Intersection with the buffered objects.

In most applications the combination of these two solved all my rgeos spatial join issues.

## Computing Maritime Routes in R

No Comments »

Thanks to the attention my paper on the cost of Somali piracy has received, a lot of people have approached me to ask how I computed the maritime routes. It is not a very difficult task using R. The key ingredient is a map of the world, that can be rasterized into a grid; all the landmass needs to be assigned an infinite cost of crossing and last but not least -- one needs to compute the actual routes.

What packages do I need?

The package gdistance does most of the actual work of computing the routes. The wrld_simpl map provides what is needed to generate a raster.

Generating a Raster

After the raster is generated, we can proceed by making landmass impassable for vessels.

There are a few more things to do, such as opening up the Suez Canal and some other maritime passages -- one needs to find the right grid cells for this task. In the next step we can transform the raster into a transition layer matrix, that comes from the gdistance package. It is a data construct that essentially tells us how one can move from one cell to the other -- you can allow diagonal moves by allowing the vessel to move into all 8 adjacent grid cells. There is also a geo-correction necessary, as the diagonals are longer distances than the straight-line moves.

Well -- and thats basically it -- of course, there are a few bits and pieces that need additional work -- like adding heterogenuous costs as one can imagine exist due to maritime currents and so on. Furthermore, there is a whole logic surrounding the handling of the output and the storing in a local database for further use and so on.

But not to bore you with that -- how can I obtain the distance between A and B? This uses Dijkstra's Algorithm and is called through the gdistance function "shortestPath".

Using this output, you can then generate fancy graphs such as ...

## Starting Multiple Stata Instances on Mac

No Comments »

I found it useful to have multiple Stata instances running on my Mac, in particular, if I use one instance to clean the data before running merge commands. It is always annoying if the merging does not work out or throws an error and then, one would have to clear the current instance and open the DTA file that was messing up the merge.

Its a simple command that allows you to open multiple Stata instances on a Mac:

You can also define an alias command in your .bash_profile,

Good luck!

## R function: generate a panel data.table or data.frame to fill with data

No Comments »

I have started to work with R and STATA together. I like running regressions in STATA, but I do graphs and setting up the dataset in R. R clearly has a strong comparative advantage here compared to STATA. I was writing a function that will give me a (balanced) panel-structure in R. It then simply works by joining in the additional data.tables or data.frames that you want to join into it.

It consists of two functions:

This first function generates the time-vector, you need to tell it what time-steps you want it to have.

This second function then generates the panel-structure. You need to give it a group vector, such as for example a vector of district names and you need  to pass it the time vector that the other function created.

Hope this is helpful to some of you.

## Salmon Fishing in Yemen: Somali Piracy and the Fishing industry

No Comments »

We are working on finalizing our paper on Somalian piracy and the effects on the shipping industry. We believe that our paper is the first serious attempt to identify the cost of piracy using a novel dataset and a novel approach. We find that the direct and indirect cost may be between $1.8 -$3.0 billion. This is a lot of money compared with the mere $150 -$250 million that the piracy activity generates for the pirates. This highlights how large the welfare gains from having a functioning state with working institutions and an established monopoly of power can be.

Clearly, the paper is not capturing all the adjustments that are taking place. In particular, we find that local and regional trade is a lot stronger affected by piracy then, e.g. trade from Asia to Europe. This suggests that the piracy burden is especially born by regional economies, such as Yemen, Kenya, the Seychelles and so on. This got me thinking and I started looking at the impact of piracy on the fishing industry -- I first started off with Yemen, however, the data quality is very poor.

Here is something very primer...The graph depicts quarterly reported fish catches by the Ministry of Fisheries of the Seychelles and the number of piracy attacks in the vicinity of the Seychelles (roughly in a radius of around 500 miles). Do we believe that the rise in piracy was causing this drop in fish catches, which appears to be persistent?

## Exploring Heterogenous Treatment Effects: Returns to Capital

No Comments »

Jon de Quidt and I recently looked a bit into the data from the paper of De Mel, Woodruff and McKenzie (2008) in Sri Lanka. It is quite an influential paper, experimentally administering capital shocks to microenterprises in Sri Lanka in order to estimate the returns to capital.

Their paper highlighted that the returns to capital "at the bottom of the Pyramid" could be very large indeed. In their favorite specification they find that these could be as high as 50 - 63 % per year (in real terms). This suggests that these investments pay-off on average. It was one of the first papers that used experimentally generated variation in capital stocks to estimate these returns. Thats why it became so influential and the authors have a set of papers on Mexico and Ghana, performing similar estimates.

They point out that there is a lot of heterogeneity in the estimated treatment effects. In particular, they observe that for female entrepreneurs, there is virtually no (average) treatment effect. This and the reasons that could be underlying this observation are explored in a second paper.

We had a look at the De Mel et al (2008) data, which is available here. We essentially ran their regressions of (reported) real profits on the treatment dummy. However, we did this iteratively for each individual treated person, using the whole group of non-treated individuals as counterfactual. This allow us to get an estimate of the treatment effect for each individual treated.

From this, we get a distribution of treatment effect estimates. The average of this should be the treatment effect that De Mel et al (2008) report in their paper. And indeed it comes quite close. However, what we are intrigued by is the significant heterogeneity in the point estimates for the treatment effect for different individuals.

A key observation is that the treatment effects are very heterogeneous - the mass of enterprises who saw a drop in profits is almost as large as the mass of enterprises that saw a rise, however, some saw a very significant and large rise in real profits. The vertical line is the average of the treatment effect, which here, as we lumped the cash treatments together, is around 900 rupees.  The median treatment effect however, is only  332 rupees. Thats some food for thought.

## Removing Multibyte Characters from Strings

1 Comment »

I was a bit annoyed by the error when loading a dataset that contains multi-byte characters. R basically just chokes on them. I have not really understood the intricacies of this, but it was basically just an annoyance and since I did not really use these characters in the strings containing them, I just wanted to remove them.

The easiest solution was to use Vim with the following search and replace:

s/[\x80-\xFF]//g

## The Welfare Cost of Lawlessness: Evidence from Somalian Piracy

Uncategorized No Comments »

Max Weber has shaped the distinction between functioning and failed states. In his words "a state is a human community that (successfully) claims a monopoly of the legitimate use of physical force within a given territory". For Somalia the effective monopoly of power has not been with any form of state since 1991 as it has been topping the list of the failed states index for the past five consecutive years. The disastrous US intervention in Mogadishu in October 1993 lead to a shift in US foreign policy, towards non-intervention in Somalia. Unchecked by outside forces, the state further fragmented into several smaller regions that were dominated by war lords. As such it became a refuge for radical islamists and organised crime. It was only through the rise in piracy throughout the first decade of this century, and in particular due to a sharp increase in piracy attacks in 2008, that the world seemed to notice and care again about the situation in Somalia. The interest may be partly driven by the romantic ideas that Hollywood's pirates inspire, but mostly, as we argue, because Somalian piracy has been an externality. But how costly can such an externality from statelessness be? What is the tax rate at with which Somalian piracy taxes world-trade? And how does this tax rate compare to an optimal tax rate?

Our recent research aims to answer these questions, thus shedding light on the key questions about the role of institutions in securing trade from predation and theft (see e.g. Dixit (2004) ).

Piracy at a Maritime choke point

Our window into obtaining estimates of the "piracy tax" comes from micro-data on individual shipping contracts. This approach is methodologically more powerful than indirect accounting approaches such as the various One Earth Future Foundation Reports (2010, 2011), which lack any counterfactual. We consider the direct link between the risk of piracy attacks and the cost of shipping by studying the impact on chartering rates on maritime routes that vary by (1) whether and (2) the extent over time to which they are exposed to Somalian piracy. Most of the trade between Asia and Europe has to go through the Gulf of Aden and is thus, potentially affected by Somalian piracy. The fact that the Gulf of Aden is one of the business shipping routes becomes clear by inspection of our data on chartering contracts. Roughly 25% of our ships are travelling through the Gulf of Aden.

This graphical representation highlights how important the Suez Canal is for world trade. Anything that disrupts trade through the Suez Canal has thus the potential of disturbing patterns of trade. The role of the Suez Canal and its impact on trade has been studied in a related study by Feyrer (2009), who looks at the Suez Canal closure as a natural experiment.

We argue that the upsurge in piracy in spring 2008, which becomes evident when studying the monthly time series of attacks, has been disrupting trade and has lead to several reactions by the shipping industry. The cost of these reactions are passed on to the charteres and eventually to the consumers of traded goods. The impact on shipping rates is our window through which we estimate the "piracy tax".

Bringing both of data-sets together, we find that piracy caused an increase in the transport cost by around 8%. We identify this increase in the cost both from the sudden increase in violence intensity in spring 2008, but also from seasonal variation in the intensity.

In particular, early summer is a period of relative little piracy activity. We show that this is due to the Monsoon season, which just makes it difficult for pirates to operate in their small vessels. Hence, the estimated effect varies significantly with the season as the risks are lower. This is illustrated in the following picture. The piracy tax is lower in the Monsoon season. We check that this drop in shipping rates is not due to less shipping through the area.

Taking econometric studies seriously.

Taking our estimates seriously, the overall costs of Somalian piracy can be obtained by scaling up the point estimate by aggregate measures of trade through the region. Clearly, such estimates will have very large standard errors - however, their foundation is a solid micro-econometric estimate. Given this we find that, at the lowest level, the $120 million in net revenues that pirates generate are far offset by the costs borne by the shipping industry, which lie between$ 0.9 billion to \$ 3.3 billion.

One could compare the revenues generated by piracy to an equivalent tax rate on traffic through the Gulf of Aden or the broader Somalian territorial waters. If we follow this avenue, we estimate that the equivalent tax rate on traffic would be well below 1%. What does this mean? It highlights that a functioning state is able to implement redistributive policies a lot more efficiently than a "roving bandit" ( Olson, 2000). Hence, this exercises provides a sense of how much value is generated by installing functioning institutions and a functioning state.

Conclusions

Is there a solution to Somalian piracy? As Somalian piracy is a classical externality, there is a need for cooperation to adress the problem. Many commentators argue that the only long lasting solution is to provide support on the ground in Somalia. However, international cooperation is difficult to muster, due to varying geo-political interests of the players involved.

History is full of anecdotes suggesting that this problem is not new. Consider for example, the correspondent report on Chinese piracy in The London and China Telegraph from 4th February 1867 noted that

Besides we are not the only Power with large interests at stake. French, Americans, and Germans carry on an extensive trade [...] Why should we then incur singly the expense of suppressing piracy if each provided a couple of gunboats the force would suffice for the safety foreign shipping which is all that devolves upon [..] why should the English tax payer alone bear the expense?

This sentiment highlight the public good aspect of travels on global water ways. This research will fall short of providing policy suggestions, however, it highlights that piracy is most likely one of the costliest ways of making transfers to Somalia.

## Downloading All Your Pictures From iPad or iPhone

No Comments »

I really disklike iTunes, it is the worst piece of software I have ever come accross.  I would say that Windows has been getting better and better. I had the following problem: I uploaded quite a few pictures via iTunes onto my iPad, just because its nice to look at pictures on that machine. However, the machine with which I did the syncing broke and needed repair and somehow, I forgot to save these pictures onto a  hard drive for backup. So the only place where these pictures now rest is on my iPad.

iTunes wont allow you to copy pictures on your iPad onto a machine (only the pictures that you atually take with the iPad). This is because, these pictures *should*  be on the machine with which you synced your iPad in the first place.

However, this was not true in my case anymore. Now you could either invest some money and purchase an app that allows you to copy your picture albums from the iPad onto a Windows machine.

There is e.g. CopyTrans Suite, which is a bit costly and in the version I tried, did not copy the full resolution of the pictures (which is a rip-off!).

So I was looking into a cheap and quick solution to get the original full resolution pictures down from my iPad.

Setting things up: installing free app "WiFi Photo"
This app basically makes your photo albums available on a local webserver. Once you start the app on the iPad, it tells you an URL to brows to on your local machine. There you can see all the pictures that are on your iPad.

You could now use this app to manually download the pictures, however, it is limited to 100 pictures at once and you will not get the full resolution pictures if you do a batch download.

If you browse through the app, you will notice that the URL to the full resolution pictures has the following form:

http://192.168.1.6:15555/0/fr_564.jpg

where the "0" stands for the album ID. If you have, say 2 albums on the iPad, this would take values "0" or "1". Images are stored as consecutive numbers in each album, so the following link would go to picture number 564 in full resolution in album 0. So we will exploit this structure to do an automated batch download.

Doing an automated batch download

First, in order for this to work you need to get a a local PHP installation up and running. If you are really lazy, you could just install XAMPP. However, you can implement the code in any other coding language, e.g. in R as well.

To download all the pictures, you need to adjust and run the following script

What this script does it iterates through the albums (the first loop), in my case I have four albums. The second loop then iterates through the pictures, I simply assume that there are at most 1000 pictures in each album. Clearly, this can be made smarter, i.e. automatically find out how many pictures in each album, but this works and thats all we need.
I would recommend running the script a few times, as sometimes it is not able to retrieve the content and then, no file is created. By adding the "file_exists" check, I make sure that no picture, that has been downloaded already, is downloaded again. So if you run the script several times, it will be quicker and quicker to also pick up the last missing pictures.
Running the script takes some time as it needs to copy down each picture, and in my case this were a rough 2000 pictures. But now, they are back in the safe haven of my local machine.

## Internet Connection Sharing via Ad-Hoc network for iPhone and iPad

Tricks No Comments »

As I have to be in hospital for a few days in Germany, I bought myself a SIM card and a month mobile internet flatrate. I have a netbook, a EEPC 1000go with an integrated mobile broadband card.

However, in hospital I rather wanted to use my iPad and my iPhone to read papers and so on. As both are from the UK and I have not unlocked them, I wanted to indirectly use the internet on these devices through my netbook. I was amazed by how quick and easy it was setting things up....

## Microfinance Map of India - another go...

No Comments »

I gave it another go, trying to get a map that looks a bit nicer. This time, I tried to compute something like a density or intensity in a certain area. On the previous map, this was not visible very well. I used ggplot2 and a bit of R code, together with RGoogleMaps to produce the following picture:

This map displays the intensity of microfinance institution headquarter distribution across India. The data comes from the MIX Market.

The fact that many MFIs are clustered around in the south is highlighted quite strongly. What this graph does not take into account however, is their variable size. This is problematic and I agree that this needs further refinement, i.e. that the intensity takes into account how big an MFI is. However, I would conjecture that this merely makes the contrasts in such a map just stronger.