# Correction For Spatial And Temporal Auto-Correlation In Panel Data: Using R To Estimate Spatial HAC Errors Per Conley

tl;dr: Fast computation of standard errors that allows for serial and spatial auto-correlation.

Economists and political scientists often employ panel data that track units (e.g., firms or villages) over time. When estimating regression models using such data, we often need to be concerned about two forms of auto-correlation: serial (within units over time) and spatial (across nearby units). As Cameron and Miller (2013) note in their excellent guide to cluster-robust inference, failure to account for such dependence can lead to incorrect conclusions: “[f]ailure to control for within-cluster error correlation can lead to very misleadingly small standard errors…” (p. 4).

Conley (1999, 2008) develops one commonly employed solution. His approach allows for serial correlation over all (or a specified number of) time periods, as well as spatial correlation among units that fall within a certain distance of each other. For example, we can account for correlated disturbances within a particular village over time, as well as between that village and every other village within one hundred kilometers. As with serial correlation, spatial correlation can be positive or negative. It can be made visually obvious by plotting, for example, residuals after removing location fixed effects.

Example Visualization of Spatial Correlation from Radil, S. Matthew, Spatializing Social Networks: Making Space for Theory In Spatial Analysis, 2011.

We provide a new function that allows R users to more easily estimate these corrected standard errors. (Solomon Hsiang (2010) provides code for STATA, which we used to test our estimates and benchmark speed.) Moreover using the excellent lfe, Rcpp, and RcppArmadillo packages (and Tony Fischetti’s Haversine distance function), our function is roughly 20 times faster than the STATA equivalent and can scale to handle panels with more units. (We have used it on panel data with over 100,000 units observed over 6 years.)

This demonstration employs data from Fetzer (2014), who uses a panel of U.S. counties from 1999-2012. The data and code can be downloaded here.

#### STATA Code:

We first use Hsiang’s STATA code to compute the corrected standard errors (spatHAC in the output below). This routine takes just over 25 seconds.

cd "~/Dropbox/ConleySEs/Data"
clear; use "new_testspatial.dta"

tab year, gen(yy_)
tab FIPS, gen(FIPS_)

timer on 1
ols_spatial_HAC EmpClean00 HDD yy_*FIPS_2-FIPS_362, lat(lat ) lon(lon ) t(year) p(FIPS) dist(500) lag(5) bartlett disp

# *-----------------------------------------------
# *    Variable |   OLS      spatial    spatHAC
# *-------------+---------------------------------
# *         HDD |   -0.669     -0.669     -0.669
# *             |    0.608      0.786      0.838

timer off 1
timer list 1
#  1:     24.8 /        3 =      8.2650

#### R Code:

Using the same data and options as the STATA code, we then estimate the adjusted standard errors using our new R function. This requires us to first estimate our regression model using the felm function from the lfe package.

# Loading sample data:
dta_file <- "~/Dropbox/ConleySEs/Data/new_testspatial.dta"
setnames(DTA, c("latitude", "longitude"), c("lat", "lon"))

source("~/Dropbox/ConleySEs/ConleySEs_17June2015.R")

ptm <-proc.time()

# We use the felm() from the lfe package to estimate model with year and county fixed effects.
# Two important points:
# (1) We specify our latitude and longitude coordinates as the cluster variables, so that they are included in the output (m).
# (2) We specify keepCx = TRUE, so that the centered data is included in the output (m).

m <-felm(EmpClean00 ~HDD -1 |year +FIPS |0 |lat +lon,
data = DTA[!is.na(EmpClean00)], keepCX = TRUE)

coefficients(m) %>%round(3) # Same as the STATA result.
   HDD
-0.669 

We then feed this model to our function, as well as the cross-sectional unit (county FIPS codes), time unit (year), geo-coordinates (lat and lon), the cutoff for serial correlation (5 years), the cutoff for spatial correlation (500 km), and the number of cores to use.

SE <-ConleySEs(reg = m,
unit = "FIPS",
time = "year",
lat = "lat", lon = "lon",
dist_fn = "SH", dist_cutoff = 500,
lag_cutoff = 5,
cores = 1,
verbose = FALSE)

sapply(SE, sqrt) %>%round(3) # Same as the STATA results.
        OLS     Spatial Spatial_HAC
0.608       0.786       0.837 
proc.time() -ptm
   user  system elapsed
1.619   0.055   1.844 

Estimating the model and computing the standard errors requires just over 1 second, making it over 20 times faster than the comparable STATA routine.

#### R Using Multiple Cores:

Even with a single core, we realize significant speed improvements. However, the gains are even more dramatic when we employ multiple cores. Using 4 cores, we can cut the estimation of the standard errors down to around 0.4 seconds. (These replications employ the Haversine distance formula, which is more time-consuming to compute.)

pkgs <-c("rbenchmark", "lineprof")
invisible(sapply(pkgs, require, character.only = TRUE))

bmark <-benchmark(replications = 25,
columns = c('replications','elapsed','relative'),
ConleySEs(reg = m,
unit = "FIPS", time = "year", lat = "lat", lon = "lon",
dist_fn = "Haversine", lag_cutoff = 5, cores = 1, verbose = FALSE),
ConleySEs(reg = m,
unit = "FIPS", time = "year", lat = "lat", lon = "lon",
dist_fn = "Haversine", lag_cutoff = 5, cores = 2, verbose = FALSE),
ConleySEs(reg = m,
unit = "FIPS", time = "year", lat = "lat", lon = "lon",
dist_fn = "Haversine", lag_cutoff = 5, cores = 4, verbose = FALSE))
bmark %>%mutate(avg_eplased = elapsed /replications, cores = c(1, 2, 4))
  replications elapsed relative avg_eplased cores
1           25   23.48    2.095      0.9390     1
2           25   15.62    1.394      0.6249     2
3           25   11.21    1.000      0.4483     4

Given the prevalence of panel data that exhibits both serial and spatial dependence, we hope this function will be a useful tool for applied econometricians working in R.

#### Feedback Appreciated: Memory vs. Speed Tradeoff

This was Darin’s first foray into C++, so we welcome feedback on how to improve the code. In particular, we would appreciate thoughts on how to overcome a memory vs. speed tradeoff we encountered. (You can email Darin at darinc[at]stanford.edu.)

The most computationally intensive chunk of our code computes the distance from each unit to every other unit. To cut down on the number of distance calculations, we can fill the upper triangle of the distance matrix and then copy it to the lower triangle. With $N$ units, this requires only  $(N (N-1) /2)$ distance calculations.

However, as the number of units grows, this distance matrix becomes too large to store in memory, especially when executing the code in parallel. (We tried to use a sparse matrix, but this was extremely slow to fill.) To overcome this memory issue, we can avoid constructing a distance matrix altogether. Instead, for each unit, we compute the vector of distances from that unit to every other unit. We then only need to store that vector in memory. While that cuts down on memory use, it requires us to make twice as many   $(N (N-1))$  distance calculations.

As the number of units grows, we are forced to perform more duplicate distance calculations to avoid memory constraints – an unfortunate tradeoff. (See the functions XeeXhC and XeeXhC_Lg in ConleySE.cpp.)

sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.4 (Yosemite)

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods
[7] base

other attached packages:
[3] geosphere_1.4-3           sp_1.1-1
[5] lfe_2.3-1709              Matrix_1.2-2
[7] ggplot2_1.0.1             foreign_0.8-65
[9] data.table_1.9.4          dplyr_0.4.2
[11] knitr_1.11

loaded via a namespace (and not attached):
[1] Formula_1.2-1    magrittr_1.5     MASS_7.3-43
[4] munsell_0.4.2    xtable_1.7-4     lattice_0.20-33
[7] colorspace_1.2-6 R6_2.1.1         stringr_1.0.0
[10] plyr_1.8.3       tools_3.2.2      parallel_3.2.2
[13] grid_3.2.2       gtable_0.1.2     DBI_0.3.1
[16] htmltools_0.2.6  yaml_2.1.13      assertthat_0.1
[19] digest_0.6.8     reshape2_1.4.1   formatR_1.2
[22] evaluate_0.7.2   rmarkdown_0.8    stringi_0.5-5
[25] compiler_3.2.2   scales_0.2.5     chron_2.3-47
[28] proto_0.3-10    

# Leveraging R for Job Openings for Economists

Quite a few people emailed me regarding my post on Econ Job Market. This post is about how you can use very basic and simple R tools to help you in sorting through the Job Openings for Economists list from the American Economic Association.This definitely helped me in developing my spreadsheet of places to apply for.

Before I begin I would like to quickly revisit EJM again.

Revisiting Using R for Econ Job Market
It turns out that my post about scraping the EJM website to obtain a listing of the Job Posts was (partly) redundant. The new EJM system available on “myeconjobmarket.org”  provides a facility to download a spreadsheet. Unfortunately, that spreadsheet does not contain the important “position ID”. This position ID is important as you can then construct a link for the applications.

An example:

https://econjobmarket.org/AdDetails.php?posid=2723

https://econjobmarket.org/Apply/PosApp.php?posid=2723

In order for this to work, you ll need to have a current login session open as otherwise, you ll be redirect to the main homepage. I updated the spreadsheet and its available here for download. I emailed EJM to add the job opening ID to their spreadsheet, then you can merge the two spreadsheets.

econjobmarket-01-11-2014

Leveraging R for JOE?

Now I am turning to JOE. As on EJM, you can download the Job Openings. Again, they dont include a link to the job posting. However, you can easily construct this because the Job Posting ID is simply a concatenation of the fields “joe_issue_ID” and “jp_id”, separated with an underscore. This gives the JOE_ID.

https://www.aeaweb.org/joe/listing.php?JOE_ID=2014-02_111451008

Now you can try the filtering on JOE to limit the types of postings. But you can also do this in R and you can try to add some features.

Filtering Jobs/ adding a common country code

A first thing I wanted to do is just show you how to filter the job listings and add a common country name or country code.


library(countrycode)
library(data.table)
library(stringr)
options(stringsAsFactors=FALSE)

JOBS$Application_deadline<-as.POSIXct(JOBS$Application_deadline)

###this will keep all full time academic jobs, be it international or just within US

##split out the country
JOBS$country<-gsub("([A-Z]*)( [A-Z]{2,})?(.*)","\\1\\2", JOBS$locations)
###get harmonized country codes...
JOBS$iso3<-countrycode(JOBS$country, "country.name", "iso3c", warn = FALSE)
###transfer application deadline into a date format
JOBS$Application_deadline<-as.POSIXct(JOBS$Application_deadline)
###drop ones that have already passed


When doing this, you will notice something weird. The application deadline is wrong… in quite a few cases.

Consider for example the job posting for an Assistant Professor position at the Harvard Kennedy School (see https://www.aeaweb.org/joe/listing.php?JOE_ID=2014-02_111451068).

In the spreadsheet, you will see a deadline of 31.01.2015 – which definately cant be right, because the ASSA meetings are in early January. So how can we fix these up? If you look at the plain text of the posting, you will see that applications will begin to be considered 18-Nov-14… that is much more reasonable…

If you sort applications by the application deadline field provided by JOE, you run the risk of missing out due to quirks like this.

One way around this is to run some regular expressions on the main text field to flag up common date formats. This way you do not need to filter all individual job postings for a date. You can simply look at job postings that seem to have weird application deadlines (like later than December).

A regular expression could take the form:

(November|December) ([0-9]{1,2})(,)? (2014)?

which would map date formats like “November 20,  2014″ or “November 20″. The following code maps a few common date formats and the resulting spreadsheet filtering only academic jobs is attached. This formed the starting point for my job market application spreadsheet.

JOBS$jp_full_text<-as.character(JOBS$jp_full_text)

###OTHER DATE MENTIONED IN DECEMBER / NOVEMBER MENTIONED IN THE FULL TEXT
JOBS$otherdate<-"" whichare<-regexpr("(November|December) ([0-9]{1,2})(,)? (2014)?",JOBS$jp_full_text, perl=TRUE, useBytes=TRUE)
JOBS[whichare[1:nrow(JOBS)]!=-1]$otherdate<-regmatches(JOBS$jp_full_text,whichare)
whichare<-regexpr("([0-9]{1,2}) (November|December)(,)? (2014)?",JOBS$jp_full_text, perl=TRUE, useBytes=TRUE) JOBS[whichare[1:nrow(JOBS)]!=-1]$otherdate<-regmatches(JOBS$jp_full_text,whichare) whichare<-regexpr("([0-9\\.\\/]{1,+})([1-9]\\.\\/]{1,+})(2014)?",JOBS$jp_full_text, perl=TRUE, useBytes=TRUE)
JOBS[whichare[1:nrow(JOBS)]!=-1]$otherdate<-regmatches(JOBS$jp_full_text,whichare)
JOBS$url<-JOBS[, paste("https://www.aeaweb.org/joe/listing.php?JOE_ID=",joe_issue_ID,"_",jp_id,sep="")] The resulting spreadsheet is attached joe-results-refined. # Leveraging R for Econ Job Market [UPDATE] I just was told that the new features on EJM actually allow you to download an XLS spreadsheet of the job listings on EJM. This is accessible when you login to myeconjobmarket.org and is part of their new AIMS (Application and Interview Management System). I wanted to describe a little helper I am using to help refine the places I want to apply at since I am going to be on the Economics Job Market this year. The two main websites were job openings are advertised are: Now JOE has a really nice feature where you can download simply all job openings into a nice spreadsheet. This allows you to browse through and refine your search. Econ Job Market does not have such a feature. The Listing page is quite annoying… If you want more details for a job opening, such as a description of the fields and application requirements, you will have to click on “more info…”. In the JOE spreadsheet, you have all that information at once. I wanted to create a JOE like spreadsheet using the openings from EJM. Some of which, of course, do overlap. But for some reason, some jobs are only on EJM but not on JOE. So how can we use R to help us do that? The first thing I wanted to do is get the simple listings on the main application page from EJM as a spreadsheet. You can simply download the main listings file and extract the pieces of information. Most important is the “posid” field, which is the position ID contained in the EJM URL. This will give you a direct link to the HTML page of the job opening and it also tells you whether you can apply through EJM. This leaves you with a listing of EJM links to jobs, their position ID and the EJM Application Link in case the Job Opening accepts applications through EJM. Now you can proceed to simply download all the HTML files using a batch downloader such as DownThemAll. If you want to do that, you can print out a list: cat(EJM$url, sep="\n")


and enter them into Down Them All. Alternatively you can iterate through the list of links and send HTTP queries to get the details from each job separately. This is part of the next lines of code:

This renders us with a nice csv that we can browse quickly through in Excel…for convenience you can download the listings as of 23-10-2014 below. Good luck for your applications!

econjobmarket-23-10-2014

# Bulk PDF Compression with Imagemagick

Just a quick line of code to bulk compress a range of PDFs using imagemagick on a MacOS. This was necessary as some of the graphs I was using for my talks made the PDF files for the presentation incredibly large.

 FILES="*.pdf" for f in $FILES; do convert -density 300 -depth 8 -quality 85$f $f.png ; done  # Fracking in your neighborhood? Shale- Oil and Gas Economic Impact Map I have been working on a visualisation to highlight the results from my paper Fracking Growth, to highlight the local economic impact of the recent oil and gas boom in the US. There are two key insights. First, there are strong spillover effects from the oil and gas sector. I estimate that every oil and gas sector job created roughly 2.17 other jobs, mainly in the transport, construction and local service sectors. Given that aggregate employment in the oil and gas sector has more than doubled between 2004 and 2013 , increasing from 316,700 to 581,500. Given the estimated multiplier effect, this suggests that aggregate employment increased by between 500,000 to 600,000 . The second insight from my paper is that oil and gas boom induced structural transformations away from tradable goods sectors need not happen. The classical argument in the development literature is simple: a resource boom drives up local prices, which is a problem for all those sectors of the economy that can’t pass higher input costs on to their final goods consumers (Corden and Neary, 1982). In development economics, such mechanisms are also described by the term “Dutch disease”. I argue that such structural transformations need not occur, if there are significant trade costs for the extracted resource. I thought it would be nice to visualise these econometric results through an interactive shiny application. This allows you to see where fracking is happening, and what are its local economic impacts. North Dakota: An Interactive Visualisation of the two findings To illustrate the two results from my research, I developed this shiny app that allows you to study key economic aggregates at the county level and compare them to nation wide averages or levels. This is essentially what the difference-in-difference estimators do in my paper, while controlling for a whole lot of fixed-effects. In the shiny app, if you zoom – for example – into North Dakota, you will see a lot of blue dots. Each dot represents an unconventional oil or gas well. The panel shows the time-variation in the number of wells constructed. Overall employment of all counties visible at the zoom level in the oil and gas sector has gone up from around 3169 in the early 2004 to more than 25,000 by 2012. Employment in the non-oil and gas sectors follows suit, but increases more smoothly. The average weighted unemployment rates of the visible counties looks dramatically different when compared to the rest of the US: unemployment rates in North Dakota, are hovering around 3 percent, while they are above 7 percent in the rest of the US in 2012. The recession did only marginally increase unemployment in North Dakota. The boom in the oil and gas industry implied that there was little effect of the recession to be felt there. This highlights the first finding of my paper. There are significant spillovers from the mining sector into the non-mining sectors. The second finding from my research can be illustrated using the case of North Dakota as well. While monthly average wages have increased significantly, catching up with the US wide average, natural gas prices in North Dakota have significantly gone down, with the average natural gas price for industrial use in North Dakota now being about 20-30% cheaper than in the rest of the US. This highlights the second case made in my paper. But enough of that, lets turn to the Shiny app itself. Shiny App using the Leaflet Javascript Library I was inspired by the interactive map that visualises the super-zip codes and I use a lot of the code from there. The key feature of the shiny app is a reactive function that returns summary statistics for the county centroids that are currently visible given the users zoom level. countiesInBounds <- reactive({ if (is.null(input$map_bounds))
return(EMP[FALSE,])
bounds <- input$map_bounds latRng <- range(bounds$north, bounds$south) lngRng <- range(bounds$east, bounds$west) EMP[longitude >= latRng[1] & longitude <= latRng[2] & latitude >= lngRng[1] & latitude <= lngRng[2]] }) This function takes the map-bounds as currently set by the zoom level inn the input object and uses it to subset the data object EMP, which contains most of the statistics that are displayed. There is a second similar function for the number of wells in the bounds. This function is called by the functions that create the plot panels. The server.R file is the pasted here: library(shiny) #toinst<-c('jcheng5/leaflet-shiny','trestletech/ShinyDash') #sapply(toinst, function(x) devtools::install_github(x)) library(leaflet) library(ShinyDash) library(RColorBrewer) library(scales) library(lattice) library(plyr) require(maps) shinyServer(function(input, output, session) { ## Interactive Map ########################################### # Create the map map <- createLeafletMap(session, "map") # A reactive expression that returns the set of zips that are # in bounds right now wellsInBounds <- reactive({ if (is.null(input$map_bounds))
return(WELLS.data[FALSE,])
bounds <- input$map_bounds latRng <- range(bounds$north, bounds$south) lngRng <- range(bounds$east, bounds$west) subset(WELLS.data, latitude >= latRng[1] & latitude <= latRng[2] & longitude >= lngRng[1] & longitude <= lngRng[2]) }) countiesInBounds <- reactive({ if (is.null(input$map_bounds))
return(EMP[FALSE,])
bounds <- input$map_bounds latRng <- range(bounds$north, bounds$south) lngRng <- range(bounds$east, bounds$west) EMP[longitude >= latRng[1] & longitude <= latRng[2] & latitude >= lngRng[1] & latitude <= lngRng[2]] }) output$plotSummaryLocal <- renderPlot({
LOCAL<-countiesInBounds()[,list(NonMiningEmpC=sum(NonMiningEmpC, na.rm=TRUE), EmpCClean21=sum(EmpCClean21, na.rm=TRUE)), by=c("year")]
WELLS<-wellsInBounds()[,.N, by=c("year")][order(year)]
WELLS<-data.table(cbind("year"=WELLS$year,"N"=cumsum(WELLS$N)))
LOCAL<-join(LOCAL,WELLS)
if(any(!is.na(LOCAL$EmpCClean21)) & any(!is.na(LOCAL$N))) {
par(mfrow=c(1,3),oma=c(0,0,0,0),mar=c(4,1,3,0))

plot(LOCAL[,c("year","NonMiningEmpC"),with=F], xlab="Year",main="Non-Mining Sector",type="l",col="red",cex.main=1.5,cex.lab=1.5)
plot(LOCAL[,c("year","EmpCClean21"),with=F], xlab="Year",main="Oil and Gas Sector",type="l",col="red",cex.main=1.5,cex.lab=1.5)
plot(LOCAL[,c("year","N"),with=F], xlab="Year",main="Number of Wells",type="l",col="red",cex.main=1.5,cex.lab=1.5)
}
else {

}

})

output$plotSummaryMacro <- renderPlot({ par(mfrow=c(1,3),oma=c(0,0,0,0),mar=c(4,1,3,0)) ###plot unemployment rate, earnings per worker, LOCAL<-countiesInBounds()[,list(indgasprice=sum(indgasprice * overallemp,na.rm=TRUE)/sum(overallemp,na.rm=TRUE), unemploymentrate=100*sum(unemploymentrate*labourforce,na.rm=TRUE)/sum(labourforce,na.rm=TRUE), EarnS=sum(EarnS*labourforce, na.rm=TRUE)/sum(labourforce,na.rm=TRUE)), by=c("year")] LOCAL<-join(NATIONAL,LOCAL) if(any(!is.na(LOCAL$unemploymentrate))) {
plot(LOCAL[,c("year","unemploymentrate"),with=F], ylim=c(min(min(LOCAL$unemploymentrate),min(LOCAL$natunemploymentrate)), max(max(LOCAL$unemploymentrate),max(LOCAL$natunemploymentrate))), xlab="Year",main="Unemployment",type="l",col="red",cex.main=1.5,cex.lab=1.5)
lines(LOCAL[,c("year","natunemploymentrate"),with=F], col="blue", lty=2)
plot(LOCAL[,c("year","indgasprice"),with=F], ylim=c(min(min(LOCAL$indgasprice),min(LOCAL$natindgasprice)), max(max(LOCAL$indgasprice),max(LOCAL$natindgasprice))), xlab="Year",main="Natural Gas Price",type="l",col="red",cex.main=1.5,cex.lab=1.5)
lines(LOCAL[,c("year","natindgasprice"),with=F], col="blue", lty=2)
plot(LOCAL[,c("year","EarnS"),with=F], ylim=c(min(min(LOCAL$EarnS),min(LOCAL$natEarnS)), max(max(LOCAL$EarnS),max(LOCAL$natEarnS))), xlab="Year",main="Monthly Wages",type="l",col="red",cex.main=1.5,cex.lab=1.5)
lines(LOCAL[,c("year","natEarnS"),with=F], col="blue", lty=2)
} else {

}
})

####FOR DATA EXPLORER
output$datatable <- renderDataTable({ EMP[, c("year","Area","population","shaleareashare","TPI_CI","EarnS","unemploymentrate","EmpCClean21","NonMiningEmpC","indgasprice","peind"),with=F] }) session$onFlushed(once=TRUE, function() {
paintObs <- observe({

# Clear existing circles before drawing
map$clearShapes() # Draw in batches of 1000; makes the app feel a bit more responsive chunksize <- 1000 wellplot<-wellsInBounds() if(nrow(wellplot)>0) { if(nrow(wellplot)>5000) { wellplot<-wellplot[WID %in% sample(wellplot$WID,5000)]
}
for (from in seq.int(1, nrow(wellplot), chunksize)) {
to <- min(nrow(wellplot), from + chunksize)
chunkplot <- wellplot[from:to,]
# Bug in Shiny causes this to error out when user closes browser
# before we get here
try(
map$addCircle( chunkplot$latitude, chunkplot$longitude ) ) } } }) # TIL this is necessary in order to prevent the observer from # attempting to write to the websocket after the session is gone. session$onSessionEnded(paintObs$suspend) }) showWellPopup <- function(event) { content <- as.character(paste(event, collapse=",")) map$showPopup(event$lat, event$lng, content)
}
# When map is clicked, show a popup with city info
clickObs <- observe({
map$clearPopups() event <- input$map_shape_click
if (is.null(event))
return()

isolate({
showWellPopup(event)
})
})

session$onSessionEnded(clickObs$suspend)
})

The UI.r

library(leaflet)
library(shiny)
library(ShinyDash)
shinyUI(navbarPage("Fracking Growth", id="nav",

tabPanel("Interactive map",
div(class="outer",

tags$head( # Include our custom CSS includeCSS("styles.css"), includeScript("gomap.js") ), leafletMap("map", width="100%", height="100%", initialTileLayer = "//{s}.tiles.mapbox.com/v3/jcheng.map-5ebohr46/{z}/{x}/{y}.png", initialTileLayerAttribution = HTML('Maps by <a href="http://www.mapbox.com/">Mapbox</a>'), options=list( center = c(37.45, -93.85), zoom = 5, maxZoom=9, maxBounds = list(list(15.961329,-129.92981), list(52.908902,-56.80481)) # Show US only ) ), absolutePanel(id = "controls", class = "modal", fixed = TRUE, draggable = TRUE, top = 60, left = "auto", right = 20, bottom = "auto", width = 450, height = "auto", h2("Shale Oil and Gas Impact Explorer"), # selectInput("color", "Color", vars), # selectInput("size", "Size", vars, selected = "welltype"), h4("Shale Boom over Time"), plotOutput("plotSummaryLocal", height = 150) , plotOutput("plotSummaryMacro", height = 150) , p("Note that all plots and summary statistics are calculated for the wells- or county centroid visible in your current zoom-level") ), tags$div(id="cite",
'Data compiled for ', tags$a('Fracking Growth',href="http://www.trfetzer.com/fracking-growth/"), ' by Thiemo Fetzer.' ) ) ) , tabPanel("Data explorer", hr(), dataTableOutput("datatable") ) ) ) # Shifting Centre of Gravity of US Fossil Fuel Production My research Fracking Growth investigates the localised impacts of the current oil and gas production boom in the US. The boom is triggered by extraction of oil and gas deposits, that were previously not possible to exploit. The shale deposits have become technologically recoverable due to a combination of horizontal drilling technology and hydraulic fracturing, which is extremely controversial as it involves injection of significant amounts of water, sands fused with chemicals beneath the surface. The fact that these shale deposits are geographically located in distinct places within the US can be highlighted through a very simple graph. Taking Energy Information Administration state level crude-oil and natural gas (marketed) production data, we can compute a “center of gravity” of US fossil fuel production – simply by weighting the state centroid coordinates by the energy content of the fossil fuels being extracted (or by their dollar value). The resulting two pictures are worth sharing, because they highlight the dramatic changes that have happened in only recent years. Plotting Centre of Gravity by Energy Content Note that here I am plotting 14 years of data, from 1998 to 2012. What you can see is striking: the centre of gravity of fossil fuel production by heat content has moved clearly towards the north for the whole period, but has made a decided shift towards the East within the past four years. This represents the dramatic production expansion for natural gas in the Marcellus shale formation. Now natural gas is becoming significantly cheaper in the US, as the US essentially becomes a net exporter and is self-sufficient. This allows domestic prices to be significantly lower than elsewhere. In the US, natural gas is now cheaper by a factor of three compared to the UK. This emulates policy makers around the world to consider revising rules of fracking legislation, to stimulate their economies in a similar way. In the UK , most recently, land access rules have been changed to make fracking easier. The fact that US natural gas prices are now dramatically lower than crude oil prices, when comparing them to their respective energy content has implications for how you plot the centre of gravity. If you plot it by fossil fuel value, the picture is dramatically different. Plotting Centre of Gravity by Fossil Fuel Value (in USD) Here, the move in the centre of gravity knows only one direction. It is moving decidedly north west. This is driven by the Bakken oil boom. It is clear that the two pictures indicate the possibility for their to be dramatic limitations when it comes to physical infrastructure. The sudden and quick relocation of production has put strains on the US physical infrastructure to carry crude oil and natural gas to the market. This is a core argument in my research, and I think this is bound to stay for a while, as the regulatory process and construction time for new pipelines and upgrading of refinery capacity is bound to take a few years. Note that my approach to the graph is similar to Danny Quah’s work on studying the centre of economic activity around the world, except that I do not have to worry about projection issues too much (the center of gravity could be beneath the surface of the world in a truely 3-dimensional world). Given a shapefile of US states, that you can e.g. obtain from the GADM , you can compute the centroid of the geographies using the gCentroid function that is part of the rgeos package. Computing Centroids Code:  1 2 3  STATES<-readOGR(dsn="States", layer="states") STATES.pts<-do.call("rbind", lappy(1:length(STATES), function(x) gCentroid(STATES[x,])@coords)) STATES.data<-data.table(cbind(STATES@data,STATES.pts)) # Is rainfall reporting endogenous to conflict? For my paper on the impact of social insurance on the dynamics of conflict in India, I use some new remote sensed weather data. The data comes from the Tropical Rainfall Measuring Mission (TRMM) satellites. The satellite carries a set of five instruments, and is essentially a rainfall radar located in outer space. As a robustness check I needed to verify that my main results go through using other rainfall data. In the paper I try to make a humble case in favour of using remote sensed data where possible. The key reason being that the TRMM data comes from the same set of instruments over time, rather than from input sources that could be varying with e.g., economic conditions. This is a problem that has been identified by climatologist, who try to correct for systematic biases that could arise from the fact that weather stations are more likely to be located in places with a lot of economic activity. At first I was a bit reluctant as it is quite heavy data that needs to be processed. Nevertheless, thorough analysis required me to jump the hoop and obtain secondary rainfall data sources. I chose the GPCC monthly rainfall data for verification of my results, since these have been used by many other authors in the past in similar contexts. The data is based on rain gauge measurements and is available for the past 100 years. The raw data is quite heavy ; the monthly rainfall rate data for the whole world at at 0.5 degree resolution would amount to about 150 million rows of data for the period from 1961-2010. If you drop the non-land grid cells, this reduces the size dramatically to only 40 million rows. Below is a bit of code that loads in the data once you have downloaded the ASCII source files from the GPCC website. On my personal website, I make a dta and an rdata file available for the whole world. There are three variables appearing in that order: (1) the rainfall rate, (2) the rainfall normals and (3) an integer that gives the number of reporting rain gauges that fall in a grid cell in a particular month. It turns out that all my results are robust to using this data. However, I do find something that is quite neat. It turns out that, if a district experienced some insurgency related conflict in the previous year, it is less likely that this district has an active rain gauge reporting data in subsequent years. While it is a no-brainer that places with severe conflict do not have functioning weather reporting, these results suggest that reporting may also be systematically affected in places with relatively low intensity of conflict – as is the case of India. While I do not want to overstate the importance of this, it provides another justification of why it makes sense for economists to be using remotely sensed weather data. This is not to say that ground based data is not useful. Quite the reverse, ground based data is more accurate in many ways, which makes it very important for climatologist. As economist, we are worried about systematic measurement error that correlates with the economic variables we are studying. This is were remote sensed data provides advantages as it does not “decide” to become less accurate in places that are e.g. less developed, suffer from conflict or simply, have nobody living there. Here the function to read in the data and match to district centroids, you need some packages. #########LOAD GPPC NOTE THAT YOU NEED TO SUBSET THE DATA IF YOU DONT WANT TO END UP WITH A HUGE DATA OBJECT loadGPCC<-function(ff, COORDS) { yr<-as.numeric(gsub("(.*)\\_([0-9]{2})([0-9]{4})","\\3",ff)) month<-as.numeric(gsub("(.*)\\_([0-9]{2})([0-9]{4})","\\2",ff)) temp<-data.table(data.frame(cbind(COORDS, read.table(file=paste("Rainfall/gpcc_full_data_archive_v006_05_degree_2001_2010/", ff,sep=""), header=FALSE, skip=14)))) ###YOU COULD SUBSET THE DATA BY EXTENT HERE IF YOU DONT WANT TO GET IT FOR THE WHOLE WORLD ##E.G. SUBSET FOR BY BOUNDING BOX ##temp<-temp[x>=73 & x<=136 & y>=16 & y<=54] temp<-cbind("year"= yr, "month"=month, temp) gc("free") temp } ################ ##### ffs<-list.files("Rainfall/ gpcc_full_data_archive_v006_05_degree_2001_2010") ###THIS DEFINES THE GRID STRUCTURE OF THE DATA ###YOU MAY NEED TO ADJUST IF YOU WORK WITH A COARSER GRID xs=seq(-179.75,179.75,.5) ys=seq(89.75,-89.75,-.5) COORDS<-do.call("rbind", lapply(ys, function(x) cbind("x"=xs,"y"=x))) system.time(GPCC<-do.call("rbind", lapply(1:length(ffs), function(x) loadGPCC(ffs[x], COORDS)))) ###MATCHING THIS TO SHAPEFILE? ##YOU COULD MATCH CENTROIDS OF DISTRICTS TO THE NEAREST GRID CELL - THE FOLLOWING FUNCTION WOULD DO THAT ###find nearest lat / lon pair ##you may want to vectorise this NEAREST<-NULL for(k in 1:nrow(CENTROIDS)) { cat(k," ") temp<-distHaversine(CENTROIDS[k,c("x","y"),with=F], GPCC.coords[, c("delx","dely"), with=F]) NEAREST<-rbind(NEAREST, cbind(CENTROIDS[k],GPCC.coords[which(temp== min(temp))])) } # Deploying Shiny Server on Amazon: Some Troubleshoots and Solutions I really enjoyed Treb Allen‘s tutorial on deploying a Shiny server on an Amazon Cloud Instance. I used this approach for my shiny app that is a map highlighting the economic impact of the recent shale oil and gas boom on the places where the actual extraction happens. The easiest way to proceed is to use the AMI Image, which basically is like a virtual box image just running on Amazon Cloud. It has the basic Shiny-server up and running. Along the way, I came across a few troubleshoots for which there are simple solutions. I cant seem to access the Shiny server through the Browser? Right after the installation and setting up of the Amazon Instance, I tried to access the shiny server using the public DNS, in my case that was Public DNS: ec2-54-84-227-28.compute-1.amazonaws.com However, this did not work since the shiny-server is listening on port 3838 and you need to allow incoming traffic on that port. The way to manage that in the EC2 Dashboard is to go change the security group that is assigned to the instance that you are running. You need to add a rule to allow incoming traffic on port 3838. Once this is done, you should be able to go to your public DNS, in my case the request URL in the browser now is: ec2-54-72-74-90.eu-west-1.compute.amazonaws.com:3838/shale-economic-impact/ in your browser Where are the shiny-apps located? The standard shiny apps that are preinstalled are located in “/var/shiny-server/www” If you ssh into your EC2 instance, you can go to that folder. I installed packages, but my shiny application can not load them? The problem is most likely that you are logged in as ec2-user, where you have your own dedicated library path. In order to install R packages system wide, you need to change to root by doing: sudo -i ##install now your R packages, R CMD INSTALL ... exit The exit part is important as then you turn off administrator rights. When I run the app, I get Javascript Error ”The application unexpectedly exited. Diagnostic information has been dumped to the JavaScript error console.”? It could be that your EC2 instance is not powerful enough. I had that problem because the dataset that was loaded was too big, which creates a time-out. One way to overcome this is to start a medium instance rather than a micro instance. Please be aware that this is not part of the free usage tier and you will be billed for usage. However, an alternative simple fix by editing the config file. It could be that you are hitting a time-out. In the shiny-server configuration help, there are two timeouts that can be set in the free shiny server version. app_init_timeout — Describes the amount of time (in seconds) to wait for an application to start. After this many seconds if the R process still has not become responsive, it will be deemed an unsuccessful startup and the connection will be closed. app_idle_timeout — Defines the amount of time (in seconds) an R process with no active connections should remain open. After the last connection disconnects from an R process, this timer will start and, after the specified number of seconds, if no new connections have been created, the R process will be killed. It could be that the javascript error is thrown, because the R process was killed. You can edit the configuration file to increase the time-out periods, adding: # Instruct Shiny Server to run applications as the user "shiny" run_as shiny; # Define a server that listens on port 3838 server { listen 3838; # Define a location at the base URL location / { # Host the directory of Shiny Apps stored in this directory site_dir /var/shiny-server/www; # Log all Shiny output to files in this directory log_dir /var/log/shiny-server; # When a user visits the base URL rather than a particular application, # an index of the applications available in this directory will be shown. directory_index off; app_init_timeout 250; } } This brings us right to the next question,… Where do I find my shiny server configuration file? There is a hard coded configuration file, but in the search path there is one located in: /etc/shiny-server/shiny-server.conf here you can do the above edits. After you have done the edits you want to reload the configuration… How do I reload the Configuration File, how to start or stop the shiny server? #reload without restarting sudo reload shiny-server #stop the shiny server sudo stop shiny-server #start it... sudo stop shiny-server Copying files from your local machine to the AWS Instance? You can use “scp” for secure copying, e.g. To download files from your instance: scp -i frackingshiny-eu-west-1.pem ec2-user@ec2-54-72-74-90.eu-west-1.compute.amazonaws.com:/var/shiny-server/www/shale-economic-impact.zip To upload files to your instance: scp -r -i frackingshiny-eu-west-1.pem “/Users/thiemo/shale-economic-impact.zip” ec2-user@ec2-54-72-74-90.eu-west-1.compute.amazonaws.com:/var/shiny-server/www/ I plan to add more troubleshoots – if you have come across some error for which you had to find a solution, feel free to comment and I ll amend the list. # Stacking Regressions: Latex Tables with R and stargazer In my paper on the impact of the shale oil and gas boom in the US, I run various instrumental variables specifications. For these, it is nice to stack the regression results one on the other – in particular, to have one row for the IV results, one row for the Reduced Form and maybe one row for plain OLS to see how the IV may turn coefficients around. I found that as of now – there is no way to do that directly; please correct me if I am wrong. The layout I have in mind is as in the screenshot of my table. In Stata, this can be accomplished through the use of esttab with the fragment, and append feature. This appends successive rows to an existing table and you can label the rows using the “refcat” option. However, in R this is not possible as of yet. I have mainly worked with stargazer, as Marek has added felm objects for high dimensional fixed effects to be handled by his stargazer package. The following functions are a “hack” that extracts the particular rows from the generated latex code by the stargazer package. You can then piece the table together by combining the individual elements. The idea is that you have a single stargazer command that is passed to the various functions that extract the different features. Obviuously, this can be done a lot more elegant as the code is extremely hacky, but it does work Marek has said that he is thinking of incorporating a stacking option into stargazer, but for now, my hack works reasonably well. The key thing to realise is that stargazer calls have the option to return The following character strings can be used in the table.layout and omit.table.layout arguments of the stargazer command.  “-” single horizontal line “=” double horizontal line “-!” mandatory single horizontal line “=!” mandatory double horizontal line “l” dependent variable caption “d” dependent variable labels “m” model label “c” column labels “#” model numbers “b” object names “t” coefficient table “o” omitted coefficient indicators “a” additional lines “n” notes “s” model statistics   The following functions will simply extract the rows that are being returned. stargazer.keepcolnums<-function(call="") { command<-gsub("\\)$",", table.layout='#'\\)",command)
call<-eval(parse(text=command))
bounds<-grep("tabular",call)
row1<- call[(bounds[1]+1):(bounds[2]-1)]
row1<-gsub("^\\\\\\\\\$-1\\.8ex\$\\\\hline $", "", row1) row1<-gsub("^\\\\hline \\\\\\\\\$-1\\.8ex\$$","",row1)
row1<-row1[row1!=""]
}
stargazer.keeprow<-function(call="") {
command<-gsub("\\)$",", table.layout='t'\\)",command) call<-eval(parse(text=command)) bounds<-grep("tabular",call) row1<- call[(bounds[1]+1):(bounds[2]-1)] row1<-gsub("^\\\\\\\\\$-1\\.8ex\$\\\\hline$", "", row1)
row1<-gsub("^\\\\hline \\\\\\\\\$-1\\.8ex\$ $","",row1) row1<-row1[row1!=""] } stargazer.keepstats<-function(call="") { command<-gsub("\\)$",", table.layout='s'\\)",command)
call<-eval(parse(text=command))
row1<-gsub("(.*)\\\\\\\\\$-1\\.8ex\$(.*)(\\\\end\\{tabular\\})(.*)","\\2",paste(call,collapse=" "))
row1
}
stargazer.begintable<-function(call="") {
command<-gsub("\\)$",", table.layout='m'\\)",command) call<-eval(parse(text=command)) row1<-paste("\\begin{tabular}",gsub("(.*)\\\\begin\\{tabular\\}(.*)(\\\\end\\{tabular\\})(.*)","\\2",paste(call,collapse="\n")),sep="") row1 } stargazer.varlabels<-function(call="") { command<-gsub("\\)$",", table.layout='d'\\)",command)
call<-eval(parse(text=command))
row1<-paste("\\begin{tabular}",gsub("(.*)\\\\begin\\{tabular\\}(.*)(\\\\end\\{tabular\\})(.*)","\\2",paste(call,collapse="\n")),sep="")
row1
}

stargazer.keepcollabels<-function(call="") {
command<-gsub("\\)$",", table.layout='c'\\)",command) call<-eval(parse(text=command)) row1<-gsub("(.*)\\\\\\\\\$-1\\.8ex\$(.*)(\\\\end\\{tabular\\})(.*)","\\2",paste(call,collapse=" ")) row1 } stargazer.keepomit<-function(call="") { command<-gsub("\\)$",", table.layout='o'\\)",command)
call<-eval(parse(text=command))
row1<-gsub("(.*)\\\\\\\\\$-1\\.8ex\$(.*)(\\\\end\\{tabular\\})(.*)","\\2",paste(call,collapse="\n"))
row1
}

It easiest to see how you can use these functions to construct stacked regression output by giving a simple example.

###the global command to be passed to the hacky ###functions that extract the individual bits
## OLS is a list of OLS results from running the
## lfe command

begintable<-stargazer.begintable(command)
###some multicolumn to combine
collabel<-stargazer.keepcollabels(command)
colnums<-stargazer.keepcolnums(command)
##plain OLS row
row1<-stargazer.keeprow(command)
##the stats part for the OLS (number of Obs, R2)
stats<-stargazer.keepstats(command)
##the rows for the Fixed effect indicators
omitted<-stargazer.keepomit(command)

##the IV command passing a list of IV results in ## IV object

##IV row
row2<-stargazer.keeprow(command)
footer<-c("\\end{tabular}\n" )

###now combine all the items
cat(begintable,collcombine,collabel,colnums,"\\hline \\\\\\emph{Reduced Form} \\\\",row1,""\\hline \\\\\\emph{Reduced Form} \\\\",row2,stats,"\\hline\\hline",footer, file="Draft/tables/energyintensity.tex", sep="\n")

I know the solution is a bit hacky, but it works and does the trick.

# Fracking and House Prices on the Marcellus Shale

Starting last summer I worked on a short project that set out to estimate the potential costs of externalities due to unconventional shale gas production in the Marcellus shale on local house prices using a dataset of roughly 150,000 recently sold houses in Ohio, West Virginia and Pennsylvania.

The data suggests that proximity to a natural gas well is correlated with lower housing prices, which confirms previous studies.

I stopped working on a project that looks at the impact of nearby shale gas extraction on property prices for the Marcellus shale. Instead, I focused on my paper “Fracking Growth” that evaluates the employment consequences of the shale oil and gas boom in the US more generally.

Everybody can have a look at the data and the document as it stands on sharelatex, where I also tried Sharelatex’s Knitr capacities, which are still somewhat limited as a lot of R-packages I usually work with are not yet installed.

The public sharelatex file, the data and the R script can be accessed here:

https://www.sharelatex.com/project/534d232b32ed2b25466b2541?r=6f73efc4&rs=ps&rm=d

Here are some preliminary snippets. The data used in this study comes from Zillow.com In Fall 2013 I downloaded data for recently sold houses. I focused the download to cover all or most of the counties that are somewhat near the Marcellus shale in West Virginia, Ohio and Pennsylvania. This list goes back to 2011 and provides data for 151,156 sold properties.

load(file = "HOUSES.rdata")
library(xtable)
table(HOUSES$year) #### 2011 2012 2013 #### 40087 63248 47821 A simple tabulation suggests that most data is for 2012. Some characteristics that are included in the data are the sale price in USD, the number of bedrooms, number of bathrooms, the built up land size in square feet, the year the property was built and for some properties also the lot size. The properties have geo-coordinates, which are used to intersect the location of the property with census-tract shapefiles. This will allow the adding of further characteristics at the census-tract level to control for general local characteristics. The geo-coordinates are further used to compute the exact distance of a property to the nearest actual or permitted well in the year that the property was sold. Distances are computed in meters by computing the Haversine distance on a globe with radius r = 6378137 meters. The following graph plots the average price per square foot as a function of distance to the nearest well in the year in which the property was sold. I group distances into 500 meter bins. plot(HOUSES[, list(soldprice =sum(soldprice)/sum(sqft)), by=distancecat ], xlab="Distance to Well", ylab="Price per sqft") A first inspection suggests a positive gradient in distance, that is – however, quite non-monotone. Non-monotonic relationship between distance to the nearest oil or gas well and price per sqft. Does this relationship hold up when running a hedonic pricing regression? $log(y_{ict}) = \gamma \times welldistance_{i} + \beta \times X_i + a_c + \eta_t + e_{ict}$ These are estimated using the lfe package, as I introduce quite demanding fixed effects (census-tract and county by year). The lfe package takes these fixed effects out iteratively before running the actual regression on the demeaned data. The results for two chosen bandwidths are presented in the table below. There appears to be a positive gradient – being further away from a well correlates with higher prices per square foot. Regression results comparing sold houses nearby unconventional gas wells on the Marcellus shale Clearly, the important question is whether one can separate out the property price appreciation that is likely to happen due to the local economic boom from the price differentials that may arise due to the presence of the local externalities and whether, one can separate out externalities due to environmental degradation as distinct from price differentials arising due to factors discussed in the beginning: no access to mortgage lending or insurances. Unfortunately, I do not have the time to spend more time on this now, but I think a short paper is still feasible… # Mobility from Mobile Phones I have worked on big data in my work with QuBit in London. In my research I increasingly find the tools I learnt there to be extremely useful. The keywords are smart data management for big data, such as hadoop and hive for querying just the right set of data to work with. I am currently working on Mobile Phone usage data provided by Orange Senegal as part of their data for development challenge. The scope for using big data for development is immense. Many African countries have seen extremely rapid urbanization, essentially moving from an agrarian economy straight into a service sector economy. Mobile technology is being developed in tech hubs across Africa, be it in Dar Es Salaam, Nairobi or in Dakar. Mobile money is a revolution in itself. Development planning in Africa can not rely on tools used in developed countries: there are for example, no sophisticated traffic monitoring systems in place to help planners keep track of traffic flows, dispatch police units to congested spots. Mobile phone usage data may turn out to be an incredibly useful tool for development planners as it theoretically allows an analysis of human mobility along the mobile phone mast network. This allows planners to get a sense of population density and how population moves within an urban setting. This may help in development planning, but can turn out to be useful on many other dimensions. For example, epidemiological models of disease may make find in mobile phone data a useful input to model spread of diseases. In any case, there is little economics literature that has worked with mobile phone use data so far. The two constraints are, first, getting the data in the first place and second, how to work with true “big data”. I have submitted a proposal for the Orange Data for Development Challenge Senegal and obtained the mobile phone Call Detail Records data. This is data collected by mobile phone companies for billing purposes and basically records, for every user, when and where (at which mobile phone tower) they have used their phone, either by receiving or making a call or by receiving or sending a text message. The result is three datasets made available to researchers. The datasets are based on Call Detail Records (CDR) of phone calls and text exchanges between more than 9 million of Orange’s customers in Senegal between January 1, 2013 to December 31, 2013. The datasets are: (1) antenna-to-antenna traffic for 1666 antennas on an hourly basis, (2) fine-grained mobility data on a rolling 2-week basis for a year with bandicoot behavioral indicators at individual level for about 300,000 randomly sampled users, (3) one year of coarse-grained mobility data at arrondissement level with bandicoot behavioral indicators at individual level for about 150,000 randomly sampled users. I want to fine tune into the Dakar region. The capital is capturing most of mobile phone users and its most interesting to study mobility there. This constraints the focus on the fine-grained mobility data for rolling 2 week time windows. Just to get a sense of the type of data we are talking about. Suppose that every of the 300,000 users received was active in every hour of the time window, so that it is a truely balanced hourly panel, the resulting dataset would have 100.8 million rows. It is not possible to hold such amounts of data in the working memory on a conventional computer running Stata. R is a lot more capable to work with big data. First, R provides the necessary architecture to run certain routines for efficient data processing directly using C code. This ensures that big R data objects are not moved around in the working memory, but that operations are done using the “physical storage” address of the data using C pointers. Most prominently the R package “dplyr” developed by Hadley Wickham is extremely useful as it allows running operations on these very huge data frames in just a few seconds. I just wanted highlight here the approach I have taken so far in converting the unwieldy data objects into something useable. The idea is first, to zoom in to the Dakar region. This part is only a small share when it comes to comparing with the size of the country, but it absorbs most of the mobile phone users that are sampled. Out of 300,000 sampled users, on average 120,000 or 40% are located in the Dakar region. Dakar Region only small in size relative to rest of the country The accumulation of users in the Dakar region implies that there the mobile phone network is quite dense. In total there are 1666 antennas reported, but 489, or 29.3% of these antennas are clustered in the Dakar region. Dakar Region and Mobile Phone Network Masts with overlayed .01 degree grid The idea will be to construct measures of mobility at a spatial resolution. A user will only be registered in the antenna that he or she is locked in. Some antennas are extremely close to one another. In order to get a continous measure across space I compute a grid and associate grid cells with mobile phone masts. The above picture is of a 0.01 degree grid (roughly 1 km since we are not far from the equator). Leaving out the Bambilor arrondisement in Rufisque department (the big and sparse chunk on the right), almost every grid cell contains a mobile phone mast. The first approach I have taken is to get a sense of population density: where are people usually located in the morning, evening and afternoon hours. The idea is that, especially, the location in the evening hours is the place in which individuals live (spend the night so to say). The location during the day may reflect the location where individuals are working. To highlight how this looks, consider the picture below. This plots out the average evening location of individuals in the central capital region. This information is displayed in two ways. First, the grid cell coloring is related to the number of people inside that grid cell, on average, for the evening hours in a two week time window. Second, the red dots indicate individual locations. If people were only logged in to one mobile phone mast every evening, then this would show up as a cluster of points around the mobile phone mast. In this case, the information at the grid level is useful. The fact that we see many individual red dots suggests that also in the evening, there is some mobility even if it is only within a grid cell. Average Location of Mobile Phone Users in Evening Hours We can construct crude average morning, afternoon and evening locations. This then allows a construction of a measure of average mobility during the day time in a two week time window. The distances travelled can be visually plotted. The figure below is a first attempt. This is obviously ongoing research… Mobility of Individuals in Dakar: Lines indicate straight line travel distances from the average morning, afternoon and evening location. # Regressions with Multiple Fixed Effects – Comparing Stata and R In my paper on the impact of the recent fracking boom on local economic outcomes, I am estimating models with multiple fixed effects. These fixed effects are useful, because they take out, e.g. industry specific heterogeneity at the county level – or state specific time shocks. The models can take the form: $y_{cist} = \alpha_{ci} + b_{st} + \gamma_{it}+ X_{cist}'\beta + \epsilon_{cist}$ where $\alpha_{ci}$ is a set of county-industry, $b_{ci}$ a set of state-time and $\gamma_{it}$ is a set of industry-time fixed effects. Such a specification takes out arbitrary state-specific time shocks and industry specific time shocks, which are particularly important in my research context as the recession hit tradable industries more than non-tradable sectors, as is suggested in Mian, A., & Sufi, A. (2011). What Explains High Unemployment ? The Aggregate Demand Channel. How can we estimate such a specification? Running such a regression in R with the lm or reg in stata will not make you happy, as you will need to invert a huge matrix. An alternative in Stata is to absorb one of the fixed-effects by using xtreg or areg. However, this still leaves you with a huge matrix to invert, as the time-fixed effects are huge; inverting this matrix will still take ages. However, there is a way around this by applying the Frisch-Waugh Lovell theorem iteratively (remember your Econometrics course?); this basically means you iteratively take out each of the fixed effects in turn by demeaning the data by that fixed effect. The iterative procedure is described in detail in Gaure (2013), but also appears in Guimaraes and Portugal(2010). Simen Gaure has developed an R-package called lfe, which performs the demeaning for you and also provides the possibility to run instrumental variables regressions; it theoretically supports any dimensionality of fixed effects. The key benefit of Simen Gaure’s implementation is the flexibility, the use of C in the background for some of the computing and its support for multicore processing, which speeds up the demeaning process dramatically, especially the larger your samples get.. In Stata there is a package called reg2hdfe and reg3hdfe which has been developed by Guimaraes and Portugal (2010). As the name indicates, these support only fixed effects up to two or three dimensions. Lets see how – on the same dataset – the runtimes of reg2hdfe and lfe compare. Comparing Performance of Stata and R I am estimating the following specification $y_{cist} = \alpha_{ci} + b_{sit} + \gamma_{it}+ X_{cist}'\beta + \epsilon_{cist}$ where $\alpha_{ci}$ is a set of county-industry, $b_{ci}$ a set of state-time fixed effects. There are about 3000 counties in the dataset and 22 industries. Furthermore, there are 50 states and the time period is also about 50 quarters. This means – in total – there are 3000 x 22 = 66,000 county-industry fixed effects to be estimated and 22 x 50 x 50 = 55,000 time fixed effects to be estimated. The sample I work with has sufficient degrees of freedom to allow the estimation of such a specification – I work with roughly 3.7 million observations. I have about 10 covariates that are in $X_{cist}$, i.e. these are control variables that vary within county x industry over state x industry x time. Performance in Stata In order to time the length of a stata run, you need to run  set rmsg on, which turns on a timer for each command that is run. The command I run in stata is Code:  1  reg2hdfe logy x1-x10, id1(sitq ) id2(id) cluster(STATE_FIPS ) You should go get a coffee, because this run is going to take quite a bit of time. In my case, it took t=1575.31, or just about 26 minutes. Performance in R In order to make the runs of reg2hdfe and lfe, we need to set the tolerance level of the convergence criterion to be the same in both. The standard tolerance in Stata is set at $1e^{-6}$, while for lfe package it is set at $1e^{-8}$. In order to make the runs comparable you can set the options in the R package lfe options explicitly: options(lfe.eps=1e-6) The second change we need to make is to disallow lfe to use multiple cores, since reg2hdfe uses only a single thread. We can do this by setting:  options(lfe.threads=1) Now lets run this in R using:   system.time(summary(felm(log(y) ~ x1 + x2 +x3 +x4 + x5 + x6 + x7 +x8 + x9 + x10 + G(id)+G(sitq), data=EMP, cluster=c("STATE_FIPS"))))   The procedure converges in a lot quicker than Stata…    user system elapsed 208.450 23.817 236.831    It took a mere 4 minutes. Now suppose I run this in four separate threads…   user system elapsed 380.964 23.540 177.520      Running this on four threads saves about one minute in processing time; not bad, but not too much gained; the gains from multi-threading increase, the more fixed-effects are added and the larger the samples are. # Classi-Compare of Raster Satellite Images – Before and After For my research on the effect of power outages on fertility , we study a period of extensive power rationing that lasted for almost a whole year and affected most of Latin America, but in particular, it affected Colombia. The key difficult was to determine which areas were exposed to the power-outage and the extent to which this was the case. This is not straightforward, since there does not exist household- or even municipality level consumption data. But here is how R and Satellite Data can help. In particular, we study the night light series obtained from the Defense Meterological Sattelite Program, which has been discussed by Jeffrey before. We simply look for abnormal variation in municipality level light-emitting intensity from 1992 to 1993. Here is some code that generates some Raster-Maps using the package rasterVis , and uses jQuery to generate a fancy before and after comparison to highlight the year-on-year changes in light intensity of 1992 compared to 1993. ###load the raster images tif<-"F101992.v4b_web.stable_lights.avg_vis.tif" f151 = raster(tif) tif<-"F101993.v4b_web.stable_lights.avg_vis.tif" f152 = raster(tif) ##crop a smaller window to plot e = extent(-78,-72,2,8) #e = extent(-80,-78,-4.6,-2) rn= crop(f151, e) rn2= crop(f152, e) ### do a logarithmic transformation to highlight places that receive not much, but some light. rn<-log(rn+1) png("1992.png") p <- levelplot(rn, layers=1, margin=FALSE,col.regions = gray(0:100/100)) p + layer(sp.polygons(COLPOB, lwd=.25, linetype=2, col='darkgray')) dev.off() rn2<-log(rn2+1) png("1993.png") p <- levelplot(rn2, layers=1, margin=FALSE,col.regions = gray(0:100/100)) p + layer(sp.polygons(COLPOB, lwd=.25, linetype=2, col='darkgray')) dev.off() Now with this together, you can create a fancy slider as I have seen on KFOR — comparing satellite pictures of towns before and after a tornado went through them. The code is essentially just borrowed from that TV station and it loads the javascript from their server; it is essentially just a clever use of jQuery and is maybe something that could or is already implemented in an R reporting package? Do you know of such a function? Anyways, all you need is a slider.html page that contains the code referring to the two picture sources; the code is simple: <!DOCTYPE html> <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> <script src="https://ajax.googleapis.com/ajax/libs/jquery/1.8.3/jquery.min.js"></script> <script src="http://cache.ltvcms.com/localtv/tornado2/js/jquery.classycompare.js"></script> <link rel="stylesheet" type="text/css" href="http://cache.ltvcms.com/localtv/tornado2/css/jquery.classycompare.css"> <style type="text/css">.sample1 {width:725px; height:725px;}.sample2 {width:725px; height:725px;}.sample3 {width:725px; height:725px;}</style> </head> <body> <div id="wrapper"> <div class="container_6 clearfix"> <section class="main-section grid_6"> <div class="main-content"> <section class="clearfix"> <div> <div class="container" style="position:relative"> <div class="sample1"> <img src="1992municio.png" alt="Before" height="725px" width="725px"> <img src="1993municio.png" alt="After" height="725px" width="725px"> </div> </div> <script>$(window).load(function() {
\$('.sample1').ClassyCompare({
defaultgap:50,
leftgap:0,
rightgap:10,
caption: true,
reveal: 0.5
});
});
</script> </div>
</section>
</div>
</section>
</div>
</div>

</body>
</html>

This is how it looks — I know the stuff is not perfectly aligned, partly because when cropping the picture I made a mistake and could not be bothered with fixing it.

Have fun!