Writing a book with a little help from Emacs and friends

Etiquetas

, , , , ,

This post provides technical details about the making of my book “Displaying time series, spatial, and space-time data with R”, a fully reproducible book including hundreds of code chunks and graphics. Hopefully this post will help others when writing a reproducible document.

Tools

Among the tools that can create reproducible documents with R, I decided to use these gems of open-source software:

Sigue leyendo

rasterVis and solaR moved to GitHub

Etiquetas

, ,

I have just published a new version of solaR and rasterVis on CRAN. From now on both packages will be developed in GitHub.

There are some changes in rasterVis to highlight:

  • vectorplot accepts isField='dXY' to display vector fields defined by vertical and horizontal components.
  • The backroground region of vectorplot can be defined with a different RasterLayer.
  • horizonplot has a new argument, stat, to define a function to be applied to each zone.

2nd Data Analysis Contest Using R

Etiquetas

, , , ,

The V Conference of R Users  (Zaragoza, 12-13 December 2013) will include the 2nd Data Analysis Contest Using R sponsored by Synergic Partners. The company proposes to implement the Support Vector Machines algorithm using MapReduce. The winner will receive a prize of 500 euros. More information from Santiago Basaldúa (sbasaldua@synergicpartners.com) and in the webpage of the Conference (in Spanish).

R, GeoJSON and GitHub

Etiquetas

, , , , , , , , , ,

GeoJSON is an open computer file format for encoding collections of simple geographical features along with their non-spatial attributes using JavaScript Object Notation. These files can be rendered easily within GitHub repositories. GitHub uses Leaflet.js to represent the data and OpenStreetMap for the underlying map data.

Spatial objects in R can be converted to GeoJSON with writeOGR from the rgdal package. For example, let’s display the meteorological stations of the SIAR network. (By the way, GitHub also supports rendering csv files)


library(sp)
library(rgdal)
setwd(tempdir())
download.file('https://raw.github.com/oscarperpinan/solar/gh-pages/data/SIAR.csv', 'siar.csv', method='wget')
siar <- read.csv('siar.csv')
summary(siar)
siarSP <- SpatialPointsDataFrame(siar[,c(6, 7)], siar[,-c(6,7)])
writeOGR(siarSP, 'siar.geojson', 'siarSP', driver='GeoJSON')

view raw

siar.R

hosted with ❤ by GitHub

Now upload this file to a GitHub repository or create a Gist and voilá!


Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

view raw

siar.geojson

hosted with ❤ by GitHub

Another example, now using a larger dataset from the OpenPV project. From the GitHub help page:

If your map contains a large number of markers (roughly over 750), GitHub will automatically cluster nearby markers at higher zoom levels. Simply click the cluster or zoom in to see individual markers.


Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

On the RStudio download logs

Etiquetas

, , ,

Some days ago H. Wickham (Chief Scientist of the RStudio company) posted an article about the RStudio CRAN mirror with this information:

Finally, because every download from a CRAN mirror is logged, CRAN mirrors provide a rich source of data about R and package usage. To date, it’s been hard to get access to this data. We wanted to change that, so you can now download our anonymised log data from cran-logs.rstudio.com.

There has been a number of posts about these download logs (most of them included in the Top 7 articles of the week of R-Bloggers):

More or less explicitely they use the RStudio logs to give an answer to the question «How many people use this R package?». In my opinion, such question cannot be answered safely with the RStudio download logs.

The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a
given body of data. (John Tukey)

Here are my objections, partly inspired by this interesting thread on R-devel.

RStudio mirror logs register download statistics, which is different from usage.

For example, I have downloaded (and installed) most of the packages included in the Top 10 list. Does that mean that I am a user of these packages? No. I only use three of them on a regular basis. The rest of packages were once downloaded to give them a try, or because they were required by another package, etc. On the other hand, I am a regular user of packages that I don’t download from a CRAN mirror because they are already installed with my R (for example, Recommended packages) or because I use Debian packages to install them.

You may be interested in this question in stackoverflow, these posts by D. Eddelbuettel, and this thread about popular R packages.

The RStudio mirror is only one of the network of the Comprehensive R Archive Network of servers.

CRAN includes today 90 mirrors worldwide. Thus, the RStudio logs represent only a fraction of the total download values.

Reading this message in the thread on R-devel I discovered that log files have also been readily available from the main CRAN mirror for years: http://www.r-project.org/awstats/awstats.cran.r-project.org.html http://cran.r-project.org/report_cran.html

The RStudio mirror is a sample that cannot be safely regarded as representative of the mirrors network.

The mirrors of the Comprehensive R Archive Network, mostly operated by public or nonprofit institutions, provide faster package downloads for users at their geographical location. The RStudio mirror is different:

  • It is a virtual machine run by Amazon’s EC2 service using Amazon CloudFront, a web service for content delivery automatically routed to the nearest edge location.
  • It is operated by a private company.

RStudio is a company dedicated to providing software, education, and services for the R statistical computing environment. (From http://www.rstudio.com/about/)

The RStudio company promotes a list of courses focused on the packages developed and maintained by their team. Besides, one of the main products of this company, the RStudio IDE, uses the RStudio mirror as the default choice for RStudio. Let’s assume that most RStudio users do not deliberately choose a different mirror. Then, a significant percentage of the RStudio mirror logs are affected by the choices of the RStudio IDE users community, which in turn could be partly influenced by the RStudio training philosophy.

I would like to close this post with a warning raised by J. Ryan at the previously mentioned R-devel thread:

While I think download statistics are potentially interesting for developers, done incorrectly it can very likely damage the community.