Some days ago H. Wickham (Chief Scientist of the RStudio company) posted an article about the RStudio CRAN mirror with this information:
Finally, because every download from a CRAN mirror is logged, CRAN mirrors provide a rich source of data about R and package usage. To date, it’s been hard to get access to this data. We wanted to change that, so you can now download our anonymised log data from cran-logs.rstudio.com.
There has been a number of posts about these download logs (most of them included in the Top 7 articles of the week of R-Bloggers):
- Tracking CRAN packages downloads
- Where is the R activity?
- Top 100 R packages for 2013 (Jan-May)
- A list of R packages, by popularity
More or less explicitely they use the RStudio logs to give an answer to the question «How many people use this R package?». In my opinion, such question cannot be answered safely with the RStudio download logs.
The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a
given body of data. (John Tukey)
Here are my objections, partly inspired by this interesting thread on R-devel.
RStudio mirror logs register download statistics, which is different from usage.
For example, I have downloaded (and installed) most of the packages included in the Top 10 list. Does that mean that I am a user of these packages? No. I only use three of them on a regular basis. The rest of packages were once downloaded to give them a try, or because they were required by another package, etc. On the other hand, I am a regular user of packages that I don’t download from a CRAN mirror because they are already installed with my R (for example, Recommended packages) or because I use Debian packages to install them.
You may be interested in this question in stackoverflow, these posts by D. Eddelbuettel, and this thread about popular R packages.
The RStudio mirror is only one of the network of the Comprehensive R Archive Network of servers.
CRAN includes today 90 mirrors worldwide. Thus, the RStudio logs represent only a fraction of the total download values.
Reading this message in the thread on R-devel I discovered that log files have also been readily available from the main CRAN mirror for years: http://www.r-project.org/awstats/awstats.cran.r-project.org.html http://cran.r-project.org/report_cran.html
The RStudio mirror is a sample that cannot be safely regarded as representative of the mirrors network.
The mirrors of the Comprehensive R Archive Network, mostly operated by public or nonprofit institutions, provide faster package downloads for users at their geographical location. The RStudio mirror is different:
- It is a virtual machine run by Amazon’s EC2 service using Amazon CloudFront, a web service for content delivery automatically routed to the nearest edge location.
- It is operated by a private company.
RStudio is a company dedicated to providing software, education, and services for the R statistical computing environment. (From http://www.rstudio.com/about/)
The RStudio company promotes a list of courses focused on the packages developed and maintained by their team. Besides, one of the main products of this company, the RStudio IDE, uses the RStudio mirror as the default choice for RStudio. Let’s assume that most RStudio users do not deliberately choose a different mirror. Then, a significant percentage of the RStudio mirror logs are affected by the choices of the RStudio IDE users community, which in turn could be partly influenced by the RStudio training philosophy.
I would like to close this post with a warning raised by J. Ryan at the previously mentioned R-devel thread:
While I think download statistics are potentially interesting for developers, done incorrectly it can very likely damage the community.