It's all about the data: 2011

Thursday, October 6, 2011

Setting up FastRWeb/Rserve on Ubuntu

This blog entry documents my recent (successful) attempt to use Simon Urbanek's Rserve and FastRWeb for CGI scripting with R. This is a working blog entry and will be updated or replaced as needed (last updated 4:15 PM 10/6/2011).

#### Helpful documentation:

    http://rforge.net/FastRWeb/
    http://urbanek.info/research/pub/urbanek-iasc08.pdf
    http://www.rforge.net/Rserve/
    http://cran.r-project.org/web/packages/Rserve/
    (Plus personal communications with Simon, the results
    of which are included in the summary below)

#### The steps used (your configuration probably varies):

0. Ubuntu Linux, 64-bit, Version 10.04 LTS (plus updates). I did the following steps as root, but will return to security issues below.

1. I did a fresh installation of the apache2 web server. I noted that the default location of the cgi-bin (used later) is /usr/lib/cgi-bin; yours may vary. I confirmed that this was up and running and that I could use the toy CGI script foo.cgi placed in the cgi-bin:

    #!/usr/bin/perl
    print "Content-type: text/html\n\n";
    print "<html>Hello World</html>";

To test this I pointed my browser to http://localhost/cgi-bin/foo.cgi; if there are problems, consult your system administrator or do detective work (probably in the log files, /var/log/apache2 on my system). Do not continue until you have Hello World working!

2. I did a fresh installation of R, version 2.13.2, using the required --enable-R-shlib option to configure.

3. I installed R packages Rserve, Cairo, FastRWeb, and (though not required) XML (this required installing some libxml2... package in Ubuntu, first, but again is NOT required for Rserve/FastRWeb).

4. After installing FastRWeb, I went into the inst directory of the package and ran the install.sh script; this created /var/FastRWeb, used extensively below.

5. I went into /var/FastRWeb/code and examined the files; in a slightly older version of FastRWeb I commented out a few lines, but the current (10/6/2011) version removed that need for me.

6. I fired up R, and per Simon's instructions did the following:

    system.file("cgi-bin", package="FastRWeb")

This revealed the location of a binary called Rcgi. I copied this into /usr/lib/cgi-bin, and renamed it R (instead of Rcgi).

7. Finally, I created a file /var/FastRWeb/web.R/foo.png.R:

    # foo.png.R:
    run <- function(n=100, ...) {
    n <- as.integer(n)
    p <- WebPlot(800, 600)
    plot(rnorm(n), rnorm(n), pch=19, col=2)
    p
    }

8. I tested it with the URL: http://localhost/cgi-bin/R/foo.png?n=500

#### Security Issues

I have a feeling that if you have a "trusted machine" without user access, the steps above may not technically pose security risks (even as root); but they do not represent good security practices and *would* introduce security risks on shared servers. For my purposes, I added to the beginning of /var/FastRWeb/code/rserve.conf:

    gid 33
    uid 33

because www-data (uid and gid 33) is the username for my apache2 instances and it seemed like a reasonable choice. For good measure, I also changed permissions in /var/FastRWeb:

    chown www-data:www-data .
    chown -R www-data:www-data ./*

Finally, I set

    sockmod 0660
    umask 0007

based on Simon's recommendation for further security. To stop Rserve and FastRWeb:

    killall -INT Rserve

Monday, September 26, 2011

The Inaugural "Least Interesting Stat" Award

I hereby give the first award to the Yale Daily News for its sports page caption, Monday, September 26, 2011:

"STAT OF THE DAY 4: THE NUMBER OF YEAR SINCE THE FOOTBALL TEAM HAS SCORED 70 POINTS AFTER THE FIRST TWO GAMES OF THE SEASON. The Bulldogs have scored 74 points after two weeks, a total that was last matched in 2007, when Yale put up 79 in what would become a 9-1 season."

For a slightly more invigorating use of statistics and Yale football, see my Yale-Harvard graphical exploration. I need to update it with the last few years of results.

Sunday, September 4, 2011

New York Predictive Analytics Talk

I'll be giving an evening talk at the New York Predictive Analytics World, http://www.predictiveanalyticsworld.com/newyork/2011/. The rough plan:

This talk will touch upon topics in data analysis, statistics, and computing relating to modern massive data challenges. How do classical theories in statistical inference and asymptotics translate into statistical practice in the modern world? What role should complex Bayesian procedures and other cutting-edge methodologies have in the data analyst toolkit? Computationally, how can we manage the data deluge and how is statistical software evolving? What are the implications for the data analyst? What are the dangers posed by
addressing these very questions? I'll suggest possible answers to some of these questions, and hope to spur further debate by posing others.

Wednesday, August 17, 2011

Blogs on Trade and the Environment

http://environment.yale.edu/envirocenter/

This blogging on the Yale Center for Environmental Law & Policy site discusses issues arising from our recent study of linkages between trade and the environment.

Tuesday, August 16, 2011

Fantasy Football 2011

It's that time of year again! Yesterday I scraped some ranking and points projection data from http://fftoolbox.com.

I was interested in how the projected points declined with rank, across the player positions. The plot, below, helps explain why running backs are selected ahead of wide receivers, for example: the decline in production of wide receivers is much more shallow than for running backs. You get hurt less (in expectation) by taking lower-ranked wide receivers than you do by taking lower-ranked running backs. What I'd really like to do is integrate weekly variation into the analysis... but this requires a more substantial data scrape than I had time for.

Monday, August 15, 2011

Using "Google Docs" to scrape HTML tables from web pages

One of my students suggested I try this... so I did. In Google Docs, create a new spreadsheet. In the first cell, type something of the form:

=ImportHtml("http://the-url-goes-here", "table", 0)

My first attempt was scraping some fantasy football points projections:

=ImportHtml("http://www.fftoolbox.com/football/2011/cheatsheets.cfm?player_pos=QB", "table", 0)

Bingo. At least, it worked for me on the 8 pages I tried. I used 0 as the third argument because some web page recommended it.

I could see using this for data scrapes when a small number of pages are involved, but for more advanced scrapes that require automation I'll continue to use R.