Quickly read Excel worksheets into R (Windows only…sorry)
June 18, 2013 — 10:08

Author:   Category: Data Collection Uncategorized  Tags:   Comments: 17

I suppose most companies use the Microsoft Office suite of programs, and my office is no exception. It easy to import data from an API or a database into R, but importing data from an Excel workbook is a different story. There are a few R packages for reading Excel files, but I’ve had problems with all of them:

  • `read.xlsx` (`gdata` package): pretty convenient to run in R, but requires Perl which for some reason I have a hard time installing on my Windows machine…that might just be an issue with me, not the machine.
  • `odbcConnectExcel2007` (`RODBC` package): from what I’ve seen on the listservs, this one has a hard time reading xlsx files because of a driver mismatch – you have to access the files through 32-bit R, which is annoying.
  • `readWorksheetFromFile` (`XLConnect` package): uses Java, easy to install, and has tons of functionality to write in addition to read, but I don’t really need the write functionality and for large files especially XLConnect is very slow.

So I set off in search of a faster way to pull information out of an Excel file. The gist below shows what I came up with. Excel already has Visual Basic capabilities built in. So I stole a little VB script from here and stuck it in a function that writes the script to a temporary file, calls the script from the command line, and then outputs the contents of the formerly-Excel file.

The function takes the following inputs:

  • file_path: the full path of the Excel file
  • keep_sheets: the names of the sheets you want to keep; extracts all sheets if NULL
  • target_dir: the directory to which you want to output the excel sheets; defaults to the directory where the Excel sheet is kept. If set to FALSE, it doesn’t output any files – rather, it reads them all into R as a list of data frames.

Unfortunately, this only works on Windows. I had hoped it would work on a Mac where Excel was installed, but I haven’t had any luck getting it to work on my home computer.



Comments:
  • schima

    Is that possible to choose which columns to be read into R? or is that possible to read some extra blank columns into R?

    June 19, 2013 — 1:30
    • Currently, no, but it would be easy enough to modify the function to include that possibility.

      June 19, 2013 — 7:59
  • Piet

    You seems not aware of the XLConnect package on CRAN (http://cran.r-project.org/web/packages/XLConnect/index.html), that reads, select information, and write so easily xls and xlsx file for windows, mac, linux, etc., and even does not require Excel to be installed on any platform….

    give it a try and forget VB ;-)

    June 19, 2013 — 7:50
    • Piet, you seem to have not read the original post :-) I mention XLConnect right at the top of the post, in the bullet points. I like the package, but I don’t often need to write to Excel, and the read functions are incredibly slow on large data sets. In general, though, I do think XLConnect is the best package of the three I mentioned.

      June 19, 2013 — 7:58
  • Piet

    Ha yes you’re right… I am not familiar with the function ‘readWorksheetFromFile’, which is why I skipped from reading it completely probably… In general, for big files, what helps is to indicate the range (or region) to use, like region=’A1:VY3000′. But then I guess, when you speak about big files, you mean ‘really really very big files’, and Java is not very good in this.

    June 19, 2013 — 10:30
  • Mark

    The easiest, and most reliable way is to export data from excel to an open format, like csv.

    June 19, 2013 — 11:11
    • I agree – that’s exactly what this function does. If you mean the best way to deal with Excel is to not use Excel, I actually agree with that too. But none of my co-workers do :-)

      June 19, 2013 — 11:49
  • Hi Schaun,

    Have you tried “xlsx” R Packages? If, what are the problems you faced?

    Best Regards,
    Dr. Pradeep Mavuluri,
    http://www.costaleconomist.blogspot.com.

    June 19, 2013 — 12:00
    • Ah, I forgot about xlsx. It’s been a while since I used it. It uses Java, just like XLConnect, so I would expect it to also be slow on large data sets. I’ll have to look into it.

      June 19, 2013 — 12:21
      • Pradeep,

        Digging a little deeper, it seems the xlsx does, in fact, get quite slow when dealing with large data sets: http://stackoverflow.com/a/14456281/1829950

        June 19, 2013 — 12:40
        • Uwe Block

          Hi Schaun,

          I use read.xlsx2() from the xlsx package to read 10+ MB xls (not xlsx!) files regularly. It works very well and sufficiently fast for me.
          For the computing environment available to me, it was a benefit that xlsx depends only on JAVA which was already available on the server. I tried also XLConnect but couldn’t get it to work.
          Speed was not my main concern, so I never did any benchmarks.
          Would be interesting to know if there any benchmarks of the different packages around.

          Brgds,
          Uwe

          June 23, 2013 — 6:09
  • [...] wrote a couple days about about importing Excel files into R. There are lots of ways to do this, but all [...]

    June 20, 2013 — 14:33
  • John Thompson

    This is a great tool, thanks!!

    June 22, 2013 — 22:26
  • […] Schaun Wheeler I suppose most companies use the Microsoft Office suite of programs, and my office is no exception. […]

    January 1, 2014 — 12:40
  • Nabila

    Dear Schaun,

    I am a beginner in using R and have almost no knowledge of Visual Basic. My question will sound stupid, but I am having trouble figuring how to get this function to work. I have a large xlsx file (approx. 50Mb) that needs to be analysed in R. Do I simply copy this function in my script, specify the inputs and run it?

    June 23, 2014 — 10:53
  • Leave a Reply

    Your email address will not be published. Required fields are marked *