Quickly read Excel (xlsx) worksheets into R on any platform
June 20, 2013 — 14:32

Author:   Category: Data Collection Uncategorized  Tags: ,   Comments: 34

I wrote a couple days about about importing Excel files into R. There are lots of ways to do this, but all the ways that use only R have drawbacks (as I outlined in my last post), and all the other ways require installation of programs other than R. I’m not opposed to using programs other than R – it’s easy enough to weave, for example, Python and R code into each other. But I’d become curious about the possibility of solving this problem without the need for added programs, so I did some more searching. Turns out you can import an .xlsx document into pretty much anything that can parse XML, because that’s all an .xlsx document is.

The following gist uses R’s XML package, as well as a little help from the `plyr` package. It will import all sheets from an .xlsx document into a list of data frames and doesn’t require any additional installations of converters, and as far as I can tell, it should work on any platform. They key is to rename the document to have a .zip extension. Then you can unzip the document, pull out the XML files, and match them up to recreate a view similar to what you would see in Excel. The issue of how to deal with XML files in R is an issue that comes up regularly on Stack Overflow, so I thought this gist could also be a nice little example of how to handle that sort of problem (long story short: convert the XML to a list and manipulate the list – it’s generally much easier than trying to manipulate the XML directly).

EDIT 1: You may notice that this converts all times and date-times into dates. That’s because the particular data I was working with didn’t have times or date-times, and because I’m still trying to figure out exactly how Excel handles date/time origins. I’ll edit the function when I find out more.

EDIT 2: Ok, times and date-times are now appropriate, but only if you’re opening an xlsx document that was created on Windows. The Mac version of Excel apparently uses a different origin for dates, which is stupid. I’m trying to figure out how to tell from the XML which platform an xlsx document was created on, but so far I haven’t had any luck.

EDIT 3: I’ve now updated the function. Interestingly, it seems Microsoft might have standardized its handling of dates across operating systems. I created an .xlsx file on my Mac and only the Windows origin (“1899-12-30″) resulted in the correct dates. I’ve left a note in the code to indicate the origin that is supposed to be the right origin for Mac Excel (“1904-01-01″). I’ve also included some options in the function to select only specific sheets from a workbook, and to indicate whether the function should expect the sheets to have headers.

EDIT4: In response to James Dalrymple’s comments below, I’ve revised the function. For someone more comfortable with R’s list structure than with xpath expressions, the original method of converting the XML to a list was convenient. However, extracting node values and attributes through xpath expressions gives a very noticeable boost in speed. I ran the following, revised function on an 18.3 MB file with 18 sheets in it. It took 14 minutes to run. And I’m now much more familiar with xpath than I was before, so I see this as an all-around win.

Comments:
  • Interesting. My favourite remains this, which I got from R Bloggers:

    #Clipboard Reader from Excel
    read.excel <- function(header=TRUE,…) {
    read.table("clipboard",sep="\t",header=header,…)
    }

    It's so easy!

    June 21, 2013 — 5:56
    • I do use that one all the time, but it’s not automated – you have to actually open up Excel and copy the stuff you want.

      June 21, 2013 — 6:57
    • don calpe

      The clipboard solution is only available if you are using MS Windows.

      June 21, 2013 — 7:52
      • There is a Mac alternative to the clipboard: https://stat.ethz.ch/pipermail/r-help/2009-January/186555.html

        But it’s not as user-friendly. And I’m not sure if it works on Linux. Still, to me, the important feature of reading from Excel is automatability. I want a function to be able to pull data from Excel sheets even if I’m not at the computer to physically open the file and copy the data.

        June 21, 2013 — 8:46
  • Robert

    This is awesome! Thanks for posting.

    June 21, 2013 — 9:30
  • David Winsemius

    Agree this is a great posting and I will try to track down the strategy to distinguish Win from Mac origin of Excel document. I’m pretty sure I have seen that documented. But really… The Mac Clipboard is not as “user-friendly”? Who do you think invented the Clipboard? Well it really was Xerox PARC but it was brought to us by Apple. Also remember that Excel was invented on the Mac and only years later brought over to Windows.

    June 21, 2013 — 15:08
    • David, if you could track down that origin information, that would be great. As for user-friendliness, I was referring to the R interface to the Mac clipboard, not the Mac clipboard itself. It’s a lot easier to `write.table` to “clipboard” than it is to remember the right options to the `pipe` function – and even when (I think) I got it right, I still had a hard time pasting multiple columns to or from Excel. In general, I personally prefer to live as Microsoft-free an existence as possible, but in this particular case I do think the “clipboard” functionality in R does beat the comparable Mac functionality.

      June 21, 2013 — 15:15
  • David Winsemius

    After renaming to Workbook1.zip and using Unarchiver.app on the result, I get a folder named Workbook1 and then thie tests whether the word “Macintosh” is in the XML text.

    grep(“Macintosh”, readLines(“~/Workbook1/docProps/app.xml”) )
    # [1] 2

    June 21, 2013 — 15:31
    • Can you open up that XML and see what tag contains the word “Macintosh”? Below is what I get when I open up a file created on Windows. I’m guessing from your `grep` results that it’s appearing in the tag – probably saying something like “Microsoft Excel for Macintosh,” but I’d like to make sure before modifying the function.

      EDIT: Ugh. Note to self: pasting XML into a WordPress comment is a really bad idea. At any rate, the second line of my Windows Excel XML is an “Application” tag that contains the words “Microsoft Excel.” I’d like to know the full contents of that tag from a Mac Excel document so I can make sure I’m pulling the right information into the function.

      June 21, 2013 — 15:53
  • Gabor Grothendieck

    R News 4/1 has a discussion of Excel origins.

    Also, note that the XML methods apply only to xlsx files. xls files do not use XML.

    June 21, 2013 — 16:32
  • Ok, I just got home and created an Excel document on my laptop. The Application tag in app.xml reads “Microsoft Macintosh Excel”. I’ll modify the function.

    And yes, this only works on xlsx files.

    June 21, 2013 — 17:57
  • Lee

    Interesting method. Can the process be reversed to output an xlsx file?

    Regarding your date origin problem, I’ve dealt with this before importing Excel files using the `xlsx` R package. Here are the salient lines of comments and code in the scripts that I use:

    req.date = as.Date(req.date, origin=’1899-12-30′)
    # A couple specifics on this date conversion:
    # Microsoft Excel reports dates as serial numbers from 01-Jan-1900.
    # However, a known bug is that MS Excel assumes the year 1900 to be
    # a leap year (which it was not). Also there are some errors in
    # how leap seconds are aggregated.
    #
    # Together, the origin parameter must be specified to account for a
    # two day offset

    Essentially, there is a long standing bug in the way Excel interprets dates that will not be fixed in the interest of backwards compatibility (with, of all things, Lotus 123). Note, the source of xlsx files I use these scripts on are 100% from Windows based MS Excel, so not sure how this applies to Mac MS Office generated files.

    June 21, 2013 — 23:09
    • Yes, I suppose it could be reversed to output an xlsx file. I don’t have that need often enough to want to figure it out, though!

      I’ve updated the handling of dates. See Edit 3 in the original post.

      June 23, 2013 — 10:13
  • James Dalrymple

    Hi Schaun,

    I just tried running your function in Windows 7 in 64 bit RStudio with 3 Gb RAM. The function works perfect with smaller datasets, but when I tried to load a 10 Mb file this way, I couldnt get the dataset to load and had to stop/interrupt R. Would you recommend this for xlsx files of this size?

    July 1, 2013 — 13:27
  • James,

    Short answer: since you’re on Windows, use this instead: http://housesofstones.com/blog/2013/06/18/quickly-read-excel-worksheets-into-r-windows-only-sorry

    Slightly longer answer: I’m working to speed up the function from this post by turning the XML into an R list at a later stage in the process. I’ll post an update if I succeed.

    July 1, 2013 — 18:15
    • James Dalrymple

      Thanks Schaun! I like taking advantage of the better compression size of xlsx file format versus csv file format, so I find this very useful.

      July 2, 2013 — 0:53
      • James, see Edit 4, above. Let me know how this works out for you. You’ll probably want to keep your task manger open while you’re running this to see if you’re hitting your memory limits.

        July 3, 2013 — 12:56
  • Susan

    Thanks Schaun – this is very useful!

    October 3, 2013 — 11:18
  • DaNiu

    Just stumbled upon this function since I’m constantly going to/from R & Excel. Only question though…. is, what would be the best way to call the individual worksheets (now objects)? For instance, when I declare object x as xlsxToR function, I can get the dataframe structures, but I’m thinking that’s not how I’m supposed to do it…. sorry for this very naive question.

    October 6, 2013 — 23:37
    • Good question. You can call each sheet by number (if you assign the output of the function to the object `dfs`, you can call the first worksheet with `dfs[[1]]`. But it can be cumbersome to remember which list is which number, so I just added names to the gist. So if your first sheet is called “My Sheet”, you can call it with `dfs[['My Sheet']]`. If you only want one sheet, you can put that sheets name in the `keep_sheets` parameter in the function, and it will only return the one data frame corresponding to that sheet.

      October 7, 2013 — 9:12
      • DaNiu

        Thanks Schaun! Worked perfectly. Thanks for posting this very useful function.

        October 7, 2013 — 11:38
  • David Mora

    Hi, I was tryng your function to read .xlsx files in Ubuntu 10.04
    R version 3.0.2 (2013-09-25) — “Frisbee Sailing”
    Platform: x86_64-pc-linux-gnu (64-bit)
    using Rstudio 0.97.551 and got the following error:

    Error in x$.attrs : $ operator is invalid for atomic vectors
    In addition: Warning message:
    In rbind(sheet = c(“Open”, “1″, “rId1″), sheet = c(“Tablas”, “4″, :
    number of columns of result is not a multiple of vector length (arg 1)

    Any idea what is going on ?

    October 22, 2013 — 12:24
    • Thanks for letting me know about that. A week or so ago I realized the function was processing some number formats incorrectly, so I wrote what I thought was a quick fix, and like most quick fixes it ended up breaking everything. I just went back and converted the `styles` portion of the function to use XPath expression. I should have done that a long time ago. Try it out and see if it works for you now.

      October 22, 2013 — 14:50
  • David Mora

    Thanks for your quick response.
    Now I’m having this error:

    xlsxToR(“reporte.zip”)

    Error in `$<-.data.frame`(`*tmp*`, "sheet", value = character(0)) :
    replacement has 0 rows, data has 2089
    In addition: Warning message:
    In rbind(sheet = c("Open", "1", "rId1"), sheet = c("Tablas", "4", :
    number of columns of result is not a multiple of vector length (arg 1)

    October 22, 2013 — 15:58
    • Hm. The function worked on the xlsx file I used for testing. If your file is something you can share, or if you can create a sharable file that reproduces the error, feel free to send it to me at schaun.wheeler@gmail.com and I’ll take a look.

      October 22, 2013 — 16:01
      • I’m just including this comment for documentation purposes. David sent me his data set: the excel document had hidden sheets, which meant the meta data for those sheets had one more attribute than the one non-hidden sheet. I hadn’t built the function to expect that possibility. I changed the function to handle that kind of setup.

        October 23, 2013 — 9:36
  • Ralf

    Hi, I would like to use your function but I get an error (same as David Mora) when using a selection of worksheets over the keep_sheets parameter. With keep_sheets = Null it works.
    Therefore I replaced line 84 with

    #x$sheet <- sheet_names[sheet_names$id == i, "name"]
    x$sheet <- sheet_names[i, "name"]

    and it works. My second problem is that only the first 26 columns of each worksheet is loaded. But there are about 60 per sheet. Have you an idea why?

    January 6, 2014 — 17:06
  • Phil

    Dumb question, but why not just export your Excel worksheet as CSV or TSV and import that directly into R ?

    March 13, 2014 — 16:13
    • Because you can’t (easily) automate that. Someone gives you an Excel workbook with 12 sheets, 4 of which you need to use. That workbook gets updated daily (or weekly, etc.) Often the workbook is so big that it takes forever to open in Excel. So either you can take time out of your schedule every day (or week, etc.) to open the file, wait for it to load in the GUI, and manually ‘Save As’ several times, or you can write a script to do all of that automatically. For the business case that originally motivated me to write this script, it probably saved me about 1-2 hours a week.

      Now, you can automate the writing of the multiple worksheets to a CSV file and then read those files into R. I wrote about that here:

      http://housesofstones.com/blog/2013/06/18/quickly-read-excel-worksheets-into-r-windows-only-sorry/

      The issue with that approach is that it required a Windows installation of Excel on the computer on which I wanted to run the script. That worked fine for me at work but I have a Mac at home, so I went looking for a cross-platform solution, which led me the approach I wrote about in this post.

      March 13, 2014 — 16:33
  • Marine

    I was able to read an xlsm with your function , thank you. But I need to write into it as well. how do I do this?

    May 22, 2014 — 18:39
  • Nabila

    Hi Schaun, I tried running your code, and if I understood correctly, I should have a .zip extension file. I tend to run into this error after the first line:
    xlsxToR <- function(file, keep_sheets = NULL, header = FALSE) {

    xlsxToR <- function("C:\\Users\\Nabila\\Dropbox\\IsolutionsProject\\ServiceRequestTickets.zip", keep_sheets = "TicketDetails", header = FALSE) {

    Error: unexpected string constant in "xlsxToR <- function("C:\\Users\\Nabila\\Dropbox\\IsolutionsProject\\ServiceRequestTickets.zip""

    I am using a Windows pc. I don't know what I am doing wrong. Could you please help?

    June 24, 2014 — 6:37
    • Nabila

      Oh and by the way, I checked my file path several times, and it is correct. Also tried using forward slashes instead of the double backward slashes, but to no avail..

      Thanks

      June 24, 2014 — 6:39
  • function_instantiation <- function_name(parameters){
    [function definition here]
    }

    output <- function_instantiation(parameters=parameter_inputs)

    So after copying the entire function definition in the R interpreter, you need to call:

    output <- xlsxToR(“C:\\Users\\Nabila\\Dropbox\\IsolutionsProject\\ServiceRequestTickets.zip”, keep_sheets = “TicketDetails”, header = FALSE)

    June 24, 2014 — 6:48
    • Nabila

      Thanks!

      June 24, 2014 — 6:53
  • Leave a Reply

    Your email address will not be published. Required fields are marked *