in a vast echoing space

CSV Support in Excel for Mac 2016 is a Travesty

Excel 2016 includes the exciting new ability to save files as “CSV UTF-8 (Comma delimited)” in addition to the old “Comma Separated Values (.csv)” format. In magic fairy dreamland, this output format solves the age-old CSV character set problem: CSVs produced by Excel on Windows encode text using Windows-1252, and CSVs produced by Excel on Mac use MacRoman. Unfortunately, over here in the real world, this new export format creates more problems than it solves. Files saved in this format still suffer from two other problems: incorrect line ending, and a shocking corruption issue resulting in potentially irrecoverable data loss. And, on top of that, Excel still has issues even opening files containing special characters.

Opening

Let’s start with an example. Here’s the CSV:

Pretty standard stuff. If you inspect the raw file, you’ll find Unix-style line endings (that is, a single <LF> at the end of each line), quotes escaped with quotes, UTF-8 encoding (note the two-byte “é”), and a trailing newline.

Now, let’s bring this file into Excel, as one does. I’m using a fully-updated (as of this post) copy of Excel for Mac 2016. Here’s what happens if I just open the file:

Well, that’s a great start. To get special characters to decode correctly, I need to create a new workbook, then from the “Data” ribbon, select “From Text” – then, after selecting the file, I can set the “File origin” to “Unicode (UTF-8)”. Of course, that method of import doesn’t support line breaks in fields:

To produce a file Excel can open the normal way, I had to re-encode the reference as UTF-16LE:

$ iconv -f utf-8 -t utf-16le <csv-good.csv >csv-converted.csv

Or prefix the file with a byte order mark (BOM):

$ printf $'\xef\xbb\xbf' | cat - csv-good.csv >csv-converted.csv

Saving

Alright, onward to output. First, let’s see what happens when we save the file using the oldschool “Comma Separated Values (.csv)” format:

Okay, so it’s close. Our <LF>s have been replaced with <CR><LF>s (with the exception of the the linebreak within the second column of the third row, which has bizarely been replaced with a <CR>) and we lose our trailing newline, but that’s no big deal. The quotes are still correctly quoted. And as expected, the special character has been replaced with 8E. All completely reversible:

$ iconv -f mac -t utf8 <csv-mac.csv \
    | perl -pe 's/\r\n?/\n/g' \
    | sed -e 's/$//' \
    | diff csv-good.csv -
$ echo $?
0

(First iconv performs the character set conversion, then perl invocation fixes the newlines, and finally sed adds the trailing newline.)

There’s just one problem: Excel for Mac itself doesn’t properly reopen this file!

That’s… That’s super special, guys.

Okay, moving along. Let’s see what the “CSV UTF-8 (Comma delimited) (.csv)” format does:

Well, that’s…

Oh boy.

Well, to its credit, the file is UTF-8-encoded. Unfortunately, that’s also where things start going wrong: the first three bytes of the file are a BOM, because Microsoft tools are crazy. Still, given the above experience trying to open a UTF-8 file, I’m not surprised.

The next problem is that it’s replaced all the line endings with… just carriage returns?! How can you possibly regress on something like line endings? Windows- or Unix-style, I really don’t care, but why on God’s green earth would you use Classic Mac-style, <CR>-only line termination? That’s just offensive.

But that’s not the biggest problem: no, the biggest problem, as alluded to in the image above, is the behaviour produced when escaping quotes within a field. Why is the first quote not escaped, and where did that extra et" come from? I’ll give you a hint: it’s from earlier in the field.

Let’s try a series of small tests, each with only a single row and column:

Input Output
0123456789" "0123456789""""
0123456789"" "0123456789""""1"
0123456789"""" "0123456789""""""123"
0123456789"""""""""" "0123456789""""""""""""""""""""""123456789""""""""""""23456789""""
01234"56789 "01234"56789""1234""""
01234""""56789 "01234""""56789""1234""""""234"
012""345""6789 "012""345""6789""12""345""6""2""345"""""345"""345""345"

I think you get the picture. The output, when quotes within fields are involved, is completely untrustworthy. The original input is all but unrecoverable (if there is a pattern for “unwinding” the damage, it isn’t obvious to me), so any file saved this way is basically useless if there are quotes anywhere other than the header. And I haven’t tried, but it looks like it would be pretty easy to generate collisions (the same output for different input) meaning that even if there is a pattern, it’s almost certainly not bijective.

I can understand that these sorts of issues could have flown under the radar during the runup to one of the biggest revisions to Office for Mac. What I don’t get is how nobody’s noticed this nearly eighteen months after Office for Mac 2016 was released to Office 365 subscribers. At time of release, Office 2016 seemed like a huge step forward for Office for Mac, including far better compatibility with the Windows version of Office applications and a significantly improved UI. But the persistence of issues like these do raise doubts about Microsoft’s continuing commitment to bugfixes outside of major releases. I hope this one gets fixed soon.

Write a Comment

Comment