Search: Site Web

Software reVisions

In pursuit of reliable, fault-tolerant, fail-safe software and systems

The Humane Society of the United States

Proprietary File Formats

When it's named like a Docx, looks like a Docx and is used like a Docx... it's a Zip?

I was nearly finished with a post about the dangers of storing files in proprietary binary formats vs. universally readable ASCII text, which had been prompted by my frustrations trying to read a .DOCX file someone sent me. That's when Google finally brought me to an article that reminded me that Microsoft Word(MS) 2007 .DOCX files are really .ZIP files consisting of several .XML files.

Initially being unable to read the file since I'm still running MS Word 2003 (which hasn't a clue about Word 2007's DOCX format), my journey of discovery had, until then, begun to resemble a trail of tears.

At one point I loaded the file into a text editor to get an idea of its true contents. A DOCX file is completely unreadable this way. DUH! Its a compressed ZIP fi

Instead of moving all code into a .HTML file, Microsoft chose to break the file up into even more .XML files, put them in a 3-level subdirectory structure and ZIP the whole thing. While ZIP has long been an open and widely used packaging and compression format which makes it tremendously better than Microsoft's proprietary DOC files, it's still a single point of failure in the technology required to read information.

By contrast with an HTML file, everything is in printable ASCII text, even the formatting codes. Thus, even without a clue about the meaning of those codes, the actual information could still be easily extracted.

Once you change the file's extension from .docx to .zip and unzip it, the various .xml files, manifest, etc. are in plain ASCII text so this is an improvement. Of course, the vast majority of Microsoft users do not have the time, interest or knowledge to do this so it's just one more headache to be endured as they try to share files in their quest to get their jobs done.

Microsoft's solution to this dilemma is their Compatibility Pack which updates Word, Excel, etc. but it requires you to have applied all critical updates to office before installing it. After being burned a few times with Microsoft updates that crashed critical programs I was using, lost data or introduced bugs worse than the ones they fixed, I turned off automatic updating. The backlog is quite large by now. By all the virus scanners and things I run though, I don't have the security problems Microsoft tries to scare everyone with either. Gee... I must just be incredibly lucky I guess.

By making it the default save format, Microsoft dramatically increases the market penetration of the new format as only techies are likely to change the default to the old format to prevent these problems. And of course it generates lots of support revenue from people trying to read the file they were just sent and must act on yesterday.

Thus, the new format provides no advantage or improvement or "openness" to the public at all. It simply is another way for Microsoft to lock customers into their expensive products. They called it "open" to appease the EU assuming that no decision-maker there or business person here had the technical knowledge to see through their charade.



Post a Comment

<< Home