Improve your site with HTTP compression

Improve your site with HTTP compression

Following on the theme of aspects of http this third instalment covers compression.

Previous entries:
1. Faking site performance – http connection handling
2. Speeding up your web pages – http caching

A browser can declare it supports a compression method by declaring so in its Accept-Encoding request header for example a common header appearing in http requests would be “Accept-Encoding: gzip deflateâ€. A server declares the content being served as compressed via a response header Content-Encoding – a quick example “Content-Encoding: gzipâ€.

There are two compression methods. Obviously Gzip and deflate as just mentioned in the examples. Interestingly the actual compression algorithm provided by both gzip and deflate formats is zlib. The reason zlib finds itself in so many open standards (for example PPP compression) is spelled out on the home page of the reference implementation – “A Massively Spiffy Yet Delicately Unobtrusive Compression Library (Also Free Not to Mention Unencumbered by Patents)â€. Obviously the free aspect is what counts there (although spiffyness is often a desirable property in compression algorithms). So what is the difference between gzip deflate and zlib? Gzip and deflate are both container formats that provide a header and footer (mostly additional file information and error detection data) while zlib is the actual compressed stream format that sits inside. You can impress your girl friends with that geek trivia!

In fact there are many products and services that have incorporated the zlib implementation available at http://zlib.net. This reference implementation has a subtle hidden feature that allows pumping out a deflate formatted or gzip formatted zlib stream. There is an argument in the “constructor†(the library is plain C so I use that term loosely) to the zlib stream object called window bits. If you pass a negative value you get a deflate formatted result. Add 16 to the windows bits argument to get a gzip formatted result. Leave windows bits unaffected to get a raw zlib stream. Yes yes your girlfriends are excited I know.

For a moment let’s consider how the Apache web server handles compression. As with most web servers Apache implementation is provided as an extension in this particular case mod_gzip. As zlib’s source code license is BSD’ish in nature it is not surprising that the zlib library provides most of the grunt work for mod_gzip.

In fact perusing the source code looking for the magic negative or +16 window size we find on line 2888 of mod_gzip.c (version 1.5 was HEAD at the time this was written) a constant windows size that is negative. Ok so mod_gzip is banking on zlib to provide a deflate formatted zlib stream. However based on the browser’s Accept-Encoding request header mod_gzip needs to figure out if it is going to do a deflate formatting or a gzip formatting. It turns out mod_gzip is doing its own gzip header and trailer which seems kind of silly to me. Maybe one of you reading out there wants to submit a patch – you could possibly trim a few dozen lines of code out of mod_gzip just by switching the window bits argument and throwing out the code that does the manual gzip formatting. If you do it trust me your girl friend is going to go wild – you practically wrote Apache!

Ironically it also seems to skip handling deflate compression as well. I can guess why that is likely. When implementing http compression the standard clearly states a deflate or gzip formatted stream. However early implementers of both browsers and servers were confused about the difference between deflate formatting and zlib streams so out on the Internet nobody is quite sure what you mean when you say Accept-Encoding: deflate in a browser request.

So that wraps up the server side now for the browsers. I am going to pick on Internet Explorer 6 here. It is just too easy! Internet Explorer has the concept of in-process plug-ins that can be used as content handlers. This is a little bit different to ActiveX controls – think instead OLE Documents (like embedding spreadsheets in word documents). This technique is used for example by Adobe’s Acrobat reader to allow PDF content to be interacted with directly within a browser window without opening an Acrobat window. However as Internet Explorer makes a web request it assumes it knows best and marks the request as being able to handle compression. The browser starts receiving the compressed content and then notices the MIME type of the response. It uses the MIME type to recognise that Acrobat should get involved launches it up and starts handing over the data. Whoops! Acrobat starts receiving compressed content and Acrobat is assuming uncompressed content. Hilarity ensues.

I believe this has been fixed now as the lower levels of the http client library in windows now does transparent decompression. I don’t know if that fixes the Acrobat problem I haven’t tried in IE7.

Another nasty bug in Internet Explorer 6 that has slowed the adoption of universal compression of http content is javascript handling. I was involved at one time in a project that had some very large javascript files. I thought what a fantastic opportunity to apply http compression (the javascript was highly compressable so it was going to be a huge win). It was a win for a little while. Before long in came reports of broken websites. Nothing that we could reproduce internally. Finally we found someone that was having the problem and managed to speak with them. I instructed them to make a direct request for the javascript from the browser that was having trouble and send me a copy of the file. Sure enough it looked like garbage. In fact it looked like half a compressed stream which is exactly what it was. If a compressed javascript file was only half downloaded before being cancelled due to a browser closing or a navigation request and a caching directive was set on the script then the next use of the file in question the browser would see the half compressed content and send it to the javascript interpreter as if it was the full uncompressed content. Hilarity ensues.

Http compression is a nice thing to do for your users. Remember from faking performance that there are only two connections to use simultaneously. If we are compressing content those channels are freed up earlier to process remaining content. However apply compression judiciously these days I generally only compress html and plain text (both static and dynamically generated) as well as XML and JSON (for use in AJAX communications) as these are the most tested cases of the use of compression.

In his real job Luke Amery works on shopping cart software. He is the technical director of On Technology Australia’s leading e-commerce development company.