In the spirit of Marko Karppinen’s The State of the Validation, here are the results of testing 119 XHTML sites for standards compliance. This is not a rigorous scientific exercise; the methodology had several shortcomings, some of which I detail below.
Most of the sites tested are the personal web pages of the “Alpha Geeks”, an elite group of well-linked web designers and programmers, along with some of their friends. Because these are individuals, I do not plan to “name names” by publishing the exact list of URLs tested. Sorry. However, the general sample group is pretty easy to reconstruct. If you’re the type of person who is interested in XHTML — if you’re the type of person who would waste time reading the rest of this post — just look at your own blogroll, and start validating. Your results should be roughly the same as mine.
This post is divided into three sections:
The tests derive from the first three criteria described in an earlier entry. I only tested sites that claimed to be XHTML — in other words, I only validated sites that provided an XHTML DOCTYPE (or something that was trying to be an XHTML DOCTYPE, anyway.) I ignored sites that provided an HTML DOCTYPE or that didn’t have a DOCTYPE at all. It would have been interesting to test HTML 4.01 standards compliance, but that wasn’t what I was interested in.
The “fourth” test described in the earlier entry gets at the question of, “Why are you using XHTML in the first place?” I think this is a good question to ponder… but for this survey I thought it best to focus on the first three tests, which are less philosophical and more straightforward and mechanical.
For the sake of brevity, as soon as a site failed, I stopped applying all further tests. One strike, you’re out.
Level 1: The “main” page must validate as XHTML. (“The Simple Validation Test”)
This test is self-explanatory. Select the home page of the site and run it through the W3C validator. Note that in many cases the page I tested was not a top-level page, but a main journal/weblog page, as in
http://domain.com/blog. The distinction doesn’t matter too much. We just want to validate the main entry point to the site… or the page that outsiders tend to link to in their blogrolls, anyway.
The great majority of XHTML sites failed to pass Level 1.
Level 2: Three secondary pages must validate as XHTML. (“The Laziness Test”)
I designed this test to weed out people who go to the effort to make sure their home page validates… and then simply slap an XHTML DOCTYPE on the top of the rest of their pages and call it quits.
A “secondary” page is simply another page on the website that is only one or two clicks away from the main page, such as an “About” page, a “Contact” page, or a monthly archive page. These secondary pages often had images, forms, or other elements that were not present on the main page, thus providing a useful test of proper tag nesting and valid attribute usage. If the secondary page lacked an XHTML DOCTYPE I skipped it; if it had an XHTML doctype, it was fair game.
Of course, a more thorough test would validate all pages on the site and then characterize the total results (somehow). I chose to validate just three pages. Basically, I figure that if I can quickly select three other pages that all validate, then you’ve done a pretty good job of making sure that your site is in solid shape. Of course, some people will pass this test based on the luck of the draw, and so clearly this test overestimates the number of people who have “perfectly valid” sites. Hey, I’m okay with that.
The majority of XHTML sites that passed Level 1 failed to pass Level 2.
Level 3: The site must serve up the proper MIME-type (application/xhtml+xml) to conforming user agents. (“The MIME-type Test”)
The “conforming user agent” I used to sniff for the MIME-type was Mozilla 1.3. Mozilla has been around long enough that its ability to handle
application/xhtml+xmlshould be well-known. Furthermore, Mozilla indicates that it can handle this the proper MIME-type through its
HTTP-ACCEPTheader. If the site served up
text/htmlto other browsers, that was fine — I was just looking for some acknowledgment of this issue.
If an author makes it past Test 2, he or she clearly knows a thing or two about XHTML. If he or she then fails Test 3, we can conclude one of two things:
- The author is ignorant of the spec.
- The author is willfully ignoring the spec.
Either way, it’s a failure. XHTML is not simply about making sure all your tags are closed and your attributes are quoted. XHTML might look superficially like HTML, but it is an entirely different beast. Those who know enough to pass Test 2 should know enough to understand the MIME-type as well.
Anyway, the great majority of XHTML sites that passed Level 2 failed to pass Level 3.
The reasons why you should serve up your XHTML as
application/xhtml+xml are well-documented. First and foremost, the spec says so:
The ‘application/xhtml+xml’ media type [ RFC3236 ] is the [emphasis not mine] media type for XHTML Family document types, and in particular it is suitable for XHTML Host Language document types….
‘application/xhtml+xml’ SHOULD be used for serving XHTML documents to XHTML user agents. Authors who wish to support both XHTML and HTML user agents MAY utilize content negotiation by serving HTML documents as ‘text/html’ and XHTML documents as ‘application/xhtml+xml’.
Second, there’s Hixie’s famous article on the matter, which describes why you need to use the proper MIME-type. Personally, I think Hixie is a little too strict. He argues strenuously that serving up XHTML as
text/html is wrong, and then relegates to Appendix B the concept of serving up different MIME-types to different user agents: “Some advanced authors are able to send back XHTML as application/xhtml+xml to UAs that support it, and as text/html to legacy UAs…” (A side note: this distinction about “advanced” authors is a little odd. First, as the results demonstrate, XHTML is hard enough that even advanced authors get it wrong most of the time. Second, configuring your server to do some minimal MIME-type negotiation really isn’t that tough. If you’re advanced enough to know what XHTML is, you’re advanced enough to add a few lines to your
.htaccess file. Or add a little PHP snippet for your dynamic pages. Et cetera.)
Anyway, without Hixie’s Appendix B, we’re stuck. If you serve up your pages as
application/xhtml+xml to all browsers, you’ll run into IE, which chokes on this MIME-type. The only non-suicidal thing to do is to serve
text/html to the primitive browsers that don’t understand the proper MIME-type, and
application/xhtml+xml to the ones that do.
I collected results for 119 XHTML websites. I reviewed about half the sites on April 19, 2003, and the other half on April 20, 2003. I used Mozilla 1.3 to sniff for MIME-types, but for the majority of my testing I used Safari Beta 2, because of its superior speed and tab management. (A side note: for beta software, Safari performed extremely well, humming along smoothly with fifteen or twenty tabs open at once. It did consistently crash on a couple of URLs, which I plan to submit with the bug reporting tool.)
Finding 119 XHTML websites is not quite as easy as it first appears. At first I tried searching Google for terms such as “XHTML standards” or “XHTML DOCTYPE”. But as it turned out, sites that talk about XHTML standards and DOCTYPEs are suprisingly unlikely to be XHTML sites.
I finally hit upon a method that yielded a reasonable percentage of XHTML websites. I went to the blogs of several very well-known bloggers who write about web standards: the “Alpha Geeks”. I then methodically went through their blogrolls. Some observations:
This method is likely to overestimate the number of valid XHTML sites. The Alpha Geeks and their friends are among the most tech-savvy people publishing on the web — and furthermore, they have the enormous freedom to tailor their site so that it validates. (Large corporations are for various reasons much more sluggish.)
The blogrolls of the Alpha Geeks consisted primarily of fellow Alpha Geeks. There were other sites, of course — news sites, journalist-bloggers, music aficionado-bloggers, bloggers who drive traffic by posting pictures of themselves in their underwear, and so on. But the majority of the links were web standards advocates, web designers, and programmers.
Even in this elite crowd, a large percentage of people either didn’t bother with DOCTYPEs or were using HTML DOCTYPEs. I didn’t spend time validating the latter, although it would have been an interesting exercise.
A significant fraction of the Alpha Geeks were the so-called “Microsoft bloggers”. Microsoft is doing a pretty good job of getting its employees out there in the Alpha Geek community. Interestingly, nearly all the Microsoft bloggers are using HTML DOCTYPEs. Do they know something the rest of us don’t?
One of the more popular blogging tools of the Alpha Geeks was Moveable Type. The majority of Alpha Geek MT sites were not using MT’s default templates — usually their MT installation was highly customized. Radio was also a popular choice, although Radio blogs did not contribute significantly to the number of XHTML sites. A few of the Alphas “roll their own” system (more power to them). Blogger was suprisingly rare, considering its popularity in general — perhaps because it isn’t as customizable as Moveable Type. The ground-breaking (but now unsupported) Greymatter was even rarer.
Of the XHTML sites, XHTML 1.0 Transitional was the most popular choice by a wide margin. This isn’t too surprising. XHTML 1.0 Transitional is the default DOCTYPE for Moveable Type, and it has the added benefit of allowing you to use all sorts of wonderfully semantic tags and attributes such as the
<center>tag and the
borderattribute for images.
Many Alpha Geeks (including some vociferous standards advocates) failed validation very badly, with dozens and dozens of errors of varying types. On the other hand, a few Alpha Geeks came tantalizingly, frustratingly close to validation. Typically this sort of failure would arise on the last page, where the author would make a tiny error such as forgetting to escape a few entities or inserting naked text inside a blockquote. I can certainly understand how these kinds of errors can creep in, no matter how diligently you try to avoid them. (And I can sympathize — the blockquote validation error is a personal bugbear of mine.)
But it doesn’t matter whether I feel bad or not. It doesn’t matter if I think the errors are “small” or “forgivable”. That has absolutely nothing to do with the specs, or the validator…
And, umm, on that note, let’s get to the results.
Of the 119 XHTML sites tested:
- 88 sites (74%) failed Test 1 (“Simple Site Validation”).
- 18 sites (15%) passed Test 1, but failed Test 2 (“The Laziness Test”).
- 12 sites (10%) passed Test 2, but failed Test 3 (“The MIME-type Test”).
- Leaving us with one site (1%) that managed to pass all three tests.
I know I promised not to name names, but I must make an exception. For the one man in the entire set who passed all three tests, let’s hear it for… beandizzy! Yay beandizzy! At the time of this writing, beandizzy is reformulating his design — but as of a week ago, his site validated perfectly and served up the right MIME-type. So congratulations, beandizzy. You have beaten the elite of the elite. You stand alone on the mountain top. (Well, there might be the occasional string theorist standing alongside you — but really, physicists are best ignored.)
As for the rest, the results speak for themselves. Even among the elite of the elite, the savviest of the savvy, adherence to standards is pretty low. Note that this survey most likely overestimates adherence to XHTML standards, since you would expect the Alpha Geeks to rate high on XHTML standards comprehension.
Also, I have to admit that I grew rather emotionally invested in the test process. I figured twenty sites would be enough to get at least one compliant site. When that failed, I went on to 40, 60, … amazed that not one site had passed. By the time I reached beandizzy’s site (#98) I was pretty drained. I surveyed the rest of the blogroll I was on and then gave up. So again, this survey most likely overestimates XHTML standards adherence, because I quit soon after I got one success.
Conclusions are forthcoming. But there’s one thing that’s clear right off the bat: XHTML is pretty damn hard. If the Alpha Geeks can’t get it right, who can?