In the spirit of Marko Karppinen’s The State of the Validation, here are the results of testing 119 XHTML sites for standards compliance. This is not a rigorous scientific exercise; the methodology had several shortcomings, some of which I detail below.
Most of the sites tested are the personal web pages of the “Alpha Geeks”, an elite group of well-linked web designers and programmers, along with some of their friends. Because these are individuals, I do not plan to “name names” by publishing the exact list of URLs tested. Sorry. However, the general sample group is pretty easy to reconstruct. If you’re the type of person who is interested in XHTML — if you’re the type of person who would waste time reading the rest of this post — just look at your own blogroll, and start validating. Your results should be roughly the same as mine.
This post is divided into three sections:
Test Description
The tests derive from the first three criteria described in an earlier entry. I only tested sites that claimed to be XHTML — in other words, I only validated sites that provided an XHTML DOCTYPE (or something that was trying to be an XHTML DOCTYPE, anyway.) I ignored sites that provided an HTML DOCTYPE or that didn’t have a DOCTYPE at all. It would have been interesting to test HTML 4.01 standards compliance, but that wasn’t what I was interested in.
The “fourth” test described in the earlier entry gets at the question of, “Why are you using XHTML in the first place?” I think this is a good question to ponder… but for this survey I thought it best to focus on the first three tests, which are less philosophical and more straightforward and mechanical.
For the sake of brevity, as soon as a site failed, I stopped applying all further tests. One strike, you’re out.
-
Level 1: The “main” page must validate as XHTML. (“The Simple Validation Test”)
This test is self-explanatory. Select the home page of the site and run it through the W3C validator. Note that in many cases the page I tested was not a top-level page, but a main journal/weblog page, as in
http://domain.com/blog
. The distinction doesn’t matter too much. We just want to validate the main entry point to the site… or the page that outsiders tend to link to in their blogrolls, anyway.The great majority of XHTML sites failed to pass Level 1.
-
Level 2: Three secondary pages must validate as XHTML. (“The Laziness Test”)
I designed this test to weed out people who go to the effort to make sure their home page validates… and then simply slap an XHTML DOCTYPE on the top of the rest of their pages and call it quits.
A “secondary” page is simply another page on the website that is only one or two clicks away from the main page, such as an “About” page, a “Contact” page, or a monthly archive page. These secondary pages often had images, forms, or other elements that were not present on the main page, thus providing a useful test of proper tag nesting and valid attribute usage. If the secondary page lacked an XHTML DOCTYPE I skipped it; if it had an XHTML doctype, it was fair game.
Of course, a more thorough test would validate all pages on the site and then characterize the total results (somehow). I chose to validate just three pages. Basically, I figure that if I can quickly select three other pages that all validate, then you’ve done a pretty good job of making sure that your site is in solid shape. Of course, some people will pass this test based on the luck of the draw, and so clearly this test overestimates the number of people who have “perfectly valid” sites. Hey, I’m okay with that.
The majority of XHTML sites that passed Level 1 failed to pass Level 2.
-
Level 3: The site must serve up the proper MIME-type (application/xhtml+xml) to conforming user agents. (“The MIME-type Test”)
The “conforming user agent” I used to sniff for the MIME-type was Mozilla 1.3. Mozilla has been around long enough that its ability to handle
application/xhtml+xml
should be well-known. Furthermore, Mozilla indicates that it can handle this the proper MIME-type through itsHTTP-ACCEPT
header. If the site served uptext/html
to other browsers, that was fine — I was just looking for some acknowledgment of this issue.If an author makes it past Test 2, he or she clearly knows a thing or two about XHTML. If he or she then fails Test 3, we can conclude one of two things:
- The author is ignorant of the spec.
- The author is willfully ignoring the spec.
Either way, it’s a failure. XHTML is not simply about making sure all your tags are closed and your attributes are quoted. XHTML might look superficially like HTML, but it is an entirely different beast. Those who know enough to pass Test 2 should know enough to understand the MIME-type as well.
Anyway, the great majority of XHTML sites that passed Level 2 failed to pass Level 3.
The reasons why you should serve up your XHTML as application/xhtml+xml
are well-documented. First and foremost, the spec says so:
The ‘application/xhtml+xml’ media type [ RFC3236 ] is the [emphasis not mine] media type for XHTML Family document types, and in particular it is suitable for XHTML Host Language document types….
‘application/xhtml+xml’ SHOULD be used for serving XHTML documents to XHTML user agents. Authors who wish to support both XHTML and HTML user agents MAY utilize content negotiation by serving HTML documents as ‘text/html’ and XHTML documents as ‘application/xhtml+xml’.
Second, there’s Hixie’s famous article on the matter, which describes why you need to use the proper MIME-type. Personally, I think Hixie is a little too strict. He argues strenuously that serving up XHTML as text/html
is wrong, and then relegates to Appendix B the concept of serving up different MIME-types to different user agents: “Some advanced authors are able to send back XHTML as application/xhtml+xml to UAs that support it, and as text/html to legacy UAs…” (A side note: this distinction about “advanced” authors is a little odd. First, as the results demonstrate, XHTML is hard enough that even advanced authors get it wrong most of the time. Second, configuring your server to do some minimal MIME-type negotiation really isn’t that tough. If you’re advanced enough to know what XHTML is, you’re advanced enough to add a few lines to your .htaccess
file. Or add a little PHP snippet for your dynamic pages. Et cetera.)
Anyway, without Hixie’s Appendix B, we’re stuck. If you serve up your pages as application/xhtml+xml
to all browsers, you’ll run into IE, which chokes on this MIME-type. The only non-suicidal thing to do is to serve text/html
to the primitive browsers that don’t understand the proper MIME-type, and application/xhtml+xml
to the ones that do.
Data Collection
I collected results for 119 XHTML websites. I reviewed about half the sites on April 19, 2003, and the other half on April 20, 2003. I used Mozilla 1.3 to sniff for MIME-types, but for the majority of my testing I used Safari Beta 2, because of its superior speed and tab management. (A side note: for beta software, Safari performed extremely well, humming along smoothly with fifteen or twenty tabs open at once. It did consistently crash on a couple of URLs, which I plan to submit with the bug reporting tool.)
Finding 119 XHTML websites is not quite as easy as it first appears. At first I tried searching Google for terms such as “XHTML standards” or “XHTML DOCTYPE”. But as it turned out, sites that talk about XHTML standards and DOCTYPEs are suprisingly unlikely to be XHTML sites.
I finally hit upon a method that yielded a reasonable percentage of XHTML websites. I went to the blogs of several very well-known bloggers who write about web standards: the “Alpha Geeks”. I then methodically went through their blogrolls. Some observations:
-
This method is likely to overestimate the number of valid XHTML sites. The Alpha Geeks and their friends are among the most tech-savvy people publishing on the web — and furthermore, they have the enormous freedom to tailor their site so that it validates. (Large corporations are for various reasons much more sluggish.)
-
The blogrolls of the Alpha Geeks consisted primarily of fellow Alpha Geeks. There were other sites, of course — news sites, journalist-bloggers, music aficionado-bloggers, bloggers who drive traffic by posting pictures of themselves in their underwear, and so on. But the majority of the links were web standards advocates, web designers, and programmers.
-
Even in this elite crowd, a large percentage of people either didn’t bother with DOCTYPEs or were using HTML DOCTYPEs. I didn’t spend time validating the latter, although it would have been an interesting exercise.
-
A significant fraction of the Alpha Geeks were the so-called “Microsoft bloggers”. Microsoft is doing a pretty good job of getting its employees out there in the Alpha Geek community. Interestingly, nearly all the Microsoft bloggers are using HTML DOCTYPEs. Do they know something the rest of us don’t?
-
One of the more popular blogging tools of the Alpha Geeks was Moveable Type. The majority of Alpha Geek MT sites were not using MT’s default templates — usually their MT installation was highly customized. Radio was also a popular choice, although Radio blogs did not contribute significantly to the number of XHTML sites. A few of the Alphas “roll their own” system (more power to them). Blogger was suprisingly rare, considering its popularity in general — perhaps because it isn’t as customizable as Moveable Type. The ground-breaking (but now unsupported) Greymatter was even rarer.
-
Of the XHTML sites, XHTML 1.0 Transitional was the most popular choice by a wide margin. This isn’t too surprising. XHTML 1.0 Transitional is the default DOCTYPE for Moveable Type, and it has the added benefit of allowing you to use all sorts of wonderfully semantic tags and attributes such as the
<center>
tag and theborder
attribute for images.
Many Alpha Geeks (including some vociferous standards advocates) failed validation very badly, with dozens and dozens of errors of varying types. On the other hand, a few Alpha Geeks came tantalizingly, frustratingly close to validation. Typically this sort of failure would arise on the last page, where the author would make a tiny error such as forgetting to escape a few entities or inserting naked text inside a blockquote. I can certainly understand how these kinds of errors can creep in, no matter how diligently you try to avoid them. (And I can sympathize — the blockquote validation error is a personal bugbear of mine.)
But it doesn’t matter whether I feel bad or not. It doesn’t matter if I think the errors are “small” or “forgivable”. That has absolutely nothing to do with the specs, or the validator…
“Listen! And understand! That Validator is out there. It can’t be bargained with! It can’t be reasoned with! It doesn’t feel pity, or remorse, or fear. And it absolutely will not stop, EVER… until you are validated!”
And, umm, on that note, let’s get to the results.
Results
Of the 119 XHTML sites tested:
- 88 sites (74%) failed Test 1 (“Simple Site Validation”).
- 18 sites (15%) passed Test 1, but failed Test 2 (“The Laziness Test”).
- 12 sites (10%) passed Test 2, but failed Test 3 (“The MIME-type Test”).
- Leaving us with one site (1%) that managed to pass all three tests.
I know I promised not to name names, but I must make an exception. For the one man in the entire set who passed all three tests, let’s hear it for… beandizzy! Yay beandizzy! At the time of this writing, beandizzy is reformulating his design — but as of a week ago, his site validated perfectly and served up the right MIME-type. So congratulations, beandizzy. You have beaten the elite of the elite. You stand alone on the mountain top. (Well, there might be the occasional string theorist standing alongside you — but really, physicists are best ignored.)
As for the rest, the results speak for themselves. Even among the elite of the elite, the savviest of the savvy, adherence to standards is pretty low. Note that this survey most likely overestimates adherence to XHTML standards, since you would expect the Alpha Geeks to rate high on XHTML standards comprehension.
Also, I have to admit that I grew rather emotionally invested in the test process. I figured twenty sites would be enough to get at least one compliant site. When that failed, I went on to 40, 60, … amazed that not one site had passed. By the time I reached beandizzy’s site (#98) I was pretty drained. I surveyed the rest of the blogroll I was on and then gave up. So again, this survey most likely overestimates XHTML standards adherence, because I quit soon after I got one success.
Conclusions are forthcoming. But there’s one thing that’s clear right off the bat: XHTML is pretty damn hard. If the Alpha Geeks can’t get it right, who can?
Here’s another.
I submit my entire site, except for the photolog.*
Also note that the forthcoming WordPress will be XHTML 1.0 Strict out of the box.
* The photolog is using legacy software that I’ve managed to coax in places to a semblance of reliance, but to get it to approach anything approaching semantic would be a decidedly non-trivial endeavour.
Bravo, Matt! I have been struggling *this very week* to get legacy software to spit out something resembling valid markup. So I’m certainly willing to give your Photolog a pass.
So now there are three. Maybe I need to start keeping a list.
If you wanted to be cruel set up a webring with invalid code required to be a part of. Interesting site BTW, I’ll be checking back.
Interesting survey. I tested myself and found I failed at level 3, which was the final nudge I needed to add the PHP content negotiation code.
Of course, this means that if I ever accidentally introduce badly formed XML in to my site it will die with an ugly error page in Mozilla. Hopefully I’ll be the first to notice.
And another… I finally got around to changing the content type of my root directory pages via php – seeing as there are only 5 pages in there, you can tell how much free time I’ve had lately thanks to sweet, sweet real life.
And now….the rest…. can wait until the “big clean up” and log script finishing effort this summer. 😉
I passed all three. For those of you using PHP, the code for level 3 is simple… Here it is:
if ( stristr($_SERVER[“HTTP_ACCEPT”],”application/xhtml+xml”) ) {
header(“Content-type: application/xhtml+xml”);
}
else {
header(“Content-type: text/html”);
}
This basically ensures that you can serve to the widest audience and still generate proper XHTML for those who can read it.
TNL
Hello, gentlemen! Just be careful, now… when it comes to the advanced browsers, you are flying without a net. Check out Jacques Distler’s post about invalid comments (if you haven’t already). His post is MT-specific, but it might prove useful in general:
http://golem.ph.utexas.edu/~distler/blog/archives/000155.html
Hmmm. I definitely need to start a list.
I guess I’m more of a gamma geek since I just launched my blog and have very few readers. But I believe it would pass the tests.
Thank you for taking the time to do this; it’s interesting/disconcerting.
I don’t think it’s that XHTML is difficult – properly nest your tags, keep everything in lower case, quote your attributes and you’re 90% of the way there. Add the validator and it becomes extremely eassy to spot your mistakes.
What’s difficult is maintaining XHTML validity on sites that are constantly updated with new mark up every day as new blog entries are added.
I’m surprised that you only managed to find one site that passed the test.
Not trying to blow my own trumpet here (honestly), but for the past 5 or 6 months http://www.xiven.com has passed all three of the tests you mentioned (does that make me an Alpha Geek). It also sends valid HTML4 to non-XHTML-supporting browsers.
Additionally, http://www.aagh.net/blog/ seems to pass the tests with flying colours.
Tristan: Your code fails to take qvalues into account, and only works by coincidence on today’s browsers. See http://www.klio.org/marks/2003_04_archive.html#entry-40 for a fuller explanation (and correct code in Perl).
Jon: Check.
insin: Check. And hey, no more bragging about sweet, sweet real life.
Tom: Check. And check. Yes, I was surprised at the number of failures too.
Tristan: The first couple pages validate fine, but there are a couple of problems with the “broadband” templates. Please do let me know if these get fixed. (As for qvalues: my criteria was simply that as long as Mozilla 1.3 says that it’s happy, I’m happy. 🙂 )
Simon: Yeah, I would have thought that it shouldn’t be too difficult, particularly after some practice getting the hang of the main gotchas: entities and proper tag nesting. But the numbers on pure validation alone are pretty dismal. A 90% failure rate.
The problem is that keeping a site perfectly valid (or at least valid enough to pass test #2) takes either heroic diligence or very good tools. Both of which are in short supply.
“XHTML is pretty damn hard” — HTML4 is just as hard. Try the same test with HTML, you’ll see what I mean. The only differences are that most people have had more time to get familiar with HTML, and the MIME type issues doesn’t exist any more.
Give it a few years (like, ten or twenty). We’ll get there.
Paul Snowden of idontsmoke had a weblog that conformed to you rules a year ago (or more, I’m not sure when he took it down). Now he uses the minumum that validates as html4.01 strict instead.
If you are serving the right mime type for XHTML 1.1 it would seem to me that keeping the site valid is pretty easy. Each time I screw something up, a good browser should refuse to render the page and it will give me an error. As long as I look at my pages after an edit I should be able to catch problems quickly, no need to re-validate because maximal standards compliance is its own validation.
I suppose a page could be well-formed, but not valid. Can that occur under XHTML 1.1?
“I suppose a page could be well-formed, but not valid. Can that occur under XHTML 1.1?”
Absolutely! Evan gave his favourite example above:
<blockquote>Spoons!</blockquote>
is well-formed, but invalid.
<blockquote><p>Spoons!</p></blockquote>
is valid.
Mozilla will render both, as they are both well-formed.
Ian, liorean — yup, see, all the cool kids are switching back to HTML 4.01. 🙂
Certainly from a pure markup perspective, HTML 4.01 Strict is pretty tough. I’d argue that it’s actually tougher than XHTML 1.0 Transitional.
The latter is really about just quoting your attributes, closing all your tags, and moving on. But in HTML 4.01 Strict there are a lot of new and surprising tag-ordering issues (like the accursed blockquote problem) that you have to deal with. Also, HTML 4.01 Strict takes away a lot of presentational crutches — there’s no align=”center”, no border=”0″ for images, and so on. It forces you to lean harder on CSS and maybe even think a little more semantically. Crummy old HTML4, who’d have thunk it?
I don’t have too much of a problem with XHTML 1.1, other than the spec says no text/html for it, and I do for UAs that don’t support application/xhtml+xml. I even modify the meta tag so that it agrees.
My problem is the stupid ampersand. Blogshares, for instance, requires invalid markup in a link. I usually wait a few days until the site has been claimed, and then change it back to what it is supposed to be.
My blog (click on my name) should pass all three tests.
Sorry for the double post, but I just checked and my site is also officially “Bobby 508 Approved”. Great post by the way Evan.
Could somebody post step-by-step instructions for people that don’t know a thing about PHP, or even if it is supported by their server?
The latest post in my blog is about how to set your server up to display the right MIME type.
Tristan mentioned how to use PHP to send a proper MIME Type. here’s an extention of that (that i coded without knowing of this one.)
<?php
/* Due to some browsers spawning a vacum (read: != Gecko) we sometimes need to
* send an incorrect MIME type
* See: http://www.w3.org/TR/xhtml-media-types/#application-xhtml-xml
* See: http://www.w3.org/People/mimasa/test/xhtml/media-types/results
*/
if ( stristr($_SERVER[“HTTP_ACCEPT”],”application/xhtml+xml”) ) {
header(“Content-type: application/xhtml+xml”);
} else if ( stristr($_SERVER[“HTTP_ACCEPT”],”application/xml”) ) {
header(“Content-type: application/xml”);
} else if ( stristr($_SERVER[“HTTP_ACCEPT”],”text/xml”) ) {
header(“Content-type: text/xml”);
} else {
header(“Content-type: text/html”);
}
This only sends an incorrect MIME type if i really REALLY need to, and as can be seen in those links some support text/xml & application/xml without supporting application/xhtml+xml so dont give up on sending something correct before you send text/html so soon.
>This only sends an incorrect MIME type if i really REALLY need to
But is that prudent? I don’t know if it really does it, but IE might well send an accept header that lists text/xml and application/xml, but it only displays them as an XML tree even if they are XHTML.
That’s exactly the problem case described in the “q” problem btw.
My page (as seen in the header) validates as XHTML 1.1, but is sent as application/xml. IE refuses to even display the tree because it cannot handle the XHTML 1.1 DTD.
I agree with CornedBee, it’s best to be prudent. The browsers that support application/xhtml+xml are usually trustworthy** when it comes to their ACCEPT headers. The browsers that don’t, aren’t… and if you trust these poorly-designed ACCEPT headers, you’ll probably get burned. To be on the safe side, it is best to keep it simple send the “bad” browsers text/html. There’s no shame in that; after all, it’s not *your* fault that certain browsers can’t be trusted.
** See the caveat on Jacques’s site:
http://golem.ph.utexas.edu/~distler/blog/archives/000167.html.
Jacques has found that embedding other XML content (i.e. MathML) doesn’t work with all browsers that explicitly support application/xhtml+xml in the ACCEPT header. So he actually has to list out all his “good” browsers explicitly. It’s a crazy world we live in.
Hi folks,
I’ve been working on XHTML on my site (on and off) for a few months and I think I pass tests 1 and 2.
I don’t pass test 3, but I’m not sure I have to. I understand why I *should*, but the standard does /not/ say I *must*.
I moved over to XHTML1.1 mostly. There’s a good reason for this: much less tag soup, much more separation of the presentation using CSS. To me, this was more important than passing test 3, which I knew about but chose to ignore for now (in the belief, as I said, that I am /not/ infringing the standard).
I expend effort testing my site for well-formed, valid XHTML using Tidy (for all pages) and the W3C Validator (for spot checks).
So although I don’t claim my site is perfect (realistically, it’s too large to be sure it’s perfect), I am moderately happy that I haven’t crashed and burned.
…unless you know better 😉
It’s been several months since your post, but I may as well provide you a link to another XHMTL-compliant website. In fact, this one is XHTML STRICT, and it’s been compliant for well over a year:
http://www.VoyagerRadio.com
It may not be much to look at, but it’s compliant and accessible via wireless devices.
I’ve found your journal, by the way, in my own hunt for XHTML sites. I’m looking particularly for Strict sites.