Whoa, a cluster bomb? I hope not! Sounds dangerous.
Well, it has been rather busy around here. I’ve decided to collect all posts that are even vaguely markup-related and display them in a central repository. I’ve also included a list of sites that pass the XHTML 100 test suite. Again, we’re only testing validation and MIME-types. I’m purposefully ignoring Test #4, the “Why Are You Doing This” Test. You could be one of those rarified individuals that has actual technical reasons for using XHTML. Or you could be doing it for “softer” reasons: for political advocacy, as a personal learning experience, or simply to prove to yourself that you can do it. It’s all fine as far as I’m concerned.
Note that I tried to add the W3C Markup pages to the list, but failed. The main page validates as XHTML 1.0 Strict and provides the proper MIME-type to Mozilla. However, the second link I happened to grab is valid but serves up text/html
. Ditto for the validator.1
The only downside is that on our sidebar we have to say goodbye to guest-blogger Byron Kubert. Byron’s adventures in Norwegian Viking School were gripping, but now he’s back in the States, and he hasn’t posted in months. He’ll still be accessible from the front page, though.2
Final note: I’d like to offer particular congratulations to stalwart young U.K. computer scientist Thomas Pike and his comrade and countryman, Thomas Hurst. Both of them serve up their pages as XHTML 1.1 to browsers that accept application/xhtml+xml
and HTML 4.01 Strict to browsers that don’t — tags and everything. Now that, my friends, is real content negotiation. Gentlemen, I salute you.
1. On the plus side, you can validate the validation of the validator. What fun!
2. It’s a good thing Byron spent more time sailing ships and less time learning how to cleave skulls with an axe, or else I’d be a little worried about demoting him.
You should make your criteria a little clearer.
Must all pages be XHTML? Must they all be served as application/xhtml+xml ?
What about search pages, comment-entry popups and the like?
Good questions, and they lead to some needed clarifications.
“Must all pages be XHTML? Must they all be served as application/xhtml+xml ?”
This is addressed in the description of the test suite. XHTML DOCTYPE pages are tested. HTML or null DOCTYPE pages are not. Implicitly, this means I also test for application/xhtml+xml for all secondary pages, not just the home page.
Except…
“What about search pages, comment-entry popups and the like?”
Aggh, you have caught an “undocumented exploit” of the test suite (which Jacob Childress noticed also).
The one caveat for the MIME-type test was that I had purposefully not been bothering with it for search pages and comment pages. I was merely validating those pages. To be rigorously correct, I *should* tell people that they need to A) switch their search page back to an HTML4 doctype or B) hack into their CGI script, et cetera.
But I could never bring myself to tell people this. It just seems heartless. I’m a moderate/liberal, so sue me.
Thus, people who were serving up their search pages as text/html and the rest of their pages as application/xhtml+xml were getting away scot-free. So for consistency, I’m not going to ding anyone who sneaked through this way. Instead I will amend the test suite shortly so that it states this exception explicitly.
(Maybe I need to start using the “MUST NOT”, “SHOULD NOT”, notation from RFC2119? Ehh.)
“Exploit” is a little strong.
Whether it is easy or hard to serve a Search Page or Comment Popup with the correct MIME-type depends on the technique you were using to set the MIME-type in the first place.
Some people are using PHP in their XHTML pages. If you’re doing that, I think a modification of the search template (or Comment Popup template) is all that is required. (I could be wrong about this.)
I’m using Apache’s mod_rewrite to set the MIME-type. This definitely requires no change to the CGI-script(s) themselves.
Some of your sites *are* setting the correct MIME-type for CGI-generated pages (search, comment-popup). So it’s clearly possible. Some are not, and one is actually brave enough to send an HTML 4.01 document as application/xhtml+xml.
I don’t care either way how you decide to treat dynamically-generated (.cgi) pages. But you need to think carefully about it. Say someone comes along with a site whose every page is dynamically-generated (.cgi). What then? Does he get an automatic pass?
Similarly, sites which are a mixture of HTML 4 and XHTML pose a potential problem for you. What do you do with a site with one (valid) XHTML page, and N-1 HTML4 pages (whose validity you are not going to check). Is that a “Valid XHTML site”?
I’m not pointing out these “edge cases” to be difficult, rather to (hopefully) save you some grief as more and more people decide it would be “cool” to appear on Evan’s list.
I think you’ll have a much easier time if an “XHTML site” is all XHTML and all served with the correct MIME-type, rather than some mixture of XHTML and HTML4, served with some mixture of MIME-types. And CGI scripts (which may constitute anywhere from 0% to 100% of the pages of the site) should be treated the sames as static pages in this regard.
Otherwise, your criteria will get very muddy very fast.
Hmmm.
Did a little experimentation.
You were right. It’s not easy (probably not feasible) to change the MIME-Type sent without modifying the CGI-script.
With MovableType, that means a small number of CGI-generated pages which will be served with the wrong MIME-Type (by those unwilling to touch the CGI scripts).
But there are other CMS systems where *every* page is CGI-generated. So it still seems unsatisfactory to give CGI-generated pages a blanket pass.
This is a real problem for me because I host other weblogs through a single MT installation, and any changes I make must apply only to my weblog, since I can hardly expect anyone else on the site to worry about XHTML and MIME-types.
Another problem is that MT uses iso-8859-1 as its default character encoding, and because I might want to use Vietnamese characters from time to time, I’m using utf-8 on my non-CGI pages. Naturally I would like utf-8 to be used on my search page and comments preview page as well. MT offers a “PublishCharset” option, but it is not applied on a per-blog basis, so changing the option would affect my other users.
I suppose I might hack the MT code, though that’s my solution of last resort…
“This is a real problem for me because I host other weblogs through a single MT installation, and any changes I make must apply only to my weblog, since I can hardly expect anyone else on the site to worry about XHTML and MIME-types.”
Not a problem at all.
I’ll describe, briefly, my setup.
For ordinary pages, you can just use mod_rewrite (either in a .htaccess or the server-config file) to set the MIME-type for .html files in a directory, depending on the USER-AGENT, HTTP-ACCEPT headers, or whatever.
Since this is set on a per-directory basis, it applies to your blog, but not theirs.
For CGI pages, the fix is a bit trickier. It involves a small hack to lib/MT/App.pm
— lib/MT/App.pm.orig Tue Mar 25 11:09:03 2003
+++ lib/MT/App.pm Tue Mar 25 11:10:38 2003
@@ -52,6 +52,9 @@
sub send_http_header {
my $app = shift;
my($type) = @_;
+ if ($ENV{‘TYPE’}){
+ $type= $ENV{‘TYPE’};
+ }
$type ||= ‘text/html’;
if (my $charset = $app->{charset}) {
$type .= “; charset=$charset”
Then you use mod_rewrite to set the environment variable [E=TYPE:application/xhtml+xml]. In principle, one can arrange the mod_rewrite rules so that is set on a per-blog basis.
I haven’t really experimented with setting the environment variable on a per-blog basis, so I don’t know how well it works in practice. But it should be doable.
Well, I’m certainly not going to give an all-PHP site a pass on MIME-types. No sirree.
On the other hand, I don’t want to go back to the people who just have a few CGI pages and say, “Sorry, I’m retroactively removing you from the list, due to a rule change I made today.” Maybe I should grandfather these people in? Yes, this is muddy.
That said, I truly do have *no* problem with people serving up some pages as XHTML, others as HTML. People should be free to stamp their pages with the appropriate DOCTYPEs, and I will test and validate accordingly. I don’t think this muddies the rules any worse than they are already. (I suppose that people could try to game the system by providing four static XHTML pages and the rest HTML. In which case I will use (muddy) human judgment to tell them, sorry, no dice.)
Anyway, we have bigger problems, because I can’t guarantee that a site will stay valid — I can only guarantee that on a certain date, it passed the tests. In other words, I think I need to start timestamping the entries. I don’t want people emailing me a month later saying, “Hey! So-and-so isn’t valid!” My response is don’t email me — email the site owner. This process scales badly enough as it is.
Ugh. It’s late, and I’m still thinking out loud here. Jacques (and Jacob), I appreciate you keeping me on my toes. I will indeed come up with more well-defined criteria. But for now I will muddle through as best I can.
Jacques, you’re a clever man. I’m certain that my changes to the MT code would have been much more messy. I’ll give your suggestions a try tomorrow. Thanks!
I serve everything through index.php (with some obvious exceptions; i.e. CSS files etc), so there’s only one place I need to worry about Content-Types. I apply http://www.aagh.net/style/xhtml2html.xsl to the generated content before sending if I’m running in HTML mode, which works surprisingly well given it’s simplicity.
Oh, and do I get brownie points for my lovely clean URL’s, and making stuff like http://www.aagh.net/ – http://www.aagh.net/index – http://www.aagh.net/index.html work, despite serving through PHP rather than relying on Apache’s content negotiation? 🙂
For what it’s worth, I was able to hack MT into sending the correct MIME-type and character set for my MT CGI pages. Unfortunately, I wasn’t able to get Jacques’s rather elegant solution to work — I think his solution was quite sound, but because my web provider uses suexec, the environment variable created by the mod_rewrite rule is never passed to the CGI script. So my solution ended up being very kludgey and much less generally applicable…
I think I’ll put in a feature request for configurable character set and (conditional) MIME-type support at the MT forums…
“I think his solution was quite sound, but because my web provider uses suexec, the environment variable created by the mod_rewrite rule is never passed to the CGI script.”
That’s because suExec only passes those environment variables on a list of “safe” ones.
To use my hint with suExec, you’d better use the RIGHT environment variable name!
Change “TYPE” to “CONTENT_TYPE” in my explanation, and you should be good to go.
Sorry for the confusion.
Ack!
That sets off another bug, if you’re not careful. Try “HTTP_CONTENT_TYPE” instead!