One of the problems with the XHTML 100 Laziness Test was that it was… well, lazy. Rather than simply validating three random secondary pages, a real Laziness Test would spider through all pages. I didn’t bother with this approach for two reasons. First, my version of the Laziness Test was reasonably effective at weeding people out (although not perfect). Second, writing a well-behaved spider is well beyond my rather limited programming skills.
Fortunately, all is not lost. We can build such a spider. We have the technology. Or rather, the Norwegians do. (And isn’t that scary? What else are those Norwegians hiding from us in those fjords? Does the Defense Department know about this?) Anywaay. Via Zeldman, via Ben Meadowcroft, via… no, wait, that’s all the vias. Err… via all those people, researcher Dagfinn Parnas reports that 99.29% of the 2.4 million web pages in his sample fail validation.
In truth, Parnas did a much better job measuring general standards compliance than the XHTML 100, which was just a quick-and-dirty survey. Moreover, you can’t compare Parnas’s data directly with the XHTML 100 results. First of all, I was looking at the Alpha Geeks And Their Friends, while Parnas is looking at a much larger and much less geeky population. Second, Parnas is looking at HTML, while I was looking at XHTML only. Finally, Parnas’s analysis is fundamentally different in that he aggregates his data into one pool. For example, consider a site in the XHTML 100 where the first page validates, the first two secondary pages validate, but the last secondary page fails. Parnas would count that as three successes and one failure. I would count that as total success for Test #1 and total failure for Test #2.
But hey — let’s ignore all those distinctions. If you can’t compare apples and oranges on the Internet, where can you compare them? Parnas reports a 99% failure rate. In contrast, I report an Markup Geek failure rate (as measured by Test #2) of a staggeringly low 90%. I think we can all be proud.
If you do bother to read Parnas’s nearly 6MB pdf paper, and I certainly recommend that you do, be sure to look at the breakdown of the various types of errors. The bulk of the errors (after “no DTD declared”) consist of “non-valid attribute specified” or “required attribute not specified”. Not surprising at all. From my own experience, very few people seem to know that the alt
tag is required or that <img border="0">
is illegal in XHTML 1.0 Strict. The only real puzzler in Parnas’s data was the relatively low fraction of pages with invalid entities. In the XHTML 100, invalid entities were a major killer. I don’t have a good explanation for this discrepancy, but hey.
Parnas concludes:
As we have seen there is little correlation between the official HTML standard and the de-facto standard of the WWW.
The validation done here raises the question if the HTML standard is of any use on the WWW. It seems very odd to have a standard that only 0.7% of the HTML documents adhere to…
A good question. For me, the reason to validate is not ideological. Simply put, validation saves me time. For any page design, there are a huge number of possible glitches across the various browsers. Validation doesn’t reduce the set to zero, but it does make the set a lot smaller. Hey, I don’t know about you, but I need all the help I can get.
Darn those Norwegians! First they make the best cellphone (Nokia) and now this! Next they’ll out do the Segway!
As a side note, I haven’t been following most of the technical posts about HTML and the like, but wouldn’t another standard benchmark involve how different browsers are able to read them? Something must be standard enough or we’d always get stuck on lame websites that don’t work…oh wait. Never mind.
Well, that’s tricky, because you would have to somehow quantify “how browsers are able to read them”. Some sites are obviously mangled beyond repair in certain browsers. But for other sites, the distortion is a bit more subjective.
You’d also have to think about weighting the browsers. How does Safari compare to Lynx? Is Internet Explorer twenty times more important than all the other browsers combined? Do we ding all sites that don’t display perfectly in Netscape 4?
All that said, you are fundamentally right — the important thing is how a site displays in the various browsers. The dirty little secret is: the reason we’re harping on validation, validation, validation is because it’s easy to measure.
It also takes care of browsers that haven’t been written yet. Write valid (X)HTML+CSS and it ought to render correctly in future browsers.
Safari didn’t exist 6 months ago. 6 months from now, its few remaining rendering bugs will have been squashed, and it will be the default browser on every new copy of MacOSX.
Should our ‘benchmark’ for websites change just because Safari came on the scene? That would be crazy.
Instead we can demand Standards-compliance from browser writers and from web-site authors, and have reasonable assurance that the former will correctly render the latter.
It’s important to emphasize that “Will my site be viewable in today’s web browsers?” is a different question from “Will it be viewable in the browsers of 5 years from now?”
Ensuring a postive answer to the former question merely requires hard work. the only hope tho achieve the latter is through validation.
Wow. I hadn’t thought of that. I guess my question shows that I am a novice.
There have been some really good people at all of these companies and organizations that have been driving standards compliance. That’s been the trend over the last couple of years; let’s hope it continues.
As for Safari, I *really* hope Dave Hyatt has squashed the float bug:
https://www.goer.org/Prototypes/Safari/floatbug.html
Interesting to see that someone actually bothered to read my thesis 🙂
What isn’t mentioned in it is that a large part of the valid pages are from the apache’s directory listing. Neither is that I made a huge virus alert the first night I started validating. The university central it-desk noticied 350 000 http connections during a time of day there were usually 500. They faulty mistook this for a virus spreading and isolated the entire computer science net. Briefly my account was confiscated, but when all the facts were on the table everybody was nice again 🙂
Hello, Dagfinn!
Well, I see from your story that “no good deed goes unpunished”. Glad to hear that you were able to straighten it out. 🙂
Oh, and thanks for the fascinating report!
I finally got around to the thesis again and have converted each page to an image.
The url is http://elsewhat.com/thesis. (the page is very much work in progress and doesn’t validate :), but the url structure is set)
I’ve also started collecting some links to different pages referencing the thesis.
Hopefully, people will avoid the horrible pdf version now.
Thanks, Dagfinn. I’ve updated the URLs.