Atom IDs: What’s Wrong With Domain + Timestamp?

For a long time I had been mostly indifferent to the nascent Atom syndication format. True, Atom was taking the right approach in formalizing its specification and developing comprehensive test suites. But I was still thinking, why bother adding an Atom feed when I already had perfectly fine RSS feeds? In other words, “What does Atom add that RSS can’t already do?”

That was my mindset until a few weeks ago, when Mark Pilgrim delivered a well-needed sharp blow to the kidneys to that whole idea. This is when I got religion. Of course, I really should have gotten religion well before that, but I’m a slow learner.

So I was all rarin’ to go, ready to put up an Atom feed, when another post by Mark brought me up short. In his instructions on “How to make a good ID in Atom“, Mark advised against using your HTTP permalink URI as an ID, recommending Tag URIs (yuck!) instead. Then Tim Bray chimed in, saying that permalink URI as an Atom ID is probably fine, but if you don’t want to use that, you can construct a NewsML URN (yuck again!) Not to be outdone, Bill de Hora said, “Never mind the URI.” Permalinks — heck, domains themselves — are transient, so… use Tag URIs. But wait, Tag URIs (at least the way Mark constructs them) contain domains! Pretty soon my brain started to hurt, and I decided to back off and think about this later. Then things got busy at work, time passed, life went on. I probably would have forgotten about the whole thing, except I a few days ago I noticed a task in iCal: “Make Atom feed.” Stupid iCal.

So it’s back to thinking about Atom IDs. Each entry in an Atom feed must have a globally unique, unchanging ID that is also a valid URI. The idea is that Atom applications should be able to identify feed entries uniquely, even if your entries get resyndicated or published in several places at once. However we choose to generate our Atom IDs, we are duty-bound to ensure that they don’t change after creation and that we don’t accidentally clobber someone else’s Atom IDs.

Okay, so what can we use that’s unique and unchanging?

  • My domain of goer.org is unique — I own it, and no one else who’s publishing feeds should be using it. (Bill would say that I’m just renting my domain, but as we’ll see, that doesn’t really matter.)

  • The creation timestamp of each entry is unchanging.[1]

Together, the two form a globally unique, unchanging event. For example, on May 10, 2004 at 9:56pm and 30 seconds PST, the site goer.org posted the entry, “Supercharge Your Outlook Performance!” Now, the domain goer.org is not unchanging — I could easily lose it. The creation timestamp is not unique — there’s an excellent chance someone posted something on May 10, 2004 at 9:56pm and 30 seconds PST. But put the two together, and we have the makings of a simple but robust Atom ID:

https://www.goer.org/2004/05/10/21/56/30/PST/

Note that this is not the permalink to that entry. The permalink is https://www.goer.org/2004/May/index.html#10. The Atom ID takes the domain and appends the timestamp components, separated by forward slashes. I used forward slashes because I know that forward slashes can be used to make valid HTTP URIs. (Tag URIs and NewsML URNs seem weird and scary to me, so I decided to stick with familiar HTTP URIs.) Thus, the Atom ID is a valid HTTP URI, even though it doesn’t point directly to an HTTP resource. Based on my cursory reading on HTTP URIs, I think this is okay. (If I’m wrong about that — if it is incorrect to design an HTTP URI that does not point directly to an HTTP resource — let me know.)

For another example, consider my HTML tutorial. The permalink for the entry on classes and IDs is https://www.goer.org/HTML/intermediate/classes_and_ids/. However, the creation timestamp was August 21, 2002 at 12:34pm and 03 seconds PST. Thus, the corresponding Atom ID would be https://www.goer.org/2002/08/21/12/34/03/PST/.

I’m trying to think of scenarios where this scheme would fail. It doesn’t matter if my permalinks change. The domain and creation timestamp are still the same. Nor does it matter if I lose my domain name. I couldn’t produce future entries with the same domain, but the entries I already produced would still be valid, and I could certainly produce new entries with a new domain name.

For example, let’s say that on January 1, 2005, I’m getting married. Because I’m such a forward-thinking, progressive guy, I’m not asking my wife to change her last name — in fact, I’ve decided to change my last name to hers. Henceforth I am to be known as “Evan Goer Nahasapeemapetilon.” And to really drive the point home to friends and family, I’m dumping goer.org for a more appropriate domain name. So now to the really important question: what does that do to my Atom entries? Well, all posts before January 1, 2005 would have the form https://www.goer.org/YYYY/MM/DD/HH/MM/SS/PST, while all posts after January 1, 2005 would have the form http://nahasapeemapetilon.net/YYYY/MM/DD/HH/MM/SS/PST. Old entry IDs are unaffected, and all IDs are still unique.

Okay, but what if someone else takes over goer.org? In order to have a collision, the following would have to happen:

  1. The new owner decides to publish an Atom feed, AND
  2. They happen to use the exact same format I do, with the exact same separators, AND
  3. They decide to publish entries that were created in the past, before they owned goer.org, AND
  4. Some of those entries have creation timestamps equal to some of mine, down to the second, AND
  5. The old entries have never been published as Atom entries before, because otherwise they would already have an Atom ID, and Atom IDs are unchanging, remember?

The other scenario I thought of is, what about a gigantic site that produces many thousands of entries a day? In that case, we do start to have a non-trivial chance of having two entries with the same creation timestamp. However, the Reuters and eBays of this world can certainly afford to generate some sort of extra identifier to ensure uniqueness. Fortunately, the average weblogger wouldn’t need to add this extra machinery.

Okay, let’s sum up: to make a good Atom ID, construct an HTTP URI using domain + creation timestamp. Anyone see any problems with this scheme? If not, I’ll forge ahead with Atom in a few days. One Atom feed will provide entry summaries, while the other will be the first goer.org feed ever to provide full-content goodness. Yum!

1. In this inertial reference frame, anyway.

5 thoughts on “Atom IDs: What’s Wrong With Domain + Timestamp?

  1. Nope, no problems I can see: you’ve basically reinvented tag URIs without most of their benefits, and with several of the risks that they mitigate, but if you don’t mind, shouldn’t be any big problem for the consumer.

    A tag URI uses a domain name plus a date when you controlled it, so it’s loss-of-domain proof, you use a domain name plus an opaque string of digits that looks like a date which could be reused by a subsequent owner but maybe wouldn’t, a tag URI uses something, title, keyword slug, whatever you have handy, that will let you tell one from another, you need to know the date and time of posts and translate it from one format to another (and one timezone to another, if you happen to use UTC dates in your feed), but if that doesn’t bother you, no problem.

    Of course, any time you create a URI that looks like a URL, it will be dereferenced, so I hope you’re prepared for that, either in redirection or in ignoring your error log. I think maybe that’s one of the reasons that RDF folks like to end namespace URIs with a # – it minimizes the number of 404s.

    Dual feeds: do you really have a need for two? One of the biggest wins in Atom, for me, is that it provides unambiguous places for both the summary and the full content, something I’ve been doing in RSS (and being called funky, and told I “make people sad” and the like) for quite a while. To me, putting both in one feed, letting the consumer either decide which to use, or use both, seems like a very good thing.

  2. Hey, that’s me, reinventing the wheel. (But what’s with this “round” nonsense? Hexagons, that’s where the future is.)

    Tim Bray mentioned that tag URIs are not officially URIs yet. I don’t know anything about the backstory there, I just took Tim at his word. Am I just being waaay too anal if I’m worried about that issue?

    Good point about dual feeds and Atom. Okay, it’s one (non-funky) feed to rule them all, then. And cheer up, you don’t make me sad.

  3. Your RSS 0.91 feed contains embedded HTML markup, which 0.91 did not allow. A spec-compliant RSS reader would display your markup as markup instead of rendering it. Luckily for you, there are no spec-compliant RSS readers. But be aware that you are relying on client rendering bugs and de facto standards.

    Your RSS 1.0 feed is served as application/rdf+xml, which is not an IANA-registered media type.

    http://www.iana.org/assignments/media-types/application/

    As you can see, the media type is only referenced by a draft that links nowhere and does not appear to exist (and is probably expired if it does exist).

  4. The summary element in your Atom feed contains inline markup in the wrong namespace. The feed validator does not flag this. I will file a bug report.

    Of course you are welcome to use markup in your Atom feed, but please put it in the proper namespace. Given your X-Phile status, I would assume that you would prefer to use inline markup everywhere instead of the escaping hack you are using now. See Sam’s Atom feed for an example.

    http://intertwingly.net/blog/index.atom

  5. Hey hey, I’m gettin’ there, I’m gettin’ there!

    Thank you for the pointers. You’re right, I do plan to use inline markup, and I should do it properly rather than resorting to hacks that I copied-and-pasted from the default MT Atom template. (Although for the record, I’m not an X-Phile, and probably never will be.)

Comments are closed.