Tuesday, July 14, 2009

Why XHTML is Wrong and HTML 5 is Right

I'll admit it: I fell for the XHTML hype and I've tried to convert myself to writing XHTML rather than HTML. The argument sounded good, but . . . I was wrong.

Then about a month ago I read almost all of the HTML 4.01 spec - not the digestions that you find in books like Everything You Ever Wanted to Know to Be an HTML Expert!!!!! in 5 minutes a Day!!!. I read the real spec. I actually spent about 1/4 of my time reading the DTD.

[I thought I needed to write an HTML parser in PHP - but that was just another programming hallucination which I solved in a much simpler, but vastly more limited and restrictive way]

Anyway, while doing that I realized that writing an HTML parser is Hard. Whereas writing an XML parser is EASY.

And that - in a succinct nutshell - is the only argument in favor of XHTML.

And, along with everybody else, I blindly followed that Pundant-centric reasoning.

But lets think about it: How many times do you Write an HTML Parser? For most of us, it's Zero.

OK, how many times do you have to read and understand HTML? For most people it's still Zero, but for a Lot of Us it's 'All the f****ing Time!'.

So - should (X?)HTML be Human Parse-able or Machine Parse-able?

That is: Should (X?)HTML be Easy for Us to Read or Easy to Write simple Parsing Programs for?

If you don't know the answer to that one, ... I was about to write something insulting, but that's not fair. Let's look at some history:

1. RPN [that's Reverse Polish Notation]. It's easy to parse, but hard to read. It has a long history, but even the original HP calculator [showing my age] which was a commercially successful RPN calculator had to succumb and convert to algebraic notation. (and as far as I know, the only commercially successful one)

2. The FORTH family of languages. Apparently this never really dies. FORTH was a stack based language which is really easy to write a parser for, but almost impossible to understand. Why? All the operations either push something on the stack or pop some stuff off, transform it and then push it back. Us Humans are NOT Stack Machines, so we can't track the state of the Stack.

3. Postscript - Same problem as FORTH, but it's successful as an intermediate language to describe a rendered document. Nobody actually programs in Postscript - we all use tools which output to Postscript and then feed that to a Postscript interpreter. [anybody every try to read that stuff?]

4. Etc

Point Made?

But, you say, HTML Sucks!!!

Right!!!

Lets get behind HTML 5 and Fix it. XHTML Sucks too.

No comments: