Tuesday, March 3, 2009

How can you screw up XML? Part II

XML is overly wordy. It has lots of required text.

I think the theory is - if there really is one - that longer words make the computer-code more easily understood. That was certainly true when we were using FORTRAN with 6 character variable names and Basic with 2 characters - but there's a limit. That limit is probably around 32 characters - 1/3 of a screen line - at which point you need some white space and punctuation.

XML doesn't provide for white space, just punctuation. As (semi-)intelligent creatures, we need white space to clump symbols into recognizable things. It seems to be how we parse. Computers don't distinguish, so . . . They can read XML, but we can't. [maybe it's really a plot by the machines and W3 are a bunch of Cylons?]


So maybe the extra stuff is supposed to make XML more reliable? To get there, we have to talk about two slightly different subjects: Bandwidth and Information Theory.

Bandwidth measures how fast we can transmit messages. The bigger the bandwidth, the faster the electronic wiggles are and so the more Bits we can represent in a given chunk of time - say a second. Big Bandwidth is Good.

Effective Bandwidth is the fraction of the real Bandwidth you get to use for your stuff - the content you want to see, transmit, or use - read streaming video from hulu.com. The more non-Content characters in the Protocol [read XML] used to encode your 'stuff', the lower your Effective Bandwidth. [You pay for Real Bandwidth, but you Get Effective Bandwidth - it's kind of like sales tax or Net Income after Income Tax]

Information Theory studies how to transmit Information in the presence of Noise. It turns out that you can always get your message across accurately - most of the time - if you use a fancy enough code. A Code takes a simple message and adds a lot of extra bits which allow the receiver to tell if the message was messed up [received a 0 where there should have been a 1] and to reconstruct it. When you have more noise, you have to add more reconstruction bits. That makes the message longer. So Information Theory says - if you want reliable communication, you have to allocate some of your Bandwidth to these reconstruction bits - called Redundancy - so that your Effective Bandwidth is lower than your actual Bandwidth.

Now let's apply this to XML.

XML adds extra stuff to create a rigid structure for you message. This has Nothing to do with Information Theory because XML is transferred over TCP - which is a Lossless Protocol - meaning that, if the message gets there At All, it's guaranteed to be OK. There's No Noise on TCP. [The TCP protocol has already eaten up the Bandwidth required by Information Theory to get your 'stuff' to you]

The XML extra stuff is there so computer programs can easily parse the message and use it's Content once it gets there.

XML's 'extra stuff' shares the Same Bandwidth as the Content inside the XML message.

I ask you: what's more important: the Content or the 'extra stuff'?

Personally, I think the 'extra stuff' should be as small and efficient as possible so we can use as much of the Bandwidth for Content.

W3 must think that the 'extra stuff' is more important than content because they make their protocols as bulky as they can.

Don't believe me? go to www.w3.org and read some of their specs - try name spaces or the RDF spec or just about anything. All Structure with minimal space for Content.

Why do we put up with this?

No comments: