20 October 2005

More XML

(Apologies to those of my readers that aren't interested in this stuff. I've been giving more time & attention to my work of late, and the results are less blogging, and technical stuff being on the top of my mind more than current affairs)

Very good piece by Jim Waldo of Sun that chimes (in my mind at least) with my piece below. He emphasises the limited scope of what XML is. He doesn't echo my discussion of whether XML is good, rather he shoves that aside as irrelevant - the comparison is with ASCII. We don't spend much time arguing over whether ASCII is a good character set - is 32 really the best place to put a space? Do we really need the "at" sign more than the line-and-two-dots "divide-by" sign? Who cares? The goodness or badness of ASCII isn't the point, and the badness of XML isn't really the point either.

The comparison with ASCII is very interesting - Waldo talks about using the classic Unix command-line tools like tr, sort, cut, head and so on that can be combined to all sorts of powerful thing with data in ascii line-oriented data files. XML, apparently, is like that.

Well, yes, I agree with all that. But, just a sec, where are those tools? Where are the tools that will do transforms on arbitrary XML data, and that can be combined to do powerful things? It all seems perfectly logical that they should exist and would be useful, but I've never seen any! If I want to perform exactly Waldo's example: producing a unique list of words from an English document, on a file in XML (say OOWriter's output), how do I do it? If I want to list all the font sizes used, how do I do that? I can write a 20-30 line program in XSLT or perl to do what I want, just as Waldo could have written a 20-30 line program in Awk or C to do his job, but I can't just plug together pre-existing tools as Waldo did on his ascii file.

There are tools like IE or XMLSpy that can interactively view, navigate, or edit XML data, and there is XSLT in which you can write programs to do specific transformations for specific XML dialects, but that's like saying, with Unix ascii data, you've got Emacs and Perl - get on with it! The equivalents of sort, join, head and so on, either as commandline tools for scripting or a standard library for compiling against, are conspicuous by their absence.

The nearest thing I can think of is something called XMLStarlet, but even that looks more like awk than like a collection of simple tools, and in any case it is not widely used. Significantly, one of its more useful features is the ability to convert between XML and the PYX format, a data format that is equivalent to XML but easier to read, edit, and process with software (in other words - superior in every way).

As a complete aside - note that pyx would be slightly horrible for marked-up text: it would look a bit like nroff or something. XML is optimised for web pages at the expense of every other function. That is why it is so bad.

Maybe I'm impatient. XML 1.0 has been around since 1998, and while that seems like a long time, it may not be long enough. Any process that involves forming new ways for people to do things actually takes a period of time that is independent of Moore's law, or "internet time", or whatever. The general-purpose tools for manipulating arbitrary XML data in useful ways may yet arrive.

But I think the tools have been prevented, or at least held up, by the problems of the XML syntax itself. You could write rough-and-ready implementations of most of the Unix text utilities in a few lines of C, and program size and speed is excellent. To write any kind of tool for processing XML, you've got to link in a parser. Until recently, that itself would make your program large and slow. The complete source for the GNU textutils is a 2.7M tgz file, while the source for xerces-c alone is 7.4M. The libc library containing C's basic string-handling functions (and much more) is a 1.3Mb library, xerces-c is 4.5Mb.

If you have to perform several operations on the data, it is much more efficent to parse the file into a data structure, apply all transformations on the data, and then stream it back to the file. That efficiency probably doesn't matter, but efficiency matters to many programmers much more than it should. It takes a serious effort of will to build something that uses such an inefficient method. Most programmers will have been drawn irresistibly to bundling a series of transformations into a single process, using XSLT or a conventional language, rather than making them independent subprocesses. The thought that 99% of their program's activity is going to be building a data structure from the XML, then throwing it away so it has to be built up again by the next tool, just "feels" wrong, even if you don't actually know or care whether the whole run will take 5ms or 500.

In case I haven't been clear - I think the "xmlutils" tools are needed, I don't think the efficiency considerations above are good reasons not to make or use them, but I think they might be the cause of the tools' unfortunate non-existence.

I also don't see how they can be used as an argument in favour of XML when they don't exist.

See also: Terence Parr - when not to use XML

No comments: