Hey, look at that data structure... Man, that thing's huuuuge...

I realized that I never write about work. I'm not sure why. I suppose that it just never occurrs to me that anyone would find it interesting. Work stuff certainly doesn't fit the mandate or purpose of this site, which essentially is to save thoughts I've had at various times so that I can go back and re-read them later. I was even supposed to have a code snippet repository and an online scratchpad (in addition to other things) here as well, but I never got around to doing that. Bottom line: Anyone reading this but me does so for their own reasons. (So if you aren't me and you want smarmy personal weblogs written for an audience, go look at WWDN or The Gus.)

OK, so I have some work stuff to report on. It's not interesting work stuff. You have been warned.

I'm writing an app that sucks data out of a popular groupware calendaring/messaging server and turns it into XML so that an XSLT processor can pick it up and do magical things to it. It ought to be very fancy once done.

I had the app "done" in a couple days. I was making XML just fine. But it wasn't actually done. My app had inconsistencies because the data it parses can vary much more than I previously thought. I had to account for that. My simple data structure of an array of hashes of arrays would no longer suffice. So I went looking for someone that had invented a similar wheel. I came up with nothing.

I discovered why there are so very few XML generators as compared to the large number of XML parsers out there: it's incredibly hard to actually make XML from highly variable or unstructured data. There's typically nothing about the data that tells you what the data means to the last element of data you just read. Or the next 8 elements you're going to read next. Or the stuff you already read and fired off an event for. Unstructured data has no meta data, which is must have for XML to be made. XML implies and demands structure. Everyone wants to write nifty parsers for going through, extracting, or doing whatever to structured data, but nobody wants to write libraries which help people make that data. And I can't blame them.

I was sorely hoping for my app to be super-abstract: give it a set of name/value pairs, nested arbitrarily deep, in any order, and you'd get structured XML based on how the data was named. I thought that was a pretty decent result for a few days coding. As a proof-of-concept, I even used the functions I'd written to create XML representations of directory listings (I've got a big ass XML doc which represents the /usr heirarchy on my Linux workstation for instance). But then the bubble burst: One condition I have to account for is repeating namepsaces at the same "level". Oy! It can't be! It's like having two files with the same name in the same directory! The world is ending!

So I've spent the last two days working my way down the Perl data structure rabbit hole(s). My solution is going to be either brutally simple, or utterly unmaintainable. And I would normally opt for code which is "simple and stupid, but easy to read an it works" over "very slick and buzzword-compliant but nobody can tell what the hell's happening". The wrinkle here is that I'm a temp still, and they are just now hiring for my job. It's a juicy job description, too, one most geeks would jump at. So there's more than a little inkling to impress the professors with a ninja master-quality solution even they can't figure out. Except that won't help anyone; if nobody can maintain it but me, then I'll never get rid of the albatross.

Ahem... so here is what I have so far when making XML. I currently have an array of hashes which can contain arrays of hashes or hashes, each of which can contain arrays, references to hashes, or hashes. Those hashes can contain any of the parent elements I just mentioned. The "top level" arrays of hashes can similarly have hashes of arrays, arrays of hashes, arrays of arrays, hashes of hashes, hashes of rerefences to arrays/hashes, ad infinitum. About the only thing I don't have are plain, bare scalars (aside from all the list context and reference stuff, that is). I also have functions which unwind these data structures. They are what makes XML. I descend through and make attribute, check parent elements, check element level, add new PCDATA, etc.

I think my solution is unworkable, even though it works. Even code which accesses a simple array of hashes can have a lot of punctuation in it. If the keys to a hash are references to something, then you just doubled the number of times your fingers hit the '{' or '}' keys. I'm serious: you can write something, banging away for 45 minutes, switch over to a term window, run it, get slightly off results, go back to your code... and get completely lost. From coding 45 minutes straight to completely lost, in the space of 90 seconds. You'll have foreach loops five deep, each of which look like line noise. You'll run out of descriptive temp variable names. You'll make spaghetti.

I love Perl, I really do, but I don't think I have a head for the brutally complex stuff. I've been getting down on myself lately because I think maybe I'm an idiot for not looking at my problem and saying "Ah ha!, Here's what we need...." Although now I think the problem is actually kinda complex. Any time you take free-form user input you have to account for "variances", but to build well-formed XML on the fly while doing it is very hard. I might not be cut out for the real propeller-head low-level stuff. Or maybe I need to finally break down and take a data structures class of something. Things will be better when my future (at least the immediate one) is secured.

Comments for: Hey, look at that data structure... Man, that thing's huuuuge...

Post a comment
Name:


Email Address:


URL:


Comments:


Remember info?