Import XML files from The Sports Network into an RDBMS.
Usage:
htsn-import [OPTIONS] [FILES]
The Sports Network http://www.sportsnetwork.com/ offers an XML feed containing various sports news and statistics. Our sister program htsn is capable of retrieving the feed and saving the individual XML documents contained therein. But what to do with them?
The purpose of htsn-import is to take these XML documents and get them into something we can use, a relational database management system (RDBMS), i.e. "a SQL database". The structure of relational database, is, well, relational, and the feed XML is not. So there is some work to do before the data can be inserted.
First, we must parse the XML. Each supported document type (see below) has a full pickle/unpickle implementation ("pickle" is simply a synonym for serialize here). That means that we parse the entire document into a data structure, and if we pickle (serialize) that data structure, we get the exact same XML document tha we started with.
This is important for two reasons. First, it serves as a second level of validation. The first validation is performed by the XML parser, but if that succeeds and unpicking fails, we know that something is fishy. Second, we don't ever want to be surprised by some new element or attribute showing up in the XML. The fact that we can unpickle the whole thing now means that we won't be surprised in the future.
The aforementioned feature is especially important because we automatically migrate the database schema every time we import a document. If you attempt to import a "newsxml.dtd" document, all database objects relating to the news will be created if they do not exist. We don't want the schema to change out from under us without warning, so it's important that no XML be parsed that would result in a different schema than we had previously. Since we can pickle/unpickle everything already, this should be impossible.
Examples and usage documentation are available in the man page.