Format

The XDB file format is the successor to the .H! format, which as far as I can tell came about with the release of the Hypertext Viewer version 2.02. While not fully understanding the file format I've been able to find out enough to be able to extract information from VSUM in a ghetto fashion.

Offset Description
001D51h A table of strings, all with the first character in lower case seperated by 00h. All instances of these strings within the file are replaced by references in the form of two bytes. The first byte is either 02 meaning to display the string as is, 03 meaning to display the string as is with a trailing space, 04 to display the string with the first character upper case, 05 same as 04 with a trailing space. The next byte points to the string to display. 01h is the first string, 02h the second and so on.

In practice (at least for the VSUM XDB), only a subset of this table is used for in-record string references: the second byte is a single byte index, so it can only address 255 entries. The VSUM file contains more than 255 strings in total, but only the first 255 are referenced from the records.

Decoded control byte summary (as observed in VSUM):

  • 00h: newline (0D 0A) replacement
  • 01h xx yy: run-length expansion; yy = DCh terminates display
  • 02h idx: insert word idx as-is
  • 03h idx: insert word idx as-is plus a trailing space
  • 04h idx: insert word idx with first letter upper-cased
  • 05h idx: same as 04h plus a trailing space
  • FFh: escape: display the next byte literally
005480h The start of the alphabetic virus list (includes string table references). Each of the virus names follows a space (20h) and is trailed by 01h. There are a variable amount of bytes inbetween each name. The last virus name is trailed by 01 0C 20 02.

One more observation: this region contains not just virus names, but also other label-like text used by the viewer. If you treat every occurrence of the display pattern as a "virus name", you can end up accidentally indexing help pages later in the file whose header text begins with "Virus Name:" (field descriptions rather than a real virus entry).

016EB5h The first of the "PZPAPN" record seperators. 32 bytes after that (016EEDh) "PT" appears and then the text for that record (includes string table references, as well as 00 and 01xx for spaces). This is the first virus in VSUM "2KB". The text goes right till the next "PZPAPN" seperator. The last record is simply "PZ".

More detailed record layout (as observed in VSUMX.XDB):

  • Records are delimited by the ASCII marker string PZPAPN.
  • A practical way to locate records is to split the file on PZPAPN. In the VSUM file, the first virus record appears at split index 32 (i.e., record index = virus-number + 32). Entries beyond the main virus range exist and may be non-virus pages.
  • Within each record, the human-readable "page" text begins after a marker sequence that looks like a space followed by a capitalized-word reference: 20h 04h EDh. In VSUM, word EDh (index 237) decodes to the string "Index", so this marker corresponds to the literal text " Index".
  • After that marker, the rest of the record is a stream of text bytes and control codes (00h / 01h / 02-05h / FFh) as described above.
  • Some records are not viruses (indexes, cross-references, field descriptions). The VSUM virus range appears to be 0..1607, with a special page at 1608 ("A word from Patricia..."). Other pages can exist beyond this range.

Encoding note: the viewer is a DOS-era application, so some characters are in an OEM codepage (commonly CP437). If you display the output in a modern UTF-8 web page, you may need to convert CP437->UTF-8, or strip non-ASCII characters, depending on the desired presentation.

All of this applies to VSUMX809.ZIP.