Migrating documentation

illustrations illustrations illustrations illustrations illustrations illustrations illustrations
post-thumb

Published on 7 December 2023 by Andrew Owen (4 minutes)

Migrating documentation from one software platform to another can be painful. I remember the days when moving a Word document back and forth between Mac and Windows caused problems. I started working as a technical writer in September 2006, a month after the launch of Pandoc (the “universal document converter”), although I didn’t learn of its existence until many years later. It certainly would have helped with some of the migrations I’ve managed, but not all of them:

  • Multiple-sourced to single-sourced Word with Doc2Help.
  • Word to DocBook.
  • Flare to FrameMaker.
  • FrameMaker to DITA.
  • AsciiDoc to DocBook.

Pandoc supports DocBook and Word as input formats, but AsciiDoc is only supported as an output format, and it can’t handle Flare or FrameMaker.

Typically, the most difficult migration is going from unstructured content (such as unstructured FrameMaker) to structured content (such as DITA). This is because, depending on who wrote it, the source material may require heavy editing before migration. The easiest migration I did was from AsciiDoc to DocBook, but that’s because AsciiDoc is a lightweight way of representing DocBook. All I had to do was to fix a few validation errors. But that’s an edge case.

The simplest migration you’re likely to do will be from some flavor of Markdown to a more complex system. Perhaps counter-intuitively, the second-simplest migration you’re likely to do will be the reverse (complex to simple). This is because you’ll be throwing away style and meta information, which greatly simplifies the process. If you’re going from a proprietary system, the simplest solution is probably to publish to HTML and then run the output through Pandoc to whichever format you require.

It’s worth noting that there are two types of Word documents: the old style ones in a proprietary format (.doc) and the newer ones in XML format (.docx). Most migration tools will require the newer format. If you have Word installed locally, you can use the Python script doc2docx to batch convert files in the older format. When I migrated the entire suite of docs for a POS (point of sale) and back-office system from Word to DocBook, I wasn’t aware of Pandoc. But I was aware of XSLT. After exporting all the images and saving the Word files in XML format, I was able to write a simple transform to convert the content to DocBook. I’m told my successor had a much easier time migrating that content to FrameMaker.

Flare uses XHTML. It also adds its own proprietary tags, but these are easy to identify. It’s fairly trivial to open a whole project in VScode and use search and replace to get clean XHTML. You should then be able to convert this to plain HTML using Pandoc. However, Flare and other complex systems have features such as text snippets and variables. You’ll need to export these separately and if you’re going from one complex system to another (that doesn’t include its own dedicated import tool). You may want to preserve this information outside the tool you are using, such as in an Excel spreadsheet.

In my experience, migrating to DITA was the most painful. DITA is very restrictive, but fortunately I was the original author of the content and so even though I had been using unstructured tools, the content was well-structured. But, I still ended up copying and pasting the content from unstructured FrameMaker into the new system. To the best of my knowledge, even modern component content management systems lack direct import tools for unstructured FrameMaker. I used the copy-paste approach because in my evaluation it was quicker and simpler to do that than the alternative, which would have been converting unstructured FrameMaker into Structured FrameMaker as a first step.

The lesson then is that document migration needs to be managed like any other software implementation project. You need a plan to get from the old system to the new system. You need a realistic timeframe to do it in. You need the resources to do it. And depending on your delivery cadence, you may need to maintain both systems in parallel for a period of time (which may require additional resources).

Finally, this isn’t a task that you want to do on a regular basis. So when choosing a new system, make sure that it won’t lock you in and that it will scale to meet your future needs. Consider how much time it will take to train up your docs team to use it. And make sure it has good localization support, even if you currently have no intention of localizing your content. It’s better to have it and not need it than need it and not have it.