We’re not entirely sure that we need an XML workflow, because we don’t see exactly how we will benefit from it in our publishing program.

Last week, I watched a webinar (well, a recording of it) in which university presses were discussing (in a “panel” format) how and to what extent they had implemented an XML workflow. A common theme in their discussion was, “We’re not entirely sure that we need an XML workflow, because we don’t see exactly how we will benefit from it in our publishing program.”

Several “advantages” of an XML workflow were stated by the publishers in the course of their discussion. According to the publishers on the panel, an XML workflow can:

For the most part these publishers understand some of the advantages of XML, even though they didn’t present those advantages in an analytical framework. (Publishers understand their own business better than they sometimes give themselves credit for, and some solution providers are all too happy to play on that lack of certainty.)

However, publishers don’t have a clear understanding of how to accomplish each of these advantages from within the context of their own publishing operation. There is a strong disconnect between the “ideal” of “XML workflow,” on one hand, and the practical matter of improving our own publishing processes, on the other.

It is also the case that many of these advantages are not in fact per se advantages of an XML workflow. In fact, several of these advantages can be realized without buying anything from an XML workflow vendor or adopting anything more complex than a few simple procedures.

So, without further ado, let’s talk about these advantages and point the way toward accomplishing each in the context of a publishing enterprise.

Disciplining the Publishing Process

The idea here is that, when you prepare for production in an XML workflow, you have to have a higher level of discipline in how you do it, and this discipline is good for the publishing process as a whole. One example: XML workflows often require the use of paragraph and character styles for formatting. Adopting the XML workflow requires that we do a better job of styling our publications.

One thing that is clear, though, is that you don’t have to adopt an “XML workflow” in order to discipline your publishing process in this way. Any publishing team can, right now, start using paragraph and character styles in a more disciplined way in the publishing process. How you do this very much depends on your publishing context. For many presses, it means adopting a consistent set of paragraph and character styles that are used in both editorial (Microsoft Word, usually) and typesetting (InDesign, usually). The editors can add keyboard shortcuts to their editorial template to assist in applying the house styles. The designers and typesetters can set up a design template that uses these styles and significantly speeds the production process.

The discipline in this system can come, for instance, by creating a script in Word or InDesign that “audits” a manuscript to list the styles that are used, and any non-style hard-coded formatting that needs to be addressed. I have created scripts like this for the publishers I have worked with, and they have found them to be very helpful to make sure that everything is the way it should be for publication. Such disciplined systems can be created apart from an XML workflow, or within one.

Providing an Archival, “Future-Proof” File

Publishers understand that XML is a “future-proof” file format. To put it technically, it is a textual file format that uses tags in text to represent formatting and semantic information about the content, and the file itself is usually encoded as ASCII or UTF-8 text. This stands in contrast to application files, which are often binary and unusable outside of the application. So, the thinking is, adopting an XML workflow will help us to ensure that our publication assets are available for whatever applications might be available now or in the future, whether or not the applications that created these files are still available.

There is truth in this, but what many publishers don’t seem to understand is that you don’t need to adopt an “XML workflow” in order to ensure that your files are future-proof. You just need to “Save As” file formats that are future-proof. Here are a few file formats that you can use that will almost certainly be readable in 100 years:

  • Word .docx — If you use Microsoft Word, you are probably using the .docx file format already. If so, that is good, because .docx is just a .zip file containing .xml files. Both .zip and .xml are as future-proof as any file format can be: Everyone uses them, the industry has standardized on them, so it is almost certain that software to process .zip and .xml will be available in 100 years.
    However, if you’re still using the old .doc file format, please switch to .docx. The .doc file format is an obsolete binary format, and it is not at all certain it will be supported 5 years from now. (I had the experience, about 5 years ago, of trying to open an old Word 6 .doc file, but the current version of Word had removed support for that file format, so I had to go hunting for a converter. A word to the wise: Use .docx.)
  • InDesign .idml — With InDesign, the situation is a little more difficult, because the default file format is .indd, which is Adobe’s own binary file format, which is not future proof. However you can “Save As” .idml, which is a .zip file containing .xml files — just like Word .docx (and many others). My recommendation is to go ahead and do your typesetting with .indd, but before you put the project to bed, “Save As” .idml and archive that file. Now you are future-proof: That file will be usable in 100 years.
  • Adobe .pdf (unencrypted) for vector art — Even though .pdf is usually a binary file, the specification for it has been published, and there are lots of tools that can work with it. If you are using Adobe Illustrator to create artwork, use .pdf as your file format rather than Illustrator’s file format. It is almost certain that .pdf files will be usable in 100 years.
  • Adobe .dng for raw photographs — Many camera manufacturers have created their own proprietary raw file format. Raw files are much better than .jpgs for storing photographs, but you don’t want to use the manufacturers’ proprietary formats. Instead, Adobe has created the .dng format (“digital negative”) to address this need, and has provided a free converter to enable photographers and publishers to convert raw files from proprietary to standard .dng.
  • .jpg, .png, or .tiff for other raster artwork — These images formats are well understood, with published specifications, and widely used. The .png and .tiff formats are lossless, which means that all of the image data is retained. By contrast, .jpg is lossy — the compression algorithm throws away some of the image data depending on the “quality” number that you use (usually 1–100). There is another advantage to .tiff and .jpg: They both enable image metadata to be embedded using EXIF, which is also a standard.

In summary, there are simple ways to ensure that all publication assets are stored in “archival, future-proof” formats. Again, implementing these changes to your workflow can be as simple as creating a procedure that you always follow, such as always saving the InDesign file as .idml when putting the project to bed. It also possible to implement automated processes that do these housekeeping tasks for you, so that you have less to keep track of.

Making Ebook Production Easier

A lot of the publishers’ discussion during the above-mentioned webinar had to do with making ebook production more sensible as a normalized part of their product development workflow. The theory is, if our publishing process produces an archival XML file, it will be a lot simpler to create ebooks from that file. It seems that this is also the promise that many solution providers have proposed. But the reality of achieving that result has been elusive and expensive for many publishers.

I have implemented several XML-based ebook production workflows for publishers. I have been doing this for almost 10 years. There is as yet no one-size-fits-all solution to this issue; I would love to provide it, but it’s a complex problem: Study Bibles, for instance, have very different requirements from academic books or Sunday School curriculum.

However, I have enough experience with the nuts and bolts of this area to make several definitive statements. These statements might help guide you in integrating ebook production into your print production workflow. Some of these statements will be bare, to be filled out in a future post.

  • InDesign is usually the best place to maintain the canonical, archival version of the publication. Unless your content has very complex requirements that do not fit within the context of print typesetting, InDesign is the best place to maintain the content in a single source. (Study Bibles are often too complex, so there is an unavoidable transition to maintaining two sources: the print typesetting, and the XML archive.) For the vast majority of fiction and non-fiction books, there is no need to maintain a separate “XML archive” of the files.
  • InDesign’s EPUB export is worthless. Don’t use it if you can avoid it. I will enumerate this in a later post.
  • InDesign is, however, capable of holding all of the structures that you want to appear in your ebooks, within the context of an InDesign publication. For example, paragraph and character styles can be mapped to the equivalent structures in ebooks.
  • InDesign .idml can be the source of an ebook production workflow. The .idml file format can be converted directly to .xhtml (web pages using XML syntax) and .css (stylesheets that control visual formatting). Creating this conversion is a technical challenge, but it is not intractable.
  • EPUBs are just .zip files that contain .xhtml and .css, along with a couple of ebook structure files. If you can create .xhtml + .css web pages from your .idml file, it is a small step from their to having a valid EPUB.
  • Kindle ebooks can be created from valid EPUB files. The tools to do this are available for free from Amazon.

The bottom line is that you can integrate ebook production into your InDesign-based print production workflow.The best approach to this uses XML throughout the process. It is somewhat complicated, and it requires some specific technical knowledge, but it is a tractable problem that can be solved. I plan to unpack this subject in more depth in the coming weeks.

Enabling the Creation of Specialized Outputs

The final advantage of an XML workflow that the publishers mentioned while participating in the webinar was that XML makes it possible to create specialized outputs for particular channels. For example, one publisher talked about wanting to create BITS XML for medical  publishing. Others talked about putting their content on HighWire, a system for scholarly publishing (the webinar participants were university presses). These systems have very specific requirements, so we cannot say anything in general terms about how to design a system that will meet those requirements. However, it is likely that an XML workflow is the best way for publishers to ensure that they can provide their content in a format that meets those requirements.

Because I believe that, for most purposes, the InDesign publication is the best form for the “archival, future-proof file,” I am interested in exploring how a system that provides content to online databases like HighWire could be designed around InDesign.

Another topic in this connection is semantic indexing and tagging. Digital content can have a variety of “entities” in the text itself tagged and indexed. For instance, in work with Bible and Bible reference publishers, we have found it to be valuable to tag Scripture references that occur in the text flow. Often, the tagged content is used in ebooks and Bible reading apps where Scripture references are expected to be links. How we store these tags, especially in the context of an InDesign-centric workflow, is an interesting and important question to solve in order to fully meet the needs of these publishers. Similar questions will arise in other content domains.


Of the four “advantages of an XML workflow” that we began with, only two of them are specifically related to the use of XML in content processing: Streamlining and normalizing ebook production; and enabling the creation of specialized outputs.

In both cases, XML is the medium through which content flows as it is transformed from one format (usually a typeset publication) to another (an ebook, or an online content database). In the case of ebooks, all of the content structures that are needed can be accomplished within an InDesign-centered workflow. On the other hand, structuring content for an online database depends entirely on the requirements of that particular database.

The other two stated “advantages” of XML are really not related to the use of an XML workflow at all, but can be achieved within a traditional publishing workflow. However, it is fair to say that an XML workflow, if adopted, will also play a role in achieving these advantages.

What are your experiences with each of these areas of the publication process?

Posted by Sean Harrison

  1. The UNC School of Government has published online for the past few years a publication and annual supplemental content that is completely formatted and output as XML with InDesign. The Drupal migration tool was created by a local contractor, Design Hammer. The annual publishing cycle has provided opportunities for refinement of all process phases, and the client adoption has slowly begun to make the effort worthwhile. The print version (800+ pages) has a very strong following, and this has challenged us to create a more efficient online product.


Leave a Reply