Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
XML and Content Strategy
1. XML and content strategy
Why and how to “future-proof” your content
Publishers and other information providers increasingly By far, the most practical, most versatile tool for manipulating
use multiple media to display their content for various content for both current formats and those not invented yet is
applications. Books become e-books, online journal articles XML. XML (eXtensible Markup Language) is an open standard;
are published online first and in print later, and figures are its power derives from the fact that XML has been adopted by
aggregated into image databases. Users request chunks of entire industries, many government agencies, and platform
content, or the publisher assembles pieces of content from developers. When new standards emerge, such as EPUB for
multiple publications into a new publication. Information e-book readers, the standards are derived from generic XML,
users want what they want, when they want, in the form they allowing even files created a few years hence to flow readily
want. As publishers work to respond to the changing needs into the new standard.
of their constituencies, the challenge is: how can publishers
“future-proof” their content?
By far, the most practical, most
Even today, content takes many forms and has many uses. versatile tool for manipulating
Publishers find that they need to adapt their content in
various ways (figure 1).
content for both current formats
and those not invented yet is XML.
Sampling
Some organizations think of XML as a different set of tags.
While XML tags are different from those used in other
• Web-ready HTML on proprietary platforms
systems like SGML or HTML, XML is actually a different way of
• HTML for web previewing
thinking about content. Karen Colson, director of publishing
• PDF for printing/viewing/downloading and communications at the Association for Research in Vision
• Distribution by third-party aggregators (Ovid, EBSCO) and Ophthalmology (ARVO) explains it simply,
• Abstract & indexing services (Scopus)
• Mobile devices (iPad, smartphones) XML describes content, not appearance.
• Archival solutions (Portico)
An XML tag (actually, a pair of tags—one at the beginning and
one at the end of an element) might indicate that a section
Figure 1: Sampling of data output
of copy is a first-level heading inside a book chapter. The
actual appearance of the heading, however, is determined by
If today’s situation is not complicated enough, the future a different style sheet for each application. The typeface and
is likely to be even more complex. How can information size that appears in the book might be completely different if
providers respond to the changing needs of customers and the book is available on an e-reader, and it might be different
new technologies with greater facility in terms of time, cost, still if the book is included on the electronic platform of a
and effort? third-party aggregator.
SPi Global
2807 North Parham Road, Suite 350, Richmond, VA 23294
T 1 804 262 4219 www.spi-global.com
2. The tag for a first-level heading also can function as metadata.
For instance, a book’s table of contents might be constructed Organizations that want to get the
by copying chapter titles and first-level headings. Or, perhaps
an aggregator’s general search function could look primarily
most out of XML apply it consistently
at first-level headings. In either case, a pair of tags that starts and as early as possible in the content
out regulating appearance can have multiple programmatic development process.
applications as well.
Organizations that want to get the most out of XML apply When an error occurs, the correction is made in the native
it consistently and as early as possible in the content XML file so that the error can be corrected in every product
development process. When this happens, editing changes that flows from the content. Making corrections in the native
are captured within a single, authoritative XML file, all XML XML file represents the industry’s best practice, but practical
files are built according to the same rules, and the final challenges exist even with this approach.
XML file is the source for all types of output. Creating this
capability requires thoughtful planning and technically astute Julia Sawabini, director of e-commerce at Elsevier, explains
implementation. that to build the web page for a particular product, Elsevier
pulls content from a database containing fields variously
Planning for end-to-end XML Workflow populated by editorial, production, and marketing people.
The most reliable and powerful way to apply XML to The information is organized via style sheets but no content
documents is to do so at the very beginning of the production is created at this point. “If there’s something wrong on
cycle. In organizations where content is created by employees, the website, it’s wrong someplace along the way. I can’t
the content creator may enter tags, often using shortcuts or change it.”
templates. For most publishers, however, tags are applied by
skilled markup operators based on the list of tags available Once a correction is made, the change may not appear
to them (more on this below). Most markup operators work immediately, as the website is updated in batches at specified
for compositors, so their function sometimes overlaps intervals. The incorrect product information will appear on the
with typesetting. But markup is a distinct function in the site until the update takes place. Also, the incorrect material
production process. Once the tags are applied, production will remain on the servers of distributors, e-bookstores, and
can proceed (figure 2). other outlets for the information unless corrected files are
sent and uploaded.
XML markup An analogous challenge occurs in publishing printed materials.
Sometimes a production person spots an error while
Copyediting processing a PDF for the printer. The temptation, and often
the reality, is that the production person corrects the PDF and
sends it on to the printer, breathing a sigh of relief. Unless
Typsetting
the production manager remembers to go back to make the
same correction, the error still exists in the XML file.
Page layout
Implicit in this discussion is the notion that XML workflow
Proofreading includes an element that is rarely critical in a single-
medium product—what director of production at Elsevier
Phil Schafer describes as “a central content repository with
Content Repository
full functionality.” It is not enough to save all content to a
particular server. Ideally, the content will flow into a database-
Multiple outputs like structure that enables the owner or other authorized
users to find specific content and manipulate it for specific
Figure 2: Production process using XML publishing applications.
Page 2 XML and content strategy
Why and how to “future-proof” your content
3. around the phrase Homo sapiens that indicate “these words
Data in the content management are genus and species – put them in italics, and remember to
make an index entry for this term.” In an anthropology book,
systems are heavily tagged with you might want to distinguish between Homo sapiens and
metadata so users can get optimal other species such as Homo erectus, and treat both species as
search results despite the multiple index sub-entries under the genus Homo. In that case, you’d
put a pair of tags around Homo indicating “this is a genus”,
original sources of the material. and a tag around either sapiens or erectus indicating “this
is a species.” Instructions for constructing the index would
complete the picture.
Content repositories can be critical in highly regulated areas
such as medicine. Larry McGrew, head of content and editorial
The previous paragraph took 186 words to discuss how to
operations at Aetna, relies on multiple content management
treat genus and species in a DTD. Multiply this by the many
systems with carefully approved material to populate
editorial, functional, design, and marketing considerations in
Aetna’s sites that are central to their members’ experience.
any one publication, and then multiply it again by the range
McGrew admits that this has been “extremely challenging”
of publications you hope to represent with a single DTD. The
to implement.
considerations become massive, and the temptation might
be to skimp on the detail of the DTD (for instance, coding for
The DTD
genus and species together, rather than separately). This might
The Document Type Definition (DTD)—the very rough be a false economy, though. Nina Chang, senior publisher for
equivalent of type specifications for print products—specifies e-journals at Lippincott Williams & Wilkins, points out,
both how an element will look in print, on the web, on e-book
readers, etc., and, to some extent, what the element means. Richly tagged data allow for more
DTDs need to code both data and metadata.
precise searching.
To explain how a DTD functions, look at the different tagging
In STM and scholarly publishing, searchers want to retrieve
possibilities for how genus and species might be handled
the information that really matters, so the detail of the DTD
depending on the media and application. For instance, we
is important to the perception of quality. It’s helpful to refine
assume that readers of this white paper belong to the species
the DTD as much as possible before implementation.
Homo sapiens. It is probably sufficient therefore to surround
Homo sapiens with XML tags that mean “put these words in
italics no matter what other appearance specifications you
have.” But in a zoology book, you might want to put each One approach is to start with a DTD
genus/species into the index. In that case, you could put tags that is already in the public domain.
The Document Type Definition As Schafer points out, “If we choose to introduce a new
(DTD)—the very rough equivalent element, we have to take it to a supplier support data team
of type specifications for print to ensure that it’s implemented across all of our journals.”
And Chang of LWW points out that changing the DTD has
products—specifies both how an implications for archival data as well. For instance, do you go
element will look in print, on the web, back and insert new tags to keep up with the functionality
on e-book readers, etc., and, to some of new material? This requires a business decision: What
are the changes worth to the users, compared with the
extent, what the element means.
inevitable costs?
XML and content strategy Page 3
Why and how to “future-proof” your content
4. of career-oriented pressures that impel them to comply with
Vendors that have developed and constraints that authors of journal articles will accept. Still,
over time elementary-high school and higher education
worked with DTD’s in the past have a publishers have begun to implement DTD’s, which in turn
pragmatic knowledge of what works offer them flexibility. Not only can they put content on
well for their customers, and they also multiple platforms to meet student and school district needs
have staff with backgrounds to steer but also they can customize the content of publications. This
may be one reason why most educational publishers seem
skillfully through the complexities. fairly confident of their ability to meet the idiosyncratic social
science requirements of the single largest school district (ie,
the Texas School Board) while continuing to publish their
At large publishing organizations, developing a sufficiently
books for the rest of the country.
powerful and flexible DTD is a challenge. As we discussed
earlier, it is not enough to catalog all of the type specifications
Custom publishers are another category that has found XML
that might be needed. A team building the DTD also needs
to be an invaluable asset to their business, as seen in the
to consider whether to define specific kinds of information
Case Study.
and to what degree of detail, and they also need to define
the metadata required for their own use and for the use of
ONIX: A specialized DTD for book metadata
current and future third parties.
For people in the publishing industry, ONIX (ONline
One approach is to start with a DTD that is already in the Information eXchange) is perhaps the most familiar example
public domain. For instance, Colson of ARVO has twice used of a DTD for metadata.
the DTD developed by the National Library of Medicine as the
basis for an organizational DTD: ONIX is used extensively in the book trade as a standardized
means of communicating information about books—from
[The DTD from the National Library of Medicine] author and title to weight per copy, minimum order quantity,
subject classification, and so forth. These data then populate
is comprehensive—it works for books,
everything from the publisher’s own Website (for instance,
Annual Meeting abstracts, and all of our the one maintained by Elsevier’s Sawabini) to industry giants
other publications. such as Amazon and Barnes & Noble.
Colson even used this DTD when she worked at American
Geophysical Union (AGU), even though AGU content had little Case Study
if any relationship to medicine, because the structure worked
Triangle Publishing Services, Inc., prepares publications
effectively for other types of scholarly content.
for technology companies like Microsoft, Cisco, and
Hewlett-Packard. In some cases, Triangle has prepared
Another approach is to contract with a trusted vendor. all the content in a book so that it can be repurposed.
Vendors that have developed and worked with DTD’s in the
past have a pragmatic knowledge of what works well for For example, a book with chapters on applications in
a dozen different industries can be disaggregated into
their customers, and they also have staff with backgrounds
a dozen different white papers for distribution online.
to steer skillfully through the complexities. Outside vendors Or, by searching on XML tags, the book’s case studies
can do their future-oriented work freeing up in-house staff can be extracted and used in other settings.
to manage day-to-day operations. And a good outside vendor
can also help train staff to understand the new DTD and/or a Larry Marion, CEO and Editorial Director at Triangle,
new, XML-oriented workflow. says this about taking advantage of the power of
XML:
A large proportion of scholarly journals, with their tightly Think about how you want to repurpose content; be
structured, relatively brief units of copy, have migrated with as creative and granular as possible. Extra work at the
reasonable success to XML. Books have been harder because beginning can save you pain down the road.
they are more varied, and authors often don’t have the kind
Page 4 XML and content strategy
Why and how to “future-proof” your content
5. In fact, if you need to understand how XML refers to types of
content and not their appearance, take a look at the display of Data conversions are typically done
any particular title on Amazon, and then on Barnes & Noble. by production vendors, with their
Author, title, publisher’s description, and the like look entirely
different, yet they contain precisely the same information.
in-depth knowledge of publishing
workflows and outputs.
Other industries and disciplines have their own specialized
metadata sets, as well.
display, search, and the like. Similarly, links to tables and
Implementation illustrations might or might not be captured.
In some parallel universe, management might be able to
Another challenge is that conversions may not capture
send out a memo one Friday afternoon announcing a new
important metadata (“this is a chapter, not a scholarly paper”)
production workflow that starts the following Monday
because the metadata simply don’t exist in the original
morning. In this world, however, it isn’t that simple. Employees
material. Either the original publisher provides the metadata
may need to perform different tasks, or they may perform
retrospectively, or the new party provides the metadata using
the same tasks in different sequence. Managers need to
their best, potentially fallible judgment.
assess performance using different metrics. Suppliers need to
accept input that looks different and generate different kinds
Building capacity for end-to-end XML requires an organization
of output, with possible changes in schedules, prices, and
to commit staff resources, time on the calendar, and financial
quality management. For a publisher, all of this needs to take
resources. Realistically, not every publisher can muster all
place while products already in the pipeline move through
three kinds of resources conveniently.
the previous workflow, or some hybrid.
Data conversions are typically done by production vendors,
with their in-depth knowledge of publishing workflows and
The programmatic approach, however,
outputs.
can miss or misinterpret improvised
or last-minute changes. Another approach is to leave file conversions to the aggregator,
e-book platform, etc. that wants to use the data. These
companies typically do a good job of ensuring that the XML
XML on the fly they generate is effective for their application, but if another
vendor approaches the publisher, the process needs to be
Sometimes, an information provider will need to produce
repeated at the cost of more money and more time.
XML hastily. For instance, a content provider may be switching
publishers or may be wishing to digitize back file content, or
Time for XML?
work with a new third party aggregator.
For the foreseeable future, information is going to flow into
In these situations, publishers need to convert existing data. and through multiple platforms— from books, magazines,
With typesetting files in hand, a conversion vendor can read and newspapers to websites, e-book readers, mobile devices,
the typesetting codes (for instance, “Heading 1”) and change and inventions that are only sketches on a white board right
them to XML tags, for the most part programmatically. For now. Authorities agree that XML provides the most effective
instance, if someone sees at the last minute that a “1” head way to cope with the multiple and shifting demands. Colson
really should have been a “2” head, that person might not of ARVO says it well:
change the typesetting code but might simply alter the type
characteristics to look like a “2” head. The XML coding will Don’t be afraid of XML. Using XML will give you
continue to treat the heading as a “1” head, with potential more versatility than any scheme I’m aware of.
implications for the quality of the applications such as Web
XML and content strategy Page 5
Why and how to “future-proof” your content
6. The Contributors The Authors
Special thanks to the following individual contributors: • Rich Lampert
• Nina Chang, Senior Publisher, Online Journals, Lippincott The Lampert Consultancy
Williams & Wilkins www.lampert-consultancy.net
Rich Lampert is owner of The Lampert Consultancy, LLC,
• Karen Colson, Director, Publishing and Communications, established in 2004 to provide strategic, editorial, and
Association for Research in Vision and Ophthalmology marketing services to publishers in STM, professional,
• Mark Gaertner, Senior Web Producer, Team Lead, and scholarly publishing. Rich is also, Principal, Publishing
BMStudio at Bristol-Myers Squibb Services Division, at Doody Enterprises, Inc., which focuses
on not-for-profit publishers.
• Larry Marion, CEO/Editor-in-Chief, Triangle
Publishing Services • Cara Kaufman
• Larry McGrew, Head, Content/Editorial Kaufman-Wills Group
Operations, Aetna www.kaufmanwills.com
Cara Kaufman is co-founder of Kaufman-Wills Group,
• Julia Sawabini, Web Marketing Director, Elsevier
LLC, which was created in 2000, to offer STM and other
• Phil Schafer, Director, Journal Production, Elsevier scholarly publishers a full range of professional publishing
services in the areas of strategic planning, business
development, electronic publishing strategy, RFP and
self-publishing projects, editorial services, and marketing
and market research.
SPi sought the help of Kaufman-Wills Group in developing
this white paper.
Page 6 XML and content strategy
Why and how to “future-proof” your content