TUGboat, Volume 34 (2013), No. 3 313
Online publishing via pdf2htmlEX
Lu Wang and Wanmin Liu
Abstract
The Web has long become an essential part of our
lives. While web technologies have been actively
developed for years, there is still a large gap between
web and traditional paper publishing. For example,
the PDF format, the de facto standard for publishing,
is not supported in the HTML standard; and the
most powerful typesetting system, T
E
X, cannot be
integrated perfectly.
Despite of the long history of people trying to
convert T
E
X or PDF into HTML, some are focused on
only a small fraction of features, e.g. text, formulas
or images; some are too old to support new features
in the HTML standard such as font embedding or
linear transformations (e.g. rotation); some display
everything in images at the cost of larger sizes.
In this article, while we survey and compare
existing methods of publishing T
E
X or PDF docu-
ments online, a new approach is attempted to attack
this issue. We introduce an open source program,
called pdf2htmlEX, which is a general PDF to HTML
converter and publishing tool with high fidelity. It
presents PDF elements with corresponding native
HTML elements, in order to achieve high accuracy
and small size. The flexible design also makes it
useful for a variety of use cases in online publishing.
Obviously T
E
X users can immediately benefit with
zero learning cost, just like
dvipdf
while people were
still using DVI. More information is available at the
home page:
https://github.com/coolwanglu/pdf2htmlex
1 Introduction
rguably, for many people the World Wide
Web is the Internet. Indeed, web technolo-
gies have been so actively developed in the
past few years, nowadays web pages far
surpass plain text and images. HTML5 brings au-
dio, video, 3D graphics and many other rich features;
CSS3 defines brand new visual effects, and JavaScript
allows different kinds of user interactions. Modern
web browsers are literally operating systems, and
the boundary between web apps and local software
has been blurred. Today, we can access the WWW
with all kinds of devices such as watches, phones,
tablets, computers and even glasses. It has become
an essential part of our lives.
The web technologies provide brand new user
experiences compared to traditional media. Taking
Wikipedia as an example, it has rich contents: inside
an article, besides plain text, there are often images,
animations, audio and video that are relevant to the
topic; it is well organized: users may jump to rele-
vant articles by clicking links; it is interactive: users
may create or edit an article; it is personalized: the
appearance of the web site respects users’ preferences
such as language, theme or format; it is social: users
may leave comments and have discussions regarding
an article.
Compared with traditional publishing media, it
is more convenient and easier for users to obtain,
view and share the contents. While most features in
HTML are targeting visual effects, multimedia and
rich Internet applications, there is still a large gap
between the Web and traditional publishing. Many
existing publishing technologies cannot be perfectly
integrated online — especially two of them focused
on in this article, PDF and T
E
X, which are the most
popular format and typesetting system respectively.
PDF
The Portable Document Format, developed
by Adobe, is one of the most popular formats for dig-
ital documents. PDF is known for its wide support
of different types of fonts, encodings, raster images,
vector graphics, and many other features from pre-
press processing to user interaction. It is widely
supported in different operating systems and devices.
Nowadays, almost all documents can be exported to
PDF. Notably, with a virtual PDF printer, any docu-
ment that can be printed on paper can be converted
to PDF. It has become the de facto standard for
academic articles, technical reports, manuals, news-
papers and ebooks. As an example, the final format
for TUGboat is PDF.
PDF is a print-ready format; it is designed to
completely describe a fixed-layout flat document. A
PDF file clearly defines the appearance of the docu-
ment, independent of particular devices or viewers.
PDF is not supported in the HTML standard,
but it can be viewed directly in several web browsers.
Users of other web browsers usually have to read PDF
documents with web browser plugins, or download
the files and then read them with a local PDF reader.
In all these cases, PDF files are viewed in a closed
environment where users cannot utilize most web
features.
1
T
E
X
Designed and written by Professor Donald
Knuth, T
E
X is one of the most powerful typesetting
systems in the world.
2
It is well-known for its capa-
bility of producing high quality formulas and figures
1
PDF
does include features such as external links and
interactive functions within a document, but these are quite
limited compared to HTML.
2
When using ‘T
E
X’ in this article, most of the time we
will be referring to the whole T
E
X family.
Online publishing via pdf2htmlEX