Raph Levien <raph@acm.org>
28 Jul 1999
levien.com Gnome homeThis is the first writeup of a proposal to unify GnomeText and gscript.
I'm going to present quite a number of interfaces, each at a fairly abstract level. There remains a fair amount of concretization to be done, such as working out the details of memory management, object layout, choice of data structures, and so on. This work is important, but premature at this point.
The goals of pango include:
- A very simple interface for users.
- Support for high quality typesetting.
- A modular design for script-specific engines.
- Support for different rendering back-ends (X and PostScript initially).
- Reasonable speed and memory usage.
And now, the interfaces...
Basic datatypes
There a few basic datatypes common to many interfaces in pango. One of the most important is an "attributed string." Abstractly, this is a sequence of attributed characters, each of which is a pair of the raw character and a map from attribute tag to attribute value. In practice, I expect attributed strings to be stored run-length coded in a style similar to GnomeText.
Attributed glyph strings are defined analogously, but with glyphs instead of characters.
Ranges of characters are used frequently, especially to keep track of transformations as they pass down the pipeline.
Itemizer
This need not be a class - it can simply be a function that takes an attributed string and returns a list of runs. Significant attributes include:
- A (handle of or pointer to a) MultiShaper.
- The language (such as "en-US").
All characters in a run have the same MultiShaper, same bidirectional level, and same script.
MultiShaper
The MultiShaper interface essentially holds a set of shaping engines and a sized font.
The abstract datatype contained in a MultiShaper is a map from scripts to Shapers. A script is a set of Unicode characters.
Shaper
The Shaper interface essentially holds a shaping engine for a single script, as well as a sized font (generally the same font held by the MultiShaper the shaper was obtained from).
The Shaper::shape method takes an attributed string, and returns an attributed glyph string. Each returned glyph contains:
- An (x, y) offset.
- A width.
- A corresponding range in the input string.
The (x, y) offset is relative to the current text position. After placing each glyph, the width is added to the current text position. Most glyphs other than combining diacriticals have a zero offset, and a width comparable to the width of the character (slightly adjusted if kerned with the next character). Most combining diacriticals have a width of zero, and an offset that places them correctly with respect to the previous character.
The string passed in to the shape method must satisfy the constraints on a run.
An additional method is ::query_interface, which returns a (rendering technology specific) rendering object.
Abstract shaping engine
The heart of the Pango design is the abstract shaping engine. Essentially, the shaping process is split into two parts; a highly script-specific, yet font technology independent part (the abstract shaping engine), and a part specific to the font technology, but with much of the script-specific complexity removed.
You create a MultiShaper by passing in a ConcreteShaper to the abstract shaping engine. The engine itself is probably best implemented with dynloaded modules, much like the shaping engines in GScript.
The interface between the abstract shaping engine and the ConcreteShaper is an attributed abstract glyph string, which is defined analogously to an attributed string, but with abstract glyphs in place of characters.
ConcreteShaper
The ConcreteShaper::shape method takes an attributed abstract glyph string, and returns an attributed glyph string. The returned string is identical to that returned by a Shaper.
ConcreteShaper implementations are responsible for recoding the abstract glyphs to concrete glyph numbers, ligating, kerning, and positioning diacriticals.
An abstract glyph is a description of a glyph in a script that is designed to be independent of font technology. A sequence of abstract glyphs appears in "visual order," i.e. left to right for even bidi order, right to left for odd.
Unicode, even though it is theoretically a character oriented rather than glyph oriented, is actually rich in glyph forms. Thus, for most scripts, we choose Unicode numbering for abstract glyphs. In the case of Arabic, we use U+Fxxx presentation forms. Most other scripts have their glyphs sufficiently well covered by the character code points. This includes Latin, Greek, Cyrillic, Hebrew, CJK, and the simpler Indic scripts. Devanagari (and almost certainly similar Indic scripts) do not have glyph coverage, so we will need to allocate our own code ranges.
In general, abstract glyphs are unligated. It is up to the concrete shaper to perform ligation.
Line breaking
Line breaking adds additional layers of complexity. For one, all potential line breaks must be identified, including discretionary hyphens for Latin scripts, and word boundaries in (???). Secondly, inserting hyphens may require reshaping. Third is the line breaking algorithm itself, which may be nontrivial.
It seems to me that preserving the GnomeText approach to line breaking is generally worthwhile. My feeling is that the high quality layout stuff should act as a driver, presenting its own API (very similar to GnomeText's), and calling the interfaces above.
Finding hyphenation points and word breaks is entirely font independent. Thus, I think the correct order for high quality layout is as follows:
- Itemize the text using the itemizer.
- Perform language-specific hyphenation and word break detection on each run.
- Convert each run into glyphs with the MultiShaper.
- For each hyphen, reshape the character subsequence corresponding to the (possibly ligated) glyph containing or preceding the hyphenation point, with the hyphen inserted. Thus, an "ffi" ligature gets reshaped as "f -" and "fi".
- Construct a table of breaks with x0 and x1 values for each break (see GnomeText and/or libhnj for more details).
- Call the line breaking algorithm on this table (this needs to be modular, as different applications have need for different levels of sophistication in line breaking).
- Reconstruct laid-out lines, using the hyphen-inserted attributed glyph strings where necessary. Do bidi reordering of runs within a line, and justify.
I think a similar driver could (and should) be written for simpler cases that don't need linebreaking. Perhaps one specific to the X fonts as well.
Open design decisions
Size as part of MultiShaper, or parsed by shapers as attribute in original string?
Accented characters canonically decomposed, or composed as much as possible?