Aller au contenu

So, just what is a "TeX token list"?

In a previous article—also part of this series on low-level TeXnicalities—we explored the processes through which TeX scans your .tex file to generate new tokens: we examined the fundamental nature of aTeX token and how TeX creates them (see What is a "TeX token"?).

In this follow-up article we take a look at token lists: what are they and how do TeX engines create/use them. Gaining an understanding of token lists can be tricky because they are stored deep in TeX’s internals: those details are hidden away from the user—although, today, this is not always true if you do more advanced programming with LuaTeX. But, for now, you can start to think of token lists as TeX’s way of storing a series of integer values, where each integer is a token derived from a character or command that TeX had read from your input file.

Token lists play a pivotal role in the internal operation of TeX, often in some surprising ways, such as the internal operation of commands like \uppercase and \lowercase. One particularly important use of token lists is storing and executing macros, a topic we will examine in detail as part of a future article in this series.

TeX gets its input from files and token lists

TeX engines have three sources of input—two that you may know:

  • physical text files stored on disk;
  • text that a user types into the terminal (command line);

but it also has a third way of reading/obtaining input: token lists!

Token lists are, in effect, an internal data storage facility that TeX uses as part of its operations. Because TeX’s token lists act as a “storage facility” for previously-created tokens, it makes sense for TeX to be able to re-use them as another source of input. When it becomes necessary to take its next input from a particular token list (or TeX is instructed to do so), TeX will temporarily halt reading input from a physical file (i.e., creating new tokens) and switch to obtaining its input from existing tokens: the in-memory location where the token list is stored. Clearly, with a token list the process of scanning + generation of tokens has already taken place so TeX just needs to look at each token in the list and decide what to do with each one.

By way of a quick example, the low-level (TeX primitive) \toks command lets you create a list of tokens that TeX saves in memory for later re-use:

\toks100={Hello}

To retrieve those tokens (ie., tell TeX to treat them as its next source of input) you’d issue a command such as

\the\toks100

This will cause TeX to switch from creating new tokens from your input file to getting its next from where those tokens (created by \toks) are stored—in a so-called token register which is just an internal memory location known to TeX (here it is register 100).

In addition, token lists can be internally generated, on-the-fly, by a number of TeX commands. One example is the command \jobname which generates a series of character tokens—one token for each character in the name of the main file that TeX is processing. Another example is the \string command; for example

\string\mymacro

generates a series of character tokens for each letter in the name \mymacro—including the initial \ character. We take a closer look at some “token-generating commands” at the end of this article.

Token list: Explained by analogy

Unless you have a programming background and/or some knowledge of computer science, “token lists” may be somewhat a somewhat hazy concept, and, perhaps, a little confusing. However, if you wish to become proficient in writing TeX/LaTeX macros then a good understanding of topics such as TeX tokens, token lists and category codes (\catcode) will prove to be extremely useful.

In this section we’re going to use an analogy to explain/illustrate the core ideas/principles of a TeX token list: how TeX stores tokens in memory. It’s worth taking time to read this through because token lists are a fundamental aspect of TeX and worth understanding in a little more detail.

Token lists: An analogy (thought experiment)

We are going to work through a “thought experiment” to provide a basis for understanding TeX token lists. Imagine that you had access to a large set of containers, such as hundreds of tins—we can’t consider or use the use the term “box” to describe our thought-experiment containers because, of course, “box” has a very specific meaning in TeX, quite unrelated to our discussion here. So we’ll call our containers “Tins”, where each Tin:

  • has a unique identifying number printed on its exterior;
  • is (internally) split into two compartments.

Those two compartments are designed as follows:

  • the left-hand compartment holds the item you want to put in the Tin;
  • the right-hand compartment is designed to hold a piece of paper onto which you can write a single number: the number identifying another Tin.

test

Suppose that you have a collection of, say, 5 items and you want to store that collection of items within those Tins; but, alas, each Tin can only hold 1 item of the type you want to store.

For simplicity, let’s assume we wanted to store 5 coloured circles:

Furthermore, when you go back to retrieve those items from your storage system (Tins) those items must be retrieved/found in a particular order—the order in which they were stored: that sequence must be preserved. How can you achieve this?

We can take advantage of the fact that each Tin:

  • has a unique identifying number attached to its exterior;
  • has 2 compartments—only 1 of which we will use to contain our item, the other contains a piece of paper with another Tin’s number written onto it.

We’ll assume every Tin is empty—but there’s nothing to stop you opening any particular Tin to check if it is empty; if it isn’t, try the next one until you find an empty Tin.

What we could do is as follows. Put our first item (dark green circle) in one of our Tins (e.g., Tin 124) and make a note of the number of this first Tin—it does not matter what number that first Tin has, all that matters is that we write it down somewhere and save it for later use.

Find a second Tin—any Tin number (e.g., Tin 432)—and take a note of its number. Write the number of that second Tin (432) on a piece of paper and place that note into the first Tin (Tin 124). We place our second item (light green circle) into the second Tin. So, we currently have the following situation:

  • a written note—not stored in a Tin—stating that the first Tin is number 124 (it contains our first item);
  • within Tin 124 we have added another note saying the next item is to be found in Tin 432.

In essence, we have linked our first two Tins: we know where to start (Tin 124) and that a note in Tin 124 tells us which Tin contains the next item (Tin 432).

We then find a third Tin, write down its number (e.g., Tin 543) on a piece of paper and place that in the second Tin (number 432). We then place our third item item (red circle) into the third Tin.

Now we have linked three Tins in the sequence: our starting point, Tin 124 (dark green circle) → Tin 432 (light green circle)→ Tin 543 (red circle) →…

Repeat this process for the final two items (light blue and dark blue circles) using Tin 213 (light blue circle) and Tin 102 (dark blue circle).


We now have all 5 Tins linked together (using the numeric identifier of each Tin) and are able to retrieve all our stored items—in the correct order—simply by visiting each Tin in turn, removing our item and looking at the note telling us which Tin contains our next item.

What about the last item in our list (Tin 102)?

Why should we be concerned by this one in particular? So far we have stored each item in a Tin, together with a note saying which Tin contains the next item: for the last item in our list what should that note say—because there is no next Tin.

When we reach the final item (Tin) it has to be obvious that this Tin (containing the last item) is the final item in our list—we do not need to look for another Tin, because there isn’t one. One way to do that is to use place a “special” Tin number inside our final Tin (102). We can use any number we wish provided we choose a unique number that is not the number of an actual Tin—for example “Tin -1”, “Tin 0”: it does not matter so long as we know that “Tin -1” or “Tin 0” etc immediately tells us to stop looking: we don’t need to look for any more Tins because this is the last one and thus there are no more items to retrieve.

From “items” and “Tins” to tokens and TeX

We now need to move from our analogy to a description that is closer to TeX’s reality. Firstly, instead of storing differently coloured circles in our imaginary Tins it should be clear that we could think of those Tins as storing TeX tokens: simple integers. That’s the easier part of moving our analogy across to the realm of software (TeX). But what might be the software equivalent of our physical numbered Tins with “compartments”?

We don’t want to venture too far into programming concepts but you can think of our “Tins” as representing a few bytes of computer memory which have been “packaged up” into a convenient unit of storage. Our analogy’s use of a numeric identifier for each Tin can be considered as the the location inside computer memory where each little package of memory is located. Within TeX itself, those little packages of storage are called “memory words”—a term which reflects the time/era in which TeX was created (1970s). These “memory words” are the fundamental building block used within TeX but we don’t need to explore them in any more detail here—anyone who wants further detail can refer to an article on the author’s personal blog.

In computer programming terms, what we have been discussing is called a linked list: a TeX token list is a linked list built from TeX’s storage containers called memory words where each memory word can be used to store:

  • a value: the value of the token (an integer);
  • a link: the memory location of the next memory word containing the next token in our list.

Where does TeX use token lists?

Everywhere! This is true because a TeX/LaTeX macro definition (e.g, a LaTeX command) is stored as a (slightly specialized) form of token list—specialized in the sense that it contains tokens that you don’t see in “standard” token lists (related to matching macro parameters etc). Don’t worry about this because we’ll address those details in a future article.

An example macro

A macro can be thought of as comprising three parts:

\def\<macro name><parameter text>{<replacement text>}

Note that instead of \def you could have use \edef, \gdef or \xdef.

Note to LaTeX users: Here we have are defining macros using raw, low-level, TeX commands (called primitives). LaTeX users will be more familiar with creating macros via LaTeX’s \newcommand (which is itself a macro).

When you ask TeX to create (define) a macro it will create a token which represents the <macro name> and a token list which represents the combined <parameter text> and <replacement text>. TeX will carefully store everything so that the token representing <macro name> is linked to the token list representing its definition (<parameter text> and <replacement text>).

For example, if we define \mymacro like this:

\def\mymacro abc #1 defz{I typed "#1"!}

We can see that its constituent parts are:

  • <macro name> = mymacro
  • <parameter text> = abc #1 defz
  • <replacement text> = I typed "#1"!

For example, you could call \mymacro like this:

\mymacro abc THIS TEXT defz

which results in I typed "THIS TEXT"! being typeset—the abc and defz are not typeset. abc and defz are sequences of character tokens used to delimit the macro parameter #1 and are absorbed and discarded when your macro call is successfully processed by TeX.

When you defined \mymacro, the pattern of tokens contained in the stored <parameter text> act as a “template” that TeX can use to work out:

  • which tokens in your input are the delimiter tokens;
  • which tokens in your input actually form the parameter(s) of your macro (here, what you are using for #1 in your call of \mymacro).

You have to call \mymacro with a <parameter text> containing delimiters that are identical to the ones used to define it—that includes using character delimiters with identical category codes. If the delimiters in the <parameter text> used to call \mymacro are different to the ones used to define it (the “template” stored in memory), then TeX can become rather confused—when it tries to process \mymacro it would not be able to match the “template” it has saved in its memory.

When TeX sees that you are calling a macro it will scan your input text to create new tokens and try, token-by-token, to match them with the token list <parameter text> template stored as part of your macro’s definition. If the delimiters used in your input text result in a series of tokens that don’t match the ones stored in the “template” then TeX will usually throw an error.

TeX is very particular—remember that character tokens are a combination of character code and category code: if you change the category code of a character you get a different token value resulting from that character.

Suppose we changed the category code of z to, say, 12—ordinarily it is 11—and try to call our macro like this:

\catcode`z=12
\mymacro abc THIS TEXT defz more text here...

This time it will not work because the category code of z has been changed. You will see an error such as this:

Runaway argument?
THIS TEXT defz 
! Paragraph ended before \mymacro was complete.
<to be read again> 
\par 
l.22

When TeX reads and scans the z in defz it cannot recognize it as forming the end of \mymacro’s <parameter text> used in your input file. Up until seeing that erroneous z TeX had correctly matched the first 3 characters def but that z (with category code 12) trips-up TeX’s scanning. Assuming z had a category code of 11 when we defined \mymacro: that would result in a token value of 256×11 + 122 = 2938 being stored as part of the \mymacro’s definition (i.e., stored as part of the <parameter text> “template”). However, with category code 12, z will now create a token value of 256×12 + 122 = 3194. Because the token value (for z) read in from your input (value 3194) does not match the z-token contained in the stored <parameter text> token list template (value 2938), TeX will carry on scanning your input. TeX will continue to scan the text following on after your macro (more text here ...) to look for additional tokens—trying to match the stored <parameter text> template with the tokens it finds in your input. It probably won’t find the correct pattern of tokens and errors will result as TeX “overshoots” your input and erroneously reads extra text to create additional tokens—those extra tokens should not have been read at this point and will almost certainly generate an error.

We’ll go into this in more detail in a future article.

Other uses of token lists

Other commands used to create/store token lists include:

\toks<n>={...}
\everypar={...} 
\everymath={...}
\everydisplay={...}
\everyhbox={...}
\everyvbox={...}
\output={...}
\everyjob={...}
\everycr={...}
\errhelp={...}

Each one of these commands creates a token list from the characters and commands within the braces ‘{...}’ and that list of tokens is intended to be re-used in certain circumstances. For example, \everypar={...} creates and stores a set of tokens (a token list) that TeX injects into the input just before it starts a new paragraph.

Hidden uses of token lists: examples

In this final section we’ll look at some some practical examples of token lists being used in ways you might not expect.

Example 1: \uppercase{...} and \lowercase{....}—temporary token lists

In addition to explicit commands to generate token lists, there are circumstances when TeX generates a hidden and temporary internal token list in order to do some special processing. Remember that when TeX read/processes your input characters/commands they are turned into tokens: the fundamental building block that TeX engines work with.

A good example are the commands \uppercase{...} or \lowercase{...} because their operation can, on first encounter, be rather confusing. Once you understand what they are doing—deeper inside TeX and invisible to the user—their operations become much easier to comprehend.

Suppose you have a simple series of letters that you want to make uppercase—e.g., abcde and convert that to ABCDE. Well, it’s simple enough with TeX’s \uppercase command:

\uppercase{abcde}

will cause TeX to output ABCDE. Now let’s suppose we wanted to save our simple series of letters for use later on—i.e., we don’t want to output them straight away so we’ll use TeX’s only internal mechanism—not external (file) mechanism—for saving data: use a token list. We can do that by either creating a macro or using an explicit token list command:

\toks100={abcde}
\def\mychars{abcde}

Then, at some point, you might decide that you’d like to re-use your series of letters but, this time, in uppercase; so you try

\uppercase{\the\toks100}

and

\uppercase{\mychars}

But, alas, neither of these work. Why is that?

Secret token lists!

To understand how the commands \uppercase{...} \lowercase{...} actually work I needed to peek inside the inner workings of TeX so the following explanation is derived from doing that.

When TeX detects either \uppercase{<material>} or \lowercase{<material>} in your input the first thing TeX does is to create a (temporary) internal token list from the <material> enclosed between the ‘{’ and ‘}’ which follow after the \uppercase{...} or \lowercase{...} commands—that temporary token list is internal to TeX.

A crucial point, and central to understanding how \uppercase{<material>} and \lowercase{<material>} actually work, is that any commands or macros contained in the <material> are not expanded: all that TeX does is to generate tokens from characters and commands placed between {...}. During the operation of \uppercase{<material>} or \lowercase{<material>} nothing between the braces is executed: it is simply turned into tokens.

After the <material> inside the {...} has been converted into a (temporary) token list, TeX then re-visits every token in that list and tests whether it is a character token or a command token (using the numeric value of the token). If TeX detects a character token it modifies that token to adjust the case of the character (according to whether \uppercase or \lowercase is being processed). TeX simply ignores any command tokens and does not “look into” any command tokens to see what they represent or contain (e.g. a macro containing characters)—they are simply skipped over: only character tokens are actually processed/affected by case-changing operations.

So, for example, if we issue a TeX command such as \uppercase{abcde} TeX will create a token list from abcde containing nothing but character tokens: they are all adjusted to create a series of modified tokens representing A, B, C, D, and E. Those modified tokens are fed back into TeX’s input processor which results in ABCDE being typeset. However, if we have stored our characters within a macro—for example \def\mychars{abcde}—and try to convert them to uppercase like this:

\uppercase{\mychars}

then it will fail and abcde will be typeset—not ABCDE as you might expect. If we then try to store our characters in a token list such as \toks0={abcde} and do \uppercase{\the\toks0} then, once again, \uppercase will fail because the token list will comprise entirely of tokens that are not affected by \uppercase.

Taking the example of our macro, \mychars, after TeX detects \uppercase in the input, TeX looks up the meaning of \uppercase and actions it, creating a temporary token list from {\mychars}. Clearly, that temporary token list contains just one token which is not a character token but one that represents our macro command \mychars: hence, for the purposes of executing \uppercase, that token is ignored—\mychars does not represent a character token. However, as noted above, once \uppercase has done its work, the temporary token list (created by the action of \uppercase) is fed back into TeX’s full input-processing (scanning) mechanism. When TeX re-reads that token list it detects a token which represents our \mychars macro which TeX executes (expands) and generates a series of characters to typeset abcde—still in lowercase because they were “wrapped up” inside a macro and thus invisible to the actions of \uppercase.

Once TeX has re-examined the temporary token list created for \uppercase{...} or \lowercase{...}, and processed any character tokens, it then switches to using that temporary token list as its source of input: typesetting characters (processed character tokens) and executing commands and macros.

How can this be fixed?

Because \uppercase{...} or \lowercase{...} will only act upon character tokens, we need a way to “force the unpackaging” of characters contained in our macro \mychars (or contained in a \toks register) before \uppercase{...} or \lowercase{...} acts on it. By “unpackaging” what we really mean is TeX’s process of expansion:

  • replacing a TeX/LaTeX command with the sequence of tokens from which that command (e.g., a macro) is comprised, or
  • producing the sequence of tokens a command is designed to generate. One example of a command that generates tokens is \jobname, which produces a sequence of character tokens representing the name of the main TeX file being processed.

Lower-level magic: scantoks(..., ...)

Here we are really probing into some darker corners of TeX’s inner workings so you can ignore this section unless you enjoy the details…

After TeX detects \uppercase or \lowercase in the input stream, it executes an internal function called scantoks(..., ...) whose job it is to generate the token list using the items between the opening ‘{’ and closing ‘}’—as discussed, that token list is subsequently examined to detect (then adjust) any character tokens to alter the character case as required. Note carefully that we are referring to scantoks(..., ...) as the internal function built into the source code of TeX engines—here, it is not being referred to as the name of a control sequence.

As part of its work, scantoks(..., ...) can be instructed whether to expand, or not expand, the token list it is constructing and for \uppercase and (\lowercase) it does not expand the tokens: it merely creates them and puts them into a token list.

One of the first things that scantoks(..., ...) has to do is check for an opening ‘{’ (or any character of \catcode 1) because it has to ensure the user hasn’t made a syntax error and forgotten the opening ‘{’ (or any character of category code 1)—because a character with category code 1 is required to delimit the start of a list of items to be tokenized.

And here’s the trick: the task of looking for an opening ‘{’ triggers scantoks(..., ...) to run TeX’s expansion process, which means that the following examples will work:

\let\ob={
\uppercase\ob abcde}
\def\obb{\ob}
\uppercase\obb xyz}

Taking the example of \obb, a macro, it is recognized as an expandable command and is duly expanded by TeX (via the scantoks(..., ...) function) in its search for an opening brace (any character with category code 1). What this means is that we can use the “\expandafter trick” to achieve our goal of “unpacking” our characters from the confines of our macro—i.e., expanding it. Note that \expandafter also falls into the category of being an expandable command, so TeX actions it here and lets it do its work as part of hunting for an opening ‘{’ (or any character with category code of 1).

So, if you define:

\toks0={abcde}
\def\mychars{abcde}

And do this:

\uppercase\expandafter{\mychars}
\uppercase\expandafter{\the\toks0}

in both cases you will now see ABCDE typeset because the \expandafter causes “unpackaging” (expansion) of \mychars and \the\toks0—both result in \uppercase seeing a stream of character tokens, which they can process to change the case.

Example 2: \string—more temporary token lists

Internally, TeX classifies \string as one of its so-called “convert” commands: performing the operation of “convert to text”. The \string command is designed to convert a token into a human-readable text version—i.e., typeset the human-readable string of characters from which that token was originally created.

For example \string\hello creates a temporary token list which contains the characters \, h, e, l, l, o — yes, even including the initial ‘\’. Once that token list has been created it is then re-read by TeX and the text of the command “\hello” is typeset—yes, including ‘\’ if you choose the correct font…

You may wonder how/why TeX can typeset the escape character when it is usually used to trigger TeX’s scanner into creating a command token: why doesn’t it do that here? The answer has to do with category codes: usually, a ‘\’ character has catcode 0 (escape character) but when \string generates its internal token list it does something a little different. When it creates a character-token list it assigns category code 12 to all characters apart from the space character which is assigned catcode 10—recall that character tokens are calculated from 256 x catcode + ASCII value. So, when TeX re-reads (inputs) the temporary token list that \string generated from \hello, TeX does not see an escape character because the token for ‘\’ was calculated with a catcode 12 and not 0: TeX just treats ‘\’ as a regular character and typesets it.

Strictly speaking, we should probably note that TeX does not atually generate a token for escape characters when it detects them in the input. Once it has recognized a character with category code 0, that character is just used to “trigger” generating a control sequence token: once it has triggered TeX to do that the escape character has done its work and is no longer considered.

Technical note

A command called \showtokens{...} (introduced by the e-TeX engine) can show token lists (in the log file). From the e-TeX manual:

The command \showtokens{<token list>} displays the token list, and allows the display of quantities that cannot be displayed by \show or \showthe, e.g.:

\showtokens\expandafter{\jobname}

In conclusion

In section 291 of the TeX source code (see page 122 of TeX: The Program) Knuth describes a token list as follows:

“A token list is a singly-linked list of one-word nodes in mem, where each word contains a token and a link. Macro definitions, output-routine definitions, marks, \write texts, and a few other things are remembered by TeX in the form of token lists, usually preceded by a node with a reference count in its “token_ref_count” field.”

On first reading this may not have been easy to understand but, hopefully, it may now make a little more sense.

Overleaf guides

LaTeX Basics

Mathematics

Figures and tables

References and Citations

Languages

Document structure

Formatting

Fonts

Presentations

Commands

Field specific

Class files

Advanced TeX/LaTeX