On Transformations and Embeddings

This week I was in Sweden and had a chance to visit the National Library in Stockholm. Together with my Stanford colleagues Peter Broadwell and Lindsay King, I spoke on some of the AI work that seemed most interesting to me in the library space currently.

The slide behind me here says “The Decade’s Challenge: Transformations and Embeddings”. By this I mean two critical engagements with material that research libraries are — or could be — involved with.

The simplest example of a transformation might be something we’ve been doing for decades: optical character recognition. From the OmniPage of the 1990s, the Adobe Acrobat or Abbyy FineReader of today — and our own mobile phones, which are now applying OCR automatically other than any photograph we take — this is relatively uncontroversial addition of “recognized” text. Going from a scan of a printed page to Unicode text seems like a clear win for search ability and discoverability — whether on a micro-scale within an individual PDF, or a macro scale of an entire digital library.

There were — and remain — bumps along the way, from mis-recognition of long “S” as F, to lagging support for Fraktur or Blackletter typestyles. But few view OCR as a threat — especially since it is often presented “underneath” the raster layer of a book scan in a PDF, adding searchability but not hindering visual inspection.

Clearly OCR is not as useful if we are interested in typography, printer’s ink, or the characteristics of paper itself. For scholars in these areas, OCR is irrelevant — the “transformation” of the image of the page to a pure-text representation has lost crucial dimensions of the, flattening it out to a dehydrated, UTF-8 representation.

In fact, it may be useful to think of the resulting OCR text alone — when not paired with a raster layer in a PDF — as a kind of lower-dimensional representation of the original page image. In that sense, it is an embedding.

We can consider other, more complicated transformations — including many of which would have seemed impossible only eighteen months ago. Transformers-based HTR (handwritten text recognition), whether the TrOCR model itself or the new transformers-based models in Transcribus, promise to make handwriting suddenly accessible and searchable in ways it has never been in the past, absent large-scale and human-powered transcription efforts. In the domain of human speech, OpenAI’s (freely downloadable) weights for Whisper have made the human voice, in dozens of languages, similarly tractable for transcription. Each of these transformations of cultural heritage material promise to advance the cause of accessibility and fundability in the near future.

And in each of these cases, the reduction “down” to unicode text may be considered an embedding of its own. A whisper-generated transcript of an oral history video recording loses all of the speakers’ facial expressions, vocal tone, pauses or hesitancy, and many other forms of para-textual information conveyance. Similarly, paleographers would obviously lose the core aspect of their interest in a text file.

Nevertheless, these transformations and embeddings promise not just the ability to search, discover, and compute these library objects. They also suggest new and previously-impossible ways of engaging with complicated forms of human culture.

Take for example the deployment of text-to-image CLIP models on undescribed visual collections. I first learned of this approach from Javier de la Rosa, a former Stanford Library research engineer now at the National Library of Norway. In this image from our talk in Stockholm, Lindsay shows a collection of Art History teaching material (mostly drawn from slides and book scans), made searchable with a CLIP network:

Although many of these images may have some form of caption, few have descriptions of the work’s content or genre. Yet searching for a phrase such as “painterly abstraction” calls forth images which unquestionably match that search term. This is due to CLIP’s gluing together of two very disparate hypothetical spaces — one linguistic and one visual. By joining language tokens and pixel distributions into a linked embedding space, looking for abstract concepts with natural language surfaces the images closest to the identical point in the visual space. Thus are concepts such as “waves”, “brutalism”, or even “candlelit ambiance” suddenly discoverable. (These examples are all from Lindsay’s experiment and talk.)

Text-to-Image networks are perhaps the opposite of simpler, uncontroversial transformations such as OCR. As I often say, when you have a dual network, you have twice the problems of bias and representation. Yet CLIP search is an example of how the embeddings resulting from the current “transformers moment” in AI can suggest new, and far less limiting, ways of exploring our ever-growing collections of digitized material.

This is not a celebration of a trimumphalist or technocratic approach to the very complex problems we face in the library space. The true end goal is not to rely on the outcomes, after-effects, and unintended consequences of algorithms that may have been developed for the purposes of surveillance or the generation of avocado armchairs. But it is a reminder of possibilities are now possible for libraries to explore. Responsible engagement with the processes of transformation and embedding are, now more than ever, the contours of our future work.

Previous: Restoring creation and modification dates