Against recognition

June 28, 2021

(This post was originally part of an email update to my GitHub sponsors. I'm making it public because I like it. It's a little less formal than most things on my site.)

tweet from Robin Sloan about how thermal
printers are magical

at Dynamicland, I had this idea for an 'audio recorder' or 'audio note' device -- it would be a thing that has a push button, a microphone, and a receipt printer:

you hold the button down to record¹
say something / make whatever sound you want
let go of the button
the device immediately (or even in real time, while you're speaking?) prints out a little receipt that 'contains' (that is) the audio that it just recorded²
then you can put that sound (receipt) down³ to play it

like I feel like there is no audio equivalent of jotting something down on a post-it note (or even writing on a whiteboard), and that's kinda what I wanted: a cheap physical object, made of sound instead of writing, where both consuming it and creating it feel immediately at hand.

It'd be a mix of the nice things about audio (oral communication style instead of written, can bring in other noise-making objects, background sounds can make it in without your explicitly intending to include them, music, kids/nonliterate people can participate, etc) and the nice things about writing a note (cheap to make, disposable, objectness, can point to and reuse later).

(ideas: you narrate a project you're working on, then literally stick your narration onto the project / you hum a bunch of little sound samples and play with them to make some musical composition / you build a game where the game pieces are spoken words... / i don't know. you can make non-visual systems, where you only use your hands and ears and mouth, that would be accessible even to someone who can't read or can't see)

but -- and I think this is important -- the thing would not try to recognize your speech and turn it into text. the receipt represents only the sound itself. (from the computer's point of view, it's just an opaque blob of audio.)⁴

The computer is what stores the audio data for the receipt, and when you put down the receipt, the computer is what plays the audio through the speakers in the ceiling. So the computer's role here is to store (audio) information and to orchestrate hardware, but its role is not really to digest and index that information.

if you have the computer recognize the speech inside these audio receipts and turn it into text, you're now privileging that one kind of audio (speech), which feels weird to me. you're going to push people to make sound in the ways that are legible to the computer. you may start to think that the text is the 'canonical' format, and the audio is only a temporary stop on the way to figuring out the text, and it's a Problem if that text comes out wrong, and you'll treat everything in the world only in terms of how you can convert it to text...

(and to what end? none of the ideas or aspirations I've described so far require the computer to understand what's in the audio receipts.)

(sure, maybe it'd be nice to have, like, search, but should that nice-to-have dominate the design? You can't search physical books or post-it notes, and that's fine. I want to play with this fresh thing of audio notes, which doesn't even exist yet, before trying to add an extra layer of recognition and textuality and legibility, a layer which may overwhelm some of the things about audio that were interesting to me in the first place.)

I have this reMarkable 2 tablet I got last year. I really like it. I use it a lot. Its UI looks like this (source):

screenshot of reMarkable tablet UI;
it's a Finder-like display with icons for documents and for folders

and that frustrates me a little. I feel like you could do so much more; I feel like that interface doesn't really take the tablet and pen seriously.⁵

There's so much typed text on that screen -- there are so many straight lines and buttons -- it feels like it's a UI built around tapping and clicking and maybe keyboard shortcuts, not around the pen. Shouldn't as much of the interface as possible be handwritten text, and hand-drawn lines, and weird-shaped regions?⁶

A small example: titles. This file is called "Chapter 4":

excerpt from earlier
reMarkable UI screenshot that just shows the bottom of one document icon, with the
title 'Chapter 4'

if you have a tablet, with a pen, I feel like you should be able to just... write/draw that "Chapter 4" in. you shouldn't have to Make a New File, then type C-H-A-P-T-E-R-SPACE-4 in a text field, then click OK... there shouldn't even really be a 'text field' at all on a tablet like that; at minimum, it should be an open blank field where you can draw or write anything with your pen.

even if the computer can't figure out what some little scribble in that field means, it is still useful, since the title is there to serve me, to help me spot my files on sight. and in fact, even if the tablet does manage to recognize the text in my scrawled title, I'd prefer that it show my original handwriting there, because then the title feels like it's mine, it's my writing from my hand, with my weird quirks and imperfections. it's not dead 'text' from some font built into the OS

why is there any typed text on the tablet UI? why am I typing a title by mashing soft keys on a software keyboard? shouldn't it all be handwritten? not only would it be more comfortable and more fun (the tablet hardware is great for handwriting and terrible for typing), it would be more open: I could draw little smiley faces or stars or whatever I want. (I could 'star' a document by literally drawing a star on it!)⁷

anyway, if we want legibility, we could imagine techniques to navigate and visualize that space of titles/title-drawings that aren't just about converting everything to text and treating it as text:

it's weird that tablets like the reMarkable or iPad don't use this more in their interfaces. like Dynamicland, they have a new form of input, a form that goes way beyond traditional mouse and keyboard or touch, that is far more open-ended, and I think they should use it! pervasively!

and I don't mean treating handwriting as just an input to a recognition system that slots into old interface paradigm:

screenshot of 'Write in any text field with Scribble' section of Apple page,
showing 'Carmel' scrawled into Apple Maps search box

(I understand why that form of text recognition is useful -- existing systems are big, and they do a lot of stuff you can't replicate easily, and some form of pen compatibility with them is important -- but it doesn't excite me, and I feel like it limits our imagination if we think too much about it)

A couple of years ago, we wanted to have a better idea of what projects people were doing (and had done already) at Dynamicland, so we started making this 'research gallery' application.⁸

It was built around this dynamic 'scrapbook'. The idea was that when you make something, you'd add a new page to the scrapbook about it, with photos, videos, text description, maybe a little embedded demo of the thing, and so on.⁹

photo of 'BART departures' page of scrapbook

You can see two such scrapbook pages below -- on the left, a page about the "Animation" project, and on the right, a page about the "DNA Kit" project:

Each project at Dynamicland would get a big page (or two or three) in the scrapbook.

You can see (your eyes were probably drawn to) all the iconic Dynamicland dot-framed pages that are glued into the scrapbook. These pages are a few different things:

photos that have been printed out via the Dynamicland system (for example, the yellow-backed areas in Animation): the photo is actually printed on the paper, but the computer also knows where and what the photo is and can transclude it onto the wall or as a thumbnail in search results or whatever
demo videos that play on the scrapbook (the purple "History" in Animation)
an embedded, live instance of the project itself (bottom-left corner of Animation, which animates between the 3 hand drawings along the bottom center of the page)

But I don't actually want to talk about those dot-frames, or about their behavior; they're not the part that I find interesting. I'm much more interested in how you can really put anything you want on the scrapbook page (it is, after all, just a big piece of paper).

Look at everything on the Animation scrapbook page that isn't framed by colored dots:

left side of prev photo of scrapbook, showing just Animation
page, and now with dotframed areas grayed out

These are things -- post-it notes, bits of text floating around that were written by different people, handwritten headings -- where the computer doesn't even know they're there. But they mean something to you and me. And, unlike 'text' in a computer, they can vary in human ways; they can be written or typeset differently, set at different sizes, with different colors, and so on, without software needing to implement any of those features.

It's like how I can yell or cry or laugh in an audio recording, but all that meaning gets destroyed when it gets 'recognized' into text.

When we made the scrapbook, we struggled a lot with the tension between the freedom of this unrecognized 'open input space' and the utility of a scrapbook that is indexable/legible to the computer. The scrapbook you see above was a sort of compromise, with both structured (dot-framed) and unstructured (everything else) elements. I wrote a bit about this tension at the time, in an unpublished draft:

Let's say we only had a physical scrapbook of projects, with no structured data -- no computer at all, basically. There's something freeing about that open space format: you can put whatever you want into the book. Photos, handwritten notes, hand-drawn diagrams, booklets stapled in, paper inserts that fold out, whatever.

But in exchange for that freedom, you get a book which is in some ways profoundly illegible and unsearchable. How would you search for projects that involve 'music'? How would you search for projects 'made by Omar'?

Unless the book had an index or table of contents for that specific kind of query (and you kept that index updated by hand, every time you added or changed a project), you'd have to scan through the book from cover to cover to answer the query.

So when we made the research gallery, we wanted its users to have the power of a computer to process structured data:

search for projects,

see many views of projects (by date, by author, by subject, by capabilities used...),

filter only for what you're interested in,

see connections to related work,

quickly render a list of results.

We want the computer to understand some things about each project. But once we make a project format that is easy for the computer to read, we also circumscribe how authors can describe a project. The system wants authors to say the kinds of structured things that it can understand, like:

a project is described by strings of (QWERTY-keyboard-typeable) text, not diagrams, or icons, or singing, or other languages, or handwriting where the author used a different pen and pressed harder on some parts to emphasize them

a project was created on a particular date on the calendar, not 'some time from Omar here and some time from Paula there, with low-level thinking and tweaking all along as part of a different project X, and then a big overhaul 6 months later'

a project is demonstrated in action by individual photos and videos, not a big collage of photos with stuff pasted in like handwritten captions and supplementary drawings and extra people who weren't captured in the original photos

To be honest -- I would have liked to not have any dot frames on the scrapbook page at all -- I would have liked for that 'open input space' to be the primary thing, and the recognition system to come later (if at all).¹⁰ (and if that means the computer has to give up some legibility for a while, if that means you can't automatically figure out a thumbnail photo for a scrapbook page, if that means you can't search through the scrapbook quickly, that might have been OK by me.)

the idea that a lot of things in the computer are not really about computation; they are artifacts made by people for people. it should be normal to put things inside the computer that the computer itself cannot digest, but that it can pass on to your future self or to other people.

this is sort of the idea behind literate programming, too -- the idea that the stuff for humans should be the default context, and the highly constrained stuff parsed by the computer should be an exceptional mode within that.

It's not that I think that recognition is always bad -- but I do think it is more interesting to err against it when we're designing new systems.

(and I think framing your problems in terms of 'recognition' is actually risky; it may result in really ossified, unimaginative systems. You end up with some expert who works on the 'recognition module' of your system, and their job is to take some fixed form of input and deliver some fixed form of output, and they can improve that module as much as they want, but they don't think about the interface context around the recognition. Even if you want recognition, I think there should be someone who thinks about the whole system at once and can come up with an end-to-end design that includes both pattern recognition and user interface.)

We have all these computer systems that love lowest-common-denominator formats like plain text, and they push programmers to normalize everything into those formats, so the computer can 'understand' them.

But I feel like as much as possible, the computer should be leaving things the way they are!

If you have recognition, it should be a sort of overlay you put on the thing (maybe one of many such overlays); you shouldn't destroy the thing and replace it with its ashes.

If it has to exist, the text recognizer should attach an overlay to the image that says 'it might have this text in it'; the image shouldn't itself be transformed into text. (and ideally, that overlay would be rich with context and provenance; it wouldn't just be a blob of plain text; it would know what image it's from, admit other texts that it could potentially be, talk about how likely each word of it is to be correct, say as much as possible about the recognizer's process and thinking)

The original thing is still around and is still the source of truth.

mostly, this email is not an argument; i don't know if it really makes any sense; it's a feeling

i feel like the computer should give you more space to play. like you should be able to play and doodle and dream by default, wherever you happen to be on the computer, without your first concern being whether the computer will recognize it...

my computer's first job is not to recognize things; it's to hold onto what I put in it. it's a medium, not an intelligence; i want it to be good at being a sheet of paper before it tries to be anything else. (it's a versatile sheet of paper: it can hold video, and sound, and links, and computations, and ...)

that is, it's push-to-talk / spring-loaded ↩︎
It's strange -- the detail that the printer is a receipt printer feels very important. that the printout comes out right at hand, like a Polaroid (rather than out of a printer somewhere else in the room), and that the printout is fast (not a laser printer churning for 30 seconds before spitting out a page). it helps create this sense of lightness and 'post-it-ness' ↩︎
dj microbeads proposed implementing this as a standalone project, with QR codes on the receipts & a phone app that you point at a QR code to play the sound, and that immediately didn't feel right to me.

it feels like it doesn't preserve enough of the lightness, the fact that you could put this receipt down anywhere on your desk and it'd play, like an object that has its own magic. (maybe the receipt would have a dot frame around it, like any other page in Dynamicland, although I don't like to think about that too hard; it would be better if it felt even lighter than that)

I think you at least need to have an always-on well on your desk where you can put a receipt to play it (like a slot where you'd place a Yu-Gi-Oh card, but alive -- maybe it'd have a webcam / document camera / cheap phone mounted overhead), even if you don't have Dynamicland-esque coverage of the whole desktop; the app on your existing phone that you have to switch into is not enough ↩︎
maybe the receipt would have the audio waveform printed on it or something, just to give it a unique appearance. Plus, there's this Dynamicland aspiration that I'd want to maintain -- that a computer could in theory look at the receipt and regenerate the audio completely from what it sees (even if in practice it usually 'cheats' by keeping the audio file on disk)

(The Dynamicland aspiration is that you should be able to look at the situation in the real world & completely derive the computer state from that; there shouldn't be invisible 'virtual state' that lives in RAM or on a hard disk.)

(Ideally, if the power cut out and the computers in the ceiling at Dynamicland all restarted, nothing would be lost, because the physical arrangement of stuff in the real world completely determines the behavior of the computer anyway, and that physical arrangement remains intact.) ↩︎
it's such a waste! I have a pet theory that a lot of the stagnation in programming and in GUIs in general is downstream of a stagnation in consumer I/O devices.

hard to do, say, graphical programming that feels good, if you're stuck with a mouse and keyboard. and tablets aren't stuck with that; they have a chance to do something new, and it feels like they've squandered it so far ↩︎
I almost feel like I should get to draw the interface that I want for myself. like how Acme lets you grow your own palette of commands as you work. that would also make me feel a lot more comfortable and committed to the interface, if it was something that came from my own hand ↩︎
and why are there files and folders at all? it feels like a naive, traditional-PC-derivative take on how to deal with information. why can't I just have endless pages, where I mark out and wire regions of them together with my magic pen? ↩︎
and we wanted to think about how you would make a 'database' or querying interface that takes advantage of the unique properties of Dynamicland, and we wanted more applications in DL that actually got regularly used in a real context. ↩︎
reminds me a bit of the Lisa Polaroids :-) ↩︎
and the dots take so much space! to me, the fact that they dominate your visual field feels so wrong, so misleading about what's really important. and it constrains the number of 'objects' that can fit on the page so much when each 'object' has this thick dot-frame around it. it constrains your imagination of what the scrapbook can be. i wish objects could be small and free and unrecognized ↩︎