From Porting Models to Custom Silicon to a Tiny LLM That Writes Music: the Story of BookMusic

Table of Contents

“Without deviation from the norm, progress is not possible.” — Frank Zappa

This is the story of a side project that started as “let’s port a cutting-edge music model to custom AWS silicon” and ended as “let’s ask a tiny local LLM to write a few lines of code in a live-coding language”. Same goal, wildly different path. And the journey taught me something I think is worth sharing: with today’s AI tools, getting things done is fast and almost cheap. The scarce resource isn’t execution anymore. It’s having an idea that’s actually different.

Where it started: SpectroStream and Magenta RT #

It all began with a paper: SpectroStream, a neural audio codec from Google that compresses 48 kHz stereo music into discrete tokens. It’s the codec behind Magenta RealTime, Google’s open-weights model that generates music in real time as a stream of audio tokens.

I was curious. Could I make “concepts sound”? Could a phrase, a paragraph, a story generate its own music? And, since I had AWS credits for Inferentia2 instances, could I port Magenta RT to Inf2 as a learning project?

So I did what I always do: research first, credits later.

The Inf2 rabbit hole #

The feasibility research on porting Magenta RT to AWS Neuron was fascinating, and humbling. The short version:

Magenta RT is a three-model pipeline (a style embedding model, a T5-style encoder-decoder LM with a nested autoregressive decoder, and the SpectroStream codec), trained in JAX, hitting a real-time factor of 1.8x on a single NVIDIA H100. Not a huge margin to play with.

Neuron, on the other hand, compiles everything ahead of time to fixed shapes. No lax.cond. Static while_loop only. And the nested autoregressive decode is exactly the worst case for that compilation model. Even Trainium2 doesn’t help: the hardware supports more, but the compiler doesn’t expose it yet.

I ended up with a six-phase spike plan, complete with go/no-go gates and a strategic fork at phase 3: stay in JAX and fight the compiler, or reimplement the LM in PyTorch on NxD Inference. A proper research project. Months of work, realistically.

{{< alert “circle-info” >}} Lesson number one, which I keep relearning: the cheapest experiment is the one that kills the project fastest. Front-load it. {{< /alert >}}

And while staring at that plan, a question crept in: what did I actually want to build?

The idea shifts: a story that generates its own music #

What I wanted wasn’t “Magenta RT on Inf2”. That was the means, not the end. What I wanted was this: you open a book, you start reading, and music plays underneath. Not a playlist. Music generated from the text itself, evolving as the story does. The highlight follows your reading, the mood follows the highlight.

The key insight was that this doesn’t need true real-time generation at all. Reading a paragraph takes 10 to 20 seconds. If you can generate the next chunk of music faster than the reader finishes the current one, you can buffer ahead and the illusion holds forever. I called it “fake realtime”.

Suddenly, I didn’t need an H100. I didn’t need Inf2. I needed something that generates music slightly faster than people read.

Prototyping on the laptop #

First attempt: MusicGen, Meta’s text-to-music model. Vanilla AudioCraft falls back to CPU on a Mac (no MPS support for MusicGen), but there’s an MLX port that runs natively on Apple Silicon. On my MacBook M5 with 16 GB, the small model hits about 0.8x real-time. Slower than playback, but with buffering? Workable.

Then I needed the translation layer: a paragraph of a story is not a music prompt. “The kitten tangled the ball of yarn” means nothing to MusicGen. So I put an LLM in the middle with a carefully designed prompt: read the paragraph, translate the mood into musical descriptors (genre, instruments, tempo, feeling), never describe the plot.

I tested it on the Italian translation of Through the Looking-Glass. The paragraph where the black kitten plays with the wool came back as:

Playful chamber music, pizzicato strings, light woodwind trills, brisk tempo, whimsical and warm, cozy domestic charm

Pizzicato strings for a kitten batting a ball of yarn. That sounded good, and I thought the concept worked.

I moved the LLM local too: Ollama with Qwen3.5, structured outputs to keep a small model on rails, low temperature because this is translation, not creativity. It worked. The pipeline was: text, then local LLM, then MusicGen, then audio files, then a player.

It worked, but it was heavy. Two models fighting for 16 GB of unified memory, a 0.8x real-time factor with no margin.

The pivot: don’t generate audio, generate code #

Then came the idea that changed everything. Do you know Strudel? It’s a browser-based live-coding environment, a JavaScript port of TidalCycles. You write a few lines of pattern code and the browser synthesizes the music via Web Audio. Instantly. For free.

So: what if the LLM doesn’t describe music for an audio model, but writes the music directly as Strudel code?

Think about what this means in terms of performance: no MusicGen means no GPU, and no real-time factor to worry about (generating 200 tokens of code is nearly instant). No audio files to buffer and crossfade. Continuity between sections becomes trivial: pass the previous code to the model and ask it to evolve it, exactly the live-coding workflow Strudel was designed for.

The catch: Strudel is niche, and a small model has basically zero reliable knowledge of its syntax. Which brings me to the part of the project I ended up enjoying the most.

The prompt is the music model #

My first attempt was exactly what you’d expect: “write Strudel code for the mood of this paragraph”. The result was a disaster. The model confidently invented functions that don’t exist, wrote patterns that didn’t parse, and when something did run, it sounded like a drum machine falling down the stairs. Not exactly what you want under a quiet reading session.

You can’t ask a small model to recall Strudel. There’s just not enough of it in any training corpus. You have to teach it Strudel, inside the prompt. So the prompt grew, iteration after iteration, into something that is honestly the heart of the whole project:

A restricted vocabulary. Only functions and sounds that actually exist and sound good for reading: soft synths, pads, a handful of gm_ soundfonts, gentle percussion. And a hard rule: never invent function names.
A safety net for harmony. Scale degrees instead of raw notes (n("0 2 4").scale("C:minor")): whatever numbers the model picks, they land inside the scale. A small model mathematically can’t play a wrong note.
A mood-to-style cheat sheet, with one labeled example per style. This was the breakthrough. Without per-style examples, small models collapse everything into the same two instruments. With them, they copy the structure and swap the mood. The routing is explicit: cozy / nostalgic / rainy goes to LOFI, wonder / magic / surreal goes to DREAMY (lydian, no drums), rage / fury goes to ANGER.
Modulation as a requirement, not an option. Each piece plays for about a minute, so static loops get boring fast. The prompt pushes slow LFO-style movement: filters that breathe over 16-20 cycles, gentle stereo drift, soft volume swells.

Here’s the AMBIENT example, straight from the prompt, the one the model copies for calm or sorrowful scenes:

setcpm(32)
stack(
  note("<[c3,eb3,g3] [bb2,d3,f3] [ab2,c3,eb3] [g2,bb2,d3]>")
    .sound("gm_synth_strings_1")
    .attack(.6).release(1.5)
    .gain(.5).room(.7)
    .lpf(sine.range(400,1100).slow(20)),
  note("<c2 ab1>").sound("sine").gain(.35).lpf(450)
    .pan(sine.range(.3,.7).slow(16))
)

Two layers, a four-chord minor progression, a filter that takes twenty cycles to open and close, a bass that slowly drifts across the stereo field. That’s the whole “model”.

After generation, the previous section’s code is passed back in, with the instruction to keep the same instruments and evolve the rest. Textual continuity becomes musical continuity, for free.

There are also smaller lessons buried in there that took real iteration to learn. My favorite: with a drum beat, one Strudel cycle is one bar, so the tempo numbers mean completely different things for beat styles (18-26 cpm) versus drumless pads (28-45 cpm). Get that wrong and your cozy lofi cafe turns into a drum’n’bass set.

Funny thing: everyone talks about prompt engineering for chatbots and agents, and meanwhile here I am, tuning few-shot examples so that a 0.8B model can write a decent ambient pad. Prompt engineering is still very much a thing, and with small local models it’s not optional: the prompt does the heavy lifting the parameters can’t.

The prompt is the music model now. A few hundred lines of text doing the job I was about to throw an H100 at. The full prompt, all styles included, is in the git repository.

A digression on genres #

I tried to add nine styles currently in the prompt to cover the moods a novel typically goes through: ambient, lofi, chillhop/jazzy, warm piano, dreamy, tense, happy/pop, epic orchestral Each is a description plus one short, clean example (2 or 3 layers, so small models can copy it).

This is also where the project is most open-ended, also for non-technical people, and where help is genuinely appreciated. Adding a genre is the most fun contribution imaginable: write a labeled style block, craft one good Strudel example, add a line to the mood cheat sheet, done: no model training is needed. If you can sketch a pattern in the Strudel REPL, you can teach BookMusic a new genre. I’d love to see what a proper live coder would do with a noir style, a folk one, or something properly weird for the experimental chapters.

And speaking of proper live coders: a sincere thank you to Switch Angel and the other algorave composers I follow, who are the real inspiration behind the Strudel pivot. Watching people perform music by writing code live is what planted the idea that code is a perfectly good musical medium.
To be clear, this project doesn’t even get close to what they do: a small model copying ambient templates is to live coding what elevator music is to a concert. But if BookMusic brings a few more curious people to Strudel, to algorave, and to their streams, I’ll consider it a success.

BookMusic #

The result is BookMusic: open a text file or a PDF, click a paragraph, and start reading. The app groups paragraphs into roughly one-minute sections, sends each section to a small local model (Qwen3.5 via Ollama, anything from 0.8B to 4B works), gets back a few lines of ambient Strudel code, and plays it in the browser. As you scroll, the highlight follows you and the music advances, prefetching the next section while you read.

The architecture is embarrassingly simple: a static frontend (pdf.js for rendering, the Strudel REPL engine for sound) and a thin FastAPI backend whose only job is prose in, Strudel code out. The whole thing runs on a laptop.

That said, the backend is deliberately a single swappable seam, because I want to keep a cloud option open too, that might come in the near future. You don’t always have a laptop with Ollama running when you want to read, and for many users “call a serverless endpoint that consumes a few tokens on Bedrock” is far easier than installing a local LLM stack. Generating a one-minute piece costs a few hundred tokens of a small model: pennies, even at scale. I’m still thinking about the implementation (API Gateway + Lambda + Bedrock is the natural shape), but whatever it becomes, it will be open source like the rest of the project.

A few hard-won gotchas, in case you tinker with something similar:

Disable thinking. Qwen-style models “think” by default on recent Ollama, and the think=False flag in ollama-python wasn’t honored by the server. Generation took 120 to 180 seconds of pure reasoning before a single note. The fix was POSTing /api/chat directly with a top-level "think": false. This was the single biggest “it broke on another machine” surprise.
Keep sampling minimal. Stacking my own repeat_penalty on top of the model’s baked-in presence penalty produced token salad. Temperature and num_predict, nothing else.
Small models need small, labeled examples. And a client-side retry for when the generated code doesn’t parse, because sometimes it won’t.

The name (and a lovely discovery) #

While hunting for a name, I stumbled on something that settled the question for me: book music already exists, and it’s beautiful. It’s the 19th-century medium of folded, perforated cardboard books that mechanical fairground organs read and play as the pages scroll past. Patented by Anselmo Gavioli in 1892, it freed organs from fixed-length pinned barrels: suddenly a machine could play any tune, of any length, just by feeding it a different book.

A book that a machine reads and turns into music, automatically, page by page, with the length of the music no longer limited by the instrument. I was accidentally rebuilding a Victorian fairground organ in software, with an LLM punching the holes.

I briefly considered other names (something more SEO-friendly, since searching “book music” today returns fairground organ suppliers and Wikipedia). But honestly, some names are worth losing the SEO battle for. The project is called BookMusic, and it wears its 130-year-old lineage proudly in the README. Gavioli would have shipped this as a weekend project.

What I actually learned #

Looking back, every individual piece of this project was fast. The research on Neuron’s compiler constraints, the MusicGen prototype, the prompt engineering, the web app itself (spun up with Claude Code in a handful of sessions). AI assistance made all of it almost frictionless. Ten years ago each of those steps would have been weeks; now the execution is rarely the bottleneck.

But none of that produced the project. The project came from three decisions no tool made for me:first, I realized I was solving the wrong problem. “Port Magenta RT to Inf2” was an impressive-sounding goal that had quietly replaced the actual goal, then noticing that “fake realtime” changes everything.
Once generation only needs to beat reading speed, the entire hardware problem evaporates.
The last thing was the Strudel pivot. Generating code instead of audio is the kind of sideways move that doesn’t show up in any benchmark, and it made the project ten times simpler and fully local.

AI lets you get pretty much everything done, and fast. That’s exactly why doing what everyone else is doing has never been worth less. The leverage has moved entirely to the idea: the reframing, the deviation from the norm, the “what if we don’t generate audio at all”. Zappa was right.

Where to go next #

Plenty is intentionally left undone, and ideas are welcome:

New genres and styles: the easiest and most rewarding contribution, see above. Bring your Strudel patterns!
The cloud option: a serverless prose-to-Strudel endpoint (API Gateway + Lambda + Bedrock) for when there’s no laptop around, open source like everything else
A real crossfade between sections (today it’s a cycle-aligned swap that leans on reverb tails)
Running the LLM fully in the browser with WebGPU, the opposite extreme: no backend at all (but, honestly, the first tests aren’t so great).
Better PDF handling: multi-column layouts, header stripping, EPUB
Smarter mood detection and prompt evolution: if you find a tweak that makes a small model noticeably more reliable, that’s gold

The project is AGPL and meant to be played with. Pull requests are more than welcomed, and let me know how your book sounds. If you’re a live coder: what genre should BookMusic learn next?