Circular Buffer Performance Trick

Update 12/04/2024: Read at the end of the post for more info.

I have been hacking on AI agents recently for both fun and profit as part of the work I’m doing for one of my clients.

They’re mostly text-to-speech (TTS) agents leveraging LLMs for generating text which is then turned into voice by a trained TTS model.

As you [probably] know, maintaining conversation with LLMs over a longer period of time requires maintaining the conversational context and sending it back to the LLM along with your follow-up prompts to prevent the LLMs from “hallucinating” from the get-go.

We had a need for a small agent that would act as a sort of supervisor for a few other agents and we felt using one of the over-bloated LLM frameworks which tend to maintain the context and append the new prompts to it on your behalf posed an unnecessary overhead for this task.

So we had to come up with a way to maintain the conversational context of specific size and it needed to be reasonably fast given the volume of back and forth between the agents and thus consequently LLMs.

Most of the agents we have are written in Python and Go, but as part of my side hacks I started digging into developing them in Rust along with some funky new Rust crates. But this specific agent was actually written in Go, hence the following code is Go-based. Nevertheless, I will list the Rust implementation at the end as I needed it for one of my side projects which I’m hoping to blog about soon.

Anyways, the basic requirements were pretty simple: we needed a fixed-size buffer (strings will do for the demonstration purpose). We needed that buffer to be circular: every time a new prompt is added to it, it needs to be appended at the end of the buffer and the element in the front of the buffer needs to be popped out so we maintain the fixed size and prevent the memory bloat.

Now, this might seem as a premature optimisation given that we have bigger things to worry about when it comes to memory usage, when dealing with LLM workflows. That may be very well on the point, but that should not discourage us from making our programs faster and more efficient, especially as you will see the investement into making our programs faster is not huge. Love your craft!

It’s also worth mentioning that this post does not deal with managing the context size limit imposed by many LLMs. That’s left to the reader as an exercise.

The initial implementation would be something like this:

type History struct {
	data []string
	size int
}

func NewHistory(size int) *History {
	return &History{
		data: make([]string, 0, size),
		size: size,
	}
}

func (h *History) Add(element string) {
	if len(h.data) < h.size {
		h.data = append(h.data, element)
		return
	}
	h.data = append(h.data[1:], element)
}

func (h *History) String() string {
	return strings.Join(h.data, "\n")
}

Note, that we’ve added the String method to History which plays two roles

implements the Stringer interface for easy context inspection
we use it to generate the new prompt we send to the LLM

This is hopefully straightforward, but in case it isn’t just run this simple program to see it in action.

package main

import (
	"fmt"
)

func main() {
	h := NewHistory(3)
	for _, s := range []string{"foo", "bar", "car", "dar", "ear"} {
		h.Add(s)
	}
	// prints: car\ndar\near
	fmt.Println(h)
}

This works fine and does the job, but an experienced Go programmer notices that if the conversation continues for a long time we end up making unnecessary allocations in the Add method due to append calls that might need to allocate new patches of memory – this also puts an extra pressure on Go GC which also isn’t ideal.

Let’s write a simple benchmark that tests the performance for buffers of varying sizes; something like this could do:

func BenchmarkHistory(b *testing.B) {
	sizes := []int{10, 100, 1000, 10000}

	for _, size := range sizes {
		b.Run(fmt.Sprintf("Add_Size_%d", size), func(b *testing.B) {
			h := NewHistory(size)
			b.ResetTimer()
			for i := 0; i < 10*size*b.N; i++ {
				h.Add("benchmark")
				_ = h.String()
			}
		})
	}
}

NOTE: we want to benchmark Add and String together as they usually happen around the same time so they’re in the same hot path.

Here are the results

goos: darwin
goarch: arm64
pkg: history
BenchmarkHistory/Add_Size_10-12         	  135546	      8878 ns/op	   14400 B/op	     111 allocs/op
BenchmarkHistory/Add_Size_100-12        	    1818	    647935 ns/op	 1055979 B/op	    1010 allocs/op
BenchmarkHistory/Add_Size_1000-12       	      19	  62660368 ns/op	102598794 B/op	   10028 allocs/op
BenchmarkHistory/Add_Size_10000-12      	       1	5882203041 ns/op	10122446440 B/op	  100580 allocs/op
PASS
ok  	history	10.835s

This isn’t horrible but we can do better!

If you think about it a bit harder you will notice that there is no need for extra allocations when adding a new element to the history buffer. We know that the history has a fixed size and that if we add a new element to it we must pop the one in the front i.e. make the room for a new entry, old man! We should be able to allocate the buffer only once, at the beginning of the program, and never again.

To solve this problem, we can simply override the element that needs to be poped by a newly added element and maintain a cursor to its position. We also need to do this in a circular fashion, hence calling it a circular buffer, so we can reconstruct the full history easily. Let’s demonstrate it in code to make it a bit clearer what I mean:

type History struct {
	data []string
	size int
	pos  int
}

func NewHistory(size int) *History {
	return &History{
		data: make([]string, size),
		size: size,
		pos:  0,
	}
}

func (h *History) Add(element string) {
	h.data[h.pos] = element
	h.pos = (h.pos + 1) % h.size
}


func (fs *FixedSizeSlice) Get() []string {
	result := make([]string, fs.size)
	for i := 0; i < fs.size; i++ {
		result[i] = fs.data[(fs.pos+i)%fs.size]
	}
	return result
}

func (h *History) String() string {
	result := make([]string, h.size)
	for i := 0; i < h.size; i++ {
		result[i] = h.data[(h.pos+i)%h.size]
	}
	return strings.Join(result, "\n")
}

Notice how we had to change the String method to accommodate the new buffer design. We couldn’t just simply reuse the previous implementation as our data buffer is no longer “linear” (contiguous?) i.e. it wouldnt return the context in chronological order.

Ok, this seems better but just to make sure it is, let’s run the benchmark again:

goos: darwin
goarch: arm64
pkg: history
BenchmarkHistory/Add_Size_10-12         	   84068	     12798 ns/op	   27200 B/op	     200 allocs/op
BenchmarkHistory/Add_Size_100-12        	    1359	    873885 ns/op	 2815989 B/op	    2000 allocs/op
BenchmarkHistory/Add_Size_1000-12       	      14	  83245247 ns/op	265928059 B/op	   20017 allocs/op
BenchmarkHistory/Add_Size_10000-12      	       1	8806193542 ns/op	26552736240 B/op	  200171 allocs/op
PASS
ok  	history	12.726s

Oooooof, there goes our performance optimisation! This is even worse. How could that be? If you squint hard enough you will notice we’re creating a result slice every time we need to return the conversation which happens a lot during the agent conversations. And that’s the penalty we are ultimately paying here. There must be a better way! And there is, indeed!

Though we won’t be able to entirely ditch the memory allocations in String because we must assemble the final prompt from the collected context, but we can make it a bit more performant by leveraging strings.Builder.

A Builder is used to efficiently build a string using Write methods. It minimizes memory copying. The zero value is ready to use.

The new String implementation looks like this:

func (h *History) String() string {
	sb := strings.Builder{}
	sb.Grow(h.size * (len(h.data[0]) + 1))

	idx := h.pos
	for i := 0; i < h.size; i++ {
		sb.WriteString(h.data[idx])
		if i < h.size-1 {
			sb.WriteByte('\n')
		}
		idx = (idx + 1) % h.size
	}

	return sb.String()
}

If we run the benchmark again we get the following result:

goos: darwin
goarch: arm64
pkg: history
BenchmarkHistory/Add_Size_10-12         	  184761	      6508 ns/op	   11200 B/op	     100 allocs/op
BenchmarkHistory/Add_Size_100-12        	    1981	    593697 ns/op	 1024007 B/op	    1000 allocs/op
BenchmarkHistory/Add_Size_1000-12       	      20	  56352919 ns/op	102400864 B/op	   10009 allocs/op
BenchmarkHistory/Add_Size_10000-12      	       1	5457632291 ns/op	10149619328 B/op	  100171 allocs/op
PASS
ok  	history	10.381s

This is significantly better than the previous effort and much better than the original implementation of the buffer we started with.

We could probably spend more time on this but this is already pretty good for half an hour’s worth of work and we have to move on to more interesting bits like making the agents talk to each other.

Finally, I’d like to mention a Go module that was suggested by carlana on lobste.rs. Specifically https://github.com/carlmjohnson/deque which implements a generic deque which would also greatly simplify our Go code! Do check it out. It’d be great if it ever made it to the Go standard library.

I mentioned at the beginning that I needed something like this for the Rust agent we kicked off, so here’s the Rust implementation in case anyone is wondering about it:

struct History {
    data: Vec<String>,
    size: usize,
    pos: usize,
}

impl History {
    fn new(size: usize) -> Self {
        History {
            data: vec![String::new(); size],
            size,
            pos: 0,
        }
    }

    fn add(&mut self, element: String) {
        self.data[self.pos] = element;
        self.pos = (self.pos + 1) % self.size;
    }

    fn string(&self) -> String {
        let mut result = Vec::with_capacity(self.size);
        for i in 0..self.size {
            result.push(&self.data[(self.pos + i) % self.size][..]);
        }
        result.join("\n")
    }
}

impl fmt::Display for History {
    fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
        let mut result = Vec::with_capacity(self.size);
        for i in 0..self.size {
            result.push(&self.data[(self.pos + i) % self.size][..]);
        }
        write!(f, "{}", result.join("\n"))
    }
}

You can then use the History like so:

fn main() {
    let mut h = History::new(3);
    let strings = vec!["foo", "bar", "car", "dar", "ear"];

    strings.iter().for_each(|s| h.add(s.to_string()));

    println!("{}", h.string());
}

masklinn suggested on lobste.rs that we don’t even need Vec<String>in the string method. We can do away with simple String:

    fn string(&self) -> String {
        let mut s = String::with_capacity(
            self.data
                .iter()
                .take(self.size)
                .map(|s| s.len())
                .sum::<usize>()
                + self.size,
        );
        for i in 0..self.size {
            s.push_str(&self.data[(self.pos + i) % self.size]);
            s.push('\n');
        }
        s.pop();
        s
    }

Finally, we can also leverage std::collections::VecDeque:

use std::collections::VecDeque;

struct History {
    data: VecDeque<String>,
    size: usize,
}

impl History {
    fn new(size: usize) -> Self {
        History {
            data: VecDeque::with_capacity(size),
            size,
        }
    }

    fn add(&mut self, element: String) {
        if self.data.len() == self.size {
            self.data.pop_front();
        }
        self.data.push_back(element);
    }

    fn string(&self) -> String {
        let mut s = String::new();
        for string in &self.data {
            s.push_str(string);
            s.push('\n');
        }
        s.pop(); // Remove the last newline character
        s
    }
}

impl fmt::Display for History {
    fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
        let mut result = Vec::with_capacity(self.size);
        for string in &self.data {
            result.push(string.as_str());
        }
        write!(f, "{}", result.join("\n"))
    }
}

As always with a bit more thinking and a tiny more time investment we can avoid unnecessary performance issues that will pay off for us down the line. Remember, performance is a feature! Now go optimise your code!

Update 12/04/2024: I’ve received great feedback from various communities about this post and I’ve incorporated some of it into this post. Speficially, I’ve added:

alternative Rust implementations
- skipping the usage of Vec<String> in favour of String
- using std::collections::VecDeque instead of implementing it by hand
mention of a Go module that implements deque

go golang rust performance

See also