Update 12/04/2024: Read at the end of the post for more info.
I have been hacking on AI agents recently for both fun and profit as part of the work I’m doing for one of my clients.
They’re mostly text-to-speech (TTS) agents leveraging LLMs for generating text which is then turned into voice by a trained TTS model.
As you [probably] know, maintaining conversation with LLMs over a longer period of time requires maintaining the conversational context and sending it back to the LLM along with your follow-up prompts to prevent the LLMs from “hallucinating” from the get-go.
We had a need for a small agent that would act as a sort of supervisor for a few other agents and we felt using one of the over-bloated LLM frameworks which tend to maintain the context and append the new prompts to it on your behalf posed an unnecessary overhead for this task.
So we had to come up with a way to maintain the conversational context of specific size and it needed to be reasonably fast given the volume of back and forth between the agents and thus consequently LLMs.
Most of the agents we have are written in Python and Go, but as part of my side hacks I started digging into developing them in Rust along with some funky new Rust crates. But this specific agent was actually written in Go, hence the following code is Go-based. Nevertheless, I will list the Rust implementation at the end as I needed it for one of my side projects which I’m hoping to blog about soon.
Anyways, the basic requirements were pretty simple: we needed a fixed-size buffer (strings will do for the demonstration purpose). We needed that buffer to be circular: every time a new prompt is added to it, it needs to be appended at the end of the buffer and the element in the front of the buffer needs to be popped out so we maintain the fixed size and prevent the memory bloat.
Now, this might seem as a premature optimisation given that we have bigger things to worry about when it comes to memory usage, when dealing with LLM workflows. That may be very well on the point, but that should not discourage us from making our programs faster and more efficient, especially as you will see the investement into making our programs faster is not huge. Love your craft!
It’s also worth mentioning that this post does not deal with managing the context size limit imposed by many LLMs. That’s left to the reader as an exercise.
The initial implementation would be something like this:
type History struct {
data []string
size int
}
func NewHistory(size int) *History {
return &History{
data: make([]string, 0, size),
size: size,
}
}
func (h *History) Add(element string) {
if len(h.data) < h.size {
h.data = append(h.data, element)
return
}
h.data = append(h.data[1:], element)
}
func (h *History) String() string {
return strings.Join(h.data, "\n")
}
Note, that we’ve added the String
method to History
which plays two roles
- implements the Stringer interface for easy context inspection
- we use it to generate the new prompt we send to the LLM
This is hopefully straightforward, but in case it isn’t just run this simple program to see it in action.
package main
import (
"fmt"
)
func main() {
h := NewHistory(3)
for _, s := range []string{"foo", "bar", "car", "dar", "ear"} {
h.Add(s)
}
// prints: car\ndar\near
fmt.Println(h)
}
This works fine and does the job, but an experienced Go programmer notices that if the conversation continues for a long time
we end up making unnecessary allocations in the Add
method due to append
calls that might need to allocate new patches
of memory – this also puts an extra pressure on Go GC which also isn’t ideal.
Let’s write a simple benchmark that tests the performance for buffers of varying sizes; something like this could do:
func BenchmarkHistory(b *testing.B) {
sizes := []int{10, 100, 1000, 10000}
for _, size := range sizes {
b.Run(fmt.Sprintf("Add_Size_%d", size), func(b *testing.B) {
h := NewHistory(size)
b.ResetTimer()
for i := 0; i < 10*size*b.N; i++ {
h.Add("benchmark")
_ = h.String()
}
})
}
}
NOTE: we want to benchmark Add
and String
together as they usually happen around the same time so they’re in the same hot path.
Here are the results
goos: darwin
goarch: arm64
pkg: history
BenchmarkHistory/Add_Size_10-12 135546 8878 ns/op 14400 B/op 111 allocs/op
BenchmarkHistory/Add_Size_100-12 1818 647935 ns/op 1055979 B/op 1010 allocs/op
BenchmarkHistory/Add_Size_1000-12 19 62660368 ns/op 102598794 B/op 10028 allocs/op
BenchmarkHistory/Add_Size_10000-12 1 5882203041 ns/op 10122446440 B/op 100580 allocs/op
PASS
ok history 10.835s
This isn’t horrible but we can do better!
If you think about it a bit harder you will notice that there is no need for extra allocations when adding a new element to the history buffer. We know that the history has a fixed size and that if we add a new element to it we must pop the one in the front i.e. make the room for a new entry, old man! We should be able to allocate the buffer only once, at the beginning of the program, and never again.
To solve this problem, we can simply override the element that needs to be poped by a newly added element and maintain a cursor to its position. We also need to do this in a circular fashion, hence calling it a circular buffer, so we can reconstruct the full history easily. Let’s demonstrate it in code to make it a bit clearer what I mean:
type History struct {
data []string
size int
pos int
}
func NewHistory(size int) *History {
return &History{
data: make([]string, size),
size: size,
pos: 0,
}
}
func (h *History) Add(element string) {
h.data[h.pos] = element
h.pos = (h.pos + 1) % h.size
}
func (fs *FixedSizeSlice) Get() []string {
result := make([]string, fs.size)
for i := 0; i < fs.size; i++ {
result[i] = fs.data[(fs.pos+i)%fs.size]
}
return result
}
func (h *History) String() string {
result := make([]string, h.size)
for i := 0; i < h.size; i++ {
result[i] = h.data[(h.pos+i)%h.size]
}
return strings.Join(result, "\n")
}
Notice how we had to change the String
method to accommodate the new buffer design. We couldn’t just simply reuse the previous
implementation as our data buffer is no longer “linear” (contiguous?) i.e. it wouldnt return the context in chronological order.
Ok, this seems better but just to make sure it is, let’s run the benchmark again:
goos: darwin
goarch: arm64
pkg: history
BenchmarkHistory/Add_Size_10-12 84068 12798 ns/op 27200 B/op 200 allocs/op
BenchmarkHistory/Add_Size_100-12 1359 873885 ns/op 2815989 B/op 2000 allocs/op
BenchmarkHistory/Add_Size_1000-12 14 83245247 ns/op 265928059 B/op 20017 allocs/op
BenchmarkHistory/Add_Size_10000-12 1 8806193542 ns/op 26552736240 B/op 200171 allocs/op
PASS
ok history 12.726s
Oooooof, there goes our performance optimisation! This is even worse. How could that be? If you squint hard enough you will notice
we’re creating a result
slice every time we need to return the conversation which happens a lot during the agent conversations.
And that’s the penalty we are ultimately paying here. There must be a better way! And there is, indeed!
Though we won’t be able to entirely ditch the memory allocations in String
because we must assemble the final prompt
from the collected context, but we can make it a bit more performant by leveraging strings.Builder
.
A Builder is used to efficiently build a string using Write methods. It minimizes memory copying. The zero value is ready to use.
The new String
implementation looks like this:
func (h *History) String() string {
sb := strings.Builder{}
sb.Grow(h.size * (len(h.data[0]) + 1))
idx := h.pos
for i := 0; i < h.size; i++ {
sb.WriteString(h.data[idx])
if i < h.size-1 {
sb.WriteByte('\n')
}
idx = (idx + 1) % h.size
}
return sb.String()
}
If we run the benchmark again we get the following result:
goos: darwin
goarch: arm64
pkg: history
BenchmarkHistory/Add_Size_10-12 184761 6508 ns/op 11200 B/op 100 allocs/op
BenchmarkHistory/Add_Size_100-12 1981 593697 ns/op 1024007 B/op 1000 allocs/op
BenchmarkHistory/Add_Size_1000-12 20 56352919 ns/op 102400864 B/op 10009 allocs/op
BenchmarkHistory/Add_Size_10000-12 1 5457632291 ns/op 10149619328 B/op 100171 allocs/op
PASS
ok history 10.381s
This is significantly better than the previous effort and much better than the original implementation of the buffer we started with.
We could probably spend more time on this but this is already pretty good for half an hour’s worth of work and we have to move on to more interesting bits like making the agents talk to each other.
Finally, I’d like to mention a Go module that was suggested by carlana on lobste.rs. Specifically https://github.com/carlmjohnson/deque which implements a generic deque which would also greatly simplify our Go code! Do check it out. It’d be great if it ever made it to the Go standard library.
I mentioned at the beginning that I needed something like this for the Rust agent we kicked off, so here’s the Rust implementation in case anyone is wondering about it:
struct History {
data: Vec<String>,
size: usize,
pos: usize,
}
impl History {
fn new(size: usize) -> Self {
History {
data: vec![String::new(); size],
size,
pos: 0,
}
}
fn add(&mut self, element: String) {
self.data[self.pos] = element;
self.pos = (self.pos + 1) % self.size;
}
fn string(&self) -> String {
let mut result = Vec::with_capacity(self.size);
for i in 0..self.size {
result.push(&self.data[(self.pos + i) % self.size][..]);
}
result.join("\n")
}
}
impl fmt::Display for History {
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
let mut result = Vec::with_capacity(self.size);
for i in 0..self.size {
result.push(&self.data[(self.pos + i) % self.size][..]);
}
write!(f, "{}", result.join("\n"))
}
}
You can then use the History
like so:
fn main() {
let mut h = History::new(3);
let strings = vec!["foo", "bar", "car", "dar", "ear"];
strings.iter().for_each(|s| h.add(s.to_string()));
println!("{}", h.string());
}
masklinn suggested on lobste.rs that we don’t even need Vec<String>
in
the string
method. We can do away with simple String
:
fn string(&self) -> String {
let mut s = String::with_capacity(
self.data
.iter()
.take(self.size)
.map(|s| s.len())
.sum::<usize>()
+ self.size,
);
for i in 0..self.size {
s.push_str(&self.data[(self.pos + i) % self.size]);
s.push('\n');
}
s.pop();
s
}
Finally, we can also leverage std::collections::VecDeque
:
use std::collections::VecDeque;
struct History {
data: VecDeque<String>,
size: usize,
}
impl History {
fn new(size: usize) -> Self {
History {
data: VecDeque::with_capacity(size),
size,
}
}
fn add(&mut self, element: String) {
if self.data.len() == self.size {
self.data.pop_front();
}
self.data.push_back(element);
}
fn string(&self) -> String {
let mut s = String::new();
for string in &self.data {
s.push_str(string);
s.push('\n');
}
s.pop(); // Remove the last newline character
s
}
}
impl fmt::Display for History {
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
let mut result = Vec::with_capacity(self.size);
for string in &self.data {
result.push(string.as_str());
}
write!(f, "{}", result.join("\n"))
}
}
As always with a bit more thinking and a tiny more time investment we can avoid unnecessary performance issues that will pay off for us down the line. Remember, performance is a feature! Now go optimise your code!
Update 12/04/2024: I’ve received great feedback from various communities about this post and I’ve incorporated some of it into this post. Speficially, I’ve added:
- alternative Rust implementations
- skipping the usage of
Vec<String>
in favour ofString
- using
std::collections::VecDeque
instead of implementing it by hand
- skipping the usage of
- mention of a Go module that implements deque