Let’s prototype characters that think and feel (pt.2 : Object Based Training & Inference)

January 28

Didn’t want to spend any more time on TTS. Simply putting the locks in the right places didn’t work, which was demotivating. I need to build a visualizer to properly debug the buffers, but I’m not doing it in this prototype.

So, object based inference, the idea is simple. Current LLM models are text in (called the “prompt”) and text out. Usually set up for user/assistant conversations. However, in coding, you rarely work with language or text, because coding is object based.

I have this proof of concept up and running. For example, I have this OutputObject:

public class MessageOutputObject
{
    // Message
    public string Message;
    
    // Mood
    public Mood Mood;
    
    // Author
    public AuthorOutputObject Author = new();
}

// Mood enum
public enum Mood
{
    Neutral,
    Crazy,
    Happy,
    Sad,
    Angry,
    Excited
}

public class AuthorOutputObject
{
    // String first name
    public string FirstName = "Pet";
    
    // String last name
    public string LastName = "L";
}

If I call InferObjectAsync, my GBNF generator will automatically generate a grammar:

root::=MessageOutputObject
MessageOutputObject::="{""\"Message\":"Message",""\"Mood\":"Mood",""\"Author\":"Author"}"
string::="\""([^\"]*)"\""
boolean::="true"|"false"
int::=[-]?[0-9]+
uint::=[0-9]+
float::=[-]?[0-9]+"."?[0-9]*([eE][-+]?[0-9]+)?[fF]?
double::=[-]?[0-9]+"."?[0-9]*([eE][-+]?[0-9]+)?[dD]?
Message::=string
Mood::="\"Neutral\""|"\"Crazy\""|"\"Happy\""|"\"Sad\""|"\"Angry\""|"\"Excited\""
Author::=AuthorOutputObject
AuthorOutputObject::="{""\"FirstName\":"FirstName",""\"LastName\":"LastName"}"
FirstName::="\"Pet" ([^\"]*) "\""
LastName::="\"L" ([^\"]*) "\""

After that, InferAsync will run as usual, and during every iteration, we will attempt to repair and parse the entire output as JSON, and populate the OutputObject, which looks like this:

[OutputObject]
{
  "Message": ". Today was a productive day! I managed to complete several tasks, including writing a new article for the blog and conducting some market research for upcoming projects. The sun was shining, which always helps boost my mood and motivation. Looking forward to another fruitful day tomorrow!",
  "Mood": "Happy",
  "Author": {
    "FirstName": "Peter",
    "LastName": "Lee"
  }
}

So all of this works, we can now let the LLM generate objects, instead of text. But we can’t use this to build a game yet, because there are a couple of issues:

  • All rules are appended to a string, which works fine in this case, but breaks during recursion. For example, if we’re writing an array rule, and come across a new object, we’ll start writing the object rule in the middle of the array rule.
  • There are no namespaced rules, which will break when conflicting names occur
  • There is no way to let the LLM choose between available options. There are enums, but we can’t use those for available actions for example, because we can’t modify enums during runtime.
  • Need some way to wrap things in a prompt format (mistral, alpaca)

To tackle the LLM being able to choose from a list of dynamic options, I’m thinking of using an array, and the LLM will simply return an array with a single element (the chosen option). However, sometimes, we might want to the LLM to “complete” an array, so we’ll probably need to use custom attributes. Also, when the LLM chooses “walk to vector3” (for example), how and where do we specify the vector3? This will require the grammar to be updated to include a vector3, but it’s not possible to modify the grammar during inference. Complex issues.

See you soon.

February 2

I have rewritten the grammar generator, and primitive & custom rules + namespacing options and custom attributes are in too. Tomorrow I will tackle the array select mode.

February 3

All done.

Now I’ll get some default executor functionality ready for object and object[] based inference, and start work on the custom executor which uses output from the previous iteration to determine the next. This is what we’ll need to build the game.

While working on this, I came up with something. We can’t modify the grammar during inference (at least, not easily), but we simply might be able to slice the grammar into parts and run inference in parts, which will allow us to modify a future “grammar” that we aren’t processing yet.

February 4

And, done. This is probably the most complex C# code I’ve written because of all the reflection, escaping, custom attributes and what not, but everything seems to be working. In the future, I’ll revisit some of it and simplify/clean things up, but for now this will do.

In the end, I came up with a better way to do the object selection (AssemblyQualifiedName metadata + TypeNameHandling), however, the metadata does cost some tokens in the context. The advantage is that we don’t have to modify the grammar during runtime, we don’t have to do multiple iterations for different objects, the LLM can simply pick an action from a list and it will correctly be casted back to us on the C# side.

Here I’ve created a new array of the base class (BaseAction) and added a couple of actions (extended from BaseAction) to it. By setting the ArrayDefaultValueMode to Select, we tell the LLM to select a single action from the list and return it to us.

The LLM correctly picks the “SayAction” in the first selector (a correct guess from the LLM based on the variable name!) and picks the MoveToVector3Action in the second selector.

Because the first action does not respect the default value mode overrides, it is ignoring the Discard override mode and making an exact copy (it contains “Hello, World!” by default).

If we had set the GrammarRespectDefaultValueMode to true, it would have respected the Discard override and generated a new sentence for us. Additionally, if the DefaultValueMode of the Message had been set to Complete, it would have completed the sentence.

However, for the second action, we do tell it to respect the default value mode override. The “TargetPosition” contains a different Vector3 by default, and it’s correctly discarding it and generating a new Vector3 for us.

In the end, everything (even though the array is of the type BaseAction) gets correctly casted back their original extended types, and we can properly access the SayAction and MoveToVector3Action objects generated by the LLM.

Edit
Stripping the Version, Culture and PublicKeyToken from the AssemblyQualifiedName saves us some tokens and keeps the parser functionality intact, a nice way to save some tokens.

Edit 2
I’ve added support for training annotations and uuidv4.

  • Training annotations are used to explain the training data by giving extra context. They are not rendered in the grammar & inference output, only in the export
  • uuidv4 is added to every sample to make every sample unique, even if their contents are identical

These are two tricks I came up with last year that should make the LLM understand the training data better, improve generalization and help avoid overfitting.

I’m not sure if uuidv4 should be disabled in inference, as it could help with generalization in few-shot scenarios, will think about it.

February 7

A couple of things left to do for object based inference:

  • Helper functions (in object based executor) for getting context token count, token count per I/O object pair and removing tokens from start to end index
  • Input/output object base class (with optional uuid4) and I/O object pairs list (with token count?)
  • Keeping track of “dirty” I/O object pairs
  • Applying template to dirty object pairs, prepend (serialized) to latest input object before inference
  • Applying EOS at the end of the generated grammar. I’ve come across an issue with the default ApplyTemplate behavior, which could be on the C++ side, so we might just hardcode it for now (or at least have it configurable, but not read from the model)
  • Custom example that builds on top of the executor to export all object pairs to SFT dataset
  • An optionally, a tiny bit of maintenance, better names for GrammarRespectDefaultValueMode and TrainingAnnotation (GrammarIgnoreMember?), some more code commenting, etc

Will hopefully start picking up these tasks tonight. I’ve also started experimenting with a vision encoder which can encode visibility (for example, first person or top down view) into a multidimensional int array, and have checked if my grammar generator code can handle it, and it works! I didn’t intend to make the characters actually “see” or have any spatial awareness, but at least now I know that it’s possible. This could allow characters to actually explore larger, unknown worlds.

February 9

Helper functions, input/output object base class and object pairs are in, but I might get rid of the whole uuid4 idea. Even though it might be able to help with generalization (treating examples as unique, even if their contents look similar), I think it could also cause unwanted behavior due to similar looking uuids (seeing patterns that aren’t there). I will keep it in mind for future experimentation, but won’t try it in this prototype.

Tomorrow I’ll revert some of that complex logic, complete the inference code (“dirty” object pairs, EOS) and try and create an example that can export the object pairs to a SFT dataset.

If all goes well, it should be ready for a first test in Unity soon..

Edit
Removed uuid4 code, EOS option is in, started working on object pair inference

February 10

Currently, my ObjectBasedExecutor is built on top of the existing InteractiveExecutor that ships with LLamaSharp. However, to have full control over (correct) prompt template tokenization and managing the context in an object oriented way, it would be better to build a custom executor based on the StatefulExecutorBase instead.

The prototype can definitely be built with the current approach, but it might not work as good as it should, so tonight I’ll decide wether I’ll build a better executor, or not.

For example, this is what the “mistral” template looks like on the LLaMA-Factory (training) side:

Notice the space between [INST] and {{content}} but not between {{content}} and [/INST]. I’m not sure why their template is set up this way. (a mistake, maybe?)

I’ve been thinking about creating a custom template on the training side, because I don’t like the spaces between the [INST] tags that Mistral uses (obligatory xkcd). It would also make sense because we are not really using instructions (and will not be training on top of an instruct model either!), but input and output objects instead. We could also remove the tool calling template data, since we won’t be using it.

In either case (no matter if we’re using the existing mistral template or not), we need a couple of special tokens. In the case of the default mistral template, the special tokens we need are:

  • <s> (BOS)
  • </s> (EOS)

According to this page, both [INST] and [/INST] should be inserted as strings, this would explain why a difference in spaces could mess things up!

If we add a space too many, or forget one, we could end up with a difference in tokenization between training and inference.

Currently, the ObjectBasedExecutor simply adds the prompt template as raw text and the above are not separated into special tokens. However, we need the actual special tokens, and not as part of the text. (we can’t do it automatically due to prompt injection)

This is where the new executor would come in. We could simply tokenize them as special tokens and use a PreprocessInputs function to apply the prompt template using the special tokens, and leave the prompt template out of any grammar generator related code.

Edit
There’s actually some conflicting information on the [INST] and [/INST] tokens. Mistral’s descriptions on huggingface say they are regular strings, while their official docs say they are special tokens to avoid prompt injection. Will have to dig into this.

Edit 2
After some digging, we do actually need to tokenize [INST] and [/INST] as special, because:

  • If we don’t, then there is no way they will correctly tokenize to a single token
  • They are listed in the tokenizer configuration (as special!) over here
  • It will avoid prompt injection

Edit 3
Special tokens for [INST] and [/INST] were introduced to the vocabulary with the version 0.3 release of the Mistral 7B models, mystery solved!

February 11

I’ve started working on the custom executor, and a kv cache / context visualization tool.

February 14

The past few days I’ve been preparing, disassembling the existing executors and trying to learn how they work on the lowest level. I think I understand everything clearly now, so this weekend I will:

  • Write down my understandings of the flow of the interactive executor
  • Write down a proposal for the object based executor, including support for few-shot, handle running out of context and persistent objects (in my imagination, during inference, this looks like a game of tetris)
  • Write down ideas for helper functions we need, like removing objects from context (which we’ll re-use for running out of context)

February 20

Since the training server with the Tesla P100 is complete, and we’re going to be using that in this part of the blog (after the executor), I’ve started to prepare for the next part of the blog.

I’ve experimented with path tracing + inference on a single GPU last summer, but it required quantizing the model to 2-bit and using a small (2048 tokens) context. I don’t want to get stuck on optimization during prototyping, so I’ve decided to use a dual GPU setup for the prototype.

This little RTX 3050 arrived yesterday. It has the same amount of CUDA cores (2560) and VRAM (8 GB) as the Tesla P4, but it has GDDR6 instead of GDDR5, and has tensor cores, which we need.

The RTX 3080 will be used for path tracing, and the RTX 3050 will be used for inference.

February 23

I’ve collaborated with one of the maintainers of LLamaSharp to help introduce grammar optimization modes. Since this game (due to object based inference) will be completely built on grammars, this will bring some big performance increases.

February 25

About halfway through documenting the existing executors, going at a steady pace. I’m analyzing them line by line, writing down the flow in my own words in a notepad so I can turn this into a flowchart later. I’ve come across a bunch of things that can be improved.

Edit
All done.

February 26

Starting work on the object based executor. The idea is to first write down a proposal in my own language and then start coding. More soon.

March 13

Not a lot of updates when it comes to the executor. I think I’ve got everything written down now, so I can start building it in steps.

I’ve verified that the current code is il2cpp compatible which was kind of a concern, and I’ve been solving some rendering math on the side.

March 28

Took a little break.

First of all, the object based executor. I wrote down a proposal in which I tried to cover all possible use cases, and the end result was future proof, but too complicated for the first version. The proposal itself is still useful, since these things need to be thought out in detail anyways.

So, back to the drawing board. What exactly do we need? Let’s apply MoSCoW, a technique I learnt in college.

Must

  • Inference for input and output object
  • Proper tokenization (with special tokens that match the training data)
  • Debug text view of the context/kv cache
  • Debug token view of the context/kv cache
  • Ability to export all input/output objects to JSON dataset

Should

  • Object based out of context handling
  • Marking object as persistent

Could

  • Inference/decode for input only, in case we want to append information to the kv cache but don’t need a response
  • Object softdeletes, which deletes objects from the kv cache, but keeps them in memory for dataset export
  • Sharded dataset export. I’m not sure how LLaMa-Factory works with SFT entries that have a history longer than the cutoff_len, I know they do shard PT data. Needs investigating.

Must is simply what has to be done, so let’s focus on that, and put the rest on the backlog.

As for the rendering math I’ve been working on, it also has to be done, and I think I can finish it tonight:

  • Add specular/diffuse sampling method base classes (base class has viewWS param as Vector3.zero)
  • Add cosine hemisphere diffuse
  • Add surface base class
  • Enable/disable resampler
  • Add new weightOverPdf math for HDRP compatibility with bounded VNDF GGX sampling, but I can postpone this until the day I do the actual rendering implementation

March 31

The rendering math is 90% done, I haven’t completed the new weightOverPdf because it’s kind of hard to do this on the C# side (and we don’t even need it there), it’s easier to verify when everything is ported to the HLSL shader side, so it will be done later.

Let’s prototype characters that think and feel (pt.1 : Interaction Engine)

October 14

Since this is my own website, and it’s not really meant as a professional blog, let’s try something different. Usually I like to finish whatever I’m working on first, then do a write-up afterwards and make everything look good. For this post, I want to try a new format, and write as I go, like a live blog.

Let’s try and create videogame characters that think, feel, move and speak.

To prepare for this, I’ve spent the past month learning new things:

  • C++, and building a C++ library from source using CMake
  • Writing C# wrappers for C++ libraries
  • Working with the ONNX Runtime, including CUDA and TensorRT
  • Working with Animators, including bools, triggers, transitions and root motion
  • Working with Navmeshes, including runtime baking and dynamic obstacles
  • Working with ScriptableObjects, including saving/loading from disk

I will be using a technique called greyboxing, which I learnt in college. This means that this prototype won’t have fancy artwork, it will look as simple as possible.

See you soon.

October 15

Prepared the basics today, a simple character that can walk around and pick things up. Just because this prototype is going to use simple artwork, doesn’t mean it won’t be path traced in real time!

I’m recording these with a 35mm f/1.4 lens in 21:9 resolution, 23.976 fps with 180 degree shutter and 3 point lighting. I’ve also coded a little ffmpeg tool to automatically turn these into high quality gifs.

October 16

I added basic character stats and some debug GUI today. The next part is going to take a bit longer, so I don’t know when the next update will be. I spent the first part of 2024 creating my own game engine in C# console to prepare for this moment, so now it needs to be ported. It includes a full interaction engine with policies.

Once the interaction engine has been ported, it needs to be hooked up to the characters above, and we’re going to need some kind of turn based gameplay controller to manually control the characters for now.

October 19

Finished porting my interaction engine and managed to do some improvements in the process. I also added policies for navigation (calculating paths) and visibility (tracing rays) so the LLM will not attempt to interact with items it can not reach. I also finished up the turn based gameplay controller and everything is working as it should, will add some footage later.

I also helped the LLamaSharp team with their 0.18 release, since we’re going to be using that later in this prototype.

Next up is the sensory engine, which will prepare the characters to see, hear, smell and feel.

October 21

I added support for walking to interactables today. It works by tracing a ray to the interactable and then subtracting the interactable diameter and setting it as a pathing target, while checking if it’s reachable or not. I’ve also added sittables which extend from the interactables and have both a sit and stand target, instead of dynamically generating a destination. It’s also possible to sit down on the sittables and stand up again. No code progress on the sensory engine, because I am still thinking of a way to store the actual data, since we’re going to need to be able to transform the data for both the LLM and GUI. Simply storing interaction-sensor data isn’t enough either, since the character should be able to perceive (for example) changes in weather, and other world info.

October 22

Added world space interaction debug interface and have started designing the basics for the perceptual/sensory buffer.

October 23

I have finished designing the perceptual buffer and have the first version up and running.

October 24

Today I’ve been working on perceptual policies, which will decide what the character can see/hear/smell/feel using tools like ray tracing and human eye fov calculations. I also implemented UUID4 for interactables and the perceptual buffer today. There is actually a good reason behind this, something I came up with, I will explain later.

October 25

I finished coding the perceptual policies today, next up are continuous interactions. Currently, when an interaction gets executed (lighting a campfire, or sitting down on a chair) it gets fired once, and wether a character perceives this (and thus registers the interaction in their perceptual buffer), depends on the registered perceptual policies for that interaction. However, if a character sits down on a chair or starts dancing (which shouldn’t just fire once, but should be some kind of continuous state), and we then walk around this wall and see the character, we should still perceive their continuous interaction (and it should end up in our perceptual buffer).

October 26

Started working on the continuous interactions today. I’ve also done a lot of refactoring to all the existing code. I also wasn’t happy with the single ray vision policy, so I added a cone tracing option. Here’s an example with 8 rays @ 8 degree spread angle.

October 28

Continuous interactions are more complex than expected. Let’s take speaking for example. Inference is done through tokens (text-based), and we can’t just mark the interaction as completed when we are done generating tokens, we need to actually keep track of the playback of the second layer of inference (text -> audio). We don’t want the second character (Alice) to reply while the first character (Bob) is still speaking. We also don’t want inference/speech to pile up (things they’ve been wanting to say for the past 10 minutes, but were still speaking the previous sentences), and we don’t want the second character to endlessly wait, if the first character just won’t stop talking (because the first character is noticing the second character keeps waiting for them to finish).

Not sure how long the design of the improved interactions will take. For now, this is my updated short-term todo:

  • Continuous interactions <- you are here
  • Grabbable interactables
  • Environment/health percepts
  • Interaction response percepts
  • TTS in talk interaction
  • Interaction save/playback
  • Object based inference C# side (encoder + decoder)
  • Object based inference LLaMA side (training)
  • Character stats + basic challenge
  • First test (youtube video?)

And also a few nice to haves, but not required:

  • Queued interactions
  • Equipables

October 29

First thing we’re going to need to do is extend the TTS implementation. I added the required functionality for an IsSpeaking flag, and OnSpeechStarted and OnSpeechCompleted events this morning. I’ve also added events for navigation (OnDestinationReached).

October 30

Finished coding the continuous interactions this morning. I’ve decided to skip the grabbables for now, they are not required for a first prototype, let’s not over-engineer this. If we need to transport items, we will use a backpack, which is easier to code. I’ve also improved the Interaction GUI (and thus the data that will be sent to the LLM) to respect the AlwaysVisible property on the Interactions. This means we can now see some Interactions in the world even though they can not be executed, with the reason mentioned in red.

October 31

Forgot something important. Obviously we don’t want the characters to be aware of everything in the world. If we want to sit down on a chair, we first need to know if that chair exists, so we first need to discover it. Finished up the discovery code this morning, so we’re good to go!

Interactable discovery is handled every frame, and stays in memory, so when a character turns around, facing their back to an interactable, or hiding behind a wall, they will not forget about that interactable. We also automatically discover interactables if we perceive interactions through evaluated perception. For example, if Bob talks to Alice behind her back, but Alice has never seen Bob, but she can hear him, she will now learn of Bob’s existence. Also, if Alice can hear Bob sitting down on a chair, and Alice has not yet seen the chair, but can hear Bob sitting down on a chair, she will learn about the chair’s existence. Obviously, this only works if her sensors can succesfully register this.

Next up are environment & health percepts, which means being able to sense (for example) changes in temperature, and sensing pain from (for example) freezing temperatures.

November 1

Created the basics for time of day & temperature today, no percepts on the character side yet.

November 2

Characters now take damage due to cold, and health percepts are in, which work according to an accumulated damage & time threshold to avoid spamming the perceptual buffer. I’ve also added interaction response percepts. For example, if we’re gathering wood, we’d want to notify the character how much wood they have gathered, or if their axe breaks.

Ideally the health/damage thresholds would be triggered by the LLM (in batches of processed data) instead of a certain amount of damage over time, will look into this later.

November 3

I’ve been working on the TTS implementation to prepare for the next item on the todo, however, things get pretty complicated when we want to have two different speakers at the same time, in an efficient way.

November 6

Here’s the deal, I’ll dump my raw thoughts here so you know what’s going on.

We can’t allow feature-creep and over-engineering during prototyping, but what we do build, should be done properly, so we don’t end up throwing everything away and get into tech debt later. It’s very easy to build things fast to show-off, but not everything translates well into a final product.

The interaction engine is done, and the code is clean, I’m satisfied. However, for TTS, things get complicated. As a game developer, you also need to have some understanding and be invested in the legal side of things. I have a separate drive with all my licenses and invoices related to this project.

I’m using Piper (C++) and a wrapper (C#) to use the C++ library in the C# engine. Piper uses a phonemizer to convert text to phonemes. Piper and piper-phonemize are both released under the MIT license which means we can use them, but piper-phonemize relies on an espeak-ng fork which is released under the GPL-3.0 license, which means we can’t use it.

The next version of piper will make use of epitran, which is MIT. The thing is, we don’t know how long this will take, and if this takes 1 or 2 years, this means I won’t be able to ship the game. I could take the gamble, keep doing what I do, and hope the piper version with epitran releases around the same time when I want to release my game, but I don’t like gambling.

There are multiple discussion (1, 2) about the licensing for the espeak/piper repos, which has become a difficult subject because the original developer (Johnathan Duddington) has disappeared of the face of the earth years ago, thus the license can not be changed. However:

“espeak-ng code owners are fine with API use via dynamic linking for closed source program”

The thing is, I want to be sure, and I don’t like the idea of a GPL dependency somewhere down the hierarchy. I came up with a solution to create an open source voice plugin (MIT), and allow users to select their own version of the voice plugin. The game would ship with a version of the open source plugin with some basic mumble sounds, and people would be able to download, compile, or even fork and build their own plugin with piper support, if they wanted to.

Another option, is to use a piper fork that does not use espeak. However, their API works completely different. I could take their whole project apart and build a compatible API and wrapper for it, but I’m very new to C++, so this will take a long time. On top of that, they mention the “phone conversion is not nearly perfect but it should work for simple applications” so it could be that it’s so bad that it’s not even usable, and I won’t find out until I’m done, which sucks.

Moving on, the voice library I wanted to use is libritts_r, which can be found on huggingface over here. The voices are released under a MIT license, but diving into a licensing discussion over here reveals that they were finetuned on a lessac dataset (blizzard license), which means it can not be used commercially, which means we need to swap to the regular libritts (CC BY 4.0 license).

This is a lot of work, and this doesn’t even have anything to do with the technical challenges either. The current C# wrapper I’m using is based on an open source wrapper, and I have added functionality for:

  • CUDA (C# and C++ side)
  • Multi-speaker model loading (C# and C++ side)
  • Async

However, to offer IL2CPP compatibility, we can not use marshaling delegates that point to instance methods of native code, which the wrapper does by default. I’ve made the changes to convert these to static methods, however, the callback writes to a PCM buffer, and static methods can not write to an instanced PCM buffer, which means the PCM buffer also needs to be static, which means we can not use separate instanced PCM buffers for different characters in the game.

One option is to convert the PCM buffer to some kind of dictionary format, generate unique ids for speakers, use them as indexes for the PCM buffers, pass them to the C++ side when generating speech and back to the callback on the C# side to store them in the correct location.

For now, I have created a separate TTS branch, and started completely from scratch, not sure which direction I’ll choose.

November 7

Alright, I’ve made my decision. I will be taking apart the piper without espeak project, and:

  • Clean up their code
  • Port my changes
  • Add compatibility for upstream API
  • Write C# wrapper form scratch

This way, we can get the best of both worlds. We can go full MIT and have everything properly set up on the licensing side, while also offering future compatibility with upstream, in case of an early epitran release.

Their code is a bit of a mess (which they also warned for in their readme) so it’s going to need some restructuring and cleaning up, but that’s fine. I’ve always looked up to C++ like it’s some kind of elite programming language that I wanted to learn some day, so I see this as a good opportunity.

See you soon.

November 10

IPA loading is in on both the C++ and C# side.

First version is up and running in engine, we have audio output! Lots of cleaning up to do, but I’m very happy.

November 12

I’ve cleaned up a lot of the code so far, nothing much to write about, going at a steady pace.

November 13

Callbacks and PCM buffers now use IL2CPP compatible instances, which is the first step to supporting multiple voices at the same time.

November 15

Currently implementing SafeHandles, after that, I’ll do a first test with multiple speakers.

November 20

SafeHandle implementation is done. Currently working on some stuff behind the scenes to prepare for the next step, will probably create a part 2 of this post where we actually move to the LLM side.

November 25

Quick update before I head to bed. I’ve built a little low power “llama box” last week. It’s a server with a i7-6700k, 16 GB DDR4 and a Tesla P4, overclocked to 1531 MHz and modified to use the CUDA sysmem fallback policy. It can finetune a 4096 context length, rank 32, 4-bit QLoRA for a 7b LLM in a single night, and allows me to train and game dev at the same time.

I’ve also been researching evaluation datasets, and prepared a tool to visualize my training results. I’ve been running experiments for the past week to make sure I understand exactly what I’m doing and have the best possible training setup ready for part 2 of this blog. I’ve been training models with DoRA, rsLoRA, Unsloth, Liger Kernel, all the way from rank 8 to rank 128, with dropouts from 0 to 0.2, warmups from 0 to 0.2, learning rates from 0.00001 to 0.00005, different schedulers and layer targets, to compare every combination in terms of training and eval loss.

As for the Piper implementation, I’m debating between finishing it up as-is, or including support for multiple backends (ROCm, TensorRT, etc), settings and inference stats. It’s nice to be able to support AMD cards, and have settings/stats for debugging, but it’s not a requirement for now.

November 26

Decided to add the multiple backends with their settings anyways. I’m skipping the inference stats, we don’t need them.

December 4

Was suffering from sinusitis for the past 1-2 weeks, never had this before, but I’m starting to feel better now. Where were we again?

Currently, the C++ side of the TTS library uses a synthesis config (noise parameters, etc) which is stored in the Voice. After playing around with some code changes, I noticed we can actually change speaker without re-loading the voice. Ideally, I would like to change the Voice to a Model internally (to avoid confusion), and split the synthesis config up into a model and speaker/synthesis config, and maintain these on the C# side. This way we can load the model with the model config (persistent settings) and execute inference with our speaker/synthesis config (temporary settings). If we do this, we can have two separate characters, speaking with different voices, using the same model loaded in memory.

I did a quick test with this when adding the backend selection, but noticed it was crashing on leaving play mode. I’ll re-do the backends and find a way to make this run without issues.

December 6

Re-did the backends, full async multi-speaker with different voices sharing the same loaded model. I think there’s a small issue with streaming text left (we might need this to improve LLM response times), I’ll have a look tomorrow.

December 11

Not much to write about currently, I upgraded the llamabox from 16GB to 32GB RAM (and upgraded the development PC from 32GB to 64GB RAM), added an extra 256GB SSD and upgraded it’s connection from 200mbit to gigabit, it’s currently training a another test model.

As for the TTS implementation, I need to experiment with streaming text and cherry pick the event (speech start / speech complete) commits from the dev branch because of a bad merge, and then it should be ready to go into the main project.

December 18

Streaming text now works. It was a little difficult because doing batches of words causes unwanted pauses here and there, and is really dependent on text generation speed, so it now speaks sentence by sentence. I’ve also cherry picked the event code from the dev branch, all ready to go into the main project.

December 19

I’m not satisfied with speaking sentence by sentence, as we need to wait until we have a full sentence before the character starts speaking, and I’ve come up with an idea to improve this. Will give this a shot tonight. It’s a tricky subject because it requires a few locks in the right places to make this thread safe from both the C# / C++ side, and we can’t allow inference for every word because it will spam and overload the inference engine. More soon.

January 7

It’s been a while. I’m currently waiting for NVIDIA’s CES 2025 Keynote to go live here, so I’ll take this moment to update the blog.

I took a break from TTS and have been working on object based inference which will be used in the next part of this blog. I tackled a complex issue that had been on the backlog since the summer. What’s currently left for object based inference is:

  • Writing them to a temporary list and then render the full grammar instead of writing rules directly to the grammar (which breaks during recursion)
  • Namespacing grammar rules, I have some code for this laying around, but there was some edge case that needed work, something with objects and members conflicts or two of the same objects in the same namespace being a member of different objects, I can’t remember what it was currently, but I’ll run into this issue as soon as I start working on it again
  • Custom attributes for completion/fill/selection modes per member instead of globally defined for the entire pipeline (not required, but is more flexible)
  • Think about prompt format (or we just stick to an existing, less efficient format for now, like Alpaca or Mistral), I’ve done research on custom prompt formats, I should be able to modify this file to fit my needs, we’ll see
  • Come up with a way to make chaining objects work nicely (currently requires me to hijack my own code and inject it in the middle)

The TTS thing is not a complexity issue, I’ve simply gotten a bit burnt out from TTS after working on the same thing for 2 months. I’ve written down the fixes that need to be implemented for streaming text to work and I don’t expect any issues there. Basically:

  • Clearing PCM buffers should happen in the inference engine callback, not when starting the inference, this fixes the PCM buffers being empty during inference, which was causing silence (automatically filled with zeros)
  • PCM buffer pointers are only used for reading chunks, it’s unrelated to inference and should not be altered for streaming text
  • Make sure locks are in the right place so OnAudioRead (audio driver/engine) never tries to read an empty PCM buffer (at the moment when it gets cleared by the inference callback but is not yet filled with new data from the callback), so a lock around the reading buffer part and a lock around the clear + refill buffer part
  • Get rid of the ConcurrentQueue, it was a QoL thing but we actually need the locks because specific logic needs to be inside the locks, see above
  • Come up with logic for batching, we either send every word to the inference engine as a batch, or all words that are available in a single batch, or a fixed batch size, but we need to make sure we correctly handle sentences that are smaller than the minimum batch count, end of sentence, etc

January 14

Some exciting stuff (if everything goes well) happening behind the scenes, which I will write in part 2 of this blog post. I’m also helping out at LLamaSharp to get the next release ready (which I will also be using)

January 17

Scored a Tesla P100.

It has twice as much VRAM, 3.8x the memory bandwidth and 235x the FP16 performance of the Tesla P4.

Introduction

I’ve been thinking about starting this blog for a while now. However, I’ve been stalling because I’m not sure which direction I want to take here. Do I want to blog for an audience, or just as a logbook for myself? Do I include explanations to make things easier to read? Do I write a big introduction post with all history up to this point, or just add pieces where necessary? I guess I want to please every possible audience, but is that even realistic? Do I stall on these questions for another week, or just start right now?

Let me take you on a journey where I experiment with bleeding edge technology in videogames. The screenshots in this post have been taken in-engine, using the graphics code I have been working on for the past 3 years.

My name is Ramon Stefano. I’ve been modding and creating games since I was a kid. I started with drawing on paper, switched to 3d modeling and texturing when I was 12 years old, and then started coding. I’ve studied game design & application development in college, worked full time in racing games & API development, and have been working on my passion project, a racing simulator, for many years.

I like film/cinematography and want to create immersive, magical experiences in videogames that make you forget about the real world.

To do this, I want to bring photorealistic graphics to videogames using real time path tracing. This road from the very first videogames to photorealism, with graphics and graphics cards improving every year, is a journey we’ll only get to experience once, and I want to be a part of it.

I like to experiment with bleeding edge technology. The kind of things that are still in active research, things that are not ready, not stable, and won’t appear in mainstream videogames for the next few years.

This summer, I’ve been preparing the technology for a little side project called Tiny Adventurers, a game about clay characters coming to life and having to face challenges to survive.

Here’s what I have in mind:

  • Path traced graphics
  • Ray traced audio
  • Speech to text
  • Text to speech
  • LLM trained on personality, sensory and interaction data

To start off, to achieve path traced graphics, we’re going to need a somewhat modern graphics card.

I will be doing all the graphics programming on this RTX 3080.

The RTX 3080 has the necessary RT and CUDA cores on board which we need to do path tracing and LLM inference in real time.

We’re also going to need a VR headset. I’ve picked the Quest 1 over the Quest 2 because it’s supposed to have better colors. I prefer realistic colors over resolution.

And of course, papers, lectures and books. I got myself this nice hardcover of Ray Tracing Gems II which includes a lot of key ingredients towards building a real time path tracer.

I dropped everything maths related in school because I wanted to become a guitar teacher. Unfortunately, graphics is all about maths and statistics. Thanks to TU Wien & Utrecht University, I’ve been studying and watching lectures to get back up to speed.

I want the characters to have some kind of personality. The past few years I’ve been reading books, watching interviews and listening to podcasts to learn about personality disorders, but I don’t know a lot about personality types. So, I decided to pick up this book about personality types called “Surrounded By Idiots”, a dutch book for a change.

If you’ve come this far, thanks for reading. The next post will most likely be a technical deep dive into either graphics or character behavior.