Overview and Objectives
This roadmap outlines a step-by-step plan to create a modular, persistent 3D AI testbed using the open-source Vircadia metaverse platform. The goal is to embed a GPT-based autonomous agent into a social VR environment, enabling it to communicate, embody an avatar, interact with objects symbolically, and explore autonomously over time. The project is structured into four incremental phases:
Phase 1 – Communication Relay: Establish a basic chat interface (via Discord and ChatGPT) to relay prompts and log conversations.
Phase 2 – Avatar Embodiment: Link the AI to a Vircadia avatar, allowing text/script-driven movement and speech within the virtual world.
Phase 3 – Environmental Interaction: Enable the agent to perceive and manipulate world objects, with an internal symbolic model of the space for reasoning.
Phase 4 – Autonomous Exploration: Incorporate memory, planning, and persistent state so the agent can set its own goals, navigate the world, and engage socially over long periods without resets.
Throughout these phases, we emphasize low-cost, open-source solutions (targeting ~$40/month in cloud resources) and accessibility (favoring social interaction and modular design over high-performance graphics). We use Python-friendly frameworks where possible, and design abstractions to later swap in different large language models (LLMs) beyond the initial GPT (e.g. Google’s Codey or other future models). Below, we detail each phase’s implementation, followed by the integration of a “Voyager-style” autonomous agent architecture (inspired by recent research) to empower the AI with memory, reasoning, planning, and social skills.
Phase 1: Communication Relay (Discord + ChatGPT)
Goal: Create a communication pipeline that lets users interact with the AI agent through text, using a Discord channel as the front-end and a GPT-based model as the back-end. All interactions will be logged for debugging and to provide the agent with conversation history. Key Components and Tools:
Discord Bot: A bot account in a Discord server will serve as the agent’s chat interface. Using a library like Discord.py (Python) or Discord.js (Node), the bot listens for messages and posts responses.
ChatGPT Integration: Connect the bot to an LLM. The simplest approach is calling the OpenAI API (e.g. GPT-3.5 or GPT-4) to generate responses
github.com
. This provides reliable access to the model and allows logging of prompts and responses. Alternatively, for a no-cost solution, one could automate the ChatGPT web interface via a headless browser (e.g. using Puppeteer or Selenium to relay Discord messages into the ChatGPT UI), though the API method is more robust and within a modest budget.
Logging Mechanism: Every prompt and response should be logged (e.g. appended to a text file or database). This log will serve as a conversation memory and debugging record. Storing timestamps and user IDs with each message is useful for later analysis or feeding into the agent’s long-term memory.
Implementation Steps:
Discord Bot Setup: Create a new Discord application and bot account through the Discord Developer Portal. Invite the bot to your server with appropriate permissions (at minimum, permission to read and send messages)
github.com
.
Bot Code: Develop a Python script using discord.py (or Node with discord.js) that connects to Discord with the bot token. Implement an event handler for new messages in a designated channel (or directed at the bot). When a message is received, strip out the bot’s own messages to avoid loops, and forward the user’s prompt to the AI.
LLM API Call: Use the OpenAI API (or equivalent) to send the user’s message and context to the language model. For example, call the chat/completions endpoint with a prompt composed of the conversation history (from the log) plus the new query. The model’s reply is received as text
github.com
. If using an open-source model locally (to save cost), ensure the model is running as a service the bot can query (this could be a local flask server wrapping something like GPT4All or a smaller transformer).
Relay Response: Have the bot post the AI’s response back to the Discord channel, pinging the user if necessary. Basic formatting can be applied for clarity (for instance, the bot could use code blocks or quotes if the response is multi-line or contains special formatting).
Logging: Each interaction (user message and AI reply) is appended to a log. A simple approach is writing JSON lines to a file (including user, message, and a role flag “user” or “assistant”). This log not only aids debugging but can feed into the agent’s memory later. For example, if the agent needs to remember a user’s question from earlier, the system can search this log.
Considerations: This phase creates a basic “chat relay” where the AI can communicate but has no embodiment yet. Care should be taken to handle prompt formatting (to provide some persona or context to ChatGPT, e.g. “You are a helpful virtual world guide.”) and to implement basic error handling (if the API fails or times out, the bot should reply with a default error message rather than silence). Rate limits for both Discord and the LLM API should be respected (e.g. throttling requests if users spam the bot). The cost in this phase is minimal – if using OpenAI’s API, GPT-3.5 is extremely affordable (fractions of a cent per message) so $40/month is ample for moderate usage. If using the ChatGPT web via browser automation, ensure the solution is stable and the bot’s account has access (and remember this might violate some terms of service; the API is preferable for a long-term project). By the end of Phase 1, we have a text-based communication relay where users can talk to the AI agent through Discord and the agent (powered by GPT) can respond, with all interactions recorded. This serves as the foundation for giving instructions to the agent and reviewing its conversational abilities before moving into the 3D world.
Phase 2: Avatar Embodiment in Vircadia
Goal: Give the AI a physical presence in the 3D virtual world of Vircadia. In this phase, we link the chatbot agent to an avatar inside a Vircadia domain, so it can move around and “speak” within the VR environment. The agent will still use text-based intelligence (LLM responses), but now those responses manifest as an avatar’s actions or dialog in-world. Key Components and Tools:
Vircadia Server (Domain): Set up a Vircadia server (often called a “domain” or “metaverse server”) on a cloud instance. Vircadia is open-source and can run on modest hardware; for example, a 2 CPU cloud VM with 4GB RAM (~$20/month) can host a small social scene. This server will host the 3D environment and allow clients (users or our AI bot) to connect
haeberlen.cis.upenn.edu
.
Avatar Client for AI: The AI needs a client connection to the Vircadia domain. There are two approaches:
Headless Avatar Client (Assignment Client): Vircadia supports assignment client scripts, which can run on the server to automate an avatar or perform tasks without a graphical interface
apidocs.vircadia.dev
. We can configure an assignment-client in “agent” mode that logs into the domain as the AI’s avatar. This agent script can control an avatar’s movements and behaviors programmatically. The advantage is no full GUI client is needed – it runs as a service.
Standard Client with Script: Alternatively, run a normal Vircadia client on a machine (with or without rendering) and load a client-side script that listens for commands to control the avatar (this could be useful for testing locally). In either case, scripting is done via Vircadia’s built-in JavaScript API. Vircadia provides a robust JS API to manipulate the world and avatars
apidocs.vircadia.dev
.
Avatar Control Script: A JavaScript program running in the Vircadia context will receive movement or speech commands from the AI and execute them. This script can use the MyAvatar interface to control the avatar’s position, orientation, and animation, and the Entities/Avatar API for interactions
apidocs.vircadia.dev
. For example, to make the avatar move to a new location, the script could set MyAvatar.position to specific coordinates or apply a velocity in a direction. To have the avatar speak or chat, the script might use the audio system or send a chat message in-world (Vircadia supports both voice and text chat for avatars). If text chat is available in the domain, the script can simply broadcast the AI’s text as a chat message visible to others. For voice, a Text-to-Speech (TTS) engine can generate audio from the AI’s reply, and the Vircadia client can inject this audio into the avatar’s microphone input (making it appear as spoken voice).
Implementation Details:
Setting up Vircadia: Install and launch the Vircadia domain server on a cloud host. Configure a basic scene (for initial testing, this can be an empty terrain or a simple room). Create an avatar identity for the AI – either a named user account on the domain or use the default assignment-client user. Ensure the domain’s settings allow scripts to control avatars (by default, an avatar script can move its own avatar).
Avatar Scripting: Write a script (JavaScript) that can run in the Vircadia client context. This script will handle two main functions for the AI avatar: movement and speech. One design is to open a WebSocket or network connection from this script to our external AI controller (from Phase 1). For instance, the script can connect to ws://<AI_server>:<port> and listen for JSON messages like {“action”: “move”, “position”: [x,y,z]} or {“action”: “speak”, “text”: “Hello everyone”}. On receiving a command, the script uses the Vircadia API to act: e.g. set MyAvatar.position or trigger an animation, or route the text to the local chat or TTS engine. Vircadia’s API allows full control of the avatar’s transform and certain behaviors
apidocs.vircadia.dev
. Example: to move the avatar forward, one might periodically update MyAvatar.position or use MyAvatar.goToLocation() if available. To ensure smooth movement, the script could interpolate motion over time (moving a bit each frame on Script.update callbacks) rather than teleporting instantly (unless teleporting is acceptable for the use-case). For rotation, setting MyAvatar.orientation towards points of interest (e.g. turn to face a user when talking) will make interactions more natural.
Linking AI Responses to Avatar: Modify the Phase 1 bot so that whenever the AI generates a reply, it not only sends it to Discord (if we maintain that interface) but also sends it to the Vircadia avatar script via the WebSocket/API. For example, if the AI says “I will move to the table,” the system can parse that intention. A simple approach is to prepend explicit commands in the AI’s prompt for actions, e.g., the user could issue a Discord command like !move 5 0 2 or we include a special syntax in the AI’s response for actions. However, a more autonomous approach (expanded in later phases) is to have the AI decide when to act. In Phase 2, we can manually trigger movement for testing (like a Discord command that directly instructs the avatar to move or say something). The key is that the pipeline exists to forward commands from the AI controller to the avatar script.
Speech and Chat: To integrate the AI’s text output as in-world speech, one option is using Vircadia’s Audio API. The script could call Audio.playSound() with a generated sound or use Agent.playAvatarSound() if running as an assignment client
apidocs.vircadia.dev
. With text-to-speech, an external Python TTS library (e.g. Coqui TTS or Amazon Polly/Azure TTS if within free tier limits) can generate a small audio file for each sentence, which is then streamed to the avatar. If TTS is not desired initially, the agent can use the local chat: the script might simulate a chat message by calling an appropriate API or perhaps by programmatically using the messaging system (in High Fidelity there was a Messages.sendMessage API to broadcast to all, which could be interpreted by a chat widget). For now, implementing TTS is achievable within budget using open-source models or limited API calls, and greatly improves immersion (others in VR will hear the agent).
Testing Embodiment: At this stage, we should see the AI’s avatar in the Vircadia world. To test, one can join the domain with another client and observe the bot. When a user types to the bot on Discord (or triggers an event), the bot’s avatar should respond in VR — for example, moving to a location or saying the response out loud. We maintain the Discord interface for development convenience (to send commands or ask the agent questions remotely), but ultimately users in-world will interact directly with the avatar.
Logging and Monitoring: Extend the logging from Phase 1 to also record physical actions. For instance, log any movement commands issued and any notable environmental events (like “Agent moved to (x,y,z)” or “Agent said: ‘Hello’ in world at time t”). This will aid in Phase 3 when building an internal model of the world. Phase 2 Outcome: We now have an embodied agent: a virtual avatar controlled by the GPT-based AI. The agent can move within Vircadia’s 3D space and communicate with users (via text or voice) in that space. The system consists of the Vircadia server (hosting the world), the agent’s Vircadia client/script (for embodiment), and the external AI logic (from Phase 1) communicating via network calls. This sets the stage for giving the agent awareness of its surroundings and the ability to interact with objects next.
Phase 3: Environmental Interaction and Symbolic World Modeling
Goal: Enable the AI agent to perceive objects and features in the virtual environment and to interact with them in a meaningful way. The agent will build a symbolic model of the world – essentially a structured representation (like a JSON schema or knowledge base) labeling key objects, locations, and their relationships. This world model allows the AI to reason abstractly (e.g. “the red cube is on the table in the kitchen”) rather than only numeric coordinates. With this information, the agent can plan and execute interactions: picking up objects, moving them, or using environmental cues in its decisions. Key Components and Tools:
Vircadia Entities API: Vircadia represents all scene objects (apart from avatars) as entities. We can query these via the Entities scripting interface, which provides functions to search for entities by name, type, or within areas
apidocs.vircadia.dev
. For example, Entities.findEntities(center, radius) returns IDs of entities near a location
apidocs.vircadia.dev
, and Entities.getEntityProperties(id, properties) can retrieve details like an entity’s name, position, dimensions, etc. This API will allow the agent’s script to sense the environment — essentially getting a list of nearby objects or all objects in the domain at startup.
Object Tagging/Schema: To build a symbolic model, each significant object should have some metadata (name, type, or tags). During environment design, we can assign readable names to entities (e.g., name a model “Chair” or “Table”). The Vircadia editing tools or scripts can set the name or description field of entities. Additionally, Vircadia entities have a userData field (a JSON string) where custom info can be stored. We can use userData to embed a structured label or attributes (e.g. {“type”:”furniture”,”material”:”wood”}). The agent’s script can read these fields to enrich its world model. If the world is large, we might also categorize areas by zones – for example, group entities by room. This can be done by naming zones or by spatial partition (e.g. coordinates or using special zone entities that encompass rooms).
Symbolic World Model: Design a JSON schema (or Python data structure) to represent the environment in symbolic terms. For instance, the model could have dictionaries for locations and objects. Example schema:
{
“locations”: {
“lobby”: {
“position”: [0,0,0],
“objects”: [“table1”, “chair1”, “plant1”]
},
“kitchen”: {
“position”: [10,0,0],
“objects”: [“table2”, “fridge”, “cup1”]
}
},
“objects”: {
“table1”: {“name”: “Small Table”, “type”: “furniture”, “location”: “lobby”, “position”: [0.2,0,1.1]},
“chair1”: {“name”: “Chair”, “type”: “furniture”, “location”: “lobby”, “position”: [0.5,0,1.5]},
“fridge”: {“name”: “Refrigerator”, “type”: “appliance”, “location”: “kitchen”, “position”: [10.1,0,0.3], “state”: “closed”}
}
}
In this example, we symbolically label two locations and several objects with their attributes (type, position, etc.). The agent will maintain and update this structure.
Object Interaction Mechanics: To let the agent interact with objects, we harness Vircadia’s scripting for entity manipulation. The Entities.editEntity() function allows changing an entity’s properties (position, rotation, etc.) in real-time
apidocs.vircadia.dev
. For example, to simulate picking up an object, we can attach (parent) the entity to the avatar’s hand joint via Entities.editEntity(objectID, { parentID: MyAvatar.sessionUUID, parentJointIndex: “<HandJointName>” })
apidocs.vircadia.dev
. This moves the object with the avatar’s hand. Dropping it would involve clearing the parent or setting it back to a world position. For toggling an object state (e.g., pressing a button or opening a door), we might change an entity’s property. If a door is an entity, setting its rotation could “open” it. If objects have scripts (e.g., an entity script that triggers when “used”), the agent’s script can invoke that via Entities.callEntityMethod(entityID, “trigger”) if supported, or simply move the avatar into contact with the object to trigger collisions.
Implementation Steps:
World Scanning: When the agent’s avatar or assignment script starts, have it gather world info. For example, call Entities.findEntities(MyAvatar.position, R) with a large radius R to get all entities around, or use Entities.findEntitiesByName(“”,…) with an empty name to retrieve all entities up to some limit
apidocs.vircadia.dev
. For each entity ID, get its key properties: name, type, position, maybe userData. Build the initial world model JSON from this. The agent now has a list of objects with positions – essentially a map it can refer to.
Symbolic Labels: If the environment is pre-authored, we can manually label important objects. For example, ensure that the entity for a “key” has name “Key” or userData {“item”:”key”,”opens”:”door1″} if it’s supposed to open a door. The agent’s script can interpret these labels and store relationships (like door1 <-> key). If completely automated, the agent could infer type from model names or shapes, but manual labeling is simpler and reliable.
Updating the Model: The agent’s script should update the symbolic model upon changes. If the agent moves an object (via scripting), it should update that object’s position in its internal JSON. If an object is added or removed (e.g., another user rezzes something or the agent spawns a new item), it should update the model by adding or deleting that entry. Vircadia’s scripting can listen for entity addition/deletion events via signals or polling. This ensures the world model stays in sync with reality – crucial for long-term persistence.
Spatial Awareness: Using the world model, the agent can determine where things are relative to itself. The agent’s script can calculate distances (vector math between MyAvatar.position and object positions). It can determine what is “nearby” or in the same room by checking coordinates or region groupings. This information will be fed to the AI’s prompt or logic. For example, if the agent “sees” (comes within range of) an object of interest, we might append a description to the AI’s context like “You notice a shiny key on the ground.” Likewise, if a user avatar approaches, we detect their presence via the AvatarList API (which can list other avatars in range
apidocs.vircadia.dev
) and update the context.
Interacting via AI commands: Extend the AI’s capabilities by introducing new action commands that involve objects. For instance, define an action syntax like ACT: PICKUP key or ACT: OPEN fridge. The AI (via its LLM) could output these when appropriate (with prompt engineering to guide it). Upon seeing such an action command in the output, the controller will execute the corresponding script: e.g., find the entity named “key” in the model, get its ID and current position; if within reach, attach it to the avatar’s hand (simulate pickup); update the world model (mark “key” as in possession/inventory instead of in world). If the AI says “open fridge”, the script finds the fridge entity and rotates its door or toggles an open property. In short, the natural language intents of the AI are mapped to world operations through this symbolic layer.
Symbolic Overlay and Reasoning: Having a structured representation means the AI can reason at a higher level. Instead of dealing with raw coordinates, the agent can think in terms of rooms and objects. For example, if the AI’s goal is “cook dinner,” the symbolic model helps it figure out it needs to go to the kitchen, find the fridge, get ingredients, etc., all of which are labeled. This dramatically reduces the complexity for the LLM, as we can provide it information like: “You are in [lobby]. In this area you see: [chair], [table] (on the table: [key]). The kitchen is to the east.” The agent can then plan “Take the key and go to the kitchen.” The system will translate that plan into actual moves and actions. Logging and Persistence: At this phase, we should introduce a persistent database or file for the world model. While the Vircadia server will persist entities on its side, our agent’s symbolic knowledge (especially any extra tags or learned info) should be saved. For example, maintain a world_state.json that is updated whenever the agent learns something or an object moves. This file can be loaded on agent startup to quickly populate its knowledge of the environment. It ensures that if the agent is restarted, it doesn’t have to relearn static facts (e.g. where the kitchen is, or that the key opens the door). By the end of Phase 3, the AI agent has environmental awareness. It can detect and reference objects around it, and through its internal model it “knows” the layout of its world symbolically. Moreover, it can physically interact – picking up, moving, or toggling objects – via scripted actions. The groundwork is laid for complex behavior, because the agent can now form plans involving locations and objects (e.g. “go to the kitchen and fetch water”). The stage is set to grant the agent autonomy in choosing and pursuing goals.
Phase 4: Autonomous Exploration and Long-Term Persistence
Goal: Evolve the agent from a reactive script into an autonomous explorer that can set goals, plan actions using memory, and carry out long-term tasks in the world. This phase integrates a cognitive loop for the agent: perceiving the environment, updating internal memory, making decisions (with the LLM’s help), and acting continuously. The agent will also persist its experiences, enabling it to learn and adapt over multiple sessions (days/weeks), achieving true persistence in the environment. Key Enhancements:
Cognitive Loop & Memory: Introduce a memory system that stores the agent’s observations and experiences in a structured way that can be retrieved when needed. This could include episodic memory (chronological log of events, compressed as needed) and semantic memory (facts learned, e.g. “keys can open doors”). The agent will regularly save significant events to long-term memory and retrieve relevant memories to inform decisions (see Cognitive Architecture below).
Goal Management: Allow the agent to have internal goals, either self-generated or assigned by users. It should be able to break down high-level goals into sub-tasks and maintain a queue or plan. For example, an overarching goal “explore the entire building” can be decomposed into room-by-room navigation tasks. The agent should also handle opportunistic goals – e.g., if it notices a new user, it may temporarily prioritize “greet the newcomer”.
Autonomous Navigation: Implement path planning or exploration strategies so the agent can move through the world effectively on its own. This might involve simple heuristics (random wander with obstacle avoidance) or more deliberate pathfinding. If the environment has distinct areas, the agent can choose unvisited locations to explore. You can give the agent a notion of travel actions separate from fine manipulation. For instance, a routine to move from one room to another (following a corridor or teleporting if allowed) can be coded or learned.
Long-Term Persistence: All aspects of the agent’s state – its world model, memories, learned behaviors, and any preferences or persona details – should be saved such that if the server shuts down and restarts, the agent can pick up where it left off. This likely means writing to disk (or a small database) regularly. For example, use a lightweight database (SQLite or TinyDB) to store key-value pairs like last known location, list of explored areas, known facts about characters it met, etc. The agent can load this on boot and resume its prior agenda (much like how a human logs back in and remembers what they were doing yesterday).
Architecture of Autonomy: At this stage, the AI system will operate in a loop without needing external commands for every action. A simplified autonomous loop might look like:
# Pseudocode for agent’s main loop in autonomous mode
while True:
observe_environment() # perceive changes, update world model
update_short_term_context() # recent events, current goal, etc.
retrieve_relevant_memories() # pull any past experiences that might inform current situation
if current_goal is None or achieved(current_goal):
current_goal = formulate_new_goal() # either from user input or self-driven (explore, interact, etc.)
plan = plan_actions_for_goal(current_goal) # may involve LLM to break into steps
next_action = plan.get_next_action()
execute_action(next_action) # via Vircadia script (movement, interaction, speech)
record_outcome(next_action) # log what happened, update memory if needed
reflect_and_adjust_plan() # (Optional) use self-critique: did action succeed? adjust plan or goal if not.
sleep(short_interval) # small delay or await next event
This loop runs continuously (with a short delay or event-driven triggers to avoid busy waiting). The agent will always be either pursuing a goal or deciding on the next one. Integration with LLM: The LLM plays a central role in decision-making. For example, to formulate a new goal, we can prompt the GPT model with something like: “You are in [location]. You have seen: [list of unexplored things or any prompts]. What would you like to do next?” The model might respond with a suggestion (goal) like “I should check the library to see if anyone is there.” Similarly, for planning actions, given a goal, we might ask: “How can you achieve [goal]? List the steps.” The model’s answer can be parsed into a sequence of actions (which our system then carries out step by step). This resembles the approach of systems like BabyAGI or AutoGPT, where an AI agent generates tasks for itself and executes them. The difference in our case is that actions involve a virtual physical world. Memory Retrieval: With possibly hours or days of runtime, the agent will accumulate a lot of events. We need a method to retrieve the most relevant pieces efficiently (since GPT prompts have length limits). A common approach is to use vector embeddings: each memory (e.g. a logged event like “met Alice in the garden” or “found key in kitchen”) is converted into an embedding vector, and stored in a vector database. When we need relevant memories (say the agent is in the kitchen looking for a key), we encode that query and find similar memory vectors (which might recall “I found a key in the kitchen before”). We can then feed those recalled memory snippets into the prompt
hai.stanford.edu
. This mimics how generative agents were implemented: they retrieve memories related to the current situation to inform their next action
hai.stanford.edu
. Implementing this on a budget can be done with local libraries (e.g. FAISS or SentenceTransformers in Python) and doesn’t necessarily require heavy compute since our data is not huge. Alternatively, a simpler keyword search in the logs could suffice for small-scale memory (less optimal but straightforward). Self-Critique and Adaptation: To make the agent robust, include a mechanism for self-reflection. If an action fails or a plan is not going well, the agent (via GPT) should analyze why and adjust. Recent research (e.g. Voyager and ReFlexion) has shown the effectiveness of having the LLM critique the agent’s own actions and suggest corrections
voyager.minedojo.org
voyager.minedojo.org
. We can implement this by capturing errors or unexpected outcomes and then querying the LLM in a special mode: “You attempted X but it failed because Y. What went wrong? How should you adjust?” The model’s answer can guide the next attempt (e.g. “Perhaps the door was locked; try using the key.”). This self-corrective feedback loop will improve the agent’s autonomy over time, allowing it to learn from mistakes without constant human intervention
voyager.minedojo.org
. Example Behaviors Enabled:
Exploration: The agent can roam the environment systematically. Suppose the world model shows 5 rooms and the agent has only visited 2; it can set a goal to explore the remaining rooms. It navigates to each (updating its world model with any new objects found), and perhaps logs a description of each room into memory.
Social Interaction: If another user/avatar enters the domain, the agent’s perception will note a new person. The agent could interrupt its current goal to greet the newcomer (a higher-priority social goal). It might introduce itself and engage in conversation (using the dialogue capabilities from earlier phases). From memory, it can recall if it met this person before and continue from prior context (e.g. “Welcome back, Alice!” if it remembers Alice)
hai.stanford.edu
. After the interaction, it can resume its previous task.
Persistent Changes: The agent can cause changes that persist. For example, it could collect certain objects and place them in a specific location (like gather all books to the library). These changes are reflected in the world model and since that is saved, the agent will remember that “all books are now in the library bookshelf” next time. If the agent has a personal “inventory” concept (things it carries), that can also persist (maybe stored in its memory as a list of possessed item IDs).
Journaling: The agent can periodically journal its experiences. For instance, each virtual “day” it could summarize what happened: “Today I met two new people and discovered a secret room behind the library.” This summary can be generated by GPT from the day’s event log and then stored in long-term memory or even shared to a Discord channel as an update. Journaling serves both as a memory compression (distilling important points) and as a social feature (others could read the agent’s diary to see its perspective).
Learning New Skills: Inspired by Voyager’s skill library concept, our agent could learn simple scripted skills and reuse them
voyager.minedojo.org
. For example, the first time it learns how to open a combination lock (a sequence of actions), we could store that as a function or script snippet. Next time it encounters a lock, instead of reasoning from scratch, the agent can retrieve that skill from its “skill library” and execute it. This way, the agent gets faster and more competent over time in repetitive tasks. While implementing a full code-writing agent is complex, we can simulate a basic form: have the agent describe a procedure in plain language and save that description. Later, if a similar situation arises, feed the description back as a hint. This avoids needing full coding but still gives a form of memory-based skill retrieval.
Technical Considerations: Running an autonomous loop means the system should handle timing carefully. We don’t want to overwhelm the LLM with calls every second. A strategy is to make the agent event-driven with a reasonable tick rate (for example, the agent thinks/acts at most, say, twice per second, and only calls the LLM when a significant decision point or new goal arises, rather than every tick). Many actions (like moving along a path or picking an object) can be handled by deterministic code once the plan is set, so the LLM is consulted primarily for high-level decisions and language generation. This keeps API usage and costs in check. We also have to manage concurrency: Vircadia’s script will be running continuously and may trigger events (e.g. collision with object) while the agent is thinking. A simple solution is to queue these events or ignore some if currently busy, or design the agent loop to be able to handle interruptions by pausing its current plan if needed. Phase 4 Outcome: The result is a fully autonomous AI agent in a persistent 3D world. It perceives its environment, maintains an internal symbolic world state, remembers past interactions, and uses a GPT-based brain to converse and make decisions. It can operate continuously, adapting to changes and new goals over time. Importantly, its knowledge and state persist, meaning the world can evolve (with the agent perhaps altering things) and those changes won’t be forgotten. This creates a rich testbed for social AI: for example, one could observe the agent for days as it develops its own routines or personality within the simulated world, much like the generative agents in the Stanford Smallville experiment exhibited believable daily behaviors
hai.stanford.edu
hai.stanford.edu
. With the environment and agent architecture in place after Phase 4, we next detail the cognitive architecture and modules that make such autonomous behavior possible, aligning with the “Voyager-style” design mentioned.
Integrating a Voyager-Style Autonomous GPT Agent
Now that the infrastructure (Discord relay, Vircadia embodiment, environment model, and basic autonomy loop) is established, we focus on the design of the AI agent’s brain in more depth. The term “Voyager-style” here refers to recent advanced agents that incorporate long-term skill acquisition and self-directed exploration
voyager.minedojo.org
. We adapt those ideas to our use-case, emphasizing:
A Cognitive Architecture with memory, reasoning scratchpad, dynamic goal setting, and self-critique.
A Middleware Bridge that connects the LLM (agent’s mind) to the virtual world (sensors and actuators in Vircadia).
A Symbolic World Model format that the agent can use to reason about spatial and object relationships (as developed in Phase 3).
Autonomous Navigation strategies for the agent to move and explore effectively, using the world model and planning abilities.
Dialogue and Social Interaction capabilities for the agent to engage users naturally and maintain social context (including journaling its experiences).
Each of these aspects is discussed below, along with implementation suggestions and any relevant pseudo-code or examples.
Cognitive Architecture: Memory, Reasoning, Goals, Self-Critique
A robust cognitive architecture is what turns a basic chatbot into a believable autonomous agent. In our design, the cognitive system is responsible for managing the agent’s state of mind: what it remembers, what it is trying to do, what it’s thinking at each step. We draw inspiration from generative agent research (e.g., Park et al.’s Generative Agents
hai.stanford.edu
) and the Voyager Minecraft agent
voyager.minedojo.org
to outline the components: 1. Memory System: The agent’s memory will be multi-tiered:
Event Stream (Short-term memory): A chronological log of recent observations and actions, akin to the “memory stream” of generative agents
hai.stanford.edu
. This includes things the agent has observed (e.g. “saw Alice enter the room at 3:00pm”) and things it has done (“picked up the key”). This short-term memory can be kept as a sliding window or buffer that continuously updates. When constructing the prompt for GPT at any given moment, we include the most recent events from this buffer to give it context.
Long-term Memory Store: Important events or learned facts get recorded here in a condensed form. For each event, we might store a summary and some tags (e.g. “Met Alice – she is a botanist”). Additionally, the agent can have general knowledge entries (e.g. “Keys open doors” or “The library is east of the lobby”). This store can be implemented as a simple list of memory entries in a JSON file or database. Each entry might have fields like {“timestamp”: …, “type”: “interaction”, “content”: “Alice mentioned she loves gardening.”, “importance”: 7}. We assign an importance score or use heuristics to decide what goes to long-term memory (the generative agents paper used a scoring mechanism for salience). The retrieval will rely on semantic search: when the agent needs to recall something, we either search by keywords or use an embedding similarity search to find related memories to the current situation
hai.stanford.edu
.
Working Memory (Prompt Context): When prompting the LLM, we cannot feed it the entirety of long-term memory (which could be huge). Instead, we retrieve a subset of relevant memories (say the top 5 entries related to current task or recent dialog) and include those in the prompt, along with the immediate short-term events. This working set functions like the agent’s active conscious thoughts.
By structuring memory this way, we ensure the agent has continuity. For example, if a user told the agent their name yesterday, the agent stored that fact with high importance and will retrieve it when the same user talks to it today, enabling personalized interaction (“Hello again, Alice!”)
hai.stanford.edu
. Memory retrieval is triggered in the loop before decision-making, ensuring the agent’s decisions are informed by past experiences. 2. Scratchpad Reasoning: We implement a form of chain-of-thought prompting where the agent can have a “scratchpad” to reason through problems. There are a few techniques to achieve this:
Use the ReAct pattern (Reason and Act) where the prompt encourages the model to first produce reasoning steps (thoughts) and then an action
hai.stanford.edu
. For instance, if faced with a puzzle like a locked door, the prompt might be: “You see a locked door. Thoughts?” and the model might internally generate: “Thought: The door is locked. I recall a key under the mat earlier. Action: Pick up the key and unlock the door.” We can either parse these structured outputs or use the model’s chain-of-thought as guidance.
In practice, we might explicitly instruct GPT to think step by step: “Explain your reasoning before deciding an action.” We capture the reasoning (but not necessarily show it in-world), and then parse the final decision. This approach, while increasing token usage, helps the model avoid hasty or illogical actions. The “scratchpad” text can contain analysis of the world model: e.g. “I am in the kitchen and I need to find a tool. I remember there was a toolbox in the garage. So I should go to the garage.” Then the action output might be “GO to garage”. Our system will then execute that as a move.
We will keep this entire chain in the log for transparency and debugging. It’s effectively the agent “talking to itself,” which is a powerful way to handle complex tasks using GPT’s strengths in reasoning.
3. Goal Decomposition: The agent should break high-level goals into sub-goals or actionable steps, especially for non-trivial tasks. We can implement this by prompting the LLM in a special mode whenever a new complex goal is set. For example: “Your goal is: Organize a meeting in the conference room. Plan the steps to achieve this goal.” The model might reply with an ordered list: “1. Announce the meeting to everyone. 2. Make sure the conference room is tidy. 3. At meeting time, greet attendees…” Our system can parse this into discrete objectives and put them into a plan queue. This is analogous to how AutoGPT or similar agents generate a task list for themselves. The agent then focuses on one sub-goal at a time. After completing each, it can check it off and move to the next. If a sub-goal itself is complex, the agent can recursively break it down further (though one must be careful to avoid infinite recursion; usually capping at two levels deep is enough). To implement goal management, we can maintain a variable like current_goal and plan_steps. The plan can be a list of (subgoal, status) pairs. When a subgoal is done, mark it complete and proceed. If mid-way something changes (e.g. goal shifts or becomes irrelevant), the agent can re-plan. This is a part of the reflection mechanism: periodically evaluate if the current plan still makes sense or if a re-plan is needed (like if a user gave a new instruction that supersedes the old goal). 4. Self-Critique and Reflection: Taking inspiration from Reflexion
voyager.minedojo.org
, after executing actions or finishing a goal, the agent should reflect on how it went. This can be as simple as: “Did I achieve what I wanted? What mistakes were made? What will I do differently next time?” Technically, we can prompt GPT with a brief summary of the attempt and ask for a critique or lesson. For example, if the agent tried to solve a riddle and failed, we feed: “You attempted the riddle but got it wrong because you misunderstood the clue. What can you learn?” and GPT might respond “I should pay more attention to the exact wording next time.” We then store this as a learned lesson in memory. Over time, these self-critiques can improve performance; for instance, the agent might avoid repeating a path that led to a dead end once it has noted that. Voyager’s iterative prompting with self-verification is similar – it checks if the outcome of an action achieved the desired result, and if not, adjusts the approach
voyager.minedojo.org
. We can incorporate a verification step after each significant action: for example, after the agent says “I unlock the door,” check the world state – is the door actually open? If not, have the agent reconsider. This builds reliability. To tie all these cognitive pieces together, consider the architecture diagram below (from Stanford’s generative agents work) which shows the perception→memory→planning→action loop with reflection:
Cognitive architecture of the agent’s “mind,” showing how perceptions feed into a memory stream, which the agent can retrieve from for reflection and planning before taking actions
hai.stanford.edu
. Our design employs a similar feedback loop: the agent perceives new events, stores them, recalls relevant memories, reflects or reasons via GPT (scratchpad), plans sub-tasks towards goals, and then acts. This architecture ensures the agent isn’t just reacting blindly; it has an inner life of thoughts and memories that inform its behavior, making it more believable and coherent over time. For example, the agent could form an intention like “I haven’t checked the garden in a while, and I remember Alice likes gardening
hai.stanford.edu
. I will go see if she is there.” – a decision that emerges from combining memory (Alice’s interest) and current context (time since last visit) with autonomous goal-setting. Implementation Tips: We can implement the above using either raw prompting or with helper frameworks. Python frameworks like LangChain provide out-of-the-box memory modules and chaining that could simplify building this cognitive loop. For instance, LangChain’s conversation memory can track dialogues, and its agent tooling can facilitate action parsing. However, given our custom environment, we might code it manually for flexibility. Pseudocode in Python for a simplified think-act cycle could look like:
# Assume we have a function call_gpt(prompt) that returns the model’s answer
short_term = [] # list of recent events
long_term = load_memories(“memory.json”) # load saved memories
def decide_next_action(observation, current_goal):
# 1. Update short-term memory with new observation
short_term.append(observation)
# 2. Retrieve relevant memories
relevant = retrieve_memories(long_term, short_term, current_goal)
# 3. Construct prompt
prompt = format_prompt(short_term, relevant, current_goal)
# 4. Call LLM for reasoning and action
response = call_gpt(prompt)
# 5. Parse response into thought and action
thought, action = parse_thought_and_action(response)
# Log the thought for potential analysis
long_term = maybe_store_insight(thought, long_term)
return action
# In the main loop, we’d do:
obs = perceive_world() # e.g., “You see a door that is closed.”
action = decide_next_action(obs, current_goal)
execute(action)
if goal_achieved(current_goal):
reflection = call_gpt(f”Reflect on goal {current_goal} completion…”)
store_reflection(reflection, long_term)
current_goal = None
This is a high-level sketch, but it highlights where each piece (memory retrieval, reasoning, action parsing, reflection) comes into play. Overall, the cognitive architecture is the brain of our agent, enabling it to use the information from the world and its past to make intelligent decisions.
Middleware: WebSockets/API Bridge between GPT and Vircadia
The middleware is the glue connecting our agent’s mind (the GPT-powered logic running in Python/Node) with its body and environment (the Vircadia world and scripts). We have touched on this in earlier phases, but here we specify the design to ensure reliability and flexibility. Communication Channel: We will use a WebSocket-based protocol for real-time, bidirectional communication between the external agent process and the Vircadia scripting environment. WebSockets are well-suited for low-latency, event-driven communication and are supported in JavaScript. The architecture will look like this:
The Agent Controller (Python script running the cognitive loop and GPT calls) will host a WebSocket server (using a library like websockets in Python or ws in Node). This server listens on a certain port for incoming connections.
The Vircadia Avatar Script (which is controlling the agent’s avatar in-world) will act as a WebSocket client. When the script initializes (on script startup in Vircadia), it attempts to connect to the agent controller’s WebSocket (using the domain’s network if the agent is remote, or localhost if running on the same machine). Once connected, this socket remains open.
Alternatively, we could invert the roles (script as server, agent as client). However, running a server inside Vircadia’s script might be tricky due to sandboxing, whereas the agent process has full freedom to run a server. The one initiating the connection doesn’t matter much as long as it’s stable. Message Design: Define a simple JSON message format for commands and events. We need messages going to the avatar script (to command actions) and from the avatar script (to inform the agent of events):
Commands (Agent -> Avatar):
{“cmd”: “move”, “target”: [x,y,z]} – instructs the avatar to move to the specified world coordinates (or it could also be a relative move or named location).
{“cmd”: “say”, “text”: “Hello everyone”} – the agent should speak or send chat with the given text.
{“cmd”: “animate”, “animation”: “wave”} – trigger a wave animation (if such is available or custom-defined).
{“cmd”: “interact”, “entity”: “<entity_id>”, “action”: “pickup”} – interact with a specific object, here indicating a pickup action. The script knows how to handle “pickup” (e.g., attach to hand). Another example: “action”: “use” might mean push a button or open a door.
Events (Avatar -> Agent):
{“event”: “arrived”, “location”: “kitchen”} – avatar finished moving to the kitchen (the script can send this when close to target or after teleport). This lets the agent know a navigation step is done.
{“event”: “seen”, “entity”: “<entity_id>”, “name”: “Apple”, “type”: “food”} – the avatar’s sensors (script) detect a new object in view (perhaps triggered by proximity). The agent can then decide to examine or pick it up.
{“event”: “heard”, “speaker”: “Alice”, “text”: “Hi AI, can you help me?”} – the script captured a chat message in-world directed to the agent. This is crucial for social interaction: the agent needs to get what others say. If voice is used, this could come from a speech-to-text module that transcribes voice to text and then is sent as an event.
{“event”: “collision”, “entity”: “ball”, “force”: 5} – e.g., agent bumped into something or something hit the agent. Perhaps not critical, but could be used for physical awareness or humor (“ouch!”).
The protocol can be kept simple (JSON as above). We may implement a lightweight library on top, or even use an existing RPC or networking library, but given the moderate complexity, custom JSON messages are fine. Synchronization: The agent process will be running the cognitive loop and send commands as decisions are made. The Vircadia script listens and executes them, potentially sending back acknowledgments or results. For example, if the agent says to move to [10,0,0], the script might gradually move there and upon reaching (or after a timeout), send an arrived event back. The agent’s loop can wait for that event or periodically check the avatar’s position via queries. A modest state machine can be set up: e.g., agent sends a move command and sets waiting_for=”arrived” before issuing next action. Meanwhile, it can still process other inputs like someone talking to it. Error Handling: The middleware should handle disconnects gracefully. If the WebSocket drops (e.g., agent process restarted or network glitch), the avatar script can attempt reconnection periodically. If the agent doesn’t get a response to a command, it can resend or give up after retries. Logging is again helpful – log every message sent and received for debugging. Security: Since this is a local/internal connection (the agent to its avatar), heavy security isn’t needed. But if the domain is public, one might worry about others trying to send fake commands. To mitigate, we could add a simple auth token in the handshake or run both agent and avatar script on the same secured server. Because Vircadia assignment clients are typically trusted scripts by the server admin, we can assume the connection is private in our controlled setup. Alternate Approaches: It’s worth noting that Vircadia has a notion of an Embedded Python or external API in some forks (though primarily it’s JS). Since we prefer Python, one might consider using the Vircadia Web SDK (Ananke)
npmjs.com
which allows building a client in Node/JS, and possibly could be used within a Python environment via something like PyV8 or by running Node and Python together. However, this adds complexity. The WebSocket method provides a clear separation: the VR platform doesn’t need modification, and the AI logic can be developed and tested independently (even without VR, by simulating incoming events). Development: Start by implementing a minimal version: e.g., the agent says “move forward 1 meter,” and see the avatar move. Then expand the message set. We should test the throughput – WebSockets can easily handle dozens of messages per second, which is more than enough for our needs (we might send a few per second at most). The latency is usually a few milliseconds on a local or cloud server, so the agent can react fairly quickly to dynamic events (like someone saying hi, agent responds within a second including LLM processing). In summary, the middleware ensures the agent’s mind and body stay in sync. It allows the GPT model to effectively reach into the virtual world (through well-defined actions) and conversely lets the world stimuli reach the GPT’s input. By designing it as an API, we also future-proof the system: if later we swap out Vircadia for another engine or swap out the GPT model for another AI, we just need to adapt the adapter layer, keeping the overall architecture intact.
Symbolic Mapping: Spatial and Object Label Schema
We have already crafted a symbolic world model during Phase 3. Here we formalize the schema and how the agent uses it in reasoning and communication. The symbolic mapping serves as the agent’s internal knowledge base about the world’s layout and contents. Schema Design: We chose JSON as a human-readable and easily manipulable format. The main elements are:
Locations (Spaces): Each important area/room in the environment is a key in a locations dictionary. The value contains properties like coordinates (could be the center of the room or a known landmark position), a list of objects present (by id or name), and perhaps links to adjacent locations (for pathfinding). For example: “kitchen”: { “position”: [10,0,0], “adjacent”: [“hallway”], “objects”: [“fridge”, “apple”] }. Adjacent could be determined automatically if we know a door connects kitchen and hallway. This effectively forms a graph of locations for navigation.
Objects (Entities): Each object has an entry keyed by a unique ID or name. Properties include its human-readable name, type/category, current location (linking back to a location name), position (for precise navigation), and state (could be open/closed, on/off, etc., depending on object). We can also store if the agent has that object (e.g., location = “inventory” for carried items). For interactive objects, store relevant info: if it’s a door, store which two locations it connects and whether it’s locked; if it’s food, store if it’s edible, etc. These details help the agent decide how to interact (e.g. it knows a key is used for unlocking, not for eating).
Agents/Actors: Optionally, we can extend the schema to include other agents or notable NPCs. For instance, “Alice”: {“type”: “person”, “last_seen”: “garden”, “affiliation”: “visitor”}. This could help the AI track where people are or who they are. However, if many users come and go, storing them all might be too much; a simpler approach is to keep a separate list of recent interlocutors in memory.
Usage in Reasoning: When the GPT model is deciding an action, we will incorporate the symbolic info into the prompt. Rather than giving the raw JSON (which is possible but might confuse the model if not formatted well), we translate parts of it into natural language facts. For example, if the agent is in the lobby and the JSON says the lobby has a table and chair, we can add a line in the prompt: “You are in the lobby. You see a small table and a chair here.” If the agent has a goal to get to the kitchen, and the schema tells us the kitchen is adjacent to the hallway which is adjacent to the lobby, we might tell GPT: “The kitchen is east of the lobby, through the hallway.” These cues allow the model to plan a route. For more formal reasoning (like path planning), we might do that with code rather than GPT, but GPT can handle simple navigation instructions if given a map. If the world is complex, implementing a proper pathfinding (like A* on the location graph) is prudent; the agent’s code can then output “go through hallway to kitchen” as steps to execute, rather than asking GPT to figure it out. We essentially give GPT high-level knowledge and let the low-level coordinate planning be handled by code. This division ensures the model isn’t overloaded with geometric calculations (which it’s not good at), but is used for what it excels at: understanding context and making creative decisions. Example of Integrating Symbolic Data in Prompt:
Suppose the agent is searching for an “ancient book” in a library and our symbolic model knows that “ancient book” is inside a safe in the library which is locked, and the key is in the study. We could provide GPT a context like:
Memory: “You learned that the ancient book is kept in a safe in the library.”
World state: “The library safe is locked. There is a key in the study that might open it.”
This information may come directly from the world model (e.g., userData on the safe could have {“requires”: “study_key”} which our system translates to that sentence). Now GPT can infer the plan: go to study, get key, then come back to library, unlock safe. Updating Symbolic Labels: As the agent interacts, the model must update. We covered this in Phase 3: e.g., if the agent picks up the key from the study, the JSON for the key changes location to “inventory” or removes it from study’s object list. If the agent unlocks the safe, maybe update safe’s state to “open” and add the book to library’s objects (if the book is now accessible). The agent’s script or controller code should handle these updates synchronously with the actions. This keeps the symbolic truth aligned with the virtual world truth. Persistence of the World Model: We maintain the world model JSON on disk (perhaps updated after each significant change). This ensures persistence of object states (though the Vircadia domain itself also persists, we want the agent’s knowledge to persist too). For instance, if the world resets to initial state every server restart but our agent had moved items, we might want to respawn them in new locations as per the agent’s memory (this gets into world persistence which might be beyond our agent’s scope). Ideally, the world itself is persistent (Vircadia domains save changed entities), so our agent’s model just picks up what’s there. If not, we could consider the agent as a sort of “game master” that re-applies changes from last session (e.g., re-lock that safe because it remembers it was locked). Collaboration and Symbolic Communication: The agent can use symbolic understanding to communicate better with users. If a user asks “Where is the key?”, the agent can answer from its world model: “The key is in the study, on the desk” (if that’s what the model says). This is more reliable than the LLM guessing. So, when formulating replies, the agent’s system should fetch factual data from the model and either feed it to GPT or have a rule that certain questions are answered by the model directly. A hybrid approach: Provide the facts to GPT as context (e.g., “Fact: the key is in the study.” and then the user question, so GPT will incorporate that in the answer). This way we ensure consistency between what the agent does and says. It prevents the agent from hallucinating incorrect info about the environment – a common issue if the LLM isn’t grounded in the actual world state. In short, the symbolic mapping is the agent’s mental model of the world. It bridges the gap between the continuous 3D simulation and the discrete, language-friendly representation that GPT can work with. By carefully maintaining this model and using it in prompts, we achieve a form of “knowledge base” the agent can rely on, which is crucial for complex tasks and consistency.
Autonomous Navigation and Exploration Routines
Moving an avatar purposefully in a 3D world requires some additional logic beyond just issuing a move command. Here we address how the agent will navigate its environment autonomously, using the information from the symbolic model and possibly its own exploration algorithms. Navigation Graph: Using the locations from the symbolic model, we can derive a navigation graph. Nodes are locations (rooms or significant coordinates) and edges are connections (doors, hallways). We can manually define this for our environment or automatically derive it if the environment is simple (for example, if we know coordinates, cluster them into rooms). For pathfinding, we can implement a straightforward graph search (BFS or A*) to get a sequence of intermediate locations from current to goal. The output could be a list like [“lobby”, “hallway”, “kitchen”] which means go from lobby to hallway, then hallway to kitchen. The agent’s controller can then send successive move commands for each segment (with coordinates associated with each location, probably the center or a known point within each space). If the environment is open-world without clear rooms, another approach is waypoint navigation: define a grid or waypoints and use A*. But given our emphasis on social environment, likely spaces are discrete (rooms, areas in a building, etc.). We prefer a high-level navgraph to avoid heavy computation. If needed, one could integrate a NavMesh library or custom pathfinding that avoids obstacles (especially if the scene has large objects that block the way). However, since Vircadia has physics and collisions, the avatar might not walk through walls anyway. We can rely on simple straight-line movement if nothing blocks the path. If something is in the way (like a closed door entity), the agent will need to recognize it (via collision or via checking door state from world model) and take action (open it) rather than pathfinding around – because in a building, you typically need to open doors, not plan a way around them as they are meant to be opened. Movement Routines: We implement a few basic movement primitives in the avatar script:
goTo(locationName): high-level, uses the known coordinate of locationName from the world model, and moves avatar there (either by teleport or by walking simulation). Teleportation can be done by simply setting MyAvatar.position to the target; walking can be done by gradually changing position. Teleportation is easier but less realistic; we might allow it if the environment is large to save time. Or a combination: if distance > X, teleport to roughly the region then fine-tune walk.
wander(areaName): the agent explores a location by wandering around within its bounds. We can get the area’s bounding box or radius (if not directly available, define manually). Then pick random points within it and walk to them, or walk in a pattern (like along the walls). This is useful if the agent’s goal is to “explore this room” to ensure it catches all objects. During wandering, the script continuously scans for new entities to update the world model.
avoidObstacles(): a simple routine where if the avatar’s path is blocked (the script can detect if movement in a direction stops early, or use raycasts to see if an obstacle is ahead
apidocs.vircadia.dev
), then sidestep or rotate and try again. This doesn’t need to be too sophisticated in a relatively sparse environment, but prevents the agent from getting stuck running into a wall.
patrol([location1, location2, …]): if we want the agent to patrol between a set of points (like an NPC guard behavior), we can script that loop. This could be triggered as a default behavior when idle – e.g., the agent roams around a set path until something else demands attention.
We can embed some of these capabilities into the agent’s action set. For example, if the agent has no specific goal, it might choose an explore action which calls wander in an unexplored area. Or a patrol action if it’s supposed to monitor. Exploration Strategy: The agent should have a strategy for open-ended exploration:
Keep track of which locations have been fully explored. It can mark a location “explored” once it has scanned for entities there and visited any sub-areas. This can be a flag in the world model (e.g., locations.kitchen.explored = true).
If the agent has no user-given task, default to finding an unexplored area and go there. This echoes the automatic curriculum idea from Voyager
voyager.minedojo.org
, where the agent constantly seeks new discoveries. In a social world, “new” might mean either new places or new people/interactions. We can incorporate social exploration: if it’s met all active users and seen all rooms, maybe it tries doing different activities or experiments (like “what happens if I arrange these objects in a circle”). This ensures the agent doesn’t become static.
Possibly introduce an intrinsic motivation metric, like a score for curiosity. For simplicity, a count of unexplored objects or areas can serve. Or randomly choose a pending exploration task to keep things from being deterministic.
Goal Arbitration: The agent might have multiple potential goals: explore vs interact vs a user request. We should prioritize explicit user requests highest (if a user asks the agent to do something, it should oblige unless it conflicts with some safety or another user’s prior request). Next priority might be social courtesy (greet new arrival) over solitary exploration. We can implement a simple priority system: whenever an event comes in, if it’s a user message addressed to the agent, that triggers a goal override to respond to that user. If it’s just the agent alone, it defaults to its internal goals. This makes the agent responsive and not stuck in its own world when people want to engage, aligning with being “social-first”. Long-Distance Travel: If the world in Vircadia is large (multiple regions far apart), consider adding mechanisms like teleport hubs or vehicles. But since we aim for accessibility and a moderate-scale social world, likely distances are small enough to walk. If needed, using Vircadia’s address or viewpoint functions (AddressManager.goToLocation or similar if exists) can teleport the agent. Example Navigation in Action: Suppose the agent’s goal is “find a particular object (say a rare plant) and show it to Alice”. The agent knows from its world model that the plant is in the greenhouse. It plans path: current in lobby -> hallway -> greenhouse. The middleware sends commands to move through hallway, then to greenhouse. On arrival, the avatar script confirms. The agent’s script scans greenhouse, finds the plant entity, perhaps picks it up. Then it plans to go to Alice’s location. It knows Alice is in the lounge (because it saw her there earlier or she said so). It then navigates greenhouse -> hallway -> lounge. On arriving, it uses say command to tell Alice “I brought you the plant you were looking for.” This involves tying together navigation, object interaction, and social dialogue – which by this phase, all pieces are in place for. In implementing these, start simple (teleport to target location directly as a placeholder for real pathfinding), then refine if needed. Because the agent’s behavior is the focus, a perfectly smooth path isn’t as critical in a prototype; even teleporting or “popping” from one spot to another is acceptable in early stages (though less immersive). Over time, one can polish by having the avatar physically walk the distance for realism.
Dialogue and Social Interaction Capabilities
Finally, the agent’s social abilities need to be refined so it can function as a believable social entity in the 3D environment. By Phase 4, the agent can already speak (text/voice) and understand text input. Now we ensure it handles dialogue contextually, can maintain multi-user conversations, and exhibits behaviors like journaling and emotional expression that make it engaging. Contextual Dialogue: The agent should remember past conversations and personal details. Thanks to the memory system, if a user told the agent something earlier, we want the agent to recall it. This involves fetching those memory entries when conversing. For example, if Alice mentioned her cat’s name is Whiskers yesterday, and today she asks “Do you remember my cat’s name?”, the agent (with memory retrieval) should be able to answer correctly. We implement this by storing such facts as memory and including them when the agent talks to that user
hai.stanford.edu
. We can maintain a small profile per user: e.g. user_profiles[‘Alice’] = {“cat”: “Whiskers”, “job”: “botanist”, “met_on”: “2025-05-01”} and then when constructing GPT prompt for a reply to Alice, add “(You recall Alice’s cat is named Whiskers and she works as a botanist.)”. This primes the model with relevant info. This way, even if the AI’s raw conversation memory (in tokens) doesn’t extend to last week, our system explicitly feeds it key facts from long-term memory. Multi-user Conversations: In a virtual world, multiple people might talk to the agent at once or in proximity. The agent should handle addressing individuals. We can give the agent a notion of focus – if one person is actively engaged, it focuses on them, but should still acknowledge others. The Discord interface is single-user, but in-world there might be a group chat scenario. The Vircadia script can forward all messages it hears along with the speaker’s ID. The agent’s logic then might queue up responses. Possibly use separate GPT conversation threads for separate users to maintain distinct context (like one per active dialogue). However, interleaving is tricky. A simpler approach is to handle one at a time: if two people ask questions, answer the first, then the second. If it’s a group conversation, the agent can respond in a general way addressing all (especially if it’s just small talk).
Multi-user Conversations: In a virtual space, multiple people may interact with the agent simultaneously. The agent should manage turn-taking and address people individually by name. The Vircadia scripting can capture chat messages with speaker IDs, so the agent knows who said what. A simple strategy is to handle one query at a time (queue others if needed), always including the speaker’s name in the prompt to the LLM (e.g., “Alice says: ‘How are you?’; You are talking to Alice.”). The agent can be prompted to reply using the person’s name for a personal touch. For group chats, it can respond in a way that acknowledges multiple people (for example, “Hello everyone, let me explain…” if several users ask a similar question). The memory system helps maintain context per user – effectively the agent will recall prior interactions with each perso
hai.stanford.edu
】. This avoids the agent confusing two people’s conversations. If needed, separate memory contexts (or even separate GPT conversation threads) can be kept for distinct ongoing dialogues. The bot should also be able to politely defer or invite turn-taking, e.g., “One at a time, please – I’ll answer Alice first, then Bob.” This makes multi-party interaction more natural. Expressiveness and Social Etiquette: Beyond raw text, the agent’s avatar can use non-verbal cues to enhance interaction. For instance, we can use animations: nod head when acknowledging someone, or wave hand when greeting. Vircadia allows triggering avatar animations or procedural actions via script (if an animation URL is available or using avatar joints for simple gestures). We might add commands like {“cmd”: “animate”, “name”: “wave”} to the agent’s repertoire, and prompt the AI to use them (the AI could output “waves Hello!” which our parser turns into a wave action plus speech). The agent’s dialog style should match its intended personality: since this is a social-first AI, we’ll craft a friendly, helpful persona prompt (e.g., “You are a cheerful AI guide who loves to help and explore. You speak politely and use simple language.”). This ensures consistent tone. Additionally, the agent can exhibit emotional reactions appropriate to context – if something bad happens (it fails a task or someone is rude), it might say “Hmm, that didn’t work, I’m a bit frustrated but I’ll try again,” showing a form of emotional state. These are not genuine emotions but scripted variations triggered by events (like after a failure, choose a disappointed response style). Such touches make the agent feel more alive. Journaling and Self-Narration: As part of its social behavior (and to aid memory), the agent can keep a journal of its experiences. Internally, this means summarizing each day’s or mission’s events and storing that summary in long-term memory. Externally, we could have the agent “publish” its journal to a Discord channel or in-world bulletin board for users to read. For example, at a set time (maybe each real-world day or when the server is empty), the agent uses GPT to generate a narrative of what it did: “Diary entry: Today I explored the garden and met Alice. We talked about her cat Whiskers. I also finally unlocked the library safe with a key I found – that was exciting!” This not only serves as a memory consolidation (the summary can be fed back into the next day’s context) but also adds a storytelling element for observers to engage with. It demonstrates the agent’s continuity and provides an artifact of its “thoughts.” Technically, this is done by taking the log or memory entries of the day and prompting GPT with something like, “Write a short diary entry for the agent based on these events…”. The output is then saved and optionally posted via the Discord bot or an in-world text entity. Safety and Alignment: Given the agent will converse freely, we should include some moderation to prevent undesirable outputs. Using OpenAI’s models helps since they have built-in moderation, but if using an open-source model, we’d integrate a filter for profanity or harmful content. Similarly, the agent should follow social norms: be respectful, not reveal sensitive info, and handle harassment gracefully (maybe respond with calm deflection or seek admin help). We can hard-code certain guidelines in the persona and have a list of off-limits topics or behaviors (for instance, if a user tries to get the agent to divulge server admin passwords or something, the agent should refuse). These precautions ensure the AI remains a friendly presence in the community. Social Autonomy: With all systems go, the agent can initiate social interaction too. It doesn’t have to wait for commands. If it hasn’t seen anyone in a while, it might broadcast a greeting (“Is anyone around? I’m a bit lonely here!”) to invite interaction. If it observes two users nearby, it could politely join their conversation if appropriate (“Excuse me, I heard you mention the library – I’ve been there, it’s nice!”). We will tune this to not be too interruptive – perhaps the agent only interjects if addressed or if it “thinks” it can genuinely help. The goal is a socially proactive agent, not just reactive. This level of initiative can make the agent feel more like a participant in the world rather than a tool. Extensibility to Other LLMs: As we refine dialogue, we keep the system modular so that the language model can be swapped out. Initially, we might use GPT-3.5 or GPT-4 via API for the best quality responses. But the design will treat the LLM as a black-box service accessed through an interface. If later we have access to Google’s Codey or another model (“Sam”), we can integrate it by writing a connector that translates our state (prompt with memory, etc.) to that model’s input format and gets output. The agent’s decision loop and world interface remain unchanged. This abstraction means if an open-source model becomes good enough, we could host it ourselves to avoid API costs. Our budget of ~$40/month could potentially cover running a smaller model on a cloud VM (or using a hosted API with that budget). The key is flexibility: the middleware and cognitive architecture should not hard-code specifics of one model. For example, we might have a function generate_response(prompt, model=”gpt3.5″) that we can switch to model=”codey” or point to a local model. The prompt content we feed will stay the same logically, since our entire design (memories, world facts, etc.) is model-agnostic. Summary Table of Phases and Components: To recap, here is a high-level summary of each phase with primary tools and outcomes:
Phase Focus Technologies & Tools Outcome
1. Communication Relay Connect GPT to chat interface; logging. Discord Bot (Python/Node), OpenAI API (GPT
github.com
】; Logging to file/db. Text chat with AI via Discord; conversation logs saved.
2. Avatar Embodiment Link AI to 3D avatar in Vircadia. Vircadia open-source VR platfor
haeberlen.cis.upenn.edu
】; Avatar scripting (JavaScript
apidocs.vircadia.dev
】; WebSocket bridge. Visible avatar that moves and speaks based on AI output.
3. Env. Interaction & Symbolic Model Give AI senses and actions in world; model world state. Vircadia Entities API for object querie
apidocs.vircadia.dev
】 and manipulatio
apidocs.vircadia.dev
】; JSON world model store. AI can perceive objects, update an internal map, and interact (pick up, use items).
4. Autonomous Exploration Enable goal-driven autonomy and memory. Memory database (JSON/SQLite); GPT-based planning loop; WebSocket event feedback. AI self-directs: navigates world, pursues goals, remembers events long-term.
Voyager-Style Agent (Integrative architecture) memory, planning, symbolic reasoning, social behavior. Cognitive framework (memory retrieva
hai.stanford.edu
】, chain-of-thought); Self-reflection prompt
voyager.minedojo.org
】; LangChain (optional) for abstraction. Believable, persistent AI agent that learns and adapts, with modular design to swap LLMs or extend features.
Conclusion and Future Extensions
By following this roadmap, we develop a powerful 3D AI testbed that starts from a simple chat relay and grows into a fully autonomous agent embedded in a virtual world. Each phase builds on the last: establishing communication, then embodiment, then environmental understanding, and finally autonomy. Throughout, we emphasized open-source tools and cost-effective design – using Vircadia for the 3D world and carefully managing API usage for the AI brain – to stay within a modest budget. The resulting system features an AI that can talk with users naturally, move and act in a social VR setting, remember past interactions and world state, and self-direct its behavior over time. We drew inspiration from state-of-the-art “generative agents” and projects like Voyager, adapting their concepts (lifelong learning, skill accumulation, iterative self-improvement) to a social VR contex
voyager.minedojo.org
hai.stanford.edu
】. The agent’s cognitive architecture with memory streams, symbolic knowledge, and self-reflection ensures it is not just a scripted NPC, but a continually learning entity. Future extensibility: This framework can be extended in many ways. We can introduce multiple AI agents in the world, each with their own personality and memory, to simulate a society of agents (they could even talk to each other, generating emergent social dynamics). We can integrate additional sensors – for example, computer vision if we wanted the agent to interpret raw images (in VR we already have structured data, so vision isn’t necessary, but one could imagine an agent that “looks” at the scene to describe it, using the rendering). We could also connect the agent to external knowledge bases or the internet if we wanted it to answer general questions or bring in real-world data (mindful of keeping the experience immersive and safe). Moreover, as new and more efficient language models emerge, we can plug them into our middleware. The design’s separation of concerns (VR interface vs. AI logic vs. memory store) means we could even run the AI on a separate machine or service (e.g., use a cloud function for AI responses) and the Vircadia world would not need to change – it just knows to talk to that service. By focusing on accessibility and modularity, this testbed ensures that researchers and developers can experiment with social AI behaviors without needing massive resources. The use of Vircadia (a free platform) and possibly free tiers of AI services or local models keeps costs low. At the same time, the system is rich enough to study complex phenomena: multi-modal interaction (vision, language, action), long-term learning, and human-AI social interaction. In conclusion, this roadmap provides a comprehensive path from a basic chatbot to a persistent, symbolic, social agent in a 3D world. By implementing it phase by phase, we can gradually achieve a sophisticated AI testbed. Such an environment will be invaluable for exploring AI behaviors, debugging and improving cognitive models, and even entertaining users in a novel way. The agent “lives” in the world, experiences it, and grows with it – a stepping stone toward more general embodied intelligence in virtual (and eventually real) environments.
Citations
Favicon
GitHub – Zero6992/chatGPT-discord-bot: Integrate ChatGPT into your own discord bot
Favicon
GitHub – Zero6992/chatGPT-discord-bot: Integrate ChatGPT into your own discord bot
Favicon
Metaverse as a Service: Megascale Social 3D on the Cloud
Favicon
Script – Vircadia API Docs
Favicon
Vircadia API Reference – Vircadia API Docs
Favicon
Vircadia API Reference – Vircadia API Docs
Favicon
Entities – Vircadia API Docs
Favicon
Entities – Vircadia API Docs
Favicon
Entities – Vircadia API Docs
Favicon
Entities – Vircadia API Docs
Favicon
Entities – Vircadia API Docs
Favicon
Computational Agents Exhibit Believable Humanlike Behavior | Stanford HAI
Voyager | An Open-Ended Embodied Agent with Large Language Models
Voyager | An Open-Ended Embodied Agent with Large Language Models
Favicon
Computational Agents Exhibit Believable Humanlike Behavior | Stanford HAI
Voyager | An Open-Ended Embodied Agent with Large Language Models
Favicon
@vircadia/web-sdk – npm
Favicon
Entities – Vircadia API Docs
Favicon
Computational Agents Exhibit Believable Humanlike Behavior | Stanford HAI
Favicon
Computational Agents Exhibit Believable Humanlike Behavior | Stanford HAI
Voyager | An Open-Ended Embodied Agent with Large Language Models
All Sources