1. Introduction
Immersive virtual environments, including augmented reality (AR), mixed reality (MR) and virtual reality (VR), are useful in various fields based on immersive experiential environments and interactive technologies[1-4]. AR technology expresses three-dimensional objects in the real world and allows virtual human-like agents (VHA) to be embedded in a physical space[5]. VR is also classified virtual humans (VH), avatars that reflect the behaviors performed by a specific human, and agents whose behavior is determined by computer algorithms[6, 7]. A virtual agent that mimics a human or similarly expresses human behavior and facial expressions increases the social presence of a user in a virtual environment[8-11].
An important issue in existing research on virtual agents is the creation of plausible movements and motions for virtual agents to enable natural interactions. To this end, application studies such as using motion capture data or high-dimensional human motion constraints and biomechanical constraints have been conducted[12]. In addition, through participant experiments on personal space and perception of agents, a study was conducted from the perspective of proxemic theory to generate design recommendations for implementing a pervasive AR experience using virtual agents[5]. In addition, agent application studies that can blur the boundary between real and virtual spaces have been conducted using agents that exist in both real and virtual spaces for social interaction[13]. Thus, a human-like agent serves as a system interface by providing natural and intuitive interactions with users. However, the existing virtual-agent studies in immersive virtual environments, including AR, MR and VR, have focused on methods for arranging virtual agents in precise or natural positions in the real world for systematic and natural interactions with users in the real world or performing human-like actions and motions. Alternatively, most studies have analyzed interactions with virtual agents through experiments in various cases and environments.
Recently, a study was conducted to propose a language learning software tool development framework to use ChatGPT (artificial intelligence (AI) using a large language model) in AR as software for children to learn foreign languages[14]. In addition, in selecting actions that individuals can take toward high-level goals, a study that recommends actions through AR and generative AI has been conducted[15]. In VR, research proposing a collaborative environment that integrates an external framework into VR using an internalized agent based on ChatGPT has been conducted[16].
This study aims to utilize generative AI such as ChatGPT, to apply a virtual agent as an interface in an immersive virtual environment. Therefore, the proposed method focuses on increasing the user’s immersion in a new environment that is similar to reality or difficult to experience in reality, or the user’s role as an assistant in interaction. To achieve this, the proposed system applies ChatGPT, an interactive AI service, to a virtual agent, along with an integrated development environment that considers VR and MR user participation in the Unity 3D engine. This study proposes a method for applying an interactive AI service as a user’s assistant in an immersive virtual environment. The proposed virtual assistant agent (VAA) has two key functions.
Information agent: Interface that instantly provides the user with necessary information in an immersive virtual environment
Control agent: Interface that intuitively provides control over virtual objects in an immersive virtual environment
In the proposed method, the user communicates with the virtual agent through voice, and the virtual agent provides feedback to the user through text or the results.
2. Virtual Assistant Agent (VAA)
The proposed VAA workflow comprises an extended reality (XR) project, which is an immersive virtual environment expressed in VR or MR by integrating VR HMD (Meta Quest 2 HMD, etc), and mixed reality (MR) wearable devices or headsets (Microsoft HoloLens 2) based on Unity 3D, which is an authoring tool that supports interactive content production. Subsequently, it builds a virtual agent that integrates OpenAI with ChatGPT. ChatGPT is a prototype AI chatbot developed by OpenAI, and is a conversational AI service based on GPT-3.5 and GPT-4[17]. This study aims to implement the interaction between immersive users and virtual agents in a way that provides useful information and enables intuitive control from the assistant’s point of view. Figure 1 shows the proposed VAA workflow. The input from the user to the virtual agent was vocal, and the feedback from the virtual agent to the user was the text of the graphical user interface (GUI) or control results.
We designed an integrated development environment for the user experience and interaction implementation in an immersive virtual environment composed of MR and VR. The integrated development environment designed in this study is based on the Unity 3D engine, which considers both AR and MR users using Microsoft HoloLens 2 through a mixed reality toolkit (MRTK), and VR users using VR HMDs such as Meta Quest 2 HMD for efficient development. The Unity 3D engine has a flexible advantage in building integrated developments with an OpenAI application programming interface (API). However, the current experience environment is a single development process in which each user participates separately in an immersive virtual environment, unless it is a collaborative system in which MR and VR users participate together. Figure 2 shows a simple example of establishing an integrated development environment by setting the API key of OpenAI and importing the OpenAI Unity package[18] from the Unity 3D engine. The AR/MR development environment using MRTK is described in Section 3.1, and the VR development environment using the Oculus integration package is described in Section 3.2.
The first role of the VAA is information transfer of the information agent. The information agent acts as an assistant when users participating in a virtual environment request more diverse or specific information regarding the purpose of the experience environment (tourism and education). It converts the user’s voice into text, transmits the question query to OpenAI, receives the answer query, and configures it as text on the GUI, as shown in the workflow of the information agent in Figure 1.
The key point is that in the process of passing the question query to OpenAI, the role and situation of the VAA must be set together in the virtual environment to create an accurate answer query corresponding to the purpose (education and tourism). Users can communicate through an information agent and GUI. The exchanged text information is stored in the cache and the user can repeatedly check the previous content using the GUI list. Figure 3 shows an example of the execution process of a VAA information agent. The agent’s feedback can also be transmitted by converting text into voice in the same way as the user. However, in this study, when there was a large amount of information and content to be conveyed, text was used for slow confirmation, along with graphic content expressed in a virtual environment without time constraints.
An important factor in an information agent is to accurately transmit contextually appropriate answers. For this reason, when generating an answer query through OpenAI, a process of setting a prompt for the situation is necessary. Table 1 shows examples of prompts for predefined situations in the proposed information agent. This study implements an immersive virtual environment with the theme of anatomical education and landmark tourism. Therefore, by transmitting the situation on the subject together to OpenAI, it is possible to accurately create an answer query according to the situation according to the question query.
The second role of the VAA is virtual scene control of the control agent. The agent participates in the virtual environment more intuitively and directly controls the objects in the virtual environment. Typically, users control (grab, move, and throw) virtual objects using their hands or controllers. However, an additional GUI may be required for the control process, such as changing the properties of a virtual object. Therefore, this study presents a method for implementing the process of controlling objects based only on voice input through OpenAI.
The user’s voice is converted into text and the control request query is transmitted to OpenAI along with the script creation conditions, as shown in the control agent workflow in Figure 1. The agent receives the code written in the object-control function and dynamically creates a control script. At this time, the key is to set the control request query and the command necessary for script creation together as conditions similar to those of the information agent. The commands required for script creation, such as necessary libraries and variable definition conditions, are directly related to the accuracy of the control agent. Finally, the control agent adds and executes the dynamically created script as a menu item in the editor, or as a component of an empty object. It was redesigned in a structure suitable for the workflow of VAA’s control agent, which is proposed differently from other projects[19] accessed through the existing dialog and editor’s custom menu. When a dynamically generated script is executed, the object control result that satisfies the user request can be checked. Figure 4 shows that the VAA performs the actions requested by the user through the control agents as an interface and provides feedback as a result.
In the case of the proposed control agent, the process of dynamically generating scripts to perform required actions is the key. Also, dynamically generated scripts must not contain errors that affect the normal execution of the project. Therefore, in the process of generating a script according to the control request query through OpenAI based on the Unity 3D engine, the necessary environment settings, factors that cause syntax errors, and basic properties are defined together as script creation condition prompts. Table 2 shows this, and is an example of setting prompts for instructions, grammar, properties, etc. required in the immersive virtual environment produced in this study.
3. Immersive Virtual Environment
We constructed an environment in which users experience the VAA in an immersive virtual environment with MR and VR. To implement a specific virtual environment, the MR user uses a Microsoft HoloLens 2 device and the VR user uses an Meta Quest 2 HMD. However, because the purpose of the proposed virtual environment is to utilize an agent from the viewpoint of an interface application that supports the user experience, interactions such as hand tracking, input processing using a controller, or menu control using a GUI are set to a minimum. As defined above, the development environment integrates plug-ins and a development toolkit that can control each device using the Unity 3D software.
The MR user wears a Microsoft HoloLens 2 device and uses a minimal interface to interact with the VAA using their hands. The virtual environment development of the MR proceeds by integrating MRTK into the Unity 3D software. The MR development environment implementation imports the MRTK Foundation and Toolkit (MRTK Foundation, MRTK Standard Assets) into the Unity 3D project and composes the Unity OpenXR plug-in together. In addition, in this process, through MRTK configuration-related profile settings, such as HoloLens 2’s hand interaction profile, it goes through the process of integrating with Unity[20].
MR content involves anatomical education, and the 3D human body information necessary for education is augmented and displayed. In addition to the basic information provided in the content, users ask questions through the information agent interface, and the manipulation of 3D human body information utilizes the control agent interface (Figure 5(a)).
The VR user wears an Meta Quest 2 HMD and interacts with the VAA using their hands, similar to the MR. The VR development was performed by importing the Oculus integration package provided for Unity 3D and engine development. Because functions and scripts, including cameras and interactions, are provided in the Oculus development tool folder, users develop them by registering prefabs or components or by modifying and editing scripts.
The purpose of the proposed VR content is landmark tourism. During the course of the experience, the user receives landmark information through the information agent interface and controls the landmark 3D model using the control agent interface from the VAA (Figure 5(b)).
There are cases of performing spiritual chats through ChatGPT NPC (non-player character) in VR[21]. Attempts to enable natural communication with users have been made using realistic NPCs as avatars or agents. However, this study attempted an interface application as an assistant (secretary) rather than direct communication with the agent.
4. Experimental Results and Analysis
The integrated development environment and MR, VR contents production were implemented using MRTK (1.0.2209.0) and Oculus Integration SDK based on Unity 2022.2.10f1 (64bit). The MR user uses a Microsoft HoloLens 2 and the VR user uses the Meta Quest 2 HMD. The PC for the integrated development environment and experiment was equipped with an AMD Ryzen™ 7 4800H (2.9GHz), 16GB of RAM, and a GeForce GTX1660 Ti GPU. Figure 5 shows the process of executing the information and control agents of the proposed VAA in the produced MR and VR contents. MR content for educational purposes and VR content for tourism purposes were produced, and it can be confirmed that information transfer or control of virtual objects desired by users is being conducted through the VAA’s interface application.
The experiment was conducted by dividing the performance of the information and control agents of the proposed VAA by their characteristics. First, the time from the query question to the feedback of the answer query was measured for the information agent. Because this study used voice as the input method for generating question queries, the time from the user’s voice input to text conversion was also separately measured and recorded together (Table 3). The length of the sentence affects speech to text (STT) and answer query generation time. For STT, it takes an average of 2s to create text, except for a maximum of 4s when the content irrelevant to the input voice is delivered to the output. Here, the important point is the time from the generation of the question query to the calculation of the answer query. It takes 6s on average; however, it takes considerable time to send out more than 400words when creating an answer query unrelated to the question or intentionally requiring a long and specific answer.
Time | Mean | Min | Max |
---|---|---|---|
STT (Speech To Text) | 2.892 | 1.867 | 4.881 |
Information agent (from question query to answer query) | 6.072 | 0.799 | 20.880 |
The performance experiment for the control agent is described below. It is important for the control agent to determine whether a user’s control request has been completed accurately. Therefore, an experiment was conducted to measure the accuracy of the control results based on a control request query. In the experiment, two control request queries were set for both MR and VR, and the control results were confirmed. The control request query is composed of commands related to object manipulation necessary for the experience in the proposed virtual environment for tourism and education purposes as follows.
MR Content
M1.Move the digestive gameobject to the user gameobject position.
M2.Change the digestive gameobject Scale at double.
VR Content
V1.Change the color of the figure gameobject.
V2.Rotate the figure gameobject at 30degrees around the y-axis.
In the experiment, a similar question was sent to the control request query 100 times to confirm the control result. The model exhibited an accuracy of more than 90%. There was a problem in obtaining different results from some request or creating an incorrect control script; however, it showed a high probability of success (Table 4). In addition, during the process, the control agent must create a script from the control request query and reflect it back into the virtual environment. Therefore, there is a limitation that require more time compared to an information agent that simply creates an answer query.
Lastly, a survey experiment was conducted to analyze the satisfaction with the interface application of VAA composed of information and control agents. The survey consisted of 7 participants (male: 6, female: 1) between the ages of 24 and 28. Participants have an interface experience of interacting through GUI in 3D interactive content. The purpose of the survey is to analyze usefulness, ease of use, ease of learning, and satisfaction through the proposed new interface application. To examine the experiences on this, scores were recorded on a 7-point scale based on 4 factors and 30 questions of the USE (Usefulness, Satisfaction, and Ease of use) Questionnaire by Lund[22]. Table 5 shows statistical data for each information agent and control agent based on the survey results.
First of all, it was confirmed that the information agent showed high satisfaction of 6.0 or higher for all factors. Basically, the process of learning and using the new interface was easy, and the results showed that the necessary information could be effectively explored. On the other hand, the control agent showed different responses depending on the participants. The input and operation method of the interface is the same as that of the information agent, but the messages input for control have relatively specialized or technical parts. Inaccurate control request queries cause behavioral results that are different from the intention, so it is difficult to deliver sophisticated messages to agents. As a result, it was confirmed that the results of other factors except learning were slightly lower than those of the information agent. The limitations and problems related to this are discussed in detail in Chapter 5. As a result of calculating statistical significance between the two agents through one-way ANOVA (Analysis of variance), it was calculated that there was a significant difference in usefulness and ease of use. On the other hand, it was found that there was no significant difference in ease of learning because it was based on the same GUI as voice input. It is noteworthy that there was no significant difference in satisfaction either, which was confirmed to be a result reflecting the characteristics of the participants. Participants who had experience in development and had no difficulty in handling professional control request queries showed high satisfaction. However, participants with relatively insufficient knowledge and experience showed a widening gap. For this reason, it is analyzed that they affected a significant difference in satisfaction compared to ease of use, which showed a certain pattern of results.
5. Limitation and Discussion
This study attempts to obtain the information or actions desired by a user in an immersive virtual environment by applying the information and control agents of the proposed VAA as an interface. The main aim is to solve the inconvenience that occurs when manipulating the 3D GUI using hands or controllers and simultaneously presents a novel direction that can be intuitively processed only with voice input. However, the following limitations exist in the structure of the proposed workflow and integrated development environment.
Both the information and control agents generate the content of the answer queries and control scripts through OpenAI. Therefore, a fundamental time-delay problem arises until the output is generated. In particular, when a chat is started for the first time, a time-delay problem occurs as an output unrelated to the question or a request query is generated. Existing web-based ChatGPTs can directly check the process of receiving an answer, and they can stop generating an answer and request a new answer if it is determined that the answer is unintended. However, the intermediate process cannot be checked when the completed answer is received through OpenAI like the proposed VAA. Because of this, problems can arise where inaccurate or unintended responses are delivered. The experimental results confirmed that the problem was solved by performing repetitive questions and answers during the performance experiment of the information agent. Therefore, this problem can be solved by performing the process of learning the agent in the first stage of content execution. In addition, it appears that a user-centered evaluation is required to determine whether the current delay time is appropriate for an interface.
When a control request query is transmitted to the control agent of the VAA, an incorrect control script is generated or an error occurs if the query is not specifically generated. There is no problem with the information agent asking any question, but the control agent can show high accuracy, as in the previous experiment, only when it delivers a control request query using specific terms (gameobject and position) from the developer’s viewpoint. However, because the conditional script command for generating scripts can be defined in advance, the limitation can be solved by specifying specific conditions for non-developers and presenting rules for using control request queries.
The most important issue with a control agent is the creation of a control script. Scripts that do not exist in the existing project are dynamically created according to the control request query and are reflected in the virtual environment; therefore, recompiling of the project is required. Therefore, the environment to which the proposed VAA can be applied does not yet support the build mode and operates only in the editor mode. It is a problem arising from the basic concept conceived in the existing AI command project[19], and this study tried to develop it as a method for applying VAA in an immersive virtual environment. Ultimately, it will be possible to solve this problem by defining the action to be controlled as a control script in advance, and finding an appropriate control script and applying it to the object, rather than creating a control script through a control agent.
6. Conclusion
We designed a VAA with the role of an assistant as part of an interface to provide users with various actions and convenience in an immersive virtual environment, including MR and VR. It consists of an information agent that effectively delivers the information the user needs and a control agent that intuitively handles the user’s control action in interaction with the virtual environment. The interface application of this structure was intended to provide the advantage that users can easily and quickly search for information and control the environment and objects. Consequently, we designed an integrated development environment and directly produced an immersive virtual environment with MR and VR content. To apply the proposed method as an interface through performance experiments, we comprehensively analyzed its limitations and problems, along with its usability.
In the future, we will systematically analyze and supplement the limitations of the proposed VAA and improve its technical limitations for application as an interface. In addition, we aim to verify the possibility of utilization by producing various types of content that can be applied to immersive virtual environments. Finally, we will conduct a survey experiment for interface applications through user evaluation and comparison experiments between the proposed VAA-based interface and existing interfaces in an immersive virtual environment.