게임 속 여러 캐릭터를 동시에 제어하는 AI 기술, 새로운 돌파구 마련

비디오 게임을 플레이하는 AI가 한 번에 여러 캐릭터를 자연스럽게 조종할 수 있다면 어떨까요? 최근 Papers with Code에 공개된 연구들은 이런 상상을 현실로 만들어가고 있습니다. 2026년 4월 2일 공개된 세 편의 논문은 AI가 게임 환경을 이해하고 제어하는 방식에 근본적인 변화를 가져올 기술들을 제시하고 있습니다.

멀티 에이전트 제어의 난제, ActionParty로 해결

Alexander Pondaven과 Igor Gilitschenski를 포함한 연구진이 발표한 ActionParty는 비디오 생성 AI의 오랜 숙제를 해결했습니다. 기존의 비디오 확산 모델들은 게임 속에서 한 캐릭터만 제어할 수 있었습니다. 마치 여러 명이 함께 춤을 추는 파티에서 한 사람의 동작만 지시할 수 있는 것과 같았죠. 이 연구는 ‘액션 바인딩’이라는 핵심 문제를 다룹니다.

액션 바인딩이란 쉽게 말해 ‘누가 무엇을 할지’ 정확히 연결하는 것입니다. 예를 들어 게임 속에 세 명의 캐릭터가 있을 때, 첫 번째는 점프하고, 두 번째는 달리고, 세 번째는 멈춰 있도록 각각 다른 명령을 동시에 내릴 수 있어야 합니다. 기존 모델들은 이 연결고리가 약해서 명령이 뒤섞이거나 엉뚱한 캐릭터가 반응하는 문제가 있었습니다.

ActionParty는 이 문제를 해결하기 위해 각 캐릭터와 액션을 명확하게 연결하는 새로운 메커니즘을 도입했습니다. 이는 게임 AI 개발뿐만 아니라 로봇 시뮬레이션, 자율주행 차량의 다중 에이전트 학습 등 다양한 분야에 활용될 수 있는 기술입니다. 실제 게임 개발 현장에서는 NPC(Non-Player Character)들의 자연스러운 협동 행동을 구현하는 데 큰 도움이 될 것으로 보입니다.

AAA 게임 데이터로 만든 초현실적 학습 환경

Zheng-Hui Huang과 연구진이 공개한 Generative World Renderer는 AI 학습 데이터의 품질 문제를 정면으로 다룹니다. 지금까지 AI 모델들은 주로 합성된 가짜 데이터로 학습했습니다. 하지만 이런 데이터는 실제 세계의 복잡함과 다양성을 제대로 담아내지 못했습니다. 마치 교과서만 보고 공부한 학생이 실전에서 당황하는 것과 비슷한 상황이었죠.

이 연구는 AAA급 게임에서 추출한 대규모 동적 데이터셋을 제시합니다. AAA급 게임은 최고 수준의 그래픽과 물리 엔진을 자랑하는 게임들을 말합니다. 연구진은 독특한 듀얼 스크린 스티칭 캡처 방법을 사용해 720p 해상도, 초당 30프레임으로 400만 개의 연속 프레임을 수집했습니다. 이는 약 37시간 분량의 게임 영상에 해당하는 방대한 양입니다.

더 중요한 것은 단순히 RGB 영상만이 아니라 5개의 G-버퍼 채널을 함께 수집했다는 점입니다. G-버퍼는 게임 렌더링 과정에서 생성되는 기하학적 정보로, 깊이, 법선 벡터, 재질 속성 등을 포함합니다. 이런 정보는 AI가 3차원 공간을 이해하고 물체의 물리적 특성을 파악하는 데 필수적입니다. 다양한 장면, 시각 효과, 환경을 포함한 이 데이터셋은 생성형 역렌더링과 순렌더링 기술의 현실성과 시간적 일관성을 크게 향상시킬 것으로 기대됩니다.

AI에게 ‘보고 싶은 것’을 알려주는 기술

Jona Ruthardt와 Deva Ramanan 등이 발표한 Steerable Visual Representations는 AI 비전 모델의 근본적인 한계를 지적합니다. DINOv2나 MAE 같은 사전 학습된 비전 트랜스포머들은 범용적인 이미지 특징을 추출하는 데는 뛰어나지만, 한 가지 문제가 있습니다. 이들은 항상 이미지에서 가장 눈에 띄는 것만 집중해서 봅니다.

예를 들어 거리 사진을 보여주면 AI는 큰 건물이나 밝은 간판에만 주목하고, 구석에 있는 작은 표지판이나 특정 나무 종류 같은 덜 눈에 띄는 요소들은 무시하는 경향이 있습니다. 하지만 실제 응용에서는 이런 덜 두드러진 개념들이 중요할 때가 많습니다.

이 연구는 멀티모달 대규모 언어 모델의 장점을 활용합니다. 텍스트로 AI에게 ‘무엇을 봐야 하는지’ 지시할 수 있다면, 같은 이미지에서도 상황에 따라 다른 특징을 추출할 수 있습니다. 이를 ‘조종 가능한 시각 표현’이라고 부릅니다. 마치 카메라의 초점을 원하는 곳으로 자유롭게 맞출 수 있는 것처럼, AI의 ‘시선’을 우리가 원하는 대로 유도할 수 있게 되는 것이죠.

이 기술은 이미지 검색, 분류, 분할 등 다양한 하위 작업에 적용될 수 있습니다. 특히 의료 영상 분석에서 특정 병변을 찾거나, 자율주행에서 특정 교통 표지를 인식하는 등 목적에 맞는 세밀한 제어가 필요한 분야에서 큰 가치를 발휘할 것으로 보입니다.

세 연구가 그리는 미래의 AI 비전

이 세 논문은 각각 독립적이지만, 함께 보면 흥미로운 시너지를 발견할 수 있습니다. ActionParty는 ‘누가 무엇을 하는가’의 문제를, Generative World Renderer는 ‘어떻게 현실적으로 보이게 하는가’의 문제를, Steerable Visual Representations는 ‘무엇에 집중할 것인가’의 문제를 각각 해결합니다.

이 세 기술이 결합되면 어떤 일이 가능할까요? 게임 개발자는 복잡한 멀티플레이어 시나리오를 자동으로 생성하고 테스트할 수 있을 것입니다. 로봇 공학자는 여러 로봇이 협력하는 상황을 현실적인 환경에서 시뮬레이션하고, 각 로봇이 특정 목표에 집중하도록 훈련시킬 수 있습니다. 영화 제작자는 CG 캐릭터들의 자연스러운 상호작용을 더 빠르고 효율적으로 만들어낼 수 있습니다.

실무 적용 가능성과 과제

이 기술들이 실제 산업에 적용되기까지는 몇 가지 과제가 남아 있습니다. 먼저 계산 비용입니다. 비디오 확산 모델과 대규모 언어 모델은 많은 컴퓨팅 자원을 필요로 합니다. 실시간 게임이나 로봇 제어에 사용하려면 추가적인 최적화가 필요할 것입니다.

또한 데이터 수집의 법적, 윤리적 문제도 고려해야 합니다. AAA 게임에서 데이터를 추출하는 것은 저작권 문제를 일으킬 수 있으며, 게임 개발사와의 협력이나 적절한 라이선스 확보가 필수적입니다. 연구진들이 이런 문제를 어떻게 해결했는지는 논문의 상세 내용에서 확인할 수 있을 것입니다.

오픈소스 생태계의 역할

Papers with Code 플랫폼에 이 논문들이 공개된 것은 중요한 의미를 갖습니다. 이 플랫폼은 논문과 함께 구현 코드를 제공하여 연구의 재현성을 높이고, 다른 연구자들이 빠르게 기술을 검증하고 개선할 수 있도록 돕습니다. 특히 Hugging Face에서 호스팅되는 이 논문들은 활발한 커뮤니티 피드백과 협업을 기대할 수 있습니다.

오픈소스 접근 방식은 AI 연구의 민주화에도 기여합니다. 대형 기술 기업뿐만 아니라 스타트업, 학계, 개인 개발자들도 최신 기술에 접근하고 자신의 프로젝트에 적용할 수 있게 됩니다. 이는 혁신의 속도를 가속화하고 예상치 못한 응용 분야를 발견하는 계기가 될 수 있습니다.

AI 비전 기술의 다음 단계

이 연구들이 제시하는 방향은 명확합니다. AI는 단순히 데이터를 처리하는 것을 넘어, 복잡한 환경을 이해하고, 여러 주체를 동시에 제어하며, 상황에 맞게 집중 대상을 조절할 수 있어야 합니다. 이는 범용 인공지능을 향한 중요한 단계입니다.

특히 주목할 점은 이 기술들이 모두 실용적인 문제 해결에 초점을 맞추고 있다는 것입니다. 학술적 호기심만이 아니라 게임 개발, 로봇 공학, 컴퓨터 비전 등 실제 산업의 요구사항을 반영하고 있습니다. 이는 연구 성과가 빠르게 상용화될 가능성을 높입니다.

앞으로 몇 년 안에 우리는 이 기술들이 통합된 형태로 다양한 제품과 서비스에서 만나게 될 것입니다. 게임에서는 더 똑똑하고 자연스러운 NPC들을, 영화에서는 더 현실적인 CG를, 로봇에서는 더 정교한 협업 능력을 기대할 수 있습니다. AI 비전 기술의 새로운 장이 열리고 있습니다.

What if AI playing video games could naturally control multiple characters at once? Recent research published on Papers with Code is turning this imagination into reality. Three papers released on April 2, 2026, present technologies that will fundamentally change how AI understands and controls game environments.

Solving Multi-Agent Control with ActionParty

ActionParty, presented by researchers including Alexander Pondaven and Igor Gilitschenski, solves a long-standing challenge in video generation AI. Existing video diffusion models could only control a single character in games. It was like being able to direct only one person’s movements at a party where multiple people are dancing together. This research tackles the core problem of ‘action binding’.

Action binding, simply put, means precisely connecting ‘who does what’. For example, when there are three characters in a game, the first should jump, the second should run, and the third should stay still – each receiving different commands simultaneously. Previous models had weak connections, causing commands to get mixed up or wrong characters to respond.

ActionParty introduces a new mechanism that clearly connects each character with their actions. This technology can be applied not only to game AI development but also to various fields such as robot simulation and multi-agent learning for autonomous vehicles. In actual game development, it will greatly help implement natural cooperative behaviors of NPCs (Non-Player Characters).

Ultra-Realistic Training Environment from AAA Game Data

Generative World Renderer, released by Zheng-Hui Huang and team, directly addresses the quality problem of AI training data. Until now, AI models primarily learned from synthetic fake data. However, such data failed to properly capture the complexity and diversity of the real world. It was similar to a student who only studied from textbooks being confused in real situations.

This research presents a large-scale dynamic dataset extracted from AAA games. AAA games are titles boasting the highest level of graphics and physics engines. The researchers collected 4 million continuous frames at 720p resolution and 30 frames per second using a novel dual-screen stitched capture method. This amounts to approximately 37 hours of game footage.

More importantly, they collected not just RGB video but also five G-buffer channels simultaneously. G-buffers are geometric information generated during the game rendering process, including depth, normal vectors, and material properties. This information is essential for AI to understand three-dimensional space and grasp the physical characteristics of objects. This dataset, containing diverse scenes, visual effects, and environments, is expected to greatly improve the realism and temporal coherence of generative inverse and forward rendering technologies.

Technology That Tells AI ‘What to Look At’

Steerable Visual Representations, presented by Jona Ruthardt, Deva Ramanan, and others, points out a fundamental limitation of AI vision models. Pretrained Vision Transformers like DINOv2 and MAE excel at extracting generic image features, but they have one problem. They always focus only on the most prominent things in images.

For example, when shown a street photo, AI tends to focus only on large buildings or bright signs, ignoring less noticeable elements like small signs in corners or specific tree species. However, in real applications, these less prominent concepts are often important.

This research leverages the advantages of multimodal large language models. If we can instruct AI with text on ‘what to look at’, it can extract different features from the same image depending on the situation. This is called ‘steerable visual representations’. Just like being able to freely adjust a camera’s focus to desired locations, we can guide AI’s ‘gaze’ as we want.

This technology can be applied to various downstream tasks such as image retrieval, classification, and segmentation. It will particularly demonstrate great value in fields requiring fine-grained control for specific purposes, such as finding specific lesions in medical imaging or recognizing specific traffic signs in autonomous driving.

The Future AI Vision Drawn by Three Studies

These three papers are independent, but together they reveal interesting synergies. ActionParty solves the problem of ‘who does what’, Generative World Renderer addresses ‘how to make it look realistic’, and Steerable Visual Representations tackles ‘what to focus on’.

What becomes possible when these three technologies combine? Game developers could automatically generate and test complex multiplayer scenarios. Roboticists could simulate situations where multiple robots cooperate in realistic environments and train each robot to focus on specific goals. Filmmakers could create natural interactions of CG characters faster and more efficiently.

Practical Applicability and Challenges

Several challenges remain before these technologies can be applied to actual industries. First is computational cost. Video diffusion models and large language models require significant computing resources. Additional optimization will be needed to use them in real-time games or robot control.

Legal and ethical issues of data collection must also be considered. Extracting data from AAA games can raise copyright issues, and cooperation with game developers or obtaining appropriate licenses is essential. How researchers resolved these issues can be confirmed in the detailed contents of the papers.

Role of Open Source Ecosystem

The publication of these papers on the Papers with Code platform has important significance. This platform provides implementation code along with papers to enhance research reproducibility and help other researchers quickly verify and improve technologies. Particularly, these papers hosted on Hugging Face can expect active community feedback and collaboration.

The open-source approach also contributes to democratizing AI research. Not only large tech companies but also startups, academia, and individual developers can access cutting-edge technologies and apply them to their projects. This can accelerate the pace of innovation and lead to discovering unexpected application areas.

Next Steps in AI Vision Technology

The direction these studies present is clear. AI must go beyond simply processing data to understanding complex environments, controlling multiple subjects simultaneously, and adjusting focus targets according to situations. This is an important step toward artificial general intelligence.

Particularly noteworthy is that all these technologies focus on solving practical problems. They reflect actual industry requirements in game development, robotics, and computer vision, not just academic curiosity. This increases the possibility of research results being quickly commercialized.

In the coming years, we will encounter these technologies in integrated forms across various products and services. In games, we can expect smarter and more natural NPCs, in movies more realistic CG, and in robots more sophisticated collaboration capabilities. A new chapter in AI vision technology is opening.

Zyss News