추론 기반 이미지 생성부터 동적 세계 모델링까지, AI 멀티모달 기술의 새로운 지평

인공지능 분야에서 텍스트와 이미지를 함께 다루는 멀티모달(multimodal) 기술이 빠르게 진화하고 있습니다. 최근 Papers with Code에 공개된 연구들은 단순히 이미지를 생성하거나 분석하는 수준을 넘어, 복잡한 추론 과정을 거쳐 이미지를 만들고, 실시간으로 변화하는 세계를 모델링하며, 효율적으로 시각 정보를 처리하는 방향으로 발전하고 있음을 보여줍니다.

추론과 생성을 하나로: UniGRPO의 통합 접근법

Jie Liu와 연구팀이 2026년 3월 24일 발표한 UniGRPO(Unified Policy Optimization for Reasoning-Driven Visual Generation)는 텍스트와 이미지를 번갈아 생성할 수 있는 통합 모델을 위한 새로운 강화학습 프레임워크입니다. 이 논문은 텍스트 생성에는 자기회귀(autoregressive) 모델링을, 이미지 생성에는 플로우 매칭(flow matching) 기법을 결합하는 방식에 주목합니다.

여기서 ‘자기회귀 모델링’이란 이전에 생성된 단어들을 바탕으로 다음 단어를 예측하는 방식을 말합니다. 마치 문장을 쓸 때 앞에 쓴 내용을 보고 다음 단어를 선택하는 것과 같습니다. ‘플로우 매칭’은 단순한 노이즈에서 시작해 점진적으로 원하는 이미지로 변환해가는 기법으로, 최근 이미지 생성 분야에서 각광받고 있습니다.

UniGRPO의 핵심은 이 두 가지 서로 다른 생성 방식을 하나의 강화학습 프레임워크로 통합했다는 점입니다. 강화학습은 AI가 시행착오를 통해 스스로 학습하는 방법으로, 이 경우 ‘어떻게 하면 더 좋은 텍스트와 이미지를 생성할 수 있을까’를 학습하게 됩니다. 이는 단순히 데이터를 모방하는 것을 넘어, 추론 과정을 거쳐 더 창의적이고 맥락에 맞는 결과물을 만들어낼 수 있는 가능성을 열어줍니다.

이 연구가 주목받는 이유는 텍스트와 이미지를 따로따로 다루는 기존 방식의 한계를 극복하고, 두 모달리티를 자연스럽게 오가며 생성할 수 있는 통합 모델의 방향성을 제시했기 때문입니다. 예를 들어, 복잡한 설명을 읽고 그에 맞는 이미지를 생성하거나, 이미지를 보고 상세한 설명을 작성하는 작업이 하나의 모델 안에서 매끄럽게 이루어질 수 있습니다.

현실 세계의 노이즈를 이겨내는 광학 흐름 추정

Jaewon Min과 연구팀이 같은 날 발표한 DA-Flow(Degradation-Aware Optical Flow Estimation with Diffusion Models)는 컴퓨터 비전의 기본 과제 중 하나인 광학 흐름(optical flow) 추정을 다룹니다. 광학 흐름이란 연속된 영상 프레임에서 물체가 어떻게 움직이는지를 픽셀 단위로 추적하는 기술입니다. 자율주행 자동차가 주변 차량의 움직임을 파악하거나, 비디오 편집에서 물체를 추적할 때 필수적인 기술이죠.

문제는 실험실에서 깨끗한 데이터로 학습된 모델이 실제 환경에서는 성능이 크게 떨어진다는 점입니다. 현실의 영상에는 흐릿함(blur), 노이즈, 압축 왜곡 등 다양한 품질 저하 요인이 존재하기 때문입니다. DA-Flow는 이러한 ‘품질 저하를 인식하는(degradation-aware)’ 새로운 접근법을 제안합니다.

이 연구는 확산 모델(diffusion models)을 활용합니다. 확산 모델은 최근 이미지 생성 분야에서 뛰어난 성능을 보이는 기법으로, 노이즈가 섞인 데이터에서 원본을 복원하는 과정을 학습합니다. DA-Flow는 이 원리를 광학 흐름 추정에 적용하여, 품질이 저하된 영상에서도 정확한 움직임 정보를 추출할 수 있도록 설계되었습니다.

실용적 관점에서 이 연구의 의미는 큽니다. 자율주행, 로봇 비전, 비디오 감시 시스템 등 실제 응용 분야에서는 항상 완벽한 화질의 영상을 기대할 수 없습니다. 비가 오거나 안개가 낀 날씨, 저조도 환경, 빠르게 움직이는 카메라 등 다양한 조건에서도 안정적으로 작동하는 광학 흐름 추정 기술은 AI 시스템의 신뢰성을 크게 높일 수 있습니다.

게임 속 세계를 학습하는 AI: WildWorld 데이터셋

Zhen Li와 연구팀이 발표한 WildWorld는 동적 세계 모델링(dynamic world modeling)을 위한 대규모 데이터셋입니다. 이 연구는 동역학 시스템 이론(dynamical systems theory)과 강화학습의 관점에서 세계를 바라봅니다. 즉, 세계의 진화를 행동에 의해 구동되는 잠재 상태의 역학으로 보고, 시각적 관찰은 그 상태에 대한 부분적인 정보를 제공한다고 가정합니다.

쉽게 설명하면, 게임 캐릭터가 앞으로 걸으면 주변 풍경이 변하는 것처럼, AI가 행동을 취하면 세계의 상태가 어떻게 바뀌는지를 학습하는 것입니다. 최근 비디오 세계 모델들은 이러한 행동 조건부 역학을 데이터로부터 학습하려고 시도하지만, 기존 데이터셋들은 한계가 있었습니다.

WildWorld의 특징은 액션 롤플레잉 게임(ARPG) 환경에서 수집된 대규모 데이터로, 행동과 명시적인 상태 정보를 포함한다는 점입니다. ARPG는 캐릭터가 복잡한 환경에서 다양한 행동을 취하며 상호작용하는 게임 장르로, AI가 실제 세계의 복잡성을 학습하기에 적합한 환경입니다. 이 데이터셋은 단순히 비디오만 제공하는 것이 아니라, 각 순간의 게임 상태와 플레이어의 행동이 명시적으로 기록되어 있어, AI가 인과관계를 더 정확하게 학습할 수 있습니다.

이러한 세계 모델링 기술은 게임 AI를 넘어 로봇 공학, 시뮬레이션, 예측 시스템 등 다양한 분야에 응용될 수 있습니다. 예를 들어, 로봇이 새로운 환경에서 자신의 행동이 어떤 결과를 가져올지 미리 시뮬레이션해볼 수 있다면, 더 안전하고 효율적인 행동 계획을 세울 수 있을 것입니다.

필요한 것만 보는 효율적인 비전-언어 모델

Adrian Bulat과 연구팀의 VISion On Request는 대규모 비전-언어 모델(LVLM)의 효율성을 개선하는 새로운 접근법을 제시합니다. 비전-언어 모델은 이미지를 보고 질문에 답하거나 설명을 생성하는 AI로, 최근 급속도로 발전하고 있지만 계산 비용이 매우 높다는 문제가 있습니다.

기존의 효율성 개선 방법들은 주로 ‘시각 토큰 감소(visual token reduction)’에 초점을 맞췄습니다. 토큰이란 AI가 처리하는 정보의 기본 단위로, 이미지를 여러 개의 작은 패치로 나누어 각각을 하나의 토큰으로 처리합니다. 토큰 수를 줄이면 계산량이 줄어들지만, 중요한 시각 정보를 놓칠 위험도 있습니다.

VISion On Request의 핵심 아이디어는 ‘희소하고 동적으로 선택되는 비전-언어 상호작용(sparse, dynamically selected, vision-language interactions)’입니다. 즉, 모든 시각 정보를 항상 처리하는 것이 아니라, 현재 작업에 필요한 부분만 선택적으로 처리한다는 개념입니다. 마치 사람이 복잡한 그림을 볼 때 전체를 동시에 보는 것이 아니라, 필요에 따라 특정 부분에 주의를 집중하는 것과 유사합니다.

이 접근법은 단순히 토큰 수를 줄이는 것을 넘어, 작업의 맥락에 따라 어떤 시각 정보가 중요한지를 동적으로 판단합니다. 예를 들어, ‘이 사진에 고양이가 있나요?’라는 질문에는 동물 형태에 집중하고, ‘이 방의 색깔은 무엇인가요?’라는 질문에는 색상 정보에 집중하는 식입니다. 이를 통해 계산 효율성과 성능을 동시에 개선할 수 있습니다.

멀티모달 AI의 미래를 여는 기술들

이번에 공개된 연구들은 멀티모달 AI가 단순한 입출력 처리를 넘어, 추론, 적응, 효율성이라는 세 가지 핵심 방향으로 진화하고 있음을 보여줍니다. UniGRPO는 추론 기반 생성을 통해 더 창의적이고 맥락에 맞는 결과물을 만들어내는 방향을, DA-Flow는 현실 세계의 불완전한 조건에 적응하는 방향을, WildWorld는 복잡한 동적 환경을 이해하는 방향을, 그리고 VISion On Request는 효율적인 정보 처리 방향을 제시합니다.

특히 주목할 점은 이 모든 연구가 실용성을 강조한다는 것입니다. 실험실의 완벽한 조건이 아닌 실제 환경에서 작동하고, 계산 비용을 고려하며, 복잡한 상호작용을 다룰 수 있는 AI 시스템을 만들려는 노력이 뚜렷합니다. 이는 AI 기술이 연구 단계를 넘어 실제 제품과 서비스로 빠르게 전환되고 있는 현재의 산업 트렌드와도 일치합니다.

앞으로 이러한 기술들이 결합되면, 복잡한 시각 정보를 이해하고, 추론 과정을 거쳐 적절한 반응을 생성하며, 변화하는 환경에 적응하고, 효율적으로 작동하는 통합 AI 시스템이 가능해질 것입니다. 자율주행 자동차, 지능형 로봇, 창의적 콘텐츠 생성 도구, 고급 게임 AI 등 다양한 분야에서 이러한 발전의 혜택을 누릴 수 있을 것으로 기대됩니다.

Papers with Code에 공개된 이 연구들은 코드와 함께 제공되어, 연구자와 개발자들이 직접 실험하고 개선할 수 있는 기반을 제공합니다. 이러한 오픈 사이언스의 정신은 AI 기술의 민주화와 빠른 발전을 가능하게 하는 중요한 요소입니다. 멀티모달 AI의 다음 단계는 이미 시작되었으며, 그 발전 속도는 우리의 예상을 뛰어넘을 것입니다.

Multimodal technology that handles both text and images is rapidly evolving in the field of artificial intelligence. Recent research published on Papers with Code shows that the field is advancing beyond simply generating or analyzing images, moving toward creating images through complex reasoning processes, modeling dynamically changing worlds in real-time, and efficiently processing visual information.

Unifying Reasoning and Generation: UniGRPO’s Integrated Approach

UniGRPO (Unified Policy Optimization for Reasoning-Driven Visual Generation), published by Jie Liu and the research team on March 24, 2026, is a new reinforcement learning framework for unified models capable of interleaved text and image generation. This paper focuses on combining autoregressive modeling for text generation with flow matching techniques for image generation.

‘Autoregressive modeling’ refers to a method of predicting the next word based on previously generated words, similar to how we choose the next word when writing a sentence by looking at what we’ve already written. ‘Flow matching’ is a technique that gradually transforms simple noise into a desired image, which has recently gained attention in the image generation field.

The core innovation of UniGRPO is that it integrates these two different generation methods into a single reinforcement learning framework. Reinforcement learning is a method where AI learns through trial and error, in this case learning ‘how to generate better text and images.’ This opens up possibilities for creating more creative and contextually appropriate outputs through reasoning processes, beyond simply mimicking data.

This research is noteworthy because it overcomes the limitations of existing approaches that handle text and images separately, presenting a direction for unified models that can naturally alternate between the two modalities. For example, generating images that match complex descriptions or creating detailed descriptions from images can be done seamlessly within a single model.

Overcoming Real-World Noise in Optical Flow Estimation

DA-Flow (Degradation-Aware Optical Flow Estimation with Diffusion Models), published by Jaewon Min and the research team on the same day, addresses optical flow estimation, one of the fundamental tasks in computer vision. Optical flow is a technology that tracks how objects move at the pixel level in consecutive video frames. It’s essential when autonomous vehicles detect surrounding vehicle movements or when tracking objects in video editing.

The problem is that models trained on clean laboratory data often perform poorly in real environments. Real-world videos contain various quality degradation factors such as blur, noise, and compression artifacts. DA-Flow proposes a new ‘degradation-aware’ approach that recognizes these quality degradations.

This research utilizes diffusion models, a technique that has shown excellent performance in image generation recently by learning to restore originals from noisy data. DA-Flow applies this principle to optical flow estimation, designed to extract accurate motion information even from degraded videos.

From a practical perspective, this research is significant. In real-world applications like autonomous driving, robot vision, and video surveillance systems, perfect image quality cannot always be expected. Optical flow estimation technology that works reliably under various conditions—rainy or foggy weather, low-light environments, fast-moving cameras—can greatly enhance the reliability of AI systems.

AI Learning Game Worlds: The WildWorld Dataset

WildWorld, published by Zhen Li and the research team, is a large-scale dataset for dynamic world modeling. This research views the world from the perspective of dynamical systems theory and reinforcement learning, assuming that world evolution is latent-state dynamics driven by actions, with visual observations providing partial information about that state.

Simply put, just as the surrounding scenery changes when a game character walks forward, this involves AI learning how the world’s state changes when it takes actions. Recent video world models attempt to learn this action-conditioned dynamics from data, but existing datasets had limitations.

WildWorld’s distinctive feature is that it’s large-scale data collected from Action Role-Playing Game (ARPG) environments, including actions and explicit state information. ARPGs are game genres where characters take various actions and interact in complex environments, making them suitable for AI to learn real-world complexity. This dataset not only provides videos but also explicitly records game states and player actions at each moment, enabling AI to learn causal relationships more accurately.

This world modeling technology can be applied beyond game AI to various fields including robotics, simulation, and prediction systems. For example, if a robot can simulate in advance what results its actions will bring in a new environment, it can create safer and more efficient action plans.

Efficient Vision-Language Models That See Only What’s Needed

VISion On Request by Adrian Bulat and the research team presents a new approach to improving the efficiency of Large Vision-Language Models (LVLMs). Vision-language models are AI that answer questions or generate descriptions by looking at images, developing rapidly recently but with very high computational costs.

Existing efficiency improvement methods mainly focused on ‘visual token reduction.’ Tokens are the basic units of information that AI processes, dividing images into multiple small patches and treating each as one token. Reducing token count decreases computation but risks missing important visual information.

The core idea of VISion On Request is ‘sparse, dynamically selected, vision-language interactions.’ That is, instead of always processing all visual information, it selectively processes only the parts needed for the current task. This is similar to how humans don’t see an entire complex picture simultaneously but focus attention on specific parts as needed.

This approach goes beyond simply reducing token count, dynamically determining which visual information is important according to the task context. For example, for the question ‘Is there a cat in this photo?’, it focuses on animal shapes, while for ‘What color is this room?’, it focuses on color information. This can improve both computational efficiency and performance simultaneously.

Technologies Opening the Future of Multimodal AI

The research published this time shows that multimodal AI is evolving beyond simple input-output processing in three key directions: reasoning, adaptation, and efficiency. UniGRPO presents a direction for creating more creative and contextually appropriate outputs through reasoning-based generation, DA-Flow presents adaptation to imperfect real-world conditions, WildWorld presents understanding complex dynamic environments, and VISion On Request presents efficient information processing.

Particularly noteworthy is that all these studies emphasize practicality. The clear effort to create AI systems that work in real environments rather than perfect laboratory conditions, consider computational costs, and can handle complex interactions aligns with the current industry trend of AI technology rapidly transitioning beyond the research stage to actual products and services.

When these technologies combine in the future, integrated AI systems will be possible that understand complex visual information, generate appropriate responses through reasoning processes, adapt to changing environments, and operate efficiently. We can expect to benefit from these developments in various fields including autonomous vehicles, intelligent robots, creative content generation tools, and advanced game AI.

These studies published on Papers with Code are provided with code, offering a foundation for researchers and developers to directly experiment and improve. This spirit of open science is an important element enabling the democratization and rapid development of AI technology. The next stage of multimodal AI has already begun, and its development speed will exceed our expectations.

Zyss News