05/20/2026 · zyss

3D 세계 생성부터 시각 합성까지, 생성 AI의 새로운 물결

요즘 생성 AI 분야가 정말 빠르게 움직이고 있어요. 제가 최근 Papers with Code에서 트렌딩 논문들을 살펴보다가 흥미로운 패턴을 발견했거든요. 3D 세계 생성, 시각 합성, 그리고 선호도 최적화라는 세 가지 주제가 동시에 주목받고 있더라고요. 각각 독립적으로 보이지만, 사실 이들은 모두 ‘더 나은 시각적 결과물’이라는 공통 목표를 향해 달려가고 있습니다.

카메라로 걸어다니는 3D 세계, Lyra 2.0

Tianchang Shen 연구팀이 발표한 Lyra 2.0은 비디오 생성 기술을 3D 장면 제작에 활용하는 새로운 접근법을 제시했어요. 2026년 4월 14일에 공개된 이 논문은 카메라 제어가 가능한 비디오를 생성한 뒤, 이를 3D로 변환하는 방식을 사용합니다. 솔직히 말하면, 이전에도 비슷한 시도들이 있었지만 대부분 작은 규모의 장면에 한정됐었거든요.

이 방식의 핵심은 ‘생성적 재구성’이라고 부르는 개념입니다. 비디오 모델의 시각적 충실도와 창의성을 3D 출력물과 결합해서, 실시간 렌더링과 시뮬레이션이 가능한 결과물을 만들어내는 거죠. 근데 여기서 중요한 건, 큰 규모의 복잡한 환경으로 확장하려면 긴 카메라 궤적에서도 3D 일관성을 유지하는 비디오 생성이 필요하다는 점이에요.

※ 생성적 재구성: 생성 모델로 만든 2D 비디오를 3D 공간으로 변환하는 기술

예전에 제가 게임 개발 프로젝트에 참여했을 때, 3D 환경을 만드는 데 몇 주씩 걸렸던 기억이 나요. 디자이너가 일일이 모델링하고 텍스처를 입히고… 정말 노동집약적이었죠. 그런데 이런 기술이 실용화되면 그 과정이 극적으로 단축될 것 같아요. 물론 아직 완벽하진 않겠지만요.

확산 모델의 비효율을 극복하다

Jian Han 팀의 Generative Refinement Networks(GRN)는 다른 방향에서 문제를 해결하려고 합니다. 현재 시각 생성 분야를 지배하고 있는 확산 모델은 계산적으로 비효율적이거든요. 복잡도에 관계없이 동일한 계산 노력을 들이는 게 문제였어요.

반면 자기회귀(AR) 모델은 본질적으로 복잡도를 인식할 수 있어요. 가변적인 우도를 통해 이를 증명하죠. 하지만 손실이 발생하는 이산 토큰화와 오류 누적 때문에 제약을 받았습니다. GRN은 이 두 가지 접근법의 장점을 결합하려는 시도예요.

※ 자기회귀 모델: 이전 출력을 입력으로 사용하여 순차적으로 데이터를 생성하는 모델

개인적으로 이 연구가 흥미로운 이유는, 단순히 성능만 높이는 게 아니라 효율성에 초점을 맞췄다는 점이에요. 실무에서는 성능도 중요하지만 비용과 속도도 똑같이 중요하거든요. 특히 서비스를 운영하는 입장에서는요.

선호도를 학습하는 새로운 방식

Ya-Qi Yu 연구팀이 제안한 Visual Preference Optimization with Rubric Rewards, 즉 rDPO는 또 다른 각도에서 접근해요. Direct Preference Optimization(DPO)의 효과는 멀티모달 작업에서 중요한 품질 차이를 반영하는 선호도 데이터에 달려 있습니다.

기존 파이프라인들은 주로 오프폴리시 섭동이나 거친 결과 기반 신호에 의존했는데, 이게 세밀한 시각적 추론에는 적합하지 않았어요. rDPO는 인스턴스별 루브릭을 기반으로 한 선호도 최적화 프레임워크를 제안합니다. 각 이미지에 대해 특정한 평가 기준을 적용하는 거죠.

※ 루브릭: 평가 항목과 기준을 명확히 정의한 평가 도구

뭐랄까, 이건 마치 학생들을 평가할 때 단순히 점수만 주는 게 아니라 구체적인 채점 기준표를 만드는 것과 비슷해요. 어떤 부분이 좋고 나쁜지 명확하게 알 수 있으니까요.

세 논문이 보여주는 공통 트렌드

이 세 논문을 함께 보면 재미있는 패턴이 보여요. 첫째, 모두 기존 방법론의 한계를 명확히 인식하고 있다는 점. Lyra 2.0은 규모 확장의 문제를, GRN은 계산 효율성을, rDPO는 평가 기준의 모호함을 각각 지적합니다.

둘째, 단순히 성능 개선이 아니라 실용성에 초점을 맞추고 있어요. 3D 실시간 렌더링, 복잡도 인식 생성, 세밀한 평가 기준 등은 모두 실제 응용을 염두에 둔 설계입니다. 이건 연구 분야가 성숙해지고 있다는 신호라고 봐요.

셋째, 서로 다른 문제를 다루지만 결국 ‘더 나은 시각적 AI’라는 하나의 목표로 수렴한다는 거예요. 3D 세계를 만들든, 이미지를 합성하든, 선호도를 학습하든 결국 사람이 만족할 만한 시각적 결과물을 효율적으로 만드는 게 목적이잖아요.

실무 적용 가능성은?

솔직히 말하면, 이 기술들이 당장 내일부터 프로덕션에 투입되긴 어려울 거예요. 논문 단계의 연구들은 보통 이상적인 환경에서 테스트되거든요. 실제 서비스 환경에서는 예상치 못한 문제들이 수두룩하게 나타나죠.

하지만 방향성은 분명히 맞다고 생각해요. 특히 GRN의 효율성 중심 접근은 곧바로 활용 가치가 있을 것 같고요. Lyra 2.0의 3D 생성 기술은 게임이나 메타버스 분야에서, rDPO의 선호도 학습은 이미지 생성 서비스의 품질 개선에 각각 적용될 수 있을 거예요.

제가 최근에 이미지 생성 API를 사용해본 적이 있는데, 결과물의 일관성이 떨어지는 게 가장 큰 문제였거든요. 같은 프롬프트를 줘도 매번 다른 스타일로 나오고… 이런 부분에서 rDPO 같은 접근법이 도움이 될 것 같아요.

앞으로의 전망

생성 AI 분야는 이제 ‘만들 수 있다’는 단계를 넘어 ‘잘 만들 수 있다’는 단계로 진입하고 있어요. 단순히 그럴듯한 결과물을 내놓는 게 아니라, 사용자가 원하는 품질과 스타일을 정확히 맞추는 게 중요해진 거죠.

이 세 논문은 각자의 방식으로 그 문제에 접근하고 있습니다. 3D 일관성, 계산 효율성, 평가 기준의 명확성… 이런 요소들이 결합되면 정말 강력한 시스템이 나올 수 있을 거예요. 물론 그게 언제가 될지는 모르겠지만요.

개인적으로는 이런 연구들이 오픈소스로 공개되는 걸 보면서 희망을 느껴요. Papers with Code 같은 플랫폼 덕분에 누구나 최신 연구를 접하고 실험해볼 수 있으니까요. 몇 년 전만 해도 상상하기 어려웠던 일이죠.

결국 기술은 사람이 만들고 사람이 사용하는 거잖아요. 이 논문들이 제시하는 방향이 실제로 우리 삶을 어떻게 바꿀지, 지켜보는 재미가 쏠쏠할 것 같습니다. 뭐, 완벽하진 않겠지만 그게 또 기술 발전의 묘미 아니겠어요?

The generative AI field is moving incredibly fast these days. I recently discovered an interesting pattern while browsing trending papers on Papers with Code. Three topics—3D world generation, visual synthesis, and preference optimization—are simultaneously gaining attention. They may seem independent, but they’re all racing toward a common goal: better visual outputs.

Lyra 2.0: Walking Through 3D Worlds with a Camera

Lyra 2.0, presented by Tianchang Shen’s research team, introduces a novel approach that leverages video generation technology for 3D scene creation. Published on April 14, 2026, this paper generates camera-controlled videos and then converts them into 3D. Honestly, there have been similar attempts before, but most were limited to small-scale scenes.

The core concept here is called ‘generative reconstruction.’ It combines the visual fidelity and creativity of video models with 3D outputs ready for real-time rendering and simulation. But here’s the key: scaling to large, complex environments requires video generation that maintains 3D consistency over long camera trajectories.

※ Generative Reconstruction: Technology that converts 2D videos created by generative models into 3D space

I remember participating in a game development project years ago. Creating 3D environments took weeks. Designers had to model everything manually and apply textures… it was incredibly labor-intensive. But if this technology becomes practical, that process could be dramatically shortened. Of course, it’s not perfect yet.

Overcoming Diffusion Model Inefficiency

Jian Han’s team tackles the problem from a different angle with Generative Refinement Networks (GRN). Diffusion models currently dominating visual generation are computationally inefficient. The problem is they apply uniform computational effort regardless of complexity.

In contrast, autoregressive (AR) models are inherently complexity-aware. They prove this through variable likelihoods. But they’ve been constrained by lossy discrete tokenization and error accumulation. GRN attempts to combine the advantages of both approaches.

※ Autoregressive Model: A model that generates data sequentially using previous outputs as inputs

What makes this research interesting to me personally is that it focuses on efficiency, not just performance. In practice, performance matters, but cost and speed are equally important. Especially when you’re running a service.

A New Way to Learn Preferences

Ya-Qi Yu’s research team approaches from yet another angle with Visual Preference Optimization with Rubric Rewards, or rDPO. The effectiveness of Direct Preference Optimization (DPO) depends on preference data that reflects quality differences that matter in multimodal tasks.

Existing pipelines mainly relied on off-policy perturbations or coarse outcome-based signals, which weren’t suitable for fine-grained visual reasoning. rDPO proposes a preference optimization framework based on instance-specific rubrics. It applies specific evaluation criteria to each image.

※ Rubric: An evaluation tool that clearly defines assessment items and criteria

It’s kind of like when evaluating students—instead of just giving scores, you create a detailed grading rubric. You can clearly see what’s good and what’s not.

Common Trends Across Three Papers

Looking at these three papers together reveals an interesting pattern. First, they all clearly recognize the limitations of existing methodologies. Lyra 2.0 points out scaling issues, GRN addresses computational efficiency, and rDPO tackles ambiguous evaluation criteria.

Second, they focus on practicality rather than just performance improvement. Real-time 3D rendering, complexity-aware generation, and fine-grained evaluation criteria are all designed with real-world applications in mind. I think this signals the field is maturing.

Third, while they address different problems, they ultimately converge toward one goal: better visual AI. Whether creating 3D worlds, synthesizing images, or learning preferences, the ultimate purpose is to efficiently produce visual outputs that satisfy people.

Practical Applicability?

To be honest, these technologies won’t be deployed into production starting tomorrow. Research at the paper stage is usually tested in ideal environments. In actual service environments, countless unexpected problems emerge.

But the direction is definitely right. Especially GRN’s efficiency-focused approach seems immediately valuable. Lyra 2.0’s 3D generation technology could be applied in gaming or metaverse fields, and rDPO’s preference learning could improve quality in image generation services.

I recently used an image generation API, and the biggest problem was inconsistent output quality. Even with the same prompt, it produced different styles every time… I think approaches like rDPO could help with that.

Future Outlook

The generative AI field is now moving beyond the ‘can create’ stage to the ‘can create well’ stage. It’s not just about producing plausible results anymore—it’s about precisely matching the quality and style users want.

These three papers approach that problem in their own ways. 3D consistency, computational efficiency, clear evaluation criteria… when these elements combine, truly powerful systems could emerge. Though I don’t know when that’ll be.

Personally, I feel hopeful seeing these studies released as open source. Thanks to platforms like Papers with Code, anyone can access and experiment with the latest research. That was hard to imagine just a few years ago.

Ultimately, technology is made by people and used by people. It’ll be interesting to watch how the directions these papers suggest actually change our lives. Well, it won’t be perfect, but that’s also the charm of technological progress, isn’t it?