05/20/2026 · zyss

물리 시뮬레이터로 올림피아드 문제 풀고, GUI 에이전트 통합 프레임워크까지—이번 주 SOTA 동향

DeepSeek-R1 이후로 LLM의 추론 능력이 눈에 띄게 좋아졌잖아요. 근데 솔직히 말하면, 이건 대부분 인터넷에 널린 수학 문제 덕분이었어요. 물리학처럼 대규모 QA 데이터가 부족한 분야는? 그냥 소외되고 있었죠. 그런데 이번 주 Papers with Code 트렌딩을 보니, 물리 시뮬레이터를 강화학습 환경으로 쓰는 논문이 1위에 올라왔더라고요. 데이터가 없으면 만들면 되는 거 아니냐는 발상인데, 뭐랄까, 정말 단순하면서도 강력한 접근이에요.

물리 올림피아드를 시뮬레이터로 푼다고?

Mihir Prabhudesai와 공동 저자들이 2026년 4월 13일 공개한 ‘Solving Physics Olympiad via Reinforcement Learning on Physics Simulators’는 제목 그대로입니다. 물리 시뮬레이터를 RL 환경으로 삼아서, 모델이 직접 시행착오를 겪으며 물리 문제를 푸는 거죠. 수학은 인터넷에 문제-정답 쌍이 넘쳐나지만, 물리는 그렇지 않아요. 특히 올림피아드급 난이도는 더더욱요. 그래서 연구팀은 아예 시뮬레이터 안에서 모델을 훈련시키는 방법을 택했습니다.

예전에 제가 대학원 때 물리 시뮬레이션 돌려본 적 있는데요, 그때는 그냥 결과 확인용이었거든요. 근데 이걸 RL 환경으로 쓴다는 건, 모델이 ‘이 각도로 공을 던지면 어떻게 될까?’ 같은 실험을 수천 번 반복하면서 스스로 물리 법칙을 체득한다는 뜻이에요. 데이터 부족 문제를 우회하는 동시에, 실제 물리 현상에 대한 이해도 깊어지는 셈이죠. 개인적으로 이 접근은 정말 와닿았어요. 물리학뿐 아니라 화학, 생물학 같은 다른 과학 분야에도 적용 가능할 것 같거든요.

다만 한계도 있어요. 시뮬레이터가 현실을 완벽하게 재현할 순 없잖아요. 마찰력이나 공기 저항 같은 세부 변수를 얼마나 정교하게 모델링했는지가 관건일 텐데, 논문 초록만으론 그 부분까지는 알 수 없더라고요. 그래도 방향성 자체는 흥미롭습니다. 데이터가 없으면 환경을 만들면 된다는 발상, 이게 바로 RL의 강점 아니겠어요?

사람-물체 상호작용 영상 생성, 이제 텍스트+이미지+오디오+포즈 한 번에

Donghao Zhou 연구팀이 같은 날 발표한 ‘OmniShow’는 Human-Object Interaction Video Generation(HOIVG) 분야의 논문입니다. 텍스트, 참조 이미지, 오디오, 포즈를 동시에 조건으로 받아서 고품질 영상을 만들어내는 모델이에요. 전자상거래 데모나 숏폼 콘텐츠 제작에 바로 쓸 수 있는 실용적인 기술이죠.

기존 접근법들은 이 모든 조건을 한꺼번에 처리하지 못했어요. 텍스트만, 또는 이미지만 받는 식이었거든요. 근데 실제 콘텐츠 제작 현장에선 ‘이 제품을 들고, 이런 동작으로, 이 배경음악에 맞춰서’ 같은 복합 요구사항이 당연하잖아요. OmniShow는 그걸 통합한 프레임워크라는 점에서 의미가 있습니다. 사실 이건 비밀인데, 저도 예전에 광고 영상 제작 프로젝트에 참여한 적 있거든요. 그때 모델 하나로 이런 걸 다 처리할 수 있었으면 얼마나 좋았을까 싶어요.

물론 ‘고품질’이 구체적으로 어느 정도인지는 논문 본문을 봐야 알겠지만, 멀티모달 조건 통합 자체만으로도 진일보한 거예요. 앞으로 인터랙티브 엔터테인먼트나 메타버스 콘텐츠 제작에서 활용도가 높을 것 같네요. 다만 계산 비용이 얼마나 드는지가 궁금하긴 해요. 실시간 생성까지는 아직 멀었을 테니까요.

GUI 에이전트, 드디어 통합 프레임워크 나왔다

Fei Tang 연구팀의 ‘ClawGUI’는 GUI 에이전트 훈련, 평가, 배포를 한 번에 처리하는 통합 프레임워크예요. GUI 에이전트라는 건, API 대신 화면을 보고 탭하고 스와이프하면서 앱을 조작하는 AI를 말합니다. CLI 기반 에이전트가 못 건드리는 롱테일 애플리케이션까지 커버할 수 있죠.

근데 솔직히 이 분야는 모델링 능력보다 인프라 문제가 더 컸어요. 훈련 데이터 모으기도 어렵고, 평가 기준도 애매하고, 실제 배포는 또 다른 차원의 문제였거든요. ClawGUI는 그 전체 파이프라인을 하나로 묶었다는 게 핵심입니다. 논문 초록에서 ‘모델링 능력보다 병목이 덜하다’고 명시한 걸 보면, 연구팀도 이 문제의식을 정확히 짚었네요.

제가 개인적으로 GUI 자동화 툴을 써본 경험이 있는데요, 매번 화면 좌표 하드코딩하고 예외 처리하느라 정말 고생했거든요. 근데 비전 기반 에이전트가 자동으로 처리해준다면? 업무 자동화 분야에서 게임 체인저가 될 수 있어요. 특히 레거시 소프트웨어처럼 API가 없는 환경에서 말이죠. 다만 보안 이슈는 어떻게 해결했는지 궁금하긴 합니다. 화면 캡처하고 조작하는 과정에서 민감 정보 유출 위험이 있을 테니까요.

이번 주 트렌드를 관통하는 키워드

세 논문 모두 ‘실용성’에 방점을 찍고 있어요. 물리 시뮬레이터는 데이터 부족 문제를, OmniShow는 멀티모달 조건 통합을, ClawGUI는 에이전트 인프라를 각각 해결하려는 시도죠. SOTA 달성보다는 실제 활용 가능성에 더 무게를 둔 느낌입니다.

개인적으로 이런 흐름이 반갑긴 한데, 동시에 우려도 돼요. 벤치마크 점수 경쟁에서 벗어나는 건 좋지만, 그렇다고 성능 지표를 아예 무시하면 곤란하거든요. 특히 물리 시뮬레이터 기반 RL은 시뮬레이터 품질에 따라 결과가 천차만별일 텐데, 그걸 어떻게 검증할지가 관건일 거예요. OmniShow도 마찬가지고요. ‘고품질’이라는 게 주관적일 수 있잖아요.

뭐랄까, 이번 주 트렌딩은 ‘AI가 이제 실무로 들어온다’는 신호탄 같아요. 연구실 밖으로 나가려면 벤치마크 점수만으론 부족하다는 걸, 연구자들도 이제 체감하고 있는 거죠. 앞으로 몇 달간 이런 실용 중심 논문들이 더 쏟아질 것 같은데, 개발자 입장에선 반가운 일이에요. 당장 써먹을 수 있는 기술이 늘어난다는 뜻이니까요.

다만 한 가지 아쉬운 건, 이 논문들 대부분이 아직 코드나 데모를 공개하지 않았다는 점이에요. Papers with Code 트렌딩에 올라왔다고 해서 바로 재현 가능한 건 아니거든요. 특히 ClawGUI 같은 경우, 프레임워크라고 하면서 실제 배포 가능한 형태로 오픈소스화되는지가 중요할 텐데, 그 부분은 지켜봐야 할 것 같아요. 기대는 되지만, 또 실망할 준비도 해야 하는 거죠. 그게 이 분야의 현실이니까요.

LLM reasoning has gotten noticeably better since DeepSeek-R1, right? But honestly, that’s mostly thanks to the abundance of math problems on the internet. What about fields like physics, where large-scale QA data is scarce? They’ve just been left behind. But this week’s Papers with Code trending list features a paper that uses physics simulators as RL environments, ranking first. The idea is simple: if you don’t have data, create the environment yourself. It’s straightforward yet powerful.

Solving Physics Olympiads with Simulators?

Mihir Prabhudesai and co-authors published ‘Solving Physics Olympiad via Reinforcement Learning on Physics Simulators’ on April 13, 2026. It’s exactly what the title says. They use physics simulators as RL environments, letting models learn through trial and error to solve physics problems. Math problems flood the internet, but physics—especially Olympiad-level—doesn’t have that luxury. So the team chose to train models inside simulators instead.

I’ve run physics simulations back in grad school, but only to check results. Using them as RL environments means the model experiments thousands of times—’What happens if I throw the ball at this angle?’—and internalizes physics laws on its own. It bypasses the data shortage while deepening understanding of actual physical phenomena. Personally, this approach really resonated with me. It could apply to chemistry, biology, and other sciences too.

There are limits, though. Simulators can’t perfectly replicate reality. How precisely they model friction or air resistance matters, but the abstract doesn’t reveal that. Still, the direction itself is fascinating. If you lack data, build the environment—that’s RL’s strength, isn’t it?

Human-Object Interaction Video Generation: Text+Image+Audio+Pose All at Once

Donghao Zhou’s team released ‘OmniShow’ the same day, a paper in Human-Object Interaction Video Generation (HOIVG). It takes text, reference images, audio, and pose as simultaneous conditions to produce high-quality videos. It’s practical tech ready for e-commerce demos or short-form content creation.

Existing approaches couldn’t handle all these conditions together. They’d take text only, or images only. But real content production demands complex requirements like ‘hold this product, do this motion, sync with this background music.’ OmniShow integrates that, which is significant. Truth be told, I worked on an ad video project before, and I wished we had a model that could handle all this in one go.

Of course, ‘high-quality’ needs the full paper to quantify, but just unifying multimodal conditions is progress. I expect high utility in interactive entertainment or metaverse content creation. Though I’m curious about computational cost. Real-time generation is probably still far off.

GUI Agents Finally Get a Unified Framework

Fei Tang’s team introduced ‘ClawGUI,’ a unified framework for training, evaluating, and deploying GUI agents. GUI agents operate apps by watching screens and tapping/swiping instead of using APIs. They can cover long-tail applications that CLI-based agents can’t touch.

Honestly, this field’s bottleneck was infrastructure more than modeling capacity. Gathering training data is hard, evaluation criteria are fuzzy, and deployment is a whole other challenge. ClawGUI unifies the entire pipeline. The abstract explicitly states it’s ‘bottlenecked less by modeling capacity,’ showing the team nailed the problem.

I’ve used GUI automation tools before, and hardcoding screen coordinates plus exception handling was a nightmare. But if vision-based agents handle it automatically? That could be a game-changer in workflow automation, especially for legacy software without APIs. Though I wonder how they addressed security. Screen capture and manipulation risk leaking sensitive info.

The Keyword Connecting This Week’s Trends

All three papers emphasize ‘practicality.’ Physics simulators tackle data scarcity, OmniShow tackles multimodal integration, ClawGUI tackles agent infrastructure. They prioritize real-world applicability over SOTA benchmarks.

I appreciate this shift, but I’m also concerned. Moving away from benchmark competition is good, but ignoring performance metrics entirely is risky. Especially for simulator-based RL—results vary wildly with simulator quality, so how do you validate that? Same with OmniShow. ‘High-quality’ can be subjective.

This week’s trending feels like a signal: ‘AI is entering real work.’ Researchers realize benchmark scores alone aren’t enough to leave the lab. I expect more practical-focused papers in the coming months, which is great for developers. It means more immediately usable tech.

One disappointment: most of these papers haven’t released code or demos yet. Being on Papers with Code trending doesn’t mean instant reproducibility. Especially for ClawGUI—if it’s a framework, open-sourcing it in deployable form matters. That’s something to watch. I’m hopeful, but also prepared for letdowns. That’s the reality of this field.