재영의 기술 블로그

February 1, 2026 undefined분 읽기

PyTorch 내부 구조 10 - CUDA Extension으로 Custom Kernel 연결하기

CUDA kernel을 PyTorch operator로 만들려면 kernel 코드뿐 아니라 tensor contract와 runtime semantics를 함께 맞춰야 한다

Lectures

February 4, 2026 undefined분 읽기

custom op를 제대로 등록하려면 구현 이전에 schema와 dispatch 구조를 먼저 분명히 해야 한다

Lectures

February 7, 2026 undefined분 읽기

backward는 forward의 덧붙임이 아니라 어떤 중간값을 저장하고 어떤 계산을 다시 할지 결정하는 설계 문제다

Lectures

February 10, 2026 undefined분 읽기

fused op는 launch overhead 감소뿐 아니라 메모리 접근과 intermediate materialization을 줄이기 위해 설계된다

Lectures

February 13, 2026 undefined분 읽기

custom op가 실제 학습에 들어가려면 mixed precision 환경에서의 dtype 규칙과 안정성까지 고려해야 한다

Lectures

February 16, 2026 undefined분 읽기

internals를 이해하는 목적은 결국 profile에서 시간을 어디서 잃는지 읽고 바꿀 수 있게 되는 데 있다

Lectures

February 19, 2026 undefined분 읽기

최근 PyTorch internals를 이해하려면 eager 실행 경로뿐 아니라 compile 경로도 함께 봐야 한다

Lectures

February 22, 2026 undefined분 읽기

Triton은 별도 장난감 언어가 아니라 PyTorch의 modern kernel story와 직접 연결되는 계층이다

Lectures

February 25, 2026 undefined분 읽기

DDP와 FSDP는 autograd 바깥의 마법이 아니라 gradient readiness와 tensor state를 runtime 차원에서 가로채는 구조다

Lectures