재영의 기술 블로그

February 19, 2026 undefined분 읽기

GPU 시스템 11 - Shared Memory Bank Conflict

shared memory가 빠르다고 끝이 아닌 이유와 bank conflict를 피하는 기본 원리

Lectures

February 21, 2026 undefined분 읽기

warp 내부 데이터 교환을 shared memory 없이 처리하는 warp-level primitive의 의미

Lectures

February 23, 2026 undefined분 읽기

reduction kernel을 통해 shared memory, warp primitive, synchronization을 한 번에 이해하기

Lectures

February 25, 2026 undefined분 읽기

softmax kernel 안에 reduction, memory traffic, numerical stability가 어떻게 함께 들어가는지

Lectures

February 27, 2026 undefined분 읽기

layernorm과 RMSNorm을 통해 normalization kernel이 왜 memory-bound가 되기 쉬운지 이해하기

Lectures

March 1, 2026 undefined분 읽기

vectorized memory access와 alignment가 bandwidth 활용에 어떤 차이를 만드는지

Lectures

March 3, 2026 undefined분 읽기

register를 많이 쓰는 최적화가 왜 오히려 전체 성능을 떨어뜨릴 수 있는지

Lectures

March 7, 2026 undefined분 읽기

memory load와 compute를 더 겹치게 만드는 asynchronous copy와 double buffering의 감각

Lectures

January 23, 2026 undefined분 읽기

PyTorch의 CUDA 메모리는 단순 malloc/free가 아니라 caching allocator 위에서 재사용된다

Lectures