재영의 기술 블로그

January 27, 2026 undefined분 읽기

분산 LLM 학습 08 - Tensor Parallel의 기본: 모델 내부 연산을 어떻게 나누는가

모델이 한 GPU에 안 들어가기 시작하면 더 이상 데이터만 나누는 것으로는 부족하고 연산 자체를 분할해야 한다

Lectures

February 15, 2026 undefined분 읽기

naive matrix multiply를 통해 GPU 메모리 병목과 재사용 문제를 읽는 법

Lectures

February 17, 2026 undefined분 읽기

tiled matmul에서 shared memory와 block 협업이 왜 큰 성능 차이를 만드는지

Lectures

March 5, 2026 undefined분 읽기

tensor core가 어떤 종류의 연산에서 큰 성능 차이를 만들고 mixed precision과 어떻게 연결되는지

Lectures