Parallel Storage Allocation

컴퓨터공학/Performance Engineering

Parallel Storage Allocation

Pyxis 2026. 3. 30. 22:32

[MIT OpenCourseWare] Parallel Storage Allocation

https://youtu.be/d5e_YJGXXFU?si=OjFtavN5i8PmsMXU

Heap Storage in C

지난 강의 Storage Allocation에서 다뤘던 부분 복습.

Aligned allocation

Q) aligned memory allocation을 쓰는 이유를 아시는분?

1) 한 가지 이유는, 캐시라인에 정렬(align)되게 할당한다면, 2개의 캐시라인을 fetch해오지 않고 1개의 캐시라인만 fetch해올 수 있음. 따라서, 2번의 캐시미스대신 1번의 캐시미스만 발생.

2) 또 다른 이유는 vectorization 연산이 2의 거듭제곱에 정렬(aligned)된 메모리 주소를 필요로 하기 때문.

리눅스 커널은 애플리케이션의 address space에서 size 바이트를 저장할 수 있을 만큼 충분히 큰 연속적인 미사용 영역(contiguous, unused region)을 찾아 페이지 테이블을 수정하고, 사용자가 이 영역에 접근할 때 "legal"로 처리하여 segfault가 발생하지 않도록 필요한 가상 메모리 관리 구조를 OS내에 생성합니다.

So mmap is a system call.
And usually mmap is used to treat some file on disk as part of memory, so that when you write to that memory region, it also backs it up on disk. In this context here, I'm actually using mmap to allocate virtual memory without having any backing file. So mmap has a whole bunch of parameters here. The second to the last parameter indicates the file I want to map, and if I pass a negative 1, that means there's no backing file. So I'm just using this to allocate some virtual memory. The first argument is where I want to allocate it. And "0" means that I don't care. The "SIZE" in terms of number of bytes has how much memory I want to allocate. Then there's also permissions. So here it says I can read and write this memory region. "MAP_PRIVATE" means that this memory region is private to the process that's allocating it. And then "MAP_ANON" means that there is no name associated with this memory region. And then as I said, "-1" means that there's no backing file. And the last parameter is just "0" if there's no backing file. Normally it would be an offset into the file that you're trying to map. But here there's no backing file. And what mmap does is it finds a contiguous unused region in the address space of the application that's large enough to hold size bytes. And then it updates the page table so that it now contains an entry for the pages that you allocated. And then it creates a necessary virtual memory management structures within the operating system to make it so that users accesses to this area are legal, and accesses won't result in a seg fault. If you try to access some region of memory without using-- without having OS set these parameters, then you might get a segfault because the program might not have permission to access that area. But mmap is going to make sure that the user can access this area of virtual memory. And mmap is a system call, whereas malloc, which we talked about last time, is a library call. So these are two different things. And malloc actually uses mmap under the hood to get more memory from the operating system.

mmap은 시스템 콜입니다. 일반적으로 mmap은 디스크의 파일을 메모리의 일부처럼 취급하는 데 사용됩니다. 따라서 해당 메모리 영역에 쓰면 디스크에도 백업됩니다. 하지만 여기서는 백업 파일 없이 mmap을 사용하여 가상 메모리를 할당하고 있습니다.

mmap에는 여러 매개변수가 있는데요,

끝에서 두 번째 매개변수는 매핑할 파일을 지정하며, -1을 전달하면 백업 파일이 없음을 의미합니다. 즉, 가상 메모리를 할당하는 데만 사용합니다.

1) 첫 번째 인수는 할당할 위치를 지정합니다. "0"은 위치를 지정하지 않음을 의미합니다.

2) "SIZE"는 바이트 단위로 할당할 메모리 크기를 지정합니다.

3) 또한 권한도 있습니다. 여기서는 해당 메모리 영역에 읽기/쓰기 권한이 있음을 나타냅니다.

"MAP_PRIVATE"는 해당 메모리 영역이 할당하는 프로세스에만 사용됨을 의미합니다.

"MAP_ANON"은 해당 메모리 영역에 이름이 없음을 의미합니다.

4) 그리고 말씀드린 대로, "-1"은 백업 파일이 없음을 의미합니다.

5) 마지막 매개변수는 백업 파일이 없으면 "0"입니다. 일반적으로 이 매개변수는 매핑하려는 파일의 offset을 나타냅니다. 하지만 여기서는 백업 파일이 없으므로 0을 사용합니다.

`mmap` 함수는 애플리케이션의 주소 공간에서 `size` 바이트를 저장할 수 있을 만큼 충분히 큰 연속된 미사용 영역을 찾습니다. 그런 다음 페이지 테이블을 업데이트하여 할당된 페이지에 대한 항목을 추가합니다. 그리고 OS내에 필요한 가상 메모리 관리 구조를 생성하여 사용자가 이 영역에 접근할 때 오류가 발생하지 않도록 합니다. 운영 체제에서 이러한 매개변수를 설정하지 않고 메모리 영역에 접근하려고 하면 프로그램이 해당 영역에 접근할 권한이 없기 때문에 segfault가 발생할 수 있습니다. 하지만 `mmap` 함수는 사용자가 이 가상 메모리 영역에 접근할 수 있도록 보장합니다. mmap은 시스템 콜이고, 지난번에 이야기했던 malloc은 라이브러리 호출입니다. 따라서 이 두 가지는 서로 다른 것입니다. malloc은 실제로 내부적으로 mmap을 사용하여 OS로부터 더 많은 메모리를 확보합니다.

1) mmap() is lazy.

요청된 할당에 대해 physical memory를 즉시 할당하지 않습니다.

대신에, 특별한 "0" page를 가리키는 엔트리로 채워진 페이지 테이블을 생성하고, 해당 페이지를 "ReadOnly"로 마킹합니다.

그런 다음 처음으로 페이지를 write하려고 하면, Page Fault 발생하며 OS는 적절한 physical memory를 확보하려고 합니다. 그러고 나서 접근하려고 했던 페이지에 대해, virtual address ↔ physical address간의 매핑을 저장합니다. 그런 다음 명령어 실행을 재개하여 작업이 계속될 수있도록합니다.

실제로 mmap은 1GB의 램만 사용하더라도 1TB의 가상 메모리를 매핑할 수 있다는 것이 밝혀져 있습니다. mmap을 호출하면, 실제로 physical memory를 즉시 할당하지 않기 때문인데요. 주의해야 할 점은 mmap 호출 이후 physical memory 부족으로 인해 프로세스가 죽을 수도 있다는 것입니다. 특정 메모리에 접근하려고 할 때 해당 physical memory를 할당하기 때문입니다. 그래서 실제로 mmap을 호출하려 했을 때 보다 더 늦어질 수 있습니다.

Q) malloc과 mmap의 차이는 무엇일까요?

So as I said, malloc is a library call. And it's part of-- malloc and free are part of the memory allocation interface of the heap-management code in the c library. And the heap-management code uses the available system facilities, including the mmap function to get a virtual address space from the operating system. And then the heap-management code is going-- within malloc-- is going to attempt to satisfy user requests for heap storage by reusing the memory that it got from the OS as much as possible until it can't do that anymore. And then it will go and call mmap to get more memory from the operating system. So the malloc implementation invokes mmap and other system calls to expand the size of the users heap storage. And the responsibility of malloc is to reuse the memory, such that your fragmentation is reduced, and you have good temporal locality, whereas the responsibility of mmap is actually getting this memory from the operating system. So any questions on the differences between malloc and mmap?

malloc은 라이브러리 함수 호출입니다. malloc과 free는 C 라이브러리의 힙 관리 코드에서 사용하는 메모리 할당 인터페이스의 일부입니다. 힙 관리 코드는 mmap 함수를 포함한 시스템 기능을 활용하여 OS로부터 가상 주소 공간을 할당받습니다. malloc 함수 내에서 힙 관리 코드는 OS로부터 할당받은 메모리를 최대한 재사용하여 사용자의 힙 저장 공간 요청을 충족시키려 시도합니다. 하지만 더 이상 재사용할 수 없게 되면 mmap을 호출하여 OS로부터 추가 메모리를 할당받습니다. 따라서 malloc 구현은 mmap 및 기타 system call을 통해 사용자의 heap storage 크기를 확장합니다. malloc의 역할은 메모리를 재사용하여 fragmentation을 줄이고 temporal locality를 갖는 것이고, mmap의 역할은 실제로 OS로부터 메모리를 할당받는 것입니다.

Q) malloc / mmap의 차이점에 대한 또 다른 질문있나요?, 하나 던져보자면, malloc 대신 mmap을 쓰면 안될까요?

A) Yes, so one answer is that you might have free storage from before that you would want to reuse. And it turns out that mmap is relatively heavy weight. So it works on a page granularity. So if you want to do a small allocation, it's quite wasteful to allocate an entire page for that allocation and not reuse it. You'll get very bad external fragmentation. And when you call mmap, it has to go through all of the overhead of the security of the OS and updating the page table and so on. Whereas, if you use malloc, it's actually pretty fast for most allocations, and especially if you have temporal locality where you allocate something that you just recently freed. So your program would be pretty slow if you used m map all the time, even for small allocations. For big allocations, it's fine. But for small allocations, you should use malloc.

한 가지 이유는 이전에 사용했던 여유 공간을 재사용하고 싶을 수 있다는 점입니다. 그런데 mmap은 상대적으로 무겁다는 것이 밝혀져 있습니다. 페이지 단위로 동작하기 때문에 작은 공간을 할당할 때 전체 페이지를 할당하고 재사용하지 않으면 낭비가 심합니다. external fragmenation이 심하게 발생하죠. mmap을 호출하면 OS의 보안 오버헤드, 페이지 테이블 업데이트 등의 과정도 거쳐야 합니다. 반면 malloc을 사용하면 대부분의 할당, 특히 최근에 해제한 메모리를 다시 할당하는 경우처럼 temporal locality가 있는 경우에는 훨씬 빠릅니다. 따라서 작은 메모리 할당에도 mmap을 계속 사용하면 프로그램 속도가 상당히 느려질 것입니다. 큰 메모리 할당에는 문제가 없지만, 작은 메모리 할당에는 malloc을 사용해야 됩니다.

when you access memory location, you access it via the virtual address.

And the virtual address can be divided into two parts, where the lower order bits store the offset, and the higher order bits store the virtual page number. And in order to get the physical address associated with this virtual address, the hardware is going to look up this virtual page number in what's called the page table. And then if it finds a corresponding entry for the virtual page number in the page table, that will tell us the physical frame number. And then the physical frame number corresponds to where this physical memory is in DRAM. So you can just take the frame number, and then use the same offset as before to get the appropriate offset into the physical memory frame. So if the virtual page that you're looking for doesn't reside in physical memory, then a page fault is going to occur. And when a page fault occurs, either the operating system will see that the process actually has permissions to look at that memory region, and it will set the permissions and place the entry into the page table so that you can get the appropriate physical address. But otherwise, the operating system might see that this process actually can't access that region memory, and then you'll get a segmentation fault.

메모리 위치에 접근할 때는 가상 주소를 사용합니다. 가상 주소는 두 부분으로 나눌 수 있는데, 하위 비트에는 오프셋이 저장되고 상위 비트에는 가상 페이지 번호가 저장됩니다. 이 가상 주소에 해당하는 피지컬 주소를 얻으려면 하드웨어는 페이지 테이블에서 해당 가상 페이지 번호를 찾습니다. 페이지 테이블에서 해당 가상 페이지 번호에 해당하는 항목을 찾으면 피지컬 프레임 번호를 알 수 있습니다. 피지컬 프레임 번호는 DRAM에서 해당 피지 메모리의 위치를 나타냅니다. 따라서 프레임 번호를 가져와서 이전과 동일한 오프셋을 사용하여 피지컬 메모리 프레임의 적절한 오프셋을 얻을 수 있습니다. 찾고자 하는 가상 페이지가 피지컬 메모리에 없으면 페이지 폴트가 발생합니다. 페이지 폴트가 발생하면 운영 체제는 해당 프로세스가 해당 메모리 영역을 볼 권한이 있는지 확인하고 권한을 설정하여 페이지 테이블에 항목을 추가하여 적절한 피지컬 주소를 얻을 수 있도록 합니다. 하지만 그렇지 않으면 운영 체제는 해당 프로세스가 해당 메모리 영역에 접근할 수 없다고 판단하여 segfault를 발생시킬 수 있습니다.

It turns out that the page table search, also called a page walk, is pretty expensive. And that's why we have the Translation-Look-a-side-Buffer or TLB, which is essentially a cache for the page table. So the hardware uses a TLB to cache the recent page table look ups into this TLB so that later on when you access the same page, it doesn't have to go all the way to the page table to find the physical address. It can first look in the TLB to see if it's been recently accessed.

페이지 테이블 검색, 즉 페이지 워크는 상당히 비용이 많이 드는 작업임이 밝혀져 있습니다. 그래서 페이지 테이블 캐시 역할을 하는 TLB(Translation-Look-a-side-Buffer)가 있는 것입니다. 하드웨어는 TLB를 사용하여 최근 페이지 테이블 검색 결과를 캐시합니다. 이렇게 하면 나중에 동일한 페이지에 접근할 때 페이지 테이블에서 피지 주소를 찾기 위해 처음부터 다시 접근할 필요 없이 TLB에서 최근에 접근된 기록이 있는지 확인할 수 있습니다.

Q) why would you expect to see something that it recently has accessed? So what's one property of a program that will make it so that you get a lot of TLB hits?

So the page table stores pages, which are typically four kilobytes. Nowadays there are also huge pages, which can be a couple of megabytes. And most of the accesses in your program are going to be near each other. So they're likely going to reside on the same page for accesses that have been done close together in time. So therefore you'll expect that many of your recent accesses are going to be stored in the TLB if your program has locality, either spatial or temporal locality or both. So how this architecture works is that the processor is first going to check whether the virtual address you're looking for is in TLB. If it's not, it's going to go to the page table and look it up. And then if it finds that there, then it's going to store that entry into the TLB. And then next it's going to go get this physical address that it found from the TLB and look it up into the CPU cache. And if it finds it there, it gets it. If it doesn't, then it goes to d ram to satisfy the request. Most modern machines actually have an optimization that allow you to do TLB access in parallel with the L1 cache access. So the L1 cache actually uses virtual addresses instead of physical addresses, and this reduces the latency of a memory access. So that's a brief review of address translation.

페이지 테이블은 일반적으로 4KB 크기의 페이지를 저장합니다. 요즘에는 몇 메가바이트에 달하는 대용량 페이지도 있습니다. 프로그램에서 대부분의 접근은 서로 가까운 곳에서 이루어지기 때문에, 시간적으로 가까운 시점에 이루어진 접근은 같은 페이지에 저장될 가능성이 높습니다. 따라서 프로그램에 공간적 또는 시간적 지역성, 혹은 둘 다 존재하는 경우, 최근 접근 기록들이 TLB에 저장될 것으로 예상할 수 있습니다. 이러한 아키텍처의 작동 방식은 다음과 같습니다. 프로세서는 먼저 찾고자 하는 가상 주소가 TLB에 있는지 확인합니다. 없으면 페이지 테이블에서 해당 주소를 찾습니다. 페이지 테이블에서 해당 주소를 찾으면 TLB에 저장합니다. 그런 다음 TLB에서 찾은 피지컬 주소를 CPU 캐시에서 검색합니다. 캐시에서 해당 주소를 찾으면 가져오고, 찾지 못하면 DRAM에서 해당 주소를 가져와 요청을 처리합니다. 대부분의 최신 컴퓨터에는 TLB 접근과 L1 캐시 접근을 병렬로 수행할 수 있도록 최적화 기능이 탑재되어 있습니다. 따라서 L1 캐시는 피지컬 주소 대신 가상 주소를 사용하며, 이는 메모리 접근 지연 시간을 줄여줍니다. 이것이 주소 변환에 대한 간략한 설명입니다.

let's talk about stacks. So when you execute a serial c and c++ program, you're using a stack to keep track of the function calls and local variables that you have to save. So here, let's say we have this invocation tree, where function A calls Function B, which then returns. And then a calls function C, which calls D, returns, calls E, returns, and then returns again. Here are the different views of the stack at different points of the execution. So initially when we call A, we have a stack frame for A. And then when a calls B, we're going to place a stack frame for B right below the stack frame of A. So these are going to be linearly ordered. When we're done with b, then this part of the stack is no longer going to be used, the part for b. And then when it calls C, It's going to allocate a stack frame below A on the stack. And this space is actually going to be the same space as what B was using before. But this is fine, because we're already done with the call to B. Then when C calls D, we're going to create A stack frame for D right below C. When it returns, we're not going to use that space any more, so then we can reuse it for the stack frame when we call E. And then eventually all of these will pop back up. And all of these views here share the same view of the stack frame for A. And then for C, D, and E, they all stare share the same view of this stack for C. So this is how a traditional linear stack works when you call a serial C or C++ program. And you can view this as a serial walk over the invocation tree. There's one rule for pointers. With traditional linear stacks is that a parent can pass pointers to its stack variables down to its children. But not the other way around. A child can't pass a pointer to some local variable back to its parent. So if you do that, you'll get a bug in your program. How many of you have tried doing that before? Yeah, so a lot of you. So let's see why that causes a problem.

스택에 대해 이야기해 보겠습니다. C와 C++ 프로그램을 순차적으로 실행할 때, 함수 호출과 저장해야 하는 지역 변수를 관리하기 위해 스택을 사용합니다. 예를 들어, 함수 A가 함수 B를 호출하고, B가 반환되는 호출 트리가 있다고 가정해 보겠습니다. 그리고 A는 함수 C를 호출하고, C는 함수 D를 호출하고, 반환되고, 함수 E를 호출하고, 반환되고, 다시 반환됩니다. 실행 과정의 각 시점에서 스택의 모습을 살펴보겠습니다. 처음 A를 호출할 때, A를 위한 스택 프레임이 생성됩니다. 그리고 A가 B를 호출하면, A의 스택 프레임 바로 아래에 B를 위한 스택 프레임이 생성됩니다. 따라서 이들은 순차적으로 배열됩니다. B 호출이 끝나면, B를 위한 스택 공간은 더 이상 사용되지 않습니다. 그런 다음 A가 C를 호출하면, A 아래에 새로운 스택 프레임이 할당됩니다. 이 공간은 이전에 B가 사용했던 공간과 동일합니다. 하지만 B 호출은 이미 완료되었으므로 괜찮습니다. C가 D를 호출하면 C 바로 아래에 D에 대한 스택 프레임을 생성합니다. D가 반환되면 더 이상 해당 공간을 사용하지 않으므로 E를 호출할 때 스택 프레임으로 재사용할 수 있습니다. 그러면 결국 이 모든 스택 프레임이 다시 나타납니다. 여기 있는 모든 뷰는 A에 대한 스택 프레임을 공유합니다. 그리고 C, D, E는 모두 C에 대한 스택을 공유합니다. 이것이 바로 직렬로 실행되는 C 또는 C++ 프로그램을 호출할 때 전통적인 선형 스택이 작동하는 방식입니다. 이를 호출 트리를 순차적으로 순회하는 것으로 볼 수 있습니다. 포인터에 대한 한 가지 규칙이 있습니다. 전통적인 선형 스택에서는 부모 함수스택 변수에 대한 포인터를 자식 함수로 전달할 수 있습니다. 하지만 그 반대는 불가능합니다. 자식 함수는 지역 변수에 대한 포인터를 부모 함수로 전달할 수 없습니다. 만약 그렇게 한다면 프로그램에 버그가 발생합니다. 여러분 중 몇 명이나 전에 그렇게 해 본 적이 있나요? 네, 꽤 많을 겁니다. 그렇다면 이것이 왜 문제를 일으키는지 살펴보겠습니다.

So if I call B, and I pass a pointer to some local variable in B stack to A, and then now when A calls C, It's going to overwrite the space that B was using. And if B's local variable was stored in the space that C has now overwritten, then you're just going to see garbage. And when you try to access that, you're not going to get the correct value. So you can pass a pointer to A's local variable down to any of these descendant function calls, because they all see the same view of A stack. And that's not going to be overwritten while these descendant function calls are proceeding. But if you pass it the other way, then potentially the variable that you had a pointer to is going to be overwritten. So here's one question.

만약 제가 B를 호출하고, B의 스택에 있는 어떤 지역 변수에 대한 포인터를 A에 전달한다고 가정해 봅시다. 그러면 A가 C를 호출할 때, B가 사용하던 공간을 덮어쓰게 됩니다. 만약 B의 지역 변수가 C가 덮어쓴 공간에 저장되어 있었다면, C는 알아볼 수 없는 이상한 값들을 보게 될 것입니다. 그리고 그 값에 접근하려고 해도 올바른 값을 얻을 수 없을 것입니다. 따라서 A의 지역 변수에 대한 포인터는 이러한 모든 하위 함수 호출에 안전하게 전달할 수 있습니다. 왜냐하면 모든 하위 함수는 A의 스택에 대한 동일한 뷰를 보기 때문입니다. 이러한 하위 함수 호출이 진행되는 동안 포인터는 덮어쓰여지지 않습니다. 하지만 반대로 A의 지역 변수에 포인터를 전달하면, 포인터가 가리키는 변수가 덮어쓰여질 가능성이 있습니다. 여기서 한 가지 질문이 생깁니다.

Q) If you want to pass memory from a child back to the parent, where would you allocate it? So you can allocate it on the parent. What's another option?

자식에서 부모로 메모리를 전달하려면 어디에 할당해야 할까요? 부모에 할당할 수도 있지만, 다른 방법은 없을까요?

A) Yes, so another way to do this is to allocate it on the heap. If you allocate it on the heap, even after you return from the function call, that memory is going to persist. You can also allocate it in the parent's stack, if you want. In fact, some programs are written that way. And one of the reasons why many C functions require you to pass in memory to the function where it's going to store the return value is to try to avoid an expensive heap allocation in the child. Because if the parent allocates this space to store the result, the child can just put whatever it wants to compute in that space. And the parent will see it. So then the responsibility is up to the parent to figure out whether it wants to allocate the memory on the stack or on the heap. So this is one of the reasons why you'll see many C functions, where one of the arguments is a memory location where the result should be stored. OK, so that was the serial case.

또 다른 방법은 힙에 메모리를 할당하는 것입니다. 힙에 할당하면 함수 호출을 반환한 후에도 해당 메모리가 유지됩니다. 원한다면 부모 함수의 스택에 할당할 수도 있습니다. 실제로 그렇게 작성된 프로그램도 있습니다. 많은 C 함수에서 함수의 반환 값을 저장하는 곳에 메모리 위치를 인수로 전달하도록 하는 이유 중 하나는 자식 함수에서 비용이 많이 드는 힙 할당을 피하기 위해서입니다. 부모 함수가 결과를 저장할 공간을 할당하면 자식 함수는 원하는 계산 결과를 그 공간에 저장할 수 있고, 부모 함수는 이를 확인할 수 있습니다. 따라서 스택에 메모리를 할당할지 힙에 할당할지는 부모 함수가 결정해야 합니다. 이것이 바로 많은 C 함수에서 인수로 결과 저장 위치를 지정하는 이유 중 하나입니다. 여기까지가 직렬 호출의 경우였습니다.

What happens in parallel? So in parallel, we have what's called a cactus stack where we can support multiple views of the stack in parallel. So let's say we have a program where it calls function a, and then a spawns b and c. So b and c are going to be running potentially in parallel. And then c spawns d and e, which can potentially be running in parallel. So for this program, we could have functions b, d and e all executing in parallel. And a cactus stack is going to allow us to have all of these functions see the same view of this stack as they would have if this program were executed serially. And the Cilk runtime system supports the cactus stack to make it easy for writing parallel programs. Because now when you're writing programs, you just have to obey the same rules for programming in serial c and c++ with regards to the stack, and then you'll still get the intended behavior. And it turns out that there's no copying of the stacks here. So all of these different views are seeing the same virtual memory addresses for a. But now there is an issue of how do we implement this cactus stack? Because in the serial case, we could have these later stacks overwriting the earlier stacks. But in parallel, how can we do this?

병렬 호출에서는 무슨 일이 일어날까요? 병렬 처리를 위해 '선인장 스택(Cactus Stack)'이라는 것이 있는데, 이를 통해 스택에 대한 여러 관점을 병렬로 지원할 수 있습니다. 예를 들어, 함수 A를 호출하는 프로그램이 있다고 가정해 보겠습니다. A는 함수 B와 C를 생성합니다. 그러면 B와 C는 병렬로 실행될 수 있습니다. 그리고 C는 함수 D와 E를 생성하는데, 이 함수들 또한 병렬로 실행될 수 있습니다. 따라서 이 프로그램에서는 함수 B, D, E가 모두 병렬로 실행될 수 있습니다. Cactus Stack을 사용하면 이러한 모든 함수들이 프로그램이 직렬로 실행될 때와 동일한 스택 관점을 갖게 됩니다. Cilk runtime system은 병렬 프로그램 작성을 용이하게 하기 위해 Cactus Stack을 지원합니다. 이제 프로그램을 작성할 때 스택과 관련하여 직렬 C 및 C++ 프로그래밍 규칙을 그대로 따르면 의도한 동작을 그대로 얻을 수 있습니다. 그리고 여기서 스택을 복사하지 않는다는 것이 밝혀져 있습니다. 그래서 이 모든 서로 다른 관점에서 A에 대한 동일한 가상 메모리 주소를 보고 있습니다. 하지만 이제 문제는 이 Cactus Stack을 어떻게 구현하느냐입니다. 직렬 처리의 경우, 나중에 생성된 스택이 이전에 생성된 스택을 덮어쓸 수 있습니다. 하지만 병렬 처리에서는 어떻게 해야 할까요?

Q) So does anyone have any simple ideas on how we can implement a cactus stack? so one way to do this is to have each thread use a different stack. And then store pointers to the different stack frames across the different stacks. There's actually another way to do this, which is easier.

선인장 스택을 구현하는 간단한 방법이 있을까요? 한 가지 방법은 각 스레드가 서로 다른 스택을 사용하는 것입니다. 그리고 각 스택에 있는 스택 프레임에 대한 포인터를 저장하는 거죠. 하지만 더 쉬운 방법도 있습니다.

A) If the stack frames have a fixed maximum size, then you could put them all in the same stack separated by that fixed size.

스택 프레임의 최대 크기가 고정되어 있다면, 그 고정된 크기만큼 간격을 두고 모든 프레임을 같은 스택에 저장할 수 있습니다.

Q) Yeah, so if the stacks all have a maximum depth, then you could just allocate a whole bunch of stacks, which are separated by this maximum depth. There's actually another way to do this, which is to not use the stack.

스택의 최대 깊이가 정해져 있다면, 그 최대 깊이만큼 간격을 두고 여러 개의 스택을 할당하면 되겠네요. 스택을 사용하지 않는 방법도 있습니다.

A) Could you memory map it somewhere else-- each of the different threads?

각 스레드마다 메모리 맵핑을 따로 할 수 있을까요?

Q) Yes, that's actually one way to do it.

네, 그것도 한 가지 방법입니다.

The easiest way to do it is just to allocate it off the heap. So instead of allocating the frames on the stack, you just do a heap allocation for each of these stack frames. And then each of these stack frames has a pointer to the parent stack frame. So whenever you do a function call, you're going to do a memory allocation from the heap to get a new stack frame. And then when you finish a function, you're going to pop something off of this stack, and free it back to the heap. In fact, a lot of early systems for parallel programming use this strategy of heap-based cactus stacks. Turns out that you can actually minimize the performance impact using this strategy if you optimize the code enough. But there is actually a bigger problem with using a heap-based cactus stack, which doesn't have to do with performance. Does anybody have any guesses of what this potential issue is?

가장 쉬운 방법은 힙에 할당하는 것입니다. 스택에 프레임을 할당하는 대신, 각 스택 프레임마다 힙에 메모리를 할당하는 거죠. 각 스택 프레임은 상위 스택 프레임을 가리키는 포인터를 가지고 있습니다. 따라서 함수를 호출할 때마다 힙에서 메모리를 할당하여 새로운 스택 프레임을 얻게 됩니다. 그리고 함수 실행이 끝나면 스택에서 값을 꺼내 힙으로 반환합니다. 실제로 초기 병렬 프로그래밍 시스템들은 이러한 힙 기반의 'Cactus Stack' 전략을 많이 사용했습니다. 또한 코드를 충분히 최적화하면 이 전략(Heap-Based)을 사용하더라도 성능 저하를 최소화할 수 있다는 것이 밝혀져 있습니다. 그런데 힙 기반의 Cactus Stack에는 성능과는 무관한 더 큰 문제가 있습니다. 이 잠재적인 문제가 무엇인지 짐작 가는 분이 계신가요?

A) It requires you to allocate the heap in parallel.

힙을 병렬로 할당해야 합니다.

Q) Yeah, so let's assume that we can do parallel heap allocation. And we'll talk about that. So assuming that we can do that correctly, what's the issue with this approach?

병렬 힙 할당이 가능하다고 가정해 보겠습니다. 이에 대해서는 나중에 자세히 이야기해 보죠. 병렬 할당이 제대로 가능하다고 가정했을 때, 이 접근 방식의 문제점은 무엇일까요?

A) It's that you don't know how big the stack is going to be?

스택의 크기를 정확히 알 수 없다는 점입니다.

Q) So let's assume that you can get whatever stack frames you need from the heap, so you don't actually need to put an upper bound on this.

Q) 필요한 스택 프레임을 힙에서 가져올 수 있다고 가정해 보겠습니다. 그러면 스택 크기에 상한을 정할 필요가 없겠죠.

A) We don't know the maximum depth.

최대 깊이를 알 수 없습니다.

Q) Yeah. So we don't know the maximum depth, but let's say we can make that work. So you don't actually need to know the maximum depth if you're allocating off the heap. Any other guesses?

Q) 네. 최대 깊이를 알 수 없지만, 제대로 작동하도록 만들 수 있다고 가정해 보겠습니다. 힙 외부에 할당하는 경우에는 최대 깊이를 알 필요가 없겠죠. 다른 문제점은 없을까요?

A) Something to do with returning from the stack that is allocated on the heap to one of the original stacks.

힙에 할당된 스택에서 원래 스택 중 하나로 반환하는 것과 관련된 문제일 수 있습니다.

So let's say we could get that to work as well. So what happens if I try to run some program using this heap-based cactus stack with something that's using the regular stack? So let's say I have some old legacy code that was already compiled using the traditional linear stack. So there's a problem with interoperability here. Because the traditional code is assuming that, when you make a function call, the stack frame for the function call is going to appear right after the stack frame for the particular callee function. So if you try to mix code that uses the traditional stack as well as this heap-based cactus stack approach, then it's not going to work well together. One approach is that you can just recompile all your code to use this heap-based cactus stack. Even if you could do that, even if all of the source codes were available, there are some legacy programs that actually in the source code, they do some manipulations with the stack, because they assume that you're using the traditional stack, and those programs would no longer work if you're using a heap-based cactus stack. So the problem is interoperability with legacy code. Turns out that you can fix this using an approach called thread local memory mapping. So one of the students mentioned memory mapping. But that requires changes to the operating system. So it's not general purpose.

그 방식도 제대로 작동한다고 가정해 봅시다. 그렇다면 힙 기반 Cactus Stack을 사용하는 프로그램과 일반 스택을 사용하는 프로그램을 함께 실행하려고 하면 어떻게 될까요? 예를 들어, 기존의 선형 스택을 사용하여 컴파일된 오래된 레거시 코드가 있다고 가정해 보겠습니다. 여기서 상호 운용성(Interoperability) 문제가 발생합니다. 기존 코드는 함수 호출 시 호출된 함수(Callee function)의 스택 프레임이 호출된 함수의 스택 프레임 바로 다음에 나타난다고 가정합니다. 따라서 기존 스택을 사용하는 코드와 힙 기반의 Cactus Stack 방식을 사용하는 코드를 혼합해서 사용하면 제대로 작동하지 않습니다. 한 가지 해결책은 모든 코드를 힙 기반의 Cactus Stack을 사용하도록 다시 컴파일하는 것입니다. 설령 모든 소스 코드를 사용할 수 있다고 해도, 기존 스택을 사용한다고 가정하고 스택을 조작하는 레거시 프로그램들이 소스 코드에 존재합니다. 이러한 프로그램들은 힙 기반의 Cactus 스택을 사용하게 되면 더 이상 작동하지 않게 됩니다. 즉, 레거시 코드와의 상호 운용성(Interoperability)이 문제입니다. 다행히 Thread Local Memory Mapping이라는 방식을 사용하면 이 문제를 해결할 수 있습니다. 한 학생이 메모리 매핑에 대해 언급했었죠. 하지만 이 방식은 운영체제를 수정해야 하므로 일반적으로 사용할 수 있는건 아닙니다(not general purpose).

But the heap-based cactus stack turns out to be very simple. And we can prove nice bounds about it. So besides the interoperability issue, heap-based cactus stacks are pretty good in practice, as well as in theory. So we can actually prove a space bound of a Cilk program that uses the heap-based cactus stack. So let's say S 1 is the stack space required by a serial execution of a Cilk program. Then the stack space of P-worker execution using a heap-based cactus stack is going to be upper bounded by P times S 1. so S p is the space for a p worker execution, and that's less than or equal to P times S 1.

하지만 힙 기반 Cactus stack은 매우 간단합니다. 그리고 이에 대한 훌륭한 상한(Space Bound)을 증명할 수 있습니다. 따라서 상호 운용성(Interoperability) 문제를 제외하면 힙 기반 Cactus stack은 이론뿐 아니라 실전에서도 매우 좋습니다. 실제로 힙 기반 Cactus stack을 사용하는 Cilk 프로그램의 공간 상한(space bound)을 증명할 수 있습니다. Cilk 프로그램의 직렬 실행에 필요한 스택 공간을 S₁이라고 하면, 힙 기반 Cacktus Stack을 사용하는 p번째 워커 실행의 스택 공간은 P * S₁으로 상한이 정해집니다. 즉, S p는 p번째 워커 실행에 필요한 공간이며, 이는 (S p <= P * S₁)입니다.

(→ 즉, Heap-based Cactus Stack을 사용하면 Serial Stack보다 적어도 성능상으로 불리하지는 않다는 것이다. 다시 말해서 최악의 경우에도 동일한 성능을 낸다는 것이 증명되어 있음.)

To understand how this works, we need to understand a little bit about how the Cilk's work-stealing algorithm works. So in the Cilk work-stealing algorithm, whenever you spawn something of work, or that spawns a new task, is going to work on the task that it spawned. So therefore, for any leaf in the invocation tree that currently exists, there's always going to be a worker working on it. There's not going to be any leaves in the tree where there's no worker working on it. Because when a worker spawns a task, it creates a new leaf. But then it works immediately on that leaf.

이것이 어떻게 작동하는지 이해하려면 Cilk의 Work Stealing 알고리즘이 어떻게 작동하는지 조금 알아야 합니다. Cilk Work Stealing 알고리즘에서는 작업을 생성하거나 새로운 작업을 생성하는 작업을 수행할 때, 해당 작업은 자신이 생성한 작업을 처리합니다. 따라서 현재 존재하는 호출 트리의 모든 리프 노드에는 항상 워커가 실행 중입니다. 워커가 실행 중이지 않은 리프 노드는 없습니다. 워커가 태스크를 생성하면 새로운 리프 노드를 만들고, 그 리프 노드에서 즉시 작업을 시작하기 때문입니다.

So here we have a invocation tree. And for all of the leaves, we have a processor working on it. And with this busy leaves property, we can easily show this space bound. So for each one of these processors, the maximum stack space it's using is going to be upper bounded by s 1, because that's maximum stack space across a serial execution that executes the whole program. And then since we have p of these leaves, we just multiply s 1 by p, and that gives us an upper bound on the overall space used by a p worker execution. This can be a loose upper bound, because we're double counting here. There's some part of this memory that we're counting more than once, because they're shared among the different processors. But that's why we have the less than or equal to here. So it's upper bounded by p times s 1. So this is one of the nice things about using a heap-based cactus stack is that you get this good space bound. So this is one of the nice things about using a heap-based cactus stack is that you get this good space bound.

이렇게 호출 트리가 있고, 모든 리프 노드에는 각 노드에서 작업하는 프로세서가 있습니다. 이 'busy-leaves property'를 이용하면 Space Bound를 쉽게 보여줄 수 있습니다. 각 프로세서가 사용하는 최대 스택 공간은 S₁으로 제한됩니다. 왜냐하면 S₁은 전체 프로그램을 실행하는 직렬 실행 동안의 최대 스택 공간이기 때문입니다. 리프 노드가 p개 있으므로 S₁에 p를 곱하면 p개의 워커 실행에서 사용되는 전체 공간의 상한값을 얻을 수 있습니다. 이 상한값은 다소 느슨할 수 있는데, 일부 메모리 영역은 여러 프로세서에서 공유되기 때문에 이중 계산이 발생하기 때문입니다. 하지만 이것이 바로 '이하' 조건을 사용한 이유입니다. 따라서 상한값은 P * S₁입니다. 이처럼 힙 기반 Cactus 스택을 사용하면 Space Bound를 명확하게 정의할 수 있다는 장점이 있습니다.

let's try to apply this theorem to a real example. So this is the divide and conquer matrix multiplication code that we saw in a previous lecture. So in this code, we're making eight recursive calls to a divide and conquer function. Each of size n over 2. And before we make any of these calls, we're doing a malloc to get some temporary space. And this is of size order N squared. And then we free this temporary space at the end. And notice here that the allocations of the temporary matrix obey a stack discipline. So we're allocating stuff before we make recursive calls. And we're freeing it after, or right before we return from the function. So all this stack-- all the allocations are nested, and they follow a stack discipline. And it turns out that even if you're allocating off the heap, if you follow a stack discipline, you can still use the space bound from the previous slide to upper bound the p worker space.

이 정리를 실제 예제에 적용해 보겠습니다. 이것은 이전 강의에서 살펴본 분할 정복 행렬 곱셈 코드입니다. 이 코드에서는 분할 정복 함수를 8번 재귀적으로 호출합니다. 각 호출의 크기는 n/2입니다. 이러한 호출을 하기 전에 `malloc`을 사용하여 임시 공간을 할당합니다. 이 공간의 크기는 n^2입니다. 그리고 마지막에 이 임시 공간을 해제합니다. 여기서 임시 행렬 할당이 스택 규칙을 따르는 것을 알 수 있습니다. 재귀 호출 전에 메모리를 할당하고, 함수에서 반환하기 직전에 해제합니다. 모든 할당이 중첩되어 있고 스택 규칙을 따릅니다. 힙 외부에 할당하더라도 스택 규칙을 따르면 이전 슬라이드에서 살펴본 공간 상한을 사용하여 p 작업자 공간의 상한을 구할 수 있습니다.

so let's try to analyze the space of this code here. So first let's look at what the work and span are. So this is just going to be review. What's the work of this divide and conquer matrix multiply? So it's n cubed. So it's n cubed because we have eight solve problems of size n over 2. And then we have to do linear work to add together the matrices. So our recurrence is going to be t 1 of N is equal to eight times t 1 of N over 2 plus order N squared. And that solves to order N cubed if you just pull out your master theorem card.

이제 이 코드의 Space를 분석해 보겠습니다. 먼저 작업량과 span을 살펴보겠습니다. 이건 복습입니다. 이 분할 정복 행렬 곱셈(D&C Matrix Mult)의 작업량은 얼마일까요? θ(n^3)입니다. θ(n^3)인 이유는 n/2 크기의 문제를 8개 풀어야 하기 때문입니다. 그리고 행렬들을 더하는 데에는 선형적인 작업량이 필요합니다. 따라서 점화식은 T₁(n) = 8 × T₁(n/2) + θ(n²)입니다. master theorem card를 이용하면 θ(n^3)이 나옵니다.

What about the span? So what's the recurrence here? so the span T infinity of n is equal to T infinitive of n over 2 plus a span of the addition. And what's the span of the addition? let's assume that we have a parallel addition. We have nested Cilk four loops. Right, so then the span of that is just going of be log n. Since the span of 1 Cilk four loop is log n and when you nest them, you just add together the span. So it's going to be t infinity of n is equal to t infinity of n over 2 plus order log n. And what does that solve to? Yeah, so it's going to solve to order log squared n. Again you can pull out your master theorem card, and look at one of the three cases.

그럼 span의 점화식은 어떻게 될까요? span은 T∞(n) = T∞(n/2) + θ(log n)의 스팬입니다. 덧셈의 span은 무엇일까요? 병렬 덧셈이라고 가정해 보겠습니다. 중첩된 4개의 루프가 있다고 합시다. 그러면 스팬은 log n이 됩니다. 1개의 4개 루프의 스팬은 log n이고, 중첩된 루프의 스팬은 모두 더해지기 때문입니다. 그러니까 T∞n = T∞(n/2) + θ(log^2 n) 과 같다는 거죠. 그럼 이걸 풀면 뭐가 나올까요? 네, θ(log^2 n)이 나옵니다. 다시 한번 mater theorem card를 떠올려 세 가지 경우 중 하나를 살펴보면 됩니다.

let's look at the space. What's going to be the recurrence for the space? The only place we're generating new space is when we call this malloc here. So they're all seeing the same original matrix. So what would the recurrence be? So the n square term is right. Do we actually need eight subproblems of size n over 2? What happens after we finish one of these sub problems? Are we still going to use the space for it?
So you can actually reuse the memory. Because you free the memory you allocated after each one of these recursive calls. So therefore the recurrence is just going to be s of n over 2 plus theta n squared. And what does that solve to?

N squared. Right. So here the n squared term actually dominates. You have a decreasing geometric series. So it's dominated at the root, and you get theta of n squared. And therefore by using the busy leaves property and the theorem for the space bound, this tells us that on p processors, the space is going to be bounded by p times n squared. And this is actually pretty good since we have a bound on this.

Space를 살펴봅시다. space에 대한 점화식은 어떻게 될까요? 새로운 공간을 생성하는 유일한 곳은 여기서 `malloc`을 호출할 때입니다. 따라서 모두 동일한 원래 행렬을 참조합니다. 그렇다면 점화식은 어떻게 될까요? θ(n^2)이 맞습니다. 실제로 크기가 n/2인 하위 문제가 8개나 필요할까요? 이러한 하위 문제 중 하나를 완료한 후에는 어떻게 될까요? 여전히 해당 하위 문제를 위해 공간을 사용하게 될까요? 실제로 메모리를 재사용할 수 있습니다. 각 재귀 호출 후에 할당된 메모리를 해제하기 때문입니다. 따라서 재귀식은 `S(n/2) + θ(n²)`이 됩니다. 이 식의 해는 무엇일까요? θ(n²)입니다. 맞습니다. 여기서 n² 항이 지배적입니다. 감소하는 기하급수이므로 루트에서 지배적이며, θ(n²)이 됩니다. 따라서 busy-leaves property와 space bound theorem을 사용하면 p개의 프로세서에서 공간은 O(p x n^2)으로 제한된다는 것을 알 수 있습니다. 이는 실제로 space bound를 알 수 있다는 점에서 상당히 좋은 결과입니다.

→ 분할 정복부분을 복습하고 다시 봐야 할듯...

TODO : master theorem card, Divide&Conquer Matrix Mult

It turns out that we can actually prove a stronger bound for this particular example. And I'll walk you through how we can prove this stronger bound. Here's the order p times n squared is already pretty good. But we can actually do better if we look internally at how this algorithm is structured. So on each level of recursion, we're branching eight ways. And most of the space is going to be used near the top of this recursion tree. So if I branch as much as possible near the top of my recursion tree, then that's going to give me my worst case space bound. Because the space is decreasing geometrically as I go down the tree. So I'm going to branch eight ways until I get to so me level k in the recursion tree where I have p nodes. And at that point, I'm not going to branch anymore because I've already used up all p nodes. And that's the number of workers I have. So let's say I have this level k here, where I have p nodes.

이 특정 예제에 대해 더 강력한 space bound를 증명할 수 있다는 것이 밝혀져 있습니다. 그리고 이 더 강력한 space bound를 어떻게 증명하는지 단계별로 설명해 드리겠습니다. 여기서 p * n^2은 이미 상당히 좋은 결과입니다. 하지만 알고리즘의 구조를 내부적으로 살펴보면 더 나은 결과를 얻을 수 있습니다. 각 재귀 레벨에서 8개의 분기가 발생합니다. 그리고 대부분의 공간은 재귀 트리의 상단 부근에서 사용됩니다. 따라서 재귀 트리의 상단 부근에서 최대한 많이 분기하면 최악의 경우 space bound를 얻을 수 있습니다. 트리를 아래로 내려갈수록 공간은 기하급수적으로 감소하기 때문입니다. 따라서 재귀 트리의 레벨 k에 도달할 때까지 8개의 분기가 발생하며, 이때 노드 수는 p개입니다. 이 시점에서는 더 이상 분기하지 않습니다. 이미 모든 p개의 노드를 사용했기 때문입니다. 이것이 제가 가진 worker 개수입니다. 노드 수가 p개인 레벨 k가 있다고 가정해 보겠습니다.

Q) what would be the value of k here? If I branch eight ways how many levels do I have to go until I get to p nodes?

k의 값은 얼마일까요? 만약 분기점이 8개라면, p개의 노드에 도달하기까지 몇 단계나 거쳐야 할까요?

Yes. It's log base 8 of p. So we have eight, the k, equal p, because we're branching k ways. And then using some algebra, you can get it so that k is equal to log base 8 of p, which is equal to log base 2 of p divided by 3. And then at this level k downwards, it's going to decrease geometrically. So the space is going to be dominant at this level k. So the space decreases geometrically as you go down from level k, and also as you go up from level k. So therefore we can just look at what the space is at this level k here. So the space is going to be p times the size of each one of these nodes squared. And the size of each one of these nodes is going to be n over 2 to the log base 2 of p over 3. And then we square that because we're using n squared temporary space. So if you solve that, that gives you p to the one-third times n squared, which is better than the upper bound we saw earlier of order p times n squared. So you can work out the details for this example. Not all the details are shown on this slide. You need to show that the level k here actually dominates all the other levels in the recursion tree. But in general, if you know what the structure of the algorithm, is you can potentially prove a stronger space bound than just applying the general theorem we showed on the previous slide.

네, p의 밑이 8인 로그입니다. 분기가 k개이므로 k는 p와 같습니다. 대수학을 이용하면 k는 p의 밑이 8인 로그와 같고, 이는 p의 밑이 2인 로그의 나누기 3과 같습니다. 그리고 이 레벨 k에서 아래로 내려갈수록 공간은 기하급수적으로 감소합니다. 따라서 이 레벨 k에서 공간이 가장 많이 사용됩니다. 레벨 k에서 아래로 내려갈수록 공간은 기하급수적으로 감소하고, 위로 올라갈 때도 마찬가지입니다. 그러므로 레벨 k에서의 공간을 살펴보면, 공간은 각 노드의 크기를 제곱한 값에 p를 곱한 것과 같습니다. 각 노드의 크기는 n/2의 밑이 2인 로그를 3으로 나눈 값입니다. 그리고 임시 공간을 n 제곱으로 사용하므로 제곱하는 것입니다. 그래서 이 문제를 풀면 p의 1/3제곱 곱하기 n의 제곱이 나오는데, 이는 앞서 살펴본 상한인 p 곱하기 n의 제곱보다 더 나은 결과입니다. 이 예제의 세부 사항은 직접 계산해 볼 수 있습니다. 이 슬라이드에 모든 세부 사항이 나와 있는 것은 아닙니다. 여기서 레벨 k가 재귀 트리의 다른 모든 레벨을 지배한다는 것을 보여줘야 합니다. 하지만 일반적으로 알고리즘의 구조를 알고 있다면 이전 슬라이드에서 보여드린 일반 정리를 적용하는 것보다 잠재적으로 더 강력한 space bound를 증명할 수 있습니다.

★

→ 한 번의 level 마다 8개로 나뉘어 지므로, P개의 워커를 전부다 사용하는 level k에서 할당하는 메모리의 space bound는 8^k = P, 즉 k = (lg P) / 3라는 것을 알 수 있음.

따라서 이전에 증명한 space bound인 θ(p n^2)보다 더 강력한 θ(p^(1/3) n^2)을 증명할 수 있다.

Q) How do we know that the workers don't split along the path of the ~~ instead of across or horizontal.

워커 노드들이 ~~ 경로를 따라 분기하지 않고, 수평 방향으로 분기하지 않는다는 것을 어떻게 알 수 있나요?

A) So you don't actually know that. But this turns out to be the worst case. So if it branches any other way, the space is just going to be lower. So you have to argue that this is going to be the worst case, and it's going to be-- intuitively it's the worst case, because you're using most of the memory near the root of the recursion tree. So if you can get all p nodes as close as possible to the root, that's going to make your space as high as possible. It's a good question.

A) 사실 그걸 확실히 알 수는 없습니다. 하지만 이것이 최악의 경우입니다. 만약 다른 방향으로 분기한다면 공간 사용량은 더 줄어들 것입니다. 따라서 이것이 최악의 경우라고 주장해야 하며, 직관적으로도 최악의 경우입니다. 왜냐하면 재귀 트리의 루트 근처에서 대부분의 메모리를 사용하기 때문입니다. 따라서 모든 p 노드를 루트에 최대한 가깝게 배치할수록 공간 사용량이 최소화될 것입니다.

So parallel functions fail to interoperate with legacy and third-party serial binaries. Even if you can recompile all of this code, which isn't always necessarily the case, you can still have issues if the legacy code is taking advantage of the traditional linear stack inside the source code. So our implementation of Cilk uses a less space efficient strategy that is interoperable with legacy code. And it uses a pool of linear stacks instead of a heap-based strategy. So we're going to maintain a pool of linear stacks lying around. There's going to be more than p stacks lying around. And whenever a worker tries to steal something, it's going to try to acquire one of these tasks from this pool of linear tasks. And when it's done, it will return it back. But when it finds that there's no more linear stacks in this pool, then it's not going to steal anymore. So this is still going to preserve the space bound, as long as the number of stacks is a constant times the number of processors. But it will affect the time bounds of the work-stealing algorithm. Because now when a worker is idle, it might not necessarily have the chance to steal if there are no more stacks lying around. This strategy doesn't require any changes to the operating system. There is a way where you can preserve the space and the time bounds using thread local memory mapping. But this does require changes to the operating system. So our implementation of Cilk uses a pool of linear stacks, and it's based on the Intel implementation.
병렬 함수는 레거시 및 써드파티 직렬 바이너리(= traditional linear stack)와 상호 운용되지 않습니다. 모든 코드를 다시 컴파일할 수 있다고 하더라도(항상 가능한 것은 아니지만), 레거시 코드가 소스 코드 내의 기존 선형 스택을 활용하는 경우 여전히 문제가 발생할 수 있습니다. 그래서 Cilk 구현에서는 레거시 코드와의 상호 운용성(interoperability)을 유지하면서 공간 효율성은 다소 떨어지는 전략을 사용합니다. 그래서 힙 기반 전략 대신 선형 스택 풀을 사용합니다. 즉, 선형 스택 풀을 유지하고, 이 풀에는 p개 이상의 스택이 존재합니다. 워커가 작업을 훔칠려고 할 때, 이 선형 스택 풀에서 작업 하나를 획득려고 시도합니다. 작업이 완료되면 작업을 반환합니다. 하지만 풀에 더 이상 선형 스택이 없으면 더 이상 작업을 훔치려 하지 않습니다. 따라서 스택 수가 프로세서 수의 곱과 같은(StackNum == ProcessorNum * N) 상수 값인 한, 이 방식은 space bound를 유지합니다. 하지만 work-stealing 알고리즘의 time bound에는 영향을 미칩니다. 왜냐하면 이제 워커가 Idle 상태일 때, 사용 가능한 스택이 더 이상 없다면 스택을 훔칠 기회가 없을 수도 있기 때문입니다. 이 전략은 운영 체제를 변경할 필요가 없습니다. thread local memory mapping을 사용하여 space/time bound를 유지하는 방법이 있지만, 이 방법은 운영 체제를 변경해야 합니다. 따라서 저희가 구현한 Cilk는 인텔 구현을 기반으로한 선형 스택 풀을 사용합니다.

Basic Properties Of Storage Allocators

Allocator Speed를 다음과 같이 정의하자.

- allocator가 초당 수행할 수 있는 할당/해제 횟수.

Q) large block / small block 중에 allocator speed를 최적화하는 것이 중요한 것은 어느 경우일까요?

A) small block

Q) 이유는?

A1) small block에 대해 최적화하지 않으면, fragmentation이 발생할 가능성이 더 높음

Q) 근본적인 이유

A) the main reason is that when you're allocating a block, a user program is typically going to write to all the bytes in the block. And therefore, for a large block, it takes so much time to write that the allocator time has little effect on the overall running time. Whereas if a program allocates many small blocks, the amount of useful work it's actually doing on the block can be comparable to the overhead for the allocation. And therefore, all of the allocation overhead can add up to a significant amount for small blocks. So essentially for large blocks, you can amortize away the overheads for storage allocation, whereas for small, small blocks, it's harder to do that. Therefore, it's important to optimize for small blocks.

주된 이유는 블록을 할당할 때 사용자 프로그램이 일반적으로 블록의 모든 바이트에 쓰기 작업을 수행하기 때문입니다. 따라서 큰 블록의 경우 쓰기 작업에 소요되는 시간이 매우 길어 할당 시간이 전체 실행 시간에 미치는 영향이 미미합니다. 반면 프로그램이 여러 개의 작은 블록을 할당하는 경우, 실제로 블록에서 수행하는 유용한 작업량이 할당 오버헤드와 비슷할 수 있습니다. 따라서 작은 블록의 경우 모든 할당 오버헤드가 누적되어 상당한 시간이 소요될 수 있습니다. 즉, 큰 블록의 경우 저장 공간 할당 오버헤드를 분산 처리할 수 있지만, 아주 작은 블록의 경우에는 그렇게 하기가 어렵습니다. 따라서 작은 블록에 대한 최적화가 중요합니다.

Here's another definition. So the user footprint is the maximum over time of the number u of bytes in use by the user program. And these are the bytes that are allocated and not freed. And this is measuring the peak memory usage. It's not necessarily equal to the sum of the sizes that you have allocated so far, because you might have reused some of that. So the user footprint is the peak memory usage and number of bytes. And the allocator footprint is the maximum over time of the number of a bytes that the memory provided to the allocator by the operating system.

And the reason why the allocator footprint could be larger than the user footprint, is that when you ask the OS for some memory, it could give you more than what you asked for. And similarly, if you ask malloc for some amount of memory, it can also give you more than what you asked for. And the fragmentation is defined to be a divided by u. And a program with low fragmentation will keep this ratio as low as possible, so keep the allocator footprint as close as possible to the user footprint. And in the best case, this ratio is going to be one. So you're using all of the memory that the operating system allocated. One remark is that the allocator footprint a usually gross monotonically for many allocators.

So it turns out that many allocators do m maps to get more memory. But they don't always free this memory back to the OS. And you can actually free memory using something called m unmap, which is the opposite of m map, to give memory back to the OS. But this turns out to be pretty expensive. In modern operating systems, their implementation is not very efficient. So many allocators don't use m unmap. You can also use something called m advise.

And what m advise does is it tells the operating system that you're not going to be using this page anymore but to keep it around in virtual memory. So this has less overhead, because it doesn't have to clear this entry from the page table. It just has to mark that the program isn't using this page anymore. So some allocators use m advise with the option, don't need, to free memory. But a is usually still growing monotonically over time, because allocators don't necessarily free all of the things back to the OS that they allocated. Here's a theorem that we proved in last week's lecture, which says that the fragmentation for binned free list is order log base 2 of u, or just order log u. And the reason for this is that you're can have log-based 2 of u bins.

And for each bin it can basically contain u bytes of storage. So overall you can use-- overall, you could have allocated u times log u storage, and only be using u of those bytes. So therefore the fragmentation is order log u. Another thing to note is that modern 64-bit processors only provide about 2 to 48 bytes of virtual address space. So this is sort of news because you would probably expect that, for a 64-bit processor, you have to the 64 bytes of virtual address space. But that turns out not to be the case. So they only support to the 48 bytes.

And that turns out to be enough for all of the programs that you would want to write. And that's also going to be much more than the physical memory you would have on a machine. So nowadays, you can get a big server with a terabyte of memory, or to the 40th bytes of physical memory, which is still much lower than the number of bytes in the virtual address space.

다른 정의를 살펴보겠습니다. User footprint는 사용자 프로그램이 사용하는 바이트 수 u의 시간 경과에 따른 최대값입니다. 여기서 u는 할당되었지만 해제되지 않은 바이트를 나타냅니다. 이는 최대 메모리 사용량을 측정하는 것이며, 지금까지 할당된 크기의 합과 반드시 같지는 않습니다. 일부 메모리가 재사용되었을 수 있기 때문입니다. 따라서 사용자 메모리 사용량은 최대 메모리 사용량(바이트 수)입니다. 반면 Allocator footprint는 운영 체제가 할당자에게 제공한 메모리 수 a의 시간 경과에 따른 최대값입니다. Allocator footprint가 User footprint보다 클 수 있는 이유는 운영 체제에 메모리를 요청할 때 요청한 양보다 더 많은 메모리를 할당받을 수 있기 때문입니다. 마찬가지로 malloc 함수를 사용하여 특정 양의 메모리를 할당할 때도 요청한 양보다 더 많은 메모리를 할당받을 수 있습니다. fragmentation은 할당된 메모리 수를 해제된 메모리 수(u)로 나눈 값으로 정의됩니다. fragmentation이 적은 프로그램은 이 비율을 최대한 낮게 유지하려고 노력하므로 Allocator footprint를 User footprint에 최대한 가깝게 유지합니다. 최상의 경우, 이 비율은 1이 됩니다. 즉, 운영체제가 할당한 모든 메모리를 사용하게 되는 것이죠.

한 가지 알아둘 점은 많은 Allocator footprint가 일반적으로 단조롭게 증가(grows monotonically)한다는 것입니다. 그래서 많은 할당자가 더 많은 메모리를 확보하기 위해 `m map`을 사용합니다. 하지만 할당된 메모리를 항상 운영체제에 반환하지는 않습니다. `m unmap`이라는 함수를 사용하면 메모리를 해제할 수 있는데, 이는 `m map`의 반대 개념으로 운영체제에 메모리를 반환하는 것입니다. 하지만 이 방법은 비용이 상당히 비싸다는 것이 밝혀져 있습니다. 최신 운영체제에서는 `m unmap`의 구현이 그다지 효율적이지 않기 때문에 많은 할당자가 `m unmap`을 사용하지 않습니다.

`m advise`라는 함수도 사용할 수 있습니다. `m advise`는 운영체제에 해당 페이지를 더 이상 사용하지 않을 것이지만 가상 메모리에는 유지하도록 알려주는 함수입니다. 따라서 페이지 테이블에서 해당 항목을 지울 필요가 없으므로 오버헤드가 적습니다. 프로그램이 더 이상 해당 페이지를 사용하지 않는다는 표시만 하면 됩니다. 일부 할당자는 메모리 해제를 선택하지 않아도 되는 `m advise` 를 사용합니다. 하지만 할당자가 할당한 모든 메모리를 운영체제에 반환하지 않기 때문에 메모리 크기는 시간이 지남에 따라 단조롭게 증가하는 경우가 많습니다.

지난주 강의에서 증명한 정리에 따르면, binned-free list의 단편화는 log₂ u의 차수가 됩니다. 그 이유는 log₂ 밑이 u인 bin을 최대 두 개까지 가질 수 있기 때문입니다. 각 bin은 기본적으로 u바이트의 저장 공간을 가질 수 있습니다. 따라서 전체적으로 u × log u의 저장 공간을 할당하고 실제로 사용하는 바이트는 u개일 수 있습니다. 그러므로 단편화는 log u가 됩니다. 또 하나 알아둘 점은 최신 64bit 프로세서는 약 2~48바이트의 가상 주소 공간만 제공한다는 것입니다. 이건 좀 놀라운 소식인데요, 64비트 프로세서라면 당연히 64바이트의 가상 주소 공간을 사용할 수 있을 거라고 생각하기 쉽잖아요. 하지만 실제로는 그렇지 않습니다. 48바이트만 지원하거든요. 그런데 48바이트면 대부분의 프로그램을 실행하기에 충분합니다. 게다가 피지컬 메모리 용량보다도 훨씬 많죠. 요즘에는 1TB 정도의 메모리를 가진 대형 서버도 있는데, 피지컬 메모리 용량은 40분의 1바이트 정도밖에 안 됩니다. 그래도 가상 주소 공간의 바이트 수보다는 훨씬 적습니다.

the space overhead of an allocator is a space used for bookkeeping. So perhaps you could store headers with the blocks that you allocate to keep track of the size and other information. And that would contribute to the space overhead Internal fragmentation is a waste due to allocating larger blocks in the user request. So you can get internal fragmentation if, when you call malloc, you get back a block that's actually larger than what the user requested. We saw on the bin free list algorithm, we're rounding up to the nearest power of 2's. If you allocate nine bytes, you'll actually get back 16 bytes in our binned-free list algorithm from last lecture. So that contributes to internal fragmentation. It turns out that not all binned-free list implementations use powers of 2. So some of them use other powers that are smaller than 2 in order to reduce the internal fragmentation. Then there's an external fragmentation, which is the waste due to the inability to use storage because it's not contiguous. So for example, if I allocated a whole bunch of one byte things consecutively in memory, then I freed every other byte. And now I want to allocate a 2-byte thing, I don't actually have contiguous memory to satisfy that request, because all of my free bytes are in one-bite chunks, and they're not next to each other. So this is one example of how external fragmentation can happen after you allocate stuff and free stuff. Then there's blow up. And this is for a parallel allocator. The additional space beyond what a serial allocator would require. So if a serial allocator requires s space, and a parallel allocator requires t space, then it's just going to be t over s. That's the blow up.

Allocator의 Space Overhead는 bookkepping을 위한 공간입니다. 예를 들어, 할당하는 블록의 크기 및 기타 정보를 추적하기 위해 헤더를 저장할 수 있습니다. 이 또한 공간 오버헤드에 영향을 미칩니다.

Internal Fragmentation은 사용자 요청보다 큰 블록을 할당하여 발생하는 낭비입니다. malloc을 호출했을 때 사용자가 요청한 크기보다 큰 블록이 반환되면 내부 단편화가 발생합니다. 지난 강의에서 살펴본 binned-free list 알고리즘에서는 2의 거듭제곱으로 반올림합니다. 9바이트를 할당하면 실제로는 16바이트가 반환됩니다. 이것이 내부 단편화의 원인이 됩니다. 모든 binned-free list 구현이 2의 거듭제곱을 사용하는 것은 아닙니다. 일부 구현은 내부 단편화를 줄이기 위해 2보다 작은 거듭제곱을 사용합니다.

External Fragmentation은 연속적이지 않은 저장 공간으로 인해 발생하는 낭비입니다. 예를 들어, 1바이트 크기의 메모리 요소를 연속적으로 여러 개 할당한 다음, 하나씩 건너뛰면서 해제했다고 가정해 보겠습니다. 이제 2바이트 크기의 요소를 할당하려고 하는데, 모든 해제된 메모리가 1바이트 단위로 흩어져 있고 서로 인접해 있지 않기 때문에 요청을 충족할 수 있는 연속적인 메모리 공간이 없습니다. 이것이 바로 메모리를 할당하고 해제한 후에 외부 단편화가 발생하는 한 예입니다.

다음으로는 'Blowup'이 있습니다. 이는 Parallel Allocator에 대한 설명으로, Serial Allocator가 필요로 하는 공간을 초과하는 추가 공간을 의미합니다. 예를 들어 Serial Allocator가 s만큼의 공간을 필요로 하고 병렬 할당자가 t만큼의 공간을 필요로 한다면, t를 s로 나눈 값이 됩니다. 이것이 바로 블로우업입니다.

Parallel Allocation Strategies

Strategy 1 : Global Heap

장점

- Serial Allocator에 비해서, Parallel Allocator를 사용함으로써 추가 사용하는 메모리가 없다. 즉, Blowup = 1.

단점

- Lock을 획득해야 하는데, Seriall Allocator 사용시, 할당 작업 자체는 최근에 해제된(freed) 메모리를 재사용할 것이므로 해당 주소는 이미 L1-Cache에 있을 가능성이 높아 L1-Cache 접근 비용과 비슷하다. 반면에 Lock을 획득하는 것은 일반적으로 L2-Cache 접근 비용에 달함이 알려져 있으므로 느림.

- 스레드가 많아질 수록 경합(Contention)이 Scalability를 저해할 수 있음.

이상적으로는, 스레드 개수가 증가할 수록 allocation/deallocation 수행 비용이 증가하면 안됨.

- 실제로는 그렇지 않은데, scalability의 손실에 대한 대부분의 원인은 lock contention(락 획득을 위한 경합 비용)이기 때문.

the most common reason for loss of scalability is lock contention. So here all of the processes are trying to acquire the same lock, which is the same memory address. And if you recall from the caching lecture, or the multicore programming lecture, every time you acquire a memory location, you have to bring that cache line into your own cache, and then invalidate the same cache line in other processors' caches. So if all the processors are doing this, then this cache line is going to be bouncing around among all of the processors' caches, and this could lead to very bad performance.

Scalability 손실의 가장 일반적인 원인은 lock contetion(락 경합)입니다. 모든 프로세스가 동일한 메모리 주소에 대한 락을 획득하려고 시도하는 상황입니다. Caching / Multicore Programming 강의에서 배웠듯이, 메모리 위치를 획득할 때마다 해당 캐시 라인을 자신의 캐시로 가져오고 다른 프로세서의 캐시에서 동일한 캐시 라인을 무효화해야 합니다. 모든 프로세서가 이 작업을 수행하면 해당 캐시 라인이 모든 프로세서의 캐시 사이를 오가게(bouncing) 되어 성능 저하로 이어질 수 있습니다.

Q) Lock Contention은 large blocks / small blocks 중 어느 것에 대해 더 문제가 될까?

A) Small blocks !

Q) why?

A) Typically, a user program writes all the bytes of an allocated block, making it hard for a thread allocating large blocks to issue allocation requests at a high rate. In contrast, if a program allocates many small blocks in parallel, contention can be a significant issue.

일반적으로 사용자 프로그램은 할당된 블록의 모든 바이트를 Write하므로, 큰 블록을 할당하는 스레드가 고빈도로 할당 요청을 보내기 어렵습니다. 반대로, 프로그램이 여러 개의 작은 블록을 병렬로 할당하는 경우, 경합(Lock Contetion)이 심각한 문제가 될 수 있습니다.

→ 이전 질문과 비슷한 양상인데, large blocks에 대해 할당하는 것은, 횟수가 적으므로 lock contetion 개수가 많지 않음. 반면에, small blocks는 할당 횟수가 일반적으로 많으므로 그만큼 lock contetion 횟수가 많아져 문제가 될 수 있음.

Strategy 2 : Local Heaps

you could actually have an unbounded blow up. Because if you do all of the allocations in one heap, and you free everything in another heap, then whenever the first heap does an allocation, there's actually free space sitting around in another heap. But it's just going to grab more memory from the operating system. So you're blow up can be unbounded. And this phenomenon, it's what's called memory drift. So blocks allocated by one thread are freed by another thread. And if you run your program for long enough, your memory consumption can keep increasing. And this is sort of like a memory leak. So you might see that if you have a memory drift problem, your program running on multiple processors could run out of memory eventually. Whereas if you just run it on a single core, it won't run out of memory. And here it's because the allocator isn't smart enough to reuse things in other heaps.

실제로 blow-up이 급증할 수 있습니다. 모든 메모리 할당을 하나의 힙에서 수행하고 다른 힙에서 모든 메모리를 해제하면, 첫 번째 힙에서 할당이 발생할 때마다 다른 힙에는 실제로 여유 공간이 생깁니다. 하지만 이 경우 운영 체제에서 더 많은 메모리를 가져오게 됩니다. 따라서 blow-up이 급증할 수 있습니다. 이러한 현상을 Memory Drift라고 합니다. 한 스레드에서 할당한 메모리 블록이 다른 스레드에서 해제되는 것입니다. 프로그램을 오랫동안 실행하면 메모리 사용량이 계속 증가할 수 있습니다. 이는 일종의 메모리 누수와 같습니다. 메모리 드리프트 문제가 있는 경우, 여러 프로세서에서 실행되는 프로그램이 결국 메모리 부족으로 인해 오류가 발생할 수 있습니다. 반면 단일 코어에서 실행하면 메모리 부족 문제는 발생하지 않습니다. 이는 할당자가 다른 힙의 메모리를 재사용할 만큼 똑똑하지 않기 때문입니다.

if you keep allocating from one thread, if you do all of your allocations in one thread, and do all of your deallocations on another thread, every time you allocate from the first thread, there's actually memory sitting around in the system. But the first thread isn't going to see it, because it only sees its own heap. And it's just going to keep grabbing more memory from the OS. And then the second thread actually has this extra memory sitting around. But it's not using it. Because it's only doing the freeze. It's not doing allocate. And if we recall the definition of blow up is, how much more space you're using compared to a serial execution of a program. If you executed this program on a single core, you would only have a single heap that does the allocations and frees. So your memory isn't going to blow up. It's just going to be constant over time. Whereas if you use two threads to execute this, the memory could just keep growing over time.

만약 한 스레드에서 메모리 할당을 하고 다른 스레드에서 해제를 한다면, 첫 번째 스레드에서 할당이 발생할 때마다 시스템에는 실제로 여유 메모리가 생깁니다. 하지만 첫 번째 스레드는 자신의 힙만 볼 수 있기 때문에 이 여유 메모리를 인식하지 못합니다. 그래서 운영체제로부터 계속해서 메모리를 가져오려고 할 것입니다. 그러면 두 번째 스레드는 이 여유 메모리를 가지고 있지만 사용하지는 않습니다. 두 번째 스레드는 메모리 할당이 아닌 해제만 하기 때문입니다. blow-up이란 프로그램을 직렬로 실행할 때와 비교하여 얼마나 더 많은 공간을 사용하는지를 의미합니다. 이 프로그램을 단일 코어에서 실행하면 할당과 해제를 담당하는 단일 힙만 사용하게 되므로 메모리가 폭발적으로 증가하지 않고 시간이 지남에 따라 일정하게 유지됩니다. 하지만 두 개의 스레드를 사용하여 이 프로그램을 실행하면 메모리 사용량이 시간이 지남에 따라 계속해서 증가할 수 있습니다.

→ A 스레드에선 할당만 하고, B 스레드에선 A 스레드에서 할당한 메모리를 받아서 해제만 한다고 생각해보자. B에서 해제한 블록은 실제로 해제되었고 free-list에 들어가겠지만, A 스레드에선 해당 사실을 알 수 없다. 따라서 여유 메모리 공간이 있음에도 불구하고, B 스레드에서 해제되었기 때문에 해당 사실을 몰라서, 즉 free-list가 바닥났다고 생각하므로 OS에 추가 메모리를 요청하게된다.이를 Memory Drift라 하고, memory leak의 일종에 해당된다.

Q) ~

A) so if you remember the binned-free list approach, let's say we're using that. Then all you have to do is set some pointers in your binned-free lists data structure, as well as the block that you're freeing, so that it appears in one of the linked lists. So you can do that even if some other processor allocated that block.

그러니까, 이전에 사용했던 binned-free list 방식을 기억하신다면, 그 방식을 사용한다고 가정해 보겠습니다. 그러면 binned-free list 자료 구조에서 포인터를 설정하고, 해제하려는 블록이 연결 리스트 중 하나에 나타나도록 설정하기만 하면 됩니다. 따라서 다른 프로세서가 해당 블록을 할당했더라도 해제할 수 있습니다.

Strategy 3 : Local Ownership

- 할당된 객체에 대해, owner thread를 마킹.

- 해제된 객체는 owner's heap으로 반환.

장점

- Local Objects에 대한 할당 / 해제는 동기화가 필요없으므로 빠름.

단점

- Remote Objects에 대한 해제는 동기화가 필요

→ 단, global heap과 달리 하나의 스레드에 대해서만 동기화하면 되기 때문에 비교적 저렴함.

upper bound

Blow-up <= p

→ Serial Allocator 사용시 최대 메모리 사용량을 x라고 하자. p 개의 스레드가 local ownership heap 사용시, 최대 메모리 사용량은 p * x가 된다. 따라서 blow-up(ratio) = (p * x) / x = p로 upper bound가 된다.

Resilience to false sharing.

False Sharing

True Sharing은 2개의 프로세서가 동일한 메모리 위치에 접근하려고 할 때 발생합니다. False Sharing은 여러 개의 프로세서가 서로 다른 메모리 위치에 접근하지만, 해당 위치들이 우연히 동일한 캐시 라인에 있는 경우를 말합니다.

Let's say we have two variables, x and y. And the compiler happens to place x and y on the same cache line. Now, when the first processor writes to x, it's going to bring this cache line into its cache. When the other processor writes to y, since it's on the same cache line, it's going to bring this cache line to y's cache. And then now, the first processor writes x, it's going to bring this cache line back to the first processor's cache. And then you can see this phenomenon keep happening. So here, even though the processors are writing to different memory locations, because they happen to be on the same cache line, the cache line is going to be bouncing back and forth on the machine between the different processors' caches. And this problem gets worse if more processors are accessing this cache line. So this can be quite hard to debug. Because if you're using just variables on the stack, you don't actually know necessarily where the compiler is going to place these memory locations. So the compiler could just happen to place x and y in the same cache block. And then you'll get this performance hit, even though it seems like you're accessing different memory locations. If you're using the heap for memory allocation, you have more knowledge. Because if you allocate a huge block, you know that all of the memory locations are contiguous in physical memory. So you can space the accesses far enough apart so that different processes aren't going to touch the same cache line.

x와 y라는 두 변수가 있다고 가정해 봅시다. 컴파일러가 x와 y를 같은 캐시 라인에 배치한다고 생각해 보세요. 그러면 첫 번째 프로세서가 x에 쓰기를 하면, 이 캐시 라인의 내용을 자신의 캐시로 가져옵니다. 다른 프로세서가 y에 쓰기를 하면, 같은 캐시 라인에 있으므로 이 캐시 라인의 내용을 y의 캐시로 가져옵니다. 그리고 다시 첫 번째 프로세서가 x에 쓰기를 하면, 이 캐시 라인의 내용을 다시 첫 번째 프로세서의 캐시로 가져옵니다. 이런 현상이 계속 반복되는 것을 볼 수 있습니다. 즉, 프로세서들이 서로 다른 메모리 위치에 쓰기를 하더라도, 같은 캐시 라인에 있기 때문에 캐시 라인의 내용이 여러 프로세서의 캐시 사이를 오가게 됩니다. 더 많은 프로세서가 이 캐시 라인에 접근할수록 문제는 더욱 심각해집니다. 따라서 디버깅이 상당히 어려울 수 있습니다. 스택에 변수를 사용하는 경우, 컴파일러가 이러한 메모리 위치를 정확히 어디에 배치할지 알 수 없기 때문입니다. 컴파일러가 우연히 x와 y를 같은 캐시 블록에 배치할 수도 있습니다. 그러면 서로 다른 메모리 위치에 접근하는 것처럼 보이지만 성능 저하가 발생합니다. 힙을 사용하여 메모리를 할당하면 정보를 더 많이 알 수 있습니다. 큰 블록을 할당하면 모든 메모리 위치가 피지컬 메모리에서 연속적으로 연결되어 있음을 알 수 있습니다. 따라서 서로 다른 프로세스가 동일한 캐시 라인에 접근하지 않도록 접근 간격을 충분히 확보할 수 있습니다.

A program can induce false sharing having different threads process nearby objects.

- The programmer can mitigate this problem by aligning the object on a cache-line boundary and padding out the object to the size of a cache line, but this solution can be wasteful of space

프로그램은 서로 다른 스레드가 인접한 객체를 처리하도록 하여 false sharing를 유발할 수 있음.

- 프로그래머는 객체를 캐시 라인 경계에 정렬하고 캐시 라인 크기만큼 객체를 패딩하여 이 문제를 완화할 수 있지만, 이 해결책은 공간 낭비를 초래함.

Allcator가 아래 2가지 방법으로 false sharing을 유발 가능

- Actively

when the allocator satisfies memory requests from different threads using the same cache block.

Allocator가 동일한 캐시 라인을 사용하여 서로 다른 스레드의 메모리 요청을 처리할 때 발생합니다.

- Passively

when the program passes objects lying on the same cache line to different threads, and the allocator reuses the object's storage after the objects are freed to satisfy requests from those threads.

프로그램이 동일한 캐시 라인에 있는 객체를 서로 다른 스레드에 전달할 때, 할당자는 객체가 해제된 후 해당 객체의 저장 공간을 재사용하여 해당 스레드의 요청을 처리합니다.

Local Ownership approach tends to reduce false sharing because the thread that allocates an object is eventually going to get it back. You're not going to have it so that an object is permanently split among multiple processors' heaps. So even if you see false sharing in local ownership, it's usually temporary. Eventually the object is going to go back to the heap that it was allocated from, and the false sharing is going to go away.

Local Ownership 방식은 객체를 할당한 스레드가 결국 해당 객체를 다시 가져오기 때문에 false sharing를 줄이는 경향이 있습니다. 객체가 여러 프로세서의 힙에 영구적으로 분할되는 일은 발생하지 않습니다. 따라서 local ownership에서 false sharing이 발생하더라도 대개 일시적입니다. 결국 객체는 원래 할당된 힙으로 돌아가고, false sharing은 사라집니다.

Strategy 4 : Hoard Allocator

in the hoard allocator, we're going to have p local heaps. But we're also going to have a global heap. The memory is going to be organized into large super blocks of size s. And s is usually a multiple of the page size. So this is the granularity at which objects are going to be moved around in the allocator. And then you can move super blocks between the local heaps and the global heaps. So when a local heap has a lot of super blocks that are not being fully used and you can move it to the global heap, and then when a local heap doesn't have enough memory, it can go to the global heap to get more memory. And then when the global heap doesn't have any more memory, then it gets more memory from the operating system. So this is sort of a combination of the approaches that we saw before. The advantages are that this is a pretty fast allocator. It's also scalable. As you add more processors, the performance improves. You can also bound the blow up. And it also has resilience to false sharing, because it's using local heaps.

Hoard Allocator에는 p개의 로컬 힙과 전역 힙이 있습니다. 메모리는 크기가 s인 큰 슈퍼 블록으로 구성됩니다. 여기서 s는 일반적으로 페이지 크기의 배수입니다. 이 크기는 할당자 내에서 객체를 이동시키는 단위(Granularity)입니다. 그리고 슈퍼 블록은 로컬 힙과 전역 힙 사이에서 이동할 수 있습니다. 로컬 힙에 사용되지 않는 슈퍼 블록이 많으면 이를 전역 힙으로 이동할 수 있습니다. 반대로 로컬 힙에 메모리가 부족하면 전역 힙에서 메모리를 할당받을 수 있습니다. 전역 힙에도 메모리가 없으면 운영체제로부터 메모리를 할당받습니다. 이것은 이전에 살펴본 접근 방식들을 조합한 것입니다. 장점은 이 할당자가 매우 빠르다는 점입니다. 또한 확장성이 뛰어나 프로세서가 추가될수록 성능이 향상됩니다. blow-up도 제한됩니다. 그리고 로컬 힙을 사용하기 때문에 false sharing에도 강합니다.

장점

- Fast

- Scalable

- Bounded blowup

- Resilience to false sharing

1) thread i 내의 local heap으로부터 superblock 가져옴

2) global heap으로부터 superblock 가져옴

3) OS로부터 new superblock 가져옴

let's say we call malloc in our program. And let's say thread i calls the malloc. So what we're going to do is we're going to check if there is a free object in heap i that can satisfy this request. And if so, we're going to get an object from the fullest non-full super block in i's heap.

프로그램에서 malloc 함수를 호출한다고 가정해 봅시다. 그리고 스레드 i가 malloc을 호출한다고 가정해 보겠습니다. 그러면 우리는 힙 i에 이 요청을 만족시킬 수 있는 사용 가능한 객체가 있는지 확인할 것입니다. 만약 있다면, 스레드 i의 힙에서 가장 가득 차 있지만 아직 가득 차지 않은 슈퍼 블록에서 객체를 가져올 것입니다.

Q) Does anyone know why we want to get the object from the fullest non-full super block?

왜 the fullest non-full super block에서 객체를 가져오려고 하는지 아는 사람 있나요?

A) So when a super block needs to be moved, it's as dense as possible. And more importantly, this is to reduce external fragmentation. Because as we saw in the last lecture, if you skew the distribution of allocated memory objects to as few pages, or in this case, as few super blocks as possible, that reduces your external fragmentation.
슈퍼 블록을 이동해야 할 때, 가능한 한 밀집된 상태를 유지해야 합니다. 더 중요한 것은 이것이 external fragmentation를 줄이기 위한 것입니다. 지난 강의에서 살펴본 것처럼, 할당된 메모리 객체의 분포를 가능한 한 적은 페이지, 또는 이 경우에는 가능한 한 적은 슈퍼 블록에 집중시키면 외부 단편화를 줄일 수 있습니다.

So how it implements this is as follows. When we call free of x, let's say x is owned by thread i, then we're going to put x back into heap i, and then we're going to check if the n u storage in heap i, u sub i is less than the min of a sub i minus 2 s and a sub i over 2. And what this condition says, if it's true, it means that your heap is, at most, half utilized. Because if it's smaller than this, it has to be smaller than a sub i over 2. That means there's twice as much allocated than used in the local heap i. And therefore there must be some super block that's at least half empty. And you move that super block, or one of those super blocks, to the global heap.

이를 구현하는 방법은 다음과 같습니다. 스레드 i가 소유한 메모리를 free(x)로 해제하면, `x`를 힙 i에 다시 넣습니다. 그런 다음 `u_i` < min(a_i - 2s, a_i / 2)를 확인합니다. 이 조건이 참이면 힙이 최대 절반만 사용되고 있다는 의미입니다. 왜냐하면 만약 이보다 작다면, a_i / 2보다 작아야 하기 때문입니다. 즉, 로컬 힙 i에 할당된 공간이 사용량의 두 배라는 뜻입니다. 따라서 적어도 절반은 비어 있는 슈퍼 블록이 있어야 합니다. 그리고 그 슈퍼 블록, 또는 슈퍼 블록 중 하나를 글로벌 힙으로 옮깁니다.

★

u_i < min(a_i - 2S, a_ i / 2) 의미를 분석해보자.

- a_i : Local Heap

- u_i : Local Heap에서 사용 중인 용량

1) u_i < a_i - 2S(a_i - u_i) <= 2S

→ Slack을 2 * SuperBlockSize 까지만 허용하겠다는 의미.

2) u_i < a_i / 2Local Heap을 절반 미만으로 사용 중을 의미

3) u_i < min(a_i - 2S, a_i / 2)

Slack이 2 * SuperBlockSize보다 작거나, --- (1)

Local Heap을 절반 미만으로 사용 중인 경우 --- (2)

→ 그러므로 min이 뜻하는 것은 1) "OR" 2) 조건식을 의미.

즉, Deallocation을 할 때, Local Heap에서 너무 많은 공간을 잡고 있는 것이므로 Super Block 1개를 Global Heap으로 옮긴다는 것이다.

이전 슬라이드에서 배운 invariant를 통해 blow-up이 1 + O(SP/U)에 해당 한다는 것을 증명할 수 있음.

SLOC은 Serial Allocator의 약어인듯함..?

32 threads 기준, SuperMalloc은 가장 성능이 잘 나오는데다가 코드 라인 수도 가장 간단하다고 한다.

Parallel Garbage Collection

제일 듣고 싶은 부분이었는데 시간 없다고 생략됨.. 강의 자료 기반으로 작성.

Stop-the-world collector

- 프로그램이 일시적으로 정지하고, GC가 전체 메모리를 정리

- 긴 정지 시간

Incremental collector

- GC가 실행될 때 마다 메모리의 작은 부분을 정리

- 짧은 정지 시간

Running Collector with Program

- copying collector의 Incremental 버전.

- 가비지 컬렉션 시점이 되면 프로그램과 가비지 컬렉터가 교대로 실행됨.

BFS에서 이미 제거된 객체 O가 다른 객체 O'에 대한 참조를 얻게 되면, BFS는 O'를 찾지 못할 수 있으며, 이 경우 O'는 해제됩니다.

★

Incremental collector의 특성은 다음과 같다.

장점- 한 번에 모든 garbage를 정리하지 않고 점진적으로 일부만 정리하므로, 프로그램의 정지 시간이 적다.

단점

- 위 슬라이드와 같은 문제가 발생하는데, object O에 대해서 스캔이 끝났는데 다음 다른 일부를 스캔하는 동안, 프로그램 로직에 의해 object O가 O'을 참조하게 될 수 있다. 그러면 GC가 O가 O'을 참조한다는 사실을 알려면 다시 O를 스캔해야 할텐데, 그러지 못하므로 O'이 해제될 수 있다.

1) Program follows forward pointer if there is one.

기존 Copying collector에서도 존재하던 개념.

2) Whenever the program accesses an object not in the TO-space, mark object as explored and copy it over to the TO-space.

FROM space에 있는 객체를 접근하게 되면, 해당 객체를 'explored'로 마킹하고 TO space로 복사(당겨옴).

TO space에 있지 않은 object에 접근할 때 마다, 해당 object를 'explored'로 마킹하고 TO space로 복사. 위 과정을 위해 'read barrier'가 필요하게 됨.

★

=================================

https://cs.nyu.edu/~mwalfish/classes/14fa/ref/appel.pdf

appel's paper에 따르면, Baker's algorithm은 다음 invariant를 유지함.

- The mutator sees only TO-space pointers in its registers.

mutator는 레지스터에서 TO-space 포인터만 볼 수 있음.

- Objects in the new area contain TO-space pointers only (because new objects are initialized from the registers).

새 영역의 객체는 TO-space 포인터만 포함(새로운 객체는 레지스터에서 초기화되기 때문).

- Objects in the scanned area contain TO-space pointers only.

scanned area의 객체는 TO-space 포인터만 포함.

- Objects in the unscanned area contain both FROM-space and TO-space pointers.

unscanned area의 객체는 FROM-space 포인터와 TO-space 포인터를 모두 포함.

=================================

3) Whenever the program allocates an object, put it in the TO-space.

object를 할당할 때 마다, TO space에 넣음.

4) Requires a read barrier to intercept every read with a check, which is expensive.

'read'를 할 때 마다 fetch했던 객체가 FROM-space에 있는지 확인해야 하므로 비용이 들어감. write는 횟수가 적지만, read는 이에 비해 매무 빈번하므로 비싼 비용 발생.

Q) This algorithm is conservative in that it does not necessarily collect all garbage. Why?

A) 일단 TO-space로 옮기기 때문? TO-space로 옮긴 직후에 해당 참조가 지워지더라도 FROM-space로 옮기는 로직(undo)은 없어서 conservative(보수적).

Nettles-O'Toole Algorithm
1) Program works only in FROM space until garbage collection is finished.

프로그램은 GC가 완료될 때 까지 FROM-space에서만 동작.

2) Replicates the objects by keeping mutations to FROM-space objects in a log.

FROM-space 객체에 대한 변경 사항을 로깅하여 객체를 (TO-space에)복제.

3) Garbage collector applies the mutations to corresponding TO-space objects.

GC는 해당 변경 사항을 TO-space 객체에 적용.

4) Requires a write barrier to log mutations on every write.
- This is expensive, but writes are usually much less frequent than reads.

write작업을 할 때 마다 변경 사항을 로깅하는데에 'write barrier'가 필요.

- 비싼 비용, 하지만 read보다 write가 일반적으로 덜 발생함.

5) Is this algorithm conservative?

Nettles-O'Toole 알고리즘은 보수적인가?

Parallel and Concurrent GC

1) Based on Nettles-O'Toole algorithm

2) High-level idea
- Use per-processor local stacks for search

탐색을 위해 프로세서마다 로컬 스택을 사용

- Maintain a shared stack for load balancing
-- Processors periodically transfer objects between local and shared stack

- 로드 밸런싱을 위해 공유된 스택을 유지

-- 프로세서들은 주기적으로 로컬/공유 스택 사이에서 객체들을 이동시킴.

- Use synchronization primitives (test-and-set and fetch-and-add) to manage concurrent accesses

동시 접근을 관리하기 위해 동기화 도구 사용(test-and-set, fetch-and-add)

See "On Bounding Time and Space for Multiprocessor Garbage Collection" (PLDI 1999), and "A Parallel, Real-Time Garbage Collector" (PLDI 2001) by Cheng and Blelloch

'컴퓨터공학 > Performance Engineering' 카테고리의 다른 글

Bentley Rules for Optimizing Work (0)	2026.04.26
The Cilk Runtime System (0)	2026.04.06
Why should I learn Software Performance Engineering ? (1)	2026.03.28
Storage Allocation (0)	2026.03.27
Bit Hacks (0)	2026.03.24

현재글Parallel Storage Allocation

Mnemonic

UE5 Network Programmer

Casting, 정렬, ue4, Multiplayer, 언리얼엔진, viewing transformation, 3d viewing pipeline, pirorityqueue, USkeletalMeshComponent, 힙트리, dedicated, UE5, 우선순위큐, Unreal Engine, rootmotion, 퀵 소트, C++, loadobject, retarget, heap tree,

Today :
Yesterday :

Mnemonic