## Question about Bottom-up traversal used in LBVH and TRBVH

Practical and theoretical implementation discussion.
shocker_0x15
Posts: 75
Joined: Sun Aug 19, 2012 3:24 pm
Contact:

### Question about Bottom-up traversal used in LBVH and TRBVH

Hi, Now I try to implement TRBVH [1] using OpenCL.
It is based on Linear BVH [2] and an idea of parallel construction of the tree [3].
I have finished implementing [2] and [3]. The result of rendering seems to be done correctly and the speed of construction is also reasonable.
However I have one doubt about the part of parallel construction.

In the proposed parallel construction for computing node's AABB, the paths from leaf nodes to the root node are processed in parallel.
Each thread starts from a leaf node and walks up the tree according to the information of parent's index.
When a thread visit an internal node, it increments an atomic counter assigned to the node.
If the resulting value of atomic increment is 1, the thread immediately terminated.
The other case, the thread compute the node's AABB using children indices.
By assigning the processing responsibility to the second thread visiting the node, it guarantees that the AABBs of subtree of the node are already computed.

Indeed, I think the computations of a subtree are done. However is writing AABB to global memory also done?

Let's consider a simple tree with 3 leaves.

Code: Select all

      Q
/   \
P     \
/  \     \
A    B     C

TImeline
thread A (leaf A) --> visit node P, terminated
thread B (leaf B) --> visit node P, compute AABB (A+B) and store result --> visit node Q and terminated
thread C (leaf C) --> visit node Q, read AABBs of P and C then compute AABB of Q


Thanks.

[1] "Fast Parallel Construction of High-Quality Bounding Volume Hierarchies".
[2] "Fast BVH Construction on GPUs"
[3] "Maximizing Parallelism in the Construction of BVHs, Octrees, and k-d Trees"

shocker_0x15
Posts: 75
Joined: Sun Aug 19, 2012 3:24 pm
Contact:

### Re: Question about Bottom-up traversal used in LBVH and TRBV

I put my code for bottom-up traversal of computing nodes' AABBs.

Let's think one thread computed an AABB and stored it. ** 2 **
** 1 ** The thread was terminated after that because the access to the parent was the fisrt time.
Immediately after the termination, another thread accesses the same node.
This time, it is the second access, so the thread gets the responsibility of computing AABB.
The doubt point is here.

Code: Select all

...
uint selfIdx = get_global_id(0);// leaf index
const global LeafNode* lNode = lNodes + selfIdx;
point3 min = lNode->bbox.min;
point3 max = lNode->bbox.max;

uint pIdx = parentIdxs[selfIdx];// parentIdxs stores parent node index for each internal and leaf node.
parentIdxs += numLeaves;// parent indices for internal nodes are stored in the latter half.

// atomic_inc() returns the old value, so it can determine the second access if the value is 1.
while (atomic_inc(counters + pIdx) == 1) { // ** 1 **
global InternalNode* pINode = iNodes + pIdx;// parent internal node
bool leftIsSelf = pINode->c[0] == selfIdx;
uint otherIdx = pINode->c[leftIsSelf];
const AABB bbox = pINode->isLeaf[leftIsSelf] ? (lNodes + otherIdx)->bbox : (iNodes + otherIdx)->bbox;// **2 **
pINode->bbox.min = min = fmin(min, bbox.min);
pINode->bbox.max = max = fmax(max, bbox.max);

if (pIdx == 0)
return;

selfIdx = pIdx;
pIdx = parentIdxs[pIdx];
}
...


MohamedSakr
Posts: 83
Joined: Thu Apr 24, 2014 2:27 am

### Re: Question about Bottom-up traversal used in LBVH and TRBV

I'm not sure, and I may be wrong "so clarify more about the code"

atomic_inc(counters + pIdx) , I sense that this expression is kinda false, atomics are used on shared/global memory, in this case you are using it on local memory, which will do nothing!! "so all threads will enter in the first iteration and exit in the second iteration"

so from this code, all threads will do the same thing. unless I misunderstand something

shocker_0x15
Posts: 75
Joined: Sun Aug 19, 2012 3:24 pm
Contact:

### Re: Question about Bottom-up traversal used in LBVH and TRBV

This is full of the kernel.
the variable counters is a global memory object.

Code: Select all

kernel void calcNodeAABBs(global uchar* _iNodes, global uint* counters, const global uchar* _lNodes, uint numLeaves, const global uint* parentIdxs) {
const uint selfIdx = get_global_id(0);
global InternalNode* iNodes = (global InternalNode*)_iNodes;
const global LeafNode* lNodes = (const global LeafNode*)_lNodes;
if (selfIdx >= numLeaves)
return;

const global LeafNode* lNode = lNodes + selfIdx;
point3 min = lNode->bbox.min;
point3 max = lNode->bbox.max;

uint pIdx = parentIdxs[selfIdx];
parentIdxs += numLeaves;

while (atomic_inc(counters + pIdx) == 1) {
global InternalNode* pINode = iNodes + pIdx;
bool leftIsSelf = pINode->c[0] == selfIdx;
uint otherIdx = pINode->c[leftIsSelf];
const AABB bbox = pINode->isLeaf[leftIsSelf] ? (lNodes + otherIdx)->bbox : (iNodes + otherIdx)->bbox;
pINode->bbox.min = min = fmin(min, bbox.min);
pINode->bbox.max = max = fmax(max, bbox.max);

if (pIdx == 0)
return;

selfIdx = pIdx;
pIdx = parentIdxs[pIdx];
}
}

* The kernel receives node variables by uchar* type, this is a workaround for Apple's OpenCL bug.

MohamedSakr
Posts: 83
Joined: Thu Apr 24, 2014 2:27 am

### Re: Question about Bottom-up traversal used in LBVH and TRBV

I see a problem, but let me be sure first

if (pIdx == 0), this expression is for the termination right? , how is pIdx determined here?? "so when pIdx != 0??"
another main problem is: 2 threads enter the while loop at the same time, first thread enters, terminate at the if statement, 2nd thread won't even enter!!

so clarify the logic as much as possible, I see this while loop as a night mare bomb which will explode with race conditions!!

shocker_0x15
Posts: 75
Joined: Sun Aug 19, 2012 3:24 pm
Contact:

### Re: Question about Bottom-up traversal used in LBVH and TRBV

pIdx == 0 means that the traversal reaches the root node, because the root node is always stored at index = 0 in LBVH. Therefore, the thread should be terminated after processing the root node.

I think there is no problem if two threads enter the while loop at the same time because it uses the atomic operation. Do I have a misconception about the atomic?

Thanks.

MohamedSakr
Posts: 83
Joined: Thu Apr 24, 2014 2:27 am

### Re: Question about Bottom-up traversal used in LBVH and TRBV

I consider the initial value of counters pointer is 1 at all nodes
the problem here is "there is no second access!!"

so the while loop will enter each node only ONCE!!, any other thread trying to enter this node after this will be kicked out

BTW, I'm really interested in implementing TRBVH, we may cooperate on this if you don't mind

shocker_0x15
Posts: 75
Joined: Sun Aug 19, 2012 3:24 pm
Contact:

### Re: Question about Bottom-up traversal used in LBVH and TRBV

The initial values of counters are 0, and atomic_inc() returns the old value before atomic-incrementing.

The reason why I doubt about bottom-up traversal (though my LBVH implementation seems working well) is that the similar procedure used in TRBVH which counts the total number of leaves of a subtree doesn't works well.
It produces non-deterministic results, sometimes it works correctly.

My project is public on GitHub Now I try to implement TRBVH on branch "TRBVH"
http://github.com/shocker-0x15/CLeaR
Sorry for all of comments in the source are written in Japanese.

MohamedSakr
Posts: 83
Joined: Thu Apr 24, 2014 2:27 am

### Re: Question about Bottom-up traversal used in LBVH and TRBV

this means the while loop won't get entered at all!! it will always be while(false) { }

to solve this, you will need to make the while condition: while ((atomic_inc(counters + pIdx) == 0) || last_thread_to_enter_this_node)
the variable last_thread_to_enter_this_node should mark the thread which KNOWS that it will be the last thread to enter, this may be done with bitwise operations I think, or may be there is another logic that can be implemented

sriravic
Posts: 20
Joined: Fri Jun 22, 2012 6:48 pm

### Re: Question about Bottom-up traversal used in LBVH and TRBV

Hi

Basically you'd be experiencing a race condition if you were to assume that results of A+B is computed by the time C gets to root node. Hypothetically such a situation is valid only when all threads execute in step which is never guaranteed at hardware level. You must use some synchronization primitive in your code to make sure that all reads and writes are consistent. However with your current method of implementation having a synchronization primitive inside the while loop would be disastrous as SIMT execution of threads in CUDA/OpenCL would mean that different code paths are taken and synchronization is applied to only a few code paths.

I'd suggest something on this sort for a bug free implementation although I'm not sure fully.

for(;;) // infinite loop -> all threads exit once building is done
{