Your team's work is very interesting and addresses a gap in multi-scale research for pathological image processing. I have a question: When your inputs are x5, x10, and x20 magnification patches, the CLAM-ResNet-derived features would likely have dimensions like (1, 100, 1024), (1, 200, 1024), and (1, 400, 1024). How do you ensure that the height (H) and width (W) of these features are aligned before feeding them into the Transformer? For instance:
The (1, 100, 1024) feature, after padding, might be reshaped into a 10x10 grid.
The (1, 400, 1024) feature might become a 20x20 grid.
If their spatial dimensions (H, W) differ, how does the subsequent Conv Processor handle multi-scale feature fusion?
Your team's work is very interesting and addresses a gap in multi-scale research for pathological image processing. I have a question: When your inputs are x5, x10, and x20 magnification patches, the CLAM-ResNet-derived features would likely have dimensions like (1, 100, 1024), (1, 200, 1024), and (1, 400, 1024). How do you ensure that the height (H) and width (W) of these features are aligned before feeding them into the Transformer? For instance:
The (1, 100, 1024) feature, after padding, might be reshaped into a 10x10 grid.
The (1, 400, 1024) feature might become a 20x20 grid.
If their spatial dimensions (H, W) differ, how does the subsequent Conv Processor handle multi-scale feature fusion?