Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
123 changes: 123 additions & 0 deletions patches-sonic/driver-arista-net-tg3-napi-enable-called-flag.patch
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
From 821f6d79ad2773e0ff1537c0bb3c7af93a694709 Mon Sep 17 00:00:00 2001
From: Yury Murashka <yurypm@arista.com>
Date: Thu, 8 May 2026 00:00:00 +0000
Subject: net: tg3: guard napi_disable and pci_disable_device calls

Comment thread
yurypm marked this conversation as resolved.
We need this patch to fix a soft lockup in the Linux kernel on Arista
modular chassis in the 202511 branch.

During PCIe hot-plug events, uncorrectable errors can be reported and
AER recovery for the tg3 device is initiated by the AER kernel driver.
The tg3_io_error_detected function is the AER error recovery handler.

From tg3_io_error_detected, we call tg3_netif_stop->tg3_napi_disable->
napi_disable and return PCI_ERS_RESULT_NEED_RESET on non-fatal error.
We expect that during AER recovery tg3_io_slot_reset and tg3_io_resume
will be called. But AER error recovery can fail. For example, when one
of PCIe devices on the same bus reports PCI_ERS_RESULT_NO_AER_DRIVER.
As a result, tg3_io_slot_reset and tg3_io_resume are not called, PCIe
device is disabled and NAPI is disabled (pci_disable_device and
napi_disable are called from tg3_io_error_detected). Then we can try to
disable PCIe link and napi_disable will be called again:

napi_disable+0x1b/0x1b0
tg3_napi_disable+0x89/0xa0 [tg3]
tg3_netif_stop+0x37/0xe3 [tg3]
tg3_stop+0x30/0x160 [tg3]
tg3_close+0x2a/0x60 [tg3]
__dev_close_many+0xad/0x130
dev_close_many+0xb2/0x190
unregister_netdevice_many_notify+0x19d/0xa00
unregister_netdevice_queue+0xf8/0x140
unregister_netdev+0x1c/0x30
tg3_remove_one+0xaa/0x150 [tg3]
pci_device_remove+0x42/0xb0
device_release_driver_internal+0x19c/0x200
pci_stop_bus_device+0x85/0xb0
pci_stop_bus_device+0x2c/0xb0
pci_stop_bus_device+0x2c/0xb0
pci_stop_and_remove_bus_device+0x12/0x20
pciehp_unconfigure_device+0x9f/0x160
pciehp_disable_slot+0x67/0x100
pciehp_handle_presence_or_link_change+0x77/0x350

This is not expected by napi_disable and a thread can be locked in
napi_disable forever. We have pcierr_recovery to cover a similar issue,
but for fatal errors. We cannot reuse this flag because it is reset in
tg3_io_resume, but it is not called when AER recovery fails.

Similarly, if an AER error is reported and tg3_io_error_detected calls
pci_disable_device, a subsequent device removal via tg3_remove_one or
tg3_shutdown will call pci_disable_device again for the already-disabled
device.

Add a napi_enabled flag to struct tg3 to track whether napi_enable has
been called. Guard tg3_napi_disable() so it returns early if NAPI was
not previously enabled. Also guard pci_disable_device() calls in
tg3_remove_one() and tg3_shutdown() with pci_is_enabled() to avoid
disabling an already-disabled device.

Fixes: b45aa2f6192e ("tg3: Add EEH support")
Signed-off-by: Yury Murashka <yurypm@arista.com>

--- a/drivers/net/ethernet/broadcom/tg3.c
+++ b/drivers/net/ethernet/broadcom/tg3.c
@@ -7415,6 +7415,11 @@ static void tg3_napi_disable(struct tg3
{
int i;

+ if (!tp->napi_enabled)
+ return;
+
+ tp->napi_enabled = false;
+
for (i = tp->irq_cnt - 1; i >= 0; i--)
napi_disable(&tp->napi[i].napi);
}
@@ -7423,6 +7428,8 @@ static void tg3_napi_enable(struct tg3 *
{
int i;

+ tp->napi_enabled = true;
Comment thread
yurypm marked this conversation as resolved.
+
for (i = 0; i < tp->irq_cnt; i++)
napi_enable(&tp->napi[i].napi);
}
@@ -17715,6 +17722,7 @@ static int tg3_init_one(struct pci_dev *
tp->tx_mode = TG3_DEF_TX_MODE;
tp->irq_sync = 1;
tp->pcierr_recovery = false;
+ tp->napi_enabled = false;

if (tg3_debug > 0)
tp->msg_enable = tg3_debug;
@@ -18096,7 +18104,8 @@ static void tg3_remove_one(struct pci_de
}
free_netdev(dev);
pci_release_regions(pdev);
- pci_disable_device(pdev);
+ if (pci_is_enabled(pdev))
+ pci_disable_device(pdev);
}
}

@@ -18252,7 +18261,8 @@ static void tg3_shutdown(struct pci_dev

rtnl_unlock();

- pci_disable_device(pdev);
+ if (pci_is_enabled(pdev))
Comment thread
yurypm marked this conversation as resolved.
+ pci_disable_device(pdev);
}

/**
--- a/drivers/net/ethernet/broadcom/tg3.h
+++ b/drivers/net/ethernet/broadcom/tg3.h
@@ -3430,6 +3430,7 @@ struct tg3 {
struct device *hwmon_dev;
bool link_up;
bool pcierr_recovery;
+ bool napi_enabled;

u32 ape_hb;
unsigned long ape_hb_interval;
1 change: 1 addition & 0 deletions patches-sonic/series
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
driver-arista-net-tg3-dma-mask-4g-sb800.patch
driver-arista-net-tg3-disallow-broadcom-default-mac.patch
driver-arista-net-tg3-access-regs-indirectly.patch
driver-arista-net-tg3-napi-enable-called-flag.patch
driver-arista-pci-reassign-pref-mem.patch
driver-arista-mmcblk-not-working-on-AMD-platforms.patch
driver-arista-restrict-eMMC-drive-to-50Mhz-from-userland.patch
Expand Down