Conversation
cdesiniotis
left a comment
There was a problem hiding this comment.
Thanks @JunAr7112 for the progress on this. I left some comments. I mainly have some questions around how to accurately determine whether we need to install these packages or not. Ideally we would have a robust way to determine this and avoid using / introducing new flags / environment variables for this feature.
| if [ ! -d "/usr/src/linux-headers-$(uname -r)/" ]; then | ||
| echo "Installing Linux kernel headers..." | ||
| apt-get -qq install --no-install-recommends linux-headers-${KERNEL_VERSION} > /dev/null | ||
| fi |
There was a problem hiding this comment.
An observation -- if we mount /usr/src from the host AND /usr/src/linux-headers-<kernel-version> does not exist, meaning the linux headers are not available on the host, then we end up installing the headers both in the container AND on the host at /usr/src/linux-headers-<kernel-version>. I am not sure if this is desirable.
| apt-get -qq download linux-image-${KERNEL_VERSION} && dpkg -x linux-image*.deb . | ||
| { apt-get -qq download linux-modules-${KERNEL_VERSION} && dpkg -x linux-modules*.deb . || true; } 2> /dev/null |
There was a problem hiding this comment.
Similar to the linux-headers package, shouldn't we have a conditional here along the lines of "if linux-image / linux-modules packages are not install, then install them"? I am uncertain on the best way to determine whether we need to install them or not...
| trap "rm -rf ${tmp_dir}" EXIT | ||
| cd ${tmp_dir} | ||
|
|
||
| if [ "${PERSIST_DRIVER}" = false ]; then |
There was a problem hiding this comment.
Ideally we should not be relying on the PERSIST_DRIVER environment variable as that feature is not introduced in this PR.
| cd ${tmp_dir} | ||
|
|
||
| if [ "${PERSIST_DRIVER}" = false ]; then | ||
| rm -rf /lib/modules/${KERNEL_VERSION} |
There was a problem hiding this comment.
I am not sure why cleaning up this directory is required on every driver container instantiation. @shivamerla @tariq1890 would you happen to know why? If we can remove this, then we no longer need this conditional.
There was a problem hiding this comment.
On container startup, if we do not mount /lib/modules or /usr/src explicitly from the host, both of these directories will be empty inside of the container. As an example:
$ kubectl exec -n gpu-operator -it nvidia-driver-daemonset-7qwn5 -- sh
# ls -ltr /lib/modules/
ls: cannot access '/lib/modules/': No such file or directory
# ls -ltr /usr/src/
total 0
Based on this observation, what are your thoughts on my below proposal?
- Remove the following command from our script as it appears to not be needed:
rm -rf /lib/modules/${KERNEL_VERSION} - Only install the linux-modules package if
/lib/modules/${KERNEL_VERSION}directory does not exist.
| mv lib/modules/${KERNEL_VERSION}/modules.* /lib/modules/${KERNEL_VERSION} | ||
| mv lib/modules/${KERNEL_VERSION}/kernel /lib/modules/${KERNEL_VERSION} | ||
| depmod ${KERNEL_VERSION} |
There was a problem hiding this comment.
Note, these commands won't work as they require that the linux-image and linux-modules deb packages were downloaded and extracted locally first.
| ls -1 boot/vmlinuz-* | sed 's/\/boot\/vmlinuz-//g' - > version | ||
| if [ -z "$(<version)" ]; then | ||
| echo "Could not locate Linux kernel version string" >&2 | ||
| return 1 |
There was a problem hiding this comment.
@shivamerla after reviewing this again, do we even really require to install the linux-image-$KERNEL_VERSION package? It seems like we only use it to construct the kernel-version string. However, the kernel version string is already assumed to be set correctly in the KERNEL_VERSION environment variable before we ever reach this point in the script.
Based on my understanding, I think we can remove this code block entirely in all cases and never install the linux-image-$KERNEL_VERSION package.
| if [ "${PERSIST_DRIVER}" = false ]; then | ||
| mv lib/modules/${KERNEL_VERSION}/modules.* /lib/modules/${KERNEL_VERSION} | ||
| mv lib/modules/${KERNEL_VERSION}/kernel /lib/modules/${KERNEL_VERSION} | ||
| depmod ${KERNEL_VERSION} |
There was a problem hiding this comment.
Based on my comment https://github.com/NVIDIA/gpu-driver-container/pull/65/files#r1681872260 I would recommend the following change:
| if [ "${PERSIST_DRIVER}" = false ]; then | |
| mv lib/modules/${KERNEL_VERSION}/modules.* /lib/modules/${KERNEL_VERSION} | |
| mv lib/modules/${KERNEL_VERSION}/kernel /lib/modules/${KERNEL_VERSION} | |
| depmod ${KERNEL_VERSION} | |
| if [ ! -d "/lib/modules/${KERNEL_VERSION}" ]; then | |
| { apt-get -qq download linux-modules-${KERNEL_VERSION} && dpkg -x linux-modules*.deb . || true; } 2> /dev/null | |
| mv lib/modules/${KERNEL_VERSION}/modules.* /lib/modules/${KERNEL_VERSION} | |
| mv lib/modules/${KERNEL_VERSION}/kernel /lib/modules/${KERNEL_VERSION} | |
| depmod ${KERNEL_VERSION} |
new branch with linux header changes