Skip to content

Conversation

@dlmiles
Copy link
Collaborator

@dlmiles dlmiles commented Feb 18, 2025

This is mainly a refactor of a commonly repeated pattern into a new function that fully handle error paths.

Medium risk due to the high foot-fall and different use cases/filesystem scenarios.

I have a few questions with the flock.c file, best annotated into the source in a moment.

Copy link
Collaborator Author

@dlmiles dlmiles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments to the flock.c code that raises questions over what the intention really is.

What is being protected from what, it is usual for 2 independent processes to be modifying files. I'm not sure which operations in magic and which files needs inter-process co-operation to prevent data corruption? So the design goal is unclear to me to envisage why FILE_LOCKS exists.

Maybe I should check if ./configure --disable-locking is working as I can't spot the use case.

fl.l_pid = getpid();

if (fcntl(fd, F_GETLK, &fl))
{
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re use of F_GETLK before F_SETLK

It seems strange to call F_GETLK first. The same process can not see its own lock, but it can see the lock of another process on the file(dev:ino). But if we are looking for another process with an active lock, there is a
race condition to this, because maybe it places down the lock between our call to F_GETLK and our F_SETLK.

So in effect any data obtained from doing things in this order is always out-of-date.

It is more usual to attempt to make the lock first, then if it fails, enquire with F_GETLK the PID of the process that is blocking us by using F_GETLK. So the nice error message can be displayed. Now if the F_GETLK fails to find a lock is set, maybe we just missed the other process unlocking, so we might (optionally) retry the operation of obtaining the lock one more time. Obviously we don't want to repeat more than once. But maybe magic files are not high-contention file (like database) so the retry maybe over the top.

perror(fname);
perror(fname);
/* appears to be a best-effort rather than signal error, */
/* and continues to gzdopen() to provide caller handle */
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even when we don't get the lock, we carry on regardless, no indication or action made on error. A best-effort to obtain.

close(fd);
fd = -1;
}
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are successful at this point here, we did fail to obtain the lock on this FD.
We did not mark *is_locked = TRUE informing the caller.
But... if the caller were to use close(fd) and there was another open FD on this same file(dev:ino) by another part of magic then it would release all our locks. Line 137-139 area and Line 160-161 area.
I think the caller will just use close(fd) like normal because *is_locked == TRUE. which removes all locks from the fd anyway.

So either this file is lock aware or it isn't. Given is_locked != NULL we can assume the caller is classifying this open call is being lock aware. If this is correct then all calls to open should have their fd managed

f = gzdopen(fd, mode);
goto done;
if (f)
goto done_store_fdp;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vvvv Re Line 160 below vvvv

This mandatory call to close(fd) removes all locks our process has on this file. Another process can now gain their lock. Even if another part of our application has another FD open and will be doing more work on the FD later.
I don't get why a close() is made here, to a file that might be regarded as a lock aware file/path.

f = gzdopen(fd, mode);
if (f == NULL)
{
close(fd);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this close to manage the potential error path from the 3rd party library not accepting the 'fd'.
However this should also be a managed close if the file/path is consider file-lock aware.

done_store_fdp:
if (fdp) *fdp = fd;
done:
if (fdp != NULL) *fdp = fd;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move because the function path_gzdopen_internal() will take care of this when successful and only one exit path exists no in this function that needs to ensure it is done here. So special goto label for that one use case'

f = fopen(filename, mode);
if ((fdp != NULL) && (f != NULL)) *fdp = fileno(f);
if (f)
if (fdp) *fdp = fileno(f);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just rewrite of what is already there, in a form that is both similar to similar lines above and (I find) easier to see glance at its intention.

return gzhandle;

close:
close(fd);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't do anything for *fdp = -1 to handle error path. Maybe it should. None of previous code has it but looking back on it again I spot this.

Copy link
Collaborator Author

@dlmiles dlmiles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...

fl.l_type = F_WRLCK;
fl.l_pid = getpid();

fd = open(fname, O_RDWR);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^^^ line 166 ^^^

Consider fl.l_type = ((oflags & O_ACCMODE) == O_RDONLY) ? F_RDLCK : F_WRLCK

Consider honouring the callers open mode flags for O_RDONLY and O_RDWR, I think maybe a O_RDONLY caller request is upgraded to O_RDWR with F_WRLCK. Unclear why that would be done.

@RTimothyEdwards
Copy link
Owner

File locking was introduced by my advisor Mike Godfrey at Stanford back in the early 1990s. The point of file locking was related to our use of magic with a database on a central server, and a number of people working on the same design at the same time. File locking prevents two people from overwriting each other's edits on a cell layout. Only one of the users can have the layout file open at a time, and if another person opens the same layout, they will find it read-only and uneditable. This is a feature of commercial layout tools, although I doubt they use file locking. Otherwise, even on a single-user system, the file locking can prevent a script from overwriting an open layout, so there are valid use cases. The major downside of file locking is that there is a major conflict between large, complex chip layouts with many cells, and Linux version that have hard-coded a limit of 1024 open file descriptors. This first showed up on the gf180mcu open PDK, where the I/O cells were so deeply nested that my system ran out of file descriptors. Since then, I hacked in some fixes to allow file locking to be disabled on demand, but much of the code is really hacks on hacks on hacks, mostly due to hastily trying to fix things during emergency situations happening at the last minute close to tapeouts.

@RTimothyEdwards
Copy link
Owner

There having been 33 years between the original file locking code and now, I would be very surprised if there aren't much better ways to deal with the whole issue of preventing people from overwriting each other's edits. I would not, however, want to depend on an automated git merge to resolve layout differences. . .

@dlmiles
Copy link
Collaborator Author

dlmiles commented Feb 18, 2025

Thanks for the background, information, I'll digest what you said into what I can see in codebase. To me a cell editmode lock sounds like a traditional foobar.mag.lock file would fit, it is also very visible to outside world that a lock exists, and self-managing/cleaning.

There is also the UI ability to, simply take a copy of the original file (into private auto-save area), checksum/timestamp original, have auto-save always use a private magic managed area, so when it comes to the user pressing save button, it then resolves that the original file has been changed under the session, and provide the user resolution options. Many regular text editors have some combination of the same thing from a UI perspective. Worst case scenario the current session of work is already safely inside the auto-save area.

Linux does have a default environment limit of 1024, there are historical reasons for why this is set at that level:

$ ulimit -Sa | grep "open files"
open files                          (-n) 1024

$ ulimit -Ha | grep "open files"
open files                          (-n) 524288

The hard limits are with -H and the soft/current/active limit is with -S.
A process can (self) increase the current limit upto the hard limit.

Many programs using a lot of FDs do this transparently on startup using getrlimit(RLIMIT_NOFILE, ...) and setrlimit() syscalls.

The reason why this limit exists is the historical kernel syscalls for select() used (and still use) a compile time default sized buffer for the sizeof(fd_set) so the environment mirrors this historical limit (to prevent buffer overflows / correct operation of fd_set) as most processes don't need more FDs anyway, those that do need to co-operatively be designed to use larger numbers but where to put the new limit (no one will be in agreement to what that should be) when half-a-million FDs are possible by default on linux today. So this is why it is up the process to manage this minor point itself.

When the magic process starts up it can call:

#include <sys/resource.h>
struct rlimit rlim;
int err = getrlimit(RLIMIT_NOFILE, &rlim);
if(err < 0)
  return err;
if(rlim.rlim_cur < 4096) {
  rlim.rlim_cur = 4096;
  err = setrlimit(RLIMIT_NOFILE, &rlim);
}
return err;

Can you provide/point-me-to a torture project ZIP of gf180 (or any open-source tech) that demonstrates the high fd usage ? at least I can profile/examine it to understand the nature of the file usage pattern.
Linux handling high FDs should not be a problem.
Thanks

@dlmiles
Copy link
Collaborator Author

dlmiles commented Feb 18, 2025

FWIW the existing magic fd_set code can be migrated to using poll(2) API if it is thought that is an issue, there is a hard coded magic FD limit in this area:
../textio/textio.h:#define TX_MAX_OPEN_FILES 20
I think it only affect console and display setup when the magic process startup. So all the FDs in use are lower than 20, these FDs never get closed AFAIK.

Other than this there isn't any other area I have seen that is a concern to when using FDs above 1023.

@RTimothyEdwards
Copy link
Owner

@dlmiles : The I/O library of gf180mcu itself exhibits the problem and so any full project will do so as well. So,
https://github.com/efabless/caravel-gf180mcu
But I think I did something in open_pdks to work around the problem so I think the problem will not manifest itself in the installed PDK. I did that a few years ago and don't remember the details, so I will have to look through my notes on open_pdks.

@dlmiles
Copy link
Collaborator Author

dlmiles commented Feb 19, 2025

This is more refined version that works for me.

diff --git a/tcltk/tclmagic.c b/tcltk/tclmagic.c
index dc0cc40f..5dc7bfbe 100644
--- a/tcltk/tclmagic.c
+++ b/tcltk/tclmagic.c
@@ -25,6 +25,7 @@
 #include <signal.h>
 #include <string.h>
 #include <errno.h>
+#include <sys/resource.h> /* FIXME autoconf? */
 
 #include "tcltk/tclmagic.h"
 #include "utils/main.h"
@@ -530,6 +531,25 @@ MakeWindowCommand(char *wname, MagWindow *mw)
     freeMagic(tclcmdstr);
 }
 
+static int
+process_rlimit_nofile_check(rlim_t nofile)
+{
+    struct rlimit rlim;
+    int err = getrlimit(RLIMIT_NOFILE, &rlim);
+    if (err < 0)
+        return err;
+    rlim_t rlim_cur = rlim.rlim_cur;
+    if (nofile > rlim.rlim_max && nofile != rlim.rlim_max) /* nofile != RLIM_INFINITY */
+        return -1;
+    if (rlim.rlim_cur < nofile || nofile == RLIM_INFINITY) {
+        rlim.rlim_cur = nofile;
+        err = setrlimit(RLIMIT_NOFILE, &rlim);
+    }
+    if (err != 0)
+        TxPrintf("Warning: process_rlimit_nofile_check(%lu) = %d (%d) [rlim_cur=%lu rlim_max=%lu]\n", nofile, err, errno, rlim_cur, rlim.rlim_max);
+    return err;
+}
+
 /*------------------------------------------------------*/
 /* Main startup procedure                              */
 /*------------------------------------------------------*/
@@ -588,6 +608,8 @@ _magic_initialize(ClientData clientData,
        TxPrintf("Using the terminal as the console.\n");
     TxFlushOut();
 
+    process_rlimit_nofile_check(4096);
+
     if (mainInitAfterArgs() != 0) goto magicfatal;
 
     /* Registration of commands is performed after calling the */

@dlmiles
Copy link
Collaborator Author

dlmiles commented Feb 19, 2025

FWIW the existing magic fd_set code can be migrated to using poll(2) API if it is thought that is an issue, there is a hard coded magic FD limit in this area: ../textio/textio.h:#define TX_MAX_OPEN_FILES 20 I think it only affect console and display setup when the magic process startup. So all the FDs in use are lower than 20, these FDs never get closed AFAIK.

Other than this there isn't any other area I have seen that is a concern to when using FDs above 1023.

I have taken a look (impact assessment for raising RLIMIT_NOFILE by default).

Have a few commits to protect the lesser used select() parts of magic.
Fix obviously incorrect usage of select() inside magic, but I think the code is not used.
Reworked the diabolical use of select() in txCommand.c to at least allow the use of all FDs from 0..1023 if they are so registered instead of the 0..19 artificial clamping that was occurring before.
These should show up over next few weeks after some QA and testing.

The final part is the 3rd party libraries.

  • tcl8: it appears tcl8 uses select() for the main event loop, there is no config.log detection for other mechanisms. Nor is there code inside the tcl8 project for other mechanisms. So the situation here is to not raise RLIMIT_NOFILE by default when tcl8 is used.
  • tcl9: this does have support and config.log detection of epoll and poll and I'm sure I've seen epoll in action in a debugging session before. So in this situation raising RLIMIT_NOFILE by default should be ok. I assume similar work has taken place inside tcl9 core libraries so they all use common improvements there.
  • libz: not an issue (have read the internals of this enough times in the past, it is pure 'fd' based)
  • libX11.so: this does appear to have undefined symbol for select() and no sign of poll/epoll (am a little shocked). This seems low risk in that I'm sure it is only used for X11 client<>server connection, which is going to be a low FD and never change for the lifetime of the process, maybe further research in passing is needed.
  • libX11.so: updated this info adding this comment under actual observations on modern linux with X11 on wayland, it does actually use poll() so I guess maybe another DSO is handling the low level client protocol, or a DSO was loaded to provide this (libxcb.so). Even if the stock libX11.so does not link to poll and only with select.
  • the output from ldd tclmagic.so I did not see any system libraries to be concerned about, those I checked all have poll awareness.
  • any other significant 3rd party libraries ?

So any strategy here may consider making it a default option for tcl9 and to advise (from a support perspective) to use tcl9 when using higher numbered FDs with magic.

@dlmiles
Copy link
Collaborator Author

dlmiles commented Feb 20, 2025

The previous comments on RLIMIT can be ignored in respect of this PR, they are for a different topic for which a new PR will be created for.

The previous comments on flock history can be ignored in respect of this PR, while this PR changes flock.c it does not modify the locking policy only refactored gzopen creation, the flock comments here impact another PR #375


I just updated this with style changes and invalidation of fd=-1 on error case. This was my only changes I expect to make. I just need to test now.

Merge Status: merge on hold, pending targeted testing (review status in 3 weeks)
Quality: usable as is
Risk: med-impact, due to high foot fall of code and caller use-cases
Level of Testing: assume none (this needs targeted testing)

@dlmiles
Copy link
Collaborator Author

dlmiles commented Feb 20, 2025

ddab8ae formatting fix, and set *fdp=-1 on error path

@RTimothyEdwards
Copy link
Owner

Any update on this? Your "review status in 3 weeks" passed some time ago.

@dlmiles
Copy link
Collaborator Author

dlmiles commented Apr 11, 2025

The main subject of this change was to cleanup error handling scenarios and introduce path_gzdopen_internal for that purpose.

While I have run with this patch for testing I can't say that I used input files which were gzip compressed during that time.

Other than that I am still happy things should be ok.

@dlmiles dlmiles force-pushed the master-upstream-20250216-fdleakmed branch from ddab8ae to a4c5c7c Compare July 11, 2025 14:48
@dlmiles
Copy link
Collaborator Author

dlmiles commented Jul 11, 2025

Rebase onto 8.3.530 no changes

@RTimothyEdwards
Copy link
Owner

@dlmiles : Finally starting on another round of merges; Should I go ahead and merge this?

@dlmiles
Copy link
Collaborator Author

dlmiles commented Jul 26, 2025

Yes please.

Despite the large amount of commentary in this PR. Most of the comments are about something else.

This commit here just cleans up to error handling path for gzopen that static source analyser was showing it need attention (over leaked file handles during error path).

@dlmiles
Copy link
Collaborator Author

dlmiles commented Jul 29, 2025

@RTimothyEdwards is this one going soon and will clear all mergable outstanding PRs.

dlmiles added 2 commits July 31, 2025 22:59
Guided by CodeQL static code analyser.

FileMayNotBeClosed.ql
FileMayNeverBeClosed.ql

Multiple paths exist that seems to be equivalent, some of those
paths handled the error path, some of them didn't.

CodeQL will still report a concern (instead of multiple concerns)
so this has been commented to clarify when the libz handle setup
is successful, the ownership of the 'fd' changed.
This is due to the transfer of ownership of the 'fd' into libz/gzip
but the error handling wasn't always being checked.
@dlmiles dlmiles force-pushed the master-upstream-20250216-fdleakmed branch from a4c5c7c to 84264b3 Compare August 1, 2025 17:35
@dlmiles
Copy link
Collaborator Author

dlmiles commented Aug 1, 2025

git rebase --onto 1ec8b6e HEAD~2 HEAD # 8.3.534

@dlmiles dlmiles force-pushed the master-upstream-20250216-fdleakmed branch from 84264b3 to 1a139e7 Compare August 1, 2025 17:44
@dlmiles
Copy link
Collaborator Author

dlmiles commented Aug 1, 2025

Fixup from header file location change utils.h => magic_zlib.h

@RTimothyEdwards
Copy link
Owner

Pulled and merged (finally).
@dlmiles : Given the number of open issues above, should I leave this pull request open, or close it?

@dlmiles
Copy link
Collaborator Author

dlmiles commented Aug 4, 2025

Please close.

Everything else commented on here has been closed (somewhere else) except for #375 (which is still open and tracking what it needs to track, with a back-link to comments made in here relevant to it).

So this issue #374 can be closed. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants