- mark-and-sweep
- mark-and-copy
- portable format!
- file header: info like 64 vs 32bit, target OS (Linux, FreeBSD, Solaris, System V, ...)
- sections: actual data or metadata: .text, .data, .rdata
- section header: refrences the sections. used by linker (and dynamic linker). describes sections
- program header: refrences the sections. used by OS/loader. describes memory segments
a memory segment contains 1 or more sections
- describes runtime execution (memory segments)
- tells the system how to create a process image
- specifies memory segments (size in file and in memory, offset in file and in virtual memory)
- etc...
- used by linker/dynamic-linker/relocation
- refrences 0 or more sections
uses mmap to lazily load binary from disk and uses the page cache (see below)
- binaries/libraries are shared between process automagically
- unused parts of binaries/libraries are automatically flushed out of memory
- lazy by default
- can be set to eager by passing POPULATE flag, or
using mlock (pages are guaranteed to always be resident in RAM) - a usecase: reading files... faster because it avoids many
readsyscalls
- shared: Share this mapping. Updates to the mapping are visible to
other processes mapping the same region, and (in the case of
file-backed mappings) are carried through to the underlying
file.
usecases: memorymapped I/O and IPC - private: Create a private copy-on-write mapping. Updates to the
mapping are not visible to other processes mapping the same
file, and are not carried through to the underlying file
anonymous is:
- not backed by any file
- its contents are initialized to zero
malloc uses anon mmap!
- see filesystem cache below!
- swap has nothing to do with the page cache: program data saved to disk. Completely different!
- non-dirty page cache pages are the first to go if system is tight on memory. They're just discarded since they're already on disk.
- most file I/O is driven by virtual memory's page cache:
try running
free -hbefore and after runningpython -c "print 'a'*(4000*2**20)" > ~/f.bin - flush page cache:
echo 1 > /proc/sys/vm/drop_caches
store processes data on disk
free -h
should be able to explain most of /proc/meminfo
https://github.com/torvalds/linux/blob/master/Documentation/filesystems/proc.txt
shared memory (shmem):
- shared memory +
tmpfs(dfto see alltmpfsmounts) sharedcolumn infreetmpfsis basically ram disk
File:
- regular file
- special file
- block
- character
- named pipe (aka. FIFO file): in-memory like regular pipes, but have a file discriptor, so can be shared by processes
- socket
- Unix Domain Socket
- Network Socket (TCP/stream, UDP/datagram, RAW)
Some steps required to write data to end of some file:
- find empty disk blocks and mark them as in use
- associate these blocks with the file
- adjust file size
- actually copy the data to the blocks
at least 3 datastructures are needed:
- for tracking free disk blocks
- for tracking which data blocks belong to a file (Inode!)
- the data blocks themselves
An inode:
- each inode is named and located by a number:
ls -i - stores location of file's data blocks on disk
- stores file metadata: permissions, various timestamps
ext4:
- amount of inodes is determined at format time!
- inode takes space on disk (256 bytes == 1/16th of a block)
- tradoff: more inodes == more files, but less space for data blocks
- default: 1 inode for every 16 KB of data blocks (configurable at format time)
How to retrieve (a specific) disk block given an inode and offset?
- multilevel indexing:
inode points to data blocks and an "indirect block", indirect blocks points to more data blocks and an indirect block, so on. This works well (time & space) for large and small files.
- contains (filename --> inode number) mappings (aka dentry??)
steps for open("/etc/f.txt"):
- "/" probably has hardcoded inode num (2)
- fetch inode #2, lookup inode num associated with "etc" in the contents: x
- fetch inode #x, lookup inode num associated with "f.txt" in the contents: y
- now we know the inode num for target file!
hard link:
- filename to inode number (just like in directory)
- doesn't work cross-filesystems: because the inode num we map to is on the same fs.
- if original file is deleted or moved, hardlinks are unaffected and the file can still be accessed!
soft link:
- name --> name mapping
- works cross filesystems
applications
|
vfs
/ \
fs1 fs2
\ /
disk
where to stick the cache?
above vfs:
- sees file contents only
- doesn't see fs metadata (e.g. inodes)
cache(aka pagecache) in/proc/meminfo,free, andvmstat
below actual filesystem:
- sees disk blocks, i.e. file contents and metadata, but to avoid storing file contents twice (in
bufferand incache), linux only stores file contents incache, while thebufferpoints tocache buffersin/proc/meminfo,free, andvmstat
Caching policy:
- write-through:
writes to cache immediately make it to disk: slow writes, fast reads, safe - write-back:
writes to cache make it to disk asynchronously: fast writes and reads, risky
sync/fsyncflush cache to disk
Recovering from incosistencies:
(writing a single disk block is assumed to be atomic!)
fsckchecks entire file system: slow!- journaling:
- record changes to be made in a journal (a circular log)
- check them off as writes make it to disk (aka checkpointing)
- Everything before the last checkpoint is assumed to have safely made it to disk
- Anything after the last ckeckpoint may/may not have made it. We only need to check entries since last checkpoint
- much faster!
- ext4 is a journaling fs
stat("/home/sam")returns stats, like inode num (x)fd = openat(x)opens directory using its inode numxfstat(fd)get the same stats again, this time using the open fd (why??)getdents(fd)returns dentries in directory: [(filename, inode), (filename, inode), ...]write(1, "...")write out all filenames
Need to read from (say) 3 file descriptors, each may block of it's not currently readable/writable (aka. ready) naively:
read(fd1)
read(fd2)
read(fd3)
can't block on more than one fd at once (assuming 1 thread).
e.g. Might block on fd1 while fd2 is actually ready and a read wouldn't block.
- Solution, nonblocking IO:
read(fd)returns error without blocking if fd currently not readable. Application needs to continuously poll, wasting cpu. - Solution, multiplexed IO (
select/poll):
blocks on all fd's at once, then issue nonblocking read on the ready fd's:
select(fd1,fd2,...,timeout)
sleeps until an fd is ready, or a time out, then issue a read on the ready fd, which won't block.
in other words: blocking (if we choose to) select on multiple fd's at once, followed by nonblocking read.
note:
- depending on the timeout,
selectcan block forever, return immediately, or until the timeout select/pollis O(n) in number of fds,epollis more efficient!
- level-triggered: get a list of every file descriptor you’re interested in that is readable
- edge-triggered: get notifications every time a file descriptor becomes readable
- parent process terminates before child ---> child's parent becomes
init - When a process terminates, it is not immediately removed from the system but parts of the it
are kept resident in memory to allow the parent to inquire about its status upon terminating (aka
wait) - once parent
wait()ed on the child process, it's fully destroyed - zombie: process that terminated but hasn't yet been
waited upon initroutinelywaits on all of its children