This project implements a multithreaded MapReduce library in C using POSIX threads and synchronization primitives.
-
pthread_mutex_t queue_mutex
Protects access to the job queue and thread pool state variables
Used in:
ThreadPool_add_job(),Thread_run(),ThreadPool_check(),ThreadPool_destroy() -
pthread_cond_t job_available
Signals worker threads when new jobs are added to the queue
Used in:
ThreadPool_add_job()(signals),Thread_run()(waits)Allows threads to sleep instead of busy waiting when no jobs are available
-
pthread_cond_t idle
Signals when all threads are idle and the job queue is empty
Used in:
Thread_run()(signals),ThreadPool_check()(waits)Enables synchronization between map and reduce phases
-
pthread_mutex_t mutex (per partition)
Protects access to each partitions linked list of key-val pairs
Used in:
MR_Emit(),MR_GetNext(),MR_Reduce()- Each partition has its own mutex
Partitions are implemented as sorted linked lists of key-value pairs:
typedef struct keyval_t {
char* key;
char* val;
struct keyval_t* next;
} keyval_t;
typedef struct {
keyval_t* head;
pthread_mutex_t mutex;
size_t size;
unsigned num;
} partition_t;Linux lab machine
gcc with -Wall -pthread flags
20 text files from testcase directory (21 unique words, 5000 occurrences each)
-
Basic Functionality
make ./wordcount testcase/*.txtVerified all 21 unique words appear across result files and each word count equals 5000, alphabetically sorted
-
Different Configurations
- Tested with varying numbers of worker threads: 1, 2, 5, 10, 20
- Tested with varying numbers of partitions: 1, 5, 10, 20
- All configurations produced correct results
-
Memory Leak Testing
valgrind --leak-check=full --show-leak-kinds=all ./wordcount testcase/*.txtResult:
==3530742== HEAP SUMMARY: ==3530742== in use at exit: 0 bytes in 0 blocks ==3530742== total heap usage: 420,281 allocs, 420,281 frees, 5,514,517 bytes allocated ==3530742== ==3530742== All heap blocks were freed -- no leaks are possible ==3530742== ==3530742== For lists of detected and suppressed errors, rerun with: -s ==3530742== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0) -
Synchronization Testing
valgrind --tool=helgrind ./wordcount testcase/*.txtResult:
==3531253== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 199410 from 82)
Running Time vs Number of Worker Threads
20 input files, 10 partitions
| Threads | Time (s) |
|---|---|
| 1 | 4.564 |
| 2 | 3.197 |
| 5 | 2.519 |
| 10 | 2.384 |
| 20 | 2.350 |
| 50 | 2.349 |
| 100 | 2.350 |
With only 20 input files having more than 20 threads doesnt benefit during map phase
Lock contention increases with more threads
Running Time vs Number of Partitions
20 input files, 5 threads
| Partitions | Time (s) |
|---|---|
| 1 | 47.742 |
| 2 | 20.145 |
| 5 | 4.972 |
| 10 | 2.509 |
| 20 | 1.592 |
| 50 | 0.439 |
| 100 | 0.358 |
Too few partitions - high lock contention and bottleneck
Too many partitions - higher speed but unescessary scheduling and sorting overhead + memory waste