Skip to content
/ mapreduce Public

Multithreaded MapReduce library in C using POSIX threads and synchronization primitives

Notifications You must be signed in to change notification settings

n5q/mapreduce

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CMPUT 379 Assignment 2 - MapReduce

This project implements a multithreaded MapReduce library in C using POSIX threads and synchronization primitives.


Synchronization Primitives Used

Thread Pool Library (threadpool.c)

  1. pthread_mutex_t queue_mutex

    Protects access to the job queue and thread pool state variables

    Used in: ThreadPool_add_job(), Thread_run(), ThreadPool_check(), ThreadPool_destroy()

  2. pthread_cond_t job_available

    Signals worker threads when new jobs are added to the queue

    Used in: ThreadPool_add_job() (signals), Thread_run() (waits)

    Allows threads to sleep instead of busy waiting when no jobs are available

  3. pthread_cond_t idle

    Signals when all threads are idle and the job queue is empty

    Used in: Thread_run() (signals), ThreadPool_check() (waits)

    Enables synchronization between map and reduce phases

MapReduce Library (mapreduce.c)

  1. pthread_mutex_t mutex (per partition)

    Protects access to each partitions linked list of key-val pairs

    Used in: MR_Emit(), MR_GetNext(), MR_Reduce()

    • Each partition has its own mutex

Partition Implementation

Data Structure

Partitions are implemented as sorted linked lists of key-value pairs:

typedef struct keyval_t {
    char* key;         
    char* val;         
    struct keyval_t* next; 
} keyval_t;

typedef struct {
    keyval_t* head;          
    pthread_mutex_t mutex;  
    size_t size;          
    unsigned num;          
} partition_t;

Testing Method

Environment

Linux lab machine

gcc with -Wall -pthread flags

20 text files from testcase directory (21 unique words, 5000 occurrences each)

Functional Testing

  1. Basic Functionality

    make
    ./wordcount testcase/*.txt

    Verified all 21 unique words appear across result files and each word count equals 5000, alphabetically sorted

  2. Different Configurations

    • Tested with varying numbers of worker threads: 1, 2, 5, 10, 20
    • Tested with varying numbers of partitions: 1, 5, 10, 20
    • All configurations produced correct results
  3. Memory Leak Testing

    valgrind --leak-check=full --show-leak-kinds=all ./wordcount testcase/*.txt

    Result:

    ==3530742== HEAP SUMMARY:
    ==3530742==     in use at exit: 0 bytes in 0 blocks
    ==3530742==   total heap usage: 420,281 allocs, 420,281 frees, 5,514,517 bytes allocated
    ==3530742== 
    ==3530742== All heap blocks were freed -- no leaks are possible
    ==3530742== 
    ==3530742== For lists of detected and suppressed errors, rerun with: -s
    ==3530742== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
    
  4. Synchronization Testing

    valgrind --tool=helgrind ./wordcount testcase/*.txt

    Result:

    ==3531253== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 199410 from 82)
    

Performance Testing

Running Time vs Number of Worker Threads

20 input files, 10 partitions

Threads Time (s)
1 4.564
2 3.197
5 2.519
10 2.384
20 2.350
50 2.349
100 2.350

With only 20 input files having more than 20 threads doesnt benefit during map phase

Lock contention increases with more threads

Running Time vs Number of Partitions

20 input files, 5 threads

Partitions Time (s)
1 47.742
2 20.145
5 4.972
10 2.509
20 1.592
50 0.439
100 0.358

Too few partitions - high lock contention and bottleneck

Too many partitions - higher speed but unescessary scheduling and sorting overhead + memory waste

About

Multithreaded MapReduce library in C using POSIX threads and synchronization primitives

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •