A GPU-SSD simulator: MQSim + Macsim Integration

Overview

This is a simulator adapted from MQSim[1] SSD simulator to include a GPU cycle-level trace-driven simulator as its frontend. The GPU component is a simplified version of macsim[10] simulator that include instruction processing, core simulation, a simple cache, and block/warp scheduling. Memory requests generated from the frontend are directly sent to MQSim backend through a shared request queue.

File Structure

The key files that are modified in addition to MQSim are highlighted below. For information regarding orginal MQSim[1], please refer to the MQSim README in the second half of this file.

root/
├── run.py                     script to run multiple benchmarks/configs
├── run_collect_results.py     script to collect output from run.py
├── src/
│   ├── main.cpp               main function of simulator; initiate GPU and SSD instances and setup connection; parse config parameters
│   ├── macsim/                GPU frontend simulation
│   │   ├── macsim.cpp         main macsim class, manages GPU-SSD request queue; kernel/warp management and thread block scheduling policy; initiates cores; track simulation stats and print at the end (check MA_DEBUG)
│   │   ├── macsim.h           check member functions of macsim class
│   │   ├── core.cpp           read instruction trace and simulate core at cycle level; warp scheduling and suspension; calls cache
│   │   ├── trace_reader.h     nvbit instruction trace format 
│   │   ├── trace_reader_main.h  structs to store kernel, warp and scheduling metadata
│   │   └── cache.cpp          cache simulation from macsim
│   ├── exec/                  manages XML parameters structures and format; add new parameters to respective file at GPU/Host/Flash level
│   │   └── SSD_Device.cpp     setup the address mapping unit(amu) and transaction scheduling unit (tsu) that connect to GPU for scheduling metadata
│   ├── host/                  SSD-host interface management, PCIe and IO flow
│   │   ├── IO_Flow_Base.cpp   base class that setup the GPU-SSD queue on host
│   │   └── IO_Flow_Trace_Based.cpp  register SSD events from GPU-SSD queue to replace the SSD trace mode
│   ├── sim/                   manage MQSim simulation basics such as event-driven engine and base class for events
│   │   └── Engine.cpp         event tree for event-driven MQSim; modified run_period function to make MQSim check event tree periodically based on macsim cycle count; host GPU-SSD queues
│   └── ssd/
│       ├── Address_Mapping_Unit_Page_Level.cpp  functions for GPU to predict PPA from LPA based on address mapping scheme, or to read mapping table
│       └── TSU_Priority_OutOfOrder.cpp  functions for GPU to monitor SSD internal stats such as channel busy/queue length, not fully used yet
└── xmls/            simulation configuration files; see below for ssdconfig.xml, gpuconfig.xml, and workload.xml

Usage

XML configuration files reside in ./xmls/ directory. Required simulation configs are ssdconfig.xml, gpuconfig.xml, and workload.xml. ssdconfig include backend parameters such as number of channel/chips, access latency; gpuconfig include frontend parameters such as number of cores and scheduling policy. The workload config does not need to modify for experiments. Check the directory for available configs and customize new ones.

To run an individual configuration:

./MQSim -i xmls/(ssdconfig.xml) -g xmls/(gpuconfig.xml) -w xmls/(workload.xml) -o (result_dir) > (result_dir)/(out.txt)

This is only recommended for functional testing, eg verifying correctness of new feature.

To sweep all benchmarks:

python3 run.py -i (ssdconfig.xml) -g (gpuconfig.xml) -w (workload.xml)
# No directory eg. xmls/ssdconfig.xml need to be specified

This command should allow you to sweep all the benchmarks with all the scheduling policy that are turned on. Check the benchmark_names variable in the run.py script for supported benchmarks, (un)comment benchmarks to select benchmarks you'd like to run. Similarly, check the scheduling_policies variable.

A recommended baseline configuration is as follows: python3 run.py -i ssd_default.xml -g gpuconfig_16c.xml -w workload_8c_4c.xml

Both MQSim XML results and stdout simulation logs are collected in ./result/(bench name)/(bench config). Check [timestamp](ssdconfig)_(gpuconfig)_(bs policy)_(workload).xml and [timestamp](ssdconfig)_(gpuconfig)_(bs policy)_(workload).out

Result Collection

run_collect_results.py collects the simulation output from both the XML stats and simulation log. The usage and input parameters are the same as run.py, for example: python3 run_collect_results.py -i ssd_default.xml -g gpuconfig_16c.xml -w workload_8c_4c.xml

With a successful run, a (date)_results.csv file will output to the Summarized_result directory. Here is a list of the attributes being collected:

Benchmark info: benchmark name,config,# of blocks,# of warps,# of instrs,# of cache accesses,# of memory requests,# of read requests,# of write requests,page footprint/block,
Config info: ssdconfig,gpuconfig,workload,scheduling policy,
Performance info: # of GC executions,IOPS,Read IOPS,Write IOPS,Device response time (ns),Max device response time (ns),Read Transaction time (ns),Write Transaction time (ns),Avg Chip Occupancy,Chip Occupancy Stddev,Avg Contention AAD,Avg Active Chips,average latency (ns),simulation end time (ns),real sim time

simulation end time is generally used as the metric to observe speedup; benchmark metric such as number of cache/memory and average page footprint shows benchmark characteristic, and performance info such as IOPS, latency and chip occupancy give additional information on why the speedup is observed.

Sensitivity Analysis

Two types of sensitivity analysis is supported: sweeping the size of SSD channels/chips and sweeping page allocation scheme. To enable, example commands are as follows:

python3 run.py -i ssd_default.xml -g gpuconfig_16c.xml -w workload_8c_4c.xml --en_n_channel_chips
python3 run.py -i ssd_default.xml -g gpuconfig_16c.xml -w workload_8c_4c.xml --en_palloc

Note!! Please limit the number of benchmarks to less than 5. Too many benchmarks will break the server, since larger number of experiments (for example, n_benchmarks*n_palloc_scheme) will be kicked off.

To collect results use the same --en_n_channel_chips or --en_palloc options after sensitivity runs have finished.

Simulation Time and Ongoing work for ML workloads

Most Rodinia and Gunrock benchmarks should finish in less than 5 minutes, if not less than 10. Only gaussian is usually longer and can take a few hours.

ML workloads that are able to run are "cnn_inf", "cnn_train","bert-sampled","gemma","gpt2","resnet50", in run.py. For efficiency of experiment I used kernel_config_edited.txt to hand pick <10 kernels that are medium-sized for each benchmark.

Some unused code to limit kernel number/instruction count in macsim.cpp, trace_reader_setup function could also be helpful.

Fixed DRAM Latency Mode

A useful tool to quickly check the trace and collect benchmark frontend stats is to use the fixed DRAM mode. Set Memory_Mode parameter in gpuconfig.xml as FIXED to bypass the GPU-SSD queue. This will speed up experiments significantly even for ML workloads with thousands of kernels. For large benchmarks, use it to check bugs and block-memory footprint/trace before running with SSD backend.

Understanding the GPU sim log and Debugging

The simulation log is ordered by kernel, periodically prints stats such as number of requests, average latency, and chip occupancy. The best way to check if the simulation advances is through the request count.

When all kernels finish, The total simulation time in cycles and overall stats will be printed at the end. If the simulation didn't finish and stats are missing, run_collect_results.py script will throw errors.

Another useful tool is the block footprint log. The block footprint data structure record the SSD channel/chip based on LPA/PPA during macsim execution (see macsim.cpp), log the occurrance based on block ID, and print at the end. It's helpful to decide the block address pattern and work together with the fixed DRAM latency mode. Here's an example output.

kernel 0
0,1,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,6,
1,6,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2,6,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,6,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4,7,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
5,,,,,,,,,6,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
6,,,,,,,,,6,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
# one line for each GPU threadblock

Check MA_DEBUG messages in macsim to turn on/off certain messages. Debug messages regarding MQSim are named MQ_DEBUG. main.cpp also prints some general stats.

Garbage Collection

Path to Macsim Traces

Rodinia: /fast_data/echung67/trace/nvbit/(bench_name)/(config_name)/kernel_config.txt Tango: /fast_data/echung67/trace_tango/nvbit/(bench_name)/kernel_config.txt FasterTransformer (only the first 10 instructions of each kernel): /data/echung67/trace/nvbit/(bench_name)/10/kernel_config.txt

For the bench_name and config_name of each benchmark suite, refer to run.py file.

List of supported benchmarks and remarks:

Future Things to Try

See original MQSim README below

MQSim is a simulator that accurately captures the behavior of both modern multi-queue SSDs and conventional SATA-based SSDs. MQSim faithfully models a number of critical features absent in existing state-of-the-art simulators, including (1) modern multi-queue-based host–interface protocols (e.g., NVMe), (2) the steady-state behavior of SSDs, and (3) the end-to-end latency of I/O requests. MQSim can be run as a standalone tool, or integrated with a full-system simulator.

The full paper is published in FAST 2018 and is available online at https://people.inf.ethz.ch/omutlu/pub/MQSim-SSD-simulation-framework_fast18.pdf

Citation

Please cite our full FAST 2018 paper if you find this repository useful.

Arash Tavakkol, Juan Gomez-Luna, Mohammad Sadrosadati, Saugata Ghose, and Onur Mutlu, "MQSim: A Framework for Enabling Realistic Studies of Modern Multi-Queue SSD Devices" Proceedings of the 16th USENIX Conference on File and Storage Technologies (FAST), Oakland, CA, USA, February 2018.

@inproceedings{tavakkol2018mqsim,
  title={{MQSim: A Framework for Enabling Realistic Studies of Modern Multi-Queue SSD Devices}},
  author={Tavakkol, Arash and G{\'o}mez-Luna, Juan and Sadrosadati, Mohammad and Ghose, Saugata and Mutlu, Onur},
  booktitle={FAST},
  year={2018}
}

Additional Resources

To learn more about MQSim, please refer to the slides and talk below:

Slides: (pptx) (pdf)
Talk: Introduction to MQSim from the Understanding and Designing Modern NAND Flash-Based Solid-State Drives (SSDs) course

Usage in Linux

Run following commands:

$ make
$ ./MQSim -i <SSD Configuration File> -w <Workload Definition File>

Usage in Windows

Open the MQSim.sln solution file in MS Visual Studio 2017 or later.
Set the Solution Configuration to Release (it is set to Debug by default).
Compile the solution.
Run the generated executable file (e.g., MQSim.exe) either in command line mode or by clicking the MS Visual Studio run button. Please specify the paths to the files containing the 1) SSD configurations, and 2) workload definitions.

Example command line execution:

$ MQSim.exe -i <SSD Configuration File> -w <Workload Definition File>

MQSim Execution Configurations

You can specify your preferred SSD configuration in the XML format. If the SSD configuration file specified in the command line does not exist, MQSim will create a sample XML file in the specified path. Here are the definitions of configuration parameters available in the XML file:

Host

PCIe_Lane_Bandwidth: the PCIe bandwidth per lane in GB/s. Range = {all positive double precision values}.
PCIe_Lane_Count: the number of PCIe lanes. Range = {all positive integer values}.
SATA_Processing_Delay: defines the aggregate hardware and software processing delay to send/receive a SATA message to the SSD device in nanoseconds. Range = {all positive integer values}.
Enable_ResponseTime_Logging: the toggle to enable response time logging. If enabled, response time is calculated for each running I/O flow over simulation epochs and is reported in a log file at the end of each epoch. Range = {true, false}.
ResponseTime_Logging_Period_Length: defines the epoch length for response time logging in nanoseconds. Range = {all positive integer values}.

SSD Device

Seed: the seed value that is used for random number generation. Range = {all positive integer values}.
Enabled_Preconditioning: the toggle to enable preconditioning. Range = {true, false}.
Memory_Type: the type of the non-volatile memory used for data storage. Range = {FLASH}.
HostInterface_Type: the type of host interface. Range = {NVME, SATA}.
IO_Queue_Depth: the length of the host-side I/O queue. If the host interface is set to NVME, then IO_Queue_Depth defines the capacity of the I/O Submission and I/O Completion Queues. If the host interface is set to SATA, then IO_Queue_Depth defines the capacity of the Native Command Queue (NCQ). Range = {all positive integer values}
Queue_Fetch_Size: the value of the QueueFetchSize parameter as described in the FAST 2018 paper [1]. Range = {all positive integer values}
Caching_Mechanism: the data caching mechanism used on the device. Range = {SIMPLE: implements a simple data destaging buffer, ADVANCED: implements an advanced data caching mechanism with different sharing options among the concurrent flows}.
Data_Cache_Sharing_Mode: the sharing mode of the DRAM data cache (buffer) among the concurrently running I/O flows when an NVMe host interface is used. Range = {SHARED, EQUAL_PARTITIONING}.
Data_Cache_Capacity: the size of the DRAM data cache in bytes. Range = {all positive integers}
Data_Cache_DRAM_Row_Size: the size of the DRAM rows in bytes. Range = {all positive power of two numbers}.
Data_Cache_DRAM_Data_Rate: the DRAM data transfer rate in MT/s. Range = {all positive integer values}.
Data_Cache_DRAM_Data_Burst_Size: the number of bytes that are transferred in one DRAM burst (depends on the number of DRAM chips). Range = {all positive integer values}.
Data_Cache_DRAM_tRCD: the value of the timing parameter tRCD in nanoseconds used to access DRAM in the data cache. Range = {all positive integer values}.
Data_Cache_DRAM_tCL: the value of the timing parameter tCL in nanoseconds used to access DRAM in the data cache. Range = {all positive integer values}.
Data_Cache_DRAM_tRP: the value of the timing parameter tRP in nanoseconds used to access DRAM in the data cache. Range = {all positive integer values}.
Address_Mapping: the logical-to-physical address mapping policy implemented in the Flash Translation Layer (FTL). Range = {PAGE_LEVEL, HYBRID}.
Ideal_Mapping_Table: if mapping is ideal, table is enabled in which all address translations entries are always in CMT (i.e., CMT is infinite in size) and thus all adddress translation requests are always successful (i.e., all the mapping entries are found in the DRAM and there is no need to read mapping entries from flash)
CMT_Capacity: the size of the SRAM/DRAM space in bytes used to cache the address mapping table (Cached Mapping Table). Range = {all positive integer values}.
CMT_Sharing_Mode: the mode that determines how the entire CMT (Cached Mapping Table) space is shared among concurrently running flows when an NVMe host interface is used. Range = {SHARED, EQUAL_PARTITIONING}.
Plane_Allocation_Scheme: the scheme for plane allocation as defined in Tavakkol et al. [3]. Range = {CWDP, CWPD, CDWP, CDPW, CPWD, CPDW, WCDP, WCPD, WDCP, WDPC, WPCD, WPDC, DCWP, DCPW, DWCP, DWPC, DPCW, DPWC, PCWD, PCDW, PWCD, PWDC, PDCW, PDWC}
Transaction_Scheduling_Policy: the transaction scheduling policy that is used in the SSD back end. Range = {OUT_OF_ORDER as defined in the Sprinkler paper [2], PRIORITY_OUT_OF_ORDER which implements OUT_OF_ORDER and NVMe priorities}.
Overprovisioning_Ratio: the ratio of reserved storage space with respect to the available flash storage capacity. Range = {all positive double precision values}.
GC_Exect_Threshold: the threshold for starting Garbage Collection (GC). When the ratio of the free physical pages for a plane drops below this threshold, GC execution begins. Range = {all positive double precision values}.
GC_Block_Selection_Policy: the GC block selection policy. Range {GREEDY, RGA (described in [4] and [5]), RANDOM (described in [4]), RANDOM_P (described in [4]), RANDOM_PP (described in [4]), FIFO (described in [6])}.
Use_Copyback_for_GC: used in GC_and_WL_Unit_Page_Level to determine block_manager→Is_page_valid gc_write transaction
Preemptible_GC_Enabled: the toggle to enable pre-emptible GC (described in [7]). Range = {true, false}.
GC_Hard_Threshold: the threshold to stop pre-emptible GC execution (described in [7]). Range = {all possible positive double precision values less than GC_Exect_Threshold}.
Dynamic_Wearleveling_Enabled: the toggle to enable dynamic wear-leveling (described in [9]). Range = {true, false}.
Static_Wearleveling_Enabled: the toggle to enable static wear-leveling (described in [9]). Range = {all positive integer values}.
Static_Wearleveling_Threshold: the threshold for starting static wear-leveling (described in [9]). When the difference between the minimum and maximum erase count within a memory unit (e.g., plane in flash memory) drops below this threshold, static wear-leveling begins. Range = {true, false}.
Preferred_suspend_erase_time_for_read: the reasonable time to suspend an ongoing flash erase operation in favor of a recently-queued read operation. Range = {all positive integer values}.
Preferred_suspend_erase_time_for_write: the reasonable time to suspend an ongoing flash erase operation in favor of a recently-queued read operation. Range = {all positive integer values}.
Preferred_suspend_write_time_for_read: the reasonable time to suspend an ongoing flash erase operation in favor of a recently-queued program operation. Range = {all positive integer values}.
Flash_Channel_Count: the number of flash channels in the SSD back end. Range = {all positive integer values}.
Flash_Channel_Width: the width of each flash channel in byte. Range = {all positive integer values}.
Channel_Transfer_Rate: the transfer rate of flash channels in the SSD back end in MT/s. Range = {all positive integer values}.
Chip_No_Per_Channel: the number of flash chips attached to each channel in the SSD back end. Range = {all positive integer values}.
Flash_Comm_Protocol: the Open NAND Flash Interface (ONFI) protocol used for data transfer over flash channels in the SSD back end. Range = {NVDDR2}.

NAND Flash

Flash_Technology: Range = {SLC, MLC, TLC}.
CMD_Suspension_Support: the type of suspend command support by flash chips. Range = {NONE, PROGRAM, PROGRAM_ERASE, ERASE}.
Page_Read_Latency_LSB: the latency of reading LSB bits of flash memory cells in nanoseconds. Range = {all positive integer values}.
Page_Read_Latency_CSB: the latency of reading CSB bits of flash memory cells in nanoseconds. Range = {all positive integer values}.
Page_Read_Latency_MSB: the latency of reading MSB bits of flash memory cells in nanoseconds. Range = {all positive integer values}.
Page_Program_Latency_LSB: the latency of programming LSB bits of flash memory cells in nanoseconds. Range = {all positive integer values}.
Page_Program_Latency_CSB: the latency of programming CSB bits of flash memory cells in nanoseconds. Range = {all positive integer values}.
Page_Program_Latency_MSB: the latency of programming MSB bits of flash memory cells in nanoseconds. Range = {all positive integer values}.
Block_Erase_Latency: the latency of erasing a flash block in nanoseconds. Range = {all positive integer values}.
Block_PE_Cycles_Limit: the PE limit of each flash block. Range = {all positive integer values}.
Suspend_Erase_Time: the time taken to suspend an ongoing erase operation in nanoseconds. Range = {all positive integer values}.
Suspend_Program_Time: the time taken to suspend an ongoing program operation in nanoseconds. Range = {all positive integer values}.
Die_No_Per_Chip: the number of dies in each flash chip. Range = {all positive integer values}.
Plane_No_Per_Die: the number of planes in each die. Range = {all positive integer values}.
Block_No_Per_Plane: the number of flash blocks in each plane. Range = {all positive integer values}.
Page_No_Per_Block: the number of physical pages in each flash block. Range = {all positive integer values}.
Page_Capacity: the size of each physical flash page in bytes. Range = {all positive integer values}.
Page_Metadat_Capacity: the size of the metadata area of each physical flash page in bytes. Range = {all positive integer values}.

MQSim Workload Definition

You can define your preferred set of workloads in the XML format. If the specified workload definition file does not exist, MQSim will create a sample workload definition file in XML format for you (i.e., workload.xml). Here is the explanation of the XML attributes and tags for the workload definition file:

The entire workload definitions should be embedded within <MQSim_IO_Scenarios></MQSim_IO_Scenarios> tags. You can define different sets of I/O scenarios within these tags. MQSim simulates each I/O scenario separately.
We call a set of workloads that should be executed together, an I/O scenario. An I/O scenario is defined within the <IO_Scenario></IO_Scenario> tags. For example, two different I/O scenarios are defined in the workload definition file in the following way:

<MQSim_IO_Scenarios>
	<IO_Scenario>
	.............
	</IO_Scenario>
	<IO_Scenario>
	.............
	</IO_Scenario>
</MQSim_IO_Scenarios>

For each I/O scenario, MQSim 1) rebuilds the Host and SSD Drive model and executes the scenario to completion, and 2) creates an output file and writes the simulation results to it. For the example mentioned above, MQSim builds the Host and SSD Drive models twice, executes the first and second I/O scenarios, and finally writes the execution results into the workload_scenario_1.xml and workload_scenario_2.xml files, respectively.

You can define up to 8 different workloads within each IO_Scenario tag. Each workload could either be a disk trace file that has already been collected on a real system or a synthetic stream of I/O requests that are generated by MQSim's request generator.

Defining a Trace-based Workload

You can define a trace-based workload for MQSim, using the <IO_Flow_Parameter_Set_Trace_Based> XML tag. Currently, MQSim can execute ASCII disk traces define in [8] in which each line of the trace file has the following format: 1.Request_Arrival_Time 2.Device_Number 3.Starting_Logical_Sector_Address 4.Request_Size_In_Sectors 5.Type_of_Requests[0 for write, 1 for read]

The following parameters are used to define a trace-based workload:

Priority_Class: the priority class of the I/O queue associated with this I/O request. Range = {URGENT, HIGH, MEDIUM, LOW}.
Device_Level_Data_Caching_Mode: the type of on-device data caching for this flow. Range={WRITE_CACHE, READ_CACHE, WRITE_READ_CACHE, TURNED_OFF}. If the caching mechanism mentioned above is set to SIMPLE, then only WRITE_CACHE and TURNED_OFF modes could be used.
Channel_IDs: a comma-separated list of channel IDs that are allocated to this workload. This list is used for resource partitioning. If there are C channels in the SSD (defined in the SSD configuration file), then the channel ID list should include values in the range 0 to C-1. If no resource partitioning is required, then all workloads should have channel IDs 0 to C-1.
Chip_IDs: a comma-separated list of chip IDs that are allocated to this workload. This list is used for resource partitioning. If there are W chips in each channel (defined in the SSD configuration file), then the chip ID list should include values in the range 0 to W-1. If no resource partitioning is required, then all workloads should have chip IDs 0 to W-1.
Die_IDs: a comma-separated list of chip IDs that are allocated to this workload. This list is used for resource partitioning. If there are D dies in each flash chip (defined in the SSD configuration file), then the die ID list should include values in the range 0 to D-1. If no resource partitioning is required, then all workloads should have die IDs 0 to D-1.
Plane_IDs: a comma-separated list of plane IDs that are allocated to this workload. This list is used for resource partitioning. If there are P planes in each die (defined in the SSD configuration file), then the plane ID list should include values in the range 0 to P-1. If no resource partitioning is required, then all workloads should have plane IDs 0 to P-1.
Initial_Occupancy_Percentage: the percentage of the storage space (i.e., logical pages) that is filled during preconditioning. Range = {all integer values in the range 1 to 100}.
File_Path: the relative/absolute path to the input trace file.
Percentage_To_Be_Executed: the percentage of requests in the input trace file that should be executed. Range = {all integer values in the range 1 to 100}.
Relay_Count: the number of times that the trace execution should be repeated. Range = {all positive integer values}.
Time_Unit: the unit of arrival times in the input trace file. Range = {PICOSECOND, NANOSECOND, MICROSECOND}

Defining a Synthetic Workload

You can define a synthetic workload for MQSim, using the <IO_Flow_Parameter_Set_Synthetic> XML tag.

The following parameters are used to define a trace-based workload:

Priority_Class: same as trace-based parameters mentioned above.
Device_Level_Data_Caching_Mode: same as trace-based parameters mentioned above.
Channel_IDs: same as trace-based parameters mentioned above.
Chip_IDs: same as trace-based parameters mentioned above.
Die_IDs: same as trace-based parameters mentioned above.
Plane_IDs: same as trace-based parameters mentioned above.
Initial_Occupancy_Percentage: same as trace-based parameters mentioned above.
Working_Set_Percentage: the percentage of available logical storage space that is accessed by generated requests. Range = {all integer values in the range 1 to 100}.
Synthetic_Generator_Type: determines the way that the stream of requests is generated. Currently, there are two modes for generating consecutive requests, 1) based on the average bandwidth of I/O requests, or 2) based on the average depth of the I/O queue. Range = {BANDWIDTH, QUEUE_DEPTH}.
Read_Percentage: the ratio of read requests in the generated flow of I/O requests. Range = {all integer values in the range 1 to 100}.
Address_Distribution: the distribution pattern of addresses in the generated flow of I/O requests. Range = {STREAMING, RANDOM_UNIFORM, RANDOM_HOTCOLD, MIXED_STREAMING_RANDOM}.
Percentage_of_Hot_Region: if RANDOM_HOTCOLD is set for address distribution, then this parameter determines the ratio of the hot region with respect to the entire logical address space. Range = {all integer values in the range 1 to 100}.
Generated_Aligned_Addresses: the toggle to enable aligned address generation. Range = {true, false}.
Address_Alignment_Unit: the unit that all generated addresses must be aligned to in sectors (i.e. 512 bytes). Range = {all positive integer values}.
Request_Size_Distribution: the distribution pattern of request sizes in the generated flow of I/O requests. Range = {FIXED, NORMAL}.
Average_Request_Size: average size of generated I/O requests in sectors (i.e. 512 bytes). Range = {all positive integer values}.
Variance_Request_Size: if the request size distribution is set to NORMAL, then this parameter determines the variance of I/O request sizes in sectors. Range = {all non-negative integer values}.
Seed: the seed value that is used for random number generation. Range = {all positive integer values}.
Average_No_of_Reqs_in_Queue: average number of I/O requests enqueued in the host-side I/O queue (i.e., the intensity of the generated flow). This parameter is used in QUEUE_DEPTH mode of request generation. Range = {all positive integer values}.
Bandwidth: the average bandwidth of I/O requests (i.e., the intensity of the generated flow) in bytes per seconds. MQSim uses this parameter in BANDWIDTH mode of request generation.
Stop_Time: defines when to stop generating I/O requests in nanoseconds.
Total_Requests_To_Generate: if Stop_Time is set to zero, then MQSim's request generator considers Total_Requests_To_Generate to decide when to stop generating I/O requests.

Analyze MQSim's XML Output

You can use an XML processor to easily read and analyze an MQSim output file. For example, you can open an MQSim output file in MS Excel. Then, MS Excel shows a set of options and you should choose "Use the XML Source task pane". The XML file is processed in MS Excel and a task pane is shown with all output parameters listed in it. In the task pane on the right, you see different types of statistics available in the MQSim's output file. To read the value of a parameter, you should:

Drag and drop that parameter from the task source pane to the Excel sheet.,
Right click on the cell that you have dropped the parameter and select XML > Refresh XML Data from the drop-down menue.

The parameters used to define the output file of the simulator are divided into categories:

Host

For each defined IO_Flow, the following parameters are shown:

Name: The name of the IO flow, e.g. Host.IO_Flow.Synth.No_0
Request_Count: The total number of requests from this IO_flow.
Read_Request_Count: The total number of read requests from this IO_flow.
Write_Request_Count: The total number of write requests from this IO_flow.
IOPS: The number of IO operations per second, i.e. how many requests are served per second.
IOPS_Read: The number of read IO operations per second.
IOPS_Write: The number of write IO operations per second.
Bytes_Transferred: The total number of data bytes transferred across the interface.
Bytes_Transferred_Read: The total number of data bytes read from the SSD Device.
Bytes_Transferred_write: The total number of data bytes written to the SSD Device.
Bandwidth: The total bandwidth delivered by the SSD Device in bytes per second.
Bandwidth_Read: The total read bandwidth delivered by the SSD Device in bytes per second.
Bandwidth_Write: The total write bandwidth delivered by the SSD Device in bytes per second.
Device_Response_Time: The average SSD device response time for a request, in nanoseconds. This is defined as the time between enqueueing the request in the I/O submission queue, and removing it from the I/O completion queue.
Min_Device_Response_Time: The minimum SSD device response time for a request, in nanoseconds.
Max_Device_Response_Time: The maximum SSD device response time for a request, in nanoseconds.
End_to_End_Request_Delay: The average delay between generating an I/O request and receiving a corresponding answer. This is defined as the difference between the request arrival time, and its removal time from the I/O completion queue. Note that the request arrival_time is the same as the request enqueue_time, when using the multi-queue properties of NVMe drives.
Min_End_to_End_Request_Delay: The minimum end-to-end request delay.
Max_End_to_End_Request_Delay: The maximum end-to-end request delay.

SSDDevice

The output parameters in the SSDDevice category contain values for:

Average transaction times at a lower abstraction level (SSDDevice.IO_Stream)
Statistics for the flash transaction layer (FTL)
Statistics for each queue in the SSD's internal flash Transaction Scheduling Unit (TSU): In the TSU exists a User_Read_TR_Queue, a User_Write_TR_Queue, a Mapping_Read_TR_Queue, a Mapping_Write_TR_Queue, a GC_Read_TR_Queue, a GC_Write_TR_queue, a GC_Erase_TR_Queue for each combination of channel and package.
For each package: the fraction of time in the exclusive memory command execution, exclusive data transfer, overlapped memory command execution and data transfer, and idle mode.

References

[1] A. Tavakkol et al., "MQSim: A Framework for Enabling Realistic Studies of Modern Multi-Queue SSD Devices," FAST, pp. 49 - 66, 2018.

[2] M. Jung and M. T. Kandemir, "Sprinkler: Maximizing Resource Utilization in Many-chip Solid State Disks," HPCA, pp. 524-535, 2014.

[3] A. Tavakkol et al., "Performance Evaluation of Dynamic Page Allocation Strategies in SSDs," ACM TOMPECS, pp. 7:1--7:33, 2016.

[4] B. Van Houdt, "A Mean Field Model for a Class of Garbage Collection Algorithms in Flash-based Solid State Drives," SIGMETRICS, pp. 191-202, 2013.

[5] Y. Li et al., "Stochastic Modeling of Large-Scale Solid-State Storage Systems: Analysis, Design Tradeoffs and Optimization," SIGMETRICS, pp. 179-190, 2013.

[6] P. Desnoyers, "Analytic Modeling of SSD Write Performance", SYSTOR, pp. 12:1-12:10, 2012.

[7] J. Lee et al., "Preemptible I/O Scheduling of Garbage Collection for Solid State Drives," Vol. 32, No. 2, pp. 247-260, 2013.

[8] J. S. Bucy et al., "The DiskSim Simulation Environment Version 4.0 Reference Manual", CMU Tech Rep. CMU-PDL-08-101, 2008.

[9] Micron Technology, Inc., "Wear Leveling in NAND Flash Memory", Application Note AN1822, 2010.

[10] Hyesoon Kim, Jaekyu Lee, Nagesh B Lakshminarayana, Jaewoong Sim, Jieun Lim, and Tri Pho. Macsim: A cpu-gpu heterogeneous simulation framework user guide. Georgia Institute of Technology, 2012.

Name		Name	Last commit message	Last commit date
Latest commit History 122 Commits
fast18		fast18
results_GC		results_GC
src		src
traces		traces
workloads		workloads
xmls		xmls
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
collect_result.sh		collect_result.sh
parse_MQSim_out.py		parse_MQSim_out.py
parse_debug.sh		parse_debug.sh
run.py		run.py
run.sh		run.sh
run_collect_results.py		run_collect_results.py
run_collect_results.sh		run_collect_results.sh
run_commands.sh		run_commands.sh
run_gc.py		run_gc.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

A GPU-SSD simulator: MQSim + Macsim Integration

Overview

File Structure

Usage

Result Collection

Sensitivity Analysis

Simulation Time and Ongoing work for ML workloads

Fixed DRAM Latency Mode

Understanding the GPU sim log and Debugging

Garbage Collection

Path to Macsim Traces

Future Things to Try

See original MQSim README below

Citation

Additional Resources

Usage in Linux

Usage in Windows

MQSim Execution Configurations

Host

SSD Device

NAND Flash

MQSim Workload Definition

Defining a Trace-based Workload

Defining a Synthetic Workload

Analyze MQSim's XML Output

Host

SSDDevice

References

About

Uh oh!

Releases

Packages

Contributors 6

Uh oh!

Languages

License

gthparch/MQSim_macsim

Folders and files

Latest commit

History

Repository files navigation

A GPU-SSD simulator: MQSim + Macsim Integration

Overview

File Structure

Usage

Result Collection

Sensitivity Analysis

Simulation Time and Ongoing work for ML workloads

Fixed DRAM Latency Mode

Understanding the GPU sim log and Debugging

Garbage Collection

Path to Macsim Traces

Future Things to Try

See original MQSim README below

Citation

Additional Resources

Usage in Linux

Usage in Windows

MQSim Execution Configurations

Host

SSD Device

NAND Flash

MQSim Workload Definition

Defining a Trace-based Workload

Defining a Synthetic Workload

Analyze MQSim's XML Output

Host

SSDDevice

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 6

Uh oh!

Languages

Packages