Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
145 commits
Select commit Hold shift + click to select a range
0b97da9
Unfinished changes with prototype function
arpitj1 Jun 6, 2024
69ef423
Loop over linalg.generic's input and output ops
arpitj1 Jun 6, 2024
7678a05
Some comments
arpitj1 Jun 6, 2024
0e88095
Partial changes from coding session to implement fusion of linalg.gen…
arpitj1 Jun 11, 2024
b57c0b8
Incremental changes to fuse linalg and for loop- Logic for shifted op…
arpitj1 Jun 19, 2024
f54c33d
ran clang format
arpitj1 Jun 25, 2024
56e2c54
some compile time fixes
arpitj1 Jun 25, 2024
e253040
Some compile fixes
arpitj1 Jul 2, 2024
e99b8a5
Fixed all the compilation issues. Sample MLIR not raised
arpitj1 Jul 3, 2024
34f595c
Bug fixes, generating some output at getLinalgArgMap
arpitj1 Jul 16, 2024
05bad97
Almost implementated remap in affine dim for multi idx
arpitj1 Jul 17, 2024
5bbf5ef
Added submap op support and refactored the code to use submap
arpitj1 Jul 24, 2024
9018d92
bunch of fixes. Now able to generate raise linalg code
arpitj1 Jul 30, 2024
ec041a0
Now almost working second loop raising to linalg
arpitj1 Jul 31, 2024
23138fc
Fixes to correctly raise 2 level for loops to linalg.generic
arpitj1 Jul 31, 2024
5f20bd7
Missed file update to enable linalg dialect in polygeist
arpitj1 Jul 31, 2024
b0e96aa
Fix for syms and dims calculation
arpitj1 Aug 6, 2024
ea76f0a
More tests added to cover different loop cases
arpitj1 Aug 7, 2024
591c84e
Now able to compile 3/any number of loops with parallel iter type; Ad…
arpitj1 Aug 7, 2024
b0108e3
Non iter-arg variant of matrix-mul and conv are now raised to linalg.…
arpitj1 Aug 7, 2024
4362c80
submap canonicalizer implemented
arpitj1 Aug 21, 2024
77c8168
Added reduction loops for linalg
arpitj1 Aug 22, 2024
98f0119
Fix for incorrect for loop dims
arpitj1 Aug 28, 2024
59eec0b
Linalg.generic 4 loop cases raised- todo: reduction and some if-else …
arpitj1 Sep 5, 2024
a363f13
Adding test case for all passing raising and lowering, example case o…
arpitj1 Sep 18, 2024
814ca51
Added pass remove iter args from scf; Added psuedo code for submap ca…
arpitj1 Oct 12, 2024
701f25a
Added removal of iter_args for affine loops
arpitj1 Oct 12, 2024
d285fb5
Temporary reverted pass registeration as the code was failing
arpitj1 Oct 12, 2024
c40e7a9
WIP commit
arpitj1 Oct 15, 2024
788a3c4
Added submap of submap canonicalizer with test- failing
arpitj1 Oct 18, 2024
8265216
Added canonicalization for linalg with submap and test cases
arpitj1 Oct 25, 2024
532773a
Added modified 2d kernel for harris score- raised successfully to lin…
arpitj1 Oct 25, 2024
e2b4b2d
Added harris score kernel with gradient kernel- just to be able to ra…
arpitj1 Oct 25, 2024
f2ab09e
Initial working implementation of debufferize flow for linalg with ex…
arpitj1 Jan 13, 2025
2342381
Added more complex case to show debufferization ; Fixed bugs in debuf…
arpitj1 Jan 13, 2025
fde88fe
Fixed clang format
arpitj1 Jan 13, 2025
cf9f953
Ran git clang format locally to fix regression failures
arpitj1 Jan 13, 2025
f10c47a
Working implementation for function args memrefType with noinline att…
arpitj1 Jan 17, 2025
490f924
Added debufferization Alloc Removal pass, add working examples with l…
arpitj1 Jan 17, 2025
e20708c
Added support for debufferization across nested regions - working for…
arpitj1 Jan 31, 2025
4a7efe7
Bug fix for erasing the op correctly
arpitj1 Jan 31, 2025
6d8832f
Bug fixes for 1. recursive parent search in sorting users 2. traversi…
arpitj1 Jan 31, 2025
6ca2aeb
Added cases of buffer capture which doesn't debufferize
arpitj1 Jan 31, 2025
803ec30
Canonicalization gets rid of memref capture by loop
arpitj1 Feb 1, 2025
fb0ac18
Working implementation for scf.for op and scf.if op; added bug fix to…
arpitj1 Feb 7, 2025
0472c34
Added data structures to track expandedUsers that can include for loo…
arpitj1 Feb 7, 2025
3272f2c
Added logic in for loop case to find all users of iter_args and updat…
arpitj1 Feb 8, 2025
da2ae5b
Added a bunch of tests with nested regions- all getting connected and…
arpitj1 Feb 8, 2025
a570c1b
Added more complex region cases with mix of if-else statements
arpitj1 Feb 8, 2025
7ee707b
Generic solver to represent linalg.generic as kernel.def ops
arpitj1 May 8, 2025
c8561b4
Adding cases for generic solver
arpitj1 May 12, 2025
07d0dcb
Backup of previous edits
arpitj1 May 28, 2025
009ab9b
Temp changes for kernel dialect
Jun 11, 2025
c0f36d3
Enabled kernel dialect correctly running on sample IR with kernel def…
Jun 11, 2025
6a67379
Added linalgToKernel pass- compile failure
arpitj1 Jun 12, 2025
7f9d00f
Working pattern matching and replacement for linalg generics
arpitj1 Jun 12, 2025
d765bb9
Partial changes for different files for kernel and input
arpitj1 Jun 12, 2025
15ef84e
Crash fix
arpitj1 Jun 13, 2025
44fed6c
Improved lib
arpitj1 Jun 26, 2025
4a95c7f
Removing redundant file
arpitj1 Jun 26, 2025
f1e5f02
Renamed kernel lib
arpitj1 Jun 26, 2025
e941c5e
Added min_abs_index test
arpitj1 Jun 26, 2025
a99fad9
Fixed a bunch of bugs in raiseToLinalg while raising polybench
arpitj1 Jun 27, 2025
4e782d5
Fixed raise to linalg and canonicalizer to generate subview
arpitj1 Jun 28, 2025
bd15b6d
Fixed submap simplification, improved raisedToLinalg to work with non…
arpitj1 Jul 31, 2025
cb34836
Added parallel fission pass
arpitj1 Aug 1, 2025
53c5d14
Added pattern for parallel to seq for loops
arpitj1 Aug 1, 2025
60b81d2
Added raise-to-linalg-pipeline
arpitj1 Aug 1, 2025
7b2f5d9
Added linalgGenericEliminateSubmaps and commented out submapToSubviewOp
arpitj1 Aug 1, 2025
71e441f
Canonicalization fix
arpitj1 Aug 1, 2025
e421a86
bug fix for non nullptr in submap creation
arpitj1 Aug 1, 2025
56724a5
Fix in linalg debufferizer - failure return and only insert memref.co…
arpitj1 Aug 1, 2025
c3c2700
improved matcher to create a dependency graph and use it for matching
arpitj1 Aug 1, 2025
ca12291
Runtime failure but match happening correctly to kernel dialect
arpitj1 Aug 3, 2025
7c204f2
Working match for linalg kernel match for gemm
arpitj1 Aug 3, 2025
37dd847
Added debug prints
arpitj1 Aug 3, 2025
7e3f0d0
Able to raise gemv
arpitj1 Aug 4, 2025
3b56eb3
blas C codes- for raising to linalg
arpitj1 Oct 15, 2025
fa99aa8
Debug prints for RaiseTolinalg and 2. SelectFunc pass to process just…
arpitj1 Oct 15, 2025
ed30a14
Update RemoveIterArgs to work with chain ops before store for affine.for
arpitj1 Oct 17, 2025
a816708
Added int op support
arpitj1 Oct 17, 2025
0edd38e
remote iter args improved and test added
arpitj1 Oct 17, 2025
3b8c43b
Implemented improvement in linalg debufferize to work through inverse…
arpitj1 Jan 10, 2026
adaa7a1
Add consumer-blind alloca fallback to --remove-iter-args
arpitj1 May 13, 2026
146322d
Add v2 region-recursive --linalg-debufferize implementation behind a …
arpitj1 May 13, 2026
9cc7f54
Promote v2 region-recursive --linalg-debufferize to default
arpitj1 May 13, 2026
0df59d3
RaiseToLinalg: support non-constant lower bounds via in-body mask
arpitj1 May 13, 2026
6b20d4b
RaiseToLinalg: distribute mixed-body loops before raising (Group C)
arpitj1 May 13, 2026
d278514
RaiseToLinalg: anchor-based chunking in DistributeAffineForOnLinalgGe…
arpitj1 May 13, 2026
a6163ca
RaiseToLinalg: support non-constant upper bounds (Group B / syrk)
arpitj1 May 14, 2026
947b38a
RaiseToLinalg: relax distribute precondition with dep-based check (Gr…
arpitj1 May 14, 2026
9085454
RaiseToLinalg: privatize 0-D scratch alloca to enable distribution
arpitj1 May 14, 2026
0483a61
Add lower-polygeist-submap pass + e2e correctness harness
arpitj1 May 14, 2026
72c5ddd
Add gemm e2e test through linalg-debufferize
arpitj1 May 14, 2026
c70576a
Add multi-kernel e2e correctness harness
arpitj1 May 14, 2026
abff25f
Extend submap lowering: broadcasts, shift-aware iter bounds, debuf flow
arpitj1 May 14, 2026
cf9707e
Add egglog-based linalg.generic body matcher prototype
arpitj1 May 15, 2026
ad9278a
kernel_match: iter-dim canonicalization + composition matcher + libra…
arpitj1 May 15, 2026
bfdddc1
kernel_match_rewrite: CLI tool that emits MLIR with kernel.launch ops
arpitj1 May 15, 2026
97e625f
kernel_match: add 4 more 1-step templates (copy, axpby, fma3, sub-fro…
arpitj1 May 15, 2026
4d4db8d
kernel.launch lowering: Phase-1 roundtrip + Phase-2 canonical defn pass
arpitj1 May 15, 2026
4d5b6b8
Add PolyBench IR explorer with Compiler Explorer deep links
arpitj1 May 15, 2026
74ec1f0
Multi-root linalg-debufferize + tensor-form stencil matcher coverage
arpitj1 May 16, 2026
de72864
Extend IR explorer with MachSuite + NPB sections + sweep scripts
arpitj1 May 16, 2026
8b4c67f
Scaffold rank-1 row-scratch privatization (disabled in pipeline)
arpitj1 May 16, 2026
27ed6e9
IR explorer: algorithm-blocker taxonomy + per-kernel blocker column
arpitj1 May 16, 2026
45d9382
Phase-2 cuBLAS-ABI lowering: kernel.launch -> runtime shim func.call
arpitj1 May 23, 2026
49472aa
IR explorer: polybenchGpu + llama2.c + llm.c sections (+ rewriter fal…
arpitj1 May 23, 2026
77600a7
Phase-2 cuBLAS-ABI: fix memref ABI bug + cross-compile pipeline + Jet…
arpitj1 May 23, 2026
02279cc
cuBLAS-ABI: lower 4 more matcher symbols (gemm variants, geam-scale, …
arpitj1 May 23, 2026
bc6767c
conv2d → cuDNN: extracted kernel + matcher template + ABI lowering + …
arpitj1 May 23, 2026
81a9654
IR explorer: conv2d-extracted is FULL (cudnnConvolution2D_9tap match)
arpitj1 May 23, 2026
0efb3cc
IR explorer: fix broken CE link for conv2d-extracted / conv3d-extracted
arpitj1 May 23, 2026
edb9921
conv2d: surface body-internal weights as launch operands → generic cu…
arpitj1 May 23, 2026
2782c9c
conv2d: FP32 path — dtype-suffixed launch symbol + cuDNN f32 shim
arpitj1 May 23, 2026
502e59c
conv2d: FP16/BF16/INT32/INT16 paths — dtype-suffixed launch symbols +…
arpitj1 May 23, 2026
800fb58
conv2d: INT32/INT16 end-to-end on Jetson; encoder + rewriter fixes
arpitj1 May 23, 2026
f6e3f6f
conv2d INT32/INT16: remove host-fallback in i32 shim, fail fast at cuDNN
arpitj1 May 23, 2026
af2c50f
IR explorer: Phase 2 dtype matrix (conv2d f32/i32/i16) + new blocker …
arpitj1 May 24, 2026
bd1ef69
conv3d: match polybenchGpu's redundant-mul body via Python tuple-AST …
arpitj1 May 24, 2026
309907e
IR explorer: conv3d row reflects matcher success; partial-pipeline bl…
arpitj1 May 24, 2026
7aef419
matcher: support multi-yield linalg.generic in regex parser + Generic…
arpitj1 May 24, 2026
a7f229b
IR explorer: correct softmax/rmsnorm blocker notes after multi-yield …
arpitj1 May 24, 2026
1235c28
matcher: softmax composition entry; multi-yield encoder + template su…
arpitj1 May 24, 2026
a3ddbac
matcher: rmsnorm 2-step composition entry + scalar-arith capture types
arpitj1 May 24, 2026
a037a9b
IR explorer: softmax + rmsnorm now match (partial-pipeline); llmc sof…
arpitj1 May 24, 2026
4b20b77
polygeist_build.sh: unified driver — kernel.c in, optimized binary out
arpitj1 May 24, 2026
3e38cde
cgeist: better diagnostic before 'too many arguments in calls' assertion
arpitj1 May 24, 2026
992c8cd
gen_wrapper.py: parse plain C array signatures alongside POLYBENCH ma…
arpitj1 May 24, 2026
b09d12b
polygeist_build.sh: drop ship-to-Jetson hint output
arpitj1 May 24, 2026
2fe46c5
IR explorer: Jetson silicon runtimes per (kernel, dataset)
arpitj1 May 25, 2026
82109b6
cgeist: add --no-inline flag; use it in polybenchGpu bake
arpitj1 May 25, 2026
5bcdbfe
syrk: silicon-validated on Jetson Orin via cgeist --no-inline path
arpitj1 May 25, 2026
5911c2f
conv2d: polybenchGpu 9-tap stencil silicon-validated on Jetson Orin
arpitj1 May 25, 2026
6343d5f
lower-kernel-launch-to-cublas: add cublasDgemv + memset_zero_1D handlers
arpitj1 May 25, 2026
b8c0b8d
matcher+lowering: gemv transpose discriminator; gemver/gesummv shims
arpitj1 May 25, 2026
bc0aec5
gesummv/gemver: host-side daxpby + diagnostic note on aarch64 print_a…
arpitj1 May 25, 2026
a07c5c6
gesummv/atax/bicg now BIT-EXACT: fix gcc IPA + weak-symbol mismatch
arpitj1 May 25, 2026
290fae0
runtime: zero-copy on Jetson via cudaHostRegister (no more H↔D bounce)
arpitj1 May 25, 2026
c9bd2e1
runtime: persistent cudaHostRegister cache (no unregister on shim exit)
arpitj1 May 25, 2026
b316a54
explorer: notes column in Jetson runtimes; conv2d rerun + diagnosed
arpitj1 May 25, 2026
0d582e6
explorer: darknet full-source bake survey + new section
arpitj1 May 25, 2026
e24b98d
extracted-darknet + fusion optimizations: 9 CNN-block kernels end-to-…
arpitj1 May 25, 2026
a1961ce
PVA backend: lower kernel.launch to libpva_operator for int8/int16 co…
arpitj1 May 26, 2026
6363ac5
Make pipeline paths portable
arpitj1 May 29, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -85,3 +85,9 @@ pythonenv*
# tmp output from tests
*.exec1
*.out1

# Local-environment-specific scripts (carry SSH hostnames, IPs, usernames
# for a particular dev machine + Jetson setup). Each developer has their
# own version of these.
scripts/correctness/run_jetson.sh
scripts/correctness/logs/
67 changes: 67 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
# Polygeist - Claude Instructions

## Environment Setup

Source this before running any commands:
```bash
export POLYGEIST_ROOT=/path/to/Polygeist
source "$POLYGEIST_ROOT/envsetup.sh"
```
This adds `build/bin/` to PATH, making `cgeist` and `polygeist-opt` available.

## Build

Only `build_polygeist.sh` is needed (LLVM/MLIR/Clang are pre-built in `llvm-project/build`).

To rebuild after making changes to any pass:
```bash
cd "$POLYGEIST_ROOT/build" && ninja
```

## Raising Pipeline (C → Linalg)

```bash
# Step 1: C to affine MLIR
cgeist <file.c> --function=* --resource-dir=/usr/lib/clang/14 --raise-scf-to-affine -fPIC -S -g -c -o output.mlir

# Step 2: Affine → Linalg (memref form)
polygeist-opt --select-func="func-name=<funcname>" --remove-iter-args --affine-parallelize --raise-affine-to-linalg-pipeline <input.mlir> -o <output_linalg.mlir>

# Step 3: Debufferize (memref linalg → tensor linalg)
polygeist-opt --linalg-debufferize <input_linalg.mlir> -o <output_debufferized.mlir>

# Step 4: Kernel extraction
polygeist-opt <input_debufferized.mlir> --linalg-to-kernel="kernel-library-path=$POLYGEIST_ROOT/generic_solver/kernel_library.mlir"
```

## Key Source Files

- `lib/polygeist/Passes/RaiseToLinalg.cpp` — raises `affine.for` loops to `linalg.generic`, creates `polygeist.submap` for strided accesses
- `lib/polygeist/Passes/LinalgDebufferize.cpp` — converts memref-based linalg to tensor-based SSA form
- `include/polygeist/PolygeistOps.td` — defines `polygeist.submap` and `polygeist.submapInverse`

## NVIDIA gated-distribution SDKs — point, don't copy

The directory `$PVASOL_ROOT` is the source tree for the PVA
Solutions SDK. The PVA Solutions public `.deb` packages ship binaries only
(`libpva_operator.so`, `libnvcv_types.so`, allowlist file) — *no headers*.
Headers exist only inside the source tree, which NVIDIA distributes to
approved developers through `developer.nvidia.com/embedded/pva`. The headers
are therefore "behind a developer-program gate," not "secret internal-only";
they're the same files any approved external developer would have.

*Rule for using these headers in Polygeist:*

- *Build-time include path is fine.* Add `-I$PVASOL_ROOT/public/src/operator/include`
(and the same pattern for NVCV / cuPVA / CV-CUDA headers under `public/3rdparty/`)
to the cross-compile flags in our build scripts.
- *Never copy headers into the Polygeist tree.* No `cp` / `git add` of any
`.h` / `.hpp` / `.cpp` / `.c` from `$PVASOL_ROOT` into
`$POLYGEIST_ROOT`. The Polygeist repo only ever references those
paths symbolically.
- *Polygeist source code may `#include "OpConv2d.h"` etc.* — the include is
resolved through the `-I` flag at build time, just like cuDNN's `cudnn.h`.
- *Anyone cloning Polygeist without PVA Solutions access gets a clean build
failure* — same as the cuDNN dependency on the cross-compile path today.
- *Same policy applies* to any other gated-distribution NVIDIA SDK source
tree on this VM (cuPVA SDK, internal NVCV builds, etc.).
74 changes: 74 additions & 0 deletions blas/dasum.c
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
#include <stdio.h>
#include <stdlib.h>
#include <math.h>

// DASUM: Sum of absolute values
// result = sum(|x[i]|)
// x: vector of length N with stride incx
double dasum(int N, const double* x, int incx) {
double result = 0.0;

for (int i = 0; i < N; i++) {
result += fabs(x[i * incx]);
}

return result;
}

// Simple version (stride = 1)
double simple_dasum(int N, const double* x) {
double result = 0.0;

for (int i = 0; i < N; i++) {
result += fabs(x[i]);
}

return result;
}

// Single precision version
float sasum(int N, const float* x, int incx) {
float result = 0.0f;

for (int i = 0; i < N; i++) {
result += fabsf(x[i * incx]);
}

return result;
}

void print_vector(const double* x, int N, const char* name) {
printf("%s: [", name);
for (int i = 0; i < N; i++) {
printf("%.1f", x[i]);
if (i < N - 1) printf(", ");
}
printf("]\n");
}

int main() {
const int N = 6;

double x[] = {1.0, -2.0, 3.0, -4.0, 5.0, -6.0};

printf("ASUM Test: sum of absolute values\n");
print_vector(x, N, "x");

double result = simple_dasum(N, x);

printf("\nasum(x) = %.1f\n", result);

printf("\nManual verification:\n");
printf("|1.0| + |-2.0| + |3.0| + |-4.0| + |5.0| + |-6.0|\n");
printf("= 1.0 + 2.0 + 3.0 + 4.0 + 5.0 + 6.0\n");
printf("= 21.0\n");

// Test with stride
printf("\n\nTesting with stride=2 (every other element):\n");
double result_stride = dasum(3, x, 2);
printf("asum(x[::2]) = %.1f\n", result_stride);
printf("Manual: |%.1f| + |%.1f| + |%.1f| = %.1f\n",
x[0], x[2], x[4], fabs(x[0]) + fabs(x[2]) + fabs(x[4]));

return 0;
}
78 changes: 78 additions & 0 deletions blas/daxpy.c
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
#include <stdio.h>
#include <stdlib.h>

// DAXPY: Constant times a vector plus a vector
// y = alpha * x + y
// x: vector of length N with stride incx
// y: vector of length N with stride incy (modified in place)
// alpha: scaling factor
void daxpy(int N, double alpha, const double* x, int incx, double* y, int incy) {
for (int i = 0; i < N; i++) {
y[i * incy] += alpha * x[i * incx];
}
}

// Simple version (stride = 1)
void simple_daxpy(int N, double alpha, const double* x, double* y) {
for (int i = 0; i < N; i++) {
y[i] += alpha * x[i];
}
}

// Single precision version
void saxpy(int N, float alpha, const float* x, int incx, float* y, int incy) {
for (int i = 0; i < N; i++) {
y[i * incy] += alpha * x[i * incx];
}
}

void print_vector(const double* x, int N, const char* name) {
printf("%s: [", name);
for (int i = 0; i < N; i++) {
printf("%.2f", x[i]);
if (i < N - 1) printf(", ");
}
printf("]\n");
}

int main() {
const int N = 5;
const double alpha = 2.0;

double x[] = {1.0, 2.0, 3.0, 4.0, 5.0};
double y[] = {10.0, 20.0, 30.0, 40.0, 50.0};

printf("AXPY Test: y = alpha * x + y\n");
printf("alpha = %.2f\n", alpha);
print_vector(x, N, "x");
print_vector(y, N, "y (before)");

// Apply axpy
simple_daxpy(N, alpha, x, y);

print_vector(y, N, "y (after)");

printf("\nManual verification:\n");
printf("y[0] = 2.0*1.0 + 10.0 = 12.00\n");
printf("y[1] = 2.0*2.0 + 20.0 = 24.00\n");
printf("y[2] = 2.0*3.0 + 30.0 = 36.00\n");
printf("y[3] = 2.0*4.0 + 40.0 = 48.00\n");
printf("y[4] = 2.0*5.0 + 50.0 = 60.00\n");

// Test with stride
printf("\n\nTesting with stride=2:\n");
double x2[] = {1.0, 2.0, 3.0, 4.0, 5.0, 6.0};
double y2[] = {100.0, 200.0, 300.0, 400.0, 500.0, 600.0};

printf("x: [1, 2, 3, 4, 5, 6]\n");
printf("y (before): [100, 200, 300, 400, 500, 600]\n");
printf("Computing: y[::2] += 10.0 * x[::2]\n");

daxpy(3, 10.0, x2, 2, y2, 2); // y[0,2,4] += 10*x[0,2,4]

printf("y (after): [%.1f, %.1f, %.1f, %.1f, %.1f, %.1f]\n",
y2[0], y2[1], y2[2], y2[3], y2[4], y2[5]);
printf("Expected: [110.0, 200.0, 330.0, 400.0, 550.0, 600.0]\n");

return 0;
}
76 changes: 76 additions & 0 deletions blas/dcopy.c
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
#include <stdio.h>
#include <stdlib.h>

// DCOPY: Copy vector x to vector y
// y = x
// x: source vector of length N with stride incx
// y: destination vector of length N with stride incy
void dcopy(int N, const double* x, int incx, double* y, int incy) {
for (int i = 0; i < N; i++) {
y[i * incy] = x[i * incx];
}
}

// Simple version (stride = 1)
void simple_dcopy(int N, const double* x, double* y) {
for (int i = 0; i < N; i++) {
y[i] = x[i];
}
}

// Single precision version
void scopy(int N, const float* x, int incx, float* y, int incy) {
for (int i = 0; i < N; i++) {
y[i * incy] = x[i * incx];
}
}

void print_vector(const double* x, int N, const char* name) {
printf("%s: [", name);
for (int i = 0; i < N; i++) {
printf("%.1f", x[i]);
if (i < N - 1) printf(", ");
}
printf("]\n");
}

int main() {
const int N = 5;

double x[] = {1.0, 2.0, 3.0, 4.0, 5.0};
double y[5] = {0.0, 0.0, 0.0, 0.0, 0.0};

printf("COPY Test\n");
print_vector(x, N, "x (source)");
print_vector(y, N, "y (before)");

// Copy x to y
simple_dcopy(N, x, y);

print_vector(y, N, "y (after)");

// Verify
printf("\nVerification: ");
int correct = 1;
for (int i = 0; i < N; i++) {
if (x[i] != y[i]) {
correct = 0;
break;
}
}
printf("%s\n", correct ? "PASS" : "FAIL");

// Test with stride
printf("\n\nTesting with stride:\n");
double src[] = {10.0, 20.0, 30.0, 40.0, 50.0, 60.0};
double dst[6] = {0.0, 0.0, 0.0, 0.0, 0.0, 0.0};

printf("Source: [10, 20, 30, 40, 50, 60]\n");
printf("Copying every other element (incx=2) to every position (incy=1):\n");
dcopy(3, src, 2, dst, 1); // Copy src[0,2,4] to dst[0,1,2]
printf("Result: [%.1f, %.1f, %.1f, %.1f, %.1f, %.1f]\n",
dst[0], dst[1], dst[2], dst[3], dst[4], dst[5]);
printf("Expected: [10.0, 30.0, 50.0, 0.0, 0.0, 0.0]\n");

return 0;
}
79 changes: 79 additions & 0 deletions blas/ddot.c
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
#include <stdio.h>
#include <stdlib.h>

// DDOT: Compute dot product of two vectors
// result = sum(x[i] * y[i])
// x: vector of length N with stride incx
// y: vector of length N with stride incy
double ddot(int N, const double* x, int incx, const double* y, int incy) {
double result = 0.0;

for (int i = 0; i < N; i++) {
result += x[i * incx] * y[i * incy];
}

return result;
}

// Simple version (stride = 1)
double simple_ddot(int N, const double* x, const double* y) {
double result = 0.0;

for (int i = 0; i < N; i++) {
result += x[i] * y[i];
}

return result;
}

// Single precision version
float sdot(int N, const float* x, int incx, const float* y, int incy) {
float result = 0.0f;

for (int i = 0; i < N; i++) {
result += x[i * incx] * y[i * incy];
}

return result;
}

int main() {
const int N = 5;
double x[] = {1.0, 2.0, 3.0, 4.0, 5.0};
double y[] = {2.0, 3.0, 4.0, 5.0, 6.0};

printf("DOT Product Test\n");
printf("x: [");
for (int i = 0; i < N; i++) {
printf("%.1f ", x[i]);
}
printf("]\n");

printf("y: [");
for (int i = 0; i < N; i++) {
printf("%.1f ", y[i]);
}
printf("]\n\n");

// Test simple version
double result = simple_ddot(N, x, y);
printf("dot(x, y) = %.1f\n", result);

// Manual verification
double manual = 0.0;
for (int i = 0; i < N; i++) {
manual += x[i] * y[i];
printf(" %.1f * %.1f = %.1f\n", x[i], y[i], x[i] * y[i]);
}
printf("Expected: %.1f, Actual: %.1f\n\n", manual, result);

// Test with stride
printf("Testing with stride=2 (every other element):\n");
double result_stride = ddot(3, x, 2, y, 2);
printf("dot(x[::2], y[::2]) = %.1f\n", result_stride);
printf("Manual: %.1f*%.1f + %.1f*%.1f + %.1f*%.1f = %.1f\n",
x[0], y[0], x[2], y[2], x[4], y[4],
x[0]*y[0] + x[2]*y[2] + x[4]*y[4]);

return 0;
}
Loading
Loading