Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,2 +1,5 @@
.env
training_data/
training_data/
your_code_base.txt
your_code_base.pdf
myenv/
24 changes: 19 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,31 +1,45 @@
# git2txt

Convert all files in git repository to .txt files. This is useful for training LLMs on your codebase.
Converts all the files of a git repository into .txt files. It also generates a single .txt & .pdf file containing the whole code base. This is useful for training LLMs on your codebase.

## How to Use

1. Create new .env file by copying example.env

```shell
cp example.env .env
```

2. Add necessary fields. The default fields are good to start with.

```bash
GIT_PROJECT_DIRECTORY=/path/to/git/repo
GIT_PROJECT_DIRECTORY=/path/to/git/repo (ex. C:\Users\MyUserName\Codebases\GitHub\my-project-name)
IGNORE_FILES=.env,package-lock.json
IGNORE_DIRS=.git,.vscode,node_modules
SAVE_DIRECTORY=training_data
SKIP_EMPTY_FILES=true

SOURCE_DIR=training_data
OUTPUT_FILE=your_code_base.txt
PDF_OUTPUT=your_code_base.pdf
```

3. Install dependencies. Using a virtual environment is recommended.

```shell
python -m pip install -r requirements.txt
```
4. Run program

4. In the "is_text_file" function, you MUST add the extensions of the file you want to be converted.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following my comment about this, change this text


5. Run program

```shell
python main.py
```
5. You'll see your data files in the ```training_data/``` directory. This will be different if you changed the path via ```SAVE_DIRECTORY``` in ```.env``` file.

6. You'll see your data files in the `training_data/` directory. This will be different if you changed the path via `SAVE_DIRECTORY` in `.env` file.

## Notes
- This program requires Python version 3.6 or later. It uses the f-string formatting technique introduced in Python 3.6.

- This program requires Python version 3.6 or later. It uses the f-string formatting technique introduced in Python 3.6.
8 changes: 6 additions & 2 deletions example.env
Original file line number Diff line number Diff line change
@@ -1,5 +1,9 @@
GIT_PROJECT_DIRECTORY=
GIT_PROJECT_DIRECTORY=C:\Users\jimzord12\Codebases\GitHub\serve-tech
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove your local machine path

The user should be able to copy this file to .env and change a minimal as possible. Providing your local machine path does not provide extra clarity or use.

IGNORE_FILES=.env,package-lock.json
IGNORE_DIRS=.git,.vscode,node_modules
SAVE_DIRECTORY=training_data
SKIP_EMPTY_FILES=true
SKIP_EMPTY_FILES=true

SOURCE_DIR=training_data
OUTPUT_FILE=your_code_base.txt
PDF_OUTPUT=our_code_base.pdf
66 changes: 61 additions & 5 deletions main.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,53 @@
import os
import hashlib
import sys
load_env()
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add this back to the file

from reportlab.pdfgen import canvas


load_env(env_path=r'.\example.env')

def is_text_file(file_path):
text_file_extensions = ['.txt', '.md', '.go', '.py', '.java', '.html', '.css', '.js', '.mod', '.sum'] # Add more as needed
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is configurable, add it to the .env.example file.

.env.example file:

FILE_EXTENSIONS=".txt,.md,.go,.py,.java,.html,.css,.js,.mod,.sum"

this file

text_file_extensions = os.environ['FILE_EXTENSIONS'].split(',')

return any(file_path.lower().endswith(ext) for ext in text_file_extensions)

def combine_txt_files_and_create_pdf(source_directory, output_file, pdf_output, separator='**'):
separator_line = separator * 40 + '\n'

# Initialize a list to store combined text
combined_text = []

with open(output_file, 'w', encoding='utf-8') as outfile:
for root, dirs, files in os.walk(source_directory):
for filename in files:
if filename.endswith('.txt'):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You've defined is_text_file() method above, which accepts various file extensions, but you've hardcoded .txt here. Please make this dynamic - either by introducing a similar method that checks the extension or using the in built-in and a list

Example with in built-in:

filename = os.path.basename(filename)  # example.txt
file_extension = filename.split('.')[1]  # .txt, .md, etc

file_path = os.path.join(root, filename)
with open(file_path, 'r', encoding='utf-8') as infile:
content = infile.read()
combined_text.append(separator_line)
combined_text.append(f"{filename.center(len(separator_line))}\n")
combined_text.append(separator_line)
combined_text.append(content + '\n')
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if content already has a newline character? This will not work well with what you've written on line 41:

for subline in line.split('\n')

combined_text.append(separator_line)

# Write to the TXT file
outfile.writelines([separator_line, f"{filename.center(len(separator_line))}\n", separator_line, content + '\n', separator_line])

# Write to the PDF file
c = canvas.Canvas(pdf_output)
text = c.beginText(40, 800) # Starting position
for line in combined_text:
# Split the combined text into lines
for subline in line.split('\n'):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be best to split by the separator line you've added on lines 27 and 30. Splitting by newline character when reading content from a file can give unexpected results.

text.textLine(subline.strip())
if text.getY() < 40: # Move to a new page if there's no space
c.drawText(text)
c.showPage()
text = c.beginText(40, 800)
c.drawText(text)
c.save()

print(f'All text files have been combined into {output_file} and {pdf_output}')

def ignore_dir(file_path: str) -> bool:
for _dir in IGNORE_DIRS:
if _dir in file_path:
Expand All @@ -25,9 +69,9 @@ def get_file_path() -> None:

def write_txt(txt_data: str, file_name: str, md5_hash: str) -> None:
full_path = os.path.join(save_directory, file_name + f'_{md5_hash}.txt')
with open(full_path, mode='w') as data:
with open(full_path, mode='w', encoding='utf-8') as data:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great addition. Always good to enforce encodings.

data.write(txt_data)
print(f'TXT written to: {full_path}')
print(f'TXT written to: {full_path}\n')


def main() -> None:
Expand All @@ -42,9 +86,15 @@ def main() -> None:
print('Creating TXT...')
for index, file in enumerate(FILES):
print(f'File #{index+1}: {file}')
# If line is empty, skip it

#if file is not a text file, skip it
if not is_text_file(file):
print(f'Skipping: [{os.path.basename(file)}] a (probably) non-text file.\n')
continue

# If file is empty, skip it
if os.environ.get('SKIP_EMPTY_FILES').upper() == 'TRUE' and os.path.getsize(file) == 0:
print('FILE IS EMPTY. SKIPPING.')
print('FILE IS EMPTY. SKIPPING.\n')
continue
with open(file, mode='r', encoding='utf-8') as git_file:
md5_hash = hashlib.md5(git_file.read().encode('utf-8')).hexdigest()
Expand All @@ -68,3 +118,9 @@ def main() -> None:
os.makedirs(save_directory, exist_ok=True)
main()
print(f'Training data can be found in {save_directory}/ directory.')

# My Code
source_dir = os.environ.get('SOURCE_DIR') # Change this to your source directory
output_file = os.environ.get('OUTPUT_FILE') # The final combined text file
pdf_output = os.environ.get('PDF_OUTPUT') # The final PDF file
combine_txt_files_and_create_pdf(source_dir, output_file, pdf_output)
7 changes: 6 additions & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
@@ -1 +1,6 @@
pydotenvs==0.2.0
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pydotenvs should not be removed. Please reintroduce.

chardet==5.2.0
click==8.1.7
colorama==0.4.6
pillow==10.3.0
pydotenvs==0.2.0
reportlab==4.1.0