Make upload MIME validation robust when python-magic is unavailable#340
Make upload MIME validation robust when python-magic is unavailable#340arpit-k16 wants to merge 4 commits intoRohanExploit:mainfrom
Conversation
👷 Deploy request for fixmybharat pending review.Visit the deploys page to approve it
|
🙏 Thank you for your contribution, @arpit-k16!PR Details:
Quality Checklist:
Review Process:
Note: The maintainers will monitor code quality and ensure the overall project flow isn't broken. |
📝 WalkthroughWalkthroughAdds an upload validation and detection pipeline in Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant Server
participant Magic as python-magic
participant Fallback as Header/Guess
participant PIL
participant DB
participant Storage
Client->>Server: POST /upload (file)
Server->>Server: check_upload_limits(identifier)
Server->>Magic: detect MIME (optional)
alt magic available
Magic-->>Server: mime/type
else
Server->>Fallback: use Content-Type or mimetypes.guess_type
Fallback-->>Server: mime/type or None
end
Server->>Server: enforce size & ALLOWED_MIME_TYPES
alt MIME is image
Server->>PIL: open and validate image bytes
PIL-->>Server: valid + dimensions
alt dimensions > MAX_DIMENSION
Server->>PIL: resize to MAX_DIMENSION
PIL-->>Server: resized bytes
Server->>Storage: update file bytes
else
Server->>Storage: store original bytes
end
Server->>Server: process_and_detect -> detection_func
Server->>DB: (startup tables created) log/store results
else
Server-->>Client: HTTPException (invalid type/size)
end
Server-->>Client: response (success/error)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
👋 Hello @arpit-k16! Thank you for your first Pull Request to VishwaGuru! 🎉
We appreciate your contribution to this open source platform empowering India's youth to engage with democracy.
What happens next?
- A maintainer will review your PR soon
- Please ensure all tests pass and the code follows our style guidelines
- Be ready to address any feedback or requested changes
Before your PR is merged:
- Ensure your code follows the project's coding standards
- All tests should pass
- Update documentation if needed
- Make sure your commits are clear and descriptive
Resources:
Thank you for contributing! 🚀
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@backend/main.py`:
- Around line 217-224: The code sets file.size after rewinding the buffer so it
becomes 0; fix by measuring the buffer length before rewinding (or by seeking to
the end then using tell) and then assign that value to file.size — e.g., after
img.save(...) and before output.seek(0) capture size = output.tell() or
output.seek(0, io.SEEK_END); size = output.tell(); then set file.file = output,
file.size = size and finally rewind the stream (output.seek(0)); update the
block that manipulates output, img.save, and file to use this order.
🧹 Nitpick comments (2)
backend/main.py (2)
184-186: Document the security trade-off when using fallback MIME detection.When
python-magicis unavailable, the code falls back tofile.content_type(client-provided, spoofable) andmimetypes.guess_type(extension-based, also spoofable). While the subsequent PIL validation (lines 201-204) provides a secondary security layer that ensures the file is actually a valid image, the MIME-type check alone becomes bypassable.Consider adding a comment documenting this security trade-off, or logging a more specific warning when fallback is used for a particular upload (not just at startup).
💡 Optional: Add per-upload warning when using fallback
if not detected_mime: # Fallback: trust FastAPI's content_type header or guess from filename detected_mime = file.content_type or mimetypes.guess_type(file.filename or "")[0] + if detected_mime: + logger.debug( + f"MIME type for {file.filename} detected via fallback (content_type/extension): {detected_mime}. " + "Content will be validated by PIL." + )
226-231: Uselogger.exceptionfor automatic traceback inclusion.
logger.exceptionautomatically includes the exception traceback, which is more informative thanlogger.errorwith manual formatting. The broad exception catch is acceptable here since PIL can raise various exception types for malformed images.♻️ Proposed fix
except Exception as pil_error: - logger.error(f"PIL validation failed for {file.filename}: {pil_error}") + logger.exception(f"PIL validation failed for {file.filename}: {pil_error}") raise HTTPException( status_code=400, detail="Invalid image file. The file appears to be corrupted or not a valid image." )
🔍 Quality Reminder |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
backend/main.py (1)
424-428:⚠️ Potential issue | 🔴 CriticalAdd
awaitto allvalidate_uploaded_file()calls; security validation is completely skipped without it.
validate_uploaded_fileis async. Calling it withoutawaitin lines 427, 582, 606, 632, 658, and 684 creates a coroutine object that never executes, allowing unvalidated files through security checks entirely.🛠️ Proposed fixes
Line 427:
if image: - validate_uploaded_file(image) + await validate_uploaded_file(image)Line 582:
- validate_uploaded_file(image) + await validate_uploaded_file(image)Line 606:
- validate_uploaded_file(image) + await validate_uploaded_file(image)Line 632:
- validate_uploaded_file(image) + await validate_uploaded_file(image)Line 658:
- validate_uploaded_file(image) + await validate_uploaded_file(image)Line 684:
- validate_uploaded_file(image) + await validate_uploaded_file(image)
🤖 Fix all issues with AI agents
In `@backend/main.py`:
- Around line 181-208: The image validation/resizing block advances the file
pointer (Image.open(file.file) and img.verify()) but only rewinds when a resize
is performed; add a file.file.seek(0) after the PIL validation/resizing logic so
the pointer is always reset before returning (ensure this happens whether or not
the resize branch ran); update the code around Image.open(file.file) /
img.verify() and the replacement of file.file (and after recording resized_size)
so process_and_detect() and any subsequent Image.open(image.file) read from the
start of the stream.
There was a problem hiding this comment.
Actionable comments posted: 6
🤖 Fix all issues with AI agents
In `@backend/main.py`:
- Around line 1-11: The file is missing imports used by check_upload_limits and
_validate_uploaded_file_sync and also duplicates the httpx import; add "import
io" and "from datetime import datetime, timedelta" at the top of backend/main.py
so io.BytesIO(), datetime.now(), and timedelta(...) are defined, and remove the
duplicate httpx import (keep a single import). Update any import grouping to
avoid Ruff F811 redefinition errors and ensure check_upload_limits and
_validate_uploaded_file_sync can reference these symbols.
- Around line 104-123: The check_upload_limits function performs a non-atomic
get→filter→check→append→set on user_upload_cache which permits a TOCTOU race;
fix it by making the read-modify-write atomic for the given identifier—either
implement a per-identifier threading.Lock (e.g., a dict of locks keyed by
identifier) and wrap the body that reads current_uploads, filters old
timestamps, checks length, appends now, and calls user_upload_cache.set inside
that lock, or replace the list-based approach with an atomic counter in a remote
store (e.g., Redis INCR with TTL) and update check_upload_limits to use that
atomic increment/decrement TTL semantics instead of get/set on
user_upload_cache. Ensure you reference and update the logic inside
check_upload_limits and use user_upload_cache.get/set or the new atomic store
consistently.
- Around line 252-253: Remove the unguarded module-level call to
Base.metadata.create_all(bind=engine) so table creation does not run at import
time; the same operation with error handling already exists in lifespan(), so
delete the Base.metadata.create_all line (and any direct uses of engine for
schema creation at module scope) to avoid duplicate/ungarded startup DB access
and rely on the existing lifecycle-managed creation inside lifespan().
- Around line 235-242: The call to validate_image_for_processing(pil_image) in
the async block is missing an import; update backend/main.py to import
validate_image_for_processing from backend.pothole_detection (same symbol used
in backend/routers/detection.py) so the
run_in_threadpool(validate_image_for_processing, pil_image) call resolves, or
alternatively remove that call if you decide the existing synchronous PIL checks
in _validate_uploaded_file_sync are sufficient—ensure the chosen approach keeps
run_in_threadpool(Image.open, image.file) and the surrounding HTTPException
handling intact.
- Around line 57-59: Remove the duplicate import lines that re-import
EXCEPTION_HANDLERS, Base, engine, get_db, SessionLocal, and Issue; locate the
redundant imports (the second occurrences of EXCEPTION_HANDLERS and the group
importing Base, engine, get_db, SessionLocal, Issue) and delete them so each
symbol is imported only once (keep the original imports where they first appear
and remove the later duplicate lines).
- Around line 192-196: Before calling img.resize capture the original format
(e.g., original_format = img.format) and after resizing pass that
original_format into img.save instead of using img.format or blindly falling
back to 'JPEG'; if original_format is None, choose a safe fallback that
preserves alpha (use 'PNG' when img.mode indicates an alpha channel like 'RGBA'
or 'LA', otherwise 'JPEG'), and keep the existing quality parameter when saving;
update the block around img.resize, img.format, and img.save (referencing
img.resize, img.format, img.save, new_width/new_height) to implement this.
🧹 Nitpick comments (2)
backend/main.py (2)
206-211: Chain the re-raisedHTTPExceptionwithfromfor proper tracebacks (Ruff B904).Inside
exceptblocks, raising a new exception withoutfrommakes the original cause invisible in tracebacks. This applies here and similarly on lines 242 and 250.♻️ Proposed fix
except Exception as pil_error: logger.error(f"PIL validation failed for {file.filename}: {pil_error}") raise HTTPException( status_code=400, detail="Invalid image file. The file appears to be corrupted or not a valid image." - ) + ) from pil_error
146-176: Security note: MIME fallback path is header/extension-based and trivially spoofable.When
python-magicis absent, bothfile.content_type(client-supplied header) andmimetypes.guess_type(extension-based) can be faked by an attacker. The downstream PILverify()check partially compensates, but a file with a valid image header followed by malicious payload could still pass. Consider adding a comment documenting this accepted trade-off, so future maintainers understand the risk surface whenpython-magicis missing.
| from fastapi import FastAPI, UploadFile, File, Form, HTTPException, Depends, BackgroundTasks, Request, Query | ||
| from fastapi.responses import JSONResponse | ||
| from fastapi.middleware.cors import CORSMiddleware | ||
| from fastapi.middleware.gzip import GZipMiddleware | ||
| from sqlalchemy.orm import Session | ||
| from sqlalchemy import func | ||
| from contextlib import asynccontextmanager | ||
| from functools import lru_cache | ||
| from typing import List, Optional, Any | ||
| from PIL import Image | ||
| import mimetypes |
There was a problem hiding this comment.
Missing import io and from datetime import datetime, timedelta will cause NameError at runtime.
io.BytesIO() is used on line 195, datetime.now() on line 109, and timedelta(hours=1) on line 110, but none of these are imported anywhere in the file. Both check_upload_limits and the image-resize path inside _validate_uploaded_file_sync will crash at runtime.
Also, import httpx on line 40 is a redefinition of the identical import on line 32 (Ruff F811).
🐛 Proposed fix — add missing imports and remove duplicate
from fastapi import FastAPI, UploadFile, File, Form, HTTPException, Depends, BackgroundTasks, Request, Query
from fastapi.responses import JSONResponse
from fastapi.middleware.cors import CORSMiddleware
from fastapi.middleware.gzip import GZipMiddleware
from sqlalchemy.orm import Session
from sqlalchemy import func
from contextlib import asynccontextmanager
from functools import lru_cache
from typing import List, Optional, Any
from PIL import Image
import mimetypes
+import io
+from datetime import datetime, timedelta try:
import magic
except ImportError:
magic = None
-import httpx
from async_lru import alru_cacheAlso applies to: 34-41
🤖 Prompt for AI Agents
In `@backend/main.py` around lines 1 - 11, The file is missing imports used by
check_upload_limits and _validate_uploaded_file_sync and also duplicates the
httpx import; add "import io" and "from datetime import datetime, timedelta" at
the top of backend/main.py so io.BytesIO(), datetime.now(), and timedelta(...)
are defined, and remove the duplicate httpx import (keep a single import).
Update any import grouping to avoid Ruff F811 redefinition errors and ensure
check_upload_limits and _validate_uploaded_file_sync can reference these
symbols.
| from backend.exceptions import EXCEPTION_HANDLERS | ||
| from backend.database import Base, engine, get_db, SessionLocal | ||
| from backend.models import Issue |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Remove duplicate imports (Ruff F811).
Lines 58–59 re-import Base, engine, get_db, SessionLocal, and Issue that are already imported on lines 44–45. Line 57 re-imports EXCEPTION_HANDLERS already imported on line 65. These redefinitions are flagged by Ruff and add confusion.
♻️ Proposed fix — remove the duplicate lines
-from backend.exceptions import EXCEPTION_HANDLERS
-from backend.database import Base, engine, get_db, SessionLocal
-from backend.models import Issue📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| from backend.exceptions import EXCEPTION_HANDLERS | |
| from backend.database import Base, engine, get_db, SessionLocal | |
| from backend.models import Issue | |
| from backend.exceptions import EXCEPTION_HANDLERS | |
| from backend.database import Base, engine, get_db, SessionLocal | |
| from backend.models import Issue |
🧰 Tools
🪛 Ruff (0.14.14)
[error] 58-58: Redefinition of unused Base from line 44: Base redefined here
Remove definition: Base
(F811)
[error] 58-58: Redefinition of unused engine from line 44: engine redefined here
Remove definition: engine
(F811)
[error] 58-58: Redefinition of unused get_db from line 44: get_db redefined here
Remove definition: get_db
(F811)
[error] 58-58: Redefinition of unused SessionLocal from line 44: SessionLocal redefined here
Remove definition: SessionLocal
(F811)
[error] 59-59: Redefinition of unused Issue from line 45: Issue redefined here
Remove definition: Issue
(F811)
🤖 Prompt for AI Agents
In `@backend/main.py` around lines 57 - 59, Remove the duplicate import lines that
re-import EXCEPTION_HANDLERS, Base, engine, get_db, SessionLocal, and Issue;
locate the redundant imports (the second occurrences of EXCEPTION_HANDLERS and
the group importing Base, engine, get_db, SessionLocal, Issue) and delete them
so each symbol is imported only once (keep the original imports where they first
appear and remove the later duplicate lines).
| def check_upload_limits(identifier: str, limit: int) -> None: | ||
| """ | ||
| Check if the user/IP has exceeded upload limits using thread-safe cache. | ||
| """ | ||
| current_uploads = user_upload_cache.get(identifier) or [] | ||
| now = datetime.now() | ||
| one_hour_ago = now - timedelta(hours=1) | ||
|
|
||
| # Filter out old timestamps (older than 1 hour) | ||
| recent_uploads = [ts for ts in current_uploads if ts > one_hour_ago] | ||
|
|
||
| if len(recent_uploads) >= limit: | ||
| raise HTTPException( | ||
| status_code=429, | ||
| detail=f"Upload limit exceeded. Maximum {limit} uploads per hour allowed." | ||
| ) | ||
|
|
||
| # Add current timestamp and update cache atomically | ||
| recent_uploads.append(now) | ||
| user_upload_cache.set(recent_uploads, identifier) |
There was a problem hiding this comment.
TOCTOU race in upload-limit enforcement.
The get → filter → check → append → set sequence on lines 108–123 is not atomic. Two concurrent requests for the same identifier can both read the same list, both pass the limit check, and both write back — allowing more uploads than limit. If user_upload_cache does not provide a compare-and-swap or locking primitive, consider wrapping this in a threading.Lock keyed by identifier, or using an atomic counter (e.g., Redis INCR with TTL) for correctness under load.
🧰 Tools
🪛 Ruff (0.14.14)
[error] 109-109: Undefined name datetime
(F821)
[error] 110-110: Undefined name timedelta
(F821)
🤖 Prompt for AI Agents
In `@backend/main.py` around lines 104 - 123, The check_upload_limits function
performs a non-atomic get→filter→check→append→set on user_upload_cache which
permits a TOCTOU race; fix it by making the read-modify-write atomic for the
given identifier—either implement a per-identifier threading.Lock (e.g., a dict
of locks keyed by identifier) and wrap the body that reads current_uploads,
filters old timestamps, checks length, appends now, and calls
user_upload_cache.set inside that lock, or replace the list-based approach with
an atomic counter in a remote store (e.g., Redis INCR with TTL) and update
check_upload_limits to use that atomic increment/decrement TTL semantics instead
of get/set on user_upload_cache. Ensure you reference and update the logic
inside check_upload_limits and use user_upload_cache.get/set or the new atomic
store consistently.
| img = img.resize((new_width, new_height), Image.Resampling.LANCZOS) | ||
|
|
||
| # Save resized image back to file | ||
| output = io.BytesIO() | ||
| img.save(output, format=img.format or 'JPEG', quality=85) |
There was a problem hiding this comment.
img.format is None after resize() — fallback to 'JPEG' may corrupt alpha-channel images.
Image.resize() returns a new Image object whose .format attribute is None. The or 'JPEG' fallback will always activate. For PNG/GIF/WebP uploads with an alpha channel, saving as JPEG will either raise an error (cannot write mode RGBA as JPEG) or silently drop transparency.
Capture the original format before resizing and use it in img.save():
🐛 Proposed fix
img = Image.open(file.file)
if img.width > 1024 or img.height > 1024:
+ original_format = img.format or 'JPEG'
# Calculate new size maintaining aspect ratio
ratio = min(1024 / img.width, 1024 / img.height)
new_width = int(img.width * ratio)
new_height = int(img.height * ratio)
img = img.resize((new_width, new_height), Image.Resampling.LANCZOS)
# Save resized image back to file
output = io.BytesIO()
- img.save(output, format=img.format or 'JPEG', quality=85)
+ img.save(output, format=original_format, quality=85)🧰 Tools
🪛 Ruff (0.14.14)
[error] 195-195: Undefined name io
(F821)
🤖 Prompt for AI Agents
In `@backend/main.py` around lines 192 - 196, Before calling img.resize capture
the original format (e.g., original_format = img.format) and after resizing pass
that original_format into img.save instead of using img.format or blindly
falling back to 'JPEG'; if original_format is None, choose a safe fallback that
preserves alpha (use 'PNG' when img.mode indicates an alpha channel like 'RGBA'
or 'LA', otherwise 'JPEG'), and keep the existing quality parameter when saving;
update the block around img.resize, img.format, and img.save (referencing
img.resize, img.format, img.save, new_width/new_height) to implement this.
| pil_image = await run_in_threadpool(Image.open, image.file) | ||
| # Validate image for processing | ||
| await run_in_threadpool(validate_image_for_processing, pil_image) | ||
| except HTTPException: | ||
| raise # Re-raise HTTP exceptions from validation | ||
| except Exception as e: | ||
| logger.error(f"Error during background initialization: {e}", exc_info=True) | ||
| logger.error(f"Invalid image file during processing: {e}", exc_info=True) | ||
| raise HTTPException(status_code=400, detail="Invalid image file") |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Search for definition of validate_image_for_processing
rg -nP --type=py 'def validate_image_for_processing'Repository: RohanExploit/VishwaGuru
Length of output: 141
🏁 Script executed:
# Search for imports of validate_image_for_processing
rg -nP --type=py 'from .* import.*validate_image_for_processing|import.*validate_image_for_processing'Repository: RohanExploit/VishwaGuru
Length of output: 380
🏁 Script executed:
# Check the imports and context in backend/main.py around line 237
head -n 250 backend/main.py | tail -n +1 | head -n 50Repository: RohanExploit/VishwaGuru
Length of output: 1939
🏁 Script executed:
# Get a broader view of imports at the top of backend/main.py
head -n 100 backend/main.pyRepository: RohanExploit/VishwaGuru
Length of output: 3750
🏁 Script executed:
# Check all imports in backend/main.py to see if validate_image_for_processing appears anywhere
rg -n 'from backend|import.*validate_image_for_processing' backend/main.pyRepository: RohanExploit/VishwaGuru
Length of output: 987
🏁 Script executed:
# Verify the actual code around line 237 in backend/main.py
sed -n '230,245p' backend/main.pyRepository: RohanExploit/VishwaGuru
Length of output: 692
Import validate_image_for_processing from backend.pothole_detection.
Line 237 calls validate_image_for_processing(pil_image), but this function is not imported in backend/main.py. The function is defined in backend.pothole_detection and already imported in other files like backend/routers/detection.py. Add it to the imports at the top of the file, or remove the call if the PIL validation in _validate_uploaded_file_sync is sufficient.
🧰 Tools
🪛 Ruff (0.14.14)
[error] 237-237: Undefined name validate_image_for_processing
(F821)
[warning] 242-242: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
🤖 Prompt for AI Agents
In `@backend/main.py` around lines 235 - 242, The call to
validate_image_for_processing(pil_image) in the async block is missing an
import; update backend/main.py to import validate_image_for_processing from
backend.pothole_detection (same symbol used in backend/routers/detection.py) so
the run_in_threadpool(validate_image_for_processing, pil_image) call resolves,
or alternatively remove that call if you decide the existing synchronous PIL
checks in _validate_uploaded_file_sync are sufficient—ensure the chosen approach
keeps run_in_threadpool(Image.open, image.file) and the surrounding
HTTPException handling intact.
| # Create tables if they don't exist | ||
| Base.metadata.create_all(bind=engine) |
There was a problem hiding this comment.
Redundant and unguarded Base.metadata.create_all at module level.
This synchronous call runs at import time without error handling, blocking the module load and crashing if the database is unreachable. The same operation is already performed inside lifespan() (line 265) with proper error handling. Remove this line to avoid the duplicate call and the risk of an unhandled startup crash.
🐛 Proposed fix
-# Create tables if they don't exist
-Base.metadata.create_all(bind=engine)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Create tables if they don't exist | |
| Base.metadata.create_all(bind=engine) |
🤖 Prompt for AI Agents
In `@backend/main.py` around lines 252 - 253, Remove the unguarded module-level
call to Base.metadata.create_all(bind=engine) so table creation does not run at
import time; the same operation with error handling already exists in
lifespan(), so delete the Base.metadata.create_all line (and any direct uses of
engine for schema creation at module scope) to avoid duplicate/ungarded startup
DB access and rely on the existing lifecycle-managed creation inside lifespan().
Summary
This PR makes image upload validation more robust in environments where
python-magicis missing or misconfigured, while preserving all existing security checks.Changes
magicdefensively inbackend/main.py, falling back toNoneif the library is not available.python-magicis available, continue to use it to detect MIME type from the file contents.python-magicis not available or fails, fall back to:UploadFile.content_typeheader, ormimetypes.guess_typebased on the filename.Rationale
Previously, the backend assumed
python-magicwas always installed. In environments where it was missing or misconfigured, the app could fail at import time or when validating uploads.This change:
python-magicis unavailable.python-magicis installed.Testing
backend/main.py.python-magicpresent: MIME detection continues to use content-based detection.python-magic: uploads are still validated via content type / filename plus PIL checks, and invalid files are rejected withHTTP 400.magic(e.g.backend/tests/test_detection_bytes.py) remain compatible because the import name and usage are unchanged.Summary by CodeRabbit
New Features
Improvements