Skip to content

Commit c67f564

Browse files
docs(setup): note nltk punkt_tab download is likely redundant
unstructured already lazily downloads punkt_tab on first tokenize call, so the eager post-install download is probably duplicate work. Keep it as a safety net (and to front-load the network hit at install time instead of on the first office-document parse), but document it. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
1 parent f4445b4 commit c67f564

1 file changed

Lines changed: 4 additions & 0 deletions

File tree

setup.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -70,6 +70,10 @@ def run(self):
7070
)
7171

7272
# do "import nltk ; nltk.download('punkt_tab')"
73+
# Likely redundant: `unstructured` (our only nltk consumer, via
74+
# unstructured.nlp.tokenize) already lazily runs nltk.download("punkt_tab")
75+
# on first use. Kept as a safety net so the download happens at install
76+
# time rather than on the first parse of an office document.
7377
try:
7478
import nltk
7579

0 commit comments

Comments
 (0)