Skip to content

Unigram frequencies in en_wordlist.xml appear to have been flattened #207

@tnantais

Description

@tnantais

I'm not sure where the en_wordlist.xml came from, but the spread of unigram frequencies is extremely narrow (most popular word, "the" with frequency 222; 50,000th most popular word, "exude" with frequency 66). This suggests either a very small training corpus, or more likely, some kind of log() flattening function. Flattening the frequencies is acceptable for ordinary unigram prediction since relative ordering is largely preserved, but for our adaptation purposes, we need raw frequencies.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions