Unigram frequencies in en_wordlist.xml appear to have been flattened

I'm not sure where the en_wordlist.xml came from, but the spread of unigram frequencies is extremely narrow (most popular word, "the" with frequency 222; 50,000th most popular word, "exude" with frequency 66).  This suggests either a very small training corpus, or more likely, some kind of log() flattening function.  Flattening the frequencies is acceptable for ordinary unigram prediction since relative ordering is largely preserved, but for our adaptation purposes, we need raw frequencies.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Unigram frequencies in en_wordlist.xml appear to have been flattened #207

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Unigram frequencies in en_wordlist.xml appear to have been flattened #207

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions