-
Notifications
You must be signed in to change notification settings - Fork 266
Description
When I tried to validate the output from preprocess.py against text8 as input, I found mismatch between input word sequence and the index encoded sequence which is written to text8-train.pb2.
To reproduce, please add below two lines to preprocess.py after calling build_string_index()
print('{}'.format(words[:100]))
print('{}'.format(index[word_indices[:100]]))
Here is the output I get:
['anarchism' 'originated' 'as' 'a' 'term' 'of' 'abuse' 'first' 'used'
'against' 'early' 'working' 'class' 'radicals' 'including' 'the' 'diggers'
'of' 'the' 'english' 'revolution' 'and' 'the' 'sans' 'culottes' 'of' 'the'
'french' 'revolution' 'whilst' 'the' 'term' 'is' 'still' 'used' 'in' 'a'
'pejorative' 'way' 'to' 'describe' 'any' 'act' 'that' 'used' 'violent'
'means' 'to' 'destroy' 'the' 'organization' 'of' 'society' 'it' 'has'
'also' 'been' 'taken' 'up' 'as' 'a' 'positive' 'label' 'by' 'self'
'defined' 'anarchists' 'the' 'word' 'anarchism' 'is' 'derived' 'from'
'the' 'greek' 'without' 'archons' 'ruler' 'chief' 'king' 'anarchism' 'as'
'a' 'political' 'philosophy' 'is' 'the' 'belief' 'that' 'rulers' 'are'
'unnecessary' 'and' 'should' 'be' 'abolished' 'although' 'there' 'are'
'differing']
['instance' 'dating' 'as' 'a' 'term' 'of' 'distances' 'first' 'used'
'against' 'early' 'working' 'class' 'squid' 'including' 'the' 'hanoi' 'of'
'the' 'english' 'treaty' 'and' 'the' 'malinowski' 'UNK' 'of' 'the'
'french' 'treaty' 'afro' 'the' 'term' 'is' 'still' 'used' 'in' 'a' 'buddy'
'way' 'to' 'islam' 'any' 'act' 'that' 'used' 'zeus' 'lincoln' 'to'
'vector' 'the' 'car' 'of' 'society' 'it' 'has' 'also' 'been' 'latin' 'up'
'as' 'a' 'failed' 'eddington' 'by' 'self' 'command' 'anarchists' 'the'
'word' 'instance' 'is' 'treaty' 'from' 'the' 'born' 'without' 'mml'
'progress' 'coast' 'king' 'instance' 'as' 'a' 'political' 'culture' 'is'
'the' 'me' 'that' 'dating' 'are' 'squid' 'and' 'public' 'be' 'acceptable'
'although' 'there' 'are' 'absent']