Skip to content

Conversation

@mttk
Copy link
Member

@mttk mttk commented Dec 1, 2020

Some outstanding things:

  • Add masking functionality in wrapper of vocab -> prepped for this but left for next PR
  • Disable caching in Field when numericalizer isn't deterministic (via deterministic arg)
  • Make Specials a singleton class
    • Moved to just check for uniqueness in vocab constructor
  • Add static constructors (from_itos, from_stoi)
  • Add UNK filtering
  • Apply non-core specials in Field & add tests for this
  • Fix previous tests
  • Add new tests
  • Improve code documentation
  • Update docs

Closes #216

@mttk mttk requested review from FilipBolt, ivansmokovic and mariosasko and removed request for mariosasko December 16, 2020 00:25
@FilipBolt FilipBolt linked an issue Dec 17, 2020 that may be closed by this pull request
Copy link
Collaborator

@ivansmokovic ivansmokovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
I have some questions.
Not sure about Specials subclassing str. What is the main benefit of this compared to a simple Special class?



class VocabDict(dict):
class Special(str):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the user expected to implement own Specials at some point or are all Specials maintained by us?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is intended for the user to subclass specials in case of new usages (e.g., Mask is a relatively new case).

2. Adds a stub ``apply`` method which accepts a sequence of tokens and adds the special token to that sequence. In its essence, the apply method is a post-tokenization hook which doesn't see the raw data whose job is to add the special token to the sequence of replace some of the existing tokens with the special token. The special tokens are applied after all post-tokenization hooks in the order they are passed to the :class:`podium.storage.vocab.Vocab` constructor. Each concrete implementation of a Special token has to implement this method.
3. Implements singleton-like hash and equality checks. The ``Special`` class overrides the default hash and equals and instead of checking for string value equality, it checks for *class name equality*. We use this type of check to ensure that each Vocab has a single instance of each Special and for simpler referencing and contains checks.

To better understand how specials work, we will walk through the implementation of one of special tokens implemented in Podium: the beginning-of-sequence (BOS) token.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you maybe happen to know a resource which contains typical Specials used in NLP we could link here? After a quick Google search I could not find one.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vocabs in transformers (or tokenizers? not sure where they delegated the vocab) had quite a large number of reserved tokens.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -0,0 +1,29 @@
Special tokens
===============
.. autoclass:: podium.storage.vocab.Special
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should (eventually) get refactored to omit storage, but looks good otherwise.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, IDK while docs refuse to use the shortened versions. I have an idea and might try it out soon.

@mttk
Copy link
Member Author

mttk commented Jan 12, 2021

@FilipBolt all comments addressed

@mariosasko
Copy link
Collaborator

@mttk Can you please reference the issue that this PR will close?

Copy link
Collaborator

@FilipBolt FilipBolt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, few minor comments left, but good to go in my mind

podium/vocab.py Outdated
[self.stoi[token] if token in self.stoi else unk_token for token in data]
)
else:
# Either UNK is not in Vocab or the user has requested unknown tokens
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, we might need to update the docs to reflect what happens when UNK is in/out of the vocab.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documented.

@mttk mttk merged commit 5273cac into master Jan 13, 2021
@mttk mttk deleted the vocab_specials branch January 13, 2021 15:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Proposal: specials revamp

5 participants