Skip to content

Conversation

@fdsa654hg
Copy link

add english to chinese dataset

Copy link
Owner

@ProFatXuanAll ProFatXuanAll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to add a dataset class, see s2s.dset for examples.

df['tgt'].apply(str).apply(self.__class__.preprocess).to_list()
)

@staticmethod
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For translation, one should use evaluation metrics other than exact match (i.e., generate words must "be the same word" and "in the same order" of target sequence , which is used by accuracy_score).
Why is exact match bad?
Consider the example translation pair I like apple and 我喜歡蘋果, 蘋果我喜歡 should also be an acceptable answer.
Thus, one should use some evaluation metrics with fozzy match, i.e., it's okay to not be that accurate but at the same time have same meaning (swapping order, synonym, etc.).
Nowadays people mostly use BLEU score as evaluation metric on translation task.
Go find some python package which calculate BLEU score for you.

from s2s.dset._base import BaseDset
from s2s.path import DATA_PATH

class Eng2ChiDset(BaseDset):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add docstring for Eng2ChiDset class, including reference of the source of the dataset.

Pipfile Outdated
tqdm = "4.49.0"
sklearn = "0.0"
tensorboard = "2.3.0"
nltk = "*"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specify dependency version

batch_tgt=dset.all_tgt(),
batch_pred=all_pred,
))
print(DSET_OPTS[args.dset_name].bleu_score(
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method only work on translation dataset but not arithmetic dataset.
Overload batch_eval can do the trick.

from s2s.dset._base import BaseDset
from s2s.path import DATA_PATH

try:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

User need to install dependency before running the code.
So you don't need to check whether a dependecies is missing or not.

fdsa654hg and others added 2 commits November 29, 2020 22:24
Overload batch_eval function to count BLEU score
Add docstring to _eng2chi.py
Accidentally changed the file run_train_tknzr.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants