Skip to content

MarcusBluestone/Question-Retrieval-NLP-Final

 
 

Repository files navigation

NLP Question Retrieval

overview.pdf contains the final version of our paper

Our code was created to be run in Google Colab. “balanced_df.csv” is the General dataset. “tech_data.csv” is the Technology dataset.

The cell in the beginning of the notebook is for mounting the drive and importing the data.

After these first few data cells, we have the analysis code. These cells can be run to re-produce our results. The order is as follows:

  • Bow with TF-IDF

  • Pre-Trained CBoW (Word2Vec)

  • Our own CBoW (Word2Vec)

  • Binary Classification with BERT Pre-trained

  • Binary Classification with BERT Fine-Tuned

  • BERT CLS Token Approach

  • BERT CLS Token Approach Fine-Tuned

Each method has a subsection for the tech and general datasets.

The GoogleNews weights can be found at: https://drive.google.com/file/d/1j7-0mg_xw7FWES12AA0rAXOTUHt-NcDJ/view?usp=sharing

About

Capstone Project for MIT NLP Class (6.8301)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 100.0%