Skip to content

Conversation

@dylanpivo
Copy link
Contributor

@dylanpivo dylanpivo commented Feb 13, 2025

  • Add weights to title, keyword and abstract.
    • Curation request for the weighting order to be: keyword, title and then abstract.

Fix and improve ranking at search:
during search the query was not treating 'full_text' and 'query' as columns, but rather as values.
Normalisation - current normalisation is set to 4 | 1 but will be investigated.

Fixes #9

@dylanpivo dylanpivo marked this pull request as draft February 13, 2025 09:17
@dylanpivo
Copy link
Contributor Author

dylanpivo commented Feb 13, 2025

Normalization:

The below found here, outlines the normalization options.

  • 0 (the default) ignores the document length
  • 1 divides the rank by 1 + the logarithm of the document length
  • 2 divides the rank by the document length
  • 4 divides the rank by the mean harmonic distance between extents (this is implemented only by ts_rank_cd)
  • 8 divides the rank by the number of unique words in document
  • 16 divides the rank by 1 + the logarithm of the number of unique words in document
  • 32 divides the rank by itself + 1

4 | 1 is currently in use.

4: weighs the record higher if the words in the search term occur closer together in the record. for instance if "climate" and "change" occur right after each other as opposed to at opposite ends of the document.

1: weighs the record lower if the document is longer. the log ensures the penalization is lessened.

@dylanpivo
Copy link
Contributor Author

dylanpivo commented Feb 17, 2025

Testing:
The testing involves mocking up a list of different metadata records, publishing them and then searching. The metadata will be put together with only Lorum Ipsum mock data and in such a way to cater for the different ranking circumstances.

The list of options from which combinations will be generated are as follows:

The search term will be fixed.

Search terms in title no harmonic distance.
Search terms in title with harmonic distance.
No search term in title.

Short length abstract. (50 words)
Long abstract. (200 words)

No/Low harmonic distance of search terms in abstract.
High harmonic distance between search terms in abstract.

Many instances of search terms in abstract. (6 instances)
Few instances of search terms in abstract. (3 instances)
Note: the amount of instances does not increase if the abstract length increases. This is so lengthening the abstract effects the ranking in isolation.

Keywords with all search terms.
Keywords with no search terms.

@dylanpivo dylanpivo force-pushed the add_title_weights branch from d59852a to 776837d Compare March 3, 2025 08:17
@dylanpivo dylanpivo force-pushed the add_title_weights branch from 776837d to 37cd3cd Compare March 3, 2025 08:43
@dylanpivo dylanpivo marked this pull request as ready for review March 3, 2025 08:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Use weightings to improve sorting by relevance

1 participant