Skip to content

Conversation

@Hitansh-Shah
Copy link
Contributor

@mekarpeles , In bgp/modules/terms.py, I have added the class for TocPageDetectorModule. It is a simple class copied from CopyrightPageDetectorModule. I removed the extractor function and changed the keywords for table of contents. I have also added mysequencer.py, a temporary file in the root of the project for defining a sequencer which only detects Table of contents page.

@Hitansh-Shah
Copy link
Contributor Author

Also I had a doubt about the dockerfile. If I am not wrong the container made by the dockerfile contains the cloned git repo of sequencer as while building the image it performs a fresh git clone. So while developing we cannot use the docker because the local changes will not be reflected. I don't have much experience with docker so I may have deviated to the wrong direction. Please correct me if I am mistaken.

@finnless
Copy link
Collaborator

finnless commented Feb 3, 2022

So while developing we cannot use the docker because the local changes will not be reflected.

This is true. I'm open to to updating this to allow easier local development. A workaround would be pushing your changes to a development branch and changing the dockerfile to clone that branch instead of master. You could also just use a local environment for development instead of a container.

@mekarpeles
Copy link
Contributor

@Hitansh-Shah
Copy link
Contributor Author

You could also just use a local environment for development instead of a container.

Yup, a virtual env seems to do the work.

@mekarpeles
Copy link
Contributor

I'll submit a new PR for the docker fixes :) This PR seems like it's in the right direction. There may be opportunities for us to tune it to increase accuracy. e.g. How might we prevent false positives -- books which may mention the words "table of contents" which don't actually have a table of contents.

@Hitansh-Shah
Copy link
Contributor Author

How might we prevent false positives -- books which may mention the words "table of contents" which don't actually have a table of contents.

I gave it a little thought. The Toc page will always be placed before the main content. So we actually don't have to scan the whole book. If somehow we can manage to set a limit for the for loop to break, we should be good to go.

@finnless
Copy link
Collaborator

finnless commented Feb 4, 2022

If somehow we can manage to set a limit for the for loop to break, we should be good to go.

Doesn't the module's super().__init__(match_limit=1) do this here?

KeywordPageDetectorModule will break once match limit is reached:

if not self.match_limit or len(self.matched_pages) < self.match_limit:

@Hitansh-Shah
Copy link
Contributor Author

@finnless That's totally correct. But in this the case where there is no table of contents page and "table of contents" /"contents" is mentioned somewhere in the book will also be detected.

As far as I know we can avoid this by 2 methods.

  1. If we found the page we can add further validation before appending it to self.matched_pages.
  2. If table of contents is present it will always be before the main content. So even if the match_limit is not reached we can break the loop if we figure out that we have entered in the main content section and from here there is no point in iterating further.

I may have missed or misinterpreted something, so please correct me if I am going in the wrong direction.

@mekarpeles
Copy link
Contributor

This seems like the right line of thinking. What other data on the page may enable us to detect table of contents pages? Also what about the books that use the word contents instead of table of contents? Do you think table of contents is usually one of the first things on the page? Are there other terms like glossary which frequently show up?

@mekarpeles
Copy link
Contributor

Also, could we use the book page image?
https://www.researchgate.net/publication/4232729_Detection_and_Segmentation_of_Table_of_Contents_and_Index_Pages_from_Document_Images

Could we build a simple classifier which bounds accuracy? https://arxiv.org/pdf/1306.4631

@Hitansh-Shah
Copy link
Contributor Author

Also what about the books that use the word contents instead of table of contents?

I guess we can pass multiple keywords in the module. Like for the copyright page there are copyright, ©.

@Hitansh-Shah
Copy link
Contributor Author

Do you think table of contents is usually one of the first things on the page? Are there other terms like glossary which frequently show up?

I guess it can really vary from book to book. We can't say for sure. Also how do we define "first" because there maybe book where the heading can be vertically written like in the example I shared on slack.
image

@Hitansh-Shah
Copy link
Contributor Author

@mekarpeles I found something interesting today. In GNOME the Document Viewer application automatically creates a sidebar table of contents with links to those pages. It is open source so we can look at the source code. I will see if I can find something useful from it. I am attaching a screenshot from the Document viewer application.
I will also take a look at the resources you attached.
image

@finnless
Copy link
Collaborator

finnless commented Feb 5, 2022

In GNOME the Document Viewer application automatically creates a sidebar table of contents with links to those pages.

My guess would be in this case the contents metadata is included in the PDF itself rather than being derived by Document Viewer.

@Hitansh-Shah
Copy link
Contributor Author

My guess would be in this case the contents metadata is included in the PDF itself rather than being derived by Document Viewer.

I guess you are right. Because I can get the same sidebar in chrome too. My bad😅

@Hitansh-Shah
Copy link
Contributor Author

So I read the articles @mekarpeles attached. Both of them mainly focused on the characteristics of Toc. One of them had a more statistical approach which is a bit complex to identify the Toc. And the other had a relatively simple approach. The main idea I got is that it may not be very accurate to just iterate through pages and look for the keywords passed in the module. Rather we may have to scan the whole page for a pattern (For eg: if a structure is there consisting of titles with bold font and occasionally starting with numbers which maybe section numbers like 3.18 etc.) and then classify it into either toc or non-toc page.

I don't know if we should implement ml or there are other ways which without ml. As of now I hardly have any knowledge of ml but if we are to implement ml into these I don't think it will be very advanced so I could learn the concepts while implementing them or atleast I will try.

@Hitansh-Shah
Copy link
Contributor Author

@mekarpeles @finnless . I have made some changes in the TocPageDetectionModule. We can avoid almost all the cases where "contents" might be detected somewhere else in the book by simply checking if it is the only word in the whole line. On toc page it will be present as a header and so as a result the only word in that line. Obviously there will be still a case where "contents" happens to be the only word of the last line of a paragraph. But in this case we can safely assume that there will be some kind of punctuation present with "contents" and as a result comparing it with our keyword would give False. I have implemented this in such a way that we can also take care of "table of contents".

Please provide your feedback on this for any improvements or corrections that can be done. After that we can test this on some books.

Comment on lines +1 to +17
from bgp import ia
from bgp import Sequencer
from bgp.modules.terms import TocPageDetectorModule, PageTypeProcessor, CopyrightPageDetectorModule


PageTypeDetectionSequencer = Sequencer({
"pagetypes": PageTypeProcessor(modules={
"toc_page": TocPageDetectorModule()
})
})


book = ia.get_item("9780262517638OpenAccess")

results = PageTypeDetectionSequencer.sequence(book).results

print(results) No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is okay for testing but we'll want to delete this file before we move out of draft

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah absolutely. I made this file for my convenience during development. When we decide to merge, I will make a commit to delete this file.

@mekarpeles
Copy link
Contributor

@Hitansh-Shah I made a few changes, take a look and see what you think and if you have any suggestions. Otherwise, we can try running this on 100 public books and see how it works!

Here's a good set of books to test with
https://archive.org/search.php?query=%22table%20of%20contents%22&sin=TXT

@Hitansh-Shah
Copy link
Contributor Author

@mekarpeles the changes you have made seem perfect to me. I have some minor concerns which I have commented in the respective changes conversation. Other than that I think we are ready to test the first version. 🚀

@Hitansh-Shah
Copy link
Contributor Author

Hey @mekarpeles can you help me with the 'search query' for retrieving the items? In the link you shared before for set of books to test on, there is a url parameter called sin=TXT which basically searches "Text Contents". I don't know how to state this in query because without it, it will search "metadata". Can you please help me with this?

@ishank-dev
Copy link
Collaborator

@Hitansh-Shah is this something you are actively working on ? If help is needed here let me know, this looks like a good add on to the genome project.

CC: @mekarpeles

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants