Table of Contents Page detection Draft #82

Hitansh-Shah · 2022-02-02T08:29:35Z

@mekarpeles , In bgp/modules/terms.py, I have added the class for TocPageDetectorModule. It is a simple class copied from CopyrightPageDetectorModule. I removed the extractor function and changed the keywords for table of contents. I have also added mysequencer.py, a temporary file in the root of the project for defining a sequencer which only detects Table of contents page.

Hitansh-Shah · 2022-02-02T08:34:41Z

Also I had a doubt about the dockerfile. If I am not wrong the container made by the dockerfile contains the cloned git repo of sequencer as while building the image it performs a fresh git clone. So while developing we cannot use the docker because the local changes will not be reflected. I don't have much experience with docker so I may have deviated to the wrong direction. Please correct me if I am mistaken.

finnless · 2022-02-03T00:47:23Z

So while developing we cannot use the docker because the local changes will not be reflected.

This is true. I'm open to to updating this to allow easier local development. A workaround would be pushing your changes to a development branch and changing the dockerfile to clone that branch instead of master. You could also just use a local environment for development instead of a container.

mekarpeles · 2022-02-03T04:06:06Z

You're right, I'd remove https://github.com/Open-Book-Genome-Project/sequencer/blob/master/Dockerfile#L3 and then within volumes add https://github.com/Open-Book-Genome-Project/sequencer/blob/master/docker-compose.yml#L6

volumes:
    - ./:/sequencer

Hitansh-Shah · 2022-02-03T10:22:20Z

You could also just use a local environment for development instead of a container.

Yup, a virtual env seems to do the work.

mekarpeles · 2022-02-03T19:09:32Z

I'll submit a new PR for the docker fixes :) This PR seems like it's in the right direction. There may be opportunities for us to tune it to increase accuracy. e.g. How might we prevent false positives -- books which may mention the words "table of contents" which don't actually have a table of contents.

Hitansh-Shah · 2022-02-03T20:23:08Z

How might we prevent false positives -- books which may mention the words "table of contents" which don't actually have a table of contents.

I gave it a little thought. The Toc page will always be placed before the main content. So we actually don't have to scan the whole book. If somehow we can manage to set a limit for the for loop to break, we should be good to go.

finnless · 2022-02-04T18:31:31Z

If somehow we can manage to set a limit for the for loop to break, we should be good to go.

Doesn't the module's super().__init__(match_limit=1) do this here?

KeywordPageDetectorModule will break once match limit is reached:

sequencer/bgp/modules/terms.py

Line 321 in f6f6f86

if not self.match_limit or len(self.matched_pages) < self.match_limit:

Hitansh-Shah · 2022-02-05T09:12:52Z

@finnless That's totally correct. But in this the case where there is no table of contents page and "table of contents" /"contents" is mentioned somewhere in the book will also be detected.

As far as I know we can avoid this by 2 methods.

If we found the page we can add further validation before appending it to self.matched_pages.
If table of contents is present it will always be before the main content. So even if the match_limit is not reached we can break the loop if we figure out that we have entered in the main content section and from here there is no point in iterating further.

I may have missed or misinterpreted something, so please correct me if I am going in the wrong direction.

mekarpeles · 2022-02-05T14:50:14Z

This seems like the right line of thinking. What other data on the page may enable us to detect table of contents pages? Also what about the books that use the word contents instead of table of contents? Do you think table of contents is usually one of the first things on the page? Are there other terms like glossary which frequently show up?

mekarpeles · 2022-02-05T14:55:29Z

Also, could we use the book page image?
https://www.researchgate.net/publication/4232729_Detection_and_Segmentation_of_Table_of_Contents_and_Index_Pages_from_Document_Images

Could we build a simple classifier which bounds accuracy? https://arxiv.org/pdf/1306.4631

Hitansh-Shah · 2022-02-05T15:04:40Z

Also what about the books that use the word contents instead of table of contents?

I guess we can pass multiple keywords in the module. Like for the copyright page there are copyright, ©.

Hitansh-Shah · 2022-02-05T15:08:44Z

Do you think table of contents is usually one of the first things on the page? Are there other terms like glossary which frequently show up?

I guess it can really vary from book to book. We can't say for sure. Also how do we define "first" because there maybe book where the heading can be vertically written like in the example I shared on slack.

Hitansh-Shah · 2022-02-05T15:17:16Z

@mekarpeles I found something interesting today. In GNOME the Document Viewer application automatically creates a sidebar table of contents with links to those pages. It is open source so we can look at the source code. I will see if I can find something useful from it. I am attaching a screenshot from the Document viewer application.
I will also take a look at the resources you attached.

finnless · 2022-02-05T21:22:20Z

In GNOME the Document Viewer application automatically creates a sidebar table of contents with links to those pages.

My guess would be in this case the contents metadata is included in the PDF itself rather than being derived by Document Viewer.

Hitansh-Shah · 2022-02-06T08:25:30Z

My guess would be in this case the contents metadata is included in the PDF itself rather than being derived by Document Viewer.

I guess you are right. Because I can get the same sidebar in chrome too. My bad😅

Hitansh-Shah · 2022-02-08T08:36:03Z

So I read the articles @mekarpeles attached. Both of them mainly focused on the characteristics of Toc. One of them had a more statistical approach which is a bit complex to identify the Toc. And the other had a relatively simple approach. The main idea I got is that it may not be very accurate to just iterate through pages and look for the keywords passed in the module. Rather we may have to scan the whole page for a pattern (For eg: if a structure is there consisting of titles with bold font and occasionally starting with numbers which maybe section numbers like 3.18 etc.) and then classify it into either toc or non-toc page.

I don't know if we should implement ml or there are other ways which without ml. As of now I hardly have any knowledge of ml but if we are to implement ml into these I don't think it will be very advanced so I could learn the concepts while implementing them or atleast I will try.

Hitansh-Shah · 2022-02-14T11:36:21Z

@mekarpeles @finnless . I have made some changes in the TocPageDetectionModule. We can avoid almost all the cases where "contents" might be detected somewhere else in the book by simply checking if it is the only word in the whole line. On toc page it will be present as a header and so as a result the only word in that line. Obviously there will be still a case where "contents" happens to be the only word of the last line of a paragraph. But in this case we can safely assume that there will be some kind of punctuation present with "contents" and as a result comparing it with our keyword would give False. I have implemented this in such a way that we can also take care of "table of contents".

Please provide your feedback on this for any improvements or corrections that can be done. After that we can test this on some books.

bgp/modules/terms.py

mekarpeles · 2022-02-23T06:14:54Z

mysequencer.py

+from bgp import ia
+from bgp import Sequencer
+from bgp.modules.terms import TocPageDetectorModule, PageTypeProcessor, CopyrightPageDetectorModule
+
+
+PageTypeDetectionSequencer = Sequencer({
+    "pagetypes": PageTypeProcessor(modules={
+        "toc_page": TocPageDetectorModule()
+    })
+})
+
+
+book = ia.get_item("9780262517638OpenAccess") 
+
+results = PageTypeDetectionSequencer.sequence(book).results
+
+print(results)


This is okay for testing but we'll want to delete this file before we move out of draft

Yeah absolutely. I made this file for my convenience during development. When we decide to merge, I will make a commit to delete this file.

bgp/modules/terms.py

mekarpeles · 2022-02-23T06:28:33Z

@Hitansh-Shah I made a few changes, take a look and see what you think and if you have any suggestions. Otherwise, we can try running this on 100 public books and see how it works!

Here's a good set of books to test with
https://archive.org/search.php?query=%22table%20of%20contents%22&sin=TXT

Hitansh-Shah · 2022-02-23T09:35:52Z

@mekarpeles the changes you have made seem perfect to me. I have some minor concerns which I have commented in the respective changes conversation. Other than that I think we are ready to test the first version. 🚀

Hitansh-Shah · 2022-03-06T13:32:34Z

Hey @mekarpeles can you help me with the 'search query' for retrieving the items? In the link you shared before for set of books to test on, there is a url parameter called sin=TXT which basically searches "Text Contents". I don't know how to state this in query because without it, it will search "metadata". Can you please help me with this?

ishank-dev · 2024-10-25T08:40:11Z

@Hitansh-Shah is this something you are actively working on ? If help is needed here let me know, this looks like a good add on to the genome project.

CC: @mekarpeles

TOC detection initial commit

2b4aeca

This was referenced Feb 3, 2022

Create Table of Contents Detector Module #72

Open

Docker should mount (not clone) sequencer repo #83

Closed

configured TocPageDetectionModule to detect the keyword only as header

0149045

mekarpeles reviewed Feb 23, 2022

View reviewed changes

bgp/modules/terms.py Outdated Show resolved Hide resolved

Update bgp/modules/terms.py

96e071d

mekarpeles reviewed Feb 23, 2022

View reviewed changes

bgp/modules/terms.py Outdated Show resolved Hide resolved

Update bgp/modules/terms.py

17bcb8c

mekarpeles reviewed Feb 23, 2022

View reviewed changes

bgp/modules/terms.py Outdated Show resolved Hide resolved

Update bgp/modules/terms.py

efff1a2

Table of Contents Page detection Draft #82

Are you sure you want to change the base?

Table of Contents Page detection Draft #82

Uh oh!

Conversation

Hitansh-Shah commented Feb 2, 2022

Uh oh!

Hitansh-Shah commented Feb 2, 2022

Uh oh!

finnless commented Feb 3, 2022

Uh oh!

mekarpeles commented Feb 3, 2022

Uh oh!

Hitansh-Shah commented Feb 3, 2022

Uh oh!

mekarpeles commented Feb 3, 2022

Uh oh!

Hitansh-Shah commented Feb 3, 2022

Uh oh!

finnless commented Feb 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Hitansh-Shah commented Feb 5, 2022

Uh oh!

mekarpeles commented Feb 5, 2022

Uh oh!

mekarpeles commented Feb 5, 2022

Uh oh!

Hitansh-Shah commented Feb 5, 2022

Uh oh!

Hitansh-Shah commented Feb 5, 2022

Uh oh!

Hitansh-Shah commented Feb 5, 2022

Uh oh!

finnless commented Feb 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Hitansh-Shah commented Feb 6, 2022

Uh oh!

Hitansh-Shah commented Feb 8, 2022

Uh oh!

Hitansh-Shah commented Feb 14, 2022

Uh oh!

Uh oh!

mekarpeles Feb 23, 2022

Choose a reason for hiding this comment

Uh oh!

Hitansh-Shah Feb 23, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mekarpeles commented Feb 23, 2022

Uh oh!

Hitansh-Shah commented Feb 23, 2022

Uh oh!

Hitansh-Shah commented Mar 6, 2022

Uh oh!

ishank-dev commented Oct 25, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

finnless commented Feb 4, 2022 •

edited

Loading

finnless commented Feb 5, 2022 •

edited

Loading