-
Notifications
You must be signed in to change notification settings - Fork 16
Table of Contents Page detection Draft #82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Table of Contents Page detection Draft #82
Conversation
|
Also I had a doubt about the dockerfile. If I am not wrong the container made by the dockerfile contains the cloned git repo of sequencer as while building the image it performs a fresh git clone. So while developing we cannot use the docker because the local changes will not be reflected. I don't have much experience with docker so I may have deviated to the wrong direction. Please correct me if I am mistaken. |
This is true. I'm open to to updating this to allow easier local development. A workaround would be pushing your changes to a development branch and changing the dockerfile to clone that branch instead of master. You could also just use a local environment for development instead of a container. |
|
You're right, I'd remove https://github.com/Open-Book-Genome-Project/sequencer/blob/master/Dockerfile#L3 and then within |
Yup, a virtual env seems to do the work. |
|
I'll submit a new PR for the docker fixes :) This PR seems like it's in the right direction. There may be opportunities for us to tune it to increase accuracy. e.g. How might we prevent false positives -- books which may mention the words "table of contents" which don't actually have a table of contents. |
I gave it a little thought. The Toc page will always be placed before the main content. So we actually don't have to scan the whole book. If somehow we can manage to set a limit for the for loop to break, we should be good to go. |
Doesn't the module's
sequencer/bgp/modules/terms.py Line 321 in f6f6f86
|
|
@finnless That's totally correct. But in this the case where there is no table of contents page and "table of contents" /"contents" is mentioned somewhere in the book will also be detected. As far as I know we can avoid this by 2 methods.
I may have missed or misinterpreted something, so please correct me if I am going in the wrong direction. |
|
This seems like the right line of thinking. What other data on the page may enable us to detect table of contents pages? Also what about the books that use the word contents instead of table of contents? Do you think table of contents is usually one of the first things on the page? Are there other terms like glossary which frequently show up? |
|
Also, could we use the book page image? Could we build a simple classifier which bounds accuracy? https://arxiv.org/pdf/1306.4631 |
I guess we can pass multiple keywords in the module. Like for the copyright page there are |
|
@mekarpeles I found something interesting today. In GNOME the Document Viewer application automatically creates a sidebar table of contents with links to those pages. It is open source so we can look at the source code. I will see if I can find something useful from it. I am attaching a screenshot from the Document viewer application. |
My guess would be in this case the contents metadata is included in the PDF itself rather than being derived by Document Viewer. |
I guess you are right. Because I can get the same sidebar in chrome too. My bad😅 |
|
So I read the articles @mekarpeles attached. Both of them mainly focused on the characteristics of Toc. One of them had a more statistical approach which is a bit complex to identify the Toc. And the other had a relatively simple approach. The main idea I got is that it may not be very accurate to just iterate through pages and look for the I don't know if we should implement ml or there are other ways which without ml. As of now I hardly have any knowledge of ml but if we are to implement ml into these I don't think it will be very advanced so I could learn the concepts while implementing them or atleast I will try. |
|
@mekarpeles @finnless . I have made some changes in the Please provide your feedback on this for any improvements or corrections that can be done. After that we can test this on some books. |
| from bgp import ia | ||
| from bgp import Sequencer | ||
| from bgp.modules.terms import TocPageDetectorModule, PageTypeProcessor, CopyrightPageDetectorModule | ||
|
|
||
|
|
||
| PageTypeDetectionSequencer = Sequencer({ | ||
| "pagetypes": PageTypeProcessor(modules={ | ||
| "toc_page": TocPageDetectorModule() | ||
| }) | ||
| }) | ||
|
|
||
|
|
||
| book = ia.get_item("9780262517638OpenAccess") | ||
|
|
||
| results = PageTypeDetectionSequencer.sequence(book).results | ||
|
|
||
| print(results) No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is okay for testing but we'll want to delete this file before we move out of draft
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah absolutely. I made this file for my convenience during development. When we decide to merge, I will make a commit to delete this file.
|
@Hitansh-Shah I made a few changes, take a look and see what you think and if you have any suggestions. Otherwise, we can try running this on 100 public books and see how it works! Here's a good set of books to test with |
|
@mekarpeles the changes you have made seem perfect to me. I have some minor concerns which I have commented in the respective changes conversation. Other than that I think we are ready to test the first version. 🚀 |
|
Hey @mekarpeles can you help me with the 'search query' for retrieving the items? In the link you shared before for set of books to test on, there is a url parameter called |
|
@Hitansh-Shah is this something you are actively working on ? If help is needed here let me know, this looks like a good add on to the genome project. CC: @mekarpeles |


@mekarpeles , In bgp/modules/terms.py, I have added the
classforTocPageDetectorModule. It is a simple class copied fromCopyrightPageDetectorModule. I removed theextractorfunction and changed thekeywordsfor table of contents. I have also added mysequencer.py, a temporary file in the root of the project for defining a sequencer which only detects Table of contents page.