ontology repacking and exporting - relates to #34 by dimatr · Pull Request #44 · graph-genome/component_segmentation

dimatr · 2020-04-28T22:04:41Z

Please review.

To use the code please add to the parameters --do-ttl True
The output .ttl files will go to the {bin_width}-turtle folder besides the main .json output folder, e.g. 1000-turtle
Double check the repacking logic in the PangenomeSchematic.py
Check if the bin/path/etc. ids are used correctly
Do not run it on the heavy sets, the turtle repack+dump may take really long

…tle' format dump

subwaystation · 2020-04-29T15:52:19Z

Hi @dimatr the good news first:
http://ttl.summerofcode.be/ says you produced valid .ttl syntax! Good job. One minor thing: Please remove the newlines between each record, this will save some space, hopefully. It will still be valid Turtle.

For better debugging, let's use the files given in svg_opg1.zip. It is a small example graph I came up with.
I ran

python ~/git/component_segmentation/segmentation.py -j svg1.gfa.og.w1.json -f svg1.gfa.og.fasta -o ./cs_svg1 -t svg1.gfa.og.w1.json.ttl

And got cs_svg1.zip.
On first glance everything looks good, except the following:

Each path name in a FALDO region is always path0. Here the corresponding path name should pop up.
Path names in vg:linkPaths are counted up from 1-n. Please replace these with the corresponding names.
vg:Link: I am missing forwardLinkEdge and reverseLinkEdge.
vg:Bin: I am missing forwardBinEdge and reverseBinEdge.

# Conflicts: # matrixcomponent/PangenomeSchematic.py

… edges definition for Component and Bin containers; Links are a part of ZoomLevel component

6br · 2020-05-07T02:03:58Z

Thanks for your implementation!
My comments:

linkRank or forwardLinkEdge/reverseLinkEdge should be defined between two links, by sorting all paths with two keys (upstream, downstream).
I think vg:Link does not affect on forwardBinEdge and reverseBinEdge, just connecting bins with arrival and departure.
More precisely, faldo:region consists of two exact positions pointing out a reference c.f. https://github.com/OBF/FALDO.

subwaystation · 2020-05-11T13:30:52Z

I see

<5/region/28-28> a faldo:Region ;
    faldo:begin <5/position/28> ;
    faldo:end <5/position/28> .

which does not make sense, because the overall pangenome nucleotide length is 17. See

[heumos@wave svg_opg1]$ odgi stats -S -i svg1.gfa.og -V
length:	17
nodes:	4
edges:	7
paths:	3
cov	sets
3	5,5-,
10	5,5-,6,
4	6,

No FALDO regions for paths -5 and 6. This means that all classes relying on that information have the wrong regions, as mentioned above.
An addition to @6br FALDO comment: For single positions, which we will have for w1, an exact position is sufficient -> https://github.com/OBF/FALDO#single-position. We don't need a full FALDO region.

- use faldo.ExactPosition when appropriate

dimatr · 2020-05-11T14:05:43Z

region begin/end are taken from the bin.nucleotide_ranges, and those can go up to 28 - see the input path "6". Please check if this logic is correct https://github.com/graph-genome/component_segmentation/blob/ontology/matrixcomponent/PangenomeSchematic.py#L156
fixed
fixed

subwaystation · 2020-05-11T14:11:22Z

I just realized, the testing GFA has two 6 paths in it....... sorry @dimatr .
I will fix this and start a fresh testing.

subwaystation · 2020-05-11T14:13:08Z

This would also explain the weird odgi bin outputs we observed. So no bug, I guess :) Just a dump user.

6br

Thanks for update. I have left some comments.

matrixcomponent/PangenomeSchematic.py

matrixcomponent/ontology.py

subwaystation · 2020-05-12T15:36:15Z

<pg/zoom1/component2/bin3/cell5> a vg:Cell ;
    vg:cellRegion <5/region/3> ;
    vg:inversionPercent 0 ;
    vg:positionPercent 1 .

We shoud have a faldo:exactPosition and not a vg:cellRegion here, right? Else, we could not connect to the exact positions.

The cleaned data:
svg_opg1_12052020.zip

I think CS is outputting more links compared to what odgi bin gives us. Short example:

<pg/zoom1/link2> a vg:Link ;
    vg:arrival <pg/zoom1/component3/bin5> ;
    vg:linkPaths <5>,
        <5-> .

But in the odgi bin output using -g, such a link does not exist. CS must make them up?! @6br @dimatr Can you confirm this?
By the way, I saw these links also in the usual CS output, so it is not an issue with the Turtle implementation.

subwaystation · 2020-05-12T16:42:56Z

I decided to open an issue for my concerns about CS #48.

- every Link has linkRank numbered after the pair sort (component.id, [component.departures])

dimatr · 2020-05-12T22:22:18Z

I have made another update. The Cell - Region - ExactPosition relations are now understandable:

<pg/zoom1/component2/bin3/cell5> a vg:Cell ;
    vg:cellRegion <5/3-3> ;
    vg:inversionPercent 0 ;
    vg:positionPercent 1 .

<5/3-3> a faldo:Region ;
    faldo:begin <5/3> ;
    faldo:end <5/3> .

<5/3> a faldo:ExactPosition ;
    faldo:position 3 .

This way all the object go down to the atomic ones. vg:Cell should contain vg:Region - this is in the vg schema.

I use a short identifier <path1/2-3> instead of a longer version <path1/region/2-3> as in the example. Is it an acceptable approach?

6br · 2020-05-13T03:51:44Z

For me, <path1/2-3> is natural and acceptable. My concern is if there is a way to describe the path name as objects explicitly.

6br · 2020-05-13T08:43:31Z

@dimatr Is it possible to embed a path on faldo:ExactPosition?

<5/1> a faldo:ExactPosition ;
    faldo:position 1 ;
    faldo:reference 5 .

The orientation of path can be encoded like this (if the reference is positive strand)

_:1b   a  faldo:ExactPosition, faldo:ForwardStrandPosition ;
            faldo:position 1 ;
            faldo:reference ddbj:XXXDSDS .

Because if the path is inverted is encoded in vg:inversionPercent, I think we can dispense with faldo:ForwardStrandPosition.

6br · 2020-05-13T09:35:25Z

As @JervenBolleman suggests, it's better to set a path name as an independent subject.

<chr1/exactposition/1>  a  faldo:ExactPosition, faldo:ForwardStrandPosition ;
            faldo:position 1 ;
            faldo:reference <chr1> .
<chr1> a vg:Path .

subwaystation · 2020-05-26T12:07:46Z

@dimatr here is the current output of odgi bin with the orientation encoded in the ranges:
svg1.gfa.og.w1.json.zip
Command line I used:

/home/heumos/git/odgi/bin/odgi bin -i svg1.gfa.og -f svg1.gfa.og.fasta -j -w 1 -s > svg1.gfa.og.w1.json

Now it should be possible to encode the orientation of a path for all cases.

Do you have any more questions? Need some feedback? Or comments from @6br ?

6br · 2020-05-26T12:21:53Z

Feel free to contact me at any time when you have any questions!

# Conflicts: # matrixcomponent/JSONparser.py # matrixcomponent/PangenomeSchematic.py # matrixcomponent/matrix.py # segmentation.py

… in each Position. Add Path write out

queries/selectBins1To5OfZoomlevel1.rq

6br · 2020-06-25T04:02:52Z

I agree to have a consistent IRI with spOdgi. Also, I also would like to replace pg with vg so it will be easer to sync with SpOdgi.

* parallel write out of the gzip compressed ontology files - no memory leaks due to the utilization of separate processes! * use the N-triples format to be 10x quicker than the Turtle (see format='nt' in PangenomeSchematic.py) * be gentle with the string variables, do not use "a"+"b" but rather "{0}{1}".format(a,b). This does not create small temporary object and leads to a lower memory fragmentation/leak * the RDF output folder is named *-rdf

dimatr · 2020-07-04T17:38:23Z

I have just pushed a big update targeting large datasets, please have a look.

subwaystation · 2020-07-06T14:10:36Z

Nice work @dimatr ! Really cool.

PubSeq

I can get a RDF output for the current PubSeq data of ~1300 genomes in 22 minutes :)
But it gives me the following warning:

[06/07/2020 08:48:39 - INFO - __main__ - 97] Starting Segmentation process on 1190 Paths.
[06/07/2020 08:48:39 - INFO - __main__ - 247] Largest bin_id was 129679; Found 7427 dividers.
[06/07/2020 08:48:39 - INFO - __main__ - 252] Input has 33806 listed Links.  Segmentation eliminated 0.0% of them.
[06/07/2020 08:48:39 - INFO - __main__ - 254] Found 33806 unique links
[06/07/2020 08:48:39 - INFO - __main__ - 105] Created dividers
[06/07/2020 08:48:39 - INFO - __main__ - 118] Created 7999 components
[06/07/2020 08:49:08 - INFO - __main__ - 92] Populated Matrix and Occupancy per component per path.
[06/07/2020 08:49:08 - INFO - __main__ - 122] populated matrix
[06/07/2020 08:49:09 - INFO - __main__ - 152] Created 12010 LinkColumns
/home/ubuntu/software/anaconda3/lib/python3.7/site-packages/joblib/externals/loky/process_executor.py:706: UserWarning: A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak.
  "timeout or by a memory leak.", UserWarning
Saved file2bin mapping to PubSeqTurtle_02062020/bin2file.json
[06/07/2020 09:09:13 - INFO - __main__ - 413] Finished processing the file relabeledSeqs_dedup_relabeledSeqs_dedup.odgi.w1.g.json

real    22m9.622s
user    72m51.954s
sys     1m37.888s

The resulting n3 is 108GB uncompressed! So no use putting it into a SPARQL endpoint on a machine with 64GB for now.

Pantograph Demo

I also ran it on the current Pantograph demo data which consists of 169 genomes. Here it only took 3 minutes ;) We have 8GB uncompressed and it fits into the RAM database of a Fuseki-Jena-Server which needs 60GB RAM.
Of course, I tried our default queries, and the one which should retrieve all the links, including arrival, departure, linkPaths
and linkZoomLevel did not succeed. I double-checked and for some links departure exists but not the corresponding arrival. Please check the attachment for the complete list. If you want to take a closer look at these files just ping me.
In my example tests, this problem did not occur. Might be due to the nature of the SARS data, but still, we would need these links fully serialized to RDF. Or not at all in the RDF. Whatever the biology says.
selectLinksResult.txt

Path Encoding

The encoding of the paths is working, but still not 100% in line with SpOdgi. When listing them, we are doing it correctly, but when referring to them, we are still omitting the path/.

6br · 2020-07-17T03:35:58Z

PubSeq

It's possible to tune parameters to avoid warnings.

Pantograph Demo

At first glance on selectLinksResult.txt, the same link identifier is shared among more than two bins, which might be the matter.
I suspect that something wrong in some boundary condition, but I haven't checked yet.

Path Encoding

I'll vote for adding the prefix path/ in the path identifier. I'll update it if there is no objection.

subwaystation · 2020-07-17T09:27:26Z

PubSeq

Cool, that would be awesome.

Pantograph Demo

I think the problem is that only departures are synced so far, but not arrivals. @dimatr suggested implementing the same sanity check for arrivals as was done in departures. Maybe he can give you some tips.

Path Encoding.

Yes, I vote for it! I had that, but I added it only in the Path ontology object. Then I realized, we need to alter the code in every ontology object, that emits a path. So be careful, when changing this!

6br · 2020-07-18T14:41:32Z

PubSeq

[18/07/2020 14:07:07 - INFO - __main__ - 97] Starting Segmentation process on 1190 Paths.
[18/07/2020 14:07:07 - INFO - __main__ - 247] Largest bin_id was 129679; Found 7427 dividers.
[18/07/2020 14:07:07 - INFO - __main__ - 252] Input has 33806 listed Links.  Segmentation eliminated 0.0% of them.
[18/07/2020 14:07:07 - INFO - __main__ - 254] Found 33806 unique links
[18/07/2020 14:07:07 - INFO - __main__ - 105] Created dividers
[18/07/2020 14:07:07 - INFO - __main__ - 118] Created 7999 components
[18/07/2020 14:07:35 - INFO - __main__ - 92] Populated Matrix and Occupancy per component per path.
[18/07/2020 14:07:35 - INFO - __main__ - 122] populated matrix
[18/07/2020 14:07:35 - INFO - __main__ - 152] Created 12010 LinkColumns
Saved file2bin mapping to PubSeqTurtle_02062020/bin2file.json
[18/07/2020 14:31:28 - INFO - __main__ - 413] Finished processing the file relabeledSeqs_dedup_relabeledSeqs_dedup.odgi.w1.g.json

real    26m12.550s
user    19m48.632s
sys     0m22.386s

Now warning messages disappeared.

6br · 2020-07-21T10:23:24Z

matrixcomponent/ontology.py

+        self.path = path
+
+    def ns_term(self):
+        return "path/{0}".format(self.path) # path1


All path should be changed as like that.

6br · 2020-07-23T07:07:58Z

In pubseq data, path name looks like
<path/http://collections.lugli.arvadosapi.com/c=13a2b522d373d0f6bfd95a58f821c677+123/sequence.fasta>
Is it fine?

Furthermore, cell name looks like
<http://example.org/vg/zoom1/component4/bin3/cellhttp://collections.lugli.arvadosapi.com/c=e4c1e7ed3a305e5b49993a2c042c4572+123/sequence.fasta>

6br · 2020-07-26T03:41:50Z

cell name is now updated to
<http://example.org/vg/zoom1/component4/bin3/cell/path<pathname>/http://collections.lugli.arvadosapi.com/c=e4c1e7ed3a305e5b49993a2c042c4572+123/sequence.fasta>

…segmentation into ontology

6br · 2020-07-26T05:51:39Z

I added assertion on building pangenomic sequence, and many downstream looks missing.

[26/07/2020 05:44:52 - INFO - matrixcomponent.PangenomeSchematic - 141] No downstream 114499
[26/07/2020 05:44:52 - INFO - matrixcomponent.PangenomeSchematic - 141] No downstream 110940
[26/07/2020 05:44:53 - INFO - matrixcomponent.PangenomeSchematic - 141] No downstream 101825
[26/07/2020 05:44:54 - INFO - matrixcomponent.PangenomeSchematic - 141] No downstream 60195
[26/07/2020 05:44:54 - INFO - matrixcomponent.PangenomeSchematic - 141] No downstream 33805
[26/07/2020 05:44:54 - INFO - matrixcomponent.PangenomeSchematic - 141] No downstream 98595
[26/07/2020 05:44:55 - INFO - matrixcomponent.PangenomeSchematic - 141] No downstream 42908
[26/07/2020 05:44:55 - INFO - matrixcomponent.PangenomeSchematic - 141] No downstream 105785
[26/07/2020 05:44:55 - INFO - matrixcomponent.PangenomeSchematic - 141] No downstream 121714
[26/07/2020 05:44:56 - INFO - matrixcomponent.PangenomeSchematic - 141] No downstream 34421
[26/07/2020 05:44:57 - INFO - matrixcomponent.PangenomeSchematic - 141] No downstream 55038
[26/07/2020 05:44:57 - INFO - matrixcomponent.PangenomeSchematic - 141] No downstream 80784
[26/07/2020 05:44:57 - INFO - matrixcomponent.PangenomeSchematic - 141] No downstream 104129
[26/07/2020 05:44:58 - INFO - matrixcomponent.PangenomeSchematic - 141] No downstream 76952
[26/07/2020 05:44:58 - INFO - matrixcomponent.PangenomeSchematic - 141] No downstream 101961
[26/07/2020 05:44:58 - INFO - matrixcomponent.PangenomeSchematic - 141] No downstream 59040
[26/07/2020 05:45:00 - INFO - matrixcomponent.PangenomeSchematic - 141] No downstream 42975
[26/07/2020 05:45:00 - INFO - matrixcomponent.PangenomeSchematic - 141] No downstream 30423

6br · 2020-07-28T09:27:52Z

No-arrivals error is fixed

I added the assertion if there is no arrivals. So we easily know when arrivals are missing.
I found this error is due to multi-processing. Instead of using multi-processing, I succeeded to get RDF without any errors.

Saved file2bin mapping to PubSeqTurtle_02062020_2/bin2file.json
[28/07/2020 07:59:10 - INFO - __main__ - 413] Finished processing the file relabeledSeqs_dedup_relabeledSeqs_dedup.odgi.w1.g.json
24310.23user 418.13system 6:27:39elapsed 106%CPU (0avgtext+0avgdata 316126732maxresident)k
264inputs+235223048outputs (0major+118306390minor)pagefaults 0swaps
$ ~/data/PubSeqTurtle_02062020/PubSeqTurtle_02062020_2/1-rdf$ ls -ltrh
total 13G
-rw-rw-r-- 1 ubuntu ubuntu 13G Jul 28 07:22 seq_chunk00000_bin1.nt.gz

josiahseaman · 2020-09-08T21:22:28Z

This has become a very long branch. Is this PR ready for merging now? I'm not available to review at the moment. @subwaystation are you available for review?

do the data repack into the ontology containers with further RDF 'tur…

acb580f

…tle' format dump

dimatr assigned josiahseaman, subwaystation and 6br Apr 28, 2020

dimatr added 2 commits April 30, 2020 20:39

Merge remote-tracking branch 'remotes/origin/master' into ontology

5c39f2a

# Conflicts: # matrixcomponent/PangenomeSchematic.py

further ontology fixes: real path names; better forward* and reverse*…

44caccc

… edges definition for Component and Bin containers; Links are a part of ZoomLevel component

josiahseaman requested review from 6br and subwaystation May 6, 2020 21:44

josiahseaman removed their assignment May 6, 2020

- do not forget to store the path_id

29550fd

- use faldo.ExactPosition when appropriate

6br reviewed May 12, 2020

View reviewed changes

matrixcomponent/PangenomeSchematic.py Outdated Show resolved Hide resolved

matrixcomponent/ontology.py Show resolved Hide resolved

matrixcomponent/ontology.py Show resolved Hide resolved

subwaystation reviewed May 12, 2020

View reviewed changes

matrixcomponent/ontology.py Outdated Show resolved Hide resolved

- explicit faldo:ExactPosition containers

b8d0e16

- every Link has linkRank numbered after the pair sort (component.id, [component.departures])

dimatr added 3 commits June 16, 2020 09:59

Merge branch 'master' into ontology

1eaacd0

# Conflicts: # matrixcomponent/JSONparser.py # matrixcomponent/PangenomeSchematic.py # matrixcomponent/matrix.py # segmentation.py

write faldo:ForwardStrandPosition, faldo:position and faldo:reference…

26e6901

… in each Position. Add Path write out

create ontology folder when needed

db95d1d

6br reviewed Jun 25, 2020

View reviewed changes

queries/selectBins1To5OfZoomlevel1.rq Outdated Show resolved Hide resolved

subwaystation and others added 3 commits June 26, 2020 14:25

Link -> ZoomLevel, replace pg with vg, add 'path/'

d2eb34c

Actually emit the position percentage instead of the coverage of a bin.

209d697

Update requirements.txt

46d9c20

6br reviewed Jul 21, 2020

View reviewed changes

Add 'path/' on cells

1f5a379

6br added 3 commits July 26, 2020 13:01

Add assertion

e014b73

Add assertion

9beafb3

Add logger info

8a336d9

6br force-pushed the ontology branch from 762ee76 to 8a336d9 Compare July 26, 2020 05:20

6br added 4 commits July 26, 2020 14:34

Register logger

61ceb22

Merge branch 'ontology' of https://github.com/graph-genome/component_…

7e9f6f5

…segmentation into ontology

Register logger

6378acd

Register logger

545e4c1

6br force-pushed the ontology branch from ff17c50 to 545e4c1 Compare July 26, 2020 05:39

6br added 2 commits July 26, 2020 14:54

Add bin logger

86eca81

Disable parallel on rdf writer

f704767

josiahseaman requested a review from subwaystation September 8, 2020 21:22

Conversation

dimatr commented Apr 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

subwaystation commented Apr 29, 2020

Uh oh!

6br commented May 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

subwaystation commented May 11, 2020

Uh oh!

dimatr commented May 11, 2020

Uh oh!

subwaystation commented May 11, 2020

Uh oh!

subwaystation commented May 11, 2020

Uh oh!

6br left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

subwaystation commented May 12, 2020

Uh oh!

subwaystation commented May 12, 2020

Uh oh!

dimatr commented May 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

6br commented May 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

6br commented May 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

6br commented May 13, 2020

Uh oh!

subwaystation commented May 26, 2020

Uh oh!

6br commented May 26, 2020

Uh oh!

Uh oh!

6br commented Jun 25, 2020

Uh oh!

dimatr commented Jul 4, 2020

Uh oh!

subwaystation commented Jul 6, 2020

PubSeq

Pantograph Demo

Path Encoding

Uh oh!

6br commented Jul 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PubSeq

Pantograph Demo

Path Encoding

Uh oh!

subwaystation commented Jul 17, 2020

PubSeq

Pantograph Demo

Path Encoding.

Uh oh!

6br commented Jul 18, 2020

PubSeq

Uh oh!

6br Jul 21, 2020

Choose a reason for hiding this comment

Uh oh!

6br commented Jul 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

6br commented Jul 26, 2020

Uh oh!

6br commented Jul 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

6br commented Jul 28, 2020

No-arrivals error is fixed

dimatr commented Apr 28, 2020 •

edited

Loading

6br commented May 7, 2020 •

edited

Loading

dimatr commented May 12, 2020 •

edited

Loading

6br commented May 13, 2020 •

edited

Loading

6br commented May 13, 2020 •

edited

Loading

6br commented Jul 17, 2020 •

edited

Loading

6br commented Jul 23, 2020 •

edited

Loading

6br commented Jul 26, 2020 •

edited

Loading