Conversation
|
Hi @dimatr the good news first: For better debugging, let's use the files given in svg_opg1.zip. It is a small example graph I came up with. And got cs_svg1.zip.
|
# Conflicts: # matrixcomponent/PangenomeSchematic.py
… edges definition for Component and Bin containers; Links are a part of ZoomLevel component
|
Thanks for your implementation!
|
<5/region/28-28> a faldo:Region ;
faldo:begin <5/position/28> ;
faldo:end <5/position/28> .which does not make sense, because the overall pangenome nucleotide length is 17. See
|
- use faldo.ExactPosition when appropriate
|
|
I just realized, the testing GFA has two |
|
This would also explain the weird |
6br
left a comment
There was a problem hiding this comment.
Thanks for update. I have left some comments.
We shoud have a The cleaned data: I think CS is outputting more links compared to what But in the |
|
I decided to open an issue for my concerns about CS #48. |
- every Link has linkRank numbered after the pair sort (component.id, [component.departures])
|
I have made another update. The This way all the object go down to the atomic ones. vg:Cell should contain vg:Region - this is in the vg schema. I use a short identifier <path1/2-3> instead of a longer version <path1/region/2-3> as in the example. Is it an acceptable approach? |
|
For me, |
|
@dimatr Is it possible to embed a path on The orientation of path can be encoded like this (if the reference is positive strand) Because if the path is inverted is encoded in |
|
As @JervenBolleman suggests, it's better to set a path name as an independent subject. |
|
@dimatr here is the current output of Now it should be possible to encode the orientation of a path for all cases. Do you have any more questions? Need some feedback? Or comments from @6br ? |
|
Feel free to contact me at any time when you have any questions! |
# Conflicts: # matrixcomponent/JSONparser.py # matrixcomponent/PangenomeSchematic.py # matrixcomponent/matrix.py # segmentation.py
… in each Position. Add Path write out
|
I agree to have a consistent IRI with |
* parallel write out of the gzip compressed ontology files - no memory leaks due to the utilization of separate processes!
* use the N-triples format to be 10x quicker than the Turtle (see format='nt' in PangenomeSchematic.py)
* be gentle with the string variables, do not use "a"+"b" but rather "{0}{1}".format(a,b). This does not create small temporary object and leads to a lower memory fragmentation/leak
* the RDF output folder is named *-rdf
|
I have just pushed a big update targeting large datasets, please have a look. |
|
Nice work @dimatr ! Really cool. PubSeqI can get a RDF output for the current PubSeq data of ~1300 genomes in 22 minutes :) The resulting Pantograph DemoI also ran it on the current Pantograph demo data which consists of 169 genomes. Here it only took 3 minutes ;) We have 8GB uncompressed and it fits into the RAM database of a Fuseki-Jena-Server which needs 60GB RAM. Path EncodingThe encoding of the paths is working, but still not 100% in line with SpOdgi. When listing them, we are doing it correctly, but when referring to them, we are still omitting the |
PubSeqIt's possible to tune parameters to avoid warnings. Pantograph DemoAt first glance on Path EncodingI'll vote for adding the prefix |
PubSeqCool, that would be awesome. Pantograph DemoI think the problem is that only departures are synced so far, but not arrivals. @dimatr suggested implementing the same sanity check for arrivals as was done in departures. Maybe he can give you some tips. Path Encoding.Yes, I vote for it! I had that, but I added it only in the |
PubSeqNow warning messages disappeared. |
| self.path = path | ||
|
|
||
| def ns_term(self): | ||
| return "path/{0}".format(self.path) # path1 |
|
In pubseq data, path name looks like Furthermore, cell name looks like |
|
cell name is now updated to |
|
I added assertion on building pangenomic sequence, and many downstream looks missing. |
No-arrivals error is fixedI added the assertion if there is no arrivals. So we easily know when arrivals are missing. |
|
This has become a very long branch. Is this PR ready for merging now? I'm not available to review at the moment. @subwaystation are you available for review? |
Please review.
--do-ttl True{bin_width}-turtlefolder besides the main .json output folder, e.g.1000-turtle