Convert a tree structure to a tabular format.
Wrapper around treelib with a few additional features geared towards easier data modelling for analytics engineers.
- Tree structures are often used to represent hierarchical data. However, they are not always easy to work with. This script converts a tree structure from a yaml file to a tabular format.
- The script can also be used to generate unique ids for each node in the tree.
- It reads from a
yamlfile and writes to acsvfile or updatedymlfile with ids.
The yaml file should have the structure described below:
Hierarchy: #Always start the yaml file with this line``
name: category # This is the name of the dimension used as column for the tabular data
id_generation: uuid # Possible values for generating node ids: 'name' -> use name, 'incremental' --> generate integers, 'uuid'->generate uuid ,'error'->throw error if no id provided
# Keep this structure for the nodes
childs:
- name: subcategory
childs:
- name: subsubcategory
childs:
- name: subsubsubcategory
- name: subcategory2
childs:
- name: subsubcategory2in python write:
from tree2tabular import TreeBuilder
fn = 'my_tree.yaml'
tree = TreeBuilder.from_yaml(fn)
tree.to_csv('my_tree.csv')output: automatically generated ids and tree in tabular structure:
| TXT_CATEGORY_LVL1 | TXT_CATEGORY_LVL2 | TXT_CATEGORY_LVL3 | DIM_CATEGORY_LVL1 | DIM_CATEGORY_LVL2 | DIM_CATEGORY_LVL3 |
|---|---|---|---|---|---|
| subcategory | subsubcategory | subsubsubcategory | 7690c4 | 163eed | 6d0573 |
| subcategory2 | subsubcategory2 | subsubcategory2 | 3860c7 | e7921e | e7921e |
df = tree.to_dataframe()tree.to_yaml('my_tree_with_ids.yaml')df = tree.to_parent_child(use_names=True)output:
| parent | child |
|---|---|
| Grandma | Mom |
| Mom | Son |
| Mom | Daughter |
- Always start with the keyword 'Hierarchy'
- Provide under
Hierarchythe following parameters:name,id_generation,childs - Each node can have three properties:
name,id,childs- The
childsproperty is a list of nodes, can be null if no child nodes - The
idproperty is optional, if not provided, it will be generated based on theid_generationparameter. The0value is reserved for the root node. - The
nameproperty is mandatory
- The
- At the top of the hierarchy: is the name of the dimension used as column for the tabular data
- e.g.
name: categorywill generate the columnsDIM_CATEGORY_LVL1,DIM_CATEGORY_LVL2, etc.
- e.g.
- Inside the hierarchy: is the name of the node
uuid: generate a unique id for each node if no id providedname: use the name of the node as id if no id providederror: raise an error if no id providedincremental: generate an incremental id for each node if no id provided, expects only integer value
- The output is a tabular structure with the following columns:
DIM_CATEGORY_LVL1,DIM_CATEGORY_LVL2, etc. andTXT_CATEGORY_LVL1,TXT_CATEGORY_LVL2, etc. - The
DIM_columns contain the ids of the nodes. - The
TXT_columns contain the names of the nodes.
- The level 1 corresponds to the top of the hierarchy, the highest the level, the deeper the node is in the hierarchy
- The primary key of the table is the
DIM_columns with the highest level, and the highest granularity. - There is no blank: if a node has no child, the
DIM_andTXT_columns of the lowest level are filled with the id and name of the node
You can find examples in the tests > demos* folders.