Skip to content

Support intermediate artifacts #683

@PertuyF

Description

@PertuyF

Hi all, thank you so much for developing LineaPy, looks great and I'm really excited about it!

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

When I develop a pipeline, I may want to integrate semantic steps to build my refined dataset table. As an illustration, master_data would be data loaded and assembled from a relational DB, whereas dataset would be the same table refined with some feature engineering.

Currently, if I try to do this I would save both master_data and dataset as artifacts, then create a pipeline like:

lineapy.to_pipeline(artifacts=[master_data.name, dataset.name], 
                    dependencies={dataset.name: {master_data.name}},
                    framework='AIRFLOW', pipeline_name='my_great_airflow_pipeline', output_dir='airflow')

My issue is that Lineapy would then create steps to build master_data from scratch, and also to create dataset from scratch instead of loading master_data as a starting point. Like:

import pickle


def master_data():

    import pandas as pd
    from sklearn.datasets import load_iris

    iris = load_iris()
    df = pd.DataFrame(iris.data, columns=iris.feature_names).assign(
        target=[iris.target_names[i] for i in iris.target]
    )
    iris_agg = df.set_index("target")
    artifact = pickle.dump(
        iris_agg, open("/home/oneai/.lineapy/linea_pickles/10dzyzx", "wb")
    )


def dataset():

    import pandas as pd
    from sklearn.datasets import load_iris

    iris = load_iris()
    df = pd.DataFrame(iris.data, columns=iris.feature_names).assign(
        target=[iris.target_names[i] for i in iris.target]
    )
    iris_agg = df.set_index("target")
    iris_clean = iris_agg.dropna().assign(test="test")
    dataset = pickle.dump(
        iris_clean, open("/home/oneai/.lineapy/linea_pickles/5Tk63gO", "wb")
    )

Describe the solution you'd like
A clear and concise description of what you want to happen.

Ideally LineaPy would capture the dependency and build:

My issue is that Lineapy would then create steps to build master_data from scratch, and also to create dataset from scratch instead of loading master_data as a starting point. Something like:

import pickle


def master_data():

    import pandas as pd
    from sklearn.datasets import load_iris

    iris = load_iris()
    df = pd.DataFrame(iris.data, columns=iris.feature_names).assign(
        target=[iris.target_names[i] for i in iris.target]
    )
    iris_agg = df.set_index("target")
    pickle.dump(
        iris_agg, open("/home/oneai/.lineapy/linea_pickles/10dzyzx", "wb")
    )


def dataset():

    import pandas as pd

    iris_agg = pickle.load(
      open("/home/oneai/.lineapy/linea_pickles/10dzyzx", "rb")
   )
    iris_clean = iris_agg.dropna().assign(test="test")
    dataset = pickle.dump(
        iris_clean, open("/home/oneai/.lineapy/linea_pickles/5Tk63gO", "wb")
    )

Is it planned to support this behavior?
Am I missing something?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions