#datascience

Emulates cloud-native MLOps locally

I have built a data science environment that allows me to construct data pipelines and manage machine learning experiments on a local PC, while providing a user experience that is as cloud-like as possible.

According to Gemini:

This environment functions as a sandbox for learning a "cloud-native development style" by replacing major components used in cloud environments—such as S3, orchestrators, and ML tracking services—with local Docker containers.

Alright. If you say so, this serves as educational content.

The source code (docker-compose.yaml, etc.) is available on GitHub.

System components

  • Data store: Versity S3 Gateway emulates S3 storage.
  • Source code repository: Just a Git daemon.
  • Pipeline: Prefect orchestrates pipelines.
  • Experiment management: MLflow tracks models, parameters, and results.
  • Notebook: Marimo for interactive development.

I initially used MinIO as the S3 alternative, but since they have stopped distributing binaries and Docker images, I switched to the Versity S3 gateway to access the file system via the S3 protocol.

The system components collaborate as shown in the following diagram:

graph TD
    classDef host fill:#f5f5f5,stroke:#333,stroke-width:2px;
    classDef app fill:#e1f5fe,stroke:#0277bd,stroke-width:2px;
    classDef infra fill:#fff9c4,stroke:#fbc02d,stroke-width:2px;

    subgraph host_net [Host Environment]
        host[PC]:::host
    end

    subgraph docker_net [Docker Network]
        prefect_server(Prefect Server):::app
        prefect_worker(Prefect Worker):::app
        mlflow(MLflow Server):::app

        git[(Git Daemon)]:::infra
        versitygw[(Versity S3)]:::infra
        postgres[(PostgreSQL)]:::infra

        host -->|Code| git -->|Code| prefect_worker
        host -->|Web UI / CLI| prefect_server
        host -->|Web UI| mlflow
        host <-->|Data| versitygw

        prefect_server -.-|Backend| postgres
        prefect_server -->|Run| prefect_worker

        prefect_worker -->|API| mlflow
        prefect_worker <-->|Data| versitygw

        mlflow -.-|Backend| postgres
        mlflow -->|Artifacts| versitygw
    end

(Marimo is omitted for clarity as it is loosely coupled with the rest of the system.)

Usage

The steps for running experiments are as follows:

  1. (If necessary) Upload the data to S3 (http://localhost:7070).
  2. Implement the Prefect workflow (experiments/*.py).
  3. git push the code to the Git repo (http:localhost:9010/repo).
  4. Deploy the workflow to Prefect (prefect deploy).
  5. Run the workflow via Prefect (prefect deployment run).
  6. Check the experiment results in MLflow (http://localhost:5001).

Detailed prerequisite steps, such as setting environment variables, are written in the GitHub README.

Implementing Prefect workflows

Define your experiment workflows and tasks using Prefect decorators (@flow, @task). Insert MLflow functions such as mlflow.log_params() to track everything in MLflow!

@flow
def example(n_tests: int = 1) -> None:
    """An example experiment as a Prefect flow with MLflow tracking."""
    mlflow.set_tracking_uri(MLFLOW_TRACKING_URI)
    mlflow.set_experiment(f"example-experiment-{datetime.now()}")

    df = pd.read_csv("s3://data/iris.csv")
    x = df.drop(columns=["target", "species"])
    y = df["target"]
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3)

    test_data = x_test.copy()
    test_data["target"] = y_test
    test_dataset = mlflow.data.from_pandas(test_data, targets="target")

    @task
    def run(n_estimators, criterion) -> None:
        """A run in the experiment."""
        with mlflow.start_run():
            params = {"n_estimators": n_estimators, "criterion": criterion}
            mlflow.log_params(params)
            model = RandomForestClassifier(**params)
            model.fit(x_train, y_train)
            model_info = mlflow.sklearn.log_model(model, input_example=x_train)
            mlflow.models.evaluate(
                model=model_info.model_uri,
                data=test_dataset,
                model_type="classifier",
            )

    for n_estimators in [2**i for i in range(5, 8)]:
        for criterion in ["gini", "entropy", "log_loss"]:
            for _ in range(n_tests):
                run(n_estimators, criterion)

Deploying and running Prefect workflows

Register the workflow with the Prefect server by declaring the deployment configuration in prefect.yaml and running prefect deploy.

deployments:
  - name: experiment
    tags: [test]
    parameters:
      n_tests: 5
    entrypoint: experiments/example.py:example
    work_pool:
      name: sandbox
      job_variables: {}

After deploying, run it with the following command:

$ prefect deployment run example/experiment

You can view the execution status in real-time by accessing the Prefect Web UI (http://localhost:4200/runs).

Checking and analyzing results in MLflow

Access the MLflow Web UI (http://localhost:5001) to review the tracked experiment results and perform comparative analysis based on parameters.

Conclusion

I have created a local environment that feels "as cloud-like as possible." Adding something like a data lake (e.g., DuckLake) would likely make it feel even more like the cloud.


Translated from the original post at https://m15a.dev/ja/posts/acap-datasci-env/.