Skip to content

2uanDM/okini-data-platform

Repository files navigation

OKINI-DATAPLATFORM-ETL

❯ Palmify

Built with the tools and technologies:

Elasticsearch Kibana Logstash Docker Python Poetry Pydantic


Table of Contents


Overview

This repository contains the source code for the okini-dataplatform-etl project. The project is designed to provide a comprehensive ETL pipeline for the Okini Data Platform.


Resources (Click to view the document of resources)

  • Calendarific: Public holiday of Japan, Korea, Hong Kong and Taiwan getting from Wiki
  • Booking.com: Hotel data (numba location)
  • Kyoceradome: Events of Osaka

Getting Started

Prerequisites

Before getting started with okini-dataplatform-etl, ensure your runtime environment meets the following requirements:

  • Programming Language: Python 3.11
  • Package Manager: Poetry or Pip
  • Container Runtime: Docker

Big Query Authentication

Currently, project use OAuth2 to authenticate for BigQuery API. Initially, if there is no file named assets/token.pickle, the application will send an url and I need to copy this url and login, then new file assets/token.pickle will be created.

Installation

Install okini-dataplatform-etl using one of the following methods:

Build from source:

  1. Clone the okini-dataplatform-etl repository:
❯ https://github.com/palmify/okini-dataplatform-etl
  1. Navigate to the project directory:
cd okini-dataplatform-etl
  1. Copy the .env and .env.prod to the root directory of the project.

  2. Run all the deps using docker-compose:

❯ docker compose up -d

** Build images on local machine:**

docker build -t okini-dataplatform-etl -f docker/Dockerfile .
docker tag okini-dataplatform-etl:latest louispalmify/development:okini-dataplatform-etl
docker push louispalmify/development:okini-dataplatform-etl
  1. Next, we need to go to the GUI of Object Storage Minio (http://localhost:9001) to create values for S3_ACCESS_KEY_ID and S3_SECRET_ACCESS_KEY:

  1. Copy the created value S3_ACCESS_KEY_ID and S3_SECRET_ACCESS_KEY to the .env.prod file. Leave .env as it is.

  2. Run docker compose down to stop the running containers, but not remove the volume.

  3. We need to remove the builded image of the okini-dataplatform-etl-dagster-app repo:

❯ docker rmi <IMAGE_ID>
  1. Now run the docker compose up -d again to start the containers with updated .env.prod file.

  2. Go to the Daster dashboard at http://localhost:3124, tab Job and turn on the job schedulers.

  1. Please help me to forward these urls to domains or something else so that I can access them to track the logs of data crawler:
http://localhost:9001 // Minio
http://localhost:5601 // Kibana
http://localhost:3124 // Dagster Webserver
http://localhost:5900 // VNC Viewer