Unbabel · komalparakh05 · Aug 7, 2024 · Aug 9, 2024 · Aug 9, 2024 · Aug 9, 2024
diff --git a/README.md b/README.md
@@ -1,86 +1,82 @@
-# Backend Engineering Challenge
+# Unbabel Backend Engineering Challenge - Sequential Processing Approach
 
+This project implements a sequential processing approach to calculate the moving average of translation delivery times. It processes events in chronological order using a sliding window, designed for memory efficiency and simplicity.
 
-Welcome to our Engineering Challenge repository 🖖
+## Features
 
-If you found this repository it probably means that you are participating in our recruitment process. Thank you for your time and energy. If that's not the case please take a look at our [openings](https://unbabel.com/careers/) and apply!
+- **Sequential Processing**: Efficiently processes events in order.
+- **Memory Efficiency**: Utilizes a deque for managing the sliding window.
+- **Comprehensive Error Handling**: Robust error handling for file operations and JSON parsing.
+- **Logging**: Provides detailed logging for tracking execution and debugging.
 
-Please fork this repo before you start working on the challenge, read it careful and take your time and think about the solution. Also, please fork this repository because we will evaluate the code on the fork.
 
-This is an opportunity for us both to work together and get to know each other in a more technical way. If you have any questions please open and issue and we'll reach out to help.
+## Requirements
 
-Good luck!
+- Python 3.7 or higher
+- pandas
 
-## Challenge Scenario
+## Installation
 
-At Unbabel we deal with a lot of translation data. One of the metrics we use for our clients' SLAs is the delivery time of a translation. 
+1. Clone the repository:
 
-In the context of this problem, and to keep things simple, our translation flow is going to be modeled as only one event.
+   ```bash
+   git clone https://github.com/komalparakh05/backend-engineering-challenge.git
+   cd backend-engineering-challenge
 
-### *translation_delivered*
+
+2. Create a virtual environment and install dependencies:
+
+	```bash
+ 	python -m venv venv
+	.\venv\Scripts\activate  # On Windows
+	# source venv/bin/activate  # On macOS/Linux
+	pip install -r requirements.txt
+
+
+## Usage
+
+Run the script with the following command:
+
+
+	python unbabel_cli.py --input_file <path_to_input_file> --window_size <window_size_in_minutes> [--chunk_size <chunk_size>]
 
 Example:
 
-```json
-{
-	"timestamp": "2018-12-26 18:12:19.903159",
-	"translation_id": "5aa5b2f39f7254a75aa4",
-	"source_language": "en",
-	"target_language": "fr",
-	"client_name": "airliberty",
-	"event_name": "translation_delivered",
-	"duration": 20,
-	"nr_words": 100
-}
-```
+
+	python unbabel_cli.py --input_file events.json --window_size 10 --chunk_size 50000
+
 
-## Challenge Objective
+## Testing
 
-Your mission is to build a simple command line application that parses a stream of events and produces an aggregated output. In this case, we're interested in calculating, for every minute, a moving average of the translation delivery time for the last X minutes.
+To test the code, you can create unit tests using a framework like unittest or pytest. Ensure your tests cover key functionalities such as parsing input, calculating averages, and handling errors. Here's how to set up pytest:
+
+Install pytest:
+
+
+	pip install pytest
 
-If we want to count, for each minute, the moving average delivery time of all translations for the past 10 minutes we would call your application like (feel free to name it anything you like!).
+Run tests using:
 
-	unbabel_cli --input_file events.json --window_size 10
 
-The input file format would be something like:
+	pytest
 
-	{"timestamp": "2018-12-26 18:11:08.509654","translation_id": "5aa5b2f39f7254a75aa5","source_language": "en","target_language": "fr","client_name": "airliberty","event_name": "translation_delivered","nr_words": 30, "duration": 20}
-	{"timestamp": "2018-12-26 18:15:19.903159","translation_id": "5aa5b2f39f7254a75aa4","source_language": "en","target_language": "fr","client_name": "airliberty","event_name": "translation_delivered","nr_words": 30, "duration": 31}
-	{"timestamp": "2018-12-26 18:23:19.903159","translation_id": "5aa5b2f39f7254a75bb3","source_language": "en","target_language": "fr","client_name": "taxi-eats","event_name": "translation_delivered","nr_words": 100, "duration": 54}
+**Refer to the `test_unbabel_cli.py` script for details on how to use `pytest`.**
 
-Assume that the lines in the input are ordered by the `timestamp` key, from lower (oldest) to higher values, just like in the example input above.
 
-The output file would be something in the following format.
+## Design Considerations
 
-```
-{"date": "2018-12-26 18:11:00", "average_delivery_time": 0}
-{"date": "2018-12-26 18:12:00", "average_delivery_time": 20}
-{"date": "2018-12-26 18:13:00", "average_delivery_time": 20}
-{"date": "2018-12-26 18:14:00", "average_delivery_time": 20}
-{"date": "2018-12-26 18:15:00", "average_delivery_time": 20}
-{"date": "2018-12-26 18:16:00", "average_delivery_time": 25.5}
-{"date": "2018-12-26 18:17:00", "average_delivery_time": 25.5}
-{"date": "2018-12-26 18:18:00", "average_delivery_time": 25.5}
-{"date": "2018-12-26 18:19:00", "average_delivery_time": 25.5}
-{"date": "2018-12-26 18:20:00", "average_delivery_time": 25.5}
-{"date": "2018-12-26 18:21:00", "average_delivery_time": 25.5}
-{"date": "2018-12-26 18:22:00", "average_delivery_time": 31}
-{"date": "2018-12-26 18:23:00", "average_delivery_time": 31}
-{"date": "2018-12-26 18:24:00", "average_delivery_time": 42.5}
-```
+- **Order of Processing**: Designed to efficiently handle ordered input data.
+- **Memory Optimization**: Uses a deque to minimize memory usage while managing the sliding window.
+- **Error Resilience**: Incorporates error handling to ensure robust execution.
+- **Transparency**: Provides logging to help understand the processing flow and diagnose issues..
 
-#### Notes
 
-Before jumping right into implementation we advise you to think about the solution first. We will evaluate, not only if your solution works but also the following aspects:
+## Alternative Approach
 
-+ Simple and easy to read code. Remember that [simple is not easy](https://www.infoq.com/presentations/Simple-Made-Easy)
-+ Comment your code. The easier it is to understand the complex parts, the faster and more positive the feedback will be
-+ Consider the optimizations you can do, given the order of the input lines
-+ Include a README.md that briefly describes how to build and run your code, as well as how to **test it**
-+ Be consistent in your code. 
+An alternative approach that focuses on streaming processing and real-time data handling is available in the `streaming_approach` branch. 
 
-Feel free to, in your solution, include some your considerations while doing this challenge. We want you to solve this challenge in the language you feel most comfortable with. Our machines run Python (3.7.x or higher) or Go (1.16.x or higher). If you are thinking of using any other programming language please reach out to us first 🙏.
+To explore this alternative approach, check out the `streaming_approach` branch:
 
-Also, if you have any problem please **open an issue**. 
+git checkout streaming_approach
 
-Good luck and may the force be with you
+This alternative solution showcases additional data engineering skills and provides a more comprehensive approach to the challenge.
diff --git a/requirements.txt b/requirements.txt
@@ -0,0 +1 @@
+pandas
diff --git a/tests/test_your_script.py b/tests/test_your_script.py
@@ -0,0 +1,49 @@
+import pytest
+from unittest.mock import mock_open, patch
+from unbabel_cli import calculate_moving_average
+
+def test_calculate_moving_average(capsys):
+    # Sample input data
+    input_data = """{"timestamp": "2018-12-26 18:11:08.509654", "duration": 20}
+{"timestamp": "2018-12-26 18:15:19.903159", "duration": 31}
+{"timestamp": "2018-12-26 18:23:19.903159", "duration": 54}
+"""
+
+    # Expected output based on the actual function behavior
+    expected_output = [
+        '{"date": "2018-12-26 18:11:00", "average_delivery_time": 0}',
+        '{"date": "2018-12-26 18:12:00", "average_delivery_time": 20.0}',
+        '{"date": "2018-12-26 18:13:00", "average_delivery_time": 20.0}',
+        '{"date": "2018-12-26 18:14:00", "average_delivery_time": 20.0}',
+        '{"date": "2018-12-26 18:15:00", "average_delivery_time": 20.0}',
+        '{"date": "2018-12-26 18:16:00", "average_delivery_time": 25.5}',
+        '{"date": "2018-12-26 18:17:00", "average_delivery_time": 25.5}',
+        '{"date": "2018-12-26 18:18:00", "average_delivery_time": 25.5}',
+        '{"date": "2018-12-26 18:19:00", "average_delivery_time": 25.5}',
+        '{"date": "2018-12-26 18:20:00", "average_delivery_time": 25.5}',
+        '{"date": "2018-12-26 18:21:00", "average_delivery_time": 25.5}',
+        '{"date": "2018-12-26 18:22:00", "average_delivery_time": 31.0}',
+        '{"date": "2018-12-26 18:23:00", "average_delivery_time": 31.0}',
+        '{"date": "2018-12-26 18:24:00", "average_delivery_time": 42.5}',
+        '{"date": "2018-12-26 18:25:00", "average_delivery_time": 42.5}',
+        '{"date": "2018-12-26 18:26:00", "average_delivery_time": 54.0}',
+        '{"date": "2018-12-26 18:27:00", "average_delivery_time": 54.0}',
+        '{"date": "2018-12-26 18:28:00", "average_delivery_time": 54.0}',
+        '{"date": "2018-12-26 18:29:00", "average_delivery_time": 54.0}',
+        '{"date": "2018-12-26 18:30:00", "average_delivery_time": 54.0}',
+        '{"date": "2018-12-26 18:31:00", "average_delivery_time": 54.0}',
+        '{"date": "2018-12-26 18:32:00", "average_delivery_time": 54.0}',
+        '{"date": "2018-12-26 18:33:00", "average_delivery_time": 54.0}',
+    ]
+
+    # Use mock_open to simulate file reading
+    with patch("builtins.open", mock_open(read_data=input_data)):
+        # Call the function with a mock file path and window size
+        calculate_moving_average("mock_file_path", 10)
+
+        # Capture the output
+        captured = capsys.readouterr()
+
+        # Split the output into lines and compare with expected output
+        actual_output = [line.strip() for line in captured.out.splitlines()]
+        assert actual_output == expected_output
diff --git a/unbabel_cli.py b/unbabel_cli.py
@@ -0,0 +1,106 @@
+#!/usr/bin/env python3
+
+import argparse
+import json
+from datetime import datetime, timedelta
+from collections import deque
+import logging
+
+# Set up logging
+logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
+logger = logging.getLogger(__name__)
+
+def parse_arguments():
+    """Parse command line arguments for input file and window size."""
+    parser = argparse.ArgumentParser(description='Calculate moving average of translation delivery times.')
+    parser.add_argument('--input_file', required=True, help='Path to the input JSON file')
+    parser.add_argument('--window_size', type=int, required=True, help='Window size in minutes')
+    return parser.parse_args()
+
+def parse_timestamp(timestamp_str):
+    """Convert timestamp string to a datetime object."""
+    try:
+        return datetime.strptime(timestamp_str, "%Y-%m-%d %H:%M:%S.%f")
+    except ValueError as e:
+        logger.error(f"Timestamp parsing error: {e}")
+        raise
+
+def calculate_moving_average(input_file, window_size):
+    """Calculate and print the moving average of delivery times."""
+    window = deque()  # Initialize a deque to store events within the window
+    current_minute = None  # Track the current minute being processed
+    last_event_minute = None  # Track the last event's minute
+
+    try:
+        with open(input_file, 'r') as f:
+            for line_num, line in enumerate(f, 1):
+                try:
+                    event = json.loads(line)  # Parse each line as a JSON event
+                except json.JSONDecodeError as e:
+                    logger.warning(f"JSON parsing error on line {line_num}: {e}")
+                    continue
+
+                timestamp = parse_timestamp(event['timestamp'])
+                minute = timestamp.replace(second=0, microsecond=0)  # Round down to the nearest minute
+
+                if current_minute is None:
+                    current_minute = minute
+
+                last_event_minute = minute
+
+                # Process each minute up to the current event's minute
+                while current_minute <= minute:
+                    # Remove events that are outside the window
+                    while window and (current_minute - parse_timestamp(window[0]['timestamp'])).total_seconds() / 60 >= window_size:
+                        window.popleft()
+
+                    # Calculate the average delivery time for the current minute
+                    if window:
+                        avg_delivery_time = sum(e['duration'] for e in window) / len(window)
+                    else:
+                        avg_delivery_time = 0
+
+                    # Output the result for the current minute
+                    print(json.dumps({
+                        "date": current_minute.strftime("%Y-%m-%d %H:%M:%S"),
+                        "average_delivery_time": round(avg_delivery_time, 1)
+                    }))
+
+                    current_minute += timedelta(minutes=1)
+
+                # Add the current event to the window
+                window.append(event)
+
+        # Continue processing for the remaining minutes after the last event
+        # This ensures we calculate the moving average for the entire window size after the last event
+        while current_minute <= last_event_minute + timedelta(minutes=window_size):
+            # Remove events that are outside the window
+            while window and (current_minute - parse_timestamp(window[0]['timestamp'])).total_seconds() / 60 >= window_size:
+                window.popleft()
+
+            # Calculate the average delivery time for the current minute
+            if window:
+                avg_delivery_time = sum(e['duration'] for e in window) / len(window)
+            else:
+                avg_delivery_time = 0
+
+            # Output the result for the current minute
+            print(json.dumps({
+                "date": current_minute.strftime("%Y-%m-%d %H:%M:%S"),
+                "average_delivery_time": round(avg_delivery_time, 1)
+            }))
+
+            current_minute += timedelta(minutes=1)
+
+    except FileNotFoundError:
+        logger.error(f"File not found: {input_file}")
+    except Exception as e:
+        logger.error(f"An unexpected error occurred: {e}")
+
+def main():
+    """Main function to parse arguments and calculate moving averages."""
+    args = parse_arguments()
+    calculate_moving_average(args.input_file, args.window_size)
+
+if __name__ == "__main__":
+    main()